It’s no wonder that chip manufacturers’ data sheets are full of errata lists, which describe deviations from the specification. So, why continue to ship? Because the redesign time, effort, and cost required to fix all errors can result in missing the market window. As long as the device functions acceptably in the application socket – with or without workarounds – the manufacturer and customer can live with the defects.
The foregoing issues are problematic enough in the case of one chip design. Now, just imagine the challenges faced by an auto manufacturer. The end-product is a vehicle with scores of such complex devices (see figure 2) – each with its own errata sheet. In bringing up a new vehicle design for production, a huge number of mismatches arise and must be resolved. A statistical overview by a leading European car manufacturer showed 11.000 issues – both large and small – to be resolved when assembling the modules together for the first time. And electronics is one of the foremost pain points.
Figure 2: The number and variety of integrated circuits in a modern automobile
So, now the chip goes into general deployment, and clocks up a significant number of operational hours. Then, suddenly, after months, single modules start to fail sporadically. After a couple of weeks of fruitless investigation by the module designers, the problem is assumed to be a chip problem. But comparison of the failed chip with a properly functioning chip yields no further information, so the failure is passed to the semiconductor provider. After extensive tests using tighter test conditions, all too often the answer comes back: “no failure found.”
The number and frequency of such cases has been increasing inexorably over the past years. The problem is that a device test can ensure that a circuit fulfills its specified functions, but typically cannot uncover intricate, rarely-occurring malfunctions caused by design errors. The only way to do this is to have 100 percent verification of the design – a comprehensive verification that ensures that there are no verification holes.
Where is that bug hiding?
We must take a tip from Sherlock Holmes: When one has eliminated all logical explanations, then the illogical explanation – even if apparently impossible – is correct. So, we must ask ourselves: what have we ignored or simply failed to anticipate in the verification? It is clear that our “comprehensive” simulations are not good enough. Our design is not perfect.
We must also cope with another misapprehension: the belief that a failure that seldom occurs cannot have a systematic cause. What about a bug that is buried so deep in a complex circuit, that it is rarely activated? When the bug is triggered by a meaningless, unknown, or unintended condition? What about a condition that occurs only when it rains, the passenger door is open, the left-side indicator is blinking and the brakes are applied? But this is surely an arbitrary example – normally, deep bugs don’t occur under such simply-described conditions!
The deeper the bug is embedded in the complexity, the lower the probability of finding it. A truly comprehensive simulation and analysis using established methods would soon exceed the lifetime of the product. Even parallel processing of the design simulation – using as many engineers as there are available – often cannot find the elusive bugs in an acceptable time.
Even more difficult is a complex digital circuit surrounded by a world full of analog components, whether on the same chip or externally in the application (see figure 3). Such mixed-signal designs are used extensively in automotive applications because the digital processing domain must work with the outside world, which is analog.
Figure 3: The world is analog – even for digital chips
In practice, the bug-fixing dilemma is even greater. Even when the bug has been found, and one knows what to modify in order to remove it, there is still a long supply chain to the end user. Who dares to risk replacing the old chip with the new? After all, the debugged design can contain new errors. Often, there is no alternative but to live with the original bug, and work around it, rather than to jeopardize the established supply chain. The effort required to re-qualify the fixed design throughout the chain is often out of all proportion to the benefit derived.