SoC designers are learning the benefits of applying high-capacity formal verification techniques at every stage of the design. Our formal tools are powerful and versatile enough for a wide variety of tasks such as architectural exploration and RTL verification, all the way through post-silicon debug.
A good rule of thumb to consider how costly a missed bug can be: Finding bugs in model testing is the least expensive option; if it’s found in component test, add 10X to the cost; 10X more in system test; and another 10X if it makes it into production. You do the math. If you thought watching your team go down in the World Cup was depressing, try explaining to your boss how you let a bug loose in the field. Taking the analogy a bit further, using formal in the post-silicon lab is like having a really good goalie, because it’s the only method for finding, fixing, and verifying the fix to shave untold engineering-hours from the design cycle, and maybe even save a job or two along the way.
Post-silicon debug means trying to reproduce bugs seen in the lab using directed-random-simulation and emulation, but often these traditional approaches, unlike formal, are unable to root cause the bug fast enough. Let’s look at a fairly typical scenario.
When a problem is encountered at this late stage it can be hard to debug due to lack of visibility into the silicon in the post-silicon lab. You can find the problem but it is often abstract and not well understood: the chip hangs, not responding, dropping packets, sending out wrong output, etc. The first step is to try to determine exactly what is happening. Many chips have on-chip trace extraction capability, e.g., controls to freeze the chip when certain events are identified or on-chip logic analyzers that allow a selected group of signals to be muxed to external pins. This lets you extract a failure trace, capturing a limited number of signals for a number of cycles before (and maybe after) a problem is detected. The next step is to isolate and root cause the post-silicon bug. At this point it is known that the chip is exhibiting illegal behavior as it can be seen in the trace, but the trace represents the last N cycles of the run and it is not known how this state was reached. Typically, there will be a limited number of signals in the trace and it is difficult to choose the right set of signals to show the problem.
Figure 1 represents this dilemma: The last few cycles of the failing scenario can be observed, but how can the root cause of the problem be found? How can the designer know in which block the bug is located?
Typically, directed-random-simulation is used to isolate the bug, but can it reveal how this state was reached using simulation? It is not clear where the bug is happening as there is weak controllability in simulation and existing methods, but what is known is that it causes block D to act incorrectly. The bug happens after a 3-4 hour run in the lab when a certain kind of traffic is injected (e.g., only for read transactions on bus X). Finding the root cause with directed-random-simulation can be extremely difficult. If it takes four hours of real time with random traffic to hit the bug, how long will it take to reproduce it when simulation time is dramatically slower?