Design Article
Tell us What You Think
We want to know what you thought about this Design. Let us know by adding a comment.
Tackling large-scale SoC and FPGA prototyping debug challenges
Brad Quinton, Tektronix
1/21/2013 11:06 AM EST
An FPGA-based prototype is a hardware-based implementation of an ASIC design that operates at high clock frequencies that closely represent the final ASIC while enabling non-intrusive monitoring of internal signals. Figure 1 shows the process for instrumenting and observing an FPGA-based prototype. Depending upon the size of the ASIC, the design may span multiple FPGAs. To test the system, engineers partition their RTL design among the FPGAs. Probes are added directly to the RTL to make specific signals available for observation. This instrumented design is then synthesized and downloaded to the FPGA prototype platform.

Figure 1. To monitor internal signals, probes are added directly to the RTL.
When the system is run, the RTL-based probe connected to each of the instrumented signals collects the signal’s value at each clock cycle. To enable the system to run at its full operating frequency and collect signal data in real-time, the data is stored in a trace buffer in FPGA block RAM. An analyzer connected to the prototype then downloads the information collected from each of the instrumented signals from block RAM, giving engineers offline visibility into the system.
The chief limitation to date of this approach is that instrumenting signals requires the use of significant amounts of block RAM and LUTs within the FPGA. Both of these resources are constrained by fixed availability on the FPGA, as well as by the fact that the majority of these resources are required by the ASIC or SoC design itself. For example, while an FPGA may have 96 block RAM, the ASIC design may require 86 of them, leaving only 10 for use in debugging.
Three primary factors influence the number of block RAM and LUTs required to instrument a system: the number of accessible signals, observation width, and trace depth. For example, the deeper the trace depth, the more block RAM that will be needed. How a debugging system uses these block RAMs impacts the efficiency of instrumentation and defines how much visibility engineers have into the system. The ability to probe more signals reduces how often the system must be recompiled. A wider observation width means more signals can be viewed with each run, potentially enabling faster identification of root causes. Finally, the ability to capture long traces is crucial for identifying and locating bugs. The types of bugs that are not caught during verification may require thousands or millions of cycles to manifest. Verifying software-driven functionality may span millions of clock cycles as well.
With traditional tools, engineers have to balance each of these factors and rarely achieve the robust level of visibility they need in a single pass. Designers must consider how long it takes to recompile the system between debug iterations. Because instrumenting code involves synthesis and place and route, adjusting which signals are probed requires the system to be recompiled. Even when an incremental recompile is possible, recompiling is a process that commonly takes from 8 to 18 hours and is typically performed overnight. If new probes are needed during the day, the process is often a “go home event” as the new results will not be ready until the next day.
The standard debugging tools offered by FPGA vendors such as ChipScope and SignalTAP can probe a maximum of 1,024 signals and require extensive LUT and memory resources. For example, 29 block RAM are required to capture even a shallow trace depth of just 1,024 words (assuming a 36 Kb block RAM size). It is likely that this may be too short a time frame for many types of errors.
To create a longer buffer, fewer signals can be captured to enable a deeper trace with the same number of block RAM used. However, several new problems are introduced in the process. Trying to locate a bug, for example, with only 32 probes in a complex system with over 10 million RTL-level signals is like randomly opening the pages of a dictionary and hoping to find a specific word.
The use of fewer probes also increases the number of iterations required to locate bugs. As each iteration also requires a synthesis, place and route; “go-home events” start to dominate debug time. This can stretch debugging of a single issue over weeks or months, leading to significant scheduling delays. In fact, if a bug is particularly difficult to uncover, it may be necessary to develop a workaround and tapeout with known bugs.
To increase the number of signals that can be instrumented, some tool vendors employ a mux network. A full crossbar mux would give concurrent access to a finite number of every probed signal on the ASIC, but such an approach quickly becomes impractical in terms of the die area required to implement the crossbar. For this reason, an n-input simple mux is commonly used. For example, an 8-1 mux can take 1,024 signals and mux into 8 pre-defined groups of 128 signals each. This enables the total number of signals that can be observed to be 8 times greater for the same number of block RAM. However, signals cannot be observed from different groups in the same run, so engineers have to spend time carefully creating the signal groups or risk having to re-run the FPGA CAD form again.
The bottom line is that ASIC prototype debug involves compromise. Emulators offer a rich debug environment, but lack speed and involve considerable expense. FPGA prototypes are cost-effective, yet for larger SoC designs traditional tools haven’t keep pace with growing complexity, with the show stopper being signal visibility at the RTL level due to resource constraints. If this latter problem could be solved, would emulators still have place?

