prototype is a hardware-based implementation of an ASIC design that
operates at high clock frequencies that closely represent the final ASIC
while enabling non-intrusive monitoring of internal signals. Figure 1
shows the process for instrumenting and observing an FPGA-based
prototype. Depending upon the size of the ASIC, the design may span
multiple FPGAs. To test the system, engineers partition their RTL design
among the FPGAs. Probes are added directly to the RTL to make specific
signals available for observation. This instrumented design is then
synthesized and downloaded to the FPGA prototype platform.
Figure 1. To monitor internal signals, probes are added directly to the RTL.
the system is run, the RTL-based probe connected to each of the
instrumented signals collects the signal’s value at each clock cycle. To
enable the system to run at its full operating frequency and collect
signal data in real-time, the data is stored in a trace buffer in FPGA
block RAM. An analyzer connected to the prototype then downloads the
information collected from each of the instrumented signals from block
RAM, giving engineers offline visibility into the system.
chief limitation to date of this approach is that instrumenting signals
requires the use of significant amounts of block RAM and LUTs within the
FPGA. Both of these resources are constrained by fixed availability on
the FPGA, as well as by the fact that the majority of these resources
are required by the ASIC or SoC design itself. For example, while an
FPGA may have 96 block RAM, the ASIC design may require 86 of them,
leaving only 10 for use in debugging.
Three primary factors
influence the number of block RAM and LUTs required to instrument a
system: the number of accessible signals, observation width, and trace
depth. For example, the deeper the trace depth, the more block RAM that
will be needed. How a debugging system uses these block RAMs impacts the
efficiency of instrumentation and defines how much visibility engineers
have into the system. The ability to probe more signals reduces how
often the system must be recompiled. A wider observation width means
more signals can be viewed with each run, potentially enabling faster
identification of root causes. Finally, the ability to capture long
traces is crucial for identifying and locating bugs. The types of bugs
that are not caught during verification may require thousands or
millions of cycles to manifest. Verifying software-driven functionality
may span millions of clock cycles as well.
tools, engineers have to balance each of these factors and rarely
achieve the robust level of visibility they need in a single pass.
Designers must consider how long it takes to recompile the system
between debug iterations. Because instrumenting code involves synthesis
and place and route, adjusting which signals are probed requires the
system to be recompiled. Even when an incremental recompile is possible,
recompiling is a process that commonly takes from 8 to 18 hours and is
typically performed overnight. If new probes are needed during the day,
the process is often a “go home event” as the new results will not be
ready until the next day.
The standard debugging tools offered by
FPGA vendors such as ChipScope and SignalTAP can probe a maximum of
1,024 signals and require extensive LUT and memory resources. For
example, 29 block RAM are required to capture even a shallow trace depth
of just 1,024 words (assuming a 36 Kb block RAM size). It is likely
that this may be too short a time frame for many types of errors.
create a longer buffer, fewer signals can be captured to enable a
deeper trace with the same number of block RAM used. However, several
new problems are introduced in the process. Trying to locate a bug, for
example, with only 32 probes in a complex system with over 10 million
RTL-level signals is like randomly opening the pages of a dictionary and
hoping to find a specific word.
The use of fewer probes also
increases the number of iterations required to locate bugs. As each
iteration also requires a synthesis, place and route; “go-home events”
start to dominate debug time. This can stretch debugging of a single
issue over weeks or months, leading to significant scheduling delays. In
fact, if a bug is particularly difficult to uncover, it may be
necessary to develop a workaround and tapeout with known bugs.
increase the number of signals that can be instrumented, some tool
vendors employ a mux network. A full crossbar mux would give concurrent
access to a finite number of every probed signal on the ASIC, but such
an approach quickly becomes impractical in terms of the die area
required to implement the crossbar. For this reason, an n-input simple
mux is commonly used. For example, an 8-1 mux can take 1,024 signals and
mux into 8 pre-defined groups of 128 signals each. This enables the
total number of signals that can be observed to be 8 times greater for
the same number of block RAM. However, signals cannot be observed from
different groups in the same run, so engineers have to spend time
carefully creating the signal groups or risk having to re-run the FPGA
CAD form again.
The bottom line is that ASIC prototype debug
involves compromise. Emulators offer a rich debug environment, but lack
speed and involve considerable expense. FPGA prototypes are
cost-effective, yet for larger SoC designs traditional tools haven’t
keep pace with growing complexity, with the show stopper being signal
visibility at the RTL level due to resource constraints. If this latter problem could be solved, would emulators still have place?