With an ever-shortening development cycle, and often several generations of products being produced in parallel or in rapid succession, the need for standardized embedded tools and capabilities that enable quick analysis and debug of embedded intellectual property (IP) is a critical factor in keeping system-on-chip verification manageable.
As more processing elements, features and functions are simultaneously embedded into the silicon, the level of embedded complexity is beginning to outstrip the capability of standalone logic analyzer-, debugger- and emulator-based diagnostic tools. While these tools capture data off the system data bus, they work only as long as every access (read or write) occurs over the external data bus. This points to an increasing gap in terms of effectively being able to provide the necessary controllability and, in particular, visibility into the internal operations of a complex system.
On-chip instrumentation is defined as an embedded block that provides both external visibility and access to the inner workings of an architecture. When properly implemented it provides a real-time "peephole" into the operations of key internal blocks that cannot otherwise be accessed in sufficient granularity on a real-time basis. Real-time visibility and monitoring of key interfaces and buses are increasingly crucial to understanding the dynamics of the operation of system architectures. As a general rule, debug visibility becomes increasingly problematic for highly integrated chips, which have extensive on-chip memory and caches, peripherals and a range of on-chip buses.
The key control and bus signals of interest in a deeply embedded system are often not externally addressable by the physical pins of the device, and are therefore inaccessible to traditional instrumentation. This accessibility issue inhibits verifying silicon operation, thereby introducing many hardware and software integration roadblocks, since the design team must address how traditional debug tools can be interfaced to work properly.
Rather than a single clearly defined product, on-chip instrumentation (OCI) is, in many ways, a tool kit of resources and analysis philosophies to assist in debug of complex systems. Different OCI approaches and implementations are used to debug, for example, processors and buses. Optimized instrumentation for different processors can differ significantly, to allow support for architectural differences and features.
At its simplest level, instrumentation consists of one or more sets of blocks that allow collection, aggregation and concentration or compression of selected internal data of a system-on-chip (SoC) for tracing over time. The signals are exported over device pins for external postprocessing or visualization. Sophisticated instrumentation blocks support real-time data filtering, on-chip analysis of target performance and single or multicore triggering and breakpoint management within the target chip.
Since the instrumentation blocks must be integrated on-chip, typically they are provided as register-transfer-level (RTL) hardware IP-that is, as synthesizable VHDL or Verilog code that can be instantiated in the target design under test. RTL-based design allows easier implementation of scalable approaches that support performance vs. resource trade-offs, allowing debug instrumentation to be integrated over the full life cycle of a part. Typically, more focus on debug features is seen in early (presilicon) debug (such as hardware emulation or in FPGA devices), with a lesser feature set shipped in final silicon.
Central to most instrumentation capabilities is the tracing of data as it moves through the application or system. To address differing debug requirements, instrumentation blocks must support different implementations of trace collection. Typical requirements include the ability to trace in cycle, branch and timer modes.
Cycle mode collects all bus cycles generated by the core or cores. Branch mode collects all execution path changes, sometimes called branch trace messages. Timer trace mode records a frame with a time stamp each time an event is satisfied, providing basic performance-analysis measurements.
Event recognition is widely used in conjunction with trace to capture information on events and operations in the SoC. Trace data values can be monitored and compared to provide real-time triggers for controlling event actions such as breakpoints and trace collection. Event recognizers can simultaneously look for bus address, data and control values, and be programmed to trigger on specific values or sequences such as address regions and data read or write cycle types. The event recognizers can control enable or disable of breakpoints and trace collection.
Data tracing based on recognizable events opens doors to new capabilities in real-time SoC analysis. The data trace mode provides real-time information about the status and data of a system's internal signals, including analysis of cache performance and internal memory, as well as data transfer operations that cannot otherwise effectively be extracted from a system. In-line or postprocessing of trace information allows for analysis of data flow performance or measurement of system characteristics, such as bus availability or cache hits and misses, which require long-term steady-state (measured over many cycles) system information.
Additional detection of events in traced data allows the development environment to flag specific features in the trace data as it flows through the application. In-circuit emulation techniques that rely on background debugger mode and JTAG implementations cannot provide this data in real-time or with complete visibility of all internal interactions.
JTAG provides the default interfaces for most basic debug functions to embedded blocks. Supporting JTAG, trace and probe ports provide the additional I/O bandwidth needed for many on-chip instrumentation approaches. But even with these additional ports, the amount of debug information required can easily exceed the allocated debug interface bandwidth of an SoC.
To reduce the information sent over the interface, and thus increase the interface's performance, data compression and filtering can be used with minimal effect on the overall system cost. Clearly, the most useful approach to reducing the information from the debug port to the host development tool is to limit transmissions to new information and have the development tools derive inferred information.
Only when a change of flow, such as an interrupt or branch, occurs would the system need to send the new beginning address. Also, if the debugging session must be real-time, then some constraints are needed on the information being sent. For instance, grouping data into relevant sets needed for specific export and analysis allows prioritized use of the debug port during run-time.
For processor characterization, the common method of performance monitoring is to provide a set of counters and a selectable set of processor and bus events to count. A counter reaching a programmed terminal count can generate an interrupt, allowing the processor to read and reset counters and log the information.
For an embedded processor, performance-monitoring modes and counters need to be set up and read out via the JTAG port so that the measurements do not interfere with processor execution. Also important is the ability to use hardware triggers to start and stop measurements, so that event counter results can be recorded at specific points in the program.
While a single point-to-point measurement can help in determining performance problems, capturing many occurrences of these event counts, in real-time, is very useful. By designing the performance-monitoring module so that the counter outputs can be written to the internal or external trace module, a trigger can be set up to save the counter values into the trace buffer, reset them, then restart them in one clock.
The purpose of software performance analysis is to pinpoint certain code that takes up significant amounts of executing time, and determine the reason why. In one simple, inexpensive method called hot-spot profiling, the processor's program counter is periodically sampled and binned to provide information on where the processor is spending most of its execution time. On-chip instrumentation to support hot-spot profiling provides a means of sampling the current PC without perturbing the target execution. A JTAG command samples the PC, then shifts the value out of the device for postprocessing.
Another OCI-enhanced method of measuring software performance is for the processor to provide instruction-type information to the trace block-specifically, when a call or return instruction is executing or an interrupt service routine has been entered or exited. The trace hardware stores only the addresses of these subroutine call/return instructions, providing a qualified trace of call return flow. Add time-stamping to the trace and it can be postprocessed to report min, max and average time in each function, exclusive or inclusive of nested functions.
A third method of software performance, at an RTOS task level, is enabled by OCI trace hardware that can qualify trace on a contiguous set of memory addresses. Most commercial real-time operating systems can enable instrumentation that writes the new active-task ID to a predefined memory location whenever a context switch occurs. Tracing the task ID values along with a duration time stamp results in an almost-zero overhead trace of task execution history.
OCI hardware consisting of qualified storage on a set of memory writes can provide another dimension of performance measurement. User-defined instrumentation consisting of markers placed into source code can provide measurement points used for system-level characterization down to very detailed application performance.
For multiprocessor SoC designs, one wants to know bus utilization, whether there is contention when attempting to access the shared bus and, if so, how much time does each core wait for the bus. Specifically, a designer needs to know the waiting time for each peripheral (including memory) the core accesses from the bus. A performance monitor attached to a bus can measure these types of system factors. It can help determine if the priority scheme implemented by an arbiter for bus accesses is the best one for the system design.
For debugging and tweaking system software where there is interaction among multiple cores, instrumented code can be inserted in the source code at the important locations in each core. A bus trace tool such as an Amba AHB trace can be set up to qualify on writes to a block of shared memory (or a dummy slave device can be placed on the bus). Each core has a small set of instrumentation addresses; the resulting address indicates which core is writing its instrumentation markers. The trace then holds a history of all the locations that have executed in all the cores. With time stamping, the duration of various core events can be displayed, illuminating the parallelism of the code running on the different cores and how software synchronization occurs between them.
Looking ahead to more complex systems, instrumentation will need sufficient "embedded intelligence" to interpret information passing between cores, determine what needs to be extracted for debug and perform other task-aware debug for on-chip RTOS or network protocol analysis. Equally challenging is presenting all the diverse debug information in a coherent, understandable way. As in many areas of complex system-on-chip design, new classes of instrumentation will be needed.
Rick Leatherman is President and Chief Executive Officer, Bruce Ableidinger is Director of Business Development and Neal Stollon is a Consulting Engineer for First Silicon Solutions Inc. (Portland, Ore.).
See related chart