While there is much emphasis on functional verification in the EDA and design communities today, there is little knowledge of tools that help optimize the performance or throughput of a given architecture. To be sure, functional verification is of vital importance. It's of little consequence how quickly the design completes its task if functionality is incorrect. Once correct functionality has been confirmed, however, there is significant value in delivering a product that exceeds specifications and outperforms its competitors.
This article explores several aspects of embedded system performance that can be evaluated pre-silicon, and it discusses how the data collected during simulation can guide design alterations that maximize throughput. The data for the analysis presented here is derived from functional simulation runs; thus, we will focus on functional modifications made to the design to improve performance. We will not examine critical timing, clock rates, or power consumption here.
Performance data collection
The quality of the performance results presented to the designer are ultimately determined by the environment in which the data is collected. As we are discussing embedded system performance, an environment that can perform both firmware and hardware execution is required. While there is value in analyzing them separately for example, code profiling using an instruction-set simulator (ISS) or using a logic simulator to monitor bus traffic this would miss the interoperative effects of hardware and software. Examples of these effects include processor instruction fetches and data reads/writes on the bus, processor load on the memory sub-system, and bus arbitration conflicts between the processor and other bus masters in the design.
Candidate environments for co-simulation of hardware and software must have sufficient visibility to collect the required performance data. These include hardware/software co-verification, hardware emulation incorporating a model of the physical processor or the physical processor itself, and logic simulation which instantiates a full functional model of the processor.
Of the three environments, hardware/software co-verification has the potential to provide the richest set of data. The ISS and software debugger that comprise co-verification processor models provide data for code profiling and cache hits and misses. The co-verification kernel processes all memory transactions from the processor out to the memory sub-system modeled in the logic simulator, satisfying the data requirements for graphing memory activity. Finally, instantiating a bus monitor in the logic simulator provides bus loading and arbitration delay data.
Hardware emulation may be the next best choice for performance data collection. With the processor represented by a physical device interfaced with a symbolic debugger, code profile information may be available. However, if the processor is implemented in or is an emulator primitive, there is little opportunity for logging the symbolic data required. A bus monitor can be instantiated in the emulated design in the same way it is with a logic simulator. The emulator can report on memory transactions as well.
The least effective environment is a logic simulator with a full functional processor model. Since it is difficult to integrate a software debugger with this type of model, data required to drive software profiling is not available. Also, with the execution speed of the logic simulator limited to 10 or 20 instructions per second, not enough software can be run to provide meaningful results. Logic simulators can be instrumented to provide bus and memory transactions, but without the correlation back to software execution, it delivers hardware performance data at best, not the system analysis presented here.
To support the system performance analysis displays discussed below, use of a properly instrumented hardware/software co-verification tool is assumed.
Plotting the time each software module takes to execute gives a graphical representation of which functions are consuming the bulk of the CPU resources. While some ISS and software debuggers can combine to deliver this data, they lack the hardware impact on software execution, including bus waits and the handling of interrupts. Integrating a hardware simulator with the ISS significantly improves the accuracy of software profiling, displaying the precise elapsed time in nanoseconds for each software function, including the time to service interrupts asserted by the hardware during function execution.
Useful formats for display of this data include a bar chart indicating the percentage of total CPU resources consumed for each function and a Gantt chart, where the sequence of function execution, calls, and returns are plotted indicating the time each takes to perform. Knowing exactly how long a time critical function takes to execute can prevent system errors, such as incomplete data transfers or dropped packets. In one customer example, the team was not certain the RAM copy routine, which was part of software initialization, would complete within the required time. Additional hardware was being considered to perform the RAM copy, rather than using the processor. Knowing with confidence that the RAM copy routine ran within the specified window saved the time and effort required to develop RAM copy hardware.
Figure 1 -- Code profiling identifies which functions are consuming the most CPU time
Software profiling can prompt renewed attention to critical functions that do not execute within the required time. Fixes include rewriting the function with an eye toward better efficiency, implementing in assembly code versus C, changing interrupt priorities while the function is executing, and re-implementing the function in hardware rather firmware.
Most embedded systems today place substantial demands on the memory subsystem. With the CPU and multiple hardware functions competing for access to shared memory regions, a complete system simulation may be the only way to spot memory bottlenecks prior to tape-out.
A graph of memory transactions over time highlights periods of peak memory utilization, allowing the designer to focus on the most critical demands for memory bandwidth. Peaks and valleys in the graph indicate an opportunity to better balance memory access, shifting less time-critical functions to a point where memory bandwidth is underutilized.
Figure 2 -- Memory transaction display shows processor reads, writes, and fetches
Understanding and correcting memory bottlenecks is made easier and faster through the correlation of a given point in time on the graph with the firmware or hardware function accessing memory. Displaying function names at the drop of a cursor allows the designer to visualize the relationship between memory activity and the responsible functions. Annotating memory reads and writes can improve time to insight.
Cache efficiency is also critical to system performance. Effective use of instruction and data caches speeds firmware execution and minimizes the CPU load on main memory. Plotting CPU cache hits and misses can lead to measurable improvements in system performance. Excessive cache misses can be a result of poor data locality for a given function, or simply an indication that caches are too small. Once identified, the cause of cache inefficiency can be easily corrected to speed firmware and hardware execution.
A significant benefit to working in the virtual world of hardware/software co-verification is the ability to quickly iterate on different cache configurations until the optimum solution is found. Fast iterations supported by a graphical software debugger and logic simulator, combined with the clarity of this memory display, make quick work of evaluating the benefits of a proposed change in cache size or algorithm. In contrast, attempting to optimize cache by working with a hardware prototype restricts the designer's options for change and provides indirect feedback on efforts to improve efficiency.
Bus bandwidth may well be the most precious commodity in today's embedded designs. CPUs, DMA controllers, peripherals, and data search engines all compete for this resource. Charting bus utilization over time as a percentage of total bus bandwidth offers a unique view into the operation of the design. If the graph never hits maximum utilization, the designer can move on to other concerns. More common is to see areas where the graph is flat across the 100 percent line, indicating a bandwidth limited function. This could be a DMA transfer, where use of every available bus cycle is common, or an unexpected peak in bus usage that warrants further investigation. Identifying and eliminating bottlenecks on the bus can result in dramatic improvements in system throughput.
Figure 3 -- Bus load display shows master reads and writes as a percentage of bus bandwidth
Bus arbitration delay
Access to the bus by a bus master is governed by the bus arbiter. Balancing bus access by choosing the most effective arbitration scheme and correctly setting priorities is seldom easy to accomplish. These parameters are often adjusted to ensure that critical functions get sufficient access while lower priorities are not totally ignored.
Arbitration problems often manifest themselves in obscure ways. Buffers may back up and overflow, or data may be dropped entirely. It's seldom clear that the observed behavior is rooted in arbitration. With enough effort, the designer can trace the problem back to the source, but there are better ways to debug these phenomenon.
A more direct indication of arbitration problems can be achieved by plotting how long each bus master waited for a bus grant after issuing a bus request. The trick of balancing bus access is easier to achieve by viewing arbitration delay, rather than monitoring changes in its secondary effects. Balancing often requires many iterations, as improving access for one master usually requires degrading access for others. Each iteration can be completed in less time by viewing the arbitration delay plot after a change in priority or arbitration scheme. Faster iterations not only save time but also improve the chances of achieving the optimum bus access balance for the design.
When evaluating functional verification tools for features such as speed, capacity, accuracy, and language support, one should also consider their ability to deliver performance analysis. These four examples detail the potential of this technology to drive effective tuning of embedded hardware and software to achieve optimum throughput and efficiency. A relatively small amount of incremental development effort can result in substantial gains in performance when the operational characteristics of the design are presented in a clear and flexible manner. Used effectively, the ability to quickly implement and analyze performance alternatives can produce an end product that is superior to competitive offerings.
Jim Kenney is a product manager in the SoC Verification Division at Mentor Graphics and is responsible for the Seamless co-Verification environment. He has over 25 years of experience in design validation and logic simulation and has spent the majority of his career at Mentor Graphics and GenRad (currently Teradyne).