A global leader in aerospace electronics needed to quantify the performance of their customers’ embedded code running on one of their delivered systems. They were able to successfully accomplish this goal by using a TLM 2.0 methodology to produce an executable system model and, subsequently, execute software to analyze functional aspects contributing to overall system level performance.
The platform they needed to analyze was a multi-board system that processes incoming data packets. Onboard timers were used to synchronize data frames and initiate CPU processing via interrupts. The CPU must process a packet before the next packet is acquired in order to maintain real time. The time from when the CPU finishes processing a packet until the arrival of the next packet is defined as CPU idle time. If enough idle time exists, then additional capabilities can be added to the system software to gain more system functionality and/or reliability. The premise of this project was that bottlenecks in the software's interaction with the hardware platform could be found and explained to the end customer, identifying potential areas for software optimization.
Figure 1. Observation of CPU idle and processing times.
Previous attempts to analyze performance in the lab were not successful, due to a lack of visibility into the hardware. The challenge was that the real time nature of the application required detailed visibility of the hardware activity as the software was executing. It was not possible to stop and continue the application and get meaningful results. It was also difficult to track the details of hardware components in the physical system, such as the state and activity of the cache or the stalling effect on the processor when accessing slow peripherals. The inherent difficulty of analyzing the physical system was a major factor driving the need for a virtual prototype.Modeling and Simulation Strategy
After reviewing the available industry technological responses to this problem, our customer determined that creating an architecturally accurate performance model would be an effective approach to give the visibility required to understand the efficiency with which the software was using the hardware. The goal for this architectural model was to identify ways of optimizing the software to make more efficient use of the hardware.
A TLM simulation environment adds internal hardware visibility to significantly improve the understanding of system activity and performance factors. Additionally, the simulation system can be fully analyzed without any side effects that occur when attempting to analyze the real hardware. Our customer developed a successful strategy as follows:
- Create and test models of the interesting portion of the platform: the HW/SW (CPU) interaction
- Retarget an in-house Windows resident testbench application used in the lab for production test
- Leverage standardized benchmarks to ensure simulated latencies match real hardware
- Instrument the simulation environment to analyze system performance
They chose to create the architecture and analysis model by leveraging an available commercial tool that supports a standards-based approach. This included focusing on the creation of SystemC TLM2.0-based models for the platform using the Mentor Graphics Vista tool set. The CPU, bus, memory, and some peripheral elements were supplied from the Mentor library. There were also a number of custom hardware elements that had to be created specifically for this design. By leveraging standards and tool capabilities, they were able to quickly put an architectural model of their system together.
Challenges included the creation of efficient custom models. By using the transaction-focused techniques of TLM2.0, they were able to create accurate and efficient simulation models; but it did require a change in their thought process. Some of their early models were too dependent on the system clock. Changing their thinking from an RTL cycle-by-cycle perspective to a transaction perspective resulted in significant model performance improvements while maintaining the desired architectural accuracy. Some example model profiles include:
- Timer: Functionality is driven by register callbacks. Reads/Writes to the timer registers trigger activity and may schedule or cancel future timing interrupt events.
- CPU ISS with cache: This is an accurate cache model, with API for statistics gathering. The ISS models aggregate CPU instructions per cycle, yielding a reasonably accurate CPU performance model, without the simulation cost of a fully cycle accurate ISS.
- Frame/DMA engines: A combination of multiple hardware blocks handles frame boundary conditions as well as the DMA of frame data.
- Bridges and interconnects: This is used to connect 8-bit slaves (e.g. PROM, memory) to 32-bit buses and to model timing and interconnects of the system backplane.
- Non-critical registers: Modeled in a block connected to a single bus port, this catches all un-mapped accesses.
Simulation Environment Assembly and Debug
Before running the production application and RTOS software, it was critical that the simulation first be fully functional. An in-house lab test program and associated platform software was used to validate the simulation. In the lab, the Windows resident application communicated to a hardware signaling box via Ethernet. A SystemC model was outfitted with socket calls to effectively replace the signaling box. The debug of the hardware models was accomplished through use of the (embedded CPU) software debugger, transaction-based hardware waveforms, and the SystemC debugging capabilities provided by the Vista tools. In this case, our customer had the luxury of knowing that the software diagnostics and testbench were correct. This allowed the virtual prototype to be verified with the same diagnostic software and the same lab test programs that were used with the physical system.
Model latencies were controlled by setting timing variables in a run-time simulation parameter file. A standard Dhrystone benchmark was run on the real hardware, and the simulation timing variables were then tweaked to match real Dhrystone performance. This was done with cache on when tuning aggregate CPU instructions per cycle and with cache off when tuning the path to/from memory. The simulation environment was tuned to be within a few percentage points of the real hardware.
Figure 2. Platform validation.
A scalable transaction-level modeling methodology separates communication, functionality, and the architectural aspects of timing and power into distinct models. Such a model can run in Loosely Timed (LT) mode at a very high speed, or it can switch to Approximately Timed (AT) mode for more detailed performance and power evaluations under software control. Using the TLM 2.0 AT mode, the system was simulated only 300 times slower than the real system. Thus, a little over 10 seconds of real time data processing was simulated in under an hour using AT mode. Particularly when compared to our estimation that the same simulation at the RTL would have taken weeks, it is clear that simulation performance was very good.
Interestingly, LT mode actually simulated faster than the real hardware. In this case, however, LT execution was not required because the board already existed and the focus was performance analysis, but LT will be leveraged in future projects when there is a need for system-level verification of hardware/software functionality.
Figure 3. TLM 2.0 LT and AT timing modes.
Gathering cache hit, miss, flush, and thrashing information was required for cache analysis. Based on our customer’s unique requirements, Mentor was able to provide callbacks for individual cache operations. User code was written to process the cache operation into useful statistics and generate a report at the end of simulation. At the start of the project, it was assumed that CPU idle time could be increased, through a more efficient use of cache. However, this turned out not to be the case, as the measured cache efficiency of 87% was better than the original 80% estimate. This was due to the small memory footprint of the production software and the efficient coding of the data processing loops.
Figure 4. In upper right Vista screen capture, graphing the CPU bus latency reveals periodic access to a slow peripheral in the system. The lower left screen capture graphs data cache statistics: red—cumulative average cache hit rate; green—decaying average of last 100 cache accesses; blue—number of new memory pages.
Several potential improvements were found; most centered on lag time of the operating system and hardware drivers. The examination of the hardware waveforms revealed a large delay from the end of a minor frame until the CPU acknowledge of a timer interrupt. By tracing through the code, it could be observed that the time was dominated by a large number of function layers in the operating system and device driver. It was also postulated that there was some inefficiency in ADA compilation for the hardware interface code.
Another area of concern was high bus activity to a non-local memory (located far from the main bus in a standard part). Based on end-user feedback, the non-local memory was being used to store shadow register information. The shadow registers could be more efficiently stored in the main SRAM.
This work has resulted in a detailed report for our customer’s end-user, describing how the hardware can be better utilized in their next software release. In addition, a baseline simulation model will be used to explore architecture options for the next-generation hardware platform.
Product Specialist Consultant
With over 22 years in EDA, Mike Bradley has experience in hardware and software engineering as well as embedded, simulation, and emulation technologies. His hardware and software background coalesced with the Seamless product line and progressed to system-level design on pace with the industry’s rising abstraction curve. Mike has a BSEE from Rensselaer Polytechnic Institute.
Jon McDonald is Sr. Technical Marketing Engineer at Mentor Graphics. He received a BS in Electrical and Computer Engineering from Carnegie Mellon and an MS in Computer Science from Polytechnic University. He has been active in digital design, language based design and architectural modeling for over 15 years. Prior to joining Mentor Graphics Mr. McDonald held senior technical engineering positions with Summit Design, Viewlogic Systems and HHB Systems.
If you found this article to be of interest, visit EDA Designline
where you will find the latest and greatest design, technology, product, and news articles with regard to all aspects of Electronic Design Automation (EDA).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for the EDA Designline weekly newsletter – just Click Here
to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).