Architecture Performance Optimization
This section describes some of the experiments we have carried out using the performance model of our SoC platform. First we elaborate on the generation of stimuli and then we report the results from the some of the simulated scenarios.
Traffic Generation Utility
As the first step in setting up the architecture exploration experiments we created a small utility, which generates input trace files for the GFRBM initiators from a high-level traffic description. The traffic description is tailored to our image processing accelerators, which access the memory in a very specific way. The traffic description file contains the following attributes:
- Memory layout related attributes: number of accessed memory regions, start-address, image size, tile size, stride
- Transaction related attributes: access type (read/write), burst size, burst type, interleaving type (sequential/interleaved), inter-burst delay, inter-tile delay
These attributes can be easily derived from the specification of the blocks.
As an example, for a video preview with MMI use case, the Rendering Engine accesses seven memory regions:
- it reads the YUV Graphic Lists
- it reads the 3 memory regions storing Y, U, and V
- it reads the RGB Graphic Lists
- it reads the RGB overlay image
- it write the RGB picture
The corresponding traffic description file for the write accesses to the RGB region has the following attributes:
- width of the memory region: 640
- height of the memory region: 240
- stride: 640
- width of one tile: 32
- height of one tile: 16
- burst-size: 8
- burst-type: INCR
- inter burst delay: 5
The generated traffic follows the same pattern as the real component: the RGB picture is stored in tiles in a 2-dimensional memory layout, and each tile is written as 16 INCR8 burst accesses. The cycle-accurate generation of the traffic is of utmost importance, because the performance of interconnect and memory subsystems is very dependent on the exact timing of the transactions.
By modifying the attributes of the traffic description we can mimic the different design parameters and operating modes, like e.g. frame-rate of the video camera. This way it is very easy for us to set up all kinds of scenarios that can occur in the real platform. The accuracy of the traffic generation has been validated by comparing the traces from the ESL model against the reference traces. The reference traces are derived from the development board from the previous design, which is using the same IP blocks for the multi-media subsystem.
Performance Analysis Scenarios
In this section we discuss experiments we conducted with the ESL performance model of our chip architecture. The absolute performance metrics are only available to our customers. Here we restrict the results to relative numbers.
A snapshot of the performance analysis results is depicted in Figure 2.
2. Performance analysis results.
Click here for a larger version
The lower left view shows the contribution from each of the initiators to the overall transaction throughput. The upper right view shows the relative contention in each of the 3 output stages, which are connected to the input-ports of the memory controller. The results are statistically aggregated over intervals of 500 micro-seconds to analyze the dynamics of the system over time. This view allows us to easily identify bottlenecks in the interconnect and memory subsystem.
The following enumeration briefly summarizes the results we obtained from our performance analysis studies.
In this scenario we investigated the impact of the data organization. The memory controller supports "full-row", "full page", and "bank-interleaved" operation modes and the possibility to map data differently into two separately configurable memory regions with a physical to logical address conversion.
We had the possibility to simulate several combinations and find a good trade off between throughput and power consumption.
Validation Quality of Service
In this exercise we validated that the multimedia subsystem does not impair the performance of the other parts of the system (MCU, modem subsystem). The multimedia components (Camera, Rendering Engine, and Display Controller) share one port on the memory controller, whereas the other ports are reserved for other subsystems. We applied the stimuli representing the other subsystems to the memory controller ports and measured the resulting throughput and latency. Not surprisingly the memory controller is able to separate the traffic streams from the different memory ports such that the low latency requirements of the MCU are satisfied independently whether or not the multimedia subsystem is active.
Increased Bus Frequency
Here we investigated the potential for increasing the memory throughput by using a higher clock frequency for the memory controller. Increasing the memory controller clock by 17% increases the overall throughput by less than 5%. We discarded this option due to the high effort we foresee to implement this change.
Enable Bufferable Flag
The AHB protocol allows specifying a "bufferable" flag for each transaction. The memory controller could take advantage of this information, because enabling the internal buffers would improve the memory bandwidth and reduce the transaction latency. However this flag is currently not used by the multimedia subsystem. We have added the bufferable to the STL stimuli files where applicable and found a 10% improvement compared to the default setting.
Memory Controller Configuration
We found that the current driver software for the memory controller does not exploit the full performance potential of this complex block. By adjusting the Quality-of-Service settings of the memory controller to the current operation mode the memory bandwidth can be significantly improved.
Summary and Outlook
We are very happy with the new way of doing performance analysis, as our initial work immediately provided value to our product design. Previously we were using spread-sheets for very high level analysis and only when the RTL became available we validated the performance using emulation. Spread-sheets are no longer able to capture the effects of multiple levels of arbitration and queuing in multi-master systems. Emulation is way too late and not flexible enough to carry out architecture performance studies, e.g. it is not easy to vary the memory controller clock independent of the bus clock.
We have adopted CoWare Platform Architect together with the CoWare ESL Model Library. ESL design in a commercial tool environment gives us far more flexibility to explore architectural alternatives and quantify potential performance improvements. By changing the attributes of the traffic generation utility we can easily set up a large set of scenarios, which would be far more difficult with the real IP blocks. The ESL model also gives us a lot more flexibility, as we can freely modify the bus clock, the arbitration policy and priorities, and even the bus topology.
Moving forward we are planning to replace the RTL model of the memory controller and memory with a SystemC transaction-level model (TLM). This will further improve the simulation speed and will give us more flexibility to explore further architectural options. The SystemC TLM models will be either generated from the RTL model or it will be manually created from our central IP modeling team.
For the next generation of our NXP cellular systems product platforms, we will continue to use and extend this approach to carry out much broader architectural studies, like e.g. assess the benefit of replacing the AHB multi-layer bus with an AXI bus. Our next generation products will encompass an increasing integrated set of audio, image & video and telecom features. This method will contribute to an optimization of the system architecture very early in the development phase.
About the Authors:
Danilo Piergentili is a system architect in the Feature Phone Product Line of the Mobile & Personal Business Unit at NXP Semiconductors. He graduated from the University of Rome "Tor Vergata" with a master degree in electronics. he can be reached at: email@example.com
David Coupe is Multimedia Architect in the Feature Phone Product Line of the Mobile & Personal Business Unit at NXP Semiconductors. He graduated from ISEN in Lille (FR) with a microelectronics engineer diploma in 1987. He can be reached at: firstname.lastname@example.org.