Our team is chartered to validate and optimize the architecture of our NXP mobile phone chips. This is a very challenging application domain, as an ever increasing set of multi-media and wireless communication functions need to be integrated into one SoC. Next to a growing number of communication standards, today's mobile phones support a large variety of multi-media applications like MP3 audio, video recording and playback, and digital still camera.
The trend towards high-quality multimedia content and higher communication bandwidth drastically increases the complexity of the underlying SoC architecture. In previous designs a single application processor was sufficient to run the rather simple phone software and to control the modem subsystem. Today numerous dedicated IP blocks are necessary to perform the multimedia functions with the required performance and energy efficiency.
1. Block diagram of a multi-media mobile phone.
The high-level block-diagram of the multi-media subsystem of a mobile phone is depicted in Figure 1. The four components on the top are initiators on the bus, whereas the multi-port memory controller is a target.
- The Micro-Controller Unit (MCU) runs the high-level application software, like e.g. user-interface, personal information management, etc., and controls the other components in the system.
- The camera interface provides two main functions. It delivers YUV encoded data from a continuous incoming dataflow from the sensor (viewfinder function). It also delivers JPEG compressed frames either for single-shot or multi-shot mode (capture function). In any case, the produced data hits memory as frame buffers with variable size (QVGA for viewfinder, 3Megapixel or more for JPEG data).
- The rendering engine reads the data produced by the camera interface blocks and combines it with the man machine graphic interface. The rendering process includes color conversion, affine transformations (translation, rotation, scaling, mirroring, shearing), and blending operations. In viewfinder mode the camera interface produces YUV data whereas the man machine interface graphic is typically in RGB format. The rendering engine will produce a combined RGB image in the main memory.
- The display controller fetches the RGB frame from the memory and shifts the data to the frame buffer of the LCD screen. It acts like a smart DMA with basic color format conversion.
- Interconnect and memory subsystems of the platform are essentially the backbone of the entire SoC, and have to deliver the required communication bandwidth for all the IP blocks. It combines access to internal memory resources as well as access to common external memory.
Design Time Performance Analysis Issues
The goal of the architecture definition phase is to determine the optimal configuration of the design parameters in interconnect and memory subsystems, in order to deliver sufficient performance at minimal cost. In the past, the performance requirements were analyzed using spread-sheets. However, this static performance analysis approach is not applicable for the complexity of today's SoC platforms.
Multiple Initiators: As shown in the block diagram, we have a much higher number of IP blocks, which act as masters on the interconnect architecture.
Dynamic traffic: The traffic generated by the multimedia accelerators is rather bursty and greatly varies depending on use cases. As an example, a viewfinder operation will deliver quite regular memory access since the data is processed in raster scan order. On the other hand, functions like video encoding or decoding tend to exhibit scattered memory accesses, especially with the latest generation of video CODECs. Another example is the influence of the frame buffer organization on the memory accesses: a coplanar organization will provide quite linear accesses, whereas a planar organization will require interleaved accesses through several frame buffer planes. An other factor influencing the traffic pattern is the dimension of the accessed objects: single dimension objects will again provide linear accesses, whereas the stride of 2-D objects will induce scattered and interleaved accesses. The combination of all possible configurations rapidly exceeds the capabilities of performance analysis using spread-sheets.
Arbitration: To cope with such a complex workload, we need multiple levels of arbitration and queuing in the bus matrix and in the multi-port memory controller. This hierarchical arbitration mechanism cannot be accurately predicted without a proper system simulation environment.
QoS: The Memory Controller offers advanced Quality-of-Service (QoS) features like bandwidth reservation for the multimedia blocks and a low latency access for the MCU.
This results in the following set of configuration parameters, which should be optimized by the architect:
Interconnect: bus-width, clock-period, topology, arbitration algorithm, priorities.
Memory Controller: bus-width, number of ports, low latency versus high bandwidth port, buffering, number of access beats.
The design space becomes even larger by the configuration parameters in the IP blocks: the data-layout of the video frames in the memory and the memory access pattern are highly configurable. This in turn has a significant impact on the effective DRAM bandwidth and access latency. One can typically face tens of parameters, the key challenge is then to isolate the most important ones and ensure the right tuning setup.
The ESL tool helps us to traverse the design space in a coordinated way. This is done by setting up simulation runs to sweep certain design parameters. The analysis results from the simulations are stored in separate data-bases. The comparison of the results unveils the significance of a design parameter with respect to a certain performance metric, like e.g. the time it takes to render one frame, the bandwidth headroom on the bus, the average latency of the MCU transactions, etc. Understanding the significance of a design parameter allows us to decide whether the corresponding implementation cost would be justified by the performance improvement.
Run-Time Performance Analysis Issues
Many performance-relevant parameters can be configured at run-time by the embedded software. Therefore performance analysis is not only a design-time issue, but also a post-silicon run-time issue.
For example, during the embedded Software development we faced a performance limitation with our current GSM chips. Both the baseband and the application processor are using the SDRAM memory controller. When the Multimedia application was tested on the prototype board the software developer saw black pixels in the image processed by the camera sensor. We analyzed this issue in the architecture group using the prototype board and logic analyzers. It was very difficult to identify the root cause by just looking at hardware traces coming from the board. It turned out that this was not a functional error in the hardware, but that the memory controller was not configured correctly by the software. This led to congestions in the queues of the memory arbiter, which in the end caused the pixel errors on the display. This correction could have been corrected earlier with a different method.
In summary, today's highly re-usable IP blocks offer many design-time and run-time configuration parameters to tune the block for the specific SoC. However, traditional design methodologies hinder us to take advantage of all this flexibility.
Using spreadsheets is no longer an option, because QoS based queuing and arbitration of dynamic workloads makes it impossible to predict the actual performance and utilization for a specific configuration of interconnect and memory subsystems.
Using RTL simulation is not an option for architecture analysis due to the long turn-around time for compiling and running a simulation. We also lack statistical analysis capabilities for performance related metrics like throughput and latency of the different components in the system.
Emulation solves the simulation speed issue of RTL simulations and is heavily used for RTL sign-off. However the late availability, lack of performance analysis views, and the long turn-around times are not addressed, so emulation alone is not a good fit for early architecture analysis studies.
Development boards are typically used for post-silicon SW debugging, but this is not at all suitable for performance analysis. The visibility, especially into interconnect and memory architecture, is very limited. This makes it very hard to identify the origin of a performance issue.
The shortcomings of our current design and debugging methodology motivated us to try out a new approach based on commercially available ESL technology, which is described in the next section. We have selected CoWare Platform Architect as a SystemC-based ESL environment for platform capture and performance analysis. Together with the RTL co-simulation capabilities and the ESL model library it delivers all the necessary ingredients for our architecture exploration and validation use-model.
The goal was to build an ESL model of the performance relevant portion of the SoC platform. It is important to define the right modeling approach. For our use-case, we need cycle accuracy for interconnect and memory subsystems, which are at the center of our investigation. Also the traffic needs to be sufficiently accurate to reproduce specific scenarios. Our job is to define the SoC architecture, so obviously we don't want to spend too much time creating the models ourselves.
Given these requirements on the model accuracy and modeling effort we decided to use a combination of models for the component in our platform.
The easiest part was the modeling of the AHB bus matrix. Here we used the fully cycle-accurate SystemC TLM model, which is available in the commercial ESL model library. We only had to instantiate, connect and configure the bus nodes according to our wishes. The AHB model provided all the configuration options of the real AHB protocol and is instrumented with all the integrated performance analysis views we need for our architectural investigation.
We did not have SystemC models of the initiators IP blocks (MCU, Camera, Render, and display), and we did not want to spend the time to create them. We are anyway only interested in the bus transactions generated by these components and not in their functional behavior. Therefore we used the Generic File Reader Bus Master (GFRBM) provided by CoWare. This model reads in a transaction trace file in the Socket Transaction Language (STL) and generates the corresponding bus transactions. The GFRBM is generic in that it can be hooked to different bus protocols by means of transactors and generates cycle accurate traffic.
For the memory subsystem we used RTL co-simulation between the RTL memory controller and memory on the one hand side and the rest of the SystemC model on the other side. We had no ESL model of our proprietary memory subsystem available, and it would have been too much effort to create a cycle-accurate model of a complex multi-port memory controller ourselves. As a replacement we used the RTL co-simulation capability provided by the commercial ESL environment. The resulting simulation speed is sufficient for our architecture analysis work.
Assembling the platform from existing library elements and the RTL memory sub-system was straight forward and a matter of a few hours. Of course the majority of the blocks in the system are omitted or modeling using the trace-driven initiators. Still this partial model of our platform provides us with exactly the configurability and accuracy we need for our investigations.
So far we used this performance model of our phone platform for two use-cases:
- We validated that the run-time performance limitation can be easily reproduced in the ESL environment.
- We carried out a set of experiments to explore architectural improvements for the next generation of the platform.
The next section talks in more detail about the architecture exploration experiments.
Validation of Run-Time Performance Analysis Issues
As a first experiment, we validated the performance limitation in the existing platform. We configured the memory controller in the same way as the real software does and stimulated the bus using the Generic File Reader Bus Masters. We converted the original board traces into STL files driving the GFRBM. This way we were able to reproduce the performance limitation as observed in the real system. The bus analysis views immediately revealed the contention on the memory controller ports as the root cause of the problem.
This exercise convinced us of the fidelity of the analysis results obtained from the ESL simulation environment. Subsequently we used the same setup for the investigation of architectural alternatives as described in the next section.