This paper presents the rationale and design behind the first big.LITTLE system from ARM based on the high-performance Cortex-A15 processor, the energy efficient Cortex-A7 processor, the coherent CCI-400 interconnect and supporting IP.
The range of performance being demanded from modern, high-performance, mobile platforms is unprecedented. Users require platforms to be accomplished at high processing intensity tasks such as gaming and web browsing while providing long battery life for low processing intensity tasks such as texting, e-mail and audio.
In the first big.LITTLE system from ARM a ‘big’ ARM Cortex-A15 processor is paired with a ‘LITTLE’ Cortex-A7 processor to create a system that can accomplish both high intensity and low intensity tasks in the most energy efficient manner. By coherently connecting the Cortex-A15 and Cortex-A7 processors via the CCI-400 coherent interconnect the system is flexible enough to support a variety of big.LITTLE use models, which can be tailored to the processing requirements of the tasks.
The central tenet of big.LITTLE is that the processors are architecturally identical. Both Cortex-A15 and Cortex-A7 implement the full ARM v7A architecture including Virtualization and Large Physical Address Extensions. Accordingly all instructions will execute in an architecturally consistent way on both Cortex-A15 and Cortex-A7, albeit with different performances.
The implementation defined feature set of Cortex-A15 and Cortex-A7 is also similar. Both processors can be configured to have between one and four cores and both integrate a level-2 cache inside the processing cluster. Additionally, each processor implements a single AMBA 4 coherent interface that can be connected to a coherent interconnect such as CCI-400.
It is in the micro-architectures that the differences between Cortex-A15 and Cortex-A7 become clear. While Cortex-A7 (Figure 1) is an in-order, non-symmetric dual-issue processor with a pipeline length of between 8-stages and 10-stages, Cortex-A15 (Figure 2) is an out-of-order sustained triple-issue processor with a pipeline length of between 15-stages and 24-stages.
Fig 1: Cortex-A7 pipeline
Click on image to enlarge.
Fig 2: Cortex-A15 Pipeline
Since the energy consumed by the execution of an instruction is partially related to the number of pipeline stages it must traverse, a significant difference in energy between Cortex-A15 and Cortex-A7 comes from the different pipeline lengths.
In general, there is a different ethos taken in the Cortex-A15 micro-architecture than with the Cortex-A7 micro-architecture. When appropriate, Cortex-A15 trades off energy efficiency for performance, while Cortex-A7 will trade off performance for energy efficiency.
A good example of these micro-architectural trade-offs is in the level-2 cache design. While a more area optimized approach would have been to share a single level-2 cache between Cortex-A15 and Cortex-A7 this part of the design can benefit from optimizations in favor of energy efficiency or performance. As such Cortex-A15 and Cortex-A7 have integrated level-2 caches.
Table 1 illustrates the difference in performance and energy between Cortex-A15 and Cortex-A7 across a variety of benchmarks and micro-benchmarks. The first column describes the uplift in performance from Cortex-A7 to Cortex-A15, while the second column considers both the performance and power difference to show the improvement in energy efficiency from Cortex-A15 to Cortex-A7. All measurements are on complete, frequency optimized layouts of Cortex-A15 and Cortex-A7 using the same cell and RAM libraries. All code that is executed on Cortex-A7 is compiled for Cortex-A15.
Table 1: Cortex-A15& Cortex-A7 performance & energy comparison
It should be observed from Table 1 that although Cortex-A7 is labeled the “LITTLE” processor its performance potential is considerable. In fact, due to micro-architecture advances Cortex-A7 provides higher performance than current Cortex-A8 based implementations for a fraction of the power. As such a significant amount of processing can remain on Cortex-A7 without resorting to Cortex-A15.The system
To create a compelling big.LITTLE solution the system around the processors must also be considered.
A key part is the CCI-400 interconnect which facilitates full coherency between Cortex-A15 and Cortex-A7 as well as IO coherency for components such as a GPU. Through optimizations around the transaction characteristics of Cortex-A15 and Cortex-A7 as well as by considering the paths to main memory and the system, a single solution is offered with the highest possible performance.
Fig 3: Cortex-A15 CCI Cortex-A7 system
Another element of the big.LITTLE system is a shared Generic Interrupt Controller (GIC-400). As well as being able to distribute up to 480-interrupts to Cortex-A15 and Cortex-A7, the programmable nature of the GIC-400 allows interrupts to be migrated between any cores in the Cortex-A15 or Cortex-A7 clusters.
From the perspective of trace and debug, both Cortex-A15 and Cortex-A7 offer trace solutions and are both compliant with the Debug v7.1 architecture. Full support for big.LITTLE debug and trace is provided through CoreSight SoC.
A final point to consider is that a big.LITTLE system incorporating Cortex-A15, Cortex-A7, CCI-400 and the GIC-400 is optimal for all big.LITTLE use-models. There is no configurable feature or optimization that favors a particular use-model. However, to reduce software complexity in the big.LITTLE task migration use model it is recommended that the same number of cores be implemented in the Cortex-A15 cluster and Cortex-A7 cluster.