Part 2 describes the techniques BDTI used for optimizing DSP algorithms on the Cortex-R4. For more analysis of ARM cores, see Can the ARM11 Handle DSP?
In 2004, ARM announced its newest generation of licensable cores, called the "Cortex" family. Cortex cores span a wide range of performance levels, with Cortex M-series cores at the low end, Cortex R-series cores providing mid-range performance, and the Cortex A-series applications processors offering the highest performance. The first Cortex core to be announced was the Cortex-M3, and since then ARM has announced several others, including the Cortex-A8 and A9, the Cortex-M1, and the Cortex-R4.
The Cortex-R4 targets moderately demanding applications such as hard disk drives, inkjet printers, automotive safety systems, and wireless modems. It is marketed as a higher-performance replacement for the older ARM9E core. BDTI recently completed a benchmark analysis of the ARM Cortex-R4 core and is now releasing the first independent signal processing benchmark results for this processor. In this article, we'll take a look at its benchmark results and compare its performance to that of other ARM cores (including the ARM11, another moderate-performance core) and selected competitors.
Table 1 summarizes key attributes of selected ARM processor cores.
Table 1. Characteristics of selected ARM cores.
* Clock speed data provided by ARM, not verified by BDTI. Clock speeds for ARM9E and ARM11 are worst-case speeds in a TSMC CL013G process and ARM Artisan SAGE-X library. Clock speed for Cortex-R4 is worst-case for a 90 nm CLN90G Artisan Advantage implementation. High-end clock speed for Cortex-A8 is based on a custom implementation.
As shown in Table 1, the Cortex-R4 is a superscalar core that can issue and execute up to two instructions per cycle. Like the Cortex-A8, it supports the ARMv7 instruction set architecture and the Thumb2 compressed instruction set, but the Cortex-R4 does not support the NEON signal processing extensions. As a result, its signal processing capabilities and features are much more limited than those of the Cortex-A8.
The Cortex-R4 as a Signal Processing Engine
The Cortex-R4 targets applications that include moderate signal processing requirements, and the core includes hardware and instructions to help improve its performance on this type of processing. For example, the Cortex-R4 supports SIMD (single instruction, multiple data) instructions that enable it to perform two 16-bit multiply-accumulate operations (MACs) per cycle; MAC operations are heavily used in many common signal processing algorithms, such as filters and FFTs.
To assess the Cortex-R4's signal processing capabilities and compare its performance to that of other processors, BDTI benchmarked the Cortex-R4 using the BDTI DSP Kernel Benchmarks, a suite of 12 key DSP algorithms such as FIR filters, FFTs, and a Viterbi decoder. These benchmarks are hand-optimized for each processor, typically in assembly language, and verified by BDTI. The BDTI DSP Kernel benchmarks have been implemented on a wide variety of processor cores and chips, providing a range of comparison data for evaluating new processors.
BDTI uses processors' results on the DSP Kernel Benchmarks to generate an overall signal processing speed metric, the BDTImark2000. (When the benchmark performance is verified using a simulator rather than hardware, this metric is called the BDTIsimMark2000.) The BDTImark2000 metric combines the number of cycles required to execute each benchmark with the processor's instruction cycle rate (i.e., its clock speed) to determine the amount of time the processor requires to execute the benchmarks. For off-the-shelf chips, we use the fastest clock speed at which the chip is currently shipping. For licensable cores, the clock speed depends on how the core is fabricated. To enable apples-to-apples comparisons, BDTI typically uses clock speeds for their cores fabbed in a TSMC 130 nm process, under worst-case conditions. ARM has not reported this data for all of its cores, so BDTI has used alternate clock speeds in some cases, as noted in the table above.
In Figure 1, we present BDTIsimMark2000 cores for selected ARM cores, alongside BDTImark2000 scores for two off-the-shelf DSP processor chips for comparison.
(Click to enlarge)
Figure 1. BDTImark2000 scores for selected cores and chips. The BDTImark2000 is a composite DSP speed metric based on processors' results on the BDTI DSP Kernel Benchmarks. A higher score indicates a faster processor. ARM has not provided clock speeds for the Cortex-R4 and Cortex-A8 that conform to BDTI's uniform conditions for cores; therefore, the results for these two cores should not be compared to results for non-ARM cores.
As shown in Figure 1, the Cortex-R4 and ARM11 have similar signal processing performance. (For a full analysis of the ARM11's signal processing performance, see "Can the ARM11 Handle DSP?") The Cortex-R4 is not intended to replace the ARM11; rather, ARM positions the Cortex-R4 as a higher-performance replacement for the ARM9E. Compared to that processor, the Cortex-R4 is nearly three times as fast. Some of the speed increase is due to the Cortex-R4's more powerful architecture (we'll discuss this more later), and some is due to its faster clock speed.
At the clock speeds shown above, the Cortex-R4's signal processing speed is similar to that of the Texas Instruments TMS320C55x, a widely used, mid-range DSP chip. At this level of performance, the Cortex-R4 may be able to subsume the processing typically allocated to a low-cost DSP processor. At 450 MHz, the Cortex-A8 with NEON signal processing extensions is more than twice as fast as the 375 MHz Cortex-R4. (The 450 MHz clock speed used here to calculate benchmark results for the Cortex-A8 is the estimated speed of the core as fabricated in Texas Instruments' OMAP3410 chip.)
From the data presented in Figure 1, it's clear the clock rate accounts for only part of the signal processing speed differences among processors. The other factor is the processors' architectural "power"—that is, how much work each processor can accomplish in each clock cycle. In the next section, we'll look at some of the architectural differences that contribute to the performance numbers shown above.
Behind the Benchmarks
Behind the Benchmarks
To evaluate and compare processors' architectural strengths and weaknesses, BDTI measures the number of instruction cycles required to execute each of the twelve benchmarks. Cycle counts don't directly assess a processor's signal processing speed (because speed also depends on clock rate) but they do provide a comparison of the relative power of the architecture. The lower the cycle count needed to execute a given amount of work, the more powerful the architecture.
Of course, processors that can execute the benchmarks in fewer cycles (and are therefore more powerful) may require more silicon area than less-powerful processors, or they may consume more energy. Furthermore, processor architects sometimes trade off architectural power for clock speed, so it's important not to assume that greater architectural power will necessarily yield a faster processor.
In Figure 2, we present the BDTIsimMark2000/MHz scores for selected processor cores and chips. This metric evaluates per-cycle throughput on optimized signal processing kernels, and is based on processors' results on the BDTI DSP Kernel Benchmarks.
(Click to enlarge)
Figure 2. BDTIsimMark2000/MHz scores for selected processor cores.
As shown in Figure 2, the Cortex-R4 has roughly the same cycle-count efficiency as the ARM11. This may seem surprising since the Cortex-R4 is superscalar and the ARM11 is not. However, Cortex-R4's dual-issue capability is quite limited. For example, although it can execute an add or subtract operation in parallel with a load or store, it can't execute a MAC instruction in parallel with anything else. As a result, its signal processing throughput is only slightly higher than that of the ARM11.
The Cortex-R4 does have nearly twice the per-cycle signal processing throughput of the ARM9E, which is a single-MAC, single-issue core with very limited parallelism. The Cortex-R4 has twice the data bandwidth of the ARM9E and provides a number of SIMD arithmetic instructions, which the ARM9E lacks.
Compared to the Cortex-A8 with NEON, the Cortex-R4 has much lower per-cycle signal processing throughput. NEON increases the parallelism of many SIMD arithmetic operations from two to four (for example, the Cortex-A8 with NEON can perform four 16-bit multiplies in parallel, while the Cortex-R4 can do only two).
For comparison purposes, we've also included results for two licensable cores from other vendors: the MIPS 24KEc and the CEVA X1620. The 24KEc is a 32-bit general-purpose processor core with DSP-oriented instruction set extensions; the X1620 is a 16-bit DSP processor core. As shown in Figure 2, the CEVA X1620 has higher per-cycle throughput than all of the ARM cores shown here, though the Cortex-A8 with NEON is very close. The X1620 combines a VLIW (very long instruction word) architecture with SIMD capabilities and can issue and execute up to eight instructions per cycle. Like the Cortex-R4, the X1620 is a dual-MAC processor, but the CEVA core can perform more operations in parallel than the Cortex-R4 and, as a result, requires fewer cycles to execute the BDTI DSP Kernel Benchmarks. The MIPS 24KEc, on the other hand, is a single-issue device, and although it can execute two 16-bit MACs in parallel, it can only load 32 bits of data per cycle. Thus, it cannot always reach its maximum MAC throughput. Overall, its per-cycle throughput is somewhat lower than that of the Cortex-R4.
Achieving Maximum Performance
In evaluating processors, speed isn't everything—area, power consumption, ease of programming, and application development infrastructure may be just as important. Nonetheless, it's essential to make sure that the processor has the minimum speed needed to meet the application requirements. The benchmark results we've presented here should help system designers understand the relative signal processing capabilities of the Cortex-R4 core and determine whether it has sufficient speed for their application. However, we have one additional caveat. Achieving the performance results we've presented was not a trivial undertaking; each of the benchmarks was painstakingly hand-optimized in assembly language to squeeze the maximum performance from each processor.
Cortex-R4 users requiring maximum performance will need to perform a similar level of optimization, a process that can be more challenging than on previous-generation ARM cores due to the Cortex-R4's SIMD capabilities, superscalar execution, and deeper pipeline. In part 2, we'll describe some of the optimization techniques we've used for implementing signal processing algorithms on the Cortex-R4.
BDTI provides the industry's most trusted and widely used benchmarks for digital signal processing and video applications. Through benchmarks and analysis, BDTI enables engineers, marketers, and managers to make confident technical and business decisions about technologies for signal processing applications. For more BDTI resources, see www.BDTI.com.