Behind the Benchmarks
To evaluate and compare processors' architectural strengths and weaknesses, BDTI measures the number of instruction cycles required to execute each of the twelve benchmarks. Cycle counts don't directly assess a processor's signal processing speed (because speed also depends on clock rate) but they do provide a comparison of the relative power of the architecture. The lower the cycle count needed to execute a given amount of work, the more powerful the architecture.
Of course, processors that can execute the benchmarks in fewer cycles (and are therefore more powerful) may require more silicon area than less-powerful processors, or they may consume more energy. Furthermore, processor architects sometimes trade off architectural power for clock speed, so it's important not to assume that greater architectural power will necessarily yield a faster processor.
In Figure 2, we present the BDTIsimMark2000/MHz scores for selected processor cores and chips. This metric evaluates per-cycle throughput on optimized signal processing kernels, and is based on processors' results on the BDTI DSP Kernel Benchmarks.
(Click to enlarge)
Figure 2. BDTIsimMark2000/MHz scores for selected processor cores.
As shown in Figure 2, the Cortex-R4 has roughly the same cycle-count efficiency as the ARM11. This may seem surprising since the Cortex-R4 is superscalar and the ARM11 is not. However, Cortex-R4's dual-issue capability is quite limited. For example, although it can execute an add or subtract operation in parallel with a load or store, it can't execute a MAC instruction in parallel with anything else. As a result, its signal processing throughput is only slightly higher than that of the ARM11.
The Cortex-R4 does have nearly twice the per-cycle signal processing throughput of the ARM9E, which is a single-MAC, single-issue core with very limited parallelism. The Cortex-R4 has twice the data bandwidth of the ARM9E and provides a number of SIMD arithmetic instructions, which the ARM9E lacks.
Compared to the Cortex-A8 with NEON, the Cortex-R4 has much lower per-cycle signal processing throughput. NEON increases the parallelism of many SIMD arithmetic operations from two to four (for example, the Cortex-A8 with NEON can perform four 16-bit multiplies in parallel, while the Cortex-R4 can do only two).
For comparison purposes, we've also included results for two licensable cores from other vendors: the MIPS 24KEc and the CEVA X1620. The 24KEc is a 32-bit general-purpose processor core with DSP-oriented instruction set extensions; the X1620 is a 16-bit DSP processor core. As shown in Figure 2, the CEVA X1620 has higher per-cycle throughput than all of the ARM cores shown here, though the Cortex-A8 with NEON is very close. The X1620 combines a VLIW (very long instruction word) architecture with SIMD capabilities and can issue and execute up to eight instructions per cycle. Like the Cortex-R4, the X1620 is a dual-MAC processor, but the CEVA core can perform more operations in parallel than the Cortex-R4 and, as a result, requires fewer cycles to execute the BDTI DSP Kernel Benchmarks. The MIPS 24KEc, on the other hand, is a single-issue device, and although it can execute two 16-bit MACs in parallel, it can only load 32 bits of data per cycle. Thus, it cannot always reach its maximum MAC throughput. Overall, its per-cycle throughput is somewhat lower than that of the Cortex-R4.
Achieving Maximum Performance
In evaluating processors, speed isn't everything—area, power consumption, ease of programming, and application development infrastructure may be just as important. Nonetheless, it's essential to make sure that the processor has the minimum speed needed to meet the application requirements. The benchmark results we've presented here should help system designers understand the relative signal processing capabilities of the Cortex-R4 core and determine whether it has sufficient speed for their application. However, we have one additional caveat. Achieving the performance results we've presented was not a trivial undertaking; each of the benchmarks was painstakingly hand-optimized in assembly language to squeeze the maximum performance from each processor.
Cortex-R4 users requiring maximum performance will need to perform a similar level of optimization, a process that can be more challenging than on previous-generation ARM cores due to the Cortex-R4's SIMD capabilities, superscalar execution, and deeper pipeline. In part 2, we'll describe some of the optimization techniques we've used for implementing signal processing algorithms on the Cortex-R4.
BDTI provides the industry's most trusted and widely used benchmarks for digital signal processing and video applications. Through benchmarks and analysis, BDTI enables engineers, marketers, and managers to make confident technical and business decisions about technologies for signal processing applications. For more BDTI resources, see www.BDTI.com.