MIPS, Inc. is a leading vendor of 32-bit licensable processor cores that are commonly found in set-top boxes and cable modems, among other products. In recent years, MIPS has augmented its cores with DSP features to increase their performance on the signal processing tasks increasingly found in their target applications.
In 2007 MIPS announced a new high-performance core, the 74K. The 74K is a dual-issue superscalar core that supports MIPS' next generation of DSP-oriented instruction set extensions, called DSP ASE Rev 2. (DSP ASE Rev 1 is used in the 74K's predecessor, the 24KE core, and in MIPS's multi-threaded 34K core.) The 74K core targets demanding multimedia and networking applications, such as WiMAX, DVD players, and VoIP. According to MIPS, the 74K core is fully synthesizable and operates at up to 1.11 GHz in a 65 nm process.
BDTI recently completed an evaluation of the 74K's signal processing features and suitability for its target applications. In this article, we'll share some highlights of our analysis.
The 74K is a 32-bit RISC CPU that implements the MIPS32 Release 2 instruction set architecture and supports the DSP ASE Revision 2 instruction-set extensions. The core can include an optional floating-point unit for support of floating-point computations. The 74K can issue and execute up to two instructions in parallel (or up to four instructions—two integer and two floating-point—if the floating-point unit is included). The base core contains a load/store unit and separate data path; one instruction can be executed in each of these two units in parallel. Figure 1 shows a block diagram of the 74K core.
(Click to enlarge)
Figure 1. MIPS 74K core.
The 74K core's data path contains a 32-bit ALU (which is part of the "integer execution unit," or "IEU") and a multiply/divide unit (MDU). The load/store unit (or AGEN pipe) and data path share thirty-two 32-bit general-purpose registers and four 64-bit accumulators.
Like other MIPS cores, the 74K uses a load/store architecture, where all ALU, shifter, and MDU operations operate on data from (and store results to) core registers. Load/store operations support 8-, 16-, and 32-bit data transfers to and from memory; 64-bit load/stores are only supported for the (optional) floating-point unit.
From a signal processing perspective, the multiply/divide unit is a key feature of the 74K core, because signal processing workloads typically require frequent multiply and multiply-accumulate (MAC) operations. The MDU supports 32x32-bit multiplies and MACs with single-cycle throughput, and also supports SIMD (single instruction, multiple data) multiplications or MACs of two sets of 16-bit data operands. (16-bit data is commonly used in signal processing applications.) Because the 74K can only transfer 32 bits of fixed-point data per cycle, however, it cannot provide sufficient data bandwidth to keep the dual multipliers fed with four new 16-bit operands per cycle. In comparison, most DSP processors and many DSP-enhanced general-purpose processors provide sufficient data memory bandwidth to keep up with their multiplication bandwidth. The 74K core's data bandwidth limitation may become a bottleneck in some applications, though algorithm transformations (such as "zipping" in filter algorithms) can sometimes be used to circumvent the bottleneck.
The 74K core's 16-bit multiplication capabilities (in terms of the number of multiplications that can be executed in parallel per cycle) are identical to those of the 24KE, and comparable to those of medium-performance DSP chips, such as Texas Instruments' TMS320C55x and Analog Devices' Blackfin, and to mid-range licensable DSP cores, such as the Ceva Teaklite-III. They are not as powerful as the MAC capabilities of some high-performance processors, which can perform four (or more) 16-bit multiplications per cycle.
The ALU supports a variety of common 32-bit arithmetic and logic operations. It also supports a range of SIMD capabilities, including dual 16-bit adds, subtracts, shifts, and compares and quad 8-bit adds, subtracts, shifts, and compares. The SIMD arithmetic operations are useful for a wide range of signal processing algorithms, including FFTs, video, graphics, and Viterbi decoding, and are similar to those found on mid-range to high-end DSP processors. The 74K core ALU also supports specialized instructions that facilitate efficient SIMD processing, such as packing and unpacking of 16- or 8-bit operands within 32-bit registers. Many of the 74K core arithmetic operations include zero-overhead saturation or rounding, which,are useful for maintaining signal fidelity in DSP applications. On some other DSP-enhanced general-purpose processors rounding and saturation must be implemented manually, at the cost of additional cycles and program memory.
Dual, Asymmetric Pipelines
To support superscalar execution, the 74K core has dual asymmetric pipelines. One pipeline is used for computations (the ALU pipeline) and one is used for address generation and loads/stores (the AGEN pipeline). This design differs from superscalar processors that use symmetric pipelines, in which any instruction can execute in either pipe. Asymmetric pipelines can be a good match for signal processing tasks, which typically require computations to be executed in parallel with a data loads and stores. The asymmetric approach yields a noticeable performance improvement over single-issue execution without requiring as much silicon area (and power) as fully symmetric pipelines.
Both of the 74K core pipelines are quite deep—14 stages for the ALU pipeline, and 15 stages for the address generation pipeline. The primary advantage of using deep pipelines is that they enable high clock speeds, and indeed, the 74K core's projected 1.11 GHz speed is impressive, particularly for a synthesizable core. One trade-off is that deep pipelines can cause long delays when the pipeline is flushed (such as during a branch). On the 74K core, this penalty is 12 cycles. To mitigate this effect and reduce branching penalties, the 74K core includes branch prediction hardware that uses three 256-entry branch history tables.
Deep pipelines can also lead to long latencies for operations other than branches. For example, on the 74K core, 32x32-bit multiply instructions have either 5- or 7-cycle latencies (depending on the instruction variant), though all have single-cycle throughput. Such instruction latencies can create performance bottlenecks, because the processor may need to wait multiple cycles for results to become available. In some cases, the latencies can be concealed by software pipelining or instruction reordering, reducing their effect on performance. The 74K core also supports out-of-order instruction execution (described further, below), which can automatically reorder instructions to help mitigate the effect of multi-cycle latencies and improve performance.
According to MIPS, the 74K core compiler is capable of software pipelining and is designed to reduce the performance penalties due to long latencies. The compiler will not always be able to identify and implement optimal code, however, and in some cases the programmer will need to hand-optimize assembly language to achieve the 74K core's maximum signal processing performance potential.
With the 74K core, the long instruction latencies will make the optimization process more challenging (since they will make it more difficult to understand the software flow), though probably not more difficult than optimizing assembly code for today's high-performance DSP processors.