The standardized MPEG family of audio compression schemes for high-quality audio transmission has become very common within the past few years. During that time, MPEG-1 Layer 3 (MP3) has become the audio compression standard used on the Internet. Based on that success, the third-generation portable MP3 player is out on the market. Other market segments being addressed include cellular phones or wristwatches that act as portable MP3 players; terrestrial- as well as satellite-based broadcast systems; and MP3 players in the home stereo and automotive environments.
One reason for MP3's overwhelming success is that today even the cheapest available personal computer has enough computational power to decode an MP3 bit stream in real-time without any dropouts. On the other hand, for most of the applications mentioned above, a PC-like hardware concept is not acceptable: Such an approach would be too large, too expensive, too power-consuming or some combination of the three. Quite often, next-generation embedded hardware concepts should be enabled to decode MP3.
In the case of a cellular phone, existing chip sets can be enhanced by an additional off-the-shelf MP3-decoder chip. The design question is whether an additional chip might be avoided by designing the next-generation chip set in such a way that it fulfills the minimum requirements of the MPEG Layer 3 algorithm.
The basic elements of the MPEG-1 and MPEG-2 decoding algorithms are roughly divided into four main sections: bit stream demultiplex, Huffman decoding of the spectral values, requantization and frequency-to-time mapping. Frequency-to-time mapping is realized via hybrid filter banks in MP3. The algorithm starts by mapping the time signal to the frequency domain (T/F mapping). As part of the feedback loop, a psychoacoustic model controls the nonuniform quantizer. Huffman encoding performs redundancy reduction of the quantized spectral data. Finally, side information and audio data are multiplexed into an MPEG bit stream.
Although MPEG Layer 3 and MPEG-2 have the same structure, there are quite a lot of differences besides the T/F mapping.
On the decoder side, the filter bank is the most time-consuming algorithm section, while the T/F mapping ranks as the most complex on the encoder side. Sections of similar complexity are the psychoacoustic model and the quantization stage. The basic operation behind efficient implementations is a fast Fourier transform (FFT) butterfly. Because of rounding errors of 1/2 least significant bit (LSB) in each stage of the transform, there's a theoretical loss of accuracy of 4.5 bits in the transform. In practice it will be more; a minimum width of 20 bits in the data path is required, with 24 bits being better, to reach the full dynamic range of a CD's native 16-bit data format without any loss in accuracy.
There are different requirements for the encoder control path. The calculation of the masking thresholds in the psychoacoustic model deals with energies per scale factor. Here, the precision requirements are much lower than for the data path. On the other hand, the dynamic range of energies is much higher than in the data path.
The amount of memory depends on the efficiency of the implementation. Our experience shows that about 4 kwords of RAM for I/O and processing buffers are sufficient for MPEG Layer 3 in stereo mode. For Huffman tables, quantizer tables, twiddle factors, polyphase window coefficients and the like, about 3 kwords of data ROM are required. The computational power very much depends on the efficiency of the instruction set and the underlying processor architecture.
Each algorithm phase mentioned above places specific demands on the architecture. For example, the basic instruction for digital signal-processing algorithms is the multiply-accumulate (MAC) operation. A pipelined DSP executes a MAC operation in a single clock cycle. To minimize cost and power consumption, fixed-point arithmetic is a must for a low-cost solution. The key words here are power consumption and chip area.
Floating vs. fixed
Floating-point arithmetic is based on separate computations of the mantissa and exponent. Although separate computations are executed on different data, they are not completely independent. In the case of a floating-point addition, one mantissa has to be shifted to get equal exponents. Later, the result has to be normalized, which again impacts both the mantissa and the exponent.
Fixed-point arithmetic is quite similar to integer arithmetic: Simply imagine the binary point somewhere other than just to the right of the least-significant digit. Adding two such numbers can be done with an integer add, whereas multiplication requires some extra shifting.
It's obvious that a more-complex, floating-point unit requires substantially more chip area than a fixed-point unit. Furthermore, a floating-point arithmetic unit consumes more energy than a fixed-point unit.
Current fixed-point DSPs are based on a data representation with the binary point just to the right of the sign bit, resulting in a data range between 1 and -1 (fractional arithmetic). The accumulator stores the result of a multiplication with full accuracy to increase the precision of consecutive MAC operations. Overflow bits avoid overflow in intermediate processing steps.
The psychoacoustic model requires calculations with limited accuracy but a high dynamic range. A fixed-point ALU is not well-suited to fulfill those needs. A possible solution is to calculate the algorithm sections with double-precision or with "pseudo-floating-point" data representation, a technique that involves the separate handling of the mantissa and exponent in software.
To speed the calculation, a simplified floating-point unit might avoid the overhead. This pseudo-floating-point unit need support neither an IEEE-compliant data format nor all operators. A very simple unit with a 16-bit mantissa and 8-bit exponent is sufficient. Instructions to do conversions between the fixed- and floating-point data types are easy to implement.
ALU design critical
The design of the arithmetic unit has a huge impact on another part of a single-chip solution: the chip area for RAM and data ROM. The IEEE single-precision floating-point format has a mantissa of 24 bits and an exponent of 8 bits. Compared with a 24-bit fixed-point data format, the floating-point data RAM requires about 30 percent more chip area. For a stereo decoder with some larger amount of memory, such as a few thousand words of RAM, the chip area for RAM is already larger than the area for the functional units. For more RAM-consuming algorithms, such as stereo encoders or multichannel decoders, the chip area for memory will determine the total chip area. In terms of area, the functional parts will be negligible.
To minimize the impact of fixed-point arithmetic on accuracy, fixed-point algorithms make heavy use of scaling and normalization. Either the maximum of the full spectra or the local maximum of a scale factor band is normalized, and the other spectral values are scaled accordingly. Fast normalization and scaling require both the ability to calculate the exponent of a given value and a barrel shifter to execute the corresponding shift operation. The possibility of using the calculated exponent directly as operand for the subsequent shift operation, without moving it from a result register to an operand register, speeds up this often-used get exponent/shift combination.
The calculation of transient functions on fixed-point devices is often realized by a combination of table lookup and some pre- and post-operations. Working with normalized values reduces the size of lookup tables without any loss in accuracy.
The address generation unit plays an important role in the case of complex FFTs. If sufficient address pointers do not exist for operands, results and twiddle factors, or if updated register values are not available in the next instruction, the minimum number of cycles per butterfly will not be reached. For processors with a single MAC unit, a value of six cycles per butterfly can be achieved. A lot of existing processors do not reach that value-at least not together with the bit-reversed addressing mode. To fetch operands and to write back results, a powerful bus structure is required. Harvard architecture is standard today.
The primary application for 16-bit DSPs is speech coding, e.g., codecs for wireless telecommunications. As all major industry analysts acknowledge, the wireless communications market is driving developments in digital signal processing. On the other hand, the high-quality audio-coding market is also growing rapidly. It might be advantageous to introduce some additional details to support 24-bit or 32-bit arithmetic-namely, double-precision modes.
Texas Instruments Inc.'s new C55xx device, for example, comes with a dual MAC. It would have been wise to foresee a mode that would couple both 16-bit MACs in such a way that a 32 x 32 MAC could execute in a single clock cycle, but that is not the case with the TI part.
Data paths that allow transfers of a double word between memory and ALU in parallel to a double-precision operation are also a significant performance factor. In that aspect, the C55xx is perfect choice. If a bit-reversed or modulo addressing mode also works with double words, a 16-bit device is an adequate alternative.
Besides their limited data width, 16-bit DSPs may lack such other resources as the optimal number of hardware loops or the optimal modulo range. If work-arounds are necessary, they will usually increase the requirements on computational power.
The use of already-configured DSP cores allows adaptation of the architecture to the specific needs of the algorithm. This approach can allow the best trade-off between processing power and chip size.
This article was presented in full at ICSPAT 2000.