The explosive growth of consumer electronics and, specifically, handheld devices such as cellular phones, PDAs, and portable media players (PMPs) has drastically changed the requirements placed on the end-silicon providers. These silicon providers can no longer design ICs that are targeted at only one or two multimedia codecs or wireless standards. Consumers expect their devices to play media from different sources, coded using different standards, and downloaded using a variety of different wireless standards. Therefore, a new, more flexible design approach must be taken that provides for easy adoption of new media standards. In this article, we focus on the challenges and opportunities for video decoder and encoder engines.
Traditional RTL-based approach for designing video engines
The last generation of video ASICs was designed to decode and encode MPEG-2 because this is the standard used for DVDs. Some of these ASICs also supported MPEG-1 to enable VCD (video CD) playback. In most cases, the logical implementation strategy for such a single application was to design custom MPEG-2 decoders and encoders using RTL (register transfer level, aka register transfer logic). A representative MPEG-2 video ASIC architecture is shown in Figure 1. This figure shows the RTL blocks that comprise the video sub-system, the host controller and the on-chip memory.
Figure 1: Representative MPEG-2 video ASIC architecture
The market has changed. Now video ASICs need to support multiple video standards at multiple resolutions. Using an all-RTL approach does not work any more, mainly because of the following reasons:
- The number and complexity of the RTL blocks increases as the number of standards increase.
- Implementing a new video standard, upgrading an existing implemented one, or fixing bugs requires a silicon re-spin.
- Video codecs, particularly encoders, improve in quality (bit rate, performance) significantly over the 4-5 years after the first silicon implementation. Implementing these improvements also requires a silicon re-spin in an all-RTL approach.
Using processors instead of fixed-RTL for video engines
So, what are the alternatives? Using a programmable processor would be the ideal solution, since it addresses all of the problems mentioned above: (1) it is easy to port codecs to a processor, (2) adopting new video standards, upgrading existing codecs, or fixing bugs can be done post-silicon in software, and (3) improvements in video codec implementations can be easily deployed using software upgrades.
Conventional processors, however, suffer from performance bottlenecks and are designed for general purpose code and not for video engines. Embedded DSPs are also not tailored exclusively for video, but instead have hardware functional units, instructions, and interfaces for general purpose DSP applications. So, to implement video codecs on conventional RISC and DSP processors means that these processors have to run at a very high speed (MHz), they require a lot of memory, and they burn a lot of power, making them unsuitable for portable applications.
This becomes evident if we do a simple analysis of the number of computations required in one of the video kernels. The Sum of Absolute Differences (SAD) is a key computation kernel performed during the motion estimation step of most video encoding algorithms. The SAD algorithm attempts to find the movement of a macroblock between two consecutive video frames. It does this by computing the summation of the absolute difference between every set of corresponding pixel values from the two macroblocks.
A simple implementation of the SAD kernel is demonstrated by the following C code:
The basic computation inside the SAD kernel is represented pictorially in Figure 2. As shown in this figure, the main set of computations in the SAD kernel includes subtraction, followed by computing the absolute, and, finally, accumulating with previous results.
Figure 2: Main computation in the Sum of Absolute Differences (SAD) kernel
Computing a SAD of two 16x16 macroblocks on a RISC processor requires 256 subtractions, 256 absolute computations, and 256 summations -- a total of 768 arithmetic operations, not including the loads and stores required to move the data around. Since this has to be done for all the macroblocks in each frame, it is clearly computationally very expensive and scales as the resolution of the video frame increases.
In fact, on a mid-range general purpose RISC processor that has some DSP instructions such as multiplies and multiply-accumulates, performing H.264 Baseline decode at CIF resolution requires about 250 MHz and performing H.264 Baseline encode at CIF resolution requires more than 1 GHz! That translates to almost 500mW for the processor core alone, not to mention the power being consumed by the memory and the rest of the video SOC. Clearly, this processor cannot be used as an embedded multimedia processor in a portable device.
Next: Configurable processors to the rescue