One of the most pressing challenges in developing advanced multimedia and communications chips is the balance between the designer and the chip--that is, the ease of development, and the resulting performance and instruction efficiency. That balancing act is unavoidable in any programmable solution. Traditional digital signal processors and microprocessors raise issues in all of those areas. For instance, DSPs can attain better performance levels than a more traditional microcontroller but are often difficult to program.
Improv's Jazz architecture was designed to be an effective platform on which to develop and run advanced multimedia and communications applications. Our experience with developing efficient software for H.263 videoconferencing, G.723.1 speech coding, MPEG-2 video and audio decoding, JPEG encoding, IEEE 1394 communications and DSL modems has shown that the platform can address a wide variety of data-intensive applications.
Improv has designed a Java class system implementing a specific system-level computational model, a directed control/data-flow network. The computational model provides structural modeling similar to HDLs (compo- nents with ports that are connected together), while the behavioral modeling is virtually identical to algorithm design in C. The structured use of Java is intuitive and has an added benefit: we can rapidly develop test benches that include virtual prototypes of user interfaces, controls and displays.
To highlight some of the critical design issues that arise in developing applications for a configurable platform, we will detail the development of a particular application for HDTV video decoding. The Improv HDTV video-decoder component incorporates the following features:
- Full support of the Grand Alliance HDTV specification, with resolutions up to 1,920 x 1,080 pixels at 30-frame/second interlaced scan and 1,280 x 720 pixels at 60-fps progressive scan.
- Implementation of all 18 ATSC-specified input video formats, with modular software configurations available to convert output to native display formats.
- Automatic 3:2 pull-down detection and coding of the bit-stream interface, a scheme that accepts data via the widely used ARM High Performance BusStandard raster-scan digital video output section.
The Jazz PSA contains multiple processing engines and VLIW (very long instruction-word)-style instruction-level parallelism. That provides significant processing power by exploiting concurrent execution both at the task and instruction levels. Clearly, the key to building high-performance applications is exposing as much concurrency as possible in the application itself.
Our experience with C-based approaches has shown that it is difficult and counterintuitive to expose concurrency while programming in such a sequential language. Improv's Application Development Framework, on the other hand, makes concurrent descriptions part of the development model and, therefore, part of the natural development process.
The key is the use of a component-based, rather than a procedural, approach. That makes structural modeling of the application closer to development with a hardware-description language, where modular components can be created and then tied together.
The main transform of the HDTV application has a significant amount of concurrency.
The sequence of inverse quantization and inverse DCT (discrete cosine transform) can be run independently for separate 8 x 8 blocks of luminance and chrominance within a given macroblock. In addition, the motion compensation of a macroblock can be performed separately for the luminance and chrominance components.
The natural way to describe that functionality is to create an inverse-DCT and inverse-quantization component that operates on 8 x 8 blocks, and then to create six instances of those components. The data read and produced by each component can be attached to different data managers within the application.
Similarly, a component is created for performing motion compensation, and then instances of that component are created for the luminance and two chrominance components.
HDTV decoding requires an extremely high level of performance of critical MPEG-2 algorithms.
In HDTV, a 1080i (1,920 x 1,080-pixel interlaced) frame contains 8,160 16 x 16 macroblocks of pixels. To handle 30 frames/s, the decoder must process 244,800 macroblocks every second, leaving only 4 microseconds per macroblock. On a 200-MHz processing engine, that translates to 8,000 instruction cycles per macroblock.
Fortunately, on the Jazz architecture the application is allocated onto multiple processing engines, giving it, in our target Jazz configuration, around 8,000 instructions for the variable-length decoding (VLD) and 24,000 instruction cycles for the transform per macroblock. Clearly, to achieve that kind of performance, there are a number of areas where we have focused our software optimization efforts. Two of the most important optimizations are VLD and the inverse DCT.
One of the problems many programmable solutions have in MPEG-2 decoding is managing the conflicting requirements of bit-stream processing with its VLD, and the transform process itself. Bit-stream processing is highly sequential and dominated by control and bit-extraction operations. Transform processing, on the other hand, is dominated by arithmetic operations and fixed-point management.
In virtually all the MPEG-2 decoder-chip implementations on the market today, VLD processing is handled by a specifically designed custom hardware block. That is because the operations required for extracting and reacting to specific bits from an incoming bit stream are typically expensive on a processing engine. However, some key features built into the Jazz architecture's general-purpose data path support more efficient handling of bit streams. These include instructions to mask bits in the ALU registers, 5-bit addition and modular addressing.
Rather than accessing the instructions directly, we have created a special bit-stream data manager with three additional methods, called GetBits, FlushBits and ShowBits. Each of those functions takes a single parameter to specify the number of bits. GetBits and ShowBits return the next specified bits in the bit stream as an integer value, while FlushBits skips over the specified bits.
The compiler automatically inserts code at the beginning and end of the task that uses those operations to set up the bit stream and its current indexes. With that approach it takes three cycles to read a specified number of bits from a bit stream rather than the 12 to 17 cycles of a software-only approach.
There has been quite a bit of work within the industry on the most efficient ways to implement the inverse DCT. Most of that work has either centered on single-processor approaches or on custom hardware. As a result, most of the optimization work has focused on one key aspect: reducing the number of multiplications by favoring addition in the algorithm.
However, with the advent of VLIW architectures, the goal of optimization changes. The Jazz architecture provides concurrent operation of multiple ALU and multiplication operators. The main processing engines have four ALUs and one or two multipliers plus a shifter unit. Therefore, the most efficient algorithm for the inverse DCT on the Jazz architecture is one that balances the number of additions to multiplications with the number of shifts needed for fixed-point management.
For our implementation, we chose a modified version of the Chen-Wang inverse-DCT algorithm requiring 11 multiplies and 29 adds. The result is a software task that takes about 1,000 instructions for a macroblock with an average of 5.5 operations per VLIW instruction.