Design Article

IMG1

Picking the right MPSoC-based video architecture: Part 1

Santanu Dutta, Jens Rennert, Tiehan Lv, Jiang Xu, Shengqi Yang, and Wayne Wolf

8/17/2009 5:45 PM EDT

The growing demand in multimedia video processing and its applications owes its origin to and, at the same time, is responsible for the further development of both hardware design and software techniques.

Aided by advancements in very large-scale integrated circuit (VLSI) manufacturing technology that has made possible the integration of increased functionality in smaller circuits, it is primarily the development of novel signal-processing architectures and design techniques that has brought audio, video, graphics, image, speech, and text processing together.

It has also prompted advanced multimedia video applications such as high-definition digital television, digital set-top boxes with time-shift functionality, 3D games, H.26x video conferencing, MPEG-4 interactivity, and so forth.

The computational requirements of multimedia video processing being dominated by signal-processing tasks that require complex and real-time processing on high volumes of data, this chapter attempts to take a closer look at some of the recent trends in designing integrated circuits (ICs) for such systems.

This series of articles considers MPSoC architectures for advanced video applications. Video applications are rapidly evolving along with the increases in computational power supplied by Moore's Law.

Although MPSoC must be tailored to their primary application in order to squeeze the maximum amount of performance from the available silicon, the architecture should also be designed for flexibility in order to maximize the utility and longevity of the design.

The computational requirements of multimedia video processing being dominated by signal-processing tasks that require complex and real-time processing on high volumes of data, we attempt a closer look at some of the recent trends in designing ICs for such systems.

We first look at several video applications in order to understand the requirements better on video MPSoCs. We of course consider video compression, the dominant application today of digital video. We also look at one of our own applications, the real-time gesture recognition system designed as part of the Princeton Smart Camera Project, as an example of a new generation of video applications.

We then spend a great deal of time identifying some of the recent trends in the design of multimedia SoCs and use the Philips NexperiaTM Home Entertainment Engine as a case study.

The specific topics touched on are: processor architectures, central processing unit (CPU) configurations, system and chip integration, intellectual property (IP) reuse, platform-based designs, communication bus structures, and design-for-testability (DFT) issues. We close with a brief discussion of trace-driven analysis of applications and architectures as part of the design of video MPSoC architectures.

Algorithms for compression/decompression
Video compression has been developed over the last 30 years and is now a mass-market item. Satellite television, terrestrial digital television, digital video cameras, and personal video recorders all make use of video compression methods. The original video compression systems occupied racks of equipment. Today, a great deal is known about how to reduce video compression algorithms to VLSI.

Most image and video compression methods are lossy. These methods take advantage of the fact that the human visual system is not equally sensitive to all features and changes to an image or image sequence.

Video and image compression methods evolved in parallel in their early days, but modern standards make use of image compression techniques. The best-known family of image compression standards is JPEG. The JPEG-2000 standard incorporates wavelet-based compression, but a basic image compression technique that is also used in video compression is the discrete cosine transform (DCT).

The DCT is a frequency transform that is applied to blocks of images, typically 8 x 8 blocks. The DCT yields frequency decompositions of the block in two dimensions: x and y. Some of the frequency components can be discarded—a process known as quantization—to reduce the amount of information transmitted.

Perceptual coding strives to throw away coefficients such that the perceptual difference between the original and compressed images will be minimal; this generally means that high-frequency components, which correspond to fine details, are discarded. The discrete cosine transform has the form:

Quantization is the term for the elimination (zeroing out) of some transform coefficients. After quantization, a lossless compression method (usually some combination of a Huffman coding and a run-length coding) is applied to the quantized coefficients in order to reduce their representation for transmission or storage.

Because high-frequency components are often the first to be eliminated in a DCT coefficient set, the DCT coefficients are generally read in a zigzag pattern as shown in Figure 14-1 below.

Figure 14.1. A zig-zag pattern in DCT coefficients

This pattern reads the coefficients starting at the DC value (the 0,0 coefficient) to the highest frequency component (the 7,7 value). If high-frequency coefficients are zeroed out, this pattern produces longer strings of zeroes than would be true from row- or column-oriented patterns; these longer strings of zeroes can be compressed by lossless compression methods.

Several video compression families exist: the H.x26x standards for teleconferencing and the MPEG standards for video broadcast and distribution. Each of these families includes several standards, developed at different times for varying levels of hardware support and bandwidth.

Here ww will concentrate on MPEG-style compression—the basic techniques used in MPEG-1 and -2. The MPEG standard defines a bit stream standard but does not determine the exact algorithms used to generate those bit streams.

This allows developers to improve their implementations of the standard—for picture quality, compression rate, power consumption, and so on—while maintaining compatibility with other manufacturers' devices. MPEG-1 and -2 are also designed for asymmetric applications, in which the transmitter is assumed to have more computational power than the receiver.

This is typical in broadcast, in which the transmitter is less cost-sensitive than the consumers' receivers; videoconferencing, in contrast, typically uses terminals of equal computational power at each end.

Figure 14-2 below shows the block diagram for an MPEG-1/2 style encoder. MPEG takes advantage of DCT-based compression. The other major compression operation is motion estimation/compensation.

Figure 14.2. Block diagram of an MPEG-1/2 style encoder

Whereas DCT works entirely within a frame, motion estimation compares data between frames. Two important data structures in MPEG video coding are the block (an 8 x 8 set of pixels) and the macroblock (a 16 x 16 set of pixels). DCT is performed on blocks; motion estimation is generally performed on macroblocks.

Motion estimation reduces a macroblock to a motion vector that describes how the macroblock is displaced from another frame. (Block motion estimation handles only translational motion.)

The receiver can then read the macroblock from its position in the reference frame in the receiver's frame store and apply the motion vector to reconstruct the motion-compensated frame.

As shown in Figure 14-3 below, the encoder compares the reference block with a search field at a number of different i,j offsets.

Figure 14.3 Motion estimation and motion vectors.

At each point it computes a two-dimensional correlation:

where r and s are the reference and search macroblocks, respectively. This computation is known as a sum-of-absolute-difference (SAD) computation. The smallest SAD value is selected to provide the motion vector.

This formula assumes full search of all possible locations in the search area. A number of other search schemes have been proposed to reduce the number of correlations performed during motion estimation.

An astonishing number of algorithms have been developed for motion estimation. A comprehensive review of motion estimation algorithms is beyond the scope of this series, but representative algorithms include three-step search, four-step search, diamond search, one-dimensional full search, and modified log search.

Motion estimation developers measure both the image quality produced by their estimates of motion and the number of search steps. Key to reducing the time required for motion estimation is to reduce the number of points tested.

Fast search algorithms tend to have less uniform memory access patterns and be less pipelineable than simple full search. The most sophisticated algorithms test a small number of candidate positions; not only does the exact pattern of points vary depending on the visual content of the frame, but some algorithms also vary the number of candidate positions tested.

Macroblocks often do not appear totally unchanged from one frame to another; to handle this problem, the encoder decodes the frame, compares it with the original, and then produces an error stream to describe the differences between the motion-compensated image and the original. The motion estimator may not find a sufficiently good match for a macroblock, in which case it transmits the block directly.

MPEG-1/2 define three types of frames. The I (for inter) frame does not use motion estimation, it only uses DCT. P (for predictive) frames use motion estimation in forward time—earlier frames are used to predict later frames.

B (for bidirectional) frames use motion estimation in both forward and backward time. Relatively few encoders available today produce B frames because of the large amount of memory and high computation rates required to perform bidirectional motion estimation.

The MPEG bit stream has a rich syntax. The system layer describes the relationship among audio, video, and additional material. The video layer is organized into groups of pictures (GOPs). Each GOP may consist of various combinations of I, P, and B frames.

Compressed video is generally accompanied by synchronized audio, so a word about MPEG audio encoding is appropriate. MPEG-1 defines three layers or levels of audio encoding. Layer 1 is the simplest layer; it applies subband coding followed by Huffman coding.

A subband coder uses a filter bank to decompose the input into multiple frequency bands. Each of the subbands tends to have higher correlation than the entire stream, allowing more efficient entropy coding. Layer 2 adds quantization to layer 1. Layer 3 is the most complex; it applies perceptual coding to achieve high-quality encoding at relatively low bit rates. (The term MP3 is derived from MPEG audio layer 3.)

MPEG-4 includes object-based encoding. Various coding operations can be performed on arbitrarily shaped blocks, not just square regions. Although some of the promise of MPEG-4 remains unfulfilled, object-based coding has become popular in multimedia application VD menus.

MPEG-7 is designed to describe multimedia libraries. MPEG-21 concentrates on rights management. The other numbers are unused.

Video Recognition
Video recognition builds on image recognition techniques. Image recognition relies on a hierarchy of operations, such as color segmentation, edge detection, and contour generation. Video adds motion information that can often be very useful in identifying important subjects.

A number of recognition problems have been defined. Human recognition is clearly an important category. Separate techniques have been developed for other types of subjects, ranging from animals to mechanical components.

Face detection and face recognition are two important human recognition problems. A face detector determines when a face is visible. A face recognizer, in contrast, identifies a person based on facial features.

In practice, most face recognizers will need face detectors because they will operate in relatively unstructured environments with people in a variety of positions, and so on. Gesture recognition is another important category of human recognition that identifies positions or movements of the body and classifies them into known types of gestures.

As systems-on-chips become more powerful, we expect to see new applications of digital video that will place their own demands on MPSoC architectures. Figure 14-4 below shows the basic flow of the gesture recognition system developed at Princeton University.

Figure 14.4. Block diagram of gesture recognition process.

Each frame is processed through five major phases. The first phase subtracts out the background and classifies pixels by color. This phase identifies flesh-tone and non-flesh-tone pixels. The contour following phase extracts out a continuous contour for each major region.

Ellipse fitting creates a closed curve to represent each contour. Graph matching creates an annotated graph that describes the relationships of the curves generated by ellipse fitting, and then compares the graph against a library of known graphs in order to determine the identity of the regions: head, torso, and so on.

We then apply hidden Markov models (HMMs) to the major body parts in order to analyze their behavior over time. The results of the HMMs are combined in a classifier to determine finally the gesture being performed at that stage.

This application requires an even greater diversity of processing than does video compression. Although some of the early operations are performed on 23-bit full-color pixels, later stages can be performed on 1-bit or 2-bit pixels that represent various color classifications. At even later stages we operate on graph models. The final stage uses a significant amount of floating-point arithmetic.

Architectural Approaches to Video Processing
Early architectures for video concentrated on optimizing a single operation. To this end, different types of array processors and single-instruction multiple data (SIMD) machines were proposed for video operations, and SIMD has been widely used for motion estimation.

Figure 14-5 below shows the SIMD motion estimation architecture proposed by K.M. Yang and his coresearchers. This machine uses an interconnection network to shuffle the data values as required between processing elements (PEs) in order to minimize the number of frame-memory accesses to pixels.

Figure 14.5. A proposed motion estimation architecture

SIMD machines can implement regular video operations very efficiently. However, they generally offer only limited programmability. Furthermore, they generally restrict the maximum sizes of certain data objects. Although size restrictions may be acceptable for systems built to implement established standards, they may not accommodate new standards or applications.

As video algorithms became more sophisticated and chip sizes increased, attention moved to chips that implemented more complete video applications. In general, these early VLSI video systems were built as heterogeneous multiprocessors.

In some of the earliest systems, every major operation was built as a separate processing element. Since different blocks received very different utilizations, this caused some hardware to become idle. As chips became larger, lower rate operations tended to be swept into programmable multimedia processors.

Multimedia processors have been defined as a class of programmable processors that provide "multimedia on a chip" and are meant to accelerate the simultaneous processing of several different multimedia data types.

Early implementation of such multimedia-specific computing engines saw an effort toward supporting the multimedia applications (e.g., MPEG video playback, JPEG still image display, and so on) on general-purpose processors.

However, owing to the inefficiency inherent in trying to map multimedia operations (e.g., multiply-accumulate, saturation arithmetic, and so on) and data types (e.g., audio and video data requiring 8, 10, 16, 24, or 32 bits of precision) onto general-purpose computers, this approach soon gave way to multimedia enhanced processors, whereby the instruction-set architecture of a general-purpose processor was augmented via multimedia-specific instructions.

Instruction-set extensions to CPUs also take advantage of SIMD structures to offer improved implementation of regular video operations on traditional CPUs. The CPU's datapath is split into subwords. The major change to the datapath is to allow for the carry chain to be cut during subword operations.

The CPU instruction is applied to the split datapath: for example, an ADD instruction applied to operands that are split into four subwords will provide four results, all packed into the destination. Two of the earliest adopters of the philosophy were the PA-RISC from HP and the UltraSPARC from Sun.

Other quick inclusions in this exalted company were the MMXTM enriched Pentium processors from Intel, the Media-GX processor from Cyrix, and the K6 processor from Advanced Micro Devices.

In spite of offering a significant performance boost to general-purpose processors, however, the multimedia-enhanced processors had difficulty keeping up with the constant evolution of multimedia standards, sophisticated algorithms, and new applications.

This led to the shift toward developing programmable single-chip multimedia processors such as the Mpact from Chromatic Research (now a part of ATI Technologies), the Trimedia from Philips, the Alpha 21164 [597] from Digital Equipment Corporation, and the advanced products from Texas Instruments (TI).

Programmable architectures comprise both functional and memory (both on- and off-chip) units that allow processing of different tasks under software control, thereby trading area for flexibility.

Programmability thus incurs additional hardware cost not only for extra control units and program storage, but also software development. The big advantage, however, is that not only can many different algorithms now run on the same programmable hardware, the flexible control mechanism can also support execution of algorithms with irregular and unpredictable data and operation flows.

Such programmable media-processing architectures are typically designed to utilize the data-, instruction-, or task-level parallelism inherent in the application and algorithms; special instructions and/or hardware are also designed at times to improve the processing efficiency.

The basic multimedia MPSoC architectures
Based on the design and operational principles involved in making effective use of the available parallelism, the following architectural concepts have found widespread use in the media-processing world:

SIMD: SIMD stream architectures are based on data parallelism. They are characterized by multiple datapaths executing the same operation in parallel on different data entities. An example of the SIMD concept can be found in the Multimedia Video Processor (MVP) from Texas Instruments.

Split-ALU: The split-arithmetic and logic unit (ALU) concept makes use of subword parallelism, whereby a number (e.g., four) of lower precision (e.g., 8-bit) data items are processed in parallel on a higher precision (e.g., 32-bit) ALU.

Of course, the ALU needs hardware extensions, for example, to prevent carry signals in addition operations to propagate across data boundaries. An example of the split-ALU concept can be found in Sun's Visual Instruction Set (VIS) design for its UltraSPARC processors.

VLIW: A very long instruction word (VLIW) machine provides a means to exploit instruction-level parallelism of multimedia algorithms by specifying, in a long instruction word, the concurrent execution of multiple operations on multiple functional units.

In contrast to superscaler machines that also try to extract instruction-level parallelism (dynamically in hardware), VLIW processors employ static instruction scheduling performed by software at compile time. An example of a VLIW machine is the Philips TriMedia processor.

MIMD: The multiple instruction and multiple data (MIMD) stream architectures try to exploit parallelism at both the task and the data level. An MIMD machine features multiple datapaths, each having its own control unit; different datapaths, therefore, can be programmed at the same time to perform different processing on different data streams.

MIMD machines can be further classified as tightly coupled with a shared memory or loosely coupled with a distributed memory. The SGI Power Challenge [602], from Silicon Graphics Inc., is an example of a shared-memory MIMD machine.

Specialized instructions: The idea here is to study specific multimedia algorithms, identify common operations, and introduce special hardware functional units in order to replace a longer sequence of standard, frequently occurring instructions by a specialized hardware-supported instruction in an effort to reduce instruction count and speed up program execution.

An example is Intel's MMXTM technology designed to accelerate multimedia applications for the Intel Pentium processors.

Co-processors: By incorporating one or more separate dedicated hardware modules adapted to specific tasks, co-processors allow execution of regular, compute-intensive tasks on dedicated hardware, whereas the less compute-intensive but irregular control and processing tasks are executed on one or more programmable processor cores.

We conclude from the above data that a variety of different processor architectures have been conceived and designed to date to support the emerging demands of multimedia.

However, as new algorithms and applications continue to put increasing demands on multimedia processors, in terms of dealing with data dependence, multiple media streams, and irregular data and control flow, exploration of even more innovative architectural concepts have become necessary. Simultaneous multithreading and reconfigurable computing are some of the current approaches.

Next in Part 2: Optimal CPU configurations and interconnections.

This series of articles is based on copyrighted material submitted by Sanatanu Dutta, Jenns Rennert, Tiehan Lv and Guang Yang to "Multiprocessor Systems-On-Chips  edited byWayne Wolf and Ahmed Amine Jerraya. It is used with the permission of the publisher, Morgan Kaufmann, an imprint of Elsevier. The book can be purchased on-line.

Santanu Dutta is a design engineering manager and technical lead in the connected multimedia solutions group at Philips Semiconductor, now NXP Semiconductor. Jenns Rennert is senior systems engineer at Micronas GmBH. Tiehan Lv attended Princeton University where he received a PhD in electrical engineering. He also has B.S. and M.Eng. degrees from Peking University. Guang Yang is a research scientist at the Philips Reserch Laboratories.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Most Popular

Product Parts Search

Enter part number or keyword
PartsSearch


FeedbackForm