Design Article

IMG1

H.264 video encoder motion search implementation on a programmable accelerator

Joe Hanson, Director of Technical Marketing, and Ricardo E. Gonzalez, Architect, Stretch Inc.

5/18/2007 2:45 AM EDT

The significant coding efficiency of H.264 video compression technology opens a wide range of new applications for streaming video over a variety of media. This increase in coding efficiency is due in part to the more effective use of inter-frame motion compression. This is achieved through the use of more reference pictures and variable block-size motion compensation.

Using the sum-of-absolute-differences (SAD) computations for macro-block comparisons, H.264 encoding supports the displacement of a reference block in 41 possible motion vectors. Even for standard definition video, the number of computations required exceeds the capabilities of most DSPs. To meet the computation requirements, designers have implemented complex architectures, such as mixing DSPs with FPGAs. These solutions have long development times and restrict the designer's ability to make algorithm enhancements without also making hardware changes.

The recently announced Stretch S6000 family of software configurable processors has been specifically optimized for high-performance video and wireless signal processing and features three technology innovations: the second-generation Instruction Set Extension Fabric (ISEF), the Programmable Accelerator, and the Processor Array. This article discusses the Programmable Accelerator.

The Programmable Accelerator consists of a series of highly optimized hardware primitives that can accelerate software implementations of various algorithms, including entropy coding, motion search, cryptography, and audio compression and processing.

Let's begin by providing an overview of the compute and bandwidth requirements for motion search in H.264. We'll then look at the software implementation of motion search that takes advantage of the hardware features of the Programmable Accelerator, and, finally, provide a summary of the performance.

H.264 motion search
As in most video compression algorithms, H.264 allows for a video frame to be encoded using only the current picture. Although this intra-coded picture, or I-frame, provides a minimal amount of compression, it does provide an anchor point that other frames can reference. Significantly improved encoding efficiency can then be realized using inter-frame prediction. Inter-frame prediction compares pixels in one frame to pixels from a previous frame (or frames), determines object displacement, and encodes only the motion vector and any difference between the two frames. Motion-based video compression is common. The same basic technique is used in MPEG2. However, H.264 supports searching for motion across a higher number of reference frames and the encoding of motion vector for blocks smaller than 16x16 pixels.

Next: Motion Search Calculations
Motion Search Calculations
Although the H.264 motion search calculation does not present a challenge in mathematical terms, it does in the sheer number of required calculations. An object's displacement, i.e. motion, is found by computing the SAD between the source and the reference frames using 16x16 blocks. Because objects in the frame can move in any direction and by any amount, this search is a very computationally intensive operation.

Advanced coding standards, such as H.264, support motion compensation on a 16x16 macroblock and smaller sub-blocks--8x16, 16x8, 8x8, 4x8, 8x4 and 4x4. For a single 16x16 macroblock, 41 possible motion vectors must be computed, as shown in Figure 1. For an H.264 macroblock SAD (or MB-SAD), it is necessary to compute 256 absolute differences between the source and the reference macroblocks at every pixel in the search area, as well as the SADs of every sub-block type, i.e. 16 at 4x4, 8 at 8x4, 8 at 4x8, etc.).


View full size

Figure 1: sub-macro block divisions

Computing the sum-of-absolute-differences between the various sub-blocks provides an approximation of the residue required to encode the information. However, when computing the cost of any particular choice, the cost of encoding the motion vector and the reference frame index must also be taken into account. Smaller values can be encoded using fewer bits. It is therefore common to compute a cost function using the following equation,

Costi= SADi + a (|x| + |y|) + b

where x and y represent the motion vector, a is the weighting function, and b represents the cost of encoding the reference frame index (the distance in time between the two frames).

The final step is to determine the optimal encoding for the entire (16x16) macro-block. Using 4x4 sub-blocks, for example, may result in smaller residue, but requires more motion vectors. This step must adhere to the H.264 restrictions regarding how macro-blocks can be encoded. For example, it is not allowed to encode a macro-block quadrant as a combination of 4x4 and 8x4 sub-blocks.

Next: Searching the video image
Searching the video image
The H.264 standard does not specify how motion search must be performed and there are many competing algorithms: full search, hierarchical search, diamond search, heuristic search, etc. It is beyond the scope of this document to describe all the competing algorithms, so instead it will focus on two of the more common algorithms implemented: full search and hierarchical search.

Full search looks for the source macro block in a rectangular region bounded by some coordinates. It computes all 41 cost vectors for every possible pixel position within the rectangular search region and determines which provides the best encoding opportunity. The horizontal and vertical search area determines the number of computations required.

Hierarchical search first performs a search over a decimated frame. Using the best motion vector found in the decimated frame as the starting point, the algorithm then performs a search over the original frame. Variations of the algorithm include decimating the frame n-times prior to the first search. When implementing a hierarchical search it is common to keep multiple vectors at each intermediate level. This provides more opportunities for finding a good match at the higher levels. For example, one can decimate the image two times in both the horizontal and the vertical dimensions and perform the initial search over a frame that is 1/16 of the original.

When performing the lower-levels of a hierarchical search, the best three motion vector candidates are kept. For each of these candidates, the search continues over a frame that is 1/16 the original. Again, the three best motion vectors are computed. For each of these candidates, the search proceeds over the original frame. From this operation, the single best motion vector can be determined. For sub-PEL accuracy, the image can then be interpolated and searched again within a smaller search range.

Data bandwidth
Performance is limited by not only computational capacity. A nave implementation of motion search is also limited by memory bandwidth. As an example, if the implementation simply moved a pointer by one pixel, the implementation would load a new reference block prior to each SAD calculation. A search range of 100x100 requires 10,000 load operations of 256 byte each to complete one macroblock search. Thus it is necessary to maximize data reuse to minimize bandwidth consumption.

Furthermore, memories provide significantly higher bandwidth when data is read in row-order. Accessing data in column-order can degrade bandwidth by an order of magnitude. The hardware support in the Programmable Accelerator was designed to minimize bandwidth consumption and maximize bandwidth available.

To achieve this, the Programmable Accelerator loads two side-by-side macroblocks of the reference image (pixel positions 0 - 31 from rows 0 - 15 of the search area). This data can be used to compute 16 full MB-SADs (with each MB-SAD computing the 41 motion vectors of interest). The hardware automatically keeps track of the minimum cost for each of the 41 vectors and the vector associated with the minimum. Once the 16 MB-SADs have been computed, it is only necessary to load the next row of data (row 16) and discard the top row (row 0). The data can then be used to compute another 16 full MB-SADs. Reusing the data in this fashion reduces the memory requirements to 1-byte per MB-SAD (discounting the initial load of data).

Next: The Algorithm Proceeds, Instructions
The Algorithm Continues
The algorithm proceeds downwards until it reaches the boundary of the search area. At this point, the motion vectors for the left-most 16-pixel column have been fully computed and it is necessary to move 16-pixels to the right. Loading the next reference macro-block to the right and then traversing up until reaching the top of the search area reduces bandwidth usage. This zigzag traversal repeats until reaching the right-most edge of the search area. Software can then read the optimal vector for each sub-block type from hardware registers and determine the best combination of sub-blocks to encode.

The Programmable Accelerator allows overlapping of load and computation, although there is some startup overhead to load in the source macro-block and the first two reference macro-blocks. The 48 load operations required cannot be overlapped with computation. This constant startup overhead means this method favors large search regions.


Figure 3: Minimizing data transfers for the search area

The SAD instructions support both 8-bit and 10-bit video data formats. Motion search performed using only the luma component and the 8-bit format requires only one byte per pixel. The Programmable Accelerator uses a load / store architecture with128-bit wide registers. Therefore, a single load operation can fetch an entire row of a macroblock.

Instructions
As described, each step in the motion search algorithm can be partitioned into three components: loading of images, computing the SADs, and manipulating the reference images for efficient processing. The Programmable Accelerator provides VLIW instructions for performing each of these specific functions. The programmer can then describe the algorithm as

Load_Reference_Block(); Compute_SAD(); Rotate_PIC_Block()

and the compiler/assembler will schedule and group the instructions appropriately. These instructions provide the acceleration needed to perform the algorithm. Stretch has used these instructions to develop a motion search function that is exposed through APIs for the system programmer. The programmer controls the search algorithm, the search range, and the cost functions for the motion vectors.

Next: Results
Results
The 256 pixel SAD computation, sub-macroblock summations, and motion vector weighting execute as a single instruction in the processor pipeline. The pipeline can sustain single cycle MB-SAD computations. Adding in the data transfer overhead, the processor can sustain 250 million MB -SADs / second or 64 billion pixel SADs / second.

Table 1 shows the utilization percent of the Programmable Accelerator for common video image resolutions and frame rates. The numbers are for a hierarchical motion search algorithm starting with a 4x decimated image of the original frame. For utilizations greater than 100 percent, an additional processor or processors would be required.


View full size

Table 1: Programmable Acceleration Utilization

Conclusion
This article described the requirements for using SAD calculations for motion search of video streams and a hardware accelerator for motion search implemented as extension instructions in the Stretch S6000 processor family. This high-performance motion search engine computes all 41 possible H.264 motion vectors for a 16x16 macroblock at a sustainable rate of 250 MB-SADs/ second. The Programmable Accelerator and its motion search engine, operating in conjunction with the software configurable processor in the S6000 family, are exposed to the programmer through a series of APIs that selects the search algorithms, controls the search range, and sets the costing functions for motion vectors selection. The implementation is scalable from standard definition to high-definition video streams and for surveillance to broadcast quality applications.

About the authors
Joe Hanson is the Director of Technical Marketing for Stretch Inc. He is responsible for developing the eco-system of technology supporting Stretch software configurable processors in compute intensive applications. Prior to Stretch, Mr. Hanson worked at Altera, where he held a variety of positions, including director level positions in applications and marketing for the Excalibur embedded processors and system level design tools. He holds Bachelor of Science degrees in Biological Sciences from Florida State University and Electrical Engineering from University of Alabama at Birmingham. Mr. Hanson holds three patents and has over 20 years of design experience in real-time data acquisition systems for medical and scientific instrumentation and professional audio equipment. He can be reached at hanson@stretchinc.com

Dr. Ricardo E. Gonzalez is a Principal Partner and Founder of Silicon Artisans, an innovative consulting firm that specializes in architecture and design of semiconductor circuits. Prior to founding Silicon Artisans, Dr. Gonzalez was an architect at Stretch, Inc., where he led the development of the S6 Software Configurable Processor Architecture. Before joining Stretch Dr. Gonzalez was a member of technical staff at Tensilica, Inc. He was a founding member of the hardware team and led the development of various configurable and extensible processors. Dr. Gonzalez also worked at Intel's Microcomputer Research Labs (MRL), where he explored novel branch architectures for high-performance processors. Dr. Gonzalez received his BS, MS, and PhD in Electrical Engineering from Stanford University in 1990, 1992 and 1997, respectively. His dissertation explores energy dissipation and energy efficiency in microprocessors.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Most Popular

Product Parts Search

Enter part number or keyword
PartsSearch


FeedbackForm