Design Article
H.264 video encoding with Stretch's S5000 software-configurable processor
Joe Hanson, Stretch Inc.,
10/24/2005 1:15 AM EDT
Such consumption of bandwidth is highly impractical as well as unnecessary for most applications. Additionally, in network-based video applications, bandwidth has a direct impact on reliability. The larger a video stream, the more packets the network must transport successfully within increasingly tighter latency constraints, further increasing the vulnerability of the stream to delay and subsequent packet loss. It is also critical that real-time video streams be able to coexist with other real-time streams such as voice data.
Lossy compression algorithms provide the most cost-effective means of retaining quality while lowering bandwidth requirements. Reducing screen resolution significantly reduces bandwidth but in many cases, the resulting reduction in image quality is unacceptable.

View full size
Figure 1: H.264 Encoder Architecture
Explicit vs Configurable Implementations
The MPEG video standard, through successive innovations, has continued to increase image quality while lowering bit rate through the use of increasingly complex algorithms. H.264, also known as MPEG4-part 10, was designed specifically to facilitate reliable transport of video over IP networks while delivering equivalent or better image quality than MPEG-2 at a substantially higher compression ratio.
The significant coding efficiency of H.264 enables a wide range of new applications for streaming video over a variety of media. Like other MPEG incarnations, the H.264 codec implementation is not explicitly defined. While the standard defines the syntax of the encoded bitstream and the method for decoding the bitstream, developers have the opportunity to introduce significant value through innovation and refinement of their codecs and their ability to deliver reliable real-time encoding of video.
In order to achieve these goals, developers require additional tools, new algorithms, and more computational capacity. From a technical perspective, the primary difference between H.264 and other MPEG standards is the use of multiple reference frames, wider search ranges, and smaller macro blocks for motion estimation, all of which ultimately translate to increased computational intensiveness. The efficiency of the encoder (see Figure 1) can be attributed to:
Intra-prediction utilizing Forward and Inverse Discrete Cosine Transforms (DCT), as well as Forward and Inverse Quantization De-blocking filtering Motion Estimation utilizing inter-frame comparisons to ½ PEL and ¼ PEL accuracy The H.264 standard supports motion estimation on blocks ranging from 4x4 to 16x16 pixels, as well as residual data transforms on 4x4 blocks with a modified integer discrete cosine transform (DCT) to avoid rounding errors.
The Necessity for Hardware Acceleration
Each of these functions requires extensive processing that must be performed in real-time to be useful. Clearly, this calls for hardware acceleration. However, given that OEMs provide significant value in how they implement these functions, a programmable platform is essential to provide developers with the flexibility they require to continue to refine their algorithms and increase their competitive edge. The limitations of using fixed ASICs, at this stage of H.264 adoption where codec implementation is a key differentiating factor, are simply too inflexible.
While today's CPUs continue to keep pace with Moore's Law, traditional programmable processors do not have architectures that are well-suited for video processing. One alternative approach is to introduce hardware acceleration units called through intrinsic instructions; examples of such instructions include Intel's MMX/SSE2 and AMD's 3DNow extensions.
While hardware acceleration units can be designed to efficiently offload block-based and pixel-level processing tasks that are not well-suited to CPU architectures, such tasks tend to be very dataflow intensive. As a consequence, deep register pipelines with fast memory access are required to achieve the required real-time efficiency. For example, the intermediate results of Sum of Absolute Differences (SAD) calculations (covered in more detail below) for motion estimation do not fit well within the limited register space of traditional CPU and DSP architectures. Additionally, base implementations require hand-optimized assembly coding to maximize efficiency, a time-consuming process that creates an inflexible architecture-specific implementation that is difficult to build and innovate upon over time.
Programmable logic devices are often perceived to provide sufficient flexibility to effectively implement evolving proprietary algorithms. Such architectures typically employ a programmable processor to manage application-level tasks while a programmable logic device such as an FPGA manages the flow of real-time data. Such architectures, however, introduce new inefficiencies.
First, developers must decide upon an interconnect between the processor and programmable logic device. This choice of interconnect determines the throughput and therefore the efficiency of the overall architecture. If the interconnect operates synchronously, this burdens the hardware designer to design to stringent timing requirements. Alternatively, asynchronous interconnects introduce significant latency into the system, as well as forced processor stalls in order to complete the data exchange. Such an interface can also expose limitations in hardware handshaking which further reduce the overall effect of bandwidth of the interconnect.

View full size
Figure 2: Stretch Software-Configurable Processor
Using discrete devices (see Figure 2), the FPGA acts as a coprocessor for which the CPU must prepare and hand off data. As a consequence, the CPU must wait for results from the FPGA, creating an interdependency and latency that weakens the effect of pipelining operations. Additionally, system design is spread across two development environments, one for the FPGA and one for the CPU.
Often, the FPGA architecture is based on an existing software architecture that fails to meet current performance requirements. In order to improve performance, the hardware team recodes the critical computational blocks of the software algorithms in a hardware description language (HDL). Not only does such an implementation require a separate verification methodology, any changes to the software algorithm or context of the application—even something as conceptually simple as modifying the screen resolution—propagates changes down to the FPGA coprocessor.
This “trickle down” effect can significantly impact the timely delivery of a product of the market, especially for those applications with algorithms and standards that are continuing to evolve.
Next: Integrated Programmable Logic

View full size
Figure 2: Stretch Software-Configurable Processor
Integrated Programmable Logic
Software-configurable processors (SCPs) offer effective parallel operation by integrating programmable logic within the processor data path (see Figure 2). Entire functions written in C/C++ are compiled to form extension instructions that reside in the programmable fabric, enabling the efficient handoff of data. This allows the algorithm developer to flexibly and effectively manage changes to the algorithm through a single programming language (see Figure 3). Because the compute-intensive hardware implementation of an algorithm is generated at software compile-time, there is no need to re-hand code in assembly language or hand off functioning C/C++ code to a hardware engineer to redesign, rewrite, and add a complex processor interface to its logic.

View full size
Figure 3: Development Flow for Stretch Software-Configurable Processor
For example, the Stretch S5000 family of software-configurable processors based on the S5 compute engine combines a RISC processor with Instruction Set Extension Fabric (ISEF) programmable logic resources integrated within the processor’s data path. Custom instructions are equivalent to native instructions as they are accessed through the processor’s instruction pipeline. A 128-bit wide register set efficiently passes large amounts of data and context to the ISEF, enabling multiple data to be processed in parallel while eliminating the bandwidth bottlenecks inherent in coprocessor-based architectures.
Through the use of extension instructions, software-configurable processors provide a substantial increase in the processing of complex algorithms. The S5 compute engine, for example, running at 300 MHz achieves an out-of-box EEMBC Telemark benchmarks score of 4.6. With the addition of extension instructions based on the original C/C++ and implemented in hardware by an optimizing compiler, the S5 outperforms assembly language optimized multi-GHz processors with a score of 877.
H.264 Acceleration in Action
The intensive computational requirements of algorithms used in H.264 provide an excellent example of how flexible hardware acceleration can provide substantial increases in overall system performance. Significant acceleration can be achieved by taking advantage of inherent parallelism in algorithmic processing.
Consider processing 4x4 blocks of luma pixels through a 2-D discrete cosine transform (DCT) and quantization step (see Figure 4 for the H.264 DCT and quantization matrices). The DCT matrix calculation can be reduced to 64 add and subtract operations by exploiting symmetry and common subexpressions, all of which can be combined into a single ISEF instruction. Quantization can be accelerated by avoiding cycle-intensive division operations, instead substituting simple multiply and shift operations. Altogether processing requires approximately 594 additions, 16 multiplies, and 288 muxes/decisions.

View full size
Figure 4: 4x4 DCT and Quantization
A standard processor performing these operations on a 4x4 block requires over 1000 cycles. Since the ISEF’s 128-bit bus can load eight 16-bit prediction data in a single cycle, the optimizing compiler is able to exploit the inherent parallelism of the algorithm to perform the required operations as a single ISEF instruction. The same overall processing can be performed by a software-configurable processor in 105 cycles, resulting in more than a 10X increase in performance.
In terms of a real-time H.264 application, this means that these functions can be implemented on a 300 MHz software-configurable processor for a 720 x 480 @ 30 frames per second video stream will requiring only 14.2% CPU utilization. Even greater acceleration is possible by increasing parallelism by working with larger subblocks to reduce overall cycle count, such as operating on two 4x4 blocks in parallel to cut execution time in half, dropping CPU utilization to 7.1%.
Another H.264 algorithm suitable for hardware acceleration is de-blocking. The key to accelerating de-blocking in a software implementation is to minimize the conditional code—and therefore the number of cycles—required to determine which values to calculate. From a hardware acceleration perspective, however, it is actually more efficient to create a single ISEF instruction that calculates all of the results in hardware and then chooses the appropriate result. This can be achieved by reordering the 128-bit result passed from the IDCT stage de-blocking extension instruction.
Additional acceleration is possible through precalculation of macro block parameters—including bS, a, B, tco, and chromaEdgeFlag—through the use of state registers inside the ISEF. The inner loop of the de-blocking filter can compute two edges per instruction, requiring three cycles each for up the horizontal and vertical filtering and about 20 cycles for loop overhead. Since the same extension instruction can be used both horizontally and vertically, there is zero overhead to switch between them. Given that each megabyte of data has approximately 64 edges, 416 cycles (64 / 4 * 26) are required per megabyte. Processing for a 720 x 480 @ 30 frames per second video stream requires 16.8 Mcycles/second, or approximately 5.2% processor utilization.
The most data-intensive aspect of H.264 is motion estimation, which is based in large part upon repeated Sum of Absolute Difference (SAD) calculations to determine the best motion vector match. SAD calculations and comparisons make extensive reuse of intermediate results which, given the large number of such temporal results, do not fit well within the limited register spaces of traditional CPU and DSP architectures nor can they be effectively transferred to fixed arithmetic and multiplier units to maintain the highest levels of utilization. Given that motion estimation can consist of up to 41 SAD and 41 motion vector (MV) calculations per macro block, a full motion search requires 262K operations for D1 @ 30 frames per seconds, for a total of 10.6 GOPS. Developers often reduce the number of computations using heuristic algorithms to minimize the number of required calculations.
Because of their programmable logic nature, software-configurable architectures can perform 64 SAD calculations in parallel with a single extension instruction. Additionally, it is possible to store the 64 intermediate results in state registers to reuse the values in the next instruction, eliminating the need to load and store all of these results. Pipelining SAD calculations in this manner significantly increases overall compute capacity.
Additionally, because the SAD extension instruction is created at compile time, algorithms can be quickly optimized to perform estimations across a variety of search areas, number of frames, or motion vectors, enabling developers to leverage the same algorithm code across a wide range of applications and price points. The dynamic nature of software-configurable architectures means that extension instructions can be configured at run-time, enabling reuse of the same programmable hardware resources to accelerate multiple applications in a way not possible with fixed ASIC implementations. Additionally, configuration overhead can be reduced to zero by “ping-ponging” between configurations so that new extension instructions are available immediately.
Note that motion estimation also employs ½ and ¼ pixel predictions that require 9 SAD calculations computed in 9 directions around a pixel. Although a straightforward series of calculations, the number of cycles required is quite high. Using extension instructions, however, a 16x16 SAD with ¼-pixel precision requires only 133 cycles while a 4x4 SAD only needs 50.
Next: Freedom through Flexibility

View full size
Figure 3: Development Flow for Stretch Software-Configurable Processor
Freedom through Flexibility
With its recent adoption by the 3G, DVD Forum, and DVB, H.264 has become a prominent international digital video format. Its adaptability to multiple applications and markets makes it an essential video technology, but its computational requirements are intense and require hardware acceleration.
Software-configurable architectures provide a unified development methodology for creating cost-effective H.264 implementations that give developers flexibility and ease of development through a single development environment employing a high level software language that does not tie an implementation to costly and time-consuming hand-optimizations of either hardware or software. As a result, applications based on software-configurable architectures scale with little manual effort since the compiler—not the developer—does the heavy lifting of optimizing the implementation.
By abstracting hardware as software, hardware and software can be optimized together, providing the efficient and flexible implementation developers need to cost-effectively bring real-time H.264 to market. Developers can improve datapath performance at a relatively low cost in CPU cycles by bringing the hardware cost associated with high performance to an economically deployable level.
Flexibility is key to deploying rapidly evolving standards like H.264. Re-engineering code and hardware, as is required by traditional approaches, is a high hurdle to deployment. Software-configurable architectures provide the flexible architecture that developers need to cost-effectively discover and implement innovation.
About the author
Joe Hanson is the Director of Business Development for Stretch. Previously, Joe spent eight years with Altera, serving as Director of Marketing for System Level Tools and Director of Marketing and Applications for the Excalibur Business Unit, Altera's embedded processor solutions group. He has B.S. degrees in Electrical Engineering and Biological Sciences and holds three patent awards. He can be reached at hanson@stretchinc.com.



