Critical to the adoption of digital video across the wide range of embedded applications is the ability to deliver the best image quality feasible for a particular screen resolution while operating under real-world constraints. Consider a typical uncompressed D1 video stream with an image size of 720 x 480 pixels requiring 1.5 bytes for color per pixel. Consuming 518 kB per frame, such a stream at 30 frames per second requires 15.5 MB per second storage and transport bandwidth.
Such consumption of bandwidth is highly impractical as well as unnecessary for most applications. Additionally, in network-based video applications, bandwidth has a direct impact on reliability. The larger a video stream, the more packets the network must transport successfully within increasingly tighter latency constraints, further increasing the vulnerability of the stream to delay and subsequent packet loss. It is also critical that real-time video streams be able to coexist with other real-time streams such as voice data.
Lossy compression algorithms provide the most cost-effective means of retaining quality while lowering bandwidth requirements. Reducing screen resolution significantly reduces bandwidth but in many cases, the resulting reduction in image quality is unacceptable.
View full size
Figure 1: H.264 Encoder Architecture
Explicit vs Configurable Implementations
The MPEG video standard, through successive innovations, has continued to increase image quality while lowering bit rate through the use of increasingly complex algorithms. H.264, also known as MPEG4-part 10, was designed specifically to facilitate reliable transport of video over IP networks while delivering equivalent or better image quality than MPEG-2 at a substantially higher compression ratio.
The significant coding efficiency of H.264 enables a wide range of new applications for streaming video over a variety of media. Like other MPEG incarnations, the H.264 codec implementation is not explicitly defined. While the standard defines the syntax of the encoded bitstream and the method for decoding the bitstream, developers have the opportunity to introduce significant value through innovation and refinement of their codecs and their ability to deliver reliable real-time encoding of video.
In order to achieve these goals, developers require additional tools, new algorithms, and more computational capacity. From a technical perspective, the primary difference between H.264 and other MPEG standards is the use of multiple reference frames, wider search ranges, and smaller macro blocks for motion estimation, all of which ultimately translate to increased computational intensiveness. The efficiency of the encoder (see Figure 1) can be attributed to:
Intra-prediction utilizing Forward and Inverse Discrete Cosine Transforms (DCT), as well as Forward and Inverse Quantization
Motion Estimation utilizing inter-frame comparisons to ½ PEL and ¼ PEL accuracy
The H.264 standard supports motion estimation on blocks ranging from 4x4 to 16x16 pixels, as well as residual data transforms on 4x4 blocks with a modified integer discrete cosine transform (DCT) to avoid rounding errors.
The Necessity for Hardware Acceleration
Each of these functions requires extensive processing that must be performed in real-time to be useful. Clearly, this calls for hardware acceleration. However, given that OEMs provide significant value in how they implement these functions, a programmable platform is essential to provide developers with the flexibility they require to continue to refine their algorithms and increase their competitive edge. The limitations of using fixed ASICs, at this stage of H.264 adoption where codec implementation is a key differentiating factor, are simply too inflexible.
While today's CPUs continue to keep pace with Moore's Law, traditional programmable processors do not have architectures that are well-suited for video processing. One alternative approach is to introduce hardware acceleration units called through intrinsic instructions; examples of such instructions include Intel's MMX/SSE2 and AMD's 3DNow extensions.
While hardware acceleration units can be designed to efficiently offload block-based and pixel-level processing tasks that are not well-suited to CPU architectures, such tasks tend to be very dataflow intensive. As a consequence, deep register pipelines with fast memory access are required to achieve the required real-time efficiency. For example, the intermediate results of Sum of Absolute Differences (SAD) calculations (covered in more detail below) for motion estimation do not fit well within the limited register space of traditional CPU and DSP architectures. Additionally, base implementations require hand-optimized assembly coding to maximize efficiency, a time-consuming process that creates an inflexible architecture-specific implementation that is difficult to build and innovate upon over time.
Programmable logic devices are often perceived to provide sufficient flexibility to effectively implement evolving proprietary algorithms. Such architectures typically employ a programmable processor to manage application-level tasks while a programmable logic device such as an FPGA manages the flow of real-time data. Such architectures, however, introduce new inefficiencies.
First, developers must decide upon an interconnect between the processor and programmable logic device. This choice of interconnect determines the throughput and therefore the efficiency of the overall architecture. If the interconnect operates synchronously, this burdens the hardware designer to design to stringent timing requirements. Alternatively, asynchronous interconnects introduce significant latency into the system, as well as forced processor stalls in order to complete the data exchange. Such an interface can also expose limitations in hardware handshaking which further reduce the overall effect of bandwidth of the interconnect.
View full size
Figure 2: Stretch Software-Configurable Processor
Using discrete devices (see Figure 2), the FPGA acts as a coprocessor for which the CPU must prepare and hand off data. As a consequence, the CPU must wait for results from the FPGA, creating an interdependency and latency that weakens the effect of pipelining operations. Additionally, system design is spread across two development environments, one for the FPGA and one for the CPU.
Often, the FPGA architecture is based on an existing software architecture that fails to meet current performance requirements. In order to improve performance, the hardware team recodes the critical computational blocks of the software algorithms in a hardware description language (HDL). Not only does such an implementation require a separate verification methodology, any changes to the software algorithm or context of the application—even something as conceptually simple as modifying the screen resolution—propagates changes down to the FPGA coprocessor.
This “trickle down” effect can significantly impact the timely delivery of a product of the market, especially for those applications with algorithms and standards that are continuing to evolve.
Next: Integrated Programmable Logic