Design Article

IMG1

Revving video encoding on C64x/DM64x DSPs

Cheng Peng

6/20/2005 9:00 AM EDT

How does one improve video encoding performance on the Texas Instruments TMS320C64x/TMS320DM64x digital signal processor generation?

The conventional implementation of a video encoder (MPEG-2, MPEG-4, H.263) is based on macroblock-level processing. The video encoder fetches a new macroblock (MB) only after the current MB goes through all the processing steps. But this intuitive approach comes with two drawbacks:

- The overall code size of a video encoder is usually bigger than the Level 1 program cache (L1P) on a C64x/DM64x DSP. The code needs to be swapped between L1P and the Level 2 program cache (L2P) during every MB fetching period, causing a significant cache-miss penalty.

- It is not efficient for the enhanced DMA (EDMA) controller to transfer a small chunk of data such as a single MB from an external video frame memory to internal memory.

To avoid the huge cache-miss penalty and CPU stalling, the algorithm can be broken into three loops, each of them a separate module that fits into L1P. Instead of processing a single MB at a time in each loop, the module processes n macroblocks-an MB strip. The size of a strip is restricted only by the size of the available Level 1 data cache (L1D). The bigger n is, the better EDMA performance we can expect for data throughput.

The three loops are:

- the MB encoding loop,

- the motion estimation loop, and

- the MB reconstruction loop.

As emphasized above, n MBs are fetched and go through one of the three processing loops together.

For example, in the MB encoding loop, when n MBs are fetched into internal memory, they are put through a discrete cosine transform (DCT), quantized and entropy-coded. This set of macroblocks is not flushed out of L1D until the MB encoding loop has been completed. Corresponding programs, including DCT, quantization and variable-length coding kernels, are also kept in L1P until all n MBs are processed completely in this loop.

A ping-pong memory buffering scheme driven by the EDMA helps reduce the initial setup time needed to perform these loops for a strip of MBs. Ping-pong buffering also ensures minimal CPU stalling cycles because the transfers are overlapped with processing.

Cheng Peng (c-peng2@ti.com), DSP video application engineer for Texas Instruments Inc. (Dallas)


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm