Design Article
Efficient Algorithms for MPEG-4 Video Decoding
J. Geoffrey Chase and Christopher Pretty
12/17/2002 12:00 AM EST
![]() |
As portable handheld communication devices proliferate, designers are adding more features to meet consumer demand such as digital video and audio. The ISO MPEG-4 standard is emerging as the predominant video and audio format for these applications.

Based on standard 16-bit telecom DSPs, these devices must very efficiently perform 8-bit video and 16-bit telecommunications algorithms to support these applications and meet low-power requirements. Dynamically reconfigurable DSPs achieve the high efficiency required via cycle-to-cycle reconfiguration while retaining flexibility and programmability. Due to the evolving nature of the standard and the cost of additional hardware upgrade and redesign, portable MPEG-4 implementations should be fully programmable.
For portable handheld devices, such as video cell phones, you need a simpler but still computationally intense implementation of the MPEG-4 standard. The MPEG-4 simple profile at level 1 is ideal for portable applications with a Quarter Common Intermediate Format (QCIF176x144 pixels), 10-15 frames per second (fps), and 64 kbps data rate. Initial estimates for this MPEG-4 decoder require at least 60-80 Mcycles on a typical 16-bit, dual-execution unit DSP.
Ways must be found to significantly reduce this value to enable programmable solutions.
Unlike the encoder, the decoder is explicitly specified by the MPEG-4 standard, leaving few choices for the system designer regarding what algorithms to employ. However, three of these algorithms, the discrete cosine transform (DCT), de-blocking filter, and de-ringing filter, consume an estimated 70% of the computational cost. Therefore, significant reductions for these algorithms via efficient algorithms and/or reconfigurable DSPs utilizing tightly coupled co-processors are necessary to enable the desired solutions. Figure 1 shows a simple block diagram of the algorithms required for decoding a simple profile MPEG-4 video stream.
Figure 1: Block diagram of simple MPEG-4 decoder showing the software boundary |
We propose two video-accelerating co-processors, designed specifically for 8-bit arithmetic and filtering and supporting 8-bit memory access, to accelerate MPEG-4 algorithms. A byte algebraic unit allows parallel add/sub, max/min, and accumulated distance operations. A byte manipulation unit that allows 8-bit words to be packed into and unpacked from 16-bit memory locations facilitates efficient 8-bit memory access. In addition, a first-order smoothing filter allows the filtering of two pixels per clock cycle, significantly improving the efficiency of the de-ringing filter. Figure 2 shows a basic schematic of a single, proposed co-processor.
Figure 2: Schematic of a proposed video PPM execution unit |
Table 1 lists the co-processor instructions, designed to ensure generality to a variety of video applications. The operand types are defined so that Mx describes an x-bit memory location, A depicts a 32-bit accumulator with access to the high and low 16-bits ALH, and R represents address registers.
| pack
unpack add subtract accumulated distance max min flag write to memory filter filter2 |
dest = pack(src1, src2)
dest1, dest2 = unpack(src) dest = src1 + src2 dest += src dest = src1 - src2 dest -= src dest += abs(src1 - src2) dest = ppmax(src1, src2) ppmax(dest, src) dest = ppmin(src1, src2) ppmin(dest, src) dest = flag(src) dest = ppfilter(src) dest1, dest2 = ppfilter2(src) |
M16
M8 M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | ALH M16 | ALH M16 | ALH M16 | ALH M8 M16 | M8 M16 | M8 |
M8
M16 M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | M8 | A M8 M16 | ALH M16 | ALH M16 | ALH M16 | ALH EST R8 R8 |
M8
M16 | M8 | A M16 | M8 | A M8 M16 | ALH M16 | ALH |
Table 1: Instructions for the proposed video-accelerating co-processor
where u, v, x, y = 0, 1, 2, ... N-1; x and y are spatial coordinates in the sample domain; u and v are coordinates in the transform domain; and C(u), C(v) = 1.0 unless u, v = 0, in which case they are C(u), C(v) = 0.7071. The DCT and the IDCT are operationally very similar and can be analyzed using a simplified representation defined as:
This form of the transform limits the memory accesses and MACs required per summation to two reads and one MAC, but requires more memory for the coefficients. Evaluating this equation directly, you must store 4096 16-bit coefficients and perform 4096 16-bit MACs per 8x8 block. These figures equate to 12.1 Mcycles on a dual-execution-unit DSP and 24.2 Mcycles on a single-execution unit-DSP, for a 10-fps QCIF session.
Examining the 4096 coefficients as a 64x64 matrix reveals complex symmetries, as illustrated in Figure 3, that you can leverage to reduce the computational load and memory use. Figure 3 shows that the rows of the coefficient matrix have modified mirror symmetry about the center. The values in the first block of eight columns of the reflected rows alternate positive and negative symmetry, while the second block alternates negative and positive symmetry. Within each smaller block of eight rows and eight columns, there is simple positive or negative mirror symmetry of the values about the center.
Figure 3: Symmetries present in the DCT coefficient matrix |
The following code fragment uses these symmetries to perform the IDCT (and, with minor modifications, the DCT) for an 8x8 block of pixels in 580 cycles. This approach results in a computational cost of 3.4 Mcycles for a 10-fps QCIF session and reduces the number of stored 16-bit coefficients to 1024, saving over 70 percent of the required data memory and computational resources. Note that the transform coefficients of Equation 3 must be represented by 16-bits to maintain precision and the input values are 9 bits, since they result from the subtraction of two 8-bit numbers in the motion-estimation process.
rep (4) block { // 4 blocks down
rep (4) block { // 4 lines/block
clr(a0,a1,a2,a3,a4,a5) || rep (4) block { // 4 blocks of 16
rep (4) single // 8 coeffs along
{
a0 += *r0++ * *r4++ || a1 += *r0++ * *r4++;
}
rep (3) single // 8 coeffs along
{
a2 += *r0++ * *r4++ || a3 += *r0++ * *r4++;
}
a2 += *r0++ * *r4++ || a3 += *r0++ * *r4++; //so rep single isn't last instr.
}
a0 -= a1 || a1 += a0;
a2 += a3 || a3 -= a2;
*(r1+=rn1) = a1 + a2 || *(r2+=rn2) = a0 + a3; // TT || BB
*(r3+=rn3) = a1 - a2 || *(r4+=rn4) = a0 - a3; // TB || BT
}
nop; // prevents R/W clash
}
An analysis of the algorithm shows that, given one line of 10 pixels spanning a block boundary, the initial gradient analysis can be performed in 16 cycles. The default smoothing filter requires an additional 23 cycles, while the DC offset smoothing filter consumes 48 cycles. A QCIF frame contains 18x22, 8x8 blocks of pixels with an associated 752 boundaries between them. Including the chrominance blocks, 930 boundaries are present, each with eight lines requiring processing. In the worst case, the DC offset mode would be required for all pixels, which translates into 4.8 Mcycles. In the best case, with the default mode used for all pixels, the computational cost is 2.9 Mcycles.
The deblocking filter operates on 8-bit pixels values and therefore may achieve considerable computational cost reductions when implemented on a dynamically reconfigurable DSP with two of the proposed video co-processors. Specifically, the DC offset smoothing filter requires numerous 8-bit compares, which are supported by the video co-processor instruction set, cutting the 48 cycles required for the DC offset filter to 21 cycles by comparing two 8-bit values simultaneously using a 16-bit register. This value is the same as the default filter, resulting in a total computational cost for the deblocking filter with the reconfigurable DSP of 2.6 Mcycles. The 46% reduction over the worst case, and 10% over the best case, results from dynamically reconfiguring the DSP as a highly parallel 8-bit processor.
Figure 4: Basic operations of the deringing post-filter |
a0h = *r2< shift a 24bit op.
rep(N/2) block {
a2h = *r2<
You can achieve significant performance improvements by implementing the threshold-acquisition function using two of the proposed video co-processors, since these operations are primarily 8 bits in length. The co-processors can perform the threshold acquisition for an 8x8 block of pixels in 24 cycles for a reduction of 91%, as illustrated in the following code, where the "plug" command is a dynamically reconfiguring call to the co-processor. The impact of the co-processors is not limited to the threshold acquisition. As is the case with the flag-write-to-memory and pixel-filtering instructions, the index-acquisition and adaptive-smoothing algorithms can also be performed more efficiently. In forming the binary index matrix, two cycles are required to compare and write for two pixels, while the video co-processor writes the flag values, 0 or 1, directly to memory, enabling this operation in a single cycle.
rep(4) block {
cliw ppcmp(r2++, r7++) //PP max/min functions compare 2 pairs of 8-bit numbers
{ //held in 2 16-bit numbers, returning results in a 16-word
plug1 ppmax(pa0h, *ma1) || ppmax(pa0l, *ma2)||
plug2 ppmin(pa1h, *ma1)|| ppmin(pa1l, *ma2);
}
}
cliw ppcmp2(r2, r7) //Using the PP max/min functions to reduce the resulting
{ //4 8-bit max and min's down to 2 of each before writing out
plug1 *ma1 = ppmax(pa0h, pa0l)||
plug2 *ma2 = ppmin(pa1h, pa1l);
}
a2h = extractz(*r2,a0); //Extracting the 2 max's to a2 for comparison
a2l = extractz(*r2,a1);
a3h = extractz(*r7,a0); //Extracting the 2 min's to a3 for comparison
a3l = extractz(*r7,a1);
max(a2h,a2l) || min(a3h,a3l); // Final comparison
The binary coefficients of the 9-tap adaptive filter, shown in Figure 5, coupled with the 8-bit pixel values, enable the first-order smoothing filter in the PPM to filter two pixels per cycle. This is possible since the pixel values need only be shifted and accumulated to perform the convolution. Using two co-processors requires 34 cycles per 8x8 block rather than the 320 cycles necessary without them. The overall computational cost of the de-ringing post-filter is 9.7 Mcycles when implemented without co-processors and only 0.9 Mcycles with them, a reduction of 91%. Table 2 summarizes the results for all three algorithms we have examined.
Algorithm |
Best Case |
Worst Case |
Best Case |
Worst Case |
| IDCT | ||||
| Deblocking Filter | ||||
| Deringing Filter | ||||
| Total | 16.0 | 17.9 | 6.9 | 6.9 |
Table 2: Summary of computational requirements (Mcycles)
Figure 5: Operation of a single-cycle dual-pixel filter |






