# Efficient Algorithms for MPEG-4 Video Decoding

As portable handheld communication devices proliferate, designers are adding more features to meet consumer demand such as digital video and audio. The ISO MPEG-4 standard is emerging as the predominant video and audio format for these applications. Based on standard 16-bit telecom DSPs, these devices must very efficiently perform 8-bit video and 16-bit telecommunications algorithms to support these applications and meet low-power requirements. Dynamically reconfigurable DSPs achieve the high efficiency required via cycle-to-cycle reconfiguration while retaining flexibility and programmability. Due to the evolving nature of the standard and the cost of additional hardware upgrade and redesign, portable MPEG-4 implementations should be fully programmable.

For portable handheld devices, such as video cell phones, you need a simpler but still computationally intense implementation of the MPEG-4 standard. The MPEG-4 simple profile at level 1 is ideal for portable applications with a Quarter Common Intermediate Format (QCIF—176x144 pixels), 10-15 frames per second (fps), and 64 kbps data rate. Initial estimates for this MPEG-4 decoder require at least 60-80 Mcycles on a typical 16-bit, dual-execution unit DSP. Ways must be found to significantly reduce this value to enable programmable solutions.

Unlike the encoder, the decoder is explicitly specified by the MPEG-4 standard, leaving few choices for the system designer regarding what algorithms to employ. However, three of these algorithms, the discrete cosine transform (DCT), de-blocking filter, and de-ringing filter, consume an estimated 70% of the computational cost. Therefore, significant reductions for these algorithms via efficient algorithms and/or reconfigurable DSPs utilizing tightly coupled co-processors are necessary to enable the desired solutions. **Figure 1** shows a simple block diagram of the algorithms required for decoding a simple profile MPEG-4 video stream.

Figure 1: Block diagram of simple MPEG-4 decoder showing the software boundary |

We propose two video-accelerating co-processors, designed specifically for 8-bit arithmetic and filtering and supporting 8-bit memory access, to accelerate MPEG-4 algorithms. A byte algebraic unit allows parallel add/sub, max/min, and accumulated distance operations. A byte manipulation unit that allows 8-bit words to be packed into and unpacked from 16-bit memory locations facilitates efficient 8-bit memory access. In addition, a first-order smoothing filter allows the filtering of two pixels per clock cycle, significantly improving the efficiency of the de-ringing filter. **Figure 2** shows a basic schematic of a single, proposed co-processor.

Figure 2: Schematic of a proposed video PPM execution unit |

**Table 1** lists the co-processor instructions, designed to ensure generality to a variety of video applications. The operand types are defined so that Mx describes an x-bit memory location, A depicts a 32-bit accumulator with access to the high and low 16-bits ALH, and R represents address registers.

pack
unpack add subtract accumulated distance max min flag write to memory filter filter2 |
dest = pack(src1, src2)
dest1, dest2 = unpack(src) dest = src1 + src2 dest += src dest = src1 - src2 dest -= src dest += abs(src1 - src2) dest = ppmax(src1, src2) ppmax(dest, src) dest = ppmin(src1, src2) ppmin(dest, src) dest = flag(src) dest = ppfilter(src) dest1, dest2 = ppfilter2(src) |
M16
M8 M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | ALH M16 | ALH M16 | ALH M16 | ALH M8 M16 | M8 M16 | M8 |
M8
M16 M16 | M8 | A M16 | M8 | A M16 | M8 | A M16 | M8 | A M8 M16 | ALH M16 | ALH M16 | ALH M16 | ALH EST R8 R8 |
M8
M16 | M8 | A M16 | M8 | A M8 M16 | ALH M16 | ALH |

**Table 1:**Instructions for the proposed video-accelerating co-processor

**Equations 1**and

**2**, respectively, define the NxN two-dimensional DCT and IDCT.

where u, v, x, y = 0, 1, 2, ... N-1; x and y are spatial coordinates in the sample domain; __u__ and v are coordinates in the transform domain; and C(u), C(v) = 1.0 unless u, v = 0, in which case they are C(u), C(v) = 0.7071. The DCT and the IDCT are operationally very similar and can be analyzed using a simplified representation defined as:

This form of the transform limits the memory accesses and MACs required per summation to two reads and one MAC, but requires more memory for the coefficients. Evaluating this equation directly, you must store 4096 16-bit coefficients and perform 4096 16-bit MACs per 8x8 block. These figures equate to 12.1 Mcycles on a dual-execution-unit DSP and 24.2 Mcycles on a single-execution unit-DSP, for a 10-fps QCIF session.

Examining the 4096 coefficients as a 64x64 matrix reveals complex symmetries, as illustrated in **Figure 3**, that you can leverage to reduce the computational load and memory use. **Figure 3** shows that the rows of the coefficient matrix have modified mirror symmetry about the center. The values in the first block of eight columns of the reflected rows alternate positive and negative symmetry, while the second block alternates negative and positive symmetry. Within each smaller block of eight rows and eight columns, there is simple positive or negative mirror symmetry of the values about the center.

Figure 3: Symmetries present in the DCT coefficient matrix |

The following code fragment uses these symmetries to perform the IDCT (and, with minor modifications, the DCT) for an 8x8 block of pixels in 580 cycles. This approach results in a computational cost of 3.4 Mcycles for a 10-fps QCIF session and reduces the number of stored 16-bit coefficients to 1024, saving over 70 percent of the required data memory and computational resources. Note that the transform coefficients of **Equation 3** must be represented by 16-bits to maintain precision and the input values are 9 bits, since they result from the subtraction of two 8-bit numbers in the motion-estimation process.

rep (4) block { // 4 blocks down
rep (4) block { // 4 lines/block
clr(a0,a1,a2,a3,a4,a5) || rep (4) block { // 4 blocks of 16
rep (4) single // 8 coeffs along
{
a0 += *r0++ * *r4++ || a1 += *r0++ * *r4++;
}
rep (3) single // 8 coeffs along
{
a2 += *r0++ * *r4++ || a3 += *r0++ * *r4++;
}
a2 += *r0++ * *r4++ || a3 += *r0++ * *r4++; //so rep single isn't last instr.
}
a0 -= a1 || a1 += a0;
a2 += a3 || a3 -= a2;
*(r1+=rn1) = a1 + a2 || *(r2+=rn2) = a0 + a3; // TT || BB
*(r3+=rn3) = a1 - a2 || *(r4+=rn4) = a0 - a3; // TB || BT
}
nop; // prevents R/W clash
}

**Figure 4**shows the basic operations performed in the filter for each macroblock comprising 8x8 pixel blocks. The threshold-acquisition function determines the maximum and minimum pixel values in each 8x8 block and uses these values to calculate a threshold value. Index acquisition uses the threshold value to form a binary index matrix in which a value of 1 is assigned to a pixel value greater than the threshold, while 0 means the value is lower. A 2D FIR filter is then applied to each pixel for which both the pixel and its eight neighbors all have indices of 1. The clipping process limits the maximum difference between the original and filtered values. The following code finds the maximum and minimum 8-bit values from an 8x8 block, the primary calculation required in the threshold acquisition. With two 8-bit values concatenated and stored in each 16-bit memory location, a shift is required to extract the data. Excluding overhead, it can be seen that for each block of N=64, 257 cycles are required for a total of 1.5 Mcycles for the entire frame.

Figure 4: Basic operations of the deringing post-filter |

a0h = *r2 shift a 24bit op.
rep(N/2) block {
a2h = *r2

You can achieve significant performance improvements by implementing the threshold-acquisition function using two of the proposed video co-processors, since these operations are primarily 8 bits in length. The co-processors can perform the threshold acquisition for an 8x8 block of pixels in 24 cycles for a reduction of 91%, as illustrated in the following code, where the "plug" command is a dynamically reconfiguring call to the co-processor. The impact of the co-processors is not limited to the threshold acquisition. As is the case with the flag-write-to-memory and pixel-filtering instructions, the index-acquisition and adaptive-smoothing algorithms can also be performed more efficiently. In forming the binary index matrix, two cycles are required to compare and write for two pixels, while the video co-processor writes the flag values, 0 or 1, directly to memory, enabling this operation in a single cycle.

rep(4) block {
cliw ppcmp(r2++, r7++) //PP max/min functions compare 2 pairs of 8-bit numbers
{ //held in 2 16-bit numbers, returning results in a 16-word
**plug1** ppmax(pa0h, *ma1) || ppmax(pa0l, *ma2)||
**plug2** ppmin(pa1h, *ma1)|| ppmin(pa1l, *ma2);
}
}
cliw ppcmp2(r2, r7) //Using the PP max/min functions to reduce the resulting
{ //4 8-bit max and min's down to 2 of each before writing out
**plug1** *ma1 = ppmax(pa0h, pa0l)||
**plug2** *ma2 = ppmin(pa1h, pa1l);
}
a2h = extractz(*r2,a0); //Extracting the 2 max's to a2 for comparison
a2l = extractz(*r2,a1);
a3h = extractz(*r7,a0); //Extracting the 2 min's to a3 for comparison
a3l = extractz(*r7,a1);
max(a2h,a2l) || min(a3h,a3l); // Final comparison

The binary coefficients of the 9-tap adaptive filter, shown in **Figure 5**, coupled with the 8-bit pixel values, enable the first-order smoothing filter in the PPM to filter two pixels per cycle. This is possible since the pixel values need only be shifted and accumulated to perform the convolution. Using two co-processors requires 34 cycles per 8x8 block rather than the 320 cycles necessary without them. The overall computational cost of the de-ringing post-filter is 9.7 Mcycles when implemented without co-processors and only 0.9 Mcycles with them, a reduction of 91%. **Table 2** summarizes the results for all three algorithms we have examined.

Algorithm |
Best Case |
Worst Case |
Best Case |
Worst Case |

IDCT |
||||

Deblocking Filter |
||||

Deringing Filter |
||||

Total |
16.0 |
17.9 |
6.9 |
6.9 |

**Table 2:**Summary of computational requirements (Mcycles)

Figure 5: Operation of a single-cycle dual-pixel filter |