Design Article

IMG1

Efficient Algorithms for MPEG-4 Video Decoding

J. Geoffrey Chase and Christopher Pretty

12/17/2002 12:00 AM EST


As portable handheld communication devices proliferate, designers are adding more features to meet consumer demand such as digital video and audio. The ISO MPEG-4 standard is emerging as the predominant video and audio format for these applications. Based on standard 16-bit telecom DSPs, these devices must very efficiently perform 8-bit video and 16-bit telecommunications algorithms to support these applications and meet low-power requirements. Dynamically reconfigurable DSPs achieve the high efficiency required via cycle-to-cycle reconfiguration while retaining flexibility and programmability. Due to the evolving nature of the standard and the cost of additional hardware upgrade and redesign, portable MPEG-4 implementations should be fully programmable.

For portable handheld devices, such as video cell phones, you need a simpler but still computationally intense implementation of the MPEG-4 standard. The MPEG-4 simple profile at level 1 is ideal for portable applications with a Quarter Common Intermediate Format (QCIF—176x144 pixels), 10-15 frames per second (fps), and 64 kbps data rate. Initial estimates for this MPEG-4 decoder require at least 60-80 Mcycles on a typical 16-bit, dual-execution unit DSP. Ways must be found to significantly reduce this value to enable programmable solutions.

Unlike the encoder, the decoder is explicitly specified by the MPEG-4 standard, leaving few choices for the system designer regarding what algorithms to employ. However, three of these algorithms, the discrete cosine transform (DCT), de-blocking filter, and de-ringing filter, consume an estimated 70% of the computational cost. Therefore, significant reductions for these algorithms via efficient algorithms and/or reconfigurable DSPs utilizing tightly coupled co-processors are necessary to enable the desired solutions. Figure 1 shows a simple block diagram of the algorithms required for decoding a simple profile MPEG-4 video stream.


Figure 1:  Block diagram of simple MPEG-4 decoder showing the software boundary

Video Accelerating Co-Processor and Dynamic Reconfigurability
This article uses the Carmel DSP from Infineon Technologies as a representative dual-execution unit, 16-bit telecom DSP since its performance is similar to many processors in its class. Configurable long instruction word (CLIW) technology enables dynamic reconfigurability, which you can get for similar DSPs using tightly coupled co-processors. More specifically, efficient execution of the 8-bit operations necessary for digital video means a 16-bit DSP must also behave as a highly parallel 8-bit processor, as enabled by dynamic reconfiguration.

We propose two video-accelerating co-processors, designed specifically for 8-bit arithmetic and filtering and supporting 8-bit memory access, to accelerate MPEG-4 algorithms. A byte algebraic unit allows parallel add/sub, max/min, and accumulated distance operations. A byte manipulation unit that allows 8-bit words to be packed into and unpacked from 16-bit memory locations facilitates efficient 8-bit memory access. In addition, a first-order smoothing filter allows the filtering of two pixels per clock cycle, significantly improving the efficiency of the de-ringing filter. Figure 2 shows a basic schematic of a single, proposed co-processor.


Figure 2:  Schematic of a proposed video PPM execution unit

Table 1 lists the co-processor instructions, designed to ensure generality to a variety of video applications. The operand types are defined so that Mx describes an x-bit memory location, A depicts a 32-bit accumulator with access to the high and low 16-bits ALH, and R represents address registers.

Instruction
Syntax
dest1 (2)
src1
src2
pack
unpack
add
 
subtract
 
accumulated distance
max
 
min
 
flag write to memory
filter
filter2
dest = pack(src1, src2)
dest1, dest2 = unpack(src)
dest = src1 + src2
dest += src
dest = src1 - src2
dest -= src
dest += abs(src1 - src2)
dest = ppmax(src1, src2)
ppmax(dest, src)
dest = ppmin(src1, src2)
ppmin(dest, src)
dest = flag(src)
dest = ppfilter(src)
dest1, dest2 = ppfilter2(src)
M16
M8
M16 | M8 | A
M16 | M8 | A
M16 | M8 | A
M16 | M8 | A
M16 | M8 | A
M16 | ALH
M16 | ALH
M16 | ALH
M16 | ALH
M8
M16 | M8
M16 | M8
M8
M16
M16 | M8 | A
M16 | M8 | A
M16 | M8 | A
M16 | M8 | A
M8
M16 | ALH
M16 | ALH
M16 | ALH
M16 | ALH
EST
R8
R8
M8
 
M16 | M8 | A
 
M16 | M8 | A
 
M8
M16 | ALH
 
M16 | ALH
 
 
 
 

Table 1:  Instructions for the proposed video-accelerating co-processor

Discrete Cosine Transform and its Inverse
The Discrete Cosine Transform and its inverse, the IDCT, are both implemented in an MPEG-4 codec. The IDCT is part of the decoder and is performed on blocks of 8x8 pixels, where each luminance macroblock is split into four 8x8 blocks and has two associated chrominance blocks. The top-left value in a transformed block represents the DC level, or average intensity of the block. The remaining 63 coefficients represent the spatial frequencies that may exist within the block. For MPEG-4, Equations 1 and 2, respectively, define the NxN two-dimensional DCT and IDCT.

where u, v, x, y = 0, 1, 2, ... N-1; x and y are spatial coordinates in the sample domain; u and v are coordinates in the transform domain; and C(u), C(v) = 1.0 unless u, v = 0, in which case they are C(u), C(v) = 0.7071. The DCT and the IDCT are operationally very similar and can be analyzed using a simplified representation defined as:

This form of the transform limits the memory accesses and MACs required per summation to two reads and one MAC, but requires more memory for the coefficients. Evaluating this equation directly, you must store 4096 16-bit coefficients and perform 4096 16-bit MACs per 8x8 block. These figures equate to 12.1 Mcycles on a dual-execution-unit DSP and 24.2 Mcycles on a single-execution unit-DSP, for a 10-fps QCIF session.

Examining the 4096 coefficients as a 64x64 matrix reveals complex symmetries, as illustrated in Figure 3, that you can leverage to reduce the computational load and memory use. Figure 3 shows that the rows of the coefficient matrix have modified mirror symmetry about the center. The values in the first block of eight columns of the reflected rows alternate positive and negative symmetry, while the second block alternates negative and positive symmetry. Within each smaller block of eight rows and eight columns, there is simple positive or negative mirror symmetry of the values about the center.


Figure 3:  Symmetries present in the DCT coefficient matrix

The following code fragment uses these symmetries to perform the IDCT (and, with minor modifications, the DCT) for an 8x8 block of pixels in 580 cycles. This approach results in a computational cost of 3.4 Mcycles for a 10-fps QCIF session and reduces the number of stored 16-bit coefficients to 1024, saving over 70 percent of the required data memory and computational resources. Note that the transform coefficients of Equation 3 must be represented by 16-bits to maintain precision and the input values are 9 bits, since they result from the subtraction of two 8-bit numbers in the motion-estimation process.


rep (4) block	{					     // 4 blocks down
    rep (4) block	{				     // 4 lines/block
	  clr(a0,a1,a2,a3,a4,a5) || rep (4) block  {	     // 4 blocks of 16	
		rep (4) single				     // 8 coeffs along
		{
		     a0 += *r0++ * *r4++ || a1 += *r0++ * *r4++;
		}
		rep (3) single				     // 8 coeffs along
		{
		     a2 += *r0++ * *r4++ || a3 += *r0++ * *r4++;
		}
		a2 += *r0++ * *r4++ || a3 += *r0++ * *r4++; //so rep single isn't last instr.
	   }
	   a0 -= a1 || a1 += a0;
	   a2 += a3 || a3 -= a2;
	   *(r1+=rn1) = a1 + a2 || *(r2+=rn2) = a0 + a3; 	// TT || BB
	   *(r3+=rn3) = a1 - a2 || *(r4+=rn4) = a0 - a3;	// TB || BT
    }
    nop;						     // prevents R/W clash
}

De-Blocking Filter
The de-blocking filter is a post-processing filter designed to improve the quality of the output video sequence. The de-blocking filter smoothes the transitions between adjacent macroblocks to prevent a 'blocky' image on reconstruction. The filter operations are performed along the 8x8 block edges of luminance and chrominance data. One of two smoothing filters is selected by a threshold comparison. The sum of the differential gradients of the five pixels on either side of the block boundary is evaluated and compared to a predetermined threshold value. If the region is smooth the DC offset mode is applied, otherwise the default mode is used.

An analysis of the algorithm shows that, given one line of 10 pixels spanning a block boundary, the initial gradient analysis can be performed in 16 cycles. The default smoothing filter requires an additional 23 cycles, while the DC offset smoothing filter consumes 48 cycles. A QCIF frame contains 18x22, 8x8 blocks of pixels with an associated 752 boundaries between them. Including the chrominance blocks, 930 boundaries are present, each with eight lines requiring processing. In the worst case, the DC offset mode would be required for all pixels, which translates into 4.8 Mcycles. In the best case, with the default mode used for all pixels, the computational cost is 2.9 Mcycles.

The deblocking filter operates on 8-bit pixels values and therefore may achieve considerable computational cost reductions when implemented on a dynamically reconfigurable DSP with two of the proposed video co-processors. Specifically, the DC offset smoothing filter requires numerous 8-bit compares, which are supported by the video co-processor instruction set, cutting the 48 cycles required for the DC offset filter to 21 cycles by comparing two 8-bit values simultaneously using a 16-bit register. This value is the same as the default filter, resulting in a total computational cost for the deblocking filter with the reconfigurable DSP of 2.6 Mcycles. The 46% reduction over the worst case, and 10% over the best case, results from dynamically reconfiguring the DSP as a highly parallel 8-bit processor.

De-Ringing Filter
The de-ringing post-filter is an adaptive smoothing filter that removes sudden transitions from a frame. Although it is not required, the standard includes this filter for reference as it is typically implemented, even though it has the largest computational cost of any decoding algorithm. Figure 4 shows the basic operations performed in the filter for each macroblock comprising 8x8 pixel blocks. The threshold-acquisition function determines the maximum and minimum pixel values in each 8x8 block and uses these values to calculate a threshold value. Index acquisition uses the threshold value to form a binary index matrix in which a value of 1 is assigned to a pixel value greater than the threshold, while 0 means the value is lower. A 2D FIR filter is then applied to each pixel for which both the pixel and its eight neighbors all have indices of 1. The clipping process limits the maximum difference between the original and filtered values. The following code finds the maximum and minimum 8-bit values from an 8x8 block, the primary calculation required in the threshold acquisition. With two 8-bit values concatenated and stored in each 16-bit memory location, a shift is required to extract the data. Excluding overhead, it can be seen that for each block of N=64, 257 cycles are required for a total of 1.5 Mcycles for the entire frame.


Figure 4:  Basic operations of the deringing post-filter


a0h = *r2< shift a 24bit op.

rep(N/2) block  {					
	a2h = *r2<

You can achieve significant performance improvements by implementing the threshold-acquisition function using two of the proposed video co-processors, since these operations are primarily 8 bits in length. The co-processors can perform the threshold acquisition for an 8x8 block of pixels in 24 cycles for a reduction of 91%, as illustrated in the following code, where the "plug" command is a dynamically reconfiguring call to the co-processor.

The impact of the co-processors is not limited to the threshold acquisition. As is the case with the flag-write-to-memory and pixel-filtering instructions, the index-acquisition and adaptive-smoothing algorithms can also be performed more efficiently. In forming the binary index matrix, two cycles are required to compare and write for two pixels, while the video co-processor writes the flag values, 0 or 1, directly to memory, enabling this operation in a single cycle.


rep(4)  block  {
     cliw ppcmp(r2++, r7++)	//PP max/min functions compare 2 pairs of 8-bit numbers
     {                          //held in 2 16-bit numbers, returning results in a 16-word
       plug1 ppmax(pa0h, *ma1) || ppmax(pa0l, *ma2)||
       plug2 ppmin(pa1h, *ma1)|| ppmin(pa1l, *ma2);
     }
}
cliw ppcmp2(r2, r7)	       //Using the  PP max/min functions to reduce the resulting 
{                              //4 8-bit max and min's down to 2 of each before writing out
     plug1 *ma1 = ppmax(pa0h, pa0l)||
     plug2 *ma2 = ppmin(pa1h, pa1l);
}
a2h = extractz(*r2,a0);		//Extracting the 2 max's to a2 for comparison
a2l = extractz(*r2,a1);
	
a3h = extractz(*r7,a0);		//Extracting the 2 min's to a3 for comparison
a3l = extractz(*r7,a1);

max(a2h,a2l) || min(a3h,a3l);	// Final comparison

The binary coefficients of the 9-tap adaptive filter, shown in Figure 5, coupled with the 8-bit pixel values, enable the first-order smoothing filter in the PPM to filter two pixels per cycle. This is possible since the pixel values need only be shifted and accumulated to perform the convolution. Using two co-processors requires 34 cycles per 8x8 block rather than the 320 cycles necessary without them. The overall computational cost of the de-ringing post-filter is 9.7 Mcycles when implemented without co-processors and only 0.9 Mcycles with them, a reduction of 91%. Table 2 summarizes the results for all three algorithms we have examined.

Decoding
Algorithm
Non-Reconfigurable
Best Case
Non-Reconfigurable
Worst Case
Reconfigurable
Best Case
Reconfigurable
Worst Case
IDCT
3.4
3.4
3.4
3.4
Deblocking Filter
2.9
4.8
2.6
2.6
Deringing Filter
9.7
9.7
0.9
0.9
Total 16.0 17.9 6.9 6.9

Table 2:  Summary of computational requirements (Mcycles)


Figure 5:  Operation of a single-cycle dual-pixel filter

Conclusions
We have presented and analyzed the MPEG-4 decoding standard. We created efficient algorithms to minimize the computational intensity of the three algorithms that consume 71% of the computational resources. With the development of efficient IDCT, de-ringing filter, and de-blocking filter algorithms, we achieve a reduction of the estimated computational cost for these three algorithms, on a typical 16-bit dual-execution unit DSP, from 40-55 Mcycles to approximately 18 Mcycles. The impact of tightly coupled co-processors to actively reconfigure the DSP datapath results in a further reduction of 55-60% to 6.9 Mcycles. As a result, total estimated decoding cost is reduced by approximately 40-45%, and as much as 58-60%, using dynamic reconfigurability.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Most Popular

Product Parts Search

Enter part number or keyword
PartsSearch


FeedbackForm