Recent advances in signal processing combined with an increase in network capacity are paving the way for users to enjoy a host of services wherever they go and on a wide array of multimedia-capable devices. However, with the increasing number of ways to gain access to content and the various terminals that users rely on to interface with the network and play back their content, there are many technical challenges to consider.
The terminals themselves may support some multimedia coding formats but not others; they are also likely to vary in their display capabilities, processing power and memory capacity. Furthermore, different bandwidth constraints and error characteristics often characterize the networks to which the terminals are connected. Additionally, end users may impose constraints or preferences of their own.
Given such a complex environment, sending devices, including intermediate nodes in the transmission path, must deliver content according to current user, network and terminal characteristics. To do so, there also needs to be a way for devices to communicate those characteristics to one another.
Active requests for content can be made through multimedia servers. As the amount of content increases, it will be necessary to index the content for efficient browsing and retrieval. MPEG-7 is now in its final phases of standardizing audiovisual descriptors and description schemes to facilitate such processes. (For further information about MPEG-7, please visit www.mpeg-7.com.)
Instead of submitting active requests, content may also be pushed to the user. For example, such broadcast content as sporting event highlights or breaking news coverage can be made available in the form of a video clip with associated audio. Given the availability of a certain class of content and specified user preferences, the desired content can be sent directly to the user. Even today, people subscribe to services that send them the latest sports scores and stock quotes. It is easy to imagine this service being extended to include multimedia information.
These scenarios fall under the general category of Universal Multimedia Access. UMA refers to the end-to-end service and delivery of adapted content to different client devices and over various networks, where the adaptation and the ability to provide variations of the same content is the key notion.
The degree to which content must be adapted depends largely on the current application environment, where it is always assumed that there is some degree of interactivity. That interactivity may represent the user programming a personal video recorder (PVR) to filter and record favorite television programs, or a user who is actively searching for content through a PC or mobile device, such as a phone or PDA. Therefore, user constraints are always present in each of the application environments. For PVR systems, media conversion is needed if the user would like a summary of a certain program-for example, the highlights of last night's soccer game or the top stories from the 10 o'clock news.
The next application environment is one that has both network and user constraints. Here, the network constraints are likely to be the most dominant factor. For example, video streaming over an Internet Protocol network is subject to packet loss and delay. Those factors have a major impact on the quality of the received video, and without media conversion to reduce the bit rate, video may not be received at all, or it is received after a significant delay.
The final and most challenging application environment is one that must deal with all constraints. Access to multimedia content through mobile devices falls into this category. Not only are network conditions most severe, but also the terminal itself has limited processing power, memory capacity and display capabilities. For mobile devices that are capable of video playback, spatial resolution reduction is required. If only still images are shown, then key frame extraction is also needed. For today's devices with only text and audio capabilities, alternate types of media conversion must be employed.
The simplest way to perform media conversion is to reconstruct the signal, possibly perform some processing on the reconstructed signal and then re-encode to meet current constraints. Although that cascaded approach leads to the best quality, it is complex to realize. In general, approaches that perform the conversion without fully reconstructing the original signal are preferred.
Arguably, the most challenging type of media to convert are compressed video bit streams (due to higher data rates and temporal dependency among frames). In the cascaded approach, a decoding loop and encoding loop are needed, each with its own frame memory and motion-compensation loop. In recent years, more efficient architectures that approximate this reference method have been proposed. In the approximated architectures, the major simplification is the utilization of a single loop. The single loop compensates for any losses incurred by the conversion process. Without such a loop, severe drift in the reconstructed sequence will occur.
So far, the conversion of media content has been performed online-that is, after a request for the content has been made and current constraints have been assessed. However, this need not be the case. To alleviate some of the burdens of online conversion, it is good practice to store several variations of the same content. In this way, a suitable starting point can be selected and any online conversion simplified.
A final factor affecting the complexity of media conversion systems is the advancements made in scalable coding schemes. Scalable coding schemes have been around for some time with little market impact. However, there has been increased interest in techniques that compress images and video into embedded bit streams. The embedded bit stream is significant, since such a bit stream may be truncated at any given point where the distortion in the reconstructed signal is proportional to the amount of decoded bits.
For image coding, an embedded bit stream is enabled by the JPEG 2000 standard, and for video coding, it is enabled through the Fine Granular Scalability (FGS) tools adopted by the MPEG-4 standard.
JPEG 2000 also allows for spatial scalability in the embedded stream, while MPEG-4 also allows for FGS temporal scalability. The MPEG committee is now considering the needs and requirements for a more advanced form of scalable video coding that integrates spatial scalability and can provide improved coding efficiency at higher bit rates.
The most widely used form of transcoding is to reduce the bit rate of compressed video bit streams. Techniques to perform such a conversion are well-known, and efficient architectures to perform the conversion do exist in the literature.
Traditionally, reducing the bit rate is accomplished by changing the quantization parameter (QP) with which the signal has been coded. However, there are limits to the amount of bits that can be reduced by changing the QP only. To overcome that limit, methods that reduce the frame rate, or number of coded frames per second, may also be used. In a predictive coding scheme that uses motion vectors to predict current frames from a reference frame, reducing the number of frames implies that frames are dropped and motion-vector references may no longer be valid. Therefore, a means to adjust motion vectors is also needed.
One key application of bit-rate reduction is when it is applied to several bit streams multiplexed over a single channel having fixed bandwidth. In that context, the goal of the bit-rate reduction is to vary the amount of bits for each bit stream according to the complexity of the video sequence. That is done to ensure efficient transmission over a shared link. In other words, each reconstructed sequence should have an equal average distortion over time, where more bits are spent for more complex sequences and less bits for simpler sequences. In the case of transcoding a single bit stream to meet negotiated network conditions, the same type of transcoding techniques are also applied.
As more image and video content is viewed on mobile devices with limited display size, converting to lower spatial resolutions will be a primary concern. Techniques that allow for such a conversion have been presented in some recent publications.
For video, the main issues include the down-sampling of texture information and the mapping of full-resolution motion vectors to low-resolution motion vectors. The down-sampling of macroblocks is typically performed in the compressed domain on discrete cosine transform (DCT) coefficients. To map the motion vectors, they should first be scaled to the lower spatial resolution, then undergo a refinement process so that the reduced-resolution motion vectors correspond better to the associated residual component that is also transmitted in the bit stream.
As with the requantization of DCT coefficients, the scaling of texture and motion information will also cause drift. To compensate for the drift, analogous closed-loop architectures are needed.
There are limits on the amount of bit rate that can be reduced at a given spatial resolution. Similarly, there are limits on video resolution. Video summarization provides a means to abstract the content by transmitting only the most relevant material for a given set of constraints. A simple example is a video sequence with associated audio being reduced to a single key frame with a few lines of text.
To measure the performance of traditional conversion techniques, typical measures such as mean-square error can be used. However, for video summarization, it is difficult to quantify the performance of different summarization techniques. Some fidelity metrics based on visual characteristics of the content have been proposed to minimize visual-content redundancy among video frames.
Summaries and pointers
Of course, a summary may also be produced in a semiautomatic or manual way. Pointers to frames or clips within the original compressed bit stream may be used.
There are obviously many types of conversions that we have not discussed. Briefly, other types include color depth conversion, such as for a PDA with 4-bit display; text conversion between different languages; and other conversions, such as between GIF and JPEG.
With regard to video format conversion, it is expected that conversion from MPEG-1/2 to MPEG-4 will be greatly needed. MPEG-1 content is widely available over the Internet and MPEG-2 is the format used for digital television broadcast and DVD. Since MPEG-4 is the coding format adopted by the 3GPP specification, a syntax conversion would need to be applied.
The full version of this article is being presented this week at the IEEE International Conference on Consumer Electronics.