Introduction
A variety of audio compression technologies are being used today, each having a distinct advantage over the other in terms of compression ratio, coding delay, coding complexity or legacy system compatibility. This makes subset of audio codecs suited for particular systems and makes working with multiple audio compression technologies indispensable.
In designing time-critical systems like conferencing, broadcast transcoding systems or be it in designing any audio and video play-out system, the knowledge of the delay encountered while audio encoding or decoding becomes critical.
Figure 1 tries to capture the various stages at which audio data encounters delay in different applications and systems. The delay encountered in audio systems can be broadly classified as follows –
Processing Delay: Computing any algorithm on a processor consumes a finite amount of time. This delay would be inversely proportional to the CPU speed and the CPU capability to compute complex logic with minimal cycles.
Algorithmic delay: The core processing during audio compression and decompression involves techniques which work on a frame of samples and inherently introduce delay. This delay can be thought of as similar to the delay introduced while digital filtering.
Application Delay: Applications involve various forms of buffering for either smooth play-out or packetization and streaming which introduce delay. IP streaming or transmission over a constant bit-rate channel introduce delays based upon the available bandwidth.
The focus of this article is on the algorithmic delay introduced due to audio compression and decompression. This is dependent primarily on the chosen compression technology and can be quantified in terms of sample lag of output with respect to input.
This article also discusses utilizing this knowledge of algorithmic delay to synchronize parametric processing modules like SBR (Spectral Band Replication), PS (Parametric Stereo) & MPEG Surround when used alongside transform coding technologies like AAC. The article aims to elucidate the algorithmic delays resulting from various modules used in MPEG audio codecs, and compares performance of the codecs with respect to this attribute.
The first few sections of the article examine the various modules which introduce delay like – Filter banks (FB), Modified Discrete Cosine Transform (MDCT) and bit reservoir module used in MPEG audio codecs. The later sections discuss the algorithmic delay encountered in popular MPEG codecs such as MPEG Layer2, MPEG Layer 3, AAC-LC, AAC-LD, HE-AAC, HEAACv2 and MPEG Surround and also explain the synchronization required while using parametric encoding tools. The article assumes that the reader has an overview of MPEG audio compression technology and focuses primarily on the delay introduced during compression and decompression.
Modules Affecting Delay
This section discusses the various modules used in MPEG audio codecs which contribute to the Algorithmic Delay. The delay values introduced here would be used in subsequent sections to get the total algorithmic delay encountered in specific MPEG audio codecs.
Framing Delay
In real-time capture systems, a frame worth of data must be accumulated before feeding into the encoder. This is the framing delay. When expressed as number of samples, it is equal to the input frame size of the codec, K.
Δframing = K.
MDCT Delay
Modified Discrete Cosine Transform is a Lapped Transform with Time Domain Alias Cancellation property. MDCT transforms a 2W point time domain input samples to W point coefficients, where 2W is the length of window used for MDCT. The strength of MDCT in overcoming blocking artifacts through 50% overlap across blocks, without increasing the bit rate, makes it a powerful tool in signal compression technique.
For a MDCT of window size 2W, it can be seen that this transform requires an initial zero frame of W samples for perfect reconstruction at the decoder end. Thus the delay is equal to half the window size.
Δmdct = W
Block Switching Delay
Varied window sizes are used to improve frequency or time resolution for stationary or transient input signal respectively, thus reducing pre-echo artifacts for transients. Intermediate transition windows between long and short windows smoothens the window switching as shown in the Figure 2.
The block decision is taken at the encoder by looking beyond the current window and is conveyed to the decoder. If the length of short windows in a frame is Nshort , then Nshort = Nlong /x where x is the number of short windows in a short window frame. In order to check for transients in the next frame, the duration to be checked is highlighted in blue in the figure. This can be expressed in number of samples as:
Nshort * (x + 1) / 2 = Nlong * (x + 1) / (2 * x)
= Nlong / 2 + Nshort / 2
Hence, the encoder block switching delay is Δblk_sw = Nlong (x + 1) / (2 * x)
Quadrature Mirror Filter Bank Delay
An N channel Filter Bank is used to split the given time domain signal y(n) into N sub-band signals, yk (n), k=0:N-1, decimated by a factor N. Figure 3a shows an N-channel analysis and synthesis filter bank.
The filters Hk (z) (and Fk (z)), k>0 are cosine modulated versions of symmetric low-pass prototype filter H0 (z) (and F0 (z)). Each of the Hk (z) filters can be represented in their poly-phase form as:
Rearrangement for an optimal implementation can be done as shown in Figure 3b.
Let the low-pass linear phase FIR filter H0 (z) be of length M = LN where M,N,L are integers, whose N poly-phase FIR components, E1 (z), l=0:N-1, are each length L. Similar analysis holds true at the synthesis filter bank also. The delay at the analysis and synthesis filter will be (M-1)/2 each, and a decimation phase of -(N-1). Hence, the combined delay of analysis, synthesis filters will be (M-1) – (N-1) or ΔQMFana_syn = M-N samples.
Hybrid Filter Bank Delay
The hybrid filter banks have been introduced in the parametric MPEG audio codecs so as to get a non-uniformly spaced frequency bands from a uniformly spaced QMF Bank. As the ear is more sensitive to the lower frequencies, parameters calculated here are more closely spaced than at the higher frequencies.
The hybrid filter bank is implemented by extending the QMF bank by a second filter step so as to increase the resolution of the low-frequency bands. The high-frequency bands are appropriately delayed so as to synchronize with the output of hybrid filter. If this symmetric FIR filter is of length P, then the delay at each hybrid FB output is (P-1)/2. Since these filters are operating on N-channel QMF bank, the total signal delay will be,
Δhybrid = N * (P – 1) / 2 samples.
Usage of Bit Reservoir
Bit demand is bound to be non-uniform across audio frames due to non-stationary nature of audio signal. A bit reservoir mechanism is employed which can facilitate this variation in bit-demand and at the same time maintain an overall average bit-rate. The size of the bit-reservoir determines the maximum transmission delay inclusive of output frame transmission delay. This delay holds significance in real time streaming applications
Delay in MPEG codecs
Delays of some of the popular MPEG audio technologies are discussed in the following section.
MPEG1 Layer2 codec
The frame size, K is 1152 samples resulting in an equal framing delay. The time to frequency mapping is done by using a 32-band QMF bank. The prototype filter has 513 taps. This results in a QMF analysis and synthesis delay (ΔQMFana_syn ) of 481 samples. MPEG 1 Layer 2 codec does not allow for variation of number of bits across frames and hence does not have a bit reservoir. The total delay is,
ΔMP12 = K + ΔQMFana_syn .
MP3 codec
In MPEG 1 Layer 3, the input frame size is 1152 samples. The QMF delay is same as that of MEPG 1 Layer 2 codec. Layer 3 has the QMF output feeding a MDCT module for better frequency resolution. This MDCT operates on 36 samples and its delay is hence 18 samples for each sub-band. The block switching module is such that three short windows replace the long one during transients. The corresponding delay (Δblk_sw) is 384 samples.
In MPEG 1 Layer 3, the output frames per second (fps) is constant. However compressed frame size can vary depending on bit rate demand and this variation is limited by the standard specification. The resulting delay due to bit reservoir can be calculated as follows:
Δbr_mp3 = ceil {(512 * Fs) / (144 * BR)} * K + K, where K is 1152 for MPEG1 and 576 samples in MPEG2.
The total algorithmic delay, ΔMP3 can be expressed as:
ΔMP3 = K + ΔQMFana_syn + Δmdct + Δblk_sw + Δbr_mp3
AAC-LC codec
AAC is the technology recommended by the ISO/IEC MPEG2 standard. Here each frame is of 1024 sample size (K). The MDCT block requires a 2048 sample time input implying that delay due to MDCT module is 1024 samples (Δmdct). Eight short windows constitute one short block. The block switching delay can be calculated to be 576 samples (Δblk_sw). For MPEG2/4 AAC family of codecs the bit reservoir size is calculated for the maximum allowed bit rate, BRmax@Fs at a given sampling frequency Fs as,
bitres_size = K * (BRmax@Fs )/Fs.
Maximum delay due to bit reservoir in terms of samples at a given bit rate BR, is as follows:
Δbr_aac = bitres_size * Fs/BR for AAC.
For a sampling frequency of 48 kHz, the standard specifies the maximum possible bit rate as 288 kbps per channel, implying maximum bit reservoir of 6144 bits per channel.
The algorithmic delay in AAC-LC would be,
ΔAAC-LC = K + Δmdct + Δblk_sw + Δbr_aac.
AAC-LD codec
AAC Low delay codec was developed to reduce algorithmic delay for an interactive two-way communication. It is based upon MPEG2 AAC audio compression technology. To reduce the algorithmic delay, the frame size is reduced to 512 (or 480) samples implying a framing delay of 512 (or 480) samples (K). The input to MDCT is now 1024 (or 960) samples, with the MDCT delay being 512 (or 480) samples (.mdct). The block switching module is eliminated to reduce algorithmic delay. The bit reservoir is minimized to achieve the target delay. Assuming the extreme case of no bit reservoir and 480 sample length frames, the delay encountered in AAC-LD would be,
ΔAAC-LD = K + Δmdct.
HE-AAC codec
HE-AAC codec as defined by the MPEG4 standard uses SBR tool with AAC. The input frame size is 2048 samples. To extract the SBR parameters, a 64-band QMF analysis filter bank is used, which consists of a 641-tap prototype filter. The resulting delay, ΔQMFana is 320 samples. In parallel to the SBR tool, the AAC part of the HE-AAC codec typically operates at half the input sampling rate. The Δmdct and Δblk_sw result in 1600 sample delay at Fs/2.
At the decoder end, after AAC decode, a 32-band analysis QMF Bank is used to generate the lower half of the spectrum resulting in a 160-sample delay at Fs/2. The high-frequency (HF) generator and envelope adjusting modules in SBR reconstruction module add to ΔHFgen_ENVadj (6 samples) delay for each of the 64 channels in the QMF FB resulting in an overall 384 sample delay.
A 64-band synthesis QMF Bank converts the full-bandwidth signal back to time domain, giving ΔQMFsyn = 257 sample delay. The value of ΔQMFdec is thus 160 * 2 + 257. The maximum bit reservoir size is same as for AAC. A block diagram of the codec is shown in Figure 4.
The total delay can be expressed as,
ΔHEAAC = ΔAAC-LC * 2 + Δdown_sampler + ΔQMFdec + ΔHFgen_ENVadj * N.
Synchronization of SBR parameters with AAC stream
The encoder transmits the SBR parameters which are used at the decoder for high-frequency reconstruction from the low-frequency (LF) signal decoded by the AAC decoder. The extracted parameters must be in sync with the signal at the decoder end for a valid reconstruction. It can be seen that parameters extracted after ΔQMFana samples at the encoder are applied at the decoder after,
Δparam_apply = 2 * (Δmdct + Δblk_sw) + Δdown_sampler + ΔQMFdec – ΔQMFsyn + ΔHFgen_ENVadj * N
Hence,
Δsync_delay = Δparam_apply – ΔQMFana = P1 + P2 samples (refer to Figure 4).
P2 must be chosen to be a multiple of number of QMF frequency bands, and P1 will be Δsync_delay – P2 . This implies that the input to the parameter extractor must be delayed by the sum of P1 and P2 samples to achieve synchronization.
HE-AACv2 codec
In HEAACv2, parametric stereo (PS) tool is used, in addition to SBR tool, with AAC to further reduce the bit rate. The Parametric Stereo tool extracts the stereo cues from the signal at the encoder and down-mixes to a mono signal which is coded using SBR and AAC. The frame size here is same as HEAAC codec. The additional blocks here are the hybrid filter banks introduced in the encoder and decoder.
The PS parameters are extracted from the hybrid Filter banks followed by SBR parameter extraction from the QMF filter banks. The down sampling for the AAC path is typically achieved using a 32-band QMF synthesis filter bank. At the decoder, a 32-band analysis QMF Bank is used along with SBR parameters to reconstruct the high-frequency part of spectrum. This is followed by a hybrid filter bank to which parameters (Stereo Cues) are applied to recreate the stereo signal. A block diagram of HE-AAC v2 codec with the hybrid analysis filter banks, together with QMF analysis and synthesis filter banks at both the encoder and decoder end is shown in Figure 5.
It can be seen that,
ΔHEAACv2 = 2 * (ΔAAC-LC + ΔQMFana_syn + Δhybrid)
Synchronization of PS and SBR parameters with AAC stream
Here, the encoder transmits the PS and SBR parameters to the decoder to reconstruct the stereo signal. The parameters are extracted after (ΔQMFana + Δhybrid) samples at the encoder end and applied at the decoder after Δparam_apply, where,
Δparam_apply = (2 * (Δmdct + Δblk_sw) + 2 * ΔQMFana_syn – ΔQMFsyn + 2 * Δhybrid) samples.
Thus,
Δsync_delay = Δparam_apply – (Δ;QMFana + Δhybrid) = P1 + P2 samples (refer to Figure 5).
The buffer P1 and P2 must be a multiple of input frame size to achieve synchronization.
MPEG Surround codec
MPEG Surround (MPS) is a parametric coding technology which down mixes M channel audio input to N channels, where N is less than M. In order to efficiently reconstruct the spatial image at the decoder, the encoder extracts the spatial cues and transmits them along with the down mix signal. This spatial information requires a lower data rate than required for transmitting all M channels.
A key feature is the ability to scale the spatial image quality by varying the spatial parameter data rate. This codec uses a 64-channel QMF Bank for analysis. A block diagram of the MPEG Surround codec is shown in Figure 6.
To facilitate common delay in both low power (LP, real QMF) and high-quality (HQ, complex QMF) decoding, a delay of Δreal_to_complex_converter is introduced in the HQ path. A buffer D (Figure 6) is introduced to synchronize the down mix signal with spatial parameters adding to the codec delay.
The algorithmic delay is
ΔMPS = K + 2 * (ΔQMFana_syn + Δhybrid) + Δreal_to_complex_converter + D
Synchronization of MPS parameters with down-mix stream
Both spatial parameters and down-mix signal are synchronized by introducing buffers P and D respectively, such that the following holds:
P = N * FrameSize such that P >= Δoffset
Δoffset is the number of samples by which the down mix signal is offset with respect to the parameters. As can be seen from Figure 6, Δoffset = 1281 for HQ (and 1601 for LP) filter bank, and hence down mix delay is,
D = P – Δoffset
MPEG Surround capabilities can be further exploited when combined with a perceptual codec. Its frame size is hence dependent on the underlying codec. Combined with an HE-AAC codec for a 5.1 to stereo down-mix, with 2048 samples as frame size, the surround parameters are extracted from the hybrid FB, then the SBR parameters from the down-mixed QMF Bank in a manner similar to HE-AACv2 encoder.
The down sampling required for the AAC codec is achieved by the QMF synthesis filter bank. At the decoder, a 32-band analysis QMF Bank is used along with SBR parameters to reconstruct the high-frequency part of spectrum. Then the hybrid filter bank is used to recreate the multichannel signal. The total delay is
ΔMPS_HEAAC= ΔHEAACv2 + Δreal_to_complex_converter + D
From synchronization point of view, an additional delay of Δmdct + Δblk_sw at Fs/2 gets added to the Δoffset value.
Summary
The knowledge of delay encountered while transmitting audio data is key to designing time critical systems. We have the algorithmic delay encountered in MPEG audio codecs and looked at the modules which contribute to this delay. We have also looked at how to synchronize the parametric coding tools to the core waveform coding algorithms. Table 1 summarizes the algorithmic delay in various codecs discussed.
We observe that achieving a good quality at lower bit rates comes at the cost of delay parameter as in AAC-LC and AAC-LD. AAC-LD being the least delay codec is suited for telephony and conferencing applications while codecs like HE-AAC and MPEG Surround are suited for broadcast and streaming applications due to the advantage they offer in terms of bit rate and audio quality.
The delays of the popular audio codecs are derived, thus serving as a ready reference for designs needing A/V sync, in transcoder applications for broadcast space, conferencing and for other real-time systems which have a delay budget.
References:
1. Lutzky M, G Schuller, et al, “A Guideline to Audio Codec Delay”, AES 116th convention , May 2004.
2. M. Bosi and R.E. Goldberg, Introduction to digital audio coding and standards , Kluwer Academic Publishers, Boston, 2003.
3. E. Allamanche, R. Geiger, J. Herre and T. Sporer, “MPEG-4 Low Delay Audio Coding Based on the AAC codec”, AES 106th convention , Munich, pp. 1-21, May 1999.
4. P.P.Vaidyanathan, Multirate systems and filterbanks , Prentice Hall, Englewood Cliffs, 1993.
5. ISO/IEC MPEG Audio Codec Specifications.
Related links:
Audio Coding: An Introduction to Data Compression – Part 1: Psychoacoustic model, masking, and MPEG Layer I coding | Part 2: MPEG Layer II coding, MP3 and AAC | Part 3: AAC continued, MPEG-4 AAC, Dolby Digital and other standards
Audio coding for wireless applications
A new standard for surround sound
Error-Resilient Coding for Audio Communication – Part 1: Waveform and CELP Speech Codecs | Part 2: Lapped transform codecs | Part 3: FEC techniques for speech and other coding techniques
Low Bit-Rate Audio Coding in Multi-channel Digital Wireless Microphone Systems
Leave a Reply
You must Register or Login to post a comment.