The study of human sound perception is highly important in the development of audio codecs. Lossy audio data compression is based in the principle that some audio information is not perceivable and is therefore discardable. Psychoacoustics is a complex problem that involves physiological and psychological factors.
Figure 1: The cochlea is a structure resembling a snail that is divided into three fluid-filled parts. Two are canals to transmit pressure and the third is the organ of Corti, which detects pressure impulses. Information is transformed into electrical impulses which travel along the auditory nerve to the brain.
While the outer and middle ear affects sound by filtering some frequencies, perception of a sound spectrum occurs in the inner ear, mainly the Cochlea
. The Cochlea is a spiral pipe, which resembles a snail, filled with fluid. The interior is covered with the Basilar membrane, which is a transducer from the acoustic to the neural domain. The Basilar membrane is sensitive to frequency producing an instantaneous Fourier Transform of sound waves traveling the Cochlea. The frequency information is then received by the brain through the neural system.
Several psychoacoustics results are of interest for signal processing, the main five are: high frequency limit, absolute threshold of hearing, absolute threshold of pain, temporal masking and simultaneous masking.
HIGH FREQUENCY LIMIT
The highest sinusoidal frequency that a human can hear depends on the sound’s intensity and listener’s age. While young people can hear up to 20 KHz, this quantity decreases to 10 Khz around 60 or 70 years of age. Most speakers’ specifications assure band limits equal to or higher than 15 KHz. In digital recording, usual upper limits are 22.05 KHz and 24 KHz which correspond to sampling rates of 44.1 KHz and 48 KHz. According to psychoacoustics and the Nyquist Limit, these sampling rates are enough to cover all the audible spectra. However, higher sampling rates are desirable to reduce aliasing and noise.
ABSOLUTE THRESHOLDS OF HEARING AND PAIN
The absolute threshold of hearing
(ATH) is the minimum intensity of a pure tone that can be heard by the average human. This threshold is a function of frequency, having a minimum between 1 KHz and 5 KHz. The threshold of hearing has been standardized as the sound pressure level (SPL) of 20 Pa and is the reference for SPL units in decibels (dBSPL).
Sound waves become hazardous and unbearable to human audition when the intensity reaches the threshold of pain. A typical value for this threshold of pain is 120 dBSPL. Like the ATH, the absolute threshold of pain (ATP) is frequency dependant, but with less variation along the audition bandwidth.
Human hearing has a large dynamic range (120 dBSPL). Industrial dynamic ranges for sampled sound varies according to the number of bits, 42 dB for 8 bits, 90 dB for 16 bits and 138 dB for 24 bits (assuming 1 bit for sign). The most popular format is 16 bits because has a good relation it provides a nice tradeoff between memory space and fidelity. For applications where sound quality is not a priority (consider voice transmission) 8 bits is the best choice.
TEMPORAL AND SIMULTANEOUS MASKING
When two pure tones are close in frequency and are largely different in amplitude, the louder one makes the weaker one imperceptible. This effect is known as masking
. When both tones are produced at the same time the masking is simultaneous. If the tones are triggered with a small difference in time the masking still occurs, but is known as temporal masking.
The presence of a tone with certain frequency will raise the threshold of hearing within certain bandwidth. Every spectral component inside this bandwidth with amplitude smaller than the modified threshold of hearing will be masked. Furthermore, a band limited noise will mask any weak enough pure tone inside the bandwidth. The hearing frequency spectrum can be subdivided in several sections where any weak sound may be masked in presence of noise. There are several ways to obtain these subdivisions; one of them is the Bark Scale
. Frequencies below 500 Hz are masked in intervals of 100. Frequencies (f) above follow a relation of 0.2f. The frequency band edges are: 0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000 and 15500.
Masking is used in audio compression to determine which frequency components can be discarded or more compressed. First an audio signal is decomposed in several critical bands using filter banks. Average amplitudes are calculated for each band and used to obtain corresponding hearing thresholds. Frequencies below the modified threshold are considered inaudible and discarded. This way the total entropy of the sound can be reduced opening possibilities to obtain higher compression rates. This is especially true if lossless compression algorithms (such as the Huffman code) are applied to the sound after frequency discarding.
Various audio codecs utilize psychoacoustic models of differing complexity and accuracy to determine what pieces of the audio signal are extraneous. By intelligently choosing a codec to match the demands of an audio system designers can optimize the performance of their product for audio quality, memory use, processing requirements, or some combination of the three.
About the authors
Christopher Davis received his B.S. in Computer Science from the University of New Mexico in 2001 and his M.S. in Computer Science from the University of New Mexico in 2005. He is currently employed by Respec Information Technologies as a contractor for Sandia National Laboratories. He can be reached by email at firstname.lastname@example.org
Victor M. Vergara is a Doctoral candidate in Eletrical Engineering at th University of New Mexico. He received his B.S.E.E. degree from the University of Panama, and M.S.E.E. from UNM. He can be reached by email at email@example.com