Portland, Ore. - Just as digital image-processing algorithms can correct blemishes, poor lighting and bad color in a photograph, software created for digital signal processing can make an average singing voice sound like a trained one. An audio counterpart to digital image enhancement being devised at Purdue University (West Lafayette, Ind.) adds the features that mark virtuosos, such as perfect pitch, accurate phrasing, resonance and vibrato synced to the beat.
A team led by Mark (J.T.) Smith, who heads the School of Electrical and Computer Engineering at Purdue, began by analyzing hundreds of virtuoso performances to glean the traits that set the professional apart from the amateur and to capture those characteristics in software. Smith presented his work last week before the 145th Meeting of the Acoustical Society of America in Nashville, Tenn.
"We don't want people to think we have a real-time system yet-so far we have only worked with a database of voice segments," said Smith. "But you can judge for yourself just how good we have been able to make an average singer sound by listening to our befores and afters." Examples are posted at news.uns.purdue.edu/UNS/html4ever/030423.Smith.singing.html.
American idols, they're not
A sampling indicates why Smith says "if only that were true" when asked if his technique can make polished crooners of the tone-deaf. Nevertheless, the enhanced voices do make the singers sound as if they have instantly acquired a year or two of vocal lessons.
"Specifically, we perform modifications that change the resonant frequency of the voice," said Smith. "We correct pitch errors, we introduce vibrato and we time-scale the signal so that we can extend or shorten the duration of a sound if we need to."
Besides being perfectly on pitch, on the beat and using vibrato (a pleasing wavering in the frequency), the digitally enhanced singing voices sound fuller and richer.
"Professional singers have what is called a singer's formant," Smith said. He defined this as "a part of the frequency spectrum that is richer because it is emphasized, providing a boost to a certain part of the spectrum from a resonant cavity in the throat-their vocal track-that produces a richer set of resonances than a normal voice." The Purdue technique can "adjust the spectral representation so that we add those kinds of resonances into the voice too," he said.
According to Smith, engineers would be wrong to assume that any of this could be done with existing algorithms. His group has been at work on the problem since 1985, and the researchers met with repeated failures before arriving at a representation of virtuosity that enabled them to add a couple of years of voice lessons to anyone's untrained voice.
"We look at the results from good singers and those of bad singers, and try to understand those differences-and that's not so easy," said Smith. "I can't tell you how many experiments we have tried where we did not get what we were hoping for."
As a faculty member at the Georgia Institute of Technology, Smith worked with Georgia Tech graduate student Matthew Lee and former doctoral student Bryan George, who pioneered the successive-approximation method Smith's Purdue team still uses. Georgia Tech professor Mark Clements and his graduate student Michael Macon developed the method for changing typed lyrics into singing. Smith and Lee experimented with synthesizing musical instruments before Smith undertook the current version, designed to improve singing voices.
The key innovation, according to Smith, is using successive-approximation decomposition instead of standard Fourier decomposition.
"You can't just go into the domain of a normal Fourier decomposition and apply filters to it, because you will often introduce artifacts that sound unnatural; there are too many limits to what you can do," said Smith. By contrast, "our successive-approximation approach gives us more flexibility as to what you can modify."
Fourier decomposition represents a signal as a series of harmonics above a fundamental. For instance, if the fundamental frequency in the voice is 100 Hz, then the harmonics are 200, 300, 400 . . . Hz. The decomposed representation then is a weighted sum of the amplitudes of those harmonics-a sum that sounds unnatural when altered.
"The reason that successive approximation works so well is that it gives us a representation that we can modify without its sounding unnatural," said Smith. "We can freely change the pitch, change the time scale and [enlist] other useful parameters that help us do the modification."
One of the most important features applied to the sinusoidal model to make it sound more natural is an "overlap-add" construction, whereby voice samples are partitioned into segments that overlap, so that the resulting voice synthesis is smooth and uninterrupted.
"We are not doing a linear-time-invariant filtering," Smith said. "What we are doing is decomposing the original signal into an overlap-add sum of sinusoidals. Then we modify the parameters and resynthesize the signal very quickly using the fast Fourier transform."
Adding it up
The Fourier method can be used in the synthesis stage because adding up a weighted sum is the same process whether it is a sum of harmonics or a sum of overlap-add frequencies, such as that gleaned from successive approximation.
"With the successive-approximation method, first we analyze the signal and choose the best matching sinusoidal approximation, which is analogous to the fundamental in normal Fourier decomposition," Smith explained. "Then we subtract that off and find the next best approximation, and the process repeats. At each step we take a successive approximation, resulting in a weighted sum of sinusoidals, with the frequency, amplitude and phase as our parameters."
Compared with Fourier decomposition, which analyzes a fundamental and all its harmonics (integer multiples of the fundamental), the successive-approximation method is more difficult to perform in real-time because each successive approximation must be completed before proceeding to the next one. In normal Fourier methods, all steps can be done in parallel.
"The analysis is more difficult [than Fourier], but the resulting synthesis is very fast, because we can output using the fast Fourier transform just as if we had done normal Fourier decomposition," said Smith.
In the next few years, Smith's team plans to expand the different kinds of voices and vocal ranges that it can successfully improve. Also, the team aims to extend the voice sample in its database to full-length songs. And finally, Smith wants to speed up the analysis step so that eventually it can be done in real-time on a DSP. "We're working toward doing a complete vocal piece, and we also need to be able to handle a wider variety of different singers," said Smith. "We are hoping that within the next year we will have done a complete piece."
Portions of the research were funded by the National Science Foundation.