A TextDependent Approach to Speaker Identification
ABOUT THE AUTHOR
A. Sankaranarayanan received a Bachelors' degree in Electronics and Telecommunication Engineering at the University of Mumbai and plans to pursue graduate studies in Electrical Engineering. His area of interest is speech signal processing.


Although digital fingerprint identification and iris scanning are extremely accurate indicators of an individual's identity, speaker identification is an upcoming technique. Speaker identification systems are popular in spite of their poorer accuracy visvis the other techniques previously mentioned because they are the least expensive to build (they can be implemented on any generalpurpose computer) and are also noninvasive in nature. Speaker identification systems may be classified in two categories based on their principle of operation.
 Textdependent systems, which make use of a fixed utterance for test and training, and rely on specific features of the test utterance in order to effect a match.
 Textindependent systems, which make use of different utterances for test and training, and rely on longterm statistical characteristics of speech for making a successful identification.
Textdependent systems require less training than textindependent systems and are capable of producing good results with a fraction of the test speech sample required by a textindependent system.
SpeechProduction Model
The development of a textdependent speaker identification system requires a thorough understanding of the nature of speech and the model of speech production. At a relatively high level, speech may be thought of as being composed of a string of phonemes (basic sound units). The English language consists of approximately 42 phonemes.
Speech is produced by the flow of air through the various articulators such as the vocal tract, lips, tongue, and nose. Air is forced out of the lungs through the trachea and the glottis, where it passes through the vocal cords. The vocal cords, if tense, vibrate like an oscillator, but if relaxed, do not vibrate and simply let the air pass through. The air stream then passes through the pharynx cavity and, depending on the position of a movable flap called the velum, exits either through the oral cavity (mouth), or the nasal cavity (nostrils). In the former case, the tongue and the teeth may modify the flow of the air stream as well. Different positions of these articulators give rise to different types of sounds. All sounds can be divided into the following broad categories.
 Voiced sounds are produced whenever the vocal cords are tensed and vibrate. Vowels ('a', 'e', 'i', 'o', and 'u') and diphthongs fall in this category of sounds. The frequency of vibration of the vocal cords is called the pitch. Moreover, the vocaltract configuration for these sounds results in a resonant structure—the vocaltract resonance frequencies are known as formants.
 Unvoiced sounds are produced when the vocal cords are relaxed and, therefore, do not vibrate. Fricatives (sounds such as 'shh' and 'f') and aspirated sounds (whispered speech) are examples of unvoiced sounds. Turbulent airflow occurs either at the mouth (fricatives) or at the glottis (aspirated sounds) to produce speech that exhibits a distinct lack of periodicity. The spectrum of unvoiced sounds usually lacks resonant peaks and has a broadband structure.
 Plosive sounds are produced when there is a buildup of pressure due to constriction at some point in the vocal tract followed by a sudden release, which leads to transient excitation. This may occur with or without vocal cord excitation. Examples of plosive sounds include the 'p' in 'pin' (an unvoiced plosive) and 'b' in 'bin' (a voiced plosive).
A powerful tool for analysis of speech is the sourcefilter model (Figure 1 shows a simplified version) of human speech production. This model is an approximate representation of the excitation source and the vocal tract. Although not very accurate for some types of sounds (especially unvoiced sounds), it provides a useful way to quantify several parameters that you can use for speaker identification.
Figure 1: The sourcefilter model of speech production.
The model in Figure 1 assumes two sources—the switch alternates between the glottal pulse generator (for voiced sounds) and the random noise generator (for unvoiced sounds). These sources are filtered by the vocal tract (represented by the timevarying filter). The figure omits some details (such as the mouth radiation model) for simplicity.
 The glottal pulse generator represents the vibration of the vocal cords and is the active source for production of voiced sounds such as vowels. It is also known as the buzz source. The period of the impulse train generated by this source is known as the pitch period or fundamental frequency of the utterance. The output frequency spectrum is rich in harmonics of the fundamental frequency.
 The random noise generator is responsible for generating the random turbulence and pressure buildup waveform for unvoiced sounds such as the fricatives. It is sometimes called a hiss source. The frequency spectrum of this source is relatively flat; this explains the broadband nature of unvoiced sounds.
 You can represent the dynamic nature of the speech articulators constituting the human vocal tract by a timevarying digitalfilter labeled in Figure 1 as the vocaltract filter model. The parameters (coefficients) associated with this filter vary over a period of about 5 to 20 milliseconds, depending on the nature of the utterance, in step with the changing configuration of the vocal tract. Since you can model the vocal tract as a tube whose shape changes with time, it exhibits resonance at specific frequencies (formants). Peaks in the frequency response of the vocaltract filter represent these formants.
The sourcefilter model assumes that it is possible to separate the excitation source from the vocaltract filter, and also assumes an allpole (autoregressive) vocaltract filter. These assumptions are not entirely accurate for many speech sounds. Nevertheless, this model forms a very useful basis for understanding the nature of speech production and for quantifying several parameters that characterize speech.
SpeakerIdentification Features
The sourcefilter model discussed in the previous section provides useful parameters for identifying a speaker. One such quantity is the pitch period or fundamental frequency of speech. Pitch varies from one individual to another; pitch frequency is high for female voices and low for male voices. This suggests that pitch might be a suitable parameter to distinguish one speaker from another, or at least to narrow down the set of probable matches.
Analysis of the frequency spectrum of the test utterance also provides valuable information about speaker identify. The spectrum contains both pitch harmonics and vocaltract resonant peaks, making it possible to identify the speaker with a high probability of being correct.
You can also use the vocaltract filter parameters (filter coefficients) to good effect for speaker identification. This is due to the fact that different speakers have different vocaltract configurations for the same utterance.
In any textdependent speaker identification system, an important decision is the choice of test utterance. As discussed in the previous section, the sourcefilter model is most accurate at representing voiced sounds, such as the vowels. Vowels have a definite, consistent pitch period. The vocaltract configuration for vowelutterances exhibits a clear formant (resonant) structure. The frequency spectrum corresponding to vowelutterances therefore contains a wealth of information that can be used for speaker identification. The prototype speaker identification system built by the author (to be described later in this paper) makes use of the vowels ('a', 'e', 'i', 'o', and 'u') for the test utterance.
PitchPeriod Estimation
A number of algorithms exist for pitchperiod estimation. The two broad categories of pitchestimation algorithms are timedomain algorithms and frequencydomain algorithms. Timedomain algorithms attempt to determine pitch period directly from the speech waveform (examples include the GoldRabiner algorithm and the autocorrelation algorithm). Frequencydomain algorithms use some form of spectral analysis to determine the pitch period (an example is the method of cepstral truncation).
Although frequencydomain algorithms may yield higher accuracy, timedomain algorithms have the advantage that they can be implemented with minimal difficulty on a generalpurpose digital computer. A computationally efficient algorithm due to Gold and Rabiner makes use of parallel processors to produce pitch period estimates that are quite reliable. A brief description of the algorithm follows.
The algorithm begins by passing the speech signal through a lowpass filter with a cutoff frequency of 600800 Hz, which removes the higher harmonics of pitch frequency that might interfere with accurate pitch estimation. This is acceptable, since the pitch frequency rarely increases above 500 Hz, even for a highpitched female voice.
The filtered speech signal is processed to generate six impulse trains. These impulse trains come from the local maxima and minima of the speech waveform; their function is to retain the periodicity of the speech signal while discarding features irrelevant to the process of pitch detection. The reason for using six impulse trains is that the algorithm must function with few errors even under extreme conditions (in the presence of harmonics). In many cases, only two or three of the six impulse trains will indicate the correct pitch period—the rest will be incorrect. However, the redundancy built into the algorithm ensures that it is able to determine the fundamental frequency with a low probability of error even in these cases.
The six impulse trains are fed to six identical pitch extractors. Each pitch extractor latches on to an impulse and holds it for a blanking interval, during which subsequent impulses are ignored. After the blanking interval, the latched value begins to decay exponentially. The decay period ends when the pitch extractor encounters an impulse that is greater in amplitude than the instantaneous amplitude of the decaying value. The time period between the initial impulse latch and the end of the decay phase is the new pitchperiod estimate. The current average pitch estimate is calculated as the mean of the previous average pitch estimate and the new pitch period estimate. New values for the blanking interval and exponentialdecay constant are empirically determined from the current average pitch estimate.
The final pitchperiod estimate is determined from the current and previous pitch estimates (and the sums of the current and previous pitch estimates) of each of the six pitch extractors through a process of consensus. This ensures accuracy of the algorithm.
The algorithm occasionally picks the wrong pitchperiod estimate; this problem manifests itself in the form of impulsive noise that occurs randomly in the pitchestimate array and can cause serious errors during comparison. A lowpass filter will remove these impulses, but will 'spread' or 'blur' the noise over the pitch contour. A median filter, however, produces the desired result of removing most of the impulsive noise while retaining the original pitch contour (Figure 2). For most purposes, a three or fivepoint median filter is suitable for eliminating noise in the pitch estimates.
Figure 2: Thelow pass (moving average) filter 'blurs' the pitch contour by spreading the impulsive noise, while the median filter removes impulsive noise without affecting the pitch contour. The filters used were both fivepoint.
Spectral Analysis: Wavelets
Spectral analysis of speech is complicated by the fact that the speech signal is nonstationary, in other words, it has a timevarying frequency spectrum depending on the utterance. However, the speech articulators vary relatively slowly and it is not incorrect to assume that short segments (about 1020 milliseconds) of speech are stationary. This leads to the idea of shorttime techniques, in which analysis is carried out with such spectrally invariant segments (windows) of speech. The shorttime Fourier transform is one of the most popular techniques in this category. The shorttime Fourier transform results in a spectrogram or timefrequency plot, which illustrates the temporal variation of the spectral components of speech.
Although popular, the shorttime Fourier transform is limited by the uncertainty principle of spectral analysis, which states that the product of uncertainty in time and in frequency has a finite lower bound. In other words, resolution in time and frequency cannot be increased independently of one another—an increase in time resolution (a smaller window) results in a decrease in frequency resolution (spectral leakage) and viceversa. The shorttime Fourier transform uses nominally fixed window widths with the consequence that it can only provide fixed resolution in time and frequency.
Recently, we've seen the emergence of a new technique known as the wavelet transform for spectral analysis of nonstationary signals. It makes use of special time functions known as wavelets, and provides the flexibility in timefrequency resolution unobtainable with the classical shorttime Fourier transform. With wavelets, it is possible to analyze a signal at several levels of resolution, making it possible to capture transient, highfrequency bursts with poor frequency resolution and also slowly varying characteristics with highfrequency resolution. Therefore, it is possible to trade off frequency resolution for better time resolution (for analyzing transients) and time resolution for better frequency resolution (for analyzing slow variations), a facility not afforded by the shorttime Fourier transform.
The CWT (Continuous Wavelet Transform) is given by the following equation.
f(t) is the nonstationary time signal to be analyzed. The function y(t) is called the mother wavelet. The mother wavelet is an oscillatory function having zero mean; most of its energy is confined in a small region near the origin. The parameter a is referred to as the scale or dilation. The scale specifies the time duration or 'stretch' of the wavelet; a large value of scale indicates poor time resolution and increased frequency resolution and viceversa. The parameter b is known as the translation. The translation specifies the position of the wavelet on the time axis. Both parameters are continuous.
You can use a continuoustime convolution operation to interpret the CWT given by Equation 1. The scale parameter specifies an infinite number of impulse responses with which to convolve the signal f(t). This interpretation is equivalent to passing the signal f(t) through a bank of (infinite) analog filters, each having an impulse response specified by one value of scale (Figure 3). The filters are of the bandpass variety (this is expected, since the mother wavelet has zero mean) and have the special property that their Qfactors (center frequency to bandwidth ratio) are equal.
Figure 3: Fourier transform of a wavelet for three values of scale. Note the bandpass nature of the filters. As the center frequency increases, the bandwidth of the filter also increases in proportion, keeping their ratio (the Qfactor) constant. The filters have been normalized so that the energy of their impulse responses is equal.
The CWT is of little computational value. For implementation on a digital computer, you must discretize the scale and translation parameters. The discretization is usually dyadic, meaning scale and translation parameters are integral powers of two. This leads to a representation of the continuoustime function as a linear combination of dyadically scaled and translated wavelets known as the DWT (Discrete Wavelet Transform). There is a further complication. Although the DWT discretizes the scale and translation parameters, it still applies to a continuoustime function. Digital computers, on the other hand, work with a discrete version of the time signal itself (obtained by sampling the continuoustime signal at the Nyquist rate).
The above considerations lead to a modified form of the DWT that digital filters can implement. Samples of the discretetime signal are considered to be the approximation coefficients of the signal at the highest (finest) possible level of resolution (labeled the 0th level of resolution). These represent the entire digital frequency range from 0 to p radians. A process of highpass filtering using a halfband filter and down sampling produces the detail coefficients at the next (coarser) level of resolution (the first level). The detail coefficients represent the frequency range between p/2 and p radians. Similarly, the approximation coefficients at the first level of resolution are obtained by passing the signal through a lowpass filter and down sampling the result. These coefficients contain spectral information in the range 0 to p/2 radians. Continuing in this fashion, you can use the approximation coefficients at this coarser level to generate approximation and detail coefficients at further coarser levels (levels 2, 3, ...). At each level, the spectrum of the approximation coefficients is divided in two by the lowpass and highpass filtering operations; thus the DWT is reduced to a form of dyadic subband filtering (Figure 4 illustrates a threelevel decomposition).
Figure 4: Dyadic subband configuration for a discretetime threelevel decomposition, illustrating the subbands occupied by the detail coefficients at the first, second, and third levels of resolution. The spectrum (extending from the end of the third dyadic subband up to DC) occupied by the approximation coefficients at the coarsest (third) level is not shown.
This process is carried out recursively with a bank of digital filters till the required level of frequency resolution is achieved (for a speechsignal band limited to ~ 6 KHz, a sevenlevel analysis is usually sufficient). The process of generating the approximation and detail coefficients at the kth level of resolution given the approximation coefficients at the (k1)st level is summarized by the schematic of Figure 5.
Figure 5: Generation of approximation and detail coefficients at a coarser level using approximation coefficients at the next finer level of resolution.
In Figure 5, a_{k}(n) and b_{k}(n) are the approximation and detail coefficients respectively at resolution level k. a_{k1}(n) are the approximation coefficients at the (k1)st level of resolution. h(n) is the lowpass (approximation) filter and g(n) is the highpass (detail) filter. The exact nature (impulse response) of these filters depends on the wavelet chosen.
Linear Predictive Analysis
LPA (Linear Predictive Analysis) is a powerful and popular technique for estimating the vocaltract filter coefficients (predictor coefficients) which, as already mentioned, are useful for speaker identification since different speakers have different vocaltract configurations for the same utterance. The basic premise of LPA is that you can approximate the current sample of the speech signal (within reasonable accuracy limits) as a linear combination of past samples of speech. The difference between the predicted sample and the actual sample is known as the prediction error. You can determine a set of predictor coefficients by minimizing the meansquared error. Thus, the theory of LPA is intimately tied to the sourcefilter model of speech production.
The number of coefficients used to characterize the timevarying vocaltract filter is known as the order of the predictor. As already mentioned, the filter is treated as an allpole system, also known as an autoregressive model. This imposes certain limitations on the filter in that it is able to accurately model only voiced sounds, and introduces significant prediction error for unvoiced sounds. Moreover, the transfer function of the filter requires zeros for accurately modeling nasals, a facility the autoregressive model does not afford. In spite of these limitations, autoregressive LPA provides a sufficiently accurate model for speaker identification, especially if the test utterance comprises vowels.
The vocaltract filter is a timevarying system. A new set of predictor coefficients must, therefore, be evaluated once every 1020 milliseconds. The LPA algorithm typically sections the speech signal into windows of length 1020 milliseconds, with an overlap of about 510 milliseconds. A set of linear equations (p equations, where p is the predictor order) results from minimizing the meansquared error between the predicted and actual samples within the window. You can solve this set of equations using one of two techniques: the autocorrelation method or the covariance method. Although the latter results in faster convergence, the former guarantees a stable predictor and is more often used. The matrix form of these equations for the autocorrelation method is given by Equation 2.
In Equation 2, R(k) represents the shorttime autocorrelation function of the speech signal, and (a_{1}, a_{2}, ..., a_{p}) represent the p predictor coefficients. The solution of this set of linear equations can be found using the usual matrix inversion technique, but a computationally efficient iterative solution due to Levinson and Durbin is often employed. This algorithm exploits the special properties of the autocorrelation matrix in Equation 2 (the matrix is symmetric, has equal elements along the diagonal, and is said to possess the Toeplitz property).
You can obtain a reasonably accurate estimate of the vocaltract filter using a tenth or twelfthorder predictor. The transfer function and frequency response of the vocaltract filter can be easily determined once the predictor coefficients have been evaluated. Figure 6 shows the vocaltract response for a 20millisecond frame of the voiced utterance 'a' for two speakers. The spectrum is smooth and shows no harmonic ripple due to pitch. A clear formant structure is visible; the location as well as amplitude of these formants is different, thus vindicating the effectiveness of LPA for speaker identification.
Figure 6: Vocaltract filter responses of two speakers uttering the voiced sound 'a'. A twelfthorder predictor was used to capture the vocaltract resonant peaks during a 20millisecond stationary period. Note the complete absence of pitch harmonics in the spectra and the clear formant structure (three formants). Also note the difference in amplitude and location of the formants for the two speakers.
Distance Metrics
During the training phase, the features described in the previous sections must be extracted from the training utterance and stored in a database (the collection of features extracted will henceforth be referred to as a profile). The test phase involves creation of a profile from the test utterance (which is the same as the training utterance in a textdependent speakeridentification system) and comparison of this profile with those stored in the database. The profile in the database that is 'closest' to the test profile (subject to some independent threshold) is then declared a match. The measure of 'closeness' between two profiles is provided by suitable distance metrics. Different features within the profile may use different distance metrics.
The squaredEuclidean distance is eminently suitable for computing the distance between pitch estimates of the two profiles. The squaredEuclidean distance between two Ndimensional vectors (denoting the pitch vectors) {a_{1}, a_{2}, ..., a_{N}} and {b_{1}, b_{2}, ..., b_{N}} is given by Equation 3.
Pitch vectors extracted from speech will almost certainly be of different lengths and the larger vector will have to be truncated to the size of the smaller one before Equation 3 is applied. Normalization of the distance is also usually performed to avoid variability in pitch vector length.
The DWT coefficients contain spectral information in dyadic subbands whose location and extent depend on the level of resolution. One possible method for comparing the two sets of DWT coefficients follows. For both DWTs, the fraction of normalized (per sample) energy in each scale is evaluated, and the ratio of the corresponding fractional energy in each DWT is taken (for similar DWTs, this ratio should be close to unity; it is inverted if less than unity). These ratios are weighted by a nonlinear (decreasing) function of the type a^{n}, where 0.92 a 0.96. This is because ratios of fractional energies at higher scales are in greater error due to a smaller number of samples; assigning lower weights to these scales reduces the error in the final distance measure. The logarithm of each weighted ratio is then accumulated. For DWTs of two utterances by the same speaker, this distance is close to zero.
LPA provides only an approximate estimate of the vocaltract frequency response. Due to noise as well as the inexactness of the linear prediction model, the predictor coefficients obtained from two speech samples of the same utterance by the same individual will vary. The Itakura distance provides an estimate of the distance between two sets of linear predictor coefficients. The mathematical expression for this distance metric is given by Equation 4.
In Equation 4, a and â are the two predictor coefficient vectors being compared. R is the autocorrelation matrix corresponding to the profile stored in the database (see Equation 3). This distance metric is accumulated for each frame of speech (after an initial adjustment to make the number of LPA frames equal). The final distance may be normalized to account for speechrate variability.
The final distance between two profiles is a weighted sum of the three distance metrics previously discussed. Weighting is necessary, since not all features are equally effective at identifying a speaker. The pitch estimates of two individuals may be similar, in which case the squaredEuclidean distance would be small. By contrast, DWT and LPA coefficients are much better at identifying a speaker, yielding relatively small distances for a match and large distances for a mismatch.
Performance Criteria
The performance of a speakeridentification system is described in terms of three parameters:
 A false acceptance occurs when the system incorrectly identifies an unregistered individual as an enrolled one, or when one registered individual is mistaken for another. The FAR (False Acceptance Ratio) is the ratio of the number of false acceptances to the total number of trials. You can reduce the FAR by setting a strict (low) threshold.
 A false rejection occurs when the system incorrectly refuses to identify an individual who is registered with the system. The FRR (False Rejection Ratio) is the ratio of the number of false rejections to the total number of trials. You can minimize the FRR by setting the threshold to a liberal (high) value.
 The equal error rate is defined as the error rate offered by the system when the FAR and FRR are made equal to each other. You can obtain an equal error rate by plotting FAR/FRR curves for threshold values.
The requirements for low FAR and FRR are seen to be conflicting, and both parameters cannot be simultaneously lowered. However, a low FAR is vital for good speaker identification systems (otherwise security of the system would be jeopardized), and most systems are biased for good FAR performance at the expense of FRR.
Prototype System
The author has developed a smallscale prototype speaker identification system based on the principles described in the previous sections of this paper. The entire system has been developed using objectoriented concepts in the C++ language. An important design objective was to ensure a modular and highly portable system.
The prototype system uses a fixed training and test utterance comprising the English vowels ('a', 'e', 'i', 'o', and 'u') for reasons discussed earlier. A sampling rate of 11,025 Hz is used, limiting the maximum analog frequency to ~ 5.5 KHz, which is sufficient to preserve all required information. In the training phase, featureextraction algorithms are used to create a profile from the speech sample. The GoldRabiner algorithm is used to estimate pitch; pitch postprocessing makes use of a fivepoint median filter. Extraction of spectral information is accomplished using a sevenlevel DWT, yielding a peak frequency resolution of ~ 40 Hz at the lower end of the spectrum. The DWT makes use of a filter bank corresponding to the Daubechies (D2) wavelet. LPA is performed on the speech signal after firstorder preemphasis (highpass filtering) to account for the 6 dB/octave rolloff characteristic of the vocal tract. A twelfthorder predictor is used. Profiles thus created are stored in a local disk database. In the test phase, the same features are used to create a profile from the test utterance. The test profile is then compared with the profiles in the database. The profiles in the database are indexed on overall average pitch, and a modified binarysearch algorithm is used to retrieve the profiles more efficiently than a sequential search. The profile in the database that yields the smallest distance to the test profile is chosen (subject to an independent threshold) as the match. The system is adaptive; in other words, it is capable of tracking slight changes in speech patterns over multiple test utterances. A successful match causes the profile in the database to be updated upon request.
The system was tested with a group of fifteen speakers consisting of nine males and six females. Ten of the fifteen speakers were enrolled in the database. Three values of threshold (STRICT, NORMAL, and LIBERAL) were used to evaluate the performance of the system. Three trials were conducted for every individual for each value of threshold. The system performance characteristics, FAR and FRR, were determined for each threshold. The point of intersection of the FAR and FRR curves yielded the equal error rate. The system was found to yield a very low error rate (FAR and FRR) for registered individuals. The error rate (FAR) was, however, quite considerable for individuals not registered with the system. Tests also indicated that the system was resistant to minor changes in the utterance rate and intonation.
Acknowledgements