McClellan, S, Gibson, J D, Ephraim, Y,, Fussell, J W, Wilcox, L D, Bush, M.A., Gao, Y, Ramabhadran, B, Picheny, M. "Speech Signal Processing The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRc Press llc. 2000
McClellan, S., Gibson, J.D., Ephraim, Y., Fussell, J.W., Wilcox, L.D., Bush, M.A., Gao, Y., Ramabhadran, B., Picheny, M. “Speech Signal Processing” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
Stan mcclellan University of Alabama 15 Texas AeM Ur Speech Signal Processing Yariv Ephraim ATeT Bell laboratories George Mason University 15.1 Coding, Transmission, and Storage Standardization. Variable Rate Coding. Summary and Conclusions Lynn D. wilcox 15.2 Speech Enhancement and Noise Reduction FX Palo alto lab Models and Performance Measures. Signal Estimation. Source oding.SignalClassification.comments Marcia a Bush Xerox palo alto research center 15.3 Analysis and Synthesis Linear predictive Yuqing Gao Dynamic Time Warping. Hidden Markov gnde gre srocessfig e Bhuvana ramabhadran Recognition System IBM 15.5 Large Vocabulary Continuous Speech Ro T.J. Watson Research Center ognition System.Hidd y Models as Acoustic Models for Speech Recognition. Speaker Michael Picheny Context in Continuous Speech. Language Modeling. Hypothesis arch. State-of-the-Art Systems. Challenges in Speech T.J. Watson Research Center ecognition.Applications 15.1 Coding, Transmission, and Storage Stan Mcclellan and Jerry D. Gibson Interest in speech coding is motivated by a wide range of applications, including commercial telephony, digital cellular mobile radio, military communications, voice mail, speech storage, and future personal communica tions networks. The goal of speech coding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for the particular application. At higher bit rates, such as 64 and 32 kbits/s, achieving good quality and intelligibility is not too difficult, but as the desired bit rate is lowered to 16 kbits/s and below, the problem becomes increasingly challenging. Depending on the application, many difficult constraints must be considered, including the issue of complexity. For example, for the 32-kbits/s speech coding standard, the ITU-T not only required highly intelligible, high-quality speech, but the coder also had to have low delay, withstand independent bit error rates up to 10-2 have acceptable performance degradation for several synchronous or asynchronous tandem connections, and pass some voiceband modem signals. Other applications may have different criteria. Digital cellular mobile radio in the U.S. has no low delay or voiceband modem signal requirements, but the speech data rates required are under 8 kbits/s and the transmission medium(or channel) can be very noisy and have relatively long fades. These considerations affect the speech coder chosen for a particular application As speech coder data rates drop to 16 kbits/s and below, perceptual criteria taking into account human auditory response begin to play a prominent role For time domain coders, the perceptual effects are incorporated using a frequency-weighted error criterion. The frequency-domain coders include perceptual effects by allocating "International Telecommunications Union, Telecommunications Standardization Sector, formerly the CCitt. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 15 Speech Signal Processing 15.1 Coding, Transmission, and Storage General Approaches • Model Adaptation • Analysis-by-Synthesis • Particular Implementations • Speech Quality and Intelligibility • Standardization • Variable Rate Coding • Summary and Conclusions 15.2 Speech Enhancement and Noise Reduction Models and Performance Measures • Signal Estimation • Source Coding • Signal Classification • Comments 15.3 Analysis and Synthesis Analysis of Excitation • Fourier Analysis • Linear Predictive Analysis • Homomorphic (Cepstral) Analysis • Speech Synthesis 15.4 Speech Recognition Speech Recognition System Architecture • Signal Pre-Processing • Dynamic Time Warping • Hidden Markov Models • State-of-the-Art Recognition Systems 15.5 Large Vocabulary Continuous Speech Recognition Overview of a Speech Recognition System • Hidden Markov Models As Acoustic Models for Speech Recognition • Speaker Adaptation • Modeling Context in Continuous Speech • Language Modeling • Hypothesis Search • State-of-the-Art Systems • Challenges in Speech Recognition • Applications 15.1 Coding, Transmission, and Storage Stan McClellan and Jerry D. Gibson Interest in speech coding is motivated by a wide range of applications, including commercial telephony, digital cellular mobile radio, military communications, voice mail, speech storage, and future personal communications networks. The goal of speech coding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for the particular application. At higher bit rates, such as 64 and 32 kbits/s, achieving good quality and intelligibility is not too difficult, but as the desired bit rate is lowered to 16 kbits/s and below, the problem becomes increasingly challenging. Depending on the application, many difficult constraints must be considered, including the issue of complexity. For example, for the 32-kbits/s speech coding standard, the ITU-T1 not only required highly intelligible, high-quality speech, but the coder also had to have low delay, withstand independent bit error rates up to 10–2, have acceptable performance degradation for several synchronous or asynchronous tandem connections, and pass some voiceband modem signals. Other applications may have different criteria. Digital cellular mobile radio in the U.S. has no low delay or voiceband modem signal requirements, but the speech data rates required are under 8 kbits/s and the transmission medium (or channel) can be very noisy and have relatively long fades. These considerations affect the speech coder chosen for a particular application. As speech coder data rates drop to 16 kbits/s and below, perceptual criteria taking into account human auditory response begin to play a prominent role. For time domain coders, the perceptual effects are incorporated using a frequency-weighted error criterion. The frequency-domain coders include perceptual effects by allocating 1 International Telecommunications Union, Telecommunications Standardization Sector, formerly the CCITT. Stan McClellan University of Alabama at Birmingham Jerry D. Gibson Texas A&M University Yariv Ephraim AT&T Bell Laboratories George Mason University Jesse W. Fussell Department of Defense Lynn D. Wilcox FX Palo Alto Lab Marcia A. Bush Xerox Palo Alto Research Center Yuqing Gao IBM T.J. Watson Research Center Bhuvana Ramabhadran IBM T.J. Watson Research Center Michael Picheny IBM T.J. Watson Research Center
Encoder FIGURE 15.1 Differential encoder transmitter with a pole-zero predictor. The focus of this article is the contrast among the three most important classes of speech coders that have representative implementations in several international standards--time-domain coders, frequency-domain coders, and hybrid coders In the following, we define these classifications, look specifically at the important haracteristics of representative, general implementations of each class, and briefly discuss the rapidly changing national and international standardization efforts related to speech coding General Approaches Time domain Coders and linear Prediction Linear Predictive Coding(LPC)is a modeling technique that has seen widespread application among time- domain speech coders, largely because it is computationally simple and applicable to the mechanisms involved in speech production. In LPC, general spectral characteristics are described by a parametric model based on estimates of autocorrelations or autocovariances. The model of choice for speech is the all-pole or autoregressive (AR) model. This model is particularly suited for voiced speech because the vocal tract can be well modeled by an all-pole transfer function. In this case, the estimated LPC model parameters correspond to an ar process which can produce waveforms very similar to the original speech segment. Differential Pulse Code Modulation (DPCM)coders(i.e, ITU-T G.721 ADPCM [CCITT, 1984])and LPC vocoders(i.e, U.S. Federal Standard 1015(National Communications System, 1984])are examples of this class of time-domain predictive architec- ture. Code Excited Coders(i.e, ITU-T G728[Chen, 1990] and U.S. Federal Standard 1016[ National Commt lications System, 1991)also utilize LPC spectral modeling techniques. I Based on the general spectral model, a predictive coder formulates an estimate of a future sample of speech based on a weighted combination of the immediately preceding samples. The error in this estimate(the prediction residual) typically comprises a significant portion of the data stream of the encoded speech. The residual contains information that is important in speech perception and cannot be modeled in a straightfor- ward fashion. The most familiar form of predictive coder is the classical Differential Pulse Code Modulation (DPCM)system shown in Fig. 15. 1 In DPCM, the predicted value at time instant k, s(kk-1),is subtracted from the input signal at time k, s(k), to produce the prediction error signal e(k). The prediction error is then approximated ( quantized) and the quantized prediction error, eg(k), is coded (represented as a binary number) for transmission to the receiver. Simultaneously with the coding, ea()is summed with s(kk-1)to yield reconstructed version of the input sample, s(k). Assuming no channel errors, an identical reconstruction, distorted only by the effects of quantization, is accomplished at the receiver. At both the transmitter and receiver, the predicted value at time instant k+l is derived using reconstructed values up through time k, and the procedure is repeated. The first DPCM systems had B(z)=0 and A(z)=>, where (a, i=1.N) are the LPC coefficients and z-I represents unit delay, so that the predicted value was a weighted linear combination of previous reconstructed values, or However, codebook excitation is generally described as a hybrid coding technique c 2000 by CRC Press LLC
© 2000 by CRC Press LLC The focus of this article is the contrast among the three most important classes of speech coders that have representative implementations in several international standards—time-domain coders, frequency-domain coders, and hybrid coders. In the following, we define these classifications, look specifically at the important characteristics of representative, general implementations of each class, and briefly discuss the rapidly changing national and international standardization efforts related to speech coding. General Approaches Time Domain Coders and Linear Prediction Linear Predictive Coding (LPC) is a modeling technique that has seen widespread application among timedomain speech coders, largely because it is computationally simple and applicable to the mechanisms involved in speech production. In LPC, general spectral characteristics are described by a parametric model based on estimates of autocorrelations or autocovariances. The model of choice for speech is the all-pole or autoregressive (AR) model. This model is particularly suited for voiced speech because the vocal tract can be well modeled by an all-pole transfer function. In this case, the estimated LPC model parameters correspond to an AR process which can produce waveforms very similar to the original speech segment. Differential Pulse Code Modulation (DPCM) coders (i.e., ITU-T G.721 ADPCM [CCITT, 1984]) and LPC vocoders (i.e., U.S. Federal Standard 1015 [National Communications System, 1984]) are examples of this class of time-domain predictive architecture. Code Excited Coders (i.e., ITU-T G728 [Chen, 1990] and U.S. Federal Standard 1016 [National Communications System, 1991]) also utilize LPC spectral modeling techniques.1 Based on the general spectral model, a predictive coder formulates an estimate of a future sample of speech based on a weighted combination of the immediately preceding samples. The error in this estimate (the prediction residual) typically comprises a significant portion of the data stream of the encoded speech. The residual contains information that is important in speech perception and cannot be modeled in a straightforward fashion. The most familiar form of predictive coder is the classical Differential Pulse Code Modulation (DPCM) system shown in Fig. 15.1. In DPCM, the predicted value at time instant k, ˆs(k *k – 1), is subtracted from the input signal at time k, s(k), to produce the prediction error signal e(k). The prediction error is then approximated (quantized) and the quantized prediction error, eq(k), is coded (represented as a binary number) for transmission to the receiver. Simultaneously with the coding, eq(k) is summed with ˆs(k *k – 1) to yield a reconstructed version of the input sample, ˆs(k). Assuming no channel errors, an identical reconstruction, distorted only by the effects of quantization, is accomplished at the receiver.At both the transmitter and receiver, the predicted value at time instant k +1 is derived using reconstructed values up through time k, and the procedure is repeated. The first DPCM systems had ˆ B(z) = 0 and Â(z) = , where {ai ,i = 1…N} are the LPC coefficients and z –1 represents unit delay, so that the predicted value was a weighted linear combination of previous reconstructed values, or 1 However, codebook excitation is generally described as a hybrid coding technique. FIGURE 15.1 Differential encoder transmitter with a pole-zero predictor. aiz i i N - Â=1
浏k-0)=∑叫k一小 (15.1) Later work showed that letting B(z)=2bg2-j improves the perceived quality of the reconstructed speech by shaping the spectrum of the quantization noise to match the speech spectrum, as well as improving noisy- channel performance [Gibson, 1984]. To produce high-quality, highly intelligible speech, it is necessary that the quantizer and predictor parameters be adaptive to compensate for nonstationarities in the speech waveform. Frequency Domain Coders Coders that rely on spectral decomposition often use the usual set of sinusoidal basis functions from signal theory to represent the specific short-time spectral content of a segment of speech. In this case, the approximated signal consists of a linear combination of sinusoids with specified amplitudes and arguments(frequency, phase) For compactness, a countable subset of harmonically related sinusoids may be used. The two most prominent types of frequency domain coders are subband coders and multi-band coders. Subband coders digitally filter the speech into nonoverlapping(as nearly as possible)frequency bands. After filtering, each band is decimated (effectively sampled at a lower rate)and coded separately using PCM, DPCM or some other method. At the receiver, the bands are decoded, upsampled, and summed to reconstruct the speech By allocating a different number of bits per sample to the subbands, the perceptually more important frequency bands can be coded with greater accuracy. The design and implementation of subband coders and ne speech quality produced have been greatly improved by the development of digital filters called quadrature mirror filters(QMFs)[Johnston, 1980] and polyphase filters. These filters allow subband overlap at the encoder, which causes aliasing, but the reconstruction filters at the receiver can be chosen to eliminate the aliasing if quantization errors are small. Multi-band coders perform a similar function by characterizing the contributions of individual sinusoidal components to the short-term speech spectrum. These parameters are then quantized, coded, transmitted, and used to configure a bank of tuned oscillators at the receiver Outputs of the oscillators are mixed in proportion to the distribution of spectral energy present in the original waveform. An important requirement of multi-band coders is a capability to precisely determine perceptually significant spectral components and track the evolution of their energy and phase. Recent developments related to multi-band coding emphasize the use of harmonically related components with carefully intermixed spectral regions of bandlimited white noise. Sinusoidal Transform Coders(STC) and Multi-Band Excitation coders(MBe) are examples of this type of frequency domain coders. Model Adaptat Adaptation algorithms for coder predictor or quantizer parameters can be loosely grouped based on the signals that are used as the basis for adaptation. Generally, forward adaptive coder elements analyze the input speech ed version of it)to characterize predictor coefficients, spectral components, or quantizer parameters in a blockwise fashion. Backward adaptive coder elements analyze a reconstructed signal, which contains quantization noise, to adjust coder parameters in a sequential fashion. Forward adaptive coder elements can produce a more efficient model of speech signal characteristics, but introduce delay into the coders operation due to buffering of the signal. Backward adaptive coder elements do not introduce delay, but produce signal models that have lower fidelity with respect to the original speech due to the dependence on the noisy reconstructed signal. Most low-rate coders rely on some form of forward adaptation. This requires moderate to high delay in processing for accuracy of parameter estimation(autocorrelations/autocovariances for LPC- based coders, sinusoidal resolution for frequency-domain coders). The allowance of significant delay for many coder architectures has enabled a spectrally matched pre- or post-processing step to reduce apparent quanti- tion noise and provide significant perceptual improvements. Perceptual enhancements combined with analysis-by-synthesis optimization, and enabled by recent advances in high-power computing architectures such as digital signal processors, have tremendously improved speech coding results at medium and low rates "In this case, the predicted value is s(kk-1)= 浏k-0+∑。be!k- c 2000 by CRC Press LLC
© 2000 by CRC Press LLC (15.1) Later work showed that letting ˆ B(z) = improves the perceived quality of the reconstructed speech1 by shaping the spectrum of the quantization noise to match the speech spectrum, as well as improving noisychannel performance [Gibson, 1984]. To produce high-quality, highly intelligible speech, it is necessary that the quantizer and predictor parameters be adaptive to compensate for nonstationarities in the speech waveform. Frequency Domain Coders Coders that rely on spectral decomposition often use the usual set of sinusoidal basis functions from signal theory to represent the specific short-time spectral content of a segment of speech. In this case, the approximated signal consists of a linear combination of sinusoids with specified amplitudes and arguments (frequency, phase). For compactness, a countable subset of harmonically related sinusoids may be used. The two most prominent types of frequency domain coders are subband coders and multi-band coders. Subband coders digitally filter the speech into nonoverlapping (as nearly as possible) frequency bands. After filtering, each band is decimated (effectively sampled at a lower rate) and coded separately using PCM, DPCM, or some other method. At the receiver, the bands are decoded, upsampled, and summed to reconstruct the speech. By allocating a different number of bits per sample to the subbands, the perceptually more important frequency bands can be coded with greater accuracy. The design and implementation of subband coders and the speech quality produced have been greatly improved by the development of digital filters called quadrature mirror filters (QMFs) [Johnston, 1980] and polyphase filters. These filters allow subband overlap at the encoder, which causes aliasing, but the reconstruction filters at the receiver can be chosen to eliminate the aliasing if quantization errors are small. Multi-band coders perform a similar function by characterizing the contributions of individual sinusoidal components to the short-term speech spectrum. These parameters are then quantized, coded, transmitted, and used to configure a bank of tuned oscillators at the receiver. Outputs of the oscillators are mixed in proportion to the distribution of spectral energy present in the original waveform. An important requirement of multi-band coders is a capability to precisely determine perceptually significant spectral components and track the evolution of their energy and phase. Recent developments related to multi-band coding emphasize the use of harmonically related components with carefully intermixed spectral regions of bandlimited white noise. Sinusoidal Transform Coders (STC) and Multi-Band Excitation coders (MBE) are examples of this type of frequency domain coders. Model Adaptation Adaptation algorithms for coder predictor or quantizer parameters can be loosely grouped based on the signals that are used as the basis for adaptation. Generally, forward adaptive coder elements analyze the input speech (or a filtered version of it) to characterize predictor coefficients, spectral components, or quantizer parameters in a blockwise fashion. Backward adaptive coder elements analyze a reconstructed signal, which contains quantization noise, to adjust coder parameters in a sequential fashion. Forward adaptive coder elements can produce a more efficient model of speech signal characteristics, but introduce delay into the coder’s operation due to buffering of the signal. Backward adaptive coder elements do not introduce delay, but produce signal models that have lower fidelity with respect to the original speech due to the dependence on the noisy reconstructed signal. Most low-rate coders rely on some form of forward adaptation. This requires moderate to high delay in processing for accuracy of parameter estimation (autocorrelations/autocovariances for LPCbased coders, sinusoidal resolution for frequency-domain coders). The allowance of significant delay for many coder architectures has enabled a spectrally matched pre- or post-processing step to reduce apparent quantization noise and provide significant perceptual improvements. Perceptual enhancements combined with analysis-by-synthesis optimization, and enabled by recent advances in high-power computing architectures such as digital signal processors, have tremendously improved speech coding results at medium and low rates. 1 In this case, the predicted value is ˆs(k* k – 1) = . s kk as k i ˆ ˆ . i i N ( - ) = - ( ) = 1 Â1 b jz j j M - Â =1 ask i be k j i i N j q j M ˆ( ) - + - ( ) Â Â = = 1 1
Analysis-by-Synthesis A significant drawback to traditional"instantaneous "coding approaches such as DPCM lies in the perceptual or subjective relevance of the distortion measure and the signals to which it is applied. Thus, the advent of analysis-by-synthesis coding techniques poses an important milestone in the evolution of medium-to low-rate speech coding. An analysis-by-synthesis coder chooses the coder excitation by minimizing distortion between the original signal and the set of synthetic signals produced by every possible codebook excitation sequence In contrast, time-domain predictive coders must produce an estimated prediction residual(innovations sequence)to drive the spectral shaping filter(s) of the LPC model, and the classical DPCM approach is to weighted distortion in the optimization of analysis-by-synthesis coders is significant in that it de-emplazae oa quantize the residual sequence directly using scalar or vector quantizers. The incorporation of freque increases the tolerance for) quantization noise surrounding spectral peaks. This effect is perceptually trans- parent since the ear is less sensitive to error around frequencies having higher energy [Atal and Schroeder, 1979 This approach has resulted in significant improvements in low-rate coder performance, and recent increases in processor speed and power are crucial enabling techniques for these applications. Analysis-by-synthesis coders based on linear prediction are generally described as hybrid coders since they fall between waveform coders and vocoders Particular Implementations Currently, three coder architectures dominate the fields of medium and low-rate speech coding: Code-Excited Linear Prediction(CELP): an LPC-based technique which optimizes a vector of excitation samples(and/or pitch filter and lag parameters)using analysis-by-synthesis Multi-Band Excitation(MBE): a direct spectral estimation technique which optimizes the spectral recon struction error over a set of subbands using analysis-by-synthesis Mixed-Excitation Linear Prediction(MELP): an optimized version of the traditional LPC vocoder which includes an explicit multiband model of the excitation signal. Several realizations of these approaches have been adopted nationally and internationally as standard speech coding architectures at rates below 16 kbits/s(ie, G728, IMBE, U.S. Federal Standard 1016, etc. ) The success of these implementations is due to LPC-based analysis-by-synthesis with a perceptual distortion criterion time frequency-domain modeling of a speech waveform or LPC residual. Additionally, the coders that operate at lower rates all benefit from forward adaptation methods which produce efficient, accurate parameter estimates. CELP The general CELP architecture is described as a blockwise analysis-by-synthesis selection of an LPC excitation sequence In low-rate CELP coders, a forward-adaptive linear predictive analysis is performed at 20 to 30 msec intervals. The gross spectral characterization is used to reconstruct, via linear prediction, candidate speech segments derived from a constrained set of plausible filter excitations(the""). The excitation vector that produces the synthetic speech segment with smallest perceptually weighted distortion(with respect to the riginal speech) is chosen for transmission. Typically, the excitation vector is optimized more often than the LPC spectral model. The use of vectors rather than scalars for the excitation is significant in bit-rate reduction. The use of perceptual weighting in the CELP reconstruction stage and analysis-by-synthesis optimization the dominant low-frequency(pitch) component are key concepts in maintaining good quality encoded speech at lower rates. CELP-based speech coders are the predominant coding methodologies for rates between 4 kbits/s and 16 kbits/s due to their excellent subjective performance. Some of the most notable are detailed below. ITU-TRecommendation G728(LD-CELP)[Chen, 1990] is a low delay, backward adaptive CELP coder. In G.728, a low algorithmic delay(less than 2.5 msec)is achieved by using 1024 candidate excitation sequences, each only 5 samples long. A 50th-order LPC spectral model is used, and the coefficients ar backward-adapted based on the transmitted excitation The speech coder standardized by the Ctia for use in the U.S. (time-division multiple-access)8 kbits/s digital cellular radio systems is called vector sum excited linear prediction(VSELP)[ Gerson and Jasiuk, c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Analysis-by-Synthesis A significant drawback to traditional “instantaneous” coding approaches such as DPCM lies in the perceptual or subjective relevance of the distortion measure and the signals to which it is applied. Thus, the advent of analysis-by-synthesis coding techniques poses an important milestone in the evolution of medium- to low-rate speech coding. An analysis-by-synthesis coder chooses the coder excitation by minimizing distortion between the original signal and the set of synthetic signals produced by every possible codebook excitation sequence. In contrast, time-domain predictive coders must produce an estimated prediction residual (innovations sequence) to drive the spectral shaping filter(s) of the LPC model, and the classical DPCM approach is to quantize the residual sequence directly using scalar or vector quantizers. The incorporation of frequencyweighted distortion in the optimization of analysis-by-synthesis coders is significant in that it de-emphasizes (increases the tolerance for) quantization noise surrounding spectral peaks. This effect is perceptually transparent since the ear is less sensitive to error around frequencies having higher energy [Atal and Schroeder, 1979]. This approach has resulted in significant improvements in low-rate coder performance, and recent increases in processor speed and power are crucial enabling techniques for these applications. Analysis-by-synthesis coders based on linear prediction are generally described as hybrid coders since they fall between waveform coders and vocoders. Particular Implementations Currently, three coder architectures dominate the fields of medium and low-rate speech coding: • Code-Excited Linear Prediction (CELP): an LPC-based technique which optimizes a vector of excitation samples (and/or pitch filter and lag parameters) using analysis-by-synthesis. • Multi-Band Excitation (MBE): a direct spectral estimation technique which optimizes the spectral reconstruction error over a set of subbands using analysis-by-synthesis. • Mixed-Excitation Linear Prediction (MELP): an optimized version of the traditional LPC vocoder which includes an explicit multiband model of the excitation signal. Several realizations of these approaches have been adopted nationally and internationally as standard speech coding architectures at rates below 16 kbits/s (i.e., G.728, IMBE, U.S. Federal Standard 1016, etc.). The success of these implementations is due to LPC-based analysis-by-synthesis with a perceptual distortion criterion or shorttime frequency-domain modeling of a speech waveform or LPC residual. Additionally, the coders that operate at lower rates all benefit from forward adaptation methods which produce efficient, accurate parameter estimates. CELP The general CELP architecture is described as a blockwise analysis-by-synthesis selection of an LPC excitation sequence. In low-rate CELP coders, a forward-adaptive linear predictive analysis is performed at 20 to 30 msec intervals. The gross spectral characterization is used to reconstruct, via linear prediction, candidate speech segments derived from a constrained set of plausible filter excitations (the “codebook”). The excitation vector that produces the synthetic speech segment with smallest perceptually weighted distortion (with respect to the original speech) is chosen for transmission. Typically, the excitation vector is optimized more often than the LPC spectral model. The use of vectors rather than scalars for the excitation is significant in bit-rate reduction. The use of perceptual weighting in the CELP reconstruction stage and analysis-by-synthesis optimization of the dominant low-frequency (pitch) component are key concepts in maintaining good quality encoded speech at lower rates. CELP-based speech coders are the predominant coding methodologies for rates between 4 kbits/s and 16 kbits/s due to their excellent subjective performance. Some of the most notable are detailed below. • ITU-T Recommendation G.728 (LD-CELP) [Chen, 1990] is a low delay, backward adaptive CELP coder. In G.728, a low algorithmic delay (less than 2.5 msec) is achieved by using 1024 candidate excitation sequences, each only 5 samples long. A 50th-order LPC spectral model is used, and the coefficients are backward-adapted based on the transmitted excitation. • The speech coder standardized by the CTIA for use in the U.S. (time-division multiple-access) 8 kbits/s digital cellular radio systems is called vector sum excited linear prediction (VSELP) [Gerson and Jasiuk
1990]. VSELP is a forward-adaptive form of CELP where two excitation codebooks are used to reduce the complexity of encoding Other approaches to complexity reduction in CELP coders are related to"sparse"codebook entries which have few nonzero samples per vector and"algebraic"codebooks which are based on integer lattices [Adoul and Lamblin, 1987 In this case, excitation code vectors can be constructed on an as-needed basis instead of being stored in a table. ITU-T standardization of a CELP algorithm which uses lattice based excitations has resulted in the 8 kbps G729(ACELP)coder U.S. Federal Standard 1016[ National Communications System, 1991] is a 4.8 kbps CELP coder. It has both long-and short-term linear predictors which are forward adaptive, and so the coder has a relatively large delay (100 msec). This coder produces highly intelligible, good-quality speech in a variety of environments and is robust to independent bit errors. Below about 4 kbps, the subjective quality of CElP coders is inferior to other architectures. Much research variable-rate CELP implementations has resulted in alternative coder architectures which adjust their coding rates based on a number of channel conditions or sophisticated, speech-specific cues such as phonetic segmen- tation[ Wang and Gersho, 1989; Paksoy et al., 1993]. Notably, most variable-rate CELP coders are implemer tations of finite-state CELP wherein a vector of speech cues controls the evolution of a state-machine to prescribe mode-dependent bit allocations for coder parameters. With these architectures, excellent speech quality at average rates below 2 kbps has been reported. MBE The MBE coder[ Hardwick and Lim, 1991] is an efficient frequency-domain architecture partially based on the concepts of sinusoidal transform coding(STC)[McAulay and Quatieri, 1986]. In MBE, the instantaneous pectral envelope is represented explicitly by harmonic estimates in several subbands. The performance of mBE coders at rates below 4 kbps is generally"better"than that of CELP-based schemes An MBE coder decomposes the instantaneous speech spectrum into subbands centered at harmonics of the fundamental glottal excitation(pitch). The spectral envelope of the signal is approximated by samples taken at pitch harmonics, and these harmonic amplitudes are compared to adaptive thresholds(which may be determined via analysis-by-synthesis)to determine subbands of high spectral activity. Subbands that are determined to be voiced"are labeled, and their energies and phases are encoded for transmission. Subbands having relatively low spectral activity are declared"unvoiced". These segments are approximated by an appropriately filtered segment of white noise, or a locally dense collection of sinusoids with random phase. Careful tracking of the evolution of individual spectral peaks and phases in successive frames is critical in the implementation of MBE-style coders An efficient implementation of an MBE coder was adopted for the International Maritime Satellite(INMar SAT) voice processor, and is known as Improved-MBE, or IMBE [ Hardwick and Lim, 1991]. This coder optimizes several components of the general MBE architecture, including grouping neighboring harmonics for subband voicing decisions, using non-integer pitch resolution for higher speaker fidelity, and differentially encoding the log-amplitudes of voiced harmonics using a DCT-based scheme. The IMBe coder requires high delay (about 80 msec), but produces very good quality encoded speech MELP The MELP coder [McCree and Barnwell, 1995] is based on the traditional LPC vocoder model where an LPC synthesis filter is excited by an impulse train( voiced speech) or white noise(unvoiced speech). The MELP excitation, however, has characteristics that are more similar to natural human speech. In particular, the MELP excitation can be a mixture of(possibly aperiodic) pulses with bandlimited noise. In MELP, the excitation spectrum is explicitly modeled using Fourier series coefficients and bandpass voicing strengths, and the time- domain excitation sequence is produced from the spectral model via an inverse transform. The synthetic xcitation sequence is then used to drive an LPC synthesizer which introduces formant spectral shaping Common thread In addition to the use of analysis-by-synthesis techniques and/or LPC modeling, a common thread between low-rate, forward adaptive CELP, MBE, and mElP coders is the dependence on an estimate of the fundamental glottal frequency, or pitch period. CELP coders typically employ a pitch or long-term predictor to characterize c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 1990]. VSELP is a forward-adaptive form of CELP where two excitation codebooks are used to reduce the complexity of encoding. • Other approaches to complexity reduction in CELP coders are related to “sparse” codebook entries which have few nonzero samples per vector and “algebraic” codebooks which are based on integer lattices [Adoul and Lamblin, 1987]. In this case, excitation code vectors can be constructed on an as-needed basis instead of being stored in a table. ITU-T standardization of a CELP algorithm which uses latticebased excitations has resulted in the 8 kbps G.729 (ACELP) coder. • U.S. Federal Standard 1016 [National Communications System, 1991] is a 4.8 kbps CELP coder. It has both long- and short-term linear predictors which are forward adaptive, and so the coder has a relatively large delay (100 msec). This coder produces highly intelligible, good-quality speech in a variety of environments and is robust to independent bit errors. Below about 4 kbps, the subjective quality of CELP coders is inferior to other architectures. Much research in variable-rate CELP implementations has resulted in alternative coder architectures which adjust their coding rates based on a number of channel conditions or sophisticated, speech-specific cues such as phonetic segmentation [Wang and Gersho, 1989; Paksoy et al., 1993]. Notably, most variable-rate CELP coders are implementations of finite-state CELP wherein a vector of speech cues controls the evolution of a state-machine to prescribe mode-dependent bit allocations for coder parameters. With these architectures, excellent speech quality at average rates below 2 kbps has been reported. MBE The MBE coder [Hardwick and Lim, 1991] is an efficient frequency-domain architecture partially based on the concepts of sinusoidal transform coding (STC) [McAulay and Quatieri, 1986]. In MBE, the instantaneous spectral envelope is represented explicitly by harmonic estimates in several subbands. The performance of MBE coders at rates below 4 kbps is generally “better” than that of CELP-based schemes. An MBE coder decomposes the instantaneous speech spectrum into subbands centered at harmonics of the fundamental glottal excitation (pitch). The spectral envelope of the signal is approximated by samples taken at pitch harmonics, and these harmonic amplitudes are compared to adaptive thresholds (which may be determined via analysis-by-synthesis) to determine subbands of high spectral activity. Subbands that are determined to be “voiced” are labeled, and their energies and phases are encoded for transmission. Subbands having relatively low spectral activity are declared “unvoiced”. These segments are approximated by an appropriately filtered segment of white noise, or a locally dense collection of sinusoids with random phase. Careful tracking of the evolution of individual spectral peaks and phases in successive frames is critical in the implementation of MBE-style coders. An efficient implementation of an MBE coder was adopted for the International Maritime Satellite (INMARSAT) voice processor, and is known as Improved-MBE, or IMBE [Hardwick and Lim, 1991]. This coder optimizes several components of the general MBE architecture, including grouping neighboring harmonics for subband voicing decisions, using non-integer pitch resolution for higher speaker fidelity, and differentially encoding the log-amplitudes of voiced harmonics using a DCT-based scheme. The IMBE coder requires high delay (about 80 msec), but produces very good quality encoded speech. MELP The MELP coder [McCree and Barnwell, 1995] is based on the traditional LPC vocoder model where an LPC synthesis filter is excited by an impulse train (voiced speech) or white noise (unvoiced speech). The MELP excitation, however, has characteristics that are more similar to natural human speech. In particular, the MELP excitation can be a mixture of (possibly aperiodic) pulses with bandlimited noise. In MELP, the excitation spectrum is explicitly modeled using Fourier series coefficients and bandpass voicing strengths, and the timedomain excitation sequence is produced from the spectral model via an inverse transform. The synthetic excitation sequence is then used to drive an LPC synthesizer which introduces formant spectral shaping. Common Threads In addition to the use of analysis-by-synthesis techniques and/or LPC modeling, a common thread between low-rate, forward adaptive CELP, MBE, and MELP coders is the dependence on an estimate of the fundamental glottal frequency, or pitch period. CELP coders typically employ a pitch or long-term predictor to characterize
the glottal excitation. MBE coders estimate the fundamental frequency and use this estimate to focus subband decompositions on harmonics. MELP coders perform pitch-synchronous excitation modeling. Overall coder performance is enhanced in the CELP and MBe coders with the use of sub-integer lags[Kroon and Atal, 1991] This is equivalent to performing pitch estimation using a signal sampled at a higher sampling rate to improve the precision of the spectral estimate. Highly precise glottal frequency estimation improves the"naturalness of coded speech at the expense of increased computational complexity, and in some cases increased bit rate. Accurate characterization of pitch and LPC parameters can also be used to good advantage in postfiltering to reduce apparent quantization noise. These filters, usually derived from forward-adapted filter coefficients transmitted to the receiver as side-information, perform post-processing on the reconstructed speech which reduces perceptually annoying noise components [Chen and Gersho, 1995] Speech Quality and Intelligibility To compare the performance of two speech coders, it is necessary to have some indicator of the intelligibility and quality of the speech produced by each coder. The term intelligibility usually refers to whether the output speech is easily understandable, while the term quality is an indicator of how natural the speech sounds. It is possible for a coder to produce highly intelligible speech that is low quality in that the speech may sound very achine- like and the speaker is not identifiable. On the other hand, it is unlikely that unintelligible speech would be called high quality, but there are situations in which perceptually pleasing speech does not have high intelligibility. We briefly discuss here the most common measures of intelligibility and quality used in formal tests of speech coders DRT The diagnostic rhyme test(DRT) was devised by Voiers [1977] to test the intelligibility of coders known to produce speech of lower quality. Rhyme tests are so named because the listener must determine which consonant was spoken when presented with a pair of rhyming words; that is, the listener is asked to distinguish between word pairs such as meat-beat, pool-tool, saw-thaw, and caught-taught. Each pair of words differs on only one of six phonemic attributes: voicing, nasality, sustention, sibilation, graveness, and compactness. Specifically, the listener is presented with one spoken word from a pair and asked to decide which word was spoken. The final DRT score is the percent responses computed according to P=T(R-WX100, where R is the number correctly chosen, w is the number of incorrect choices, and Tis the total of word pairs tested. Usually, 75 s DRT $95, with a good being about 90[ Papamichalis, 1987] MOS The Mean Opinion Score(MOS) is an often-used performance measure [Jayant and Noll, 1984]. To establish a MOS for a coder, listeners are asked to classify the quality of the encoded speech in one of five categories excellent(5), good(4), fair (3), poor(2), or bad(1). Alternatively, the listeners may be asked to classify the coded speech according to the amount of perceptible distortion present, ie, imperceptible(5), barely percep- tible but not annoying(4), perceptible and annoying(3), annoying but not objectionable(2), or very annoying and objectionable(1). The numbers in parentheses are used to assign a numerical value to the subjectiv evaluations, and the numerical ratings of all listeners are averaged to produce a MOS for the coder. A MOS between 4.0 and 4.5 usually indicates high quality. c It is important to compute the variance of MOS values. A large variance, which indicates an unreliable test, n occur because participants do not known what categories such as good and bad mean. It is sometimes useful to present examples of good and bad speech to the listeners before the test to calibrate the 5-point scale Papamichalis, 1987 ]. The MOS values for a variety of speech coders and noise conditions are given in[Daumer, DAM The diagnostic acceptability measure(DAm)developed by Dynastat Voiers, 1977] is an attempt to make the measurement of speech quality more systematic. For the DAM, it is critical that the listener crews be highly trained and repeatedly calibrated in order to get meaningful results. The listeners are each presented with encoded sentences taken from the Harvard 1965 list of phonetically balanced sentences, such as"Cats and dogs c 2000 by CRC Press LLC
© 2000 by CRC Press LLC the glottal excitation. MBE coders estimate the fundamental frequency and use this estimate to focus subband decompositions on harmonics. MELP coders perform pitch-synchronous excitation modeling. Overall coder performance is enhanced in the CELP and MBE coders with the use of sub-integer lags [Kroon and Atal, 1991]. This is equivalent to performing pitch estimation using a signal sampled at a higher sampling rate to improve the precision of the spectral estimate. Highly precise glottal frequency estimation improves the “naturalness” of coded speech at the expense of increased computational complexity, and in some cases increased bit rate. Accurate characterization of pitch and LPC parameters can also be used to good advantage in postfiltering to reduce apparent quantization noise. These filters, usually derived from forward-adapted filter coefficients transmitted to the receiver as side-information, perform post-processing on the reconstructed speech which reduces perceptually annoying noise components [Chen and Gersho, 1995]. Speech Quality and Intelligibility To compare the performance of two speech coders, it is necessary to have some indicator of the intelligibility and quality of the speech produced by each coder. The term intelligibility usually refers to whether the output speech is easily understandable, while the term quality is an indicator of how natural the speech sounds. It is possible for a coder to produce highly intelligible speech that is low quality in that the speech may sound very machine-like and the speaker is not identifiable. On the other hand, it is unlikely that unintelligible speech would be called high quality, but there are situations in which perceptually pleasing speech does not have high intelligibility. We briefly discuss here the most common measures of intelligibility and quality used in formal tests of speech coders. DRT The diagnostic rhyme test (DRT) was devised by Voiers [1977] to test the intelligibility of coders known to produce speech of lower quality. Rhyme tests are so named because the listener must determine which consonant was spoken when presented with a pair of rhyming words; that is, the listener is asked to distinguish between word pairs such as meat-beat, pool-tool, saw-thaw, and caught-taught. Each pair of words differs on only one of six phonemic attributes: voicing, nasality, sustention, sibilation, graveness, and compactness. Specifically, the listener is presented with one spoken word from a pair and asked to decide which word was spoken. The final DRT score is the percent responses computed according to P = (R – W) ¥ 100, where R is the number correctly chosen, W is the number of incorrect choices, and T is the total of word pairs tested. Usually, 75 £ DRT £ 95, with a good being about 90 [Papamichalis, 1987]. MOS The Mean Opinion Score (MOS) is an often-used performance measure [Jayant and Noll, 1984]. To establish a MOS for a coder, listeners are asked to classify the quality of the encoded speech in one of five categories: excellent (5), good (4), fair (3), poor (2), or bad (1). Alternatively, the listeners may be asked to classify the coded speech according to the amount of perceptible distortion present, i.e., imperceptible (5), barely perceptible but not annoying (4), perceptible and annoying (3), annoying but not objectionable (2), or very annoying and objectionable (1). The numbers in parentheses are used to assign a numerical value to the subjective evaluations, and the numerical ratings of all listeners are averaged to produce a MOS for the coder. A MOS between 4.0 and 4.5 usually indicates high quality. It is important to compute the variance of MOS values. A large variance, which indicates an unreliable test, can occur because participants do not known what categories such as good and bad mean. It is sometimes useful to present examples of good and bad speech to the listeners before the test to calibrate the 5-point scale [Papamichalis, 1987]. The MOS values for a variety of speech coders and noise conditions are given in [Daumer, 1982]. DAM The diagnostic acceptability measure (DAM) developed by Dynastat [Voiers, 1977] is an attempt to make the measurement of speech quality more systematic. For the DAM, it is critical that the listener crews be highly trained and repeatedly calibrated in order to get meaningful results. The listeners are each presented with encoded sentences taken from the Harvard 1965 list of phonetically balanced sentences, such as “Cats and dogs 1 T --
TABLE 15.1 Speech Coder Performance Comparisons Standardization Subjective rony Identifier kbits/s MOS DRT DAM TU-l 4.3 TU-T G.721 G.728 RPE-LTP GSM GSM 26B 105 VSELP CTIA CELP US. DOD FS-1016483.13b90.765.4b IMBE Inmarsat LPC-10 U.S. DoD FS-1015 224b862b a Estimated. From results of 1996 U.S. DoD 2400 bits/s vocoder competition. each hate the other"and"The pipe began to rust while new. The listener is asked to assign a number between and 100 to characteristics in three classifications-signal qualities, background qualities, and total effect. The ratings of each characteristic are weighted and used in a multiple nonlinear regression. Finally, adjustments are made to compensate for listener performance. a typical DAM score is 45 to 55%, with 50% corresponding to a good system [Papamichalis, 1987] The perception of"good quality speech is a highly individual and subjective area. As such, no single performance measure has gained wide acceptance as an indicator of the quality and intelligibility of speech produced by a coder. Further, there is no substitute for subjective listening tests under the actual environmental conditions expected in a particular application. As a rough guide to the performance of some of the coders discussed here, we present the DRT, DAM, and MOS values in Table 15.1, which is adapted from [Spanias 1994; Jayant, 1990]. From the table, it is evident that at 8 kbits/s and above, performance is quite good and that the 4.8 kbits/s CElP has substantially better performance than LPC-10e. Standardization The presence of international, national, and regional speech coding standards ensures the interoperability of coders among various implementations. As noted previously, several standard algorithms exist among the classes of speech coders. The ITU-T( formerly CCiTT) has historically been a dominant factor in international tandardization of speech coders, such as G711, G.721, G728, G 729, etc. Additionally, the emergence of digital cellular telephony, personal communications networks, and multimedia communications has driven the for- mulation of various national or regional standard algorithms such as the gSm full and half-rate standards for European digital cellular, the CTIA full-rate TDMA and CDMa algorithms and their half-rate counterparts for U.S. digital cellular, full and half-rate Pitch-Synchronous CELP [Miki et al., 1993] for Japanese cellular, as well as speech coders for particular applications [ITU-TS, 1991] The standardization efforts of the U.S. Federal Government for secure voice channels and military applica tions have a historically significant impact on the evolution of speech coder technology. In particular, the recent re-standardization of the DoD 2400 bits/s vocoder algorithm has produced some competing algorithms worthy of mention here. Of the classes of speech coders represented among the algorithms competing to replace LPC-10 everal implementations utilized STC or MBE architectures, some used CELP architectures, and others were novel combinations of multiband-excitation with LPC modeling [ McCree and Barnwell, 1995] or pitch synchronous prototype waveform interpolation techniques (Kleijn, 1991] The final results of the U.S. DoD standard competition are summarized in Table 15.2 for" quiet "and"office environments. In the table, the column labeled"FOM"is the overall Figure of Merit used by the DoD Digital Voice Processing Consortium in selecting the coder. The FOM is a unitless combination of complexity and performance components, and is measured with respect to FS-1016. The complexity of a coder is a weighted combination of memory and processing power required. The performance of a coder is a weighted combination of four factors: quality(Q--measured via MOS), intelligibility(I--measured via DRT), speaker recognition(R) and communicability(C). Recognizability and communicability for each coder were measured by tests c 2000 by CRC Press LLC
© 2000 by CRC Press LLC each hate the other” and “The pipe began to rust while new”. The listener is asked to assign a number between 0 and 100 to characteristics in three classifications—signal qualities, background qualities, and total effect. The ratings of each characteristic are weighted and used in a multiple nonlinear regression. Finally, adjustments are made to compensate for listener performance. A typical DAM score is 45 to 55%, with 50% corresponding to a good system [Papamichalis, 1987]. The perception of “good quality” speech is a highly individual and subjective area. As such, no single performance measure has gained wide acceptance as an indicator of the quality and intelligibility of speech produced by a coder. Further, there is no substitute for subjective listening tests under the actual environmental conditions expected in a particular application. As a rough guide to the performance of some of the coders discussed here, we present the DRT, DAM, and MOS values in Table 15.1, which is adapted from [Spanias, 1994; Jayant, 1990]. From the table, it is evident that at 8 kbits/s and above, performance is quite good and that the 4.8 kbits/s CELP has substantially better performance than LPC-10e. Standardization The presence of international, national, and regional speech coding standards ensures the interoperability of coders among various implementations.As noted previously, several standard algorithms exist among the classes of speech coders. The ITU-T (formerly CCITT) has historically been a dominant factor in international standardization of speech coders, such as G.711, G.721, G.728, G.729, etc. Additionally, the emergence of digital cellular telephony, personal communications networks, and multimedia communications has driven the formulation of various national or regional standard algorithms such as the GSM full and half-rate standards for European digital cellular, the CTIA full-rate TDMA and CDMA algorithms and their half-rate counterparts for U.S. digital cellular, full and half-rate Pitch-Synchronous CELP [Miki et al., 1993] for Japanese cellular, as well as speech coders for particular applications [ITU-TS, 1991]. The standardization efforts of the U.S. Federal Government for secure voice channels and military applications have a historically significant impact on the evolution of speech coder technology. In particular, the recent re-standardization of the DoD 2400 bits/s vocoder algorithm has produced some competing algorithms worthy of mention here. Of the classes of speech coders represented among the algorithms competing to replace LPC-10, several implementations utilized STC or MBE architectures, some used CELP architectures, and others were novel combinations of multiband-excitation with LPC modeling [McCree and Barnwell, 1995] or pitchsynchronous prototype waveform interpolation techniques [Kleijn, 1991]. The final results of the U.S. DoD standard competition are summarized in Table 15.2 for “quiet” and “office” environments. In the table, the column labeled “FOM” is the overall Figure of Merit used by the DoD Digital Voice Processing Consortium in selecting the coder. The FOM is a unitless combination of complexity and performance components, and is measured with respect to FS-1016. The complexity of a coder is a weighted combination of memory and processing power required. The performance of a coder is a weighted combination of four factors: quality (Q—measured via MOS), intelligibility (I—measured via DRT), speaker recognition (R), and communicability (C). Recognizability and communicability for each coder were measured by tests TABLE 15.1 Speech Coder Performance Comparisons Algorithm Standardization Rate Subjective (acronym) Body Identifier kbits/s MOS DRT DAM m-law PCM ITU-T G.711 64 4.3 95 73 ADPCM ITU-T G.721 32 4.1 94 68 LD-CELP ITU-T G.728 16 4.0 94a 70a RPE-LTP GSM GSM 13 3.5 — — VSELP CTIA IS-54 8 3.5 — — CELP U.S. DoD FS-1016 4.8 3.13b 90.7b 65.4b IMBE Inmarsat IMBE 4.1 3.4 — — LPC-10e U.S. DoD FS-1015 2.4 2.24b 86.2b 50.3b a Estimated. b From results of 1996 U.S. DoD 2400 bits/s vocoder competition
TABLE 15.2 Speech Coder Performance Comparisons Taken from Results of 1996 U.S. DoD 2400 bits/s Vocoder Competition Algorithm (acronym) FOM Rank Best MOS DRT DAM MOS DRT DAM MELP 3.3092.364.5 96912 2347 Q3.2890.570.0 2.026 R3.0889963.828291.554.1 IMBE C28991462327 CELP 8989.056 LPC-10e-9.19 0985.2 Ineligible due to failure of the quality(MOS) criteria minimum requirements(better than CELP) in both quiet and office environments comparing processed vs unprocessed data, and effectiveness of communication in application-specific coop- erative tasks [Schmidt-Nielsen and Brock, 1996; Kreamer and Tardelli, 1996]. The MOS and DRT scores were measured in a variety of common DoD environments. Each of the four"finalist" coders ranked first in one of the four categories examined(Q, I, R, C), as noted in the table. The results of the standardization process were announced in April, 1996. As indicated in Table 15. 2, the replacing a version Prediction(MELP) coder which uses several specific enhancements to the basic MELP architecture. These enhancements include multi-stage vQ of the formant parameters based on frequency-weighted bark-sc pectral distortion, direct vQ of the first 10 Fourier coefficients of the excitation using bark-weighted distortion and a gain coding technique which is robust to channel errors [McCree et al., 1996 Variable Rate Coding Previous standardization efforts and discussion here have centered on fixed-rate coding of speech where a fixed number of bits are used to represent speech in digital form per unit of time. However, with recent developments in transmission architectures(such as CDMA), the implementation of variable-rate speech coding algorithms has become feasible. In variable-rate coding, the average data rate for conversational speech can be reduced by a factor of at least 2. A variable-rate speech coding algorithm has been standardized by the CTIa for wideband(CDMA) digital mobile cellular telephony under IS-95. The algorithm, QCELP [Gardner et al., 1993), is the first practical variable-rate speech coder to be incorporated in a digital cellular system. QCELP is a multi-mode, CELP-type analysis-by-synthesis coder which uses blockwise spectral energy measurements and a finite-state machine to switch between one of four configurations. Each configuration has a fixed rate of 1, 2, 4, or 8 kbits/s with a predetermined allocation of bits among coder parameters(coefficients, gains, excitation, etc. ) The subjective performance of QCELP in the presence of low background noise is quite good since the bit allocations pe ode and mode-switching logic are well-suited to high-quality speech. In fact, QCELP at an average rate of 4 kbits/s has been judged to be MOS-equivalent to VSELP, its 8 kbits/s, fixed-rate cellular counterpart. A time- ged encoding rate of 4 to 5 kbits/s is not uncommon for QCELP, however the average rate tends toward the 8 kbits/s maximum in the presence of moderate ambient noise. The topic of robust fixed-rate and variable rate speech coding in the presence of significant background noise remains an open problem Much recent research in speech coding below 8 kbits/s has focused on multi-mode CELP architectures and efficient approaches to source-controlled mode selection [Das et al., 1995]. Multimode coders are able to quickly invoke a coding scheme and bit allocation specifically tailored to the local characteristics of the speech signal This capability has proven useful in optimizing perceptual quality at low coding rates. In fact, the majority of algorithms currently proposed for half-rate European and U.S. digital cellular standards, as well as many algo- ithms considered for rates below 2.4 kbits/s are multimode coders. The direct coupling between variable-rate (multimode)speech coding and the CDMA transmission architecture is an obvious benefit to both technologies. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC comparing processed vs. unprocessed data, and effectiveness of communication in application-specific cooperative tasks [Schmidt-Nielsen and Brock, 1996; Kreamer and Tardelli, 1996]. The MOS and DRT scores were measured in a variety of common DoD environments. Each of the four “finalist” coders ranked first in one of the four categories examined (Q,I,R,C), as noted in the table. The results of the standardization process were announced in April, 1996. As indicated in Table 15.2, the new 2400 bits/s Federal Standard vocoder replacing LPC-10e is a version of the Mixed Excitation Linear Prediction (MELP) coder which uses several specific enhancements to the basic MELP architecture. These enhancements include multi-stage VQ of the formant parameters based on frequency-weighted bark-scale spectral distortion, direct VQ of the first 10 Fourier coefficients of the excitation using bark-weighted distortion, and a gain coding technique which is robust to channel errors [McCree et al., 1996]. Variable Rate Coding Previous standardization efforts and discussion here have centered on fixed-rate coding of speech where a fixed number of bits are used to represent speech in digital form per unit of time. However, with recent developments in transmission architectures (such as CDMA), the implementation of variable-rate speech coding algorithms has become feasible. In variable-rate coding, the average data rate for conversational speech can be reduced by a factor of at least 2. A variable-rate speech coding algorithm has been standardized by the CTIA for wideband (CDMA) digital mobile cellular telephony under IS-95. The algorithm, QCELP [Gardner et al., 1993], is the first practical variable-rate speech coder to be incorporated in a digital cellular system. QCELP is a multi-mode, CELP-type analysis-by-synthesis coder which uses blockwise spectral energy measurements and a finite-state machine to switch between one of four configurations. Each configuration has a fixed rate of 1, 2, 4, or 8 kbits/s with a predetermined allocation of bits among coder parameters (coefficients, gains, excitation, etc.). The subjective performance of QCELP in the presence of low background noise is quite good since the bit allocations permode and mode-switching logic are well-suited to high-quality speech. In fact, QCELP at an average rate of 4 kbits/s has been judged to be MOS-equivalent to VSELP, its 8 kbits/s, fixed-rate cellular counterpart. A timeaveraged encoding rate of 4 to 5 kbits/s is not uncommon for QCELP, however the average rate tends toward the 8 kbits/s maximum in the presence of moderate ambient noise. The topic of robust fixed-rate and variablerate speech coding in the presence of significant background noise remains an open problem. Much recent research in speech coding below 8 kbits/s has focused on multi-mode CELP architectures and efficient approaches to source-controlled mode selection [Das et al., 1995]. Multimode coders are able to quickly invoke a coding scheme and bit allocation specifically tailored to the local characteristics of the speech signal. This capability has proven useful in optimizing perceptual quality at low coding rates. In fact, the majority of algorithms currently proposed for half-rate European and U.S. digital cellular standards, as well as many algorithms considered for rates below 2.4 kbits/s are multimode coders. The direct coupling between variable-rate (multimode) speech coding and the CDMA transmission architecture is an obvious benefit to both technologies. TABLE 15.2 Speech Coder Performance Comparisons Taken from Results of 1996 U.S. DoD 2400 bits/s Vocoder Competition Algorithm Quiet Office (acronym) FOM Rank Best MOS DRT DAM MOS DRT DAM MELP 2.616 1 I 3.30 92.3 64.5 2.96 91.2 52.7 PWI 2.347 2 Q 3.28 90.5 70.0 2.88 88.4 55.5 STC 2.026 3 R 3.08 89.9 63.8 2.82 91.5 54.1 IMBE 2.991 * C 2.89 91.4 62.3 2.71 91.1 52.4 CELP 0.0 N/A — 3.13 90.7 65.4 2.89 89.0 56.1 LPC-10e –9.19 N/A — 2.24 86.2 50.3 2.09 85.2 48.4 * Ineligible due to failure of the quality (MOS) criteria minimum requirements (better than CELP) in both quiet and office environments
Summary and Conclusions The availability of general-purpose and application-specific digital signal processing chips and the ever-widening interest in digital communications have led to an increasing demand for speech coders. The worldwide desire to establish standards in a host of applications is a primary driving force for speech coder research and development. The speech coders that are available today for operation at 16 kbits/s and below are conceptually quite exotic compared with products available less than 10 years ago. The re-standardization of U.S. Federal Standard 1015(LPC-10)at 2.4 kbits/s with performance constraints similar to those of FS-1016 at 4.8 kbits is an indicator of the rapid evolution of speech coding paradigms and vlSI architectures. ther standards to be established in the near term include the European( GSM)and U.S(CTIA) half-rate speech coders for digital cellular mobile radio. For the longer term, the specification of standards for forth- coming mobile personal communications networks will be a primary focus in the next 5 to 10 years In the preface to their book, Jayant and Noll [ 1984] state that"our understanding of speech and image coding has now reached a very mature point.. As of 1997, this statement rings truer than ever. The field is a dyna one, however, and the wide range of commercial applications demands continual progress. Defining Terms Analysis-by-synthesis: Constructing several versions of a waveform and choosing the best match. Predictive coding: Coding of time-domain waveforms based on a(usually) linear prediction model Frequency domain coding: Coding of frequency-domain characteristics based on a discrete time-frequency transform Hybrid coders: Coders that fall between waveform coders and vocoders in how they select the excitation Standard: An encoding technique adopted by an industry to be used in a particular application. Mean Opinion Score(MOS): A popular method for classifying the quality of encoded speech based on a five oint scale Variable-rate coders: Coders that output different amounts of bits based on the time-varying characteristics of the source Related Topics 17. 1 Digital Image Processing. 21.4 Example 3: Multirate Signal Pr References Proc. IEEE A. Gersho, "Advances in speech and audio compression, " Proc. IEEE, 82, June 1994 w. B Kleijn and KK Paliwal, Eds, Speech Coding and Synthesis, Amsterdam, Holland: Elsevier, 1995 CCITT,32-kbit/s adaptive differential pulse code modulation(ADPCM), Red Book, II1.3, 125-159,1984 National Communications System, Office of Technology and Standards, Federal Standard 1015: Analog to Digital Conversion of Voice by 2400 bit/second Linear Predictive Coding, 1984 J.-H. Chen, High-quality 16 kb/s speech coding with a one-way delay less than 2 ms," Proc. IEEE Int. Conf Acoust, Speech, Signal Processing, Albuquerque, NM, Pp. 453-456, April 1990. Tational Communications System, Office of Technology and Standards, Federal Standard 1016: Telecommunications Analog to Digital Conversion of Radio Voice by 4800 bit/second Code Excited Linear Prediction(CELP), 1991 J. Gibson, Adaptive prediction for speech encoding, IEEE ASSP Magazine, 1, 12-26, July 1984. J. D. Johnston, "A filter family designed for use in quadrature mirror filter banks, " Proc. IEEE Int. Conf Acoust Speech, Signal Processing, Denver, CO, PP. 291-294, April 1980 B Atal and M. Schroeder, Predictive coding of speech signals and subjective error criteria, "IEEE Trans. Acoust Speech, Signal Processing, ASSP-27, 247-254, June 1979 Gerson and M. Jasiuk, Vector sum excited linear prediction(VSELP)speech coding at 8 kb/s, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, PP. 461-464, April 1990 c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Summary and Conclusions The availability of general-purpose and application-specific digital signal processing chips and the ever-widening interest in digital communications have led to an increasing demand for speech coders. The worldwide desire to establish standards in a host of applications is a primary driving force for speech coder research and development. The speech coders that are available today for operation at 16 kbits/s and below are conceptually quite exotic compared with products available less than 10 years ago. The re-standardization of U.S. Federal Standard 1015 (LPC-10) at 2.4 kbits/s with performance constraints similar to those of FS-1016 at 4.8 kbits/s is an indicator of the rapid evolution of speech coding paradigms and VLSI architectures. Other standards to be established in the near term include the European (GSM) and U.S. (CTIA) half-rate speech coders for digital cellular mobile radio. For the longer term, the specification of standards for forthcoming mobile personal communications networks will be a primary focus in the next 5 to 10 years. In the preface to their book, Jayant and Noll [1984] state that “our understanding of speech and image coding has now reached a very mature point …” As of 1997, this statement rings truer than ever. The field is a dynamic one, however, and the wide range of commercial applications demands continual progress. Defining Terms Analysis-by-synthesis: Constructing several versions of a waveform and choosing the best match. Predictive coding: Coding of time-domain waveforms based on a (usually) linear prediction model. Frequency domain coding: Coding of frequency-domain characteristics based on a discrete time-frequency transform. Hybrid coders: Coders that fall between waveform coders and vocoders in how they select the excitation. Standard: An encoding technique adopted by an industry to be used in a particular application. Mean Opinion Score (MOS): A popular method for classifying the quality of encoded speech based on a fivepoint scale. Variable-rate coders: Coders that output different amounts of bits based on the time-varying characteristics of the source. Related Topics 17.1 Digital Image Processing • 21.4 Example 3: Multirate Signal Processing References A. S. Spanias, “Speech coding: A tutorial review,” Proc. IEEE, 82, 1541–1575, October 1994. A. Gersho, “Advances in speech and audio compression,” Proc. IEEE, 82, June 1994. W. B. Kleijn and K. K. Paliwal, Eds., Speech Coding and Synthesis, Amsterdam, Holland: Elsevier, 1995. CCITT, “32-kbit/s adaptive differential pulse code modulation (ADPCM),” Red Book, III.3, 125–159, 1984. National Communications System, Office of Technology and Standards, Federal Standard 1015: Analog to Digital Conversion of Voice by 2400 bit/second Linear Predictive Coding, 1984. J.-H. Chen, “High-quality 16 kb/s speech coding with a one-way delay less than 2 ms,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, pp. 453–456, April 1990. National Communications System, Office of Technology and Standards, Federal Standard 1016: Telecommunications: Analog to Digital Conversion of Radio Voice by 4800 bit/second Code Excited Linear Prediction (CELP), 1991. J. Gibson, “Adaptive prediction for speech encoding,” IEEE ASSP Magazine, 1, 12–26, July 1984. J. D. Johnston, “A filter family designed for use in quadrature mirror filter banks,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Denver, CO, pp. 291–294, April 1980. B. Atal and M. Schroeder, “Predictive coding of speech signals and subjective error criteria,” IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 247–254, June 1979. I. Gerson and M. Jasiuk, “Vector sum excited linear prediction (VSELP) speech coding at 8 kb/s,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, pp. 461–464, April 1990