LR Rabiner and RW Schafer, June 3

More documents

Recommendations

Info

DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 486CHAPTER 8. THE CEPSTRUM AND HOMOMORPHIC SPEECH PROCESSING a short-time Fourier analysis is done first, resulting in a DFT, Xm[k], for the m th frame. Then the DFT values are grouped together in critical bands and weighted by triangular weighting functions such as those depicted in Fig. 8.41. Note that the bandwidths in Fig. 8.41 are constant for center frequencies below 0.01 0.005 0 0 500 1000 1500 2000 2500 3000 3500 4000 frequency in Hz Figure 8.41: DFT weighting functions for mel-frequency-cepstrum computations. 1 kHz and then increase exponentially up to half the sampling rate of 4 kHz resulting in 24 “filters”. The mel-spectrum of the mth r = 1, 2, . . . , R as frame is defined for MFm[r] = 1 Ur |Vr[k]Xm[k]| 2 (8.117a) Ar k=Lr where Vr[k] is the weighting function for the the rth filter ranging from DFT index Lr to Ur, and Ur Ar = |Vr[k]| 2 (8.117b) k=Lr is a normalizing factor for the r th mel-filter. This normalization is built into the plot of Fig. 8.41. It is needed so that a perfectly flat input Fourier spectrum will produce a flat mel-spectrum. For each frame, a discrete cosine transform of the logarithm of the magnitude of the filter outputs is computed to form the function mfcc[n] as mfccm[n] = 1 R R r=1 2π log (MFm[r]) cos r + R 1 n . (8.118) 2 Typically, mfccm[n] is evaluated for n = 1, 2, . . . , Nmfcc, where Nmfcc is less than the number of mel-filters, e.g., Nmfcc = 13 and R = 24. Figure 8.42 shows the result of mfcc analysis of a frame of voiced speech in comparison with the shorttime spectrum, LPC spectrum, and a homomorphically smoothed spectrum. 21 The large dots are the values of log (MFm[r]) and the line interpolated between 21 The speech signal was pre-emphasized by convolution with δ[n] − 0.97δ[n − 1] prior to analysis so as to equalize the levels of the formant resonances.
DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 8.7. CEPSTRUM DISTANCE MEASURES 487 log magnitude 3 2 1 0 −1 −2 −3 −4 −5 Short-time Fourier Transform Homomorphic smoothing, nco = 13 −6 LPC smoothing, p = 12 Mel cepstrum smoothing, Nmfcc = 13 −7 0 500 1000 1500 2000 2500 3000 3500 4000 frequency in Hz Figure 8.42: Comparison of spectral smoothing methods to mel-frequency analysis. them is a spectrum reconstructed by interpolation at the original DFT frequencies. Note that these spectra are different from one another in detail, but they have, in common, peaks at the formant resonances. At higher frequencies, the reconstructed mel-spectrum, of course, has more smoothing due to the structure of the filter bank. The mfcc parameters have become firmly established as the basic feature vector for many speech and acoustic pattern recognition problems. For this reason, new and efficient ways of computing mfcc[n] are of interest. An intriguing proposal is to use floating gate electronic technology to implement the filter bank and the DCT computation with microwatts of power [23]. 8.7.5 Dynamic Cepstral Features The set of mel frequency cepstral coefficients (mfcc) provide perceptually meaningful and smooth estimates of the speech spectra over time, and have been used effectively in a range of speech processing systems [18]. Since speech is inherently a dynamic signal, changing regularly in time, it is reasonable to seek a representation that includes some aspect of the dynamic nature of the speech signal. As such, Furui [4] proposed use of estimates of the time derivatives (both first and second order derivatives) of the short-term cepstrum. Furui called the
Page 1 and 2:
DRAFT: L. R. Rabiner and R. W. Scha
Page 3 and 4:
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14: DRAFT: L. R. Rabiner and R. W. Scha
Page 63: DRAFT: L. R. Rabiner and R. W. Scha
Page 77: DRAFT: L. R. Rabiner and R. W. Scha
show all

LR Rabiner and RW Schafer, June 3

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?