LR Rabiner and RW Schafer, June 3
LR Rabiner and RW Schafer, June 3
LR Rabiner and RW Schafer, June 3
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
DRAFT: L. R. <strong>Rabiner</strong> <strong>and</strong> R. W. <strong>Schafer</strong>, <strong>June</strong> 3, 2009<br />
486CHAPTER 8. THE CEPSTRUM AND HOMOMORPHIC SPEECH PROCESSING<br />
a short-time Fourier analysis is done first, resulting in a DFT, Xm[k], for the<br />
m th frame. Then the DFT values are grouped together in critical b<strong>and</strong>s <strong>and</strong><br />
weighted by triangular weighting functions such as those depicted in Fig. 8.41.<br />
Note that the b<strong>and</strong>widths in Fig. 8.41 are constant for center frequencies below<br />
0.01<br />
0.005<br />
0<br />
0 500 1000 1500 2000 2500 3000 3500 4000<br />
frequency in Hz<br />
Figure 8.41: DFT weighting functions for mel-frequency-cepstrum computations.<br />
1 kHz <strong>and</strong> then increase exponentially up to half the sampling rate of 4 kHz<br />
resulting in 24 “filters”. The mel-spectrum of the mth r = 1, 2, . . . , R as<br />
frame is defined for<br />
MFm[r] = 1<br />
Ur <br />
|Vr[k]Xm[k]| 2<br />
(8.117a)<br />
Ar<br />
k=Lr<br />
where Vr[k] is the weighting function for the the rth filter ranging from DFT<br />
index Lr to Ur, <strong>and</strong><br />
Ur <br />
Ar = |Vr[k]| 2<br />
(8.117b)<br />
k=Lr<br />
is a normalizing factor for the r th mel-filter. This normalization is built into the<br />
plot of Fig. 8.41. It is needed so that a perfectly flat input Fourier spectrum<br />
will produce a flat mel-spectrum. For each frame, a discrete cosine transform<br />
of the logarithm of the magnitude of the filter outputs is computed to form the<br />
function mfcc[n] as<br />
mfccm[n] = 1<br />
R<br />
R<br />
r=1<br />
<br />
2π<br />
log (MFm[r]) cos r +<br />
R<br />
1<br />
<br />
n . (8.118)<br />
2<br />
Typically, mfccm[n] is evaluated for n = 1, 2, . . . , Nmfcc, where Nmfcc is less than<br />
the number of mel-filters, e.g., Nmfcc = 13 <strong>and</strong> R = 24. Figure 8.42 shows the<br />
result of mfcc analysis of a frame of voiced speech in comparison with the shorttime<br />
spectrum, LPC spectrum, <strong>and</strong> a homomorphically smoothed spectrum. 21<br />
The large dots are the values of log (MFm[r]) <strong>and</strong> the line interpolated between<br />
21 The speech signal was pre-emphasized by convolution with δ[n] − 0.97δ[n − 1] prior to<br />
analysis so as to equalize the levels of the formant resonances.