LR Rabiner and RW Schafer, June 3

More documents

Recommendations

Info

DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 442CHAPTER 8. THE CEPSTRUM AND HOMOMORPHIC SPEECH PROCESSING Figure 8.11b shows the components of the complex cepstrum, with ˆx[n] for n < 0 contributed by x2[n] and ˆx[n] for n ≥ 0 comprised of the complex cepstrums of x1[n] and x3[n]. In particular, the impulses at multiples of Np = 15 are due to the echoing caused by convolution with x3[n]. Note that the contributions due to the pole at z = a and the zero at z = −1/b die out rapidly as n → ±∞ respectively. 8.2.5 Minimum- and Maximum-Phase Signals A general result for sequences of the form Eq. (8.26) is that they can be completely represented by only the real parts of their Fourier transforms [15]. Thus, since the real part of the discrete-time Fourier transform of the complex cepstrum is log |X(e jω )|, we should be able to represent the complex cepstrum of minimum-phase signals by the logarithm of the magnitude of the Fourier transform alone. This can easily be shown by remembering that the real part of the Fourier transform is the Fourier transform of the even part of the sequence; i.e., since log |X(e jω )| is the Fourier transform of the cepstrum, then c[n] = ˆx[n] + ˆx[−n] . (8.34) 2 It follows from Eqs. (8.26) and (8.34) that since ˆx[n] = 0 for n < 0, ⎧ ⎪⎨ 0 n < 0 ˆxmnp[n] = c[n] n = 0 ⎪⎩ 2c[n] n > 0. (8.35) where we use the notation mnp for minimum phase signals, and mxp for maximum phase signals. Thus, for minimum phase sequences the complex cepstrum can be obtained by computing the cepstrum and then using Eq. (8.35). Similar results can be obtained for maximum-phase signals. In this case, it can be seen from Eq. (8.27) and (8.34) that, for maximum-phase signals, ⎧ ⎪⎨ 0 n > 0 ˆxmxp[n] = c[n] n = 0 (8.36) ⎪⎩ 2c[n] n < 0, so again, the complex cepstrum of a maximum-phase signal can be computed from only the log |X(e jω )|. 8.3 Homomorphic Analysis of the Speech Model A fundamental tenet of digital speech processing is that speech can be represented as the output of a linear, time-varying system whose properties vary slowly with time. This is embodied in the model of Figure 8.12, which emerged from our discussion of the physics of speech production. This leads to the basic principle of speech analysis that assumes that short segments of the speech signal can be modeled as the output of a linear time-invariant system excited either
DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 8.3. HOMOMORPHIC ANALYSIS OF THE SPEECH MODEL 443 by a quasi-periodic impulse train or a random noise signal. As we have seen repeatedly in previous chapters, the fundamental problem of speech analysis is to reliably and robustly estimate the parameters of the model of Figure 8.12 (i.e., the pitch period control, the shape of the glottal pulse, the gains for the voiced or unvoiced excitation signals, the state of the voiced/unvoiced switch, the vocal tract parameters and the radiation model, as illustrated in Figure 8.12), and to measure the variations of these model control parameters with time. Figure 8.12: General discrete-time model of speech production. Since the excitation and impulse response of a linear time-invariant system are combined by convolution, the problem of speech analysis can also be viewed as a problem in separating the components of a convolution, and therefore, homomorphic systems and the cepstrum are useful tools for speech analysis. In the model of Figure 8.12, the pressure signal at the lips, s[n], for a voiced section of speech is represented as the convolution s[n] = p[n] ∗ hV [n], (8.37a) where p[n] is the quasi-periodic voiced excitation signal, and hV [n] represents the combined effect of the vocal tract impulse response v[n], the glottal pulse g[n], the radiation load response at the lips r[n], and the voiced gain AV . The effective impulse response, hV [n], is itself the convolution of g[n], v[n], and r[n], including scaling by the voiced section gain control, AV , i.e., hV [n] = AV · g[n] ∗ v[n] ∗ r[n]. (8.37b) Recall that it is common usage, when it is not necessary to make a fine distinction, to refer to hV [n] as simply “the vocal tract impulse response for voiced
Page 1 and 2: DRAFT: L. R. Rabiner and R. W. Scha
Page 19: DRAFT: L. R. Rabiner and R. W. Scha
Page 71 and 72:
DRAFT: L. R. Rabiner and R. W. Scha
Page 73 and 74:
Page 75 and 76:
Page 77:
show all

LR Rabiner and RW Schafer, June 3

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?