LR Rabiner and RW Schafer, June 3

More documents

Recommendations

Info

DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 450CHAPTER 8. THE CEPSTRUM AND HOMOMORPHIC SPEECH PROCESSING c [ n ] s ^ [ n ] 2 1 0 −1 −2 (a) Complex Cepstrum of Synthetic Speech −25 −20 −15 −10 −5 0 5 10 15 20 25 quefrency nT in ms 0.5 0 −0.5 −1 −1.5 −2 (b) Cepstrum of Synthetic Speech −25 −20 −15 −10 −5 0 5 10 15 20 25 quefrency nT in ms Figure 8.18: (a) Complex cepstrum of synthetic speech output. (b) Corresponding cepstrum of synthetic speech output. the log magnitude and continuous phase of the discrete-time Fourier transform S(e j2πF T ). These are, of course, the real and imaginary parts of ˆ S(e j2πF T ), the discrete-time Fourier transform of the complex cepstrum, ˆs[n]. The heavy lines show the contributions to the log magnitude and continuous phase due to the overall system response, i.e., ˆ HV (e j2πF T ) = log |HV (e j2πF T )|+j arg{HV (e j2πF T )}. The thin lines show the total log magnitude and continuous phase of the output of the system. Observe that the excitation introduces a periodic (in F ) variation in both the log magnitude and continuous phase that is superimposed upon the more slowly varying components due to the system response. It is this periodic component that manifests itself in the cepstrum as the impulses at quefrencies that are multiples of Np, and it is this behavior that motivated the original definition of the cepstrum by Bogert et al. [1] 8.3.2 Homomorphic Analysis of the Model for Unvoiced Speech In Section 8.3.1, we considered an extended example of homomorphic analysis of the discrete-time model of voiced speech production. This analysis is exact for the assumed model, since we were able to determine the z-transforms of each of the convolutional components of the synthetic speech output. A completely similar analysis is not possible for the model for unvoiced speech production
DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 8.3. HOMOMORPHIC ANALYSIS OF THE SPEECH MODEL 451 log e | S(e j2π FT ) | arg [ S(e j2π FT ) ] ARG [ S(e j2π FT ) ] 6 4 2 0 −2 (a) Log Magnitude Spectrum of Synthetic Voiced Speech −4 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10 5 0 (b) Continuous Phase of Synthetic Voiced Speech −5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 4 2 0 −2 (c) Principal Value Phase of Synthetic Voiced Speech −4 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 frequency in Hz Figure 8.19: Frequency-domain representation of the complex cepstrum. (a) Log magnitude log |S(e j2πF T )| (real part of ˆ S(e j2πF T )) (b) Continuous phase arg{S(e j2πF T )} (imaginary part of ˆ S(e j2πF T )), (c) Principal value phase ARG{S(e j2πF T )}. The heavy lines in (a) and (b) represent ˆ HV (e j2πF T ) = log |HV (e j2πF T )| + j arg{HV (e j2πF T )} . since no z-transform representation exists directly for the random noise input signal itself. However, if we employ the autocorrelation and power spectrum representation for the model for unvoiced speech production, we can obtain similar results to those for voiced speech. Recall that for unvoiced speech, we have no glottal pulse excitation so the model output is s[n] = hU [n] ∗ u[n] = v[n] ∗ r[n] ∗ (AU u[n]), where u[n] is a unitvariance white noise sequence. The autocorrelation representation of unvoiced speech is therefore φss[n] = φvv[n] ∗ φrr[n] ∗ (A 2 U δ[n]) = A 2 U φvv[n] ∗ φrr[n], (8.47) where φvv[n] and φrr[n] are the deterministic autocorrelation functions of the vocal tract and radiation systems respectively. These are combined by convolution. The z-transform of φss[n] exists and is given by where Φss(z) = A 2 U Φvv(z)Φrr(z), (8.48) Φvv(z) = V (z)V (z −1 ) (8.49a) Φrr(z) = R(z)R(z −1 ) (8.49b)
Page 1 and 2: DRAFT: L. R. Rabiner and R. W. Scha
Page 27: DRAFT: L. R. Rabiner and R. W. Scha
Page 77: DRAFT: L. R. Rabiner and R. W. Scha

LR Rabiner and RW Schafer, June 3

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?