LR Rabiner and RW Schafer, June 3

More documents

Recommendations

Info

DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 464CHAPTER 8. THE CEPSTRUM AND HOMOMORPHIC SPEECH PROCESSING A maximum-phase sequence is defined by the properties xmxp[n] = 0 and ˆxmxp[n] = 0 for n > 0. Starting with Eq. (8.76) we can apply these constraints to obtain the following recursion relation for the complex cepstrum of a maximumphase sequence: ⎧ xmxp[n] ⎪⎨ xmxp[0] ˆxmxp[n] = ⎪⎩ − 0 k ˆxmxp[k] n k=n+1 xmxp[n − k] n < 0 xmxp[0] (8.84) log{xmxp[0]} n = 0 0 n > 0. and by reorganizing Eq. (8.85) we obtain the following recursion for the inverse characteristic system for maximum-phase signals: ⎧ 0 k ⎪⎨ ˆxmxp[n]xmxp[0] + ˆxmxp[k]xmxp[n − k] n < 0 n xmxp[n] = k=n+1 (8.85) exp{ˆxmxp[0]} n = 0 ⎪⎩ 0 n > 0. 8.5 Homomorphic Filtering of Natural Speech We are now in a position to apply the concepts of the cepstrum and homomorphic filtering to a (natural) speech signal. Recall that the model for speech production, as shown in Figure 8.12, consists essentially of a slowly time-varying linear system excited by either a quasi-periodic impulse train or by random noise. Thus, it is appropriate to think of a short segment of voiced speech as having been taken from the steady-state output of a linear time-invariant system excited by a periodic impulse train. Similarly, a short segment of unvoiced speech can be thought of as resulting from the excitation of a linear time-invariant system by random noise. The analysis of Section 8.3, which was based on exact z-transform representations of the components of the model, demonstrated that for this convolutional model there is an interesting separation in the cepstrum between the excitation and the vocal tract impulse response components. The purpose of this section is to demonstrate that similar behavior results if short-time homomorphic analysis methods are employed with natural speech inputs. 8.5.1 A Model for Short-Time Cepstral Analysis of Speech Following the approach presented in [12], we begin by assuming that, over the length of the window L, the speech signal s[n] satisfies the convolution equation s[n] = e[n] ∗ h[n] 0 ≤ n ≤ L − 1, (8.86) where h[n] is the impulse response of the system from the point of excitation (at the glottis for voiced speech and at a constriction for unvoiced speech) to the radiation at the lips. In this analysis, the impulse response h[n] = hU [n] models the combined effects of the excitation gain, the vocal tract system, and
DRAFT: L. R. Rabiner and R. W. Schafer, June 3, 2009 8.5. HOMOMORPHIC FILTERING OF NATURAL SPEECH 465 radiation of sound at the lips for unvoiced speech, while h[n] = hV [n] contains an additional convolutional component due to the glottal pulse for voiced speech. 12 Furthermore, we assume that the impulse response h[n] is short compared to the length of the window so that the windowed segment can be represented as x[n] = w[n]s[n] = w[n](e[n] ∗ h[n]) ≈ ew[n] ∗ h[n] 0 ≤ n ≤ L − 1, (8.87) where ew[n] = w[n]e[n]; i.e., any tapering due to the analysis window is incorporated into the excitation as a slowly varying amplitude modulation. In the case of unvoiced speech, the excitation e[n] would be white noise and h[n] = hU [n]. In the case of voiced speech, h[n] = hV [n] and e[n] would be a unit impulse train of the form e[n] = p[n] = Nw−1 k=0 δ[n − kNp], (8.88) where Nw is the number of impulses in the window and Np is the discrete-time pitch period (measured in samples). For voiced speech, the windowed excitation is ew[n] = w[n]p[n] = Nw−1 k=0 wNp[k]δ[n − kNp], (8.89) where wNp [k] is the “time-sampled” window sequence defined as w[kNp] k = 0, 1, . . . , Nw − 1 wNp [k] = 0 otherwise. From (8.89), the DTFT of ew[n] is Ew(e jω ) = Nw−1 k=0 (8.90) wNp [k]e−jωkNp = WNp (ejωNp ), (8.91) and from (8.91) it follows that Ew(e jω ) is periodic in ω with period 2π/Np. Therefore, ˆX(e jω ) = log{HV (e jω )} + log{Ew(e jω )} (8.92) has two components: (1) log{HV (e jω )}, due to the vocal tract frequency response, which is slowly varying in ω, and (2) log{WNp (ejωNp )}, which is due to the excitation and periodic with period 2π/Np. 13 The complex cepstrum of the windowed speech segment x[n] is therefore ˆx[n] = ˆ hV [n] + êw[n]. (8.93) 12 Note that it is often convenient to incorporate the excitation gain (AV or AU in Figure 8.12) into h[n] so that we can assume that e[n] consists of unit impulses for voiced excitation and unit variance white noise for unvoiced excitation. 13 For signals sampled with sampling rate Fs, this period corresponds to Fs/Np Hz in cyclic analog frequency.
Page 1 and 2: DRAFT: L. R. Rabiner and R. W. Scha
Page 41: DRAFT: L. R. Rabiner and R. W. Scha
Page 77: DRAFT: L. R. Rabiner and R. W. Scha

LR Rabiner and RW Schafer, June 3

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?