20.07.2014 Views

eurasip2007-groupdel.. - SRI Speech Technology and Research ...

eurasip2007-groupdel.. - SRI Speech Technology and Research ...

eurasip2007-groupdel.. - SRI Speech Technology and Research ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6 EURASIP Journal on Audio, <strong>Speech</strong>, <strong>and</strong> Music Processing<br />

production, the transfer function of such a system is given<br />

by:<br />

H(z) =<br />

∑<br />

∑ k b k z k<br />

k a k z k . (20)<br />

The transfer function of the same system for the production<br />

of vowel assuming an all pole model is given by<br />

1<br />

H(z) = ∑ k=p<br />

k=0 a kz , k (21)<br />

1<br />

H(z) =<br />

1+ ∑ k=p<br />

k=1 a kz . k (22)<br />

Let the vowel be characterized by the frequencies F1, F2, F3.<br />

Hence the poles of the system are located at<br />

p = re ¦jωi . (23)<br />

By substituting (23)in(21), the system function for production<br />

of the ith formant now becomes<br />

H i (z) =<br />

But from resonance theory<br />

1<br />

1 2r cos ω i Tz 1 + r 2 z 2 . (24)<br />

r = e πBiT . (25)<br />

By substituting (25)in(24), the system function in (24)now<br />

becomes<br />

H i (z) =<br />

1<br />

1 2e πBiT cos ω i Tz 1 + e 2πBiT z 2 . (26)<br />

In the above array of equations, ω i corresponds to the ith<br />

formant frequency, B i to the b<strong>and</strong>width of the ith formant<br />

frequency, <strong>and</strong> T to the sampling period. Using (26), we<br />

generate a synthetic vowel with the following values: F1 =<br />

500 Hz, F2 = 1500 Hz, F3 = 3500 Hz, B i = 10% of F i ,<br />

<strong>and</strong> T = 0.0001 second corresponding to a sampling rate of<br />

10 KHz. Note that F1, F2, <strong>and</strong> F3 are the formant frequencies<br />

in Hz. We then extract the MODGDF, MFCC, <strong>and</strong> joint<br />

features (MODGDF + MFCC) from the synthesized vowel.<br />

To reconstruct the formants, we use the algorithm described<br />

above. The reconstructed formant structures derived from<br />

the MODGDF, MFCC, joint features (MODGDF + MFCC),<br />

<strong>and</strong> also RASTA filtered MFCC are shown in Figures 3(a),<br />

3(b), 3(c),<strong>and</strong>3(d), respectively. The illustrations are shown<br />

as spectrogram like plots 2 where the data along the Y-axis<br />

correspond to the DFT bins <strong>and</strong> the x-axis corresponds to<br />

the frame number. It is interesting to note that while the<br />

formants are reconstructed accurately by both the MOD-<br />

GDF <strong>and</strong> the MFCC as in Figures 3(a) <strong>and</strong> 3(b),respectively,<br />

joint features (MODGDF + MFCC) combine the formant<br />

2 The differences between the subplots are better visualized in color than in<br />

gray scale.<br />

Frequency<br />

(FFT bins)<br />

Frequency<br />

(FFT bins)<br />

Frequency<br />

(FFT bins)<br />

Frequency<br />

(FFT bins)<br />

250<br />

150<br />

50<br />

250<br />

150<br />

50<br />

250<br />

150<br />

50<br />

250<br />

150<br />

50<br />

10 20 30 40 50 60 70 80 90<br />

Frame number<br />

(a)<br />

10 20 30 40 50 60 70 80 90<br />

Frame number<br />

(b)<br />

10 20 30 40 50 60 70 80 90<br />

Frame number<br />

(c)<br />

10 20 30 40 50 60 70 80 90<br />

Frame number<br />

(d)<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.25<br />

0.15<br />

0.05<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

Figure 3: Spectrogram-like plots to illustrate formant reconstructions<br />

for a synthetic vowel. (a) The short-time modified group delay<br />

spectra reconstructed from MODGDF, (b) the short-time power<br />

spectra reconstructed from MFCC, (c) the short-time composite<br />

spectra reconstructed from joint features (MODGDF+MFCC), <strong>and</strong><br />

(d) the short-time power spectra reconstructed from RASTA filtered<br />

MFCC.<br />

information in the individual features as in Figure 3(c). Itis<br />

expected that RASTA filtered MFCC shown in Figure 3(d)<br />

does not capture the formant structure for the synthetic signal.<br />

RASTA filtered MFCC reconstructions are illustrated<br />

here only to show that while the MODGDF works well on<br />

clean speech, RASTA MFCC fails, which is quite expected.<br />

Although one may hypothesize that individual features capture<br />

formant information well <strong>and</strong> the need for joint features<br />

is really not there, it would be significant to note that joint<br />

features combine information gathered from individual featuresasinFigure<br />

3(c). Similar spectrogram-like plots to illustrate<br />

formant reconstructions for a synthetic speech signal<br />

with varying formant trajectories are shown in Figure 4.Itis<br />

interesting to note that in Figure 4(a), all the 3 formants are<br />

visible. In Figure 4(b), while the first 2 formants are visible,<br />

the third formant is not clearly visible. In Figure 4(c), while<br />

the first 2 formants are clear, the third formant is further emphasized.<br />

Hence it is clear that joint features are able to combine<br />

the information that is available in both the MODGDF<br />

<strong>and</strong> MFCC.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!