eurasip2007-groupdel.. - SRI Speech Technology and Research ...
eurasip2007-groupdel.. - SRI Speech Technology and Research ...
eurasip2007-groupdel.. - SRI Speech Technology and Research ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6 EURASIP Journal on Audio, <strong>Speech</strong>, <strong>and</strong> Music Processing<br />
production, the transfer function of such a system is given<br />
by:<br />
H(z) =<br />
∑<br />
∑ k b k z k<br />
k a k z k . (20)<br />
The transfer function of the same system for the production<br />
of vowel assuming an all pole model is given by<br />
1<br />
H(z) = ∑ k=p<br />
k=0 a kz , k (21)<br />
1<br />
H(z) =<br />
1+ ∑ k=p<br />
k=1 a kz . k (22)<br />
Let the vowel be characterized by the frequencies F1, F2, F3.<br />
Hence the poles of the system are located at<br />
p = re ¦jωi . (23)<br />
By substituting (23)in(21), the system function for production<br />
of the ith formant now becomes<br />
H i (z) =<br />
But from resonance theory<br />
1<br />
1 2r cos ω i Tz 1 + r 2 z 2 . (24)<br />
r = e πBiT . (25)<br />
By substituting (25)in(24), the system function in (24)now<br />
becomes<br />
H i (z) =<br />
1<br />
1 2e πBiT cos ω i Tz 1 + e 2πBiT z 2 . (26)<br />
In the above array of equations, ω i corresponds to the ith<br />
formant frequency, B i to the b<strong>and</strong>width of the ith formant<br />
frequency, <strong>and</strong> T to the sampling period. Using (26), we<br />
generate a synthetic vowel with the following values: F1 =<br />
500 Hz, F2 = 1500 Hz, F3 = 3500 Hz, B i = 10% of F i ,<br />
<strong>and</strong> T = 0.0001 second corresponding to a sampling rate of<br />
10 KHz. Note that F1, F2, <strong>and</strong> F3 are the formant frequencies<br />
in Hz. We then extract the MODGDF, MFCC, <strong>and</strong> joint<br />
features (MODGDF + MFCC) from the synthesized vowel.<br />
To reconstruct the formants, we use the algorithm described<br />
above. The reconstructed formant structures derived from<br />
the MODGDF, MFCC, joint features (MODGDF + MFCC),<br />
<strong>and</strong> also RASTA filtered MFCC are shown in Figures 3(a),<br />
3(b), 3(c),<strong>and</strong>3(d), respectively. The illustrations are shown<br />
as spectrogram like plots 2 where the data along the Y-axis<br />
correspond to the DFT bins <strong>and</strong> the x-axis corresponds to<br />
the frame number. It is interesting to note that while the<br />
formants are reconstructed accurately by both the MOD-<br />
GDF <strong>and</strong> the MFCC as in Figures 3(a) <strong>and</strong> 3(b),respectively,<br />
joint features (MODGDF + MFCC) combine the formant<br />
2 The differences between the subplots are better visualized in color than in<br />
gray scale.<br />
Frequency<br />
(FFT bins)<br />
Frequency<br />
(FFT bins)<br />
Frequency<br />
(FFT bins)<br />
Frequency<br />
(FFT bins)<br />
250<br />
150<br />
50<br />
250<br />
150<br />
50<br />
250<br />
150<br />
50<br />
250<br />
150<br />
50<br />
10 20 30 40 50 60 70 80 90<br />
Frame number<br />
(a)<br />
10 20 30 40 50 60 70 80 90<br />
Frame number<br />
(b)<br />
10 20 30 40 50 60 70 80 90<br />
Frame number<br />
(c)<br />
10 20 30 40 50 60 70 80 90<br />
Frame number<br />
(d)<br />
0.1<br />
0.08<br />
0.06<br />
0.04<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0.25<br />
0.15<br />
0.05<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
Figure 3: Spectrogram-like plots to illustrate formant reconstructions<br />
for a synthetic vowel. (a) The short-time modified group delay<br />
spectra reconstructed from MODGDF, (b) the short-time power<br />
spectra reconstructed from MFCC, (c) the short-time composite<br />
spectra reconstructed from joint features (MODGDF+MFCC), <strong>and</strong><br />
(d) the short-time power spectra reconstructed from RASTA filtered<br />
MFCC.<br />
information in the individual features as in Figure 3(c). Itis<br />
expected that RASTA filtered MFCC shown in Figure 3(d)<br />
does not capture the formant structure for the synthetic signal.<br />
RASTA filtered MFCC reconstructions are illustrated<br />
here only to show that while the MODGDF works well on<br />
clean speech, RASTA MFCC fails, which is quite expected.<br />
Although one may hypothesize that individual features capture<br />
formant information well <strong>and</strong> the need for joint features<br />
is really not there, it would be significant to note that joint<br />
features combine information gathered from individual featuresasinFigure<br />
3(c). Similar spectrogram-like plots to illustrate<br />
formant reconstructions for a synthetic speech signal<br />
with varying formant trajectories are shown in Figure 4.Itis<br />
interesting to note that in Figure 4(a), all the 3 formants are<br />
visible. In Figure 4(b), while the first 2 formants are visible,<br />
the third formant is not clearly visible. In Figure 4(c), while<br />
the first 2 formants are clear, the third formant is further emphasized.<br />
Hence it is clear that joint features are able to combine<br />
the information that is available in both the MODGDF<br />
<strong>and</strong> MFCC.