13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm UniversityOn extending VTLN to phoneme-specific warping inautomatic speech recognitionDaniel Elenius and Mats BlombergDepartmen of Speech, Music and Hearing, KTH, StockholmAbstractPhoneme- and formant-specific warping hasbeen shown to decrease formant and cepstralmismatch. These findings have not yet been fullyimplemented in speech recognition. This paperdiscusses a few reasons how this can be. A smallexperimental study is also included where phoneme-independentwarping is extended towardsphoneme-specific warping. The results of thisinvestigation did not show a significant decreasein error rate during recognition. This isalso in line with earlier experiments of methodsdiscussed in the paper.IntroductionIn ASR, mismatch between training and testconditions degrades the performance. Thereforemuch effort has been invested into reducing thismismatch using normalization of the inputspeech and adaptation of the acoustic modelstowards the current test condition.Phoneme-specific frequency scaling of aspeech spectrum between speaker groups hasbeen shown to reduce formant- (Fant, 1975) andcepstral- distance (Potamianos and Narayanan,2003). Frequency scaling has also been performedas a part of vocal tract length normalization(VTLN) to reduce spectral mismatchcaused by speakers having different vocal tractlengths (Lee and Rose 1996). However, in contrastto findings above this scaling is normallymade without regard to sound-class. How comethat phoneme-specific frequency scaling inVTLN has not yet been fully implemented inASR (automatic speech recognition) systems?Formant frequency mismatch was reducedby about one-half when formant- andvowel-category- specific warping was appliedcompared to uniform scaling (Fant, 1975). Alsophoneme-specific warping without formant-specificscaling has been beneficial interms of reducing cepstral distance (Potamianosand Narayanan, 2003). In the study it was alsofound that warp factors differed more betweenphonemes for younger children than for olderones. They did not implement automatic selectionof warp factors to be used during recognition.One reason presented was that the gain inpractice could be limited by the need of correctlyestimating a large number of warp factors.Phone clustering was suggested as a method tolimit the number of warping factors needed toestimate.One method used in ASR is VTLN, whichperforms frequency warping during analysis ofan utterance to reduce spectral mismatch causedby speakers having different vocal tract lengths(Lee and Rose, 1996). They steered the degree ofwarping by a time-independent warping-factorwhich optimized the likelihood of the utterancegiven an acoustic model using the maximumlikelihood criterion. The method has also beenfrequently used in recognition experiments bothwith adults and children (Welling, Kanthak andNey, 1999; Narayanan and Potamianos, 2002;Elenius and Blomberg, 2005; Giuliani, Gerosaand Brugnara 2006). A limitation with this approachis that time-invariant warping results inall phonemes as well as non-speech segmentssharing a common warping factor.In recent years increased interest has beendirected towards time-varying VTLN (Miguelet.al., 2005; Maragakis et.al., 2008). The formermethod estimates a frame-specific warpingfactor during a memory-less Viterbi decodingprocess, while the latter method uses a two-passstrategy where warping factors are estimatedbased on an initial grouping of speech frames.The former method focuses on revising the hypothesisof what was said during warp estimationwhile the latter focuses on sharing the samewarp factor within each given group. Phoneme-specificwarping can be implemented tosome degree with either of these methods. Eitherby explicitly forming phoneme-specific groupsor implicitly by estimating frame-specific warpfactors.However, none of the methods above presentsa complete solution for phoneme-specificwarping. One reason is that more then one instantiationof a phoneme can occur far apart intime. This introduces a long distance depend-144

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!