13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm UniversityEstimating speaker characteristics for speech recognitionMats Blomberg and Daniel EleniusDept. of Speech, Music and Hearing, KTH/CSC, StockholmAbstractA speaker-characteristic-based hierarchic treeof speech recognition models is designed. Theleaves of the tree contain model sets, which arecreated by transforming a conventionallytrained set using leaf-specific speaker profilevectors. The non-leaf models are formed bymerging the models of their child nodes. Duringrecognition, a maximum likelihood criterionis followed to traverse the tree from theroot to a leaf. The computational load for estimatingone- (vocal tract length) and fourdimensionalspeaker profile vectors (vocal tractlength, two spectral slope parameters andmodel variance scaling) is reduced to a fractioncompared to that of an exhaustive searchamong all leaf nodes. Recognition experimentson children’s connected digits using adult modelsexhibit similar recognition performance forthe exhaustive and the one-dimensional treesearch. Further error reduction is achievedwith the four-dimensional tree. The estimatedspeaker properties are analyzed and discussed.IntroductionKnowledge on speech production can play animportant role in speech recognition by imposingconstraints on the structure of trained andadapted models. In contrast, current conventional,purely data-driven, speaker adaptationtechniques put little constraint on the models.This makes them sensitive to recognition errorsand they require a sufficiently high initial accuracyin order to improve the quality of the models.Several speaker characteristic propertieshave been proposed for this type of adaptation.The most commonly used is compensation formismatch in vocal tract length, performed byVocal Tract Length Normalization (VTLN)(Lee and Rose, 1998). Other candidates, lessexplored, are voice source quality, articulationclarity, speech rate, accent, emotion, etc.However, there are at least two problemsconnected to the approach. One is to establishthe quantitative relation between the propertyand its acoustic manifestation. The secondproblem is that the estimation of these featuresquickly becomes computationally heavy, sinceeach candidate value has to be evaluated in acomplete recognition procedure, and the numberof candidates needs to be sufficiently highin order to have the required precision of theestimate. This problem becomes particularlysevere if there is more than one property to bejointly optimized, since the number of evaluationpoints equals the product of the number ofindividual candidates for each property. Twostagetechniques, e.g. (Lee and Rose, 1998) and(Akhil et. al., 2008), reduce the computationalrequirements, unfortunately to the prize oflower recognition performance, especially if theaccuracy of the first recognition stage is low.In this work, we approach the problem ofexcessive computational load by representingthe range of the speaker profile vector as quantizedvalues in a multi-dimensional binary tree.Each node contains an individual value, or aninterval, of the profile vector and a correspondingmodel set. The standard exhaustive searchfor the best model among the leaf nodes cannow be replaced by a traversal of the tree fromthe root to a leaf. This results in a significantreduction of the computational amount.There is an important argument for structuringthe tree based on speaker characteristicproperties rather than on acoustic observations.If we know the acoustic effect of modifying acertain property of this kind, we can predictmodels of speaker profiles outside their rangein the adaptation corpus. This extrapolation isgenerally not possible with the standard acoustic-onlyrepresentation.In this report, we evaluate the predictionperformance by training the models on adultspeech and evaluating the recognition accuracyon children’s speech. The achieved results exhibita substantial reduction in computationalload while maintaining similar performance asan exhaustive grid search technique.In addition to the recognized identity, thespeaker properties are also estimated. As thesecan be represented in acoustic-phonetic terms,they are easier to interpret than the standardmodel parameters used in a recognizer. This154

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!