13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universityprovides a mechanism for feedback fromspeech recognition research to speech productionknowledge.MethodTree generationThe tree is generated using a top-down designin the speaker profile domain, followed by abottom-up merging process in the acousticmodel domain. Initially, the root node is loadedwith the full, sorted, list of values for each dimensionin the speaker profile vector. A numberof child nodes are created, whose lists areachieved by binary splitting each dimension listin the mother node. This tree generation processproceeds until each dimension list has asingle value, which defines a leaf node. In thisnode, the dimension values define a uniquespeaker profile vector. This vector is used topredict a profile-specific model set by controllingthe transformation of a conventionallytrained original model set. When all child nodemodels to a certain mother node are created,they are merged into a model set at their mothernode. The merging procedure is repeated upwardsin the tree, until the root model isreached. Each node in the tree now contains amodel set which is defined by its list of speakerprofile values. All models in the tree have equalstructure and number of parameters.Search procedureDuring recognition of an utterance, the tree isused to select the speaker profile whose modelset maximizes the score of the utterance. Therecognition procedure starts by evaluating thechild nodes of the root. The maximumlikelihoodscoring child node is selected forfurther search. This is repeated until a stop criterionis met, which can be that the leaf level ora specified intermediate level is reached. Anotherselection criterion may be the maximumscoring node along the selected root-to-leafpath (path-max). This would account for thepossibility that the nodes close to the leaves fitpartial properties of a test speaker well but haveto be combined with sibling nodes to give anoverall good match.Model transformationsWe have selected a number of speaker propertiesto evaluate our multi-dimensional estimationapproach. The current set contains a fewbasic properties described below. These aresimilar, although not identical, to our work in(Blomberg and Elenius, 2008). Further developmentof the set will be addressed in futurework.VTLNAn obvious candidate as one element in thespeaker profile vector is Vocal Tract LengthNormalisation (VTLN). In this work, a standardtwo-segment piece-wise linear warping functionprojects the original model spectrum intoits warped spectrum. The procedure can be performedefficiently as a matrix multiplication inthe standard acoustic representation in currentspeech recognition systems, MFCC (Mel FrequencyCepstral Coefficients), as shown by Pitzand Ney (2005).Spectral slopeOur main intention with this feature is to compensatefor differences in the voice sourcespectrum. However, since the operation currentlyis performed on all models, also unvoicedand non-speech models will be affected.The feature will, thus, perform an overall compensationof mismatch in spectral slope,whether caused by voice source or the transmissionchannel.We use a first order low-pass function toapproximate the gross spectral shape of thevoice source function. This corresponds to theeffect of the parameter T a in the LF voicesource model (Fant, Liljencrants and Lin,1985). In order to correctly modify a model inthis feature, it is necessary to remove the characteristicsof the training data and to insertthose of the test speaker. A transformation ofthis feature thus involves two parameters: aninverse filter for the training data and a filterfor the test speaker.This two-stage normalization techniquegives us the theoretically attractive possibilityto use separate transformations for the vocaltract transfer function and the voice sourcespectrum (at least in these parameters). Afterthe inverse filter, there remains (in theory) onlythe vocal tract transfer function. Performingfrequency warping at this position in the chainwill thus not affect the original voice source ofthe model. The new source characteristics areinserted after the warping and are also unaffected.In contrast, conventional VTLN implicitlywarps the voice source spectrum identically155

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!