13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universityency due to a shared warping factor. For theframe-based method using a memory-lessViterbi process this is not naturally accountedfor.A second reason is that in an unsupervisedtwo-pass strategy initial mismatch causes recognitionerrors which limit the performance.Ultimately initial errors in assigning frames togroup-identities will bias the final recognitionphase towards the erroneous identities assignedin the first pass.The objective of this paper is to assess theimpact of phoneme-specific warping on anASR-system. First a discussion is held regardingissues with phoneme-specific warping. Then anexperiment is set up to measure the accuracy of asystem performing phoneme-specific VTLN.The results are then presented on a connected-digittask where the recognizer wastrained for adults and evaluated on children’sspeech.Phoneme-specific VTLNThis section describes some of the challenges inphoneme-specific vocal tract length normalization.Selection of frequency warping functionIn (Fant, 1975) a case was made forvowel-category and formant- specific scaling incontrast to uniform scaling. This requires formanttracking and subsequent calculations offormant-specific scaling factors, which is possibleduring manual analysis. Following anidentical approach under unsupervised ASRwould include automatic formant tracking,which is a non-trivial problem without a finalsolution (Vargas at.al, 2008).Lee and Rose (1996) avoided explicitwarping of formants by performing a commonfrequency warping function for all formants.Since the function is equal for all formants, noformant-frequency estimation is needed whenapplying this method. The warping function canbe linear, piece-wise linear or non-linear. Uniformfrequency scaling of the frequency intervalof formants is possible using a linear orpiece-wise linear function. This could also beextended to a rough formant scaling, using anon-linear function, under the simplified assumptionthat the formant regions do not overlap.This paper is focused on uniform scaling ofall formants. For this aim a piece-wise linearwarping function is used, where the amount ofwarping is steered by a warping-factor.Warp factor estimationGiven a specific form of the frequency warpingto be performed, a question still remains of thedegree of warping. In Lee and Rose (1996) thiswas steered by a common warping factor for allsound-classes. The amount of warping was determinedby selecting the warping-factor thatmaximized the likelihood of the warped utterancegiven an acoustic model. In the generalcase this maximization lacks a simple closedform and therefore the search involves an exhaustivesearch on a set of warping factors.An alternative to warp the utterance is toperform a transform of the model parameters ofthe acoustic model towards the utterance.Thereby a warp-specific model is generated. Inthis case, warp factor selection amounts to selectingthe model that best fit data, which is astandard classification problem. So given a setof warp-specific models one can select themodel that results in the maximum likelihood ofthe utterance.Phoneme-specific warp estimationLet us consider extending the method above to aphoneme-specific case. Instead of a scalarwarping factor a vector of warping factors can beestimated with one factor per phoneme. The taskis now to find the parameter vector that maximizesthe likelihood of the utterance given thewarped models. In theory this results in an exhaustivesearch of all combinations of warpingfactors. For 20 phonemes with 10 warp candidates,this amounts to 10 20 likelihood calculations.This is not practically feasible and therebyan approximate method is needed.In (Miguel et.al, 2005) a two-pass strategywas used. During the first pass a preliminarysegmentation is made. This is then held constantduring warp estimation to allow separatewarp-estimates to be made for each phoneme.Both a regular recognition phase as well asK-Means grouping has been used in their region-basedextension to VTLNThe group-based warping method above relieson a two-pass strategy where a preliminaryfixed classification is used during warp factorestimation, which is then applied in a final recognitionphase. Initial recognition errors canultimately cause a warp to be selected thatmaximizes the likelihood of an erroneous identity.Application of this warping factor will then145

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!