13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universitybias the final recognition towards the erroneousidentities. The severity of this hazard depends onthe number of categories used and the kind ofconfusions made.An alternative to a two-pass approach is tosuccessively revise the hypothesis of what hasbeen said as different warping factors areevaluated. Following this line of thought leads toa parallel of warp factor estimation and determinationof what was said. For a speech recognizerusing Viterbi-decoding this can be implementedby adding a warp-dimension to thephoneme-time trellis (Miguel et.al. 2005). Thisleads to a frame-specific warping factor. Unconstrainedthis would lead to a large amount ofcomputations. Therefore a constraint on thetime-derivative of the warp factor was used tolimit the search space.A slowly varying warping factor might notbe realistic even though individual articulatorsmove slowly. One reason is that given multiplesources of sound a switch between them cancause an abrupt change in the warping factor.This switch can for instance be between speakers,to/from non-speech frames, or a change inplace and manner of articulation. The changecould be performed with a small movementwhich causes a substantial change in the air-flowpath. To some extent this could perhaps be takeninto account using parallel warp-candidates inthe beam-search used during recognition.In this paper model-based warping is performed.For each warp setting the likelihood ofthe utterance given the set of warped models iscalculated using the Viterbi algorithm. The warpset is chosen that results in the maximum likelihoodof the utterance given the warped models.In contrast to the frame-based, long distancedependencies are taken into account. This ishandled by warping the phoneme models used torecognize what was said. Thereby each instantiationof the model during recognition is forcedto share the same warping factor. This was notthe case in the frame-based method which used amemory-less Viterbi decoding scheme for warpfactor selection.Separate recognitions for each combinationof warping factors were used to avoid relying onan initial recognition phase, as was done in theregion-based method.To cope with the huge search space two approacheswere taken in the current study namely:reducing the number of individual warp factorsby clustering phonemes together and by supervisedadaptation to a target group.Experimental studyPhoneme-specific warping has been explored interms of WER (word error rate) in an experimentalstudy. This investigation was made on aconnected-digit string task. For this aim a recognitionsystem was trained on adult speakers:This system was then adapted towards childrenby performing VTLT (vocal tract length transformation).A comparison between phoneme-independentand -specific adaptationthrough warping the models of the recognizerwas conducted. Unsupervised warping duringtest was also conducted using two groups ofphonemes with separate warping factors. Thegroups used were formed by separating silence,/t/ and /k/ forming the rest of the phonemes.Speech materialThe corpora used for training and evaluationcontain prompted digit-strings recorded one at atime. Recordings were made using directionalmicrophones close to the mouth. The experimentswere performed for Swedish using twodifferent corpora, namely SpeeCon andPF-STAR for adults and children respectively.PF-STAR consists of children speech inmultiple languages (Batliner et.al. 2005). TheSwedish part consists of 198 children of 4 to 8years repeating oral prompts spoken by an adultspeaker. In this study only connected-digitstrings were used to concentrate on acousticmodeling rather than language models. Eachchild was orally prompted to speak 10three-digit strings amounting to 30 digits perspeaker. Recordings were performed in a separateroom at daycare and after-school centers.During these recordings sound was picked up bya head-set mounted cardioid microphone,Sennheiser ME 104. The signal was digitizedusing 24 bits @ 32 kHz using an externalusb-based A/D converter. In the current studythe recordings were down-sampled to 16 bits @16 kHz to match that used in SpeeCon.SpeeCon consists of both adults and childrendown to 8 years (Großkopf et.al 2002). In thisstudy, only digit-strings recordings were used.The subjects were prompted using text on acomputer screen in an office environment. Recordingswere made using the same kind of microphoneas was used in Pf-Star. An analoghigh-pass filter with a cut-off frequency of 80 Hzwas used, and digital conversion was performed146

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!