13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universityto the vocal tract transfer function. Such an assumptionis, to our knowledge, not supportedby speech production theory.Model varianceAn additional source of difference betweenadults’ and children’s speech is the larger intraandinter-speaker variability of the latter category(Potamianos and Narayanan, 2003). Weaccount for this effect by increasing the modelvariances. This feature will also compensate formismatch which can’t be modeled by the otherprofile features. Universal variance scaling isimplemented by multiplying the diagonal covarianceelements of the mixture componentsby a constant factor.ExperimentsOne- and four-dimension speaker profiles wereused for evaluation. The single dimensionspeaker profile was frequency warping(VTLN). The four-dimensional profile consistedof frequency warping, the two voicesource parameters and the variance scaling factor.Speech corporaThe task of connected digits recognition in themismatched case of child test data using adulttraining data was selected. Two corpora, theSwedish PF-Star children’s corpus (PF-Star-Sw) (Batliner et. al., 2005) and TIDIGITS,were used for this purpose. In this report, wepresent the PF-Star results. Results onTIDIGITS will be published in other reports.PF-Star-Sw consists of 198 children agedbetween 4 and 8 years. In the digit subset, eachchild was aurally prompted for ten 3-digitstrings. Recordings were made in a separateroom at day-care and after school centers.Downsampling and re-quantization of theoriginal specification of Pf-Star-Sw was performedto 16 bits / 16 kHz.Since PF-Star-Sw does not contain adultspeakers, the training data was taken from theadult Swedish part of the SPEECON database(Großkopf et al, 2002). In that corpus, eachspeaker uttered one 10 digit-string and four 5digit-strings, using text prompts on a computerscreen. The microphone signal was processedby an 80 Hz high-pass filter and digitized with16-bits / 16 kHz. The same type of head-set microphonewas used for PF-Star-Sw andSPEECON.Training and evaluation sets consist of 60speakers, resulting in a training data size of1800 digits and a children’s test data of 1650digits. The latter size is due to the failure ofsome children to produce all the three-digitstrings.The low age of the children combined withthe fact that the training and testing corpora areseparate makes the recognition task quite difficult.Pre-processing and model configurationA phone model representation of the vocabularyhas been chosen in order to allow phoneme-dependenttransformations. A continuous-distributionHMM system with wordinternal,three-state triphone models is used.The output distribution is modeled by 16 diagonalcovariance mixture components.The cepstrum coefficients are derived froma 38-channel mel filterbank with 0-7600 Hzfrequency range, 10 ms frame rate and 25 msanalysis window. The original models aretrained with 18 MFCCs plus normalized logenergy, and their delta and acceleration features.In the transformed models, reducing thenumber of MFCCs to 12 compensates for cepstralsmoothing and results in a standard 39-element vector.Test conditionsThe frequency warping factor was quantizedinto 16 log-spaced values between 1.0 and 1.7,representing the amount of frequency expansionof the adult model spectra. The two voicesource factors and the variance scaling factor,being judged as less informative, were quantizedinto 8 log-spaced values. The pole cut-offfrequencies were varied between 100 and 4000Hz and the variance scale factor ranged between1.0 and 3.0.The one-dimensional tree consists of 5 levelsand 16 leaf nodes. The four-dimensionaltree has the same number of levels and 8192leaves. The exhaustive grid search was not performedfor four dimensions, due to prohibitivecomputational requirements.The node selection criterion during the treesearch was varied to stop at different levels. Anadditional rule was to select the maximumlikelihoodnode of the traversed path from theroot to a leaf node. These were comparedagainst an exhaustive search among all leafnodes.156

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!