Proceedings Fonetik 2009 - Institutionen för lingvistik

More documents

Recommendations

Info

Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm Universityto the vocal tract transfer function. Such an assumptionis, to our knowledge, not supportedby speech production theory.Model varianceAn additional source of difference betweenadults’ and children’s speech is the larger intraandinter-speaker variability of the latter category(Potamianos and Narayanan, 2003). Weaccount for this effect by increasing the modelvariances. This feature will also compensate formismatch which can’t be modeled by the otherprofile features. Universal variance scaling isimplemented by multiplying the diagonal covarianceelements of the mixture componentsby a constant factor.ExperimentsOne- and four-dimension speaker profiles wereused for evaluation. The single dimensionspeaker profile was frequency warping(VTLN). The four-dimensional profile consistedof frequency warping, the two voicesource parameters and the variance scaling factor.Speech corporaThe task of connected digits recognition in themismatched case of child test data using adulttraining data was selected. Two corpora, theSwedish PF-Star children’s corpus (PF-Star-Sw) (Batliner et. al., 2005) and TIDIGITS,were used for this purpose. In this report, wepresent the PF-Star results. Results onTIDIGITS will be published in other reports.PF-Star-Sw consists of 198 children agedbetween 4 and 8 years. In the digit subset, eachchild was aurally prompted for ten 3-digitstrings. Recordings were made in a separateroom at day-care and after school centers.Downsampling and re-quantization of theoriginal specification of Pf-Star-Sw was performedto 16 bits / 16 kHz.Since PF-Star-Sw does not contain adultspeakers, the training data was taken from theadult Swedish part of the SPEECON database(Großkopf et al, 2002). In that corpus, eachspeaker uttered one 10 digit-string and four 5digit-strings, using text prompts on a computerscreen. The microphone signal was processedby an 80 Hz high-pass filter and digitized with16-bits / 16 kHz. The same type of head-set microphonewas used for PF-Star-Sw andSPEECON.Training and evaluation sets consist of 60speakers, resulting in a training data size of1800 digits and a children’s test data of 1650digits. The latter size is due to the failure ofsome children to produce all the three-digitstrings.The low age of the children combined withthe fact that the training and testing corpora areseparate makes the recognition task quite difficult.Pre-processing and model configurationA phone model representation of the vocabularyhas been chosen in order to allow phoneme-dependenttransformations. A continuous-distributionHMM system with wordinternal,three-state triphone models is used.The output distribution is modeled by 16 diagonalcovariance mixture components.The cepstrum coefficients are derived froma 38-channel mel filterbank with 0-7600 Hzfrequency range, 10 ms frame rate and 25 msanalysis window. The original models aretrained with 18 MFCCs plus normalized logenergy, and their delta and acceleration features.In the transformed models, reducing thenumber of MFCCs to 12 compensates for cepstralsmoothing and results in a standard 39-element vector.Test conditionsThe frequency warping factor was quantizedinto 16 log-spaced values between 1.0 and 1.7,representing the amount of frequency expansionof the adult model spectra. The two voicesource factors and the variance scaling factor,being judged as less informative, were quantizedinto 8 log-spaced values. The pole cut-offfrequencies were varied between 100 and 4000Hz and the variance scale factor ranged between1.0 and 3.0.The one-dimensional tree consists of 5 levelsand 16 leaf nodes. The four-dimensionaltree has the same number of levels and 8192leaves. The exhaustive grid search was not performedfor four dimensions, due to prohibitivecomputational requirements.The node selection criterion during the treesearch was varied to stop at different levels. Anadditional rule was to select the maximumlikelihoodnode of the traversed path from theroot to a leaf node. These were comparedagainst an exhaustive search among all leafnodes.156
Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm UniversityTraining and recognition experiments wereconducted using HTK (Young et. al., 2005).Separate software was developed for the transformationand the model tree algorithms.Results and discussionRecognition results for the one- and fourelementspeaker profiles are presented in Table1 for different search criteria together with abaseline result for non-transformed models.The error rate of the one-dimensional treebasedsearch was as low as that of the exhaustivesearch at a fraction (25-50%) of the computationalload. This result is especially positive,considering that the latter search is guaranteedto find the global maximum-likelihoodspeaker vector.Even the profile-independent root node providessubstantial improvement compared to thebaseline result. Since there is no estimationprocedure involved, this saves considerablecomputation.In the four-dimensional speaker profile, thecomputational load is less than 1% of the exhaustivesearch. A minimum error rate isreached at a stop level two and three levels belowthe root. Four features yield consistent improvementsover the single feature, except forthe root criterion. Clearly, vocal tract length isvery important, but spectral slope and variancescaling also have positive contribution.Table 1. Number of recognition iterations and worderror rate for one and four-dimensional speakerprofile.Search No. iterations WER(%)alg.Baseline 1 32.21-D 4-D 1-D 4-DExhaustive16 8192 11.5 -Root 1 1 11.9 13.9Level 1 2 16 12.2 11.1Level 2 4 32 11.5 10.2Level 3 6 48 11.2 10.2Leaf 8 50 11.2 10.4Path-max 9 51 11.9 11.6Histograms of warp factors for individualutterances are presented in Figure 1. The distributionsfor exhaustive and 1-dimensional leafsearch are very similar, which corresponds wellwith their small difference in recognition errorrate. The 4-dimensional leaf search distributiondiffers from these, mainly in the peak region.The cause of its bimodal character calls for furtherinvestigation. A possible explanation maylie in the fact that the reference models aretrained on both male and female speakers. Distinctparts have probably been assigned in thetrained models for these two categories. Thetwo peaks might reflect that some utterancesare adjusted to the female parts of the modelswhile others are adjusted to the male parts. Thismight be better caught by the more detailedfour-dimensional estimation.Nbr utterances1201008060402001 1.2 1.4 1.6 1.8Warp factor1-dim exhaustive1-dim tree4-dim treeFigure 1. Histogram of estimated frequency warpfactors for the three estimation techniques.Figure 2 shows scatter diagrams for averagewarp factor per speaker vs. body height forone- and four-dimensional search. The largestdifference between the plots occurs for theshortest speakers, for which the fourdimensionalsearch shows more realistic values.This indicates that the latter makes more accurateestimates in spite of its larger deviationfrom a Gaussian distribution in Figure 1. Thisis also supported by a stronger correlation betweenwarp factor and height (-0.55 vs. -0.64).Warp factor1.61.51.41.31.21.1180 100 120 140 160Body height1.61.51.41.31.21.1180 100 120 140 160Body heightFigure 2. Scatter diagrams of warp factor vs. bodyheight for one- (left) and four-dimensional (right)search. Each sample point is an average of all utterancesof one speaker.The operation of the spectral shape compensationis presented in Figure 3 as an averagefunction over the speakers and for the twospeakers with the largest positive and negativedeviation from the average. The average functionindicates a slope compensation of the frequencyregion below around 500 Hz. This157
Page 1 and 2:
Department of LinguisticsProceeding
Page 3 and 4:
Proceedings, FONETIK 2009, Dept. of
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Proceedings, FOETIK 2009, Dept. of
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106: Proceedings, FONETIK 2009, Dept. of
Page 127 and 128: Proceedings, FOETIK 2009, Dept. of
Page 155: Proceedings, FONETIK 2009, Dept. of
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227:
Department of LinguisticsPhonetics
show all

Proceedings Fonetik 2009 - Institutionen för lingvistik

Create successful ePaper yourself

Delete template?

Save as template?