Mon.O1a.03 Discriminative Feature-Space Transforms Using Deep ...

Training WER BMMI objfun.ML 23.6% 0.16FMMI 20.3% 0.18FMMI-BMMI 18.7% 0.20Architecture MSE BMMI17x40/1000/40 21.5% –17x40/2048/40 21.1% 20.7%17x40/4096/40 21.1% –Table 1: Word error rates and BMMI objective function valuesfor baseline acoustic models.Table 2: Comparison of word error rates for models with NN-FMMI transforms after MSE pretraining.3.1.2. Acoustic modelsWords in the recognition lexicon are represented as sequencesof phones, and phones are modeled with a 3-state left-to-rightHMM topology that does not allow state skipping. Contextdependentstates are obtained by growing phonetic decisiontrees which can ask questions within a ±2-phone cross-wordcontext window. The acoustic models have 2200 HMM statesand 64 Gaussians/state and are speaker adaptively trained withfeature-space MLLR (FMLLR) [7]. At test time, speaker adaptationis performed with VTLN, FMLLR and multiple regressiontree-based MLLR transforms. The recognition vocabularycontains 90K words and the decoding is done with a 4-gramlanguage model containing 4M ngrams trained with modifiedKneser-Ney smoothing.3.1.3. Baseline discriminative trainingDiscriminative training is performed in both feature and modelspaceusing a variant of the MMI criterion called boosted (ormargin-based) MMI (BMMI) [8]. The objective function usedis a modification of BMMI proposed in [9] which uses a framebased,state-based loss function instead of a phone-based accuracymeasure. For the baseline model, we train two-level FMMItransforms with offset features following the recipe in [10]. TheGaussian posteriors and offset features that are used as input forthe first-level transform are provided by 512 diagonal covarianceGaussians. The second-level (or context) transform spans±8 frames. We used 4 iterations of BMMI for training the transformfollowed by an additional 4 BMMI iterations for estimatingthe Gaussian parameters in the resulting FMMI space. InTable 1 we summarize the word error rates and BMMI objectivefunction values for the baseline acoustic models after thedifferent training steps (ML, FMMI, FMMI with model-spaceBMMI).3.2. Neural network FMMI experiments3.2.1. ArchitectureThe networks that were trained have the following configuration:• An input layer of dimension 17×40 corresponding to ±8input frames (same input as regular FMMI)• A variable number of hidden layers with a hyperbolictangent non-linearity• An output layer with 40 units with a hyperbolic tangentnon-linearity3.2.2. Direct trainingIn an initial experiment we trained two NN-FMMI transformseach with one hidden layer (1000 and 2048 units, respectively)directly with BMMI starting with weights that were initializedto have a uniform distribution over [−0.01, 0.01]. Training wasdone with stochastic updates after each utterance with a fixedlearning rate η =1e-5. The BMMI objective function improvedfrom 0.160 to 0.162 for both networks after 20 passes throughthe training data (epochs). The word error rate obtained withthe smaller network is 23.1%. Compared to the values in Table1, the BMMI objective function improved very little whichsuggests that the BMMI criterion is hard to optimize and thetraining gets stuck in a poor local optimum very early.3.2.3. MSE pretraining using linear FMMI offsetsIn a subsequent experiment, we trained the NN-FMMI transformswith minimum squared error between the outputs andthe offsets produced by the best linear FMMI transform. Thehope was that squared error is easier to optimize and that learningthe linear FMMI offsets provides a good starting point forfurther optimization. We confirm this conjecture in Figure 1where we plot the BMMI objective function for a 3-layer network(17x40/2048/40) with and without pretraining.BMMI objfun.0.1780.1760.1740.1720.170.1680.1660.1640.162MSE pretrainingno pretraining0.160 1 2 3 4 5 6 7EpochFigure 1: BMMI objective function for NN-FMMI with onehidden layer with and without MSE pretraining.In Table 2 we indicate the word error rates for three NN-FMMI transforms which differ in the size of the hidden layerafter MSE pretraining. We also show the BMMI refinementresult for the middle configuration. As can be seen, the performanceimprovement saturates as more units are added to thehidden layer. This prompted us to look at stacking the layers tocreate a deep neural network architecture.3.2.4. Layer stackingHere, the training proceeds as follows. The network is grownone layer at the time and initialized with the MSE pretrainedweights from the previous network. MSE pretraining is donefor all the layers, not just for the newly added weights. Oncea desired configuration is reached, all the weights are trained

according to the BMMI objective function.2.42.2217x40/1024*1/4017x40/1024*2/4017x40/1024*3/4017x40/1024*4/4017x40/1024*5/40Architecture MSE BMMI Model-space17x40/1024*1/40 21.5% – –17x40/1024*2/40 20.8% – –17x40/1024*3/40 20.8% 20.1% –17x40/1024*4/40 20.8% 19.8% 17.9%17x40/1024*5/40 20.7% 19.9% –1.8MSE objfun.1.61.4Table 3: Comparison of word error rates for NN-FMMI transformsafter MSE pretraining and BMMI refinement.1.210.80.60 5 10 15 20 25 30EpochFigure 2: MSE objective functions for NN-FMMIs with increasingnumber of hidden layers.BMMI objfun.0.18050.180.17950.1790.17850.1780.177517x40/1024*3/4017x40/1024*4/4017x40/1024*5/400.1770 2 4 6 8 10 12 14 16EpochFigure 3: BMMI objective functions for NN-FMMIs with 3, 4and 5 hidden layers.In Figure 2 we show the MSE objective functions on heldoutdata (1/10th of the training data). The jumps correspondto the learning rate annealing steps. Correspondingly, we plotthe BMMI objective function for the networks with 3, 4 and 5hidden layers in Figure 3. Unlike the results from 3.2.2, here theBMMI criterion improves for networks with more parametersbecause of the pretraining step. Increased BMMI values alsolead to improved word error rates as shown in Table 3. Thebest result of 17.9% is obtained after model-space training for a6-layer NN-FMMI transform.4. ConclusionWe have presented a discriminative feature transform formulationusing deep neural networks which are trained throughboosted MMI gradient backpropagation. Pretraining is doneone layer at a time by learning the offsets of a linear FMMItransform. Experimental results on English broadcast newstranscription show the superiority of NN-FMMI over regularFMMI after model-space discriminative training. Because ofthe pretraining step, NN-FMMI requires the estimation of alinear FMMI transform first. Future work will address waysof removing this dependency by directly training the networkwith a regularized BMMI objective function and by employingsecond-order optimization methods [11] which have provensuccessful for neural network acoustic modeling [12].5. AcknowledgmentsThis work was supported in part by DARPA under Grant HR0011-12-C-0015 3 .6. References[1] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, andG. Zweig, “fMPE: Discriminatively trained features for speechrecognition,” in Proc. of ICASSP, 2005, pp. 961–964.[2] G.E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependentpre-trained deep neural networks for large vocabulary speechrecognition,” IEEE Transactions on Speech and Audio Processing,vol. 20, no. 1, pp. 30–42, 2012.[3] F. Seide, G. Li, X. Chen, and D. Yu, “Conversational speechtranscription using context-dependent deep neural networks,” inProc. Interspeech, 2011.[4] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithmfor deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.[5] H. Hermansky, D.P.W. Ellis, and S. Sharma, “Tandem connectionistfeature extraction for conventional HMM systems,” in Proc.ICASSP, 2000.[6] František Grézl, Martin Karafiát, Stanislav Kontár, and JanČernocký, “Probabilistic and bottle-neck features for LVCSR ofmeetings,” in Proc. ICASSP, 2007.[7] M.J.F Gales, “Maximum likelihood linear transformations forHMM-based speech recognition,” Computer Speech and Language,vol. 12, pp. 75–98, 1998.[8] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon,and K. Visweswariah, “Boosted MMI for model and feature-spacediscriminative training,” in Proc. of ICASSP, 2008, pp. 4057–4060.[9] G. Saon and D. Povey, “Penalty function maximization for largemargin HMM training,” in Proc. Interspeech, 2008, pp. 920–923.[10] D. Povey, “Improvements to fMPE for discriminative training offeatures,” in Proc. Interspeech, 2005, pp. 2977–2980.[11] J. Martens, “Deep learning via Hessian-free optimization,” inProc. ICML, 2010.[12] B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable minimumBayes risk training of neural network acoustic models using distributedHessian-free optimization,” in Proc. Interspeech, 2012,Submitted.3 Approved for Public Release, Distribution Unlimited. The views,opinions, and/or findings contained in this article/presentation are thoseof the author/presenter and should not be interpreted as representing theofficial views or policies, either expressed or implied, of the DefenseAdvanced Research Projects Agency or the Department of Defense.

Mon.O1a.03 Discriminative Feature-Space Transforms Using Deep ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?