13.07.2015 Views

PDF file - Scale

PDF file - Scale

PDF file - Scale

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Number of SGMM Substates1800 2700 4k 6k 9k 12k 16kSGMM: 51.6 50.9 50.6 50.1 49.9 49.3 49.4Per-speaker adaptationfMLLR: 49.7 49.4 48.7 48.3 48.0 47.6 47.6Per-utterance adaptationfMLLR: 51.1 50.7 50.3 49.8 49.5 49.1 49.2+ subspaces: 50.2 49.9 49.5 48.9 48.6 48.0 47.9Table 1. fMLLR adaptation results (in % WER).where ˜W b from an orthonormal basis, i.e., tr ( ˜W b ˜WT c ) = δ(b, c).With this subspace approach, Equation (34) is modified as:˜∆ = 1 βBXb=1˜W b tr ( ˜W b ˜PT ). (43)Note that in this method of calculation, the quantities α (s)bare implicitand are never referred to in the calculation, but the updated Wwill still be constrained by the subspace. This simplifies the codingprocedure, but at the cost of slightly higher memory and storagerequirement.6.1. Training the basesThe auxiliary function improvement in the transformed space can becomputed as 1 tr ( ˜∆ ˜P T ) (up to a linear approximation). This is the2same as 1 tr ( √ 1 12 β˜P √β ˜PT ). So, the auxiliary function improvement1is the trace of the scatter of √ β˜P projected onto the subspace.The first step in training the basis is to compute the quantity√1 ˜P(s) βfor each speaker. We then compute the scatter matrix:S = X s„ « „ T 11vec √β ˜P(s) vec √β ˜P(s)« , (44)where vec(M) represents concatenating the rows of a matrix M intoa vector. The column vectors u b corresponding to the top B singularvalues in the SVD of S, S = ULV T , gives bases ˜W b , i.e. u b =vec( ˜W b ).7. EXPERIMENTSOur experiments are with an SGMM style of system on the CALL-HOME English database; see [4] for system details. Results arewithout speaker adaptive training.In Table 1 we show adaptation results for different SGMM systemsof varying model complexities [4]. We can see that the proposedmethod for fMLLR provides substantial improvements overan unadapted SGMM baseline when adapting using all the availabledata for a particular speaker. The improvements are consistent withthose obtained by a standard implementation of fMLLR over a baselinesystem that uses conventional GMMs.When adapting per-utterance (i.e. with little adaptation data), wesee that normal fMLLR adaptation provides very modest gains (weuse a minimum of 100 speech frames for adaptation, which givesgood performance). However, using the subspace fMLLR with B =100 basis transforms W b (and the same minimum of 100 frames),we are able to get performance that is comparable to per-speakeradaptation.8. CONCLUSIONSIn this paper we presented a novel estimation algorithm for fMLLRtransforms with full-covariance models, which iteratively finds thegradient in a transformed space where the expected Hessian is proportionalto unity. The proposed algorithm provides large improvementsover a competitive unadapted SGMM baseline on an LVCSRtask. It is also used to estimate a subspace-constrained fMLLR,which provides better results with limited adaptation data. The algorithmitself is independent of the SGMM framework, and can beapplied to any HMM that uses GMM emission densities.9. REFERENCES[1] M. J. F. Gales, “Maximum likelihood linear transformations forHMM-based speech recognition,” Computer Speech and Language,vol. 12, no. 2, pp. 75–98, April 1998.[2] K. Visweswariah, V. Goel, and R. Gopinath, “Structuring lineartransforms for adaptation using training time information,” inProc. IEEE ICASSP, 2002, vol. 1, pp. 585–588.[3] K. C. Sim and M. J. F. Gales, “Adaptation of precision matrixmodels on large vocabulary continuous speech recognition,” inProc. IEEE ICASSP, 2005, vol. I, pp. 97–100.[4] D. Povey et al., “Subspace gaussian mixture models for speechrecognition,” Submitted to ICASSP, 2010.[5] S. Axelrod et al., “Subspace constrained Gaussian mixture modelsfor speech recognition,” IEEE Trans. Speech Audio Process.,vol. 13, no. 6, pp. 1144–1160, 2005.[6] D. Povey, “A Tutorial-style introduction to Subspace GaussianMixture Models for Speech Recognition,” Tech. Rep. MSR-TR-2009-111, Microsoft Research, 2009.A. CALCULATING OPTIMAL STEP SIZEThe auxiliary function in the step size k is:Q(k) = β log det(A + k∆ 1:d,1:d )+k m − 1 2 k2 n, (45)m = tr (∆K T ) − tr (∆G T ) (46)n = X “”tr ∆ T Σ −1j ∆G j (type 1) (47)jn = X ”tr“∆ T A k ∆G k (type 2) (48)kwhere ∆ 1:d,1:d is the first d columns of ∆. We use a Newton’smethod optimization for k. After computingB = (A + k∆ 1:d,1:d ) −1 ∆ 1:d,1:d (49)d 1 = βtr (B) + m − kn (50)d 2 = −β(tr BB) − n (51)where d 1 and d 2 are the first and second derivatives of (45) withrespect to k, we update k as:ˆk = k − d1d 2. (52)At this point we check that Q(ˆk) ≥ Q(k). If Q(·) decreases,we keep halving the step size ˆk ← (k + ˆk)/2 until Q(ˆk) ≥ Q(k).The final k should typically be close to 1.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!