A versatile Gaussian splitting approach to non-linear state ...


A versatile Gaussian splitting approach to non-linear state ...

AversatileGaussiansplitting approach tonon-linear stateestimationanditsapplication to noise-robustASRVolkerLeutnant,Alexander Krueger,ReinholdHaeb-UmbachDepartmentofCommunicationsEngineering,UniversityofPaderborn, Germany{leutnant,krueger,haeb}@nt.uni-paderborn.deAbstractIn this work, a splitting and weighting scheme that allows forsplitting a Gaussian density into a Gaussian mixture density(GMM) is extended to allow the mixture components to be arrangedalong arbitrary directions. The parameters of the Gaussianmixture are chosen such that the GMM and the originalGaussian still exhibit equal central moments up to an order offour. The resulting mixtures’ covariances will have eigenvaluesthat are smaller than those of the covariance of the originaldistribution,whichisadesirablepropertyinthecontextofnonlinearstate estimation, since the underlying assumptions of theextended KALMAN filter are better justified in this case. Applicationtospeechfeatureenhancementinthecontextofnoiserobustautomaticspeechrecognitionrevealsthebeneficialpropertiesoftheproposedapproachintermsofareducedworderrorrate onthe Aurora 2recognition task.IndexTerms: densitysplitting, moment matching1. IntroductionThe application of model-based speech feature enhancement tothe prominent noise-robustness problem of today’s automaticspeech recognizer (ASR) systems has gained considerable interestin recent years. Thereby, conditional BAYESian estimationisemployedtoinfertheuncorruptedspeechfeaturevectorsfromthecorruptedobservations basedonapriorimodelsofthecepstral speech feature vectors x t and the noise-only featurevectors n t and a highly non-linear observation model relatingthe twotothe corrupted observations y t [1]:y t = Dlog(e D+ x t+e D+ n t+2α te D+ (x t +n t )2). (1)TherebyDandD + denotethediscrete-cosinetransform(DCT)matrixanditspseudo-inverse, respectively. Severalapproximationshave been proposed to model the observation probabilitydensityp(y t|x t,n t). The mostprominent andatthesame timesimplest approach neglects any phase difference between thespeech and the noise signal. However, it is well known andhas already been verified experimentally that the more accuratemodel of (1), which includes the phase factor α t, resultsin improved performance [2]. Since a numerical evaluationof the occurring integrals to obtain the observation probabilityp(y t|x t,n t) is computationally very demanding, it usually isstillapproximated by a Gaussiandensity.The two most prominent filters for non-linear estimationproblems are the unscented [3] and the extended KALMAN filter[4]. The unscented filter determines the mean and covarianceofthisGaussianbymeansofso-calledsigmapoints,whichare drawn from a joint Gaussian a priori density of x t and n tand later on propagated through the non-linearity. In contrast,the extended KALMAN filter utilizes lower-order terms of theTaylor-series expansion of the non-linearity to obtain estimatesof the mean and the covariance and in general is very sensitiveto the choice of the expansion vector. But it is only when thespectral radius of the covariance matrix of the joint Gaussianprior is sufficiently small that the underlying assumptions areapproximately valid.Driven by this idea, ALSPACH and SORENSON proposedto model the original a priori distribution by a Gaussianmixture model where the individual mixtures exhibit smallereigenvalues than the original distribution [5]. Only recently,MERWE et al. [6] and FAUBEL et al. [7] adopted this approach.While MERWE et al. applied a weighted expectationmaximizationalgorithmtosamplesdrawnfromtheaprioridensityto obtain the parameters of a GMM, FAUBEL et al. proposedtoiterativelyincreasethenumberofmixturecomponentsbysplittingaselectedGaussianalongoneoftheeigenvectorsofthe corresponding covariance. The mixture component to splitandtheeigenvectorthesplitisperformedalongistherebydeterminedby the degree of non-linearity [7]. Both approaches canbe considered computationally expensive and thus, to the bestof our knowledge, none or very few attempts have been madeto apply the concept to the problem of speech feature enhancementasisdonehere.Further,wewillextendthetheorytosplitsalong arbitrary directions, not restricted to those defined by theeigenvectors of the covariance matrix.The paper begins with a short review of approximatinga GMM by a single Gaussian by means of minimizing theirKULLBACK-LEIBLER divergence. Since the same criterioncannot be applied to split a Gaussian into a GMM withoutany constraints on the parameter set, a splitting and weightingscheme which allows for multiple splits in arbitrary directionsis introduced next, followed by a discussion on the sensitivityof the choice of the parameters. The splitting and weightingscheme is thenappliedtomodel-based speechfeature enhancementand its performance is evaluated by means of recognitionresults onthe Aurora 2database [8].2. Merging a GMM into asingleGaussianGivenaGaussian mixture probabilitydensity p(z), z ∈ R Dp(z) =M−1∑m=0w mp m(z) =M−1∑m=0w mN(z; µ m ,Σ m), (2)where w m, µ m and Σ m denote the weight, the mean and thecovariance of the m-th Gaussian mixture component p m(z),m ∈ {1,..,M}, respectively, merging the GMM to a singleGaussian q(z)=N(z; ˜µ, ˜Σ) usually refers to finding its mean˜µ and covariance ˜Σ such that the KULLBACK-LEIBLER (KL)divergence between p(z)and q(z)is minimized, resultinginM−1∑˜µ = w mµ m (3)m=0M−1∑M−1∑˜Σ = w mΣ m+ w m(µ m −˜µ)(µ m −˜µ) ′ , (4)m=0m=0

where (·) ′ denotes the matrix/vector transpose operator. Thecovariance ˜Σ can thereby be regarded as being composed ofa ”within class” term S W = ∑ M−1m=0wmΣm and a ”betweenclass”termS ∑ M−1B=m=0 wm(µ m −˜µ)(µ m −˜µ)′ . Thisapproachis alsoreferredtoas moment matchingfor obvious reasons.3. Splitting aSingle Gaussianinto aGMMWhile merging a GMM into a single Gaussian by minimizingtheir KL-divergence is common practice, the objective here isthe opposite: splittingaGaussian into a GMM consisting of anarbitrary number of mixture components M. A splitting and aweighting scheme which allow for multiple splits in arbitrarydirectionsaredescribednext. TokeeptheobjectiveofminimizingtheKL-divergence,bothschemesaredrivenbythekeyideaof matching the moments of the GMM tothatof the Gaussian.3.1. SplittinginanArbitraryDirectionSplitting the Gaussian q(z) = N(z; ˜µ, ˜Σ) into a GMM p(z)with M = 2K+1 components along an arbitrary directionu l ∈ R D can be achieved by first shifting the mean 2K-timesand second transforming the covariance of the shifted density.Arranging the means of the M mixture components symmetricallywith respect to ˜µ along u l with an equidistant spacingwhile transforming all covariances in the same manner resultsinaGMMthat isfullyspecified bythe parameter setµ 0 = ˜µ, Σ 0, w 0,µ k = ˜µ+ηku l , Σ k = Σ 0, w k , (5)µ K+k = ˜µ−ηku l , Σ K+k = Σ k , w K+k = w k ,where k∈{1,..,K}. All weights are positive and normalizedsuch that w ∑ 2K0+ ˜k=1 w˜k =1. The parameter η ∈ R ≥0 determinesthe distance between two adjacent means in terms of thelengthof the vector u l .ThesymmetricsetupensuresthefirstcentralmomentoftheGaussianandtheGMMtomatch. Applying(4)andsolving forΣ 0 toalsomatchthe second central moment yieldsΣ 0 = ˜Σ−βu∑ Kl u ′ l, β = 2η 2 w k k 2 , (6)k=1where the parameter β ∈ R ≥0 controls the covariance reduction,which certainly has to be a function of the mean displacementη if the moments have to match. Calling for Σ 0 to besymmetricandnon-negative definiteposesconstraintsoneitherthe length of u l or β. While the symmetry is obviously givenbyany vector u l ∈ R D ,the non-negative definiteness property)z ′(˜Σ−βul u ′ l z ≥ 0 ∀z ∈ R D (7)is preserved only by vectors in a certain subset of R D . Expressingthe covariance ˜Σ in terms of an orthonormal matrixcomposed of its eigenvectors V = [v 1,...,v D] and a diagonalmatrix composed of its eigenvalues Λ = diag([λ 1,...,λ D]),one obtainsz ′( VΛV ′ −βu l u ′ l)z ≥ 0 ∀z ∈ R D , (8)whichis equivalent to˜z ′( )I−βΛ −1 2V ′ u l u ′ lVΛ −1 2 ˜z ≥ 0 ∀˜z ∈ R D , (9)with˜z = Λ 1 2 V ′ z,since Λ2 1 V ′ isfull-rank.The matrix I − β(Λ −1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ inturn is non-negative definite if the eigenvalues ofβ(Λ −1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ are smaller than or equal toone. Assuming without loss of generality that β is bound tothe interval [0,1], non-negative definiteness can be ensured bynormalizing the vector u l by the square root of the maximumeigenvalue of (Λ −1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ , which is just thelength of the vector Λ −1 2 V ′ u l . The normalized vector ŭ l isthus given byŭ l =u l√(Λ −1 2V ′ u l ) ′ (Λ −1 2V ′ u l )=u l√u ′ l˜Σ −1 u l. (10)Notethatifu l coincides withanarbitraryeigenvector v d of ˜Σ,ŭ l turns intoŭ l = √ λ d v d /||v d || .3.2. SimultaneousSplittinginMultipleDirectionsExtension to a simultaneous split in L arbitrary directionsu 1,...,u L can be carried out in a similar fashion. The parametersoftheresultingGMMconsistingofM= 2KL+1mixturecomponents are givenbyµ 0 = ˜µ, Σ 0 = ˜Σ−β ∑ Ll=1 u lu ′ l, w 0µ l k = ˜µ+ηku l, Σ l k = Σ 0, wkl (11)µ l K+k = ˜µ−ηku l, Σ l K+k = Σ l k, wK+k=w l k,lwhere l∈{1,..,L} and k∈{1,..,K}. All split directions u lagain have to be normalized, this time by the square root ofthemaximumeigenvalueof ∑ Ll=1 (Λ−1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ topreserve the non-negative definiteness of all covariances.Applying(4) tomatchthe second central moment calls for[L∑u l u ′ ll=1K]∑β −2η 2 wkk l 2k=1!= 0, (12)which cansimply be met for an arbitrarynumber of split directionsLiftheweightsarechosentobeindependentoftheindexl. The parameter η determining the displacement of the mixturemeans and the parameter β controlling the change in thecovariances are thus again (compare (6))relatedby∑ Kβ = 2η 2 wkk l 2 . (13)k=1Sincethenormalizationofthesplitdirectionsrequires0≤β≤1,themeanshiftηwillbeboundtoη ∈ [0,η max],whichingeneralwill depend on L and K. If the weights are independent of η,a closed-form solution for the upper bound η max can be derivedfrom (13). However, if the weights depend on η, a closed-formsolutionusually does not exist,as willbe outlined next.3.3. WeightingSchemeEquation (13) relates the covariance reduction parameter βto the mean shift η and the chosen weights wk. l Weightingschemes where the center weight w 0 is specified andthe remaining probability mass is equally distributed amongthe remaining 2KL non-center mixtures to satisfy w 0 +2 ∑ L Kl=1∑k=1 wl k =1 are quite common [3]. However, sincewe split a Gaussian (2K+1)-times along a specified directionu l , assigning the same weight to mixtures that are close to themean ˜µ and those that are further away from it may not be anappropriate choice. Instead, we assign each mixture componentp m(z) a weight that is proportional tothe likelihood of itsshiftedmean under the initialGaussian q(z) asw 0 = 1 C γ, wl k = w l K+k = 1 C e−1 2 k2 η 2 u ′ l ˜Σ −1 u l, (14)with the normalization constant C ∈ R >0. The parameter γdetermines how much weight is assigned to the centered mixturecomponent and moreover controls the higher moments, aswill be utilized later. Since the weights represent probabilities,γ thereby is constrained to be ∈ R ≥0 , as opposed to the unscentedfilter [3] where also negative center weights may be

Averaged Accuracy [%]6160595857L=1,K=1,η max=1.7321L=1,K=2,η max=1.4192L=1,K=3,η max=1.2323L=2,K=1,η max=1.73210 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1η/η maxFigure 2: Averaged recognition accuracies on test set A of theAurora2database foran SNRof 0dB;η∈[0,η max],γ=γ opt forL=1,K∈{1,..,3} andL=2,K=1setups (L=1, K ∈{1,..,3} and L=2, K =1), the averagedrecognitionaccuraciesexhibitthesamecharacteristic: theaccuracyfirstincreaseswithη,reachesaplateauandfinallydeclineswhenapproachingη max. Thegreatestimprovementcomparedtothebaseline(+3.5%absolute)isobtainedwithL=1andK=2at η/η max =0.8. Interestingly, increasing the number of splitdirections from L=1toL=2could not improve the results.This may be contributed to the fact that splitting (and filtering)is performed in the cepstral domain, where the energycomponent dominates the eigenvalues. However, the linearizationof (1) with respect to the current split distribution and theestimationofthemeanandcovariance ofthecorrespondingobservationdensity requires the transformation to the logarithmicmel power spectral domain, which is done by application ofthe pseudo-inverse of the DCT matrix. The variance reductionin the energy component can thus be considered to spreadalong all components of the state vector in the logarithmic melpowerspectraldomainandthusreducetheinfluenceofthenonlinearityonthelinearization for allcomponents of the observationmodelinthe logarithmicmel power spectral domain. Nevertheless,splitting in L>1 directions may become useful fordifferent observation models or inference problems.With η approaching its bound η max, β approaches one andthe variances along the corresponding split directions becomeszero. Though this may be reminiscent of the unscented filter,there is a subtle, but important difference to it. First, the proposedsplitting scheme does not require the number of split directionstobeequaltothenumberof(possiblyaugmented)statedimensionsasrequiredbytheunscentedfilter. Thus,onecanfocuson those directions where the non-linearity is most severe.Second and most important, the proposed splitting scheme allowsfor merging the contributions of the individual splits onthe posterior density level. This,however, is not possible underthe unscented filter, which explicitly assumes the observationdensity to be Gaussian and thus renders the posterior distributiontobeGaussian, too. However, adetailedcomparisontotheunscented filterisnotthefocus of thispaper andwillbeleftforfuture research.Finally, the computationally modest configuration (L=1,K=1, γ=γ opt, η/η max=0.8) is used to carry out feature enhancementon the complete testset A of the Aurora 2database.Thistime,wealsoutilizetheuncertaintyaboutthecleanspeechestimates provided by the filter to carry out the final recognitionwithuncertaintydecoding[11]. Resultsforbothfilters,thestandard IEKF-α and the filter with the Gaussian sum approximation,aregiveninTab.1.IncomparisontothestandardIEKFα,theapplicationofthesplittingschemeresultsinanimprovedrecognition performance which under the standard decoder alreadyreaches the performance of the uncertainty decoder ap-Table 1: Averaged recognition accuracies on test set A of theAurora2database withstandardand uncertainty decodingSNR IEKF-α IEKF-α+splitting[dB] standard uncertainty standard uncertainty20 98.72 98.64 98.70 98.6615 97.18 97.31 97.21 97.3010 93.61 94.03 93.41 93.725 82.82 84.12 83.16 84.660 56.95 58.92 60.47 62.73∅ 85.86 86.60 86.59 87.41plied to the standard IEKF-α. Additional application of theuncertainty decoder finally results in an accuracy of 87.41%.Note that the major gain is almost exclusively achieved underlowSNRconditions –validatingtheassumptionourfirstexperimentwas subject to.6. ConclusionsIn this work, a splitting and weighting scheme has been describedthatallows forsplittingaGaussiandensity intoaGaussianmixturedensity.Wewereabletoextendtheexistingtheorytosplits alongarbitrarydirections, not necessarily along the directionsof the eigenvectors. The GMM and the original Gaussianexhibitequalcentralmomentsuptoanorderoffour.However,the resulting mixtures’ covariances have eigenvalues thataresmallerthanthose ofthecovariance ofthe originaldistribution,thereby reducing the linearization error of a Taylor seriesexpansion. Applicationtomodel-basedspeechfeatureenhancementon the the Aurora 2 speech recognition task confirmedthis property to be beneficial in the context of non-linear stateestimation. The splitting contributes most under low SNR conditions,where the non-linearity is most severe and where thecovariance of the predictive distribution is large, such that theassumptions inherent to the extended KALMAN filter are contradictedmost.7. AcknowledgementsThisworkhas beensupported byDeutsche Forschungsgemeinschaft(DFG)under contract no. Ha3455/6-1.8. References[1] J. Droppo, A. Acero, and L. Deng, “A nonlinear observation model for removingnoisefromcorruptedspeechlogmel-spectralenergies,”inProc.Int.Conf. SpokenLang. Process.(ICSL), 2002.[2] V.LeutnantandR.Haeb-Umbach,“Ananalyticderivationofaphasesensitiveobservation model for noise robust speechrecognition,” in Proc.Annu.Conf. ofISCA(Interspeech), 2009.[3] S. J. Julier and J. K. Uhlmann, “A new extension of the Kalman filter tononlinear systems,” in Int. Symp. Aerospace/Defense Sensing, Simul. andControls,1997.[4] B.BellandF.Cathey,“TheiteratedKalmanfilterupdateasaGauss-Newtonmethod,”IEEETrans.Autom.Control,vol. 38,no.2,1993.[5] D. Alspach and H. Sorenson, “Nonlinear Bayesian estimation using Gaussiansumapproximations,”IEEETrans.Autom.Control,vol.17,no.4,1972.[6] R. van der Merwe and E. Wan, “Gaussian mixture sigma-point particle filtersfor sequential probabilistic inference in dynamic state-space models,”in Proc.Int. Conf.Acoust.,Speechand Sig.Process.(ICASSP),2003.[7] F. Faubel and D. Klakow, “Further improvement of the adaptive level ofdetail transform: Splitting in direction of the nonlinearity,” in Proc. Europ.Sig. Process.Conf.(EUSIPCO),2010.[8] D. Pearce and H.-G. Hirsch, “The Aurora experimental framework for theperformance evaluation of speech recognition systems under noisy conditions,”inProc.Int. Conf.Spoken Lang. Process.(ICSL), 2000.[9] Z. Winiewski, “Vectors of kth order moments,” Bulletins of the StanisawStaszicAcademyofMiningandMetallurgy,Geodesyb.112,no.1423,1991.[10] L. Isserlis, “On a formula for the product-moment coefficient of any orderof anormal frequency distribution inany numberof variables,”Biometrika,vol.12, no.1/2,1918.[11] V.IonandR.Haeb-Umbach,“Anoveluncertaintydecodingrulewithapplicationstotransmissionerrorrobustspeechrecognition,”IEEETrans.Audio,Speech,and Lang.Process.,vol. 16,no.5,2008.

More magazines by this user
Similar magazines