A versatile Gaussian splitting approach to non-linear state ...

A**versatile****Gaussian****splitting** **approach** **to****non**-**linear** **state**estimationanditsapplication **to** noise-robustASRVolkerLeutnant,Alexander Krueger,ReinholdHaeb-UmbachDepartmen**to**fCommunicationsEngineering,UniversityofPaderborn, Germany{leutnant,krueger,haeb}@nt.uni-paderborn.deAbstractIn this work, a **splitting** and weighting scheme that allows for**splitting** a **Gaussian** density in**to** a **Gaussian** mixture density(GMM) is extended **to** allow the mixture components **to** be arrangedalong arbitrary directions. The parameters of the **Gaussian**mixture are chosen such that the GMM and the original**Gaussian** still exhibit equal central moments up **to** an order offour. The resulting mixtures’ covariances will have eigenvaluesthat are smaller than those of the covariance of the originaldistribution,whichisadesirablepropertyinthecontex**to**f**non****linear****state** estimation, since the underlying assumptions of theextended KALMAN filter are better justified in this case. Application**to**speechfeatureenhancementinthecontex**to**fnoiserobustau**to**maticspeechrecognitionrevealsthebeneficialpropertiesoftheproposed**approach**intermsofareducedworderrorrate onthe Aurora 2recognition task.IndexTerms: density**splitting**, moment matching1. IntroductionThe application of model-based speech feature enhancement **to**the prominent noise-robustness problem of **to**day’s au**to**maticspeech recognizer (ASR) systems has gained considerable interestin recent years. Thereby, conditional BAYESian estimationisemployed**to**infertheuncorruptedspeechfeaturevec**to**rsfromthecorruptedobservations basedonapriorimodelsofthecepstral speech feature vec**to**rs x t and the noise-only featurevec**to**rs n t and a highly **non**-**linear** observation model relatingthe two**to**the corrupted observations y t [1]:y t = Dlog(e D+ x t+e D+ n t+2α te D+ (x t +n t )2). (1)TherebyDandD + denotethediscrete-cosinetransform(DCT)matrixanditspseudo-inverse, respectively. Severalapproximationshave been proposed **to** model the observation probabilitydensityp(y t|x t,n t). The mostprominent andatthesame timesimplest **approach** neglects any phase difference between thespeech and the noise signal. However, it is well known andhas already been verified experimentally that the more accuratemodel of (1), which includes the phase fac**to**r α t, resultsin improved performance [2]. Since a numerical evaluationof the occurring integrals **to** obtain the observation probabilityp(y t|x t,n t) is computationally very demanding, it usually isstillapproximated by a **Gaussian**density.The two most prominent filters for **non**-**linear** estimationproblems are the unscented [3] and the extended KALMAN filter[4]. The unscented filter determines the mean and covarianceofthis**Gaussian**bymeansofso-calledsigmapoints,whichare drawn from a joint **Gaussian** a priori density of x t and n tand later on propagated through the **non**-**linear**ity. In contrast,the extended KALMAN filter utilizes lower-order terms of theTaylor-series expansion of the **non**-**linear**ity **to** obtain estimatesof the mean and the covariance and in general is very sensitive**to** the choice of the expansion vec**to**r. But it is only when thespectral radius of the covariance matrix of the joint **Gaussian**prior is sufficiently small that the underlying assumptions areapproximately valid.Driven by this idea, ALSPACH and SORENSON proposed**to** model the original a priori distribution by a **Gaussian**mixture model where the individual mixtures exhibit smallereigenvalues than the original distribution [5]. Only recently,MERWE et al. [6] and FAUBEL et al. [7] adopted this **approach**.While MERWE et al. applied a weighted expectationmaximizationalgorithm**to**samplesdrawnfromtheaprioridensity**to** obtain the parameters of a GMM, FAUBEL et al. proposed**to**iterativelyincreasethenumberofmixturecomponentsby**splitting**aselected**Gaussian**alongoneoftheeigenvec**to**rsofthe corresponding covariance. The mixture component **to** splitandtheeigenvec**to**rthesplitisperformedalongistherebydeterminedby the degree of **non**-**linear**ity [7]. Both **approach**es canbe considered computationally expensive and thus, **to** the bes**to**f our knowledge, **non**e or very few attempts have been made**to** apply the concept **to** the problem of speech feature enhancementasisdonehere.Further,wewillextendthetheory**to**splitsalong arbitrary directions, not restricted **to** those defined by theeigenvec**to**rs of the covariance matrix.The paper begins with a short review of approximatinga GMM by a single **Gaussian** by means of minimizing theirKULLBACK-LEIBLER divergence. Since the same criterioncannot be applied **to** split a **Gaussian** in**to** a GMM withoutany constraints on the parameter set, a **splitting** and weightingscheme which allows for multiple splits in arbitrary directionsis introduced next, followed by a discussion on the sensitivityof the choice of the parameters. The **splitting** and weightingscheme is thenapplied**to**model-based speechfeature enhancementand its performance is evaluated by means of recognitionresults onthe Aurora 2database [8].2. Merging a GMM in**to** asingle**Gaussian**Givena**Gaussian** mixture probabilitydensity p(z), z ∈ R Dp(z) =M−1∑m=0w mp m(z) =M−1∑m=0w mN(z; µ m ,Σ m), (2)where w m, µ m and Σ m denote the weight, the mean and thecovariance of the m-th **Gaussian** mixture component p m(z),m ∈ {1,..,M}, respectively, merging the GMM **to** a single**Gaussian** q(z)=N(z; ˜µ, ˜Σ) usually refers **to** finding its mean˜µ and covariance ˜Σ such that the KULLBACK-LEIBLER (KL)divergence between p(z)and q(z)is minimized, resultinginM−1∑˜µ = w mµ m (3)m=0M−1∑M−1∑˜Σ = w mΣ m+ w m(µ m −˜µ)(µ m −˜µ) ′ , (4)m=0m=0

where (·) ′ denotes the matrix/vec**to**r transpose opera**to**r. Thecovariance ˜Σ can thereby be regarded as being composed ofa ”within class” term S W = ∑ M−1m=0wmΣm and a ”betweenclass”termS ∑ M−1B=m=0 wm(µ m −˜µ)(µ m −˜µ)′ . This**approach**is alsoreferred**to**as moment matchingfor obvious reasons.3. Splitting aSingle **Gaussian**in**to** aGMMWhile merging a GMM in**to** a single **Gaussian** by minimizingtheir KL-divergence is common practice, the objective here isthe opposite: **splitting**a**Gaussian** in**to** a GMM consisting of anarbitrary number of mixture components M. A **splitting** and aweighting scheme which allow for multiple splits in arbitrarydirectionsaredescribednext. TokeeptheobjectiveofminimizingtheKL-divergence,bothschemesaredrivenbythekeyideaof matching the moments of the GMM **to**tha**to**f the **Gaussian**.3.1. SplittinginanArbitraryDirectionSplitting the **Gaussian** q(z) = N(z; ˜µ, ˜Σ) in**to** a GMM p(z)with M = 2K+1 components along an arbitrary directionu l ∈ R D can be achieved by first shifting the mean 2K-timesand second transforming the covariance of the shifted density.Arranging the means of the M mixture components symmetricallywith respect **to** ˜µ along u l with an equidistant spacingwhile transforming all covariances in the same manner resultsinaGMMthat isfullyspecified bythe parameter setµ 0 = ˜µ, Σ 0, w 0,µ k = ˜µ+ηku l , Σ k = Σ 0, w k , (5)µ K+k = ˜µ−ηku l , Σ K+k = Σ k , w K+k = w k ,where k∈{1,..,K}. All weights are positive and normalizedsuch that w ∑ 2K0+ ˜k=1 w˜k =1. The parameter η ∈ R ≥0 determinesthe distance between two adjacent means in terms of thelengthof the vec**to**r u l .Thesymmetricsetupensuresthefirstcentralmomen**to**fthe**Gaussian**andtheGMM**to**match. Applying(4)andsolving forΣ 0 **to**alsomatchthe second central moment yieldsΣ 0 = ˜Σ−βu∑ Kl u ′ l, β = 2η 2 w k k 2 , (6)k=1where the parameter β ∈ R ≥0 controls the covariance reduction,which certainly has **to** be a function of the mean displacementη if the moments have **to** match. Calling for Σ 0 **to** besymmetricand**non**-negative definiteposesconstraintsoneitherthe length of u l or β. While the symmetry is obviously givenbyany vec**to**r u l ∈ R D ,the **non**-negative definiteness property)z ′(˜Σ−βul u ′ l z ≥ 0 ∀z ∈ R D (7)is preserved only by vec**to**rs in a certain subset of R D . Expressingthe covariance ˜Σ in terms of an orthonormal matrixcomposed of its eigenvec**to**rs V = [v 1,...,v D] and a diagonalmatrix composed of its eigenvalues Λ = diag([λ 1,...,λ D]),one obtainsz ′( VΛV ′ −βu l u ′ l)z ≥ 0 ∀z ∈ R D , (8)whichis equivalent **to**˜z ′( )I−βΛ −1 2V ′ u l u ′ lVΛ −1 2 ˜z ≥ 0 ∀˜z ∈ R D , (9)with˜z = Λ 1 2 V ′ z,since Λ2 1 V ′ isfull-rank.The matrix I − β(Λ −1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ inturn is **non**-negative definite if the eigenvalues ofβ(Λ −1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ are smaller than or equal **to**one. Assuming without loss of generality that β is bound **to**the interval [0,1], **non**-negative definiteness can be ensured bynormalizing the vec**to**r u l by the square root of the maximumeigenvalue of (Λ −1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ , which is just thelength of the vec**to**r Λ −1 2 V ′ u l . The normalized vec**to**r ŭ l isthus given byŭ l =u l√(Λ −1 2V ′ u l ) ′ (Λ −1 2V ′ u l )=u l√u ′ l˜Σ −1 u l. (10)Notethatifu l coincides withanarbitraryeigenvec**to**r v d of ˜Σ,ŭ l turns in**to**ŭ l = √ λ d v d /||v d || .3.2. SimultaneousSplittinginMultipleDirectionsExtension **to** a simultaneous split in L arbitrary directionsu 1,...,u L can be carried out in a similar fashion. The parametersoftheresultingGMMconsistingofM= 2KL+1mixturecomponents are givenbyµ 0 = ˜µ, Σ 0 = ˜Σ−β ∑ Ll=1 u lu ′ l, w 0µ l k = ˜µ+ηku l, Σ l k = Σ 0, wkl (11)µ l K+k = ˜µ−ηku l, Σ l K+k = Σ l k, wK+k=w l k,lwhere l∈{1,..,L} and k∈{1,..,K}. All split directions u lagain have **to** be normalized, this time by the square root ofthemaximumeigenvalueof ∑ Ll=1 (Λ−1 2 V ′ u l )(Λ −1 2 V ′ u l ) ′ **to**preserve the **non**-negative definiteness of all covariances.Applying(4) **to**matchthe second central moment calls for[L∑u l u ′ ll=1K]∑β −2η 2 wkk l 2k=1!= 0, (12)which cansimply be met for an arbitrarynumber of split directionsLiftheweightsarechosen**to**beindependen**to**ftheindexl. The parameter η determining the displacement of the mixturemeans and the parameter β controlling the change in thecovariances are thus again (compare (6))relatedby∑ Kβ = 2η 2 wkk l 2 . (13)k=1Sincethenormalizationofthesplitdirectionsrequires0≤β≤1,themeanshiftηwillbebound**to**η ∈ [0,η max],whichingeneralwill depend on L and K. If the weights are independent of η,a closed-form solution for the upper bound η max can be derivedfrom (13). However, if the weights depend on η, a closed-formsolutionusually does not exist,as willbe outlined next.3.3. WeightingSchemeEquation (13) relates the covariance reduction parameter β**to** the mean shift η and the chosen weights wk. l Weightingschemes where the center weight w 0 is specified andthe remaining probability mass is equally distributed amongthe remaining 2KL **non**-center mixtures **to** satisfy w 0 +2 ∑ L Kl=1∑k=1 wl k =1 are quite common [3]. However, sincewe split a **Gaussian** (2K+1)-times along a specified directionu l , assigning the same weight **to** mixtures that are close **to** themean ˜µ and those that are further away from it may not be anappropriate choice. Instead, we assign each mixture componentp m(z) a weight that is proportional **to**the likelihood of itsshiftedmean under the initial**Gaussian** q(z) asw 0 = 1 C γ, wl k = w l K+k = 1 C e−1 2 k2 η 2 u ′ l ˜Σ −1 u l, (14)with the normalization constant C ∈ R >0. The parameter γdetermines how much weight is assigned **to** the centered mixturecomponent and moreover controls the higher moments, aswill be utilized later. Since the weights represent probabilities,γ thereby is constrained **to** be ∈ R ≥0 , as opposed **to** the unscentedfilter [3] where also negative center weights may be

Averaged Accuracy [%]6160595857L=1,K=1,η max=1.7321L=1,K=2,η max=1.4192L=1,K=3,η max=1.2323L=2,K=1,η max=1.73210 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1η/η maxFigure 2: Averaged recognition accuracies on test set A of theAurora2database foran SNRof 0dB;η∈[0,η max],γ=γ opt forL=1,K∈{1,..,3} andL=2,K=1setups (L=1, K ∈{1,..,3} and L=2, K =1), the averagedrecognitionaccuraciesexhibitthesamecharacteristic: theaccuracyfirstincreaseswithη,reachesaplateauandfinallydeclineswhen**approach**ingη max. Thegreatestimprovementcompared**to**thebaseline(+3.5%absolute)isobtainedwithL=1andK=2at η/η max =0.8. Interestingly, increasing the number of splitdirections from L=1**to**L=2could not improve the results.This may be contributed **to** the fact that **splitting** (and filtering)is performed in the cepstral domain, where the energycomponent dominates the eigenvalues. However, the **linear**izationof (1) with respect **to** the current split distribution and theestimationofthemeanandcovariance ofthecorrespondingobservationdensity requires the transformation **to** the logarithmicmel power spectral domain, which is done by application ofthe pseudo-inverse of the DCT matrix. The variance reductionin the energy component can thus be considered **to** spreadalong all components of the **state** vec**to**r in the logarithmic melpowerspectraldomainandthusreducetheinfluenceofthe**non****linear**ityonthe**linear**ization for allcomponents of the observationmodelinthe logarithmicmel power spectral domain. Nevertheless,**splitting** in L>1 directions may become useful fordifferent observation models or inference problems.With η **approach**ing its bound η max, β **approach**es one andthe variances along the corresponding split directions becomeszero. Though this may be reminiscent of the unscented filter,there is a subtle, but important difference **to** it. First, the proposed**splitting** scheme does not require the number of split directions**to**beequal**to**thenumberof(possiblyaugmented)**state**dimensionsasrequiredbytheunscentedfilter. Thus,onecanfocuson those directions where the **non**-**linear**ity is most severe.Second and most important, the proposed **splitting** scheme allowsfor merging the contributions of the individual splits onthe posterior density level. This,however, is not possible underthe unscented filter, which explicitly assumes the observationdensity **to** be **Gaussian** and thus renders the posterior distribution**to**be**Gaussian**, **to**o. However, adetailedcomparison**to**theunscented filterisnotthefocus of thispaper andwillbeleftforfuture research.Finally, the computationally modest configuration (L=1,K=1, γ=γ opt, η/η max=0.8) is used **to** carry out feature enhancemen**to**n the complete testset A of the Aurora 2database.Thistime,wealsoutilizetheuncertaintyaboutthecleanspeechestimates provided by the filter **to** carry out the final recognitionwithuncertaintydecoding[11]. Resultsforbothfilters,thestandard IEKF-α and the filter with the **Gaussian** sum approximation,aregiveninTab.1.Incomparison**to**thestandardIEKFα,theapplicationofthe**splitting**schemeresultsinanimprovedrecognition performance which under the standard decoder alreadyreaches the performance of the uncertainty decoder ap-Table 1: Averaged recognition accuracies on test set A of theAurora2database withstandardand uncertainty decodingSNR IEKF-α IEKF-α+**splitting**[dB] standard uncertainty standard uncertainty20 98.72 98.64 98.70 98.6615 97.18 97.31 97.21 97.3010 93.61 94.03 93.41 93.725 82.82 84.12 83.16 84.660 56.95 58.92 60.47 62.73∅ 85.86 86.60 86.59 87.41plied **to** the standard IEKF-α. Additional application of theuncertainty decoder finally results in an accuracy of 87.41%.Note that the major gain is almost exclusively achieved underlowSNRconditions –validatingtheassumptionourfirstexperimentwas subject **to**.6. ConclusionsIn this work, a **splitting** and weighting scheme has been describedthatallows for**splitting**a**Gaussian**density in**to**a**Gaussian**mixturedensity.Wewereable**to**extendtheexistingtheory**to**splits alongarbitrarydirections, not necessarily along the directionsof the eigenvec**to**rs. The GMM and the original **Gaussian**exhibitequalcentralmomentsup**to**anorderoffour.However,the resulting mixtures’ covariances have eigenvalues thataresmallerthanthose ofthecovariance ofthe originaldistribution,thereby reducing the **linear**ization error of a Taylor seriesexpansion. Application**to**model-basedspeechfeatureenhancemen**to**n the the Aurora 2 speech recognition task confirmedthis property **to** be beneficial in the context of **non**-**linear** **state**estimation. The **splitting** contributes most under low SNR conditions,where the **non**-**linear**ity is most severe and where thecovariance of the predictive distribution is large, such that theassumptions inherent **to** the extended KALMAN filter are contradictedmost.7. AcknowledgementsThisworkhas beensupported byDeutsche Forschungsgemeinschaft(DFG)under contract no. Ha3455/6-1.8. References[1] J. Droppo, A. Acero, and L. Deng, “A **non****linear** observation model for removingnoisefromcorruptedspeechlogmel-spectralenergies,”inProc.Int.Conf. SpokenLang. Process.(ICSL), 2002.[2] V.LeutnantandR.Haeb-Umbach,“Ananalyticderivationofaphasesensitiveobservation model for noise robust speechrecognition,” in Proc.Annu.Conf. ofISCA(Interspeech), 2009.[3] S. J. Julier and J. K. Uhlmann, “A new extension of the Kalman filter **to****non****linear** systems,” in Int. Symp. Aerospace/Defense Sensing, Simul. andControls,1997.[4] B.BellandF.Cathey,“TheiteratedKalmanfilterupdateasaGauss-New**to**nmethod,”IEEETrans.Au**to**m.Control,vol. 38,no.2,1993.[5] D. Alspach and H. Sorenson, “Non**linear** Bayesian estimation using **Gaussian**sumapproximations,”IEEETrans.Au**to**m.Control,vol.17,no.4,1972.[6] R. van der Merwe and E. Wan, “**Gaussian** mixture sigma-point particle filtersfor sequential probabilistic inference in dynamic **state**-space models,”in Proc.Int. Conf.Acoust.,Speechand Sig.Process.(ICASSP),2003.[7] F. Faubel and D. Klakow, “Further improvement of the adaptive level ofdetail transform: Splitting in direction of the **non****linear**ity,” in Proc. Europ.Sig. Process.Conf.(EUSIPCO),2010.[8] D. Pearce and H.-G. Hirsch, “The Aurora experimental framework for theperformance evaluation of speech recognition systems under noisy conditions,”inProc.Int. Conf.Spoken Lang. Process.(ICSL), 2000.[9] Z. Winiewski, “Vec**to**rs of kth order moments,” Bulletins of the StanisawStaszicAcademyofMiningandMetallurgy,Geodesyb.112,no.1423,1991.[10] L. Isserlis, “On a formula for the product-moment coefficient of any orderof anormal frequency distribution inany numberof variables,”Biometrika,vol.12, no.1/2,1918.[11] V.IonandR.Haeb-Umbach,“Anoveluncertaintydecodingrulewithapplications**to**transmissionerrorrobustspeechrecognition,”IEEETrans.Audio,Speech,and Lang.Process.,vol. 16,no.5,2008.