CONNECTED WORDS RECOGNITION - Telfor

IX TELEKOMUNIKACIONI FORUM TELFOR'2001, Beograd, 20-22.11.2001.god.CONNECTED WORDS RECOGNITIONDarko Pekar, Radovan Obradoviæ, Vlado DeliæSchool of Engineering, University of Novi Sad, YugoslaviaI INTRODUCTIONIn [1] the same group of authors worked on the system fordiscrete digits recognition in Serbian language. Problems inthat work are located, and many of them overcome andsolved by methods proposed in this paper. New system notonly surpasses previous in discrete word recognition, but italso gives very good results when applied to fastpronounced connected digits in adverse conditions(background noise, telephone channel etc.). New system isalso speaker independent, works in real time, even on weakPentium I configurations and supports Windows NT/9xoperating systems. It does not require any additionalhardware, can work with several Dialogic sound cards,which offers inexpensive way of providing CTI applicationswith ASR support.II ACOUSTIC FEATURES VECTORAcoustic features vector consists of static and dynamiccomponents.Static featuresIn use are: 14 MFCC, signal energy, zero crossing rate anddegree of voicing.Dynamic featuresIn order to characterize feature change in time firstderivative of the static features is used. We carried outexperiments with two methods for first derivativecalculation:• difference between two neighbouring frames, filtered intime with a filter of length 3, which significantlyreduced its noisy component [1],• first derivative calculation using the regressioncoefficients.Regression coefficients are obtained when the function isapproximated with the polynomial using the MinimalSquare Error (MSE) criteria, and then first derivative of thatfunction is found at the desired spot. This is done in awindow of certain length. If we use the second orderpolynomial we obtain the following expression,[6]ri,jQΣ qpi+q,jq=−Q= .(2.1)Q2Σ qq=−Qwhere 2Q+1 stands for window duration (in frames),is static feature value, i frame number, and j staticpk + q,jfeature index.III MODEL STRUCTUREWords are modeled with HMM which has simple linearstructure, and each phoneme in word is modeled with twostates. Acoustic vectors probability distribution during onestate is mo deled by a mixture of 6 Gaussian distributions. Itis common to combine Gaussian distributions in thefollowing way:p(x)=K∑k = 1cke1− ( x−µk ) U k ( x−µk )'2,(3.1)where x represents the acoustic feature vector, µ ,kU arekthe parameters of the Gausian distributions, and k is thenumber of mixtures.We used "winner take all" approximation, where the sum in(3.1) is replaced with the max operator:K1− ( x−µ ) U ( x− µ )'k k kp(x)= max c e2.(3.2)k = 1kIn our opinion this approach is insufficiently represented inASR practice, although according to our results, givesresults comparable to first method, and can be much moreefficiently implemented, because the software realizationsusually use the logarithm of probability density function:K1ln p ( x)= max[ln ck− ( x − µk) Uk( x − µk)'].(3.3)k=1 2We see that for expression (3.3) evaluation it is unnecessaryto use transcedent functions, but only simple operationssuch as addition, subtraction and multiplication, which canbe implemented very efficiently on DSP processors. Weemphasize that the expression (3.3) evaluation is verysimilar to the algorithm of vector quantization.IV MODEL PARAMETERS CALCULATIONAll previous systems for discrete word recognition hadmore or less standard training procedure: initial modeldetermination, Baum-Welch reestimation method by theMaximum Likelihood criteria (ML) [2], and some methodswhich perform discriminative training.In the first step the initial distribution of the feature vectorsamong the states is carried out, in such a way that a similaraccumulated difference between neighbouring frameswithin each state is preserved [2]. This method didn'tguarantee that one state would cover the same part of thesame phoneme in every observation sequence. On thecontrary, it was inevitable to have features from two or evenmore phonemes within one state. None of the later trainingmethods couldn't correct this starting error. This wasespecially expressed in the states in the middle of words,and had catastrophic consequences to recognitionperformance, because such states could "recognize"virtually anything. That is why our system requiressegmented speech database. This implies that theinformation about each phoneme start and end point isprovided for every observation sequence in the training set.

Optimization is possible and works fine. Basic assumptionis that the segmentations won't be significantly changed ifwe vary parameters a little, so we keep the segmentationfixed during one set of iterations. This acceptableassumption enables drastic speed up of the algorithm.Namely, each model accumulates certain metric along thewhole observation sequence. Each state lasts several frames,so we can also talk about accumulated metric per state. Byvarying only one state's parameters, we change onlycontributions of that state to total metrics, since thesegmentation didn't change. In this way we reduced thenumber of parameters which should be simultaneouslychanged to several hundred, and new goal function value iscalculated in several milliseconds (since we calculate onlythe changed values). After this separate variations of allstates in all models we enter the new segmentation and newset of iterations. This method can give acceptable resultswithin several hours of training.Depending on chosen parameters and training set size thismethod can, in relatively short period, achieve recognitionpercent of 100 in the training set, with significant distinctionbetween models. But if training set isn't large enough (and itnever is), this will lead to so called overfitting phenomenon.This phenomenon is characterized by excessive modeldependence on the training set and its reduced efficiency onsome other set. That is why corrective training should beaborted when system perfomance on an independent setbegins to descend.V WORD SEQUENCE MODEL STRUCTUREWord sequence is modeled as an array of words and nonspeechintervals. So called non-speech state is establishedfor non-speech intervals description. The whole vocabularyis represented by a large trellis in which we have severalconstraints:• each sequence must begin with non-speech state,• non-speech state duration is not limited,• from the non-speech state it is possible to enter only thebeginning state of any word,• word model has a linear structure and limited statesduration (from both sides),• from the last state of each word it is possible to enteronly non-speech state,• each sequence ends with a non-speech state.For trellis search we used Viterbi algorithm which observesmentioned constraints. Depending of the desiredapplication, Viterbi search can be "forced" throughpreviously determined sequence of words, which isachieved by creating appropriate trellis.Besides non-speech state, which does not belong to anyword, it is possible to establish some other "words" whichcould model different kinds of noises. Deploying noises tonon-speech state results in an increased number of wordinsertions, because not all the noises and silence can bemodeled with only one state. Some experiments wereconducted with one additional "word", and significantimprovement was achieved.VI EXPERIMENT RESULTSAll experiments were conducted on the records oftelephone quality. Testing base contained around 3500records of both isolated and connected digits. Most of thetests were conducted with a reduced training set, because ofthe time required for system training. Connected wordswere pronounced fairly quick in often adverse conditions,so it was to expect that the recognition rate would have beeneven better on some slower pronounced test set.The aim of the first set of tests was to determine how thesystem behaves after certain number of corrective training(CT) iterations. We also observed the influence of thenumber of parameters on the overfitting effect. There were 3groups of parameters: in the first one we had 2 HMM statesper phone (spp) and 6 Gausian mixtures (gm), in the second2spp and 3 gm, and in the third 1 spp and 3 gm.recognition%2 spp, 6 gm 2 spp, 3 gm 1 spp, 3 gm10090807060500 2 4 6 8 10 12 14 16 18 20 22 24CT iterations numberGraph 6.1: Recognition accuracy depending on the CTiterations and parameter numberAs expected, overfitting effect is in direct corelation withthe number of parameters, more accurately with the quotientof the number of parameters and training set size. When thenumber of parameters is sufficiently large (first case), aftercertain number of iterations (concretely 8) we have systemdegradation on the independent set. Then we should abortthe training.The second test compares system performance with andwithout additional word which models background noise(Fig.6.2.). On the x-axis we have: the number of falserecognitions (fr), insertions (i), deletions (d) and error rate(er) per digit. As expected the most persuasive advance isachieved with the insertions problem, although the numberof false recognitions is also significantly reduced. A littleunexpected is the deleted digits increase, but neverthelessthe overall error rate is diminished.The third test compares system performance when thedistribution function is calculated as a sum of ponderedGaussian mixtures (spg), and when it is calculated by the"winner take all" method (wta) (Fig.6.3.).806040200fr i d er (%)10 digits10 digits +"garbage"wordGraph 6.2: System performance with and without"garbage" word

CONNECTED WORDS RECOGNITION - Telfor

Create successful ePaper yourself

Delete template?

Save as template?