SPEECH RECOGNITION FOR A TRAVEL RESERVATION SYSTEM ...

More documents

Recommendations

Info

“FROM [city]” and “TO [city]” for all cities. We alsodesigned different CFGs for dates and times which allowedexpressions such as “MONDAY SEPTEMBERTHIRD TWO THOUSAND ONE” and “TEN A.M. INTHE MORNING” and many more. The probabilitiesfor paths in the grammars were assigned using commonsense. Using these CFGs, a possible parsing of a traveldomain sentence is shownI WANT TO FLY FROM NEW YORK} {{ }placeNEXT TUESDAY} {{ }dateIN}THE MORNING{{ }.timeAlthough this approach seemed promising, we weredisappointed to see that the error rate increased whenthe embedded CFGs were used. We have determinedthe following reasons for the failure of this approach:• The weights (probabilities) within the CFGs areassigned subjectively and this is not usually appropriatefor speech recognition purposes. Trigramprobabilities capture the word probabilitiesbetter. Ideally, we would like to train these weightsusing a training corpus, which we are currentlyexploring.• The grammars should be designed to representsame logical entities where the context is expectedto be exactly same for each path in the grammar.For example, if “ON DELTA” is in thesame grammar with “A DELTA FLIGHT” thathurts the performance since the sequence “FLYON DELTA” is more likely than “FLY A DELTAFLIGHT” but the CFG inherently assumes theyare the same. So, the grammars should be designedkeeping this point in mind.• If a word appears in two or more grammars, ora word appears in different paths in a grammar,the LM probability of the word decreases as comparedto the n-gram LM. The reason for this isthat when computing the LM probability of thecurrent word, only one path through the grammaris considered. So, grammars should be designedto make sure that each word appears in only onegrammar and that it never (or rarely) occurs outsidethat grammar.Because of the above problems, best results are obtainedwhen embedded CFG LM probabilities are interpolatedwith regular smoothed trigram LM probabilities.We used equal weighting between the two.We revised our CFGs after observing the above points,and redesigned new embedded CFGs corresponding todates, times, cities and “city state” pairs, airports andairlines. We experimented with different grammar designs.However, the best result we obtained was onlyslightly better than the class-based LM in some testsetsand worse in others. We are exploring new grammar designsand working on ways to estimate within grammarweights more accurately from training data. We presentour experimental results next.6. EXPERIMENTAL RESULTSThe acoustic and language models were developed andtested in the IBM communicator system which is usedfor flight reservations. We trained the acoustic modelsusing 380 hours of travel domain speech from about4000 speakers and 250 hours of general telephony domainspeech. Language models were trained using 100Ksentences from travel domain collected from various sources.The numbers in the corpus were tagged using the methoddescribed in [10] to help in language modeling. Therewas a held-out set of 18K sentences used to optimizetrigram smoothing parameters.Our main test set is composed of a subset of thecalls that the IBM Communicator system received duringthe NIST evaluation of various DARPA communicatorsystems in June 2000 (1173 utterances out of 1262received).The baseline model is the IBM speech recognitionsystem used during the June 2000 NIST evaluation.Acoustic model (old AM) is a generic telephony systemthat is trained from a wide variety of telephonydata (600 hours of speech). Language model (old LM)is trained using 90K sentences from travel domain andhas LM classes and no compound words except citynames.As mentioned above, we added 10K more sentencesto the LM training data, redesigned the classes andadded new classes and compound words as described inSection 5. A few such LM’s were designed. The bestone (“new LM”) is chosen among all.Table 1 shows the overall improvement in word errorrate due to the new acoustic and language models. Theresults are on the main testset.Acoustic Model Language Model WERold AM old LM 23.70%old AM new LM 20.63%new AM old LM 19.03%new AM new LM 17.79%Table 1: Comparison between old and new, AM andLM. Overall a big reduction is shown.As shown in Table 1, we achieved a significant reductionin error rate with new acoustic and language
models. Addition of compound words and adjustmentof non-uniform within class weights improved error rateabout 15% relatively.To test the dialog state dependent language modeling,we used six dialog states and all dialog feedbackstates were mapped to these six different states. Foreach state, we used about 500-1000 state dependentsentences for training. We did not have enough datafor dialog state dependent LM training. We belive dialogstate feedback can improve performance more ifmore training data is collected.The effect of the dialog state feedback and embeddedCFGs are shown in table 2. Here, we show the effectson 3 testsets. “main” is the same as the one in table1. The other two are decodings of the data received by2 other sites during the evaluation. For anonimity, thenames of the sites are not mentioned.It is shown that the dialog state feedback reducesthe error rate by about 5%. For other sites, the dialogstate is estimated from their system prompts (for testingpurposes) using some rules since the real feedbackinformation was not available. For embedded CFG LMresult in column three, no dialog state feedback is used.Embedded CFGs increased error rate for IBM receivedcalls, but reduced error rate for the other two sites. Thiscould be due to the different nature of calls that werereceived by different sites. Some sites have a more directeddialog system that encourages more structuredresponses which might explain why embedded grammarshelped more for the other sites.Language Model main test2 test3new LM 17.79% 19.70% 27.74%Dialog state dependent 16.91% 19.15% 27.45%embedded CFGs 19.19% 17.23% 27.23%Table 2: Comparison showing LM with NLU state feedbackand embedded grammars. Same acoustic model isused throughout.7. CONCLUSIONOur innovative work in language modeling reduced speechrecognition errors a substantial amount on a travel reservationspoken dialog system. Most reduction is obtainedby carefully retraining language models using domainspecific data and cleverly increasing the LM coverageusing classes. Including “[city] [state]” compoundwords in the vocabulary were shown to be effectivefor better performance. Dialog context dependent languagemodeling improved error rate consistently overall testsets. However, using embedded CFG objects inn-gram language models did not uniformly improve theerror rate. Our work is in progress in several directions(1) Better design of embedded grammars, (2) Rescoringlattices using more elaborate LM techniques usingCFGs, and (3) automatic training of within grammarweights using a training corpus.8. ACKNOWLEDGEMENTSThe author would like to thank Yuqing Gao, AndySakrajda, Adwait Ratnaparkhi, Kevin Smith, XiaoqiangLuo, Kishore Papineni, Todd Ward, Salim Roukos, andmany other members of IBM HLT department for designing,building and developing other parts of the dialogsystem.REFERENCES[1] E. Levin et al., “The AT&T-Darpa communicator mixedinitiativespoken dialogue system,” in International Conferenceon Spoken Language Processing, 2000.[2] B. Pellom, W. Ward, and S. Pradhan, “The CU communicator:An architecture for dialogue systems,” in InternationalConference on Spoken Language Processing, 2000.[3] A. I. Rudnicky et al., “Task and domain specific modellingin the Carnegie Mellon communicator system,” in InternationalConference on Spoken Language Processing, 2000.[4] A. Aaron et al., “Speech recognition for Darpa Communicator,”in Proc. IEEE Conf. Acoust. Speech Sig. Proc., 2001.[5] Y. Gao, H. Erdogan, Y. Li, V. Goel, and M. Picheny, “Recentadvances in speech recognition system for IBM Darpacommunicator,” in European Conference on Speech Communicationand Technology, 2001.[6] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue,“Galaxy-II: A reference architecture of conversational systemdevelopment,” in International Conference on SpokenLanguage Processing, 1998.[7] K. Davies et al., “The IBM conversational telephony systemfor financial applications,” in European Conference onSpeech Communication and Technology, 1999.[8] G. Saon and M. Padmanabhan, “Data-driven approach todesigning compound words for continuous speech recognition,”in Automatic Speech Recognition and UnderstandingWorkshop, 1999.[9] M. Monkowski, “Embedded grammar objects within the languagemodels,” Technical report, IBM, 2001.[10] X. Luo and M. Franz, “Semantic tokenization of verbalizednumbers in language modeling,” in International Conferenceon Spoken Language Processing, volume 1, pp. 158–61, 2000.
Page 1 and 2: SPEECH RECOGNITION FOR A TRAVEL RES
Page 3 and 4: (speech recognizer) which is based
Page 5: were then put in a LM class of thei

SPEECH RECOGNITION FOR A TRAVEL RESERVATION SYSTEM ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?