12.07.2015 Views

SPEECH RECOGNITION FOR A TRAVEL RESERVATION SYSTEM ...

SPEECH RECOGNITION FOR A TRAVEL RESERVATION SYSTEM ...

SPEECH RECOGNITION FOR A TRAVEL RESERVATION SYSTEM ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>SPEECH</strong> <strong>RECOGNITION</strong> <strong>FOR</strong> A <strong>TRAVEL</strong> <strong>RESERVATION</strong> <strong>SYSTEM</strong>Hakan ErdoğanIBM TJ Watson Research CenterPO Box 218 Yorktown Heights, NY 10598erdogan@us.ibm.comABSTRACTIn this paper, we present our work on speech recognitionfor a spoken dialog system for automatic travel reservations.The system can receive speech input from an analogline or from IP telephony through a web site. The systemuses speech recognition, natural language understanding, airtravel database access, dialog management, natural languagegeneration and speech synthesis technologies to perform thetask of an automated travel agent. Language modeling for amixed-initiative spoken dialog system is a challenging problem.We explore class-based n-gram language models (LMs)for this domain. Details for designing LM classes and assigningnon-uniform probabilities within the classes are provided.We point out disadvantages of class-based LMs andintroduce compound word vocabularies to overcome the problemsobserved. Dialog state dependent LMs are exploredwhich enables incorporating semantic context into the languagemodels. We also analyze the use of embedded contextfree grammar objects within LMs and point out advantagesand disadvantages of using them. The resulting LMs are implementedand evaluated using IBM telephony speech recognitionengine. Our efforts are shown to decrease the worderror rate from 24% to 17% on an evaluation testset.Keywords: speech recognition, spoken dialog systems,language modeling, context free grammars1. INTRODUCTIONAutomation of travel reservations over the web has beena successful business model for many internet companies.This is a step towards replacing human travelagents with intelligent machines. The next step towardsa more human like experience in automated travel reservationsis to use automatic speech recognition and naturallanguage understanding to interact with a machineand book the travel reservations using a more naturalcommunication medium, namely speech.There exists commercial products that perform speechrecognition and dialog management for just this purpose.However, current commercial systems have limitations.The callers have to know the system and theycannot say anything they want and have the systemunderstand their needs. Also, the commercial productdoes not provide a free dialog user interface, but theusers are directed by the system as to what they cansay at a particular state in the dialog. The responses aremodeled by context free grammars (CFGs) and hencethe responses have to be in a specific format.DARPA initiated a program called “Communicator”which aims to provide a more natural dialog system fortravel reservations as compared to current commercialproducts. A key point in the communicator programis that the systems have to have a user-interface whichis mixed-initiative. Ideally, the systems should be flexibleto accommodate the needs for both advanced andbeginner users and they should let the user pick whatthey want to say at any point in the dialog. More informationabout this task can be found in [1, 2, 3, 4].IBM communicator travel reservation system can bedescribed as an automated travel agent that helps callersmake airline reservations using natural language anddialog. The speech source can be POTS telephony ordigital IP-based telephony through an e-commerce website. The goal in the Communicator project is to designa system that behaves as close to a human operatoras possible. The system combines speech recognition,natural language understanding, dialog management,database access, language generation and speech synthesisto perform the desired task. It is much simplerto design a speech recognition system that has limitedinput and uses fixed grammars as possible input speech.However, the challenge comes from the fact that, wewould like more natural language to be used as inputspeech.In this paper, we report our progress on the speechrecognition portion of the IBM communicator system.Our work can be divided into two areas, acoustic modelingand language modeling. We report our detailedacoustic modeling experiments in [5], which also includesbrief description of our language modeling work.We will focus on language modeling portion of the recognitionsystem in this paper. Some of the improvementsmentioned here were included into the live deployed sys-


tem, some are in developmental stage and are plannedto be included in the system in the near future.2. <strong>SYSTEM</strong> DESCRIPTIONThe architecture of IBM communicator system has aGalaxy-II compliant [6] hub architecture. It is similarto the one described in [7]. The system is composedof a number of modules that interact with eachother through the hub. The hub sequentially passes tokensto each module and maintains state of the system.There are modules such as recognizer, TTS, parser, dialogmanager and language generation. The architectureis shown in Figure 1. It is possible to replace thetelephony interface with IP interface to acquire audiodirectly from a web site. This might make it possible touse the system over the internet instead of the currenttelephony interface.IP telephony or analog lineinterfaceTelephonyTTS serverRecognitionserverNaturalLanguageGenerationHUBParserTraveldatabaseinterfaceDialogmanagerFigure 1: Travel reservation system architecture.A typical interaction between the system and a usercan be as follows:System: Hello there, please tell me about the first leg ofyour trip. For example you can say, I want to fly fromLos Angeles to Chicago leaving next Tuesday.User: I want to fly from Beijing to Los Angeles.System: Flying from Beijing China to Los Angeles InternationalAirport on what date?User: Tomorrow.System: What time would you like to leave on SaturdayMay fifth?User: Anytime please.System: I see a couple of flights. (the system goes onto give details about each flight).User: I would like the first one.System: . . .User: . . .The system allows users to give more information inresponse to each prompt. For example, when asked fordate, the user could have said “Anytime tomorrow” togive both date and time in response to the date question.The user could also have said “No, I am going toLas Vegas not Los Angeles” to correct previous recognitionand understanding errors if any. This is due tothe mixed-initiative nature of the system which allowsmore natural interaction between the system and theusers.We focus on improving the speech recognition modulein this work, specifically language modelling forspeech recognition.3. <strong>SPEECH</strong> <strong>RECOGNITION</strong>Automatic speech recognition has been a research areafor 30 years. Significant improvements have been madein computational speeds and speech recognition technologyto make it an interesting area of business aswell. Main goal in speech recognition is to have a machinerecognize what word is being spoken by an arbitraryperson. Applications range from desktop dictation,various telephony applications such as voice portals,directory dialers to embedded speech recognitionapplications such as command and control in a car. Initially,studies were done on limited vocabulary isolatedspeech recognition. However, recently large vocabularycontinuous speech recognition systems have been developedand are being made commercially available byseveral companies.Fundamental equation of speech recognition is as follows:w ∗ = argmax words P(word|speech signal),that is to find the most likely word given the acousticdata. We can rewrite this equation in the followingform which is easier to tackle:w ∗ = argmax w P(w|x(t)) = argmax w P(x(t)|w)P(w),where x(t) represents the acoustic signal. The first partof the objective function P (x(t)|w), the probability ofgenerating acoustics x(t) by speaking word w, is handledby the acoustic model. The second part P (w) isthe probability of the word w itself in the language.This probability depends on the context of the speech.Language modeling enables us to estimate this probabilityP (w). The score for each word is determined byadding the log probabilities of acoustic and languagemodels with appropriate weights. Evaluating the scorefor each word in a language is impractical in large vocabularyrecognition. There are time alignment issuesas well for continuous speech recognition. These issuesare handled by efficient search strategies such as Viterbisearch and A ∗ search. In IBM, we use a stack decoder


(speech recognizer) which is based on A ∗ search. Next,we explain acoustic modeling briefly and detail our workon domain specific language modeling for travel reservations.4. ACOUSTIC MODELINGState of the art acoustic modeling is performed via HiddenMarkov Models (HMM) in speech recognition. Avector of features are extracted from acoustic signalsevery 10 ms. Typically MFCC features are used. Eachphoneme features are modeled by a three state HMMwhere states correspond to begin middle and end of aphoneme. It is also observed that each phoneme behavesdifferently in different contexts. Phonemes in differentcontexts are modeled separately. Each state ismodeled by a probability distribution parameterized asa mixture of Gaussians. The parameters of an HMMmodel are these Gaussian means, (co)variances, mixtureweights and transition probabilities between contiguousstates. There is a well known training algorithmfor HMMs called forward-backward algorithm (Baum-Welch) in which acoustic data as well as the script correspondingto the acoustic data are used to train themodel parameters.We built domain specific acoustic models for ourtravel reservation system. The training data were collectedfrom internal calls made to the IBM communicatorsystem, as well as generic telephony data from differentdomains. Since the travel domain speech recognitionis required to recognize city names and the trainingdata did not contain all possible city names, it wasnecessary to add generic telephony data to increase coverageof the acoustic model for all different cities andairports. We explain our work on language modelingnext.5. LANGUAGE MODELINGCalculating the probability of a word in a given contextis a problem that involves many parameters. The probabilityof a word in a certain position in a sentence candepend on the general topic being conveyed, the wordsthat occur before and after the word, the syntax of thesentence, the meaning (semantics) of the sentence, etc.Some speech recognition systems use context freegrammars (CFGs) to constrain what the user can sayto the system. CFGs define a formal language representableby a finite state machine (FSM) which onlyallows certain sentences to be spoken. The word sequencesthat do not correspond to the grammar receivea probability of zero. The probability of an acceptableword sequence is determined by probabilities assignedto each path in the grammar. The grammars also simplifythe search problem in speech recognition by eliminatingmany possibilities during search. However, theapplication of CFGs in spoken language applications islimited. Another method to compute word sequenceprobabilities are called n-grams.To simplify the word probability estimation problem,in n-gram language models, the probability of a word w iin a context is assumed to be dependent on the previousn words only:P (w i |context) ≈ P (w i |w i−n+1 . . . w i−2 w i−1 ),where w j represents the word in position j. The trigramprobability P (w i |w i−2 w i−1 ), and bigram probabilityP (w i |w i−1 ) are mostly used. The unigram probabilityis P (w i ) assuming no context information. Themaximum likelihood estimates for n-grams are foundby counting the occurance of a specific n-gram in trainingdata and dividing by the total number of n-grams.Usually, trigram language models are used where theprobabilities are smoothed with bigram and unigramprobabilities for robustness and to avoid zero probabilities.P (w i |con.) ≈ λ 3 P (w i |w i−2 w i−1 )+λ 2 P (w i |w i−1 )+λ 1 P (w i ),where λ’s are estimated from held-out data and aredependent on counts of the word sequences w i−1 andw i−2 w i−1 in the training data.The reason for using a small value of n is that thenumber of parameters increase exponentially with nand that the training data becomes sparse to estimatethese parameters reliably. Thus, standard smoothedtrigram LMs do not model longer correlations betweenwords, neither do they take advantage of linguistic structureor knowledge. However, in most cases they providesatisfactory performance given enough training data andappropriate smoothing.5.1. Class-based Language ModelsTrigram language model is very powerful if there isenough training data that covers a big portion of possiblesentences in a domain. However, most of the time,the training data does not cover all possible sentencesand words. In this case, word trigram language modelsdo not accurately represent the language. Furthermore,some words are logically connected and it makessense to tie their probabilities together instead of relyingon the individual training data counts for thesewords. Thus, we “regularize” the maximum likelihoodestimation of trigram probabilities by using our priorknowledge. This approach increases coverage and contextrichness. In travel domain, there are plenty of


classes of words, such as months of the year, days ofthe week, etc. We use class-based language models torepresent the tying among these similar words. Eachword w is assigned to a class C(w), where the class canbe a singleton set 1 or a multiple element set. In thiscase, the word sequence probability for a class-basedtrigram becomes:P (w i ) = P (w i |C(w i )P (C(w i )|C(w i−2 )C(w i−1 )),where C(w) represent the word class for word w andP (w i |C(w i )) is the within class probability (or weight)of word w i in the class C(w i ). For certain classes, we useuniform within class probabilities and for some otherswe use non-uniform weights.In our language model for the travel reservation system,we have 42 classes representing cities, states, airports,airlines, months, years, days in a month, hours,minutes, days of the week as well as some other classes.We have nonuniform within class probabilities forcities and they are obtained as a weighted combinationof unigram counts in training and airport volume data.So, big cities with higher airport volume receive a higherprobability as compared to a small city. For states, wehave obtained nonuniform weights by adding up airportvolume data for airports within each state. For airlines,we also have used a similar technique where each airlineprobability is proportional to the airline passengervolume data. For other classes, we either used uniformor heuristic weights. We have iterated this processto achieve the best results. For example, we had aclass of years which had uniform weights for each yearfrom 1990 to 2010. After observing some errors, wechanged the weighting to be nonuniform and made currentand following years (1999, 2000, 2001, 2002) morelikely since it is highly unlikely that one would call thesystem and want to reserve a flight for 3 years later.We achieved a significant error reduction with this kindof small modifications.LM classes are good for increasing coverage and enrichingthe context. However, we have observed a problemwith the classes we used which was related to citiesand states. In travel reservations, when people talkabout their travel plans, they often pronounce the cityand state name one after the other such as “DALLASTEXAS”. So, if trigram training is used the bigram“DALLAS TEXAS” has a high probability whereas “DU-LUTH TEXAS” has a low probability. However, if classtrigram is used with [city] and [state] classes, “[city][state]” has high probability regardles of the individualcity or state. This problem causes speech recognitionerrors such as “SIOUX CITY OHIO” (instead of“SIOUX CITY IOWA”) and “DALLAS MINNESOTA”1 In which case we can think that the word is itself a class(instead of “DULUTH MINNESOTA”) which wouldhave been highly unlikely if a regular trigram had beenused. These speech recognition errors are highly undesirablefor the travel reservation system since it ishard for the dialog system to recover from these typeof errors. The same kind of problem occurs for “[city][airport]” pairs. To overcome these significant problems,we used the compound word approach which weexplain next.5.2. Compound WordsTrigram LMs work on words directly. It is advantageousto glue frequently occurring word sequences togetherand assume that they always appear together forthe purposes of the LM. We concatanate such words,e.g. the word sequence “SAN FRANSISCO” can beglued together to form “SAN FRANSISCO”, since it ishighly likely that the word “SAN” will appear togetherwith “FRANSISCO”. In our system, by default the citynames are combined together such as “NEW YORK”and “LOS ANGELES”. The advantage comes from thefacts that (1) we can use a larger word history (in trigrams)when we consider this sequence as one word, (2)longer words have a higher chance of being recognizedcorrectly since number of confusable words decrease asthe words get longer and acoustically matching a longerword is harder than matching short words. So, if thedecoder gets a high acoustic likelihood for a long word,it is most probably the correct word.There are some disadvantages for using compoundwords. (1) Need to generate pronunciations for the compoundwords for which there might be many variationsdepending on how many pronunciation variants thereare for each individual word. (2) Including possible silencein between words also increases lexicon size. (3)When the words are glued, if one of the constituentwords occur in another context, they become very unlikely(since their unigram probability falls) and recognizingthem in other contexts becomes more difficult.It might also be advantageous to tie other frequentlyoccurring word sequences together. There is an automatedmethod for determining compound words fromtraining data [8]. This method uses the mutual informationbetween words to determine whether to combinethem or not. For the travel reservation domain, automaticallygenerated compound words did not improverecognition rate [4].We have used compound words to overcome the problemwe faced with the language model classes [city]and [state]. We glued together the word sequencescorresponding to “[city] [state]” allowing only the correctcombination such as “DALLAS TEXAS” or “LOS-ANGELES CALI<strong>FOR</strong>NIA”. These compound words


were then put in a LM class of their own [city state]to increase coverage. This method helped reduced theword error rate (WER) significantly. The error correctionswere at important places for the language understandingand major errors in city names can be avoidedusing this method. Similarly, we glued together “[city][airport]” pairs such as “NEW YORK LA GUARDIA”to aid in recognition as well. These were put in an LMclass of their own as well.We also have other compound words that were formedby choosing a subset of the word sequences generatedby the method in [8]. These include compounds like“I WOULD LIKE TO FLY” and “THAT’S O.K.” andmany others.Because of the third disadvantage of compound words,it is beneficial to interpolate trigram probabilities withthe ones obtained without the compound words present.The best result is obtained when regular trigram LMand compound-word trigram LM are interpolated withequal weighting between the two.5.3. Dialog State Dependent Language ModelingDialog manager in the travel reservation system canprovide feedback to the speech recognition module aboutthe state of the dialog. We use a form based dialog manager(FDM) which fills out forms according to user input.At a given point in the dialog, the FDM maintainsthe state of the dialog and prompts for new informationfrom the user to finish the transaction. This informationis useful, because it can be used to help in modelingthe language of the response. For example, if the dialogmanager is asking about target destination, it is highlyexpected that a city name will be in the response. Similarly,if a question about the departure time is beingpresented, the response most likely will contain time information.The dialog state can also be inferred fromthe system prompt indirectly by using some rules. Statedependent trigram probability is given by:P triS (w i) = P (w i |w i−2 w i−1 , S)where S represents the current dialog state. We havedialog states in our system corresponding to departureand arrival dates and times, departure and arrival city,airline and confirmations. To use this information, webuilt language models trained from a corpora of userresponses recorded at a specific state. We interpolatethese language models with the base trigram LM to obtaina different LM for each state. This is done since wedo not have enough data for each state to train the LMand the answers do not have to comply with the questionsall the time since this is an open dialog system.Dialog state feedback for language modeling improvesrecognition rate about 5% relative.It should be noted that, since our training data foreach state was limited, it is better to use a bigram languagemodel for each state in order not to overtrain thesystem. However, if enough data can be obtained foreach state, trigram LM can be used.5.4. Embedded CFGsContext free grammars (CFGs) represent a finite languagewhich allows only certain word sequences. Theycould be represented by a finite state machine. CFGsare widely used in language processing for parsing sentences.CFGs with probabilistic paths through the grammarare called stochastic CFGs (SCFGs). DARPA communicatoris a free-form mixed-initiative dialog system,so direct application of context free grammars is not appropriate.However, within the utterances, there arewell structured portions (phrases) such as the partsthat are describing the arrival or departure dates, timesand places. IBM speech recognition engine has the recentlyadded capability of using embedded SCFG objectswithin the language models [9]. The portions ofspeech that matches a grammar are tokenized with aspecial token and a trigram LM is trained with the newtokens in place. During decoding, if a word is hypothesizedthat occurs in the first position for the grammar,then the grammar is hypothesized and that path isscored using the grammar. This scoring is incorporatedinto IBM’s stack decoder. This process is equivalent toa language model probability as follows:P (w i ) = P (. . . w i |G j )P (G j |G j−2 G j−1 ),where G j represents the grammar object that includes apath that contains the word w i and the trigram historyis considered with the grammar objects 2 . The functionof embedded CFGs is similar to using compound wordsand LM classes with non-uniform within class probabilities.So, it has the advantages listed above for thecompound word approach. The difference is that withthe CFGs, there is no need to define compound wordsand thus it is possible to capture a wider variety ofword sequences by minimal effort. Another advantageis that a longer silence can be inserted between individualwords and they will still be recognized. However,when highly inclusive grammars are used, thereare some disadvantages as we explore in the following.Since we already worked on the compound wordsand classes as mentioned before, we designed new embeddedCFGs to have broader coverage than the LMclasses, such as a place grammar which contained expressionssuch as “FROM [city] TO [city]” as well as2 We allow single word grammars as well.


“FROM [city]” and “TO [city]” for all cities. We alsodesigned different CFGs for dates and times which allowedexpressions such as “MONDAY SEPTEMBERTHIRD TWO THOUSAND ONE” and “TEN A.M. INTHE MORNING” and many more. The probabilitiesfor paths in the grammars were assigned using commonsense. Using these CFGs, a possible parsing of a traveldomain sentence is shownI WANT TO FLY FROM NEW YORK} {{ }placeNEXT TUESDAY} {{ }dateIN}THE MORNING{{ }.timeAlthough this approach seemed promising, we weredisappointed to see that the error rate increased whenthe embedded CFGs were used. We have determinedthe following reasons for the failure of this approach:• The weights (probabilities) within the CFGs areassigned subjectively and this is not usually appropriatefor speech recognition purposes. Trigramprobabilities capture the word probabilitiesbetter. Ideally, we would like to train these weightsusing a training corpus, which we are currentlyexploring.• The grammars should be designed to representsame logical entities where the context is expectedto be exactly same for each path in the grammar.For example, if “ON DELTA” is in thesame grammar with “A DELTA FLIGHT” thathurts the performance since the sequence “FLYON DELTA” is more likely than “FLY A DELTAFLIGHT” but the CFG inherently assumes theyare the same. So, the grammars should be designedkeeping this point in mind.• If a word appears in two or more grammars, ora word appears in different paths in a grammar,the LM probability of the word decreases as comparedto the n-gram LM. The reason for this isthat when computing the LM probability of thecurrent word, only one path through the grammaris considered. So, grammars should be designedto make sure that each word appears in only onegrammar and that it never (or rarely) occurs outsidethat grammar.Because of the above problems, best results are obtainedwhen embedded CFG LM probabilities are interpolatedwith regular smoothed trigram LM probabilities.We used equal weighting between the two.We revised our CFGs after observing the above points,and redesigned new embedded CFGs corresponding todates, times, cities and “city state” pairs, airports andairlines. We experimented with different grammar designs.However, the best result we obtained was onlyslightly better than the class-based LM in some testsetsand worse in others. We are exploring new grammar designsand working on ways to estimate within grammarweights more accurately from training data. We presentour experimental results next.6. EXPERIMENTAL RESULTSThe acoustic and language models were developed andtested in the IBM communicator system which is usedfor flight reservations. We trained the acoustic modelsusing 380 hours of travel domain speech from about4000 speakers and 250 hours of general telephony domainspeech. Language models were trained using 100Ksentences from travel domain collected from various sources.The numbers in the corpus were tagged using the methoddescribed in [10] to help in language modeling. Therewas a held-out set of 18K sentences used to optimizetrigram smoothing parameters.Our main test set is composed of a subset of thecalls that the IBM Communicator system received duringthe NIST evaluation of various DARPA communicatorsystems in June 2000 (1173 utterances out of 1262received).The baseline model is the IBM speech recognitionsystem used during the June 2000 NIST evaluation.Acoustic model (old AM) is a generic telephony systemthat is trained from a wide variety of telephonydata (600 hours of speech). Language model (old LM)is trained using 90K sentences from travel domain andhas LM classes and no compound words except citynames.As mentioned above, we added 10K more sentencesto the LM training data, redesigned the classes andadded new classes and compound words as described inSection 5. A few such LM’s were designed. The bestone (“new LM”) is chosen among all.Table 1 shows the overall improvement in word errorrate due to the new acoustic and language models. Theresults are on the main testset.Acoustic Model Language Model WERold AM old LM 23.70%old AM new LM 20.63%new AM old LM 19.03%new AM new LM 17.79%Table 1: Comparison between old and new, AM andLM. Overall a big reduction is shown.As shown in Table 1, we achieved a significant reductionin error rate with new acoustic and language


models. Addition of compound words and adjustmentof non-uniform within class weights improved error rateabout 15% relatively.To test the dialog state dependent language modeling,we used six dialog states and all dialog feedbackstates were mapped to these six different states. Foreach state, we used about 500-1000 state dependentsentences for training. We did not have enough datafor dialog state dependent LM training. We belive dialogstate feedback can improve performance more ifmore training data is collected.The effect of the dialog state feedback and embeddedCFGs are shown in table 2. Here, we show the effectson 3 testsets. “main” is the same as the one in table1. The other two are decodings of the data received by2 other sites during the evaluation. For anonimity, thenames of the sites are not mentioned.It is shown that the dialog state feedback reducesthe error rate by about 5%. For other sites, the dialogstate is estimated from their system prompts (for testingpurposes) using some rules since the real feedbackinformation was not available. For embedded CFG LMresult in column three, no dialog state feedback is used.Embedded CFGs increased error rate for IBM receivedcalls, but reduced error rate for the other two sites. Thiscould be due to the different nature of calls that werereceived by different sites. Some sites have a more directeddialog system that encourages more structuredresponses which might explain why embedded grammarshelped more for the other sites.Language Model main test2 test3new LM 17.79% 19.70% 27.74%Dialog state dependent 16.91% 19.15% 27.45%embedded CFGs 19.19% 17.23% 27.23%Table 2: Comparison showing LM with NLU state feedbackand embedded grammars. Same acoustic model isused throughout.7. CONCLUSIONOur innovative work in language modeling reduced speechrecognition errors a substantial amount on a travel reservationspoken dialog system. Most reduction is obtainedby carefully retraining language models using domainspecific data and cleverly increasing the LM coverageusing classes. Including “[city] [state]” compoundwords in the vocabulary were shown to be effectivefor better performance. Dialog context dependent languagemodeling improved error rate consistently overall testsets. However, using embedded CFG objects inn-gram language models did not uniformly improve theerror rate. Our work is in progress in several directions(1) Better design of embedded grammars, (2) Rescoringlattices using more elaborate LM techniques usingCFGs, and (3) automatic training of within grammarweights using a training corpus.8. ACKNOWLEDGEMENTSThe author would like to thank Yuqing Gao, AndySakrajda, Adwait Ratnaparkhi, Kevin Smith, XiaoqiangLuo, Kishore Papineni, Todd Ward, Salim Roukos, andmany other members of IBM HLT department for designing,building and developing other parts of the dialogsystem.REFERENCES[1] E. Levin et al., “The AT&T-Darpa communicator mixedinitiativespoken dialogue system,” in International Conferenceon Spoken Language Processing, 2000.[2] B. Pellom, W. Ward, and S. Pradhan, “The CU communicator:An architecture for dialogue systems,” in InternationalConference on Spoken Language Processing, 2000.[3] A. I. Rudnicky et al., “Task and domain specific modellingin the Carnegie Mellon communicator system,” in InternationalConference on Spoken Language Processing, 2000.[4] A. Aaron et al., “Speech recognition for Darpa Communicator,”in Proc. IEEE Conf. Acoust. Speech Sig. Proc., 2001.[5] Y. Gao, H. Erdogan, Y. Li, V. Goel, and M. Picheny, “Recentadvances in speech recognition system for IBM Darpacommunicator,” in European Conference on Speech Communicationand Technology, 2001.[6] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue,“Galaxy-II: A reference architecture of conversational systemdevelopment,” in International Conference on SpokenLanguage Processing, 1998.[7] K. Davies et al., “The IBM conversational telephony systemfor financial applications,” in European Conference onSpeech Communication and Technology, 1999.[8] G. Saon and M. Padmanabhan, “Data-driven approach todesigning compound words for continuous speech recognition,”in Automatic Speech Recognition and UnderstandingWorkshop, 1999.[9] M. Monkowski, “Embedded grammar objects within the languagemodels,” Technical report, IBM, 2001.[10] X. Luo and M. Franz, “Semantic tokenization of verbalizednumbers in language modeling,” in International Conferenceon Spoken Language Processing, volume 1, pp. 158–61, 2000.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!