SPEECH RECOGNITION FOR A TRAVEL RESERVATION SYSTEM ...

More documents

Recommendations

Info

classes of words, such as months of the year, days ofthe week, etc. We use class-based language models torepresent the tying among these similar words. Eachword w is assigned to a class C(w), where the class canbe a singleton set 1 or a multiple element set. In thiscase, the word sequence probability for a class-basedtrigram becomes:P (w i ) = P (w i |C(w i )P (C(w i )|C(w i−2 )C(w i−1 )),where C(w) represent the word class for word w andP (w i |C(w i )) is the within class probability (or weight)of word w i in the class C(w i ). For certain classes, we useuniform within class probabilities and for some otherswe use non-uniform weights.In our language model for the travel reservation system,we have 42 classes representing cities, states, airports,airlines, months, years, days in a month, hours,minutes, days of the week as well as some other classes.We have nonuniform within class probabilities forcities and they are obtained as a weighted combinationof unigram counts in training and airport volume data.So, big cities with higher airport volume receive a higherprobability as compared to a small city. For states, wehave obtained nonuniform weights by adding up airportvolume data for airports within each state. For airlines,we also have used a similar technique where each airlineprobability is proportional to the airline passengervolume data. For other classes, we either used uniformor heuristic weights. We have iterated this processto achieve the best results. For example, we had aclass of years which had uniform weights for each yearfrom 1990 to 2010. After observing some errors, wechanged the weighting to be nonuniform and made currentand following years (1999, 2000, 2001, 2002) morelikely since it is highly unlikely that one would call thesystem and want to reserve a flight for 3 years later.We achieved a significant error reduction with this kindof small modifications.LM classes are good for increasing coverage and enrichingthe context. However, we have observed a problemwith the classes we used which was related to citiesand states. In travel reservations, when people talkabout their travel plans, they often pronounce the cityand state name one after the other such as “DALLASTEXAS”. So, if trigram training is used the bigram“DALLAS TEXAS” has a high probability whereas “DU-LUTH TEXAS” has a low probability. However, if classtrigram is used with [city] and [state] classes, “[city][state]” has high probability regardles of the individualcity or state. This problem causes speech recognitionerrors such as “SIOUX CITY OHIO” (instead of“SIOUX CITY IOWA”) and “DALLAS MINNESOTA”1 In which case we can think that the word is itself a class(instead of “DULUTH MINNESOTA”) which wouldhave been highly unlikely if a regular trigram had beenused. These speech recognition errors are highly undesirablefor the travel reservation system since it ishard for the dialog system to recover from these typeof errors. The same kind of problem occurs for “[city][airport]” pairs. To overcome these significant problems,we used the compound word approach which weexplain next.5.2. Compound WordsTrigram LMs work on words directly. It is advantageousto glue frequently occurring word sequences togetherand assume that they always appear together forthe purposes of the LM. We concatanate such words,e.g. the word sequence “SAN FRANSISCO” can beglued together to form “SAN FRANSISCO”, since it ishighly likely that the word “SAN” will appear togetherwith “FRANSISCO”. In our system, by default the citynames are combined together such as “NEW YORK”and “LOS ANGELES”. The advantage comes from thefacts that (1) we can use a larger word history (in trigrams)when we consider this sequence as one word, (2)longer words have a higher chance of being recognizedcorrectly since number of confusable words decrease asthe words get longer and acoustically matching a longerword is harder than matching short words. So, if thedecoder gets a high acoustic likelihood for a long word,it is most probably the correct word.There are some disadvantages for using compoundwords. (1) Need to generate pronunciations for the compoundwords for which there might be many variationsdepending on how many pronunciation variants thereare for each individual word. (2) Including possible silencein between words also increases lexicon size. (3)When the words are glued, if one of the constituentwords occur in another context, they become very unlikely(since their unigram probability falls) and recognizingthem in other contexts becomes more difficult.It might also be advantageous to tie other frequentlyoccurring word sequences together. There is an automatedmethod for determining compound words fromtraining data [8]. This method uses the mutual informationbetween words to determine whether to combinethem or not. For the travel reservation domain, automaticallygenerated compound words did not improverecognition rate [4].We have used compound words to overcome the problemwe faced with the language model classes [city]and [state]. We glued together the word sequencescorresponding to “[city] [state]” allowing only the correctcombination such as “DALLAS TEXAS” or “LOS-ANGELES CALI<strong>FOR</strong>NIA”. These compound words
were then put in a LM class of their own [city state]to increase coverage. This method helped reduced theword error rate (WER) significantly. The error correctionswere at important places for the language understandingand major errors in city names can be avoidedusing this method. Similarly, we glued together “[city][airport]” pairs such as “NEW YORK LA GUARDIA”to aid in recognition as well. These were put in an LMclass of their own as well.We also have other compound words that were formedby choosing a subset of the word sequences generatedby the method in [8]. These include compounds like“I WOULD LIKE TO FLY” and “THAT’S O.K.” andmany others.Because of the third disadvantage of compound words,it is beneficial to interpolate trigram probabilities withthe ones obtained without the compound words present.The best result is obtained when regular trigram LMand compound-word trigram LM are interpolated withequal weighting between the two.5.3. Dialog State Dependent Language ModelingDialog manager in the travel reservation system canprovide feedback to the speech recognition module aboutthe state of the dialog. We use a form based dialog manager(FDM) which fills out forms according to user input.At a given point in the dialog, the FDM maintainsthe state of the dialog and prompts for new informationfrom the user to finish the transaction. This informationis useful, because it can be used to help in modelingthe language of the response. For example, if the dialogmanager is asking about target destination, it is highlyexpected that a city name will be in the response. Similarly,if a question about the departure time is beingpresented, the response most likely will contain time information.The dialog state can also be inferred fromthe system prompt indirectly by using some rules. Statedependent trigram probability is given by:P triS (w i) = P (w i |w i−2 w i−1 , S)where S represents the current dialog state. We havedialog states in our system corresponding to departureand arrival dates and times, departure and arrival city,airline and confirmations. To use this information, webuilt language models trained from a corpora of userresponses recorded at a specific state. We interpolatethese language models with the base trigram LM to obtaina different LM for each state. This is done since wedo not have enough data for each state to train the LMand the answers do not have to comply with the questionsall the time since this is an open dialog system.Dialog state feedback for language modeling improvesrecognition rate about 5% relative.It should be noted that, since our training data foreach state was limited, it is better to use a bigram languagemodel for each state in order not to overtrain thesystem. However, if enough data can be obtained foreach state, trigram LM can be used.5.4. Embedded CFGsContext free grammars (CFGs) represent a finite languagewhich allows only certain word sequences. Theycould be represented by a finite state machine. CFGsare widely used in language processing for parsing sentences.CFGs with probabilistic paths through the grammarare called stochastic CFGs (SCFGs). DARPA communicatoris a free-form mixed-initiative dialog system,so direct application of context free grammars is not appropriate.However, within the utterances, there arewell structured portions (phrases) such as the partsthat are describing the arrival or departure dates, timesand places. IBM speech recognition engine has the recentlyadded capability of using embedded SCFG objectswithin the language models [9]. The portions ofspeech that matches a grammar are tokenized with aspecial token and a trigram LM is trained with the newtokens in place. During decoding, if a word is hypothesizedthat occurs in the first position for the grammar,then the grammar is hypothesized and that path isscored using the grammar. This scoring is incorporatedinto IBM’s stack decoder. This process is equivalent toa language model probability as follows:P (w i ) = P (. . . w i |G j )P (G j |G j−2 G j−1 ),where G j represents the grammar object that includes apath that contains the word w i and the trigram historyis considered with the grammar objects 2 . The functionof embedded CFGs is similar to using compound wordsand LM classes with non-uniform within class probabilities.So, it has the advantages listed above for thecompound word approach. The difference is that withthe CFGs, there is no need to define compound wordsand thus it is possible to capture a wider variety ofword sequences by minimal effort. Another advantageis that a longer silence can be inserted between individualwords and they will still be recognized. However,when highly inclusive grammars are used, thereare some disadvantages as we explore in the following.Since we already worked on the compound wordsand classes as mentioned before, we designed new embeddedCFGs to have broader coverage than the LMclasses, such as a place grammar which contained expressionssuch as “FROM [city] TO [city]” as well as2 We allow single word grammars as well.
Page 1 and 2: SPEECH RECOGNITION FOR A TRAVEL RES
Page 3: (speech recognizer) which is based
Page 7: models. Addition of compound words

SPEECH RECOGNITION FOR A TRAVEL RESERVATION SYSTEM ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?