Large Vocabulary Continuous Speech Recognition - Berlin Chen

Large Vocabulary ContinuousSpeech RecognitionBerlin ChenDepartment of Computer Science & Information EngineeringNational Taiwan Normal UniversityReferences:1. X. L., Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,”Computer Speech and Language, vol. 16, 20022. H. Ney, “Progress in Dynamic Programming Search for LVCSR,” Proceedings of the IEEE, August 20003. S.Ortmanns, H. Ney, “The Time-Conditioned Approach in Dynamic Programming Search for LVCSR,” IEEETransactions on Speech and Audio Processing, vol. 8, no. 6, November 20004. S.Ortmanns, H. Ney, X. Aubert, “A Word Graph Algorithm for Large Vocabulary Continuous SpeechRecognition,” Computer Speech and Language, vol. 11, 1997

Why LVCSR Difficult ? (1/2)• The software complexity of a search algorithm isconsiderably high• The effort required to build an efficient decoder is quitelargeSpeech – Berlin Chen 2

Why LVCSR Difficult ? (2/2)• Maximum Approximation of the Decoding ProblemWˆ==arg maxWp(X| W)P(W)⎪⎧P Wnp sTWn ⎪⎫⎨ (1) ∑ ( X,1|1) ⎬W:W nsT1 1⎪ ⎭arg max⎪⎩Wˆ( sT: the correspond ing state sequence of Wn )1⎧≈ arg max ⎨P(WW W n⎩n1)αmax: sT11p(X,sT1| Wn1⎫) ⎬⎭1( α : heuristic language model factor )Speech – Berlin Chen 3

Two Major Constituents of LVCSR• Front-end Processing is a spectral analysis that derivesfeature vectors to capture salient spectral characteristicsof speech input– Digital signal processing– Feature extraction• Linguistic Decoding combines word-level matching andsentence-level search to perform an inverse operation todecode the message from the speech waveform– Acoustic modeling– Language modeling– SearchSpeech – Berlin Chen 4

Classification of Approaches to LVCSR Along Two Axes• Network Expansion– Dynamic expansion of the search space during the decoding• Tree-structured n-gram network• Re-entrant (word-conditioned) vs. start-synchronous (timeconditioned)– (Full) Static expansion of the search space prior to decoding• Weighted Finite State Transducers (WFST) - AT&T– Composition (A。L。G), Determinization, (Pushing) andMinimization of WFST• Static tree-based representations• Search Strategy– Time-synchronous (Viterbi + Beam Pruning)– Time-asynchronous (A* or Stack Decoding)Breadth-first (Parallel)Best-first (Sequential)Speech – Berlin Chen 5

Word-conditioned vs. Time-conditioned Search (1/3)• Word-conditioned Search– A virtual tree copy is explored for each active LM history– More complicated in HMM state manipulation (dependent on theLM complexity)• Time-conditioned Search– A virtual tree copy is being entered at each time by the word endhypotheses of the same given time– More complicated in LM-level recombinationword-conditionedtime-conditionedSpeech – Berlin Chen 6

Word-conditioned vs. Time-conditioned Search (2/3)• Schematic ComparisonWord-conditioned SearchTime-conditioned SearchNo. of tree-copiesis dependent on thecomplexity ofLM constraints(for bigram constraint,MAX: |V| 2 , |V| lexicon size )No. of tree-copiesis independentof the complexity ofLM constraints,but dependent onthe duration ofthe utterance(however, thenumber is 30~50according to theaverage wordduration )Speech – Berlin Chen 7

Word-conditioned vs. Time-conditioned Search (3/3)Word-conditioned Search (with bigram LM)Time-conditioned Search (with m-gram LM)These two algorithms are stated in a “look-backward” manner.Speech – Berlin Chen 8

Word-Conditioned Tree-Copy Search (WC)• As trigram language modeling is usedP(‧| sil sil)P(‧| sil 飛蛾 )飛蛾P(‧| 飛蛾一群 ) P(‧| 一群影響 ) P(‧| 影響生活 )P(‧| 生活很大 )一群影響生活很大造成P(‧| 生命很大 )很大Node( history, arc, state)P(‧| 很大造成 )肺炎疫情影響生命頗大P(‧| sil 肺炎 )P(‧| 肺炎疫情 ) P(‧| 疫情影響 ) P(‧| 影響生命 ) P(‧| 生命頗大 )V trees V 2 trees V 2 trees V 2 trees V 2 treesLanguage Model Look-aheadAcoustic Look-aheadcan be applied– The pronunciation lexicon is structured as a tree– Due to the constraints of n-gram language modeling, a word’soccurrence is dependent on the previous n-1 words– Search through all possible tree copies from the start time to theend time of the utterance to find a best sequence of wordhypothesesSpeech – Berlin Chen 9

Lexical/Phonetic Tree (1/2)• Each arc stands for a phonetic unit• The application of language modeling is delayed until leafnodes are reached– Word identities are only defined at leaf nodes (Solution: languagemodel smearing/look-ahead )– Each leaf node is shared by words having the same pronunciationPP( say they )( tell they )theySpeech – Berlin Chen 10

Lexical/Phonetic Tree (2/2)• Reasons for using the lexical/phonetic tree– States according to phones that are common to different words areshared by different hypotheses• A compact representation of the acoustic-phonetic searchspace– The uncertainty about the word identity is much higher at itsbeginning than its ending• More computations is required at the beginning of a word thantoward its endSpeech – Berlin Chen 11

Linear Lexicon vs. Tree Lexicon(for Word-Conditioned Search)• In the general n-gram case (n ≧1), the total number ofprefix tree copies is equal to |V| n−1 , where |V| is the lexiconsize and n the LM order• When using a (flat) linear lexicon and in the general n -gram case (n ≧2), the total number of copies of the linearlexicon would be |V| n−2|V| linear lexicons for trigram modeling|V| 2 tree-copies for trigram modelingSpeech – Berlin Chen 12

Context-Dependent Phonetic Constraints• Context-Dependent modeling (e.g., cross-word, CW,triphone modeling) is a crucial factor regarding thesearch space at the junction between successive wordsImplementation of CWwould be much more difficultfor time-conditioned thanword-conditioned search(why ?)• Current NTNU system is implemented with intra-wordINITIAL/FINAL or triphone modelingSpeech – Berlin Chen 13

More Details on Word-Conditioned Search

Word-Conditioned Search (1/3)• Word (history)-conditioned Search– A virtual/imaginary tree copy explored for linguistic context ofactive search hypotheses– Search hypotheses recombined at tree root nodes according tolanguage modeling (or the history)P(‧| 一群影響 )P(‧| 影響生活 )一群影響生活疫情影響生命For n-gram language modeling:- Retain distinct n-1-gram word historiesP(‧| 疫情影響 ) P(‧| 影響生命 )V 2 trees V 2 trees V 2 trees– Time is the independent variable; expansion proceeds in parallelin a “breadth-first” mannerSpeech – Berlin Chen 15

Word-Conditioned Search (2/3)• Integration of acoustic and linguistic knowledge bymeans of tree-dimensional coordinates– A network (dynamically) built to describe sentences in terms ofwords• Language models for network transition probabilities– A network (statically) built to describe words in terms of phone• The pronunciation dictionary (organized as a phonetic tree)• Transition penalties are applied– A network (statically) built to describe a phone unit in terms ofsequences of HMM states• Spectral vectors derived from the speech signal are consumedSpeech – Berlin Chen 16

Word-Conditioned Search (3/3)BacktrackingInformationshould bemanipulated• Three basic operations performed– Acoustic-level re-combinations within tree arcs• Viterbi search( t, s;arc) = max ⎡ Q n ( t 1, s ; arc) P( s s ′ ; arc) ⎤−1− ′P( x s arc)Q −1v1 s′⎢⎣v1⎥⎦n t;– Tree arc extensionsQ( t S ; arc ) Q ( t − 1, S ar c )n − 1 , = n −′v01vM;11arc’– Language-model-level recombination• Word end hypotheses sharing the same history wererecombinednnH ( v ; t ) max ⎡−1α2= Q n ( t , Sv; arcE) P ( vnv ) ⎤−1⋅v vn11⎢⎣ 1⎥⎦ v 1v 2nQ t , S ; arc = H v ; tnv 2( ) ( )0The beginning state The ending stateB2( t, S arcB)v′ 01 −1Q n ;一群影響生活疫情影響生活n=3s Ms 0v 3arcn-1P( 生活 | 一群影響 )Q一群影響(-.S生活 , ,-)P( 生活 | 疫情影響 )Q疫情影響(-.S生活 , ,-)Speech – Berlin Chen 17

WC: Implementation Issues (1/3)• Path hypotheses at each time frame are differentiatedbased on– The n-1 word history (for the n-gram LM)– The phone unit (or the tree arc)– The HMM state• Organization of active path hypotheses (states)– Hierarchical organizationActive HMM StatesActive ArcsNo. of States (NS)“A Data-driven Approach”Node( history, arc, state)Active Trees(LM Histories)No. of Trees (NT)No. of Arcs (NA)active HMM states(or path hypotheses)at frame tSpeech – Berlin Chen 18

WC: Implementation Issues (2/3)• Organization of active path hypotheses (states)– Flat organizationActive HMM States at frame t+1at frame tData-drivenNew HMM Statesvn−11vvn−11n−11a ks 2a ks 2a ks 3a js 11a js 1n−1a js 2v a js 2vvvvn−11n−11n−11n−11a js 3Acoustic level recombinationQvn −11maxs′( t,s;arc )[ Q n −1( t − 1, s′; arc ) P( s s′; arc )] P( xts;arc )v1=C ++ STL (Standard Template Libraries)is suitable for such an accesslog( NS )for the access of any HMM stateNS: the number of HMM statesSpeech – Berlin Chen 19

WC: Implementation Issues (3/3)• The pronunciation lexicon is organized as a “trie” (tree) structurestruct DEF_LEXICON_TREE{short Model_ID;short WD_NO;int *WD_ID;int Leaf;double Unigram;struct Tree *Child;struct Tree *Brother;struct Tree *Father;};Tree AD C BH G FK JTrieD CH G FEIABEKJISpeech – Berlin Chen 20

WC: Pruning of Path Hypotheses (1/5)• Viterbi search– Belong to a class of breadth-first search• Time-synchronous• Hypotheses terminate at the same point in time– Therefore, hypotheses can be compared– The search hypotheses will grow exponentially– Pruning away unlikely (incorrect) paths is needed• Viterbi beam search• Hypotheses with likelihood falling within a fixed radius (orbeam) of the most likely hypothesis are retained• The beam size determined empirically or adaptivelySpeech – Berlin Chen 21

WC: Pruning of Path Hypotheses (2/5)• Pruning Techniques1. Standard Beam Pruning (Acoustic-level Pruning)• Retain only hypotheses with a score close to the best hypothesis⎡⎤ThrAC() t = ⎢ max Q n −1( t , s;arc ) × fn −1AC( v s arc )v⎥1⎣ 1 , ,⎦Q t,s;arc < Thr t ⇒v1n ⎡⎤LM() t = ⎢ max Q n −1( t , S ; arcB) × fn −1LM( v S arc )v0 ⎥⎣ , 0 , B 11⎦( ) ( ) pruned!2. Language Model Pruning (Word-level Pruning)• Applied to word-end or tree start-up hypothesesThr−1( ) ( ) pruned!ACQ− 1v1nt, S0; arcB

WC: Pruning of Path Hypotheses (3/5)• Pruning Techniques (cont.)– Stricter pruning applied at word ends• The threshold is tightly compared to the acoustic-level onePose severerequirements onthe system memory• Reasons– A single path hypothesis is propagated into multiple wordends (words with the same pronunciation)– A large number of arcs (models) of the new generatedtree copies are about to be activatedSpeech – Berlin Chen 23

WC: Pruning of Path Hypotheses (4/5)• A simple dynamic pruning mechanism in the NTNU system– Acoustic-Level PruningSpeech – Berlin Chen 24

WC: Pruning of Path Hypotheses (5/5)• A simple dynamic pruning mechanism in the NTNU system– Word-Level (language-model-level) PruningSpeech – Berlin Chen 25

WC: Language Model Look-ahead (1/2)• Language model probabilities incorporated as early inthe search as possible• Language model probability incorporated for computing of– Unigram Look-ahead– Bigram Look-aheadπv( a )= maxw ∈W( a )P( w v )• Anticipate the language modelprobabilities with the state hypothesis~QThrπn−1v1AC( a )( t,s;arc) = π ( a ) Q ( t,s;arc)() t= max⎡s, arc n−1v1= ⎢−n −1( v s arc )v11 , ,⎣w ∈W( a )P( w )⎤( as , arc) Q n ( t,s;arc ) × fACmax π1⎥⎦π( a)~Q−1v1n= maxw∈Wsarc( a )aP( w)Qv n1 −1( t,s;arc)( t,s;arc) < Thr ( t) ⇒ pruned!ACW( ) aImplementation of bigram look-ahead would be much difficult for time-conditionedthan word-conditioned search (why ?)Speech – Berlin Chen 26

WC: Language Model Look-ahead (2/2)• A Simple recursive function for calculating unigram LMlook-aheadvoid SpeechClass::Calculate_Word_Tree_Unigram(){if(Root==(struct Tree *) NULL) return;Do_Calculate_Word_Tree_Unigram(Root);}void SpeechClass::Do_Calculate_Word_Tree_Unigram(struct Tree *ptrNow){if(ptrNow==(struct Tree *) NULL) return;Do_Calculate_Word_Tree_Unigram(ptrNow->Brother);Do_Calculate_Word_Tree_Unigram(ptrNow->Child);if(ptrNow->Father!=(struct Tree *) NULL)if(ptrNow->Unigram > ptrNow->Father->Unigram)ptrNow->Father->Unigram=ptrNow->Unigram;}Speech – Berlin Chen 27

WC: Acoustic Look-ahead (1/3)• The Chinese language is well known for its monosyllabicstructure, in which each Chinese word is composed ofone or more syllables– Utilize syllable-level heuristics to enhance search efficiency• Help make the right decision when pruning– How to design the a suitable structure in order to estimate theheuristics for the unexplored portion of a speech utterance?~Qv n1( t,s;arc)Heuristics ;( t,s arc)log F ( t,s;arc)1 −n−1v1~log Q= n−1v1( t,s;arc) + log Heuristics ( t,s;arc)Speech – Berlin Chen 28

WC: Acoustic Look-ahead (2/3)• Possible structures for heuristic estimation for theChinese languageㄧㄝㄧㄣㄏㄞㄨㄢㄈㄣㄏㄤㄕㄤㄊㄞㄑㄧㄧㄧㄣㄏㄨㄚㄕㄢㄒ一ㄤㄓㄠㄙㄥㄨㄨㄛㄨㄛㄅㄤINITIALs FINALs Syllables A lattice based on the lexical tree• Incorporated with the actual decoding score andlanguage model look-ahead score( t,h,arc , s ) = m′⋅ log D( t,h,arc , s ) + m′⋅ log L ( arc ) + m′log L ( t,arc s )~D ,log k q 1 k q 2 LM k 3 ⋅ACkqSpeech – Berlin Chen 29

WC: Acoustic Look-ahead (3/3)• Tested on 4-hour radio broadcast news (Chen et al., ICASSP 2004)– Use Kneser-Ney backoff smoothing for language modeling• The recognition efficiency for TC improves significantly (a relativeimprovement of 41.61% was obtained) while the time spent onacoustic look-ahead (0.004 real time factor) was almost negligibleSpeech – Berlin Chen 30

More Details on Time-Conditioned Search

TC: Decomposition of Search History (1/3)If only the latest m-1 word history is instead kept:hGt t( w;τ , t) = max p( x , s w)tτ + 1s= conditiona l prob. that word w produces xnnt t n( w ; t) = P( w ) max p( x , s w )11= joint prob. of observingt1sτ + 1τ + 1If the whole word history is kept:G11x1t1andnn−1n−1( w1; t) = max{ P( wnw1) G( w1; τ ) h( w;τ , t)}τn−1n−1= P( w w ) ⋅ max{ G( w ; τ ) h( w;τ , t)}n1τ1wn1tτ + 1ending at tGHnm( w1; t) ≈ H ( v2; t)m( v ; t)= max⎧Pn n m⎨1w1: w n − m + 2 = v2⎩(with DP recursion )= maxv21nt t n( w ) ⋅ max p( x , s w )t1sm−1m−1{ P( vmv1) ⋅ max{ H ( v1; τ ) h( w;τ , t)}τ111⎫⎬⎭Speech – Berlin Chen 32

TC: Decomposition of Search History (2/3)• Tree-internal search hypotheses (for an internal node,not word-end node, s)If the whole word history is kept:Qτn ⎧t t n( t;s) = max P( w ) ⋅ max p( x , s w )==wwwmax Pn1max Gn1n1n ⎧τ τ n ⎫t t( w ) ⋅ max p( x , s w ) ⋅ max p( x , s −)1⎨⎩t1⎨⎩s : s = S⎬⎭nt t( w ; τ ) ⋅ max p( x , s −)11tτ + 1sτt1s : s = S: s = stτwnw n, s = st1τ + 1111τ + 11tτ + 1s1⎫⎬⎭: s = sIf only the latest m-1 word history is instead kept:Qτnt t( t;s) = max G( w ) ⋅ p( x+s −)n 1; τ maxtτ 1,τ + 1w1sτ+ 1:st= smt t≈ max H ( v ) ⋅ p( x+s+−)m m n 2; τ maxtτ 1,τ 1v2: v2= w n − m + 2sτ+ 1:st= st t= H () τ ⋅ max p( x , s −)maxtτ + 1s: s = stτ + 1τ + 1tτ + 1τ + 1Word identity is still unknownSpeech – Berlin Chen 33

TC: Decomposition of Search History (3/3)• Three basic operations performed– Acoustic-level re-combinations within tree arcsQτ• Viterbi search( t, s;arc) = max[ Q ( t −1,s′; arc) P( s s′; arc)] P( x s;arc)s′– Tree arc extensionsτ( t S ; arc ) = Q ( t − 1, S ; ar c )QM′τ,0 τtarc’s Ms 0arcHQtThe beginning state The ending state– Word-level recombination{ { }m 11mτ⎧⎧ Q ( t,S ; arc )mm−1m−1( v ; t) = max P( v v ) ⋅ max H ( v ; τ ) h( v ; τ , t)2vm−1m−1( v v ) ⋅ max H ( v ; τ )m( t,S ; arc ) = H ( t) = max H ( v ; t)0B1= max⎨Pv1⎩mmax1vm2τ⎨⎩21τHvmmax() τE⎫⎫⎬⎬⎭⎭Speech – Berlin Chen 34

TC vs. WC: Conflicting Expectations• Word-conditioned hypotheses seem to be morecondense– Since we have got rid of the word boundary information whenperforming path recombination• The number of time-conditioned hypotheses might besmaller because the number of possible start times(1,…,T) is always smaller than the number of wordhistories in the word-conditioned search– E.g., a recognition task consisting of 10,000 words and asentence consisting of 2000 frames2 × 10(TC)3

• A* for LVCSRTC vs. A* search (1/2)– Performed with the maximum approximation( )n– The partial word sequence hypotheses w1;τ associated withnthe scores G( w play the central role1;τ )• Multistack: time-specific priority queues used to rank thenpartial word sequence hypotheses w ;τ at each time τ( )• Need a conservative estimate of the prob. Score (LM +nAcoustic) extending from ( w to the end of the speech1;τ )signal (difficult to achieve!)– Instead approximately obtained normalizednG w ;τ with1nrespect to G w ;τmax n 1w1( )1( )Speech – Berlin Chen 36

TC vs. A* search (2/2)– Operations (time-asynchronous and look-forward)n• Select the most promising hypothesis ( w1;τ )– The best hypotheses with the shortest end time τare extend first» When reaching a word-end state at time t ,incorporate the LM prob. And update the relevantqueue• Pruning atn– The level of partial word sequence hypotheses ( w1;τ )– The level of acoustic word hypotheses ( w n +1;τ ,t)– The level of DP recursion on the tree-internal statehypotheses• TC for LVCSR– Operations (or algorithm) performed in time-synchronous andlook-backward mannerSpeech – Berlin Chen 37

Word Graph Rescoring

Word Graph Rescoring (1/6)• Tree-copy search with a higher order (trigram, fourgram,etc.) language model (LM) will be very time-consuming andimpractical• A word graph provides word alternatives in regions of thespeech signal, where the ambiguity about the actual spokenwords in high– Decouple the acoustic recognition (matching) and language modelapplication• Such that more complicated long-span language models (suchas PLSA, LSA, Trigger-based LMs, etc.) can be applied in theword graph rescoring processSpeech – Berlin Chen 39

Word Graph Rescoring (2/6)• If word hypotheses ending at each time frame havescores higher than a predefined threshold– Their associated decoding information, such as the word startand end speech frames, the identities of current and predecessorwords, and the acoustic score, can be kept in order to build aword graph for further higher-order language model rescoring• E.g., a bigram LM was used in the tree search procedure,while a trigram LM in the word graph rescoring procedure• Keep track of word hypotheses whose scores are veryclose to the locally optimal hypothesis, but that do notsurvive due to the recombination process台灣入聯活動P( 活動 | 台灣入聯 )Bigram History Assumed中華民國入聯tP( 活動 | 中華民國入聯 )t 3t 2t 1Speech – Berlin Chen 40

Word Graph Rescoring (3/6)• If bigram LM is employed in the tree-copy searcht– For a word hypothesis ending at time , information aboutits beginning time and its immediate predecessor word should beretainedτ( t ; v,w) = B ( t,S ; arc )vbeginning time of wv : immediate predecessor word of w– The acoustic score of a word hypothesis is also retainedacoustic score of wACvww( w; τ , t ) = Q ( t,S ; arc ) H ( v,τ)( w; 0t)( w; 1t )( w t )AC v,0τAC v,τ1; τ2 2AC v,vvnEwEwFor each possible word hypothesis ,not only the word segment with the bestpredecessor word were recorded(those do not survive due to re-combinationare also kept)Speech – Berlin Chen 41

Word Graph Rescoring (4/6)• Bookkeeping at the word level– When word hypotheses were recombined into one hypothesis tostart up the next tree• Not only the word segments (arcs) with the best predecessorword were recorded• But for the hypotheses that have the same LM history, onlythe best one was kept (“word-pair” approximation)SIL (-)肺炎 (-)肺炎 (SIL)疫情 ( 飛蛾 )疫情 ( 肺炎 )一群 ( 飛蛾 )疫情 ( 肺炎 )隱藏 ( 疫情 )隱藏 ( 一群 )影響 ( 疫情 )影響 ( 疫情 )生活 ( 隱藏 )SIL( 生活 )生活 ( 影響 )SIL( 生命 )生命 ( 影響 )生命 ( 影響 )影響 ( 一群 )飛蛾 (SIL)一群 ( 肺炎 )Speech – Berlin Chen 42

Word Graph Rescoring (5/6)• “Word-Pair” Approximation: for each path hypothesis, theposition of word boundary between the last two words isindependent of the other words of this path hypothesis• An illustration for “word-pair” approximation肺炎疫情影響飛蛾疫情影響B uv( w , t ) = τ2u nw=“ 影響 ”τ τ12u n-1v=“ 疫情 ”u n-2Historyb=“ 飛蛾 ” a=“ 肺炎 ”Speech – Berlin Chen 43

Word Graph Rescoring (6/6)• A word graph built with word-pair approximation疫情影響肺炎疫情隱藏隱藏生命SIL飛蛾一群疫情影響生活生活SILSIL肺炎疫情一群P( 影響 | 飛蛾疫情 )P( 影響 | 肺炎疫情 )影響生命• Each edge stands for a word hypothesis• The node at the right side of an edge denotes the word end– There is a maximum of incoming word edges for a node– There is no maximum of the num. of outgoing edges for a nodeSpeech – Berlin Chen 44

• Feature ExtractionConfiguration of NTNU System– HLDA (Heteroscedastic Linear Discriminant Analysis) + MLLT(Maximum Likelihood Linear Transformation) + MVN (Mean andVariance Normalization)• Language Modeling– Bigram/Trigram trained with the SRI toolkit• Acoustic Modeling– 151 RCD INITIAL/FINAL models trained with the HTK toolkit– Intra-word triphone modeling is currently under development• Look-ahead Schemes– Unigram Language model look-ahead– Utterance-level acoustic look-aheadSpeech – Berlin Chen 45

Large Vocabulary Continuous Speech Recognition - Berlin Chen

Create successful ePaper yourself

Delete template?

Save as template?