12.07.2015 Views

Large Vocabulary Continuous Speech Recognition - Berlin Chen

Large Vocabulary Continuous Speech Recognition - Berlin Chen

Large Vocabulary Continuous Speech Recognition - Berlin Chen

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Large</strong> <strong>Vocabulary</strong> <strong>Continuous</strong><strong>Speech</strong> <strong>Recognition</strong><strong>Berlin</strong> <strong>Chen</strong>Department of Computer Science & Information EngineeringNational Taiwan Normal UniversityReferences:1. X. L., Aubert, “An Overview of Decoding Techniques for <strong>Large</strong> <strong>Vocabulary</strong> <strong>Continuous</strong> <strong>Speech</strong> <strong>Recognition</strong>,”Computer <strong>Speech</strong> and Language, vol. 16, 20022. H. Ney, “Progress in Dynamic Programming Search for LVCSR,” Proceedings of the IEEE, August 20003. S.Ortmanns, H. Ney, “The Time-Conditioned Approach in Dynamic Programming Search for LVCSR,” IEEETransactions on <strong>Speech</strong> and Audio Processing, vol. 8, no. 6, November 20004. S.Ortmanns, H. Ney, X. Aubert, “A Word Graph Algorithm for <strong>Large</strong> <strong>Vocabulary</strong> <strong>Continuous</strong> <strong>Speech</strong><strong>Recognition</strong>,” Computer <strong>Speech</strong> and Language, vol. 11, 1997


Why LVCSR Difficult ? (1/2)• The software complexity of a search algorithm isconsiderably high• The effort required to build an efficient decoder is quitelarge<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 2


Why LVCSR Difficult ? (2/2)• Maximum Approximation of the Decoding ProblemWˆ==arg maxWp(X| W)P(W)⎪⎧P Wnp sTWn ⎪⎫⎨ (1) ∑ ( X,1|1) ⎬W:W nsT1 1⎪ ⎭arg max⎪⎩Wˆ( sT: the correspond ing state sequence of Wn )1⎧≈ arg max ⎨P(WW W n⎩n1)αmax: sT11p(X,sT1| Wn1⎫) ⎬⎭1( α : heuristic language model factor )<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 3


Two Major Constituents of LVCSR• Front-end Processing is a spectral analysis that derivesfeature vectors to capture salient spectral characteristicsof speech input– Digital signal processing– Feature extraction• Linguistic Decoding combines word-level matching andsentence-level search to perform an inverse operation todecode the message from the speech waveform– Acoustic modeling– Language modeling– Search<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 4


Classification of Approaches to LVCSR Along Two Axes• Network Expansion– Dynamic expansion of the search space during the decoding• Tree-structured n-gram network• Re-entrant (word-conditioned) vs. start-synchronous (timeconditioned)– (Full) Static expansion of the search space prior to decoding• Weighted Finite State Transducers (WFST) - AT&T– Composition (A。L。G), Determinization, (Pushing) andMinimization of WFST• Static tree-based representations• Search Strategy– Time-synchronous (Viterbi + Beam Pruning)– Time-asynchronous (A* or Stack Decoding)Breadth-first (Parallel)Best-first (Sequential)<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 5


Word-conditioned vs. Time-conditioned Search (1/3)• Word-conditioned Search– A virtual tree copy is explored for each active LM history– More complicated in HMM state manipulation (dependent on theLM complexity)• Time-conditioned Search– A virtual tree copy is being entered at each time by the word endhypotheses of the same given time– More complicated in LM-level recombinationword-conditionedtime-conditioned<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 6


Word-conditioned vs. Time-conditioned Search (2/3)• Schematic ComparisonWord-conditioned SearchTime-conditioned SearchNo. of tree-copiesis dependent on thecomplexity ofLM constraints(for bigram constraint,MAX: |V| 2 , |V| lexicon size )No. of tree-copiesis independentof the complexity ofLM constraints,but dependent onthe duration ofthe utterance(however, thenumber is 30~50according to theaverage wordduration )<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 7


Word-conditioned vs. Time-conditioned Search (3/3)Word-conditioned Search (with bigram LM)Time-conditioned Search (with m-gram LM)These two algorithms are stated in a “look-backward” manner.<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 8


Word-Conditioned Tree-Copy Search (WC)• As trigram language modeling is usedP(‧| sil sil)P(‧| sil 飛 蛾 )飛 蛾P(‧| 飛 蛾 一 群 ) P(‧| 一 群 影 響 ) P(‧| 影 響 生 活 )P(‧| 生 活 很 大 )一 群影 響生 活很 大 造 成P(‧| 生 命 很 大 )很 大Node( history, arc, state)P(‧| 很 大 造 成 )肺 炎疫 情影 響生 命頗 大P(‧| sil 肺 炎 )P(‧| 肺 炎 疫 情 ) P(‧| 疫 情 影 響 ) P(‧| 影 響 生 命 ) P(‧| 生 命 頗 大 )V trees V 2 trees V 2 trees V 2 trees V 2 treesLanguage Model Look-aheadAcoustic Look-aheadcan be applied– The pronunciation lexicon is structured as a tree– Due to the constraints of n-gram language modeling, a word’soccurrence is dependent on the previous n-1 words– Search through all possible tree copies from the start time to theend time of the utterance to find a best sequence of wordhypotheses<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 9


Lexical/Phonetic Tree (1/2)• Each arc stands for a phonetic unit• The application of language modeling is delayed until leafnodes are reached– Word identities are only defined at leaf nodes (Solution: languagemodel smearing/look-ahead )– Each leaf node is shared by words having the same pronunciationPP( say they )( tell they )they<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 10


Lexical/Phonetic Tree (2/2)• Reasons for using the lexical/phonetic tree– States according to phones that are common to different words areshared by different hypotheses• A compact representation of the acoustic-phonetic searchspace– The uncertainty about the word identity is much higher at itsbeginning than its ending• More computations is required at the beginning of a word thantoward its end<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 11


Linear Lexicon vs. Tree Lexicon(for Word-Conditioned Search)• In the general n-gram case (n ≧1), the total number ofprefix tree copies is equal to |V| n−1 , where |V| is the lexiconsize and n the LM order• When using a (flat) linear lexicon and in the general n -gram case (n ≧2), the total number of copies of the linearlexicon would be |V| n−2|V| linear lexicons for trigram modeling|V| 2 tree-copies for trigram modeling<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 12


Context-Dependent Phonetic Constraints• Context-Dependent modeling (e.g., cross-word, CW,triphone modeling) is a crucial factor regarding thesearch space at the junction between successive wordsImplementation of CWwould be much more difficultfor time-conditioned thanword-conditioned search(why ?)• Current NTNU system is implemented with intra-wordINITIAL/FINAL or triphone modeling<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 13


More Details on Word-Conditioned Search


Word-Conditioned Search (1/3)• Word (history)-conditioned Search– A virtual/imaginary tree copy explored for linguistic context ofactive search hypotheses– Search hypotheses recombined at tree root nodes according tolanguage modeling (or the history)P(‧| 一 群 影 響 )P(‧| 影 響 生 活 )一 群影 響生 活疫 情影 響生 命For n-gram language modeling:- Retain distinct n-1-gram word historiesP(‧| 疫 情 影 響 ) P(‧| 影 響 生 命 )V 2 trees V 2 trees V 2 trees– Time is the independent variable; expansion proceeds in parallelin a “breadth-first” manner<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 15


Word-Conditioned Search (2/3)• Integration of acoustic and linguistic knowledge bymeans of tree-dimensional coordinates– A network (dynamically) built to describe sentences in terms ofwords• Language models for network transition probabilities– A network (statically) built to describe words in terms of phone• The pronunciation dictionary (organized as a phonetic tree)• Transition penalties are applied– A network (statically) built to describe a phone unit in terms ofsequences of HMM states• Spectral vectors derived from the speech signal are consumed<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 16


Word-Conditioned Search (3/3)BacktrackingInformationshould bemanipulated• Three basic operations performed– Acoustic-level re-combinations within tree arcs• Viterbi search( t, s;arc) = max ⎡ Q n ( t 1, s ; arc) P( s s ′ ; arc) ⎤−1− ′P( x s arc)Q −1v1 s′⎢⎣v1⎥⎦n t;– Tree arc extensionsQ( t S ; arc ) Q ( t − 1, S ar c )n − 1 , = n −′v01vM;11arc’– Language-model-level recombination• Word end hypotheses sharing the same history wererecombinednnH ( v ; t ) max ⎡−1α2= Q n ( t , Sv; arcE) P ( vnv ) ⎤−1⋅v vn11⎢⎣ 1⎥⎦ v 1v 2nQ t , S ; arc = H v ; tnv 2( ) ( )0The beginning state The ending stateB2( t, S arcB)v′ 01 −1Q n ;一 群 影 響 生 活疫 情 影 響 生 活n=3s Ms 0v 3arcn-1P( 生 活 | 一 群 影 響 )Q一 群 影 響(-.S生 活 , ,-)P( 生 活 | 疫 情 影 響 )Q疫 情 影 響(-.S生 活 , ,-)<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 17


WC: Implementation Issues (1/3)• Path hypotheses at each time frame are differentiatedbased on– The n-1 word history (for the n-gram LM)– The phone unit (or the tree arc)– The HMM state• Organization of active path hypotheses (states)– Hierarchical organizationActive HMM StatesActive ArcsNo. of States (NS)“A Data-driven Approach”Node( history, arc, state)Active Trees(LM Histories)No. of Trees (NT)No. of Arcs (NA)active HMM states(or path hypotheses)at frame t<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 18


WC: Implementation Issues (2/3)• Organization of active path hypotheses (states)– Flat organizationActive HMM States at frame t+1at frame tData-drivenNew HMM Statesvn−11vvn−11n−11a ks 2a ks 2a ks 3a js 11a js 1n−1a js 2v a js 2vvvvn−11n−11n−11n−11a js 3Acoustic level recombinationQvn −11maxs′( t,s;arc )[ Q n −1( t − 1, s′; arc ) P( s s′; arc )] P( xts;arc )v1=C ++ STL (Standard Template Libraries)is suitable for such an accesslog( NS )for the access of any HMM stateNS: the number of HMM states<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 19


WC: Implementation Issues (3/3)• The pronunciation lexicon is organized as a “trie” (tree) structurestruct DEF_LEXICON_TREE{short Model_ID;short WD_NO;int *WD_ID;int Leaf;double Unigram;struct Tree *Child;struct Tree *Brother;struct Tree *Father;};Tree AD C BH G FK JTrieD CH G FEIABEKJI<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 20


WC: Pruning of Path Hypotheses (1/5)• Viterbi search– Belong to a class of breadth-first search• Time-synchronous• Hypotheses terminate at the same point in time– Therefore, hypotheses can be compared– The search hypotheses will grow exponentially– Pruning away unlikely (incorrect) paths is needed• Viterbi beam search• Hypotheses with likelihood falling within a fixed radius (orbeam) of the most likely hypothesis are retained• The beam size determined empirically or adaptively<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 21


WC: Pruning of Path Hypotheses (2/5)• Pruning Techniques1. Standard Beam Pruning (Acoustic-level Pruning)• Retain only hypotheses with a score close to the best hypothesis⎡⎤ThrAC() t = ⎢ max Q n −1( t , s;arc ) × fn −1AC( v s arc )v⎥1⎣ 1 , ,⎦Q t,s;arc < Thr t ⇒v1n ⎡⎤LM() t = ⎢ max Q n −1( t , S ; arcB) × fn −1LM( v S arc )v0 ⎥⎣ , 0 , B 11⎦( ) ( ) pruned!2. Language Model Pruning (Word-level Pruning)• Applied to word-end or tree start-up hypothesesThr−1( ) ( ) pruned!ACQ− 1v1nt, S0; arcB


WC: Pruning of Path Hypotheses (3/5)• Pruning Techniques (cont.)– Stricter pruning applied at word ends• The threshold is tightly compared to the acoustic-level onePose severerequirements onthe system memory• Reasons– A single path hypothesis is propagated into multiple wordends (words with the same pronunciation)– A large number of arcs (models) of the new generatedtree copies are about to be activated<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 23


WC: Pruning of Path Hypotheses (4/5)• A simple dynamic pruning mechanism in the NTNU system– Acoustic-Level Pruning<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 24


WC: Pruning of Path Hypotheses (5/5)• A simple dynamic pruning mechanism in the NTNU system– Word-Level (language-model-level) Pruning<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 25


WC: Language Model Look-ahead (1/2)• Language model probabilities incorporated as early inthe search as possible• Language model probability incorporated for computing of– Unigram Look-ahead– Bigram Look-aheadπv( a )= maxw ∈W( a )P( w v )• Anticipate the language modelprobabilities with the state hypothesis~QThrπn−1v1AC( a )( t,s;arc) = π ( a ) Q ( t,s;arc)() t= max⎡s, arc n−1v1= ⎢−n −1( v s arc )v11 , ,⎣w ∈W( a )P( w )⎤( as , arc) Q n ( t,s;arc ) × fACmax π1⎥⎦π( a)~Q−1v1n= maxw∈Wsarc( a )aP( w)Qv n1 −1( t,s;arc)( t,s;arc) < Thr ( t) ⇒ pruned!ACW( ) aImplementation of bigram look-ahead would be much difficult for time-conditionedthan word-conditioned search (why ?)<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 26


WC: Language Model Look-ahead (2/2)• A Simple recursive function for calculating unigram LMlook-aheadvoid <strong>Speech</strong>Class::Calculate_Word_Tree_Unigram(){if(Root==(struct Tree *) NULL) return;Do_Calculate_Word_Tree_Unigram(Root);}void <strong>Speech</strong>Class::Do_Calculate_Word_Tree_Unigram(struct Tree *ptrNow){if(ptrNow==(struct Tree *) NULL) return;Do_Calculate_Word_Tree_Unigram(ptrNow->Brother);Do_Calculate_Word_Tree_Unigram(ptrNow->Child);if(ptrNow->Father!=(struct Tree *) NULL)if(ptrNow->Unigram > ptrNow->Father->Unigram)ptrNow->Father->Unigram=ptrNow->Unigram;}<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 27


WC: Acoustic Look-ahead (1/3)• The Chinese language is well known for its monosyllabicstructure, in which each Chinese word is composed ofone or more syllables– Utilize syllable-level heuristics to enhance search efficiency• Help make the right decision when pruning– How to design the a suitable structure in order to estimate theheuristics for the unexplored portion of a speech utterance?~Qv n1( t,s;arc)Heuristics ;( t,s arc)log F ( t,s;arc)1 −n−1v1~log Q= n−1v1( t,s;arc) + log Heuristics ( t,s;arc)<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 28


WC: Acoustic Look-ahead (2/3)• Possible structures for heuristic estimation for theChinese languageㄧㄝㄧㄣㄏㄞㄨㄢㄈㄣㄏㄤㄕㄤㄊㄞㄑㄧㄧㄧㄣㄏㄨㄚㄕㄢㄒ 一 ㄤㄓㄠㄙㄥㄨㄨㄛㄨㄛㄅㄤINITIALs FINALs Syllables A lattice based on the lexical tree• Incorporated with the actual decoding score andlanguage model look-ahead score( t,h,arc , s ) = m′⋅ log D( t,h,arc , s ) + m′⋅ log L ( arc ) + m′log L ( t,arc s )~D ,log k q 1 k q 2 LM k 3 ⋅ACkq<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 29


WC: Acoustic Look-ahead (3/3)• Tested on 4-hour radio broadcast news (<strong>Chen</strong> et al., ICASSP 2004)– Use Kneser-Ney backoff smoothing for language modeling• The recognition efficiency for TC improves significantly (a relativeimprovement of 41.61% was obtained) while the time spent onacoustic look-ahead (0.004 real time factor) was almost negligible<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 30


More Details on Time-Conditioned Search


TC: Decomposition of Search History (1/3)If only the latest m-1 word history is instead kept:hGt t( w;τ , t) = max p( x , s w)tτ + 1s= conditiona l prob. that word w produces xnnt t n( w ; t) = P( w ) max p( x , s w )11= joint prob. of observingt1sτ + 1τ + 1If the whole word history is kept:G11x1t1andnn−1n−1( w1; t) = max{ P( wnw1) G( w1; τ ) h( w;τ , t)}τn−1n−1= P( w w ) ⋅ max{ G( w ; τ ) h( w;τ , t)}n1τ1wn1tτ + 1ending at tGHnm( w1; t) ≈ H ( v2; t)m( v ; t)= max⎧Pn n m⎨1w1: w n − m + 2 = v2⎩(with DP recursion )= maxv21nt t n( w ) ⋅ max p( x , s w )t1sm−1m−1{ P( vmv1) ⋅ max{ H ( v1; τ ) h( w;τ , t)}τ111⎫⎬⎭<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 32


TC: Decomposition of Search History (2/3)• Tree-internal search hypotheses (for an internal node,not word-end node, s)If the whole word history is kept:Qτn ⎧t t n( t;s) = max P( w ) ⋅ max p( x , s w )==wwwmax Pn1max Gn1n1n ⎧τ τ n ⎫t t( w ) ⋅ max p( x , s w ) ⋅ max p( x , s −)1⎨⎩t1⎨⎩s : s = S⎬⎭nt t( w ; τ ) ⋅ max p( x , s −)11tτ + 1sτt1s : s = S: s = stτwnw n, s = st1τ + 1111τ + 11tτ + 1s1⎫⎬⎭: s = sIf only the latest m-1 word history is instead kept:Qτnt t( t;s) = max G( w ) ⋅ p( x+s −)n 1; τ maxtτ 1,τ + 1w1sτ+ 1:st= smt t≈ max H ( v ) ⋅ p( x+s+−)m m n 2; τ maxtτ 1,τ 1v2: v2= w n − m + 2sτ+ 1:st= st t= H () τ ⋅ max p( x , s −)maxtτ + 1s: s = stτ + 1τ + 1tτ + 1τ + 1Word identity is still unknown<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 33


TC: Decomposition of Search History (3/3)• Three basic operations performed– Acoustic-level re-combinations within tree arcsQτ• Viterbi search( t, s;arc) = max[ Q ( t −1,s′; arc) P( s s′; arc)] P( x s;arc)s′– Tree arc extensionsτ( t S ; arc ) = Q ( t − 1, S ; ar c )QM′τ,0 τtarc’s Ms 0arcHQtThe beginning state The ending state– Word-level recombination{ { }m 11mτ⎧⎧ Q ( t,S ; arc )mm−1m−1( v ; t) = max P( v v ) ⋅ max H ( v ; τ ) h( v ; τ , t)2vm−1m−1( v v ) ⋅ max H ( v ; τ )m( t,S ; arc ) = H ( t) = max H ( v ; t)0B1= max⎨Pv1⎩mmax1vm2τ⎨⎩21τHvmmax() τE⎫⎫⎬⎬⎭⎭<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 34


TC vs. WC: Conflicting Expectations• Word-conditioned hypotheses seem to be morecondense– Since we have got rid of the word boundary information whenperforming path recombination• The number of time-conditioned hypotheses might besmaller because the number of possible start times(1,…,T) is always smaller than the number of wordhistories in the word-conditioned search– E.g., a recognition task consisting of 10,000 words and asentence consisting of 2000 frames2 × 10(TC)3


• A* for LVCSRTC vs. A* search (1/2)– Performed with the maximum approximation( )n– The partial word sequence hypotheses w1;τ associated withnthe scores G( w play the central role1;τ )• Multistack: time-specific priority queues used to rank thenpartial word sequence hypotheses w ;τ at each time τ( )• Need a conservative estimate of the prob. Score (LM +nAcoustic) extending from ( w to the end of the speech1;τ )signal (difficult to achieve!)– Instead approximately obtained normalizednG w ;τ with1nrespect to G w ;τmax n 1w1( )1( )<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 36


TC vs. A* search (2/2)– Operations (time-asynchronous and look-forward)n• Select the most promising hypothesis ( w1;τ )– The best hypotheses with the shortest end time τare extend first» When reaching a word-end state at time t ,incorporate the LM prob. And update the relevantqueue• Pruning atn– The level of partial word sequence hypotheses ( w1;τ )– The level of acoustic word hypotheses ( w n +1;τ ,t)– The level of DP recursion on the tree-internal statehypotheses• TC for LVCSR– Operations (or algorithm) performed in time-synchronous andlook-backward manner<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 37


Word Graph Rescoring


Word Graph Rescoring (1/6)• Tree-copy search with a higher order (trigram, fourgram,etc.) language model (LM) will be very time-consuming andimpractical• A word graph provides word alternatives in regions of thespeech signal, where the ambiguity about the actual spokenwords in high– Decouple the acoustic recognition (matching) and language modelapplication• Such that more complicated long-span language models (suchas PLSA, LSA, Trigger-based LMs, etc.) can be applied in theword graph rescoring process<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 39


Word Graph Rescoring (2/6)• If word hypotheses ending at each time frame havescores higher than a predefined threshold– Their associated decoding information, such as the word startand end speech frames, the identities of current and predecessorwords, and the acoustic score, can be kept in order to build aword graph for further higher-order language model rescoring• E.g., a bigram LM was used in the tree search procedure,while a trigram LM in the word graph rescoring procedure• Keep track of word hypotheses whose scores are veryclose to the locally optimal hypothesis, but that do notsurvive due to the recombination process台 灣入 聯活 動P( 活 動 | 台 灣 入 聯 )Bigram History Assumed中 華 民 國入 聯tP( 活 動 | 中 華 民 國 入 聯 )t 3t 2t 1<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 40


Word Graph Rescoring (3/6)• If bigram LM is employed in the tree-copy searcht– For a word hypothesis ending at time , information aboutits beginning time and its immediate predecessor word should beretainedτ( t ; v,w) = B ( t,S ; arc )vbeginning time of wv : immediate predecessor word of w– The acoustic score of a word hypothesis is also retainedacoustic score of wACvww( w; τ , t ) = Q ( t,S ; arc ) H ( v,τ)( w; 0t)( w; 1t )( w t )AC v,0τAC v,τ1; τ2 2AC v,vvnEwEwFor each possible word hypothesis ,not only the word segment with the bestpredecessor word were recorded(those do not survive due to re-combinationare also kept)<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 41


Word Graph Rescoring (4/6)• Bookkeeping at the word level– When word hypotheses were recombined into one hypothesis tostart up the next tree• Not only the word segments (arcs) with the best predecessorword were recorded• But for the hypotheses that have the same LM history, onlythe best one was kept (“word-pair” approximation)SIL (-)肺 炎 (-)肺 炎 (SIL)疫 情 ( 飛 蛾 )疫 情 ( 肺 炎 )一 群 ( 飛 蛾 )疫 情 ( 肺 炎 )隱 藏 ( 疫 情 )隱 藏 ( 一 群 )影 響 ( 疫 情 )影 響 ( 疫 情 )生 活 ( 隱 藏 )SIL( 生 活 )生 活 ( 影 響 )SIL( 生 命 )生 命 ( 影 響 )生 命 ( 影 響 )影 響 ( 一 群 )飛 蛾 (SIL)一 群 ( 肺 炎 )<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 42


Word Graph Rescoring (5/6)• “Word-Pair” Approximation: for each path hypothesis, theposition of word boundary between the last two words isindependent of the other words of this path hypothesis• An illustration for “word-pair” approximation肺 炎疫 情影 響飛 蛾疫 情影 響B uv( w , t ) = τ2u nw=“ 影 響 ”τ τ12u n-1v=“ 疫 情 ”u n-2Historyb=“ 飛 蛾 ” a=“ 肺 炎 ”<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 43


Word Graph Rescoring (6/6)• A word graph built with word-pair approximation疫 情影 響肺 炎疫 情隱 藏隱 藏生 命SIL飛 蛾一 群疫 情影 響生 活生 活SILSIL肺 炎疫 情一 群P( 影 響 | 飛 蛾 疫 情 )P( 影 響 | 肺 炎 疫 情 )影 響生 命• Each edge stands for a word hypothesis• The node at the right side of an edge denotes the word end– There is a maximum of incoming word edges for a node– There is no maximum of the num. of outgoing edges for a node<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 44


• Feature ExtractionConfiguration of NTNU System– HLDA (Heteroscedastic Linear Discriminant Analysis) + MLLT(Maximum Likelihood Linear Transformation) + MVN (Mean andVariance Normalization)• Language Modeling– Bigram/Trigram trained with the SRI toolkit• Acoustic Modeling– 151 RCD INITIAL/FINAL models trained with the HTK toolkit– Intra-word triphone modeling is currently under development• Look-ahead Schemes– Unigram Language model look-ahead– Utterance-level acoustic look-ahead<strong>Speech</strong> – <strong>Berlin</strong> <strong>Chen</strong> 45

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!