The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

2. For each candidate"I!computeLet"$ 3. be the merge that maximizes +9,",4@@@4@the merged model",the samples ) in . These states are chained with transitions of probability 1, such that a sample 1 £££1 ”•4@@4@@@4,@@CHAPTER 3. HIDDEN MARKOV MODELS 40A. Get some new samples )and incorporate into the current model 4.B. Loop:1. Compute a set of candidate merges! from among the states of model 4.0. )g0 .0 and its posterior probability+-,",0. )g0 . Then let 4#1 :6#">0 .5. Let H :6 HEN 1.+9,# 1. )g0ŒÏ4. If +-, 4. )g0 , break from the loop.C. If the data is exhausted, break from the loop and 4 return as the induced model.Incremental merging might in principle produce results worse than the batch version since theevaluation step doesn’t have as much data at its disposal. However, we didn’t find this to be a significantdisadvantage in practice. One can optimize the number of samples incorporated in each step A (the batch size)for overall speed. This requires balancing the gains due to smaller model size against the constant overhead ofeach execution of step B. The best value will depend on the data and how much merging is actually possibleon each iteration; we found between 1 and 10 samples at a time to be good choices.One has to be careful not to start merging with extremely small models, such as that resulting fromincorporating only a few short samples. Many of the priors discussed earlier contain logarithmic terms thatapproach singularities (log 0) in this case, which can produce poor results, usually by leading to extrememerging. That can easily be prevented by incorporating a larger number of samples (say, 10 to 20) beforegoing on to the first merging step.Further modifications to the simple best-first search strategy are discussed in Section 3.4.5.3.4 Implementation IssuesIn this section we elaborate on the implementation of the various steps in the generic HMM mergingalgorithm presented in Section 3.3.5.3.4.1 Efficient sample incorporationIn the simplest case this step creates a dedicated state for each instance of a symbol in any ofÔ 1is generated by a 1£££ u Ôstateusequence . 1 can be reached from via a transition of probability 1· u u. where is the total number ofÔsamples. State connects to with probability 1. All states u @)/.corresponding output symbol with probability 1.@ ½ u 1 u, •emit their
”•CHAPTER 3. HIDDEN MARKOV MODELS 41In this way, repeated samples lead to multiple paths through the model, all generating the samplestring. The total probability of a string according to the initial model is thus ] Õ z Öof string 1 . It follows that the initial model constitutes a maximum likelihood model for the data ) .1• , i.e., the relative frequencyNote that corresponding states in equivalent paths can be merged without loss of model likelihood.This is generally what the merging loop does in its initial passes.A trivial optimization at this stage is to avoid the initial multiplicity of paths and check for eachnew sample whether it is already accounted for by an existing path. If so, only the first transition probabilityhas to be updated.The idea of shortcutting the merging of samples into the existing model could be pursued furtheralong the lines of Thomason & Granum (1986). Using an extension of the Viterbi algorithm, the new samplecan be aligned with the existing model states, recruiting new states only where necessary. Such an alignmentcouldn’t effect all the possible merges, e.g., it wouldn’t be able to generate loops, but it could further reducethe initial number of states in the model, thereby saving computation in subsequent steps. 63.4.2 Computing candidate mergesThe general case here is to examine all of the 1 .ƒÞã.,.ñÞã. L 102possible pairs of states in the currentmodel. The quadratic cost in the number of states explains the importance of the various strategies to keepthe number of states in the initial model small.We have explored various application specific strategies to narrow down the set of worthwhilecandidates. For example, if the cost of a merge is usually dominated by the cost of merging the outputdistributions of the states involved, we might index states according to their emission characteristics andconsider only pairs of states with similar outputs. This constraint can be removed later after all other mergingpossibilities have been exhausted. The resulting strategy (first merging only same-outputs states, followedby general merging) not only speeds up the algorithm, it is also generally a good heuristic in incrementalmerging to prevent premature merges that are likely to be assessed differently in the light of new data.Sometimes hard knowledge about the target model structure can further constrain the search. Forexample, word models for speech recognition are usually not allowed to generate arbitrary repetitions ofsubsequences (see Section 3.6.2). All merges creating loops (perhaps excepting self-loops) can therefore beeliminated in this case.3.4.3 Model evaluation using Viterbi pathsTo find the posterior probability of a potential model, we need to evaluate the structural priorand, depending on the goal of the maximization, either find the maximum posterior probability(MAP)4TÃB0+-,estimates for the model parameters , or evaluate the integral for +-, )À. 4/ÃE0 given by equation (2.20).UÄMAP estimation of HMM parameters could be done using the Baum-Welch iterative reestimation(EM) method, by taking the Dirichlet prior into account in the reestimation step. However, this would6 This optimization is as yet unimplemented.
Page 1 and 2: The dissertation of Andreas Stolcke
Page 3 and 4: Bayesian Learning of Probabilistic
Page 5 and 6: iAcknowledgmentsLife and work in Be
Page 7 and 8: iiiContentsList of FiguresList of T
Page 9 and 10: CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15: CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17: CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19: CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21: ..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23: VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25: ,,vv,v,v,,directly. However, note t
Page 26 and 27: 4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29: 6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 102 and 103:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 104 and 105:
==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107:
,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

Create successful ePaper yourself

Delete template?

Save as template?