The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 3. HIDDEN MARKOV MODELS 46local posterior probability maxima in the space of HMM structures constructed by successive mergingoperations.By far the most common problem found in practice is that the stopping criterion is triggered tooearly, since a single merging step alone decreases the posterior model probability, although additional relatedsteps might eventually increase it. This happens although in the vast majority of cases the first step is inthe right direction. The straightforward solution to this problem is to add a ‘lookahead’ to the best-firststrategy. The stopping criterion is modified to trigger only after a fixed number of steps 1 have producedno improvement; merging still proceeds along the best-first path. Due to this, the lookahead depth does notÌentail an exponential increase in computation as a full tree search would. The only additional cost is the workperformed by looking ahead in vain at the end of a merging sequence. That cost is amortized over severalsamples if incremental merging with a batch size 1 is being used.Best-first merging with lookahead has been our method of choice for almost all applications, usingÌlookaheads between 2 and 5. However, we have also experimented with beam search strategies. In these, aset of working models is kept at each time, either limited in number (say, top! the scoring ones), or by thedifference in score to the current best model. On each inner loop of the search algorithm, all current modelsare modified according to the possible merges, and among the pool thus generated the best ones accordingto the beam criterion are retained. (By including the unmerged models in the pool we get the effect of alookahead.)Some duplication of work results from the fact that different sequences of merges can lead to thesame final HMM structure. To remove such gratuitous duplicates from the beam we attach a list of disallowedmerges to each model, which is propagated from a model to its successors generated by merging. Multiplesuccessors of the same model have the list extended so that later successors cannot produce identical resultsfrom simply permuting the merge sequence.The resulting beam search version of our algorithm does indeed produce superior results on datathat requires aligning long substrings of states, and where the quality of the alignment can only be evaluatedafter several coordinated merging steps. On the other hand, beam search is considerably more expensive thanbest-first search and may not be worth a marginal improvement.All results in Section 3.6 were obtained using best-first search with lookahead. Nevertheless,improved search strategies and heuristics for merging remain an important problem for future research.3.5 Related WorkMany of the ideas used in our approach to Bayesian HMM induction are not new by themselves,and can be found in similar forms in the vast literatures on grammar induction and statistical inference.
purposes. 9 Althoughthe underlyingintuitionsare very similar, there are some significant conceptual differencesCHAPTER 3. HIDDEN MARKOV MODELS 473.5.1 Non-probabilistic finite-state modelsAt the most basic level we have the concept of state merging, which is implicit in the notion of stateequivalence classes, and as such is pervasively used in much of automata theory (Hopcroft & Ullman 1979).It has also been applied to the induction of non-probabilistic automata (Angluin & Smith 1983).Still in the field of non-probabilistic automata induction, Tomita (1982) has used a simple hillclimbingprocedure combined with a goodness measure based on positive/negative samples to search thespace of possible models. This strategy is obviously similar in spirit to our best-first search method (whichuses a probabilistic goodness criterion based on positive samples alone).The incremental version of the merging algorithm, in which samples are incorporated into a preliminarymodel structure one at a time, is similar in spirit (but not in detail) to the automata learning algorithmproposed by Porat & Feldman (1991), which induces finite-state models from positive-only, lexicographicallyordered samples.3.5.2 Bayesian approachesThe Bayesian approach to grammatical inference goes back at least to Horning (1969), where aprocedure is proposed for finding the grammar with highest posterior probability given the data, using anenumeration of all candidate models in order of decreasing prior probability. While this procedure can beproven to converge to the maximum posterior probability grammar after a finite number of steps, it was foundto be impractical when applied to the induction of context-free grammars. Horning’s approach can be appliedto any enumerable grammatical domain, but there is no reason to believe that the simple enumerative approachwould be feasible in any but the most restricted applications. The HMM merging approach can be seen as anattempt to make the Bayesian strategy workable by operating in a more data-driven manner, while sacrificingoptimality of the result.3.5.3 State splitting algorithmsEnumerative search for finding the best model structure is also used by Bell et al. (1990) 8 to findoptimal text compression models (for given number of number of states), although they clearly state that thisis not a feasible practical approach. They also suggest both state merging and splitting as ways of constructingmodel structure dynamically from data, although the former is dismissed as being too inefficient for theirbetween our work and their compression-oriented approaches.First, the evaluation functions used areinvariably entropy (i.e., likelihood) based, and there is no formalized notion of a trade-off between model8 Thanks to Fernando Pereira for pointing out this reference. It is amazing how much overlap, apparently without mutual knowledge,there is between the text compression field and probabilistic computational linguistics. For example, the problem of smoothing zeroprobabilityestimates and the solutions using mixtures (Bahl et al. 1983) or back-off models (Katz 1987) all have almost perfect analogsin the various strategies for building code spaces for compression models.9 Bell et al. (1990) attribute the state merging idea to Evans (1971).
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8: iiiContentsList of FiguresList of T
Page 9 and 10: CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15: CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17: CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19: CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21: ..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23: VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25: ,,vv,v,v,,directly. However, note t
Page 26 and 27: 4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29: 6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 104 and 105: ==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107: ,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 108 and 109:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

Create successful ePaper yourself

Delete template?

Save as template?