The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 3. HIDDEN MARKOV MODELS 34Returning to the example, we now chose to merge states 2 and 6 (4 3). This step decreases the loglikelihood (from L 0£ 602 to L 0£ 829) but it is the smallest decrease that can be achieved by any of the potentialmerges.Following that, states 1 and 5 can be merged without (4 penalty 4). The resulting HMM is theminimal model generating the target language 0# , but what prevents us from merging further, to obtain anHMM for , ? #It turns out that merging the remaining two states reduces the likelihood much more drastically thanthe previous, ‘good’ generalization step, from L 0£ 829 to L 3£ 465 (i.e., three decimal orders of magnitude). Apreliminary answer, therefore, is to set the threshold small enough to allow only desirable generalizations. Amore satisfactory answer is provided by the Bayesian methods described below.Note that further data may well justify the generalization to a model for # . This data-drivencharacter is one of the central aspects of model merging.A domain-specific justification for model merging in the case of HMMs applies. It can be seen fromthe example that the structure of the generating HMM can always be recovered by an appropriate sequence ofstate merges from the initial model, provided that the available data ‘covers’ all of the generating model, i.e.,each emission and transition is exercised at least once. Informally, this is because the initial model is obtainedby ‘unrolling’ the paths used in generating the samples in the target model. The iterative merging process,then, is an attempt to undo the unrolling, tracing a search through the model space back to the generatingmodel. Of course, the best-first heuristic is not guaranteed to find the appropriate sequence of merges, or, lesscritically, it may result in a model that is only weakly equivalent to the generating model.3.3.3 Priors for Hidden Markov ModelsFrom the previous discussion it is clear that the choice of the prior distribution is important sinceit is the term in (2.13) that drives generalization. We take the approach that priors should be subject toexperimentation and empirical comparison of their ability to lead to useful generalization. The choice of aprior represents an intermediate level of probabilisticmodeling, between the global choice of model formalism(HMMs, in our case) and the choice of a particular instance from a model class (e.g., a specific HMM structureand parameters). The model merging approach ideally replaces the usually poorly constrained choice of lowlevelparameters with a more robust choice of (few) prior parameters. As long as it doesn’t assign zeroprobability to the correct model, the choice of prior is eventually overwhelmed by a sufficient amount of data.In practice, the ability to find the correct model may be limited by the search strategy used, in our case, themerging process.HMMs are a special kind of parameterized graph structure. Unsurprisingly, many aspects of thepriors discussed in this section can be found in Bayesian approaches to the induction of graph-based modelsin other domains (e.g., Bayesian networks (Cooper & Herskovits 1992; Buntine 1991) and decision trees(Buntine 1992)).
4Ãä Ö Õ0'6 Ã\“çäÆ4?ú £££Æ ú 0Å-,;Æ Xd?ä Ö ÕÃCHAPTER 3. HIDDEN MARKOV MODELS 353.3.3.1 Structural vs. parameter priorsAs discussed in Section 2.5.5, an HMM may be specified as a combination of structure andcontinuous parameters. For HMMs the structure or topology is given by as a set of states, transitions andemissions. Transitions and emissions represent discrete choices as to which paths and outputs can havenon-zero probability in the HMM.Our approach is to compose a prior for both the structure and the parameters of the HMM as aproduct of independent priors for each transition and emission multinomial, possibly along with a globalfactor. Although the implicit independence assumption about the parameters of different states is clearlya simplification, it shouldn’t introduce any systematic bias toward any particular model structure. It does,however, greatly simplify the computation and updating of the global posteriors for various model variants,as detailed in Section 3.4.The global prior for a 4 model thus becomes a product+-,ä Ö ÕÃä Ö Õ0 (3.5)Ã+-,2Uä Ö ÕÄ+9,+9,450%64Tø&0. 4Tø&0. 4TøÍ4where 4Tø&0 is a prior for global aspects of the model structure (including, e.g., the number of states),Õ +-, Ö äis a prior contribution for the structure associated with state , and +-,;U Õ ä ÖÄ . 4 u 00 is a prior on the+-,parameters (transition and emission probabilities) associated with state u .Unless otherwise noted, the global factor +-, 4/ø'0 is assumed to be unbiased, and therefore ignoredin the maximization.3.3.3.2 Parameter priors for HMMsSince HMM transitions and emission probabilities are conceptually multinomials, one for eachstate, we apply the Dirichlet prior discussed in Section 2.5.5.1. What the parameters are exactly depends onthe structure-vs.-parameter trade-off.Narrow parameter priorsA natural application of the Dirichlet prior is as a prior distribution over eachset of multinomial parameters within a given HMM structure 4ùÃ . Relative to equation (3.5), the parametersof a state u with ¢ä Ö Õtransitions and ¢àä Ö Õúemissions contribute a factor1þ=>û…ü;ý\+9,2Uä Ö ÕÄ. 4 ø Z41Å-,;Æà £££UÇ þ1ä @ 1û ü;ý =\ ÿUÇ ÿ1ä d £ (3.6)10 àä @Here are the transition probabilities at state , H ranging over the states that can follow ; u U ä d are theemission probabilities in state U , ranging over the outputs emitted by . u Æ à and Æ ú are the prior weights foru utransitions and emissions, respectively, and can be chosen to introduce more or less bias towards a uniformassignment of the parameters.@aX
Page 1 and 2: The dissertation of Andreas Stolcke
Page 3 and 4: Bayesian Learning of Probabilistic
Page 5 and 6: iAcknowledgmentsLife and work in Be
Page 7 and 8: iiiContentsList of FiguresList of T
Page 9 and 10: CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15: CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17: CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19: CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21: ..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23: VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25: ,,vv,v,v,,directly. However, note t
Page 26 and 27: 4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29: 6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107:
,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

Create successful ePaper yourself

Delete template?

Save as template?