The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht,0 à?B= û ü;ýþ •tÆ•þ= ç•CHAPTER 3. HIDDEN MARKOV MODELS 36Broad parameter priorsIn the preceding version the parameters were constrained by the choice of a modelstructure 4 Ã . As indicated earlier, one may instead let the parameters range over all potential transitions (allstates in the model) and emissions (all elements of the output alphabet). Dirichlet priors as in equation (3.6)Õ ä Öcan ¢ à 6.ñÞ9. still ¢ ú 6. 3§. be used, using and for all states .One interesting aspect of this approach is that at least the emission prior weights can be chosen toube non-symmetrical, with prior meansÕ ä ÖÆ @óµ6bãò…U@dadjusted so as to match the empirical fraction of symbol occurrences in the data. This ‘empirical Bayes’approach is similar to the setting of prior class probability means in Buntine (1992).W dWe are already working on the assumption that transitions and emissions are a priori independentof each other. It is therefore in principle possible to use any combination of broad and narrow parameterpriors, although a full exploration of the possibilities remains to be done. 33.3.3.3 Structure priors for HMMsIn the case of broad parameter priors the choice of transitions and emissions is already subsumedby the choice of parameters. The only structural component left open in this case is the number of .ñÞ9. states .For example, one might add an explicit bias towards a smaller number of states by setting+-,for constant¡Ìsome 1. However, as we will see below, the state-based priors by themselves produce atendency towards reducing the number of states as a result of Bayesian ‘Occam factors’ (Gull 1988).In the case of narrow parameter priors we need to specify how the prior probability mass is distributedamong all possible model topologies with a given number of states. For practical reasons it is desirable tohave a specification that can be described as a product of individual state-based distributions. This leads tothe following approach.As for the transitions, we assume that each state has on average a certain number of outgoingtransitions, ¢ à . We don’t have a reason to prefer any of the .ƒÞã. possible target states a priori, so each potentialtransition will be assessed a prior probability of existence of t à 6• . Similarly, each possible emission willhave a prior probability of t ú 6• , where ¢ ú is the prior expected number of emissions per state.The resulting structural contribution to the prior for a state u becomesä Ö ÕÃû ü;ýþ=àû =, •ê ü;ýLgt à 0 1 ú ÿ•ç?B= û ü;ýÿ •(3.7)£Õ ä ÖAs before, represents the number of transitions from state , ¢ ú and the number of its emissions.¢ àIn MDL terms, the structural prior (3.7) corresponds to a HMM coding scheme in which eachuÕ ä Ötransitionis encoded by L logt à bits, and each emission with L logt ú bits. Potential transitions and emissionsthat are missing each take up L log , 1 Lht à 0 and L log , 1 Lgt ú 0 respectively.3 In the experiments reported later only narrow parameters priors, combined with simple MDL structure priors are used. The detailscan be found in the relevant sections.. 4Tø&0&6Tt+-,
4ä Ö ÕÃ,CHAPTER 3. HIDDEN MARKOV MODELS 37Description Length priorsWe can use the MDL framework as discussed in Section 2.5.6 to derive simplepriors for HMM structures from various coding schemes. For example, a natural way to encode the transitionsand emissions in an HMM is to simply enumerate them. Each transitioncan be encoded using log , .ñÞ9.ñN 10 bits,since there are .ñÞ9. possible transitions, plus a special ‘end’ marker which allows us not to encode the missingtransitions explicitly. The total description length for all transitions from ¢state is thusSimilarly, u Õ ä Öall emissions from can ¢ ú be coded . 3§.N 10 using log bits. 4 , uä Ö Õlog , .ñÞã.N 10 . àThe resulting priorû…ü;ý , ?B= ?E= ü;ý ûÿ(3.8)10 3§.N . þ. 4Tø&0%Ò.ƒÞã.N 10+-,has the property that small differences in the number of states matter little compared to differences in the totalnumber of transitions and emissions.We have seen in Section 2.5.7 that the preferred criterion for maximization is the posterior ofstructure +-, 4ÀÃM. )g0 , which requires integrating out the parameters UÄ . In Section 3.4 we give a solution forthis computation that relies on the approximation of sample likelihoods by Viterbi paths.3.3.4 Why are smaller HMMs preferred?Intuitively, we want an HMM induction algorithm to prefer ‘smaller’ models over ‘larger’ ones,other things being equal. This can be interpreted as a special case of ‘Occam’s razor,’ or the scientific maximthat simpler explanations are to be preferred unless more complex explanations are required to explain thedata.Once the notions of model size (or explanation complexity) and goodness of explanation arequantified, this principle can be modified to include a trade-off between the criteria of simplicity and datafit. This is precisely what the Bayesian approach does, since in optimizing the product 450+-,)/. 450 a+-,compromise between simplicity (embodied in the prior) and fit to the data (high model likelihood) is found.But how is it that the HMM priors discussed in the previous section lead to a preference for ‘smaller’or ‘simpler’ models? Two answers present themselves: one has to do with the general phenomenon of ‘Occamfactors’ found in Bayesian inference; the other is related, but specific to the way HMMs partition data forpurposes of ‘explaining’ it. We will discuss each in turn.3.3.4.1 Occam factorsConsider the following scenario. Two pundits, 4 1 and 4 2, are asked for their predictions regardingan upcoming election involving a number of candidates. Each pundit has his/her own ‘model’ of the politicalprocess. We will identify these models with their respective proponents, and try to evaluate each according4 The basic idea of encoding transitions and emissions by enumeration has various more sophisticated variants. For example, onecould base the enumeration of transitions on a canonical ordering of states, such that only log¤£¦¥1‡§¥log£¦¥©¨¨¨¥s££¥1‡bits are required. Or one could use the-out-of-£-bit integer coding scheme described in Cover & Thomas (1991) and used for MDLinference in Quinlan & Rivest (1989). Any reasonable Bayesian inference procedure should not be sensitive to such minor difference inthe prior, unless it is used with too little data. Our goal here is simply to suggest priors that have reasonable qualitative properties, andare at the same time computationally convenient.
Page 1 and 2: The dissertation of Andreas Stolcke
Page 3 and 4: Bayesian Learning of Probabilistic
Page 5 and 6: iAcknowledgmentsLife and work in Be
Page 7 and 8: iiiContentsList of FiguresList of T
Page 9 and 10: CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15: CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17: CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19: CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21: ..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23: VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25: ,,vv,v,v,,directly. However, note t
Page 26 and 27: 4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29: 6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 98 and 99:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107:
,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?