The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 3. HIDDEN MARKOV MODELS 50distribution for a target language was generated by assigning uniform probabilities at all choice points in agiven HMM topology.The induced models were compared using a variety of techniques. A simple quantitativecomparisonis obtained by computing the log likelihood on a test set. This is proportional to the negative of (the empiricalestimate of) the cross-entropy, which reaches a minimum when the two distributions are identical.To evaluate the HMM topology induced by Baum-Welch training, the resulting models are pruned,i.e., transitions and emissions with probability close to zero are deleted. The resulting structure can then becompared with that of the target model, or one generated by merging. The pruning criterion used throughout?was that a transition or emission had an expected count of less than 103 given the training set.Specifically, we would like to check that the resulting model generates exactly the same discretelanguage as the target model. This can be done empirically (with arbitrarily high accuracy) using a simpleMonte-Carlo experiment. First, the target model is used to generate a reasonably large number of samples,which are then parsed using the HMM under evaluation. Samples which cannot be parsed indicate that theinduction process has produced a model that is not sufficiently general. This can be interpreted as overfittingthe training data.Conversely, we can generate samples from the HMM in question and check that these can be parsedby the target model. If not, the induction has overgeneralized.In some cases we also inspected the resulting HMM structures, mostly to gain an intuition for thepossible ways in which things can go wrong. Some examples of this are presented below.Note that the outcome of the Baum-Welch algorithm may (and indeed does, in our experience) varygreatly with the initial parameter settings. We therefore set the initial transition and emission probabilitiesrandomly from a uniform distribution, and repeated each Baum-Welch experiment 10 times. Merging, on theother hand, is deterministic, so for each group of runs only a single merging experiment was performed forcomparison purposes.Another source of variation in the Baum-Welch method is the fixed total number of model parameters.In our case, the HMMs are fully parameterized for a given number of states and set of possible emissions;so the number of parameters can be simply characterized by the number of states in the HMM. 12In the experiments reported here we tried two variants: one in which the number of states was equalto that of the target model (the minimal number of states for the language at hand), and a second one in whichadditional states were provided.Finally, the nature of the training samples was also varied. Besides a standard random sample fromthe target distribution we also experimented with a ‘minimal’ selection of representative samples, chosento be characteristic of the HMM topology in question. This selection is heuristic and can be characterizedas follows: “From the list of all strings generated by the target model, pick a minimal subset, in order ofdecreasing probability, such that each transition and emission in the target model is exercised at least once, andsuch that each looping transitionis exemplified by a sample that traverses the loop at least twice.” The minimal12 By convention, we exclude initial and final states in these counts.
symbols. 14 The minimal training sample used for this model consisted of 8 strings:CHAPTER 3. HIDDEN MARKOV MODELS 51training samples this rule produces are listed below, and do in fact seem to be intuitively representative oftheir respective target models.3.6.1.2 Priors and merging strategyA uniform strategy and associated parameter settings were used in all merging experiments.The straightforward description length prior for HMM topologies from Section 3.3.3.3, along witha narrow Dirichlet prior for the parameters (Section 3.3.3.2) were used to drive generalization. The totalDirichlet prior weight Æ 0 for each multinomial was held constant at Æ 0 6 1, which biases the probabilitiesto be non-uniform (in spite of the target models used). The objective function in the maximization was theposterior of the HMM structure, as discussed in Section 2.5.7.Merging proceeded using the incremental strategy described in Section 3.3.5 (with batch size 1),along with several of the techniques discussed earlier. Specifically, incremental merging was constrained tobe among states with identical emissions at first, followed by an unconstrained batch merging phase. Theglobal prior weight.was adjusted so as to keep the effective sample size constant at 50. In accordance withthe rationale given in Section 3.4.4, this gradually increases the prior weight during the incremental mergingphase, thereby preventing early overgeneralization.The search strategy was best-first with 5 steps lookahead.3.6.1.3 Case study IThe first test task was to learn the regular ã¬ language , generated by the HMM inFigure 3.2. 13 It turns out that the key difficulty in this case is to find the dependency between first andlast symbols, which can be separated by arbitrarily long sequences of intervening (and non-distinguishing) Alternatively, a sample of 20 random strings was used.The results of both the merging and the Baum-Welch runs are summarized by the series of plots inFigure 3.3. The plots in the left column refer to the minimal training sample runs, whereas the right column13 We use standard regular expressionnotation to describe finite-state languages: §1stands forrepetitions of the string § , §32denotes0 or more repetitions of § , §4stands for 1 or more repetitions, and5is the disjunction (set union) operator.14 This test model was inspired by finite-state models with similar characteristics that have been the subject of investigations intohuman language learning capabilities (Reber 1969; Cleeremans 1991)
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8:
iiiContentsList of FiguresList of T
Page 9 and 10:
CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15: CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17: CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19: CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21: ..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23: VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25: ,,vv,v,v,,directly. However, note t
Page 26 and 27: 4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29: 6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 104 and 105: ==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107: ,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 112 and 113:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 114 and 115:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?