The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 3. HIDDEN MARKOV MODELS 68sequences from induced phoneme-based models and adding them to directly observed pronunciations for thepurpose of smoothing.3.6.2.3 Quantitative evaluationSeveral HMM construction methods were tested and compared using the TIMIT data.The maximum-likelihood (ML) model: the HMM is the union of all unique samples, with probabilities¡corresponding to the observed relative frequencies. This is essentially the result of building the initialmodel in the HMM merging procedure, before any merging takes place.Baum-Welch estimation: an HMM of fixed size and structure is submitted to Baum-Welch (EM)¡estimation of the probability parameters. Of course, finding the ‘right’ size and structure is exactly thelearning problem at hand. We wanted to evaluate the structure finding abilities of the Baum-Welchprocedure, so we set the number of states to a fixed multiple of the maximum sample length for a givenword, and randomly initialized all possible transitionsand emissions to non-zero probabilities. After theEM procedure converges, transitions and emissions with probability close to zero are pruned, leavingan HMM structure that can be evaluated. Several model sizes (as multiples of the sample length) weretried.Standard HMM merging with loop suppression (see above). We used the simple description length prior¡from Section 3.3.3.3, with log 65L!¢ à log .ƒÞã. for transitions and log + 6xLÎ¢ ú log . 3§. for emissions,+as well as a narrow Dirichlet prior for transition and emission probabilities, with Æ 0 6cases. Several global weighting factors.for the structure prior were evaluated.1£ 0 in bothHMM merging with the single-outputconstraint, as explained above. The same prior as for the multipleoutputHMMs was used, except that an emission parameter prior does not exist in this case, and¡thestructural prior contribution for an emission is log + 6xL log . 3§. .A simple-minded way of comparing these various methods would be to apply them to the trainingportion of the data, and then compare generalization on the test data. As a measure of generalization it iscustomary to use the negative log probability, or empirical cross-entropy, they assign to the test samples. Themethod that achieves the lowest cross-entropy would ‘win’ the comparison.A problem that immediately poses itself is that there is a significant chance that some of the testsamples have zero probability according to the induced HMMs. One might be tempted to evaluate based onthe number of test samples covered by the model, but such a comparison alone would be meaningless since amodel that assigns (very low) probability to all possible strings could trivially ‘win’ in this comparison.The general approach that is usually taken in this situation is to have some recipe that preventsvanishing probabilities on new, unseen samples. There are a great many such approaches in common use,such as parameter smoothing and back-off schemes, but many of these are not suitable for the comparisontask at hand.
ÆCHAPTER 3. HIDDEN MARKOV MODELS 69At the very least, the method chosen should bewell-defined, i.e., correspond to some probabilistic model that represents a proper distribution over all¡strings;unbiased with respect to the methods being compared, to the extent possible.¡Standard back-off models (where a second model is consulted if, and only if, the first one returnsprobability zero) do not yield consistent probabilities unless they are combined with ‘discounting’ of probabilitiesto ensure that the total probabilitymass sums up to unity (Katz 1987). The discounting scheme, as wellas various smoothing approaches (e.g., adding a fixed number of virtual ‘Dirichlet’ samples into parameterestimates) tend to be specific to the model used, and are therefore inherently problematic when comparingdifferent model-building methods.To overcome these problems, we chose to use the mixture models approach described in Section2.3.1. The target models to be evaluated are combined with a simple back-off model that guaranteesnon-zero probabilities, e.g., a bigram grammar with smoothed parameters. This back-off grammar is identicalin structure for all target models. Unlike discrete back-off schemes, the target and the back-up are alwaysconsulted both for the probability they assign to a given sample, which are then weighted and averagedaccording to a mixture proportion.When comparing two model induction methods, we first let each induce a structure. Each is builtinto a mixture model, and both the component model parameters and the mixture proportions are estimatedusing the EM procedure for generic mixture distributions. To get meaningful estimates for the mixtureproportions, the HMM structure is induced based on a subset of the training data, and the full training datais then used to estimate the parameters, including the mixture weights. This holding-out of training datamakes the mixture model approach similar to the deleted interpolation method (Jelinek & Mercer 1980). Themain difference is that the component parameters are estimated jointly with the mixture proportions. 18 In ourexperiments we always used half of the training data in the structure induction phase, adding the other halfduring the EM estimation phase. Also, to ensure that the back-off model receives a non-zero prior probability,we estimate the mixture proportions under a simple symmetrical Dirichlet prior with Æ 1 62 6 1£ 5.3.6.2.4 Results and discussionHMM merging was evaluated in two variants, with and without the single-output constraint. Ineach version, three settings of the structure prior weight. were tried: 0.25, 0.5 and 1.0. Similarly, forBaum-Welch training the preset number of states in the fully parameterized HMM was set to 1.0, 1.5 and1.75 times the longest sample length. For comparison purposes, we also included the performance of theunmerged maximum-likelihood HMM, and a biphone grammar of the kind used in the mixture models usedto evaluate the other model types. Table 3.1 summarizes the results of these experiments.18 This difference can be traced to the different goals: in deleted interpolation the main goal is to gauge the reliability of parameterestimates, whereas here we want to assess the different structures.
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8:
iiiContentsList of FiguresList of T
Page 9 and 10:
CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15:
CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17:
CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19:
CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21:
..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23:
VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25:
,,vv,v,v,,directly. However, note t
Page 26 and 27:
4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29:
6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 104 and 105: ==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107: ,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 116 and 117: 104Chapter 5Probabilistic Attribute
Page 118 and 119: ,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121: CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 130 and 131:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 132 and 133:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?