The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 3. HIDDEN MARKOV MODELS 706 0£ 6 (.ã6 1£ 0£L 2£ [ 2£ [ L [ L(.2£ [ L + 2£0£ t"Ï 0£(.thÏ 0£ tKÏML M 25) M 5) M 0)log 600 10 3 418 10 3 355 10 3 343 10 3Perplexity 1.979 1.886 1.855 1.849Significance 000001 0036 45 –states 4084 1333 1232 1204transitions 4857 1725 1579 1542emissions 3878 1425 1385 1384training time 28:19 32:03 28:58 29:496 0£ (.ã6 (.ã6 1£ 0£L L(.2£ 2£ L 2£ 2£ [ L [ [ [tKÏ 0£ t 0£ Ï t 0£ tKÏ 0£ ÏML M1 25) M1 5) M1 0)log 600 10 3 450 10 3 403 10 3 394 10 3Perplexity 1.979 1.902 1.879 1.874Significance 000001 0004 013 016states 4084 1653 1601 1592transitions 4857 2368 2329 2333emissions 3878 1447 1395 1386training time 28:19 30:14 26:03 25:53(¯6 1£ (Ð6 0¨ 1£ 1£ (¯6 5¨ 75¨L 2£ 2£ L [ 2£ [ L 2£ [ L [0£ tKÏ thÏ 0£ 0£ tKÏ t"Ï 0£BG BW ) BW ) BW )log 613 10 3 470 10 3 385 10 3 392 10 3Perplexity 1.985 1.912 1.870 1.873Significance 000001 000003 041 017states n/a 1120 1578 1798transitions n/a 1532 2585 3272emissions n/a 1488 1960 2209training time 3:47 55:36 99:55 123:59Table 3.1: Results of TIMIT trials with several model building methods.The training methods are identified by the following keys: BG bigram grammar, ML maximumlikelihood HMM, BW Baum-Welch trained HMM, M merged HMM, M1 single-output mergedHMM. log + is the totallog probabilityon the 1895 test samples. Perplexity is the average numberof phones that can follow in any given context within a word (computed as the exponential of theper-phone cross-entropy). Significance refers to thet level in a á -test pairing the log probabilitiesof the test samples with those of the best score (merging,.ã6 1£ 0).The number of states, transitionsand emissions is listed for the resultingHMMs where applicable.The training times listed represent the total time (in minutes and seconds) it took to induce theHMM structure and subsequently EM-train the mixture models, on a SPARCstation 10/41.
. canCHAPTER 3. HIDDEN MARKOV MODELS 71The table also includes useful summary statistics of the model sizes obtained, and the time it took tocompute the models. The latter figures are obviously only a very rough measure of computational demands,and their comparison suffers from the fact that the implementation of each of the methods may certainly beoptimized in idiosyncratic ways. Nevertheless these figures should give an approximate idea of what to expectin a realistic application of the induction methods involved.One important general conclusion from these experiments is that both the merged models and thoseobtained by Baum-Welch training do significantlybetter than the two ‘dumb’ approaches, the bigram grammarand the ML HMM (which is essentially a list of observed samples). We can therefore conclude that it pays totry to generalize from the data, either using our Bayesian approach or Baum-Welch on an HMM of suitablesize.Overall the difference in scores even between the simplest approach (bigram) and the best scoring(merging,.T6 1£ one 0) are quite small, with phone perplexities ranging from 1.985 to 1.849. This is notsurprising given the specialized nature and small size of the sample corpus. Unfortunately, this also leavesvery little room for significant differences in comparing alternate methods. However, the advantage of thebest model merging result (unconstrained with. 6 1£ outputs 0) is still significant compared to the bestBaum-Welch (size factor 1.5) result (t Ï 0£ 041). Such small differences in log probabilities would probablybe irrelevant when the resulting HMMs are embedded in a speech recognition system.Perhaps the biggest advantage of the merging approach in this application is the compactness of theresulting models. The merged models are considerably smaller than the comparable Baum-Welch HMMs.This is important for any of the standard algorithms operating on HMMs, which typically scale linearly withthe number of transitions (or quadratically with the number of states). Besides this advantage in productionuse, the trainingtimes for Baum-Welch grow quadratically with the number of states for the structure inductionphase since it requires fully parameterized HMMs. This scaling is clearly visible in the run times we observed.Although we haven’t done a word-by-word comparison of the HMM structures derived by mergingand Baum-Welch, the summary of model sizes seem to confirm our earlier finding (Section 3.6.1) that Baum-Welch training needs a certain redundancy in ‘model real estate’ to be effective in finding good-fitting models.Smaller size factors give poor fits, whereas sufficiently large HMMs will tend to overfit the training data.The choice of the weights. prior for HMM merging (Section 3.4.4) controls the model size inan indirect way: larger values lead to more generalization and smaller HMMs. For best results this valuecan be set based on previous experience with representative data. This could effectively be done in a crossvalidationlike procedure, in which generalization is successively increased starting with small.’s. Due tothe nature of the merging algorithm, this can be done incrementally, i.e., the outcome of merging with a smallbe submitted to more merging at a larger. value, until further increases reduce generalization on thecross-validation data.
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8:
iiiContentsList of FiguresList of T
Page 9 and 10:
CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15:
CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17:
CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19:
CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21:
..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23:
VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25:
,,vv,v,v,,directly. However, note t
Page 26 and 27:
4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29:
6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31:
CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 104 and 105: ==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107: ,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 116 and 117: 104Chapter 5Probabilistic Attribute
Page 118 and 119: ,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121: CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 132 and 133:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

Create successful ePaper yourself

Delete template?

Save as template?