The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 3. HIDDEN MARKOV MODELS 48complexity and data fit. Second, since the finite-state models they investigate act as encoder/decoders oftext they are deterministic, i.e., the current state and the next input symbol determine a unique next state (itfollows that each string has a unique derivation). This constrains the model space and allows states to beidentified with string suffixes, which is the basis of all their algorithms. Finally, the models have no end statessince they are supposed to encode continuous text. This is actually a minor difference since we can viewthe end-of-sentence as a special symbol, so that the final state is simply one that is dedicated to emitting thatspecial symbol.Bell et al. (1990) suggest state splittingas a more efficient induction technique for adaptively findinga finite-state model structure. In this approach, states are successively duplicated and differentiated accordingto their preceding context, whenever such a move promises to help the prediction of the following symbol.Ron et al. (1994) give a reformulation and formal analysis of this idea in terms of an information-theoreticevaluation function.Interestingly, Bell et al. (1990) show that such a state splitting strategy confines the power of thefinite-state model to that of a finite-context model. In models of this type there is always a finite bound",such that the last"preceding symbols uniquely determine the distributionof the next symbol. In other words,state-based models derived by this kind of splittingare ¢ essentially -gram models with variable (but bounded)context. This restriction applies equally to the algorithm of Ron et al. (1994).By contrast, consider the HMM depicted in Figure 3.2, which is used below as a benchmark model.It describes a language in which the context needed for correct prediction of the final symbol is unbounded.Such a model can be found without difficulty by simple best-first merging. The major advantage of splittingapproach is that it is guaranteed to find the appropriate model if enough data is presented and if the targetlanguage is in fact finite-context.3.5.4 Other probabilistic approachesAnother probabilistic approach to HMM structure induction similar to ours is described by Thomason& Granum (1986). The basic idea is to incrementally build a model structure by incorporating newsamples using an extended form of Viterbi alignment. New samples are aligned to the existing model so asto maximize their likelihood, while allowing states to be inserted or deleted for alignment purposes. Theprocedure is limited to HMMs that have a left-to-right ordering of states, however; in particular, no loops areallowed. In a sense this approach can be seen as an approximation to Bayesian HMM merging for this specialclass of models. The approximation in this case is twofold: the likelihood (not the posterior) is maximized,and only the likelihood of a single sample (rather than the entire data set) is considered.Haussler et al. (1992) apply HMMs trained by the Baum-Welch method to the problem of proteinprimary structure alignment. Their model structures are mostly of a fixed, linear form, but subject to limitedmodification by a heuristic that inserts states (‘stretches’ the model) or deletes states (‘shrinks’ the model)based on the estimated probabilities.Somewhat surprisingly, the work by Brown et al. (1992) on the construction of ¢ class-based -gram
ÌCHAPTER 3. HIDDEN MARKOV MODELS 49models for language modeling can also be viewed as a special case of HMM merging. A ¢ class-based -gramgrammar is easily represented as an HMM, with one state per class. Transition probabilities represent theconditional probabilities between classes, whereas emission probabilities correspond to the word distributionsfor each class ¢ (for 2, higher-order HMMs are required). The incremental word clustering algorithmgiven in (Brown et al. 1992) then becomes an instance of HMM merging, albeit one that is entirely based onlikelihoods. 103.6 EvaluationWe have evaluated the HMM merging algorithm experimentally in a series of applications. Suchan evaluation is essential for a number of reasons:The simple priors used in our algorithm give it a general direction, but little specific guidance, or may¡actually be misleading in practical cases, given finite data.Even if we grant the appropriateness of the priors (and hence the optimality of the Bayesian inferenceprocedure in its ideal form), the various approximations and simplifications incorporated in¡ourimplementation could jeopardize the result.Using real problems (and associated data), it has to be shown that HMM merging is a practical method,¡both in terms of results and regarding computational requirements.We proceed in three stages.First, simple formal languages and artificially generated trainingsamples are used to provide a proof-of-concept for the approach. Second, we turn to real, albeit abstracteddata derived from the TIMIT speech database. Finally, we give a brief account of how HMM merging isembedded in an operational speech understanding system to provide multiple-pronunciation models for wordrecognition. 113.6.1 Case studies of finite-state language induction3.6.1.1 MethodologyIn the first group of tests we performed with the merging algorithm, the objective was twofold: wewanted to assess empirically the basic soundness of the merging heuristic and the best-first search strategy, aswell as to compare its structure finding abilities to the traditional Baum-Welch method.To this end, we chose a number of relatively simple regular languages, produced stochastic versionsof them, generated artificial corpora, and submitted the samples to both induction methods. The probability10 Furthermore, after becoming aware of their work,we realized that the scheme Brown et al. (1992) are using for efficient recomputationof likelihoods after merging is essentially the same as the one we were using for recomputingposteriors (subtracting old terms and addingnew ones).11 All HMM drawings in this section were produced using an ad hoc algorithm that optimizes layout using best-first search based on aheuristic quality metric (no Bayesian principles whatsoever were involved). We apologize for not taking the time to hand-edit some ofthe more problematic results, but believe the quality to be sufficient for expository purposes.
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8:
iiiContentsList of FiguresList of T
Page 9 and 10: CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15: CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17: CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19: CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21: ..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23: VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25: ,,vv,v,v,,directly. However, note t
Page 26 and 27: 4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29: 6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 104 and 105: ==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107: ,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 110 and 111:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

Create successful ePaper yourself

Delete template?

Save as template?