The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 1. INTRODUCTION 2Instance-based parts of a model can coexist with generalized ones, depending on the degree of similarity¡among the observed samples, allowing the model to adapt to non-uniform coverage of the sample space.The generalization process is driven and controlled by a uniform, probabilistic metric: the Bayesian¡posterior probability of a model, integrating both criteria of goodness-of-fit with respect to the data anda notion of model simplicity (‘Occam’s Razor’).Our 1 approach is quite general in nature and scope (comparable to, say, Mitchell’s version spaces(Mitchell 1982)) and needs to be instantiated in concrete domains to study its utility and practicality. Wewill do that with three different types of probabilistic models: Hidden Markov Models (HMMs), stochasticcontext-free grammars (SCFGs), and simple probabilistic attribute grammars (PAGs).Following this introduction, Chapter 2 presents the basic concepts and mathematical formalismsunderlying probabilistic language models and Bayesian learning, and also introduces our approach to learningin general terms.Chapter 3 (HMMs), Chapter 4 (SCFGs) and Chapter 5 (attribute grammars) describe the particularversions of the learning approach for the various types of languages models. Unfortunately, these chapters(except for Chapter 3) are not entirely self-contained, as they form a natural progression in both ideas andformalisms presented.The following two chapters address various computational problems aside from learning that arisein connection with probabilistic context-free language models. Chapter 6 deals with probabilistic parsing andChapter 7 gives an algorithm for approximating context-free grammars with much simpler ¢ -gram models.These two chapters are nearly self-contained and need not be read in any particular order (with respect eachother or the preceding chapters).future research.Chapter 8 discusses general open issues arising from the present work and gives an outlook onVirtually all probabilistic grammar types and algorithms described in the following chapters havebeen implemented and integrated in an object-oriented framework in CommonLisp/CLOS. The result purportsto be a flexible and extensible environment for experimentation with probabilistic language models. Thedocumentation for this system is available separately (Stolcke 1994).background.The remainder of this introductiongives general motivationand highlights,as well as some historical1.2 Structural Learning of Probabilistic GrammarsProbabilistic language models (or grammars) have firmly established themselves in a number ofareas in recent time (automatic speech recognition being one of the major applications). One importantfactor is their probabilistic nature itself: they can be used to make weighted predictions about future data1 The first person plural will be used throughout, both for stylistic uniformity and to reflect the fact that much of this work was donein collaboration with others. Bibliographic references to co-authored publications are given at the end of this chapter.
CHAPTER 1. INTRODUCTION 3based (or conditioned) on evidence seen in the past, using the framework of probability theory as a consistentmathematical basis.However, this fundamental feature is only truly useful because probabilistic models are also adaptable:there are effective algorithms for tuning a model based on previously observed data, so as to optimizeits predictions on new data (assuming old and new data obey the same statistics).1.2.1 Probabilistic finite-state modelsAn example in point are the probabilistic finite-state models known as Hidden Markov models(HMMs) routinely used in speech recognition to model the phone sequences making up the words to berecognized. The top part of Figure 1.1 shows a simple word model for “and.” Each phonetic realization ofthe word corresponds to a path through the network of states and transitions, with the probabilities indicated.Given a network structure and a training corpus data to be modeled, there are standard algorithms foroptimizing (or estimating) the probability parameters of the HMM to fit the data.However, a more fundamental problem is how to obtain a suitable model structure in first place.The basic idea here will be to construct initial model networks from the observed data (as shown in the bottompart of Figure 1.1), and then gradually transform them into a more compact and general form by a processcalled model merging. We will see that there is a fundamental tension between optimizing the fit of the modelto the observed data, and the goal of generalizing to new data. We will use the Bayesian notions of prior andposterior model probabilities to formalize these conflicting goals and derive a combined criterion that allowsfinding a compromise between them.Chapter 3 describes this approach to structural learning of HMMs and discusses many of the issuesand methods recurring in later chapters.1.2.2 The Miniature Language (* Learning 0) TaskAn additional motivation for the present work came from a seemingly simple task proposed byFeldman et al. (1990): construct a machine learner that could generalize from usage examples of a naturallanguage fragment to novel instances, for an arbitrary natural language. Figure 1.2 shows the essentialelements of this miniature language learning problem, informally known as “¨ the 0” task. The goal isto ‘learn’ ¨ the 0 language from exposure to pairs of corresponding two-dimensional pictures and naturallanguage descriptions. Both the syntax and semantics of the language were intentionally limited to makethe problem more manageable. The purpose of the proposal was to highlight certain fundamental problemswith traditional cognitive science theories, including issue such as dependence on the underlying conceptualsystem, grounding of meaning and categorization in perception, and others which are explored in recent andongoing research (Regier 1992; Feldman et al. 1994).For our purposes we can abstract a (much simpler) subproblem from this interdisciplinary task:given pairs of sentences and associated idealized semantics (e.g., in first-order logic formulae), construct anadequate formal description of the relation between these two for the given language.
Page 1 and 2: The dissertation of Andreas Stolcke
Page 3 and 4: Bayesian Learning of Probabilistic
Page 5 and 6: iAcknowledgmentsLife and work in Be
Page 7 and 8: iiiContentsList of FiguresList of T
Page 9 and 10: CONTENTSv4.5.4 Summary and Discussi
Page 16 and 17: CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19: CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21: ..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23: VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25: ,,vv,v,v,,directly. However, note t
Page 26 and 27: 4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29: 6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31: CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33: CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35: CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37: ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39: 666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41: It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43: , uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45: CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47: CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49: 4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63:
CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65:
CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67:
correlation between initial and fin
Page 68 and 69:
,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 70 and 71:
Page 72 and 73:
,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75:
CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77:
CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79:
CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81:
CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89:
domain. 3 In short, we will leave o
Page 90 and 91:
,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93:
9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95:
¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107:
,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
122Chapter 6Efficient parsing with
Page 136 and 137:
1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139:
and each state in set ( -ÏH ):6)
Page 140 and 141:
NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145:
) In particular, the string probabi
Page 146 and 147:
H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149:
©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?