The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

==Ì==ÌCHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 92Cook et al. (1976) exhibit a procedure similar to ours, using a somewhat different set of operators. 9Their approach is also aimed at probabilistic SCFGs, but uses a conceptually quite different evaluationfunction, which will be discussed in more detail below, as it illustrates a fundamental feature of the Bayesianphilosophy adopted here.Langley (1994) discusses a non-probabilistic CFG induction approach using the same merging andchunking operators as described here, which in turn is based on that of Wolff (1978). Langley’s CFG learneralso alternates between merging and chunking. No incremental learning strategy is described, although addingone along the lines presented here seems straightforward. The evaluation function is non-probabilistic, butincorporates several heuristics to control data fit and a bias towards grammar ‘simplicity,’ measured by thetotal length of production RHSs. A comparison with our Bayesian criterion highlights the considerableconceptual and practical simplification gained from using probabilities as the universal ‘currency’ of theevaluation metric.The present approach was derived as a minimal extension of the HMM merging approach toSCFGs (see Section 3.5 for origins of the state merging concept). As such, it is also related to variousinduction methods for non-probabilistic CFGs that rely on structured (parse-tree skeleton) samples to formtree equivalence classes that correspond to the nonterminals in a CFG (Fass 1983; Sakakibara 1990). As wehave seen, merging alone is sufficient as an induction operator if fully bracketed samples are provided.4.4.3 Cook’s Grammatical Inference by Hill ClimbingCook et al. (1976) present a hill-climbing search procedure for SCFGs that shares many of thefeatures and ideas of ours. Among these is the best-first approach, and an evaluation metric that aims tobalance ‘complexity’ of the grammar against ‘discrepancy’ relative to the target distribution. A crucialdifference is that only the relative frequencies of the samples, serving as an approximation to the true targetdistribution, are used.Discrepancy of grammar and samples is evaluated by a metric that combines elements of the standardrelative entropy with an ad-hoc measure of string complexity. Complexity of the grammar is likewise measuredby a mix of rule entropy and rule complexity. 10 Discrepancy and complexity are then combined in a weightedsum, where the weighting factor is set empirically (although the induction procedure is apparently quite robustwith respect to the exact value of this parameter).To see the conceptual difference to the Bayesian approach, consider the introductory example fromSection 4.3.2. The four samples , `Z0 observed with relative frequencies (10, 5, 2,1) are good evidence for a generalization to the target grammar that Z¢ 0 generates . However, ifthe same samples were observed with hundred-fold frequencies (1000, 500, 200, 100), then the hypothesis0 should become rather unlikely (in the absence of any additional samples, such as 5 5 , 6 6 ,etc.) Indeed our Bayesian learner will refrain from this generalization, due to the 100-fold increased log¢9 Thanks to Eugene Charniak for pointing out this reference, which seems to be less well-known and accessible than it deserves.10 The exact rationale for these measures is not entirely clear, as the complexities of strings are determined independently of theunderlying model, which is inconsistent with the standard information theoretic (and MDL) approach.
CHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 93likelihood loss in that case. Algorithms based on just the relative frequencies of samples, on the otherhand, will be indifferent to this change in absolute frequencies. Also, the Bayesian approach gives anintuitive interpretation to the weighting factor that is useful in practice to globally balance the complexity anddiscrepancy terms (Section 3.4.4).Cook et al. (1976) propose a larger set of operators which partly overlaps with our merging andchunking operations. Chunking is known under the name ‘substitution.’ Merging is not directly available,but similar effects can be obtained by an operation called ‘disjunction,’ which creates new nonterminalsthat expand to one of a number of existing nonterminals. They also have special operations for removingproductions which are subsumed or made redundant by others. These can mostly me emulated with merging,although explicit testing for, and elimination of redundant productions is also useful in our algorithm, sinceit shortcuts combinations of induction steps (i.e., they are macro operators in search parlance). 11Cook et al. (1976) evaluate their algorithm using a number of benchmark grammars; these will bereexamined below using our Bayesian algorithm.4.5 EvaluationThe merging algorithm for SCFGs has been evaluated in a number of experiments. These fallnaturally into two broad categories: simple formal languages and various grammar fragments modelingaspects of natural language syntax.4.5.1 Formal language benchmarksThis group of test grammars has been extracted from the article by Cook et al. (1976) discussed inSection 4.4. Except for the last two, they represent examples of simple context-free formal languages as theyare typically given in textbooks on the subject.The main advantage of using this same set of grammars is that the results can be compared. Sincethe two algorithms (ours and Cook’s) have similar underlying intuitions about structural grammar inductionwe expect similar results, but it is important to verify that expectation.Experimental setup To replicate the examples given by Cook, we used the following procedure. Thesamples given in the paper are used unchanged, except that sample probabilities were converted into countssuch that the total number was 50 for each experiment. These were then incorporated into initial grammarsand merged in batch mode. Our SNF grammar format could introduce a subtle inductive bias not presentin Cook original experiments. The merging procedure was therefore constrained to always maintain aone-to-one correspondence between terminals and preterminals, effectively making terminals redundant andletting preterminals function as the true terminals, in terms of which arbitrary productions are now possible.Chunking of single nonterminals was allowed.11 Notice how repeated merging and chunking effectively eliminate redundant productions in the example of Section 4.3.2.
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8:
iiiContentsList of FiguresList of T
Page 9 and 10:
CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15:
CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17:
CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19:
CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21:
..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23:
VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25:
,,vv,v,v,,directly. However, note t
Page 26 and 27:
4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29:
6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31:
CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33:
CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35:
CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37:
ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39:
666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41:
It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43:
, uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45:
CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47:
CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49:
4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51:
6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53:
2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 106 and 107: ,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 116 and 117: 104Chapter 5Probabilistic Attribute
Page 118 and 119: ,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121: CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 134 and 135: 122Chapter 6Efficient parsing with
Page 136 and 137: 1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139: and each state in set ( -ÏH ):6)
Page 140 and 141: NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143: CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145: ) In particular, the string probabi
Page 146 and 147: H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149: ©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151: ) The probabilistic unit-production
Page 152 and 153: ¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

Create successful ePaper yourself

Delete template?

Save as template?