The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 88The criterion used for model evaluation is the posterior of the model structure, as discussed inSection 2.5.7. It is approximated using the same Viterbi method as described for HMMs (Section 3.4.3).This again has the advantage that the posterior is decomposed into a product of terms, one for each grammarnonterminal. Changes to the posterior are computed efficiently by recomputing just the terms pertaining tononterminals affected by the merging or chunking operation.4.3.5 Search strategiesThe question of how to search efficiently for good sequences of merging operators becomes morepressing in the case of SCFGs. The main reason is that the introductionof the new operator, chunking, createsa more complex topology in the search space. In addition, the evaluation of chunking step is not directlycomparable to merging, as chunking does not have a generalizing effect on the grammar (it can only affectthe prior contribution to the posterior).We have experimented with several search strategies for SCFG learning, discussed below. Clearlymore sophisticated ones are possible, and await further study.Best-first search This is the straightforwardextension of our approach to merging with HMMs. All operatortypes and application instances are pooled for the purpose of comparison, and at each step the locally bestone is chosen. This is combined with the simple linear look-ahead extension described in Section 3.4.5 tohelp overcome local maxima.This simple approach often fails because chunking typically has to be followed by several mergingsteps to produce an overall improvement. The look-ahead feature often doesn’t help here as other chunks getin the way between a chunking step and the ‘right’ successive merging choices.Multi-level best-first search One possible solution to the above problem is to make the search procedureaware of the different nature of the two operators, by constraining the way in which they interact. Empirically,the following simple extension of the best-first paradigm seems to work generally well for many SCFGs.The basic idea is that the search operates on two distinct levels, associated with merging andchunking, respectively. Search at the merging level consists of a best-first sequence of merging steps (withlook-ahead). Search at the second level chooses the locally best chunking step, and then proceeds with asearch at level 1. (Clearly, this approach could be generalized to any number of search levels).Notice that in this approach, the chunking steps are not evaluated by trying an exhaustive sequenceof merges following each possible choice. This would entail an overhead that is quite significant even in smallcases.Beam search In a beam search the locality of the search is relaxed by considering a pool a relatively goodmodels simultaneously, rather than only a single one as in best-first search. In Section 3.3 we remarked thatbeam-search for HMMs seems to only very rarely give worthwhile improvements over the best-first approach.
CHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 89However, beam search can produce significantly improved search results for many SCFGs, preciselybecause of the interaction between the different search operators and the impact this has on the properevaluation of choices. If the beam width is made sufficiently large, the effects of combinations of mergingand chunking will be assessed correctly, even if the two types of operators are not treated specially. (Addinga multi-level approach here might result in further improvements to efficiency and/or results, but hasn’t beeninvestigated yet.)For concreteness we give a brief account of the beam search algorithm used, especially since theopen-ended nature of the search (absence of a goal state) makes it different from standard beam searchalgorithms found in the literature. The beam is a list of nodes ordered by descending evaluation score. Inour case, a node corresponds to a model (grammar), and the evaluation function is its posterior probability.Nodes can be either expanded or unexpanded, depending on whether successor nodes have been generatedfrom them, using the search operators. When a node is expanded its descendants are inserted into the beam asunexpanded nodes, unless they are already found there. A single step in the beam search consists of expandingall unexpanded nodes in the current beam, using all available operators.The scope of the beam search is determined by two parameters: The beam depth is the maximumtotal number of nodes in the beam, whereas the beam width gives the number of unexpanded nodes allowed inthe beam. During expansion of the beam, low-scoring nodes are truncated from the beam so as to not exceedeither width or depth. (Alternatively, one may also limit nodes in the beam to those scoring within a certaintolerance of the current best node.) The combination of conditions delimiting the elements of the beam arealso known as the beam criterion.The search terminates when no unexpanded nodes satisfy the beam criterion, i.e., only expandednodes remain in the beam. The first, best-scoring node from the beam is returned as the result of the search.Beam search as described here is a generalization of the (one-level) best-first search introduced inSection 3.4.5: a beam search of depth – and width one is equivalent to a best-first search with – steps oflookahead.Search in grammar spaces raises the question of how the equivalence of two models should bedetermined efficiently. This is necessary to avoid duplicate models from crowding out worthwhile contendersin the beam. Duplicates are generated pervasively, as the same operators, such as merging, applied in differentorder often yield identical results. To address this problem for SCFGs and similar types of models we use atwo-pronged approach. First, efficient methods for computing the posteriors of models, without necessarilycomputing the full models themselves are applied, using incremental evaluation strategies as described inSection 3.4.3. If two model have different posterior probabilitiesthey must be structurallydifferent. Secondly,if necessary, we compute a hash function of the CFG structure, which is a pseudo-random number that dependsonly on the structure of the productions, but not on their order or the names of the nonterminals used. If twogrammars yield the same hash code they are considered identical for the purposes of beam search. This leavesa small probability that a model might mistakenly be discarded. 66 If the hash function were optimal, that probability would 2R be 28 in the current implementation.
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8:
iiiContentsList of FiguresList of T
Page 9 and 10:
CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15:
CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17:
CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19:
CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21:
..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23:
VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25:
,,vv,v,v,,directly. However, note t
Page 26 and 27:
4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29:
6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31:
CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33:
CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35:
CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37:
ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39:
666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41:
It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43:
, uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45:
CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47:
CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49:
4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51: 6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53: 2. For each candidate"I!computeLet"
Page 54 and 55: 6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57: , u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59: CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61: CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63: CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65: CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67: correlation between initial and fin
Page 68 and 69: ,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 72 and 73: ,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75: CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77: CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79: CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81: CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 86 and 87: CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89: domain. 3 In short, we will leave o
Page 90 and 91: ,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93: 9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95: ¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97: CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 104 and 105: ==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107: ,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 116 and 117: 104Chapter 5Probabilistic Attribute
Page 118 and 119: ,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121: CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 134 and 135: 122Chapter 6Efficient parsing with
Page 136 and 137: 1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139: and each state in set ( -ÏH ):6)
Page 140 and 141: NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143: CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145: ) In particular, the string probabi
Page 146 and 147: H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149: ©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151:
) The probabilistic unit-production
Page 152 and 153:
¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155:
The forward and inner probabilities
Page 156 and 157:
9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159:
,,by nonterminals. Multiplying this
Page 160 and 161:
6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163:
description. Again, we ignore this
Page 164 and 165:
for all pairs of states d =¸+= Š
Page 166 and 167:
,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 168 and 169:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 170 and 171:
are then summed over all nontermina
Page 172 and 173:
+CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175:
1CHAPTER 6. EFFICIENT PARSING WITH
Page 176 and 177:
CHAPTER 6. EFFICIENT PARSING WITH S
Page 178 and 179:
= ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181:
168Chapter 7-grams from Stochastic
Page 182 and 183:
CHAPTER 7. -GRAMS FROM STOCHASTIC
Page 184 and 185:
)ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187:
-grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189:
,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191:
,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193:
Consider the following problem: sta
Page 194 and 195:
CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197:
184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199:
BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201:
BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203:
BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204:
BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?