The dissertation of Andreas Stolcke is approved: University of ...

More documents

Recommendations

Info

CHAPTER 7. -GRAMS FROM STOCHASTIC CONTEXT-FREE GRAMMARS 170algorithm (cf. Sections 2.3.2, 4.2.2, 6.5.2). In the absence of human expertise the grammar induction methodsof Chapter 4 or Chapter 3 may be used. 2There are good arguments that SCFGs are in principle not adequate probabilistic models for naturallanguages, due to the conditional independence assumptions they embody (Magerman & Marcus 1991;Jones & Eisner 1992b; Briscoe & Carroll 1993). The main criticisms are that production probabilities areindependent of expansion context (e.g., whether a noun phrase is realized in subject of object position), andthat lexical co-occurrences, as well as lexical/syntactical contingencies cannot easily be represented, resultingin poor probabilistic estimates for these phenomena. Such shortcomings can be partly remedied by usingSCFGs with very specific, semantically oriented categories and rules (Jurafsky et al. 1994b). If the goal isto ¢ use -grams nevertheless, then their computation from a more constrained SCFG is still useful since theresults can be interpolated with ¢ raw -gram estimates for smoothing. An experiment illustratingthis approachis reported below.On the other hand, even if more sophisticated language models give better ¢ results, -grams will mostlikely still be important in applications such as speech recognition. The standard speech decoding techniqueof frame-synchronous dynamic programming (Ney 1984) is based on a first-order Markov assumption, whichis satisfied by bigrams models (as well as by Hidden Markov Models), but not by more complex modelsincorporating non-local or higher-order constraints (including SCFGs). A standard approach is therefore touse simple language models to generate a preliminary set of candidate hypotheses. These hypotheses, e.g.,represented as word lattices or -best lists (Schwartz & Chow 1990), are re-evaluated later using additionalcriteria that can afford to be more costly due to the more constrained outcomes. In this type of setting, thetechniques developed here can be used to compile probabilistic knowledge encoded in the more elaboratelanguage models ¢ into -gram estimates that improve the quality of the hypotheses generated by the decoder.Finally, comparing directly estimated, ¢ reliable -grams with those compiled from other languagemodels is a potentially useful method for evaluating the models in question.For the purpose of this chapter we assume that ¢ computing -grams from SCFGs is of either practicalor theoretical interest and concentrate on the computational aspects of the problem.It should be noted that there are alternative, unrelated methods for addressing the problem of thelarge parameter space ¢ in -gram models. For example, Brown et al. (1992) describe an approach based ongrouping words into classes, thereby reducing the number of conditional probabilities in the model. Daganet al. (1994) explore similarities between words to interpolate bigram estimates involving words with similarsyntagmatic distributions.The technique of compiling higher-level grammatical models into lower-level is not entirely new:Zue et al. (1991) report building a word-pair grammar from more elaborate language models to achievegood coverage, by random generation of sentences. We essentially propose a solution for extending thisapproach to the probabilistic realm. The need for ¢ obtaining -gram estimates from SCFGs originated inthe BeRP speech understanding system already mentioned elsewhere in this thesis (Jurafsky et al. 1994a).2 This chapter describes an£-gram algorithm specifically for SCFGs. However, the methods described here are easily adapted to thesimpler HMM case.
,and,I=(=,,,,(==CHAPTER 7. -GRAMS FROM STOCHASTIC CONTEXT-FREE GRAMMARS 171The previous solution to the problem was to estimate ¢ -gram probabilities from the SCFG by counting onrandomly generated artificial samples.7.3 The Algorithm7.3.1 Normal form for SCFGsUnlike in other parts of this thesis, we cannot get around the need to normalize the grammar toChomsky Normal Form (CNF). A CFG is in CNF if all productions are of the formor) ¸ =m>Any CFG structure can be converted into a weakly equivalent CNF grammar (Hopcroft & Ullman1979), and in the case of SCFGs the probabilities can be assigned such that the string probabilities remainunchanged. 3parses.Furthermore, parses in the original grammar can be reconstructed from corresponding CNFIn short, we can, without loss of generality, assume that the SCFGs in question is in CNF. Thealgorithm described here in fact generalizes to the more general Canonical Two-Form (Graham et al. 1980)format, and in the case of bigrams (¢ 6 2) it can even be modified to work directly for arbitrary SCFGs. Still,the CNF form is convenient, and to keep the exposition simple we assume all SCFGs to be in CNF.where )g)=i?>I83 .) ¸ 7.3.2 Probabilities from expectationsThe first key insight towards a solution is that the ¢ -gram probabilities can be obtained from theassociated expected frequencies for ¢ -grams and , ¢JL 10 -grams:(7.1)+-,=$?( 1 £££;(. ¨&0. ( 1( 2 £££7(10w6=$?where 1 £££7( 1. ¨&0 (¨&0 stands for the expected count of occurrences of the substring ( in a sentence of ¨ . 4(ë.Proof. Write the expectation ¢ for -grams recursively in terms of those of ¢gL order 1 and theconditional ¢ -gram probabilities:Therefore, if we can computeimmediately have ¢ an -gram grammar for the language generated by{.(-.{§0 for all substrings ( of lengths ¢ and ¢§L 1 for a SCFG{, we3 Preservation of string probabilities is trivial if the grammar has no null or unit productions. In cases where it does, an algorithmsimilar to the one in Section 6.4.7 can be used to update the probabilities.4 The only counts appearing here are expectations, so be will not be using special notation to make a distinction between observedand expected values.( 1 £££;(. ¨&0&6|( 1 £££2(1. ¨'0. ( 1( 2 £££;(10£=$?+-,=>?
Page 1 and 2:
The dissertation of Andreas Stolcke
Page 3 and 4:
Bayesian Learning of Probabilistic
Page 5 and 6:
iAcknowledgmentsLife and work in Be
Page 7 and 8:
iiiContentsList of FiguresList of T
Page 9 and 10:
CONTENTSv4.5.4 Summary and Discussi
Page 14 and 15:
CHAPTER 1. INTRODUCTION 2Instance-b
Page 16 and 17:
CHAPTER 1. INTRODUCTION 4A.0.830.33
Page 18 and 19:
CHAPTER 1. INTRODUCTION 6the ¨ 0 l
Page 20 and 21:
..1 £££1; 450,1 £££1; 450CHAP
Page 22 and 23:
VU=@U@@=U===UCHAPTER 2. FOUNDATIONS
Page 24 and 25:
,,vv,v,v,,directly. However, note t
Page 26 and 27:
4@@@@-@b@6@˜--@@@0@@@@@CHAPTER 2.
Page 28 and 29:
6tt,u ·¥¸¹u ,10ºtu ,2 10Yt ¸
Page 30 and 31:
CHAPTER 2. FOUNDATIONS 18As more da
Page 32 and 33:
CHAPTER 2. FOUNDATIONS 20Global mod
Page 34 and 35:
CHAPTER 2. FOUNDATIONS 22¡An expli
Page 36 and 37:
ÊS==66@N,ÆÆ=NÆ00ÆÊ=S=N0Æ=#@0
Page 38 and 39:
666CHAPTER 2. FOUNDATIONS 262.5.7 P
Page 40 and 41:
It, u¦¸¹u Ù 0w6¬tt,_, u Ù 0
Page 42 and 43:
, uu!¸¹u Ù 0c6,u ,ÔÔ0 ö1 ö1
Page 44 and 45:
CHAPTER 3. HIDDEN MARKOV MODELS 32T
Page 46 and 47:
CHAPTER 3. HIDDEN MARKOV MODELS 34R
Page 48 and 49:
4ÿ= ê•4TÃE0&Ò¢¡? •ç1 Lht
Page 50 and 51:
6666ò U1ò +9,9. 4 20+-,¡ . 4 10C
Page 52 and 53:
2. For each candidate"I!computeLet"
Page 54 and 55:
6\“ç%&ät\“ç tè ä, u¦¸¹u
Page 56 and 57:
, u1 ¸¼u Ù 0 and , u3 ¸¹u Ù 0
Page 58 and 59:
CHAPTER 3. HIDDEN MARKOV MODELS 46l
Page 60 and 61:
CHAPTER 3. HIDDEN MARKOV MODELS 48c
Page 62 and 63:
CHAPTER 3. HIDDEN MARKOV MODELS 50d
Page 64 and 65:
CHAPTER 3. HIDDEN MARKOV MODELS 520
Page 66 and 67:
correlation between initial and fin
Page 68 and 69:
,CHAPTER 3. HIDDEN MARKOV MODELS 56
Page 70 and 71:
Page 72 and 73:
,CHAPTER 3. HIDDEN MARKOV MODELS 60
Page 74 and 75:
CHAPTER 3. HIDDEN MARKOV MODELS 62b
Page 76 and 77:
CHAPTER 3. HIDDEN MARKOV MODELS 64t
Page 78 and 79:
CHAPTER 3. HIDDEN MARKOV MODELS 66t
Page 80 and 81:
CHAPTER 3. HIDDEN MARKOV MODELS 68s
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
CHAPTER 3. HIDDEN MARKOV MODELS 74b
Page 88 and 89:
domain. 3 In short, we will leave o
Page 90 and 91:
,,,,,£CHAPTER 4. STOCHASTIC CONTEX
Page 92 and 93:
9 ¸)Ô ¸ 9 ¸Ô 1 2 £££;,ÔÔC
Page 94 and 95:
¸= ¸= ¸.1.2¸¸) 1_) 20&6#=,,,,
Page 96 and 97:
CHAPTER 4. STOCHASTIC CONTEXT-FREE
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
==Ì==ÌCHAPTER 4. STOCHASTIC CONTE
Page 106 and 107:
,= ===I¸theybÜ„thiscg„\\ ¸¸
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
104Chapter 5Probabilistic Attribute
Page 118 and 119:
,1makingandCHAPTER 5. PROBABILISTIC
Page 120 and 121:
CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133: CHAPTER 5. PROBABILISTIC ATTRIBUTE
Page 134 and 135: 122Chapter 6Efficient parsing with
Page 136 and 137: 1z1CHAPTER 6. EFFICIENT PARSING WIT
Page 138 and 139: and each state in set ( -ÏH ):6)
Page 140 and 141: NPDetVTVI P CHAPTER 6. EFFICIENT PA
Page 142 and 143: CHAPTER 6. EFFICIENT PARSING WITH S
Page 144 and 145: ) In particular, the string probabi
Page 146 and 147: H : d=:6) ¸ÆÆÆ:6) ¸ .V=i£Ù )
Page 148 and 149: ©) 6#=©,©,©,,ÆNNö,²NL++and>+
Page 150 and 151: ) The probabilistic unit-production
Page 152 and 153: ¸0 ¸ 29¸¸ 99W9 [;t£ ? 1 ?1u 6
Page 154 and 155: The forward and inner probabilities
Page 156 and 157: 9 itself²²NN++9 ¸0ÌLL++??1£,=C
Page 158 and 159: ,,by nonterminals. Multiplying this
Page 160 and 161: 6CHAPTER 6. EFFICIENT PARSING WITH
Page 162 and 163: description. Again, we ignore this
Page 164 and 165: for all pairs of states d =¸+= Š
Page 166 and 167: ,9 ¸0 : 0¸ £j9¸ A )z9£CHAPTER
Page 170 and 171: are then summed over all nontermina
Page 172 and 173: +CHAPTER 6. EFFICIENT PARSING WITH
Page 174 and 175: 1CHAPTER 6. EFFICIENT PARSING WITH
Page 178 and 179: = ¸Let t,t) ¸ .V=i£,t,t,6666yyyy
Page 180 and 181: 168Chapter 7-grams from Stochastic
Page 184 and 185: )ÅÆÅÅÅÆÅÅÅÆÅÅÅÆÅÅÅ
Page 186 and 187: -grams CCHAPTER 7. -GRAMS FROM ST
Page 188 and 189: ,?Ó,tÌ?L A0,I 1N I N A A 2 A 3 N
Page 190 and 191: ,,CHAPTER 7. -GRAMS FROM STOCHASTI
Page 192 and 193: Consider the following problem: sta
Page 194 and 195: CHAPTER 8. FUTURE DIRECTIONS 1828.2
Page 196 and 197: 184BibliographyAHO, ALFRED V., RAVI
Page 198 and 199: BIBLIOGRAPHY 186DAGAN, IDO, FERNAND
Page 200 and 201: BIBLIOGRAPHY 188——, & ——. 1
Page 202 and 203: BIBLIOGRAPHY 190QUINLAN, J. ROSS, &
Page 204: BIBLIOGRAPHY 192WALLACE, C. S., & P
show all

The dissertation of Andreas Stolcke is approved: University of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?