12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

-grams CCHAPTER 7. -GRAMS FROM STOCHASTIC CONTEXT-FREE GRAMMARS 174<strong>The</strong> computation <strong>of</strong> prefix probabilities for SCFGs <strong>is</strong> generally useful for applications and has beensolved with the LRI algorithm (Jelinek & Lafferty 1991). In Chapter 6 we have seen how th<strong>is</strong> computation canbe carried out efficiently for sparsely parameterized SCFGs using a probabil<strong>is</strong>tic version <strong>of</strong> Earley’s parser.Computing suffix probabilities <strong>is</strong> obviously a symmetrical task; for example, one could create a ‘mirrored’SCFG (reversing the order <strong>of</strong> right-hand side symbols in all productions) and then run any prefix probabilitycomputation on that mirror grammar.Note that in the case <strong>of</strong> bigrams, only a particularly simple form <strong>of</strong> prefix/suffix probabilities arerequired, namely, the ‘left-corner’ ) ( 10 and ‘right-corner’ ( 20 probabilities, and , whichcan each be obtained from a single matrix inversion (Jelinek & Lafferty 1991), corresponding to the left-cornerX C +-,= +-,matrix © used in the probabil<strong>is</strong>tic Earley parser (as well as the corresponding right-corner matrix).Finally, it <strong>is</strong> interesting to compare the relative ease with which one can solve the substringexpectation problem to the seemingly similar problem <strong>of</strong> finding substring probabilities: the probability that) generates (one or more instances <strong>of</strong>) ( . <strong>The</strong> latter problem <strong>is</strong> studied by Corazza et al. (1991), and shownto lead to a non-linear system <strong>of</strong> equations. <strong>The</strong> crucial difference here <strong>is</strong> that expectations are additive withrespect to the cases in Figure 7.1, whereas the corresponding probabilities are not, since the three cases canoccur in the same string.7.3.5 Ëcontaining string boundariesA complete ¢ -gram grammar includes strings delimited by a special marker denoting beginningand end <strong>of</strong> a string. In Section 2.2.2 we had introduced the symbol ‘$’ for th<strong>is</strong> purpose.To generate expectations for ¢ÜL 1-grams adjoining the string boundaries, the original SCFGgrammar <strong>is</strong> augmented by a new top-level production9Ù¸$9$1£ 0óò<strong>is</strong> the old start symbol, and9%Ù becomes the start symbol <strong>of</strong> the augmented grammar. <strong>The</strong> algorithm<strong>is</strong> then simply applied to the augmented grammar to give the desired ¢ -gram probabilities including the ‘$’where9marker.7.4 Efficiency and Complexity IssuesSummarizing from the previous section, we can compute any ¢ -gram probability by solving twolinear systems <strong>of</strong> equations <strong>of</strong> the form (7.3), one with ( being the ¢ -gram itself and one for the , ¢¦L 10 -gramprefix ( 1 £££;(=$?1. <strong>The</strong> latter computation can be shared among all ¢ -grams with the same prefix, so thatessentially one system needs to be solved for each ¢ -gram we are interested in. <strong>The</strong> good news here <strong>is</strong> thatthe work required <strong>is</strong> linear in the number ¢ <strong>of</strong> -grams, and correspondingly limited if one needs probabilitiesfor only a subset <strong>of</strong> the ¢ possible -grams. For example, one could compute these probabilities on demandand cache the results.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!