12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

,and,I=(=,,,,(==CHAPTER 7. -GRAMS FROM STOCHASTIC CONTEXT-FREE GRAMMARS 171<strong>The</strong> previous solution to the problem was to estimate ¢ -gram probabilities from the SCFG by counting onrandomly generated artificial samples.7.3 <strong>The</strong> Algorithm7.3.1 Normal form for SCFGsUnlike in other parts <strong>of</strong> th<strong>is</strong> thes<strong>is</strong>, we cannot get around the need to normalize the grammar toChomsky Normal Form (CNF). A CFG <strong>is</strong> in CNF if all productions are <strong>of</strong> the formor) ¸ =m>Any CFG structure can be converted into a weakly equivalent CNF grammar (Hopcr<strong>of</strong>t & Ullman1979), and in the case <strong>of</strong> SCFGs the probabilities can be assigned such that the string probabilities remainunchanged. 3parses.Furthermore, parses in the original grammar can be reconstructed from corresponding CNFIn short, we can, without loss <strong>of</strong> generality, assume that the SCFGs in question <strong>is</strong> in CNF. <strong>The</strong>algorithm described here in fact generalizes to the more general Canonical Two-Form (Graham et al. 1980)format, and in the case <strong>of</strong> bigrams (¢ 6 2) it can even be modified to work directly for arbitrary SCFGs. Still,the CNF form <strong>is</strong> convenient, and to keep the exposition simple we assume all SCFGs to be in CNF.where )g)=i?>I83 .) ¸ 7.3.2 Probabilities from expectations<strong>The</strong> first key insight towards a solution <strong>is</strong> that the ¢ -gram probabilities can be obtained from theassociated expected frequencies for ¢ -grams and , ¢JL 10 -grams:(7.1)+-,=$?( 1 £££;(. ¨&0. ( 1( 2 £££7(10w6=$?where 1 £££7( 1. ¨&0 (¨&0 stands for the expected count <strong>of</strong> occurrences <strong>of</strong> the substring ( in a sentence <strong>of</strong> ¨ . 4(ë.Pro<strong>of</strong>. Write the expectation ¢ for -grams recursively in terms <strong>of</strong> those <strong>of</strong> ¢gL order 1 and theconditional ¢ -gram probabilities:<strong>The</strong>refore, if we can computeimmediately have ¢ an -gram grammar for the language generated by{.(-.{§0 for all substrings ( <strong>of</strong> lengths ¢ and ¢§L 1 for a SCFG{, we3 Preservation <strong>of</strong> string probabilities <strong>is</strong> trivial if the grammar has no null or unit productions. In cases where it does, an algorithmsimilar to the one in Section 6.4.7 can be used to update the probabilities.4 <strong>The</strong> only counts appearing here are expectations, so be will not be using special notation to make a d<strong>is</strong>tinction between observedand expected values.( 1 £££;(. ¨&0&6|( 1 £££2(1. ¨'0. ( 1( 2 £££;(10£=$?+-,=>?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!