12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 7. -GRAMS FROM STOCHASTIC CONTEXT-FREE GRAMMARS 177Training corpus Test corpusNo. <strong>of</strong> sentences 2621 364No. <strong>of</strong> words 16974 2208Bigram vocabulary 1064Bigram coverage 100% 77%SCFG productions 1177SCFG vocabulary 655SCFG coverage 63% 51%Table 7.1: BeRP corpora and language model stat<strong>is</strong>tics.Coverage <strong>is</strong> measured by the percentage <strong>of</strong> sentences parsed with non-zero probability by a given languagemodel.expectations and 108959 linear systems for bigram expectations.SPARCstation 10 using a non-optimized L<strong>is</strong>p implementation. 6<strong>The</strong> process takes about 9 hours on a<strong>The</strong> experiments and results described below overlap with those reported in Jurafsky et al. (1994b).In experiment 1, the recognizer used bigrams that were estimated directly from the training corpus,without any smoothing, resulting in a word error rate <strong>of</strong> 33.7%.In experiment 2, a different set <strong>of</strong> bigram probabilities was used, computed from the context-freegrammar, whose probabilities had previously been estimated from the same training corpus, using standardEM techniques. Th<strong>is</strong> resulted in a word error rate <strong>of</strong> 32.9%. Th<strong>is</strong> may seem surpr<strong>is</strong>ingly good given thelow coverage <strong>of</strong> the underlying CFGs, but notice that the conversion into bigrams <strong>is</strong> bound to result in aless constraining language model, effectively increasing coverage. For compar<strong>is</strong>on purposes we also ran thesame experiment with bigrams computed indirectly by Monte-Carlo sampling from the SCFG, using 200,000samples. <strong>The</strong> result was slightly worse (33.3%), confirming that the prec<strong>is</strong>e computation has an inherentadvantage, as it cannot omit words or constructions that the SCFG assigns very low probability.Finally, in experiment 3, the bigrams generated from the SCFG were augmented by those from theraw training data, in a proportion <strong>of</strong> 200,000 : 2500. We have not attempted to optimize th<strong>is</strong> mixture proportion,e.g., by deleted interpolation (Jelinek & Mercer 1980). 7 With the bigram estimates thus obtained, the worderror rate dropped to 29.6%, which represents a stat<strong>is</strong>tically significant improvement over experiments 1and 2.Table 7.2 summarizes these figures and also adds two more points <strong>of</strong> compar<strong>is</strong>on: a pure SCFGlanguage model and a mixture model that interpolates between bigram and SCFG. Notice that the latter case<strong>is</strong> different from experiment 3, where the language model used <strong>is</strong> a standard bigram, albeit one that wasobtained by ‘mixing’ counts obtained both from the data and from the SCFG. <strong>The</strong> system referred to here, on6 One inefficiency <strong>is</strong> that the actual number <strong>of</strong> nonterminals (and hence the rank <strong>of</strong> the coefficient matrix) <strong>is</strong> 445, as the grammar <strong>is</strong>converted to the Simple Normal Form introduced in Chapter 4.7 Th<strong>is</strong> proportion comes about because in the original system, predating the method described here, bigrams had to be estimated fromthe SCFG by random sampling. Generating 200,000 sentence samples was found to give good converging estimates for the bigrams.<strong>The</strong> bigrams from the raw training sentences were then simply added to the randomly generated ones. We later verified that the bigramsestimated from the SCFG were indeed identical to the ones computed directly using the method described here.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!