12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 2. FOUNDATIONS 9<strong>The</strong> number <strong>of</strong> parameters in ¢ -gram models grows exponentially with ¢ , and only the cases¢56 2 (bigram models) and ¢56 3 (trigram models) are <strong>of</strong> practical importance. Bigram and trigrammodels are popular for various applications, especially speech decoding (Ney 1984), to approximate the trued<strong>is</strong>tributions <strong>of</strong> language elements (characters, words, etc.), which are known to violate the independenceassumption embodied in (2.1).Because (2.1) <strong>is</strong> essentially a truncated version <strong>of</strong> the true joint probability given by (2.2), ¢ -gramsare in some sense a natural choice for th<strong>is</strong> approximation, and are appropriate to the extent that symboloccurrences tend be more and more independent as the d<strong>is</strong>tance between the occurrences increases. Of coursethere are important cases in natural language and elsewhere where th<strong>is</strong> assumption <strong>is</strong> blatantly wrong. Forexample, the d<strong>is</strong>tribution <strong>of</strong> lexical elements in natural languages are constrained by phrase structures thatcan relate two (or more) words over essentially arbitrary d<strong>is</strong>tances. Th<strong>is</strong> <strong>is</strong> the main motivation for moving,at a minimum, to the stochastic context-free models that are one <strong>of</strong> the main subjects <strong>of</strong> th<strong>is</strong> thes<strong>is</strong>.2.2.3 Probabil<strong>is</strong>tic grammars as random string generatorsOne can always abstractly associate a probabil<strong>is</strong>tic language ¨ with a corresponding random stringgenerator, i.e., a device that generates strings stochastically according to the d<strong>is</strong>tribution ¨ . However, aprobabil<strong>is</strong>tic grammar usually describes ¨ by making th<strong>is</strong> generator concrete. For example, for ¢ -gramsa generator would output string symbols left-to-right, always choosing the next symbol 1M@ according to aprobability lookup table indexed by the ¢"L 10 -tuple <strong>of</strong> previous symbols 1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!