02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Surprisingly, many research papers come with c<strong>on</strong>clusi<strong>on</strong>s such as ”Our model pro-<br />

vides 2% improvement in perplexity over 3-gram with Good-Turing discounting and 0.3%<br />

reducti<strong>on</strong> <strong>of</strong> WER, thus we have achieved new state <strong>of</strong> the art results.” - that is clearly mis-<br />

leading statement. Thus, great care must be given to proper evaluati<strong>on</strong> and comparis<strong>on</strong><br />

<strong>of</strong> techniques.<br />

2.2 N-gram <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />

The probability <strong>of</strong> a sequence <strong>of</strong> symbols (usually words) is computed using a chain rule<br />

as<br />

P (w) =<br />

N<br />

P (wi|w1...wi−1) (2.4)<br />

i=1<br />

The most frequently used language models are <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the n-gram statistics, which are<br />

basically word co-occurrence frequencies. The maximum likelihood estimate <strong>of</strong> probability<br />

<strong>of</strong> word A in c<strong>on</strong>text H is then computed as<br />

P (A|H) = C(HA)<br />

C(H)<br />

(2.5)<br />

where C(HA) is the number <strong>of</strong> times that the HA sequence <strong>of</strong> words has occurred in the<br />

training data. The c<strong>on</strong>text H can c<strong>on</strong>sist <strong>of</strong> several words, for the usual trigram models<br />

|H| = 2. For H = ∅, the model is called unigram, and it does not take into account history.<br />

As many <strong>of</strong> these probability estimates are going to be zero (for all words that were not<br />

seen in the training data in a particular c<strong>on</strong>text H), smoothing needs to be applied. This<br />

works by redistributing probabilities between seen and unseen (zero-frequency) events, by<br />

exploiting the fact that some estimates, mostly those <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> single observati<strong>on</strong>s, are<br />

greatly over-estimated. Detailed overview <strong>of</strong> comm<strong>on</strong> smoothing techniques and empirical<br />

evaluati<strong>on</strong> can be found in [29].<br />

The most important factors that influence quality <strong>of</strong> the resulting n-gram model is<br />

the choice <strong>of</strong> the order and <strong>of</strong> the smoothing technique. In this thesis, we will report<br />

results while using the most popular variants: Good-Turing smoothing [34] and modified<br />

Kneser-Ney smoothing [36] [29]. The modified Kneser-Ney smoothing (KN) is reported to<br />

provide c<strong>on</strong>sistently the best results am<strong>on</strong>g smoothing techniques, at least for word-<str<strong>on</strong>g>based</str<strong>on</strong>g><br />

language models [24].<br />

The most significant advantages <strong>of</strong> models <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> n-gram statistics are speed (prob-<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!