02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

than a sentence) will be called a ”l<strong>on</strong>g-span model”. An example <strong>of</strong> a l<strong>on</strong>g-span model is<br />

a cache model, or a topic model.<br />

Comparis<strong>on</strong> <strong>of</strong> performance <strong>of</strong> a short span model (such as 4-gram LM) against a<br />

combinati<strong>on</strong> <strong>of</strong> a short span and a l<strong>on</strong>g span model (such as 4-gram + cache) is very<br />

popular in the literature, as it leads to large improvements in perplexity. However, the<br />

reducti<strong>on</strong> <strong>of</strong> a word error rate in speech recogniti<strong>on</strong> by using l<strong>on</strong>g-span models is usually<br />

quite small - as was menti<strong>on</strong>ed previously, this is caused by the fact that perplexity is<br />

comm<strong>on</strong>ly evaluated while assuming perfect history, which is a false assumpti<strong>on</strong> as the<br />

history in speech recogniti<strong>on</strong> is typically very noisy 1 . Typical examples <strong>of</strong> such experiments<br />

are different novel ways how to compute cache-like models. Joshua Goodman’s report [24]<br />

is a good reference for those who are interested in more insight into criticism <strong>of</strong> typical<br />

language modeling research.<br />

To avoid these mistakes, performance <strong>of</strong> individual models is reported and compared<br />

to a modified Kneser-Ney smoothed 5-gram (which is basically a state-<strong>of</strong>-the-art am<strong>on</strong>g<br />

n-gram models), and further compared to a combinati<strong>on</strong> <strong>of</strong> a 5-gram model with a un-<br />

igram cache model. After that, we report the results after using all models together,<br />

with an analysis which models are providing the most complementary informati<strong>on</strong> in the<br />

combinati<strong>on</strong>, and which models discover patterns that can be better discovered by other<br />

techniques.<br />

4.2 Penn Treebank Dataset<br />

One <strong>of</strong> the most widely used data sets for evaluating performance <strong>of</strong> the statistical language<br />

models is the Penn Treebank porti<strong>on</strong> <strong>of</strong> the WSJ corpus (denoted here as a Penn Treebank<br />

Corpus). It has been previously used by many researchers, with exactly the same data<br />

preprocessing (the same training, validati<strong>on</strong> and test data and the same vocabulary limited<br />

to 10K words). This is quite rare in the language modeling field, and allows us to compare<br />

directly performances <strong>of</strong> different techniques and their combinati<strong>on</strong>s, as many researchers<br />

were kind enough to provide us their results for the following comparis<strong>on</strong>. Combinati<strong>on</strong><br />

<strong>of</strong> the models is further d<strong>on</strong>e by using linear interpolati<strong>on</strong> - for combinati<strong>on</strong> <strong>of</strong> two models<br />

M1 and M2 this means<br />

PM12 (w|h) = λPM1 (w|h) + (1 − λ)PM2 (w|h) (4.1)<br />

1 Thanks to Dietrich Klakow for pointing this out.<br />

46

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!