Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
than a sentence) will be called a ”l<strong>on</strong>g-span model”. An example <strong>of</strong> a l<strong>on</strong>g-span model is<br />
a cache model, or a topic model.<br />
Comparis<strong>on</strong> <strong>of</strong> performance <strong>of</strong> a short span model (such as 4-gram LM) against a<br />
combinati<strong>on</strong> <strong>of</strong> a short span and a l<strong>on</strong>g span model (such as 4-gram + cache) is very<br />
popular in the literature, as it leads to large improvements in perplexity. However, the<br />
reducti<strong>on</strong> <strong>of</strong> a word error rate in speech recogniti<strong>on</strong> by using l<strong>on</strong>g-span models is usually<br />
quite small - as was menti<strong>on</strong>ed previously, this is caused by the fact that perplexity is<br />
comm<strong>on</strong>ly evaluated while assuming perfect history, which is a false assumpti<strong>on</strong> as the<br />
history in speech recogniti<strong>on</strong> is typically very noisy 1 . Typical examples <strong>of</strong> such experiments<br />
are different novel ways how to compute cache-like models. Joshua Goodman’s report [24]<br />
is a good reference for those who are interested in more insight into criticism <strong>of</strong> typical<br />
language modeling research.<br />
To avoid these mistakes, performance <strong>of</strong> individual models is reported and compared<br />
to a modified Kneser-Ney smoothed 5-gram (which is basically a state-<strong>of</strong>-the-art am<strong>on</strong>g<br />
n-gram models), and further compared to a combinati<strong>on</strong> <strong>of</strong> a 5-gram model with a un-<br />
igram cache model. After that, we report the results after using all models together,<br />
with an analysis which models are providing the most complementary informati<strong>on</strong> in the<br />
combinati<strong>on</strong>, and which models discover patterns that can be better discovered by other<br />
techniques.<br />
4.2 Penn Treebank Dataset<br />
One <strong>of</strong> the most widely used data sets for evaluating performance <strong>of</strong> the statistical language<br />
models is the Penn Treebank porti<strong>on</strong> <strong>of</strong> the WSJ corpus (denoted here as a Penn Treebank<br />
Corpus). It has been previously used by many researchers, with exactly the same data<br />
preprocessing (the same training, validati<strong>on</strong> and test data and the same vocabulary limited<br />
to 10K words). This is quite rare in the language modeling field, and allows us to compare<br />
directly performances <strong>of</strong> different techniques and their combinati<strong>on</strong>s, as many researchers<br />
were kind enough to provide us their results for the following comparis<strong>on</strong>. Combinati<strong>on</strong><br />
<strong>of</strong> the models is further d<strong>on</strong>e by using linear interpolati<strong>on</strong> - for combinati<strong>on</strong> <strong>of</strong> two models<br />
M1 and M2 this means<br />
PM12 (w|h) = λPM1 (w|h) + (1 − λ)PM2 (w|h) (4.1)<br />
1 Thanks to Dietrich Klakow for pointing this out.<br />
46