Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Table 7.5: Additi<strong>on</strong>al results <strong>on</strong> the Micros<strong>of</strong>t Sentence Completi<strong>on</strong> Challenge task.<br />
Model Accuracy [%]<br />
filtered KN5 47.7<br />
filtered RNNME-100 48.8<br />
RNNME combinati<strong>on</strong> 55.4<br />
n-gram models. These results are summarized in Table 7.5, where models trained <strong>on</strong><br />
modified training data are denoted as filtered. Combinati<strong>on</strong> <strong>of</strong> RNNME models trained<br />
<strong>on</strong> the original and the filtered training data then provides the best result <strong>on</strong> this task so<br />
far, about 55% accuracy.<br />
As the task itself is very interesting and shows what language modeling research can<br />
focus <strong>on</strong> in the future, the next chapter will include some <strong>of</strong> my ideas how good test sets<br />
for measuring quality <strong>of</strong> language models should be created.<br />
7.4 Speech Recogniti<strong>on</strong> <strong>of</strong> Morphologically Rich <str<strong>on</strong>g>Language</str<strong>on</strong>g>s<br />
N-gram language models usually work quite well for English, but not so much for other<br />
languages. The reas<strong>on</strong> is that for morphologically rich languages, the number <strong>of</strong> word<br />
units is much larger, as new words are formed easily using simple rules, by adding new<br />
word ending etc. Having two or more separate sources <strong>of</strong> informati<strong>on</strong> (such as stem and<br />
ending) in a single token increases amount <strong>of</strong> parameters in n-gram models that have to<br />
be estimated from the training data. Thus, higher order n-gram models usually do not<br />
give much improvement. Other problem is that for these languages, much less training<br />
data is usually available.<br />
To illustrate the problem, we have used the Penn Treebank Corpus as described in<br />
Chapter 4, and added two bits <strong>of</strong> random informati<strong>on</strong> to every token. This should increase<br />
perplexity <strong>of</strong> the model that is trained <strong>on</strong> these modified data by more than two bits, as<br />
it is not possible to revert the process (the informati<strong>on</strong> that certain words are similar has<br />
to be obtained just from the statistical similarity <strong>of</strong> occurrence).<br />
As the n-gram models cannot perform any clustering, it must be expected that their<br />
performance will degrade significantly. On the other hand, RNN models can perform clus-<br />
tering well, thus the increase <strong>of</strong> entropy should be lower. Results with simple RNN models<br />
with the same architecture and KN5 models with no discounts are shown in Table 7.6.<br />
100