Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
ALP can be used to obtain prior probabilities <strong>of</strong> any sequential data, thus it provides<br />
theoretical soluti<strong>on</strong> to the statistical language modeling. As menti<strong>on</strong>ed before, ALP is<br />
not computable (because <strong>of</strong> the halting problem), however it is menti<strong>on</strong>ed here to justify<br />
our later experiments with model combinati<strong>on</strong>. Different language modeling techniques<br />
can be seen as individual comp<strong>on</strong>ents in eq. 2.2, where instead <strong>of</strong> using descripti<strong>on</strong> length<br />
<strong>of</strong> individual models for normalizati<strong>on</strong>, we use the performance <strong>of</strong> the model <strong>on</strong> some<br />
validati<strong>on</strong> data to obtain its weight 2 . More details about c<strong>on</strong>cepts such as ALP and<br />
Minimum descripti<strong>on</strong> length (MDL) will be given in Chapter 8.<br />
Another work worth <strong>of</strong> menti<strong>on</strong>ing was d<strong>on</strong>e by Mah<strong>on</strong>ey [44], who has shown that the<br />
problem <strong>of</strong> finding the best models <strong>of</strong> data is actually equal to the problem <strong>of</strong> general data<br />
compressi<strong>on</strong>. Compressi<strong>on</strong> can be seen as two problems: data modeling, and coding. Since<br />
coding is optimally solved by Arithmetic coding, data compressi<strong>on</strong> can be seen just as a<br />
data modeling problem. Mah<strong>on</strong>ey together with M. Hutter also organize a competiti<strong>on</strong><br />
with the aim to reach the best possible compressi<strong>on</strong> results <strong>on</strong> a given data set (mostly<br />
c<strong>on</strong>taining wikipedia text), known as a Hutter prize competiti<strong>on</strong>. As the data compressi<strong>on</strong><br />
<strong>of</strong> text is almost equal to the language modeling task, I follow the same idea and try<br />
to reach the best achievable results <strong>on</strong> a single well-known data set, the Penn Treebank<br />
Corpus, where it is possible to compare (and combine) results <strong>of</strong> techniques developed by<br />
several other researchers.<br />
The important drawback <strong>of</strong> perplexity is that it obscures achieved improvements. Usu-<br />
ally, improvements <strong>of</strong> perplexity are measured as percentual decrease over the baseline<br />
value, which is a mistaken but widely accepted practice. In Table 2.1, it is shown that<br />
c<strong>on</strong>stant perplexity improvement translates to different entropy reducti<strong>on</strong>s. For example,<br />
it will be shown in Chapter 7 that advanced LM techniques provide similar relative reduc-<br />
ti<strong>on</strong>s <strong>of</strong> entropy for word and character <str<strong>on</strong>g>based</str<strong>on</strong>g> models, while perplexity comparis<strong>on</strong> would<br />
completely fail in such case. Thus, perplexity results will be reported as a good measure<br />
for quick comparis<strong>on</strong>, but improvements will be mainly reported by using entropy.<br />
2 It can be argued that since most <strong>of</strong> the models that are comm<strong>on</strong>ly used in language modeling are<br />
not Turing-complete - such as finite state machines - using descripti<strong>on</strong> length <strong>of</strong> these models would be<br />
inappropriate.<br />
13