02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ALP can be used to obtain prior probabilities <strong>of</strong> any sequential data, thus it provides<br />

theoretical soluti<strong>on</strong> to the statistical language modeling. As menti<strong>on</strong>ed before, ALP is<br />

not computable (because <strong>of</strong> the halting problem), however it is menti<strong>on</strong>ed here to justify<br />

our later experiments with model combinati<strong>on</strong>. Different language modeling techniques<br />

can be seen as individual comp<strong>on</strong>ents in eq. 2.2, where instead <strong>of</strong> using descripti<strong>on</strong> length<br />

<strong>of</strong> individual models for normalizati<strong>on</strong>, we use the performance <strong>of</strong> the model <strong>on</strong> some<br />

validati<strong>on</strong> data to obtain its weight 2 . More details about c<strong>on</strong>cepts such as ALP and<br />

Minimum descripti<strong>on</strong> length (MDL) will be given in Chapter 8.<br />

Another work worth <strong>of</strong> menti<strong>on</strong>ing was d<strong>on</strong>e by Mah<strong>on</strong>ey [44], who has shown that the<br />

problem <strong>of</strong> finding the best models <strong>of</strong> data is actually equal to the problem <strong>of</strong> general data<br />

compressi<strong>on</strong>. Compressi<strong>on</strong> can be seen as two problems: data modeling, and coding. Since<br />

coding is optimally solved by Arithmetic coding, data compressi<strong>on</strong> can be seen just as a<br />

data modeling problem. Mah<strong>on</strong>ey together with M. Hutter also organize a competiti<strong>on</strong><br />

with the aim to reach the best possible compressi<strong>on</strong> results <strong>on</strong> a given data set (mostly<br />

c<strong>on</strong>taining wikipedia text), known as a Hutter prize competiti<strong>on</strong>. As the data compressi<strong>on</strong><br />

<strong>of</strong> text is almost equal to the language modeling task, I follow the same idea and try<br />

to reach the best achievable results <strong>on</strong> a single well-known data set, the Penn Treebank<br />

Corpus, where it is possible to compare (and combine) results <strong>of</strong> techniques developed by<br />

several other researchers.<br />

The important drawback <strong>of</strong> perplexity is that it obscures achieved improvements. Usu-<br />

ally, improvements <strong>of</strong> perplexity are measured as percentual decrease over the baseline<br />

value, which is a mistaken but widely accepted practice. In Table 2.1, it is shown that<br />

c<strong>on</strong>stant perplexity improvement translates to different entropy reducti<strong>on</strong>s. For example,<br />

it will be shown in Chapter 7 that advanced LM techniques provide similar relative reduc-<br />

ti<strong>on</strong>s <strong>of</strong> entropy for word and character <str<strong>on</strong>g>based</str<strong>on</strong>g> models, while perplexity comparis<strong>on</strong> would<br />

completely fail in such case. Thus, perplexity results will be reported as a good measure<br />

for quick comparis<strong>on</strong>, but improvements will be mainly reported by using entropy.<br />

2 It can be argued that since most <strong>of</strong> the models that are comm<strong>on</strong>ly used in language modeling are<br />

not Turing-complete - such as finite state machines - using descripti<strong>on</strong> length <strong>of</strong> these models would be<br />

inappropriate.<br />

13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!