Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
words, but also additi<strong>on</strong>al symbols such as brackets and quotati<strong>on</strong> marks. These follow<br />
simple patterns that cannot be described efficiently with n-gram models (such as that after<br />
a left bracket, a right bracket should be expected etc.). Even with a limited RNNME-<br />
160 model, the entropy reducti<strong>on</strong> over KN5 model with no count cut<strong>of</strong>fs was about 7% -<br />
perplexity was reduced from 110 to 80. By combining static and dynamic RNNME-160<br />
model, perplexity decreased further to 58. It would be not surprising if a combinati<strong>on</strong><br />
<strong>of</strong> several larger static and dynamic RNNME models trained <strong>on</strong> the MT corpora would<br />
provide perplexity reducti<strong>on</strong> over 50% against the KN5 model with no count cut<strong>of</strong>fs - it<br />
would be interesting to investigate this task in the future.<br />
7.2 Data Compressi<strong>on</strong><br />
Compressi<strong>on</strong> <strong>of</strong> text files is a problem almost equivalent to statistical language modeling.<br />
The objective <strong>of</strong> state <strong>of</strong> the art text compressors is to predict the next character or word<br />
with as high probability as possible, and encode it using algorithmic coding. Decompres-<br />
si<strong>on</strong> is then a symmetric process, where models are built while processing the decoded<br />
data in the same way as they were built during the compressi<strong>on</strong>. Thus, the objective is<br />
to simply maximize P (w|h). An example <strong>of</strong> such compressi<strong>on</strong> approach is the state <strong>of</strong> the<br />
art compressi<strong>on</strong> program ’PAQ’ from M. Mah<strong>on</strong>ey [45, 46].<br />
The difference between language modeling and data compressi<strong>on</strong> is that the data is not<br />
known in advance, and has to be processed <strong>on</strong>-line. As it was described in the previous<br />
chapters, training <strong>of</strong> neural network models with a hidden layer requires several training<br />
passes over the training data for reaching the top performance. Thus, it was not clear if<br />
RNN LMs can be practically applied to data compressi<strong>on</strong> problems.<br />
As data compressi<strong>on</strong> is not topic <strong>of</strong> this thesis, I will not describe experiments in great<br />
detail. However, it is worth observing that it is possible to accurately estimate compressi<strong>on</strong><br />
ratio given by RNN LMs using the RNNLM toolkit. The size <strong>of</strong> a compressed text file is<br />
size <strong>of</strong> (possibly also compressed) vocabulary plus entropy <strong>of</strong> the training data during the<br />
first training epoch (this is simply average entropy per word times number <strong>of</strong> words).<br />
For the further experiments, I have implemented arithmetic coding into a special ver-<br />
si<strong>on</strong> <strong>of</strong> the RNNLM toolkit, thus the following results were obtained with real RNN-<str<strong>on</strong>g>based</str<strong>on</strong>g><br />
data compressor. As the class <str<strong>on</strong>g>based</str<strong>on</strong>g> RNN architecture was used, first the class <strong>of</strong> predicted<br />
word is encoded using arithmetic coding, and then the specific word, as the text data are<br />
96