02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Entropy per word <strong>on</strong> the WSJ test data<br />

9<br />

8.8<br />

8.6<br />

8.4<br />

8.2<br />

8<br />

7.8<br />

7.6<br />

7.4<br />

7.2<br />

7<br />

10 5<br />

10 6<br />

Training tokens<br />

10 7<br />

KN5<br />

KN5+RNN<br />

Figure 5.1: Improvements with increasing amount <strong>of</strong> training data - WSJ (JHU setup).<br />

Note that size <strong>of</strong> the hidden layer is tuned for the optimal performance, and increases with<br />

the amount <strong>of</strong> the training data.<br />

5.1.3 C<strong>on</strong>clusi<strong>on</strong> <strong>of</strong> WSJ Experiments (JHU setup)<br />

The possible improvements increase with more training data <strong>on</strong> this particular setup. This<br />

is a very positive result; the drawback is that with increased amount <strong>of</strong> the training data,<br />

such as billi<strong>on</strong>s <strong>of</strong> words, the computati<strong>on</strong>al complexity <strong>of</strong> RNN models is prohibitively<br />

large. However, we dealt with the computati<strong>on</strong>al complexity in the previous chapter, and<br />

it should be doable to train good RNN models even <strong>on</strong> data sets with more than a billi<strong>on</strong><br />

words by using the class-<str<strong>on</strong>g>based</str<strong>on</strong>g> RNNME architecture.<br />

Similarly to the experiments with the Penn Treebank Corpus, I tried to achieve the<br />

lowest possible perplexity. However, this time just two RNN LMs were used, and the<br />

combinati<strong>on</strong> <strong>of</strong> models did include just static RNN LMs, dynamic RNN LMs (with a<br />

single learning rate α = 0.1) and a Kneser-Ney smoothed 5-gram model with a cache.<br />

Good-Turing smoothed trigram has perplexity 246 <strong>on</strong> the test data; the best combinati<strong>on</strong><br />

<strong>of</strong> models had perplexity 108 - this by more than 56% lower (entropy reducti<strong>on</strong> 15.0%).<br />

The 5-gram with modified Kneser-Ney smoothing has perplexity 212 <strong>on</strong> this task, thus<br />

the combined result is by 49% lower (entropy reducti<strong>on</strong> 12.6%). Thus, although the<br />

combinati<strong>on</strong> experiments were much more restricted than in the case <strong>of</strong> PTB, the entropy<br />

65<br />

10 8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!