02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Entropy reducti<strong>on</strong> per word over KN4 [bits]<br />

0<br />

-0.1<br />

-0.2<br />

-0.3<br />

-0.4<br />

-0.5<br />

-0.6<br />

-0.7<br />

10 1<br />

10 2<br />

Hidden layer size<br />

RNN + KN4<br />

RNNME+KN4<br />

Figure 6.9: Improvements over the KN4 model obtained with RNN and RNNME models<br />

with increasing size <strong>of</strong> the hidden layer.<br />

decreased from 131 to 117. The training progress <strong>of</strong> RNN, RNMME-40 with 1G hash<br />

and the new RNNME-40 with 8G hash is shown at Figure 6.10. Unfortunately, we were<br />

not able to run new lattice rescoring experiments due to graduati<strong>on</strong> <strong>of</strong> Anoop Deoras and<br />

limitati<strong>on</strong>s <strong>of</strong> use <strong>of</strong> the IBM recognizer, but it can be expected that even WER would be<br />

much lower with the new models with larger hash and more features. Lastly, experiments<br />

with even more features were performed - adding 5-gram features seems to not help, while<br />

adding skip-1 gram features helps a bit.<br />

It is also interesting to compare performance <strong>of</strong> RNN and RNNME architectures as the<br />

amount <strong>of</strong> the training data increases. With more training data, the optimal size <strong>of</strong> the<br />

hidden layer increases, as the model must have enough parameters to encode all patterns.<br />

In the previous chapter, it was shown that the improvements from the neural net language<br />

models actually increase with more training data, which is a very optimistic result. How-<br />

ever, with more training data it is also needed to increase the size <strong>of</strong> the hidden layer -<br />

here we show that if the hidden layer size is kept c<strong>on</strong>stant, the simple RNN architecture<br />

provides smaller improvements over baseline n-gram model as the amount <strong>of</strong> the train-<br />

ing words increases. A very interesting empirical result is that RNNME architecture still<br />

87<br />

10 3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!