02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 5.1: Comparis<strong>on</strong> <strong>of</strong> advanced language modeling techniques <strong>on</strong> the WSJ task (37M<br />

training tokens).<br />

Model Dev WER[%] Eval WER[%]<br />

Baseline - KN5 12.2 17.2<br />

Discriminative LM [79] 11.5 16.9<br />

Joint LM [23] - 16.7<br />

Static RNN 10.3 14.5<br />

Static RNN + KN 10.2 14.5<br />

Adapted RNN 9.7 14.2<br />

Adapted RNN + KN 9.7 14.2<br />

3 interpolated RNN LMs 9.5 13.9<br />

Table 5.2: Comparis<strong>on</strong> <strong>of</strong> results <strong>on</strong> the WSJ dev set (JHU setup) obtained with models<br />

trained <strong>on</strong> different amount <strong>of</strong> the data.<br />

# words PPL WER Improvement[%]<br />

KN5 +RNN KN5 +RNN Entropy WER<br />

223K 415 333 - - 3.7 -<br />

675K 390 298 15.6 13.9 4.5 10.9<br />

2233K 331 251 14.9 12.9 4.8 13.4<br />

6.4M 283 200 13.6 11.7 6.1 14.0<br />

37M 212 133 12.2 10.2 8.7 16.4<br />

to the way humans process natural language. I believe that advanced techniques exist<br />

that are able to model richer set <strong>of</strong> patterns in the language, and these should be actu-<br />

ally getting increasingly better than n-grams with more training data. Thus, I performed<br />

experiments to check if RNN LMs behave in this way.<br />

Results with increasingly large subset <strong>of</strong> the training data for the WSJ-JHU task<br />

are shown in Table 5.2. Both relative entropy reducti<strong>on</strong>s and relative word error rate<br />

reducti<strong>on</strong>s are increasing with more training data. This is a very optimistic result, and<br />

it c<strong>on</strong>firms that the original motivati<strong>on</strong> for using neural net language models was correct:<br />

by using distributed representati<strong>on</strong> <strong>of</strong> the history instead <strong>of</strong> the sparse coding, the neural<br />

net models can represent certain patterns in the language more efficiently than the n-gram<br />

models. The same results are also shown at Figure 5.1, where it is easier to see the trend.<br />

64

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!