Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
improvements actually increased - this can also be explained by the fact that the WSJ-JHU<br />
setup is about 40x larger.<br />
5.2 Kaldi WSJ Setup<br />
Additi<strong>on</strong>al experiments <strong>on</strong> the Wall Street Journal task were performed using n-best lists<br />
generated with an open source speech recogniti<strong>on</strong> toolkit Kaldi [60] trained <strong>on</strong> SI-84 data<br />
further described in [62]. The acoustic models used in the following experiments were<br />
<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> triph<strong>on</strong>es and GMMs. Several advantages <strong>of</strong> using Kaldi such as better re-<br />
peatability <strong>of</strong> the performed experiments were already menti<strong>on</strong>ed in the beginning <strong>of</strong> this<br />
chapter (although Kaldi is still being developed, it should be easy to repeat the following<br />
experiments with slightly better results, as RNN rescoring code is integrated in the Kaldi<br />
toolkit). Note that this setup is also not the state <strong>of</strong> the art, as with more training data<br />
and advanced acoustic modeling techniques, it is possible to get better baseline results.<br />
Rescoring experiments with RNN LMs <strong>on</strong> a state <strong>of</strong> the art setup is subject <strong>of</strong> the following<br />
chapter.<br />
I used 1000-best lists generated by Stefan Kombrink in the following experiments. The<br />
test sets are the same as for the JHU setup. This time I trained RNNME models to save<br />
time - it is possible to achieve very good results even with tiny size <strong>of</strong> the hidden layer. For<br />
the ME part <strong>of</strong> the model, I used unigram, bigram, trigram and fourgram features, with<br />
hash size 2G parameters. The vocabulary was limited to 20K words used by the decoder.<br />
Training data c<strong>on</strong>sisted <strong>of</strong> 37M tokens, from which 1% was used as heldout data. The<br />
training data were shuffled to increase speed <strong>of</strong> c<strong>on</strong>vergence during training, however, due<br />
to homogeneity <strong>of</strong> the corpus, the automatic sorting technique as described in Chapter 6<br />
was not used. The results are summarized in Table 5.3.<br />
It can be seen that RNNME models improve PPL and WER significantly even with<br />
tiny size <strong>of</strong> the hidden layer, such as 10 neur<strong>on</strong>s. However, for reaching top performance, it<br />
is useful to train models as large as possible. While training <strong>of</strong> small RNNME models (such<br />
as with less than 100 neur<strong>on</strong>s in the hidden layer) takes around several hours, training the<br />
largest models takes a few days. After combining all RNNME models, the performance<br />
still improves; however, adding unsupervised adaptati<strong>on</strong> resulted in rather insignificant<br />
improvement - note that the Eval 92 c<strong>on</strong>tains 333 utterances and Eval 92 <strong>on</strong>ly 213, thus<br />
there is noise in the WER results due to small amount <strong>of</strong> test data.<br />
66