02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

improvements actually increased - this can also be explained by the fact that the WSJ-JHU<br />

setup is about 40x larger.<br />

5.2 Kaldi WSJ Setup<br />

Additi<strong>on</strong>al experiments <strong>on</strong> the Wall Street Journal task were performed using n-best lists<br />

generated with an open source speech recogniti<strong>on</strong> toolkit Kaldi [60] trained <strong>on</strong> SI-84 data<br />

further described in [62]. The acoustic models used in the following experiments were<br />

<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> triph<strong>on</strong>es and GMMs. Several advantages <strong>of</strong> using Kaldi such as better re-<br />

peatability <strong>of</strong> the performed experiments were already menti<strong>on</strong>ed in the beginning <strong>of</strong> this<br />

chapter (although Kaldi is still being developed, it should be easy to repeat the following<br />

experiments with slightly better results, as RNN rescoring code is integrated in the Kaldi<br />

toolkit). Note that this setup is also not the state <strong>of</strong> the art, as with more training data<br />

and advanced acoustic modeling techniques, it is possible to get better baseline results.<br />

Rescoring experiments with RNN LMs <strong>on</strong> a state <strong>of</strong> the art setup is subject <strong>of</strong> the following<br />

chapter.<br />

I used 1000-best lists generated by Stefan Kombrink in the following experiments. The<br />

test sets are the same as for the JHU setup. This time I trained RNNME models to save<br />

time - it is possible to achieve very good results even with tiny size <strong>of</strong> the hidden layer. For<br />

the ME part <strong>of</strong> the model, I used unigram, bigram, trigram and fourgram features, with<br />

hash size 2G parameters. The vocabulary was limited to 20K words used by the decoder.<br />

Training data c<strong>on</strong>sisted <strong>of</strong> 37M tokens, from which 1% was used as heldout data. The<br />

training data were shuffled to increase speed <strong>of</strong> c<strong>on</strong>vergence during training, however, due<br />

to homogeneity <strong>of</strong> the corpus, the automatic sorting technique as described in Chapter 6<br />

was not used. The results are summarized in Table 5.3.<br />

It can be seen that RNNME models improve PPL and WER significantly even with<br />

tiny size <strong>of</strong> the hidden layer, such as 10 neur<strong>on</strong>s. However, for reaching top performance, it<br />

is useful to train models as large as possible. While training <strong>of</strong> small RNNME models (such<br />

as with less than 100 neur<strong>on</strong>s in the hidden layer) takes around several hours, training the<br />

largest models takes a few days. After combining all RNNME models, the performance<br />

still improves; however, adding unsupervised adaptati<strong>on</strong> resulted in rather insignificant<br />

improvement - note that the Eval 92 c<strong>on</strong>tains 333 utterances and Eval 92 <strong>on</strong>ly 213, thus<br />

there is noise in the WER results due to small amount <strong>of</strong> test data.<br />

66

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!