Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
were: 400 classes, hidden layer size up to 800 neur<strong>on</strong>s. Other hyper-parameters such as<br />
interpolati<strong>on</strong> weights were tuned <strong>on</strong> the WSJ’92 set (333 sentences), and the WSJ’93 set<br />
used for evaluati<strong>on</strong> c<strong>on</strong>sists <strong>of</strong> 465 sentences.<br />
Note that this setup is very simple as the acoustic models that were used to generate<br />
n-best lists for this task were not the state <strong>of</strong> the art. Also, the corresp<strong>on</strong>ding language<br />
models used in the previous research were trained just <strong>on</strong> limited amount <strong>of</strong> the training<br />
data (37M-70M words), although by using more training data that are easily affordable<br />
for this task, better performance can be expected. The same holds for the vocabulary - a<br />
20K word list was used, although it would be simple to use more. Thus, the experiments<br />
<strong>on</strong> this setup are not supposed to beat the state <strong>of</strong> the art, but to allow comparis<strong>on</strong> to<br />
other LM techniques and to provide more insight into the performance <strong>of</strong> the RNN LMs.<br />
5.1.1 Results <strong>on</strong> the JHU Setup<br />
Results with RNN models and competitive techniques are summarized in Table 5.1. The<br />
best RNN models have very high optimal weight when combined with KN5 baseline model,<br />
and actually by discarding the n-gram model completely, the results are not significantly<br />
affected. Interpolati<strong>on</strong> <strong>of</strong> three RNN models gives the best results - the word error rate<br />
is reduced relatively by about 20%. Other techniques, such as discriminatively trained<br />
language model and joint LM (structured model) provide smaller improvements, <strong>on</strong>ly<br />
about 2-3% reducti<strong>on</strong> <strong>of</strong> WER <strong>on</strong> the evaluati<strong>on</strong> set.<br />
The adapted RNN model is not evaluated as a dynamic RNN LM described in the<br />
previous chapters, but simply a static model that is re-trained <strong>on</strong> the 1-best lists. This was<br />
d<strong>on</strong>e due to performance issues; it becomes relatively slow to work with RNN models that<br />
are c<strong>on</strong>tinuously updated, especially in the n-best list rescoring framework. Adaptati<strong>on</strong><br />
itself provides relatively small improvement, especially with the large models.<br />
5.1.2 Performance with Increasing Size <strong>of</strong> the Training Data<br />
It was observed by Joshua Goodman that with increasing amount <strong>of</strong> the training data,<br />
improvements provided by many advanced language modeling techniques vanish, with<br />
a possible c<strong>on</strong>clusi<strong>on</strong> that it might be sufficient to train basic n-gram models <strong>on</strong> huge<br />
amounts <strong>of</strong> data to obtain good performance [24]. This is sometimes interpreted as an<br />
argument against language modeling research; however, as was menti<strong>on</strong>ed in the introduc-<br />
ti<strong>on</strong> <strong>of</strong> this thesis, simple counting <strong>of</strong> words in different c<strong>on</strong>texts is far from being close<br />
63