02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

were: 400 classes, hidden layer size up to 800 neur<strong>on</strong>s. Other hyper-parameters such as<br />

interpolati<strong>on</strong> weights were tuned <strong>on</strong> the WSJ’92 set (333 sentences), and the WSJ’93 set<br />

used for evaluati<strong>on</strong> c<strong>on</strong>sists <strong>of</strong> 465 sentences.<br />

Note that this setup is very simple as the acoustic models that were used to generate<br />

n-best lists for this task were not the state <strong>of</strong> the art. Also, the corresp<strong>on</strong>ding language<br />

models used in the previous research were trained just <strong>on</strong> limited amount <strong>of</strong> the training<br />

data (37M-70M words), although by using more training data that are easily affordable<br />

for this task, better performance can be expected. The same holds for the vocabulary - a<br />

20K word list was used, although it would be simple to use more. Thus, the experiments<br />

<strong>on</strong> this setup are not supposed to beat the state <strong>of</strong> the art, but to allow comparis<strong>on</strong> to<br />

other LM techniques and to provide more insight into the performance <strong>of</strong> the RNN LMs.<br />

5.1.1 Results <strong>on</strong> the JHU Setup<br />

Results with RNN models and competitive techniques are summarized in Table 5.1. The<br />

best RNN models have very high optimal weight when combined with KN5 baseline model,<br />

and actually by discarding the n-gram model completely, the results are not significantly<br />

affected. Interpolati<strong>on</strong> <strong>of</strong> three RNN models gives the best results - the word error rate<br />

is reduced relatively by about 20%. Other techniques, such as discriminatively trained<br />

language model and joint LM (structured model) provide smaller improvements, <strong>on</strong>ly<br />

about 2-3% reducti<strong>on</strong> <strong>of</strong> WER <strong>on</strong> the evaluati<strong>on</strong> set.<br />

The adapted RNN model is not evaluated as a dynamic RNN LM described in the<br />

previous chapters, but simply a static model that is re-trained <strong>on</strong> the 1-best lists. This was<br />

d<strong>on</strong>e due to performance issues; it becomes relatively slow to work with RNN models that<br />

are c<strong>on</strong>tinuously updated, especially in the n-best list rescoring framework. Adaptati<strong>on</strong><br />

itself provides relatively small improvement, especially with the large models.<br />

5.1.2 Performance with Increasing Size <strong>of</strong> the Training Data<br />

It was observed by Joshua Goodman that with increasing amount <strong>of</strong> the training data,<br />

improvements provided by many advanced language modeling techniques vanish, with<br />

a possible c<strong>on</strong>clusi<strong>on</strong> that it might be sufficient to train basic n-gram models <strong>on</strong> huge<br />

amounts <strong>of</strong> data to obtain good performance [24]. This is sometimes interpreted as an<br />

argument against language modeling research; however, as was menti<strong>on</strong>ed in the introduc-<br />

ti<strong>on</strong> <strong>of</strong> this thesis, simple counting <strong>of</strong> words in different c<strong>on</strong>texts is far from being close<br />

63

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!