02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

hidden layer, both for their stand-al<strong>on</strong>e versi<strong>on</strong> and even after combining them with the<br />

back<strong>of</strong>f model. RNN models without direct c<strong>on</strong>necti<strong>on</strong>s must sacrifice a lot <strong>of</strong> parameters<br />

to describe simple patterns, while in the presence <strong>of</strong> direct c<strong>on</strong>necti<strong>on</strong>s, the hidden layer<br />

<strong>of</strong> the neural network may focus <strong>on</strong> discovering complementary informati<strong>on</strong> to the direct<br />

c<strong>on</strong>necti<strong>on</strong>s. Comparis<strong>on</strong> <strong>of</strong> improvements over the baseline n-gram model given by RNN<br />

and RNNME models with increasing size <strong>of</strong> the hidden layer is provided in Figure 6.9.<br />

Most importantly, we have observed good performance when we used the RNNME<br />

model for rescoring experiments. Reducti<strong>on</strong>s <strong>of</strong> word error rate <strong>on</strong> the RT04 evaluati<strong>on</strong><br />

set are summarized in Table 6.5. The model with direct parameters with 40 neur<strong>on</strong>s in<br />

the hidden layer performs almost as well as model without direct parameters and with 320<br />

neur<strong>on</strong>s. This means that we have to train <strong>on</strong>ly 40 2 recurrent weights, instead <strong>of</strong> 320 2 , to<br />

achieve similar WER.<br />

The best result reported in Table 6.5, WER 11.70%, was achieved by using interpola-<br />

ti<strong>on</strong> <strong>of</strong> three models: RNN-640, RNN-480 and another RNN-640 model trained <strong>on</strong> subset<br />

<strong>of</strong> the training data (the corpora LDC97T22, LDC98T28, LDC2005T16 and BN03 were<br />

used - see Table 6.1). It is likely that further combinati<strong>on</strong> with RNNME models would<br />

yield even better results.<br />

6.6.3 Further Results with RNNME<br />

Motivated by the success <strong>of</strong> the RNNME architecture, I have later performed additi<strong>on</strong>al<br />

experiments with the RNNME models. The models were improved by adding unigram<br />

and four-gram features, and by using larger hash array. The new results are summarized<br />

in Table 6.6.<br />

It can be seen that by using more features and more memory for the hash, the perplexity<br />

results improved c<strong>on</strong>siderably. The RNNME-0 with 16G features al<strong>on</strong>e is better than the<br />

baseline back<strong>of</strong>f 4-gram model, and after their interpolati<strong>on</strong>, the perplexity is reduced to<br />

125 from the baseline 140. Using 16G features is impractical due to memory complexity,<br />

thus additi<strong>on</strong>al experiments were performed with 8G features. By using as little as 10<br />

neur<strong>on</strong>s in the hidden layer, we can see that the perplexity <strong>on</strong> the evaluati<strong>on</strong> set was<br />

reduced from 137 to 127 - even after interpolati<strong>on</strong> with the back<strong>of</strong>f model, the difference<br />

is significant (126 to 120).<br />

Even models with more neur<strong>on</strong>s, such as RNNME-40, improved c<strong>on</strong>siderably - we can<br />

see that by using more memory and more features, the perplexity <strong>of</strong> RNNME-40 model<br />

86

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!