02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Perplexity<br />

1600<br />

1400<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0 100 200 300 400 500 600<br />

Sorted chunks<br />

Held-out set<br />

Eval set<br />

Figure 6.2: Perplexity <strong>of</strong> data chunks sorted by their performance <strong>on</strong> the development<br />

data.<br />

about 318M tokens 3 . In Figure 6.3, we show three variants <strong>of</strong> training the RNN LM:<br />

training <strong>on</strong> all c<strong>on</strong>catenated training corpora with the natural order, standard stochastic<br />

gradient descent using all data (randomized order <strong>of</strong> sentences), and training with the<br />

reduced and sorted set.<br />

We can c<strong>on</strong>clude that stochastic gradient descent helps to reduce the number <strong>of</strong> re-<br />

quired training epochs before c<strong>on</strong>vergence is achieved, against training <strong>on</strong> all data with<br />

the natural order <strong>of</strong> sentences. However, sorting the data results in significantly lower<br />

final perplexity <strong>on</strong> the development set - we observe around 10% reducti<strong>on</strong> <strong>of</strong> perplexity.<br />

In Table 6.2, we show that these improvements carry over to the evaluati<strong>on</strong> set.<br />

6.5 Experiments with large RNN models<br />

The perplexity <strong>of</strong> the large 4-gram model with modified Kneser-Ney smoothing (denoted<br />

later as KN4) is 144 <strong>on</strong> the development set, and 140 <strong>on</strong> the evaluati<strong>on</strong> set. We can see<br />

in Table 6.2 that RNN models with 80 neur<strong>on</strong>s are still far away from this performance.<br />

3 Training a standard n-gram model <strong>on</strong> the reduced set results in about the same perplexity as with the<br />

n-gram model that is trained <strong>on</strong> all data.<br />

78

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!