02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Entropy<br />

8.4<br />

8.2<br />

8<br />

7.8<br />

7.6<br />

7.4<br />

7.2<br />

7<br />

6.8<br />

1 2 3 4 5 6 7 8<br />

Epoch<br />

KN4<br />

RNN-40<br />

RNNME-40 (1G)<br />

RNNME-40 (8G)<br />

Figure 6.10: Comparis<strong>on</strong> <strong>of</strong> training progress <strong>of</strong> RNN model with 40 neur<strong>on</strong>s and RNNME<br />

with 40 neur<strong>on</strong>s (1G hash and 8G hash). Entropy is calculated <strong>on</strong> the development set.<br />

6.6.4 <str<strong>on</strong>g>Language</str<strong>on</strong>g> Learning by RNN<br />

<str<strong>on</strong>g>Statistical</str<strong>on</strong>g> language modeling has been criticized by linguists, for example by Chomsky as<br />

menti<strong>on</strong>ed in the introducti<strong>on</strong>, for inability to distinguish grammatical and ungrammatical<br />

sentences that are completely novel. Chomsky’s famous examples were ’colorless green<br />

ideas sleep furiously’ and ’furiously sleep ideas green colorless’. Unless we would use<br />

enormous amount <strong>of</strong> the training data, the n-gram models will not be able to assign<br />

different probability to these two sentences, although the first <strong>on</strong>e is grammatical and<br />

thus should be more likely than the sec<strong>on</strong>d <strong>on</strong>e.<br />

For the following simple experiment, the language models introduced in the previous<br />

secti<strong>on</strong>s were used - namely, the RNN-640 model and the large KN4 n-gram model trained<br />

<strong>on</strong> the Broadcast News data. Interestingly, the n-gram model does not c<strong>on</strong>tain any bigrams<br />

that would corresp<strong>on</strong>d to those found in the test sentences, thus it has to back <strong>of</strong>f to<br />

unigram statistics for estimati<strong>on</strong> <strong>of</strong> probability <strong>of</strong> every word (except the first word that<br />

is in the c<strong>on</strong>text <strong>of</strong> start <strong>of</strong> sentence symbol, and the end <strong>of</strong> sentence symbol - for these<br />

cases, bigram statistics were used).<br />

The difference in probability <strong>of</strong> the test sentences given by the n-gram model is just<br />

minor, as can be seen in Table 6.7. On the other hand, the RNN-640 model assigns about<br />

90

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!