Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

Table 5.1: Comparison of advanced language modeling techniques on the WSJ task (37M training tokens). Model Dev WER[%] Eval WER[%] Baseline - KN5 12.2 17.2 Discriminative LM [79] 11.5 16.9 Joint LM [23] - 16.7 Static RNN 10.3 14.5 Static RNN + KN 10.2 14.5 Adapted RNN 9.7 14.2 Adapted RNN + KN 9.7 14.2 3 interpolated RNN LMs 9.5 13.9 Table 5.2: Comparison of results on the WSJ dev set (JHU setup) obtained with models trained on different amount of the data. # words PPL WER Improvement[%] KN5 +RNN KN5 +RNN Entropy WER 223K 415 333 - - 3.7 - 675K 390 298 15.6 13.9 4.5 10.9 2233K 331 251 14.9 12.9 4.8 13.4 6.4M 283 200 13.6 11.7 6.1 14.0 37M 212 133 12.2 10.2 8.7 16.4 to the way humans process natural language. I believe that advanced techniques exist that are able to model richer set of patterns in the language, and these should be actu- ally getting increasingly better than n-grams with more training data. Thus, I performed experiments to check if RNN LMs behave in this way. Results with increasingly large subset of the training data for the WSJ-JHU task are shown in Table 5.2. Both relative entropy reductions and relative word error rate reductions are increasing with more training data. This is a very optimistic result, and it confirms that the original motivation for using neural net language models was correct: by using distributed representation of the history instead of the sparse coding, the neural net models can represent certain patterns in the language more efficiently than the n-gram models. The same results are also shown at Figure 5.1, where it is easier to see the trend. 64
Entropy per word on the WSJ test data 9 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 7 10 5 10 6 Training tokens 10 7 KN5 KN5+RNN Figure 5.1: Improvements with increasing amount of training data - WSJ (JHU setup). Note that size of the hidden layer is tuned for the optimal performance, and increases with the amount of the training data. 5.1.3 Conclusion of WSJ Experiments (JHU setup) The possible improvements increase with more training data on this particular setup. This is a very positive result; the drawback is that with increased amount of the training data, such as billions of words, the computational complexity of RNN models is prohibitively large. However, we dealt with the computational complexity in the previous chapter, and it should be doable to train good RNN models even on data sets with more than a billion words by using the class-<strong>based</strong> RNNME architecture. Similarly to the experiments with the Penn Treebank Corpus, I tried to achieve the lowest possible perplexity. However, this time just two RNN LMs were used, and the combination of models did include just static RNN LMs, dynamic RNN LMs (with a single learning rate α = 0.1) and a Kneser-Ney smoothed 5-gram model with a cache. Good-Turing smoothed trigram has perplexity 246 on the test data; the best combination of models had perplexity 108 - this by more than 56% lower (entropy reduction 15.0%). The 5-gram with modified Kneser-Ney smoothing has perplexity 212 on this task, thus the combined result is by 49% lower (entropy reduction 12.6%). Thus, although the combination experiments were much more restricted than in the case of PTB, the entropy 65 10 8
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14:
Chapter 2 Overview of Stati
Page 15 and 16:
2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67: were: 400 classes, hidden layer siz
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116: from the expensive part of the mode
Page 117 and 118: Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?