Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

hidden layer, both for their stand-alone version and even after combining them with the backoff model. RNN models without direct connections must sacrifice a lot of parameters to describe simple patterns, while in the presence of direct connections, the hidden layer of the neural network may focus on discovering complementary information to the direct connections. Comparison of improvements over the baseline n-gram model given by RNN and RNNME models with increasing size of the hidden layer is provided in Figure 6.9. Most importantly, we have observed good performance when we used the RNNME model for rescoring experiments. Reductions of word error rate on the RT04 evaluation set are summarized in Table 6.5. The model with direct parameters with 40 neurons in the hidden layer performs almost as well as model without direct parameters and with 320 neurons. This means that we have to train only 40 2 recurrent weights, instead of 320 2 , to achieve similar WER. The best result reported in Table 6.5, WER 11.70%, was achieved by using interpolation of three models: RNN-640, RNN-480 and another RNN-640 model trained on subset of the training data (the corpora LDC97T22, LDC98T28, LDC2005T16 and BN03 were used - see Table 6.1). It is likely that further combination with RNNME models would yield even better results. 6.6.3 Further Results with RNNME Motivated by the success of the RNNME architecture, I have later performed additional experiments with the RNNME models. The models were improved by adding unigram and four-gram features, and by using larger hash array. The new results are summarized in Table 6.6. It can be seen that by using more features and more memory for the hash, the perplexity results improved considerably. The RNNME-0 with 16G features alone is better than the baseline backoff 4-gram model, and after their interpolation, the perplexity is reduced to 125 from the baseline 140. Using 16G features is impractical due to memory complexity, thus additional experiments were performed with 8G features. By using as little as 10 neurons in the hidden layer, we can see that the perplexity on the evaluation set was reduced from 137 to 127 - even after interpolation with the backoff model, the difference is significant (126 to 120). Even models with more neurons, such as RNNME-40, improved considerably - we can see that by using more memory and more features, the perplexity of RNNME-40 model 86
Entropy reduction per word over KN4 [bits] 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 10 1 10 2 Hidden layer size RNN + KN4 RNNME+KN4 Figure 6.9: Improvements over the KN4 model obtained with RNN and RNNME models with increasing size of the hidden layer. decreased from 131 to 117. The training progress of RNN, RNMME-40 with 1G hash and the new RNNME-40 with 8G hash is shown at Figure 6.10. Unfortunately, we were not able to run new lattice rescoring experiments due to graduation of Anoop Deoras and limitations of use of the IBM recognizer, but it can be expected that even WER would be much lower with the new models with larger hash and more features. Lastly, experiments with even more features were performed - adding 5-gram features seems to not help, while adding skip-1 gram features helps a bit. It is also interesting to compare performance of RNN and RNNME architectures as the amount of the training data increases. With more training data, the optimal size of the hidden layer increases, as the model must have enough parameters to encode all patterns. In the previous chapter, it was shown that the improvements from the neural net language models actually increase with more training data, which is a very optimistic result. How- ever, with more training data it is also needed to increase the size of the hidden layer - here we show that if the hidden layer size is kept constant, the simple RNN architecture provides smaller improvements over baseline n-gram model as the amount of the training words increases. A very interesting empirical result is that RNNME architecture still 87 10 3
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14:
Chapter 2 Overview of Stati
Page 15 and 16:
2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18:
ALP can be used to obtain prior pro
Page 19 and 20:
• Good theoretical motivation •
Page 21 and 22:
abilities of n-grams are stored in
Page 23 and 24:
main (static) n-gram model. As the
Page 25 and 26:
There are many popular examples sho
Page 27 and 28:
y Chen et al., who proposed a so-ca
Page 29 and 30:
confusion among researchers, and ma
Page 31 and 32:
language model took almost a week u
Page 33 and 34:
w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36:
ate is halved at start of every new
Page 37 and 38:
or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89: Table 6.4: Perplexity on the evalua
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116: from the expensive part of the mode
Page 117 and 118: Bibliography [1] A. Alexandrescu, K
Page 119 and 120: [23] D. Filimonov, M. Harper. A joi
Page 121 and 122: [50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124: [77] W. Wang, M. Harper. The SuperA
Page 125 and 126: Test Phase After the model is train
Page 127 and 128: • compute sentence-level scores g
Page 129 and 130: Appendix B: Data generated from mod
Page 131 and 132: Appendix C: Example of decoded utte
Page 133: AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?