Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

Table 4.7: Results on Penn Treebank corpus (evaluation set) with different linear interpolation techniques. Model PPL Static LI of all models 83.5 Static LI of all models + dynamic RNNs with α = 0.5 80.5 Adaptive LI of all models + dynamic RNNs with α = 0.5 79.4 4.5.1 Adaptive Linear Combination All the experiments above use fixed weights of models in the combination, where the weights are estimated on the PTB validation set. We have extended the usual linear combination of models to a case when weights of all individual models are variable, and are estimated during processing of the test data. The initial distribution of weights is uniform (every model has the same weight), and as the test data are being processed, we compute optimal weights <strong>based</strong> on the performance of models on the history of the last several words (the objective is to minimize perplexity). In theory, the weights can be estimated using the whole history. However, we found that it is possible to use multiple lengths of history - a combination where interpolation weights are estimated using just a few preceding words can capture short context characteristics that can vary rapidly between individual sentences or paragraphs, while a combination where interpolation weights depend on the whole history is the most robust. It should be noted that an important motivation for this approach is that a combination of adaptive and static RNN models with fixed weights is suboptimal. When the first word in the test data is processed, both static and adaptive models are equal. As more data is processed, the adaptive model is supposed to learn new information, and thus its optimal weight can change. If there is a sudden change of topic in the test data, the static model might perform better for several sentences, while if there are repeating sentences or names of people, the dynamic model can work better. Further improvement was motivated by the observation that adaptation of RNN models with the learning rate α = 0.1 leads usually to the best individual results, but models in combination are more complementary if some are processed with larger learning rate. The results are summarized in Table 4.7. Overall, the adaptive learning rate provides small improvement, and has an interesting advantage: it does not require any validation data for tuning the weights of individual models. 60
4.6 Conclusion of the Model Combination Experiments We have achieved a new state of the art results on the well-known Penn Treebank Corpus, as we reduced the perplexity from the baseline 141.2 to 83.5 by combining many advanced language modeling techniques. Perplexity was further reduced to 79.4 by using adaptive linear interpolation of models and by using larger learning rate for dynamic RNN models. These experiments were already described in [51]. In the subsequent experiments, we were able to obtain perplexity 78.8 by using in the model combination also RNNME models that will be described in the Chapter 6. This corresponds to 11.8% reduction of entropy over 5-gram model with modified Kneser-Ney smoothing and no count cutoffs - this is more than twice more entropy reduction than the best previously published result on the Penn Treebank data set. It is quite important and interesting to realize that we can actually rely just on a few techniques to reach near-optimal performance. Combination of RNNLMs and KN5 model with a cache is very simple and straightforward. All these techniques are purely data driven, with no need for extra domain knowledge. This is in contrast to techniques that rely for example on syntactical parsers, which require human-annotated data. Thus, my conclusion for the experiments with the Penn Treebank corpus is that techniques that focus on the modeling outperform techniques that focus on the features and attempt to incorporate knowledge provided by human experts. This might suggest that the task of learning the language should focus more on the learning itself, than on hand-designing features and complex models by linguists. I believe that systems that rely on the extra information provided by humans may be useful in the short term perspective, but from the long term one, the machine learning algorithms will improve and overcome the rule <strong>based</strong> systems, as there is a great availability of unstructured data. Just by looking at the evolution of the speech recognition field, it is possible to observe this drift towards statistical learning. Interestingly, also the research scientists from big companies such as Google claim that systems without special linguistic features work if not the same, then even better [58]. 61
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63: Table 4.6: Results on Penn Treebank
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?