Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

As can be seen in Table 7.3, the usual data compression programs such as gzip do not work very well on text data - however, the speed of advanced compressors is orders of magnitude lower. Thus, the achieved results are currently more interesting from the research point of view, than from the practical point of view - however, with further progress, things may change in the future. Interestingly, the achieved entropy 1.21 bits per character for English text (including spaces and end of line symbols) is already lower than the upper bound estimate of Shannon, 1.3 bpc [66]. It can be expected that with even more data, the entropy would still decrease considerably. 7.3 Microsoft Sentence Completion Challenge The motivation examples in the introduction of this thesis did show that a good statistical language model should assign higher probability to sentences that can be assumed as usual, correct or meaningful, and low probability to the others. Also, it was explained that n- gram models cannot represent patterns over longer contexts efficiently due to exponential increase of number of parameters with the order of the model. Thus it is an interesting task to compare the developed RNN language models and n-gram models on a simple task, where the language model is supposed to choose the most meaningful word among several options in a sentence with one missing word. Such task has been recently published in [83]. It consists of 1040 sentences where a single informative word is discarded, and five possible options are given. An example: I have seen it on him , and could write to it . I have seen it on him , and could migrate to it . I have seen it on him , and could climb to it . I have seen it on him , and could swear to it . I have seen it on him , and could contribute to it . Thus, by computing the likelihood of each sentence and choosing the most likely one given a specific model, we can test ability of language models to ”understand” patterns in the sentence. Note that this task is similar to the usual quality measure of language 98
Table 7.4: Accuracy of different language modeling techniques on the Microsoft Sentence Completion Challenge task. Human performance is 91% accuracy [83]. Model Perplexity Accuracy [%] random - 20.0 GT3 92 36.0 GT5 87 38.4 KN5 84 40.0 RNNME-50 68 43.5 RNNME-100 65 45.1 RNNME-200 63 47.7 RNNME-300 60 49.3 models, the perplexity - with the difference that the sentence completion challenge focuses on the informative words that occur infrequently. Results obtained with various n-gram models and RNNME models are summarized in Table 7.4. The models were trained on about 50M tokens using 200K vocabulary, as a link to the training data was provided in [83]. RNNME language models perform much better that the usual n-gram models on this task: obviously, their ability to represent longer context patterns is very useful. While n-gram models perform about 20% better than is the random performance, the largest RN- NME model is almost 30% better. Still, the performance is far from human performance, which is 91% accuracy. We can think of models that would focus more on the task itself - basic objective function for usual language models is to minimize entropy of the training data, while in the case of sentence completion challenge, we are more interested in capturing patterns between infrequent words. A simple task-specific modification can involve models that are trained on data where frequent words are discarded. This reduces amount of possible parameters of n-gram models for capturing regularities between infrequent words. In the following experiments, the 200 most frequent words were discarded both from the training and test data. It can be observed that n-gram models that are trained on such modified training data give much better accuracy. However, as a lot of possibly important information is discarded, the RNNME models do not have possibility to significantly overcome the 99
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14:
Chapter 2 Overview of Stati
Page 15 and 16:
2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18:
ALP can be used to obtain prior pro
Page 19 and 20:
• Good theoretical motivation •
Page 21 and 22:
abilities of n-grams are stored in
Page 23 and 24:
main (static) n-gram model. As the
Page 25 and 26:
There are many popular examples sho
Page 27 and 28:
y Chen et al., who proposed a so-ca
Page 29 and 30:
confusion among researchers, and ma
Page 31 and 32:
language model took almost a week u
Page 33 and 34:
w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36:
ate is halved at start of every new
Page 37 and 38:
or using matrix-vector notation as
Page 39 and 40:
information for more than 5 time st
Page 41 and 42:
A simple solution to the exploding
Page 43 and 44:
output layer changes to computation
Page 45 and 46:
While RNN models can overcome this
Page 47 and 48:
complex or random architectures (su
Page 49 and 50:
While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101: Table 7.3: Size of compressed text
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116: from the expensive part of the mode
Page 117 and 118: Bibliography [1] A. Alexandrescu, K
Page 119 and 120: [23] D. Filimonov, M. Harper. A joi
Page 121 and 122: [50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124: [77] W. Wang, M. Harper. The SuperA
Page 125 and 126: Test Phase After the model is train
Page 127 and 128: • compute sentence-level scores g
Page 129 and 130: Appendix B: Data generated from mod
Page 131 and 132: Appendix C: Example of decoded utte
Page 133: AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?