Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

The notion ”probability of a sentence” is an entirely useless one, under any known interpretation of this term. (Chomsky, 1969) Still, we can consider entropy and perplexity as very useful measures. The simple reason is that in the real-world applications (such as speech recognizers), there is a strong positive correlation between perplexity of involved language model and the system’s performance [24]. More theoretical reasons for using entropy as a measure of performance come from an artificial intelligence point of view [42]. If we want to build an intelligent agent that will maximize its reward in time, we have to maximize its ability to predict the outcome of its own actions. Given the fact that such agent is supposed to work in the real world and it can experience complex regularities including the natural language, we cannot hope for a success unless this agent has an ability to find and exploit existing patterns in such data. It is known that Turing machines (or equivalent) have the ability to represent any algorithm (in other words, any pattern or regularity). However, algorithms that would find all possible patterns in some data are not known. Contrary, it was proved that such algorithms cannot exist in general, due to the halting problem (for some algorithms, the output is not computationally decidable due to potential infinite recursion). A very inspiring work on this topic was done by Solomonoff [70], who has shown an optimal solution to the general prediction problem called Algorithmic probability. Despite the fact that it is uncomputable, it provides very interesting insight into concepts such as patterns, regularities, information, noise and randomness. Solomonoff’s solution is to average over all possible (infinitely many) models of given data, while normalizing by their description length. Algorithmic probability (ALP) of string x is defined as PM(x) = ∞ 2 −|Si(x)| , (2.2) i=0 where PM(x) denotes probability of string x with respect to machine M and |Si(x)| is the description length of x (or any sequence that starts with x) given the i-th model of x. Thus, the shortest descriptions dominate the final value of algorithmic probability of the string x. More information about ALP, as well as proofs of its interesting properties (for example invariance to the choice of the machine M, as long as M is universal) can be found in [70]. 12
ALP can be used to obtain prior probabilities of any sequential data, thus it provides theoretical solution to the statistical language modeling. As mentioned before, ALP is not computable (because of the halting problem), however it is mentioned here to justify our later experiments with model combination. Different language modeling techniques can be seen as individual components in eq. 2.2, where instead of using description length of individual models for normalization, we use the performance of the model on some validation data to obtain its weight 2 . More details about concepts such as ALP and Minimum description length (MDL) will be given in Chapter 8. Another work worth of mentioning was done by Mahoney [44], who has shown that the problem of finding the best models of data is actually equal to the problem of general data compression. Compression can be seen as two problems: data modeling, and coding. Since coding is optimally solved by Arithmetic coding, data compression can be seen just as a data modeling problem. Mahoney together with M. Hutter also organize a competition with the aim to reach the best possible compression results on a given data set (mostly containing wikipedia text), known as a Hutter prize competition. As the data compression of text is almost equal to the language modeling task, I follow the same idea and try to reach the best achievable results on a single well-known data set, the Penn Treebank Corpus, where it is possible to compare (and combine) results of techniques developed by several other researchers. The important drawback of perplexity is that it obscures achieved improvements. Usu- ally, improvements of perplexity are measured as percentual decrease over the baseline value, which is a mistaken but widely accepted practice. In Table 2.1, it is shown that constant perplexity improvement translates to different entropy reductions. For example, it will be shown in Chapter 7 that advanced LM techniques provide similar relative reductions of entropy for word and character <strong>based</strong> models, while perplexity comparison would completely fail in such case. Thus, perplexity results will be reported as a good measure for quick comparison, but improvements will be mainly reported by using entropy. 2 It can be argued that since most of the models that are commonly used in language modeling are not Turing-complete - such as finite state machines - using description length of these models would be inappropriate. 13
Page 1 and 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4: Abstrakt Statistické jazykové mod
Page 5 and 6: Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68:
were: 400 classes, hidden layer siz
Page 69 and 70:
Entropy per word on the WSJ test da
Page 71 and 72:
Table 5.3: Results on the WSJ setup
Page 73 and 74:
Table 5.5: Results for models <stro
Page 75 and 76:
trained together with a maximum ent
Page 77 and 78:
wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80:
n-gram probabilities. However, it w
Page 81 and 82:
Table 6.1: Training corpora for NIS
Page 83 and 84:
Perplexity 360 340 320 300 280 260
Page 85 and 86:
Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88:
1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90:
Table 6.4: Perplexity on the evalua
Page 91 and 92:
Entropy reduction per word over KN4
Page 93 and 94:
Table 6.6: Perplexity with the new
Page 95 and 96:
Entropy reduction over KN5 -0.04 -0
Page 97 and 98:
as a baseline, and 12.3% after resc
Page 99 and 100:
Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102:
Table 7.3: Size of compressed text
Page 103 and 104:
Table 7.4: Accuracy of different la
Page 105 and 106:
Table 7.6: Entropy on PTB with n-gr
Page 107 and 108:
8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?