Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

where d(t) is a target vector that represents the word w(t + 1) that should have been predicted (encoded again as 1-of-V vector). Note that it is important to use cross entropy and not mean square error (MSE), which is a common mistake. The network would still work, but the results would be suboptimal (at least, if our objective is to minimize entropy, perplexity, word error rate or to maximize compression ratio). Weights V between the hidden layer s(t) and the output layer y(t) are updated as vjk(t+1) = vjk(t) + sj(t)eok(t)α (3.9) where α is the learning rate, j iterates over the size of the hidden layer and k over the size of the output layer, sj(t) is output of j-th neuron in the hidden layer and eok(t) is error gradient of k-th neuron in the output layer. If L2 regularization is used, the equation changes to vjk(t+1) = vjk(t) + sj(t)eok(t)α − vjk(t)β (3.10) where β is regularization parameter, in the following experiments its value is β = 10 −6 . Regularization is used to keep weights close to zero 2 . Using matrix-vector notation, the equation 3.10 would change to V(t+1) = V(t) + s(t)eo(t) T α − V(t)β. (3.11) Next, gradients of errors are propagated from the output layer to the hidden layer eh(t) = dh eo(t) T V, t , (3.12) where the error vector is obtained using function dh() that is applied element-wise dhj(x, t) = xsj(t)(1 − sj(t)). (3.13) Weights U between the input layer w(t) and the hidden layer s(t) are then updated as uij(t+1) = uij(t) + wi(t)ehj(t)α − uij(t)β (3.14) 2 Quick explanation of using regularization is by using Occam’s razor: simper solutions should be preferred, and small numbers can be stored more compactly than the large ones; thus, models with small weights should generalize better. 32
or using matrix-vector notation as U(t+1) = U(t) + w(t)eh(t) T α − U(t)β. (3.15) Note that only one neuron is active at a given time in the input vector w(t). As can be seen from the equation 3.14, the weight change for neurons with zero activation is none, thus the computation can be speeded up by updating weights that correspond just to the active input neuron. The recurrent weights W are updated as or using matrix-vector notation as wlj(t+1) = wlj(t) + sl(t−1)ehj(t)α − wlj(t)β (3.16) W(t+1) = W(t) + s(t−1)eh(t) T α − W(t)β (3.17) 3.3.1 Backpropagation Through Time The training algorithm presented in the previous section is further denoted as normal backpropagation, as the RNN is trained in the same way as normal feedforward network with one hidden layer, with the only exception that the state of the input layer depends on the state of the hidden layer from previous time step. However, it can be seen that such training approach is not optimal - the network tries to optimize prediction of the next word given the previous word and previous state of the hidden layer, but no effort is devoted towards actually storing in the hidden layer state some information that can be actually useful in the future. If the network remembers some long context information in the state of the hidden layer, it is so more by luck than by design. However, a simple extension of the training algorithm can ensure that the network will learn what information to store in the hidden layer - this is the so-called Backpropagation through time algorithm. The idea is simple: a recurrent neural network with one hidden layer which is used for N time steps can be seen as a deep feedforward network with N hidden layers (where the hidden layers have the same dimensionality and unfolded recurrent weight matrices are identical). This idea has already been described in [53], and is illustrated in Figure 3.2. Such deep feedforward network can be trained by the normal gradient descent. Errors 33
Page 1 and 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4: Abstrakt Statistické jazykové mod
Page 5 and 6: Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35: ate is halved at start of every new
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88:
1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90:
Table 6.4: Perplexity on the evalua
Page 91 and 92:
Entropy reduction per word over KN4
Page 93 and 94:
Table 6.6: Perplexity with the new
Page 95 and 96:
Entropy reduction over KN5 -0.04 -0
Page 97 and 98:
as a baseline, and 12.3% after resc
Page 99 and 100:
Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102:
Table 7.3: Size of compressed text
Page 103 and 104:
Table 7.4: Accuracy of different la
Page 105 and 106:
Table 7.6: Entropy on PTB with n-gr
Page 107 and 108:
8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?