Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
where d(t) is a target vector that represents the word w(t + 1) that should have been<br />
predicted (encoded again as 1-<strong>of</strong>-V vector). Note that it is important to use cross entropy<br />
and not mean square error (MSE), which is a comm<strong>on</strong> mistake. The network would still<br />
work, but the results would be suboptimal (at least, if our objective is to minimize entropy,<br />
perplexity, word error rate or to maximize compressi<strong>on</strong> ratio). Weights V between the<br />
hidden layer s(t) and the output layer y(t) are updated as<br />
vjk(t+1) = vjk(t) + sj(t)eok(t)α (3.9)<br />
where α is the learning rate, j iterates over the size <strong>of</strong> the hidden layer and k over the<br />
size <strong>of</strong> the output layer, sj(t) is output <strong>of</strong> j-th neur<strong>on</strong> in the hidden layer and eok(t) is<br />
error gradient <strong>of</strong> k-th neur<strong>on</strong> in the output layer. If L2 regularizati<strong>on</strong> is used, the equati<strong>on</strong><br />
changes to<br />
vjk(t+1) = vjk(t) + sj(t)eok(t)α − vjk(t)β (3.10)<br />
where β is regularizati<strong>on</strong> parameter, in the following experiments its value is β = 10 −6 .<br />
Regularizati<strong>on</strong> is used to keep weights close to zero 2 . Using matrix-vector notati<strong>on</strong>, the<br />
equati<strong>on</strong> 3.10 would change to<br />
V(t+1) = V(t) + s(t)eo(t) T α − V(t)β. (3.11)<br />
Next, gradients <strong>of</strong> errors are propagated from the output layer to the hidden layer<br />
<br />
eh(t) = dh eo(t) T V, t , (3.12)<br />
where the error vector is obtained using functi<strong>on</strong> dh() that is applied element-wise<br />
dhj(x, t) = xsj(t)(1 − sj(t)). (3.13)<br />
Weights U between the input layer w(t) and the hidden layer s(t) are then updated as<br />
uij(t+1) = uij(t) + wi(t)ehj(t)α − uij(t)β (3.14)<br />
2 Quick explanati<strong>on</strong> <strong>of</strong> using regularizati<strong>on</strong> is by using Occam’s razor: simper soluti<strong>on</strong>s should be<br />
preferred, and small numbers can be stored more compactly than the large <strong>on</strong>es; thus, models with small<br />
weights should generalize better.<br />
32