02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

where d(t) is a target vector that represents the word w(t + 1) that should have been<br />

predicted (encoded again as 1-<strong>of</strong>-V vector). Note that it is important to use cross entropy<br />

and not mean square error (MSE), which is a comm<strong>on</strong> mistake. The network would still<br />

work, but the results would be suboptimal (at least, if our objective is to minimize entropy,<br />

perplexity, word error rate or to maximize compressi<strong>on</strong> ratio). Weights V between the<br />

hidden layer s(t) and the output layer y(t) are updated as<br />

vjk(t+1) = vjk(t) + sj(t)eok(t)α (3.9)<br />

where α is the learning rate, j iterates over the size <strong>of</strong> the hidden layer and k over the<br />

size <strong>of</strong> the output layer, sj(t) is output <strong>of</strong> j-th neur<strong>on</strong> in the hidden layer and eok(t) is<br />

error gradient <strong>of</strong> k-th neur<strong>on</strong> in the output layer. If L2 regularizati<strong>on</strong> is used, the equati<strong>on</strong><br />

changes to<br />

vjk(t+1) = vjk(t) + sj(t)eok(t)α − vjk(t)β (3.10)<br />

where β is regularizati<strong>on</strong> parameter, in the following experiments its value is β = 10 −6 .<br />

Regularizati<strong>on</strong> is used to keep weights close to zero 2 . Using matrix-vector notati<strong>on</strong>, the<br />

equati<strong>on</strong> 3.10 would change to<br />

V(t+1) = V(t) + s(t)eo(t) T α − V(t)β. (3.11)<br />

Next, gradients <strong>of</strong> errors are propagated from the output layer to the hidden layer<br />

<br />

eh(t) = dh eo(t) T V, t , (3.12)<br />

where the error vector is obtained using functi<strong>on</strong> dh() that is applied element-wise<br />

dhj(x, t) = xsj(t)(1 − sj(t)). (3.13)<br />

Weights U between the input layer w(t) and the hidden layer s(t) are then updated as<br />

uij(t+1) = uij(t) + wi(t)ehj(t)α − uij(t)β (3.14)<br />

2 Quick explanati<strong>on</strong> <strong>of</strong> using regularizati<strong>on</strong> is by using Occam’s razor: simper soluti<strong>on</strong>s should be<br />

preferred, and small numbers can be stored more compactly than the large <strong>on</strong>es; thus, models with small<br />

weights should generalize better.<br />

32

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!