02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

w(t-2)<br />

s(t-3)<br />

U<br />

W<br />

w(t-1)<br />

s(t-2)<br />

U<br />

W<br />

w(t)<br />

s(t-1)<br />

Figure 3.2: Recurrent neural network unfolded as a deep feedforward network, here<br />

for 3 time steps back in time.<br />

are propagated from the hidden layer s(t) to the hidden layer from the previous time step<br />

s(t−1) and the recurrent weight matrix (denoted as W in Figure 3.2) is updated. Error<br />

propagati<strong>on</strong> is d<strong>on</strong>e recursively as follows (note that the algorithm requires the states <strong>of</strong><br />

U<br />

W<br />

s(t)<br />

the hidden layer from the previous time steps to be stored):<br />

V<br />

y(t)<br />

<br />

eh(t−τ−1) = dh eh(t−τ) T W, t−τ−1 . (3.18)<br />

The functi<strong>on</strong> dh is defined in equati<strong>on</strong> 3.13. The unfolding can be applied for as many<br />

time steps as many training examples were already seen, however the error gradients<br />

quickly vanish as they get backpropagated in time [4] (in rare cases the errors can explode),<br />

so several steps <strong>of</strong> unfolding are sufficient (this is sometimes referred to as truncated<br />

BPTT). While for word <str<strong>on</strong>g>based</str<strong>on</strong>g> LMs, it seems to be sufficient to unfold network for about<br />

5 time steps, it is interesting to notice that this still allows the network to learn to store<br />

34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!