02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

or using matrix-vector notati<strong>on</strong> as<br />

U(t+1) = U(t) + w(t)eh(t) T α − U(t)β. (3.15)<br />

Note that <strong>on</strong>ly <strong>on</strong>e neur<strong>on</strong> is active at a given time in the input vector w(t). As can be<br />

seen from the equati<strong>on</strong> 3.14, the weight change for neur<strong>on</strong>s with zero activati<strong>on</strong> is n<strong>on</strong>e,<br />

thus the computati<strong>on</strong> can be speeded up by updating weights that corresp<strong>on</strong>d just to the<br />

active input neur<strong>on</strong>. The recurrent weights W are updated as<br />

or using matrix-vector notati<strong>on</strong> as<br />

wlj(t+1) = wlj(t) + sl(t−1)ehj(t)α − wlj(t)β (3.16)<br />

W(t+1) = W(t) + s(t−1)eh(t) T α − W(t)β (3.17)<br />

3.3.1 Backpropagati<strong>on</strong> Through Time<br />

The training algorithm presented in the previous secti<strong>on</strong> is further denoted as normal<br />

backpropagati<strong>on</strong>, as the RNN is trained in the same way as normal feedforward network<br />

with <strong>on</strong>e hidden layer, with the <strong>on</strong>ly excepti<strong>on</strong> that the state <strong>of</strong> the input layer depends<br />

<strong>on</strong> the state <strong>of</strong> the hidden layer from previous time step.<br />

However, it can be seen that such training approach is not optimal - the network tries<br />

to optimize predicti<strong>on</strong> <strong>of</strong> the next word given the previous word and previous state <strong>of</strong> the<br />

hidden layer, but no effort is devoted towards actually storing in the hidden layer state<br />

some informati<strong>on</strong> that can be actually useful in the future. If the network remembers<br />

some l<strong>on</strong>g c<strong>on</strong>text informati<strong>on</strong> in the state <strong>of</strong> the hidden layer, it is so more by luck than<br />

by design.<br />

However, a simple extensi<strong>on</strong> <strong>of</strong> the training algorithm can ensure that the network will<br />

learn what informati<strong>on</strong> to store in the hidden layer - this is the so-called Backpropagati<strong>on</strong><br />

through time algorithm. The idea is simple: a recurrent neural network with <strong>on</strong>e hidden<br />

layer which is used for N time steps can be seen as a deep feedforward network with<br />

N hidden layers (where the hidden layers have the same dimensi<strong>on</strong>ality and unfolded<br />

recurrent weight matrices are identical). This idea has already been described in [53], and<br />

is illustrated in Figure 3.2.<br />

Such deep feedforward network can be trained by the normal gradient descent. Errors<br />

33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!