Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
or using matrix-vector notati<strong>on</strong> as<br />
U(t+1) = U(t) + w(t)eh(t) T α − U(t)β. (3.15)<br />
Note that <strong>on</strong>ly <strong>on</strong>e neur<strong>on</strong> is active at a given time in the input vector w(t). As can be<br />
seen from the equati<strong>on</strong> 3.14, the weight change for neur<strong>on</strong>s with zero activati<strong>on</strong> is n<strong>on</strong>e,<br />
thus the computati<strong>on</strong> can be speeded up by updating weights that corresp<strong>on</strong>d just to the<br />
active input neur<strong>on</strong>. The recurrent weights W are updated as<br />
or using matrix-vector notati<strong>on</strong> as<br />
wlj(t+1) = wlj(t) + sl(t−1)ehj(t)α − wlj(t)β (3.16)<br />
W(t+1) = W(t) + s(t−1)eh(t) T α − W(t)β (3.17)<br />
3.3.1 Backpropagati<strong>on</strong> Through Time<br />
The training algorithm presented in the previous secti<strong>on</strong> is further denoted as normal<br />
backpropagati<strong>on</strong>, as the RNN is trained in the same way as normal feedforward network<br />
with <strong>on</strong>e hidden layer, with the <strong>on</strong>ly excepti<strong>on</strong> that the state <strong>of</strong> the input layer depends<br />
<strong>on</strong> the state <strong>of</strong> the hidden layer from previous time step.<br />
However, it can be seen that such training approach is not optimal - the network tries<br />
to optimize predicti<strong>on</strong> <strong>of</strong> the next word given the previous word and previous state <strong>of</strong> the<br />
hidden layer, but no effort is devoted towards actually storing in the hidden layer state<br />
some informati<strong>on</strong> that can be actually useful in the future. If the network remembers<br />
some l<strong>on</strong>g c<strong>on</strong>text informati<strong>on</strong> in the state <strong>of</strong> the hidden layer, it is so more by luck than<br />
by design.<br />
However, a simple extensi<strong>on</strong> <strong>of</strong> the training algorithm can ensure that the network will<br />
learn what informati<strong>on</strong> to store in the hidden layer - this is the so-called Backpropagati<strong>on</strong><br />
through time algorithm. The idea is simple: a recurrent neural network with <strong>on</strong>e hidden<br />
layer which is used for N time steps can be seen as a deep feedforward network with<br />
N hidden layers (where the hidden layers have the same dimensi<strong>on</strong>ality and unfolded<br />
recurrent weight matrices are identical). This idea has already been described in [53], and<br />
is illustrated in Figure 3.2.<br />
Such deep feedforward network can be trained by the normal gradient descent. Errors<br />
33