Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
y(t) = g (Vs(t)) (3.5)<br />
The output layer y represents a probability distributi<strong>on</strong> <strong>of</strong> the next word wt+1 given<br />
the history. The time complexity <strong>of</strong> <strong>on</strong>e training or test step is proporti<strong>on</strong>al to<br />
O = H × H + H × V = H × (H + V ) (3.6)<br />
where H is size <strong>of</strong> the hidden layer and V is size <strong>of</strong> the vocabulary.<br />
3.3 Learning Algorithm<br />
Both the feedforward and the recurrent architecture <strong>of</strong> the neural network model can be<br />
trained by stochastic gradient descent using a well-known backpropagati<strong>on</strong> algorithm [65].<br />
However, for better performance, a so-called Backpropagati<strong>on</strong> through time algorithm can<br />
be used to propagate gradients <strong>of</strong> errors in the network back in time through the recurrent<br />
weights, so that the model is trained to capture useful informati<strong>on</strong> in the state <strong>of</strong> the<br />
hidden layer. With simple BP training, the recurrent network performs poorly in some<br />
cases, as will be shown later (some comparis<strong>on</strong> was already presented in [50]). The BPTT<br />
algorithm has been described in [65], and a good descripti<strong>on</strong> for a practical implementati<strong>on</strong><br />
is in [9].<br />
With the stochastic gradient descent, the weight matrices <strong>of</strong> the network are updated<br />
after presenting every example. A cross entropy criteri<strong>on</strong> is used to obtain gradient <strong>of</strong> an<br />
error vector in the output layer, which is then backpropagated to the hidden layer, and in<br />
case <strong>of</strong> BPTT through the recurrent c<strong>on</strong>necti<strong>on</strong>s backwards in time. During the training,<br />
validati<strong>on</strong> data are used for early stopping and to c<strong>on</strong>trol the learning rate. Training<br />
iterates over all training data in several epochs before c<strong>on</strong>vergence is achieved - usually,<br />
8-20 epochs are needed. As it will be shown in Chapter 6, the c<strong>on</strong>vergence speed <strong>of</strong> the<br />
training can be improved by randomizing order <strong>of</strong> sentences in the training data, effectively<br />
reducing the number <strong>of</strong> required training epochs (this was already observed in [5], and we<br />
provide more details in [52]).<br />
The learning rate is c<strong>on</strong>trolled as follows. Starting learning rate is α = 0.1. The<br />
same learning rate is used as l<strong>on</strong>g as significant improvement <strong>on</strong> the validati<strong>on</strong> data is<br />
observed (in further experiments, we c<strong>on</strong>sider as a significant improvement more than<br />
0.3% reducti<strong>on</strong> <strong>of</strong> the entropy). After no significant improvement is observed, the learning<br />
30