02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

y(t) = g (Vs(t)) (3.5)<br />

The output layer y represents a probability distributi<strong>on</strong> <strong>of</strong> the next word wt+1 given<br />

the history. The time complexity <strong>of</strong> <strong>on</strong>e training or test step is proporti<strong>on</strong>al to<br />

O = H × H + H × V = H × (H + V ) (3.6)<br />

where H is size <strong>of</strong> the hidden layer and V is size <strong>of</strong> the vocabulary.<br />

3.3 Learning Algorithm<br />

Both the feedforward and the recurrent architecture <strong>of</strong> the neural network model can be<br />

trained by stochastic gradient descent using a well-known backpropagati<strong>on</strong> algorithm [65].<br />

However, for better performance, a so-called Backpropagati<strong>on</strong> through time algorithm can<br />

be used to propagate gradients <strong>of</strong> errors in the network back in time through the recurrent<br />

weights, so that the model is trained to capture useful informati<strong>on</strong> in the state <strong>of</strong> the<br />

hidden layer. With simple BP training, the recurrent network performs poorly in some<br />

cases, as will be shown later (some comparis<strong>on</strong> was already presented in [50]). The BPTT<br />

algorithm has been described in [65], and a good descripti<strong>on</strong> for a practical implementati<strong>on</strong><br />

is in [9].<br />

With the stochastic gradient descent, the weight matrices <strong>of</strong> the network are updated<br />

after presenting every example. A cross entropy criteri<strong>on</strong> is used to obtain gradient <strong>of</strong> an<br />

error vector in the output layer, which is then backpropagated to the hidden layer, and in<br />

case <strong>of</strong> BPTT through the recurrent c<strong>on</strong>necti<strong>on</strong>s backwards in time. During the training,<br />

validati<strong>on</strong> data are used for early stopping and to c<strong>on</strong>trol the learning rate. Training<br />

iterates over all training data in several epochs before c<strong>on</strong>vergence is achieved - usually,<br />

8-20 epochs are needed. As it will be shown in Chapter 6, the c<strong>on</strong>vergence speed <strong>of</strong> the<br />

training can be improved by randomizing order <strong>of</strong> sentences in the training data, effectively<br />

reducing the number <strong>of</strong> required training epochs (this was already observed in [5], and we<br />

provide more details in [52]).<br />

The learning rate is c<strong>on</strong>trolled as follows. Starting learning rate is α = 0.1. The<br />

same learning rate is used as l<strong>on</strong>g as significant improvement <strong>on</strong> the validati<strong>on</strong> data is<br />

observed (in further experiments, we c<strong>on</strong>sider as a significant improvement more than<br />

0.3% reducti<strong>on</strong> <strong>of</strong> the entropy). After no significant improvement is observed, the learning<br />

30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!