Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
wt-3<br />
wt-2<br />
wt-1<br />
D<br />
D<br />
D<br />
P(wt|c<strong>on</strong>text) P(wt|c<strong>on</strong>text)<br />
H<br />
V<br />
Figure 6.1: Feedforward neural network 4-gram model (<strong>on</strong> the left) and Recurrent neural<br />
network language model (<strong>on</strong> the right).<br />
6.2 Computati<strong>on</strong>al Complexity<br />
The computati<strong>on</strong>al complexity <strong>of</strong> a basic neural network language model is very high for<br />
several reas<strong>on</strong>s, and there have been many attempts to deal with almost all <strong>of</strong> them. The<br />
training time <strong>of</strong> N-gram feedforward neural network language model is proporti<strong>on</strong>al to<br />
wt-1<br />
<br />
<br />
I × W × (N − 1) × D × H + H × V , (6.3)<br />
where I is the number <strong>of</strong> the training epochs before c<strong>on</strong>vergence <strong>of</strong> the training is achieved,<br />
W is the number <strong>of</strong> tokens in the training set (in usual cases, words plus end-<strong>of</strong>-sentence<br />
symbols), N is the N-gram order, D is the dimensi<strong>on</strong>ality <strong>of</strong> words in the low-dimensi<strong>on</strong>al<br />
space, H is size <strong>of</strong> the hidden layer and V size <strong>of</strong> the vocabulary (see Figure 6.1). The<br />
term (N − 1) × D is equal to the size <strong>of</strong> the projecti<strong>on</strong> layer.<br />
The recurrent neural network language model has computati<strong>on</strong>al complexity<br />
H<br />
<br />
<br />
I × W × H × H + H × V . (6.4)<br />
It can be seen that for increasing order N, the complexity <strong>of</strong> the feedforward architecture<br />
increases linearly, while it remains c<strong>on</strong>stant for the recurrent <strong>on</strong>e (actually, N has no<br />
meaning in RNN LM).<br />
Assuming that the maximum entropy model uses feature set f with full N-gram features<br />
(from unigrams up to order N) and that it is trained using <strong>on</strong>-line stochastic gradient<br />
descent in the same way as the neural network models, its computati<strong>on</strong>al complexity is<br />
<br />
I × W × N × V . (6.5)<br />
73<br />
V