02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

wt-3<br />

wt-2<br />

wt-1<br />

D<br />

D<br />

D<br />

P(wt|c<strong>on</strong>text) P(wt|c<strong>on</strong>text)<br />

H<br />

V<br />

Figure 6.1: Feedforward neural network 4-gram model (<strong>on</strong> the left) and Recurrent neural<br />

network language model (<strong>on</strong> the right).<br />

6.2 Computati<strong>on</strong>al Complexity<br />

The computati<strong>on</strong>al complexity <strong>of</strong> a basic neural network language model is very high for<br />

several reas<strong>on</strong>s, and there have been many attempts to deal with almost all <strong>of</strong> them. The<br />

training time <strong>of</strong> N-gram feedforward neural network language model is proporti<strong>on</strong>al to<br />

wt-1<br />

<br />

<br />

I × W × (N − 1) × D × H + H × V , (6.3)<br />

where I is the number <strong>of</strong> the training epochs before c<strong>on</strong>vergence <strong>of</strong> the training is achieved,<br />

W is the number <strong>of</strong> tokens in the training set (in usual cases, words plus end-<strong>of</strong>-sentence<br />

symbols), N is the N-gram order, D is the dimensi<strong>on</strong>ality <strong>of</strong> words in the low-dimensi<strong>on</strong>al<br />

space, H is size <strong>of</strong> the hidden layer and V size <strong>of</strong> the vocabulary (see Figure 6.1). The<br />

term (N − 1) × D is equal to the size <strong>of</strong> the projecti<strong>on</strong> layer.<br />

The recurrent neural network language model has computati<strong>on</strong>al complexity<br />

H<br />

<br />

<br />

I × W × H × H + H × V . (6.4)<br />

It can be seen that for increasing order N, the complexity <strong>of</strong> the feedforward architecture<br />

increases linearly, while it remains c<strong>on</strong>stant for the recurrent <strong>on</strong>e (actually, N has no<br />

meaning in RNN LM).<br />

Assuming that the maximum entropy model uses feature set f with full N-gram features<br />

(from unigrams up to order N) and that it is trained using <strong>on</strong>-line stochastic gradient<br />

descent in the same way as the neural network models, its computati<strong>on</strong>al complexity is<br />

<br />

I × W × N × V . (6.5)<br />

73<br />

V

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!