02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

n-gram probabilities. However, it was shown in [40] that this simple technique degrades<br />

performance for small values <strong>of</strong> S very significantly, and even with small S such as 2000,<br />

the complexity induced by the H × V term is still very large.<br />

More successful approaches are <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> Goodman’s trick for speeding up maximum<br />

entropy models using classes [25]. Each word from the vocabulary is assigned to a single<br />

class, and <strong>on</strong>ly the probability distributi<strong>on</strong> over the classes is computed first. In the<br />

sec<strong>on</strong>d step, the probability distributi<strong>on</strong> over words that are members <strong>of</strong> a particular class<br />

is computed (we know this class from the predicted word whose probability we are trying<br />

to estimate). As the number <strong>of</strong> classes can be very small (several hundreds), this is a<br />

much more effective approach than using shortlists, and the performance degradati<strong>on</strong> is<br />

smaller. We have shown that meaningful classes can be formed very easily, by c<strong>on</strong>sidering<br />

<strong>on</strong>ly unigram frequencies <strong>of</strong> words [50]. Similar approaches have been described in [40]<br />

and [57].<br />

6.2.4 Reducti<strong>on</strong> <strong>of</strong> Size <strong>of</strong> the Hidden Layer<br />

Another way to reduce H ×V is to choose a small value <strong>of</strong> H. For example, in [8], H = 100<br />

is used when the amount <strong>of</strong> the training data is over 600M words. However, we will show<br />

that the small size <strong>of</strong> the hidden layer is insufficient to obtain good performance when the<br />

amount <strong>of</strong> training data is large, as l<strong>on</strong>g as the usual neural net LM architecture is used.<br />

In Secti<strong>on</strong> 6.6, a novel architecture <strong>of</strong> neural net LM is described, denoted as RNNME<br />

(recurrent neural network trained jointly with maximum entropy model). It allows small<br />

hidden layers to be used for models that are trained <strong>on</strong> huge amounts <strong>of</strong> data, with<br />

very good performance (much better than what can be achieved with the traditi<strong>on</strong>al<br />

architecture).<br />

6.2.5 Parallelizati<strong>on</strong><br />

Computati<strong>on</strong> in artificial neural network models can be parallelized quite easily. It is<br />

possible to either divide the matrix times vector computati<strong>on</strong> between several CPUs, or to<br />

process several examples at <strong>on</strong>ce, which allows going to matrix times matrix computati<strong>on</strong><br />

that can be optimized by existing libraries such as BLAS. In the c<strong>on</strong>text <strong>of</strong> NN LMs,<br />

Schwenk has reported a speedup <strong>of</strong> several times by exploiting parallelizati<strong>on</strong> [68].<br />

It might seem that recurrent networks are much harder to parallelize, as the state <strong>of</strong><br />

the hidden layer depends <strong>on</strong> the previous state. However, <strong>on</strong>e can parallelize just the<br />

75

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!