02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

where s is a state <strong>of</strong> the hidden layer. For the feedforward NN LM architecture introduced<br />

by Bengio et al. in [5], the state <strong>of</strong> the hidden layer depends <strong>on</strong> a projecti<strong>on</strong> layer, that is<br />

formed as a projecti<strong>on</strong> <strong>of</strong> N − 1 recent words into low-dimensi<strong>on</strong>al space. After the model<br />

is trained, similar words have similar low-dimensi<strong>on</strong>al representati<strong>on</strong>s.<br />

Alternatively, the current state <strong>of</strong> the hidden layer can depend <strong>on</strong> the most recent<br />

word and the state <strong>of</strong> the hidden layer in the previous time step. Thus, the time is not<br />

represented explicitly. This recurrence allows the hidden layer to represent low-dimensi<strong>on</strong>al<br />

representati<strong>on</strong> <strong>of</strong> the entire history (or in other words, it provides the model a short term<br />

memory). Such architecture is denoted as a Recurrent neural network <str<strong>on</strong>g>based</str<strong>on</strong>g> language<br />

model (RNN LM), and it was described in the Chapter 3. In the Chapter 4, we have shown<br />

that RNN LM achieves state <strong>of</strong> the art performance <strong>on</strong> the well-known Penn Treebank<br />

Corpus, and that it outperforms standard feedforward NN LM architectures, as well as<br />

many other advanced language modeling techniques.<br />

It is interesting to see that maximum entropy models trained with just n-gram features<br />

have almost the same performance as usual back<strong>of</strong>f models with modified Kneser-Ney<br />

smoothing, as reported in Table 4.1. On the other hand, neural network models, due to<br />

their ability to cluster similar words (or similar histories), outperform the state-<strong>of</strong>-the-<br />

art back<strong>of</strong>f models. Moreover, neural net language models are complementary to back<strong>of</strong>f<br />

models, and further gains can be obtained by linearly interpolating them.<br />

We can view a maximum entropy model as neural net model with no hidden layer,<br />

with the input layer that represents all features being directly c<strong>on</strong>nected to the output<br />

layer. Such a model has been already described in [81], where it was shown that it can be<br />

trained to perform similarly to a Kneser-Ney smoothed n-gram model, although <strong>on</strong> very<br />

limited task due to memory complexity.<br />

Maximum entropy language models have been usually trained by special algorithms,<br />

such as generalized iterative scaling. Interestingly, we will show that a maximum entropy<br />

language model can be trained using the same algorithm as the neural net models - by the<br />

stochastic gradient descent with early stopping. This leads to very simple implementati<strong>on</strong><br />

<strong>of</strong> the training, and allows us to train both models jointly, as will be shown later.<br />

72

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!