Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
where s is a state <strong>of</strong> the hidden layer. For the feedforward NN LM architecture introduced<br />
by Bengio et al. in [5], the state <strong>of</strong> the hidden layer depends <strong>on</strong> a projecti<strong>on</strong> layer, that is<br />
formed as a projecti<strong>on</strong> <strong>of</strong> N − 1 recent words into low-dimensi<strong>on</strong>al space. After the model<br />
is trained, similar words have similar low-dimensi<strong>on</strong>al representati<strong>on</strong>s.<br />
Alternatively, the current state <strong>of</strong> the hidden layer can depend <strong>on</strong> the most recent<br />
word and the state <strong>of</strong> the hidden layer in the previous time step. Thus, the time is not<br />
represented explicitly. This recurrence allows the hidden layer to represent low-dimensi<strong>on</strong>al<br />
representati<strong>on</strong> <strong>of</strong> the entire history (or in other words, it provides the model a short term<br />
memory). Such architecture is denoted as a Recurrent neural network <str<strong>on</strong>g>based</str<strong>on</strong>g> language<br />
model (RNN LM), and it was described in the Chapter 3. In the Chapter 4, we have shown<br />
that RNN LM achieves state <strong>of</strong> the art performance <strong>on</strong> the well-known Penn Treebank<br />
Corpus, and that it outperforms standard feedforward NN LM architectures, as well as<br />
many other advanced language modeling techniques.<br />
It is interesting to see that maximum entropy models trained with just n-gram features<br />
have almost the same performance as usual back<strong>of</strong>f models with modified Kneser-Ney<br />
smoothing, as reported in Table 4.1. On the other hand, neural network models, due to<br />
their ability to cluster similar words (or similar histories), outperform the state-<strong>of</strong>-the-<br />
art back<strong>of</strong>f models. Moreover, neural net language models are complementary to back<strong>of</strong>f<br />
models, and further gains can be obtained by linearly interpolating them.<br />
We can view a maximum entropy model as neural net model with no hidden layer,<br />
with the input layer that represents all features being directly c<strong>on</strong>nected to the output<br />
layer. Such a model has been already described in [81], where it was shown that it can be<br />
trained to perform similarly to a Kneser-Ney smoothed n-gram model, although <strong>on</strong> very<br />
limited task due to memory complexity.<br />
Maximum entropy language models have been usually trained by special algorithms,<br />
such as generalized iterative scaling. Interestingly, we will show that a maximum entropy<br />
language model can be trained using the same algorithm as the neural net models - by the<br />
stochastic gradient descent with early stopping. This leads to very simple implementati<strong>on</strong><br />
<strong>of</strong> the training, and allows us to train both models jointly, as will be shown later.<br />
72