Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
”Every time I fire a linguist out <strong>of</strong> my group, the accuracy goes up 3 .”<br />
We may understand Jelinek’s statement as an observati<strong>on</strong> that with decreased com-<br />
plexity <strong>of</strong> the system and increased generality <strong>of</strong> the approaches, the performance goes up.<br />
It is then not so surprising to see the general purpose algorithms to beat the very specific<br />
<strong>on</strong>es, although clearly the task specific algorithms may have better initial results.<br />
<strong>Neural</strong> network language models will be described in more detail in Chapter 2. These<br />
models are today am<strong>on</strong>g state <strong>of</strong> the art techniques, and we will dem<strong>on</strong>strate their per-<br />
formance <strong>on</strong> several data sets, where <strong>on</strong> each <strong>of</strong> them their performance is unmatched by<br />
other techniques.<br />
The main advantage <strong>of</strong> NNLMs over n-grams is that history is no l<strong>on</strong>ger seen as exact<br />
sequence <strong>of</strong> n − 1 words H, but rather as a projecti<strong>on</strong> <strong>of</strong> H into some lower dimensi<strong>on</strong>al<br />
space. This reduces number <strong>of</strong> parameters in the model that have to be trained, resulting<br />
in automatic clustering <strong>of</strong> similar histories. While this might sound the same as the<br />
motivati<strong>on</strong> for class <str<strong>on</strong>g>based</str<strong>on</strong>g> models, the main difference is that NNLMs project all words<br />
into the same low dimensi<strong>on</strong>al space, and there can be many degrees <strong>of</strong> similarity between<br />
words.<br />
The main weak point <strong>of</strong> these models is very large computati<strong>on</strong>al complexity, which<br />
usually prohibits to train these models <strong>on</strong> full training set, using the full vocabulary. I will<br />
deal with these issues in this work by proposing simple and effective speed-up techniques.<br />
Experiments and results obtained with neural network models trained <strong>on</strong> over 400M words<br />
while using large vocabulary will be reported, which is to my knowledge the largest set<br />
that a proper NNLM has been trained <strong>on</strong> 4 .<br />
2.4 Introducti<strong>on</strong> to Data Sets and Experimental Setups<br />
In this work, I would like to avoid mistakes that are <strong>of</strong>ten menti<strong>on</strong>ed when it comes to<br />
criticism <strong>of</strong> the current research in the statistical language modeling. It is usually claimed<br />
that the new techniques are studied in very specific systems, using weak or ambiguous<br />
baselines. Comparability <strong>of</strong> the achieved results is very low, if any. This leads to much<br />
3 Although later, Jelinek himself claimed that the original statement was ”Every time a linguist leaves<br />
my group, the accuracy goes up”, the former <strong>on</strong>e gained more popularity.<br />
4 I am aware <strong>of</strong> experiments with even more training data (more than 600M words) [8], but the resulting<br />
model in that work uses a small hidden layer, which as it will be shown later prohibits to train a model<br />
with competitive performance <strong>on</strong> such amount <strong>of</strong> training data.<br />
24