02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

”Every time I fire a linguist out <strong>of</strong> my group, the accuracy goes up 3 .”<br />

We may understand Jelinek’s statement as an observati<strong>on</strong> that with decreased com-<br />

plexity <strong>of</strong> the system and increased generality <strong>of</strong> the approaches, the performance goes up.<br />

It is then not so surprising to see the general purpose algorithms to beat the very specific<br />

<strong>on</strong>es, although clearly the task specific algorithms may have better initial results.<br />

<strong>Neural</strong> network language models will be described in more detail in Chapter 2. These<br />

models are today am<strong>on</strong>g state <strong>of</strong> the art techniques, and we will dem<strong>on</strong>strate their per-<br />

formance <strong>on</strong> several data sets, where <strong>on</strong> each <strong>of</strong> them their performance is unmatched by<br />

other techniques.<br />

The main advantage <strong>of</strong> NNLMs over n-grams is that history is no l<strong>on</strong>ger seen as exact<br />

sequence <strong>of</strong> n − 1 words H, but rather as a projecti<strong>on</strong> <strong>of</strong> H into some lower dimensi<strong>on</strong>al<br />

space. This reduces number <strong>of</strong> parameters in the model that have to be trained, resulting<br />

in automatic clustering <strong>of</strong> similar histories. While this might sound the same as the<br />

motivati<strong>on</strong> for class <str<strong>on</strong>g>based</str<strong>on</strong>g> models, the main difference is that NNLMs project all words<br />

into the same low dimensi<strong>on</strong>al space, and there can be many degrees <strong>of</strong> similarity between<br />

words.<br />

The main weak point <strong>of</strong> these models is very large computati<strong>on</strong>al complexity, which<br />

usually prohibits to train these models <strong>on</strong> full training set, using the full vocabulary. I will<br />

deal with these issues in this work by proposing simple and effective speed-up techniques.<br />

Experiments and results obtained with neural network models trained <strong>on</strong> over 400M words<br />

while using large vocabulary will be reported, which is to my knowledge the largest set<br />

that a proper NNLM has been trained <strong>on</strong> 4 .<br />

2.4 Introducti<strong>on</strong> to Data Sets and Experimental Setups<br />

In this work, I would like to avoid mistakes that are <strong>of</strong>ten menti<strong>on</strong>ed when it comes to<br />

criticism <strong>of</strong> the current research in the statistical language modeling. It is usually claimed<br />

that the new techniques are studied in very specific systems, using weak or ambiguous<br />

baselines. Comparability <strong>of</strong> the achieved results is very low, if any. This leads to much<br />

3 Although later, Jelinek himself claimed that the original statement was ”Every time a linguist leaves<br />

my group, the accuracy goes up”, the former <strong>on</strong>e gained more popularity.<br />

4 I am aware <strong>of</strong> experiments with even more training data (more than 600M words) [8], but the resulting<br />

model in that work uses a small hidden layer, which as it will be shown later prohibits to train a model<br />

with competitive performance <strong>on</strong> such amount <strong>of</strong> training data.<br />

24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!