02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

where λ is the interpolati<strong>on</strong> weight <strong>of</strong> the model M1. As l<strong>on</strong>g as both models produce<br />

correct probability distributi<strong>on</strong>s and λ ∈< 0; 1 >, the linear interpolati<strong>on</strong> produces cor-<br />

rect probability distributi<strong>on</strong>. It has been reported that log-linear interpolati<strong>on</strong> <strong>of</strong> models<br />

can work in some cases significantly better than the linear interpolati<strong>on</strong> (especially when<br />

combining l<strong>on</strong>g span and short span language models), but the log-linear interpolati<strong>on</strong><br />

requires renormalizati<strong>on</strong> <strong>of</strong> the probability distributi<strong>on</strong> and is thus much more computa-<br />

ti<strong>on</strong>ally expensive [35]:<br />

1<br />

PM12 (w|h) =<br />

Zλ(h) PM1 (w|h)λ1 × PM2 (w|h)λ2 (4.2)<br />

where Zλ(h) is the normalizati<strong>on</strong> term. Because <strong>of</strong> the normalizati<strong>on</strong> term, we need to<br />

c<strong>on</strong>sider the full probability distributi<strong>on</strong> given by both models, while for the linear inter-<br />

polati<strong>on</strong>, it is enough to interpolate probabilities given by both models for an individual<br />

word. The previous equati<strong>on</strong>s can be easily extended to combinati<strong>on</strong> <strong>of</strong> more than two<br />

models, by having separate weight for each model.<br />

The Penn Treebank Corpus was divided as follows: secti<strong>on</strong>s 0-20 were used as the<br />

training data (930k tokens), secti<strong>on</strong>s 21-22 as the validati<strong>on</strong> data (74k tokens) and secti<strong>on</strong>s<br />

23-24 as the test data (82k tokens). All words outside the 10K vocabulary were mapped to<br />

a special token (unknown word) in all PTB data sets, thus there are no Out-Of-Vocabulary<br />

(OOV) words.<br />

4.3 Performance <strong>of</strong> Individual <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />

The performance <strong>of</strong> all individual models used in the further experiments is presented in<br />

Table 4.1. First, we will give references and provide brief details about the individual<br />

models. Then we will compare performance <strong>of</strong> models, combine them together and finally<br />

analyze c<strong>on</strong>tributi<strong>on</strong>s <strong>of</strong> all individual models and techniques. I would also like to menti<strong>on</strong><br />

here that the following experiments were performed with the help <strong>of</strong> Anoop Deoras who<br />

reimplemented some <strong>of</strong> the advanced LM techniques that are menti<strong>on</strong>ed in the comparis<strong>on</strong>.<br />

Some <strong>of</strong> the following results are also <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the work <strong>of</strong> other researchers, as will be<br />

menti<strong>on</strong>ed later.<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!