Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
where λ is the interpolati<strong>on</strong> weight <strong>of</strong> the model M1. As l<strong>on</strong>g as both models produce<br />
correct probability distributi<strong>on</strong>s and λ ∈< 0; 1 >, the linear interpolati<strong>on</strong> produces cor-<br />
rect probability distributi<strong>on</strong>. It has been reported that log-linear interpolati<strong>on</strong> <strong>of</strong> models<br />
can work in some cases significantly better than the linear interpolati<strong>on</strong> (especially when<br />
combining l<strong>on</strong>g span and short span language models), but the log-linear interpolati<strong>on</strong><br />
requires renormalizati<strong>on</strong> <strong>of</strong> the probability distributi<strong>on</strong> and is thus much more computa-<br />
ti<strong>on</strong>ally expensive [35]:<br />
1<br />
PM12 (w|h) =<br />
Zλ(h) PM1 (w|h)λ1 × PM2 (w|h)λ2 (4.2)<br />
where Zλ(h) is the normalizati<strong>on</strong> term. Because <strong>of</strong> the normalizati<strong>on</strong> term, we need to<br />
c<strong>on</strong>sider the full probability distributi<strong>on</strong> given by both models, while for the linear inter-<br />
polati<strong>on</strong>, it is enough to interpolate probabilities given by both models for an individual<br />
word. The previous equati<strong>on</strong>s can be easily extended to combinati<strong>on</strong> <strong>of</strong> more than two<br />
models, by having separate weight for each model.<br />
The Penn Treebank Corpus was divided as follows: secti<strong>on</strong>s 0-20 were used as the<br />
training data (930k tokens), secti<strong>on</strong>s 21-22 as the validati<strong>on</strong> data (74k tokens) and secti<strong>on</strong>s<br />
23-24 as the test data (82k tokens). All words outside the 10K vocabulary were mapped to<br />
a special token (unknown word) in all PTB data sets, thus there are no Out-Of-Vocabulary<br />
(OOV) words.<br />
4.3 Performance <strong>of</strong> Individual <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />
The performance <strong>of</strong> all individual models used in the further experiments is presented in<br />
Table 4.1. First, we will give references and provide brief details about the individual<br />
models. Then we will compare performance <strong>of</strong> models, combine them together and finally<br />
analyze c<strong>on</strong>tributi<strong>on</strong>s <strong>of</strong> all individual models and techniques. I would also like to menti<strong>on</strong><br />
here that the following experiments were performed with the help <strong>of</strong> Anoop Deoras who<br />
reimplemented some <strong>of</strong> the advanced LM techniques that are menti<strong>on</strong>ed in the comparis<strong>on</strong>.<br />
Some <strong>of</strong> the following results are also <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the work <strong>of</strong> other researchers, as will be<br />
menti<strong>on</strong>ed later.<br />
47