Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
THE SKY THIS <br />
WAS BLUE<br />
N-gram models with N = 4 are unable to efficiently model such comm<strong>on</strong> patterns in<br />
the language. With N = 10, we can see that the number <strong>of</strong> variati<strong>on</strong>s is so large that we<br />
cannot realistically hope to have such amounts <strong>of</strong> training data that would allow n-gram<br />
models to capture such l<strong>on</strong>g-c<strong>on</strong>text patterns - we would basically have to see each specific<br />
variati<strong>on</strong> in the training data, which is infeasible in practical situati<strong>on</strong>s.<br />
Another type <strong>of</strong> patterns that n-gram models will not be able to model efficiently is<br />
similarity <strong>of</strong> individual words. A popular example is:<br />
PARTY WILL BE ON <br />
C<strong>on</strong>sidering that <strong>on</strong>ly two or three variati<strong>on</strong>s <strong>of</strong> this sentence are present in the training<br />
data, such as PARTY WILL BE ON MONDAY and PARTY WILL BE ON TUESDAY, the n-gram<br />
models will not be able to assign meaningful probability to novel (but similar) sequence<br />
such as PARTY WILL BE ON FRIDAY, even if days <strong>of</strong> the week appeared in the training data<br />
frequently enough to discover that there is some similarity am<strong>on</strong>g them.<br />
As language modeling is closely related to artificial intelligence and language learning,<br />
it is possible to find great amount <strong>of</strong> different language modeling techniques and large<br />
number <strong>of</strong> their variati<strong>on</strong>s across research literature published in the past thirty years.<br />
While it is out <strong>of</strong> scope <strong>of</strong> this work to describe all <strong>of</strong> these techniques in detail, we will<br />
at least make short introducti<strong>on</strong> to the important techniques and provide references for<br />
further details.<br />
2.3.1 Cache <str<strong>on</strong>g>Language</str<strong>on</strong>g> <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />
As stated previously, <strong>on</strong>e <strong>of</strong> the most obvious drawbacks <strong>of</strong> n-gram models is in their<br />
inability to represent l<strong>on</strong>ger term patterns. It has been empirically observed that many<br />
words, especially the rare <strong>on</strong>es, have significantly higher chance <strong>of</strong> occurring again if they<br />
did occur in the recent history. Cache models [32] are supposed to deal with this regularity,<br />
and are <strong>of</strong>ten represented as another n-gram model, which is estimated dynamically from<br />
the recent history (usually few hundreds <strong>of</strong> words are c<strong>on</strong>sidered) and interpolated with the<br />
18