Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
The noti<strong>on</strong> ”probability <strong>of</strong> a sentence” is an entirely useless <strong>on</strong>e, under any known<br />
interpretati<strong>on</strong> <strong>of</strong> this term. (Chomsky, 1969)<br />
Still, we can c<strong>on</strong>sider entropy and perplexity as very useful measures. The simple<br />
reas<strong>on</strong> is that in the real-world applicati<strong>on</strong>s (such as speech recognizers), there is a str<strong>on</strong>g<br />
positive correlati<strong>on</strong> between perplexity <strong>of</strong> involved language model and the system’s per-<br />
formance [24].<br />
More theoretical reas<strong>on</strong>s for using entropy as a measure <strong>of</strong> performance come from<br />
an artificial intelligence point <strong>of</strong> view [42]. If we want to build an intelligent agent that<br />
will maximize its reward in time, we have to maximize its ability to predict the outcome<br />
<strong>of</strong> its own acti<strong>on</strong>s. Given the fact that such agent is supposed to work in the real world<br />
and it can experience complex regularities including the natural language, we cannot hope<br />
for a success unless this agent has an ability to find and exploit existing patterns in such<br />
data. It is known that Turing machines (or equivalent) have the ability to represent any<br />
algorithm (in other words, any pattern or regularity). However, algorithms that would<br />
find all possible patterns in some data are not known. C<strong>on</strong>trary, it was proved that such<br />
algorithms cannot exist in general, due to the halting problem (for some algorithms, the<br />
output is not computati<strong>on</strong>ally decidable due to potential infinite recursi<strong>on</strong>).<br />
A very inspiring work <strong>on</strong> this topic was d<strong>on</strong>e by Solom<strong>on</strong><strong>of</strong>f [70], who has shown an<br />
optimal soluti<strong>on</strong> to the general predicti<strong>on</strong> problem called Algorithmic probability. Despite<br />
the fact that it is uncomputable, it provides very interesting insight into c<strong>on</strong>cepts such<br />
as patterns, regularities, informati<strong>on</strong>, noise and randomness. Solom<strong>on</strong><strong>of</strong>f’s soluti<strong>on</strong> is to<br />
average over all possible (infinitely many) models <strong>of</strong> given data, while normalizing by their<br />
descripti<strong>on</strong> length. Algorithmic probability (ALP) <strong>of</strong> string x is defined as<br />
PM(x) =<br />
∞<br />
2 −|Si(x)|<br />
, (2.2)<br />
i=0<br />
where PM(x) denotes probability <strong>of</strong> string x with respect to machine M and |Si(x)| is<br />
the descripti<strong>on</strong> length <strong>of</strong> x (or any sequence that starts with x) given the i-th model <strong>of</strong><br />
x. Thus, the shortest descripti<strong>on</strong>s dominate the final value <strong>of</strong> algorithmic probability <strong>of</strong><br />
the string x. More informati<strong>on</strong> about ALP, as well as pro<strong>of</strong>s <strong>of</strong> its interesting properties<br />
(for example invariance to the choice <strong>of</strong> the machine M, as l<strong>on</strong>g as M is universal) can be<br />
found in [70].<br />
12