02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The noti<strong>on</strong> ”probability <strong>of</strong> a sentence” is an entirely useless <strong>on</strong>e, under any known<br />

interpretati<strong>on</strong> <strong>of</strong> this term. (Chomsky, 1969)<br />

Still, we can c<strong>on</strong>sider entropy and perplexity as very useful measures. The simple<br />

reas<strong>on</strong> is that in the real-world applicati<strong>on</strong>s (such as speech recognizers), there is a str<strong>on</strong>g<br />

positive correlati<strong>on</strong> between perplexity <strong>of</strong> involved language model and the system’s per-<br />

formance [24].<br />

More theoretical reas<strong>on</strong>s for using entropy as a measure <strong>of</strong> performance come from<br />

an artificial intelligence point <strong>of</strong> view [42]. If we want to build an intelligent agent that<br />

will maximize its reward in time, we have to maximize its ability to predict the outcome<br />

<strong>of</strong> its own acti<strong>on</strong>s. Given the fact that such agent is supposed to work in the real world<br />

and it can experience complex regularities including the natural language, we cannot hope<br />

for a success unless this agent has an ability to find and exploit existing patterns in such<br />

data. It is known that Turing machines (or equivalent) have the ability to represent any<br />

algorithm (in other words, any pattern or regularity). However, algorithms that would<br />

find all possible patterns in some data are not known. C<strong>on</strong>trary, it was proved that such<br />

algorithms cannot exist in general, due to the halting problem (for some algorithms, the<br />

output is not computati<strong>on</strong>ally decidable due to potential infinite recursi<strong>on</strong>).<br />

A very inspiring work <strong>on</strong> this topic was d<strong>on</strong>e by Solom<strong>on</strong><strong>of</strong>f [70], who has shown an<br />

optimal soluti<strong>on</strong> to the general predicti<strong>on</strong> problem called Algorithmic probability. Despite<br />

the fact that it is uncomputable, it provides very interesting insight into c<strong>on</strong>cepts such<br />

as patterns, regularities, informati<strong>on</strong>, noise and randomness. Solom<strong>on</strong><strong>of</strong>f’s soluti<strong>on</strong> is to<br />

average over all possible (infinitely many) models <strong>of</strong> given data, while normalizing by their<br />

descripti<strong>on</strong> length. Algorithmic probability (ALP) <strong>of</strong> string x is defined as<br />

PM(x) =<br />

∞<br />

2 −|Si(x)|<br />

, (2.2)<br />

i=0<br />

where PM(x) denotes probability <strong>of</strong> string x with respect to machine M and |Si(x)| is<br />

the descripti<strong>on</strong> length <strong>of</strong> x (or any sequence that starts with x) given the i-th model <strong>of</strong><br />

x. Thus, the shortest descripti<strong>on</strong>s dominate the final value <strong>of</strong> algorithmic probability <strong>of</strong><br />

the string x. More informati<strong>on</strong> about ALP, as well as pro<strong>of</strong>s <strong>of</strong> its interesting properties<br />

(for example invariance to the choice <strong>of</strong> the machine M, as l<strong>on</strong>g as M is universal) can be<br />

found in [70].<br />

12

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!