Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
2.1 Evaluati<strong>on</strong><br />
2.1.1 Perplexity<br />
Evaluati<strong>on</strong> <strong>of</strong> quality <strong>of</strong> different language models is usually d<strong>on</strong>e by using either perplexity<br />
or word error rate. Both metrics have some important properties, as well as drawbacks,<br />
which we will briefly menti<strong>on</strong> here. The perplexity (PPL) <strong>of</strong> word sequence w is defined<br />
as<br />
P P L = K<br />
<br />
<br />
<br />
K <br />
i=1<br />
1<br />
1 K = 2− K i=1<br />
P (wi|w1...i−1) log2P (wi|w1...i−1)<br />
(2.1)<br />
Perplexity is closely related to the cross entropy between the model and some test data 1 .<br />
It can be seen as exp<strong>on</strong>ential <strong>of</strong> average per-word entropy <strong>of</strong> some test data. For example,<br />
if the model encodes each word from the test data <strong>on</strong> average in 8 bits, the perplexity is<br />
256. There are several practical reas<strong>on</strong>s why to use perplexity and not entropy: first, it is<br />
easier to remember absolute values in the usual range <strong>of</strong> perplexity between 100-200, than<br />
numbers between corresp<strong>on</strong>ding 6.64 and 7.64 bits. Sec<strong>on</strong>d, it looks better to report that<br />
some new technique yields an improvement <strong>of</strong> 10% in perplexity, rather than 2% reducti<strong>on</strong><br />
<strong>of</strong> entropy, although both results are referring to the same improvement (in this example,<br />
we assume baseline perplexity <strong>of</strong> 200). Probably the most importantly, perplexity can be<br />
easily evaluated (if we have some held out or test data) and as it is closely related to the<br />
entropy, the model which yields the lowest perplexity is in some sense the closest model<br />
to the true model which generated the data.<br />
There has been great effort in the past to discover models which would be the best for<br />
representing patterns found in both real and artificial sequential data, and interestingly<br />
enough, there has been limited cooperati<strong>on</strong> between researchers working in different fields,<br />
which gave rise to high diversity <strong>of</strong> various techniques that were developed. Natural<br />
language was viewed by many as a special case <strong>of</strong> sequence <strong>of</strong> discrete symbols, and its<br />
structure was supposedly best captured by various limited artificial grammars (such as<br />
c<strong>on</strong>text free grammar), with str<strong>on</strong>g linguistic motivati<strong>on</strong>.<br />
The questi<strong>on</strong> <strong>of</strong> validity <strong>of</strong> the statistical approach for describing natural language has<br />
been raised many times in the past, with maybe the most widely known statement coming<br />
from Noam Chomsky:<br />
1 For simplificati<strong>on</strong>, it is later denoted simply as entropy.<br />
11