02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.1 Evaluati<strong>on</strong><br />

2.1.1 Perplexity<br />

Evaluati<strong>on</strong> <strong>of</strong> quality <strong>of</strong> different language models is usually d<strong>on</strong>e by using either perplexity<br />

or word error rate. Both metrics have some important properties, as well as drawbacks,<br />

which we will briefly menti<strong>on</strong> here. The perplexity (PPL) <strong>of</strong> word sequence w is defined<br />

as<br />

P P L = K<br />

<br />

<br />

<br />

K <br />

i=1<br />

1<br />

1 K = 2− K i=1<br />

P (wi|w1...i−1) log2P (wi|w1...i−1)<br />

(2.1)<br />

Perplexity is closely related to the cross entropy between the model and some test data 1 .<br />

It can be seen as exp<strong>on</strong>ential <strong>of</strong> average per-word entropy <strong>of</strong> some test data. For example,<br />

if the model encodes each word from the test data <strong>on</strong> average in 8 bits, the perplexity is<br />

256. There are several practical reas<strong>on</strong>s why to use perplexity and not entropy: first, it is<br />

easier to remember absolute values in the usual range <strong>of</strong> perplexity between 100-200, than<br />

numbers between corresp<strong>on</strong>ding 6.64 and 7.64 bits. Sec<strong>on</strong>d, it looks better to report that<br />

some new technique yields an improvement <strong>of</strong> 10% in perplexity, rather than 2% reducti<strong>on</strong><br />

<strong>of</strong> entropy, although both results are referring to the same improvement (in this example,<br />

we assume baseline perplexity <strong>of</strong> 200). Probably the most importantly, perplexity can be<br />

easily evaluated (if we have some held out or test data) and as it is closely related to the<br />

entropy, the model which yields the lowest perplexity is in some sense the closest model<br />

to the true model which generated the data.<br />

There has been great effort in the past to discover models which would be the best for<br />

representing patterns found in both real and artificial sequential data, and interestingly<br />

enough, there has been limited cooperati<strong>on</strong> between researchers working in different fields,<br />

which gave rise to high diversity <strong>of</strong> various techniques that were developed. Natural<br />

language was viewed by many as a special case <strong>of</strong> sequence <strong>of</strong> discrete symbols, and its<br />

structure was supposedly best captured by various limited artificial grammars (such as<br />

c<strong>on</strong>text free grammar), with str<strong>on</strong>g linguistic motivati<strong>on</strong>.<br />

The questi<strong>on</strong> <strong>of</strong> validity <strong>of</strong> the statistical approach for describing natural language has<br />

been raised many times in the past, with maybe the most widely known statement coming<br />

from Noam Chomsky:<br />

1 For simplificati<strong>on</strong>, it is later denoted simply as entropy.<br />

11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!