08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classification II – Sentiment Analysis<br />

Accounting for unseen words and<br />

other oddities<br />

When we calculated the preceding probabilities, we actually cheated ourselves. We<br />

were not calculating the real probabilities, but only rough approximations by means of<br />

the fractions. We assumed that the training corpus would tell us the whole truth about<br />

the real probabilities. It did not. A corpus of only six tweets obviously cannot give us<br />

all the information about every tweet that has ever been written. For example, there<br />

certainly are tweets containing the word "text", it is just that we have never seen them.<br />

Apparently, our approximation is very rough, so we should account for that. This is<br />

often done in practice <strong>with</strong> "add-one smoothing".<br />

Add-one smoothing is sometimes also referred to as additive smoothing<br />

or Laplace smoothing. Note that Laplace smoothing has nothing to do<br />

<strong>with</strong> Laplacian smoothing, which is related to smoothing of polygon<br />

meshes. If we do not smooth by one but by an adjustable parameter<br />

alpha greater than zero, it is called Lidstone smoothing.<br />

It is a very simple technique, simply adding one to all counts. It has the underlying<br />

assumption that even if we have not seen a given word in the whole corpus, there is<br />

still a chance that our sample of tweets happened to not include that word. So, <strong>with</strong><br />

add-one smoothing we pretend that we have seen every occurrence once more than<br />

we actually did. That means that instead of calculating the following:<br />

We now calculate:<br />

Why do we add 2 in the denominator? We have to make sure that the end result<br />

is again a probability. Therefore, we have to normalize the counts so that all<br />

probabilities sum up to one. As in our current dataset awesome, can occur either<br />

zero or one time, we have two cases. And indeed, we get 1 as the total probability:<br />

[ 124 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!