08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8<br />

Now, we iterate over all the movies:<br />

for i in range(nmovies):<br />

movie_likeness[i] = all_correlations(reviews[:,i], reviews.T)<br />

movie_likeness[i,i] = -1<br />

We set the diagonal to -1; otherwise, the most similar movie to any movie is<br />

itself, which is true, but very unhelpful. This is the same trick we used in Chapter<br />

2, <strong>Learning</strong> How to Classify <strong>with</strong> Real-world Examples, when we first introduced the<br />

nearest neighbor classification. Based on this matrix, we can easily write a function<br />

that estimates a rating:<br />

def nn_movie(movie_likeness, reviews, uid, mid):<br />

likes = movie_likeness[mid].argsort()<br />

# reverse the sorting so that most alike are in<br />

# beginning<br />

likes = likes[::-1]<br />

# returns the rating for the most similar movie available<br />

for ell in likes:<br />

if reviews[u,ell] > 0:<br />

return reviews[u,ell]<br />

How well does the preceding function do? Fairly well: its RMSE is<br />

only 0.85.<br />

The preceding code does not show you all of the details of the crossvalidation.<br />

While it would work well in production as it is written,<br />

for testing, we need to make sure we have recomputed the likeness<br />

matrix afresh <strong>with</strong>out using the user that we are currently testing<br />

on (otherwise, we contaminate the test set and we have an inflated<br />

estimate of generalization). Unfortunately, this takes a long time, and<br />

we do not need the full matrix for each user. You should compute only<br />

what you need. This makes the code slightly more complex than the<br />

preceding examples. On the companion website for this book, you will<br />

find code <strong>with</strong> all the hairy details. There you will also find a much<br />

faster implementation of the all_correlations function.<br />

Combining multiple methods<br />

We can now combine the methods given in the earlier section into a single<br />

prediction. For example, we could average the predictions. This is normally good<br />

enough, but there is no reason to think that both predictions are similarly good<br />

and should thus have the exact same weight of 0.5. It might be that one is better.<br />

[ 169 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!