Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 8<br />
Now, we iterate over all the movies:<br />
for i in range(nmovies):<br />
movie_likeness[i] = all_correlations(reviews[:,i], reviews.T)<br />
movie_likeness[i,i] = -1<br />
We set the diagonal to -1; otherwise, the most similar movie to any movie is<br />
itself, which is true, but very unhelpful. This is the same trick we used in Chapter<br />
2, <strong>Learning</strong> How to Classify <strong>with</strong> Real-world Examples, when we first introduced the<br />
nearest neighbor classification. Based on this matrix, we can easily write a function<br />
that estimates a rating:<br />
def nn_movie(movie_likeness, reviews, uid, mid):<br />
likes = movie_likeness[mid].argsort()<br />
# reverse the sorting so that most alike are in<br />
# beginning<br />
likes = likes[::-1]<br />
# returns the rating for the most similar movie available<br />
for ell in likes:<br />
if reviews[u,ell] > 0:<br />
return reviews[u,ell]<br />
How well does the preceding function do? Fairly well: its RMSE is<br />
only 0.85.<br />
The preceding code does not show you all of the details of the crossvalidation.<br />
While it would work well in production as it is written,<br />
for testing, we need to make sure we have recomputed the likeness<br />
matrix afresh <strong>with</strong>out using the user that we are currently testing<br />
on (otherwise, we contaminate the test set and we have an inflated<br />
estimate of generalization). Unfortunately, this takes a long time, and<br />
we do not need the full matrix for each user. You should compute only<br />
what you need. This makes the code slightly more complex than the<br />
preceding examples. On the companion website for this book, you will<br />
find code <strong>with</strong> all the hairy details. There you will also find a much<br />
faster implementation of the all_correlations function.<br />
Combining multiple methods<br />
We can now combine the methods given in the earlier section into a single<br />
prediction. For example, we could average the predictions. This is normally good<br />
enough, but there is no reason to think that both predictions are similarly good<br />
and should thus have the exact same weight of 0.5. It might be that one is better.<br />
[ 169 ]