01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Regression – Recommendations<br />

reg.fit(xc, y[train]-x1)<br />

# We need to perform the same normalization while testing<br />

xc,x1 = movie_norm(x[test])<br />

p = np.array(map(reg.predict, xc)).ravel()<br />

e = (p+x1)-y[test]<br />

err += np.sum(e*e)<br />

We did not explain the movie_norm function. This function performs per-movie<br />

normalization: some movies are just generally better and get higher average marks:<br />

def movie_norm(x):<br />

xc = x.copy().toarray()<br />

We cannot use xc.mean(1) because we do not want to have the zeros counting for<br />

the mean. We only want the mean of the ratings that were actually given:<br />

x1 = np.array([xi[xi > 0].mean() for xi in xc])<br />

In certain cases, there were no ratings and we got a NaN value, so we replace it with<br />

zeros using np.nan_to_num, which does exactly this task:<br />

x1 = np.nan_to_num(x1)<br />

Now we normalize the input by removing the mean value from the non-zero entries:<br />

for i in xrange(xc.shape[0]):<br />

xc[i] -= (xc[i] > 0) * x1[i]<br />

Implicitly, this also makes the movies that the user did not rate have a value of zero,<br />

which is average. Finally, we return the normalized array and the means:<br />

return x,x1<br />

You might have noticed that we converted to a regular (dense) array. This has the<br />

added advantage that it makes the optimization much faster: while scikit-learn<br />

works well with the sparse values, the dense arrays are much faster (if you can fit<br />

them in memory; when you cannot, you are forced to use sparse arrays).<br />

When compared with simply guessing the average value for that user, this approach<br />

is 80 percent better. The results are not spectacular, but it is a start. On one hand,<br />

this is a very hard problem and we cannot expect to be right with every prediction:<br />

we perform better when the users have given us more reviews. On the other hand,<br />

regression is a blunt tool for this job. Note how we learned a completely separate<br />

model for each user. In the next chapter, we will look at other methods that go<br />

beyond regression for approaching this problem. In those models, we integrate the<br />

information from all users and all movies in a more intelligent manner.<br />

[ 162 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!