08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

[ 161 ]<br />

Chapter 7<br />

The loading of the dataset is just basic <strong>Python</strong>, so let us jump ahead to the learning.<br />

We have a sparse matrix, where there are entries from 1 to 5 whenever we have<br />

a rating (most of the entries are zero to denote that this user has not rated these<br />

movies). This time, as a regression method, for variety, we are going to be using the<br />

LassoCV class:<br />

from sklearn.linear_model import LassoCV<br />

reg = LassoCV(fit_intercept=True, alphas=[.125,.25,.5,1.,2.,4.])<br />

By passing the constructor an explicit set of alphas, we can constrain the values that<br />

the inner cross-validation will use. You may note that the values are multiples of<br />

two, starting <strong>with</strong> 1/8 up to 4. We will now write a function which learns a model<br />

for the user i:<br />

# isolate this user<br />

u = reviews[i]<br />

We are only interested in the movies that the user u rated, so we must build up the<br />

index of those. There are a few NumPy tricks in here: u.toarray() to convert from a<br />

sparse matrix to a regular array. Then, we ravel() that array to convert from a row<br />

array (that is, a two-dimensional array <strong>with</strong> a first dimension of 1) to a simple onedimensional<br />

array. We compare it <strong>with</strong> zero and ask where this comparison is true.<br />

The result, ps, is an array of indices; those indices correspond to movies that the user<br />

has rated:<br />

u = u.array().ravel()<br />

ps, = np.where(u > 0)<br />

# Build an array <strong>with</strong> indices [0...N] except i<br />

us = np.delete(np.arange(reviews.shape[0]), i)<br />

x = reviews[us][:,ps].T<br />

Finally, we select only the movies that the user has rated:<br />

y = u[ps]<br />

Cross-validation is set up as before. Because we have many users, we are going to<br />

only use four folds (more would take a long time and we have enough training data<br />

<strong>with</strong> just 80 percent of the data):<br />

err = 0<br />

kf = KFold(len(y), n_folds=4)<br />

for train,test in kf:<br />

# Now we perform a per-movie normalization<br />

# this is explained below<br />

xc,x1 = movie_norm(x[train])

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!