Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
[ 161 ]<br />
Chapter 7<br />
The loading of the dataset is just basic <strong>Python</strong>, so let us jump ahead to the learning.<br />
We have a sparse matrix, where there are entries from 1 to 5 whenever we have<br />
a rating (most of the entries are zero to denote that this user has not rated these<br />
movies). This time, as a regression method, for variety, we are going to be using the<br />
LassoCV class:<br />
from sklearn.linear_model import LassoCV<br />
reg = LassoCV(fit_intercept=True, alphas=[.125,.25,.5,1.,2.,4.])<br />
By passing the constructor an explicit set of alphas, we can constrain the values that<br />
the inner cross-validation will use. You may note that the values are multiples of<br />
two, starting <strong>with</strong> 1/8 up to 4. We will now write a function which learns a model<br />
for the user i:<br />
# isolate this user<br />
u = reviews[i]<br />
We are only interested in the movies that the user u rated, so we must build up the<br />
index of those. There are a few NumPy tricks in here: u.toarray() to convert from a<br />
sparse matrix to a regular array. Then, we ravel() that array to convert from a row<br />
array (that is, a two-dimensional array <strong>with</strong> a first dimension of 1) to a simple onedimensional<br />
array. We compare it <strong>with</strong> zero and ask where this comparison is true.<br />
The result, ps, is an array of indices; those indices correspond to movies that the user<br />
has rated:<br />
u = u.array().ravel()<br />
ps, = np.where(u > 0)<br />
# Build an array <strong>with</strong> indices [0...N] except i<br />
us = np.delete(np.arange(reviews.shape[0]), i)<br />
x = reviews[us][:,ps].T<br />
Finally, we select only the movies that the user has rated:<br />
y = u[ps]<br />
Cross-validation is set up as before. Because we have many users, we are going to<br />
only use four folds (more would take a long time and we have enough training data<br />
<strong>with</strong> just 80 percent of the data):<br />
err = 0<br />
kf = KFold(len(y), n_folds=4)<br />
for train,test in kf:<br />
# Now we perform a per-movie normalization<br />
# this is explained below<br />
xc,x1 = movie_norm(x[train])