08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Regression – Recommendations<br />

The LinearRegression class implements OLS regression as follows:<br />

lr = LinearRegression(fit_intercept=True)<br />

We set the fit_intercept parameter to True in order to add a bias term. This<br />

is exactly what we had done before, but in a more convenient interface:<br />

lr.fit(x,y)<br />

p = map(lr.predict, x)<br />

<strong>Learning</strong> and prediction are performed for classification as follows:<br />

e = p-y<br />

total_error = np.sum(e*e) # sum of squares<br />

rmse_train = np.sqrt(total_error/len(p))<br />

print('RMSE on training: {}'.format(rmse_train))<br />

We have used a different procedure to compute the root mean square error on the<br />

training data. Of course, the result is the same as we had before: 4.6 (it is always<br />

good to have these sanity checks to make sure we are doing things correctly).<br />

Now, we will use the KFold class to build a 10-fold cross-validation loop and test<br />

the generalization ability of linear regression:<br />

from sklearn.cross_validation import Kfold<br />

kf = KFold(len(x), n_folds=10)<br />

err = 0<br />

for train,test in kf:<br />

lr.fit(x[train],y[train])<br />

p = map(lr.predict, x[test])<br />

e = p-y[test]<br />

err += np.sum(e*e)<br />

rmse_10cv = np.sqrt(err/len(x))<br />

print('RMSE on 10-fold CV: {}'.format(rmse_10cv))<br />

With cross-validation, we obtain a more conservative estimate (that is, the error is<br />

greater): 5.6. As in the case of classification, this is a better estimate of how well we<br />

could generalize to predict prices.<br />

Ordinary least squares is fast at learning time and returns a simple model,<br />

which is fast at prediction time. For these reasons, it should often be the first<br />

model that you use in a regression problem. However, we are now going to see<br />

more advanced methods.<br />

[ 152 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!