08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 7<br />

So, we can see that the data lies between -7.9 and -0.5. Now that we have an estimate<br />

data, we can check what happens when we use OLS to predict. Note that we can use<br />

exactly the same classes and methods as before:<br />

from sklearn.linear_model import LinearRegression<br />

lr = LinearRegression(fit_intercept=True)<br />

lr.fit(data,target)<br />

p = np.array(map(lr.predict, data))<br />

p = p.ravel() # p is a (1,16087) array, we want to flatten it<br />

e = p-target # e is 'error': difference of prediction and reality<br />

total_sq_error = np.sum(e*e)<br />

rmse_train = np.sqrt(total_sq_error/len(p))<br />

print(rmse_train)<br />

The error is not exactly zero because of the rounding error, but it is very close:<br />

0.0025 (much smaller than the standard deviation of the target, which is the natural<br />

comparison value).<br />

When we use cross-validation (the code is very similar to what we used before in the<br />

Boston example), we get something very different: 0.78. Remember that the standard<br />

deviation of the data is only 0.6. This means that if we always "predict" the mean<br />

value of -3.5, we have a root mean square error of 0.6! So, <strong>with</strong> OLS, in training, the<br />

error is insignificant. When generalizing, it is very large and the prediction is actually<br />

harmful: we would have done better (in terms of root mean square error) by simply<br />

predicting the mean value every time!<br />

Training and generalization error<br />

When the number of features is greater than the number of examples,<br />

you always get zero training error <strong>with</strong> OLS, but this is rarely a sign<br />

that your model will do well in terms of generalization. In fact, you<br />

may get zero training error and have a completely useless model.<br />

One solution, naturally, is to use regularization to counteract the overfitting. We can<br />

try the same cross-validation loop <strong>with</strong> an elastic net learner, having set the penalty<br />

parameter to 1. Now, we get 0.4 RMSE, which is better than just "predicting the<br />

mean". In a real-life problem, it is hard to know when we have done all we can as<br />

perfect prediction is almost always impossible.<br />

[ 157 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!