08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7<br />

Root mean squared error and prediction<br />

The root mean squared error corresponds approximately to an<br />

estimate of the standard deviation. Since most of the data is at the<br />

most two standard deviations from the mean, we can double our<br />

RMSE to obtain a rough confident interval. This is only completely<br />

valid if the errors are normally distributed, but it is roughly correct<br />

even if they are not.<br />

Multidimensional regression<br />

So far, we have only used a single variable for prediction, the number of<br />

rooms per dwelling. We will now use all the data we have to fit a model using<br />

multidimensional regression. We now try to predict a single output (the average<br />

house price) based on multiple inputs.<br />

The code looks very much like before:<br />

x = boston.data<br />

# we still add a bias term, but now we must use np.concatenate,<br />

which<br />

# concatenates two arrays/lists because we<br />

# have several input variables in v<br />

x = np.array([np.concatenate(v,[1]) for v in boston.data])<br />

y = boston.target<br />

s,total_error,_,_ = np.linalg.lstsq(x,y)<br />

Now, the root mean squared error is only 4.7! This is better than what we had before,<br />

which indicates that the extra variables did help. Unfortunately, we can no longer<br />

easily display the results as we have a 14-dimensional regression.<br />

Cross-validation for regression<br />

If you remember when we first introduced classification, we stressed the importance<br />

of cross-validation for checking the quality of our predictions. In regression, this is<br />

not always done. In fact, we only discussed the training error model earlier. This is<br />

a mistake if you want to confidently infer the generalization ability. Since ordinary<br />

least squares is a very simple model, this is often not a very serious mistake (the<br />

amount of overfitting is slight). However, we should still test this empirically, which<br />

we will do now using scikit-learn. We will also use its linear regression classes as<br />

they will be easier to replace for more advanced methods later in the chapter:<br />

from sklearn.linear_model import LinearRegression<br />

[ 151 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!