01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7<br />

Root mean squared error and prediction<br />

The root mean squared error corresponds approximately to an<br />

estimate of the standard deviation. Since most of the data is at the<br />

most two standard deviations from the mean, we can double our<br />

RMSE to obtain a rough confident interval. This is only completely<br />

valid if the errors are normally distributed, but it is roughly correct<br />

even if they are not.<br />

Multidimensional regression<br />

So far, we have only used a single variable for prediction, the number of<br />

rooms per dwelling. We will now use all the data we have to fit a model using<br />

multidimensional regression. We now try to predict a single output (the average<br />

house price) based on multiple inputs.<br />

The code looks very much like before:<br />

x = boston.data<br />

# we still add a bias term, but now we must use np.concatenate,<br />

which<br />

# concatenates two arrays/lists because we<br />

# have several input variables in v<br />

x = np.array([np.concatenate(v,[1]) for v in boston.data])<br />

y = boston.target<br />

s,total_error,_,_ = np.linalg.lstsq(x,y)<br />

Now, the root mean squared error is only 4.7! This is better than what we had before,<br />

which indicates that the extra variables did help. Unfortunately, we can no longer<br />

easily display the results as we have a 14-dimensional regression.<br />

Cross-validation for regression<br />

If you remember when we first introduced classification, we stressed the importance<br />

of cross-validation for checking the quality of our predictions. In regression, this is<br />

not always done. In fact, we only discussed the training error model earlier. This is<br />

a mistake if you want to confidently infer the generalization ability. Since ordinary<br />

least squares is a very simple model, this is often not a very serious mistake (the<br />

amount of overfitting is slight). However, we should still test this empirically, which<br />

we will do now using scikit-learn. We will also use its linear regression classes as<br />

they will be easier to replace for more advanced methods later in the chapter:<br />

from sklearn.linear_model import LinearRegression<br />

[ 151 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!