20.03.2013 Views

From Algorithms to Z-Scores - matloff - University of California, Davis

From Algorithms to Z-Scores - matloff - University of California, Davis

From Algorithms to Z-Scores - matloff - University of California, Davis

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

308 CHAPTER 15. RELATIONS AMONG VARIABLES: LINEAR REGRESSION<br />

Here I use the term model selection <strong>to</strong> mean which predic<strong>to</strong>r variables (including powers and<br />

interactions) we will use. If we have data on many predic<strong>to</strong>rs, we almost certainly will not be able<br />

<strong>to</strong> use them all, for the following reason:<br />

15.17.1 The Overfitting Problem in Regression<br />

Recall in Section 15.10) we mentioned that we could add polynomial terms <strong>to</strong> a regression model.<br />

But you can see that if we carry this notion <strong>to</strong> its extreme, we get absurd results. If we fit a<br />

polynomial <strong>of</strong> degree 99 <strong>to</strong> our 100 points, we can make our fitted curve exactly pass through every<br />

point! This clearly would give us a meaningless, useless curve. We are simply fitting the noise.<br />

Recall that we analyzed this problem in Section 14.1.4 in our chapter on modeling. There we noted<br />

an absolutely fundamental principle in statistics:<br />

In choosing between a simpler model and a more complex one, the latter is more accurate<br />

only if either<br />

• we have enough data <strong>to</strong> support it, or<br />

• the complex model is sufficiently different from the simpler one<br />

This is extremely important in regression analysis, because we <strong>of</strong>ten have so many<br />

variables we can use, thus <strong>of</strong>ten can make highly complex models.<br />

In the regression context, the phrase “we have enough data <strong>to</strong> support the model” means (in the<br />

parametric model case) we have enough data so that the confidence intervals for the βi will be<br />

reasonably narrow. For fixed n, the more complex the model, the wider the resulting confidence<br />

intervals will tend <strong>to</strong> be.<br />

If we use <strong>to</strong>o many predic<strong>to</strong>r variables, 11 , our data is “diluted,” by being “shared” by so many<br />

βi. As a result, V ar( βi) will be large, with big implications: Whether our goal is Prediction or<br />

Understanding, our estimates will be so poor that neither goal is achieved.<br />

On the other hand, if some predic<strong>to</strong>r variable is really important (i.e. its βi is far from 0), then it<br />

may pay <strong>to</strong> include it, even though the confidence intervals might get somewhat wider.<br />

The questions raised in turn by the above considerations, i.e. How much data is enough data?,<br />

and How different from 0 is “quite different”?, are addressed below in Section 15.17.3.<br />

A detailed mathematical example <strong>of</strong> overfitting in regression is presented in my paper A Careful<br />

Look at the Use <strong>of</strong> Statistical Methodology in Data Mining (book chapter), by N. Matl<strong>of</strong>f, in<br />

11 In the ALOHA example above, b, b 2 , b 3 and b 4 are separate predic<strong>to</strong>rs, even though they are <strong>of</strong> course correlated.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!