16.11.2012 Views

Data Mining Methods and Models

Data Mining Methods and Models

Data Mining Methods and Models

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

MULTICOLLINEARITY 121<br />

TABLE 3.12 Regression Results, with Variance Inflation Factors Indicating a<br />

Multicollinearity Problem<br />

The regression equation is<br />

Rating = 52.2 - 2.20 Sugars + 4.14 Fiber + 2.59 shelf 2<br />

- 0.0421 Potass<br />

Predictor Coef SE Coef T P VIF<br />

Constant 52.184 1.632 31.97 0.000<br />

Sugars -2.1953 0.1854 -11.84 0.000 1.4<br />

Fiber 4.1449 0.7433 5.58 0.000 6.5<br />

shelf 2 2.588 1.805 1.43 0.156 1.4<br />

Potass -0.04208 0.02520 -1.67 0.099 6.7<br />

S = 6.06446 R-Sq = 82.3% R-Sq(adj) = 81.4%<br />

Analysis of Variance<br />

Source DF SS MS F P<br />

Regression 4 12348.8 3087.2 83.94 0.000<br />

Residual Error 72 2648.0 36.8<br />

Total 76 14996.8<br />

Source DF Seq SS<br />

Sugars 1 8701.7<br />

Fiber 1 3416.1<br />

shelf 2 1 128.5<br />

Potass 1 102.5<br />

Note that we need to st<strong>and</strong>ardize the variables involved in the composite, to<br />

avoid the possibility that the greater variability of one of the variables will overwhelm<br />

that of the other variable. For example, the st<strong>and</strong>ard deviation of fiber among all<br />

cereals is 2.38 grams, <strong>and</strong> the st<strong>and</strong>ard deviation of potassium is 71.29 milligrams.<br />

(The grams/milligrams scale difference is not at issue here. What is relevant is the<br />

difference in variability, even on their respective scales.) Figure 3.11 illustrates the<br />

difference in variability.<br />

We therefore proceed to perform the regression of nutritional rating on the variables<br />

sugarsz <strong>and</strong> shelf 2, <strong>and</strong> W = � �<br />

fiberz + potassiumz /2. The results are provided<br />

in Table 3.13.<br />

Note first that the multicollinearity problem seems to have been resolved, with<br />

the VIF values all near 1. Note also, however, that the regression results are rather<br />

disappointing, with the values of R2 , R2 adj , <strong>and</strong> s all underperforming the model results<br />

found in Table 3.7, from the model y = β0 + β1(sugars) + β2(fiber) + β4(shelf 2) +<br />

ε, which did not even include the potassium variable.<br />

What is going on here? The problem stems from the fact that the fiber variable<br />

is a very good predictor of nutritional rating, especially when coupled with sugar<br />

content, as we shall see later when we perform best subsets regression. Therefore,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!