11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

512 Features of Big Data1210(a)p = 10p = 100p = 1000p = 500032.5(b)p = 10p = 100p = 1000p = 5000454035s = 1s = 2s = 5s = 10(a)15(b)s = 1s = 2s = 5s = 10823010density of γ n6density of σ n21.5density2520density of σ n2411551020.5500 0.2 0.4 0.6 0.800 0.5 1 1.5 200.2 0.4 0.6 0.8 100 0.5 1 1.5FIGURE 43.2Distributions of spurious correlations. Left panel: Distributions of γ n for thenull model when |Ŝ| = 1 and their associated estimates of σ2 =1forvariouschoices of p. Rightpanel:Distributionsofγ n for the model Y =2X 1 +.3X 2 +ε2and their associated estimates of σ =1forvariouschoicesof|Ŝ| but fixedp =1000.Thesamplesizen =50.AdaptedfromFanetal.(2012).model Y =2X 1 + .3X 2 + ɛ and use the stepwise selection method to recruitvariables. Again, the spurious variables are selected mainly due to their spuriouscorrelation with ɛ, theunobservedbutrealizedvectorofrandomnoises.As shown in the two right panels of Figure 43.2, the spurious correlation isvery large and ˆσ 2 gets notably more biased when |Ŝ| gets larger.Underestimation of residual variance leads the statistical inference astray.Variables are declared statistically significant that are not in reality, and thisleads to faulty scientific conclusions.43.5 Incidental endogeneityHigh dimensionality also gives rise to incidental endogeneity. Scientists collectcovariates that are potentially related to the response. As there are manycovariates, some of those variables can be incidentally correlated with theresidual noise. This can cause model selection inconsistency and incorrect

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!