11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

J. Fan 511Realizations of two independent normal2 1 0 1 22 1 0 1 2Density141210864200.4 0.5 0.6 0.7 0.8Maximum absolute sample correlation35p = 10 3p = 10 330p = 10 4 p = 10 425Density201510500.7 0.8 0.9 1Maximum absolute multiple correlationFIGURE 43.1Illustration of spurious correlation. Left panel: a typical realization of Z 1 withits most spuriously correlated variable (p = 1000); middle and right panels:distributions of ˆr and ˆR for p =1000andp =10,000.Thesamplesizeisn = 50.if the genes in Ŝ are truly responsible for a disease, we may end up wronglypronouncing Z 1 as the gene that is responsible for the disease.We now examine the impact of spurious correlation on statistical inference.Consider a linear modelY = X ⊤ β + ε,σ 2 =var(ε).The residual variance based on a selected set Ŝ of variables isˆσ 2 =1n −|Ŝ| Y⊤ (I n − PŜ)Y, PŜ = XŜ(X ⊤ Ŝ X Ŝ )−1 X ⊤ Ŝ .When the variables are not data selected and the model is unbiased, thedegree of freedom adjustment makes the residual variance unbiased. However,the situation is completely different when the variables are data selected. Forexample, when β = 0, one has Y = ɛ and all selected variables are spurious.If the number of selected variables |Ŝ| is much smaller than n, thenˆσ 2 =1n −|Ŝ| (1 − γ2 n)‖ɛ‖ 2 ≈ (1 − γ 2 n)σ 2 ,where γ 2 n = ɛ ⊤ PŜɛ/‖ɛ‖ 2 .Therefore,σ 2 is underestimated by a factor of γ 2 n.Suppose that we select only one spurious variable. This variable mustthen be mostly correlated with Y or, equivalently, ɛ. Becausethespuriouscorrelation is high, the bias is large. The two left panels of Figure 43.2 depictthe distributions of γ n along with the associated estimates of ˆσ 2 for differentchoices of p. Clearly,thebiasincreaseswiththedimension,p.When multiple spurious variables are selected, the biases of residual varianceestimation become more pronounced, since the spurious correlation getslarger as demonstrated in Figure 43.1. To illustrate this, consider the linear

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!