11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

510 Features of Big Datalearning studies, also give rise to intensive computation. When the sample sizeis large, the computation of summary statistics such as correlations among allvariables is expensive. Yet statistical methods often involve repeated evaluationsof such functions. Parallel computing and other updating techniques arerequired. Therefore, scalability of techniques to both dimensionality and thenumber of cases should be borne in mind when developing statistical procedures.43.4 Spurious correlationSpurious correlation is a feature of high dimensionality. It refers to variablesthat are not correlated theoretically but whose sample correlation is high. Toillustrate the concept, consider a random sample of size n = 50 of p independentstandard N (0, 1) random variables. Thus the population correlationbetween any two random variables is zero and their corresponding samplecorrelation should be small. This is indeed the case when the dimension issmall in comparison with the sample size. When p is large, however, spuriouscorrelations start to appear. To illustrate this point, let us computeˆr = maxj≥2 ĉorr(Z 1,Z j )where ĉorr(Z 1 ,Z j ) is the sample correlation between variables Z 1 and Z j .Similarly, we can computeˆR = max|S|=5 ĉorr(Z 1, Z S ), (43.1)which is the maximum multiple correlation between Z 1 and Z S with 1 /∈ S,namely, the correlation between Z 1 and its best linear predictor using Z S .Inthe implementation, we use the forward selection algorithm as an approximationto compute ˆR, whichisnolargerthan ˆR but avoids computing all ( )p5multiple R 2 in (43.1). This experiment is repeated 200 times.The empirical distributions of ˆr and ˆR are shown in Figure 43.1. Thespurious correlation ˆr is centered around .45 for p =1000and.55forp =10,000. The corresponding values are .85 and .91 when the multiple correlationˆR is used. Theoretical results on the order of the spurious correlation ˆr aregiven in Cai and Jiang (2012) and Fan et al. (2012), but the order of ˆR remainsunknown.The impact of spurious correlation includes false scientific discoveries andfalse statistical inferences. In terms of scientific discoveries, Z 1 and ZŜ arepractically indistinguishable when n = 50, given that their correlation isaround .9 for a set Ŝ with |Ŝ| = 5. If Z 1 represents the expression of a genethat is responsible for a disease, we can discover five genes Ŝ that have a similarpredictive power even though they are unrelated to the disease. Similarly,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!