11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

516 Features of Big DataAs an illustration, we simulate n data points respectively from the populationN (µ 0 , I p ) and N (µ 1 , I p ), in which p = 4500, µ 0 = 0, and µ 1 is arealization of a mixture of point mass 0 with probability .98 and the standarddouble exponential distribution with probability .02. Therefore, most componentshave no discriminative power, yet some components are very powerful inclassification. Indeed, among 2% or 90 realizations from the double exponentialdistributions, several components are very large, and many componentsare small.The distance-based classifier, which classifies x to class 1 when‖x − µ 1 ‖ 2 ≤‖x − µ 0 ‖ 2 or β ⊤ (x − µ) ≥ 0,where β = µ 1 − µ 0 and µ =(µ 0 + µ 1 )/2. Letting Φ denote the cumulativedistribution function of a standard Normal random variable, we find that themisclassification rate is Φ(−‖µ 1 − µ 0 ‖/2), which is effectively zero because bythe Law of Large Numbers,‖µ 1 − µ 0 ‖≈ √ 4500 × .02 × 1 ≈ 9.48.However, when β is estimated by the sample mean, the resulting classificationrule behaves like a random guess due to the accumulation of noise.To help the intuition, we drew n = 100 data points from each class andselected the best m features from the p-dimensional space, according to theabsolute values of the components of µ 1 ; this is an infeasible procedure, butcan be well estimated when m is small (Fan and Fan, 2008). We then projectedthe m-dimensional data on their first two principal components. Figure 43.5presents their projections for various values of m. Clearly,whenm = 2, thesetwo projections have high discriminative power. They still do when m = 100,as there are noise accumulations and also signal accumulations too. Thereare about 90 non-vanishing signals, though some are very small; the expectedvalues of those are approximately 9.48 as noted above. When m = 500 or4500, these two projections have no discriminative power at all due to noiseaccumulation. See also Hall et al. (2005) for a geometric representation of highdimension and low sample size data for further intuition.43.7 Sparsest solution in high confidence setTo attenuate the noise accumulation issue, we frequently impose the sparsityon the underlying parameter β 0 . At the same time, the information on β 0contained in the data is through statistical modeling. The latter is summarizedby confidence sets of β 0 in R p .Combiningthesetwopiecesofinformation,ageneral solution to high-dimensional statistics is naturally the sparsest solutionin high-confidence set.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!