13.07.2015 Views

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

FinchModern methods for <strong>the</strong> detection of multivariate outlierseach observation has associated with it a value for <strong>the</strong> generalizedvariance. Observations that are more distant from <strong>the</strong> bulk of <strong>the</strong><strong>data</strong> will have larger values of <strong>the</strong> generalized variance. For p = 2variables, observations with a generalized variance greater thanq 3 + 1.5(q 3 − q 1 ) (5)would be considered outliers, where q 1 <strong>and</strong> q 2 are <strong>the</strong> lower <strong>and</strong>upper quartiles,respectively,of <strong>the</strong> generalized variances. For morethan two variables, <strong>the</strong> generalized variances are compared with√M G + χ 2 0.975.p (q 3 − q 1 ), (6)where M G is <strong>the</strong> median of <strong>the</strong> generalized variance values <strong>and</strong>χ 2 0.975.p . Figure 3 is a plot of X <strong>and</strong> Y, with outliers identified using<strong>the</strong> MGV approach. In this graph, outliers are denoted by 0, whichis different than notation used in Figures 1 <strong>and</strong> 2. These graphs areincluded here exactly as taken from <strong>the</strong> R software output, whichwill be used extensively in <strong>the</strong> following examples.PROJECTION-BASED OUTLIER DETECTIONAno<strong>the</strong>r alternative for identifying multivariate outliers is basedon <strong>the</strong> notion of <strong>the</strong> depth of one <strong>data</strong> point among a set of o<strong>the</strong>rpoints. The idea of depth was described by Tukey (1975), <strong>and</strong>later exp<strong>and</strong>ed upon by Donoho <strong>and</strong> Gasko (1992). In general,depth can be thought of as <strong>the</strong> relative location of an observationvis-à-vis ei<strong>the</strong>r edge (upper or lower) in a set of <strong>data</strong>.In <strong>the</strong> univariate case, this simply means determining to whichedge a given observation more closely lies (i.e., maximum orminimum value), <strong>and</strong> <strong>the</strong>n calculating <strong>the</strong> proportion of casesbetween that observation <strong>and</strong> its closest edge. The larger thisproportion, <strong>the</strong> deeper <strong>the</strong> observation lies in <strong>the</strong> univariateFIGURE 3 | Scatterplot of observations identified as outliers based on<strong>the</strong> MGV method.<strong>data</strong>. While ma<strong>the</strong>matically somewhat more challenging, conceptuallyprojection methods of multivariate outlier detectionwork in much <strong>the</strong> same way. However, ra<strong>the</strong>r than determining<strong>the</strong> proximity to a single edge, <strong>the</strong> algorithm must identify<strong>the</strong> proximity to <strong>the</strong> edge of <strong>the</strong> multivariate space. This processis carried out using <strong>the</strong> method of projection that is describedbelow.For <strong>the</strong> purposes of this explanation, we will avoid presenting<strong>the</strong> ma<strong>the</strong>matical equations that underlie <strong>the</strong> projection-basedoutlier detection approach. The interested reader is encouragedto refer to Wilcox (2005) for a more technical treatment of thismethodology. Following is a conceptual description of two commonlyused approaches to carrying out this technique when p = 2.In <strong>the</strong> first method, <strong>the</strong> algorithm begins by identifying <strong>the</strong> multivariatecenter of <strong>the</strong> <strong>data</strong> using an acceptable approach, such as<strong>the</strong> multivariate mean after application of MCD or MVE. Next,for each point (X i ) <strong>the</strong> following steps are carried out:(1) A line is drawn connecting <strong>the</strong> multivariate center <strong>and</strong> pointX i .(2) A line perpendicular to <strong>the</strong> line in 1 is <strong>the</strong>n drawn from eachof <strong>the</strong> o<strong>the</strong>r observations, X j .(3) The location where <strong>the</strong> line in 2 intersects with <strong>the</strong> line in 1 is<strong>the</strong> projected depth (d ij ) of that <strong>data</strong> point for <strong>the</strong> line.(4) Steps 1–3 are <strong>the</strong>n repeated such that each of <strong>the</strong> n <strong>data</strong> pointsis connected to <strong>the</strong> multivariate center of <strong>the</strong> <strong>data</strong> with a line,<strong>and</strong> corresponding values of d ij are calculated for each of <strong>the</strong>o<strong>the</strong>r observations.(5) For a given observation, each of its depth values, d ij , is comparedwith a st<strong>and</strong>ard (to be described below). If for any singleprojection <strong>the</strong> observation is an outlier, <strong>the</strong>n it is classified asan outlier for purposes of future analyses.As mentioned earlier, <strong>the</strong>re is an alternative approach to <strong>the</strong>projection method, which is not based on finding <strong>the</strong> multivariatecenter of <strong>the</strong> distribution. Ra<strong>the</strong>r, all (n 2 − n)/2 possible linesare drawn between all pairs of observations in <strong>the</strong> <strong>data</strong>set. Then,<strong>the</strong> approach outlined above for calculating d ij is used for eachof <strong>the</strong>se lines. Thus, ra<strong>the</strong>r than having n−1 such d ij values, eachobservation will have (n2 −n)2− 1 indicators of depth. In all o<strong>the</strong>rways, this second approach is identical to <strong>the</strong> first, however. Priorresearch has demonstrated that this second method might be moreaccurate than <strong>the</strong> first, but it is not clear how great an advantageit actually has in practice (Wilcox, 2005). Fur<strong>the</strong>rmore, because itmust examine all possible lines in <strong>the</strong> set of <strong>data</strong>, method 2 canrequire quite a bit more computational time, particularly for large<strong>data</strong>sets.The literature on multivariate outlier detection using <strong>the</strong>projection-based method includes two different criteria againstwhich an observation can be judged as an outlier. The first of <strong>the</strong>seis essentially identical to that used for <strong>the</strong> MGV in Eq. 6, with <strong>the</strong>exception that M G is replaced by M D , <strong>the</strong> median of <strong>the</strong> d ij for thatprojection. Observations associated with values of d ij larger thanthis cut score are considered to be outliers for that projection. Analternative comparison criterion is√ ( )M D + χ 2 MADi0.975.p 0.6745(7)www.frontiersin.org July 2012 | Volume 3 | Article 211 | 63

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!