13.07.2015 Views

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

FinchModern methods for <strong>the</strong> detection of multivariate outliersTable 1 | Summary of outlier detection methods.Method Equation Reference Strengths WeaknessesD 2 i (x i − ¯x) ′ S −1 (x i − ¯x) Mahalanobis (1936) Intuitively easy to underst<strong>and</strong>;MVEMCDMGVP1P2P3Identify subset of <strong>data</strong> contained within <strong>the</strong>ellipsoid that has minimized volumeIdentify subset of <strong>data</strong> that minimizes <strong>the</strong>determinant of <strong>the</strong> covariance matrixCalculate MAD version of D 2 as∑ nj=1√∑pl=1(xjl −x ilMAD l) 2to identify most centralpoints; calculate variance of this central set asadditional observations are added one by one;examine this generalized variance <strong>and</strong> retainthose with√values less than <strong>the</strong> adjusted medianM G + χ 2 0.975.p (q 3 − q 1 )Identify <strong>the</strong> multivariate center of <strong>data</strong> usingMCD or MVE <strong>and</strong> <strong>the</strong>n determine its relativedistance from this center (depth); use <strong>the</strong> MGVcriteria based on this depth to identify outliersIdentify all possible lines between all pairs ofobservations in order to determine depth ofeach pointRousseeuw <strong>and</strong>Leroy (1987)Rousseeuw <strong>and</strong> vanDriessen (1999)Wilcox (2005)Donoho <strong>and</strong> Gasko(1992)Donoho <strong>and</strong> Gasko(1992)Same approach as P1 except√that <strong>the</strong> criteria for Donoho <strong>and</strong> Gaskoidentifying outliers is M D + χ 2 0.975.p ( MAD i0.6745 ) (1992)easy to calculate; familiar too<strong>the</strong>r researchersYields mean with maximumpossible breakdown pointYields mean with maximumpossible breakdown pointTypically removes fewerobservations than ei<strong>the</strong>r MVE orMCDApproximates an affineequivariant outlier detectionmethod; may not exclude asmany cases as MVE or MCDSome evidence that this methodis more accurate than P1 interms of identifying outliersMay yield a mean with a higherbreakdown point than o<strong>the</strong>rprojection methodsSensitive to outliers; assumes<strong>data</strong> are continuousMay remove as much as 50%of sampleMay remove as much as 50%of sampleGenerally does not have ashigh a breakdown point asMVE or MCDWill not typically lead to amean with <strong>the</strong> maximumpossible breakdown pointExtensive computationaltime, particularly for large<strong>data</strong>setsWill likely lead to exclusion ofmore observations as outliersthan will o<strong>the</strong>r projectionapproacheswherex i = Vector of scores on <strong>the</strong> set of p variables for subject i¯x = Vector of sample means on <strong>the</strong> set of p variablesS = Covariance matrix for <strong>the</strong> p variablesA number of recommendations exist in <strong>the</strong> literature for identifyingwhen this value is large; i.e., when an observation might be anoutlier. The approach used here will be to compare Di 2 to <strong>the</strong> χ 2distribution with p degrees of freedom <strong>and</strong> declare an observationto be an outlier if its value exceeds <strong>the</strong> quantile for some inverseprobability; i.e., χ 2 p (0.005) (Mahalanobis).D 2 is easy to compute using existing software <strong>and</strong> allows fordirect hypo<strong>the</strong>sis <strong>testing</strong> regarding outlier status (Wilcox, 2005).Despite <strong>the</strong>se advantages, D 2 is sensitive to outliers because it isbased on <strong>the</strong> sample covariance matrix, S, which is itself sensitiveto outliers (Wilcox, 2005). In addition, D 2 assumes that <strong>the</strong> <strong>data</strong>are continuous <strong>and</strong> not categorical so that when <strong>data</strong> are ordinal,for example, it may be inappropriate for outlier detection (Zijlstraet al., 2007). Given <strong>the</strong>se problems, researchers have developedalternatives to multivariate outlier detection that are more robust<strong>and</strong> more flexible than D 2 .MINIMUM VOLUME ELLIPSOIDOne of <strong>the</strong> earliest of alternative approach to outlier detection was<strong>the</strong> Minimum Volume Ellipsoid (MVE), developed by Rousseeuw<strong>and</strong> Leroy (1987). In concept, <strong>the</strong> goal behind this method is toidentify a subsample of observations of size h (where h < n) thatcreates <strong>the</strong> smallest volume ellipsoid of <strong>data</strong> points, based on <strong>the</strong>values of <strong>the</strong> variables. By definition, this ellipsoid should be free ofoutliers, <strong>and</strong> estimates of central tendency <strong>and</strong> dispersion would beobtained using just this subset of observations. The MVE approachto dealing with outliers can, in practice, be all but intractable tocarry out as <strong>the</strong> number of possible ellipsoids to investigate willtypically be quite large. Therefore, an alternative approach is totake a large number of r<strong>and</strong>om samples of size h with replacement,whereh = n 2+ 1, (2)<strong>and</strong> calculate <strong>the</strong> volume of <strong>the</strong> ellipsoids created by each. Thefinal sample to be used in fur<strong>the</strong>r analyses is that which yields<strong>the</strong> smallest ellipsoid. An example of such an ellipsoid based onMVE can be seen in Figure 1. The circles represent observationsthat have been retained, while those marked with a star representoutliers that will be removed from <strong>the</strong> sample for future analyses.MINIMUM COVARIANCE DETERMINANTThe minimum covariance determinant (MCD) approach to outlierdetection is similar to <strong>the</strong> MVE in that it searches for a portionof <strong>the</strong> <strong>data</strong> that eliminates <strong>the</strong> presence <strong>and</strong> impact of outliers.However, whereas MVE seeks to do this by minimizing <strong>the</strong> volumeof an ellipsoid created by <strong>the</strong> retained points, MCD does it bywww.frontiersin.org July 2012 | Volume 3 | Article 211 | 61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!