FinchModern methods for <strong>the</strong> detection of multivariate outliersTable 3 | Correlations.Variable TIMEDRS ATTDRUG ATTHOUSEFULL DATA SET (N = 465)TIMEDRS 1.00 0.10 0.13ATTDRUG 0.10 1.00 0.03ATTHOUSE 0.13 0.03 1.00MAHALANOBIS DISTANCE (N = 452)TIMEDRS 1.00 0.07 0.08ATTDRUG 0.07 1.00 0.02ATTHOUSE 0.08 0.02 1.00MCD (N = 235)TIMEDRS 1.00 0.25 0.19ATTDRUG 0.25 1.00 0.26ATTHOUSE 0.19 0.26 1.00MVE (N = 235)TIMEDRS 1.00 0.33 0.05ATTDRUG 0.33 1.00 0.32ATTHOUSE 0.05 0.32 1.00MGV (N = 463)TIMEDRS 1.00 0.07 0.10ATTDRUG 0.07 1.00 0.02ATTHOUSE 0.10 0.02 1.00PROJECTION 1 (N = 425)TIMEDRS 1.00 0.06 0.10ATTDRUG 0.06 1.00 0.03ATTHOUSE 0.10 0.03 1.00PROJECTION 2 (N = 422)TIMEDRS 1.00 0.04 0.11ATTDRUG 0.04 1.00 0.03ATTHOUSE 0.11 0.03 1.00PROJECTION 3 (N = 425)TIMEDRS 1.00 0.06 0.10ATTDRUG 0.06 1.00 0.03ATTHOUSE 0.10 0.03 1.00<strong>and</strong> MGV <strong>data</strong>sets. In addition, probably as a result of removinga number of individuals with large values, <strong>the</strong> mean of TIME-DRS was substantially lower for <strong>the</strong> MCD <strong>and</strong> MVE <strong>data</strong> thanfor <strong>the</strong> full, Mahalanobis, <strong>and</strong> MGV <strong>data</strong>sets. As noted with <strong>the</strong>boxplot, <strong>the</strong> three projection methods produced means, skewness,<strong>and</strong> kurtosis values that generally fell between those of MCD/MVE<strong>and</strong> <strong>the</strong> o<strong>the</strong>r approaches. Also of interest in this regard is <strong>the</strong> relativesimilarity in distributional characteristics of <strong>the</strong> ATTDRUGvariable across outlier detection methods. This would suggest that<strong>the</strong>re were few if any outliers present in <strong>the</strong> <strong>data</strong> for this variable.Finally, <strong>the</strong> full <strong>and</strong> MGV <strong>data</strong>sets had kurtosis values that weresomewhat larger than those of <strong>the</strong> o<strong>the</strong>r methods included here.Finally, in order to ascertain how <strong>the</strong> various outlier detectionmethods impacted relationships among <strong>the</strong> variables we estimatedcorrelations for each approach, with results appearing in Table 3.For <strong>the</strong> full <strong>data</strong>, correlations among <strong>the</strong> three variables were alllow, with <strong>the</strong> largest being 0.13. When outliers were detected <strong>and</strong>removed using D 2 , MGV <strong>and</strong> <strong>the</strong> three projection methods, <strong>the</strong>correlations were attenuated even more. In contrast, correlationscalculated using <strong>the</strong> MCD <strong>data</strong> were uniformly larger than thoseof any method, except for <strong>the</strong> value between TIMEDRS <strong>and</strong>ATTDRUG, which was larger for <strong>the</strong> MVE <strong>data</strong>. On <strong>the</strong> o<strong>the</strong>rh<strong>and</strong>, <strong>the</strong> correlation between TIMEDRS <strong>and</strong> ATTHOUSE wassmaller for MVE than for any of <strong>the</strong> o<strong>the</strong>r methods used here.DISCUSSIONThe purpose of this study was to demonstrate seven methods ofoutlier detection designed especially for multivariate <strong>data</strong>. Thesemethods were compared based upon distributions of individualvariables, <strong>and</strong> relationships among <strong>the</strong>m. The strategy involvedfirst identification of outlying observations followed by <strong>the</strong>irremoval prior to <strong>data</strong> analysis. A brief summary of results foreach methodology appears in Table 4. These results were markedlydifferent across methods, based upon both distributional <strong>and</strong>correlational measures. Specifically, <strong>the</strong> MGV <strong>and</strong> Mahalanobisdistance approaches resulted in <strong>data</strong> that was fairly similar to<strong>the</strong> full <strong>data</strong>. In contrast, MCD <strong>and</strong> MVE both created <strong>data</strong>setsthat were very different, with variables more closely adhering to<strong>the</strong> normal distribution, <strong>and</strong> with generally (though not universally)larger relationships among <strong>the</strong> variables. It is important tonote that <strong>the</strong>se latter two methods each removed 230 observations,or nearly half of <strong>the</strong> <strong>data</strong>, which may be of some concernin particular applications <strong>and</strong> which will be discussed in moredetail below. Indeed, this issue is not one to be taken lightly.While <strong>the</strong> MCD/MVE approaches produced <strong>data</strong>sets that weremore in keeping with st<strong>and</strong>ard assumptions underlying many statisticalprocedures (i.e., normality), <strong>the</strong> representativeness of <strong>the</strong>sample may be called into question. Therefore, it is importantthat researchers making use of ei<strong>the</strong>r of <strong>the</strong>se methods closelyinvestigate whe<strong>the</strong>r <strong>the</strong> resulting sample resembles <strong>the</strong> population.Indeed, it is possible that identification of such a large proportionof <strong>the</strong> sample as outliers is really more of an indication that <strong>the</strong>population is actually bimodal. Such major differences in performancedepending upon <strong>the</strong> methodology used point to <strong>the</strong> needfor researchers to be familiar with <strong>the</strong> panoply of outlier detectionapproaches available. This work provides a demonstration of<strong>the</strong> methods, comparison of <strong>the</strong>ir relative performance with a wellknown <strong>data</strong>set, <strong>and</strong> computer software code so that <strong>the</strong> reader canuse <strong>the</strong> methods with <strong>the</strong>ir own <strong>data</strong>.There is not one universally optimal approach for identifyingoutliers, as each research problem presents <strong>the</strong> <strong>data</strong> analyst withspecific challenges <strong>and</strong> questions that might be best addressedusing a method that is not optimal in ano<strong>the</strong>r scenario. This studyhelps researchers <strong>and</strong> <strong>data</strong> analysts to see <strong>the</strong> range of possibilitiesavailable to <strong>the</strong>m when <strong>the</strong>y must address outliers in <strong>the</strong>ir <strong>data</strong>.In addition, <strong>the</strong>se results illuminate <strong>the</strong> impact of using <strong>the</strong> variousmethods for a representative <strong>data</strong>set, while <strong>the</strong> R code in <strong>the</strong>Appendix provides <strong>the</strong> researcher with <strong>the</strong> software tools necessaryto use each technique. A major issue that researchers mustconsider is <strong>the</strong> tradeoff between a method with a high breakdownpoint (i.e., that is impervious to <strong>the</strong> presence of many outliers)<strong>and</strong> <strong>the</strong> desire to retain as much of <strong>the</strong> <strong>data</strong> as possible. From thisexample, it is clear that <strong>the</strong> methods with <strong>the</strong> highest breakdownpoints, MCD/MVE, retained <strong>data</strong> that more clearly conformed to<strong>the</strong> normal distribution than did <strong>the</strong> o<strong>the</strong>r approaches, but at <strong>the</strong>cost of approximately half of <strong>the</strong> original <strong>data</strong>. Thus, researchersmust consider <strong>the</strong> purpose of <strong>the</strong>ir efforts to detect outliers. If <strong>the</strong>ywww.frontiersin.org July 2012 | Volume 3 | Article 211 | 67
FinchModern methods for <strong>the</strong> detection of multivariate outliersTable 4 | Summary of results for outlier detection methods.MethodOutliersremovedImpact on distributionsImpact oncorrelationsCommentsD 2 i 13 Reduced skewness <strong>and</strong> kurtosis whencompared to full <strong>data</strong> set, but did not fullyeliminate <strong>the</strong>m. Reduced variation inTIMEDRSMVE 230 Largely eliminated skewness <strong>and</strong> greatlylowered kurtosis in TIMEDRS. Also reducedkurtosis in ATTHOUSE when compared tofull <strong>data</strong>. Greatly lowered both <strong>the</strong> mean <strong>and</strong>st<strong>and</strong>ard deviation of TIMEDRSComparable correlationsto <strong>the</strong> full <strong>data</strong>setResulted in markedlyhigher correlations fortwo pairs of variables,than was seen with <strong>the</strong>o<strong>the</strong>r methods, exceptfor MCDMCD 230 Very similar pattern to that displayed by MVE Yielded relatively higherMGV 2 Yielded distributional results very similar tothose of <strong>the</strong> full <strong>data</strong>setP1 40 Resulted in lower mean, st<strong>and</strong>ard deviation,skewness, <strong>and</strong> kurtosis values for TIMEDRSwhen compared to <strong>the</strong> full <strong>data</strong>, D 2 , <strong>and</strong>MGV, though not when compared to MVE<strong>and</strong> MCD. Yielded comparable skewness <strong>and</strong>kurtosis to o<strong>the</strong>r methods for ATTDRUG <strong>and</strong>ATTHOUSE, <strong>and</strong> somewhat greater variationfor <strong>the</strong>se o<strong>the</strong>r variables, as wellcorrelation values thanany of <strong>the</strong> o<strong>the</strong>rmethods, except MVE,<strong>and</strong> no very low valuesVery similar correlationstructure as found in <strong>the</strong>full <strong>data</strong>set <strong>and</strong> for D 2Very comparablecorrelation results to <strong>the</strong>full <strong>data</strong>set, as well as D 2<strong>and</strong> MGVResulted in a sample with somewhat less skewed <strong>and</strong>kurtotic variables, though <strong>the</strong>y did remain clearlynon-normal in nature. The correlations among <strong>the</strong>variables remained low, as with <strong>the</strong> full <strong>data</strong>setReduced <strong>the</strong> sample size substantially, but also yieldedvariables with distributional characteristics much morefavorable to use with common statistical analyses; i.e.,very little skewness or kurtosis. In addition, correlationcoefficients were generally larger than for <strong>the</strong> o<strong>the</strong>rmethods, suggesting greater linearity in relationshipsamong <strong>the</strong> variablesProvided a sample with very characteristics to that ofMVEIdentified very few outliers, leading to a sample that didnot differ meaningfully from <strong>the</strong> originalAppears to find a “middle ground” between MVE/MCD<strong>and</strong> D 2 /MGV in terms of <strong>the</strong> number of outliersidentified <strong>and</strong> <strong>the</strong> resulting impact on variabledistributions <strong>and</strong> correlationsP2 43 Very similar results to P1 Very similar results to P1 Provided a sample yielding essentially <strong>the</strong> same resultsP3 40 Identical results to P1 Identical results to P1 In this case, resulted in an identical sample to that of P1as P1are seeking a “clean” set of <strong>data</strong> upon which <strong>the</strong>y can run a varietyof analyses with little or no fear of outliers having an impact, <strong>the</strong>nmethods with a high breakdown point, such as MCD <strong>and</strong> MVE areoptimal. On <strong>the</strong> o<strong>the</strong>r h<strong>and</strong>, if an examination of outliers revealsthat <strong>the</strong>y are from <strong>the</strong> population of interest, <strong>the</strong>n a more carefulapproach to dealing with <strong>the</strong>m is necessary. Removing such outlierscould result in a <strong>data</strong>set that is more tractable with respectto commonly used parametric statistical analyses but less representativeof <strong>the</strong> general population than is desired. Of course, <strong>the</strong>converse is also true in that a <strong>data</strong>set replete with outliers mightproduce statistical results that are not generalizable to <strong>the</strong> populationof real interest, when <strong>the</strong> outlying observations are not partof this population.FUTURE RESEARCHThere are a number of potential areas for future research in <strong>the</strong>area of outlier detection. Certainly, future work should use <strong>the</strong>semethods with o<strong>the</strong>r extant <strong>data</strong>sets having different characteristicsthan <strong>the</strong> one featured here. For example, <strong>the</strong> current set of<strong>data</strong> consisted of only three variables. It would be interesting tocompare <strong>the</strong> relative performance of <strong>the</strong>se methods when morevariables are present. Similarly, examining <strong>the</strong>m for much smallergroups would also be useful, as <strong>the</strong> current sample is fairly largewhen compared to many that appear in social science research. Inaddition, a simulation study comparing <strong>the</strong>se methods with oneano<strong>the</strong>r would also be warranted. Specifically, such a study couldbe based upon <strong>the</strong> generation of <strong>data</strong>sets with known outliers<strong>and</strong> distributional characteristics of <strong>the</strong> non-outlying cases. Thevarious detection methods could <strong>the</strong>n be used <strong>and</strong> <strong>the</strong> resultingretained <strong>data</strong>sets compared to <strong>the</strong> known non-outliers in terms of<strong>the</strong>se various characteristics. Such a study would be quite useful ininforming researchers regarding approaches that might be optimalin practice.CONCLUSIONThis study should prove helpful to those faced with a multivariateoutlier problem in <strong>the</strong>ir <strong>data</strong>. Several methods of outlierdetection were demonstrated <strong>and</strong> great differences among <strong>the</strong>mwere observed, in terms of <strong>the</strong> characteristics of <strong>the</strong> observationsretained. These findings make it clear that researchers must bevery thoughtful in <strong>the</strong>ir treatment of outlying observations. Simplyrelying on Mahalanobis Distance because it is widely used<strong>Frontiers</strong> in Psychology | Quantitative Psychology <strong>and</strong> Measurement July 2012 | Volume 3 | Article 211 | 68