Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

FinchModern methods for the detection of multivariate outliersTable 3 | Correlations.Variable TIMEDRS ATTDRUG ATTHOUSEFULL DATA SET (N = 465)TIMEDRS 1.00 0.10 0.13ATTDRUG 0.10 1.00 0.03ATTHOUSE 0.13 0.03 1.00MAHALANOBIS DISTANCE (N = 452)TIMEDRS 1.00 0.07 0.08ATTDRUG 0.07 1.00 0.02ATTHOUSE 0.08 0.02 1.00MCD (N = 235)TIMEDRS 1.00 0.25 0.19ATTDRUG 0.25 1.00 0.26ATTHOUSE 0.19 0.26 1.00MVE (N = 235)TIMEDRS 1.00 0.33 0.05ATTDRUG 0.33 1.00 0.32ATTHOUSE 0.05 0.32 1.00MGV (N = 463)TIMEDRS 1.00 0.07 0.10ATTDRUG 0.07 1.00 0.02ATTHOUSE 0.10 0.02 1.00PROJECTION 1 (N = 425)TIMEDRS 1.00 0.06 0.10ATTDRUG 0.06 1.00 0.03ATTHOUSE 0.10 0.03 1.00PROJECTION 2 (N = 422)TIMEDRS 1.00 0.04 0.11ATTDRUG 0.04 1.00 0.03ATTHOUSE 0.11 0.03 1.00PROJECTION 3 (N = 425)TIMEDRS 1.00 0.06 0.10ATTDRUG 0.06 1.00 0.03ATTHOUSE 0.10 0.03 1.00and MGV datasets. In addition, probably as a result of removinga number of individuals with large values, the mean of TIME-DRS was substantially lower for the MCD and MVE data thanfor the full, Mahalanobis, and MGV datasets. As noted with theboxplot, the three projection methods produced means, skewness,and kurtosis values that generally fell between those of MCD/MVEand the other approaches. Also of interest in this regard is the relativesimilarity in distributional characteristics of the ATTDRUGvariable across outlier detection methods. This would suggest thatthere were few if any outliers present in the data for this variable.Finally, the full and MGV datasets had kurtosis values that weresomewhat larger than those of the other methods included here.Finally, in order to ascertain how the various outlier detectionmethods impacted relationships among the variables we estimatedcorrelations for each approach, with results appearing in Table 3.For the full data, correlations among the three variables were alllow, with the largest being 0.13. When outliers were detected andremoved using D 2 , MGV and the three projection methods, thecorrelations were attenuated even more. In contrast, correlationscalculated using the MCD data were uniformly larger than thoseof any method, except for the value between TIMEDRS andATTDRUG, which was larger for the MVE data. On the otherhand, the correlation between TIMEDRS and ATTHOUSE wassmaller for MVE than for any of the other methods used here.DISCUSSIONThe purpose of this study was to demonstrate seven methods ofoutlier detection designed especially for multivariate data. Thesemethods were compared based upon distributions of individualvariables, and relationships among them. The strategy involvedfirst identification of outlying observations followed by theirremoval prior to data analysis. A brief summary of results foreach methodology appears in Table 4. These results were markedlydifferent across methods, based upon both distributional andcorrelational measures. Specifically, the MGV and Mahalanobisdistance approaches resulted in data that was fairly similar tothe full data. In contrast, MCD and MVE both created datasetsthat were very different, with variables more closely adhering tothe normal distribution, and with generally (though not universally)larger relationships among the variables. It is important tonote that these latter two methods each removed 230 observations,or nearly half of the data, which may be of some concernin particular applications and which will be discussed in moredetail below. Indeed, this issue is not one to be taken lightly.While the MCD/MVE approaches produced datasets that weremore in keeping with standard assumptions underlying many statisticalprocedures (i.e., normality), the representativeness of thesample may be called into question. Therefore, it is importantthat researchers making use of either of these methods closelyinvestigate whether the resulting sample resembles the population.Indeed, it is possible that identification of such a large proportionof the sample as outliers is really more of an indication that thepopulation is actually bimodal. Such major differences in performancedepending upon the methodology used point to the needfor researchers to be familiar with the panoply of outlier detectionapproaches available. This work provides a demonstration ofthe methods, comparison of their relative performance with a wellknown dataset, and computer software code so that the reader canuse the methods with their own data.There is not one universally optimal approach for identifyingoutliers, as each research problem presents the data analyst withspecific challenges and questions that might be best addressedusing a method that is not optimal in another scenario. This studyhelps researchers and data analysts to see the range of possibilitiesavailable to them when they must address outliers in their data.In addition, these results illuminate the impact of using the variousmethods for a representative dataset, while the R code in theAppendix provides the researcher with the software tools necessaryto use each technique. A major issue that researchers mustconsider is the tradeoff between a method with a high breakdownpoint (i.e., that is impervious to the presence of many outliers)and the desire to retain as much of the data as possible. From thisexample, it is clear that the methods with the highest breakdownpoints, MCD/MVE, retained data that more clearly conformed tothe normal distribution than did the other approaches, but at thecost of approximately half of the original data. Thus, researchersmust consider the purpose of their efforts to detect outliers. If theywww.frontiersin.org July 2012 | Volume 3 | Article 211 | 67
FinchModern methods for the detection of multivariate outliersTable 4 | Summary of results for outlier detection methods.MethodOutliersremovedImpact on distributionsImpact oncorrelationsCommentsD 2 i 13 Reduced skewness and kurtosis whencompared to full data set, but did not fullyeliminate them. Reduced variation inTIMEDRSMVE 230 Largely eliminated skewness and greatlylowered kurtosis in TIMEDRS. Also reducedkurtosis in ATTHOUSE when compared tofull data. Greatly lowered both the mean andstandard deviation of TIMEDRSComparable correlationsto the full datasetResulted in markedlyhigher correlations fortwo pairs of variables,than was seen with theother methods, exceptfor MCDMCD 230 Very similar pattern to that displayed by MVE Yielded relatively higherMGV 2 Yielded distributional results very similar tothose of the full datasetP1 40 Resulted in lower mean, standard deviation,skewness, and kurtosis values for TIMEDRSwhen compared to the full data, D 2 , andMGV, though not when compared to MVEand MCD. Yielded comparable skewness andkurtosis to other methods for ATTDRUG andATTHOUSE, and somewhat greater variationfor these other variables, as wellcorrelation values thanany of the othermethods, except MVE,and no very low valuesVery similar correlationstructure as found in thefull dataset and for D 2Very comparablecorrelation results to thefull dataset, as well as D 2and MGVResulted in a sample with somewhat less skewed andkurtotic variables, though they did remain clearlynon-normal in nature. The correlations among thevariables remained low, as with the full datasetReduced the sample size substantially, but also yieldedvariables with distributional characteristics much morefavorable to use with common statistical analyses; i.e.,very little skewness or kurtosis. In addition, correlationcoefficients were generally larger than for the othermethods, suggesting greater linearity in relationshipsamong the variablesProvided a sample with very characteristics to that ofMVEIdentified very few outliers, leading to a sample that didnot differ meaningfully from the originalAppears to find a “middle ground” between MVE/MCDand D 2 /MGV in terms of the number of outliersidentified and the resulting impact on variabledistributions and correlationsP2 43 Very similar results to P1 Very similar results to P1 Provided a sample yielding essentially the same resultsP3 40 Identical results to P1 Identical results to P1 In this case, resulted in an identical sample to that of P1as P1are seeking a “clean” set of data upon which they can run a varietyof analyses with little or no fear of outliers having an impact, thenmethods with a high breakdown point, such as MCD and MVE areoptimal. On the other hand, if an examination of outliers revealsthat they are from the population of interest, then a more carefulapproach to dealing with them is necessary. Removing such outlierscould result in a dataset that is more tractable with respectto commonly used parametric statistical analyses but less representativeof the general population than is desired. Of course, theconverse is also true in that a dataset replete with outliers mightproduce statistical results that are not generalizable to the populationof real interest, when the outlying observations are not partof this population.FUTURE RESEARCHThere are a number of potential areas for future research in thearea of outlier detection. Certainly, future work should use thesemethods with other extant datasets having different characteristicsthan the one featured here. For example, the current set ofdata consisted of only three variables. It would be interesting tocompare the relative performance of these methods when morevariables are present. Similarly, examining them for much smallergroups would also be useful, as the current sample is fairly largewhen compared to many that appear in social science research. Inaddition, a simulation study comparing these methods with oneanother would also be warranted. Specifically, such a study couldbe based upon the generation of datasets with known outliersand distributional characteristics of the non-outlying cases. Thevarious detection methods could then be used and the resultingretained datasets compared to the known non-outliers in terms ofthese various characteristics. Such a study would be quite useful ininforming researchers regarding approaches that might be optimalin practice.CONCLUSIONThis study should prove helpful to those faced with a multivariateoutlier problem in their data. Several methods of outlierdetection were demonstrated and great differences among themwere observed, in terms of the characteristics of the observationsretained. These findings make it clear that researchers must bevery thoughtful in their treatment of outlying observations. Simplyrelying on Mahalanobis Distance because it is widely usedFrontiers in Psychology | Quantitative Psychology and Measurement July 2012 | Volume 3 | Article 211 | 68
Page 2 and 3:
FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5:
Table of Contents05 Is Data Cleanin
Page 7 and 8:
OsborneAssumptions and data cleanin
Page 9 and 10:
ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13:
Hoekstra et al.Why assumptions are
Page 14 and 15:
Page 16 and 17:
Page 18 and 19: ORIGINAL RESEARCH ARTICLEpublished:
Page 20 and 21: García-PérezStatistical conclusio
Page 30 and 31: Sheng and ShengEffect of non-normal
Page 42 and 43: REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 60 and 61: ORIGINAL RESEARCH ARTICLEpublished:
Page 62 and 63: FinchModern methods for the detecti
Page 72 and 73: MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75: NimonStatistical assumptionsand Del
Page 76 and 77: NimonStatistical assumptionsFor exa
Page 78 and 79: Kraha et al.Interpreting multiple r
Page 94 and 95: SmithsonComparing moderation of slo
Page 102 and 103: REVIEW ARTICLEpublished: 01 March 2
Page 104 and 105: Flora et al.Factor analysis assumpt
Page 118 and 119:
Flora et al.Factor analysis assumpt
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Kasper and ÜnlüAssumptions of fac
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Lages and JaworskaHow predictable a
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Cummiskey et al.Testing assumptions
Page 154 and 155:
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Create successful ePaper yourself

Delete template?

Save as template?