Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

ORIGINAL RESEARCH ARTICLEpublished: 05 July 2012doi: 10.3389/fpsyg.2012.00211Distribution of variables by method of outlier detectionW. Holmes Finch*Department of Educational Psychology, Ball State University, Muncie, IN, USAEdited by:Jason W. Osborne, Old DominionUniversity, USAReviewed by:Matt Jans, The United States CensusBureau, USAAvi Allalouf, National Institute forTesting and Evaluation, Israel*Correspondence:W. Holmes Finch, Department ofEducational Psychology, Ball StateUniversity, Muncie, IN 47304, USA.e-mail: whfinch@bsu.eduThe presence of outliers can very problematic in data analysis, leading statisticians todevelop a wide variety of methods for identifying them in both the univariate and multivariatecontexts. In case of the latter, perhaps the most popular approach has been Mahalanobisdistance, where large values suggest an observation that is unusual as compared to thecenter of the data. However, researchers have identified problems with the applicationof this metric such that its utility may be limited in some situations. As a consequence,other methods for detecting outlying observations have been developed and studied. However,a number of these approaches, while apparently robust and useful have not madetheir way into general practice in the social sciences. Thus, the goal of this study was todescribe some of these methods and demonstrate them using a well known dataset froma popular multivariate textbook widely used in the social sciences. Results demonstratedthat the methods do indeed result in datasets with very different distributional characteristics.These results are discussed in light of how they might be used by researchers andpractitioners.Keywords: Mahalanobis distance, minimum covariance determinant, minimum generalized variance, minimumvolume ellipsoid, outliers, projectionINTRODUCTIONThe presence of outliers is a ubiquitous and sometimes problematicaspect of data analysis. They can result from a varietyof processes, including data recording and entry errors, obtainingsamples from other than the target population, and samplingunusual individuals from the target population itself (Kruskal,1988). Based on the standard definition of outliers, it is entirelypossible that a dataset may not have any such cases, or it mighthave many. Given that they can arise from very different processes,outliers should not all be treated in the same manner. For example,those caused by data collection problems are likely to be removedfrom the sample prior to analysis, while those that are simplyunusual members of the target population would be retained fordata analysis. Finally, Kruskal noted that in some cases understandingthe mechanism that caused outliers is the most importantaspect of a given study. In other words, outliers can themselvesprovide useful information to researchers, and are not necessarilyproblematic in the sense of being bad data. The focus of thismanuscript is not on the mechanism giving rise to outliers, butrather on methods for detecting them once they are present in thesample.A number of authors have sought to precisely define whatconstitutes an outlier (e.g., Evans, 1999), and methods for detectingand dealing with them once detected remain an active areaof research. It is well known that outliers can have a dramaticimpact on the performance of common statistical analysessuch as Pearson’s correlation coefficient (Marascuilo and Serlin,1988), univariate, and multivariate means comparisons (Kirk,1995; Huberty and Olejnik, 2006), cluster analysis (Kaufman andRousseeuw, 2005), multivariate means comparisons, and factoranalysis (Brown, 2006), among others. For this reason researchersare strongly encouraged to investigate their data for the presence ofoutliers prior to conducting data analysis (Tabachnick and Fidell,2007).In the multivariate context, the most commonly recommendedapproach for outlier detection is the Mahalanobis Distance (D 2 ).While this approach can be an effective tool for such purpose,it also has weaknesses that might render it less than effectivein many circumstances (Wilcox, 2005). The focus of this manuscriptis on describing several alternative methods for multivariateoutlier detection; i.e., observations that have unusual patternson multiple variables as opposed to extreme scores on a singlevariable (univariate outliers). In addition, these approacheswill be demonstrated, along with D 2 , using a set of data takenfrom Tabachnick and Fidell (2007). The demonstration will utilizefunctions from the R software package for both outlier detectionand data analysis after removal of the outliers. It shouldbe noted that the focus of this manuscript is not on attemptingto identify some optimal approach for dealing with outliersonce they have been identified, which is an area of statistics itselfreplete with research, and which is well beyond the scope ofthis study. Suffice it to say that identification of outliers is onlythe first step in the process, and much thought must be givento how outliers will be handled. In the current study, they willbe removed from the dataset in order to clearly demonstratethe differential impact of the various outlier detection methodson the data and subsequent analyses. However, it is notrecommended that this approach to outliers be taken in everysituation.IMPACT OF OUTLIERS IN MULTIVARIATE ANALYSISOutliers can have a dramatic impact on the results of commonmultivariate statistical analyses. For example, they can distort correlationcoefficients (Marascuilo and Serlin, 1988; Osborne andwww.frontiersin.org July 2012 | Volume 3 | Article 211 | 59
FinchModern methods for the detection of multivariate outliersOverbay, 2004), and create problems in regression analysis, evenleading to the presence of collinearity among the set of predictorvariables in multiple regression (Pedhazur, 1997). Distortions tothe correlation may in turn lead to biased sample estimates, as outliersartificially impact the degree of linearity present between apair of variables (Osborne and Overbay, 2004). In addition, methodsbased on the correlation coefficient such as factor analysisand structural equation modeling are also negatively impactedby the presence of outliers in data (Brown, 2006). Cluster analysisis particularly sensitive to outliers with a distortion of clusterresults when outliers are the center or starting point of the analysis(Kaufman and Rousseeuw, 2005). Outliers can also themselvesform a cluster, which is not truly representative of the broaderarray of values in the population. Outliers have also been shownto detrimentally impact testing for mean differences using ANOVAthrough biasing group means where they are present (Osborne andOverbay, 2004).While outliers can be problematic from a statistical perspective,it is not always advisable to remove them from the data. When theseobservations are members of the target population, their presencein the dataset can be quite informative regarding the nature of thepopulation (e.g., Mourão-Miranda et al., 2011). To remove outliersfrom the sample in this case would lead to loss of informationabout the population at large. In such situations, outlier detectionwould be helpful in terms of identifying members of the targetpopulation who are unusual when compared to the rest, but theseindividuals should not be removed from the sample (Zijlstra et al.,2011).METHODS OF MULTIVARIATE OUTLIER DETECTIONGiven the negative impact that outliers can have on multivariatestatistical methods, their accurate detection is an importantmatter to consider prior to data analysis (Tabachnick and Fidell,2007; Stevens, 2009). In popular multivariate statistics texts, thereader is recommended to use D 2 for multivariate outlier detection,although as is described below, there are several alternativesfor multivariate outlier detection that may prove to be more effectivethan this standard approach. Prior to discussing these methodshowever, it is important to briefly discuss general qualities thatmake for an effective outlier detection method. Readers interestedin a more detailed treatment are referred to two excellent texts byWilcox (2005, 2010).When thinking about the impact of outliers, perhaps the keyconsideration is the breakdown point of the statistical analysisin question. The breakdown point can be thought of as theminimum proportion of a sample that can consist of outliersafter which point they will have a notable impact on the statisticof interest. In other words, if a statistic has a breakdownpoint of 0.1, then 10% of the sample could consist of outlierswithout markedly impacting the statistic. However, if the nextobservation beyond this 10% was also an outlier, the statisticin question would then be impacted by its presence (Maronnaet al., 2006). Comparatively, a statistic with a breakdown pointof 0.3 would be relatively more impervious to outliers, as itwould not be impacted until more than 30% of the sample wasmade up of outliers. Of course, it should be remembered thatthe degree of this impact is dependent on the magnitude of theoutlying observation, such that more extreme outliers would havea greater impact on the statistic than would a less extreme value.A high breakdown point is generally considered to be a positiveattribute.While the breakdown point is typically thought of as a characteristicof a statistic, it can also be a characteristic of a statistic inconjunction with a particular method of outlier detection. Thus,if a researcher calculates the sample mean after removing outliersusing a method such as D 2 , the breakdown point of the combinationof mean and outlier detection method will be different thanthat of the mean by itself. Finally, although having a high breakdownpoint is generally desirable, it is also true that statistics withhigher breakdown points (e.g., the median, the trimmed mean) areoften less accurate in estimating population parameters when thedata are drawn from a multivariate normal distribution (Gentonand Lucas, 2003).Another important property for a statistical measure of location(e.g., mean) is that it exhibit both location and scale equivariance(Wilcox, 2005). Location equivariance means that ifa constant is added to each observation in the data set, themeasure of location will be increased by that constant value.Scale equivariance occurs when multiplication of each observationin the data set by a constant leads to a change in themeasure of location by the same constant. In other words, thescale of measurement should not influence relative comparisonsof individuals within the sample or relative comparisonsof group measures of location such as the mean. In the contextof multivariate data, these properties for measures of locationare referred to as affine equivariance. Affine equivariance extendsthe notion of equivariance beyond changes in location and scaleto measures of multivariate dispersion. Covariance matrices areaffine equivariant, for example, though they are not particularlyrobust to the presence of outliers (Wilcox, 2005). A viableapproach to dealing with multivariate outliers must maintainaffine equivariance.Following is a description of several approaches for outlierdetection. For the most part, these descriptions are presentedconceptually, including technical details only when they are vitalto understanding how the methods work. References are providedfor the reader who is interested in learning more about thetechnical aspects of these approaches. In addition to these descriptions,Table 1 also includes a summary of each method, includingfundamental equations, pertinent references, and strengths andweaknesses.MAHALANOBIS DISTANCEThe most commonly recommended approach for multivariateoutlier detection is D 2 , which is based on a measure of multivariatedistance first introduced by Mahalanobis (1936), and whichhas been used in a wide variety of contexts. D 2 has been suggestedas the default outlier detection method by a number of authorsin popular textbooks used by researchers in a variety of fields(e.g., Johnson and Wichern, 2002; Tabachnick and Fidell, 2007).In practice, a researcher would first calculate the value of D 2 foreach subject in the sample, as follows:D 2 i = (x i − ¯x) ′ S −1 (x i − ¯x) (1)Frontiers in Psychology | Quantitative Psychology and Measurement July 2012 | Volume 3 | Article 211 | 60
Page 2 and 3:
FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5:
Table of Contents05 Is Data Cleanin
Page 7 and 8:
OsborneAssumptions and data cleanin
Page 9 and 10: ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13: Hoekstra et al.Why assumptions are
Page 18 and 19: ORIGINAL RESEARCH ARTICLEpublished:
Page 20 and 21: García-PérezStatistical conclusio
Page 30 and 31: Sheng and ShengEffect of non-normal
Page 42 and 43: REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 62 and 63: FinchModern methods for the detecti
Page 72 and 73: MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75: NimonStatistical assumptionsand Del
Page 76 and 77: NimonStatistical assumptionsFor exa
Page 78 and 79: Kraha et al.Interpreting multiple r
Page 94 and 95: SmithsonComparing moderation of slo
Page 102 and 103: REVIEW ARTICLEpublished: 01 March 2
Page 104 and 105: Flora et al.Factor analysis assumpt
Page 110 and 111:
Flora et al.Factor analysis assumpt
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Kasper and ÜnlüAssumptions of fac
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Lages and JaworskaHow predictable a
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Cummiskey et al.Testing assumptions
Page 154 and 155:
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Create successful ePaper yourself

Delete template?

Save as template?