13.07.2015 Views

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ORIGINAL RESEARCH ARTICLEpublished: 05 July 2012doi: 10.3389/fpsyg.2012.00211Distribution of variables by method of outlier detectionW. Holmes Finch*Department of Educational Psychology, Ball State University, Muncie, IN, USAEdited by:Jason W. Osborne, Old DominionUniversity, USAReviewed by:Matt Jans, The United States CensusBureau, USAAvi Allalouf, National Institute forTesting <strong>and</strong> Evaluation, Israel*Correspondence:W. Holmes Finch, Department ofEducational Psychology, Ball StateUniversity, Muncie, IN 47304, USA.e-mail: whfinch@bsu.eduThe presence of outliers can very problematic in <strong>data</strong> analysis, leading statisticians todevelop a wide variety of methods for identifying <strong>the</strong>m in both <strong>the</strong> univariate <strong>and</strong> multivariatecontexts. In case of <strong>the</strong> latter, perhaps <strong>the</strong> most popular approach has been Mahalanobisdistance, where large values suggest an observation that is unusual as compared to <strong>the</strong>center of <strong>the</strong> <strong>data</strong>. However, researchers have identified problems with <strong>the</strong> applicationof this metric such that its utility may be limited in some situations. As a consequence,o<strong>the</strong>r methods for detecting outlying observations have been developed <strong>and</strong> studied. However,a number of <strong>the</strong>se approaches, while apparently robust <strong>and</strong> useful have not made<strong>the</strong>ir way into general practice in <strong>the</strong> social sciences. Thus, <strong>the</strong> goal of this study was todescribe some of <strong>the</strong>se methods <strong>and</strong> demonstrate <strong>the</strong>m using a well known <strong>data</strong>set froma popular multivariate textbook widely used in <strong>the</strong> social sciences. Results demonstratedthat <strong>the</strong> methods do indeed result in <strong>data</strong>sets with very different distributional characteristics.These results are discussed in light of how <strong>the</strong>y might be used by researchers <strong>and</strong>practitioners.Keywords: Mahalanobis distance, minimum covariance determinant, minimum generalized variance, minimumvolume ellipsoid, outliers, projectionINTRODUCTIONThe presence of outliers is a ubiquitous <strong>and</strong> sometimes problematicaspect of <strong>data</strong> analysis. They can result from a varietyof processes, including <strong>data</strong> recording <strong>and</strong> entry errors, obtainingsamples from o<strong>the</strong>r than <strong>the</strong> target population, <strong>and</strong> samplingunusual individuals from <strong>the</strong> target population itself (Kruskal,1988). Based on <strong>the</strong> st<strong>and</strong>ard definition of outliers, it is entirelypossible that a <strong>data</strong>set may not have any such cases, or it mighthave many. Given that <strong>the</strong>y can arise from very different processes,outliers should not all be treated in <strong>the</strong> same manner. For example,those caused by <strong>data</strong> collection problems are likely to be removedfrom <strong>the</strong> sample prior to analysis, while those that are simplyunusual members of <strong>the</strong> target population would be retained for<strong>data</strong> analysis. Finally, Kruskal noted that in some cases underst<strong>and</strong>ing<strong>the</strong> mechanism that caused outliers is <strong>the</strong> most importantaspect of a given study. In o<strong>the</strong>r words, outliers can <strong>the</strong>mselvesprovide useful information to researchers, <strong>and</strong> are not necessarilyproblematic in <strong>the</strong> sense of being bad <strong>data</strong>. The focus of thismanuscript is not on <strong>the</strong> mechanism giving rise to outliers, butra<strong>the</strong>r on methods for detecting <strong>the</strong>m once <strong>the</strong>y are present in <strong>the</strong>sample.A number of authors have sought to precisely define whatconstitutes an outlier (e.g., Evans, 1999), <strong>and</strong> methods for detecting<strong>and</strong> dealing with <strong>the</strong>m once detected remain an active areaof research. It is well known that outliers can have a dramaticimpact on <strong>the</strong> performance of common statistical analysessuch as Pearson’s correlation coefficient (Marascuilo <strong>and</strong> Serlin,1988), univariate, <strong>and</strong> multivariate means comparisons (Kirk,1995; Huberty <strong>and</strong> Olejnik, 2006), cluster analysis (Kaufman <strong>and</strong>Rousseeuw, 2005), multivariate means comparisons, <strong>and</strong> factoranalysis (Brown, 2006), among o<strong>the</strong>rs. For this reason researchersare strongly encouraged to investigate <strong>the</strong>ir <strong>data</strong> for <strong>the</strong> presence ofoutliers prior to conducting <strong>data</strong> analysis (Tabachnick <strong>and</strong> Fidell,2007).In <strong>the</strong> multivariate context, <strong>the</strong> most commonly recommendedapproach for outlier detection is <strong>the</strong> Mahalanobis Distance (D 2 ).While this approach can be an effective tool for such purpose,it also has weaknesses that might render it less than effectivein many circumstances (Wilcox, 2005). The focus of this manuscriptis on describing several alternative methods for multivariateoutlier detection; i.e., observations that have unusual patternson multiple variables as opposed to extreme scores on a singlevariable (univariate outliers). In addition, <strong>the</strong>se approacheswill be demonstrated, along with D 2 , using a set of <strong>data</strong> takenfrom Tabachnick <strong>and</strong> Fidell (2007). The demonstration will utilizefunctions from <strong>the</strong> R software package for both outlier detection<strong>and</strong> <strong>data</strong> analysis after removal of <strong>the</strong> outliers. It shouldbe noted that <strong>the</strong> focus of this manuscript is not on attemptingto identify some optimal approach for dealing with outliersonce <strong>the</strong>y have been identified, which is an area of statistics itselfreplete with research, <strong>and</strong> which is well beyond <strong>the</strong> scope ofthis study. Suffice it to say that identification of outliers is only<strong>the</strong> first step in <strong>the</strong> process, <strong>and</strong> much thought must be givento how outliers will be h<strong>and</strong>led. In <strong>the</strong> current study, <strong>the</strong>y willbe removed from <strong>the</strong> <strong>data</strong>set in order to clearly demonstrate<strong>the</strong> differential impact of <strong>the</strong> various outlier detection methodson <strong>the</strong> <strong>data</strong> <strong>and</strong> subsequent analyses. However, it is notrecommended that this approach to outliers be taken in everysituation.IMPACT OF OUTLIERS IN MULTIVARIATE ANALYSISOutliers can have a dramatic impact on <strong>the</strong> results of commonmultivariate statistical analyses. For example, <strong>the</strong>y can distort correlationcoefficients (Marascuilo <strong>and</strong> Serlin, 1988; Osborne <strong>and</strong>www.frontiersin.org July 2012 | Volume 3 | Article 211 | 59

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!