13.07.2015 Views

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

FinchModern methods for <strong>the</strong> detection of multivariate outlierswhere MAD i is <strong>the</strong> median of all |d ij − M D |. Here, MAD i is scaledby <strong>the</strong> divisor 0.6745 so that it approximates <strong>the</strong> st<strong>and</strong>ard deviationobtained when sampling from a normal distribution. Regardlessof <strong>the</strong> criterion used, an observation is declared to be an outlier ingeneral if it is an outlier for any of <strong>the</strong> projections.GOALS OF THE CURRENT STUDYThe primary goal of this study was to describe alternatives to D 2 formultivariate outlier detection. As noted above, D 2 has a numberof weaknesses in this regard, making it less than optimal for manyresearch situations. Researchers need outlier detection methodson which <strong>the</strong>y can depend, particularly given <strong>the</strong> sensitivity ofmany multivariate statistical techniques to <strong>the</strong> presence of outliers.Those described here may well fill that niche. A secondarygoal of this study was to demonstrate, for a well known <strong>data</strong>set,<strong>the</strong> impact of <strong>the</strong> various outlier detection methods on measuresof location, variation <strong>and</strong> covariation. It is hoped that this studywill serve as a useful reference for applied researchers workingin <strong>the</strong> multivariate context who need to ascertain whe<strong>the</strong>r anyobservations are outliers, <strong>and</strong> if so which ones.MATERIALS AND METHODSThe Women’s Health <strong>and</strong> Drug study that is described in detailin Tabachnick <strong>and</strong> Fidell (2007) was used for demonstrative purposes.This <strong>data</strong>set was selected because it appears in this verypopular text in order to demonstrate <strong>data</strong> screening <strong>and</strong> as suchwas deemed an excellent source for demonstrating <strong>the</strong> methodsstudied here. A subset of <strong>the</strong> variables were used in <strong>the</strong> study,including number of visits to a health care provider (TIMEDRS),attitudes toward medication (ATTDRUG) <strong>and</strong> attitudes towardhousework (ATTHOUSE). These variables were selected because<strong>the</strong>y were featured in Tabachnick <strong>and</strong> Fidell’s own analysis of <strong>the</strong><strong>data</strong>. The sample used in this study consisted of 465 females aged20–59 years who were r<strong>and</strong>omly sampled from <strong>the</strong> San Fern<strong>and</strong>oValley in California <strong>and</strong> interviewed in 1976. Fur<strong>the</strong>r descriptionof <strong>the</strong> sample <strong>and</strong> <strong>the</strong> study from which it was drawn can be foundin Tabachnick <strong>and</strong> Fidell.In order to explore <strong>the</strong> impact of <strong>the</strong> various outlier detectionmethods included here, a variety of statistical analyses wereconducted subsequent to <strong>the</strong> application of each approach. In particular,distributions of <strong>the</strong> three variables were examined for <strong>the</strong>FIGURE 4 | Continued<strong>Frontiers</strong> in Psychology | Quantitative Psychology <strong>and</strong> Measurement July 2012 | Volume 3 | Article 211 | 64

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!