Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

FinchModern methods for the detection of multivariate outlierseach observation has associated with it a value for the generalizedvariance. Observations that are more distant from the bulk of thedata will have larger values of the generalized variance. For p = 2variables, observations with a generalized variance greater thanq 3 + 1.5(q 3 − q 1 ) (5)would be considered outliers, where q 1 and q 2 are the lower andupper quartiles,respectively,of the generalized variances. For morethan two variables, the generalized variances are compared with√M G + χ 2 0.975.p (q 3 − q 1 ), (6)where M G is the median of the generalized variance values andχ 2 0.975.p . Figure 3 is a plot of X and Y, with outliers identified usingthe MGV approach. In this graph, outliers are denoted by 0, whichis different than notation used in Figures 1 and 2. These graphs areincluded here exactly as taken from the R software output, whichwill be used extensively in the following examples.PROJECTION-BASED OUTLIER DETECTIONAnother alternative for identifying multivariate outliers is basedon the notion of the depth of one data point among a set of otherpoints. The idea of depth was described by Tukey (1975), andlater expanded upon by Donoho and Gasko (1992). In general,depth can be thought of as the relative location of an observationvis-à-vis either edge (upper or lower) in a set of data.In the univariate case, this simply means determining to whichedge a given observation more closely lies (i.e., maximum orminimum value), and then calculating the proportion of casesbetween that observation and its closest edge. The larger thisproportion, the deeper the observation lies in the univariateFIGURE 3 | Scatterplot of observations identified as outliers based onthe MGV method.data. While mathematically somewhat more challenging, conceptuallyprojection methods of multivariate outlier detectionwork in much the same way. However, rather than determiningthe proximity to a single edge, the algorithm must identifythe proximity to the edge of the multivariate space. This processis carried out using the method of projection that is describedbelow.For the purposes of this explanation, we will avoid presentingthe mathematical equations that underlie the projection-basedoutlier detection approach. The interested reader is encouragedto refer to Wilcox (2005) for a more technical treatment of thismethodology. Following is a conceptual description of two commonlyused approaches to carrying out this technique when p = 2.In the first method, the algorithm begins by identifying the multivariatecenter of the data using an acceptable approach, such asthe multivariate mean after application of MCD or MVE. Next,for each point (X i ) the following steps are carried out:(1) A line is drawn connecting the multivariate center and pointX i .(2) A line perpendicular to the line in 1 is then drawn from eachof the other observations, X j .(3) The location where the line in 2 intersects with the line in 1 isthe projected depth (d ij ) of that data point for the line.(4) Steps 1–3 are then repeated such that each of the n data pointsis connected to the multivariate center of the data with a line,and corresponding values of d ij are calculated for each of theother observations.(5) For a given observation, each of its depth values, d ij , is comparedwith a standard (to be described below). If for any singleprojection the observation is an outlier, then it is classified asan outlier for purposes of future analyses.As mentioned earlier, there is an alternative approach to theprojection method, which is not based on finding the multivariatecenter of the distribution. Rather, all (n 2 − n)/2 possible linesare drawn between all pairs of observations in the dataset. Then,the approach outlined above for calculating d ij is used for eachof these lines. Thus, rather than having n−1 such d ij values, eachobservation will have (n2 −n)2− 1 indicators of depth. In all otherways, this second approach is identical to the first, however. Priorresearch has demonstrated that this second method might be moreaccurate than the first, but it is not clear how great an advantageit actually has in practice (Wilcox, 2005). Furthermore, because itmust examine all possible lines in the set of data, method 2 canrequire quite a bit more computational time, particularly for largedatasets.The literature on multivariate outlier detection using theprojection-based method includes two different criteria againstwhich an observation can be judged as an outlier. The first of theseis essentially identical to that used for the MGV in Eq. 6, with theexception that M G is replaced by M D , the median of the d ij for thatprojection. Observations associated with values of d ij larger thanthis cut score are considered to be outliers for that projection. Analternative comparison criterion is√ ( )M D + χ 2 MADi0.975.p 0.6745(7)www.frontiersin.org July 2012 | Volume 3 | Article 211 | 63
FinchModern methods for the detection of multivariate outlierswhere MAD i is the median of all |d ij − M D |. Here, MAD i is scaledby the divisor 0.6745 so that it approximates the standard deviationobtained when sampling from a normal distribution. Regardlessof the criterion used, an observation is declared to be an outlier ingeneral if it is an outlier for any of the projections.GOALS OF THE CURRENT STUDYThe primary goal of this study was to describe alternatives to D 2 formultivariate outlier detection. As noted above, D 2 has a numberof weaknesses in this regard, making it less than optimal for manyresearch situations. Researchers need outlier detection methodson which they can depend, particularly given the sensitivity ofmany multivariate statistical techniques to the presence of outliers.Those described here may well fill that niche. A secondarygoal of this study was to demonstrate, for a well known dataset,the impact of the various outlier detection methods on measuresof location, variation and covariation. It is hoped that this studywill serve as a useful reference for applied researchers workingin the multivariate context who need to ascertain whether anyobservations are outliers, and if so which ones.MATERIALS AND METHODSThe Women’s Health and Drug study that is described in detailin Tabachnick and Fidell (2007) was used for demonstrative purposes.This dataset was selected because it appears in this verypopular text in order to demonstrate data screening and as suchwas deemed an excellent source for demonstrating the methodsstudied here. A subset of the variables were used in the study,including number of visits to a health care provider (TIMEDRS),attitudes toward medication (ATTDRUG) and attitudes towardhousework (ATTHOUSE). These variables were selected becausethey were featured in Tabachnick and Fidell’s own analysis of thedata. The sample used in this study consisted of 465 females aged20–59 years who were randomly sampled from the San FernandoValley in California and interviewed in 1976. Further descriptionof the sample and the study from which it was drawn can be foundin Tabachnick and Fidell.In order to explore the impact of the various outlier detectionmethods included here, a variety of statistical analyses wereconducted subsequent to the application of each approach. In particular,distributions of the three variables were examined for theFIGURE 4 | ContinuedFrontiers in Psychology | Quantitative Psychology and Measurement July 2012 | Volume 3 | Article 211 | 64
Page 2 and 3:
FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5:
Table of Contents05 Is Data Cleanin
Page 7 and 8:
OsborneAssumptions and data cleanin
Page 9 and 10:
ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13:
Hoekstra et al.Why assumptions are
Page 14 and 15: Hoekstra et al.Why assumptions are
Page 16 and 17: Hoekstra et al.Why assumptions are
Page 18 and 19: ORIGINAL RESEARCH ARTICLEpublished:
Page 20 and 21: García-PérezStatistical conclusio
Page 30 and 31: Sheng and ShengEffect of non-normal
Page 42 and 43: REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 60 and 61: ORIGINAL RESEARCH ARTICLEpublished:
Page 62 and 63: FinchModern methods for the detecti
Page 72 and 73: MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75: NimonStatistical assumptionsand Del
Page 76 and 77: NimonStatistical assumptionsFor exa
Page 78 and 79: Kraha et al.Interpreting multiple r
Page 94 and 95: SmithsonComparing moderation of slo
Page 102 and 103: REVIEW ARTICLEpublished: 01 March 2
Page 104 and 105: Flora et al.Factor analysis assumpt
Page 114 and 115:
Flora et al.Factor analysis assumpt
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Kasper and ÜnlüAssumptions of fac
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Lages and JaworskaHow predictable a
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Cummiskey et al.Testing assumptions
Page 154 and 155:
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Create successful ePaper yourself

Delete template?

Save as template?