Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

FinchModern methods for the detection of multivariate outliersFIGURE 4 | Boxplots.datasets created by the various outlier detection methods, as wellas the full dataset. The strategy in this study was to remove allobservations that were identified as outliers by each method, thuscreating datasets for each approach that included only those notdeemed to be outliers. It is important to note that this is not typicallyrecommended practice, nor is it being suggested here. Rather,the purpose of this study was to demonstrate the impact of eachmethod on the data itself. Therefore, rather than take the approachof examining each outlier carefully to ascertain whether it was trulypart of the target population,the strategy was to remove those casesidentified as outliers prior to conducting statistical analyses. In thisway, it was hoped that the reader could clearly see the way in whicheach detection method worked and how this might impact resultinganalyses. In terms of the actual data analysis, the focus was ondescribing the resulting datasets. Therefore, distributional characteristicsof each variable within each method were calculated,including the mean, median, standard deviation, skewness, kurtosis,and first and third quartiles. In addition, distributions ofthe variables were examined using the boxplot. Finally, in orderto demonstrate the impact of these approaches on relational measures,Pearson’s correlation coefficient was estimated between eachpair of variables. All statistical analyses including identificationof outliers was carried out using the R software package, version2.12.1 (R Foundation for Statistical Computing, 2010). The R codeused to conduct these analyses appears in the Appendix at the endof the manuscript.RESULTSAn initial examination of the full dataset using boxplots appearsin Figure 4. It is clear that in particular the variable TIME-DRS is positively skewed with a number of fairly large values,even while the median is well under 10. Descriptive statistics forthe full dataset (Table 2) show that indeed, all of the variablesare fairly kurtotic, particularly TIMEDRS, which also displays astrong positive skew. Finally, the correlations among the threevariables for the full dataset appear in Table 3. All of thesewww.frontiersin.org July 2012 | Volume 3 | Article 211 | 65
FinchModern methods for the detection of multivariate outliersTable 2 | Descriptive statistics.VariableMeanFull (N = 465) D 2 (N = 452) MCD (N = 235) MVE (N = 235) MGV (N = 463) P1 (N = 425) P2 (N = 422) P3 (N = 425)TIMEDRS 7.90 6.67 2.45 3.37 7.64 5.27 5.17 5.27ATTDRUG 7.69 7.67 7.69 7.68 7.68 7.65 7.64 7.65ATTHOUSE 23.53 23.56 23.36 23.57 23.50 23.54 23.54 23.54MEDIANTIMEDRS 4.00 4.00 2.00 3.00 2.00 2.00 2.00 2.00ATTDRUG 8.00 8.00 8.00 8.00 7.00 7.00 7.00 7.00ATTHOUSE 24.00 24.00 23.00 23.00 21.00 21.00 21.00 21.00Q1TIMEDRS 2.00 2.00 1.00 2.00 2.00 2.00 2.00 2.00ATTDRUG 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00ATTHOUSE 21.00 21.00 21.00 21.00 21.00 21.00 21.00 21.00Q3TIMEDRS 10.00 9.00 4.00 5.00 9.50 7.00 7.00 7.00ATTDRUG 9.00 8.25 8.00 8.00 8.00 8.00 8.00 8.00ATTHOUSE 27.00 26.25 26.00 26.00 26.50 26.00 26.75 26.00STANDARD DEVIATIONTIMEDRS 10.95 7.35 1.59 2.22 10.23 4.72 4.58 4.72ATTDRUG 1.16 1.16 0.89 0.81 1.15 1.56 1.52 1.56ATTHOUSE 4.48 4.22 3.67 3.40 4.46 4.25 4.26 4.25SKEWNESSTIMEDRS 3.23 2.07 0.23 0.46 3.15 1.12 1.08 1.12ATTDRUG −0.12 −0.11 −0.16 0.01 −0.12 −0.09 −0.09 −0.09ATTHOUSE −0.45 −0.06 −0.03 0.06 −0.46 −0.03 −0.03 −0.03KURTOSISTIMEDRS 15.88 7.92 2.32 2.49 15.77 3.42 3.29 3.42ATTDRUG 2.53 2.51 2.43 2.31 2.54 2.51 2.51 2.51ATTHOUSE 4.50 2.71 2.16 2.17 4.54 2.69 2.68 2.69are below 0.15, indicating fairly weak relationships among themeasures. However, it is not clear to what extent these correlationsmay be impacted by the distributional characteristics justdescribed.Given these distributional issues, the researcher working withthis dataset would be well advised to investigate the possibility thatoutliers are present. For this example, we can use R to calculate D 2for each observation, with appropriate program code appearing inthe Appendix. In order to identify an observation as an outlier, wecompare the D 2 value to the chi-square distribution with degreesof freedom equal to the number of variables (three in this case),and α = 0.001. Using this criterion, 13 individuals were identifiedas outliers. In order to demonstrate the impact of using D 2 foroutlier detection, these individuals were removed, and descriptivegraphics and statistics were generated for the remaining 452 observations.An examination of the boxplot for the Mahalanobis datareveals that the range of values is more truncated than for the original,particularly for TIMEDRS. A similar result is evident in thedescriptive statistics found in Table 1, where we can see that thestandard deviation, skewness, kurtosis and mean are all smaller inthe Mahalanobis data for TIMEDRS. In contrast, the removal ofthe 13 outliers identified by D 2 did not result in great changes forthe distributional characteristics of ATTDRUG or ATTHOUSE.The correlations among the three variables were comparable tothose in the full dataset, if not slightly smaller.As discussed previously, there are some potential problems withusing D 2 as a method for outlier detection. For this reason, otherapproaches have been suggested for use in the context of multivariatedata in particular. A number of these, including MCD, MVE,MGV, and three projection methods, were applied to this dataset,followed by generation of graphs and descriptive statistics as wasdone for both the full dataset and the Mahalanobis data. Boxplotsfor the three variables after outliers were identified and removedby each of the methods appear in Figure 4. As can be seen, theMCD and MVE approaches resulted in data that appear to be theleast skewed, particularly for TIMEDRS. In contrast, the data fromMGV was very similar to that of the full dataset, while the threeprojection methods resulted in data that appeared to lie betweenMCD/MVE and the Mahalanobis data in terms of the distributionsof the three variables. An examination of Table 1 confirms thatmaking use of the different outlier detection methods results indatasets with markedly different distributional characteristics. Ofparticular note are differences between MCD/MVE as comparedto the full dataset, and the Mahalanobis and MGV data. Specifically,the skewness and kurtosis evident in these two samples wasmarkedly lower than that of any of the datasets, particularly the fullFrontiers in Psychology | Quantitative Psychology and Measurement July 2012 | Volume 3 | Article 211 | 66
Page 2 and 3:
FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5:
Table of Contents05 Is Data Cleanin
Page 7 and 8:
OsborneAssumptions and data cleanin
Page 9 and 10:
ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13:
Hoekstra et al.Why assumptions are
Page 14 and 15:
Hoekstra et al.Why assumptions are
Page 16 and 17: Hoekstra et al.Why assumptions are
Page 18 and 19: ORIGINAL RESEARCH ARTICLEpublished:
Page 20 and 21: García-PérezStatistical conclusio
Page 30 and 31: Sheng and ShengEffect of non-normal
Page 42 and 43: REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 60 and 61: ORIGINAL RESEARCH ARTICLEpublished:
Page 62 and 63: FinchModern methods for the detecti
Page 72 and 73: MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75: NimonStatistical assumptionsand Del
Page 76 and 77: NimonStatistical assumptionsFor exa
Page 78 and 79: Kraha et al.Interpreting multiple r
Page 94 and 95: SmithsonComparing moderation of slo
Page 102 and 103: REVIEW ARTICLEpublished: 01 March 2
Page 104 and 105: Flora et al.Factor analysis assumpt
Page 116 and 117:
Flora et al.Factor analysis assumpt
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Kasper and ÜnlüAssumptions of fac
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Lages and JaworskaHow predictable a
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Cummiskey et al.Testing assumptions
Page 154 and 155:
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?