Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

García-PérezStatistical conclusion validityhow SCV could be improved, a few words are worth about howBayesian approaches fare on SCV.THE BAYESIAN APPROACHAdvocates of Bayesian approaches to data analysis, hypothesistesting, and model selection (e.g., Jennison and Turnbull, 1990;Wagenmakers, 2007; Matthews, 2011) overemphasize the problemsof the frequentist approach and praise the solutions offeredby the Bayesian approach: Bayes factors (BFs) for hypothesis testing,credible intervals for interval estimation, Bayesian posteriorprobabilities, Bayesian information criterion (BIC) as a tool formodel selection and, above all else, strict reliance on observed dataand independence of the sampling plan (i.e., fixed vs. sequentialsampling). There is unquestionable merit in these alternatives anda fair comparison with their frequentist counterparts requires adetailed analysis that is beyond the scope of this paper. Yet, I cannotresist the temptation of commenting on the presumed problems ofthe frequentist approach and also on the standing of the Bayesianapproach with respect to SCV.One of the preferred objections to p values is that they relate todata that were never collected and which, thus, should not affectthe decision of what hypothesis the observed data support or failto support. Intuitively appealing as it may seem, the argument isflawed because the referent for a p value is not other data setsthat could have been observed in undone replications of the sameexperiment. Instead, the referent is the properties of the test statisticitself, which is guaranteed to have the declared samplingdistribution when data are collected as assumed in the derivationof such distribution. Statistical tests are calibrated procedures withknown properties, and this calibration is what makes their resultsinterpretable. As is the case for any other calibrated procedure ormeasuring instrument, the validity of the outcome only rests onadherence to the usage specifications. And, of course, the test statisticand the resultant p value on application cannot be blamed forthe consequences of a failure to collect data properly or to applythe appropriate statistical test.Consider a two-sample t test for means. Those who need a referentmay want to notice that the p value for the data from a givenexperiment relates to the uncountable times that such test has beenapplied to data from any experiment in any discipline. Calibrationof the t test ensures that a proper use with a significance level of,say, 5% will reject a true null hypothesis on 5% of the occasions,no matter what the experimental hypothesis is, what the variablesare, what the data are, what the experiment is about, who carriesit out, or in what research field. What a p value indicates is howtenable it is that the t statistic will attain the observed value ifthe null were correct, with only a trivial link to the data observedin the experiment of concern. And this only places in a precisequantitative framework the logic that the man on the street usesto judge, for instance, that getting struck by lightning four timesover the past 10 years is not something that could identically havehappened to anybody else, or that the source of a politician’s hugeand untraceable earnings is not the result of allegedly winning toplottery prizes numerous times over the past couple of years. In anycase, the advantage of the frequentist approach as regards SCV isthat the probability of a Type-I or a Type-II error can be clearly andunequivocally stated, which is not to be mistaken for a statementthat a p value is the probability of a Type-I error in the currentcase, or that it is a measure of the strength of evidence against thenull that the current data provide. The most prevalent problems ofp values are their potential for misuse and their widespread misinterpretation(Nickerson, 2000). But misuse or misinterpretationdo not make NHST and p values uninterpretable or worthless.Bayesian approaches are claimed to be free of these presumedproblems, yielding a conclusion that is exclusively grounded on thedata. In a naive account of Bayesian hypothesis testing, Malakoff(1999) attributes to biostatistician Steven Goodman the assertionthat the Bayesian approach “says there is an X% probability thatyour hypothesis is true–not that there is some convoluted chancethat if you assume the null hypothesis is true, you will get a similaror more extreme result if you repeated your experiment thousandsof times.” Besides being misleading and reflecting a poorunderstanding of the logic of calibrated NHST methods, whatgoes unmentioned in this and other accounts is that the Bayesianpotential to find out the probability that the hypothesis is true willnot materialize without two crucial extra pieces of information.One is the a priori probability of each of the competing hypotheses,which certainly does not come from the data. The other isthe probability of the observed data under each of the competinghypothesis, which has the same origin as the frequentist p valueand whose computation requires distributional assumptions thatmust necessarily take the sampling method into consideration.In practice, Bayesian hypothesis testing generally computes BFsand the result might be stated as “the alternative hypothesis is xtimes more likely than the null,” although the probability that thistype of statement is wrong is essentially unknown. The researchermay be content with a conclusion of this type, but how much ofthese odds comes from the data and how much comes from theextra assumptions needed to compute a BF is undecipherable. Inmany cases research aims at gathering and analyzing data to makeinformed decisions such as whether application of a treatmentshould be discontinued, whether changes should be introducedin an educational program, whether daytime headlights should beenforced, or whether in-car use of cell phones should be forbidden.Like frequentist analyses, Bayesian approaches do not guaranteethat the decisions will be correct. One may argue that stating howmuch more likely is one hypothesis over another bypasses the decisionto reject or not reject any of them and, then, that Bayesianapproaches to hypothesis testing are free of Type-I and Type-IIerrors. Although this is technically correct, the problem remainsfrom the perspective of SCV: Statistics is only a small part of aresearch process whose ultimate goal is to reach a conclusion andmake a decision, and researchers are in a better position to defendtheir claims if they can supplement them with a statement of theprobability with which those claims are wrong.Interestingly, analyses of decisions based on Bayesianapproaches have revealed that they are no better than frequentistdecisions as regards Type-I and Type-II errors and that parametricassumptions (i.e., the choice of prior and the assumed distributionof the observations) crucially determine the performance ofBayesian methods. For instance, Bayesian estimation is also subjectto potentially large bias and lack of precision (Alcalá-Quintana andGarcía-Pérez, 2004; García-Pérez and Alcalá-Quintana, 2007), thecoverage probability of Bayesian credible intervals can be worsewww.frontiersin.org August 2012 | Volume 3 | Article 325 | 23
García-PérezStatistical conclusion validitythan that of frequentist confidence intervals (Agresti and Min,2005; Alcalá-Quintana and García-Pérez, 2005), and the Bayesianposterior probability in hypothesis testing can be arbitrarily largeor small (Zaslavsky, 2010). On another front, use of BIC for modelselection may discard a true model as often as 20% of the times,while a concurrent 0.05-size chi-square test rejects the true modelbetween 3 and 7% of times, closely approximating its stated performance(García-Pérez and Alcalá-Quintana, 2012). In any case,the probabilities of Type-I and Type-II errors in practical decisionsmade from the results of Bayesian analyses will always be unknownand beyond control.IMPROVING THE SCV OF RESEARCHMost breaches of SCV arise from a poor understanding of statisticalprocedures and the resultant inadequate usage. These problemscan be easily corrected, as illustrated in this paper, but the problemswill not have arisen if researchers had had a better statisticaltraining in the first place. There was a time in which one simplycould not run statistical tests without a moderate understandingof NHST. But these days the application of statistical tests is only amouse-click away and all that students regard as necessary is learningthe rule by which p values pouring out of statistical softwaretell them whether the hypothesis is to be accepted or rejected, asthe study of Hoekstra et al. (2012) seems to reveal.One way to eradicate the problem is by improving statisticaleducation at undergraduate and graduate levels, perhaps notjust focusing on giving formal training on a number of methodsbut by providing students with the necessary foundationsthat will subsequently allow them to understand and apply methodsfor which they received no explicit formal training. In theiranalysis of statistical errors in published papers, Milligan andMcFillen (1984, p. 461) concluded that “in doing projects, it isnot unusual for applied researchers or students to use or apply astatistical procedure for which they have received no formal training.This is as inappropriate as a person conducting research ina given content area before reading the existing background literatureon the topic. The individual simply is not prepared toconduct quality research. The attitude that statistical technologyis secondary or less important to a person’s formal training isshortsighted. Researchers are unlikely to master additional statisticalconcepts and techniques after leaving school. Thus, thestatistical training in many programs must be strengthened. Asingle course in experimental design and a single course in multivariateanalysis is probably insufficient for the typical studentto master the course material. Someone who is trained onlyin theory and content will be ill-prepared to contribute to theadvancement of the field or to critically evaluate the research ofothers.” But statistical education does not seem to have changedmuch over the subsequent 25 years, as revealed by survey studiesconducted by Aiken et al. (1990), Friedrich et al. (2000),Aiken et al. (2008), and Henson et al. (2010). Certainly somework remains to be done in this arena, and I can only secondthe proposals made in the papers just cited. But there is alsothe problem of the unhealthy over-reliance on narrow-breadth,clickable software for data analysis, which practically obliteratesany efforts that are made to teach and promote alternatives (seethe list of “Pragmatic Factors” discussed by Borsboom, 2006,pp. 431–434).The last trench in the battle against breaches of SCV is occupiedby journal editors and reviewers. Ideally, they also watch for problemsin these respects. There is no known in-depth analysis of thereview process in psychology journals (but see Nickerson, 2005)and some evidence reveals that the focus of the review process isnot always on the quality or validity of the research (Sternberg,2002; Nickerson, 2005). Simmons et al. (2011) and Wicherts et al.(2012) have discussed empirical evidence of inadequate researchand review practices (some of which threaten SCV) and they haveproposed detailed schemes through which feasible changes in editorialpolicies may help eradicate not only common threats toSCV but also other threats to research validity in general. I canonly second proposals of this type. Reviewers and editors havethe responsibility of filtering out (or requesting amendments to)research that does not meet the journal’s standards, including SCV.The analyses of Milligan and McFillen (1984) and Nieuwenhuiset al. (2011) reveal a sizeable number of published papers withstatistical errors. This indicates that some remains to be done inthis arena too, and some journals have indeed started to take action(see Aickin, 2011).ACKNOWLEDGMENTSThis research was supported by grant PSI2009-08800 (Ministeriode Ciencia e Innovación, Spain).REFERENCESAgresti, A., and Min,Y. (2005). Frequentistperformance of Bayesian confidenceintervals for comparing proportionsin 2 × 2 contingency tables.Biometrics 61, 515–523.Ahn, C., Overall, J. E., and Tonidandel,S. (2001). Sample size and power calculationsin repeated measurementanalysis. Comput. Methods ProgramsBiomed. 64, 121–124.Aickin, M. (2011). Test ban: policy ofthe Journal of Alternative and ComplementaryMedicine with regard toan increasingly common statisticalerror. J. Altern. Complement. Med.17, 1093–1094.Aiken, L. S., West, S. G., and Millsap,R. E. (2008). Doctoral trainingin statistics, measurement, andmethodology in psychology: replicationand extension of Aiken, West,Sechrest, and Reno’s (1990) surveyof PhD programs in North America.Am. Psychol. 63, 32–50.Aiken, L. S., West, S. G., Sechrest, L.,and Reno, R. R. (1990). Graduatetraining in statistics, methodology,and measurement in psychology:a survey of PhD programs inNorth America. Am. Psychol. 45,721–734.Albers, W., Boon, P. C., and Kallenberg,W. C. M. (2000). The asymptoticbehavior of tests for normalmeans based on a variancepre-test. J. Stat. Plan. Inference 88,47–57.Alcalá-Quintana, R., and García-Pérez,M. A. (2004). The role of parametricassumptions in adaptive Bayesianestimation. Psychol. Methods 9,250–271.Alcalá-Quintana, R., and García-Pérez, M. A. (2005). Stoppingrules in Bayesian adaptive thresholdestimation. Spat. Vis. 18,347–374.Anscombe, F. J. (1953). Sequential estimation.J. R. Stat. Soc. Series B 15,1–29.Anscombe, F. J. (1954). Fixedsample-sizeanalysis of sequentialobservations. Biometrics 10,89–100.Armitage, P., McPherson, C. K., andRowe, B. C. (1969). Repeatedsignificance tests on accumulatingdata. J. R. Stat. Soc. Ser. A 132,235–244.Armstrong, L., and Marks, L. E. (1997).Differential effect of stimulus contexton perceived length: implicationsfor the horizontal–verticalillusion. Percept. Psychophys. 59,1200–1213.Austin, J. T., Boyle, K. A., and Lualhati,J. C. (1998). Statistical conclusionvalidity for organizational scienceresearchers: a review. Organ.Res. Methods 1, 164–208.Baddeley, A., and Wilson, B. A. (2002).Prose recall and amnesia: implicationsfor the structure of workingmemory. Neuropsychologia 40,1737–1743.Frontiers in Psychology | Quantitative Psychology and Measurement August 2012 | Volume 3 | Article 325 | 24
Page 2 and 3: FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5: Table of Contents05 Is Data Cleanin
Page 7 and 8: OsborneAssumptions and data cleanin
Page 9 and 10: ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13: Hoekstra et al.Why assumptions are
Page 20 and 21: García-PérezStatistical conclusio
Page 30 and 31: Sheng and ShengEffect of non-normal
Page 42 and 43: REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 62 and 63: FinchModern methods for the detecti
Page 72 and 73: MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75:
NimonStatistical assumptionsand Del
Page 76 and 77:
NimonStatistical assumptionsFor exa
Page 78 and 79:
Kraha et al.Interpreting multiple r
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
SmithsonComparing moderation of slo
Page 96 and 97:
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
REVIEW ARTICLEpublished: 01 March 2
Page 104 and 105:
Flora et al.Factor analysis assumpt
Page 106 and 107:
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Kasper and ÜnlüAssumptions of fac
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Lages and JaworskaHow predictable a
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Cummiskey et al.Testing assumptions
Page 154 and 155:
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Create successful ePaper yourself

Delete template?

Save as template?