Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

García-PérezStatistical conclusion validityalthough the recommendations of Wilkinson and The Task Forceon Statistical Inference (1999) in this respect are not always followed.Consider, e.g., Lippa’s (2007) study of the relation betweensex drive and sexual attraction. Correlations generally lower than0.3 in absolute value were declared strong as a result of p valuesbelow 0.001. With sample sizes sometimes nearing 50,000 pairedobservations, even correlations valued at 0.04 turned out significantin this study. More attention to effect sizes is certainly needed,both by researchers and by journal editors and reviewers.The remainder of this paper analyzes three common practicesthat result in SCV breaches, also discussing simple replacementsfor them.STOPPING RULES FOR DATA COLLECTION WITHOUTCONTROL OF TYPE-I ERROR RATESThe asymptotic theory that provides justification for null hypothesissignificance testing (NHST) assumes what is known as fixedsampling, which means that the size n of the sample is not itself arandom variable or, in other words, that the size of the sample hasbeen decided in advance and the statistical test is performed oncethe entire sample of data has been collected. Numerous procedureshave been devised to determine the size that a sample must haveaccording to planned power (Ahn et al., 2001; Faul et al., 2007;Nisen and Schwertman, 2008; Jan and Shieh, 2011), the size of theeffect sought to be detected (Morse, 1999), or the width of theconfidence intervals of interest (Graybill, 1958; Boos and Hughes-Oliver,2000; Shieh and Jan,2012). For reviews,see Dell et al. (2002)and Maxwell et al. (2008). In many cases, a researcher simply strivesto gather as large a sample as possible. Asymptotic theory supportsNHST under fixed sampling assumptions, whether or not the sizeof the sample was planned.In contrast to fixed sampling, sequential sampling implies thatthe number of observations is not fixed in advance but dependsby some rule on the observations already collected (Wald, 1947;Anscombe, 1953; Wetherill, 1966). In practice, data are analyzedas they come in and data collection stops when the observationscollected thus far satisfy some criterion. The use of sequentialsampling faces two problems (Anscombe, 1953, p. 6): (i) devisinga suitable stopping rule and (ii) finding a suitable test statisticand determining its sampling distribution. The mere statementof the second problem evidences that the sampling distributionof conventional test statistics for fixed sampling no longer holdsunder sequential sampling. These sampling distributions are relativelyeasy to derive in some cases, particularly in those involvingnegative binomial parameters (Anscombe, 1953; García-Pérez andNúñez-Antón, 2009). The choice between fixed and sequentialsampling (sometimes portrayed as the “experimenter’s intention”;see Wagenmakers, 2007) has important ramifications for NHSTbecause the probability that the observed data are compatible (byany criterion) with a true null hypothesis generally differs greatlyacross sampling methods. This issue is usually bypassed by thosewho look at the data as a “sure fact” once collected, as if the samplingmethod used to collect the data did not make any differenceor should not affect how the data are interpreted.There are good reasons for using sequential sampling in psychologicalresearch. For instance, in clinical studies in which patientsare recruited on the go, the experimenter may want to analyzedata as they come in to be able to prevent the administration of aseemingly ineffective or even hurtful treatment to new patients. Instudies involving a waiting-list control group, individuals in thisgroup are generally transferred to an experimental group midwayalong the experiment. In studies with laboratory animals, theexperimenter may want to stop testing animals before the plannednumber has been reached so that animals are not wasted when aneffect (or the lack thereof) seems established. In these and analogouscases, the decision as to whether data will continue to becollected results from an analysis of the data collected thus far, typicallyusing a statistical test that was devised for use in conditionsof fixed sampling. In other cases, experimenters test their statisticalhypothesis each time a new observation or block of observations iscollected, and continue the experiment until they feel the data areconclusive one way or the other. Software has been developed thatallows experimenters to find out how many more observationswill be needed for a marginally non-significant result to becomesignificant on the assumption that sample statistics will remaininvariant when the extra data are collected (Morse, 1998).The practice of repeated testing and optional stopping hasbeen shown to affect in unpredictable ways the empirical Type-Ierror rate of statistical tests designed for use under fixed sampling(Anscombe, 1954; Armitage et al., 1969; McCarroll et al., 1992;Strube, 2006; Fitts, 2011a). The same holds when a decision ismade to collect further data on evidence of a marginally (non)significant result (Shun et al., 2001; Chen et al., 2004). The inaccuracyof statistical tests in these conditions represents a breachof SCV, because the statistical conclusion thus fails to be incorrectwith the assumed (and explicitly stated) probabilities of Type-Iand Type-II errors. But there is an easy way around the inflationof Type-I error rates from within NHST, which solves the threatto SCV that repeated testing and optional stopping entail.In what appears to be the first development of a sequentialprocedure with control of Type-I error rates in psychology, Frick(1998) proposed that repeated statistical testing be conductedunder the so-called COAST (composite open adaptive sequentialtest) rule: If the test yields p < 0.01, stop collecting data andreject the null; if it yields p > 0.36, stop also and do not rejectthe null; otherwise, collect more data and re-test. The low criterionat 0.01 and the high criterion at 0.36 were selected throughsimulations so as to ensure a final Type-I error rate of 0.05 forpaired-samples t tests. Use of the same low and high criteriarendered similar control of Type-I error rates for tests of theproduct-moment correlation, but they yielded slightly conservativetests of the interaction in 2 × 2 between-subjects ANOVAs.Frick also acknowledged that adjusting the low and high criteriamight be needed in other cases, although he did not address them.This has nevertheless been done by others who have modified andextended Frick’s approach (e.g., Botella et al., 2006; Ximenez andRevuelta, 2007; Fitts, 2010a,b, 2011b). The result is sequential procedureswith stopping rules that guarantee accurate control of finalType-I error rates for the statistical tests that are more widely usedin psychological research.Yet, these methods do not seem to have ever been used in actualresearch, or at least their use has not been acknowledged. Forinstance, of the nine citations to Frick’s (1998) paper listed in WoSas of May 3, 2012, only one is from a paper (published in 2011) inwww.frontiersin.org August 2012 | Volume 3 | Article 325 | 19
García-PérezStatistical conclusion validitywhich the COAST rule was reportedly used, although unintendedly.And not a single citation is to be found in WoS from papersreporting the use of the extensions and modifications of Botellaet al. (2006) or Ximenez and Revuelta (2007). Perhaps researchersin psychology invariably use fixed sampling, but it is hard to believethat “data peeking” or “data monitoring” was never used, or thatthe results of such interim analyses never led researchers to collectsome more data. Wagenmakers (2007, p. 785) regretted that “itis not clear what percentage of p values reported in experimentalpsychology have been contaminated by some form of optionalstopping. There is simply no information in Results sections thatallows one to assess the extent to which optional stopping hasoccurred.” This incertitude was quickly resolved by John et al.(2012). They surveyed over 2000 psychologists with highly revealingresults: Respondents affirmatively admitted to the practices ofdata peeking, data monitoring, or conditional stopping in ratesthat varied between 20 and 60%.Besides John et al.’s (2012) proposal that authors disclose thesedetails in full and Simmons et al.’s (2011) proposed list of requirementsfor authors and guidelines for reviewers, the solution tothe problem is simple: Use strategies that control Type-I errorrates upon repeated testing and optional stopping. These strategieshave been widely used in biomedical research for decades(Bauer and Köhne, 1994; Mehta and Pocock, 2011). There is noreason that psychological research should ignore them and give upefficient research with control of Type-I error rates, particularlywhen these strategies have also been adapted and further developedfor use under the most common designs in psychologicalresearch (Frick, 1998; Botella et al., 2006; Ximenez and Revuelta,2007; Fitts, 2010a,b).It should also be stressed that not all instances of repeated testingor optional stopping without control of Type-I error ratesthreaten SCV. A breach of SCV occurs only when the conclusionregarding the research question is based on the use of thesepractices. For an acceptable use, consider the study of Xu et al.(2011). They investigated order preferences in primates to findout whether primates preferred to receive the best item first ratherthan last. Their procedure involved several experiments and theydeclared that “three significant sessions (two-tailed binomial testsper session, p < 0.05) or 10 consecutive non-significant sessionswere required from each monkey before moving to the nextexperiment. The three significant sessions were not necessarilyconsecutive (. . .) Ten consecutive non-significant sessions weretaken to mean there was no preference by the monkey” (p. 2304).In this case, the use of repeated testing with optional stopping ata nominal 95% significance level for each individual test is partof the operational definition of an outcome variable used as acriterion to proceed to the next experiment. And, in any event,the overall probability of misclassifying a monkey according tothis criterion is certainly fixed at a known value that can easilybe worked out from the significance level declared for eachindividual binomial test. One may object to the value of the resultantrisk of misclassification, but this does not raise concernsabout SCV.In sum, the use of repeated testing with optional stoppingthreatens SCV for lack of control of Type-I and Type-II errorrates. A simple way around this is to refrain from these practicesand adhere to the fixed sampling assumptions of statistical tests;otherwise, use the statistical methods that have been developed foruse with repeated testing and optional stopping.PRELIMINARY TESTS OF ASSUMPTIONSTo derive the sampling distribution of test statistics used in parametricNHST, some assumptions must be made about the probabilitydistribution of the observations or about the parameters ofthese distributions. The assumptions of normality of distributions(in all tests), homogeneity of variances (in Student’s two-samplet test for means or in ANOVAs involving between-subjects factors),sphericity (in repeated-measures ANOVAs), homoscedasticity(in regression analyses), or homogeneity of regression slopes(in ANCOVAs) are well known cases. The data on hand may ormay not meet these assumptions and some parametric tests havebeen devised under alternative assumptions (e.g., Welch’s test fortwo-sample means, or correction factors for the degrees of freedomof F statistics from ANOVAs). Most introductory statisticstextbooks emphasize that the assumptions underlying statisticaltests must be formally tested to guide the choice of a suitable teststatistic for the null hypothesis of interest. Although this recommendationseems reasonable, serious consequences on SCV arisefrom following it.Numerous studies conducted over the past decades have shownthat the two-stage approach of testing assumptions first and subsequentlytesting the null hypothesis of interest has severe effects onType-I and Type-II error rates. It may seem at first sight that this issimply the result of cascaded binary decisions each of which has itsown Type-I and Type-II error probabilities; yet, this is the result ofmore complex interactions of Type-I and Type-II error rates thatdo not have fixed (empirical) probabilities across the cases that endup treated one way or the other according to the outcomes of thepreliminary test: The resultant Type-I and Type-II error rates ofthe conditional test cannot be predicted from those of the preliminaryand conditioned tests. A thorough analysis of what factorsaffect the Type-I and Type-II error rates of two-stage approachesis beyond the scope of this paper but readers should be aware thatnothing suggests in principle that a two-stage approach might beadequate. The situations that have been more thoroughly studiedinclude preliminary goodness-of-fit tests for normality beforeconducting a one-sample t test (Easterling and Anderson, 1978;Schucany and Ng, 2006; Rochon and Kieser, 2011), preliminarytests of equality of variances before conducting a two-sample ttest for means (Gans, 1981; Moser and Stevens, 1992; Zimmerman,1996,2004; Hayes and Cai,2007),preliminary tests of both equalityof variances and normality preceding two-sample t tests for means(Rasch et al., 2011), or preliminary tests of homoscedasticity beforeregression analyses (Caudill, 1988; Ng and Wilcox, 2011). Theseand other studies provide evidence that strongly advises againstconducting preliminary tests of assumptions. Almost all of theseauthors explicitly recommended against these practices and hopedfor the misleading and misguided advice given in introductorytextbooks to be removed. Wells and Hintze (2007, p. 501) concludedthat “checking the assumptions using the same data thatare to be analyzed, although attractive due to its empirical nature,is a fruitless endeavor because of its negative ramifications on theactual test of interest.” The ramifications consist of substantial butFrontiers in Psychology | Quantitative Psychology and Measurement August 2012 | Volume 3 | Article 325 | 20
Page 2 and 3: FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5: Table of Contents05 Is Data Cleanin
Page 7 and 8: OsborneAssumptions and data cleanin
Page 9 and 10: ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13: Hoekstra et al.Why assumptions are
Page 22 and 23: García-PérezStatistical conclusio
Page 30 and 31: Sheng and ShengEffect of non-normal
Page 42 and 43: REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 62 and 63: FinchModern methods for the detecti
Page 70 and 71:
FinchModern methods for the detecti
Page 72 and 73:
MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75:
NimonStatistical assumptionsand Del
Page 76 and 77:
NimonStatistical assumptionsFor exa
Page 78 and 79:
Kraha et al.Interpreting multiple r
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
SmithsonComparing moderation of slo
Page 96 and 97:
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
REVIEW ARTICLEpublished: 01 March 2
Page 104 and 105:
Flora et al.Factor analysis assumpt
Page 106 and 107:
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Kasper and ÜnlüAssumptions of fac
Page 126 and 127:
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Lages and JaworskaHow predictable a
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
Cummiskey et al.Testing assumptions
Page 154 and 155:
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Create successful ePaper yourself

Delete template?

Save as template?