13.07.2015 Views

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

García-PérezStatistical conclusion validitywhich <strong>the</strong> COAST rule was reportedly used, although unintendedly.And not a single citation is to be found in WoS from papersreporting <strong>the</strong> use of <strong>the</strong> extensions <strong>and</strong> modifications of Botellaet al. (2006) or Ximenez <strong>and</strong> Revuelta (2007). Perhaps researchersin psychology invariably use fixed sampling, but it is hard to believethat “<strong>data</strong> peeking” or “<strong>data</strong> monitoring” was never used, or that<strong>the</strong> results of such interim analyses never led researchers to collectsome more <strong>data</strong>. Wagenmakers (2007, p. 785) regretted that “itis not clear what percentage of p values reported in experimentalpsychology have been contaminated by some form of optionalstopping. There is simply no information in Results sections thatallows one to assess <strong>the</strong> extent to which optional stopping hasoccurred.” This incertitude was quickly resolved by John et al.(2012). They surveyed over 2000 psychologists with highly revealingresults: Respondents affirmatively admitted to <strong>the</strong> practices of<strong>data</strong> peeking, <strong>data</strong> monitoring, or conditional stopping in ratesthat varied between 20 <strong>and</strong> 60%.Besides John et al.’s (2012) proposal that authors disclose <strong>the</strong>sedetails in full <strong>and</strong> Simmons et al.’s (2011) proposed list of requirementsfor authors <strong>and</strong> guidelines for reviewers, <strong>the</strong> solution to<strong>the</strong> problem is simple: Use strategies that control Type-I errorrates upon repeated <strong>testing</strong> <strong>and</strong> optional stopping. These strategieshave been widely used in biomedical research for decades(Bauer <strong>and</strong> Köhne, 1994; Mehta <strong>and</strong> Pocock, 2011). There is noreason that psychological research should ignore <strong>the</strong>m <strong>and</strong> give upefficient research with control of Type-I error rates, particularlywhen <strong>the</strong>se strategies have also been adapted <strong>and</strong> fur<strong>the</strong>r developedfor use under <strong>the</strong> most common designs in psychologicalresearch (Frick, 1998; Botella et al., 2006; Ximenez <strong>and</strong> Revuelta,2007; Fitts, 2010a,b).It should also be stressed that not all instances of repeated <strong>testing</strong>or optional stopping without control of Type-I error ratesthreaten SCV. A breach of SCV occurs only when <strong>the</strong> conclusionregarding <strong>the</strong> research question is based on <strong>the</strong> use of <strong>the</strong>sepractices. For an acceptable use, consider <strong>the</strong> study of Xu et al.(2011). They investigated order preferences in primates to findout whe<strong>the</strong>r primates preferred to receive <strong>the</strong> best item first ra<strong>the</strong>rthan last. Their procedure involved several experiments <strong>and</strong> <strong>the</strong>ydeclared that “three significant sessions (two-tailed binomial testsper session, p < 0.05) or 10 consecutive non-significant sessionswere required from each monkey before moving to <strong>the</strong> nextexperiment. The three significant sessions were not necessarilyconsecutive (. . .) Ten consecutive non-significant sessions weretaken to mean <strong>the</strong>re was no preference by <strong>the</strong> monkey” (p. 2304).In this case, <strong>the</strong> use of repeated <strong>testing</strong> with optional stopping ata nominal 95% significance level for each individual test is partof <strong>the</strong> operational definition of an outcome variable used as acriterion to proceed to <strong>the</strong> next experiment. And, in any event,<strong>the</strong> overall probability of misclassifying a monkey according tothis criterion is certainly fixed at a known value that can easilybe worked out from <strong>the</strong> significance level declared for eachindividual binomial test. One may object to <strong>the</strong> value of <strong>the</strong> resultantrisk of misclassification, but this does not raise concernsabout SCV.In sum, <strong>the</strong> use of repeated <strong>testing</strong> with optional stoppingthreatens SCV for lack of control of Type-I <strong>and</strong> Type-II errorrates. A simple way around this is to refrain from <strong>the</strong>se practices<strong>and</strong> adhere to <strong>the</strong> fixed sampling assumptions of statistical tests;o<strong>the</strong>rwise, use <strong>the</strong> statistical methods that have been developed foruse with repeated <strong>testing</strong> <strong>and</strong> optional stopping.PRELIMINARY TESTS OF ASSUMPTIONSTo derive <strong>the</strong> sampling distribution of test statistics used in parametricNHST, some assumptions must be made about <strong>the</strong> probabilitydistribution of <strong>the</strong> observations or about <strong>the</strong> parameters of<strong>the</strong>se distributions. The assumptions of normality of distributions(in all tests), homogeneity of variances (in Student’s two-samplet test for means or in ANOVAs involving between-subjects factors),sphericity (in repeated-measures ANOVAs), homoscedasticity(in regression analyses), or homogeneity of regression slopes(in ANCOVAs) are well known cases. The <strong>data</strong> on h<strong>and</strong> may ormay not meet <strong>the</strong>se assumptions <strong>and</strong> some parametric tests havebeen devised under alternative assumptions (e.g., Welch’s test fortwo-sample means, or correction factors for <strong>the</strong> degrees of freedomof F statistics from ANOVAs). Most introductory statisticstextbooks emphasize that <strong>the</strong> assumptions underlying statisticaltests must be formally tested to guide <strong>the</strong> choice of a suitable teststatistic for <strong>the</strong> null hypo<strong>the</strong>sis of interest. Although this recommendationseems reasonable, serious consequences on SCV arisefrom following it.Numerous studies conducted over <strong>the</strong> past decades have shownthat <strong>the</strong> two-stage approach of <strong>testing</strong> assumptions first <strong>and</strong> subsequently<strong>testing</strong> <strong>the</strong> null hypo<strong>the</strong>sis of interest has severe effects onType-I <strong>and</strong> Type-II error rates. It may seem at first sight that this issimply <strong>the</strong> result of cascaded binary decisions each of which has itsown Type-I <strong>and</strong> Type-II error probabilities; yet, this is <strong>the</strong> result ofmore complex interactions of Type-I <strong>and</strong> Type-II error rates thatdo not have fixed (empirical) probabilities across <strong>the</strong> cases that endup treated one way or <strong>the</strong> o<strong>the</strong>r according to <strong>the</strong> outcomes of <strong>the</strong>preliminary test: The resultant Type-I <strong>and</strong> Type-II error rates of<strong>the</strong> conditional test cannot be predicted from those of <strong>the</strong> preliminary<strong>and</strong> conditioned tests. A thorough analysis of what factorsaffect <strong>the</strong> Type-I <strong>and</strong> Type-II error rates of two-stage approachesis beyond <strong>the</strong> scope of this paper but readers should be aware thatnothing suggests in principle that a two-stage approach might beadequate. The situations that have been more thoroughly studiedinclude preliminary goodness-of-fit tests for normality beforeconducting a one-sample t test (Easterling <strong>and</strong> Anderson, 1978;Schucany <strong>and</strong> Ng, 2006; Rochon <strong>and</strong> Kieser, 2011), preliminarytests of equality of variances before conducting a two-sample ttest for means (Gans, 1981; Moser <strong>and</strong> Stevens, 1992; Zimmerman,1996,2004; Hayes <strong>and</strong> Cai,2007),preliminary tests of both equalityof variances <strong>and</strong> normality preceding two-sample t tests for means(Rasch et al., 2011), or preliminary tests of homoscedasticity beforeregression analyses (Caudill, 1988; Ng <strong>and</strong> Wilcox, 2011). These<strong>and</strong> o<strong>the</strong>r studies provide evidence that strongly advises againstconducting preliminary tests of assumptions. Almost all of <strong>the</strong>seauthors explicitly recommended against <strong>the</strong>se practices <strong>and</strong> hopedfor <strong>the</strong> misleading <strong>and</strong> misguided advice given in introductorytextbooks to be removed. Wells <strong>and</strong> Hintze (2007, p. 501) concludedthat “checking <strong>the</strong> assumptions using <strong>the</strong> same <strong>data</strong> thatare to be analyzed, although attractive due to its empirical nature,is a fruitless endeavor because of its negative ramifications on <strong>the</strong>actual test of interest.” The ramifications consist of substantial but<strong>Frontiers</strong> in Psychology | Quantitative Psychology <strong>and</strong> Measurement August 2012 | Volume 3 | Article 325 | 20

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!