Nimon et al.The assumption of reliable <strong>data</strong>APPENDIXEXAMPLE WRITE-UP FOR SAMPLE, INSTRUMENT, AND RESULT SECTIONSSampleA convenience sample of 420 students (200 fifth graders, 220 sixth graders) were from a suburban public intermediate school in <strong>the</strong>southwest of <strong>the</strong> United States <strong>and</strong> included 190 Whites (100 males, 90 females; 135 regular education, 55 gifted education), 105 Blacks(55 males, 50 females; 83 regular education, 22 gifted education), 95 Hispanics (48 males, 47 females; 84 regular education, 11 giftededucation), 18 Asians (9 males, 9 females; 13 regular education, 5 gifted education), <strong>and</strong> 12 O<strong>the</strong>rs (5 males, 7 females; 10 regulareducation, 2 gifted education). The school consisted of 45% of students in high-poverty as defined by number of students on freelunch. None of <strong>the</strong> students were in special education. Parental <strong>and</strong>/or student consent was obtained by 94% of <strong>the</strong> students, providinga high response rate.INSTRUMENTMarat (2005) included an instrument that contained several predictors of self-efficacy (see Pintrich et al., 1991; B<strong>and</strong>ura, 2006). In<strong>the</strong> present study, five constructs were included: Motivation Strategies (MS: 5 items); Cognitive Strategies (CS; 15 items); ResourceManagement Strategies (MS; 12 items); Self-Regulated Learning (SRL; 16 items); <strong>and</strong> Self-Assertiveness (SA; 6 items). No modificationswere made to <strong>the</strong> items or <strong>the</strong> subscales but only five of <strong>the</strong> subscales from <strong>the</strong> original instrument listed above were administered. TheEnglish version of <strong>the</strong> instrument was administered via paper to students by <strong>the</strong> researchers during regular class time <strong>and</strong> utilized afive-point Likert scale anchored from 1 (not well) to 5 (very well). Composite scores were created for each construct by averaging <strong>the</strong>items for each subscale.RESULTSCoefficient alpha was calculated for <strong>the</strong> <strong>data</strong> in h<strong>and</strong> resulting in acceptable levels of reliability for MS (0.82, 0.84), CS (0.85, 0.84), RMS(0.91, 0.83), SRL (0.84, 86), <strong>and</strong> SA (0.87, 0.83), fall <strong>and</strong> spring, respectively (Thompson, 2003a).GIFTED AND REGULAR STUDENTSTable A1 provides <strong>the</strong> reliability coefficients, bivariate correlations, means, SD for each factor disaggregated by gifted <strong>and</strong> regularstudents.Table A1 | Bivariate correlations, means, SD, <strong>and</strong> reliability coefficient disaggregated by gifted <strong>and</strong>regular education students.α, Coefficient alpha; M, mean; SD, st<strong>and</strong>ard deviation. Bivariate correlations below <strong>the</strong> diagonal are for Gifted students<strong>and</strong> above <strong>the</strong> diagonal are for Regular education students. MS, motivation strategies; CS, cognitive strategies; RMS,resource management strategies; SRL, self-regulated learning; SA, self-assertiveness.www.frontiersin.org April 2012 | Volume 3 | Article 102 | 53
PERSPECTIVE ARTICLEpublished: 04 July 2012doi: 10.3389/fpsyg.2012.00218Replication unreliability in psychology: elusive phenomenaor “elusive” statistical power?Patrizio E. Tressoldi*Dipartimento di Psicologia Generale, Università di Padova, Padova, ItalyEdited by:Jason W. Osborne, Old DominionUniversity, USAReviewed by:Fiona Fidler, University of Melbourne,AustraliaDarrell Hull, University North Texas,USADonald Sharpe, University of Regina,Canada*Correspondence:Patrizio E. Tressoldi, Dipartimento diPsicologia Generale, Università diPadova, Padova, Italy.e-mail: patrizio.tressoldi@unipd.itThe focus of this paper is to analyze whe<strong>the</strong>r <strong>the</strong> unreliability of results related to certaincontroversial psychological phenomena may be a consequence of <strong>the</strong>ir low statisticalpower. Applying <strong>the</strong> Null Hypo<strong>the</strong>sis StatisticalTesting (NHST), still <strong>the</strong> widest used statisticalapproach, unreliability derives from <strong>the</strong> failure to refute <strong>the</strong> null hypo<strong>the</strong>sis, in particularwhen exact or quasi-exact replications of experiments are carried out. Taking as example<strong>the</strong> results of meta-analyses related to four different controversial phenomena, subliminalsemantic priming, incubation effect for problem solving, unconscious thought <strong>the</strong>ory,<strong>and</strong> non-local perception, it was found that, except for semantic priming on categorization,<strong>the</strong> statistical power to detect <strong>the</strong> expected effect size (ES) of <strong>the</strong> typical study, is low orvery low. The low power in most studies undermines <strong>the</strong> use of NHST to study phenomenawith moderate or low ESs. We conclude by providing some suggestions on how toincrease <strong>the</strong> statistical power or use different statistical approaches to help discriminatewhe<strong>the</strong>r <strong>the</strong> results obtained may or may not be used to support or to refute <strong>the</strong> reality ofa phenomenon with small ES.Keywords: incubation effect, non-local perception, power, subliminal priming, unconscious thought <strong>the</strong>oryINTRODUCTIONARE THERE ELUSIVE PHENOMENA OR IS THERE AN “ELUSIVE” POWERTO DETECT THEM?When may a phenomenon be considered to be real or veryprobable, following <strong>the</strong> rules of current scientific methodology?Among <strong>the</strong> many requirements, <strong>the</strong>re is a substantial consensusthat replication is one of <strong>the</strong> more fundamental (Schmidt,2009). In o<strong>the</strong>r words, a phenomenon may be considered realor very probable when it has been observed many times <strong>and</strong>preferably by different people or research groups. Whereas afailure to replicate is quite expected in <strong>the</strong> case of conceptualreplication, or when <strong>the</strong> experimental procedure or materialsentail relevant modifications, a failure in <strong>the</strong> case of an exactor quasi-exact replication, give rise to serious concerns about<strong>the</strong> reality of <strong>the</strong> phenomenon under investigation. This is <strong>the</strong>case in <strong>the</strong> four phenomena used as examples in this paper,namely, semantic subliminal priming, incubation effects on problemsolving,unconscious thought,<strong>and</strong> non-local perception (NLP;e.g., Kennedy, 2001; Pratte <strong>and</strong> Rouder, 2009; Waroquier et al.,2009).The focus of this paper is to demonstrate that for all phenomenawith a moderate or small effect size (ES), approximately below 0.5if we refer to st<strong>and</strong>ardized differences such as Cohen’s d, <strong>the</strong> typicalstudy shows a power level insufficient to detect <strong>the</strong> phenomenonunder investigation.Given that <strong>the</strong> majority of statistical analyses are based on <strong>the</strong>Null Hypo<strong>the</strong>sis Statistical Testing (NHST) frequentist approach,<strong>the</strong>ir failure is determined by <strong>the</strong> rejection of <strong>the</strong> (nil) null hypo<strong>the</strong>sisH 0 , usually setting α < 0.05. Even if this procedure is consideredincorrect because <strong>the</strong> frequentist approach only supports H 0rejection <strong>and</strong> not H 0 validity 1 , it may be tolerated if <strong>the</strong>re is proofof a high level of statistical power as recommended in <strong>the</strong> recentAPA statistical recommendations [American Psychological Association,APA (2010)]: “Power: When applying inferential statistics,take seriously <strong>the</strong> statistical power considerations associated with<strong>the</strong> tests of hypo<strong>the</strong>ses. Such considerations relate to <strong>the</strong> likelihoodof correctly rejecting <strong>the</strong> tested hypo<strong>the</strong>ses, given a particularalpha level, ES, <strong>and</strong> sample size. In that regard, routinely provideevidence that <strong>the</strong> study has sufficient power to detect effects of substantiveinterest. Be similarly careful in discussing <strong>the</strong> role playedby sample size in cases in which not rejecting <strong>the</strong> null hypo<strong>the</strong>sisis desirable (i.e., when one wishes to argue that <strong>the</strong>re are nodifferences), when <strong>testing</strong> various assumptions underlying <strong>the</strong> statisticalmodel adopted (e.g., normality, homogeneity of variance,homogeneity of regression), <strong>and</strong> in model fitting. pag. 30.”HOW MUCH POWER?Statistical power depends on three classes of parameters: (1) <strong>the</strong>significance level (i.e., <strong>the</strong> Type I error probability) of <strong>the</strong> test, (2)<strong>the</strong> size(s) of <strong>the</strong> sample(s) used for <strong>the</strong> test, <strong>and</strong> (3) an ES parameterdefining H 1 <strong>and</strong> thus indexing <strong>the</strong> degree of deviation fromH 0 in <strong>the</strong> underlying population.Power analysis should be used prospectively to calculate <strong>the</strong>minimum sample size required so that one can reasonably detectan effect of a given size. Power analysis can also be used to calculate1 The line of reasoning from “<strong>the</strong> null hypo<strong>the</strong>sis is false” to “<strong>the</strong> <strong>the</strong>ory is <strong>the</strong>reforetrue” involves <strong>the</strong> logical fallacy of affirming <strong>the</strong> consequent: “If <strong>the</strong> <strong>the</strong>ory is true,<strong>the</strong> null hypo<strong>the</strong>sis will prove to be false. The null hypo<strong>the</strong>sis proved to be false;<strong>the</strong>refore, <strong>the</strong> <strong>the</strong>ory must be true” (Nickerson, 2000).www.frontiersin.org July 2012 | Volume 3 | Article 218 | 54