38 BLOK ET AL. / EFFECTIVENESS OF EARLY CHILDHOOD EDUCATIONmental conditions <strong>and</strong> one control condition. Two experimentalcomparisons were within the domain <strong>of</strong> our researchquestion. One was that between the CC (control condition,i.e., no preschool, no follow-up <strong>programme</strong>) <strong>and</strong> the EEcondition, in which students received preschool education forthe first 5 years <strong>of</strong> their life, followed by a supplementaryEducational Support Program from kindergarten through 2ndgrade. <strong>The</strong> second comparison we extracted from the availabledata was that between the CC condition <strong>and</strong> the EC condition,in which students received only the preschool <strong>programme</strong> <strong>and</strong>not the follow-up <strong>programme</strong>. <strong>The</strong> data we used came fromRamey, Campbell, Burchinal, Skinner, Gardner, <strong>and</strong> Ramey(2000), who presented IQ data from 5 to 15 years old, <strong>and</strong>school performance data from 8 to 15 years old.<strong>The</strong> results <strong>of</strong> the Chicago Child-Parent Center <strong>and</strong>Expansion Program (CPC&EP) have also been widelyrepresented in the research literature. We selected the outcomesreported in Reynolds (1994) as our reference data,because the presented intervention groups were the mostrelevant to our research question. We extracted one experimentalcomparison from these data, namely the full interventiongroup with follow-on (denoted as PS þ KG þ PG-3 byReynolds) with the non-CPC comparison group.We found two studies on the effectiveness <strong>of</strong> a supplementaryemergent literacy curriculum compared to a st<strong>and</strong>ardHead Start <strong>programme</strong> (Whitehurst, Epstein, Angell, Payne,Crone, & Fishel, 1994; Whitehurst, Zevenbergen, Crone,Schultz, & Velting, 1999). <strong>The</strong> second study is a replication <strong>of</strong>the first, <strong>and</strong> includes a follow-up <strong>of</strong> both the original cohort<strong>and</strong> the replication cohort. Unfortunately, the outcomes <strong>of</strong> thetwo studies were not reported independently, as the secondarticle (Whitehurst et al., 1999) combined the results <strong>of</strong> bothcohorts. This left us no choice but to use the outcomes <strong>of</strong> thesecond article only, as it concerned the biggest sample <strong>and</strong>provided follow-up results. <strong>The</strong> two Whitehurst et al. studiestherefore resulted in one experimental comparison, namelyHead Start with an emergent literacy add-on contrasted with aHead Start-only condition.After all decisions had been made, there remained 34experimental comparisons.Coding <strong>of</strong> variables<strong>The</strong> experimental comparisons in the database were coded forseveral characteristics (see Table 1). Variables 1–3 concerndesign characteristics, variables 4–11 concern sample characteristics,<strong>and</strong> variables 12–17 concern characteristics <strong>of</strong> theexperimental intervention.Because most experimental comparisons resulted in multipleoutcomes, <strong>other</strong> variables (variables 18–25) were coded atTable 1Coding scheme for the experimental comparisons, <strong>and</strong> reliability <strong>of</strong> codingInter-coderVariable Scale reliability a1. Subject assignment 0. Strictly controlled (r<strong>and</strong>omisation or matching at subject level); 1. no strict control 87(r<strong>and</strong>omisation or matching at group level, post hoc comparison, or no control at all)2. Treatment fidelity 0. High in most respects; 1. unknown 1003. Intervention in control group 0. St<strong>and</strong>ard <strong>programme</strong>, not under control <strong>of</strong> experimenter; 1. unknown <strong>programme</strong> 92or no <strong>programme</strong> at all4. Nation 0. USA; 1. <strong>other</strong> than USA 1005. Recency <strong>of</strong> <strong>programme</strong>Numerical (minus 1900).93 b(year implementation started)6. Size <strong>of</strong> experimental group Number <strong>of</strong> students 1.00 b7. Size <strong>of</strong> control group Number <strong>of</strong> students .99 b8. Mean age <strong>of</strong> students at onset <strong>of</strong> study Number <strong>of</strong> months (before birth coded as 0) .96 b9. Percentage <strong>of</strong> students from ethnic Percentage.96 bminorities10. Level <strong>of</strong> education <strong>of</strong> parents 1. Low; 2. mixed; 9. unknown 8711. Level <strong>of</strong> income <strong>of</strong> parents 1. Low; 2. mixed; 9. unknown 9312. Delivery <strong>mode</strong> 1. Home-based; 2. centre-based; 3. combination <strong>of</strong> home- <strong>and</strong> centre-based 9613. Length <strong>of</strong> <strong>programme</strong> Number <strong>of</strong> months (a year equals 10 months, unless <strong>other</strong>wise indicated by.99 bexperimenter)14. Intensity <strong>of</strong> <strong>programme</strong> Number <strong>of</strong> hours per week .91 b15. Continuation after K 0. No; 1. yes 10016. Inclusion <strong>of</strong> social or economical 0. No; 1. yes 85support17. Inclusion <strong>of</strong> coaching <strong>of</strong> parenting skills 0. No; 1. yes 8518. Effect size at pretest Numerical 1.00 b19. St<strong>and</strong>ard error <strong>of</strong> pretest effect size Numerical .94 b20. Domain <strong>of</strong> the posttest 0. Cognition 1. socioemotional development 9321. Time <strong>of</strong> measurement <strong>of</strong> posttest Number <strong>of</strong> months after intervention ended, coded on a time scale <strong>of</strong> years 1.00 b22. Type <strong>of</strong> posttest score 0. Observed score; 1. gain score or score adjusted for covariates 9423. Type <strong>of</strong> posttest effect size 0. Derived by reviewers; 1. reported by experimenters 10024. Effect size at posttest Numerical 1.00 b25. St<strong>and</strong>ard error <strong>of</strong> posttest effect size Numerical .94 ba Percentage <strong>of</strong> classifications agreed upon by the two coders, unless <strong>other</strong>wise indicated.b Product–moment correlation between the codes <strong>of</strong> the two coders.
INTERNATIONAL JOURNAL OF BEHAVIORAL DEVELOPMENT, 2005, 29 (1), 35–47 39the level <strong>of</strong> effect sizes. Hedges’ unbiased estimate d was usedas an effect size estimate (variables 18 <strong>and</strong> 24). This statisticuses the within-group st<strong>and</strong>ard deviation as a method <strong>of</strong>st<strong>and</strong>ardisation. It includes a correction factor to obviate biasresulting from small samples. <strong>The</strong> st<strong>and</strong>ard error <strong>of</strong> the effectsize (variables 19 <strong>and</strong> 25) was estimated following Hedges <strong>and</strong>Olkin (1985, p. 86, Eq. 15). Whenever possible, we usedobserved scores to calculate effect sizes. Several experimenters,however, reported only gain scores or scores adjusted forcovariates, indicated by variable 22 (type <strong>of</strong> posttest score).Some reported outcomes were inherently negative, forinstance when behaviour ratings referred to negative behaviour.In these cases, outcomes were recoded simply bychanging the sign. This correction procedure was applied tothe studies by Goodson et al. (2000), Johnson <strong>and</strong> Walker(1987), Scarr <strong>and</strong> McCartney (1988), <strong>and</strong> Seitz, Rosenbaum,<strong>and</strong> Apfel (1985).Two independent coders coded all the studies. Inter-coderreliability was estimated by determining the rate <strong>of</strong> agreementin the case <strong>of</strong> a nominal scale, or the product–momentcorrelation in the case <strong>of</strong> an interval scale. <strong>The</strong> results arereported in the last column <strong>of</strong> Table 1. <strong>The</strong> reliability provedto be satisfactory, ranging between 85 <strong>and</strong> 100% for nominalvariables, <strong>and</strong> between .91 <strong>and</strong> 1.00 for interval variables. Inthe case <strong>of</strong> divergent codes, final codes were established bymutual agreement, <strong>and</strong> used in subsequent analyses.Many study designs either did not incorporate a pretest ordid not report sufficient statistics to estimate an effect size thatcaptured the initial differences between conditions at thepretest. We were able to determine pretest effect sizes for only40% <strong>of</strong> our cases. To prevent an excessive loss <strong>of</strong> data, wedecided to impute zero scores for missing effect sizes atpretesting. This value is close to the mean value we found forcases, which allowed us to estimate a pretest effect size (meanvalue 0.06 with a corresponding st<strong>and</strong>ard error <strong>of</strong> 0.04).Integration <strong>of</strong> effects<strong>The</strong> coding phase resulted in a file containing 207 differentoutcomes (171 in the cognitive domain, 36 in the socioemotionaldomain) from the 34 experimental comparisons. Weanalysed the data in two steps. We first aggregated effect sizesto the level <strong>of</strong> the experimental comparisons. This aggregationwas performed separately for each domain <strong>and</strong> time <strong>of</strong>measurement, varying from 0 to 180 months after theintervention ended. This aggregation was conducted byweighted integration, in which the results were weighted ininverse proportion to their st<strong>and</strong>ard error (i.e., the greater thest<strong>and</strong>ard error, the smaller the weight). <strong>The</strong> aggregated effectsizes <strong>and</strong> the corresponding st<strong>and</strong>ard errors were estimatedfollowing Hedges <strong>and</strong> Olkin (1985, p. 112, Eqs. 8 <strong>and</strong> 9). Thisaggregation <strong>mode</strong>l assumes the results within one study to behomogeneous <strong>and</strong> to differ from each <strong>other</strong> only on the basis <strong>of</strong>r<strong>and</strong>om differences between the outcome variables. <strong>The</strong>st<strong>and</strong>ard errors <strong>of</strong> aggregated effect sizes are generally smallerthan the st<strong>and</strong>ard errors corresponding to the constituent effectsizes, which seems a fair reward for using more than oneoutcome measure. All calculations on aggregations were doneusing the Meta <strong>programme</strong> (Schwarzer, 1989). This first stepresulted in 85 different outcomes (71 in the cognitive domain,14 in the socioemotional domain).As a second step, outcomes or effect sizes were integratedinto an overall effect size, separately for each domain. <strong>The</strong>integration was performed according to the r<strong>and</strong>om effects<strong>mode</strong>l (Hedges & Olkin, 1985). <strong>The</strong> <strong>mode</strong>l we specifiedacknowledges the hierarchical <strong>and</strong> longitudinal nature <strong>of</strong> ourdata. <strong>The</strong> <strong>mode</strong>l splits the effect size d ijt for experimentalcomparison j from study i at moment t into two components,namely a true effect size d ijt , <strong>and</strong> an error component e ijt . <strong>The</strong>true effect sizes are assumed to vary across measurementmoments t, comparisons j, <strong>and</strong> studies i. <strong>The</strong> variance <strong>of</strong> d ijt isexplained by the regression <strong>mode</strong>l:d ijt ¼ g 0 þ g n Z nijt þ u ijt þ v ij þ w i (1)where g 0 is the gr<strong>and</strong> mean, Z nijt are characteristics <strong>of</strong> thestudies (n being the index referring to the characteristics),comparisons <strong>and</strong> measurement moments, <strong>and</strong> u ijt ,v ij , <strong>and</strong> w iare residual error terms at the three levels distinguished. <strong>The</strong><strong>mode</strong>l makes it possible to distinguish between three variancecomponents, viz. s 2u (the variance between measurementmoments t), s 2 v (the variance between experimental comparisonsj ), <strong>and</strong> s 2 w (the variance between studies i ). <strong>The</strong> <strong>mode</strong>lalso enables testing whether any <strong>of</strong> the parameter variances aresignificantly different from zero with the test statistic Q. Ifstudy outcomes are heterogeneous, it is worthwhile trying torelate the heterogeneity to the various characteristics Z nijt .Ifnot, the study outcomes are homogeneous <strong>and</strong> no explanatoryvariables need to be introduced in equation (1). <strong>The</strong>specification <strong>and</strong> testing <strong>of</strong> <strong>mode</strong>ls was carried out withMLwiN, using restricted maximum likelihood estimation(Goldstein et al., 1998; Hox, 2002). Analyses were performedseparately for both domains (cognition, socioemotional development).ResultsDescription <strong>of</strong> the studies in the databaseThis subsection briefly describes the studies in our database,which yielded 34 different comparisons (Table 2).Assignment <strong>of</strong> the subjects to the different conditions <strong>of</strong> thecomparison proceeded according to strict guidelines (atr<strong>and</strong>om, by matching, or by blocking) in only 16 cases. In<strong>other</strong> cases, less strict procedures were followed (e.g., r<strong>and</strong>omassignment or matching <strong>of</strong> intact groups), or assignment wasnot under the control <strong>of</strong> the investigator.Treatment fidelity was reported to be high in all or mostrespects for 11 <strong>of</strong> the 34 comparisons. For the <strong>other</strong>comparisons, no information could be found. However, thisdoes not necessarily mean that the treatment was jeopardised.We found the same lack <strong>of</strong> information with respect to thecontrol condition. Students in the control condition mostlyfollowed a ‘‘st<strong>and</strong>ard <strong>programme</strong>’’.<strong>The</strong> sample size was generally small, averaging 77 for boththe experimental <strong>and</strong> the control conditions. This averageexcludes the outlying large sample size <strong>of</strong> the study byGoodson et al. (2000), which featured about 1600 childrenin both conditions. <strong>The</strong> experimental group contained morethan 100 students in only 8 <strong>of</strong> the 34 comparisons. Evidently,such small sample sizes imply generally low power to detect adifference in outcomes. Most students belonged to an ethnicminority group (average: 81%, taking experimental <strong>and</strong> controlgroups together). Median student age at the start <strong>of</strong> theintervention <strong>programme</strong> showed considerable variation, rangingfrom pre-birth to 64 months (average 37 months). Boththe socioeconomic status <strong>and</strong> the income <strong>of</strong> parents were