files\pdf\briesch kilgus et al 2012

elmbrookrti

files\pdf\briesch kilgus et al 2012

XX10.1177/1534508412441966Briesch et al.Assessment for Effective Intervention© 2012 Hammill Institute on DisabilitiesReprints and permission: http://www.sagepub.com/journalsPermissions.navArticleThe Influence of Alternative ScaleFormats on the Generalizability ofData Obtained From Direct BehaviorRating Single-Item Scales (DBR-SIS)Assessment for Effective InterventionXX(X) 1 –7© 2012 Hammill Institute on DisabilitiesReprints and permission: http://www.sagepub.com/journalsPermissions.navDOI: 10.1177/1534508412441966http://aei.sagepub.comAmy M. Briesch, PhD 1 , Stephen P. Kilgus, PhD 2 , Sandra M. Chafouleas, PhD 3 ,T. Chris Riley-Tillman, PhD 4 , and Theodore J. Christ, PhD 5AbstractThe current study served to extend previous research on scaling construction of Direct Behavior Rating (DBR) in orderto explore the potential flexibility of DBR to fit various intervention contexts. One hundred ninety-eight undergraduatestudents viewed the same classroom footage but rated student behavior using one of eight randomly assigned scales(i.e., differed with regard to number of gradients, length of scale, discrete vs. continuous). Descriptively, mean ratingstypically fell within the same scale gradient across conditions. Furthermore, results of generalizability analyses revealednegligible variance attributable to the facet of scale type or interaction terms involving this facet. Implications for DBR scaleconstruction within the context of intervention-related decision making are presented and discussed.Keywordsemotional/behavioral disorders, rating scales, social–emotionalAlthough traditionally conceived of as an interventiontool (e.g., daily behavior report card; Chafouleas, Riley-Tillman, & McDougal, 2002; Vannest et al., 2010), use ofDirect Behavior Rating (DBR) has gained recent interestwithin an assessment context in light of the highlightedneed to identify appropriate tools for use within problemsolvingmodels (Chafouleas, Volpe, Gresham, & Cook,2010). That is, within a proactive model of service delivery,decisions regarding student performance and response toinstructional and behavioral supports must be made efficientlyand for a greater number of students. As a result,there exists a need to identify assessment tools that are botheffective (i.e., psychometrically defensible) and feasible forregular use. It has been suggested that the use of DBR maymeet both requirements (Chafouleas, Riley-Tillman, &Sugai, 2007). DBR involves conducting a single rating ofoperationally defined behavior(s) of interest (e.g., academicengagement) at the end of a prespecified rating period (e.g.,a math lesson or unit; Chafouleas, Riley-Tillman, & Sugai,2007). Surveys of DBR use have included both teacher andschool psychologist samples, with results generally suggestingmoderate to high use for assessment purposes(Chafouleas, Riley-Tillman, & Sassu, 2006; Riley-Tillman,Chafouleas, Briesch, & Eckert, 2008). However, clearunderstanding regarding what actual usage looks like inpractice, particularly with regard to issues of instrumentation,has not been provided.In an effort to systematize empirical work related tobuilding a psychometric base of evidence for DBR use inassessment, researchers have focused on investigations relevantto DBR Single-Item Scales (DBR-SIS; Christ, Riley-Tillman, & Chafouleas, 2009). Fundamentally, DBR-SIScan be described as a unipolar graphic rating scale, in that(a) the left end of the scale represents an absence of thebehavior (i.e., 0%) whereas the right indicates a strong presence(i.e., 100%), and (b) graphic descriptions (e.g.,anchors) are typically provided along the scale (Christ &Boice, 2009). Beyond these general categorizations, however,variability in scale format has been noted across1 Northeastern University, Boston, MA, USA2 East Carolina University, Greenville, NC, USA3 University of Connecticut, Storrs, CT, USA4 University of Missouri, Columbia, MO, USA5 University of Minnesota, Minneapolis, MN, USACorresponding Author:Amy M. Briesch, Northeastern University, Department of Counselingand Applied Educational Psychology, 404 International Village,360 Huntington Avenue, Boston, MA 02115, USAEmail: a.briesch@neu.eduDownloaded from aei.sagepub.com at NORTHEASTERN UNIV LIBRARY on April 23, 2012


2 Assessment for Effective Intervention XX(X)studies with regard to the gradients and anchors used.Some investigations have used a DBR-SIS consisting of aline divided into six equal gradients (i.e., 0–5), with percentageanchors at each gradient (e.g., 0%, 20%, . . . 100%;Chafouleas, Riley-Tillman, Sassu, LaFrance, & Patwa,2007). In contrast, other studies have used an 11-gradientline (i.e., 0–10), with percentage anchors included at onlythree points (i.e., 0%, 50%, 100%; Chafouleas et al., 2010).More recent work has suggested that ratings may yield technicallydefensible data given that at least six gradients areused (Chafouleas, Christ, & Riley-Tillman, 2009; Christet al., 2009). Direct comparisons of these scale formats areneeded, however, to evaluate the consistency of obtained data.Literature spanning several decades supports a relationbetween scale construction and the technical adequacy ofdata. For instance, although somewhat inconsistent, numerousstudies have revealed a relationship between the numberof scale gradients and psychometric defensibility. Weng(2004) identified a positive correlation between the numberof gradients and both coefficient alpha and test–retest reliabilitycoefficients. Researchers have also considered theinfluence of multiple graphic rating scale components,including line length. For example, Revill, Robinson,Rosen, and Hogg (1976) compared 5-, 10-, 15-, and 20-cmlines, finding the shortest line to be associated with thegreatest error. Finally, other work (e.g., Preston & Colman,2000) has compared categorical and continuous scales.Some findings have indicated that raters demonstrate a preferencefor categorical scales given that they are (a) moreconsistent with natural judgments and (b) less time-intensiveto code than their continuous counterparts (Ramsey, 1973).Others, however, have specified that the use of a continuousscale may offer greater specificity, particularly when veryfew ratings are to be made or in the absence of summation,as in the case with DBR-SIS (Christ & Boice, 2009).To date, two studies have investigated such issues (e.g.,gradients, anchoring) related to DBR-SIS construction. In astudy by Chafouleas et al. (2009), 125 undergraduate studentswere asked to indicate the total amount of time (0–60seconds) that the target behavior (i.e., visually distracted,active manipulation) was observed on a 100-mm linemarked with three qualitative anchors (i.e., never, sometimes,always). Scales differed, however, across threewithin-participant experimental conditions based on thenumber of scale gradients (i.e., 6, 10, 14) applied to the line.Generalizability study results were found to be roughlysimilar across scale gradient conditions, with the greatestproportions of rating variance attributable to error (i.e., pro,e; range = 35%–39%), differences between raters (i.e.,o; range = 17%–23%), and changes in the rank ordering ofstudents across time (i.e., p × o; range = 11%–17%). Resultstherefore suggested that the number of gradients applied toan otherwise identical scale should not affect the reliabilityof DBR-SIS data.More recently, 81 undergraduate student participantswere provided with a 100-mm, 10-gradient DBR-SIS to ratestudent levels of academic engagement and disruptivebehavior across a series of video clips (Riley-Tillman,Christ, Chafouleas, Boice-Mallach, & Briesch, 2010). Inthis case, each participant was randomly assigned to useeither a proportional scale (i.e., 0%, 20%, 90%), or an absolutescale (i.e., 1 min, 4 min). Generalizability findings indicatedthat neither approach to scale anchoring was associatedwith greater rating accuracy. However, the authors suggestedthat proportional scaling might be more efficient, asinterpretation does not require knowledge of observationduration. One limitation associated with each of the aforementionedstudies was that all ratings were conducted usinga 100-mm line. Given previous evidence to suggest thatscale length may influence rating error (Revill et al., 1976),it is unclear whether the results specific to scaling gradientsand anchoring may generalize to other scale formats.Initial findings of consistency across approaches to scaleconstruction appear to support the potential flexibility inconstructing DBR-SIS; however, such flexibility andrelated variability in construction across studies can alsolimit the cumulative interpretation of findings from DBR-SIS research. For example, variations may restrict thedegree to which psychometric results apply, and thus, it isrecommended that direct comparisons be made to determinewhether these scales function differently. In this way,those school-based professionals interested in using DBR toinform defensible intervention-based decisions would beprovided guidance with regard to the flexibility of scaleconstruction. Thus, the purpose of the current study was toexpand upon existing research on DBR-SIS instrumentationthrough an evaluation of the influence of various scale formatson the generalizability of ratings of student behavior.These features included the number of scale gradients provided(i.e., 5 or 10), the length of the scale itself (i.e., 50 or100 mm), and the use of either a discrete or continuousscale. Given that minimal differences between scale typeshave been previously noted in the literature, it was hypothesizedthat evidence of convergent validity across scaletypes would be identified in the current study.MethodParticipants and SettingParticipants included 198 undergraduate students enrolledin an introductory psychology course at a large southeasternuniversity and represented a sample of convenience.Per Human Subjects Institutional Review Board–approvedprocedures, written informed consent was attained fromeach participant prior to enrollment. The majority (58%)of participants were female, and identified themselves aseither White (61%) or Black (24%). Roughly 30% ofDownloaded from aei.sagepub.com at NORTHEASTERN UNIV LIBRARY on April 23, 2012


Briesch et al. 3participants were currently enrolled in a teacher educationprogram.MaterialsVideotape. As the design of the study required that all ratersbe able to view an identical snapshot of behavior, preexistingvideo footage of elementary school–aged studentswas used. Parental consent had previously been obtainedfor the children to serve as actors during a period of simulatedclassroom instruction designed to ensure sufficientvariability in the behaviors of interest. Most of the simulatedinstruction was unscripted; however, general visualcues (e.g., signs reading “get out of seat”) were provided toensure that students displayed a variety of both appropriatebehaviors (e.g., passive and active engagement) and inappropriatebehaviors (e.g., calling out, noncompliance). Thefinal four 3-min video clips were chosen based on the factthat several children could be clearly seen in the frameexhibiting a range of behaviors typically observed in educationalsettings.DBR-SIS forms. All DBR-SIS forms required participantsto observe and rate two common classroom behaviors: academicengagement (AE) and disruptive behavior (DB). Forthe purposes of this study, Shapiro’s (2004) definition ofacademic engagement used in the Behavioral Observationof Students in Schools (BOSS) was adopted. That is, AEwas defined as actively (e.g., writing, raising hand) or passively(e.g., listening to the teacher, reading silently) participatingin classroom activities. DB was defined as anystudent action that interrupts regular school or classroomactivities (e.g., being out of seat, calling out). Although allparticipants rated the same target behaviors, the graphicpresentation of the DBR-SIS varied across conditions.Scales varied with regard to (a) scale type (i.e., continuous,discrete), (b) number of scale gradients provided (i.e.,5, 10), and (c) scale length (i.e., 50 mm, 100 mm). Thisresulted in a total of eight experimental groups with roughly25 participants assigned to each condition.ProcedureParticipants were randomly assigned to an experimentalcondition (i.e., DBR-SIS format) on entering one of fivepossible experimental sessions (each 25–30 min in duration)scheduled across a 3-week period. Because of the factthat different scales were used, examiner instructionsremained general. Participants were allowed 4 min to independentlyreview the behavioral definitions, as well as anexample of how to use the scale to conduct their ratings.Participants were allowed continuous access to definitions,instructions, and rating examples throughout study proceedings.Next, all participants were asked to carefullyobserve four of the eight students present in the classroomand to rate the AE and DB displayed by each student immediatelyfollowing each 3-min clip. All participants weregiven up to 2 min to complete their ratings using only thescale type assigned to his or her condition. In total, eachparticipant conducted 32 ratings (i.e., four target students ×four video clips × two behaviors).Dependent MeasureThe primary dependent variable was the DBR scoreassigned to each target student. Data coders used 12-inchrulers to determine the exact point at which the participants’ratings fell. To facilitate direct comparisons acrossDBR types, millimeter values were subsequently convertedto an 11-point (i.e., 0–10) discrete scale using a ±5-mmmargin of error for each discrete value. That is, ratings fallingless than 5 mm below, or 5 mm above, each decile (i.e.,10, 20, etc.) were considered to fall within that decile (e.g.,47 mm or 54 mm both equate to 5 on the discrete scale).Intercoder reliability was determined for 10% of cases, andwas found to be extremely high (M = 1.00).Data AnalysisSubsequent to data screening, descriptive statistics werefirst reviewed in order to examine the degree to which ratingsdiffered across scale types. Generalizability theory(GT; see Brennan, 2001, for comprehensive discussion ofGT) was then used in order to determine the percentage ofrating variance attributable to relevant sources of error (i.e.,facets). In the current investigation, the facet of scale typewas of primary importance; however, variance attributableto differences across persons, observations, and raters wasalso examined. Generalizability (G) studies were conductedusing a partially nested, random effects model for person(p; i.e., student) by observation (o; i.e., video clip) by rater(r; i.e., study participant) nested within scale type (s; i.e.,assigned participant condition) (i.e., p × o × (r:s)). The variancecomponents derived from a G study can be used in adecision (D) study to generate reliability-like coefficientsthat can be used for the purposes of both relative and absolutedecision making. Given that the purpose of the currentstudy was to assess validity (i.e., evidence of convergentvalidity across scale types) and not reliability, however,only G study results are presented. All variance componentswere derived in SPSS 17.0 using an ANOVA withType III sum of squares.ResultsRating DescriptivesDescriptively, results suggested that ratings did not varysubstantively across DBR formats, with 91% of mean ratingsDownloaded from aei.sagepub.com at NORTHEASTERN UNIV LIBRARY on April 23, 2012


4 Assessment for Effective Intervention XX(X)Table 1. Means and Standard Deviations for Academic Engagement Across Groups by Clip and StudentClip 1Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8Student M SD M SD M SD M SD M SD M SD M SD M SD1 6.14 2.06 6.56 2.43 6.41 2.50 6.00 2.06 6.17 2.06 5.79 2.96 5.86 2.17 6.56 2.422 7.84 2.33 7.68 2.49 7.83 2.41 7.04 2.59 7.46 2.90 8.04 1.99 8.02 2.20 8.27 1.923 6.59 3.05 7.08 3.09 7.57 2.69 6.72 2.63 7.28 2.91 7.91 2.44 7.52 2.85 7.29 3.214 5.63 2.69 5.84 2.79 5.37 2.22 5.46 2.32 4.67 3.14 6.23 2.86 6.27 2.38 6.13 2.76Clip 2Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8Student M SD M SD M SD M SD M SD M SD M SD M SD1 6.20 1.71 6.82 2.14 6.62 2.63 6.71 1.62 7.15 1.86 6.45 2.81 5.95 2.48 6.75 1.932 7.59 2.67 7.76 2.44 7.68 3.05 7.31 2.30 8.04 2.73 8.38 2.04 8.25 2.58 7.96 2.913 7.80 2.44 7.94 2.58 8.15 2.59 7.58 2.61 7.96 2.62 8.09 2.59 8.70 1.80 7.93 3.064 6.90 2.27 7.46 1.95 6.98 2.54 6.98 2.04 7.22 2.53 7.28 2.67 7.41 2.34 8.27 1.64Clip 3Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8Student M SD M SD M SD M SD M SD M SD M SD M SD1 2.27 2.19 2.82 1.93 2.98 2.53 3.31 2.54 2.54 2.56 2.36 1.98 2.50 2.13 2.64 2.022 7.61 1.83 8.16 2.11 8.13 2.41 7.67 1.87 7.80 2.62 8.21 2.00 8.09 2.11 7.80 2.503 8.12 1.89 7.02 3.13 7.77 2.50 6.94 2.54 7.61 2.90 8.17 2.37 7.39 2.90 7.95 2.644 8.71 1.78 8.58 1.60 8.49 1.68 8.19 1.50 8.30 2.30 9.06 1.36 8.93 1.35 8.91 1.24Clip 4Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8Student M SD M SD M SD M SD M SD M SD M SD M SD1 3.24 2.17 4.29 2.35 4.04 2.37 3.71 2.53 3.28 2.14 3.94 2.55 4.07 2.24 3.84 2.512 0.52 1.39 0.78 1.95 0.77 1.74 1.19 2.16 0.20 0.72 1.15 2.76 0.86 1.88 0.60 1.763 5.90 2.26 7.60 2.25 6.45 2.73 6.33 2.38 6.33 2.56 6.60 2.58 7.34 1.78 6.93 2.414 8.16 1.95 8.59 2.06 7.78 2.55 7.98 1.42 7.98 2.24 8.47 2.02 8.55 1.85 8.69 1.82fluctuating no more than 1 point between DBR scales. Infact, for a given student and video clip, the mean range ofDBR scores across scaling conditions was 1.12 points forAE and 1.02 points for DB. The largest range (1.70 points)was observed for the rating of Student 3’s AE during Clip 4(see Table 1).Generalizability AnalysesGT was used to evaluate whether DBR outcomes weresimilar across scaling conditions, as well as to consideralternative facets or interactions to which rating variancemay be attributed. Variance component analyses revealedseveral consistencies across the behaviors examined (seeTable 2). First, the greatest proportion of rating variance(42% AE, 40% DB) was explained by the interactionbetween persons and occasions. This indicates that the rankorder of students varied substantially from one video clip tothe next, which follows from the fact that video clips werepurposively selected based on behavioral variability demonstrated.Most pertinent to the purpose of the study, nearlyall facets and interactions involving scale type (i.e., scaletype, persons by scale type, occasions by scale type, personsby occasions by scale type) were found to contributenegligible variance to the model. Across scales, no overallrating differences were identified and the rank order ofstudents did not change. A notable percentage of variance(12% AE, 5% DB) was attributable, however, to the terminvolving raters nested within scale types.Although many consistencies were observed acrossbehaviors, two differences should be noted. First, the facet ofperson explained 28% of the variance in ratings of DB butonly 5% in the case of AE, suggesting that student engagementlevels were more consistent when averaged acrossDownloaded from aei.sagepub.com at NORTHEASTERN UNIV LIBRARY on April 23, 2012


Briesch et al. 5Table 2. Full Model Generalizability Study Results: Person, Scale,Rater:Scale, OccasionFacet(s) Var % Var Var % VarPerson 0.48 5 3.81 28Scale type 0 0 0 0Observation 0.09 1 0.37 3Rater:scale type 1.24 12 0.70 5Person × scale type 0.03 0 0.03 0Person × observation 4.45 42 5.57 40Observation × scale type 0.01 0 0.01 0Person × scale × observation 0.01 0 0.01 0Error a 4.281 41 3.30 24Total 10.57 100 b 13.79 100 braters and observations. Related, a significant proportion ofthe rating variance for AE (41%) remained unexplained byfacets of the model in comparison to a smaller proportion forDB (24%). Finally, the nested term involving the main effectof raters and the interaction between raters and scale typesexplained a greater proportion of variance for AE (12%) thanfor DB (5%). Although this term cannot be neatly interpretedgiven the confounded effects, this generally suggests that raterswere more consistent in their assessment of DB than AE.DiscussionThe purpose of the current study was to examine the influenceof DBR scale construction on obtained ratings.Overall, ratings were not found to vary significantly acrossthe DBR-SIS formats employing different gradients (5 vs.10), lengths (50 mm vs. 100 mm), and scaling approaches(continuous vs. discrete). This is consistent with previousscaling research, which found no statistically significantdifferences between scales employing either (a) 5 or 10 scalegradients (Bendig, 1954; Chafouleas et al., 2009; Matell &Jacoby, 1971), or (b) continuous or categorical scales(Cicchetti, Showalter, & Tyrer, 1985; Rasmussen, 1989).Raw ratings converted to an 11-point discrete scale werefound to be relatively similar across groups, with meanDBR scores falling within 1.12 points when rating AE and1.02 points when rating DB.Results of generalizability analyses also supported thefinding of minimal differences across DBR scale formats.Analyses demonstrated that negligible variance (0%) wasattributable to the facet of scale type (indicating no overall differencesin rating behavior across scales) or the interactionbetween person and scale type (indicating no changes in theAEDBNote. AE = academic engagement; DB = disruptive behavior; Var =variance calculated using Type III sum of squares; %Var = percentage oftotal variance.a. Includes residual along with interactions involving r:s.b. Values rounded to 100%.rank order of students across scales). A small proportion ofvariance was, however, attributable to the term involvingraters nested within scale types (12% AE, 5% DB), generallysuggesting that how a particular scale was used differedfrom one rater to another. Such a finding is not surprisinggiven that significant rating differences have been notedacross DBR users in the absence of rater training (Briesch,Chafouleas, & Riley-Tillman, 2010; Chafouleas, Christ et al.,2007). This finding does, however, further support theneed to consider DBR recordings within rater. In addition,results were consistent with the findings of Chafouleas,Briesch, and colleagues (2010), in which greater differencesbetween raters were noted when rating AE (8%) than DB(1%). The higher saliency of disruptive behavior mayexplain why ratings have been more consistent, and highlightsthe fact that different behavior targets may warrantvarying levels of rater training.It is also worth noting that the greatest proportion ofvariance in ratings of both AE and DB (i.e., roughly 40%)was attributable to the interaction between persons andoccasions, thus indicating that the rank order of studentsvaried widely from one observation occasion to the next.Such a finding was not surprising given the fact that videoclips were purposively selected to ensure behavioral variability.These results do, however, lend support to thepotential role of DBR-SIS in monitoring student behavior,particularly with regard to sensitivity to change. Capacityfor sensitivity to change is particularly important to ensurethat behavioral data are reflective of actual variation in thetrend or level of student behavior. The size of the variancecomponent for the interaction between persons and occasionsthus supports that DBR-SIS data are sufficiently sensitiveto behavioral changes over time. This is a key findingfor classroom teachers and support personnel interested inusing DBR to monitor student response to intervention anddetect changes in behavior over time.Finally, 24% (DB) to 41% (AE) of the observed ratingvariance was subsumed under the residual error term, suggestingthe influence of other factors that were either uncontrolledfor in the model (e.g., time of day) or uninterpretable becauseof the nesting of facets. A recent study, for example, foundthat the interaction between raters and persons (i.e., differentialrating of particular students) accounted for a significantproportion of the variance (20%) in DBR ratings (Briesch et al.,2010). Because of the nesting of raters within scale typeswithin the current study, the effect of this interaction could notbe independently estimated; however, this effect may help toaccount for some degree of the residual error observed.LimitationsOne limitation of the current study was that all ratingsoriginally made on a continuous scale were converted todiscrete values. This was deemed necessary in order toDownloaded from aei.sagepub.com at NORTHEASTERN UNIV LIBRARY on April 23, 2012


6 Assessment for Effective Intervention XX(X)make meaningful comparisons between the ratings; however,some degree of rating variability was inevitably lost.Although the eight scales appear relatively equivalent,mean ratings would likely appear somewhat different hadthe continuous values been used. Second, despite precedencefor the methodology within DBR literature (e.g.,Chafouleas et al., 2009; Riley-Tillman et al., 2010) andbeyond (Sterling-Turner & Watson, 2002; Sterling-Turner,Watson, Wildmon, Watkins, & Little, 2001), the use ofundergraduate students as research participants is considereda limitation. Although previous research has highlightedsimilarities between teacher-generated ratings andthose conducted by external raters who do not have competingdemands on their attention (i.e., research assistants)(e.g., Chafouleas, Briesch, et al., 2010), it is unknownwhether or not undergraduate students are as generallyattuned to classroom behaviors as actual teachers or evengraduate students. Therefore, potential differences in cognitiveset may attenuate generalization to a population ofactual educators. This limitation to external validity wasdeemed necessary in order to simultaneously explore anumber of different DBR scales; however, further researchis therefore needed to determine whether the current findingswould generalize when conducted with actual teachersin applied classroom settings. Third, the nature of theexperimental design, in which participants were randomlyassigned to scale conditions, served to limit the data analysesthat could be conducted. Had the design been one thatwas fully crossed (e.g., all participants rate all studentsusing all scales), it would have been possible to examinethe proportion of variance in ratings attributable specificallyto rater differences. The benefits in terms of supplementalevidence, however, were outweighed by the costs interms of participant resources, and the defined primarypurpose of this study.Implications forResearchers and PractitionersOverall, results support one of the proposed features ofDBR-SIS; namely, that the method may be flexibly constructed(Chafouleas, Riley-Tillman, & Sugai, 2007). Thisholds implications for school-based users of DBR-SIS as itsuggests that choices of (a) a continuous or categoricalscale, (b) number of scale gradients, and (c) length of DBR-SIS graphic line can be based on the population and targetof measurement, without potential sacrifice to technicaladequacy. For example, those individuals working withyounger students may prefer to use a smaller number ofscale gradients in order to assist with scale interpretability.In addition, scales may be flexibly selected to align with theintervention (and corresponding target behavior) of interest.When implementing an intervention designed to increasecompliance, for example, raters may find it more natural tomake categorical judgments (e.g., never, sometimes, alwayscompliant) than to use a continuous scale (e.g., 0%, 50%,100% compliant). The ability to flexibly fit scale developmentto the idiosyncrasies of individual cases may thereforeenhance the extent to which DBR-SIS data inform intervention-relateddecisions. Scales may be adapted to provideinformation closely matched to referral concerns, resultingin more appropriate conclusions. This is analogous to themanner in which a practitioner may choose the systematicdirect observation coding system (e.g., event recording,interval sampling) that is best suited to the problem behaviordimension and referral question.Despite the current findings, we suggest that in theabsence of replication, it should not be assumed that thecurrent results will automatically apply to any DBR-SISformat. For example, it would be inadvisable to assume evidenceof the concurrent validity of a 50-mm, 5-point continuousDBR-SIS also applies to a 100-mm, 10-pointcontinuous DBR-SIS. Although results support such generalizationwithin the current study, caution is recommendedin application to past or future research on DBR-SIS.AcknowledgmentsSpecial thanks are extended to Teri LeBel and Christina Boice-Mallach for their assistance with the preparation of materials anddata collection.Declaration of Conflicting InterestsThe author(s) declared no potential conflicts of interest withrespect to the research, authorship, and/or publication of thisarticle.FundingThe author(s) disclosed receipt of the following financial supportfor the research, authorship, and/or publication of this article:Preparation of this article was supported by a grant from theInstitute for Education Sciences, U.S. Department of Education(R324B060014). Opinions expressed herein do not necessarilyreflect the position of the U.S. Department of Education, andsuch endorsements should not be inferred.ReferencesBendig, A. W. (1954). Reliability and the number of rating scalecategories.Journal of Applied Psychology, 38, 38–40.Brennan, R. L. (2001). Generalizability theory. New York, NY:Springer-Verlag.Briesch, A. M., Chafouleas, S. M., & Riley-Tillman, T. C. (2010).Generalizability and dependability of behavior assessmentmethods to estimate academic engagement: A comparisonof systematic direct observation and Direct Behavior Rating.School Psychology Review, 39, 408–421.Chafouleas, S. M., Briesch, A. M., Riley-Tillman, T. C., Christ, T. J.,Black, A., & Kilgus, S. P. (2010). An investigation of thegeneralizability and dependability of Direct Behavior Rating–Single Item Scales (DBR-SIS) to measure academicDownloaded from aei.sagepub.com at NORTHEASTERN UNIV LIBRARY on April 23, 2012


Briesch et al. 7engagement and disruptive behavior of middle school students.Journal of School Psychology, 48, 219–246. doi:10.1016/j.jsp.2010.02.001Chafouleas, S. M., Christ, T. J., & Riley-Tillman, T. C. (2009).Generalizability of scaling gradients on direct behavior ratings.Educational and Psychological Measurement, 69, 157–173.doi:10.1177/0013164408322005Chafouleas, S. M., Christ, T., Riley-Tillman, T. C., Briesch, A. M.,& Chanese, J. A. (2007). Generalizability and dependabilityof Daily Behavior Report Cards to measure social behavior ofpreschoolers. School Psychology Review, 36, 63–79.Chafouleas, S. M., Riley-Tillman, T. C., & McDougal, J. L.(2002). Good, bad, or in-between: How does the daily behaviorreport card rate? Psychology in the Schools, 39, 157–169.Chafouleas, S. M., Riley-Tillman, T. C., & Sassu, K. A. (2006).Acceptability and reported use of daily behavior report cardsamong teachers. Journal of Positive Behavior Interventions,8, 174–182.Chafouleas, S. M., Riley-Tillman, T. C., Sassu, K. A., LaFrance,M. J., & Patwa, S. S. (2007). Daily behavior report cards: Aninvestigation of the consistency of on-task data across raters andmethods. Journal of Positive Behavior Interventions, 9, 30–37.Chafouleas, S. M., Riley-Tillman, T. C., & Sugai, G. (2007).School-based behavioral assessment: Informing interventionand instruction. New York, NY: Guilford.Chafouleas, S. M., Volpe, R. J., Gresham, F. M., & Cook, C. R.(2010). School-based behavioral assessment within problemsolvingmodels: Current status and future directions. SchoolPsychology Review, 39, 343–349.Christ, T. J., & Boice, C. H. (2009). Rating scale items: A brief reviewof nomenclature, components, and formatting to inform the developmentof Direct Behavior Rating (DBR). Assessment for EffectiveIntervention, 34, 242–250. doi:10.1177/1534508409336182Christ, T. J., Riley-Tillman, T. C., & Chafouleas, S. M. (2009). Foundationfor the development and use of Direct Behavior Rating (DBR)to assess and evaluate student behavior. Assessment for EffectiveIntervention, 34, 201–213. doi:10.1177/1534508409340390Cicchetti, D. V., Showalter, D., & Tyrer, P. J. (1985). Scalecategories on levels of interrater reliability: A Monte Carloinvestigation. Applied Psychological Measurement, 9, 31–36.doi:10.1177/014662168500900103Matell, M. S., & Jacoby, J. (1972). Is there an optimal number ofalternatives for Likert-scale items? Effects of testing time andscale properties. Journal of Applied Psychology, 56, 506–509.doi:10.1037/h0033601Preston, C. C., & Colman, A. M. (2000). Optimal number ofresponse categories in rating scales: Reliability, validity, discriminatingpower, and respondent preferences. Acta Psychologica,104, 1–15. doi:10.1016/S0001-6918(99)00050-5Ramsey, J. O. (1973). The effect of number of categories in ratingscales on precision of estimation of scale values. Psychometrika,38, 513–532. doi:10.1007/BF02291492Rasmussen, J. L. (1989). Analysis of Likert-scale data: A reinterpretationof Gregoire and Driver. Psychological Bulletin, 105,167–170. doi:10.1037/0033-2909.105.1.167Revill, S. I., Robinson, J. O., Rosen, M., & Hogg, M. I. (1976).The reliability of a linear analogue for evaluating pain. Anasthesia,31, 1191–1198.Riley-Tillman, T. C., Chafouleas, S. M., Briesch, A. M., &Eckert, T. L. (2008). Daily behavior report cards and systematicdirect observation: An investigation of the acceptability,reported training and use, and decision reliability amongschool psychologists, Journal of Behavioral Education, 17,313–327. doi:10.1007/s10864-008-9070-5Riley-Tillman, T. C., Christ, T. J., Chafouleas, S. M., Boice-Mallach, C. H., & Briesch, A. M. (2010). The impact of observationduration on the accuracy of data obtained from DirectBehavior Rating (DBR). Journal of Positive Behavior Interventions,13, 119–128.Shapiro, E. S. (2004). Academic skills problems workbook. New York,NY: Guilford.Sterling-Turner, H. E., & Watson, T. S. (2002). An analog investigationof the relationship between treatment acceptabilityand treatment integrity. Journal of Behavioral Education, 11,39–50. doi:10.1023/A:1014333305011Sterling-Turner, H. E., Watson, T. S., Wildmon, M., Watkins, C.,& Little, E. (2001). Investigating the relationship betweentraining type and treatment integrity. School PsychologyQuarterly, 16, 56–67. doi:10.1521/scpq.16.1.56.19157Vannest, K., Davis, J., Davis, C., Mason, B. A., & Burke, M. D.(2010). Effective intervention for behavior with a Daily BehaviorReport Card. School Psychology Review, 39, 654–672.Weng, L. (2004). Impact of the number of response categories andanchor labels on coefficient alpha and test-retest reliability.Educational and Psychological Measurement, 64, 956–972.doi:10.1177/0013164404268674Downloaded from aei.sagepub.com at NORTHEASTERN UNIV LIBRARY on April 23, 2012

More magazines by this user
Similar magazines