30.10.2012 Views

Book Reviews Assessing Educational Measurement: Ovations ...

Book Reviews Assessing Educational Measurement: Ovations ...

Book Reviews Assessing Educational Measurement: Ovations ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

I don’t want to know which questions<br />

you answered correctly. I want to know<br />

how much . . . you know. I need to leap<br />

from what I know and don’t want, to<br />

what I want but can’t know. That’s called<br />

inference.<br />

In short, the field of educational measurement<br />

focuses on evaluating and enhancing<br />

the quality of the information generated by<br />

tests or, more precisely, the accuracy and<br />

dependability of inferences about constructs.<br />

The reference work <strong>Educational</strong> <strong>Measurement</strong><br />

is a compendium of current best practices for<br />

accomplishing that purpose.<br />

A Look Under the Hood<br />

The fourth edition of <strong>Educational</strong> <strong>Measurement</strong><br />

consists of 21 chapters. My own expertise<br />

in the field of educational measurement<br />

is markedly narrower than the content constituting<br />

the entire volume, so I will offer<br />

brief comments on only three chapters, followed<br />

by observations pertaining to the book<br />

as a whole. The chapters discussed in the following<br />

paragraphs address validity, standard<br />

setting, and classroom assessment.<br />

Validity<br />

The topic of validity is a natural choice for<br />

the first chapter in any comprehensive treatment<br />

of educational measurement. After all,<br />

validity has been identified as “the most<br />

fundamental consideration in developing<br />

and evaluating tests” (American <strong>Educational</strong><br />

Research Association [AERA], American<br />

Psychological Association [APA], & National<br />

Council on <strong>Measurement</strong> in Education<br />

[NCME], 1999, p. 9) and as “the foundation<br />

for virtually all of our measurement work”<br />

(Frisbie, 2005, p. 21).<br />

Unfortunately, validity has been a concept<br />

in turmoil ever since the third edition of<br />

<strong>Educational</strong> <strong>Measurement</strong>, in which Messick<br />

(1989) attempted to propose a grand, unifying<br />

theory of validity. The ensuing years have<br />

witnessed much discontent related to what<br />

found its way into Messick’s treatise (e.g., test<br />

consequences) and what was left out (e.g.,<br />

practical guidance for validation efforts).<br />

Regarding the former, Brennan (2006b)<br />

states that “the most contentious topic in<br />

validity is the role of consequences” (p. 8);<br />

regarding the latter, Shepard (1993) notes<br />

that “Messick’s analysis does not help to<br />

identify which validity questions are essential<br />

to support a test use” (p. 427). The abstruseness<br />

of Messick’s prose has presented<br />

an initial barrier to the discussion of both<br />

problems, with one commentator opining<br />

that “questioning Messick’s theory of validity<br />

is akin to carving a Thanksgiving armadillo”<br />

(Markus, 1998, p. 7).<br />

Many measurement specialists had high<br />

hopes that Kane’s (2006) chapter on validity<br />

in the most recent edition of <strong>Educational</strong><br />

<strong>Measurement</strong> would address many of the difficulties<br />

in what had been the state of validity<br />

theory. Kane’s treatment of validity is far<br />

more succinct and accessible. However,<br />

Kane’s chapter does not so much refine or<br />

extend a theory from one edition to the next<br />

as present a qualitatively different approach,<br />

offered without strong refutation of the previous<br />

formulation or a clear and comprehensive<br />

integration of the old and new<br />

perspectives.<br />

The new validity chapter does begin to<br />

develop some concrete steps for validation<br />

efforts, rooted largely in Kane’s (1992) previous<br />

work that encourages an explicit validity<br />

argument to support intended test-score<br />

inferences. Those who do the difficult work<br />

of test validation surely will appreciate<br />

Kane’s providing this potential strategy to<br />

guide their efforts. However, in this new<br />

validity chapter Kane appears to shy away<br />

from directly confronting the glaring weaknesses<br />

in Messick’s work. For example, he<br />

does not address the logical error of attempting<br />

to incorporate consequences as a part of<br />

validity; and he does not offer guidance<br />

about how the necessary precondition of validation<br />

efforts—namely, a clear statement<br />

about intended score inferences—should be<br />

determined. Instead, Kane (2006) proposes<br />

a negotiation procedure among an unspecified<br />

amalgam of interests, noting that<br />

“agreement on interpretations and uses may<br />

require negotiations among stakeholders<br />

about the conclusions to be drawn and the<br />

decisions to be made” (p. 60). It seems<br />

appropriate that Kane has formally suggested<br />

that explicit, a priori consideration be<br />

given to the potential stakeholders affected<br />

by tests, but before such a proposal can be<br />

implemented, much more must be learned<br />

about how the appropriate stakeholders for<br />

any situation should be identified or limited<br />

and about how to conduct and arbitrate what<br />

could often be (at least in high-stakes contexts)<br />

contentious negotiations. Overall,<br />

although more work surely will be done<br />

to further refine some vexing aspects of validity,<br />

the chapter clearly provides welcome<br />

advances in validity theory and practice while<br />

highlighting the challenges for theoretical<br />

refinements and applications in the future.<br />

Standard Setting<br />

Standard setting is the art and science<br />

of establishing cut scores on tests, that is,<br />

the scores used to classify test takers into<br />

groups such as Pass/Fail, Basic, Proficient,<br />

or Advanced, and other labeled performance<br />

categories. The topic of standard<br />

setting apparently has arrived because an<br />

entire chapter on it, by Hambleton and<br />

Pitoniak (2006), is included in the fourth<br />

edition of <strong>Educational</strong> <strong>Measurement</strong>. In<br />

the third edition, the topic was embedded<br />

in a chapter on student competency testing<br />

(Jaeger, 1989). Because of the current<br />

ubiquity of standard setting and because of<br />

the high stakes that are sometimes associated<br />

with test performance, it seems appropriate<br />

that the topic has received careful<br />

attention in this edition.<br />

Hambleton and Pitoniak’s (2006) chapter<br />

on setting performance standards provides<br />

the most comprehensive and balanced<br />

treatment of the subject to date. Since the<br />

previous edition of <strong>Educational</strong> <strong>Measurement</strong>,<br />

the repertoire of standard-setting<br />

methods has greatly expanded. Those who<br />

must establish cut scores have a broader array<br />

of procedures from which to choose, and<br />

options have been developed to better match<br />

the method with the assessment format, context,<br />

and other considerations. Hambleton<br />

and Pitoniak catalogue and provide brief<br />

descriptions of many of the available procedures.<br />

More important than the cataloging<br />

of methods, however, is that the details on<br />

each method are embedded in a comprehensive<br />

description of the typical steps in the<br />

standard-setting process, including (among<br />

others) developing performance-level descriptors;<br />

choosing, training, and providing feedback<br />

to participants; evaluating and documenting<br />

the process; and compiling validity<br />

evidence. Overall, the chapter strikes an<br />

appropriate balance of theory, procedural<br />

guidance, and grounding in the Standards<br />

for <strong>Educational</strong> and Psychological Testing<br />

(AERA, APA, & NCME, 1999).<br />

Given more space, the authors might have<br />

paid greater attention to standard setting on<br />

alternate assessments, methods for integrating<br />

or choosing among results yielded by<br />

different methods, and rationales and<br />

methods for adjusting both cut scores for a<br />

single test and a system of cut scores across<br />

grade levels or subjects, known as vertically<br />

moderated standard setting (Cizek, 2005). For<br />

example, Hambleton and Pitoniak (2006)<br />

consider adjustments to recommended cut<br />

MARCH 2008<br />

97

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!