Book Reviews Assessing Educational Measurement: Ovations ...
Book Reviews Assessing Educational Measurement: Ovations ...
Book Reviews Assessing Educational Measurement: Ovations ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
I don’t want to know which questions<br />
you answered correctly. I want to know<br />
how much . . . you know. I need to leap<br />
from what I know and don’t want, to<br />
what I want but can’t know. That’s called<br />
inference.<br />
In short, the field of educational measurement<br />
focuses on evaluating and enhancing<br />
the quality of the information generated by<br />
tests or, more precisely, the accuracy and<br />
dependability of inferences about constructs.<br />
The reference work <strong>Educational</strong> <strong>Measurement</strong><br />
is a compendium of current best practices for<br />
accomplishing that purpose.<br />
A Look Under the Hood<br />
The fourth edition of <strong>Educational</strong> <strong>Measurement</strong><br />
consists of 21 chapters. My own expertise<br />
in the field of educational measurement<br />
is markedly narrower than the content constituting<br />
the entire volume, so I will offer<br />
brief comments on only three chapters, followed<br />
by observations pertaining to the book<br />
as a whole. The chapters discussed in the following<br />
paragraphs address validity, standard<br />
setting, and classroom assessment.<br />
Validity<br />
The topic of validity is a natural choice for<br />
the first chapter in any comprehensive treatment<br />
of educational measurement. After all,<br />
validity has been identified as “the most<br />
fundamental consideration in developing<br />
and evaluating tests” (American <strong>Educational</strong><br />
Research Association [AERA], American<br />
Psychological Association [APA], & National<br />
Council on <strong>Measurement</strong> in Education<br />
[NCME], 1999, p. 9) and as “the foundation<br />
for virtually all of our measurement work”<br />
(Frisbie, 2005, p. 21).<br />
Unfortunately, validity has been a concept<br />
in turmoil ever since the third edition of<br />
<strong>Educational</strong> <strong>Measurement</strong>, in which Messick<br />
(1989) attempted to propose a grand, unifying<br />
theory of validity. The ensuing years have<br />
witnessed much discontent related to what<br />
found its way into Messick’s treatise (e.g., test<br />
consequences) and what was left out (e.g.,<br />
practical guidance for validation efforts).<br />
Regarding the former, Brennan (2006b)<br />
states that “the most contentious topic in<br />
validity is the role of consequences” (p. 8);<br />
regarding the latter, Shepard (1993) notes<br />
that “Messick’s analysis does not help to<br />
identify which validity questions are essential<br />
to support a test use” (p. 427). The abstruseness<br />
of Messick’s prose has presented<br />
an initial barrier to the discussion of both<br />
problems, with one commentator opining<br />
that “questioning Messick’s theory of validity<br />
is akin to carving a Thanksgiving armadillo”<br />
(Markus, 1998, p. 7).<br />
Many measurement specialists had high<br />
hopes that Kane’s (2006) chapter on validity<br />
in the most recent edition of <strong>Educational</strong><br />
<strong>Measurement</strong> would address many of the difficulties<br />
in what had been the state of validity<br />
theory. Kane’s treatment of validity is far<br />
more succinct and accessible. However,<br />
Kane’s chapter does not so much refine or<br />
extend a theory from one edition to the next<br />
as present a qualitatively different approach,<br />
offered without strong refutation of the previous<br />
formulation or a clear and comprehensive<br />
integration of the old and new<br />
perspectives.<br />
The new validity chapter does begin to<br />
develop some concrete steps for validation<br />
efforts, rooted largely in Kane’s (1992) previous<br />
work that encourages an explicit validity<br />
argument to support intended test-score<br />
inferences. Those who do the difficult work<br />
of test validation surely will appreciate<br />
Kane’s providing this potential strategy to<br />
guide their efforts. However, in this new<br />
validity chapter Kane appears to shy away<br />
from directly confronting the glaring weaknesses<br />
in Messick’s work. For example, he<br />
does not address the logical error of attempting<br />
to incorporate consequences as a part of<br />
validity; and he does not offer guidance<br />
about how the necessary precondition of validation<br />
efforts—namely, a clear statement<br />
about intended score inferences—should be<br />
determined. Instead, Kane (2006) proposes<br />
a negotiation procedure among an unspecified<br />
amalgam of interests, noting that<br />
“agreement on interpretations and uses may<br />
require negotiations among stakeholders<br />
about the conclusions to be drawn and the<br />
decisions to be made” (p. 60). It seems<br />
appropriate that Kane has formally suggested<br />
that explicit, a priori consideration be<br />
given to the potential stakeholders affected<br />
by tests, but before such a proposal can be<br />
implemented, much more must be learned<br />
about how the appropriate stakeholders for<br />
any situation should be identified or limited<br />
and about how to conduct and arbitrate what<br />
could often be (at least in high-stakes contexts)<br />
contentious negotiations. Overall,<br />
although more work surely will be done<br />
to further refine some vexing aspects of validity,<br />
the chapter clearly provides welcome<br />
advances in validity theory and practice while<br />
highlighting the challenges for theoretical<br />
refinements and applications in the future.<br />
Standard Setting<br />
Standard setting is the art and science<br />
of establishing cut scores on tests, that is,<br />
the scores used to classify test takers into<br />
groups such as Pass/Fail, Basic, Proficient,<br />
or Advanced, and other labeled performance<br />
categories. The topic of standard<br />
setting apparently has arrived because an<br />
entire chapter on it, by Hambleton and<br />
Pitoniak (2006), is included in the fourth<br />
edition of <strong>Educational</strong> <strong>Measurement</strong>. In<br />
the third edition, the topic was embedded<br />
in a chapter on student competency testing<br />
(Jaeger, 1989). Because of the current<br />
ubiquity of standard setting and because of<br />
the high stakes that are sometimes associated<br />
with test performance, it seems appropriate<br />
that the topic has received careful<br />
attention in this edition.<br />
Hambleton and Pitoniak’s (2006) chapter<br />
on setting performance standards provides<br />
the most comprehensive and balanced<br />
treatment of the subject to date. Since the<br />
previous edition of <strong>Educational</strong> <strong>Measurement</strong>,<br />
the repertoire of standard-setting<br />
methods has greatly expanded. Those who<br />
must establish cut scores have a broader array<br />
of procedures from which to choose, and<br />
options have been developed to better match<br />
the method with the assessment format, context,<br />
and other considerations. Hambleton<br />
and Pitoniak catalogue and provide brief<br />
descriptions of many of the available procedures.<br />
More important than the cataloging<br />
of methods, however, is that the details on<br />
each method are embedded in a comprehensive<br />
description of the typical steps in the<br />
standard-setting process, including (among<br />
others) developing performance-level descriptors;<br />
choosing, training, and providing feedback<br />
to participants; evaluating and documenting<br />
the process; and compiling validity<br />
evidence. Overall, the chapter strikes an<br />
appropriate balance of theory, procedural<br />
guidance, and grounding in the Standards<br />
for <strong>Educational</strong> and Psychological Testing<br />
(AERA, APA, & NCME, 1999).<br />
Given more space, the authors might have<br />
paid greater attention to standard setting on<br />
alternate assessments, methods for integrating<br />
or choosing among results yielded by<br />
different methods, and rationales and<br />
methods for adjusting both cut scores for a<br />
single test and a system of cut scores across<br />
grade levels or subjects, known as vertically<br />
moderated standard setting (Cizek, 2005). For<br />
example, Hambleton and Pitoniak (2006)<br />
consider adjustments to recommended cut<br />
MARCH 2008<br />
97