Book Reviews Assessing Educational Measurement: Ovations ...
Book Reviews Assessing Educational Measurement: Ovations ...
Book Reviews Assessing Educational Measurement: Ovations ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Assessing</strong> <strong>Educational</strong> <strong>Measurement</strong>: <strong>Ovations</strong>,<br />
Omissions, Opportunities<br />
<strong>Educational</strong> <strong>Measurement</strong> (4th ed.). Robert<br />
L. Brennan (Ed.). Westport, CT: Praeger, 2006.<br />
796 pp., $125.00 (cloth). ISBN 0–275–98125–8.<br />
Reviewed by<br />
Gregory J. Cizek<br />
Not all readers of <strong>Educational</strong> Researcher are<br />
likely to be familiar with the field of educational<br />
measurement. The field is sometimes<br />
referred to as psychometrics, but in plain language<br />
educational measurement is essentially<br />
testing—a specialization concerned with<br />
developing and evaluating the procedures<br />
used to make inferences about learning,<br />
achievement, aptitudes, interests, and other<br />
constructs in education.<br />
Before jumping directly into a review<br />
of the most recent edition of <strong>Educational</strong><br />
<strong>Measurement</strong>, edited by Robert L. Brennan<br />
(2006a), it seems appropriate to provide<br />
some background. Two words—quality<br />
control—describe the fundamental interest<br />
of measurement specialists. The passion of<br />
psychometricians is ensuring that the data<br />
that result from the use of tests are of the<br />
highest quality possible. Psychometricians<br />
seek to ensure that the information generated<br />
by testing instruments, observational<br />
protocols, and other such procedures provides<br />
consistent and accurate portrayals of<br />
the students and systems to which those<br />
tools are applied.<br />
At only modest risk of overstatement, I<br />
would assert that the obsession with highquality<br />
information should not be the<br />
exclusive province of psychometricians but<br />
is rightly seen as a primary interest of all<br />
social scientists (and a key weakness in<br />
graduate student preparation). Along these<br />
lines, Cone and Foster (1991) have argued<br />
as follows:<br />
96<br />
<strong>Educational</strong> Researcher, Vol. 37, No. 2, pp. 96–100<br />
DOI: 10.3102/0013189X08315727<br />
© 2008 AERA. http://er.aera.net<br />
EDUCATIONAL RESEARCHER<br />
Scholars commonly acknowledge that<br />
developments in all areas of science follow<br />
appropriate measurement techniques. . . .<br />
One only has to think of the value of the<br />
microscope for biology and chemistry, the<br />
telescope for astronomy, and magnetic resonance<br />
imaging for contemporary medicine<br />
to support this point. In psychology,<br />
the use of the most sophisticated structural<br />
equation modeling, time series analysis,<br />
or meta-analytic methodology is only as<br />
strong as the data used, and these data<br />
depend on the quality of the measures used<br />
in their collection. . . . Graduate students<br />
learn complex, sophisticated statistical procedures<br />
to test data obtained in elegant,<br />
internally and externally valid experimental<br />
designs. But they are rarely exposed to<br />
the training needed to evaluate whether the<br />
data they obtain so cleverly and analyze so<br />
complexly are any good in the first place.<br />
(p. 653)<br />
In short, the concern of educational measurement<br />
specialists is a consequential one. The<br />
fourth edition of <strong>Educational</strong> <strong>Measurement</strong>,<br />
the field’s definitive resource, seeks to cover<br />
this important terrain comprehensively.<br />
The rest of this review is organized into<br />
three sections. The first defines some key<br />
terms and provides additional background.<br />
The second offers specific descriptive and<br />
evaluative comments. The third situates the<br />
current edition of <strong>Educational</strong> <strong>Measurement</strong><br />
in historical context and provides suggestions<br />
for the next edition.<br />
Key Terms and Background<br />
In the preceding section I used two terms—<br />
construct and inference—that are essential to<br />
understanding the focus of <strong>Educational</strong><br />
<strong>Measurement</strong>. Construct refers to the targets<br />
of measurement. In the social sciences<br />
a construct is a label used to describe a<br />
<strong>Book</strong><br />
<strong>Reviews</strong><br />
characteristic on which people vary. The<br />
characteristics measured by most tests are<br />
referred to as constructs because they are not<br />
directly observable but are “constructed.” For<br />
example, although a characteristic such as<br />
honesty does not exist in a physical sense, it is<br />
nonetheless one on which people are observed<br />
to vary. We use the label honest to describe<br />
people who behave more or less regularly in<br />
ways regarded as ethical and the label dishonest<br />
to describe those whose actions are<br />
regarded as unethical. The construct label is<br />
helpful in describing these regularities for purposes<br />
of identifying individual differences and<br />
communicating clearly about them. In educational<br />
research, nearly all areas of study concern<br />
constructs, for example, persistence,<br />
reading comprehension, readiness, teamwork,<br />
and persuasive writing ability, to name just<br />
a few.<br />
The constructs of interest to social scientists<br />
must be studied indirectly by means<br />
of the instruments and scoring procedures<br />
developed to measure them. However, a<br />
gap always exists between the information<br />
yielded by an instrument and any conclusion<br />
(e.g., score or classification decision)<br />
about the underlying characteristic that the<br />
instrument purports to measure. The gap<br />
exists because the conclusion necessarily is<br />
based on a limited sample of information,<br />
observations, or responses. The conclusion,<br />
interpretation, or meaning that is<br />
drawn regarding the underlying characteristic<br />
is called an inference. The indirect<br />
measurement is necessarily a proxy, and<br />
inference is required whenever one wishes<br />
to use the observed measurement as an<br />
indication of standing on the unobservable<br />
characteristic. This reality was expressed<br />
neatly by Wright (1994), who, in the context<br />
of achievement testing, described the<br />
gap in this way:
I don’t want to know which questions<br />
you answered correctly. I want to know<br />
how much . . . you know. I need to leap<br />
from what I know and don’t want, to<br />
what I want but can’t know. That’s called<br />
inference.<br />
In short, the field of educational measurement<br />
focuses on evaluating and enhancing<br />
the quality of the information generated by<br />
tests or, more precisely, the accuracy and<br />
dependability of inferences about constructs.<br />
The reference work <strong>Educational</strong> <strong>Measurement</strong><br />
is a compendium of current best practices for<br />
accomplishing that purpose.<br />
A Look Under the Hood<br />
The fourth edition of <strong>Educational</strong> <strong>Measurement</strong><br />
consists of 21 chapters. My own expertise<br />
in the field of educational measurement<br />
is markedly narrower than the content constituting<br />
the entire volume, so I will offer<br />
brief comments on only three chapters, followed<br />
by observations pertaining to the book<br />
as a whole. The chapters discussed in the following<br />
paragraphs address validity, standard<br />
setting, and classroom assessment.<br />
Validity<br />
The topic of validity is a natural choice for<br />
the first chapter in any comprehensive treatment<br />
of educational measurement. After all,<br />
validity has been identified as “the most<br />
fundamental consideration in developing<br />
and evaluating tests” (American <strong>Educational</strong><br />
Research Association [AERA], American<br />
Psychological Association [APA], & National<br />
Council on <strong>Measurement</strong> in Education<br />
[NCME], 1999, p. 9) and as “the foundation<br />
for virtually all of our measurement work”<br />
(Frisbie, 2005, p. 21).<br />
Unfortunately, validity has been a concept<br />
in turmoil ever since the third edition of<br />
<strong>Educational</strong> <strong>Measurement</strong>, in which Messick<br />
(1989) attempted to propose a grand, unifying<br />
theory of validity. The ensuing years have<br />
witnessed much discontent related to what<br />
found its way into Messick’s treatise (e.g., test<br />
consequences) and what was left out (e.g.,<br />
practical guidance for validation efforts).<br />
Regarding the former, Brennan (2006b)<br />
states that “the most contentious topic in<br />
validity is the role of consequences” (p. 8);<br />
regarding the latter, Shepard (1993) notes<br />
that “Messick’s analysis does not help to<br />
identify which validity questions are essential<br />
to support a test use” (p. 427). The abstruseness<br />
of Messick’s prose has presented<br />
an initial barrier to the discussion of both<br />
problems, with one commentator opining<br />
that “questioning Messick’s theory of validity<br />
is akin to carving a Thanksgiving armadillo”<br />
(Markus, 1998, p. 7).<br />
Many measurement specialists had high<br />
hopes that Kane’s (2006) chapter on validity<br />
in the most recent edition of <strong>Educational</strong><br />
<strong>Measurement</strong> would address many of the difficulties<br />
in what had been the state of validity<br />
theory. Kane’s treatment of validity is far<br />
more succinct and accessible. However,<br />
Kane’s chapter does not so much refine or<br />
extend a theory from one edition to the next<br />
as present a qualitatively different approach,<br />
offered without strong refutation of the previous<br />
formulation or a clear and comprehensive<br />
integration of the old and new<br />
perspectives.<br />
The new validity chapter does begin to<br />
develop some concrete steps for validation<br />
efforts, rooted largely in Kane’s (1992) previous<br />
work that encourages an explicit validity<br />
argument to support intended test-score<br />
inferences. Those who do the difficult work<br />
of test validation surely will appreciate<br />
Kane’s providing this potential strategy to<br />
guide their efforts. However, in this new<br />
validity chapter Kane appears to shy away<br />
from directly confronting the glaring weaknesses<br />
in Messick’s work. For example, he<br />
does not address the logical error of attempting<br />
to incorporate consequences as a part of<br />
validity; and he does not offer guidance<br />
about how the necessary precondition of validation<br />
efforts—namely, a clear statement<br />
about intended score inferences—should be<br />
determined. Instead, Kane (2006) proposes<br />
a negotiation procedure among an unspecified<br />
amalgam of interests, noting that<br />
“agreement on interpretations and uses may<br />
require negotiations among stakeholders<br />
about the conclusions to be drawn and the<br />
decisions to be made” (p. 60). It seems<br />
appropriate that Kane has formally suggested<br />
that explicit, a priori consideration be<br />
given to the potential stakeholders affected<br />
by tests, but before such a proposal can be<br />
implemented, much more must be learned<br />
about how the appropriate stakeholders for<br />
any situation should be identified or limited<br />
and about how to conduct and arbitrate what<br />
could often be (at least in high-stakes contexts)<br />
contentious negotiations. Overall,<br />
although more work surely will be done<br />
to further refine some vexing aspects of validity,<br />
the chapter clearly provides welcome<br />
advances in validity theory and practice while<br />
highlighting the challenges for theoretical<br />
refinements and applications in the future.<br />
Standard Setting<br />
Standard setting is the art and science<br />
of establishing cut scores on tests, that is,<br />
the scores used to classify test takers into<br />
groups such as Pass/Fail, Basic, Proficient,<br />
or Advanced, and other labeled performance<br />
categories. The topic of standard<br />
setting apparently has arrived because an<br />
entire chapter on it, by Hambleton and<br />
Pitoniak (2006), is included in the fourth<br />
edition of <strong>Educational</strong> <strong>Measurement</strong>. In<br />
the third edition, the topic was embedded<br />
in a chapter on student competency testing<br />
(Jaeger, 1989). Because of the current<br />
ubiquity of standard setting and because of<br />
the high stakes that are sometimes associated<br />
with test performance, it seems appropriate<br />
that the topic has received careful<br />
attention in this edition.<br />
Hambleton and Pitoniak’s (2006) chapter<br />
on setting performance standards provides<br />
the most comprehensive and balanced<br />
treatment of the subject to date. Since the<br />
previous edition of <strong>Educational</strong> <strong>Measurement</strong>,<br />
the repertoire of standard-setting<br />
methods has greatly expanded. Those who<br />
must establish cut scores have a broader array<br />
of procedures from which to choose, and<br />
options have been developed to better match<br />
the method with the assessment format, context,<br />
and other considerations. Hambleton<br />
and Pitoniak catalogue and provide brief<br />
descriptions of many of the available procedures.<br />
More important than the cataloging<br />
of methods, however, is that the details on<br />
each method are embedded in a comprehensive<br />
description of the typical steps in the<br />
standard-setting process, including (among<br />
others) developing performance-level descriptors;<br />
choosing, training, and providing feedback<br />
to participants; evaluating and documenting<br />
the process; and compiling validity<br />
evidence. Overall, the chapter strikes an<br />
appropriate balance of theory, procedural<br />
guidance, and grounding in the Standards<br />
for <strong>Educational</strong> and Psychological Testing<br />
(AERA, APA, & NCME, 1999).<br />
Given more space, the authors might have<br />
paid greater attention to standard setting on<br />
alternate assessments, methods for integrating<br />
or choosing among results yielded by<br />
different methods, and rationales and<br />
methods for adjusting both cut scores for a<br />
single test and a system of cut scores across<br />
grade levels or subjects, known as vertically<br />
moderated standard setting (Cizek, 2005). For<br />
example, Hambleton and Pitoniak (2006)<br />
consider adjustments to recommended cut<br />
MARCH 2008<br />
97
scores based on the standard error of measurement.<br />
However, they do not explain<br />
why the expected error in an examinee’s<br />
observed score is an appropriate basis for an<br />
adjustment; nor do they suggest how<br />
decision-making bodies should incorporate<br />
expected measurement error into explicit considerations<br />
of false-positive and false-negative<br />
decisions. In addition, some treatment of<br />
using observed variation in standard-setting<br />
participants’ cut-score recommendations as a<br />
basis for adjustments would be desirable.<br />
Classroom Assessment<br />
The topic of classroom assessment has been<br />
neglected in previous editions of <strong>Educational</strong><br />
<strong>Measurement</strong>. Thus it is noteworthy that the<br />
latest edition contains a separate chapter on<br />
the subject, perhaps because of widening<br />
recognition of the potentially potent effects<br />
of high-quality classroom assessments on<br />
student learning (see, e.g., Black & Wiliam,<br />
1998). At only 24 pages, however, the chapter<br />
by Shepard (2006) in the latest edition is<br />
far too brief. There are only three shorter<br />
chapters in the volume: one that provides an<br />
overview of group score assessments (e.g.,<br />
the National Assessment of <strong>Educational</strong><br />
Progress and the Trends in International<br />
Mathematics and Science Study), one on<br />
second-language testing, and the editor’s<br />
introduction. It is not clear how chapter<br />
lengths were decided, but evidence revealing<br />
that teachers make classroom decisions based<br />
on assessment information every 2 to 3<br />
minutes (Shavelson & Stern, 1981) and the<br />
substantial research base on classroom assessment<br />
that has accumulated in the past<br />
20 years suggest that the coverage of classroom<br />
assessment could have been greatly<br />
expanded.<br />
The few pages devoted to classroom<br />
assessment might also have been apportioned<br />
differently to dive directly into the<br />
most important aspects of the topic. For<br />
example, precious space was spent recounting<br />
the missteps of earlier practice, recalling<br />
early IQ tests, such as Army Alpha, and so<br />
on. Although it is clear that such missteps<br />
occurred, published discoveries in the<br />
Journal of the American Medical Association or<br />
American Psychologist are not routinely introduced<br />
by archaeologies of earlier practice<br />
involving vital humors or homuncular man.<br />
Moreover, in the course of referencing<br />
various tests, the chapter reinforces a false<br />
dichotomy. Rather than clearly distinguish<br />
between the legitimate and totally different<br />
98<br />
EDUCATIONAL RESEARCHER<br />
purposes of large-scale and classroom assessments,<br />
the chapter perhaps unwittingly contributes<br />
to an either/or perspective that casts<br />
one purpose as bad and the other as good.<br />
Hmmm. . . . Let’s think about this. Should<br />
we choose large-scale standardized tests that<br />
are “formal” and “technical” and represent<br />
“single-moment-in-time” measures of “isolated”<br />
and “decontextualized” topics based<br />
on “outmoded” expectations? Or should we<br />
opt for tests that are “contemporary,”<br />
“embedded,” and “ongoing” assessments<br />
that offer “authentic” and “flexible” measurement<br />
of “deeper” understandings?<br />
Although these problems detract from the<br />
chapter, there is much that compensates<br />
for them. For example, Shepard (2006)<br />
frequently and effectively highlights the<br />
essential connections between classroom<br />
assessment and cognitive psychology, and<br />
a portion of the chapter on learning progressions<br />
provides a clear example of what<br />
assessment in writing would look like if<br />
based on how writing skill develops. The<br />
chapter also contains information on the<br />
use of rubrics to aid students in understanding<br />
the criteria that characterize successful<br />
learning and on the kinds of<br />
self-assessment activities and performance<br />
feedback that are most effective in enhancing<br />
learning. Finally, at a time when much<br />
educational testing is increasingly under<br />
attack, Shepard unapologetically defends<br />
the simple but powerful and persistent<br />
finding that “students appear to study<br />
more and learn more if they expect to be<br />
tested” (p. 637).<br />
Like the other two chapters reviewed<br />
here, the chapter on classroom assessment<br />
omits some topics that would have<br />
been desirable to include. For example,<br />
additional treatment of methods for conducting<br />
observations or checking on the<br />
quality of those observations and further<br />
discussion of how teachers synthesize<br />
sources of classroom information for decision<br />
making would be helpful. Although<br />
Shepard (2006) reviews some of the literature<br />
on grading, the chapter might have<br />
benefited from concrete examples of and<br />
rationales for grading models that can be<br />
defended for reporting student achievement,<br />
and those that are less defensible as<br />
well. Finally, although bias in large-scale<br />
testing has practically been eliminated as<br />
a result of focused attention to the problem<br />
in that context, the potential for bias<br />
in classroom assessments would seem to<br />
loom large. A compilation of research,<br />
guidelines, and methods relevant to minimizing<br />
bias in the classroom assessment<br />
context is sorely needed.<br />
Crosscutting Comments on<br />
Historical Context<br />
Brennan’s (2006a) edition of <strong>Educational</strong><br />
<strong>Measurement</strong> follows three previous editions,<br />
edited by Linn (1989), Thorndike (1971),<br />
and Lindquist (1951). There is a great deal<br />
of outstanding scholarship in the fourth<br />
edition, and surely it was a mammoth undertaking<br />
to compile a volume representing<br />
the state of the art in a discipline so diverse.<br />
Brennan has succeeded in circumscribing<br />
the domain in a comprehensive manner and<br />
assembling individual chapters of exceptionally<br />
high quality.<br />
Another reviewer (Wainer, 2007) of the<br />
fourth edition judged that little has been<br />
learned about key measurement topics since<br />
the publication of the third edition, asking,<br />
“How much new has happened in reliability<br />
since 1989?” (p. 485); that reviewer generally<br />
advised against purchasing this edition. I disagree.<br />
Although it is true that the chapters on<br />
reliability and item response theory cover<br />
much of the same ground as those chapters<br />
did in the third edition, evaluating the latest<br />
volume on that basis is a judgment made<br />
on an unrepresentative sample. The other<br />
80% of the fourth edition documents<br />
substantial advances in research and new<br />
developments on topics such as cognitive<br />
psychology (Mislevy, 2006); technology in<br />
testing (Drasgow, Leucht, & Bennett, 2006);<br />
accountability (Koretz & Hamilton, 2006);<br />
scoring, reporting, and test security (Cohen<br />
& Wollack, 2006); performance assessment<br />
(Lane & Stone, 2006); and others.<br />
My own evaluation is that “adequate yearly<br />
progress”—to invoke a popular phrase these<br />
days—has been made in the field since the<br />
publication of the third edition. The fourth<br />
edition of <strong>Educational</strong> <strong>Measurement</strong> is an<br />
essential update for measurement specialists<br />
and for social science researchers in<br />
general.<br />
The field of measurement is dynamic.<br />
Indeed, it is changing so rapidly that,<br />
although the fourth edition is still quite new,<br />
it may not be too early to begin planning for<br />
the fifth. The task of ensuring the rigor,<br />
accuracy, and readability of discrete chapters<br />
is challenging, but I would urge that an additional<br />
perspective be considered for the next<br />
edition. Adequate progress seems to be only
a modest goal; more radical aims should be<br />
contemplated.<br />
For one thing, it seems to me that previous<br />
editions of <strong>Educational</strong> <strong>Measurement</strong><br />
have uniformly and implicitly defined the<br />
universe of testing as consisting nearly exclusively<br />
of large-scale, standardized assessments.<br />
It is a curious contrast: Although<br />
so much educational testing and assessment<br />
occur at the level of the individual student<br />
and teacher or at the classroom level, the<br />
content of each edition of <strong>Educational</strong> <strong>Measurement</strong><br />
is terribly tilted toward the technologies<br />
of testing programs such as the<br />
SAT, ACT, and GRE. It is as if the<br />
Federal Aviation Administration were<br />
to consider aviation safety with exclusive<br />
reference to commercial airlines, ignoring<br />
the much greater volume of private aircraft<br />
flights each day. Clearly, the evolving<br />
technologies of computer adaptive<br />
testing, item response theory, generalizability<br />
theory, and differential item<br />
functioning warrant documentation and<br />
dissemination; and it is true that the<br />
results of large-scale tests are often consequential.<br />
However, it is equally true that<br />
these developments pertain to a narrow<br />
slice of educational assessment and that<br />
classroom testing and grading are consequential<br />
in their own right. The inclusion<br />
of a chapter on classroom assessment in<br />
the fourth edition is commendable and<br />
definitely a step in the right direction, but<br />
this initiative must be broadened.<br />
Accordingly, it seems appropriate to recommend<br />
that educational measurement be<br />
(re)considered more broadly, that balkanization<br />
of topics be avoided, and that cross-level<br />
perspectives be integrated and crosscutting<br />
questions be addressed, to the extent possible,<br />
in each chapter. For example, how<br />
should teachers think about setting standards<br />
on classroom tests? What are appropriate<br />
ways to consider the reliability of<br />
alternate assessments and other tests administered<br />
to sometimes very small samples?<br />
How might coherence between classroom<br />
assessments and state-level content standards<br />
be promoted? Are there any differences in<br />
appropriate testing accommodations for<br />
classroom and large-scale tests? What sources<br />
of validity evidence are appropriate for tests<br />
at different levels, with differing purposes, or<br />
with differing consequences?<br />
Although the recommendation for greater<br />
integration might seem unrealistic, the current<br />
edition of <strong>Educational</strong> <strong>Measurement</strong><br />
actually contains a remarkably comprehensive<br />
example of the kind of integrated treatment<br />
that could serve as a model for chapters<br />
in the next edition. The chapter by Lane<br />
and Stone (2006) on performance assessment<br />
deftly weaves together treatments of<br />
reliability, cognitive psychology, scoring,<br />
measurement models, classroom assessment<br />
concerns, computer-aided testing, validity,<br />
test design, fairness, and other concerns in a<br />
way that fully covers the identified topic of<br />
the chapter but does not duplicate the essential<br />
content of other chapters in the volume.<br />
Finally, to inform thinking about the next<br />
edition, it may be illuminating to look backward.<br />
A historical note appears in the preface<br />
to the first edition of <strong>Educational</strong> <strong>Measurement</strong>.<br />
It refers the reader to a preceding<br />
volume, The Construction and Use of Achievement<br />
Examinations (Hawkes, Lindquist, &<br />
Mann, 1936), which was produced by the<br />
same publisher as the subsequent editions<br />
and could fairly claim to be the real first edition<br />
in the series. That earliest volume contained<br />
a chapter by McConn (1936), whose<br />
observations would easily be at home in the<br />
latest edition:<br />
When one begins to meditate upon<br />
[achievement tests], one can hardly fail to<br />
be astonished by their multiplicity. . . . We<br />
are impelled to ask, why do we give such<br />
countless tests? Probably many persons<br />
will answer immediately that the obvious<br />
and legitimate purpose of practically all<br />
this achievement testing is the maintenance<br />
of standards; which seems to mean<br />
either one or both of two things: the imposition<br />
and enforcement of a prescribed<br />
curriculum; or the enforcement of some<br />
minimum degree of attainment. (p. 446)<br />
McConn (1936) also asks some of the<br />
policy questions that are being asked today<br />
and identifies a gap in the measurement<br />
literature:<br />
What do we accomplish by all this testing<br />
. . . anyway? Is it worth all the effort and<br />
money it costs? Do we perchance do<br />
harm instead of good, or harm as well as<br />
good with our examinations, and especially<br />
through the uses we make of their<br />
results? In short, it seems that we need<br />
not only techniques, but also some philosophy<br />
. . . dealing with the right uses of<br />
such instruments and their wrong uses or<br />
abuses. (p. 443)<br />
In conclusion, the fourth edition of<br />
<strong>Educational</strong> <strong>Measurement</strong> clearly succeeds<br />
in capturing the state of the art in the field.<br />
However, although this new edition documents<br />
substantial advances in the technology<br />
of testing, McConn’s observations highlight<br />
the presence of lingering challenges to be<br />
addressed—challenges related to the social,<br />
political, and educational contexts in which<br />
the science of psychometrics has long been<br />
situated. By tradition, the first two chapters<br />
of each edition of <strong>Educational</strong> <strong>Measurement</strong><br />
are devoted to the essential topics of validity<br />
and reliability, respectively. Two additional<br />
chapters would be a welcome enhancement<br />
to the next edition. The first would be an initial<br />
chapter, preceding those on validity<br />
and reliability, that would begin to articulate<br />
a philosophy—or perhaps multiple<br />
philosophies—of educational testing and<br />
provide a context for relating those foundational<br />
ideas to the technological advances<br />
chronicled in each edition. The second<br />
would describe various models for how the<br />
enterprise of educational measurement can<br />
be integrated across levels of a planned educational<br />
assessment system; such a chapter<br />
would explicitly probe possible structures for<br />
effectively melding classroom assessment,<br />
large-scale testing in elementary and secondary<br />
schools, and postsecondary assessments.<br />
The challenge ahead lies in enhancing<br />
the utility of each component in the system<br />
for consumers of the results while retaining<br />
the fidelity of each component to its intended<br />
measurement objective.<br />
REFERENCES<br />
American <strong>Educational</strong> Research Association,<br />
American Psychological Association, &<br />
National Council on <strong>Measurement</strong> in Education.<br />
(1999). Standards for educational and<br />
psychological testing. Washington, DC: American<br />
<strong>Educational</strong> Research Association.<br />
Black, P., & Wiliam, D. (1998). Assessment and<br />
classroom learning. Assessment in Education,<br />
5(1), 7–74.<br />
Brennan, R. L. (Ed.). (2006a). <strong>Educational</strong> measurement<br />
(4th ed.). Westport, CT: Praeger.<br />
Brennan, R. L. (2006b). Perspectives on the<br />
evolution and future of educational measurement.<br />
In R. L. Brennan (Ed.), <strong>Educational</strong><br />
measurement (4th ed., pp. 1–16). Westport,<br />
CT: Praeger.<br />
Cizek, G. J. (2005). Adapting testing technology<br />
to serve accountability aims: The case of<br />
vertically moderated standard setting. Applied<br />
<strong>Measurement</strong> in Education, 18(1), 1–10.<br />
Cohen, A. S., & Wollack, J. A (2006). Test<br />
administration, security, scoring, and reporting.<br />
In R. L. Brennan (Ed.), <strong>Educational</strong> measurement<br />
(4th ed., pp. 355–386). Westport,<br />
CT: Praeger.<br />
MARCH 2008<br />
99
Cone, J. D., & Foster, S. L. (1991). Training in<br />
measurement: Always the bridesmaid. American<br />
Psychologist, 46, 653–654.<br />
Drasgow, F., Luecht, R. M., & Bennett, R. E.<br />
(2006). Technology and testing. In R. L.<br />
Brennan (Ed.), <strong>Educational</strong> measurement (4th<br />
ed., pp. 471–516). Westport, CT: Praeger.<br />
Frisbie, D. A. (2005). <strong>Measurement</strong> 101: Some<br />
fundamentals revisited. <strong>Educational</strong> <strong>Measurement</strong>:<br />
Issues and Practice, 24(3), 21–28.<br />
Hambleton, R. K., & Pitoniak, M. J. (2006).<br />
Setting performance standards. In R. L.<br />
Brennan (Ed.), <strong>Educational</strong> measurement (4th<br />
ed., pp. 433–470). Westport, CT: Praeger.<br />
Hawkes, H. E., Lindquist, E. F., & Mann, C. R.<br />
(1936). The construction and use of achievement<br />
examinations. Boston: Houghton Mifflin.<br />
Jaeger, R. M. (1989). Certification of student<br />
competence. In R. L. Linn (Ed.), <strong>Educational</strong><br />
measurement (3rd ed., pp. 485–514). Washington,<br />
DC: American Council on Education.<br />
Kane, M. T. (1992). An argument-based<br />
approach to validation. Psychological Bulletin,<br />
112, 527–535.<br />
Kane, M. T. (2006). Validation. In R. L.<br />
Brennan (Ed.), <strong>Educational</strong> measurement (4th<br />
ed., pp. 17–64). Westport, CT: Praeger.<br />
Koretz, D. M., & Hamilton, L. S. (2006).<br />
Testing for accountability in K–12. In R. L.<br />
Brennan (Ed.), <strong>Educational</strong> measurement<br />
100<br />
EDUCATIONAL RESEARCHER<br />
(4th ed., pp. 531–578). Westport, CT:<br />
Praeger.<br />
Lane, S., & Stone, C. A. (2006). Performance<br />
assessment. In R. L. Brennan (Ed.), <strong>Educational</strong><br />
measurement (4th ed., pp. 387–431). Westport,<br />
CT: Praeger.<br />
Lindquist, E. F. (Ed.). (1951). <strong>Educational</strong> measurement.<br />
Washington, DC: American Council<br />
on Education.<br />
Linn, R. L. (Ed.). (1989). <strong>Educational</strong> measurement<br />
(3rd ed.). Washington, DC: American<br />
Council on Education.<br />
Markus, K. A. (1998). Science, measurement,<br />
and validity: Is completion of Samuel<br />
Messick’s synthesis possible? Social Indicators<br />
Research, 45(1), 7–34.<br />
McConn, M. (1936). The uses and abuses of<br />
examinations. In H. E. Hawkes, E. F. Lindquist,<br />
& C. R. Mann (Eds.), The construction and<br />
use of achievement examinations (pp. 443–478).<br />
Boston: Houghton Mifflin.<br />
Messick, S. (1989). Validity. In R. L. Linn<br />
(Ed.), <strong>Educational</strong> measurement (3rd ed.,<br />
pp. 13–103). New York: Macmillan.<br />
Mislevy, R. J. (2006). Cognitive psychology and<br />
educational assessment. In R. L. Brennan<br />
(Ed.), <strong>Educational</strong> measurement (4th ed.,<br />
pp. 257–306). Westport, CT: Praeger.<br />
Shavelson, R. J., & Stern, P. (1981). Research on<br />
teachers’ pedagogical thoughts, judgments,<br />
decisions, and behavior. Review of <strong>Educational</strong><br />
Research, 51, 455–498.<br />
Shepard, L. A. (1993). Evaluating test validity.<br />
Review of Research in Education, 19, 405–450.<br />
Shepard, L. A. (2006). Classroom assessment. In<br />
R. L. Brennan (Ed.), <strong>Educational</strong> measurement<br />
(4th ed., pp. 624–646). Westport, CT: Praeger.<br />
Thorndike, R. L. (Ed.). (1971). <strong>Educational</strong><br />
measurement (2nd ed.). Washington, DC:<br />
American Council on Education.<br />
Wainer, H. (2007). A psychometric cicada:<br />
<strong>Educational</strong> <strong>Measurement</strong> returns. <strong>Educational</strong><br />
Researcher, 36, 485–486.<br />
Wright, B. D. (1994). Introduction to the Rasch<br />
model [Videocassette]. Available from College<br />
of Education, University of Denver.<br />
AUTHOR<br />
GREGORY J. CIZEK is a professor of educational<br />
measurement and evaluation at the University<br />
of North Carolina, Chapel Hill, School of<br />
Education, CB 3500, Chapel Hill, NC 27599–<br />
3500; cizek@unc.edu. His research focuses on<br />
standard setting, test security, and validity.<br />
Manuscript received January 4, 2008<br />
Revision received January 8, 2008<br />
Accepted January 8, 2008