30.10.2012 Views

Book Reviews Assessing Educational Measurement: Ovations ...

Book Reviews Assessing Educational Measurement: Ovations ...

Book Reviews Assessing Educational Measurement: Ovations ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Assessing</strong> <strong>Educational</strong> <strong>Measurement</strong>: <strong>Ovations</strong>,<br />

Omissions, Opportunities<br />

<strong>Educational</strong> <strong>Measurement</strong> (4th ed.). Robert<br />

L. Brennan (Ed.). Westport, CT: Praeger, 2006.<br />

796 pp., $125.00 (cloth). ISBN 0–275–98125–8.<br />

Reviewed by<br />

Gregory J. Cizek<br />

Not all readers of <strong>Educational</strong> Researcher are<br />

likely to be familiar with the field of educational<br />

measurement. The field is sometimes<br />

referred to as psychometrics, but in plain language<br />

educational measurement is essentially<br />

testing—a specialization concerned with<br />

developing and evaluating the procedures<br />

used to make inferences about learning,<br />

achievement, aptitudes, interests, and other<br />

constructs in education.<br />

Before jumping directly into a review<br />

of the most recent edition of <strong>Educational</strong><br />

<strong>Measurement</strong>, edited by Robert L. Brennan<br />

(2006a), it seems appropriate to provide<br />

some background. Two words—quality<br />

control—describe the fundamental interest<br />

of measurement specialists. The passion of<br />

psychometricians is ensuring that the data<br />

that result from the use of tests are of the<br />

highest quality possible. Psychometricians<br />

seek to ensure that the information generated<br />

by testing instruments, observational<br />

protocols, and other such procedures provides<br />

consistent and accurate portrayals of<br />

the students and systems to which those<br />

tools are applied.<br />

At only modest risk of overstatement, I<br />

would assert that the obsession with highquality<br />

information should not be the<br />

exclusive province of psychometricians but<br />

is rightly seen as a primary interest of all<br />

social scientists (and a key weakness in<br />

graduate student preparation). Along these<br />

lines, Cone and Foster (1991) have argued<br />

as follows:<br />

96<br />

<strong>Educational</strong> Researcher, Vol. 37, No. 2, pp. 96–100<br />

DOI: 10.3102/0013189X08315727<br />

© 2008 AERA. http://er.aera.net<br />

EDUCATIONAL RESEARCHER<br />

Scholars commonly acknowledge that<br />

developments in all areas of science follow<br />

appropriate measurement techniques. . . .<br />

One only has to think of the value of the<br />

microscope for biology and chemistry, the<br />

telescope for astronomy, and magnetic resonance<br />

imaging for contemporary medicine<br />

to support this point. In psychology,<br />

the use of the most sophisticated structural<br />

equation modeling, time series analysis,<br />

or meta-analytic methodology is only as<br />

strong as the data used, and these data<br />

depend on the quality of the measures used<br />

in their collection. . . . Graduate students<br />

learn complex, sophisticated statistical procedures<br />

to test data obtained in elegant,<br />

internally and externally valid experimental<br />

designs. But they are rarely exposed to<br />

the training needed to evaluate whether the<br />

data they obtain so cleverly and analyze so<br />

complexly are any good in the first place.<br />

(p. 653)<br />

In short, the concern of educational measurement<br />

specialists is a consequential one. The<br />

fourth edition of <strong>Educational</strong> <strong>Measurement</strong>,<br />

the field’s definitive resource, seeks to cover<br />

this important terrain comprehensively.<br />

The rest of this review is organized into<br />

three sections. The first defines some key<br />

terms and provides additional background.<br />

The second offers specific descriptive and<br />

evaluative comments. The third situates the<br />

current edition of <strong>Educational</strong> <strong>Measurement</strong><br />

in historical context and provides suggestions<br />

for the next edition.<br />

Key Terms and Background<br />

In the preceding section I used two terms—<br />

construct and inference—that are essential to<br />

understanding the focus of <strong>Educational</strong><br />

<strong>Measurement</strong>. Construct refers to the targets<br />

of measurement. In the social sciences<br />

a construct is a label used to describe a<br />

<strong>Book</strong><br />

<strong>Reviews</strong><br />

characteristic on which people vary. The<br />

characteristics measured by most tests are<br />

referred to as constructs because they are not<br />

directly observable but are “constructed.” For<br />

example, although a characteristic such as<br />

honesty does not exist in a physical sense, it is<br />

nonetheless one on which people are observed<br />

to vary. We use the label honest to describe<br />

people who behave more or less regularly in<br />

ways regarded as ethical and the label dishonest<br />

to describe those whose actions are<br />

regarded as unethical. The construct label is<br />

helpful in describing these regularities for purposes<br />

of identifying individual differences and<br />

communicating clearly about them. In educational<br />

research, nearly all areas of study concern<br />

constructs, for example, persistence,<br />

reading comprehension, readiness, teamwork,<br />

and persuasive writing ability, to name just<br />

a few.<br />

The constructs of interest to social scientists<br />

must be studied indirectly by means<br />

of the instruments and scoring procedures<br />

developed to measure them. However, a<br />

gap always exists between the information<br />

yielded by an instrument and any conclusion<br />

(e.g., score or classification decision)<br />

about the underlying characteristic that the<br />

instrument purports to measure. The gap<br />

exists because the conclusion necessarily is<br />

based on a limited sample of information,<br />

observations, or responses. The conclusion,<br />

interpretation, or meaning that is<br />

drawn regarding the underlying characteristic<br />

is called an inference. The indirect<br />

measurement is necessarily a proxy, and<br />

inference is required whenever one wishes<br />

to use the observed measurement as an<br />

indication of standing on the unobservable<br />

characteristic. This reality was expressed<br />

neatly by Wright (1994), who, in the context<br />

of achievement testing, described the<br />

gap in this way:


I don’t want to know which questions<br />

you answered correctly. I want to know<br />

how much . . . you know. I need to leap<br />

from what I know and don’t want, to<br />

what I want but can’t know. That’s called<br />

inference.<br />

In short, the field of educational measurement<br />

focuses on evaluating and enhancing<br />

the quality of the information generated by<br />

tests or, more precisely, the accuracy and<br />

dependability of inferences about constructs.<br />

The reference work <strong>Educational</strong> <strong>Measurement</strong><br />

is a compendium of current best practices for<br />

accomplishing that purpose.<br />

A Look Under the Hood<br />

The fourth edition of <strong>Educational</strong> <strong>Measurement</strong><br />

consists of 21 chapters. My own expertise<br />

in the field of educational measurement<br />

is markedly narrower than the content constituting<br />

the entire volume, so I will offer<br />

brief comments on only three chapters, followed<br />

by observations pertaining to the book<br />

as a whole. The chapters discussed in the following<br />

paragraphs address validity, standard<br />

setting, and classroom assessment.<br />

Validity<br />

The topic of validity is a natural choice for<br />

the first chapter in any comprehensive treatment<br />

of educational measurement. After all,<br />

validity has been identified as “the most<br />

fundamental consideration in developing<br />

and evaluating tests” (American <strong>Educational</strong><br />

Research Association [AERA], American<br />

Psychological Association [APA], & National<br />

Council on <strong>Measurement</strong> in Education<br />

[NCME], 1999, p. 9) and as “the foundation<br />

for virtually all of our measurement work”<br />

(Frisbie, 2005, p. 21).<br />

Unfortunately, validity has been a concept<br />

in turmoil ever since the third edition of<br />

<strong>Educational</strong> <strong>Measurement</strong>, in which Messick<br />

(1989) attempted to propose a grand, unifying<br />

theory of validity. The ensuing years have<br />

witnessed much discontent related to what<br />

found its way into Messick’s treatise (e.g., test<br />

consequences) and what was left out (e.g.,<br />

practical guidance for validation efforts).<br />

Regarding the former, Brennan (2006b)<br />

states that “the most contentious topic in<br />

validity is the role of consequences” (p. 8);<br />

regarding the latter, Shepard (1993) notes<br />

that “Messick’s analysis does not help to<br />

identify which validity questions are essential<br />

to support a test use” (p. 427). The abstruseness<br />

of Messick’s prose has presented<br />

an initial barrier to the discussion of both<br />

problems, with one commentator opining<br />

that “questioning Messick’s theory of validity<br />

is akin to carving a Thanksgiving armadillo”<br />

(Markus, 1998, p. 7).<br />

Many measurement specialists had high<br />

hopes that Kane’s (2006) chapter on validity<br />

in the most recent edition of <strong>Educational</strong><br />

<strong>Measurement</strong> would address many of the difficulties<br />

in what had been the state of validity<br />

theory. Kane’s treatment of validity is far<br />

more succinct and accessible. However,<br />

Kane’s chapter does not so much refine or<br />

extend a theory from one edition to the next<br />

as present a qualitatively different approach,<br />

offered without strong refutation of the previous<br />

formulation or a clear and comprehensive<br />

integration of the old and new<br />

perspectives.<br />

The new validity chapter does begin to<br />

develop some concrete steps for validation<br />

efforts, rooted largely in Kane’s (1992) previous<br />

work that encourages an explicit validity<br />

argument to support intended test-score<br />

inferences. Those who do the difficult work<br />

of test validation surely will appreciate<br />

Kane’s providing this potential strategy to<br />

guide their efforts. However, in this new<br />

validity chapter Kane appears to shy away<br />

from directly confronting the glaring weaknesses<br />

in Messick’s work. For example, he<br />

does not address the logical error of attempting<br />

to incorporate consequences as a part of<br />

validity; and he does not offer guidance<br />

about how the necessary precondition of validation<br />

efforts—namely, a clear statement<br />

about intended score inferences—should be<br />

determined. Instead, Kane (2006) proposes<br />

a negotiation procedure among an unspecified<br />

amalgam of interests, noting that<br />

“agreement on interpretations and uses may<br />

require negotiations among stakeholders<br />

about the conclusions to be drawn and the<br />

decisions to be made” (p. 60). It seems<br />

appropriate that Kane has formally suggested<br />

that explicit, a priori consideration be<br />

given to the potential stakeholders affected<br />

by tests, but before such a proposal can be<br />

implemented, much more must be learned<br />

about how the appropriate stakeholders for<br />

any situation should be identified or limited<br />

and about how to conduct and arbitrate what<br />

could often be (at least in high-stakes contexts)<br />

contentious negotiations. Overall,<br />

although more work surely will be done<br />

to further refine some vexing aspects of validity,<br />

the chapter clearly provides welcome<br />

advances in validity theory and practice while<br />

highlighting the challenges for theoretical<br />

refinements and applications in the future.<br />

Standard Setting<br />

Standard setting is the art and science<br />

of establishing cut scores on tests, that is,<br />

the scores used to classify test takers into<br />

groups such as Pass/Fail, Basic, Proficient,<br />

or Advanced, and other labeled performance<br />

categories. The topic of standard<br />

setting apparently has arrived because an<br />

entire chapter on it, by Hambleton and<br />

Pitoniak (2006), is included in the fourth<br />

edition of <strong>Educational</strong> <strong>Measurement</strong>. In<br />

the third edition, the topic was embedded<br />

in a chapter on student competency testing<br />

(Jaeger, 1989). Because of the current<br />

ubiquity of standard setting and because of<br />

the high stakes that are sometimes associated<br />

with test performance, it seems appropriate<br />

that the topic has received careful<br />

attention in this edition.<br />

Hambleton and Pitoniak’s (2006) chapter<br />

on setting performance standards provides<br />

the most comprehensive and balanced<br />

treatment of the subject to date. Since the<br />

previous edition of <strong>Educational</strong> <strong>Measurement</strong>,<br />

the repertoire of standard-setting<br />

methods has greatly expanded. Those who<br />

must establish cut scores have a broader array<br />

of procedures from which to choose, and<br />

options have been developed to better match<br />

the method with the assessment format, context,<br />

and other considerations. Hambleton<br />

and Pitoniak catalogue and provide brief<br />

descriptions of many of the available procedures.<br />

More important than the cataloging<br />

of methods, however, is that the details on<br />

each method are embedded in a comprehensive<br />

description of the typical steps in the<br />

standard-setting process, including (among<br />

others) developing performance-level descriptors;<br />

choosing, training, and providing feedback<br />

to participants; evaluating and documenting<br />

the process; and compiling validity<br />

evidence. Overall, the chapter strikes an<br />

appropriate balance of theory, procedural<br />

guidance, and grounding in the Standards<br />

for <strong>Educational</strong> and Psychological Testing<br />

(AERA, APA, & NCME, 1999).<br />

Given more space, the authors might have<br />

paid greater attention to standard setting on<br />

alternate assessments, methods for integrating<br />

or choosing among results yielded by<br />

different methods, and rationales and<br />

methods for adjusting both cut scores for a<br />

single test and a system of cut scores across<br />

grade levels or subjects, known as vertically<br />

moderated standard setting (Cizek, 2005). For<br />

example, Hambleton and Pitoniak (2006)<br />

consider adjustments to recommended cut<br />

MARCH 2008<br />

97


scores based on the standard error of measurement.<br />

However, they do not explain<br />

why the expected error in an examinee’s<br />

observed score is an appropriate basis for an<br />

adjustment; nor do they suggest how<br />

decision-making bodies should incorporate<br />

expected measurement error into explicit considerations<br />

of false-positive and false-negative<br />

decisions. In addition, some treatment of<br />

using observed variation in standard-setting<br />

participants’ cut-score recommendations as a<br />

basis for adjustments would be desirable.<br />

Classroom Assessment<br />

The topic of classroom assessment has been<br />

neglected in previous editions of <strong>Educational</strong><br />

<strong>Measurement</strong>. Thus it is noteworthy that the<br />

latest edition contains a separate chapter on<br />

the subject, perhaps because of widening<br />

recognition of the potentially potent effects<br />

of high-quality classroom assessments on<br />

student learning (see, e.g., Black & Wiliam,<br />

1998). At only 24 pages, however, the chapter<br />

by Shepard (2006) in the latest edition is<br />

far too brief. There are only three shorter<br />

chapters in the volume: one that provides an<br />

overview of group score assessments (e.g.,<br />

the National Assessment of <strong>Educational</strong><br />

Progress and the Trends in International<br />

Mathematics and Science Study), one on<br />

second-language testing, and the editor’s<br />

introduction. It is not clear how chapter<br />

lengths were decided, but evidence revealing<br />

that teachers make classroom decisions based<br />

on assessment information every 2 to 3<br />

minutes (Shavelson & Stern, 1981) and the<br />

substantial research base on classroom assessment<br />

that has accumulated in the past<br />

20 years suggest that the coverage of classroom<br />

assessment could have been greatly<br />

expanded.<br />

The few pages devoted to classroom<br />

assessment might also have been apportioned<br />

differently to dive directly into the<br />

most important aspects of the topic. For<br />

example, precious space was spent recounting<br />

the missteps of earlier practice, recalling<br />

early IQ tests, such as Army Alpha, and so<br />

on. Although it is clear that such missteps<br />

occurred, published discoveries in the<br />

Journal of the American Medical Association or<br />

American Psychologist are not routinely introduced<br />

by archaeologies of earlier practice<br />

involving vital humors or homuncular man.<br />

Moreover, in the course of referencing<br />

various tests, the chapter reinforces a false<br />

dichotomy. Rather than clearly distinguish<br />

between the legitimate and totally different<br />

98<br />

EDUCATIONAL RESEARCHER<br />

purposes of large-scale and classroom assessments,<br />

the chapter perhaps unwittingly contributes<br />

to an either/or perspective that casts<br />

one purpose as bad and the other as good.<br />

Hmmm. . . . Let’s think about this. Should<br />

we choose large-scale standardized tests that<br />

are “formal” and “technical” and represent<br />

“single-moment-in-time” measures of “isolated”<br />

and “decontextualized” topics based<br />

on “outmoded” expectations? Or should we<br />

opt for tests that are “contemporary,”<br />

“embedded,” and “ongoing” assessments<br />

that offer “authentic” and “flexible” measurement<br />

of “deeper” understandings?<br />

Although these problems detract from the<br />

chapter, there is much that compensates<br />

for them. For example, Shepard (2006)<br />

frequently and effectively highlights the<br />

essential connections between classroom<br />

assessment and cognitive psychology, and<br />

a portion of the chapter on learning progressions<br />

provides a clear example of what<br />

assessment in writing would look like if<br />

based on how writing skill develops. The<br />

chapter also contains information on the<br />

use of rubrics to aid students in understanding<br />

the criteria that characterize successful<br />

learning and on the kinds of<br />

self-assessment activities and performance<br />

feedback that are most effective in enhancing<br />

learning. Finally, at a time when much<br />

educational testing is increasingly under<br />

attack, Shepard unapologetically defends<br />

the simple but powerful and persistent<br />

finding that “students appear to study<br />

more and learn more if they expect to be<br />

tested” (p. 637).<br />

Like the other two chapters reviewed<br />

here, the chapter on classroom assessment<br />

omits some topics that would have<br />

been desirable to include. For example,<br />

additional treatment of methods for conducting<br />

observations or checking on the<br />

quality of those observations and further<br />

discussion of how teachers synthesize<br />

sources of classroom information for decision<br />

making would be helpful. Although<br />

Shepard (2006) reviews some of the literature<br />

on grading, the chapter might have<br />

benefited from concrete examples of and<br />

rationales for grading models that can be<br />

defended for reporting student achievement,<br />

and those that are less defensible as<br />

well. Finally, although bias in large-scale<br />

testing has practically been eliminated as<br />

a result of focused attention to the problem<br />

in that context, the potential for bias<br />

in classroom assessments would seem to<br />

loom large. A compilation of research,<br />

guidelines, and methods relevant to minimizing<br />

bias in the classroom assessment<br />

context is sorely needed.<br />

Crosscutting Comments on<br />

Historical Context<br />

Brennan’s (2006a) edition of <strong>Educational</strong><br />

<strong>Measurement</strong> follows three previous editions,<br />

edited by Linn (1989), Thorndike (1971),<br />

and Lindquist (1951). There is a great deal<br />

of outstanding scholarship in the fourth<br />

edition, and surely it was a mammoth undertaking<br />

to compile a volume representing<br />

the state of the art in a discipline so diverse.<br />

Brennan has succeeded in circumscribing<br />

the domain in a comprehensive manner and<br />

assembling individual chapters of exceptionally<br />

high quality.<br />

Another reviewer (Wainer, 2007) of the<br />

fourth edition judged that little has been<br />

learned about key measurement topics since<br />

the publication of the third edition, asking,<br />

“How much new has happened in reliability<br />

since 1989?” (p. 485); that reviewer generally<br />

advised against purchasing this edition. I disagree.<br />

Although it is true that the chapters on<br />

reliability and item response theory cover<br />

much of the same ground as those chapters<br />

did in the third edition, evaluating the latest<br />

volume on that basis is a judgment made<br />

on an unrepresentative sample. The other<br />

80% of the fourth edition documents<br />

substantial advances in research and new<br />

developments on topics such as cognitive<br />

psychology (Mislevy, 2006); technology in<br />

testing (Drasgow, Leucht, & Bennett, 2006);<br />

accountability (Koretz & Hamilton, 2006);<br />

scoring, reporting, and test security (Cohen<br />

& Wollack, 2006); performance assessment<br />

(Lane & Stone, 2006); and others.<br />

My own evaluation is that “adequate yearly<br />

progress”—to invoke a popular phrase these<br />

days—has been made in the field since the<br />

publication of the third edition. The fourth<br />

edition of <strong>Educational</strong> <strong>Measurement</strong> is an<br />

essential update for measurement specialists<br />

and for social science researchers in<br />

general.<br />

The field of measurement is dynamic.<br />

Indeed, it is changing so rapidly that,<br />

although the fourth edition is still quite new,<br />

it may not be too early to begin planning for<br />

the fifth. The task of ensuring the rigor,<br />

accuracy, and readability of discrete chapters<br />

is challenging, but I would urge that an additional<br />

perspective be considered for the next<br />

edition. Adequate progress seems to be only


a modest goal; more radical aims should be<br />

contemplated.<br />

For one thing, it seems to me that previous<br />

editions of <strong>Educational</strong> <strong>Measurement</strong><br />

have uniformly and implicitly defined the<br />

universe of testing as consisting nearly exclusively<br />

of large-scale, standardized assessments.<br />

It is a curious contrast: Although<br />

so much educational testing and assessment<br />

occur at the level of the individual student<br />

and teacher or at the classroom level, the<br />

content of each edition of <strong>Educational</strong> <strong>Measurement</strong><br />

is terribly tilted toward the technologies<br />

of testing programs such as the<br />

SAT, ACT, and GRE. It is as if the<br />

Federal Aviation Administration were<br />

to consider aviation safety with exclusive<br />

reference to commercial airlines, ignoring<br />

the much greater volume of private aircraft<br />

flights each day. Clearly, the evolving<br />

technologies of computer adaptive<br />

testing, item response theory, generalizability<br />

theory, and differential item<br />

functioning warrant documentation and<br />

dissemination; and it is true that the<br />

results of large-scale tests are often consequential.<br />

However, it is equally true that<br />

these developments pertain to a narrow<br />

slice of educational assessment and that<br />

classroom testing and grading are consequential<br />

in their own right. The inclusion<br />

of a chapter on classroom assessment in<br />

the fourth edition is commendable and<br />

definitely a step in the right direction, but<br />

this initiative must be broadened.<br />

Accordingly, it seems appropriate to recommend<br />

that educational measurement be<br />

(re)considered more broadly, that balkanization<br />

of topics be avoided, and that cross-level<br />

perspectives be integrated and crosscutting<br />

questions be addressed, to the extent possible,<br />

in each chapter. For example, how<br />

should teachers think about setting standards<br />

on classroom tests? What are appropriate<br />

ways to consider the reliability of<br />

alternate assessments and other tests administered<br />

to sometimes very small samples?<br />

How might coherence between classroom<br />

assessments and state-level content standards<br />

be promoted? Are there any differences in<br />

appropriate testing accommodations for<br />

classroom and large-scale tests? What sources<br />

of validity evidence are appropriate for tests<br />

at different levels, with differing purposes, or<br />

with differing consequences?<br />

Although the recommendation for greater<br />

integration might seem unrealistic, the current<br />

edition of <strong>Educational</strong> <strong>Measurement</strong><br />

actually contains a remarkably comprehensive<br />

example of the kind of integrated treatment<br />

that could serve as a model for chapters<br />

in the next edition. The chapter by Lane<br />

and Stone (2006) on performance assessment<br />

deftly weaves together treatments of<br />

reliability, cognitive psychology, scoring,<br />

measurement models, classroom assessment<br />

concerns, computer-aided testing, validity,<br />

test design, fairness, and other concerns in a<br />

way that fully covers the identified topic of<br />

the chapter but does not duplicate the essential<br />

content of other chapters in the volume.<br />

Finally, to inform thinking about the next<br />

edition, it may be illuminating to look backward.<br />

A historical note appears in the preface<br />

to the first edition of <strong>Educational</strong> <strong>Measurement</strong>.<br />

It refers the reader to a preceding<br />

volume, The Construction and Use of Achievement<br />

Examinations (Hawkes, Lindquist, &<br />

Mann, 1936), which was produced by the<br />

same publisher as the subsequent editions<br />

and could fairly claim to be the real first edition<br />

in the series. That earliest volume contained<br />

a chapter by McConn (1936), whose<br />

observations would easily be at home in the<br />

latest edition:<br />

When one begins to meditate upon<br />

[achievement tests], one can hardly fail to<br />

be astonished by their multiplicity. . . . We<br />

are impelled to ask, why do we give such<br />

countless tests? Probably many persons<br />

will answer immediately that the obvious<br />

and legitimate purpose of practically all<br />

this achievement testing is the maintenance<br />

of standards; which seems to mean<br />

either one or both of two things: the imposition<br />

and enforcement of a prescribed<br />

curriculum; or the enforcement of some<br />

minimum degree of attainment. (p. 446)<br />

McConn (1936) also asks some of the<br />

policy questions that are being asked today<br />

and identifies a gap in the measurement<br />

literature:<br />

What do we accomplish by all this testing<br />

. . . anyway? Is it worth all the effort and<br />

money it costs? Do we perchance do<br />

harm instead of good, or harm as well as<br />

good with our examinations, and especially<br />

through the uses we make of their<br />

results? In short, it seems that we need<br />

not only techniques, but also some philosophy<br />

. . . dealing with the right uses of<br />

such instruments and their wrong uses or<br />

abuses. (p. 443)<br />

In conclusion, the fourth edition of<br />

<strong>Educational</strong> <strong>Measurement</strong> clearly succeeds<br />

in capturing the state of the art in the field.<br />

However, although this new edition documents<br />

substantial advances in the technology<br />

of testing, McConn’s observations highlight<br />

the presence of lingering challenges to be<br />

addressed—challenges related to the social,<br />

political, and educational contexts in which<br />

the science of psychometrics has long been<br />

situated. By tradition, the first two chapters<br />

of each edition of <strong>Educational</strong> <strong>Measurement</strong><br />

are devoted to the essential topics of validity<br />

and reliability, respectively. Two additional<br />

chapters would be a welcome enhancement<br />

to the next edition. The first would be an initial<br />

chapter, preceding those on validity<br />

and reliability, that would begin to articulate<br />

a philosophy—or perhaps multiple<br />

philosophies—of educational testing and<br />

provide a context for relating those foundational<br />

ideas to the technological advances<br />

chronicled in each edition. The second<br />

would describe various models for how the<br />

enterprise of educational measurement can<br />

be integrated across levels of a planned educational<br />

assessment system; such a chapter<br />

would explicitly probe possible structures for<br />

effectively melding classroom assessment,<br />

large-scale testing in elementary and secondary<br />

schools, and postsecondary assessments.<br />

The challenge ahead lies in enhancing<br />

the utility of each component in the system<br />

for consumers of the results while retaining<br />

the fidelity of each component to its intended<br />

measurement objective.<br />

REFERENCES<br />

American <strong>Educational</strong> Research Association,<br />

American Psychological Association, &<br />

National Council on <strong>Measurement</strong> in Education.<br />

(1999). Standards for educational and<br />

psychological testing. Washington, DC: American<br />

<strong>Educational</strong> Research Association.<br />

Black, P., & Wiliam, D. (1998). Assessment and<br />

classroom learning. Assessment in Education,<br />

5(1), 7–74.<br />

Brennan, R. L. (Ed.). (2006a). <strong>Educational</strong> measurement<br />

(4th ed.). Westport, CT: Praeger.<br />

Brennan, R. L. (2006b). Perspectives on the<br />

evolution and future of educational measurement.<br />

In R. L. Brennan (Ed.), <strong>Educational</strong><br />

measurement (4th ed., pp. 1–16). Westport,<br />

CT: Praeger.<br />

Cizek, G. J. (2005). Adapting testing technology<br />

to serve accountability aims: The case of<br />

vertically moderated standard setting. Applied<br />

<strong>Measurement</strong> in Education, 18(1), 1–10.<br />

Cohen, A. S., & Wollack, J. A (2006). Test<br />

administration, security, scoring, and reporting.<br />

In R. L. Brennan (Ed.), <strong>Educational</strong> measurement<br />

(4th ed., pp. 355–386). Westport,<br />

CT: Praeger.<br />

MARCH 2008<br />

99


Cone, J. D., & Foster, S. L. (1991). Training in<br />

measurement: Always the bridesmaid. American<br />

Psychologist, 46, 653–654.<br />

Drasgow, F., Luecht, R. M., & Bennett, R. E.<br />

(2006). Technology and testing. In R. L.<br />

Brennan (Ed.), <strong>Educational</strong> measurement (4th<br />

ed., pp. 471–516). Westport, CT: Praeger.<br />

Frisbie, D. A. (2005). <strong>Measurement</strong> 101: Some<br />

fundamentals revisited. <strong>Educational</strong> <strong>Measurement</strong>:<br />

Issues and Practice, 24(3), 21–28.<br />

Hambleton, R. K., & Pitoniak, M. J. (2006).<br />

Setting performance standards. In R. L.<br />

Brennan (Ed.), <strong>Educational</strong> measurement (4th<br />

ed., pp. 433–470). Westport, CT: Praeger.<br />

Hawkes, H. E., Lindquist, E. F., & Mann, C. R.<br />

(1936). The construction and use of achievement<br />

examinations. Boston: Houghton Mifflin.<br />

Jaeger, R. M. (1989). Certification of student<br />

competence. In R. L. Linn (Ed.), <strong>Educational</strong><br />

measurement (3rd ed., pp. 485–514). Washington,<br />

DC: American Council on Education.<br />

Kane, M. T. (1992). An argument-based<br />

approach to validation. Psychological Bulletin,<br />

112, 527–535.<br />

Kane, M. T. (2006). Validation. In R. L.<br />

Brennan (Ed.), <strong>Educational</strong> measurement (4th<br />

ed., pp. 17–64). Westport, CT: Praeger.<br />

Koretz, D. M., & Hamilton, L. S. (2006).<br />

Testing for accountability in K–12. In R. L.<br />

Brennan (Ed.), <strong>Educational</strong> measurement<br />

100<br />

EDUCATIONAL RESEARCHER<br />

(4th ed., pp. 531–578). Westport, CT:<br />

Praeger.<br />

Lane, S., & Stone, C. A. (2006). Performance<br />

assessment. In R. L. Brennan (Ed.), <strong>Educational</strong><br />

measurement (4th ed., pp. 387–431). Westport,<br />

CT: Praeger.<br />

Lindquist, E. F. (Ed.). (1951). <strong>Educational</strong> measurement.<br />

Washington, DC: American Council<br />

on Education.<br />

Linn, R. L. (Ed.). (1989). <strong>Educational</strong> measurement<br />

(3rd ed.). Washington, DC: American<br />

Council on Education.<br />

Markus, K. A. (1998). Science, measurement,<br />

and validity: Is completion of Samuel<br />

Messick’s synthesis possible? Social Indicators<br />

Research, 45(1), 7–34.<br />

McConn, M. (1936). The uses and abuses of<br />

examinations. In H. E. Hawkes, E. F. Lindquist,<br />

& C. R. Mann (Eds.), The construction and<br />

use of achievement examinations (pp. 443–478).<br />

Boston: Houghton Mifflin.<br />

Messick, S. (1989). Validity. In R. L. Linn<br />

(Ed.), <strong>Educational</strong> measurement (3rd ed.,<br />

pp. 13–103). New York: Macmillan.<br />

Mislevy, R. J. (2006). Cognitive psychology and<br />

educational assessment. In R. L. Brennan<br />

(Ed.), <strong>Educational</strong> measurement (4th ed.,<br />

pp. 257–306). Westport, CT: Praeger.<br />

Shavelson, R. J., & Stern, P. (1981). Research on<br />

teachers’ pedagogical thoughts, judgments,<br />

decisions, and behavior. Review of <strong>Educational</strong><br />

Research, 51, 455–498.<br />

Shepard, L. A. (1993). Evaluating test validity.<br />

Review of Research in Education, 19, 405–450.<br />

Shepard, L. A. (2006). Classroom assessment. In<br />

R. L. Brennan (Ed.), <strong>Educational</strong> measurement<br />

(4th ed., pp. 624–646). Westport, CT: Praeger.<br />

Thorndike, R. L. (Ed.). (1971). <strong>Educational</strong><br />

measurement (2nd ed.). Washington, DC:<br />

American Council on Education.<br />

Wainer, H. (2007). A psychometric cicada:<br />

<strong>Educational</strong> <strong>Measurement</strong> returns. <strong>Educational</strong><br />

Researcher, 36, 485–486.<br />

Wright, B. D. (1994). Introduction to the Rasch<br />

model [Videocassette]. Available from College<br />

of Education, University of Denver.<br />

AUTHOR<br />

GREGORY J. CIZEK is a professor of educational<br />

measurement and evaluation at the University<br />

of North Carolina, Chapel Hill, School of<br />

Education, CB 3500, Chapel Hill, NC 27599–<br />

3500; cizek@unc.edu. His research focuses on<br />

standard setting, test security, and validity.<br />

Manuscript received January 4, 2008<br />

Revision received January 8, 2008<br />

Accepted January 8, 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!