Book Reviews Assessing Educational Measurement: Ovations ...

Assessing Educational Measurement: Ovations, 

Omissions, Opportunities 

Educational Measurement (4th ed.). Robert 

L. Brennan (Ed.). Westport, CT: Praeger, 2006. 

796 pp., $125.00 (cloth). ISBN 0–275–98125–8. 

Reviewed by 

Gregory J. Cizek 

Not all readers of Educational Researcher are 

likely to be familiar with the field of educational 

measurement. The field is sometimes 

referred to as psychometrics, but in plain language 

educational measurement is essentially 

testing—a specialization concerned with 

developing and evaluating the procedures 

used to make inferences about learning, 

achievement, aptitudes, interests, and other 

constructs in education. 

Before jumping directly into a review 

of the most recent edition of Educational 

Measurement, edited by Robert L. Brennan 

(2006a), it seems appropriate to provide 

some background. Two words—quality 

control—describe the fundamental interest 

of measurement specialists. The passion of 

psychometricians is ensuring that the data 

that result from the use of tests are of the 

highest quality possible. Psychometricians 

seek to ensure that the information generated 

by testing instruments, observational 

protocols, and other such procedures provides 

consistent and accurate portrayals of 

the students and systems to which those 

tools are applied. 

At only modest risk of overstatement, I 

would assert that the obsession with highquality 

information should not be the 

exclusive province of psychometricians but 

is rightly seen as a primary interest of all 

social scientists (and a key weakness in 

graduate student preparation). Along these 

lines, Cone and Foster (1991) have argued 

as follows: 

96 

Educational Researcher, Vol. 37, No. 2, pp. 96–100 

DOI: 10.3102/0013189X08315727 

© 2008 AERA. http://er.aera.net 

EDUCATIONAL RESEARCHER 

Scholars commonly acknowledge that 

developments in all areas of science follow 

appropriate measurement techniques. . . . 

One only has to think of the value of the 

microscope for biology and chemistry, the 

telescope for astronomy, and magnetic resonance 

imaging for contemporary medicine 

to support this point. In psychology, 

the use of the most sophisticated structural 

equation modeling, time series analysis, 

or meta-analytic methodology is only as 

strong as the data used, and these data 

depend on the quality of the measures used 

in their collection. . . . Graduate students 

learn complex, sophisticated statistical procedures 

to test data obtained in elegant, 

internally and externally valid experimental 

designs. But they are rarely exposed to 

the training needed to evaluate whether the 

data they obtain so cleverly and analyze so 

complexly are any good in the first place. 

(p. 653) 

In short, the concern of educational measurement 

specialists is a consequential one. The 

fourth edition of Educational Measurement, 

the field’s definitive resource, seeks to cover 

this important terrain comprehensively. 

The rest of this review is organized into 

three sections. The first defines some key 

terms and provides additional background. 

The second offers specific descriptive and 

evaluative comments. The third situates the 

current edition of Educational Measurement 

in historical context and provides suggestions 

for the next edition. 

Key Terms and Background 

In the preceding section I used two terms— 

construct and inference—that are essential to 

understanding the focus of Educational 

Measurement. Construct refers to the targets 

of measurement. In the social sciences 

a construct is a label used to describe a 

Book 

Reviews 

characteristic on which people vary. The 

characteristics measured by most tests are 

referred to as constructs because they are not 

directly observable but are “constructed.” For 

example, although a characteristic such as 

honesty does not exist in a physical sense, it is 

nonetheless one on which people are observed 

to vary. We use the label honest to describe 

people who behave more or less regularly in 

ways regarded as ethical and the label dishonest 

to describe those whose actions are 

regarded as unethical. The construct label is 

helpful in describing these regularities for purposes 

of identifying individual differences and 

communicating clearly about them. In educational 

research, nearly all areas of study concern 

constructs, for example, persistence, 

reading comprehension, readiness, teamwork, 

and persuasive writing ability, to name just 

a few. 

The constructs of interest to social scientists 

must be studied indirectly by means 

of the instruments and scoring procedures 

developed to measure them. However, a 

gap always exists between the information 

yielded by an instrument and any conclusion 

(e.g., score or classification decision) 

about the underlying characteristic that the 

instrument purports to measure. The gap 

exists because the conclusion necessarily is 

based on a limited sample of information, 

observations, or responses. The conclusion, 

interpretation, or meaning that is 

drawn regarding the underlying characteristic 

is called an inference. The indirect 

measurement is necessarily a proxy, and 

inference is required whenever one wishes 

to use the observed measurement as an 

indication of standing on the unobservable 

characteristic. This reality was expressed 

neatly by Wright (1994), who, in the context 

of achievement testing, described the 

gap in this way:

I don’t want to know which questions 

you answered correctly. I want to know 

how much . . . you know. I need to leap 

from what I know and don’t want, to 

what I want but can’t know. That’s called 

inference. 

In short, the field of educational measurement 

focuses on evaluating and enhancing 

the quality of the information generated by 

tests or, more precisely, the accuracy and 

dependability of inferences about constructs. 

The reference work Educational Measurement 

is a compendium of current best practices for 

accomplishing that purpose. 

A Look Under the Hood 

The fourth edition of Educational Measurement 

consists of 21 chapters. My own expertise 

in the field of educational measurement 

is markedly narrower than the content constituting 

the entire volume, so I will offer 

brief comments on only three chapters, followed 

by observations pertaining to the book 

as a whole. The chapters discussed in the following 

paragraphs address validity, standard 

setting, and classroom assessment. 

Validity 

The topic of validity is a natural choice for 

the first chapter in any comprehensive treatment 

of educational measurement. After all, 

validity has been identified as “the most 

fundamental consideration in developing 

and evaluating tests” (American Educational 

Research Association [AERA], American 

Psychological Association [APA], & National 

Council on Measurement in Education 

[NCME], 1999, p. 9) and as “the foundation 

for virtually all of our measurement work” 

(Frisbie, 2005, p. 21). 

Unfortunately, validity has been a concept 

in turmoil ever since the third edition of 

Educational Measurement, in which Messick 

(1989) attempted to propose a grand, unifying 

theory of validity. The ensuing years have 

witnessed much discontent related to what 

found its way into Messick’s treatise (e.g., test 

consequences) and what was left out (e.g., 

practical guidance for validation efforts). 

Regarding the former, Brennan (2006b) 

states that “the most contentious topic in 

validity is the role of consequences” (p. 8); 

regarding the latter, Shepard (1993) notes 

that “Messick’s analysis does not help to 

identify which validity questions are essential 

to support a test use” (p. 427). The abstruseness 

of Messick’s prose has presented 

an initial barrier to the discussion of both 

problems, with one commentator opining 

that “questioning Messick’s theory of validity 

is akin to carving a Thanksgiving armadillo” 

(Markus, 1998, p. 7). 

Many measurement specialists had high 

hopes that Kane’s (2006) chapter on validity 

in the most recent edition of Educational 

Measurement would address many of the difficulties 

in what had been the state of validity 

theory. Kane’s treatment of validity is far 

more succinct and accessible. However, 

Kane’s chapter does not so much refine or 

extend a theory from one edition to the next 

as present a qualitatively different approach, 

offered without strong refutation of the previous 

formulation or a clear and comprehensive 

integration of the old and new 

perspectives. 

The new validity chapter does begin to 

develop some concrete steps for validation 

efforts, rooted largely in Kane’s (1992) previous 

work that encourages an explicit validity 

argument to support intended test-score 

inferences. Those who do the difficult work 

of test validation surely will appreciate 

Kane’s providing this potential strategy to 

guide their efforts. However, in this new 

validity chapter Kane appears to shy away 

from directly confronting the glaring weaknesses 

in Messick’s work. For example, he 

does not address the logical error of attempting 

to incorporate consequences as a part of 

validity; and he does not offer guidance 

about how the necessary precondition of validation 

efforts—namely, a clear statement 

about intended score inferences—should be 

determined. Instead, Kane (2006) proposes 

a negotiation procedure among an unspecified 

amalgam of interests, noting that 

“agreement on interpretations and uses may 

require negotiations among stakeholders 

about the conclusions to be drawn and the 

decisions to be made” (p. 60). It seems 

appropriate that Kane has formally suggested 

that explicit, a priori consideration be 

given to the potential stakeholders affected 

by tests, but before such a proposal can be 

implemented, much more must be learned 

about how the appropriate stakeholders for 

any situation should be identified or limited 

and about how to conduct and arbitrate what 

could often be (at least in high-stakes contexts) 

contentious negotiations. Overall, 

although more work surely will be done 

to further refine some vexing aspects of validity, 

the chapter clearly provides welcome 

advances in validity theory and practice while 

highlighting the challenges for theoretical 

refinements and applications in the future. 

Standard Setting 

Standard setting is the art and science 

of establishing cut scores on tests, that is, 

the scores used to classify test takers into 

groups such as Pass/Fail, Basic, Proficient, 

or Advanced, and other labeled performance 

categories. The topic of standard 

setting apparently has arrived because an 

entire chapter on it, by Hambleton and 

Pitoniak (2006), is included in the fourth 

edition of Educational Measurement. In 

the third edition, the topic was embedded 

in a chapter on student competency testing 

(Jaeger, 1989). Because of the current 

ubiquity of standard setting and because of 

the high stakes that are sometimes associated 

with test performance, it seems appropriate 

that the topic has received careful 

attention in this edition. 

Hambleton and Pitoniak’s (2006) chapter 

on setting performance standards provides 

the most comprehensive and balanced 

treatment of the subject to date. Since the 

previous edition of Educational Measurement, 

the repertoire of standard-setting 

methods has greatly expanded. Those who 

must establish cut scores have a broader array 

of procedures from which to choose, and 

options have been developed to better match 

the method with the assessment format, context, 

and other considerations. Hambleton 

and Pitoniak catalogue and provide brief 

descriptions of many of the available procedures. 

More important than the cataloging 

of methods, however, is that the details on 

each method are embedded in a comprehensive 

description of the typical steps in the 

standard-setting process, including (among 

others) developing performance-level descriptors; 

choosing, training, and providing feedback 

to participants; evaluating and documenting 

the process; and compiling validity 

evidence. Overall, the chapter strikes an 

appropriate balance of theory, procedural 

guidance, and grounding in the Standards 

for Educational and Psychological Testing 

(AERA, APA, & NCME, 1999). 

Given more space, the authors might have 

paid greater attention to standard setting on 

alternate assessments, methods for integrating 

or choosing among results yielded by 

different methods, and rationales and 

methods for adjusting both cut scores for a 

single test and a system of cut scores across 

grade levels or subjects, known as vertically 

moderated standard setting (Cizek, 2005). For 

example, Hambleton and Pitoniak (2006) 

consider adjustments to recommended cut 

MARCH 2008 

97

scores based on the standard error of measurement. 

However, they do not explain 

why the expected error in an examinee’s 

observed score is an appropriate basis for an 

adjustment; nor do they suggest how 

decision-making bodies should incorporate 

expected measurement error into explicit considerations 

of false-positive and false-negative 

decisions. In addition, some treatment of 

using observed variation in standard-setting 

participants’ cut-score recommendations as a 

basis for adjustments would be desirable. 

Classroom Assessment 

The topic of classroom assessment has been 

neglected in previous editions of Educational 

Measurement. Thus it is noteworthy that the 

latest edition contains a separate chapter on 

the subject, perhaps because of widening 

recognition of the potentially potent effects 

of high-quality classroom assessments on 

student learning (see, e.g., Black & Wiliam, 

1998). At only 24 pages, however, the chapter 

by Shepard (2006) in the latest edition is 

far too brief. There are only three shorter 

chapters in the volume: one that provides an 

overview of group score assessments (e.g., 

the National Assessment of Educational 

Progress and the Trends in International 

Mathematics and Science Study), one on 

second-language testing, and the editor’s 

introduction. It is not clear how chapter 

lengths were decided, but evidence revealing 

that teachers make classroom decisions based 

on assessment information every 2 to 3 

minutes (Shavelson & Stern, 1981) and the 

substantial research base on classroom assessment 

that has accumulated in the past 

20 years suggest that the coverage of classroom 

assessment could have been greatly 

expanded. 

The few pages devoted to classroom 

assessment might also have been apportioned 

differently to dive directly into the 

most important aspects of the topic. For 

example, precious space was spent recounting 

the missteps of earlier practice, recalling 

early IQ tests, such as Army Alpha, and so 

on. Although it is clear that such missteps 

occurred, published discoveries in the 

Journal of the American Medical Association or 

American Psychologist are not routinely introduced 

by archaeologies of earlier practice 

involving vital humors or homuncular man. 

Moreover, in the course of referencing 

various tests, the chapter reinforces a false 

dichotomy. Rather than clearly distinguish 

between the legitimate and totally different 

98 


purposes of large-scale and classroom assessments, 

the chapter perhaps unwittingly contributes 

to an either/or perspective that casts 

one purpose as bad and the other as good. 

Hmmm. . . . Let’s think about this. Should 

we choose large-scale standardized tests that 

are “formal” and “technical” and represent 

“single-moment-in-time” measures of “isolated” 

and “decontextualized” topics based 

on “outmoded” expectations? Or should we 

opt for tests that are “contemporary,” 

“embedded,” and “ongoing” assessments 

that offer “authentic” and “flexible” measurement 

of “deeper” understandings? 

Although these problems detract from the 

chapter, there is much that compensates 

for them. For example, Shepard (2006) 

frequently and effectively highlights the 

essential connections between classroom 

assessment and cognitive psychology, and 

a portion of the chapter on learning progressions 

provides a clear example of what 

assessment in writing would look like if 

based on how writing skill develops. The 

chapter also contains information on the 

use of rubrics to aid students in understanding 

the criteria that characterize successful 

learning and on the kinds of 

self-assessment activities and performance 

feedback that are most effective in enhancing 

learning. Finally, at a time when much 

educational testing is increasingly under 

attack, Shepard unapologetically defends 

the simple but powerful and persistent 

finding that “students appear to study 

more and learn more if they expect to be 

tested” (p. 637). 

Like the other two chapters reviewed 

here, the chapter on classroom assessment 

omits some topics that would have 

been desirable to include. For example, 

additional treatment of methods for conducting 

observations or checking on the 

quality of those observations and further 

discussion of how teachers synthesize 

sources of classroom information for decision 

making would be helpful. Although 

Shepard (2006) reviews some of the literature 

on grading, the chapter might have 

benefited from concrete examples of and 

rationales for grading models that can be 

defended for reporting student achievement, 

and those that are less defensible as 

well. Finally, although bias in large-scale 

testing has practically been eliminated as 

a result of focused attention to the problem 

in that context, the potential for bias 

in classroom assessments would seem to 

loom large. A compilation of research, 

guidelines, and methods relevant to minimizing 

bias in the classroom assessment 

context is sorely needed. 

Crosscutting Comments on 

Historical Context 

Brennan’s (2006a) edition of Educational 

Measurement follows three previous editions, 

edited by Linn (1989), Thorndike (1971), 

and Lindquist (1951). There is a great deal 

of outstanding scholarship in the fourth 

edition, and surely it was a mammoth undertaking 

to compile a volume representing 

the state of the art in a discipline so diverse. 

Brennan has succeeded in circumscribing 

the domain in a comprehensive manner and 

assembling individual chapters of exceptionally 

high quality. 

Another reviewer (Wainer, 2007) of the 

fourth edition judged that little has been 

learned about key measurement topics since 

the publication of the third edition, asking, 

“How much new has happened in reliability 

since 1989?” (p. 485); that reviewer generally 

advised against purchasing this edition. I disagree. 

Although it is true that the chapters on 

reliability and item response theory cover 

much of the same ground as those chapters 

did in the third edition, evaluating the latest 

volume on that basis is a judgment made 

on an unrepresentative sample. The other 

80% of the fourth edition documents 

substantial advances in research and new 

developments on topics such as cognitive 

psychology (Mislevy, 2006); technology in 

testing (Drasgow, Leucht, & Bennett, 2006); 

accountability (Koretz & Hamilton, 2006); 

scoring, reporting, and test security (Cohen 

& Wollack, 2006); performance assessment 

(Lane & Stone, 2006); and others. 

My own evaluation is that “adequate yearly 

progress”—to invoke a popular phrase these 

days—has been made in the field since the 

publication of the third edition. The fourth 

edition of Educational Measurement is an 

essential update for measurement specialists 

and for social science researchers in 

general. 

The field of measurement is dynamic. 

Indeed, it is changing so rapidly that, 

although the fourth edition is still quite new, 

it may not be too early to begin planning for 

the fifth. The task of ensuring the rigor, 

accuracy, and readability of discrete chapters 

is challenging, but I would urge that an additional 

perspective be considered for the next 

edition. Adequate progress seems to be only

a modest goal; more radical aims should be 

contemplated. 

For one thing, it seems to me that previous 

editions of Educational Measurement 

have uniformly and implicitly defined the 

universe of testing as consisting nearly exclusively 

of large-scale, standardized assessments. 

It is a curious contrast: Although 

so much educational testing and assessment 

occur at the level of the individual student 

and teacher or at the classroom level, the 

content of each edition of Educational Measurement 

is terribly tilted toward the technologies 

of testing programs such as the 

SAT, ACT, and GRE. It is as if the 

Federal Aviation Administration were 

to consider aviation safety with exclusive 

reference to commercial airlines, ignoring 

the much greater volume of private aircraft 

flights each day. Clearly, the evolving 

technologies of computer adaptive 

testing, item response theory, generalizability 

theory, and differential item 

functioning warrant documentation and 

dissemination; and it is true that the 

results of large-scale tests are often consequential. 

However, it is equally true that 

these developments pertain to a narrow 

slice of educational assessment and that 

classroom testing and grading are consequential 

in their own right. The inclusion 

of a chapter on classroom assessment in 

the fourth edition is commendable and 

definitely a step in the right direction, but 

this initiative must be broadened. 

Accordingly, it seems appropriate to recommend 

that educational measurement be 

(re)considered more broadly, that balkanization 

of topics be avoided, and that cross-level 

perspectives be integrated and crosscutting 

questions be addressed, to the extent possible, 

in each chapter. For example, how 

should teachers think about setting standards 

on classroom tests? What are appropriate 

ways to consider the reliability of 

alternate assessments and other tests administered 

to sometimes very small samples? 

How might coherence between classroom 

assessments and state-level content standards 

be promoted? Are there any differences in 

appropriate testing accommodations for 

classroom and large-scale tests? What sources 

of validity evidence are appropriate for tests 

at different levels, with differing purposes, or 

with differing consequences? 

Although the recommendation for greater 

integration might seem unrealistic, the current 

edition of Educational Measurement 

actually contains a remarkably comprehensive 

example of the kind of integrated treatment 

that could serve as a model for chapters 

in the next edition. The chapter by Lane 

and Stone (2006) on performance assessment 

deftly weaves together treatments of 

reliability, cognitive psychology, scoring, 

measurement models, classroom assessment 

concerns, computer-aided testing, validity, 

test design, fairness, and other concerns in a 

way that fully covers the identified topic of 

the chapter but does not duplicate the essential 

content of other chapters in the volume. 

Finally, to inform thinking about the next 

edition, it may be illuminating to look backward. 

A historical note appears in the preface 

to the first edition of Educational Measurement. 

It refers the reader to a preceding 

volume, The Construction and Use of Achievement 

Examinations (Hawkes, Lindquist, & 

Mann, 1936), which was produced by the 

same publisher as the subsequent editions 

and could fairly claim to be the real first edition 

in the series. That earliest volume contained 

a chapter by McConn (1936), whose 

observations would easily be at home in the 

latest edition: 

When one begins to meditate upon 

[achievement tests], one can hardly fail to 

be astonished by their multiplicity. . . . We 

are impelled to ask, why do we give such 

countless tests? Probably many persons 

will answer immediately that the obvious 

and legitimate purpose of practically all 

this achievement testing is the maintenance 

of standards; which seems to mean 

either one or both of two things: the imposition 

and enforcement of a prescribed 

curriculum; or the enforcement of some 

minimum degree of attainment. (p. 446) 

McConn (1936) also asks some of the 

policy questions that are being asked today 

and identifies a gap in the measurement 

literature: 

What do we accomplish by all this testing 

. . . anyway? Is it worth all the effort and 

money it costs? Do we perchance do 

harm instead of good, or harm as well as 

good with our examinations, and especially 

through the uses we make of their 

results? In short, it seems that we need 

not only techniques, but also some philosophy 

. . . dealing with the right uses of 

such instruments and their wrong uses or 

abuses. (p. 443) 

In conclusion, the fourth edition of 

Educational Measurement clearly succeeds 

in capturing the state of the art in the field. 

However, although this new edition documents 

substantial advances in the technology 

of testing, McConn’s observations highlight 

the presence of lingering challenges to be 

addressed—challenges related to the social, 

political, and educational contexts in which 

the science of psychometrics has long been 

situated. By tradition, the first two chapters 

of each edition of Educational Measurement 

are devoted to the essential topics of validity 

and reliability, respectively. Two additional 

chapters would be a welcome enhancement 

to the next edition. The first would be an initial 

chapter, preceding those on validity 

and reliability, that would begin to articulate 

a philosophy—or perhaps multiple 

philosophies—of educational testing and 

provide a context for relating those foundational 

ideas to the technological advances 

chronicled in each edition. The second 

would describe various models for how the 

enterprise of educational measurement can 

be integrated across levels of a planned educational 

assessment system; such a chapter 

would explicitly probe possible structures for 

effectively melding classroom assessment, 

large-scale testing in elementary and secondary 

schools, and postsecondary assessments. 

The challenge ahead lies in enhancing 

the utility of each component in the system 

for consumers of the results while retaining 

the fidelity of each component to its intended 

measurement objective. 

REFERENCES 

American Educational Research Association, 

American Psychological Association, & 

National Council on Measurement in Education. 

(1999). Standards for educational and 

psychological testing. Washington, DC: American 

Educational Research Association. 

Black, P., & Wiliam, D. (1998). Assessment and 

classroom learning. Assessment in Education, 

5(1), 7–74. 

Brennan, R. L. (Ed.). (2006a). Educational measurement 

(4th ed.). Westport, CT: Praeger. 

Brennan, R. L. (2006b). Perspectives on the 

evolution and future of educational measurement. 

In R. L. Brennan (Ed.), Educational 

measurement (4th ed., pp. 1–16). Westport, 

CT: Praeger. 

Cizek, G. J. (2005). Adapting testing technology 

to serve accountability aims: The case of 

vertically moderated standard setting. Applied 

Measurement in Education, 18(1), 1–10. 

Cohen, A. S., & Wollack, J. A (2006). Test 

administration, security, scoring, and reporting. 

In R. L. Brennan (Ed.), Educational measurement 

(4th ed., pp. 355–386). Westport, 

CT: Praeger. 

MARCH 2008 

99

Cone, J. D., & Foster, S. L. (1991). Training in 

measurement: Always the bridesmaid. American 

Psychologist, 46, 653–654. 

Drasgow, F., Luecht, R. M., & Bennett, R. E. 

(2006). Technology and testing. In R. L. 

Brennan (Ed.), Educational measurement (4th 

ed., pp. 471–516). Westport, CT: Praeger. 

Frisbie, D. A. (2005). Measurement 101: Some 

fundamentals revisited. Educational Measurement: 

Issues and Practice, 24(3), 21–28. 

Hambleton, R. K., & Pitoniak, M. J. (2006). 

Setting performance standards. In R. L. 



Hawkes, H. E., Lindquist, E. F., & Mann, C. R. 

(1936). The construction and use of achievement 

examinations. Boston: Houghton Mifflin. 

Jaeger, R. M. (1989). Certification of student 

competence. In R. L. Linn (Ed.), Educational 

measurement (3rd ed., pp. 485–514). Washington, 

DC: American Council on Education. 

Kane, M. T. (1992). An argument-based 

approach to validation. Psychological Bulletin, 

112, 527–535. 

Kane, M. T. (2006). Validation. In R. L. 



Koretz, D. M., & Hamilton, L. S. (2006). 

Testing for accountability in K–12. In R. L. 

Brennan (Ed.), Educational measurement 

100 


(4th ed., pp. 531–578). Westport, CT: 

Praeger. 

Lane, S., & Stone, C. A. (2006). Performance 

assessment. In R. L. Brennan (Ed.), Educational 

measurement (4th ed., pp. 387–431). Westport, 

CT: Praeger. 

Lindquist, E. F. (Ed.). (1951). Educational measurement. 

Washington, DC: American Council 

on Education. 

Linn, R. L. (Ed.). (1989). Educational measurement 

(3rd ed.). Washington, DC: American 

Council on Education. 

Markus, K. A. (1998). Science, measurement, 

and validity: Is completion of Samuel 

Messick’s synthesis possible? Social Indicators 

Research, 45(1), 7–34. 

McConn, M. (1936). The uses and abuses of 

examinations. In H. E. Hawkes, E. F. Lindquist, 

& C. R. Mann (Eds.), The construction and 

use of achievement examinations (pp. 443–478). 

Boston: Houghton Mifflin. 

Messick, S. (1989). Validity. In R. L. Linn 

(Ed.), Educational measurement (3rd ed., 

pp. 13–103). New York: Macmillan. 

Mislevy, R. J. (2006). Cognitive psychology and 

educational assessment. In R. L. Brennan 

(Ed.), Educational measurement (4th ed., 

pp. 257–306). Westport, CT: Praeger. 

Shavelson, R. J., & Stern, P. (1981). Research on 

teachers’ pedagogical thoughts, judgments, 

decisions, and behavior. Review of Educational 

Research, 51, 455–498. 

Shepard, L. A. (1993). Evaluating test validity. 

Review of Research in Education, 19, 405–450. 

Shepard, L. A. (2006). Classroom assessment. In 

R. L. Brennan (Ed.), Educational measurement 

(4th ed., pp. 624–646). Westport, CT: Praeger. 

Thorndike, R. L. (Ed.). (1971). Educational 

measurement (2nd ed.). Washington, DC: 

American Council on Education. 

Wainer, H. (2007). A psychometric cicada: 

Educational Measurement returns. Educational 

Researcher, 36, 485–486. 

Wright, B. D. (1994). Introduction to the Rasch 

model [Videocassette]. Available from College 

of Education, University of Denver. 

AUTHOR 

GREGORY J. CIZEK is a professor of educational 

measurement and evaluation at the University 

of North Carolina, Chapel Hill, School of 

Education, CB 3500, Chapel Hill, NC 27599– 

3500; cizek@unc.edu. His research focuses on 

standard setting, test security, and validity. 

Manuscript received January 4, 2008 

Revision received January 8, 2008 

Accepted January 8, 2008

Book Reviews Assessing Educational Measurement: Ovations ...

Create successful ePaper yourself

Delete template?

Save as template?