Book Reviews Assessing Educational Measurement: Ovations ...

More documents

Recommendations

Info

I don’t want to know which questions you answered correctly. I want to know how much . . . you know. I need to leap from what I know and don’t want, to what I want but can’t know. That’s called inference. In short, the field of educational measurement focuses on evaluating and enhancing the quality of the information generated by tests or, more precisely, the accuracy and dependability of inferences about constructs. The reference work Educational Measurement is a compendium of current best practices for accomplishing that purpose. A Look Under the Hood The fourth edition of Educational Measurement consists of 21 chapters. My own expertise in the field of educational measurement is markedly narrower than the content constituting the entire volume, so I will offer brief comments on only three chapters, followed by observations pertaining to the book as a whole. The chapters discussed in the following paragraphs address validity, standard setting, and classroom assessment. Validity The topic of validity is a natural choice for the first chapter in any comprehensive treatment of educational measurement. After all, validity has been identified as “the most fundamental consideration in developing and evaluating tests” (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999, p. 9) and as “the foundation for virtually all of our measurement work” (Frisbie, 2005, p. 21). Unfortunately, validity has been a concept in turmoil ever since the third edition of Educational Measurement, in which Messick (1989) attempted to propose a grand, unifying theory of validity. The ensuing years have witnessed much discontent related to what found its way into Messick’s treatise (e.g., test consequences) and what was left out (e.g., practical guidance for validation efforts). Regarding the former, Brennan (2006b) states that “the most contentious topic in validity is the role of consequences” (p. 8); regarding the latter, Shepard (1993) notes that “Messick’s analysis does not help to identify which validity questions are essential to support a test use” (p. 427). The abstruseness of Messick’s prose has presented an initial barrier to the discussion of both problems, with one commentator opining that “questioning Messick’s theory of validity is akin to carving a Thanksgiving armadillo” (Markus, 1998, p. 7). Many measurement specialists had high hopes that Kane’s (2006) chapter on validity in the most recent edition of Educational Measurement would address many of the difficulties in what had been the state of validity theory. Kane’s treatment of validity is far more succinct and accessible. However, Kane’s chapter does not so much refine or extend a theory from one edition to the next as present a qualitatively different approach, offered without strong refutation of the previous formulation or a clear and comprehensive integration of the old and new perspectives. The new validity chapter does begin to develop some concrete steps for validation efforts, rooted largely in Kane’s (1992) previous work that encourages an explicit validity argument to support intended test-score inferences. Those who do the difficult work of test validation surely will appreciate Kane’s providing this potential strategy to guide their efforts. However, in this new validity chapter Kane appears to shy away from directly confronting the glaring weaknesses in Messick’s work. For example, he does not address the logical error of attempting to incorporate consequences as a part of validity; and he does not offer guidance about how the necessary precondition of validation efforts—namely, a clear statement about intended score inferences—should be determined. Instead, Kane (2006) proposes a negotiation procedure among an unspecified amalgam of interests, noting that “agreement on interpretations and uses may require negotiations among stakeholders about the conclusions to be drawn and the decisions to be made” (p. 60). It seems appropriate that Kane has formally suggested that explicit, a priori consideration be given to the potential stakeholders affected by tests, but before such a proposal can be implemented, much more must be learned about how the appropriate stakeholders for any situation should be identified or limited and about how to conduct and arbitrate what could often be (at least in high-stakes contexts) contentious negotiations. Overall, although more work surely will be done to further refine some vexing aspects of validity, the chapter clearly provides welcome advances in validity theory and practice while highlighting the challenges for theoretical refinements and applications in the future. Standard Setting Standard setting is the art and science of establishing cut scores on tests, that is, the scores used to classify test takers into groups such as Pass/Fail, Basic, Proficient, or Advanced, and other labeled performance categories. The topic of standard setting apparently has arrived because an entire chapter on it, by Hambleton and Pitoniak (2006), is included in the fourth edition of Educational Measurement. In the third edition, the topic was embedded in a chapter on student competency testing (Jaeger, 1989). Because of the current ubiquity of standard setting and because of the high stakes that are sometimes associated with test performance, it seems appropriate that the topic has received careful attention in this edition. Hambleton and Pitoniak’s (2006) chapter on setting performance standards provides the most comprehensive and balanced treatment of the subject to date. Since the previous edition of Educational Measurement, the repertoire of standard-setting methods has greatly expanded. Those who must establish cut scores have a broader array of procedures from which to choose, and options have been developed to better match the method with the assessment format, context, and other considerations. Hambleton and Pitoniak catalogue and provide brief descriptions of many of the available procedures. More important than the cataloging of methods, however, is that the details on each method are embedded in a comprehensive description of the typical steps in the standard-setting process, including (among others) developing performance-level descriptors; choosing, training, and providing feedback to participants; evaluating and documenting the process; and compiling validity evidence. Overall, the chapter strikes an appropriate balance of theory, procedural guidance, and grounding in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999). Given more space, the authors might have paid greater attention to standard setting on alternate assessments, methods for integrating or choosing among results yielded by different methods, and rationales and methods for adjusting both cut scores for a single test and a system of cut scores across grade levels or subjects, known as vertically moderated standard setting (Cizek, 2005). For example, Hambleton and Pitoniak (2006) consider adjustments to recommended cut MARCH 2008 97
scores based on the standard error of measurement. However, they do not explain why the expected error in an examinee’s observed score is an appropriate basis for an adjustment; nor do they suggest how decision-making bodies should incorporate expected measurement error into explicit considerations of false-positive and false-negative decisions. In addition, some treatment of using observed variation in standard-setting participants’ cut-score recommendations as a basis for adjustments would be desirable. Classroom Assessment The topic of classroom assessment has been neglected in previous editions of Educational Measurement. Thus it is noteworthy that the latest edition contains a separate chapter on the subject, perhaps because of widening recognition of the potentially potent effects of high-quality classroom assessments on student learning (see, e.g., Black & Wiliam, 1998). At only 24 pages, however, the chapter by Shepard (2006) in the latest edition is far too brief. There are only three shorter chapters in the volume: one that provides an overview of group score assessments (e.g., the National Assessment of Educational Progress and the Trends in International Mathematics and Science Study), one on second-language testing, and the editor’s introduction. It is not clear how chapter lengths were decided, but evidence revealing that teachers make classroom decisions based on assessment information every 2 to 3 minutes (Shavelson & Stern, 1981) and the substantial research base on classroom assessment that has accumulated in the past 20 years suggest that the coverage of classroom assessment could have been greatly expanded. The few pages devoted to classroom assessment might also have been apportioned differently to dive directly into the most important aspects of the topic. For example, precious space was spent recounting the missteps of earlier practice, recalling early IQ tests, such as Army Alpha, and so on. Although it is clear that such missteps occurred, published discoveries in the Journal of the American Medical Association or American Psychologist are not routinely introduced by archaeologies of earlier practice involving vital humors or homuncular man. Moreover, in the course of referencing various tests, the chapter reinforces a false dichotomy. Rather than clearly distinguish between the legitimate and totally different 98 EDUCATIONAL RESEARCHER purposes of large-scale and classroom assessments, the chapter perhaps unwittingly contributes to an either/or perspective that casts one purpose as bad and the other as good. Hmmm. . . . Let’s think about this. Should we choose large-scale standardized tests that are “formal” and “technical” and represent “single-moment-in-time” measures of “isolated” and “decontextualized” topics based on “outmoded” expectations? Or should we opt for tests that are “contemporary,” “embedded,” and “ongoing” assessments that offer “authentic” and “flexible” measurement of “deeper” understandings? Although these problems detract from the chapter, there is much that compensates for them. For example, Shepard (2006) frequently and effectively highlights the essential connections between classroom assessment and cognitive psychology, and a portion of the chapter on learning progressions provides a clear example of what assessment in writing would look like if based on how writing skill develops. The chapter also contains information on the use of rubrics to aid students in understanding the criteria that characterize successful learning and on the kinds of self-assessment activities and performance feedback that are most effective in enhancing learning. Finally, at a time when much educational testing is increasingly under attack, Shepard unapologetically defends the simple but powerful and persistent finding that “students appear to study more and learn more if they expect to be tested” (p. 637). Like the other two chapters reviewed here, the chapter on classroom assessment omits some topics that would have been desirable to include. For example, additional treatment of methods for conducting observations or checking on the quality of those observations and further discussion of how teachers synthesize sources of classroom information for decision making would be helpful. Although Shepard (2006) reviews some of the literature on grading, the chapter might have benefited from concrete examples of and rationales for grading models that can be defended for reporting student achievement, and those that are less defensible as well. Finally, although bias in large-scale testing has practically been eliminated as a result of focused attention to the problem in that context, the potential for bias in classroom assessments would seem to loom large. A compilation of research, guidelines, and methods relevant to minimizing bias in the classroom assessment context is sorely needed. Crosscutting Comments on Historical Context Brennan’s (2006a) edition of Educational Measurement follows three previous editions, edited by Linn (1989), Thorndike (1971), and Lindquist (1951). There is a great deal of outstanding scholarship in the fourth edition, and surely it was a mammoth undertaking to compile a volume representing the state of the art in a discipline so diverse. Brennan has succeeded in circumscribing the domain in a comprehensive manner and assembling individual chapters of exceptionally high quality. Another reviewer (Wainer, 2007) of the fourth edition judged that little has been learned about key measurement topics since the publication of the third edition, asking, “How much new has happened in reliability since 1989?” (p. 485); that reviewer generally advised against purchasing this edition. I disagree. Although it is true that the chapters on reliability and item response theory cover much of the same ground as those chapters did in the third edition, evaluating the latest volume on that basis is a judgment made on an unrepresentative sample. The other 80% of the fourth edition documents substantial advances in research and new developments on topics such as cognitive psychology (Mislevy, 2006); technology in testing (Drasgow, Leucht, & Bennett, 2006); accountability (Koretz & Hamilton, 2006); scoring, reporting, and test security (Cohen & Wollack, 2006); performance assessment (Lane & Stone, 2006); and others. My own evaluation is that “adequate yearly progress”—to invoke a popular phrase these days—has been made in the field since the publication of the third edition. The fourth edition of Educational Measurement is an essential update for measurement specialists and for social science researchers in general. The field of measurement is dynamic. Indeed, it is changing so rapidly that, although the fourth edition is still quite new, it may not be too early to begin planning for the fifth. The task of ensuring the rigor, accuracy, and readability of discrete chapters is challenging, but I would urge that an additional perspective be considered for the next edition. Adequate progress seems to be only
Page 1: Assessing Educational Measurement:
Page 5: Cone, J. D., & Foster, S. L. (1991)

Book Reviews Assessing Educational Measurement: Ovations ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?