View - Waisman Laboratory for Brain Imaging and Behavior

More documents

Recommendations

Info

this complexity was at least partly the reason why relatively few generalizability studies were being conducted. I decided to try to publicize, teach, and simplify generalizability theory for graduate students and measurement practitioners. At about this time, with the assistance of Kane and Gillmore (and later Noreen Webb and Xiaohong Gao), I began an every-other-year training session on G theory for the AERA and NCME Annual Meetings. My first effort at writing a simpler treatment of G theory (Brennan, 1977) was a paper that was rejected by a major journal- the editor described it as being “too propaedeutic.” Just about that time Jay Millman, who was then president of NCME, asked me to consider writing a monograph on generalizability theory for publication by NCME. With the encouragement of Michael Kane and David Jarjoura, I agreed, but, when I completed the monograph almost 3 years later, NCME was no longer interested in publishing it! ACT, however, did publish Elements of Generalizability Theory (Brennan, 1983). I had long felt that a simpler treatment of G theory was not enough to get the theory used more widely by practitioners. They also needed a computer program. So, at the same time I was writing Elements of Generalizability Theory, I was designing a computer program called GENOVA (Crick & Brennan, 1983) that would be coordinated with the monograph. My computer skills were not adequate for programming GENOVA, however. That task was undertaken by Joe Crick, a colleague from graduate school at Harvard, who somehow managed to translate my math and handwritten input-output layouts into workable FORTRAN code while serving as Director of the Computing Center at the University of Massachusetts, Boston. Several expositions of G theory were published in the late 1980s and early 199Os, all of which are briefer and less demanding than Cronbach et al. (1972) or Brennan (1983, 1992a). Shavelson, Webb, and Rowley (1989) provided a particularly readable journal article that summarizes G theory, and in the same year Feldt and Brennan (1989) de- voted about one third of their chapter on reliability to G theory. In 1991, Shavelson and Webb published a relatively short monograph entitled Generalizability Theory: A Primer. Brennan (1992b) provided a very brief introduction intended primarily for classroom use. Interest in performance testing in the late 1980s led to a mini-boom in generalizability analyses and considerably greater publicity for G theory. It seemed evident to practitioners that G theory was eminently well-suited to analyzing scores from such tests. In particular, practitioners realized that understanding the results of a performance test necessitated grappling with two or more facets simultaneously -especially tasks and raters. The relevance of G theory in such contexts is especially well illustrated by Richard Shavelson and his colleagues in a series of presentations and articles involving science and mathematics performance assessments, in particular (see, e.g., Gao, Brennan, & Shavelson, 1994; Shavelson, Baxter, & Gao, 1993; Shavelson, Baxter, & Pine, 1991, 1992). Also, Brennan and Johnson (1995) and Brennan (199613) consider some theoretical and applied issues in performance testing from the perspective of G theory. New assessments such as performance tests recently motivated Cronbach, Linn, Brennan, and Haertel (1995) to state: “Assessments depart from traditional measurements in ways that require extensions and modifications of generalizability analysis. . . . Assessments pose problems that reach beyond available psychometric theory” (p. 1). The Cronbach et al. (1995) report and a recent journal article revision (Cronbach, Linn, Brennan, & Haertel 1997) suggest a number of problems that need to be researched, and they propose some recommended solutions. These articles emphasize the importance of estimates of absolute standard errors of measurement for many of the types of decisions that are typically made with performance assessments. Also, these articles urge that an analysis of error for group means explicitly recognizes that pupils are nested in classes and schools. Whether to treat pupils as fixed or random in such analyses is discussed in some detail (see, also, Brennan 1995a). In their 1972 monograph, Cronbach and his colleagues illustrated the applicability of G theory largely by reanalyzing some already published data in the psychology and education literature. Since 1972, in addition to topics already cited in this overview, G theory has been used to study issues such as classroom teaching (e.g., Erlich & Borich, 1979; Erlich & Shavelson, 1976); program evaluation (e.g., Gillmore, 1983); the use of tables of specifications in educational testing (e.g., Jarjoura & Brennan, 1982, 1983; Kolen & Jarjoura, 1984); counseling and development (Webb, Rowley, & Shavelson, 1988); setting performance standards (Brennan, 1995b); job performance (Webb, Shavelson, Kim, & Chen, 1989); neuroticism and coping with anger (Atkinson, & Violato, 1994); and aspects of physiology, including blood pressure (Llabre et al., 1988; Saab et al., 1992). Unfinished Work G theory has a protean quality. The procedures and even the issues take on a new form in every context. G theory enables you to ask your questions better; what is most significant for you cannot be supplied from the outside. (Cronbach, 1976, p. 199) In this sense, G theory is a continuous work in progress, and none of the research reviewed here can be deemed complete. Still, there are some important theoretical and statistical topics that clearly need to be addressed more fully than they have been, and there are potential areas of application where the theory has been largely unused as yet. Although G theory has been applied in a number of contexts, the coverage is not balanced and one might expect that after 25 years many more generalizability analyses would have been conducted than are reported in the literature. Most published generalizability analyses are in the education literature, perhaps because those who are most knowledgeable about G theory tend to be employed in colleges of education, educational testing companies, and related organizations. Clearly, Winter 1997 17
however, G theory has potential applicability wherever measurement procedures are employed. In particular, G theory seems very much underutilized in psychological and medical areas. It is often stated that G theory “blurs the distinction between reliability and validity” (Cronbach et al., 1972, p. 380). Yet, very little of the G theory literature directly addresses validation issues. A notable exception is Kane’s (1982) treatment of “A Sampling Model for Validity,” which is clearly one of the major theoretical contributions to the literature on G theory in the last 25 years. In his article, Kane clearly begins to make explicit links between G theory and issues traditionally subsumed under validity. Still, many of the contributions that G theory probably could make to the validation of particular measurement procedures are unexplored, and it seems reasonable to speculate that more theoretical contributions are possible. By the early 1960s, Cronbach and his colleagues had pretty much completed their development of univariate G theory. It provided a coherent framework for considering most, if not all, of the reliability literature that had been developed to that time. About 1966, they began work on multivariate G theory, in which each of the levels of one or more fixed facets is associated with a distinct universe score. Although it might be claimed that not all of univariate G theory is novel, multivariate G theory (the generalizability of profiles) is clearly a unique contribution of Cronbach and his colleagues (Cronbach et al., 1972, chapters 9 and 10). In commenting on multivariate G theory, Cronbach has stated: Despite the long-standing interest Gleser and I had in profiles, all of G theory down to 1966 considered one score at a time. . . . A decade of work was required to expose the twists and turns of the simpler univariate multifacet theory, so surely much multivariate theory remains to be developed. (Cronbach, 1991, p. 394) Shavelson and Webb (1981) in their review of G theory discuss some developments in multivariate G theory since the Cronbach et al. 18 (1972) monograph. Since their review, there have been other articles published on the subject (e.g., Brennan, Gao, & Colton, 1995; Gao, Shavelson, Brennan, & Baxter, 1996; Jarjoura & Brennan, 1982, 1983; Kolen & Jarjoura, 1984; NuPbaum, 1984; Webb, Shavelson, & Maddahian, 1983). Also, Brennan (1983, 1992a) and Shavelson, Webb, and Rowley (1989) provide illustrative multivariate analyses. However, it is still true that “much multivariate theory remains to be developed (Cronbach, 1991, p. 394). In my opinion, the conceptual framework of G theory is more central, and likely to be more enduring, than the statistical machinery used to carry out generalizability analyses. However, the statistical procedures are still important. Since estimates of variance components are so central, any issue associated with such estimates is of particular concern. For example, the stability of estimated variance components was considered by Cronbach et al. (1972) and subsequently studied by Smith (1978, 1981, 19821, Brennan (1994), and Gao (1996) among others. It has long been recognized that conditional SEMs are not constant for all examinees. Lord’s (1957, 1959) articles provide perhaps the best known formula for conditional SEMs-a formula based on an absolute definition of error. Conditional, relative-error SEMs in G theory were considered by Jarjoura (1986). Recently, Brennan (1996a) has extended the work of Lord and Jarjoura, but much more research remains to be done. Almost all of G theory and its applications to date effectively assume that the scores used to make decisions about the objects of measurement (usually examinees) are raw scores or linear transformations of raw scores. Often, however, the scale scores actually used are nonlinear transformations, and there is no necessary reason to believe that results based on a generalizability analysis of raw scores are directly relevant for such scale scores. One common example is the conversion of raw scores on tasks to “passhotpass” status on an assessment (see Cronbach et al., 1995, 1997). Recently, Brennan and Lee (1997) have considered some approaches to estimating conditional SEMs for nonlinear transformation of raw scores, but the role of nonlinear transformations in G theory is still largely unexplored. Brennan (1984) discusses a number of other statistical topics relevant to G theory-topics that are by no means thoroughly researched as yet. In particular, practitioners need more readily available procedures for performing generalizability analyses in unbalanced situations, Twenty-five years ago, in commenting about the future of G theory, Cronbach et al. (1972) stated that: Because our model treats conditions within a facet as unordered, it will not deal adequately with the stability of scores that are subject to trends, or to order effects arising from the measurement process. . . . A large contribution will be made by the development of a model for treating ordered facets. (p. 364) Such a contribution has yet to be made. Furthermore, Rogosa and Ghandour (1991) suggest that G theory may not be applicable to certain statistical models for behavioral observations- situations in which time is a facet. Their research deserves further consideration, because it seems to provide results that are inconsistent with G theory (and other traditional psychometric models). The final paragraph of The Dependability of Behavioral Measurements (Cronbach et al., 1972, p. 388) states: Today’s reader, coming to a fully elaborated generalizability theory for the first time, no doubt finds it forbidding. As measurement specialists become accustomed to its language and its ways of treating data, this strangeness will pass. As the theory is put in different words by successive writers, it will be rounded into smoother form. As it becomes more integrated with other recent developments in error theory, and with the validation theory of which it is a part, it will become inseparable from the measurement theory of the next generation. The predictions of Cronbach and his colleagues are only partly ful- Educational Measurement: Issues and Practice
Page 1 and 2: Kelley, T. L. (1923). Statistical m
Page 3: ative (6) and absolute (A) error fo
Page 7: Fisher, R. A. (1925). Statistical m

View - Waisman Laboratory for Brain Imaging and Behavior

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?