View - Waisman Laboratory for Brain Imaging and Behavior

Kelley, T. L. (1923). Statistical method. 

New York: Macmillan. 

Kelley, T. L. (1942). The reliability coefficient. 

Psychometrika, 7, 75-83. 

Kuder, G. F., & Richardson, M. W. 

(1937). The theory of estimation of 

test reliability. Psychometrika, 2, 

151-160. 

Lord, F. M., & Novick, M. R. (1968). Statistical 

theories of mental test scores. 

Reading, MA Addison-Wesley. 

Novick, M. R. (1966). The axioms and 

principal results of classical test theory. 

Journal of Mathematical Psychology, 

3, 1-18. 

Pearson, K. (1896). Mathematical contributions 

to the theory of evolution- 

111. Regression, heredity and 

panmixia. Philosophical Pansactions, 

A, 187, 252-318. 

Pearson, K. (1904). On the laws of inheritance 

in man. 11. On the inheritance 

of the mental and moral 

characters in man, and its comparison 

with the inheritance of physical 

characters. Biometrika, 3, 131-190. 

Pearson, K. (1930). The life, letters, and 

labours of Francis Galton. Vol. HIA. 

Correlation, personal identifkatwn 

and eugenics. Cambridge: The University 

Press. 

Pearson, K, & Lee, A. (1903). On the 

laws of inheritance in man. I. Inheritance 

of physical characters. Biometrika, 

2, 357-462. 

Read, C. B. (1985). Normal distribution. 

In S. Kotz & N. L. Johnson (Eds.), 

Encyclopedia of statistical sciences 

(Vol. 6, pp. 347-359). Toronto: Wiley. 

Richardson, M. W. (1936). Notes on the 

rationale of item analysis. Psychometrzka, 

1(1), 69-76. 

Rulon, P. J. (1939). A simplified procedure 

for determining the reliability of 

a test by split-halves. Harvard Educational 

Review, 9, 99-103. 

Sheynin, 0. B. (1968). On the early history 

of the law of large numbers. Biometriha, 

55, 459-467. 

Spearman, C. (1904). The proof and 

measurement of association between 

two things. American Journal of Psychology, 

15, 72-101. 

Spearman, C. (1907). Demonstration of 

formulae for true measurement of 

correlation. American Journal of Psychology, 

18, 160-169. 

Spearman, C. (1910). Correlation calculated 

from faulty data. British 

Journal of Psychology, 3, 271-295. 

Thurstone, L. L. (1932). The reliability 

and validity of tests. Ann Arbor, MI: 

N. p. 

Venn, J. (1888). The logic of chance (3rd 

ed.). London: Macmillan. 

Walker, H. M. (1929). Studies in the history 

of statistical method. Baltimore: 

Williams & Wilkins. 

A Perspective on the History of 

0 

Generabab ility Theory 

Robert L. Brennan 

University of Iowa 

What psychometric and scientific perspectives influenced 

the development of G theorg What practical 

testing problems gave impetus to its adoption What 

work remains to be done 

with G theory. Consequently, this 

article provides a somewhat idiosyncratic 

perspective on the history of G 

theory and what I perceive as unfinished 

work for the theory. Almost 

certainlv. other reviewers would see 

the landscape somewhat differently. 

verviews of various parts of 

0 the history of generalizability 

(G) theory are provided elsewhere. 

An indispensable starting point is 

the preface and parts of the first 

chapter of Cronbach, Gleser, Nanda, 

and Rajaratnam (1972) entitled The 

Dependability of Behavioral Measurements: 

Theory of Generalizability 

for Scores and Profiles. The 

Cronbach et al. monograph is still 

the most definitive treatment of G 

theory. Shavelson and Webb (1981) 

review the G theory literature from 

1973-1980, and Shavelson, Webb, 

and Rowley (1989) cover additional 

contributions in the 1980s. A very 

brief historical overview is provided 

by Brennan (1983, 1992a, pp. 1-2). 

In addition, Cronbach (1976, 1989, 

1991) offers numerous perspectives 

on G theory and its history. Cronbach 

(1991) is particularly rich with 

first-person reflections. 

This historical overview is not intended 

to repeat everything already 

covered in published reviews, although 

a summary is provided. 

Parts of this article are based 

largely on my personal experience 

Theory Development and Enabling 

Work 

In discussing the genesis of G 

theory, Cronbach (1991) states: 

In 1957 I obtained funds from the 

National Institute of Mental 

Health to produce, with Gleser's 

Robert L. Brennan is Lindquist Professor 

of Educational Measurement and 

Director of the Iowa Testing Programs, 

University of Iowa, 334A Lindquist 

Center, Iowa City, IA 52242. His specializations 

are generalizability theory, 

equating, and scaling. 

14 Educational Measurement: Issues and Practice

collaboration, a kind of handbook 

of measurement theory.. . . 

“Since reliability has been studied 

thoroughly and is now understood,” 

I suggested to the team, 

“let us devote our first few weeks 

to outlining that section of the 

handbook, to get a feel for the undertaking.” 

We learned humility 

the hard way-the enterprise 

never got past that topic. Not 

until 1972 did the book appear 

(Cronbach, Gleser, Nanda, & Rajaratnam) 

that exhausted our 

findings on reliability reinterpreted 

as generalizability. Even 

then, we did not exhaust the topic. 

When we tried initially to summarize 

prominent, seemingly 

transparent, convincingly argued 

papers on test reliability, the messages 

conflicted. (pp. 391-392) 

To resolve these conflicts, Cronbach 

and his colleagues devised a 

rich conceptual framework and married 

it to analysis of random effects 

variance components. The net effect 

is “a tapestry that interweaves ideas 

from at least two dozen authors” 

(Cronbach, 1991, p. 394). 

It is not uncommon for G theory 

to be described as the application of 

analysis of variance (ANOVA) to 

classical test theory. This characterization 

of the theory is inadequate, 

at best, and probably more misinformative 

than useful-except in one 

respect. It does correctly suggest 

that the parents of G theory can be 

viewed as classical test theory and 

analysis of variance. The G theory 

child, however, is both more and less 

than the simple conjunction of its 

parents. In particular, G theory is 

not a replacement for classical theory, 

although it does liberalize the 

theory. Also, not all of ANOVA is 

relevant to G theory; indeed, some 

perspectives on ANOVA are inconsistent 

with G theory (see Brennan, 

1984). 

The statistical machinery employed 

in G theory has its genesis in 

Fisher’s (1925) work on factorial designs. 

However, G theory has no 

substantive role for hypothesis testing. 

Rather, it emphasizes the estimation 

of random effects variance 

components-a subject that was researched 

by statisticians in the late 

1940s (see, e.g., Crump, 1946, and 

particularly Eisenhart, 1947). This 

research was brought to Cronbach’s 

attention by a graduate student, 

Milton Meux, about 1957 (L. J. 

Cronbach, personal communication, 

April 18,1997) at approximately the 

same time that Cornfield and Tukey 

(1956) published their rules for expressing 

expected mean square 

equations in terms of variance components. 

By 1950, there was a rich literature 

on reliability from the perspective 

of classical test theory. Most of 

this literature had been superbly 

summarized by Gulliksen (1950), 

which included chapters on experimental 

methods for estimating reliability, 

as well as reliability 

estimated by item homogeneitywhat 

came to be called internal consistency 

estimates. Such estimates 

included, of course, Hoyt’s (1941) 

ANOVA version of Kuder and 

Richardson’s (1937) KR20 index. It 

is not quite true, however, that Hoyt 

was the first to apply ANOVA to 

measurement problems. An earlier 

contribution was made by Burt 

(1936) in his treatment of the analysis 

of examination marks. 

Gulliksen’s (1950) book was published 

before Cronbach’s widely 

cited 1951 article that introduced 

Coefficient a. For the next several 

years, a great deal of research on reliability 

formed the backdrop for G 

theory. Finlayson’s (1951) study of 

grades assigned to essays was probably 

the first treatment of reliability 

in terms of variance components. 

Shortly thereafter Pilliner (1952) 

provided theoretical relations between 

intraclass correlations and 

ANOVA (see also Haggard, 1958). 

Cronbach (1947) had expressed 

the concern that some type of multifacet 

analysis was needed to resolve 

inconsistencies in some estimates of 

reliability. The 1950s were years in 

which various researchers began to 

exploit the fact that ANOVA could 

handle multiple facets simultaneously. 

Particular examples include 

Loveland’s (1952) doctoral dissertation, 

work by Medley, Mitzel, and 

Doi (1956) on classroom observations, 

and Burt’s (1955) treatment of 

test reliability estimated by analysis 

of variance. Most importantly, Lindquist 

(1953, chap. 16) laid out an 

extensive exposition of multifacet 

theory that focused on the estimation 

of variance components in reliability 

studies. Lindquist demonstrated 

that multifacet analyses 

lead to alternative definitions of 

error and reliability coefficients. 

Lindquist’s chapter clearly foreshadowed 

important parts of G theory. 

Cronbach was on the faculty at 

the University of Chicago from 1946 

to 1948. He recalls that: 

Five minutes with Joseph 

Schwab had a profound influence. 

. . . In some context 

Schwab remarked that biologists 

have to decide what to count as a 

species. . . . Schwab was acute 

enough to catch my flicker of surprise 

and force home the idea of 

scientist as construer rather than 

as discoverer of categories the 

Creator had in mind. That conversation 

. . . resonates in my 

thinking to this day. (Cronbach, 

1989, p. 72, italics added) 

Given this perspective, it is not 

surprising that G theory requires 

that investigators define the conditions 

of measurement of interest 

to them. The theory effectively disavows 

any notion of there being a 

correct set of conditions of measurement, 

but it is clear that the particular 

tasks or items used are not a 

sufficient specification of a measurement 

procedure. These notions 

are central to the conceptual framework 

of G theory, but they are not 

entirely novel. 

Guttman once made the provocative 

remark that a test belongs to 

several sets, and therefore has 

several reliabilities. “List as 

many 4-letter words that begin 

with t as you can.” That word-fluency 

task fits into at least three 

families: 4-letter words beginning 

with a specified letter, t words of 

a specified length, and 4-letter 

words with t in a specified position. 

The investigator’s theory, 

rather than an abstract concept of 

truth and error, determines 

which family contains tests that 

“measure the same variable.” 

(Cronbach, 1991, p. 394) 

In 1951, Ebel published an article 

on the reliability of ratings in which 

he essentially considered two types 

of error variance-one that included, 

and another that excluded, 

rater main effects. In the process of 

doing so, Ebel also considered single-facet 

crossed and nested designs. 

It wasn’t until G theory was 

fully formulated that the issues 

Ebel grappled with were truly clarified 

in the distinction between rel- 

Winter 1997 15

ative (6) and absolute (A) error for 

various designs. Very much the 

same problems were considered by 

Lord (1955, 1957, 1959) in a classic 

series of articles about conditional 

standard errors of measurement 

(SEMs) and reliability under the assumptions 

of what came to be called 

the binomial error model (see also 

Lord, 1962). In effect, the rater 

main effects in Ebel’s article play 

the role of the item main effects in 

Lord’s articles. In addition, Lord’s 

articles clearly specify what came to 

be called randomly parallel tests. 

The issues Lord was grappling 

with had a clear influence on the development 

of G theory. According to 

Cronbach (personal communication, 

1996), about 1957, Lord visited the 

Cronbach team in Urbana. Their 

discussions suggested that the error 

in Lords formulation of the binomial 

error model (which treated one 

person at a time-that is, a completely 

nested design) could not be 

the same error as that in classical 

theory for a crossed design. (Lord 

basically acknowledges this in his 

1962 article.) This insight was eventually 

captured in the distinction 

between 6 and A in G theory, and it 

illustrated that errors of measurement 

are influenced by the choice of 

design. Lord’s binomial error model 

is probably best known as a simple 

way to estimate conditional SEMs 

and as an important precursor to 

strong true score theory, but it is 

also associated with important insights 

that became an integral part 

of G theory. 

The genius of Cronbach and his 

colleagues was their creation of a 

conceptual framework and use of a 

methodology (variance components 

analysis) that integrated the contributions 

of numerous researchers, 

even when some contributions 

seemed to conflict with one another. 

The essential features of univariate 

G theory were largely completed 

with technical reports in 1960- 

1961, each with a different first 

author. These were revised into 

three journal articles, each with a 

different first author (Cronbach, Rajaratnam, 

& Gleser, 1963; Gleser, 

Cronbach, & Rajaratnam, 1965; and 

Rajaratnam, Cronbach, & Gleser, 

1965). In 1964 Cronbach moved to 

Stanford. Shortly thereafter, Harinder 

Nanda’s studies on interbat- 

tery reliability provided part of the 

motivation for the development of 

multivariate G theory (considered 

later). This very major extension of 

the univariate model is part of the 

reason it took more than 10 years 

after the 1960-1961 reports for 

Cronbach et al. (1972) to appear in 

print. It is still the most intensive 

and extensive treatment of G theory. 

Applications and Extensions With 

Some Personal Reflections 

“any investigators who come to 

employ geiieralizabilitg theory in 

their research do so only after concluding 

that more conventional approaches 

seem inadequate. That 

was indeed the motivation that led 

me to generalizability theory. In the 

late 1960s and early 1970s, I served 

as a consultant on evaluations of the 

Head Start and Follow Through 

Programs, and the National Day 

Care Study. A distinguishing common 

characteristic of these studies 

was that the treatments were applied 

to whole classrooms and evaluated 

using certain measurement 

procedures. A very natural question 

to ask, then, was, “How shall we estimate 

the reliability of classroom 

mean scores for these measurement 

procedures’’ A number of discussions 

convinced many of us that the 

problem was not getting an estimate; 

rather, the problem was that 

we had too many estimates, and no 

obvious way to choose among them. 

Early in the summer of 1972, I set 

myself the goal of resolving this 

paradox by the end of the summer. 

It did not take that long. The library 

at SUNY at Stony Brook where I 

was a beginning assistant professor 

had a brand new book entitled The 

Dependability of Behavioral Measurements 

(Cronbach et al., 1972). 

After studying it night and day for a 

week, the answer was obvious-the 

different estimates we were getting 

were related to different universes 

o f generalization when class means 

were the objects of measurement. 

This insight eventually led to my 

first publication on generalizability 

theory (Brennan, 1975). Shortly 

thereafter, Michael Kane joined the 

faculty of education at Stony Brook, 

and I discovered that he and some of 

his former colleagues at the University 

of Illinois had been working on 

exactly the same problem in the 

context of student evaluations of 

teaching (see, e.g., Kane, Gillmore, 

& Crooks, 1976). Our common interest 

in this problem led to a joint article 

(Kane & Brennan, 1977). 

The Cronbach et al. (1972) formulation 

of G theory was general 

enough to permit any set of conditions 

(e.g., persons, classes, items) 

to be the objects of measurement 

facet. In that sense, the work on 

class means in the early-to-mid- 

1970s was more of an illustration 

than a substantive contribution to 

the theory. In a series of articles 

about the symmetry of G theory, 

Cardinet and his colleagues emphasized 

the role that facets other than 

persons might play as objects of 

measurement (e.g., Cardinet & 

Allal, 1983; Cardinet, Tourneur, & 

Allal, 1976a, 1976b, 1981). 

At the same time that Kane and I 

were working on our class means article, 

we were intrigued with the 

idea of using generalizability theory 

to address issues surrounding the 

reliability of criterion-referenced (or 

domain-referenced) scores, which 

was a very hot topic in the early-tomid-1970s. 

Our initial forays into 

this area (Brennan & Kane, 1977a, 

1977b) were based on a very simple 

idea-use absolute error rather than 

relative error in defining indices and 

signal-noise ratios. This work was 

later summarized and somewhat extended 

by Brennan (1984). The research 

that Kane and I did on 

domain-referenced scores and class 

means was so clearly co-equal that 

we flipped a coin to decide on first 

authorship. To follow blindly the 

alphabetize-by-last-name convention 

would have grossly misrepresented 

our relative contributions. 

In 1981, Shavelson and Webb 

published a review of G theory for 

the years 1973-1980. Actually, their 

article is much more than a review 

of 8 years of literature-it is also an 

excellent summary of G theory that 

is highly relevant and readable 

today. Only some of the work they 

review has been discussed here. 

By the late 1970s, I had read 

Cronbach et al. (1972) cover-to-cover 

three times, but parts of it still challenged 

me. I agreed with their statement 

that “the book is complexly 

organized and by no means simple 

to follow” (Cronbach et al., 1972, 

p. 3). It seemed likely to me that 


this complexity was at least partly 

the reason why relatively few generalizability 

studies were being conducted. 

I decided to try to publicize, 

teach, and simplify generalizability 

theory for graduate students and 

measurement practitioners. At 

about this time, with the assistance 

of Kane and Gillmore (and later 

Noreen Webb and Xiaohong Gao), I 

began an every-other-year training 

session on G theory for the AERA 

and NCME Annual Meetings. 

My first effort at writing a simpler 

treatment of G theory (Brennan, 

1977) was a paper that was 

rejected by a major journal- the editor 

described it as being “too 

propaedeutic.” Just about that time 

Jay Millman, who was then president 

of NCME, asked me to consider 

writing a monograph on generalizability 

theory for publication by 

NCME. With the encouragement of 

Michael Kane and David Jarjoura, I 

agreed, but, when I completed the 

monograph almost 3 years later, 

NCME was no longer interested in 

publishing it! ACT, however, did 

publish Elements of Generalizability 

Theory (Brennan, 1983). 

I had long felt that a simpler 

treatment of G theory was not 

enough to get the theory used more 

widely by practitioners. They also 

needed a computer program. So, at 

the same time I was writing Elements 

of Generalizability Theory, I 

was designing a computer program 

called GENOVA (Crick & Brennan, 

1983) that would be coordinated 

with the monograph. My computer 

skills were not adequate for programming 

GENOVA, however. That 

task was undertaken by Joe Crick, a 

colleague from graduate school at 

Harvard, who somehow managed to 

translate my math and handwritten 

input-output layouts into workable 

FORTRAN code while serving as 

Director of the Computing Center at 

the University of Massachusetts, 

Boston. 

Several expositions of G theory 

were published in the late 1980s and 

early 199Os, all of which are briefer 

and less demanding than Cronbach 

et al. (1972) or Brennan (1983, 

1992a). Shavelson, Webb, and Rowley 

(1989) provided a particularly 

readable journal article that summarizes 

G theory, and in the same 

year Feldt and Brennan (1989) de- 

voted about one third of their chapter 

on reliability to G theory. In 

1991, Shavelson and Webb published 

a relatively short monograph 

entitled Generalizability Theory: A 

Primer. Brennan (1992b) provided a 

very brief introduction intended primarily 

for classroom use. 

Interest in performance testing in 

the late 1980s led to a mini-boom in 

generalizability analyses and considerably 

greater publicity for G 

theory. It seemed evident to practitioners 

that G theory was eminently 

well-suited to analyzing scores from 

such tests. In particular, practitioners 

realized that understanding the 

results of a performance test necessitated 

grappling with two or more 

facets simultaneously -especially 

tasks and raters. The relevance of G 

theory in such contexts is especially 

well illustrated by Richard Shavelson 

and his colleagues in a series of 

presentations and articles involving 

science and mathematics performance 

assessments, in particular 

(see, e.g., Gao, Brennan, & Shavelson, 

1994; Shavelson, Baxter, & 

Gao, 1993; Shavelson, Baxter, & 

Pine, 1991, 1992). Also, Brennan 

and Johnson (1995) and Brennan 

(199613) consider some theoretical 

and applied issues in performance 

testing from the perspective of G 

theory. 

New assessments such as performance 

tests recently motivated 

Cronbach, Linn, Brennan, and 

Haertel (1995) to state: “Assessments 

depart from traditional measurements 

in ways that require 

extensions and modifications of generalizability 

analysis. . . . Assessments 

pose problems that reach 

beyond available psychometric theory” 

(p. 1). The Cronbach et al. 

(1995) report and a recent journal 

article revision (Cronbach, Linn, 

Brennan, & Haertel 1997) suggest a 

number of problems that need to be 

researched, and they propose some 

recommended solutions. These articles 

emphasize the importance of estimates 

of absolute standard errors 

of measurement for many of the 

types of decisions that are typically 

made with performance assessments. 

Also, these articles urge that 

an analysis of error for group means 

explicitly recognizes that pupils are 

nested in classes and schools. 

Whether to treat pupils as fixed or 

random in such analyses is discussed 

in some detail (see, also, 

Brennan 1995a). 

In their 1972 monograph, Cronbach 

and his colleagues illustrated 

the applicability of G theory largely 

by reanalyzing some already published 

data in the psychology and 

education literature. Since 1972, in 

addition to topics already cited in 

this overview, G theory has been 

used to study issues such as classroom 

teaching (e.g., Erlich & Borich, 

1979; Erlich & Shavelson, 1976); 

program evaluation (e.g., Gillmore, 

1983); the use of tables of specifications 

in educational testing (e.g., 

Jarjoura & Brennan, 1982, 1983; 

Kolen & Jarjoura, 1984); counseling 

and development (Webb, Rowley, & 

Shavelson, 1988); setting performance 

standards (Brennan, 1995b); 

job performance (Webb, Shavelson, 

Kim, & Chen, 1989); neuroticism 

and coping with anger (Atkinson, & 

Violato, 1994); and aspects of physiology, 

including blood pressure 

(Llabre et al., 1988; Saab et al., 

1992). 

Unfinished Work 

G theory has a protean quality. 

The procedures and even the issues 

take on a new form in every 

context. G theory enables you to 

ask your questions better; what is 

most significant for you cannot be 

supplied from the outside. (Cronbach, 

1976, p. 199) 

In this sense, G theory is a continuous 

work in progress, and none 

of the research reviewed here can be 

deemed complete. Still, there are 

some important theoretical and statistical 

topics that clearly need to be 

addressed more fully than they 

have been, and there are potential 

areas of application where the theory 

has been largely unused as yet. 

Although G theory has been applied 

in a number of contexts, the 

coverage is not balanced and one 

might expect that after 25 years 

many more generalizability analyses 

would have been conducted than 

are reported in the literature. Most 

published generalizability analyses 

are in the education literature, perhaps 

because those who are most 

knowledgeable about G theory tend 

to be employed in colleges of education, 

educational testing companies, 

and related organizations. Clearly, 

Winter 1997 17

however, G theory has potential applicability 

wherever measurement 

procedures are employed. In particular, 

G theory seems very much 

underutilized in psychological and 

medical areas. 

It is often stated that G theory 

“blurs the distinction between reliability 

and validity” (Cronbach et al., 

1972, p. 380). Yet, very little of the G 

theory literature directly addresses 

validation issues. A notable exception 

is Kane’s (1982) treatment of “A 

Sampling Model for Validity,” which 

is clearly one of the major theoretical 

contributions to the literature 

on G theory in the last 25 years. In 

his article, Kane clearly begins to 

make explicit links between G theory 

and issues traditionally subsumed 

under validity. Still, many of 

the contributions that G theory 

probably could make to the validation 

of particular measurement procedures 

are unexplored, and it 

seems reasonable to speculate that 

more theoretical contributions are 

possible. 

By the early 1960s, Cronbach and 

his colleagues had pretty much completed 

their development of univariate 

G theory. It provided a coherent 

framework for considering most, if 

not all, of the reliability literature 

that had been developed to that 

time. About 1966, they began work 

on multivariate G theory, in which 

each of the levels of one or more 

fixed facets is associated with a distinct 

universe score. Although it 

might be claimed that not all of univariate 

G theory is novel, multivariate 

G theory (the generalizability of 

profiles) is clearly a unique contribution 

of Cronbach and his colleagues 

(Cronbach et al., 1972, 

chapters 9 and 10). In commenting 

on multivariate G theory, Cronbach 

has stated: 

Despite the long-standing interest 

Gleser and I had in profiles, 

all of G theory down to 1966 considered 

one score at a time. . . . A 

decade of work was required to 

expose the twists and turns of the 

simpler univariate multifacet 

theory, so surely much multivariate 

theory remains to be developed. 

(Cronbach, 1991, p. 394) 

Shavelson and Webb (1981) in 

their review of G theory discuss 

some developments in multivariate 

G theory since the Cronbach et al. 

18 

(1972) monograph. Since their review, 

there have been other articles 

published on the subject (e.g., Brennan, 

Gao, & Colton, 1995; Gao, 

Shavelson, Brennan, & Baxter, 

1996; Jarjoura & Brennan, 1982, 

1983; Kolen & Jarjoura, 1984; NuPbaum, 

1984; Webb, Shavelson, & 

Maddahian, 1983). Also, Brennan 

(1983, 1992a) and Shavelson, Webb, 

and Rowley (1989) provide illustrative 

multivariate analyses. However, 

it is still true that “much multivariate 

theory remains to be developed 

(Cronbach, 1991, p. 394). 

In my opinion, the conceptual 

framework of G theory is more central, 

and likely to be more enduring, 

than the statistical machinery 

used to carry out generalizability 

analyses. However, the statistical 

procedures are still important. 

Since estimates of variance components 

are so central, any issue associated 

with such estimates is of 

particular concern. For example, 

the stability of estimated variance 

components was considered by 

Cronbach et al. (1972) and subsequently 

studied by Smith (1978, 

1981, 19821, Brennan (1994), and 

Gao (1996) among others. 

It has long been recognized that 

conditional SEMs are not constant 

for all examinees. Lord’s (1957, 

1959) articles provide perhaps the 

best known formula for conditional 

SEMs-a formula based on an absolute 

definition of error. Conditional, 

relative-error SEMs in G 

theory were considered by Jarjoura 

(1986). Recently, Brennan (1996a) 

has extended the work of Lord and 

Jarjoura, but much more research 

remains to be done. 

Almost all of G theory and its applications 

to date effectively assume 

that the scores used to make decisions 

about the objects of measurement 

(usually examinees) are raw 

scores or linear transformations of 

raw scores. Often, however, the 

scale scores actually used are nonlinear 

transformations, and there is 

no necessary reason to believe that 

results based on a generalizability 

analysis of raw scores are directly 

relevant for such scale scores. One 

common example is the conversion 

of raw scores on tasks to “passhotpass” 

status on an assessment (see 

Cronbach et al., 1995, 1997). Recently, 

Brennan and Lee (1997) 

have considered some approaches to 

estimating conditional SEMs for 

nonlinear transformation of raw 

scores, but the role of nonlinear 

transformations in G theory is still 

largely unexplored. 

Brennan (1984) discusses a number 

of other statistical topics relevant 

to G theory-topics that are 

by no means thoroughly researched 

as yet. In particular, practitioners 

need more readily available procedures 

for performing generalizability 

analyses in unbalanced 

situations, 

Twenty-five years ago, in commenting 

about the future of G theory, 

Cronbach et al. (1972) stated 

that: 

Because our model treats conditions 

within a facet as unordered, 

it will not deal adequately with 

the stability of scores that are 

subject to trends, or to order 

effects arising from the measurement 

process. . . . A large contribution 

will be made by the development 

of a model for treating 

ordered facets. (p. 364) 

Such a contribution has yet to be 

made. Furthermore, Rogosa and 

Ghandour (1991) suggest that G 

theory may not be applicable to certain 

statistical models for behavioral 

observations- situations in 

which time is a facet. Their research 

deserves further consideration, because 

it seems to provide results 

that are inconsistent with G theory 

(and other traditional psychometric 

models). 

The final paragraph of The Dependability 

of Behavioral Measurements 

(Cronbach et al., 1972, p. 388) 

states: 

Today’s reader, coming to a fully 

elaborated generalizability theory 

for the first time, no doubt finds it 

forbidding. As measurement specialists 

become accustomed to its 

language and its ways of treating 

data, this strangeness will pass. 

As the theory is put in different 

words by successive writers, it 

will be rounded into smoother 

form. As it becomes more integrated 

with other recent developments 

in error theory, and with 

the validation theory of which it 

is a part, it will become inseparable 

from the measurement theory 

of the next generation. 

The predictions of Cronbach and 

his colleagues are only partly ful- 

Educational Measurement: Issues and Practice

filled, as yet, but they are coming to 

pass. 

References 

Atkinson, M., & Violato, C. (1994). 

Neuroticism and coping with anger: 

The trans-situational consistency of 

coping responses. Journal of Personality 

and Individual Differences, 17, 

769-782. 

Brennan, R. L. (1975). The calculation 

of reliability from a split-plot factorial 

design. Educational and Psychological 

Measurement, 35, 779-788. 

Brennan, R. L. (1977). Generalizability 

analyses: Principles and procedures 

(ACT Technical Bulletin No. 26). Iowa 

City: American College Testing. 

Brennan, R. L. (1983). Elements ofgeneralizabilitji 

theory. Iowa City: American 

College Testing. 

Brennan, R. L. (1984). Estimating the 

dependability of the scores. In R. A. 

Berk (Ed.), A guide to criterion-referenced 

test construction (pp. 292-334). 

Baltimore: Johns Hopkins University 

Press. 

Brennan, R. L. (1992a). Elements ofgeneralizability 

theory (rev. ed.). Iowa 

City: American College Testing. 

Brennan, R. L. (1992b). Generalizability 

theory. Educational Measurement: 

Issues and Practice, 11(4), 27-34. 

Brennan, R. L. (1994). Variance components 

in generalizability theory. In 

C. R. Reynolds (Ed.), Cognitive assessment: 

A multidisciplinary perspective 

(pp. 175-207). New York: Plenum. 

Brennan, R. L. (1995a). The conventional 

wisdom about group mean 

scores. Journal of Educational Measurement, 

14,385-396. 

Brennan, R. L. (199513). Standard setting 

from the perspective of generalizability 

theory. In Proceedings of the 

joint conference on standard setting 

for large-scale assessments (Vol. 11, 

pp. 269-287). Washington, DC: National 

Center for Education Statistics 

and National Assessment Governing 

Board. 

Brennan, R. L. (1996a). Conditional 

standard errors of measurement in 

generalizability theory (ITP Occasional 

Paper No. 40). Iowa City: 

University of Iowa, Iowa Testing Programs. 

Brennan, R. L. (1996b). Generalizability 

of performance assessments. In Technical 

issues in performance assessments 

(pp. 19-58). Washington, DC: 

National Center for Education Statistics. 

Brennan, R. L., Gao, X., & Colton, D. A. 

(1995). Generalizability analyses of 

work keys listening and writing tests. 

Educational and Psychological Measurement, 

55, 157-176. 

Brennan, R. L., &Johnson, E. G. (1995). 

Generalizability of performance assessments. 

Educational Measurement: 

Issues and Practice, 14(4), 9-12. 

Brennan, R. L., & Kane, M. T. (1977a). 

An index of dependability for mastery 

tests. Journal of Educational Measurement, 

14,277-289, 

Brennan, R. L., & Kane, M. T. (197713). 

Signalhoise ratios for domainreferenced 

tests. Psychometrika, 42, 

609-625. 

Brennan, R. L., & Lee, W. C. (1997). 

Conditional standard errors of tneasurement 

for scale scores using binomial 

and compound binomial assumptions 

(ITP Occasional Paper No. 

41). Iowa City: University of Iowa, 

Iowa Testing Programs. 

Burt, C. (1936). The analysis of examination 

marks. In P. Hartog & E. C. 

Rhodes (Eds.), The marks of examiners 

(pp. 245-314). London: Macmillan. 

Burt, C. (1955). Test reliability estimated 

by analysis of variance. British 

Journal of Statistical Psychology, 8, 

103-118. 

Cardinet, J., & Allal, L. (1983). Estimation 

of generalizability parameters. 

In L. J. Fyans (Ed.), New directions 

for testing and measurement: Generalizability 

theory: Inferences and practical 

applications (No. 18, pp. 17-48). 

San Francisco: Jossey-Bass. 

Cardinet, J., Tourneur, Y., & Allal, L. 

(1976a). The generalizability of surveys 

of educational outcomes. In D. N. 

M. de Gruijter & L. J. T. van der 

Kamp (Eds.), Advances in psychological 

and educational measurement 

(pp. 185-198). New York: Wiley. 


(197613). The symmetry of generalizability 

theory: Applications to educational 

measurement. Journal of Educational 



(1981). Extensions of generalizability 

theory and its applications in educational 

measurement. Journal of Educational 


Cornfield, J., & Tukey, J. W. (1956). 

Average values of mean squares in 

factorials. Annals of Mathematical 

Statistics, 27, 907-949. 

Crick, J. E., & Brennan, R. L. (1983). 

Manual for GENOVA: A generalized 

analysis of variance system (ACT 

Technical Bulletin No. 43). Iowa City: 

American College Testing. 

Cronbach, L. J. (1947). Test “reliability”: 

Its meaning and determination. Psychometrika, 

12(1), 1-16. 

Cronbach, L. J. (1951). Coefficient alpha 

and the internal structure of tests. 

Psychometrika, 16, 292-334. 

Cronbach, L. J. (1976). On the design 

of educational measures. In D. N. M. 

de Gruijter & L. J. T. van der 

Kamp (Eds.), Advances in psychological 

and educational measurement 

(pp. 199-208). New York: Wiley. 

Cronbach, L. J. (1989). Lee J. Cronbach. 

In G. Lindzey (Ed.), A history of psychology 

in autobiography (Vol. VIII, 

pp. 63-93). Stanford: Stanford University 

Press. 

Cronbach, L. J. (1991). Methodological 

studies-A personal retrospective. In 

R. E. Snow & D. E. Wiley (Eds.), Improving 

inquiry in social science: A 

volume in honor of Lee J. Cronbach 

(pp. 385-400). Hillsdale, NJ: Erlbaum. 

Cronbach, L. J., Gleser, G. C., Nanda, 

H., & Rajaratnam, N. (1972). The dependability 

of behavioral measurements: 

Theory of generalizability for 

scores and profiles. New York: Wiley. 

(Out of print but available from Books 

on Demand) 

Cronbach, L. J., Linn, R. L., Brennan, 

R. L., & Haertel, E. (1995). Generalizability 

analysis for educational assessments 

(Evaluation comment). Los 

Angeles: University of California, 

Center for Research on Evaluation, 

Standards, and Student %sting. 

Cronbach, L. J., Linn, R. L., Brennan, 

R. L., & Haertel, E. (1997). Generalizability 

analysis for performance assessments 

of student achievement 

or school effectiveness. Educational 

and Psychological Measurement, 57, 

373-399. 

Cronbach, L. J., Rajaratnam, N., & 

Gleser, G. C. (1963). Theory of 

generalizability: A liberalization of 

reliability theory. British Journal of 

Statistical Psychology, 16, 137-163. 

Crump, S. L. (1946). The estimation of 

variance components in analysis of 

variance. Biometrics Bulletin, 2, 7-11 

Ebel, R. L. (1951). Estimation of the reliability 

of ratings. Psychometrika, 16, 

407-424. 

Eisenhart, C. (1947). The assumptions 

underlying analysis of variance. Biometrics, 

3, 1-21. 

Erlich, 0.) & Borich, C. (1979). Occurrence 

and generalizability of scores on 

a classroom interaction instrument. 

Journal of Educational Measurement, 

16, 11-18. 

Erlich, O., & Shavelson, R. J. (1976). 

Application of generalizability theory 

to the study of teaching (Tech. Rep. 

No. 76-9-1). San Francisco: Far West 

Laboratory. 

Feldt, L. S., & Brennan, R. L. (1989). 

Reliability, In R. L. Linn (Ed.), Educational 

measurement (3rd ed., 

pp. 127-144). New York: Macmillan. 

Finlayson, D. S. (1951). The reliability 

o f marking essays. British Journal of 

Educational Psychology, 35, 143-162. 

Winter 1997 19

Fisher, R. A. (1925). Statistical methods 

for research workers. London: Oliver 

& Bond. 

Gao, X. (1996). Sampling variability 

and generalizability of work keys listening 

and writing scores (ACT Research 

Report No. 96-11, Iowa City: 

ACT. 

Gao, X., Brennan, R. L., & Shavelson, R. 

J. (1994, April). Estimating generalizability 

of matrix-sampled science 

performance assessments. Paper presented 

at the Annual Meeting of the 

American Educational Research Association, 

New Orleans. 

Gao, X., Shavelson, R. J., Brennan, R. L., 

& Baxter, G. P. (1996, April). A multivariate 

generalizability theory approach 

to convergent validity of 

performance-based assessment. Paper 

presented at the Annual Meeting of 

the National Council on Measurement 

in Education, New York. 

Gillmore, G. M. (1983). Generalizability 

theory: Applications to program evaluation. 

In L. J. Fyans (Ed.), New directions 

for testing and measurement: 

Generalizability theory: Inferences 

and practical applications (No. 18, 

pp. 3-16). San Francisco: Jossey-Bass. 

Gleser, G. C., Cronbach, L. J., & Rajaratnam, 

N. (1965). Generalizability 

of scores influenced by multiple 

sources of variance. Psychometrika, 

30,395-418. 

Gulliksen, H. (1950). Theory of mental 

tests. New York: Wiley. 

Haggard, E. A. (1958). Intraclass correlation 

and the analysis of variance. 

New York: Dryden. 

Hoyt, C. J. (1941). Test reliability estimated 

by analysis of variance. Psychometrika, 

6, 153-160. 

Jarjoura, D. (1986). An estimator of 

examinee-level measurement error 

variance that considers test form difficulty 

adjustments. Applied Psychological 

Measurement, 1 U, 175-186. 

Jarjoura, D., & Brennan, R. L. (1982). 

A variance components model for 

measurement procedures associated 

with a table of specifications. Applied 

Psychological Measurement, 6, 

161-171. 

Jarjoura, D., & Brennan, R. L. (1983). 

Multivariate generalizability models 

for tests developed according to a 

table of specifications. In L. J. Fyans 

(Ed.), New directions for testing and 

measurement: Generalizabil ity theory: 

Inferences and practical applications 

(No.18, pp. 83-101). San Francisco: 

Jossey-Bass. 

Kane, M. T. (1982). A sampling model 

for validity. Applied Psychological 


Kane, M. T., & Brennan, R. L. (1977). 

The generalizability of class means. 

Review of Educational Research, 47, 

267-292. 

Kane, M. T., Gillmore, G. M., & Crooks, 

T. J . (1976). Student evaluations of 

teaching: The generalizability of class 

means. Journal of Educational Measurement, 

13,171-183. 

Kolen M. J., & Jarjoura, D. (1984). Item 

profile analysis for tests developed according 

to a table of specifications. 

Applied Psychological Measurement, 

8, 219-230. 

Kuder, G. F., & Richardson, M. W. 

(1937). The theory of the estimation of 

test reliability. Psychometrika, 2, 

151-160. 

Lindquist, E. F. (1953). Design and 

analysis of experiments in psychology 

and education. Boston: Houghton 

Mifflin. 

Llabre, M. M., Ironson, G. H., Spitzer, 

S. B., Gellman, M. D., Weidler, D. J., 

& Schneiderman, N. (1988). How 

many blood pressure measurements 

are enough An application of generalizability 

theory to the study of blood 

pressure keliabiiity. Psychophysiology, 

25.97-105. 

Lord; F. M. (1955). Estimating test reliability. 

Educational and Psychological 

Measurement, 15,325-336. 

Lord, F. M. (1957). Do tests of the same 

length have the same standard errors 

of measurement Educational 


510-521. 

Lord, F. M. (1959). Tests of the same 

length do have the same standard 

error of measurement. Educational 


233-239. 

Lord, F. M. (1962). Test reliability: A 

correction. Educational and Psychological 

Measurement, 22, 511-5 12. 

Loveland, E. H. (1952). Measurement of 

factors affecting test-retest reliability. 

Unpublished doctoral dissertation, 

University of Tennessee. 

Medley, D. M., Mitzel, H. E., & Doi, 

A. N. (1956). Analysis of variance 

models and their use in a threeway 

design without replication. Journal 

of Experimental Education, 24, 

221-229. 

Nupbaum, A. (1984). Multivariate generalizability 

theory in educational 

measurement: An empirical study. 

Applied Psychological Measurement, 

8, 219-230. 

Pilliner, A. E. G. (1952). The application 

of analysis of variance to problems 

of correlation. British Journal 

of Psychology, Statistical Section, 5, 

31-38. 

Rajaratnam, N., Cronbach, L. J., & 

Gleser, G. C. (1965). Generalizability 

of stratified-parallel tests. Psychometrika, 

30, 39-56. 

Rogosa, D., & Ghandour, G. (1991). 

Statistical models for behavioral 

observations. Journal of Educational 

Statistics, 3, 157-252. 

Saab, P. G., Llabre, M. M., Hurwitz, 

B. E., Frame, C. A., Reineke, L. J., Fins, 

A. I., McCalla, J., Cieply, L. K., & 

Schneiderman, N. (1992). Myocardial 

and peripheral vascular responses to 

behavioral challenges and their stability 

in black and white Americans. 

Psychophysiology, 29, 384-397. 

Shavelson, R. J., Baxter, G. P., & Gao, 

X. (1993). Sampling variability of 

performance assessments. Journal 

of Educational Measurement, 30, 

215-232. 

Shavelson, R. J., Baxter, G. P., & Pine, 

J. (1991). Performance assessments 

in science. Applied Measurement in 

Education, 4, 347-362. 

Shavelson, R. J., Baxter, G. P., & Pine, 

J. (1992). Performance assessments: 

The rhetoric and reality. Educational 

Researcher, 21(4), 22-27. 

Shavelson, R. J., &Webb, N. M. (1981). 

Generalizability theory: 1973-1980. 

British Journal of Mathematical and 

Statistical Psychology, 34, 133-166. 

Shavelson, R. J., &Webb, N. M. (1991). 

Generalizability theory: A primer. 

Newbury Park, CA Sage. 

Shavelson, R. J., Webb, N. M., & 

Rowley, G. L. (1989). Generalizability 

theory. American Psychologist, 6, 

922-932. 

Smith, P. L. (1978). Sampling errors of 

variance components in small sample 

generalizability studies. Journal 

of Educational Statistics, 3, 319- 

346. 

Smith, P. L. (1981). Gaining accuracy in 

generalizability theory: Using multiple 

designs. Journal of Educational 

Measurement, 18,147-154. 

Smith, P. L.(1982). A confidence interval 

approach for variance component 

estimates in the context of 

generalizability theory. Educational 


459-466. 

Webb, N. M., Rowley, G. L., & Shavelson, 

R. J. (1988). Using generalizability 

theory in counseling and 

development. Measurement and Evaluation 

in Counseling and Development, 

21, 81-90. 

Webb, N. M., Shavelson, R. J., Kim, K. 

S., & Chen, Z. (1989). Reliability (generalizability) 

of job performance measurements: 

Navy machinist mates. 

Military Psychology, 1, 91-110. 

Webb, N. M., Shavelson, R. J., & Maddahian, 

E. (1983). Multivariate generalizability 

theory. In L. J. Fyans (Ed.), 

New directions for testing and measurement: 

Generalizability theory: Inferences 

and practical applications 

(No.18, pp. 67-81). San Francisco: 

Jossey-Bass.

View - Waisman Laboratory for Brain Imaging and Behavior

Create successful ePaper yourself

Delete template?

Save as template?