10.01.2015 Views

View - Waisman Laboratory for Brain Imaging and Behavior

View - Waisman Laboratory for Brain Imaging and Behavior

View - Waisman Laboratory for Brain Imaging and Behavior

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Kelley, T. L. (1923). Statistical method.<br />

New York: Macmillan.<br />

Kelley, T. L. (1942). The reliability coefficient.<br />

Psychometrika, 7, 75-83.<br />

Kuder, G. F., & Richardson, M. W.<br />

(1937). The theory of estimation of<br />

test reliability. Psychometrika, 2,<br />

151-160.<br />

Lord, F. M., & Novick, M. R. (1968). Statistical<br />

theories of mental test scores.<br />

Reading, MA Addison-Wesley.<br />

Novick, M. R. (1966). The axioms <strong>and</strong><br />

principal results of classical test theory.<br />

Journal of Mathematical Psychology,<br />

3, 1-18.<br />

Pearson, K. (1896). Mathematical contributions<br />

to the theory of evolution-<br />

111. Regression, heredity <strong>and</strong><br />

panmixia. Philosophical Pansactions,<br />

A, 187, 252-318.<br />

Pearson, K. (1904). On the laws of inheritance<br />

in man. 11. On the inheritance<br />

of the mental <strong>and</strong> moral<br />

characters in man, <strong>and</strong> its comparison<br />

with the inheritance of physical<br />

characters. Biometrika, 3, 131-190.<br />

Pearson, K. (1930). The life, letters, <strong>and</strong><br />

labours of Francis Galton. Vol. HIA.<br />

Correlation, personal identifkatwn<br />

<strong>and</strong> eugenics. Cambridge: The University<br />

Press.<br />

Pearson, K, & Lee, A. (1903). On the<br />

laws of inheritance in man. I. Inheritance<br />

of physical characters. Biometrika,<br />

2, 357-462.<br />

Read, C. B. (1985). Normal distribution.<br />

In S. Kotz & N. L. Johnson (Eds.),<br />

Encyclopedia of statistical sciences<br />

(Vol. 6, pp. 347-359). Toronto: Wiley.<br />

Richardson, M. W. (1936). Notes on the<br />

rationale of item analysis. Psychometrzka,<br />

1(1), 69-76.<br />

Rulon, P. J. (1939). A simplified procedure<br />

<strong>for</strong> determining the reliability of<br />

a test by split-halves. Harvard Educational<br />

Review, 9, 99-103.<br />

Sheynin, 0. B. (1968). On the early history<br />

of the law of large numbers. Biometriha,<br />

55, 459-467.<br />

Spearman, C. (1904). The proof <strong>and</strong><br />

measurement of association between<br />

two things. American Journal of Psychology,<br />

15, 72-101.<br />

Spearman, C. (1907). Demonstration of<br />

<strong>for</strong>mulae <strong>for</strong> true measurement of<br />

correlation. American Journal of Psychology,<br />

18, 160-169.<br />

Spearman, C. (1910). Correlation calculated<br />

from faulty data. British<br />

Journal of Psychology, 3, 271-295.<br />

Thurstone, L. L. (1932). The reliability<br />

<strong>and</strong> validity of tests. Ann Arbor, MI:<br />

N. p.<br />

Venn, J. (1888). The logic of chance (3rd<br />

ed.). London: Macmillan.<br />

Walker, H. M. (1929). Studies in the history<br />

of statistical method. Baltimore:<br />

Williams & Wilkins.<br />

A Perspective on the History of<br />

0<br />

Generabab ility Theory<br />

Robert L. Brennan<br />

University of Iowa<br />

What psychometric <strong>and</strong> scientific perspectives influenced<br />

the development of G theorg What practical<br />

testing problems gave impetus to its adoption What<br />

work remains to be done<br />

with G theory. Consequently, this<br />

article provides a somewhat idiosyncratic<br />

perspective on the history of G<br />

theory <strong>and</strong> what I perceive as unfinished<br />

work <strong>for</strong> the theory. Almost<br />

certainlv. other reviewers would see<br />

the l<strong>and</strong>scape somewhat differently.<br />

verviews of various parts of<br />

0 the history of generalizability<br />

(G) theory are provided elsewhere.<br />

An indispensable starting point is<br />

the preface <strong>and</strong> parts of the first<br />

chapter of Cronbach, Gleser, N<strong>and</strong>a,<br />

<strong>and</strong> Rajaratnam (1972) entitled The<br />

Dependability of <strong>Behavior</strong>al Measurements:<br />

Theory of Generalizability<br />

<strong>for</strong> Scores <strong>and</strong> Profiles. The<br />

Cronbach et al. monograph is still<br />

the most definitive treatment of G<br />

theory. Shavelson <strong>and</strong> Webb (1981)<br />

review the G theory literature from<br />

1973-1980, <strong>and</strong> Shavelson, Webb,<br />

<strong>and</strong> Rowley (1989) cover additional<br />

contributions in the 1980s. A very<br />

brief historical overview is provided<br />

by Brennan (1983, 1992a, pp. 1-2).<br />

In addition, Cronbach (1976, 1989,<br />

1991) offers numerous perspectives<br />

on G theory <strong>and</strong> its history. Cronbach<br />

(1991) is particularly rich with<br />

first-person reflections.<br />

This historical overview is not intended<br />

to repeat everything already<br />

covered in published reviews, although<br />

a summary is provided.<br />

Parts of this article are based<br />

largely on my personal experience<br />

Theory Development <strong>and</strong> Enabling<br />

Work<br />

In discussing the genesis of G<br />

theory, Cronbach (1991) states:<br />

In 1957 I obtained funds from the<br />

National Institute of Mental<br />

Health to produce, with Gleser's<br />

Robert L. Brennan is Lindquist Professor<br />

of Educational Measurement <strong>and</strong><br />

Director of the Iowa Testing Programs,<br />

University of Iowa, 334A Lindquist<br />

Center, Iowa City, IA 52242. His specializations<br />

are generalizability theory,<br />

equating, <strong>and</strong> scaling.<br />

14 Educational Measurement: Issues <strong>and</strong> Practice


collaboration, a kind of h<strong>and</strong>book<br />

of measurement theory.. . .<br />

“Since reliability has been studied<br />

thoroughly <strong>and</strong> is now understood,”<br />

I suggested to the team,<br />

“let us devote our first few weeks<br />

to outlining that section of the<br />

h<strong>and</strong>book, to get a feel <strong>for</strong> the undertaking.”<br />

We learned humility<br />

the hard way-the enterprise<br />

never got past that topic. Not<br />

until 1972 did the book appear<br />

(Cronbach, Gleser, N<strong>and</strong>a, & Rajaratnam)<br />

that exhausted our<br />

findings on reliability reinterpreted<br />

as generalizability. Even<br />

then, we did not exhaust the topic.<br />

When we tried initially to summarize<br />

prominent, seemingly<br />

transparent, convincingly argued<br />

papers on test reliability, the messages<br />

conflicted. (pp. 391-392)<br />

To resolve these conflicts, Cronbach<br />

<strong>and</strong> his colleagues devised a<br />

rich conceptual framework <strong>and</strong> married<br />

it to analysis of r<strong>and</strong>om effects<br />

variance components. The net effect<br />

is “a tapestry that interweaves ideas<br />

from at least two dozen authors”<br />

(Cronbach, 1991, p. 394).<br />

It is not uncommon <strong>for</strong> G theory<br />

to be described as the application of<br />

analysis of variance (ANOVA) to<br />

classical test theory. This characterization<br />

of the theory is inadequate,<br />

at best, <strong>and</strong> probably more misin<strong>for</strong>mative<br />

than useful-except in one<br />

respect. It does correctly suggest<br />

that the parents of G theory can be<br />

viewed as classical test theory <strong>and</strong><br />

analysis of variance. The G theory<br />

child, however, is both more <strong>and</strong> less<br />

than the simple conjunction of its<br />

parents. In particular, G theory is<br />

not a replacement <strong>for</strong> classical theory,<br />

although it does liberalize the<br />

theory. Also, not all of ANOVA is<br />

relevant to G theory; indeed, some<br />

perspectives on ANOVA are inconsistent<br />

with G theory (see Brennan,<br />

1984).<br />

The statistical machinery employed<br />

in G theory has its genesis in<br />

Fisher’s (1925) work on factorial designs.<br />

However, G theory has no<br />

substantive role <strong>for</strong> hypothesis testing.<br />

Rather, it emphasizes the estimation<br />

of r<strong>and</strong>om effects variance<br />

components-a subject that was researched<br />

by statisticians in the late<br />

1940s (see, e.g., Crump, 1946, <strong>and</strong><br />

particularly Eisenhart, 1947). This<br />

research was brought to Cronbach’s<br />

attention by a graduate student,<br />

Milton Meux, about 1957 (L. J.<br />

Cronbach, personal communication,<br />

April 18,1997) at approximately the<br />

same time that Cornfield <strong>and</strong> Tukey<br />

(1956) published their rules <strong>for</strong> expressing<br />

expected mean square<br />

equations in terms of variance components.<br />

By 1950, there was a rich literature<br />

on reliability from the perspective<br />

of classical test theory. Most of<br />

this literature had been superbly<br />

summarized by Gulliksen (1950),<br />

which included chapters on experimental<br />

methods <strong>for</strong> estimating reliability,<br />

as well as reliability<br />

estimated by item homogeneitywhat<br />

came to be called internal consistency<br />

estimates. Such estimates<br />

included, of course, Hoyt’s (1941)<br />

ANOVA version of Kuder <strong>and</strong><br />

Richardson’s (1937) KR20 index. It<br />

is not quite true, however, that Hoyt<br />

was the first to apply ANOVA to<br />

measurement problems. An earlier<br />

contribution was made by Burt<br />

(1936) in his treatment of the analysis<br />

of examination marks.<br />

Gulliksen’s (1950) book was published<br />

be<strong>for</strong>e Cronbach’s widely<br />

cited 1951 article that introduced<br />

Coefficient a. For the next several<br />

years, a great deal of research on reliability<br />

<strong>for</strong>med the backdrop <strong>for</strong> G<br />

theory. Finlayson’s (1951) study of<br />

grades assigned to essays was probably<br />

the first treatment of reliability<br />

in terms of variance components.<br />

Shortly thereafter Pilliner (1952)<br />

provided theoretical relations between<br />

intraclass correlations <strong>and</strong><br />

ANOVA (see also Haggard, 1958).<br />

Cronbach (1947) had expressed<br />

the concern that some type of multifacet<br />

analysis was needed to resolve<br />

inconsistencies in some estimates of<br />

reliability. The 1950s were years in<br />

which various researchers began to<br />

exploit the fact that ANOVA could<br />

h<strong>and</strong>le multiple facets simultaneously.<br />

Particular examples include<br />

Lovel<strong>and</strong>’s (1952) doctoral dissertation,<br />

work by Medley, Mitzel, <strong>and</strong><br />

Doi (1956) on classroom observations,<br />

<strong>and</strong> Burt’s (1955) treatment of<br />

test reliability estimated by analysis<br />

of variance. Most importantly, Lindquist<br />

(1953, chap. 16) laid out an<br />

extensive exposition of multifacet<br />

theory that focused on the estimation<br />

of variance components in reliability<br />

studies. Lindquist demonstrated<br />

that multifacet analyses<br />

lead to alternative definitions of<br />

error <strong>and</strong> reliability coefficients.<br />

Lindquist’s chapter clearly <strong>for</strong>eshadowed<br />

important parts of G theory.<br />

Cronbach was on the faculty at<br />

the University of Chicago from 1946<br />

to 1948. He recalls that:<br />

Five minutes with Joseph<br />

Schwab had a profound influence.<br />

. . . In some context<br />

Schwab remarked that biologists<br />

have to decide what to count as a<br />

species. . . . Schwab was acute<br />

enough to catch my flicker of surprise<br />

<strong>and</strong> <strong>for</strong>ce home the idea of<br />

scientist as construer rather than<br />

as discoverer of categories the<br />

Creator had in mind. That conversation<br />

. . . resonates in my<br />

thinking to this day. (Cronbach,<br />

1989, p. 72, italics added)<br />

Given this perspective, it is not<br />

surprising that G theory requires<br />

that investigators define the conditions<br />

of measurement of interest<br />

to them. The theory effectively disavows<br />

any notion of there being a<br />

correct set of conditions of measurement,<br />

but it is clear that the particular<br />

tasks or items used are not a<br />

sufficient specification of a measurement<br />

procedure. These notions<br />

are central to the conceptual framework<br />

of G theory, but they are not<br />

entirely novel.<br />

Guttman once made the provocative<br />

remark that a test belongs to<br />

several sets, <strong>and</strong> there<strong>for</strong>e has<br />

several reliabilities. “List as<br />

many 4-letter words that begin<br />

with t as you can.” That word-fluency<br />

task fits into at least three<br />

families: 4-letter words beginning<br />

with a specified letter, t words of<br />

a specified length, <strong>and</strong> 4-letter<br />

words with t in a specified position.<br />

The investigator’s theory,<br />

rather than an abstract concept of<br />

truth <strong>and</strong> error, determines<br />

which family contains tests that<br />

“measure the same variable.”<br />

(Cronbach, 1991, p. 394)<br />

In 1951, Ebel published an article<br />

on the reliability of ratings in which<br />

he essentially considered two types<br />

of error variance-one that included,<br />

<strong>and</strong> another that excluded,<br />

rater main effects. In the process of<br />

doing so, Ebel also considered single-facet<br />

crossed <strong>and</strong> nested designs.<br />

It wasn’t until G theory was<br />

fully <strong>for</strong>mulated that the issues<br />

Ebel grappled with were truly clarified<br />

in the distinction between rel-<br />

Winter 1997 15


ative (6) <strong>and</strong> absolute (A) error <strong>for</strong><br />

various designs. Very much the<br />

same problems were considered by<br />

Lord (1955, 1957, 1959) in a classic<br />

series of articles about conditional<br />

st<strong>and</strong>ard errors of measurement<br />

(SEMs) <strong>and</strong> reliability under the assumptions<br />

of what came to be called<br />

the binomial error model (see also<br />

Lord, 1962). In effect, the rater<br />

main effects in Ebel’s article play<br />

the role of the item main effects in<br />

Lord’s articles. In addition, Lord’s<br />

articles clearly specify what came to<br />

be called r<strong>and</strong>omly parallel tests.<br />

The issues Lord was grappling<br />

with had a clear influence on the development<br />

of G theory. According to<br />

Cronbach (personal communication,<br />

1996), about 1957, Lord visited the<br />

Cronbach team in Urbana. Their<br />

discussions suggested that the error<br />

in Lords <strong>for</strong>mulation of the binomial<br />

error model (which treated one<br />

person at a time-that is, a completely<br />

nested design) could not be<br />

the same error as that in classical<br />

theory <strong>for</strong> a crossed design. (Lord<br />

basically acknowledges this in his<br />

1962 article.) This insight was eventually<br />

captured in the distinction<br />

between 6 <strong>and</strong> A in G theory, <strong>and</strong> it<br />

illustrated that errors of measurement<br />

are influenced by the choice of<br />

design. Lord’s binomial error model<br />

is probably best known as a simple<br />

way to estimate conditional SEMs<br />

<strong>and</strong> as an important precursor to<br />

strong true score theory, but it is<br />

also associated with important insights<br />

that became an integral part<br />

of G theory.<br />

The genius of Cronbach <strong>and</strong> his<br />

colleagues was their creation of a<br />

conceptual framework <strong>and</strong> use of a<br />

methodology (variance components<br />

analysis) that integrated the contributions<br />

of numerous researchers,<br />

even when some contributions<br />

seemed to conflict with one another.<br />

The essential features of univariate<br />

G theory were largely completed<br />

with technical reports in 1960-<br />

1961, each with a different first<br />

author. These were revised into<br />

three journal articles, each with a<br />

different first author (Cronbach, Rajaratnam,<br />

& Gleser, 1963; Gleser,<br />

Cronbach, & Rajaratnam, 1965; <strong>and</strong><br />

Rajaratnam, Cronbach, & Gleser,<br />

1965). In 1964 Cronbach moved to<br />

Stan<strong>for</strong>d. Shortly thereafter, Harinder<br />

N<strong>and</strong>a’s studies on interbat-<br />

tery reliability provided part of the<br />

motivation <strong>for</strong> the development of<br />

multivariate G theory (considered<br />

later). This very major extension of<br />

the univariate model is part of the<br />

reason it took more than 10 years<br />

after the 1960-1961 reports <strong>for</strong><br />

Cronbach et al. (1972) to appear in<br />

print. It is still the most intensive<br />

<strong>and</strong> extensive treatment of G theory.<br />

Applications <strong>and</strong> Extensions With<br />

Some Personal Reflections<br />

“any investigators who come to<br />

employ geiieralizabilitg theory in<br />

their research do so only after concluding<br />

that more conventional approaches<br />

seem inadequate. That<br />

was indeed the motivation that led<br />

me to generalizability theory. In the<br />

late 1960s <strong>and</strong> early 1970s, I served<br />

as a consultant on evaluations of the<br />

Head Start <strong>and</strong> Follow Through<br />

Programs, <strong>and</strong> the National Day<br />

Care Study. A distinguishing common<br />

characteristic of these studies<br />

was that the treatments were applied<br />

to whole classrooms <strong>and</strong> evaluated<br />

using certain measurement<br />

procedures. A very natural question<br />

to ask, then, was, “How shall we estimate<br />

the reliability of classroom<br />

mean scores <strong>for</strong> these measurement<br />

procedures’’ A number of discussions<br />

convinced many of us that the<br />

problem was not getting an estimate;<br />

rather, the problem was that<br />

we had too many estimates, <strong>and</strong> no<br />

obvious way to choose among them.<br />

Early in the summer of 1972, I set<br />

myself the goal of resolving this<br />

paradox by the end of the summer.<br />

It did not take that long. The library<br />

at SUNY at Stony Brook where I<br />

was a beginning assistant professor<br />

had a br<strong>and</strong> new book entitled The<br />

Dependability of <strong>Behavior</strong>al Measurements<br />

(Cronbach et al., 1972).<br />

After studying it night <strong>and</strong> day <strong>for</strong> a<br />

week, the answer was obvious-the<br />

different estimates we were getting<br />

were related to different universes<br />

o f generalization when class means<br />

were the objects of measurement.<br />

This insight eventually led to my<br />

first publication on generalizability<br />

theory (Brennan, 1975). Shortly<br />

thereafter, Michael Kane joined the<br />

faculty of education at Stony Brook,<br />

<strong>and</strong> I discovered that he <strong>and</strong> some of<br />

his <strong>for</strong>mer colleagues at the University<br />

of Illinois had been working on<br />

exactly the same problem in the<br />

context of student evaluations of<br />

teaching (see, e.g., Kane, Gillmore,<br />

& Crooks, 1976). Our common interest<br />

in this problem led to a joint article<br />

(Kane & Brennan, 1977).<br />

The Cronbach et al. (1972) <strong>for</strong>mulation<br />

of G theory was general<br />

enough to permit any set of conditions<br />

(e.g., persons, classes, items)<br />

to be the objects of measurement<br />

facet. In that sense, the work on<br />

class means in the early-to-mid-<br />

1970s was more of an illustration<br />

than a substantive contribution to<br />

the theory. In a series of articles<br />

about the symmetry of G theory,<br />

Cardinet <strong>and</strong> his colleagues emphasized<br />

the role that facets other than<br />

persons might play as objects of<br />

measurement (e.g., Cardinet &<br />

Allal, 1983; Cardinet, Tourneur, &<br />

Allal, 1976a, 1976b, 1981).<br />

At the same time that Kane <strong>and</strong> I<br />

were working on our class means article,<br />

we were intrigued with the<br />

idea of using generalizability theory<br />

to address issues surrounding the<br />

reliability of criterion-referenced (or<br />

domain-referenced) scores, which<br />

was a very hot topic in the early-tomid-1970s.<br />

Our initial <strong>for</strong>ays into<br />

this area (Brennan & Kane, 1977a,<br />

1977b) were based on a very simple<br />

idea-use absolute error rather than<br />

relative error in defining indices <strong>and</strong><br />

signal-noise ratios. This work was<br />

later summarized <strong>and</strong> somewhat extended<br />

by Brennan (1984). The research<br />

that Kane <strong>and</strong> I did on<br />

domain-referenced scores <strong>and</strong> class<br />

means was so clearly co-equal that<br />

we flipped a coin to decide on first<br />

authorship. To follow blindly the<br />

alphabetize-by-last-name convention<br />

would have grossly misrepresented<br />

our relative contributions.<br />

In 1981, Shavelson <strong>and</strong> Webb<br />

published a review of G theory <strong>for</strong><br />

the years 1973-1980. Actually, their<br />

article is much more than a review<br />

of 8 years of literature-it is also an<br />

excellent summary of G theory that<br />

is highly relevant <strong>and</strong> readable<br />

today. Only some of the work they<br />

review has been discussed here.<br />

By the late 1970s, I had read<br />

Cronbach et al. (1972) cover-to-cover<br />

three times, but parts of it still challenged<br />

me. I agreed with their statement<br />

that “the book is complexly<br />

organized <strong>and</strong> by no means simple<br />

to follow” (Cronbach et al., 1972,<br />

p. 3). It seemed likely to me that<br />

16 Educational Measurement: Issues <strong>and</strong> Practice


this complexity was at least partly<br />

the reason why relatively few generalizability<br />

studies were being conducted.<br />

I decided to try to publicize,<br />

teach, <strong>and</strong> simplify generalizability<br />

theory <strong>for</strong> graduate students <strong>and</strong><br />

measurement practitioners. At<br />

about this time, with the assistance<br />

of Kane <strong>and</strong> Gillmore (<strong>and</strong> later<br />

Noreen Webb <strong>and</strong> Xiaohong Gao), I<br />

began an every-other-year training<br />

session on G theory <strong>for</strong> the AERA<br />

<strong>and</strong> NCME Annual Meetings.<br />

My first ef<strong>for</strong>t at writing a simpler<br />

treatment of G theory (Brennan,<br />

1977) was a paper that was<br />

rejected by a major journal- the editor<br />

described it as being “too<br />

propaedeutic.” Just about that time<br />

Jay Millman, who was then president<br />

of NCME, asked me to consider<br />

writing a monograph on generalizability<br />

theory <strong>for</strong> publication by<br />

NCME. With the encouragement of<br />

Michael Kane <strong>and</strong> David Jarjoura, I<br />

agreed, but, when I completed the<br />

monograph almost 3 years later,<br />

NCME was no longer interested in<br />

publishing it! ACT, however, did<br />

publish Elements of Generalizability<br />

Theory (Brennan, 1983).<br />

I had long felt that a simpler<br />

treatment of G theory was not<br />

enough to get the theory used more<br />

widely by practitioners. They also<br />

needed a computer program. So, at<br />

the same time I was writing Elements<br />

of Generalizability Theory, I<br />

was designing a computer program<br />

called GENOVA (Crick & Brennan,<br />

1983) that would be coordinated<br />

with the monograph. My computer<br />

skills were not adequate <strong>for</strong> programming<br />

GENOVA, however. That<br />

task was undertaken by Joe Crick, a<br />

colleague from graduate school at<br />

Harvard, who somehow managed to<br />

translate my math <strong>and</strong> h<strong>and</strong>written<br />

input-output layouts into workable<br />

FORTRAN code while serving as<br />

Director of the Computing Center at<br />

the University of Massachusetts,<br />

Boston.<br />

Several expositions of G theory<br />

were published in the late 1980s <strong>and</strong><br />

early 199Os, all of which are briefer<br />

<strong>and</strong> less dem<strong>and</strong>ing than Cronbach<br />

et al. (1972) or Brennan (1983,<br />

1992a). Shavelson, Webb, <strong>and</strong> Rowley<br />

(1989) provided a particularly<br />

readable journal article that summarizes<br />

G theory, <strong>and</strong> in the same<br />

year Feldt <strong>and</strong> Brennan (1989) de-<br />

voted about one third of their chapter<br />

on reliability to G theory. In<br />

1991, Shavelson <strong>and</strong> Webb published<br />

a relatively short monograph<br />

entitled Generalizability Theory: A<br />

Primer. Brennan (1992b) provided a<br />

very brief introduction intended primarily<br />

<strong>for</strong> classroom use.<br />

Interest in per<strong>for</strong>mance testing in<br />

the late 1980s led to a mini-boom in<br />

generalizability analyses <strong>and</strong> considerably<br />

greater publicity <strong>for</strong> G<br />

theory. It seemed evident to practitioners<br />

that G theory was eminently<br />

well-suited to analyzing scores from<br />

such tests. In particular, practitioners<br />

realized that underst<strong>and</strong>ing the<br />

results of a per<strong>for</strong>mance test necessitated<br />

grappling with two or more<br />

facets simultaneously -especially<br />

tasks <strong>and</strong> raters. The relevance of G<br />

theory in such contexts is especially<br />

well illustrated by Richard Shavelson<br />

<strong>and</strong> his colleagues in a series of<br />

presentations <strong>and</strong> articles involving<br />

science <strong>and</strong> mathematics per<strong>for</strong>mance<br />

assessments, in particular<br />

(see, e.g., Gao, Brennan, & Shavelson,<br />

1994; Shavelson, Baxter, &<br />

Gao, 1993; Shavelson, Baxter, &<br />

Pine, 1991, 1992). Also, Brennan<br />

<strong>and</strong> Johnson (1995) <strong>and</strong> Brennan<br />

(199613) consider some theoretical<br />

<strong>and</strong> applied issues in per<strong>for</strong>mance<br />

testing from the perspective of G<br />

theory.<br />

New assessments such as per<strong>for</strong>mance<br />

tests recently motivated<br />

Cronbach, Linn, Brennan, <strong>and</strong><br />

Haertel (1995) to state: “Assessments<br />

depart from traditional measurements<br />

in ways that require<br />

extensions <strong>and</strong> modifications of generalizability<br />

analysis. . . . Assessments<br />

pose problems that reach<br />

beyond available psychometric theory”<br />

(p. 1). The Cronbach et al.<br />

(1995) report <strong>and</strong> a recent journal<br />

article revision (Cronbach, Linn,<br />

Brennan, & Haertel 1997) suggest a<br />

number of problems that need to be<br />

researched, <strong>and</strong> they propose some<br />

recommended solutions. These articles<br />

emphasize the importance of estimates<br />

of absolute st<strong>and</strong>ard errors<br />

of measurement <strong>for</strong> many of the<br />

types of decisions that are typically<br />

made with per<strong>for</strong>mance assessments.<br />

Also, these articles urge that<br />

an analysis of error <strong>for</strong> group means<br />

explicitly recognizes that pupils are<br />

nested in classes <strong>and</strong> schools.<br />

Whether to treat pupils as fixed or<br />

r<strong>and</strong>om in such analyses is discussed<br />

in some detail (see, also,<br />

Brennan 1995a).<br />

In their 1972 monograph, Cronbach<br />

<strong>and</strong> his colleagues illustrated<br />

the applicability of G theory largely<br />

by reanalyzing some already published<br />

data in the psychology <strong>and</strong><br />

education literature. Since 1972, in<br />

addition to topics already cited in<br />

this overview, G theory has been<br />

used to study issues such as classroom<br />

teaching (e.g., Erlich & Borich,<br />

1979; Erlich & Shavelson, 1976);<br />

program evaluation (e.g., Gillmore,<br />

1983); the use of tables of specifications<br />

in educational testing (e.g.,<br />

Jarjoura & Brennan, 1982, 1983;<br />

Kolen & Jarjoura, 1984); counseling<br />

<strong>and</strong> development (Webb, Rowley, &<br />

Shavelson, 1988); setting per<strong>for</strong>mance<br />

st<strong>and</strong>ards (Brennan, 1995b);<br />

job per<strong>for</strong>mance (Webb, Shavelson,<br />

Kim, & Chen, 1989); neuroticism<br />

<strong>and</strong> coping with anger (Atkinson, &<br />

Violato, 1994); <strong>and</strong> aspects of physiology,<br />

including blood pressure<br />

(Llabre et al., 1988; Saab et al.,<br />

1992).<br />

Unfinished Work<br />

G theory has a protean quality.<br />

The procedures <strong>and</strong> even the issues<br />

take on a new <strong>for</strong>m in every<br />

context. G theory enables you to<br />

ask your questions better; what is<br />

most significant <strong>for</strong> you cannot be<br />

supplied from the outside. (Cronbach,<br />

1976, p. 199)<br />

In this sense, G theory is a continuous<br />

work in progress, <strong>and</strong> none<br />

of the research reviewed here can be<br />

deemed complete. Still, there are<br />

some important theoretical <strong>and</strong> statistical<br />

topics that clearly need to be<br />

addressed more fully than they<br />

have been, <strong>and</strong> there are potential<br />

areas of application where the theory<br />

has been largely unused as yet.<br />

Although G theory has been applied<br />

in a number of contexts, the<br />

coverage is not balanced <strong>and</strong> one<br />

might expect that after 25 years<br />

many more generalizability analyses<br />

would have been conducted than<br />

are reported in the literature. Most<br />

published generalizability analyses<br />

are in the education literature, perhaps<br />

because those who are most<br />

knowledgeable about G theory tend<br />

to be employed in colleges of education,<br />

educational testing companies,<br />

<strong>and</strong> related organizations. Clearly,<br />

Winter 1997 17


however, G theory has potential applicability<br />

wherever measurement<br />

procedures are employed. In particular,<br />

G theory seems very much<br />

underutilized in psychological <strong>and</strong><br />

medical areas.<br />

It is often stated that G theory<br />

“blurs the distinction between reliability<br />

<strong>and</strong> validity” (Cronbach et al.,<br />

1972, p. 380). Yet, very little of the G<br />

theory literature directly addresses<br />

validation issues. A notable exception<br />

is Kane’s (1982) treatment of “A<br />

Sampling Model <strong>for</strong> Validity,” which<br />

is clearly one of the major theoretical<br />

contributions to the literature<br />

on G theory in the last 25 years. In<br />

his article, Kane clearly begins to<br />

make explicit links between G theory<br />

<strong>and</strong> issues traditionally subsumed<br />

under validity. Still, many of<br />

the contributions that G theory<br />

probably could make to the validation<br />

of particular measurement procedures<br />

are unexplored, <strong>and</strong> it<br />

seems reasonable to speculate that<br />

more theoretical contributions are<br />

possible.<br />

By the early 1960s, Cronbach <strong>and</strong><br />

his colleagues had pretty much completed<br />

their development of univariate<br />

G theory. It provided a coherent<br />

framework <strong>for</strong> considering most, if<br />

not all, of the reliability literature<br />

that had been developed to that<br />

time. About 1966, they began work<br />

on multivariate G theory, in which<br />

each of the levels of one or more<br />

fixed facets is associated with a distinct<br />

universe score. Although it<br />

might be claimed that not all of univariate<br />

G theory is novel, multivariate<br />

G theory (the generalizability of<br />

profiles) is clearly a unique contribution<br />

of Cronbach <strong>and</strong> his colleagues<br />

(Cronbach et al., 1972,<br />

chapters 9 <strong>and</strong> 10). In commenting<br />

on multivariate G theory, Cronbach<br />

has stated:<br />

Despite the long-st<strong>and</strong>ing interest<br />

Gleser <strong>and</strong> I had in profiles,<br />

all of G theory down to 1966 considered<br />

one score at a time. . . . A<br />

decade of work was required to<br />

expose the twists <strong>and</strong> turns of the<br />

simpler univariate multifacet<br />

theory, so surely much multivariate<br />

theory remains to be developed.<br />

(Cronbach, 1991, p. 394)<br />

Shavelson <strong>and</strong> Webb (1981) in<br />

their review of G theory discuss<br />

some developments in multivariate<br />

G theory since the Cronbach et al.<br />

18<br />

(1972) monograph. Since their review,<br />

there have been other articles<br />

published on the subject (e.g., Brennan,<br />

Gao, & Colton, 1995; Gao,<br />

Shavelson, Brennan, & Baxter,<br />

1996; Jarjoura & Brennan, 1982,<br />

1983; Kolen & Jarjoura, 1984; NuPbaum,<br />

1984; Webb, Shavelson, &<br />

Maddahian, 1983). Also, Brennan<br />

(1983, 1992a) <strong>and</strong> Shavelson, Webb,<br />

<strong>and</strong> Rowley (1989) provide illustrative<br />

multivariate analyses. However,<br />

it is still true that “much multivariate<br />

theory remains to be developed<br />

(Cronbach, 1991, p. 394).<br />

In my opinion, the conceptual<br />

framework of G theory is more central,<br />

<strong>and</strong> likely to be more enduring,<br />

than the statistical machinery<br />

used to carry out generalizability<br />

analyses. However, the statistical<br />

procedures are still important.<br />

Since estimates of variance components<br />

are so central, any issue associated<br />

with such estimates is of<br />

particular concern. For example,<br />

the stability of estimated variance<br />

components was considered by<br />

Cronbach et al. (1972) <strong>and</strong> subsequently<br />

studied by Smith (1978,<br />

1981, 19821, Brennan (1994), <strong>and</strong><br />

Gao (1996) among others.<br />

It has long been recognized that<br />

conditional SEMs are not constant<br />

<strong>for</strong> all examinees. Lord’s (1957,<br />

1959) articles provide perhaps the<br />

best known <strong>for</strong>mula <strong>for</strong> conditional<br />

SEMs-a <strong>for</strong>mula based on an absolute<br />

definition of error. Conditional,<br />

relative-error SEMs in G<br />

theory were considered by Jarjoura<br />

(1986). Recently, Brennan (1996a)<br />

has extended the work of Lord <strong>and</strong><br />

Jarjoura, but much more research<br />

remains to be done.<br />

Almost all of G theory <strong>and</strong> its applications<br />

to date effectively assume<br />

that the scores used to make decisions<br />

about the objects of measurement<br />

(usually examinees) are raw<br />

scores or linear trans<strong>for</strong>mations of<br />

raw scores. Often, however, the<br />

scale scores actually used are nonlinear<br />

trans<strong>for</strong>mations, <strong>and</strong> there is<br />

no necessary reason to believe that<br />

results based on a generalizability<br />

analysis of raw scores are directly<br />

relevant <strong>for</strong> such scale scores. One<br />

common example is the conversion<br />

of raw scores on tasks to “passhotpass”<br />

status on an assessment (see<br />

Cronbach et al., 1995, 1997). Recently,<br />

Brennan <strong>and</strong> Lee (1997)<br />

have considered some approaches to<br />

estimating conditional SEMs <strong>for</strong><br />

nonlinear trans<strong>for</strong>mation of raw<br />

scores, but the role of nonlinear<br />

trans<strong>for</strong>mations in G theory is still<br />

largely unexplored.<br />

Brennan (1984) discusses a number<br />

of other statistical topics relevant<br />

to G theory-topics that are<br />

by no means thoroughly researched<br />

as yet. In particular, practitioners<br />

need more readily available procedures<br />

<strong>for</strong> per<strong>for</strong>ming generalizability<br />

analyses in unbalanced<br />

situations,<br />

Twenty-five years ago, in commenting<br />

about the future of G theory,<br />

Cronbach et al. (1972) stated<br />

that:<br />

Because our model treats conditions<br />

within a facet as unordered,<br />

it will not deal adequately with<br />

the stability of scores that are<br />

subject to trends, or to order<br />

effects arising from the measurement<br />

process. . . . A large contribution<br />

will be made by the development<br />

of a model <strong>for</strong> treating<br />

ordered facets. (p. 364)<br />

Such a contribution has yet to be<br />

made. Furthermore, Rogosa <strong>and</strong><br />

Gh<strong>and</strong>our (1991) suggest that G<br />

theory may not be applicable to certain<br />

statistical models <strong>for</strong> behavioral<br />

observations- situations in<br />

which time is a facet. Their research<br />

deserves further consideration, because<br />

it seems to provide results<br />

that are inconsistent with G theory<br />

(<strong>and</strong> other traditional psychometric<br />

models).<br />

The final paragraph of The Dependability<br />

of <strong>Behavior</strong>al Measurements<br />

(Cronbach et al., 1972, p. 388)<br />

states:<br />

Today’s reader, coming to a fully<br />

elaborated generalizability theory<br />

<strong>for</strong> the first time, no doubt finds it<br />

<strong>for</strong>bidding. As measurement specialists<br />

become accustomed to its<br />

language <strong>and</strong> its ways of treating<br />

data, this strangeness will pass.<br />

As the theory is put in different<br />

words by successive writers, it<br />

will be rounded into smoother<br />

<strong>for</strong>m. As it becomes more integrated<br />

with other recent developments<br />

in error theory, <strong>and</strong> with<br />

the validation theory of which it<br />

is a part, it will become inseparable<br />

from the measurement theory<br />

of the next generation.<br />

The predictions of Cronbach <strong>and</strong><br />

his colleagues are only partly ful-<br />

Educational Measurement: Issues <strong>and</strong> Practice


filled, as yet, but they are coming to<br />

pass.<br />

References<br />

Atkinson, M., & Violato, C. (1994).<br />

Neuroticism <strong>and</strong> coping with anger:<br />

The trans-situational consistency of<br />

coping responses. Journal of Personality<br />

<strong>and</strong> Individual Differences, 17,<br />

769-782.<br />

Brennan, R. L. (1975). The calculation<br />

of reliability from a split-plot factorial<br />

design. Educational <strong>and</strong> Psychological<br />

Measurement, 35, 779-788.<br />

Brennan, R. L. (1977). Generalizability<br />

analyses: Principles <strong>and</strong> procedures<br />

(ACT Technical Bulletin No. 26). Iowa<br />

City: American College Testing.<br />

Brennan, R. L. (1983). Elements ofgeneralizabilitji<br />

theory. Iowa City: American<br />

College Testing.<br />

Brennan, R. L. (1984). Estimating the<br />

dependability of the scores. In R. A.<br />

Berk (Ed.), A guide to criterion-referenced<br />

test construction (pp. 292-334).<br />

Baltimore: Johns Hopkins University<br />

Press.<br />

Brennan, R. L. (1992a). Elements ofgeneralizability<br />

theory (rev. ed.). Iowa<br />

City: American College Testing.<br />

Brennan, R. L. (1992b). Generalizability<br />

theory. Educational Measurement:<br />

Issues <strong>and</strong> Practice, 11(4), 27-34.<br />

Brennan, R. L. (1994). Variance components<br />

in generalizability theory. In<br />

C. R. Reynolds (Ed.), Cognitive assessment:<br />

A multidisciplinary perspective<br />

(pp. 175-207). New York: Plenum.<br />

Brennan, R. L. (1995a). The conventional<br />

wisdom about group mean<br />

scores. Journal of Educational Measurement,<br />

14,385-396.<br />

Brennan, R. L. (199513). St<strong>and</strong>ard setting<br />

from the perspective of generalizability<br />

theory. In Proceedings of the<br />

joint conference on st<strong>and</strong>ard setting<br />

<strong>for</strong> large-scale assessments (Vol. 11,<br />

pp. 269-287). Washington, DC: National<br />

Center <strong>for</strong> Education Statistics<br />

<strong>and</strong> National Assessment Governing<br />

Board.<br />

Brennan, R. L. (1996a). Conditional<br />

st<strong>and</strong>ard errors of measurement in<br />

generalizability theory (ITP Occasional<br />

Paper No. 40). Iowa City:<br />

University of Iowa, Iowa Testing Programs.<br />

Brennan, R. L. (1996b). Generalizability<br />

of per<strong>for</strong>mance assessments. In Technical<br />

issues in per<strong>for</strong>mance assessments<br />

(pp. 19-58). Washington, DC:<br />

National Center <strong>for</strong> Education Statistics.<br />

Brennan, R. L., Gao, X., & Colton, D. A.<br />

(1995). Generalizability analyses of<br />

work keys listening <strong>and</strong> writing tests.<br />

Educational <strong>and</strong> Psychological Measurement,<br />

55, 157-176.<br />

Brennan, R. L., &Johnson, E. G. (1995).<br />

Generalizability of per<strong>for</strong>mance assessments.<br />

Educational Measurement:<br />

Issues <strong>and</strong> Practice, 14(4), 9-12.<br />

Brennan, R. L., & Kane, M. T. (1977a).<br />

An index of dependability <strong>for</strong> mastery<br />

tests. Journal of Educational Measurement,<br />

14,277-289,<br />

Brennan, R. L., & Kane, M. T. (197713).<br />

Signalhoise ratios <strong>for</strong> domainreferenced<br />

tests. Psychometrika, 42,<br />

609-625.<br />

Brennan, R. L., & Lee, W. C. (1997).<br />

Conditional st<strong>and</strong>ard errors of tneasurement<br />

<strong>for</strong> scale scores using binomial<br />

<strong>and</strong> compound binomial assumptions<br />

(ITP Occasional Paper No.<br />

41). Iowa City: University of Iowa,<br />

Iowa Testing Programs.<br />

Burt, C. (1936). The analysis of examination<br />

marks. In P. Hartog & E. C.<br />

Rhodes (Eds.), The marks of examiners<br />

(pp. 245-314). London: Macmillan.<br />

Burt, C. (1955). Test reliability estimated<br />

by analysis of variance. British<br />

Journal of Statistical Psychology, 8,<br />

103-118.<br />

Cardinet, J., & Allal, L. (1983). Estimation<br />

of generalizability parameters.<br />

In L. J. Fyans (Ed.), New directions<br />

<strong>for</strong> testing <strong>and</strong> measurement: Generalizability<br />

theory: Inferences <strong>and</strong> practical<br />

applications (No. 18, pp. 17-48).<br />

San Francisco: Jossey-Bass.<br />

Cardinet, J., Tourneur, Y., & Allal, L.<br />

(1976a). The generalizability of surveys<br />

of educational outcomes. In D. N.<br />

M. de Gruijter & L. J. T. van der<br />

Kamp (Eds.), Advances in psychological<br />

<strong>and</strong> educational measurement<br />

(pp. 185-198). New York: Wiley.<br />

Cardinet, J., Tourneur, Y., & Allal, L.<br />

(197613). The symmetry of generalizability<br />

theory: Applications to educational<br />

measurement. Journal of Educational<br />

Measurement, 13, 119-135.<br />

Cardinet, J., Tourneur, Y., & Allal, L.<br />

(1981). Extensions of generalizability<br />

theory <strong>and</strong> its applications in educational<br />

measurement. Journal of Educational<br />

Measurement, 18, 183-204.<br />

Cornfield, J., & Tukey, J. W. (1956).<br />

Average values of mean squares in<br />

factorials. Annals of Mathematical<br />

Statistics, 27, 907-949.<br />

Crick, J. E., & Brennan, R. L. (1983).<br />

Manual <strong>for</strong> GENOVA: A generalized<br />

analysis of variance system (ACT<br />

Technical Bulletin No. 43). Iowa City:<br />

American College Testing.<br />

Cronbach, L. J. (1947). Test “reliability”:<br />

Its meaning <strong>and</strong> determination. Psychometrika,<br />

12(1), 1-16.<br />

Cronbach, L. J. (1951). Coefficient alpha<br />

<strong>and</strong> the internal structure of tests.<br />

Psychometrika, 16, 292-334.<br />

Cronbach, L. J. (1976). On the design<br />

of educational measures. In D. N. M.<br />

de Gruijter & L. J. T. van der<br />

Kamp (Eds.), Advances in psychological<br />

<strong>and</strong> educational measurement<br />

(pp. 199-208). New York: Wiley.<br />

Cronbach, L. J. (1989). Lee J. Cronbach.<br />

In G. Lindzey (Ed.), A history of psychology<br />

in autobiography (Vol. VIII,<br />

pp. 63-93). Stan<strong>for</strong>d: Stan<strong>for</strong>d University<br />

Press.<br />

Cronbach, L. J. (1991). Methodological<br />

studies-A personal retrospective. In<br />

R. E. Snow & D. E. Wiley (Eds.), Improving<br />

inquiry in social science: A<br />

volume in honor of Lee J. Cronbach<br />

(pp. 385-400). Hillsdale, NJ: Erlbaum.<br />

Cronbach, L. J., Gleser, G. C., N<strong>and</strong>a,<br />

H., & Rajaratnam, N. (1972). The dependability<br />

of behavioral measurements:<br />

Theory of generalizability <strong>for</strong><br />

scores <strong>and</strong> profiles. New York: Wiley.<br />

(Out of print but available from Books<br />

on Dem<strong>and</strong>)<br />

Cronbach, L. J., Linn, R. L., Brennan,<br />

R. L., & Haertel, E. (1995). Generalizability<br />

analysis <strong>for</strong> educational assessments<br />

(Evaluation comment). Los<br />

Angeles: University of Cali<strong>for</strong>nia,<br />

Center <strong>for</strong> Research on Evaluation,<br />

St<strong>and</strong>ards, <strong>and</strong> Student %sting.<br />

Cronbach, L. J., Linn, R. L., Brennan,<br />

R. L., & Haertel, E. (1997). Generalizability<br />

analysis <strong>for</strong> per<strong>for</strong>mance assessments<br />

of student achievement<br />

or school effectiveness. Educational<br />

<strong>and</strong> Psychological Measurement, 57,<br />

373-399.<br />

Cronbach, L. J., Rajaratnam, N., &<br />

Gleser, G. C. (1963). Theory of<br />

generalizability: A liberalization of<br />

reliability theory. British Journal of<br />

Statistical Psychology, 16, 137-163.<br />

Crump, S. L. (1946). The estimation of<br />

variance components in analysis of<br />

variance. Biometrics Bulletin, 2, 7-11<br />

Ebel, R. L. (1951). Estimation of the reliability<br />

of ratings. Psychometrika, 16,<br />

407-424.<br />

Eisenhart, C. (1947). The assumptions<br />

underlying analysis of variance. Biometrics,<br />

3, 1-21.<br />

Erlich, 0.) & Borich, C. (1979). Occurrence<br />

<strong>and</strong> generalizability of scores on<br />

a classroom interaction instrument.<br />

Journal of Educational Measurement,<br />

16, 11-18.<br />

Erlich, O., & Shavelson, R. J. (1976).<br />

Application of generalizability theory<br />

to the study of teaching (Tech. Rep.<br />

No. 76-9-1). San Francisco: Far West<br />

<strong>Laboratory</strong>.<br />

Feldt, L. S., & Brennan, R. L. (1989).<br />

Reliability, In R. L. Linn (Ed.), Educational<br />

measurement (3rd ed.,<br />

pp. 127-144). New York: Macmillan.<br />

Finlayson, D. S. (1951). The reliability<br />

o f marking essays. British Journal of<br />

Educational Psychology, 35, 143-162.<br />

Winter 1997 19


Fisher, R. A. (1925). Statistical methods<br />

<strong>for</strong> research workers. London: Oliver<br />

& Bond.<br />

Gao, X. (1996). Sampling variability<br />

<strong>and</strong> generalizability of work keys listening<br />

<strong>and</strong> writing scores (ACT Research<br />

Report No. 96-11, Iowa City:<br />

ACT.<br />

Gao, X., Brennan, R. L., & Shavelson, R.<br />

J. (1994, April). Estimating generalizability<br />

of matrix-sampled science<br />

per<strong>for</strong>mance assessments. Paper presented<br />

at the Annual Meeting of the<br />

American Educational Research Association,<br />

New Orleans.<br />

Gao, X., Shavelson, R. J., Brennan, R. L.,<br />

& Baxter, G. P. (1996, April). A multivariate<br />

generalizability theory approach<br />

to convergent validity of<br />

per<strong>for</strong>mance-based assessment. Paper<br />

presented at the Annual Meeting of<br />

the National Council on Measurement<br />

in Education, New York.<br />

Gillmore, G. M. (1983). Generalizability<br />

theory: Applications to program evaluation.<br />

In L. J. Fyans (Ed.), New directions<br />

<strong>for</strong> testing <strong>and</strong> measurement:<br />

Generalizability theory: Inferences<br />

<strong>and</strong> practical applications (No. 18,<br />

pp. 3-16). San Francisco: Jossey-Bass.<br />

Gleser, G. C., Cronbach, L. J., & Rajaratnam,<br />

N. (1965). Generalizability<br />

of scores influenced by multiple<br />

sources of variance. Psychometrika,<br />

30,395-418.<br />

Gulliksen, H. (1950). Theory of mental<br />

tests. New York: Wiley.<br />

Haggard, E. A. (1958). Intraclass correlation<br />

<strong>and</strong> the analysis of variance.<br />

New York: Dryden.<br />

Hoyt, C. J. (1941). Test reliability estimated<br />

by analysis of variance. Psychometrika,<br />

6, 153-160.<br />

Jarjoura, D. (1986). An estimator of<br />

examinee-level measurement error<br />

variance that considers test <strong>for</strong>m difficulty<br />

adjustments. Applied Psychological<br />

Measurement, 1 U, 175-186.<br />

Jarjoura, D., & Brennan, R. L. (1982).<br />

A variance components model <strong>for</strong><br />

measurement procedures associated<br />

with a table of specifications. Applied<br />

Psychological Measurement, 6,<br />

161-171.<br />

Jarjoura, D., & Brennan, R. L. (1983).<br />

Multivariate generalizability models<br />

<strong>for</strong> tests developed according to a<br />

table of specifications. In L. J. Fyans<br />

(Ed.), New directions <strong>for</strong> testing <strong>and</strong><br />

measurement: Generalizabil ity theory:<br />

Inferences <strong>and</strong> practical applications<br />

(No.18, pp. 83-101). San Francisco:<br />

Jossey-Bass.<br />

Kane, M. T. (1982). A sampling model<br />

<strong>for</strong> validity. Applied Psychological<br />

Measurement, 6, 125-160.<br />

Kane, M. T., & Brennan, R. L. (1977).<br />

The generalizability of class means.<br />

Review of Educational Research, 47,<br />

267-292.<br />

Kane, M. T., Gillmore, G. M., & Crooks,<br />

T. J . (1976). Student evaluations of<br />

teaching: The generalizability of class<br />

means. Journal of Educational Measurement,<br />

13,171-183.<br />

Kolen M. J., & Jarjoura, D. (1984). Item<br />

profile analysis <strong>for</strong> tests developed according<br />

to a table of specifications.<br />

Applied Psychological Measurement,<br />

8, 219-230.<br />

Kuder, G. F., & Richardson, M. W.<br />

(1937). The theory of the estimation of<br />

test reliability. Psychometrika, 2,<br />

151-160.<br />

Lindquist, E. F. (1953). Design <strong>and</strong><br />

analysis of experiments in psychology<br />

<strong>and</strong> education. Boston: Houghton<br />

Mifflin.<br />

Llabre, M. M., Ironson, G. H., Spitzer,<br />

S. B., Gellman, M. D., Weidler, D. J.,<br />

& Schneiderman, N. (1988). How<br />

many blood pressure measurements<br />

are enough An application of generalizability<br />

theory to the study of blood<br />

pressure keliabiiity. Psychophysiology,<br />

25.97-105.<br />

Lord; F. M. (1955). Estimating test reliability.<br />

Educational <strong>and</strong> Psychological<br />

Measurement, 15,325-336.<br />

Lord, F. M. (1957). Do tests of the same<br />

length have the same st<strong>and</strong>ard errors<br />

of measurement Educational<br />

<strong>and</strong> Psychological Measurement, 17,<br />

510-521.<br />

Lord, F. M. (1959). Tests of the same<br />

length do have the same st<strong>and</strong>ard<br />

error of measurement. Educational<br />

<strong>and</strong> Psychological Measurement, 19,<br />

233-239.<br />

Lord, F. M. (1962). Test reliability: A<br />

correction. Educational <strong>and</strong> Psychological<br />

Measurement, 22, 511-5 12.<br />

Lovel<strong>and</strong>, E. H. (1952). Measurement of<br />

factors affecting test-retest reliability.<br />

Unpublished doctoral dissertation,<br />

University of Tennessee.<br />

Medley, D. M., Mitzel, H. E., & Doi,<br />

A. N. (1956). Analysis of variance<br />

models <strong>and</strong> their use in a threeway<br />

design without replication. Journal<br />

of Experimental Education, 24,<br />

221-229.<br />

Nupbaum, A. (1984). Multivariate generalizability<br />

theory in educational<br />

measurement: An empirical study.<br />

Applied Psychological Measurement,<br />

8, 219-230.<br />

Pilliner, A. E. G. (1952). The application<br />

of analysis of variance to problems<br />

of correlation. British Journal<br />

of Psychology, Statistical Section, 5,<br />

31-38.<br />

Rajaratnam, N., Cronbach, L. J., &<br />

Gleser, G. C. (1965). Generalizability<br />

of stratified-parallel tests. Psychometrika,<br />

30, 39-56.<br />

Rogosa, D., & Gh<strong>and</strong>our, G. (1991).<br />

Statistical models <strong>for</strong> behavioral<br />

observations. Journal of Educational<br />

Statistics, 3, 157-252.<br />

Saab, P. G., Llabre, M. M., Hurwitz,<br />

B. E., Frame, C. A., Reineke, L. J., Fins,<br />

A. I., McCalla, J., Cieply, L. K., &<br />

Schneiderman, N. (1992). Myocardial<br />

<strong>and</strong> peripheral vascular responses to<br />

behavioral challenges <strong>and</strong> their stability<br />

in black <strong>and</strong> white Americans.<br />

Psychophysiology, 29, 384-397.<br />

Shavelson, R. J., Baxter, G. P., & Gao,<br />

X. (1993). Sampling variability of<br />

per<strong>for</strong>mance assessments. Journal<br />

of Educational Measurement, 30,<br />

215-232.<br />

Shavelson, R. J., Baxter, G. P., & Pine,<br />

J. (1991). Per<strong>for</strong>mance assessments<br />

in science. Applied Measurement in<br />

Education, 4, 347-362.<br />

Shavelson, R. J., Baxter, G. P., & Pine,<br />

J. (1992). Per<strong>for</strong>mance assessments:<br />

The rhetoric <strong>and</strong> reality. Educational<br />

Researcher, 21(4), 22-27.<br />

Shavelson, R. J., &Webb, N. M. (1981).<br />

Generalizability theory: 1973-1980.<br />

British Journal of Mathematical <strong>and</strong><br />

Statistical Psychology, 34, 133-166.<br />

Shavelson, R. J., &Webb, N. M. (1991).<br />

Generalizability theory: A primer.<br />

Newbury Park, CA Sage.<br />

Shavelson, R. J., Webb, N. M., &<br />

Rowley, G. L. (1989). Generalizability<br />

theory. American Psychologist, 6,<br />

922-932.<br />

Smith, P. L. (1978). Sampling errors of<br />

variance components in small sample<br />

generalizability studies. Journal<br />

of Educational Statistics, 3, 319-<br />

346.<br />

Smith, P. L. (1981). Gaining accuracy in<br />

generalizability theory: Using multiple<br />

designs. Journal of Educational<br />

Measurement, 18,147-154.<br />

Smith, P. L.(1982). A confidence interval<br />

approach <strong>for</strong> variance component<br />

estimates in the context of<br />

generalizability theory. Educational<br />

<strong>and</strong> Psychological Measurement, 42,<br />

459-466.<br />

Webb, N. M., Rowley, G. L., & Shavelson,<br />

R. J. (1988). Using generalizability<br />

theory in counseling <strong>and</strong><br />

development. Measurement <strong>and</strong> Evaluation<br />

in Counseling <strong>and</strong> Development,<br />

21, 81-90.<br />

Webb, N. M., Shavelson, R. J., Kim, K.<br />

S., & Chen, Z. (1989). Reliability (generalizability)<br />

of job per<strong>for</strong>mance measurements:<br />

Navy machinist mates.<br />

Military Psychology, 1, 91-110.<br />

Webb, N. M., Shavelson, R. J., & Maddahian,<br />

E. (1983). Multivariate generalizability<br />

theory. In L. J. Fyans (Ed.),<br />

New directions <strong>for</strong> testing <strong>and</strong> measurement:<br />

Generalizability theory: Inferences<br />

<strong>and</strong> practical applications<br />

(No.18, pp. 67-81). San Francisco:<br />

Jossey-Bass.<br />

20 Educational Measurement: Issues <strong>and</strong> Practice

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!