View - Waisman Laboratory for Brain Imaging and Behavior
View - Waisman Laboratory for Brain Imaging and Behavior
View - Waisman Laboratory for Brain Imaging and Behavior
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Kelley, T. L. (1923). Statistical method.<br />
New York: Macmillan.<br />
Kelley, T. L. (1942). The reliability coefficient.<br />
Psychometrika, 7, 75-83.<br />
Kuder, G. F., & Richardson, M. W.<br />
(1937). The theory of estimation of<br />
test reliability. Psychometrika, 2,<br />
151-160.<br />
Lord, F. M., & Novick, M. R. (1968). Statistical<br />
theories of mental test scores.<br />
Reading, MA Addison-Wesley.<br />
Novick, M. R. (1966). The axioms <strong>and</strong><br />
principal results of classical test theory.<br />
Journal of Mathematical Psychology,<br />
3, 1-18.<br />
Pearson, K. (1896). Mathematical contributions<br />
to the theory of evolution-<br />
111. Regression, heredity <strong>and</strong><br />
panmixia. Philosophical Pansactions,<br />
A, 187, 252-318.<br />
Pearson, K. (1904). On the laws of inheritance<br />
in man. 11. On the inheritance<br />
of the mental <strong>and</strong> moral<br />
characters in man, <strong>and</strong> its comparison<br />
with the inheritance of physical<br />
characters. Biometrika, 3, 131-190.<br />
Pearson, K. (1930). The life, letters, <strong>and</strong><br />
labours of Francis Galton. Vol. HIA.<br />
Correlation, personal identifkatwn<br />
<strong>and</strong> eugenics. Cambridge: The University<br />
Press.<br />
Pearson, K, & Lee, A. (1903). On the<br />
laws of inheritance in man. I. Inheritance<br />
of physical characters. Biometrika,<br />
2, 357-462.<br />
Read, C. B. (1985). Normal distribution.<br />
In S. Kotz & N. L. Johnson (Eds.),<br />
Encyclopedia of statistical sciences<br />
(Vol. 6, pp. 347-359). Toronto: Wiley.<br />
Richardson, M. W. (1936). Notes on the<br />
rationale of item analysis. Psychometrzka,<br />
1(1), 69-76.<br />
Rulon, P. J. (1939). A simplified procedure<br />
<strong>for</strong> determining the reliability of<br />
a test by split-halves. Harvard Educational<br />
Review, 9, 99-103.<br />
Sheynin, 0. B. (1968). On the early history<br />
of the law of large numbers. Biometriha,<br />
55, 459-467.<br />
Spearman, C. (1904). The proof <strong>and</strong><br />
measurement of association between<br />
two things. American Journal of Psychology,<br />
15, 72-101.<br />
Spearman, C. (1907). Demonstration of<br />
<strong>for</strong>mulae <strong>for</strong> true measurement of<br />
correlation. American Journal of Psychology,<br />
18, 160-169.<br />
Spearman, C. (1910). Correlation calculated<br />
from faulty data. British<br />
Journal of Psychology, 3, 271-295.<br />
Thurstone, L. L. (1932). The reliability<br />
<strong>and</strong> validity of tests. Ann Arbor, MI:<br />
N. p.<br />
Venn, J. (1888). The logic of chance (3rd<br />
ed.). London: Macmillan.<br />
Walker, H. M. (1929). Studies in the history<br />
of statistical method. Baltimore:<br />
Williams & Wilkins.<br />
A Perspective on the History of<br />
0<br />
Generabab ility Theory<br />
Robert L. Brennan<br />
University of Iowa<br />
What psychometric <strong>and</strong> scientific perspectives influenced<br />
the development of G theorg What practical<br />
testing problems gave impetus to its adoption What<br />
work remains to be done<br />
with G theory. Consequently, this<br />
article provides a somewhat idiosyncratic<br />
perspective on the history of G<br />
theory <strong>and</strong> what I perceive as unfinished<br />
work <strong>for</strong> the theory. Almost<br />
certainlv. other reviewers would see<br />
the l<strong>and</strong>scape somewhat differently.<br />
verviews of various parts of<br />
0 the history of generalizability<br />
(G) theory are provided elsewhere.<br />
An indispensable starting point is<br />
the preface <strong>and</strong> parts of the first<br />
chapter of Cronbach, Gleser, N<strong>and</strong>a,<br />
<strong>and</strong> Rajaratnam (1972) entitled The<br />
Dependability of <strong>Behavior</strong>al Measurements:<br />
Theory of Generalizability<br />
<strong>for</strong> Scores <strong>and</strong> Profiles. The<br />
Cronbach et al. monograph is still<br />
the most definitive treatment of G<br />
theory. Shavelson <strong>and</strong> Webb (1981)<br />
review the G theory literature from<br />
1973-1980, <strong>and</strong> Shavelson, Webb,<br />
<strong>and</strong> Rowley (1989) cover additional<br />
contributions in the 1980s. A very<br />
brief historical overview is provided<br />
by Brennan (1983, 1992a, pp. 1-2).<br />
In addition, Cronbach (1976, 1989,<br />
1991) offers numerous perspectives<br />
on G theory <strong>and</strong> its history. Cronbach<br />
(1991) is particularly rich with<br />
first-person reflections.<br />
This historical overview is not intended<br />
to repeat everything already<br />
covered in published reviews, although<br />
a summary is provided.<br />
Parts of this article are based<br />
largely on my personal experience<br />
Theory Development <strong>and</strong> Enabling<br />
Work<br />
In discussing the genesis of G<br />
theory, Cronbach (1991) states:<br />
In 1957 I obtained funds from the<br />
National Institute of Mental<br />
Health to produce, with Gleser's<br />
Robert L. Brennan is Lindquist Professor<br />
of Educational Measurement <strong>and</strong><br />
Director of the Iowa Testing Programs,<br />
University of Iowa, 334A Lindquist<br />
Center, Iowa City, IA 52242. His specializations<br />
are generalizability theory,<br />
equating, <strong>and</strong> scaling.<br />
14 Educational Measurement: Issues <strong>and</strong> Practice
collaboration, a kind of h<strong>and</strong>book<br />
of measurement theory.. . .<br />
“Since reliability has been studied<br />
thoroughly <strong>and</strong> is now understood,”<br />
I suggested to the team,<br />
“let us devote our first few weeks<br />
to outlining that section of the<br />
h<strong>and</strong>book, to get a feel <strong>for</strong> the undertaking.”<br />
We learned humility<br />
the hard way-the enterprise<br />
never got past that topic. Not<br />
until 1972 did the book appear<br />
(Cronbach, Gleser, N<strong>and</strong>a, & Rajaratnam)<br />
that exhausted our<br />
findings on reliability reinterpreted<br />
as generalizability. Even<br />
then, we did not exhaust the topic.<br />
When we tried initially to summarize<br />
prominent, seemingly<br />
transparent, convincingly argued<br />
papers on test reliability, the messages<br />
conflicted. (pp. 391-392)<br />
To resolve these conflicts, Cronbach<br />
<strong>and</strong> his colleagues devised a<br />
rich conceptual framework <strong>and</strong> married<br />
it to analysis of r<strong>and</strong>om effects<br />
variance components. The net effect<br />
is “a tapestry that interweaves ideas<br />
from at least two dozen authors”<br />
(Cronbach, 1991, p. 394).<br />
It is not uncommon <strong>for</strong> G theory<br />
to be described as the application of<br />
analysis of variance (ANOVA) to<br />
classical test theory. This characterization<br />
of the theory is inadequate,<br />
at best, <strong>and</strong> probably more misin<strong>for</strong>mative<br />
than useful-except in one<br />
respect. It does correctly suggest<br />
that the parents of G theory can be<br />
viewed as classical test theory <strong>and</strong><br />
analysis of variance. The G theory<br />
child, however, is both more <strong>and</strong> less<br />
than the simple conjunction of its<br />
parents. In particular, G theory is<br />
not a replacement <strong>for</strong> classical theory,<br />
although it does liberalize the<br />
theory. Also, not all of ANOVA is<br />
relevant to G theory; indeed, some<br />
perspectives on ANOVA are inconsistent<br />
with G theory (see Brennan,<br />
1984).<br />
The statistical machinery employed<br />
in G theory has its genesis in<br />
Fisher’s (1925) work on factorial designs.<br />
However, G theory has no<br />
substantive role <strong>for</strong> hypothesis testing.<br />
Rather, it emphasizes the estimation<br />
of r<strong>and</strong>om effects variance<br />
components-a subject that was researched<br />
by statisticians in the late<br />
1940s (see, e.g., Crump, 1946, <strong>and</strong><br />
particularly Eisenhart, 1947). This<br />
research was brought to Cronbach’s<br />
attention by a graduate student,<br />
Milton Meux, about 1957 (L. J.<br />
Cronbach, personal communication,<br />
April 18,1997) at approximately the<br />
same time that Cornfield <strong>and</strong> Tukey<br />
(1956) published their rules <strong>for</strong> expressing<br />
expected mean square<br />
equations in terms of variance components.<br />
By 1950, there was a rich literature<br />
on reliability from the perspective<br />
of classical test theory. Most of<br />
this literature had been superbly<br />
summarized by Gulliksen (1950),<br />
which included chapters on experimental<br />
methods <strong>for</strong> estimating reliability,<br />
as well as reliability<br />
estimated by item homogeneitywhat<br />
came to be called internal consistency<br />
estimates. Such estimates<br />
included, of course, Hoyt’s (1941)<br />
ANOVA version of Kuder <strong>and</strong><br />
Richardson’s (1937) KR20 index. It<br />
is not quite true, however, that Hoyt<br />
was the first to apply ANOVA to<br />
measurement problems. An earlier<br />
contribution was made by Burt<br />
(1936) in his treatment of the analysis<br />
of examination marks.<br />
Gulliksen’s (1950) book was published<br />
be<strong>for</strong>e Cronbach’s widely<br />
cited 1951 article that introduced<br />
Coefficient a. For the next several<br />
years, a great deal of research on reliability<br />
<strong>for</strong>med the backdrop <strong>for</strong> G<br />
theory. Finlayson’s (1951) study of<br />
grades assigned to essays was probably<br />
the first treatment of reliability<br />
in terms of variance components.<br />
Shortly thereafter Pilliner (1952)<br />
provided theoretical relations between<br />
intraclass correlations <strong>and</strong><br />
ANOVA (see also Haggard, 1958).<br />
Cronbach (1947) had expressed<br />
the concern that some type of multifacet<br />
analysis was needed to resolve<br />
inconsistencies in some estimates of<br />
reliability. The 1950s were years in<br />
which various researchers began to<br />
exploit the fact that ANOVA could<br />
h<strong>and</strong>le multiple facets simultaneously.<br />
Particular examples include<br />
Lovel<strong>and</strong>’s (1952) doctoral dissertation,<br />
work by Medley, Mitzel, <strong>and</strong><br />
Doi (1956) on classroom observations,<br />
<strong>and</strong> Burt’s (1955) treatment of<br />
test reliability estimated by analysis<br />
of variance. Most importantly, Lindquist<br />
(1953, chap. 16) laid out an<br />
extensive exposition of multifacet<br />
theory that focused on the estimation<br />
of variance components in reliability<br />
studies. Lindquist demonstrated<br />
that multifacet analyses<br />
lead to alternative definitions of<br />
error <strong>and</strong> reliability coefficients.<br />
Lindquist’s chapter clearly <strong>for</strong>eshadowed<br />
important parts of G theory.<br />
Cronbach was on the faculty at<br />
the University of Chicago from 1946<br />
to 1948. He recalls that:<br />
Five minutes with Joseph<br />
Schwab had a profound influence.<br />
. . . In some context<br />
Schwab remarked that biologists<br />
have to decide what to count as a<br />
species. . . . Schwab was acute<br />
enough to catch my flicker of surprise<br />
<strong>and</strong> <strong>for</strong>ce home the idea of<br />
scientist as construer rather than<br />
as discoverer of categories the<br />
Creator had in mind. That conversation<br />
. . . resonates in my<br />
thinking to this day. (Cronbach,<br />
1989, p. 72, italics added)<br />
Given this perspective, it is not<br />
surprising that G theory requires<br />
that investigators define the conditions<br />
of measurement of interest<br />
to them. The theory effectively disavows<br />
any notion of there being a<br />
correct set of conditions of measurement,<br />
but it is clear that the particular<br />
tasks or items used are not a<br />
sufficient specification of a measurement<br />
procedure. These notions<br />
are central to the conceptual framework<br />
of G theory, but they are not<br />
entirely novel.<br />
Guttman once made the provocative<br />
remark that a test belongs to<br />
several sets, <strong>and</strong> there<strong>for</strong>e has<br />
several reliabilities. “List as<br />
many 4-letter words that begin<br />
with t as you can.” That word-fluency<br />
task fits into at least three<br />
families: 4-letter words beginning<br />
with a specified letter, t words of<br />
a specified length, <strong>and</strong> 4-letter<br />
words with t in a specified position.<br />
The investigator’s theory,<br />
rather than an abstract concept of<br />
truth <strong>and</strong> error, determines<br />
which family contains tests that<br />
“measure the same variable.”<br />
(Cronbach, 1991, p. 394)<br />
In 1951, Ebel published an article<br />
on the reliability of ratings in which<br />
he essentially considered two types<br />
of error variance-one that included,<br />
<strong>and</strong> another that excluded,<br />
rater main effects. In the process of<br />
doing so, Ebel also considered single-facet<br />
crossed <strong>and</strong> nested designs.<br />
It wasn’t until G theory was<br />
fully <strong>for</strong>mulated that the issues<br />
Ebel grappled with were truly clarified<br />
in the distinction between rel-<br />
Winter 1997 15
ative (6) <strong>and</strong> absolute (A) error <strong>for</strong><br />
various designs. Very much the<br />
same problems were considered by<br />
Lord (1955, 1957, 1959) in a classic<br />
series of articles about conditional<br />
st<strong>and</strong>ard errors of measurement<br />
(SEMs) <strong>and</strong> reliability under the assumptions<br />
of what came to be called<br />
the binomial error model (see also<br />
Lord, 1962). In effect, the rater<br />
main effects in Ebel’s article play<br />
the role of the item main effects in<br />
Lord’s articles. In addition, Lord’s<br />
articles clearly specify what came to<br />
be called r<strong>and</strong>omly parallel tests.<br />
The issues Lord was grappling<br />
with had a clear influence on the development<br />
of G theory. According to<br />
Cronbach (personal communication,<br />
1996), about 1957, Lord visited the<br />
Cronbach team in Urbana. Their<br />
discussions suggested that the error<br />
in Lords <strong>for</strong>mulation of the binomial<br />
error model (which treated one<br />
person at a time-that is, a completely<br />
nested design) could not be<br />
the same error as that in classical<br />
theory <strong>for</strong> a crossed design. (Lord<br />
basically acknowledges this in his<br />
1962 article.) This insight was eventually<br />
captured in the distinction<br />
between 6 <strong>and</strong> A in G theory, <strong>and</strong> it<br />
illustrated that errors of measurement<br />
are influenced by the choice of<br />
design. Lord’s binomial error model<br />
is probably best known as a simple<br />
way to estimate conditional SEMs<br />
<strong>and</strong> as an important precursor to<br />
strong true score theory, but it is<br />
also associated with important insights<br />
that became an integral part<br />
of G theory.<br />
The genius of Cronbach <strong>and</strong> his<br />
colleagues was their creation of a<br />
conceptual framework <strong>and</strong> use of a<br />
methodology (variance components<br />
analysis) that integrated the contributions<br />
of numerous researchers,<br />
even when some contributions<br />
seemed to conflict with one another.<br />
The essential features of univariate<br />
G theory were largely completed<br />
with technical reports in 1960-<br />
1961, each with a different first<br />
author. These were revised into<br />
three journal articles, each with a<br />
different first author (Cronbach, Rajaratnam,<br />
& Gleser, 1963; Gleser,<br />
Cronbach, & Rajaratnam, 1965; <strong>and</strong><br />
Rajaratnam, Cronbach, & Gleser,<br />
1965). In 1964 Cronbach moved to<br />
Stan<strong>for</strong>d. Shortly thereafter, Harinder<br />
N<strong>and</strong>a’s studies on interbat-<br />
tery reliability provided part of the<br />
motivation <strong>for</strong> the development of<br />
multivariate G theory (considered<br />
later). This very major extension of<br />
the univariate model is part of the<br />
reason it took more than 10 years<br />
after the 1960-1961 reports <strong>for</strong><br />
Cronbach et al. (1972) to appear in<br />
print. It is still the most intensive<br />
<strong>and</strong> extensive treatment of G theory.<br />
Applications <strong>and</strong> Extensions With<br />
Some Personal Reflections<br />
“any investigators who come to<br />
employ geiieralizabilitg theory in<br />
their research do so only after concluding<br />
that more conventional approaches<br />
seem inadequate. That<br />
was indeed the motivation that led<br />
me to generalizability theory. In the<br />
late 1960s <strong>and</strong> early 1970s, I served<br />
as a consultant on evaluations of the<br />
Head Start <strong>and</strong> Follow Through<br />
Programs, <strong>and</strong> the National Day<br />
Care Study. A distinguishing common<br />
characteristic of these studies<br />
was that the treatments were applied<br />
to whole classrooms <strong>and</strong> evaluated<br />
using certain measurement<br />
procedures. A very natural question<br />
to ask, then, was, “How shall we estimate<br />
the reliability of classroom<br />
mean scores <strong>for</strong> these measurement<br />
procedures’’ A number of discussions<br />
convinced many of us that the<br />
problem was not getting an estimate;<br />
rather, the problem was that<br />
we had too many estimates, <strong>and</strong> no<br />
obvious way to choose among them.<br />
Early in the summer of 1972, I set<br />
myself the goal of resolving this<br />
paradox by the end of the summer.<br />
It did not take that long. The library<br />
at SUNY at Stony Brook where I<br />
was a beginning assistant professor<br />
had a br<strong>and</strong> new book entitled The<br />
Dependability of <strong>Behavior</strong>al Measurements<br />
(Cronbach et al., 1972).<br />
After studying it night <strong>and</strong> day <strong>for</strong> a<br />
week, the answer was obvious-the<br />
different estimates we were getting<br />
were related to different universes<br />
o f generalization when class means<br />
were the objects of measurement.<br />
This insight eventually led to my<br />
first publication on generalizability<br />
theory (Brennan, 1975). Shortly<br />
thereafter, Michael Kane joined the<br />
faculty of education at Stony Brook,<br />
<strong>and</strong> I discovered that he <strong>and</strong> some of<br />
his <strong>for</strong>mer colleagues at the University<br />
of Illinois had been working on<br />
exactly the same problem in the<br />
context of student evaluations of<br />
teaching (see, e.g., Kane, Gillmore,<br />
& Crooks, 1976). Our common interest<br />
in this problem led to a joint article<br />
(Kane & Brennan, 1977).<br />
The Cronbach et al. (1972) <strong>for</strong>mulation<br />
of G theory was general<br />
enough to permit any set of conditions<br />
(e.g., persons, classes, items)<br />
to be the objects of measurement<br />
facet. In that sense, the work on<br />
class means in the early-to-mid-<br />
1970s was more of an illustration<br />
than a substantive contribution to<br />
the theory. In a series of articles<br />
about the symmetry of G theory,<br />
Cardinet <strong>and</strong> his colleagues emphasized<br />
the role that facets other than<br />
persons might play as objects of<br />
measurement (e.g., Cardinet &<br />
Allal, 1983; Cardinet, Tourneur, &<br />
Allal, 1976a, 1976b, 1981).<br />
At the same time that Kane <strong>and</strong> I<br />
were working on our class means article,<br />
we were intrigued with the<br />
idea of using generalizability theory<br />
to address issues surrounding the<br />
reliability of criterion-referenced (or<br />
domain-referenced) scores, which<br />
was a very hot topic in the early-tomid-1970s.<br />
Our initial <strong>for</strong>ays into<br />
this area (Brennan & Kane, 1977a,<br />
1977b) were based on a very simple<br />
idea-use absolute error rather than<br />
relative error in defining indices <strong>and</strong><br />
signal-noise ratios. This work was<br />
later summarized <strong>and</strong> somewhat extended<br />
by Brennan (1984). The research<br />
that Kane <strong>and</strong> I did on<br />
domain-referenced scores <strong>and</strong> class<br />
means was so clearly co-equal that<br />
we flipped a coin to decide on first<br />
authorship. To follow blindly the<br />
alphabetize-by-last-name convention<br />
would have grossly misrepresented<br />
our relative contributions.<br />
In 1981, Shavelson <strong>and</strong> Webb<br />
published a review of G theory <strong>for</strong><br />
the years 1973-1980. Actually, their<br />
article is much more than a review<br />
of 8 years of literature-it is also an<br />
excellent summary of G theory that<br />
is highly relevant <strong>and</strong> readable<br />
today. Only some of the work they<br />
review has been discussed here.<br />
By the late 1970s, I had read<br />
Cronbach et al. (1972) cover-to-cover<br />
three times, but parts of it still challenged<br />
me. I agreed with their statement<br />
that “the book is complexly<br />
organized <strong>and</strong> by no means simple<br />
to follow” (Cronbach et al., 1972,<br />
p. 3). It seemed likely to me that<br />
16 Educational Measurement: Issues <strong>and</strong> Practice
this complexity was at least partly<br />
the reason why relatively few generalizability<br />
studies were being conducted.<br />
I decided to try to publicize,<br />
teach, <strong>and</strong> simplify generalizability<br />
theory <strong>for</strong> graduate students <strong>and</strong><br />
measurement practitioners. At<br />
about this time, with the assistance<br />
of Kane <strong>and</strong> Gillmore (<strong>and</strong> later<br />
Noreen Webb <strong>and</strong> Xiaohong Gao), I<br />
began an every-other-year training<br />
session on G theory <strong>for</strong> the AERA<br />
<strong>and</strong> NCME Annual Meetings.<br />
My first ef<strong>for</strong>t at writing a simpler<br />
treatment of G theory (Brennan,<br />
1977) was a paper that was<br />
rejected by a major journal- the editor<br />
described it as being “too<br />
propaedeutic.” Just about that time<br />
Jay Millman, who was then president<br />
of NCME, asked me to consider<br />
writing a monograph on generalizability<br />
theory <strong>for</strong> publication by<br />
NCME. With the encouragement of<br />
Michael Kane <strong>and</strong> David Jarjoura, I<br />
agreed, but, when I completed the<br />
monograph almost 3 years later,<br />
NCME was no longer interested in<br />
publishing it! ACT, however, did<br />
publish Elements of Generalizability<br />
Theory (Brennan, 1983).<br />
I had long felt that a simpler<br />
treatment of G theory was not<br />
enough to get the theory used more<br />
widely by practitioners. They also<br />
needed a computer program. So, at<br />
the same time I was writing Elements<br />
of Generalizability Theory, I<br />
was designing a computer program<br />
called GENOVA (Crick & Brennan,<br />
1983) that would be coordinated<br />
with the monograph. My computer<br />
skills were not adequate <strong>for</strong> programming<br />
GENOVA, however. That<br />
task was undertaken by Joe Crick, a<br />
colleague from graduate school at<br />
Harvard, who somehow managed to<br />
translate my math <strong>and</strong> h<strong>and</strong>written<br />
input-output layouts into workable<br />
FORTRAN code while serving as<br />
Director of the Computing Center at<br />
the University of Massachusetts,<br />
Boston.<br />
Several expositions of G theory<br />
were published in the late 1980s <strong>and</strong><br />
early 199Os, all of which are briefer<br />
<strong>and</strong> less dem<strong>and</strong>ing than Cronbach<br />
et al. (1972) or Brennan (1983,<br />
1992a). Shavelson, Webb, <strong>and</strong> Rowley<br />
(1989) provided a particularly<br />
readable journal article that summarizes<br />
G theory, <strong>and</strong> in the same<br />
year Feldt <strong>and</strong> Brennan (1989) de-<br />
voted about one third of their chapter<br />
on reliability to G theory. In<br />
1991, Shavelson <strong>and</strong> Webb published<br />
a relatively short monograph<br />
entitled Generalizability Theory: A<br />
Primer. Brennan (1992b) provided a<br />
very brief introduction intended primarily<br />
<strong>for</strong> classroom use.<br />
Interest in per<strong>for</strong>mance testing in<br />
the late 1980s led to a mini-boom in<br />
generalizability analyses <strong>and</strong> considerably<br />
greater publicity <strong>for</strong> G<br />
theory. It seemed evident to practitioners<br />
that G theory was eminently<br />
well-suited to analyzing scores from<br />
such tests. In particular, practitioners<br />
realized that underst<strong>and</strong>ing the<br />
results of a per<strong>for</strong>mance test necessitated<br />
grappling with two or more<br />
facets simultaneously -especially<br />
tasks <strong>and</strong> raters. The relevance of G<br />
theory in such contexts is especially<br />
well illustrated by Richard Shavelson<br />
<strong>and</strong> his colleagues in a series of<br />
presentations <strong>and</strong> articles involving<br />
science <strong>and</strong> mathematics per<strong>for</strong>mance<br />
assessments, in particular<br />
(see, e.g., Gao, Brennan, & Shavelson,<br />
1994; Shavelson, Baxter, &<br />
Gao, 1993; Shavelson, Baxter, &<br />
Pine, 1991, 1992). Also, Brennan<br />
<strong>and</strong> Johnson (1995) <strong>and</strong> Brennan<br />
(199613) consider some theoretical<br />
<strong>and</strong> applied issues in per<strong>for</strong>mance<br />
testing from the perspective of G<br />
theory.<br />
New assessments such as per<strong>for</strong>mance<br />
tests recently motivated<br />
Cronbach, Linn, Brennan, <strong>and</strong><br />
Haertel (1995) to state: “Assessments<br />
depart from traditional measurements<br />
in ways that require<br />
extensions <strong>and</strong> modifications of generalizability<br />
analysis. . . . Assessments<br />
pose problems that reach<br />
beyond available psychometric theory”<br />
(p. 1). The Cronbach et al.<br />
(1995) report <strong>and</strong> a recent journal<br />
article revision (Cronbach, Linn,<br />
Brennan, & Haertel 1997) suggest a<br />
number of problems that need to be<br />
researched, <strong>and</strong> they propose some<br />
recommended solutions. These articles<br />
emphasize the importance of estimates<br />
of absolute st<strong>and</strong>ard errors<br />
of measurement <strong>for</strong> many of the<br />
types of decisions that are typically<br />
made with per<strong>for</strong>mance assessments.<br />
Also, these articles urge that<br />
an analysis of error <strong>for</strong> group means<br />
explicitly recognizes that pupils are<br />
nested in classes <strong>and</strong> schools.<br />
Whether to treat pupils as fixed or<br />
r<strong>and</strong>om in such analyses is discussed<br />
in some detail (see, also,<br />
Brennan 1995a).<br />
In their 1972 monograph, Cronbach<br />
<strong>and</strong> his colleagues illustrated<br />
the applicability of G theory largely<br />
by reanalyzing some already published<br />
data in the psychology <strong>and</strong><br />
education literature. Since 1972, in<br />
addition to topics already cited in<br />
this overview, G theory has been<br />
used to study issues such as classroom<br />
teaching (e.g., Erlich & Borich,<br />
1979; Erlich & Shavelson, 1976);<br />
program evaluation (e.g., Gillmore,<br />
1983); the use of tables of specifications<br />
in educational testing (e.g.,<br />
Jarjoura & Brennan, 1982, 1983;<br />
Kolen & Jarjoura, 1984); counseling<br />
<strong>and</strong> development (Webb, Rowley, &<br />
Shavelson, 1988); setting per<strong>for</strong>mance<br />
st<strong>and</strong>ards (Brennan, 1995b);<br />
job per<strong>for</strong>mance (Webb, Shavelson,<br />
Kim, & Chen, 1989); neuroticism<br />
<strong>and</strong> coping with anger (Atkinson, &<br />
Violato, 1994); <strong>and</strong> aspects of physiology,<br />
including blood pressure<br />
(Llabre et al., 1988; Saab et al.,<br />
1992).<br />
Unfinished Work<br />
G theory has a protean quality.<br />
The procedures <strong>and</strong> even the issues<br />
take on a new <strong>for</strong>m in every<br />
context. G theory enables you to<br />
ask your questions better; what is<br />
most significant <strong>for</strong> you cannot be<br />
supplied from the outside. (Cronbach,<br />
1976, p. 199)<br />
In this sense, G theory is a continuous<br />
work in progress, <strong>and</strong> none<br />
of the research reviewed here can be<br />
deemed complete. Still, there are<br />
some important theoretical <strong>and</strong> statistical<br />
topics that clearly need to be<br />
addressed more fully than they<br />
have been, <strong>and</strong> there are potential<br />
areas of application where the theory<br />
has been largely unused as yet.<br />
Although G theory has been applied<br />
in a number of contexts, the<br />
coverage is not balanced <strong>and</strong> one<br />
might expect that after 25 years<br />
many more generalizability analyses<br />
would have been conducted than<br />
are reported in the literature. Most<br />
published generalizability analyses<br />
are in the education literature, perhaps<br />
because those who are most<br />
knowledgeable about G theory tend<br />
to be employed in colleges of education,<br />
educational testing companies,<br />
<strong>and</strong> related organizations. Clearly,<br />
Winter 1997 17
however, G theory has potential applicability<br />
wherever measurement<br />
procedures are employed. In particular,<br />
G theory seems very much<br />
underutilized in psychological <strong>and</strong><br />
medical areas.<br />
It is often stated that G theory<br />
“blurs the distinction between reliability<br />
<strong>and</strong> validity” (Cronbach et al.,<br />
1972, p. 380). Yet, very little of the G<br />
theory literature directly addresses<br />
validation issues. A notable exception<br />
is Kane’s (1982) treatment of “A<br />
Sampling Model <strong>for</strong> Validity,” which<br />
is clearly one of the major theoretical<br />
contributions to the literature<br />
on G theory in the last 25 years. In<br />
his article, Kane clearly begins to<br />
make explicit links between G theory<br />
<strong>and</strong> issues traditionally subsumed<br />
under validity. Still, many of<br />
the contributions that G theory<br />
probably could make to the validation<br />
of particular measurement procedures<br />
are unexplored, <strong>and</strong> it<br />
seems reasonable to speculate that<br />
more theoretical contributions are<br />
possible.<br />
By the early 1960s, Cronbach <strong>and</strong><br />
his colleagues had pretty much completed<br />
their development of univariate<br />
G theory. It provided a coherent<br />
framework <strong>for</strong> considering most, if<br />
not all, of the reliability literature<br />
that had been developed to that<br />
time. About 1966, they began work<br />
on multivariate G theory, in which<br />
each of the levels of one or more<br />
fixed facets is associated with a distinct<br />
universe score. Although it<br />
might be claimed that not all of univariate<br />
G theory is novel, multivariate<br />
G theory (the generalizability of<br />
profiles) is clearly a unique contribution<br />
of Cronbach <strong>and</strong> his colleagues<br />
(Cronbach et al., 1972,<br />
chapters 9 <strong>and</strong> 10). In commenting<br />
on multivariate G theory, Cronbach<br />
has stated:<br />
Despite the long-st<strong>and</strong>ing interest<br />
Gleser <strong>and</strong> I had in profiles,<br />
all of G theory down to 1966 considered<br />
one score at a time. . . . A<br />
decade of work was required to<br />
expose the twists <strong>and</strong> turns of the<br />
simpler univariate multifacet<br />
theory, so surely much multivariate<br />
theory remains to be developed.<br />
(Cronbach, 1991, p. 394)<br />
Shavelson <strong>and</strong> Webb (1981) in<br />
their review of G theory discuss<br />
some developments in multivariate<br />
G theory since the Cronbach et al.<br />
18<br />
(1972) monograph. Since their review,<br />
there have been other articles<br />
published on the subject (e.g., Brennan,<br />
Gao, & Colton, 1995; Gao,<br />
Shavelson, Brennan, & Baxter,<br />
1996; Jarjoura & Brennan, 1982,<br />
1983; Kolen & Jarjoura, 1984; NuPbaum,<br />
1984; Webb, Shavelson, &<br />
Maddahian, 1983). Also, Brennan<br />
(1983, 1992a) <strong>and</strong> Shavelson, Webb,<br />
<strong>and</strong> Rowley (1989) provide illustrative<br />
multivariate analyses. However,<br />
it is still true that “much multivariate<br />
theory remains to be developed<br />
(Cronbach, 1991, p. 394).<br />
In my opinion, the conceptual<br />
framework of G theory is more central,<br />
<strong>and</strong> likely to be more enduring,<br />
than the statistical machinery<br />
used to carry out generalizability<br />
analyses. However, the statistical<br />
procedures are still important.<br />
Since estimates of variance components<br />
are so central, any issue associated<br />
with such estimates is of<br />
particular concern. For example,<br />
the stability of estimated variance<br />
components was considered by<br />
Cronbach et al. (1972) <strong>and</strong> subsequently<br />
studied by Smith (1978,<br />
1981, 19821, Brennan (1994), <strong>and</strong><br />
Gao (1996) among others.<br />
It has long been recognized that<br />
conditional SEMs are not constant<br />
<strong>for</strong> all examinees. Lord’s (1957,<br />
1959) articles provide perhaps the<br />
best known <strong>for</strong>mula <strong>for</strong> conditional<br />
SEMs-a <strong>for</strong>mula based on an absolute<br />
definition of error. Conditional,<br />
relative-error SEMs in G<br />
theory were considered by Jarjoura<br />
(1986). Recently, Brennan (1996a)<br />
has extended the work of Lord <strong>and</strong><br />
Jarjoura, but much more research<br />
remains to be done.<br />
Almost all of G theory <strong>and</strong> its applications<br />
to date effectively assume<br />
that the scores used to make decisions<br />
about the objects of measurement<br />
(usually examinees) are raw<br />
scores or linear trans<strong>for</strong>mations of<br />
raw scores. Often, however, the<br />
scale scores actually used are nonlinear<br />
trans<strong>for</strong>mations, <strong>and</strong> there is<br />
no necessary reason to believe that<br />
results based on a generalizability<br />
analysis of raw scores are directly<br />
relevant <strong>for</strong> such scale scores. One<br />
common example is the conversion<br />
of raw scores on tasks to “passhotpass”<br />
status on an assessment (see<br />
Cronbach et al., 1995, 1997). Recently,<br />
Brennan <strong>and</strong> Lee (1997)<br />
have considered some approaches to<br />
estimating conditional SEMs <strong>for</strong><br />
nonlinear trans<strong>for</strong>mation of raw<br />
scores, but the role of nonlinear<br />
trans<strong>for</strong>mations in G theory is still<br />
largely unexplored.<br />
Brennan (1984) discusses a number<br />
of other statistical topics relevant<br />
to G theory-topics that are<br />
by no means thoroughly researched<br />
as yet. In particular, practitioners<br />
need more readily available procedures<br />
<strong>for</strong> per<strong>for</strong>ming generalizability<br />
analyses in unbalanced<br />
situations,<br />
Twenty-five years ago, in commenting<br />
about the future of G theory,<br />
Cronbach et al. (1972) stated<br />
that:<br />
Because our model treats conditions<br />
within a facet as unordered,<br />
it will not deal adequately with<br />
the stability of scores that are<br />
subject to trends, or to order<br />
effects arising from the measurement<br />
process. . . . A large contribution<br />
will be made by the development<br />
of a model <strong>for</strong> treating<br />
ordered facets. (p. 364)<br />
Such a contribution has yet to be<br />
made. Furthermore, Rogosa <strong>and</strong><br />
Gh<strong>and</strong>our (1991) suggest that G<br />
theory may not be applicable to certain<br />
statistical models <strong>for</strong> behavioral<br />
observations- situations in<br />
which time is a facet. Their research<br />
deserves further consideration, because<br />
it seems to provide results<br />
that are inconsistent with G theory<br />
(<strong>and</strong> other traditional psychometric<br />
models).<br />
The final paragraph of The Dependability<br />
of <strong>Behavior</strong>al Measurements<br />
(Cronbach et al., 1972, p. 388)<br />
states:<br />
Today’s reader, coming to a fully<br />
elaborated generalizability theory<br />
<strong>for</strong> the first time, no doubt finds it<br />
<strong>for</strong>bidding. As measurement specialists<br />
become accustomed to its<br />
language <strong>and</strong> its ways of treating<br />
data, this strangeness will pass.<br />
As the theory is put in different<br />
words by successive writers, it<br />
will be rounded into smoother<br />
<strong>for</strong>m. As it becomes more integrated<br />
with other recent developments<br />
in error theory, <strong>and</strong> with<br />
the validation theory of which it<br />
is a part, it will become inseparable<br />
from the measurement theory<br />
of the next generation.<br />
The predictions of Cronbach <strong>and</strong><br />
his colleagues are only partly ful-<br />
Educational Measurement: Issues <strong>and</strong> Practice
filled, as yet, but they are coming to<br />
pass.<br />
References<br />
Atkinson, M., & Violato, C. (1994).<br />
Neuroticism <strong>and</strong> coping with anger:<br />
The trans-situational consistency of<br />
coping responses. Journal of Personality<br />
<strong>and</strong> Individual Differences, 17,<br />
769-782.<br />
Brennan, R. L. (1975). The calculation<br />
of reliability from a split-plot factorial<br />
design. Educational <strong>and</strong> Psychological<br />
Measurement, 35, 779-788.<br />
Brennan, R. L. (1977). Generalizability<br />
analyses: Principles <strong>and</strong> procedures<br />
(ACT Technical Bulletin No. 26). Iowa<br />
City: American College Testing.<br />
Brennan, R. L. (1983). Elements ofgeneralizabilitji<br />
theory. Iowa City: American<br />
College Testing.<br />
Brennan, R. L. (1984). Estimating the<br />
dependability of the scores. In R. A.<br />
Berk (Ed.), A guide to criterion-referenced<br />
test construction (pp. 292-334).<br />
Baltimore: Johns Hopkins University<br />
Press.<br />
Brennan, R. L. (1992a). Elements ofgeneralizability<br />
theory (rev. ed.). Iowa<br />
City: American College Testing.<br />
Brennan, R. L. (1992b). Generalizability<br />
theory. Educational Measurement:<br />
Issues <strong>and</strong> Practice, 11(4), 27-34.<br />
Brennan, R. L. (1994). Variance components<br />
in generalizability theory. In<br />
C. R. Reynolds (Ed.), Cognitive assessment:<br />
A multidisciplinary perspective<br />
(pp. 175-207). New York: Plenum.<br />
Brennan, R. L. (1995a). The conventional<br />
wisdom about group mean<br />
scores. Journal of Educational Measurement,<br />
14,385-396.<br />
Brennan, R. L. (199513). St<strong>and</strong>ard setting<br />
from the perspective of generalizability<br />
theory. In Proceedings of the<br />
joint conference on st<strong>and</strong>ard setting<br />
<strong>for</strong> large-scale assessments (Vol. 11,<br />
pp. 269-287). Washington, DC: National<br />
Center <strong>for</strong> Education Statistics<br />
<strong>and</strong> National Assessment Governing<br />
Board.<br />
Brennan, R. L. (1996a). Conditional<br />
st<strong>and</strong>ard errors of measurement in<br />
generalizability theory (ITP Occasional<br />
Paper No. 40). Iowa City:<br />
University of Iowa, Iowa Testing Programs.<br />
Brennan, R. L. (1996b). Generalizability<br />
of per<strong>for</strong>mance assessments. In Technical<br />
issues in per<strong>for</strong>mance assessments<br />
(pp. 19-58). Washington, DC:<br />
National Center <strong>for</strong> Education Statistics.<br />
Brennan, R. L., Gao, X., & Colton, D. A.<br />
(1995). Generalizability analyses of<br />
work keys listening <strong>and</strong> writing tests.<br />
Educational <strong>and</strong> Psychological Measurement,<br />
55, 157-176.<br />
Brennan, R. L., &Johnson, E. G. (1995).<br />
Generalizability of per<strong>for</strong>mance assessments.<br />
Educational Measurement:<br />
Issues <strong>and</strong> Practice, 14(4), 9-12.<br />
Brennan, R. L., & Kane, M. T. (1977a).<br />
An index of dependability <strong>for</strong> mastery<br />
tests. Journal of Educational Measurement,<br />
14,277-289,<br />
Brennan, R. L., & Kane, M. T. (197713).<br />
Signalhoise ratios <strong>for</strong> domainreferenced<br />
tests. Psychometrika, 42,<br />
609-625.<br />
Brennan, R. L., & Lee, W. C. (1997).<br />
Conditional st<strong>and</strong>ard errors of tneasurement<br />
<strong>for</strong> scale scores using binomial<br />
<strong>and</strong> compound binomial assumptions<br />
(ITP Occasional Paper No.<br />
41). Iowa City: University of Iowa,<br />
Iowa Testing Programs.<br />
Burt, C. (1936). The analysis of examination<br />
marks. In P. Hartog & E. C.<br />
Rhodes (Eds.), The marks of examiners<br />
(pp. 245-314). London: Macmillan.<br />
Burt, C. (1955). Test reliability estimated<br />
by analysis of variance. British<br />
Journal of Statistical Psychology, 8,<br />
103-118.<br />
Cardinet, J., & Allal, L. (1983). Estimation<br />
of generalizability parameters.<br />
In L. J. Fyans (Ed.), New directions<br />
<strong>for</strong> testing <strong>and</strong> measurement: Generalizability<br />
theory: Inferences <strong>and</strong> practical<br />
applications (No. 18, pp. 17-48).<br />
San Francisco: Jossey-Bass.<br />
Cardinet, J., Tourneur, Y., & Allal, L.<br />
(1976a). The generalizability of surveys<br />
of educational outcomes. In D. N.<br />
M. de Gruijter & L. J. T. van der<br />
Kamp (Eds.), Advances in psychological<br />
<strong>and</strong> educational measurement<br />
(pp. 185-198). New York: Wiley.<br />
Cardinet, J., Tourneur, Y., & Allal, L.<br />
(197613). The symmetry of generalizability<br />
theory: Applications to educational<br />
measurement. Journal of Educational<br />
Measurement, 13, 119-135.<br />
Cardinet, J., Tourneur, Y., & Allal, L.<br />
(1981). Extensions of generalizability<br />
theory <strong>and</strong> its applications in educational<br />
measurement. Journal of Educational<br />
Measurement, 18, 183-204.<br />
Cornfield, J., & Tukey, J. W. (1956).<br />
Average values of mean squares in<br />
factorials. Annals of Mathematical<br />
Statistics, 27, 907-949.<br />
Crick, J. E., & Brennan, R. L. (1983).<br />
Manual <strong>for</strong> GENOVA: A generalized<br />
analysis of variance system (ACT<br />
Technical Bulletin No. 43). Iowa City:<br />
American College Testing.<br />
Cronbach, L. J. (1947). Test “reliability”:<br />
Its meaning <strong>and</strong> determination. Psychometrika,<br />
12(1), 1-16.<br />
Cronbach, L. J. (1951). Coefficient alpha<br />
<strong>and</strong> the internal structure of tests.<br />
Psychometrika, 16, 292-334.<br />
Cronbach, L. J. (1976). On the design<br />
of educational measures. In D. N. M.<br />
de Gruijter & L. J. T. van der<br />
Kamp (Eds.), Advances in psychological<br />
<strong>and</strong> educational measurement<br />
(pp. 199-208). New York: Wiley.<br />
Cronbach, L. J. (1989). Lee J. Cronbach.<br />
In G. Lindzey (Ed.), A history of psychology<br />
in autobiography (Vol. VIII,<br />
pp. 63-93). Stan<strong>for</strong>d: Stan<strong>for</strong>d University<br />
Press.<br />
Cronbach, L. J. (1991). Methodological<br />
studies-A personal retrospective. In<br />
R. E. Snow & D. E. Wiley (Eds.), Improving<br />
inquiry in social science: A<br />
volume in honor of Lee J. Cronbach<br />
(pp. 385-400). Hillsdale, NJ: Erlbaum.<br />
Cronbach, L. J., Gleser, G. C., N<strong>and</strong>a,<br />
H., & Rajaratnam, N. (1972). The dependability<br />
of behavioral measurements:<br />
Theory of generalizability <strong>for</strong><br />
scores <strong>and</strong> profiles. New York: Wiley.<br />
(Out of print but available from Books<br />
on Dem<strong>and</strong>)<br />
Cronbach, L. J., Linn, R. L., Brennan,<br />
R. L., & Haertel, E. (1995). Generalizability<br />
analysis <strong>for</strong> educational assessments<br />
(Evaluation comment). Los<br />
Angeles: University of Cali<strong>for</strong>nia,<br />
Center <strong>for</strong> Research on Evaluation,<br />
St<strong>and</strong>ards, <strong>and</strong> Student %sting.<br />
Cronbach, L. J., Linn, R. L., Brennan,<br />
R. L., & Haertel, E. (1997). Generalizability<br />
analysis <strong>for</strong> per<strong>for</strong>mance assessments<br />
of student achievement<br />
or school effectiveness. Educational<br />
<strong>and</strong> Psychological Measurement, 57,<br />
373-399.<br />
Cronbach, L. J., Rajaratnam, N., &<br />
Gleser, G. C. (1963). Theory of<br />
generalizability: A liberalization of<br />
reliability theory. British Journal of<br />
Statistical Psychology, 16, 137-163.<br />
Crump, S. L. (1946). The estimation of<br />
variance components in analysis of<br />
variance. Biometrics Bulletin, 2, 7-11<br />
Ebel, R. L. (1951). Estimation of the reliability<br />
of ratings. Psychometrika, 16,<br />
407-424.<br />
Eisenhart, C. (1947). The assumptions<br />
underlying analysis of variance. Biometrics,<br />
3, 1-21.<br />
Erlich, 0.) & Borich, C. (1979). Occurrence<br />
<strong>and</strong> generalizability of scores on<br />
a classroom interaction instrument.<br />
Journal of Educational Measurement,<br />
16, 11-18.<br />
Erlich, O., & Shavelson, R. J. (1976).<br />
Application of generalizability theory<br />
to the study of teaching (Tech. Rep.<br />
No. 76-9-1). San Francisco: Far West<br />
<strong>Laboratory</strong>.<br />
Feldt, L. S., & Brennan, R. L. (1989).<br />
Reliability, In R. L. Linn (Ed.), Educational<br />
measurement (3rd ed.,<br />
pp. 127-144). New York: Macmillan.<br />
Finlayson, D. S. (1951). The reliability<br />
o f marking essays. British Journal of<br />
Educational Psychology, 35, 143-162.<br />
Winter 1997 19
Fisher, R. A. (1925). Statistical methods<br />
<strong>for</strong> research workers. London: Oliver<br />
& Bond.<br />
Gao, X. (1996). Sampling variability<br />
<strong>and</strong> generalizability of work keys listening<br />
<strong>and</strong> writing scores (ACT Research<br />
Report No. 96-11, Iowa City:<br />
ACT.<br />
Gao, X., Brennan, R. L., & Shavelson, R.<br />
J. (1994, April). Estimating generalizability<br />
of matrix-sampled science<br />
per<strong>for</strong>mance assessments. Paper presented<br />
at the Annual Meeting of the<br />
American Educational Research Association,<br />
New Orleans.<br />
Gao, X., Shavelson, R. J., Brennan, R. L.,<br />
& Baxter, G. P. (1996, April). A multivariate<br />
generalizability theory approach<br />
to convergent validity of<br />
per<strong>for</strong>mance-based assessment. Paper<br />
presented at the Annual Meeting of<br />
the National Council on Measurement<br />
in Education, New York.<br />
Gillmore, G. M. (1983). Generalizability<br />
theory: Applications to program evaluation.<br />
In L. J. Fyans (Ed.), New directions<br />
<strong>for</strong> testing <strong>and</strong> measurement:<br />
Generalizability theory: Inferences<br />
<strong>and</strong> practical applications (No. 18,<br />
pp. 3-16). San Francisco: Jossey-Bass.<br />
Gleser, G. C., Cronbach, L. J., & Rajaratnam,<br />
N. (1965). Generalizability<br />
of scores influenced by multiple<br />
sources of variance. Psychometrika,<br />
30,395-418.<br />
Gulliksen, H. (1950). Theory of mental<br />
tests. New York: Wiley.<br />
Haggard, E. A. (1958). Intraclass correlation<br />
<strong>and</strong> the analysis of variance.<br />
New York: Dryden.<br />
Hoyt, C. J. (1941). Test reliability estimated<br />
by analysis of variance. Psychometrika,<br />
6, 153-160.<br />
Jarjoura, D. (1986). An estimator of<br />
examinee-level measurement error<br />
variance that considers test <strong>for</strong>m difficulty<br />
adjustments. Applied Psychological<br />
Measurement, 1 U, 175-186.<br />
Jarjoura, D., & Brennan, R. L. (1982).<br />
A variance components model <strong>for</strong><br />
measurement procedures associated<br />
with a table of specifications. Applied<br />
Psychological Measurement, 6,<br />
161-171.<br />
Jarjoura, D., & Brennan, R. L. (1983).<br />
Multivariate generalizability models<br />
<strong>for</strong> tests developed according to a<br />
table of specifications. In L. J. Fyans<br />
(Ed.), New directions <strong>for</strong> testing <strong>and</strong><br />
measurement: Generalizabil ity theory:<br />
Inferences <strong>and</strong> practical applications<br />
(No.18, pp. 83-101). San Francisco:<br />
Jossey-Bass.<br />
Kane, M. T. (1982). A sampling model<br />
<strong>for</strong> validity. Applied Psychological<br />
Measurement, 6, 125-160.<br />
Kane, M. T., & Brennan, R. L. (1977).<br />
The generalizability of class means.<br />
Review of Educational Research, 47,<br />
267-292.<br />
Kane, M. T., Gillmore, G. M., & Crooks,<br />
T. J . (1976). Student evaluations of<br />
teaching: The generalizability of class<br />
means. Journal of Educational Measurement,<br />
13,171-183.<br />
Kolen M. J., & Jarjoura, D. (1984). Item<br />
profile analysis <strong>for</strong> tests developed according<br />
to a table of specifications.<br />
Applied Psychological Measurement,<br />
8, 219-230.<br />
Kuder, G. F., & Richardson, M. W.<br />
(1937). The theory of the estimation of<br />
test reliability. Psychometrika, 2,<br />
151-160.<br />
Lindquist, E. F. (1953). Design <strong>and</strong><br />
analysis of experiments in psychology<br />
<strong>and</strong> education. Boston: Houghton<br />
Mifflin.<br />
Llabre, M. M., Ironson, G. H., Spitzer,<br />
S. B., Gellman, M. D., Weidler, D. J.,<br />
& Schneiderman, N. (1988). How<br />
many blood pressure measurements<br />
are enough An application of generalizability<br />
theory to the study of blood<br />
pressure keliabiiity. Psychophysiology,<br />
25.97-105.<br />
Lord; F. M. (1955). Estimating test reliability.<br />
Educational <strong>and</strong> Psychological<br />
Measurement, 15,325-336.<br />
Lord, F. M. (1957). Do tests of the same<br />
length have the same st<strong>and</strong>ard errors<br />
of measurement Educational<br />
<strong>and</strong> Psychological Measurement, 17,<br />
510-521.<br />
Lord, F. M. (1959). Tests of the same<br />
length do have the same st<strong>and</strong>ard<br />
error of measurement. Educational<br />
<strong>and</strong> Psychological Measurement, 19,<br />
233-239.<br />
Lord, F. M. (1962). Test reliability: A<br />
correction. Educational <strong>and</strong> Psychological<br />
Measurement, 22, 511-5 12.<br />
Lovel<strong>and</strong>, E. H. (1952). Measurement of<br />
factors affecting test-retest reliability.<br />
Unpublished doctoral dissertation,<br />
University of Tennessee.<br />
Medley, D. M., Mitzel, H. E., & Doi,<br />
A. N. (1956). Analysis of variance<br />
models <strong>and</strong> their use in a threeway<br />
design without replication. Journal<br />
of Experimental Education, 24,<br />
221-229.<br />
Nupbaum, A. (1984). Multivariate generalizability<br />
theory in educational<br />
measurement: An empirical study.<br />
Applied Psychological Measurement,<br />
8, 219-230.<br />
Pilliner, A. E. G. (1952). The application<br />
of analysis of variance to problems<br />
of correlation. British Journal<br />
of Psychology, Statistical Section, 5,<br />
31-38.<br />
Rajaratnam, N., Cronbach, L. J., &<br />
Gleser, G. C. (1965). Generalizability<br />
of stratified-parallel tests. Psychometrika,<br />
30, 39-56.<br />
Rogosa, D., & Gh<strong>and</strong>our, G. (1991).<br />
Statistical models <strong>for</strong> behavioral<br />
observations. Journal of Educational<br />
Statistics, 3, 157-252.<br />
Saab, P. G., Llabre, M. M., Hurwitz,<br />
B. E., Frame, C. A., Reineke, L. J., Fins,<br />
A. I., McCalla, J., Cieply, L. K., &<br />
Schneiderman, N. (1992). Myocardial<br />
<strong>and</strong> peripheral vascular responses to<br />
behavioral challenges <strong>and</strong> their stability<br />
in black <strong>and</strong> white Americans.<br />
Psychophysiology, 29, 384-397.<br />
Shavelson, R. J., Baxter, G. P., & Gao,<br />
X. (1993). Sampling variability of<br />
per<strong>for</strong>mance assessments. Journal<br />
of Educational Measurement, 30,<br />
215-232.<br />
Shavelson, R. J., Baxter, G. P., & Pine,<br />
J. (1991). Per<strong>for</strong>mance assessments<br />
in science. Applied Measurement in<br />
Education, 4, 347-362.<br />
Shavelson, R. J., Baxter, G. P., & Pine,<br />
J. (1992). Per<strong>for</strong>mance assessments:<br />
The rhetoric <strong>and</strong> reality. Educational<br />
Researcher, 21(4), 22-27.<br />
Shavelson, R. J., &Webb, N. M. (1981).<br />
Generalizability theory: 1973-1980.<br />
British Journal of Mathematical <strong>and</strong><br />
Statistical Psychology, 34, 133-166.<br />
Shavelson, R. J., &Webb, N. M. (1991).<br />
Generalizability theory: A primer.<br />
Newbury Park, CA Sage.<br />
Shavelson, R. J., Webb, N. M., &<br />
Rowley, G. L. (1989). Generalizability<br />
theory. American Psychologist, 6,<br />
922-932.<br />
Smith, P. L. (1978). Sampling errors of<br />
variance components in small sample<br />
generalizability studies. Journal<br />
of Educational Statistics, 3, 319-<br />
346.<br />
Smith, P. L. (1981). Gaining accuracy in<br />
generalizability theory: Using multiple<br />
designs. Journal of Educational<br />
Measurement, 18,147-154.<br />
Smith, P. L.(1982). A confidence interval<br />
approach <strong>for</strong> variance component<br />
estimates in the context of<br />
generalizability theory. Educational<br />
<strong>and</strong> Psychological Measurement, 42,<br />
459-466.<br />
Webb, N. M., Rowley, G. L., & Shavelson,<br />
R. J. (1988). Using generalizability<br />
theory in counseling <strong>and</strong><br />
development. Measurement <strong>and</strong> Evaluation<br />
in Counseling <strong>and</strong> Development,<br />
21, 81-90.<br />
Webb, N. M., Shavelson, R. J., Kim, K.<br />
S., & Chen, Z. (1989). Reliability (generalizability)<br />
of job per<strong>for</strong>mance measurements:<br />
Navy machinist mates.<br />
Military Psychology, 1, 91-110.<br />
Webb, N. M., Shavelson, R. J., & Maddahian,<br />
E. (1983). Multivariate generalizability<br />
theory. In L. J. Fyans (Ed.),<br />
New directions <strong>for</strong> testing <strong>and</strong> measurement:<br />
Generalizability theory: Inferences<br />
<strong>and</strong> practical applications<br />
(No.18, pp. 67-81). San Francisco:<br />
Jossey-Bass.<br />
20 Educational Measurement: Issues <strong>and</strong> Practice