Using Item Response Theory to Score the Myers-Briggs Type Indicator

Taken either singly or together, these criticisms arepotentially quite serious. For example, if factor analyticevidence consistently indicates that the 4-factor view of theMBTI is implausible, its psychometric defensibility inassessment situations would be called into question.Likewise, the lack of bimodal distributions in thepreference scores, as well as the nontrivial rates of typechanges seen in test-retest situations, have been viewed bymany researchers as representing serious challenges to thepsychometric quality of the MBTI. In the followingsections we examine each of these issues.Criticisms of the MBTI’s Factor StructureSeveral exploratory factor analyses of the MBTI havebeen reported, and some of them (e.g., Comrey, 1983;Sipps, Alexander, & Friedt, 1985) have produced factorstructures that their authors viewed as being inconsistentwith the predicted 4-factor model. This fact has been citedby critics of the MBTI (e.g., Pittenger, 1993, pp. 474-476)as support for the more general conclusion that “the MBTIdoes not provide the assessment of personality types that itclaims” (Pittenger, p. 475).However, a number of other exploratory factor analyticstudies of the MBTI (e.g., Harvey, Murry, & Stamoulis,1995; Tischler, 1994; Tzeng, Outcalt, Boyer, Ware, &Landis, 1984) have reported results that show an extremelyhigh degree of correspondence between the recoveredfactor solutions and the predicted 4-factor structure. Whatconclusions regarding the MBTI’s factor structure orconstruct validity should be drawn based on theseapparently conflicting findings?In our assessment, the fact that several exploratorystudies have reported findings that closely match thepredicted 4-factor structure (e.g., Harvey et al., 1995;Tischler, 1994) is consistent with -- but not definitive proofof -- the validity of the MBTI’s predicted dimensionalstructure. Of greater importance, the fact that someexploratory studies produced solutions that did not matchthe predicted 4-factor structure (e.g., Sipps et al., 1985)says very little either pro or con, given (a) the less-thanoptimalsample sizes and factor-analytic decision rules thatcharacterized those studies, as well as (b) the inherentinability of exploratory methods to test of the validity of ahypothesized factor model.Regarding the former issue, the Comrey (1983) andSipps et al. (1985) findings were based on factor-analyticdecisions (e.g., principal components analysis, Varimaxrotation) that have been repeatedly criticized in thepsychometric literature (e.g., Cliff, 1987; Lee & Comrey,1979; Snook & Gorsuch, 1989; Tucker, Koopman, & Linn,1969). With respect to sample size, the Comrey (1983)study demonstrated only a 2.5:1 ratio of subjects to items;in such small samples, the likelihood of finding unstableresults due to the effects of sampling error increasessignificantly. In contrast, among the exploratory studiesthat reported results that were consistent with the predicted4-factor structure, the Harvey et al. (1995) study had a 12:1ratio of subjects to items, and the Tischler (1994) study hada 22:1 ratio; results obtained in samples of these sizesshould be much more likely to be stable and valid thanthose obtained in smaller samples.Regarding the latter issue, the results of anyexploratory factor analysis -- even one performed in a verylarge sample -- are fundamentally incapable of answeringwhat is essentially a confirmatory question: namely, towhat degree does the hypothesized factor structure providea plausible representation of the observed item-level data?That is, among its other limitations (e.g., subjectivity withrespect to determining the number of factors to retain), theexploratory factor model exhibits a fundamentalindeterminacy with respect to factor rotation (i.e., aninfinite number of different orthogonal or obliquetransformations of the factor solution can be made withoutchanging the degree to which it can reproduce, or ‘fit,’ thedata matrix). Thus, if the predicted structure is notrecovered, this fact provides essentially no evidenceregarding the degree to which the hypothesized modelwould be capable of providing a level of fit that is as goodas, or better than, that which is produced by the obtainedfactor solution.Fortunately, confirmatory factor analytic methods (e.g.,James, Mulaik, & Brett, 1982; Jöreskog & Sörbom, 1981)were developed to address precisely this kind of question.Unlike exploratory factor analysis, confirmatory factoranalysis allows the researcher to directly test the degree towhich a hypothesized factor model is consistent with thevariance/covariance matrix that is observed among theinstrument’s items. A major strength of confirmatoryfactor analysis is that it allows for the possibility offalsifying a hypothesized factor model (i.e., showing that itis inconsistent with the observed data). That is, if thepredicted factor pattern is found to provide a poor level offit to the observed data, this fact can provide compellingevidence against the validity or plausibility of the predictedfactor structure. Thus, although confirmatory methodscannot prove that a given good-fitting model is the bestpossible model for an instrument (theoretically, it is alwayspossible to postulate the existence of an alternative modelthat demonstrates an even higher level of fit), they arenevertheless extremely valuable by virtue of their ability toreject poor-fitting models and to rank competing modelswith respect to the degree to which they fit the observeddata.Although studies that criticize the psychometricproperties of the MBTI typically do not cite their findings,several confirmatory factor analyses of the MBTI havebeen reported (e.g., Harvey, Murry, & Stamoulis, 1995;Harvey, Murry, & Markham, 1995; Johnson & Saunders,1990; Thompson & Borrello, 1989), and their results haveconsistently supported the validity of the predicted 4-factorstructure. When considered on its own (e.g., Johnson &Saunders, 1990; Thompson & Borrello, 1989), thepredicted MBTI factor structure has been found to providea plausible representation of the latent structure of thisinstrument. Of even greater importance, when the

predicted 4-factor MBTI model was compared against thealternative factor models advanced by Comrey (1983) andSipps et al. (1985), the predicted MBTI structure was foundto be superior to both of these competing views of itsdimensionality (Harvey, Murry, & Stamoulis, 1995).Indeed, the results of the Harvey et al. (1985) studysuggested that both the Sipps et al. (1983) and Comrey(1983) models were fundamentally misspecified (i.e., basedon the extremely high correlations that were estimatedbetween some of their factors).However, these factor analytic studies have identifiedsome issues that deserve further study. For example, in theexploratory studies, some MBTI items were found to loadstrongly on more than one factor; additionally, in bothexploratory and confirmatory studies, a nontrivialpercentage of the items exhibited only moderate-to-smallloadings on their primary factors. Ideally, to maximize theindependence and measurement precision of the scales, wewould prefer that items load only on the predicted factor,and that all items in a scale demonstrate moderate-to-largeloadings on their underlying factor. These findings suggestthat the item pools for each of the four main MBTI scalescould be broadened to include additional items with higherloadings on the desired latent construct.Additionally, in studies that examined oblique factormodels, consistently nonzero correlations between the SNand JP factors were reported (e.g., Harvey et al., 1995;Pittenger, 1993, p. 475), a finding that has also been seenwhen the traditional prediction ratio method is used tocalculate MBTI preference scores (e.g., Webb, 1964). Thatis, there is some tendency for individuals who preferSensing to be more likely to favor Judging than Perceiving,and for those who favor Intuition to be more likely to favorPerceiving than Judging. Ideally, from a theoreticalstandpoint (e.g., Myers, 1980, pp. 2-9) one might argue thatthe four preferences should be mutually orthogonal.However, it must be noted that these SN-JP correlationshave generally been quite modest in magnitude (e.g., in the.20’s to .40’s, representing only 4% - 16% of sharedvariance), and that at this point we cannot determinewhether the lack of orthogonality is due to redundancy inthe conceptual definition of the SN and JP preferences,limitations of the items used to measure these constructs,sampling error, a combination of the above factors, or thatit simply reflects the fact that some combinations of scoreson these two dimensions occur more frequently than others(e.g., SJ is much more common than SP). Further researchconducted in larger and more carefully stratified samples isnecessary to resolve this question.In sum, although some secondary issues remainunresolved, a review of the factor analytic research findingsindicates quite conclusively that the major criticisms thathave been raised regarding the MBTI’s factor structure(e.g., Comrey, 1983; Pittenger, 1993) are not supported bythe data, particularly the results of confirmatory factoranalyses. On the contrary, a large and growing body ofevidence indicates that (a) four major factors underlie theitems that are used to compute the MBTI preference scores,(b) the items that define these factors are precisely thosethat were predicted to do so by the MBTI’s developers, and(c) of all of the competing factor structures that have beenproposed to date, the a priori 4-factor solution provides themost plausible representation of the MBTI’s latentstructure.Criticisms Regarding Type Stability and BimodalityThus, when one considers the entirety of the factoranalytic evidence, the MBTI’s hypothesized 4-factorstructure performs quite well; clearly, this is encouragingnews for proponents of the MBTI. However, with respectto criticisms that focus on preference score bimodality andtype stability in test-retest situations, until recently there hasbeen less cause for encouragement.Type stability. The fact that a nontrivial percentage ofMBTI respondents change their type assignments on atleast one preference dimension on repeated testing has beenwell documented. For example, Carskadon (1977) reportedrelatively high test-retest reliabilities over five-weekintervals (.78 - .87) for preference scores; however, onretesting, 19% of the subjects changed type on the EIpreference, 11% changed on SN, 17% on TF, and 16% onJP. Other studies have produced similar findings: forexample, Myers and McCaulley (1985, p. 173) summarizedthe results of 20 test-retest studies, finding that full-profiletype stability rates ranged from 24%-61%, with an averageof only 43% of the subjects remaining the same on all fourscales on retesting.Although the levels of test-retest reliability obtainedusing the continuous preference scores have generally beenquite respectable, the levels of instability in the categoricaltype assignments have presented an inviting target forcritics of the MBTI. For example, Pittenger (1993) notedthat because “Jung and Briggs and Myers conceived ofpersonality as an invariant” (p. 471), “if each of the 16types is to represent a very different personality trait, it ishard to reconcile a test that allows individuals to makeradical shifts in their type” (p. 472). Under this argument,switching poles on even one of the four preferencedimensions represents a significant substantive andinterpretative change.In our assessment, it is unlikely that the majority ofthese apparent changes in type -- especially those that occurover relatively short intervals of a few weeks or months --reflect true changes in preference. Instead, as has beenspeculated by a number of authors (e.g., Harvey & Murry,1994, Pittenger, 1993), it is much more likely that thesechanges are the result of the action of measurement error;in particular, measurement error occurring in the vicinity ofthe type cutoff score.That is, for individuals whose true preference scores lieclose to the type cutoff point, even a relatively smallamount of measurement error could cause their observedpreference scores to lie on opposite sides of the cutoff overrepeated testings (giving the erroneous appearance of a type

switch), despite the fact that the true preferences remainconstant over time (i.e., as would be predicted by typetheory). For such individuals, the most direct way toimprove the MBTI’s level of type stability would be toincrease its measurement precision (or reliability).It is important to note that the above interpretation doesnot rule out the possibility that some percentage ofrespondents who appear to change types on repeatedtestings may truly change their scores on one or morepreference dimensions, or that some individuals maysimply appear to change types due to careless responding,situational factors, or deliberate misrepresentation. On thecontrary, it simply provides an explanation for whyindividuals who do not suffer from true fluctuations in theirpreferences would appear to change their types.In short, the important question concerns the relativepercentage of individuals who appear to change type onrepeated testing due simply to the action of measurementerror near the type cutoff. If such individuals constitute alarge percentage of those whose type assignments changeon retesting, a strategy for improving the MBTI to reducesuch occurrences would then be evident (i.e., increasing itsmeasurement precision near the type cutoff score).Bimodality. The issue of preference score bimodalityis closely linked with the issue of type stability. Althoughsome demonstrations of preference bimodality have beenreported in select samples having strongly differentiatedtypes (e.g., Rytting, Ware, & Prince, 1994), there isoverwhelming evidence to indicate that MBTI preferencedistributions in large, unselected samples are not bimodal(e.g., Harvey & Murry, 1994; Hicks, 1984; McCrae &Costa, 1989; Striker & Ross, 1964). Although this lack ofbimodality in MBTI preference scores does not necessarilyinvalidate the type-based theory on which the instrument isbased, it does present a tempting target for critics of theMBTI. As Pittenger (1993) noted, findings of lack ofbimodality “give reason to suspect the claims that typesrepresent separate populations, and that small quantitativedifferences between scores represent a significantqualitative difference in personality” (p. 471).Regardless of whether or not one agrees with theassertion that the MBTI must demonstrate bimodal scoredistributions (as we describe below, in our assessmentbimodality is not strictly necessary), the fact remains thatthe type stability, measurement precision, and bimodalityissues are closely linked. Because all psychological testscontain some degree of measurement error, whenever acutoff score is used to dichotomize a continuous scale itbecomes highly advantageous to minimize the relativenumber of people who score near the cutoff. This is donein order to minimize the chance that even relatively minorerrors of measurement could cause a person’s observedscore to fall on the opposite side of the cutoff from theirtrue score (i.e., an erroneous type classification). AsPittenger (1993) noted, “an accurate and durableassessment of type cannot be made for those subjectswhose scores are close to the zero point [i.e., type cutoff]and [who therefore] have a high probability of crossing thatboundary” (p. 472) due simply to the action ofmeasurement error.In essence, a lack of bimodality in the preference scoredistributions may exacerbate the problem of typemisclassifications due to measurement error near the cutoffscore (i.e., because center-weighted distributions have amuch higher percentage of individuals scoring near thecutoff). Thus, if measurement precision (i.e., reliability) isheld constant, increasing the number of people who scorenear the type cutoff will unavoidably increase the numberof erroneous type classifications, both in test-retest andsingle-administration situations. It follows that as apractical matter, the reliability of a scale that is to bedichotomized may need to be significantly higher than thelevel that would be considered adequate for a test in whicha cutoff score is not imposed. Thus, on totally pragmaticgrounds, bimodal preference score distributions are muchmore desirable than center-weighted ones because theyreduce the number of erroneous type classifications thatwould be expected due to measurement error at the cutoff.As was noted above, one might legitimately questionwhether it is necessary for a type-based instrument toproduce bimodal distributions. Although many researchers(e.g., Pittenger, 1993; Striker & Ross, 1964) appear to haveaccepted the argument that bimodal distributions arenecessary based largely on theoretical arguments (e.g.,Myers with Myers, 1980), opposing arguments can beoffered (e.g., Mitchell, 1995). Indeed, at a strictlypragmatic level, there is no difference between setting acutoff score on the MBTI scales for the purpose ofassigning individuals to type categories versus setting acutoff score on any other psychological scale that lacks abimodal distribution (which is, of course, the case for mostpsychological scales). That is, cutoff scores are frequently-- and appropriately -- used with tests that demonstratecenter-weighted, Normal distributions. For example, inorganizational selection it is commonplace to rankemployees based on their scores on a cognitive ability test,and to only consider those who score above a minimumcutoff for hiring. In such situations, rarely if ever does thepractitioner expect the employment test to demonstratebimodality, or to minimize the density of the distributionnear the cutoff point. Clearly, bimodality is not a necessarycondition for setting a cutoff score on a psychological test.Thus, although one can argue that bimodality is not aprerequisite characteristic in order for the MBTI to bejudged psychometrically adequate, it is nonetheless a highlydesirable characteristic due to the MBTI’s use of a cutoffscore to assign individuals to the categorical types. Basedon the above discussion of the effect of measurement errorat the cutoff, it is clear that the bimodality and type-stabilityissues are inextricably linked, and that the maximumimprovement in MBTI test-retest type stability would beexpected to occur when improvements in both bimodalityand measurement precision at the cutoff are achieved.Thus, one does not have to accept the theory-basedargument that a type-based instrument must producebimodal score distributions in order to appreciate the

practical advantages that would obtain if the MBTI’spreference scores were more bimodal in nature.Strategies for Addressing these IssuesOf all of the criticisms of the MBTI that have beenraised to date, it is our assessment that the type-instabilityissue is one of the most troublesome. That is, if it is truethat preferences are inborn, and that by adulthood mostindividuals achieve reasonably well differentiated types(e.g., Myers & McCaulley, 1988; Myers with Myers,1980), one would definitely not expect to find from 24%-61% of individuals changing types on at least one MBTIdimension on repeated testing, especially when theadministrations are given only a few weeks or monthsapart. Indeed, when interpreting the empirical dataregarding test-retest type stability and preference scoredistribution shape, critics of the MBTI have concluded that“the patterns of data do not suggest that there is reason tobelieve that there are 16 unique types of personality”(Pittenger, 1993, p. 483), and that “the four-letter type codeis not a stable personality characteristic” (p. 472).It is important to realize that such conclusions arebased on a critical -- and untested -- assumption: namely,that the lack of bimodality and the observed levels of typeinstability reflect flaws in the MBTI itself. Interestingly,little or no consideration has been given to the alternativeviewpoint that these empirical findings do not reflect flawsin the MBTI or its underlying theory, but instead are causedby limitations in the scoring system that is used to convertitem responses into the preference scores that aredichotomized to form type assignments. We contend thatbefore sweeping conclusions regarding the validity of theMBTI can be drawn, researchers must first determinewhether improvements in bimodality and type stability canbe achieved via modifications to the techniques that areused to score the MBTI and assign categorical types.Without doubt, the answer to the question of whetherrevisions to the MBTI scoring system would be able toimprove type stability and/or preference score bimodality isof fundamental importance. That is, if a new scoringsystem were to be developed that is capable of producingmore bimodally shaped preference distributions in large,unselected samples of MBTI respondents, this wouldeffectively destroy a key line of evidence on whichcriticisms of the MBTI instrument -- as well as the typebasedtheory on which it is founded -- have been based(e.g., Pittenger, 1993, p. 471). Likewise, if a scoringsystem capable of producing improvements in the MBTI’smeasurement precision near the cutoff were to be produced,increased type stability in test-retest situations would bepredicted to result, thereby addressing the remaining majorempirical criticism of the MBTI.However, what strategies should be followed in orderto modify the MBTI’s scoring procedures in order toachieve the objectives of increased bimodality andmeasurement precision? Given that the lack of bimodalityis hardly a new occurrence, having been present in itsearlier scoring systems as well (e.g., Stricker & Ross,1964), there is little reason to believe that simply updatingthe prediction-ratio based preference scoring weights usingnew samples of respondents would lead to significantchanges in the shapes of the preference score distributions.Indeed, it is unlikely that any alternative number-right orweighted number-right scoring technique that takes alinear-model based approach would be any more likelythan the existing weighting system to produce bimodality orimproved measurement precision. For example, Harveyand Murry (1994) examined two alternative scoringmethods (i.e., an unweighted count of the number of itemsanswered in the keyed direction, and a linear-model basedweighting system using factor scoring coefficients), findingthat neither produced any meaningful reductions in thecenter-weightedness of the preference distributions.One possibility for improving the test-retest typestability that has been suggested involves increasing thenumber of categories into which individuals are classifiedon each preference dimension (Harvey & Murry, 1994).For example, earlier versions of the MBTI were scoredusing a 3-category system: the two bipolar types (e.g., ‘E’or ‘I’), plus an indeterminate ‘x’ classification forindividuals who scored close to the type cutoff (e.g., seeMyers & McCauley, 1985, chapter 9). It seems reasonableto hypothesize that a sizable percentage of the individualswho switch types on repeated administrations of the MBTIare those whose preference scores lie close to the cutoff.For such individuals, a change of only a few preferencescore units could cause them to be classified into theopposite type on repeated testing. Stopping the practice offorcing these type-indeterminate individuals into bipolartype categories might produce significant improvements intest-rest stability. Of course, even if an ‘indeterminate’category is added, the performance of such a system wouldbe greatly facilitated if the shapes of the preference scoredistributions were also made more bimodal, therebyreducing the number of type-indeterminate individuals.With respect to methods for changing the proceduresused to compute MBTI preference scores in order toimprove measurement precision and bimodality, in ourassessment the strategy that holds the greatest promise is touse item response theory (IRT) techniques (e.g., Lord &Novick, 1968). Although only a few studies using IRTscoring of the MBTI have been conducted (Harvey &Murry, 1994; Harvey, Murry, & Markham, 1994; Thomas& Harvey, 1995), their results have been very encouraging.Specifically, they demonstrated that switching to IRTscoring -- without making any substantive changes to theMBTI items themselves -- produces (a) strongly bimodalpreference distributions in large, unselected samples ofrespondents; and (b) scales that produce their maximummeasurement precision in the vicinity of the type cutoff(e.g., Harvey & Murry, 1994). Related IRT research(Thomas & Harvey, 1995) has revealed that the degree ofmeasurement precision of the MBTI scales can be furtherimproved through the addition of new items.

IRT Methods in the Context of the MBTIBefore reviewing the results of these studies, we willfirst provide a brief tutorial on IRT methods, payingspecific attention to the ways in which traditional IRTterminology must be translated into the terminology of typetheory and the MBTI. Historically, IRT terminology hasbeen deeply rooted in right/wrong, ability-oriented testingmethods. Although this ability-oriented terminology isuseful in the context of scoring right/wrong, multiplechoicetest items, it is somewhat counterproductive whenone is attempting to understand how IRT would be used toscore personality tests in which (a) “right” or “wrong”answers do not exist, (b) the notion of item “difficulty” haslittle or no intuitive meaning, and (c) the susceptibility ofitems to “guessing the correct answer” is not typically acause for concern.In this section we briefly describe the fundamentals ofIRT methods as they relate to the MBTI; however, adetailed description of IRT is beyond the scope of thisarticle. The reader is referred one of the standard IRT texts(e.g., Hambleton, Swaminathan, & Rogers, 1991; Hulin,Drasgow, & Parsons, 1983; Lord & Novick, 1968) for amore comprehensive treatment. Our primary goal is todescribe the basics of the IRT approach to measurementand explicate the terminological differences that existbetween standard descriptions of IRT methods and theirapplication to the specific case of the MBTI.IRT TerminologyThe latent construct, or θ. In IRT, as in classical testtheory (CTT), a primary focus of testing is to derive anestimate of each examinee’s score on the latent construct(or set of four bipolar constructs, in the case of the MBTI)being assessed. In CTT, this quantity is termed the truescore; in IRT, it is typically termed the latent trait score(which is abbreviated θ, or theta). In both cases, this scoreis an unobserved, hypothetical construct (e.g., Intelligence,Extraversion) on which people are assumed to differ, butwhich cannot be directly quantified. Thus, we are forced toestimate examinees’ scores on the latent construct based ontheir responses to a set of test items.The term “latent trait” has a tendency to set off alarmsfor proponents of type-based theories of personality;indeed, this usage of the term “trait” represents our firstencounter with the semantic difficulties that can occurwhen applying IRT (which is also known as Latent TraitTheory) to the MBTI. It must be stressed that this use ofthe term “trait” when describing the latent construct beingestimated by IRT in no way implies a taking-of-sides in theongoing “trait vs. type” debate (e.g., Block & Ozer, 1982;Gangestad & Snyder, 1991; Mendelsohn, Weiss, & Feimer,1982). That is, although the MBTI is based on the notionof discrete types of personality, the MBTI has always usedscores on continuous bipolar scales in order to assess thestrength and direction of the preference for EI, SN, TF, andJP (i.e., the prediction-ratio based preference scores; e.g.,Myers & McCaulley, 1988, p. 9). By dichotomizing thesepreference scores, individuals can subsequently be assignedto categorical types.Throughout our discussion of how IRT methods can beused to score the MBTI, it is critically important to keep inmind that the MBTI preference scores estimated using thetraditional prediction-ratio method correspond directly tothe θ scores estimated by IRT. Thus, IRT takes preciselythe same logical approach that has always been used in theMBTI: that is, describing both the strength and direction ofthe preference for the EI, SN, TF, and JP dimensions usingfour bipolar continuua. Only the computational methodinvolved in computing these continuous preference scoresis different. In effect, whenever the term ‘trait’ or ‘latenttrait’ appears in a discussion of IRT methods, one cansimply substitute the term ‘preference score’ to understandhow IRT would be used to score the MBTI.Probability of a correct response (PCR). The otherquantity that is of fundamental interest in IRT is thelikelihood that a given respondent will make a “correct”response to a given item. In ability-oriented testing, wehave a clear understanding of what a correct vs. incorrectitem response means, and we can easily compute andinterpret the percentages of people who respond correctlyto each test item. However, when IRT is applied to theMBTI (or to any other test that does not employ rightversus-wrongscoring), what meaning do we attach to thisconcept?As it turns out, the lack of a “correct” response to eachitem poses absolutely no problem with respect to applyingIRT scoring methods to the MBTI. That is, although thereare no “right” or “wrong” responses, in the traditionalMBTI scoring system each possible item response is keyedtoward one or the other of the poles of the item’s assignedpreference dimension (e.g., the response “thinking” fromthe word-pair “thinking vs. feeling” is keyed toward the“T” pole of the TF dimension, and the “feeling” response iskeyed toward the “F” pole). This keying of items withrespect to the poles of each preference continuum providesus with the information that is needed to use IRT to scorethe MBTI.In essence, IRT methods simply require that each itembe scored dichotomously; although it is common to do so, itis not mandatory that this scoring system be couched interms of a “correct” versus “incorrect” response. For theMBTI, we need only pick one of the two poles of eachscale (e.g., for the EI scale, the “I” preference) as the keyedpole; this choice is essentially arbitrary, and for maximumsimilarity to the traditional prediction-ratio scoring system(e.g., Myers & McCaulley, 1988, p. 9), item responses havebeen keyed toward the I, N, F, and P poles in MBTI IRTstudies (e.g., Harvey & Murry, 1994). Once a keyed pole ischosen, each MBTI item response is dichotomously scoredby determining whether or not it is in the keyed direction.Using the above example, if an individual chose the“thinking” alternative from the “thinking vs. feeling” wordpair, this response would not be in the keyed (i.e., “F”)direction; therefore, it would be scored as a zero.

It must be stressed that this choice of a keyed directionfor each scale is entirely arbitrary, and that IRT scoringworks equally well regardless of which pole is chosen asthe keyed response. That is, the choice of the keyed polesimply determines the direction of the scale (i.e., becausethe type cutoff point is assigned a value of zero, preferencescores that lie in the keyed direction receive positivenumbers, and preferences toward the non-keyed polereceive negative numbers). Reversing the keyed polesimply reverses the scale of the θ score continuum.The item characteristic curve (ICC). The foundationof the IRT approach is the ICC; each item on a test willhave its own ICC. In essence, the ICC answers thequestion, “How are individuals’ scores on the latentconstruct (i.e., preferences) related to their observedprobabilities of endorsing this MBTI item in the keyed (i.e.,INFP) direction?” The ICC depicts the form of thefunctional relation that exists between the latent constructand the PCR. In practice, there are many different ways inwhich this functional relationship between θ scores andPCRs can be modeled.One of the simplest ways in which preference scorescan be related to the observed item endorsement rates is amodel in which higher scores on the latent preferenceconstruct are linearly associated with higher likelihoods ofendorsing the item in the keyed direction. HypotheticalItem 1 in Figure 1 illustrates an ICC that is primarily linearin nature. In Figure 1, the horizontal axis represents thelatent preference score (θ), and the vertical axis representsthe likelihood that individuals holding a given preferencewould endorse this item in the keyed direction (i.e., thePCR). The ICC shows how scores on the latent preferencescale correspond to observed item-endorsement ratesIf the ICC for Item 1 in Figure 1 had been obtained foran actual MBTI item (e.g., on the EI scale, one that askedthem to choose between “good mixer” vs. “quiet andreserved”), and the EI items were keyed toward theIntrovert pole, individuals having positive scores on the θscale would be Introverts, and those having negative scoreswould be Extraverts (a value of θ = 0.0 serves as the typecutoff score, and the θ metric is scaled in z units). Just aswith traditional prediction-ratio based preference scores,scores that are further away from the type cutoff denotestronger preferences toward that pole of the preferencecontinuum. To determine the predicted likelihood that agroup of individuals who share a given θ score wouldendorse a given item in the keyed direction, simply locatethe desired θ score on the x-axis, and then draw a verticalline until the ICC is reached. By projecting a horizontalline leftward to the y-axis from the ICC, the PCR valueassociated with that θ score can be determined.For example, in Figure 1 individuals who score 0.0 onθ have no clear preference for either the “E” or “I” poles;we would expect 50% of them to endorse this item in the“I” direction and 50% to endorse this item in the “E”direction (note the vertical line drawn at θ = 0, and thehorizontal line drawn at PCR = 0.5). In contrast, whenconsidering a group of individuals who hold a strongpreference toward the Introvert pole (e.g., at θ = +2.5), aPCR value of over 0.80 would be predicted; that is, over80% of these strong Introverts would be expected toendorse the ‘I’ alternative (i.e., “quiet and reserved”), andless than 20% would be expected to endorse the ‘E’alternative (i.e., “good mixer”). Conversely, among agroup of individuals demonstrating a very strong Extravertpreference (e.g., θ = -3.0), a PCR of approximately 0.14would be expected (i.e., only 14% of these strongExtraverts would say they are “quiet and reserved”,whereas 86% would say they are “good mixers”).In sharp contrast to the linear ICC described above, astep function ICC might exist. In a step function, a cutoffscore on the θ preference scale is effectively present, suchthat all individuals who score below a given level of θ willfail to endorse the item in the keyed direction, and allindividuals who score above this cutoff will endorse it inthe keyed direction. Hypothetical ICC 2 in Figure 1 depictsan ICC that approximates a step function: here, the cutoffpoint is at θ = 0.0, and effectively all those who score lowerthan -0.1 (i.e., the Extraverts) would endorse the non-keyedresponse (“good mixer”), and all those above 0.1 (i.e., theIntroverts) would endorse the keyed response (“quiet andreserved”). At the cutoff point, only in the very narrowrange of approximately -0.1 to +0.1 would we observeExtraverts endorsing the “I” alternative and Introvertsendorsing the “E” alternative.Step-function ICCs possess appealing properties in thecontext of a type-based assessment instrument like theMBTI. That is, if two distinct types of people exist, almostall of the people whose continuous preference scores liebelow the cutoff value for Item 2 would be expected to notendorse a response alternative that is keyed toward theopposite pole, whereas almost all of those who score abovethe cutoff would be expected to endorse the item in thekeyed direction. Indeed, if true step functions ICCs likeItem 2’s existed in practice, one could effectively develop asingle-item test that would measure each individual’s MBTIpreference with great accuracy (i.e., if the step functioncutoff point coincided precisely with the “natural” cutoffthat exists between the two types).Item information functions. The reason that stepfunctionICCs are potentially so desirable is that theyconvey a great deal of information regarding eachindividual’s standing on each MBTI preference dimension.However, step functions are limited in the sense that theinformation they provide is confined to a relatively narrowrange of scores (i.e., those who score near the cutoff pointthat defines the “step”). In the context of IRT, the term“information” is used to describe an item’s ability todiscriminate between individuals who hold different scoreson the latent preference continuum. That is, if the size ofthe difference between two individuals’ scores on the latentpreference continuum is held constant, increasing theamount of information provided by an item makes it easierto discriminate between those individuals (i.e., with respect

to the likelihood that they would endorse the item in thekeyed direction).IRT methods allow us to quantify the amount ofinformation provided by each item at any given level of theθ scale via the item information function (IIF). Figure 2presents the IIFs for the two hypothetical items listed inFigure 1. As these IIFs illustrate, the linear ICC seen forItem 1 provides a consistent – but small – amount ofinformation across the entire range of θ scores. In contrast,the step-function ICC seen for Item 2 provides a great dealof information near the cutoff point, but very littleinformation elsewhere. Thus, for individuals who endorseItem 2 in the keyed direction, we can be quite confident thattheir θ scores lie above the cutoff point; however, we havevirtually no ability to determine whether they hold a strong,intermediate, or weak preference toward the “I” pole basedon their endorsement of Item 2 in the keyed direction. Thatis, in terms of the expected PCR, there is virtually nodifference between a strong (e.g., θ = 2.5) versus a weak(e.g., θ = 0.5) “I” preference with respect to the responsesto Item 2; hence, it provides very little information outsidethe narrow band surrounding its cutoff point.Of course, due to the action of measurement error, it isextremely unlikely that in an actual testing situation wewould encounter ICCs that break as sharply as the onedepicted for hypothetical Item 2. More commonly, ICCstend to assume an intermediate value between the twoextremes depicted in Figure 1, producing variants of an “S”shaped ICC. Thus, when applying IRT methods, thefundamental question concerns the kind of ICC that onechooses to employ when modeling the relations betweenthe latent construct and the observed item endorsementrates. In particular, the choice between fitting a linearversus a nonlinear model is critical: as can be seen fromthe ICCs in Figure 1, it would be profoundly misleading tofit a linear ICC to an item that possessed a true ICC like theone depicted for Item 2. Likewise, it would be highlymisleading to force a step-function ICC onto an item thatdemonstrated an ICC like the one seen for Item 1.IRT Models for Dichotomously Scored Test ItemsIRT models differ primarily in terms of theassumptions they make regarding the ways in which scoreson the latent construct (θ) can relate to observed itemendorsement rates (PCR). These differences are reflectedin the number of parameters that must be estimated in orderto “fit” an ICC to each item’s responses.1-parameter (Rasch) model. One of the simplestanswers to the question of how the latent construct isrelated to the endorsement rates for each item is given bythe 1-parameter, or Rasch, model (e.g., Rasch, 1960). Notsurprisingly, in the 1-parameter model there is only onecharacteristic of each item that sets its ICC apart from theICCs of the other items on the test. Using traditional IRTterminology, this parameter is the difficulty of the item.Unfortunately, the difficulty parameter represents yetanother example of the way in which traditional IRTterminology is awkward when applied to instruments thatdo not use right/wrong scoring. That is, in a traditionalright/wrong test, we define a “difficult” item as being onethat few respondents are able to answer correctly (i.e., onewith a low p value); conversely, an “easy” item is definedas one that most respondents (even those who score verylow on the construct being measured) are able to answercorrectly. However, with the MBTI we are concerned withthe question of how likely it would be for a person to makean item response in the keyed direction (i.e., I, N, F, or P),not whether such a response is “right” or “wrong.”In the present case, the difficulty of an item (denoted b)refers to the degree to which raters will tend to endorse theitem in the keyed direction. Thus, items having numericallyhigh b parameters will be the ones that only people whoscore high in the keyed preference direction will tend toendorse. In contrast, items having low b parameters willtend to be endorsed in the keyed direction even byindividuals whose preferences lie strongly toward the nonkeyedpole of the preference dimension. The scale of the bparameter is the same as the scale of θ (i.e., standard, or z,units).An example should help to illustrate the way in whichthe b parameter can be used to differentiate between testitems. Figure 3 presents the ICCs for three actual MBTIitems drawn from the EI scale; these ICCs were computedby fitting the 1-parameter IRT model in a sample of 2,499MBTI profiles (the sample used to compute this andsubsequent figures was formed by sampling subjects fromthe databases used in the Harvey & Murry, 1994, andHarvey et al., 1995, studies, and then adding approximately600 new raters – primarily college students – who were notused in those studies). Because the EI responses werekeyed toward the “I” pole, individuals having Extravertpreferences exhibit negative θ scores, and those havingIntrovert preferences exhibit positive θ scores. Forreference, a horizontal line has been drawn at the 50% pointof likelihood of item endorsement, and a vertical line at thetype cutoff point (i.e., θ = 0.0)..The ICCs in Figure 3 depict the percentages ofindividuals who share a given θ score that would beexpected to endorse each item in the “I” direction. Bycomparing the levels of θ at which 50% of raters would beexpected to endorse an item in the “I” direction, one can seethe way in which the b parameter differentiates among testitems. That is, Item 129 has the lowest b parameter; wewould expect 50% of individuals who share the moderatelystrong “E” preference of -0.9 to endorse the “I” alternativefor this item (i.e., “not interested in following the latestfashion”). In contrast, Item 33 has the highest b value; forit, the point at which 50% endorse the “I” response (“hardto get to know”) does not occur until a moderately strong“I” preference of 0.9 is achieved.Thus, for any given level of θ (i.e., true preference onthe EI dimension), we would expect to see the highest ratesof “I” endorsement occurring for Item 129, followed byItem 50, with the lowest rates of “I” endorsement occurringfor Item 33. For example, consider a group of moderately

strong Introverts (i.e., θ = 0.9, which represents a score ofalmost one standard deviation above the mean EIpreference score). Among this group of Introverts, wewould expect 50% of them to describe themselves as “hardto get to know” (Item 33), 64% as “quiet and reserved”(Item 50), and 86% as “not interested in following the latestfashion (Item 129)” Conversely, for a group of θ = -0.9Extraverts, we would expect to find that only about 12%describe themselves as “hard to get to know,” 20% as“quiet and reserved,” and 50% as “not interested infollowing the latest fashion.”In general, regardless of the specific IRT model that ischosen, the substantive interpretation of the ICC willalways be the same: that is, by drawing a line projectingvertically from a given θ score to the ICC, and thenprojecting a line horizontally to the PCR, one can determinethe expected percentage of people who share that true levelof the preference that would be expected to endorse theitem in the keyed direction.How, then, should the IRT b parameter be interpretedin the context of the MBTI? As the results in Figure 3illustrate, in the 1-parameter IRT model the only thing thatdifferentiates one test item from another is the horizontal(left-right) location of the ICC on the latent preferencescale. As a practical matter, the numerical value of the bparameter is defined directly in terms of the ICC: that is, bis equal to the value of θ that corresponds to a 50%likelihood of endorsing the item in the keyed direction.Thus, for the items presented in Figure 3, the b values areapproximately -0.9, 0.35, and 0.9 for Items 129, 50, and 33,respectively.The b parameter is useful for determining the point onthe preference continuum (θ) at which the item will bemaximally informative. As a general rule, an item willprovide the most information regarding an individual’s θscore at the value of the b parameter (which, notsurprisingly, coincides with the point at which the ICCdemonstrates its sharpest slope). In this context, iteminformation is synonymous with discriminating power(i.e., the ability to differentiate between individuals in termsof their standing on the θ scale of preference). That is, adifference of a given size (e.g., 0.5 θ units) between twoindividuals with respect to the strength of their preferencewill translate into a larger expected difference in PCRs asthe slope of the ICC increases.For example, consider Item 129 in Figure 3 (i.e., theleftmost ICC). At its most informative point, a change ofone-half standard deviation (SD) in θ between two groupsof Extraverts (i.e., -1.2 vs. -0.7) translates into a change ofapproximately 14% (i.e., 42% to 56%) in the likelihood ofendorsing Item 129 in the ‘I’ direction. In contrast, thesame magnitude of θ preference difference between twogroups of individuals who score very strongly in theIntrovert direction (e.g., 2.5 vs. 3.0) produces virtually nochange in the PCRs (i.e., 97-98% “I” endorsement rateswould be expected in both groups). Thus, Item 129 ismuch more informative or discriminating among moderateExtraverts than it is among individuals with strong Introvertpreferences (nearly all of whom would endorse the item inthe ‘I’ direction)..With respect to the implications of using IRT methodsto score the MBTI, the b parameter provides very usefulinformation on each item. In the MBTI, by virtue of the factthat many users are more interested in the categorical typescores than in the continuous preference scores, we need toset a cutoff score on the preference continuum to assignrespondents into the type categories. Consequently, wewould tend to prefer items that have b values that lie closeto the θ = 0.0 point that divides each continuum intocategorical types. Thus, considering the items presented inFigure 3, Item 50 would be much more useful than Item129 with respect to locating individuals on one side or theother of the EI cutoff score.Conceptually, then, the IRT approach is not especiallycomplicated. The main problem from a practical point ofview lies in estimating the unknown b parameters for theMBTI items, and in estimating the scores on the latentpreference construct (θ) for each person, given theirresponses to the test items and our knowledge of the itemparameters. The main difference between the IRTapproach and older CTT-based approaches to measurementis that we explicitly assume that the relation between thelatent construct score and the observed item response maybe nonlinear in nature.2-parameter model. Unfortunately, the 1-parameterIRT model suffers from significant limitations, perhaps themost important being that it assumes that all items on thetest are equally discriminating or informative. For manypsychological tests (especially personality tests), this isprobably an unrealistic assumption. That is, some testitems are likely to be stronger indicators of an individual’sunderlying preferences than other test items (a fact that isacknowledged by the existing MBTI scoring system, whichdifferentially weights items when computing preferencescores). In response to the need to allow test items to bedifferentially discriminating at their points of maximumdiscrimination, the 2-parameter IRT model was developed.In essence, the 2-parameter IRT model is a superset ofthe 1-parameter model; in addition to the b (“location ofmaximum information” parameter), a second parameter(abbreviated a, or the discrimination parameter) was addedto allow for the fact that different test items will bedifferentially informative or discriminating regarding thelatent construct. In practical terms, the a parameter definesthe slope of the ICC at its point of maximum inflection(which, in the 1- and 2-paramter IRT models, occurs at bunits on the θ scale).Using the 2-parameter model, Figure 4 depicts ICCsfor three hypothetical items that have identical b parameters(in this case, b = 0.0), but which differ in terms of their aparameters (a = 0.35, 1.0, and 2.1 for Items 1-3,respectively). A comparison of the ICCs for these threeitems graphically illustrates the difference between the 1-and 2-parameter models, and highlights the importance ofmodeling both the point of maximum information as well as

the amount of discrimination that occurs at the point ofmaximum information. Specifically, Figure 4 illustrates theway in which sharper ICC slopes enhance our ability todiscriminate between individuals who differ in their θscores.That is, consider two groups of MBTI respondents:Group 1 consists of individuals who have a true EIpreference of θ = -0.2 (i.e., a very slight preference toward“E”); Group 2 consists of individuals having a preferenceof θ = +0.2 (i.e., a slight “I” preference; vertical lines aredrawn in Figure 4 at these locations). The horizontal linesdrawn in Figure 4 depict the predicted item endorsementrates for Items 1 vs. 3 at these two θ levels. A comparisonof the dotted (Item 3) and solid (Item 1) horizontal linesimmediately indicates why higher a parameters are moredesirable: for Item 1, a difference of only approximately6% exists between the expected endorsement rates forGroups 1 versus 2; in contrast, a difference of over 36%exists for Item 3. Clearly, responses to Item 3 are muchmore sensitive to the relatively slight differences in θ scoresthat exist between Groups 1 and 2.The implications for using the a parameters to assessthe performance of items in the MBTI are not quite asstraightforward as for the b parameters. On the one hand,one could argue that “more information is always better,”and that we should prefer items that produce larger amountsof information (i.e., sharper ICC slopes). However,especially in the case of an instrument like the MBTI thatuses a cutoff score to dichotomize its continuous preferencescores in order to assign categorical type values, theamount of information provided by each item must bebalanced against the location on the θ scale at which theitem produces its information. Thus, we might very wellprefer a moderately discriminating item to a highlydiscriminating item if the b parameter of the moderatelydiscriminating item was located close to the type cutoffscore, and the b for the highly discriminating item waslocated 2 SD units away from the type cutoff (i.e., causingit to produce relatively little information at the cutoff).3-parameter model. Although the 2-parameter model’sability to account for differentially discriminating itemsoffers a valuable improvement over the 1-parameter model,the 2-parameter model can be criticized on the grounds thatit assumes that all test items will have zero lowerasymptotes for their ICCs (i.e., for individuals with verylow scores on the θ scale, the ICCs will flatten-out at avalue that approaches zero). Although many test items willindeed reach an effectively zero lower asymptote within thenormal range of scores (e.g., Items 2 and 3 in Figure 4 doso at -3 and -1.5 z, respectively), some will not.In the context of right/wrong tests that are subject toattempts to guess the correct answer, it is common toobserve nonzero lower asymptotes for the ICCs due to thewillingness of respondents to guess when they do not knowthe correct answer (e.g., for a 4-alternative multiple choicemath question, random guessing would be expected toproduce a 25% success rate). In the context of instrumentsthat do not use right/wrong scoring (e.g., the MBTI),nonzero lower asymptotes can also occur, although forreasons other than guessing.In short, nonzero lower asymptotes for items on apersonality inventory may reflect the fact that the items aresufficiently skewed in terms of their endorsementproperties that even individuals who score very low on theθ scale (i.e., their preferences lie strongly toward the nonkeyedalternative) will still endorse the item in the keyeddirection at nontrivial rates. The 3-parameter IRT modelallows for this possibility by adding a third parameter foreach item (abbreviated c) which defines the PCR thatwould be expected for people who score strongly towardthe non-keyed preference pole (i.e., the effective lowerasymptote of the ICC). Although we would not expectthere to be many items in the MBTI for which largenonzero c parameters would occur, it is possible that someitems would require a nonzero value for the c parameter.Figure 5 presents the ICCs produced by fitting the 3-parameter IRT model to the three EI items depicted inFigure 3. As a comparison of Figures 3 vs. 5 makes readilyapparent, a very different picture of item functioning isproduced as a result of choosing a 1- vs. 3-parameter IRTmodel. In particular, Items 50 and 33 demonstrate a visiblysharper ICC slope than was produced in the 1-parametermodel, whereas Item 129 demonstrates a significantlyflatter slope than was seen in Figure 3. Figure 6 presentsthe item information functions for these three items;inspection of these IIFs shows that Item 50 producessubstantially more information than Item 33, and that bothproduce far more information than Item 129 (whichproduces very little information at any value of θ). Item 50is made even more desirable by the fact that the peak of itsinformation function lies closest to the type cutoff score(i.e., θ = 0), which should make it the most useful of thesethree items with respect to distinguishing betweenindividuals whose score close to the type cutoff.The results presented in Figure 5 also indicate that it isquite possible to find MBTI items that even raters whoscore very strongly toward the non-keyed end of thepreference scale will endorse in the keyed direction atnontrivial rates. For example, the ICC for Item 129 showsthat many extremely strong Extraverts endorse this item inthe Introvert direction (e.g., at θ = -3.0, approximately30% of these Extraverts endorse the “I” alternative, “notinterested in following new fashions,” instead of the “E”response, “one of the first to follow a new fashion”). Thisability to capture different kinds of item response patternsis a major advantage of the 3-parameter IRT model.Test-level information and SE functions. An importantadvantage of IRT as a test development and scoring methodis that it allows us to obtain a detailed look at the aggregateperformance of collections of test items. In particular, wecan calculate both test information functions (TIFs) and teststandard error (SE) functions to assess the performance ofan item pool. TIFs indicate the amount of information ormeasurement precision that is provided by a test at allpossible levels of θ, whereas test-level SE functions

indicate the degree of precision to be expected whenestimating test scores for examinees at different levels of θ.Thus, the test SE functions represent a continuouslyvariable analog to the global SEM estimate produced byCTT, indicating the degree of error that would be expectedwhen estimating the “true” latent preference scores basedon the observed patterns of item responses. Likewise, thetest information functions represent a continuously variableanalog to the unitary reliability coefficient estimated byCTT: that is, higher values reflect higher measurementprecision and freedom from error, and lower valuesrepresent less measurement precision and increaseduncertainty with respect to estimating scores on the latentconstruct.Both of these functions represent tremendousimprovements over the simplistic views of reliability andmeasurement error that are inherent in traditional CTTbasedmethods. That is, in classical approaches to testing, atest’s reliability is estimated as a single number that ispresumed to be constant across the entire possible range oftest scores. Likewise, a test’s standard error ofmeasurement (SEM) is presumed to be constant across allpossible test values. Both of these assumptions aretenuous; indeed, it is reasonable to expect that most testswill tend to be more precise for respondents who have“average” scores on the latent construct, and less precisefor those individuals who hold extreme scores (i.e., teststargeted at an “average” population typically lack items thatprovide significant levels of information for individualswho score at the extremes of the distribution).Figure 7 presents the TIFs for a scale composed of thethree EI items contained in Figures 3 and 5, as well as forthe full EI scale; Figure 8 presents the corresponding SEfunctions for the 3-item and full-length EI scales. AsFigures 7-8 illustrate, significant improvements in testprecision (i.e., higher TIFs, lower SEs) are achieved in thefull-length EI scale relative to a 3-item scale. Additionally,both the TIFs and SEs show that measurement precision isnot constant across the full range of θ-based preferencescores, being significantly better in the middle range of θscores (peaking at approximately θ = 0.25), and somewhatmore precise for the Introvert half of the scale than for theExtravert half (see Figure 8).These results clearly undermine the CTT assumptionthat reliability and SEM remain constant across the fullrange of MBTI preference scores. Based on past studiesthat have estimated the CTT reliability of the MBTI scalesto lie in the .75-.85 range (e.g., Harvey & Murry, 1994;Myers & McCaulley, 1985), two horizontal lines have beendrawn in Figures 7-8 at the levels of information/SE thatcorrespond to r xx = .75 (which produces SEM = .50 for z-scaled variables like θ) and r xx =.85 (SEM = .39). Acomparison of the TIFs and SEs for the full EI scale againstthese CTT reference lines indicates that the θ scoresestimated by IRT would be expected to significantly exceedthe levels of measurement precision implied by the unitaryCTT estimates in the middle range of θ-based preferencescores (i.e., from approximately -0.5 to +1.0 for the .39SEM, and -1.0 to 1.5 for the .50 SEM), and to fall short ofthe levels of precision implied by the CTT results outsidethese ranges.It is important to stress that these findings do not implythat IRT-based scoring is less precise than CTT-basednumber-right scoring for preferences that lie outside theabove intervals. On the contrary, they indicate that thelevels of measurement precision implied by CTT’s unitaryr xx and SEM statistics are likely to underestimate theeffective level of precision for preference scores that fallwithin approximately .5 to 1 SD of the type cutoff score,and to increasingly overestimate the precision ofmeasurement for preference scores that lie strongly towardeither pole of the preference scale.Is IRT Appropriate for the MBTI?By this point, the reader might well feel that he or shehas seen at least one ICC too many, and perhaps bewondering whether it is really necessary to go to the troublerequired to fit these nonlinear ICCs to the MBTI responses.Without a doubt, the IRT approach is somewhat morecomplex than the prediction-ratio technique that hastraditionally been used to score the MBTI. In short, onemight question whether or not the increased complexityinherent in the IRT is worth the trouble, and whether anyevidence exists to indicate that the IRT model actuallyprovides a good “fit” to the MBTI item response patterns.Fortunately, a very direct method exists for assessingthe “fit” of the IRT model; it involves an examination ofempirically derived ICCs. Empirical ICCs are essentiallyscatterplots, defined as follows: the vertical axis of the plotrepresents the observed rate of item endorsement (PCR),the horizontal axis represents discrete levels of the latentpreference score, and the points in the plot represent thepercentage of respondents at each level of the latentpreference score that endorse the item in the keyeddirection. By visually examining this scatterplot of meanitem endorsement rates, we can get an idea of the “true”nature of the relationship between the latent preferencedimension and the observed likelihood of item endorsementin the keyed direction for the various levels of the latentconstruct.Empirically derived ICCs provide an ideal vehicle forassessing the fit of the IRT model by virtue of the fact thatthey do not “force” any particular model (e.g., the 3-parameter IRT model) onto the data. That is, the ICCspresented in Figures 3 and 5 are the ones that wereproduced by fitting the 1- and 3-parameter IRT models tothe MBTI item responses; although they look impressive,they essentially have to follow the IRT model, and there isno guarantee that they will actually provide a good fit to thedata. In contrast, the empirically derived ICCs are free toadopt any shape that is appropriate for the data. Thus, tothe extent that the ICCs produced by the IRT models matchthe shape of the empirical ICCs, we would conclude thatthe IRT model provides a good degree of fit to the MBTIdata.

As a practical matter, the main difficulty that ariseswhen computing empirical ICCs is in finding a satisfactorymethod for estimating the latent construct scores. Becausewe don’t know the “true” preference scores for eachexaminee, and we can’t use the θ scores that are estimatedusing IRT (i.e., to avoid creating a logical circularity), it iscustomary to use the total score on the scale as the bestavailable estimate of the true score. In the present case, thescores computed using the prediction-ratio (PR) preferencescoring weights for Form F were used as the estimate ofeach person’s true score on the latent construct (virtuallyidentical results were also obtained when we used thesimple unweighted percentage of items that were answeredin the keyed direction as the estimate of the latentconstruct).Computationally, the empirical ICCs (see Figures 9-12for the EI items used in the previous examples, and Figures13-15 for the top items from the SN, TF, and JP scales)were produced as follows: (a) each person’s net preferencescore was calculated using the Form F scoring key andplaced on a scale that placed the type cutoff at zero (i.e.,preferences toward the keyed pole received positive values,and those toward the non-keyed pole received negativescores); (b) subgroups of raters were formed by breakingthe sample into discrete intervals based on their PRpreference score (e.g., in Figure 9, all raters scoring 53toward the “E” pole); (c) for each subgroup, we calculatedthe percentage of raters in that subgroup that endorsed theitem in the keyed direction (e.g., Figure 9 shows that forItem 50, 0% of the raters in the subgroup scoring 53 toward“E” endorsed the item in the “I” direction); finally, (d) foreach subgroup, we plotted the percentage of raters thatendorsed the item in the keyed direction against thesubgroup’s PR-based preference score (smoothed splineinterpolations were fitted through this scatterplot in anattempt to capture the “true” ICC for each item).It is important to emphasize again that unlike the ICCspresented in Figures 3 and 5 -- which were estimated usingIRT methods and which therefore must follow the formdictated by the 1- or 3-parameter IRT model – theempirically derived ICCs presented in Figures 9-15 arecompletely unconstrained by the IRT model. Accordingly,they can take on any form that is appropriate in order todepict the functional relationship (if any) that existsbetween each item response and the traditional PR-basedpreference scores. Thus, to the degree that we seeagreement between the empirically derived ICCs versus theICCs that were generated from the IRT parameterestimates, we will interpret such agreement as validation ofthe appropriateness of the IRT approach.As the results in Figures 9-11 illustrate, although theunconstrained empirical ICCs provide a very poor match tothe ICCs that were produced using the 1-parameter IRTmodel (Figure 3), they provide a very good match to theICCs produced by the 3-parameter model (Figure 5). Forexample, the empirical ICC for Item 50 demonstrates a verynonlinear, highly discriminating shape (Figure 9); thiscurve closely matches the ICC estimated by the 3-parameter IRT model (Figure 5) in terms of both its shapeas well as its relative location on the θ axis. Likewise, theempirical ICCs in Figures 10 and 11 for Items 33 and 129agree quite closely with the 3-parameter model ICCs(Figure 5).In all cases, there is remarkably little “scatter” aroundthe line that we fit to each scatterplot, a fact that furthersupports the validity and advisability of using the 3-parameter IRT model to score the MBTI. When oneconsiders the fact that some of these subgroup percentageendorsementstatistics (i.e., the squares in Figures 9-11) arebased on quite small Ns, the correspondence between theempirically vs. IRT-derived ICCs becomes even moreimpressive. To facilitate the comparison of these ICCs, theempirically derived ICCs for EI items 33, 50, and 129 arepresented superimposed upon one another in Figure 12. Asa comparison of Figures 5 vs. 12 indicates, there is a greatdeal of similarity between the empirically vs. IRT-derivedICCs; this similarity is even more notable when oneconsiders the profound differences that exist between themethods that were used to compute the scores that definethe horizontal axes in Figure 5 (i.e., maximum likelihoodbasedestimation of θ using the parameters estimated for the3-parameter IRT model) vs. Figure 12 (i.e., prediction-ratiobased preference scores based on the Form F scoringsystem).As a further indicator of the generalizability of theabove findings, empirically derived ICCs for highperformanceitems drawn from the SN, TF, and JP scales(i.e., identified using the Harvey & Murry, 1994, IRTparameters) are presented in Figures 13-15. Inspection ofthese ICCs again reveals the existence of markedlynonlinear functional relationships between preferencescores and the likelihood of endorsing MBTI items in thekeyed direction. Clearly, an S-shaped ICC is the mostappropriate representation for these MBTI items. As withthe EI items, the results in Figures 13-15 indicate thatalthough some items demonstrate their highestdiscriminating power (i.e., ICC slope) at the type cutoffpoint (Figure 13), others produce their maximumdiscriminating power at points below (e.g., Figure 14) andabove (e.g., Figure 15) the type cutoff point. The fact thatdifferent items tend to produce their maximumdiscrimination at different points along the preference scorecontinuum is easily modeled using IRT methods (i.e., byassigning different b parameters to the items).To provide something of a baseline against which tojudge the results in Figures 13-15, Figures 16-17 depictempirical ICCs computed by plotting item-endorsementrates against preference scores for dimensions other thanthe predicted one for the item in question. The ICC shownin Figure 16 is typical of such ICCs; this scatterplot showsthat there is virtually no association between scores on theEI preference scale and subgroup item-endorsementpercentages on Item 85 (a JP item). Note that there is anappreciably higher level of “scatter” around the line of bestfit in this plot, as compared to the empirical ICCs computedfor items on their predicted preference dimensions (Figures

9-15), indicating that (as expected) JP item endorsementrates are not consistently predictive of EI preferences.There are exceptions to the pattern of non-associationdepicted in Figure 16, however, and most involvecomparisons between the SN and JP dimensions. Forexample, Figure 17 presents a scatterplot of PCR values forItem 85 – which, as Figure 15 illustrates, is a highlydiscriminating item with respect to the JP dimension –against the PR-based preference scores for the SNdimension. As the empirically derived ICC in Figure 17illustrates, there is a relatively strong (and linear)association between these two axes, such that higher scoreson the “N” preference are associated with higher likelihoodof endorsing Item 85 in the “P” (i.e., “unplanned” over“scheduled”) direction. This finding is consistent with theoft-reported positive correlation between the SN and JPpreference scores (e.g., Harvey & Murry, 1994), and doesnot necessarily represent cause for concern. Indeed, incases in which MBTI items are found to have consistentfunctional relationships with multiple latent preferencescales, the possibility of using multidimensional IRTmodels that are capable of making use of the “collateralinformation” contained in such items becomes worthy offurther study.Figure 18 presents an empirical ICC in which itemendorsement rates for EI Item 116 are plotted against thePR-based EI preferences. As in the earlier empirical ICCs,the results in Figure 18 demonstrate a strong level of fitbetween the actual MBTI item response patterns and the 3-parameter IRT model. However, the most notable aspectregarding Item 116’s empirical ICC is that although thisitem demonstrates strong discriminating power with respectto the EI preference, the location of this discriminationoccurs relatively far from the EI type cutoff point (i.e.,approximately 41 PR preference units toward the “I” pole).That is, Introverts must possess quite a strong preferencetoward the “I” pole before they begin to choose the“detached” alternative over the “sociable” alternative insignificant numbers.In view of the fact that Item 116 provides relativelylittle discriminating power at the type cutoff point, it is notsurprising to find that the traditional PR-based scoringsystem does not view it as being an especially useful onewith respect to assessing the EI preference. However, asthe empirical ICC in Figure 18 clearly indicates, this item isvery useful in discriminating between individualsexhibiting moderate vs. strong preferences toward the “I”pole of the EI scale. This ability to assess thediscriminating power of each MBTI across the full range ofpreference scores represents yet another point of superiorityof the IRT approach over the traditional PR-based scoringsystem, which is primarily sensitive only to an item’sdiscriminating power in the vicinity of the type cutoffscore.In sum, using only the observed MBTI endorsementrates and the preference scores produced by the traditionalPR-based scoring system, the above findings demonstratethat (a) the relationship between MBTI preferences andobserved item endorsement rates is decidedly nonlinear formany items; (b) MBTI items differ widely with respect tothe amount of information and discrimination they provide;and (c) the location on the preference scale at which eachitem provides its maximum information varies considerablyfor different MBTI items. These findings strongly supportthe appropriateness and potential usefulness of the 3-parameter IRT model as a vehicle for capturing thecomplex dynamics involved in responding to the MBTI’sitems. In addition, these results argue strongly against thenotion that simpler models (e.g., the 1-parameter IRTmodel, or systems based on a weighted or unweightedlinear model) can provide an adequate representation of thecomplexity of these item responses. In short, theseempirical ICCs indicate that the 3-parameter IRT modelprovides a very good degree of fit to the MBTI itemresponses. We turn finally to a review of findings fromstudies that have attempted to apply the IRT approach toscoring the MBTI.IRT Research on the MBTIEmpirical studies evaluating IRT-based approaches toscoring the MBTI have only recently begun to appear.However, the results of these initial studies have been veryencouraging, especially regarding the ability of IRT scoringto address two of the most-criticized aspects of the MBTI:namely, preference score bimodality, and the degree ofmeasurement precision that exists in the vicinity of the typecutoff scores. Additionally, IRT-based methods ofestimating MBTI preference scores offer advantages inother areas, in particular, quantifying the quality or internalconsistency of an individual’s profile of MBTI itemresponses (e.g., to detect potentially invalid profiles).Bimodal DistributionsAs we noted in our review of criticisms that have beenraised regarding the MBTI, many authors have attacked iton the grounds that its preference score distributions are notbimodal (e.g., Pittenger, 1993; Stricker & Ross, 1964).Indeed, as the results presented in Harvey and Murry(1994) illustrated, PR-based preference score distributionsare highly center-weighted and platykurtic. This lack ofbimodality has at least two important implications: (a) itprovides ammunition to those who attempt to challenge thevalidity of Myers’ type-based personality theory (i.e., ifthere are basically two distinct “types” of people on each ofthe MBTI dimensions, it would not be unreasonable toexpect to find a somewhat bimodal shape in the preferencescore distributions); and (b) it exacerbates the alreadydifficult process of accurately assigning individuals todiscrete type categories (i.e., whenever a cutoff score isused, we would strongly prefer to minimize the number ofindividuals who score near the cutoff; unfortunately, thePR-based preference score distributions locate a sizablenumber of individuals near the cutoff point).

Fortunately, the results of the Harvey and Murry(1994) study -- which was the first to derive and evaluatean IRT-based scoring system for the MBTI -- indicatedquite clearly that when the 3-parameter IRT model is usedto estimate scores on the continuous preference scales, theresulting score distributions are strongly bimodal.Updating these findings using the database from which theabove empirical ICC results were produced (i.e., whichadds a number of individuals to the sample used in Harvey& Murry, 1994), Figure 19 presents the frequencydistribution for the EI scale’s PR-based preference scores(Figure 19 contains a frequency-count bar for each discretePR-preference value). In contrast, Figure 20 presents thedistribution of the EI θ-based preference score estimates (θscores contain a significantly higher number of discretescore values; consequently, to facilitate comparison, thenumber of frequency bars in Figure 20 has been matched tothe number of discrete PR-based preference values).A comparison of Figures 19 vs. 20 indicates that the θ-based preference distribution is strongly bimodal in shape,whereas the PR-based preference scores exhibit a relativelyflat distribution in which many individuals score near thetype cutoff (very similar results are seen for the remainingthree preference dimensions). Although some respondentsdo indeed score in the vicinity of the type cutoff in the IRTbaseddistribution, there is a pronounced decrease in thedensity of individuals scoring in the cutoff region betweenthe two very pronounced modes (which are locatedapproximately ±0.5 units on either side of the type cutoff).A visual examination of the two distributions suggests thatfewer individuals score close to the cutoff point in the θ-vs. PR-based distributions.Thus, regarding the issue of preference scorebimodality, the evidence available to date indicates quiteconvincingly that bimodal score distributions can beproduced by simply changing the technology that is used toestimate preference scores from the observed MBTI itemresponses. Although bimodal preference distributions havebeen found in highly selected samples of individuals whodemonstrate very strong type differentiation (e.g., Ryttinget al.,1994), they have not been seen in larger, morerepresentative samples (e.g., Stricker & Ross, 1964); thisfact has been trumpeted by MBTI critics as a serious flawin both the MBTI instrument as well as Myers’ type-basedpersonality theory that inspired the MBTI. If these resultsare found by subsequent research to be generalizable tonon-student-based samples (which we have every reason toexpect, given both the relatively large size of our sampleand the fact that the students who attend major universitiestypically represent a diverse cross-section of the generalpopulation), this fact will effectively eliminate one of themajor arguments raised by MBTI critics.Measurement PrecisionAs we noted in our review of criticisms of the MBTI,many authors have expressed concerns regarding itsmeasurement precision; in particular, the level of scorestability that is seen in test-retest situations, and its abilityto correctly assign individuals who score close to the typecutoffs to type categories (e.g., Pittenger, 1993). Earlier,we identified two strategies that could be taken to improvethe level of test-retest stability and the MBTI’s ability tocorrectly classify individuals into type categories: (a)decreasing the number of individuals who score close to thetype cutoffs by increasing the bimodality of the preferencescore distributions; and (b) revising the MBTI scoringsystem to produce a higher level of precision in the vicinityof the type cutoff score.As a visual examination of the results presented inFigures 19-20 suggests, switching from a PR- to a θ-basedscoring system for the MBTI – without changing a singletest item – appears to provide a means for addressing thebimodality issue. In an attempt to more precisely addressthe question of whether θ-based scoring reduces thenumber of individuals scoring close to the type cutoffs, westandardized the PR-based preference scores to have thesame mean and SD as the θ-based preferences, and thencounted the number of individuals who scored within agiven sized band around each scale’s type cutoff score.Values of ±0.25 and ±0.35 were used when setting thesebands; 0.25 is a somewhat arbitrary value, whereas 0.35approximates the size of a ±1 SEM confidence interval fora scale having a .85 reliability, as well as the size of the SEthat would be expected when estimating θ scores at the typecutoff point (see Figure 8). Individuals who score withinthese bands should be much more likely to be incorrectlyclassified into a categorical type due to the action ofmeasurement error (either in a single administration, or in atest-retest situation) than those who score outside thesezones.Table 1 presents the numbers of individuals scoringwithin these two intervals for the PR- and θ-basedpreferences. As the breakdowns in Table 1 indicate, PRbasedpreference scoring consistently locates a largerpercentage of respondents in the “zone of uncertainty”around the cutoff than the θ-based scoring system. Usingthe number of individuals classified within the ±0.25 and±0.35 bands by the traditional PR-based scoring system asthe basis for comparison, the IRT-based scoring systemproduces reductions of 37% and 27%, respectively, in thenumber of MBTI profiles that fall within this zone ofuncertainty.Likewise, comparing the number of individuals thatfall within the zone of uncertainty using IRT versus PRscoring, the results in Table 1 indicate that 54% and 36% ofthe profiles that fall within the uncertainty zone using PRscoring fall outside the zone when using IRT scoring forthe .25 and .35 bands, respectively. Conversely, only 4%and 3% of the profiles that fall outside of the uncertaintyzone using PR scoring fall inside the zone when using IRTscoring. Again, these results illustrate the sizablereductions in the percentage of individuals who score closeto the type cutoff point that are produced simply by

switching from a PR-based to a θ-based scoring system forthe MBTI item responses.Figures 21 and 22 present more information on theperformance of the IRT-based scoring system; Figure 21shows a scatterplot of the EI preference scores estimated byPR- vs. IRT-methods, whereas Figure 22 shows ascatterplot of IRT-based preference scores for the EI vs. SNscales. As the plot in Figure 21 illustrates, there is a strong– but decidedly nonlinear – association between θ- vs. PRbasedpreference score estimates. For example, forindividuals receiving an identical PR preference score,Figure 21 illustrates how they can receive a relatively broadrange of θ-based preference scores. This illustrates a majoradvantage of θ-based scoring: that is, it doesn’t just matterhow many items are endorsed in the keyed direction, it iscritically important to determine which items are endorsedin each direction. In short, answers to highlydiscriminating items are much more diagnostic thananswers to items that possess low b parameters; IRT-basedscoring automatically takes these factors into account whenestimating each individual’s θ-based preference score.Thus, two individuals with the same overall number of“keyed” answers might receive very different θ-basedpreference scores, depending on which items wereendorsed.The reductions in distribution density near the typecutoff scores that are illustrated in Figures 20 and 22, andquantified in Table 1, provide reason for optimismregarding the ability of IRT scoring to improve themeasurement precision of the MBTI (as manifest by testretesttype stability, or with respect to agreement with typevalues obtained via “true type” methods). For example, inFigure 22, areas of much higher density can be seen in thebivariate distribution of the EI and SN scales (i.e., at thepoints at which the bimodal peaks are present in theunivariate frequency distributions); likewise, areas of lowdensity are seen in areas in which we would prefer to havefew if any respondents (e.g., at 0 on both scales, therelatively sparsely populated square in the center of thescatterplot). Researchers now need to conduct empiricalstudies that compare PR- vs. θ-based MBTI scoringsystems in test-retest and “true type” settings; if, as wehypothesize, θ-based scoring is capable of producingimprovements in test-retest type stability and higher levelsof agreement between MBTI- and “true type”-based typeassignments, another major class of criticisms of the MBTIcould thereby be addressed.However, it must be noted that the above results, aswell as those obtained in the Harvey, Murry, and Markham(1994) study that examined the measurement precision ofvarious short-form versions of the MBTI, are not uniformlypositive. Indeed, these research findings indicate thatconsiderable “room for improvement” exists with respect tothe MBTI’s measurement precision. For example, evenusing the relatively small ±0.25 uncertainty interval inTable 1, 11% of the individuals in the sample have θ-basedpreference scores that lie close to the type cutoff score, and19% score in this region using the more liberal ±0.35interval. Although these rates represent sizable reductionswith respect to the numbers of individuals that fall withinthe uncertainty region using PR scoring (which locates 18%and 25% of the sample within these zones, respectively),we would ideally prefer to see the number of individualsscoring close to the cutoff approach zero.Expanding the MBTI item pools to contain new items– in particular, items that produce highly discriminatingICCs like those presented in Figures 9 and 13-15) – is themost likely way in which to further improve the MBTI’smeasurement precision. As the results of the Harvey,Murry, and Stamoulis (1995) and Harvey and Murry (1994)studies demonstrated, there are relatively few “highperformance” items in the Form G/F item pools; manyitems demonstrate only moderate levels of discrimination,and a number of items produce relatively poor levels ofinformation (e.g., Figure 11).The degree to which the MBTI could benefit from theaddition of new, high-performance items was demonstratedby the Thomas and Harvey (1995) study, which attemptedto write new items that would parallel the content domainsof the existing four MBTI scales. Containing an item poolof 200 new items (50 per scale), the Work Styles Inventory(WSI; Thomas, 1994) was field tested on a sample of 583college students. Based on analyses of this database,Thomas and Harvey (1995) identified a number of the WSIitems that, when added to the existing MBTI item pools,produced significantly higher TIFs for the MBTI scales.Figure 23 presents the TIFs for the EI scale that werecomputed using the Form F MBTI item pool, a long andshort version of the WSI EI items, and the combined WSIplus-MBTIpool.An inspection of the TIFs presented in Figure 23reveals that, as hypothesized, it is indeed possible to writenew, high-performance items for the four main scales of theMBTI. When added to the existing MBTI scales, thesenew items produce substantial improvements in the TIFs,relative to the levels produced by the Form F items. Ofcourse, the results in Figure 23 also indicate that the WSIitems also leave some “room for improvement,” inparticular, with respect to the location of the additionalinformation they provide. That is, the Form F item poolhas a TIF that is somewhat biased in favor of assessingindividuals scoring toward the “I” pole of the EI scale (i.e.,its TIF peaks at approximately 0.25 in the “I” direction). Incontrast, the WSI items are strongly biased in favor ofhigher precision in the “I” direction, with TIFs peaking atapproximately 0.8 units in the “I” direction. For practicaluse, we would prefer the TIFs to be symmetric, andcentered on the cutoff point between the two types. Thus,additional items that produced their highest levels ofdiscrimination in the “E” direction would be needed tobalance-out these new items.It is also possible that the measurement precision of theMBTI item pools can be enhanced through the use of someof the “research” items that are included on longer forms ofthe MBTI (e.g., Form F, J). For example, Form J contains

over 190 items that are not part of the Form F/G scoringsystem; it seems reasonable to hypothesize that the additionof these “research” items to the Form F/G item poolsshould also produce improvements in the TIFs for the fourmajor MBTI scales. Additional research is needed toevaluate the degree to which the new high-performanceitems can be obtained from the existing “research” itempool.ConclusionsIn this article, we identified a small number of generalclasses of criticisms that have been directed toward theMBTI. Based on our review, the first of these classes ofcriticisms – which claims that the MBTI items do notmeasure the four latent constructs they seek to measure --was found to be sharply inconsistent with empiricalresearch findings, particularly the results of recent largesampleexploratory and confirmatory factor analyses. Thesecond class of criticisms – which involves claims to theeffect that the MBTI is flawed because it does not producebimodally shaped distributions of preference scores – waslikewise found to be unsupported by the data when oneconsiders preference score distributions computed usingIRT-based scoring methods. Although traditional PR-basedpreference scores do not exhibit bimodality, IRT’s θ-basedpreference score distributions were found to be sharplybimodal in large, unselected samples.Using the research findings currently available to us, wewere unable to dismiss the final class of criticisms – whichdeals with claims to the effect that the MBTI is flawedbecause its levels of test-retest type stability are less thanperfect. However, based on the reductions in the relativenumber of individuals who score close to the type cutoffsthat occur when IRT-based scoring methods are used, aswell as the potential for the MBTI’s measurement precisionto be increased via the addition of new items, we concludethat it is reasonable to hypothesize that significantimprovements in the MBTI’s test-retest type stability maybe achievable by switching to IRT-based scoring and/orlengthening the MBTI item pools. Research implementingthese strategies is now needed in order that we maydetermine the degree to which these measurement-precisionbased criticisms can be dismissed as convincingly as wehave dealt with criticisms based on the MBTI’s factorstructure and the bimodality of its preference scoredistributions.We also attempted to provide an overview of the IRTmodel, focusing on the way in which IRT’s traditional“right-wrong” terminology can be adapted to the domain ofassessment instruments that are not couched in “rightwrong”terms, and on ways in which one can assesswhether the IRT models “fits” the observed item responses.Regarding this latter issue, the results we presented usingempirically derived ICCs – which, by definition, are in noway influenced by the assumptions made by the IRT model– showed quite convincingly that many MBTI items doindeed demonstrate nonlinear relations with the latentpreference constructs, and that the MBTI items differsharply with respect to both the amount and location of theinformation they provide with respect to the underlyingMBTI preferences.In conclusion, it is important to note that the traditionalprediction-ratio based system of estimating MBTIpreference scores has worked well for decades, and it hasbeen very valuable to practitioners by virtue of providingthem with a means of scoring the instrument and assigningindividuals to type categories. Clearly, any new system forscoring the MBTI must offer significant advantages orfeatures that cannot be obtained using the traditional PRbasedmethod. In short, we must ask whether it is worththe trouble to change to a new scoring system? Based onthe above results, we conclude that IRT-based scoring doesoffer the kind – and magnitude -- of improvement neededto justify the change to a new MBTI scoring system.Specifically, advantages offered by IRT scoring includethe following: (a) it produces bimodal score distributionsthat decrease the number of individuals who score close tothe type cutoffs; (b) it offers a scoring system that allows usto differentially weight item responses based on each item’sdiscriminating power, the point at which it provides itsmaximum information, and the degree to which individualswho score strongly in the non-keyed direction will tend toendorse it in the keyed direction (all of which shouldproduce more precise estimates of each person’s scores onthe preference scales); (c) it allows the development of aversion of the MBTI that can be administered usingcomputerized adaptive testing (CAT) technology (whichhas the potential to significantly reduce testing time whilekeeping the precision of measurement high); (d) it canproduce quantitative indices of the quality and internalconsistency of an individual’s MBTI item response profileusing appropriateness indices (these may be valuable inidentifying invalid response profiles and in resolving casesof type indeterminacy); and (e) it allows sensitive, itemlevel studies of the degree to which MBTI items tend toperform differently for individuals in different demographiccategories (e.g., to identify items suffering from potentialgender- or race-based bias).IRT-based MBTI research has finally started to appear,and although much has been accomplished, much remainsto be done. In particular, studies are needed to determinethe degree to which IRT scoring is capable of producinghigher test-retest type stability and/or agreement with “truetype” assessments, the degree to which MBTI items sufferfrom race- or sex-based bias, the amount of reduction intesting time that may be possible by using CAT-basedadministration, the amount of success that may be achievedby using appropriateness indices to spot aberrant orinternally inconsistent response profiles, and the degree towhich the measurement precision of the MBTI scales canbe enhanced via the addition of new items (either from thecurrently unused “research” items, or from other sources).

ReferencesBlock, J., & Ozer, D. J. (1982). Two types ofpsychologists: Remarks on the Mendelsohn, Weiss, andFeimer contribution. Journal of Personality and SocialPsychology, 42, 1171-1181.Briggs, K. C., & Myers, I. B. (1976). Myers-Briggs TypeIndicator: Form F. Palo Alto: ConsultingPsychologists Press.Carlson, J. (1985). Recent assessments of the Myers-BriggsType Indicator. Journal of Personality Assessment,49(4), 356-365.Carlyn, M. (1977). An assessment of the Myers-BriggsType Indicator. Journal of Personality Assessment, 41,461-473.Carskadon, T. G. (1977). Test-retest reliabilities ofcontinuous scores on the Myers-Briggs Type Indicator.Psychological Reports, 41, 1011-1012.Cliff, N. (1987). The eigenvalue-greater-than-one rule andthe reliability of components. Psychological Bulletin,103, 276-279.Coe, C. K. (1992). The MBTI: Potential uses and misusesin personnel administration. Public PersonnelManagement, 21(4), 511-523.Comrey, A. L. (1983). An evaluation of the Myers-BriggsType Indicator. Academic Psychology Bulletin, 5, 115-129.Gangestad, S. W., & Snyder, M. (1991). Taxonomicanalysis redux: Some statistical considerations fortesting a latent class model. Journal of Personality andSocial Psychology, 61, 141-146.Garden, A. (1989). Organisational size as a variable in typeanalysis and employee turnover. Journal ofPsychological Type, 17, 3-13.Gauld, V., & Sink, D. (1985). The MBTI as a diagnostictool in organization development interventions. Journalof Psychological Type, 9, 24-29.Gough, H. G. (1976). Studying creativity by means ofword association tests. Journal of Applied Psychology,61, 348-353.Hall, W. B., & MacKinnon, D. W. (1969). Personalityinventory correlates of creativity among Architects.Journal of Applied Psychology, 53, 322-326.Hambleton, R. K, Swaminathan, H., & Rogers, H. J.(1991). Fundamentals of item response theory.Newbury Park, CA: Sage.Harvey, R. J., & Murry, W. D. (1994). Scoring the Myers-Briggs Type Indicator: Empirical comparison ofpreference score versus latent-trait methods. Journal ofPersonality Assessment, 62, 116-129.Harvey, R. J., Murry, W. D., & Markham, S. E. (1994).Evaluation of three short form versions of the Myers-Briggs Type Indicator. Journal of PersonalityAssessment, 63, 181-184.Harvey, R. J., Murry, W. D., & Markham, S. E. (1995,May). A “Big Five” Scoring System for the Myers-Briggs Type Indicator. Paper presented at the AnnualConference of the Society for Industrial andOrganizational Psychology, Orlando.Harvey, R. J., Murry, W. D., & Stamoulis, D. (1995).Unresolved issues in the dimensionality of the Myers-Briggs Type Indicator. Educational and PsychologicalMeasurement, 55, 535-544.Harvey, R. J., & Thomas, L. A. (1995, May). Improvingthe measurement precision of the Myers-Briggs TypeIndicator. Paper presented at the Annual Conferenceof the Society for Industrial and OrganizationalPsychology, Orlando.Hulin, C., Drasgow, F., & Parsons, C. (1983). Itemresponse theory: Application to psychologicalmeasurement. Homewood, IL: Dow Jones-Irwin.Hartzler, G. J., & Hartzler, M. T. (1982). Managementuses of the Myers-Briggs Type Indicator. Research inPsychological Type, 5, 20-29.James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causalanalysis. Beverly Hills: Sage.Johnson, D. A., & Saunders, D. R. (1990). Confirmatoryfactor analysis of the Myers-Briggs Type Indicator --Expanded Analysis Report. Educational andPsychological Measurement, 50, 561-571.Joreskog, K. G., & Sorbom, D. (1981). LISREL V: Analysis oflinear structural relationships by maximum likelihood andleast squares methods. Chicago: International EducationalServices.Kirton, M. J. (1976). Adaptors and innovators: Adescription and measure. Journal of AppliedPsychology, 61, 622-629.Lee, H. B., & Comrey, A. L. (1979). Distortions in acommonly used factor analytic procedure. MultivariateBehavioral Research, 14, 301-321.Lord, F. M., & Novick, M. R. (1968). Statistical theoriesof mental test scores. Reading, MA: Addison-Wesley.McCormick, E. J., Jeanneret, P. R., & Mecham, R. C.(1972). A study of job characteristics and jobdimensions as based on the Position AnalysisQuestionnaire (PAQ). Journal of Applied Psychology,56, 347-367.McCarley, N., & Carskadon, T. G. (1983). Test-retestreliabilities of scales and subscales of the Myers-BriggsType Inventory and of criteria for clinical interpretivehypotheses involving them. Research in PsychologicalType, 6, 24-36.Mendelsohn, G. A., Weiss, D. S., & Feimer, N. R. (1982).Conceptual and empirical analysis of the typologicalimplications of patterns of socialization and femininity.Journal of Personality and Social Psychology, 42,1157-1170.Miller, M. L., & Thayer, J. F. (1989). On the existence ofdiscrete classes in personality: Is self-monitoring thecorrect joint to carve? Journal of Personality andSocial Psychology, 57, 143-155.

Mislevy, R. J., & Bock, R. D. (1990). BILOG 3: Itemanalysis and test scoring with binary logistic methods.Mooresville, IN: Scientific Software.Mitchell, W. (1995). A clash of paradigms: Whybimodality, ANOVA interactions, and discontinuitiesare irrelevant criteria for typologies. Unpublishedmanuscript.Moore, T. (1987). Personality tests are back. Fortune,March 30, 74-82.Myers, I. B. (1962). The Myers-Briggs Type Indicatormanual. Princeton, NJ: Educational Testing Service.Myers, I. B., & McCaulley, M. H. (1985). A guide to thedevelopment and use of the Myers-Briggs TypeIndicator. Palo Alto, CA: Consulting PsychologistsPress.Myers, I. B., with Myers, P. B. (1980). Gifts differing.Palo Alto, CA: Consulting Psychologists Press.Pittenger, D. J. (1993). The utility of the Myers-BriggsType Indicator. Review of Educational Research, 63,467-488.Poilitt, I. (1982). Managing differences in industry.Research in Psychological Type, 5, 4-19.Rytting, M., Ware, R., & Prince, R. A. (1994). Bimodaldistributions in a sample of CEOs: Validating evidencefor the MBTI. Journal of Psychological Type, 31, 16-23.Sample, J. A., & Hoffman, J. L. (1986). The MBTI as amanagement and organizational tool. Journal ofPsychological Type, 11, 47-50.Sipps, G. J., Alexander, R. A., & Friedt, L. (1985). Itemanalysis of the Myers-Briggs Type Indicator.Educational and Psychological Measurement, 45, 789-796.Stricker, L. J., & Ross, J. (1964). Some correlates of aJungian personality inventory. Psychological Reports,14, 623-643.Thomas, L. A. (1994). Unpublished Master’s thesis,Virginia Polytechnic Institute and State University.Thomas, L. A., & Harvey, R. J. (1995, April). Improvingthe measurement precision of the Myers-Briggs TypeIndicator. Paper presented at the Annual Conference ofthe Society for Industrial and OrganizationalPsychology, Orlando.Thompson. B., & Borrello. G. M. (1986). Constructvalidity of the Myers-Briggs Type Indicator.Educational and Psychological Measurement, 46, 745-752.Thompson, B., & Borrello, G. M. (1989, January). Aconfirmatory factor analysis of data from the Myers-Briggs Type Indicator. Paper presented at the annualmeeting of the Southwest Educational ResearchAssociation, Houston.Tucker, L. R., Koopman, R. F., & Linn, R. L. (1969).Evaluation of factor-analytic research procedures bymeans of simulated correlation matrices.Psychometrika, 34, 421-460.Tzeng, O. C. S., Outcalt, D., Boyer, S. L., Ware, R., &Landis, D. (1984). Item validity of the Myers-BriggsType Indicator. Journal of Personality Assessment, 48,255-256.

ExtravertsIntrovertsFigure 1. ICCs for two hypothetical items that illustrate the range of relations that can exist between the latent construct (θ,on the horizontal axis) and the observed likelihood of item endorsement in the keyed direction (PCR, on the y axis). Item 1defines an almost linear function, whereas Item 2 approximates a step function. These ICCs were generated using a 2-parameter IRT model in which the b parameters were 0.0, and the a parameters were 0.35 and 17.0 for Items 1 and 2,respectively.

Figure 2. Item information functions for the two hypothetical items presented in Figure 1. The horizontal axis represents thelevels of theta, whereas the vertical axis reflects the amount of information contained in each item, across the different levelsof theta.

Figure 3 1-parameter ICCs for EI items 33 (easy vs. hard to get to know), 50 (“good mixer” vs. quiet and reserved), and 129(one of first to follow a new fashion vs. not interested). On the theta (horizontal) axis, positive values indicate a preference inthe “I” direction, and negative values indicate a preference in the “E” direction (the vertical line serves as the cutoff betweenthe types). The PCR (vertical) axis indicates the expected percentage of individuals who would endorse the item in the keyed(“I”) direction for each level of theta (the horizontal line denotes the point at which we would expect 50% of the examinees toendorse the item in the keyed direction). The dotted vertical lines indicate the levels of theta at which 50% of those who holdthat preference would endorse the item in the “I” direction.

Figure 4 2-parameter ICCs for three hypothetical EI items that differ only in terms of their a (discrimination) parameters(Item 1 has a = .35, Item 2 = 1.0, and Item 3 = 2.1). On the theta (horizontal) axis, positive values indicate a preference in the“I” direction, and negative values indicate a preference in the “E” direction; higher scores on the PCR (vertical) axis reflect ahigher likelihood of endorsing the keyed (“I”) response. The two vertical lines on the theta axis are drawn to reflect a “slight”preference (Myers & McCaulley, 1985, p. 58) in the “E” (-0.2) and “I” (+0.2) directions. The solid horizontal lines identifythe different item endorsement (PCR) rates for Item 1 at these two preferences; the dotted horizontal lines identify the PCRsfor Item 3.

Figure 5. 3-parameter ICCs for EI items 33 (easy vs. hard to get to know), 50 (“good mixer” vs. quiet and reserved), and 129(one of first to follow a new fashion vs. not interested). Higher PCRs are associated with increased levels of endorsement ofthe “I” alternative.

Figure 6. Item information functions for 3-parameter ICCs for EI items 33 (easy vs. hard to get to know), 50 (“good mixer”vs. quiet and reserved), and 129 (one of first to follow a new fashion vs. not interested). The vertical axis reflects the amountof information contained in each item, across the different levels of theta.

Figure 7. Test information functions for a 3-item EI scale formed from items 33, 50, and 129 versus one formed from all ofthe Form F EI items. The vertical axis reflects the amount of information contained in the collection of items in each test,across the different levels of theta (larger values are better). The lower horizontal line denotes the amount of informationnecessary to produce a 0.5 standard error (SE) when estimating the theta score from the item responses; the upper horizontalline corresponds to the level required to produce a 0.39 SE (i.e., the level that would be predicted if the CTT-based reliabilityof the MBTI scales was 0.85).

Figure 8. Test standard error (SE) functions for a 3-item EI scale formed from items 33, 50, and 129 versus one formed fromall of the Form F EI items. The vertical axis reflects the amount of precision in estimating the theta score, at each level of theta(smaller values are better). The upper horizontal line denotes an SE of 0.5; the lower line denotes an SE of 0.39 (whichcorresponds to a CTT reliability of 0.85).

Figure 9. Empirically derived ICC for a high-performance MBTI item from the EI scale (number 50, “good mixer” vs. quietand reserved). The horizontal axis denotes the EI preference scores (positive values indicating “I” preference, negative valuesindicating “E” preference) computed using the Form F scoring system. The curved line drawn through the points is asmoothed spline interpolation. The squares denote the actual percentages of individuals at each level of the EI preference whoendorsed the item in the “I” direction. Here, higher PCRs are associated with increased likelihood of endorsing the “quiet andreserved” alternative.

Figure 10. Empirically derived ICC for a moderate-performance MBTI item from the EI scale (number 33, easy vs. hard toget to know). Here, higher PCRs are associated with an increased likelihood of endorsing the “hard to get to know”alternative.

Figure 11. Empirically derived ICC for a low-performance MBTI item from the EI scale (number 129, one of first to follow anew fashion vs. not interested). Here, higher PCRs are associated with increased likelihood of endorsing the “not interested infollowing fashion” alternative.

Figure 12. Overlaid empirically derived ICCs for EI items 33, 50, and 129. A comparison of these ICCs against thoseproduced by the 3-parameter IRT model presented in Figure 5 provides compelling evidence regarding the appropriateness ofusing the 3-parameter IRT model to score the MBTI.

Figure 13. Empirically derived ICC for a high-performance MBTI item from the SN scale (number 104, concrete v. abstract);scores to the right of the vertical line represent “N” preferences, whereas those to the left represent “S” preferences. Here,higher PCRs are associated with increased likelihood of endorsing the “abstract” alternative.

Figure 14. Empirically derived ICC for a high-performance MBTI item from the TF scale (number 114, feeling v. thinking);scores to the right of the vertical line represent “F” preferences, whereas those to the left represent “T” preferences. HigherPCRs are associated with increased likelihood of endorsing the “feeling” response.

Figure 15. Empirically derived ICC for a high-performance MBTI item from the JP scale (number 85, scheduled v.unplanned); scores to the right of the vertical line represent “P” preferences, whereas those to the left represent “J”preferences. Higher PCRs are associated with increased likelihood of endorsing the “unplanned” response.

Figure 16. Empirically derived ICC for a high-performance JP item (85, scheduled v. unplanned) using scores on the EIpreference dimension as the horizontal axis (scores to the right of the vertical line denote “I” preferences, whereas those to theleft represent “E” preferences). As would be expected, there is virtually no association between EI preferences and thelikelihood of endorsing this item in the “unplanned” (“P”) direction.

Figure 17. Empirically derived ICC for a high-performance JP item (85, scheduled v. unplanned) using scores on the SNpreference dimension as the horizontal axis (scores to the right of the vertical line denote “N” preferences, whereas those to theleft represent “S” preferences). Reflecting the fact that the SN and JP preferences are not orthogonal, a consistent associationcan be observed between SN preferences and the PCR rates for this JP item (as expected, intuitives tend endorse this item inthe “unplanned” direction at higher rates than sensors).

Figure 18. Empirically derived ICC for EI item 116 (detached v. sociable) using scores on the EI preference dimension as thehorizontal axis. This illustrates an item that would likely be viewed as a low-performance item by the traditional predictionratiobased scoring system, but which is viewed as a strongly discriminating item by IRT. The reason for this discrepancy liesin the fact that this item provides its best discrimination for relatively strong Introverts (e.g., in the 40-50 range toward “I”).

Figure 19. Frequency distribution for PR-based preference scores (using Form F key) on the EI dimension.

Figure 20. Frequency distribution for IRT-based preference score estimates on the EI dimension.

Figure 21. Scatterplot of EI preference scores estimated using the traditional PR-based formula (horizontal axis) versus theIRT-based method (vertical axis). The line drawn through the points is the linear regression line.

Figure 22. Scatterplot of EI (vertical axis) versus SN (horizontal axis) preference scores estimated using IRT methods. Notethe areas of higher density at approximately 0.5 z units above and below the type cutoff, and the area of low density at thecutoff on each scale (i.e., 0.0).

Figure 23. Test information functions for the EI scales using the Form F MBTI item pools, the 22- and 35-item pools for theEI scale of the Work Styles Inventory (WSI; Thomas, 1994), and the combined MBTI plus WSI EI item pool. Horizontal linescorrespond to the levels of information that would produce SE values in estimating theta of .25 and .50.

Table 1Numbers of MBTI Profiles Scoring Within a Given “Zone of Uncertainty” around the Cutoffs±0.25 Interval Around the CutoffNumber of Profiles% of Total% of Row% of ColumnOutside Cutoff Regionon θ-Based PreferenceInside Cutoff Regionon θ-Based PreferenceOutside the CutoffRegion on PR-Preference198479.4%89.3%96.4%732.9%26.3%3.6%Inside the CutoffRegion on PR-Preference2379.5%10.7%53.6%2058.2%73.7%46.4%Total222188.9%27811.1%Total 205782.3%44217.7%2499100%±0.35 Interval Around the CutoffNumber of Profiles% of Total% of Row% of ColumnOutside Cutoff Regionon θ-Based PreferenceInside Cutoff Regionon θ-Based PreferenceOutside the CutoffRegion on PR-Preference180972.4%88.9%96.9%582.3%12.5%3.1%Inside the CutoffRegion on PR-Preference2269.0%11.1%35.8%40616.3%87.5%64.2%Total203581.4%46418.6%Total 186774.7%63225.3%2499100%

Using Item Response Theory to Score the Myers-Briggs Type Indicator

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?