Using Item Response Theory to Score the Myers-Briggs Type Indicator
Using Item Response Theory to Score the Myers-Briggs Type Indicator
Using Item Response Theory to Score the Myers-Briggs Type Indicator
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Taken ei<strong>the</strong>r singly or <strong>to</strong>ge<strong>the</strong>r, <strong>the</strong>se criticisms arepotentially quite serious. For example, if fac<strong>to</strong>r analyticevidence consistently indicates that <strong>the</strong> 4-fac<strong>to</strong>r view of <strong>the</strong>MBTI is implausible, its psychometric defensibility inassessment situations would be called in<strong>to</strong> question.Likewise, <strong>the</strong> lack of bimodal distributions in <strong>the</strong>preference scores, as well as <strong>the</strong> nontrivial rates of typechanges seen in test-retest situations, have been viewed bymany researchers as representing serious challenges <strong>to</strong> <strong>the</strong>psychometric quality of <strong>the</strong> MBTI. In <strong>the</strong> followingsections we examine each of <strong>the</strong>se issues.Criticisms of <strong>the</strong> MBTI’s Fac<strong>to</strong>r StructureSeveral explora<strong>to</strong>ry fac<strong>to</strong>r analyses of <strong>the</strong> MBTI havebeen reported, and some of <strong>the</strong>m (e.g., Comrey, 1983;Sipps, Alexander, & Friedt, 1985) have produced fac<strong>to</strong>rstructures that <strong>the</strong>ir authors viewed as being inconsistentwith <strong>the</strong> predicted 4-fac<strong>to</strong>r model. This fact has been citedby critics of <strong>the</strong> MBTI (e.g., Pittenger, 1993, pp. 474-476)as support for <strong>the</strong> more general conclusion that “<strong>the</strong> MBTIdoes not provide <strong>the</strong> assessment of personality types that itclaims” (Pittenger, p. 475).However, a number of o<strong>the</strong>r explora<strong>to</strong>ry fac<strong>to</strong>r analyticstudies of <strong>the</strong> MBTI (e.g., Harvey, Murry, & Stamoulis,1995; Tischler, 1994; Tzeng, Outcalt, Boyer, Ware, &Landis, 1984) have reported results that show an extremelyhigh degree of correspondence between <strong>the</strong> recoveredfac<strong>to</strong>r solutions and <strong>the</strong> predicted 4-fac<strong>to</strong>r structure. Whatconclusions regarding <strong>the</strong> MBTI’s fac<strong>to</strong>r structure orconstruct validity should be drawn based on <strong>the</strong>seapparently conflicting findings?In our assessment, <strong>the</strong> fact that several explora<strong>to</strong>rystudies have reported findings that closely match <strong>the</strong>predicted 4-fac<strong>to</strong>r structure (e.g., Harvey et al., 1995;Tischler, 1994) is consistent with -- but not definitive proofof -- <strong>the</strong> validity of <strong>the</strong> MBTI’s predicted dimensionalstructure. Of greater importance, <strong>the</strong> fact that someexplora<strong>to</strong>ry studies produced solutions that did not match<strong>the</strong> predicted 4-fac<strong>to</strong>r structure (e.g., Sipps et al., 1985)says very little ei<strong>the</strong>r pro or con, given (a) <strong>the</strong> less-thanoptimalsample sizes and fac<strong>to</strong>r-analytic decision rules thatcharacterized those studies, as well as (b) <strong>the</strong> inherentinability of explora<strong>to</strong>ry methods <strong>to</strong> test of <strong>the</strong> validity of ahypo<strong>the</strong>sized fac<strong>to</strong>r model.Regarding <strong>the</strong> former issue, <strong>the</strong> Comrey (1983) andSipps et al. (1985) findings were based on fac<strong>to</strong>r-analyticdecisions (e.g., principal components analysis, Varimaxrotation) that have been repeatedly criticized in <strong>the</strong>psychometric literature (e.g., Cliff, 1987; Lee & Comrey,1979; Snook & Gorsuch, 1989; Tucker, Koopman, & Linn,1969). With respect <strong>to</strong> sample size, <strong>the</strong> Comrey (1983)study demonstrated only a 2.5:1 ratio of subjects <strong>to</strong> items;in such small samples, <strong>the</strong> likelihood of finding unstableresults due <strong>to</strong> <strong>the</strong> effects of sampling error increasessignificantly. In contrast, among <strong>the</strong> explora<strong>to</strong>ry studiesthat reported results that were consistent with <strong>the</strong> predicted4-fac<strong>to</strong>r structure, <strong>the</strong> Harvey et al. (1995) study had a 12:1ratio of subjects <strong>to</strong> items, and <strong>the</strong> Tischler (1994) study hada 22:1 ratio; results obtained in samples of <strong>the</strong>se sizesshould be much more likely <strong>to</strong> be stable and valid thanthose obtained in smaller samples.Regarding <strong>the</strong> latter issue, <strong>the</strong> results of anyexplora<strong>to</strong>ry fac<strong>to</strong>r analysis -- even one performed in a verylarge sample -- are fundamentally incapable of answeringwhat is essentially a confirma<strong>to</strong>ry question: namely, <strong>to</strong>what degree does <strong>the</strong> hypo<strong>the</strong>sized fac<strong>to</strong>r structure providea plausible representation of <strong>the</strong> observed item-level data?That is, among its o<strong>the</strong>r limitations (e.g., subjectivity withrespect <strong>to</strong> determining <strong>the</strong> number of fac<strong>to</strong>rs <strong>to</strong> retain), <strong>the</strong>explora<strong>to</strong>ry fac<strong>to</strong>r model exhibits a fundamentalindeterminacy with respect <strong>to</strong> fac<strong>to</strong>r rotation (i.e., aninfinite number of different orthogonal or obliquetransformations of <strong>the</strong> fac<strong>to</strong>r solution can be made withoutchanging <strong>the</strong> degree <strong>to</strong> which it can reproduce, or ‘fit,’ <strong>the</strong>data matrix). Thus, if <strong>the</strong> predicted structure is notrecovered, this fact provides essentially no evidenceregarding <strong>the</strong> degree <strong>to</strong> which <strong>the</strong> hypo<strong>the</strong>sized modelwould be capable of providing a level of fit that is as goodas, or better than, that which is produced by <strong>the</strong> obtainedfac<strong>to</strong>r solution.Fortunately, confirma<strong>to</strong>ry fac<strong>to</strong>r analytic methods (e.g.,James, Mulaik, & Brett, 1982; Jöreskog & Sörbom, 1981)were developed <strong>to</strong> address precisely this kind of question.Unlike explora<strong>to</strong>ry fac<strong>to</strong>r analysis, confirma<strong>to</strong>ry fac<strong>to</strong>ranalysis allows <strong>the</strong> researcher <strong>to</strong> directly test <strong>the</strong> degree <strong>to</strong>which a hypo<strong>the</strong>sized fac<strong>to</strong>r model is consistent with <strong>the</strong>variance/covariance matrix that is observed among <strong>the</strong>instrument’s items. A major strength of confirma<strong>to</strong>ryfac<strong>to</strong>r analysis is that it allows for <strong>the</strong> possibility offalsifying a hypo<strong>the</strong>sized fac<strong>to</strong>r model (i.e., showing that itis inconsistent with <strong>the</strong> observed data). That is, if <strong>the</strong>predicted fac<strong>to</strong>r pattern is found <strong>to</strong> provide a poor level offit <strong>to</strong> <strong>the</strong> observed data, this fact can provide compellingevidence against <strong>the</strong> validity or plausibility of <strong>the</strong> predictedfac<strong>to</strong>r structure. Thus, although confirma<strong>to</strong>ry methodscannot prove that a given good-fitting model is <strong>the</strong> bestpossible model for an instrument (<strong>the</strong>oretically, it is alwayspossible <strong>to</strong> postulate <strong>the</strong> existence of an alternative modelthat demonstrates an even higher level of fit), <strong>the</strong>y arenever<strong>the</strong>less extremely valuable by virtue of <strong>the</strong>ir ability <strong>to</strong>reject poor-fitting models and <strong>to</strong> rank competing modelswith respect <strong>to</strong> <strong>the</strong> degree <strong>to</strong> which <strong>the</strong>y fit <strong>the</strong> observeddata.Although studies that criticize <strong>the</strong> psychometricproperties of <strong>the</strong> MBTI typically do not cite <strong>the</strong>ir findings,several confirma<strong>to</strong>ry fac<strong>to</strong>r analyses of <strong>the</strong> MBTI havebeen reported (e.g., Harvey, Murry, & Stamoulis, 1995;Harvey, Murry, & Markham, 1995; Johnson & Saunders,1990; Thompson & Borrello, 1989), and <strong>the</strong>ir results haveconsistently supported <strong>the</strong> validity of <strong>the</strong> predicted 4-fac<strong>to</strong>rstructure. When considered on its own (e.g., Johnson &Saunders, 1990; Thompson & Borrello, 1989), <strong>the</strong>predicted MBTI fac<strong>to</strong>r structure has been found <strong>to</strong> providea plausible representation of <strong>the</strong> latent structure of thisinstrument. Of even greater importance, when <strong>the</strong>
predicted 4-fac<strong>to</strong>r MBTI model was compared against <strong>the</strong>alternative fac<strong>to</strong>r models advanced by Comrey (1983) andSipps et al. (1985), <strong>the</strong> predicted MBTI structure was found<strong>to</strong> be superior <strong>to</strong> both of <strong>the</strong>se competing views of itsdimensionality (Harvey, Murry, & Stamoulis, 1995).Indeed, <strong>the</strong> results of <strong>the</strong> Harvey et al. (1985) studysuggested that both <strong>the</strong> Sipps et al. (1983) and Comrey(1983) models were fundamentally misspecified (i.e., basedon <strong>the</strong> extremely high correlations that were estimatedbetween some of <strong>the</strong>ir fac<strong>to</strong>rs).However, <strong>the</strong>se fac<strong>to</strong>r analytic studies have identifiedsome issues that deserve fur<strong>the</strong>r study. For example, in <strong>the</strong>explora<strong>to</strong>ry studies, some MBTI items were found <strong>to</strong> loadstrongly on more than one fac<strong>to</strong>r; additionally, in bo<strong>the</strong>xplora<strong>to</strong>ry and confirma<strong>to</strong>ry studies, a nontrivialpercentage of <strong>the</strong> items exhibited only moderate-<strong>to</strong>-smallloadings on <strong>the</strong>ir primary fac<strong>to</strong>rs. Ideally, <strong>to</strong> maximize <strong>the</strong>independence and measurement precision of <strong>the</strong> scales, wewould prefer that items load only on <strong>the</strong> predicted fac<strong>to</strong>r,and that all items in a scale demonstrate moderate-<strong>to</strong>-largeloadings on <strong>the</strong>ir underlying fac<strong>to</strong>r. These findings suggestthat <strong>the</strong> item pools for each of <strong>the</strong> four main MBTI scalescould be broadened <strong>to</strong> include additional items with higherloadings on <strong>the</strong> desired latent construct.Additionally, in studies that examined oblique fac<strong>to</strong>rmodels, consistently nonzero correlations between <strong>the</strong> SNand JP fac<strong>to</strong>rs were reported (e.g., Harvey et al., 1995;Pittenger, 1993, p. 475), a finding that has also been seenwhen <strong>the</strong> traditional prediction ratio method is used <strong>to</strong>calculate MBTI preference scores (e.g., Webb, 1964). Thatis, <strong>the</strong>re is some tendency for individuals who preferSensing <strong>to</strong> be more likely <strong>to</strong> favor Judging than Perceiving,and for those who favor Intuition <strong>to</strong> be more likely <strong>to</strong> favorPerceiving than Judging. Ideally, from a <strong>the</strong>oreticalstandpoint (e.g., <strong>Myers</strong>, 1980, pp. 2-9) one might argue that<strong>the</strong> four preferences should be mutually orthogonal.However, it must be noted that <strong>the</strong>se SN-JP correlationshave generally been quite modest in magnitude (e.g., in <strong>the</strong>.20’s <strong>to</strong> .40’s, representing only 4% - 16% of sharedvariance), and that at this point we cannot determinewhe<strong>the</strong>r <strong>the</strong> lack of orthogonality is due <strong>to</strong> redundancy in<strong>the</strong> conceptual definition of <strong>the</strong> SN and JP preferences,limitations of <strong>the</strong> items used <strong>to</strong> measure <strong>the</strong>se constructs,sampling error, a combination of <strong>the</strong> above fac<strong>to</strong>rs, or thatit simply reflects <strong>the</strong> fact that some combinations of scoreson <strong>the</strong>se two dimensions occur more frequently than o<strong>the</strong>rs(e.g., SJ is much more common than SP). Fur<strong>the</strong>r researchconducted in larger and more carefully stratified samples isnecessary <strong>to</strong> resolve this question.In sum, although some secondary issues remainunresolved, a review of <strong>the</strong> fac<strong>to</strong>r analytic research findingsindicates quite conclusively that <strong>the</strong> major criticisms thathave been raised regarding <strong>the</strong> MBTI’s fac<strong>to</strong>r structure(e.g., Comrey, 1983; Pittenger, 1993) are not supported by<strong>the</strong> data, particularly <strong>the</strong> results of confirma<strong>to</strong>ry fac<strong>to</strong>ranalyses. On <strong>the</strong> contrary, a large and growing body ofevidence indicates that (a) four major fac<strong>to</strong>rs underlie <strong>the</strong>items that are used <strong>to</strong> compute <strong>the</strong> MBTI preference scores,(b) <strong>the</strong> items that define <strong>the</strong>se fac<strong>to</strong>rs are precisely thosethat were predicted <strong>to</strong> do so by <strong>the</strong> MBTI’s developers, and(c) of all of <strong>the</strong> competing fac<strong>to</strong>r structures that have beenproposed <strong>to</strong> date, <strong>the</strong> a priori 4-fac<strong>to</strong>r solution provides <strong>the</strong>most plausible representation of <strong>the</strong> MBTI’s latentstructure.Criticisms Regarding <strong>Type</strong> Stability and BimodalityThus, when one considers <strong>the</strong> entirety of <strong>the</strong> fac<strong>to</strong>ranalytic evidence, <strong>the</strong> MBTI’s hypo<strong>the</strong>sized 4-fac<strong>to</strong>rstructure performs quite well; clearly, this is encouragingnews for proponents of <strong>the</strong> MBTI. However, with respect<strong>to</strong> criticisms that focus on preference score bimodality andtype stability in test-retest situations, until recently <strong>the</strong>re hasbeen less cause for encouragement.<strong>Type</strong> stability. The fact that a nontrivial percentage ofMBTI respondents change <strong>the</strong>ir type assignments on atleast one preference dimension on repeated testing has beenwell documented. For example, Carskadon (1977) reportedrelatively high test-retest reliabilities over five-weekintervals (.78 - .87) for preference scores; however, onretesting, 19% of <strong>the</strong> subjects changed type on <strong>the</strong> EIpreference, 11% changed on SN, 17% on TF, and 16% onJP. O<strong>the</strong>r studies have produced similar findings: forexample, <strong>Myers</strong> and McCaulley (1985, p. 173) summarized<strong>the</strong> results of 20 test-retest studies, finding that full-profiletype stability rates ranged from 24%-61%, with an averageof only 43% of <strong>the</strong> subjects remaining <strong>the</strong> same on all fourscales on retesting.Although <strong>the</strong> levels of test-retest reliability obtainedusing <strong>the</strong> continuous preference scores have generally beenquite respectable, <strong>the</strong> levels of instability in <strong>the</strong> categoricaltype assignments have presented an inviting target forcritics of <strong>the</strong> MBTI. For example, Pittenger (1993) notedthat because “Jung and <strong>Briggs</strong> and <strong>Myers</strong> conceived ofpersonality as an invariant” (p. 471), “if each of <strong>the</strong> 16types is <strong>to</strong> represent a very different personality trait, it ishard <strong>to</strong> reconcile a test that allows individuals <strong>to</strong> makeradical shifts in <strong>the</strong>ir type” (p. 472). Under this argument,switching poles on even one of <strong>the</strong> four preferencedimensions represents a significant substantive andinterpretative change.In our assessment, it is unlikely that <strong>the</strong> majority of<strong>the</strong>se apparent changes in type -- especially those that occurover relatively short intervals of a few weeks or months --reflect true changes in preference. Instead, as has beenspeculated by a number of authors (e.g., Harvey & Murry,1994, Pittenger, 1993), it is much more likely that <strong>the</strong>sechanges are <strong>the</strong> result of <strong>the</strong> action of measurement error;in particular, measurement error occurring in <strong>the</strong> vicinity of<strong>the</strong> type cu<strong>to</strong>ff score.That is, for individuals whose true preference scores lieclose <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff point, even a relatively smallamount of measurement error could cause <strong>the</strong>ir observedpreference scores <strong>to</strong> lie on opposite sides of <strong>the</strong> cu<strong>to</strong>ff overrepeated testings (giving <strong>the</strong> erroneous appearance of a type
switch), despite <strong>the</strong> fact that <strong>the</strong> true preferences remainconstant over time (i.e., as would be predicted by type<strong>the</strong>ory). For such individuals, <strong>the</strong> most direct way <strong>to</strong>improve <strong>the</strong> MBTI’s level of type stability would be <strong>to</strong>increase its measurement precision (or reliability).It is important <strong>to</strong> note that <strong>the</strong> above interpretation doesnot rule out <strong>the</strong> possibility that some percentage ofrespondents who appear <strong>to</strong> change types on repeatedtestings may truly change <strong>the</strong>ir scores on one or morepreference dimensions, or that some individuals maysimply appear <strong>to</strong> change types due <strong>to</strong> careless responding,situational fac<strong>to</strong>rs, or deliberate misrepresentation. On <strong>the</strong>contrary, it simply provides an explanation for whyindividuals who do not suffer from true fluctuations in <strong>the</strong>irpreferences would appear <strong>to</strong> change <strong>the</strong>ir types.In short, <strong>the</strong> important question concerns <strong>the</strong> relativepercentage of individuals who appear <strong>to</strong> change type onrepeated testing due simply <strong>to</strong> <strong>the</strong> action of measurementerror near <strong>the</strong> type cu<strong>to</strong>ff. If such individuals constitute alarge percentage of those whose type assignments changeon retesting, a strategy for improving <strong>the</strong> MBTI <strong>to</strong> reducesuch occurrences would <strong>the</strong>n be evident (i.e., increasing itsmeasurement precision near <strong>the</strong> type cu<strong>to</strong>ff score).Bimodality. The issue of preference score bimodalityis closely linked with <strong>the</strong> issue of type stability. Althoughsome demonstrations of preference bimodality have beenreported in select samples having strongly differentiatedtypes (e.g., Rytting, Ware, & Prince, 1994), <strong>the</strong>re isoverwhelming evidence <strong>to</strong> indicate that MBTI preferencedistributions in large, unselected samples are not bimodal(e.g., Harvey & Murry, 1994; Hicks, 1984; McCrae &Costa, 1989; Striker & Ross, 1964). Although this lack ofbimodality in MBTI preference scores does not necessarilyinvalidate <strong>the</strong> type-based <strong>the</strong>ory on which <strong>the</strong> instrument isbased, it does present a tempting target for critics of <strong>the</strong>MBTI. As Pittenger (1993) noted, findings of lack ofbimodality “give reason <strong>to</strong> suspect <strong>the</strong> claims that typesrepresent separate populations, and that small quantitativedifferences between scores represent a significantqualitative difference in personality” (p. 471).Regardless of whe<strong>the</strong>r or not one agrees with <strong>the</strong>assertion that <strong>the</strong> MBTI must demonstrate bimodal scoredistributions (as we describe below, in our assessmentbimodality is not strictly necessary), <strong>the</strong> fact remains that<strong>the</strong> type stability, measurement precision, and bimodalityissues are closely linked. Because all psychological testscontain some degree of measurement error, whenever acu<strong>to</strong>ff score is used <strong>to</strong> dicho<strong>to</strong>mize a continuous scale itbecomes highly advantageous <strong>to</strong> minimize <strong>the</strong> relativenumber of people who score near <strong>the</strong> cu<strong>to</strong>ff. This is donein order <strong>to</strong> minimize <strong>the</strong> chance that even relatively minorerrors of measurement could cause a person’s observedscore <strong>to</strong> fall on <strong>the</strong> opposite side of <strong>the</strong> cu<strong>to</strong>ff from <strong>the</strong>irtrue score (i.e., an erroneous type classification). AsPittenger (1993) noted, “an accurate and durableassessment of type cannot be made for those subjectswhose scores are close <strong>to</strong> <strong>the</strong> zero point [i.e., type cu<strong>to</strong>ff]and [who <strong>the</strong>refore] have a high probability of crossing thatboundary” (p. 472) due simply <strong>to</strong> <strong>the</strong> action ofmeasurement error.In essence, a lack of bimodality in <strong>the</strong> preference scoredistributions may exacerbate <strong>the</strong> problem of typemisclassifications due <strong>to</strong> measurement error near <strong>the</strong> cu<strong>to</strong>ffscore (i.e., because center-weighted distributions have amuch higher percentage of individuals scoring near <strong>the</strong>cu<strong>to</strong>ff). Thus, if measurement precision (i.e., reliability) isheld constant, increasing <strong>the</strong> number of people who scorenear <strong>the</strong> type cu<strong>to</strong>ff will unavoidably increase <strong>the</strong> numberof erroneous type classifications, both in test-retest andsingle-administration situations. It follows that as apractical matter, <strong>the</strong> reliability of a scale that is <strong>to</strong> bedicho<strong>to</strong>mized may need <strong>to</strong> be significantly higher than <strong>the</strong>level that would be considered adequate for a test in whicha cu<strong>to</strong>ff score is not imposed. Thus, on <strong>to</strong>tally pragmaticgrounds, bimodal preference score distributions are muchmore desirable than center-weighted ones because <strong>the</strong>yreduce <strong>the</strong> number of erroneous type classifications thatwould be expected due <strong>to</strong> measurement error at <strong>the</strong> cu<strong>to</strong>ff.As was noted above, one might legitimately questionwhe<strong>the</strong>r it is necessary for a type-based instrument <strong>to</strong>produce bimodal distributions. Although many researchers(e.g., Pittenger, 1993; Striker & Ross, 1964) appear <strong>to</strong> haveaccepted <strong>the</strong> argument that bimodal distributions arenecessary based largely on <strong>the</strong>oretical arguments (e.g.,<strong>Myers</strong> with <strong>Myers</strong>, 1980), opposing arguments can beoffered (e.g., Mitchell, 1995). Indeed, at a strictlypragmatic level, <strong>the</strong>re is no difference between setting acu<strong>to</strong>ff score on <strong>the</strong> MBTI scales for <strong>the</strong> purpose ofassigning individuals <strong>to</strong> type categories versus setting acu<strong>to</strong>ff score on any o<strong>the</strong>r psychological scale that lacks abimodal distribution (which is, of course, <strong>the</strong> case for mostpsychological scales). That is, cu<strong>to</strong>ff scores are frequently-- and appropriately -- used with tests that demonstratecenter-weighted, Normal distributions. For example, inorganizational selection it is commonplace <strong>to</strong> rankemployees based on <strong>the</strong>ir scores on a cognitive ability test,and <strong>to</strong> only consider those who score above a minimumcu<strong>to</strong>ff for hiring. In such situations, rarely if ever does <strong>the</strong>practitioner expect <strong>the</strong> employment test <strong>to</strong> demonstratebimodality, or <strong>to</strong> minimize <strong>the</strong> density of <strong>the</strong> distributionnear <strong>the</strong> cu<strong>to</strong>ff point. Clearly, bimodality is not a necessarycondition for setting a cu<strong>to</strong>ff score on a psychological test.Thus, although one can argue that bimodality is not aprerequisite characteristic in order for <strong>the</strong> MBTI <strong>to</strong> bejudged psychometrically adequate, it is none<strong>the</strong>less a highlydesirable characteristic due <strong>to</strong> <strong>the</strong> MBTI’s use of a cu<strong>to</strong>ffscore <strong>to</strong> assign individuals <strong>to</strong> <strong>the</strong> categorical types. Basedon <strong>the</strong> above discussion of <strong>the</strong> effect of measurement errorat <strong>the</strong> cu<strong>to</strong>ff, it is clear that <strong>the</strong> bimodality and type-stabilityissues are inextricably linked, and that <strong>the</strong> maximumimprovement in MBTI test-retest type stability would beexpected <strong>to</strong> occur when improvements in both bimodalityand measurement precision at <strong>the</strong> cu<strong>to</strong>ff are achieved.Thus, one does not have <strong>to</strong> accept <strong>the</strong> <strong>the</strong>ory-basedargument that a type-based instrument must producebimodal score distributions in order <strong>to</strong> appreciate <strong>the</strong>
practical advantages that would obtain if <strong>the</strong> MBTI’spreference scores were more bimodal in nature.Strategies for Addressing <strong>the</strong>se IssuesOf all of <strong>the</strong> criticisms of <strong>the</strong> MBTI that have beenraised <strong>to</strong> date, it is our assessment that <strong>the</strong> type-instabilityissue is one of <strong>the</strong> most troublesome. That is, if it is truethat preferences are inborn, and that by adulthood mostindividuals achieve reasonably well differentiated types(e.g., <strong>Myers</strong> & McCaulley, 1988; <strong>Myers</strong> with <strong>Myers</strong>,1980), one would definitely not expect <strong>to</strong> find from 24%-61% of individuals changing types on at least one MBTIdimension on repeated testing, especially when <strong>the</strong>administrations are given only a few weeks or monthsapart. Indeed, when interpreting <strong>the</strong> empirical dataregarding test-retest type stability and preference scoredistribution shape, critics of <strong>the</strong> MBTI have concluded that“<strong>the</strong> patterns of data do not suggest that <strong>the</strong>re is reason <strong>to</strong>believe that <strong>the</strong>re are 16 unique types of personality”(Pittenger, 1993, p. 483), and that “<strong>the</strong> four-letter type codeis not a stable personality characteristic” (p. 472).It is important <strong>to</strong> realize that such conclusions arebased on a critical -- and untested -- assumption: namely,that <strong>the</strong> lack of bimodality and <strong>the</strong> observed levels of typeinstability reflect flaws in <strong>the</strong> MBTI itself. Interestingly,little or no consideration has been given <strong>to</strong> <strong>the</strong> alternativeviewpoint that <strong>the</strong>se empirical findings do not reflect flawsin <strong>the</strong> MBTI or its underlying <strong>the</strong>ory, but instead are causedby limitations in <strong>the</strong> scoring system that is used <strong>to</strong> convertitem responses in<strong>to</strong> <strong>the</strong> preference scores that aredicho<strong>to</strong>mized <strong>to</strong> form type assignments. We contend thatbefore sweeping conclusions regarding <strong>the</strong> validity of <strong>the</strong>MBTI can be drawn, researchers must first determinewhe<strong>the</strong>r improvements in bimodality and type stability canbe achieved via modifications <strong>to</strong> <strong>the</strong> techniques that areused <strong>to</strong> score <strong>the</strong> MBTI and assign categorical types.Without doubt, <strong>the</strong> answer <strong>to</strong> <strong>the</strong> question of whe<strong>the</strong>rrevisions <strong>to</strong> <strong>the</strong> MBTI scoring system would be able <strong>to</strong>improve type stability and/or preference score bimodality isof fundamental importance. That is, if a new scoringsystem were <strong>to</strong> be developed that is capable of producingmore bimodally shaped preference distributions in large,unselected samples of MBTI respondents, this wouldeffectively destroy a key line of evidence on whichcriticisms of <strong>the</strong> MBTI instrument -- as well as <strong>the</strong> typebased<strong>the</strong>ory on which it is founded -- have been based(e.g., Pittenger, 1993, p. 471). Likewise, if a scoringsystem capable of producing improvements in <strong>the</strong> MBTI’smeasurement precision near <strong>the</strong> cu<strong>to</strong>ff were <strong>to</strong> be produced,increased type stability in test-retest situations would bepredicted <strong>to</strong> result, <strong>the</strong>reby addressing <strong>the</strong> remaining majorempirical criticism of <strong>the</strong> MBTI.However, what strategies should be followed in order<strong>to</strong> modify <strong>the</strong> MBTI’s scoring procedures in order <strong>to</strong>achieve <strong>the</strong> objectives of increased bimodality andmeasurement precision? Given that <strong>the</strong> lack of bimodalityis hardly a new occurrence, having been present in itsearlier scoring systems as well (e.g., Stricker & Ross,1964), <strong>the</strong>re is little reason <strong>to</strong> believe that simply updating<strong>the</strong> prediction-ratio based preference scoring weights usingnew samples of respondents would lead <strong>to</strong> significantchanges in <strong>the</strong> shapes of <strong>the</strong> preference score distributions.Indeed, it is unlikely that any alternative number-right orweighted number-right scoring technique that takes alinear-model based approach would be any more likelythan <strong>the</strong> existing weighting system <strong>to</strong> produce bimodality orimproved measurement precision. For example, Harveyand Murry (1994) examined two alternative scoringmethods (i.e., an unweighted count of <strong>the</strong> number of itemsanswered in <strong>the</strong> keyed direction, and a linear-model basedweighting system using fac<strong>to</strong>r scoring coefficients), findingthat nei<strong>the</strong>r produced any meaningful reductions in <strong>the</strong>center-weightedness of <strong>the</strong> preference distributions.One possibility for improving <strong>the</strong> test-retest typestability that has been suggested involves increasing <strong>the</strong>number of categories in<strong>to</strong> which individuals are classifiedon each preference dimension (Harvey & Murry, 1994).For example, earlier versions of <strong>the</strong> MBTI were scoredusing a 3-category system: <strong>the</strong> two bipolar types (e.g., ‘E’or ‘I’), plus an indeterminate ‘x’ classification forindividuals who scored close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff (e.g., see<strong>Myers</strong> & McCauley, 1985, chapter 9). It seems reasonable<strong>to</strong> hypo<strong>the</strong>size that a sizable percentage of <strong>the</strong> individualswho switch types on repeated administrations of <strong>the</strong> MBTIare those whose preference scores lie close <strong>to</strong> <strong>the</strong> cu<strong>to</strong>ff.For such individuals, a change of only a few preferencescore units could cause <strong>the</strong>m <strong>to</strong> be classified in<strong>to</strong> <strong>the</strong>opposite type on repeated testing. S<strong>to</strong>pping <strong>the</strong> practice offorcing <strong>the</strong>se type-indeterminate individuals in<strong>to</strong> bipolartype categories might produce significant improvements intest-rest stability. Of course, even if an ‘indeterminate’category is added, <strong>the</strong> performance of such a system wouldbe greatly facilitated if <strong>the</strong> shapes of <strong>the</strong> preference scoredistributions were also made more bimodal, <strong>the</strong>rebyreducing <strong>the</strong> number of type-indeterminate individuals.With respect <strong>to</strong> methods for changing <strong>the</strong> proceduresused <strong>to</strong> compute MBTI preference scores in order <strong>to</strong>improve measurement precision and bimodality, in ourassessment <strong>the</strong> strategy that holds <strong>the</strong> greatest promise is <strong>to</strong>use item response <strong>the</strong>ory (IRT) techniques (e.g., Lord &Novick, 1968). Although only a few studies using IRTscoring of <strong>the</strong> MBTI have been conducted (Harvey &Murry, 1994; Harvey, Murry, & Markham, 1994; Thomas& Harvey, 1995), <strong>the</strong>ir results have been very encouraging.Specifically, <strong>the</strong>y demonstrated that switching <strong>to</strong> IRTscoring -- without making any substantive changes <strong>to</strong> <strong>the</strong>MBTI items <strong>the</strong>mselves -- produces (a) strongly bimodalpreference distributions in large, unselected samples ofrespondents; and (b) scales that produce <strong>the</strong>ir maximummeasurement precision in <strong>the</strong> vicinity of <strong>the</strong> type cu<strong>to</strong>ff(e.g., Harvey & Murry, 1994). Related IRT research(Thomas & Harvey, 1995) has revealed that <strong>the</strong> degree ofmeasurement precision of <strong>the</strong> MBTI scales can be fur<strong>the</strong>rimproved through <strong>the</strong> addition of new items.
IRT Methods in <strong>the</strong> Context of <strong>the</strong> MBTIBefore reviewing <strong>the</strong> results of <strong>the</strong>se studies, we willfirst provide a brief tu<strong>to</strong>rial on IRT methods, payingspecific attention <strong>to</strong> <strong>the</strong> ways in which traditional IRTterminology must be translated in<strong>to</strong> <strong>the</strong> terminology of type<strong>the</strong>ory and <strong>the</strong> MBTI. His<strong>to</strong>rically, IRT terminology hasbeen deeply rooted in right/wrong, ability-oriented testingmethods. Although this ability-oriented terminology isuseful in <strong>the</strong> context of scoring right/wrong, multiplechoicetest items, it is somewhat counterproductive whenone is attempting <strong>to</strong> understand how IRT would be used <strong>to</strong>score personality tests in which (a) “right” or “wrong”answers do not exist, (b) <strong>the</strong> notion of item “difficulty” haslittle or no intuitive meaning, and (c) <strong>the</strong> susceptibility ofitems <strong>to</strong> “guessing <strong>the</strong> correct answer” is not typically acause for concern.In this section we briefly describe <strong>the</strong> fundamentals ofIRT methods as <strong>the</strong>y relate <strong>to</strong> <strong>the</strong> MBTI; however, adetailed description of IRT is beyond <strong>the</strong> scope of thisarticle. The reader is referred one of <strong>the</strong> standard IRT texts(e.g., Hamble<strong>to</strong>n, Swaminathan, & Rogers, 1991; Hulin,Drasgow, & Parsons, 1983; Lord & Novick, 1968) for amore comprehensive treatment. Our primary goal is <strong>to</strong>describe <strong>the</strong> basics of <strong>the</strong> IRT approach <strong>to</strong> measurementand explicate <strong>the</strong> terminological differences that existbetween standard descriptions of IRT methods and <strong>the</strong>irapplication <strong>to</strong> <strong>the</strong> specific case of <strong>the</strong> MBTI.IRT TerminologyThe latent construct, or θ. In IRT, as in classical test<strong>the</strong>ory (CTT), a primary focus of testing is <strong>to</strong> derive anestimate of each examinee’s score on <strong>the</strong> latent construct(or set of four bipolar constructs, in <strong>the</strong> case of <strong>the</strong> MBTI)being assessed. In CTT, this quantity is termed <strong>the</strong> truescore; in IRT, it is typically termed <strong>the</strong> latent trait score(which is abbreviated θ, or <strong>the</strong>ta). In both cases, this scoreis an unobserved, hypo<strong>the</strong>tical construct (e.g., Intelligence,Extraversion) on which people are assumed <strong>to</strong> differ, butwhich cannot be directly quantified. Thus, we are forced <strong>to</strong>estimate examinees’ scores on <strong>the</strong> latent construct based on<strong>the</strong>ir responses <strong>to</strong> a set of test items.The term “latent trait” has a tendency <strong>to</strong> set off alarmsfor proponents of type-based <strong>the</strong>ories of personality;indeed, this usage of <strong>the</strong> term “trait” represents our firstencounter with <strong>the</strong> semantic difficulties that can occurwhen applying IRT (which is also known as Latent Trait<strong>Theory</strong>) <strong>to</strong> <strong>the</strong> MBTI. It must be stressed that this use of<strong>the</strong> term “trait” when describing <strong>the</strong> latent construct beingestimated by IRT in no way implies a taking-of-sides in <strong>the</strong>ongoing “trait vs. type” debate (e.g., Block & Ozer, 1982;Gangestad & Snyder, 1991; Mendelsohn, Weiss, & Feimer,1982). That is, although <strong>the</strong> MBTI is based on <strong>the</strong> notionof discrete types of personality, <strong>the</strong> MBTI has always usedscores on continuous bipolar scales in order <strong>to</strong> assess <strong>the</strong>strength and direction of <strong>the</strong> preference for EI, SN, TF, andJP (i.e., <strong>the</strong> prediction-ratio based preference scores; e.g.,<strong>Myers</strong> & McCaulley, 1988, p. 9). By dicho<strong>to</strong>mizing <strong>the</strong>sepreference scores, individuals can subsequently be assigned<strong>to</strong> categorical types.Throughout our discussion of how IRT methods can beused <strong>to</strong> score <strong>the</strong> MBTI, it is critically important <strong>to</strong> keep inmind that <strong>the</strong> MBTI preference scores estimated using <strong>the</strong>traditional prediction-ratio method correspond directly <strong>to</strong><strong>the</strong> θ scores estimated by IRT. Thus, IRT takes precisely<strong>the</strong> same logical approach that has always been used in <strong>the</strong>MBTI: that is, describing both <strong>the</strong> strength and direction of<strong>the</strong> preference for <strong>the</strong> EI, SN, TF, and JP dimensions usingfour bipolar continuua. Only <strong>the</strong> computational methodinvolved in computing <strong>the</strong>se continuous preference scoresis different. In effect, whenever <strong>the</strong> term ‘trait’ or ‘latenttrait’ appears in a discussion of IRT methods, one cansimply substitute <strong>the</strong> term ‘preference score’ <strong>to</strong> understandhow IRT would be used <strong>to</strong> score <strong>the</strong> MBTI.Probability of a correct response (PCR). The o<strong>the</strong>rquantity that is of fundamental interest in IRT is <strong>the</strong>likelihood that a given respondent will make a “correct”response <strong>to</strong> a given item. In ability-oriented testing, wehave a clear understanding of what a correct vs. incorrectitem response means, and we can easily compute andinterpret <strong>the</strong> percentages of people who respond correctly<strong>to</strong> each test item. However, when IRT is applied <strong>to</strong> <strong>the</strong>MBTI (or <strong>to</strong> any o<strong>the</strong>r test that does not employ rightversus-wrongscoring), what meaning do we attach <strong>to</strong> thisconcept?As it turns out, <strong>the</strong> lack of a “correct” response <strong>to</strong> eachitem poses absolutely no problem with respect <strong>to</strong> applyingIRT scoring methods <strong>to</strong> <strong>the</strong> MBTI. That is, although <strong>the</strong>reare no “right” or “wrong” responses, in <strong>the</strong> traditionalMBTI scoring system each possible item response is keyed<strong>to</strong>ward one or <strong>the</strong> o<strong>the</strong>r of <strong>the</strong> poles of <strong>the</strong> item’s assignedpreference dimension (e.g., <strong>the</strong> response “thinking” from<strong>the</strong> word-pair “thinking vs. feeling” is keyed <strong>to</strong>ward <strong>the</strong>“T” pole of <strong>the</strong> TF dimension, and <strong>the</strong> “feeling” response iskeyed <strong>to</strong>ward <strong>the</strong> “F” pole). This keying of items withrespect <strong>to</strong> <strong>the</strong> poles of each preference continuum providesus with <strong>the</strong> information that is needed <strong>to</strong> use IRT <strong>to</strong> score<strong>the</strong> MBTI.In essence, IRT methods simply require that each itembe scored dicho<strong>to</strong>mously; although it is common <strong>to</strong> do so, itis not manda<strong>to</strong>ry that this scoring system be couched interms of a “correct” versus “incorrect” response. For <strong>the</strong>MBTI, we need only pick one of <strong>the</strong> two poles of eachscale (e.g., for <strong>the</strong> EI scale, <strong>the</strong> “I” preference) as <strong>the</strong> keyedpole; this choice is essentially arbitrary, and for maximumsimilarity <strong>to</strong> <strong>the</strong> traditional prediction-ratio scoring system(e.g., <strong>Myers</strong> & McCaulley, 1988, p. 9), item responses havebeen keyed <strong>to</strong>ward <strong>the</strong> I, N, F, and P poles in MBTI IRTstudies (e.g., Harvey & Murry, 1994). Once a keyed pole ischosen, each MBTI item response is dicho<strong>to</strong>mously scoredby determining whe<strong>the</strong>r or not it is in <strong>the</strong> keyed direction.<strong>Using</strong> <strong>the</strong> above example, if an individual chose <strong>the</strong>“thinking” alternative from <strong>the</strong> “thinking vs. feeling” wordpair, this response would not be in <strong>the</strong> keyed (i.e., “F”)direction; <strong>the</strong>refore, it would be scored as a zero.
It must be stressed that this choice of a keyed directionfor each scale is entirely arbitrary, and that IRT scoringworks equally well regardless of which pole is chosen as<strong>the</strong> keyed response. That is, <strong>the</strong> choice of <strong>the</strong> keyed polesimply determines <strong>the</strong> direction of <strong>the</strong> scale (i.e., because<strong>the</strong> type cu<strong>to</strong>ff point is assigned a value of zero, preferencescores that lie in <strong>the</strong> keyed direction receive positivenumbers, and preferences <strong>to</strong>ward <strong>the</strong> non-keyed polereceive negative numbers). Reversing <strong>the</strong> keyed polesimply reverses <strong>the</strong> scale of <strong>the</strong> θ score continuum.The item characteristic curve (ICC). The foundationof <strong>the</strong> IRT approach is <strong>the</strong> ICC; each item on a test willhave its own ICC. In essence, <strong>the</strong> ICC answers <strong>the</strong>question, “How are individuals’ scores on <strong>the</strong> latentconstruct (i.e., preferences) related <strong>to</strong> <strong>the</strong>ir observedprobabilities of endorsing this MBTI item in <strong>the</strong> keyed (i.e.,INFP) direction?” The ICC depicts <strong>the</strong> form of <strong>the</strong>functional relation that exists between <strong>the</strong> latent constructand <strong>the</strong> PCR. In practice, <strong>the</strong>re are many different ways inwhich this functional relationship between θ scores andPCRs can be modeled.One of <strong>the</strong> simplest ways in which preference scorescan be related <strong>to</strong> <strong>the</strong> observed item endorsement rates is amodel in which higher scores on <strong>the</strong> latent preferenceconstruct are linearly associated with higher likelihoods ofendorsing <strong>the</strong> item in <strong>the</strong> keyed direction. Hypo<strong>the</strong>tical<strong>Item</strong> 1 in Figure 1 illustrates an ICC that is primarily linearin nature. In Figure 1, <strong>the</strong> horizontal axis represents <strong>the</strong>latent preference score (θ), and <strong>the</strong> vertical axis represents<strong>the</strong> likelihood that individuals holding a given preferencewould endorse this item in <strong>the</strong> keyed direction (i.e., <strong>the</strong>PCR). The ICC shows how scores on <strong>the</strong> latent preferencescale correspond <strong>to</strong> observed item-endorsement ratesIf <strong>the</strong> ICC for <strong>Item</strong> 1 in Figure 1 had been obtained foran actual MBTI item (e.g., on <strong>the</strong> EI scale, one that asked<strong>the</strong>m <strong>to</strong> choose between “good mixer” vs. “quiet andreserved”), and <strong>the</strong> EI items were keyed <strong>to</strong>ward <strong>the</strong>Introvert pole, individuals having positive scores on <strong>the</strong> θscale would be Introverts, and those having negative scoreswould be Extraverts (a value of θ = 0.0 serves as <strong>the</strong> typecu<strong>to</strong>ff score, and <strong>the</strong> θ metric is scaled in z units). Just aswith traditional prediction-ratio based preference scores,scores that are fur<strong>the</strong>r away from <strong>the</strong> type cu<strong>to</strong>ff denotestronger preferences <strong>to</strong>ward that pole of <strong>the</strong> preferencecontinuum. To determine <strong>the</strong> predicted likelihood that agroup of individuals who share a given θ score wouldendorse a given item in <strong>the</strong> keyed direction, simply locate<strong>the</strong> desired θ score on <strong>the</strong> x-axis, and <strong>the</strong>n draw a verticalline until <strong>the</strong> ICC is reached. By projecting a horizontalline leftward <strong>to</strong> <strong>the</strong> y-axis from <strong>the</strong> ICC, <strong>the</strong> PCR valueassociated with that θ score can be determined.For example, in Figure 1 individuals who score 0.0 onθ have no clear preference for ei<strong>the</strong>r <strong>the</strong> “E” or “I” poles;we would expect 50% of <strong>the</strong>m <strong>to</strong> endorse this item in <strong>the</strong>“I” direction and 50% <strong>to</strong> endorse this item in <strong>the</strong> “E”direction (note <strong>the</strong> vertical line drawn at θ = 0, and <strong>the</strong>horizontal line drawn at PCR = 0.5). In contrast, whenconsidering a group of individuals who hold a strongpreference <strong>to</strong>ward <strong>the</strong> Introvert pole (e.g., at θ = +2.5), aPCR value of over 0.80 would be predicted; that is, over80% of <strong>the</strong>se strong Introverts would be expected <strong>to</strong>endorse <strong>the</strong> ‘I’ alternative (i.e., “quiet and reserved”), andless than 20% would be expected <strong>to</strong> endorse <strong>the</strong> ‘E’alternative (i.e., “good mixer”). Conversely, among agroup of individuals demonstrating a very strong Extravertpreference (e.g., θ = -3.0), a PCR of approximately 0.14would be expected (i.e., only 14% of <strong>the</strong>se strongExtraverts would say <strong>the</strong>y are “quiet and reserved”,whereas 86% would say <strong>the</strong>y are “good mixers”).In sharp contrast <strong>to</strong> <strong>the</strong> linear ICC described above, astep function ICC might exist. In a step function, a cu<strong>to</strong>ffscore on <strong>the</strong> θ preference scale is effectively present, suchthat all individuals who score below a given level of θ willfail <strong>to</strong> endorse <strong>the</strong> item in <strong>the</strong> keyed direction, and allindividuals who score above this cu<strong>to</strong>ff will endorse it in<strong>the</strong> keyed direction. Hypo<strong>the</strong>tical ICC 2 in Figure 1 depictsan ICC that approximates a step function: here, <strong>the</strong> cu<strong>to</strong>ffpoint is at θ = 0.0, and effectively all those who score lowerthan -0.1 (i.e., <strong>the</strong> Extraverts) would endorse <strong>the</strong> non-keyedresponse (“good mixer”), and all those above 0.1 (i.e., <strong>the</strong>Introverts) would endorse <strong>the</strong> keyed response (“quiet andreserved”). At <strong>the</strong> cu<strong>to</strong>ff point, only in <strong>the</strong> very narrowrange of approximately -0.1 <strong>to</strong> +0.1 would we observeExtraverts endorsing <strong>the</strong> “I” alternative and Introvertsendorsing <strong>the</strong> “E” alternative.Step-function ICCs possess appealing properties in <strong>the</strong>context of a type-based assessment instrument like <strong>the</strong>MBTI. That is, if two distinct types of people exist, almostall of <strong>the</strong> people whose continuous preference scores liebelow <strong>the</strong> cu<strong>to</strong>ff value for <strong>Item</strong> 2 would be expected <strong>to</strong> notendorse a response alternative that is keyed <strong>to</strong>ward <strong>the</strong>opposite pole, whereas almost all of those who score above<strong>the</strong> cu<strong>to</strong>ff would be expected <strong>to</strong> endorse <strong>the</strong> item in <strong>the</strong>keyed direction. Indeed, if true step functions ICCs like<strong>Item</strong> 2’s existed in practice, one could effectively develop asingle-item test that would measure each individual’s MBTIpreference with great accuracy (i.e., if <strong>the</strong> step functioncu<strong>to</strong>ff point coincided precisely with <strong>the</strong> “natural” cu<strong>to</strong>ffthat exists between <strong>the</strong> two types).<strong>Item</strong> information functions. The reason that stepfunctionICCs are potentially so desirable is that <strong>the</strong>yconvey a great deal of information regarding eachindividual’s standing on each MBTI preference dimension.However, step functions are limited in <strong>the</strong> sense that <strong>the</strong>information <strong>the</strong>y provide is confined <strong>to</strong> a relatively narrowrange of scores (i.e., those who score near <strong>the</strong> cu<strong>to</strong>ff pointthat defines <strong>the</strong> “step”). In <strong>the</strong> context of IRT, <strong>the</strong> term“information” is used <strong>to</strong> describe an item’s ability <strong>to</strong>discriminate between individuals who hold different scoreson <strong>the</strong> latent preference continuum. That is, if <strong>the</strong> size of<strong>the</strong> difference between two individuals’ scores on <strong>the</strong> latentpreference continuum is held constant, increasing <strong>the</strong>amount of information provided by an item makes it easier<strong>to</strong> discriminate between those individuals (i.e., with respect
<strong>to</strong> <strong>the</strong> likelihood that <strong>the</strong>y would endorse <strong>the</strong> item in <strong>the</strong>keyed direction).IRT methods allow us <strong>to</strong> quantify <strong>the</strong> amount ofinformation provided by each item at any given level of <strong>the</strong>θ scale via <strong>the</strong> item information function (IIF). Figure 2presents <strong>the</strong> IIFs for <strong>the</strong> two hypo<strong>the</strong>tical items listed inFigure 1. As <strong>the</strong>se IIFs illustrate, <strong>the</strong> linear ICC seen for<strong>Item</strong> 1 provides a consistent – but small – amount ofinformation across <strong>the</strong> entire range of θ scores. In contrast,<strong>the</strong> step-function ICC seen for <strong>Item</strong> 2 provides a great dealof information near <strong>the</strong> cu<strong>to</strong>ff point, but very littleinformation elsewhere. Thus, for individuals who endorse<strong>Item</strong> 2 in <strong>the</strong> keyed direction, we can be quite confident that<strong>the</strong>ir θ scores lie above <strong>the</strong> cu<strong>to</strong>ff point; however, we havevirtually no ability <strong>to</strong> determine whe<strong>the</strong>r <strong>the</strong>y hold a strong,intermediate, or weak preference <strong>to</strong>ward <strong>the</strong> “I” pole basedon <strong>the</strong>ir endorsement of <strong>Item</strong> 2 in <strong>the</strong> keyed direction. Thatis, in terms of <strong>the</strong> expected PCR, <strong>the</strong>re is virtually nodifference between a strong (e.g., θ = 2.5) versus a weak(e.g., θ = 0.5) “I” preference with respect <strong>to</strong> <strong>the</strong> responses<strong>to</strong> <strong>Item</strong> 2; hence, it provides very little information outside<strong>the</strong> narrow band surrounding its cu<strong>to</strong>ff point.Of course, due <strong>to</strong> <strong>the</strong> action of measurement error, it isextremely unlikely that in an actual testing situation wewould encounter ICCs that break as sharply as <strong>the</strong> onedepicted for hypo<strong>the</strong>tical <strong>Item</strong> 2. More commonly, ICCstend <strong>to</strong> assume an intermediate value between <strong>the</strong> twoextremes depicted in Figure 1, producing variants of an “S”shaped ICC. Thus, when applying IRT methods, <strong>the</strong>fundamental question concerns <strong>the</strong> kind of ICC that onechooses <strong>to</strong> employ when modeling <strong>the</strong> relations between<strong>the</strong> latent construct and <strong>the</strong> observed item endorsementrates. In particular, <strong>the</strong> choice between fitting a linearversus a nonlinear model is critical: as can be seen from<strong>the</strong> ICCs in Figure 1, it would be profoundly misleading <strong>to</strong>fit a linear ICC <strong>to</strong> an item that possessed a true ICC like <strong>the</strong>one depicted for <strong>Item</strong> 2. Likewise, it would be highlymisleading <strong>to</strong> force a step-function ICC on<strong>to</strong> an item thatdemonstrated an ICC like <strong>the</strong> one seen for <strong>Item</strong> 1.IRT Models for Dicho<strong>to</strong>mously <strong>Score</strong>d Test <strong>Item</strong>sIRT models differ primarily in terms of <strong>the</strong>assumptions <strong>the</strong>y make regarding <strong>the</strong> ways in which scoreson <strong>the</strong> latent construct (θ) can relate <strong>to</strong> observed itemendorsement rates (PCR). These differences are reflectedin <strong>the</strong> number of parameters that must be estimated in order<strong>to</strong> “fit” an ICC <strong>to</strong> each item’s responses.1-parameter (Rasch) model. One of <strong>the</strong> simplestanswers <strong>to</strong> <strong>the</strong> question of how <strong>the</strong> latent construct isrelated <strong>to</strong> <strong>the</strong> endorsement rates for each item is given by<strong>the</strong> 1-parameter, or Rasch, model (e.g., Rasch, 1960). Notsurprisingly, in <strong>the</strong> 1-parameter model <strong>the</strong>re is only onecharacteristic of each item that sets its ICC apart from <strong>the</strong>ICCs of <strong>the</strong> o<strong>the</strong>r items on <strong>the</strong> test. <strong>Using</strong> traditional IRTterminology, this parameter is <strong>the</strong> difficulty of <strong>the</strong> item.Unfortunately, <strong>the</strong> difficulty parameter represents yetano<strong>the</strong>r example of <strong>the</strong> way in which traditional IRTterminology is awkward when applied <strong>to</strong> instruments thatdo not use right/wrong scoring. That is, in a traditionalright/wrong test, we define a “difficult” item as being onethat few respondents are able <strong>to</strong> answer correctly (i.e., onewith a low p value); conversely, an “easy” item is definedas one that most respondents (even those who score verylow on <strong>the</strong> construct being measured) are able <strong>to</strong> answercorrectly. However, with <strong>the</strong> MBTI we are concerned with<strong>the</strong> question of how likely it would be for a person <strong>to</strong> makean item response in <strong>the</strong> keyed direction (i.e., I, N, F, or P),not whe<strong>the</strong>r such a response is “right” or “wrong.”In <strong>the</strong> present case, <strong>the</strong> difficulty of an item (denoted b)refers <strong>to</strong> <strong>the</strong> degree <strong>to</strong> which raters will tend <strong>to</strong> endorse <strong>the</strong>item in <strong>the</strong> keyed direction. Thus, items having numericallyhigh b parameters will be <strong>the</strong> ones that only people whoscore high in <strong>the</strong> keyed preference direction will tend <strong>to</strong>endorse. In contrast, items having low b parameters willtend <strong>to</strong> be endorsed in <strong>the</strong> keyed direction even byindividuals whose preferences lie strongly <strong>to</strong>ward <strong>the</strong> nonkeyedpole of <strong>the</strong> preference dimension. The scale of <strong>the</strong> bparameter is <strong>the</strong> same as <strong>the</strong> scale of θ (i.e., standard, or z,units).An example should help <strong>to</strong> illustrate <strong>the</strong> way in which<strong>the</strong> b parameter can be used <strong>to</strong> differentiate between testitems. Figure 3 presents <strong>the</strong> ICCs for three actual MBTIitems drawn from <strong>the</strong> EI scale; <strong>the</strong>se ICCs were computedby fitting <strong>the</strong> 1-parameter IRT model in a sample of 2,499MBTI profiles (<strong>the</strong> sample used <strong>to</strong> compute this andsubsequent figures was formed by sampling subjects from<strong>the</strong> databases used in <strong>the</strong> Harvey & Murry, 1994, andHarvey et al., 1995, studies, and <strong>the</strong>n adding approximately600 new raters – primarily college students – who were notused in those studies). Because <strong>the</strong> EI responses werekeyed <strong>to</strong>ward <strong>the</strong> “I” pole, individuals having Extravertpreferences exhibit negative θ scores, and those havingIntrovert preferences exhibit positive θ scores. Forreference, a horizontal line has been drawn at <strong>the</strong> 50% poin<strong>to</strong>f likelihood of item endorsement, and a vertical line at <strong>the</strong>type cu<strong>to</strong>ff point (i.e., θ = 0.0)..The ICCs in Figure 3 depict <strong>the</strong> percentages ofindividuals who share a given θ score that would beexpected <strong>to</strong> endorse each item in <strong>the</strong> “I” direction. Bycomparing <strong>the</strong> levels of θ at which 50% of raters would beexpected <strong>to</strong> endorse an item in <strong>the</strong> “I” direction, one can see<strong>the</strong> way in which <strong>the</strong> b parameter differentiates among testitems. That is, <strong>Item</strong> 129 has <strong>the</strong> lowest b parameter; wewould expect 50% of individuals who share <strong>the</strong> moderatelystrong “E” preference of -0.9 <strong>to</strong> endorse <strong>the</strong> “I” alternativefor this item (i.e., “not interested in following <strong>the</strong> latestfashion”). In contrast, <strong>Item</strong> 33 has <strong>the</strong> highest b value; forit, <strong>the</strong> point at which 50% endorse <strong>the</strong> “I” response (“hard<strong>to</strong> get <strong>to</strong> know”) does not occur until a moderately strong“I” preference of 0.9 is achieved.Thus, for any given level of θ (i.e., true preference on<strong>the</strong> EI dimension), we would expect <strong>to</strong> see <strong>the</strong> highest ratesof “I” endorsement occurring for <strong>Item</strong> 129, followed by<strong>Item</strong> 50, with <strong>the</strong> lowest rates of “I” endorsement occurringfor <strong>Item</strong> 33. For example, consider a group of moderately
strong Introverts (i.e., θ = 0.9, which represents a score ofalmost one standard deviation above <strong>the</strong> mean EIpreference score). Among this group of Introverts, wewould expect 50% of <strong>the</strong>m <strong>to</strong> describe <strong>the</strong>mselves as “hard<strong>to</strong> get <strong>to</strong> know” (<strong>Item</strong> 33), 64% as “quiet and reserved”(<strong>Item</strong> 50), and 86% as “not interested in following <strong>the</strong> latestfashion (<strong>Item</strong> 129)” Conversely, for a group of θ = -0.9Extraverts, we would expect <strong>to</strong> find that only about 12%describe <strong>the</strong>mselves as “hard <strong>to</strong> get <strong>to</strong> know,” 20% as“quiet and reserved,” and 50% as “not interested infollowing <strong>the</strong> latest fashion.”In general, regardless of <strong>the</strong> specific IRT model that ischosen, <strong>the</strong> substantive interpretation of <strong>the</strong> ICC willalways be <strong>the</strong> same: that is, by drawing a line projectingvertically from a given θ score <strong>to</strong> <strong>the</strong> ICC, and <strong>the</strong>nprojecting a line horizontally <strong>to</strong> <strong>the</strong> PCR, one can determine<strong>the</strong> expected percentage of people who share that true levelof <strong>the</strong> preference that would be expected <strong>to</strong> endorse <strong>the</strong>item in <strong>the</strong> keyed direction.How, <strong>the</strong>n, should <strong>the</strong> IRT b parameter be interpretedin <strong>the</strong> context of <strong>the</strong> MBTI? As <strong>the</strong> results in Figure 3illustrate, in <strong>the</strong> 1-parameter IRT model <strong>the</strong> only thing thatdifferentiates one test item from ano<strong>the</strong>r is <strong>the</strong> horizontal(left-right) location of <strong>the</strong> ICC on <strong>the</strong> latent preferencescale. As a practical matter, <strong>the</strong> numerical value of <strong>the</strong> bparameter is defined directly in terms of <strong>the</strong> ICC: that is, bis equal <strong>to</strong> <strong>the</strong> value of θ that corresponds <strong>to</strong> a 50%likelihood of endorsing <strong>the</strong> item in <strong>the</strong> keyed direction.Thus, for <strong>the</strong> items presented in Figure 3, <strong>the</strong> b values areapproximately -0.9, 0.35, and 0.9 for <strong>Item</strong>s 129, 50, and 33,respectively.The b parameter is useful for determining <strong>the</strong> point on<strong>the</strong> preference continuum (θ) at which <strong>the</strong> item will bemaximally informative. As a general rule, an item willprovide <strong>the</strong> most information regarding an individual’s θscore at <strong>the</strong> value of <strong>the</strong> b parameter (which, notsurprisingly, coincides with <strong>the</strong> point at which <strong>the</strong> ICCdemonstrates its sharpest slope). In this context, iteminformation is synonymous with discriminating power(i.e., <strong>the</strong> ability <strong>to</strong> differentiate between individuals in termsof <strong>the</strong>ir standing on <strong>the</strong> θ scale of preference). That is, adifference of a given size (e.g., 0.5 θ units) between twoindividuals with respect <strong>to</strong> <strong>the</strong> strength of <strong>the</strong>ir preferencewill translate in<strong>to</strong> a larger expected difference in PCRs as<strong>the</strong> slope of <strong>the</strong> ICC increases.For example, consider <strong>Item</strong> 129 in Figure 3 (i.e., <strong>the</strong>leftmost ICC). At its most informative point, a change ofone-half standard deviation (SD) in θ between two groupsof Extraverts (i.e., -1.2 vs. -0.7) translates in<strong>to</strong> a change ofapproximately 14% (i.e., 42% <strong>to</strong> 56%) in <strong>the</strong> likelihood ofendorsing <strong>Item</strong> 129 in <strong>the</strong> ‘I’ direction. In contrast, <strong>the</strong>same magnitude of θ preference difference between twogroups of individuals who score very strongly in <strong>the</strong>Introvert direction (e.g., 2.5 vs. 3.0) produces virtually nochange in <strong>the</strong> PCRs (i.e., 97-98% “I” endorsement rateswould be expected in both groups). Thus, <strong>Item</strong> 129 ismuch more informative or discriminating among moderateExtraverts than it is among individuals with strong Introvertpreferences (nearly all of whom would endorse <strong>the</strong> item in<strong>the</strong> ‘I’ direction)..With respect <strong>to</strong> <strong>the</strong> implications of using IRT methods<strong>to</strong> score <strong>the</strong> MBTI, <strong>the</strong> b parameter provides very usefulinformation on each item. In <strong>the</strong> MBTI, by virtue of <strong>the</strong> factthat many users are more interested in <strong>the</strong> categorical typescores than in <strong>the</strong> continuous preference scores, we need <strong>to</strong>set a cu<strong>to</strong>ff score on <strong>the</strong> preference continuum <strong>to</strong> assignrespondents in<strong>to</strong> <strong>the</strong> type categories. Consequently, wewould tend <strong>to</strong> prefer items that have b values that lie close<strong>to</strong> <strong>the</strong> θ = 0.0 point that divides each continuum in<strong>to</strong>categorical types. Thus, considering <strong>the</strong> items presented inFigure 3, <strong>Item</strong> 50 would be much more useful than <strong>Item</strong>129 with respect <strong>to</strong> locating individuals on one side or <strong>the</strong>o<strong>the</strong>r of <strong>the</strong> EI cu<strong>to</strong>ff score.Conceptually, <strong>the</strong>n, <strong>the</strong> IRT approach is not especiallycomplicated. The main problem from a practical point ofview lies in estimating <strong>the</strong> unknown b parameters for <strong>the</strong>MBTI items, and in estimating <strong>the</strong> scores on <strong>the</strong> latentpreference construct (θ) for each person, given <strong>the</strong>irresponses <strong>to</strong> <strong>the</strong> test items and our knowledge of <strong>the</strong> itemparameters. The main difference between <strong>the</strong> IRTapproach and older CTT-based approaches <strong>to</strong> measurementis that we explicitly assume that <strong>the</strong> relation between <strong>the</strong>latent construct score and <strong>the</strong> observed item response maybe nonlinear in nature.2-parameter model. Unfortunately, <strong>the</strong> 1-parameterIRT model suffers from significant limitations, perhaps <strong>the</strong>most important being that it assumes that all items on <strong>the</strong>test are equally discriminating or informative. For manypsychological tests (especially personality tests), this isprobably an unrealistic assumption. That is, some testitems are likely <strong>to</strong> be stronger indica<strong>to</strong>rs of an individual’sunderlying preferences than o<strong>the</strong>r test items (a fact that isacknowledged by <strong>the</strong> existing MBTI scoring system, whichdifferentially weights items when computing preferencescores). In response <strong>to</strong> <strong>the</strong> need <strong>to</strong> allow test items <strong>to</strong> bedifferentially discriminating at <strong>the</strong>ir points of maximumdiscrimination, <strong>the</strong> 2-parameter IRT model was developed.In essence, <strong>the</strong> 2-parameter IRT model is a superset of<strong>the</strong> 1-parameter model; in addition <strong>to</strong> <strong>the</strong> b (“location ofmaximum information” parameter), a second parameter(abbreviated a, or <strong>the</strong> discrimination parameter) was added<strong>to</strong> allow for <strong>the</strong> fact that different test items will bedifferentially informative or discriminating regarding <strong>the</strong>latent construct. In practical terms, <strong>the</strong> a parameter defines<strong>the</strong> slope of <strong>the</strong> ICC at its point of maximum inflection(which, in <strong>the</strong> 1- and 2-paramter IRT models, occurs at bunits on <strong>the</strong> θ scale).<strong>Using</strong> <strong>the</strong> 2-parameter model, Figure 4 depicts ICCsfor three hypo<strong>the</strong>tical items that have identical b parameters(in this case, b = 0.0), but which differ in terms of <strong>the</strong>ir aparameters (a = 0.35, 1.0, and 2.1 for <strong>Item</strong>s 1-3,respectively). A comparison of <strong>the</strong> ICCs for <strong>the</strong>se threeitems graphically illustrates <strong>the</strong> difference between <strong>the</strong> 1-and 2-parameter models, and highlights <strong>the</strong> importance ofmodeling both <strong>the</strong> point of maximum information as well as
<strong>the</strong> amount of discrimination that occurs at <strong>the</strong> point ofmaximum information. Specifically, Figure 4 illustrates <strong>the</strong>way in which sharper ICC slopes enhance our ability <strong>to</strong>discriminate between individuals who differ in <strong>the</strong>ir θscores.That is, consider two groups of MBTI respondents:Group 1 consists of individuals who have a true EIpreference of θ = -0.2 (i.e., a very slight preference <strong>to</strong>ward“E”); Group 2 consists of individuals having a preferenceof θ = +0.2 (i.e., a slight “I” preference; vertical lines aredrawn in Figure 4 at <strong>the</strong>se locations). The horizontal linesdrawn in Figure 4 depict <strong>the</strong> predicted item endorsementrates for <strong>Item</strong>s 1 vs. 3 at <strong>the</strong>se two θ levels. A comparisonof <strong>the</strong> dotted (<strong>Item</strong> 3) and solid (<strong>Item</strong> 1) horizontal linesimmediately indicates why higher a parameters are moredesirable: for <strong>Item</strong> 1, a difference of only approximately6% exists between <strong>the</strong> expected endorsement rates forGroups 1 versus 2; in contrast, a difference of over 36%exists for <strong>Item</strong> 3. Clearly, responses <strong>to</strong> <strong>Item</strong> 3 are muchmore sensitive <strong>to</strong> <strong>the</strong> relatively slight differences in θ scoresthat exist between Groups 1 and 2.The implications for using <strong>the</strong> a parameters <strong>to</strong> assess<strong>the</strong> performance of items in <strong>the</strong> MBTI are not quite asstraightforward as for <strong>the</strong> b parameters. On <strong>the</strong> one hand,one could argue that “more information is always better,”and that we should prefer items that produce larger amountsof information (i.e., sharper ICC slopes). However,especially in <strong>the</strong> case of an instrument like <strong>the</strong> MBTI thatuses a cu<strong>to</strong>ff score <strong>to</strong> dicho<strong>to</strong>mize its continuous preferencescores in order <strong>to</strong> assign categorical type values, <strong>the</strong>amount of information provided by each item must bebalanced against <strong>the</strong> location on <strong>the</strong> θ scale at which <strong>the</strong>item produces its information. Thus, we might very wellprefer a moderately discriminating item <strong>to</strong> a highlydiscriminating item if <strong>the</strong> b parameter of <strong>the</strong> moderatelydiscriminating item was located close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ffscore, and <strong>the</strong> b for <strong>the</strong> highly discriminating item waslocated 2 SD units away from <strong>the</strong> type cu<strong>to</strong>ff (i.e., causingit <strong>to</strong> produce relatively little information at <strong>the</strong> cu<strong>to</strong>ff).3-parameter model. Although <strong>the</strong> 2-parameter model’sability <strong>to</strong> account for differentially discriminating itemsoffers a valuable improvement over <strong>the</strong> 1-parameter model,<strong>the</strong> 2-parameter model can be criticized on <strong>the</strong> grounds thatit assumes that all test items will have zero lowerasymp<strong>to</strong>tes for <strong>the</strong>ir ICCs (i.e., for individuals with verylow scores on <strong>the</strong> θ scale, <strong>the</strong> ICCs will flatten-out at avalue that approaches zero). Although many test items willindeed reach an effectively zero lower asymp<strong>to</strong>te within <strong>the</strong>normal range of scores (e.g., <strong>Item</strong>s 2 and 3 in Figure 4 doso at -3 and -1.5 z, respectively), some will not.In <strong>the</strong> context of right/wrong tests that are subject <strong>to</strong>attempts <strong>to</strong> guess <strong>the</strong> correct answer, it is common <strong>to</strong>observe nonzero lower asymp<strong>to</strong>tes for <strong>the</strong> ICCs due <strong>to</strong> <strong>the</strong>willingness of respondents <strong>to</strong> guess when <strong>the</strong>y do not know<strong>the</strong> correct answer (e.g., for a 4-alternative multiple choicemath question, random guessing would be expected <strong>to</strong>produce a 25% success rate). In <strong>the</strong> context of instrumentsthat do not use right/wrong scoring (e.g., <strong>the</strong> MBTI),nonzero lower asymp<strong>to</strong>tes can also occur, although forreasons o<strong>the</strong>r than guessing.In short, nonzero lower asymp<strong>to</strong>tes for items on apersonality inven<strong>to</strong>ry may reflect <strong>the</strong> fact that <strong>the</strong> items aresufficiently skewed in terms of <strong>the</strong>ir endorsementproperties that even individuals who score very low on <strong>the</strong>θ scale (i.e., <strong>the</strong>ir preferences lie strongly <strong>to</strong>ward <strong>the</strong> nonkeyedalternative) will still endorse <strong>the</strong> item in <strong>the</strong> keyeddirection at nontrivial rates. The 3-parameter IRT modelallows for this possibility by adding a third parameter foreach item (abbreviated c) which defines <strong>the</strong> PCR thatwould be expected for people who score strongly <strong>to</strong>ward<strong>the</strong> non-keyed preference pole (i.e., <strong>the</strong> effective lowerasymp<strong>to</strong>te of <strong>the</strong> ICC). Although we would not expect<strong>the</strong>re <strong>to</strong> be many items in <strong>the</strong> MBTI for which largenonzero c parameters would occur, it is possible that someitems would require a nonzero value for <strong>the</strong> c parameter.Figure 5 presents <strong>the</strong> ICCs produced by fitting <strong>the</strong> 3-parameter IRT model <strong>to</strong> <strong>the</strong> three EI items depicted inFigure 3. As a comparison of Figures 3 vs. 5 makes readilyapparent, a very different picture of item functioning isproduced as a result of choosing a 1- vs. 3-parameter IRTmodel. In particular, <strong>Item</strong>s 50 and 33 demonstrate a visiblysharper ICC slope than was produced in <strong>the</strong> 1-parametermodel, whereas <strong>Item</strong> 129 demonstrates a significantlyflatter slope than was seen in Figure 3. Figure 6 presents<strong>the</strong> item information functions for <strong>the</strong>se three items;inspection of <strong>the</strong>se IIFs shows that <strong>Item</strong> 50 producessubstantially more information than <strong>Item</strong> 33, and that bothproduce far more information than <strong>Item</strong> 129 (whichproduces very little information at any value of θ). <strong>Item</strong> 50is made even more desirable by <strong>the</strong> fact that <strong>the</strong> peak of itsinformation function lies closest <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff score(i.e., θ = 0), which should make it <strong>the</strong> most useful of <strong>the</strong>sethree items with respect <strong>to</strong> distinguishing betweenindividuals whose score close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff.The results presented in Figure 5 also indicate that it isquite possible <strong>to</strong> find MBTI items that even raters whoscore very strongly <strong>to</strong>ward <strong>the</strong> non-keyed end of <strong>the</strong>preference scale will endorse in <strong>the</strong> keyed direction atnontrivial rates. For example, <strong>the</strong> ICC for <strong>Item</strong> 129 showsthat many extremely strong Extraverts endorse this item in<strong>the</strong> Introvert direction (e.g., at θ = -3.0, approximately30% of <strong>the</strong>se Extraverts endorse <strong>the</strong> “I” alternative, “notinterested in following new fashions,” instead of <strong>the</strong> “E”response, “one of <strong>the</strong> first <strong>to</strong> follow a new fashion”). Thisability <strong>to</strong> capture different kinds of item response patternsis a major advantage of <strong>the</strong> 3-parameter IRT model.Test-level information and SE functions. An importantadvantage of IRT as a test development and scoring methodis that it allows us <strong>to</strong> obtain a detailed look at <strong>the</strong> aggregateperformance of collections of test items. In particular, wecan calculate both test information functions (TIFs) and teststandard error (SE) functions <strong>to</strong> assess <strong>the</strong> performance ofan item pool. TIFs indicate <strong>the</strong> amount of information ormeasurement precision that is provided by a test at allpossible levels of θ, whereas test-level SE functions
indicate <strong>the</strong> degree of precision <strong>to</strong> be expected whenestimating test scores for examinees at different levels of θ.Thus, <strong>the</strong> test SE functions represent a continuouslyvariable analog <strong>to</strong> <strong>the</strong> global SEM estimate produced byCTT, indicating <strong>the</strong> degree of error that would be expectedwhen estimating <strong>the</strong> “true” latent preference scores basedon <strong>the</strong> observed patterns of item responses. Likewise, <strong>the</strong>test information functions represent a continuously variableanalog <strong>to</strong> <strong>the</strong> unitary reliability coefficient estimated byCTT: that is, higher values reflect higher measurementprecision and freedom from error, and lower valuesrepresent less measurement precision and increaseduncertainty with respect <strong>to</strong> estimating scores on <strong>the</strong> latentconstruct.Both of <strong>the</strong>se functions represent tremendousimprovements over <strong>the</strong> simplistic views of reliability andmeasurement error that are inherent in traditional CTTbasedmethods. That is, in classical approaches <strong>to</strong> testing, atest’s reliability is estimated as a single number that ispresumed <strong>to</strong> be constant across <strong>the</strong> entire possible range oftest scores. Likewise, a test’s standard error ofmeasurement (SEM) is presumed <strong>to</strong> be constant across allpossible test values. Both of <strong>the</strong>se assumptions aretenuous; indeed, it is reasonable <strong>to</strong> expect that most testswill tend <strong>to</strong> be more precise for respondents who have“average” scores on <strong>the</strong> latent construct, and less precisefor those individuals who hold extreme scores (i.e., teststargeted at an “average” population typically lack items thatprovide significant levels of information for individualswho score at <strong>the</strong> extremes of <strong>the</strong> distribution).Figure 7 presents <strong>the</strong> TIFs for a scale composed of <strong>the</strong>three EI items contained in Figures 3 and 5, as well as for<strong>the</strong> full EI scale; Figure 8 presents <strong>the</strong> corresponding SEfunctions for <strong>the</strong> 3-item and full-length EI scales. AsFigures 7-8 illustrate, significant improvements in testprecision (i.e., higher TIFs, lower SEs) are achieved in <strong>the</strong>full-length EI scale relative <strong>to</strong> a 3-item scale. Additionally,both <strong>the</strong> TIFs and SEs show that measurement precision isnot constant across <strong>the</strong> full range of θ-based preferencescores, being significantly better in <strong>the</strong> middle range of θscores (peaking at approximately θ = 0.25), and somewhatmore precise for <strong>the</strong> Introvert half of <strong>the</strong> scale than for <strong>the</strong>Extravert half (see Figure 8).These results clearly undermine <strong>the</strong> CTT assumptionthat reliability and SEM remain constant across <strong>the</strong> fullrange of MBTI preference scores. Based on past studiesthat have estimated <strong>the</strong> CTT reliability of <strong>the</strong> MBTI scales<strong>to</strong> lie in <strong>the</strong> .75-.85 range (e.g., Harvey & Murry, 1994;<strong>Myers</strong> & McCaulley, 1985), two horizontal lines have beendrawn in Figures 7-8 at <strong>the</strong> levels of information/SE thatcorrespond <strong>to</strong> r xx = .75 (which produces SEM = .50 for z-scaled variables like θ) and r xx =.85 (SEM = .39). Acomparison of <strong>the</strong> TIFs and SEs for <strong>the</strong> full EI scale against<strong>the</strong>se CTT reference lines indicates that <strong>the</strong> θ scoresestimated by IRT would be expected <strong>to</strong> significantly exceed<strong>the</strong> levels of measurement precision implied by <strong>the</strong> unitaryCTT estimates in <strong>the</strong> middle range of θ-based preferencescores (i.e., from approximately -0.5 <strong>to</strong> +1.0 for <strong>the</strong> .39SEM, and -1.0 <strong>to</strong> 1.5 for <strong>the</strong> .50 SEM), and <strong>to</strong> fall short of<strong>the</strong> levels of precision implied by <strong>the</strong> CTT results outside<strong>the</strong>se ranges.It is important <strong>to</strong> stress that <strong>the</strong>se findings do not implythat IRT-based scoring is less precise than CTT-basednumber-right scoring for preferences that lie outside <strong>the</strong>above intervals. On <strong>the</strong> contrary, <strong>the</strong>y indicate that <strong>the</strong>levels of measurement precision implied by CTT’s unitaryr xx and SEM statistics are likely <strong>to</strong> underestimate <strong>the</strong>effective level of precision for preference scores that fallwithin approximately .5 <strong>to</strong> 1 SD of <strong>the</strong> type cu<strong>to</strong>ff score,and <strong>to</strong> increasingly overestimate <strong>the</strong> precision ofmeasurement for preference scores that lie strongly <strong>to</strong>wardei<strong>the</strong>r pole of <strong>the</strong> preference scale.Is IRT Appropriate for <strong>the</strong> MBTI?By this point, <strong>the</strong> reader might well feel that he or shehas seen at least one ICC <strong>to</strong>o many, and perhaps bewondering whe<strong>the</strong>r it is really necessary <strong>to</strong> go <strong>to</strong> <strong>the</strong> troublerequired <strong>to</strong> fit <strong>the</strong>se nonlinear ICCs <strong>to</strong> <strong>the</strong> MBTI responses.Without a doubt, <strong>the</strong> IRT approach is somewhat morecomplex than <strong>the</strong> prediction-ratio technique that hastraditionally been used <strong>to</strong> score <strong>the</strong> MBTI. In short, onemight question whe<strong>the</strong>r or not <strong>the</strong> increased complexityinherent in <strong>the</strong> IRT is worth <strong>the</strong> trouble, and whe<strong>the</strong>r anyevidence exists <strong>to</strong> indicate that <strong>the</strong> IRT model actuallyprovides a good “fit” <strong>to</strong> <strong>the</strong> MBTI item response patterns.Fortunately, a very direct method exists for assessing<strong>the</strong> “fit” of <strong>the</strong> IRT model; it involves an examination ofempirically derived ICCs. Empirical ICCs are essentiallyscatterplots, defined as follows: <strong>the</strong> vertical axis of <strong>the</strong> plotrepresents <strong>the</strong> observed rate of item endorsement (PCR),<strong>the</strong> horizontal axis represents discrete levels of <strong>the</strong> latentpreference score, and <strong>the</strong> points in <strong>the</strong> plot represent <strong>the</strong>percentage of respondents at each level of <strong>the</strong> latentpreference score that endorse <strong>the</strong> item in <strong>the</strong> keyeddirection. By visually examining this scatterplot of meanitem endorsement rates, we can get an idea of <strong>the</strong> “true”nature of <strong>the</strong> relationship between <strong>the</strong> latent preferencedimension and <strong>the</strong> observed likelihood of item endorsementin <strong>the</strong> keyed direction for <strong>the</strong> various levels of <strong>the</strong> latentconstruct.Empirically derived ICCs provide an ideal vehicle forassessing <strong>the</strong> fit of <strong>the</strong> IRT model by virtue of <strong>the</strong> fact that<strong>the</strong>y do not “force” any particular model (e.g., <strong>the</strong> 3-parameter IRT model) on<strong>to</strong> <strong>the</strong> data. That is, <strong>the</strong> ICCspresented in Figures 3 and 5 are <strong>the</strong> ones that wereproduced by fitting <strong>the</strong> 1- and 3-parameter IRT models <strong>to</strong><strong>the</strong> MBTI item responses; although <strong>the</strong>y look impressive,<strong>the</strong>y essentially have <strong>to</strong> follow <strong>the</strong> IRT model, and <strong>the</strong>re isno guarantee that <strong>the</strong>y will actually provide a good fit <strong>to</strong> <strong>the</strong>data. In contrast, <strong>the</strong> empirically derived ICCs are free <strong>to</strong>adopt any shape that is appropriate for <strong>the</strong> data. Thus, <strong>to</strong><strong>the</strong> extent that <strong>the</strong> ICCs produced by <strong>the</strong> IRT models match<strong>the</strong> shape of <strong>the</strong> empirical ICCs, we would conclude that<strong>the</strong> IRT model provides a good degree of fit <strong>to</strong> <strong>the</strong> MBTIdata.
As a practical matter, <strong>the</strong> main difficulty that ariseswhen computing empirical ICCs is in finding a satisfac<strong>to</strong>rymethod for estimating <strong>the</strong> latent construct scores. Becausewe don’t know <strong>the</strong> “true” preference scores for eachexaminee, and we can’t use <strong>the</strong> θ scores that are estimatedusing IRT (i.e., <strong>to</strong> avoid creating a logical circularity), it iscus<strong>to</strong>mary <strong>to</strong> use <strong>the</strong> <strong>to</strong>tal score on <strong>the</strong> scale as <strong>the</strong> bestavailable estimate of <strong>the</strong> true score. In <strong>the</strong> present case, <strong>the</strong>scores computed using <strong>the</strong> prediction-ratio (PR) preferencescoring weights for Form F were used as <strong>the</strong> estimate ofeach person’s true score on <strong>the</strong> latent construct (virtuallyidentical results were also obtained when we used <strong>the</strong>simple unweighted percentage of items that were answeredin <strong>the</strong> keyed direction as <strong>the</strong> estimate of <strong>the</strong> latentconstruct).Computationally, <strong>the</strong> empirical ICCs (see Figures 9-12for <strong>the</strong> EI items used in <strong>the</strong> previous examples, and Figures13-15 for <strong>the</strong> <strong>to</strong>p items from <strong>the</strong> SN, TF, and JP scales)were produced as follows: (a) each person’s net preferencescore was calculated using <strong>the</strong> Form F scoring key andplaced on a scale that placed <strong>the</strong> type cu<strong>to</strong>ff at zero (i.e.,preferences <strong>to</strong>ward <strong>the</strong> keyed pole received positive values,and those <strong>to</strong>ward <strong>the</strong> non-keyed pole received negativescores); (b) subgroups of raters were formed by breaking<strong>the</strong> sample in<strong>to</strong> discrete intervals based on <strong>the</strong>ir PRpreference score (e.g., in Figure 9, all raters scoring 53<strong>to</strong>ward <strong>the</strong> “E” pole); (c) for each subgroup, we calculated<strong>the</strong> percentage of raters in that subgroup that endorsed <strong>the</strong>item in <strong>the</strong> keyed direction (e.g., Figure 9 shows that for<strong>Item</strong> 50, 0% of <strong>the</strong> raters in <strong>the</strong> subgroup scoring 53 <strong>to</strong>ward“E” endorsed <strong>the</strong> item in <strong>the</strong> “I” direction); finally, (d) foreach subgroup, we plotted <strong>the</strong> percentage of raters thatendorsed <strong>the</strong> item in <strong>the</strong> keyed direction against <strong>the</strong>subgroup’s PR-based preference score (smoo<strong>the</strong>d splineinterpolations were fitted through this scatterplot in anattempt <strong>to</strong> capture <strong>the</strong> “true” ICC for each item).It is important <strong>to</strong> emphasize again that unlike <strong>the</strong> ICCspresented in Figures 3 and 5 -- which were estimated usingIRT methods and which <strong>the</strong>refore must follow <strong>the</strong> formdictated by <strong>the</strong> 1- or 3-parameter IRT model – <strong>the</strong>empirically derived ICCs presented in Figures 9-15 arecompletely unconstrained by <strong>the</strong> IRT model. Accordingly,<strong>the</strong>y can take on any form that is appropriate in order <strong>to</strong>depict <strong>the</strong> functional relationship (if any) that existsbetween each item response and <strong>the</strong> traditional PR-basedpreference scores. Thus, <strong>to</strong> <strong>the</strong> degree that we seeagreement between <strong>the</strong> empirically derived ICCs versus <strong>the</strong>ICCs that were generated from <strong>the</strong> IRT parameterestimates, we will interpret such agreement as validation of<strong>the</strong> appropriateness of <strong>the</strong> IRT approach.As <strong>the</strong> results in Figures 9-11 illustrate, although <strong>the</strong>unconstrained empirical ICCs provide a very poor match <strong>to</strong><strong>the</strong> ICCs that were produced using <strong>the</strong> 1-parameter IRTmodel (Figure 3), <strong>the</strong>y provide a very good match <strong>to</strong> <strong>the</strong>ICCs produced by <strong>the</strong> 3-parameter model (Figure 5). Forexample, <strong>the</strong> empirical ICC for <strong>Item</strong> 50 demonstrates a verynonlinear, highly discriminating shape (Figure 9); thiscurve closely matches <strong>the</strong> ICC estimated by <strong>the</strong> 3-parameter IRT model (Figure 5) in terms of both its shapeas well as its relative location on <strong>the</strong> θ axis. Likewise, <strong>the</strong>empirical ICCs in Figures 10 and 11 for <strong>Item</strong>s 33 and 129agree quite closely with <strong>the</strong> 3-parameter model ICCs(Figure 5).In all cases, <strong>the</strong>re is remarkably little “scatter” around<strong>the</strong> line that we fit <strong>to</strong> each scatterplot, a fact that fur<strong>the</strong>rsupports <strong>the</strong> validity and advisability of using <strong>the</strong> 3-parameter IRT model <strong>to</strong> score <strong>the</strong> MBTI. When oneconsiders <strong>the</strong> fact that some of <strong>the</strong>se subgroup percentageendorsementstatistics (i.e., <strong>the</strong> squares in Figures 9-11) arebased on quite small Ns, <strong>the</strong> correspondence between <strong>the</strong>empirically vs. IRT-derived ICCs becomes even moreimpressive. To facilitate <strong>the</strong> comparison of <strong>the</strong>se ICCs, <strong>the</strong>empirically derived ICCs for EI items 33, 50, and 129 arepresented superimposed upon one ano<strong>the</strong>r in Figure 12. Asa comparison of Figures 5 vs. 12 indicates, <strong>the</strong>re is a greatdeal of similarity between <strong>the</strong> empirically vs. IRT-derivedICCs; this similarity is even more notable when oneconsiders <strong>the</strong> profound differences that exist between <strong>the</strong>methods that were used <strong>to</strong> compute <strong>the</strong> scores that define<strong>the</strong> horizontal axes in Figure 5 (i.e., maximum likelihoodbasedestimation of θ using <strong>the</strong> parameters estimated for <strong>the</strong>3-parameter IRT model) vs. Figure 12 (i.e., prediction-ratiobased preference scores based on <strong>the</strong> Form F scoringsystem).As a fur<strong>the</strong>r indica<strong>to</strong>r of <strong>the</strong> generalizability of <strong>the</strong>above findings, empirically derived ICCs for highperformanceitems drawn from <strong>the</strong> SN, TF, and JP scales(i.e., identified using <strong>the</strong> Harvey & Murry, 1994, IRTparameters) are presented in Figures 13-15. Inspection of<strong>the</strong>se ICCs again reveals <strong>the</strong> existence of markedlynonlinear functional relationships between preferencescores and <strong>the</strong> likelihood of endorsing MBTI items in <strong>the</strong>keyed direction. Clearly, an S-shaped ICC is <strong>the</strong> mostappropriate representation for <strong>the</strong>se MBTI items. As with<strong>the</strong> EI items, <strong>the</strong> results in Figures 13-15 indicate thatalthough some items demonstrate <strong>the</strong>ir highestdiscriminating power (i.e., ICC slope) at <strong>the</strong> type cu<strong>to</strong>ffpoint (Figure 13), o<strong>the</strong>rs produce <strong>the</strong>ir maximumdiscriminating power at points below (e.g., Figure 14) andabove (e.g., Figure 15) <strong>the</strong> type cu<strong>to</strong>ff point. The fact thatdifferent items tend <strong>to</strong> produce <strong>the</strong>ir maximumdiscrimination at different points along <strong>the</strong> preference scorecontinuum is easily modeled using IRT methods (i.e., byassigning different b parameters <strong>to</strong> <strong>the</strong> items).To provide something of a baseline against which <strong>to</strong>judge <strong>the</strong> results in Figures 13-15, Figures 16-17 depictempirical ICCs computed by plotting item-endorsementrates against preference scores for dimensions o<strong>the</strong>r than<strong>the</strong> predicted one for <strong>the</strong> item in question. The ICC shownin Figure 16 is typical of such ICCs; this scatterplot showsthat <strong>the</strong>re is virtually no association between scores on <strong>the</strong>EI preference scale and subgroup item-endorsementpercentages on <strong>Item</strong> 85 (a JP item). Note that <strong>the</strong>re is anappreciably higher level of “scatter” around <strong>the</strong> line of bestfit in this plot, as compared <strong>to</strong> <strong>the</strong> empirical ICCs computedfor items on <strong>the</strong>ir predicted preference dimensions (Figures
9-15), indicating that (as expected) JP item endorsementrates are not consistently predictive of EI preferences.There are exceptions <strong>to</strong> <strong>the</strong> pattern of non-associationdepicted in Figure 16, however, and most involvecomparisons between <strong>the</strong> SN and JP dimensions. Forexample, Figure 17 presents a scatterplot of PCR values for<strong>Item</strong> 85 – which, as Figure 15 illustrates, is a highlydiscriminating item with respect <strong>to</strong> <strong>the</strong> JP dimension –against <strong>the</strong> PR-based preference scores for <strong>the</strong> SNdimension. As <strong>the</strong> empirically derived ICC in Figure 17illustrates, <strong>the</strong>re is a relatively strong (and linear)association between <strong>the</strong>se two axes, such that higher scoreson <strong>the</strong> “N” preference are associated with higher likelihoodof endorsing <strong>Item</strong> 85 in <strong>the</strong> “P” (i.e., “unplanned” over“scheduled”) direction. This finding is consistent with <strong>the</strong>oft-reported positive correlation between <strong>the</strong> SN and JPpreference scores (e.g., Harvey & Murry, 1994), and doesnot necessarily represent cause for concern. Indeed, incases in which MBTI items are found <strong>to</strong> have consistentfunctional relationships with multiple latent preferencescales, <strong>the</strong> possibility of using multidimensional IRTmodels that are capable of making use of <strong>the</strong> “collateralinformation” contained in such items becomes worthy offur<strong>the</strong>r study.Figure 18 presents an empirical ICC in which itemendorsement rates for EI <strong>Item</strong> 116 are plotted against <strong>the</strong>PR-based EI preferences. As in <strong>the</strong> earlier empirical ICCs,<strong>the</strong> results in Figure 18 demonstrate a strong level of fitbetween <strong>the</strong> actual MBTI item response patterns and <strong>the</strong> 3-parameter IRT model. However, <strong>the</strong> most notable aspectregarding <strong>Item</strong> 116’s empirical ICC is that although thisitem demonstrates strong discriminating power with respect<strong>to</strong> <strong>the</strong> EI preference, <strong>the</strong> location of this discriminationoccurs relatively far from <strong>the</strong> EI type cu<strong>to</strong>ff point (i.e.,approximately 41 PR preference units <strong>to</strong>ward <strong>the</strong> “I” pole).That is, Introverts must possess quite a strong preference<strong>to</strong>ward <strong>the</strong> “I” pole before <strong>the</strong>y begin <strong>to</strong> choose <strong>the</strong>“detached” alternative over <strong>the</strong> “sociable” alternative insignificant numbers.In view of <strong>the</strong> fact that <strong>Item</strong> 116 provides relativelylittle discriminating power at <strong>the</strong> type cu<strong>to</strong>ff point, it is notsurprising <strong>to</strong> find that <strong>the</strong> traditional PR-based scoringsystem does not view it as being an especially useful onewith respect <strong>to</strong> assessing <strong>the</strong> EI preference. However, as<strong>the</strong> empirical ICC in Figure 18 clearly indicates, this item isvery useful in discriminating between individualsexhibiting moderate vs. strong preferences <strong>to</strong>ward <strong>the</strong> “I”pole of <strong>the</strong> EI scale. This ability <strong>to</strong> assess <strong>the</strong>discriminating power of each MBTI across <strong>the</strong> full range ofpreference scores represents yet ano<strong>the</strong>r point of superiorityof <strong>the</strong> IRT approach over <strong>the</strong> traditional PR-based scoringsystem, which is primarily sensitive only <strong>to</strong> an item’sdiscriminating power in <strong>the</strong> vicinity of <strong>the</strong> type cu<strong>to</strong>ffscore.In sum, using only <strong>the</strong> observed MBTI endorsementrates and <strong>the</strong> preference scores produced by <strong>the</strong> traditionalPR-based scoring system, <strong>the</strong> above findings demonstratethat (a) <strong>the</strong> relationship between MBTI preferences andobserved item endorsement rates is decidedly nonlinear formany items; (b) MBTI items differ widely with respect <strong>to</strong><strong>the</strong> amount of information and discrimination <strong>the</strong>y provide;and (c) <strong>the</strong> location on <strong>the</strong> preference scale at which eachitem provides its maximum information varies considerablyfor different MBTI items. These findings strongly support<strong>the</strong> appropriateness and potential usefulness of <strong>the</strong> 3-parameter IRT model as a vehicle for capturing <strong>the</strong>complex dynamics involved in responding <strong>to</strong> <strong>the</strong> MBTI’sitems. In addition, <strong>the</strong>se results argue strongly against <strong>the</strong>notion that simpler models (e.g., <strong>the</strong> 1-parameter IRTmodel, or systems based on a weighted or unweightedlinear model) can provide an adequate representation of <strong>the</strong>complexity of <strong>the</strong>se item responses. In short, <strong>the</strong>seempirical ICCs indicate that <strong>the</strong> 3-parameter IRT modelprovides a very good degree of fit <strong>to</strong> <strong>the</strong> MBTI itemresponses. We turn finally <strong>to</strong> a review of findings fromstudies that have attempted <strong>to</strong> apply <strong>the</strong> IRT approach <strong>to</strong>scoring <strong>the</strong> MBTI.IRT Research on <strong>the</strong> MBTIEmpirical studies evaluating IRT-based approaches <strong>to</strong>scoring <strong>the</strong> MBTI have only recently begun <strong>to</strong> appear.However, <strong>the</strong> results of <strong>the</strong>se initial studies have been veryencouraging, especially regarding <strong>the</strong> ability of IRT scoring<strong>to</strong> address two of <strong>the</strong> most-criticized aspects of <strong>the</strong> MBTI:namely, preference score bimodality, and <strong>the</strong> degree ofmeasurement precision that exists in <strong>the</strong> vicinity of <strong>the</strong> typecu<strong>to</strong>ff scores. Additionally, IRT-based methods ofestimating MBTI preference scores offer advantages ino<strong>the</strong>r areas, in particular, quantifying <strong>the</strong> quality or internalconsistency of an individual’s profile of MBTI itemresponses (e.g., <strong>to</strong> detect potentially invalid profiles).Bimodal DistributionsAs we noted in our review of criticisms that have beenraised regarding <strong>the</strong> MBTI, many authors have attacked i<strong>to</strong>n <strong>the</strong> grounds that its preference score distributions are notbimodal (e.g., Pittenger, 1993; Stricker & Ross, 1964).Indeed, as <strong>the</strong> results presented in Harvey and Murry(1994) illustrated, PR-based preference score distributionsare highly center-weighted and platykurtic. This lack ofbimodality has at least two important implications: (a) itprovides ammunition <strong>to</strong> those who attempt <strong>to</strong> challenge <strong>the</strong>validity of <strong>Myers</strong>’ type-based personality <strong>the</strong>ory (i.e., if<strong>the</strong>re are basically two distinct “types” of people on each of<strong>the</strong> MBTI dimensions, it would not be unreasonable <strong>to</strong>expect <strong>to</strong> find a somewhat bimodal shape in <strong>the</strong> preferencescore distributions); and (b) it exacerbates <strong>the</strong> alreadydifficult process of accurately assigning individuals <strong>to</strong>discrete type categories (i.e., whenever a cu<strong>to</strong>ff score isused, we would strongly prefer <strong>to</strong> minimize <strong>the</strong> number ofindividuals who score near <strong>the</strong> cu<strong>to</strong>ff; unfortunately, <strong>the</strong>PR-based preference score distributions locate a sizablenumber of individuals near <strong>the</strong> cu<strong>to</strong>ff point).
Fortunately, <strong>the</strong> results of <strong>the</strong> Harvey and Murry(1994) study -- which was <strong>the</strong> first <strong>to</strong> derive and evaluatean IRT-based scoring system for <strong>the</strong> MBTI -- indicatedquite clearly that when <strong>the</strong> 3-parameter IRT model is used<strong>to</strong> estimate scores on <strong>the</strong> continuous preference scales, <strong>the</strong>resulting score distributions are strongly bimodal.Updating <strong>the</strong>se findings using <strong>the</strong> database from which <strong>the</strong>above empirical ICC results were produced (i.e., whichadds a number of individuals <strong>to</strong> <strong>the</strong> sample used in Harvey& Murry, 1994), Figure 19 presents <strong>the</strong> frequencydistribution for <strong>the</strong> EI scale’s PR-based preference scores(Figure 19 contains a frequency-count bar for each discretePR-preference value). In contrast, Figure 20 presents <strong>the</strong>distribution of <strong>the</strong> EI θ-based preference score estimates (θscores contain a significantly higher number of discretescore values; consequently, <strong>to</strong> facilitate comparison, <strong>the</strong>number of frequency bars in Figure 20 has been matched <strong>to</strong><strong>the</strong> number of discrete PR-based preference values).A comparison of Figures 19 vs. 20 indicates that <strong>the</strong> θ-based preference distribution is strongly bimodal in shape,whereas <strong>the</strong> PR-based preference scores exhibit a relativelyflat distribution in which many individuals score near <strong>the</strong>type cu<strong>to</strong>ff (very similar results are seen for <strong>the</strong> remainingthree preference dimensions). Although some respondentsdo indeed score in <strong>the</strong> vicinity of <strong>the</strong> type cu<strong>to</strong>ff in <strong>the</strong> IRTbaseddistribution, <strong>the</strong>re is a pronounced decrease in <strong>the</strong>density of individuals scoring in <strong>the</strong> cu<strong>to</strong>ff region between<strong>the</strong> two very pronounced modes (which are locatedapproximately ±0.5 units on ei<strong>the</strong>r side of <strong>the</strong> type cu<strong>to</strong>ff).A visual examination of <strong>the</strong> two distributions suggests thatfewer individuals score close <strong>to</strong> <strong>the</strong> cu<strong>to</strong>ff point in <strong>the</strong> θ-vs. PR-based distributions.Thus, regarding <strong>the</strong> issue of preference scorebimodality, <strong>the</strong> evidence available <strong>to</strong> date indicates quiteconvincingly that bimodal score distributions can beproduced by simply changing <strong>the</strong> technology that is used <strong>to</strong>estimate preference scores from <strong>the</strong> observed MBTI itemresponses. Although bimodal preference distributions havebeen found in highly selected samples of individuals whodemonstrate very strong type differentiation (e.g., Ryttinget al.,1994), <strong>the</strong>y have not been seen in larger, morerepresentative samples (e.g., Stricker & Ross, 1964); thisfact has been trumpeted by MBTI critics as a serious flawin both <strong>the</strong> MBTI instrument as well as <strong>Myers</strong>’ type-basedpersonality <strong>the</strong>ory that inspired <strong>the</strong> MBTI. If <strong>the</strong>se resultsare found by subsequent research <strong>to</strong> be generalizable <strong>to</strong>non-student-based samples (which we have every reason <strong>to</strong>expect, given both <strong>the</strong> relatively large size of our sampleand <strong>the</strong> fact that <strong>the</strong> students who attend major universitiestypically represent a diverse cross-section of <strong>the</strong> generalpopulation), this fact will effectively eliminate one of <strong>the</strong>major arguments raised by MBTI critics.Measurement PrecisionAs we noted in our review of criticisms of <strong>the</strong> MBTI,many authors have expressed concerns regarding itsmeasurement precision; in particular, <strong>the</strong> level of scorestability that is seen in test-retest situations, and its ability<strong>to</strong> correctly assign individuals who score close <strong>to</strong> <strong>the</strong> typecu<strong>to</strong>ffs <strong>to</strong> type categories (e.g., Pittenger, 1993). Earlier,we identified two strategies that could be taken <strong>to</strong> improve<strong>the</strong> level of test-retest stability and <strong>the</strong> MBTI’s ability <strong>to</strong>correctly classify individuals in<strong>to</strong> type categories: (a)decreasing <strong>the</strong> number of individuals who score close <strong>to</strong> <strong>the</strong>type cu<strong>to</strong>ffs by increasing <strong>the</strong> bimodality of <strong>the</strong> preferencescore distributions; and (b) revising <strong>the</strong> MBTI scoringsystem <strong>to</strong> produce a higher level of precision in <strong>the</strong> vicinityof <strong>the</strong> type cu<strong>to</strong>ff score.As a visual examination of <strong>the</strong> results presented inFigures 19-20 suggests, switching from a PR- <strong>to</strong> a θ-basedscoring system for <strong>the</strong> MBTI – without changing a singletest item – appears <strong>to</strong> provide a means for addressing <strong>the</strong>bimodality issue. In an attempt <strong>to</strong> more precisely address<strong>the</strong> question of whe<strong>the</strong>r θ-based scoring reduces <strong>the</strong>number of individuals scoring close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ffs, westandardized <strong>the</strong> PR-based preference scores <strong>to</strong> have <strong>the</strong>same mean and SD as <strong>the</strong> θ-based preferences, and <strong>the</strong>ncounted <strong>the</strong> number of individuals who scored within agiven sized band around each scale’s type cu<strong>to</strong>ff score.Values of ±0.25 and ±0.35 were used when setting <strong>the</strong>sebands; 0.25 is a somewhat arbitrary value, whereas 0.35approximates <strong>the</strong> size of a ±1 SEM confidence interval fora scale having a .85 reliability, as well as <strong>the</strong> size of <strong>the</strong> SEthat would be expected when estimating θ scores at <strong>the</strong> typecu<strong>to</strong>ff point (see Figure 8). Individuals who score within<strong>the</strong>se bands should be much more likely <strong>to</strong> be incorrectlyclassified in<strong>to</strong> a categorical type due <strong>to</strong> <strong>the</strong> action ofmeasurement error (ei<strong>the</strong>r in a single administration, or in atest-retest situation) than those who score outside <strong>the</strong>sezones.Table 1 presents <strong>the</strong> numbers of individuals scoringwithin <strong>the</strong>se two intervals for <strong>the</strong> PR- and θ-basedpreferences. As <strong>the</strong> breakdowns in Table 1 indicate, PRbasedpreference scoring consistently locates a largerpercentage of respondents in <strong>the</strong> “zone of uncertainty”around <strong>the</strong> cu<strong>to</strong>ff than <strong>the</strong> θ-based scoring system. <strong>Using</strong><strong>the</strong> number of individuals classified within <strong>the</strong> ±0.25 and±0.35 bands by <strong>the</strong> traditional PR-based scoring system as<strong>the</strong> basis for comparison, <strong>the</strong> IRT-based scoring systemproduces reductions of 37% and 27%, respectively, in <strong>the</strong>number of MBTI profiles that fall within this zone ofuncertainty.Likewise, comparing <strong>the</strong> number of individuals thatfall within <strong>the</strong> zone of uncertainty using IRT versus PRscoring, <strong>the</strong> results in Table 1 indicate that 54% and 36% of<strong>the</strong> profiles that fall within <strong>the</strong> uncertainty zone using PRscoring fall outside <strong>the</strong> zone when using IRT scoring for<strong>the</strong> .25 and .35 bands, respectively. Conversely, only 4%and 3% of <strong>the</strong> profiles that fall outside of <strong>the</strong> uncertaintyzone using PR scoring fall inside <strong>the</strong> zone when using IRTscoring. Again, <strong>the</strong>se results illustrate <strong>the</strong> sizablereductions in <strong>the</strong> percentage of individuals who score close<strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff point that are produced simply by
switching from a PR-based <strong>to</strong> a θ-based scoring system for<strong>the</strong> MBTI item responses.Figures 21 and 22 present more information on <strong>the</strong>performance of <strong>the</strong> IRT-based scoring system; Figure 21shows a scatterplot of <strong>the</strong> EI preference scores estimated byPR- vs. IRT-methods, whereas Figure 22 shows ascatterplot of IRT-based preference scores for <strong>the</strong> EI vs. SNscales. As <strong>the</strong> plot in Figure 21 illustrates, <strong>the</strong>re is a strong– but decidedly nonlinear – association between θ- vs. PRbasedpreference score estimates. For example, forindividuals receiving an identical PR preference score,Figure 21 illustrates how <strong>the</strong>y can receive a relatively broadrange of θ-based preference scores. This illustrates a majoradvantage of θ-based scoring: that is, it doesn’t just matterhow many items are endorsed in <strong>the</strong> keyed direction, it iscritically important <strong>to</strong> determine which items are endorsedin each direction. In short, answers <strong>to</strong> highlydiscriminating items are much more diagnostic thananswers <strong>to</strong> items that possess low b parameters; IRT-basedscoring au<strong>to</strong>matically takes <strong>the</strong>se fac<strong>to</strong>rs in<strong>to</strong> account whenestimating each individual’s θ-based preference score.Thus, two individuals with <strong>the</strong> same overall number of“keyed” answers might receive very different θ-basedpreference scores, depending on which items wereendorsed.The reductions in distribution density near <strong>the</strong> typecu<strong>to</strong>ff scores that are illustrated in Figures 20 and 22, andquantified in Table 1, provide reason for optimismregarding <strong>the</strong> ability of IRT scoring <strong>to</strong> improve <strong>the</strong>measurement precision of <strong>the</strong> MBTI (as manifest by testretesttype stability, or with respect <strong>to</strong> agreement with typevalues obtained via “true type” methods). For example, inFigure 22, areas of much higher density can be seen in <strong>the</strong>bivariate distribution of <strong>the</strong> EI and SN scales (i.e., at <strong>the</strong>points at which <strong>the</strong> bimodal peaks are present in <strong>the</strong>univariate frequency distributions); likewise, areas of lowdensity are seen in areas in which we would prefer <strong>to</strong> havefew if any respondents (e.g., at 0 on both scales, <strong>the</strong>relatively sparsely populated square in <strong>the</strong> center of <strong>the</strong>scatterplot). Researchers now need <strong>to</strong> conduct empiricalstudies that compare PR- vs. θ-based MBTI scoringsystems in test-retest and “true type” settings; if, as wehypo<strong>the</strong>size, θ-based scoring is capable of producingimprovements in test-retest type stability and higher levelsof agreement between MBTI- and “true type”-based typeassignments, ano<strong>the</strong>r major class of criticisms of <strong>the</strong> MBTIcould <strong>the</strong>reby be addressed.However, it must be noted that <strong>the</strong> above results, aswell as those obtained in <strong>the</strong> Harvey, Murry, and Markham(1994) study that examined <strong>the</strong> measurement precision ofvarious short-form versions of <strong>the</strong> MBTI, are not uniformlypositive. Indeed, <strong>the</strong>se research findings indicate thatconsiderable “room for improvement” exists with respect <strong>to</strong><strong>the</strong> MBTI’s measurement precision. For example, evenusing <strong>the</strong> relatively small ±0.25 uncertainty interval inTable 1, 11% of <strong>the</strong> individuals in <strong>the</strong> sample have θ-basedpreference scores that lie close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff score, and19% score in this region using <strong>the</strong> more liberal ±0.35interval. Although <strong>the</strong>se rates represent sizable reductionswith respect <strong>to</strong> <strong>the</strong> numbers of individuals that fall within<strong>the</strong> uncertainty region using PR scoring (which locates 18%and 25% of <strong>the</strong> sample within <strong>the</strong>se zones, respectively),we would ideally prefer <strong>to</strong> see <strong>the</strong> number of individualsscoring close <strong>to</strong> <strong>the</strong> cu<strong>to</strong>ff approach zero.Expanding <strong>the</strong> MBTI item pools <strong>to</strong> contain new items– in particular, items that produce highly discriminatingICCs like those presented in Figures 9 and 13-15) – is <strong>the</strong>most likely way in which <strong>to</strong> fur<strong>the</strong>r improve <strong>the</strong> MBTI’smeasurement precision. As <strong>the</strong> results of <strong>the</strong> Harvey,Murry, and Stamoulis (1995) and Harvey and Murry (1994)studies demonstrated, <strong>the</strong>re are relatively few “highperformance” items in <strong>the</strong> Form G/F item pools; manyitems demonstrate only moderate levels of discrimination,and a number of items produce relatively poor levels ofinformation (e.g., Figure 11).The degree <strong>to</strong> which <strong>the</strong> MBTI could benefit from <strong>the</strong>addition of new, high-performance items was demonstratedby <strong>the</strong> Thomas and Harvey (1995) study, which attempted<strong>to</strong> write new items that would parallel <strong>the</strong> content domainsof <strong>the</strong> existing four MBTI scales. Containing an item poolof 200 new items (50 per scale), <strong>the</strong> Work Styles Inven<strong>to</strong>ry(WSI; Thomas, 1994) was field tested on a sample of 583college students. Based on analyses of this database,Thomas and Harvey (1995) identified a number of <strong>the</strong> WSIitems that, when added <strong>to</strong> <strong>the</strong> existing MBTI item pools,produced significantly higher TIFs for <strong>the</strong> MBTI scales.Figure 23 presents <strong>the</strong> TIFs for <strong>the</strong> EI scale that werecomputed using <strong>the</strong> Form F MBTI item pool, a long andshort version of <strong>the</strong> WSI EI items, and <strong>the</strong> combined WSIplus-MBTIpool.An inspection of <strong>the</strong> TIFs presented in Figure 23reveals that, as hypo<strong>the</strong>sized, it is indeed possible <strong>to</strong> writenew, high-performance items for <strong>the</strong> four main scales of <strong>the</strong>MBTI. When added <strong>to</strong> <strong>the</strong> existing MBTI scales, <strong>the</strong>senew items produce substantial improvements in <strong>the</strong> TIFs,relative <strong>to</strong> <strong>the</strong> levels produced by <strong>the</strong> Form F items. Ofcourse, <strong>the</strong> results in Figure 23 also indicate that <strong>the</strong> WSIitems also leave some “room for improvement,” inparticular, with respect <strong>to</strong> <strong>the</strong> location of <strong>the</strong> additionalinformation <strong>the</strong>y provide. That is, <strong>the</strong> Form F item poolhas a TIF that is somewhat biased in favor of assessingindividuals scoring <strong>to</strong>ward <strong>the</strong> “I” pole of <strong>the</strong> EI scale (i.e.,its TIF peaks at approximately 0.25 in <strong>the</strong> “I” direction). Incontrast, <strong>the</strong> WSI items are strongly biased in favor ofhigher precision in <strong>the</strong> “I” direction, with TIFs peaking atapproximately 0.8 units in <strong>the</strong> “I” direction. For practicaluse, we would prefer <strong>the</strong> TIFs <strong>to</strong> be symmetric, andcentered on <strong>the</strong> cu<strong>to</strong>ff point between <strong>the</strong> two types. Thus,additional items that produced <strong>the</strong>ir highest levels ofdiscrimination in <strong>the</strong> “E” direction would be needed <strong>to</strong>balance-out <strong>the</strong>se new items.It is also possible that <strong>the</strong> measurement precision of <strong>the</strong>MBTI item pools can be enhanced through <strong>the</strong> use of someof <strong>the</strong> “research” items that are included on longer forms of<strong>the</strong> MBTI (e.g., Form F, J). For example, Form J contains
over 190 items that are not part of <strong>the</strong> Form F/G scoringsystem; it seems reasonable <strong>to</strong> hypo<strong>the</strong>size that <strong>the</strong> additionof <strong>the</strong>se “research” items <strong>to</strong> <strong>the</strong> Form F/G item poolsshould also produce improvements in <strong>the</strong> TIFs for <strong>the</strong> fourmajor MBTI scales. Additional research is needed <strong>to</strong>evaluate <strong>the</strong> degree <strong>to</strong> which <strong>the</strong> new high-performanceitems can be obtained from <strong>the</strong> existing “research” itempool.ConclusionsIn this article, we identified a small number of generalclasses of criticisms that have been directed <strong>to</strong>ward <strong>the</strong>MBTI. Based on our review, <strong>the</strong> first of <strong>the</strong>se classes ofcriticisms – which claims that <strong>the</strong> MBTI items do notmeasure <strong>the</strong> four latent constructs <strong>the</strong>y seek <strong>to</strong> measure --was found <strong>to</strong> be sharply inconsistent with empiricalresearch findings, particularly <strong>the</strong> results of recent largesampleexplora<strong>to</strong>ry and confirma<strong>to</strong>ry fac<strong>to</strong>r analyses. Thesecond class of criticisms – which involves claims <strong>to</strong> <strong>the</strong>effect that <strong>the</strong> MBTI is flawed because it does not producebimodally shaped distributions of preference scores – waslikewise found <strong>to</strong> be unsupported by <strong>the</strong> data when oneconsiders preference score distributions computed usingIRT-based scoring methods. Although traditional PR-basedpreference scores do not exhibit bimodality, IRT’s θ-basedpreference score distributions were found <strong>to</strong> be sharplybimodal in large, unselected samples.<strong>Using</strong> <strong>the</strong> research findings currently available <strong>to</strong> us, wewere unable <strong>to</strong> dismiss <strong>the</strong> final class of criticisms – whichdeals with claims <strong>to</strong> <strong>the</strong> effect that <strong>the</strong> MBTI is flawedbecause its levels of test-retest type stability are less thanperfect. However, based on <strong>the</strong> reductions in <strong>the</strong> relativenumber of individuals who score close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ffsthat occur when IRT-based scoring methods are used, aswell as <strong>the</strong> potential for <strong>the</strong> MBTI’s measurement precision<strong>to</strong> be increased via <strong>the</strong> addition of new items, we concludethat it is reasonable <strong>to</strong> hypo<strong>the</strong>size that significantimprovements in <strong>the</strong> MBTI’s test-retest type stability maybe achievable by switching <strong>to</strong> IRT-based scoring and/orleng<strong>the</strong>ning <strong>the</strong> MBTI item pools. Research implementing<strong>the</strong>se strategies is now needed in order that we maydetermine <strong>the</strong> degree <strong>to</strong> which <strong>the</strong>se measurement-precisionbased criticisms can be dismissed as convincingly as wehave dealt with criticisms based on <strong>the</strong> MBTI’s fac<strong>to</strong>rstructure and <strong>the</strong> bimodality of its preference scoredistributions.We also attempted <strong>to</strong> provide an overview of <strong>the</strong> IRTmodel, focusing on <strong>the</strong> way in which IRT’s traditional“right-wrong” terminology can be adapted <strong>to</strong> <strong>the</strong> domain ofassessment instruments that are not couched in “rightwrong”terms, and on ways in which one can assesswhe<strong>the</strong>r <strong>the</strong> IRT models “fits” <strong>the</strong> observed item responses.Regarding this latter issue, <strong>the</strong> results we presented usingempirically derived ICCs – which, by definition, are in noway influenced by <strong>the</strong> assumptions made by <strong>the</strong> IRT model– showed quite convincingly that many MBTI items doindeed demonstrate nonlinear relations with <strong>the</strong> latentpreference constructs, and that <strong>the</strong> MBTI items differsharply with respect <strong>to</strong> both <strong>the</strong> amount and location of <strong>the</strong>information <strong>the</strong>y provide with respect <strong>to</strong> <strong>the</strong> underlyingMBTI preferences.In conclusion, it is important <strong>to</strong> note that <strong>the</strong> traditionalprediction-ratio based system of estimating MBTIpreference scores has worked well for decades, and it hasbeen very valuable <strong>to</strong> practitioners by virtue of providing<strong>the</strong>m with a means of scoring <strong>the</strong> instrument and assigningindividuals <strong>to</strong> type categories. Clearly, any new system forscoring <strong>the</strong> MBTI must offer significant advantages orfeatures that cannot be obtained using <strong>the</strong> traditional PRbasedmethod. In short, we must ask whe<strong>the</strong>r it is worth<strong>the</strong> trouble <strong>to</strong> change <strong>to</strong> a new scoring system? Based on<strong>the</strong> above results, we conclude that IRT-based scoring doesoffer <strong>the</strong> kind – and magnitude -- of improvement needed<strong>to</strong> justify <strong>the</strong> change <strong>to</strong> a new MBTI scoring system.Specifically, advantages offered by IRT scoring include<strong>the</strong> following: (a) it produces bimodal score distributionsthat decrease <strong>the</strong> number of individuals who score close <strong>to</strong><strong>the</strong> type cu<strong>to</strong>ffs; (b) it offers a scoring system that allows us<strong>to</strong> differentially weight item responses based on each item’sdiscriminating power, <strong>the</strong> point at which it provides itsmaximum information, and <strong>the</strong> degree <strong>to</strong> which individualswho score strongly in <strong>the</strong> non-keyed direction will tend <strong>to</strong>endorse it in <strong>the</strong> keyed direction (all of which shouldproduce more precise estimates of each person’s scores on<strong>the</strong> preference scales); (c) it allows <strong>the</strong> development of aversion of <strong>the</strong> MBTI that can be administered usingcomputerized adaptive testing (CAT) technology (whichhas <strong>the</strong> potential <strong>to</strong> significantly reduce testing time whilekeeping <strong>the</strong> precision of measurement high); (d) it canproduce quantitative indices of <strong>the</strong> quality and internalconsistency of an individual’s MBTI item response profileusing appropriateness indices (<strong>the</strong>se may be valuable inidentifying invalid response profiles and in resolving casesof type indeterminacy); and (e) it allows sensitive, itemlevel studies of <strong>the</strong> degree <strong>to</strong> which MBTI items tend <strong>to</strong>perform differently for individuals in different demographiccategories (e.g., <strong>to</strong> identify items suffering from potentialgender- or race-based bias).IRT-based MBTI research has finally started <strong>to</strong> appear,and although much has been accomplished, much remains<strong>to</strong> be done. In particular, studies are needed <strong>to</strong> determine<strong>the</strong> degree <strong>to</strong> which IRT scoring is capable of producinghigher test-retest type stability and/or agreement with “truetype” assessments, <strong>the</strong> degree <strong>to</strong> which MBTI items sufferfrom race- or sex-based bias, <strong>the</strong> amount of reduction intesting time that may be possible by using CAT-basedadministration, <strong>the</strong> amount of success that may be achievedby using appropriateness indices <strong>to</strong> spot aberrant orinternally inconsistent response profiles, and <strong>the</strong> degree <strong>to</strong>which <strong>the</strong> measurement precision of <strong>the</strong> MBTI scales canbe enhanced via <strong>the</strong> addition of new items (ei<strong>the</strong>r from <strong>the</strong>currently unused “research” items, or from o<strong>the</strong>r sources).
ReferencesBlock, J., & Ozer, D. J. (1982). Two types ofpsychologists: Remarks on <strong>the</strong> Mendelsohn, Weiss, andFeimer contribution. Journal of Personality and SocialPsychology, 42, 1171-1181.<strong>Briggs</strong>, K. C., & <strong>Myers</strong>, I. B. (1976). <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r: Form F. Palo Al<strong>to</strong>: ConsultingPsychologists Press.Carlson, J. (1985). Recent assessments of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Journal of Personality Assessment,49(4), 356-365.Carlyn, M. (1977). An assessment of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Journal of Personality Assessment, 41,461-473.Carskadon, T. G. (1977). Test-retest reliabilities ofcontinuous scores on <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r.Psychological Reports, 41, 1011-1012.Cliff, N. (1987). The eigenvalue-greater-than-one rule and<strong>the</strong> reliability of components. Psychological Bulletin,103, 276-279.Coe, C. K. (1992). The MBTI: Potential uses and misusesin personnel administration. Public PersonnelManagement, 21(4), 511-523.Comrey, A. L. (1983). An evaluation of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Academic Psychology Bulletin, 5, 115-129.Gangestad, S. W., & Snyder, M. (1991). Taxonomicanalysis redux: Some statistical considerations fortesting a latent class model. Journal of Personality andSocial Psychology, 61, 141-146.Garden, A. (1989). Organisational size as a variable in typeanalysis and employee turnover. Journal ofPsychological <strong>Type</strong>, 17, 3-13.Gauld, V., & Sink, D. (1985). The MBTI as a diagnostic<strong>to</strong>ol in organization development interventions. Journalof Psychological <strong>Type</strong>, 9, 24-29.Gough, H. G. (1976). Studying creativity by means ofword association tests. Journal of Applied Psychology,61, 348-353.Hall, W. B., & MacKinnon, D. W. (1969). Personalityinven<strong>to</strong>ry correlates of creativity among Architects.Journal of Applied Psychology, 53, 322-326.Hamble<strong>to</strong>n, R. K, Swaminathan, H., & Rogers, H. J.(1991). Fundamentals of item response <strong>the</strong>ory.Newbury Park, CA: Sage.Harvey, R. J., & Murry, W. D. (1994). Scoring <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r: Empirical comparison ofpreference score versus latent-trait methods. Journal ofPersonality Assessment, 62, 116-129.Harvey, R. J., Murry, W. D., & Markham, S. E. (1994).Evaluation of three short form versions of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Journal of PersonalityAssessment, 63, 181-184.Harvey, R. J., Murry, W. D., & Markham, S. E. (1995,May). A “Big Five” Scoring System for <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Paper presented at <strong>the</strong> AnnualConference of <strong>the</strong> Society for Industrial andOrganizational Psychology, Orlando.Harvey, R. J., Murry, W. D., & Stamoulis, D. (1995).Unresolved issues in <strong>the</strong> dimensionality of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Educational and PsychologicalMeasurement, 55, 535-544.Harvey, R. J., & Thomas, L. A. (1995, May). Improving<strong>the</strong> measurement precision of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r. Paper presented at <strong>the</strong> Annual Conferenceof <strong>the</strong> Society for Industrial and OrganizationalPsychology, Orlando.Hulin, C., Drasgow, F., & Parsons, C. (1983). <strong>Item</strong>response <strong>the</strong>ory: Application <strong>to</strong> psychologicalmeasurement. Homewood, IL: Dow Jones-Irwin.Hartzler, G. J., & Hartzler, M. T. (1982). Managementuses of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Research inPsychological <strong>Type</strong>, 5, 20-29.James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causalanalysis. Beverly Hills: Sage.Johnson, D. A., & Saunders, D. R. (1990). Confirma<strong>to</strong>ryfac<strong>to</strong>r analysis of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r --Expanded Analysis Report. Educational andPsychological Measurement, 50, 561-571.Joreskog, K. G., & Sorbom, D. (1981). LISREL V: Analysis oflinear structural relationships by maximum likelihood andleast squares methods. Chicago: International EducationalServices.Kir<strong>to</strong>n, M. J. (1976). Adap<strong>to</strong>rs and innova<strong>to</strong>rs: Adescription and measure. Journal of AppliedPsychology, 61, 622-629.Lee, H. B., & Comrey, A. L. (1979). Dis<strong>to</strong>rtions in acommonly used fac<strong>to</strong>r analytic procedure. MultivariateBehavioral Research, 14, 301-321.Lord, F. M., & Novick, M. R. (1968). Statistical <strong>the</strong>oriesof mental test scores. Reading, MA: Addison-Wesley.McCormick, E. J., Jeanneret, P. R., & Mecham, R. C.(1972). A study of job characteristics and jobdimensions as based on <strong>the</strong> Position AnalysisQuestionnaire (PAQ). Journal of Applied Psychology,56, 347-367.McCarley, N., & Carskadon, T. G. (1983). Test-retestreliabilities of scales and subscales of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Inven<strong>to</strong>ry and of criteria for clinical interpretivehypo<strong>the</strong>ses involving <strong>the</strong>m. Research in Psychological<strong>Type</strong>, 6, 24-36.Mendelsohn, G. A., Weiss, D. S., & Feimer, N. R. (1982).Conceptual and empirical analysis of <strong>the</strong> typologicalimplications of patterns of socialization and femininity.Journal of Personality and Social Psychology, 42,1157-1170.Miller, M. L., & Thayer, J. F. (1989). On <strong>the</strong> existence ofdiscrete classes in personality: Is self-moni<strong>to</strong>ring <strong>the</strong>correct joint <strong>to</strong> carve? Journal of Personality andSocial Psychology, 57, 143-155.
Mislevy, R. J., & Bock, R. D. (1990). BILOG 3: <strong>Item</strong>analysis and test scoring with binary logistic methods.Mooresville, IN: Scientific Software.Mitchell, W. (1995). A clash of paradigms: Whybimodality, ANOVA interactions, and discontinuitiesare irrelevant criteria for typologies. Unpublishedmanuscript.Moore, T. (1987). Personality tests are back. Fortune,March 30, 74-82.<strong>Myers</strong>, I. B. (1962). The <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>rmanual. Prince<strong>to</strong>n, NJ: Educational Testing Service.<strong>Myers</strong>, I. B., & McCaulley, M. H. (1985). A guide <strong>to</strong> <strong>the</strong>development and use of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r. Palo Al<strong>to</strong>, CA: Consulting PsychologistsPress.<strong>Myers</strong>, I. B., with <strong>Myers</strong>, P. B. (1980). Gifts differing.Palo Al<strong>to</strong>, CA: Consulting Psychologists Press.Pittenger, D. J. (1993). The utility of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Review of Educational Research, 63,467-488.Poilitt, I. (1982). Managing differences in industry.Research in Psychological <strong>Type</strong>, 5, 4-19.Rytting, M., Ware, R., & Prince, R. A. (1994). Bimodaldistributions in a sample of CEOs: Validating evidencefor <strong>the</strong> MBTI. Journal of Psychological <strong>Type</strong>, 31, 16-23.Sample, J. A., & Hoffman, J. L. (1986). The MBTI as amanagement and organizational <strong>to</strong>ol. Journal ofPsychological <strong>Type</strong>, 11, 47-50.Sipps, G. J., Alexander, R. A., & Friedt, L. (1985). <strong>Item</strong>analysis of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r.Educational and Psychological Measurement, 45, 789-796.Stricker, L. J., & Ross, J. (1964). Some correlates of aJungian personality inven<strong>to</strong>ry. Psychological Reports,14, 623-643.Thomas, L. A. (1994). Unpublished Master’s <strong>the</strong>sis,Virginia Polytechnic Institute and State University.Thomas, L. A., & Harvey, R. J. (1995, April). Improving<strong>the</strong> measurement precision of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r. Paper presented at <strong>the</strong> Annual Conference of<strong>the</strong> Society for Industrial and OrganizationalPsychology, Orlando.Thompson. B., & Borrello. G. M. (1986). Constructvalidity of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r.Educational and Psychological Measurement, 46, 745-752.Thompson, B., & Borrello, G. M. (1989, January). Aconfirma<strong>to</strong>ry fac<strong>to</strong>r analysis of data from <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Paper presented at <strong>the</strong> annualmeeting of <strong>the</strong> Southwest Educational ResearchAssociation, Hous<strong>to</strong>n.Tucker, L. R., Koopman, R. F., & Linn, R. L. (1969).Evaluation of fac<strong>to</strong>r-analytic research procedures bymeans of simulated correlation matrices.Psychometrika, 34, 421-460.Tzeng, O. C. S., Outcalt, D., Boyer, S. L., Ware, R., &Landis, D. (1984). <strong>Item</strong> validity of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Journal of Personality Assessment, 48,255-256.
ExtravertsIntrovertsFigure 1. ICCs for two hypo<strong>the</strong>tical items that illustrate <strong>the</strong> range of relations that can exist between <strong>the</strong> latent construct (θ,on <strong>the</strong> horizontal axis) and <strong>the</strong> observed likelihood of item endorsement in <strong>the</strong> keyed direction (PCR, on <strong>the</strong> y axis). <strong>Item</strong> 1defines an almost linear function, whereas <strong>Item</strong> 2 approximates a step function. These ICCs were generated using a 2-parameter IRT model in which <strong>the</strong> b parameters were 0.0, and <strong>the</strong> a parameters were 0.35 and 17.0 for <strong>Item</strong>s 1 and 2,respectively.
Figure 2. <strong>Item</strong> information functions for <strong>the</strong> two hypo<strong>the</strong>tical items presented in Figure 1. The horizontal axis represents <strong>the</strong>levels of <strong>the</strong>ta, whereas <strong>the</strong> vertical axis reflects <strong>the</strong> amount of information contained in each item, across <strong>the</strong> different levelsof <strong>the</strong>ta.
Figure 3 1-parameter ICCs for EI items 33 (easy vs. hard <strong>to</strong> get <strong>to</strong> know), 50 (“good mixer” vs. quiet and reserved), and 129(one of first <strong>to</strong> follow a new fashion vs. not interested). On <strong>the</strong> <strong>the</strong>ta (horizontal) axis, positive values indicate a preference in<strong>the</strong> “I” direction, and negative values indicate a preference in <strong>the</strong> “E” direction (<strong>the</strong> vertical line serves as <strong>the</strong> cu<strong>to</strong>ff between<strong>the</strong> types). The PCR (vertical) axis indicates <strong>the</strong> expected percentage of individuals who would endorse <strong>the</strong> item in <strong>the</strong> keyed(“I”) direction for each level of <strong>the</strong>ta (<strong>the</strong> horizontal line denotes <strong>the</strong> point at which we would expect 50% of <strong>the</strong> examinees <strong>to</strong>endorse <strong>the</strong> item in <strong>the</strong> keyed direction). The dotted vertical lines indicate <strong>the</strong> levels of <strong>the</strong>ta at which 50% of those who holdthat preference would endorse <strong>the</strong> item in <strong>the</strong> “I” direction.
Figure 4 2-parameter ICCs for three hypo<strong>the</strong>tical EI items that differ only in terms of <strong>the</strong>ir a (discrimination) parameters(<strong>Item</strong> 1 has a = .35, <strong>Item</strong> 2 = 1.0, and <strong>Item</strong> 3 = 2.1). On <strong>the</strong> <strong>the</strong>ta (horizontal) axis, positive values indicate a preference in <strong>the</strong>“I” direction, and negative values indicate a preference in <strong>the</strong> “E” direction; higher scores on <strong>the</strong> PCR (vertical) axis reflect ahigher likelihood of endorsing <strong>the</strong> keyed (“I”) response. The two vertical lines on <strong>the</strong> <strong>the</strong>ta axis are drawn <strong>to</strong> reflect a “slight”preference (<strong>Myers</strong> & McCaulley, 1985, p. 58) in <strong>the</strong> “E” (-0.2) and “I” (+0.2) directions. The solid horizontal lines identify<strong>the</strong> different item endorsement (PCR) rates for <strong>Item</strong> 1 at <strong>the</strong>se two preferences; <strong>the</strong> dotted horizontal lines identify <strong>the</strong> PCRsfor <strong>Item</strong> 3.
Figure 5. 3-parameter ICCs for EI items 33 (easy vs. hard <strong>to</strong> get <strong>to</strong> know), 50 (“good mixer” vs. quiet and reserved), and 129(one of first <strong>to</strong> follow a new fashion vs. not interested). Higher PCRs are associated with increased levels of endorsement of<strong>the</strong> “I” alternative.
Figure 6. <strong>Item</strong> information functions for 3-parameter ICCs for EI items 33 (easy vs. hard <strong>to</strong> get <strong>to</strong> know), 50 (“good mixer”vs. quiet and reserved), and 129 (one of first <strong>to</strong> follow a new fashion vs. not interested). The vertical axis reflects <strong>the</strong> amoun<strong>to</strong>f information contained in each item, across <strong>the</strong> different levels of <strong>the</strong>ta.
Figure 7. Test information functions for a 3-item EI scale formed from items 33, 50, and 129 versus one formed from all of<strong>the</strong> Form F EI items. The vertical axis reflects <strong>the</strong> amount of information contained in <strong>the</strong> collection of items in each test,across <strong>the</strong> different levels of <strong>the</strong>ta (larger values are better). The lower horizontal line denotes <strong>the</strong> amount of informationnecessary <strong>to</strong> produce a 0.5 standard error (SE) when estimating <strong>the</strong> <strong>the</strong>ta score from <strong>the</strong> item responses; <strong>the</strong> upper horizontalline corresponds <strong>to</strong> <strong>the</strong> level required <strong>to</strong> produce a 0.39 SE (i.e., <strong>the</strong> level that would be predicted if <strong>the</strong> CTT-based reliabilityof <strong>the</strong> MBTI scales was 0.85).
Figure 8. Test standard error (SE) functions for a 3-item EI scale formed from items 33, 50, and 129 versus one formed fromall of <strong>the</strong> Form F EI items. The vertical axis reflects <strong>the</strong> amount of precision in estimating <strong>the</strong> <strong>the</strong>ta score, at each level of <strong>the</strong>ta(smaller values are better). The upper horizontal line denotes an SE of 0.5; <strong>the</strong> lower line denotes an SE of 0.39 (whichcorresponds <strong>to</strong> a CTT reliability of 0.85).
Figure 9. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> EI scale (number 50, “good mixer” vs. quietand reserved). The horizontal axis denotes <strong>the</strong> EI preference scores (positive values indicating “I” preference, negative valuesindicating “E” preference) computed using <strong>the</strong> Form F scoring system. The curved line drawn through <strong>the</strong> points is asmoo<strong>the</strong>d spline interpolation. The squares denote <strong>the</strong> actual percentages of individuals at each level of <strong>the</strong> EI preference whoendorsed <strong>the</strong> item in <strong>the</strong> “I” direction. Here, higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “quiet andreserved” alternative.
Figure 10. Empirically derived ICC for a moderate-performance MBTI item from <strong>the</strong> EI scale (number 33, easy vs. hard <strong>to</strong>get <strong>to</strong> know). Here, higher PCRs are associated with an increased likelihood of endorsing <strong>the</strong> “hard <strong>to</strong> get <strong>to</strong> know”alternative.
Figure 11. Empirically derived ICC for a low-performance MBTI item from <strong>the</strong> EI scale (number 129, one of first <strong>to</strong> follow anew fashion vs. not interested). Here, higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “not interested infollowing fashion” alternative.
Figure 12. Overlaid empirically derived ICCs for EI items 33, 50, and 129. A comparison of <strong>the</strong>se ICCs against thoseproduced by <strong>the</strong> 3-parameter IRT model presented in Figure 5 provides compelling evidence regarding <strong>the</strong> appropriateness ofusing <strong>the</strong> 3-parameter IRT model <strong>to</strong> score <strong>the</strong> MBTI.
Figure 13. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> SN scale (number 104, concrete v. abstract);scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line represent “N” preferences, whereas those <strong>to</strong> <strong>the</strong> left represent “S” preferences. Here,higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “abstract” alternative.
Figure 14. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> TF scale (number 114, feeling v. thinking);scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line represent “F” preferences, whereas those <strong>to</strong> <strong>the</strong> left represent “T” preferences. HigherPCRs are associated with increased likelihood of endorsing <strong>the</strong> “feeling” response.
Figure 15. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> JP scale (number 85, scheduled v.unplanned); scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line represent “P” preferences, whereas those <strong>to</strong> <strong>the</strong> left represent “J”preferences. Higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “unplanned” response.
Figure 16. Empirically derived ICC for a high-performance JP item (85, scheduled v. unplanned) using scores on <strong>the</strong> EIpreference dimension as <strong>the</strong> horizontal axis (scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line denote “I” preferences, whereas those <strong>to</strong> <strong>the</strong>left represent “E” preferences). As would be expected, <strong>the</strong>re is virtually no association between EI preferences and <strong>the</strong>likelihood of endorsing this item in <strong>the</strong> “unplanned” (“P”) direction.
Figure 17. Empirically derived ICC for a high-performance JP item (85, scheduled v. unplanned) using scores on <strong>the</strong> SNpreference dimension as <strong>the</strong> horizontal axis (scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line denote “N” preferences, whereas those <strong>to</strong> <strong>the</strong>left represent “S” preferences). Reflecting <strong>the</strong> fact that <strong>the</strong> SN and JP preferences are not orthogonal, a consistent associationcan be observed between SN preferences and <strong>the</strong> PCR rates for this JP item (as expected, intuitives tend endorse this item in<strong>the</strong> “unplanned” direction at higher rates than sensors).
Figure 18. Empirically derived ICC for EI item 116 (detached v. sociable) using scores on <strong>the</strong> EI preference dimension as <strong>the</strong>horizontal axis. This illustrates an item that would likely be viewed as a low-performance item by <strong>the</strong> traditional predictionratiobased scoring system, but which is viewed as a strongly discriminating item by IRT. The reason for this discrepancy liesin <strong>the</strong> fact that this item provides its best discrimination for relatively strong Introverts (e.g., in <strong>the</strong> 40-50 range <strong>to</strong>ward “I”).
Figure 19. Frequency distribution for PR-based preference scores (using Form F key) on <strong>the</strong> EI dimension.
Figure 20. Frequency distribution for IRT-based preference score estimates on <strong>the</strong> EI dimension.
Figure 21. Scatterplot of EI preference scores estimated using <strong>the</strong> traditional PR-based formula (horizontal axis) versus <strong>the</strong>IRT-based method (vertical axis). The line drawn through <strong>the</strong> points is <strong>the</strong> linear regression line.
Figure 22. Scatterplot of EI (vertical axis) versus SN (horizontal axis) preference scores estimated using IRT methods. Note<strong>the</strong> areas of higher density at approximately 0.5 z units above and below <strong>the</strong> type cu<strong>to</strong>ff, and <strong>the</strong> area of low density at <strong>the</strong>cu<strong>to</strong>ff on each scale (i.e., 0.0).
Figure 23. Test information functions for <strong>the</strong> EI scales using <strong>the</strong> Form F MBTI item pools, <strong>the</strong> 22- and 35-item pools for <strong>the</strong>EI scale of <strong>the</strong> Work Styles Inven<strong>to</strong>ry (WSI; Thomas, 1994), and <strong>the</strong> combined MBTI plus WSI EI item pool. Horizontal linescorrespond <strong>to</strong> <strong>the</strong> levels of information that would produce SE values in estimating <strong>the</strong>ta of .25 and .50.
Table 1Numbers of MBTI Profiles Scoring Within a Given “Zone of Uncertainty” around <strong>the</strong> Cu<strong>to</strong>ffs±0.25 Interval Around <strong>the</strong> Cu<strong>to</strong>ffNumber of Profiles% of Total% of Row% of ColumnOutside Cu<strong>to</strong>ff Regionon θ-Based PreferenceInside Cu<strong>to</strong>ff Regionon θ-Based PreferenceOutside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference198479.4%89.3%96.4%732.9%26.3%3.6%Inside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference2379.5%10.7%53.6%2058.2%73.7%46.4%Total222188.9%27811.1%Total 205782.3%44217.7%2499100%±0.35 Interval Around <strong>the</strong> Cu<strong>to</strong>ffNumber of Profiles% of Total% of Row% of ColumnOutside Cu<strong>to</strong>ff Regionon θ-Based PreferenceInside Cu<strong>to</strong>ff Regionon θ-Based PreferenceOutside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference180972.4%88.9%96.9%582.3%12.5%3.1%Inside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference2269.0%11.1%35.8%40616.3%87.5%64.2%Total203581.4%46418.6%Total 186774.7%63225.3%2499100%