10.07.2015 Views

Using Item Response Theory to Score the Myers-Briggs Type Indicator

Using Item Response Theory to Score the Myers-Briggs Type Indicator

Using Item Response Theory to Score the Myers-Briggs Type Indicator

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Taken ei<strong>the</strong>r singly or <strong>to</strong>ge<strong>the</strong>r, <strong>the</strong>se criticisms arepotentially quite serious. For example, if fac<strong>to</strong>r analyticevidence consistently indicates that <strong>the</strong> 4-fac<strong>to</strong>r view of <strong>the</strong>MBTI is implausible, its psychometric defensibility inassessment situations would be called in<strong>to</strong> question.Likewise, <strong>the</strong> lack of bimodal distributions in <strong>the</strong>preference scores, as well as <strong>the</strong> nontrivial rates of typechanges seen in test-retest situations, have been viewed bymany researchers as representing serious challenges <strong>to</strong> <strong>the</strong>psychometric quality of <strong>the</strong> MBTI. In <strong>the</strong> followingsections we examine each of <strong>the</strong>se issues.Criticisms of <strong>the</strong> MBTI’s Fac<strong>to</strong>r StructureSeveral explora<strong>to</strong>ry fac<strong>to</strong>r analyses of <strong>the</strong> MBTI havebeen reported, and some of <strong>the</strong>m (e.g., Comrey, 1983;Sipps, Alexander, & Friedt, 1985) have produced fac<strong>to</strong>rstructures that <strong>the</strong>ir authors viewed as being inconsistentwith <strong>the</strong> predicted 4-fac<strong>to</strong>r model. This fact has been citedby critics of <strong>the</strong> MBTI (e.g., Pittenger, 1993, pp. 474-476)as support for <strong>the</strong> more general conclusion that “<strong>the</strong> MBTIdoes not provide <strong>the</strong> assessment of personality types that itclaims” (Pittenger, p. 475).However, a number of o<strong>the</strong>r explora<strong>to</strong>ry fac<strong>to</strong>r analyticstudies of <strong>the</strong> MBTI (e.g., Harvey, Murry, & Stamoulis,1995; Tischler, 1994; Tzeng, Outcalt, Boyer, Ware, &Landis, 1984) have reported results that show an extremelyhigh degree of correspondence between <strong>the</strong> recoveredfac<strong>to</strong>r solutions and <strong>the</strong> predicted 4-fac<strong>to</strong>r structure. Whatconclusions regarding <strong>the</strong> MBTI’s fac<strong>to</strong>r structure orconstruct validity should be drawn based on <strong>the</strong>seapparently conflicting findings?In our assessment, <strong>the</strong> fact that several explora<strong>to</strong>rystudies have reported findings that closely match <strong>the</strong>predicted 4-fac<strong>to</strong>r structure (e.g., Harvey et al., 1995;Tischler, 1994) is consistent with -- but not definitive proofof -- <strong>the</strong> validity of <strong>the</strong> MBTI’s predicted dimensionalstructure. Of greater importance, <strong>the</strong> fact that someexplora<strong>to</strong>ry studies produced solutions that did not match<strong>the</strong> predicted 4-fac<strong>to</strong>r structure (e.g., Sipps et al., 1985)says very little ei<strong>the</strong>r pro or con, given (a) <strong>the</strong> less-thanoptimalsample sizes and fac<strong>to</strong>r-analytic decision rules thatcharacterized those studies, as well as (b) <strong>the</strong> inherentinability of explora<strong>to</strong>ry methods <strong>to</strong> test of <strong>the</strong> validity of ahypo<strong>the</strong>sized fac<strong>to</strong>r model.Regarding <strong>the</strong> former issue, <strong>the</strong> Comrey (1983) andSipps et al. (1985) findings were based on fac<strong>to</strong>r-analyticdecisions (e.g., principal components analysis, Varimaxrotation) that have been repeatedly criticized in <strong>the</strong>psychometric literature (e.g., Cliff, 1987; Lee & Comrey,1979; Snook & Gorsuch, 1989; Tucker, Koopman, & Linn,1969). With respect <strong>to</strong> sample size, <strong>the</strong> Comrey (1983)study demonstrated only a 2.5:1 ratio of subjects <strong>to</strong> items;in such small samples, <strong>the</strong> likelihood of finding unstableresults due <strong>to</strong> <strong>the</strong> effects of sampling error increasessignificantly. In contrast, among <strong>the</strong> explora<strong>to</strong>ry studiesthat reported results that were consistent with <strong>the</strong> predicted4-fac<strong>to</strong>r structure, <strong>the</strong> Harvey et al. (1995) study had a 12:1ratio of subjects <strong>to</strong> items, and <strong>the</strong> Tischler (1994) study hada 22:1 ratio; results obtained in samples of <strong>the</strong>se sizesshould be much more likely <strong>to</strong> be stable and valid thanthose obtained in smaller samples.Regarding <strong>the</strong> latter issue, <strong>the</strong> results of anyexplora<strong>to</strong>ry fac<strong>to</strong>r analysis -- even one performed in a verylarge sample -- are fundamentally incapable of answeringwhat is essentially a confirma<strong>to</strong>ry question: namely, <strong>to</strong>what degree does <strong>the</strong> hypo<strong>the</strong>sized fac<strong>to</strong>r structure providea plausible representation of <strong>the</strong> observed item-level data?That is, among its o<strong>the</strong>r limitations (e.g., subjectivity withrespect <strong>to</strong> determining <strong>the</strong> number of fac<strong>to</strong>rs <strong>to</strong> retain), <strong>the</strong>explora<strong>to</strong>ry fac<strong>to</strong>r model exhibits a fundamentalindeterminacy with respect <strong>to</strong> fac<strong>to</strong>r rotation (i.e., aninfinite number of different orthogonal or obliquetransformations of <strong>the</strong> fac<strong>to</strong>r solution can be made withoutchanging <strong>the</strong> degree <strong>to</strong> which it can reproduce, or ‘fit,’ <strong>the</strong>data matrix). Thus, if <strong>the</strong> predicted structure is notrecovered, this fact provides essentially no evidenceregarding <strong>the</strong> degree <strong>to</strong> which <strong>the</strong> hypo<strong>the</strong>sized modelwould be capable of providing a level of fit that is as goodas, or better than, that which is produced by <strong>the</strong> obtainedfac<strong>to</strong>r solution.Fortunately, confirma<strong>to</strong>ry fac<strong>to</strong>r analytic methods (e.g.,James, Mulaik, & Brett, 1982; Jöreskog & Sörbom, 1981)were developed <strong>to</strong> address precisely this kind of question.Unlike explora<strong>to</strong>ry fac<strong>to</strong>r analysis, confirma<strong>to</strong>ry fac<strong>to</strong>ranalysis allows <strong>the</strong> researcher <strong>to</strong> directly test <strong>the</strong> degree <strong>to</strong>which a hypo<strong>the</strong>sized fac<strong>to</strong>r model is consistent with <strong>the</strong>variance/covariance matrix that is observed among <strong>the</strong>instrument’s items. A major strength of confirma<strong>to</strong>ryfac<strong>to</strong>r analysis is that it allows for <strong>the</strong> possibility offalsifying a hypo<strong>the</strong>sized fac<strong>to</strong>r model (i.e., showing that itis inconsistent with <strong>the</strong> observed data). That is, if <strong>the</strong>predicted fac<strong>to</strong>r pattern is found <strong>to</strong> provide a poor level offit <strong>to</strong> <strong>the</strong> observed data, this fact can provide compellingevidence against <strong>the</strong> validity or plausibility of <strong>the</strong> predictedfac<strong>to</strong>r structure. Thus, although confirma<strong>to</strong>ry methodscannot prove that a given good-fitting model is <strong>the</strong> bestpossible model for an instrument (<strong>the</strong>oretically, it is alwayspossible <strong>to</strong> postulate <strong>the</strong> existence of an alternative modelthat demonstrates an even higher level of fit), <strong>the</strong>y arenever<strong>the</strong>less extremely valuable by virtue of <strong>the</strong>ir ability <strong>to</strong>reject poor-fitting models and <strong>to</strong> rank competing modelswith respect <strong>to</strong> <strong>the</strong> degree <strong>to</strong> which <strong>the</strong>y fit <strong>the</strong> observeddata.Although studies that criticize <strong>the</strong> psychometricproperties of <strong>the</strong> MBTI typically do not cite <strong>the</strong>ir findings,several confirma<strong>to</strong>ry fac<strong>to</strong>r analyses of <strong>the</strong> MBTI havebeen reported (e.g., Harvey, Murry, & Stamoulis, 1995;Harvey, Murry, & Markham, 1995; Johnson & Saunders,1990; Thompson & Borrello, 1989), and <strong>the</strong>ir results haveconsistently supported <strong>the</strong> validity of <strong>the</strong> predicted 4-fac<strong>to</strong>rstructure. When considered on its own (e.g., Johnson &Saunders, 1990; Thompson & Borrello, 1989), <strong>the</strong>predicted MBTI fac<strong>to</strong>r structure has been found <strong>to</strong> providea plausible representation of <strong>the</strong> latent structure of thisinstrument. Of even greater importance, when <strong>the</strong>


predicted 4-fac<strong>to</strong>r MBTI model was compared against <strong>the</strong>alternative fac<strong>to</strong>r models advanced by Comrey (1983) andSipps et al. (1985), <strong>the</strong> predicted MBTI structure was found<strong>to</strong> be superior <strong>to</strong> both of <strong>the</strong>se competing views of itsdimensionality (Harvey, Murry, & Stamoulis, 1995).Indeed, <strong>the</strong> results of <strong>the</strong> Harvey et al. (1985) studysuggested that both <strong>the</strong> Sipps et al. (1983) and Comrey(1983) models were fundamentally misspecified (i.e., basedon <strong>the</strong> extremely high correlations that were estimatedbetween some of <strong>the</strong>ir fac<strong>to</strong>rs).However, <strong>the</strong>se fac<strong>to</strong>r analytic studies have identifiedsome issues that deserve fur<strong>the</strong>r study. For example, in <strong>the</strong>explora<strong>to</strong>ry studies, some MBTI items were found <strong>to</strong> loadstrongly on more than one fac<strong>to</strong>r; additionally, in bo<strong>the</strong>xplora<strong>to</strong>ry and confirma<strong>to</strong>ry studies, a nontrivialpercentage of <strong>the</strong> items exhibited only moderate-<strong>to</strong>-smallloadings on <strong>the</strong>ir primary fac<strong>to</strong>rs. Ideally, <strong>to</strong> maximize <strong>the</strong>independence and measurement precision of <strong>the</strong> scales, wewould prefer that items load only on <strong>the</strong> predicted fac<strong>to</strong>r,and that all items in a scale demonstrate moderate-<strong>to</strong>-largeloadings on <strong>the</strong>ir underlying fac<strong>to</strong>r. These findings suggestthat <strong>the</strong> item pools for each of <strong>the</strong> four main MBTI scalescould be broadened <strong>to</strong> include additional items with higherloadings on <strong>the</strong> desired latent construct.Additionally, in studies that examined oblique fac<strong>to</strong>rmodels, consistently nonzero correlations between <strong>the</strong> SNand JP fac<strong>to</strong>rs were reported (e.g., Harvey et al., 1995;Pittenger, 1993, p. 475), a finding that has also been seenwhen <strong>the</strong> traditional prediction ratio method is used <strong>to</strong>calculate MBTI preference scores (e.g., Webb, 1964). Thatis, <strong>the</strong>re is some tendency for individuals who preferSensing <strong>to</strong> be more likely <strong>to</strong> favor Judging than Perceiving,and for those who favor Intuition <strong>to</strong> be more likely <strong>to</strong> favorPerceiving than Judging. Ideally, from a <strong>the</strong>oreticalstandpoint (e.g., <strong>Myers</strong>, 1980, pp. 2-9) one might argue that<strong>the</strong> four preferences should be mutually orthogonal.However, it must be noted that <strong>the</strong>se SN-JP correlationshave generally been quite modest in magnitude (e.g., in <strong>the</strong>.20’s <strong>to</strong> .40’s, representing only 4% - 16% of sharedvariance), and that at this point we cannot determinewhe<strong>the</strong>r <strong>the</strong> lack of orthogonality is due <strong>to</strong> redundancy in<strong>the</strong> conceptual definition of <strong>the</strong> SN and JP preferences,limitations of <strong>the</strong> items used <strong>to</strong> measure <strong>the</strong>se constructs,sampling error, a combination of <strong>the</strong> above fac<strong>to</strong>rs, or thatit simply reflects <strong>the</strong> fact that some combinations of scoreson <strong>the</strong>se two dimensions occur more frequently than o<strong>the</strong>rs(e.g., SJ is much more common than SP). Fur<strong>the</strong>r researchconducted in larger and more carefully stratified samples isnecessary <strong>to</strong> resolve this question.In sum, although some secondary issues remainunresolved, a review of <strong>the</strong> fac<strong>to</strong>r analytic research findingsindicates quite conclusively that <strong>the</strong> major criticisms thathave been raised regarding <strong>the</strong> MBTI’s fac<strong>to</strong>r structure(e.g., Comrey, 1983; Pittenger, 1993) are not supported by<strong>the</strong> data, particularly <strong>the</strong> results of confirma<strong>to</strong>ry fac<strong>to</strong>ranalyses. On <strong>the</strong> contrary, a large and growing body ofevidence indicates that (a) four major fac<strong>to</strong>rs underlie <strong>the</strong>items that are used <strong>to</strong> compute <strong>the</strong> MBTI preference scores,(b) <strong>the</strong> items that define <strong>the</strong>se fac<strong>to</strong>rs are precisely thosethat were predicted <strong>to</strong> do so by <strong>the</strong> MBTI’s developers, and(c) of all of <strong>the</strong> competing fac<strong>to</strong>r structures that have beenproposed <strong>to</strong> date, <strong>the</strong> a priori 4-fac<strong>to</strong>r solution provides <strong>the</strong>most plausible representation of <strong>the</strong> MBTI’s latentstructure.Criticisms Regarding <strong>Type</strong> Stability and BimodalityThus, when one considers <strong>the</strong> entirety of <strong>the</strong> fac<strong>to</strong>ranalytic evidence, <strong>the</strong> MBTI’s hypo<strong>the</strong>sized 4-fac<strong>to</strong>rstructure performs quite well; clearly, this is encouragingnews for proponents of <strong>the</strong> MBTI. However, with respect<strong>to</strong> criticisms that focus on preference score bimodality andtype stability in test-retest situations, until recently <strong>the</strong>re hasbeen less cause for encouragement.<strong>Type</strong> stability. The fact that a nontrivial percentage ofMBTI respondents change <strong>the</strong>ir type assignments on atleast one preference dimension on repeated testing has beenwell documented. For example, Carskadon (1977) reportedrelatively high test-retest reliabilities over five-weekintervals (.78 - .87) for preference scores; however, onretesting, 19% of <strong>the</strong> subjects changed type on <strong>the</strong> EIpreference, 11% changed on SN, 17% on TF, and 16% onJP. O<strong>the</strong>r studies have produced similar findings: forexample, <strong>Myers</strong> and McCaulley (1985, p. 173) summarized<strong>the</strong> results of 20 test-retest studies, finding that full-profiletype stability rates ranged from 24%-61%, with an averageof only 43% of <strong>the</strong> subjects remaining <strong>the</strong> same on all fourscales on retesting.Although <strong>the</strong> levels of test-retest reliability obtainedusing <strong>the</strong> continuous preference scores have generally beenquite respectable, <strong>the</strong> levels of instability in <strong>the</strong> categoricaltype assignments have presented an inviting target forcritics of <strong>the</strong> MBTI. For example, Pittenger (1993) notedthat because “Jung and <strong>Briggs</strong> and <strong>Myers</strong> conceived ofpersonality as an invariant” (p. 471), “if each of <strong>the</strong> 16types is <strong>to</strong> represent a very different personality trait, it ishard <strong>to</strong> reconcile a test that allows individuals <strong>to</strong> makeradical shifts in <strong>the</strong>ir type” (p. 472). Under this argument,switching poles on even one of <strong>the</strong> four preferencedimensions represents a significant substantive andinterpretative change.In our assessment, it is unlikely that <strong>the</strong> majority of<strong>the</strong>se apparent changes in type -- especially those that occurover relatively short intervals of a few weeks or months --reflect true changes in preference. Instead, as has beenspeculated by a number of authors (e.g., Harvey & Murry,1994, Pittenger, 1993), it is much more likely that <strong>the</strong>sechanges are <strong>the</strong> result of <strong>the</strong> action of measurement error;in particular, measurement error occurring in <strong>the</strong> vicinity of<strong>the</strong> type cu<strong>to</strong>ff score.That is, for individuals whose true preference scores lieclose <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff point, even a relatively smallamount of measurement error could cause <strong>the</strong>ir observedpreference scores <strong>to</strong> lie on opposite sides of <strong>the</strong> cu<strong>to</strong>ff overrepeated testings (giving <strong>the</strong> erroneous appearance of a type


switch), despite <strong>the</strong> fact that <strong>the</strong> true preferences remainconstant over time (i.e., as would be predicted by type<strong>the</strong>ory). For such individuals, <strong>the</strong> most direct way <strong>to</strong>improve <strong>the</strong> MBTI’s level of type stability would be <strong>to</strong>increase its measurement precision (or reliability).It is important <strong>to</strong> note that <strong>the</strong> above interpretation doesnot rule out <strong>the</strong> possibility that some percentage ofrespondents who appear <strong>to</strong> change types on repeatedtestings may truly change <strong>the</strong>ir scores on one or morepreference dimensions, or that some individuals maysimply appear <strong>to</strong> change types due <strong>to</strong> careless responding,situational fac<strong>to</strong>rs, or deliberate misrepresentation. On <strong>the</strong>contrary, it simply provides an explanation for whyindividuals who do not suffer from true fluctuations in <strong>the</strong>irpreferences would appear <strong>to</strong> change <strong>the</strong>ir types.In short, <strong>the</strong> important question concerns <strong>the</strong> relativepercentage of individuals who appear <strong>to</strong> change type onrepeated testing due simply <strong>to</strong> <strong>the</strong> action of measurementerror near <strong>the</strong> type cu<strong>to</strong>ff. If such individuals constitute alarge percentage of those whose type assignments changeon retesting, a strategy for improving <strong>the</strong> MBTI <strong>to</strong> reducesuch occurrences would <strong>the</strong>n be evident (i.e., increasing itsmeasurement precision near <strong>the</strong> type cu<strong>to</strong>ff score).Bimodality. The issue of preference score bimodalityis closely linked with <strong>the</strong> issue of type stability. Althoughsome demonstrations of preference bimodality have beenreported in select samples having strongly differentiatedtypes (e.g., Rytting, Ware, & Prince, 1994), <strong>the</strong>re isoverwhelming evidence <strong>to</strong> indicate that MBTI preferencedistributions in large, unselected samples are not bimodal(e.g., Harvey & Murry, 1994; Hicks, 1984; McCrae &Costa, 1989; Striker & Ross, 1964). Although this lack ofbimodality in MBTI preference scores does not necessarilyinvalidate <strong>the</strong> type-based <strong>the</strong>ory on which <strong>the</strong> instrument isbased, it does present a tempting target for critics of <strong>the</strong>MBTI. As Pittenger (1993) noted, findings of lack ofbimodality “give reason <strong>to</strong> suspect <strong>the</strong> claims that typesrepresent separate populations, and that small quantitativedifferences between scores represent a significantqualitative difference in personality” (p. 471).Regardless of whe<strong>the</strong>r or not one agrees with <strong>the</strong>assertion that <strong>the</strong> MBTI must demonstrate bimodal scoredistributions (as we describe below, in our assessmentbimodality is not strictly necessary), <strong>the</strong> fact remains that<strong>the</strong> type stability, measurement precision, and bimodalityissues are closely linked. Because all psychological testscontain some degree of measurement error, whenever acu<strong>to</strong>ff score is used <strong>to</strong> dicho<strong>to</strong>mize a continuous scale itbecomes highly advantageous <strong>to</strong> minimize <strong>the</strong> relativenumber of people who score near <strong>the</strong> cu<strong>to</strong>ff. This is donein order <strong>to</strong> minimize <strong>the</strong> chance that even relatively minorerrors of measurement could cause a person’s observedscore <strong>to</strong> fall on <strong>the</strong> opposite side of <strong>the</strong> cu<strong>to</strong>ff from <strong>the</strong>irtrue score (i.e., an erroneous type classification). AsPittenger (1993) noted, “an accurate and durableassessment of type cannot be made for those subjectswhose scores are close <strong>to</strong> <strong>the</strong> zero point [i.e., type cu<strong>to</strong>ff]and [who <strong>the</strong>refore] have a high probability of crossing thatboundary” (p. 472) due simply <strong>to</strong> <strong>the</strong> action ofmeasurement error.In essence, a lack of bimodality in <strong>the</strong> preference scoredistributions may exacerbate <strong>the</strong> problem of typemisclassifications due <strong>to</strong> measurement error near <strong>the</strong> cu<strong>to</strong>ffscore (i.e., because center-weighted distributions have amuch higher percentage of individuals scoring near <strong>the</strong>cu<strong>to</strong>ff). Thus, if measurement precision (i.e., reliability) isheld constant, increasing <strong>the</strong> number of people who scorenear <strong>the</strong> type cu<strong>to</strong>ff will unavoidably increase <strong>the</strong> numberof erroneous type classifications, both in test-retest andsingle-administration situations. It follows that as apractical matter, <strong>the</strong> reliability of a scale that is <strong>to</strong> bedicho<strong>to</strong>mized may need <strong>to</strong> be significantly higher than <strong>the</strong>level that would be considered adequate for a test in whicha cu<strong>to</strong>ff score is not imposed. Thus, on <strong>to</strong>tally pragmaticgrounds, bimodal preference score distributions are muchmore desirable than center-weighted ones because <strong>the</strong>yreduce <strong>the</strong> number of erroneous type classifications thatwould be expected due <strong>to</strong> measurement error at <strong>the</strong> cu<strong>to</strong>ff.As was noted above, one might legitimately questionwhe<strong>the</strong>r it is necessary for a type-based instrument <strong>to</strong>produce bimodal distributions. Although many researchers(e.g., Pittenger, 1993; Striker & Ross, 1964) appear <strong>to</strong> haveaccepted <strong>the</strong> argument that bimodal distributions arenecessary based largely on <strong>the</strong>oretical arguments (e.g.,<strong>Myers</strong> with <strong>Myers</strong>, 1980), opposing arguments can beoffered (e.g., Mitchell, 1995). Indeed, at a strictlypragmatic level, <strong>the</strong>re is no difference between setting acu<strong>to</strong>ff score on <strong>the</strong> MBTI scales for <strong>the</strong> purpose ofassigning individuals <strong>to</strong> type categories versus setting acu<strong>to</strong>ff score on any o<strong>the</strong>r psychological scale that lacks abimodal distribution (which is, of course, <strong>the</strong> case for mostpsychological scales). That is, cu<strong>to</strong>ff scores are frequently-- and appropriately -- used with tests that demonstratecenter-weighted, Normal distributions. For example, inorganizational selection it is commonplace <strong>to</strong> rankemployees based on <strong>the</strong>ir scores on a cognitive ability test,and <strong>to</strong> only consider those who score above a minimumcu<strong>to</strong>ff for hiring. In such situations, rarely if ever does <strong>the</strong>practitioner expect <strong>the</strong> employment test <strong>to</strong> demonstratebimodality, or <strong>to</strong> minimize <strong>the</strong> density of <strong>the</strong> distributionnear <strong>the</strong> cu<strong>to</strong>ff point. Clearly, bimodality is not a necessarycondition for setting a cu<strong>to</strong>ff score on a psychological test.Thus, although one can argue that bimodality is not aprerequisite characteristic in order for <strong>the</strong> MBTI <strong>to</strong> bejudged psychometrically adequate, it is none<strong>the</strong>less a highlydesirable characteristic due <strong>to</strong> <strong>the</strong> MBTI’s use of a cu<strong>to</strong>ffscore <strong>to</strong> assign individuals <strong>to</strong> <strong>the</strong> categorical types. Basedon <strong>the</strong> above discussion of <strong>the</strong> effect of measurement errorat <strong>the</strong> cu<strong>to</strong>ff, it is clear that <strong>the</strong> bimodality and type-stabilityissues are inextricably linked, and that <strong>the</strong> maximumimprovement in MBTI test-retest type stability would beexpected <strong>to</strong> occur when improvements in both bimodalityand measurement precision at <strong>the</strong> cu<strong>to</strong>ff are achieved.Thus, one does not have <strong>to</strong> accept <strong>the</strong> <strong>the</strong>ory-basedargument that a type-based instrument must producebimodal score distributions in order <strong>to</strong> appreciate <strong>the</strong>


practical advantages that would obtain if <strong>the</strong> MBTI’spreference scores were more bimodal in nature.Strategies for Addressing <strong>the</strong>se IssuesOf all of <strong>the</strong> criticisms of <strong>the</strong> MBTI that have beenraised <strong>to</strong> date, it is our assessment that <strong>the</strong> type-instabilityissue is one of <strong>the</strong> most troublesome. That is, if it is truethat preferences are inborn, and that by adulthood mostindividuals achieve reasonably well differentiated types(e.g., <strong>Myers</strong> & McCaulley, 1988; <strong>Myers</strong> with <strong>Myers</strong>,1980), one would definitely not expect <strong>to</strong> find from 24%-61% of individuals changing types on at least one MBTIdimension on repeated testing, especially when <strong>the</strong>administrations are given only a few weeks or monthsapart. Indeed, when interpreting <strong>the</strong> empirical dataregarding test-retest type stability and preference scoredistribution shape, critics of <strong>the</strong> MBTI have concluded that“<strong>the</strong> patterns of data do not suggest that <strong>the</strong>re is reason <strong>to</strong>believe that <strong>the</strong>re are 16 unique types of personality”(Pittenger, 1993, p. 483), and that “<strong>the</strong> four-letter type codeis not a stable personality characteristic” (p. 472).It is important <strong>to</strong> realize that such conclusions arebased on a critical -- and untested -- assumption: namely,that <strong>the</strong> lack of bimodality and <strong>the</strong> observed levels of typeinstability reflect flaws in <strong>the</strong> MBTI itself. Interestingly,little or no consideration has been given <strong>to</strong> <strong>the</strong> alternativeviewpoint that <strong>the</strong>se empirical findings do not reflect flawsin <strong>the</strong> MBTI or its underlying <strong>the</strong>ory, but instead are causedby limitations in <strong>the</strong> scoring system that is used <strong>to</strong> convertitem responses in<strong>to</strong> <strong>the</strong> preference scores that aredicho<strong>to</strong>mized <strong>to</strong> form type assignments. We contend thatbefore sweeping conclusions regarding <strong>the</strong> validity of <strong>the</strong>MBTI can be drawn, researchers must first determinewhe<strong>the</strong>r improvements in bimodality and type stability canbe achieved via modifications <strong>to</strong> <strong>the</strong> techniques that areused <strong>to</strong> score <strong>the</strong> MBTI and assign categorical types.Without doubt, <strong>the</strong> answer <strong>to</strong> <strong>the</strong> question of whe<strong>the</strong>rrevisions <strong>to</strong> <strong>the</strong> MBTI scoring system would be able <strong>to</strong>improve type stability and/or preference score bimodality isof fundamental importance. That is, if a new scoringsystem were <strong>to</strong> be developed that is capable of producingmore bimodally shaped preference distributions in large,unselected samples of MBTI respondents, this wouldeffectively destroy a key line of evidence on whichcriticisms of <strong>the</strong> MBTI instrument -- as well as <strong>the</strong> typebased<strong>the</strong>ory on which it is founded -- have been based(e.g., Pittenger, 1993, p. 471). Likewise, if a scoringsystem capable of producing improvements in <strong>the</strong> MBTI’smeasurement precision near <strong>the</strong> cu<strong>to</strong>ff were <strong>to</strong> be produced,increased type stability in test-retest situations would bepredicted <strong>to</strong> result, <strong>the</strong>reby addressing <strong>the</strong> remaining majorempirical criticism of <strong>the</strong> MBTI.However, what strategies should be followed in order<strong>to</strong> modify <strong>the</strong> MBTI’s scoring procedures in order <strong>to</strong>achieve <strong>the</strong> objectives of increased bimodality andmeasurement precision? Given that <strong>the</strong> lack of bimodalityis hardly a new occurrence, having been present in itsearlier scoring systems as well (e.g., Stricker & Ross,1964), <strong>the</strong>re is little reason <strong>to</strong> believe that simply updating<strong>the</strong> prediction-ratio based preference scoring weights usingnew samples of respondents would lead <strong>to</strong> significantchanges in <strong>the</strong> shapes of <strong>the</strong> preference score distributions.Indeed, it is unlikely that any alternative number-right orweighted number-right scoring technique that takes alinear-model based approach would be any more likelythan <strong>the</strong> existing weighting system <strong>to</strong> produce bimodality orimproved measurement precision. For example, Harveyand Murry (1994) examined two alternative scoringmethods (i.e., an unweighted count of <strong>the</strong> number of itemsanswered in <strong>the</strong> keyed direction, and a linear-model basedweighting system using fac<strong>to</strong>r scoring coefficients), findingthat nei<strong>the</strong>r produced any meaningful reductions in <strong>the</strong>center-weightedness of <strong>the</strong> preference distributions.One possibility for improving <strong>the</strong> test-retest typestability that has been suggested involves increasing <strong>the</strong>number of categories in<strong>to</strong> which individuals are classifiedon each preference dimension (Harvey & Murry, 1994).For example, earlier versions of <strong>the</strong> MBTI were scoredusing a 3-category system: <strong>the</strong> two bipolar types (e.g., ‘E’or ‘I’), plus an indeterminate ‘x’ classification forindividuals who scored close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff (e.g., see<strong>Myers</strong> & McCauley, 1985, chapter 9). It seems reasonable<strong>to</strong> hypo<strong>the</strong>size that a sizable percentage of <strong>the</strong> individualswho switch types on repeated administrations of <strong>the</strong> MBTIare those whose preference scores lie close <strong>to</strong> <strong>the</strong> cu<strong>to</strong>ff.For such individuals, a change of only a few preferencescore units could cause <strong>the</strong>m <strong>to</strong> be classified in<strong>to</strong> <strong>the</strong>opposite type on repeated testing. S<strong>to</strong>pping <strong>the</strong> practice offorcing <strong>the</strong>se type-indeterminate individuals in<strong>to</strong> bipolartype categories might produce significant improvements intest-rest stability. Of course, even if an ‘indeterminate’category is added, <strong>the</strong> performance of such a system wouldbe greatly facilitated if <strong>the</strong> shapes of <strong>the</strong> preference scoredistributions were also made more bimodal, <strong>the</strong>rebyreducing <strong>the</strong> number of type-indeterminate individuals.With respect <strong>to</strong> methods for changing <strong>the</strong> proceduresused <strong>to</strong> compute MBTI preference scores in order <strong>to</strong>improve measurement precision and bimodality, in ourassessment <strong>the</strong> strategy that holds <strong>the</strong> greatest promise is <strong>to</strong>use item response <strong>the</strong>ory (IRT) techniques (e.g., Lord &Novick, 1968). Although only a few studies using IRTscoring of <strong>the</strong> MBTI have been conducted (Harvey &Murry, 1994; Harvey, Murry, & Markham, 1994; Thomas& Harvey, 1995), <strong>the</strong>ir results have been very encouraging.Specifically, <strong>the</strong>y demonstrated that switching <strong>to</strong> IRTscoring -- without making any substantive changes <strong>to</strong> <strong>the</strong>MBTI items <strong>the</strong>mselves -- produces (a) strongly bimodalpreference distributions in large, unselected samples ofrespondents; and (b) scales that produce <strong>the</strong>ir maximummeasurement precision in <strong>the</strong> vicinity of <strong>the</strong> type cu<strong>to</strong>ff(e.g., Harvey & Murry, 1994). Related IRT research(Thomas & Harvey, 1995) has revealed that <strong>the</strong> degree ofmeasurement precision of <strong>the</strong> MBTI scales can be fur<strong>the</strong>rimproved through <strong>the</strong> addition of new items.


IRT Methods in <strong>the</strong> Context of <strong>the</strong> MBTIBefore reviewing <strong>the</strong> results of <strong>the</strong>se studies, we willfirst provide a brief tu<strong>to</strong>rial on IRT methods, payingspecific attention <strong>to</strong> <strong>the</strong> ways in which traditional IRTterminology must be translated in<strong>to</strong> <strong>the</strong> terminology of type<strong>the</strong>ory and <strong>the</strong> MBTI. His<strong>to</strong>rically, IRT terminology hasbeen deeply rooted in right/wrong, ability-oriented testingmethods. Although this ability-oriented terminology isuseful in <strong>the</strong> context of scoring right/wrong, multiplechoicetest items, it is somewhat counterproductive whenone is attempting <strong>to</strong> understand how IRT would be used <strong>to</strong>score personality tests in which (a) “right” or “wrong”answers do not exist, (b) <strong>the</strong> notion of item “difficulty” haslittle or no intuitive meaning, and (c) <strong>the</strong> susceptibility ofitems <strong>to</strong> “guessing <strong>the</strong> correct answer” is not typically acause for concern.In this section we briefly describe <strong>the</strong> fundamentals ofIRT methods as <strong>the</strong>y relate <strong>to</strong> <strong>the</strong> MBTI; however, adetailed description of IRT is beyond <strong>the</strong> scope of thisarticle. The reader is referred one of <strong>the</strong> standard IRT texts(e.g., Hamble<strong>to</strong>n, Swaminathan, & Rogers, 1991; Hulin,Drasgow, & Parsons, 1983; Lord & Novick, 1968) for amore comprehensive treatment. Our primary goal is <strong>to</strong>describe <strong>the</strong> basics of <strong>the</strong> IRT approach <strong>to</strong> measurementand explicate <strong>the</strong> terminological differences that existbetween standard descriptions of IRT methods and <strong>the</strong>irapplication <strong>to</strong> <strong>the</strong> specific case of <strong>the</strong> MBTI.IRT TerminologyThe latent construct, or θ. In IRT, as in classical test<strong>the</strong>ory (CTT), a primary focus of testing is <strong>to</strong> derive anestimate of each examinee’s score on <strong>the</strong> latent construct(or set of four bipolar constructs, in <strong>the</strong> case of <strong>the</strong> MBTI)being assessed. In CTT, this quantity is termed <strong>the</strong> truescore; in IRT, it is typically termed <strong>the</strong> latent trait score(which is abbreviated θ, or <strong>the</strong>ta). In both cases, this scoreis an unobserved, hypo<strong>the</strong>tical construct (e.g., Intelligence,Extraversion) on which people are assumed <strong>to</strong> differ, butwhich cannot be directly quantified. Thus, we are forced <strong>to</strong>estimate examinees’ scores on <strong>the</strong> latent construct based on<strong>the</strong>ir responses <strong>to</strong> a set of test items.The term “latent trait” has a tendency <strong>to</strong> set off alarmsfor proponents of type-based <strong>the</strong>ories of personality;indeed, this usage of <strong>the</strong> term “trait” represents our firstencounter with <strong>the</strong> semantic difficulties that can occurwhen applying IRT (which is also known as Latent Trait<strong>Theory</strong>) <strong>to</strong> <strong>the</strong> MBTI. It must be stressed that this use of<strong>the</strong> term “trait” when describing <strong>the</strong> latent construct beingestimated by IRT in no way implies a taking-of-sides in <strong>the</strong>ongoing “trait vs. type” debate (e.g., Block & Ozer, 1982;Gangestad & Snyder, 1991; Mendelsohn, Weiss, & Feimer,1982). That is, although <strong>the</strong> MBTI is based on <strong>the</strong> notionof discrete types of personality, <strong>the</strong> MBTI has always usedscores on continuous bipolar scales in order <strong>to</strong> assess <strong>the</strong>strength and direction of <strong>the</strong> preference for EI, SN, TF, andJP (i.e., <strong>the</strong> prediction-ratio based preference scores; e.g.,<strong>Myers</strong> & McCaulley, 1988, p. 9). By dicho<strong>to</strong>mizing <strong>the</strong>sepreference scores, individuals can subsequently be assigned<strong>to</strong> categorical types.Throughout our discussion of how IRT methods can beused <strong>to</strong> score <strong>the</strong> MBTI, it is critically important <strong>to</strong> keep inmind that <strong>the</strong> MBTI preference scores estimated using <strong>the</strong>traditional prediction-ratio method correspond directly <strong>to</strong><strong>the</strong> θ scores estimated by IRT. Thus, IRT takes precisely<strong>the</strong> same logical approach that has always been used in <strong>the</strong>MBTI: that is, describing both <strong>the</strong> strength and direction of<strong>the</strong> preference for <strong>the</strong> EI, SN, TF, and JP dimensions usingfour bipolar continuua. Only <strong>the</strong> computational methodinvolved in computing <strong>the</strong>se continuous preference scoresis different. In effect, whenever <strong>the</strong> term ‘trait’ or ‘latenttrait’ appears in a discussion of IRT methods, one cansimply substitute <strong>the</strong> term ‘preference score’ <strong>to</strong> understandhow IRT would be used <strong>to</strong> score <strong>the</strong> MBTI.Probability of a correct response (PCR). The o<strong>the</strong>rquantity that is of fundamental interest in IRT is <strong>the</strong>likelihood that a given respondent will make a “correct”response <strong>to</strong> a given item. In ability-oriented testing, wehave a clear understanding of what a correct vs. incorrectitem response means, and we can easily compute andinterpret <strong>the</strong> percentages of people who respond correctly<strong>to</strong> each test item. However, when IRT is applied <strong>to</strong> <strong>the</strong>MBTI (or <strong>to</strong> any o<strong>the</strong>r test that does not employ rightversus-wrongscoring), what meaning do we attach <strong>to</strong> thisconcept?As it turns out, <strong>the</strong> lack of a “correct” response <strong>to</strong> eachitem poses absolutely no problem with respect <strong>to</strong> applyingIRT scoring methods <strong>to</strong> <strong>the</strong> MBTI. That is, although <strong>the</strong>reare no “right” or “wrong” responses, in <strong>the</strong> traditionalMBTI scoring system each possible item response is keyed<strong>to</strong>ward one or <strong>the</strong> o<strong>the</strong>r of <strong>the</strong> poles of <strong>the</strong> item’s assignedpreference dimension (e.g., <strong>the</strong> response “thinking” from<strong>the</strong> word-pair “thinking vs. feeling” is keyed <strong>to</strong>ward <strong>the</strong>“T” pole of <strong>the</strong> TF dimension, and <strong>the</strong> “feeling” response iskeyed <strong>to</strong>ward <strong>the</strong> “F” pole). This keying of items withrespect <strong>to</strong> <strong>the</strong> poles of each preference continuum providesus with <strong>the</strong> information that is needed <strong>to</strong> use IRT <strong>to</strong> score<strong>the</strong> MBTI.In essence, IRT methods simply require that each itembe scored dicho<strong>to</strong>mously; although it is common <strong>to</strong> do so, itis not manda<strong>to</strong>ry that this scoring system be couched interms of a “correct” versus “incorrect” response. For <strong>the</strong>MBTI, we need only pick one of <strong>the</strong> two poles of eachscale (e.g., for <strong>the</strong> EI scale, <strong>the</strong> “I” preference) as <strong>the</strong> keyedpole; this choice is essentially arbitrary, and for maximumsimilarity <strong>to</strong> <strong>the</strong> traditional prediction-ratio scoring system(e.g., <strong>Myers</strong> & McCaulley, 1988, p. 9), item responses havebeen keyed <strong>to</strong>ward <strong>the</strong> I, N, F, and P poles in MBTI IRTstudies (e.g., Harvey & Murry, 1994). Once a keyed pole ischosen, each MBTI item response is dicho<strong>to</strong>mously scoredby determining whe<strong>the</strong>r or not it is in <strong>the</strong> keyed direction.<strong>Using</strong> <strong>the</strong> above example, if an individual chose <strong>the</strong>“thinking” alternative from <strong>the</strong> “thinking vs. feeling” wordpair, this response would not be in <strong>the</strong> keyed (i.e., “F”)direction; <strong>the</strong>refore, it would be scored as a zero.


It must be stressed that this choice of a keyed directionfor each scale is entirely arbitrary, and that IRT scoringworks equally well regardless of which pole is chosen as<strong>the</strong> keyed response. That is, <strong>the</strong> choice of <strong>the</strong> keyed polesimply determines <strong>the</strong> direction of <strong>the</strong> scale (i.e., because<strong>the</strong> type cu<strong>to</strong>ff point is assigned a value of zero, preferencescores that lie in <strong>the</strong> keyed direction receive positivenumbers, and preferences <strong>to</strong>ward <strong>the</strong> non-keyed polereceive negative numbers). Reversing <strong>the</strong> keyed polesimply reverses <strong>the</strong> scale of <strong>the</strong> θ score continuum.The item characteristic curve (ICC). The foundationof <strong>the</strong> IRT approach is <strong>the</strong> ICC; each item on a test willhave its own ICC. In essence, <strong>the</strong> ICC answers <strong>the</strong>question, “How are individuals’ scores on <strong>the</strong> latentconstruct (i.e., preferences) related <strong>to</strong> <strong>the</strong>ir observedprobabilities of endorsing this MBTI item in <strong>the</strong> keyed (i.e.,INFP) direction?” The ICC depicts <strong>the</strong> form of <strong>the</strong>functional relation that exists between <strong>the</strong> latent constructand <strong>the</strong> PCR. In practice, <strong>the</strong>re are many different ways inwhich this functional relationship between θ scores andPCRs can be modeled.One of <strong>the</strong> simplest ways in which preference scorescan be related <strong>to</strong> <strong>the</strong> observed item endorsement rates is amodel in which higher scores on <strong>the</strong> latent preferenceconstruct are linearly associated with higher likelihoods ofendorsing <strong>the</strong> item in <strong>the</strong> keyed direction. Hypo<strong>the</strong>tical<strong>Item</strong> 1 in Figure 1 illustrates an ICC that is primarily linearin nature. In Figure 1, <strong>the</strong> horizontal axis represents <strong>the</strong>latent preference score (θ), and <strong>the</strong> vertical axis represents<strong>the</strong> likelihood that individuals holding a given preferencewould endorse this item in <strong>the</strong> keyed direction (i.e., <strong>the</strong>PCR). The ICC shows how scores on <strong>the</strong> latent preferencescale correspond <strong>to</strong> observed item-endorsement ratesIf <strong>the</strong> ICC for <strong>Item</strong> 1 in Figure 1 had been obtained foran actual MBTI item (e.g., on <strong>the</strong> EI scale, one that asked<strong>the</strong>m <strong>to</strong> choose between “good mixer” vs. “quiet andreserved”), and <strong>the</strong> EI items were keyed <strong>to</strong>ward <strong>the</strong>Introvert pole, individuals having positive scores on <strong>the</strong> θscale would be Introverts, and those having negative scoreswould be Extraverts (a value of θ = 0.0 serves as <strong>the</strong> typecu<strong>to</strong>ff score, and <strong>the</strong> θ metric is scaled in z units). Just aswith traditional prediction-ratio based preference scores,scores that are fur<strong>the</strong>r away from <strong>the</strong> type cu<strong>to</strong>ff denotestronger preferences <strong>to</strong>ward that pole of <strong>the</strong> preferencecontinuum. To determine <strong>the</strong> predicted likelihood that agroup of individuals who share a given θ score wouldendorse a given item in <strong>the</strong> keyed direction, simply locate<strong>the</strong> desired θ score on <strong>the</strong> x-axis, and <strong>the</strong>n draw a verticalline until <strong>the</strong> ICC is reached. By projecting a horizontalline leftward <strong>to</strong> <strong>the</strong> y-axis from <strong>the</strong> ICC, <strong>the</strong> PCR valueassociated with that θ score can be determined.For example, in Figure 1 individuals who score 0.0 onθ have no clear preference for ei<strong>the</strong>r <strong>the</strong> “E” or “I” poles;we would expect 50% of <strong>the</strong>m <strong>to</strong> endorse this item in <strong>the</strong>“I” direction and 50% <strong>to</strong> endorse this item in <strong>the</strong> “E”direction (note <strong>the</strong> vertical line drawn at θ = 0, and <strong>the</strong>horizontal line drawn at PCR = 0.5). In contrast, whenconsidering a group of individuals who hold a strongpreference <strong>to</strong>ward <strong>the</strong> Introvert pole (e.g., at θ = +2.5), aPCR value of over 0.80 would be predicted; that is, over80% of <strong>the</strong>se strong Introverts would be expected <strong>to</strong>endorse <strong>the</strong> ‘I’ alternative (i.e., “quiet and reserved”), andless than 20% would be expected <strong>to</strong> endorse <strong>the</strong> ‘E’alternative (i.e., “good mixer”). Conversely, among agroup of individuals demonstrating a very strong Extravertpreference (e.g., θ = -3.0), a PCR of approximately 0.14would be expected (i.e., only 14% of <strong>the</strong>se strongExtraverts would say <strong>the</strong>y are “quiet and reserved”,whereas 86% would say <strong>the</strong>y are “good mixers”).In sharp contrast <strong>to</strong> <strong>the</strong> linear ICC described above, astep function ICC might exist. In a step function, a cu<strong>to</strong>ffscore on <strong>the</strong> θ preference scale is effectively present, suchthat all individuals who score below a given level of θ willfail <strong>to</strong> endorse <strong>the</strong> item in <strong>the</strong> keyed direction, and allindividuals who score above this cu<strong>to</strong>ff will endorse it in<strong>the</strong> keyed direction. Hypo<strong>the</strong>tical ICC 2 in Figure 1 depictsan ICC that approximates a step function: here, <strong>the</strong> cu<strong>to</strong>ffpoint is at θ = 0.0, and effectively all those who score lowerthan -0.1 (i.e., <strong>the</strong> Extraverts) would endorse <strong>the</strong> non-keyedresponse (“good mixer”), and all those above 0.1 (i.e., <strong>the</strong>Introverts) would endorse <strong>the</strong> keyed response (“quiet andreserved”). At <strong>the</strong> cu<strong>to</strong>ff point, only in <strong>the</strong> very narrowrange of approximately -0.1 <strong>to</strong> +0.1 would we observeExtraverts endorsing <strong>the</strong> “I” alternative and Introvertsendorsing <strong>the</strong> “E” alternative.Step-function ICCs possess appealing properties in <strong>the</strong>context of a type-based assessment instrument like <strong>the</strong>MBTI. That is, if two distinct types of people exist, almostall of <strong>the</strong> people whose continuous preference scores liebelow <strong>the</strong> cu<strong>to</strong>ff value for <strong>Item</strong> 2 would be expected <strong>to</strong> notendorse a response alternative that is keyed <strong>to</strong>ward <strong>the</strong>opposite pole, whereas almost all of those who score above<strong>the</strong> cu<strong>to</strong>ff would be expected <strong>to</strong> endorse <strong>the</strong> item in <strong>the</strong>keyed direction. Indeed, if true step functions ICCs like<strong>Item</strong> 2’s existed in practice, one could effectively develop asingle-item test that would measure each individual’s MBTIpreference with great accuracy (i.e., if <strong>the</strong> step functioncu<strong>to</strong>ff point coincided precisely with <strong>the</strong> “natural” cu<strong>to</strong>ffthat exists between <strong>the</strong> two types).<strong>Item</strong> information functions. The reason that stepfunctionICCs are potentially so desirable is that <strong>the</strong>yconvey a great deal of information regarding eachindividual’s standing on each MBTI preference dimension.However, step functions are limited in <strong>the</strong> sense that <strong>the</strong>information <strong>the</strong>y provide is confined <strong>to</strong> a relatively narrowrange of scores (i.e., those who score near <strong>the</strong> cu<strong>to</strong>ff pointthat defines <strong>the</strong> “step”). In <strong>the</strong> context of IRT, <strong>the</strong> term“information” is used <strong>to</strong> describe an item’s ability <strong>to</strong>discriminate between individuals who hold different scoreson <strong>the</strong> latent preference continuum. That is, if <strong>the</strong> size of<strong>the</strong> difference between two individuals’ scores on <strong>the</strong> latentpreference continuum is held constant, increasing <strong>the</strong>amount of information provided by an item makes it easier<strong>to</strong> discriminate between those individuals (i.e., with respect


<strong>to</strong> <strong>the</strong> likelihood that <strong>the</strong>y would endorse <strong>the</strong> item in <strong>the</strong>keyed direction).IRT methods allow us <strong>to</strong> quantify <strong>the</strong> amount ofinformation provided by each item at any given level of <strong>the</strong>θ scale via <strong>the</strong> item information function (IIF). Figure 2presents <strong>the</strong> IIFs for <strong>the</strong> two hypo<strong>the</strong>tical items listed inFigure 1. As <strong>the</strong>se IIFs illustrate, <strong>the</strong> linear ICC seen for<strong>Item</strong> 1 provides a consistent – but small – amount ofinformation across <strong>the</strong> entire range of θ scores. In contrast,<strong>the</strong> step-function ICC seen for <strong>Item</strong> 2 provides a great dealof information near <strong>the</strong> cu<strong>to</strong>ff point, but very littleinformation elsewhere. Thus, for individuals who endorse<strong>Item</strong> 2 in <strong>the</strong> keyed direction, we can be quite confident that<strong>the</strong>ir θ scores lie above <strong>the</strong> cu<strong>to</strong>ff point; however, we havevirtually no ability <strong>to</strong> determine whe<strong>the</strong>r <strong>the</strong>y hold a strong,intermediate, or weak preference <strong>to</strong>ward <strong>the</strong> “I” pole basedon <strong>the</strong>ir endorsement of <strong>Item</strong> 2 in <strong>the</strong> keyed direction. Thatis, in terms of <strong>the</strong> expected PCR, <strong>the</strong>re is virtually nodifference between a strong (e.g., θ = 2.5) versus a weak(e.g., θ = 0.5) “I” preference with respect <strong>to</strong> <strong>the</strong> responses<strong>to</strong> <strong>Item</strong> 2; hence, it provides very little information outside<strong>the</strong> narrow band surrounding its cu<strong>to</strong>ff point.Of course, due <strong>to</strong> <strong>the</strong> action of measurement error, it isextremely unlikely that in an actual testing situation wewould encounter ICCs that break as sharply as <strong>the</strong> onedepicted for hypo<strong>the</strong>tical <strong>Item</strong> 2. More commonly, ICCstend <strong>to</strong> assume an intermediate value between <strong>the</strong> twoextremes depicted in Figure 1, producing variants of an “S”shaped ICC. Thus, when applying IRT methods, <strong>the</strong>fundamental question concerns <strong>the</strong> kind of ICC that onechooses <strong>to</strong> employ when modeling <strong>the</strong> relations between<strong>the</strong> latent construct and <strong>the</strong> observed item endorsementrates. In particular, <strong>the</strong> choice between fitting a linearversus a nonlinear model is critical: as can be seen from<strong>the</strong> ICCs in Figure 1, it would be profoundly misleading <strong>to</strong>fit a linear ICC <strong>to</strong> an item that possessed a true ICC like <strong>the</strong>one depicted for <strong>Item</strong> 2. Likewise, it would be highlymisleading <strong>to</strong> force a step-function ICC on<strong>to</strong> an item thatdemonstrated an ICC like <strong>the</strong> one seen for <strong>Item</strong> 1.IRT Models for Dicho<strong>to</strong>mously <strong>Score</strong>d Test <strong>Item</strong>sIRT models differ primarily in terms of <strong>the</strong>assumptions <strong>the</strong>y make regarding <strong>the</strong> ways in which scoreson <strong>the</strong> latent construct (θ) can relate <strong>to</strong> observed itemendorsement rates (PCR). These differences are reflectedin <strong>the</strong> number of parameters that must be estimated in order<strong>to</strong> “fit” an ICC <strong>to</strong> each item’s responses.1-parameter (Rasch) model. One of <strong>the</strong> simplestanswers <strong>to</strong> <strong>the</strong> question of how <strong>the</strong> latent construct isrelated <strong>to</strong> <strong>the</strong> endorsement rates for each item is given by<strong>the</strong> 1-parameter, or Rasch, model (e.g., Rasch, 1960). Notsurprisingly, in <strong>the</strong> 1-parameter model <strong>the</strong>re is only onecharacteristic of each item that sets its ICC apart from <strong>the</strong>ICCs of <strong>the</strong> o<strong>the</strong>r items on <strong>the</strong> test. <strong>Using</strong> traditional IRTterminology, this parameter is <strong>the</strong> difficulty of <strong>the</strong> item.Unfortunately, <strong>the</strong> difficulty parameter represents yetano<strong>the</strong>r example of <strong>the</strong> way in which traditional IRTterminology is awkward when applied <strong>to</strong> instruments thatdo not use right/wrong scoring. That is, in a traditionalright/wrong test, we define a “difficult” item as being onethat few respondents are able <strong>to</strong> answer correctly (i.e., onewith a low p value); conversely, an “easy” item is definedas one that most respondents (even those who score verylow on <strong>the</strong> construct being measured) are able <strong>to</strong> answercorrectly. However, with <strong>the</strong> MBTI we are concerned with<strong>the</strong> question of how likely it would be for a person <strong>to</strong> makean item response in <strong>the</strong> keyed direction (i.e., I, N, F, or P),not whe<strong>the</strong>r such a response is “right” or “wrong.”In <strong>the</strong> present case, <strong>the</strong> difficulty of an item (denoted b)refers <strong>to</strong> <strong>the</strong> degree <strong>to</strong> which raters will tend <strong>to</strong> endorse <strong>the</strong>item in <strong>the</strong> keyed direction. Thus, items having numericallyhigh b parameters will be <strong>the</strong> ones that only people whoscore high in <strong>the</strong> keyed preference direction will tend <strong>to</strong>endorse. In contrast, items having low b parameters willtend <strong>to</strong> be endorsed in <strong>the</strong> keyed direction even byindividuals whose preferences lie strongly <strong>to</strong>ward <strong>the</strong> nonkeyedpole of <strong>the</strong> preference dimension. The scale of <strong>the</strong> bparameter is <strong>the</strong> same as <strong>the</strong> scale of θ (i.e., standard, or z,units).An example should help <strong>to</strong> illustrate <strong>the</strong> way in which<strong>the</strong> b parameter can be used <strong>to</strong> differentiate between testitems. Figure 3 presents <strong>the</strong> ICCs for three actual MBTIitems drawn from <strong>the</strong> EI scale; <strong>the</strong>se ICCs were computedby fitting <strong>the</strong> 1-parameter IRT model in a sample of 2,499MBTI profiles (<strong>the</strong> sample used <strong>to</strong> compute this andsubsequent figures was formed by sampling subjects from<strong>the</strong> databases used in <strong>the</strong> Harvey & Murry, 1994, andHarvey et al., 1995, studies, and <strong>the</strong>n adding approximately600 new raters – primarily college students – who were notused in those studies). Because <strong>the</strong> EI responses werekeyed <strong>to</strong>ward <strong>the</strong> “I” pole, individuals having Extravertpreferences exhibit negative θ scores, and those havingIntrovert preferences exhibit positive θ scores. Forreference, a horizontal line has been drawn at <strong>the</strong> 50% poin<strong>to</strong>f likelihood of item endorsement, and a vertical line at <strong>the</strong>type cu<strong>to</strong>ff point (i.e., θ = 0.0)..The ICCs in Figure 3 depict <strong>the</strong> percentages ofindividuals who share a given θ score that would beexpected <strong>to</strong> endorse each item in <strong>the</strong> “I” direction. Bycomparing <strong>the</strong> levels of θ at which 50% of raters would beexpected <strong>to</strong> endorse an item in <strong>the</strong> “I” direction, one can see<strong>the</strong> way in which <strong>the</strong> b parameter differentiates among testitems. That is, <strong>Item</strong> 129 has <strong>the</strong> lowest b parameter; wewould expect 50% of individuals who share <strong>the</strong> moderatelystrong “E” preference of -0.9 <strong>to</strong> endorse <strong>the</strong> “I” alternativefor this item (i.e., “not interested in following <strong>the</strong> latestfashion”). In contrast, <strong>Item</strong> 33 has <strong>the</strong> highest b value; forit, <strong>the</strong> point at which 50% endorse <strong>the</strong> “I” response (“hard<strong>to</strong> get <strong>to</strong> know”) does not occur until a moderately strong“I” preference of 0.9 is achieved.Thus, for any given level of θ (i.e., true preference on<strong>the</strong> EI dimension), we would expect <strong>to</strong> see <strong>the</strong> highest ratesof “I” endorsement occurring for <strong>Item</strong> 129, followed by<strong>Item</strong> 50, with <strong>the</strong> lowest rates of “I” endorsement occurringfor <strong>Item</strong> 33. For example, consider a group of moderately


strong Introverts (i.e., θ = 0.9, which represents a score ofalmost one standard deviation above <strong>the</strong> mean EIpreference score). Among this group of Introverts, wewould expect 50% of <strong>the</strong>m <strong>to</strong> describe <strong>the</strong>mselves as “hard<strong>to</strong> get <strong>to</strong> know” (<strong>Item</strong> 33), 64% as “quiet and reserved”(<strong>Item</strong> 50), and 86% as “not interested in following <strong>the</strong> latestfashion (<strong>Item</strong> 129)” Conversely, for a group of θ = -0.9Extraverts, we would expect <strong>to</strong> find that only about 12%describe <strong>the</strong>mselves as “hard <strong>to</strong> get <strong>to</strong> know,” 20% as“quiet and reserved,” and 50% as “not interested infollowing <strong>the</strong> latest fashion.”In general, regardless of <strong>the</strong> specific IRT model that ischosen, <strong>the</strong> substantive interpretation of <strong>the</strong> ICC willalways be <strong>the</strong> same: that is, by drawing a line projectingvertically from a given θ score <strong>to</strong> <strong>the</strong> ICC, and <strong>the</strong>nprojecting a line horizontally <strong>to</strong> <strong>the</strong> PCR, one can determine<strong>the</strong> expected percentage of people who share that true levelof <strong>the</strong> preference that would be expected <strong>to</strong> endorse <strong>the</strong>item in <strong>the</strong> keyed direction.How, <strong>the</strong>n, should <strong>the</strong> IRT b parameter be interpretedin <strong>the</strong> context of <strong>the</strong> MBTI? As <strong>the</strong> results in Figure 3illustrate, in <strong>the</strong> 1-parameter IRT model <strong>the</strong> only thing thatdifferentiates one test item from ano<strong>the</strong>r is <strong>the</strong> horizontal(left-right) location of <strong>the</strong> ICC on <strong>the</strong> latent preferencescale. As a practical matter, <strong>the</strong> numerical value of <strong>the</strong> bparameter is defined directly in terms of <strong>the</strong> ICC: that is, bis equal <strong>to</strong> <strong>the</strong> value of θ that corresponds <strong>to</strong> a 50%likelihood of endorsing <strong>the</strong> item in <strong>the</strong> keyed direction.Thus, for <strong>the</strong> items presented in Figure 3, <strong>the</strong> b values areapproximately -0.9, 0.35, and 0.9 for <strong>Item</strong>s 129, 50, and 33,respectively.The b parameter is useful for determining <strong>the</strong> point on<strong>the</strong> preference continuum (θ) at which <strong>the</strong> item will bemaximally informative. As a general rule, an item willprovide <strong>the</strong> most information regarding an individual’s θscore at <strong>the</strong> value of <strong>the</strong> b parameter (which, notsurprisingly, coincides with <strong>the</strong> point at which <strong>the</strong> ICCdemonstrates its sharpest slope). In this context, iteminformation is synonymous with discriminating power(i.e., <strong>the</strong> ability <strong>to</strong> differentiate between individuals in termsof <strong>the</strong>ir standing on <strong>the</strong> θ scale of preference). That is, adifference of a given size (e.g., 0.5 θ units) between twoindividuals with respect <strong>to</strong> <strong>the</strong> strength of <strong>the</strong>ir preferencewill translate in<strong>to</strong> a larger expected difference in PCRs as<strong>the</strong> slope of <strong>the</strong> ICC increases.For example, consider <strong>Item</strong> 129 in Figure 3 (i.e., <strong>the</strong>leftmost ICC). At its most informative point, a change ofone-half standard deviation (SD) in θ between two groupsof Extraverts (i.e., -1.2 vs. -0.7) translates in<strong>to</strong> a change ofapproximately 14% (i.e., 42% <strong>to</strong> 56%) in <strong>the</strong> likelihood ofendorsing <strong>Item</strong> 129 in <strong>the</strong> ‘I’ direction. In contrast, <strong>the</strong>same magnitude of θ preference difference between twogroups of individuals who score very strongly in <strong>the</strong>Introvert direction (e.g., 2.5 vs. 3.0) produces virtually nochange in <strong>the</strong> PCRs (i.e., 97-98% “I” endorsement rateswould be expected in both groups). Thus, <strong>Item</strong> 129 ismuch more informative or discriminating among moderateExtraverts than it is among individuals with strong Introvertpreferences (nearly all of whom would endorse <strong>the</strong> item in<strong>the</strong> ‘I’ direction)..With respect <strong>to</strong> <strong>the</strong> implications of using IRT methods<strong>to</strong> score <strong>the</strong> MBTI, <strong>the</strong> b parameter provides very usefulinformation on each item. In <strong>the</strong> MBTI, by virtue of <strong>the</strong> factthat many users are more interested in <strong>the</strong> categorical typescores than in <strong>the</strong> continuous preference scores, we need <strong>to</strong>set a cu<strong>to</strong>ff score on <strong>the</strong> preference continuum <strong>to</strong> assignrespondents in<strong>to</strong> <strong>the</strong> type categories. Consequently, wewould tend <strong>to</strong> prefer items that have b values that lie close<strong>to</strong> <strong>the</strong> θ = 0.0 point that divides each continuum in<strong>to</strong>categorical types. Thus, considering <strong>the</strong> items presented inFigure 3, <strong>Item</strong> 50 would be much more useful than <strong>Item</strong>129 with respect <strong>to</strong> locating individuals on one side or <strong>the</strong>o<strong>the</strong>r of <strong>the</strong> EI cu<strong>to</strong>ff score.Conceptually, <strong>the</strong>n, <strong>the</strong> IRT approach is not especiallycomplicated. The main problem from a practical point ofview lies in estimating <strong>the</strong> unknown b parameters for <strong>the</strong>MBTI items, and in estimating <strong>the</strong> scores on <strong>the</strong> latentpreference construct (θ) for each person, given <strong>the</strong>irresponses <strong>to</strong> <strong>the</strong> test items and our knowledge of <strong>the</strong> itemparameters. The main difference between <strong>the</strong> IRTapproach and older CTT-based approaches <strong>to</strong> measurementis that we explicitly assume that <strong>the</strong> relation between <strong>the</strong>latent construct score and <strong>the</strong> observed item response maybe nonlinear in nature.2-parameter model. Unfortunately, <strong>the</strong> 1-parameterIRT model suffers from significant limitations, perhaps <strong>the</strong>most important being that it assumes that all items on <strong>the</strong>test are equally discriminating or informative. For manypsychological tests (especially personality tests), this isprobably an unrealistic assumption. That is, some testitems are likely <strong>to</strong> be stronger indica<strong>to</strong>rs of an individual’sunderlying preferences than o<strong>the</strong>r test items (a fact that isacknowledged by <strong>the</strong> existing MBTI scoring system, whichdifferentially weights items when computing preferencescores). In response <strong>to</strong> <strong>the</strong> need <strong>to</strong> allow test items <strong>to</strong> bedifferentially discriminating at <strong>the</strong>ir points of maximumdiscrimination, <strong>the</strong> 2-parameter IRT model was developed.In essence, <strong>the</strong> 2-parameter IRT model is a superset of<strong>the</strong> 1-parameter model; in addition <strong>to</strong> <strong>the</strong> b (“location ofmaximum information” parameter), a second parameter(abbreviated a, or <strong>the</strong> discrimination parameter) was added<strong>to</strong> allow for <strong>the</strong> fact that different test items will bedifferentially informative or discriminating regarding <strong>the</strong>latent construct. In practical terms, <strong>the</strong> a parameter defines<strong>the</strong> slope of <strong>the</strong> ICC at its point of maximum inflection(which, in <strong>the</strong> 1- and 2-paramter IRT models, occurs at bunits on <strong>the</strong> θ scale).<strong>Using</strong> <strong>the</strong> 2-parameter model, Figure 4 depicts ICCsfor three hypo<strong>the</strong>tical items that have identical b parameters(in this case, b = 0.0), but which differ in terms of <strong>the</strong>ir aparameters (a = 0.35, 1.0, and 2.1 for <strong>Item</strong>s 1-3,respectively). A comparison of <strong>the</strong> ICCs for <strong>the</strong>se threeitems graphically illustrates <strong>the</strong> difference between <strong>the</strong> 1-and 2-parameter models, and highlights <strong>the</strong> importance ofmodeling both <strong>the</strong> point of maximum information as well as


<strong>the</strong> amount of discrimination that occurs at <strong>the</strong> point ofmaximum information. Specifically, Figure 4 illustrates <strong>the</strong>way in which sharper ICC slopes enhance our ability <strong>to</strong>discriminate between individuals who differ in <strong>the</strong>ir θscores.That is, consider two groups of MBTI respondents:Group 1 consists of individuals who have a true EIpreference of θ = -0.2 (i.e., a very slight preference <strong>to</strong>ward“E”); Group 2 consists of individuals having a preferenceof θ = +0.2 (i.e., a slight “I” preference; vertical lines aredrawn in Figure 4 at <strong>the</strong>se locations). The horizontal linesdrawn in Figure 4 depict <strong>the</strong> predicted item endorsementrates for <strong>Item</strong>s 1 vs. 3 at <strong>the</strong>se two θ levels. A comparisonof <strong>the</strong> dotted (<strong>Item</strong> 3) and solid (<strong>Item</strong> 1) horizontal linesimmediately indicates why higher a parameters are moredesirable: for <strong>Item</strong> 1, a difference of only approximately6% exists between <strong>the</strong> expected endorsement rates forGroups 1 versus 2; in contrast, a difference of over 36%exists for <strong>Item</strong> 3. Clearly, responses <strong>to</strong> <strong>Item</strong> 3 are muchmore sensitive <strong>to</strong> <strong>the</strong> relatively slight differences in θ scoresthat exist between Groups 1 and 2.The implications for using <strong>the</strong> a parameters <strong>to</strong> assess<strong>the</strong> performance of items in <strong>the</strong> MBTI are not quite asstraightforward as for <strong>the</strong> b parameters. On <strong>the</strong> one hand,one could argue that “more information is always better,”and that we should prefer items that produce larger amountsof information (i.e., sharper ICC slopes). However,especially in <strong>the</strong> case of an instrument like <strong>the</strong> MBTI thatuses a cu<strong>to</strong>ff score <strong>to</strong> dicho<strong>to</strong>mize its continuous preferencescores in order <strong>to</strong> assign categorical type values, <strong>the</strong>amount of information provided by each item must bebalanced against <strong>the</strong> location on <strong>the</strong> θ scale at which <strong>the</strong>item produces its information. Thus, we might very wellprefer a moderately discriminating item <strong>to</strong> a highlydiscriminating item if <strong>the</strong> b parameter of <strong>the</strong> moderatelydiscriminating item was located close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ffscore, and <strong>the</strong> b for <strong>the</strong> highly discriminating item waslocated 2 SD units away from <strong>the</strong> type cu<strong>to</strong>ff (i.e., causingit <strong>to</strong> produce relatively little information at <strong>the</strong> cu<strong>to</strong>ff).3-parameter model. Although <strong>the</strong> 2-parameter model’sability <strong>to</strong> account for differentially discriminating itemsoffers a valuable improvement over <strong>the</strong> 1-parameter model,<strong>the</strong> 2-parameter model can be criticized on <strong>the</strong> grounds thatit assumes that all test items will have zero lowerasymp<strong>to</strong>tes for <strong>the</strong>ir ICCs (i.e., for individuals with verylow scores on <strong>the</strong> θ scale, <strong>the</strong> ICCs will flatten-out at avalue that approaches zero). Although many test items willindeed reach an effectively zero lower asymp<strong>to</strong>te within <strong>the</strong>normal range of scores (e.g., <strong>Item</strong>s 2 and 3 in Figure 4 doso at -3 and -1.5 z, respectively), some will not.In <strong>the</strong> context of right/wrong tests that are subject <strong>to</strong>attempts <strong>to</strong> guess <strong>the</strong> correct answer, it is common <strong>to</strong>observe nonzero lower asymp<strong>to</strong>tes for <strong>the</strong> ICCs due <strong>to</strong> <strong>the</strong>willingness of respondents <strong>to</strong> guess when <strong>the</strong>y do not know<strong>the</strong> correct answer (e.g., for a 4-alternative multiple choicemath question, random guessing would be expected <strong>to</strong>produce a 25% success rate). In <strong>the</strong> context of instrumentsthat do not use right/wrong scoring (e.g., <strong>the</strong> MBTI),nonzero lower asymp<strong>to</strong>tes can also occur, although forreasons o<strong>the</strong>r than guessing.In short, nonzero lower asymp<strong>to</strong>tes for items on apersonality inven<strong>to</strong>ry may reflect <strong>the</strong> fact that <strong>the</strong> items aresufficiently skewed in terms of <strong>the</strong>ir endorsementproperties that even individuals who score very low on <strong>the</strong>θ scale (i.e., <strong>the</strong>ir preferences lie strongly <strong>to</strong>ward <strong>the</strong> nonkeyedalternative) will still endorse <strong>the</strong> item in <strong>the</strong> keyeddirection at nontrivial rates. The 3-parameter IRT modelallows for this possibility by adding a third parameter foreach item (abbreviated c) which defines <strong>the</strong> PCR thatwould be expected for people who score strongly <strong>to</strong>ward<strong>the</strong> non-keyed preference pole (i.e., <strong>the</strong> effective lowerasymp<strong>to</strong>te of <strong>the</strong> ICC). Although we would not expect<strong>the</strong>re <strong>to</strong> be many items in <strong>the</strong> MBTI for which largenonzero c parameters would occur, it is possible that someitems would require a nonzero value for <strong>the</strong> c parameter.Figure 5 presents <strong>the</strong> ICCs produced by fitting <strong>the</strong> 3-parameter IRT model <strong>to</strong> <strong>the</strong> three EI items depicted inFigure 3. As a comparison of Figures 3 vs. 5 makes readilyapparent, a very different picture of item functioning isproduced as a result of choosing a 1- vs. 3-parameter IRTmodel. In particular, <strong>Item</strong>s 50 and 33 demonstrate a visiblysharper ICC slope than was produced in <strong>the</strong> 1-parametermodel, whereas <strong>Item</strong> 129 demonstrates a significantlyflatter slope than was seen in Figure 3. Figure 6 presents<strong>the</strong> item information functions for <strong>the</strong>se three items;inspection of <strong>the</strong>se IIFs shows that <strong>Item</strong> 50 producessubstantially more information than <strong>Item</strong> 33, and that bothproduce far more information than <strong>Item</strong> 129 (whichproduces very little information at any value of θ). <strong>Item</strong> 50is made even more desirable by <strong>the</strong> fact that <strong>the</strong> peak of itsinformation function lies closest <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff score(i.e., θ = 0), which should make it <strong>the</strong> most useful of <strong>the</strong>sethree items with respect <strong>to</strong> distinguishing betweenindividuals whose score close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff.The results presented in Figure 5 also indicate that it isquite possible <strong>to</strong> find MBTI items that even raters whoscore very strongly <strong>to</strong>ward <strong>the</strong> non-keyed end of <strong>the</strong>preference scale will endorse in <strong>the</strong> keyed direction atnontrivial rates. For example, <strong>the</strong> ICC for <strong>Item</strong> 129 showsthat many extremely strong Extraverts endorse this item in<strong>the</strong> Introvert direction (e.g., at θ = -3.0, approximately30% of <strong>the</strong>se Extraverts endorse <strong>the</strong> “I” alternative, “notinterested in following new fashions,” instead of <strong>the</strong> “E”response, “one of <strong>the</strong> first <strong>to</strong> follow a new fashion”). Thisability <strong>to</strong> capture different kinds of item response patternsis a major advantage of <strong>the</strong> 3-parameter IRT model.Test-level information and SE functions. An importantadvantage of IRT as a test development and scoring methodis that it allows us <strong>to</strong> obtain a detailed look at <strong>the</strong> aggregateperformance of collections of test items. In particular, wecan calculate both test information functions (TIFs) and teststandard error (SE) functions <strong>to</strong> assess <strong>the</strong> performance ofan item pool. TIFs indicate <strong>the</strong> amount of information ormeasurement precision that is provided by a test at allpossible levels of θ, whereas test-level SE functions


indicate <strong>the</strong> degree of precision <strong>to</strong> be expected whenestimating test scores for examinees at different levels of θ.Thus, <strong>the</strong> test SE functions represent a continuouslyvariable analog <strong>to</strong> <strong>the</strong> global SEM estimate produced byCTT, indicating <strong>the</strong> degree of error that would be expectedwhen estimating <strong>the</strong> “true” latent preference scores basedon <strong>the</strong> observed patterns of item responses. Likewise, <strong>the</strong>test information functions represent a continuously variableanalog <strong>to</strong> <strong>the</strong> unitary reliability coefficient estimated byCTT: that is, higher values reflect higher measurementprecision and freedom from error, and lower valuesrepresent less measurement precision and increaseduncertainty with respect <strong>to</strong> estimating scores on <strong>the</strong> latentconstruct.Both of <strong>the</strong>se functions represent tremendousimprovements over <strong>the</strong> simplistic views of reliability andmeasurement error that are inherent in traditional CTTbasedmethods. That is, in classical approaches <strong>to</strong> testing, atest’s reliability is estimated as a single number that ispresumed <strong>to</strong> be constant across <strong>the</strong> entire possible range oftest scores. Likewise, a test’s standard error ofmeasurement (SEM) is presumed <strong>to</strong> be constant across allpossible test values. Both of <strong>the</strong>se assumptions aretenuous; indeed, it is reasonable <strong>to</strong> expect that most testswill tend <strong>to</strong> be more precise for respondents who have“average” scores on <strong>the</strong> latent construct, and less precisefor those individuals who hold extreme scores (i.e., teststargeted at an “average” population typically lack items thatprovide significant levels of information for individualswho score at <strong>the</strong> extremes of <strong>the</strong> distribution).Figure 7 presents <strong>the</strong> TIFs for a scale composed of <strong>the</strong>three EI items contained in Figures 3 and 5, as well as for<strong>the</strong> full EI scale; Figure 8 presents <strong>the</strong> corresponding SEfunctions for <strong>the</strong> 3-item and full-length EI scales. AsFigures 7-8 illustrate, significant improvements in testprecision (i.e., higher TIFs, lower SEs) are achieved in <strong>the</strong>full-length EI scale relative <strong>to</strong> a 3-item scale. Additionally,both <strong>the</strong> TIFs and SEs show that measurement precision isnot constant across <strong>the</strong> full range of θ-based preferencescores, being significantly better in <strong>the</strong> middle range of θscores (peaking at approximately θ = 0.25), and somewhatmore precise for <strong>the</strong> Introvert half of <strong>the</strong> scale than for <strong>the</strong>Extravert half (see Figure 8).These results clearly undermine <strong>the</strong> CTT assumptionthat reliability and SEM remain constant across <strong>the</strong> fullrange of MBTI preference scores. Based on past studiesthat have estimated <strong>the</strong> CTT reliability of <strong>the</strong> MBTI scales<strong>to</strong> lie in <strong>the</strong> .75-.85 range (e.g., Harvey & Murry, 1994;<strong>Myers</strong> & McCaulley, 1985), two horizontal lines have beendrawn in Figures 7-8 at <strong>the</strong> levels of information/SE thatcorrespond <strong>to</strong> r xx = .75 (which produces SEM = .50 for z-scaled variables like θ) and r xx =.85 (SEM = .39). Acomparison of <strong>the</strong> TIFs and SEs for <strong>the</strong> full EI scale against<strong>the</strong>se CTT reference lines indicates that <strong>the</strong> θ scoresestimated by IRT would be expected <strong>to</strong> significantly exceed<strong>the</strong> levels of measurement precision implied by <strong>the</strong> unitaryCTT estimates in <strong>the</strong> middle range of θ-based preferencescores (i.e., from approximately -0.5 <strong>to</strong> +1.0 for <strong>the</strong> .39SEM, and -1.0 <strong>to</strong> 1.5 for <strong>the</strong> .50 SEM), and <strong>to</strong> fall short of<strong>the</strong> levels of precision implied by <strong>the</strong> CTT results outside<strong>the</strong>se ranges.It is important <strong>to</strong> stress that <strong>the</strong>se findings do not implythat IRT-based scoring is less precise than CTT-basednumber-right scoring for preferences that lie outside <strong>the</strong>above intervals. On <strong>the</strong> contrary, <strong>the</strong>y indicate that <strong>the</strong>levels of measurement precision implied by CTT’s unitaryr xx and SEM statistics are likely <strong>to</strong> underestimate <strong>the</strong>effective level of precision for preference scores that fallwithin approximately .5 <strong>to</strong> 1 SD of <strong>the</strong> type cu<strong>to</strong>ff score,and <strong>to</strong> increasingly overestimate <strong>the</strong> precision ofmeasurement for preference scores that lie strongly <strong>to</strong>wardei<strong>the</strong>r pole of <strong>the</strong> preference scale.Is IRT Appropriate for <strong>the</strong> MBTI?By this point, <strong>the</strong> reader might well feel that he or shehas seen at least one ICC <strong>to</strong>o many, and perhaps bewondering whe<strong>the</strong>r it is really necessary <strong>to</strong> go <strong>to</strong> <strong>the</strong> troublerequired <strong>to</strong> fit <strong>the</strong>se nonlinear ICCs <strong>to</strong> <strong>the</strong> MBTI responses.Without a doubt, <strong>the</strong> IRT approach is somewhat morecomplex than <strong>the</strong> prediction-ratio technique that hastraditionally been used <strong>to</strong> score <strong>the</strong> MBTI. In short, onemight question whe<strong>the</strong>r or not <strong>the</strong> increased complexityinherent in <strong>the</strong> IRT is worth <strong>the</strong> trouble, and whe<strong>the</strong>r anyevidence exists <strong>to</strong> indicate that <strong>the</strong> IRT model actuallyprovides a good “fit” <strong>to</strong> <strong>the</strong> MBTI item response patterns.Fortunately, a very direct method exists for assessing<strong>the</strong> “fit” of <strong>the</strong> IRT model; it involves an examination ofempirically derived ICCs. Empirical ICCs are essentiallyscatterplots, defined as follows: <strong>the</strong> vertical axis of <strong>the</strong> plotrepresents <strong>the</strong> observed rate of item endorsement (PCR),<strong>the</strong> horizontal axis represents discrete levels of <strong>the</strong> latentpreference score, and <strong>the</strong> points in <strong>the</strong> plot represent <strong>the</strong>percentage of respondents at each level of <strong>the</strong> latentpreference score that endorse <strong>the</strong> item in <strong>the</strong> keyeddirection. By visually examining this scatterplot of meanitem endorsement rates, we can get an idea of <strong>the</strong> “true”nature of <strong>the</strong> relationship between <strong>the</strong> latent preferencedimension and <strong>the</strong> observed likelihood of item endorsementin <strong>the</strong> keyed direction for <strong>the</strong> various levels of <strong>the</strong> latentconstruct.Empirically derived ICCs provide an ideal vehicle forassessing <strong>the</strong> fit of <strong>the</strong> IRT model by virtue of <strong>the</strong> fact that<strong>the</strong>y do not “force” any particular model (e.g., <strong>the</strong> 3-parameter IRT model) on<strong>to</strong> <strong>the</strong> data. That is, <strong>the</strong> ICCspresented in Figures 3 and 5 are <strong>the</strong> ones that wereproduced by fitting <strong>the</strong> 1- and 3-parameter IRT models <strong>to</strong><strong>the</strong> MBTI item responses; although <strong>the</strong>y look impressive,<strong>the</strong>y essentially have <strong>to</strong> follow <strong>the</strong> IRT model, and <strong>the</strong>re isno guarantee that <strong>the</strong>y will actually provide a good fit <strong>to</strong> <strong>the</strong>data. In contrast, <strong>the</strong> empirically derived ICCs are free <strong>to</strong>adopt any shape that is appropriate for <strong>the</strong> data. Thus, <strong>to</strong><strong>the</strong> extent that <strong>the</strong> ICCs produced by <strong>the</strong> IRT models match<strong>the</strong> shape of <strong>the</strong> empirical ICCs, we would conclude that<strong>the</strong> IRT model provides a good degree of fit <strong>to</strong> <strong>the</strong> MBTIdata.


As a practical matter, <strong>the</strong> main difficulty that ariseswhen computing empirical ICCs is in finding a satisfac<strong>to</strong>rymethod for estimating <strong>the</strong> latent construct scores. Becausewe don’t know <strong>the</strong> “true” preference scores for eachexaminee, and we can’t use <strong>the</strong> θ scores that are estimatedusing IRT (i.e., <strong>to</strong> avoid creating a logical circularity), it iscus<strong>to</strong>mary <strong>to</strong> use <strong>the</strong> <strong>to</strong>tal score on <strong>the</strong> scale as <strong>the</strong> bestavailable estimate of <strong>the</strong> true score. In <strong>the</strong> present case, <strong>the</strong>scores computed using <strong>the</strong> prediction-ratio (PR) preferencescoring weights for Form F were used as <strong>the</strong> estimate ofeach person’s true score on <strong>the</strong> latent construct (virtuallyidentical results were also obtained when we used <strong>the</strong>simple unweighted percentage of items that were answeredin <strong>the</strong> keyed direction as <strong>the</strong> estimate of <strong>the</strong> latentconstruct).Computationally, <strong>the</strong> empirical ICCs (see Figures 9-12for <strong>the</strong> EI items used in <strong>the</strong> previous examples, and Figures13-15 for <strong>the</strong> <strong>to</strong>p items from <strong>the</strong> SN, TF, and JP scales)were produced as follows: (a) each person’s net preferencescore was calculated using <strong>the</strong> Form F scoring key andplaced on a scale that placed <strong>the</strong> type cu<strong>to</strong>ff at zero (i.e.,preferences <strong>to</strong>ward <strong>the</strong> keyed pole received positive values,and those <strong>to</strong>ward <strong>the</strong> non-keyed pole received negativescores); (b) subgroups of raters were formed by breaking<strong>the</strong> sample in<strong>to</strong> discrete intervals based on <strong>the</strong>ir PRpreference score (e.g., in Figure 9, all raters scoring 53<strong>to</strong>ward <strong>the</strong> “E” pole); (c) for each subgroup, we calculated<strong>the</strong> percentage of raters in that subgroup that endorsed <strong>the</strong>item in <strong>the</strong> keyed direction (e.g., Figure 9 shows that for<strong>Item</strong> 50, 0% of <strong>the</strong> raters in <strong>the</strong> subgroup scoring 53 <strong>to</strong>ward“E” endorsed <strong>the</strong> item in <strong>the</strong> “I” direction); finally, (d) foreach subgroup, we plotted <strong>the</strong> percentage of raters thatendorsed <strong>the</strong> item in <strong>the</strong> keyed direction against <strong>the</strong>subgroup’s PR-based preference score (smoo<strong>the</strong>d splineinterpolations were fitted through this scatterplot in anattempt <strong>to</strong> capture <strong>the</strong> “true” ICC for each item).It is important <strong>to</strong> emphasize again that unlike <strong>the</strong> ICCspresented in Figures 3 and 5 -- which were estimated usingIRT methods and which <strong>the</strong>refore must follow <strong>the</strong> formdictated by <strong>the</strong> 1- or 3-parameter IRT model – <strong>the</strong>empirically derived ICCs presented in Figures 9-15 arecompletely unconstrained by <strong>the</strong> IRT model. Accordingly,<strong>the</strong>y can take on any form that is appropriate in order <strong>to</strong>depict <strong>the</strong> functional relationship (if any) that existsbetween each item response and <strong>the</strong> traditional PR-basedpreference scores. Thus, <strong>to</strong> <strong>the</strong> degree that we seeagreement between <strong>the</strong> empirically derived ICCs versus <strong>the</strong>ICCs that were generated from <strong>the</strong> IRT parameterestimates, we will interpret such agreement as validation of<strong>the</strong> appropriateness of <strong>the</strong> IRT approach.As <strong>the</strong> results in Figures 9-11 illustrate, although <strong>the</strong>unconstrained empirical ICCs provide a very poor match <strong>to</strong><strong>the</strong> ICCs that were produced using <strong>the</strong> 1-parameter IRTmodel (Figure 3), <strong>the</strong>y provide a very good match <strong>to</strong> <strong>the</strong>ICCs produced by <strong>the</strong> 3-parameter model (Figure 5). Forexample, <strong>the</strong> empirical ICC for <strong>Item</strong> 50 demonstrates a verynonlinear, highly discriminating shape (Figure 9); thiscurve closely matches <strong>the</strong> ICC estimated by <strong>the</strong> 3-parameter IRT model (Figure 5) in terms of both its shapeas well as its relative location on <strong>the</strong> θ axis. Likewise, <strong>the</strong>empirical ICCs in Figures 10 and 11 for <strong>Item</strong>s 33 and 129agree quite closely with <strong>the</strong> 3-parameter model ICCs(Figure 5).In all cases, <strong>the</strong>re is remarkably little “scatter” around<strong>the</strong> line that we fit <strong>to</strong> each scatterplot, a fact that fur<strong>the</strong>rsupports <strong>the</strong> validity and advisability of using <strong>the</strong> 3-parameter IRT model <strong>to</strong> score <strong>the</strong> MBTI. When oneconsiders <strong>the</strong> fact that some of <strong>the</strong>se subgroup percentageendorsementstatistics (i.e., <strong>the</strong> squares in Figures 9-11) arebased on quite small Ns, <strong>the</strong> correspondence between <strong>the</strong>empirically vs. IRT-derived ICCs becomes even moreimpressive. To facilitate <strong>the</strong> comparison of <strong>the</strong>se ICCs, <strong>the</strong>empirically derived ICCs for EI items 33, 50, and 129 arepresented superimposed upon one ano<strong>the</strong>r in Figure 12. Asa comparison of Figures 5 vs. 12 indicates, <strong>the</strong>re is a greatdeal of similarity between <strong>the</strong> empirically vs. IRT-derivedICCs; this similarity is even more notable when oneconsiders <strong>the</strong> profound differences that exist between <strong>the</strong>methods that were used <strong>to</strong> compute <strong>the</strong> scores that define<strong>the</strong> horizontal axes in Figure 5 (i.e., maximum likelihoodbasedestimation of θ using <strong>the</strong> parameters estimated for <strong>the</strong>3-parameter IRT model) vs. Figure 12 (i.e., prediction-ratiobased preference scores based on <strong>the</strong> Form F scoringsystem).As a fur<strong>the</strong>r indica<strong>to</strong>r of <strong>the</strong> generalizability of <strong>the</strong>above findings, empirically derived ICCs for highperformanceitems drawn from <strong>the</strong> SN, TF, and JP scales(i.e., identified using <strong>the</strong> Harvey & Murry, 1994, IRTparameters) are presented in Figures 13-15. Inspection of<strong>the</strong>se ICCs again reveals <strong>the</strong> existence of markedlynonlinear functional relationships between preferencescores and <strong>the</strong> likelihood of endorsing MBTI items in <strong>the</strong>keyed direction. Clearly, an S-shaped ICC is <strong>the</strong> mostappropriate representation for <strong>the</strong>se MBTI items. As with<strong>the</strong> EI items, <strong>the</strong> results in Figures 13-15 indicate thatalthough some items demonstrate <strong>the</strong>ir highestdiscriminating power (i.e., ICC slope) at <strong>the</strong> type cu<strong>to</strong>ffpoint (Figure 13), o<strong>the</strong>rs produce <strong>the</strong>ir maximumdiscriminating power at points below (e.g., Figure 14) andabove (e.g., Figure 15) <strong>the</strong> type cu<strong>to</strong>ff point. The fact thatdifferent items tend <strong>to</strong> produce <strong>the</strong>ir maximumdiscrimination at different points along <strong>the</strong> preference scorecontinuum is easily modeled using IRT methods (i.e., byassigning different b parameters <strong>to</strong> <strong>the</strong> items).To provide something of a baseline against which <strong>to</strong>judge <strong>the</strong> results in Figures 13-15, Figures 16-17 depictempirical ICCs computed by plotting item-endorsementrates against preference scores for dimensions o<strong>the</strong>r than<strong>the</strong> predicted one for <strong>the</strong> item in question. The ICC shownin Figure 16 is typical of such ICCs; this scatterplot showsthat <strong>the</strong>re is virtually no association between scores on <strong>the</strong>EI preference scale and subgroup item-endorsementpercentages on <strong>Item</strong> 85 (a JP item). Note that <strong>the</strong>re is anappreciably higher level of “scatter” around <strong>the</strong> line of bestfit in this plot, as compared <strong>to</strong> <strong>the</strong> empirical ICCs computedfor items on <strong>the</strong>ir predicted preference dimensions (Figures


9-15), indicating that (as expected) JP item endorsementrates are not consistently predictive of EI preferences.There are exceptions <strong>to</strong> <strong>the</strong> pattern of non-associationdepicted in Figure 16, however, and most involvecomparisons between <strong>the</strong> SN and JP dimensions. Forexample, Figure 17 presents a scatterplot of PCR values for<strong>Item</strong> 85 – which, as Figure 15 illustrates, is a highlydiscriminating item with respect <strong>to</strong> <strong>the</strong> JP dimension –against <strong>the</strong> PR-based preference scores for <strong>the</strong> SNdimension. As <strong>the</strong> empirically derived ICC in Figure 17illustrates, <strong>the</strong>re is a relatively strong (and linear)association between <strong>the</strong>se two axes, such that higher scoreson <strong>the</strong> “N” preference are associated with higher likelihoodof endorsing <strong>Item</strong> 85 in <strong>the</strong> “P” (i.e., “unplanned” over“scheduled”) direction. This finding is consistent with <strong>the</strong>oft-reported positive correlation between <strong>the</strong> SN and JPpreference scores (e.g., Harvey & Murry, 1994), and doesnot necessarily represent cause for concern. Indeed, incases in which MBTI items are found <strong>to</strong> have consistentfunctional relationships with multiple latent preferencescales, <strong>the</strong> possibility of using multidimensional IRTmodels that are capable of making use of <strong>the</strong> “collateralinformation” contained in such items becomes worthy offur<strong>the</strong>r study.Figure 18 presents an empirical ICC in which itemendorsement rates for EI <strong>Item</strong> 116 are plotted against <strong>the</strong>PR-based EI preferences. As in <strong>the</strong> earlier empirical ICCs,<strong>the</strong> results in Figure 18 demonstrate a strong level of fitbetween <strong>the</strong> actual MBTI item response patterns and <strong>the</strong> 3-parameter IRT model. However, <strong>the</strong> most notable aspectregarding <strong>Item</strong> 116’s empirical ICC is that although thisitem demonstrates strong discriminating power with respect<strong>to</strong> <strong>the</strong> EI preference, <strong>the</strong> location of this discriminationoccurs relatively far from <strong>the</strong> EI type cu<strong>to</strong>ff point (i.e.,approximately 41 PR preference units <strong>to</strong>ward <strong>the</strong> “I” pole).That is, Introverts must possess quite a strong preference<strong>to</strong>ward <strong>the</strong> “I” pole before <strong>the</strong>y begin <strong>to</strong> choose <strong>the</strong>“detached” alternative over <strong>the</strong> “sociable” alternative insignificant numbers.In view of <strong>the</strong> fact that <strong>Item</strong> 116 provides relativelylittle discriminating power at <strong>the</strong> type cu<strong>to</strong>ff point, it is notsurprising <strong>to</strong> find that <strong>the</strong> traditional PR-based scoringsystem does not view it as being an especially useful onewith respect <strong>to</strong> assessing <strong>the</strong> EI preference. However, as<strong>the</strong> empirical ICC in Figure 18 clearly indicates, this item isvery useful in discriminating between individualsexhibiting moderate vs. strong preferences <strong>to</strong>ward <strong>the</strong> “I”pole of <strong>the</strong> EI scale. This ability <strong>to</strong> assess <strong>the</strong>discriminating power of each MBTI across <strong>the</strong> full range ofpreference scores represents yet ano<strong>the</strong>r point of superiorityof <strong>the</strong> IRT approach over <strong>the</strong> traditional PR-based scoringsystem, which is primarily sensitive only <strong>to</strong> an item’sdiscriminating power in <strong>the</strong> vicinity of <strong>the</strong> type cu<strong>to</strong>ffscore.In sum, using only <strong>the</strong> observed MBTI endorsementrates and <strong>the</strong> preference scores produced by <strong>the</strong> traditionalPR-based scoring system, <strong>the</strong> above findings demonstratethat (a) <strong>the</strong> relationship between MBTI preferences andobserved item endorsement rates is decidedly nonlinear formany items; (b) MBTI items differ widely with respect <strong>to</strong><strong>the</strong> amount of information and discrimination <strong>the</strong>y provide;and (c) <strong>the</strong> location on <strong>the</strong> preference scale at which eachitem provides its maximum information varies considerablyfor different MBTI items. These findings strongly support<strong>the</strong> appropriateness and potential usefulness of <strong>the</strong> 3-parameter IRT model as a vehicle for capturing <strong>the</strong>complex dynamics involved in responding <strong>to</strong> <strong>the</strong> MBTI’sitems. In addition, <strong>the</strong>se results argue strongly against <strong>the</strong>notion that simpler models (e.g., <strong>the</strong> 1-parameter IRTmodel, or systems based on a weighted or unweightedlinear model) can provide an adequate representation of <strong>the</strong>complexity of <strong>the</strong>se item responses. In short, <strong>the</strong>seempirical ICCs indicate that <strong>the</strong> 3-parameter IRT modelprovides a very good degree of fit <strong>to</strong> <strong>the</strong> MBTI itemresponses. We turn finally <strong>to</strong> a review of findings fromstudies that have attempted <strong>to</strong> apply <strong>the</strong> IRT approach <strong>to</strong>scoring <strong>the</strong> MBTI.IRT Research on <strong>the</strong> MBTIEmpirical studies evaluating IRT-based approaches <strong>to</strong>scoring <strong>the</strong> MBTI have only recently begun <strong>to</strong> appear.However, <strong>the</strong> results of <strong>the</strong>se initial studies have been veryencouraging, especially regarding <strong>the</strong> ability of IRT scoring<strong>to</strong> address two of <strong>the</strong> most-criticized aspects of <strong>the</strong> MBTI:namely, preference score bimodality, and <strong>the</strong> degree ofmeasurement precision that exists in <strong>the</strong> vicinity of <strong>the</strong> typecu<strong>to</strong>ff scores. Additionally, IRT-based methods ofestimating MBTI preference scores offer advantages ino<strong>the</strong>r areas, in particular, quantifying <strong>the</strong> quality or internalconsistency of an individual’s profile of MBTI itemresponses (e.g., <strong>to</strong> detect potentially invalid profiles).Bimodal DistributionsAs we noted in our review of criticisms that have beenraised regarding <strong>the</strong> MBTI, many authors have attacked i<strong>to</strong>n <strong>the</strong> grounds that its preference score distributions are notbimodal (e.g., Pittenger, 1993; Stricker & Ross, 1964).Indeed, as <strong>the</strong> results presented in Harvey and Murry(1994) illustrated, PR-based preference score distributionsare highly center-weighted and platykurtic. This lack ofbimodality has at least two important implications: (a) itprovides ammunition <strong>to</strong> those who attempt <strong>to</strong> challenge <strong>the</strong>validity of <strong>Myers</strong>’ type-based personality <strong>the</strong>ory (i.e., if<strong>the</strong>re are basically two distinct “types” of people on each of<strong>the</strong> MBTI dimensions, it would not be unreasonable <strong>to</strong>expect <strong>to</strong> find a somewhat bimodal shape in <strong>the</strong> preferencescore distributions); and (b) it exacerbates <strong>the</strong> alreadydifficult process of accurately assigning individuals <strong>to</strong>discrete type categories (i.e., whenever a cu<strong>to</strong>ff score isused, we would strongly prefer <strong>to</strong> minimize <strong>the</strong> number ofindividuals who score near <strong>the</strong> cu<strong>to</strong>ff; unfortunately, <strong>the</strong>PR-based preference score distributions locate a sizablenumber of individuals near <strong>the</strong> cu<strong>to</strong>ff point).


Fortunately, <strong>the</strong> results of <strong>the</strong> Harvey and Murry(1994) study -- which was <strong>the</strong> first <strong>to</strong> derive and evaluatean IRT-based scoring system for <strong>the</strong> MBTI -- indicatedquite clearly that when <strong>the</strong> 3-parameter IRT model is used<strong>to</strong> estimate scores on <strong>the</strong> continuous preference scales, <strong>the</strong>resulting score distributions are strongly bimodal.Updating <strong>the</strong>se findings using <strong>the</strong> database from which <strong>the</strong>above empirical ICC results were produced (i.e., whichadds a number of individuals <strong>to</strong> <strong>the</strong> sample used in Harvey& Murry, 1994), Figure 19 presents <strong>the</strong> frequencydistribution for <strong>the</strong> EI scale’s PR-based preference scores(Figure 19 contains a frequency-count bar for each discretePR-preference value). In contrast, Figure 20 presents <strong>the</strong>distribution of <strong>the</strong> EI θ-based preference score estimates (θscores contain a significantly higher number of discretescore values; consequently, <strong>to</strong> facilitate comparison, <strong>the</strong>number of frequency bars in Figure 20 has been matched <strong>to</strong><strong>the</strong> number of discrete PR-based preference values).A comparison of Figures 19 vs. 20 indicates that <strong>the</strong> θ-based preference distribution is strongly bimodal in shape,whereas <strong>the</strong> PR-based preference scores exhibit a relativelyflat distribution in which many individuals score near <strong>the</strong>type cu<strong>to</strong>ff (very similar results are seen for <strong>the</strong> remainingthree preference dimensions). Although some respondentsdo indeed score in <strong>the</strong> vicinity of <strong>the</strong> type cu<strong>to</strong>ff in <strong>the</strong> IRTbaseddistribution, <strong>the</strong>re is a pronounced decrease in <strong>the</strong>density of individuals scoring in <strong>the</strong> cu<strong>to</strong>ff region between<strong>the</strong> two very pronounced modes (which are locatedapproximately ±0.5 units on ei<strong>the</strong>r side of <strong>the</strong> type cu<strong>to</strong>ff).A visual examination of <strong>the</strong> two distributions suggests thatfewer individuals score close <strong>to</strong> <strong>the</strong> cu<strong>to</strong>ff point in <strong>the</strong> θ-vs. PR-based distributions.Thus, regarding <strong>the</strong> issue of preference scorebimodality, <strong>the</strong> evidence available <strong>to</strong> date indicates quiteconvincingly that bimodal score distributions can beproduced by simply changing <strong>the</strong> technology that is used <strong>to</strong>estimate preference scores from <strong>the</strong> observed MBTI itemresponses. Although bimodal preference distributions havebeen found in highly selected samples of individuals whodemonstrate very strong type differentiation (e.g., Ryttinget al.,1994), <strong>the</strong>y have not been seen in larger, morerepresentative samples (e.g., Stricker & Ross, 1964); thisfact has been trumpeted by MBTI critics as a serious flawin both <strong>the</strong> MBTI instrument as well as <strong>Myers</strong>’ type-basedpersonality <strong>the</strong>ory that inspired <strong>the</strong> MBTI. If <strong>the</strong>se resultsare found by subsequent research <strong>to</strong> be generalizable <strong>to</strong>non-student-based samples (which we have every reason <strong>to</strong>expect, given both <strong>the</strong> relatively large size of our sampleand <strong>the</strong> fact that <strong>the</strong> students who attend major universitiestypically represent a diverse cross-section of <strong>the</strong> generalpopulation), this fact will effectively eliminate one of <strong>the</strong>major arguments raised by MBTI critics.Measurement PrecisionAs we noted in our review of criticisms of <strong>the</strong> MBTI,many authors have expressed concerns regarding itsmeasurement precision; in particular, <strong>the</strong> level of scorestability that is seen in test-retest situations, and its ability<strong>to</strong> correctly assign individuals who score close <strong>to</strong> <strong>the</strong> typecu<strong>to</strong>ffs <strong>to</strong> type categories (e.g., Pittenger, 1993). Earlier,we identified two strategies that could be taken <strong>to</strong> improve<strong>the</strong> level of test-retest stability and <strong>the</strong> MBTI’s ability <strong>to</strong>correctly classify individuals in<strong>to</strong> type categories: (a)decreasing <strong>the</strong> number of individuals who score close <strong>to</strong> <strong>the</strong>type cu<strong>to</strong>ffs by increasing <strong>the</strong> bimodality of <strong>the</strong> preferencescore distributions; and (b) revising <strong>the</strong> MBTI scoringsystem <strong>to</strong> produce a higher level of precision in <strong>the</strong> vicinityof <strong>the</strong> type cu<strong>to</strong>ff score.As a visual examination of <strong>the</strong> results presented inFigures 19-20 suggests, switching from a PR- <strong>to</strong> a θ-basedscoring system for <strong>the</strong> MBTI – without changing a singletest item – appears <strong>to</strong> provide a means for addressing <strong>the</strong>bimodality issue. In an attempt <strong>to</strong> more precisely address<strong>the</strong> question of whe<strong>the</strong>r θ-based scoring reduces <strong>the</strong>number of individuals scoring close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ffs, westandardized <strong>the</strong> PR-based preference scores <strong>to</strong> have <strong>the</strong>same mean and SD as <strong>the</strong> θ-based preferences, and <strong>the</strong>ncounted <strong>the</strong> number of individuals who scored within agiven sized band around each scale’s type cu<strong>to</strong>ff score.Values of ±0.25 and ±0.35 were used when setting <strong>the</strong>sebands; 0.25 is a somewhat arbitrary value, whereas 0.35approximates <strong>the</strong> size of a ±1 SEM confidence interval fora scale having a .85 reliability, as well as <strong>the</strong> size of <strong>the</strong> SEthat would be expected when estimating θ scores at <strong>the</strong> typecu<strong>to</strong>ff point (see Figure 8). Individuals who score within<strong>the</strong>se bands should be much more likely <strong>to</strong> be incorrectlyclassified in<strong>to</strong> a categorical type due <strong>to</strong> <strong>the</strong> action ofmeasurement error (ei<strong>the</strong>r in a single administration, or in atest-retest situation) than those who score outside <strong>the</strong>sezones.Table 1 presents <strong>the</strong> numbers of individuals scoringwithin <strong>the</strong>se two intervals for <strong>the</strong> PR- and θ-basedpreferences. As <strong>the</strong> breakdowns in Table 1 indicate, PRbasedpreference scoring consistently locates a largerpercentage of respondents in <strong>the</strong> “zone of uncertainty”around <strong>the</strong> cu<strong>to</strong>ff than <strong>the</strong> θ-based scoring system. <strong>Using</strong><strong>the</strong> number of individuals classified within <strong>the</strong> ±0.25 and±0.35 bands by <strong>the</strong> traditional PR-based scoring system as<strong>the</strong> basis for comparison, <strong>the</strong> IRT-based scoring systemproduces reductions of 37% and 27%, respectively, in <strong>the</strong>number of MBTI profiles that fall within this zone ofuncertainty.Likewise, comparing <strong>the</strong> number of individuals thatfall within <strong>the</strong> zone of uncertainty using IRT versus PRscoring, <strong>the</strong> results in Table 1 indicate that 54% and 36% of<strong>the</strong> profiles that fall within <strong>the</strong> uncertainty zone using PRscoring fall outside <strong>the</strong> zone when using IRT scoring for<strong>the</strong> .25 and .35 bands, respectively. Conversely, only 4%and 3% of <strong>the</strong> profiles that fall outside of <strong>the</strong> uncertaintyzone using PR scoring fall inside <strong>the</strong> zone when using IRTscoring. Again, <strong>the</strong>se results illustrate <strong>the</strong> sizablereductions in <strong>the</strong> percentage of individuals who score close<strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff point that are produced simply by


switching from a PR-based <strong>to</strong> a θ-based scoring system for<strong>the</strong> MBTI item responses.Figures 21 and 22 present more information on <strong>the</strong>performance of <strong>the</strong> IRT-based scoring system; Figure 21shows a scatterplot of <strong>the</strong> EI preference scores estimated byPR- vs. IRT-methods, whereas Figure 22 shows ascatterplot of IRT-based preference scores for <strong>the</strong> EI vs. SNscales. As <strong>the</strong> plot in Figure 21 illustrates, <strong>the</strong>re is a strong– but decidedly nonlinear – association between θ- vs. PRbasedpreference score estimates. For example, forindividuals receiving an identical PR preference score,Figure 21 illustrates how <strong>the</strong>y can receive a relatively broadrange of θ-based preference scores. This illustrates a majoradvantage of θ-based scoring: that is, it doesn’t just matterhow many items are endorsed in <strong>the</strong> keyed direction, it iscritically important <strong>to</strong> determine which items are endorsedin each direction. In short, answers <strong>to</strong> highlydiscriminating items are much more diagnostic thananswers <strong>to</strong> items that possess low b parameters; IRT-basedscoring au<strong>to</strong>matically takes <strong>the</strong>se fac<strong>to</strong>rs in<strong>to</strong> account whenestimating each individual’s θ-based preference score.Thus, two individuals with <strong>the</strong> same overall number of“keyed” answers might receive very different θ-basedpreference scores, depending on which items wereendorsed.The reductions in distribution density near <strong>the</strong> typecu<strong>to</strong>ff scores that are illustrated in Figures 20 and 22, andquantified in Table 1, provide reason for optimismregarding <strong>the</strong> ability of IRT scoring <strong>to</strong> improve <strong>the</strong>measurement precision of <strong>the</strong> MBTI (as manifest by testretesttype stability, or with respect <strong>to</strong> agreement with typevalues obtained via “true type” methods). For example, inFigure 22, areas of much higher density can be seen in <strong>the</strong>bivariate distribution of <strong>the</strong> EI and SN scales (i.e., at <strong>the</strong>points at which <strong>the</strong> bimodal peaks are present in <strong>the</strong>univariate frequency distributions); likewise, areas of lowdensity are seen in areas in which we would prefer <strong>to</strong> havefew if any respondents (e.g., at 0 on both scales, <strong>the</strong>relatively sparsely populated square in <strong>the</strong> center of <strong>the</strong>scatterplot). Researchers now need <strong>to</strong> conduct empiricalstudies that compare PR- vs. θ-based MBTI scoringsystems in test-retest and “true type” settings; if, as wehypo<strong>the</strong>size, θ-based scoring is capable of producingimprovements in test-retest type stability and higher levelsof agreement between MBTI- and “true type”-based typeassignments, ano<strong>the</strong>r major class of criticisms of <strong>the</strong> MBTIcould <strong>the</strong>reby be addressed.However, it must be noted that <strong>the</strong> above results, aswell as those obtained in <strong>the</strong> Harvey, Murry, and Markham(1994) study that examined <strong>the</strong> measurement precision ofvarious short-form versions of <strong>the</strong> MBTI, are not uniformlypositive. Indeed, <strong>the</strong>se research findings indicate thatconsiderable “room for improvement” exists with respect <strong>to</strong><strong>the</strong> MBTI’s measurement precision. For example, evenusing <strong>the</strong> relatively small ±0.25 uncertainty interval inTable 1, 11% of <strong>the</strong> individuals in <strong>the</strong> sample have θ-basedpreference scores that lie close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ff score, and19% score in this region using <strong>the</strong> more liberal ±0.35interval. Although <strong>the</strong>se rates represent sizable reductionswith respect <strong>to</strong> <strong>the</strong> numbers of individuals that fall within<strong>the</strong> uncertainty region using PR scoring (which locates 18%and 25% of <strong>the</strong> sample within <strong>the</strong>se zones, respectively),we would ideally prefer <strong>to</strong> see <strong>the</strong> number of individualsscoring close <strong>to</strong> <strong>the</strong> cu<strong>to</strong>ff approach zero.Expanding <strong>the</strong> MBTI item pools <strong>to</strong> contain new items– in particular, items that produce highly discriminatingICCs like those presented in Figures 9 and 13-15) – is <strong>the</strong>most likely way in which <strong>to</strong> fur<strong>the</strong>r improve <strong>the</strong> MBTI’smeasurement precision. As <strong>the</strong> results of <strong>the</strong> Harvey,Murry, and Stamoulis (1995) and Harvey and Murry (1994)studies demonstrated, <strong>the</strong>re are relatively few “highperformance” items in <strong>the</strong> Form G/F item pools; manyitems demonstrate only moderate levels of discrimination,and a number of items produce relatively poor levels ofinformation (e.g., Figure 11).The degree <strong>to</strong> which <strong>the</strong> MBTI could benefit from <strong>the</strong>addition of new, high-performance items was demonstratedby <strong>the</strong> Thomas and Harvey (1995) study, which attempted<strong>to</strong> write new items that would parallel <strong>the</strong> content domainsof <strong>the</strong> existing four MBTI scales. Containing an item poolof 200 new items (50 per scale), <strong>the</strong> Work Styles Inven<strong>to</strong>ry(WSI; Thomas, 1994) was field tested on a sample of 583college students. Based on analyses of this database,Thomas and Harvey (1995) identified a number of <strong>the</strong> WSIitems that, when added <strong>to</strong> <strong>the</strong> existing MBTI item pools,produced significantly higher TIFs for <strong>the</strong> MBTI scales.Figure 23 presents <strong>the</strong> TIFs for <strong>the</strong> EI scale that werecomputed using <strong>the</strong> Form F MBTI item pool, a long andshort version of <strong>the</strong> WSI EI items, and <strong>the</strong> combined WSIplus-MBTIpool.An inspection of <strong>the</strong> TIFs presented in Figure 23reveals that, as hypo<strong>the</strong>sized, it is indeed possible <strong>to</strong> writenew, high-performance items for <strong>the</strong> four main scales of <strong>the</strong>MBTI. When added <strong>to</strong> <strong>the</strong> existing MBTI scales, <strong>the</strong>senew items produce substantial improvements in <strong>the</strong> TIFs,relative <strong>to</strong> <strong>the</strong> levels produced by <strong>the</strong> Form F items. Ofcourse, <strong>the</strong> results in Figure 23 also indicate that <strong>the</strong> WSIitems also leave some “room for improvement,” inparticular, with respect <strong>to</strong> <strong>the</strong> location of <strong>the</strong> additionalinformation <strong>the</strong>y provide. That is, <strong>the</strong> Form F item poolhas a TIF that is somewhat biased in favor of assessingindividuals scoring <strong>to</strong>ward <strong>the</strong> “I” pole of <strong>the</strong> EI scale (i.e.,its TIF peaks at approximately 0.25 in <strong>the</strong> “I” direction). Incontrast, <strong>the</strong> WSI items are strongly biased in favor ofhigher precision in <strong>the</strong> “I” direction, with TIFs peaking atapproximately 0.8 units in <strong>the</strong> “I” direction. For practicaluse, we would prefer <strong>the</strong> TIFs <strong>to</strong> be symmetric, andcentered on <strong>the</strong> cu<strong>to</strong>ff point between <strong>the</strong> two types. Thus,additional items that produced <strong>the</strong>ir highest levels ofdiscrimination in <strong>the</strong> “E” direction would be needed <strong>to</strong>balance-out <strong>the</strong>se new items.It is also possible that <strong>the</strong> measurement precision of <strong>the</strong>MBTI item pools can be enhanced through <strong>the</strong> use of someof <strong>the</strong> “research” items that are included on longer forms of<strong>the</strong> MBTI (e.g., Form F, J). For example, Form J contains


over 190 items that are not part of <strong>the</strong> Form F/G scoringsystem; it seems reasonable <strong>to</strong> hypo<strong>the</strong>size that <strong>the</strong> additionof <strong>the</strong>se “research” items <strong>to</strong> <strong>the</strong> Form F/G item poolsshould also produce improvements in <strong>the</strong> TIFs for <strong>the</strong> fourmajor MBTI scales. Additional research is needed <strong>to</strong>evaluate <strong>the</strong> degree <strong>to</strong> which <strong>the</strong> new high-performanceitems can be obtained from <strong>the</strong> existing “research” itempool.ConclusionsIn this article, we identified a small number of generalclasses of criticisms that have been directed <strong>to</strong>ward <strong>the</strong>MBTI. Based on our review, <strong>the</strong> first of <strong>the</strong>se classes ofcriticisms – which claims that <strong>the</strong> MBTI items do notmeasure <strong>the</strong> four latent constructs <strong>the</strong>y seek <strong>to</strong> measure --was found <strong>to</strong> be sharply inconsistent with empiricalresearch findings, particularly <strong>the</strong> results of recent largesampleexplora<strong>to</strong>ry and confirma<strong>to</strong>ry fac<strong>to</strong>r analyses. Thesecond class of criticisms – which involves claims <strong>to</strong> <strong>the</strong>effect that <strong>the</strong> MBTI is flawed because it does not producebimodally shaped distributions of preference scores – waslikewise found <strong>to</strong> be unsupported by <strong>the</strong> data when oneconsiders preference score distributions computed usingIRT-based scoring methods. Although traditional PR-basedpreference scores do not exhibit bimodality, IRT’s θ-basedpreference score distributions were found <strong>to</strong> be sharplybimodal in large, unselected samples.<strong>Using</strong> <strong>the</strong> research findings currently available <strong>to</strong> us, wewere unable <strong>to</strong> dismiss <strong>the</strong> final class of criticisms – whichdeals with claims <strong>to</strong> <strong>the</strong> effect that <strong>the</strong> MBTI is flawedbecause its levels of test-retest type stability are less thanperfect. However, based on <strong>the</strong> reductions in <strong>the</strong> relativenumber of individuals who score close <strong>to</strong> <strong>the</strong> type cu<strong>to</strong>ffsthat occur when IRT-based scoring methods are used, aswell as <strong>the</strong> potential for <strong>the</strong> MBTI’s measurement precision<strong>to</strong> be increased via <strong>the</strong> addition of new items, we concludethat it is reasonable <strong>to</strong> hypo<strong>the</strong>size that significantimprovements in <strong>the</strong> MBTI’s test-retest type stability maybe achievable by switching <strong>to</strong> IRT-based scoring and/orleng<strong>the</strong>ning <strong>the</strong> MBTI item pools. Research implementing<strong>the</strong>se strategies is now needed in order that we maydetermine <strong>the</strong> degree <strong>to</strong> which <strong>the</strong>se measurement-precisionbased criticisms can be dismissed as convincingly as wehave dealt with criticisms based on <strong>the</strong> MBTI’s fac<strong>to</strong>rstructure and <strong>the</strong> bimodality of its preference scoredistributions.We also attempted <strong>to</strong> provide an overview of <strong>the</strong> IRTmodel, focusing on <strong>the</strong> way in which IRT’s traditional“right-wrong” terminology can be adapted <strong>to</strong> <strong>the</strong> domain ofassessment instruments that are not couched in “rightwrong”terms, and on ways in which one can assesswhe<strong>the</strong>r <strong>the</strong> IRT models “fits” <strong>the</strong> observed item responses.Regarding this latter issue, <strong>the</strong> results we presented usingempirically derived ICCs – which, by definition, are in noway influenced by <strong>the</strong> assumptions made by <strong>the</strong> IRT model– showed quite convincingly that many MBTI items doindeed demonstrate nonlinear relations with <strong>the</strong> latentpreference constructs, and that <strong>the</strong> MBTI items differsharply with respect <strong>to</strong> both <strong>the</strong> amount and location of <strong>the</strong>information <strong>the</strong>y provide with respect <strong>to</strong> <strong>the</strong> underlyingMBTI preferences.In conclusion, it is important <strong>to</strong> note that <strong>the</strong> traditionalprediction-ratio based system of estimating MBTIpreference scores has worked well for decades, and it hasbeen very valuable <strong>to</strong> practitioners by virtue of providing<strong>the</strong>m with a means of scoring <strong>the</strong> instrument and assigningindividuals <strong>to</strong> type categories. Clearly, any new system forscoring <strong>the</strong> MBTI must offer significant advantages orfeatures that cannot be obtained using <strong>the</strong> traditional PRbasedmethod. In short, we must ask whe<strong>the</strong>r it is worth<strong>the</strong> trouble <strong>to</strong> change <strong>to</strong> a new scoring system? Based on<strong>the</strong> above results, we conclude that IRT-based scoring doesoffer <strong>the</strong> kind – and magnitude -- of improvement needed<strong>to</strong> justify <strong>the</strong> change <strong>to</strong> a new MBTI scoring system.Specifically, advantages offered by IRT scoring include<strong>the</strong> following: (a) it produces bimodal score distributionsthat decrease <strong>the</strong> number of individuals who score close <strong>to</strong><strong>the</strong> type cu<strong>to</strong>ffs; (b) it offers a scoring system that allows us<strong>to</strong> differentially weight item responses based on each item’sdiscriminating power, <strong>the</strong> point at which it provides itsmaximum information, and <strong>the</strong> degree <strong>to</strong> which individualswho score strongly in <strong>the</strong> non-keyed direction will tend <strong>to</strong>endorse it in <strong>the</strong> keyed direction (all of which shouldproduce more precise estimates of each person’s scores on<strong>the</strong> preference scales); (c) it allows <strong>the</strong> development of aversion of <strong>the</strong> MBTI that can be administered usingcomputerized adaptive testing (CAT) technology (whichhas <strong>the</strong> potential <strong>to</strong> significantly reduce testing time whilekeeping <strong>the</strong> precision of measurement high); (d) it canproduce quantitative indices of <strong>the</strong> quality and internalconsistency of an individual’s MBTI item response profileusing appropriateness indices (<strong>the</strong>se may be valuable inidentifying invalid response profiles and in resolving casesof type indeterminacy); and (e) it allows sensitive, itemlevel studies of <strong>the</strong> degree <strong>to</strong> which MBTI items tend <strong>to</strong>perform differently for individuals in different demographiccategories (e.g., <strong>to</strong> identify items suffering from potentialgender- or race-based bias).IRT-based MBTI research has finally started <strong>to</strong> appear,and although much has been accomplished, much remains<strong>to</strong> be done. In particular, studies are needed <strong>to</strong> determine<strong>the</strong> degree <strong>to</strong> which IRT scoring is capable of producinghigher test-retest type stability and/or agreement with “truetype” assessments, <strong>the</strong> degree <strong>to</strong> which MBTI items sufferfrom race- or sex-based bias, <strong>the</strong> amount of reduction intesting time that may be possible by using CAT-basedadministration, <strong>the</strong> amount of success that may be achievedby using appropriateness indices <strong>to</strong> spot aberrant orinternally inconsistent response profiles, and <strong>the</strong> degree <strong>to</strong>which <strong>the</strong> measurement precision of <strong>the</strong> MBTI scales canbe enhanced via <strong>the</strong> addition of new items (ei<strong>the</strong>r from <strong>the</strong>currently unused “research” items, or from o<strong>the</strong>r sources).


ReferencesBlock, J., & Ozer, D. J. (1982). Two types ofpsychologists: Remarks on <strong>the</strong> Mendelsohn, Weiss, andFeimer contribution. Journal of Personality and SocialPsychology, 42, 1171-1181.<strong>Briggs</strong>, K. C., & <strong>Myers</strong>, I. B. (1976). <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r: Form F. Palo Al<strong>to</strong>: ConsultingPsychologists Press.Carlson, J. (1985). Recent assessments of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Journal of Personality Assessment,49(4), 356-365.Carlyn, M. (1977). An assessment of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Journal of Personality Assessment, 41,461-473.Carskadon, T. G. (1977). Test-retest reliabilities ofcontinuous scores on <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r.Psychological Reports, 41, 1011-1012.Cliff, N. (1987). The eigenvalue-greater-than-one rule and<strong>the</strong> reliability of components. Psychological Bulletin,103, 276-279.Coe, C. K. (1992). The MBTI: Potential uses and misusesin personnel administration. Public PersonnelManagement, 21(4), 511-523.Comrey, A. L. (1983). An evaluation of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Academic Psychology Bulletin, 5, 115-129.Gangestad, S. W., & Snyder, M. (1991). Taxonomicanalysis redux: Some statistical considerations fortesting a latent class model. Journal of Personality andSocial Psychology, 61, 141-146.Garden, A. (1989). Organisational size as a variable in typeanalysis and employee turnover. Journal ofPsychological <strong>Type</strong>, 17, 3-13.Gauld, V., & Sink, D. (1985). The MBTI as a diagnostic<strong>to</strong>ol in organization development interventions. Journalof Psychological <strong>Type</strong>, 9, 24-29.Gough, H. G. (1976). Studying creativity by means ofword association tests. Journal of Applied Psychology,61, 348-353.Hall, W. B., & MacKinnon, D. W. (1969). Personalityinven<strong>to</strong>ry correlates of creativity among Architects.Journal of Applied Psychology, 53, 322-326.Hamble<strong>to</strong>n, R. K, Swaminathan, H., & Rogers, H. J.(1991). Fundamentals of item response <strong>the</strong>ory.Newbury Park, CA: Sage.Harvey, R. J., & Murry, W. D. (1994). Scoring <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r: Empirical comparison ofpreference score versus latent-trait methods. Journal ofPersonality Assessment, 62, 116-129.Harvey, R. J., Murry, W. D., & Markham, S. E. (1994).Evaluation of three short form versions of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Journal of PersonalityAssessment, 63, 181-184.Harvey, R. J., Murry, W. D., & Markham, S. E. (1995,May). A “Big Five” Scoring System for <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Paper presented at <strong>the</strong> AnnualConference of <strong>the</strong> Society for Industrial andOrganizational Psychology, Orlando.Harvey, R. J., Murry, W. D., & Stamoulis, D. (1995).Unresolved issues in <strong>the</strong> dimensionality of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Educational and PsychologicalMeasurement, 55, 535-544.Harvey, R. J., & Thomas, L. A. (1995, May). Improving<strong>the</strong> measurement precision of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r. Paper presented at <strong>the</strong> Annual Conferenceof <strong>the</strong> Society for Industrial and OrganizationalPsychology, Orlando.Hulin, C., Drasgow, F., & Parsons, C. (1983). <strong>Item</strong>response <strong>the</strong>ory: Application <strong>to</strong> psychologicalmeasurement. Homewood, IL: Dow Jones-Irwin.Hartzler, G. J., & Hartzler, M. T. (1982). Managementuses of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Research inPsychological <strong>Type</strong>, 5, 20-29.James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causalanalysis. Beverly Hills: Sage.Johnson, D. A., & Saunders, D. R. (1990). Confirma<strong>to</strong>ryfac<strong>to</strong>r analysis of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r --Expanded Analysis Report. Educational andPsychological Measurement, 50, 561-571.Joreskog, K. G., & Sorbom, D. (1981). LISREL V: Analysis oflinear structural relationships by maximum likelihood andleast squares methods. Chicago: International EducationalServices.Kir<strong>to</strong>n, M. J. (1976). Adap<strong>to</strong>rs and innova<strong>to</strong>rs: Adescription and measure. Journal of AppliedPsychology, 61, 622-629.Lee, H. B., & Comrey, A. L. (1979). Dis<strong>to</strong>rtions in acommonly used fac<strong>to</strong>r analytic procedure. MultivariateBehavioral Research, 14, 301-321.Lord, F. M., & Novick, M. R. (1968). Statistical <strong>the</strong>oriesof mental test scores. Reading, MA: Addison-Wesley.McCormick, E. J., Jeanneret, P. R., & Mecham, R. C.(1972). A study of job characteristics and jobdimensions as based on <strong>the</strong> Position AnalysisQuestionnaire (PAQ). Journal of Applied Psychology,56, 347-367.McCarley, N., & Carskadon, T. G. (1983). Test-retestreliabilities of scales and subscales of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Inven<strong>to</strong>ry and of criteria for clinical interpretivehypo<strong>the</strong>ses involving <strong>the</strong>m. Research in Psychological<strong>Type</strong>, 6, 24-36.Mendelsohn, G. A., Weiss, D. S., & Feimer, N. R. (1982).Conceptual and empirical analysis of <strong>the</strong> typologicalimplications of patterns of socialization and femininity.Journal of Personality and Social Psychology, 42,1157-1170.Miller, M. L., & Thayer, J. F. (1989). On <strong>the</strong> existence ofdiscrete classes in personality: Is self-moni<strong>to</strong>ring <strong>the</strong>correct joint <strong>to</strong> carve? Journal of Personality andSocial Psychology, 57, 143-155.


Mislevy, R. J., & Bock, R. D. (1990). BILOG 3: <strong>Item</strong>analysis and test scoring with binary logistic methods.Mooresville, IN: Scientific Software.Mitchell, W. (1995). A clash of paradigms: Whybimodality, ANOVA interactions, and discontinuitiesare irrelevant criteria for typologies. Unpublishedmanuscript.Moore, T. (1987). Personality tests are back. Fortune,March 30, 74-82.<strong>Myers</strong>, I. B. (1962). The <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>rmanual. Prince<strong>to</strong>n, NJ: Educational Testing Service.<strong>Myers</strong>, I. B., & McCaulley, M. H. (1985). A guide <strong>to</strong> <strong>the</strong>development and use of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r. Palo Al<strong>to</strong>, CA: Consulting PsychologistsPress.<strong>Myers</strong>, I. B., with <strong>Myers</strong>, P. B. (1980). Gifts differing.Palo Al<strong>to</strong>, CA: Consulting Psychologists Press.Pittenger, D. J. (1993). The utility of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Review of Educational Research, 63,467-488.Poilitt, I. (1982). Managing differences in industry.Research in Psychological <strong>Type</strong>, 5, 4-19.Rytting, M., Ware, R., & Prince, R. A. (1994). Bimodaldistributions in a sample of CEOs: Validating evidencefor <strong>the</strong> MBTI. Journal of Psychological <strong>Type</strong>, 31, 16-23.Sample, J. A., & Hoffman, J. L. (1986). The MBTI as amanagement and organizational <strong>to</strong>ol. Journal ofPsychological <strong>Type</strong>, 11, 47-50.Sipps, G. J., Alexander, R. A., & Friedt, L. (1985). <strong>Item</strong>analysis of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r.Educational and Psychological Measurement, 45, 789-796.Stricker, L. J., & Ross, J. (1964). Some correlates of aJungian personality inven<strong>to</strong>ry. Psychological Reports,14, 623-643.Thomas, L. A. (1994). Unpublished Master’s <strong>the</strong>sis,Virginia Polytechnic Institute and State University.Thomas, L. A., & Harvey, R. J. (1995, April). Improving<strong>the</strong> measurement precision of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong>Indica<strong>to</strong>r. Paper presented at <strong>the</strong> Annual Conference of<strong>the</strong> Society for Industrial and OrganizationalPsychology, Orlando.Thompson. B., & Borrello. G. M. (1986). Constructvalidity of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r.Educational and Psychological Measurement, 46, 745-752.Thompson, B., & Borrello, G. M. (1989, January). Aconfirma<strong>to</strong>ry fac<strong>to</strong>r analysis of data from <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong> <strong>Type</strong> Indica<strong>to</strong>r. Paper presented at <strong>the</strong> annualmeeting of <strong>the</strong> Southwest Educational ResearchAssociation, Hous<strong>to</strong>n.Tucker, L. R., Koopman, R. F., & Linn, R. L. (1969).Evaluation of fac<strong>to</strong>r-analytic research procedures bymeans of simulated correlation matrices.Psychometrika, 34, 421-460.Tzeng, O. C. S., Outcalt, D., Boyer, S. L., Ware, R., &Landis, D. (1984). <strong>Item</strong> validity of <strong>the</strong> <strong>Myers</strong>-<strong>Briggs</strong><strong>Type</strong> Indica<strong>to</strong>r. Journal of Personality Assessment, 48,255-256.


ExtravertsIntrovertsFigure 1. ICCs for two hypo<strong>the</strong>tical items that illustrate <strong>the</strong> range of relations that can exist between <strong>the</strong> latent construct (θ,on <strong>the</strong> horizontal axis) and <strong>the</strong> observed likelihood of item endorsement in <strong>the</strong> keyed direction (PCR, on <strong>the</strong> y axis). <strong>Item</strong> 1defines an almost linear function, whereas <strong>Item</strong> 2 approximates a step function. These ICCs were generated using a 2-parameter IRT model in which <strong>the</strong> b parameters were 0.0, and <strong>the</strong> a parameters were 0.35 and 17.0 for <strong>Item</strong>s 1 and 2,respectively.


Figure 2. <strong>Item</strong> information functions for <strong>the</strong> two hypo<strong>the</strong>tical items presented in Figure 1. The horizontal axis represents <strong>the</strong>levels of <strong>the</strong>ta, whereas <strong>the</strong> vertical axis reflects <strong>the</strong> amount of information contained in each item, across <strong>the</strong> different levelsof <strong>the</strong>ta.


Figure 3 1-parameter ICCs for EI items 33 (easy vs. hard <strong>to</strong> get <strong>to</strong> know), 50 (“good mixer” vs. quiet and reserved), and 129(one of first <strong>to</strong> follow a new fashion vs. not interested). On <strong>the</strong> <strong>the</strong>ta (horizontal) axis, positive values indicate a preference in<strong>the</strong> “I” direction, and negative values indicate a preference in <strong>the</strong> “E” direction (<strong>the</strong> vertical line serves as <strong>the</strong> cu<strong>to</strong>ff between<strong>the</strong> types). The PCR (vertical) axis indicates <strong>the</strong> expected percentage of individuals who would endorse <strong>the</strong> item in <strong>the</strong> keyed(“I”) direction for each level of <strong>the</strong>ta (<strong>the</strong> horizontal line denotes <strong>the</strong> point at which we would expect 50% of <strong>the</strong> examinees <strong>to</strong>endorse <strong>the</strong> item in <strong>the</strong> keyed direction). The dotted vertical lines indicate <strong>the</strong> levels of <strong>the</strong>ta at which 50% of those who holdthat preference would endorse <strong>the</strong> item in <strong>the</strong> “I” direction.


Figure 4 2-parameter ICCs for three hypo<strong>the</strong>tical EI items that differ only in terms of <strong>the</strong>ir a (discrimination) parameters(<strong>Item</strong> 1 has a = .35, <strong>Item</strong> 2 = 1.0, and <strong>Item</strong> 3 = 2.1). On <strong>the</strong> <strong>the</strong>ta (horizontal) axis, positive values indicate a preference in <strong>the</strong>“I” direction, and negative values indicate a preference in <strong>the</strong> “E” direction; higher scores on <strong>the</strong> PCR (vertical) axis reflect ahigher likelihood of endorsing <strong>the</strong> keyed (“I”) response. The two vertical lines on <strong>the</strong> <strong>the</strong>ta axis are drawn <strong>to</strong> reflect a “slight”preference (<strong>Myers</strong> & McCaulley, 1985, p. 58) in <strong>the</strong> “E” (-0.2) and “I” (+0.2) directions. The solid horizontal lines identify<strong>the</strong> different item endorsement (PCR) rates for <strong>Item</strong> 1 at <strong>the</strong>se two preferences; <strong>the</strong> dotted horizontal lines identify <strong>the</strong> PCRsfor <strong>Item</strong> 3.


Figure 5. 3-parameter ICCs for EI items 33 (easy vs. hard <strong>to</strong> get <strong>to</strong> know), 50 (“good mixer” vs. quiet and reserved), and 129(one of first <strong>to</strong> follow a new fashion vs. not interested). Higher PCRs are associated with increased levels of endorsement of<strong>the</strong> “I” alternative.


Figure 6. <strong>Item</strong> information functions for 3-parameter ICCs for EI items 33 (easy vs. hard <strong>to</strong> get <strong>to</strong> know), 50 (“good mixer”vs. quiet and reserved), and 129 (one of first <strong>to</strong> follow a new fashion vs. not interested). The vertical axis reflects <strong>the</strong> amoun<strong>to</strong>f information contained in each item, across <strong>the</strong> different levels of <strong>the</strong>ta.


Figure 7. Test information functions for a 3-item EI scale formed from items 33, 50, and 129 versus one formed from all of<strong>the</strong> Form F EI items. The vertical axis reflects <strong>the</strong> amount of information contained in <strong>the</strong> collection of items in each test,across <strong>the</strong> different levels of <strong>the</strong>ta (larger values are better). The lower horizontal line denotes <strong>the</strong> amount of informationnecessary <strong>to</strong> produce a 0.5 standard error (SE) when estimating <strong>the</strong> <strong>the</strong>ta score from <strong>the</strong> item responses; <strong>the</strong> upper horizontalline corresponds <strong>to</strong> <strong>the</strong> level required <strong>to</strong> produce a 0.39 SE (i.e., <strong>the</strong> level that would be predicted if <strong>the</strong> CTT-based reliabilityof <strong>the</strong> MBTI scales was 0.85).


Figure 8. Test standard error (SE) functions for a 3-item EI scale formed from items 33, 50, and 129 versus one formed fromall of <strong>the</strong> Form F EI items. The vertical axis reflects <strong>the</strong> amount of precision in estimating <strong>the</strong> <strong>the</strong>ta score, at each level of <strong>the</strong>ta(smaller values are better). The upper horizontal line denotes an SE of 0.5; <strong>the</strong> lower line denotes an SE of 0.39 (whichcorresponds <strong>to</strong> a CTT reliability of 0.85).


Figure 9. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> EI scale (number 50, “good mixer” vs. quietand reserved). The horizontal axis denotes <strong>the</strong> EI preference scores (positive values indicating “I” preference, negative valuesindicating “E” preference) computed using <strong>the</strong> Form F scoring system. The curved line drawn through <strong>the</strong> points is asmoo<strong>the</strong>d spline interpolation. The squares denote <strong>the</strong> actual percentages of individuals at each level of <strong>the</strong> EI preference whoendorsed <strong>the</strong> item in <strong>the</strong> “I” direction. Here, higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “quiet andreserved” alternative.


Figure 10. Empirically derived ICC for a moderate-performance MBTI item from <strong>the</strong> EI scale (number 33, easy vs. hard <strong>to</strong>get <strong>to</strong> know). Here, higher PCRs are associated with an increased likelihood of endorsing <strong>the</strong> “hard <strong>to</strong> get <strong>to</strong> know”alternative.


Figure 11. Empirically derived ICC for a low-performance MBTI item from <strong>the</strong> EI scale (number 129, one of first <strong>to</strong> follow anew fashion vs. not interested). Here, higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “not interested infollowing fashion” alternative.


Figure 12. Overlaid empirically derived ICCs for EI items 33, 50, and 129. A comparison of <strong>the</strong>se ICCs against thoseproduced by <strong>the</strong> 3-parameter IRT model presented in Figure 5 provides compelling evidence regarding <strong>the</strong> appropriateness ofusing <strong>the</strong> 3-parameter IRT model <strong>to</strong> score <strong>the</strong> MBTI.


Figure 13. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> SN scale (number 104, concrete v. abstract);scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line represent “N” preferences, whereas those <strong>to</strong> <strong>the</strong> left represent “S” preferences. Here,higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “abstract” alternative.


Figure 14. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> TF scale (number 114, feeling v. thinking);scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line represent “F” preferences, whereas those <strong>to</strong> <strong>the</strong> left represent “T” preferences. HigherPCRs are associated with increased likelihood of endorsing <strong>the</strong> “feeling” response.


Figure 15. Empirically derived ICC for a high-performance MBTI item from <strong>the</strong> JP scale (number 85, scheduled v.unplanned); scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line represent “P” preferences, whereas those <strong>to</strong> <strong>the</strong> left represent “J”preferences. Higher PCRs are associated with increased likelihood of endorsing <strong>the</strong> “unplanned” response.


Figure 16. Empirically derived ICC for a high-performance JP item (85, scheduled v. unplanned) using scores on <strong>the</strong> EIpreference dimension as <strong>the</strong> horizontal axis (scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line denote “I” preferences, whereas those <strong>to</strong> <strong>the</strong>left represent “E” preferences). As would be expected, <strong>the</strong>re is virtually no association between EI preferences and <strong>the</strong>likelihood of endorsing this item in <strong>the</strong> “unplanned” (“P”) direction.


Figure 17. Empirically derived ICC for a high-performance JP item (85, scheduled v. unplanned) using scores on <strong>the</strong> SNpreference dimension as <strong>the</strong> horizontal axis (scores <strong>to</strong> <strong>the</strong> right of <strong>the</strong> vertical line denote “N” preferences, whereas those <strong>to</strong> <strong>the</strong>left represent “S” preferences). Reflecting <strong>the</strong> fact that <strong>the</strong> SN and JP preferences are not orthogonal, a consistent associationcan be observed between SN preferences and <strong>the</strong> PCR rates for this JP item (as expected, intuitives tend endorse this item in<strong>the</strong> “unplanned” direction at higher rates than sensors).


Figure 18. Empirically derived ICC for EI item 116 (detached v. sociable) using scores on <strong>the</strong> EI preference dimension as <strong>the</strong>horizontal axis. This illustrates an item that would likely be viewed as a low-performance item by <strong>the</strong> traditional predictionratiobased scoring system, but which is viewed as a strongly discriminating item by IRT. The reason for this discrepancy liesin <strong>the</strong> fact that this item provides its best discrimination for relatively strong Introverts (e.g., in <strong>the</strong> 40-50 range <strong>to</strong>ward “I”).


Figure 19. Frequency distribution for PR-based preference scores (using Form F key) on <strong>the</strong> EI dimension.


Figure 20. Frequency distribution for IRT-based preference score estimates on <strong>the</strong> EI dimension.


Figure 21. Scatterplot of EI preference scores estimated using <strong>the</strong> traditional PR-based formula (horizontal axis) versus <strong>the</strong>IRT-based method (vertical axis). The line drawn through <strong>the</strong> points is <strong>the</strong> linear regression line.


Figure 22. Scatterplot of EI (vertical axis) versus SN (horizontal axis) preference scores estimated using IRT methods. Note<strong>the</strong> areas of higher density at approximately 0.5 z units above and below <strong>the</strong> type cu<strong>to</strong>ff, and <strong>the</strong> area of low density at <strong>the</strong>cu<strong>to</strong>ff on each scale (i.e., 0.0).


Figure 23. Test information functions for <strong>the</strong> EI scales using <strong>the</strong> Form F MBTI item pools, <strong>the</strong> 22- and 35-item pools for <strong>the</strong>EI scale of <strong>the</strong> Work Styles Inven<strong>to</strong>ry (WSI; Thomas, 1994), and <strong>the</strong> combined MBTI plus WSI EI item pool. Horizontal linescorrespond <strong>to</strong> <strong>the</strong> levels of information that would produce SE values in estimating <strong>the</strong>ta of .25 and .50.


Table 1Numbers of MBTI Profiles Scoring Within a Given “Zone of Uncertainty” around <strong>the</strong> Cu<strong>to</strong>ffs±0.25 Interval Around <strong>the</strong> Cu<strong>to</strong>ffNumber of Profiles% of Total% of Row% of ColumnOutside Cu<strong>to</strong>ff Regionon θ-Based PreferenceInside Cu<strong>to</strong>ff Regionon θ-Based PreferenceOutside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference198479.4%89.3%96.4%732.9%26.3%3.6%Inside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference2379.5%10.7%53.6%2058.2%73.7%46.4%Total222188.9%27811.1%Total 205782.3%44217.7%2499100%±0.35 Interval Around <strong>the</strong> Cu<strong>to</strong>ffNumber of Profiles% of Total% of Row% of ColumnOutside Cu<strong>to</strong>ff Regionon θ-Based PreferenceInside Cu<strong>to</strong>ff Regionon θ-Based PreferenceOutside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference180972.4%88.9%96.9%582.3%12.5%3.1%Inside <strong>the</strong> Cu<strong>to</strong>ffRegion on PR-Preference2269.0%11.1%35.8%40616.3%87.5%64.2%Total203581.4%46418.6%Total 186774.7%63225.3%2499100%

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!