12.07.2015 Views

Fundamentals of Clinical Research for Radiologists ROC Analysis

Fundamentals of Clinical Research for Radiologists ROC Analysis

Fundamentals of Clinical Research for Radiologists ROC Analysis

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>ROC</strong> <strong>Analysis</strong>most lenient decision threshold whereby alltest results are positive <strong>for</strong> disease. An empiric<strong>ROC</strong> curve has h − 1 additional coordinates,where h is the number <strong>of</strong> unique testresults in the sample. In Table 1 there are 200test results, one <strong>for</strong> each <strong>of</strong> the 200 patients inthe sample, but there are only five unique results:normal, benign, probably benign, suspicious,and malignant. Thus, h = 5, and thereare four coordinates plotted in Figure 1 correspondingto the four decision thresholds describedin Table 1.The line connecting the (0, 0) and (1, 1)coordinates is called the “chance diagonal”and represents the <strong>ROC</strong> curve <strong>of</strong> a diagnostictest with no ability to distinguish patientswith versus those without disease. An <strong>ROC</strong>curve that lies above the chance diagonal,such as the <strong>ROC</strong> curve <strong>for</strong> our fictitiousmammography example, has some diagnosticability. The further away an <strong>ROC</strong> curve isfrom the chance diagonal, and there<strong>for</strong>e, thecloser to the upper left-hand corner, the betterdiscriminating power and diagnostic accuracythe test has.In characterizing the accuracy <strong>of</strong> a diagnostic(or screening) test, the <strong>ROC</strong> curve <strong>of</strong>the test provides much more in<strong>for</strong>mationabout how the test per<strong>for</strong>ms than just a singleestimate <strong>of</strong> the test’s sensitivity and specificity[1, 2]. Given a test’s <strong>ROC</strong> curve, a cliniciancan examine the trade-<strong>of</strong>fs in sensitivityversus specificity <strong>for</strong> various decision thresholds.Based on the relative costs <strong>of</strong> false-positiveand false-negative errors and the pretestprobability <strong>of</strong> disease, the clinician canchoose the optimal decision threshold <strong>for</strong>each patient. This idea is discussed in moredetail in a later section <strong>of</strong> this article. Often,TABLE 1patient management is more complex than isallowed with a decision threshold that classifiesthe test results into positive or negative.For example, in mammography suspiciousand malignant findings are usually followedup with biopsy, probably benign findings usuallyresult in a follow-up mammogram in 3–6months, and normal and benign findings areconsidered negative.When comparing two or more diagnostictests, <strong>ROC</strong> curves are <strong>of</strong>ten the only validmethod <strong>of</strong> comparison. Figure 2 illustratestwo scenarios in which an investigator,comparing two diagnostic tests, could bemisled by relying on only a single sensitivity–specificity pair. Consider Figure 2A. Supposea more expensive or risky test (represented by<strong>ROC</strong> curve Y) was reported to have thefollowing accuracy: sensitivity = 0.40,specificity = 0.90 (labeled as coordinate 1 inFig. 2A); a less expensive or less risky test(represented by <strong>ROC</strong> curve X) was reportedto have the following accuracy: sensitivity =0.80, specificity = 0.65 (labeled as coordinate2 in Fig. 2A). If the investigator is looking <strong>for</strong>the test with better specificity, then he or shemay choose the more expensive, risky test,not realizing that a simple change in thedecision threshold <strong>of</strong> the less expensive,cheaper test could provide the desiredspecificity at an even higher sensitivity(coordinate 3 in Fig. 2A).Now consider Figure 2B. The <strong>ROC</strong> curve<strong>for</strong> test Z is superior to that <strong>of</strong> test X <strong>for</strong> a narrowrange <strong>of</strong> FPRs (0.0–0.08); otherwise, diagnostictest X has superior accuracy. Acomparison <strong>of</strong> the tests’ sensitivities at lowFPRs would be misleading unless the diagnostictests are useful only at these low FPRs.Construction <strong>of</strong> Receiver Operating Characteristic Curve Based onFictitious Mammography DataMammography Results Pathology/Follow-Up Results Decision Rules 1–4(BI-RADs Score)Not Malignant Malignant FPR SensitivityNormal 65 5 (1) 35/100 95/100Benign 10 15 (2) 25/100 80/100Probably benign 15 10 (3) 10/100 70/100Suspicious 7 60 (4) 3/100 10/100Malignant 3 10Total 100 100Note.—Decision rule 1 classifies normal mammography findings as negative; all others are positive. Decision rule 2classifies normal and benign mammography findings as negative; all others are positive. Decision rule 3 classifies normal,benign, and probably benign findings as negative; all others are positive. Decision rule 4 classifies normal, benign,probably benign, and suspicious findings as negative; malignant is the only finding classified as positive. BI-RADS =Breast Imaging Reporting and Data System, FPR = false-positive rate.To compare two or more diagnostic tests, itis convenient to summarize the tests’ accuracieswith a single summary measure. Severalsuch summary measures are used in the literature.One is Youden’s index, defined as sensitivity+ specificity − 1 [2]. Note, however,that Youden’s index is affected by the choice<strong>of</strong> the decision threshold used to define sensitivityand specificity. Thus, different decisionthresholds yield different values <strong>of</strong> theYouden’s index <strong>for</strong> the same diagnostic test.Another summary measure commonlyused is the probability <strong>of</strong> a correct diagnosis,<strong>of</strong>ten referred to simply as “accuracy” in theliterature. It can be shown that the probability<strong>of</strong> a correct diagnosis is equivalent toprobability (correct diagnosis) = PREV s ×sensitivity + (1 – PREV s ) × specificity, (1)where PREV s is the prevalence <strong>of</strong> disease inthe sample. That is, this summary measure <strong>of</strong>accuracy is affected not only by the choice <strong>of</strong>the decision threshold but also by the prevalence<strong>of</strong> disease in the study sample [2]. Thus,even slight changes in the prevalence <strong>of</strong> diseasein the population <strong>of</strong> patients being testedcan lead to different values <strong>of</strong> “accuracy” <strong>for</strong>the same test.Summary measures <strong>of</strong> accuracy derivedfrom the <strong>ROC</strong> curve describe the inherent accuracy<strong>of</strong> a diagnostic test because they are notaffected by the choice <strong>of</strong> the decision thresholdand they are not affected by the prevalence <strong>of</strong>disease in the study sample. Thus, these summarymeasures are preferable to Youden’s indexand the probability <strong>of</strong> a correct diagnosis[2]. The most popular summary measure <strong>of</strong> accuracyis the area under the <strong>ROC</strong> curve, <strong>of</strong>tendenoted as “AUC” <strong>for</strong> area under curve. Itranges in value from 0.5 (chance) to 1.0 (perfectdiscrimination or accuracy). The chancediagonal in Figure 1 has an AUC <strong>of</strong> 0.5. In Figure2A the areas under both <strong>ROC</strong> curves arethe same, 0.841. There are three interpretations<strong>for</strong> the AUC: the average sensitivity over allfalse-positive rates; the average specificityover all sensitivities [3]; and the probabilitythat, when presented with a randomly chosenpatient with disease and a randomly chosen patientwithout disease, the results <strong>of</strong> the diagnostictest will rank the patient with disease ashaving higher suspicion <strong>for</strong> disease than thepatient without disease [4].The AUC is <strong>of</strong>ten too global a summarymeasure. Instead, <strong>for</strong> a particular clinical application,a decision threshold is chosen sothat the diagnostic test will have a low FPRAJR:184, February 2005 365


Obuchowski(e.g., FPR < 0.10) or a high sensitivity (e.g.,sensitivity > 0.80). In these circumstances,the accuracy <strong>of</strong> the test at the specified FPRs(or specified sensitivities) is a more meaningfulsummary measure than the area under theentire <strong>ROC</strong> curve. The partial area under the<strong>ROC</strong> curve, PAUC (e.g., the PAUC where FPR< 0.10, or the PAUC where sensitivity > 0.80),is then an appropriate summary measure <strong>of</strong> thediagnostic test’s accuracy. In Figure 2B, thePAUCs <strong>for</strong> the two tests where the FPR is between0.0 and 0.20 are the same, 0.112. Forinterpretation purposes, the PAUC is <strong>of</strong>ten dividedby its maximum value, given by therange (i.e., maximum–minimum) <strong>of</strong> the FPRs(or false-negative rates [FNRs]) [5]. ThePAUC divided by its maximum value is calledthe partial area index and takes on values between0.5 and 1.0, as does the AUC. It is interpretedas the average sensitivity <strong>for</strong> theFPRs examined (or average specificity <strong>for</strong> theFNRs examined). In our example, the range<strong>of</strong> the FPRs <strong>of</strong> interest is 0.20–0.0 = 0.20;thus, the average sensitivity <strong>for</strong> FPRs lessthan 0.20 <strong>for</strong> diagnostic tests X and Z in Figure2B is 0.56.Although the <strong>ROC</strong> curve has many advantagesin characterizing the accuracy <strong>of</strong> a diagnostictest, it also has some limitations. Onecriticism is that the <strong>ROC</strong> curve extends beyondthe clinically relevant area <strong>of</strong> potential clinicalinterpretation. Of course, the PAUC was developedto address this criticism. Another criticismis that it is possible <strong>for</strong> a diagnostic testwith perfect discrimination between diseasedand nondiseased patients to have an AUC <strong>of</strong>0.5. Hilden [6] describes this unusual situationand <strong>of</strong>fers solutions. When comparing two diagnostictests’ accuracies, the tests’ <strong>ROC</strong>curves can cross, as in Figure 2. A comparison<strong>of</strong> these tests based only on their AUCs can bemisleading. Again, the PAUC attempts to addressthis limitation. Last, some [6, 7] criticizethe <strong>ROC</strong> curve, and especially the AUC, <strong>for</strong>not incorporating the pretest probability <strong>of</strong> diseaseand the costs <strong>of</strong> misdiagnoses.Fig. 1.—Empiric and fitted (or “smooth”) receiver operatingcharacteristic (<strong>ROC</strong>) curves constructed frommammography data in Table 1. Four labeled points onempiric curve (dotted line) correspond to four decisionthresholds used to estimate sensitivity and specificity.Area under curve (AUC) <strong>for</strong> empiric <strong>ROC</strong> curve is 0.863and <strong>for</strong> fitted curve (solid line) is 0.876.and other relevant biases common to thesetypes <strong>of</strong> studies.In <strong>ROC</strong> studies we also require that the testresults, or the interpretations <strong>of</strong> the test images,be assigned a numeric value or rank. These numericmeasurements or ranks are the basis <strong>for</strong>Sensitivity1.00.80.60.40.20.00.00.20.40.6False-Positive Rate0.81.0ASensitivity1.00.80.60.40.20.00.00.2Chance diagonal0.40.6False-Positive Rate0.81.0defining the decision thresholds that yield theestimates <strong>of</strong> sensitivity and specificity that areplotted to <strong>for</strong>m the <strong>ROC</strong> curve. Some diagnostictests yield an objective measurement (e.g.,attenuation value <strong>of</strong> a lesion). The decisionthresholds <strong>for</strong> constructing the <strong>ROC</strong> curve areSensitivity1.00.80.60.40.20.00.00.20.40.6False-Positive RateFig. 2.—Two examples illustrate advantages <strong>of</strong> receiver operating characteristic (<strong>ROC</strong>) curves (see text <strong>for</strong> explanation)and comparing summary measures <strong>of</strong> accuracy.A, <strong>ROC</strong> curve Y (dotted line) has same area under curve (AUC) as <strong>ROC</strong> curve X (solid line), but lower partial areaunder curve (PAUC) when false-positive rate (FPR) is ≤ 0.20, and higher PAUC when false-positive rate > 0.20.B, <strong>ROC</strong> curve Z (dashed line) has same PAUC as curve X (solid line) when FPR ≤ 0.20 but lower AUC.1.00.81.0BThe <strong>ROC</strong> StudyWeinstein et al. [1] describe the commonfeatures <strong>of</strong> a study <strong>of</strong> the accuracy <strong>of</strong> a diagnostictest. These include samples from bothpatients with and those without the disease <strong>of</strong>interest and a reference standard <strong>for</strong> determiningwhether positive test results are truepositivesor false-positives, and whether negativetest results are true-negatives or falsenegatives.They also discuss the need to blindreviewers who are interpreting test imagesFig. 3.—Unobserved binormal distribution that wasassumed to underlie test results in Table 1. Distribution<strong>for</strong> nondiseased patients was arbitrarily centeredat 0 with SD <strong>of</strong> 1 (i.e., µ o = 0 and σ o = 1). Binormal parameterswere estimated to be A = 2.27 and B = 1.70.Thus, distribution <strong>for</strong> diseased patients is centered atµ 1 = 1.335 with SD <strong>of</strong> σ 1 = 0.588. Four cut<strong>of</strong>fs, z1, z2, z3,and z4, correspond to four decision thresholds in Table1. If underlying test value is less than z1, then mammographerassigns test result <strong>of</strong> “normal." If theunderlying test value is less than z2 but greater thanz1, then mammographer assigns test result <strong>of</strong> “benign,”and so <strong>for</strong>th.Relative Frequency0.80.60.40.20.0DiseasedNondiseased-4 -2 0 2 4Unobserved Underlying Variable366 AJR:184, February 2005


<strong>ROC</strong> <strong>Analysis</strong>based on increasing the values <strong>of</strong> the attenuationcoefficient. Other diagnostic tests must beinterpreted by a trained observer, <strong>of</strong>ten a radiologist,and so the interpretation is subjective.Two general scales are <strong>of</strong>ten used in radiology<strong>for</strong> observers to assign a value to their subjectiveinterpretation <strong>of</strong> an image. One scale is the5-point rank scale: 1 = definitely normal, 2 =probably normal, 3 = possibly abnormal orequivocal, 4 = probably abnormal, and 5 = definitelyabnormal.The other popular scale is the 0–100% confidencescale, where 0% implies that the observeris completely confident in the absence<strong>of</strong> the disease <strong>of</strong> interest, and 100% impliesthat the observer is completely confident inthe presence <strong>of</strong> the disease <strong>of</strong> interest. Thetwo scales have strengths and weaknesses [2,8], but both are reasonably well suited to radiologyresearch. In mammography a ratingscale already exists, the BI-RADS score,which can be used to <strong>for</strong>m decision thresholdsfrom least to most suspicion <strong>for</strong> the presence<strong>of</strong> breast cancer.When the diagnostic test requires a subjectiveinterpretation by a trained reviewer, thereviewer becomes part <strong>of</strong> the diagnostic process[9]. Thus, to properly characterize the accuracy<strong>of</strong> the diagnostic test, we must includemultiple reviewers in the study. This is the socalledMRMC, multiple-reader multiplecase,<strong>ROC</strong> study. Much has been writtenabout the design and analysis <strong>of</strong> MRMC studies[10–20]. We mention here only the basicdesign <strong>of</strong> MRMC studies, and in a later subsectionwe describe their statistical analysis.The usual design <strong>for</strong> the MRMC study is afactorial design, in which every reviewer interpretsthe image (or images if there is morethan one test) <strong>of</strong> every patient. Thus, if thereare R reviewers, C patients, and I diagnostictests, then each reviewer interprets C × I images,and the study involves R × C × I total interpretations.The accuracy <strong>of</strong> each reviewerwith each diagnostic test is characterized byan <strong>ROC</strong> curve, so R × I <strong>ROC</strong> curves are constructed.Constructing pooled or consensus<strong>ROC</strong> curves is not the goal <strong>of</strong> these studies.Rather, the primary goals are to document thevariability in diagnostic test accuracy betweenreviewers and report the average, ortypical, accuracy <strong>of</strong> reviewers. In order <strong>for</strong> theresults <strong>of</strong> the study to be generalizeable to therelevant patient and reviewer populations,representative samples from both populationsare needed <strong>for</strong> the study. Often expert reviewerstake part in studies <strong>of</strong> diagnostic test accuracy,but the accuracy <strong>for</strong> a nonexpert may beFig. 4.—Three receiver operating characteristic (<strong>ROC</strong>)curves with same binormal parameter A (i.e., A = 1.0)but different values <strong>for</strong> parameter B <strong>of</strong> 3.0 (3σ 1 = σ o ),1.0 (σ 1 = σ o ), and 0.33 (σ 1 = 3σ o ).When B = 3.0, <strong>ROC</strong>curve dips below chance diagonal; this is called an improper<strong>ROC</strong> curve [2].considerably less. An excellent illustration <strong>of</strong>the issues involved in sampling reviewers <strong>for</strong>an MRMC study can be found in the study byBeam et al. [21].Sensitivity1.00.80.60.40.20.00.0B = 3.3B = 1.00.2B = 3.00.40.6False-Positive Rate0.81.0Examples <strong>of</strong> <strong>ROC</strong> Studies in RadiologyThe radiology literature, and the clinicallaboratory and more general medical literature,contain many excellent examples <strong>of</strong> how<strong>ROC</strong> curves are used to characterize the accuracy<strong>of</strong> a diagnostic test and to compare accuracies<strong>of</strong> diagnostic tests. We briefly describehere three recent examples <strong>of</strong> <strong>ROC</strong> curves beingused in the radiology literature.Kim et al. [22] conducted a prospectivestudy to determine if rectal distention usingwarm water improves the accuracy <strong>of</strong> MRI<strong>for</strong> preoperative staging <strong>of</strong> rectal cancer. AfterMRI, the patients underwent surgical resection,considered the gold standardregarding the invasion <strong>of</strong> adjacent structuresand regional lymph node involvement. Fourobservers, unaware <strong>of</strong> the pathology results,independently scored the MR images using 4-and 5-point rating scales. Using statisticalmethods <strong>for</strong> MRMC studies [13], the authorsdetermined that typical reviewers’ accuracy<strong>for</strong> determining outer wall penetration is improvedwith rectum distention, but that revieweraccuracy <strong>for</strong> determining regionallymph node involvement is not affected.Osada et al. [23] used <strong>ROC</strong> analysis to assessthe ability <strong>of</strong> MRI to predict fetal pulmonaryhypoplasia. They imaged 87fetuses, measuring both lung volume andsignal intensity. An <strong>ROC</strong> curve based onlung volume showed that lung volume hassome ability to discriminate between fetuseswho will have good versus those who willhave poor respiratory outcome after birth.An <strong>ROC</strong> curve based on the combined in<strong>for</strong>mationfrom lung volume and signal intensity,however, has superior accuracy. Formore in<strong>for</strong>mation on the optimal way tocombine measures or test results, see the articleby Pepe and Thompson [24].In a third study, Zheng et al. [25] assessedhow the accuracy <strong>of</strong> a mammographic computer-aideddetection (CAD) scheme wasaffected by restricting the maximum number<strong>of</strong> regions that could be identified aspositive. Using a sample <strong>of</strong> 300 cases with amalignant mass and 200 normals, the investigatorsapplied their CAD system, each timereducing the maximum number <strong>of</strong> positiveregions that the CAD system could identifyfrom seven to one. A special <strong>ROC</strong> techniquecalled “free-response receiver operatingcharacteristic curves” (F<strong>ROC</strong>) was used.The horizontal axis <strong>of</strong> the F<strong>ROC</strong> curve differsfrom the traditional <strong>ROC</strong> curve in thatit gives the average number <strong>of</strong> false-positivesper image. Zheng et al. concluded thatlimiting the maximum number <strong>of</strong> positiveregions that the CAD could identify improvesthe overall accuracy <strong>of</strong> CAD in mammography.For more in<strong>for</strong>mation on F<strong>ROC</strong>curves and related methods, I refer you toother articles [26–29].Statistical Methods <strong>for</strong> <strong>ROC</strong> <strong>Analysis</strong>Fitting Smooth <strong>ROC</strong> CurvesIn Figure 1 we saw the empiric <strong>ROC</strong> curve<strong>for</strong> the test results in Table 1. The curve wasconstructed with line segments connectingthe observed points on the <strong>ROC</strong> curve. Empiric<strong>ROC</strong> curves <strong>of</strong>ten have a jagged appearance,as seen in Figure 1, and <strong>of</strong>ten lie slightlybelow the “true,” smooth, <strong>ROC</strong> curve—thatis, the test’s <strong>ROC</strong> curve if it were constructedwith an infinite number <strong>of</strong> points (not just thefour points in Fig. 1) and an infinitely largesample size. A smooth curve gives us a betteridea <strong>of</strong> the relationship between the diagnos-AJR:184, February 2005 367


Obuchowskitic test and the disease. In this subsection wedescribe some methods <strong>for</strong> constructingsmooth <strong>ROC</strong> curves.The most popular method <strong>of</strong> fitting asmooth <strong>ROC</strong> curve is to assume that the testresults (e.g., the BI-RADS scores in Table 1)come from two unobserved distributions, onedistribution <strong>for</strong> the patients with disease andone <strong>for</strong> the patients without the disease. Usuallyit is assumed that these two distributionscan be trans<strong>for</strong>med to normal distributions,referred to as the binormal assumption. It isthe unobserved, underlying distributions thatwe assume can be trans<strong>for</strong>med to follow abinormal distribution, and not the observedtest results. Figure 3 illustrates the hypothesizedunobserved binormal distribution estimated<strong>for</strong> the observed BI-RADS results inTable 1. Note how the distributions <strong>for</strong> thediseased and nondiseased patients overlap.Let the unobserved binormal variables <strong>for</strong>the nondiseased and diseased patients havemeans µ 0 and µ 1 , and variances σ 0 [2] and σ 1[2], respectively. Then it can be shown [30]that the <strong>ROC</strong> curve is completed described bytwo parameters:A = (µ 1 – µ 0 ) / σ 1 (2)B = σ 0 / σ 1 . (3)(See Appendix 1 <strong>for</strong> a <strong>for</strong>mula that linksparameters A and B to the <strong>ROC</strong> curve.) Figure4 illustrates three <strong>ROC</strong> curves. ParameterA was set to be constant at 1.0 and parameterB varies as follows: 0.33 (the underlying distribution<strong>of</strong> the diseased patients is threetimes more variable than that <strong>of</strong> the nondiseasedpatients), 1.0 (the two distributionshave the same SD), and 3.0 (the underlyingdistribution <strong>of</strong> the nondiseased patients isthree times more variable than that <strong>of</strong> the diseasedpatients). As one can see, the curvesdiffer dramatically with changes in parameterB. Parameter A, on the other hand, determineshow far the curve is above the chance diagonal(where A = 0); <strong>for</strong> a constant B parameter,the greater the value <strong>of</strong> A, the higher the <strong>ROC</strong>curve lies (i.e., greater accuracy).Parameters A and B can be estimated fromdata such as in Table 1 using maximum likelihoodmethods [30, 31]. For the data in Table1, the maximum likelihood estimates (MLEs)<strong>of</strong> parameters A and B are 2.27 and 1.70, respectively;the smooth <strong>ROC</strong> curve is given inFigure 1. Fortunately, some useful s<strong>of</strong>tware[32] has been written to per<strong>for</strong>m the necessarycalculations <strong>of</strong> A and B, along with estimation<strong>of</strong> the area under the smooth curve(see next subsection), its SE and confidenceinterval (CI), and CIs <strong>for</strong> the <strong>ROC</strong> curve itself(see Appendix 1).Dorfman and Alf [30] suggested a statisticaltest to evaluate whether the binormal assumptionwas reasonable <strong>for</strong> a given data set. Others[33, 34] have shown through empiric investigationand simulation studies that many differentunderlying distributions are well approximatedby the binormal assumption.When the diagnostic test results are themselvesa continuous measurement (e.g., CTTABLE 2attenuation values, or measured lesion diameter),it may not be necessary to assume theexistence <strong>of</strong> an unobserved, underlying distribution.Sometimes continuous-scale testresults themselves follow a binormal distribution,but caution should be taken that thefit is good (see the article by Goddard andHinberg [35] <strong>for</strong> a discussion <strong>of</strong> the resultingbias when the distribution is not truly binormalyet the binormal distribution is assumed).Zou et al. [36] suggest using a Box-Cox trans<strong>for</strong>mation to trans<strong>for</strong>m data toEstimating Area Under Empirical Receiver Operating CharacteristicCurveTest ResultsScore xScore No. <strong>of</strong> PairsNondiseasedDiseasedNo. <strong>of</strong> PairsNormal Normal 1/2 65 x 5 = 325 162.5Normal Benign 1 65 x 15 = 975 975Normal Probably benign 1 65 x 10 = 650 650Normal Suspicious 1 65 x 60 = 3,900 3,900Normal Malignant 1 65 x 10 = 650 650Benign Normal 0 10 x 5 = 50 0Benign Benign 1/2 10 x 15 = 150 75Benign Probably benign 1 10 x 10 = 100 100Benign Suspicious 1 10 x 60 = 600 600Benign Malignant 1 10 x 10 = 100 100Probably benign Normal 0 15 x 5 = 75 0Probably benign Benign 0 15 x 15 = 225 0Probably benign Probably benign 1/2 15 x 10 = 150 75Probably benign Suspicious 1 15 x 60 = 900 900Probably benign Malignant 1 15 x 10 = 150 150Suspicious Normal 0 7 x 5 = 35 0Suspicious Benign 0 7 x 15 = 105 0Suspicious Probably benign 0 7 x 10 = 70 0Suspicious Suspicious 1/2 7 x 60 = 420 210Suspicious Malignant 1 7 x 10 = 70 70Malignant Normal 0 3 x 5 = 15 0Malignant Benign 0 3 x 15 = 45 0Malignant Probably benign 0 3 x 10 = 30 0Malignant Suspicious 0 3 x 60 = 180 0Malignant Malignant 1/2 3 x 10 = 30 15Total 10,000 pairs 8,632.5368 AJR:184, February 2005


<strong>ROC</strong> <strong>Analysis</strong>TABLE 3 Fictitious Data Comparing the Accuracy <strong>of</strong> Two Diagnostic Tests<strong>ROC</strong> CurveXYEstimated AUC 0.841 0.841Estimated SE <strong>of</strong> AUC 0.041 0.045Estimated PAUC where FPR < 0.20 0.112 0.071Estimated SE <strong>of</strong> PAUC 0.019 0.014Estimated covariance 0.00001Z test comparing PAUCs Z = [0.112 – 0.071] / √[0.019 2 + 0.014 2 – 0.00002]95% CI <strong>for</strong> difference in PAUCs [0.112 – 0.071] ± 1.96 × √[0.019 2 + 0.014 2 – 0.00002]Note.—AUC = area under the curve, PAUC = partial area under the curve, CI = confidence interval.binormality. Alternatively, one can use s<strong>of</strong>twarelike <strong>ROC</strong>KIT [32] that will bin the testresults into an optimal number <strong>of</strong> categoriesand apply the same maximum likelihoodmethods as mentioned earlier <strong>for</strong> rating datalike the BI-RADS scores.More elaborate models <strong>for</strong> the <strong>ROC</strong> curve thatcan take into account covariates (e.g., the patient’sage, symptoms) have also been developedin the statistics literature [37–39] and will becomemore accessible as new s<strong>of</strong>tware is written.Estimating the Area Under the <strong>ROC</strong> CurveEstimation <strong>of</strong> the area under the smoothcurve, assuming a binormal distribution, isdescribed in Appendix 1. In this subsection,we describe and illustrate estimation <strong>of</strong> thearea under the empiric <strong>ROC</strong> curve. The process<strong>of</strong> estimating the area under the empiric<strong>ROC</strong> curve is nonparametric, meaning thatno assumptions are made about the distribution<strong>of</strong> the test results or about any hypothesizedunderlying distribution. The estimationworks <strong>for</strong> tests scored with a rating scale, a0–100% confidence scale, or a true continuous-scalevariable.The process <strong>of</strong> estimating the area under theempiric <strong>ROC</strong> curve involves four simple steps:First, the test result <strong>of</strong> a patient with disease iscompared with the test result <strong>of</strong> a patient withoutdisease. If the <strong>for</strong>mer test result indicatesmore suspicion <strong>of</strong> disease than the latter test result,then a score <strong>of</strong> 1 is assigned. If the test resultsare identical, then a score <strong>of</strong> 1/2 isassigned. If the diseased patient has a test resultindicating less suspicion <strong>for</strong> disease than thetest result <strong>of</strong> the nondiseased patient, then ascore <strong>of</strong> 0 is assigned. It does not matter whichdiseased and nondiseased patient you beginwith. Using the data in Table 1 as an illustration,suppose we start with a diseased patient assigneda test result <strong>of</strong> “normal” and a nondiseasedpatient assigned a test result <strong>of</strong> “normal.”Because their test results are the same, this pairis assigned a score <strong>of</strong> 1/2.Second, repeat the first step <strong>for</strong> every possiblepair <strong>of</strong> diseased and nondiseased patientsin your sample. In Table 1 there are 100diseased patients and 100 nondiseased patients,thus 10,000 possible pairs. Becausethere are only five unique test results, the10,000 possible pairs can be scored easily, asin Table 2.Third, sum the scores <strong>of</strong> all possible pairs.From Table 2, the sum is 8,632.5.Fourth, divide the sum from step 3 by thenumber <strong>of</strong> pairs in the study sample. In ourexample we have 10,000 pairs. Dividing thesum from step 3 by 10,000 gives us 0.86325,which is our estimate <strong>of</strong> the area under theempiric <strong>ROC</strong> curve. Note that this method <strong>of</strong>estimating the area under the empiric <strong>ROC</strong>curve gives the same result as one would obtainby fitting trapezoids under the curve andsumming the areas <strong>of</strong> the trapezoids (socalledtrapezoid method).The variance <strong>of</strong> the estimated area underthe empiric <strong>ROC</strong> curve is given by DeLonget al. [40] and can be used <strong>for</strong> constructingCIs; s<strong>of</strong>tware programs are available <strong>for</strong> estimatingthe nonparametric AUC and itsvariance [41].Comparing the AUCs or PAUCs <strong>of</strong> TwoDiagnostic TestsTo test whether the AUC (or PAUC) <strong>of</strong> onediagnostic test (denoted by AUC 1 ) equals theAUC (or PAUC) <strong>of</strong> another diagnostic test(AUC 2 ), the following test statistic is calculated:Z = [AUC 1 – AUC 2 ] /√[var 1 + var 2 – 2 × cov], (4)where var 1 is the estimated variance <strong>of</strong> AUC 1 ,var 2 is the estimated variance <strong>of</strong> AUC 2 , andcov is the estimated covariance between AUC 1and AUC 2 . When different samples <strong>of</strong> patientsundergo the two diagnostic tests, the covarianceequals zero. When the same sample <strong>of</strong> patientsundergoes both diagnostic tests (i.e., a pairedstudy design), then the covariance is not generallyequal to zero and is <strong>of</strong>ten positive. The estimatedvariances and covariances are standardoutput <strong>for</strong> most <strong>ROC</strong> s<strong>of</strong>tware [32, 41].The test statistic Z follows a standard normaldistribution. For a two-tailed test withsignificance level <strong>of</strong> 0.05, the critical valuesare –1.96 and +1.96. If Z is less than −1.96,then we conclude that the accuracy <strong>of</strong> diagnostictest 2 is superior to that <strong>of</strong> diagnostictest 1; if Z exceeds +1.96, then we concludethat the accuracy <strong>of</strong> diagnostic test 1 is superiorto that <strong>of</strong> diagnostic test 2.A two-sided CI <strong>for</strong> the difference in AUC(or PAUC) between two diagnostic tests canbe calculated fromLL = [AUC 1 – AUC 2 ] – z α/2 ×√[var 1 + var 2 – 2 × cov] (5)UL = [AUC 1 – AUC 2 ] + z α/2 ×√[var 1 + var 2 – 2 × cov], (6)where LL is the lower limit <strong>of</strong> the CI, UL isthe upper limit, and z α/2 is a value from thestandard normal distribution corresponding toa probability <strong>of</strong> α/2. For example, to constructa 95% CI, α = 0.05, thus z α/2 = 1.96.Consider the <strong>ROC</strong> curves in Figure 2A. Theestimated areas under the smooth <strong>ROC</strong> curves <strong>of</strong>the two tests are the same, 0.841. The PAUCswhere the FPR is greater than 0.20, however, differ.From the estimated variances and covariancein Table 3, the value <strong>of</strong> the Z statistic <strong>for</strong> comparingthe PAUCs is 1.77, which is not statisticallysignificant. The 95% CI <strong>for</strong> the difference inPAUCs is more in<strong>for</strong>mative: (−0.004 to 0.086);the CI <strong>for</strong> the partial area index is (−0.02 to 0.43).The CI contains large positive differences, suggestingthat more research is needed to investigatethe relative accuracies <strong>of</strong> these twodiagnostic tests <strong>for</strong> FPRs less than 0.20.<strong>Analysis</strong> <strong>of</strong> MRMC <strong>ROC</strong> StudiesMultiple published methods discuss per<strong>for</strong>mingthe statistical analysis <strong>of</strong> MRMC studies [13–20]. The methods are used to construct CIs <strong>for</strong> diagnosticaccuracy and statistical tests <strong>for</strong> assessingdifferences in accuracy between tests. Astatistical overview <strong>of</strong> the methods is given elsewhere[10]. Here, we briefly mention some <strong>of</strong> thekey issues <strong>of</strong> MRMC <strong>ROC</strong> analyses.AJR:184, February 2005 369


ObuchowskiFixed- or random-effects models.—TheMRMC study has two samples, a sample <strong>of</strong>patients and a sample <strong>of</strong> reviewers. If the studyresults are to be generalized to patients similarto ones in the study sample and to reviewerssimilar to ones in the study sample, then a statisticalanalysis that treats both patients and reviewersas random effects should be used [13,14, 17–20]. If the study results are to be generalizedto just patients similar to ones in thestudy sample, then the patients are treated asrandom effects but the reviewers should betreated as fixed effects [13–20]. Some <strong>of</strong> thestatistical methods can treat reviewers as eitherrandom or fixed, whereas other methodstreat reviewers only as fixed effects.Parametric or nonparametric.—Some <strong>of</strong> themethods rely on models that make strong assumptionsabout how the accuracies <strong>of</strong> the reviewersare correlated and distributed(parametric methods) [13, 14], other methods aremore flexible [15, 20], and still others make noassumptions [16–19] (nonparametric methods).The parametric methods may be more powerfulwhen their assumptions are met, but <strong>of</strong>ten it is difficultto determine if the assumptions are met.Covariates.—Reviewers’ accuracy may beaffected by their training or experience or bycharacteristics <strong>of</strong> the patients (e.g., age, sex,stage <strong>of</strong> disease, comorbidities). These variablesare called covariates. Some <strong>of</strong> the statisticalmethods [15, 20] have models that caninclude covariates. These models providevaluable insight into the variability betweenreviewers and between patients.S<strong>of</strong>tware.—S<strong>of</strong>tware is available <strong>for</strong> publicuse <strong>for</strong> some <strong>of</strong> the methods [32, 42, 43]; theauthors <strong>of</strong> the other methods may be able toprovide s<strong>of</strong>tware if contacted.Determining Sample Size <strong>for</strong> <strong>ROC</strong> StudiesMany issues must be considered in determiningthe number <strong>of</strong> patients needed <strong>for</strong> an<strong>ROC</strong> study. We list several <strong>of</strong> the key issuesand some useful references here, followed bya simple illustration. S<strong>of</strong>tware is also available<strong>for</strong> determining the required sample size<strong>for</strong> some <strong>ROC</strong> study designs [32, 41].1. Is it a MRMC <strong>ROC</strong> study? Many radiologystudies include more than one reviewerbut are not considered MRMC studies.MRMC studies usually involve five or more reviewersand focus on estimating the averageaccuracy <strong>of</strong> the reviewers. In contrast, many radiologystudies include two or three reviewersto get some idea <strong>of</strong> the interreviewer variability.Estimation <strong>of</strong> the required sample size <strong>for</strong>MRMC studies requires balancing the number<strong>of</strong> reviewers in the reviewer sample with thenumber <strong>of</strong> patients in the patient sample. See[14, 44] <strong>for</strong> <strong>for</strong>mulae <strong>for</strong> determining samplesizes <strong>for</strong> MRMC studies and [45] <strong>for</strong> samplesize tables <strong>for</strong> MRMC studies. Sample size determination<strong>for</strong> non-MRMC studies is basedon the number <strong>of</strong> patients needed.2. Will the study involve a single diagnostictest or compare two or morediagnostic tests? <strong>ROC</strong> studies comparingtwo or more diagnostic tests are common.These studies focus on the difference betweenAUCs or PAUCs <strong>of</strong> the two (or more)diagnostic tests. Sample size can be basedon either planning <strong>for</strong> enough statisticalpower to detect a clinically important difference,or constructing a CI <strong>for</strong> the differencein accuracies that is narrow enough to makeclinically relevant conclusions from thestudy. In studies <strong>of</strong> one diagnostic test, we<strong>of</strong>ten focus on the magnitude <strong>of</strong> the test’sAUC or PAUC, basing sample size on thedesired width <strong>of</strong> a CI.3. If two or more diagnostic tests arebeing compared, will it be a paired orunpaired study design, and are the accuracies<strong>of</strong> the tests hypothesized to bedifferent or equivalent? Paired designsalmost always require fewer patients than anunpaired design, and so are used wheneverthey are logistically, ethically, and financiallyfeasible. Studies that are per<strong>for</strong>med to determinewhether two or more tests have the sameaccuracy are called equivalency studies. Oftenin radiology a less invasive diagnostic test,or a quicker imaging sequence, is developedand compared with the standard test. The investigatorwants to know if the test is similarin accuracy to the standard test. Equivalencystudies <strong>of</strong>ten require a larger sample size thanstudies in which the goal is to show that onetest has superior accuracy to another test. Thereason is that to show equivalence the investigatormust rule out all large differences betweenthe tests—that is, the CI <strong>for</strong> thedifference must be very narrow.4. Will the patients be recruited in aprospective or retrospective fashion?In prospective designs, patients are recruitedbased on their signs or symptoms, so at thetime <strong>of</strong> recruitment it is unknown whether thepatient has the disease <strong>of</strong> interest. In contrast,in retrospective designs patients are recruitedbased on their known true disease status (asdetermined by the gold or reference standard)[2]. Both studies are used commonly in radiology.Retrospective studies <strong>of</strong>ten requirefewer patients than prospective designs.5. What will be the ratio <strong>of</strong> nondiseasedto diseased patients in thestudy sample? Let k denote the ratio <strong>of</strong>the number <strong>of</strong> nondiseased to diseased patientsin the study sample. For retrospectivestudies k is usually decided in the designphase <strong>of</strong> the study. For prospective designsk is unknown in the design phase but can beestimated by (1 – PREV p ) / PREV p , wherePREV p is the prevalence <strong>of</strong> disease in therelevant population. A range <strong>of</strong> values <strong>for</strong>-PREV p should be considered when determiningsample size.6. What summary measure <strong>of</strong> accuracywill be used? In this article we havefocused mainly on the AUC and PAUC, butothers are possible (see [2]). The choice <strong>of</strong>summary measures determines which variancefunction <strong>for</strong>mula will be used in calculatingsample size. Note that the variancefunction is related to the variance by the following<strong>for</strong>mula: variance = VF / N, where VFis the variance function and N is the number<strong>of</strong> study patients with disease.7. What is the conjectured accuracy<strong>of</strong> the diagnostic test? The conjectured accuracyis needed to determine the expecteddifference in accuracy between two or morediagnostic tests. Also, the magnitude <strong>of</strong> theaccuracy affects the variance function. In thefollowing example, we present the variancefunction <strong>for</strong> the AUC; see Zhou et al. [2] <strong>for</strong><strong>for</strong>mulae <strong>for</strong> other variance functions.Consider the following example. Supposean investigator wants to conduct a study to determineif MRI can distinguish benign frommalignant breast lesions. Patients with a suspiciouslesion detected on mammography will beprospectively recruited to undergo MRI be<strong>for</strong>ebiopsy. The pathology results will be the referencestandard. The MR images will be interpretedindependently by two reviewers; theywill score the lesions using a 0–100% confidencescale. An <strong>ROC</strong> curve will be constructed<strong>for</strong> each reviewer; AUCs will be estimated, and95% CIs <strong>for</strong> the AUCs will be constructed. IfMRI shows some promise, the investigator willplan a larger MRMC study.The investigator expects 20–40% <strong>of</strong> patientsto have pathologically confirmed breast cancer(PREV p = 0.2–0.4); thus, k = 1.5–4.0. The investigatorexpects the AUC <strong>of</strong> MRI to be approximately0.80 or higher. The variancefunction <strong>of</strong> the AUC <strong>of</strong>ten used <strong>for</strong> sample sizecalculations is as follows:VF = (0.0099 × e –A×A/2 ) ×[(5 × A 2 + 8) + (A 2 + 8) / k], (7)370 AJR:184, February 2005


<strong>ROC</strong> <strong>Analysis</strong>where A is the parameter from the binormaldistribution. Parameter A can be calculatedfrom A = φ −1 (AUC) × 1.414, where φ −1 is the inverse<strong>of</strong> the cumulative normal distribution function[2]. For our example, AUC = 0.80; thusφ −1 (0.80) = 0.84 and A = 1.18776. The variancefunction, VF, equals (0.00489) × [(15.05387) +(9.41077) / 4.0] = 0.08512, where we have set k= 4.0. For k = 1.5, the VF = 0.10429.Suppose the investigator wants a 95% CIno wider than 0.10. That is, if the estimatedAUC from the study is 0.80, then the lowerbound <strong>of</strong> the CI should not be less than 0.75and the upper bound should not exceed 0.85.A <strong>for</strong>mula <strong>for</strong> calculating the required samplesize <strong>for</strong> a CI isN = [z α/22× VF] / L 2 (8)where z α/2 = 1.96 <strong>for</strong> a 95% CI and L is thedesired half-width <strong>of</strong> the CI. Here, L = 0.05.N is the number <strong>of</strong> patients with diseaseneeded <strong>for</strong> the study; the total number <strong>of</strong> patientsneeded <strong>for</strong> the study is N × (1 + k). Forour example, N equals [1.96 2 × 0.08512] /0.05 2 = 130.8 <strong>for</strong> k = 4.0, and 160.3 <strong>for</strong> k =1.5. Thus, depending on the unknown prevalence<strong>of</strong> breast cancer in the study sample, theinvestigator needs to recruit perhaps as few as401 total patients (if the sample prevalence is40%) but perhaps as many as 654 (if the sampleprevalence is only 20%).Finding the Optimal Point on the CurveMetz [46] derived a <strong>for</strong>mula <strong>for</strong> determiningthe optimal decision threshold on the <strong>ROC</strong>curve, where “optimal” is in terms <strong>of</strong> minimizingthe overall costs. “Costs” can be defined asmonetary costs, patient morbidity and mortality,or both. The slope, m, <strong>of</strong> the <strong>ROC</strong> curve atthe optimal decision threshold ism = (1 – PREV p ) / PREV p ×[C FP – C TN ] / [C FN – C TP ] (9)where C FP , C TN , C FN , and C TP are the costs <strong>of</strong>false-positive, true-negative, false-negative, andtrue-positive results, respectively. Once m is estimated,the optimal decision threshold is theone <strong>for</strong> which sensitivity and specificity maximizethe following expression: [sensitivity –m(1 – specificity)] [47].Examining the <strong>ROC</strong> curve labeled X inFigure 2, we see that the slope is very steep inthe lower left where both the sensitivity andFPR are low, and is close to zero at the upperright where the sensitivity and FPR are high.The slope takes on a high value when the patientis unlikely to have the disease or the cost<strong>of</strong> a false-positive is large; <strong>for</strong> these situations,a low FPR is optimal. The slope takeson a value near zero when the patient is likelyto have the disease or treatment <strong>for</strong> the diseaseis beneficial and carries little risk tohealthy patients; in these situations, a highsensitivity is optimal [3]. A nice example <strong>of</strong> astudy using this equation is given in [48]. Seealso work by Greenhouse and Mantel [49]and Linnet [50] <strong>for</strong> determining the optimaldecision threshold when a desired level <strong>for</strong>the sensitivity, specificity, or both is specifieda priori.ConclusionApplications <strong>of</strong> <strong>ROC</strong> curves in the medicalliterature have increased greatly in thepast few decades, and with this expansionmany new statistical methods <strong>of</strong> <strong>ROC</strong> analysishave been developed. These includemethods that correct <strong>for</strong> common biases likeverification bias and imperfect gold standardbias, methods <strong>for</strong> combining the in<strong>for</strong>mationfrom multiple diagnostic tests (i.e., optimalcombinations <strong>of</strong> tests) and multiple studies(i.e., meta-analysis), and methods <strong>for</strong> analyzingclustered data (i.e., multiple observationsfrom the same patient). Interested readers cansearch directly <strong>for</strong> these statistical methods orconsult two recently published books on <strong>ROC</strong>curve analysis and related topics [2, 39].Available s<strong>of</strong>tware <strong>for</strong> <strong>ROC</strong> analysis allowsinvestigators to easily fit, evaluate, and compare<strong>ROC</strong> curves [41, 51], although usersshould be cautious about the validity <strong>of</strong> thes<strong>of</strong>tware and check the underlying methodsand assumptions.AcknowledgmentsI thank the two series’ coeditors and an outsidestatistician <strong>for</strong> their helpful comments onan earlier draft <strong>of</strong> this manuscript.References1. Weinstein S, Obuchowski NA, Lieber ML. <strong>Clinical</strong>evaluation <strong>of</strong> diagnostic tests. AJR2005;184:14–192. Zhou XH, Obuchowski NA, McClish DK. Statisticalmethods in diagnostic medicine. New York,NY: Wiley-Interscience, 20023. Metz CE. Some practical issues <strong>of</strong> experimentaldesign and data analysis in radiologic <strong>ROC</strong> studies.Invest Radiol 1989;24:234–2454. Hanley JA, McNeil BJ. The meaning and use <strong>of</strong>the area under a receiver operating characteristic(<strong>ROC</strong>) curve. Radiology 1982;143:29–365. McClish DK. Analyzing a portion <strong>of</strong> the <strong>ROC</strong>curve. Med Decis Making 1989;9:190–1956. Hilden J. The area under the <strong>ROC</strong> curve and itscompetitors. Med Decis Making 1991;11:95–1017. Hilden J. Prevalence-free utility-respecting summaryindices <strong>of</strong> diagnostic power do not exist. StatMed 2000;19:431–4408. Wagner RF, Beiden SV, Metz CE. Continuous versuscategorical data <strong>for</strong> <strong>ROC</strong> analysis: some quantitativeconsiderations. Acad Radiol 2001;8:328–3349. Beam CA, Baker ME, Paine SS, Sostman HD, SullivanDC. Answering unanswered questions: proposal<strong>for</strong> a shared resource in clinical diagnosticradiology research. Radiology 1992;183:619–62010. Obuchowski NA, Beiden SV, Berbaum KS, et al.Multireader multicase receiver operating characteristicanalysis: an empirical comparison <strong>of</strong> fivemethods. Acad Radiol 2004;11:980–99511. Obuchowski NA. Multi-reader <strong>ROC</strong> studies: acomparison <strong>of</strong> study designs. Acad Radiol1995;2:709–71612. Roe CA, Metz CE. Variance-component modelingin the analysis <strong>of</strong> receiver operating characteristic indexestimates. Acad Radiol 1997;4:587–60013. Dorfman DD, Berbaum KS, Metz CE. Receiver operatingcharacteristic rating analysis: generalizationto the population <strong>of</strong> readers and patients with thejackknife method. Invest Radiol 1992;27:723–73114. Obuchowski NA. Multi-reader multi-modality<strong>ROC</strong> studies: hypothesis testing and sample sizeestimation using an ANOVA approach with dependentobservations. with rejoinder. Acad Radiol1995;2:S22–S2915. Toledano AY, Gatsonis C. Ordinal regressionmethodology <strong>for</strong> <strong>ROC</strong> curves derived from correlateddata. Stat Med 1996;15:1807–182616. Song HH. <strong>Analysis</strong> <strong>of</strong> correlated <strong>ROC</strong> areas in diagnostictesting. Biometrics 1997;53:370–38217. Beiden SV, Wagner RF, Campbell G. Components-<strong>of</strong>-variancemodels and multiple-bootstrapexperiments: an alternative method <strong>for</strong> randomeffects,receiver operating characteristic analysis.Acad Radiol 2000;7:341–34918. Beiden SV, Wagner RF, Campbell G, Metz CE, JiangY. Components-<strong>of</strong>-variance models <strong>for</strong> random-effects<strong>ROC</strong> analysis: the case <strong>of</strong> unequal variance structureacross modalities. Acad Radiol 2001;8:605–61519. Beiden SV, Wagner RF, Campbell G, Chan HP.<strong>Analysis</strong> <strong>of</strong> uncertainties in estimates <strong>of</strong> components<strong>of</strong> variance in multivariate <strong>ROC</strong> analysis.Acad Radiol 2001;8:616–62220. Ishwaran H, Gatsonis CA. A general class <strong>of</strong> hierarchicalordinal regression models with applicationsto correlated <strong>ROC</strong> analysis. Can J Stat2000;28:731–75021. Beam CA, Layde PM, Sullivan DC. Variability inthe interpretation <strong>of</strong> screening mammograms byUS radiologists: findings from a national sample.Arch Intern Med 1996;156:209–21322. Kim MJ, Lim JS, Oh YT, et al. Preoperative MRI <strong>of</strong>rectal cancer with and without rectal water filling: an intraindividualcomparison. AJR 2004;182:1469–147623. Osada H, Kaku K, Masuda K, Iitsuka Y, Seki K,Sekiya S. Quantitative and qualitative evaluations<strong>of</strong> fetal lung with MR imaging. Radiology2004;231:887–89224. Pepe MS, Thompson ML. Combining diagnostictest results to increase accuracy. Biostatistics2000;1:123–140AJR:184, February 2005 371


Obuchowski25. Zheng B, Leader JK, Abrams G, et al. Computeraideddetection schemes: the effect <strong>of</strong> limiting thenumber <strong>of</strong> cued regions in each case. AJR2004;182:579–58326. Chakraborty DP, Winter LHL. Free-response methodology:alternative analysis and a new observer-per<strong>for</strong>manceexperiment. Radiology 1990;174:873–88127. Chakraborty DP. Maximum likelihood analysis <strong>of</strong>free-response receiver operating characteristic(F<strong>ROC</strong>) data. Med Phys 1989;16:561–56828. Swensson RG. Unified measurement <strong>of</strong> observerper<strong>for</strong>mance in detecting and localizing target objectson images. Med Phys 1996;23:1709–172529. Obuchowski NA, Lieber ML, Powell KA. Dataanalysis <strong>for</strong> detection and localization <strong>of</strong> multipleabnormalities with application to mammography.Acad Radiol 2000;7:516–52530. Dorfman DD, Alf E. Maximum likelihood estimation<strong>of</strong> parameters <strong>of</strong> signal detection theory: adirect solution. Psychometrika 1968;33:117–12431. Dorfman DD, Alf E. Maximum-likelihood estimation<strong>of</strong> parameters <strong>of</strong> signal detection theoryand determination <strong>of</strong> confidence intervals: ratingmethod data. J Math Psychol 1969;6:487–49632. <strong>ROC</strong>KIT and LABMRMC. Available at:xray.bsd.uchicago.edu/krl/KRL_<strong>ROC</strong>s<strong>of</strong>tware_index.htm. Accessed December 13, 200433. Swets JA. Empirical RO. Cs in discrimination and diagnostictasks: implications <strong>for</strong> theory and measurement<strong>of</strong> per<strong>for</strong>mance. Psychol Bull 1986;99:181–19834. Hanley JA. The robustness <strong>of</strong> the binormal assumptionused in fitting <strong>ROC</strong> curves. Med DecisMaking 1988;8:197–20335. Goddard MJ, Hinberg I. Receiver operating characteristic(<strong>ROC</strong>) curves and non-normal data: anempirical study. Stat Med 1990;9:325–33736. Zou KH, Tempany CM, Fielding JR, Silverman SG.Original smooth receiver operating characteristiccurve estimation from continuous data: statisticalmethods <strong>for</strong> analyzing the predictive value <strong>of</strong> spiralCT <strong>of</strong> ureteral stones. Acad Radiol 1998;5:680–68737. Pepe MS. A regression modeling framework <strong>for</strong> receiveroperating characteristic curves in medical diagnostictesting. Biometrika 1997;84:595–60838. Pepe MS. An interpretation <strong>for</strong> the <strong>ROC</strong> curve usingGLM procedures. Biometrics 2000;56:352–35939. Pepe MS. The statistical evaluation <strong>of</strong> medicaltests <strong>for</strong> classification and prediction. New York,NY: Ox<strong>for</strong>d University Press, 200340. DeLong ER, DeLong DM, Clarke-Pearson DL.Comparing the areas under two or more correlated receiveroperating characteristic curves: a nonparametricapproach. Biometrics 1988;44:837–84441. <strong>ROC</strong> analysis. Available at: www.bio.ri.ccf.org/<strong>Research</strong>/<strong>ROC</strong>/index.html. Accessed December13, 200442. OBUMRM. Available at: www.bio.ri.ccf.org/OBUMRM/OBUMRM.html. Accessed December13, 200443. The University <strong>of</strong> Iowa Department <strong>of</strong> Radiology:The Medical Image Perception Laboratory.MRMC 2.0. Available at: perception.radiology.uiowa.edu.Accessed December 13, 200444. Hillis SL, Berbaum KS. Power estimation <strong>for</strong> the Dorfman-Berbaum-Metzmethod. Acad Radiol (in press)45. Obuchowski NA. Sample size tables <strong>for</strong> receiver operatingcharacteristic studies. AJR 2000;175:603–60846. Metz CE. Basic principles <strong>of</strong> <strong>ROC</strong> analysis.Semin Nucl Med 1978;8:283–29847. Zweig MH, Campbell G. Receiver-operating characteristic(<strong>ROC</strong>) plots: a fundamental evaluation toolin clinical medicine. Clin Chem 1993;39:561–57748. Somoza E, Mossman D. “Biological markers” andpsychiatric diagnosis: risk-benefit balancing using<strong>ROC</strong> analysis. Biol Psychiatry 1991;29:811–82649. Greenhouse SW, Mantel N. The evaluation <strong>of</strong> diagnostictests. Biometrics 1950;6:399–41250. Linnet K. Comparison <strong>of</strong> quantitative diagnostictests: type I error, power, and sample size. StatMed 1987;6:147–15851. Stephan C, Wesseling S, Schink T, Jung K. Comparison<strong>of</strong> eight computer programs <strong>for</strong> receiveroperatingcharacteristic analysis. Clin Chem2003;49:433–43952. Ma G, Hall WJ. Confidence bands <strong>for</strong> receiver operatingcharacteristic curves. Med Decis Making1993;13:191–197APPENDIX 1. Area Under the Curve and Confidence Intervals with Binormal ModelUnder the binormal assumption, the receiver operating characteristic (<strong>ROC</strong>) curve is thecollection <strong>of</strong> points given by[1 – φ(c), 1 – φ(B × c – A)]where c ranges from –∞ to +∞ and represents all the possible values <strong>of</strong> the underlying binormaldistribution, and φ is the cumulative normal distribution evaluated at c. For example, <strong>for</strong> a falsepositiverate <strong>of</strong> 0.10, φ(c) is set equal to 0.90; from tables <strong>of</strong> the cumulative normal distribution,we have φ(1.28) = 0.90. Suppose A = 2.0 and B = 1.0; then the sensitivity = 1 – φ(− 0.72) = 1 –0.2358 = 0.7642.<strong>ROC</strong>KIT [32] gives a confidence interval (CI) <strong>for</strong> sensitivity at particular false-positiverates (i.e., pointwise CIs). A CI <strong>for</strong> the entire <strong>ROC</strong> curve (i.e., simultaneous CI) is describedby Ma and Hall [52].Under the binormal distribution assumption, the area under the smooth <strong>ROC</strong> curve (AUC)is given byAUC = φ[A / √ (1 + B 2 )].For the example above, AUC = φ[2.0 / √ (2.0)] = φ[1.414] = 0.921.The variance <strong>of</strong> the full area under the <strong>ROC</strong> curve is given as standard output in programs like<strong>ROC</strong>KIT [32]. An estimator <strong>for</strong> the variance <strong>of</strong> the partial area under the curve (PAUC) was givenby McClish [5]; a Fortran program is available <strong>for</strong> estimating the PAUC and its variance [41].The reader’s attention is directed to earlier articles in the <strong>Fundamentals</strong> <strong>of</strong> <strong>Clinical</strong> <strong>Research</strong> series:1. Introduction, which appeared in February 20012. The <strong>Research</strong> Framework, April 20013. Protocol, June 20014. Data Collection, October 20015. Population and Sample, November 20016. Statistically Engineering the Study <strong>for</strong> Success, July 20027. Screening <strong>for</strong> Preclinical Disease: Test and DiseaseCharacteristics, October 20028. Exploring and Summarizing Radiologic Data, January 20039. Visualizing Radiologic Data, March 200310. Introduction to Probability Theory and SamplingDistributions, April 200311. Observational Studies in Radiology, November 200412. Randomized Controlled Trials, December 200413. <strong>Clinical</strong> Evaluation <strong>of</strong> Diagnostic Tests, January 2005372 AJR:184, February 2005

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!