<strong>ROC</strong> <strong>Analysis</strong>TABLE 3 Fictitious Data Comparing the Accuracy <strong>of</strong> Two Diagnostic Tests<strong>ROC</strong> CurveXYEstimated AUC 0.841 0.841Estimated SE <strong>of</strong> AUC 0.041 0.045Estimated PAUC where FPR < 0.20 0.112 0.071Estimated SE <strong>of</strong> PAUC 0.019 0.014Estimated covariance 0.00001Z test comparing PAUCs Z = [0.112 – 0.071] / √[0.019 2 + 0.014 2 – 0.00002]95% CI <strong>for</strong> difference in PAUCs [0.112 – 0.071] ± 1.96 × √[0.019 2 + 0.014 2 – 0.00002]Note.—AUC = area under the curve, PAUC = partial area under the curve, CI = confidence interval.binormality. Alternatively, one can use s<strong>of</strong>twarelike <strong>ROC</strong>KIT [32] that will bin the testresults into an optimal number <strong>of</strong> categoriesand apply the same maximum likelihoodmethods as mentioned earlier <strong>for</strong> rating datalike the BI-RADS scores.More elaborate models <strong>for</strong> the <strong>ROC</strong> curve thatcan take into account covariates (e.g., the patient’sage, symptoms) have also been developedin the statistics literature [37–39] and will becomemore accessible as new s<strong>of</strong>tware is written.Estimating the Area Under the <strong>ROC</strong> CurveEstimation <strong>of</strong> the area under the smoothcurve, assuming a binormal distribution, isdescribed in Appendix 1. In this subsection,we describe and illustrate estimation <strong>of</strong> thearea under the empiric <strong>ROC</strong> curve. The process<strong>of</strong> estimating the area under the empiric<strong>ROC</strong> curve is nonparametric, meaning thatno assumptions are made about the distribution<strong>of</strong> the test results or about any hypothesizedunderlying distribution. The estimationworks <strong>for</strong> tests scored with a rating scale, a0–100% confidence scale, or a true continuous-scalevariable.The process <strong>of</strong> estimating the area under theempiric <strong>ROC</strong> curve involves four simple steps:First, the test result <strong>of</strong> a patient with disease iscompared with the test result <strong>of</strong> a patient withoutdisease. If the <strong>for</strong>mer test result indicatesmore suspicion <strong>of</strong> disease than the latter test result,then a score <strong>of</strong> 1 is assigned. If the test resultsare identical, then a score <strong>of</strong> 1/2 isassigned. If the diseased patient has a test resultindicating less suspicion <strong>for</strong> disease than thetest result <strong>of</strong> the nondiseased patient, then ascore <strong>of</strong> 0 is assigned. It does not matter whichdiseased and nondiseased patient you beginwith. Using the data in Table 1 as an illustration,suppose we start with a diseased patient assigneda test result <strong>of</strong> “normal” and a nondiseasedpatient assigned a test result <strong>of</strong> “normal.”Because their test results are the same, this pairis assigned a score <strong>of</strong> 1/2.Second, repeat the first step <strong>for</strong> every possiblepair <strong>of</strong> diseased and nondiseased patientsin your sample. In Table 1 there are 100diseased patients and 100 nondiseased patients,thus 10,000 possible pairs. Becausethere are only five unique test results, the10,000 possible pairs can be scored easily, asin Table 2.Third, sum the scores <strong>of</strong> all possible pairs.From Table 2, the sum is 8,632.5.Fourth, divide the sum from step 3 by thenumber <strong>of</strong> pairs in the study sample. In ourexample we have 10,000 pairs. Dividing thesum from step 3 by 10,000 gives us 0.86325,which is our estimate <strong>of</strong> the area under theempiric <strong>ROC</strong> curve. Note that this method <strong>of</strong>estimating the area under the empiric <strong>ROC</strong>curve gives the same result as one would obtainby fitting trapezoids under the curve andsumming the areas <strong>of</strong> the trapezoids (socalledtrapezoid method).The variance <strong>of</strong> the estimated area underthe empiric <strong>ROC</strong> curve is given by DeLonget al. [40] and can be used <strong>for</strong> constructingCIs; s<strong>of</strong>tware programs are available <strong>for</strong> estimatingthe nonparametric AUC and itsvariance [41].Comparing the AUCs or PAUCs <strong>of</strong> TwoDiagnostic TestsTo test whether the AUC (or PAUC) <strong>of</strong> onediagnostic test (denoted by AUC 1 ) equals theAUC (or PAUC) <strong>of</strong> another diagnostic test(AUC 2 ), the following test statistic is calculated:Z = [AUC 1 – AUC 2 ] /√[var 1 + var 2 – 2 × cov], (4)where var 1 is the estimated variance <strong>of</strong> AUC 1 ,var 2 is the estimated variance <strong>of</strong> AUC 2 , andcov is the estimated covariance between AUC 1and AUC 2 . When different samples <strong>of</strong> patientsundergo the two diagnostic tests, the covarianceequals zero. When the same sample <strong>of</strong> patientsundergoes both diagnostic tests (i.e., a pairedstudy design), then the covariance is not generallyequal to zero and is <strong>of</strong>ten positive. The estimatedvariances and covariances are standardoutput <strong>for</strong> most <strong>ROC</strong> s<strong>of</strong>tware [32, 41].The test statistic Z follows a standard normaldistribution. For a two-tailed test withsignificance level <strong>of</strong> 0.05, the critical valuesare –1.96 and +1.96. If Z is less than −1.96,then we conclude that the accuracy <strong>of</strong> diagnostictest 2 is superior to that <strong>of</strong> diagnostictest 1; if Z exceeds +1.96, then we concludethat the accuracy <strong>of</strong> diagnostic test 1 is superiorto that <strong>of</strong> diagnostic test 2.A two-sided CI <strong>for</strong> the difference in AUC(or PAUC) between two diagnostic tests canbe calculated fromLL = [AUC 1 – AUC 2 ] – z α/2 ×√[var 1 + var 2 – 2 × cov] (5)UL = [AUC 1 – AUC 2 ] + z α/2 ×√[var 1 + var 2 – 2 × cov], (6)where LL is the lower limit <strong>of</strong> the CI, UL isthe upper limit, and z α/2 is a value from thestandard normal distribution corresponding toa probability <strong>of</strong> α/2. For example, to constructa 95% CI, α = 0.05, thus z α/2 = 1.96.Consider the <strong>ROC</strong> curves in Figure 2A. Theestimated areas under the smooth <strong>ROC</strong> curves <strong>of</strong>the two tests are the same, 0.841. The PAUCswhere the FPR is greater than 0.20, however, differ.From the estimated variances and covariancein Table 3, the value <strong>of</strong> the Z statistic <strong>for</strong> comparingthe PAUCs is 1.77, which is not statisticallysignificant. The 95% CI <strong>for</strong> the difference inPAUCs is more in<strong>for</strong>mative: (−0.004 to 0.086);the CI <strong>for</strong> the partial area index is (−0.02 to 0.43).The CI contains large positive differences, suggestingthat more research is needed to investigatethe relative accuracies <strong>of</strong> these twodiagnostic tests <strong>for</strong> FPRs less than 0.20.<strong>Analysis</strong> <strong>of</strong> MRMC <strong>ROC</strong> StudiesMultiple published methods discuss per<strong>for</strong>mingthe statistical analysis <strong>of</strong> MRMC studies [13–20]. The methods are used to construct CIs <strong>for</strong> diagnosticaccuracy and statistical tests <strong>for</strong> assessingdifferences in accuracy between tests. Astatistical overview <strong>of</strong> the methods is given elsewhere[10]. Here, we briefly mention some <strong>of</strong> thekey issues <strong>of</strong> MRMC <strong>ROC</strong> analyses.AJR:184, February 2005 369
ObuchowskiFixed- or random-effects models.—TheMRMC study has two samples, a sample <strong>of</strong>patients and a sample <strong>of</strong> reviewers. If the studyresults are to be generalized to patients similarto ones in the study sample and to reviewerssimilar to ones in the study sample, then a statisticalanalysis that treats both patients and reviewersas random effects should be used [13,14, 17–20]. If the study results are to be generalizedto just patients similar to ones in thestudy sample, then the patients are treated asrandom effects but the reviewers should betreated as fixed effects [13–20]. Some <strong>of</strong> thestatistical methods can treat reviewers as eitherrandom or fixed, whereas other methodstreat reviewers only as fixed effects.Parametric or nonparametric.—Some <strong>of</strong> themethods rely on models that make strong assumptionsabout how the accuracies <strong>of</strong> the reviewersare correlated and distributed(parametric methods) [13, 14], other methods aremore flexible [15, 20], and still others make noassumptions [16–19] (nonparametric methods).The parametric methods may be more powerfulwhen their assumptions are met, but <strong>of</strong>ten it is difficultto determine if the assumptions are met.Covariates.—Reviewers’ accuracy may beaffected by their training or experience or bycharacteristics <strong>of</strong> the patients (e.g., age, sex,stage <strong>of</strong> disease, comorbidities). These variablesare called covariates. Some <strong>of</strong> the statisticalmethods [15, 20] have models that caninclude covariates. These models providevaluable insight into the variability betweenreviewers and between patients.S<strong>of</strong>tware.—S<strong>of</strong>tware is available <strong>for</strong> publicuse <strong>for</strong> some <strong>of</strong> the methods [32, 42, 43]; theauthors <strong>of</strong> the other methods may be able toprovide s<strong>of</strong>tware if contacted.Determining Sample Size <strong>for</strong> <strong>ROC</strong> StudiesMany issues must be considered in determiningthe number <strong>of</strong> patients needed <strong>for</strong> an<strong>ROC</strong> study. We list several <strong>of</strong> the key issuesand some useful references here, followed bya simple illustration. S<strong>of</strong>tware is also available<strong>for</strong> determining the required sample size<strong>for</strong> some <strong>ROC</strong> study designs [32, 41].1. Is it a MRMC <strong>ROC</strong> study? Many radiologystudies include more than one reviewerbut are not considered MRMC studies.MRMC studies usually involve five or more reviewersand focus on estimating the averageaccuracy <strong>of</strong> the reviewers. In contrast, many radiologystudies include two or three reviewersto get some idea <strong>of</strong> the interreviewer variability.Estimation <strong>of</strong> the required sample size <strong>for</strong>MRMC studies requires balancing the number<strong>of</strong> reviewers in the reviewer sample with thenumber <strong>of</strong> patients in the patient sample. See[14, 44] <strong>for</strong> <strong>for</strong>mulae <strong>for</strong> determining samplesizes <strong>for</strong> MRMC studies and [45] <strong>for</strong> samplesize tables <strong>for</strong> MRMC studies. Sample size determination<strong>for</strong> non-MRMC studies is basedon the number <strong>of</strong> patients needed.2. Will the study involve a single diagnostictest or compare two or morediagnostic tests? <strong>ROC</strong> studies comparingtwo or more diagnostic tests are common.These studies focus on the difference betweenAUCs or PAUCs <strong>of</strong> the two (or more)diagnostic tests. Sample size can be basedon either planning <strong>for</strong> enough statisticalpower to detect a clinically important difference,or constructing a CI <strong>for</strong> the differencein accuracies that is narrow enough to makeclinically relevant conclusions from thestudy. In studies <strong>of</strong> one diagnostic test, we<strong>of</strong>ten focus on the magnitude <strong>of</strong> the test’sAUC or PAUC, basing sample size on thedesired width <strong>of</strong> a CI.3. If two or more diagnostic tests arebeing compared, will it be a paired orunpaired study design, and are the accuracies<strong>of</strong> the tests hypothesized to bedifferent or equivalent? Paired designsalmost always require fewer patients than anunpaired design, and so are used wheneverthey are logistically, ethically, and financiallyfeasible. Studies that are per<strong>for</strong>med to determinewhether two or more tests have the sameaccuracy are called equivalency studies. Oftenin radiology a less invasive diagnostic test,or a quicker imaging sequence, is developedand compared with the standard test. The investigatorwants to know if the test is similarin accuracy to the standard test. Equivalencystudies <strong>of</strong>ten require a larger sample size thanstudies in which the goal is to show that onetest has superior accuracy to another test. Thereason is that to show equivalence the investigatormust rule out all large differences betweenthe tests—that is, the CI <strong>for</strong> thedifference must be very narrow.4. Will the patients be recruited in aprospective or retrospective fashion?In prospective designs, patients are recruitedbased on their signs or symptoms, so at thetime <strong>of</strong> recruitment it is unknown whether thepatient has the disease <strong>of</strong> interest. In contrast,in retrospective designs patients are recruitedbased on their known true disease status (asdetermined by the gold or reference standard)[2]. Both studies are used commonly in radiology.Retrospective studies <strong>of</strong>ten requirefewer patients than prospective designs.5. What will be the ratio <strong>of</strong> nondiseasedto diseased patients in thestudy sample? Let k denote the ratio <strong>of</strong>the number <strong>of</strong> nondiseased to diseased patientsin the study sample. For retrospectivestudies k is usually decided in the designphase <strong>of</strong> the study. For prospective designsk is unknown in the design phase but can beestimated by (1 – PREV p ) / PREV p , wherePREV p is the prevalence <strong>of</strong> disease in therelevant population. A range <strong>of</strong> values <strong>for</strong>-PREV p should be considered when determiningsample size.6. What summary measure <strong>of</strong> accuracywill be used? In this article we havefocused mainly on the AUC and PAUC, butothers are possible (see [2]). The choice <strong>of</strong>summary measures determines which variancefunction <strong>for</strong>mula will be used in calculatingsample size. Note that the variancefunction is related to the variance by the following<strong>for</strong>mula: variance = VF / N, where VFis the variance function and N is the number<strong>of</strong> study patients with disease.7. What is the conjectured accuracy<strong>of</strong> the diagnostic test? The conjectured accuracyis needed to determine the expecteddifference in accuracy between two or morediagnostic tests. Also, the magnitude <strong>of</strong> theaccuracy affects the variance function. In thefollowing example, we present the variancefunction <strong>for</strong> the AUC; see Zhou et al. [2] <strong>for</strong><strong>for</strong>mulae <strong>for</strong> other variance functions.Consider the following example. Supposean investigator wants to conduct a study to determineif MRI can distinguish benign frommalignant breast lesions. Patients with a suspiciouslesion detected on mammography will beprospectively recruited to undergo MRI be<strong>for</strong>ebiopsy. The pathology results will be the referencestandard. The MR images will be interpretedindependently by two reviewers; theywill score the lesions using a 0–100% confidencescale. An <strong>ROC</strong> curve will be constructed<strong>for</strong> each reviewer; AUCs will be estimated, and95% CIs <strong>for</strong> the AUCs will be constructed. IfMRI shows some promise, the investigator willplan a larger MRMC study.The investigator expects 20–40% <strong>of</strong> patientsto have pathologically confirmed breast cancer(PREV p = 0.2–0.4); thus, k = 1.5–4.0. The investigatorexpects the AUC <strong>of</strong> MRI to be approximately0.80 or higher. The variancefunction <strong>of</strong> the AUC <strong>of</strong>ten used <strong>for</strong> sample sizecalculations is as follows:VF = (0.0099 × e –A×A/2 ) ×[(5 × A 2 + 8) + (A 2 + 8) / k], (7)370 AJR:184, February 2005