Fundamentals of Clinical Research for Radiologists ROC Analysis

More documents

Recommendations

Info

ROC AnalysisTABLE 3 Fictitious Data Comparing the Accuracy of Two Diagnostic TestsROC CurveXYEstimated AUC 0.841 0.841Estimated SE of AUC 0.041 0.045Estimated PAUC where FPR < 0.20 0.112 0.071Estimated SE of PAUC 0.019 0.014Estimated covariance 0.00001Z test comparing PAUCs Z = [0.112 – 0.071] / √[0.019 2 + 0.014 2 – 0.00002]95% CI for difference in PAUCs [0.112 – 0.071] ± 1.96 × √[0.019 2 + 0.014 2 – 0.00002]Note.—AUC = area under the curve, PAUC = partial area under the curve, CI = confidence interval.binormality. Alternatively, one can use softwarelike ROCKIT [32] that will bin the testresults into an optimal number of categoriesand apply the same maximum likelihoodmethods as mentioned earlier for rating datalike the BI-RADS scores.More elaborate models for the ROC curve thatcan take into account covariates (e.g., the patient’sage, symptoms) have also been developedin the statistics literature [37–39] and will becomemore accessible as new software is written.Estimating the Area Under the ROC CurveEstimation of the area under the smoothcurve, assuming a binormal distribution, isdescribed in Appendix 1. In this subsection,we describe and illustrate estimation of thearea under the empiric ROC curve. The processof estimating the area under the empiricROC curve is nonparametric, meaning thatno assumptions are made about the distributionof the test results or about any hypothesizedunderlying distribution. The estimationworks for tests scored with a rating scale, a0–100% confidence scale, or a true continuous-scalevariable.The process of estimating the area under theempiric ROC curve involves four simple steps:First, the test result of a patient with disease iscompared with the test result of a patient withoutdisease. If the former test result indicatesmore suspicion of disease than the latter test result,then a score of 1 is assigned. If the test resultsare identical, then a score of 1/2 isassigned. If the diseased patient has a test resultindicating less suspicion for disease than thetest result of the nondiseased patient, then ascore of 0 is assigned. It does not matter whichdiseased and nondiseased patient you beginwith. Using the data in Table 1 as an illustration,suppose we start with a diseased patient assigneda test result of “normal” and a nondiseasedpatient assigned a test result of “normal.”Because their test results are the same, this pairis assigned a score of 1/2.Second, repeat the first step for every possiblepair of diseased and nondiseased patientsin your sample. In Table 1 there are 100diseased patients and 100 nondiseased patients,thus 10,000 possible pairs. Becausethere are only five unique test results, the10,000 possible pairs can be scored easily, asin Table 2.Third, sum the scores of all possible pairs.From Table 2, the sum is 8,632.5.Fourth, divide the sum from step 3 by thenumber of pairs in the study sample. In ourexample we have 10,000 pairs. Dividing thesum from step 3 by 10,000 gives us 0.86325,which is our estimate of the area under theempiric ROC curve. Note that this method ofestimating the area under the empiric ROCcurve gives the same result as one would obtainby fitting trapezoids under the curve andsumming the areas of the trapezoids (socalledtrapezoid method).The variance of the estimated area underthe empiric ROC curve is given by DeLonget al. [40] and can be used for constructingCIs; software programs are available for estimatingthe nonparametric AUC and itsvariance [41].Comparing the AUCs or PAUCs of TwoDiagnostic TestsTo test whether the AUC (or PAUC) of onediagnostic test (denoted by AUC 1 ) equals theAUC (or PAUC) of another diagnostic test(AUC 2 ), the following test statistic is calculated:Z = [AUC 1 – AUC 2 ] /√[var 1 + var 2 – 2 × cov], (4)where var 1 is the estimated variance of AUC 1 ,var 2 is the estimated variance of AUC 2 , andcov is the estimated covariance between AUC 1and AUC 2 . When different samples of patientsundergo the two diagnostic tests, the covarianceequals zero. When the same sample of patientsundergoes both diagnostic tests (i.e., a pairedstudy design), then the covariance is not generallyequal to zero and is often positive. The estimatedvariances and covariances are standardoutput for most ROC software [32, 41].The test statistic Z follows a standard normaldistribution. For a two-tailed test withsignificance level of 0.05, the critical valuesare –1.96 and +1.96. If Z is less than −1.96,then we conclude that the accuracy of diagnostictest 2 is superior to that of diagnostictest 1; if Z exceeds +1.96, then we concludethat the accuracy of diagnostic test 1 is superiorto that of diagnostic test 2.A two-sided CI for the difference in AUC(or PAUC) between two diagnostic tests canbe calculated fromLL = [AUC 1 – AUC 2 ] – z α/2 ×√[var 1 + var 2 – 2 × cov] (5)UL = [AUC 1 – AUC 2 ] + z α/2 ×√[var 1 + var 2 – 2 × cov], (6)where LL is the lower limit of the CI, UL isthe upper limit, and z α/2 is a value from thestandard normal distribution corresponding toa probability of α/2. For example, to constructa 95% CI, α = 0.05, thus z α/2 = 1.96.Consider the ROC curves in Figure 2A. Theestimated areas under the smooth ROC curves ofthe two tests are the same, 0.841. The PAUCswhere the FPR is greater than 0.20, however, differ.From the estimated variances and covariancein Table 3, the value of the Z statistic for comparingthe PAUCs is 1.77, which is not statisticallysignificant. The 95% CI for the difference inPAUCs is more informative: (−0.004 to 0.086);the CI for the partial area index is (−0.02 to 0.43).The CI contains large positive differences, suggestingthat more research is needed to investigatethe relative accuracies of these twodiagnostic tests for FPRs less than 0.20.Analysis of MRMC ROC StudiesMultiple published methods discuss performingthe statistical analysis of MRMC studies [13–20]. The methods are used to construct CIs for diagnosticaccuracy and statistical tests for assessingdifferences in accuracy between tests. Astatistical overview of the methods is given elsewhere[10]. Here, we briefly mention some of thekey issues of MRMC ROC analyses.AJR:184, February 2005 369
ObuchowskiFixed- or random-effects models.—TheMRMC study has two samples, a sample ofpatients and a sample of reviewers. If the studyresults are to be generalized to patients similarto ones in the study sample and to reviewerssimilar to ones in the study sample, then a statisticalanalysis that treats both patients and reviewersas random effects should be used [13,14, 17–20]. If the study results are to be generalizedto just patients similar to ones in thestudy sample, then the patients are treated asrandom effects but the reviewers should betreated as fixed effects [13–20]. Some of thestatistical methods can treat reviewers as eitherrandom or fixed, whereas other methodstreat reviewers only as fixed effects.Parametric or nonparametric.—Some of themethods rely on models that make strong assumptionsabout how the accuracies of the reviewersare correlated and distributed(parametric methods) [13, 14], other methods aremore flexible [15, 20], and still others make noassumptions [16–19] (nonparametric methods).The parametric methods may be more powerfulwhen their assumptions are met, but often it is difficultto determine if the assumptions are met.Covariates.—Reviewers’ accuracy may beaffected by their training or experience or bycharacteristics of the patients (e.g., age, sex,stage of disease, comorbidities). These variablesare called covariates. Some of the statisticalmethods [15, 20] have models that caninclude covariates. These models providevaluable insight into the variability betweenreviewers and between patients.Software.—Software is available for publicuse for some of the methods [32, 42, 43]; theauthors of the other methods may be able toprovide software if contacted.Determining Sample Size for ROC StudiesMany issues must be considered in determiningthe number of patients needed for anROC study. We list several of the key issuesand some useful references here, followed bya simple illustration. Software is also availablefor determining the required sample sizefor some ROC study designs [32, 41].1. Is it a MRMC ROC study? Many radiologystudies include more than one reviewerbut are not considered MRMC studies.MRMC studies usually involve five or more reviewersand focus on estimating the averageaccuracy of the reviewers. In contrast, many radiologystudies include two or three reviewersto get some idea of the interreviewer variability.Estimation of the required sample size forMRMC studies requires balancing the numberof reviewers in the reviewer sample with thenumber of patients in the patient sample. See[14, 44] for formulae for determining samplesizes for MRMC studies and [45] for samplesize tables for MRMC studies. Sample size determinationfor non-MRMC studies is basedon the number of patients needed.2. Will the study involve a single diagnostictest or compare two or morediagnostic tests? ROC studies comparingtwo or more diagnostic tests are common.These studies focus on the difference betweenAUCs or PAUCs of the two (or more)diagnostic tests. Sample size can be basedon either planning for enough statisticalpower to detect a clinically important difference,or constructing a CI for the differencein accuracies that is narrow enough to makeclinically relevant conclusions from thestudy. In studies of one diagnostic test, weoften focus on the magnitude of the test’sAUC or PAUC, basing sample size on thedesired width of a CI.3. If two or more diagnostic tests arebeing compared, will it be a paired orunpaired study design, and are the accuraciesof the tests hypothesized to bedifferent or equivalent? Paired designsalmost always require fewer patients than anunpaired design, and so are used wheneverthey are logistically, ethically, and financiallyfeasible. Studies that are performed to determinewhether two or more tests have the sameaccuracy are called equivalency studies. Oftenin radiology a less invasive diagnostic test,or a quicker imaging sequence, is developedand compared with the standard test. The investigatorwants to know if the test is similarin accuracy to the standard test. Equivalencystudies often require a larger sample size thanstudies in which the goal is to show that onetest has superior accuracy to another test. Thereason is that to show equivalence the investigatormust rule out all large differences betweenthe tests—that is, the CI for thedifference must be very narrow.4. Will the patients be recruited in aprospective or retrospective fashion?In prospective designs, patients are recruitedbased on their signs or symptoms, so at thetime of recruitment it is unknown whether thepatient has the disease of interest. In contrast,in retrospective designs patients are recruitedbased on their known true disease status (asdetermined by the gold or reference standard)[2]. Both studies are used commonly in radiology.Retrospective studies often requirefewer patients than prospective designs.5. What will be the ratio of nondiseasedto diseased patients in thestudy sample? Let k denote the ratio ofthe number of nondiseased to diseased patientsin the study sample. For retrospectivestudies k is usually decided in the designphase of the study. For prospective designsk is unknown in the design phase but can beestimated by (1 – PREV p ) / PREV p , wherePREV p is the prevalence of disease in therelevant population. A range of values for-PREV p should be considered when determiningsample size.6. What summary measure of accuracywill be used? In this article we havefocused mainly on the AUC and PAUC, butothers are possible (see [2]). The choice ofsummary measures determines which variancefunction formula will be used in calculatingsample size. Note that the variancefunction is related to the variance by the followingformula: variance = VF / N, where VFis the variance function and N is the numberof study patients with disease.7. What is the conjectured accuracyof the diagnostic test? The conjectured accuracyis needed to determine the expecteddifference in accuracy between two or morediagnostic tests. Also, the magnitude of theaccuracy affects the variance function. In thefollowing example, we present the variancefunction for the AUC; see Zhou et al. [2] forformulae for other variance functions.Consider the following example. Supposean investigator wants to conduct a study to determineif MRI can distinguish benign frommalignant breast lesions. Patients with a suspiciouslesion detected on mammography will beprospectively recruited to undergo MRI beforebiopsy. The pathology results will be the referencestandard. The MR images will be interpretedindependently by two reviewers; theywill score the lesions using a 0–100% confidencescale. An ROC curve will be constructedfor each reviewer; AUCs will be estimated, and95% CIs for the AUCs will be constructed. IfMRI shows some promise, the investigator willplan a larger MRMC study.The investigator expects 20–40% of patientsto have pathologically confirmed breast cancer(PREV p = 0.2–0.4); thus, k = 1.5–4.0. The investigatorexpects the AUC of MRI to be approximately0.80 or higher. The variancefunction of the AUC often used for sample sizecalculations is as follows:VF = (0.0099 × e –A×A/2 ) ×[(5 × A 2 + 8) + (A 2 + 8) / k], (7)370 AJR:184, February 2005
Page 2 and 3: ROC Analysismost lenient decision t
Page 4 and 5: ROC Analysisbased on increasing the
Page 8 and 9: ROC Analysiswhere A is the paramete

Fundamentals of Clinical Research for Radiologists ROC Analysis

Create successful ePaper yourself

Delete template?

Save as template?