15.08.2013 Views

Tips for Learners of Evidence-Based Medicine

Tips for Learners of Evidence-Based Medicine

Tips for Learners of Evidence-Based Medicine

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CMAJ 2005: <strong>Tips</strong> <strong>for</strong> <strong>Learners</strong> <strong>of</strong> <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong>: A 5-Part Series<br />

02 Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, Moyer V, Guyatt G.<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 1. relative risk reduction, absolute<br />

risk reduction and number needed to treat. Can Med Assoc J 2004; 171:353–<br />

358.<br />

08 Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, Guyatt G.<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 2. measures <strong>of</strong> precision<br />

(confidence intervals). Can Med Assoc J 2004; 171:611–615.<br />

14 McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G. <strong>Tips</strong> <strong>for</strong> learners<br />

<strong>of</strong> evidence-based medicine: 3. measures <strong>of</strong> observer variability (kappa statistic).<br />

Can Med Assoc J 2004; 171:1369–1373.<br />

19 Hatala R, Keitz S, Wyer P, Guyatt G. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />

medicine: 4. assessing heterogeneity <strong>of</strong> primary studies in systematic reviews<br />

and whether to combine their results. Can Med Assoc J 2005;172:661–665.<br />

24 Montori VM, Wyer P, Newman TB, Keitz S, Guyatt G. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong><br />

evidence-based medicine: 5. the effect <strong>of</strong> spectrum <strong>of</strong> disease on the<br />

per<strong>for</strong>mance <strong>of</strong> diagnostic tests. Can med Assoc J 2005;172:385–390.<br />

Page 1 <strong>of</strong> 29


DOI:10.1503/cmaj.1021197<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

1. Relative risk reduction, absolute risk reduction<br />

and number needed to treat<br />

Physicians, patients and policy-makers are influenced<br />

not only by the results <strong>of</strong> studies but also by how authors<br />

present the results. 1–4 Depending on which<br />

measures <strong>of</strong> effect authors choose, the impact <strong>of</strong> an intervention<br />

may appear very large or quite small, even though<br />

the underlying data are the same. In this article we present<br />

3 measures <strong>of</strong> effect — relative risk reduction, absolute risk<br />

reduction and number needed to treat — in a fashion designed<br />

to help clinicians understand and use them. We<br />

have organized the article as a series <strong>of</strong> “tips” or exercises.<br />

This means that you, the reader, will have to do some work<br />

in the course <strong>of</strong> reading this article (we are assuming that<br />

most readers are practitioners, as opposed to researchers<br />

and educators).<br />

The tips in this article are adapted from approaches developed<br />

by educators with experience in teaching evidencebased<br />

medicine skills to clinicians. 5,6 A related article, intended<br />

<strong>for</strong> people who teach these concepts to clinicians, is available<br />

online at www.cmaj.ca/cgi/content/full/171/4/353/DC1.<br />

Clinician learners’ objectives<br />

Understanding risk and risk reduction<br />

• Learn how to determine control and treatment event<br />

rates in published studies.<br />

• Learn how to determine relative and absolute risk reductions<br />

from published studies.<br />

• Understand how relative and absolute risk reductions<br />

usually apply to different populations.<br />

Balancing benefits and adverse effects in individual<br />

patients<br />

• Learn how to use a known relative risk reduction to estimate<br />

the risk <strong>of</strong> an event <strong>for</strong> a patient undergoing<br />

treatment, given an estimate <strong>of</strong> that patient’s risk <strong>of</strong> the<br />

CMAJ • AUG. 17, 2004; 171 (4) 353<br />

© 2004 Canadian Medical Association or its licensors<br />

Review<br />

Synthèse<br />

Alexandra Barratt, Peter C. Wyer, Rose Hatala, Thomas McGinn, Antonio L. Dans, Sheri Keitz,<br />

Virginia Moyer, Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />

ß See related article page 347<br />

event without treatment.<br />

• Learn how to use absolute risk reductions to assess<br />

whether the benefits <strong>of</strong> therapy outweigh its harms.<br />

Calculating and using number needed to treat<br />

• Develop an understanding <strong>of</strong> the concept <strong>of</strong> number<br />

needed to treat (NNT) and how it is calculated.<br />

• Learn how to interpret the NNT and develop an understanding<br />

<strong>of</strong> how the “threshold NNT” varies depending<br />

on the patient’s values and preferences, the<br />

severity <strong>of</strong> possible outcomes and the adverse effects<br />

(harms) <strong>of</strong> therapy.<br />

Tip 1: Understanding risk and risk reduction<br />

You can calculate relative and absolute risk reductions using<br />

simple mathematical <strong>for</strong>mulas (see Appendix 1). However,<br />

you might find it easier to understand the concepts<br />

through visual presentation. Fig. 1A presents data from a hypothetical<br />

trial <strong>of</strong> a new drug <strong>for</strong> acute myocardial infarction,<br />

showing the 30-day mortality rate in a group <strong>of</strong> patients at<br />

high risk <strong>for</strong> the adverse event (e.g., elderly patients with<br />

congestive heart failure and anterior wall infarction). On the<br />

basis <strong>of</strong> in<strong>for</strong>mation in Fig. 1A, how would you describe the<br />

Teachers <strong>of</strong> evidence-based medicine:<br />

See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />

at www.cmaj.ca/cgi/content/full/171/4/353/DC1. It<br />

contains the exercises found in this article in fill-in-theblank<br />

<strong>for</strong>mat, commentaries from the authors on the<br />

challenges they encounter when teaching these concepts<br />

to clinician learners and links to useful online resources.<br />

Page 2 <strong>of</strong> 29


Barratt et al<br />

effect <strong>of</strong> the new drug? (Hint: Consider the event rates in<br />

people not taking the new drug and those who are taking it.)<br />

We can describe the difference in mortality (event)<br />

rates in both relative and absolute<br />

terms. In this case,<br />

these high-risk patients had a<br />

relative risk reduction <strong>of</strong> 25%<br />

and an absolute risk reduction<br />

<strong>of</strong> 10%.<br />

Now, let’s consider Fig. 1B,<br />

which shows the results <strong>of</strong> a<br />

second hypothetical trial <strong>of</strong> the<br />

same new drug, but in a patient<br />

population with a lower risk <strong>for</strong><br />

the outcome (e.g., younger patients<br />

with uncomplicated inferior<br />

wall myocardial infarction).<br />

Looking at Fig. 1B, how<br />

would you describe the effect<br />

<strong>of</strong> the new drug?<br />

The relative risk reduction<br />

with the new drug remains at<br />

25%, but the event rate is lower<br />

in both groups, and hence<br />

the absolute risk reduction is only 2.5%.<br />

Although the relative risk reduction might be similar<br />

across different risk groups (a safe assumption in many if<br />

A<br />

Risk <strong>for</strong> outcome<br />

<strong>of</strong> interest, %<br />

B<br />

Risk <strong>for</strong> outcome<br />

<strong>of</strong> interest, %<br />

40<br />

30<br />

20<br />

10<br />

0<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Trial 1: high-<br />

risk patients<br />

Trial 1: high-<br />

risk patients<br />

Placebo<br />

Treatment<br />

Trial 2: low-<br />

risk patients<br />

Risk and risk reduction: definitions<br />

354 JAMC 17 AOÛT 2004; 171 (4)<br />

Event rate: the number <strong>of</strong> people experiencing an<br />

event as a proportion <strong>of</strong> the number <strong>of</strong> people in<br />

the population<br />

Relative risk reduction: the difference in event<br />

rates between 2 groups, expressed as a proportion<br />

<strong>of</strong> the event rate in the untreated group; usually<br />

constant across populations with different risks 7,8<br />

Absolute risk reduction: the arithmetic difference<br />

between 2 event rates; varies with the underlying<br />

risk <strong>of</strong> an event in the individual patient<br />

The absolute risk reduction becomes smaller<br />

when event rates are low, whereas the<br />

relative risk reduction, or “efficacy” <strong>of</strong> the<br />

treatment, <strong>of</strong>ten remains constant<br />

not most cases 7,8 ), the absolute gains, represented by absolute<br />

risk reductions, are not. In sum, the absolute risk reduction<br />

becomes smaller when event rates are low, whereas<br />

the relative risk reduction, or<br />

“efficacy” <strong>of</strong> the treatment, <strong>of</strong>-<br />

ten remains constant.<br />

These phenomena may be<br />

factors in the design <strong>of</strong> drug<br />

trials. For example, a drug<br />

may be tested in severely affected<br />

people in whom the<br />

absolute risk reduction is likely<br />

to be impressive, but is<br />

subsequently marketed <strong>for</strong><br />

use by less severely affected<br />

patients, in whom the absolute<br />

risk reduction will be<br />

substantially less.<br />

The bottom line<br />

Relative risk reduction is<br />

<strong>of</strong>ten more impressive than<br />

absolute risk reduction. Furthermore,<br />

the lower the event rate in the control group,<br />

the larger the difference between relative risk reduction<br />

and absolute risk reduction.<br />

Among high-risk patients in trial 1, the event rate in the control group (placebo) is 40 per<br />

100 patients, and the event rate in the treatment group is 30 per 100 patients.<br />

Absolute risk reduction (also called the risk difference) is the simple difference in the event<br />

rates (40% – 30% = 10%).<br />

Relative risk reduction is the difference between the event rates in relative terms. Here, the<br />

event rate in the treatment group is 25% less than the event rate in the control group (i.e., the<br />

10% absolute difference expressed as a proportion <strong>of</strong> the control rate is 10/40 or<br />

25% less).<br />

Among low-risk patients in trial 2, the event rate in the control group (placebo) is only 10%.<br />

If the treatment is just as effective in these low-risk patients, what event rate can we expect<br />

in the treatment group?<br />

Page 3 <strong>of</strong> 29<br />

The event rate in the treated group would be 25% less than in the control group or 7.5%.<br />

There<strong>for</strong>e, the absolute risk reduction <strong>for</strong> the low-risk patients (second pair <strong>of</strong> columns) is only<br />

2.5%, even though the relative risk reduction is the same as <strong>for</strong> the high-risk patients<br />

(first pair <strong>of</strong> columns).<br />

Fig. 1: Results <strong>of</strong> hypothetical placebo-controlled trials <strong>of</strong> a new drug <strong>for</strong> acute myocardial infarction. The bars represent the 30day<br />

mortality rate in different groups <strong>of</strong> patients with acute myocardial infarction and heart failure. A: Trial involving patients at<br />

high risk <strong>for</strong> the adverse outcome. B: Trials involving a group <strong>of</strong> patients at high risk <strong>for</strong> the adverse outcome and another group <strong>of</strong><br />

patients at low risk <strong>for</strong> the adverse outcome.


Tip 2: Balancing benefits and adverse effects<br />

in individual patients<br />

In prescribing medications or other treatments, physicians<br />

consider both the potential benefits and the potential<br />

harms. We have just demonstrated that the benefits <strong>of</strong><br />

treatment (presented as absolute risk reductions) will generally<br />

be greater in patients at higher risk <strong>of</strong> adverse outcomes<br />

than in patients at lower risk <strong>of</strong> adverse outcomes.<br />

You must now incorporate the possibility <strong>of</strong> harm into<br />

your decision-making.<br />

First, you need to quantify the potential benefits. Assume<br />

you are managing 2 patients <strong>for</strong> high blood pressure<br />

and are considering the use <strong>of</strong> a new antihypertensive drug,<br />

drug X, <strong>for</strong> which the relative risk reduction <strong>for</strong> stroke over<br />

3 years is 33%, according to published randomized controlled<br />

trials.<br />

Pat is a 69-year-old woman whose blood pressure during<br />

a routine examination is 170/100 mm Hg; her blood<br />

pressure remains unchanged when you see her again 3<br />

weeks later. She is otherwise well and has no history <strong>of</strong> cardiovascular<br />

or cerebrovascular disease. You assess her risk<br />

<strong>of</strong> stroke at about 1% (or 1 per 100) per year. 9<br />

Dorothy is also 69 years <strong>of</strong> age, and her blood pressure<br />

is the same as Pat’s, 170/100 mm Hg; however, because she<br />

had a stroke recently, you assess her risk <strong>of</strong> subsequent<br />

stroke as higher than Pat’s, perhaps 10% per year. 10<br />

One way <strong>of</strong> determining the potential benefit <strong>of</strong> a new<br />

treatment is to complete a benefit table such as Table 1A.<br />

To do this, insert your estimated 3-year event rates <strong>for</strong> Pat<br />

and Dorothy, and then apply the relative risk reduction<br />

(33%) expected if they take drug X. It is clear from Table<br />

Table 1B: Benefit and harm table<br />

Patient group<br />

Table 1A: Benefit table*<br />

Patient group<br />

No<br />

treatment<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine<br />

1A that the absolute risk reduction <strong>for</strong> patients at higher<br />

risk (such as Dorothy) is much greater than <strong>for</strong> those at<br />

lower risk (such as Pat).<br />

Now, you need to factor the potential harms (adverse effects<br />

associated with using the drug) into the clinical decision.<br />

In the clinical trials <strong>of</strong> drug X, the risk <strong>of</strong> severe gastric<br />

bleeding increased 3-fold over 3 years in patients who<br />

received the drug (relative risk <strong>of</strong> 3). A population-based<br />

study has reported the risk <strong>of</strong> severe gastric bleeding <strong>for</strong><br />

women in your patients’ age group at about 0.1% per year<br />

(regardless <strong>of</strong> their risk <strong>of</strong> stroke). These data can now be<br />

added to the table to allow a more balanced assessment <strong>of</strong><br />

the benefits and harms that could arise from treatment<br />

(Table 1B).<br />

Considering the results <strong>of</strong> this process, would you give<br />

drug X to Pat, to Dorothy or to both?<br />

In making your decisions, remember that there is not<br />

necessarily one “right answer” here. Your analysis might go<br />

something like this:<br />

Pat will experience a small benefit (absolute risk reduction<br />

over 3 years <strong>of</strong> about 1%), but this will be considerably<br />

<strong>of</strong>fset by the increased risk <strong>of</strong> gastric bleeding (absolute risk<br />

increase over 3 years <strong>of</strong> 0.6%). The potential benefit <strong>for</strong><br />

Dorothy (absolute risk reduction over 3 years <strong>of</strong> about 10%)<br />

is much greater than the increased risk <strong>of</strong> harm (absolute<br />

risk increase over 3 years <strong>of</strong> 0.6%). There<strong>for</strong>e, the benefit <strong>of</strong><br />

treatment is likely to be greater <strong>for</strong> Dorothy (who is at<br />

higher risk <strong>of</strong> stroke) than <strong>for</strong> Pat (who is at lower risk).<br />

Assessment <strong>of</strong> the balance between benefits and harms<br />

depends on the value that patients place on reducing their<br />

risk <strong>of</strong> stoke in relation to the increased risk <strong>of</strong> gastric<br />

bleeding. Many patients might be much more concerned<br />

about the <strong>for</strong>mer than the latter.<br />

3-yr event rate <strong>for</strong> stroke, % 3-yr event rate <strong>for</strong> severe gastric bleeding, %<br />

With treatment<br />

(drug X)<br />

3-yr event rate <strong>for</strong> stroke, %<br />

No<br />

treatment<br />

Absolute risk reduction<br />

(no treatment – treatment)<br />

With treatment<br />

(drug X)<br />

No<br />

treatment<br />

Absolute<br />

risk reduction, %<br />

(no treatment – treatment)<br />

At lower risk (e.g., Pat) 3 2 1<br />

At higher risk (e.g., Dorothy) 30 20 10<br />

*<strong>Based</strong> on data from a randomized controlled trial <strong>of</strong> drug X, which reported a 33% relative risk reduction <strong>for</strong> the outcome<br />

(stroke) over 3 years.<br />

With treatment<br />

(drug X)<br />

Absolute risk increase<br />

(treatment – no treatment)<br />

At lower risk<br />

(e.g., Pat) 3 2 1 0.3 0.9 0.6<br />

At higher risk<br />

(e.g., Dorothy) 30 20 10 0.3 0.9 0.6<br />

*<strong>Based</strong> on data from randomized controlled trials <strong>of</strong> drug X reporting a 33% relative risk reduction <strong>for</strong> the outcome (stroke) over 3 years and a 3-fold increase <strong>for</strong> the adverse effect<br />

(severe gastric bleeding) over the same period.<br />

Page 4 <strong>of</strong> 29<br />

CMAJ AUG. 17, 2004; 171 (4) 355


Barratt et al<br />

Number needed to treat: definitions<br />

Number needed to treat: the number <strong>of</strong> patients who<br />

would have to receive the treatment <strong>for</strong> 1 <strong>of</strong> them to<br />

benefit; calculated as 100 divided by the absolute risk<br />

reduction expressed as a percentage (or 1 divided by the<br />

absolute risk reduction expressed as a proportion; see<br />

Appendix 1)<br />

Number needed to harm: the number <strong>of</strong> patients who<br />

would have to receive the treatment <strong>for</strong> 1 <strong>of</strong> them to<br />

experience an adverse effect; calculated as 100 divided<br />

by the absolute risk increase expressed as a percentage<br />

(or 1 divided by the absolute risk increase expressed as a<br />

proportion)<br />

The bottom line<br />

When available, trial data regarding relative risk reductions<br />

(or increases), combined with estimates <strong>of</strong> baseline<br />

(untreated) risk in individual patients, provide the basis <strong>for</strong><br />

clinicians to balance the benefits and harms <strong>of</strong> therapy <strong>for</strong><br />

their patients.<br />

Tip 3: Calculating and using number needed<br />

to treat<br />

Some physicians use another measure <strong>of</strong> risk and benefit,<br />

the number needed to treat (NNT), in considering the<br />

consequences <strong>of</strong> treating or not treating. The NNT is the<br />

number <strong>of</strong> patients to whom a clinician would need to administer<br />

a particular treatment to prevent 1 patient from<br />

having an adverse outcome over a predefined period <strong>of</strong><br />

time. (It also reflects the likelihood that a particular patient<br />

to whom treatment is administered will benefit from it.) If,<br />

<strong>for</strong> example, the NNT <strong>for</strong> a treatment is 10, the practitioner<br />

would have to give the treatment to 10 patients to<br />

prevent 1 patient from having the adverse outcome over<br />

Table 2: Benefit table <strong>for</strong> patients with cardiovascular problems<br />

356 JAMC 17 AOÛT 2004; 171 (4)<br />

the defined period, and each patient who received the treatment<br />

would have a 1 in 10 chance <strong>of</strong> being a beneficiary.<br />

If the absolute risk reduction is large, you need to treat<br />

only a small number <strong>of</strong> patients to observe a benefit in at<br />

least some <strong>of</strong> them. Conversely, if the absolute risk reduction<br />

is small, you must treat many people to observe a benefit<br />

in just a few.<br />

An analogous calculation to the one used to determine<br />

the NNT can be used to determine the number <strong>of</strong> patients<br />

who would have to be treated <strong>for</strong> 1 patient to experience an<br />

adverse event. This is the number needed to harm (NNH),<br />

which is the inverse <strong>of</strong> the absolute risk increase.<br />

How com<strong>for</strong>table are you with estimating the NNT<br />

<strong>for</strong> a given treatment? For example, consider the following<br />

questions: How many 60-year-old patients with hypertension<br />

would you have to treat with diuretics <strong>for</strong> a period<br />

<strong>of</strong> 5 years to prevent 1 death? How many people with<br />

myocardial infarction would you have to treat with βblockers<br />

<strong>for</strong> 2 years to prevent 1 death? How many people<br />

with acute myocardial infarction would you have to treat<br />

with streptokinase to prevent 1 person from dying in the<br />

next 5 weeks? Compare your answers with estimates derived<br />

from published studies (Table 2). How accurate<br />

were your estimates? Are you surprised by the size <strong>of</strong> the<br />

NNT values?<br />

Physicians <strong>of</strong>ten experience problems in this type <strong>of</strong><br />

exercise, usually because they are unfamiliar with the calculation<br />

<strong>of</strong> NNT. Here is one way to think about it. If a<br />

disease has a mortality rate <strong>of</strong> 100% without treatment<br />

and therapy reduces that mortality rate to 50%, how<br />

many people would you need to treat to prevent 1 death?<br />

From the numbers given, you can probably figure out that<br />

treating 100 patients with the otherwise fatal disease results<br />

in 50 survivors. This is equivalent to 1 out <strong>of</strong> every 2<br />

treated. Since all were destined to die, the NNT to prevent<br />

1 death is 2. The <strong>for</strong>mula reflected in this calculation<br />

is as follows: the NNT to prevent 1 adverse outcome<br />

equals the inverse <strong>of</strong> the absolute risk reduction. Table 3<br />

illustrates this concept further. Note that, if the absolute<br />

risk reduction is presented as a percentage, the NNT is<br />

Event rate, %<br />

Clinical question Control group Treatment group ARR, % NNT<br />

What is the reduction in risk <strong>of</strong> stroke within 5<br />

years among 60-year-old patients with<br />

hypertension who are treated with diuretics? 11<br />

What is the reduction in risk <strong>of</strong> death within 2<br />

years after MI among 60-year-old patients treated<br />

with β-blockers? 12<br />

What is the reduction in risk <strong>of</strong> death within 5<br />

weeks after acute MI among 60-year-old patients<br />

treated with streptokinase? 13<br />

Note: MI = myocardial infarction, ARR = absolute risk reduction, NNT = number needed to treat.<br />

2.9 1.9 1.00 100<br />

9.8 7.3 2.50 40<br />

12.0 9.2 2.80 36<br />

Page 5 <strong>of</strong> 29


Table 3: Calculation <strong>of</strong> NNT from absolute risk reduction*<br />

Form <strong>of</strong> absolute<br />

risk reduction<br />

100/absolute risk reduction; if the absolute risk reduction<br />

is expressed as a proportion, the NNT is 1/absolute risk<br />

reduction. Both methods give the same answer, so use<br />

whichever you find easier.<br />

It can be challenging <strong>for</strong> clinicians to estimate the baseline<br />

risks <strong>for</strong> specific populations. For example, some physicians<br />

may have little idea <strong>of</strong> the risk <strong>of</strong> stroke over 5 years<br />

among patients with hypertension. Physicians may also<br />

overestimate the effect <strong>of</strong> treatment, which leads them to<br />

ascribe larger absolute risk reductions and smaller NNT<br />

values than are actually the case. 14<br />

Now that you know how to determine the NNT from<br />

the absolute risk reduction, you must also consider whether<br />

the NNT is reasonable. In other words, what is the maximum<br />

NNT that you and your patients will accept as justifying<br />

the benefits and harms <strong>of</strong> therapy? This is referred to<br />

as the threshold NNT. 15 If the calculated NNT is above<br />

the threshold, the benefits are not large enough (or the risk<br />

<strong>of</strong> harm is too great) to warrant initiating the therapy.<br />

Determinants <strong>of</strong> the threshold NNT include the patient’s<br />

own values and preferences, the severity <strong>of</strong> the outcome<br />

that would be prevented, and the costs and side effects<br />

<strong>of</strong> the intervention. Thus, the threshold NNT will<br />

almost certainly be different <strong>for</strong> different patients, and<br />

there is no simple answer to the question <strong>of</strong> when an NNT<br />

is sufficiently low to justify initiating treatment.<br />

The bottom line<br />

NNT is a concise, clinically useful presentation <strong>of</strong> the<br />

effect <strong>of</strong> an intervention. You can easily calculate it from<br />

the absolute risk reduction (just remember to check<br />

whether the absolute risk reduction is presented as a percentage<br />

or a proportion and use a numerator <strong>of</strong> 100 or 1<br />

accordingly). Be careful not to overestimate the effect <strong>of</strong><br />

treatments (i.e., use a value <strong>of</strong> absolute risk reduction that is<br />

too high) and thus underestimate the NNT.<br />

Conclusions<br />

Calculation<br />

<strong>of</strong> NNT Example<br />

Percentage (e.g., 2.8%) 100/ARR 100/2.8 = 36<br />

Proportion (e.g., 0.028) 1/ARR 1/0.028 = 36<br />

*Using absolute risk reduction in last row <strong>of</strong> Table 2. 13<br />

Clinicians seeking to apply clinical evidence to the care<br />

<strong>of</strong> individual patients need to understand and be able to<br />

calculate relative risk reduction, absolute risk reduction<br />

and NNT from data presented in clinical trials and systematic<br />

reviews. We have described and defined these<br />

concepts and presented tabular tools and equations to<br />

help clinicians overcome common pitfalls in acquiring<br />

these skills.<br />

This article has been peer reviewed.<br />

References<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine<br />

From the School <strong>of</strong> Public Health, University <strong>of</strong> Sydney, Sydney, Australia (Barratt);<br />

the Columbia University College <strong>of</strong> Physicians and Surgeons, New York, NY<br />

(Wyer); the Department <strong>of</strong> <strong>Medicine</strong>, University <strong>of</strong> British Columbia, Vancouver,<br />

BC (Hatala); Mount Sinai Medical Center, New York, NY (McGinn); the Department<br />

<strong>of</strong> Internal <strong>Medicine</strong>, University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>,<br />

Manila, The Philippines (Dans); Durham Veterans Affairs Medical Center and<br />

Duke University Medical Center, Durham, NC (Keitz); the Department <strong>of</strong> Pediatrics,<br />

University <strong>of</strong> Texas, Houston, Tex. (Moyer); and the Departments <strong>of</strong> <strong>Medicine</strong><br />

and <strong>of</strong> Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,<br />

Ont. (Guyatt)<br />

Competing interests: None declared.<br />

Contributors: Alexandra Barratt contributed tip 2, drafted the manuscript, coordinated<br />

input from coauthors and reviewers and from field-testing and revised all<br />

drafts. Peter Wyer edited drafts and provided guidance in developing the final <strong>for</strong>mat.<br />

Rose Hatala contributed tip 1, coordinated the internal review process and<br />

provided comments throughout development <strong>of</strong> the manuscript. Thomas McGinn<br />

contributed tip 3 and provided comments throughout development <strong>of</strong> the manuscript.<br />

Antonio Dans reviewed all drafts and provided comments throughout development<br />

<strong>of</strong> the manuscript. Sheri Keitz conducted field-testing <strong>of</strong> the tips and contributed<br />

material from the field-testing to the manuscript. Virginia Moyer<br />

reviewed and contributed to the final version <strong>of</strong> the manuscript. Gordon Guyatt<br />

helped to write the manuscript (as an editor and coauthor).<br />

1. Malenka DJ, Baron JA, Johansen S, Wahrenberger JW, Ross JM. The framing<br />

effect <strong>of</strong> relative and absolute risk. J Gen Intern Med 1993;8:543-8.<br />

2. Forrow L, Taylor WC, Arnold RM. Absolutely relative: How research results<br />

are summarized can affect treatment decisions. Am J Med 1992;92:121-4.<br />

3. Naylor CD, Chen E, Strauss B. Measured enthusiasm: Does the method <strong>of</strong><br />

reporting trial results alter perceptions <strong>of</strong> therapeutic effectiveness? Ann Intern<br />

Med 1992;117:916-21.<br />

4. Fahey T, Griffiths S, Peters TJ. <strong>Evidence</strong> based purchasing: understanding<br />

results <strong>of</strong> clinical trials and systematic reviews. BMJ 1995;311:1056-60.<br />

5. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures<br />

<strong>of</strong> association. In: Guyatt G, Rennie D, editors. The users’ guides to the<br />

medical literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA<br />

Publications; 2002. p. 351-68.<br />

6. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />

<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series.<br />

CMAJ 2004;171(4):347-8.<br />

7. Schmid CH, Lau J, McIntosh MW, Cappelleri JC. An empirical study <strong>of</strong> the<br />

effect <strong>of</strong> the control rate as a predictor <strong>of</strong> treatment efficacy in meta-analysis<br />

<strong>of</strong> clinical trials. Stat Med 1998;17:1923-42.<br />

8. Furukawa TA, Guyatt GH, Griffith LE. Can we individualise the number<br />

needed to treat? An empirical study <strong>of</strong> summary effect measures in metaanalyses.<br />

Int J Epidemiol 2002;31:72-6.<br />

9. SHEP Cooperative Research Group. Prevention <strong>of</strong> stroke by anti-hypertensive<br />

drug treatment in older persons with isolated systolic hypertension. Final<br />

results <strong>of</strong> the Systolic Hypertension in the Elderly Program (SHEP). JAMA<br />

1991;265:3255-64.<br />

10. SALT Collaborative Group. Swedish Aspirin Low-dose Trial (SALT) <strong>of</strong><br />

75mg aspirin as secondary prophylaxis after cerebrovascular events. Lancet<br />

1991;338:1345-9.<br />

11. Psaty BM, Smith NL, Siscovick DS, Koepsell TD, Weiss NS, Heckbert<br />

SR. Health outcomes associated with antihypertensive therapies used as<br />

first-line agents. A systematic review and meta-analysis. JAMA 1997;277:<br />

739-45.<br />

12. β-Blocker Health Attack Trial Research Group. A randomized trial <strong>of</strong> propranolol<br />

in patients with acute myocardial infarction. I. Mortality results.<br />

JAMA 1982;247:1707-14.<br />

13. ISIS-2 Collaborative Group. Randomised trial <strong>of</strong> intravenous streptokinase,<br />

oral aspirin, both or neither among 17 187 cases <strong>of</strong> suspected acute myocardial<br />

infarction: ISIS-2. Lancet 1988;2:349-60.<br />

14. Chatellier G, Zapletal E, Lemaitre D, Menard J, Degoulet P. The number<br />

needed to treat: a clinically useful nomogram in its proper context. BMJ 1996;<br />

312:426-9.<br />

15. Sinclair JC, Cook RJ, Guyatt GH, Pauker SG, Cook DJ. When should an effective<br />

treatment be used? Derivation <strong>of</strong> the threshold number needed to treat<br />

and the minimum event rate <strong>for</strong> treatment. J Clin Epidemiol 2001;54:253-62.<br />

Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />

Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet<br />

.att.net<br />

Page 6 <strong>of</strong> 29<br />

CMAJ AUG. 17, 2004; 171 (4) 357


Barratt et al<br />

Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong><br />

Working Group: Peter C. Wyer (project director), Columbia<br />

University College <strong>of</strong> Physicians and Surgeons, New York, NY;<br />

Deborah Cook, Gordon Guyatt (general editor), Ted Haines,<br />

Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose<br />

Hatala (internal review coordinator), Department <strong>of</strong> <strong>Medicine</strong>,<br />

University <strong>of</strong> British Columbia, Vancouver, BC; Robert Hayward<br />

(editor, online version), Bruce Fisher, University <strong>of</strong> Alberta,<br />

Edmonton, Alta.; Sheri Keitz (field-test coordinator), Durham<br />

Veterans Affairs Medical Center and Duke University, Durham,<br />

NC; Alexandra Barratt, University <strong>of</strong> Sydney, Sydney, Australia;<br />

Pamela Charney, Albert Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY;<br />

Antonio L. Dans, University <strong>of</strong> the Philippines College <strong>of</strong><br />

<strong>Medicine</strong>, Manila, The Philippines; Barnet Eskin, Morristown<br />

Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory<br />

University, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />

Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />

McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />

Montori, Department <strong>of</strong> <strong>Medicine</strong>, Mayo Clinic College <strong>of</strong><br />

<strong>Medicine</strong>, Rochester, Minn.; Virginia Moyer, University <strong>of</strong> Texas,<br />

Houston, Tex.; Thomas B. Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San<br />

Fred Sebastian<br />

358 JAMC 17 AOÛT 2004; 171 (4)<br />

Francisco, Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa,<br />

Ont.; W. Scott Richardson, Wright State University, Dayton,<br />

Ohio; Mark C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />

Appendix 1: Formulas <strong>for</strong> commonly used measures <strong>of</strong><br />

therapeutic effect<br />

Measure <strong>of</strong> effect Formula<br />

Relative risk (Event rate in intervention group) ÷ (event<br />

rate in control group)<br />

Relative risk reduction 1 – relative risk<br />

or<br />

(Absolute risk reduction) ÷ (event rate in<br />

control group)<br />

Absolute risk reduction (Event rate in intervention group) – (event<br />

rate in control group)<br />

Number needed to treat 1 ÷ (absolute risk reduction)<br />

Please, reader, can you spare some time?<br />

Our annual CMAJ readership survey begins September 20. By telling us a<br />

little about who you are and what you think <strong>of</strong> CMAJ, you’ll help us pave<br />

our way to an even better journal. For 2 weeks, we’ll be asking you to take<br />

the survey route on one <strong>of</strong> your visits to the journal online. We hope you’ll<br />

go along with the detour and help us stay on track.<br />

Chers lecteurs et lectrices, pourriez-vous nous accorder un moment?<br />

Le sondage annuel auprès des lecteurs du JAMC débute le 20 septembre. En nous parlant un peu de<br />

vous et de ce que vous pensez du JAMC, vous nous aiderez à améliorer encore le journal. Pendant<br />

deux semaines, lorsque vous rendrez visite au journal électronique, nous vous demanderons de passer<br />

une fois par la page du sondage. Nous espérons que vous accepterez de faire ce détour qui contribuera<br />

à nous garder sur la bonne voie.<br />

Page 7 <strong>of</strong> 29


DOI:10.1503/cmaj.1031667<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

2. Measures <strong>of</strong> precision (confidence intervals)<br />

In the first article in this series, 1 we presented an approach<br />

to understanding how to estimate a treatment’s<br />

effectiveness that covered relative risk reduction, absolute<br />

risk reduction and number needed to treat. But how<br />

precise are these estimates <strong>of</strong> treatment effect?<br />

In reading the results <strong>of</strong> clinical trials, clinicians <strong>of</strong>ten<br />

come across 2 related but different statistical measures <strong>of</strong> an<br />

estimate’s precision: p values and confidence intervals. The p<br />

value describes how <strong>of</strong>ten apparent differences in treatment<br />

effect that are as large as or larger than those observed in a<br />

particular trial will occur in a long run <strong>of</strong> identical trials if in<br />

fact no true effect exists. If the observed differences are sufficiently<br />

unlikely to occur by chance alone, investigators reject<br />

the hypothesis that there is no effect. For example, consider<br />

a randomized trial comparing diuretics with placebo<br />

that finds a 25% relative risk reduction <strong>for</strong> stroke with a p<br />

value <strong>of</strong> 0.04. This p value means that, if diuretics were in<br />

fact no different in effectiveness than placebo, we would expect,<br />

by the play <strong>of</strong> chance alone, to observe a reduction —<br />

or increase — in relative risk <strong>of</strong> 25% or more in 4 out <strong>of</strong><br />

100 identical trials.<br />

Although they are useful <strong>for</strong> investigators planning how<br />

large a study needs to be to demonstrate a particular magnitude<br />

<strong>of</strong> effect, p values fail to provide clinicians and patients<br />

with the in<strong>for</strong>mation they most need, i.e., the range<br />

<strong>of</strong> values within which the true effect is likely to reside.<br />

However, confidence intervals provide exactly that in<strong>for</strong>mation<br />

in a <strong>for</strong>m that pertains directly to the process <strong>of</strong> deciding<br />

whether to administer a therapy to patients. If the<br />

range <strong>of</strong> possible true effects encompassed by the confidence<br />

interval is overly wide, the clinician may choose to<br />

administer the therapy only selectively or not at all.<br />

Confidence intervals are there<strong>for</strong>e the topic <strong>of</strong> this article.<br />

For a nontechnical explanation <strong>of</strong> p values and their<br />

limitations, we refer interested readers to the Users’ Guides<br />

to the Medical Literature. 2<br />

As with the first article in this series, 1 we present the in<strong>for</strong>mation<br />

as a series <strong>of</strong> “tips” or exercises. This means that<br />

you, the reader, will have to do some work in the course <strong>of</strong><br />

reading the article. The tips we present here have been<br />

adapted from approaches developed by educators experienced<br />

in teaching evidence-based medicine skills to clinicians.<br />

2-4 A related article, intended <strong>for</strong> people who teach<br />

Review<br />

Synthèse<br />

Victor M. Montori, Jennifer Kleinbart, Thomas B. Newman, Sheri Keitz, Peter C. Wyer,<br />

Virginia Moyer, Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />

these concepts to clinicians, is available online at www.<br />

cmaj.ca/cgi/content/full/171/6/611/DC1.<br />

Clinician learners’ objectives<br />

Making confidence intervals intuitive<br />

• Understand the dynamic relation between confidence<br />

intervals and sample size.<br />

Interpreting confidence intervals<br />

• Understand how the confidence intervals around estimates<br />

<strong>of</strong> treatment effect can affect therapeutic decisions.<br />

Estimating confidence intervals <strong>for</strong> extreme<br />

proportions<br />

• Learn a shortcut <strong>for</strong> estimating the upper limit <strong>of</strong> the<br />

95% confidence intervals <strong>for</strong> proportions with very<br />

small numerators and <strong>for</strong> proportions with numerators<br />

very close to the corresponding denominators.<br />

Tip 1: Making confidence intervals intuitive<br />

Imagine a hypothetical series <strong>of</strong> 5 trials (<strong>of</strong> equal duration<br />

but different sample sizes) in which investigators have<br />

experimented with treatments <strong>for</strong> patients who have a particular<br />

condition (elevated low-density lipoprotein cholesterol)<br />

to determine whether a drug (a novel cholesterollowering<br />

agent) would work better than a placebo to<br />

prevent strokes (Table 1A). The smallest trial enrolled only<br />

Teachers <strong>of</strong> evidence-based medicine:<br />

See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />

at www.cmaj.ca/cgi/content/full/171/6/611/DC1. It<br />

contains the exercises found in this article in fill-in-theblank<br />

<strong>for</strong>mat, commentaries from the authors on the<br />

challenges they encounter when teaching these concepts<br />

to clinician learners and links to useful online resources.<br />

CMAJ • SEPT. 14, 2004; 171 (6) 611<br />

© 2004 Canadian Medical Association or its licensors<br />

Page 8 <strong>of</strong> 29


Montori et al<br />

8 patients, and the largest enrolled 2000 patients, and half<br />

<strong>of</strong> the patients in each trial underwent the experimental<br />

treatment. Now imagine that all <strong>of</strong> the trials showed a relative<br />

risk reduction <strong>for</strong> the treatment group <strong>of</strong> 50% (meaning<br />

that patients in the drug treatment group were only half<br />

as likely as those in the placebo group to have a stroke). In<br />

each individual trial, how confident can we be that the true<br />

value <strong>of</strong> the relative risk reduction is important <strong>for</strong> patients<br />

(i.e., “patient-important”)? 5 If you were to look at the studies<br />

individually, which ones would lead you to recommend<br />

the treatment unequivocally to your patients?<br />

Most clinicians might intuitively guess that we could be<br />

more confident in the results <strong>of</strong> the larger trials. Why is this?<br />

In the absence <strong>of</strong> bias or systematic error, the results <strong>of</strong> a trial<br />

can be interpreted as an estimate <strong>of</strong> the true magnitude <strong>of</strong> effect<br />

that would occur if all possible eligible patients had been<br />

included. When only a few <strong>of</strong> these patients are included, the<br />

play <strong>of</strong> chance alone may lead to a result that is quite different<br />

from the true value. Confidence intervals are a numeric<br />

measure <strong>of</strong> the range within which such variation is likely to<br />

occur. The 95% confidence intervals that we <strong>of</strong>ten see in<br />

biomedical publications represent the range within which we<br />

are likely to find the underlying true treatment effect.<br />

To gain a better appreciation <strong>of</strong> confidence intervals, go<br />

back to Table 1A (don’t look yet at Table 1B!) and take a<br />

guess at what you think the confidence intervals might be<br />

<strong>for</strong> the 5 trials presented. In a moment you’ll see how your<br />

Table 1A: Relative risk and relative risk reduction observed<br />

in 5 successively larger hypothetical trials<br />

Control event<br />

rate<br />

Treatment<br />

event rate Relative risk, %<br />

Relative risk<br />

reduction, %*<br />

2/4 1/4 50 50<br />

10/20 5/20 50 50<br />

20/40 10/40 50 50<br />

50/100 25/100 50 50<br />

500/1000 250/1000 50 50<br />

*Calculated as the absolute difference between the control and treatment event rates<br />

(expressed as a fraction or a percentage), divided by the control event rate. In the first row<br />

in this table, relative risk reduction = (2/4 –1/4) ÷ 2/4 = 1/2 or 50%. If the control event<br />

rate were 3/4 and the treatment event rate 1/4, the relative risk reduction would be<br />

(3/4 – 1/4) ÷ 3/4 = 2/3. Using percentages <strong>for</strong> the same example, if the control event rate<br />

were 75% and the treatment event rate were 25%, the relative risk reduction would be<br />

(75% – 25%) ÷ 75% = 67%.<br />

Table 1B: Confidence intervals (CIs) around the relative risk reduction in<br />

5 successively larger hypothetical trials<br />

Control<br />

event rate<br />

Treatment<br />

event rate<br />

Relative<br />

risk, %<br />

612 JAMC 14 SEPT. 2004; 171 (6)<br />

estimates compare to 95% confidence intervals calculated<br />

using a <strong>for</strong>mula, but <strong>for</strong> now, try figuring out intervals that<br />

you intuitively feel to be appropriate.<br />

Now, consider the first trial, in which 2 out <strong>of</strong> 4 patients<br />

who receive the control intervention and 1 out <strong>of</strong> 4 patients<br />

who receive the experimental treatment suffer a stroke.<br />

The risk in the treatment group is half that in the control<br />

group, which gives us a relative risk <strong>of</strong> 50% and a relative<br />

risk reduction <strong>of</strong> 50% (see Table 1A). 1,6<br />

Given the substantial relative risk reduction, would you<br />

be ready to recommend this treatment to a patient? Be<strong>for</strong>e<br />

you answer this question, consider whether it is plausible,<br />

with so few patients in the study, that the investigators might<br />

just have gotten lucky and the true treatment effect is really a<br />

50% increase in relative risk. In other words, is it plausible<br />

that the true event rate in the group that received treatment<br />

was 3 out <strong>of</strong> 4 instead <strong>of</strong> 1 out <strong>of</strong> 4? If you accept that this<br />

large, harmful effect might represent the underlying truth,<br />

would you also accept that a relative risk reduction <strong>of</strong> 90%,<br />

i.e., a very large benefit <strong>of</strong> treatment, is consistent with the<br />

experimental data in these few patients? To the extent that<br />

these suggestions are plausible, we can intuitively create a<br />

range <strong>of</strong> plausible truth <strong>of</strong> “-50% to 90%” surrounding the<br />

relative risk reduction <strong>of</strong> 50% that was actually observed.<br />

Now, do this <strong>for</strong> each <strong>of</strong> the other 4 trials. In the trial with<br />

20 patients in each group, 10 <strong>of</strong> those in the control group<br />

suffered a stroke, as did 5 <strong>of</strong> those in the treatment group.<br />

Both the relative risk and the relative risk reduction are again<br />

50%. Do you still consider it plausible that the true event rate<br />

in the treatment group is 15 out <strong>of</strong> 20 rather than 5 out <strong>of</strong> 20<br />

(the same proportions as we considered in the smaller trial)?<br />

If not, what about 12 out <strong>of</strong> 20? The latter would represent a<br />

20% increase in risk over the control rate (12/20 v. 10/20). A<br />

true relative risk reduction <strong>of</strong> 90% may still be plausible,<br />

given the observed results and the numbers <strong>of</strong> patients involved.<br />

In short, given this larger number <strong>of</strong> patients and the<br />

lower chance <strong>of</strong> a “bad sample,” the “range <strong>of</strong> plausible truth”<br />

around the observed relative risk reduction <strong>of</strong> 50% might be<br />

narrower, perhaps from a relative risk increase <strong>of</strong> 20% (represented<br />

as –20%) to a relative risk reduction <strong>of</strong> 90%.<br />

You can develop similar intuitively derived confidence<br />

intervals <strong>for</strong> the larger trials. We’ve done this in Table 1B,<br />

which also shows the 95% confidence intervals that we cal-<br />

CI around relative risk reduction, %<br />

Relative risk<br />

reduction, % Intuitive CI* Calculated 95% CI*†<br />

2/4 1/4 50 50 –50 to 90 –174 to 92<br />

10/20 5/20 50 50 –20 to 90 –14 to 79.5<br />

20/40 10/40 50 50 0 to 90 9.5 to 73.4<br />

50/100 25/100 50 50 20 to 80 26.8 to 66.4<br />

500/1000 250/1000 50 50 40 to 60 43.5 to 55.9<br />

*Negative values represent an increase in risk relative to control. See text <strong>for</strong> further explanation.<br />

†Calculated by statistical s<strong>of</strong>tware.<br />

Page 9 <strong>of</strong> 29


culated using a statistical program called StatsDirect (available<br />

commercially through www.statsdirect.com). You can<br />

see that in some instances we intuitively overestimated or<br />

underestimated the intervals relative to those we derived<br />

using the statistical <strong>for</strong>mulas.<br />

The bottom line<br />

Confidence intervals in<strong>for</strong>m clinicians about the range<br />

within which the true treatment effect might plausibly lie,<br />

given the trial data. Greater precision (narrower confidence<br />

intervals) results from larger sample sizes and consequent<br />

larger number <strong>of</strong> events. Statisticians (and statistical s<strong>of</strong>tware)<br />

can calculate 95% confidence intervals around any<br />

estimate <strong>of</strong> treatment effect.<br />

Tip 2: Interpreting<br />

confidence intervals<br />

You should now have an understanding<br />

<strong>of</strong> the relation between the<br />

width <strong>of</strong> the confidence interval<br />

around a measure <strong>of</strong> outcome in a<br />

clinical trial and the number <strong>of</strong> participants<br />

and events in that study.<br />

You are ready to consider whether a<br />

study is sufficiently large, and the resulting<br />

confidence intervals sufficiently<br />

narrow, to reach a definitive<br />

conclusion about recommending the<br />

therapy, after taking into account<br />

your patient’s values, preferences and<br />

circumstances.<br />

The concept <strong>of</strong> a minimally important<br />

treatment effect proves useful<br />

in considering the issue <strong>of</strong> when a<br />

study is large enough and has there<strong>for</strong>e<br />

generated confidence intervals<br />

that are narrow enough to recommend<br />

<strong>for</strong> or against the therapy. This<br />

concept requires the clinician to<br />

think about the smallest amount <strong>of</strong><br />

benefit that would justify therapy.<br />

Consider a set <strong>of</strong> hypothetical trials.<br />

Fig. 1A displays the results <strong>of</strong> trial<br />

1. The uppermost point <strong>of</strong> the bell<br />

curve is the observed treatment effect<br />

(the point estimate), and the tails <strong>of</strong><br />

the bell curve represent the boundaries<br />

<strong>of</strong> the 95% confidence interval.<br />

For the medical condition being investigated,<br />

assume that a 1% absolute<br />

risk reduction is the smallest benefit<br />

that patients would consider to outweigh<br />

the downsides <strong>of</strong> therapy.<br />

Given the in<strong>for</strong>mation in Fig. 1A,<br />

A<br />

B<br />

C<br />

-5<br />

-5<br />

Trial 4<br />

Treatment harms<br />

-3<br />

-3<br />

Trial 3<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: confidence intervals<br />

would you recommend this treatment to your patients if<br />

the point estimate represented the truth? What if the upper<br />

boundary <strong>of</strong> the confidence interval represented the truth?<br />

Or the lower boundary?<br />

For all 3 <strong>of</strong> these questions, the answer is yes, provided<br />

that 1% is in fact the smallest patient-important difference.<br />

Thus, the trial is definitive and allows a strong inference<br />

about the treatment decision.<br />

In the case <strong>of</strong> trial 2 (see Fig. 1B), would your patients<br />

choose to undergo the treatment if either the point estimate<br />

or the upper boundary <strong>of</strong> the confidence interval represented<br />

the true effect? What about the lower boundary? The answer<br />

regarding the lower boundary is no, because the effect<br />

is less than the smallest difference that patients would consider<br />

large enough <strong>for</strong> them to undergo the treatment. Al-<br />

-1<br />

-1<br />

-5 -3 -1 0<br />

Treatment helps<br />

0 1 3 5<br />

0 1 3 5<br />

1 3 5<br />

% Absolute risk reduction<br />

Trial 1<br />

Trial 1<br />

Page 10 <strong>of</strong> 29<br />

Trial 2<br />

Fig. 1: Results <strong>of</strong> 4 hypothetical trials. For the medical condition under investigation,<br />

an absolute risk reduction <strong>of</strong> 1% (double vertical rule) is the smallest benefit that patients<br />

would consider important enough to warrant undergoing treatment. In each<br />

case, the uppermost point <strong>of</strong> the bell curve is the observed treatment effect (the point<br />

estimate), and the tails <strong>of</strong> the bell curve represent the boundaries <strong>of</strong> the 95% confidence<br />

interval. See text <strong>for</strong> further explanation.<br />

CMAJ SEPT. 14, 2004; 171 (6) 613


Montori et al<br />

though trial 2 shows a “positive” result (i.e., the confidence<br />

interval does not encompass zero), the sample size was inadequate<br />

and the result remains compatible with risk reductions<br />

below the minimal patient-important difference.<br />

When a study result is positive, you can determine<br />

whether the sample size was adequate by checking the lower<br />

boundary <strong>of</strong> the confidence interval, the smallest plausible<br />

treatment effect compatible with the results. If this value is<br />

greater than the smallest difference your patients would<br />

consider important, the sample size is adequate and the trial<br />

result definitive. However, if the lower boundary falls below<br />

the smallest patient-important difference, leaving patients<br />

uncertain as to whether taking the treatment is in their best<br />

interest, the trial is not definitive. The sample size is inadequate,<br />

and further trials are required.<br />

What happens when the confidence interval <strong>for</strong> the effect<br />

<strong>of</strong> a therapy includes zero (where zero means “no effect”<br />

and hence a negative result)?<br />

For studies with negative results — those that do not exclude<br />

a true treatment effect <strong>of</strong> zero — you must focus on<br />

the other end <strong>of</strong> the confidence interval, that representing<br />

the largest plausible treatment effect consistent with the<br />

trial data. You must consider whether the upper boundary<br />

<strong>of</strong> the confidence interval falls below the smallest difference<br />

that patients might consider important. If so, the sample<br />

size is adequate, and the trial is definitively negative (see<br />

trial 3 in Fig. 1C). Conversely, if the upper boundary exceeds<br />

the smallest patient-important difference, then the<br />

trial is not definitively negative, and more trials with larger<br />

sample sizes are needed (see trial 4 in Fig. 1C).<br />

The bottom line<br />

To determine whether a trial with a positive result is sufficiently<br />

large, clinicians should focus on the lower boundary <strong>of</strong><br />

the confidence interval and determine if it is greater than the<br />

smallest treatment benefit that patients would consider important<br />

enough to warrant taking the treatment. For studies<br />

with a negative result, clinicians should examine the upper<br />

boundary <strong>of</strong> the confidence interval to determine if this value<br />

is lower than the smallest treatment benefit that patients<br />

would consider important enough to warrant taking the treatment.<br />

In either case, if the confidence interval overlaps the<br />

smallest treatment benefit that is important to patients, then<br />

the study is not definitive and a larger study is needed.<br />

Table 2: The 3/n rule to estimate the upper limit <strong>of</strong> the<br />

95% confidence interval (CI) <strong>for</strong> proportions with 0 in the<br />

numerator<br />

n<br />

Observed<br />

proportion 3/n<br />

Upper limit <strong>of</strong><br />

95% CI<br />

20 0/20 3/20 0.15 or 15%<br />

100 0/100 3/100 0.03 or 3%<br />

300 0/300 3/300 0.01 or 1%<br />

1000 0/1000 3/1000 0.003 or 0.3%<br />

614 JAMC 14 SEPT. 2004; 171 (6)<br />

Tip 3: Estimating confidence intervals <strong>for</strong><br />

extreme proportions<br />

When reviewing journal articles, readers <strong>of</strong>ten encounter<br />

proportions with small numerators or with numerators very<br />

close in size to the denominators. Both situations raise the<br />

same issue. For example, an article might assert that a treatment<br />

is safe because no serious complications occurred in the<br />

20 patients who received it; another might claim near-perfect<br />

sensitivity <strong>for</strong> a test that correctly identified 29 out <strong>of</strong> 30<br />

cases <strong>of</strong> a disease. However, in many cases such articles do<br />

not present confidence intervals <strong>for</strong> these proportions.<br />

The first step <strong>of</strong> this tip is to learn the “rule <strong>of</strong> 3” <strong>for</strong><br />

zero numerators, 7 and the next step is to learn an extension<br />

(which might be called the “rule <strong>of</strong> 5, 7, 9 and 10”) <strong>for</strong> numerators<br />

<strong>of</strong> 1, 2, 3 and 4. 8<br />

Consider the following example. Twenty people undergo<br />

surgery, and none suffer serious complications. Does<br />

this result allow us to be confident that the true complication<br />

rate is very low, say less than 5% (1 out <strong>of</strong> 20)? What<br />

about 10% (2 out <strong>of</strong> 20)?<br />

You will probably appreciate that if the true complication<br />

rate were 5% (1 in 20), it wouldn’t be that unusual to<br />

observe no complications in a sample <strong>of</strong> 20, but <strong>for</strong> increasingly<br />

higher true rates, the chances <strong>of</strong> observing no complications<br />

in a sample <strong>of</strong> 20 gets increasingly smaller.<br />

What we are after is the upper limit <strong>of</strong> a 95% confidence<br />

interval <strong>for</strong> the proportion 0/20. The following is a<br />

simple rule <strong>for</strong> calculating this upper limit: if an event occurs<br />

0 times in n subjects, the upper boundary <strong>of</strong> the 95%<br />

confidence interval <strong>for</strong> the event rate is about 3/n (Table 2).<br />

You can use the same <strong>for</strong>mula when the observed proportion<br />

is 100%, by translating 100% into its complement.<br />

For example, imagine that the authors <strong>of</strong> a study on a diagnostic<br />

test report 100% sensitivity when the test is per<strong>for</strong>med<br />

<strong>for</strong> 20 patients who have the disease. That means<br />

that the test identified all 20 with the disease as positive and<br />

identified none as falsely negative. You would like to know<br />

how low the sensitivity <strong>of</strong> the test could be, given that it<br />

was 100% <strong>for</strong> a sample <strong>of</strong> 20 patients. Using the 3/n rule<br />

Table 3: Method <strong>for</strong> obtaining an approximation <strong>of</strong><br />

the upper limit <strong>of</strong> the 95% CI*<br />

Observed<br />

numerator<br />

Numerator <strong>for</strong> calculating<br />

approximate upper limit <strong>of</strong> 95% CI<br />

0 3<br />

1 5<br />

2 7<br />

3 9<br />

4 10<br />

*For any observed numerator listed in the left hand column, divide the<br />

corresponding numerator in the right hand column by the number <strong>of</strong> study<br />

subjects to get the approximate upper limit <strong>of</strong> the 95% CI. For example, if the<br />

sample size is 15 and the observed numerator is 3, the upper limit <strong>of</strong> the 95%<br />

confidence interval is approximately 9 ÷ 15 = 0.6 or 60%.<br />

Page 11 <strong>of</strong> 29


<strong>for</strong> the proportion <strong>of</strong> false negatives (0 out <strong>of</strong> 20), we find<br />

that the proportion <strong>of</strong> false negatives could be as high as<br />

15% (3 out <strong>of</strong> 20). Subtract this result from 100% to obtain<br />

the lower limit <strong>of</strong> the 95% confidence interval <strong>for</strong> the sensitivity<br />

(in this example, 85%).<br />

What if the numerator is not zero but is still very small?<br />

There is a shortcut rule <strong>for</strong> small numerators other than<br />

zero (i.e., 1, 2, 3 or 4) (Table 3).<br />

For example, out <strong>of</strong> 20 people receiving surgery imagine<br />

that 1 person suffers a serious complication, yielding an observed<br />

proportion <strong>of</strong> 1/20 or 5%. Using the corresponding<br />

value from Table 3 (i.e., 5) and the sample size, we find that<br />

the upper limit <strong>of</strong> the 95% confidence interval will be<br />

about 5/20 or 25%. If 2 <strong>of</strong> the 20 (10%) had suffered complications,<br />

the upper limit would be about 7/20, or 35%.<br />

The bottom line<br />

Although statisticians (and statistical s<strong>of</strong>tware) can calculate<br />

95% confidence intervals, clinicians can readily estimate<br />

the upper boundary <strong>of</strong> confidence intervals <strong>for</strong> proportions<br />

with very small numerators. These estimates highlight the<br />

greater precision attained with larger sample sizes and help<br />

to calibrate intuitively derived confidence intervals.<br />

Conclusions<br />

Clinicians need to understand and interpret confidence<br />

intervals to properly use research results in making decisions.<br />

They can use thresholds, based on differences that<br />

patients are likely to consider important, to interpret confidence<br />

intervals and to judge whether the results are definitive<br />

or whether a larger study (with more patients and<br />

events) is necessary. For proportions with extremely small<br />

numerators, a simple rule is available <strong>for</strong> estimating the upper<br />

limit <strong>of</strong> the confidence interval.<br />

This article has been peer reviewed.<br />

From the Department <strong>of</strong> <strong>Medicine</strong>, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester,<br />

Minn. (Montori); the Hospital <strong>Medicine</strong> Unit, Division <strong>of</strong> General <strong>Medicine</strong>,<br />

Emory University, Atlanta, Ga. (Kleinbart); the Departments <strong>of</strong> Epidemiology and<br />

Biostatistics and <strong>of</strong> Pediatrics, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />

Calif. (Newman); Durham Veterans Affairs Medical Center and Duke University<br />

Medical Center, Durham, NC (Keitz); the Columbia University College <strong>of</strong><br />

Physicians and Surgeons, New York, NY (Wyer); the Department <strong>of</strong> Pediatrics,<br />

University <strong>of</strong> Texas, Houston, Tex. (Moyer); and the Departments <strong>of</strong> <strong>Medicine</strong><br />

and <strong>of</strong> Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,<br />

Ont. (Guyatt)<br />

Competing interests: None declared.<br />

Contributors: Victor Montori, as principal author, decided on the structure and<br />

flow <strong>of</strong> the article, and oversaw and contributed to the writing <strong>of</strong> the manuscript.<br />

Jennifer Kleinbart reviewed the manuscript at all phases <strong>of</strong> development and contributed<br />

to the writing <strong>of</strong> tip 1. Thomas Newman developed the original idea <strong>for</strong><br />

tip 3 and reviewed the manuscript at all phases <strong>of</strong> development. Sheri Keitz used<br />

all <strong>of</strong> the tips as part <strong>of</strong> a live teaching exercise and submitted comments, suggestions<br />

and the possible variations that are described in the article. Peter Wyer reviewed<br />

and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence<br />

with <strong>for</strong>mat specifications. Virginia Moyer reviewed and revised the final draft <strong>of</strong><br />

the manuscript to improve clarity and style. Gordon Guyatt developed the original<br />

ideas <strong>for</strong> tips 1 and 2, reviewed the manuscript at all phases <strong>of</strong> development, contributed<br />

to the writing as coauthor, and reviewed and revised the final draft <strong>of</strong> the<br />

manuscript to achieve accuracy and consistency <strong>of</strong> content as general editor.<br />

References<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: confidence intervals<br />

1. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. <strong>Tips</strong> <strong>for</strong><br />

learners <strong>of</strong> evidence-based medicine: 1. Relative risk reduction, absolute risk<br />

reduction and number needed to treat. CMAJ 2004;171(4):353-8.<br />

2. Guyatt G, Jaeschke R, Cook D, Walter S. Therapy and understanding the results:<br />

hypothesis testing. In: Guyatt G, Rennie D, editors. Users’ guides to the<br />

medical literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA<br />

Press; 2002. p. 329-38.<br />

3. Guyatt G, Walter S, Cook D, Jaeschke R. Therapy and understanding the results:<br />

confidence intervals. In: Guyatt G, Rennie D, editors. Users’ guides to the<br />

medical literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA<br />

Press; 2002. p. 339-49.<br />

4. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />

<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series<br />

[editorial]. CMAJ 2004;171(4):347-8.<br />

5. Guyatt G, Montori V, Devereaux PJ, Schunemann H, Bhandari M. Patients at the<br />

center: in our practice, and in our use <strong>of</strong> language. ACP J Club 2004;140:A11-2.<br />

6. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures<br />

<strong>of</strong> association. In: Guyatt G, Rennie D, editors. Users’ guides to the medical<br />

literature: a manual <strong>of</strong> evidence-based clinical practice. Chicago: AMA Press;<br />

2002. p. 351-68.<br />

7. Hanley J, Lippman-Hand A. If nothing goes wrong, is everything all right?<br />

Interpreting zero numerators. JAMA 1983;249:1743-5.<br />

8. Newman TB. If almost nothing goes wrong, is almost everything all right?<br />

[letter]. JAMA 1995;274:1013.<br />

Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working<br />

Group: Peter C. Wyer (project director), College <strong>of</strong> Physicians and<br />

Surgeons, Columbia University, New York, NY; Deborah Cook,<br />

Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke,<br />

McMaster University, Hamilton, Ont.; Rose Hatala (internal<br />

review coordinator), University <strong>of</strong> British Columbia, Vancouver,<br />

BC; Robert Hayward (editor, online version), Bruce Fisher,<br />

University <strong>of</strong> Alberta, Edmonton, Alta.; Sheri Keitz (field test<br />

coordinator), Durham Veterans Affairs Medical Center and Duke<br />

University Medical Center, Durham, NC; Alexandra Barratt,<br />

University <strong>of</strong> Sydney, Sydney, Australia; Pamela Charney, Albert<br />

Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY; Antonio L. Dans,<br />

University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>, Manila, The<br />

Philippines; Barnet Eskin, Morristown Memorial Hospital,<br />

Morristown, NJ; Jennifer Kleinbart, Emory University School <strong>of</strong><br />

<strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />

Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />

McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />

Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn.;<br />

Virginia Moyer, University <strong>of</strong> Texas, Houston, Tex.; Thomas B.<br />

Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />

Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa, Ont.;<br />

Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain;<br />

W. Scott Richardson, Wright State University, Dayton, Ohio; Mark<br />

C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />

Articles to date in this series<br />

Page 12 <strong>of</strong> 29<br />

Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />

Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet<br />

.att.net<br />

Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S,<br />

et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 1.<br />

Relative risk reduction, absolute risk reduction and<br />

number needed to treat. CMAJ 2004;171(4):353-8.<br />

CMAJ SEPT. 14, 2004; 171 (6) 615


Correspondance<br />

ical journals [editorial]. CMAJ 1984;130:1412.<br />

11. Bero LA, Galbraith A, Rennie D. The publication<br />

<strong>of</strong> sponsored symposiums in medical journals.<br />

N Engl J Med 1992;327:1135-40.<br />

Competing interests: None declared.<br />

DOI:10.1503/cmaj.1041329<br />

Online access to a<br />

<strong>for</strong>-pr<strong>of</strong>it CMAJ<br />

Wayne Kondro, quoting CMA Secretary-General<br />

Bill Tholl, reports<br />

that “Physicians will continue to receive<br />

their free subscription to CMAJ as a benefit<br />

<strong>of</strong> association membership ‘<strong>for</strong> the<br />

<strong>for</strong>eseeable future’” after CMA Publications<br />

is sold to CMA Holdings in January<br />

2004. 1 That’s all to the good — but what<br />

then <strong>of</strong> CMAJ’s worldwide readers? Will<br />

access to CMAJ remain free <strong>for</strong> all online<br />

users, despite the shift to <strong>for</strong>-pr<strong>of</strong>it status?<br />

I found it strange that this issue was not<br />

addressed in Kondro’s news article.<br />

Adam L. Scheffler<br />

Independent researcher<br />

Chicago, Ill.<br />

Reference<br />

1. Kondro W. CMAJ enters <strong>for</strong>-pr<strong>of</strong>it market.<br />

CMAJ 2004;171(11):1334.<br />

DOI:10.1503/cmaj.1041759<br />

[Editor’s note]<br />

CMAJ’s editors have addressed the<br />

topic <strong>of</strong> open access in this issue’s<br />

Editorial (see page 149).<br />

DOI:10.1503/cmaj.1041760<br />

Correction<br />

In part 2 <strong>of</strong> the series “<strong>Tips</strong> <strong>for</strong> learners<br />

<strong>of</strong> evidence-based medicine” 1 the<br />

in<strong>for</strong>mation in Fig. 1 did not fully correspond<br />

with the in<strong>for</strong>mation provided in<br />

the text. Specifically, the data <strong>for</strong> hypo-<br />

162 JAMC • 18 JANV. 2005; 172 (2)<br />

thetical trial 2 in Fig. 1B should have<br />

been centred at 5% absolute risk reduction,<br />

as described in the text; instead, the<br />

figure showed trial 2 as being centred at<br />

about 6.5% absolute risk reduction. The<br />

corrected figure is presented here.<br />

A<br />

B<br />

C<br />

-5<br />

-5<br />

Trial 4<br />

Treatment harms<br />

-3<br />

-3<br />

Trial 3<br />

-1<br />

-1<br />

-5 -3 -1 0<br />

Treatment helps<br />

0 1 3 5<br />

0 1 3 5<br />

% Absolute risk reduction<br />

Reference<br />

1. Montori VM, Kleinbart J, Newman TB, Keitz S,<br />

Wyer PC, Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong><br />

evidence-based medicine: 2. Measures <strong>of</strong> precision<br />

(confidence intervals). CMAJ 2004;171(6):<br />

611-5.<br />

DOI:10.1503/cmaj.1041761<br />

1 3 5<br />

Trial 1<br />

Trial 1<br />

Page 13 <strong>of</strong> 29<br />

Trial 2<br />

Fig. 1: Results <strong>of</strong> 4 hypothetical trials. For the medical condition under investigation,<br />

an absolute risk reduction <strong>of</strong> 1% (double vertical rule) is the smallest benefit<br />

that patients would consider important enough to warrant undergoing treatment. In<br />

each case, the uppermost point <strong>of</strong> the bell curve is the observed treatment effect<br />

(the point estimate), and the tails <strong>of</strong> the bell curve represent the boundaries <strong>of</strong> the<br />

95% confidence interval. See the text 1 <strong>for</strong> further explanation.


DOI:10.1503/cmaj.1031981<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

3. Measures <strong>of</strong> observer variability (kappa statistic)<br />

Thomas McGinn, Peter C. Wyer, Thomas B. Newman, Sheri Keitz, Rosanne Leipzig,<br />

Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />

Imagine that you’re a busy family physician and that<br />

you’ve found a rare free moment to scan the recent literature.<br />

Reviewing your preferred digest <strong>of</strong> abstracts,<br />

you notice a study comparing emergency physicians’ interpretation<br />

<strong>of</strong> chest radiographs with radiologists’ interpretations.<br />

1 The article catches your eye because you have frequently<br />

found that your own reading <strong>of</strong> a radiograph differs<br />

from both the <strong>of</strong>ficial radiologist reading and an un<strong>of</strong>ficial<br />

reading by a different radiologist, and you’ve wondered<br />

about the extent <strong>of</strong> this disagreement and its implications.<br />

Looking at the abstract, you find that the authors have reported<br />

the extent <strong>of</strong> agreement using the κ statistic. You recall<br />

that κ stands <strong>for</strong> “kappa” and that you have encountered this<br />

measure <strong>of</strong> agreement be<strong>for</strong>e, but your grasp <strong>of</strong> its meaning<br />

remains tentative. You there<strong>for</strong>e choose to take a quick glance<br />

at the authors’ conclusions as reported in the abstract and to<br />

defer downloading and reviewing the full text <strong>of</strong> the article.<br />

Practitioners, such as the family physician just described,<br />

may benefit from understanding measures <strong>of</strong> observer variability.<br />

For many studies in the medical literature, clinician<br />

readers will be interested in the extent <strong>of</strong> agreement among<br />

multiple observers. For example, do the investigators in a<br />

clinical study agree on the presence or absence <strong>of</strong> physical,<br />

radiographic or laboratory findings? Do investigators involved<br />

in a systematic overview agree on the validity <strong>of</strong> an<br />

article, or on whether the article should be included in the<br />

analysis? In perusing these types <strong>of</strong> studies, where investigators<br />

are interested in quantifying agreement, clinicians<br />

will <strong>of</strong>ten come across the kappa statistic.<br />

In this article we present tips aimed at helping clinical<br />

learners to use the concepts <strong>of</strong> kappa when applying diagnostic<br />

tests in practice. The tips presented here have been<br />

adapted from approaches developed by educators experienced<br />

in teaching evidence-based medicine skills to clinicians.<br />

2 A related article, intended <strong>for</strong> people who teach<br />

these concepts to clinicians, is available online at www.<br />

cmaj.ca/cgi/content/full/171/11/1369/DC1.<br />

Clinician learners’ objectives<br />

Defining the importance <strong>of</strong> kappa<br />

• Understand the difference between measuring agreement<br />

and measuring agreement beyond chance.<br />

• Understand the implications <strong>of</strong> different values <strong>of</strong> kappa.<br />

Calculating kappa<br />

Review<br />

Synthèse<br />

• Understand the basics <strong>of</strong> how the kappa score is<br />

calculated.<br />

• Understand the importance <strong>of</strong> “chance agreement” in<br />

estimating kappa.<br />

Calculating chance agreement<br />

• Understand how to calculate the kappa score given different<br />

distributions <strong>of</strong> positive and negative results.<br />

• Understand that the more extreme the distributions <strong>of</strong><br />

positive and negative results, the greater the agreement<br />

that will occur by chance alone.<br />

• Understand how to calculate chance agreement, agreement<br />

beyond chance and kappa <strong>for</strong> any set <strong>of</strong> assessments<br />

by 2 observers.<br />

Tip 1: Defining the importance <strong>of</strong> kappa<br />

A common stumbling block <strong>for</strong> clinicians is the basic<br />

concept <strong>of</strong> agreement beyond chance and, in turn, the importance<br />

<strong>of</strong> correcting <strong>for</strong> chance agreement. People making<br />

a decision on the basis <strong>of</strong> presence or absence <strong>of</strong> an element<br />

<strong>of</strong> the physical examination, such as Murphy’s sign,<br />

will sometimes agree simply by chance. The kappa statistic<br />

corrects <strong>for</strong> this chance agreement and tells us how much<br />

<strong>of</strong> the possible agreement over and above chance the reviewers<br />

have achieved.<br />

A simple example should help to clarify the importance<br />

<strong>of</strong> correcting <strong>for</strong> chance agreement. Two radiologists independently<br />

read the same 100 mammograms. Reader 1 is<br />

having a bad day and reads all the films as negative without<br />

looking at them in great detail. Reader 2 reads the<br />

Teachers <strong>of</strong> evidence-based medicine:<br />

See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />

at www.cmaj.ca/cgi/content/full/171/11/1369/DC1. It<br />

contains the exercises found in this article in fill-in-theblank<br />

<strong>for</strong>mat, commentaries from the authors on the<br />

challenges they encounter when teaching these concepts<br />

to clinician learners and links to useful online resources.<br />

CMAJ • NOV. 23, 2004; 171 (11) 1369<br />

© 2004 Canadian Medical Association or its licensors<br />

Page 14 <strong>of</strong> 29


McGinn et al<br />

films more carefully and identifies 4 <strong>of</strong> the 100 mammograms<br />

as positive (suspicious <strong>for</strong> malignancy). How would<br />

you characterize the level <strong>of</strong> agreement between these 2<br />

radiologists?<br />

The percent agreement between them is 96%, even<br />

though one <strong>of</strong> the readers has, on cursory review, decided<br />

to call all <strong>of</strong> the results negative. Hence, measuring the<br />

simple percent agreement overestimates the degree <strong>of</strong> clinically<br />

important agreement in a fashion that is misleading.<br />

The role <strong>of</strong> kappa is to indicate how much the 2 observers<br />

agree beyond the level <strong>of</strong> agreement that could be expected<br />

by chance. Table 1 presents a rating system that is commonly<br />

used as a guideline <strong>for</strong> evaluating kappa scores.<br />

Purely to illustrate the range <strong>of</strong> kappa scores that readers<br />

can expect to encounter, Table 2 gives some examples <strong>of</strong><br />

commonly reported assessments and the kappa scores that<br />

resulted when investigators studied their reproducibility.<br />

The bottom line<br />

If clinicians neglect the possibility <strong>of</strong> chance agreement,<br />

they will come to misleading conclusions about the reproducibility<br />

<strong>of</strong> clinical tests. The kappa statistic allows us to<br />

measure agreement above and beyond that expected by<br />

chance alone. Examples <strong>of</strong> kappa scores <strong>for</strong> frequently ordered<br />

tests sometimes show surprisingly poor levels <strong>of</strong><br />

agreement beyond chance.<br />

Table 1: Qualitative classification<br />

<strong>of</strong> kappa values as degree <strong>of</strong><br />

agreement beyond chance 3<br />

Kappa<br />

value<br />

Degree <strong>of</strong> agreement<br />

beyond chance<br />

0 None<br />

0–0.2 Slight<br />

0.2–0.4 Fair<br />

0.4–0.6 Moderate<br />

0.6–0.8 Substantial<br />

0.8–1.0 Almost perfect<br />

Table 2: Representative kappa values <strong>for</strong> common tests<br />

and clinical assessments<br />

Assessment Kappa value<br />

Interpretation <strong>of</strong> T wave changes on an exercise<br />

stress test 4<br />

Presence <strong>of</strong> jugular venous distension 5<br />

Detection <strong>of</strong> alcohol dependence using CAGE<br />

questionnaire 6<br />

Presence <strong>of</strong> goitre 7<br />

Bone marrow interpretation by hematologist 8<br />

Straight leg raising test 9<br />

Diagnosis <strong>of</strong> pulmonary embolus by helical CT 10<br />

Diagnosis <strong>of</strong> lower extremity arterial disease by<br />

arteriography 11<br />

0.25<br />

0.56<br />

0.75<br />

0.82–0.95<br />

0.84<br />

0.82<br />

0.82<br />

0.39–0.64<br />

1370 JAMC 23 NOV. 2004; 171 (11)<br />

Tip 2: Calculating kappa<br />

What is the maximum potential <strong>for</strong> agreement between<br />

2 observers doing a clinical assessment, such as<br />

presence or absence <strong>of</strong> Murphy’s sign in patients with<br />

abdominal pain? In Fig. 1, the upper horizontal bar represents<br />

100% agreement between 2 observers. For the hypothetical<br />

situation represented in the figure, the estimated<br />

chance agreement between the 2 observers is 50%.<br />

This would occur if, <strong>for</strong> example, each <strong>of</strong> the 2 observers<br />

randomly called half <strong>of</strong> the assessments positive. Given<br />

this in<strong>for</strong>mation, what is the possible agreement beyond<br />

chance?<br />

The vertical line in Fig. 1 intersects the horizontal bars<br />

at the 50% point that we identified as the expected agreement<br />

by chance. All agreement to the right <strong>of</strong> this line corresponds<br />

to agreement beyond chance. Hence the maximum<br />

agreement beyond chance is 50% (100% – 50%).<br />

The other number you need to calculate the kappa score<br />

is the degree <strong>of</strong> agreement beyond chance. The observed<br />

agreement, as shown by the lower horizontal bar in Fig. 1,<br />

is 75%, so the degree <strong>of</strong> agreement beyond chance is 25%<br />

(75% – 50%).<br />

Kappa is calculated as the observed agreement beyond<br />

chance (25%) divided by the maximum agreement beyond<br />

chance (50%); here, kappa is 0.50.<br />

Agreement expected Possible agreement<br />

by chance 50% above chance<br />

Observed agreement: 75%<br />

Observed agreement above chance: 25%<br />

kappa = 25/50 = 0. 5 (moderate agreement)<br />

Page 15 <strong>of</strong> 29<br />

Fig. 1: Two observers independently assess the presence or<br />

absence <strong>of</strong> a finding or outcome. Each observer determines<br />

that the finding is present in exactly 50% <strong>of</strong> the subjects. Their<br />

assessments agree in 75% <strong>of</strong> the cases. The yellow horizontal<br />

bar represents potential agreement (100%), and the turquoise<br />

bar represents actual agreement. The portion <strong>of</strong> each coloured<br />

bar that lies to the left <strong>of</strong> the dotted vertical line represents the<br />

agreement expected by chance (50%). The observed agreement<br />

above chance is half <strong>of</strong> the possible agreement above<br />

chance. The ratio <strong>of</strong> these 2 numbers is the kappa score.


The bottom line<br />

Kappa allows us to measure agreement above and beyond<br />

that expected by chance alone. We calculate kappa by<br />

estimating the chance agreement and then comparing the<br />

observed agreement beyond chance with the maximum<br />

possible agreement beyond chance.<br />

Tip 3: Calculating chance agreement<br />

A conceptual understanding <strong>of</strong> kappa may still leave the<br />

actual calculations a mystery. The following example is intended<br />

<strong>for</strong> those who desire a more complete understanding<br />

<strong>of</strong> the kappa statistic.<br />

Let us assume that 2 hopeless clinicians are assessing the<br />

presence <strong>of</strong> Murphy’s sign in a group <strong>of</strong> patients. They<br />

have no idea what they are doing, and their evaluations are<br />

no better than blind guesses. Let us say they are each<br />

guessing the presence and absence <strong>of</strong> Murphy’s sign in a<br />

50:50 ratio: half the time they guess that Murphy’s sign is<br />

present, and the other half that it is absent. If you were<br />

completing a 2 × 2 table, with these 2 clinicians evaluating<br />

the same 100 patients, how would the cells, on average, get<br />

filled in?<br />

Fig. 2 represents the completed 2 × 2 table. Guessing at<br />

random, the 2 hopeless clinicians have agreed on the assessments<br />

<strong>of</strong> 50% <strong>of</strong> the patients. How did we arrive at the<br />

numbers shown in the table? According to the laws <strong>of</strong><br />

chance, each clinician guesses that half <strong>of</strong> the 50 patients<br />

assessed as positive by the other clinician (i.e., 25 patients)<br />

have Murphy’s sign.<br />

How would this exercise work if the same 2 hopeless<br />

clinicians were to randomly guess that 60% <strong>of</strong> the patients<br />

had a positive result <strong>for</strong> Murphy’s sign? Fig. 3 provides the<br />

answer in this situation. The clinicians would agree <strong>for</strong> 52<br />

<strong>of</strong> the 100 patients (or 52% <strong>of</strong> the time) and would disagree<br />

<strong>for</strong> 48 <strong>of</strong> the patients. In a similar way, using 2 × 2 tables<br />

<strong>for</strong> higher and higher positive proportions (i.e., how <strong>of</strong>ten<br />

Clinician 2<br />

Sign<br />

present<br />

Sign<br />

absent<br />

Sign<br />

present<br />

Clinician 1<br />

Sign<br />

absent Total<br />

25 25 50<br />

25 25 50<br />

Total 50 50<br />

Fig. 2: Agreement table <strong>for</strong> 2 hopeless clinicians who randomly<br />

guess whether Murphy’s sign is present or absent in 100 patients<br />

with abdominal pain. Each clinician determines that half<br />

<strong>of</strong> the patients have a positive result. The numbers in each box<br />

reflect the number <strong>of</strong> patients in each agreement category.<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: kappa statistic<br />

the observer makes the diagnosis), you can figure out how<br />

<strong>of</strong>ten the observers will, on average, agree by chance alone<br />

(as delineated in Table 3).<br />

At this point, we have demonstrated 2 things. First, even<br />

if the reviewers have no idea what they are doing, there will<br />

be substantial agreement by chance alone. Second, the<br />

magnitude <strong>of</strong> the agreement by chance increases as the<br />

proportion <strong>of</strong> positive (or negative) assessments increases.<br />

But how can we calculate kappa when the clinicians<br />

whose assessments are being compared are no longer<br />

“hopeless,” in other words, when their assessments reflect a<br />

level <strong>of</strong> expertise that one might actually encounter in practice?<br />

It’s not very hard.<br />

Let’s take a simple example, returning to the premise<br />

that each <strong>of</strong> the 2 clinicians assesses Murphy’s sign as being<br />

present in 50% <strong>of</strong> the patients. Here, we assume that<br />

the 2 clinicians now have some knowledge <strong>of</strong> Murphy’s<br />

sign and their assessments are no longer random. Each<br />

decides that 50% <strong>of</strong> the patients have Murphy’s sign and<br />

50% do not, but they still don’t agree on every patient.<br />

Rather, <strong>for</strong> 40 patients they agree that Murphy’s sign is<br />

present, and <strong>for</strong> 40 patients they agree that Murphy’s sign<br />

is absent. Thus, they agree on the diagnosis <strong>for</strong> 80% <strong>of</strong><br />

the patients, and they disagree <strong>for</strong> 20% <strong>of</strong> the patients<br />

(see Fig. 4A). How do we calculate the kappa score in this<br />

situation?<br />

Recall that if each clinician found that 50% <strong>of</strong> the patients<br />

had Murphy’s sign but their decision about the presence <strong>of</strong><br />

the sign in each patient was random, the clinicians would be<br />

in agreement 50% <strong>of</strong> the time, each cell <strong>of</strong> the 2 × 2 table<br />

would have 25 patients (as shown in Fig. 2), chance agree-<br />

Clinician 2<br />

Sign<br />

present<br />

Sign<br />

absent<br />

Sign<br />

present<br />

Clinician 1<br />

Sign<br />

absent Total<br />

36 24 60<br />

24 16 40<br />

Total 60 40<br />

Page 16 <strong>of</strong> 29<br />

Fig. 3: As in Fig. 2, the 2 clinicians again guess at random<br />

whether Murphy’s sign is present or absent. However, each<br />

clinician now guesses that the sign is present in 60 <strong>of</strong> the 100<br />

patients. Under these circumstances, <strong>of</strong> the 60 patients <strong>for</strong><br />

whom clinician 1 guesses that the sign is present, clinician 2<br />

guesses that it is present in 60%; 60% <strong>of</strong> 60 is 36 patients. Of<br />

the 60 patients <strong>for</strong> whom clinician 1 guesses that the sign is<br />

present, clinician 2 guesses that it is absent in 40%; 40% <strong>of</strong> 60<br />

is 24 patients. Of the 40 patients <strong>for</strong> whom clinician 1 guesses<br />

that the sign is absent, clinician 2 guesses that it is present in<br />

60%; 60% <strong>of</strong> 40 is 24 patients. Of the 40 patients <strong>for</strong> whom<br />

clinician 1 guesses that the sign is absent, clinician 2 guesses<br />

that it is absent in 40%; 40% <strong>of</strong> 40 is 16 patients.<br />

CMAJ NOV. 23, 2004; 171 (11) 1371


McGinn et al<br />

ment would be 50%, and maximum agreement beyond<br />

chance would also be 50%.<br />

The no-longer-hopeless clinicians’ agreement on 80%<br />

<strong>of</strong> the patients is there<strong>for</strong>e 30% above chance. Kappa is a<br />

comparison <strong>of</strong> the observed agreement above chance with<br />

the maximum agreement above chance: 30%/50% = 60%<br />

<strong>of</strong> the possible agreement above chance, which gives these<br />

clinicians a kappa <strong>of</strong> 0.6, as shown in Fig. 4B.<br />

A Clinician 1<br />

Clinician 2<br />

Sign<br />

present<br />

Sign<br />

absent<br />

Sign<br />

present<br />

Sign<br />

absent<br />

40 10<br />

10 40<br />

B Clinician 1<br />

Clinician 2<br />

Table 3: Chance agreement when 2<br />

observers randomly assign positive<br />

and negative results, <strong>for</strong> successively<br />

higher rates <strong>of</strong> a positive call<br />

Proportion<br />

positive (%)<br />

Sign<br />

present<br />

Sign<br />

absent<br />

Sign<br />

present<br />

40<br />

(25)<br />

10<br />

(25)<br />

Agreement<br />

by chance (%)<br />

50 50<br />

60 52<br />

70 58<br />

80 68<br />

90 82<br />

Sign<br />

absent Total<br />

10<br />

(25)<br />

40<br />

(25)<br />

Total 50 50<br />

κ = (observed agreement – agreement expected by chance) ÷ (100 – agreement expected<br />

by chance)<br />

= (80% – 50%) ÷ (100% – 50%)<br />

= 30% ÷ 50%<br />

= 0.6<br />

Fig. 4: Two clinicians who have been trained to assess Murphy’s<br />

sign in patients with abdominal pain do an actual assessment<br />

on 100 patients. A: A 2 × 2 table reflecting actual agreement<br />

between the 2 clinicians. B: A 2 × 2 table illustrating the<br />

correct approach to determining the kappa score. The numbers<br />

in parentheses correspond to the results that would be expected<br />

were each clinician randomly guessing that half <strong>of</strong> the<br />

patients had a positive result (as in Fig. 2).<br />

1372 JAMC 23 NOV. 2004; 171 (11)<br />

50<br />

50<br />

Formula <strong>for</strong> calculating kappa<br />

(Observed agreement – agreement expected by chance) ÷<br />

(100% – agreement expected by chance)<br />

Another way <strong>of</strong> expressing this <strong>for</strong>mula:<br />

(Observed agreement beyond chance) ÷ (maximum<br />

possible agreement beyond chance)<br />

Hence, to calculate kappa when only 2 alternatives are<br />

possible (e.g., presence or absence <strong>of</strong> a finding), you need<br />

just 2 numbers: the percentage <strong>of</strong> patients that the 2 assessors<br />

agreed on and the expected agreement by chance.<br />

Both can be determined by constructing a 2 × 2 table exactly<br />

as illustrated above.<br />

The bottom line<br />

Chance agreement is not always 50%; rather, it varies<br />

from one clinical situation to another. When the prevalence<br />

<strong>of</strong> a disease or outcome is low, 2 observers will guess<br />

that most patients are normal and the symptom <strong>of</strong> the disease<br />

is absent. This situation will lead to a high percentage<br />

<strong>of</strong> agreement simply by chance. When the prevalence is<br />

high, there will also be high apparent agreement, with most<br />

patients judged to exhibit the symptom. Kappa measures<br />

the agreement after correcting <strong>for</strong> this variable degree <strong>of</strong><br />

chance agreement.<br />

Conclusions<br />

Page 17 <strong>of</strong> 29<br />

Armed with this understanding <strong>of</strong> kappa as a measure <strong>of</strong><br />

agreement between different observers, you are able to return<br />

to the study <strong>of</strong> agreement in chest radiography interpretations<br />

between emergency physicians and radiologists 1<br />

in a more in<strong>for</strong>med fashion. You learn from the abstract<br />

that the kappa score <strong>for</strong> overall agreement between the 2<br />

classes <strong>of</strong> practitioners was 0.40, with a 95% confidence<br />

interval ranging from 0.35 to 0.46. This means that the<br />

agreement between emergency physicians and radiologists<br />

represented 40% <strong>of</strong> the potentially achievable agreement<br />

beyond chance. You understand that this kappa score<br />

would be conventionally considered to represent fair to<br />

moderate agreement but is inferior to many <strong>of</strong> the kappa<br />

values listed in Table 2. You are now much more confident<br />

about going to the full text <strong>of</strong> the article to review the<br />

methods and assess the clinical applicability <strong>of</strong> the results to<br />

your own patients.<br />

The ability to understand measures <strong>of</strong> variability in data<br />

presented in clinical trials and systematic reviews is an important<br />

skill <strong>for</strong> clinicians. We have presented a series <strong>of</strong><br />

tips developed and used by experienced teachers <strong>of</strong> evidence-based<br />

medicine <strong>for</strong> the purpose <strong>of</strong> facilitating such<br />

understanding.


This article has been peer reviewed.<br />

From the Department <strong>of</strong> <strong>Medicine</strong>, Division <strong>of</strong> General Internal <strong>Medicine</strong><br />

(McGinn), and the Department <strong>of</strong> Geriatrics (Leipzig), Mount Sinai Medical Center,<br />

New York, NY; the Columbia University College <strong>of</strong> Physicians and Surgeons,<br />

New York, NY (Wyer); the Departments <strong>of</strong> Epidemiology and Biostatistics and <strong>of</strong><br />

Pediatrics, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco, Calif. (Newman);<br />

Durham Veterans Affairs Medical Center and Duke University Medical<br />

Center, Durham, NC (Keitz); and the Departments <strong>of</strong> <strong>Medicine</strong> and <strong>of</strong> Clinical<br />

Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)<br />

Competing interests: None declared.<br />

Contributors: Thomas McGinn developed the original idea <strong>for</strong> tips 1 and 2 and, as<br />

principal author, oversaw and contributed to the writing <strong>of</strong> the manuscript.<br />

Thomas Newman and Roseanne Leipzig reviewed the manuscript at all phases <strong>of</strong><br />

development and contributed to the writing as coauthors. Sheri Keitz used all <strong>of</strong><br />

the tips as part <strong>of</strong> a live teaching exercise and submitted comments, suggestions<br />

and the possible variations that are described in the article. Peter Wyer reviewed<br />

and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence with<br />

<strong>for</strong>mat specifications. Gordon Guyatt developed the original idea <strong>for</strong> tip 3, reviewed<br />

the manuscript at all phases <strong>of</strong> development, contributed to the writing as a<br />

coauthor, and, as general editor, reviewed and revised the final draft <strong>of</strong> the manuscript<br />

to achieve accuracy and consistency <strong>of</strong> content.<br />

References<br />

1. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs<br />

in the emergency department: Is the radiologist really necessary? Postgrad<br />

Med J 2003;79:214-7.<br />

2. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />

<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series<br />

[editorial]. CMAJ 2004;171(4):347-8.<br />

3. Maclure M, Willett WC. Misinterpretation and misuse <strong>of</strong> the kappa statistic.<br />

Am J Epidemiol 1987;126:161-9.<br />

4. Blackburn H. The exercise electrocardiogram: differences in interpretation.<br />

Report <strong>of</strong> a technical group on exercise electrocardiography. Am J Cardiol<br />

1968;21:871-80.<br />

5. Cook DJ. Clinical assessment <strong>of</strong> central venous pressure in the critically ill.<br />

Am J Med Sci 1990;299:175-8.<br />

6. Aertgeerts B, Buntinx F, Fevery J, Ansoms S. Is there a difference between<br />

CAGE interviews and written CAGE questionnaires? Alcohol Clin Exp Res<br />

2000;24:733-6.<br />

7. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey <strong>of</strong> thyroid enlargement<br />

in two general practices in Great Britain. BMJ 1963;1:29-34.<br />

8. Guyatt GH, Patterson C, Ali M, Singer J, Levine M, Turpie I, et al. Diagnosis<br />

<strong>of</strong> iron-deficiency anemia in the elderly. Am J Med 1990;88:205-9.<br />

9. McCombe PF, Fairbank JC, Cockersole BC, Pynsent PB. 1989 Volvo Award<br />

in clinical sciences. Reproducibility <strong>of</strong> physical signs in low-back pain. Spine<br />

1989;14:908-18.<br />

10. Perrier A, Howarth N, Didier D, Loubeyre P, Unger PF, de Moerloose P, et<br />

al. Per<strong>for</strong>mance <strong>of</strong> helical computed tomography in unselected outpatients<br />

with suspected pulmonary embolism. Ann Intern Med 2001;135:88-97.<br />

11. Koelemay MJ, Legemate DA, Reekers JA, Koedam NA, Balm R, Jacobs MJ.<br />

Interobserver variation in interpretation <strong>of</strong> arteriography and management <strong>of</strong><br />

severe lower leg arterial disease. Eur J Vasc Endovasc Surg 2001;21:417-22.<br />

Articles to date in this series<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: kappa statistic<br />

Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />

Pelham NY 10803, USA; fax 914 738-9368; pwyer@att.net<br />

Page 18 <strong>of</strong> 29<br />

Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong><br />

Working Group: Peter C. Wyer (project director), College <strong>of</strong><br />

Physicians and Surgeons, Columbia University, New York, NY;<br />

Deborah Cook, Gordon Guyatt (general editor), Ted Haines,<br />

Roman Jaeschke, McMaster University, Hamilton, Ont.; Rose<br />

Hatala (internal review coordinator), University <strong>of</strong> British<br />

Columbia, Vancouver, BC; Robert Hayward (editor, online<br />

version), Bruce Fisher, University <strong>of</strong> Alberta, Edmonton, Alta.;<br />

Sheri Keitz (field test coordinator), Durham Veterans Affairs<br />

Medical Center and Duke University Medical Center, Durham,<br />

NC; Alexandra Barratt, University <strong>of</strong> Sydney, Sydney, Australia;<br />

Pamela Charney, Albert Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY;<br />

Antonio L. Dans, University <strong>of</strong> the Philippines College <strong>of</strong><br />

<strong>Medicine</strong>, Manila, The Philippines; Barnet Eskin, Morristown<br />

Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory<br />

University School <strong>of</strong> <strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly<br />

Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne<br />

Leipzig, Thomas McGinn, Mount Sinai Medical Center, New<br />

York, NY; Victor M. Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>,<br />

Rochester, Minn.; Virginia Moyer, University <strong>of</strong> Texas, Houston,<br />

Tex.; Thomas B. Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San<br />

Francisco, San Francisco, Calif.; Jim Nishikawa, University <strong>of</strong><br />

Ottawa, Ottawa, Ont.; Kameshwar Prasad, Arabian Gulf<br />

University, Manama, Bahrain; W. Scott Richardson, Wright State<br />

University, Dayton, Ohio; Mark C. Wilson, University <strong>of</strong> Iowa,<br />

Iowa City, Iowa<br />

Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz<br />

S, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

1. Relative risk reduction, absolute risk reduction and<br />

number needed to treat. CMAJ 2004;171(4):353-8.<br />

Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC,<br />

Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />

medicine: 2. Measures <strong>of</strong> precision (confidence intervals).<br />

CMAJ 2004;171(6):611-5.<br />

CMAJ NOV. 23, 2004; 171 (11) 1373


DOI:10.1503/cmaj.1031920<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

4. Assessing heterogeneity <strong>of</strong> primary studies<br />

in systematic reviews and whether to combine<br />

their results<br />

Rose Hatala, Sheri Keitz, Peter Wyer, Gordon Guyatt, <strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong><br />

Teaching <strong>Tips</strong> Working Group<br />

Clinicians wishing to quickly answer a clinical question<br />

may seek a systematic review, rather than searching<br />

<strong>for</strong> primary articles. Such a review is also called a<br />

meta-analysis when the investigators have used statistical<br />

techniques to combine results across studies. Databases useful<br />

<strong>for</strong> this purpose include the Cochrane Library (www.<br />

thecochranelibrary.com) and the ACP Journal Club (www.<br />

acpjc.org; use the search term “review”), both <strong>of</strong> which are<br />

available through personal or institutional subscription.<br />

Clinicians can use systematic reviews to guide clinical practice<br />

if they are able to understand and interpret the results.<br />

Systematic reviews differ from traditional reviews in that<br />

they are usually confined to a single focused question,<br />

which serves as the basis <strong>for</strong> systematic searching, selection<br />

and critical evaluation <strong>of</strong> the relevant research. 1 Authors <strong>of</strong><br />

systematic reviews use explicit methods to minimize bias<br />

and consider using statistical techniques to combine the results<br />

<strong>of</strong> individual studies. When appropriate, such pooling<br />

allows a more precise estimate <strong>of</strong> the magnitude <strong>of</strong> benefit<br />

or harm <strong>of</strong> a therapy. It may also increase the applicability<br />

<strong>of</strong> the result to a broader range <strong>of</strong> patient populations.<br />

Clinicians encountering a meta-analysis frequently find<br />

the pooling process mysterious. Specifically, they wonder<br />

how authors decide whether the ranges <strong>of</strong> patients, interventions<br />

and outcomes are too broad to sensibly pool the<br />

results <strong>of</strong> the primary studies.<br />

In this article we present an approach to evaluating potentially<br />

important differences in the results <strong>of</strong> individual<br />

studies being considered <strong>for</strong> a meta-analysis. These differences<br />

are frequently referred to as heterogeneity. 1 Our discussion<br />

focuses on the qualitative, rather than the statistical,<br />

assessment <strong>of</strong> heterogeneity (see Box 1).<br />

Two concepts are commonly implied in the assessment<br />

<strong>of</strong> heterogeneity. The first is an assessment <strong>for</strong> heterogeneity<br />

within 4 key elements <strong>of</strong> the design <strong>of</strong> the original studies:<br />

the patients, interventions, outcomes and methods. This<br />

assessment bears on the question <strong>of</strong> whether pooling the results<br />

is at all sensible. The second concept relates to assessing<br />

heterogeneity among the results <strong>of</strong> the original studies.<br />

Even if the study designs are similar, the researchers must<br />

decide whether it is useful to combine the primary studies’<br />

CMAJ • MAR. 1, 2005; 172 (5) 661<br />

© 2005 CMA Media Inc. or its licensors<br />

Review<br />

Synthèse<br />

results. Our discussion assumes a basic familiarity with how<br />

investigators present the magnitude 2,3 and precision 4 <strong>of</strong><br />

treatment effects in individual randomized trials.<br />

The tips in this article are adapted from approaches developed<br />

by educators with experience in teaching evidencebased<br />

medicine skills to clinicians. 1,5,6 A related article, intended<br />

<strong>for</strong> people who teach these concepts to clinicians, is<br />

available online at www.cmaj.ca/cgi/content/full/172/5/<br />

661/DC1.<br />

Clinician learners’ objectives<br />

Qualitative assessment <strong>of</strong> the design <strong>of</strong> primary<br />

studies<br />

• Understand the concepts <strong>of</strong> heterogeneity <strong>of</strong> study design<br />

among the individual studies included in a systematic<br />

review.<br />

Qualitative assessment <strong>of</strong> the results <strong>of</strong> primary<br />

studies<br />

• Understand how to qualitatively determine the appropriateness<br />

<strong>of</strong> pooling estimates <strong>of</strong> effect from the individual<br />

studies by assessing (1) the degree <strong>of</strong> overlap <strong>of</strong><br />

the confidence intervals around these point estimates <strong>of</strong><br />

effect and (2) the disparity between the point estimates<br />

themselves.<br />

• Understand how to estimate the “true” value <strong>of</strong> the estimate<br />

<strong>of</strong> effect from a graphic display <strong>of</strong> the results <strong>of</strong><br />

individual studies.<br />

Teachers <strong>of</strong> evidence-based medicine:<br />

Page 19 <strong>of</strong> 29<br />

See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />

at www.cmaj.ca/cgi/content/full/172/5/661/DC1. It<br />

contains the exercises found in this article in fill-in-theblank<br />

<strong>for</strong>mat, commentaries from the authors on the<br />

challenges they encounter when teaching these concepts<br />

to clinician learners and links to useful online resources.


Hatala et al<br />

Box 1: Statistical assessments <strong>of</strong> heterogeneity<br />

Meta-analysts typically use 2 statistical approaches to evaluate<br />

the extent <strong>of</strong> variability in results between studies: Cochran’s<br />

Q test and the I 2<br />

statistic.<br />

Cochran’s Q test<br />

• Cochran’s Q test is the traditional test <strong>for</strong> heterogeneity. It<br />

begins with the null hypothesis that all <strong>of</strong> the apparent<br />

variability is due to chance. That is, the true underlying<br />

magnitude <strong>of</strong> effect (whether measured with a relative risk,<br />

an odds ratio or a risk difference) is the same across studies.<br />

• The test then generates a probability, based on a χ 2<br />

distribution, that differences in results between studies as<br />

extreme as or more extreme than those observed could occur<br />

simply by chance.<br />

• If the p value is low (say, less than 0.1) investigators should<br />

look hard <strong>for</strong> possible explanations <strong>of</strong> variability in results<br />

between studies (including differences in patients,<br />

interventions, measurement <strong>of</strong> outcomes and study design).<br />

• As the p value gets very low (less than 0.01) we may be<br />

increasingly uncom<strong>for</strong>table about using single best estimates<br />

<strong>of</strong> treatment effects.<br />

• The traditional test <strong>for</strong> heterogeneity is limited, in that it may<br />

be underpowered (when studies have included few patients it<br />

may be difficult to reject the null hypothesis even if it is false)<br />

or overpowered (when sample sizes are very large, small and<br />

unimportant differences in magnitude <strong>of</strong> effect may<br />

nevertheless generate low p values).<br />

I 2<br />

statistic<br />

• The I 2<br />

statistic, the second approach to measuring<br />

heterogeneity, attempts to deal with potential underpowering<br />

or overpowering. I 2<br />

provides an estimate <strong>of</strong> the percentage <strong>of</strong><br />

variability in results across studies that is likely due to true<br />

differences in treatment effect, as opposed to chance.<br />

• When I 2<br />

is 0%, chance provides a satisfactory explanation <strong>for</strong><br />

the variability we have observed, and we are more likely to<br />

be com<strong>for</strong>table with a single pooled estimate <strong>of</strong> treatment<br />

effect.<br />

• As I 2<br />

increases, we get increasingly uncom<strong>for</strong>table with a<br />

single pooled estimate, and the need to look <strong>for</strong> explanations<br />

<strong>of</strong> variability other than chance becomes more compelling.<br />

• For example, one rule <strong>of</strong> thumb characterizes I 2 <strong>of</strong> less than<br />

0.25 as low heterogeneity, 0.25 to 0.5 as moderate<br />

heterogeneity and over 0.5 as high heterogeneity.<br />

662 JAMC 1 er MARS 2005; 172 (5)<br />

Tip 1: Qualitative assessment <strong>of</strong> the design <strong>of</strong><br />

primary studies<br />

Consider the following 3 hypothetical systematic reviews.<br />

For which <strong>of</strong> these systematic reviews does it make<br />

sense to combine the primary studies?<br />

• A systematic review <strong>of</strong> all therapies <strong>for</strong> all types <strong>of</strong> cancer,<br />

intended to generate a single estimate <strong>of</strong> the impact<br />

<strong>of</strong> these therapies on mortality.<br />

• A systematic review that examines the effect <strong>of</strong> different<br />

antibiotics, such as tetracyclines, penicillins and chloramphenicol,<br />

on improvement in peak expiratory flow<br />

rates and days <strong>of</strong> illness in patients with acute exacerbation<br />

<strong>of</strong> obstructive lung disease, including chronic<br />

bronchitis and emphysema. 7<br />

• A systematic review <strong>of</strong> the effectiveness <strong>of</strong> tissue plasminogen<br />

activator (tPA) compared with no treatment<br />

or placebo in reducing mortality among patients with<br />

acute myocardial infarction. 8<br />

Most clinicians would instinctively reject the first <strong>of</strong><br />

these proposed reviews as overly broad but would be com<strong>for</strong>table<br />

with the idea <strong>of</strong> combining the results <strong>of</strong> trials relevant<br />

to the third question. What about the second review?<br />

What aspects <strong>of</strong> the primary studies must be similar to justify<br />

combining their results in this systematic review?<br />

Table 1 lists features that would be relevant to the<br />

question considered in the second review and categorizes<br />

them according to the 4 key elements <strong>of</strong> study design: the<br />

patients, interventions, outcomes and methods <strong>of</strong> the primary<br />

studies. Combining results is appropriate when the<br />

biology is such that across the range <strong>of</strong> patients, interventions,<br />

outcomes and study methods, one can anticipate<br />

more or less the same magnitude <strong>of</strong> treatment effect.<br />

In other words, the judgement as to whether the primary<br />

studies are similar enough to be combined in a systematic<br />

review is based on whether the underlying pathophysiology<br />

would predict a similar treatment effect across<br />

the range <strong>of</strong> patients, interventions, outcomes and study<br />

methods <strong>of</strong> the primary studies. If you think back to the<br />

first systematic review — all therapies <strong>for</strong> all cancers — you<br />

probably recognize that there is significant variability in the<br />

Table 1: Relevant features <strong>of</strong> study design to be considered when deciding whether to pool studies in a<br />

systematic review (<strong>for</strong> a review examining the effect <strong>of</strong> antibiotics in patients with obstructive lung disease)<br />

Patients Interventions Outcomes Study methods<br />

Patient age Same antibiotic in all studies Death All randomized trials<br />

Patient sex<br />

Type <strong>of</strong> lung disease<br />

(e.g., emphysema,<br />

chronic bronchitis)<br />

Same class <strong>of</strong> antibiotic in all<br />

studies<br />

Comparison <strong>of</strong> antibiotic with<br />

placebo<br />

Comparison <strong>of</strong> one antibiotic with<br />

another<br />

Peak expiratory flow<br />

Forced expiratory volume in<br />

the first second<br />

Only blinded randomized<br />

trials<br />

Cohort studies<br />

Page 20 <strong>of</strong> 29


pathophysiology <strong>of</strong> different cancers (“patients” in Table 1)<br />

and in the mechanisms <strong>of</strong> action <strong>of</strong> different cancer therapies<br />

(“interventions” in Table 1).<br />

If you were inclined to reject pooling the results <strong>of</strong> the<br />

studies to be considered in the second systematic review, you<br />

might have reasoned that we would expect substantially different<br />

effects with different antibiotics, different infecting<br />

agents or different underlying lung pathology. If you were<br />

inclined to accept pooling <strong>of</strong> results in this review, you might<br />

argue that the antibiotics used in the different studies are all<br />

effective against the most common organisms underlying<br />

pulmonary exacerbations. You might also assert that the biology<br />

<strong>of</strong> an acute exacerbation <strong>of</strong> an obstructive lung disease<br />

(e.g., inflammation) is similar, despite variability in the underlying<br />

pathology. In other words, we would expect more<br />

or less the same effect across agents and across patients.<br />

Finally, you probably accepted the validity <strong>of</strong> pooling results<br />

<strong>for</strong> the third systematic review — tPA <strong>for</strong> myocardial<br />

infarction — because you consider that the mechanism <strong>of</strong><br />

myocardial infarction is relatively constant across a broad<br />

range <strong>of</strong> patients.<br />

The bottom line<br />

• Similarity in the aspects <strong>of</strong> primary study design outlined<br />

in Table 1 (patients, interventions, outcomes,<br />

study methods) guides the decision as to whether it<br />

makes sense to combine the results <strong>of</strong> primary studies<br />

in a systematic review.<br />

• The range <strong>of</strong> characteristics <strong>of</strong> the primary studies<br />

across which it is sensible to combine results is a matter<br />

<strong>of</strong> judgment based on the researcher’s understanding <strong>of</strong><br />

the underlying biology <strong>of</strong> the disease.<br />

Tip 2: Qualitative assessment <strong>of</strong> the results <strong>of</strong><br />

primary studies<br />

You should now understand that combining the results <strong>of</strong><br />

different studies is sensible only when we expect more or less<br />

the same magnitude <strong>of</strong> treatment effects across the range <strong>of</strong><br />

patients, interventions and outcomes that the investigators<br />

have included in their systematic review. However, even<br />

when we are confident <strong>of</strong> the similarity in design among the<br />

individual studies, we may still wonder whether the results <strong>of</strong><br />

the studies should be pooled. The following graphic demonstration<br />

shows how to qualitatively assess the results <strong>of</strong> the<br />

primary studies to decide if meta-analysis (i.e., statistical<br />

pooling) is appropriate. You can find discussions <strong>of</strong> quantitative,<br />

or statistical, approaches to the assessment <strong>of</strong> heterogeneity<br />

elsewhere (see Box 1 or Higgins and associates 9 ).<br />

Consider the results <strong>of</strong> the studies in 2 hypothetical systematic<br />

reviews (Fig. 1A and Fig. 1B). The central vertical<br />

line, labelled “no difference,” represents a treatment effect <strong>of</strong><br />

0. This would be equivalent to a risk ratio or relative risk <strong>of</strong> 1<br />

or an absolute or relative risk reduction <strong>of</strong> 0. 2 Values to the<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: heterogeneity<br />

left <strong>of</strong> the “no difference” line indicate that the treatment is<br />

superior to the control, whereas those to the right <strong>of</strong> the line<br />

indicate that the control is superior to the treatment. For<br />

each <strong>of</strong> the 4 studies represented in the figures, the dot represents<br />

the point estimate <strong>of</strong> the treatment effect (the value<br />

observed in the study), and the horizontal line represents the<br />

confidence interval around that observed effect. For which<br />

systematic review does it make sense to combine results? Decide<br />

on the answer to this question be<strong>for</strong>e you read on.<br />

You have probably concluded that pooling is appropriate<br />

A<br />

B<br />

Favours new<br />

treatment<br />

Favours<br />

new treatment<br />

No difference<br />

No difference<br />

Favours control<br />

Favours control<br />

Page 21 <strong>of</strong> 29<br />

Fig. 1: Results <strong>of</strong> the studies in 2 hypothetical systematic reviews.<br />

The central vertical line represents a treatment effect <strong>of</strong><br />

0. Values to the left <strong>of</strong> this line indicate that the treatment is superior<br />

to the control, whereas those to the right <strong>of</strong> the line indicate<br />

that the control is superior to the treatment. For each <strong>of</strong><br />

the 4 studies in each figure, the dot represents the point estimate<br />

<strong>of</strong> the treatment effect (the value observed in the study),<br />

and the horizontal line represents the confidence interval<br />

around that observed effect.<br />

CMAJ MAR. 1, 2005; 172 (5) 663


Hatala et al<br />

<strong>for</strong> the studies represented in Fig. 1B but not <strong>for</strong> those represented<br />

in Fig. 1A. Can you explain why? Is it because the<br />

point estimates <strong>for</strong> the studies in Fig. 1A lie on opposite sides<br />

Favours<br />

new treatment<br />

Fig. 2: Point estimates and confidence intervals <strong>for</strong> 4 studies.<br />

Two <strong>of</strong> the point estimates favour the new treatment, and the<br />

other 2 point estimates favour the control. Investigators doing a<br />

systematic review with these 4 studies would be satisfied that it<br />

is appropriate to pool the results.<br />

Pooled estimate <strong>of</strong> underlying effect<br />

Favours<br />

new treatment<br />

No difference<br />

No difference<br />

Favours control<br />

Favours control<br />

Fig. 3: Results <strong>of</strong> the hypothetical systematic review presented<br />

in Fig. 1B. The pooled estimate at the bottom <strong>of</strong> the chart (large<br />

diamond) provides the best guess as to the underlying treatment<br />

effect. It is centred on the midpoint <strong>of</strong> the area <strong>of</strong> overlap<br />

<strong>of</strong> the confidence intervals around the estimates <strong>of</strong> the individual<br />

trials.<br />

664 JAMC 1 er MARS 2005; 172 (5)<br />

<strong>of</strong> the “no difference” line, whereas those <strong>for</strong> the studies in<br />

Fig. 1B lie on the same side <strong>of</strong> the “no difference” line?<br />

Be<strong>for</strong>e you answer this question, consider the studies<br />

represented in Fig. 2. Here, the point estimates <strong>of</strong> 2 studies<br />

are on the “favours new treatment” side <strong>of</strong> the “no difference”<br />

line, and the point estimates <strong>of</strong> 2 other studies are on<br />

the “favours control” side. However, all 4 point estimates<br />

are very close to the “no difference” line, and, in this case,<br />

investigators doing a systematic review will be satisfied that<br />

it is appropriate to pool the results. There<strong>for</strong>e, it is not the<br />

position <strong>of</strong> the point estimates relative to the “no difference”<br />

line that determines the appropriateness <strong>of</strong> pooling.<br />

There are 2 criteria <strong>for</strong> not combining the results <strong>of</strong><br />

studies in a meta-analysis: highly disparate point estimates<br />

and confidence intervals with little overlap, both <strong>of</strong> which<br />

are exemplified by Fig. 1A. When pooling is appropriate on<br />

the basis <strong>of</strong> these criteria, where is the best estimate <strong>of</strong> the<br />

underlying magnitude <strong>of</strong> effect likely to be? Look again at<br />

Fig. 1B and make a guess. Now look at Fig. 3.<br />

The pooled estimate at the bottom <strong>of</strong> Fig. 3 is centred on<br />

the midpoint <strong>of</strong> the area <strong>of</strong> overlap <strong>of</strong> the confidence intervals<br />

around the estimates <strong>of</strong> the individual trials. It provides our<br />

best guess as to the underlying treatment effect. Of course, we<br />

cannot actually know the “truth” and must be content with<br />

potentially misleading estimates. The intent <strong>of</strong> a meta-analysis<br />

is to include enough studies to narrow the confidence interval<br />

around the resulting pooled estimate sufficiently to provide estimates<br />

<strong>of</strong> benefit <strong>for</strong> our patients in which we can be confident.<br />

Thus, our best estimate <strong>of</strong> the truth will lie in the area <strong>of</strong><br />

overlap among the confidence intervals around the point estimates<br />

<strong>of</strong> treatment effect presented in the primary studies.<br />

What is the clinician to do when presented with results<br />

such as those in Fig. 1A? If the investigators have done a<br />

good job <strong>of</strong> planning and executing the meta-analysis, they<br />

will provide some assistance. 6 Be<strong>for</strong>e examining the study<br />

results in detail, they will have generated a priori hypotheses<br />

to explain the heterogeneity in magnitude <strong>of</strong> effect across<br />

studies that they are liable to encounter. These hypotheses<br />

will include differences in patients (effects may be larger in<br />

sicker patients), in interventions (larger doses may result in<br />

larger effects), in outcomes (longer follow-up may diminish<br />

the magnitude <strong>of</strong> effect) and in study design (methodologically<br />

weaker studies may generate larger effects).<br />

The investigators will then have examined the extent to<br />

which these hypotheses can explain the differences in magnitude<br />

<strong>of</strong> effect across studies. These subgroup analyses<br />

may be misleading, but if they meet 7 criteria suggested<br />

elsewhere 10 (see Box 2), they may provide credible and satisfying<br />

explanations <strong>for</strong> the variability in results.<br />

The bottom line<br />

Page 22 <strong>of</strong> 29<br />

• Readers can decide <strong>for</strong> themselves whether there is<br />

clinically important heterogeneity among the results <strong>of</strong><br />

primary studies through a qualitative assessment <strong>of</strong> the<br />

graphic results. This assessment is based on the amount


Box 2: Questions to ask when evaluating a subgroup<br />

analysis in a meta-analysis 10<br />

• Was the subgroup comparison based on a within-study,<br />

rather than a between-study, comparison?<br />

• Is the magnitude <strong>of</strong> the difference in effect between<br />

subgroups large?<br />

• Is the effect consistent across studies?<br />

• Is the difference in effect statistically significant?<br />

• Was the subgroup analysis planned in advance by the<br />

trialists?<br />

• Were many subgroup analyses per<strong>for</strong>med and selectively<br />

reported?<br />

• Is the difference in effect in the subgroup supported by a<br />

biological hypothesis?<br />

<strong>of</strong> disparity among the individual point estimates and<br />

the degree <strong>of</strong> overlap among the confidence intervals.<br />

Conclusions<br />

Understanding the concept <strong>of</strong> heterogeneity in a systematic<br />

review or meta-analysis is central to a full appreciation<br />

<strong>of</strong> the implications <strong>of</strong> such reviews <strong>for</strong> clinical practice.<br />

We have presented 2 tips aimed at helping clinical readers<br />

overcome commonly encountered difficulties in understanding<br />

this concept.<br />

This article has been peer reviewed.<br />

From the Department <strong>of</strong> <strong>Medicine</strong>, University <strong>of</strong> British Columbia, Vancouver, BC<br />

(Hatala); Durham Veterans Affairs Medical Center and Duke University Medical<br />

Center, Durham, NC (Keitz); the Columbia University College <strong>of</strong> Physicians and<br />

Surgeons, New York, NY (Wyer); and the Departments <strong>of</strong> <strong>Medicine</strong> and <strong>of</strong> Clinical<br />

Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)<br />

Competing interests: None declared.<br />

Contributors: Rose Hatala modified the original ideas <strong>for</strong> tips 1 and 2, drafted the<br />

manuscript, coordinated input from reviewers and field-testing, and revised all drafts.<br />

Sheri Keitz used all <strong>of</strong> the tips as part <strong>of</strong> a live teaching exercise and submitted comments,<br />

suggestions and the possible variations that are described in the article. Peter<br />

Wyer reviewed and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence<br />

with <strong>for</strong>mat specifications. Gordon Guyatt developed the original ideas <strong>for</strong><br />

tips 1 and 2, reviewed the manuscript at all phases <strong>of</strong> development, contributed to<br />

the writing as a coauthor, and, as general editor, reviewed and revised the final draft<br />

<strong>of</strong> the manuscript to achieve accuracy and consistency <strong>of</strong> content.<br />

References<br />

1. Oxman A, Guyatt G, Cook D, Montori V. Summarizing the evidence. In:<br />

Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual <strong>for</strong><br />

evidence-based clinical practice. Chicago: AMA Press; 2002. p. 155-73.<br />

2. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al, <strong>for</strong> the<br />

<strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group. <strong>Tips</strong> <strong>for</strong> learners<br />

<strong>of</strong> evidence-based medicine: 1. Relative risk reduction, absolute risk reduction<br />

and number needed to treat. CMAJ 2004;171(4):353-8.<br />

3. Guyatt G, Cook D, Devereaux PJ, Meade M, Straus S. Therapy. In: Guyatt<br />

G, Rennie D, editors. Users’ guides to the medical literature: a manual <strong>for</strong> evidence-based<br />

clinical practice. Chicago: AMA Press; 2002. p. 55-79.<br />

4. Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al,<br />

<strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group. <strong>Tips</strong> <strong>for</strong><br />

learners <strong>of</strong> evidence-based medicine: 2. Measures <strong>of</strong> precision (confidence intervals).<br />

CMAJ 2004;171(6):611-5.<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: heterogeneity<br />

5. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong><br />

<strong>for</strong> learning and teaching evidence-based medicine: introduction to the series.<br />

CMAJ 2004;171(4):347-8.<br />

6. Montori V, Hatala R, Guyatt G. Summarizing the evidence: evaluating differences<br />

in study results. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature:<br />

a manual <strong>for</strong> evidence-based clinical practice. Chicago: AMA Press; 2002. p. 547-52.<br />

7. Saint S, Bent S, Vittingh<strong>of</strong>f E, Grady D. Antibiotics in chronic obstructive<br />

pulmonary disease exacerbations. JAMA 1995;273:957-60.<br />

8. Held PH, Teo KK, Yusuf S. Effects <strong>of</strong> tissue-type plasminogen activator and<br />

anisoylated plasminogen streptokinase activator complex on mortality in acute<br />

myocardial infarction. Circulation 1990;82:1668-74.<br />

9. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency<br />

in meta-analyses. BMJ 2003;327:557-60.<br />

10. Oxman A, Guyatt G. When to believe a subgroup analysis. In: Guyatt G,<br />

Rennie D, editors. Users’ guides to the medical literature: a manual <strong>for</strong> evidencebased<br />

clinical practice. Chicago: AMA Press; 2002. p. 553-65.<br />

Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />

Pelham NY 10804; fax 914 738-9368; pwyer@att.net<br />

Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working<br />

Group: Peter C. Wyer (project director), College <strong>of</strong> Physicians and<br />

Surgeons, Columbia University, New York, NY; Deborah Cook,<br />

Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke,<br />

McMaster University, Hamilton, Ont.; Rose Hatala (internal<br />

review coordinator), University <strong>of</strong> British Columbia, Vancouver,<br />

BC; Robert Hayward (editor, online version), Bruce Fisher,<br />

University <strong>of</strong> Alberta, Edmonton, Alta.; Sheri Keitz (field test<br />

coordinator), Durham Veterans Affairs Medical Center and Duke<br />

University Medical Center, Durham, NC; Alexandra Barratt,<br />

University <strong>of</strong> Sydney, Sydney, Australia; Pamela Charney, Albert<br />

Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY; Antonio L. Dans,<br />

University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>, Manila, The<br />

Philippines; Barnet Eskin, Morristown Memorial Hospital,<br />

Morristown, NJ; Jennifer Kleinbart, Emory University School <strong>of</strong><br />

<strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />

Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />

McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />

Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn.;<br />

Virginia Moyer, University <strong>of</strong> Texas, Houston, Tex.; Thomas B.<br />

Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />

Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa, Ont.;<br />

Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain;<br />

W. Scott Richardson, Wright State University, Dayton, Ohio; Mark<br />

C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />

Articles to date in this series<br />

Page 23 <strong>of</strong> 29<br />

Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S,<br />

et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine: 1.<br />

Relative risk reduction, absolute risk reduction and<br />

number needed to treat. CMAJ 2004;171(4):353-8.<br />

Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC,<br />

Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

2. Measures <strong>of</strong> precision (confidence intervals).<br />

CMAJ 2004;171(6):611-5.<br />

McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt<br />

G, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

3. Measures <strong>of</strong> observer variability (kappa statistic).<br />

CMAJ 2004;171(11):1369-73.<br />

CMAJ MAR. 1, 2005; 172 (5) 665


DOI:10.1503/cmaj.1031666<br />

<strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

5. The effect <strong>of</strong> spectrum <strong>of</strong> disease on the<br />

per<strong>for</strong>mance <strong>of</strong> diagnostic tests<br />

Victor M. Montori, Peter Wyer, Thomas B. Newman, Sheri Keitz, Gordon Guyatt,<br />

<strong>for</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group<br />

For clinicians to use a diagnostic test in clinical practice,<br />

they need to know how well the test distinguishes<br />

between those who have the suspected disease<br />

or condition and those who do not. If investigators<br />

choose clinically inappropriate populations <strong>for</strong> their study<br />

<strong>of</strong> a diagnostic test and thereby introduce what is sometimes<br />

called spectrum bias, the results may seriously mislead<br />

clinicians.<br />

In this article we present a series <strong>of</strong> examples that illustrate<br />

why clinicians need to pay close attention to the populations<br />

enrolled in studies <strong>of</strong> diagnostic test per<strong>for</strong>mance<br />

be<strong>for</strong>e they apply the results <strong>of</strong> those studies to their own<br />

patients. After working through these examples, you should<br />

understand which characteristics <strong>of</strong> a study population are<br />

likely to result in misleading interpretations <strong>of</strong> test results<br />

and which are not.<br />

The tips in this article are adapted from approaches developed<br />

by educators with experience in teaching evidencebased<br />

medicine principles to clinicians. 1,2 A related article,<br />

intended <strong>for</strong> people who teach these concepts to clinicians,<br />

is available online at www.cmaj.ca/cgi/content/full/173<br />

/4/385/DC1.<br />

Clinician learners’ objectives<br />

“Ideal” spectrum <strong>of</strong> disease<br />

• Understand the importance <strong>of</strong> spectrum <strong>of</strong> disease in<br />

the evaluation <strong>of</strong> diagnostic test characteristics.<br />

Prevalence, spectrum and test characteristics<br />

• Understand the lack <strong>of</strong> impact <strong>of</strong> disease prevalence on<br />

sensitivity, specificity and likelihood ratios.<br />

• Understand the impact <strong>of</strong> disease prevalence or likelihood<br />

on the probability <strong>of</strong> the target condition (posttest<br />

probability) after test results are available.<br />

Tip 1: “Ideal” spectrum <strong>of</strong> disease<br />

Let’s consider a clinical example that illustrates the concept<br />

<strong>of</strong> “disease spectrum” in relation to diagnostic tests.<br />

CMAJ • AUG. 16, 2005; 173 (4) 385<br />

© 2005 CMA Media Inc. or its licensors<br />

Review<br />

Synthèse<br />

Brain natriuretic peptide (BNP) is a hormone secreted by<br />

the ventricles in the heart in response to expansion. Plasma<br />

levels <strong>of</strong> BNP increase when acute or chronic congestive<br />

heart failure is present. Consequently, investigators have<br />

suggested using BNP levels to distinguish congestive heart<br />

failure from other causes <strong>of</strong> acute dyspnea among patients<br />

presenting to emergency departments. 3<br />

One highly publicized study reported promising results<br />

using a BNP cut<strong>of</strong>f point <strong>of</strong> 100 pg/mL. 4,5 This cut<strong>of</strong>f point<br />

means that patients with BNP levels greater than<br />

100 pg/mL are considered to have a “positive” test result<br />

<strong>for</strong> congestive heart failure and those with levels below this<br />

threshold are considered to have a “negative” test result.<br />

The investigators compared the number <strong>of</strong> diagnoses <strong>of</strong><br />

congestive heart failure using BNP levels with those using a<br />

criterion standard (or “gold standard”) defined by established<br />

clinical and imaging criteria. Commentaries have<br />

challenged the investigators’ estimates <strong>of</strong> the sensitivity and<br />

specificity <strong>of</strong> the BNP test at the proposed cut<strong>of</strong>f point on<br />

the basis that clinicians were already confident with respect<br />

to the likelihood <strong>of</strong> congestive heart failure in most <strong>of</strong> the<br />

patients in the study. 6,7<br />

Ideally, the ability <strong>of</strong> a test to correctly identify patients<br />

with and without a particular disease would not vary between<br />

patients. However, if you are a clinician, you already<br />

intuitively understand that a test may per<strong>for</strong>m better when<br />

it is used to evaluate patients with more severe disease than<br />

it would with patients whose disease is less advanced and<br />

less obvious. You also appreciate that diagnostic tests are<br />

not needed when the disease is either clinically obvious or<br />

sufficiently unlikely that you need not seriously consider it.<br />

Teachers <strong>of</strong> evidence-based medicine:<br />

See the “<strong>Tips</strong> <strong>for</strong> teachers” version <strong>of</strong> this article online<br />

at www.cmaj.ca/cgi/content/full/173/4/385/DC1. It<br />

contains the exercises found in this article in fill-in-theblank<br />

<strong>for</strong>mat, commentaries from the authors on the<br />

challenges they encounter when teaching these<br />

concepts to clinician learners and links to useful online<br />

resources.<br />

Page 24 <strong>of</strong> 29


Montori et al<br />

A study <strong>of</strong> the per<strong>for</strong>mance <strong>of</strong> a diagnostic test involves<br />

per<strong>for</strong>ming that test on patients with and without the disease<br />

or condition <strong>of</strong> interest together with a second test or<br />

investigation that we will call the “criterion standard.” We<br />

accept the results <strong>of</strong> the second test as the criterion by<br />

which the results <strong>of</strong> the test under investigation are assessed.<br />

In designing such a study, investigators sometimes<br />

choose both patients in whom the disease is unequivocally<br />

advanced and patients who are unequivocally free <strong>of</strong> disease,<br />

such as healthy, asymptomatic volunteers. This approach<br />

ensures the validity <strong>of</strong> the criterion standard and<br />

may be appropriate in the early stages <strong>of</strong> developing a test.<br />

However, any study done with a population that lacks diagnostic<br />

uncertainty may produce a biased estimate <strong>of</strong> a test’s<br />

per<strong>for</strong>mance relative to that produced by<br />

a study restricted to patients <strong>for</strong> whom<br />

the test would be clinically indicated.<br />

Returning to the use <strong>of</strong> BNP levels to<br />

test <strong>for</strong> congestive heart failure among patients<br />

with acute dyspnea, consider Fig. 1.<br />

The horizontal axis represents increasing<br />

values <strong>of</strong> BNP. The 2 bell curves constitute<br />

hypothetical probability density plots<br />

<strong>of</strong> the distribution <strong>of</strong> BNP values among<br />

patients with and without congestive<br />

heart failure. 8 The height at any point in<br />

either curve reflects the proportion <strong>of</strong><br />

emergency patients in the particular subgroup<br />

with the corresponding BNP value.<br />

Aside from the choice <strong>of</strong> cut<strong>of</strong>f value, this<br />

figure does not reflect the results <strong>of</strong> any<br />

actual study.<br />

The bell curve on the left in Fig. 1represents<br />

the hypothetical distribution <strong>of</strong><br />

BNP values in a group <strong>of</strong> young patients<br />

with known asthma and no risk factors <strong>for</strong><br />

congestive heart failure. They will tend to<br />

have low levels <strong>of</strong> circulating BNP. The<br />

bell curve on the right represents the distribution<br />

<strong>of</strong> BNP values among older patients<br />

with unequivocal and severe congestive<br />

heart failure. Such patients will<br />

have test results clustered on the high end<br />

<strong>of</strong> the scale.<br />

If Fig. 1accurately represented the<br />

per<strong>for</strong>mance <strong>of</strong> the BNP test in distinguishing<br />

between all patients with and<br />

without congestive heart failure as the<br />

cause <strong>of</strong> their symptoms, the test would<br />

be very useful. The 2 curves demonstrate<br />

very little overlap. For BNP values below<br />

90 pg/mL (point A), no patients have<br />

congestive heart failure, and <strong>for</strong> BNP values<br />

above 110 pg/mL (point B), all patients<br />

have congestive heart failure. This<br />

Proportion <strong>of</strong> patients<br />

386 JAMC 16 AOÛT 2005; 173 (4)<br />

means, assuming that Fig. 1 reflects reality, that you can be<br />

completely certain about the diagnosis <strong>for</strong> all people with<br />

BNP values below 90 pg/mL or above 110 pg/mL. Only<br />

<strong>for</strong> patients whose BNP values are between 90 and<br />

110 pg/mL is there residual uncertainty about their likelihood<br />

<strong>of</strong> congestive heart failure.<br />

However, be<strong>for</strong>e you embrace a test on the basis <strong>of</strong> its<br />

per<strong>for</strong>mance among patients in whom the presence or absence<br />

<strong>of</strong> disease is unequivocal, you need to consider the<br />

likely distribution <strong>of</strong> test results in a population <strong>of</strong> patients<br />

<strong>for</strong> whom you would be less certain.<br />

In Fig. 2, imagine that the entire study population is<br />

made up <strong>of</strong> middle-aged patients, all <strong>of</strong> whom have chronic<br />

congestive heart failure and recurrent asthma. The distributions<br />

<strong>of</strong> BNP values in the subgroups with and without<br />

A<br />

BNP level, pg/mL<br />

Fig. 1: Hypothetical probability density distributions <strong>of</strong> measured plasma brain<br />

natriuretic peptide (BNP) levels in 2 subgroups <strong>of</strong> a study population. The cut<strong>of</strong>f<br />

point <strong>for</strong> a diagnosis <strong>of</strong> congestive heart failure (CHF) is 100 pg/mL. Patients with a<br />

negative test result <strong>for</strong> CHF (left-hand curve) are younger, with known asthma and<br />

no risk factors <strong>for</strong> CHF. The patients with confirmed CHF are older, and the disease<br />

is clinically severe and unequivocal. Clinicians in the emergency department have<br />

little uncertainty regarding the cause <strong>of</strong> dyspnea in any <strong>of</strong> these patients.<br />

Proportion <strong>of</strong> patients<br />

Patients without<br />

acute CHF<br />

Patients with<br />

acute CHF<br />

0 20 40 60 80 100 120 140 160 180 200<br />

Patients without<br />

acute CHF<br />

Patients with<br />

acute CHF<br />

0 20 40 60 80 100 120 140 160 180 200<br />

Fig. 2: These hypothetical probability density distributions reflect a study population<br />

<strong>of</strong> middle-aged patients who all have recurrent asthma and chronic CHF.<br />

The patients whose dyspnea is caused by asthma exacerbations look clinically<br />

similar to those whose symptoms are caused by acute CHF.<br />

B<br />

BNP level, pg/mL<br />

A B<br />

Page 25 <strong>of</strong> 29


acute congestive heart failure are both much closer to the<br />

middle <strong>of</strong> the range. The extent <strong>of</strong> the overlap <strong>of</strong> the curves<br />

between points A and B is much greater, which means that<br />

there is residual uncertainty about the disease status <strong>of</strong> a<br />

large proportion <strong>of</strong> the patients even after the BNP test has<br />

been per<strong>for</strong>med.<br />

It may be helpful to note that the sensitivity <strong>of</strong> the BNP<br />

test at a cut<strong>of</strong>f value <strong>of</strong> 100 pg/mL (the proportion <strong>of</strong> patients<br />

with acute congestive heart failure whose BNP level<br />

is greater than 100 pg/mL) is defined in Fig. 1 and Fig. 2 as<br />

the percentage <strong>of</strong> the total area <strong>of</strong> the right-hand curve that<br />

lies to the right <strong>of</strong> the cut<strong>of</strong>f value. Notice that this percentage<br />

is markedly lower in Fig. 2 than in Fig. 1. The<br />

same is true <strong>of</strong> specificity, which is the proportion <strong>of</strong> patients<br />

without acute congestive heart failure whose BNP<br />

level is less than 100 pg/mL. This is defined in the figures<br />

as the proportion <strong>of</strong> the left-hand curve that lies to the left<br />

<strong>of</strong> the cut<strong>of</strong>f point. Again this percentage is appreciably<br />

lower in Fig. 2 compared with Fig. 1.<br />

These theoretical concerns play out (albeit with a lesser<br />

magnitude <strong>of</strong> impact than depicted in Fig. 1and Fig. 2) in<br />

studies <strong>of</strong> the BNP test as a diagnostic tool. In the BNP<br />

study to which we have referred, the sensitivity and specificity<br />

<strong>of</strong> the test using the 100 pg/mL cut-<strong>of</strong>f were 90%<br />

and 76% respectively when all patients were included. 4<br />

Only about 25% <strong>of</strong> the study population were judged by<br />

the treating physicians to be in the intermediate range <strong>of</strong><br />

probability <strong>of</strong> acute congestive heart failure. 5 When only<br />

patients in this subgroup were considered in a number <strong>of</strong><br />

studies, the sensitivity and specificity <strong>of</strong> the BNP test at a<br />

cut<strong>of</strong>f point <strong>of</strong> 100 pg/mL were only 88% and 55% respectively.<br />

7<br />

The range <strong>of</strong> disease states found among the patients<br />

in the population upon which a test is to be used is commonly<br />

referred to as “disease spectrum.” In making your<br />

final assessment on the value <strong>of</strong> a test,<br />

consider the spectrum <strong>of</strong> the disease or<br />

condition in which you are interested.<br />

You don’t need to differentiate healthy<br />

patients from patients with severe disease.<br />

Rather, you must differentiate<br />

those who have the disease from those<br />

who do not among all those who appear<br />

as if they might have it. The “right”<br />

population <strong>for</strong> a diagnostic test study includes<br />

(1) those in whom we are uncertain<br />

<strong>of</strong> the diagnosis; (2) those in whom<br />

we will use the test in clinical practice to<br />

resolve our uncertainty; and (3) patients<br />

with the disease who have a wide spectrum<br />

<strong>of</strong> severity and patients without the<br />

disease who have symptoms commonly<br />

associated with it.<br />

Readers familiar with the concept and<br />

interpretation <strong>of</strong> likelihood ratios <strong>for</strong> diagnostic<br />

test results 1 may find it useful to<br />

Proportion <strong>of</strong> patients<br />

note that the likelihood ratio <strong>for</strong> any given test value is represented<br />

by the respective height <strong>of</strong> the curves at that point<br />

on the horizontal axis (Fig. 3). The point on the horizontal<br />

axis below the intersection <strong>of</strong> the 2 curves is the test result<br />

with a likelihood ratio <strong>of</strong> 1. Fig. 3 also identifies test<br />

values corresponding to likelihood ratios <strong>of</strong> 0.25 and 4.<br />

Comparing Fig. 1and Fig. 2 once more, you will notice<br />

that the relative heights <strong>of</strong> the 2 curves, and hence the likelihood<br />

ratios, corresponding to a given BNP level will<br />

change as the curves move closer together and the area <strong>of</strong><br />

overlap increases.<br />

The bottom line<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: spectrum <strong>of</strong> disease<br />

• Test per<strong>for</strong>mance will vary with the spectrum <strong>of</strong> disease<br />

within a study population. 9<br />

• The sensitivity and specificity <strong>of</strong> a test, when it is used<br />

to differentiate patients who obviously do not have the<br />

disease from patients who obviously do, likely overestimate<br />

its per<strong>for</strong>mance when the test is applied in a clinical<br />

context characterized by diagnostic uncertainty.<br />

Patients without<br />

acute CHF<br />

Patients with<br />

acute CHF<br />

Increasing<br />

test value<br />

Definitions<br />

Disease spectrum: The range <strong>of</strong> the disease states found<br />

among patients who make up the population upon<br />

which a test is to be used.<br />

Per<strong>for</strong>mance <strong>of</strong> diagnostic tests: Measures derived from<br />

the percentage <strong>of</strong> patients with and without disease<br />

identified by a particular test result, with disease<br />

positivity defined through the application <strong>of</strong> an<br />

acceptable criterion standard to each patient in a study.<br />

Sensitivity and specificity are examples <strong>of</strong> such measures.<br />

Test result<br />

(LR = 0.25)<br />

Test result<br />

(LR = 1)<br />

Test result<br />

(LR = 4)<br />

Page 26 <strong>of</strong> 29<br />

Fig. 3: Likelihood ratios (LRs) and spectrum <strong>of</strong> disease. The likelihood ratio <strong>of</strong> a<br />

test result represented by a point on the horizontal line is the height <strong>of</strong> the righthand<br />

bell curve (patients with the disease <strong>of</strong> interest) divided by the height <strong>of</strong> the<br />

left-hand bell curve (patients without the disease <strong>of</strong> interest) at that point.<br />

CMAJ AUG. 16, 2005; 173 (4) 387<br />

4<br />

1


Montori et al<br />

Tip 2: Prevalence, spectrum and test<br />

characteristics<br />

You may have learned the rule <strong>of</strong> thumb that post-test<br />

probabilities (which are closely related to predictive values)<br />

vary with disease prevalence, but sensitivities, specificities<br />

and likelihood ratios do not. Is this true? The answer is<br />

“yes,” provided that disease spectrum remains the same in<br />

high- and low-prevalence populations. In the discussion<br />

that follows, <strong>for</strong> purposes <strong>of</strong> simplicity, we use the term<br />

“prevalence” to denote the likelihood that any patient randomly<br />

selected from the study population has the disease or<br />

condition as defined by the criterion standard. This is not<br />

the same thing as the probability <strong>of</strong> disease in any individual<br />

patient.<br />

Referring once again to Fig. 1, let’s consider 3 cases. In<br />

the first, we’ll assume that there were 1000 patients in each<br />

subgroup: 1000 in whom congestive heart failure was unequivocally<br />

the cause <strong>of</strong> their dyspnea and 1000 in whom<br />

asthma was almost certainly the cause. The prevalence <strong>of</strong><br />

congestive heart failure is 50%. Each bell curve corresponds<br />

to the distribution <strong>of</strong> BNP values within the respec-<br />

A Pregnant Not pregnant Total<br />

Positive<br />

test result<br />

Negative<br />

test result<br />

A<br />

C<br />

95<br />

5<br />

388 JAMC 16 AOÛT 2005; 173 (4)<br />

B<br />

D<br />

1 96<br />

99 104<br />

Total 100 100 200<br />

B<br />

Positive<br />

test result<br />

Negative<br />

test result<br />

A × 4<br />

C × 4<br />

380<br />

20<br />

B<br />

D<br />

1 381<br />

99 119<br />

Total 400 100 500<br />

C<br />

Positive<br />

test result<br />

Negative<br />

test result<br />

A<br />

C<br />

95<br />

5<br />

B × 4<br />

D × 4<br />

4 99<br />

396 401<br />

Total 100 400 500<br />

Fig. 4: Changes in disease prevalence have no effect on diagnostic test characteristics.<br />

tive subgroup. Now consider a second case, where there are<br />

2000 older patients with severe congestive heart failure and<br />

1000 younger patients with recurrent asthma and no risk<br />

factors <strong>for</strong> congestive heart failure. The prevalence <strong>of</strong> congestive<br />

heart failure is 67%. Finally, consider a third case,<br />

where 2000 patients with asthma and 1000 patients with severe<br />

congestive heart failure are studied. The prevalence <strong>of</strong><br />

congestive heart failure is 33%.<br />

In each case the height <strong>of</strong> either curve corresponding to<br />

any particular BNP level still corresponds to the proportion<br />

<strong>of</strong> patients with that test value in that group. Changes<br />

in the total number <strong>of</strong> patients will not alter these proportions,<br />

and the per<strong>for</strong>mance <strong>of</strong> the test, as measured by sensitivity,<br />

specificity or likelihood ratios, will be unaffected.<br />

The per<strong>for</strong>mance <strong>of</strong> the BNP test in identifying patients<br />

with and without acute congestive heart failure remained<br />

the same. Hence, when the spectrum remains the same, the<br />

prevalence <strong>of</strong> congestive heart failure within the study population<br />

is irrelevant to the estimation <strong>of</strong> test characteristics.<br />

Let’s take a different clinical example. The ICON urine<br />

test <strong>for</strong> pregnancy (Beckman Coulter, Inc., Fullerton,<br />

Calif.) has a very high sensitivity and specificity when per<strong>for</strong>med<br />

later than 2 weeks postconception. 10<br />

Women attending a screening clinic in a geographic area<br />

characterized by moderate population growth are tested <strong>for</strong><br />

pregnancy. 50% <strong>of</strong> the women are pregnant. Hence, the<br />

prevalence <strong>of</strong> pregnancy is 50% in this setting. The ICON<br />

test has a sensitivity <strong>of</strong> 95% and a specificity <strong>of</strong> 99%. By<br />

definition, 95% <strong>of</strong> the 100 pregnant women (95% sensitivity)<br />

will have a positive test result, and 99% <strong>of</strong> the 100<br />

nonpregnant women (99% specificity) will have a negative<br />

test result. The sensitivity is influenced by the proportion <strong>of</strong><br />

women who present less than 2 weeks after conception.<br />

The same test is per<strong>for</strong>med in a similar clinic located in a<br />

geographic area characterized by high population growth.<br />

Four times as many women are pregnant as women who are<br />

not. The prevalence <strong>of</strong> pregnancy has increased to 80%. The<br />

percentage <strong>of</strong> pregnant women who have positive test results<br />

remains the same (380/400), and the sensitivity <strong>of</strong> the test<br />

remains 95% in this population. The percentage <strong>of</strong><br />

nonpregnant women who have a negative test result is also<br />

unchanged at 99%.<br />

The same pregnancy test is now used in a clinic servicing a<br />

population characterized by low population growth. Only<br />

one-fifth <strong>of</strong> women are pregnant. The sensitivity remains the<br />

same despite a decrease in the proportion <strong>of</strong> pregnant women<br />

from 50% to 20%. The specificity (the proportion <strong>of</strong><br />

nonpregnant women with a negative test result) remains the<br />

same despite an increase in the prevalence <strong>of</strong> nonpregnant<br />

women to 80%. Once again, the prevalence <strong>of</strong> pregnancy in<br />

the population is irrelevant to the estimation <strong>of</strong> test<br />

characteristics.<br />

Page 27 <strong>of</strong> 29


It is a qualitative, and inherently dichotomized, test:<br />

both clinicians and patients recognize that it is not possible<br />

to be “a little bit pregnant.” In short, although estimates <strong>of</strong><br />

per<strong>for</strong>mance values <strong>for</strong> the ICON test vary in the literature,<br />

11,12 the per<strong>for</strong>mance <strong>of</strong> the test in detecting pregnancy<br />

is likely to be uni<strong>for</strong>m if the percentage <strong>of</strong> subjects who are<br />

less than 2 weeks postconception does not vary.<br />

For the purpose <strong>of</strong> our demonstration, let’s assume that<br />

ICON test results are positive in 95% <strong>of</strong> women who are<br />

pregnant and negative in 99% <strong>of</strong> women who are not. Fig.<br />

4 shows the sensitivity and specificity <strong>of</strong> the test when it is<br />

administered in 3 different geographic locations with high,<br />

moderate and low population growth and where the proportion<br />

<strong>of</strong> women presenting within 2 weeks <strong>of</strong> conception<br />

is constant. Again, <strong>for</strong> simplicity, we are considering only<br />

the prevalence <strong>of</strong> pregnancy in the population being studied<br />

— in other words, the percentage <strong>of</strong> women tested who<br />

are pregnant. A practitioner might estimate the probability<br />

<strong>of</strong> pregnancy in an individual patient to be higher or lower<br />

than this on the basis <strong>of</strong> clinical features such as use <strong>of</strong> birth<br />

control methods, history <strong>of</strong> recent sexual activity and past<br />

history <strong>of</strong> gynecologic disease. As Fig. 4 shows, the prevalence<br />

<strong>of</strong> pregnancy in the population has no effect on the<br />

estimation <strong>of</strong> test characteristics.<br />

There are many examples <strong>of</strong> conditions that may present<br />

with equal severity in people with different demographic<br />

characteristics (age, sex, ethnicity) but that are<br />

much more prevalent in one group than in another. Mild<br />

osteoarthritis <strong>of</strong> the knee is rare among young patients but<br />

common among older patients. Asymptomatic thyroid abnormalities<br />

are rare among men but common among<br />

women. In both examples, diagnostic tests will have the<br />

same sensitivity, specificity and likelihood ratios in young<br />

and old patients and in men and women respectively.<br />

However, higher prevalence will result in a higher proportion<br />

<strong>of</strong> those with a positive test result who do in fact<br />

have the disease <strong>for</strong> which they are being tested. Referring<br />

to Fig. 4, in the population with a lower prevalence <strong>of</strong><br />

pregnancy, 95 <strong>of</strong> 99 women (96%) with positive test results<br />

are pregnant (Fig. 4C) compared with 380 <strong>of</strong> 381 women<br />

(99.7%) in the population with a higher prevalence (Fig.<br />

4B). The likelihood <strong>of</strong> the condition or disease among patients<br />

who have a positive test result is sometimes referred<br />

to as the predictive value <strong>of</strong> a test. The predictive value corresponds<br />

with the post-test probability <strong>of</strong> the disease when<br />

the test result is positive. Unlike sensitivity, specificity or<br />

likelihood ratios, predictive values are strongly influenced<br />

by changes in prevalence in the population being tested.<br />

Although differences in prevalence alone should not affect<br />

the sensitivity or specificity <strong>of</strong> a test, in many clinical<br />

settings disease prevalence and severity may be related. For<br />

instance, rheumatoid arthritis seen in a family physician’s<br />

<strong>of</strong>fice will be relatively uncommon, and most patients will<br />

have a relatively mild case. In contrast, rheumatoid arthritis<br />

will be common in a rheumatologist’s <strong>of</strong>fice, and patients<br />

will tend to have relatively severe disease. Tests to diagnose<br />

rheumatoid arthritis in the rheumatologist’s waiting area<br />

(e.g., hand inspection <strong>for</strong> joint de<strong>for</strong>mity) are likely to be<br />

relatively more sensitive not because <strong>of</strong> the increased<br />

prevalence but because <strong>of</strong> the spectrum <strong>of</strong> disease present<br />

(e.g., degree and extent <strong>of</strong> joint de<strong>for</strong>mity) in this setting.<br />

The bottom line<br />

• Disease prevalence has no direct effect on test characteristics<br />

(e.g., likelihood ratios, sensitivity, and specificity).<br />

• Spectrum <strong>of</strong> disease and disease prevalence have different<br />

effects on diagnostic test characteristics.<br />

Conclusions<br />

Clinicians need to understand how and when the choice<br />

<strong>of</strong> patients <strong>for</strong> a diagnostic test study may affect the per<strong>for</strong>mance<br />

<strong>of</strong> the test. Both disease spectrum in patients with<br />

the condition <strong>of</strong> interest and the spectrum <strong>of</strong> competing<br />

conditions in patients without the condition <strong>of</strong> interest can<br />

affect the test’s apparent diagnostic power. Despite the potentially<br />

powerful impact <strong>of</strong> disease spectrum and competing<br />

conditions, changes in prevalence that do not reflect<br />

changes in spectrum will not alter test per<strong>for</strong>mance.<br />

This article has been peer reviewed.<br />

References<br />

<strong>Tips</strong> <strong>for</strong> EBM learners: spectrum <strong>of</strong> disease<br />

From the Knowledge and Encounter Research Unit, Department <strong>of</strong> <strong>Medicine</strong>,<br />

Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn. (Montori); the Departments<br />

<strong>of</strong> Epidemiology and Biostatistics and <strong>of</strong> Pediatrics, University <strong>of</strong> Cali<strong>for</strong>nia, San<br />

Francisco (Newman); Durham Veterans Affairs Medical Center and Duke University<br />

Medical Center, Durham, NC (Keitz); the Columbia University College <strong>of</strong><br />

Physicians and Surgeons, New York, NY (Wyer); and the Departments <strong>of</strong> <strong>Medicine</strong><br />

and <strong>of</strong> Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,<br />

Ont. (Guyatt)<br />

Competing interests: None declared.<br />

Page 28 <strong>of</strong> 29<br />

Contributors: Victor Montori, as principal author, oversaw and contributed to the<br />

writing <strong>of</strong> the manuscript. Thomas Newman reviewed the manuscript at all phases<br />

<strong>of</strong> development and contributed to the writing as coauthor <strong>of</strong> tip 2. Sheri Keitz<br />

used all tips as part <strong>of</strong> a live teaching exercise and submitted comments, suggestions<br />

and the possible variations that are reported in the manuscript. Peter Wyer<br />

reviewed and revised the final draft <strong>of</strong> the manuscript to achieve uni<strong>for</strong>m adherence<br />

with <strong>for</strong>mat specifications. Gordon Guyatt developed the original idea <strong>for</strong> tips<br />

1 and 2, reviewed the manuscript at all phases <strong>of</strong> development, contributed to the<br />

writing as coauthor, and reviewed and revised the final draft <strong>of</strong> the manuscript to<br />

achieve accuracy and consistency <strong>of</strong> content as general editor.<br />

1. Jaeschke R, Guyatt G, Lijmer J. Diagnostic tests. In: Guyatt G, Rennie D, editors.<br />

Users’ guides to the medical literature: a manual <strong>for</strong> evidence-based clinical<br />

practice. Chicago: AMA Press; 2002. p. 121-40.<br />

2. Wyer P, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. <strong>Tips</strong> <strong>for</strong><br />

learning and teaching evidence-based medicine: introduction to the series.<br />

CMAJ 2004;171(4):347-8.<br />

3. Dao Q, Krishnaswamy P, Kazanegra R, Harrison A, Amirnovin R, Lenert L,<br />

et al. Utility <strong>of</strong> B-type natriuretic peptide in the diagnosis <strong>of</strong> congestive heart<br />

failure in an urgent-care setting. J Am Coll Cardiol 2001;37:379-85.<br />

4. Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et<br />

al.; Breathing Not Properly Multinational Study Investigators. Rapid measurement<br />

<strong>of</strong> B-type natriuretic peptide in the emergency diagnosis <strong>of</strong> heart<br />

failure. N Engl J Med 2002;347:161-7.<br />

5. McCullough PA, Nowak RM, McCord J, Hollander JE, Herrmann HC, Steg<br />

PG, et al. B-type natriuretic peptide and clinical judgment in emergency diagnosis<br />

<strong>of</strong> heart failure: analysis from Breathing Not Properly (BNP) Multinational<br />

Study. Circulation 2002;106:416-22.<br />

CMAJ AUG. 16, 2005; 173 (4) 389


Montori et al<br />

6. Hohl CM, Mitelman BY, Wyer P, Lang E. Should emergency physicians use<br />

B-type natriuretic peptide testing in patients with unexplained dyspnea? Can J<br />

Emerg Med 2003;5:162-5.<br />

7. Schwam E. B-type natriuretic peptide <strong>for</strong> diagnosis <strong>of</strong> heart failure in emergency<br />

department patients: a critical appraisal. Acad Emerg Med 2004;11:686-91.<br />

8. Tandberg D, Deely JJ, O’Malley AJ. Generalized likelihood ratios <strong>for</strong> quantitative<br />

diagnostic test scores. Am J Emerg Med 1997;15:694-9.<br />

9. Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen<br />

JH, et al. Empirical evidence <strong>of</strong> design-related bias in studies <strong>of</strong> diagnostic<br />

tests. JAMA 1999;282:1061-6.<br />

10. Product insert. Available: www.beckman.com/literature/ClinDiag/08109.D<br />

.pdf (accessed 13 Jul 2005).<br />

11. Lauszus FF. Clinical trial <strong>of</strong> 2 highly sensitive pregnancy tests — Tandem<br />

ICON HCG-urine and OPCO On-step Pacific Biotech. Ugeskr Laeger 1992;<br />

154:2069-70.<br />

12. Mishalani SH, Seliktar J, Braunstein GD. Four rapid serum–urine combination<br />

assays <strong>of</strong> choriogonadotropin (hCG) compared and assesed <strong>for</strong> their utility<br />

in quantitative determinations <strong>of</strong> hCG. Clin Chem 1994;40:1944-99.<br />

Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,<br />

Pelham NY 10804; fax 212 305-6792; pwyer@att.net<br />

Members <strong>of</strong> the <strong>Evidence</strong>-<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working<br />

Group: Peter C. Wyer (project director), College <strong>of</strong> Physicians and<br />

Surgeons, Columbia University, New York, NY; Deborah Cook,<br />

Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke,<br />

McMaster University, Hamilton, Ont.; Rose Hatala (internal<br />

review coordinator), University <strong>of</strong> British Columbia, Vancouver,<br />

BC; Robert Hayward (editor, online version), Bruce Fisher,<br />

University <strong>of</strong> Alberta, Edmonton, Alta.; Sheri Keitz (field test<br />

coordinator), Durham Veterans Affairs Medical Center and Duke<br />

University Medical Center, Durham, NC; Alexandra Barratt,<br />

University <strong>of</strong> Sydney, Sydney, Australia; Pamela Charney, Albert<br />

Einstein College <strong>of</strong> <strong>Medicine</strong>, Bronx, NY; Antonio L. Dans,<br />

University <strong>of</strong> the Philippines College <strong>of</strong> <strong>Medicine</strong>, Manila, The<br />

Philippines; Barnet Eskin, Morristown Memorial Hospital,<br />

Morristown, NJ; Jennifer Kleinbart, Emory University School <strong>of</strong><br />

Holiday Review 2005<br />

Call <strong>for</strong> submissions<br />

Hilarity and good humour … help enormously in both the study and<br />

the practice <strong>of</strong> medicine … [I]t is an unpardonable sin to go about<br />

among patients with a long face.<br />

— William Osler<br />

390 JAMC 16 AOÛT 2005; 173 (4)<br />

<strong>Medicine</strong>, Atlanta, Ga.; Hui Lee, <strong>for</strong>merly Group Health Centre,<br />

Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig, Thomas<br />

McGinn, Mount Sinai Medical Center, New York, NY; Victor M.<br />

Montori, Mayo Clinic College <strong>of</strong> <strong>Medicine</strong>, Rochester, Minn.;<br />

Virginia Moyer, University <strong>of</strong> Texas, Houston, Tex.; Thomas B.<br />

Newman, University <strong>of</strong> Cali<strong>for</strong>nia, San Francisco, San Francisco,<br />

Calif.; Jim Nishikawa, University <strong>of</strong> Ottawa, Ottawa, Ont.;<br />

Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain;<br />

W. Scott Richardson, Wright State University, Dayton, Ohio; Mark<br />

C. Wilson, University <strong>of</strong> Iowa, Iowa City, Iowa<br />

Articles to date in this series<br />

Yes, that’s right, it’s already time to send us your creative contributions<br />

<strong>for</strong> CMAJ’s Holiday Review 2005. We’re looking <strong>for</strong> humour, spo<strong>of</strong>s,<br />

personal reflections, history <strong>of</strong> medicine, <strong>of</strong>f-beat scientific explorations<br />

and postcards from the edge <strong>of</strong> medicine.<br />

Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz<br />

S, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based medicine:<br />

1. Relative risk reduction, absolute risk reduction and<br />

number needed to treat. CMAJ 2004;171(4):353-8.<br />

Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC,<br />

Moyer V, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />

medicine: 2. Measures <strong>of</strong> precision (confidence intervals).<br />

CMAJ 2004;171(6):611-5.<br />

McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R,<br />

Guyatt G, et al. <strong>Tips</strong> <strong>for</strong> learners <strong>of</strong> evidence-based<br />

medicine: 3. Measures <strong>of</strong> observer variability (kappa<br />

statistic). CMAJ 2004;171(11):1369-73.<br />

Hatala R, Keitz S, Wyer P, Guyatt G; <strong>for</strong> the <strong>Evidence</strong>-<br />

<strong>Based</strong> <strong>Medicine</strong> Teaching <strong>Tips</strong> Working Group. <strong>Tips</strong><br />

<strong>for</strong> learners <strong>of</strong> evidence-based medicine: 4. Assessing<br />

heterogeneity <strong>of</strong> primary studies in systematic reviews<br />

and whether to combine their results. CMAJ 2005;<br />

172(5):661-5.<br />

Send your <strong>of</strong>ferings through our online manuscript tracking system (http://mc.manuscriptcentral.com/cmaj).<br />

Articles should be no more than 1200 words; photographs and illustrations are welcome. Please mention in<br />

your cover letter that your submission is intended <strong>for</strong> this year’s Holiday Review.<br />

The deadline <strong>for</strong> submissions is Sept. 20, 2005.<br />

Page 29 <strong>of</strong> 29

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!