15.08.2013 Views

BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...

BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...

BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>BMJ</strong> <strong>Statistical</strong> <strong>Notes</strong> <strong>Series</strong><br />

<strong>List</strong> <strong>by</strong> <strong>JM</strong> <strong>Bland</strong>: <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/~mb55/pubs/pbstnote.htm<br />

1 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Correlation, regression and repeated data. <strong>BMJ</strong> 1994;308:896<br />

2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Regression towards the mean. <strong>BMJ</strong> 1994;308:1499<br />

3 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 1: sensitivity and specificity <strong>BMJ</strong> 1994;308:1552<br />

4 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 2: predictive values. <strong>BMJ</strong> 1994;309:102<br />

5 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 3: receiver operating characteristic plots. <strong>BMJ</strong> 1994;309:188<br />

6 <strong>Bland</strong> <strong>JM</strong>, Altman DG. One- and two-sided tests of significance. <strong>BMJ</strong> 1994;309:248<br />

7 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Some examples of regression towards the mean. <strong>BMJ</strong> 1994;309:780<br />

8 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Quartiles, quintliles, centiles, and other quantiles. <strong>BMJ</strong> 1994;309:996<br />

9 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Matching. <strong>BMJ</strong> 1994;309:1128<br />

10 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Multiple significance tests: the Bonferroni method. <strong>BMJ</strong> 1995;310:170<br />

11 Altman DG, <strong>Bland</strong> <strong>JM</strong>. The normal distribution. <strong>BMJ</strong> 1995;310:298<br />

12 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Calculating correlation coefficients with repeated observations: Part 1, correlation<br />

within subjects. <strong>BMJ</strong> 1995;310:446<br />

13 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Calculating correlation coefficients with repeated observations: Part 2, correlation<br />

between subjects. <strong>BMJ</strong> 1995;310:633<br />

14 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Absence of evidence is not evidence of absence. <strong>BMJ</strong> 1995;311:485<br />

15 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Presentation of numerical data. <strong>BMJ</strong> 1996;312:572<br />

16 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Logarithms. <strong>BMJ</strong> 1996;312:700<br />

17 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transforming data. <strong>BMJ</strong> 1996;312:770<br />

18 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transformations, means and confidence intervals. <strong>BMJ</strong> 1996;312:1079<br />

19 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The use of transformations when comparing two means. <strong>BMJ</strong> 1996;312:1153<br />

[Dataset bpl 04591]<br />

20 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Comparing several groups using analysis of variance. <strong>BMJ</strong> 1996;312:1472-1473<br />

21 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Measurement error. <strong>BMJ</strong> 1996;312:1654<br />

22 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Measurement error and correlation coefficients. <strong>BMJ</strong> 1996;313:41-42 (includes<br />

corrected table)<br />

23 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Measurement error proportional to the mean. <strong>BMJ</strong> 1996;313:106


<strong>BMJ</strong> <strong>Statistical</strong> <strong>Notes</strong> <strong>Series</strong><br />

<strong>List</strong> <strong>by</strong> <strong>JM</strong> <strong>Bland</strong>: <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/~mb55/pubs/pbstnote.htm<br />

24 Altman DG, Matthews JNS. Interaction 1: Heterogeneity of effects. <strong>BMJ</strong> 1996;313:486<br />

25 Matthews JNS, Altman DG. Interaction 2: compare effect sizes not P values. <strong>BMJ</strong> 1996;313:808<br />

26 Matthews JNS, Altman DG. Interaction 3: How to examine heterogeneity. <strong>BMJ</strong> 1996;313:862<br />

27 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Detecting skewness from summary information. <strong>BMJ</strong> 1996;313:1200<br />

28 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Cronbach's alpha. <strong>BMJ</strong> 1997;314:572<br />

29 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Units of analysis. <strong>BMJ</strong> 1997;314:1874<br />

30 <strong>Bland</strong> <strong>JM</strong>, Kerry SM. Trials randomised in clusters. <strong>BMJ</strong> 1997;315:600<br />

31 Kerry SM, <strong>Bland</strong> <strong>JM</strong>. Analysis of a trial randomised in clusters. <strong>BMJ</strong> 1998;316:54<br />

32 <strong>Bland</strong> <strong>JM</strong>, Kerry SM. Weighted comparison of means. <strong>BMJ</strong> 1998;316:129<br />

33 Kerry SM, <strong>Bland</strong> <strong>JM</strong>. Sample size in cluster randomisation. <strong>BMJ</strong> 1998;316:549<br />

34 Kerry SM, <strong>Bland</strong> <strong>JM</strong>. The intra-cluster correlation coefficient in cluster randomisation. <strong>BMJ</strong> 1998;316:1455<br />

35 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Generalisation and extrapolation. <strong>BMJ</strong> 1998;317:409-410<br />

36 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Time to event (survival) data. <strong>BMJ</strong> 1998;317:468-469<br />

37 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Bayesians and frequentists <strong>BMJ</strong> 1998;317:1151<br />

38 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Survival probabilities (the Kaplan-Meier method). <strong>BMJ</strong> 1998;317:1572<br />

39 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation in controlled trials: why randomise? <strong>BMJ</strong> 1999;318:1209<br />

40 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Variables and parameters. <strong>BMJ</strong> 1999;318:1667<br />

41 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-704<br />

42 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The odds ratio. <strong>BMJ</strong> 2000;320:1468<br />

43 Day SJ, Altman DG. Blinding in clinical trials and other studies. <strong>BMJ</strong> 2000;321:504<br />

44 Altman DG, Schulz KF. Concealing treatment allocation in randomised trials. <strong>BMJ</strong> 2001;323:446-447<br />

45 Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow up measurements. <strong>BMJ</strong><br />

2001;323:1123-1124<br />

46 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Validating scales and indexes. <strong>BMJ</strong> 2002;324:606-607<br />

47 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Interaction revisited: the difference between two estimates. <strong>BMJ</strong> 2003;326:219<br />

48 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The logrank test. <strong>BMJ</strong> 2004;328:1073


<strong>BMJ</strong> <strong>Statistical</strong> <strong>Notes</strong> <strong>Series</strong><br />

<strong>List</strong> <strong>by</strong> <strong>JM</strong> <strong>Bland</strong>: <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/~mb55/pubs/pbstnote.htm<br />

49 Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. <strong>BMJ</strong> 2004;329:168-169<br />

50 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation <strong>by</strong> minimisation. <strong>BMJ</strong> 2005;330:843<br />

51 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Standard deviations and standard errors. <strong>BMJ</strong> 2005;331:903<br />

52 Altman DG, Royston P. The cost of dichotomising continuous variables. <strong>BMJ</strong> 2006;332:1080<br />

53 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Missing data. <strong>BMJ</strong> 2007;334:424<br />

54 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Parametric v non-parametric methods for data analysis. <strong>BMJ</strong> 2009;339:170<br />

55 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Analysis of continuous data from small samples. <strong>BMJ</strong> 2009;339:961-962<br />

56 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Comparisons within randomised groups can be very misleading. <strong>BMJ</strong> 2011; 342:<br />

d561<br />

57 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Correlation in restricted ranges of data. <strong>BMJ</strong> 2011; 343: 577<br />

58 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to obtain the confidence interval from a P value. <strong>BMJ</strong> 2011; 343: 681<br />

59 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to obtain the P value from a confidence interval. <strong>BMJ</strong> 2011; 343: 682.<br />

60 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Brackets (parentheses) in formulas. <strong>BMJ</strong> 2011; 343: 578


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


i . ~~ " .<br />

It<br />

~,<br />

I<br />

; Statistics <strong>Notes</strong> ' ..,'<br />

,I .,<br />

;; , f<br />

,<br />

, , ;~ Measurement error<br />

,.,; r,i<br />

1; '}<br />

1 J Martin <strong>Bland</strong>, Douglas G Altman I<br />

, ,\<br />

j<br />

;\ This is the 21st in a series of Several measurements of the same quantity on the same .'<br />

ij occasional notes on medical subject will not in general be the same. This may be Table 1-Repeated peak ex~/ratory flow rate (PEFR) I<br />

f measurements for 20 schoolchildren ..<br />

" stansncs. because of natural vananon m the subject, vananon m '<br />

;) ~ shows the measurement four measurements process, of or lung both. function For example, in each table of 20 1 Child PEFR (V mln .)<br />

1: schoolchildren (taken from a larger study'). The first No 1st 2nd 3rd 4th Mean SO<br />

I child shows typical variation, having peak expiratory<br />

! flow rates of 190, 220, 200, and 200 l/min. 1 190 220 200 200 202.50 12.58 )<br />

;I' Let us suppose that the child has a "true" average 2 220 200 240 230 222.50 17.08<br />

! ...3 260 260 240 280 260.00 16.33<br />

,Ii value over all possIble measurements, which IS what we 4 210 300 280 265 263.75 38.60<br />

\' really want to know when we make a measurement. 5 270 265 280 270 271.25 6.29<br />

'" Repeated measurements on the same subject will vary 6 280 280 270 275 276.25 4.79<br />

j: around the true value because of measurement error. 7 260 280 280 300 280.00 16.33<br />

i Th ..8 275 275 275 305 282.50 15 00<br />

, e standard. deVlanon of repeated measure~ents on 9 280 290 300 290 290.00 8:16<br />

the same subject enables us to measure the sIZe of the 10 320 290 300 290 300.00 14.14<br />

measurement error. We shall assume that this standard 11 300 300 310 300 302.50 5.00<br />

deviation is the same for all subjects, as otherwise there 12 270 250 330 370 305.00 55.08 ,<br />

would be no point in estimating it. The main exception 13 320 330 330 330 327.50 5.00<br />

. h .14 335 320 335 375 341.25 2358<br />

IS W en the me~surement en:or depends on the sIZe of 15 350 320 340 355 343.75 18:87<br />

the measurement, usually WIth measurements be com- 16 360 320 350 345 343.75 17.02<br />

ing more variable as the magnitude of the measurement 17 330 340 380 390 360.00 29.44<br />

f 60 increases. We deal with this case in a subsequent statis- 18 335 385 360 370 362.50 21.02<br />

~ ° tics note. The common standard deviation of repeated 19 400 420 425 420 416.25 11.09<br />

c 20 430 460 480 470 460.00 21.60<br />

.g measurements IS known as the unthtn-subject standard<br />

.~ 40 ° deviation, which we shall denote <strong>by</strong> ~<br />

=P: To estimate the within-subject standard deviation, we<br />

i 20 : .° need several subjects with at least two measurements for standard deviation ~w is given <strong>by</strong> when n is the number 'j<br />

~ 0 0 0 ° each. In addition to the data, table 1 also shows the of subjects. We can check for a relation between stand- '<br />

.i 0.00 ° mean and standard deviation of the four readings for ard deviation and mean <strong>by</strong> plotting for each subject the<br />

~ goo 300 400 500 each child. To get the common within-subject standard absolute value of the difference--that is, ignoring any<br />

Subject mean (I/min) deviation we actually average the variances, the squares<br />

FI 1 I d.. d I b. t .of the standard deviations. The mean within-subject<br />

standard 9 -n deviations IVI ua su plotted !/ec s vanance IS 460.52, so the esumated within-subject<br />

sign-against the mean.<br />

The measurement error can be quoted as ~. The<br />

UliLerence ,... b<br />

etween a su<br />

b.,<br />

ject s measurement and the<br />

against their means standard deviation is ~w= f46o:5 = 21.5 1/min. The true value would be expected to be less than 1. 96 ~w for<br />

calculation is easier using a program that performs one 95% of observations. Another useful way of presenting<br />

way analysis ofvariance2 (table 2). The value called the measurement error is sometimes called the repeatability,<br />

residual mean square is the within-subject variance. which is J2 x 1. 96 ~ or 2. 77~. The difference<br />

The analysis of variance method is the better approach between two measurements for the same subject is<br />

in practice, as it deals automatically with the case of expected to be less than 2. 77 ~w for 95% of pairs of<br />

subjects having different numbers of observations. We observations. For the data in table 1 the repeatability is<br />

should check the assumption that the standard 2.77 x 2.5 = 60 l/min. The large variability in peak<br />

deviation is unrelated to the magnitude of the expiratory flow rate is well known, so individual<br />

measurement. This can be done graphically, <strong>by</strong> plotting readings of peak expiratory flow are seldom used.<br />

the individual subject's standard deviations against their The variable used for analysis in the study from which<br />

means (see fig 1). Any important relation should be table 1 was taken was the mean of the last three<br />

fairly obvious, but we can check analytically <strong>by</strong> calculat- readings.'<br />

.ing a rank correlation coefficient. For the figure there Other ways of describing the repeatability of<br />

HDePalartm~nt of Public<br />

S e G th ScIences, ' H .tal<br />

t eorge s OSpl<br />

Medical School London<br />

does not appear to be a relation (Kendall's t = 0.16,<br />

P - 0 3)<br />

-..stansncs<br />

A common design is to take only two measurements<br />

measurements<br />

..<br />

notes.<br />

will be considered in subsequent<br />

SW17 ORE' per subject. In ~s case the method can be simplified I <strong>Bland</strong> <strong>JM</strong>, Holland WW, Elliott A. The development of respiratory<br />

J Martin <strong>Bland</strong>, professor of because the vanance of two observations is half the symptoms in a cohort of Kent schoolchild~o. Bu/J Physio-~rh Resp<br />

medical statistics<br />

IRCF Medical Statistics<br />

Group, Centre for<br />

Statistics in Medicine,<br />

square of their difference. So if the difference between<br />

th b ..'. ...2<br />

e two 0 servanons for subject I IS di the WIthin-subject<br />

1974;10:699-716.<br />

Altman DG, <strong>Bland</strong> <strong>JM</strong>. Comparing several groups using analysis of<br />

variance. BM] 1996;312:1472.<br />

.<br />

Institute of Health Table 2-0ne way analysis of variance for the data of table 1<br />

Sciences, PO Box 777,<br />

Oxford OX3 7LF<br />

0 egrees 0 f Variance .. ratio Probability<br />

Douglas G Altman, head Source of variation freedom Sum of squares Mean square (F) (P)<br />

Correspondence to: Children 19 285318.44 15016.78 32.6


" "'.~..~~ ~.v~. a"uu,vu,-~ o'V'-" 'v UCla. u.C VL"~llL'" o.;VI1o.;urr~I11;<br />

complete destruction of the femoral head. Blood cultures infection. If the diagnosis is missed or delayed, the con-<br />

grew S auTeUS. His C reactive protein peaked at 122 sequences are serious: the joint destruction may<br />

mg/!. The hip was drained surgically of large amounts of preclude successful arthroplasty or, perhaps worse, a<br />

pus, culture of which grew S aureus and Proteus mirabi- hip replacement may be inserted into an unrecognised<br />

lis. He ~as treated with a prolonged course of high dose septic environment.<br />

antibiotics, with some clinical improvement but The most userJl non-specific tests seem to be the<br />

continuing poor mobility. erythrocyte sedimentation rate and measurement of C .<br />

n an ..~e~ctive ?ro~ein; the single most useful specific test is ~<br />

Her DIscussion Jomt asplranon and culture. .<br />

first These patients were all elderly and had pre-existing We recommend consideration of septic arthritis in<br />

ng/l. osteoarthritis and concurrent infection elsewhere. any patient with an apparently acute exacerbation of an<br />

pus None, however, had other systemic conditions predis- osteoarthritic joint, particularly if there is a possibility of<br />

nar'y posing to infection, such as diabetes, except for the sec- coexistent infection elsewhere. Other possible nonayed<br />

ond patient, who had a myeloproliferative disorder. The infective causes of a rapid deterioration in symptoms<br />

,ated development of septic arthritis <strong>by</strong> haematogenous include pseudogout and avascular necrosis, and these<br />

was spread was associated with increasing hip pain and will also need to be considered.<br />

: she rapid destruction of the femoral head. This was accom- Funding: None.<br />

1 few<br />

panied <strong>by</strong> a delay in diagnosis of up to six months.<br />

Infection in the presence of existing inflammatory<br />

Conflict ofinterest:None.<br />

ldio- joint disease, particularly rheumatoid arthritis, is well<br />

mtis known,' It is much rarer to see this in association with I Gardner GC, Weisman MH. Pyarthrosis in patients with rhewnatoid<br />

ISIng the much commoner osteoarthrins, although It IS<br />

arthritis,<br />

40 years.<br />

a<br />

Am<strong>JM</strong>ed<br />

report of<br />

1990;88:503-11.<br />

13 cases and a review of the literature from the past<br />

ttage<br />

mm<br />

recognised! In common with other bone and joint<br />

. m fi ectlOnS, . th e presentanon . 0 f sepnc . ar thr InS " has 2 Goldenberg DL, Cohen AS. Acute infectious arthritis. Am J Med<br />

3 Vincent 1976;60:369-77. GM, Amirault )D. Septic arthritis in the elderly. Chn Orthop<br />

)fhis<br />

with<br />

changed in recent years from the usual florid illness. 1991;251:241-5.<br />

leral<br />

atec-<br />

:rred .<br />

Statistics <strong>Notes</strong><br />

..<br />

Measurement error and correlation coefficients<br />

J Martin <strong>Bland</strong>, Douglas G Altman<br />

This is the 22nd in a series of Measurement error is the variation between measure- natural approach when investigating measurement error,<br />

occasional notes on medical ments of the same quantity on the same individual.! To this will inflate the correlation coefficient,<br />

statistics quantify measurement error we need repeated measure- The correlation coefficient between repeated measments<br />

on several subjects. We have discussed the urements is often called the reliability of the<br />

f within-subject standard deviation as an index of measurement method, It is widely used in the validation<br />

~ measurement error,! which we like as it has a simple of psychological measures such as scales of anxiety and<br />

( CliniCal interpretation. Here we consider the use of cor- depression, where it is known as the test-retest reliabilrelation<br />

coefficients to quantify measurement error, ity,ln such studies it is quoted for different populations<br />

1 A common design for the investigation of measurement (university students, psychiatric outpatients, etc)<br />

;teo- " error is to take pairs of measurements on a group of sub- because the correlation coefficient differs between them<br />

' . I<br />

jects, as in table 1. When we have pairs of observations it is as a result of differing ranges of the quantity being<br />

natural to plot one measurement against the other, The measured, The user has to select the correlation from<br />

resulting scatter diagram (see figure 1) may tempt us to the study population most like the user's own.<br />

Departm~nt of Public<br />

Healtb Sciences,<br />

S t G ~orge ' s H osp i t al<br />

calculate a correlation coefficient between the first and<br />

second measurement. There are difficulties in interpreting<br />

thi 1 . ffi . In al th 1 .<br />

between s corre repeated anon coe measurements Clent. gener, will depend e corre on anon the<br />

Another problem with the use of the correlation coefficient<br />

between the first and second measurements is<br />

Table 1-Pairs of measurements of FEV 1 (Htres) a few<br />

Medical School, London variability between sub'ects. Samples containing subjects weeks apart from 20 Scottish schoolchildren, .tak~n from<br />

SW17 ORE<br />

J Martin <strong>Bland</strong>, professor of<br />

h<br />

W 0<br />

A,a- tl<br />

UL1Ler grea y<br />

will J .<br />

pro duce 1arger corre 1 anon .a larger study (0 Strachan, personal communication)<br />

medical statistics coefficients than will samples containing similar subjects'<br />

ICRF M d 1 S .,<br />

e lca tatlstlcs<br />

For example, suppose we split<br />

...easuremen<br />

this group m whom we have<br />

measured forced expiratory volume m one second (FEV J<br />

S b. u Jec<br />

I<br />

No<br />

M<br />

1 sl<br />

I<br />

2nd<br />

S<br />

u b.<br />

No Jec<br />

Measuremen I<br />

1 sl 2nd<br />

Group, Centre for into two subsamples, the first 10 subjects and the second<br />

Statistics in Medicine, 10 subjects. As table 1 is ordered <strong>by</strong> the first FEV I 1 1.19 1.37 11 1.54 1.57<br />

In~titute ofHealtb measurement, both subsamples vary less than does the 2 1.33 1.32 12 1.59 1.60<br />

ScIences, PO Box 777,<br />

Oxford OX3 7LF whole sample.<br />

Th<br />

e corre<br />

1 .<br />

anon<br />

fi<br />

or<br />

th firs<br />

e<br />

b<br />

t su samp<br />

1 .3<br />

e IS 4<br />

1.35<br />

1.36<br />

1.40<br />

1.25<br />

13<br />

14<br />

1.61<br />

1.61<br />

1.53<br />

1.61<br />

DougiasGAInnanhead r=0.63andfortheseconditisr=0.31,bothlessthan 5 1.36 1.29 15 1.62 1.68<br />

, r= 0.77 for the full sample. The correlation coefficient 6 1.38 1.37 16 1.78 1.76<br />

Correspondence to: thus depends on the way the sample is chosen, and it has 7 1.38 1.40 17 1.80 1.82<br />

Professor <strong>Bland</strong>. meaning only for the population from which the stUdY: ~::~ ~:~:~: ~:: ~:~~ l<br />

subjects can be regarded as a random sample. If we select 10 1.431.51 20 2.102.20 l<br />

EMJ 1996;313:41-2 subjects to give a wide range of the measurement, the Ii<br />

8M] VOLUME313 6 JULy 1996 41 H<br />

I<br />

i<br />

I<br />

i


.'.)'. In practice, there will usually be little difference"; 11<br />

Table 2-0ne way analysis of varianqe for the data in table 1 between r and rI for true repeated measurements. If,<br />

Source of variation<br />

Degrees of<br />

freedom<br />

Sum of<br />

squares<br />

Mean<br />

square<br />

Variance<br />

ratio (F)<br />

Probability<br />

(P)<br />

however, there is a systematic change from the first<br />

d .gh b db!<br />

measurement to the secon , as ml t e cause y a<br />

learning effect, rl will be much less than r. If there was<br />

"-<br />

';<br />

!' ...<br />

.<br />

Children 19 1.52981 0.08052 7.4


Correct data for Statistics Note<br />

There was an error in Statistics Note 22, Measurement error and correlation coefficients, <strong>Bland</strong> <strong>JM</strong> and<br />

Altman DG, 1996, British Medical Journal 313, 41-2.<br />

The wrong table of data was printed. The correct data are:<br />

Sub. 1st 2nd Sub. 1st 2nd<br />

1 1.20 1.24 11 1.62 1.68<br />

2 1.21 1.19 12 1.64 1.61<br />

3 1.42 1.83 13 1.65 2.05<br />

4 1.43 1.38 14 1.74 1.80<br />

5 1.49 1.60 15 1.78 1.76<br />

6 1.58 1.36 16 1.80 1.76<br />

7 1.58 1.65 17 1.80 1.82<br />

8 1.59 1.60 18 1.85 1.73<br />

9 1.60 1.58 19 1.88 1.82<br />

10 1.61 1.61 20 1.92 2.00


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Downloaded from<br />

bmj.com on 28 February 2008


Statistics notes<br />

Bayesians and frequentists<br />

J Martin <strong>Bland</strong>, Douglas G Altman,<br />

There are two competing philosophies of statistical<br />

analysis: the Bayesian and the frequentist. The<br />

frequentists are much the larger group, and almost all<br />

the statistical analyses which appear in the <strong>BMJ</strong> are frequentist.<br />

The Bayesians are much fewer and until<br />

recently could only snipe at the frequentists from the<br />

high ground of university departments of mathematical<br />

statistics. Now the increasing power of computers is<br />

bringing Bayesian methods to the fore.<br />

Bayesian methods are based on the idea that<br />

unknown quantities, such as population means and<br />

proportions, have probability distributions. The probability<br />

distribution for a population proportion<br />

expresses our prior knowledge or belief about it, before<br />

we add the knowledge which comes from our data. For<br />

example, suppose we want to estimate the prevalence<br />

of diabetes in a health district. We could use the knowledge<br />

that the percentage of diabetics in the United<br />

Kingdom as a whole is about 2%, so we expect the<br />

prevalence in our health district to be fairly similar. It is<br />

unlikely to be 10%, for example. We might have information<br />

based on other datasets that such rates vary<br />

between 1% and 3%, or we might guess that the prevalence<br />

is somewhere between these values. We can construct<br />

a prior distribution which summarises our<br />

beliefs about the prevalence in the absence of specific<br />

data. We can do this with a distribution having mean 2<br />

and standard deviation 0.5, so that two standard deviations<br />

on either side of the mean are 1% and 3%. (The<br />

precise mathematical form of the prior distribution<br />

depends on the particular problem.)<br />

Suppose we now collect some data <strong>by</strong> a sample<br />

survey of the district population. We can use the data to<br />

modify the prior probability distribution to tell us what<br />

we now think the distribution of the population<br />

percentage is; this is the posterior distribution. For<br />

example, if we did a survey of 1000 subjects and found<br />

15 (1.5%) to be diabetic, the posterior distribution<br />

would have mean 1.7% and standard deviation 0.3%.<br />

We can calculate a set of values, a 95% credible interval<br />

(1.2% to 2.4% for the example), such that there is a<br />

probability of 0.95 that the percentage of diabetics is<br />

within this set. The frequentist analysis, which ignores<br />

the prior information, would give an estimate 1.5%<br />

with standard error 0.4% and 95% confidence interval<br />

0.8% to 2.5%. This is similar to the results of the Bayesian<br />

method, as is usually the case, but the Bayesian<br />

method gives an estimate nearer the prior mean and a<br />

narrower interval.<br />

Frequentist methods regard the population value<br />

as a fixed, unvarying (but unknown) quantity, without a<br />

probability distribution. Frequentists then calculate<br />

confidence intervals for this quantity, or significance<br />

tests of hypotheses concerning it. Bayesians reasonably<br />

object that this does not allow us to use our wider<br />

knowledge of the problem. Also, it does not provide<br />

what researchers seem to want, which is to be able to<br />

say that there is a probability of 95% that the<br />

<strong>BMJ</strong> VOLUME 317 24 OCTOBER 1998 <strong>www</strong>.bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

population value lies within the 95% confidence interval,<br />

or that the probability that the null hypothesis is<br />

true is less than 5%. It is argued that researchers want<br />

this, which is why they persistently misinterpret<br />

confidence intervals and significance tests in this way.<br />

A major difficulty, of course, is deciding on the<br />

prior distribution. This is going to influence the<br />

conclusions of the study, yet it may be a subjective synthesis<br />

of the available information, so the same data<br />

analysed <strong>by</strong> different investigators could lead to different<br />

conclusions. Another difficulty is that Bayesian<br />

methods may lead to intractable computational<br />

problems. (All widely available statistical packages use<br />

frequentist methods.)<br />

Most statisticians have become Bayesians or<br />

frequentists as a result of their choice of university.<br />

They did not know that Bayesians and frequentists<br />

existed until it was too late and the choice had been<br />

made. There have been subsequent conversions. Some<br />

who were taught the Bayesian way discovered that<br />

when they had huge quantities of medical data to analyse<br />

the frequentist approach was much quicker and<br />

more practical, although they may remain Bayesian at<br />

heart. Some frequentists have had Damascus road conversions<br />

to the Bayesian view. Many practising<br />

statisticians, however, are fairly ignorant of the<br />

methods used <strong>by</strong> the rival camp and too busy to have<br />

time to find out.<br />

The advent of very powerful computers has given a<br />

new impetus to the Bayesians. Computer intensive<br />

methods of analysis are being developed, which allow<br />

new approaches to very difficult statistical problems,<br />

such as the location of geographical clusters of cases of<br />

a disease. This new practicability of the Bayesian<br />

approach is leading to a change in the statistical<br />

paradigm—and a rapprochement between Bayesians<br />

and frequentists. 12 Frequentists are becoming curious<br />

about the Bayesian approach and more willing to use<br />

Bayesian methods when they provide solutions to difficult<br />

problems. In the future we expect to see more<br />

Bayesian analyses reported in the <strong>BMJ</strong>. When this happens<br />

we may try to use Statistics notes to explain them,<br />

though we may have to recruit a Bayesian to do it.<br />

We thank David Spiegelhalter for comments on a draft.<br />

1 Breslow N. Biostatistics and Bayes (with discussion). Statist Sci 1990;5:<br />

269-98.<br />

2 Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to<br />

randomized trials (with discussion). J R Statist Soc A 1994;157:357-416.<br />

Correction<br />

North of England evidence based guidelines development project:<br />

guideline for the primary care management of dementia<br />

An editorial error occurred in this article <strong>by</strong> Martin Eccles<br />

and colleagues (19 September, pp 802-8). In the list of<br />

authors the name of Moira Livingston [not Livingstone] was<br />

misspelt.<br />

Education and debate<br />

Department of<br />

Public Health<br />

Sciences, St<br />

George’s Hospital<br />

Medical School,<br />

London SW17 0RE<br />

J Martin <strong>Bland</strong>,<br />

professor of medical<br />

statistics<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman,<br />

head<br />

Correspondence to:<br />

Professor <strong>Bland</strong><br />

<strong>BMJ</strong> 1998;317:1151<br />

1151


Clinical review<br />

Department of<br />

Public Health<br />

Sciences, St<br />

George’s Hospital<br />

Medical School,<br />

London SW17 0RE<br />

J Martin <strong>Bland</strong>,<br />

professor of medical<br />

statistics<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman,<br />

head<br />

Correspondence to:<br />

Professor <strong>Bland</strong><br />

<strong>BMJ</strong> 1998;317:1572<br />

Time (months) to<br />

conception or<br />

censoring in 38<br />

sub-fertile women<br />

after laparoscopy<br />

and hydrotubation 2<br />

Did not<br />

Conceived conceive<br />

1 2<br />

1 3<br />

1 4<br />

1 7<br />

1 7<br />

1 8<br />

2 8<br />

2 9<br />

2 9<br />

2 9<br />

2 11<br />

3 24<br />

3 24<br />

3<br />

4<br />

4<br />

4<br />

6<br />

6<br />

9<br />

9<br />

9<br />

10<br />

13<br />

16<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Statistics <strong>Notes</strong><br />

Survival probabilities (the Kaplan-Meier method)<br />

J Martin <strong>Bland</strong>, Douglas G Altman<br />

As we have observed, 1 analysis of survival data requires<br />

special techniques because some observations are<br />

censored as the event of interest has not occurred for all<br />

patients. For example, when patients are recruited over<br />

two years one recruited at the end of the study may be<br />

alive at one year follow up, whereas one recruited at the<br />

start may have died after two years. The patient who died<br />

has a longer observed survival than the one who still<br />

survives and whose ultimate survival time is unknown.<br />

The table shows data from a study of conception in<br />

subfertile women. 2 The event is conception, and<br />

women “survived” until they conceived. One woman<br />

conceived after 16 months (menstrual cycles), whereas<br />

several were followed for shorter time periods during<br />

which they did not conceive; their time to conceive was<br />

thus censored.<br />

We wish to estimate the proportion surviving (not<br />

having conceived) <strong>by</strong> any given time, which is also the<br />

estimated probability of survival to that time for a<br />

member of the population from which the sample is<br />

drawn. Because of the censoring we use the<br />

Kaplan-Meier method. For each time interval we<br />

estimate the probability that those who have survived<br />

to the beginning will survive to the end. This is a conditional<br />

probability (the probability of being a survivor at<br />

the end of the interval on condition that the subject<br />

was a survivor at the beginning of the interval). Survival<br />

to any time point is calculated as the product of the<br />

conditional probabilities of surviving each time<br />

interval. These data are unusual in representing<br />

months (menstrual cycles); usually the conditional<br />

probabilities relate to days. The calculations are simplified<br />

<strong>by</strong> ignoring times at which there were no recorded<br />

survival times (whether events or censored times).<br />

In the example, the probability of surviving for two<br />

months is the probability of surviving the first month<br />

times the probability of surviving the second month<br />

given that the first month was survived. Of 38 women,<br />

32 survived the first month, or 0.842. Of the 32 women<br />

at the start of the second month (“at risk” of<br />

conception), 27 had not conceived <strong>by</strong> the end of the<br />

month. The conditional probability of surviving the<br />

second month is thus 27/32 = 0.844, and the overall<br />

probability of surviving (not conceiving) after two<br />

months is 0.842 × 0.844 = 0.711. We continue in this<br />

way to the end of the table, or until we reach the last<br />

event. Observations censored at a given time affect the<br />

number still at risk at the start of the next month. The<br />

estimated probability changes only in months when<br />

there is a conception. In practice, a computer is used to<br />

do these calculations. Standard errors and confidence<br />

intervals for the estimated survival probabilities can be<br />

found <strong>by</strong> Greenwood’s method. 3 Survival probabilities<br />

are usually presented as a survival curve (figure). The<br />

“curve” is a step function, with sudden changes in the<br />

estimated probability corresponding to times at which<br />

an event was observed. The times of the censored data<br />

are indicated <strong>by</strong> short vertical lines.<br />

Survival probability<br />

1.0<br />

0.75<br />

0.5<br />

0.25<br />

0<br />

0 6 12 18 24<br />

Time (months)<br />

Survival curve showing probability of not conceiving among 38<br />

subfertile women after laparoscopy and hydrotubation 2<br />

There are three assumptions in the above. Firstly,<br />

we assume that at any time patients who are censored<br />

have the same survival prospects as those who<br />

continue to be followed. This assumption is not easily<br />

testable. Censoring may be for various reasons. In the<br />

conception study some women had received hormone<br />

treatment to promote ovulation, and others had<br />

stopped trying to conceive. Thus they were no longer<br />

part of the population we wanted to study, and their<br />

survival times were censored. In most studies some<br />

subjects drop out for reasons unrelated to the<br />

condition under study (for example, emigration) If,<br />

however, for some patients in this study censoring was<br />

related to failure to conceive this would have biased the<br />

estimated survival probabilities downwards.<br />

Secondly, we assume that the survival probabilities<br />

are the same for subjects recruited early and late in the<br />

study. In a long term observational study of patients<br />

with cancer, for example, the case mix may change over<br />

the period of recruitment, or there may be an innovation<br />

in ancillary treatment. This assumption may be<br />

tested, provided we have enough data to estimate<br />

survival curves for different subsets of the data.<br />

Thirdly, we assume that the event happens at the<br />

time specified. This is not a problem for the conception<br />

data, but could be, for example, if the event were recurrence<br />

of a tumour which would be detected at a regular<br />

examination. All we would know is that the event<br />

happened between two examinations. This imprecision<br />

would bias the survival probabilities upwards. When<br />

the observations are at regular intervals this can be<br />

allowed for quite easily, using the actuarial method. 3<br />

Formal methods are needed for testing hypotheses<br />

about survival in two or more groups. We shall describe<br />

the logrank test for comparing curves and the more<br />

complex Cox regression model in future <strong>Notes</strong>.<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Time to event (survival) data. <strong>BMJ</strong><br />

1997;317:468-9.<br />

2 Luthra P, <strong>Bland</strong> <strong>JM</strong>, Stanton SL. Incidence of pregnancy after<br />

laparoscopy and hydrotubation. <strong>BMJ</strong> 1982;284:1013-4.<br />

3 Parmar MKB, Machin D. Survival analysis: a practical approach. Chichester:<br />

Wiley, 37, 47-9.<br />

1572 <strong>BMJ</strong> VOLUME 317 5 DECEMBER 1998 <strong>www</strong>.bmj.com


Statistics notes<br />

Treatment allocation in controlled trials: why randomise?<br />

Douglas G Altman, J Martin <strong>Bland</strong><br />

Since 1991 the <strong>BMJ</strong> has had a policy of not publishing<br />

trials that have not been properly randomised, except<br />

in rare cases where this can be justified. 1 Why?<br />

The simplest approach to evaluating a new<br />

treatment is to compare a single group of patients<br />

given the new treatment with a group previously<br />

treated with an alternative treatment. Usually such<br />

studies compare two consecutive series of patients in<br />

the same hospital(s). This approach is seriously flawed.<br />

Problems will arise from the mixture of retrospective<br />

and prospective studies, and we can never satisfactorily<br />

eliminate possible biases due to other factors (apart<br />

from treatment) that may have changed over time.<br />

Sacks et al compared trials of the same treatments in<br />

which randomised or historical controls were used and<br />

found a consistent tendency for historically controlled<br />

trials to yield more optimistic results than randomised<br />

trials. 2 The use of historical controls can be justified<br />

only in tightly controlled situations of relatively rare<br />

conditions, such as in evaluating treatments for<br />

advanced cancer.<br />

The need for contemporary controls is clear, but<br />

there are difficulties. If the clinician chooses which<br />

treatment to give each patient there will probably be<br />

differences in the clinical and demographic characteristics<br />

of the patients receiving the different treatments.<br />

Much the same will happen if patients choose their<br />

own treatment or if those who agree to have a<br />

treatment are compared with ref<strong>users</strong>. Similar problems<br />

arise when the different treatment groups are at<br />

different hospitals or under different consultants. Such<br />

systematic differences, termed bias, will lead to an overestimate<br />

or underestimate of the difference between<br />

treatments. Bias can be avoided <strong>by</strong> using random allocation.<br />

A well known example of the confusion engendered<br />

<strong>by</strong> a non-randomised study was the study of the<br />

possible benefit of vitamin supplementation at the time<br />

of conception in women at high risk of having a ba<strong>by</strong><br />

with a neural tube defect. 3 The investigators found that<br />

the vitamin group subsequently had fewer babies with<br />

neural tube defects than the placebo control group.<br />

The control group included women ineligible for the<br />

trial as well as women who refused to participate. As a<br />

consequence the findings were not widely accepted,<br />

and the Medical Research Council later funded a large<br />

randomised trial to answer to the question in a way that<br />

would be widely accepted. 4<br />

The main reason for using randomisation to<br />

allocate treatments to patients in a controlled trial is to<br />

prevent biases of the types described above. We want to<br />

compare the outcomes of treatments given to groups<br />

of patients which do not differ in any systematic way.<br />

Another reason for randomising is that statistical<br />

theory is based on the idea of random sampling. In a<br />

study with random allocation the differences between<br />

treatment groups behave like the differences between<br />

random samples from a single population. We know<br />

<strong>BMJ</strong> VOLUME 318 1 MAY 1999 <strong>www</strong>.bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

how random samples are expected to behave and so<br />

can compare the observations with what we would<br />

expect if the treatments were equally effective.<br />

The term random does not mean the same as haphazard<br />

but has a precise technical meaning. By random<br />

allocation we mean that each patient has a known<br />

chance, usually an equal chance, of being given each<br />

treatment, but the treatment to be given cannot be predicted.<br />

If there are two treatments the simplest method<br />

of random allocation gives each patient an equal<br />

chance of getting either treatment; it is equivalent to<br />

tossing a coin. In practice most people use either a<br />

table of random numbers or a random number<br />

generator on a computer. This is simple randomisation.<br />

Possible modifications include block randomisation,<br />

to ensure closely similar numbers of patients in<br />

each group, and stratified randomisation, to keep the<br />

groups balanced for certain prognostic patient characteristics.<br />

We discuss these extensions in a subsequent<br />

Statistics note.<br />

Fifty years after the publication of the first<br />

randomised trial 5 the technical meaning of the term<br />

randomisation continues to elude some investigators.<br />

Journals continue to publish “randomised” trials which<br />

are no such thing. One common approach is to<br />

allocate treatments according to the patient’s date of<br />

birth or date of enrolment in the trial (such as giving<br />

one treatment to those with even dates and the other to<br />

those with odd dates), <strong>by</strong> the terminal digit of the hospital<br />

number, or simply alternately into the different<br />

treatment groups. While all of these approaches are in<br />

principle unbiased—being unrelated to patient<br />

characteristics—problems arise from the openness of<br />

the allocation system. 1 Because the treatment is known<br />

when a patient is considered for entry into the trial this<br />

knowledge may influence the decision to recruit that<br />

patient and so produce treatment groups which are<br />

not comparable.<br />

Of course, situations exist where randomisation is<br />

simply not possible. 6 The goal here should be to retain<br />

all the methodological features of a well conducted<br />

randomised trial 7 other than the randomisation.<br />

1 Altman DG. Randomisation. <strong>BMJ</strong> 1991;302:1481-2.<br />

2 Sacks H, Chalmers TC, Smith H. Randomized versus historical controls<br />

for clinical trials. Am J Med 1982;72:233-40.<br />

3 Smithells RW, Sheppard S, Schorah CJ, Seller MJ, Nevin NC, Harris R, et<br />

al. Possible prevention of neural-tube defects <strong>by</strong> periconceptional vitamin<br />

supplementation. Lancet 1980;i:339-40.<br />

4 MRC Vitamin Study Research Group. Prevention of neural tube defects:<br />

results of the Medical Research Council vitamin study. Lancet<br />

1991;338:131-7.<br />

5 Medical Research Council. Streptomycin treatment of pulmonary tuberculosis.<br />

<strong>BMJ</strong> 1948;2:769-82.<br />

6 Black N. Why we need observational studies to evaluate the effectiveness<br />

of health care. <strong>BMJ</strong> 1996;312:1215-8.<br />

7 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving<br />

the quality of reporting of randomized controlled trials: the<br />

CONSORT Statement. JAMA 1996;276:637-9.<br />

Education and debate<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman,<br />

professor of statistics<br />

in medicine<br />

Department of<br />

Public Health<br />

Sciences, St<br />

George’s Hospital<br />

Medical School,<br />

London SW17 0RE<br />

J Martin <strong>Bland</strong>,<br />

professor of medical<br />

statistics<br />

Correspondence to:<br />

Professor Altman.<br />

<strong>BMJ</strong> 1999;318:1209<br />

1209


15 Van den Hoogen H<strong>JM</strong>, Koes BW, van Eijk JT, Bouter LM, Devillé W. On<br />

the course of low back pain in general practice: a one year follow up<br />

study. Ann Rheum Dis 1998;57:13-9.<br />

16 Croft PR, Papageorgiou AC, Ferry S, Thomas E, Jayson MIV, Silman AJ.<br />

Psychological distress and low back pain: Evidence from a prospective<br />

study in the general population. Spine 1996;20:2731-7.<br />

17 Papageorgiou AC, Macfarlane GJ, Thomas E, Croft PR, Jayson MIV,<br />

Silman AJ. Psychosocial factors in the work place—do they predict new<br />

episodes of low back pain? Spine 1997;22:1137-42.<br />

18 Main CJ, Wood PL, Hollis S, Spanswick CC, Waddell G. The distress and<br />

risk assessment method. A simple patient classification to identify distress<br />

and evaluate the risk of poor outcome. Spine 1992;17:42-52.<br />

19 Coste J, Delecoeuillerie G, Cohen de Lara A, Le Parc <strong>JM</strong>, Paolaggi JB.<br />

Clinical course and prognostic factors in acute low back pain: an inception<br />

cohort study in primary care practice. <strong>BMJ</strong> 1994;308:577-80.<br />

Statistics notes<br />

Variables and parameters<br />

Douglas G Altman, J Martin <strong>Bland</strong><br />

Like all specialist areas, statistics has developed its own<br />

language. As we have noted before, 1 much confusion<br />

may arise when a word in common use is also given a<br />

technical meaning. Statistics abounds in such terms,<br />

including normal, random, variance, significant, etc.<br />

Two commonly confused terms are variable and<br />

parameter; here we explain and contrast them.<br />

Information recorded about a sample of individuals<br />

(often patients) comprises measurements such as<br />

blood pressure, age, or weight and attributes such as<br />

blood group, stage of disease, and diabetes. Values of<br />

these will vary among the subjects; in this context<br />

blood pressure, weight, blood group and so on are<br />

variables. Variables are quantities which vary from<br />

individual to individual.<br />

By contrast, parameters do not relate to actual<br />

measurements or attributes but to quantities defining a<br />

theoretical model. The figure shows the distribution of<br />

measurements of serum albumin in 481 white men<br />

aged over 20 with mean 46.14 and standard deviation<br />

3.08 g/l. For the empirical data the mean and SD are<br />

called sample estimates. They are properties of the collection<br />

of individuals. Also shown is the normal 1 distribution<br />

which fits the data most closely. It too has mean<br />

46.14 and SD 3.08 g/l. For the theoretical distribution<br />

the mean and SD are called parameters. There is not<br />

one normal distribution but many, called a family of<br />

distributions. Each member of the family is defined <strong>by</strong><br />

its mean and SD, the parameters 1 which specify the<br />

particular theoretical normal distribution with which<br />

we are dealing. In this case, they give the best estimate<br />

of the population distribution of serum albumin if we<br />

can assume that in the population serum albumin has<br />

a normal distribution.<br />

Most statistical methods, such as t tests, are called<br />

parametric because they estimate parameters of some<br />

underlying theoretical distribution. Non-parametric<br />

methods, such as the Mann-Whitney U test and the log<br />

rank test for survival data, do not assume any particular<br />

family for the distribution of the data and so do not<br />

estimate any parameters for such a distribution.<br />

Another use of the word parameter relates to its<br />

original mathematical meaning as the value(s) defining<br />

one of a family of curves. If we fit a regression model,<br />

such as that describing the relation between lung function<br />

and height, the slope and intercept of this line<br />

<strong>BMJ</strong> VOLUME 318 19 JUNE 1999 <strong>www</strong>.bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

20 Dionne CE, Koepsell TD, Von Korff M, Deyo RA, Barlow WE, Checkoway<br />

H. Predicting long-term functional limitations among back pain patients<br />

in primary care. J Clin Epidemiol 1997;50:31-43.<br />

21 Macfarlane GJ, Thomas E, Papageorgiou AC, Schollum J, Croft PR. The<br />

natural history of chronic pain in the community: a better prognosis than<br />

in the clinic? J Rheumatol 1996;23:1617-20.<br />

22 Troup JDG, Martin JW, Lloyd DCEF. Back pain in industry. A prospective<br />

survey. Spine 1981;6:61-9.<br />

23 Burton AK, Tillotson KM. Prediction of the clinical course of low-back<br />

trouble using multivariable models. Spine 1991;16:7-14.<br />

24 Pope MH, Rosen JC, Wilder DG, Frymoyer JW. The relation between biomechanical<br />

and psychological factors in patients with low-back pain.<br />

Spine 1980;5:173-8.<br />

Frequency<br />

(Accepted 31 March 1999)<br />

80<br />

60<br />

40<br />

20<br />

0<br />

35 40 45 50 55<br />

Albumin (g/l)<br />

Measurements of serum albumin in 481 white men aged over 20<br />

(data from Dr W G Miller)<br />

(more generally known as regression coefficients) are<br />

the parameters defining the model. They have no<br />

meaning for individuals, although they can be used to<br />

predict an individual’s lung function from their height.<br />

In some contexts parameters are values that can be<br />

altered to see what happens to the performance of<br />

some system. For example, the performance of a<br />

screening programme (such as positive predictive<br />

value or cost effectiveness) will depend on aspects such<br />

as the sensitivity and specificity of the screening test. If<br />

we look to see how the performance would change if,<br />

say, sensitivity and specificity were improved, then we<br />

are treating these as parameters rather than using the<br />

values observed in a real set of data.<br />

Parameter is a technical term which has only<br />

recently found its way into general use, unfortunately<br />

without keeping its correct meaning. It is common in<br />

medical journals to find variables incorrectly called<br />

parameters (but not in the <strong>BMJ</strong> we hope 2 ). Another<br />

common misuse of parameter is as a limit or boundary,<br />

as in “within certain parameters.” This misuse seems to<br />

have arisen from confusion between parameter and<br />

perimeter.<br />

Misuse of medical terms is rightly deprecated. Like<br />

other language errors it leads to confusion and the loss<br />

of valuable distinction. Misuse of non-medical terms<br />

should be viewed likewise.<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. The normal distribution. <strong>BMJ</strong> 1995;310:298.<br />

2 Endpiece: What’s a parameter? <strong>BMJ</strong> 1998;316:1877.<br />

General practice<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman,<br />

professor of statistics<br />

in medicine<br />

Department of<br />

Public Health<br />

Sciences, St<br />

George’s Hospital<br />

Medical School,<br />

London SW17 0RE<br />

J Martin <strong>Bland</strong>,<br />

professor of medical<br />

statistics<br />

Correspondence to:<br />

Professor Altman.<br />

<strong>BMJ</strong> 1999;318:1667<br />

1667


Much work still needs to be done to achieve this. To be<br />

useful in health policy at this level, all the targets need<br />

to be elaborated further and clear, practical statements<br />

must be made on their operation—especially the four<br />

targets on health policy and sustainable health systems.<br />

The WHO should stimulate the discussion of these<br />

important targets, but it should also be careful about<br />

being too prescriptive about health systems since this<br />

could be counterproductive.<br />

In addition, more attention should be given to the<br />

usefulness of the targets in member states. One way of<br />

doing this is to rank the countries <strong>by</strong> target and to<br />

divide them into three groups. A specific level could be<br />

set for each group. For example, for target 2, three such<br />

groups could be distinguished as follows:<br />

x Countries that have already achieved this target<br />

x Countries for which the global target is achievable<br />

and challenging<br />

x Countries that find the global target hard to achieve<br />

and therefore “demotivating.”<br />

The first group needs stricter target levels, and the<br />

third group less stringent ones. If a breakdown of this<br />

kind is made for each target, some countries may be<br />

classified in different groups for different targets. In this<br />

way, the targets will provide an insight into the health<br />

status of the population and could be useful for policy<br />

makers in member states in encouraging action and<br />

allocating their resources.<br />

We thank Dr J Visschedijk and Professor L J Gunning-Schepers<br />

and other referees of this article for their helpful comments.<br />

Funding: This study was commissioned <strong>by</strong> Policy Action<br />

Coordination at the WHO and supported <strong>by</strong> an unrestricted<br />

educational grant from Merck & Co Inc, New Jersey, USA.<br />

Competing interests: None declared.<br />

1 World Health Assembly. Resolution WHA51.7. Health for all policy for the<br />

twenty-first century. Geneva: World Health Organisation, 1998.<br />

2 World Health Association. Health for all in the 21st century. Geneva: WHO,<br />

1998.<br />

Statistics notes<br />

How to randomise<br />

Douglas G Altman, J Martin <strong>Bland</strong><br />

We have explained why random allocation of<br />

treatments is a required feature of controlled trials. 1<br />

Here we consider how to generate a random allocation<br />

sequence.<br />

Almost always patients enter a trial in sequence<br />

over a prolonged period. In the simplest procedure,<br />

simple randomisation, we determine each patient’s<br />

treatment at random independently with no constraints.<br />

With equal allocation to two treatment groups<br />

this is equivalent to tossing a coin, although in practice<br />

coins are rarely used. Instead we use computer generated<br />

random numbers. Suitable tables can be found in<br />

most statistics textbooks. The table shows an example 2 :<br />

the numbers can be considered as either random digits<br />

from 0 to 9 or random integers from 0 to 99.<br />

For equal allocation to two treatments we could<br />

take odd and even numbers to indicate treatments A<br />

and B respectively. We must then choose an arbitrary<br />

<strong>BMJ</strong> VOLUME 319 11 SEPTEMBER 1999 <strong>www</strong>.bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

3 World Health Association. Global strategy for health for all <strong>by</strong> the year 2000.<br />

Geneva: WHO, 1981. (WHO Health for All series No 3.)<br />

4 Visschedijk J, Siméant S. Targets for health for all in the 21st century.<br />

World Health Stat Q 1998;51:56-67.<br />

5 Van de Water HPA, van Herten LM. Never change a winning team? Review<br />

of WHO’s new global policy: health for all in the 21st century. Leiden: TNO<br />

Prevention and Health, 1999.<br />

6 World Health Organisation. Bridging the gaps. Geneva: WHO, 1995.<br />

(World health report.)<br />

7 World Health Organisation. Fighting disease, fostering development. Geneva:<br />

WHO, 1996. (World health report.)<br />

8 World Health Organisation. 1997: Conquering suffering,enriching humanity.<br />

Geneva: WHO, 1997. (World health report.)<br />

9 Murray CJL, Lopez AD, eds. The global burden of disease. Boston: Harvard<br />

University Press, 1996.<br />

10 United Nations. The world population prospects. New York: UN, 1998.<br />

11 United Nations Development Programme. Human development report<br />

1997. New York: Oxford University Press, 1997.<br />

12 World Bank. Poverty reduction and the World Bank: progress and challenges in<br />

the 1990s. New York: World Bank, 1996.<br />

13 World Health Organisation. Third evaluation of health for all <strong>by</strong> the year<br />

2000. Geneva: WHO, 1999. (In press.)<br />

14 Ad Hoc Committee on Health Research Relating to Future Intervention<br />

Options. Investing in health research and development. Geneva: WHO,1996.<br />

(Document TDR/Gen/96.1.)<br />

15 Taylor CE. Surveillance for equity in primary health care: policy implications<br />

from international experience. Int J Epidemiol 1992;21:1043-9.<br />

16 Frerichs RR. Epidemiologic surveillance in developing countries. Annu<br />

Rev Public Health 1991;12:257-80.<br />

17 World Health Organisation. Health for all renewal—building sustainable<br />

health systems: from policy to action. Report of meeting on 17-19 November 1997<br />

in Helsinki, Finland. Geneva: WHO, 1998.<br />

18 World Health Organisation. EMC annual report 1996. Geneva: WHO:<br />

1996.<br />

19 World Health Organisation. Physical status: the use and interpretation of<br />

anthropometry of a WHO expert committee. Geneva: WHO, 1995. (WHO<br />

technical report series No 834.)<br />

20 World Health Organisation. Global database on child growth and malnutrition.<br />

Geneva: WHO, 1997.<br />

21 World Health Organisation. Tobacco or health:a global status report. Geneva:<br />

WHO, 1997.<br />

22 Erkens C. Cost-effectiveness of ‘short course chemotherapy’ in smear-negative<br />

tuberculosis. Utrecht: Netherlands School of Public Health, 1996.<br />

23 Van de Water HPA, van Herten LM. Bull’s eye or Achilles’ heel: WHO’s European<br />

health for all targets evaluated in the Netherlands. Leiden: Netherlands<br />

Association for Applied Scientific Research (TNO) Prevention and<br />

Health, 1996.<br />

24 Van de Water HPA, van Herten LM. Health policies on target? Review of<br />

health target and priority setting in 18 European countries. Leiden:<br />

Netherlands Association for Applied Scientific Research (TNO) Prevention<br />

and Health, 1998.<br />

(Accepted 4 May 1999)<br />

place to start and also the direction in which to read<br />

the table. The first 10 two digit numbers from a starting<br />

place in column 2 are 85 80 62 36 96 56 17 17 23 87,<br />

which translate into the sequence ABBBBBAAA<br />

A for the first 10 patients. We could instead have taken<br />

each digit on its own, or numbers 00 to 49 for A and 50<br />

to 99 for B. There are countless possible strategies; it<br />

makes no difference which is used.<br />

We can easily generalise the approach. With three<br />

groups we could use 01 to 33 for A, 34 to 66 for B, and<br />

67 to 99 for C (00 is ignored). We could allocate treatments<br />

A and B in proportions 2 to 1 <strong>by</strong> using 01 to 66<br />

for A and 67 to 99 for B.<br />

At any point in the sequence the numbers of<br />

patients allocated to each treatment will probably<br />

differ, as in the above example. But sometimes we want<br />

to keep the numbers in each group very close at all<br />

times. Block randomisation (also called restricted<br />

Education and debate<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman,<br />

professor of statistics<br />

in medicine<br />

Department of<br />

Public Health<br />

Sciences, St<br />

George’s Hospital<br />

Medical School,<br />

London SW17 0RE<br />

J Martin <strong>Bland</strong>,<br />

professor of medical<br />

statistics<br />

Correspondence to:<br />

Professor Altman.<br />

<strong>BMJ</strong> 1999;319:703–4<br />

703


Education and debate<br />

Excerpt from a table of random digits. 2 The numbers used in the<br />

example are shown in bold<br />

89 11 77 99 94<br />

35 83 73 68 20<br />

84 85 95 45 52<br />

56 80 93 52 82<br />

97 62 98 71 39<br />

79 36 13 72 99<br />

34 96 98 54 89<br />

69 56 88 97 43<br />

09 17 78 78 02<br />

83 17 39 84 16<br />

24 23 36 44 14<br />

39 87 30 20 41<br />

75 18 53 77 83<br />

33 93 39 24 81<br />

22 52 01 86 71<br />

randomisation) is used for this purpose. For example, if<br />

we consider subjects in blocks of four at a time there<br />

are only six ways in which two get A and two get B:<br />

1:AABB2:ABAB3:ABBA4:BBAA5:BA<br />

BA6:BAAB<br />

We choose blocks at random to create the<br />

allocation sequence. Using the single digits of the previous<br />

random sequence and omitting numbers outside<br />

the range 1 to 6 we get 5623665611.From these<br />

we can construct the block allocation sequence BAB<br />

A/BAAB/ABAB/ABBA/BAAB,andsoon.<br />

The numbers in the two groups at any time can never<br />

differ <strong>by</strong> more than half the block length. Block size is<br />

normally a multiple of the number of treatments.<br />

Large blocks are best avoided as they control balance<br />

less well. It is possible to vary the block length, again at<br />

random, perhaps using a mixture of blocks of size 2, 4,<br />

or 6.<br />

While simple randomisation removes bias from the<br />

allocation procedure, it does not guarantee, for example,<br />

that the individuals in each group have a similar<br />

age distribution. In small studies especially some<br />

chance imbalance will probably occur, which might<br />

complicate the interpretation of results. We can use<br />

stratified randomisation to achieve approximate<br />

balance of important characteristics without sacrificing<br />

the advantages of randomisation. The method is to<br />

produce a separate block randomisation list for each<br />

subgroup (stratum). For example, in a study to<br />

One hundred years ago<br />

Generalisation of salt infusions<br />

The subcutaneous infusion of salt solution has proved of great<br />

benefit in the treatment of collapse after severe operations. The<br />

practice, it may be said, developed from two sources: the new<br />

method of transfusion where water, instead of another person’s<br />

blood, is injected into the patient’s veins; and flushing of the<br />

peritoneum, introduced <strong>by</strong> Lawson Tait. After flushing, much of<br />

the fluid left in the peritoneum is absorbed into the circulation,<br />

greatly to the patient’s advantage. Dr. Clement Penrose has tried<br />

the effect of subcutaneous salt infusions as a last extremity in<br />

severe cases of pneumonia. He continues this treatment with<br />

inhalations of oxygen. He has had experience of three cases, all<br />

considered hopeless, and succeeded in saving one. In the other<br />

two the prolongation of life and the relief of symptoms were so<br />

marked that Dr. Penrose regretted that the treatment had not<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

compare two alternative treatments for breast cancer it<br />

might be important to stratify <strong>by</strong> menopausal status.<br />

Separate lists of random numbers should then be constructed<br />

for premenopausal and postmenopausal<br />

women. It is essential that stratified treatment<br />

allocation is based on block randomisation within each<br />

stratum rather than simple randomisation; otherwise<br />

there will be no control of balance of treatments within<br />

strata, so the object of stratification will be defeated.<br />

Stratified randomisation can be extended to two or<br />

more stratifying variables. For example, we might want<br />

to extend the stratification in the breast cancer trial to<br />

tumour size and number of positive nodes. A separate<br />

randomisation list is needed for each combination of<br />

categories. If we had two tumour size groups (say 4cm) and three groups for node involvement (0,<br />

1-4, > 4) as well as menopausal status, then we have<br />

2 × 3 × 2 = 12 strata, which may exceed the limit of what<br />

is practical. Also with multiple strata some of the combinations<br />

of categories may be rare, so the intended<br />

treatment balance is not achieved.<br />

In a multicentre study the patients within each centre<br />

will need to be randomised separately unless there<br />

is a central coordinated randomising service. Thus<br />

“centre” is a stratifying variable, and there may be other<br />

stratifying variables as well.<br />

In small studies it is not practical to stratify on more<br />

than one or perhaps two variables, as the number of<br />

strata can quickly approach the number of subjects.<br />

When it is really important to achieve close similarity<br />

between treatment groups for several variables<br />

minimisation can be used—we discuss this method in a<br />

separate Statistics note. 3<br />

We have described the generation of a random<br />

sequence in some detail so that the principles are clear.<br />

In practice, for many trials the process will be done <strong>by</strong><br />

computer. Suitable software is available at <strong>http</strong>://<br />

<strong>www</strong>.sghms.ac.uk/phs/staff/jmb/jmb.htm.<br />

We shall also consider in a subsequent note the<br />

practicalities of using a random sequence to allocate<br />

treatments to patients.<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation in controlled trials: why randomise?<br />

<strong>BMJ</strong> 1999;318:1209.<br />

2 Altman DG. Practical statistics for medical research. London: Chapman and<br />

Hall, 1990: 540-4.<br />

3 Treasure T, MacRae KD. Minimisation: the platinum standard for trials?<br />

<strong>BMJ</strong> 1998;317:362-3.<br />

been employed earlier. Several physicians have adopted Dr.<br />

Penrose’s method, and with the most gratifying results. The cases<br />

are reported fully in the Bulletin of the Johns Hopkins Hospital for<br />

July last. The infusions of salt solution were administered just as<br />

after an operation. The salt solution, at a little above body<br />

temperature, is poured into a graduated bottle connected <strong>by</strong> a<br />

rubber tube with a needle. The pressure is regulated <strong>by</strong> elevating<br />

the bottle, or <strong>by</strong> means of a rubber bulb with valves; the needle is<br />

introduced into the connective tissue under the breast or under<br />

the integuments of the thighs. There can be no doubt that<br />

subcutaneous saline infusions are increasing in popularity, and<br />

little doubt that their use will be greatly extended in medicine as<br />

well as surgery.<br />

(<strong>BMJ</strong> 1899;ii:933)<br />

704 <strong>BMJ</strong> VOLUME 319 11 SEPTEMBER 1999 <strong>www</strong>.bmj.com


Education and debate<br />

Department of<br />

Public Health<br />

Sciences,<br />

St George’s<br />

Hospital Medical<br />

School, London<br />

SW17 0RE<br />

J Martin <strong>Bland</strong><br />

professor of medical<br />

statistics<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Correspondence to:<br />

Professor <strong>Bland</strong><br />

<strong>BMJ</strong> 2000;320:1468<br />

Statistics <strong>Notes</strong><br />

The odds ratio<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

J Martin <strong>Bland</strong>, Douglas G Altman<br />

In recent years odds ratios have become widely used in<br />

medical reports—almost certainly some will appear in<br />

today’s <strong>BMJ</strong>. There are three reasons for this. Firstly,<br />

they provide an estimate (with confidence interval) for<br />

the relationship between two binary (“yes or no”) variables.<br />

Secondly, they enable us to examine the effects of<br />

other variables on that relationship, using logistic<br />

regression. Thirdly, they have a special and very<br />

convenient interpretation in case-control studies (dealt<br />

with in a future note).<br />

The odds are a way of representing probability,<br />

especially familiar for betting. For example, the odds<br />

that a single throw of a die will produce a six are 1 to<br />

5, or 1/5. The odds is the ratio of the probability that<br />

the event of interest occurs to the probability that it<br />

does not. This is often estimated <strong>by</strong> the ratio of the<br />

number of times that the event of interest occurs to<br />

the number of times that it does not. The table shows<br />

data from a cross sectional study showing the<br />

prevalence of hay fever and eczema in 11 year old<br />

children. 1 The probability that a child with eczema will<br />

also have hay fever is estimated <strong>by</strong> the proportion<br />

141/561 (25.1%). The odds is estimated <strong>by</strong> 141/420.<br />

Similarly, for children without eczema the probability<br />

of having hay fever is estimated <strong>by</strong> 928/14 453 (6.4%)<br />

and the odds is 928/13 525. We can compare the<br />

groups in several ways: <strong>by</strong> the difference between the<br />

proportions, 141/561 − 928/14 453 = 0.187 (or 18.7<br />

percentage points); the ratio of the proportions, (141/<br />

561)/(928/14 453) = 3.91 (also called the relative<br />

risk); or the odds ratio, (141/420)/(928/<br />

13 525) = 4.89.<br />

Association between hay fever and eczema in 11 year old<br />

children 1<br />

Hay fever<br />

Eczema<br />

Yes No<br />

Total<br />

Yes 141 420 561<br />

No 928 13 525 14 453<br />

Total 1069 13 945 15 522<br />

Now, suppose we look at the table the other way<br />

round, and ask what is the probability that a child with<br />

hay fever will also have eczema? The proportion is<br />

141/1069 (13.2%) and the odds is 141/928. For a<br />

child without hay fever, the proportion with eczema is<br />

420/13 945 (3.0%) and the odds is 420/13 525. Comparing<br />

the proportions this way, the difference is 141/<br />

1069 − 420/13 945 = 0.102 (or 10.2 percentage<br />

points); the ratio (relative risk) is (141/1069)/(420/<br />

13 945) = 4.38; and the odds ratio is (141/928)/(420/<br />

13 525) = 4.89. The odds ratio is the same whichever<br />

way round we look at the table, but the difference and<br />

ratio of proportions are not. It is easy to see why this is.<br />

The two odds ratios are<br />

which can both be rearranged to give<br />

If we switch the order of the categories in the rows and<br />

the columns, we get the same odds ratio. If we switch<br />

the order for the rows only or for the columns only, we<br />

get the reciprocal of the odds ratio, 1/4.89 = 0.204.<br />

These properties make the odds ratio a useful<br />

indicator of the strength of the relationship.<br />

The sample odds ratio is limited at the lower end,<br />

since it cannot be negative, but not at the upper end,<br />

and so has a skew distribution. The log odds ratio, 2<br />

however, can take any value and has an approximately<br />

Normal distribution. It also has the useful property that<br />

if we reverse the order of the categories for one of the<br />

variables, we simply reverse the sign of the log odds<br />

ratio: log(4.89) = 1.59, log(0.204) = − 1.59.<br />

We can calculate a standard error for the log odds<br />

ratio and hence a confidence interval. The standard<br />

error of the log odds ratio is estimated simply <strong>by</strong> the<br />

square root of the sum of the reciprocals of the four<br />

frequencies. For the example,<br />

A 95% confidence interval for the log odds ratio is<br />

obtained as 1.96 standard errors on either side of the<br />

estimate. For the example, the log odds ratio is<br />

log e(4.89) = 1.588 and the confidence interval is<br />

1.588±1.96×0.103, which gives 1.386 to 1.790. We can<br />

antilog these limits to give a 95% confidence interval<br />

for the odds ratio itself, 2 as exp(1.386) = 4.00 to<br />

exp(1.790) = 5.99. The observed odds ratio, 4.89, is not<br />

in the centre of the confidence interval because of the<br />

asymmetrical nature of the odds ratio scale. For this<br />

reason, in graphs odds ratios are often plotted using a<br />

logarithmic scale. The odds ratio is 1 when there is no<br />

relationship. We can test the null hypothesis that the<br />

odds ratio is 1 <strong>by</strong> the usual 2 test for a two <strong>by</strong> two<br />

table.<br />

Despite their usefulness, odds ratios can cause difficulties<br />

in interpretation. 3 We shall review this debate<br />

and also discuss odds ratios in logistic regression and<br />

case-control studies in future Statistics <strong>Notes</strong>.<br />

We thank Barbara Butland for providing the data.<br />

1 Strachan DP, Butland BK, Anderson HR. Incidence and prognosis of<br />

asthma and wheezing illness from early childhood to age 33 in a national<br />

British cohort. <strong>BMJ</strong>. 1996;312:1195-9.<br />

2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transforming data. <strong>BMJ</strong> 1996;312:770.<br />

3 Sackett DL, Deeks JJ, Altman DG. Down with odds ratios! Evidence-Based<br />

Med 1996;1:164-6.<br />

1468 <strong>BMJ</strong> VOLUME 320 27 MAY 2000 bmj.com


Education and debate<br />

Leo<br />

Pharmaceuticals,<br />

Princes Risborough,<br />

Buckinghamshire<br />

HP27 9RR<br />

Simon J Day<br />

manager, clinical<br />

biometrics<br />

ICRF Medical<br />

Statistics Group,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Correspondence to:<br />

SJDay<br />

<strong>BMJ</strong> 2000;321:504<br />

Statistics <strong>Notes</strong><br />

Blinding in clinical trials and other studies<br />

Simon J Day, Douglas G Altman<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Human behaviour is influenced <strong>by</strong> what we know or<br />

believe. In research there is a particular risk of expectation<br />

influencing findings, most obviously when there is<br />

some subjectivity in assessment, leading to biased<br />

results. Blinding (sometimes called masking) is used to<br />

try to eliminate such bias.<br />

It is a tenet of randomised controlled trials that the<br />

treatment allocation for each patient is not revealed<br />

until the patient has irrevocably been entered into the<br />

trial, to avoid selection bias. This sort of blinding, better<br />

referred to as allocation concealment, will be discussed<br />

in a future statistics note. In controlled trials the term<br />

blinding, and in particular “double blind,” usually<br />

refers to keeping study participants, those involved<br />

with their management, and those collecting and analysing<br />

clinical data unaware of the assigned treatment,<br />

so that they should not be influenced <strong>by</strong> that<br />

knowledge.<br />

The relevance of blinding will vary according to<br />

circumstances. Blinding patients to the treatment they<br />

have received in a controlled trial is particularly important<br />

when the response criteria are subjective, such as<br />

alleviation of pain, but less important for objective criteria,<br />

such as death. Similarly, medical staff caring for<br />

patients in a randomised trial should be blinded to<br />

treatment allocation to minimise possible bias in<br />

patient management and in assessing disease status.<br />

For example, the decision to withdraw a patient from a<br />

study or to adjust the dose of medication could easily<br />

be influenced <strong>by</strong> knowledge of which treatment group<br />

the patient has been assigned to.<br />

In a double blind trial neither the patient nor the<br />

caregivers are aware of the treatment assignment.<br />

Blinding means more than just keeping the name of<br />

the treatment hidden. Patients may well see the<br />

treatment being given to patients in the other<br />

treatment group(s), and the appearance of the drug<br />

used in the study could give a clue to its identity. Differences<br />

in taste, smell, or mode of delivery may also<br />

influence efficacy, so these aspects should be identical<br />

for each treatment group. Even colour of medication<br />

has been shown to influence efficacy. 1<br />

In studies comparing two active compounds, blinding<br />

is possible using the “double dummy” method. For<br />

example, if we want to compare two medicines, one<br />

presented as green tablets and one as pink capsules, we<br />

could also supply green placebo tablets and pink<br />

placebo capsules so that both groups of patients would<br />

take one green tablet and one pink capsule.<br />

Blinding is certainly not always easy or possible.<br />

Single blind trials (where either only the investigator or<br />

only the patient is blind to the allocation) are<br />

sometimes unavoidable, as are open (non-blind) trials.<br />

In trials of different styles of patient management,<br />

surgical procedures, or alternative therapies, full blinding<br />

is often impossible.<br />

In a double blind trial it is implicit that the<br />

assessment of patient outcome is done in ignorance of<br />

the treatment received. Such blind assessment of<br />

outcome can often also be achieved in trials which are<br />

open (non-blinded). For example, lesions can be<br />

photographed before and after treatment and assessed<br />

<strong>by</strong> someone not involved in running the trial. Indeed,<br />

blind assessment of outcome may be more important<br />

than blinding the administration of the treatment,<br />

especially when the outcome measure involves subjectivity.<br />

Despite the best intentions, some treatments have<br />

unintended effects that are so specific that their occurrence<br />

will inevitably identify the treatment received to<br />

both the patient and the medical staff. Blind<br />

assessment of outcome is especially useful when this is<br />

a risk.<br />

In epidemiological studies it is preferable that the<br />

identification of “cases” as opposed to “controls” be<br />

kept secret while researchers are determining each<br />

subject’s exposure to potential risk factors. In many<br />

such studies blinding is impossible because exposure<br />

can be discovered only <strong>by</strong> interviewing the study<br />

participants, who obviously know whether or not they<br />

are a case. The risk of differential recall of important<br />

disease related events between cases and controls must<br />

then be recognised and if possible investigated. 2 As a<br />

minimum the sensitivity of the results to differential<br />

recall should be considered. Blinded assessment of<br />

patient outcome may also be valuable in other<br />

epidemiological studies, such as cohort studies.<br />

Blinding is important in other types of research<br />

too. For example, in studies to evaluate the performance<br />

of a diagnostic test those performing the test must<br />

be unaware of the true diagnosis. In studies to evaluate<br />

the reproducibility of a measurement technique the<br />

observers must be unaware of their previous measurement(s)<br />

on the same individual.<br />

We have emphasised the risks of bias if adequate<br />

blinding is not used. This may seem to be challenging<br />

the integrity of researchers and patients, but bias associated<br />

with knowing the treatment is often subconscious.<br />

On average, randomised trials that have not<br />

used appropriate levels of blinding show larger<br />

treatment effects than blinded studies. 3 Similarly, diagnostic<br />

test performance is overestimated when the reference<br />

test is interpreted with knowledge of the test<br />

result. 4 Blinding makes it difficult to bias results<br />

intentionally or unintentionally and so helps ensure<br />

the credibility of study conclusions.<br />

1 De Craen A<strong>JM</strong>, Roos PJ, de Vries AL, Kleijnen J. Effect of colour of drugs:<br />

systematic review of perceived effect of drugs and their effectiveness. <strong>BMJ</strong><br />

1996;313:1624-6.<br />

2 Barry D. Differential recall bias and spurious associations in case/control<br />

studies. Stat Med 1996;15:2603-16.<br />

3 Schulz KF, Chalmers I, Hayes R, Altman DG. Empirical evidence of bias:<br />

dimensions of methodological quality associated with estimates of treatment<br />

effects in controlled trials. JAMA 1995;273:408-12.<br />

4 Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der<br />

Meulen JH, et al. Empirical evidence of design-related bias in studies of<br />

diagnostic tests. JAMA 1999;282:1061-6.<br />

504 <strong>BMJ</strong> VOLUME 321 19-26 AUGUST 2000 bmj.com


Education and debate<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Family Health<br />

International,<br />

PO Box 13950,<br />

Research Triangle<br />

Park, NC 27709,<br />

USA<br />

Kenneth F Schulz<br />

vice president,<br />

Quantitative Sciences<br />

Correspondence to:<br />

D G Altman<br />

<strong>BMJ</strong> 2001;323:446–7<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Irrationality, the market, and quality of care<br />

Consider the irrationality of a person who pays extra<br />

so as not to share a hotel room with a colleague while<br />

on a business trip. He does this because he values<br />

privacy but he also scoffs at taking out long term care<br />

insurance to guarantee a private room in a nursing<br />

home. Why is he willing to risk sharing a room for the<br />

rest of his life with a person he does not like? This<br />

common irrationality is often masked <strong>by</strong><br />

rationalisations such as “I would rather die than have<br />

to live in a nursing home.” Yet we know that when the<br />

time comes most prefer the limited pleasures of life in<br />

a nursing home to suicide<br />

their feet. There are even more fundamental reasons<br />

why depending on the rationality of the market will<br />

never work well for quality of care (box). Sensible<br />

policy for providing nursing home care requires a<br />

larger welfare state, a larger regulatory state, and<br />

encouragement of public, non-profit providers.<br />

Australia’s recent experience shows that to head in the<br />

opposite direction is medically, economically, and<br />

politically irrational.<br />

Competing interests: None declared.<br />

1 McCallum J, Geiselhart K. Australia’s new aged: issues for young and old.<br />

Sydney: Allen and Unwin, 1996.<br />

2 Osborne D, Gaebler T. Reinventing government. New York: Addison-<br />

Wesley, 1992.<br />

3 Jost T. The necessary and proper role of regulation to assure the quality<br />

of health care. Houston Law Review 1988;25:525-98.<br />

4 Tingle L. Moran the big winner as aged care goes private. Sydney Morning<br />

Herald 16 March 2001:2.<br />

5 Lohr R, Head M. Kerosene baths reveal systemic aged care crisis in Australia.<br />

World Socialist Web Site. <strong>www</strong>.wsws.org/articles/2000/mar2000/<br />

aged-m10.shtml (accessed 10 Mar 2000).<br />

6 Jenkins A, Braithwaite J. Profits, pressure and corporate lawbreaking.<br />

Crime, Law and Social Change 1993;20:221-32.<br />

7 Braithwaite J, Makkai T, Braithwaite V, Gibson D. Raising the standard: resident<br />

centred nursing home regulation in Australia. Canberra: Department of<br />

Community Services and Health, 1993.<br />

8 Braithwaite J, Braithwaite V. The politics of legalism: rules versus<br />

standards in nursing home regulation. Social and Legal Studies<br />

1995;4:307-41.<br />

9 Black J. Rules and regulators. Oxford: Clarendon Press, 1997.<br />

10 Braithwaite J, Makkai T. Can resident-centred inspection of nursing<br />

homes work with very sick residents? Health Policy 1993;24:19-33.<br />

11 Makkai T, Braithwaite J. Praise, pride and corporate compliance. Int J Sociology<br />

Law 1993;21:73-91.<br />

12 Braithwaite J. Restorative justice and responsive regulation. New York: Oxford<br />

University Press (in press).<br />

13 McKibbin H. Accreditation: the on-site audit. The Standard (Newsletter of<br />

the Aged Care Standards Agency) 1999;2(2):2.<br />

14 Power M. The audit society. Oxford: Oxford University Press, 1997.<br />

Statistics <strong>Notes</strong><br />

Concealing treatment allocation in randomised trials<br />

Douglas G Altman, Kenneth F Schulz<br />

We have previously explained why random allocation<br />

of treatments is a required design feature of controlled<br />

trials 1 and explained how to generate a random allocation<br />

sequence. 2 Here we consider the importance of<br />

concealing the treatment allocation until the patient is<br />

entered into the trial.<br />

Regardless of how the allocation sequence has<br />

been generated—such as <strong>by</strong> simple or stratified<br />

randomisation 2 —there will be a prespecified sequence<br />

of treatment allocations. In principle, therefore, it is<br />

possible to know what treatment the next patient will<br />

get at the time when a decision is taken to consider the<br />

patient for entry into the trial.<br />

The strength of the randomised trial is based on<br />

aspects of design which eliminate various types of bias.<br />

Randomisation of patients to treatment groups<br />

eliminates bias <strong>by</strong> making the characteristics of the<br />

patients in two (or more) groups the same on average,<br />

and stratification with blocking may help to reduce<br />

chance imbalance in a particular trial. 2 All this good<br />

work can be undone if a poor procedure is adopted to<br />

implement the allocation sequence. In any trial one or<br />

more people must determine whether each patient is<br />

eligible for the trial, decide whether to invite the<br />

patient to participate, explain the aims of the trial and<br />

the details of the treatments, and, if the patient agrees<br />

to participate, determine what treatment he or she will<br />

receive.<br />

Suppose it is clear which treatment a patient will<br />

receive if he or she enters the trial (perhaps because<br />

there is a typed list showing the allocation sequence).<br />

Each of the above steps may then be compromised<br />

because of conscious or subconscious bias. Even when<br />

the sequence is not easily available, there is strong<br />

anecdotal evidence of frequent attempts to discover<br />

the sequence through a combination of a misplaced<br />

belief that this will be beneficial to patients and lack of<br />

understanding of the rationale of randomisation. 3<br />

How can the allocation sequence be concealed?<br />

Firstly, the person who generates the allocation<br />

sequence should not be the person who determines<br />

eligibility and entry of patients. Secondly, if possible the<br />

mechanism for treatment allocation should use people<br />

not involved in the trial. A common procedure,<br />

especially in larger trials, is to use a central telephone<br />

randomisation system. Here patient details are<br />

supplied, eligibility confirmed, and the patient entered<br />

into the trial before the treatment allocation is divulged<br />

(and it may still be blinded 4 ). Another excellent allocation<br />

concealment mechanism, common in drug trials,<br />

is to get the allocation done <strong>by</strong> a pharmacy. The interventions<br />

are sealed in serially numbered containers<br />

(usually bottles) of equal appearance and weight<br />

according to the allocation sequence.<br />

If external help is not available the only other<br />

system that provides a plausible defence against allocation<br />

bias is to enclose assignments in serially<br />

numbered, opaque, sealed envelopes. Apart from<br />

neglecting to mention opacity, this is the method used<br />

in the famous 1948 streptomycin trial (see box). This<br />

446 <strong>BMJ</strong> VOLUME 323 25 AUGUST 2001 bmj.com


Description of treatment allocation in the MRC<br />

streptomycin trial 5<br />

“Determination of whether a patient would be treated<br />

<strong>by</strong> streptomycin and bed-rest (S case) or <strong>by</strong> bed-rest<br />

alone (C case) was made <strong>by</strong> reference to a statistical<br />

series based on random sampling numbers drawn up<br />

for each sex at each centre <strong>by</strong> Professor Bradford Hill;<br />

the details of the series were unknown to any of the<br />

investigators or to the co-ordinator and were<br />

contained in a set of sealed envelopes, each bearing on<br />

the outside only the name of the hospital and a<br />

number. After acceptance of a patient <strong>by</strong> the panel,<br />

and before admission to the streptomycin centre, the<br />

appropriate numbered envelope was opened at the<br />

central office; the card inside told if the patient was to<br />

be an S or a C case, and this information was then<br />

given to the medical officer of the centre.”<br />

method is not immune to corruption, 3 particularly if<br />

poorly executed. However, with care, it can be a good<br />

mechanism for concealing allocation. We recommend<br />

that investigators ensure that the envelopes are opened<br />

sequentially, and only after the participant’s name and<br />

other details are written on the appropriate envelope. 3<br />

If possible, that information should also be transferred<br />

to the assigned allocation <strong>by</strong> using pressure sensitive<br />

paper or carbon paper inside the envelope. If an investigator<br />

cannot use numbered containers, envelopes<br />

represent the best available allocation concealment<br />

mechanism without involving outside parties, and may<br />

sometimes be the only feasible option. We suspect,<br />

however, that in years to come we will see greater use of<br />

external “third party” randomisation.<br />

The public health benefits of mobile phones<br />

The bread and butter of public health on call is<br />

identifying contacts in the case of suspected<br />

meningococcal disease. On the whole this is<br />

straightforward but can occasionally cause difficulties.<br />

Most areas that I have worked in include several<br />

universities, and during October it is common to<br />

experience the problem of contact tracing in the<br />

student population.<br />

There are two main problems. The first is how to<br />

define household contacts when the index patient lives<br />

in a hall of residence containing several hundred<br />

students. Finding the appropriate university protocol<br />

and not being too concerned about the different<br />

approaches adopted <strong>by</strong> neighbouring universities can<br />

reduce the number of sleepless nights. The second<br />

problem is harder. “Close kissing contacts” among 18<br />

year olds who have been set free from parental control<br />

for the first time is a minefield. My experience suggests<br />

that it is best to assume there will be lots and that names<br />

and contact details will not necessarily have been<br />

obtained. By the end of a weekend on call, you will feel<br />

like a cross between a detective and an “agony aunt.”<br />

One year I volunteered to cover Christmas weekend<br />

in the belief that at least the students would be gone <strong>by</strong><br />

then. I could not have been more mistaken. To add a<br />

further difficulty, the index patient presented to<br />

hospital on the night of the last day of term, and all<br />

contacts had already set off to the far reaches of the<br />

country. I could not believe my luck when the friend<br />

<strong>BMJ</strong> VOLUME 323 25 AUGUST 2001 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

The desirability of concealing the allocation was<br />

recognised in the streptomycin trial 5 (see box). Yet the<br />

importance of this key element of a randomised trial<br />

has not been widely recognised. Empirical evidence of<br />

the bias associated with failure to conceal the<br />

allocation 67 and explicit requirement to discuss this<br />

issue in the CONSORT statement 8 seem to be leading<br />

to wider recognition that allocation concealment is an<br />

essential aspect of a randomised trial.<br />

Allocation concealment is completely different<br />

from (double) blinding. 4 It is possible to conceal the<br />

randomisation in every randomised trial. Also,<br />

allocation concealment seeks to eliminate selection<br />

bias (who gets into the trial and the treatment they are<br />

assigned). By contrast, blinding relates to what happens<br />

after randomisation, is not possible in all trials, and<br />

seeks to reduce ascertainment bias (assessment of<br />

outcome).<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Treatment allocation in controlled trials: why randomise?<br />

<strong>BMJ</strong> 1999;318:1209.<br />

2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-4.<br />

3 Schulz KF. Subverting randomization in controlled trials. JAMA<br />

1995;274:1456-8.<br />

4 Day SJ, Altman DG. Blinding in clinical trials and other studies. <strong>BMJ</strong><br />

2000;321:504.<br />

5 Medical Research Council. Streptomycin treatment of pulmonary<br />

tuberculosis: a Medical Research Council investigation. <strong>BMJ</strong> 1948;2:<br />

769-82.<br />

6 Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al. Does<br />

quality of reports of randomised trials affect estimates of intervention<br />

efficacy reported in meta-analyses. Lancet 1998;352:609-13.<br />

7 Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias:<br />

dimensions of methodological quality associated with estimates of treatment<br />

effects in controlled trials. JAMA 1995;273:408-12.<br />

8 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving<br />

the quality of reporting of randomized controlled trials: the<br />

CONSORT statement. JAMA 1996;276:637-9.<br />

accompanying the patient produced both their mobile<br />

phones and confidently reassured me that between the<br />

two of them they would have the mobile numbers of<br />

all 15 “household” contacts. She was right, and in just<br />

over two hours all of them had been contacted.<br />

There has been much coverage in the medical and<br />

popular press about the potential health hazards of<br />

mobile phones, and if these fears are realised the 100%<br />

ownership among this small sample of students is<br />

worrying. However, in terms of contact tracing for<br />

suspected meningococcal disease, mobile phones have<br />

potential health benefits not just for their owners but<br />

also for the mental health of public health doctors. Of<br />

course, this may not solve the “close kissing contact”<br />

problem.<br />

Debbie Lawlor senior lecturer in epidemiology and public<br />

health, University of Bristol<br />

We welcome articles up to 600 words on topics such as<br />

A memorable patient, A paper that changed my practice, My<br />

most unfortunate mistake, or any other piece conveying<br />

instruction, pathos, or humour. If possible the article<br />

should be supplied on a disk. Permission is needed<br />

from the patient or a relative if an identifiable patient is<br />

referred to. We also welcome contributions for<br />

“Endpieces,” consisting of quotations of up to 80 words<br />

(but most are considerably shorter) from any source,<br />

ancient or modern, which have appealed to the reader.<br />

Education and debate<br />

447


Statistics <strong>Notes</strong><br />

Analysing controlled trials with baseline and follow up<br />

measurements<br />

Andrew J Vickers, Douglas G Altman<br />

In many randomised trials researchers measure a continuous<br />

variable at baseline and again as an outcome<br />

assessed at follow up. Baseline measurements are common<br />

in trials of chronic conditions where researchers<br />

want to see whether a treatment can reduce<br />

pre-existing levels of pain, anxiety, hypertension, and<br />

the like.<br />

<strong>Statistical</strong> comparisons in such trials can be made<br />

in several ways. Comparison of follow up (posttreatment)<br />

scores will give a result such as “at the end<br />

of the trial, mean pain scores were 15 mm (95% confidence<br />

interval 10 to 20 mm) lower in the treatment<br />

group.” Alternatively a change score can be calculated<br />

<strong>by</strong> subtracting the follow up score from the baseline<br />

score, leading to a statement such as “pain reductions<br />

were 20 mm (16 to 24 mm) greater on treatment than<br />

control.” If the average baseline scores are the same in<br />

each group the estimated treatment effect will be the<br />

same using these two simple approaches. If the<br />

treatment is effective the statistical significance of the<br />

treatment effect <strong>by</strong> the two methods will depend on the<br />

correlation between baseline and follow up scores. If<br />

the correlation is low using the change score will add<br />

variation and the follow up score is more likely to show<br />

a significant result. Conversely, if the correlation is high<br />

using only the follow up score will lose information<br />

and the change score is more likely to be significant. It<br />

is incorrect, however, to choose whichever analysis<br />

gives a more significant finding. The method of analysis<br />

should be specified in the trial protocol.<br />

Some use change scores to take account of chance<br />

imbalances at baseline between the treatment groups.<br />

However, analysing change does not control for<br />

baseline imbalance because of regression to the<br />

mean 12 : baseline values are negatively correlated with<br />

change because patients with low scores at baseline<br />

generally improve more than those with high scores. A<br />

better approach is to use analysis of covariance<br />

(ANCOVA), which, despite its name, is a regression<br />

method. 3 In effect two parallel straight lines (linear<br />

regression) are obtained relating outcome score to<br />

baseline score in each group. They can be summarised<br />

as a single regression equation:<br />

follow up score =<br />

constant + a×baseline score + b×group<br />

Results of trial of acupuncture for shoulder pain 4<br />

Placebo group<br />

(n=27)<br />

Pain scores (mean and SD)<br />

where a and b are estimated coefficients and group<br />

is a binary variable coded 1 for treatment and 0 for<br />

control. The coefficient b is the effect of interest—the<br />

estimated difference between the two treatment<br />

groups. In effect an analysis of covariance adjusts each<br />

patient’s follow up score for his or her baseline score,<br />

but has the advantage of being unaffected <strong>by</strong> baseline<br />

differences. If, <strong>by</strong> chance, baseline scores are worse in<br />

the treatment group, the treatment effect will be<br />

underestimated <strong>by</strong> a follow up score analysis and overestimated<br />

<strong>by</strong> looking at change scores (because of<br />

regression to the mean). By contrast, analysis of covariance<br />

gives the same answer whether or not there is<br />

baseline imbalance.<br />

As an illustration, Kleinhenz et al randomised 52<br />

patients with shoulder pain to either true or sham acupuncture.<br />

4 Patients were assessed before and after<br />

treatment using a 100 point rating scale of pain and<br />

function, with lower scores indicating poorer outcome.<br />

There was an imbalance between groups at baseline,<br />

with better scores in the acupuncture group (see table).<br />

Analysis of post-treatment scores is therefore biased.<br />

The authors analysed change scores, but as baseline<br />

and change scores are negatively correlated (about<br />

r= −0.25 within groups) this analysis underestimates<br />

the effect of acupuncture. From analysis of covariance<br />

we get:<br />

follow up score =<br />

24 + 0.71×baseline score + 12.7×group<br />

(see figure). The coefficient for group (b) has a useful<br />

interpretation: it is the difference between the mean<br />

change scores of each group. In the above example it<br />

can be interpreted as “pain and function score<br />

improved <strong>by</strong> an estimated 12.7 points more on average<br />

in the treatment group than in the control group.” A<br />

95% confidence interval and P value can also be calculated<br />

for b (see table). 5 The regression equation<br />

provides a means of prediction: a patient with a<br />

baseline score of 50, for example, would be predicted<br />

to have a follow up score of 72.2 on treatment and 59.5<br />

on control.<br />

An additional advantage of analysis of covariance is<br />

that it generally has greater statistical power to detect a<br />

treatment effect than the other methods. 6 For example,<br />

a trial with a correlation between baseline and follow<br />

Acupuncture group<br />

(n=25)<br />

Difference between means<br />

(95% CI) P value<br />

Baseline<br />

Analysis<br />

53.9 (14) 60.4 (12.3) 6.5<br />

Follow up 62.3 (17.9) 79.6 (17.1) 17.3 (7.5 to 27.1) 0.0008<br />

Change score* 8.4 (14.6) 19.2 (16.1) 10.8 (2.3 to 19.4) 0.014<br />

ANCOVA<br />

*Analysis reported <strong>by</strong> authors.<br />

12.7 (4.1 to 21.3) 0.005<br />

4<br />

<strong>BMJ</strong> VOLUME 323 10 NOVEMBER 2001 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Education and debate<br />

Integrative<br />

Medicine Service,<br />

Biostatistics Service,<br />

Memorial<br />

Sloan-Kettering<br />

Cancer Center, New<br />

York, New York<br />

10021, USA<br />

Andrew J Vickers<br />

assistant attending<br />

research methodologist<br />

ICRF Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute of Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Correspondence to:<br />

Dr Vickers<br />

vickersa@mskcc.org<br />

<strong>BMJ</strong> 2001;323:1123–4<br />

1123


Education and debate<br />

Post–treatment score<br />

100<br />

80<br />

60<br />

40<br />

20<br />

Acupuncture<br />

Placebo<br />

20 40 60<br />

80 100<br />

Pretreatment score<br />

Pretreatment and post-treatment scores in each group showing fitted<br />

lines. Squares show mean values for the two groups. The estimated<br />

difference between the groups from analysis of covariance is the<br />

vertical distance between the two lines<br />

up scores of 0.6 that required 85 patients for analysis of<br />

follow up scores, would require 68 for a change score<br />

analysis but only 54 for analysis of covariance.<br />

The efficiency gains of analysis of covariance compared<br />

with a change score are low when there is a high<br />

correlation (say r > 0.8) between baseline and follow up<br />

measurements. This will often be the case, particularly<br />

in stable chronic conditions such as obesity. In these<br />

A memorable patient<br />

Informed consent<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

I first met Ivy three years ago when she came for her<br />

29th oesophageal dilatation. She was an 86 year old<br />

spinster, deaf without speech from childhood, and the<br />

only sign language she knew was thumbs up, which she<br />

would use for saying good morning or for showing<br />

happiness. She had no next of kin and had lived in a<br />

residential home for the past 50 years. She developed a<br />

benign oesophageal stricture in 1992 and came to the<br />

endoscopy unit for repeated dilatations. The carers in<br />

the residential home used to say that she enjoyed her<br />

“days out” at the endoscopy unit.<br />

We would explain the procedure to her in sign<br />

language. She would use the thumbs up sign and make<br />

a cross on the dotted line on the consent form. She<br />

would enter the endoscopy room smiling, put her left<br />

arm out to be cannulated, turn to her left side for<br />

endoscopy, and when fully awake would show her<br />

thumbs up again. Every time after her dilatation the<br />

nursing staff would question why an expandable<br />

oesophageal stent was not being considered. We would<br />

conclude that the indications for an expandable stent<br />

in benign strictures are not well established.<br />

Her need for dilatation was becoming more<br />

frequent, and so on her 46th dilatation we decided to<br />

refer her to our regional centre for the insertion of a<br />

stent. She had an expandable stent inserted, and in his<br />

report the endoscopist mentioned the risk of the stent<br />

migrating down in the stomach beyond the stricture.<br />

Six weeks later she developed a bolus obstruction. At<br />

endoscopy it was noted that the stent had indeed<br />

migrated down. She consented to another stent. Four<br />

weeks later she had another bolus obstruction that<br />

could not be completely removed at the first attempt,<br />

situations, analysis of change scores can be a<br />

reasonable alternative, particularly if restricted<br />

randomisation is used to ensure baseline comparability<br />

between groups. 7 Analysis of covariance is the<br />

preferred general approach, however.<br />

As with all analyses of continuous data, the use of<br />

analysis of covariance depends on some assumptions<br />

that need to be tested. In particular, data transformation,<br />

such as taking logarithms, may be indicated. 8<br />

Lastly, analysis of covariance is a type of multiple<br />

regression and can be seen as a special type of adjusted<br />

analysis. The analysis can thus be expanded to include<br />

additional prognostic variables (not necessarily continuous),<br />

such as age and diagnostic group.<br />

We thank Dr J Kleinhenz for supplying the raw data from his<br />

study.<br />

1 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Regression towards the mean. <strong>BMJ</strong><br />

1994;308:1499.<br />

2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Some examples of regression towards the mean.<br />

<strong>BMJ</strong> 1994;309:780.<br />

3 Senn S. Baseline comparisons in randomized clinical trials. Stat Med<br />

1991;10:1157-9.<br />

4 Kleinhenz J, Streitberger K, Windeler J, Gussbacher A, Mavridis G, Martin<br />

E. Randomised clinical trial comparing the effects of acupuncture and<br />

a newly designed placebo needle in rotator cuff tendonitis. Pain<br />

1999;83:235-41.<br />

5 Altman DG, Gardner MJ. Regression and correlation. In: Altman DG,<br />

Machin D, Bryant TN, Gardner MJ, eds. Statistics with confidence. 2nd ed.<br />

London: <strong>BMJ</strong> Books, 2000:73-92.<br />

6 Vickers AJ. The use of percentage change from baseline as an outcome in<br />

a controlled trial is statistically inefficient: a simulation study. BMC Med<br />

Res Methodol 2001;1:16.<br />

7 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-4.<br />

8 <strong>Bland</strong> <strong>JM</strong>, Altman DG. The use of transformation when comparing two<br />

means. <strong>BMJ</strong> 1996;312:1153.<br />

and she was brought back the following day for<br />

removal of the bolus <strong>by</strong> endoscopy.<br />

She came to the endoscopy room but did not have<br />

her familiar smile. She looked around for a minute, got<br />

off her trolley, and walked out. Everyone in the<br />

endoscopy room understood that she was trying to say,<br />

“I’ve had enough.”<br />

She did not come back for a repeat endoscopy, and<br />

she stayed nil <strong>by</strong> mouth on intravenous fluids. Two<br />

weeks later she died of an aspiration pneumonia. We<br />

think she understood all the procedures she had<br />

agreed to. We also think it was informed consent. I<br />

hope we were right. She gave us a very clear message<br />

without saying a word on her last visit to the<br />

endoscopy room.<br />

Do we really understand what aphasic patients are<br />

trying to tell us when we get informed consent for<br />

invasive procedures? We should try to read the<br />

non-verbal messages very carefully.<br />

I Tiwari associate specialist in gastroenterology, Broomfield<br />

Hospital, Chelmsford<br />

We welcome articles of up to 600 words on topics<br />

such as A memorable patient, A paper that changed my<br />

practice, My most unfortunate mistake, or any other piece<br />

conveying instruction, pathos, or humour. If possible<br />

the article should be supplied on a disk. Permission is<br />

needed from the patient or a relative if an identifiable<br />

patient is referred to. We also welcome contributions<br />

for “Endpieces,” consisting of quotations of up to 80<br />

words (but most are considerably shorter) from any<br />

source, ancient or modern, which have appealed to<br />

the reader.<br />

1124 <strong>BMJ</strong> VOLUME 323 10 NOVEMBER 2001 bmj.com


Education and debate<br />

Papers p569<br />

Department of<br />

Public Health<br />

Sciences, St<br />

George’s Hospital<br />

Medical School,<br />

London SW17 0RE<br />

J Martin <strong>Bland</strong><br />

professor of medical<br />

statistics<br />

Cancer Research<br />

UK Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute for Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Correspondence to:<br />

Professor <strong>Bland</strong><br />

mbland@sghms.ac.uk<br />

<strong>BMJ</strong> 2002;324:606–7<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

The quality and reliability of health information on<br />

the internet remains of paramount concern in Europe,<br />

as elsewhere. Self regulatory codes of ethics for health<br />

websites abound, yet the quality and practices of many<br />

are highly questionable.<br />

Little progress seems to have been made,<br />

moreover, in assuring consumers that the information<br />

they share with health websites will not be misused.<br />

Several US studies have already concluded that<br />

websites’ privacy practices do not match their<br />

proclaimed policies. 5 In an attempt to counter this erosion<br />

of trust in Europe, the European Commission’s<br />

guidelines for quality criteria for health related<br />

websites have recognised that there is no shortage of<br />

legislation in the field of privacy and security. 6 They<br />

have drawn specific attention to a new recommendation<br />

regarding online data collection adopted in<br />

May 2001 that explains how European directives on<br />

issues such as data protection should be applied to the<br />

most common processing tasks carried out via the<br />

internet. 7<br />

The challenge facing Europe’s health professionals<br />

and policymakers is to carefully craft the development<br />

of new approaches to the supervision of medical and<br />

pharmaceutical practice. Their ultimate goal is to raise<br />

Statistics <strong>Notes</strong><br />

Validating scales and indexes<br />

J Martin <strong>Bland</strong>, Douglas G Altman<br />

An index of quality is a measurement like any other,<br />

whether it is assessing a website, as in today’s <strong>BMJ</strong>, 1 a<br />

clinical trial used in a meta-analysis, 2 or the quality of a<br />

life experienced <strong>by</strong> a patient. 3 As with all measurements,<br />

we have to decide whether it measures what we<br />

want it to measure, and how well.<br />

The simplest measurements, such as length and<br />

distance, can be validated <strong>by</strong> an objective criterion. The<br />

earliest criteria must have been biological: the length of<br />

a pace, a foot, a thumb. The obvious problem, that the<br />

criterion varies from person to person, was eventually<br />

solved <strong>by</strong> establishing a fundamental unit and defining<br />

all others in terms of it. Other measurements can then<br />

be defined in terms of a fundamental unit. To define a<br />

unit of weight we find a handy substance which<br />

appears the same everywhere, such as water. The unit<br />

of weight is then the weight of a volume of water specified<br />

in the basic unit of length, such as 100 cubic centimetres.<br />

Such measurements have criterion validity,<br />

meaning that we can take some known quantity and<br />

compare our measurement with it.<br />

For some measurements no such standard is possible.<br />

Cardiac stroke volume, for example, can be<br />

measured only indirectly. Direct measurement, <strong>by</strong><br />

collecting all the blood pumped out of the heart over a<br />

series of beats, would involve rather drastic interference<br />

with the system. Our criterion becomes agreement<br />

with another indirect measurement. Indeed, we sometimes<br />

have to use as a standard a method which we<br />

know produces inaccurate measurements.<br />

consumers’ confidence in online healthcare. They must<br />

ensure that the mechanisms are put in place where<strong>by</strong><br />

health professionals themselves can benefit from using<br />

the internet, while still ensuring the highest standards<br />

of medical practice.<br />

Avienda was formerly known as the Centre for Law Ethics and<br />

Risk in Telemedicine.<br />

Competing interests: None declared.<br />

1 <strong>http</strong>://news.bbc.co.uk/hi/english/uk/england/newsid_1752000/<br />

1752670.stm (accessed 5 Feb 2002).<br />

2 Case C-322/01: Reference for a preliminary ruling <strong>by</strong> the Landgericht<br />

Frankfurt am Main <strong>by</strong> order of that court of 10 August 2001 in the case<br />

of Deutscher Apothekerverband e.V. against DocMorris NV and Jacques<br />

Waterval. Official Journal of the European Communities No C 2001<br />

December 8:348/10.<br />

3 Council Directive 1992/28/EEC of 31 March 1992 on the advertising of<br />

medicinal products for human use. (Articles 1(3) and 3(1).) Official Journal<br />

of the European Communities No L 1995 11 February:32/26.<br />

4 Directive 2000/31/EC on mutual recognition of primary medical and<br />

specialist medical qualifications and minimum standards of training. Official<br />

Journal of the European Communities No L 2001 July 31:206/1-51.<br />

5 Schwartz J. Medical websites faulted on privacy. Washington Post 2000<br />

February 1.<br />

6 <strong>http</strong>://europa.eu.int/information_society/eeurope/ehealth/quality/<br />

draft_guidelines/index_en.htm (accessed 5 Feb 2002).<br />

7 European Commission. Recommendation 2/2001 on certain minimum<br />

requirements for collecting personal data on-line in the European<br />

Union. Adopted on 17 May 2001. <strong>http</strong>://europa.eu.int/comm/<br />

internal_market/en/dataprot/wpdocs/wp43en.htm (accessed 25 Jan<br />

2002).<br />

Some quantities are even more difficult to measure<br />

and evaluate. Cardiac stroke volume does at least have<br />

an objective reality; a physical quantity of blood is<br />

pumped out of the heart when it beats. Anxiety and<br />

depression do not have a physical reality but are useful<br />

artificial constructs. They are measured <strong>by</strong> questionnaire<br />

scales, where answers to a series of questions<br />

related to the concept we want to measure are<br />

combined to give a numerical score. Website quality is<br />

similar. We are measuring a quantity which is not precisely<br />

defined, and there is no instrument with which<br />

we can compare any measure we might devise. How<br />

are we to assess the validity of such a scale?<br />

The relevant theory was developed in the social sciences<br />

in the context of questionnaire scales. 4 First we<br />

might ask whether the scale looks right, whether it asks<br />

about the sorts of thing which we think of as being<br />

related to anxiety or website quality. If it appears to be<br />

correct, we call this face validity. Next we might ask<br />

whether it covers all the aspects which we want to<br />

measure. A phobia scale which asked about fear of<br />

dogs, spiders, snakes, and cats but ignored height, confined<br />

spaces, and crowds would not do this. We call<br />

appropriate coverage of the subject matter content<br />

validity.<br />

Our scale may look right and cover the right things,<br />

but what other evidence can we bring to the question<br />

of validity? One question we can ask is whether our<br />

score has the relationships with other variables that we<br />

would expect. For example, does an anxiety measure<br />

606 <strong>BMJ</strong> VOLUME 324 9 MARCH 2002 bmj.com


distinguish between psychiatric patients and medical<br />

patients? Do we get different anxiety scores from<br />

students before and after an examination? Does a<br />

measure of depression predict suicide attempts? We<br />

call the property of having appropriate relationships<br />

with other variables construct validity.<br />

We can also ask whether the items which together<br />

compose the scale are related to one another: does the<br />

scale have internal consistency? If not, do the items really<br />

measure the same thing? On the other hand, if the<br />

items are too similar, some may be redundant. Highly<br />

correlated items in a scale may make the scale overlong<br />

and may lead to some aspects being overemphasised,<br />

impairing the content validity. A handy summary<br />

measure for this feature is Cronbach’s alpha. 5<br />

A scale must also be repeatable and be sufficiently<br />

objective to give similar results for different observers.<br />

If a measurement is repeatable, in that someone who<br />

has a high score on one occasion tends to have a high<br />

score on another, it must be measuring something.<br />

With physical measurements, it is often possible for the<br />

same observer (or different observers) to make<br />

repeated measurements in quick succession. When<br />

there is a subjective element in the measurement the<br />

observer can be blinded from their first measurement,<br />

Honour a physician with the honour due unto him<br />

A few years ago my general practitioner told me that<br />

anyone aged over 40 with upper abdominal discomfort<br />

needed investigating. At the local teaching hospital, a<br />

pleasant young doctor did a gastroscopy, which<br />

showed a mass in my stomach wall. I was sent for a<br />

barium meal. A consultant radiologist took the x ray<br />

films, instructing me briskly to turn this way and that<br />

but not otherwise paying me any attention. He told me<br />

to wait a few minutes while he checked the films to see<br />

if all the views were satisfactory. I sat alone in the room<br />

for about five minutes.<br />

From the moment the consultant re-entered I could<br />

see that he was slightly agitated. “I’m terribly sorry,” he<br />

called out as he came through the door at the far end.<br />

And then again, “I’m terribly sorry.” Perhaps these<br />

words of regret, coupled with the concern on his face,<br />

might not have had the effect they did had I not been a<br />

man with an abdominal mass on his mind. At this<br />

moment of truth and reckoning, certain visions swam<br />

before my eyes.<br />

Three strides later, he was in front of me and<br />

looking me full in the face: “I’m terribly sorry, I hadn’t<br />

realised you were a doctor.” In his hand was the request<br />

form, and I could see that my general practitioner had<br />

written “ex-SR here” in one corner. He must have<br />

spotted this when checking the form as he looked at<br />

the preliminary plates. Though no further x rays were<br />

needed, he proceeded a little breathlessly to deliver<br />

three or four minutes of almost a caricature of caring,<br />

empathic interest in a patient. What branch of<br />

medicine was I in, and where did I work? Good<br />

heavens, that must be tough. Is that an Australian<br />

accent I hear? A St Mary’s old boy, ah yes. What did I<br />

think about. . .?<br />

I don’t mean to imply that this was insincere, merely<br />

splendidly different from his earlier matter of factness<br />

and economy of word. I had thought nothing of this at<br />

the time: in such a bread and butter procedure I had<br />

<strong>BMJ</strong> VOLUME 324 9 MARCH 2002 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

and different observers can make simultaneous<br />

measurements. In assessing the reliability of a website<br />

quality scale, it is easy to get several observers to apply<br />

the scale independently. With websites, repeat assessments<br />

need to be close in time because their content<br />

changes frequently (as does bmj.com). With questionnaires,<br />

either self administered or recorded <strong>by</strong> an<br />

observer, repeat measurements need to be far enough<br />

apart in time for the earlier responses to be forgotten,<br />

yet not so far apart that the underlying quantity being<br />

measured might have changed. Such data enable us to<br />

evaluate test-retest reliability. If two measures have comparable<br />

face, content, and construct validity the more<br />

repeatable one may be preferred for the study of a<br />

given population.<br />

1 Gagliardi A, Jadad AR. Examination of instruments used to rate quality of<br />

health information on the internet: chronicle of a voyage with an unclear<br />

destination. <strong>BMJ</strong> 2002;324:569-73.<br />

2 Jüni P, Altman DG, Egger M. Assessing the quality of controlled clinical<br />

trials. <strong>BMJ</strong> 2001;323:42-6.<br />

3 Muldoon MF, Barger SD, Flory JD, Manuck B. What are quality of life<br />

measurements measuring? <strong>BMJ</strong> 1998;316:542-5.<br />

4 Streiner DL, Norman GR. Health measurement scales: a practical<br />

guide to their development and use. 2nd ed. Oxford: Oxford University Press,<br />

1996.<br />

5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Cronbach’s alpha. <strong>BMJ</strong> 1997;314:572.<br />

no more reason to expect the doctor to engage with<br />

me as a person than I would the phlebotomist taking a<br />

routine blood sample. Clearly, this consultant saw<br />

things similarly as a rule, but when the patient was a<br />

doctor the aesthetics of the encounter changed. He<br />

had apologised three times for what he felt was a lapse<br />

on his part, arising from his failure to notice what was<br />

written in the corner of the request form. Perhaps he<br />

thought I knew that my general practitioner had<br />

written this and that I expected this of a medical<br />

referral, and thus expected to be recognised <strong>by</strong> him<br />

not just as a patient but also as a colleague. He seemed<br />

to see this as my due. (As it happens, I didn’t.)<br />

I had forgotten this incident, but it was brought back<br />

to me <strong>by</strong> the aftermath of the Bristol cardiac surgery<br />

debacle, and <strong>by</strong> the publicity surrounding other recent<br />

medical scandals. These have all put a spotlight on<br />

relations between doctors, who seem to offer each<br />

other acknowledgement and empathy, as my<br />

consultant had sought belatedly to do to me. The<br />

general public may be coming to suspect that this<br />

collegiate solidarity is somehow not in their interests,<br />

associating it with mutual protectiveness and thus with<br />

cover-ups of medical malpractice. It is too soon to say<br />

how the profession will react, but my consultant was an<br />

older man and my guess is that, with younger<br />

generations of doctors, we will see the waning of a<br />

tradition whose roots lie with Hippocrates. For it was<br />

his oath that bound doctors to look well on each other<br />

(and not charge each other for their services).<br />

It’s another story, but I found out later that the mass<br />

was the gastroscopy instrument itself distorting the<br />

stomach wall, misdiagnosed <strong>by</strong> an inexperienced<br />

registrar. No special treatment there, anyway.<br />

Derek Summerfield consultant psychiatrist, CASCAID,<br />

South London and Maudsley NHS Trust, London<br />

Education and debate<br />

607


Statistics <strong>Notes</strong><br />

Interaction revisited: the difference between two estimates<br />

Douglas G Altman, J Martin <strong>Bland</strong><br />

We often want to compare two estimates of the same<br />

quantity derived from separate analyses. Thus we might<br />

want to compare the treatment effect in subgroups in a<br />

randomised trial, such as two age groups. The term for<br />

such a comparison is a test of interaction. In earlier Statistics<br />

<strong>Notes</strong> we discussed interaction in terms of heterogeneity<br />

of treatment effect. 1–3 Here we revisit interaction<br />

and consider the concept more generally.<br />

The comparison of two estimated quantities, such as<br />

means or proportions, each with its standard error, is a<br />

general method that can be applied widely. The two estimates<br />

should be independent, not obtained from the<br />

same individuals—examples are the results from<br />

subgroups in a randomised trial or from two independent<br />

studies. The samples should be large. If the estimates<br />

are E 1 and E 2 with standard errors SE(E 1) and SE(E 2),<br />

then the difference d=E 1 − E 2 has standard error<br />

SE(d)=√[SE(E 1) 2 + SE(E 2) 2 ] (that is, the square root of the<br />

sum of the squares of the separate standard errors). This<br />

formula is an example of a well known relation that the<br />

variance of the difference between two estimates is the<br />

sum of the separate variances (here the variance is the<br />

square of the standard error). Then the ratio z=d/SE(d)<br />

gives a test of the null hypothesis that in the population<br />

the difference d is zero, <strong>by</strong> comparing the value of z to<br />

the standard normal distribution. The 95% confidence<br />

interval for the difference is d−1.96SE(d) tod+1.96SE(d).<br />

We illustrated this for means and proportions, 3<br />

although we did not show how to get the standard<br />

error of the difference. Here we consider comparing<br />

relative risks or odds ratios. These measures are always<br />

analysed on the log scale because the distributions of<br />

the log ratios tend to be those closer to normal than of<br />

the ratios themselves.<br />

In a meta-analysis of non-vertebral fractures in randomised<br />

trials of hormone replacement therapy the<br />

estimated relative risk from 22 trials was 0.73 (P=0.02) in<br />

favour of hormone replacement therapy. 4 From 14 trials<br />

of women aged on average < 60 years the relative risk<br />

was 0.67 (95% confidence interval 0.46 to 0.98; P=0.03).<br />

From eight trials of women aged >60 the relative risk<br />

was 0.88 (0.71 to 1.08; P=0.22). In other words, in<br />

younger women the estimated treatment benefit was a<br />

33% reduction in risk of fracture, which was statistically<br />

significant, compared with a 12% reduction in older<br />

women, which was not significant. But are the relative<br />

risks from the subgroups significantly different from<br />

each other? We show how to answer this question using<br />

just the summary data quoted.<br />

Because the calculations were made on the log scale,<br />

comparing the two estimates is complex (see table). We<br />

need to obtain the logs of the relative risks and their<br />

confidence intervals (rows 2 and 4). 5 As 95% confidence<br />

intervals are obtained as 1.96 standard errors either side<br />

of the estimate, the SE of each log relative risk is<br />

obtained <strong>by</strong> dividing the width of its confidence interval<br />

<strong>by</strong> 2×1.96 (row 6). The estimated difference in log<br />

relative risks is d=E 1− E 2=−0.2726 and its standard error<br />

<strong>BMJ</strong> VOLUME 326 25 JANUARY 2003 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

0.2206 (row 8). From these two values we can test the<br />

interaction and estimate the ratio of the relative risks<br />

(with confidence interval). The test of interaction is the<br />

ratio of d to its standard error: z=−0.2726/<br />

0.2206=−1.24, which gives P=0.2 when we refer it to a<br />

table of the normal distribution. The estimated<br />

interaction effect is exp( − 0.2726)=0.76. (This value can<br />

also be obtained directly as 0.67/0.88=0.76.) The<br />

confidence interval for this effect is − 0.7050 to 0.1598<br />

on the log scale (row 9). Transforming back to the relative<br />

risk scale, we get 0.49 to 1.17 (row 12). There is thus<br />

no good evidence to support a different treatment effect<br />

in younger and older women.<br />

The same approach is used for comparing odds<br />

ratios. Comparing means or regression coefficients is<br />

simpler as there is no log transformation. The two estimates<br />

must be independent: the method should not be<br />

used to compare a subset with the whole group, or two<br />

estimates from the same patients.<br />

There is limited power to detect interactions, even in<br />

a meta-analysis combining the results from several studies.<br />

As this example illustrates, even when the two<br />

estimates and P values seem very different the test of<br />

interaction may not be significant. It is not sufficient for<br />

the relative risk to be significant in one subgroup and<br />

not in another. Conversely, it is not correct to assume<br />

that when two confidence intervals overlap the two estimates<br />

are not significantly different. 6 <strong>Statistical</strong> analysis<br />

should be targeted on the question in hand, and not<br />

based on comparing P values from separate analyses. 2<br />

1 Altman DG, Matthews JNS. Interaction 1: Heterogeneity of effects. <strong>BMJ</strong><br />

1996;313:486.<br />

2 Matthews JNS, Altman DG. Interaction 2: Compare effect sizes not P<br />

values. <strong>BMJ</strong> 1996;313:808.<br />

3 Matthews JNS, Altman DG. Interaction 3: How to examine heterogeneity.<br />

<strong>BMJ</strong> 1996;313:862.<br />

4 Torgerson DJ, Bell-Syer SEM. Hormone replacement therapy and<br />

prevention of nonvertebral fractures. A meta-analysis of randomized<br />

trials. JAMA 2001;285:2891-7.<br />

5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Logarithms. <strong>BMJ</strong> 1996;312:700.<br />

6 <strong>Bland</strong> M, Peacock J. Interpreting statistics with confidence. Obstetrician<br />

and Gynaecologist (in press).<br />

Calculations for comparing two estimated relative risks<br />

Education and debate<br />

Cancer Research<br />

UK Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine,<br />

Institute for Health<br />

Sciences, Oxford<br />

OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Department of<br />

Public Health<br />

Sciences,<br />

St George’s<br />

Hospital Medical<br />

School, London<br />

SW17 0RE<br />

J Martin <strong>Bland</strong><br />

professor of medical<br />

statistics<br />

Correspondence to:<br />

D G Altman<br />

doug.altman@<br />

cancer.org.uk<br />

<strong>BMJ</strong> 2003;326:219<br />

Group 1 Group 2<br />

1 RR 0.67 0.88<br />

2 *log RR −0.4005 (E1) −0.1278 (E2) 3 95% CI for RR 0.46 to 0.98 0.71 to 1.08<br />

4 *95% CI for log RR −0.7765 to −0.0202 −0.3425 to 0.0770<br />

5 Width of CI 0.7563 0.4195<br />

6 SE[=width/(2×1.96)] 0.1929 0.1070<br />

Difference between log relative risks<br />

7 d[=E1−E2] −0.4005–(−0.1278)=−0.2726<br />

8 SE(d) √(0.1929 2 + 0.1070 2 )=0.2206<br />

9 CI(d) −0.2726 ±1.96×0.2206<br />

or −0.7050 to 0.1598<br />

10 Test of interaction z=−0.2726/0.2206=−1.24 (P=0.2)<br />

Ratio of relative risks (RRR)<br />

11 RRR=exp(d) exp(−0.2726)=0.76<br />

12 CI(RRR) exp(−0.7050) to exp(0.1598), or 0.49 to 1.17<br />

*Values obtained <strong>by</strong> taking natural logarithms of values on preceding row.<br />

219


Statistics <strong>Notes</strong><br />

The logrank test<br />

J Martin <strong>Bland</strong>, Douglas G Altman<br />

We often wish to compare the survival experience of<br />

two (or more) groups of individuals. For example, the<br />

table shows survival times of 51 adult patients with<br />

recurrent malignant gliomas 1<br />

tabulated <strong>by</strong> type of<br />

tumour and indicating whether the patient had died or<br />

was still alive at analysis—that is, their survival time was<br />

censored. 2 As the figure shows, the survival curves differ,<br />

but is this sufficient to conclude that in the population<br />

patients with anaplastic astrocytoma have worse<br />

survival than patients with glioblastoma?<br />

We could compute survival curves 3 for each group<br />

and compare the proportions surviving at any specific<br />

time. The weakness of this approach is that it does not<br />

provide a comparison of the total survival experience of<br />

the two groups, but rather gives a comparison at some<br />

arbitrary time point(s). In the figure the difference in<br />

survival is greater at some times than others and eventually<br />

becomes zero. We describe here the logrank test, the<br />

most popular method of comparing the survival of<br />

groups, which takes the whole follow up period into<br />

account. It has the considerable advantage that it does<br />

not require us to know anything about the shape of the<br />

survival curve or the distribution of survival times.<br />

The logrank test is used to test the null hypothesis<br />

that there is no difference between the populations in<br />

the probability of an event (here a death) at any time<br />

point. The analysis is based on the times of events<br />

(here deaths). For each such time we calculate the<br />

observed number of deaths in each group and the<br />

number expected if there were in reality no difference<br />

between the groups. The first death was in week 6,<br />

when one patient in group 1 died. At the start of this<br />

week, there were 51 subjects alive in total, so the risk of<br />

death in this week was 1/51. There were 20 patients in<br />

group 1, so, if the null hypothesis were true, the<br />

expected number of deaths in group 1 is 20 × 1/51 =<br />

0.39. Likewise, in group 2 the expected number of<br />

deaths is 31 × 1/51 = 0.61. The second event<br />

occurred in week 10, when there were two deaths.<br />

There were now 19 and 31 patients at risk (alive) in the<br />

two groups, one having died in week 6, so the<br />

probability of death in week 10 was 2/50. The<br />

expected numbers of deaths were 19 × 2/50 = 0.76<br />

and 31 × 2/50 = 1.24 respectively.<br />

The same calculations are performed each time an<br />

event occurs. If a survival time is censored, that<br />

individual is considered to be at risk of dying in the<br />

week of the censoring but not in subsequent weeks.<br />

This way of handling censored observations is the<br />

same as for the Kaplan-Meier survival curve. 3<br />

From the calculations for each time of death, the<br />

total numbers of expected deaths were 22.48 in group<br />

1 and 19.52 in group 2, and the observed numbers of<br />

deaths were 14 and 28. We can now use a 2 test of the<br />

null hypothesis. The test statistic is the sum of (O –<br />

E) 2 /E for each group, where O and E are the totals of<br />

the observed and expected events. Here (14 − 22.48) 2<br />

/ 22.48 + (28 − 19.52) 2 / 19.52 = 6.88. The degrees of<br />

freedom are the number of groups minus one, i.e.<br />

<strong>BMJ</strong> VOLUME 328 1 MAY 2004 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Survival proportion<br />

1.0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 52 104 156 208<br />

Survival curves for women with glioma <strong>by</strong> diagnosis<br />

2 − 1 = 1. From a table of the 2 distribution we get<br />

P < 0.01, so that the difference between the groups is<br />

statistically significant. There is a different method of<br />

calculating the test statistic, 4 but we prefer this<br />

approach as it extends easily to several groups. It is<br />

also possible to test for a trend in survival across<br />

ordered groups. 4 Although we have shown how the<br />

calculation is made, we strongly recommend the use of<br />

statistical software.<br />

The logrank test is based on the same assumptions<br />

as the Kaplan Meier survival curve 3 —namely, that censoring<br />

is unrelated to prognosis, the survival probabilities<br />

are the same for subjects recruited early and late in<br />

the study, and the events happened at the times specified.<br />

Deviations from these assumptions matter most if<br />

they are satisfied differently in the groups being<br />

compared, for example if censoring is more likely in<br />

one group than another.<br />

The logrank test is most likely to detect a difference<br />

between groups when the risk of an event is<br />

consistently greater for one group than another. It is<br />

unlikely to detect a difference when survival curves<br />

cross, as can happen when comparing a medical with a<br />

surgical intervention. When analysing survival data, the<br />

survival curves should always be plotted.<br />

Because the logrank test is purely a test of<br />

significance it cannot provide an estimate of the size of<br />

the difference between the groups or a confidence<br />

interval. For these we must make some assumptions<br />

about the data. Common methods use the hazard ratio,<br />

including the Cox proportional hazards model, which<br />

we shall describe in a future Statistics Note.<br />

Competing interests: None declared.<br />

Astrocytoma<br />

Glioblastoma<br />

Time (weeks)<br />

1 Rostomily RC, Spence AM, Duong D, McCormick K, <strong>Bland</strong> M, Berger<br />

MS. Multimodality management of recurrent adult malignant gliomas:<br />

results of a phase II multiagent chemotherapy study and analysis of<br />

cytoreductive surgery. Neurosurgery 1994;35:378-8.<br />

2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Time to event (survival) data. <strong>BMJ</strong><br />

1998;317:468-9.<br />

3 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Survival probabilities. The Kaplan-Meier method.<br />

<strong>BMJ</strong> 1998;317:1572.<br />

4 Altman DG. Practical statistics for medical research. London: Chapman &<br />

Hall, 1991: 371-5.<br />

Education and debate<br />

Department of<br />

Health Sciences,<br />

University of York,<br />

York YO10 5DD<br />

J Martin <strong>Bland</strong><br />

professor of health<br />

statistics<br />

Cancer Research<br />

UK/NHS Centre<br />

for Statistics in<br />

Medicine, Institute<br />

of Health Sciences,<br />

Oxford OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Correspondence to:<br />

Professor <strong>Bland</strong><br />

<strong>BMJ</strong> 2004;328:1073<br />

Weeks to death or<br />

censoring in 51<br />

adults with<br />

recurrent gliomas 1<br />

(A=astrocytoma,<br />

G=glioblastoma)<br />

A G<br />

6 10<br />

13 10<br />

21 12<br />

30 13<br />

31* 14<br />

37 15<br />

38 16<br />

47* 17<br />

49 18<br />

50 20<br />

63 24<br />

79 24<br />

80* 25<br />

82* 28<br />

82* 30<br />

86 33<br />

98 34*<br />

149* 35<br />

202 37<br />

219 40<br />

40<br />

40*<br />

46<br />

48<br />

70*<br />

76<br />

81<br />

82<br />

91<br />

112<br />

181<br />

*Censored survival<br />

time (still alive at<br />

follow up).<br />

1073


Papers<br />

What is already known on this topic<br />

Epidural analgesia during labour is effective but<br />

has been associated with increased rates of<br />

instrumental delivery<br />

Studies have included women of mixed parity and<br />

high concentrations of epidural anaesthetic<br />

What this study adds<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Epidural infusions with low concentrations of local<br />

anaesthetic are unlikely to increase the risk of<br />

caesarean section in nulliparous women<br />

Although epidural analgesia is associated with an<br />

increased risk of instrumental vaginal delivery,<br />

operator bias cannot be excluded<br />

Epidural analgesia is associated with a longer<br />

second stage labour and increased oxytocin<br />

requirement, but the importance of these is<br />

unclear as maternal analgesia and neonatal<br />

outcome may be better with epidural analgesia<br />

Differences in protocols for management of labour<br />

could have contributed to the differences in rates of<br />

instrumental vaginal delivery.<br />

Another limitation was the large number of women<br />

who changed from parenteral opioids to epidural<br />

analgesia. Our intention to treat approach would likely<br />

render any estimation of the effects of epidural analgesia<br />

more conservative, but it was necessary to prevent<br />

selection bias. Comparing a policy of offering epidural<br />

analgesia with one of offering parenteral opioids<br />

reflects real life. As the definitions of stages of labour<br />

varied between trials, we were unable to determine if<br />

epidural analgesia prolonged the first stage, and the<br />

actual duration can only be estimated. Unlike previous<br />

reviews, we focused on nulliparous women because the<br />

indications for, and risks of, caesarean section differ<br />

with parity.<br />

We limited our analysis to trials that used infusions<br />

of bupivacaine with concentrations of 0.125% or less,<br />

to reflect current practice. In a randomised controlled<br />

trial, low concentrations have been shown to reduce<br />

the rate of instrumental delivery. 7<br />

Epidural analgesia may increase the risk of<br />

instrumental delivery <strong>by</strong> several mechanisms. Reduction<br />

of serum oxytocin levels can result in a weakening<br />

of uterine activity. 9 10 This may be due in part to intravenous<br />

fluid infusions being given before epidural<br />

analgesia, reducing oxytocin secretion. 11 The increased<br />

use of oxytocin after starting epidural analgesia may<br />

indicate attempts at speeding up labour. Maternal<br />

efforts at expulsion can also be impaired, causing fetal<br />

malposition during descent. 12 Previously, the association<br />

of neonatal morbidity and mortality with longer<br />

labour (second stage longer than two hours) had justified<br />

expediting delivery, leading to increased rates of<br />

instrumental delivery. 13<br />

Delaying pushing until the fetus’s head is visible or<br />

until one hour after reaching full cervical dilation may<br />

reduce the incidence of instrumental delivery and its<br />

attendant morbidity. 13 Although women receiving<br />

epidural analgesia had a longer second stage labour,<br />

this was not associated with poorer neonatal outcome<br />

in our analysis.<br />

We thank Christopher R Palmer (University of Cambridge) for<br />

guiding us and Shiv Sharma (University of Texas Southwestern<br />

Medical Centre) for helping with the data.<br />

Contributors: See bmj.com<br />

Funding: None.<br />

Competing interests: None declared.<br />

Ethical approval: Not required.<br />

1 Hawkins JL, Koonin LM, Palmer SK, Gibbs CP. Anesthesia-related deaths<br />

during obstetric delivery in the United States, 1979-1990. Anesthesiology<br />

1997;86:277-84.<br />

2 Chestnut DH. Anesthesia and maternal mortality. Anesthesiology<br />

3<br />

1997;86:273-6.<br />

Roberts CL, Algert CS, Douglas I, Tracy SK, Peat B. Trends in labour and<br />

birth interventions among low-risk women in New South Wales. Aust NZ<br />

J Obstet Gynaecol 2002;42:176-81.<br />

4 Halpern SH, Leighton BL, Ohlsson A, Barrett JF, Rice A. Effect of<br />

epidural vs parenteral opioid analgesia on the progress of labor: a metaanalysis.<br />

JAMA 1998;280:2105-10.<br />

5 Zhang J, Klebanoff MA, DerSimonian R. Epidural analgesia in association<br />

with duration of labor and mode of delivery: a quantitative review. Am J<br />

Obstet Gynecol 1999;180:970-7.<br />

6 Howell CJ. Epidural versus non-epidural analgesia for pain relief in<br />

labour. Cochrane Database Syst Rev 2000;CD000331.<br />

7 Comparative Obstetric Mobile Epidural Trial (COMET) Study Group<br />

UK. Effect of low-dose mobile versus traditional epidural techniques on<br />

mode of delivery: a randomised controlled trial. Lancet 2001;358:19-23.<br />

8 Scottish Intercollegiate Guidelines Network. Methodology checklist 2:<br />

randomised controlled trials. <strong>www</strong>.sign.ac.uk/guidelines/fulltext/50/<br />

9<br />

checklist2.html (accessed 7 Mar 2004).<br />

Goodfellow CF, Hull MG, Swaab DF, Dogterom J, Buijs RM. Oxytocin<br />

deficiency at delivery with epidural analgesia. Br J Obstet Gynaecol<br />

1983;90:214-9.<br />

10 Bates RG, Helm CW, Duncan A, Edmonds DK. Uterine activity in the second<br />

stage of labour and the effect of epidural analgesia. Br J Obstet Gynaecol<br />

1985;92:1246-50.<br />

11 Cheek TG, Samuels P, Miller F, Tobin M, Gutsche BB. Normal saline i.v.<br />

fluid load decreases uterine activity in active labour. Br J Anaesth<br />

1996;77:632-5.<br />

12 Newton ER, Schroeder BC, Knape KG, Bennett BL. Epidural analgesia<br />

and uterine function. Obstet Gynecol 1995;85:749-55.<br />

13 Miller AC. The effects of epidural analgesia on uterine activity and labor.<br />

Int J Obstet Anesth 1997:6:2-18.<br />

(Accepted 18 March 2004)<br />

doi 10.1136/bmj.38097.590810.7C<br />

Corrections and clarifications<br />

The logrank test<br />

An error in this Statistics <strong>Notes</strong> article <strong>by</strong> J Martin<br />

<strong>Bland</strong> and Douglas G Altman persisted to<br />

publication (1 May, p 1073). The third sentence—<br />

relating to the values in a table of survival<br />

times—should have ended: “but is this sufficient to<br />

conclude that in the population, patients with<br />

anaplastic astrocytoma have better [not ‘worse’]<br />

survival than patients with glioblastoma?”<br />

Minerva<br />

In the seventh item in the 17 April issue (p 964),<br />

Minerva admits to not being religious. Maybe this is<br />

why she reported that in the Bible Daniel has 21<br />

chapters. It doesn’t; it has only 12.<br />

Exploring perspectives on death<br />

In this News article <strong>by</strong> Caroline White (17 April,<br />

p 911), we referred to average life expectancy as<br />

running to 394 000 minutes—a rather brief life<br />

span, some might think, given that there are only<br />

525 600 minutes in one non-leap year. Assuming<br />

that the average life expectancy referred to in the<br />

article was intended to relate to UK males (that is,<br />

75 years), some straightforward maths reveals that<br />

75 years (even discounting the extra length of leap<br />

years) in fact contain 39 420 000 minutes. Quite a<br />

lot of minutes were lost somewhere in our<br />

publication process.<br />

1412 <strong>BMJ</strong> VOLUME 328 12 JUNE 2004 bmj.com


Education and debate<br />

Screening and Test<br />

Evaluation<br />

Program, School of<br />

Public Health,<br />

University of<br />

Sydney, NSW 2006,<br />

Australia<br />

Jonathan J Deeks<br />

senior research<br />

biostatistician<br />

Cancer Research<br />

UK/NHS Centre<br />

for Statistics in<br />

Medicine, Institute<br />

for Health Sciences,<br />

Oxford OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Correspondence to:<br />

Mr Deeks<br />

Jon.Deeks@<br />

cancer.org.uk<br />

<strong>BMJ</strong> 2004;329:168–9<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Statistics <strong>Notes</strong><br />

Diagnostic tests 4: likelihood ratios<br />

Jonathan J Deeks, Douglas G Altman<br />

The properties of a diagnostic or screening test are often<br />

described using sensitivity and specificity or predictive<br />

values, as described in previous <strong>Notes</strong>. 12 Likelihood<br />

ratios are alternative statistics for summarising diagnostic<br />

accuracy, which have several particularly powerful<br />

properties that make them more useful clinically than<br />

other statistics. 3<br />

Each test result has its own likelihood ratio, which<br />

summarises how many times more (or less) likely<br />

patients with the disease are to have that particular<br />

result than patients without the disease. More formally,<br />

it is the ratio of the probability of the specific test result<br />

in people who do have the disease to the probability in<br />

people who do not.<br />

A likelihood ratio greater than 1 indicates that the<br />

test result is associated with the presence of the disease,<br />

whereas a likelihood ratio less than 1 indicates that the<br />

test result is associated with the absence of disease. The<br />

further likelihood ratios are from 1 the stronger the<br />

evidence for the presence or absence of disease. Likelihood<br />

ratios above 10 and below 0.1 are considered to<br />

provide strong evidence to rule in or rule out<br />

diagnoses respectively in most circumstances. 4 When<br />

tests report results as being either positive or negative<br />

the two likelihood ratios are called the positive<br />

likelihood ratio and the negative likelihood ratio.<br />

The table shows the results of a study of the value of<br />

a history of smoking in diagnosing obstructive airway<br />

disease. 5 Smoking history was categorised into four<br />

groups according to pack years smoked (packs per day<br />

× years smoked). The likelihood ratio for each category<br />

is calculated <strong>by</strong> dividing the percentage of patients with<br />

obstructive airway disease in that category <strong>by</strong> the<br />

percentage without the disease in that category. For<br />

example, among patients with the disease 28% had 40+<br />

smoking pack years compared with just 1.4% of<br />

patients without the disease. The likelihood ratio is<br />

thus 28.4/1.4 = 20.3. A smoking history of more than<br />

40 pack years is strongly predictive of a diagnosis of<br />

obstructive airway disease as the likelihood ratio is substantially<br />

higher than 10. Although never smoking or<br />

smoking less than 20 pack years both point to not having<br />

obstructive airway disease, their likelihood ratios<br />

are not small enough to rule out the disease with<br />

confidence.<br />

Likelihood ratios are ratios of probabilities, and can be treated in the same way as risk<br />

ratios for the purposes of calculating confidence intervals 6<br />

Smoking habit<br />

Obstructive airway disease<br />

(pack years)<br />

Yes (n (%)) No (n (%)) Likelihood ratio 95% CI<br />

≥40 42 (28.4) 2 (1.4) (42/148)/(2/144)=20.4 5.04 to 82.8<br />

20-40 25 (16.9) 24 (16.7) (25/148)/(24/144)=1.01 0.61 to 1.69<br />

0-20 29 (19.6) 51 (35.4) (29/148)/51/144)=0.55 0.37 to 0.82<br />

Never smoked or smoked<br />

for


test results. Forcing dichotomisation on multicategory<br />

test results may discard useful diagnostic information.<br />

Likelihood ratios can be used to help adapt the<br />

results of a study to your patients. To do this they make<br />

use of a mathematical relationship known as Bayes<br />

theorem that describes how a diagnostic finding<br />

changes our knowledge of the probability of abnormality.<br />

3 The post-test odds that the patient has the<br />

disease are estimated <strong>by</strong> multiplying the pretest odds<br />

<strong>by</strong> the likelihood ratio. The use of odds rather than<br />

risks makes the calculation slightly complex (box) but a<br />

nomogram can be used to avoid having to make<br />

conversions between odds and probabilities (figure). 7<br />

Both the figure and the box illustrate how a prior<br />

probability of obstructive airway disease of 0.1 (based,<br />

say, on presenting features) is updated to a probability<br />

of 0.7 with the knowledge that the patient had smoked<br />

for more than 40 pack years.<br />

In clinical practice it is essential to know how a<br />

particular test result predicts the risk of abnormality.<br />

Sensitivities and specificities 1 do not do this: they<br />

describe how abnormality (or normality) predicts<br />

particular test results. Predictive values 2 do give<br />

probabilities of abnormality for particular test results,<br />

but depend on the prevalence of abnormality in the<br />

A memorable patient<br />

Living history<br />

I was finally settling down at my desk when the pager<br />

bleeped: it was the outpatients’ department. An extra<br />

patient had been added to the afternoon list—would I<br />

see him?<br />

The patient was a slightly built man in his 60s. He<br />

had brought recent documentation from another<br />

hospital. I asked about his presenting complaint.<br />

“Well, I’ll try, but I wasn’t aware of everything that<br />

happened. That’s why I’ve brought my wife—she was<br />

with me at the time.”<br />

This was turning out to be one of those perfect<br />

neurological consultations: documents from another<br />

hospital, a witness account, an articulate patient. The<br />

only question would be whether it was seizure,<br />

syncope, or transient ischaemic attack. As we went<br />

through his medical history, I studied his records and<br />

for the first time noticed the phrase “Tetralogy of<br />

Fallot.”<br />

“Yes, my lifelong diagnosis,” he smiled. “I was<br />

operated on.”<br />

I saw the faint dusky blue colour of his lips.<br />

“Blalock-Taussig shunt?” I asked, as dim memories<br />

from medical school somehow came back into focus.<br />

“Yes, twice. By Blalock.” He paused. “In those days<br />

the operation was only done at Johns Hopkins—all the<br />

patients went there. I remember my second operation<br />

well. I was 12 at the time.”<br />

“And the doctors?” I asked.<br />

“Yes, especially Dr Taussig. She would come around<br />

with her entourage every so often. Deaf as a post, she<br />

was.”<br />

“What? Taussig deaf?”<br />

“Yes. She had an amplifier attached to her<br />

stethoscope to examine patients.”<br />

I asked to examine him, only too aware that it was<br />

more for my benefit than his.<br />

A half smile suggested that he had read my<br />

thoughts: “Of course.”<br />

<strong>BMJ</strong> VOLUME 329 17 JULY 2004 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

study sample and can rarely be generalised beyond the<br />

study (except when the study is based on a suitable<br />

random sample, as is sometimes the case for<br />

population screening studies). Likelihood ratios provide<br />

a solution as they can be used to calculate the<br />

probability of abnormality, while adapting for varying<br />

prior probabilities of the chance of abnormality from<br />

different contexts.<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 1: sensitivity and specificity. <strong>BMJ</strong><br />

1994;308:1552.<br />

2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Diagnostic tests 2: predictive values. <strong>BMJ</strong><br />

1994;309:102.<br />

3 Sackett DL, Straus S, Richardson WS, Rosenberg W, Haynes RB. Evidencebased<br />

medicine.How to practise and teach EBM. 2nd ed. Edinburgh: Churchill<br />

Livingstone, 2000:67-93.<br />

4 Jaeschke R, Guyatt G, Lijmer J. Diagnostic tests. In: Guyatt G, Rennie D,<br />

eds. Users’ guides to the medical literature. Chicago: AMA Press,<br />

2002:121-40.<br />

5 Straus SE, McAlister FA, Sackett DL, Deeks JJ. The accuracy of patient<br />

history, wheezing, and laryngeal measurements in diagnosing obstructive<br />

airway disease. CARE-COAD1 Group. Clinical assessment of the<br />

reliability of the examination-chronic obstructive airways disease. JAMA<br />

2000;283:1853-7.<br />

6 Altman DG. Diagnostic tests. In: Altman DG, Machin D, Bryant TN,<br />

Gardner MJ, eds. Statistics with confidence. 2nd ed. London: <strong>BMJ</strong> Books,<br />

2000:105-19.<br />

7 Fagan TJ. Letter: Nomogram for Bayes theorem. N Engl J Med<br />

1975;293:257.<br />

To even my unpractised technique, his<br />

cardiovascular signs were a museum piece: absent left<br />

subclavian pulse, big arciform scars on the anterior<br />

chest created <strong>by</strong> the surgeon who had saved his life,<br />

central cyanosis, right-sided systolic murmurs, loud<br />

pulmonary valve closure sound (iatrogenic pulmonary<br />

hypertension, I reasoned), pulsatile liver—all these and<br />

undoubtedly more noted <strong>by</strong> the physician who first<br />

understood his condition.<br />

That night, I read about his doctors. Helen Taussig<br />

indeed had had substantial hearing impairment, a<br />

disability that would have meant the end of a career in<br />

cardiology for a less able clinician. I also learnt of the<br />

greater challenges of sexual prejudice that she fought<br />

and overcame all her life. I learnt about Alfred Blalock,<br />

the young doctor denied a residency at Johns Hopkins<br />

only to be invited back in his later years to head its<br />

surgical unit.<br />

The experiences of a life in medicine are sometimes<br />

overwhelming. For weeks, I reflected on the<br />

perspectives opened to me <strong>by</strong> this unassuming patient.<br />

The curious irony of a man with a life threatening<br />

condition who had outlived his saviours; the<br />

extraordinary vision of his all-too-human doctors; the<br />

opportunity to witness history played out in the course<br />

of a half-hour consultation. And memories were<br />

jogged, too: the words of my former professor of<br />

medicine, who showed us cases of “Fallot’s” and first<br />

told us about Taussig, the woman cardiologist; the<br />

portrait of Blalock adorning a surgical lecture theatre<br />

in medical college.<br />

(And my patient’s neurological examination and<br />

investigations? Non-contributory. I still don’t know<br />

whether it was seizure, syncope, or transient ischaemic<br />

attack.)<br />

Giridhar P Kalamangalam clinical fellow in epilepsy,<br />

department of neurology, Cleveland Clinic Foundation,<br />

Cleveland OH, USA<br />

Education and debate<br />

169


Statistics <strong>Notes</strong><br />

Treatment allocation <strong>by</strong> minimisation<br />

Douglas G Altman, J Martin <strong>Bland</strong><br />

In almost all controlled trials treatments are allocated<br />

<strong>by</strong> randomisation. Blocking and stratification can be<br />

used to ensure balance between groups in size and<br />

patient characteristics. 1 But stratified randomisation<br />

using several variables is not effective in small trials.<br />

The only widely acceptable alternative approach is<br />

minimisation, 2 3 a method of ensuring excellent<br />

balance between groups for several prognostic factors,<br />

even in small samples. With minimisation the<br />

treatment allocated to the next participant enrolled in<br />

the trial depends (wholly or partly) on the characteristics<br />

of those participants already enrolled. The aim is<br />

that each allocation should minimise the imbalance<br />

across multiple factors.<br />

Table 1 shows some baseline characteristics in a<br />

controlled trial comparing two types of counselling in<br />

relation to dietary intake. 4 Minimisation was used for<br />

the four variables shown, and the two groups were<br />

clearly very similar in all of these variables. Such good<br />

balance for important prognostic variables helps the<br />

credibility of the comparisons. How is it achieved?<br />

Minimisation is based on a different principle from<br />

randomisation. The first participant is allocated a treatment<br />

at random. For each subsequent participant we<br />

determine which treatment would lead to better<br />

balance between the groups in the variables of interest.<br />

The dietary behaviour trial used minimisation<br />

based on the four variables in table 1. Suppose that<br />

after 40 patients had entered this trial the numbers in<br />

each subgroup in each treatment group were as shown<br />

in table 2. (Note that two or more categories need to be<br />

constructed for continuous variables.)<br />

The next enrolled participant is a black woman<br />

aged 52, who is a non-smoker. If we were to allocate her<br />

to behavioural counselling, the imbalance would be<br />

increased in sex distribution (12+1 v 11 women), in age<br />

(7+1 v 5 aged > 50), and in smoking (14+1 v 12<br />

non-smoking) and decreased in ethnicity (4+1 v 5<br />

black). We formalise this <strong>by</strong> summing over the four<br />

variables the numbers of participants with the same<br />

characteristics as this new recruit already in the trial:<br />

Behavioural: 12 (sex) +7 (age) +4 (ethnicity) +14 (smoking) = 37<br />

Nutrition: 11+5+5+12=33<br />

Imbalance is minimised <strong>by</strong> allocating this person to<br />

the group with the smaller total (or at random if the<br />

totals are the same). Allocation to behavioural counsel-<br />

Table 1 Baseline characteristics in two groups 4<br />

Behavioural<br />

counselling<br />

Nutrition<br />

counselling<br />

Women 82 84<br />

Mean (SD) age (years)<br />

Ethnicity:<br />

43.3 (13.8) 43.2 (14.0)<br />

White 94 96<br />

Black 37 32<br />

Asian 3 5<br />

Current smokers 47 44<br />

<strong>BMJ</strong> VOLUME 330 9 APRIL 2005 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Table 2 Hypothetical distribution of baseline characteristics after<br />

40 patients had been enrolled in the trial<br />

Behavioural<br />

counselling<br />

(n=20)<br />

Nutrition<br />

counselling<br />

(n=20)<br />

Women 12 11<br />

Age >50<br />

Ethnicity:<br />

7 5<br />

White 15 15<br />

Black 4 5<br />

Asian 1 0<br />

Current smokers 6 8<br />

ling would increase the imbalance; allocation to nutrition<br />

would decrease it.<br />

At this point there are two options. The chosen<br />

treatment could simply be taken as the one with the<br />

lower score; or we could introduce a random element.<br />

We use weighted randomisation so that there is a high<br />

chance (eg 80%) of each participant getting the<br />

treatment that minimises the imbalance. The use of a<br />

random element will slightly worsen the overall imbalance<br />

between the groups, but balance will be much<br />

better for the chosen variables than with simple<br />

randomisation. A random element also makes the allocation<br />

more unpredictable, although minimisation is a<br />

secure allocation system when used <strong>by</strong> an independent<br />

person.<br />

After the treatment is determined for the current<br />

participant the numbers in each group are updated<br />

and the process repeated for each subsequent<br />

participant. If at any time the totals for the two groups<br />

are the same, then the choice should be made using<br />

simple randomisation. The method extends to trials of<br />

more than two treatments.<br />

Minimisation is a valid alternative to ordinary randomisation,<br />

2 3 5 and has the advantage, especially in<br />

small trials, that there will be only minor differences<br />

between groups in those variables used in the<br />

allocation process. Such balance is especially desirable<br />

where there are strong prognostic factors and modest<br />

treatment effects, such as oncology. Minimisation is<br />

best performed with the aid of software—for example,<br />

minim, a free program. 6 Its use makes trialists think<br />

about prognostic factors at the outset and helps ensure<br />

adherence to the protocol as a trial progresses. 7<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. How to randomise. <strong>BMJ</strong> 1999;319:703-4.<br />

2 Treasure T, MacRae KD. Minimisation: the platinum standard for trials.<br />

<strong>BMJ</strong> 1998;317:362-3.<br />

3 Scott NW, McPherson GC, Ramsay CR, Campbell MK. The method of<br />

minimization for allocation to clinical trials: a review. Control Clin Trials<br />

2002;23:662-74.<br />

4 Steptoe A, Perkins-Porras L, McKay C, Rink E, Hilton S, Cappuccio FP.<br />

Behavioural counselling to increase consumption of fruit and vegetables<br />

in low income adults: randomised trial. <strong>BMJ</strong> 2003;326:855-8.<br />

5 Buyse M, McEntegart D. Achieving balance in clinical trials: an<br />

unbalanced view from the European regulators. Applied Clin Trials<br />

2004;13:36-40.<br />

6 Evans S, Royston P, Day S. Minim: allocation <strong>by</strong> minimisation in clinical<br />

trials. <strong>http</strong>://<strong>www</strong>-<strong>users</strong>.<strong>york</strong>.ac.uk/zmb55/guide/minim.htm. (accessed<br />

24 October 2004).<br />

7 Day S. Commentary: Treatment allocation <strong>by</strong> the method of<br />

minimisation. <strong>BMJ</strong> 1999;319:947-8.<br />

Education and debate<br />

Cancer Research<br />

UK Medical<br />

Statistics Group,<br />

Centre for Statistics<br />

in Medicine, Oxford<br />

OX3 7LF<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Department of<br />

Health Sciences,<br />

University of York,<br />

York YO10 5DD<br />

J Martin <strong>Bland</strong><br />

professor of health<br />

statistics<br />

Correspondence to:<br />

D Altman<br />

doug.altman@<br />

cancer.org.uk<br />

<strong>BMJ</strong> 2005;330:843<br />

843


6 The Cardiac Arrhythmia Suppression Trial I. Preliminary Report. Effect of<br />

encainide and flecainide on mortality in a randomized trial of arrhythmia<br />

suppression after myocardial infarction. N Engl J Med 1989;321:406-12.<br />

7 Dickert N, Grady C. What’s the price of a research subject? Approaches to<br />

payment for research participation. N Engl J Med 1999;341:198-203.<br />

8 Grady C. Money for research participation: does it jeopardize informed<br />

consent? Am J Bioethics 2001;1:40-4.<br />

9 Macklin R. “Due” and “undue” inducements: On paying money to<br />

research subjects. IRB: a review of human subjects research 1981;3:1-6.<br />

10 McGee G. Subject to payment? JAMA 1997;278:199-200.<br />

11 McNeil P. Paying people to participate in research. Bioethics<br />

1997;11:390-6.<br />

12 Wilkenson M, Moore A. Inducement in research. Bioethics 1997;11:<br />

373-89.<br />

13 Halpern SD, Karlawish JHT, Casarett D, Berlin JA, Asch DA. Empirical<br />

assessment of whether moderate payments are undue or unjust inducements<br />

for participation in clinical trials. Arch Intern Med 2004;164:801-3.<br />

14 Bentley JP, Thacker PG. The influence of risk and monetary payment on<br />

the research participation decision making process. J Med Ethics<br />

2004;30:293-8.<br />

15 Viens AM. Socio-economic status and inducement to participate. Am J<br />

Bioethics 2001;1.<br />

Statistics <strong>Notes</strong><br />

Standard deviations and standard errors<br />

Douglas G Altman, J Martin <strong>Bland</strong><br />

The terms “standard error” and “standard deviation”<br />

are often confused. 1 The contrast between these two<br />

terms reflects the important distinction between data<br />

description and inference, one that all researchers<br />

should appreciate.<br />

The standard deviation (often SD) is a measure of<br />

variability. When we calculate the standard deviation of a<br />

sample, we are using it as an estimate of the variability of<br />

the population from which the sample was drawn. For<br />

data with a normal distribution, 2 about 95% of individuals<br />

will have values within 2 standard deviations of the<br />

mean, the other 5% being equally scattered above and<br />

below these limits. Contrary to popular misconception,<br />

the standard deviation is a valid measure of variability<br />

regardless of the distribution. About 95% of observations<br />

of any distribution usually fall within the 2 standard<br />

deviation limits, though those outside may all be at one<br />

end. We may choose a different summary statistic, however,<br />

when data have a skewed distribution. 3<br />

When we calculate the sample mean we are usually<br />

interested not in the mean of this particular sample, but<br />

in the mean for individuals of this type—in statistical<br />

terms, of the population from which the sample comes.<br />

We usually collect data in order to generalise from them<br />

and so use the sample mean as an estimate of the mean<br />

for the whole population. Now the sample mean will<br />

vary from sample to sample; the way this variation<br />

occurs is described <strong>by</strong> the “sampling distribution” of the<br />

mean. We can estimate how much sample means will<br />

vary from the standard deviation of this sampling distribution,<br />

which we call the standard error (SE) of the estimate<br />

of the mean. As the standard error is a type of<br />

standard deviation, confusion is understandable.<br />

Another way of considering the standard error is as a<br />

measure of the precision of the sample mean.<br />

The standard error of the sample mean depends<br />

on both the standard deviation and the sample size, <strong>by</strong><br />

the simple relation SE = SD/√(sample size). The standard<br />

error falls as the sample size increases, as the extent<br />

of chance variation is reduced—this idea underlies the<br />

sample size calculation for a controlled trial, for<br />

<strong>BMJ</strong> VOLUME 331 15 OCTOBER 2005 bmj.com<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

16 Beauchamp TL, Childress JF. Respect for autonomy. Principles of biomedical<br />

ethics. 4th ed. New York: Oxford University Press, 1994:120-88.<br />

17 Roberts LW. Evidence-based ethics and informed consent in mental<br />

illness research. Arch Gen Psychiatry 2000;57:540-2.<br />

18 Bayer R, Oppenheimer GM. Toward a more democratic medicine: sharing the<br />

burden of ignorance. AIDS Doctors: voices from the epidemic. New York: Oxford<br />

University Press, 2000:156-69.<br />

19 Coulter A, Rozansky D. Full engagement in health. <strong>BMJ</strong> 2004;329:1197-8.<br />

20 Joffe S, Manocchia M, Weeks JC, Cleary PD. What do patients value in<br />

their hospital care? An empirical perspective on autonomy centred<br />

bioethics. J Med Ethics 2003;29:103-8.<br />

21 Heesen C, Kasper J, Segal J, Kopke S, Muhlhauser I. Decisional role preferences,<br />

risk knowledge and information interests in patients with multiple<br />

sclerosis. Mult Scler 2004;10:643-50.<br />

22 Azoulay E, Pochard F, Chevret S, et al. Half the family members of intensive<br />

care unit patients do not want to share in the decision-making process:<br />

a study in 78 French intensive care units. Crit Care Med<br />

2004;32:1832-8.<br />

23 Dunn LB, Gordon NE. Improving informed consent and enhancing<br />

recruitment for research <strong>by</strong> understanding economic behavior. JAMA<br />

2005;293:609-12.<br />

example. By contrast the standard deviation will not<br />

tend to change as we increase the size of our sample.<br />

So, if we want to say how widely scattered some<br />

measurements are, we use the standard deviation. If we<br />

want to indicate the uncertainty around the estimate of<br />

the mean measurement, we quote the standard error of<br />

the mean. The standard error is most useful as a means<br />

of calculating a confidence interval. For a large sample,<br />

a 95% confidence interval is obtained as the values<br />

1.96×SE either side of the mean. We will discuss confidence<br />

intervals in more detail in a subsequent Statistics<br />

Note. The standard error is also used to calculate P values<br />

in many circumstances.<br />

The principle of a sampling distribution applies to<br />

other quantities that we may estimate from a sample,<br />

such as a proportion or regression coefficient, and to<br />

contrasts between two samples, such as a risk ratio or<br />

the difference between two means or proportions. All<br />

such quantities have uncertainty due to sampling variation,<br />

and for all such estimates a standard error can be<br />

calculated to indicate the degree of uncertainty.<br />

In many publications a ± sign is used to join the<br />

standard deviation (SD) or standard error (SE) to an<br />

observed mean—for example, 69.4±9.3 kg. That<br />

notation gives no indication whether the second figure<br />

is the standard deviation or the standard error (or<br />

indeed something else). A review of 88 articles<br />

published in 2002 found that 12 (14%) failed to<br />

identify which measure of dispersion was reported<br />

(and three failed to report any measure of variability). 4<br />

The policy of the <strong>BMJ</strong> and many other journals is to<br />

remove ± signs and request authors to indicate clearly<br />

whether the standard deviation or standard error is<br />

being quoted. All journals should follow this practice.<br />

Competing interests: None declared.<br />

1 Nagele P. Misuse of standard error of the mean (SEM) when reporting<br />

variability of a sample. A critical evaluation of four anaesthesia journals.<br />

Br J Anaesthesiol 2003;90:514-6.<br />

2 Altman DG, <strong>Bland</strong> <strong>JM</strong>. The normal distribution. <strong>BMJ</strong> 1995;310:298.<br />

3 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Quartiles, quintiles, centiles, and other quantiles.<br />

<strong>BMJ</strong> 1994;309:996.<br />

4 Olsen CH. Review of the use of statistics in Infection and Immunity. Infect<br />

Immun 2003;71:6689-92.<br />

Education and debate<br />

Cancer Research<br />

UK/NHS Centre<br />

for Statistics in<br />

Medicine, Wolfson<br />

College, Oxford<br />

OX2 6UD<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

Department of<br />

Health Sciences,<br />

University of York,<br />

York YO10 5DD<br />

J Martin <strong>Bland</strong><br />

professor of health<br />

statistics<br />

Correspondence to:<br />

Prof Altman<br />

doug.altman@<br />

cancer.org.uk<br />

<strong>BMJ</strong> 2005;331:903<br />

903


Practice<br />

Cancer Research<br />

UK/NHS Centre<br />

for Statistics in<br />

Medicine, Wolfson<br />

College, Oxford<br />

OX2 6UD<br />

Douglas G Altman<br />

professor of statistics<br />

in medicine<br />

MRC Clinical Trials<br />

Unit, London<br />

NW1 2DA<br />

Patrick Royston<br />

professor of statistics<br />

Correspondence to:<br />

Professor Altman<br />

doug.altman@<br />

cancer.org.uk<br />

<strong>BMJ</strong> 2006;332:1080<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Statistics <strong>Notes</strong><br />

The cost of dichotomising continuous variables<br />

Douglas G Altman, Patrick Royston<br />

Measurements of continuous variables are made in all<br />

branches of medicine, aiding in the diagnosis and<br />

treatment of patients. In clinical practice it is helpful to<br />

label individuals as having or not having an attribute,<br />

such as being “hypertensive” or “obese” or having<br />

”high cholesterol,” depending on the value of a<br />

continuous variable.<br />

Categorisation of continuous variables is also common<br />

in clinical research, but here such simplicity is<br />

gained at some cost. Though grouping may help data<br />

presentation, notably in tables, categorisation is unnecessary<br />

for statistical analysis and it has some serious<br />

drawbacks. Here we consider the impact of converting<br />

continuous data to two groups (dichotomising), as this<br />

is the most common approach in clinical research. 1<br />

What are the perceived advantages of forcing all<br />

individuals into two groups? A common argument is<br />

that it greatly simplifies the statistical analysis and leads<br />

to easy interpretation and presentation of results. A<br />

binary split—for example, at the median—leads to a<br />

comparison of groups of individuals with high or low<br />

values of the measurement, leading in the simplest case<br />

to a t test or 2 test and an estimate of the difference<br />

between the groups (with its confidence interval). There<br />

is, however, no good reason in general to suppose that<br />

there is an underlying dichotomy, and if one exists there<br />

is no reason why it should be at the median. 2<br />

Dichotomising leads to several problems. Firstly,<br />

much information is lost, so the statistical power to<br />

detect a relation between the variable and patient outcome<br />

is reduced. Indeed, dichotomising a variable at<br />

the median reduces power <strong>by</strong> the same amount as<br />

would discarding a third of the data. 2 3 Deliberately discarding<br />

data is surely inadvisable when research<br />

studies already tend to be too small. Dichotomisation<br />

may also increase the risk of a positive result being a<br />

false positive. 4 Secondly, one may seriously underestimate<br />

the extent of variation in outcome between<br />

groups, such as the risk of some event, and<br />

considerable variability may be subsumed within each<br />

group. Individuals close to but on opposite sides of the<br />

cutpoint are characterised as being very different<br />

rather than very similar. Thirdly, using two groups<br />

conceals any non-linearity in the relation between the<br />

variable and outcome. Presumably, many who dichotomise<br />

are unaware of the implications.<br />

If dichotomisation is used where should the<br />

cutpoint be? For a few variables there are recognised<br />

cutpoints, such as > 25 kg/m 2 to define “overweight”<br />

based on body mass index. For some variables, such as<br />

age, it is usual to take a round number, usually a multiple<br />

of five or 10. The cutpoint used in previous studies<br />

may be adopted. In the absence of a prior cutpoint<br />

the most common approach is to take the sample<br />

median. However, using the sample median implies<br />

that various cutpoints will be used in different studies<br />

so that their results cannot easily be compared,<br />

seriously hampering meta-analysis of observational<br />

studies. 5 Nevertheless, all these approaches are<br />

preferable to performing several analyses and<br />

choosing that which gives the most convincing result.<br />

Use of this so called “optimal” cutpoint (usually that<br />

giving the minimum P value) runs a high risk of a spuriously<br />

significant result; the difference in the outcome<br />

variable between the groups will be overestimated,<br />

perhaps considerably; and the confidence interval will<br />

6 7<br />

be too narrow. This strategy should never be used.<br />

When regression is being used to adjust for the effect<br />

of a confounding variable, dichotomisation will run the<br />

risk that a substantial part of the confounding remains. 47<br />

Dichotomisation is not much used in epidemiological<br />

studies, where the use of several categories is preferred.<br />

Using multiple categories (to create an “ordinal”<br />

variable) is generally preferable to dichotomising. With<br />

four or five groups the loss of information can be quite<br />

small, but there are complexities in analysis.<br />

Instead of categorising continuous variables, we prefer<br />

to keep them continuous. We could then use<br />

linear regression rather than a two sample t test, for<br />

example. If we were concerned that a linear regression<br />

would not truly represent the relation between the<br />

outcome and predictor variable, we could explore<br />

whether some transformation (such as a log transformation)<br />

would be helpful. 78 As an example, in a regression<br />

analysis to develop a prognostic model for patients with<br />

primary biliary cirrhosis, a carefully developed model<br />

with bilirubin as a continuous explanatory variable<br />

explained 31% more of the variability in the data than<br />

when bilirubin distribution was split at the median. 7<br />

Competing interests: None declared.<br />

1 Del Priore G, Zandieh P, Lee MJ. Treatment of continuous data as<br />

categoric variables in obstetrics and gynecology. Obstet Gynecol<br />

1997;89:351-4.<br />

2 MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of<br />

dichotomization of quantitative variables. Psychol Meth 2002;7:19-40.<br />

3 Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53.<br />

4 Austin PC, Brunner LJ. Inflation of the type I error rate when a continuous<br />

confounding variable is categorized in logistic regression analyses.<br />

Stat Med 2004;23:1159-78.<br />

5 Buettner P, Garbe C, Guggenmoos-Holzmann I. Problems in defining<br />

cutoff points of continuous prognostic factors: example of tumor<br />

thickness in primary cutaneous melanoma. J Clin Epidemiol<br />

1997;50:1201-10.<br />

6 Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using<br />

“optimal” cutpoints in the evaluation of prognostic factors. J Natl Cancer<br />

Inst 1994;86:829-35.<br />

7 Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous<br />

predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41.<br />

8 Royston P, Sauerbrei W. Building multivariable regression models with<br />

continuous covariates in clinical epidemiology–with an emphasis on<br />

fractional polynomials. Methods Inf Med 2005;44:561-71.<br />

Endpiece<br />

Reading and reflecting<br />

Reading without reflecting is like eating without<br />

digesting.<br />

Edmund Burke<br />

Kamal Samanta, retired general practitioner,<br />

Den<strong>by</strong> Dale, West Yorkshire<br />

1080 <strong>BMJ</strong> VOLUME 332 6 MAY 2006 bmj.com


PRACTICE<br />

1 Cancer Research UK/NHS Centre<br />

for Statistics in Medicine, Oxford<br />

OX2 6UD<br />

2 Department of Health Sciences,<br />

University of York, York YO10 5DD<br />

Correspondence to: Professor<br />

altman<br />

doug.altman@cancer.org.uk<br />

bmj 2007;334:424<br />

doi:10.1136/bmj.38977.682025.2C<br />

sTATIsTICs noTEs<br />

Missing data<br />

Downloaded from<br />

bmj.com on 28 February 2008<br />

Douglas G Altman 1 , J Martin <strong>Bland</strong> 2<br />

Almost all studies have some missing observations.<br />

Yet textbooks and software commonly assume that<br />

data are complete, and the topic of how to handle<br />

missing data is not often discussed outside statistics<br />

journals.<br />

There are many types of missing data and different<br />

reasons for data being missing. Both issues affect the<br />

analysis. Some examples are:<br />

(1) n a postal questionnaire survey not all the<br />

selected individuals respond;<br />

(2) n a randomised trial some patients are lost to<br />

follow-up before the end of the study;<br />

(3) n a multicentre study some centres do not<br />

measure a particular variable;<br />

(4) n a study in which patients are assessed<br />

frequently some data are missing at some time<br />

points for unknown reasons;<br />

(5) Occasional data values for a variable are missing<br />

because some equipment failed;<br />

(6) Some laboratory samples are lost in transit or<br />

technically unsatisfactory;<br />

(7) n a magnetic resonance imaging study some very<br />

obese patients are excluded as they are too large<br />

for the machine;<br />

(8) n a study assessing quality of life some patients<br />

die during the follow-up period.<br />

The prime concern is always whether the available<br />

data would be biased. f the fact that an observation<br />

is missing is unrelated both to the unobserved value<br />

(and hence to patient outcome) and the data that are<br />

available this is called “missing completely at random.”<br />

For cases 5 and 6 above that would be a safe<br />

assumption. Sometimes data are missing in a predictable<br />

way that does not depend on the missing value<br />

itself but which can be predicted from other data—as<br />

in case 3. Confusingly, this is known as “missing at<br />

random.” n the common cases 1 and 2, however, the<br />

missing data probably depend on unobserved values,<br />

called “missing not at random,” and hence their lack<br />

may lead to bias.<br />

n general, it is important to be able to examine<br />

whether missing data may have introduced bias. For<br />

example, if we know nothing at all about the nonresponders<br />

to a survey then we can do little to explore<br />

possible bias. Thus a high response rate is necessary for<br />

reliable answers. 1 Sometimes, though, some information<br />

is available. For example, if the survey sample is<br />

chosen from a register that includes age and sex, then<br />

the responders and non-responders can be compared<br />

on these variables. At the very least this gives some<br />

pointers to the representativeness of the sample. Nonresponders<br />

often (but not always) have a worse medical<br />

prognosis than those who respond.<br />

A few missing observations are a minor nuisance,<br />

but a large amount of missing data is a major threat<br />

to a study’s integrity. Non-response is a particular<br />

problem in pair-matched studies, such as some casecontrol<br />

studies, as it is unclear how to analyse data<br />

from the unmatched individuals. Loss of patients<br />

also reduces the power of the trial. Where losses are<br />

expected it is wise to increase the target sample size<br />

to allow for losses. This cannot eliminate the potential<br />

bias, however.<br />

Missing data are much more common in retrospective<br />

studies, in which routinely collected data are<br />

subsequently used for a different purpose. 2 When information<br />

is sought from patients’ medical notes, the notes<br />

often do not say whether or not a patient was a smoker<br />

or had a particular procedure carried out. t is tempting<br />

to assume that the answer is no when there is no<br />

indication that the answer is yes, but this is generally<br />

unwise.<br />

No really satisfactory solution exists for missing data,<br />

which is why it is important to try to maximise data<br />

collection. The main ways of handling missing data in<br />

analysis are: (a) omitting variables which have many<br />

missing values; (b) omitting individuals who do not<br />

have complete data; and (c) estimating (imputing) what<br />

the missing values were.<br />

Omitting everyone without complete data is known<br />

as complete case (or available case) analysis and is<br />

probably the most common approach. When only a<br />

very few observations are missing little harm will be<br />

done, but when many are missing omitting all patients<br />

without full data might result in a large proportion of<br />

the data being discarded, with a major loss of statistical<br />

power. The results may be biased unless the data<br />

are missing completely at random. n general it is<br />

advisable not to include in an analysis any variable<br />

that is not available for a large proportion of the sample.<br />

The main alternative approach to case deletion is<br />

imputation, where<strong>by</strong> missing values are replaced <strong>by</strong><br />

some plausible value predicted from that individual’s<br />

available data. mputation has been the topic of much<br />

recent methodological work; we will consider some of<br />

the simpler methods in a separate Statistics Note.<br />

Competing interests: None declared.<br />

1 Evans SJW. Good surveys guide. <strong>BMJ</strong> 1991;302:302-3.<br />

2 Burton A, Altman DG. Missing covariate data within cancer prognostic<br />

studies: a review of current reporting and proposed guidelines. Br J<br />

Cancer 2004;91:4-8.<br />

This is part of a series of occasional articles on statistics and handling<br />

data in research.<br />

424 <strong>BMJ</strong> | 24 feBruary 2007 | VoluMe 334


PRACTICE<br />

1 Centre for Statistics in Medicine,<br />

University of Oxford, Wolfson<br />

College Annexe, Oxford OX2 6UD<br />

2 Department of Health Sciences,<br />

University of York, York YO10 5DD<br />

Correspondence to:<br />

Professor Altman<br />

doug.altman@csm.ox.ac.uk<br />

Cite this as: <strong>BMJ</strong> 2009;338:a3167<br />

doi: 10.1136/bmj.a3167<br />

STATISTICS noTES<br />

Parametric v non-parametric methods for data analysis<br />

Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />

Continuous data arise in most areas of medicine.<br />

Familiar clinical examples include blood pressure,<br />

ejection fraction, forced expiratory volume in 1 second<br />

(FEV 1 ), serum cholesterol, and anthropometric measurements.<br />

Methods for analysing continuous data fall<br />

into two classes, distinguished <strong>by</strong> whether or not they<br />

make assumptions about the distribution of the data.<br />

Theoretical distributions are described <strong>by</strong> quantities<br />

called parameters, notably the mean and standard<br />

deviation. 1 Methods that use distributional assumptions<br />

are called parametric methods, because we estimate<br />

the parameters of the distribution assumed for the data.<br />

Frequently used parametric methods include t tests and<br />

analysis of variance for comparing groups, and least<br />

squares regression and correlation for studying the relation<br />

between variables. All of the common parametric<br />

methods (“t methods”) assume that in some way the data<br />

follow a normal distribution and also that the spread of<br />

the data (variance) is uniform either between groups or<br />

across the range being studied. For example, the two<br />

sample t test assumes that the two samples of observations<br />

come from populations that have normal distributions<br />

with the same standard deviation. The importance<br />

of the assumptions for t methods diminishes as sample<br />

size increases.<br />

Alternative methods, such as the sign test, Mann-<br />

Whitney test, and rank correlation, do not require the<br />

data to follow a particular distribution. They work<br />

<strong>by</strong> using the rank order of observations rather than<br />

the measurements themselves. Methods which do not<br />

require us to make distributional assumptions about<br />

the data, such as the rank methods, are called non-parametric<br />

methods. The term non-parametric applies to<br />

the statistical method used to analyse data, and is not<br />

a property of the data. 1 As tests of significance, rank<br />

methods have almost as much power as t methods to<br />

detect a real difference when samples are large, even<br />

for data which meet the distributional requirements.<br />

Non-parametric methods are most often used to<br />

analyse data which do not meet the distributional<br />

requirements of parametric methods. In particular,<br />

First day jitters<br />

On my first day of clinical rotations, I was confident that<br />

I wouldn’t be another clueless medical student. I had<br />

worked diligently in my basic science courses to learn<br />

the pathophysiology of clinically relevant diseases<br />

and rehearsed for hours on end how to perform an<br />

impressive history and physical examination. Judging<br />

from the case-based questions I was doing, it seemed that<br />

my first day would be a walk in the park.<br />

Then I saw my first patient, and it was as though all<br />

the mental cue cards that I had meticulously arranged<br />

skewed data are frequently analysed <strong>by</strong> non-parametric<br />

methods, although data transformation can often make<br />

the data suitable for parametric analyses. 2<br />

Data that are scores rather than measurements may<br />

have many possible values, such as quality of life scales<br />

or data from visual analogue scales, while others have<br />

only a few possible values, such as Apgar scores or<br />

stage of disease. Scores with many values are often analysed<br />

using parametric methods, whereas those with<br />

few values tend to be analysed using rank methods, but<br />

there is no clear boundary between these cases.<br />

To compensate for the advantage of being free of<br />

assumptions about the distribution of the data, rank<br />

methods have the disadvantage that they are mainly<br />

suited to hypothesis testing and no useful estimate is<br />

obtained, such as the average difference between two<br />

groups. Estimates and confidence intervals are easy<br />

to find with t methods. Non-parametric estimates and<br />

confidence intervals can be calculated, however, but<br />

depend on extra assumptions which are almost as<br />

strong as those for t methods. 3 Rank methods have the<br />

added disadvantage of not generalising to more complex<br />

situations, most obviously when we wish to use<br />

regression methods to adjust for several other factors.<br />

Rank methods can generate strong views, with<br />

some people preferring them for all analyses and<br />

others believing that they have no place in statistics. We<br />

believe that rank methods are sometimes useful, but<br />

parametric methods are generally preferable as they<br />

provide estimates and confidence intervals and generalise<br />

to more complex analyses.<br />

The choice of approach may also be related to sample<br />

size, as the distributional assumptions are more<br />

important for small samples. We consider the analysis<br />

of small data sets in a subsequent Statistics <strong>Notes</strong>.<br />

Competing interests: None declared.<br />

Provenance and peer review: Commissioned, not externally peer reviewed.<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Variables and parameters. <strong>BMJ</strong> 1999;318:1667.<br />

2 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Transforming data. <strong>BMJ</strong> 1996;312:770.<br />

3 Campbell MJ, Gardner MJ. Medians and their differences. In: Altman<br />

DG, Machin D, Bryant TN, Gardner MJ, eds. Statistics with confidence.<br />

2nd ed. London: <strong>BMJ</strong> Books, 2000:36-44.<br />

and rehearsed fell from my hands to the floor in utter<br />

disarray. I wouldn’t want to share with you all the<br />

embarrassment, but, suffice to say, a quick thinking<br />

resident had to add the words “… of gainful employment”<br />

to the end of a question that I had unthinkingly<br />

asked a 55 year old man: “So when would you say was<br />

your last period?”<br />

Bharat Kumar student, Saba University School of Medicine, Rye<br />

Brook, NY, USA bharatk85@gmail.com<br />

Cite this as: <strong>BMJ</strong> 2009;339:b2380<br />

170 <strong>BMJ</strong> | 18 JULY 2009 | VoLUMe 339


1 Department of Health Sciences,<br />

University of York, York YO10 5DD<br />

2 Centre for Statistics in Medicine,<br />

University of Oxford, Wolfson<br />

College Annexe, Oxford OX2 6UD<br />

correspondence to: Professor<br />

<strong>Bland</strong> mb55@<strong>york</strong>.ac.uk<br />

Cite this as: <strong>BMJ</strong> 2009;338:a3166<br />

doi: 10.1136/bmj.a3166<br />

Parametric methods, including t tests, correlation,<br />

and regression, require the assumption that the data<br />

follow a normal distribution and that variances<br />

are uniform between groups or across ranges. 1 In<br />

small samples these assumptions are particularly<br />

important, so this setting seems ideal for rank (nonparametric)<br />

methods, which make no assumptions<br />

about the distribution of the data; they use the rank<br />

order of observations rather than the measurements<br />

themselves. 1 Unfortunately, rank methods are least<br />

effective in small samples. Indeed, for very small<br />

samples, they cannot yield a significant result<br />

whatever the data. For example, when using the<br />

Mann-Witney test for comparing two samples of<br />

fewer than four observations a statistically significant<br />

difference is impossible: any data give P>0.05.<br />

Similarly, the Wilcoxon paired test, the sign test,<br />

and Spearman’s and Kendall’s rank correlation coefficients<br />

cannot produce P


esearch methods & reporting<br />

paired sample test, but this method assumes that the<br />

differences have a symmetrical distribution, which<br />

they do not. The sign test is preferred here; it tests<br />

the null hypothesis that non-zero differences are<br />

equally likely to be positive or negative, using the<br />

binomial distribution. We have 1 negative and 11<br />

positive differences, which gives P=0.006. Hence<br />

the original authors failed to detect a difference<br />

because they used an inappropriate analysis.<br />

We have often come across the idea that we<br />

should not use t distribution methods for small samples<br />

but should instead use rank based methods.<br />

The statement is sometimes that we should not use t<br />

methods at all for samples of fewer than six observations.<br />

4 But, as we noted, rank based methods cannot<br />

produce anything useful for such small samples.<br />

draw on general experience<br />

The aversion to parametric methods for small<br />

samples may arise from the inability to assess<br />

the distribution shape when there are so few<br />

Research methods and reporting<br />

Research methods and reporting is for “how to”<br />

articles—those that discuss the nuts and bolts of doing<br />

and writing up research, are actionable and readable,<br />

and will warrant appraisal <strong>by</strong> the <strong>BMJ</strong> ’s research team<br />

and statisticians. These articles should be aimed at a<br />

general medical audience that includes doctors of all<br />

disciplines and other health professionals working<br />

in and outside the UK. You should not assume that<br />

readers will know about organisations or practices that<br />

are specific to a single discipline.<br />

We welcome articles on all kinds of medical and<br />

health services research methods that will be relevant<br />

and useful to <strong>BMJ</strong> readers, whether that research<br />

is quantitative or qualitative, clinical or not. This<br />

includes articles that propose and explain practical and<br />

theoretical developments in research methodology and<br />

for those on improving the clarity and transparency of<br />

reports about research studies, protocols, and results.<br />

This section is for the “how?” of research, while<br />

the “what, why, when, and who cares?” will usually<br />

belong elsewhere. Studies evaluating ways to<br />

conduct and report research should go to the <strong>BMJ</strong> ’s<br />

Research section; articles debating research concepts,<br />

frameworks, and translation into practice and policy<br />

should go to Analysis, Editorials, or Features; and those<br />

expressing personal opinions should go to Personal<br />

View.<br />

Articles for Research methods and reporting should<br />

include:<br />

• Up to 2000 words set out under informative<br />

subheadings. For some submissions, this might be<br />

published in full on bmj.com with a shorter version in<br />

the print <strong>BMJ</strong><br />

• A separate introduction (“standfirst”) of 100-150 words<br />

spelling out what the article is about and emphasising<br />

its importance<br />

observations. How can we tell whether data follow<br />

a normal distribution if we have only a few observations?<br />

The answer is that we have not only the data<br />

to be analysed, but usually also experience of other<br />

sets of measurements of the same thing. In addition,<br />

general experience tells us that body size measurements<br />

are usually approximately normal, as are the<br />

logarithms of many blood concentrations and the<br />

square roots of counts.<br />

We thank Jonathan Cowley for the data in table 1.<br />

competing interests: None declared.<br />

Provenance and peer review: Commissioned, not externally peer<br />

reviewed.<br />

1 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Parametric v non-parametric methods for<br />

data analysis. <strong>BMJ</strong> 2009;338:a3167.<br />

2 Baker CJ, Kasper DL, Edwards MS, Schiffman G. Influence of<br />

preimmunization antibody levels on the specificity of the immune<br />

response to related polysaccharide antigens. N Engl J Med<br />

1980;303:173-8.<br />

3 Altman DG. Practical statistics for medical research. London:<br />

Chapman and Hall, 1991: 224-5.<br />

4 Siegel S. Nonparametric statistics for the behavioral sciences. 1st<br />

ed. Tokyo: McGraw-Hill Kogakusha, 1956:32.<br />

• Explicit evidence to support key statements and a<br />

brief explanation of the strength of the evidence<br />

(published trials, systematic reviews, observational<br />

studies, expert opinion, etc)<br />

• No more than 20 references, in Vancouver style,<br />

presenting the evidence on which the key statements<br />

in the paper are made<br />

• Up to three tables, boxes, or illustrations (clinical<br />

photographs, imaging, line drawings, figures—we<br />

welcome colour) to enhance the text and add to<br />

or substantiate key points made in the body of the<br />

article<br />

• A summary box with up to four short single<br />

sentences, in the form of bullet points, highlighting<br />

the article’s main points<br />

• A box of linked information such as websites for<br />

those who want to pursue the subject in more depth<br />

(this is optional)<br />

• Web extras: we may be able to publish on bmj.<br />

com some additional boxes, figures, references (in a<br />

separate reference list numbered w1,w2,w3, etc, and<br />

marked as such in the main text of the article)<br />

• Suggestions for linked podcasts or video clips, as<br />

appropriate<br />

• A statement of sources and selection criteria. As well<br />

as the standard statements of funding, competing<br />

interests, and contributorship, please provide at<br />

the end of the paper a 100-150 word paragraph<br />

(excluded from the word count) explaining the<br />

paper’s provenance. This should include the<br />

relevant experience or expertise of the author(s)<br />

and the sources of information used to prepare the<br />

paper. It should also give details of each author’s<br />

role in producing the article and name one as<br />

guarantor.<br />

962 <strong>BMJ</strong> | 24 octoBer 2009 | VoluMe 339


<strong>BMJ</strong> 2011;342:d561 doi: 10.1136/bmj.d561 Page 1 of 3<br />

Research Methods & Reporting<br />

RESEARCH METHODS & REPORTING<br />

STATISTICS NOTES<br />

Comparisons within randomised groups can be very<br />

misleading<br />

J Martin <strong>Bland</strong> professor of health statistics 1 , Douglas G Altman professor of statistics in medicine 2<br />

1 Department of Health Sciences, University of York, York YO10 5DD; 2 Centre for Statistics in Medicine, University of Oxford, Oxford OX2 6UD<br />

When we randomise trial participants into two or more<br />

intervention groups, we do this to remove bias; the groups will,<br />

on average, be comparable in every respect except the treatment<br />

which they receive. Provided the trial is well conducted, without<br />

other sources of bias, any difference in the outcome of the<br />

groups can then reasonably be attributed to the different<br />

interventions received. In a previous note we discussed the<br />

analysis of those trials in which the primary outcome measure<br />

is also measured at baseline. We discussed several valid<br />

analyses, observing that “analysis of covariance” (a regression<br />

method) is the method of choice. 1<br />

Rather than comparing the randomised groups directly, however,<br />

researchers sometimes look at the change in the measurement<br />

between baseline and the end of the trial; they test whether there<br />

was a significant change from baseline, separately in each<br />

randomised group. They may then report that this difference is<br />

significant in one group but not in the other, and conclude that<br />

this is evidence that the groups, and hence the treatments, are<br />

different. One such example was a recent trial in which<br />

participants were randomised to receive either an “anti-ageing”<br />

cream or the vehicle as a placebo. 2 A wrinkle score was recorded<br />

at baseline and after six months. The authors gave the results<br />

of significance tests comparing the score with baseline for each<br />

group separately, reporting the active treatment group to have<br />

a significant difference (P=0.013) and the vehicle group not<br />

(P=0.11). Their interpretation was that the cosmetic cream<br />

resulted in significant clinical improvement in facial wrinkles.<br />

But we cannot validly draw this conclusion, because the lack<br />

of a significant difference in the vehicle group does not provide<br />

good evidence that the anti-ageing product is superior. 3<br />

The essential feature of a randomised trial is the comparison<br />

between groups. Within group analyses do not address a<br />

meaningful question: the question is not whether there is a<br />

change from baseline, but whether any change is greater in one<br />

group than the other. It is not possible to draw valid inferences<br />

<strong>by</strong> comparing P values. In particular, there is an inflated risk of<br />

a false positive result, which we shall illustrate with a simulation.<br />

Correspondence to: Professor M <strong>Bland</strong> martin.bland@<strong>york</strong>.ac.uk<br />

The table shows simulated data for a randomised trial with two<br />

groups of 30 participants. Data were drawn from the same<br />

population, so there is no systematic difference between the two<br />

groups. The true baseline measurements had a mean of 10.0<br />

with standard deviation (SD) 2.0, and the outcome measurement<br />

was equal to the baseline plus an increase of 0.5 and a random<br />

element with SD 1.0. The difference between mean outcomes<br />

is 0.22 (95% confidence interval –0.75 to 0.34, P=0.5), adjusting<br />

for the baseline <strong>by</strong> analysis of covariance. 1 The difference is<br />

not statistically significant, which is not surprising because we<br />

know that the null hypothesis of no difference in the population<br />

is true. If we compare baseline with outcome for each group<br />

using a paired t test, however, for group A the difference is<br />

statistically significant, P=0.03, for group B it is not significant,<br />

P = 0.2. These results are quite similar to those of the anti-ageing<br />

cream trial. 2<br />

We would not wish to draw any conclusions from one<br />

simulation. In 1000 runs, the difference between groups had<br />

P


<strong>BMJ</strong> 2011;342:d561 doi: 10.1136/bmj.d561 Page 2 of 3<br />

difference were 0.37 rather than 0.5, we would expect 50% of<br />

samples to have just one significant difference.<br />

The anti-ageing trial is not the only one where we have seen<br />

this misleading approach applied to randomised trial data. 3 We<br />

even found it once in the <strong>BMJ</strong>! 5<br />

Contributors: <strong>JM</strong>B and DGA jointly wrote and agreed the text, <strong>JM</strong>B did<br />

the statistical analysis.<br />

Competing interests: All authors have completed the Unified Competing<br />

Interest form at <strong>www</strong>.icmje.org/coi_disclosure.pdf (available on request<br />

from the corresponding author) and declare: no support from any<br />

organisation for the submitted work; no financial relationships with any<br />

organisations that might have an interest in the submitted work in the<br />

previous 3 years; no other relationships or activities that could appear<br />

to have influenced the submitted work.<br />

1 Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow-up<br />

measurements. <strong>BMJ</strong> 2001;323:1123-4.<br />

2 Watson REB, Ogden S, Cotterell LF, Bowden JJ, Bastrilles JY, Long SP, et al. A cosmetic<br />

‘anti-ageing’ product improves photoaged skin: a double-blind, randomized controlled<br />

trial. Br J Dermatol 2009;161:419-26.<br />

3 <strong>Bland</strong> <strong>JM</strong>. Evidence for an ‘anti-ageing’ product may not be so clear as it appears. Br J<br />

Dermatol 2009;161:1207-8.<br />

4 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Interaction revisited: the difference between two estimates. <strong>BMJ</strong><br />

2003;326:219.<br />

5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Informed consent. <strong>BMJ</strong> 1993;306:928.<br />

Cite this as: <strong>BMJ</strong> 2011;342:d561<br />

RESEARCH METHODS & REPORTING<br />

Reprints: <strong>http</strong>://journals.bmj.com/cgi/reprintform Subscribe: <strong>http</strong>://resources.bmj.com/bmj/subscribers/how-to-subscribe


<strong>BMJ</strong> 2011;342:d561 doi: 10.1136/bmj.d561 Page 3 of 3<br />

Table<br />

Table 1| Simulated data from a randomised trial comparing two groups of 30 patients, with a true change from baseline but no difference<br />

between groups (sorted <strong>by</strong> baseline values within each group)<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

21<br />

22<br />

23<br />

24<br />

25<br />

26<br />

27<br />

28<br />

29<br />

30<br />

Group A<br />

Baseline 6 months Change<br />

6.4<br />

6.6<br />

7.3<br />

7.7<br />

7.7<br />

7.9<br />

8.0<br />

8.0<br />

8.1<br />

9.2<br />

9.3<br />

9.6<br />

9.7<br />

9.8<br />

9.8<br />

10.2<br />

10.3<br />

10.6<br />

10.6<br />

10.7<br />

10.9<br />

11.1<br />

11.2<br />

11.8<br />

12.3<br />

12.4<br />

13.1<br />

13.2<br />

13.3<br />

13.7<br />

Mean 10.02<br />

SD<br />

2.06<br />

7.1<br />

5.6<br />

8.3<br />

9.1<br />

9.5<br />

9.6<br />

8.5<br />

8.5<br />

9.1<br />

9.6<br />

8.7<br />

10.7<br />

9.0<br />

9.0<br />

8.0<br />

11.1<br />

11.5<br />

9.1<br />

12.0<br />

13.2<br />

9.7<br />

12.2<br />

10.8<br />

11.9<br />

12.2<br />

12.6<br />

15.0<br />

13.8<br />

14.1<br />

14.2<br />

10.46<br />

2.29<br />

0.7<br />

-1.0<br />

1.0<br />

1.4<br />

1.8<br />

1.7<br />

0.5<br />

0.5<br />

1.0<br />

0.4<br />

-0.6<br />

1.1<br />

-0.7<br />

-0.8<br />

-1.8<br />

0.9<br />

1.2<br />

-1.5<br />

1.4<br />

2.5<br />

-1.2<br />

1.1<br />

-0.4<br />

0.1<br />

-0.1<br />

0.2<br />

1.9<br />

0.6<br />

0.8<br />

0.5<br />

0.44<br />

1.06<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

21<br />

22<br />

23<br />

24<br />

25<br />

26<br />

27<br />

28<br />

29<br />

30<br />

Group B<br />

Baseline 6 months Change<br />

6.8<br />

7.2<br />

7.2<br />

7.4<br />

7.5<br />

7.5<br />

8.3<br />

8.4<br />

8.7<br />

9.0<br />

9.2<br />

9.6<br />

9.9<br />

10.1<br />

10.2<br />

10.3<br />

10.4<br />

10.5<br />

10.7<br />

10.8<br />

10.8<br />

11.1<br />

11.1<br />

11.4<br />

11.6<br />

11.7<br />

12.0<br />

12.3<br />

13.7<br />

13.9<br />

Mean 9.98<br />

SD<br />

1.90<br />

7.9<br />

7.5<br />

6.9<br />

6.9<br />

8.3<br />

9.4<br />

9.0<br />

8.8<br />

8.0<br />

7.2<br />

7.1<br />

10.6<br />

11.0<br />

11.5<br />

10.4<br />

11.0<br />

9.9<br />

11.3<br />

9.9<br />

10.7<br />

11.8<br />

10.0<br />

13.2<br />

11.8<br />

12.1<br />

11.5<br />

12.7<br />

13.7<br />

12.6<br />

13.7<br />

10.21<br />

2.09<br />

1.1<br />

0.3<br />

-0.3<br />

-0.5<br />

0.8<br />

1.9<br />

0.7<br />

0.4<br />

-0.7<br />

-1.8<br />

-2.1<br />

1.0<br />

1.1<br />

1.4<br />

0.2<br />

0.7<br />

-0.5<br />

0.8<br />

-0.8<br />

-0.1<br />

1.0<br />

-1.1<br />

2.1<br />

0.4<br />

0.5<br />

-0.2<br />

0.7<br />

1.4<br />

-1.1<br />

-0.2<br />

0.24<br />

1.02<br />

RESEARCH METHODS & REPORTING<br />

Reprints: <strong>http</strong>://journals.bmj.com/cgi/reprintform Subscribe: <strong>http</strong>://resources.bmj.com/bmj/subscribers/how-to-subscribe


1 Department of Health Sciences,<br />

University of York, York YO10 5DD<br />

2 Centre for Statistics in Medicine,<br />

University of Oxford, Oxford<br />

OX2 6UD<br />

Correspondence to: Professor<br />

M <strong>Bland</strong><br />

martin.bland@<strong>york</strong>.ac.uk<br />

Cite this as: <strong>BMJ</strong> 2011;342:d556<br />

doi: 10.1136/bmj.d556<br />

Competing interests: All authors<br />

have completed the Unified<br />

Competing Interest form at <strong>www</strong>.<br />

icmje.org/coi_disclosure.pdf<br />

(available on request from the<br />

corresponding author) and declare:<br />

no support from any organisation<br />

for the submitted work; no<br />

financial relationships with any<br />

organisations that might have an<br />

interest in the submitted work in<br />

the previous three years; no other<br />

relationships or activities that could<br />

appear to have influenced the<br />

submitted work.<br />

Correlation between<br />

abdominal circumference<br />

and body mass index (BMI)<br />

in 1450 adult patients with<br />

diabetes<br />

BMI group r<br />

35 0.86<br />

All patients 0.85<br />

STATISTICS NOTES<br />

Correlation in restricted ranges of data<br />

J Martin <strong>Bland</strong>, 1 Douglas G Altman 2<br />

In a study of 150 adult diabetic patients there was a strong<br />

correlation between abdominal circumference and body<br />

mass index (BMI) (r = 0.85). 1 The authors went on to report<br />

that the correlation differed in different BMI categories as<br />

shown in the table.<br />

The authors’ interpretation of these data was that in<br />

patients with low or high BMI values (BMI 35 kg/m 2 ) the correlation was strong, but in those with<br />

BMI values between 25 and 35 kg/m 2 the correlation was<br />

weak or missing. They concluded that measuring abdominal<br />

circumference is of particular importance in subjects with<br />

the most frequent BMI category (25 to 35 kg/m 2 ).<br />

When we restrict the range of one of the variables, a<br />

correlation coefficient will be reduced. For example, fig 1<br />

shows some BMI and abdominal circumference measurements<br />

from a different population. Although these people<br />

are from a rather thinner population, the correlation coefficient<br />

is very similar, r = 0.82 (P


RESEARCH METHODS AND<br />

REPORTING, p 682<br />

1 Centre for Statistics in Medicine,<br />

University of Oxford, Oxford<br />

OX2 6UD<br />

2 Department of Health Sciences,<br />

University of York, Heslington, York<br />

YO10 5DD<br />

Correspondence to: D G Altman<br />

doug.altman@csm.ox.ac.uk<br />

Cite this as: <strong>BMJ</strong> 2011;343:d2090<br />

doi: 10.1136/bmj.d2090<br />

RESEARCH METHODS<br />

& REPORTING<br />

STATISTICS NOTES<br />

How to obtain the confidence interval from a P value<br />

Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />

Confidence intervals (CIs) are widely used in reporting statistical<br />

analyses of research data, and are usually considered to<br />

1 2<br />

be more informative than P values from significance tests.<br />

Some published articles, however, report estimated effects and<br />

P values, but do not give CIs (a practice <strong>BMJ</strong> now strongly discourages).<br />

Here we show how to obtain the confidence interval<br />

when only the observed effect and the P value were reported.<br />

The method is outlined in the box below in which we have<br />

distinguished two cases.<br />

(a) Calculating the confidence interval for a difference<br />

We consider first the analysis comparing two proportions or<br />

two means, such as in a randomised trial with a binary outcome<br />

or a measurement such as blood pressure.<br />

For example, the abstract of a report of a randomised trial<br />

included the statement that “more patients in the zinc group<br />

than in the control group recovered <strong>by</strong> two days (49% v 32%,<br />

P=0.032).” 5 The difference in proportions was Est = 17 percentage<br />

points, but what is the 95% confidence interval (CI)?<br />

Following the steps in the box we calculate the CI as<br />

fo llows:<br />

z = –0.862+ √[0.743 – 2.404×log(0.032)] = 2.141;<br />

SE = 17/2.141 = 7.940, so that 1.96×SE = 15.56<br />

percentage points;<br />

95% CI is 17.0 – 15.56 to 17.0 + 15.56, or 1.4 to 32.6<br />

percentage points.<br />

(b) Calculating the confidence interval for a ratio (log<br />

transformation needed)<br />

The calculation is trickier for ratio measures, such as risk ratio,<br />

odds ratio, and hazard ratio. We need to log transform the estimate<br />

and then reverse the procedure, as described in a previous<br />

Statistics Note. 6<br />

For example, the abstract of a report of a cohort study<br />

includes the statement that “In those with a [diastolic blood<br />

pressure] reading of 95-99 mm Hg the relative risk was 0.30<br />

Steps to obtain the CI for an estimate of effect from the P value and the estimate (Est)<br />

(a) CI for a difference<br />

• Calculate the test statistic for a normal distribution test, z, from P 3 : z = −0.862 + √[0.743 −<br />

2.404×log(P)]<br />

• Calculate the standard error: SE = Est/z (ignoring minus signs)<br />

• Calculate the 95% CI: Est –1.96×SE to Est + 1.96×SE.<br />

(b) CI for a ratio<br />

• For a ratio measure, such as a risk ratio, the above formulas should be used with the<br />

estimate Est on the log scale (eg, the log risk ratio). Step 3 gives a CI on the log scale; to<br />

derive the CI on the natural scale we need to exponentiate (antilog) Est and its CI. 4<br />

All P values are two sided.<br />

All logarithms are natural (ie, to base e). 4<br />

For a 90% CI, we replace 1.96 <strong>by</strong> 1.65; for a 99% CI we use 2.57.<br />

(P=0.034).” 7 What is the confidence interval around 0.30?<br />

Following the steps in the box we calculate the CI:<br />

z = –0.862+ √[0.743 – 2.404×log(0.034)] = 2.117;<br />

Est = log (0.30) = −1.204;<br />

SE = −1.204/2.117 = −0.569 but we ignore the minus<br />

sign, so SE = 0.569, and 1.96×SE = 1.115;<br />

95% CI on log scale = −1.204 − 1.115 to −1.204 + 1.115 =<br />

−2.319 to −0.089;<br />

95% CI on natural scale = exp (−2.319) = 0.10 to exp<br />

(−0.089) = 0.91.<br />

Hence the relative risk is estimated to be 0.30 with 95% CI<br />

0.10 to 0.91.<br />

Limitations of the method<br />

The methods described can be applied in a wide range of<br />

settings, including the results from meta-analysis and regression<br />

analyses. The main context where they are not correct is<br />

in small samples where the outcome is continuous and the<br />

analysis has been done <strong>by</strong> a t test or analysis of variance,<br />

or the outcome is dichotomous and an exact method has<br />

been used for the confidence interval. However, even here<br />

the methods will be approximately correct in larger studies<br />

with, say, 60 patients or more.<br />

P values presented as inequalities<br />

Sometimes P values are very small and so are presented as<br />

P0.05 or the difference is not significant,<br />

things are more difficult. If we apply the method<br />

described here, using P=0.05, the confidence interval will be<br />

too narrow. We must remember that the estimate is even poorer<br />

than the confidence interval calculated would suggest.<br />

1 Gardner MJ, Altman DG. Confidence intervals rather than P values:<br />

estimation rather than hypothesis testing. <strong>BMJ</strong> 1986;292:746-50.<br />

2 Moher D, HopewellS,Schulz KF, MontoriV, Gøtzsche PC, DevereauxPJ,<br />

et al. CONSORT 2010. Explanation and Elaboration: updated guidelines<br />

for reporting parallel group randomised trials. <strong>BMJ</strong> 2010;340:c869.<br />

3 Lin J-T. Approximating the normal tail probability and its inverse for use<br />

on a pocket calculator. Appl Stat 1989;38:69-70.<br />

4 <strong>Bland</strong> <strong>JM</strong>, Altman DG.Statistics <strong>Notes</strong>. Logarithms. <strong>BMJ</strong> 1996;312:700.<br />

5 RoySK, Hossain MJ, Khatun W, Chakraborty B, ChowdhuryS, Begum<br />

A, et al. Zinc supplementation in children with cholera in Bangladesh:<br />

randomised controlled trial. <strong>BMJ</strong> 2008;336:266-8.<br />

6 Altman DG, <strong>Bland</strong> <strong>JM</strong>. Interaction revisited: the difference between two<br />

estimates. <strong>BMJ</strong> 2003;326:219.<br />

7 Lindblad U, Råstam L, Rydén L, Ranstam J, IsacssonS-O, Berglund G.<br />

Control of blood pressure and riskof first acute myocardial infarction:<br />

Skaraborg hypertension project. <strong>BMJ</strong> 1994;308:681.<br />

<strong>BMJ</strong> | 1 OCTOBER 2011 | VOLUME 343 681


RESEARCH METHODS & REPORTING<br />

RESEARCH METHODS AND<br />

REPORTING, p 681<br />

1 Centre for Statistics in Medicine,<br />

University of Oxford, Oxford<br />

OX2 6UD<br />

2 Department of Health Sciences,<br />

University of York, Heslington, York<br />

YO10 5DD<br />

Correspondence to: D G Altman<br />

doug.altman@csm.ox.ac.uk<br />

Cite this as: <strong>BMJ</strong> 2011;343:d2304<br />

doi: 10.1136/bmj.d2304<br />

bmj.com<br />

Previous articles in this<br />

series<br />

Ж Choosing target<br />

conditions for test<br />

accuracy studies that are<br />

relevant to clinical practice<br />

(<strong>BMJ</strong> 2011;343:d4684)<br />

Ж Brackets (parentheses)<br />

in formulas<br />

(<strong>BMJ</strong> 2011;343:d570)<br />

Ж Verification problems<br />

in diagnostic accuracy<br />

studies: consequences<br />

and solution<br />

(<strong>BMJ</strong> 2011;343:d4770)<br />

How to obtain the P value from a confidence interval<br />

Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />

We have shown in a previous Statistics Note 1 how we can<br />

calculate a confidence interval (CI) from a P value. Some<br />

published articles report confidence intervals, but do not<br />

give corresponding P values. Here we show how a confidence<br />

interval can be used to calculate a P value, should<br />

this be required. This might also be useful when the P<br />

value is given only imprecisely (eg, as P


RESEARCH METHODS & REPORTING<br />

1 Centre for Statistics in Medicine,<br />

University of Oxford, Oxford<br />

OX2 6UD<br />

2 Department of Health Sciences,<br />

University of York, York YO10 5DD<br />

Correspondence to: DG Altman<br />

doug.altman@csm.ox.ac.uk<br />

Cite this as: <strong>BMJ</strong> 2011;343:d570<br />

doi: 10.1136/bmj.d570<br />

bmj.com<br />

Previous articles in this<br />

series<br />

ЖChoosing target<br />

conditions for test<br />

accuracy studies that are<br />

relevant to clinical practice<br />

(<strong>BMJ</strong> 2011;343:d4684)<br />

ЖBrackets (parentheses)<br />

in formulas<br />

(<strong>BMJ</strong> 2011;343:d570)<br />

ЖHow to obtain the<br />

confidence interval from<br />

a P value<br />

(<strong>BMJ</strong> 2011;343:d2090)<br />

ЖHow to obtain the P<br />

value from a confidence<br />

interval<br />

(<strong>BMJ</strong> 2011;343:d2304)<br />

ЖVerification problems<br />

in diagnostic accuracy<br />

studies: consequences<br />

and solution<br />

(<strong>BMJ</strong> 2011;343:d4770)<br />

Brackets (parentheses) in formulas<br />

Douglas G Altman, 1 J Martin <strong>Bland</strong> 2<br />

Each year, new health sciences postgraduate students in<br />

York are given a simple maths test. Each year the majority<br />

of them fail to calculate 20 – 3 × 5 correctly. According<br />

to the conventional rules of arithmetic, division and<br />

multiplication are done before addition and subtraction,<br />

so 20 − 3 × 5 = 20 − 15 = 5. Many students work from<br />

left to right and calculate 20 − 3 × 5 as 17 × 5 = 85. If<br />

that was what was actually meant, we would need to use<br />

brackets: (20 − 3) × 5 = 17 × 5 = 85. Brackets tell us that<br />

the enclosed part must be evaluated first. That convention<br />

is part of various mnemonic acronyms that indicate<br />

the order of operations, such as BODMAS (Brackets, Of<br />

(that is, power of), Divide, Multiply, Add, Subtract) and<br />

PEMDAS (Parentheses, Exponentiation, Multiplication,<br />

Division, Addition, Subtraction). 1<br />

Schoolchildren learn the basic rules about how to construct<br />

and interpret mathematical formulas. 1 The conventions<br />

exist to ensure that there is absolutely no ambiguity,<br />

as mathematics (unlike prose) has no redundancy, so any<br />

mistake may have serious consequences. Our experience<br />

is that mistakes are quite common when formulas are presented<br />

in medical journal articles. A particular concern is<br />

that brackets are often omitted or misused. The following<br />

examples are typical, and we mean nothing personal <strong>by</strong><br />

choosing them.<br />

Example 1<br />

In a discussion of methods for analysing diagnostic test<br />

accuracy, Collinson 2 wrote:<br />

Sensitivity = TP/TP + FN<br />

where TP = true positive and FN = false negative. The formula<br />

should, of course, be:<br />

Sensitivity = TP/(TP + FN).<br />

Example 2<br />

For a non-statistical example, Leyland 3 wrote that the<br />

total optical power of the cornea is:<br />

P = P 1 + P 2 – t/n 2(P 1P 2)<br />

where P 1 = n 2 – n 1 /r 1 and P 2 = n 3 – n 2 /r 2 . Here n 1 , n 2 , and n 3<br />

are refractive indices, r 1, r 2, and t are distances in metres,<br />

and P, P 1 , and P 2 , are powers in dioptres. But he should<br />

have written P 1 = (n 2 – n 1 )/r 1 and P 2 = (n 3 – n 2 )/r 2 . P 1 = n 2 –<br />

n 1 /r 1 is clearly wrong dimensionally, as P 1 is dioptres, 1/<br />

metre, n 2 and n 1 are ratios and so pure numbers, and r 1 is<br />

in metres. Also, it is not clear whether t/n 2 (P 1P 2) means<br />

(t/n 2) P 1P 2, which it does, or t/(n 2P 1P 2).<br />

Do such errors matter? Certainly. In our experience the<br />

calculations are usually correct in the paper, but anyone<br />

using the published formula would go wrong. Sometimes,<br />

however, the incorrect formula was used, as in the<br />

following case.<br />

Example 3<br />

In their otherwise exemplary evaluation of the chronic<br />

ankle instability scale, Eechaute et al 4 made a mistake in<br />

their formula for the minimal detectable change (MDC)<br />

or repeatability coefficient, 5 writing: MDC = 2.04 × √(2 ×<br />

SEM). Here SEM is the standard error of a measurement<br />

or within subject standard deviation. 5 This formula uses<br />

2.04 where 2 or 1.96 is more usual, 6 but, much more seriously,<br />

the SEM should not be included within the square<br />

root, as the brackets indicate. This might be dismissed<br />

as a simple typographical error, but the authors actually<br />

used this incorrect formula. Their value of SEM was 2.7<br />

points, so they calculated the minimal detectable change<br />

as 2.04 × √(2 × 2.7) = 4.7. They should have calculated<br />

2.04 × √(2) × 2.7 = 7.8. Their erroneous formula makes<br />

the scale appear considerably more reliable than it actually<br />

is. 6 The formula is also wrong in terms of dimensions,<br />

because the minimum clinical difference should<br />

be in the same units as the measurement, not in square<br />

root units.<br />

Some mistakes in formulas may be present in a submitted<br />

manuscript, but others might be introduced in the<br />

publication process. For example, problems sometimes<br />

arise when a displayed formula is converted to an “intext”<br />

formula as part of the editing, and the implications<br />

are not realised or not noticed <strong>by</strong> either editing staff or<br />

authors. Often it is necessary to insert brackets when<br />

reformatting a formula. So the simple formula:<br />

should be changed to p/(1 − p) if moved to the text.<br />

Formulas in published articles may be used <strong>by</strong> others,<br />

so mistakes may lead to substantive errors in research. It<br />

is essential that authors and editors check all formulas<br />

carefully.<br />

Acknowledgements: We are very grateful to Phil McShane for pointing out a<br />

mistake in an earlier version of this statistics note.<br />

Contributors: DGA and <strong>JM</strong>B jointly wrote and agreed the text.<br />

Competing interests: All authors have completed the Unified Competing<br />

Interest form at <strong>www</strong>.icmje.org/coi_disclosure.pdf (available on request from<br />

the corresponding author) and declare: no support from any organisation for the<br />

submitted; no financial relationships with any organisations that might have an<br />

interest in the submitted work in the previous three years; no other relationships<br />

or activities that could appear to have influenced the submitted work.<br />

1 Wikipedia. Order of operations. [cited 2010 Nov 23]. <strong>http</strong>://<br />

en.wikipedia.org/wiki/Order_of_operations.<br />

2 Collinson P. Of bombers, radiologists, and cardiologists: time to ROC.<br />

Heart 1998;80:215-7.<br />

3 Leyland M. Validation of Orbscan II posterior corneal curvature<br />

measurement for intraocular lens power calculation. Eye<br />

2004;18:357-60.<br />

4 Eechaute C, Vaes P, Duquet W. The chronic ankle instability scale:<br />

clinimetric properties of a multidimensional, patient-assessed<br />

instrument. Phys Ther Sport 2008;9:57-66.<br />

5 <strong>Bland</strong> <strong>JM</strong>, Altman DG. Statistics notes. Measurement error. <strong>BMJ</strong><br />

1996;312:744.<br />

6 <strong>Bland</strong> <strong>JM</strong>. Minimal detectable change. Phys Ther Sport 2009;10:39.<br />

578 <strong>BMJ</strong> | 17 SEPTEMBER 2011 | VOLUME 343

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!