# Statistics at Square One - FOSJC UNESP

Statistics at Square One - FOSJC UNESP

Statistics at Square One - FOSJC UNESP

### Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>St<strong>at</strong>istics</strong> <strong>at</strong> <strong>Square</strong> <strong>One</strong>2003.4.6huangzhimanFor www.dn<strong>at</strong>hink.org

To my f<strong>at</strong>her

<strong>St<strong>at</strong>istics</strong> <strong>at</strong><strong>Square</strong> <strong>One</strong>Tenth editionT D V SwinscowandM J CampbellProfessor of Medical <strong>St<strong>at</strong>istics</strong>, Institute of General Practice andPrimary Care, School of Health and Rel<strong>at</strong>ed Research, Universityof Sheffield, Northern General Hospital, Sheffield

© BMJ Books 2002BMJ Books is an imprint of the BMJ Publishing GroupAll rights reserved. No part of this public<strong>at</strong>ion may be reproduced,stored in a retrieval system, or transmitted, in any form or by anymeans, electronic, mechanical, photocopying, recording and/orotherwise, without the prior written permission of the publishers.First edition 1976Second edition 1977Third edition 1978Fourth edition 1978Fifth edition 1979Sixth edition 1980Seventh edition 1980Eighth edition 1983Ninth edition 1996Second impression 1997Third impression 1998Fourth impression 1999Tenth edition 2002by BMJ Books, BMA House, Tavistock <strong>Square</strong>,London WC1H 9JRwww.bmjbooks.comBritish Library C<strong>at</strong>aloguing in Public<strong>at</strong>ion D<strong>at</strong>aA c<strong>at</strong>alogue record for this book is available from the British LibraryISBN 0 7279 1552 5Typeset by SIVA M<strong>at</strong>h Setters, Chennai, IndiaPrinted and bound in Spain by GraphyCems, Navarra

ContentsPrefacevii1 D<strong>at</strong>a display and summary 12 Summary st<strong>at</strong>istics for quantit<strong>at</strong>ive and binary d<strong>at</strong>a 123 Popul<strong>at</strong>ions and samples 294 St<strong>at</strong>ements of probability andconfidence intervals 395 Differences between means: type I andtype II errors and power 446 Confidence intervals for summaryst<strong>at</strong>istics of binary d<strong>at</strong>a 527 The t tests 628 The χ 2 tests 789 Exact probability test 9510 Rank score tests 10211 Correl<strong>at</strong>ion and regression 11112 Survival analysis 12613 Study design and choosing a st<strong>at</strong>istical test 135Answers to exercises 145Appendix 147Index 153v

PrefaceThere are three main upgrades to this 10th edition. The first is toacknowledge th<strong>at</strong> almost everyone now has access to a personalcomputer and to the World Wide Web, so the instructions for d<strong>at</strong>aanalysis with a pocket calcul<strong>at</strong>or have been removed. Details ofcalcul<strong>at</strong>ions remain for readers to replic<strong>at</strong>e, since otherwisest<strong>at</strong>istical analysis becomes too ‘black box’. References are made tocomputer packages. Some of the analyses are now available onstandard spreadsheet packages such as Microsoft Excel, and thereare extensions to such packages for more sophistic<strong>at</strong>ed analyses.Also, there is now extensive free software on the Web for doingmost of the calcul<strong>at</strong>ions described here. For a list of software andother useful st<strong>at</strong>istical inform<strong>at</strong>ion on the Web, one can try http://www.execpc.com/~helberg/st<strong>at</strong>istics.html or http://members.aol.com/johnp71/javast<strong>at</strong>.html. For a free general st<strong>at</strong>istical package, Iwould suggest the Center for Disease Control program EPI-INFO<strong>at</strong> http://www.cdc.gov/epo/epi/epiinfo.htm. A useful glossary ofst<strong>at</strong>istical terms has been given through the STEPS project <strong>at</strong>http://www.st<strong>at</strong>s.gla.ac.uk/steps/glossary/index.html. For simpleonline calcul<strong>at</strong>ions such as chi-squared tests or Fisher’s exact testone could try SISA from http://home.clara.net/sisa/. Sample sizecalcul<strong>at</strong>ions are available <strong>at</strong> http://www.st<strong>at</strong>.uiowa.edu/~rlenth/Power/index.html. For calcul<strong>at</strong>ing confidence intervals I recommenda commercial package, the BMJ’s own CIA, details of which areavailable from http://www.medschool.soton.ac.uk/cia/. Of course,free software comes with no guarantee of accuracy, and for seriousanalysis one should use a commercial package such as SPSS, SAS,STATA, Minitab or St<strong>at</strong>sDirect.The availability of software means th<strong>at</strong> we are no longerrestricted to tables to look up P values. I have retained the tablesvii

PREFACEin this edition, because they are still useful, but the book nowpromotes exact st<strong>at</strong>ements of probability, such as P = 0·031, r<strong>at</strong>herthan 0·01 < P < 0·05. These are easily obtainable from manypackages such as Microsoft Excel.The second upgrade is th<strong>at</strong> I have considerably expanded thesection on the description of binary d<strong>at</strong>a. Thus the book now dealswith rel<strong>at</strong>ive risk, odds r<strong>at</strong>ios, number needed to tre<strong>at</strong>/harm andother aspects of binary d<strong>at</strong>a th<strong>at</strong> have arisen through evidence-basedmedicine. I now feel th<strong>at</strong> much elementary medical st<strong>at</strong>istics is besttaught through the use of binary d<strong>at</strong>a, which fe<strong>at</strong>ures prominently inthe medical liter<strong>at</strong>ure.The third upgrade is to add a section on reading and reportingst<strong>at</strong>istics in the medical liter<strong>at</strong>ure. Many readers will not haveto perform a st<strong>at</strong>istical calcul<strong>at</strong>ion, but all will have to read andinterpret st<strong>at</strong>istical results in the medical liter<strong>at</strong>ure. Despite effortsby st<strong>at</strong>istical referees, present<strong>at</strong>ion of st<strong>at</strong>istical inform<strong>at</strong>ion in themedical liter<strong>at</strong>ure is poor, and I thought it would be useful to havesome tips readily available.The book now has a companion, <strong>St<strong>at</strong>istics</strong> <strong>at</strong> <strong>Square</strong> Two, andreference is made to th<strong>at</strong> book for the more advanced topics.I have upd<strong>at</strong>ed the references and taken the opportunity tocorrect a few typos and obscurities. I thank readers for alertingme to these, particularly Mr A F Dinah. Any remaining errors aremy own.M J Campbellhttp://www.shef.ac.uk/personal/m/michaelcampbell/index.htmlviii

1D<strong>at</strong>a displayand summaryTypes of d<strong>at</strong>aThe first step, before any calcul<strong>at</strong>ions or plotting of d<strong>at</strong>a, is todecide wh<strong>at</strong> type of d<strong>at</strong>a one is dealing with. There are a numberof typologies, but one th<strong>at</strong> has proven useful is given in Table 1.1.The basic distinction is between quantit<strong>at</strong>ive variables (for whichone asks “how much?”) and c<strong>at</strong>egorical variables (for which oneasks “wh<strong>at</strong> type?”).Quantit<strong>at</strong>ive variables can be either measured or counted. Measuredvariables, such as height, can in theory take any value within a givenrange and are termed continuous. However, even continuous variablescan only be measured to a certain degree of accuracy. Thus age isTable 1.1 Examples of types of d<strong>at</strong>a.Quantit<strong>at</strong>iveMeasuredBlood pressure, height,weight, ageCountedNumber of children in a familyNumber of <strong>at</strong>tacks of asthma per weekNumber of cases of AIDS in a cityC<strong>at</strong>egoricalOrdinal(Ordered c<strong>at</strong>egories)Grade of breast cancerBetter, same, worseDisagree, neutral, agreeNominal(Unordered c<strong>at</strong>egories)Sex (male/female)Alive or deadBlood group O, A, B, AB1

STATISTICS AT SQUARE ONEoften measured in years, height in centimetres. Examples of crudemeasured variables are shoe and h<strong>at</strong> sizes, which only take a limitedrange of values. Counted variables are counts with a given time orarea. Examples of counted variables are number of children in afamily and number of <strong>at</strong>tacks of asthma per week.C<strong>at</strong>egorical variables are either nominal (unordered) or ordinal(ordered). Nominal variables with just two levels are often termedbinary. Examples of binary variables are male/female, diseased/notdiseased, alive/dead. Variables with more than two c<strong>at</strong>egorieswhere the order does not m<strong>at</strong>ter are also termed nominal, such asblood group O, A, B, AB. These are not ordered since one cannotsay th<strong>at</strong> people in blood group B lie between those in A and thosein AB. Sometimes, however, the c<strong>at</strong>egories can be ordered, and thevariable is termed ordinal. Examples include grade of breast cancerand reactions to some st<strong>at</strong>ement such as “agree”, “neither agreenor disagree” and “disagree”. In this case the order does m<strong>at</strong>terand it is usually important to account for it.Variables shown in the top section of Table 1.1 can be convertedto ones below by using “cut off points”. For example, bloodpressure can be turned into a nominal variable by defining“hypertension” as a diastolic blood pressure gre<strong>at</strong>er than 90 mmHg,and “normotension” as blood pressure less than or equal to90 mmHg. Height (continuous) can be converted into “short”,“average” or “tall” (ordinal).In general it is easier to summarise c<strong>at</strong>egorical variables, and soquantit<strong>at</strong>ive variables are often converted to c<strong>at</strong>egorical ones fordescriptive purposes. To make a clinical decision about a p<strong>at</strong>ient,one does not need to know the exact serum potassium level(continuous) but whether it is within the normal range (nominal).It may be easier to think of the proportion of the popul<strong>at</strong>ion whoare hypertensive than the distribution of blood pressure. However,c<strong>at</strong>egorising a continuous variable reduces the amount of inform<strong>at</strong>ionavailable and st<strong>at</strong>istical tests will in general be more sensitive—th<strong>at</strong>is they will have more power (see Chapter 5 for a definition ofpower)—for a continuous variable than the corresponding nominalone, although more assumptions may have to be made about thed<strong>at</strong>a. C<strong>at</strong>egorising d<strong>at</strong>a is therefore useful for summarising results,but not for st<strong>at</strong>istical analysis. However, it is often not appreci<strong>at</strong>edth<strong>at</strong> the choice of appropri<strong>at</strong>e cut off points can be difficult, anddifferent choices can lead to different conclusions about a set of d<strong>at</strong>a.These definitions of types of d<strong>at</strong>a are not unique, nor are theymutually exclusive, and are given as an aid to help an investig<strong>at</strong>or2

DATA DISPLAY AND SUMMARYdecide how to display and analyse d<strong>at</strong>a. D<strong>at</strong>a which are effectivelycounts, such as de<strong>at</strong>h r<strong>at</strong>es, are commonly analysed as continuousif the disease is not rare. <strong>One</strong> should not deb<strong>at</strong>e overlong thetypology of a particular variable!Stem and leaf plotsBefore any st<strong>at</strong>istical calcul<strong>at</strong>ion, even the simplest, is performedthe d<strong>at</strong>a should be tabul<strong>at</strong>ed or plotted. If they are quantit<strong>at</strong>ive andrel<strong>at</strong>ively few, say up to about 30, they are conveniently writtendown in order of size.For example, a paedi<strong>at</strong>ric registrar in a district general hospitalis investig<strong>at</strong>ing the amount of lead in the urine of children from anearby housing est<strong>at</strong>e. In a particular street there are 15 childrenwhose ages range from 1 year to under 16, and in a preliminarystudy the registrar has found the following amounts of urinary lead(µmol/24 h), given in Table 1.2.Table 1.2 Urinary concentr<strong>at</strong>ion of lead in 15 children from housingest<strong>at</strong>e (µmol/24 h).0·6, 2·6, 0·1, 1·1, 0·4, 2·0, 0·8, 1·3, 1·2, 1·5, 3·2, 1·7, 1·9, 1·9, 2·2A simple way to order, and also to display, the d<strong>at</strong>a is to use a stemand leaf plot. To do this we need to abbrevi<strong>at</strong>e the observ<strong>at</strong>ions totwo significant digits. In the case of the urinary concentr<strong>at</strong>ion d<strong>at</strong>a,the digit to the left of the decimal point is the “stem” and the digitto the right the “leaf”.We first write the stems in order down the page. We then workalong the d<strong>at</strong>a set, writing the leaves down “as they come”. Thus,for the first d<strong>at</strong>a point, we write a 6 opposite the 0 stem. We thusobtain the plot shown in Figure 1.1.Stem0123Leaf616213042285 7 9 9Figure 1.1 Stem and leaf “as they come”.3

STATISTICS AT SQUARE ONEStem0123Leaf110242263685 7 9 9Figure 1.2 Ordered stem and leaf plot.We then order the leaves, as in Figure 1.2.The advantage of first setting the figures out in order of size andnot simply feeding them straight from notes into a calcul<strong>at</strong>or (forexample, to find their mean) is th<strong>at</strong> the rel<strong>at</strong>ion of each to the nextcan be looked <strong>at</strong>. Is there a steady progression, a noteworthy hump,a considerable gap? Simple inspection can disclose irregularities.Furthermore, a glance <strong>at</strong> the figures gives inform<strong>at</strong>ion on theirrange. The smallest value is 0·1 and the largest is 3·2 µmol/24 h.Note th<strong>at</strong> the range can mean two numbers (smallest, largest)or a single number (largest minus smallest). We will usually use theformer when displaying d<strong>at</strong>a, but when talking about summarymeasures (see Chapter 2) we will think of the range as a singlenumber.MedianTo find the median (or mid point) we need to identify the pointwhich has the property th<strong>at</strong> half the d<strong>at</strong>a are gre<strong>at</strong>er than it, andhalf the d<strong>at</strong>a are less than it. For 15 points, the mid point is clearlythe eighth largest, so th<strong>at</strong> seven points are less than the median,and seven points are gre<strong>at</strong>er than it. This is easily obtained fromFigure 1.2 by counting from the top to the eighth leaf, which is1·50 µmol/24 h.To find the median for an even number of points, the procedureis as follows. Suppose the paedi<strong>at</strong>ric registrar obtained a furtherset of 16 urinary lead concentr<strong>at</strong>ions from children living in thecountryside in the same county as the hospital (Table 1.3).Table 1.3 Urinary concentr<strong>at</strong>ion of lead in 16 rural children (µmol/24 h).0·2, 0·3, 0·6, 0·7, 0·8, 1·5, 1·7, 1·8, 1·9, 1·9, 2·0, 2·0, 2·1, 2·8, 3·1, 3·44

To obtain the median we average the eighth and ninth points (1·8and 1·9) to get 1·85 µmol/24 h. In general, if n is even, we averagethe (n/2)th largest and the (n/2 + 1)th largest observ<strong>at</strong>ions.The main advantage of using the median as a measure of loc<strong>at</strong>ionis th<strong>at</strong> it is “robust” to outliers. For example, if we had accidentallywritten 34 r<strong>at</strong>her than 3·4 in Table 1.2, the median would stillhave been 1·85. <strong>One</strong> disadvantage is th<strong>at</strong> it is tedious to order alarge number of observ<strong>at</strong>ions by hand (there is usually no “median”button on a calcul<strong>at</strong>or).An interesting property of the median is shown by first subtractingthe median from each observ<strong>at</strong>ion, and changing the neg<strong>at</strong>ive signsto positive ones (taking the absolute difference). For the d<strong>at</strong>a inTable 1.2 the median is 1·5 and the absolute differences are 0·9,1·1, 1·4, 0·4, 1·1, 0·5, 0·7, 0·2, 0·3, 0·0, 1·7, 0·2, 0·4, 0·4, 0·7. Thesum of these is 10·0. It can be shown th<strong>at</strong> no other d<strong>at</strong>a point willgive a smaller sum. Thus the median is the point ‘nearest’ all theother d<strong>at</strong>a points.Measures of vari<strong>at</strong>ionDATA DISPLAY AND SUMMARYIt is inform<strong>at</strong>ive to have some measure of the vari<strong>at</strong>ion ofobserv<strong>at</strong>ions about the median. The range is very susceptible towh<strong>at</strong> are known as outliers, points well outside the main body ofthe d<strong>at</strong>a. For example, if we had made the mistake of writing32 instead 3·2 in Table 1.2, then the range would be written as0·1 to 32 µmol/24 h, which is clearly misleading.A more robust approach is to divide the distribution of the d<strong>at</strong>ainto four, and find the points below which are 25%, 50% and 75%of the distribution. These are known as quartiles, and the median isthe second quartile. The vari<strong>at</strong>ion of the d<strong>at</strong>a can be summarisedin the interquartile range, the distance between the first and thirdquartile, often abbrevi<strong>at</strong>ed to IQR. With small d<strong>at</strong>a sets and if thesample size is not divisible by 4, it may not be possible to dividethe d<strong>at</strong>a set into exact quarters, and there are a variety of proposedmethods to estim<strong>at</strong>e the quartiles. A simple, consistent method isto find the points which are themselves medians between each endof the range and the median. Thus, from Figure 1.2, there areeight points between and including the smallest, 0·1, and themedian, 1·5. Thus the mid point lies between 0·8 and 1·1, or 0·95.This is the first quartile. Similarly the third quartile is mid waybetween 1·9 and 2·0, or 1·95. Thus, the interquartile range is 0·95to 1·95 µmol/24 h.5

STATISTICS AT SQUARE ONE3 . 5Lead concentr<strong>at</strong>ion (µmol/24h)32 . 521 . 510 . 50Urbanchildren(n = 15)Ruralchildren(n = 16)Figure 1.3 Dot plot of urinary lead concentr<strong>at</strong>ions for urban and ruralchildren (with medians).D<strong>at</strong>a displayThe simplest way to show d<strong>at</strong>a is a dot plot. Figure 1.3 showsthe d<strong>at</strong>a from Tables 1.2 and 1.3 together with the median for eachset. Take care if you use a sc<strong>at</strong>terplot option to plot these d<strong>at</strong>a:you may find the points with the same value are plotted on top ofeach other.Sometimes the points in separ<strong>at</strong>e plots may be linked in someway, for example the d<strong>at</strong>a in Tables 1.2 and 1.3 may result from am<strong>at</strong>ched case–control study (see Chapter 13 for a description of thistype of study) in which individuals from the countryside werem<strong>at</strong>ched by age and sex with individuals from the town. If possible,the links should be maintained in the display, for example by joiningm<strong>at</strong>ching individuals in Figure 1.3. This can lead to a more sensitiveway of examining the d<strong>at</strong>a.When the d<strong>at</strong>a sets are large, plotting individual points can becumbersome. An altern<strong>at</strong>ive is a box–whisker plot. The box ismarked by the first and third quartile, and the whiskers extend to therange. The median is also marked in the box, as shown in Figure 1.4.6

DATA DISPLAY AND SUMMARY3 . 5Lead concentr<strong>at</strong>ion (µmol/24 h)32 . 521 . 510 . 50Urbanchildren(n = 15)Ruralchildren(n = 16)Figure 1.4 Box–whisker plot of d<strong>at</strong>a from Figure 1.3.It is easy to include more inform<strong>at</strong>ion in a box–whisker plot.<strong>One</strong> method, which is implemented in some computer programs,is to extend the whiskers only to points th<strong>at</strong> are 1·5 times theinterquartile range below the first quartile or above the thirdquartile, and to show remaining points as dots, so th<strong>at</strong> the numberof outlying points is shown.HistogramsSuppose the paedi<strong>at</strong>ric registrar referred to earlier extends theurban study to the entire est<strong>at</strong>e in which the children live. Heobtains figures for the urinary lead concentr<strong>at</strong>ion in 140 childrenaged over 1 year and under 16. We can display these d<strong>at</strong>a as agrouped frequency table (Table 1.4). They can also be displayedas a histogram, as in Figure 1.5.Bar chartsSuppose, of the 140 children, 20 lived in owner occupiedhouses, 70 lived in council houses and 50 lived in priv<strong>at</strong>e rented7

STATISTICS AT SQUARE ONETable 1.4 Lead concentr<strong>at</strong>ion in 140 urban children.Lead concentr<strong>at</strong>ion (µmol/24 h)Number of children0– 20·4– 70·8– 101·2– 161·6– 232·0– 282·4– 192·8– 163·2– 113·6– 74·0– 14·4Total 1403025n = 140Number of children201510500− 0 . 4− 0 . 8− 1 . 2− 1 . 6− 2 . 0− 2 . 4− 2 . 8− 3 . 2− 3 . 6− 4 . 0− 4 . 4Lead concentr<strong>at</strong>ion (µmol/24 h)Figure 1.5 Histogram of d<strong>at</strong>a from Table 1.4.accommod<strong>at</strong>ion. Figures from the census suggest th<strong>at</strong> for this agegroup, throughout the county, 50% live in owner occupied houses,30% in council houses, and 20% in priv<strong>at</strong>e rented accommod<strong>at</strong>ion.Type of accommod<strong>at</strong>ion is a c<strong>at</strong>egorical variable, which can bedisplayed in a bar chart. We first express our d<strong>at</strong>a as percentages:8

DATA DISPLAY AND SUMMARYPercentage of subjects80706050403020SurveyCensus100OwneroccupierCouncilhousingPriv<strong>at</strong>erentalFigure 1.6 Bar chart of housing d<strong>at</strong>a for 140 children and comparablecensus d<strong>at</strong>a.14% owner occupied, 50% council house, 36% priv<strong>at</strong>e rented. Wethen display the d<strong>at</strong>a as a bar chart. The sample size should alwaysbe given (Figure 1.6).Common questionsWh<strong>at</strong> is the distinction between a histogramand a bar chart?Alas, with modern graphics programs the distinction is often lost.A histogram shows the distribution of a continuous variable and,since the variable is continuous, there should be no gaps betweenthe bars. A bar chart shows the distribution of a discrete variableor a c<strong>at</strong>egorical one, and so will have spaces between the bars. It isa mistake to use a bar chart to display a summary st<strong>at</strong>istic such asa mean, particularly when it is accompanied by some measure ofvari<strong>at</strong>ion to produce a “dynamite plunger plot” 1 . It is better to usea box–whisker plot.How many groups should I have for a histogram?In general one should choose enough groups to show the shapeof a distribution, but not too many to lose the shape in the noise.9

STATISTICS AT SQUARE ONEIt is partly aesthetic judgement but, in general, between 5 and 15,depending on the sample size, gives a reasonable picture. Try tokeep the intervals (known also as “bin widths”) equal. With equalintervals the height of the bars and the area of the bars are bothproportional to the number of subjects in the group. With unequalintervals this link is lost, and interpret<strong>at</strong>ion of the figure canbe difficult.Displaying d<strong>at</strong>a in papers• The general principle should be, as far as possible, to show theoriginal d<strong>at</strong>a and to try not to obscure the design of a study inthe display. Within the constraints of legibility, show as muchinform<strong>at</strong>ion as possible. Thus if a d<strong>at</strong>a set is small (say, lessthan 20 points) a dot plot is preferred to a box–whisker plot.• When displaying the rel<strong>at</strong>ionship between two quantit<strong>at</strong>ivevariables, use a sc<strong>at</strong>ter plot (Chapter 11) in preference toc<strong>at</strong>egorising one or both of the variables.• If d<strong>at</strong>a points are m<strong>at</strong>ched or from the same p<strong>at</strong>ient, link themwith lines where possible.• Pie-charts are another way to display c<strong>at</strong>egorical d<strong>at</strong>a, but theyare rarely better than a bar-chart or a simple table.• To compare the distribution of two or more d<strong>at</strong>a sets, it is oftenbetter to use box–whisker plots side by side than histograms.Another common technique is to tre<strong>at</strong> the histograms as if theywere bar-charts, and plot the bars for each group adjacent toeach other.• When quoting a range or interquartile range, give the twonumbers th<strong>at</strong> define it, r<strong>at</strong>her than the difference.ExercisesExercise 1.1From the 140 children whose urinary concentr<strong>at</strong>ion of lead wasinvestig<strong>at</strong>ed 40 were chosen who were aged <strong>at</strong> least 1 year but under5 years. The following concentr<strong>at</strong>ions of copper (in µmol/24 h)were found.0·70, 0·45, 0·72, 0·30, 1·16, 0·69, 0·83, 0·74, 1·24, 0·77,0·65, 0·76, 0·42, 0·94, 0·36, 0·98, 0·64, 0·90, 0·63, 0·55,10

0·78, 0·10, 0·52, 0·42, 0·58, 0·62, 1·12, 0·86, 0·74, 1·04,0·65, 0·66, 0·81, 0·48, 0·85, 0·75, 0·73, 0·50, 0·34, 0·88Find the median, range and quartiles.ReferenceDATA DISPLAY AND SUMMARY1 Campbell MJ. Present numerical results. In: Reece D, ed. How to do it, Vol. 2.London: BMJ Publishing Group, 1995:77–83.11

2 Summary st<strong>at</strong>isticsfor quantit<strong>at</strong>iveand binary d<strong>at</strong>aSummary st<strong>at</strong>istics summarise the essential inform<strong>at</strong>ion in a d<strong>at</strong>aset into a few numbers, which, for example, can be communic<strong>at</strong>edverbally. The median and the interquartile range discussed inChapter 1 are examples of summary st<strong>at</strong>istics. Here we discusssummary st<strong>at</strong>istics for quantit<strong>at</strong>ive and binary d<strong>at</strong>a.Mean and standard devi<strong>at</strong>ionThe median is known as a measure of loc<strong>at</strong>ion; th<strong>at</strong> is, it tells uswhere the d<strong>at</strong>a are. As st<strong>at</strong>ed in Chapter 1, we do not need to knowall the d<strong>at</strong>a values exactly to calcul<strong>at</strong>e the median; if we made thesmallest value even smaller or the largest value even larger, itwould not change the value of the median. Thus the median doesnot use all the inform<strong>at</strong>ion in the d<strong>at</strong>a and so it can be shown to beless efficient than the mean or average, which does use all values ofthe d<strong>at</strong>a. To calcul<strong>at</strong>e the mean we add up the observed values anddivide by their number. The total of the values obtained in Table 1.2was 22·5 µmol/24 h, which is divided by their number, 15, to givea mean of 1·50 µmol/24 h. This familiar process is convenientlyexpressed by the following symbols:∑x – x= .nx – (pronounced “x bar”) signifies the mean; x is each of thevalues of urinary lead; n is the number of these values; and ∑ ,the Greek capital sigma (English “S”) denotes “sum of”. A majordisadvantage of the mean is th<strong>at</strong> it is sensitive to outlying points. For12

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATAexample, replacing 2·2 with 22 in Table 1.2 increases the mean to2·82 µmol/24 h, whereas the median will be unchanged.A fe<strong>at</strong>ure of the mean is th<strong>at</strong> it is the value th<strong>at</strong> minimises the sumof the squares of the observ<strong>at</strong>ions from a point; in contrast, themedian minimises the sum of the absolute differences from a point(Chapter 1). For the d<strong>at</strong>a in Table 1.1, the first observ<strong>at</strong>ion is 0·6and the square of the difference from the mean is (0·6−1·5) 2 = 0·81.The sum of the squares for all the observ<strong>at</strong>ions is 9·96 (see Table 2.1).No value other than 1·50 will give a smaller sum. It is also true th<strong>at</strong>the sum of the differences (now allowing both neg<strong>at</strong>ive and positivevalues) of the observ<strong>at</strong>ions from the mean will always be zero.As well as measures of loc<strong>at</strong>ion we need measures of howvariable the d<strong>at</strong>a are. We met two of these measures, the range andinterquartile range, in Chapter 1.The range is an important measurement, for figures <strong>at</strong> the topand bottom of it denote the findings furthest removed fromthe generality. However, they do not give much indic<strong>at</strong>ion of theaverage spread of observ<strong>at</strong>ions about the mean. This is where thestandard devi<strong>at</strong>ion (SD) comes in.The theoretical basis of the standard devi<strong>at</strong>ion is complex and neednot trouble the user. We will discuss sampling and popul<strong>at</strong>ions inChapter 3. A practical point to note here is th<strong>at</strong>, when the popul<strong>at</strong>ionfrom which the d<strong>at</strong>a arise have a distribution th<strong>at</strong> is approxim<strong>at</strong>ely“Normal” (or Gaussian), then the standard devi<strong>at</strong>ion provides auseful basis for interpreting the d<strong>at</strong>a in terms of probability.The Normal distribution is represented by a family of curvesdefined uniquely by two parameters, which are the mean andthe standard devi<strong>at</strong>ion of the popul<strong>at</strong>ion. The curves are alwayssymmetrically bell shaped, but the extent to which the bell iscompressed or fl<strong>at</strong>tened out depends on the standard devi<strong>at</strong>ion ofthe popul<strong>at</strong>ion. However, the mere fact th<strong>at</strong> a curve is bell shapeddoes not mean th<strong>at</strong> it represents a Normal distribution, becauseother distributions may have a similar sort of shape.Many biological characteristics conform to a Normal distributionclosely enough for it to be commonly used—for example, heightsof adult men and women, blood pressures in a healthy popul<strong>at</strong>ion,random errors in many types of labor<strong>at</strong>ory measurements andbiochemical d<strong>at</strong>a. Figure 2.1 shows a Normal curve calcul<strong>at</strong>edfrom the diastolic blood pressures of 500 men, with mean 82 mmHgand standard devi<strong>at</strong>ion 10 mmHg. The limits representing ± 1 SD,± 2 SD and ±3 SD about the mean are marked. A more extensiveset of values is given in Table A in the Appendix.13

STATISTICS AT SQUARE ONE10080Number of men60402000 50 55 60 65 70 75 _ 80 85 90 95 100 105 110 115_ x ± 1 SD_ x ± 2 SDx ± 3 SDDiastolic blood pressure (mmHg)Figure 2.1 Normal curve calcul<strong>at</strong>ed from diastolic blood pressures of500 men, mean 82 mmHg, standard devi<strong>at</strong>ion 10 mmHg.The reason why the standard devi<strong>at</strong>ion is such a useful measureof the sc<strong>at</strong>ter of the observ<strong>at</strong>ions is this: if the observ<strong>at</strong>ions followa Normal distribution, a range covered by one standard devi<strong>at</strong>ionabove the mean and one standard devi<strong>at</strong>ion below it (x – ± 1 SD)includes about 68% of the observ<strong>at</strong>ions; a range of two standarddevi<strong>at</strong>ions above and two below (x – ± 2 SD) about 95% of theobserv<strong>at</strong>ions; and of three standard devi<strong>at</strong>ions above and threebelow (x – ± 3 SD) about 99·7% of the observ<strong>at</strong>ions. Consequently,if we know the mean and standard devi<strong>at</strong>ion of a set of observ<strong>at</strong>ions,we can obtain some useful inform<strong>at</strong>ion by simple arithmetic. Byputting one, two or three standard devi<strong>at</strong>ions above and below themean we can estim<strong>at</strong>e the range of values th<strong>at</strong> would be expected toinclude about 68%, 95% and 99·7% of the observ<strong>at</strong>ions.Standard devi<strong>at</strong>ion from ungrouped d<strong>at</strong>aThe standard devi<strong>at</strong>ion is a summary measure of the differencesof each observ<strong>at</strong>ion from the mean of all the observ<strong>at</strong>ions. If the14

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATAdifferences themselves were added up, the positive would exactlybalance the neg<strong>at</strong>ive and so their sum would be zero. Consequentlythe squares of the differences are added. The sum of the squaresis then divided by the number of observ<strong>at</strong>ions minus one to givethe mean of the squares, and the square root is taken to bring themeasurements back to the units we started with. (The division bythe number of observ<strong>at</strong>ions minus one instead of the number ofobserv<strong>at</strong>ions itself to obtain the mean square is because “degrees offreedom” must be used. In these circumstances they are one lessthan the total. The theoretical justific<strong>at</strong>ion for this need not troublethe user in practice.)To gain an intuitive feel for degrees of freedom, consider choosinga chocol<strong>at</strong>e from a box of n chocol<strong>at</strong>es. Every time we come tochoose a chocol<strong>at</strong>e we have a choice, until we come to the last one(normally one with a nut in it!), and then we have no choice. Thuswe have n − 1 choices in total, or “degrees of freedom”.The calcul<strong>at</strong>ion of the standard devi<strong>at</strong>ion is illustr<strong>at</strong>ed in Table 2.1with the 15 readings in the preliminary study of urinary leadconcentr<strong>at</strong>ions (Table 1.2). The readings are set out in column(1). In column (2) the difference between each reading and themean is recorded. The sum of the differences is 0. In column (3)the differences are squared, and the sum of those squares is given<strong>at</strong> the bottom of the column.The sum of the squares of the differences (or devi<strong>at</strong>ions) fromthe mean, 9·96, is now divided by the total number of observ<strong>at</strong>ionminus one, to give a quantity known as the variance. Thus,In this case we find:(∑x − x – ) 2Variance = .n − 19·96Variance = = 0·7114 (µmol/24 h)142 .Finally, the square root of the variance provides the standarddevi<strong>at</strong>ion:√⎯⎯⎯⎯⎯(∑x − x – ) 2SD = ,n − 115

STATISTICS AT SQUARE ONETable 2.1 Calcul<strong>at</strong>ion of standard devi<strong>at</strong>ion.(1) (2) (3) (4)Lead Differences Differences Observ<strong>at</strong>ions inConcentr<strong>at</strong>ion from mean <strong>Square</strong>d Col (1) squared(µmol/24 h) x − x – (x − x – ) 2 x 2x0·1 −1·4 1·96 0·010·4 −1·1 1·21 0·160·6 −0·9 0·81 0·360·8 −0·7 0·49 0·641·1 −0·4 0·16 1·211·2 −0·3 0·09 1·441·3 −0·2 0·04 1·691·5 0 0 2·251·7 0·2 0·04 2·891·9 0·4 0·16 3·611·9 0·4 0·16 3·612·0 0·5 0·25 4·002·2 0·7 0·49 4·842·6 1·1 1·21 6·763·2 1·7 2·89 10·24Total 22·5 0 9·96 43·71n = 15, x – = 1·50.from which we getSD =√ ⎯⎯⎯⎯⎯⎯0·7114 = 0·843 µmol/24 h.This procedure illustr<strong>at</strong>es the structure of the standard devi<strong>at</strong>ion,in particular th<strong>at</strong> the two extreme values 0·1 and 3·2 contributemost to the sum of the differences squared.Calcul<strong>at</strong>or procedureCalcul<strong>at</strong>ors often have two buttons for the SD, σ nand σ . n−1These use divisors n and n − 1 respectively. The symbol σ is theGreek lower case “s”, for standard devi<strong>at</strong>ion.The calcul<strong>at</strong>or formulae use the rel<strong>at</strong>ionship[ (∑ )σ 2 = n∑(x − x – ) 2∑ 2] ∑1x21 xn= x2 − = − x –2 .nn n16

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATA∑x2 means square the x’s and then add them. The right handexpression can be easily memorised by the expression “mean of thesquares minus the mean squared”. The variance σ 2 n − 1is obtainedfrom σ 2 = nσ 2 n − 1 n/(n − 1).The above equ<strong>at</strong>ion can be seen to be true in Table 2.1, wherethe sum of the square of the observ<strong>at</strong>ions, ∑ x 2 , is given as 43·71.We thus obtain(22·5) 243·71 − = 9·9615the same value given for the total in column (3). Care should betaken because this formula involves subtracting two largenumbers to get a small one, and can lead to incorrect results if thenumbers are very large. For example, try finding the standarddevi<strong>at</strong>ion of 100 001, 100 002, 100 003 on a calcul<strong>at</strong>or. The correctanswer is 1, but many calcul<strong>at</strong>ors will give 0 because of roundingerror. The solution is to subtract a large number from each of theobserv<strong>at</strong>ions (say 100 000) and calcul<strong>at</strong>e the standard devi<strong>at</strong>ionon the remainders, namely 1, 2 and 3. The variability of a set ofnumbers is unaffected if we change every member of the set byadding or subtracting the same constant.Standard devi<strong>at</strong>ion from grouped d<strong>at</strong>aWe can also calcul<strong>at</strong>e a standard devi<strong>at</strong>ion for count variables.For example, in addition to studying the lead concentr<strong>at</strong>ion in theurine of 140 children, the paedi<strong>at</strong>rician asked how often each of themhad been examined by a doctor during the year. After collecting theinform<strong>at</strong>ion he tabul<strong>at</strong>ed the d<strong>at</strong>a shown in Table 2.2 columns(1) and (2). The mean is calcul<strong>at</strong>ed by multiplying column (1) bycolumn (2), adding the products, and dividing by the total numberof observ<strong>at</strong>ions.As we did for continuous d<strong>at</strong>a, to calcul<strong>at</strong>e the standard devi<strong>at</strong>ionwe square each of the observ<strong>at</strong>ions in turn. In this case theobserv<strong>at</strong>ion is the number of visits, but because we have severalchildren in each class, shown in column (2), each squared number(column (4)), must be multiplied by the number of children. The17

STATISTICS AT SQUARE ONETable 2.2 Calcul<strong>at</strong>ion of the standard devi<strong>at</strong>ion from count d<strong>at</strong>a.(1) (2) (3) (4) (5)Number of Number Col (2) Col (1) Col (2)visits to or of × squared ×by doctor children Col (1) Col (4)0 2 0 0 01 8 8 1 82 27 54 4 1083 45 135 9 4054 38 152 16 6085 15 75 25 3756 4 24 36 1447 1 7 49 49Total 140 455 1697Mean number of visits = 455/140 = 3·25sum of squares is given <strong>at</strong> the foot of column (5), namely 1697. Wethen use the calcul<strong>at</strong>or formula to find the variance:and(1679 − 455 2 /140)Variance = = 1·57139SD = √ ⎯⎯⎯⎯1·57 = 1·25.Note th<strong>at</strong> although the number of visits is not Normallydistributed, the distribution is reasonably symmetrical about themean. The approxim<strong>at</strong>e 95% range is given by3·25 − 2 × 1·25 = 0·75 to 3·25 + 2 × 1·25 = 5·75This excludes two children with no visits and five children with sixor more visits. Thus there are 7 out of 140 = 5·0% outside thetheoretical 95% range.Note th<strong>at</strong> it is common for discrete quantit<strong>at</strong>ive variables tohave wh<strong>at</strong> is known as a skewed distribution, th<strong>at</strong> is, they are notsymmetrical. <strong>One</strong> clue to lack of symmetry from derived st<strong>at</strong>istics iswhen the mean and the median differ considerably. Another is whenthe standard devi<strong>at</strong>ion is of the same order of magnitude as the18

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATATable 2.3 Results from pain score on seven p<strong>at</strong>ients (mm).Original scale: 1, 1, 2, 3, 3, 6, 56log escale: 0, 0, 0·69, 1·10, 1.10, 1·79, 4·03mean, but the observ<strong>at</strong>ions must be non-neg<strong>at</strong>ive. Sometimes <strong>at</strong>ransform<strong>at</strong>ion will convert a skewed distribution into a symmetricalone. When the d<strong>at</strong>a are counts, such as number of visits to a doctor,often the square root transform<strong>at</strong>ion will help, and if there are nozero or neg<strong>at</strong>ive values a logarithmic transform<strong>at</strong>ion may render thedistribution more symmetrical.D<strong>at</strong>a transform<strong>at</strong>ionAn anaesthetist measures the pain of a procedure using a 100 mmvisual analogue scale on seven p<strong>at</strong>ients. The results are given inTable 2.3, together with the log etransform<strong>at</strong>ion (the In button on acalcul<strong>at</strong>or).The d<strong>at</strong>a are plotted in Figure 2.2, which shows th<strong>at</strong> the outlierdoes not appear so extreme in the logged d<strong>at</strong>a. The mean and medianare 10·29 and 2, respectively, for the original d<strong>at</strong>a, with a standarddevi<strong>at</strong>ion of 20·22. Where the mean is bigger than the median, thedistribution is positively skewed. For the logged d<strong>at</strong>a the mean andmedian are 1·24 and 1·10 respectively, which are rel<strong>at</strong>ively close,605504Painscore(mm)403020Logpainscore321010Figure 2.2 Dot plots of original and logged d<strong>at</strong>a from pain scores.019

STATISTICS AT SQUARE ONEindic<strong>at</strong>ing th<strong>at</strong> the logged d<strong>at</strong>a have a more symmetrical distribution.Thus it would be better to analyse the logged transformed d<strong>at</strong>a inst<strong>at</strong>istical tests than using the original scale.In reporting these results, the median of the raw d<strong>at</strong>a would begiven, but it should be explained th<strong>at</strong> the st<strong>at</strong>istical test was carriedout on the transformed d<strong>at</strong>a. Note th<strong>at</strong> the median of the loggedd<strong>at</strong>a is the same as the log of the median of the raw d<strong>at</strong>a—however,this is not true for the mean. The mean of the logged d<strong>at</strong>a is notnecessarily equal to the log of the mean of the raw d<strong>at</strong>a. Theantilog (exp or e x on a calcul<strong>at</strong>or) of the mean of the logged d<strong>at</strong>ais known as the geometric mean, and is often a better summaryst<strong>at</strong>istic than the mean for d<strong>at</strong>a from positively skewed distributions.For these d<strong>at</strong>a the geometric mean is 3·45 mm.A number of articles have discussed transforming variables. 1,2 Anumber of points can be made:• If two groups are to be compared, a transform<strong>at</strong>ion th<strong>at</strong>reduces the skewness of an outcome variable often results in thestandard devi<strong>at</strong>ions of the variable in the two groups beingsimilar.• A log transform is the only one th<strong>at</strong> will give sensible results ona back transform<strong>at</strong>ion.• Transforming variables is not “che<strong>at</strong>ing”. Some variables aremeasured n<strong>at</strong>urally on a log scale (for example, pH).Between subjects and within subjectsstandard devi<strong>at</strong>ionIf repe<strong>at</strong>ed measurements are made of, say, blood pressure on anindividual, these measurements are likely to vary. This is withinsubject, or intrasubject, variability, and we can calcul<strong>at</strong>e a standarddevi<strong>at</strong>ion of these observ<strong>at</strong>ions. If the observ<strong>at</strong>ions are closetogether in time, this standard devi<strong>at</strong>ion is often described as themeasurement error. Measurements made on different subjects varyaccording to between subject, or intersubject, variability. If manyobserv<strong>at</strong>ions were made on each individual, and the average taken,then we can assume th<strong>at</strong> the intrasubject variability has beenaveraged out and the vari<strong>at</strong>ion in the average values is due solelyto the intersubject variability. Single observ<strong>at</strong>ions on individualsclearly contain a mixture of intersubject and intrasubject vari<strong>at</strong>ion,20

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATAbut we cannot separ<strong>at</strong>e the two since the within subject variabilitycannot be estim<strong>at</strong>ed with only one observ<strong>at</strong>ion per subject. Thecoefficient of vari<strong>at</strong>ion (CV%) is the intrasubject standard devi<strong>at</strong>iondivided by the mean, expressed as a percentage. It is often quotedas a measure of repe<strong>at</strong>ability for biochemical assays, when an assayis carried out on several occasions on the same sample. It has theadvantage of being independent of the units of measurement, butalso numerous theoretical disadvantages. It is usually nonsensicalto use the coefficient of vari<strong>at</strong>ion as a measure of between subjectvariability.Summarising rel<strong>at</strong>ionships betweenbinary variablesThere are a number of ways of summarising the outcome frombinary d<strong>at</strong>a. These include the absolute risk reduction, the rel<strong>at</strong>iverisk, the rel<strong>at</strong>ive risk reduction, the number needed to tre<strong>at</strong> and theodds r<strong>at</strong>io. We discuss how to calcul<strong>at</strong>e these and their uses inthis section.Kennedy et al. 3 report on the study of acetazolamide andfurosemide versus standard therapy for the tre<strong>at</strong>ment of posthaemorrhagic ventricular dil<strong>at</strong><strong>at</strong>ion (PHVD) in prem<strong>at</strong>ure babies.The outcome was de<strong>at</strong>h or a shunt placement by 1 year of age.The results are given in Table 2.4.The standard method of summarising binary outcomes is to useproportions or percentages. Thus 35 out of 76 children died or hada shunt under standard therapy, and this is expressed as 35/76 or0·46. This is often expressed as a percentage, 46%, and for aprospective study such as this the proportion can be thought ofas a probability of an event happening or a risk. Thus under thestandard therapy there was a risk of 0·46 of dying or getting ashunt by 1 year of age. In the drug plus standard therapy the riskwas 49/75 = 0·65.In clinical trials, wh<strong>at</strong> we really want is to look <strong>at</strong> the contrastbetween differing therapies. We can do this by looking <strong>at</strong> theTable 2.4 Results from PHVD trial 3 .De<strong>at</strong>h/shunt No de<strong>at</strong>h/shunt TotalStandard therapy 35 41 76Drug plus standard therapy 49 26 7521

STATISTICS AT SQUARE ONEdifference in risks, or altern<strong>at</strong>ively the r<strong>at</strong>io of risks. The differenceis usually expressed as the control risk minus the experimental riskand is known as the absolute risk reduction (ARR). The difference inrisks in this case is 0·46 − 0·65 =−0·19 or − 19%. The neg<strong>at</strong>ive signindic<strong>at</strong>es th<strong>at</strong> the experimental tre<strong>at</strong>ment is this case appears to bedoing harm. <strong>One</strong> way of thinking about this is if 100 p<strong>at</strong>ients weretre<strong>at</strong>ed under standard therapy and 100 tre<strong>at</strong>ed under drugtherapy, we would expect 46 to have died or have had a shunt inthe standard therapy and 65 in the experimental therapy. Thusan extra 19 had a shunt placement or died under drug therapy.Another way of looking <strong>at</strong> this is to ask: how many p<strong>at</strong>ients wouldbe tre<strong>at</strong>ed for one extra person to be harmed by the drug therapy?Nineteen adverse events resulted from tre<strong>at</strong>ing 100 p<strong>at</strong>ients and so100/19 = 5·26 p<strong>at</strong>ients would be tre<strong>at</strong>ed for 1 adverse event. Thusroughly if 6 p<strong>at</strong>ients were tre<strong>at</strong>ed with standard therapy and 6 withdrug therapy, we would expect 1 extra p<strong>at</strong>ient to die or require ashunt in the drug therapy. This is known as the number needed toharm (NNH) and is simply expressed as the inverse of the absoluterisk reduction, with the sign ignored. When the new therapy isbeneficial it is known as the number needed to tre<strong>at</strong> (NNT) 4 andin this case the ARR will be positive. For screening studies it isknown as the number needed to screen, th<strong>at</strong> is, the number of peopleth<strong>at</strong> have to be screened to prevent one serious event or de<strong>at</strong>h. 5The number needed to tre<strong>at</strong> has been suggested by Sackett et al. 6as a useful and clinically intuitive way of thinking about the outcomeof a clinical trial. For example, in a clinical trial of prevast<strong>at</strong>in againstusual therapy to prevent coronary events in men with moder<strong>at</strong>ehypercholestremia and no history of myocardial infarction, the NNTis 42. Thus you would have to tre<strong>at</strong> 42 men with prevast<strong>at</strong>in toprevent one extra coronary event, compared with the usual therapy.It is claimed th<strong>at</strong> this is easier to understand than the rel<strong>at</strong>ive riskreduction, or other summary st<strong>at</strong>istics, and can be used to decidewhether an effect is “large” by comparing the NNT for differenttherapies.However, it is important to realise th<strong>at</strong> comparison betweenNNTs can only be made if the baseline risks are similar. Thus,suppose a new therapy managed to reduce 5 year mortality ofCreutzfeldt–Jakob disease from 100% on standard therapy to 90%on the new tre<strong>at</strong>ment. This would be a major breakthrough andhas an NNT of (1/(1 − 0·9)) = 10. In contrast, a drug th<strong>at</strong> reducedmortality from 50% to 40% would also have an NNT of 10, butwould have much less impact.22

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATAWe can also express the outcome as a risk r<strong>at</strong>io or rel<strong>at</strong>ive risk (RR),which is the r<strong>at</strong>io of the two risks, experimental risk divided bycontrol risk, namely 0·46/0·65 = 0·71. With a rel<strong>at</strong>ive risk lessthan 1, the risk of an event is lower in the control group. The rel<strong>at</strong>iverisk is often used in cohort studies. It is important to consider theabsolute risk as well. The risk of deep vein thrombosis in womenon a new type of contraceptive is 30 per 100 000 women years,compared to 15 per 100 000 women years for women on the oldtre<strong>at</strong>ment. Thus the rel<strong>at</strong>ive risk is 2, which shows th<strong>at</strong> the new typeof contraceptive carries quite a high risk of deep vein thrombosis.However, an individual woman need not be unduly concerned sinceshe has a probability of 0·0003 of getting a deep vein thrombosisin 1 year on the new drug, which is much less than if she werepregnant!We can also consider the rel<strong>at</strong>ive risk reduction (RRR) which is(control risk minus experimental risk)/control risk; this is easilyshown to be 1−RR, often expressed as a percentage. Thus a p<strong>at</strong>ientin the drug arm of the PHVD trial is <strong>at</strong> approxim<strong>at</strong>ely 29% higherrisk of experiencing an adverse event rel<strong>at</strong>ive to the risk of a p<strong>at</strong>ientin the standard therapy group.When the d<strong>at</strong>a come from a cross-sectional study or a case–control study (see Chapter 13 for a discussion of these types ofstudies) then r<strong>at</strong>her than risks, we often use odds. The odds of anevent happening are the r<strong>at</strong>io of the probability th<strong>at</strong> it happens tothe probability th<strong>at</strong> it does not. If P is the probability of an eventwe have:POdds (event) = .1−PThus the odds of a die showing a “6” on a single throw are 1:5.This is often also expressed as a proportion: thus 1/5 = 0·2.To illustr<strong>at</strong>e why this may be useful consider Table 2.5, showingthe prevalence of hay fever and eczema in a cross-sectional surveyof children aged 11. 7,8Table 2.5 Associ<strong>at</strong>ion between hay fever and eczema in children aged 11 7,8 .Hay fever present Hay fever absent TotalEczema present 141 420 561Eczema absent 928 13 525 14 453Total 1069 13 945 15 01423

STATISTICS AT SQUARE ONEIf you have hay fever the risk of eczema is 141/1069 = 0·132and the odds are 0·132/(1 − 0·132) = 0·152. Note th<strong>at</strong> this couldbe calcul<strong>at</strong>ed from the r<strong>at</strong>io of those with and without eczema:141/928 = 0·152. If you do not have hay fever the risk of eczema is420/13 945 = 0·030, and the odds are 420/13 525 = 0·031. Thusthe rel<strong>at</strong>ive risk of having eczema, given th<strong>at</strong> you have hay fever, is0·132/0·030 = 4·4. We can also consider the odds r<strong>at</strong>io, which is0·152/0·031 = 4·90.We can consider the table the other way around, and ask wh<strong>at</strong> isthe risk of hay fever given th<strong>at</strong> a child has eczema. In this case thetwo risks are 141/561 = 0·251 and 928/14 453 = 0·064, and therel<strong>at</strong>ive risk is 0·251/0·064 = 3·92. Thus the rel<strong>at</strong>ive risk of hayfever given th<strong>at</strong> a child has eczema is 3·92, which is not the sameas the rel<strong>at</strong>ive risk of eczema given th<strong>at</strong> a child has hay fever.However, the two respective odds are 141/420 = 0·336 and928/13 525 = 0·069 and the odds r<strong>at</strong>io is 0·336/0·069 = 4·87,which to the limits of rounding is the same as the odds r<strong>at</strong>io foreczema, given a child has hay fever.The fact th<strong>at</strong> the two odds r<strong>at</strong>ios are the same can be seen fromthe fact th<strong>at</strong>OR =141 × 13 525928 × 420which remains the same if we switch rows and columns.Another useful property of the odds r<strong>at</strong>io is th<strong>at</strong> the odds r<strong>at</strong>iofor an event not happening is just the inverse of the odds r<strong>at</strong>io forit happening. Thus, the odds r<strong>at</strong>io for not getting eczema, giventh<strong>at</strong> a child has hay fever, is just 1/4·90 = 0·204. This is not true ofthe rel<strong>at</strong>ive risk, where the rel<strong>at</strong>ive risk for not getting eczema givenhay fever is (420/561)/(13 525/14 453) = 0·80, which is not 1/3·92.Table 2.6 Odds r<strong>at</strong>ios and rel<strong>at</strong>ive risks for differentvalues of absolute risks (P 1, P 2).P 1P 2Rel<strong>at</strong>ive risk Odds r<strong>at</strong>ioP 1/P 2P 1/(1−P 1)P 2/(1−P 2)0·1 0·05 2·00 2·110·3 0·15 2·00 2·420·5 0·25 2·00 3·000·7 0·35 2·00 4·3324

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATATable 2.7 Methods of summarising a binary outcome in a two groupprospective study: Risk in group 1 (control) is P 1, risk in group 2 is P 2.Term Formula Observed in PHVD trialAbsolute risk reduction P 1− P 20·46 − 0·65 = − 0·19(ARR)Rel<strong>at</strong>ive risk (RR) P 1 /P 20·46/0·65 = 0·71Rel<strong>at</strong>ive risk reduction (P 1− P 2)/P 1(0·46 − 0·65)/0·65 = − 0·29(RRR)Number needed to tre<strong>at</strong> 1/|P 1−P 2| 1/0·19 = 6or harm (NNT or NNH)Odds r<strong>at</strong>io (OR) P 1/ (1−P 1) (0·46/0·54)/(0·65/0·35) = 0·46P 2 / (1−P ) 2Table 2.6 demonstr<strong>at</strong>es an important fact: the odds r<strong>at</strong>io is aclose approxim<strong>at</strong>ion to the rel<strong>at</strong>ive risk when the absolute risks arelow, but is a poor approxim<strong>at</strong>ion if the absolute risks are high.The odds r<strong>at</strong>io is the main summary st<strong>at</strong>istic to be obtained fromcase–control studies (see Chapter 13 for a description of case–controlstudies). When the assumption of a low absolute risk holds true(which is usually the situ<strong>at</strong>ion for case–control studies) then theodds r<strong>at</strong>io is assumed to approxim<strong>at</strong>e the rel<strong>at</strong>ive risk th<strong>at</strong> wouldhave been obtained if a cohort study had been conducted.Choice of summary st<strong>at</strong>istics forbinary d<strong>at</strong>aTable 2.7 gives a summary of the different methods ofsummarising a binary outcome for a prospective study such as aclinical trial. Note how in the PHVD trial the odds r<strong>at</strong>io andrel<strong>at</strong>ive risk differ markedly, because the event r<strong>at</strong>es are quite high.Common questionsWhen should I quote the mean and when shouldI quote the median to describe my d<strong>at</strong>a?It is a commonly held misapprehension th<strong>at</strong> for Normallydistributed d<strong>at</strong>a one uses the mean, and for non-Normallydistributed d<strong>at</strong>a one uses the median. Alas, this is not so: if thed<strong>at</strong>a are approxim<strong>at</strong>ely Normally distributed the mean and themedian will be close; if the d<strong>at</strong>a are not Normally distributed thenboth the mean and the median may give useful inform<strong>at</strong>ion.25

STATISTICS AT SQUARE ONEConsider a variable th<strong>at</strong> takes the value 1 for males and 0 forfemales. This is clearly not Normally distributed. However, themean gives the proportion of males in the group, whereas themedian merely tells us which group contained more than 50% ofthe people. Similarly, the mean from ordered c<strong>at</strong>egorical variablescan be more useful than the median, if the ordered c<strong>at</strong>egories can begiven meaningful scores. For example, a lecture might be r<strong>at</strong>ed as1 (poor) to 5 (excellent). The usual st<strong>at</strong>istic for summarising theresult would be the mean. For some outcome variables (such ascost) one might be interested in the mean, wh<strong>at</strong>ever the distributionof the d<strong>at</strong>a, since from the mean can be derived the total cost for agroup. However, in the situ<strong>at</strong>ion where there is a small group <strong>at</strong> oneextreme of a distribution (for example, annual income) then themedian will be more “represent<strong>at</strong>ive” of the distribution.When should I use a standard devi<strong>at</strong>ionto summarise variability?The standard devi<strong>at</strong>ion is only interpretable as a summary measurefor variables th<strong>at</strong> have approxim<strong>at</strong>ely symmetric distributions. It isoften used to describe the characteristics of a group, for example, inthe first table of a paper describing a clinical trial. It is often used,in my view incorrectly, to describe variability for measurementsth<strong>at</strong> are not plausibly normal, such as age. For these variables, therange or interquartile range is a better measure. The standarddevi<strong>at</strong>ion should not be confused with the standard error, which isdescribed in Chapter 3 and where the distinction between the twois spelled out.When should I quote an odds r<strong>at</strong>io and whenshould I quote a rel<strong>at</strong>ive risk?The odds r<strong>at</strong>io is difficult to understand, and most people thinkof it as a rel<strong>at</strong>ive risk anyway. Thus for prospective studies therel<strong>at</strong>ive risk should be easy to derive and should be quoted, and notthe odds r<strong>at</strong>io. For case–control studies one has no option but toquote the odds r<strong>at</strong>io. For cross-sectional studies one has a choice,and if it is not clear which variables are causal and which areoutcome, then the odds r<strong>at</strong>io has the advantage of being symmetric,in th<strong>at</strong> it gives the same answer if the causal and outcome variablesare swapped. A major reason for quoting odds r<strong>at</strong>ios is th<strong>at</strong> theyare the output from logistic regression, an advanced techniquediscussed in <strong>St<strong>at</strong>istics</strong> <strong>at</strong> <strong>Square</strong> Two 9 . These are quoted, even for26

SUMMARY STATISTICS FOR QUANTITATIVE AND BINARY DATAprospective studies, because of the nice st<strong>at</strong>istical properties ofodds r<strong>at</strong>ios. In this situ<strong>at</strong>ion it is important to label the odds r<strong>at</strong>ioscorrectly and consider situ<strong>at</strong>ions in which they may not be goodapproxim<strong>at</strong>ions to rel<strong>at</strong>ive risks.Reading and displaying summary st<strong>at</strong>istics• In general, display means to one more significant digit than theoriginal d<strong>at</strong>a, and SDs to two significant figures more. Try toavoid the tempt<strong>at</strong>ion to spurious accuracy offered by computerprintouts and calcul<strong>at</strong>or displays!• Consider carefully if the quoted summary st<strong>at</strong>istics correctlysummarise the d<strong>at</strong>a. If a mean and standard devi<strong>at</strong>ion arequoted, is it reasonable to assume 95% of the popul<strong>at</strong>ion arewithin 2 SDs of the mean? (Hint: if the mean and standarddevi<strong>at</strong>ion are about the same size, and if the observ<strong>at</strong>ions mustbe positive, then the distribution will be skewed).• If a rel<strong>at</strong>ive risk is quoted, is it in fact an odds r<strong>at</strong>io? Is itreasonable to assume th<strong>at</strong> the odds r<strong>at</strong>io is a good approxim<strong>at</strong>ionto a rel<strong>at</strong>ive risk?• If an NNT is quoted, wh<strong>at</strong> are the absolute levels of risk? If youare trying to evalu<strong>at</strong>e a therapy, does the absolute level of riskgiven in the paper correspond to wh<strong>at</strong> you might expect in yourown p<strong>at</strong>ients?• Do not use the ± symbol for indic<strong>at</strong>ing an SD.ExercisesExercise 2.1In the campaign against smallpox a doctor inquired into thenumber of times 150 people aged 16 and over in an Ethiopianvillage had been vaccin<strong>at</strong>ed. He obtained the following figures:never, 12 people; once, 24; twice, 42; three times, 38; four times, 30;five times, 4. Wh<strong>at</strong> is the mean number of times those peoplehad been vaccin<strong>at</strong>ed and wh<strong>at</strong> is the standard devi<strong>at</strong>ion? Is thestandard devi<strong>at</strong>ion a good measure of vari<strong>at</strong>ion in this case?Exercise 2.2Obtain the mean and standard devi<strong>at</strong>ion of the d<strong>at</strong>a in Exercise 1.1and an approxim<strong>at</strong>e 95% range. Which points are excluded from27

STATISTICS AT SQUARE ONEthe range mean − 2 SD to mean + 2 SD? Wh<strong>at</strong> proportion of thed<strong>at</strong>a is excluded?Exercise 2.3In a prospective study of 241 men and 222 women undergoingelective inp<strong>at</strong>ient surgery, 37 men and 61 women suffered nauseaand vomiting in the recovery room. 10 Find the rel<strong>at</strong>ive risk andodds r<strong>at</strong>io for nausea and vomiting for women compared to men.References1 Altman DG, Bland JM. <strong>St<strong>at</strong>istics</strong> notes: Transforming d<strong>at</strong>a. BMJ1996;312:770.2 Altman DG, Bland JM. <strong>St<strong>at</strong>istics</strong> notes: Detecting skewness from summaryinform<strong>at</strong>ion. BMJ 1996;313:1200.3 Intern<strong>at</strong>ional PHVD Drug Trial Group. Intern<strong>at</strong>ional randomised controlledtrial of acetazolamide and furosemide in posthaemorrhagic ventricular dil<strong>at</strong><strong>at</strong>ionin infancy. Lancet 1998;352:433–40.4 Cook RJ, Sackett DL. The number needed to tre<strong>at</strong>: a clinically useful measureof tre<strong>at</strong>ment effect. BMJ 1995;310:452–4.5 Rembold CR. Number needed to screen: development of a st<strong>at</strong>istic for diseasescreening. BMJ 1998;317:307–12.6 Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-based medicine.How to practice and teach EBM. New York: Churchill Livingstone, 1997.7 Strachan DP, Butland BK, Anderson HR. Incidence and prognosis of asthmaand wheezing from early childhood to age 33 in a n<strong>at</strong>ional British cohort. BMJ1996;312:1195–9.8 Bland JM, Altman DG. <strong>St<strong>at</strong>istics</strong> notes: The odds r<strong>at</strong>io. BMJ 2000;320:1468.9 Campbell MJ. <strong>St<strong>at</strong>istics</strong> <strong>at</strong> <strong>Square</strong> Two. London: BMJ Books, 2001.10 Myles PS, Mcleod ADM, Hunt JO, Fletcher H. Sex differences in speed ofemergency and quality of recovery after anaesthetic: cohort study. BMJ 2001;322:710–11.28

3 Popul<strong>at</strong>ions andsamplesPopul<strong>at</strong>ionsIn st<strong>at</strong>istics the term “popul<strong>at</strong>ion” has a slightly different meaningfrom the one given to it in ordinary speech. It need not refer onlyto people or to anim<strong>at</strong>e cre<strong>at</strong>ures—the popul<strong>at</strong>ion of Britain, forinstance, or the dog popul<strong>at</strong>ion of London. St<strong>at</strong>isticians also speakof a popul<strong>at</strong>ion of objects, or events, or procedures, or observ<strong>at</strong>ions,including such things as the quantity of lead in urine, visits to thedoctor or surgical oper<strong>at</strong>ions. A popul<strong>at</strong>ion is thus an aggreg<strong>at</strong>e ofcre<strong>at</strong>ures, things, cases and so on.Although a st<strong>at</strong>istician should define clearly the relevantpopul<strong>at</strong>ion, he or she may not be able to enumer<strong>at</strong>e it exactly. Forinstance, in ordinary usage the popul<strong>at</strong>ion of England denotesthe number of people within England’s boundaries, perhaps asenumer<strong>at</strong>ed <strong>at</strong> a census. But a physician might embark on astudy to try to answer the question “Wh<strong>at</strong> is the average systolicblood pressure of Englishmen aged 40–59?”. But who are the“Englishmen” referred to here? Not all Englishmen live inEngland, and the social and genetic background of those th<strong>at</strong>do may vary. A surgeon may study the effects of two altern<strong>at</strong>iveoper<strong>at</strong>ions for gastric ulcer. But how old are the p<strong>at</strong>ients? Wh<strong>at</strong>sex are they? How severe is their disease? Where do they live? Andso on. The reader needs precise inform<strong>at</strong>ion on such m<strong>at</strong>ters todraw valid inferences from the sample th<strong>at</strong> was studied to thepopul<strong>at</strong>ion being considered. <strong>St<strong>at</strong>istics</strong> such as averages andstandard devi<strong>at</strong>ions, when taken from popul<strong>at</strong>ions, are referredto as popul<strong>at</strong>ion parameters. They are often denoted by Greekletters; the popul<strong>at</strong>ion mean is denoted by µ (mu) and the standarddevi<strong>at</strong>ion denoted by σ (lower case sigma).29

STATISTICS AT SQUARE ONESamplesA popul<strong>at</strong>ion commonly contains too many individuals to studyconveniently, so an investig<strong>at</strong>ion is often restricted to one or moresamples drawn from it. A well chosen sample will contain most ofthe inform<strong>at</strong>ion about a particular popul<strong>at</strong>ion parameter, but therel<strong>at</strong>ion between the sample and the popul<strong>at</strong>ion must be such asto allow true inferences to be made about a popul<strong>at</strong>ion fromth<strong>at</strong> sample.Consequently, the first important <strong>at</strong>tribute of a sample is th<strong>at</strong>every individual in the popul<strong>at</strong>ion from which it is drawn musthave a known non-zero chance of being included in it; a n<strong>at</strong>uralsuggestion is th<strong>at</strong> these chances should be equal. We would like thechoices to be made independently; in other words, the choice ofone subject will not affect the chance of other subjects beingchosen. To ensure this we make the choice by means of a processin which chance alone oper<strong>at</strong>es, such as spinning a coin or, moreusually, the use of a table of random numbers. A limited table isgiven in Table F (in the Appendix), and more extensive oneshave been published. 1–4 A sample so chosen is called a randomsample. The word “random” does not describe the sample as such,but the way in which it is selected.To draw a s<strong>at</strong>isfactory sample sometimes presents gre<strong>at</strong>erproblems than to analyse st<strong>at</strong>istically the observ<strong>at</strong>ions made on it.A full discussion of the topic is beyond the scope of this book, butguidance is readily available. 1,2 In this book only an introductionis offered.Before drawing a sample the investig<strong>at</strong>or should define thepopul<strong>at</strong>ion from which it is to come. Sometimes he or she cancompletely enumer<strong>at</strong>e its members before beginning analysis—forexample, all the livers studied <strong>at</strong> necropsy over the previous year,all the p<strong>at</strong>ients aged 20–44 admitted to hospital with perfor<strong>at</strong>edpeptic ulcer in the previous 20 months. In retrospective studies ofthis kind numbers can be allotted serially from any point in thetable to each p<strong>at</strong>ient or specimen. Suppose we have a popul<strong>at</strong>ionof size 150, and we wish to take a sample of size 5. Table F containsa set of computer gener<strong>at</strong>ed random digits arranged in groups offive. Choose any row and column, say the last column of five digits.Read only the first three digits, and go down the column startingwith the first row. Thus we have 265, 881, 722, etc. If a numberappears between 001 and 150 then we include it in our sample.Thus, in order, in the sample will be subjects numbered 24, 59,30

107, 73, and 65. If necessary we can carry on down the next columnto the left until the full sample is chosen.The use of random numbers in this way is generally preferableto taking every altern<strong>at</strong>e p<strong>at</strong>ient or every fifth specimen, or actingon some other such regular plan. The regularity of the plan canoccasionally coincide by chance with some unforeseen regularity inthe present<strong>at</strong>ion of the m<strong>at</strong>erial for study—for example, by hospitalappointments being made by p<strong>at</strong>ients from certain practices oncertain days of the week, or specimens being prepared in b<strong>at</strong>chesin accordance with some schedule.As susceptibility to disease generally varies in rel<strong>at</strong>ion to age,sex, occup<strong>at</strong>ion, family history, exposure to risk, inocul<strong>at</strong>ion st<strong>at</strong>e,country lived in or visited, and many other genetic or environmentalfactors, it is advisable to examine samples when drawn to seewhether they are, on average, comparable in these respects. Therandom process of selection is intended to make them so, butsometimes it can by chance lead to disparities. To guard againstthis possibility the sampling may be str<strong>at</strong>ified. This means th<strong>at</strong> aframework is laid down initially, and the p<strong>at</strong>ients or objects of thestudy in a random sample are then allotted to the compartments ofthe framework. For instance, the framework might have a primarydivision into males and females and then a secondary division ofeach of those c<strong>at</strong>egories into five age groups, the result being aframework with ten compartments. It is then important to bear inmind th<strong>at</strong> the distributions of the c<strong>at</strong>egories on two samples madeup on such a framework may be truly comparable, but they willnot reflect the distribution of these c<strong>at</strong>egories in the popul<strong>at</strong>ionfrom which the sample is drawn unless the compartments in theframework have been designed with th<strong>at</strong> in mind. For instance,equal numbers might be admitted to the male and female c<strong>at</strong>egories,but males and females are not equally numerous in the generalpopul<strong>at</strong>ion, and their rel<strong>at</strong>ive proportions vary with age. This isknown as str<strong>at</strong>ified random sampling. To take a sample from a longlist, a compromise between strict theory and practicalities is knownas a system<strong>at</strong>ic random sample. In this case we choose subjects a fixedinterval apart on the list, say every tenth subject, but we choose thestarting point within the first interval <strong>at</strong> random.Unbiasedness and precisionPOPULATIONS AND SAMPLESThe terms unbiased and precision have acquired special meaningsin st<strong>at</strong>istics. When we say th<strong>at</strong> a measurement is unbiased we mean31

STATISTICS AT SQUARE ONEth<strong>at</strong> the average of a large set of unbiased measurements will beclose to the true value. When we say it is precise we mean th<strong>at</strong> it isrepe<strong>at</strong>able. Repe<strong>at</strong>ed measurements will be close to one another,but not necessarily close to the true value. We would like ameasurement th<strong>at</strong> is both accur<strong>at</strong>e and precise. Some authorsequ<strong>at</strong>e unbiasedness with accuracy, but this is not universal andothers use the term accuracy to mean a measurement th<strong>at</strong> isboth unbiased and precise. Strike 5 gives a good discussion ofthe problem.An estim<strong>at</strong>e of a parameter taken from a random sample isknown to be unbiased. As the sample size increases, it gets moreprecise.Randomis<strong>at</strong>ionAnother use of random number tables is to randomise thealloc<strong>at</strong>ion of tre<strong>at</strong>ments to p<strong>at</strong>ients in a clinical trial. This ensuresth<strong>at</strong> there is no bias in tre<strong>at</strong>ment alloc<strong>at</strong>ion and, in the long run,the subjects in each tre<strong>at</strong>ment group are comparable in bothknown and unknown prognostic factors. A common method is touse blocked randomis<strong>at</strong>ion. This is to ensure th<strong>at</strong> <strong>at</strong> regular intervalsthere are equal numbers in the two groups. Usual sizes for blocksare two, four, six, eight, and ten. Suppose we chose a block size often. A simple method using Table F (Appendix) is to choose thefirst five unique digits in any row. If we chose the first row, the firstfive unique digits are 3, 5, 6, 8 and 4. Thus we would alloc<strong>at</strong>e thethird, fourth, fifth, sixth and eighth subjects to one tre<strong>at</strong>ment andthe first, second, seventh, ninth, and tenth to the other. If the blocksize was less than ten we would ignore digits bigger than the blocksize. To alloc<strong>at</strong>e further subjects to tre<strong>at</strong>ment, we carry on alongthe same row, choosing the next five unique digits for the firsttre<strong>at</strong>ment. In randomised controlled trials it is advisable to changethe block size from time to time to make it more difficult to guesswh<strong>at</strong> the next tre<strong>at</strong>ment is going to be.It is important to realise th<strong>at</strong> p<strong>at</strong>ients in a randomised trial arenot a random sample from the popul<strong>at</strong>ion of people with thedisease in question but r<strong>at</strong>her a highly selected set of eligible andwilling p<strong>at</strong>ients. However, randomis<strong>at</strong>ion ensures th<strong>at</strong> in the longrun any differences in outcome in the two tre<strong>at</strong>ment groups aredue solely to differences in tre<strong>at</strong>ment.32

Vari<strong>at</strong>ion between samplesEven if we ensure th<strong>at</strong> every member of a popul<strong>at</strong>ion has aknown, and usually an equal, chance of being included in a sample,it does not follow th<strong>at</strong> a series of samples drawn from onepopul<strong>at</strong>ion and fulfilling this criterion will be identical. They willshow chance vari<strong>at</strong>ions from one to another, and the vari<strong>at</strong>ion maybe slight or considerable. For example, a series of samples of thebody temper<strong>at</strong>ure of healthy people would show very littlevari<strong>at</strong>ion from one to another, but the vari<strong>at</strong>ion between samplesof the systolic blood pressure would be considerable. Thus thevari<strong>at</strong>ion between samples depends partly on the amount ofvari<strong>at</strong>ion in the popul<strong>at</strong>ion from which they are drawn.Furthermore, it is a m<strong>at</strong>ter of common observ<strong>at</strong>ion th<strong>at</strong> a smallsample is a much less certain guide to the popul<strong>at</strong>ion from whichit was drawn than a large sample. In other words, the gre<strong>at</strong>er thenumber of members of a popul<strong>at</strong>ion th<strong>at</strong> are included in a samplethe more chance will th<strong>at</strong> sample have of accur<strong>at</strong>ely representingthe popul<strong>at</strong>ion, provided a random process is used to construct thesample. A consequence of this is th<strong>at</strong>, if two or more samples aredrawn from a popul<strong>at</strong>ion, the larger they are the more likely theyare to resemble each other—again provided th<strong>at</strong> the randomsampling technique is followed. Thus the vari<strong>at</strong>ion between samplesdepends partly also on the size of the sample. Usually, however, weare not in a position to take a random sample; our sample issimply those subjects available for study. This is a “convenience”sample. For valid generalis<strong>at</strong>ions to be made we would like toassert th<strong>at</strong> our sample is in some way represent<strong>at</strong>ive of thepopul<strong>at</strong>ion as a whole and for this reason the first stage in a reportis to describe the sample, say by age, sex and disease st<strong>at</strong>us, so th<strong>at</strong>other readers can decide if it is represent<strong>at</strong>ive of the type ofp<strong>at</strong>ients they encounter.Standard error of the meanPOPULATIONS AND SAMPLESIf we draw a series of samples and calcul<strong>at</strong>e the mean of theobserv<strong>at</strong>ions in each, we have a series of means. These meansgenerally conform to a Normal distribution, and they often do soeven if the observ<strong>at</strong>ions from which they were obtained do not (seeExercise 3.3). This can be proven m<strong>at</strong>hem<strong>at</strong>ically and is known asthe central limit theorem. The series of means, like the series33

STATISTICS AT SQUARE ONEof observ<strong>at</strong>ions in each sample, has a standard devi<strong>at</strong>ion. Thestandard error of the mean of one sample is an estim<strong>at</strong>e of thestandard devi<strong>at</strong>ion th<strong>at</strong> would be obtained from the means of alarge number of samples drawn from th<strong>at</strong> popul<strong>at</strong>ion.As noted above, if random samples are drawn from a popul<strong>at</strong>iontheir means will vary from one to another. The vari<strong>at</strong>ion dependson the vari<strong>at</strong>ion of the popul<strong>at</strong>ion and the size of the sample. Wedo not know the vari<strong>at</strong>ion in the popul<strong>at</strong>ion so we use the vari<strong>at</strong>ionin the sample as an estim<strong>at</strong>e of it. This is expressed in the standarddevi<strong>at</strong>ion. If we now divide the standard devi<strong>at</strong>ion by the square rootof the number of observ<strong>at</strong>ions in the sample we have an estim<strong>at</strong>e ofthe standard error of the mean, SEM = SD/ √⎺⎺n. It is important torealise th<strong>at</strong> we do not have to take repe<strong>at</strong>ed samples in order toestim<strong>at</strong>e the standard error: there is sufficient inform<strong>at</strong>ion within asingle sample. However, the conception is th<strong>at</strong> if we were to takerepe<strong>at</strong>ed random samples from the popul<strong>at</strong>ion, this is how wewould expect the mean to vary, purely by chance.A general practitioner in Yorkshire has a practice which includespart of a town with a large printing works and some of the adjacentsheep farming country. With her p<strong>at</strong>ients’ informed consent shehas been investig<strong>at</strong>ing whether the diastolic blood pressure of menaged 20–44 differs between the printers and the farm workers. Forthis purpose she has obtained a random sample of 72 printers and48 farm workers and calcul<strong>at</strong>ed the mean and standard devi<strong>at</strong>ionsshown in Table 3.1.To calcul<strong>at</strong>e the standard errors of the two mean blood pressuresthe standard devi<strong>at</strong>ion of each sample is divided by the square rootof the number of the observ<strong>at</strong>ions in the sample.These standard errors may be used to study the significance ofthe difference between the two means, as described in successivechapters.34Printers: SEM = 4·5/ √⎺ ⎺72 = 0·53 mmHgFarmers: SEM = 4·2/ √⎺ ⎺48 = 0·61 mmHgTable 3.1 Mean diastolic blood pressures of printers and farmers.Number Mean diastolic blood Standard devi<strong>at</strong>ionpressure (mmHg)(mmHg)Printers 72 88 4·5Farmers 48 79 4·2

Standard error of a proportionor a percentagePOPULATIONS AND SAMPLESJust as we can calcul<strong>at</strong>e a standard error associ<strong>at</strong>ed with a mean, sowe can also calcul<strong>at</strong>e a standard error associ<strong>at</strong>ed with a percentageor a proportion. Here the size of the sample will affect the size ofthe standard error but the amount of vari<strong>at</strong>ion is determined bythe value of the percentage or proportion in the popul<strong>at</strong>ion itself,and so we do not need an estim<strong>at</strong>e of the standard devi<strong>at</strong>ion. Forexample, a senior surgical registrar in a large hospital is investig<strong>at</strong>ingacute appendicitis in people aged 65 and over. As a preliminarystudy he examines the hospital case notes over the previous 10 yearsand finds th<strong>at</strong> of 120 p<strong>at</strong>ients in this age group with a diagnosisconfirmed <strong>at</strong> oper<strong>at</strong>ion 73 (60·8%) were women and 47 (39·2%)were men.If p represents one percentage, 100 − p represents the other.Then the standard error of each of these percentages is obtained by(1) multiplying them together, (2) dividing the product by thenumber in the sample, and (3) taking the square root:SE percentage =√ ⎯⎯⎯⎯⎯p(100 − p)nwhich for the appendicitis d<strong>at</strong>a given above is as follows:√ ⎯⎯⎯⎯⎯60·8 × 39·2SE percentage == 4·46120Problems with non-random samplesIn general we do not have the luxury of a random sample; wehave to make do with wh<strong>at</strong> is available, a convenience sample. Inorder to be able to make generalis<strong>at</strong>ions we should investig<strong>at</strong>ewhether biases could have crept in, which mean th<strong>at</strong> the p<strong>at</strong>ientsavailable are not typical. Common biases are:• hospital p<strong>at</strong>ients are not the same as ones seen in the community;• volunteers are not typical of non-volunteers;• p<strong>at</strong>ients who return questionnaires are different from those whodo not.35

STATISTICS AT SQUARE ONEIn order to persuade the reader th<strong>at</strong> the p<strong>at</strong>ients included aretypical it is important to give as much detail as possible <strong>at</strong> thebeginning of a report of the selection process and some demographicd<strong>at</strong>a such as age, sex, social class and response r<strong>at</strong>e.Common questionsWh<strong>at</strong> is an acceptable response r<strong>at</strong>e from a survey?If one were taking a sample to estim<strong>at</strong>e some popul<strong>at</strong>ionparameter, then one would like as high a response r<strong>at</strong>e as possible.It is conventional to accept 65–70% as reasonable. Note, however,th<strong>at</strong> valid inferences can be made on much smaller response r<strong>at</strong>esprovided no biases in response occur. If one has d<strong>at</strong>a on thenon-responders such as age or gender it is useful to report it withthe responders, to see if there are any obvious discrepancies.Given measurements on a sample, wh<strong>at</strong> is thedifference between a standard devi<strong>at</strong>ion anda standard error?A sample standard devi<strong>at</strong>ion is an estim<strong>at</strong>e of the popul<strong>at</strong>ionparameter σ; th<strong>at</strong> is, it is an estim<strong>at</strong>e of the variability of theobserv<strong>at</strong>ions. Since the popul<strong>at</strong>ion is unique, it has a uniquestandard devi<strong>at</strong>ion, which may be large or small depending on howvariable are the observ<strong>at</strong>ions. We would not expect the samplestandard devi<strong>at</strong>ion to get smaller because the sample gets larger.However, a large sample would provide a more precise estim<strong>at</strong>e ofthe popul<strong>at</strong>ion standard devi<strong>at</strong>ion σ than a small sample.A standard error, on the other hand, is a measure of precision ofan estim<strong>at</strong>e of a popul<strong>at</strong>ion parameter. A standard error is always<strong>at</strong>tached to a parameter, and one can have standard errors of anyestim<strong>at</strong>e, such as mean, median, fifth centile, even the standarderror of the standard devi<strong>at</strong>ion. Since one would expect the precisionof the estim<strong>at</strong>e to increase with the sample size, the standard errorof an estim<strong>at</strong>e will decrease as the sample size increases.When should I use a standard devi<strong>at</strong>ion to described<strong>at</strong>a and when should I use a standard error?It is a common mistake to try and use the standard error todescribe d<strong>at</strong>a. Usually it is done because the standard error issmaller, and so the study appears more precise. If the purpose is todescribe the d<strong>at</strong>a (for example, so th<strong>at</strong> one can see if the p<strong>at</strong>ients are36

POPULATIONS AND SAMPLEStypical) and if the d<strong>at</strong>a are plausibly Normal, then one should usethe standard devi<strong>at</strong>ion (mnemonic D for Description and D forDevi<strong>at</strong>ion). If the purpose is to describe the outcome of a study,for example to estim<strong>at</strong>e the prevalence of a disease, or the meanheight of a group, then one should use a standard error (or, better,a confidence interval; see Chapter 4) (mnemonic E for Estim<strong>at</strong>eand E for Error). Thus, in a paper one would describe the samplein the first table (and so give, say, SD of blood pressure, and agerange) and give estim<strong>at</strong>es of the effects in the second table (and sogive SEs of blood pressure differences).Reading and reporting popul<strong>at</strong>ionsand samples• Report carefully how the sample was chosen. Were the subjectsp<strong>at</strong>ients who happened to be about? Were they consecutivep<strong>at</strong>ients who s<strong>at</strong>isfied certain criteria?• Report differences in defining demographics between the sampleand the remaining popul<strong>at</strong>ion.• List the reasons for exclusions, and compare responders andnon-responders for key variables. In questionnaire surveys,describe how the target popul<strong>at</strong>ion was obtained, and givenumbers of people who refused to complete the questionnaireor who were “not available”.• When reading a paper, from the inform<strong>at</strong>ion given ask whetherthe results are generalisable.ExercisesExercise 3.1The mean urinary lead concentr<strong>at</strong>ion in 140 children was2·18 µmol/24 h, with standard devi<strong>at</strong>ion 0·87. Wh<strong>at</strong> is the standarderror of the mean?Exercise 3.2In Table F, wh<strong>at</strong> is the distribution of the digits, and wh<strong>at</strong> arethe mean and standard devi<strong>at</strong>ion?Exercise 3.3For the first column of five digits in Table F take the mean valueof the five digits, and do this for all rows of five digits in the column.37

STATISTICS AT SQUARE ONEWh<strong>at</strong> would you expect a histogram of the means to look like?Wh<strong>at</strong> would you expect the mean and standard devi<strong>at</strong>ion to be?References1 Altman DG. Practical st<strong>at</strong>istics for medical research. London: Chapman & Hall,1991.2 Armitage P, Berry G. St<strong>at</strong>istical methods in medical research, 3rd ed. Oxford:Blackwell Scientific Public<strong>at</strong>ions, 1994.3 Campbell MJ, Machin D. Medical st<strong>at</strong>istics: a commonsense approach, 3rd ed.Chichester: John Wiley, 1999.4 Fisher RA, Y<strong>at</strong>es F. St<strong>at</strong>istical tables for biological, agricultural and medical research,6th ed. London: Longman, 1974.5 Strike PW. St<strong>at</strong>istical methods in labor<strong>at</strong>ory medicine. Oxford: Butterworth-Heinemann, 1991:255.38

4 St<strong>at</strong>ements ofprobability andconfidence intervalsWe have seen th<strong>at</strong> when a set of observ<strong>at</strong>ions has a Normaldistribution multiples of the standard devi<strong>at</strong>ion mark certainlimits on the sc<strong>at</strong>ter of the observ<strong>at</strong>ions. For instance, 1·96 (orapproxim<strong>at</strong>ely 2) standard devi<strong>at</strong>ions above and 1·96 standarddevi<strong>at</strong>ions below the mean (± 1·96 SD) mark the points withinwhich 95% of the observ<strong>at</strong>ions lie.Reference rangesWe noted in Chapter 1 th<strong>at</strong> 140 children had a mean urinarylead concentr<strong>at</strong>ion of 2·18 µmol/24 h, with standard devi<strong>at</strong>ion0·87. The points th<strong>at</strong> include 95% of the observ<strong>at</strong>ions are2·18 ± (1·96 × 0·87), giving an interval of 0·48 to 3·89. <strong>One</strong> of thechildren had a urinary lead concentr<strong>at</strong>ion just over 4·0 µmol/24 h.This observ<strong>at</strong>ion is gre<strong>at</strong>er than 3·89 and so falls in the 5% beyondthe 95% probability limits. We can say th<strong>at</strong> the probability of eachof such observ<strong>at</strong>ions occurring is 5%. Another way of looking <strong>at</strong>this is to see th<strong>at</strong> if one chose one child <strong>at</strong> random out of the 140,the chance th<strong>at</strong> their urinary lead concentr<strong>at</strong>ion exceeded 3·89,or was less than 0·48, would be 5%. This probability is usuallyexpressed as a fraction of 1 r<strong>at</strong>her than of 100, and written P < 0·05.Standard devi<strong>at</strong>ions thus set limits about which probabilityst<strong>at</strong>ements can be made. Some of these are set out in Table A inthe Appendix. To use Table A to estim<strong>at</strong>e the probability of findingan observed value, say a urinary lead concentr<strong>at</strong>ion of 4·8 µmol/24 h,in sampling from the same popul<strong>at</strong>ion of observ<strong>at</strong>ions as the 140children provided, we proceed as follows. The distance of the newobserv<strong>at</strong>ion from the mean is 4·8 − 2·18 = 2·62. How many standarddevi<strong>at</strong>ions does this represent? Dividing the difference by the39

STATISTICS AT SQUARE ONEstandard devi<strong>at</strong>ion gives 2·62/0·87 = 3·01. This number is gre<strong>at</strong>erthan 2·576 but less than 3·291 in Table A, so the probability offinding a devi<strong>at</strong>ion as large as or more extreme than this liesbetween 0·01 and 0·001, which may be expressed as 0·001

STATEMENTS OF PROBABILITY AND CONFIDENCE INTERVALSrange of two standard errors above and two below the mean ofthese means. This common mean would be expected to lie veryclose to the mean of the popul<strong>at</strong>ion. So the standard error of amean provides a st<strong>at</strong>ement of probability about the differencebetween the mean of the popul<strong>at</strong>ion and the mean of the sample.In our sample of 72 printers, the standard error of the meanwas 0·53 mmHg. The sample mean plus or minus 1·96 times itsstandard error gives the following two figures:88 + (1·96 × 0·53) = 89·04 mmHg88 − (1·96 × 0·53) = 86·96 mmHgThis is called the 95% confidence interval (CI), and we can say th<strong>at</strong>there is only a 5% chance th<strong>at</strong> the interval 86·96 to 89·04 mmHgexcludes the mean of the popul<strong>at</strong>ion. If we took the mean plus orminus three times its standard error, the interval would be 86·41to 89·59. This is the 99·7% CI, and the chance of this intervalexcluding the popul<strong>at</strong>ion mean is 1 in 370. Confidence intervalsprovide the key to a useful device for arguing from a sample backto the popul<strong>at</strong>ion from which it came.The standard error for the percentage of male p<strong>at</strong>ients withappendicitis, described in Chapter 3, was 4·46. This is also thestandard error of the percentage of female p<strong>at</strong>ients with appendicitis,since the formula remains the same if p is replaced by 100 − p. Withthis standard error we can get 95% CIs on the two percentages:60·8 ± (1·96 × 4·46) = 52·1 to 69·539·2 ± (1·96 × 4·46) = 30·5 to 47·9These CIs exclude 50%. Can we conclude th<strong>at</strong> females are morelikely to get appendicitis? This is the subject of the rest of the book,namely inference.With small samples—say, under 30 observ<strong>at</strong>ions—larger multiplesof the standard error are needed to set confidence limits. Thissubject is discussed under the t distribution (Chapter 7).There is much confusion over the interpret<strong>at</strong>ion of the probability<strong>at</strong>tached to confidence CIs. To understand it we have to resort to theconcept of repe<strong>at</strong>ed sampling. Imagine taking repe<strong>at</strong>ed samples ofthe same size from the same popul<strong>at</strong>ion. For each sample calcul<strong>at</strong>ea 95% CI. Since the samples are different, so are the CIs. We know41

STATISTICS AT SQUARE ONEth<strong>at</strong> 95% of these intervals will include the popul<strong>at</strong>ion parameter.However, without any additional inform<strong>at</strong>ion we cannot say whichones! Thus with only one sample, and no other inform<strong>at</strong>ion aboutthe popul<strong>at</strong>ion parameter, we can say there is a 95% chanceof including the parameter in our interval. Note th<strong>at</strong> this doesnot mean th<strong>at</strong> we would expect with 95% probability th<strong>at</strong> themean from another sample is in this interval. In this case we areconsidering differences between two sample means, which is thesubject of the next chapter.Common questionsWh<strong>at</strong> is the difference between a referencerange and a confidence interval?There is precisely the same rel<strong>at</strong>ionship between a referencerange and a confidence interval as between the standard devi<strong>at</strong>ionand the standard error. The reference range refers to individualsand the confidence interval to estim<strong>at</strong>es. It is important to realiseth<strong>at</strong> samples are not unique. Different investig<strong>at</strong>ors taking samplesfrom the same popul<strong>at</strong>ion will obtain different estim<strong>at</strong>es of thepopul<strong>at</strong>ion parameter, and have different 95% confidence intervals.However, we know th<strong>at</strong> for 95 of every 100 investig<strong>at</strong>ors theconfidence interval will include the popul<strong>at</strong>ion parameter (we justdon’t know which ones).Reading and reportingconfidence intervals• In general, confidence intervals are best restricted to the mainoutcome of a study, which is often a contrast (th<strong>at</strong> is, a difference)between means or percentages. There is now a gre<strong>at</strong> emphasis onCIs in the liter<strong>at</strong>ure, and some authors <strong>at</strong>tach CIs to everyestim<strong>at</strong>e in a paper, which is not a good idea!• The Vancouver guidelines st<strong>at</strong>e: “When possible, quantifyfindings and present them with appropri<strong>at</strong>e indic<strong>at</strong>ors ofmeasurement error or uncertainty (such as confidenceintervals)”. 1• To avoid confusion with neg<strong>at</strong>ive numbers it is best to quote aconfidence interval as ‘a to b’ r<strong>at</strong>her than ‘a−b’.42

STATEMENTS OF PROBABILITY AND CONFIDENCE INTERVALSExercisesExercise 4.1A count of malaria parasites in 100 fields with a 2 mm oilimmersion lens gave a mean of 35 parasites per field, standarddevi<strong>at</strong>ion 11·6 (note th<strong>at</strong>, although the counts are quantit<strong>at</strong>ivediscrete, the counts can be assumed to follow a Normal distributionbecause the average is large). On counting one more field thep<strong>at</strong>hologist found 52 parasites. Does this number lie outside the95% reference range? Wh<strong>at</strong> is the reference range?Exercise 4.2Wh<strong>at</strong> is the 95% confidence interval for the mean of thepopul<strong>at</strong>ion from which this sample count of parasites was drawn?Reference1 Intern<strong>at</strong>ional Committee of Medical Journal Editors. Uniform requirements formanuscripts submitted to biomedical journals. BMJ 1988;296:401–5.43

5 Differences betweenmeans: type I andtype II errorsand powerWe saw in Chapter 3 th<strong>at</strong> the mean of a sample has a standarderror, and a mean th<strong>at</strong> departs by more than twice its standarderror from the popul<strong>at</strong>ion mean would be expected by chance onlyin about 5% of samples. Likewise, the difference between themeans of two samples has a standard error. We do not usuallyknow the popul<strong>at</strong>ion mean, so we may suppose th<strong>at</strong> the mean ofone of our samples estim<strong>at</strong>es it. The sample mean may happen tobe identical with the popul<strong>at</strong>ion mean, but it more probably liessomewhere above or below the popul<strong>at</strong>ion mean, and there is a95% chance th<strong>at</strong> it is within 1·96 standard errors of it.Consider now the mean of the second sample. If the samplecomes from the same popul<strong>at</strong>ion its mean will also have a 95%chance of lying within 1·96 standard errors of the popul<strong>at</strong>ion meanbut if we do not know the popul<strong>at</strong>ion mean we have only themeans of our samples to guide us. Therefore, if we want to knowwhether they are likely to have come from the same popul<strong>at</strong>ion, weask whether they lie within a certain interval, represented by theirstandard errors, of each other.Large-sample standard errorof difference between meansIf SD 1represents the standard devi<strong>at</strong>ion of sample 1 and SD 2thestandard devi<strong>at</strong>ion of sample 2, n 1the number in sample 1 and n 2the number in sample 2, a formula denoting the standard error ofthe difference between two means is:√( ⎯⎯⎯⎯⎯⎯ )SD 2 1SD 2 2SE (diff) = +(5.1)n 144n 2

TYPE I AND TYPE II ERRORS AND POWERThe comput<strong>at</strong>ion is straightforward. <strong>Square</strong> the standard devi<strong>at</strong>ionof sample 1 and divide by the number of observ<strong>at</strong>ions in the sample:SD 2 1/n 1<strong>Square</strong> the standard devi<strong>at</strong>ion of sample 2 and divide by thenumber of observ<strong>at</strong>ions in the sample:Add these:SD 2 2 /n 2SD 2 1n 1+ SD2 2n 2Take the square root, to give equ<strong>at</strong>ion (5.1). This is an estim<strong>at</strong>e ofthe standard error of the difference between the two means whichis used when both the samples are large (more than 30 subjects).Large-sample confidence intervalfor the difference in two meansFrom the d<strong>at</strong>a in Table 3.1 the general practitioner wants tocompare the mean of the printers’ blood pressures with the meanof the farmers’ blood pressures. The figures are set out first as inTable 5.1 (which repe<strong>at</strong>s Table 3.1).Table 5.1 Mean diastolic blood pressures of printers and farmers.Number Mean diastolic blood Standard devi<strong>at</strong>ionpressure (mmHg)(mmHg)Printers 72 88 4·5Farmers 48 79 4·2Analysing these figures in accordance with the formula givenabove, we have:√( ⎯⎯⎯⎯⎯⎯ )SE (diff ) = 4·5 2724·2 248= 0·805 mmHg45

STATISTICS AT SQUARE ONEThe difference between the means is 88 − 79 = 9 mmHg. Forlarge samples we can calcul<strong>at</strong>e a 95% confidence interval for thedifference in means as9 − 1·96 × 0·805 to 9 + 1·96 × 0·805which is7·42 to 10·58 mmHgFor a small sample we need to modify this procedure, as describedin Chapter 7.Null hypothesis and type I errorIn comparing the mean blood pressures of the printers and thefarmers we are testing the hypothesis th<strong>at</strong> the two samples camefrom the same popul<strong>at</strong>ion of blood pressures. The hypothesis th<strong>at</strong>there is no difference between the popul<strong>at</strong>ion from which theprinters’ blood pressures were drawn and the popul<strong>at</strong>ion fromwhich the farmers’ blood pressures were drawn is called the nullhypothesis.But wh<strong>at</strong> do we mean by “no difference”? Chance alone willalmost certainly ensure th<strong>at</strong> there is some difference between thesample means, for they are most unlikely to be identical.Consequently, we set limits within which we shall regard thesamples as not having any significant difference. If we set the limits<strong>at</strong> twice the standard error of the difference, and regard a meanoutside this range as coming from another popul<strong>at</strong>ion, we shall onaverage be wrong about one time in 20 if the null hypothesisis in fact true. If we do obtain a mean difference bigger than twostandard errors we are faced with two choices: either an unusualevent has happened, or the null hypothesis is incorrect. Imaginetossing a coin five times and getting the same face each time.This has nearly the same probability (6·3%) as obtaining a meandifference bigger than two standard errors when the null hypothesisis true. Do we regard it as a lucky event or suspect a biased coin?If we are unwilling to believe in unlucky events, we reject the nullhypothesis, in this case th<strong>at</strong> the coin is a fair one.46

TYPE I AND TYPE II ERRORS AND POWERTo reject the null hypothesis when it is true is to make wh<strong>at</strong> isknown as a type I error. The level <strong>at</strong> which a result is declaredsignificant is known as the type I error r<strong>at</strong>e, often denoted by α. Wetry to show th<strong>at</strong> a null hypothesis is unlikely, not its converse (th<strong>at</strong>it is likely), so a difference which is gre<strong>at</strong>er than the limits we haveset, and which we therefore regard as “significant”, makes the nullhypothesis unlikely. However, a difference within the limits wehave set, and which we therefore regard as “non-significant”, doesnot make the hypothesis likely. To repe<strong>at</strong> an old adage, absence ofevidence is not evidence of absence.A range of not more than two standard errors is often taken asimplying “no difference”, but there is nothing to stop investig<strong>at</strong>orschoosing a range of three standard errors (or more) if they want toreduce the chances of a type I error.Testing for differences of two meansTo find out whether the difference in blood pressure of printersand farmers could have arisen by chance the general practitionererects the null hypothesis th<strong>at</strong> there is no significant differencebetween them. The question is, how many multiples of its standarderror does the difference in means difference represent? Sincethe difference in means is 9 mmHg and its standard error is0·805 mmHg, the answer is: 9/0·805 = 11·2. We usually denote ther<strong>at</strong>io of an estim<strong>at</strong>e to its standard error by z, th<strong>at</strong> is, z = 11·2.Reference to Table A in the Appendix shows th<strong>at</strong> z is far beyondthe figure of 3·291 standard devi<strong>at</strong>ions, representing a probabilityof 0·001 (or 1 in 1000). The probability of a difference of 11·2standard errors or more occurring by chance is therefore exceedinglylow, and correspondingly the null hypothesis th<strong>at</strong> these two samplescame from the same popul<strong>at</strong>ion of observ<strong>at</strong>ions is exceedinglyunlikely. The probability is known as the P value and may be writtenP

STATISTICS AT SQUARE ONEother approach is to compute the probability of getting the observedvalue, or one th<strong>at</strong> is more extreme, if the null hypothesis were correct.This is the P value. If this is less than a specified level (usually 5%)then the result is declared significant and the null hypothesis isrejected. These two approaches, the estim<strong>at</strong>ion and hypothesistesting approach, are complementary. Imagine if the 95% CI justcaptured the value zero, wh<strong>at</strong> would be the P value? A moment’sthought should convince one th<strong>at</strong> it is 2·5%. This is known as a onesided P value, because it is the probability of getting the observedresult or one bigger than it. However, the 95% CI is two sided,because it excludes not only the 2·5% above the upper limit but alsothe 2·5% below the lower limit. To support the complementarity ofthe confidence interval approach and the null hypothesis testingapproach, most authorities double the one sided P value to obtaina two sided P value (see below for the distinction between onesided and two sided tests).Sometimes an investig<strong>at</strong>or knows a mean from a very largenumber of observ<strong>at</strong>ions and wants to compare the mean of hersample with it. We may not know the standard devi<strong>at</strong>ion of thelarge number of observ<strong>at</strong>ions or the standard error of their meanbut this need not hinder the comparison if we can assume th<strong>at</strong> thestandard error of the mean of the large number of observ<strong>at</strong>ions isnear zero or <strong>at</strong> least very small in rel<strong>at</strong>ion to the standard error ofthe mean of the small sample.This is because in equ<strong>at</strong>ion (5.1) for calcul<strong>at</strong>ing the standarderror of the difference between the two means, when n 1is very largethen SD 2/n 1 1becomes so small as to be negligible. The formula thusreduces towhich is the same as th<strong>at</strong> for standard error of the sample mean.Consequently, we find the standard error of the mean of the sampleand divide it into the difference between the means.For example, a large number of observ<strong>at</strong>ions has shown th<strong>at</strong> themean count of erythrocytes (measured × 10 12 per litre) in men is5·5. In a sample of 100 men a mean count of 5·35 was found, withstandard devi<strong>at</strong>ion 1·1. The standard error of this mean is SD/√⎺n ,1·1/√⎺⎺⎺ 100 = 0·11. The difference between the two means is 5·5 −5·35 = 0·15. This difference, divided by the standard error, givesz = 0·15/0·11 = 1·36. This figure is well below the 5% level of 1·9648√ ⎯⎯SD 2 2n 2

TYPE I AND TYPE II ERRORS AND POWERand in fact is below the 10% level of 1·645 (see Table A). Wetherefore conclude th<strong>at</strong> the difference could have arisen by chance.Altern<strong>at</strong>ive hypothesis and type II errorIt is important to realise th<strong>at</strong> when we are comparing two groupsa non-significant result does not mean th<strong>at</strong> we have proved the twosamples come from the same popul<strong>at</strong>ion—it simply means th<strong>at</strong> wehave failed to prove th<strong>at</strong> they do not come from the same popul<strong>at</strong>ion.When planning studies it is useful to think of wh<strong>at</strong> differences arelikely to arise between the two groups, or wh<strong>at</strong> would be clinicallyworthwhile; for example, wh<strong>at</strong> do we expect to be the improvedbenefit from a new tre<strong>at</strong>ment in a clinical trial? This leads to a studyhypothesis, which is a difference we would like to demonstr<strong>at</strong>e.To contrast the study hypothesis with the null hypothesis, it isoften called the altern<strong>at</strong>ive hypothesis. If we do not reject the nullhypothesis when in fact there is a difference between the groups, wemake wh<strong>at</strong> is known as a type II error. The type II error r<strong>at</strong>e isoften denoted as ß. The power of a study is defined as 1 – ß and isthe probability of rejecting the null hypothesis when it is false. Themost common reason for type II errors is th<strong>at</strong> the study is too small.The rel<strong>at</strong>ionship between type I and type II errors is shown inTable 5.2. <strong>One</strong> has to imagine a series of cases, in some of whichthe null hypothesis is true and in some of which it is false. In eithersitu<strong>at</strong>ion we carry out a significance test, which sometimes issignificant and sometimes not.The concept of power is only relevant when a study is beingplanned (see Chapter 13 for sample size calcul<strong>at</strong>ions). After astudy has been completed, we wish to make st<strong>at</strong>ements not abouthypothetical altern<strong>at</strong>ive hypotheses but about the d<strong>at</strong>a, and theway to do this is with estim<strong>at</strong>es and confidence intervals. 1Common questionsWhy is the P value not the probabilityth<strong>at</strong> the null hypothesis is true?A moment’s reflection should convince you th<strong>at</strong> the P valuecould not be the probability th<strong>at</strong> the null hypothesis is true.Suppose we got exactly the same value for the mean in twosamples (if the samples were small and the observ<strong>at</strong>ions coarselyrounded this would not be uncommon); the difference between themeans is zero. The probability of getting the observed result (zero)49

STATISTICS AT SQUARE ONETable 5.2 Rel<strong>at</strong>ionship between type I and type II errors.Null hypothesisFalseTrueTest resultSignificant Power Type I errorNot significantType II erroror a result more extreme (a result th<strong>at</strong> is either positive or neg<strong>at</strong>ive)is unity—th<strong>at</strong> is, we can be certain th<strong>at</strong> we must obtain a resultwhich is positive, neg<strong>at</strong>ive or zero. However, we can never be certainth<strong>at</strong> the null hypothesis is true, especially with small samples, soclearly the st<strong>at</strong>ement th<strong>at</strong> the P value is the probability th<strong>at</strong> thenull hypothesis is true is in error. We can think of it as a measureof the strength of evidence against the null hypothesis, but since itis critically dependent on the sample size we should not compareP values to argue th<strong>at</strong> a difference found in one group is more“significant” than a difference found in another.Wh<strong>at</strong> is the difference between a one sidedand a two sided test?Consider a test to compare the popul<strong>at</strong>ion means of two groupsA and B. A one sided test considers as an altern<strong>at</strong>ive hypothesisth<strong>at</strong> the mean of A is gre<strong>at</strong>er than the mean of B. A two sided testconsiders as an altern<strong>at</strong>ive th<strong>at</strong> the mean of A is either gre<strong>at</strong>er orless than th<strong>at</strong> of B. Since the one sided test requires a strongerassumption, it is more powerful. However, it leaves one in adilemma if the observed mean of A is much less than the observedmean of B. In theory one cannot abandon the one sided altern<strong>at</strong>ivehypothesis and choose a two sided one instead. Since, in mostcases, we are genuinely uncertain as to which direction to choose,two sided tests are almost universal. The main exception maybe when one is trying to show two tre<strong>at</strong>ments are equivalent, forexample lumpectomy compared with radical mastectomy forbreast cancer. Then the question is whether lumpectomy is worsefor survival, and we are not worried if it might be better.Reading and reporting P values• If possible present a precise P value (e.g. P = 0·031) r<strong>at</strong>her thana range such as 0·02 < P < 0·05 (see Chapters 7 and 8).50

TYPE I AND TYPE II ERRORS AND POWER• It is unnecessary to go beyond two significant figures for Pvalues, and for small values P < 0·001 will usually suffice.• Some computer programs give small values as P = 0·000.Report P < 0·001 instead.ExercisesExercise 5·1In one group of 62 p<strong>at</strong>ients with iron deficiency anaemia thehaemoglobin level was 12·2 g/dl, standard devi<strong>at</strong>ion 1·8 g/dl; inanother group of 35 p<strong>at</strong>ients it was 10·9 g/dl, standard devi<strong>at</strong>ion2·1 g/dl.Wh<strong>at</strong> is the standard error of the difference between the twomeans, and wh<strong>at</strong> is the significance of the difference? Wh<strong>at</strong> is thedifference? Give an approxim<strong>at</strong>e 95% confidence interval for thedifference.Exercise 5.2If the mean haemoglobin level in the general popul<strong>at</strong>ion is takenas 14·4 g/dl, wh<strong>at</strong> is the standard error of the difference betweenthe mean of the first sample and the popul<strong>at</strong>ion mean and wh<strong>at</strong> isthe significance of this difference?Reference1 Altman DG, Machin D, Bryant TN, Gardner MJ (eds). <strong>St<strong>at</strong>istics</strong> with confidence,2nd ed. London: BMJ Books, 2000.51

6 Confidence intervalsfor summaryst<strong>at</strong>istics of binaryd<strong>at</strong>aStandard error of difference betweenpercentages or proportionsThe surgical registrar who investig<strong>at</strong>ed appendicitis cases, referredto in Chapter 3, wonders whether the percentages of men andwomen in the sample differ from the percentages of all the othermen and women aged 65 and over admitted to the surgical wardsduring the same period. After excluding his sample of appendicitiscases, so th<strong>at</strong> they are not counted twice, he makes a rough estim<strong>at</strong>eof the number of p<strong>at</strong>ients admitted in those 10 years and finds itto be between about 12 000 and 13 000. He selects a system<strong>at</strong>icrandom sample of 640 p<strong>at</strong>ients, of whom 363 (56·7%) are womenand 277 (43·3%) are men.The percentage of women in the appendicitis sample was60·8% and differs from the percentage of women in the generalsurgical sample by 60·8 − 56·7 = 4·1%. Is this difference of anysignificance? In other words, could it have arisen by chance?There are two ways of calcul<strong>at</strong>ing the standard error of thedifference between two percentages: one is based on the nullhypothesis th<strong>at</strong> the two groups come from the same popul<strong>at</strong>ion;the other on the altern<strong>at</strong>ive hypothesis th<strong>at</strong> they are different. ForNormally distributed variables these two are the same if thestandard devi<strong>at</strong>ions are assumed to be the same, but in the binarycase the standard devi<strong>at</strong>ions depend on the estim<strong>at</strong>es of theproportions, and so if these are different then so are the standarddevi<strong>at</strong>ions. Usually, however, even in the binary case, both methodsgive almost the same result.52

SUMMARY STATISTICS OF BINARY DATAConfidence interval for a difference inproportions or percentagesThe calcul<strong>at</strong>ion of the standard error of a difference in proportionsp 1− p 2follows the same logic as the calcul<strong>at</strong>ion of the standarderror of two means: sum the squares of the individual standarderrors and then take the square root. It is based on the altern<strong>at</strong>ivehypothesis th<strong>at</strong> there is a real difference in proportions (furtherdiscussion on this point is given in the ‘Common questions’ section<strong>at</strong> the end of this chapter).√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯( )p 1(100 − p 1) p 2(100 − p 2)SE (p 1− p 2) =n+1n 2Note th<strong>at</strong> this is an approxim<strong>at</strong>e formula; the exact one would usethe popul<strong>at</strong>ion proportions r<strong>at</strong>her than the sample estim<strong>at</strong>es.With our appendicitis d<strong>at</strong>a we have√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯( )60·8 × 39·2 56·7 × 43·3+ = 4·87120 640Thus a 95% confidence interval for the difference in percentages is4·1 − 1·96 × 4·87 to 4·1 + 1·96 × 4·87 = −5·4 to 13·6%.Significance test for a differencein two proportionsFor a significance test we have to use a slightly different formula,based on the null hypothesis th<strong>at</strong> both samples have a commonpopul<strong>at</strong>ion proportion, estim<strong>at</strong>ed by p:√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯( )p(100 − p) p(100 − p)SE (diff %) = +n 1To obtain p we must amalgam<strong>at</strong>e the two samples and calcul<strong>at</strong>ethe percentage of women in the two combined; 100 − p is then then 253

STATISTICS AT SQUARE ONEpercentage of men in the two combined. The numbers in eachsample are n 1and n 2. We have:Number of women in the samples: 73 + 363 = 436Number of people in the samples: 120 + 640 = 760Percentage of women: (436 × 100)/760 = 57·4Percentage of men: (324 × 100)/760 = 42·6Putting these numbers in the formula, we find the standard errorof the difference between the percentages is√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯( )57·4 × 42·6 57·4 × 42·6+ = 4·92120 640This is very close to the standard error estim<strong>at</strong>ed under thealtern<strong>at</strong>ive hypothesis.The difference between the percentage of women (and men) inthe two samples was 4·1%. To find the probability <strong>at</strong>tached to thisdifference we divide it by its standard error: z = 4·1/4·92 = 0·83.From Table A in the Appendix we find th<strong>at</strong> P is about 0·4 and sothe difference between the percentages in the two samples couldhave been due to chance alone, as might have been expected fromthe confidence interval. Note th<strong>at</strong> this test gives results identical tothose obtained by the χ 2 test without continuity correction(described in Chapter 8).Confidence interval for an odds r<strong>at</strong>ioThe appendicitis d<strong>at</strong>a can be rewritten as a 2 × 2 table, given inTable 6.1. As discussed in Chapter 2, the odds r<strong>at</strong>io associ<strong>at</strong>edwith this table is OR = ad/bc = (277 × 73)/(47 × 363) = 1·185.This means th<strong>at</strong> women are about 20% gre<strong>at</strong>er risk of havingTable 6.1 Appendicitis d<strong>at</strong>aCasesSurgical (not appendicitis)AppendicitisMales 277 (a) 47 (b)Females 363 (c) 73 (d)Total 640 12054

SUMMARY STATISTICS OF BINARY DATAappendicitis than men. It turns out to be easier to calcul<strong>at</strong>e thestandard error of the log eOR. The standard error is given by: 1√⎯⎯⎯⎯⎯⎯1 1 1 1SE (log eOR) = + + +a b c dFor the appendicitis d<strong>at</strong>a we get√⎯⎯⎯⎯⎯⎯⎯⎯⎯1 1 1 1SE (log eOR) = + + + = 0·203277 47 363 73The log eof the odds r<strong>at</strong>io is 0·170. Thus a 95% confidence intervalfor the log eOR is given by:0·170 − 1·96 × 0·203 to 0·170 + 1·96 × 0·203or − 0·229 to 0·568To get a 95% CI for OR we need to take anti-logs (e x ). Thus a95% CI for the odds r<strong>at</strong>io is e −0·229 to e 0·568 . which is 0·80 to 1·77.Note how this confidence interval is not symmetric about the oddsr<strong>at</strong>io of 1·19, in contrast to th<strong>at</strong> for a difference in proportions.If there was no difference in the proportion of males to femalesfor the two surgical groups, we would expect the OR to be 1.Thus if the 95% CI excludes 1, we can say there is a significantdifference between the groups. In this case, the confidenceinterval includes 1, in agreement with the earlier lack of significancefrom the st<strong>at</strong>istical test.Standard error of a totalThe total number of de<strong>at</strong>hs in a town from a particular diseasevaries from year to year. If the popul<strong>at</strong>ion of the town or areawhere they occur is fairly large, say, some thousands, and providedth<strong>at</strong> the de<strong>at</strong>hs are independent of one another, the standarderror of the number of de<strong>at</strong>hs from a specified cause is givenapproxim<strong>at</strong>ely by its square root, √⎺n . Further, the standard errorof the difference between two numbers of de<strong>at</strong>hs, n 1and n 2, can betaken as √ (n 1 + n 2 ).55

STATISTICS AT SQUARE ONEThis can be used to estim<strong>at</strong>e the significance of a differencebetween two totals by dividing the difference by its standard error:n 1− n 2z = √ ⎯⎯⎯⎯(n 1+ n 2)(6.1)It is important to note th<strong>at</strong> the de<strong>at</strong>hs must be independentlycaused; for example, they must not be the result of an epidemicsuch as influenza. The reports of the de<strong>at</strong>hs must likewise beindependent; for example, the criteria for diagnosis must beconsistent from year to year and not suddenly change in accordancewith a new fashion or test, and the popul<strong>at</strong>ion <strong>at</strong> risk must be thesame size over the period of study.In spite of its limit<strong>at</strong>ions this method has its uses. For instance,in Carlisle the number of de<strong>at</strong>hs from ischaemic heart disease forone year was 276. Is this significantly higher than the total for theprevious year, which was 246? The difference is 30. The standarderror of the difference is √ (276 + 246) = 22·8. We then takez = 30/22·8 = 1·313. This is clearly much less than 1·96 times thestandard error <strong>at</strong> the 5% level of probability. Reference to Table Ashows th<strong>at</strong> P = 0·2. The difference could therefore easily be achance fluctu<strong>at</strong>ion.This method should be regarded as giving no more thanapproxim<strong>at</strong>e but useful guidance, and is unlikely to be valid over aperiod of more than very few years owing to changes in diagnostictechniques. An extension of it to the study of paired altern<strong>at</strong>ivesfollows.Paired altern<strong>at</strong>ivesSometimes it is possible to record the results of tre<strong>at</strong>ment orsome sort of test or investig<strong>at</strong>ion as one of two altern<strong>at</strong>ives. Forinstance, two tre<strong>at</strong>ments or tests might be carried out on pairsobtained by m<strong>at</strong>ching individuals chosen by random sampling,or the pairs might consist of successive tre<strong>at</strong>ments of the sameindividual (see Chapter 7 for a comparison of pairs by the t test).The result might then be recorded as “responded or did notrespond”, “improved or did not improve”, “positive or neg<strong>at</strong>ive”,and so on. This type of study yields results th<strong>at</strong> can be set out asshown in Table 6.2.56

SUMMARY STATISTICS OF BINARY DATATable 6.2 Layout for paired d<strong>at</strong>a.Member of pair receivingtre<strong>at</strong>ment ARespondedRespondedDid not respondDid not respondMember of pair receivingtre<strong>at</strong>ment BRespondedDid not respondRespondedDid not respondThe significance of the results can then be simply tested byMcNemar’s test in the following way. Ignore the first and fourthrows, and examine the second and third rows. Let the largernumber of pairs in either row be called n 1and the smaller numberof pairs in either row be n 2. We may then use formula (6.1) toobtain the result, z. This is approxim<strong>at</strong>ely Normally distributedunder the null hypothesis, and its probability can be read fromTable A.However, in practice, the fairly small numbers th<strong>at</strong> form thesubject of this type of investig<strong>at</strong>ion make a correction advisable.We therefore diminish the difference between n 1and n 2by usingthe following formula:z =|n 1− n 2| − 1√⎯⎯⎯−(n 1+ n 2)(6.2)where the vertical lines mean “take the absolute value”. Again, theresult is Normally distributed, and its probability can be read fromTable A.As for the unpaired case, there is a slightly different formula forthe standard error used to calcul<strong>at</strong>e the confidence interval. 1Suppose N is the total number of pairs. Then√⎯⎯⎯⎯⎯⎯⎯⎯⎯( )1(nSE (diff ) = 1− n 2)n 2.N 1+ n 2−NFor example, a registrar in the gastroenterological unit of a largehospital in an industrial city sees a considerable number of p<strong>at</strong>ientswith severe recurrent aphthous ulcer of the mouth. Claims have been57

STATISTICS AT SQUARE ONEmade th<strong>at</strong> a recently introduced prepar<strong>at</strong>ion stops the pain of theseulcers and promotes quicker healing than existing prepar<strong>at</strong>ions.Over a period of 6 months the registrar selected every p<strong>at</strong>ientwith this disorder and paired them off as far as possible by referenceto age, sex and frequency of ulcer<strong>at</strong>ion. Finally she had 108 p<strong>at</strong>ientsin 54 pairs. To one member of each pair, chosen by the toss of acoin, she gave tre<strong>at</strong>ment A, which she and her colleagues in the unithad hitherto regarded as the best; to the other member she gave thenew tre<strong>at</strong>ment, B. Both forms of tre<strong>at</strong>ment are local applic<strong>at</strong>ions,and they cannot be made to look alike. Consequently, to avoid biasin the assessment of the results a colleague recorded the results oftre<strong>at</strong>ment without knowing which p<strong>at</strong>ient in each pair had whichtre<strong>at</strong>ment. The results are shown in Table 6.3.Here n 1= 23, n 2= 10. Entering these values in formula (6.2), weobtain(23 − 10) − 1 12z = √⎯⎯⎯− = = 2·089.(23 + 10) √⎺⎺33The probability value associ<strong>at</strong>ed with 2·089 is about 0·04(Table A). Without the continuity correction we obtain z = 2.26and P about 0.025. Therefore we may conclude th<strong>at</strong> tre<strong>at</strong>ment Agave significantly better results than tre<strong>at</strong>ment B. The standarderror used to calcul<strong>at</strong>e the confidence interval is:1SE (diff) = ×54√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯( )(23 − 10)2(23 + 10) −54=√⎯⎯⎯⎯⎯( )154 × 16933 −54= 0·101The observed difference in proportions is235410− = 0·2415458

SUMMARY STATISTICS OF BINARY DATATable 6.3 Results of tre<strong>at</strong>ing aphthous ulcer in 54 pairs of p<strong>at</strong>ientsMember of pair receiving Member of pair receiving Pairs of p<strong>at</strong>ientstre<strong>at</strong>ment Atre<strong>at</strong>ment BResponded Responded 16Responded Did not respond 23Did not respond Responded 10Did not respond Did not respond 5Total 54The 95% confidence interval for the difference in proportions istherefore0·241 − 1·96 × 0·101 to 0·241 + 1·96 × 0·101 = 0·042 to 0·439Although this does not include zero, the confidence interval isquite wide, reflecting uncertainty as to the true difference becausethe sample size is small. An exact method is also available. 1Common questionsWhy is the standard error used for calcul<strong>at</strong>ing aconfidence interval for the difference in twoproportions different from the standard errorused for calcul<strong>at</strong>ing the significance?For nominal variables the standard devi<strong>at</strong>ion is not independentof the mean. If we suppose th<strong>at</strong> a nominal variable simply takes thevalue 0 or 1, then the mean is simply the proportion of 1s and thestandard devi<strong>at</strong>ion is directly dependent on the mean, being largestwhen the mean is 0·5. The null and altern<strong>at</strong>ive hypotheses arehypotheses about means, either th<strong>at</strong> they are the same (null) ordifferent (altern<strong>at</strong>ive). Thus for nominal variables the standarddevi<strong>at</strong>ions (and thus the standard errors) will also be different forthe null and altern<strong>at</strong>ive hypotheses. For a confidence interval,the altern<strong>at</strong>ive hypothesis is assumed to be true, whereas for asignificance test the null hypothesis is assumed to be true. In generalthe difference in the values of the two methods of calcul<strong>at</strong>ing thestandard errors is likely to be small, and use of either would leadto the same inferences. The reason why this is mentioned here isth<strong>at</strong> there is a close connection between the test of significance59

STATISTICS AT SQUARE ONEdescribed in this chapter and the χ 2 test described in Chapter 8.The difference in the arithmetic for the significance test, and th<strong>at</strong> forcalcul<strong>at</strong>ing the confidence interval, could lead some readers to believeth<strong>at</strong> they are unrel<strong>at</strong>ed, whereas in fact they are complementary.The problem does not arise with continuous variables, where thestandard devi<strong>at</strong>ion is usually assumed independent of the mean,and is also assumed to be the same value under both the null andaltern<strong>at</strong>ive hypotheses.It is worth pointing out th<strong>at</strong> the formula for calcul<strong>at</strong>ing thestandard error of an estim<strong>at</strong>e is not necessarily unique; it dependson underlying assumptions, and so different assumptions or studydesigns will lead to different estim<strong>at</strong>es for standard errors for d<strong>at</strong>asets th<strong>at</strong> might be numerically identical.ExercisesExercise 6.1In an obstetric hospital 17·8% of 320 babies were delivered byforceps in 1975. Wh<strong>at</strong> is the standard error of this percentage? Inanother hospital in the same region 21·2% of 185 babies weredelivered by forceps. Wh<strong>at</strong> is the standard error of the differencebetween the percentages <strong>at</strong> this hospital and the first? Wh<strong>at</strong> is thedifference between these percentages of forceps delivery with a95% confidence interval and wh<strong>at</strong> is its significance?Exercise 6.2Calcul<strong>at</strong>e the difference in proportions (also know as the absoluterisk reduction—see Chapter 2) and a 95% confidence interval forthe difference in proportions for the d<strong>at</strong>a given in Table 2.4.Exercise 6.3A derm<strong>at</strong>ologist tested a new topical applic<strong>at</strong>ion for the tre<strong>at</strong>mentof psoriasis on 48 p<strong>at</strong>ients. He applied it to the lesions on onepart of the p<strong>at</strong>ient’s body and wh<strong>at</strong> he considered to be the besttraditional remedy to the lesions on another but comparable partof the body, the choice of area being made by the toss of a coin. Inthree p<strong>at</strong>ients both areas of psoriasis responded; in 28 p<strong>at</strong>ients thedisease responded to the traditional remedy but hardly or not <strong>at</strong> allto the new one; in 13 it responded to the new one but hardly ornot <strong>at</strong> all to the traditional remedy; and in four cases neither60

SUMMARY STATISTICS OF BINARY DATAremedy caused an appreciable response. Did either remedy causea significantly better response than the other?Note in this exercise th<strong>at</strong> the purpose of the study would beto determine which tre<strong>at</strong>ment to try first, in the absence of anyfurther inform<strong>at</strong>ion. Clearly, if the first tre<strong>at</strong>ment fails, there islittle lost in trying the other.Exercise 6.4For the d<strong>at</strong>a in Table 2.5 calcul<strong>at</strong>e the odds r<strong>at</strong>io and a 95%confidence interval.Reference1 Altman DG, Machin D, Bryant TN, Gardner MJ (eds). <strong>St<strong>at</strong>istics</strong> with confidence,2nd ed. London: BMJ Publishing Group, 2000.61

7 The t testsPreviously we have considered how to test the null hypothesisth<strong>at</strong> there is no difference between the mean of a sample and thepopul<strong>at</strong>ion mean, and no difference between the means of twosamples. We obtained the difference between the means bysubtraction, and then divided this difference by the standard errorof the difference. If the difference is 1·96 times its standard erroror more it is likely to occur by chance with a frequency of only 1in 20, or less.With small samples, where more chance vari<strong>at</strong>ion must be allowedfor, these r<strong>at</strong>ios are not entirely accur<strong>at</strong>e because the uncertainty inestim<strong>at</strong>ing the standard error has been ignored. Some modific<strong>at</strong>ionof the procedure of dividing the difference by its standard error isneeded, and the technique to use is the t test. Its found<strong>at</strong>ions werelaid by WS Gosset, writing under the pseudonym “Student”, soth<strong>at</strong> it is sometimes known as Student’s t test. The proceduredoes not differ gre<strong>at</strong>ly from the one used for large samples, but ispreferable when the number of observ<strong>at</strong>ions is less than 60, andcertainly when they amount to 30 or less.The applic<strong>at</strong>ion of the t distribution to the following four typesof problem will now be considered:1 The calcul<strong>at</strong>ion of a confidence interval for a sample mean.2 The mean and standard devi<strong>at</strong>ion of a sample are calcul<strong>at</strong>edand a value is postul<strong>at</strong>ed for the mean of the popul<strong>at</strong>ion. Howsignificantly does the sample mean differ from the postul<strong>at</strong>edpopul<strong>at</strong>ion mean?3 The means and standard devi<strong>at</strong>ions of two samples arecalcul<strong>at</strong>ed. Could both samples have been taken from the samepopul<strong>at</strong>ion?62

THE t TESTS4 Paired observ<strong>at</strong>ions are made on two samples (or in successionon one sample). Wh<strong>at</strong> is the significance of the differencebetween the means of the two sets of observ<strong>at</strong>ions?In each case the problem is essentially the same—namely, toestablish multiples of standard errors to which probabilities can be<strong>at</strong>tached. These multiples are the number of times a difference canbe divided by its standard error. We have seen th<strong>at</strong> with largesamples 1·96 times the standard error has a probability of 5% orless, and 2·576 times the standard error a probability of 1% or less(see Table A in the Appendix). With small samples these multiplesare larger, and the smaller the sample the larger they become.Confidence interval for themean from a small sampleA rare congenital disease, Everley’s syndrome, generally causesa reduction in concentr<strong>at</strong>ion of blood sodium. This is thought toprovide a useful diagnostic sign as well as a clue to the efficacy oftre<strong>at</strong>ment. Little is known about the subject, but the director of aderm<strong>at</strong>ological department in a London teaching hospital isknown to be interested in the disease and has seen more cases thananyone else. Even so, he has seen only 18. The p<strong>at</strong>ients were allaged between 20 and 44.The mean blood sodium concentr<strong>at</strong>ion of these 18 cases was115 mmol/l, with standard devi<strong>at</strong>ion 12 mmol/l. Assuming th<strong>at</strong>blood sodium concentr<strong>at</strong>ion is Normally distributed, wh<strong>at</strong> is the 95%confidence interval within which the mean of the total popul<strong>at</strong>ion ofsuch cases may be expected to lie?The d<strong>at</strong>a are set out as follows:Number of observ<strong>at</strong>ions 18Mean blood sodiumconcentr<strong>at</strong>ion 115 mmol/1Standard devi<strong>at</strong>ion 12 mmol/1Standard error of mean SD/√⎺n = 12/√⎺⎺18 = 2·83 mmol/1To find the 95% CI above and below the mean we now have tofind a multiple of the standard error. In large samples we have seenth<strong>at</strong> the multiple is 1·96 (Chapter 4). For small samples we use thetable of t given in Table B. As the sample becomes smaller tbecomes larger for any particular level of probability. Conversely, as63

STATISTICS AT SQUARE ONEthe sample becomes larger t becomes smaller and approaches thevalues given in Table A, reaching them for infinitely large samples.Since the size of the sample influences the value of t, the size ofthe sample is taken into account in rel<strong>at</strong>ing the value of t toprobabilities in the table. Some useful parts of the full t tableappear in Table B. The left hand column is headed d.f. for “degreesof freedom”. The use of these was noted in the calcul<strong>at</strong>ion of thestandard devi<strong>at</strong>ion (Chapter 2). In practice, the degrees of freedomamount in these circumstances to one less than the number ofobserv<strong>at</strong>ions in the sample. With these d<strong>at</strong>a we have 18 − 1 = 17 d.f.This is because only 17 observ<strong>at</strong>ions plus the total number ofobserv<strong>at</strong>ions are needed to specify the sample, the 18th beingdetermined by subtraction.To find the number by which we must multiply the standarderror to give the 95% CI we enter Table B <strong>at</strong> 17 in the left handcolumn and read across to the column headed 0·05 to discover thenumber 2·110. The 95% CIs of the mean are now set as follows:which gives us:mean − 2·110 SE to mean + 2·110 SE115 − (2·110 × 2·83) to 115 + 2·110 × 2·83= 109·03 to 120·97 mmol/lWe may then say, with a 95% chance of being correct, th<strong>at</strong> therange 109·03 to 120·97 mmol/l includes the popul<strong>at</strong>ion mean.Likewise, from Table B, the 99% confidence interval of the meanis as follows:which gives:mean − 2·898 SE to mean + 2·898 SE115 − (2·898 × 2·83) to 115 + (2·898 × 2·83)= 106·80 to 123·20 mmol/lDifference of sample mean frompopul<strong>at</strong>ion mean (one sample t test)Estim<strong>at</strong>ions of plasma calcium concentr<strong>at</strong>ion in the 18 p<strong>at</strong>ientswith Everley’s syndrome gave a mean of 3·2 mmol/l, with standard64

THE t TESTSdevi<strong>at</strong>ion 1·1. Previous experience from a number of investig<strong>at</strong>ionsand published reports had shown th<strong>at</strong> the mean was commonlyclose to 2·5 mmol/l in healthy people aged 20–44, the age range ofthe p<strong>at</strong>ients. Is the mean in these p<strong>at</strong>ients abnormally high?We make the following assumptions:• The d<strong>at</strong>a are represent<strong>at</strong>ive of people with Everley’s syndrome(in this case we have a convenience sample).• The d<strong>at</strong>a are quantit<strong>at</strong>ive and plausibly Normally distributed.• The d<strong>at</strong>a are independent of each other. This assumption ismost important. In general, repe<strong>at</strong>ed measurements on thesame individual are not independent. If we had 18 measures ofplasma calcium on 15 p<strong>at</strong>ients, then we would have only 15independent observ<strong>at</strong>ions.We set the figures out as follows:Mean of general popul<strong>at</strong>ion, µ 2·5 mmol/lMean of sample, x −3·2 mmol/lStandard devi<strong>at</strong>ion of sample,SD1·1 mmol/lStandard error of sample mean,SD/√⎺n = 1·1/√⎺⎺180·26 mmo1/lDifference between meansµ − x − = 2·5 − 3·2 − 0·7 mmol/lt = difference between means divided by standarderror of sample meanµ−x – − 0·7t = = = −2·69SD/√⎺n 0·26Degrees of freedom, n − 1 = 18 − 1 = 17Ignoring the sign of the t value, and entering Table B <strong>at</strong> 17 degreesof freedom, we find th<strong>at</strong> 2·69 comes between probability values of0·02 and 0·01 (the precise value, from Excel, is P = 0·015). It istherefore unlikely th<strong>at</strong> the sample with mean 3·2 came from thepopul<strong>at</strong>ion with mean 2·5, and we may conclude th<strong>at</strong> the samplemean is, <strong>at</strong> least st<strong>at</strong>istically, unusually high. Whether it should beregarded clinically as abnormally high is something th<strong>at</strong> needs tobe considered separ<strong>at</strong>ely by the physician in charge of th<strong>at</strong> case.65

STATISTICS AT SQUARE ONEDifference between means of two samplesHere we apply a modified procedure for finding the standarderror of the difference between two means and testing the size ofthe difference by this standard error (see Chapter 5 for largesamples). For large samples we used the standard devi<strong>at</strong>ion of eachsample, computed separ<strong>at</strong>ely, to calcul<strong>at</strong>e the standard error of thedifference between the means. For small samples we calcul<strong>at</strong>e acombined standard devi<strong>at</strong>ion for the two samples. The followingexample illustr<strong>at</strong>es the procedure.The addition of bran to the diet has been reported to benefitp<strong>at</strong>ients with diverticulosis. Several different bran prepar<strong>at</strong>ions areavailable, and a clinician wants to test the efficacy of two of themon p<strong>at</strong>ients, since favourable claims have been made for each.Among the consequences of administering bran th<strong>at</strong> require testingis the transit time through the alimentary canal. Does it differ inthe two groups of p<strong>at</strong>ients taking these two prepar<strong>at</strong>ions?The null hypothesis is th<strong>at</strong> the two groups come from the samepopul<strong>at</strong>ion. By random alloc<strong>at</strong>ion the clinician selects two groupsof p<strong>at</strong>ients aged 40–64 with diverticulosis of comparable severity.Sample 1 contains 15 p<strong>at</strong>ients who are given tre<strong>at</strong>ment A, andsample 2 contains 12 p<strong>at</strong>ients who are given tre<strong>at</strong>ment B. Thetransit times of food through the gut are measured by a standardtechnique with marked pellets and the results are recorded, inorder of increasing time, in Table 7.1.The following assumptions are made:• The two samples come from distributions th<strong>at</strong> may differ intheir mean value, but not in their standard devi<strong>at</strong>ion.• The observ<strong>at</strong>ions are independent of each other.• The d<strong>at</strong>a are quantit<strong>at</strong>ive and plausibly Normally distributed.(Note in the case of a randomised trial this last assumption isless critical: see ‘Common questions’.)The d<strong>at</strong>a are shown in Figure 7.1. The assumptions ofapproxim<strong>at</strong>e Normality and equality of variance are s<strong>at</strong>isfied. Thedesign suggests th<strong>at</strong> the observ<strong>at</strong>ions are indeed independent.Since it is possible for the difference in mean transit times fortre<strong>at</strong>ments to be positive or neg<strong>at</strong>ive, we will employ a twosided test. With tre<strong>at</strong>ment A the mean transit time was 68·40 hand with tre<strong>at</strong>ment B 83·42 h. Wh<strong>at</strong> is the significance of thedifference, 15·02 h?66

THE t TESTSTable 7.1 Transit times of marker pellets through thealimentary canal of p<strong>at</strong>ients with diverticulosis on twotypes of tre<strong>at</strong>ment: unpaired comparison.Transit times (h)Sample 1 Sample 2(tre<strong>at</strong>ment A) (tre<strong>at</strong>ment B)44 5251 6452 6855 7460 7962 8366 8468 8869 9571 9771 10176 1168291108Total 1026 1001Mean 68·40 83·42120100Transit times (h)806040Tre<strong>at</strong>mentATre<strong>at</strong>mentBFigure 7.1 Trasnsit times for two bran prepar<strong>at</strong>ions.67

STATISTICS AT SQUARE ONEThe procedure is as follows. Obtain the standard devi<strong>at</strong>ion insample 1, s 1, and the standard devi<strong>at</strong>ion in sample 2, s 2. Multiplythe square of the standard devi<strong>at</strong>ion of sample 1 by the degrees offreedom, which is the number of subjects minus one:Repe<strong>at</strong> for sample 2:(n 1− 1)s 2 1(n 2− 1)s 2 2Add the two together and divide by the total degrees of freedom togive the pooled variance:s 2 p = (n 1 − 1)s 2 1 + (n 2 − 1)s 2 2n 1+ n 2− 2The standard error of the difference between the means is√⎯⎯⎯⎯ ( )SE(x – − x– ) = s 2 p+ s 2 p1 2 n 1n 2which can also be written√⎯⎯⎯⎯ ( )SE(x – − x– ) = s 1 11 2 p+n 1n 2When the difference between the means is divided by thisstandard error the result is t. Thus,(x – t = √⎯⎯⎯⎯− x– ) 1 2( )s 2 ps 2 pn+1n 2The table of the t distribution (Table B), which gives two sided pvalues, is entered <strong>at</strong> (n − 1) + (n 2− 1) degrees of freedom.68

THE t TESTSFor the transit times of Table 7.1:Tre<strong>at</strong>ment ATre<strong>at</strong>ment Bn 1= 15 n 2= 12x 1= 68·40 x 2= 83·42s 1= 16·474 s 2= 17·63514 × 271·3927 + 11 × 310·9932s 2 = p(15 − 1) + (12 − 1)= 288·82SE(x – − x– √ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯1 2) = (288·82/15 + 288·82/12)√ ⎯⎯⎯⎯⎯⎯⎯⎯⎯= (288·82(1/15 + 1/12)= 6·58283·42 − 68·40t = = 2·2826·582Table B shows th<strong>at</strong> <strong>at</strong> (15 − 1) + (12 − 1) = 25 degrees of freedom,t = 2·282 lies between 2·060 (P = 0·05) and 2·485 (P = 0·02)(precise Excel value 0·031). This degree of probability is smallerthan the conventional level of 5%. The null hypothesis th<strong>at</strong> thereis no difference between the means is therefore somewh<strong>at</strong> unlikely.A 95% confidence interval is given byThis becomesx – 1 − x– 2 ± t(n 1 + n 2 − 2)SE83·42 − 68·40 ± 2·06 × 6·58215·02 − 13·56 to 15·02 + 13·56or 1·46 to 28·58 hUnequal standard devi<strong>at</strong>ionsIf the standard devi<strong>at</strong>ions in the two groups are markedly different,for example if the r<strong>at</strong>io of the larger to the smaller is gre<strong>at</strong>er than 2,69

STATISTICS AT SQUARE ONEthen one of the assumptions of the t test (th<strong>at</strong> the two samplescome from popul<strong>at</strong>ions with the same standard devi<strong>at</strong>ion) isunlikely to hold. An approxim<strong>at</strong>e test, due to S<strong>at</strong>terthwaite anddescribed by Armitage and Berry, 1 which allows for unequalstandard devi<strong>at</strong>ions, is described below. Another version of thistest is known as Welsh’s test.R<strong>at</strong>her than use the pooled estim<strong>at</strong>e of variance, compute√⎯⎯⎯⎯ ( )SE(x – − x– ) = s 2 1× s 2 21 2 n 1n 2This is the large-sample estim<strong>at</strong>e given in equ<strong>at</strong>ion (5.1), and isanalogous to calcul<strong>at</strong>ing the standard error of the difference intwo proportions under the altern<strong>at</strong>ive hypothesis as described inChapter 6.We now compute(x – 1 − x– 2 )d = SE(x –1− x – 2 )We then test this using a t st<strong>at</strong>istic, in which the degrees of freedomare:d.f. =(s 2 /n + 1 1 s2 /n 2 2 )2[(s 2 / n 1 1 )2 /(n 1− 1)] + [(s 2 / n 2 2 )2 /(n 2− 1)]Although this may look very complic<strong>at</strong>ed, it can be evalu<strong>at</strong>ed veryeasily on a calcul<strong>at</strong>or without having to write down intermedi<strong>at</strong>esteps. It can produce a value for the degrees of freedom which is notan integer, and so not available in the tables. In this case one shouldround to the nearest integer. Many st<strong>at</strong>istical packages now carryout this test as the default, and to get the equal variances t st<strong>at</strong>isticone has to specifically ask for it. The unequal variances t test tendsto be less powerful than the usual t test if the variances are in factthe same, since it uses fewer assumptions. However, it should notbe used indiscrimin<strong>at</strong>ely because, if the standard devi<strong>at</strong>ions aredifferent, how can we interpret a non-significant difference in means,70

for example? Often a better str<strong>at</strong>egy is to try a d<strong>at</strong>a transform<strong>at</strong>ion,such as taking logarithms as described in Chapter 2. Transform<strong>at</strong>ionsth<strong>at</strong> render distributions closer to Normality often also make thestandard devi<strong>at</strong>ions similar. If a log transform<strong>at</strong>ion is successful, usethe usual t test on the logged d<strong>at</strong>a.For the d<strong>at</strong>a in Table 7.1, we find th<strong>at</strong> SE = 6·632, d = 2·26 andd.f. = 22·9, or approxim<strong>at</strong>ely 23. The tabul<strong>at</strong>ed values for 2%and 5% from Table B are 2·069 and 2·500, and so this gives0·02 < P < 0·05 (precise Excel value 0·034), very close to th<strong>at</strong>given by the equal variance test. This might be expected, becausethe standard devi<strong>at</strong>ions in the original d<strong>at</strong>a set are very similar, andso using the unequal variances t test gives very similar results to thet test which assumes equal variances. It can also be shown th<strong>at</strong> ifthe numbers of observ<strong>at</strong>ions in each group are similar, the usualt test is quite robust.Difference between means of pairedsamples (paired t test)THE t TESTSWhen the effects of two altern<strong>at</strong>ive tre<strong>at</strong>ments or experimentsare compared, for example in crossover trials, randomised trialsin which randomis<strong>at</strong>ion is between m<strong>at</strong>ched pairs, or m<strong>at</strong>chedcase–control studies (see Chapter 13), it is sometimes possible tomake comparisons in pairs. M<strong>at</strong>ching controls for the m<strong>at</strong>chedvariables can lead to a more powerful study.The test is derived from the single sample t test, using thefollowing assumptions:• The d<strong>at</strong>a are quantit<strong>at</strong>ive.• The distribution of the differences (not the original d<strong>at</strong>a), isplausibly Normal.• The differences are independent of each other.The first case to consider is when each member of the sampleacts as his own control. Whether tre<strong>at</strong>ment A or tre<strong>at</strong>ment B isgiven first or second to each member of the sample should bedetermined by the use of a table of random numbers (Table F). Inthis way any effect of one tre<strong>at</strong>ment on the other, even indirectlythrough the p<strong>at</strong>ient’s <strong>at</strong>titude to tre<strong>at</strong>ment for instance, can beminimised. Occasionally it is possible to give both tre<strong>at</strong>mentssimultaneously, as in the tre<strong>at</strong>ment of a skin disease by applying aremedy to the skin on opposite sides of the body.71

STATISTICS AT SQUARE ONETable 7.2 Transit times of marker pellets through the alimentary canal of12 p<strong>at</strong>ients with diverticulosis on two types of tre<strong>at</strong>ment:Transit times (h)Tre<strong>at</strong>ment Tre<strong>at</strong>ment DifferenceP<strong>at</strong>ient A B A−B1 63 55 82 54 62 −83 79 108 −294 68 77 −95 87 83 46 84 78 67 92 79 138 57 94 −379 66 69 −310 53 66 −1311 76 72 412 63 77 −14Total 842 920 −78Mean 70·17 76·67 −6·5Let us use as an example the studies of bran in the tre<strong>at</strong>ment ofdiverticulosis discussed earlier. The clinician wonders whethertransit time would be shorter if bran is given in the same dosagein three meals during the day (tre<strong>at</strong>ment A) or in one meal(tre<strong>at</strong>ment B). A random sample of p<strong>at</strong>ients with disease ofcomparable severity and aged 20–44 is chosen and the twotre<strong>at</strong>ments administered on two successive occasions, the order ofthe tre<strong>at</strong>ments also being determined from the table of randomnumbers. The alimentary transit times and the differences for eachpair of tre<strong>at</strong>ments are set out in Table 7.2.In calcul<strong>at</strong>ing t on the paired observ<strong>at</strong>ions we work with thedifference, d, between the members of each pair. Our first task isto find the mean of the differences between the observ<strong>at</strong>ions andthen the standard error of the mean, proceeding as follows. Findthe mean of the differences, d – . Find the standard devi<strong>at</strong>ion of thedifferences, SD. Calcul<strong>at</strong>e the standard error of the mean:To calcul<strong>at</strong>e t, divide the mean of the differences by the standarderror of the meand –t =SE(d – )72SE(d – ) = SD/√⎺n .

THE t TESTSThe table of the t distribution is entered <strong>at</strong> n − 1 degrees of freedom(number of pairs minus 1).For the d<strong>at</strong>a from Table 7.2 we haved – = − 6·5SD = 15·1SE (d – ) = 4·37t = − 6·5/4·37 = − 1·487Entering Table B <strong>at</strong> 11 degrees of freedom (n − 1) and ignoringthe minus sign, we find th<strong>at</strong> this value lies between 0·697 and1·796. Reading off the probability value, we see th<strong>at</strong> 0·1< P < 0·5(precise Excel value 0·165). The null hypothesis is th<strong>at</strong> there is nodifference between the mean transit times on these two forms oftre<strong>at</strong>ment. From our calcul<strong>at</strong>ions, it is not disproved. However, thisdoes not mean th<strong>at</strong> the two tre<strong>at</strong>ments are equivalent. To help usdecide this we calcul<strong>at</strong>e the confidence interval.A 95% CI for the mean difference is given byd – ± t n − 1SD.In this case t 11<strong>at</strong> P = 0·05 is 2·201 (Table B) and so the 95%CI is:− 6·5 − 2·201 × 4·37 to − 6·5 + 2·201 × 4·37 h= − 16·1 to 3·1 h.This interval is quite wide, so th<strong>at</strong>, even though it contains zero,we cannot really conclude th<strong>at</strong> the two prepar<strong>at</strong>ions are equivalent,and should look to a larger study.The second case of a paired comparison to consider is whentwo samples are chosen and each member of sample 1 is pairedwith one member of sample 2, as in a m<strong>at</strong>ched case–control study.As the aim is to test the difference, if any, between two typesof tre<strong>at</strong>ment, the choice of members for each pair is designed tomake them as alike as possible. The more alike they are, the moreapparent will be any differences due to tre<strong>at</strong>ment, because theywill not be confused with differences in the results caused bydisparities between members of the pair. The likeness within the73

STATISTICS AT SQUARE ONEpairs applies to <strong>at</strong>tributes rel<strong>at</strong>ing to the study in question. Forinstance, in a test for a drug reducing blood pressure the colour ofthe p<strong>at</strong>ients’ eyes would probably be irrelevant, but their restingdiastolic blood pressure could well provide a basis for selecting thepairs. Another (perhaps rel<strong>at</strong>ed) basis is the prognosis for thedisease in p<strong>at</strong>ients: in general, p<strong>at</strong>ients with a similar prognosis arebest paired. Wh<strong>at</strong>ever criteria are chosen, it is essential th<strong>at</strong> thepairs are constructed before the tre<strong>at</strong>ment is given, for the pairingmust be uninfluenced by knowledge of the effects of tre<strong>at</strong>ment.Further methodsSuppose we had a clinical trial with more than two tre<strong>at</strong>ments.It is not valid to compare each tre<strong>at</strong>ment with the other tre<strong>at</strong>mentsusing t tests because the overall type I error r<strong>at</strong>e α will be biggerthan the conventional level set for each individual test. A methodof controlling this is to use a one way analysis of variance. 1,2Common questionsShould I test my d<strong>at</strong>a for Normalitybefore using the t test?It would seem logical th<strong>at</strong>, because the t test assumes Normality,one should test for Normality first. The problem is th<strong>at</strong> the testfor Normality is dependent on the sample size. With a smallsample a non-significant result does not mean th<strong>at</strong> the d<strong>at</strong>a comefrom a Normal distribution. On the other hand, with a largesample, a significant result does not mean th<strong>at</strong> we could not usethe t test, because the t test is robust to moder<strong>at</strong>e departures fromNormality—th<strong>at</strong> is, the P value obtained can be validly interpreted.There is something illogical about using one significance testconditional on the results of another significance test. In general itis a m<strong>at</strong>ter of knowing and looking <strong>at</strong> the d<strong>at</strong>a. <strong>One</strong> can “eyeball”the d<strong>at</strong>a and if the distributions are not extremely skewed, andparticularly if (for the two sample t test) the numbers of observ<strong>at</strong>ionsare similar in the two groups, then the t test will be valid. The mainproblem is often th<strong>at</strong> outliers will infl<strong>at</strong>e the standard devi<strong>at</strong>ionsand render the test less sensitive. Also, it is not generally appreci<strong>at</strong>edth<strong>at</strong> if the d<strong>at</strong>a origin<strong>at</strong>e from a randomised controlled trial, thenthe process of randomis<strong>at</strong>ion will ensure the validity of the t test,irrespective of the original distribution of the d<strong>at</strong>a.74

THE t TESTSShould I test for equality of the standarddevi<strong>at</strong>ions before using the usual t test?The same argument prevails here as for the previous questionabout Normality. The test for equality of variances is dependenton the sample size. A rule of thumb is th<strong>at</strong> if the r<strong>at</strong>io of the largerto the smaller standard devi<strong>at</strong>ion is gre<strong>at</strong>er than 2, then theunequal variance test should be used. In view of the fact th<strong>at</strong>when the variances are similar, the equal variance and unequalvariance t test tend to agree, there is a good argument for alwaysdoing the unequal variance test. With a computer one can easilydo both the equal and the unequal variance t test and see if theanswers differ.Why should I use a paired test if my d<strong>at</strong>aare paired? Wh<strong>at</strong> happens if I don’t?Pairing provides inform<strong>at</strong>ion about an experiment, and the moreinform<strong>at</strong>ion th<strong>at</strong> can be provided in the analysis the more sensitivethe test. <strong>One</strong> of the major sources of variability is th<strong>at</strong> betweensubjects. By repe<strong>at</strong>ing measures within subjects, each subject actsas his or her own control, and the between subjects variability isremoved. In general this means th<strong>at</strong> if there is a true differencebetween the pairs the paired test is more likely to pick it up: it ismore powerful. When the pairs are gener<strong>at</strong>ed by m<strong>at</strong>ching them<strong>at</strong>ching criteria may not be important. In this case, the pairedand unpaired tests should give similar results.Reading and reporting t tests• Lack of Normality is not too much of a worry because, as st<strong>at</strong>edearlier, the t test is remarkably robust to lack of Normality,particularly if the numbers of d<strong>at</strong>a points are similar in thetwo groups. However, one should check th<strong>at</strong> the d<strong>at</strong>a areindependent, and th<strong>at</strong> the variances are similar in the twogroups.• Always report the t st<strong>at</strong>istic, its degrees of freedom and P value.Also give the estim<strong>at</strong>e and confidence interval. Thus, thetransit time study for a two sample t test would be reported ast = 2·28, d.f. = 25, P = 0·031, difference in means 15·02 h,95% CI 1·46 to 28·58 h75

STATISTICS AT SQUARE ONEExercisesExercise 7.1In 22 p<strong>at</strong>ients with an unusual liver disease the plasma alkalinephosph<strong>at</strong>ase was found by a certain labor<strong>at</strong>ory to have a meanvalue of 39 King–Armstrong units, standard devi<strong>at</strong>ion 3·4 units.Wh<strong>at</strong> is the 95% confidence interval within which the mean ofthe popul<strong>at</strong>ion of such cases whose specimens come to the samelabor<strong>at</strong>ory may be expected to lie?Exercise 7.2In the 18 p<strong>at</strong>ients with Everley’s syndrome the mean level ofplasma phosph<strong>at</strong>e was 1·7 mmol/l, standard devi<strong>at</strong>ion 0·8. If themean level in the general popul<strong>at</strong>ion is taken as 1·2 mmol/l, wh<strong>at</strong>is the significance of the difference between th<strong>at</strong> mean and themean of these 18 p<strong>at</strong>ients?Exercise 7.3In two wards for elderly women in a geri<strong>at</strong>ric hospital thefollowing levels of haemoglobin (g/dl) were found:Ward A: 12·2, 11·1, 14·0, 11·3, 10·8, 12·5, 12·2, 11·9,13·6, 12·7, 13·4, 13·7Ward B: 11·9, 10·7, 12·3, 13·9, 11·1, 11·2, 13·3, 11·4,12·0, 11·1Wh<strong>at</strong> is the difference between the mean levels in the two wards,and wh<strong>at</strong> is its significance? Wh<strong>at</strong> is the 95% confidence intervalfor the difference in tre<strong>at</strong>ments?Exercise 7.4A new tre<strong>at</strong>ment for varicose ulcer is compared with a standardtre<strong>at</strong>ment on 10 m<strong>at</strong>ched pairs of p<strong>at</strong>ients, where tre<strong>at</strong>mentbetween pairs is decided using random numbers. The outcome isthe number of days from start of tre<strong>at</strong>ment to healing of ulcer. <strong>One</strong>doctor is responsible for tre<strong>at</strong>ment and a second doctor assesseshealing without knowing which tre<strong>at</strong>ment each p<strong>at</strong>ient had. Thefollowing tre<strong>at</strong>ment times (in days) were recorded.76

Standard tre<strong>at</strong>ment: 35, 104, 27, 53, 72, 64, 97, 121,86, 41New tre<strong>at</strong>ment: 27, 52, 46, 33, 37, 82, 51, 92,68, 62Wh<strong>at</strong> are the mean differences in the healing time, the value oft, the number of degrees of freedom, and the probability? Wh<strong>at</strong> isthe 95% confidence interval for the difference?ReferencesTHE t TESTS1 Armitage P, Berry G. St<strong>at</strong>istical methods in medical research, 3rd ed. Oxford:Blackwell Scientific Public<strong>at</strong>ions, 1994:112–13.2 Altman DG. Practical st<strong>at</strong>istics for medical research. London: Chapman & Hall,1991.77

8 The χ 2 testsThe distribution of a c<strong>at</strong>egorical variable in a sample often needs tobe compared with the distribution of a c<strong>at</strong>egorical variable in anothersample. For example, over a period of 2 years a psychi<strong>at</strong>rist hasclassified by socioeconomic class the women aged 20–64 admittedto her unit suffering from self poisoning—sample A. At the sametime she has likewise classified the women of similar age admitted toa gastroenterological unit in the same hospital—sample B. She hasemployed the Registrar General’s five socioeconomic classes, andgenerally classified the women by reference to their f<strong>at</strong>her’s orhusband’s occup<strong>at</strong>ion. The results are set out in Table 8.1.The psychi<strong>at</strong>rist wants to investig<strong>at</strong>e whether the distributionof the p<strong>at</strong>ients by social class differed in these two units. Shetherefore erects the null hypothesis th<strong>at</strong> there is no differencebetween the two distributions. This is wh<strong>at</strong> is tested by the chisquared (χ 2 ) test (pronounced with a hard ch to rhyme with “sky”).By default, all χ 2 tests are two sided.It is important to emphasise here th<strong>at</strong> χ 2 tests may be carried outfor this purpose only on the actual numbers of occurrences, not onpercentages, proportions, means of observ<strong>at</strong>ions, or other derivedst<strong>at</strong>istics. Note, we distinguish here the Greek (χ 2 ) for the test andthe distribution and the L<strong>at</strong>in (X 2 ) for the calcul<strong>at</strong>ed st<strong>at</strong>istic,which is wh<strong>at</strong> is obtained from the test.The χ 2 test is carried out in the following steps. For eachobserved number (O) in the table find an “expected” number (E);this procedure is discussed below. Subtract each expected numberfrom each observed number:O − E78

THE χ 2 TESTSTable 8.1 Distribution by socioeconomic class of p<strong>at</strong>ients admitted toself poisoning (sample A) and gastroenterological (sample B) units.SamplesSocioeconomicProportion inclass A B Total group Aa b n=a+b p=a/nI 17 5 22 0·77II 25 21 46 0·54III 39 34 73 0·53IV 42 49 91 0·46V 32 25 57 0·56Total 155 134 289<strong>Square</strong> the difference:(O − E) 2 .Divide the squares so obtained for each cell of the table by theexpected number for th<strong>at</strong> cell:(O − E) 2 /E.X 2 is the sum of the (O − Ε) 2 /E values.To calcul<strong>at</strong>e the expected number for each cell of the tableconsider the null hypothesis, which in this case is th<strong>at</strong> the numbersin each cell are proportion<strong>at</strong>ely the same in sample A as they are insample B. We therefore construct a parallel table in which theproportions are exactly the same for both samples. This has beendone in columns (2) and (3) of Table 8.2. The proportions areobtained from the totals column in Table 8.1 and are applied tothe totals row. For instance, in Table 8.2, column (2), 11·80 =(22/289) × 155; 24·67 = (46/289) × 155; in column (3) 10·20 =(22/289) × 134; 21·33 = (46/289) × 134 and so on.Thus by simple proportions from the totals we find an expectednumber to m<strong>at</strong>ch each observed number. The sum of theexpected numbers for each sample must equal the sum of theobserved numbers for each sample, which is a useful check. Wenow subtract each expected number from its correspondingobserved number. The results are given in columns (4) and (5) ofTable 8.2. Here two points may be noted.79

STATISTICS AT SQUARE ONETable 8.2 Calcul<strong>at</strong>ion of the χ 2 test on figures in Table 8.1.Expected numbers O − E (O − E) 2 /EClass A B A B A B(1) (2) (3) (4) (5) (6) (7)I 11·80 10·20 5·20 − 5·20 2·292 2·651II 24·67 21·33 0·33 − 0·33 0·004 0·005III 39·15 33·85 − 0·15 0·15 0·001 0·001IV 48·81 42·19 − 6·81 6·81 0·950 1·099V 30·57 26·43 1·43 − 1·43 0·067 0·077Total 155·00 134·00 0 0 3·314 3·833X 2 = 3·314 + 3·833 = 7·147. d.f. = 4. 0·10 < P < 0·50 (precise P = 0·13).1. The sum of these differences always equals zero in each column.2. Each difference for sample A is m<strong>at</strong>ched by the same figure,but with the opposite sign, for sample B.Again these are useful checks.The figures in columns (4) and (5) are then each squared anddivided by the corresponding expected numbers in columns (2)and (3). The results are given in columns (6) and (7). Finally theseresults, (O − E) 2 /E, are summed to give X 2 .A helpful technical procedure in calcul<strong>at</strong>ing the expected numbersmay be noted here. Most electronic calcul<strong>at</strong>ors allow successivemultiplic<strong>at</strong>ion by a constant multiplier by a short cut of some kind.To calcul<strong>at</strong>e the expected numbers a constant multiplier for eachsample is obtained by dividing the total of the sample by the grandtotal for both samples. In Table 8.1 for sample A this is 155/289 =0·5363. This fraction is then successively multiplied by 22, 46, 73,91 and 57. For sample B the fraction is 134/289 = 0·4636. This toois successively multiplied by 22, 46, 73, 91 and 57. The results areshown in Table 8.2, columns (2) and (3).Having obtained a value for X 2 = ∑ [(Ο − Ε) 2 /Ε] we look up in <strong>at</strong>able of χ 2 distribution the probability <strong>at</strong>tached to it (Table C in theAppendix). Just as with the t table, we must enter this table <strong>at</strong> acertain number of degrees of freedom. To ascertain these requiressome care.When a comparison is made between one sample and another,as in Table 8.1, a simple rule is th<strong>at</strong> the degrees of freedom equal(number of columns minus one) × (number of rows minus one),not counting the row and column containing the totals. For the80

THE χ 2 TESTSd<strong>at</strong>a in Table 8.1 this gives (2 − 1) × (5 − 1) = 4. Another way oflooking <strong>at</strong> this is to ask for the minimum number of figures th<strong>at</strong>must be supplied in Table 8.1, in addition to all the totals, to allowus to complete the whole table. Four numbers disposed anyhow insamples A and B, provided they are in separ<strong>at</strong>e rows, will suffice.Entering Table C <strong>at</strong> four degrees of freedom and reading alongthe row, we find th<strong>at</strong> the value of X 2 (7·147) lies between 3·357and 7·779. The corresponding probability is: 0·10 < P < 0·50 (preciseP = 0·13). This is well above the conventionally significant levelof 0·05, so the null hypothesis is not disproved. It is thereforequite conceivable th<strong>at</strong> in the distribution of the p<strong>at</strong>ients betweensocioeconomic classes the popul<strong>at</strong>ion from which sample A wasdrawn was the same as the popul<strong>at</strong>ion from which sample B wasdrawn.Quick methodThe above method of calcul<strong>at</strong>ing X 2 illustr<strong>at</strong>es the n<strong>at</strong>ure of thest<strong>at</strong>istic clearly and is often used in practice. A quicker method,similar to the quick method for calcul<strong>at</strong>ing the standard devi<strong>at</strong>ion,is particularly suitable for use with electronic calcul<strong>at</strong>ors. 1The d<strong>at</strong>a are set out as in Table 8.1. Take the left hand columnof figures (sample A) and call each observ<strong>at</strong>ion a. Their total,which is 155, is then ∑ a.Let p be the proportion formed when each observ<strong>at</strong>ion a isdivided by the corresponding figure in the total column. Thus herep in turn equals 17/22, 25/46, … , 32/57.Let p − be the proportion formed when the total of the observ<strong>at</strong>ionsin the left hand column, ∑ a, is divided by the total of all theobserv<strong>at</strong>ions. Here p − = 155/289. Let q − = 1 − p − , which is the sameas 134/289. Then∑ ∑ pa − p−aX 2 =pq—Fourfold tablesA special form of the χ 2 test is particularly common in practiceand quick to calcul<strong>at</strong>e. It is applicable when the results of aninvestig<strong>at</strong>ion can be set out in a “fourfold table” or “2 × 2contingency table”.81

STATISTICS AT SQUARE ONETable 8.3 Numbers of wives of printers and farmers who breastfedtheir babies for less than 3 months or for 3 months or more.Breastfed forUp to 3 months 3 months or more TotalPrinters’ wives 36 14 50Farmers’ wives 30 25 55Total 66 39 105For example, the practitioner whose d<strong>at</strong>a we displayed in Table 3.1believed th<strong>at</strong> the wives of the printers and farmers should beencouraged to breastfeed their babies. She has records for herpractice going back over 10 years, in which she has noted whetherthe mother breastfed the baby for <strong>at</strong> least 3 months or not, andthese records show whether the husband was a printer or a sheepfarmer (or another occup<strong>at</strong>ion less well represented in herpractice). The figures from her records are set out in Table 8.3.The disparity seems considerable, for, although 28% of theprinters’ wives breastfed their babies for three months or more, asmany as 45% of the farmers’ wives did so. Wh<strong>at</strong> is its significance?The null hypothesis is set up th<strong>at</strong> there is no difference betweenprinters’ wives and farmers’ wives in the period for which theybreastfed their babies. The χ 2 test on a fourfold table may becarried out by a formula th<strong>at</strong> provides a short cut to the conclusion.If a, b, c and d are the numbers in the cells of the fourfold tableas shown in Table 8.4 (in this case variable 1 is breastfeeding(< 3 months 0, ≥ 3 months 1) and variable 2 is husband’s occup<strong>at</strong>ion(printer (0) or farmer (1)), X 2 is calcul<strong>at</strong>ed from the followingformula:X 2 = (ad − bc)2 (a + b + c + d)(a + b) (c + d) (b + d) (a + c)With a fourfold table there is one degree of freedom in accordancewith the rule given earlier.Using the above formula we find X 2 = 3·418 for the breastfeedingd<strong>at</strong>a. Entering the χ 2 table with one degree of freedom, we readalong the row and find th<strong>at</strong> 3·418 lies between 2·706 and 3·841.Therefore 0·05 < P < 0·10 (precise P = 0·064). So, despite an82

THE χ 2 TESTSTable 8.4 Not<strong>at</strong>ion for two group χ 2 test.Variable 10 1 TotalVariable 2 0 a b a + b1 c d c + dTotal a + c b + d a + b + c + dapparently considerable difference between the proportions ofprinters’ wives and the farmers’ wives breastfeeding their babies for3 months or more, the probability of this result or one moreextreme occurring by chance is more than 5%.We can in fact get the precise P value from Table A. It can beshown th<strong>at</strong> if Z has a Normal distribution with mean 0 and SD 1,then Z 2 has a chi-squared distribution with 1 d.f. The square rootof the chi-squared value, 3·418, is 1·85. From Table A the valuetabul<strong>at</strong>ed for 1·8 is 0·072, and th<strong>at</strong> for 1·9 is 0·057. The midpointis 0·0645, which is close to the value from Excel. However, we doneed the chi-squared tables for more than one degree of freedom.This also means there is a direct correspondence between thez value calcul<strong>at</strong>ed for the difference in two proportions describedin Chapter 6 and a chi-squared st<strong>at</strong>istic (see Exercise 8.7).We now calcul<strong>at</strong>e a confidence interval of the differencesbetween the two proportions, as described in Chapter 6. In thiscase we use the standard error based on the observed d<strong>at</strong>a, notthe null hypothesis. We could calcul<strong>at</strong>e the CI on either the rowsor the columns, and it is important th<strong>at</strong> we compare proportionsof the outcome variable, th<strong>at</strong> is, breastfeeding.P 1= 14/50 = 0·280, P 2= 25/55 = 0·455, P 1− P 2= 0·175√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎛ 0·28 × 0·72 0·455 × 0·545SE (P ⎞1− P 2)=⎝+ = 0·092450 55 ⎠The 95% CI is0·175 − 1·96 × 0·0924 to 0·175 + 1·96 × 0·0924=−0·006 to 0·35683

STATISTICS AT SQUARE ONEThus the 95% CI is wide, and includes zero, as one might expectbecause the χ 2 test was not significant <strong>at</strong> the 5% level.Small numbersWhen the numbers in a 2 × 2 contingency table are small, the χ 2approxim<strong>at</strong>ion becomes poor. The following recommend<strong>at</strong>ionsmay be regarded as a sound guide. 2 In fourfold tables a χ 2 test isinappropri<strong>at</strong>e if the total of the table is less than 20, or if the totallies between 20 and 40 and the smallest expected (not observed)value is less than 5; in contingency tables with more than onedegree of freedom it is inappropri<strong>at</strong>e if more than about one fifthof the cells have expected values less than 5 or any cell an expectedvalue of less than 1. An altern<strong>at</strong>ive to the χ 2 test for fourfold tablesis known as Fisher’s exact test and is described in Chapter 9.When the values in a fourfold table are fairly small a “correctionfor continuity”, known as Y<strong>at</strong>es’ correction, may be applied. 3Although there is no precise rule defining the circumstances inwhich to use Y<strong>at</strong>es’ correction, a common practice is to incorpor<strong>at</strong>eit into χ 2 calcul<strong>at</strong>ions on tables with a total of under 100 or with anycell containing a value less than 10. The χ 2 test on a fourfold tableis then modified as follows:X 2 c =[ ] 2|ad − bc| − 0·5 (a + b + c + d) (a + b + c + d).(a + b) (c + d) (b + d) (a + c)The vertical bars on either side of ad − bc mean th<strong>at</strong> the smallerof those two products is taken from the larger. Half the total ofthe four values is then subtracted from th<strong>at</strong> difference to provideY<strong>at</strong>es’ correction. The effect of the correction is to reduce the valueof X 2 .Applying this to the figures in Table 8.3 gives the followingresult:X 2 c =[] 2(36 × 25) − (30 × 14) − (105/2) × 105= 2·711.(66 × 39 × 55 × 50)The precise value of P is now 0·10. It is still not significant, but theP value is larger than it was in the previous calcul<strong>at</strong>ion. In fourfold84

tables containing lower frequencies than Table 8.3 the increase inP value by Y<strong>at</strong>es’ correction may change a result from significantto non-significant; in any case, care should be exercised whenmaking decisions from small samples.Comparing proportionsTHE χ 2 TESTSEarlier in this chapter we compared two samples by the χ 2 test toanswer the question “Are the distributions of the members of thesetwo samples between five classes significantly different?”. Anotherway of putting this is to ask “Are the rel<strong>at</strong>ive proportions of the twosamples the same in each class?”.For example, an industrial medical officer of a large factory wantsto immunise the employees against influenza. Five vaccines ofvarious types based on the current viruses are available, but nobodyknows which is preferable. From the work force 1350 employeesagree to be immunised with one of the vaccines in the first week ofDecember, so the medical officer randomises individuals into fiveapproxim<strong>at</strong>ely equal tre<strong>at</strong>ment groups using a computer gener<strong>at</strong>edrandom number scheme. In the first week of the following Marchhe examines the records he has been keeping to see how manyemployees got influenza and how many did not. These records areclassified by the type of vaccine used (Table 8.5).In Table 8.6 the figures are analysed by the χ 2 test. For this wehave to determine the expected values. The null hypothesis is th<strong>at</strong>there is no difference between vaccines in their efficacy againstinfluenza. We therefore assume th<strong>at</strong> the proportion of employeescontracting influenza is the same for each vaccine as it is for allTable 8.5 People who did or did not get influenza after inocul<strong>at</strong>ion withone of five vaccines.Numbers of employeesType of Got Avoided Proportion gotvaccine influenza influenza Total influenzaI 43 237 280 0·15II 52 198 250 0·21III 25 245 270 0·09IV 48 212 260 0·18V 57 233 290 0·20Total 225 1125 135085

STATISTICS AT SQUARE ONETable 8.6 Calcul<strong>at</strong>ion of χ 2 test on figures in Table 8.5.Expected numbers O − E (O − E) 2 /EType of Got Avoided Got Avoided Got Avoidedvaccine influenza influenza influenza influenza influenza influenzaI 46·7 233·3 −3·7 3·7 0·293 0·059II 41·7 208·3 10·3 −10·3 2·544 0·509III 45·0 225·0 −20·0 20·0 8·889 1·778IV 43·3 216·7 4·7 −4·7 0·510 0·102V 48·3 241·7 8·7 −8·7 1·567 0·313Total 225·0 1125·0 0 0 13·803 2·761X 2 = 16·564, d.f. = 4, P = 0·002.combined. This proportion is derived from the total who gotinfluenza, and is 225/1350. To find the expected number in eachvaccine group who would contract the disease we multiply theactual numbers in the Total column of Table 8.5 by this proportion.Thus 280 × (225/1350) = 46·7; 250 × (225/1350) = 41·7; and soon. Likewise the proportion who did not get influenza is 1125/1350.The expected numbers of those who would avoid the diseaseare calcul<strong>at</strong>ed in the same way from the totals in Table 8.5, soth<strong>at</strong> 280 × (1125/1350) = 233·3; 250 × (1250/1350) = 208·3;and so on. The procedure is thus the same as shown in Tables 8.1and 8.2.The calcul<strong>at</strong>ions in Table 8.6 show th<strong>at</strong> X 2 with four degrees offreedom is 16·564, and 0·001 < P < 0·01 (precise P = 0·002). Thisis a highly significant result. But wh<strong>at</strong> does it mean?Splitting of χ 2Inspection of Table 8.6 shows th<strong>at</strong> the largest contributionto the total X 2 comes from the figures for vaccine III. They are8·889 and 1·778, which together equal 10·667. If this figure issubtracted from the total X 2 , 16·564 − 10·667 = 5·897. This givesan approxim<strong>at</strong>e figure for X 2 for the remainder of the table withthree degrees of freedom (by removing the vaccine III we havereduced the table to four rows and two columns). We then findth<strong>at</strong> P = 0·117, a non-significant result. However, this is only arough approxim<strong>at</strong>ion. To check it exactly we apply the X 2 test tothe figures in Table 8.4 minus the row for vaccine III. In otherwords, the test is now performed on the figures for vaccines I, II,86

THE χ 2 TESTSIV, and V. On these figures X 2 = 2·983; d.f. = 3; P = 0·394. Thusthe probability falls within the same broad limits as obtained by theapproxim<strong>at</strong>e short cut given above. We can conclude th<strong>at</strong> thefigures for vaccine III are responsible for the highly significantresult of the total X 2 of 16·564.But this is not quite the end of the story. Before concluding fromthese figures th<strong>at</strong> vaccine III is superior to the others we ought tocarry out a check on other possible explan<strong>at</strong>ions for the disparity.The process of randomis<strong>at</strong>ion in the choice of the persons to receiveeach of the vaccines should have balanced out any differencesbetween the groups, but some may have remained by chance.The sort of questions worth examining now are: Were the peoplereceiving vaccine III as likely to be exposed to infection as thosereceiving the other vaccines? Could they have had a higher levelof immunity from previous infection? Were they of comparablesocioeconomic st<strong>at</strong>us? Of similar age on average? Were the sexescomparably distributed? Although some of these characteristicscould have been more or less balanced by str<strong>at</strong>ified randomis<strong>at</strong>ion,it is as well to check th<strong>at</strong> they have in fact been equalised before<strong>at</strong>tributing the numeral discrepancy in the result to the potency ofthe vaccine.χ 2 test for trendTable 8.1 is a 5 × 2 table, because there are five socioeconomicclasses and two samples. Socioeconomic groupings may bethought of as an example of an ordered c<strong>at</strong>egorical variable, asthere are some outcomes (for example, mortality) in which it issensible to st<strong>at</strong>e th<strong>at</strong> (say) social class II is between social class Iand social class III. The χ 2 test described <strong>at</strong> th<strong>at</strong> stage did not makeuse of this inform<strong>at</strong>ion; if we had interchanged any of the rows, thevalue of X 2 would have been exactly the same. Looking <strong>at</strong> theproportions p in Table 8.1 we can see th<strong>at</strong> there is no real orderingby social class in the proportions of self poisoning; social class V isbetween social classes I and II. However, in many cases, when theoutcome variable is an ordered c<strong>at</strong>egorical variable, a more powerfultest can be devised which uses this inform<strong>at</strong>ion.Consider a randomised controlled trial of health promotion ingeneral practice to change people’s e<strong>at</strong>ing habits. 4 Table 8.7 givesthe results from a review <strong>at</strong> 2 years, to look <strong>at</strong> the change in theproportion e<strong>at</strong>ing poultry.87

STATISTICS AT SQUARE ONETable 8.7 Change in e<strong>at</strong>ing poultry in randomised trial 4 .Proportion inIntervention Control Total intervention Scorea b n p = a/n xIncrease 100 78 178 0·56 1No change 175 173 348 0·50 0Decrease 42 59 101 0·42 −1Total 317 310 627 0·51If we give each c<strong>at</strong>egory a score x the χ 2 test for trend iscalcul<strong>at</strong>ed in the following way. LetE xp= ∑ ax −∑ a∑ nxNand( ∑nx 2(E xx= ∑ nx 2 −NThenwhere N is the total sample size, p − = ∑ a/N and q − = ∑ b/N.For the d<strong>at</strong>a in Table 8.7,Thus88E 2 xpX 2 =E xxpqN = 627∑a = 317∑ax = 100 × 1 + 175 × 0 − 42 × 1 = 58∑nx = 178 × 1 + 348 × 0 − 101 × 1 = 77E xp= 58 − 317 × 77/627 = 19·07∑nx2 = 178 × 1 2 + 348 × 0 2 + 101 × (−1) 2 = 279( ∑ nx) 2 /N = 77 2 /627 = 9·46E xx= 279 − 9·46 = 269·54p – = 317/627 = 0·5056, q – = 310/627 = 0·4944X 2 = 19·07 2 /(269·54 × 0·5056 × 0·4944) = 5·20

This has one degree of freedom because the linear scoring meansth<strong>at</strong> when one expected value is given all the others can bedetermined directly, and we find P = 0·02. The usual χ 2 test givesa value of X 2 = 5·51; d.f. = 2; P = 0·064. Thus the more sensitiveχ 2 test for trend yields a significant result because the test usesmore inform<strong>at</strong>ion about the experimental design. The values forthe scores are to some extent arbitrary. However, it is usual tochoose them equally spaced on either side of zero. Thus if there arefour groups the scores would be − 3, − 1, + 1, + 3, and for five groups− 2, − 1, 0, + 1, + 2. The X 2 st<strong>at</strong>istic is quite robust to other valuesfor the scores provided th<strong>at</strong> they are steadily increasing or steadilydecreasing.Note th<strong>at</strong> this is another way of splitting the overall X 2 st<strong>at</strong>istic.The overall X 2 will always be gre<strong>at</strong>er than the X 2 for trend, butbecause the l<strong>at</strong>ter uses only one degree of freedom, it is oftenassoci<strong>at</strong>ed with a smaller probability. Although one is often counsellednot to decide on a st<strong>at</strong>istical test after having looked <strong>at</strong> the d<strong>at</strong>a, itis obviously sensible to look <strong>at</strong> the proportions to see if they areplausibly monotonic (go steadily up or down) with the orderedvariable, especially if the overall χ 2 test is non-significant.Comparison of an observedand a theoretical distributionTHE χ 2 TESTSIn the cases so far discussed the observed values in one samplehave been compared with the observed values in another. Butsometimes we want to compare the observed values in one samplewith a theoretical distribution.For example, a geneticist has a breeding popul<strong>at</strong>ion of mice inhis labor<strong>at</strong>ory. Some are entirely white, some have a small p<strong>at</strong>ch ofbrown hairs on the skin, and others have a large p<strong>at</strong>ch. Accordingto the genetic theory for the inheritance of these coloured p<strong>at</strong>chesof hair the popul<strong>at</strong>ion of mice should consist of 51·0% entirelywhite, 40·8% with a small brown p<strong>at</strong>ch, and 8·2% with a largebrown p<strong>at</strong>ch. In fact, among the 784 mice in the labor<strong>at</strong>ory 380are entirely white, 330 have a small brown p<strong>at</strong>ch, and 74 have alarge brown p<strong>at</strong>ch. Do the proportions differ from those expected?The d<strong>at</strong>a are set out in Table 8.8. The expected numbers arecalcul<strong>at</strong>ed by applying the theoretical proportions to the total,namely 0·510 × 784, 0·408 × 784 and 0·082 × 784. The degreesof freedom are calcul<strong>at</strong>ed from the fact th<strong>at</strong> the only constraint isth<strong>at</strong> the total for the expected cases must equal the total for the89

STATISTICS AT SQUARE ONETable 8.8 Calcul<strong>at</strong>ion of X 2 for comparison between actual distributionand theoretical distribution.Observed Theoretical ExpectedMice cases proportions cases O − E (O − E) 2 /EEntirely white 380 0·510 400 −20 1·0000Small brown p<strong>at</strong>ch 330 0·408 320 10 0·3125Large brown p<strong>at</strong>ch 74 0·082 64 10 1·5625Total 784 1·000 784 0 2·8750Table 8.9 Not<strong>at</strong>ion for the McNemar test.First subject of pairVariable 1Variable 2 0 1 TotalSecond subject 0 e f e + fof pair 1 g h g + hTotal e + g f + h nobserved cases, and so the degrees of freedom are the number ofrows minus one. Thereafter the procedure is the same as in previouscalcul<strong>at</strong>ions of X 2 . In this case it comes to 2·875. The X 2 table isentered <strong>at</strong> two degrees of freedom. We find th<strong>at</strong> P = 0·238.Consequently, the null hypothesis of no difference between theobserved distribution and the theoretically expected one is notdisproved. The d<strong>at</strong>a conform to the theory.McNemar’s testMcNemar’s test for paired nominal d<strong>at</strong>a was described inChapter 6, using a Normal approxim<strong>at</strong>ion. In view of the rel<strong>at</strong>ionshipbetween the Normal distribution and the χ 2 distribution with onedegree of freedom, we can recast the McNemar test as a variant ofa χ 2 test. The results are often expressed as in Table 8.9.McNemar’s test is thenX 2 = (f − g)2 with 1 d.f.f + g90

THE χ 2 TESTSTable 8.10 D<strong>at</strong>a from Table 6.3 for McNemar’s test.First subject of pairRespondedDid not respondSecond subject Responded 16 10of pair Did not respond 23 5or, with a continuity correction,X 2 c = ( ⏐f − g⏐−1) 2 with 1 d.f.f + gThe d<strong>at</strong>a from Table 6.3 are recast as shown in Table 8.10.ThusorX 2 c = (10 − 23)2 = 5·1210 + 23X 2 c = ( ⏐10 − 23⏐ − 1) 2 = 4·3610 + 23We find th<strong>at</strong> for the χ 2 values, each with 1 d.f., P = 0·024 andP = 0·037, respectively. The result is identical to th<strong>at</strong> given usingthe Normal approxim<strong>at</strong>ion described in Chapter 6, which gives zst<strong>at</strong>istics as the square root the chi squared values.Extensions of the χ 2 testIf the outcome variable in a study is nominal, the χ 2 test can beextended to look <strong>at</strong> the effect of more than one input variable, forexample to allow for confounding variables. This is most easilydone using multiple logistic regression, a generalis<strong>at</strong>ion of multipleregression, which is described in Chapter 11. If the d<strong>at</strong>a arem<strong>at</strong>ched, then a further technique (conditional logistic regression)should be employed. These are described in more detail in<strong>St<strong>at</strong>istics</strong> <strong>at</strong> <strong>Square</strong> Two. 591

STATISTICS AT SQUARE ONECommon questionsI have m<strong>at</strong>ched d<strong>at</strong>a, but the m<strong>at</strong>ching criteriawere very weak. Should I use McNemar’s test?The general principle is th<strong>at</strong> if the d<strong>at</strong>a are m<strong>at</strong>ched in any way,the analysis should take account of it. If the m<strong>at</strong>ching is weak thenthe m<strong>at</strong>ched analysis and the unm<strong>at</strong>ched analysis should agree. Insome cases when there are a large number of pairs with the sameoutcome, it would appear th<strong>at</strong> McNemar’s test is discarding a lotof inform<strong>at</strong>ion, and so is losing power. However, imagine we aretrying to decide which of two high jumpers is the better. They eachjump over a bar <strong>at</strong> a fixed height, and then the height is increased.It is only when one fails to jump a given height and the othersucceeds th<strong>at</strong> a winner can be announced. It does not m<strong>at</strong>ter howmany jumps both have cleared.Reading and reporting chi squared tests• It is conventional to use the Greek χ 2 to denote the test and thedistribution, but the L<strong>at</strong>in X 2 for the observed st<strong>at</strong>istic—forexample, “from the χ 2 test we obtained X 2 = 5·1 …”.• Chi squared tests with many degrees of freedom are often illspecified and the results should be tre<strong>at</strong>ed with caution. Ifnon-significant, they do not imply th<strong>at</strong> certain results are unlikelyby chance, and if significant it is unclear wh<strong>at</strong> has been proven.• Report the X 2 st<strong>at</strong>istic, the degrees of freedom and the P values,with the estim<strong>at</strong>ed difference and a 95% confidence interval forthe true difference. Thus in the breastfeeding example we writeX 2 = 3·418, d.f. = 1, P = 0·064, difference in proportions 0·17,95% CI − 0·006 to 0·356.ExercisesExercise 8.1In a trial of a new drug against a standard drug for the tre<strong>at</strong>mentof depression the new drug caused some improvement in 56% of73 p<strong>at</strong>ients and the standard drug some improvement in 41% of70 p<strong>at</strong>ients. The results were assessed in five c<strong>at</strong>egories as follows:92New tre<strong>at</strong>ment: much improved 18, improved 23,unchanged 15, worse 9, muchworse 8

THE χ 2 TESTSStandard tre<strong>at</strong>ment: much improved 12, improved 17,unchanged 19, worse 13, muchworse 9Wh<strong>at</strong> is the value of X 2 which takes no account of the orderedvalue of d<strong>at</strong>a, wh<strong>at</strong> is the value of the X 2 test for trend, and the Pvalue? How many degrees of freedom are there? Wh<strong>at</strong> is the valueof P in each case?Exercise 8.2An outbreak of pediculosis capitis is being investig<strong>at</strong>ed in a girls’school containing 291 pupils. Of 130 children who live in a nearbyhousing est<strong>at</strong>e 18 were infested and of 161 who live elsewhere 37were infested. Wh<strong>at</strong> is the X 2 value of the difference, and wh<strong>at</strong> isits significance? Find the difference in infest<strong>at</strong>ion r<strong>at</strong>es and a 95%confidence interval for the difference.Exercise 8.3The 55 affected girls were divided <strong>at</strong> random into two groups of29 and 26. The first group received a standard local applic<strong>at</strong>ionand the second group a new local applic<strong>at</strong>ion. The efficacy of eachwas measured by clearance of the infest<strong>at</strong>ion after one applic<strong>at</strong>ion.By this measure the standard applic<strong>at</strong>ion failed in ten cases andthe new applic<strong>at</strong>ion in five. Wh<strong>at</strong> is the X 2 value of the difference(with Y<strong>at</strong>es’ correction), and wh<strong>at</strong> is its significance? Wh<strong>at</strong> is thedifference in clearance r<strong>at</strong>es and an approxim<strong>at</strong>e 95% confidenceinterval?Exercise 8.4A general practitioner reviewed all p<strong>at</strong>ient notes in four practicesfor 1 year. Newly diagnosed cases of asthma were noted, andwhether or not the case was referred to hospital. The followingreferrals were found (total cases in parentheses): practice A, 14(103); practice B, 11 (92); practice C, 39 (166); practice D, 31(221). Wh<strong>at</strong> are the X 2 and P values for the distribution of thereferrals in these practices? Do they suggest th<strong>at</strong> any one practicehas significantly more referrals than others?Exercise 8.5Carry out a chi squared test on the d<strong>at</strong>a from Table 2.4 todetermine if there is an associ<strong>at</strong>ion between therapy and de<strong>at</strong>h or93

STATISTICS AT SQUARE ONEshunt in the PHVD trial. How does this correspond with theresults of the confidence interval calcul<strong>at</strong>ion in Exercise 6.2?Exercise 8.6Carry out a chi squared test on the d<strong>at</strong>a from Table 2.5, todetermine if there is an associ<strong>at</strong>ion between hay fever and eczema.How does this result compare with the results for the confidenceintervals for odds r<strong>at</strong>ios in Exercise 6.4?Exercise 8.7Carry out a chi squared test on the d<strong>at</strong>a from Table 6.1, todetermine if there is a gender difference in appendicitis r<strong>at</strong>es.Contrast the value of chi-squared with the z st<strong>at</strong>istic on page 54.References1 Snedecor GW, Cochran WG. St<strong>at</strong>istical methods, 7th ed. Ames: Iowa St<strong>at</strong>eUniversity Press, 1980:47.2 Cochran WG. Some methods for strengthening the common χ 2 tests. Biometrics1956;10:417–51.3 Y<strong>at</strong>es F. Contingency tables involving small numbers and the χ 2 test. J Roy St<strong>at</strong>Soc Suppl 1934;1:217–35.4 Capples ME, McKnight A. Randomised controlled trial of health promotions ingeneral practice for p<strong>at</strong>ients <strong>at</strong> high cardiovascular risk. BMJ 1994;309:993–6.5 Campbell MJ. <strong>St<strong>at</strong>istics</strong> <strong>at</strong> square two. London: BMJ Books, 2001.94

9 Exact probability testSometimes in a comparison of the frequency of observ<strong>at</strong>ions in afourfold table the numbers are too small for the χ 2 test (Chapter 8).The exact probability test devised by Fisher, Irwin and Y<strong>at</strong>es 1provides a way out of the difficulty. Tables based on it have beenpublished—for example by Geigy 2 —showing levels <strong>at</strong> which thenull hypothesis can be rejected. The method will be described herebecause, with the aid of a calcul<strong>at</strong>or, the exact probability is easilycomputed.Consider the following circumstances. Some soldiers are beingtrained as parachutists. <strong>One</strong> r<strong>at</strong>her windy afternoon 55 practicejumps take place <strong>at</strong> two localities, dropping zone A and droppingzone B. Of 15 men who jump <strong>at</strong> dropping zone A, five suffersprained ankles, and of 40 who are alloc<strong>at</strong>ed dropping zone B, twosuffer this injury. The casualty r<strong>at</strong>e <strong>at</strong> dropping zone A seemsunduly high, so the medical officer in charge decides to investig<strong>at</strong>ethe disparity. Is it a difference th<strong>at</strong> might be expected by chance?If not, it deserves deeper study. The figures are set out in Table 9.1.The null hypothesis is th<strong>at</strong> there is no difference in the probabilityof injury gener<strong>at</strong>ing the proportion of injured men <strong>at</strong> each droppingzone.The method to be described tests the exact probability ofobserving the particular set of frequencies in the table if themarginal totals (th<strong>at</strong> is, the totals in the last row and column) arekept <strong>at</strong> their present values. But to the probability of getting thisparticular set of frequencies we have to add the probability ofgetting a set of frequencies showing gre<strong>at</strong>er disparity between thetwo dropping zones. This is because we are concerned to know theprobability not only of the observed figures but also of the more95

STATISTICS AT SQUARE ONETable 9.1 Numbers of men injured and uninjured in parachutetraining <strong>at</strong> two dropping zones.Injured Uninjured TotalDropping zone A 5 10 15Dropping zone B 2 38 40Total 7 48 55Table 9.2 Numbers in Table 9.1 rearranged for exact probabilitytest.Injured Uninjured TotalDropping zone A 2 38 40Dropping zone B 5 10 15Total 7 48 55extreme cases. This may seem obscure, but it ties in with the ideaof calcul<strong>at</strong>ing tail areas in the continuous case.For convenience of comput<strong>at</strong>ion the table is changed round so th<strong>at</strong>the smallest number ends up in the top left hand cell. We thereforebegin by constructing Table 9.2 from Table 9.1 by transposing theupper and lower rows.The number of possible tables with these marginal totals is 8,th<strong>at</strong> is, the smallest marginal total plus one. The eight sets areillustr<strong>at</strong>ed in Table 9.3. They are numbered in accordance with thetop left hand cell. The figures in our example appear in set 2.For the general case we can use the following not<strong>at</strong>ion: 11st variable1 02nd 1 a b r 1variable 0 c d r 2s 1s 2NThe exact probability for any table is now determined from thefollowing formula:96r 1! r 2! s 1! s 2!N! a! b! c! d!

EXACT PROBABILITY TESTTable 9.3 Sets of frequencies in Table 9.2 with same marginal totals.0 40 40 1 39 407 8 15 6 9 157 48 55 7 48 55Set 0 Set 12 38 40 3 37 405 10 15 4 11 157 48 55 7 48 55Set 2 Set 34 36 40 5 35 403 12 15 2 13 157 48 55 7 48 55Set 4 Set 56 34 40 7 33 401 14 15 0 15 157 48 55 7 48 55Set 6 Set 7The exclam<strong>at</strong>ion mark denotes “factorial” and means successivemultiplic<strong>at</strong>ion by cardinal numbers in descending series; forexample, 4! means 4 × 3 × 2 × 1. By convention, 0! = 1. Factorialfunctions are available on most calcul<strong>at</strong>ors, but care is needed notto exceed the maximum number available on the calcul<strong>at</strong>or.Generally factorials can be cancelled out for easy comput<strong>at</strong>ion ona calcul<strong>at</strong>or (see p. 98).With this formula we have to find the probability <strong>at</strong>tached to theobserv<strong>at</strong>ions in Table 9.1, which is equivalent to Table 9.2, and isdenoted by set 2 in Table 9.3. We also have to find the probabilities<strong>at</strong>tached to the more extreme cases. If ad − bc is neg<strong>at</strong>ive, then theextreme cases are obtained by progressively decreasing cells a and dand increasing b and c by the same amount. If ad − bc is positive,then progressively increase cells a and d and decrease b and c by thesame amount. 3 For Table 9.2 ad − bc is neg<strong>at</strong>ive and so the moreextreme cases are sets 0 and 1.The best way of doing this is to start with set 0. Call theprobability <strong>at</strong>tached to this set P 0. Then, applying the formula,we get:97

STATISTICS AT SQUARE ONEP 0=40! 15! 7! 48!55! 0! 40! 7! 8!This cancels down toP 0=15! 48!55! 8!For comput<strong>at</strong>ion on a calcul<strong>at</strong>or the factorials can be cancelled outfurther by removing 8! from 15! and 48! from 55! to give15 × 14 × 13 × 12 × 11 × 10 × 955 × 54 × 53 × 52 × 51 × 50 × 49We now start from the left and divide and multiply altern<strong>at</strong>ely.However, on an eight digit calcul<strong>at</strong>or we would thereby obtain theresult 0·0000317 which does not give enough significant figures.Consequently we first multiply the 15 by 1000. Altern<strong>at</strong>e dividingand multiplying then gives 0·0317107. We continue to work withthis figure, which is P 0× 1000, and we now enter it in the memorywhile also retaining it on the display.Remembering th<strong>at</strong> we are now working with units 1000 timeslarger than the real units, to calcul<strong>at</strong>e the probability for set 1 wetake the value of P 0, multiply it by b and c from set 0, and divide itby a and d from set 1. Th<strong>at</strong> is,P 1= P 0b 0c 0 = 0·031710740 × 7a 1d 11 × 9= 0·9865551.The figure for P 1is retained on the display.Likewise, to calcul<strong>at</strong>e the probability for set 2:P 2= P 1b 1c 1= 0·9865551 39 × 6a 2d 22 × 10= 11·54269498

EXACT PROBABILITY TESTThis is as far as we need go, but for illustr<strong>at</strong>ion we will calcul<strong>at</strong>ethe probabilities for all possible tables for the given marginal totals.SetProbability0 0·00003171 0·00098662 0·01154273 0·06645814 0·20491265 0·34047016 0·28372517 0·0918729Total 0·9999998A useful check is th<strong>at</strong> all the probabilities should sum to one(within the limits of rounding).The observed set has a probability of 0·0115427. The P value isthe probability of getting the observed set, or one more extreme. Aone tailed P value would be0·0115427 + 0·0009866 + 0·0000317 = 0·01256and this is the conventional approach. (This value can be obtainedusing the free package SISA—see http://home.clara.net/sisa/fisher.htm). Armitage and Berry 1 favour the mid P value, whichhalves the observed P-value. It is given by0·5 × 0·0115427 + 0·0009866 + 0·0000317 = 0·0068.To get the two tailed value we double the one tailed result, thusP = 0·025 for the conventional or P = 0·0136 for the mid Papproach.The conventional approach to calcul<strong>at</strong>ing the P value forFisher’s exact test has been shown to be conserv<strong>at</strong>ive (th<strong>at</strong> is, itrequires more evidence than is necessary to reject a false nullhypothesis). The mid P is less conserv<strong>at</strong>ive (th<strong>at</strong> is, more powerful)and also has some theoretical advantages. This is the one weadvoc<strong>at</strong>e. For larger samples the P value obtained from a χ 2 testwith Y<strong>at</strong>es’ correction will correspond to the conventional approach,and the P value from the uncorrected test will correspond to themid P value.99

STATISTICS AT SQUARE ONEIn either case, the P value is less than the conventional 5% level;the medical officer can conclude th<strong>at</strong> there is a problem in droppingzone A. The calcul<strong>at</strong>ion of confidence intervals for the difference inproportions for small samples is complic<strong>at</strong>ed and the methodsdescribed in Chapter 6 are only strictly valid for large samples. Weuse the program CIA 4 to do the calcul<strong>at</strong>ions. This gives theestim<strong>at</strong>ed difference as 28·3% and the 95% CI is 6·8% to 53·5%.(The methods of Chapter 6 give 3·5% to 53·1%.)Common questionsWhy is Fisher’s test called an exact test?Because of the discrete n<strong>at</strong>ure of the d<strong>at</strong>a, and the limitedamount of it, combin<strong>at</strong>ions of results which give the same marginaltotals can be listed, and probabilities <strong>at</strong>tached to them. Thus,given these marginal totals we can work out exactly wh<strong>at</strong> is theprobability of getting an observed result, in the same way th<strong>at</strong> wecan work out exactly the probability of getting six heads out of tentosses of a fair coin. <strong>One</strong> difficulty for calcul<strong>at</strong>ing confidenceintervals is th<strong>at</strong> there may not be values which correspond“exactly” to 95%, so we cannot get an “exact” 95% confidenceinterval but (say) one with a 97% coverage or one with a 94%coverage.Note th<strong>at</strong> we should distinguish between “exact” P values and“precise” P values. The l<strong>at</strong>ter are simply P values calcul<strong>at</strong>ed fromapproxim<strong>at</strong>e tests such as the chi squared test, but where we cannow calcul<strong>at</strong>e a precise P value r<strong>at</strong>her than a range such as0·01 < P < 0·05.Reading and reporting Fisher’s Exact Test• Present the results given earlier as: “Injury r<strong>at</strong>e in droppingzone A was 33%, in dropping zone B 5%; difference 28% (95%confidence interval 6·8 to 53·5% CIA) (P = 0·0136, Fisher’sexact test mid P).’• When reading about Fisher’s exact test, always ask whetherthe P value is one sided or two sided and beware confidenceintervals th<strong>at</strong> use large-sample approxim<strong>at</strong>ions such as thosegiven in Chapter 6.100

EXACT PROBABILITY TESTExerciseExercise 9.1Of 30 men employed in a small workshop 18 worked in onedepartment and 12 in another department. In one year five of the18 reported sick with septic hands, and of the 12 men in the otherdepartment one did so. Is there a difference in the departmentsand how would you report this result?References1 Armitage P, Berry G. St<strong>at</strong>istical methods in medical research, 3rd ed. Oxford:Blackwell Scientific Public<strong>at</strong>ions, 1994:1234.2 Lentner C (ed.). Geigy scientific tables, 8th ed. Basle: Geigy, 1982.3 Strike PW. St<strong>at</strong>istical methods in labor<strong>at</strong>ory medicine. Oxford: Butterworth-Heinemann, 1991.4 Altman DG, Machin D, Bryant TN, Gardner MJ (eds). <strong>St<strong>at</strong>istics</strong> with confidence2nd ed. London: BMJ Books, 2000.101

10 Rank score testsPopul<strong>at</strong>ion distributions are characterised, or defined, by parameterssuch as the mean and standard devi<strong>at</strong>ion. For skew distributionswe would need to know other parameters such as the degree ofskewness before the distribution could be identified uniquely, butthe mean and standard devi<strong>at</strong>ion are sufficient to identify theNormal distribution uniquely. The t test described earlier dependsfor its validity on an assumption th<strong>at</strong> the d<strong>at</strong>a origin<strong>at</strong>e from aNormally distributed popul<strong>at</strong>ion, and, when two groups arecompared, the difference between the two samples arises simplybecause they differ only in their mean value. However, if wewere concerned th<strong>at</strong> the d<strong>at</strong>a did not origin<strong>at</strong>e from a Normallydistributed popul<strong>at</strong>ion, then there are tests available which do notmake use of this assumption. Because the d<strong>at</strong>a are no longerNormally distributed, the distribution cannot be characterised bya few parameters, and so the tests are often called “nonparametric”.This is somewh<strong>at</strong> of a misnomer because, as we shall see, to beable to say anything useful about the popul<strong>at</strong>ion we must compareparameters. As was mentioned in Chapter 5, if the sample sizes inboth groups are large, then lack of Normality is of less concern,and the large-sample tests described in th<strong>at</strong> chapter would apply.Rank sum tests were described by Wilcoxon and, independently,by Mann and Whitney; these tests have since been shown to beequivalent. Convention has now ascribed the Wilcoxon test topaired d<strong>at</strong>a and the Mann–Whitney U test to unpaired d<strong>at</strong>a.Paired samplesBoogert et al. 1 (d<strong>at</strong>a also given in Shott 2 ) used ultrasound to recordfoetal movements before and after chorionic villus sampling. The102

RANK SCORE TESTSTable 10.1 Wilcoxon test on foetal movement before and after chononicvillus sampling 1,2 .Before After Difference Signedsampling sampling (before−after) Rank rankP<strong>at</strong>ient no. (2) (3) (4) (5) (6)1 25 18 7 9 92 24 27 −3 5·5 −5·53 28 25 3 5·5 5·54 15 20 −5 8 −85 20 17 3 5·5 5·56 23 24 −1 1·5 −1·57 21 24 −3 5·5 −5·58 20 22 −2 3 −39 20 19 1 1·5 1·510 27 19 8 10 10percentage of time the foetus spent moving is given in Table 10.1for ten pregnant women.If we are concerned th<strong>at</strong> the differences in percentage of timespent moving are unlikely to be Normally distributed we could usethe Wilcoxon signed rank test using the following assumptions:1. The paired differences are independent.2. The differences come from a symmetrical distribution.We do not need to perform a test to ensure th<strong>at</strong> the differencescome from a symmetrical distribution: an “eyeball” test will suffice.A plot of the differences in column (4) of Table 10.1 is given inFigure 10.1, which shows th<strong>at</strong> distribution of the differences isplausibly symmetrical. The differences are then ranked in column5 (neg<strong>at</strong>ive values are ignored and zero values omitted). When twoor more differences are identical each is allotted the point half waybetween the ranks they would fill if distinct, irrespective of the plusor minus sign. For instance, the differences of − 1 (p<strong>at</strong>ient 6) and + 1(p<strong>at</strong>ient 9) fill ranks 1 and 2. As (1 + 2)/2 = 1·5, they are allotted rank1·5. In column (6) the ranks are repe<strong>at</strong>ed for column (5), but toeach is <strong>at</strong>tached the sign of the difference from column (4). Auseful check is th<strong>at</strong> the sum of the ranks must add to n(n + 1)/2. Inthis case 10(10 + 1)/2 = 55.The numbers representing the positive ranks and the neg<strong>at</strong>iveranks in column (6) are added up separ<strong>at</strong>ely and only the smallerof the two totals is used. Irrespective of its sign, the total is referredto Table D in the Appendix against the number of pairs used in103

STATISTICS AT SQUARE ONE9 . 0Difference before and after sampling7 . 05 . 03 . 01 . 0− 1 . 0− 3 . 0− 5 . 0Figure 10.1 Plot of differences in fetal movement with mean value.the investig<strong>at</strong>ion. Rank totals larger than those in the table arenon-significant <strong>at</strong> the level of probability shown. In this case thesmaller of the ranks is 23·5. This is larger than the number (8)given for ten pairs in Table D and so the result is not significant.A confidence interval associ<strong>at</strong>ed with the test is described byCampbell and Gardner 3 and by Altman et al., 4 and is easilyobtained from the program CIA. 5 The median difference is zero.CIA gives the 95% confidence interval as − 2·50 to 4·00. This isquite narrow and so from this small study we can conclude th<strong>at</strong>we have little evidence th<strong>at</strong> chorionic villus sampling alters themovement of the foetus.Note, perhaps contrary to intuition, th<strong>at</strong> the Wilcoxon test,although a rank test, may give a different value if the d<strong>at</strong>a aretransformed, say by taking logarithms. Thus it may be worth plottingthe distribution of the differences for a number of transform<strong>at</strong>ions tosee if they make the distribution appear more symmetrical.Unpaired samplesA senior registrar in the rheum<strong>at</strong>ology clinic of a districthospital has designed a clinical trial of a new drug for rheum<strong>at</strong>oidarthritis. Twenty p<strong>at</strong>ients were randomised into two groups of tento receive either the standard therapy A, or a new tre<strong>at</strong>ment B. Theplasma globulin fractions after tre<strong>at</strong>ment are listed in Table 10.2.104

RANK SCORE TESTSTable 10.2 Plasma globulin fraction after randomis<strong>at</strong>ion to tre<strong>at</strong>mentA or B.Tre<strong>at</strong>ment A 38 26 29 41 36 31 32 30 35 33Tre<strong>at</strong>ment B 45 28 27 38 40 42 39 39 34 4550Plasma globulin fraction4540353025Tre<strong>at</strong>mentATre<strong>at</strong>mentBFigure 10.2 Plasma globulin fraction after tre<strong>at</strong>ments A or B with meanvalues.We wish to test whether the new tre<strong>at</strong>ment has changed the plasmaglobulin, and we are worried about the assumption of Normality.The first step is to plot the d<strong>at</strong>a (Figure 10.2).The clinician was concerned about the lack of Normality of theunderlying distribution of the d<strong>at</strong>a and so decided to use anonparametric test. The appropri<strong>at</strong>e test, the Mann–WhitneyU test, is computed as follows.The observ<strong>at</strong>ions in the two samples are combined into a singleseries and ranked in order, but in the ranking the figures fromone sample must be distinguished from those of the other. Thed<strong>at</strong>a appear as set out in Table 10.3. To save space they havebeen set out in two columns, but a single ranking is done. Thefigures for sample B are set in bold type. Again the sum of theranks is n(n + 1)/2.The ranks for the two samples are now added separ<strong>at</strong>ely, and thesmaller total is used. It is referred to Table E, with n 1equal to the105

STATISTICS AT SQUARE ONETable 10.3 Combined results of Table 10.2.Globulin fraction Rank Globulin fraction Rank26 1 36 1127 2 38 12·528 3 38 12·529 4 39 14·530 5 39 14·531 6 40 1632 7 41 1733 8 42 1834 9 45 19·535 10 45 19·5Totals of ranks: sample A, 81·5; sample B, 128·5.number of observ<strong>at</strong>ions in one sample and n 2equal to the numberof observ<strong>at</strong>ions in the other sample. In this case they both equal10. At n 1= 10 and n 2= 10 the upper part of the table shows thefigure 78. The smaller total of the ranks is 81·5. Since this isslightly larger than 78 it does not reach the 5% level of probability.The result is therefore not significant <strong>at</strong> th<strong>at</strong> level. In the lower partof Table E, which gives the figures for the 1% level of probability,the figure for n 1= 10 and n 2= 10 is 71. As expected, the result isfurther from th<strong>at</strong> than the 5% figure of 78.To calcul<strong>at</strong>e a meaningful confidence interval we assume th<strong>at</strong> ifthe two samples come from different popul<strong>at</strong>ions the distribution ofthese popul<strong>at</strong>ions differs only in th<strong>at</strong> one appears shifted to the leftor right of the other. This means, for example, th<strong>at</strong> we do not expectone sample to be strongly right skewed and one to be strongly leftskewed. If the assumption is reasonable then a confidence intervalfor the median difference can be calcul<strong>at</strong>ed. 3,4 Note th<strong>at</strong> the CIAcomputer program does not calcul<strong>at</strong>e the difference in medians,but r<strong>at</strong>her the median of all possible differences between the twosamples. This is usually close to the median difference and hastheoretical advantages. From CIA we find th<strong>at</strong> the difference inmedians is − 5·5 and the approxim<strong>at</strong>e 95% confidence interval is− 10 to 1·0. As might be expected from the significance test, thisinterval includes zero. Although this result is not significant it wouldbe unwise to conclude th<strong>at</strong> there was no evidence th<strong>at</strong> tre<strong>at</strong>ments Aand B differed because the confidence interval is quite wide. Thissuggests th<strong>at</strong> a larger study should be planned.If the two samples are of unequal size a further calcul<strong>at</strong>ion isneeded after the ranking has been carried out as in Table 10.3. Let106

RANK SCORE TESTSn 1be the number of p<strong>at</strong>ients or objects in the smaller sampleand T 1the total of the ranks for th<strong>at</strong> sample. Let n 2be the numberof p<strong>at</strong>ients or objects in the larger sample. Then calcul<strong>at</strong>e T 2fromthe following formula:T 2= n 1(n 1+ n 2+ 1) − T 1.Finally enter Table E with the smaller of T 1or T 2. As before, onlytotals smaller than the critical points in Table E are significant. SeeExercise 10.2 for an example of this method.If there are only a few ties, th<strong>at</strong> is, if two or more values in thed<strong>at</strong>a are equal (say, less than 10% of the d<strong>at</strong>a) then for sample sizesoutside the range of Table E we can calcul<strong>at</strong>ez =⏐ T 1− n 1(n 1+ n 2+ 1)/2⏐.⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺√⎺[n 1n 2(n 1+ n 2+ 1)/12]On the null hypothesis th<strong>at</strong> the two samples come from the samepopul<strong>at</strong>ion, z is approxim<strong>at</strong>ely Normally distributed, mean zeroand standard devi<strong>at</strong>ion one, and can be referred to Table A tocalcul<strong>at</strong>e the P value.From the d<strong>at</strong>a of Table 10.2 we obtainz = ⏐ 81·5 − 10 × 21/2⏐ = 1·78⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺√⎺ 10 × 10 × 21/12and from Table A we find th<strong>at</strong> P is about 0·075, which corrobor<strong>at</strong>esthe earlier result.D<strong>at</strong>a th<strong>at</strong> are not Normally distributed can sometimes betransformed by the use of logarithms or some other methodto make them Normally distributed, and a t test performed.Consequently the best procedure to adopt may require carefulthought. The extent and n<strong>at</strong>ure of the difference between twosamples is often brought out more clearly by standard devi<strong>at</strong>ionsand t tests than by nonparametric tests.107

STATISTICS AT SQUARE ONEIt is an interesting observ<strong>at</strong>ion th<strong>at</strong> the Mann–Whitney U test isunaffected by simple transform<strong>at</strong>ions, but the Wilcoxon signedrank test is affected. This is because the rank of a set of numbersand the rank of (say) the log of these numbers are the same.However, the rank of the difference in a set of numbers and the rankof the difference in their logs are not necessarily the same.Common questionsNonparametric tests are valid for both non-Normallydistributed d<strong>at</strong>a and Normally distributed d<strong>at</strong>a, so why notuse them all the time?It would seem prudent to use nonparametric tests in all cases,which would save one the bother of testing for Normality. Parametrictests are preferred, however, for the following reasons:• As I have tried to emphasise in this book, we are rarelyinterested in a significance test alone; we would like to saysomething about the popul<strong>at</strong>ion from which the samples came,and this is best done with estim<strong>at</strong>es of parameters andconfidence intervals.• It is difficult to do flexible modelling with nonparametric tests,for example allowing for confounding factors using multipleregression (see Chapter 11).Do nonparametric tests compare medians?It is a commonly held belief th<strong>at</strong> a Mann–Whitney U test is infact a test for differences in medians. However, two groups couldhave the same median and yet have a significant Mann–WhitneyU test. Consider the following d<strong>at</strong>a for two groups, each with100 observ<strong>at</strong>ions. Group 1: 0 (× 98), 1, 2; Group 2: 0 (× 51),1, 2 (× 48). The median in both cases is 0, but from the Mann–Whitney test P < 0·0001.Only if we are prepared to make the additional assumption th<strong>at</strong>the difference in the two groups is simply a shift in loc<strong>at</strong>ion (th<strong>at</strong>is, the distribution of the d<strong>at</strong>a in one group is simply shifted by afixed amount from the other) can we say th<strong>at</strong> the test is a test ofthe difference in medians. However, if the groups have the samedistribution, then a shift in loc<strong>at</strong>ion will move medians and meansby the same amount and so the difference in medians is the sameas the difference in means. Thus the Mann–Whitney U test is alsoa test for the difference in means.108

RANK SCORE TESTSHow is the Mann–Whitney U test rel<strong>at</strong>ed to the t test?If one were to input the ranks of the d<strong>at</strong>a r<strong>at</strong>her than the d<strong>at</strong><strong>at</strong>hemselves into a two sample t test program, the P value obtainedwould be very close to th<strong>at</strong> produced by a Mann–Whitney U test.Reading and reporting rank score testsWhen reading a result from a nonparametric test, one is oftenfaced with a bald P value. <strong>One</strong> should then ask wh<strong>at</strong> hypothesisis being tested. As shown above, one has to make furtherassumptions before st<strong>at</strong>ements concerning parameters such asmeans can be made.A Mann–Whitney U test such as th<strong>at</strong> described earlier should bereported as “Mann–Whitney P = 0·075, difference in medians− 5·5, 95% CI − 10 to 1·0”. Often, because of inappropri<strong>at</strong>e software,confidence intervals for nonparametric tests are not reported, butthere are now a number of packages which will calcul<strong>at</strong>e them. Forthe two sample test, a confidence interval is only interpretableif the distribution of the outcome variable is similar in the twogroups; if the distributions are markedly different, then thetwo groups differ in more than just a shift in loc<strong>at</strong>ion. Differencesin spread can be as important as differences in medians. 6ExercisesExercise 10.1A new tre<strong>at</strong>ment in the form of tablets for the prophylaxis ofmigraine has been introduced, to be taken before an impending<strong>at</strong>tack. Twelve p<strong>at</strong>ients agree to try this remedy in addition to theusual general measures they take, subject to advice from theirdoctor on the taking of analgesics also.A crossover trial with identical placebo tablets is carried out overa period of 8 months. The numbers of <strong>at</strong>tacks experienced by eachp<strong>at</strong>ient on, firstly, the new tre<strong>at</strong>ment and, secondly, the placebowere as follows: p<strong>at</strong>ient (1) 4 and 2; p<strong>at</strong>ient (2) 12 and 6; p<strong>at</strong>ient(3) 6 and 6; p<strong>at</strong>ient (4) 3 and 5; p<strong>at</strong>ient (5)15 and 9; p<strong>at</strong>ient (6)10 and 11; p<strong>at</strong>ient (7) 2 and 4; p<strong>at</strong>ient (8) 5 and 6; p<strong>at</strong>ient (9) 11and 3; p<strong>at</strong>ient (10) 4 and 7; p<strong>at</strong>ient (11) 6 and 0; p<strong>at</strong>ient (12) 2and 5. In a Wilcoxon rank sum test wh<strong>at</strong> is the smaller total ofranks? Is it significant <strong>at</strong> the 5% level?109

STATISTICS AT SQUARE ONEExercise 10.2Another doctor carried out a similar pilot study with thisprepar<strong>at</strong>ion on eight p<strong>at</strong>ients, giving the same placebo to ten otherp<strong>at</strong>ients. The numbers of migraine <strong>at</strong>tacks experienced by thep<strong>at</strong>ients over a period of 6 months were as follows.Group receiving new prepar<strong>at</strong>ion: p<strong>at</strong>ient (1) 8; (2) 6; (3) 0;(4) 3; (5) 14; (6) 5;(7) 11; (8) 2Group receiving placebo: p<strong>at</strong>ient (9) 7; (10) 10;(11) 4; (12) 11; (13) 2;(14) 8; (15) 8; (16) 6;(17) 1; (18) 5In a Mann–Whitney two sample test wh<strong>at</strong> is the smaller total ofranks? Which sample of p<strong>at</strong>ients provides it? Is the differencesignificant <strong>at</strong> the 5% level?References1 Boogert A, Manhigh A, Visser GHA. The immedi<strong>at</strong>e effects of chorionic villussampling on fetal movements. Am J Obstet Gynecol 1987;157:137–9.2 Shott S. <strong>St<strong>at</strong>istics</strong> for health professionals. Philadelphia: WB Saunders, 1990.3 Campbell MJ, Gardner MJ. Calcul<strong>at</strong>ing confidence intervals for somenon-parametric analyses. BMJ 1988;296:1369–71.4 Altman DG, Machin D, Bryant T, Gardner MJ (eds). <strong>St<strong>at</strong>istics</strong> with confidence,2nd ed. London: BMJ Books, 2000:37–9.5 Altman DG, Machin D, Bryant TN, Gardner, MJ (eds). <strong>St<strong>at</strong>istics</strong> with confidence,2nd ed. London: BMJ Books, 2000.6 Hart A. Mann–Whitney test is not just a test of medians: differences in spreadcan be important. BMJ 2001;323:391–3.110

11 Correl<strong>at</strong>ion andregressionThe word correl<strong>at</strong>ion is used in everyday life to denote some formof associ<strong>at</strong>ion. We might say th<strong>at</strong> we have noticed a correl<strong>at</strong>ionbetween foggy days and <strong>at</strong>tacks of wheeziness. However, inst<strong>at</strong>istical terms we use correl<strong>at</strong>ion to denote associ<strong>at</strong>ion betweentwo quantit<strong>at</strong>ive variables. We also assume th<strong>at</strong> the associ<strong>at</strong>ion islinear, th<strong>at</strong> one variable increases or decreases a fixed amount for aunit increase or decrease in the other. The other technique th<strong>at</strong>is often used in these circumstances is regression, which involvesestim<strong>at</strong>ing the best straight line to summarise the associ<strong>at</strong>ion.Correl<strong>at</strong>ion coefficientThe degree of associ<strong>at</strong>ion is measured by a correl<strong>at</strong>ion coefficient,denoted by r. It is sometimes called Pearson’s correl<strong>at</strong>ion coefficientafter its origin<strong>at</strong>or, and is a measure of linear associ<strong>at</strong>ion. If acurved line is needed to express the rel<strong>at</strong>ionship, other and morecomplic<strong>at</strong>ed measures of the correl<strong>at</strong>ion must be used.The correl<strong>at</strong>ion coefficient is measured on a scale th<strong>at</strong> variesfrom + 1 through 0 to − 1. Complete or perfect correl<strong>at</strong>ion betweentwo variables is expressed by either + 1 or − 1. When one variableincreases as the other increases the correl<strong>at</strong>ion is positive; whenone decreases as the other increases it is neg<strong>at</strong>ive. Completeabsence of correl<strong>at</strong>ion is represented by 0. Figure 11.1 gives somegraphical represent<strong>at</strong>ions of correl<strong>at</strong>ion.Looking <strong>at</strong> d<strong>at</strong>a: sc<strong>at</strong>ter diagramsWhen an investig<strong>at</strong>or has collected two series of observ<strong>at</strong>ionsand wishes to see whether there is a rel<strong>at</strong>ionship between them, heor she should first construct a sc<strong>at</strong>ter diagram. The vertical scale111

STATISTICS AT SQUARE ONEr = + 1r = − 1abr = 0Curved linecdFigure 11.1 Correl<strong>at</strong>ion illustr<strong>at</strong>ed.represents one set of measurements and the horizontal scale theother. If one set of observ<strong>at</strong>ions consists of experimental resultsand the other consists of a time scale or observed classific<strong>at</strong>ionof some kind, it is usual to put the experimental results on thevertical axis. These represent wh<strong>at</strong> is called the “dependentvariable”. The “independent variable”, such as time or height orsome other observed classific<strong>at</strong>ion, is measured along thehorizontal axis, or baseline.The words “independent” and “dependent” could puzzle thebeginner because it is sometimes not clear wh<strong>at</strong> is dependenton wh<strong>at</strong>. This confusion is a triumph of common sense overmisleading terminology, because often each variable is dependenton some third variable, which may or may not be mentioned. Itis reasonable, for instance, to think of the height of children asdependent on age r<strong>at</strong>her than the converse, but consider a positivecorrel<strong>at</strong>ion between mean tar yield and nicotine yield of certainbrands of cigarette. 1 The nicotine liber<strong>at</strong>ed is unlikely to have itsorigin in the tar: both vary in parallel with some other factor or112

CORRELATION AND REGRESSIONTable 11.1 Height and pulmonary an<strong>at</strong>omical dead space in15 children.Child number Height (cm) Dead space (ml), y(1) (2) (3)1 110 442 116 313 124 434 129 455 131 566 138 797 142 578 150 569 153 5810 155 9211 156 7812 159 6413 164 8814 168 11215 174 101Total 2169 1004Mean 144·6 66·933factors in the composition of the cigarettes. The yield of the onedoes not seem to be “dependent” on the other in the sense th<strong>at</strong>,on average, the height of a child depends on his age. In such casesit often does not m<strong>at</strong>ter which scale is put on which axis of thesc<strong>at</strong>ter diagram. However, if the intention is to make inferencesabout one variable from the other, the observ<strong>at</strong>ions from which theinferences are to be made are usually put on the baseline. As afurther example, a plot of monthly de<strong>at</strong>hs from heart diseaseagainst monthly sales of ice cream would show a neg<strong>at</strong>ive associ<strong>at</strong>ion.However, it is hardly likely th<strong>at</strong> e<strong>at</strong>ing ice cream protects fromheart disease! It is simply th<strong>at</strong> the mortality r<strong>at</strong>e from heart diseaseis inversely rel<strong>at</strong>ed—and ice cream consumption positivelyrel<strong>at</strong>ed—to a third factor, namely environmental temper<strong>at</strong>ure.Calcul<strong>at</strong>ion of the correl<strong>at</strong>ion coefficientA paedi<strong>at</strong>ric registrar has measured the pulmonary an<strong>at</strong>omicaldead space (in ml) and height (in cm) of 15 children. The d<strong>at</strong>a aregiven in Table 11.1 and the sc<strong>at</strong>ter diagram is shown in Figure 11.2.Each dot represents one child, and it is placed <strong>at</strong> the pointcorresponding to the measurement of the height (horizontal axis)and the dead space (vertical axis). The registrar now inspects the113

STATISTICS AT SQUARE ONE120An<strong>at</strong>omical dead space (ml)1008060402000 100 110 120 130 140 150 160 170 180Height of children (cm)Figure 11.2 Sc<strong>at</strong>ter diagram of rel<strong>at</strong>ion in 15 children between heightand pulmonary an<strong>at</strong>omical dead space.p<strong>at</strong>tern to see whether it seems likely th<strong>at</strong> the area covered by thedots centres on a straight line or whether a curved line is needed.In this case the paedi<strong>at</strong>rician decides th<strong>at</strong> a straight line canadequ<strong>at</strong>ely describe the general trend of the dots. His next step willtherefore be to calcul<strong>at</strong>e the correl<strong>at</strong>ion coefficient.When making the sc<strong>at</strong>ter diagram (Figure 11.2) to show theheights and pulmonary an<strong>at</strong>omical dead spaces in the 15 children,the paedi<strong>at</strong>rician sets out figures as in columns (1), (2), and (3) ofTable 11.1. It is helpful to arrange the observ<strong>at</strong>ions in serial orderof the independent variable when one of the two variables is clearlyidentifiable as independent. The corresponding figures for thedependent variable can then be examined in rel<strong>at</strong>ion to theincreasing series for the independent variable. In this way we getthe same picture, but in numerical form, as appears in the sc<strong>at</strong>terdiagram.The calcul<strong>at</strong>ion of the correl<strong>at</strong>ion coefficient is as follows, with xrepresenting the values of the independent variable (in this caseheight) and y representing the values of the dependent variable (inthis case an<strong>at</strong>omical dead space). The formula to be used is:∑ (x − x – )(y − y – )r = √⎯⎯⎯⎯⎯⎯⎯⎯⎯[∑(x − x – )2 (y − y – ]) 2114

which can be shown to be equal to:r =CORRELATION AND REGRESSION∑ __ xy − nxy(n − 1)SD(x)SD(y)For the d<strong>at</strong>a in Table 11.1 we getr =5426·614 × 19·3679 × 23·6476 = 0·846The correl<strong>at</strong>ion coefficient of 0·846 indic<strong>at</strong>es a strong positivecorrel<strong>at</strong>ion between size of pulmonary an<strong>at</strong>omical dead space andheight of child. But in interpreting correl<strong>at</strong>ion it is important toremember th<strong>at</strong> correl<strong>at</strong>ion is not caus<strong>at</strong>ion. There may or may not bea caus<strong>at</strong>ive connection between the two correl<strong>at</strong>ed variables.Moreover, if there is a connection it may be indirect.A part of the vari<strong>at</strong>ion in one of the variables (as measuredby its variance) can be thought of as being due to its rel<strong>at</strong>ionshipwith the other variable and another part as due to undetermined(often “random”) causes. The part due to the dependence ofone variable on the other is measured by r 2 . For these d<strong>at</strong>a r 2 =0·716, so we can say th<strong>at</strong> 72% of the vari<strong>at</strong>ion between childrenin size of the an<strong>at</strong>omical dead space is accounted for by theheight of the child. If we wish to label the strength of theassoci<strong>at</strong>ion, for absolute values of r, 0 – 0·19 is regarded as veryweak, 0·2 – 0·39 as weak, 0·40 – 0·59 as moder<strong>at</strong>e, 0·6 – 0·79 asstrong and 0·8 – 1 as very strong correl<strong>at</strong>ion, but these arer<strong>at</strong>her arbitrary limits, and the context of the results should beconsidered.Significance testTo test whether the associ<strong>at</strong>ion is merely apparent, and mighthave arisen by chance, use the t test in the following form:√⎯⎯⎯n − 2t = r1 − r 2(11.1)115

STATISTICS AT SQUARE ONEThe t table (Table B in the Appendix) is entered <strong>at</strong> n – 2 degreesof freedom.For example, the correl<strong>at</strong>ion coefficient for the d<strong>at</strong>a in Table 11.1was 0·846. The number of pairs of observ<strong>at</strong>ions was 15. Applyingequ<strong>at</strong>ion (11.1), we have:t = 0·846√⎯⎯⎯⎯⎯15 − 21 − 0·846 = 5·722Entering Table B <strong>at</strong> 15 − 2 = 13 degrees of freedom we find th<strong>at</strong><strong>at</strong> t = 5·72, P < 0·001 so the correl<strong>at</strong>ion coefficient may beregarded as highly significant. Thus (as could be seen immedi<strong>at</strong>elyfrom the sc<strong>at</strong>ter plot) we have a very strong correl<strong>at</strong>ion betweendead space and height which is most unlikely to have arisenby chance.The assumptions governing this test are as follows:• Both variables are plausibly Normally distributed.• There is a linear rel<strong>at</strong>ionship between them.• The null hypothesis is th<strong>at</strong> there is no associ<strong>at</strong>ion betweenthem.The test should not be used for comparing two methods ofmeasuring the same quantity, such as two methods of measuringpeak expir<strong>at</strong>ory flow r<strong>at</strong>e. Its use in this way appears to be acommon mistake, with a significant result being interpreted asmeaning th<strong>at</strong> one method is equivalent to the other. The reasonshave been extensively discussed, 2 but it is worth recalling th<strong>at</strong> asignificant result tells us little about the strength of a rel<strong>at</strong>ionship.From the formula it should be clear th<strong>at</strong> with even with a veryweak rel<strong>at</strong>ionship (say r = 0·1) we would get a significant resultwith a large enough sample (say n > 1000).Spearman rank correl<strong>at</strong>ionA plot of the d<strong>at</strong>a may reveal outlying points well away fromthe main body of the d<strong>at</strong>a, which could unduly influence thecalcul<strong>at</strong>ion of the correl<strong>at</strong>ion coefficient. Altern<strong>at</strong>ively, the variablesmay be quantit<strong>at</strong>ive discrete such as a mole count, or orderedc<strong>at</strong>egorical such as a pain score. A nonparametric procedure, due116

CORRELATION AND REGRESSIONto Spearman, is to replace the observ<strong>at</strong>ions by their ranks in thecalcul<strong>at</strong>ion of the correl<strong>at</strong>ion coefficient. This results in a simpleformula for Spearman’s rank correl<strong>at</strong>ion, r s:r = 1 − 6∑ d 2n(n 2 − 1)where d is the difference in the ranks of the two variables for agiven individual. Thus we can derive Table 11.2 from the d<strong>at</strong>a inTable 11.1.From this we get th<strong>at</strong>r s= 1 −6 × 60·515 × (225 − 1) = 0·8920In this case the value is very close to th<strong>at</strong> of the Pearsoncorrel<strong>at</strong>ion coefficient. For n > 10, the Spearman rank correl<strong>at</strong>ioncoefficient can be tested for significance using the t test givenearlier.Table 11.2 Deriv<strong>at</strong>ion of Spearman rank correl<strong>at</strong>ion from d<strong>at</strong>a ofTable 11.1.Child number Rank height Rank dead space d d 21 1 3 2 42 2 1 − 1 13 3 2 − 1 14 4 4 0 05 5 5·5 0·5 0·256 6 11 5 257 7 7 0 08 8 5·5 − 2·5 6·259 9 8 − 1 110 10 13 3 911 11 10 − 1 112 12 9 − 3 913 13 12 − 1 114 14 15 1 115 15 14 − 1 1Total 60·5117

STATISTICS AT SQUARE ONEThe regression equ<strong>at</strong>ionCorrel<strong>at</strong>ion describes the strength of an associ<strong>at</strong>ion between twovariables, and is completely symmetrical: the correl<strong>at</strong>ion betweenA and B is the same as the correl<strong>at</strong>ion between B and A. However,if the two variables are rel<strong>at</strong>ed it means th<strong>at</strong> when one changes bya certain amount the other changes on an average by a certainamount. For instance, in the children described earlier gre<strong>at</strong>er heightis associ<strong>at</strong>ed, on average, with gre<strong>at</strong>er an<strong>at</strong>omical dead space. If yrepresents the dependent variable and x the independent variable,this rel<strong>at</strong>ionship is described as the regression of y on x.The rel<strong>at</strong>ionship can be represented by a simple equ<strong>at</strong>ion calledthe regression equ<strong>at</strong>ion. In this context “regression” (the term is ahistorical anomaly) simply means th<strong>at</strong> the average value of y is a“function” of x, th<strong>at</strong> is, it changes with x.The regression equ<strong>at</strong>ion representing how much y changes withany given change of x can be used to construct a regression line ona sc<strong>at</strong>ter diagram, and in the simplest case this is assumed to be astraight line. The direction in which the line slopes depends onwhether the correl<strong>at</strong>ion is positive or neg<strong>at</strong>ive. When the two setsof observ<strong>at</strong>ions increase or decrease together (positive) the lineslopes upwards from left to right; when one set decreases as theother increases the line slopes downwards from left to right. As theline must be straight, it will probably pass through few, if any, ofthe dots. Given th<strong>at</strong> the associ<strong>at</strong>ion is well described by a straightline, we have to define two fe<strong>at</strong>ures of the line if we are to placeit correctly on the diagram. The first of these is its distance abovethe baseline; the second is its slope. These are expressed in thefollowing regression equ<strong>at</strong>ion:y = α + βx.With this equ<strong>at</strong>ion we can find a series of values of y fit, thedependent variable, th<strong>at</strong> correspond to each of a series of valuesof x, the independent variable. The parameters α and β have tobe estim<strong>at</strong>ed from the d<strong>at</strong>a. The parameter α signifies the distanceabove the baseline <strong>at</strong> which the regression line cuts the vertical(y) axis; th<strong>at</strong> is, when y = 0. The parameter β (the regressioncoefficient) signifies the amount by which change in x must bemultiplied to give the corresponding average change in y, orthe amount y changes for a unit increase in x. In this way it118

CORRELATION AND REGRESSIONrepresents the degree to which the line slopes upwards ordownwards.The regression equ<strong>at</strong>ion is often more useful than the correl<strong>at</strong>ioncoefficient. It enables us to predict y from x and gives us a bettersummary of the rel<strong>at</strong>ionship between the two variables. If, for aparticular value of x, x i, the regression equ<strong>at</strong>ion predicts a value ofy fit, the prediction error is y i− y fit. It can easily be shown th<strong>at</strong> anystraight line passing through the mean values x – and y – will give <strong>at</strong>otal prediction error ∑ (y i− y fit) of zero because the positive andneg<strong>at</strong>ive terms exactly cancel. To remove the neg<strong>at</strong>ive signs wesquare the differences and the regression equ<strong>at</strong>ion chosen tominimise the sum of squares of the prediction errors, S 2 = ∑ (y i− y fit) 2We denote the sample estim<strong>at</strong>es of α and β by a and b. It can beshown th<strong>at</strong> the one straight line th<strong>at</strong> minimises S 2 , the least squaresestim<strong>at</strong>e, is given byb =∑ (x − x – )(y − y – )∑ (x − x – )2anda = y _ − bx _It can be shown th<strong>at</strong>∑ __ xy − nxyb =(11.2)(n − 1)SD(x) 2which is of use because we have calcul<strong>at</strong>ed all the components ofequ<strong>at</strong>ion (11.2) in the calcul<strong>at</strong>ion of the correl<strong>at</strong>ion coefficient.The calcul<strong>at</strong>ion of the correl<strong>at</strong>ion coefficient on the d<strong>at</strong>a inTable 11.2 gave the following:∑ xy = 150 605, SD(x) = 19·3679, y– = 66·93, x– = 144·6Applying these figures to the formulae for the regressioncoefficients, we have:119

STATISTICS AT SQUARE ONEb = 150 605 − 15 × 66·93 × 144·6 = 5426·614 × 19·3679 2 5251·6= 1·033 ml/cma = 66·93 − (1·033 × 144·6) = −82·4Therefore, in this case, the equ<strong>at</strong>ion for the regression of y on xbecomesy =− 82·4 + 1·033x.This means th<strong>at</strong>, on average, for every increase in height of 1 cmthe increase in an<strong>at</strong>omical dead space is 1·033 ml over the range ofmeasurements made.The line representing the equ<strong>at</strong>ion is shown superimposed onthe sc<strong>at</strong>ter diagram of the d<strong>at</strong>a in Figure 11.3. The way to draw theline is to take three values of x, one on the left side of the sc<strong>at</strong>terdiagram, one in the middle and one on the right, and substitutethese in the equ<strong>at</strong>ion, as follows:If x = 110, y = (1·033 × 110) − 82·4 = 31·2If x = 140, y = (1·033 × 140) − 82·4 = 62·2If x = 170, y = (1·033 × 170) − 82·4 = 93·2Although two points are enough to define the line, three are betteras a check. Instead of the middle point one could use the mean forx, since the line must go through the point defined by the mean ofx and y. Thus the mean for x is 144·6, and if x = 144·6, y = 1·033 ×144·6 − 82·4 = 66·9, which is the mean for y. Having put the pointson a sc<strong>at</strong>ter diagram, we simply draw the line through them.The standard error of the slope SE(b) is given byS resSE(b) = √⎯⎯⎯ ∑ ⎯ (x − x – )2(11.3)120

CORRELATION AND REGRESSION120An<strong>at</strong>omical dead space (ml)1008060402000 100 110 120 130 140 150 160 170 180Height of children (cm)Figure 11.3 Regression line drawn on sc<strong>at</strong>ter diagram rel<strong>at</strong>ing heightand pulmonary an<strong>at</strong>omical dead space in 15 children (Figure 11.2).where S resis the residual standard devi<strong>at</strong>ion, given by:√⎯⎯⎯⎯⎯ ∑ (y − yfit ) 2S res=n − 2This can be shown to be algebraically equal to√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯SD(y) 2 (1 − r 2 )(n − 1)n − 2We already have to hand all of the terms in this expression. Thus√⎯⎯S resis the square root of 23·6476 2 (1 − 0·846 2 ⎯⎯)14/13 = 171·2029 =13·08445. The denomin<strong>at</strong>or of (11.3) is 72·4680. Thus SE(b) =13·08445/72·4680 = 0·18055.We can test whether the slope is significantly different fromzero by:t = b/SE(b) = 1·033/0·18055 = 5·72Again, this has n − 2 = 15 − 2 = 13 degrees of freedom. Theassumptions governing this test are as follows:121

STATISTICS AT SQUARE ONE• The prediction errors are approxim<strong>at</strong>ely Normally distributed.Note th<strong>at</strong> this does not mean th<strong>at</strong> the x or y variable has to beNormally distributed.• The rel<strong>at</strong>ionship between the two variables is linear.• The sc<strong>at</strong>ter of points about the line is approxim<strong>at</strong>ely constant—we would not wish the variability of the dependent variable tobe growing as the independent variable increases. If this is thecase, try taking logarithms of both the x and y variables.Note th<strong>at</strong> the test of significance for the slope gives exactly thesame value of P as the test of significance for the correl<strong>at</strong>ioncoefficient. Although the two tests are derived differently, they arealgebraically equivalent, which makes intuitive sense, since a testfor a significant correl<strong>at</strong>ion should be the same as a test for asignificant slope.We can obtain a 95% confidence interval for b fromb − t 0·05× SE(b) to b + t 0·05× SE(b)where the t st<strong>at</strong>istic from Table B has 13 degrees of freedom,and is equal to 2·160. Thus for our d<strong>at</strong>a the 95% confidenceinterval is1·033 − 2·160 × 0·18055 to 1·033 + 2·160 × 0·18055= 0·643 to 1·422Regression lines give us useful inform<strong>at</strong>ion about the d<strong>at</strong><strong>at</strong>hey are collected from. They show how one variable changes onaverage with another, and they can be used to find out wh<strong>at</strong> onevariable is likely to be when we know the other—provided th<strong>at</strong>we ask this question within the limits of the sc<strong>at</strong>ter diagram. Toproject the line <strong>at</strong> either end—to extrapol<strong>at</strong>e—is always riskybecause the rel<strong>at</strong>ionship between x and y may change or some kindof cut off point may exist. For instance, a regression line might bedrawn rel<strong>at</strong>ing the chronological age of some children to their boneage, and it might be a straight line between, say, the ages of 5 and10 years, but to project it up to the age of 30 would clearly lead toerror. Computer packages will often produce the intercept from aregression equ<strong>at</strong>ion, with no warning th<strong>at</strong> it may be totallymeaningless. Consider a regression of blood pressure against age inmiddle aged men. The regression coefficient is often positive,indic<strong>at</strong>ing th<strong>at</strong> blood pressure increases with age. The intercept122

CORRELATION AND REGRESSIONis often close to zero, but it would be wrong to conclude th<strong>at</strong>this is a reliable estim<strong>at</strong>e of the blood pressure in newly bornmale infants!More advanced methodsIt is possible to have more than one independent variable—inwhich case the method is known as multiple regression. 3,4 This isthe most vers<strong>at</strong>ile of st<strong>at</strong>istical methods and can be used in manysitu<strong>at</strong>ions. For example it can allow for more than one predictor;age as well as height in the above example. It can also allow forcovari<strong>at</strong>es – in a clinical trial the dependent variable may beoutcome after tre<strong>at</strong>ment, the first independent variable can bebinary, 0 for placebo and 1 for active tre<strong>at</strong>ment, and the secondindependent variable may be a baseline variable, measured beforetre<strong>at</strong>ment, but likely to affect outcome.Common questionsIf two variables are correl<strong>at</strong>edare they causally rel<strong>at</strong>ed?It is a common error to confuse correl<strong>at</strong>ion and caus<strong>at</strong>ion. Allth<strong>at</strong> correl<strong>at</strong>ion shows is th<strong>at</strong> the two variables are associ<strong>at</strong>ed.There may be a third variable, a confounding variable th<strong>at</strong> is rel<strong>at</strong>edto both of them. For example, as st<strong>at</strong>ed earlier, monthly de<strong>at</strong>hs bydrowning and monthly sales of ice cream are positively correl<strong>at</strong>ed,but no one would say the rel<strong>at</strong>ionship was causal!How do I test the assumptions underlyinglinear regression?First, always look <strong>at</strong> the sc<strong>at</strong>ter plot and ask whether it islinear. Having obtained the regression equ<strong>at</strong>ion, calcul<strong>at</strong>e theresiduals e i= y i− y fit. A histogram of e iwill reveal departures fromNormality and a plot of e iversus y fitwill reveal whether theresiduals increase in size as y fitincreases.When should I use correl<strong>at</strong>ion and when shouldI use regression?If there is a clear causal p<strong>at</strong>hway, then generally it is betterto use regression, although quoting the correl<strong>at</strong>ion coefficientas a measure of the strength of the rel<strong>at</strong>ionship is helpful. In123

STATISTICS AT SQUARE ONEepidemiological surveys, where one is interested only in the strengthof a rel<strong>at</strong>ionship, then correl<strong>at</strong>ions are preferable. For example, ina survey comparing ovarian cancer r<strong>at</strong>es by country with the salesof oral contraceptives, it is the existence of a rel<strong>at</strong>ionship th<strong>at</strong> isof interest, and so a correl<strong>at</strong>ion coefficient would be a usefulsummary.Reading and reporting correl<strong>at</strong>ionand regression• Make clear wh<strong>at</strong> type of correl<strong>at</strong>ion coefficient is being used(e.g. Pearson or Spearman).• Ask whether the rel<strong>at</strong>ionship is really linear. Do the authorsproduce evidence th<strong>at</strong> the model is reasonable? For example,do they discuss the distribution of the residuals?• Quote the correl<strong>at</strong>ion coefficient and its P value, or the regressionslope, 95% CI for the slope, the t st<strong>at</strong>istic, degrees of freedomand P value. Thus for the d<strong>at</strong>a in Table 11.1 we would write,“regression slope of dead space against height is 1·033 ml/cm(95% CI 0·643 to 1·422), P < 0·001, r = 0·846”.ExercisesExercise 11.1A study was carried out into the <strong>at</strong>tendance r<strong>at</strong>e <strong>at</strong> a hospital ofpeople in 16 different geographical areas, over a fixed period oftime. The distance of the centre from the hospital of each area wasmeasured in miles. The results were as follows:(1) 21%, 6·8; (2) 12%, 10·3; (3) 30%, 1·7; (4) 8%,14·2; (5) 10%, 8·8; (6) 26%, 5·8; (7) 42%, 2·1;(8) 31%, 3·3; (9) 21%, 4·3; (10) 15%, 9·0; (11) 19%,3·2; (12) 6%, 12·7; (13) 18%, 8·2; (14) 12%, 7·0;(15) 23%, 5·1; (16) 34%, 4·1Wh<strong>at</strong> is the correl<strong>at</strong>ion coefficient between the <strong>at</strong>tendance r<strong>at</strong>eand mean distance of the geographical area?Exercise 11.2Find the Spearman rank correl<strong>at</strong>ion for the d<strong>at</strong>a given inExercise 11.1.124

CORRELATION AND REGRESSIONExercise 11.3If the values of x from the d<strong>at</strong>a in Exercise 11.1 represent meandistance of the area from the hospital and values of y represent<strong>at</strong>tendance r<strong>at</strong>es, wh<strong>at</strong> is the equ<strong>at</strong>ion for the regression of y on x?Wh<strong>at</strong> does it mean?Exercise 11.4Find the standard error and 95% confidence interval forthe slope.References1 Russell MAH, Cole PY, Idle MS, Adams L. Carbon monoxide yields ofcigarettes and their rel<strong>at</strong>ion to nicotine yield and type of filter. BMJ 1975;3:71–3.2 Bland JM, Altman DG. St<strong>at</strong>istical methods for assessing agreement between twomethods of clinical measurement. Lancet 1986;i:307–10.3 Armitage P, Berry G. St<strong>at</strong>istical methods in medical research, 3rd edn. Oxford:Blackwell Scientific Public<strong>at</strong>ions, 1994:312–41.4 Campbell MJ. <strong>St<strong>at</strong>istics</strong> <strong>at</strong> square two. London: BMJ Books, 2001.125

12 Survival analysisSurvival analysis is concerned with studying the time betweenentry to a study and a subsequent event. Originally the analysis wasconcerned with time from tre<strong>at</strong>ment until de<strong>at</strong>h, hence the name,but survival analysis is applicable to many areas as well as mortality.Recent examples include: time to discontinu<strong>at</strong>ion of a contraceptive,maximum dose of bronchoconstrictor required to reduce a p<strong>at</strong>ient’slung function to 80% of baseline, time taken to exercise to maximumtolerance, time th<strong>at</strong> a transdermal p<strong>at</strong>ch can be left in place, and timefor a leg fracture to heal.When the outcome of a study is the time between one event andanother, a number of problems can occur.• The times are more unlikely to be Normally distributed.• We cannot afford to wait until events have happened to all thesubjects, for example until all are dead. Some p<strong>at</strong>ients mighthave left the study early—they are lost to follow up. Thus the onlyinform<strong>at</strong>ion we have about some p<strong>at</strong>ients is th<strong>at</strong> they were stillalive <strong>at</strong> the last follow up. These are termed censored observ<strong>at</strong>ions.Kaplan–Meier survival curveWe look <strong>at</strong> the d<strong>at</strong>a using a Kaplan–Meier survival curve. 1Suppose th<strong>at</strong> the survival times, including censored observ<strong>at</strong>ions,after entry into the study (ordered by increasing dur<strong>at</strong>ion) of agroup of n subjects are t 1, t 2, … , t n. The proportion of subjects,S(t), surviving beyond any follow up time (t p) is estim<strong>at</strong>ed byS(t) = r − d 1 1× r − 2 d2× … × r − d p pr 1r 2r p126

SURVIVAL ANALYSISHere t pis the largest survival time less than or equal to t; and r iis the number of subjects alive just before time t i(the ith orderedsurvival time), and d idenotes the number who died <strong>at</strong> time t i,where i can be any value between 1 and p. For censored observ<strong>at</strong>ionsd i= 0.MethodOrder the survival times by increasing dur<strong>at</strong>ion, starting withthe shortest one. At each event (i) work out the number aliveimmedi<strong>at</strong>ely before the event (r i). Before the first event all thep<strong>at</strong>ients are alive and so S(t) = 1. If we denote the start of the studyas t 0, where t 0= 0, then we have S(t 0) = 1. We can now calcul<strong>at</strong>ethe survival times t i, for each value of i from 1 to n, by means of thefollowing recurrence formula.Given the number of events (de<strong>at</strong>hs), d i, <strong>at</strong> time t iand the numberalive, r i, just before t icalcul<strong>at</strong>eS(t i) = r i − d ir iS(t i−1)We do this only for the events and not for censored observ<strong>at</strong>ions.The survival curve is unchanged <strong>at</strong> the time of a censoredobserv<strong>at</strong>ion, but <strong>at</strong> the next event after the censored observ<strong>at</strong>ionthe number of people “<strong>at</strong> risk” is reduced by the number censoredbetween the two events.Example of calcul<strong>at</strong>ion of survival curveMcIllmurray and Turkie 2 describe a clinical trial of 49 p<strong>at</strong>ients forthe tre<strong>at</strong>ment of Dukes’ C colorectal cancer. The d<strong>at</strong>a for the twotre<strong>at</strong>ments, γ -linolenic acid and a control, are given in Table 12.1. 2,3The calcul<strong>at</strong>ion of the Kaplan–Meier survival curve for the 25p<strong>at</strong>ients randomly assigned to receive γ -linolenic acid is described inTable 12.2. The + sign indic<strong>at</strong>es censored d<strong>at</strong>a. Until 6 monthsafter tre<strong>at</strong>ment, there are no de<strong>at</strong>hs, so S(t) = 1. The effect of thecensoring is to remove from the alive group those th<strong>at</strong> are censored.At 6 months two subjects have been censored and so the numberalive just before 6 months is 23. There are two de<strong>at</strong>hs <strong>at</strong> 6 months.Thus,S(6) = 1 × (23 − 2) = 0·9130.23127

STATISTICS AT SQUARE ONETable 12.1 Survival in 49 p<strong>at</strong>ients with Dukes’ C colorectal cancerrandomly assigned to either γ-linolenic acid or control tre<strong>at</strong>ment.Tre<strong>at</strong>mentSurvival time (months)γ -linolenic acid 1+, 5+, 6, 6, 9+, 10, 10, 10+, 12, 12, 12, 12, 12+,(n = 25) 13+, 15+, 16+, 20+, 24, 24+, 27+, 32, 34+, 36+,36+, 44+Control 3+, 6, 6, 6, 6, 8, 8, 12, 12, 12+, 15+, 16+, 18+, 18+,(n = 24) 20, 22+, 24, 28+, 28+, 28+, 30, 30+, 33+, 42Table 12.2 Calcul<strong>at</strong>ion of survival case for 25 p<strong>at</strong>ients randomly assignedto receive γ-linolenic acid.Cumul<strong>at</strong>iveSurvival time Number Proportion proportionCase (months) alive De<strong>at</strong>hs surviving surviving(i) (t i) (r i) (d i) (r i− d i)/r iS(t)0 0 0 – 11 1+ 25 0 1 12 5+ 24 0 1 13 6 23 2 0·9130 0·91304 65 9+ 21 0 1 0·91306 10 20 2 0·90 0·82177 108 10+9 12 17 4 0·7647 0·628410 1211 1212 1213 12+14 13+ 12 0 1 0·628415 15+ 11 0 1 0·628416 16+ 10 0 1 0·628417 20+ 9 0 1 0·628418 24 8 1 0·875 0·549819 24+20 27+ 6 0 1 0·549821 32 5 1 0·80 0·439922 34+23 36+24 36+25 44+We now reduce the number alive (“<strong>at</strong> risk”) by 2. The censoredevent <strong>at</strong> 9 months reduces the “<strong>at</strong> risk” set to 20. At 10 monthsthere are two de<strong>at</strong>hs, so the proportion surviving is 18/20 = 0·90and the cumul<strong>at</strong>ive proportion surviving is 0·913 × 0·90 = 0·8217.128

SURVIVAL ANALYSIS1 . 00 . 9Probability of survival0 . 80 . 70 . 60 . 50 . 40 10 20 30 40 50Time (months)Figure 12.1 Survival curve of 25 p<strong>at</strong>ients with Dukes’ C colorectal cancertre<strong>at</strong>ed with γ-linolenic acid.The cumul<strong>at</strong>ive survival is conveniently stored in the memory of acalcul<strong>at</strong>or. As one can see, the effect of the censored observ<strong>at</strong>ionsis to reduce the number <strong>at</strong> risk without affecting the survival curveS(t).Finally we plot the survival curve, as shown in Figure 12.1. Thecensored observ<strong>at</strong>ions are shown as ticks on the line.Log rank testTo compare two survival curves produced from two groups Aand B, we use the r<strong>at</strong>her curiously named log rank test, 1 so calledbecause it can be shown to be rel<strong>at</strong>ed to a test th<strong>at</strong> uses thelogarithms of the ranks of the d<strong>at</strong>a.The assumptions used in this test are as follows:• The survival times are ordinal or continuous.• The risk of an event in one group rel<strong>at</strong>ive to the other does notchange with time. Thus if γ-linolenic acid reduces the risk ofde<strong>at</strong>h in p<strong>at</strong>ients with colorectal cancer, then this risk reductiondoes not change with time (the so called proportional hazardsassumption).129

STATISTICS AT SQUARE ONEWe first order the d<strong>at</strong>a for the two groups combined, as shownin Table 12.3 for the p<strong>at</strong>ients with colorectal cancer. As for theKaplan–Meier survival curve, we now consider each event in turn,starting <strong>at</strong> time t = 0. At each event (de<strong>at</strong>h) <strong>at</strong> time t iwe considerthe total number alive (r i) and the total number still alive in groupA (r Ai) up to th<strong>at</strong> point. If we had a total of d ievents <strong>at</strong> time t i, then,under the null hypothesis, we consider wh<strong>at</strong> proportion of thesewould have been expected in group A. Clearly, the more people <strong>at</strong>risk in one group the more de<strong>at</strong>hs (under the null hypothesis) wewould expect. Thus the expected number of events in A isE Ai= r Ai d i.r iThe effect of the censored observ<strong>at</strong>ions is to reduce the numbers<strong>at</strong> risk, but they do not contribute to the expected numbers.Finally, we add the total number of expected events in group A,E A= ∑ E Ai.If the total number of events in group B is E B, we can deduce E Bfrom E B= n − E A. We do not calcul<strong>at</strong>e the expected number beyondthe last event, in this case <strong>at</strong> time 42 months. Also, we would stopcalcul<strong>at</strong>ing the expected values if any survival times gre<strong>at</strong>er thanthe point we were <strong>at</strong> were found in one group only.Now, to test the null hypothesis of equal risk in the two groupswe computeX 2 = (O − E A A )2 + (O − E B B )2E AE Bwhere O Aand O Bare the total number of events in groups A andB. We compare X 2 to a χ 2 distribution with one degree of freedom(one, because we have two groups and one constraint, namely th<strong>at</strong>the total expected events must equal the total observed).The calcul<strong>at</strong>ion for the colorectal d<strong>at</strong>a is given in Table 12.3. Thefirst non-censored event occurs <strong>at</strong> 6 months, <strong>at</strong> which there are sixevents. By th<strong>at</strong> time 46 p<strong>at</strong>ients are <strong>at</strong> risk, of whom 23 are in groupA. Thus we would expect 6 × 23/46 = 3 to be in group A. At8 months we have 46 − 6 = 40 p<strong>at</strong>ients <strong>at</strong> risk of whom 23 − 2 = 21130

SURVIVAL ANALYSISare in group A. There are two events, of which we would expect2 × 21/40 = 1·05 to occur in group A.The total expected number of events in A is E A= 11·3745. Thetotal number of events is 22, O A= 10, O B= 12. Thus E B= 10·6255.ThusX 2 = (10 − 11·37)2 + (12 − 10·63)2 = 0·3411·37 10·63We compare this with the χ 2 table given in Table C in theAppendix, to find th<strong>at</strong> P > 0·10 (precise P = 0·56).The rel<strong>at</strong>ive risk can be estim<strong>at</strong>ed byr = (O A /E A)/(O B /E B).The standard error of the log risk is given by 4√ ⎯⎯⎯⎯⎯⎯⎯SE(log(r)) = (1/E A+ 1/E B).Thus we find r = 0·779, and so log(r) =−0·250. SE(log(r)) = 0·427,and so an approxim<strong>at</strong>e 95% confidence interval for log(r) is− 1·087 to 0·587A 95% confidence interval for r is therefore e −1·087 to e 0·587 , which is0·34 to 1·80.This would imply th<strong>at</strong> γ-linolenic acid reduced mortality byabout 78% compared with the control group, but with a very wideconfidence interval. In view of the very small χ 2 st<strong>at</strong>istic, we havelittle evidence th<strong>at</strong> this result would not have arisen by chance.Further methodsIn the same way th<strong>at</strong> multiple regression is an extension of linearregression, an extension of the log rank test is called Cox regression.This was developed by DR Cox and allows for prognostic factors.It is beyond the scope of this book, but is described elsewhere. 4–6131

STATISTICS AT SQUARE ONETable 12.3 Calcul<strong>at</strong>ion of log rank st<strong>at</strong>istics for 49 p<strong>at</strong>ients randomlyassigned to receive γ -linolenic acid (A) or control (B)ExpectedTotal Number Total <strong>at</strong> risk number ofSurvival time <strong>at</strong> risk of events in group A events in Amonths(t i) Group (r) (d i) (r Ai) (E Ai)0 491+ A 49 0 25 03+ B 48 0 24 05+ A 47 0 24 06 A 46 6 23 3·06 A6 B6 B6 B6 B8 B 40 2 21 1·058 B9+ A 38 0 21 010 A 37 2 20 1·081110 A10+ A12 A 34 6 17 3·012 A12 A12 A12 B12 B12+ A12+ B13+ A 26 0 12 015+ A 25 0 11 015+ B 24 0 10 016+ A 23 0 10 016+ B 22 0 9 018+ B 21 0 9 018+ B20 B 19 1 9 0·473620+ A22+ B 17 0 8 024 A 16 2 8 1·024 B24+ A27+ A 13 0 6 028+ B 12 0 5 028+ B28+ B30 B 9 1 5 0·555530+ B32 A 7 1 5 0·714333+ B 6 0 4 034+ A 5 0 4 036+ A 4 0 3 036+ A42 B 2 1 1 0·5044+ A132

SURVIVAL ANALYSISCommon questionsDo I need to test for a constant rel<strong>at</strong>iverisk before doing the log rank test?This is a similar problem to testing for Normality for a t test. Thelog rank test is quite “robust” against departures from proportionalhazards, but care should be taken. If the Kaplan–Meier survivalcurves cross then this is a clear departure from proportional hazards,and the log rank test should not be used. This can happen, forexample, in a two drug trial for cancer, if one drug is very toxicinitially but produces more long term cures. In this case there is nosimple answer to the question “is one drug better than the other?”,because the answer depends on the time scale.If I don’t have any censored observ<strong>at</strong>ions,do I need to use survival analysis?Not necessarily, you could use a rank test such as the Mann–Whitney U test, but the survival method would yield an estim<strong>at</strong>eof risk, which is often required, and lends itself to a useful way ofdisplaying the d<strong>at</strong>a.Reading and reporting survival analysis• Beware of reading too much into the right hand part of aKaplan–Meier plot, unless you have d<strong>at</strong>a to show th<strong>at</strong> there isstill a reasonably sized sample there, since it may be based onvery small numbers. It is useful to print the numbers <strong>at</strong> risk <strong>at</strong>intervals along the survival time axis.• For survival curves with a high proportion of survivors, it ishelpful to trunc<strong>at</strong>e the vertical axis, and not plot acres of whitespace.• Rel<strong>at</strong>ive risks are difficult to interpret, and for communic<strong>at</strong>ionit can be helpful to choose a time (say, 2 years for the γ-linolenicacid d<strong>at</strong>a) and give the estim<strong>at</strong>ed proportions surviving <strong>at</strong> th<strong>at</strong>point.• The log rank test should be presented as “X 2 (log rank) = 0·34,d.f. = 1, P = 0·56. Estim<strong>at</strong>ed rel<strong>at</strong>ive risk 0·78, 95% CI 0·34to 1·80.”133

STATISTICS AT SQUARE ONEExercisesExercise 12.1Twenty p<strong>at</strong>ients, ten of normal weight and ten severelyoverweight, underwent an exercise stress test, in which they had tolift a progressively increasing load for up to 12 minutes, but theywere allowed to stop earlier if they could do no more. On twooccasions the equipment failed before 12 minutes. The times (inminutes) achieved were:Normal weight: 4, 10, 12 * , 2, 8, 12 * , 8 † , 6, 9, 12 *Overweight: 7 † , 5, 11, 6, 3, 9, 4, 1, 7, 12 *An asterisk denotes th<strong>at</strong> the end of the test was reached, and adagger denotes equipment failure. (I am gr<strong>at</strong>eful to C Osmond forthese d<strong>at</strong>a.) Wh<strong>at</strong> are the observed and expected values? Wh<strong>at</strong> isthe value of the log rank test to compare these groups?Exercise 12.2Wh<strong>at</strong> is the risk of stopping in the normal weight groupcompared with the overweight group? Give a 95% confidenceinterval.References1 Peto R, Pike MC, Armitage P, et al. Design and analysis of randomised clinicaltrials requiring prolonged observ<strong>at</strong>ion of each p<strong>at</strong>ient: II. Analysis and examples.Br J Cancer 1977;35:1–39.2 McIllmurray MB, Turkie W. Controlled trial of γ-linolenic acid in Dukes’ Ccolorectal cancer. BMJ 1987;294:1260,295:475.3 Altman DG, Machin D, Bryant TN, Gardner MJ (eds). <strong>St<strong>at</strong>istics</strong> with confidence,2nd ed. London: BMJ Books, 2000:Chapter 9.4 Armitage P, Berry G. St<strong>at</strong>istical methods in medical research, 3rd ed. Oxford:Blackwell Scientific Public<strong>at</strong>ions, 1994:477–81.5 Altman DG. Practical st<strong>at</strong>istics for medical research. London: Chapman & Hall,1991.6 Campbell MJ. <strong>St<strong>at</strong>istics</strong> <strong>at</strong> square two. London: BMJ Books, 2001.134

13 Study design andchoosing ast<strong>at</strong>istical testDesignIn many ways the design of a study is more important than theanalysis. A badly designed study can never be retrieved, whereas apoorly analysed one can usually be reanalysed. 1 Consider<strong>at</strong>ion ofdesign is also important because the design of a study will governhow the d<strong>at</strong>a are to be analysed.Most medical studies consider an input, which may be a medicalintervention or exposure to a potentially toxic compound, and anoutput, which is some measure of health th<strong>at</strong> the intervention issupposed to affect. The simplest way to c<strong>at</strong>egorise studies is withreference to the time sequence in which the input and output arestudied.The most powerful studies are prospective studies, and theparadigm for these is the randomised controlled trial. In this, subjectswith a disease are randomised to one of two (or more) tre<strong>at</strong>ments,one of which may be a control tre<strong>at</strong>ment. Methods of randomis<strong>at</strong>ionhave been described in Chapter 3. The importance of randomis<strong>at</strong>ionis th<strong>at</strong> we know th<strong>at</strong> in the long run (th<strong>at</strong> is, with a large number ofsubjects) tre<strong>at</strong>ment groups will be balanced in known and unknownprognostic factors. However, with small studies the chances ofimbalance in some important prognostic factors are quite high.Another important fe<strong>at</strong>ure of randomis<strong>at</strong>ion is th<strong>at</strong> one cannotpredict in advance which tre<strong>at</strong>ment a p<strong>at</strong>ient will receive. Thism<strong>at</strong>ters, because knowledge of likely tre<strong>at</strong>ment may affect whetherthe subject is recruited into the trial. It is important th<strong>at</strong> thetre<strong>at</strong>ments are concurrent—th<strong>at</strong> the active and control tre<strong>at</strong>mentsoccur in the same period of time.135

STATISTICS AT SQUARE ONETo allow for the therapeutic effect of simply being given tre<strong>at</strong>ment,the control may consist of a placebo, an inert substance th<strong>at</strong> isphysically identical to the active compound. If possible a studyshould be double blinded—neither the investig<strong>at</strong>or nor the subjectbeing aware of wh<strong>at</strong> tre<strong>at</strong>ment the subject is undergoing.Sometimes it is impossible to blind the subjects, for example whenthe tre<strong>at</strong>ment is some form of health educ<strong>at</strong>ion, but often it ispossible to ensure th<strong>at</strong> the people evalu<strong>at</strong>ing the outcome areunaware of the tre<strong>at</strong>ment.A parallel group design is one in which tre<strong>at</strong>ment and control arealloc<strong>at</strong>ed to different individuals. Examples of a parallel group trialare given in Table 7.1, in which different bran prepar<strong>at</strong>ions havebeen tested on different individuals, Table 8.5 comparing differentvaccines and Table 8.7 evalu<strong>at</strong>ing health promotion. A m<strong>at</strong>chedparallel design comes about when randomis<strong>at</strong>ion is betweenm<strong>at</strong>ched pairs, such as in Exercise 6.3, in which randomis<strong>at</strong>ionwas between different parts of a p<strong>at</strong>ient’s body.A crossover study is one in which two or more tre<strong>at</strong>ments areapplied sequentially to the same subject. Clearly this type of studycan only be considered for chronic conditions, where tre<strong>at</strong>ment isnot expected to cure the p<strong>at</strong>ient and where withdrawal of tre<strong>at</strong>mentleads to a return to a baseline level which is rel<strong>at</strong>ively stable. Theadvantages are th<strong>at</strong> each subject then acts as his or her own controland so fewer subjects may be required. The main disadvantageis th<strong>at</strong> there may be a carryover effect in th<strong>at</strong> the action of thesecond tre<strong>at</strong>ment is affected by the first tre<strong>at</strong>ment. An example ofa crossover trial is given in Table 7.2, in which different dosages ofbran are compared within the same individual. Another type of trialis a sequential trial, where the outcome is evalu<strong>at</strong>ed after each p<strong>at</strong>ientor a group of p<strong>at</strong>ients, r<strong>at</strong>her than after all the p<strong>at</strong>ients have beeng<strong>at</strong>hered into the study. It is useful if p<strong>at</strong>ient recruitment is slowand the outcome evalu<strong>at</strong>ed quickly, as for levels of nausea after ananaesthetic. Cluster trials are trials where groups of p<strong>at</strong>ients, r<strong>at</strong>herthan individual p<strong>at</strong>ients, are randomised. They may occur in primarycare, where general practitioners are randomised to different trainingpackages, or public health, where different areas receive differenthealth promotion campaigns. Further details on clinical trials areavailable in a number of excellent books. 2–5<strong>One</strong> of the major thre<strong>at</strong>s to validity of a clinical trial is compliance.P<strong>at</strong>ients are likely to drop out of trials if the tre<strong>at</strong>ment is unpleasant,and often fail to take medic<strong>at</strong>ion as prescribed. It is usual to adopt136

STUDY DESIGN AND CHOOSING A STATISTICAL TESTa pragm<strong>at</strong>ic approach and analyse by intention to tre<strong>at</strong>, th<strong>at</strong> is,analyse the d<strong>at</strong>a by the tre<strong>at</strong>ment th<strong>at</strong> the subject was assigned to,not the one they actually took. Of course, if the p<strong>at</strong>ients havedropped out of a trial, it may be impossible to get measurementsfrom them, unless they can be obtained by proxy, such as whetherthe subject is alive or dead. The altern<strong>at</strong>ive to an intention to tre<strong>at</strong>analysis is to analyse per protocol or on study. The trouble with thistype of analysis is th<strong>at</strong> we no longer have a randomised comparisonand so comparisons may be biased. Dropouts should be reportedby tre<strong>at</strong>ment group. Checklists for writing reports on clinical trialsare available. 6,7A quasi-experimental design is one in which tre<strong>at</strong>ment alloc<strong>at</strong>ionis not random. An example of this is given in Table 9.1 in whichinjuries are compared in two dropping zones. This is subject topotential biases in th<strong>at</strong> the reason why a person is alloc<strong>at</strong>ed to aparticular dropping zone may be rel<strong>at</strong>ed to their risk of a sprainedankle. If all the novices jumped together in one dropping zone, thisalone may explain the difference in injury r<strong>at</strong>es.A cohort study is one in which subjects, initially free of disease,are followed up over a period of time. Some will be exposed tosome risk factor, for example cigarette smoking. The outcome maybe de<strong>at</strong>h, and we may be interested in rel<strong>at</strong>ing the risk factor to aparticular cause of de<strong>at</strong>h. Clearly, these have to be large, long termstudies and tend to be costly to carry out. If records have been keptroutinely in the past then a historical cohort study may be carriedout, an example of which is the appendicitis study discussed inChapter 6. Here, the cohort is all cases of appendicitis admittedover a given period, and a sample of the records could be inspectedretrospectively. A typical example would be to look <strong>at</strong> birth weightrecords and rel<strong>at</strong>e birth weight to disease in l<strong>at</strong>er life.These studies differ in essence from retrospective studies, whichstart with diseased subjects and then examine possible exposure.Such case–control studies are commonly undertaken as a preliminaryinvestig<strong>at</strong>ion, because they are rel<strong>at</strong>ively quick and inexpensive.The comparison of the blood pressure in farmers and printers givenin Chapter 3 is an example of a case–control study. It is retrospectivebecause we argued from the blood pressure to the occup<strong>at</strong>ion anddid not start out with subjects assigned to occup<strong>at</strong>ion. There aremany confounding factors in case–control studies. For example,does occup<strong>at</strong>ional stress cause high blood pressure, or do peopleprone to high blood pressure choose stressful occup<strong>at</strong>ions?137

STATISTICS AT SQUARE ONEA particular problem is recall bias, in th<strong>at</strong> the cases, with the disease,are more motiv<strong>at</strong>ed to recall apparently trivial episodes in the pastthan controls, who are disease free.Cross–sectional studies are common and include surveys,labor<strong>at</strong>ory experiments and studies to examine the prevalence ofa disease. Studies valid<strong>at</strong>ing instruments and questionnaires arealso cross–sectional studies. The study of urinary concentr<strong>at</strong>ionof lead in children described in Chapter 1 and the study of therel<strong>at</strong>ionship between height and pulmonary an<strong>at</strong>omical dead spacein Chapter 11 are cross–sectional studies. The main problem ofcross–sectional studies is ensuring the sample is represent<strong>at</strong>ive ofthe popul<strong>at</strong>ion from which it is taken. Different types of sample aredescribed in Chapter 3.Sample size<strong>One</strong> of the most common questions asked of a st<strong>at</strong>istician aboutdesign is the number of p<strong>at</strong>ients to include. It is an importantquestion, because if a study is too small it will not be able toanswer the question posed, and would be a waste of time andmoney. It could also be deemed unethical because p<strong>at</strong>ients may beput <strong>at</strong> risk with no apparent benefit. However, studies should notbe too large because resources would be wasted if fewer p<strong>at</strong>ientswould have sufficed. The sample size for a continuous outcomedepends on four critical quantities: the type I and type II errorr<strong>at</strong>es α and β, (discussed in Chapter 5), the variability of the d<strong>at</strong>aσ 2 , and the effect size d. In a trial the effect size is the amount bywhich we would expect the two tre<strong>at</strong>ments to differ, or is theminimum difference th<strong>at</strong> would be clinically worthwhile. For abinary outcome we need to specify α and β, and proportionsP 1and P 2where P 1is the expected outcome under the controlintervention and P 1− P 2is the minimum clinical difference whichit is worthwhile detecting.Usually α and β are fixed <strong>at</strong> 5% and 20% (or 10%), respectively.A simple formula for a two group parallel trial with a continuousoutcome is th<strong>at</strong> the required sample size per group is given by n =16σ 2 /d 2 for two sided α of 5% and β of 20%. For example, in a trialto reduce blood pressure, if a clinically worthwhile effect for diastolicblood pressure is 5 mmHg and the between subjects standarddevi<strong>at</strong>ion is 10 mmHg, we would require n = 16 × 100/25 = 64p<strong>at</strong>ients per group in the study. For binary d<strong>at</strong>a this becomes8 (P 1(1 − P 1) + P 2(1 − P 2))/(P 1− P 2) 2 . Thus suppose th<strong>at</strong> in the138

STUDY DESIGN AND CHOOSING A STATISTICAL TESTPHVD trial of Chapter 2 the standard therapy resulted in 45% ofchildren requiring a shunt or dying. We wished to reduce this to35%. In the formula we express the percentages as proportions.Then we would require 8 × (0·45 × 0·55 + 0·35 × 0·65)/0·1 2 = 380subjects per group to have an 80% chance of detecting the specifieddifference <strong>at</strong> 5% significance. The sample size goes up as the squareof the standard devi<strong>at</strong>ion of the d<strong>at</strong>a (the variance) and goes downas the square of the effect size. Doubling the effect size reduces thesample size by a factor of 4—it is much easier to detect large effects!In practice, the sample size is often fixed by other criteria, such asfinance or resources, and the formula is used to determine a realisticeffect size. If this is too large, then the study will have to beabandoned or increased in size. Machin et al. give advice on a samplesize calcul<strong>at</strong>ions for a wide variety of study designs. 8 There are alsoa number of websites th<strong>at</strong> allow free sample size calcul<strong>at</strong>ions, forexample http://www.st<strong>at</strong>.uiowa.edu/~rlenth/Power/index.html.Choice of testIn terms of selecting a st<strong>at</strong>istical test, the most importantquestion is “wh<strong>at</strong> is the main study hypothesis?”. In some casesthere is no hypothesis; the investig<strong>at</strong>or just wants to “see wh<strong>at</strong> isthere”. For example, in a prevalence study there is no hypothesisto test, and the size of the study is determined by how accur<strong>at</strong>elythe investig<strong>at</strong>or wants to determine the prevalence. If there is nohypothesis, then there is no st<strong>at</strong>istical test. It is important to decidea priori which hypotheses are confirm<strong>at</strong>ory (th<strong>at</strong> is, are testingsome presupposed rel<strong>at</strong>ionship), and which are explor<strong>at</strong>ory (aresuggested by the d<strong>at</strong>a). No single study can support a whole seriesof hypotheses.A sensible plan is to limit severely the number of confirm<strong>at</strong>oryhypotheses. Although it is valid to use st<strong>at</strong>istical tests on hypothesessuggested by the d<strong>at</strong>a, the P values should be used only asguidelines, and the results tre<strong>at</strong>ed as very tent<strong>at</strong>ive until confirmedby subsequent studies. A useful guide is to use a Bonferronicorrection, which st<strong>at</strong>es simply th<strong>at</strong> if one is testing n independenthypotheses, one should use a significance level of 0·05/n. Thus ifthere were two independent hypotheses a result would be declaredsignificant only if P < 0·025. Note th<strong>at</strong>, since tests are rarelyindependent, this is a very conserv<strong>at</strong>ive procedure—one unlikely toreject the null hypothesis.139

STATISTICS AT SQUARE ONEThe investig<strong>at</strong>or should then ask “are the d<strong>at</strong>a independent?”.This can be difficult to decide but, as a rule of thumb, results on thesame individual, or from m<strong>at</strong>ched individuals, are not independent.Thus results from a crossover trial, or from a case–control study inwhich the controls were m<strong>at</strong>ched to the cases by age, sex and socialclass, are not independent. It is generally true th<strong>at</strong> the analysisshould reflect the design, and so a m<strong>at</strong>ched design should befollowed by a m<strong>at</strong>ched analysis. Results measured over time requirespecial care. 9 <strong>One</strong> of the most common mistakes in st<strong>at</strong>isticalanalysis is to tre<strong>at</strong> correl<strong>at</strong>ed variables as if they were independent.For example, suppose we were looking <strong>at</strong> tre<strong>at</strong>ment of leg ulcers,in which some people had an ulcer on each leg. We might have20 subjects with 30 ulcers, but the number of independent piecesof inform<strong>at</strong>ion is 20 because the st<strong>at</strong>e of ulcers on each leg for oneperson may be influenced by the st<strong>at</strong>e of health of the person andan analysis th<strong>at</strong> considered ulcers as independent observ<strong>at</strong>ionswould be incorrect. For a correct analysis of mixed paired andunpaired d<strong>at</strong>a, consult a st<strong>at</strong>istician.The next question is “wh<strong>at</strong> types of d<strong>at</strong>a are being measured?”.The test used should be determined by the d<strong>at</strong>a. The choice oftest for m<strong>at</strong>ched or paired d<strong>at</strong>a is described in Table 13.1 and forindependent d<strong>at</strong>a in Table 13.2.It is helpful to decide the input variables and the outcomevariables. For example, in a clinical trial the input variable is thetype of tre<strong>at</strong>ment—a nominal variable—and the outcome may besome clinical measure, perhaps Normally distributed. The requiredtest is then the t test (Table 13.2). However, if the input variable iscontinuous, say a clinical score, and the outcome is nominal, saycured or not cured, logistic regression is the required analysis. A ttest in this case may help but would not give us wh<strong>at</strong> we require,namely the probability of a cure for a given value of the clinicalTable 13.1 Choice of st<strong>at</strong>istical test from paired orm<strong>at</strong>ched observ<strong>at</strong>ions.VariableNominalOrdinal (ordered c<strong>at</strong>egories)Quantit<strong>at</strong>ive (discrete or non-Normal)Quantit<strong>at</strong>ive (Normal*)TestMcNemar’s testWilcoxonWilcoxonPaired t test* It is the difference between the paired observ<strong>at</strong>ions th<strong>at</strong> shouldbe plausibly Normal.140

STUDY DESIGN AND CHOOSING A STATISTICAL TESTTable 13.2 Choice of st<strong>at</strong>istical test for independent observ<strong>at</strong>ions.Outcome variableC<strong>at</strong>egorical Quantit<strong>at</strong>ive Quantit<strong>at</strong>ive Quantit<strong>at</strong>iveNominal (> 2 C<strong>at</strong>egories) Ordinal discrete non-Normal NormalNominal χ 2 or χ 2 χ 2 trend or Mann–Whitney Mann–Whitney Student’s t testFisher’s Mann–Whitney or log rank aInput C<strong>at</strong>egorical χ 2 χ 2 Kruskal– Kruskal–Wallis b Kruskal–Wallis b Analysis ofvariable (2> c<strong>at</strong>egories) Wallis b variance cOrdinal χ 2 trend or eSpearman Spearman rank Spearman rank Spearman rank or(ordered Mann– rank linear regression dc<strong>at</strong>egories) WhitneyQuantit<strong>at</strong>ive Logistic e eSpearman rank Spearman rank Spearman rank ordiscrete regression linear regression dQuantit<strong>at</strong>ive Logistic e e ePlot d<strong>at</strong>a and Plot d<strong>at</strong>a andnon-Normal regression Pearson or Pearson orSpearman rank Spearman rank andlinear regressionQuantit<strong>at</strong>ive Logistic e e eLinear regression d Pearson and linearNormal regression regressionaIf d<strong>at</strong>a are censored.bThe Kruskal–Wallis test is used for comparing ordinal or non-Normal variables for more than two groups, and is a generalis<strong>at</strong>ion of the Mann–WhitneyU test. The technique is beyond the scope of this book, but is described in more advanced books 10–12 and is available in common software (Epi-Info,Minitab, SPSS).cAnalysis of variance is a general technique; one version (one way analysis of variance) is used to compare Normally distributed variables for more thantwo groups, and is the parametric equivalent of the Kruskal–Wallis test.dIf the outcome variable is the dependent variable, then provided the residuals (see Chapter 11) are plausibly Normal, then the distribution of theindependent variable is not important.eThere are a number of more advanced techniques, such as Poisson regression, for dealing with these situ<strong>at</strong>ions. 12 However, they require certainassumptions and it is often easier to either dichotomise the outcome variable or tre<strong>at</strong> it as continuous.141

STATISTICS AT SQUARE ONEscore. As another example, suppose we have a cross-sectionalstudy in which we ask a random sample of people whether theythink their general practitioner is doing a good job, on a five pointscale, and we wish to ascertain whether women have a higheropinion of general practitioners than men have. The input variableis gender, which is nominal. The outcome variable is the five pointordinal scale. Each person’s opinion is independent of the others,so we have independent d<strong>at</strong>a. From Table 13.2 we should use a χ 2test for trend, or a Mann–Whitney U test (with correction for ties).Note, however, th<strong>at</strong> if some people share a general practitioner andothers do not, then the d<strong>at</strong>a are not independent and a moresophistic<strong>at</strong>ed analysis is called for.Note th<strong>at</strong> Tables 13.1 and 13.2 should be considered as guidesonly, and each case should be considered on its merits.Reading and reporting on thedesign of a study• There should always be a succinct st<strong>at</strong>ement in the abstract ofa report st<strong>at</strong>ing the type of study, for example ‘a case–controlstudy of 50 cases and 60 controls’.• The sample size should be justified by a power-based st<strong>at</strong>ementin the methods section. Note th<strong>at</strong> this should be written downbefore the study is carried out. Retrospective power calcul<strong>at</strong>ions,saying wh<strong>at</strong> the power would have been, are not helpful; use aconfidence interval instead.• It should be possible to assign a st<strong>at</strong>istical method to each Pvalue quoted.ExercisesSt<strong>at</strong>e the type of study described in each of the following.Exercise 13.1To investig<strong>at</strong>e the rel<strong>at</strong>ionship between egg consumption and heartdisease, a group of p<strong>at</strong>ients admitted to hospital with myocardialinfarction were questioned about their egg consumption. A group ofp<strong>at</strong>ients m<strong>at</strong>ched by age and sex admitted to a fracture clinic werealso questioned about their egg consumption using an identicalprotocol.142

STUDY DESIGN AND CHOOSING A STATISTICAL TESTExercise 13.2To investig<strong>at</strong>e the rel<strong>at</strong>ionship between certain solvents andcancer, all employees <strong>at</strong> a factory were questioned about theirexposure to an industrial solvent, and the amount and length ofexposure measured. These subjects were regularly monitored, andafter 10 years a copy of the de<strong>at</strong>h certific<strong>at</strong>e for all those who haddied was obtained.Exercise 13.3A survey was conducted of all nurses employed <strong>at</strong> a particularhospital. Among other questions, the questionnaire asked aboutthe grade of the nurse and whether she was s<strong>at</strong>isfied with her careerprospects.Exercise 13.4To evalu<strong>at</strong>e a new back school, p<strong>at</strong>ients with lower back painwere randomly alloc<strong>at</strong>ed to either the new school or to conventionaloccup<strong>at</strong>ional therapy. After 3 months they were questioned abouttheir back pain, and observed lifting a weight by independentmonitors.Exercise 13.5A new triage system has been set up <strong>at</strong> the local accident andemergency unit. To evalu<strong>at</strong>e it the waiting times of p<strong>at</strong>ients weremeasured for 6 months and compared with the waiting times <strong>at</strong> acomparable nearby hospital.References1 Campbell MJ, Machin D. Medical st<strong>at</strong>istics: A common-sense approach, 3rd ed.Chichester: Wiley, 1999.2 Pocock SJ. Clinical trials: A practical approach. Chichester: Wiley, 1983.3 Senn SJ. The design and analysis of cross-over trials. Chichester: Wiley, 1992.4 Whitehead J. The design and analysis of sequential clinical trials, revised 2nd ed.Chichester: Wiley, 1997.5 Donner A and Klar N. Design and analysis of cluster randomised trials in healthresearch. London: Arnold, 2000.6 Gardner MJ, Machin D, Campbell MJ and Altman DG. St<strong>at</strong>istical checklists.In: Altman DG, Machin D, Bryant TN and Gardner MJ (eds), <strong>St<strong>at</strong>istics</strong> withconfidence. London: BMJ Books, 2000.7 Moher D, Schultz KF, Altman D. The CONSORT st<strong>at</strong>ement: revisedrecommend<strong>at</strong>ions for improving the quality of reports of parallel-grouprandomized trials. JAMA 2001:285;1987–91.143

STATISTICS AT SQUARE ONE8 Machin D, Campbell MJ, Payers P, Pinol A. St<strong>at</strong>istical tables for the design ofclinical studies, 2nd ed. Oxford: Blackwell Scientific Public<strong>at</strong>ions, 1997.9 M<strong>at</strong>thews JNS, Altman DG, Campbell MJ, Royston JP. Analysis of serialmeasurements in medical research. BMJ 1990;300:230–5.10 Altman DG. Practical st<strong>at</strong>istics for medical research. London: Chapman & Hall,1991.11 Armitage P, Berry G. St<strong>at</strong>istical methods in medical research, 3rd ed. Oxford:Blackwell Scientific Public<strong>at</strong>ions, 1994.12 Campbell MJ. <strong>St<strong>at</strong>istics</strong> <strong>at</strong> square two. London: BMJ Books, 2001.144

Answers to exercises1.1 Median 0·71, range 0·10 to 1·24, first quartile 0·535, thirdquartile 0·84 µmol/24 h.2.1 Mean = 2·41, SD = 1·27. No, because the mean is less than2 SDs and d<strong>at</strong>a are positive.2.2 Mean = 0·697 µmol/24 h, SD = 0·2410 µmol/24 h, range0·215 to 1·179 µmol/24 h. Points 0·10 and 1·24. 2/40 or 5%.2.3 Rel<strong>at</strong>ive risk 1·79, odds r<strong>at</strong>io 2·09.3.1 SE(mean) = 0·074 µmol/24 h.3.2 A uniform or fl<strong>at</strong> distribution. Popul<strong>at</strong>ion mean 4·5, popul<strong>at</strong>ionSD 2·87.3.3 The distribution will be approxim<strong>at</strong>ely Normal, mean 4·5and SD 2·87/√⎺5 = 1·28.4.1 The reference range is 12·26 to 57·74, and so the observedvalue of 52 is included in it.4.2 95% confidence interval 32·73 to 37·27.5.1 0·42 g/dl, z = 3·08, 0·001 < P < 0·01, difference = 1·3 g/dl,95% CI 0·48 to 2·12 g/dl.5.2 0·23 g/dl, P < 0·001.6.1 SE (percentage) = 2·1%, SE (difference) = 3·7%, difference =3·4%. 95% CI − 3·9 to 10·7%, z = 0·94, P = 0·35.6.2 Difference in proportions 0·193, 95% CI 0·033 to 0·348.6.3 Yes, the traditional remedy, z = 2·2, P = 0·028.6.4 OR = 4·89, 95% CI 4·00 to 5·99.7.1 37·5 to 40·5 KA units.7.2 t = 2·652, d.f. = 17, 0·01< P < 0·02 (P = 0·017).7.3 0·56 g/dl, t = 1·243, d.f. = 20, 0·1

STATISTICS AT SQUARE ONE8.1 Standard χ 2 = 3·295, d.f. = 4, P = 0·51. Trend χ 2 = 2·25,d.f. = 1, P = 0·13.8.2 X 2 = 3·916, d.f. = 1, 0·02 < P < 0·05, (P = 0·048) differencein r<strong>at</strong>es 9%, 95% CI 0·3 to 17·9%.8.3 X 2 = 0·931, d.f. = 1, 0·1 < P < 0·5 (P = 0·33), difference inr<strong>at</strong>es 15%, 95% CI − 7·7 to 38%.8.4 X 2 = 8·949, d.f. = 3, 0·02 < P < 0·05 (P = 0·03). Yes,practice C; if this is omitted the remaining practices giveX 2 = 0·241, d.f. = 2, P = 0·89. (Both χ 2 tests by quick method.)8.5 X 2 = 5·685, d.f. = 1, P = 0·017. This is st<strong>at</strong>isticallysignificant and the CI in (6·2) does not include zero.8.6 X 2 = 285·96, d.f. = 1, P < 0·001. Highly significant and theCI in (6.4) is a long way from 1.8.7 X 2 = 0·6995, d.f. = 1, P = 0·40. The z value was 0·83 and0.83 × 0·83=0·6889, which is the same within the limits ofrounding.9.1 Sickness r<strong>at</strong>e in first department = 28%, in second department8%, difference 20% (approxim<strong>at</strong>e 95% CI = −6 to 45%,P = 0·24 (Fisher’s exact test mid P)). P is calcul<strong>at</strong>ed from2 × (0·5 × 0·173 + 0·031).10.1 Smaller total = −24. No.10.2 Mann–Whitney st<strong>at</strong>istic = 74. The group on the newremedy. No.11.1 r =−0·848.11.2 r s=−0·865.11.3 y = 36·1 − 2·34x. This means th<strong>at</strong>, on average, for every1 mile increase in mean distance the <strong>at</strong>tendance r<strong>at</strong>e dropsby 2·34%. This can be safely accepted only within the areameasured here.11.4 SE = 0·39, 95% CI =−2·34 − 2·145 × 0·39 to − 2·34 + 2·145 ×0·39 = −3·2 to − 1·5%.12.1 O A= 6, E A= 8·06, O B= 8, E B= 5·94. Log rank X 2 = 1·24,d.f. = 1, 0·1 < P < 0·5 (P = 0·27).12.2 Risk = 0·55, 95% CI 0·19 to 1·60.13.1 M<strong>at</strong>ched case–control study.13.2 Cohort study.13.3 Cross–sectional study.13.4 Randomised controlled trial.13.5 Quasi–experimental design.146

AppendixP/2P/2− Z0ZTable A Probabilities rel<strong>at</strong>ed to multiples of standard devi<strong>at</strong>ions for anormal distributionNumber of standardProbability of getting an observ<strong>at</strong>ion <strong>at</strong> least as fardevi<strong>at</strong>ions (z) from the mean (two sided P)0·0 1·000·1 0·920·2 0·840·3 0·760·4 0·690·5 0·620·6 0·550·674 0·5000·7 0·480·8 0·420·9 0·371·0 0·311·1 0·271·2 0·231·3 0·191·4 0·161·5 0·131·6 0·111·645 0·1001·7 0·0891·8 0·0721·9 0·0571·96 0·0502·0 0·0452·1 0·0362·2 0·0282·3 0·0212·4 0·0162·5 0·0122·576 0·0103·0 0·00273·291 0·0010147

STATISTICS AT SQUARE ONETable B Distribution of t (two tailed)Probabilityd.f. 0·5 0·1 0·05 0·02 0·01 0·0011 1·000 6·314 12·706 31·821 63·657 636·6192 0·816 2·920 4·303 6·965 9·925 31·5983 0·765 2·353 3·182 4·541 5·841 12·9414 0·741 2·132 2·776 3·747 4·604 8·6105 0·727 2·015 2·571 3·365 4·032 6·8596 0·718 1·943 2·447 3·143 3·707 5·9597 0·711 1·895 2·365 2·998 3·499 5·4058 0·706 1·860 2·306 2·896 3·355 5·0419 0·703 1·833 2·262 2·821 3·250 4·78110 0·700 1·812 2·228 2·764 3·169 4·58711 0·697 1·796 2·201 2·718 3·106 4·43712 0·695 1·782 2·179 2·681 3·055 4·31813 0·694 1·771 2·160 2·650 3·012 4·22114 0·692 1·761 2·145 2·624 2·977 4·14015 0·691 1·753 2·131 2·602 2·947 4·07316 0·690 1·746 2·120 2·583 2·921 4·01517 0·689 1·740 2·110 2·567 2·898 3·96518 0·688 1·734 2·101 2·552 2·878 3·92219 0·688 1·729 2·093 2·539 2·861 3·88320 0·687 1·725 2·086 2·528 2·845 3·85021 0·686 1·721 2·080 2·518 2·831 3·81922 0·686 1·717 2·074 2·508 2·819 3·79223 0·685 1·714 2·069 2·500 2·807 3·76724 0·685 1·711 2·064 2·492 2·797 3·74525 0·684 1·708 2·060 2·485 2·787 3·72526 0·684 1·706 2·056 2·479 2·779 3·70727 0·684 1·703 2·052 2·473 2·771 3·69028 0·683 1·701 2·048 2·467 2·763 3·67429 0·683 1·699 2·045 2·462 2·756 3·65930 0·683 1·697 2·042 2·457 2·750 3·64640 0·681 1·684 2·021 2·423 2·704 3·55160 0·679 1·671 2·000 2·390 2·660 3·460120 0·677 1·658 1·980 2·358 2·617 3·373∞ 0·674 1·645 1·960 2·326 2·576 3·291Adapted by permission of the authors and publishers from Table III of Fisher andY<strong>at</strong>es: St<strong>at</strong>istical Tables for Biological, Agricultural and Medical Research, published byLongman Group Ltd., London (previously published by Oliver & Boyd, Edinburgh).148

APPENDIXTable C Distribution of χ 2Probabilityd.f. 0·50 0·10 0·05 0·02 0·01 0·0011 0·455 2·706 3·841 5·412 6·635 10·8272 1·386 4·605 5·991 7·824 9·210 13·8153 2·366 6·251 7·815 9·837 11·345 16·2684 3·357 7·779 9·488 11·668 13·277 18·4655 4·351 9·236 11·070 13·388 15·086 20·5176 5·348 10·645 12·592 15·033 16·812 22·4577 6·346 12·017 14·067 16·622 18·475 24·3228 7·344 13·362 15·507 18·168 20·090 26·1259 8·343 14·684 16·919 19·679 21·666 27·87710 9·342 15·987 18·307 21·161 23·209 29·58811 10·341 17·275 19·675 22·618 24·725 31·26412 11·340 18·549 21·026 24·054 26·217 32·90913 12·340 19·812 22·362 25·472 27·688 34·52814 13·339 21·064 23·685 26·873 29·141 36·12315 14·339 22·307 24·996 28·259 30·578 37·69716 15·338 23·542 26·296 29·633 32·000 39·25217 16·338 24·769 27·587 30·995 33·409 40·79018 17·338 25·989 28·869 32·346 34·805 42·31219 18·338 27·204 30·144 33·687 36·191 43·82020 19·337 28·412 31·410 35·020 37·566 45·31521 20·337 29·615 32·671 36·343 38·932 46·79722 21·337 30·813 33·924 37·659 40·289 48·26823 22·337 32·007 35·172 38·968 41·638 49·72824 23·337 33·196 36·415 40·270 42·980 51·17925 24·337 34·382 37·652 41·566 44·314 52·62026 25·336 35·563 38·885 42·856 45·642 54·05227 26·336 36·741 40·113 44·140 46·963 55·47628 27·336 37·916 41·337 45·419 48·278 56·89329 28·336 39·087 42·557 46·693 49·588 58·30230 29·336 40·256 43·773 47·962 50·892 59·703Adapted by permission of the authors and publishers from Table IV of Fisher andY<strong>at</strong>es: St<strong>at</strong>istical Tables for Biological, Agricultural and Medical Research, published byLongman Group Ltd., London (previously published by Oliver & Boyd, Edinburgh).149

STATISTICS AT SQUARE ONETable D Wilcoxon test on paired samples: 5% and 1% levels of PNumber of pairs 5% Level 1% Level7 2 08 2 09 6 210 8 311 11 512 14 713 17 1014 21 1315 25 1616 30 19Reprinted (slightly abbrevi<strong>at</strong>ed) by permission of the publisher from St<strong>at</strong>isticalMethods, 6th edition, by George W Snedecor and William G Cochran (copyright),1967, Iowa St<strong>at</strong>e University Press, Ames, Iowa, USA.Table E Mann–Whitney test on unpaired samples: 5% and 1% levelsof P5% critical points of rank sumsn 1→n 22 3 4 5 6 7 8 9 10 11 12 13 14 15↓4 105 6 11 176 7 12 18 267 7 13 20 27 368 3 8 14 21 29 38 499 3 8 15 22 31 40 51 6310 3 9 15 23 32 42 53 65 7811 4 9 16 24 34 44 55 68 81 9612 4 10 17 26 35 46 58 71 85 99 11513 4 10 18 27 37 48 60 73 88 103 119 13714 4 11 19 28 38 50 63 76 91 106 123 141 16015 4 11 20 29 40 52 65 79 94 110 127 145 164 18516 4 12 21 31 42 54 67 82 97 114 131 150 16917 5 12 21 32 43 56 70 84 100 117 135 15418 5 13 22 33 45 58 72 87 103 121 13919 5 13 23 34 46 60 74 90 107 12420 5 14 24 35 48 62 77 93 11021 6 14 25 37 50 64 79 9522 6 15 26 38 51 66 8223 6 15 27 39 53 6824 6 16 28 40 5525 6 16 28 4226 7 17 2927 7 1728 7150

APPENDIX1% Critical points of rank sumsn 1→n 22 3 4 5 6 7 8 9 10 11 12 13 14 15↓5 156 10 16 237 10 17 24 328 11 17 25 34 439 6 11 18 26 35 45 5610 6 12 19 27 37 47 58 7111 6 12 20 28 38 49 61 74 8712 7 13 21 30 40 51 63 76 90 10613 7 14 22 31 41 53 65 79 93 109 12514 7 14 22 32 43 54 67 81 96 112 129 14715 8 15 23 33 44 56 70 84 99 115 133 151 17116 8 15 24 34 46 58 72 86 102 119 137 15517 8 16 25 36 47 60 74 89 105 122 14018 8 16 26 37 49 62 76 92 108 12519 3 9 17 27 38 50 64 78 94 11120 3 9 18 28 39 52 66 81 9721 3 9 18 29 40 53 68 8322 3 10 19 29 42 55 7023 3 10 19 30 43 5724 3 10 20 31 4425 3 11 20 3226 3 11 2127 4 1128 4n 1and n 2are the numbers of cases in the two groups. If the groups are unequal insize, n 1refers to the smaller.Reproduced by permission of the author and publisher from White C, Biometrics1952;8:33.151

STATISTICS AT SQUARE ONETable F Random numbers35368 65415 14425 97294 44734 54870 84495 39332 72708 52000 02219 86130 30264 56203 2651893023 53965 19527 72819 42973 38037 37056 13200 09831 41367 40828 25938 05655 99010 8811592226 65530 10966 29733 73902 19009 74733 68041 83166 92796 64846 79200 38776 09312 7223415542 85361 44069 61445 82994 45169 79458 52221 37132 67125 62700 83475 99850 31670 5075096424 65745 74877 48473 54281 67837 11167 74898 83136 10498 10660 65810 16373 80382 2187417946 97751 54049 83077 03256 51947 88278 23891 53495 07101 95811 73035 83017 18532 5965071495 36712 01513 30802 47228 52799 97961 82519 22756 69151 09052 38681 38858 38807 0242216762 98574 78301 62647 29247 22936 62778 56694 70597 48880 33162 76138 97425 78283 4206337969 66660 77823 54923 75832 99974 13868 94446 99521 44775 76649 00502 73424 21068 8788025471 88920 39906 81436 70910 02631 93238 41952 87493 33559 64733 24688 78583 31506 2484568507 79643 15204 84794 60093 29874 61851 05751 21960 70131 42137 73723 19252 23912 7775167385 88293 46249 53036 47309 68803 15155 28222 06764 92367 25490 18494 42546 75268 0598858948 40572 79817 40486 40494 20843 07388 74732 71655 17445 28489 84528 93922 67324 5912070476 23299 17965 93629 28988 82399 81811 86373 91600 99962 28784 77326 24912 81992 6601172887 41730 95940 54210 58480 96724 41954 91803 43078 85644 50014 93038 56037 79787 1070770205 26256 91417 78629 16268 47156 32065 54588 74250 24739 04128 53966 74106 70159 8042878883 36361 28182 51842 61426 27799 75951 58854 77236 04606 26949 56428 28495 41766 5005989970 55101 66660 36953 02774 45020 54988 19226 44811 96941 70693 68847 07633 22289 9429034382 04274 02116 37857 72075 90908 56584 67907 15075 63216 49006 24748 34289 55142 9120616999 91140 64818 23018 09217 46068 32467 63844 72589 35456 44840 90800 50692 33298 7432316329 39676 37510 35590 45888 77371 58301 79434 17500 48320 08953 18242 15133 24137 0732331983 83436 93006 12640 00403 91457 62602 12245 27670 61492 89166 69421 79505 47104 5081792780 80153 81458 82215 71536 03586 44007 85679 68186 85375 15373 57441 10034 74455 1846670834 75678 78777 79731 06046 02386 18059 89623 65480 69345 49447 10358 74307 68861 8785310100 85365 77687 36241 87563 06298 81828 40194 30647 36237 17793 50680 63701 39522 8600684265 60501 17148 13657 40775 64773 62103 16356 99405 08598 81881 62732 36765 11895 6393374041 62109 30831 62133 29462 30144 62081 79158 09737 72614 74806 25554 50911 43289 3034402882 45141 58967 19688 48208 65679 18296 19080 03529 46017 33799 45518 31075 39740 9338767647 56443 57816 49471 23525 76582 30085 90312 07397 42747 04242 58569 80087 45598 3437499668 68326 47357 94812 65654 01097 55260 80990 46748 06416 93919 64520 54666 82278 5932812013 30983 00370 40243 44457 18279 69740 39061 00548 21321 11249 48478 14917 26056 8950655581 69068 66561 75671 07363 22939 93007 45319 48358 27534 60873 51076 20823 28185 4903874957 53949 40414 15035 90232 28946 78073 75923 43081 16030 32935 30947 64395 03271 2134565073 60950 92314 02037 82817 33518 49680 20095 51301 91889 78488 75298 29067 11355 6999405110 83292 51335 64460 37648 72915 99688 62628 41297 36039 04436 82738 76614 55630 3580354053 98104 12386 15646 89759 55889 14513 96192 19957 06186 40853 38011 97401 04047 6672252351 72086 70257 83693 62924 79060 79683 03143 10627 45371 78404 50185 67515 65094 9111110759 18901 07590 07727 37140 95782 41994 71688 72341 73665 66833 14138 20949 91852 4284767322 87517 27043 12936 81043 27338 81679 88420 28220 65441 55517 96640 60178 84161 6423937634 07842 34936 26836 48230 52786 01114 61335 39149 34268 70089 93491 91616 22522 0657790556 62996 52252 42541 12781 40917 41661 96994 88818 93137 45130 34502 40479 65832 7929407067 12854 23166 49012 56479 22674 69603 47846 91920 19188 94206 30370 50741 79932 8891682945 28472 46267 45857 67101 39905 25753 75462 87523 01394 10135 26758 88652 34480 3790133399 81517 64127 82407 23689 46598 23814 89327 87057 67715 30785 58496 38661 23259 1963151428 25572 62696 33117 66242 11735 68466 90598 30201 25770 96006 48256 60967 49546 7498945246 23347 48896 15828 69240 93948 27855 21999 19155 72859 78754 40094 39323 37570 7395324384 49141 78464 73448 78883 25730 24813 36087 47883 50473 38354 25620 08787 61463 9521943550 53461 42673 12646 87988 01411 58160 76833 53423 45490 23316 84940 81917 52712 1057567691 02660 28326 46648 00840 02753 12403 29024 03017 28175 23557 64382 71324 17581 6309049360 13426 04763 85671 40498 18689 99523 50400 00562 02112 00219 84376 42585 90350 9634942432 49348 10219 99564 70165 82692 85914 81874 60401 37323 80781 59989 00844 82734 6094268547 85157 26956 52508 10019 18964 03084 21624 95686 76579 53032 44148 74984 81609 4254426081 21040 57502 30827 61940 50305 13410 22158 91529 35888 48318 13355 12491 31827 3125616113 01090 72822 51906 23547 06985 93466 74652 33329 18298 75319 55988 76412 47573 4923688368 50633 62276 50244 14896 21158 49633 92045 25400 49228 20287 69106 32732 88075 2019637861 95795 39254 87408 16929 87171 38600 61330 80663 56488 43425 08589 53842 39410 55751152

IndexPage numbers in bold type refer to figures; those in italic refer to tables or boxedm<strong>at</strong>erialabsolute risk reduction (AAR) 22, 25accuracy 32altern<strong>at</strong>ive hypothesis 49bar charts 7–9, 9bar chart/histogram distinction 9binary d<strong>at</strong>a 12–28, 52–61common questions 59–60binary variables 1, 2binary variables, summarisingrel<strong>at</strong>ionships between 21–5case–control studies 25exampleshay fever/eczema associ<strong>at</strong>ion23, 23–4post haemorrhagic ventriculardil<strong>at</strong><strong>at</strong>ion (PHVD) 21,21–5, 25number needed to harm (NNH)22, 25number needed to tre<strong>at</strong> (NNT)22, 25odds 23–4odds r<strong>at</strong>io 23–5, 24, 25see also odds r<strong>at</strong>iorisk 21–5see also riskbin widths 10blocked randomis<strong>at</strong>ion 32Bonferroni correction 139box–whisker plot 6–7, 7, 10χ 2 test 78–94common questions 92comparing proportions 85–6distribution 149extensions 91fourfold tables 81–4McNemar’s test seeMcNemar’s testnot<strong>at</strong>ion for two group test 83observed/theoretical distributioncomparison 89–90procedure 78–81quick method 81reading/reporting 92small numbers 84–5splitting 86–7trends 87–9Y<strong>at</strong>es’ correction 84–5case–control study 25, 137–8c<strong>at</strong>egorical variablesbinary 1, 2see also binary variablesnominal 1, 2ordinal 1, 2censored observ<strong>at</strong>ions 126–7, 133central limit theorem 33–4coefficient of vari<strong>at</strong>ion (CV%) 21confidence interval (CI)meanlarge sample differences 45–6small samples 63–4153

INDEXodds r<strong>at</strong>io 54–5proportions/percentages 53reading and reporting 42reference range differences 42summary st<strong>at</strong>istics of binaryd<strong>at</strong>a 52–61continuous variables 1, 1–2correl<strong>at</strong>ion 111–17, 112applic<strong>at</strong>ion 123–4causality 115, 123coefficient 111calcul<strong>at</strong>ion 113–15common questions 123–4reading/reporting 124sc<strong>at</strong>ter diagrams 111–13significance test 115–16Spearman rank 116–17counted variables 1, 2Cox regression 131d<strong>at</strong>abinary 12–28, 52–61grouped, standard devi<strong>at</strong>ion17–19, 18paired 56–9, 57quantit<strong>at</strong>ive 1, 1–2, 12–28ungrouped, standard devi<strong>at</strong>ion14–16, 16d<strong>at</strong>a display 1–11bar charts 7–9, 9box–whisker plot 6–7, 7common questions 9–10dot plots 6, 6dynamite plunger plot 9histograms 7, 8histogram/bar chart distinction 9in papers 10stem and leaf plots 3, 3, 3–4, 4see also d<strong>at</strong>a summaryd<strong>at</strong>a summary 1–11common questions 9–10d<strong>at</strong>a types 1, 1–3see also d<strong>at</strong>a typesmedian 4, 4–5range 4vari<strong>at</strong>ion measurement 5, 6see also vari<strong>at</strong>ion measurementsee also d<strong>at</strong>a displayd<strong>at</strong>a transform<strong>at</strong>ion 19, 19, 19–20geometric mean 20d<strong>at</strong>a types 1, 1–3c<strong>at</strong>egorical variables 1, 2see also c<strong>at</strong>egorical variablescut off points 2quantit<strong>at</strong>ive variables 1, 1–2see also quantit<strong>at</strong>ive variablesdegrees of freedom 15dependent variable 112, 113design 135–8m<strong>at</strong>ched parallel 136parallel group 136quasi-experimental 137reading/reporting 142see also studydifferenceranked see rank score testst tests 71–2distributionNormal see Normal distributionskewed 18–19dot plots 6, 6dynamite plunger plot 9empirical normal range 40errorscommon questions 49–50type I 46–7, 50type II 49, 50exact probability test 95–101common questions 100not<strong>at</strong>ion 96see also Fisher’s exact testfactorials 97–8Fisher’s exact test 95–101mid P value 99reading/reporting 100Gaussian distribution see Normaldistributiongeometric mean 20histograms 7, 8, 10bin widths 10groupings 9–10histogram/bar chart distinction 9hypothesis testingaltern<strong>at</strong>ive hypothesis 49confirm<strong>at</strong>ory hypotheses 139null hypothesis 46–8154

INDEXpower of study 49study hypothesis 49independent variable 112, 113inference 41interquartile range 5d<strong>at</strong>a display in papers 10Kaplan–Meier survival curve126–7leaf plots 3, 3, 3–4, 4least squares estim<strong>at</strong>e 119logged d<strong>at</strong>a 19, 19, 19–20log rank test 129–31, 133Mann–Whitney U test 105–8, 150comparing means/medians 108rel<strong>at</strong>ion to t test 109McNemar’s test 57, 90–1, 91, 92not<strong>at</strong>ion 90mean 12–14applic<strong>at</strong>ion 25–6confidence interval 63–4differences between 44–51confidence interval 45–6standard error 44–5geometric 20standard error 33–4measured variables 1, 1–2measurement error, standarddevi<strong>at</strong>ion 20median 12applic<strong>at</strong>ion 25–6comparison by non-parametrictests 108d<strong>at</strong>a summary 4, 4–5mid point see medianmid P value 99nominal variables 1, 2non-parametric tests 102, 108non-random sampling, problems35–6Normal distribution 13curve 13, 14standard devi<strong>at</strong>ion measure 14,14, 147, 147normal range 40null hypothesis 46–7, 49–50number needed to harm (NNH)22, 25number needed to tre<strong>at</strong> (NNT) 22, 25odds, binary d<strong>at</strong>a rel<strong>at</strong>ionships 23–4odds r<strong>at</strong>io 23–5, 24, 26confidence interval for 54–5one sided P value 48one sided test 50one way analysis of variance 74ordinal variables 1, 2outliers 5paired altern<strong>at</strong>ives 56–9, 57paired d<strong>at</strong>a 56–9, 57Pearson’s correl<strong>at</strong>ion coefficient 111–15percentagesconfidence interval fordifferences 53standard error of differences 35, 52pie-charts, d<strong>at</strong>a display in papers 10popul<strong>at</strong>ions 29common questions 42parameters 29reading/reporting 37sample mean difference 64–5see also samplespower, of study 49precision 32probabilityexact probability test 95–101see also exact probability testnormal distributions 147, 147P value 47–50see also P valuest<strong>at</strong>ements of 39–43see also oddsproportional hazards assumption 129proportionsconfidence interval fordifferences 53significance test fordifferences 53–4standard error of differences 35, 52χ 2 comparisons 85–6P value 47–8, 49–50common questions 49–50mid P value 99one-sided 48reading and reporting 50–1155

INDEXquantit<strong>at</strong>ive d<strong>at</strong>a 1, 1–2, 12–28quantit<strong>at</strong>ive variablescontinuous 1, 1–2counted 1, 2measured 1, 1–2quartiles 5randomis<strong>at</strong>ion 32blocked 32random sampling 30–1, 152str<strong>at</strong>ified 31system<strong>at</strong>ic 31range 4empirical normal 40interquartile 5normal 40reference 40rank score tests 102–10common questions 108–9critical points of rank sums 151paired samples 102–4Wilcoxon test 102–4, 150reading/reporting 109unpaired samples 104–8Mann–Whitney U test 105–8, 150see also Mann–Whitney U testreference ranges 39–40confidence interval differences 42regressionapplic<strong>at</strong>ion 123–4assumption testing 123coefficient 118–19common questions 123–4Cox regression 131equ<strong>at</strong>ion 118–23least squares estim<strong>at</strong>e 119line 118–22logistic 26–7multiple 123reading/reporting 124rel<strong>at</strong>ive risk (RR) 23, 25, 26rel<strong>at</strong>ive risk reduction (RRR) 23, 25riskabsolute risk reduction (AAR) 22, 25rel<strong>at</strong>ive risk (RR) 23, 25, 26rel<strong>at</strong>ive risk reduction (RRR) 23, 25risk r<strong>at</strong>io 23samples 29–37accuracy 32central limit theorem 33–4convenience 33meanconfidence interval 45–6, 63–4differences see t teststandard error 33–4, 44–5non-random, problems 35–6paired 102–4Wilcoxon test 102–4, 150popul<strong>at</strong>ion mean differences see t testprecision 32random 30–1, 152randomis<strong>at</strong>ion 32reading/reporting 37response r<strong>at</strong>es 36size 138–9str<strong>at</strong>ified random 31system<strong>at</strong>ic random 31unbiased 31–2unpaired 104–8Mann–Whitney U test 105–8, 150vari<strong>at</strong>ion between 33sample size 138–9sc<strong>at</strong>ter diagrams 111–13d<strong>at</strong>a display in papers 10significance testcorrel<strong>at</strong>ion 115–16difference in two proportions 53–4skewed distribution 18–19Spearman rank correl<strong>at</strong>ion 116–17standard devi<strong>at</strong>ion (SD) 12–14, 26calcul<strong>at</strong>or procedure 16–17coefficient of vari<strong>at</strong>ion (CV%) 21grouped d<strong>at</strong>a 17–19, 18measurement error 20Normal distribution measure14, 14, 147, 147reference ranges 39–40, 147standard error 34, 36–7between subjects 20–1within subjects 20–1unequal 69–71ungrouped d<strong>at</strong>a 14–16, 16standard error (SE)applic<strong>at</strong>ion 36–7confidence interval calcul<strong>at</strong>ion41, 59–60means 33–4difference between 44–5proportions/percentages 35156

INDEXdifference between 52significance calcul<strong>at</strong>ion 53–4,59–60standard devi<strong>at</strong>ion differences 36summary st<strong>at</strong>istics of binaryd<strong>at</strong>a 52–61of a total 55–6stem and leaf plots 3, 3, 3–4, 4range 4str<strong>at</strong>ified sampling 31Student’s t test see t teststudycarryover effect 136case–control 135, 136cluster trials 136cohort 137control 137–8crossover 136cross–sectional 138design see designdouble blinded 136hypothesis testing see hypothesistestingplacebo effects 136power 49prevalence 139prospective 135randomis<strong>at</strong>ion 135randomised controlled trials 135recall bias 138sample size 138–9sequential trials 136st<strong>at</strong>istical test selection 139–42,140, 141study hypothesis 49summary st<strong>at</strong>istics 12–28, 52–61common questions 25–7reading/displaying 27see also mean; standarddevi<strong>at</strong>ion (SD)survival analysis 126–34censored observ<strong>at</strong>ions 126, 133Kaplan–Meier survival curve126–7log rank test 129–31, 133proportional hazardsassumption 129reading/reporting 133survival curve calcul<strong>at</strong>ion 127–9system<strong>at</strong>ic random sampling 31test selection 139–42, 140, 141trends, χ 2 test 87–9t test 62–77common questions 74–5confidence interval for smallsample mean 63–4difference 71–2equality of standard devi<strong>at</strong>ions 75mean differencespaired samples 71–4sample/popul<strong>at</strong>ion means 64–5two samples 66–9Normality testing d<strong>at</strong>a 74one sample 64–5one way analysis of variance 74paired 71–4, 75prepar<strong>at</strong>ions 74–5procedure 68reading/reporting 75rel<strong>at</strong>ion to Mann–WhitneyU test 109significance test of correl<strong>at</strong>ion115–16t distribution 148unequal standard devi<strong>at</strong>ions69–71two sided test 50type I error 46–7, 50type II error 49, 50unbiasedness 31–2unbiased samples 31–2unpaired samples see samplesvariablesbinary 1, 2c<strong>at</strong>egorical 1, 2continuous 1, 1–2correl<strong>at</strong>ion/causality 123counted 1, 2dependent 112, 113independent 112, 113input 140measured 1, 1–2nominal 1, 2ordinal 1, 2outcome 140quantit<strong>at</strong>ive 1, 1–2test selection 140, 141variance 15–16157

INDEXvari<strong>at</strong>ion coefficient (CV%),standard devi<strong>at</strong>ion 21vari<strong>at</strong>ion measurementd<strong>at</strong>a summary 5, 6interquartile range 5outliers 5quartiles 5standard devi<strong>at</strong>ion 13–14Welsh’s test 70Wilcoxon test 102–4, 150Y<strong>at</strong>es’ correction 84–5158