BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...

More documents

Recommendations

Info

Practice Cancer Research UK/NHS Centre for Statistics in Medicine, Wolfson College, Oxford OX2 6UD Douglas G Altman professor of statistics in medicine MRC Clinical Trials Unit, London NW1 2DA Patrick Royston professor of statistics Correspondence to: Professor Altman doug.altman@ cancer.org.uk BMJ 2006;332:1080 Downloaded from bmj.com on 28 February 2008 Statistics Notes The cost of dichotomising continuous variables Douglas G Altman, Patrick Royston Measurements of continuous variables are made in all branches of medicine, aiding in the diagnosis and treatment of patients. In clinical practice it is helpful to label individuals as having or not having an attribute, such as being “hypertensive” or “obese” or having ”high cholesterol,” depending on the value of a continuous variable. Categorisation of continuous variables is also common in clinical research, but here such simplicity is gained at some cost. Though grouping may help data presentation, notably in tables, categorisation is unnecessary for statistical analysis and it has some serious drawbacks. Here we consider the impact of converting continuous data to two groups (dichotomising), as this is the most common approach in clinical research. 1 What are the perceived advantages of forcing all individuals into two groups? A common argument is that it greatly simplifies the statistical analysis and leads to easy interpretation and presentation of results. A binary split—for example, at the median—leads to a comparison of groups of individuals with high or low values of the measurement, leading in the simplest case to a t test or 2 test and an estimate of the difference between the groups (with its confidence interval). There is, however, no good reason in general to suppose that there is an underlying dichotomy, and if one exists there is no reason why it should be at the median. 2 Dichotomising leads to several problems. Firstly, much information is lost, so the statistical power to detect a relation between the variable and patient outcome is reduced. Indeed, dichotomising a variable at the median reduces power by the same amount as would discarding a third of the data. 2 3 Deliberately discarding data is surely inadvisable when research studies already tend to be too small. Dichotomisation may also increase the risk of a positive result being a false positive. 4 Secondly, one may seriously underestimate the extent of variation in outcome between groups, such as the risk of some event, and considerable variability may be subsumed within each group. Individuals close to but on opposite sides of the cutpoint are characterised as being very different rather than very similar. Thirdly, using two groups conceals any non-linearity in the relation between the variable and outcome. Presumably, many who dichotomise are unaware of the implications. If dichotomisation is used where should the cutpoint be? For a few variables there are recognised cutpoints, such as > 25 kg/m 2 to define “overweight” based on body mass index. For some variables, such as age, it is usual to take a round number, usually a multiple of five or 10. The cutpoint used in previous studies may be adopted. In the absence of a prior cutpoint the most common approach is to take the sample median. However, using the sample median implies that various cutpoints will be used in different studies so that their results cannot easily be compared, seriously hampering meta-analysis of observational studies. 5 Nevertheless, all these approaches are preferable to performing several analyses and choosing that which gives the most convincing result. Use of this so called “optimal” cutpoint (usually that giving the minimum P value) runs a high risk of a spuriously significant result; the difference in the outcome variable between the groups will be overestimated, perhaps considerably; and the confidence interval will 6 7 be too narrow. This strategy should never be used. When regression is being used to adjust for the effect of a confounding variable, dichotomisation will run the risk that a substantial part of the confounding remains. 47 Dichotomisation is not much used in epidemiological studies, where the use of several categories is preferred. Using multiple categories (to create an “ordinal” variable) is generally preferable to dichotomising. With four or five groups the loss of information can be quite small, but there are complexities in analysis. Instead of categorising continuous variables, we prefer to keep them continuous. We could then use linear regression rather than a two sample t test, for example. If we were concerned that a linear regression would not truly represent the relation between the outcome and predictor variable, we could explore whether some transformation (such as a log transformation) would be helpful. 78 As an example, in a regression analysis to develop a prognostic model for patients with primary biliary cirrhosis, a carefully developed model with bilirubin as a continuous explanatory variable explained 31% more of the variability in the data than when bilirubin distribution was split at the median. 7 Competing interests: None declared. 1 Del Priore G, Zandieh P, Lee MJ. Treatment of continuous data as categoric variables in obstetrics and gynecology. Obstet Gynecol 1997;89:351-4. 2 MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative variables. Psychol Meth 2002;7:19-40. 3 Cohen J. The cost of dichotomization. Appl Psychol Meas 1983;7:249-53. 4 Austin PC, Brunner LJ. Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses. Stat Med 2004;23:1159-78. 5 Buettner P, Garbe C, Guggenmoos-Holzmann I. Problems in defining cutoff points of continuous prognostic factors: example of tumor thickness in primary cutaneous melanoma. J Clin Epidemiol 1997;50:1201-10. 6 Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst 1994;86:829-35. 7 Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41. 8 Royston P, Sauerbrei W. Building multivariable regression models with continuous covariates in clinical epidemiology–with an emphasis on fractional polynomials. Methods Inf Med 2005;44:561-71. Endpiece Reading and reflecting Reading without reflecting is like eating without digesting. Edmund Burke Kamal Samanta, retired general practitioner, Denby Dale, West Yorkshire 1080 BMJ VOLUME 332 6 MAY 2006 bmj.com
PRACTICE 1 Cancer Research UK/NHS Centre for Statistics in Medicine, Oxford OX2 6UD 2 Department of Health Sciences, University of York, York YO10 5DD Correspondence to: Professor altman doug.altman@cancer.org.uk bmj 2007;334:424 doi:10.1136/bmj.38977.682025.2C sTATIsTICs noTEs Missing data Downloaded from bmj.com on 28 February 2008 Douglas G Altman 1 , J Martin Bland 2 Almost all studies have some missing observations. Yet textbooks and software commonly assume that data are complete, and the topic of how to handle missing data is not often discussed outside statistics journals. There are many types of missing data and different reasons for data being missing. Both issues affect the analysis. Some examples are: (1) n a postal questionnaire survey not all the selected individuals respond; (2) n a randomised trial some patients are lost to follow-up before the end of the study; (3) n a multicentre study some centres do not measure a particular variable; (4) n a study in which patients are assessed frequently some data are missing at some time points for unknown reasons; (5) Occasional data values for a variable are missing because some equipment failed; (6) Some laboratory samples are lost in transit or technically unsatisfactory; (7) n a magnetic resonance imaging study some very obese patients are excluded as they are too large for the machine; (8) n a study assessing quality of life some patients die during the follow-up period. The prime concern is always whether the available data would be biased. f the fact that an observation is missing is unrelated both to the unobserved value (and hence to patient outcome) and the data that are available this is called “missing completely at random.” For cases 5 and 6 above that would be a safe assumption. Sometimes data are missing in a predictable way that does not depend on the missing value itself but which can be predicted from other data—as in case 3. Confusingly, this is known as “missing at random.” n the common cases 1 and 2, however, the missing data probably depend on unobserved values, called “missing not at random,” and hence their lack may lead to bias. n general, it is important to be able to examine whether missing data may have introduced bias. For example, if we know nothing at all about the nonresponders to a survey then we can do little to explore possible bias. Thus a high response rate is necessary for reliable answers. 1 Sometimes, though, some information is available. For example, if the survey sample is chosen from a register that includes age and sex, then the responders and non-responders can be compared on these variables. At the very least this gives some pointers to the representativeness of the sample. Nonresponders often (but not always) have a worse medical prognosis than those who respond. A few missing observations are a minor nuisance, but a large amount of missing data is a major threat to a study’s integrity. Non-response is a particular problem in pair-matched studies, such as some casecontrol studies, as it is unclear how to analyse data from the unmatched individuals. Loss of patients also reduces the power of the trial. Where losses are expected it is wise to increase the target sample size to allow for losses. This cannot eliminate the potential bias, however. Missing data are much more common in retrospective studies, in which routinely collected data are subsequently used for a different purpose. 2 When information is sought from patients’ medical notes, the notes often do not say whether or not a patient was a smoker or had a particular procedure carried out. t is tempting to assume that the answer is no when there is no indication that the answer is yes, but this is generally unwise. No really satisfactory solution exists for missing data, which is why it is important to try to maximise data collection. The main ways of handling missing data in analysis are: (a) omitting variables which have many missing values; (b) omitting individuals who do not have complete data; and (c) estimating (imputing) what the missing values were. Omitting everyone without complete data is known as complete case (or available case) analysis and is probably the most common approach. When only a very few observations are missing little harm will be done, but when many are missing omitting all patients without full data might result in a large proportion of the data being discarded, with a major loss of statistical power. The results may be biased unless the data are missing completely at random. n general it is advisable not to include in an analysis any variable that is not available for a large proportion of the sample. The main alternative approach to case deletion is imputation, whereby missing values are replaced by some plausible value predicted from that individual’s available data. mputation has been the topic of much recent methodological work; we will consider some of the simpler methods in a separate Statistics Note. Competing interests: None declared. 1 Evans SJW. Good surveys guide. BMJ 1991;302:302-3. 2 Burton A, Altman DG. Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. Br J Cancer 2004;91:4-8. This is part of a series of occasional articles on statistics and handling data in research. 424 BMJ | 24 feBruary 2007 | VoluMe 334
Page 1 and 2:
BMJ Statistical Notes Series List b
Page 3 and 4:
BMJ Statistical Notes Series List b
Page 5 and 6:
Downloaded from bmj.com on 28 Febru
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16: Downloaded from bmj.com on 28 Febru
Page 25 and 26: i . ~~ " . It ~, I ; Statistics Not
Page 27 and 28: .'.)'. In practice, there will usua
Page 45 and 46: Statistics notes Bayesians and freq
Page 47 and 48: Statistics notes Treatment allocati
Page 49 and 50: Much work still needs to be done to
Page 51 and 52: Education and debate Department of
Page 53 and 54: Education and debate ICRF Medical S
Page 55 and 56: Statistics Notes Analysing controll
Page 57 and 58: Education and debate Papers p569 De
Page 59 and 60: Statistics Notes Interaction revisi
Page 61 and 62: Papers What is already known on thi
Page 63 and 64: test results. Forcing dichotomisati
Page 65: 6 The Cardiac Arrhythmia Suppressio
Page 69 and 70: 1 Department of Health Sciences, Un
Page 71 and 72: BMJ 2011;342:d561 doi: 10.1136/bmj.
Page 73 and 74: BMJ 2011;342:d561 doi: 10.1136/bmj.
Page 75 and 76: RESEARCH METHODS AND REPORTING, p 6
Page 77: RESEARCH METHODS & REPORTING 1 Cent
show all

BMJ Statistical Notes Series List by JM Bland: http://www-users.york ...

Create successful ePaper yourself

Delete template?

Save as template?