Empirical estimates of bias associated with <strong>non</strong>-random allocationTABLE 19 Impact of observed increased variability with sample sizeIST – 14 regions IST – 10 UK cities ECST – 8 regionsObserved ratio of SDs for concurrent controls 2.5 1.8 1.01Increase in variance in log OR attributable to<strong>non</strong>-random allocation 0.607 0.570 0.014Total sample sizeMultipliers to confidence interval width to give correct coverage100 1.9 1.5 1.0200 2.5 1.8 1.0500 3.8 2.6 1.11000 5.2 3.5 1.22000 7.3 4.8 1.35000 12 7.5 1.710000 16 11 2.220000 23 15 2.950000 36 24 4.460<strong>studies</strong> may be an order of magnitude too narrowto describe correctly the true uncertainty in theirresults, but that there are differences in theadjustments that are needed in differentsituations. For example, the confidence intervalcalculated from a concurrently controlled study of1000 participants may be five times too narrow todescribe the true uncertainty for regional IST-typecomparisons, three times too narrow to describethe true uncertainty in UK city IST-typecomparisons, but only 20% too narrow forregional ECST-type comparisons. For sample sizesof 10,000 the confidence intervals are estimated tobe more than 10 times too narrow for the ISTsituations and half the width needed for theECST situation. Of course, in practice onewould not know to what extent the standardconfidence interval under-represented the trueuncertainty.Generalisability and limitations of thefindingsThe value of these findings and estimates dependson the generalisability of the results obtained fromthe IST and ECST and the degree to which theslightly artificial methodology and samples used inthese evaluations are representative of the realityof <strong>non</strong>-<strong>randomised</strong> <strong>studies</strong>.GeneralisabilityThe IST and ECST were chosen for thisinvestigation as (a) they were large trials, (b) theyhad an outcome which was not rare, (c) they weremulticentre trials and (d) the trialists were willingto provide reduced and a<strong>non</strong>ymised data setssuitable for our analyses. Other than the fact thatboth trials relate to stroke medicine, the trialsdiffer considerably. One is a trial ofpharmacological agents (aspirin and heparin)whereas the other is a trial of a surgical procedure(carotid endarterectomy). The treatment in one isacute, being given immediately after the patientshave suffered a severe stroke, whereas in the otherit is preventive, being given to high-risk patients.It is difficult to argue that these trials can beregarded as representative and therefore that theresults are generalisable. However, their resultsshould be regarded as being indicative of thebiases associated with the use of <strong>non</strong>-randomcontrols. Ideally these resampling study methodsshould be repeated in more trials. In the caseof this project, the time required to generatethe resampling <strong>studies</strong> and the difficulty inobtaining data sets from multicentre clinicaltrials prevented additional evaluations beingundertaken.It is important also to consider whether the timetrend observed in the ECST is likely to be typicalof those that may be observed in other areas ofhealthcare – especially as it is in agreement withthe trends observed by Sacks and colleagues intheir review across six clinical contexts. 27 Thetrend is one of patient outcomes improving overtime. It is consistent with a general pattern ofaverage outcomes improving with progress inmedical care, which may apply across all medicalspecialities. However, this argument assumes thatthe case-mix of patients being treated is stable,which may not be the case. In some circumstanceschanges in case-mix over time, for good reason,may lead to apparent increases in adverseoutcomes. For example, if medical informationleads to knowledge that the treatment is not suitedto patients at low risk, then a change to excludinglower risk patients from receiving that treatment
<strong>Health</strong> Technology Assessment 2003; Vol. 7: No. 27may lead to increases in average event rates.Historically controlled <strong>studies</strong> undertaken in sucha situation may be prone to underestimating thebenefits of treatment and may even falselyconclude that treatment does more harm thangood.The lack of systematic bias with the use ofgeographical controls is based on the presumptionthat geographical differences act in a haphazardmanner, and are as likely to lead to overestimatesof treatment effects as to underestimates. Therandom manner in which concurrent controlgroups were selected in the resampling exerciseensured that across a large number of <strong>studies</strong>these differences would be seen to balance eachother out, albeit possibly increasingunpredictability. This result does not indicate thatgeographically controlled <strong>studies</strong> are unbiased. Inreality, a single comparison between two areas islikely to be biased, as are meta-analyses ofseveral <strong>studies</strong>, although the direction in whichthe bias acts may be unknown. In addition,if an investigator chose a geographical controlgroup with knowledge of the likely differencesin case-mix, it would be possible for theselection to be manipulated (consciously orsubconsciously) in such a way as to introducea particular bias, akin to the bias observed inRCTs when treatment allocation is notconcealed. 20Similarly, we should consider whether themechanisms leading to unpredictability in bias,especially in <strong>studies</strong> generated from the IST, arelikely to apply widely across different clinical areas.Tables 38 and 40 in Appendix 8 show that the casemixof patients recruited to the IST variedbetween locations, both internationally andbetween cities in the UK. These haphazarddifferences, together with differences in otherunknown risk factors and aspects of patientmanagement and outcome assessment, will havecaused the unpredictability in the bias that wasobserved. Evidence is available in all areas ofmedicine that such differences exist, and thereforeit seems reasonable to conclude that theunpredictable behaviour in biases will be observedelsewhere, although the degree of unpredictabilitymay vary.Limitations of the resampling methodologyThe resampling method used participantsrecruited to a <strong>randomised</strong> controlled trial togenerate <strong>non</strong>-<strong>randomised</strong> <strong>studies</strong>. This, of course,is not what happens in reality, but there arereasons to believe that our approach is more likelyto have led to underestimates than overestimatesof bias.The degree of bias in a <strong>non</strong>-<strong>randomised</strong> studydepends on the similarity of the two groups fromwhich treated participants and controls are drawn.Sampling these groups from the same <strong>randomised</strong>trial is likely to have reduced such differences forthe following reasons:1. All participants included in the RCT will havebeen judged to have been suitable for eithertreatment. In a <strong>non</strong>-<strong>randomised</strong> studyparticipants who are suitable for only one ofthe two treatments may have been recruited tothat arm: there is usually no formal assessmentthat they would have been considered suitablefor the alternative. This difference will nearlyalways act to increase differences in outcomebetween the groups.2. The RCT was conducted according to aprotocol, describing methods for recruiting,assessing, treating and evaluating the patients.This will have reduced the variability withinthe trial. Although some <strong>non</strong>-<strong>randomised</strong><strong>studies</strong> are organised using a protocol, manyare not.3. All participants included in the trial wererecruited prospectively. In <strong>non</strong>-<strong>randomised</strong><strong>studies</strong>, especially those using historicalcontrols, participants are likely to have been‘retrospectively’ included in the study,potentially introducing additional bias.On balance, it could be argued that usingrandomly chosen international comparisons forselection of concurrent controls may be regardedas rather artificial and likely to have increaseddifferences between groups. In reality, aconcurrent control group in a <strong>non</strong>-<strong>randomised</strong>study would be selected to minimise likelydifferences between groups, and such longdistancegeographical comparisons wouldprobably be avoided. It is perhaps more realistic tofocus on the magnitude of the biases observed inthe concurrent comparisons generated from theUK cities in the IST as being more representativeof what might occur in reality. The unsystematicbias seen here was less than that observed ininternational comparisons, but still large enoughto lead many <strong>studies</strong> falsely to obtain significantfindings of both benefit and harm.Importantly, we have concentrated on only oneaspect of quality in <strong>non</strong>-<strong>randomised</strong> <strong>studies</strong>: thereare other biases to which they are susceptible inthe same way as are RCTs.61© Queen’s Printer and Controller of HMSO 2003. All rights reserved.