Empirical evaluation of the ability of case-mix adjustment methodologies to control for selection biasdirectly the likely degree of residual confoundingthat may be present, and therefore we cannotgauge how biased adjusted results may still be. Bycomparing with results based on randomisation,our investigations suggest that the degree ofunderadjustment may be large. Indeed, ourresults may in fact be overoptimistic, as thecovariate data used were recorded in a standardway according to trial protocols, and werecomplete for all participants. In many <strong>non</strong><strong>randomised</strong><strong>studies</strong> measurement methods arenot standardised. Also, covariate data areincomplete (especially in retrospective <strong>studies</strong>),leading to bias if the observations are not missingat random.Our two greatest concerns are the potentialincrease in bias that could occur as a result of theexistence of correlated misclassification ofcovariates, and the differences betweenconditional and unconditional estimates.Correlated misclassification is a problem inherentto the data, and cannot be adjusted for. It is verydifficult to know the degree of misclassificationand error in a variable, and impossible to knowwhether the variable being used is the ‘true’confounder or just a proxy. These findingsquestion the appropriateness of the strategy ofincluding data on all available potentialconfounders when adjusting for case-mix, whichhas been the starting point of many riskadjustmentmethods used throughout healthcare.However, the same findings could be explained bythe peculiar differences between unconditionaland conditional estimates of treatment effectsobserved when results are expressed as ORs,although this mechanism only applied to estimatesobtained from logistic regression and stratificationmethods.The finding of high levels of residual confoundingand the detrimental effect of adjustment were seenin both historically controlled <strong>studies</strong>, known to beprone to systematic bias, and in concurrentlycontrolled <strong>studies</strong>, more prone to unpredictabilityin bias. The relationships were also noted in<strong>studies</strong> mimicking allocation by indication.It is important to find out whether suchdestructive relationships between covariates arecommon. We have examined data from only twoclinical situations, but in both we observed resultsthat undermine the use of case-mix adjustment.Also in the IST, case-mix adjustment was found tobe detrimental in eight of the 14 regions.There appears to be a small potential benefit ofusing propensity score methods over logisticregression for case-mix adjustment in terms of theconsistency of estimates of treatment effects. Whilelogistic regression always increased the range ofobserved treatment effects, propensity scoremethods did not. This finding may indicate agreater role for propensity score methods inhealthcare research, although in the particularapplications investigated neither approachperformed adequately.For those critically appraising <strong>non</strong>-<strong>randomised</strong><strong>studies</strong>, the recommendation to assess whether“investigators demonstrate similarity in all knowndeterminants of outcome” 138,139 has not beenuniversally supported by our empiricalinvestigations. The second recommendation, toassess whether they “adjust for these differencesin analysis” is also not supported empirically.Our analyses suggest that there are considerablecomplexities in assessing whether a casemixadjustment analysis will increase ordecrease bias.These findings may have a major impact on thecertainty which we assign to many effects inhealthcare which have been made on the basis ofusing risk adjustment methods.86
<strong>Health</strong> Technology Assessment 2003; Vol. 7: No. 27Chapter 8Discussion and conclusionsChapters 3–7 have reported results from fiveseparate evaluations concerning <strong>non</strong>-<strong>randomised</strong><strong>studies</strong>. The results have been discussed in detailin each chapter. We summarise their main findingsbelow.Summary of key findingsOur review of previous empirical investigations ofthe importance of randomisation (Chapter 3)identified eight <strong>studies</strong> that fulfilled our inclusioncriteria. Each investigation reported multiplecomparisons of results of <strong>randomised</strong> and <strong>non</strong><strong>randomised</strong><strong>studies</strong>. Although there was overlap inthe comparisons included in these reviews, theyreached different conclusions concerning the likelyvalidity of <strong>non</strong>-<strong>randomised</strong> data, mainly reflectingweaknesses in the meta-epidemiologicalmethodology that they all used, most notably thatit was not able to account for confounding factorsin the comparisons between <strong>randomised</strong> and <strong>non</strong><strong>randomised</strong><strong>studies</strong>, nor to detect anything otherthan systematic bias.We identified 194 tools that could be used toassess the quality of <strong>non</strong>-<strong>randomised</strong> <strong>studies</strong>(Chapter 4). Overall the tools were poorlydeveloped: the majority did not provide a meansof assessing the internal validity of <strong>non</strong><strong>randomised</strong><strong>studies</strong> and almost no attention waspaid to the principles of scale development andevaluation. However, 14 tools were identified thatincluded items related to each of our pre-specifiedcore internal validity criteria, which related toassessment of allocation method, attempts toachieve comparability by design, identification ofimportant prognostic factors and adjustment ofdifferences in case-mix. Six of the 14 tools wereconsidered potentially suitable for use as qualityassessment tools in systematic reviews, but allrequire some modification to meet all of our prespecifiedcriteria.Of 511 systematic reviews we identified thatincluded <strong>non</strong>-<strong>randomised</strong> <strong>studies</strong>, only 169 (33%)assessed study quality, and only 46% of thesereported the results of the quality assessment foreach study (Chapter 5). This is lower than the rateof quality assessment in systematic reviews of<strong>randomised</strong> controlled trials. 131 Among those thatdid assess study quality, a wide variety of qualityassessment tools were used, some of which weredesigned only for use in evaluating RCTs, andmany were designed by the review authorsthemselves. Most reviews (88%) did not assess keyquality criteria of particular importance for theassessment of <strong>non</strong>-<strong>randomised</strong> <strong>studies</strong>. Sixty-ninereviews (41%) investigated the impact of quality onstudy results in a quantitative manner. The resultsof these analyses showed no consistent pattern inthe way that study quality relates to treatmenteffects, and were confounded by the inclusion of avariety of study designs and <strong>studies</strong> of variablequality.A unique ‘resampling’ method was used togenerate multiple unconfounded comparisonsbetween RCTs and historically controlled andconcurrently controlled <strong>studies</strong> (Chapter 6). Theseempirical investigations identified twocharacteristics of the bias introduced by using <strong>non</strong>randomallocation. First, the use of historicalcontrols can lead to systematic over- orunderestimations of treatment effects, thedirection of the bias depending on time trends inthe case-mix of participants recruited to the study.In the <strong>studies</strong> used for the analyses, these timetrends varied between study regions, and weretherefore difficult to predict. Second, the results ofboth study designs varied beyond what wasexpected from chance. In a very large sample of<strong>studies</strong> the biases causing the increasedunpredictability on average cancelled each otherout, but in individual <strong>studies</strong> the bias could befairly large, and could act in either direction.These biases again relate to differences in casemix,but the differences are neither systematic norpredictable.Four commonly used methods of dealing withvariations in case-mix were identified: (i) discardingcomparisons between groups which differ in theirbaseline characteristics, (ii) regression modelling,(iii) propensity score methods and (iv) stratifiedanalyses (Chapter 7). The methods were applied tothe historically and concurrently controlled <strong>studies</strong>generated in Chapter 6, and also to <strong>studies</strong>designed to mimic ‘allocation by indication’. Noneof the methods successfully removed bias in87© Queen’s Printer and Controller of HMSO 2003. All rights reserved.