Evaluating Patient-Based Outcome Measures - NIHR Health ...

More documents

Recommendations

Info

32 Criteria for selecting a patient-based outcome measure patients beyond a certain level (Ganiats et al., 1992). Distribution of baseline scores The responsiveness of an instrument may also be influenced by the relationship of items in the instrument to the distribution of levels of difficulty or severity in the underlying construct. As a hypothetical example, it is possible to imagine an instrument designed to measure mobility where items mainly reflected ‘easy’ tasks; that is the majority of respondents could be expected to report no problem, for example, in walking a very short distance. Because most items in the scale reflect ‘easy’ items, a large amount of change could be produced (i.e. the patient reports change over the majority of items) even when only a small amount of real improvement had occurred. Stucki and colleagues (1995) show that the problem of the relationship of items to an underlying range of degrees of difficulty or seriousness is not entirely hypothetical. They provide evidence that many items from the physical ability scale of the SF-36 reflect intermediate rather than extremes of level of difficulty for patients undergoing total hip arthroplasty. Thus patients experiencing improvements at this intermediate level of physical difficulty can be expected to experience high levels of gain according to SF-36 at least in part because of the range of items. As Stucki and colleagues argue, this problem can arise from the ways in which scales are often developed, as described in earlier sections of this report, with emphasis upon high levels of agreement between items on a scale (internal reliability), rather than requiring items that reflect a full range of difficulty or severity of an underlying problem. We have already seen arguments against excessive reliance on inter-item agreement to develop instruments rehearsed by Kessler and Mroczeck (1995) in the context of reliability, above. Here it is possible to see problems arising from excessive emphasis upon internal reliability in the context of responsiveness. Summary The need for an instrument to be responsive to changes that are of importance to patients should be of evident importance in the context of clinical trials. Whilst there are no universally agreed methods for assessing this property, at a more general level all discussions require evidence of statistically significant change of some form from observations made at separate times and when there is good reason to think that changes have occurred that are of importance to patients. Precision How precise are the scores of the instrument? This review is primarily concerned with the use of patient-based outcome measures in the context of clinical trials. Investigators will need to examine the pattern of responses to health status measures in a trial to determine whether there are clear and important differences between the arms of a trial. They therefore need to examine a number of aspects of candidate instruments’ numerical properties which have not been clearly delineated in the literature, but which relate to the precision of distinctions made by an instrument. Testa and Simonson (1996) refer to this property as ‘sensitivity’: ‘Although a measure may be responsive to changes in Q (quality of life), gradations in the metric of Z (the instrument) may not be adequate to reflect these changes. Sensitivity refers to the ability of the measurement to reflect true changes or differences in Q’ (1996: 836). Stewart (1992) also refers to this property as ‘sensitivity’. In particular, she refers to the number of distinctions an instrument makes; the fewer, the more insensitive it is likely to be. Kessler and Mroczek (1995) refer to this property as ‘precision’, which is probably less confusing since sensitivity has a number of other uses and meanings in this field. As Kessler and Mroczek argue, an instrument may have high reliability but low precision if it makes only a small number of crude distinctions with regard to a dimension of health. Thus at the extreme one instrument might distinguish with high reliability only between those who are healthy and those who are ill. For the purposes of a trial, such an instrument would not be useful because it is degrees of change within the category of ‘unwell’ that are likely to be needed to evaluate results of the arms of the trial. There are a number of ways in which the issue of precision has been raised in relation to patientbased outcome measures. This is fairly disparate evidence and it is reviewed under a number of more specific headings. Precision of response categories One of the main influences on the precision of an instrument is the format of response categories; i.e. the form in which respondents are able to give their answers. At one extreme answers may be given by respondents in terms of very basic distinctions,
‘yes’ or ‘no’. Binary response categories have the advantage of simplicity but there is evidence that they do not allow respondents to report degrees of difficulty or severity that they experience and consider important to distinguish (Donovan et al., 1993). Many instruments therefore allow for gradations of response, most commonly in the form of a Likert set of response categories: – strongly agree – agree – uncertain – disagree – strongly disagree or some equivalent set of ordinally related items: – very satisfied – satisfied – neither satisfied nor dissatisfied – dissatisfied – very dissatisfied Alternatively, response categories may require that respondents choose between different options of how frequently a problem occurs. There is some evidence that there is increased precision from using seven rather than five response categories. A sample of older indviduals with heart problems were assigned to questionnaires assessing satisfaction with various domains of life with either five or seven item response categories (Avis and Smith, 1994). The latter showed higher correlations with a criterion measure of QoL completed by respondents. However there is little evidence in the literature of increased precision beyond seven categories. The main alternative to Likert format response categories is the visual analogue scale, which would appear to offer considerably more precision. Respondents can mark any point on a continuous line to represent their experience and in principal this offers an extensive range of response categories. However, the evidence is not strong that the apparent precision is meaningful (Nord, 1991). Guyatt and colleagues (1987a) compared the responsiveness of a health-related QoL measure for respiratory function, using alternate forms of a Likert and visual analogue scale. They found no significant advantage for the visual analogue scale. Similar results were found in a randomised trial setting, showing no advantage in responsiveness for visual analogue scales (Jaeschke et al., 1990). An additional concern cited earlier is the somewhat lower acceptability <strong>Health</strong> Technology Assessment 1998; Vol. 2: No. 14 of visual analogue scales as a task. Overall, firm empirical evidence of superiority of visual analogue scales over Likert scales is difficult to find (Remington et al., 1979). Precision of numerical values To be of use in clinical trials, what patients report in health status measures is generally transformed into numerical values or codes that, on the one hand, most accurately reflect differences between individuals and changes within individuals over time and, on the other hand make possible statistical analysis of the size and importance of results. Clearly philosophical and epistemological issues can be raised about this process of assigning numerical values to subjective experience (Nordenfelt, 1994). These issues must be acknowledged but are beyond the scope of this review to address. Instead, we need to examine how the field has drawn upon psychometric, social scientific and statistical principles to produce pragmatically plausible numerical values as accurately as possible to capture subjective experiences that may in some way be related to health care interventions. Two basically different methods of numerical scoring can be found amongst health status measures. On the one hand, the majority of instruments use somewhat arbitrary but common-sense based methods of simple ordinal values. For example, many instruments use Likert format response categories where degrees of agreement with a statement are given progressively lower values: strongly agree = 1; agree = 2; neither agree nor disagree = 3; disagree = 4, strongly disagree = 5. The direction of such values is entirely arbitrary, and can be reversed so that greater agreement is given higher numerical value. It is worth noting that some instruments such as SF-36 recode numerical values so that items are expressed as percentages or proportions of the total scale score. To take a hypothetical example, an instrument may have six alternative responses for an assessment of pain, ranging in severity from, let us say, ‘no pain at all’ through to ‘severe pain all of the time’. Instead of scoring responses ‘1’, ‘2’, ‘3’ and so on, the scores may be transformed into percentages of a total: ‘17%’, ‘33%’, ‘50%’. Although this approach produces a range of values between 0 and 100, the simple and limited basis from which values are derived should be kept in mind. In particular, while it might appear that 33
Page 1 and 2: Evaluating patient-based outcome me
Page 3: HTA How to obtain copies of this an
Page 6 and 7: NHS R&D HTA Programme The overall a
Page 9: List of abbreviations Health Techno
Page 12 and 13: iv Executive summary • These crit
Page 14 and 15: 2 Purpose and plan of this review t
Page 16 and 17: 4 What are patient-based outcome me
Page 22 and 23: 10 What are patient-based outcome m
Page 32 and 33: 20 Criteria for selecting a patient
Page 57: The rapid expansion of efforts to a
Page 60 and 61: 48 Recommendations further promote
Page 63 and 64: Aaronson NK (1989). Quality of life
Page 65 and 66: Clark A, Fallowfield LJ (1986). Qua
Page 67 and 68: Froberg DG, Kane RL (1989b). Method
Page 69 and 70: Jenkinson C, Layte R, Wright L, Cou
Page 71 and 72: Mulley AG (1989). Assessing patient
Page 73 and 74: Shumaker S, Berzon RA (1995). The i
Page 75: Ware J, Bayliss MS, Rogers WH, Kosi
Page 78 and 79: 66 Appendix 1 Use inclusion and exc
Page 80 and 81: 68 Appendix 1 TABLE 5 Excluded sear
Page 83 and 84: Professor Senga Bond, University of
Page 85 and 86: Professor Mike Drummond, Centre for

Evaluating Patient-Based Outcome Measures - NIHR Health ...

Create successful ePaper yourself

Delete template?

Save as template?