29.06.2013 Views

Evaluating Patient-Based Outcome Measures - NIHR Health ...

Evaluating Patient-Based Outcome Measures - NIHR Health ...

Evaluating Patient-Based Outcome Measures - NIHR Health ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

32<br />

Criteria for selecting a patient-based outcome measure<br />

patients beyond a certain level (Ganiats<br />

et al., 1992).<br />

Distribution of baseline scores<br />

The responsiveness of an instrument may also<br />

be influenced by the relationship of items in the<br />

instrument to the distribution of levels of difficulty<br />

or severity in the underlying construct. As a hypothetical<br />

example, it is possible to imagine an instrument<br />

designed to measure mobility where items<br />

mainly reflected ‘easy’ tasks; that is the majority<br />

of respondents could be expected to report no<br />

problem, for example, in walking a very short<br />

distance. Because most items in the scale reflect<br />

‘easy’ items, a large amount of change could be<br />

produced (i.e. the patient reports change over the<br />

majority of items) even when only a small amount<br />

of real improvement had occurred. Stucki and<br />

colleagues (1995) show that the problem of the<br />

relationship of items to an underlying range of<br />

degrees of difficulty or seriousness is not entirely<br />

hypothetical. They provide evidence that many<br />

items from the physical ability scale of the SF-36<br />

reflect intermediate rather than extremes of level<br />

of difficulty for patients undergoing total hip<br />

arthroplasty. Thus patients experiencing improvements<br />

at this intermediate level of physical difficulty<br />

can be expected to experience high levels of<br />

gain according to SF-36 at least in part because of<br />

the range of items. As Stucki and colleagues argue,<br />

this problem can arise from the ways in which<br />

scales are often developed, as described in earlier<br />

sections of this report, with emphasis upon high<br />

levels of agreement between items on a scale<br />

(internal reliability), rather than requiring items<br />

that reflect a full range of difficulty or severity<br />

of an underlying problem. We have already seen<br />

arguments against excessive reliance on inter-item<br />

agreement to develop instruments rehearsed by<br />

Kessler and Mroczeck (1995) in the context of<br />

reliability, above. Here it is possible to see<br />

problems arising from excessive emphasis<br />

upon internal reliability in the context<br />

of responsiveness.<br />

Summary<br />

The need for an instrument to be responsive<br />

to changes that are of importance to patients<br />

should be of evident importance in the context<br />

of clinical trials. Whilst there are no universally<br />

agreed methods for assessing this property, at a<br />

more general level all discussions require evidence<br />

of statistically significant change of some form<br />

from observations made at separate times and<br />

when there is good reason to think that changes<br />

have occurred that are of importance<br />

to patients.<br />

Precision<br />

How precise are the scores of<br />

the instrument?<br />

This review is primarily concerned with the use<br />

of patient-based outcome measures in the context<br />

of clinical trials. Investigators will need to examine<br />

the pattern of responses to health status measures<br />

in a trial to determine whether there are clear and<br />

important differences between the arms of a trial.<br />

They therefore need to examine a number of<br />

aspects of candidate instruments’ numerical<br />

properties which have not been clearly delineated<br />

in the literature, but which relate to the precision<br />

of distinctions made by an instrument. Testa<br />

and Simonson (1996) refer to this property<br />

as ‘sensitivity’:<br />

‘Although a measure may be responsive to changes<br />

in Q (quality of life), gradations in the metric of Z<br />

(the instrument) may not be adequate to reflect these<br />

changes. Sensitivity refers to the ability of the measurement<br />

to reflect true changes or differences in Q’<br />

(1996: 836).<br />

Stewart (1992) also refers to this property as<br />

‘sensitivity’. In particular, she refers to the number<br />

of distinctions an instrument makes; the fewer,<br />

the more insensitive it is likely to be. Kessler<br />

and Mroczek (1995) refer to this property as<br />

‘precision’, which is probably less confusing since<br />

sensitivity has a number of other uses and meanings<br />

in this field. As Kessler and Mroczek argue,<br />

an instrument may have high reliability but low<br />

precision if it makes only a small number of crude<br />

distinctions with regard to a dimension of health.<br />

Thus at the extreme one instrument might<br />

distinguish with high reliability only between<br />

those who are healthy and those who are ill. For<br />

the purposes of a trial, such an instrument would<br />

not be useful because it is degrees of change<br />

within the category of ‘unwell’ that are likely<br />

to be needed to evaluate results of the arms<br />

of the trial.<br />

There are a number of ways in which the issue of<br />

precision has been raised in relation to patientbased<br />

outcome measures. This is fairly disparate<br />

evidence and it is reviewed under a number of<br />

more specific headings.<br />

Precision of response categories<br />

One of the main influences on the precision of<br />

an instrument is the format of response categories;<br />

i.e. the form in which respondents are able to give<br />

their answers. At one extreme answers may be given<br />

by respondents in terms of very basic distinctions,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!