Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
32<br />
Criteria for selecting a patient-based outcome measure<br />
patients beyond a certain level (Ganiats<br />
et al., 1992).<br />
Distribution of baseline scores<br />
The responsiveness of an instrument may also<br />
be influenced by the relationship of items in the<br />
instrument to the distribution of levels of difficulty<br />
or severity in the underlying construct. As a hypothetical<br />
example, it is possible to imagine an instrument<br />
designed to measure mobility where items<br />
mainly reflected ‘easy’ tasks; that is the majority<br />
of respondents could be expected to report no<br />
problem, for example, in walking a very short<br />
distance. Because most items in the scale reflect<br />
‘easy’ items, a large amount of change could be<br />
produced (i.e. the patient reports change over the<br />
majority of items) even when only a small amount<br />
of real improvement had occurred. Stucki and<br />
colleagues (1995) show that the problem of the<br />
relationship of items to an underlying range of<br />
degrees of difficulty or seriousness is not entirely<br />
hypothetical. They provide evidence that many<br />
items from the physical ability scale of the SF-36<br />
reflect intermediate rather than extremes of level<br />
of difficulty for patients undergoing total hip<br />
arthroplasty. Thus patients experiencing improvements<br />
at this intermediate level of physical difficulty<br />
can be expected to experience high levels of<br />
gain according to SF-36 at least in part because of<br />
the range of items. As Stucki and colleagues argue,<br />
this problem can arise from the ways in which<br />
scales are often developed, as described in earlier<br />
sections of this report, with emphasis upon high<br />
levels of agreement between items on a scale<br />
(internal reliability), rather than requiring items<br />
that reflect a full range of difficulty or severity<br />
of an underlying problem. We have already seen<br />
arguments against excessive reliance on inter-item<br />
agreement to develop instruments rehearsed by<br />
Kessler and Mroczeck (1995) in the context of<br />
reliability, above. Here it is possible to see<br />
problems arising from excessive emphasis<br />
upon internal reliability in the context<br />
of responsiveness.<br />
Summary<br />
The need for an instrument to be responsive<br />
to changes that are of importance to patients<br />
should be of evident importance in the context<br />
of clinical trials. Whilst there are no universally<br />
agreed methods for assessing this property, at a<br />
more general level all discussions require evidence<br />
of statistically significant change of some form<br />
from observations made at separate times and<br />
when there is good reason to think that changes<br />
have occurred that are of importance<br />
to patients.<br />
Precision<br />
How precise are the scores of<br />
the instrument?<br />
This review is primarily concerned with the use<br />
of patient-based outcome measures in the context<br />
of clinical trials. Investigators will need to examine<br />
the pattern of responses to health status measures<br />
in a trial to determine whether there are clear and<br />
important differences between the arms of a trial.<br />
They therefore need to examine a number of<br />
aspects of candidate instruments’ numerical<br />
properties which have not been clearly delineated<br />
in the literature, but which relate to the precision<br />
of distinctions made by an instrument. Testa<br />
and Simonson (1996) refer to this property<br />
as ‘sensitivity’:<br />
‘Although a measure may be responsive to changes<br />
in Q (quality of life), gradations in the metric of Z<br />
(the instrument) may not be adequate to reflect these<br />
changes. Sensitivity refers to the ability of the measurement<br />
to reflect true changes or differences in Q’<br />
(1996: 836).<br />
Stewart (1992) also refers to this property as<br />
‘sensitivity’. In particular, she refers to the number<br />
of distinctions an instrument makes; the fewer,<br />
the more insensitive it is likely to be. Kessler<br />
and Mroczek (1995) refer to this property as<br />
‘precision’, which is probably less confusing since<br />
sensitivity has a number of other uses and meanings<br />
in this field. As Kessler and Mroczek argue,<br />
an instrument may have high reliability but low<br />
precision if it makes only a small number of crude<br />
distinctions with regard to a dimension of health.<br />
Thus at the extreme one instrument might<br />
distinguish with high reliability only between<br />
those who are healthy and those who are ill. For<br />
the purposes of a trial, such an instrument would<br />
not be useful because it is degrees of change<br />
within the category of ‘unwell’ that are likely<br />
to be needed to evaluate results of the arms<br />
of the trial.<br />
There are a number of ways in which the issue of<br />
precision has been raised in relation to patientbased<br />
outcome measures. This is fairly disparate<br />
evidence and it is reviewed under a number of<br />
more specific headings.<br />
Precision of response categories<br />
One of the main influences on the precision of<br />
an instrument is the format of response categories;<br />
i.e. the form in which respondents are able to give<br />
their answers. At one extreme answers may be given<br />
by respondents in terms of very basic distinctions,