Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
22<br />
Criteria for selecting a patient-based outcome measure<br />
completed both (King et al., 1996). The reason for<br />
the low level of agreement is that the items of one<br />
scale focus upon companionship of family and<br />
friends, whilst the other instrument’s social scale<br />
focuses upon impact of disease on social activities.<br />
The same degree of disparate content was found<br />
in social dimensions of instruments used to assess<br />
well-being in patients with rheumatoid arthritis<br />
(Fitzpatrick et al., 1991). Instruments focusing on<br />
physical function may also differ in less obvious<br />
ways in their content when assessing dimensions<br />
such as physical function about which more agreement<br />
might be expected. For example, the physical<br />
function of patients with rheumatoid arthritis is<br />
assessed in one health status instrument by items<br />
that ask respondents how much help they need<br />
to perform particular tasks, another instrument<br />
addresses similar tasks but questionnaire items<br />
elicit the degree of difficulty experienced by<br />
respondents with tasks (Ziebland et al., 1993).<br />
One commonly recommended solution to<br />
ensure that a trial will have an appropriate set of<br />
outcome measures is that one disease-specific and<br />
one generic instrument be used to assess outcomes<br />
(Cox et al., 1992; Bombardier et al., 1995). In this<br />
way, it is reasonably likely that both important<br />
proximal and distal effects of a treatment will be<br />
captured; detecting the most immediate effects<br />
upon disease as well as possible consequences<br />
that are harder to anticipate.<br />
Summary<br />
In more general terms, appropriateness of an<br />
instrument for a trial will involve considering<br />
the other criteria we have identified and discuss<br />
below; evidence of reliability, feasibility, and so<br />
on. In the more specific terms with which we have<br />
summarised the rather disparate literature on<br />
appropriateness, the term requires that investigators<br />
consider as directly as possible how well the<br />
content of an instrument matches the intended<br />
purpose of their specific trial.<br />
Reliability<br />
Does the instrument produce<br />
results that are reproducible and<br />
internally consistent?<br />
Reliability is concerned with the reproducibility<br />
and internal consistency of a measuring instrument.<br />
It assesses the extent to which the instrument<br />
is free from random error and may be considered<br />
as the amount of a score that is signal rather than<br />
noise. It is a very important property of any<br />
patient-based outcome measure in a clinical<br />
trial because it is essential to establish that any<br />
changes observed in a trial are due to the intervention<br />
and not to problems in the measuring<br />
instrument. As the random error of such a measure<br />
increases, so the size of the sample required to<br />
obtain a precise estimate of effects in a trial will<br />
increase. An unreliable measure may therefore<br />
underestimate the size of benefit obtained from<br />
an intervention. The reliability of a particular<br />
measure is not a fixed property, but is dependent<br />
upon the context and population studied<br />
(Streiner and Norman, 1995).<br />
The degree of reliability required of an instrument<br />
used to assess individuals is higher than that<br />
required to assess groups (Williams and Naylor,<br />
1992; Nunnally and Bernstein, 1994). As is<br />
described below, reliability coefficients of 0.70<br />
may be acceptable for measures in a study of<br />
a group of patients in a clinical trial. However,<br />
Nunnally and Bernstein (1994) recommend that<br />
a reliability level of at least 0.90 is required for<br />
a measure if it is going to be used for decisions<br />
about an individual on the basis of his or her<br />
score. This higher requirement is because the<br />
confidence interval around an individual’s true<br />
score are wide at reliabilities below this recommended<br />
level (Hayes et al., 1993). For a similar<br />
reason Jaeschke and colleagues (1991) express<br />
extreme caution about the interpretation of<br />
QoL scores in N of one trials. Our concern is<br />
with group applications such as in trials where the<br />
confidence interval around an estimate of the<br />
reliability of a measure is increased as sample<br />
size increases.<br />
In practice, the evaluation of reliability is in terms<br />
of two different aspects of a measure: internal consistency<br />
and reproducibility (sometimes referred to<br />
as ‘equivalence’ and ‘stability’ respectively (Bohrnstedt,<br />
1983). The two measures derive from classical<br />
measurement theory which regards any observation<br />
as the sum of two components, a true score and an<br />
error term (Bravo and Potvin, 1991).<br />
Internal consistency<br />
Normally, more than one questionnaire item is<br />
used to measure a dimension or construct. This<br />
is because of a basic principle of measurement<br />
that several related observations will produce a<br />
more reliable estimate than one. For this to be<br />
true, the items all need to be homogeneous, that<br />
is all measuring aspects of a single attribute or<br />
construct rather than different constructs<br />
(Streiner and Norman, 1995). The practical<br />
consequence of this expectation is that individual<br />
items should highly correlate with each other