Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
34<br />
Criteria for selecting a patient-based outcome measure<br />
an instrument has a high level of precision because<br />
scores are expressed as percentages, the range of<br />
actual possible values may still be quite small and<br />
scores are in no sense interval.<br />
By contrast to such common-sense based methods<br />
of weighting are efforts directly to assess the<br />
relative severity or undesirability of different<br />
states. The SIP is an example of an instrument<br />
with a more sophisticated and more explicitly<br />
based weighting system. Once the questionnaire<br />
items for the instrument had been identified, a<br />
panel of patients, health professionals and preprofessional<br />
students used category scaling to<br />
assign weights to items by making judgements<br />
of the relative severity of dysfunction of items<br />
(Bergner et al., 1976). To illustrate the impact of<br />
this weighting approach to questionnaire items,<br />
in the English version of the instrument, the most<br />
severe items in the body care and movement scale<br />
are ‘I am in a restricted position all the time’<br />
(–124) and ‘I do not have control of my bowels’<br />
(–124), whereas the least severe items are ‘I dress<br />
myself but do so very slowly’ (–043) and ‘I am very<br />
clumsy’ (–047). Separate weighting exercises on<br />
American and English versions by separate panels<br />
in the two language communities arrived at very<br />
similar weightings for items for the SIP (Patrick<br />
et al., 1985). Other instruments that include such<br />
explicitly derived weighting systems include the<br />
Nottingham <strong>Health</strong> Profile (NHP), QWB and<br />
EQ-5D.<br />
There are two particularly striking problems if the<br />
numerical values used in different patient-based<br />
outcomes are examined. On the one hand, many<br />
instruments use methods of scoring items that are<br />
deceptively simple. Although apparently simple,<br />
such scoring nevertheless may require strong<br />
assumptions; for example that the difference<br />
between the first and second responses is regarded<br />
as the same as the difference between the fourth<br />
and fifth response in a five-point Likert scale, if<br />
scores are analysed as interval scale scores.<br />
On the other hand, the other most striking<br />
problem is that scoring methods that attempt<br />
directly to estimate the values of such response<br />
categories such as in the SIP by weighting systems,<br />
risk being deceptively precise. Their numerical<br />
exactness might lend pseudo-precision to an<br />
instrument. For investigators examining the<br />
numerical values of instruments, it is sensible to<br />
treat all scoring methods as weighted, differing<br />
only in how transparent weights are, and to look<br />
beyond superficial aspects of precision to examine<br />
how weightings have been derived and validated.<br />
More pragmatically, it is appropriate to ask<br />
whether weighting systems make a difference<br />
(Björk and Roos, 1994). Sensitivity analysis may<br />
reveal that they make no significant difference to<br />
results. For example, Jenkinson and colleagues<br />
(1991) analysed patterns of change over time in<br />
health status for patients with rheumatoid arthritis<br />
by means of the FLP and NHP. Sensitivity to<br />
change as indicated by a battery of other clinical<br />
and laboratory measures was very similar, whether<br />
weighted or unweighted (items valued as ‘1’ or<br />
‘0’) versions of the instruments were used. Other<br />
studies have similarly suggested that weighted<br />
scales may not improve upon the sensitivity of<br />
unweighted scales (O’Neill et al., 1996).<br />
The response format of a patient-based outcome<br />
measure to some extent determines the kinds of<br />
statistical tests that may be used on it. This is here<br />
considered an aspect of precision in the sense that<br />
many instruments contain items that are at best<br />
ordinal in form (i.e. questionnaire items where<br />
there is an implied rank to responses: ‘very often’,<br />
‘quite often’ etc.) but not interval (i.e. where the<br />
interval between responses is of known value) or<br />
ratio (where there is a meaningful zero point). It<br />
might be argued that instruments that have only<br />
ordinal level measurement properties are capable<br />
of less precision (Haig et al., 1986). Certainly, a<br />
review of the statistical properties of a series of<br />
health status scales published in the literature<br />
concluded that the majority of scales were presented<br />
and analysed as if based on interval-level<br />
when this property was not established (Coste et al.,<br />
1995). Whilst it might be argued that an advantage<br />
of visual analogue scale over Likert format answers<br />
is that it would enable more extensive use of<br />
parametric statistics, this needs to be balanced<br />
against the lower acceptability of visual analogue<br />
scale techniques and the risk of pseudo-precision<br />
that this technique involves (Aaronson, 1989).<br />
Mackenzie and Charlson (1986) reviewed trials<br />
employing ordinal scales in three medical journals<br />
over a 5-year period and found that many measures<br />
purporting to be ordinal were not. For example,<br />
values for the items of a scale were not truly<br />
hierarchical, so it was not clear whether lower<br />
numerical scores truly reflected worse<br />
underlying states.<br />
As Streiner and Norman (1995) point out, there is<br />
a large and unresolved literature as to the propriety<br />
of using interval level statistics when it is unclear<br />
that there is a linear relationship of a measure to<br />
the underlying phenomenon. In practice, there<br />
may be many circumstances where cautious