29.06.2013 Views

Evaluating Patient-Based Outcome Measures - NIHR Health ...

Evaluating Patient-Based Outcome Measures - NIHR Health ...

Evaluating Patient-Based Outcome Measures - NIHR Health ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

34<br />

Criteria for selecting a patient-based outcome measure<br />

an instrument has a high level of precision because<br />

scores are expressed as percentages, the range of<br />

actual possible values may still be quite small and<br />

scores are in no sense interval.<br />

By contrast to such common-sense based methods<br />

of weighting are efforts directly to assess the<br />

relative severity or undesirability of different<br />

states. The SIP is an example of an instrument<br />

with a more sophisticated and more explicitly<br />

based weighting system. Once the questionnaire<br />

items for the instrument had been identified, a<br />

panel of patients, health professionals and preprofessional<br />

students used category scaling to<br />

assign weights to items by making judgements<br />

of the relative severity of dysfunction of items<br />

(Bergner et al., 1976). To illustrate the impact of<br />

this weighting approach to questionnaire items,<br />

in the English version of the instrument, the most<br />

severe items in the body care and movement scale<br />

are ‘I am in a restricted position all the time’<br />

(–124) and ‘I do not have control of my bowels’<br />

(–124), whereas the least severe items are ‘I dress<br />

myself but do so very slowly’ (–043) and ‘I am very<br />

clumsy’ (–047). Separate weighting exercises on<br />

American and English versions by separate panels<br />

in the two language communities arrived at very<br />

similar weightings for items for the SIP (Patrick<br />

et al., 1985). Other instruments that include such<br />

explicitly derived weighting systems include the<br />

Nottingham <strong>Health</strong> Profile (NHP), QWB and<br />

EQ-5D.<br />

There are two particularly striking problems if the<br />

numerical values used in different patient-based<br />

outcomes are examined. On the one hand, many<br />

instruments use methods of scoring items that are<br />

deceptively simple. Although apparently simple,<br />

such scoring nevertheless may require strong<br />

assumptions; for example that the difference<br />

between the first and second responses is regarded<br />

as the same as the difference between the fourth<br />

and fifth response in a five-point Likert scale, if<br />

scores are analysed as interval scale scores.<br />

On the other hand, the other most striking<br />

problem is that scoring methods that attempt<br />

directly to estimate the values of such response<br />

categories such as in the SIP by weighting systems,<br />

risk being deceptively precise. Their numerical<br />

exactness might lend pseudo-precision to an<br />

instrument. For investigators examining the<br />

numerical values of instruments, it is sensible to<br />

treat all scoring methods as weighted, differing<br />

only in how transparent weights are, and to look<br />

beyond superficial aspects of precision to examine<br />

how weightings have been derived and validated.<br />

More pragmatically, it is appropriate to ask<br />

whether weighting systems make a difference<br />

(Björk and Roos, 1994). Sensitivity analysis may<br />

reveal that they make no significant difference to<br />

results. For example, Jenkinson and colleagues<br />

(1991) analysed patterns of change over time in<br />

health status for patients with rheumatoid arthritis<br />

by means of the FLP and NHP. Sensitivity to<br />

change as indicated by a battery of other clinical<br />

and laboratory measures was very similar, whether<br />

weighted or unweighted (items valued as ‘1’ or<br />

‘0’) versions of the instruments were used. Other<br />

studies have similarly suggested that weighted<br />

scales may not improve upon the sensitivity of<br />

unweighted scales (O’Neill et al., 1996).<br />

The response format of a patient-based outcome<br />

measure to some extent determines the kinds of<br />

statistical tests that may be used on it. This is here<br />

considered an aspect of precision in the sense that<br />

many instruments contain items that are at best<br />

ordinal in form (i.e. questionnaire items where<br />

there is an implied rank to responses: ‘very often’,<br />

‘quite often’ etc.) but not interval (i.e. where the<br />

interval between responses is of known value) or<br />

ratio (where there is a meaningful zero point). It<br />

might be argued that instruments that have only<br />

ordinal level measurement properties are capable<br />

of less precision (Haig et al., 1986). Certainly, a<br />

review of the statistical properties of a series of<br />

health status scales published in the literature<br />

concluded that the majority of scales were presented<br />

and analysed as if based on interval-level<br />

when this property was not established (Coste et al.,<br />

1995). Whilst it might be argued that an advantage<br />

of visual analogue scale over Likert format answers<br />

is that it would enable more extensive use of<br />

parametric statistics, this needs to be balanced<br />

against the lower acceptability of visual analogue<br />

scale techniques and the risk of pseudo-precision<br />

that this technique involves (Aaronson, 1989).<br />

Mackenzie and Charlson (1986) reviewed trials<br />

employing ordinal scales in three medical journals<br />

over a 5-year period and found that many measures<br />

purporting to be ordinal were not. For example,<br />

values for the items of a scale were not truly<br />

hierarchical, so it was not clear whether lower<br />

numerical scores truly reflected worse<br />

underlying states.<br />

As Streiner and Norman (1995) point out, there is<br />

a large and unresolved literature as to the propriety<br />

of using interval level statistics when it is unclear<br />

that there is a linear relationship of a measure to<br />

the underlying phenomenon. In practice, there<br />

may be many circumstances where cautious

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!