29.06.2013 Views

Evaluating Patient-Based Outcome Measures - NIHR Health ...

Evaluating Patient-Based Outcome Measures - NIHR Health ...

Evaluating Patient-Based Outcome Measures - NIHR Health ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

assumption of interval properties with ordinally<br />

based data does not seriously mislead.<br />

One quite practical illustration of the need for<br />

caution is in the calculation of sample sizes for<br />

trials using patient-based outcome measures as a<br />

primary end-point. Using the SF-36 as example,<br />

Julious and colleagues (1995) show that for those<br />

dimensions of SF-36 where the distribution of<br />

scores are highly skewed, sample size calculations<br />

are very different if parametric and non-parametric<br />

methods are used to estimate required sample size.<br />

Distribution of items over true range<br />

The items and scores of different instruments may<br />

vary in how well they capture the full underlying<br />

range of problems experienced by patients. It is<br />

not easy to examine the relationship between the<br />

distinctions made by a measuring instrument and<br />

the true distribution of actual experiences, for the<br />

obvious reason that one usually does not have access<br />

to the true distribution other than through one’s<br />

measuring instrument. Nevertheless a number of<br />

arguments have been put forward that show that<br />

this is a real issue. Kessler and Mroczek (1995) have<br />

illustrated this problem by means of an instrument<br />

to measure psychological distress. They showed that<br />

it was possible to select short form versions of small<br />

numbers of items taken from a full set of 32 items<br />

measuring distress that, whilst all having the same<br />

reliability as the full 32-item scale and agreeing<br />

strongly with the total scale, differed markedly in<br />

ability to discriminate between distressed and not<br />

distressed individuals at different levels in an overall<br />

continuum of severity of psychological distress. One<br />

short form version was most discriminating at low<br />

levels of distress, and so on. A comparison that<br />

makes this point more intuitively understandable<br />

would be a range of intelligence tests, with, at one<br />

extreme a test that could distinguish the very<br />

cleverest as a category from all others who would<br />

be grouped together. At the opposite extreme,<br />

tests would sensitively distinguish those with very<br />

low intelligence from all others. The ideal test,<br />

whether of health or intelligence would have<br />

equal precision at every level.<br />

Another illustration of the problematic relationship<br />

between items and the ‘true’ distribution of<br />

what is being measured is provided by Stucki and<br />

colleagues’ (1996) analysis of SF-36 physical ability<br />

scores in patients undergoing total hip replacement<br />

surgery. They showed that many of the items<br />

of this scale represent moderate levels of difficulty<br />

for patients to perform (e.g. ‘bending, kneeling<br />

or stooping’); by contrast, there are only a few<br />

items that almost everyone could do with no<br />

<strong>Health</strong> Technology Assessment 1998; Vol. 2: No. 14<br />

difficulty (e.g. ‘bathing and dressing yourself’)<br />

and only a few items that were difficult for the<br />

majority to perform (e.g. ‘walking more than a<br />

mile’). A direct consequence of this is that patients<br />

passing a difficulty level in the middle of this scale<br />

of the SF-36 are more likely to have larger change<br />

scores than patients undergoing change at either<br />

the top or bottom of the range of difficulty of<br />

items, simply because of the larger number of items<br />

assessing moderate levels of difficulty. A similar set<br />

of observations about the ‘maldistribution’ of items<br />

of this scale of SF-36 was made by another group of<br />

investigators (Haley et al., 1994). The most obvious<br />

consequences of this effect are two-fold: (i) the<br />

meaning of change scores for instruments may<br />

need to be interpreted in the knowledge of baseline<br />

scores of patients and (ii) instruments may<br />

need to ensure a more even distribution of<br />

items across the range of levels of severity<br />

or difficulty.<br />

The distribution of items for the physical scale<br />

of SF-36 was examined by Stucki and colleagues<br />

(1996) by a variety of statistical methods including<br />

Rasch analysis (discussed on page 37) to address<br />

the issue of distribution. It should also be possible<br />

for investigators to inspect instruments at a more<br />

informal and intuitive level to consider whether<br />

there may be problems of the distribution of items<br />

in relation to the intended trial and patient group,<br />

to see whether particular levels of severity of illness<br />

are under-represented.<br />

Ceiling and floor effects in relation<br />

to precision<br />

The problem of ceiling and floor effects has already<br />

been considered in the context of responsiveness.<br />

They are mentioned again here because, essentially<br />

they may be viewed as problems arising from the<br />

precision and distribution of items in questionnaires.<br />

Studies were cited above in the context<br />

of responsiveness (Bindman et al., 1990; Gardiner<br />

et al., 1993) in which convincing evidence was<br />

found that some instruments did not allow patients<br />

with poor health status to report further deterioration.<br />

Questionnaires were found not to include<br />

items to capture the poorest levels of health. Potential<br />

solutions, depending on the overall format of<br />

the instrument, include adding a response category<br />

such as ‘extremely poor’ to questions and increasing<br />

the range of items, particularly addressing<br />

more severe experiences. Bindman and colleagues<br />

(1990) suggest adding transition questions which<br />

directly invite respondents to say whether they are<br />

worse or better than at a previous assessment.<br />

However, this is an unwieldy solution in terms<br />

of the formatting of questionnaires.<br />

35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!