Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
assumption of interval properties with ordinally<br />
based data does not seriously mislead.<br />
One quite practical illustration of the need for<br />
caution is in the calculation of sample sizes for<br />
trials using patient-based outcome measures as a<br />
primary end-point. Using the SF-36 as example,<br />
Julious and colleagues (1995) show that for those<br />
dimensions of SF-36 where the distribution of<br />
scores are highly skewed, sample size calculations<br />
are very different if parametric and non-parametric<br />
methods are used to estimate required sample size.<br />
Distribution of items over true range<br />
The items and scores of different instruments may<br />
vary in how well they capture the full underlying<br />
range of problems experienced by patients. It is<br />
not easy to examine the relationship between the<br />
distinctions made by a measuring instrument and<br />
the true distribution of actual experiences, for the<br />
obvious reason that one usually does not have access<br />
to the true distribution other than through one’s<br />
measuring instrument. Nevertheless a number of<br />
arguments have been put forward that show that<br />
this is a real issue. Kessler and Mroczek (1995) have<br />
illustrated this problem by means of an instrument<br />
to measure psychological distress. They showed that<br />
it was possible to select short form versions of small<br />
numbers of items taken from a full set of 32 items<br />
measuring distress that, whilst all having the same<br />
reliability as the full 32-item scale and agreeing<br />
strongly with the total scale, differed markedly in<br />
ability to discriminate between distressed and not<br />
distressed individuals at different levels in an overall<br />
continuum of severity of psychological distress. One<br />
short form version was most discriminating at low<br />
levels of distress, and so on. A comparison that<br />
makes this point more intuitively understandable<br />
would be a range of intelligence tests, with, at one<br />
extreme a test that could distinguish the very<br />
cleverest as a category from all others who would<br />
be grouped together. At the opposite extreme,<br />
tests would sensitively distinguish those with very<br />
low intelligence from all others. The ideal test,<br />
whether of health or intelligence would have<br />
equal precision at every level.<br />
Another illustration of the problematic relationship<br />
between items and the ‘true’ distribution of<br />
what is being measured is provided by Stucki and<br />
colleagues’ (1996) analysis of SF-36 physical ability<br />
scores in patients undergoing total hip replacement<br />
surgery. They showed that many of the items<br />
of this scale represent moderate levels of difficulty<br />
for patients to perform (e.g. ‘bending, kneeling<br />
or stooping’); by contrast, there are only a few<br />
items that almost everyone could do with no<br />
<strong>Health</strong> Technology Assessment 1998; Vol. 2: No. 14<br />
difficulty (e.g. ‘bathing and dressing yourself’)<br />
and only a few items that were difficult for the<br />
majority to perform (e.g. ‘walking more than a<br />
mile’). A direct consequence of this is that patients<br />
passing a difficulty level in the middle of this scale<br />
of the SF-36 are more likely to have larger change<br />
scores than patients undergoing change at either<br />
the top or bottom of the range of difficulty of<br />
items, simply because of the larger number of items<br />
assessing moderate levels of difficulty. A similar set<br />
of observations about the ‘maldistribution’ of items<br />
of this scale of SF-36 was made by another group of<br />
investigators (Haley et al., 1994). The most obvious<br />
consequences of this effect are two-fold: (i) the<br />
meaning of change scores for instruments may<br />
need to be interpreted in the knowledge of baseline<br />
scores of patients and (ii) instruments may<br />
need to ensure a more even distribution of<br />
items across the range of levels of severity<br />
or difficulty.<br />
The distribution of items for the physical scale<br />
of SF-36 was examined by Stucki and colleagues<br />
(1996) by a variety of statistical methods including<br />
Rasch analysis (discussed on page 37) to address<br />
the issue of distribution. It should also be possible<br />
for investigators to inspect instruments at a more<br />
informal and intuitive level to consider whether<br />
there may be problems of the distribution of items<br />
in relation to the intended trial and patient group,<br />
to see whether particular levels of severity of illness<br />
are under-represented.<br />
Ceiling and floor effects in relation<br />
to precision<br />
The problem of ceiling and floor effects has already<br />
been considered in the context of responsiveness.<br />
They are mentioned again here because, essentially<br />
they may be viewed as problems arising from the<br />
precision and distribution of items in questionnaires.<br />
Studies were cited above in the context<br />
of responsiveness (Bindman et al., 1990; Gardiner<br />
et al., 1993) in which convincing evidence was<br />
found that some instruments did not allow patients<br />
with poor health status to report further deterioration.<br />
Questionnaires were found not to include<br />
items to capture the poorest levels of health. Potential<br />
solutions, depending on the overall format of<br />
the instrument, include adding a response category<br />
such as ‘extremely poor’ to questions and increasing<br />
the range of items, particularly addressing<br />
more severe experiences. Bindman and colleagues<br />
(1990) suggest adding transition questions which<br />
directly invite respondents to say whether they are<br />
worse or better than at a previous assessment.<br />
However, this is an unwieldy solution in terms<br />
of the formatting of questionnaires.<br />
35