Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
30<br />
Criteria for selecting a patient-based outcome measure<br />
considered small, 0.5 as medium and 0.8 or greater<br />
as large (Cohen, 1977, Kazis et al., 1989).<br />
Standardised response mean<br />
An alternative measure is the SRM. It only differs<br />
from an effect size in that the denominator is the<br />
standard deviation of change scores in the group in<br />
order to take account of variability in change rather<br />
than baseline scores (Liang et al., 1990). Because<br />
the denominator in the SRM examines response<br />
variance in an instrument whereas the effect size<br />
does not, Katz and colleagues (1992) consider<br />
that the SRM approach is more informative.<br />
Modified standardised response mean<br />
A third method of providing a standardised<br />
expression of responsiveness is that of Guyatt and<br />
colleagues (1987b). As with effect sizes and SRMs,<br />
the numerator of this statistic is the mean change<br />
score for a group. In this case, the denominator is<br />
the standard deviation of change scores for individuals<br />
who are identified by other means as stable<br />
(Tuley et al., 1991). This denominator provides an<br />
expression of the inherent variability of changes in<br />
an instrument, ‘an intuitive estimate of background<br />
noise’ (Liang, 1995). Unlike the two other expressions,<br />
this method requires independent evidence<br />
that patients are indeed stable, in the form of a<br />
transition question asked at follow-up. MacKenzie<br />
and colleagues (1986b) used a transition index<br />
and the modified SRM to test the responsiveness<br />
of the SIP.<br />
Relative efficiency<br />
Another approach is to compare the responsiveness<br />
of health status instruments when used in studies<br />
of treatments widely considered to be effective, so<br />
that it is very likely that significant changes actually<br />
occur. As applied by Liang and colleagues (1985)<br />
who developed this approach, the performance<br />
of different health status instruments is compared<br />
to a standard instrument amongst patients who<br />
are considered to have experienced substantial<br />
change. Thus they asked patients to complete a<br />
number of health status questionnaires before<br />
and after total joint replacement surgery. <strong>Health</strong><br />
status questionnaires were considered most<br />
responsive that produced the largest paired t-test<br />
score for pre and post surgical assessments. Liang<br />
and colleagues (1985) produce a standardised<br />
version of the use of t-statistics (termed ‘relative<br />
efficiency’), the square of the ratio of t-statistic<br />
for two instruments being compared. As noted<br />
earlier, much of the data in this field is nonparametric,<br />
so that t statistics are not appropriate<br />
and non-parametric forms of relative efficiency<br />
need to be used.<br />
Sensitivity and specificity of<br />
change scores<br />
Another approach to assessing responsiveness<br />
is to consider the change scores of a health status<br />
instrument as if they were a screening test to<br />
detect true change (Deyo and Inui, 1984). In<br />
other words, data for a health status measure<br />
are examined in terms of the sensitivity of change<br />
scores produced by an instrument (proportion<br />
of true changes detected) and specificity of<br />
change scores (proportion of individuals who<br />
are truly not changing, detected as stable). This<br />
method requires that investigators identify<br />
somewhat arbitrarily a specific change score<br />
that will be taken as of interest to examine, say an<br />
improvement of five points between observations.<br />
The sample is then divided according to whether<br />
or not they have reported five points improvement<br />
or not. Some external standard is also needed to<br />
determine ‘true’ change; commonly it is a<br />
consensus of the patient’s and or clinician’s<br />
retrospective judgement of change (Deyo and<br />
Inui, 1984). Essentially, the extent of agreement<br />
is then examined between change as defined in<br />
this hypothetical case as more than five points<br />
change and independent evidence of whether<br />
individuals have changed. The same analysis is<br />
then provided for specificity. Scores of less than<br />
five points are counted as ‘unchanged’. Individuals<br />
scores are then examined to determine how much<br />
this classification of individuals as unchanged<br />
agrees with independent evidence<br />
that they have not changed.<br />
This approach permits an assessment of the<br />
sensitivity and specificity of different amounts of<br />
change registered by an instrument. For example,<br />
an instrument which, for the sake of simplicity,<br />
has five possible scores (1–5), when applied on<br />
two occasions over time, may either produce<br />
identical scores on both occasions or register<br />
up to four points of change (for example a<br />
change from ‘5’ to ‘1’). If one has an external<br />
standard of whether change truly occurred, such<br />
as a consensus of patient’s and doctor’s opinion,<br />
one can assess how sensitive and specific to the<br />
external evidence of change are changes of one<br />
point, two points etc. As sensitivity improves,<br />
specificity may deteriorate.<br />
Receiver-operating characteristics<br />
Deyo and Centor (1986) extend the principle<br />
of measuring the sensitivity and specificity of<br />
an instrument against an external criterion by<br />
suggesting that information be synthesised into<br />
receiver-operating characteristics. Like the simpler<br />
version just described in the previous section, this