Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
Evaluating Patient-Based Outcome Measures - NIHR Health ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
26<br />
Criteria for selecting a patient-based outcome measure<br />
the product of these variables’ reliability coefficients.<br />
Therefore, given typical levels of reliability<br />
of patient-based variables, a correlation coefficient<br />
of 0.60 may be strong evidence in support of<br />
construct validity (McDowell and Newell, 1996).<br />
Because there is considerable vagueness and<br />
variability in the levels of correlation coefficients<br />
that authors cite as evidence of construct validity<br />
of new instruments, McDowell and Jenkinson<br />
(1996) recommend that expected correlations<br />
should be specified at the outset of studies to test<br />
instruments’ validity in order that it be possible<br />
for validity to be disproved.<br />
The most sophisticated form of testing construct<br />
validity is so-called ‘convergent and discriminant<br />
validity’ (Campbell and Fiske, 1959). This approach<br />
requires postulating that an instrument that we<br />
wish to test should have stronger relationships<br />
with some variables and weaker relationships with<br />
others. A new measure of mobility should correlate<br />
more strongly with existing measures of physical<br />
disability than with existing measures of emotional<br />
well-being. Essentially correlations are expected to<br />
be strongest with most related constructs and weakest<br />
with most distally related constructs. Typically,<br />
construct validity is examined by inspecting correlations<br />
of a new measure against a range of other<br />
evidence such as, disease staging, performance<br />
status, clinical or laboratory evidence of disease<br />
severity, illness behaviour, use of health services and<br />
related constructs of well-being (Spitzer et al., 1981;<br />
Fletcher, 1988; Sullivan et al., 1990; Aaronson et al.,<br />
1993). As an example of convergent and discriminant<br />
validation, Sullivan and colleagues (1990)<br />
examined validity of the scales of the SIP for use in<br />
rheumatoid arthritis with the expectation that the<br />
SIP physical function score should correlate most<br />
with various measures of disease severity and the<br />
SIP psychosocial scale should correlate most with<br />
other measures of mood and psychological wellbeing.<br />
Conversely, measures of physical function<br />
were expected to correlate less with measures of<br />
physical mood. Similarly, Morrow and colleagues<br />
(1992) examined the convergent-discriminant<br />
validity of the Functional Living Index-Cancer<br />
(FLIC) by examining relationships to other<br />
variables. As predicted the subscales of FLIC to<br />
measure gastrointestinal symptoms was significantly<br />
related to other ratings by patients of nausea and<br />
vomiting, but correlations were close to zero<br />
between psychological and social subscales of<br />
FLIC and patients’ separate reports of nausea<br />
and vomiting.<br />
The most demanding form of convergent and<br />
discriminant validity is the multitrait-multimethod<br />
matrix (Campbell and Fiske, 1959) in which two<br />
unrelated constructs are measured by two or more<br />
methods. In essence, it is expected that different<br />
measures of a single underlying trait should<br />
correlate most and different measures of different<br />
constructs least. Whilst potentially powerful in<br />
psychometric test development, it is difficult to<br />
apply to patient-based outcome measures because<br />
of problems of obtaining alternative methods of<br />
measuring constructs.<br />
Most instruments to assess outcomes from the<br />
patient’s point of view are multi-dimensional.<br />
They, for example, assess physical, psychological<br />
and social aspects of an illness within one questionnaire.<br />
This internal structure of an instrument can<br />
also be considered a set of assumed relationships<br />
between underlying constructs. At the very least,<br />
an instrument with sub-scales has implied that the<br />
instrument measures different underlying constructs<br />
by providing different sub-scales, rather than<br />
requiring that all items should simply be added to<br />
produce one score of one underlying construct.<br />
Normally instruments with multiple scales also<br />
assume particular underlying relationships for<br />
the constructs measured by the instrument; for<br />
example, scales of different aspects of emotional<br />
response to illness will correlate more with each<br />
other than with scales assessing physical function.<br />
This internal structure of instruments has also to<br />
be established by construct validation. The most<br />
common of methods for this purpose is statistical,<br />
particularly the use of factor analysis. Thus, factor<br />
analysis is often considered an aspect of<br />
construct validity.<br />
To simplify, factor analysis is the analysis of patterns<br />
of, in this field, items that go together to assess<br />
single underlying constructs. Typically, statistical<br />
analysis of answers of a sample of respondents to<br />
a pool of questionnaire items is used to reveal two<br />
or more sub-scales assessing distinct dimensions.<br />
In essence, the data can be checked to see whether<br />
individual questionnaire items correlate more with<br />
the scale of which they are a part and less with<br />
other scales to which they do not belong. Examples<br />
of instruments in which factor analysis has played<br />
a key role in establishing the internal structure<br />
of sub-scales include the Profile of Mood States<br />
(McNair et al., 1992) and the St George’s<br />
Respiratory Questionnaire (Jones et al., 1992).<br />
However, there are problems with factor analytic<br />
methods used in this context. Fayers and Hand<br />
(1997) provide important arguments and evidence<br />
against excessive reliance upon factor analysis<br />
alone to determine or evaluate the construct