Localized Supervised Metric Learning on ... - Researcher - IBM
Localized Supervised Metric Learning on ... - Researcher - IBM
Localized Supervised Metric Learning on ... - Researcher - IBM
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<str<strong>on</strong>g>Localized</str<strong>on</strong>g> <str<strong>on</strong>g>Supervised</str<strong>on</strong>g> <str<strong>on</strong>g>Metric</str<strong>on</strong>g> <str<strong>on</strong>g>Learning</str<strong>on</strong>g> <strong>on</strong> Temporal Physiological Data<br />
Jimeng Sun, Daby Sow, Jianying Hu, Shahram Ebadollahi<br />
<strong>IBM</strong> T.J. Wats<strong>on</strong> Research Center, New York, USA<br />
{jimeng, sowdaby, jyhu, ebad}@us.ibm.com<br />
Abstract<br />
Effective patient similarity assessment is important<br />
for clinical decisi<strong>on</strong> support. It enables the capture of<br />
past experience as manifested in the collective l<strong>on</strong>gitudinal<br />
medical records of patients to help clinicians assess<br />
the likely outcomes resulting from their decisi<strong>on</strong>s<br />
and acti<strong>on</strong>s. However, it is challenging to devise a patient<br />
similarity metric that is clinically relevant and semantically<br />
sound. Patient similarity is highly c<strong>on</strong>text<br />
sensitive: it depends <strong>on</strong> factors such as the disease, the<br />
particular stage of the disease, and co-morbidities. One<br />
way to discern the semantics in a particular c<strong>on</strong>text is to<br />
take advantage of physicians’ expert knowledge as reflected<br />
in labels assigned to some patients. In this paper<br />
we present a method that leverages localized supervised<br />
metric learning to effectively incorporate such expert<br />
knowledge to arrive at semantically sound patient similarity<br />
measures. Experiments using data obtained from<br />
the MIMIC II database dem<strong>on</strong>strate the effectiveness of<br />
this approach.<br />
1. Introducti<strong>on</strong><br />
Medical records capture both observati<strong>on</strong>s of patients’<br />
health status, and decisi<strong>on</strong>s and acti<strong>on</strong>s taken<br />
by clinicians and care providers. Buried inside these<br />
records are nuggets of insight <strong>on</strong> the temporal evoluti<strong>on</strong><br />
pattern of patient health status, and the effects of different<br />
clinical decisi<strong>on</strong>s <strong>on</strong> the trajectory of a disease.<br />
Tapping into this source of insight can be achieved by<br />
developing techniques measuring cross patient similarities.<br />
These techniques have the potential to improve<br />
patients’ clinical outcomes as essential tools for diagnostic<br />
and prognostic decisi<strong>on</strong> support.<br />
Figure 1 illustrates several aspects of the scenario<br />
that drives our research in this area. In this figure, a patient<br />
with available observati<strong>on</strong>s up to a decisi<strong>on</strong> point<br />
is presented to the system. The cohort of patients who<br />
are clinically similar to the query patient are retrieved.<br />
A clinician looks up the decisi<strong>on</strong>s and acti<strong>on</strong>s applied to<br />
the retrieved cohort and their c<strong>on</strong>sequences and makes<br />
up her mind about the best course of acti<strong>on</strong> for the current<br />
patient. In additi<strong>on</strong>, she can project the trajectory<br />
of patient’s health status, as captured by the patient’s<br />
clinical factors and biomarkers, under the regime of any<br />
particular decisi<strong>on</strong> she makes.<br />
There are three fundamental challenges that need to<br />
be addressed before such decisi<strong>on</strong> support mechanism<br />
can be materialized:<br />
1. Alignment of the trajectories of patients’ temporal<br />
characteristics to make the records amenable to<br />
semantically and clinically sound comparis<strong>on</strong>,<br />
2. Devising similarity measures that can reflect the<br />
clinical proximity or disparity between different<br />
patients,<br />
3. Coupling between decisi<strong>on</strong>s and their c<strong>on</strong>sequences<br />
as manifested in patient prognosis.<br />
The focus of this paper is <strong>on</strong> the sec<strong>on</strong>d task. We propose<br />
two different methods for feature generati<strong>on</strong> over<br />
multi-dimensi<strong>on</strong>al temporal patient data, and adopt a localized<br />
supervised metric learning approach to arrive at<br />
a semantically sound similarity measure for retrieving<br />
patients represented in the multi-dimensi<strong>on</strong>al feature<br />
space. The proposed method is tested using the MIMIC<br />
II database, which c<strong>on</strong>sists of physiological waveforms,<br />
and accompanying clinical data obtained for ICU patients<br />
[1]. The study is carried out <strong>on</strong> 74 patients from<br />
this database, categorized into 2 groups based <strong>on</strong> different<br />
clinical c<strong>on</strong>diti<strong>on</strong>s. Comparis<strong>on</strong>s against unsupervised<br />
metric learning approaches <strong>on</strong> classificati<strong>on</strong> and<br />
retrieval accuracy are presented to illustrate the performance<br />
benefit of the proposed approach.<br />
2. Related Work<br />
In [7], Saeed and Mark reported work <strong>on</strong> retrieving<br />
similar patients using the same database, where<br />
they employed a multi-resoluti<strong>on</strong> descripti<strong>on</strong> scheme<br />
for physiological temporal ICU data and used unsupervised<br />
similarity metrics for retrieving patients. The focus<br />
of that work was more <strong>on</strong> the symbolic representati<strong>on</strong><br />
of the temporal data to make them amenable for
epresented by a N-dimensi<strong>on</strong>al feature vector x. Examples<br />
of features are the mean and variance of the sensor<br />
measures, or Wavelet coefficients. The prior belief<br />
of physicians is captured as labels <strong>on</strong> some of the patients.<br />
With this formulati<strong>on</strong>, our goal is to learn a generalized<br />
Mahalanobis distance between patient x i and<br />
patient x j defined as:<br />
√<br />
d m (x i , x j ) = (x i − x j ) T P(x i − x j ) (1)<br />
where P ∈ R N×N is called the precisi<strong>on</strong> matrix. Matrix<br />
P is positive semi-definite and is used to incorporate<br />
the correlati<strong>on</strong>s between different feature dimensi<strong>on</strong>s.<br />
The key is to learn the optimal P such that the<br />
resulting distance metric has the following properties:<br />
Figure 1. Retrieving patients based <strong>on</strong><br />
their clinical similarity to a query patient<br />
and using the retrieved patients to project<br />
the evoluti<strong>on</strong> of patient’s clinical characteristics.<br />
comparis<strong>on</strong>. In this paper, we leverage both statistical<br />
methods and Wavelet methods to extract features over<br />
the temporal data.<br />
<str<strong>on</strong>g>Supervised</str<strong>on</strong>g> metric learning has been studied in the<br />
past [11, 12, 5]. The goal has been to learn a distance<br />
metric such that samples in the same class are close and<br />
those in different classes are far away. The comm<strong>on</strong><br />
treatment is to add c<strong>on</strong>straints and regularizati<strong>on</strong> terms<br />
into the objective functi<strong>on</strong> and then to solve it using optimizati<strong>on</strong><br />
methods. To avoid a large number of c<strong>on</strong>straints,<br />
in this paper we model this problem as trace<br />
ratio problem which can be solved effectively (similar<br />
to Wang et al. [9]).<br />
3 <str<strong>on</strong>g>Localized</str<strong>on</strong>g> <str<strong>on</strong>g>Supervised</str<strong>on</strong>g> <str<strong>on</strong>g>Metric</str<strong>on</strong>g> <str<strong>on</strong>g>Learning</str<strong>on</strong>g><br />
In this secti<strong>on</strong> we present the supervised metric<br />
learning problem in the c<strong>on</strong>text of patient similarity<br />
measure. When a physician looks for similar patients<br />
in a database, the similarity is often based not <strong>on</strong>ly<br />
<strong>on</strong> quantitative measurements such as lab results, sensor<br />
measurements, age and sex, but also <strong>on</strong> the physician’s<br />
assessment of the disease type and stage. The<br />
assessment would potentially influence the relative importance<br />
a physician places <strong>on</strong> different measurements<br />
or groups of measurements. To compute this specific<br />
noti<strong>on</strong> of similarity, we propose to learn a distance metric<br />
that can automatically adjust the importance of each<br />
numeric feature by leveraging the physician’s belief.<br />
Formally, quantitative measurements of a patient are<br />
• Within-class compactness: patients of the same label<br />
are close together;<br />
• Between-class scatterness: patients of different labels<br />
are far away from each other.<br />
To formally measure these properties, we use two kinds<br />
of neighborhoods as defined in [10]: The homogeneous<br />
neighborhood of x i , denoted as Ni o , is the k-nearest<br />
patients of x i with the same label. The heterogeneous<br />
neighborhood of x i , denoted as Ni e , is the k-nearest patients<br />
of x i with different labels.<br />
Based <strong>on</strong> these two neighborhoods, we define the local<br />
compactness of point x i as<br />
C i =<br />
∑<br />
d 2 m(x i , x j ) (2)<br />
x j∈N o i<br />
and the local scatter ness of point x i as<br />
S i =<br />
∑<br />
x k ∈N e i<br />
d 2 m(x i , x k ) (3)<br />
The discriminability of the distance metric d m is defined<br />
as<br />
∑<br />
J = ∑ i C ∑ ∑<br />
i i x j∈N<br />
(x<br />
i<br />
i S =<br />
o i − x j ) T P(x i − x j )<br />
∑ ∑<br />
i i x k ∈N<br />
(x<br />
i<br />
e i − x k ) T P(x i − x k )<br />
(4)<br />
The goal is to find a P that minimizes J , which is<br />
equivalent to minimizing the local compactness and<br />
maximizing the local scatterness simultaneously. In<br />
c<strong>on</strong>trast with linear discriminant analysis [4] , which<br />
seeks for a discriminant subspace in a global sense,<br />
the localized supervised metric aims to learn a distance<br />
metric with enhanced local discriminability. To minimize<br />
J , we formulate the problem as a trace ratio minimizati<strong>on</strong><br />
problem [9] and use the decomposed Newtown’s<br />
method to find the soluti<strong>on</strong> [6].
Since P is a low-rank positive semi-definite matrix,<br />
we can decompose the precisi<strong>on</strong> matrix as P = WW T ,<br />
where W ∈ R N×d and d ≤ N. The distance metric<br />
can be rewritten as d m (x i , x) = ‖W T x i − W T x j ‖.<br />
Therefore, the distance metric is equivalent to euclidean<br />
distance over the low-dimensi<strong>on</strong>al projecti<strong>on</strong> W T x.<br />
4. Data Descripti<strong>on</strong> and Feature Extracti<strong>on</strong><br />
We have used the physiological data for 74 patients<br />
obtained from the MMIC II database [1] in our experiments.<br />
Each patient is represented with 5 streams<br />
of sensor readings, sampled at 1 minute intervals: 1)<br />
Sp02, 2) heart rate (HR), 3) mean ABP (ABPmean),<br />
(4) systolic ABP (ABPSys), and diastolic ABP (ABP-<br />
Dias). All patients bel<strong>on</strong>g to <strong>on</strong>e of two groups H or C.<br />
Those in group H (36 patients) had experienced Arterial<br />
Hypotensive Episode (AHE) events during the forecast<br />
window, whereas those in group C (38 patients) did not<br />
experience any AHE within the forecast window. The<br />
start of the forecast window is timestamped in the data<br />
set (T 0 ) and its durati<strong>on</strong> is 1 hour, in which an episode<br />
of AHE can occur. For this study, we focus <strong>on</strong> a 2-<br />
hour window around T 0 for each patient. Figure 2 illustrates<br />
the data from two patients, in which samples in<br />
H group show higher variability than those in C group.<br />
Physicians actually use the variability level of ABP to<br />
diagnose AHE [2].<br />
We have used two different schemes to represent<br />
the 2-hour temporal data for each patient: a statistical<br />
time domain method and a wavelet domain method. In<br />
the former, we compute the mean and variance of data<br />
from each sensor for each patient. Thus, each patient is<br />
represented in the time domain with a 10-dimensi<strong>on</strong>al<br />
vector. In the latter, the wavelet coefficients of the 2-<br />
hour window from each sensor are computed. We use<br />
Daubechies-4 Wavelet [3] and keep the top-10 coefficients.<br />
Finally, the coefficients from all 5 sensors are<br />
vectorized into a 50-dimensi<strong>on</strong>al feature vector for each<br />
patient.<br />
5. Experiments<br />
From the feature extracti<strong>on</strong> step described in secti<strong>on</strong><br />
4, we obtain 74 N-dimensi<strong>on</strong>al feature vectors<br />
where N = 10 for the statistic method and N = 50<br />
for the Wavelet method. We then compare the following<br />
three distance metrics using the leave-<strong>on</strong>e-out<br />
paradigm:<br />
• Expert uses Euclidean distance of the variance of<br />
the mean ABP as suggested in [2];<br />
• PCA uses Euclidean distance over lowdimensi<strong>on</strong>al<br />
points after PCA (an unsupervised<br />
metric learning algorithm);<br />
(a) Samples in H group<br />
(b) Samples in C group<br />
Figure 2. Examples of multivariate time<br />
series data for H and C groups. H<br />
group patients show higher variability<br />
than those in C group.<br />
• LSML using the localized supervised metric learning<br />
method described in secti<strong>on</strong> 3.<br />
Note that we do not make comparis<strong>on</strong>s with global supervised<br />
metric learning methods like LDA [4] because<br />
as shown in [5, 8], localized metric usually performs<br />
better. The performance metrics include k-NN classificati<strong>on</strong><br />
error rate and precisi<strong>on</strong>@10 retrieval results.<br />
The precisi<strong>on</strong>@10 of a query point is computed by retrieving<br />
10-nearest points with a specific distance metric<br />
and then computing the percentage of those retrieved<br />
points having the same label as the query point.<br />
Performance Comparis<strong>on</strong> To have a fair comparis<strong>on</strong>,<br />
both PCA and LSML project data into 1-<br />
dimensi<strong>on</strong>al space since the Expert method <strong>on</strong>ly uses<br />
<strong>on</strong>e feature, i.e., the variance of mean ABP. Table 1<br />
shows the classificati<strong>on</strong> results using 3-NN classifier,<br />
and Table 2 shows the retrieval results. As can be<br />
observed in both tables, LSML out-performs both expert<br />
and PCA <strong>on</strong> both statistical and Wavelet features,
which c<strong>on</strong>firms the importance of leveraging label informati<strong>on</strong><br />
into the distance metric. We also observe that<br />
Wavelet features improve the performance significantly<br />
for LSML, where the classificati<strong>on</strong> error drops by half<br />
(from about 15% to less than 7%.)<br />
Table 1. Classificati<strong>on</strong> error comparis<strong>on</strong><br />
Expert PCA LSML<br />
Statistic features 0.2295 0.2131 0.1475<br />
Wavelet features NA 0.2295 0.0656<br />
Table 2. Precisi<strong>on</strong>@10 retrieval results<br />
Expert PCA LSML<br />
Statistic features 0.6120 0.5355 0.6557<br />
Wavelet features NA 0.5410 0.7869<br />
Sensitivity Analysis There are two parameters in the<br />
study: 1) the number of neighbors k in the k-NN classifier<br />
and 2) the dimensi<strong>on</strong>ality d of the resulting lowdimensi<strong>on</strong>al<br />
space (after PCA and LSML). Figure 3<br />
shows the reuslts of sentivity analysis <strong>on</strong> these two parameters.<br />
Figure 3(a) plots classificati<strong>on</strong> error vs. k for<br />
all methods. Small k leads to lower classificati<strong>on</strong> error,<br />
which c<strong>on</strong>firms the need for a localized distance metric.<br />
Figure 3(b) plots classificati<strong>on</strong> error vs. dimensi<strong>on</strong>ality<br />
d for all methods except Expert, which c<strong>on</strong>firms the<br />
stability of LSML w.r.t. to different d.<br />
6. C<strong>on</strong>clusi<strong>on</strong> and Discussi<strong>on</strong><br />
We have presented a method for deriving semantically<br />
sound similarity measures for retrieving patients<br />
represented by multi-dimensi<strong>on</strong>al time series. Our<br />
method uses both statistical and wavelet based features<br />
to capture the characteristics of patients, and leverages<br />
localized supervised metric learning to incorporate<br />
physicians’ expert domain knowledge. Experiments using<br />
the MIMIC II database dem<strong>on</strong>strates the efficacy of<br />
this appraoch. In future work we plan to explore ways<br />
to explicitly incorporate temporal characteristics of the<br />
data to further improve metric learning in this particular<br />
c<strong>on</strong>text.<br />
References<br />
[1] MIMIC II Database.<br />
http://physi<strong>on</strong>et.org/physiobank/database/mimic2db/.<br />
[2] X. Chen, D. Xu, G. Zhang, and R. Mukkamala. Forecasting<br />
acute hypotensive episodes in intensive care<br />
patients based <strong>on</strong> a peripheral arterial blood pressure<br />
waveform. Computers in Cardiology, 36, 2000.<br />
[3] I. Daubechies. Ten Lectures <strong>on</strong> Wavelets. SIAM,<br />
Philadelphia, 1992.<br />
(a) Stable with different k<br />
(b) Stable with different d<br />
Figure 3. LSML is stable with different parameter<br />
values.<br />
[4] K. Fukunaga. Introducti<strong>on</strong> to Statistical Pattern Recogniti<strong>on</strong>.<br />
Academic Press, San Diego, California, 1990.<br />
[5] J. Goldberger, S. Roweis, G. Hint<strong>on</strong>, and R. Salakhutdinov.<br />
Neighborhood comp<strong>on</strong>ent analysis. In NIPS, 2005.<br />
[6] Y. Jia, F. Nie, and C. Zhang. Trace ratio problem revisited.<br />
IEEE Transacti<strong>on</strong>s <strong>on</strong> Neural Networks, 2009.<br />
[7] M. Saeed and R. Mark. A novel method for the efficient<br />
retrieval of similar multiparameter physiologic time series<br />
using wavelet-based symbolic representati<strong>on</strong>s. In<br />
American Medical Informatics Associati<strong>on</strong>, 2006.<br />
[8] M. Sugiyama. Dimensi<strong>on</strong>ality reducti<strong>on</strong> of multimodal<br />
labeled data by local fisher discriminant analysis. J.<br />
Mach. Learn. Res., 8, 2007.<br />
[9] F. Wang, J. Sun, T. Li, and N. Anerousis. Two heads<br />
better than <strong>on</strong>e: <str<strong>on</strong>g>Metric</str<strong>on</strong>g>+active learning and its applicati<strong>on</strong>s<br />
for it service classificati<strong>on</strong>. In ICDM, 2009.<br />
[10] F. Wang and C. Zhang. Feature extracti<strong>on</strong> by maximizing<br />
the neighborhood margin. In CVPR, 2007.<br />
[11] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell.<br />
Distance metric learning, with applicati<strong>on</strong> to clustering<br />
with side-informati<strong>on</strong>. In NIPS, 2002.<br />
[12] L. Yang. Distance metric learning: A comprehensive<br />
survey. Technical report, Michgan State University,<br />
2006.