Localized Supervised Metric Learning on ... - Researcher - IBM

<strong>Localized</strong> <strong>Supervised</strong> <strong>Metric</strong> <strong>Learning</strong> on Temporal Physiological Data 

Jimeng Sun, Daby Sow, Jianying Hu, Shahram Ebadollahi 

IBM T.J. Watson Research Center, New York, USA 

{jimeng, sowdaby, jyhu, ebad}@us.ibm.com 

Abstract 

Effective patient similarity assessment is important 

for clinical decision support. It enables the capture of 

past experience as manifested in the collective longitudinal 

medical records of patients to help clinicians assess 

the likely outcomes resulting from their decisions 

and actions. However, it is challenging to devise a patient 

similarity metric that is clinically relevant and semantically 

sound. Patient similarity is highly context 

sensitive: it depends on factors such as the disease, the 

particular stage of the disease, and co-morbidities. One 

way to discern the semantics in a particular context is to 

take advantage of physicians’ expert knowledge as reflected 

in labels assigned to some patients. In this paper 

we present a method that leverages localized supervised 

metric learning to effectively incorporate such expert 

knowledge to arrive at semantically sound patient similarity 

measures. Experiments using data obtained from 

the MIMIC II database demonstrate the effectiveness of 

this approach. 

1. Introduction 

Medical records capture both observations of patients’ 

health status, and decisions and actions taken 

by clinicians and care providers. Buried inside these 

records are nuggets of insight on the temporal evolution 

pattern of patient health status, and the effects of different 

clinical decisions on the trajectory of a disease. 

Tapping into this source of insight can be achieved by 

developing techniques measuring cross patient similarities. 

These techniques have the potential to improve 

patients’ clinical outcomes as essential tools for diagnostic 

and prognostic decision support. 

Figure 1 illustrates several aspects of the scenario 

that drives our research in this area. In this figure, a patient 

with available observations up to a decision point 

is presented to the system. The cohort of patients who 

are clinically similar to the query patient are retrieved. 

A clinician looks up the decisions and actions applied to 

the retrieved cohort and their consequences and makes 

up her mind about the best course of action for the current 

patient. In addition, she can project the trajectory 

of patient’s health status, as captured by the patient’s 

clinical factors and biomarkers, under the regime of any 

particular decision she makes. 

There are three fundamental challenges that need to 

be addressed before such decision support mechanism 

can be materialized: 

1. Alignment of the trajectories of patients’ temporal 

characteristics to make the records amenable to 

semantically and clinically sound comparison, 

2. Devising similarity measures that can reflect the 

clinical proximity or disparity between different 

patients, 

3. Coupling between decisions and their consequences 

as manifested in patient prognosis. 

The focus of this paper is on the second task. We propose 

two different methods for feature generation over 

multi-dimensional temporal patient data, and adopt a localized 

supervised metric learning approach to arrive at 

a semantically sound similarity measure for retrieving 

patients represented in the multi-dimensional feature 

space. The proposed method is tested using the MIMIC 

II database, which consists of physiological waveforms, 

and accompanying clinical data obtained for ICU patients 

[1]. The study is carried out on 74 patients from 

this database, categorized into 2 groups based on different 

clinical conditions. Comparisons against unsupervised 

metric learning approaches on classification and 

retrieval accuracy are presented to illustrate the performance 

benefit of the proposed approach. 

2. Related Work 

In [7], Saeed and Mark reported work on retrieving 

similar patients using the same database, where 

they employed a multi-resolution description scheme 

for physiological temporal ICU data and used unsupervised 

similarity metrics for retrieving patients. The focus 

of that work was more on the symbolic representation 

of the temporal data to make them amenable for

epresented by a N-dimensional feature vector x. Examples 

of features are the mean and variance of the sensor 

measures, or Wavelet coefficients. The prior belief 

of physicians is captured as labels on some of the patients. 

With this formulation, our goal is to learn a generalized 

Mahalanobis distance between patient x i and 

patient x j defined as: 

√ 

d m (x i , x j ) = (x i − x j ) T P(x i − x j ) (1) 

where P ∈ R N×N is called the precision matrix. Matrix 

P is positive semi-definite and is used to incorporate 

the correlations between different feature dimensions. 

The key is to learn the optimal P such that the 

resulting distance metric has the following properties: 

Figure 1. Retrieving patients based on 

their clinical similarity to a query patient 

and using the retrieved patients to project 

the evolution of patient’s clinical characteristics. 

comparison. In this paper, we leverage both statistical 

methods and Wavelet methods to extract features over 

the temporal data. 

<strong>Supervised</strong> metric learning has been studied in the 

past [11, 12, 5]. The goal has been to learn a distance 

metric such that samples in the same class are close and 

those in different classes are far away. The common 

treatment is to add constraints and regularization terms 

into the objective function and then to solve it using optimization 

methods. To avoid a large number of constraints, 

in this paper we model this problem as trace 

ratio problem which can be solved effectively (similar 

to Wang et al. [9]). 

3 <strong>Localized</strong> <strong>Supervised</strong> <strong>Metric</strong> <strong>Learning</strong> 

In this section we present the supervised metric 

learning problem in the context of patient similarity 

measure. When a physician looks for similar patients 

in a database, the similarity is often based not only 

on quantitative measurements such as lab results, sensor 

measurements, age and sex, but also on the physician’s 

assessment of the disease type and stage. The 

assessment would potentially influence the relative importance 

a physician places on different measurements 

or groups of measurements. To compute this specific 

notion of similarity, we propose to learn a distance metric 

that can automatically adjust the importance of each 

numeric feature by leveraging the physician’s belief. 

Formally, quantitative measurements of a patient are 

• Within-class compactness: patients of the same label 

are close together; 

• Between-class scatterness: patients of different labels 

are far away from each other. 

To formally measure these properties, we use two kinds 

of neighborhoods as defined in [10]: The homogeneous 

neighborhood of x i , denoted as Ni o , is the k-nearest 

patients of x i with the same label. The heterogeneous 

neighborhood of x i , denoted as Ni e , is the k-nearest patients 

of x i with different labels. 

Based on these two neighborhoods, we define the local 

compactness of point x i as 

C i = 

∑ 

d 2 m(x i , x j ) (2) 

x j∈N o i 

and the local scatter ness of point x i as 

S i = 

∑ 

x k ∈N e i 

d 2 m(x i , x k ) (3) 

The discriminability of the distance metric d m is defined 

as 

∑ 

J = ∑ i C ∑ ∑ 

i i x j∈N 

(x 

i 

i S = 

o i − x j ) T P(x i − x j ) 

∑ ∑ 

i i x k ∈N 

(x 

i 

e i − x k ) T P(x i − x k ) 

(4) 

The goal is to find a P that minimizes J , which is 

equivalent to minimizing the local compactness and 

maximizing the local scatterness simultaneously. In 

contrast with linear discriminant analysis [4] , which 

seeks for a discriminant subspace in a global sense, 

the localized supervised metric aims to learn a distance 

metric with enhanced local discriminability. To minimize 

J , we formulate the problem as a trace ratio minimization 

problem [9] and use the decomposed Newtown’s 

method to find the solution [6].

Since P is a low-rank positive semi-definite matrix, 

we can decompose the precision matrix as P = WW T , 

where W ∈ R N×d and d ≤ N. The distance metric 

can be rewritten as d m (x i , x) = ‖W T x i − W T x j ‖. 

Therefore, the distance metric is equivalent to euclidean 

distance over the low-dimensional projection W T x. 

4. Data Description and Feature Extraction 

We have used the physiological data for 74 patients 

obtained from the MMIC II database [1] in our experiments. 

Each patient is represented with 5 streams 

of sensor readings, sampled at 1 minute intervals: 1) 

Sp02, 2) heart rate (HR), 3) mean ABP (ABPmean), 

(4) systolic ABP (ABPSys), and diastolic ABP (ABP- 

Dias). All patients belong to one of two groups H or C. 

Those in group H (36 patients) had experienced Arterial 

Hypotensive Episode (AHE) events during the forecast 

window, whereas those in group C (38 patients) did not 

experience any AHE within the forecast window. The 

start of the forecast window is timestamped in the data 

set (T 0 ) and its duration is 1 hour, in which an episode 

of AHE can occur. For this study, we focus on a 2- 

hour window around T 0 for each patient. Figure 2 illustrates 

the data from two patients, in which samples in 

H group show higher variability than those in C group. 

Physicians actually use the variability level of ABP to 

diagnose AHE [2]. 

We have used two different schemes to represent 

the 2-hour temporal data for each patient: a statistical 

time domain method and a wavelet domain method. In 

the former, we compute the mean and variance of data 

from each sensor for each patient. Thus, each patient is 

represented in the time domain with a 10-dimensional 

vector. In the latter, the wavelet coefficients of the 2- 

hour window from each sensor are computed. We use 

Daubechies-4 Wavelet [3] and keep the top-10 coefficients. 

Finally, the coefficients from all 5 sensors are 

vectorized into a 50-dimensional feature vector for each 

patient. 

5. Experiments 

From the feature extraction step described in section 

4, we obtain 74 N-dimensional feature vectors 

where N = 10 for the statistic method and N = 50 

for the Wavelet method. We then compare the following 

three distance metrics using the leave-one-out 

paradigm: 

• Expert uses Euclidean distance of the variance of 

the mean ABP as suggested in [2]; 

• PCA uses Euclidean distance over lowdimensional 

points after PCA (an unsupervised 

metric learning algorithm); 

(a) Samples in H group 

(b) Samples in C group 

Figure 2. Examples of multivariate time 

series data for H and C groups. H 

group patients show higher variability 

than those in C group. 

• LSML using the localized supervised metric learning 

method described in section 3. 

Note that we do not make comparisons with global supervised 

metric learning methods like LDA [4] because 

as shown in [5, 8], localized metric usually performs 

better. The performance metrics include k-NN classification 

error rate and precision@10 retrieval results. 

The precision@10 of a query point is computed by retrieving 

10-nearest points with a specific distance metric 

and then computing the percentage of those retrieved 

points having the same label as the query point. 

Performance Comparison To have a fair comparison, 

both PCA and LSML project data into 1- 

dimensional space since the Expert method only uses 

one feature, i.e., the variance of mean ABP. Table 1 

shows the classification results using 3-NN classifier, 

and Table 2 shows the retrieval results. As can be 

observed in both tables, LSML out-performs both expert 

and PCA on both statistical and Wavelet features,

which confirms the importance of leveraging label information 

into the distance metric. We also observe that 

Wavelet features improve the performance significantly 

for LSML, where the classification error drops by half 

(from about 15% to less than 7%.) 

Table 1. Classification error comparison 

Expert PCA LSML 

Statistic features 0.2295 0.2131 0.1475 

Wavelet features NA 0.2295 0.0656 

Table 2. Precision@10 retrieval results 

Expert PCA LSML 

Statistic features 0.6120 0.5355 0.6557 

Wavelet features NA 0.5410 0.7869 

Sensitivity Analysis There are two parameters in the 

study: 1) the number of neighbors k in the k-NN classifier 

and 2) the dimensionality d of the resulting lowdimensional 

space (after PCA and LSML). Figure 3 

shows the reuslts of sentivity analysis on these two parameters. 

Figure 3(a) plots classification error vs. k for 

all methods. Small k leads to lower classification error, 

which confirms the need for a localized distance metric. 

Figure 3(b) plots classification error vs. dimensionality 

d for all methods except Expert, which confirms the 

stability of LSML w.r.t. to different d. 

6. Conclusion and Discussion 

We have presented a method for deriving semantically 

sound similarity measures for retrieving patients 

represented by multi-dimensional time series. Our 

method uses both statistical and wavelet based features 

to capture the characteristics of patients, and leverages 

localized supervised metric learning to incorporate 

physicians’ expert domain knowledge. Experiments using 

the MIMIC II database demonstrates the efficacy of 

this appraoch. In future work we plan to explore ways 

to explicitly incorporate temporal characteristics of the 

data to further improve metric learning in this particular 

context. 

References 

[1] MIMIC II Database. 

http://physionet.org/physiobank/database/mimic2db/. 

[2] X. Chen, D. Xu, G. Zhang, and R. Mukkamala. Forecasting 

acute hypotensive episodes in intensive care 

patients based on a peripheral arterial blood pressure 

waveform. Computers in Cardiology, 36, 2000. 

[3] I. Daubechies. Ten Lectures on Wavelets. SIAM, 

Philadelphia, 1992. 

(a) Stable with different k 

(b) Stable with different d 

Figure 3. LSML is stable with different parameter 

values. 

[4] K. Fukunaga. Introduction to Statistical Pattern Recognition. 

Academic Press, San Diego, California, 1990. 

[5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. 

Neighborhood component analysis. In NIPS, 2005. 

[6] Y. Jia, F. Nie, and C. Zhang. Trace ratio problem revisited. 

IEEE Transactions on Neural Networks, 2009. 

[7] M. Saeed and R. Mark. A novel method for the efficient 

retrieval of similar multiparameter physiologic time series 

using wavelet-based symbolic representations. In 

American Medical Informatics Association, 2006. 

[8] M. Sugiyama. Dimensionality reduction of multimodal 

labeled data by local fisher discriminant analysis. J. 

Mach. Learn. Res., 8, 2007. 

[9] F. Wang, J. Sun, T. Li, and N. Anerousis. Two heads 

better than one: <strong>Metric</strong>+active learning and its applications 

for it service classification. In ICDM, 2009. 

[10] F. Wang and C. Zhang. Feature extraction by maximizing 

the neighborhood margin. In CVPR, 2007. 

[11] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. 

Distance metric learning, with application to clustering 

with side-information. In NIPS, 2002. 

[12] L. Yang. Distance metric learning: A comprehensive 

survey. Technical report, Michgan State University, 

2006.

Localized Supervised Metric Learning on ... - Researcher - IBM

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?