+1 - Department of Computer Science & Engineering

Distance Metric

Learning for Large

Margin Nearest Neighbor

Classification

Kilian Q. Weinberger

John C. Blitzer

Lawrence K. Saul

University **of** Pennsylvania

GRASP Laboratory

GRASP

k-nearest neighbor

classification

Training data:

**+1**

-1

k-nearest neighbor

classification

Insert test point

?

**+1**

-1

k-nearest neighbor

classification

Find k nearest neighbors

?

**+1**

-1

k-nearest neighbor

classification

Let neighbors vote

2x

1x

?

**+1**

-1

k-nearest neighbor

classification

Assign most common label

!

**+1**

-1

Strengths **of** k-NN

classification

Strengths **of** k-NN

classification

straightforward to implement

Strengths **of** k-NN

classification

straightforward to implement

nonlinear decision boundaries

Strengths **of** k-NN

classification

straightforward to implement

nonlinear decision boundaries

complexity independent **of** # **of** classes

Strengths **of** k-NN

classification

straightforward to implement

nonlinear decision boundaries

complexity independent **of** # **of** classes

asymptotic guarantees

Problem with k-NN

?

**+1**

**+1** -1

-1

Problem with k-NN

?

**+1**

**+1** -1

-1

Problem with k-NN

?

**+1**

**+1** -1

-1

Problem with k-NN

?!

misclassified!!

**+1**

**+1** -1

-1

Problem with k-NN

The problem:

Euclidean distances

?!

misclassified!!

**+1**

**+1** -1

-1

Problem with k-NN

?

A lot **of** information

Hardly any information

**+1**

**+1** -1

-1

Amplify informative

directions

**+1**

-1

Dampen non-informative

directions

**+1**

-1

Finally!!

?

**+1**

-1

Finally!!

!

**+1**

-1

Our goal: Learn the

transformation

mapping

Input space Feature space

(meaningful distances)

Our goal: Learn the

transformation

−→

mapping

�xi L�xi

Input space Feature space

(meaningful distances)

Intuition

similarly labeled

differently labeled

Intuition

misclassified by 1-NN

similarly labeled

differently labeled

impostors

Intuition

similarly labeled

differently labeled

closest similarly labeled point

(“target neighbor”)

impostors

Intuition

similarly labeled

differently labeled

closest similarly labeled point

(“target neighbor”)

Intuition

similarly labeled

differently labeled

1. pull target neighbors closer

2. push impostors away

Intuition

similarly labeled

differently labeled

safety margin

Clean neighborhood!

similarly labeled

differently labeled

�xi

ηij

Notation

input vectors

yij{

1 if �xi and �xj’s labels match

0 otherwise

{ 1 if �xj is “target neighbor” **of** �xi

0 otherwise

[a]+ = max(a, 0)

(hinge loss)

�xi

ηij

Notation

input vectors

yij{

1 if �xi and �xj’s labels match

0 otherwise

{ 1 if �xj is “target neighbor” **of** �xi

0 otherwise

[a]+ = max(a, 0)

(hinge loss)

�xi

ηij

Notation

input vectors

yij{

1 if �xi and �xj’s labels match

0 otherwise

{ 1 if �xj is “target neighbor” **of** �xi

0 otherwise

[a]+ = max(a, 0)

(hinge loss)

�xi

ηij

Notation

input vectors

yij{

1 if �xi and �xj’s labels match

0 otherwise

{ 1 if �xj is “target neighbor” **of** �xi

0 otherwise

[a]+ = max(a, 0)

(hinge loss)

�xi

ηij

Notation

input vectors

yij{

1 if �xi and �xj’s labels match

0 otherwise

{ 1 if �xj is “target neighbor” **of** �xi

0 otherwise

[a]+ = max(a, 0)

(hinge loss)

minimize: ε(L) = εpull(L) + εpush(L)

minimize: ε(L) = εpull(L) + εpush(L)

εpull(L) = �

ηij�L(�xi−�xj)� 2

ij

minimize: ε(L) = εpull(L) + εpush(L)

εpull(L) = �

ηij�L(�xi−�xj)� 2

ij

�xi

minimize: ε(L) = εpull(L) + εpush(L)

εpull(L) = �

ηij�L(�xi−�xj)� 2

ij

�xj

Neighborhood indicator

�xi

minimize: ε(L) = εpull(L) + εpush(L)

εpull(L) = �

ηij�L(�xi−�xj)� 2

ij

�xj

Distance to target neighbor

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

neighborhood indicator

�xi

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

impostor

�xl

�xi

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

�xl

�xi

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

hinge loss

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

neighborhood radius

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

�xi

�xj

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

neighborhood radius

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

�xi

�xj

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

safety margin

minimize: ε(L) = εpull(L) + εpush(L)

εpush(L) = �

ijl

�xl �xi

�xj

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

distance to impostor

Loss function

1. pull target neighbors closer

2. push impostors away

ε(L) = εpull(L) + εpush(L)

εpull(L) = �

ij

εpush(L) = �

ijl

ηij�L(�xi−�xj)� 2

ηij(1−yil) � 1 + �L(�xi−�xj)� 2 −�L(�xi−�xl)� 2�

+

(unit margin)

ε(L)

NOT CONVEX in L!!

local minimum

Change **of** variable

Instead **of** learning L we learn M = L .

⊤ L

with

M � 0

�L(�xi − �xj)� 2 = (�xi − �xj) ⊤ L ⊤ L(�xi − �xj)

= (�xi − �xj) ⊤ M(�xi − �xj)

= ��xi − �xj� 2 M

M defines Mahalanobis distance

Loss function is convex in M!

ε(M)

only global minimum

Convex optimization

minimize:

M

ε(M) = εpull(M) + εpush(M)

subject to:

M � 0

Semidefinite

Programming Problem

minimize:

M

ε(M) = εpull(M) + εpush(M)

subject to:

M � 0

Linear programming problem with positivesemidefinite

constraint.

Semidefinite

Programming Problem

minimize:

M

ε(M) = εpull(M) + εpush(M)

subject to:

M � 0

Linear programming problem with positivesemidefinite

constraint.

Semidefinite

Programming Problem

minimize:

M

ε(M) = εpull(M) + εpush(M)

subject to:

Can be solved

efficiently with

alternating projection

or cutting plane

algorithm.

M � 0

Linear programming problem with positivesemidefinite

constraint.

Two ways **of** classification

k-NN using Mahalanobis distance

1. Learn M that minimizes ε(M)

2. Use k-NN with Mahalanobis distance

Two ways **of** classification

k-NN using Mahalanobis distance

1. Learn M that minimizes ε(M)

2. Use k-NN with Mahalanobis distance

Minimum loss classification

1. Learn M that minimizes ε(M)

2. For each test point choose label that

minimizes the loss function

20

15

10

5

0

Results

Euclidean k-NN

Mahalanobis k-NN

Minimum Loss

Olivetti faces isolet 20news MNIST

classification test error in %

20

15

10

5

0

Results

classification test-error in %

Euclidean k-NN

Mahalanobis k-NN

Minimum Loss

Multiclass SVM

(Crammer and Singer 2001)

Olivetti faces isolet 20news MNIST

Olivetti Faces

Train / Test images = 280/12

0

Total constraints = 76 440

Active constraints = 2 680

Training time = 2 mins

test image

Mahalanobis NN

Euclidean NN

MNIST

Train / Test images = 60k/10k

Total constraints = 3.3 Billion

Active constraints = 243 596

Training time = 3 hours

test digit

Mahalanobis NN

Euclidean NN

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

on Machine Learning, Banff, Canada, 2004. Copyright 2004 by say that the instances are dissimilar. After extending our

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Learning a Similarity Metric Discriminatively, with Application to Face

Verification

Sumit Chopra Raia Hadsell Yann LeCun

Abstract

We present a method for training a similarity metric from

data. The method can be used for recognition or verification

applications where the number **of** categories is very large

and not known during training, and where the number **of**

training samples for a single category is very small. The

idea is to learn a function that maps input patterns into a

target space such that the norm in the target space approximates

the “semantic” distance in the input space. The

method is applied to a face verification task. The learning

process minimizes a discriminative loss function that drives

the similarity metric to be small for pairs **of** faces from the

same person, and large for pairs from different persons. The

mapping from raw to the target space is a convolutional network

whose architecture is designed for robustness to geometric

distortions. The system is tested on the Purdue/AR

face database which has a very high degree **of** variability in

the pose, lighting, expression, position, and artificial occlusions

such as dark glasses and obscuring scarves.

1. Introduction

Traditional approaches to classification using discriminative

methods, such as neural networks or support vector

machines, generally require that all the categories be

known in advance. They also require that training examples

be available for all the categories. Furthermore, these

methods are intrinsically limited to a fairly small number

**of** categories (on the order **of** 100). Those methods are unsuitable

for applications where the number **of** categories is

very large, where the number **of** samples per category is

small, and where only a subset **of** the categories is known at

the time **of** training. Such applications include face recognition

and face verification: the number **of** categories can

be in the hundreds or thousands, with only a few examples

Courant Institute **of** Mathematical **Science**s

New York University

New York, NY, USA

sumit, raia, yann @cs.nyu.edu

Connection to previous

per category. A common approach to this kind **of** problem

is distance-based methods, which consist in computing a

similarity metric between the pattern to be classified or verified

and a library **of** stored prototypes. Another common

approach is to use non-discriminative (generative) probabilistic

methods in a reduced-dimension space, where the

model for one category can be trained without using examples

from other categories. To apply discriminative learning

techniques to this kind **of** application, we must devise

a method that can extract information about the problem

from the available data, without requiring specific information

about the categories.

The solution presented in this paper is to learn a similarity

metric from data. This similarity metric can later be used

to compare or match new samples from previously-unseen

categories (e.g. faces from people not seen during training).

We present a new type **of** discriminative training method

that is used to train the similarity metric. The method can

be applied to classification problems where the number **of**

categories is very large and/or where examples from all categories

are not available at the time **of** training.

The main idea is to find a function that maps input patterns

into a target space such that a simple distance in the

target space (say the Euclidean distance) approximates the

“semantic” distance in the input space. More precisely,

given a family **of** functions parameterized by ,

we seek to find a value **of** the parameter such that the

similarity metric

is small if and belong to the same category, and large

if they belong to different categories. The system is trained

on pairs **of** patterns taken from a training set. The loss function

minimized by training minimizes when

and are from the same category, and maximizes

when they belong to different categories. No

assumption is made about the nature **of** other than

differentiability with respect to . Because the same function

with the same parameter is used to process both

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

on Machine Learning, Banff, Canada, 2004. Copyright 2004 by say that the instances are dissimilar. After extending our

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

work

Distance Metric Learning for Large Margin

Nearest Neighbor Classification

Neighbourhood Components Analysis

Jacob Goldberger, Sam Roweis, Ge**of**f Hinton, Ruslan Salakhutdinov

**Department** **of** **Computer** **Science**, University **of** Toronto

{jacob,roweis,hinton,rsalakhu}@cs.toronto.edu

Abstract

In this paper we propose a novel method for learning a Mahalanobis

distance measure to be used in the KNN classification algorithm. The

algorithm directly maximizes a stochastic variant **of** the leave-one-out

KNN score on the training set. It can also learn a low-dimensional linear

embedding **of** labeled data that can be used for data visualization

and fast classification. Unlike other methods, our classification model

is non-parametric, making no assumptions about the shape **of** the class

distributions or the boundaries between them. The performance **of** the

method is demonstrated on several data sets, both for metric learning and

linear dimensionality reduction.

1 Introduction

Nearest neighbor (KNN) is an extremely simple yet surprisingly effective method for classification.

Its appeal stems from the fact that its decision surfaces are nonlinear, there

is only a single integer parameter (which is easily tuned with cross-validation), and the

expected quality **of** predictions improves automatically as the amount **of** training data increases.

These advantages, shared by many non-parametric methods, reflect the fact that

although the final classification machine has quite high capacity (since it accesses the entire

reservoir **of** training data at test time), the trivial learning procedure rarely causes overfitting

itself.

However, KNN suffers from two very serious drawbacks. The first is computational, since

it must store and search through the entire training set in order to classify a single test point.

(Storage can potentially be reduced by “editing” or “thinning” the training data; and in low

dimensional input spaces, the search problem can be mitigated by employing data structures

such as KD-trees or ball-trees[4].) The second is a modeling issue: how should the distance

metric used to define the “nearest” neighbours **of** a test point be defined? In this paper, we

attack both **of** these difficulties by learning a quadratic distance metric which optimizes the

expected leave-one-out classification error on the training data when used with a stochastic

neighbour selection rule. Furthermore, we can force the learned distance metric to be low

rank, thus substantially reducing storage and search costs at test time.

2 Stochastic Nearest Neighbours for Distance Metric Learning

We begin with a labeled data set consisting **of** n real-valued input vectors x1, . . . , xn in R D

and corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes

Xing et al, NIPS 2003

Kilian Q. Weinberger, John Blitzer and Lawrence K. Saul

**Department** **of** **Computer** and Information **Science**, University **of** Pennsylvania

Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104

{kilianw, blitzer, lsaul}@cis.upenn.edu

Goldberger et al,

Shalev-Shwartz et al,

ICML 2004

Abstract

NIPS 2005

We show how to learn a Mahanalobis distance metric for k-nearest neighbor

(kNN) classification by semidefinite programming. The metric is

trained with the goal that k-nearest neighbors always belong to the same

class while examples from different classes are separated by a large

margin. On seven data sets **of** varying size and difficulty, we find that

metrics trained in this way lead to significant improvements in kNN

classification—for example, achieving a test error rate **of** 1.6% on the

MNIST handwritten digits. Our approach has many parallels to support

vector machines, including a convex objective function based on the

hinge loss and the potential to work in nonlinear feature spaces by using

the “kernel trick”. On the other hand, our framework requires no

modification for problems with large numbers **of** classes.

1 Introduction

(DRAFT!!! Please do not distribute.) The k-nearest neighbors (kNN) rule [3] is one

**of** the oldest and simplest methods for pattern classification. Nevertheless, it **of**ten yields

competitive results, and in certain domains, when cleverly combined with prior knowledge,

it has significantly advanced the state-**of**-the-art [1, 13]. The kNN rule classifies each unlabeled

example by the majority label among its k-nearest neighbors in the training set. Its

performance thus depends crucially on the distance metric used to identify nearest neighbors.

In the absence **of** prior knowledge, most kNN classifiers use simple Euclidean distances

to measure the dissimilarities between examples represented as vector inputs. Euclidean

distance metrics, however, do not capitalize on any statistical regularities in the data that

might be estimated from a large training set **of** labeled examples.

Ideally, the distance metric for kNN classification should be adapted to the particular problem

being solved. It can hardly be optimal, for example, to use the same distance metric for

face recognition as for gender identification, even if in both tasks, distances are computed

between the same fixed-size images. In fact, a number **of** researchers have demonstrated

that kNN classification can be greatly improved by learning an appropriate distance metric

from labeled examples [2, 6, 11, 12]. Even a simple linear transformation **of** input features

has been shown to lead to significant improvements in kNN classification [6, 11]. Our work

Chopra et al,

CVPR 2005 Vapnik 1998

Online Online and and Batch Batch Learning Learning **of** **of** Pseudo-Metrics

Pseudo-Metrics

Shai Shai Shalev-Shwartz Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

SHAIS@CS.HUJI.AC.IL

School School **of** **of** **Computer** **Computer** **Science** **Science** && **Engineering**, **Engineering**, The The Hebrew Hebrew University University

Yoram Yoram Singer Singer SINGER@CS.HUJI.AC.IL

SINGER@CS.HUJI.AC.IL

School School **of** **of** **Computer** **Computer** **Science** **Science** && **Engineering**, **Engineering**, The The Hebrew Hebrew University University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

We describe and analyze an online algorithm for

two instances and a binary label indicating whether the

supervised learning **of** pseudo-metrics. The al-

two instances are similar or dissimilar. The work **of** (Xing

gorithm receives pairs **of** instances and predicts

et al., 2003; Shental et al., 2002) used various techniques

their similarity according to a pseudo-metric.

that are effective in batch settings, but do not have nat-

The pseudo-metrics we use are quadratic forms

ural, computationally-efficient online versions. Further-

parameterized by positive semi-definite matrices.

more, these algorithms did not come with any theoretical

The core **of** the algorithm is an update rule that

error guarantees. In this paper, we discuss, analyze, and

is based on successive projections onto the posi-

experiment with an online algorithm for learning pseudotive

semi-definite cone and onto half-space conmetrics.

As in a batch setting, we receive pairs **of** instances

straints imposed by the examples. We describe

which may be similar or dissimilar. But in contrast to

an efficient procedure for performing these pro-

batch learning, in the online setting we need to extend a

jections, derive a worst case mistake bound on

prediction on each pair as it is received. After predicting

the similarity predictions, and discuss a dual ver-

whether the current pair **of** instances is similar, we receive

sion **of** the algorithm in which it is simple to

the correct feedback on the instances’ similarity or dissim-

incorporate kernel operators. The online algoilarity.

Informally, the goal **of** the online algorithm is to

rithm also serves as a building block for deriving

minimize the number **of** prediction errors. Online learning

a large-margin batch algorithm. We demonstrate

algorithms enjoy several practical and theoretical advan-

the merits **of** the proposed approach by conducttages:

They are **of**ten simple to implement; they are typing

experiments on MNIST dataset and on docuically

both memory and run-time efficient; they **of**ten come

ment filtering.

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with for-

1. Introduction

mal guarantees on the batch algorithm obtained through the

Many problems in machine learning and statistics require conversion. Moreover, there are applications such as text

the access to a metric over instances. For example, the per- filtering in which the set **of** examples is indeed not given all

formance **of** the nearest neighbor algorithm (Cover & Hart, at once, but instead revealed in a sequential manner while

1967), multi-dimensional scaling (Cox & Cox, 1994) and predictions are requested on-the-fly.

clustering algorithms such as K-means (MacQueen, 1965), The online algorithm we suggest incrementally learns

all depend critically on whether the metric they are given a pseudo-metric and a threshold. As in (Xing et al.,

truly reflects the underlying relationships between the in- 2003), the pseudo-metrics we use are quadratic forms

put instances. Several recent papers have focused on the parametrized by positive semi-definite (PSD) matrices. At

problem **of** automatically learning a distance function from each time step, we get a pair **of** instances and calculate the

examples (Xing et al., 2003; Shental et al., 2002). These distance between them according to our current pseudo-

papers have focused on batch learning algorithms. A batch metric. We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

Appearing in Proceedings **of** the 21 International Conference

on Machine Learning, Banff, Canada, 2004. Copyright 2004 by say that the instances are dissimilar. After extending our

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Connection to previous

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

We describe and analyze an online algorithm for

two instances and a binary label indicating whether the

supervised learning **of** pseudo-metrics. The al-

two instances are similar or dissimilar. The work **of** (Xing

gorithm receives pairs **of** instances and predicts

et al., 2003; Shental et al., 2002) used various techniques

their similarity according to a pseudo-metric.

that are effective in batch settings, but do not have nat-

The pseudo-metrics we use are quadratic forms

ural, computationally-efficient online versions. Further-

parameterized by positive semi-definite matrices.

more, these algorithms did not come with any theoretical

The core **of** the algorithm is an update rule that

error guarantees. In this paper, we discuss, analyze, and

is based on successive projections onto the posi-

experiment with an online algorithm for learning pseudotive

semi-definite cone and onto half-space conmetrics.

As in a batch setting, we receive pairs **of** instances

straints imposed by the examples. We describe

which may be similar or dissimilar. But in contrast to

an efficient procedure for performing these pro-

batch learning, in the online setting we need to extend a

jections, derive a worst case mistake bound on

prediction on each pair as it is received. After predicting

the similarity predictions, and discuss a dual ver-

whether the current pair **of** instances is similar, we receive

sion **of** the algorithm in which it is simple to

the correct feedback on the instances’ similarity or dissim-

incorporate kernel operators. The online algoilarity.

Informally, the goal **of** the online algorithm is to

rithm also serves as a building block for deriving

minimize the number **of** prediction errors. Online learning

a large-margin batch algorithm. We demonstrate

algorithms enjoy several practical and theoretical advan-

the merits **of** the proposed approach by conducttages:

They are **of**ten simple to implement; they are typing

experiments on MNIST dataset and on docuically

both memory and run-time efficient; they **of**ten come

ment filtering.

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with for-

1. Introduction

mal guarantees on the batch algorithm obtained through the

Many problems in machine learning and statistics require conversion. Moreover, there are applications such as text

the access to a metric over instances. For example, the per- filtering in which the set **of** examples is indeed not given all

formance **of** the nearest neighbor algorithm (Cover & Hart, at once, but instead revealed in a sequential manner while

1967), multi-dimensional scaling (Cox & Cox, 1994) and predictions are requested on-the-fly.

clustering algorithms such as K-means (MacQueen, 1965), The online algorithm we suggest incrementally learns

all depend critically on whether the metric they are given a pseudo-metric and a threshold. As in (Xing et al.,

truly reflects the underlying relationships between the in- 2003), the pseudo-metrics we use are quadratic forms

put instances. Several recent papers have focused on the parametrized by positive semi-definite (PSD) matrices. At

problem **of** automatically learning a distance function from each time step, we get a pair **of** instances and calculate the

examples (Xing et al., 2003; Shental et al., 2002). These distance between them according to our current pseudo-

papers have focused on batch learning algorithms. A batch metric. We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

Appearing in Proceedings **of** the 21 International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Online Online and and Batch Batch Learning Learning **of** **of** Pseudo-Metrics

Pseudo-Metrics

Xing et al, NIPS 2003

Shalev-Shwartz et al,

ICML 2004

Learning a Similarity Metric Discriminatively, with Application to Face

Verification

Sumit Chopra Raia Hadsell Yann LeCun

Abstract

We present a method for training a similarity metric from

data. The method can be used for recognition or verification

applications where the number **of** categories is very large

and not known during training, and where the number **of**

training samples for a single category is very small. The

idea is to learn a function that maps input patterns into a

target space such that the norm in the target space approximates

the “semantic” distance in the input space. The

method is applied to a face verification task. The learning

process minimizes a discriminative loss function that drives

the similarity metric to be small for pairs **of** faces from the

same person, and large for pairs from different persons. The

mapping from raw to the target space is a convolutional network

whose architecture is designed for robustness to geometric

distortions. The system is tested on the Purdue/AR

face database which has a very high degree **of** variability in

the pose, lighting, expression, position, and artificial occlusions

such as dark glasses and obscuring scarves.

1. Introduction

Traditional approaches to classification using discriminative

methods, such as neural networks or support vector

machines, generally require that all the categories be

known in advance. They also require that training examples

be available for all the categories. Furthermore, these

methods are intrinsically limited to a fairly small number

**of** categories (on the order **of** 100). Those methods are unsuitable

for applications where the number **of** categories is

very large, where the number **of** samples per category is

small, and where only a subset **of** the categories is known at

the time **of** training. Such applications include face recognition

and face verification: the number **of** categories can

be in the hundreds or thousands, with only a few examples

Courant Institute **of** Mathematical **Science**s

New York University

New York, NY, USA

sumit, raia, yann @cs.nyu.edu

per category. A common approach to this kind **of** problem

is distance-based methods, which consist in computing a

similarity metric between the pattern to be classified or verified

and a library **of** stored prototypes. Another common

approach is to use non-discriminative (generative) probabilistic

methods in a reduced-dimension space, where the

model for one category can be trained without using examples

from other categories. To apply discriminative learning

techniques to this kind **of** application, we must devise

a method that can extract information about the problem

from the available data, without requiring specific information

about the categories.

The solution presented in this paper is to learn a similarity

metric from data. This similarity metric can later be used

to compare or match new samples from previously-unseen

categories (e.g. faces from people not seen during training).

We present a new type **of** discriminative training method

that is used to train the similarity metric. The method can

be applied to classification problems where the number **of**

categories is very large and/or where examples from all categories

are not available at the time **of** training.

The main idea is to find a function that maps input patterns

into a target space such that a simple distance in the

target space (say the Euclidean distance) approximates the

“semantic” distance in the input space. More precisely,

given a family **of** functions parameterized by ,

we seek to find a value **of** the parameter such that the

similarity metric

is small if and belong to the same category, and large

if they belong to different categories. The system is trained

on pairs **of** patterns taken from a training set. The loss function

minimized by training minimizes when

and are from the same category, and maximizes

when they belong to different categories. No

assumption is made about the nature **of** other than

differentiability with respect to . Because the same function

with the same parameter is used to process both

Chopra et al,

CVPR 2005

work

Distance Metric Learning for Large Margin

Nearest Neighbor Classification

Kilian Q. Weinberger, John Blitzer and Lawrence K. Saul

**Department** **of** **Computer** and Information **Science**, University **of** Pennsylvania

Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104

{kilianw, blitzer, lsaul}@cis.upenn.edu

Abstract

We show how to learn a Mahanalobis distance metric for k-nearest neighbor

(kNN) classification by semidefinite programming. The metric is

trained with the goal that k-nearest neighbors always belong to the same

class while examples from different classes are separated by a large

margin. On seven data sets **of** varying size and difficulty, we find that

metrics trained in this way lead to significant improvements in kNN

classification—for example, achieving a test error rate **of** 1.6% on the

MNIST handwritten digits. Our approach has many parallels to support

vector machines, including a convex objective function based on the

hinge loss and the potential to work in nonlinear feature spaces by using

the “kernel trick”. On the other hand, our framework requires no

modification for problems with large numbers **of** classes.

1 Introduction

(DRAFT!!! Please do not distribute.) The k-nearest neighbors (kNN) rule [3] is one

**of** the oldest and simplest methods for pattern classification. Nevertheless, it **of**ten yields

competitive results, and in certain domains, when cleverly combined with prior knowledge,

it has significantly advanced the state-**of**-the-art [1, 13]. The kNN rule classifies each unlabeled

example by the majority label among its k-nearest neighbors in the training set. Its

performance thus depends crucially on the distance metric used to identify nearest neighbors.

In the absence **of** prior knowledge, most kNN classifiers use simple Euclidean distances

to measure the dissimilarities between examples represented as vector inputs. Euclidean

distance metrics, however, do not capitalize on any statistical regularities in the data that

might be estimated from a large training set **of** labeled examples.

Ideally, the distance metric for kNN classification should be adapted to the particular problem

being solved. It can hardly be optimal, for example, to use the same distance metric for

face recognition as for gender identification, even if in both tasks, distances are computed

between the same fixed-size images. In fact, a number **of** researchers have demonstrated

that kNN classification can be greatly improved by learning an appropriate distance metric

from labeled examples [2, 6, 11, 12]. Even a simple linear transformation **of** input features

has been shown to lead to significant improvements in kNN classification [6, 11]. Our work

Neighbourhood Components Analysis

Jacob Goldberger, Sam Roweis, Ge**of**f Hinton, Ruslan Salakhutdinov

**Department** **of** **Computer** **Science**, University **of** Toronto

{jacob,roweis,hinton,rsalakhu}@cs.toronto.edu

Abstract

In this paper we propose a novel method for learning a Mahalanobis

distance measure to be used in the KNN classification algorithm. The

algorithm directly maximizes a stochastic variant **of** the leave-one-out

KNN score on the training set. It can also learn a low-dimensional linear

embedding **of** labeled data that can be used for data visualization

and fast classification. Unlike other methods, our classification model

is non-parametric, making no assumptions about the shape **of** the class

distributions or the boundaries between them. The performance **of** the

method is demonstrated on several data sets, both for metric learning and

linear dimensionality reduction.

1 Introduction

Nearest neighbor (KNN) is an extremely simple yet surprisingly effective method for classification.

Its appeal stems from the fact that its decision surfaces are nonlinear, there

is only a single integer parameter (which is easily tuned with cross-validation), and the

expected quality **of** predictions improves automatically as the amount **of** training data increases.

These advantages, shared by many non-parametric methods, reflect the fact that

although the final classification machine has quite high capacity (since it accesses the entire

reservoir **of** training data at test time), the trivial learning procedure rarely causes overfitting

itself.

However, KNN suffers from two very serious drawbacks. The first is computational, since

it must store and search through the entire training set in order to classify a single test point.

(Storage can potentially be reduced by “editing” or “thinning” the training data; and in low

dimensional input spaces, the search problem can be mitigated by employing data structures

such as KD-trees or ball-trees[4].) The second is a modeling issue: how should the distance

metric used to define the “nearest” neighbours **of** a test point be defined? In this paper, we

attack both **of** these difficulties by learning a quadratic distance metric which optimizes the

expected leave-one-out classification error on the training data when used with a stochastic

neighbour selection rule. Furthermore, we can force the learned distance metric to be low

rank, thus substantially reducing storage and search costs at test time.

2 Stochastic Nearest Neighbours for Distance Metric Learning

We begin with a labeled data set consisting **of** n real-valued input vectors x1, . . . , xn in R D

and corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes

Goldberger et al,

NIPS 2005

Vapnik 1998

Online Online and and Batch Batch Learning Learning **of** **of** Pseudo-Metrics

Pseudo-Metrics

Shai Shai Shalev-Shwartz Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

SHAIS@CS.HUJI.AC.IL

School School **of** **of** **Computer** **Computer** **Science** **Science** && **Engineering**, **Engineering**, The The Hebrew Hebrew University University

Yoram Yoram Singer Singer SINGER@CS.HUJI.AC.IL

SINGER@CS.HUJI.AC.IL

**of** **of** **Computer** **Computer** **Science** **Science** && **Engineering**, **Engineering**, The The Hebrew Hebrew University University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

We describe and analyze an online algorithm for

two instances and a binary label indicating whether the

supervised learning **of** pseudo-metrics. The al-

two instances are similar or dissimilar. The work **of** (Xing

gorithm receives pairs **of** instances and predicts

et al., 2003; Shental et al., 2002) used various techniques

their similarity according to a pseudo-metric.

that are effective in batch settings, but do not have nat-

The pseudo-metrics we use are quadratic forms

ural, computationally-efficient online versions. Further-

parameterized by positive semi-definite matrices.

more, these algorithms did not come with any theoretical

The core **of** the algorithm is an update rule that

error guarantees. In this paper, we discuss, analyze, and

is based on successive projections onto the posi-

experiment with an online algorithm for learning pseudotive

semi-definite cone and onto half-space conmetrics.

As in a batch setting, we receive pairs **of** instances

straints imposed by the examples. We describe

which may be similar or dissimilar. But in contrast to

an efficient procedure for performing these pro-

batch learning, in the online setting we need to extend a

jections, derive a worst case mistake bound on

prediction on each pair as it is received. After predicting

the similarity predictions, and discuss a dual ver-

whether the current pair **of** instances is similar, we receive

sion **of** the algorithm in which it is simple to

the correct feedback on the instances’ similarity or dissim-

incorporate kernel operators. The online algoilarity.

Informally, the goal **of** the online algorithm is to

rithm also serves as a building block for deriving

minimize the number **of** prediction errors. Online learning

a large-margin batch algorithm. We demonstrate

algorithms enjoy several practical and theoretical advan-

the merits **of** the proposed approach by conducttages:

They are **of**ten simple to implement; they are typing

experiments on MNIST dataset and on docuically

both memory and run-time efficient; they **of**ten come

ment filtering.

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with for-

1. Introduction

mal guarantees on the batch algorithm obtained through the

Many problems in machine learning and statistics require conversion. Moreover, there are applications such as text

the access to a metric over instances. For example, the per- filtering in which the set **of** examples is indeed not given all

formance **of** the nearest neighbor algorithm (Cover & Hart, at once, but instead revealed in a sequential manner while

1967), multi-dimensional scaling (Cox & Cox, 1994) and predictions are requested on-the-fly.

clustering algorithms such as K-means (MacQueen, 1965), The online algorithm we suggest incrementally learns

all depend critically on whether the metric they are given a pseudo-metric and a threshold. As in (Xing et al.,

truly reflects the underlying relationships between the in- 2003), the pseudo-metrics we use are quadratic forms

put instances. Several recent papers have focused on the parametrized by positive semi-definite (PSD) matrices. At

problem **of** automatically learning a distance function from each time step, we get a pair **of** instances and calculate the

examples (Xing et al., 2003; Shental et al., 2002). These distance between them according to our current pseudo-

papers have focused on batch learning algorithms. A batch metric. We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

Appearing in Proceedings **of** the 21 International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Connection to previous

Learning a Similarity Metric Discriminatively, with Application to Face

Verification

Sumit Chopra Raia Hadsell Yann LeCun

Abstract

We present a method for training a similarity metric from

data. The method can be used for recognition or verification

applications where the number **of** categories is very large

and not known during training, and where the number **of**

training samples for a single category is very small. The

idea is to learn a function that maps input patterns into a

target space such that the norm in the target space approximates

the “semantic” distance in the input space. The

method is applied to a face verification task. The learning

process minimizes a discriminative loss function that drives

the similarity metric to be small for pairs **of** faces from the

same person, and large for pairs from different persons. The

mapping from raw to the target space is a convolutional network

whose architecture is designed for robustness to geometric

distortions. The system is tested on the Purdue/AR

face database which has a very high degree **of** variability in

the pose, lighting, expression, position, and artificial occlusions

such as dark glasses and obscuring scarves.

1. Introduction

Traditional approaches to classification using discriminative

methods, such as neural networks or support vector

machines, generally require that all the categories be

known in advance. They also require that training examples

be available for all the categories. Furthermore, these

methods are intrinsically limited to a fairly small number

**of** categories (on the order **of** 100). Those methods are unsuitable

for applications where the number **of** categories is

very large, where the number **of** samples per category is

small, and where only a subset **of** the categories is known at

the time **of** training. Such applications include face recognition

and face verification: the number **of** categories can

be in the hundreds or thousands, with only a few examples

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

We describe and analyze an online algorithm for

two instances and a binary label indicating whether the

supervised learning **of** pseudo-metrics. The al-

two instances are similar or dissimilar. The work **of** (Xing

gorithm receives pairs **of** instances and predicts

et al., 2003; Shental et al., 2002) used various techniques

their similarity according to a pseudo-metric.

that are effective in batch settings, but do not have nat-

The pseudo-metrics we use are quadratic forms

ural, computationally-efficient online versions. Further-

parameterized by positive semi-definite matrices.

more, these algorithms did not come with any theoretical

The core **of** the algorithm is an update rule that

error guarantees. In this paper, we discuss, analyze, and

is based on successive projections onto the posi-

experiment with an online algorithm for learning pseudotive

semi-definite cone and onto half-space conmetrics.

As in a batch setting, we receive pairs **of** instances

straints imposed by the examples. We describe

which may be similar or dissimilar. But in contrast to

an efficient procedure for performing these pro-

batch learning, in the online setting we need to extend a

jections, derive a worst case mistake bound on

prediction on each pair as it is received. After predicting

the similarity predictions, and discuss a dual ver-

whether the current pair **of** instances is similar, we receive

sion **of** the algorithm in which it is simple to

the correct feedback on the instances’ similarity or dissim-

incorporate kernel operators. The online algoilarity.

Informally, the goal **of** the online algorithm is to

rithm also serves as a building block for deriving

minimize the number **of** prediction errors. Online learning

a large-margin batch algorithm. We demonstrate

algorithms enjoy several practical and theoretical advan-

the merits **of** the proposed approach by conducttages:

They are **of**ten simple to implement; they are typing

experiments on MNIST dataset and on docuically

both memory and run-time efficient; they **of**ten come

ment filtering.

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with for-

1. Introduction

mal guarantees on the batch algorithm obtained through the

**of** examples is indeed not given all

**of** the nearest neighbor algorithm (Cover & Hart, at once, but instead revealed in a sequential manner while

1967), multi-dimensional scaling (Cox & Cox, 1994) and predictions are requested on-the-fly.

**of** automatically learning a distance function from each time step, we get a pair **of** instances and calculate the

is less than the current threshold and otherwise we

Appearing in Proceedings **of** the 21 International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Courant Institute **of** Mathematical **Science**s

New York University

New York, NY, USA

sumit, raia, yann @cs.nyu.edu

per category. A common approach to this kind **of** problem

is distance-based methods, which consist in computing a

similarity metric between the pattern to be classified or verified

and a library **of** stored prototypes. Another common

approach is to use non-discriminative (generative) probabilistic

methods in a reduced-dimension space, where the

model for one category can be trained without using examples

from other categories. To apply discriminative learning

techniques to this kind **of** application, we must devise

a method that can extract information about the problem

from the available data, without requiring specific information

about the categories.

The solution presented in this paper is to learn a similarity

metric from data. This similarity metric can later be used

to compare or match new samples from previously-unseen

categories (e.g. faces from people not seen during training).

We present a new type **of** discriminative training method

that is used to train the similarity metric. The method can

be applied to classification problems where the number **of**

categories is very large and/or where examples from all categories

are not available at the time **of** training.

The main idea is to find a function that maps input patterns

into a target space such that a simple distance in the

target space (say the Euclidean distance) approximates the

“semantic” distance in the input space. More precisely,

given a family **of** functions parameterized by ,

we seek to find a value **of** the parameter such that the

similarity metric

is small if and belong to the same category, and large

if they belong to different categories. The system is trained

on pairs **of** patterns taken from a training set. The loss function

minimized by training minimizes when

and are from the same category, and maximizes

when they belong to different categories. No

assumption is made about the nature **of** other than

differentiability with respect to . Because the same function

with the same parameter is used to process both

Chopra et al,

CVPR 2005

Online Online and and Batch Batch Learning Learning **of** **of** Pseudo-Metrics

Pseudo-Metrics

metric learning

by semidefinite

programming

Xing et al, NIPS 2003

Shalev-Shwartz et al,

ICML 2004

work

Distance Metric Learning for Large Margin

Nearest Neighbor Classification

Kilian Q. Weinberger, John Blitzer and Lawrence K. Saul

**Department** **of** **Computer** and Information **Science**, University **of** Pennsylvania

Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104

{kilianw, blitzer, lsaul}@cis.upenn.edu

Abstract

We show how to learn a Mahanalobis distance metric for k-nearest neighbor

(kNN) classification by semidefinite programming. The metric is

trained with the goal that k-nearest neighbors always belong to the same

class while examples from different classes are separated by a large

margin. On seven data sets **of** varying size and difficulty, we find that

metrics trained in this way lead to significant improvements in kNN

classification—for example, achieving a test error rate **of** 1.6% on the

MNIST handwritten digits. Our approach has many parallels to support

vector machines, including a convex objective function based on the

hinge loss and the potential to work in nonlinear feature spaces by using

the “kernel trick”. On the other hand, our framework requires no

modification for problems with large numbers **of** classes.

1 Introduction

(DRAFT!!! Please do not distribute.) The k-nearest neighbors (kNN) rule [3] is one

**of** the oldest and simplest methods for pattern classification. Nevertheless, it **of**ten yields

competitive results, and in certain domains, when cleverly combined with prior knowledge,

it has significantly advanced the state-**of**-the-art [1, 13]. The kNN rule classifies each unlabeled

example by the majority label among its k-nearest neighbors in the training set. Its

performance thus depends crucially on the distance metric used to identify nearest neighbors.

In the absence **of** prior knowledge, most kNN classifiers use simple Euclidean distances

to measure the dissimilarities between examples represented as vector inputs. Euclidean

distance metrics, however, do not capitalize on any statistical regularities in the data that

might be estimated from a large training set **of** labeled examples.

Ideally, the distance metric for kNN classification should be adapted to the particular problem

being solved. It can hardly be optimal, for example, to use the same distance metric for

face recognition as for gender identification, even if in both tasks, distances are computed

between the same fixed-size images. In fact, a number **of** researchers have demonstrated

that kNN classification can be greatly improved by learning an appropriate distance metric

from labeled examples [2, 6, 11, 12]. Even a simple linear transformation **of** input features

has been shown to lead to significant improvements in kNN classification [6, 11]. Our work

Neighbourhood Components Analysis

Jacob Goldberger, Sam Roweis, Ge**of**f Hinton, Ruslan Salakhutdinov

**Department** **of** **Computer** **Science**, University **of** Toronto

{jacob,roweis,hinton,rsalakhu}@cs.toronto.edu

Abstract

In this paper we propose a novel method for learning a Mahalanobis

distance measure to be used in the KNN classification algorithm. The

algorithm directly maximizes a stochastic variant **of** the leave-one-out

KNN score on the training set. It can also learn a low-dimensional linear

embedding **of** labeled data that can be used for data visualization

and fast classification. Unlike other methods, our classification model

is non-parametric, making no assumptions about the shape **of** the class

distributions or the boundaries between them. The performance **of** the

method is demonstrated on several data sets, both for metric learning and

linear dimensionality reduction.

1 Introduction

Nearest neighbor (KNN) is an extremely simple yet surprisingly effective method for classification.

Its appeal stems from the fact that its decision surfaces are nonlinear, there

is only a single integer parameter (which is easily tuned with cross-validation), and the

expected quality **of** predictions improves automatically as the amount **of** training data increases.

These advantages, shared by many non-parametric methods, reflect the fact that

although the final classification machine has quite high capacity (since it accesses the entire

reservoir **of** training data at test time), the trivial learning procedure rarely causes overfitting

itself.

However, KNN suffers from two very serious drawbacks. The first is computational, since

it must store and search through the entire training set in order to classify a single test point.

(Storage can potentially be reduced by “editing” or “thinning” the training data; and in low

dimensional input spaces, the search problem can be mitigated by employing data structures

such as KD-trees or ball-trees[4].) The second is a modeling issue: how should the distance

metric used to define the “nearest” neighbours **of** a test point be defined? In this paper, we

attack both **of** these difficulties by learning a quadratic distance metric which optimizes the

expected leave-one-out classification error on the training data when used with a stochastic

neighbour selection rule. Furthermore, we can force the learned distance metric to be low

rank, thus substantially reducing storage and search costs at test time.

2 Stochastic Nearest Neighbours for Distance Metric Learning

We begin with a labeled data set consisting **of** n real-valued input vectors x1, . . . , xn in R D

and corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes

Goldberger et al,

NIPS 2005

Vapnik 1998

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Connection to previous

work

Xing et al, NIPS 2003

Kilian Q. Weinberger, John Blitzer and Lawrence K. Saul

**Department** **of** **Computer** and Information **Science**, University **of** Pennsylvania

Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104

{kilianw, blitzer, lsaul}@cis.upenn.edu

Goldberger et al,

Shalev-Shwartz et al,

Abstract

We show how to learn a Mahanalobis distance metric for k-nearest neighbor

(kNN) classification by semidefinite programming. The metric is

trained with the goal that k-nearest neighbors always belong to the same

class while examples from different classes are separated by a large

margin. On seven data sets **of** varying size and difficulty, we find that

NIPS 2005

ICML 2004

metrics trained in this way lead to significant improvements in kNN

classification—for example, achieving a test error rate **of** 1.6% on the

MNIST handwritten digits. Our approach has many parallels to support

vector machines, including a convex objective function based on the

hinge loss and the potential to work in nonlinear feature spaces by using

the “kernel trick”. On the other hand, our framework requires no

Learning a Similarity Metric Discriminatively, with Application to Face

Verification

Sumit Chopra Raia Hadsell Yann LeCun

Abstract

We present a method for training a similarity metric from

data. The method can be used for recognition or verification

applications where the number **of** categories is very large

and not known during training, and where the number **of**

training samples for a single category is very small. The

idea is to learn a function that maps input patterns into a

target space such that the norm in the target space approximates

the “semantic” distance in the input space. The

method is applied to a face verification task. The learning

process minimizes a discriminative loss function that drives

the similarity metric to be small for pairs **of** faces from the

same person, and large for pairs from different persons. The

mapping from raw to the target space is a convolutional network

whose architecture is designed for robustness to geometric

distortions. The system is tested on the Purdue/AR

face database which has a very high degree **of** variability in

the pose, lighting, expression, position, and artificial occlusions

such as dark glasses and obscuring scarves.

1. Introduction

Traditional approaches to classification using discriminative

methods, such as neural networks or support vector

machines, generally require that all the categories be

known in advance. They also require that training examples

be available for all the categories. Furthermore, these

methods are intrinsically limited to a fairly small number

**of** categories (on the order **of** 100). Those methods are unsuitable

for applications where the number **of** categories is

very large, where the number **of** samples per category is

small, and where only a subset **of** the categories is known at

the time **of** training. Such applications include face recognition

and face verification: the number **of** categories can

be in the hundreds or thousands, with only a few examples

Courant Institute **of** Mathematical **Science**s

New York University

New York, NY, USA

sumit, raia, yann @cs.nyu.edu

per category. A common approach to this kind **of** problem

is distance-based methods, which consist in computing a

similarity metric between the pattern to be classified or verified

and a library **of** stored prototypes. Another common

approach is to use non-discriminative (generative) probabilistic

methods in a reduced-dimension space, where the

model for one category can be trained without using examples

from other categories. To apply discriminative learning

techniques to this kind **of** application, we must devise

a method that can extract information about the problem

from the available data, without requiring specific information

about the categories.

The solution presented in this paper is to learn a similarity

metric from data. This similarity metric can later be used

to compare or match new samples from previously-unseen

categories (e.g. faces from people not seen during training).

We present a new type **of** discriminative training method

that is used to train the similarity metric. The method can

be applied to classification problems where the number **of**

categories is very large and/or where examples from all categories

are not available at the time **of** training.

The main idea is to find a function that maps input patterns

into a target space such that a simple distance in the

target space (say the Euclidean distance) approximates the

“semantic” distance in the input space. More precisely,

given a family **of** functions parameterized by ,

we seek to find a value **of** the parameter such that the

similarity metric

is small if and belong to the same category, and large

if they belong to different categories. The system is trained

on pairs **of** patterns taken from a training set. The loss function

minimized by training minimizes when

and are from the same category, and maximizes

when they belong to different categories. No

assumption is made about the nature **of** other than

differentiability with respect to . Because the same function

with the same parameter is used to process both

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Chopra et al,

CVPR 2005

Distance Metric Learning for Large Margin

Nearest Neighbor Classification

modification for problems with large numbers **of** classes.

1 Introduction

(DRAFT!!! Please do not distribute.) The k-nearest neighbors (kNN) rule [3] is one

**of** the oldest and simplest methods for pattern classification. Nevertheless, it **of**ten yields

competitive results, and in certain domains, when cleverly combined with prior knowledge,

it has significantly advanced the state-**of**-the-art [1, 13]. The kNN rule classifies each unlabeled

example by the majority label among its k-nearest neighbors in the training set. Its

performance thus depends crucially on the distance metric used to identify nearest neighbors.

In the absence **of** prior knowledge, most kNN classifiers use simple Euclidean distances

to measure the dissimilarities between examples represented as vector inputs. Euclidean

distance metrics, however, do not capitalize on any statistical regularities in the data that

might be estimated from a large training set **of** labeled examples.

Ideally, the distance metric for kNN classification should be adapted to the particular problem

being solved. It can hardly be optimal, for example, to use the same distance metric for

face recognition as for gender identification, even if in both tasks, distances are computed

between the same fixed-size images. In fact, a number **of** researchers have demonstrated

that kNN classification can be greatly improved by learning an appropriate distance metric

from labeled examples [2, 6, 11, 12]. Even a simple linear transformation **of** input features

has been shown to lead to significant improvements in kNN classification [6, 11]. Our work

Neighbourhood Neighbourhood Components Components Analysis Analysis

Jacob Jacob Goldberger, Goldberger, Sam Sam Roweis, Roweis, Ge**of**f Ge**of**f Hinton, Hinton, Ruslan Ruslan Salakhutdinov

Salakhutdinov

**Department** **Department** **of** **of** **Computer** **Computer** **Science**, **Science**, University University **of** **of** Toronto Toronto

{jacob,roweis,hinton,rsalakhu}@cs.toronto.edu

{jacob,roweis,hinton,rsalakhu}@cs.toronto.edu

Abstract Abstract

In In this this paper paper we we propose propose aa novel novel method method for for learning learning aa Mahalanobis

Mahalanobis

distance distance measure measure to to be be used used in in the the KNN KNN classification classification algorithm. algorithm. The The

algorithm algorithm directly directly maximizes maximizes aa stochastic stochastic variant variant **of** **of** the the leave-one-out

leave-one-out

KNN KNN score score on on the the training training set. set. It It can can also also learn learn aa low-dimensional low-dimensional linlinearear embedding embedding **of** **of** labeled labeled data data that that can can be be used used for for data data visualization

visualization

and and fast fast classification. classification. Unlike Unlike other other methods, methods, our our classification classification model model

is is non-parametric, non-parametric, making making no no assumptions assumptions about about the the shape shape **of** **of** the the class class

distributions distributions or or the the boundaries boundaries between between them. them. The The performance performance **of** **of** the the

method method is is demonstrated demonstrated on on several several data data sets, sets, both both for for metric metric learning learning and and

linear linear dimensionality dimensionality reduction. reduction.

k-NN

11 Introduction

Introduction

Nearest Nearest neighbor neighbor (KNN) (KNN) is is an an extremely extremely simple simple yet yet surprisingly surprisingly effective effective method method for for clasclassification.sification. Its Its appeal appeal stems stems from from the the fact fact that that its its decision decision surfaces surfaces are are nonlinear, nonlinear, there there

is is only only aa single single integer integer parameter parameter (which (which is is easily easily tuned tuned with with cross-validation), cross-validation), and and the the

expected expected quality quality **of** **of** predictions predictions improves improves automatically automatically as as the the amount amount **of** **of** training training data data inincreases.creases. These These advantages, advantages, shared shared by by many many non-parametric non-parametric methods, methods, reflect reflect the the fact fact that that

although although the the final final classification classification machine machine has has quite quite high high capacity capacity (since (since it it accesses accesses the the entire entire

reservoir reservoir **of** **of** training training data data at at test test time), time), the the trivial trivial learning learning procedure procedure rarely rarely causes causes overfitting overfitting

itself. itself.

However, However, KNN KNN suffers suffers from from two two very very serious serious drawbacks. drawbacks. The The first first is is computational, computational, since since

it it must must store store and and search search through through the the entire entire training training set set in in order order to to classify classify aa single single test test point. point.

(Storage (Storage can can potentially potentially be be reduced reduced by by “editing” “editing” or or “thinning” “thinning” the the training training data; data; and and in in low low

dimensional dimensional input input spaces, spaces, the the search search problem problem can can be be mitigated mitigated by by employing employing data data structures structures

such such as as KD-trees KD-trees or or ball-trees[4].) ball-trees[4].) The The second second is is aa modeling modeling issue: issue: how how should should the the distance distance

metric metric used used to to define define the the “nearest” “nearest” neighbours neighbours **of** **of** aa test test point point be be defined? defined? In In this this paper, paper, we we

attack attack both both **of** **of** these these difficulties difficulties by by learning learning aa quadratic quadratic distance distance metric metric which which optimizes optimizes the the

expected expected leave-one-out leave-one-out classification classification error error on on the the training training data data when when used used with with aa stochastic stochastic

neighbour neighbour selection selection rule. rule. Furthermore, Furthermore, we we can can force force the the learned learned distance distance metric metric to to be be low low

rank, rank, thus thus substantially substantially reducing reducing storage storage and and search search costs costs at at test test time. time.

22 Stochastic Stochastic Nearest Nearest Neighbours Neighbours for for Distance Distance Metric Metric Learning Learning

We begin with a labeled data set consisting **of** n real-valued input vectors x1, . . . , xn in RD We begin with a labeled data set consisting **of** n real-valued input vectors x1, . . . , xn in R

and corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes

D

and corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes

Vapnik 1998

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Connection to previous

work

Xing et al, NIPS 2003

Kilian Q. Weinberger, John Blitzer and Lawrence K. Saul

**Department** **of** **Computer** and Information **Science**, University **of** Pennsylvania

Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104

{kilianw, blitzer, lsaul}@cis.upenn.edu

Goldberger et al,

Shalev-Shwartz et al,

Abstract

We show how to learn a Mahanalobis distance metric for k-nearest neighbor

(kNN) classification by semidefinite programming. The metric is

trained with the goal that k-nearest neighbors always belong to the same

class while examples from different classes are separated by a large

margin. On seven data sets **of** varying size and difficulty, we find that

NIPS 2005

ICML 2004

metrics trained in this way lead to significant improvements in kNN

classification—for example, achieving a test error rate **of** 1.6% on the

MNIST handwritten digits. Our approach has many parallels to support

vector machines, including a convex objective function based on the

hinge loss and the potential to work in nonlinear feature spaces by using

the “kernel trick”. On the other hand, our framework requires no

Learning Learning aa Similarity Similarity Metric Metric Discriminatively, Discriminatively, with with Application Application to to Face Face

Verification Verification

Sumit Sumit Chopra Chopra Raia Raia Hadsell Hadsell Yann Yann LeCun LeCun

Abstract Abstract

Courant Courant Institute Institute **of** **of** Mathematical Mathematical **Science**s **Science**s

New New York York University University

New New York, York, NY, NY, USA USA

sumit, sumit, raia, raia, yann yann @cs.nyu.edu

@cs.nyu.edu

per per category. category. AA common common approach approach to to this this kind kind **of** **of** problem problem

is is distance-based distance-based methods, methods, which which consist consist in in computing computing aa

We We present present aa method method for for training training aa similarity similarity metric metric from from similarity similarity metric metric between between the the pattern pattern to to be be classified classified or or verver- data. data. The The method method can can be be used used for for recognition recognition or or verification verification ifiedified and and aa library library **of** **of** stored stored prototypes. prototypes. Another Another common common

applications applications where where the the number number **of** **of** categories categories is is very very large large approach approach is is to to use use non-discriminative non-discriminative (generative) (generative) probaproba- and and not not known known during during training, training, and and where where the the number number **of** **of** bilisticbilistic methods methods in in aa reduced-dimension reduced-dimension space, space, where where the the

training training samples samples for for aa single single category category is is very very small. small. The The model model for for one one category category can can be be trained trained without without using using examexam- idea idea is is to to learn learn aa function function that that maps maps input input patterns patterns into into aa

plesples from from other other categories. categories. To To apply apply discriminative discriminative learnlearn- target target space space such such that that the the norm norm in in the the target target space space apapinging techniques techniques to to this this kind kind **of** **of** application, application, we we must must devise devise

proximatesproximates the the “semantic” “semantic” distance distance in in the the input input space. space. The The aa method method that that can can extract extract information information about about the the problem problem

method method is is applied applied to to aa face face verification verification task. task. The The learning learning from from the the available available data, data, without without requiring requiring specific specific informainforma- process process minimizes minimizes aa discriminative discriminative loss loss function function that that drives drives tiontion about about the the categories. categories.

the the similarity similarity metric metric to to be be small small for for pairs pairs **of** **of** faces faces from from the the The The solution solution presented presented in in this this paper paper is is to to learn learn aa similarsimilar- same same person, person, and and large large for for pairs pairs from from different different persons. persons. The The ityity metric metric from from data. data. This This similarity similarity metric metric can can later later be be used used

mapping mapping from from raw raw to to the the target target space space is is aa convolutional convolutional netnet- to to compare compare or or match match new new samples samples from from previously-unseen

previously-unseen

workwork whose whose architecture architecture is is designed designed for for robustness robustness to to geogeo- categories categories (e.g. (e.g. faces faces from from people people not not seen seen during during training). training).

metricmetric distortions. distortions. The The system system is is tested tested on on the the Purdue/AR Purdue/AR We We present present aa new new type type **of** **of** discriminative discriminative training training method method

face face database database which which has has aa very very high high degree degree **of** **of** variability variability in in that that is is used used to to train train the the similarity similarity metric. metric. The The method method can can

the the pose, pose, lighting, lighting, expression, expression, position, position, and and artificial artificial occluocclu- be be applied applied to to classification classification problems problems where where the the number number **of** **of**

sionssions such such as as dark dark glasses glasses and and obscuring obscuring scarves. scarves.

categories categories is is very very large large and/or and/or where where examples examples from from all all catcategoriesegories are are not not available available at at the the time time **of** **of** training. training.

The The main main idea idea is is to to find find aa function function that that maps maps input input patpatternsterns into into aa target target space space such such that that aa simple simple distance distance in in the the

target target space space (say (say the the Euclidean Euclidean distance) distance) approximates approximates the the

Traditional Traditional approaches approaches to to classification classification using using discrimidiscrimi- “semantic” “semantic” distance distance in in the the input input space. space. More More precisely, precisely,

nativenative methods, methods, such such as as neural neural networks networks or or support support vecvec- given given aa family family **of** **of** functions functions parameterized parameterized by by , ,

tortor machines, machines, generally generally require require that that all all the the categories categories be be we we seek seek to to find find aa value value **of** **of** the the parameter parameter such such that that the the

known known in in advance. advance. They They also also require require that that training training examexam- similarity similarity metric metric

plesples be be available available for for all all the the categories. categories. Furthermore, Furthermore, these these is is small small if if and and belong belong to to the the same same category, category, and and large large

methods methods are are intrinsically intrinsically limited limited to to aa fairly fairly small small number number if if they they belong belong to to different different categories. categories. The The system system is is trained trained

**of** **of** categories categories (on (on the the order order **of** **of** 100). 100). Those Those methods methods are are unun- on on pairs pairs **of** **of** patterns patterns taken taken from from aa training training set. set. The The loss loss funcfuncsuitablesuitable for for applications applications where where the the number number **of** **of** categories categories is is tiontion minimized minimized by by training training minimizes minimizes when when

very very large, large, where where the the number number **of** **of** samples samples per per category category is is and and are are from from the the same same category, category, and and maximizes maximizes

small, small, and and where where only only aa subset subset **of** **of** the the categories categories is is known known at at

when when they they belong belong to to different different categories. categories. No No

the the time time **of** **of** training. training. Such Such applications applications include include face face recogrecog- assumption assumption is is made made about about the the nature nature **of** **of** other other than than

nitionnition and and face face verification: verification: the the number number **of** **of** categories categories can can differentiability differentiability with with respect respect to to . . Because Because the the same same funcfunc- be be in in the the hundreds hundreds or or thousands, thousands, with with only only aa few few examples examples tiontion with with the the same same parameter parameter is is used used to to process process both both

1. 1. Introduction

Introduction

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

impostor based

loss function

Chopra et al,

CVPR 2005

Distance Metric Learning for Large Margin

Nearest Neighbor Classification

modification for problems with large numbers **of** classes.

1 Introduction

(DRAFT!!! Please do not distribute.) The k-nearest neighbors (kNN) rule [3] is one

**of** the oldest and simplest methods for pattern classification. Nevertheless, it **of**ten yields

competitive results, and in certain domains, when cleverly combined with prior knowledge,

it has significantly advanced the state-**of**-the-art [1, 13]. The kNN rule classifies each unlabeled

example by the majority label among its k-nearest neighbors in the training set. Its

performance thus depends crucially on the distance metric used to identify nearest neighbors.

In the absence **of** prior knowledge, most kNN classifiers use simple Euclidean distances

to measure the dissimilarities between examples represented as vector inputs. Euclidean

distance metrics, however, do not capitalize on any statistical regularities in the data that

might be estimated from a large training set **of** labeled examples.

Ideally, the distance metric for kNN classification should be adapted to the particular problem

being solved. It can hardly be optimal, for example, to use the same distance metric for

face recognition as for gender identification, even if in both tasks, distances are computed

between the same fixed-size images. In fact, a number **of** researchers have demonstrated

that kNN classification can be greatly improved by learning an appropriate distance metric

from labeled examples [2, 6, 11, 12]. Even a simple linear transformation **of** input features

has been shown to lead to significant improvements in kNN classification [6, 11]. Our work

Neighbourhood Components Analysis

Jacob Goldberger, Sam Roweis, Ge**of**f Hinton, Ruslan Salakhutdinov

**Department** **of** **Computer** **Science**, University **of** Toronto

{jacob,roweis,hinton,rsalakhu}@cs.toronto.edu

Abstract

In this paper we propose a novel method for learning a Mahalanobis

distance measure to be used in the KNN classification algorithm. The

algorithm directly maximizes a stochastic variant **of** the leave-one-out

KNN score on the training set. It can also learn a low-dimensional linear

embedding **of** labeled data that can be used for data visualization

and fast classification. Unlike other methods, our classification model

is non-parametric, making no assumptions about the shape **of** the class

distributions or the boundaries between them. The performance **of** the

method is demonstrated on several data sets, both for metric learning and

linear dimensionality reduction.

1 Introduction

Nearest neighbor (KNN) is an extremely simple yet surprisingly effective method for classification.

Its appeal stems from the fact that its decision surfaces are nonlinear, there

is only a single integer parameter (which is easily tuned with cross-validation), and the

expected quality **of** predictions improves automatically as the amount **of** training data increases.

These advantages, shared by many non-parametric methods, reflect the fact that

although the final classification machine has quite high capacity (since it accesses the entire

reservoir **of** training data at test time), the trivial learning procedure rarely causes overfitting

itself.

However, KNN suffers from two very serious drawbacks. The first is computational, since

it must store and search through the entire training set in order to classify a single test point.

(Storage can potentially be reduced by “editing” or “thinning” the training data; and in low

dimensional input spaces, the search problem can be mitigated by employing data structures

such as KD-trees or ball-trees[4].) The second is a modeling issue: how should the distance

metric used to define the “nearest” neighbours **of** a test point be defined? In this paper, we

attack both **of** these difficulties by learning a quadratic distance metric which optimizes the

expected leave-one-out classification error on the training data when used with a stochastic

neighbour selection rule. Furthermore, we can force the learned distance metric to be low

rank, thus substantially reducing storage and search costs at test time.

2 Stochastic Nearest Neighbours for Distance Metric Learning

We begin with a labeled data set consisting **of** n real-valued input vectors x1, . . . , xn in R D

and corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes

Vapnik 1998

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

Connection to previous

Shalev-Shwartz et al,

ICML 2004

Learning a Similarity Metric Discriminatively, with Application to Face

Verification

Sumit Chopra Raia Hadsell Yann LeCun

Abstract

We present a method for training a similarity metric from

data. The method can be used for recognition or verification

applications where the number **of** categories is very large

and not known during training, and where the number **of**

training samples for a single category is very small. The

idea is to learn a function that maps input patterns into a

target space such that the norm in the target space approximates

the “semantic” distance in the input space. The

method is applied to a face verification task. The learning

process minimizes a discriminative loss function that drives

the similarity metric to be small for pairs **of** faces from the

same person, and large for pairs from different persons. The

mapping from raw to the target space is a convolutional network

whose architecture is designed for robustness to geometric

distortions. The system is tested on the Purdue/AR

face database which has a very high degree **of** variability in

the pose, lighting, expression, position, and artificial occlusions

such as dark glasses and obscuring scarves.

1. Introduction

Traditional approaches to classification using discriminative

methods, such as neural networks or support vector

machines, generally require that all the categories be

known in advance. They also require that training examples

be available for all the categories. Furthermore, these

methods are intrinsically limited to a fairly small number

**of** categories (on the order **of** 100). Those methods are unsuitable

for applications where the number **of** categories is

very large, where the number **of** samples per category is

small, and where only a subset **of** the categories is known at

the time **of** training. Such applications include face recognition

and face verification: the number **of** categories can

be in the hundreds or thousands, with only a few examples

Courant Institute **of** Mathematical **Science**s

New York University

New York, NY, USA

sumit, raia, yann @cs.nyu.edu

per category. A common approach to this kind **of** problem

is distance-based methods, which consist in computing a

similarity metric between the pattern to be classified or verified

and a library **of** stored prototypes. Another common

approach is to use non-discriminative (generative) probabilistic

methods in a reduced-dimension space, where the

model for one category can be trained without using examples

from other categories. To apply discriminative learning

techniques to this kind **of** application, we must devise

a method that can extract information about the problem

from the available data, without requiring specific information

about the categories.

The solution presented in this paper is to learn a similarity

metric from data. This similarity metric can later be used

to compare or match new samples from previously-unseen

categories (e.g. faces from people not seen during training).

We present a new type **of** discriminative training method

that is used to train the similarity metric. The method can

be applied to classification problems where the number **of**

categories is very large and/or where examples from all categories

are not available at the time **of** training.

The main idea is to find a function that maps input patterns

into a target space such that a simple distance in the

target space (say the Euclidean distance) approximates the

“semantic” distance in the input space. More precisely,

given a family **of** functions parameterized by ,

we seek to find a value **of** the parameter such that the

similarity metric

is small if and belong to the same category, and large

if they belong to different categories. The system is trained

on pairs **of** patterns taken from a training set. The loss function

minimized by training minimizes when

and are from the same category, and maximizes

when they belong to different categories. No

assumption is made about the nature **of** other than

differentiability with respect to . Because the same function

with the same parameter is used to process both

Online and Batch Learning **of** Pseudo-Metrics

Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Yoram Singer SINGER@CS.HUJI.AC.IL

School **of** **Computer** **Science** & **Engineering**, The Hebrew University

Andrew Y. Ng ANG@CS.STANFORD.EDU

**Computer** **Science** **Department**, Stanford University

Abstract

We describe and analyze an online algorithm for

supervised learning **of** pseudo-metrics. The algorithm

receives pairs **of** instances and predicts

their similarity according to a pseudo-metric.

The pseudo-metrics we use are quadratic forms

parameterized by positive semi-definite matrices.

The core **of** the algorithm is an update rule that

is based on successive projections onto the positive

semi-definite cone and onto half-space constraints

imposed by the examples. We describe

an efficient procedure for performing these projections,

derive a worst case mistake bound on

the similarity predictions, and discuss a dual version

**of** the algorithm in which it is simple to

incorporate kernel operators. The online algorithm

also serves as a building block for deriving

a large-margin batch algorithm. We demonstrate

the merits **of** the proposed approach by conducting

experiments on MNIST dataset and on document

filtering.

1. Introduction

Many problems in machine learning and statistics require

the access to a metric over instances. For example, the performance

**of** the nearest neighbor algorithm (Cover & Hart,

1967), multi-dimensional scaling (Cox & Cox, 1994) and

clustering algorithms such as K-means (MacQueen, 1965),

all depend critically on whether the metric they are given

truly reflects the underlying relationships between the input

instances. Several recent papers have focused on the

problem **of** automatically learning a distance function from

examples (Xing et al., 2003; Shental et al., 2002). These

papers have focused on batch learning algorithms. A batch

Appearing in Proceedings **of** the 21 st algorithm for learning a distance function is provided with

a predefined set **of** examples. Each example consists **of**

two instances and a binary label indicating whether the

two instances are similar or dissimilar. The work **of** (Xing

et al., 2003; Shental et al., 2002) used various techniques

that are effective in batch settings, but do not have natural,

computationally-efficient online versions. Furthermore,

these algorithms did not come with any theoretical

error guarantees. In this paper, we discuss, analyze, and

experiment with an online algorithm for learning pseudometrics.

As in a batch setting, we receive pairs **of** instances

which may be similar or dissimilar. But in contrast to

batch learning, in the online setting we need to extend a

prediction on each pair as it is received. After predicting

whether the current pair **of** instances is similar, we receive

the correct feedback on the instances’ similarity or dissimilarity.

Informally, the goal **of** the online algorithm is to

minimize the number **of** prediction errors. Online learning

algorithms enjoy several practical and theoretical advantages:

They are **of**ten simple to implement; they are typically

both memory and run-time efficient; they **of**ten come

with formal guarantees in the form **of** worst case bounds

on their performance; there exist several methods for converting

from online to batch learning, which come with formal

guarantees on the batch algorithm obtained through the

conversion. Moreover, there are applications such as text

filtering in which the set **of** examples is indeed not given all

at once, but instead revealed in a sequential manner while

predictions are requested on-the-fly.

The online algorithm we suggest incrementally learns

a pseudo-metric and a threshold. As in (Xing et al.,

2003), the pseudo-metrics we use are quadratic forms

parametrized by positive semi-definite (PSD) matrices. At

each time step, we get a pair **of** instances and calculate the

distance between them according to our current pseudometric.

We decide that the instances are similar if this distance

is less than the current threshold and otherwise we

International Conference

the authors.

prediction, we get the true similarity label **of** the pair **of** in-

work

Distance Metric Learning for Large Margin

Nearest Neighbor Classification

Neighbourhood Components Analysis

Jacob Goldberger, Sam Roweis, Ge**of**f Hinton, Ruslan Salakhutdinov

**Department** **of** **Computer** **Science**, University **of** Toronto

{jacob,roweis,hinton,rsalakhu}@cs.toronto.edu

Abstract

In this paper we propose a novel method for learning a Mahalanobis

distance measure to be used in the KNN classification algorithm. The

algorithm directly maximizes a stochastic variant **of** the leave-one-out

KNN score on the training set. It can also learn a low-dimensional linear

embedding **of** labeled data that can be used for data visualization

and fast classification. Unlike other methods, our classification model

is non-parametric, making no assumptions about the shape **of** the class

distributions or the boundaries between them. The performance **of** the

method is demonstrated on several data sets, both for metric learning and

linear dimensionality reduction.

1 Introduction

Nearest neighbor (KNN) is an extremely simple yet surprisingly effective method for classification.

Its appeal stems from the fact that its decision surfaces are nonlinear, there

is only a single integer parameter (which is easily tuned with cross-validation), and the

expected quality **of** predictions improves automatically as the amount **of** training data increases.

These advantages, shared by many non-parametric methods, reflect the fact that

although the final classification machine has quite high capacity (since it accesses the entire

reservoir **of** training data at test time), the trivial learning procedure rarely causes overfitting

itself.

However, KNN suffers from two very serious drawbacks. The first is computational, since

it must store and search through the entire training set in order to classify a single test point.

(Storage can potentially be reduced by “editing” or “thinning” the training data; and in low

dimensional input spaces, the search problem can be mitigated by employing data structures

such as KD-trees or ball-trees[4].) The second is a modeling issue: how should the distance

metric used to define the “nearest” neighbours **of** a test point be defined? In this paper, we

attack both **of** these difficulties by learning a quadratic distance metric which optimizes the

expected leave-one-out classification error on the training data when used with a stochastic

neighbour selection rule. Furthermore, we can force the learned distance metric to be low

rank, thus substantially reducing storage and search costs at test time.

2 Stochastic Nearest Neighbours for Distance Metric Learning

We begin with a labeled data set consisting **of** n real-valued input vectors x1, . . . , xn in R D

and corresponding class labels c1, ..., cn. We want to find a distance metric that maximizes

Xing et al, NIPS 2003

Kilian Q. Weinberger, John Blitzer and Lawrence K. Saul

**Department** **of** **Computer** and Information **Science**, University **of** Pennsylvania

Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104

{kilianw, blitzer, lsaul}@cis.upenn.edu

Abstract

Goldberger et al,

Chopra et al,

CVPR 2005

We show how to learn a Mahanalobis distance metric for k-nearest neighbor

(kNN) classification by semidefinite programming. The metric is

trained with the goal that k-nearest neighbors always belong to the same

class while examples from different classes are separated by a large

margin. On seven data sets **of** varying size and difficulty, we find that

metrics trained in this way lead to significant improvements in kNN

classification—for example, achieving a test error rate **of** 1.6% on the

MNIST handwritten digits. Our approach has many parallels to support

vector machines, including a convex objective function based on the

hinge loss and the potential to work in nonlinear feature spaces by using

the “kernel trick”. On the other hand, our framework requires no

modification for problems with large numbers **of** classes.

1 Introduction

(DRAFT!!! Please do not distribute.) The k-nearest neighbors (kNN) rule [3] is one

**of** the oldest and simplest methods for pattern classification. Nevertheless, it **of**ten yields

competitive results, and in certain domains, when cleverly combined with prior knowledge,

it has significantly advanced the state-**of**-the-art [1, 13]. The kNN rule classifies each unlabeled

example by the majority label among its k-nearest neighbors in the training set. Its

performance thus depends crucially on the distance metric used to identify nearest neighbors.

In the absence **of** prior knowledge, most kNN classifiers use simple Euclidean distances

to measure the dissimilarities between examples represented as vector inputs. Euclidean

distance metrics, however, do not capitalize on any statistical regularities in the data that

might be estimated from a large training set **of** labeled examples.

Ideally, the distance metric for kNN classification should be adapted to the particular problem

being solved. It can hardly be optimal, for example, to use the same distance metric for

face recognition as for gender identification, even if in both tasks, distances are computed

between the same fixed-size images. In fact, a number **of** researchers have demonstrated

that kNN classification can be greatly improved by learning an appropriate distance metric

from labeled examples [2, 6, 11, 12]. Even a simple linear transformation **of** input features

has been shown to lead to significant improvements in kNN classification [6, 11]. Our work

NIPS 2005

large margin

classifier

Vapnik 1998

Conclusion

Novel algorithm for metric learning

Replaces linear decision boundary in SVM by

kNN decision boundary

Training complexity independent **of** number

**of** classes (unlike multiclass SVM)

Optimization is a semidefinite program

(Can be kernelized)

Future work

Future work

Future work

Locally adaptive k-NN that

exploits manifold structure

Thank you!!!