14.02.2013 Views

SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...

SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...

SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>SUBSPACE</strong>-<strong>BASED</strong> <strong>LEARNING</strong> <strong>WITH</strong> <strong>GRASSMANN</strong> <strong>KERNELS</strong><br />

Jihun Hamm<br />

A DISSERTATION<br />

in<br />

Electrical and Systems Engineering<br />

Presented to the Faculties of the University of Pennsylvania<br />

in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy<br />

2008<br />

Supervisor of Dissertation<br />

Graduate Group Chair


COPYRIGHT<br />

Jihun Hamm<br />

2008


Acknowledgements<br />

I deeply thank my advisor Dr. Daniel D. Lee for so many things. Besides having provided<br />

financial and mental support for my graduate study, Daniel has initiated me into the field of<br />

machine learning which I barely knew about before working with him. From him I learned<br />

how to tackle the problems from the ground and stay fresh-minded. The upmost influence<br />

Daniel had on me was his energy and passion towards the goal. Being near him made me<br />

and my colleagues always stimulated and energized, and helped us to endure through some<br />

tough periods during the Ph.D process.<br />

I thank Dr. Lawrence Saul for inspiring me to follow my interest in the manifold learn-<br />

ing. During my early years working with him, his knowledge and intuition on the matter<br />

has strongly affected my approach to learning problems.<br />

I am also grateful to other professors who served in my thesis committee: Dr. Ali<br />

Jadbabaie, Dr. Jianbo Shi, Dr. Ben Taskar, and Dr. Ragini Verma. They provided valu-<br />

able feedback to polish the thesis. Dr. Jean Gallier has provided me with a guidance on<br />

mathematical issues before and during the writing of the thesis.<br />

I appreciate the support from my colleagues, especially from my lab members: Dan<br />

Huang, Yuanqing Lin, Yung-kyun Noh, and Paul Vernaza. Besides sharing the enthusiasm<br />

for research, we shared lots of fun and sometimes stressful moments of daily life as gradu-<br />

ate students. Yung-kyun has always been a pleasure to discuss any problem with. He was<br />

kind enough to read through the draft of the thesis and give suggestions.<br />

iii


Lastly, I thank my parents and my family for being who they are, and for understanding<br />

my excuses for not talking to them more often. My wife Sophia has always been by my<br />

side, and I cannot thank her enough for that.<br />

iv


ABSTRACT<br />

<strong>SUBSPACE</strong>-<strong>BASED</strong> <strong>LEARNING</strong> <strong>WITH</strong> <strong>GRASSMANN</strong> <strong>KERNELS</strong><br />

Jihun Hamm<br />

Supervisor: Prof. Daniel D. Lee<br />

In this thesis I propose a subspace-based learning paradigm for solving novel problems<br />

in machine learning. We often encounter subspace structures within data that lie inside a<br />

vector space. For example, the set of images of an object or a face under varying lighting<br />

conditions are known to lie on a low (4 or 9)-dimensional subspace with mild assumptions.<br />

Many other types of variations such as pose change or facial expression, can also be ap-<br />

proximated quite well with low-dimensional subspaces. Treating such subspaces as basic<br />

units of learning gives rise to challenges that conventional algorithms cannot handle well.<br />

In this work, I tackle subspace-based learning problems with the unifying framework of<br />

Grassmann manifold, which is the set of linear subspaces of a Euclidean space. I propose<br />

positive definite kernels on this space, which provide an easy access to the repository of<br />

various kernel algorithms. Furthermore, I show that the Grassmann kernels can be extended<br />

to the set of affine and scaled subspaces. This extension allows us to handle larger classes<br />

of problems with little additional cost.<br />

The proposed kernels in this thesis can be used with any kernel method. In particular,<br />

I demonstrate the potential advantages of the proposed kernel with Discriminant Analysis<br />

techniques and Support Vector Machines for recognition and categorization tasks. Experi-<br />

ments with real image databases show not only the feasibility of the proposed framework,<br />

but also the improved performance of the method compared with previously known meth-<br />

ods.<br />

v


Contents<br />

1 INTRODUCTION 1<br />

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Contributions and related work . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

2 BACKGROUND 7<br />

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2.2 Kernel machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2.2.2 Mercer kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.2.3 Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . 12<br />

2.2.4 Examples of kernels . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.2.5 Generating new kernels from old kernels . . . . . . . . . . . . . . 16<br />

2.2.6 Distance and conditionally positive definite kernels . . . . . . . . . 17<br />

2.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.3.1 Large margin linear classifier . . . . . . . . . . . . . . . . . . . . . 20<br />

2.3.2 Dual problem and support vectors . . . . . . . . . . . . . . . . . . 21<br />

2.3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

2.3.4 Generalization error and overfitting . . . . . . . . . . . . . . . . . 24<br />

vi


2.4 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.4.1 Fisher Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 26<br />

2.4.2 Nonparametric Discriminant Analysis . . . . . . . . . . . . . . . . 26<br />

2.4.3 Discriminant analysis in high-dimensional spaces . . . . . . . . . . 27<br />

2.4.4 Extension to nonlinear discriminant analysis . . . . . . . . . . . . 28<br />

3 MOTIVATION: <strong>SUBSPACE</strong> STRUCTURE IN DATA 30<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />

3.2 Illumination subspaces in multi-lighting images . . . . . . . . . . . . . . . 31<br />

3.3 Pose subspaces in multi-view images . . . . . . . . . . . . . . . . . . . . . 40<br />

3.4 Video sequences of human motions . . . . . . . . . . . . . . . . . . . . . 44<br />

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

4 <strong>GRASSMANN</strong> MANIFOLDS AND <strong>SUBSPACE</strong> DISTANCES 54<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

4.2 Stiefel and Grassmann manifolds . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.2.1 Stiefel manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.2.2 Grassmann manifold . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

4.2.3 Principal angles and canonical correlations . . . . . . . . . . . . . 59<br />

4.3 Grassmann distances for subspaces . . . . . . . . . . . . . . . . . . . . . . 60<br />

4.3.1 Projection distance . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

4.3.2 Binet-Cauchy distance . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

4.3.3 Max Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

4.3.4 Min Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4.3.5 Procrustes distance . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

4.3.6 Comparison of the distances . . . . . . . . . . . . . . . . . . . . . 68<br />

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

vii


4.4.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

4.4.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

5 <strong>GRASSMANN</strong> <strong>KERNELS</strong> AND DISCRIMINANT ANALYSIS 77<br />

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

5.2 Kernel functions for subspaces . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

5.2.1 Projection kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

5.2.2 Binet-Cauchy kernel . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

5.2.3 Indefinite kernels from other metrics . . . . . . . . . . . . . . . . . 83<br />

5.2.4 Extension to nonlinear subspaces . . . . . . . . . . . . . . . . . . 83<br />

5.3 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . 85<br />

5.3.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

5.3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

5.3.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

5.4 Discriminant Analysis of subspace . . . . . . . . . . . . . . . . . . . . . . 90<br />

5.4.1 Grassmann Discriminant Analysis . . . . . . . . . . . . . . . . . . 90<br />

5.4.2 Mutual Subspace Method (MSM) . . . . . . . . . . . . . . . . . . 91<br />

5.4.3 Constrained MSM (cMSM) . . . . . . . . . . . . . . . . . . . . . 92<br />

5.4.4 Discriminant Analysis of Canonical Correlations (DCC) . . . . . . 92<br />

5.5 Experiments with real-world data . . . . . . . . . . . . . . . . . . . . . . . 93<br />

5.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

5.5.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

6 EXTENDED <strong>GRASSMANN</strong> <strong>KERNELS</strong> AND PROBABILISTIC DISTANCES100<br />

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

viii


6.2 Analysis of probabilistic distances and kernels . . . . . . . . . . . . . . . . 101<br />

6.2.1 Probabilistic distances and kernels . . . . . . . . . . . . . . . . . . 101<br />

6.2.2 Data as Mixture of Factor Analyzers . . . . . . . . . . . . . . . . . 103<br />

6.2.3 Analysis of KL distance . . . . . . . . . . . . . . . . . . . . . . . 105<br />

6.2.4 Analysis of Probability Product Kernel . . . . . . . . . . . . . . . 107<br />

6.3 Extended Grassmann Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

6.3.2 Extension to affine subspaces . . . . . . . . . . . . . . . . . . . . 112<br />

6.3.3 Extension to scaled subspaces . . . . . . . . . . . . . . . . . . . . 118<br />

6.3.4 Extension to nonlinear subspaces . . . . . . . . . . . . . . . . . . 122<br />

6.4 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . 123<br />

6.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

6.4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />

6.4.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

6.5 Experiments with real-world data . . . . . . . . . . . . . . . . . . . . . . . 127<br />

6.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

6.5.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />

7 CONCLUSION 134<br />

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

7.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

7.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />

Bibliography 139<br />

ix


List of Tables<br />

4.1 Summary of the Grassmann distances. The distances can be defined as<br />

simple functions of both the basis Y and the principal angles θi except for<br />

the arc-length which involves matrix exponentials. . . . . . . . . . . . . . 69<br />

5.1 Classification rates of the Euclidean SVMs and the Grassmannian SVMs.<br />

The best rate for each dataset is highlighted by boldface. . . . . . . . . . . 89<br />

6.1 Classification rates of the Euclidean SVMs and the Grassmann SVMs. The<br />

best rate for each dataset is highlighted by boldface. . . . . . . . . . . . . 126<br />

x


List of Figures<br />

2.1 Classification in the input space (Left) vs a feature space (Right). A non-<br />

linear classification in the input space is achieved by a linear classification<br />

in the feature space via the following map: φ : R 2 → R 3 , (x1, x2) ′ ↦→<br />

(x 2 1, √ 2x1x2, x 2 2) ′ , which maps the elliptical decision boundary to the hy-<br />

perplane. This illustration was captured from the tutorial slides of Schölkopf’s<br />

given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005. . . . 9<br />

2.2 Example of classifying two-class data with a hyperplane 〈w, x〉 + b = 0.<br />

In this case the data can be separated without error. This illustration was<br />

captured from the tutorial slides of Schölkopf’s given at the workshop “The<br />

Analysis of Patterns”, Erice, Italy, 2005. . . . . . . . . . . . . . . . . . . . 21<br />

2.3 The most discriminant direction for two-class data. Suppose we have two<br />

classes of Gaussian-distributed data, and we want to project the data onto<br />

one-dimensional directions denoted by the arrows. The projection in the<br />

direction of the largest variance (PCA direction) results in a large overlap<br />

of the two class which is undesirable for classification, whereas the pro-<br />

jection in the Fisher direction yields the least overlapping, therefore most<br />

discriminant one-dimensional distributions. This illustration was captured<br />

from the paper of [58]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

xi


3.1 The figure shows the first five principal components of a face, computed<br />

analytically from a 3D model (Top) and a sphere (Bottom). These images<br />

matches well with the empirical principal component computed from a set<br />

of real images. The figure was captured from [61]. . . . . . . . . . . . . . 33<br />

3.2 Yale Face Database: the first 10 (out of 38) subjects at all poses under a<br />

fixed illumination condition. . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

3.3 Yale Face Database: all illumination conditions of a person at a fixed pose<br />

used to compute the corresponding illumination subspace. . . . . . . . . . 37<br />

3.4 Yale Face Database: examples of basis images and (cumulative) singular<br />

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

3.5 CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a<br />

fixed illumination condition. . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

3.6 CMU-PIE Database: all illumination conditions of a person at a fixed pose<br />

used to compute the corresponding illumination subspace. . . . . . . . . . 39<br />

3.7 CMU-PIE Database: examples of basis images and (cumulative) singular<br />

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

3.8 ETH-80 Database: all categories and objects at a fixed pose. . . . . . . . . 41<br />

3.9 ETH-80 Database: all poses of an object from a category used to compute<br />

the corresponding pose subspace of the object. . . . . . . . . . . . . . . . 43<br />

3.10 ETH-80 Database: examples of basis images and (cumulative) singular<br />

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

3.11 IXMAS Database: video sequences of an actor performing 11 different<br />

actions viewed from a fixed camera. . . . . . . . . . . . . . . . . . . . . . 50<br />

3.12 IXMAS Database: 3D occupancy volume of an actor of one time frame.<br />

The volume is initially computed in Cartesian coordinate system and later<br />

represented to cylindrical coordinate system to apply FFT. . . . . . . . . . 51<br />

xii


3.13 IXMAS Database: the ‘kick’ action performed by 11 actors. Each sequence<br />

has a different kick style well as different body shape and height. . . . . . 52<br />

3.14 IXMAS Database: cylindrical coordinate representation of the volume V (r, θ, z),<br />

and the corresponding 1D FFT feature abs(F F T (V (r, θ, z))), shown at a<br />

few values of θ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

4.1 Principal angles and Grassmann distances. Let span(Yi) and span(Yj) be<br />

two subspaces in the Euclidean space R D on the left. The distance between<br />

two subspaces span(Yi) and span(Yj) can be measured using the principal<br />

angles θ = [θ1, ... , θm] ′ . In the Grassmann manifold viewpoint, the sub-<br />

spaces span(Yi) and span(Yj) are considered as two points on the manifold<br />

G(m, D), whose Riemannian distance is related to the principal angles by<br />

d(Yi, Yj) = �θ�2. Various distances can be defined based on the principal<br />

angles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4.2 Yale Face Database: face recognition rates from 1NN classifier with the<br />

Grassmann distances. The two highest rates including ties are highlighted<br />

with boldface for each subspace dimension m. . . . . . . . . . . . . . . . 73<br />

4.3 CMU-PIE Database: face recognition rates from 1NN classifier with the<br />

Grassmann distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

4.4 ETH-80 Database: object categorization rates from 1NN classifier with the<br />

Grassmann distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

4.5 IXMAS Database: action recognition rates from 1NN classifier with the<br />

Grassmann distances. The two highest rates including ties are highlighted<br />

with boldface for each subspace dimension m. . . . . . . . . . . . . . . . 76<br />

xiii


5.1 Doubly kernel method. The first kernel implicitly maps the two ‘nonlinear<br />

subspaces’ Xi and Xj to span(Yi) and span(Yj) via the map Φ : X → H1,<br />

where the ‘nonlinear subspace’ means the preimage Xi = Φ −1 (span(Yi))<br />

and Xj = Φ −1 (span(Yj)). The second (=Grassmann) kernel maps the<br />

points Yi and Yj on the Grassmann manifold G(m, D) to the corresponding<br />

points in H2 via the map Ψ : G(m, D) → H2 such as (5.3) or (5.5). . . . . 84<br />

5.2 A two-dimensional subspace is represented by a triangular patch swept by<br />

two basis vectors. The positive and negative classes are colored-coded by<br />

blue and red respectively. A: The two class centers Y+ and Y− around<br />

which other subspaces are randomly generated. B–D: Examples of ran-<br />

domly selected subspaces for ‘easy’, ‘intermediate’, and ‘difficult’ datasets. 86<br />

5.3 Yale Face Database: face recognition rates from various discriminant anal-<br />

ysis methods. The two highest rates including ties are highlighted with<br />

boldface for each subspace dimension m. . . . . . . . . . . . . . . . . . . 96<br />

5.4 CMU-PIE Database: face recognition rates from various discriminant anal-<br />

ysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

5.5 ETH-80 Database: object categorization rates from various discriminant<br />

analysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

5.6 IXMAS Database: action recognition rates from various discriminant anal-<br />

ysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

6.1 Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann<br />

manifold (Left), the set of linear subspaces, can alternatively be modeled<br />

as the set of flat (σ → 0) spheres (Y ′<br />

i Yi = Im) intersecting at the origin<br />

(ui = 0). The right figure shows a general Mixture of Factor Analyzers<br />

which are not bound by these conditions. . . . . . . . . . . . . . . . . . . 104<br />

xiv


6.2 The Mixture of Factor Analyzer model of the Grassmann manifold is the<br />

collection of linear homogeneous Factor Analyzers shown as flat spheres<br />

intersecting at the origin (A). This can be relaxed to allow nonzero offsets<br />

for each Factor Analyzer (B), and also to allow arbitrary eccentricity and<br />

scale for each Factor Analyzer shown as flat ellipsoids (C). . . . . . . . . . 111<br />

6.3 The same affine span can be expressed with different offsets u1, u2, ... How-<br />

ever, one can use the unique ‘standard’ offset û, which has the shortest<br />

length from the origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

6.4 Homogeneous vs scaled subspaces. The two 2-dimensional Gaussians that<br />

span almost the same 2-dimensional space and have almost the same means,<br />

are considered similar as two representations of linear subspaces (Left).<br />

However, probabilistic distance between two Gaussian also depends on<br />

scale and eccentricity: the distance can be quite large if the Gaussians are<br />

nonhomogeneous (Right). . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

6.5 Yale Face Database: face recognition rates from various kernels. The two<br />

highest rates including ties are highlighted with boldface for each subspace<br />

dimension m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

6.6 CMU-PIE Database: face recognition rates from various kernels. . . . . . 131<br />

6.7 ETH-80 Database: object categorization rates from various kernels. . . . . 132<br />

6.8 IXMAS Database: action recognition rates from various kernels. . . . . . . 133<br />

xv


Chapter 1<br />

INTRODUCTION<br />

1.1 Overview<br />

In machine learning problems the data commonly lie in a vector space, especially in a<br />

Euclidean space. The Euclidean space is convenient for data representation, storage and<br />

computation and geometrically intuitive to understand as well.<br />

There are, however, other kinds of non-Euclidean spaces more suitable for data outside<br />

the conventional Euclidean domain. The data domain I focus on in this thesis is one of<br />

those non-Euclidean spaces where each data sample is a linear subspace of a Euclidean<br />

space.<br />

Researches often encounter this non-conventional domain in computer vision problems.<br />

For example, a set of images of an object or a face with varying lighting conditions is known<br />

to lie on a low (4 or 9-) dimensional subspace under mild assumptions. Many other types of<br />

variations such as pose changes or facial expressions, can also be empirically approximated<br />

quite well by low-dimensional subspaces. If the data consist of multiple sets of images, they<br />

can consequently be modeled as a collection of low-dimensional subspaces.<br />

What are the potential advantages of having such structures? In the above example of<br />

1


face images, we can model the illumination variation of data irrelevant to a recognition task<br />

by subspaces, and focus on learning the appropriate variation between those subspaces such<br />

as the variation due to a subject identity. This idea applies not only to illumination-varying<br />

faces but also to many other types of data for which we can model out the undesired factors<br />

from the data with subspaces. Furthermore, representing data as a collection of subspaces<br />

is much more economical than keeping all the data samples as unorganized points, since<br />

we only need to store and handle the basis vectors. I refer to this approach of handling data<br />

as the subspace-based learning approach.<br />

Few researchers have clearly defined and fully utilized the properties of such a space<br />

in learning problems. Since a collection of subspaces is non-Euclidean, one cannot benefit<br />

from the conveniences of the Euclidean space anymore. For a learning algorithm to work<br />

with the subspace representation of the data, it requires a suitable framework which is also<br />

convenient in storage and computation of such data. This thesis provides the foundations<br />

of the subspace-based learning problems using a novel framework and kernels.<br />

To show the reader the scope and the depth of this work, I raise the following questions<br />

regarding the subject:<br />

Questions<br />

1. What are the examples of the subspace structure in real data?<br />

2. Which non-Euclidean domain suits the subspace-structed data?<br />

3. What dissimilarity measures of subspaces are there, and what are their properties?<br />

4. Can we define kernels for such domain?<br />

5. Are the kernels related to probabilistic distances?<br />

6. Can we extend the framework to subspaces that are not exactly linear?<br />

2


This thesis gives detailed and definitive answers to all of the questions above.<br />

1.2 Contributions and related work<br />

In this thesis I propose the Grassmann manifold framework for solving subspace-based<br />

problems. The Grassmann manifold is the set of fixed-dimensional linear subspaces and<br />

is an ideal model of the data under consideration. The Grassmann manfiolds have been<br />

previously used in signal processing and control [74, 36, 6], numerical optimization [20]<br />

(and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78].<br />

In particular, there are many approaches that use the subspace concept for problem solving<br />

in computer vision [92, 64, 24, 43, 3]. However, these work do not explicitly nor fully<br />

utilize the benefits of the Grassmann approach for subspace-based problems. In contrast,<br />

I make a full use of the properties of the Grassmann manifold with a unifying framework<br />

that subsumes the previous approaches.<br />

With the proposed framework, a dissimilarity between subspaces can be viewed as a<br />

distance function on the Grassmann manifold. I review several known distances including<br />

the Arc-length, Projection, Binet-Cauchy, Max Corr, Min Corr, and Procrustes distances<br />

[20, 16], and provide analytical and empirical comparisons. Furthermore, I propose the<br />

Projection kernel as a legitimate kernel function on the Grassmann manifold. The Projec-<br />

tion kernel is also used in [85] where it is mainly used as a similarity measure of subspaces<br />

rather than as a full-fledged kernel function on the Grassmann manifold. Another kernel<br />

I use in the thesis is the Binet-Cauchy kernel [90, 83]. I show that in spite of the more<br />

attention the Binet-Cauchy kernel has received, the Binet-Cauchy kernel is less useful than<br />

the Projection kernel is with noisy data.<br />

Using the two kernels as the representative kernels on the Grassmann manifold, I<br />

demonstrate the advantages of using the Grassmann kernels over the Euclidean kernels by a<br />

3


classification problem with Support Vector Machines on synthetic datasets. To demonstrate<br />

the potential benefits of the kernels further, I apply the kernels to a discriminant analysis on<br />

the Grassmann manifold and compare the approach with previously suggested algorithms<br />

for subspace-based discriminant analysis [92, 64, 24, 43]. In the previous methods, feature<br />

extraction is performed in the Euclidean space while non-Euclidean subspace distances<br />

are used in the objective. This inconsistency results in a difficult optimization and a weak<br />

guarantee of convergence, whereas the proposed approach with the Grassmann kernels is<br />

simpler and more effective, evidenced by experiments with the real image databases.<br />

In this thesis I also investigate the relationship between probabilistic distances and the<br />

Grassmann kernels. If we assume the set of vectors are i.i.d. samples from an arbitrary<br />

probability distribution, then it is possible to compare two such distributions of vectors<br />

with probabilistic similarity measures, such as the KL distance [47], the Chernoff distance<br />

[15], or the Bhattacharyya/Hellinger distance [10]. Furthermore, the Bhattacharyya affinity<br />

is in fact a positive definite kernel function on the space of distributions and has nice closed-<br />

form expressions for the exponential family [40]. The probabilistic distances and kernels<br />

are used for recognizing hand-written digits and faces [70, 46, 96]. I provide a link between<br />

the probabilistic and the Grassmann views by modeling the subspace data as a limit of the<br />

Mixture of Factor Analyzers [27] under the zero-mean and homogeneous conditions. The<br />

first result I show is that the KL distance is reduced to the Projection kernel under the<br />

Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit and<br />

is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL<br />

distance, I propose an extension of the Projection kernel which is originally confined to<br />

the set of linear subspaces, to the set of affine as well as scaled subspaces. I demonstrate<br />

the potential benefits of the extended kernels with the Support Vector Machines and the<br />

Kernel Discriminant Analysis, using synthetic and real image databases. The experiments<br />

show the superiority of the extended kernels over the Bhattacharyya and the Binet-Cauchy<br />

4


kernels, as well as over the Euclidean methods.<br />

There is a related but independent problem of clustering unlabeled data points into mul-<br />

tiple subspaces. Several approaches have been proposed in the literature. A traditional and<br />

inefficient technique is to use an EM-algorithm [27] for a Mixture of Factor Analyzers<br />

(MFA), which models the data distribution as a superposition of the Gaussian distributions.<br />

More recent work on clustering subspace data includes K-subspaces method [37] which<br />

extends the K-means algorithm to the case of subspaces, and Generalized PCA [81] which<br />

represents subspaces with polynomials and solves algebraic equations to fit the data. These<br />

methods are different from the proposed method of this thesis in that they serve as a pre-<br />

processing step to generate subspace labels for the proposed subspace-based learning.<br />

1.3 Organization of the paper<br />

The rest of the paper is organized as follows:<br />

• Chapter 2 provides background materials for the thesis, including kernel theory, large<br />

margin classifiers, and discriminant analyses.<br />

• Chapter 3 discusses theoretical and empirical evidences of inherent subspace struc-<br />

tures in image and video databases, and describes procedures for preprocessing the<br />

databases.<br />

• Chapter 4 introduces the Grassmann manifold as a common framework for subspace-<br />

based learning. Various distances on the Grassmann manifold are reviewed and ana-<br />

lyzed in depth.<br />

• Chapter 5 defines the Grassmann kernels and proposes the application to discriminant<br />

analysis. Comparisons with previously used algorithms are given.<br />

5


• Chapter 6 examines the relationship between probabilistic distances and the Grass-<br />

mann kernels. The chapter contains further discussions on the extension of the do-<br />

main of the subspace-based learning and presents the extended Grassmann kernels<br />

• Chapter 7 summarizes the contributions of the thesis and discusses the future work<br />

related to the proposed methods.<br />

• Bibliography contains all the referenced work in this thesis.<br />

The main chapters of the thesis are also divide into two parts. Chapter 3 and 4 integrate<br />

known facts and set up the framework for the thesis. Chapter 5 and 6 provide the main<br />

proposals, analyses, and experimental results.<br />

6


Chapter 2<br />

BACKGROUND<br />

2.1 Introduction<br />

In this chapter I review three topics: 1) kernel machine, 2) its application to 2) large margin<br />

classification and 3) application to discriminant analysis. The theory behind the kernel<br />

machines is helpful and partially necessary to understand the proposed kernels in the thesis.<br />

The large margin classification and discriminant analysis algorithms will be used to test the<br />

proposed kernels in Chapters 5 and 6.<br />

I provide a brief tutorial of the three topics based on the well-known texts and papers<br />

such as [18, 69, 71]. Most of the proofs are omitted and can be found in the original texts.<br />

2.2 Kernel machines<br />

2.2.1 Motivation<br />

Oftentimes it is not very effective nor convenient to use the original data space to learn pat-<br />

terns of the data. For simplicity, let’s assume the data X lie in the Euclidean space. When<br />

the patterns have a complex structure in the original data space, we can try to transform<br />

7


the data space nonlinearly to another space so that the learning task becomes easier on the<br />

transformed space. The new space is called a feature space, and the map is called a feature<br />

map.<br />

Suppose we are trying to classify two-dimensional, two-class data (Figure 2.1.) If the<br />

true class boundary is an ellipse in the input space, a linear classifier cannot classify the<br />

data correctly. However, when the input space is mapped to the feature space by<br />

φ : R 2 → R 3 , (x1, x2) ′ ↦→ (x 2 1, √ 2x1x2, x 2 2) ′ ,<br />

the decision boundary becomes a hyperplane in three-dimensional space, and therefore the<br />

two classes can be perfectly separated by a simple linear classifier.<br />

Note that we mapped the data to the feature space of all (ordered) monomials of degree<br />

two (x 2 1, √ 2x1x2, x 2 2), and used a hyperplane in that space. We can use the same idea for<br />

a feature space of higher-degree monomials. However, if we map X ∈ R D to the space of<br />

degree-d monomials, the dimension of the feature space becomes<br />

⎛<br />

⎜<br />

⎝<br />

D + d − 1<br />

d<br />

which can be computationally infeasible even for moderate D and d. This difficulty is<br />

easily circumvented by noting that we only need to compute inner products of points in<br />

the feature space to define a hyperplane. For the space of degree-2 monomials, the inner<br />

product can be computed from the original data by<br />

⎞<br />

⎟<br />

⎠<br />

〈φ(x), φ(y)〉 = x 2 1y 2 1 + x 2 2y 2 2 + 2x1x2y1y2 = 〈x, y〉 2 ,<br />

which can be extended to degree-d monomials by 〈x, y〉 d . The inner product in the feature<br />

8


Figure 2.1: Classification in the input space (Left) vs a feature space (Right). A nonlinear<br />

classification in the input space is achieved by a linear classification in the feature space via<br />

the following map: φ : R 2 → R 3 , (x1, x2) ′ ↦→ (x 2 1, √ 2x1x2, x 2 2) ′ , which maps the elliptical<br />

decision boundary to the hyperplane. This illustration was captured from the tutorial slides<br />

of Schölkopf’s given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005.<br />

space, such as k(x, y) = 〈x, y〉 d , is called a kernel function. From a user’s point of view,<br />

a kernel function is simply a nonlinear similarity measure of data that corresponds to a<br />

linear similarity measure in a feature space that the user need not know explicitly. A formal<br />

definition will follow shortly.<br />

2.2.2 Mercer kernels<br />

In this subsection I introduce the Mercer’s theorem which characterizes the condition when<br />

a kernel function k induces the feature map and space. Let X denote the data space.<br />

9


In case of finite X<br />

Definition 2.1 (Symmetric Positive Definite Matrix). A real N by N symmetric matrix K<br />

is positive definite if<br />

�<br />

cicjKij ≥ 0, for all c1, ..., cN(ci ∈ R).<br />

i,j<br />

Consider a finite input space X = {x1, ..., xN} and a symmetric real-valued function<br />

k(x, y). Let K be the N by N matrix of the function Kij = k(xi, xj) evaluated at X × X .<br />

Since K is symmetric it can be diagonalized as K = V ΛV ′ where Λ is a diagonal matrix<br />

of eigenvalues λ1 ≤ ... ≤ λN and V is an orthonormal matrix whose columns are the<br />

corresponding eigenvectors. Let vi denote the i-th row of V :<br />

V = [v ′ 1 · · · v ′ N] ′ .<br />

If the matrix K is positive definite, and therefore the eigenvalues are non-negative, then we<br />

can define the following feature map<br />

φ : X → H = R N , xi ↦→ viD 1/2 , i = 1, ..., N,<br />

where D 1/2 is a diagonal matrix D 1/2 = diag( √ λ1, ..., √ λN). We now observe that the<br />

inner product in the feature space 〈·, ·〉 H coincides with the kernel matrix of the data<br />

〈φ(xi), φ(xj)〉 = viDv ′ j = (V DV ′ )ij = Kij.<br />

10


In case of compact X<br />

Let’s apply the intuition gained from the finite case to an infinite dimensional case. Al-<br />

though further generalization is possible to a finite measure space (X , µ), we will deal with<br />

compact subsets of R D as the domain.<br />

Theorem 2.2 (Mercer). Let X be a compact subset of R D . Suppose k : X × X → R is a<br />

continuous symmetric function such that the integral operator Tk : L2(X ) → L2(X ),<br />

has the property of<br />

�<br />

(Tkf)(x) =<br />

�<br />

X 2<br />

X<br />

k(x, y)f(y) dy<br />

k(x, y)f(x)f(y) dxdy ≥ 0,<br />

for all f ∈ L2(X ). Then we have a uniformly convergent series<br />

k(x, y) =<br />

∞�<br />

λiψi(x)ψi(y)<br />

i=1<br />

in terms of the normalized eigenfunctions ψi ∈ L2(X ) of Tk ( normalized means �ψi�L2 =<br />

1.)<br />

The condition on Tk is an extension of the positive definite condition for matrices.<br />

Let’s define a sequence of features maps from the operator eigenfunctions ψi:<br />

φd : X → H = l d 2, x ↦→<br />

��<br />

λ1ψ1(x), ..., � �<br />

λdψd(x) ,<br />

for d = 1, 2, .... The Theorem 2.2 tells us that the sequence of the maps φ1, φ2, ... converges<br />

to a map φ : X → H such that 〈φ(x), φ(y)〉 H = k(x, y). The theorem below is the<br />

formalization of this observation:<br />

11


Theorem 2.3 (Mercer Kernel Map). If X is a compact subset of R D and k is a function<br />

satisfying the conditions of Theorem 2.2, then there is a feature map φ : X → H into a<br />

features space H where k becomes an inner product<br />

〈φ(x), φ(y)〉 H = k(x, y),<br />

for almost all x, y ∈ X . Moreover, given any ɛ > 0, there exists a map φn into an n-<br />

dimensional Hilbert space such that<br />

for almost all x, y ∈ X .<br />

|k(x, y) − 〈φn(x), φn(y)〉 | < ɛ<br />

The Mercer’s kernel gives us a construction of a features space. In the next section we<br />

will look at a more general construction via the Reproducing Kernel Hilbert Space.<br />

2.2.3 Reproducing Kernel Hilbert Space<br />

Extending the notion of positive definiteness of matrices and compact operators, we can<br />

define the positive definiteness of a function on an arbitrary set X as follows:<br />

Definition 2.4 (Positive Definite Kernel). Let X be any set, and k : X × X → R be<br />

a symmetric real-valued function k(xi, xj) = k(xj, xi) for all xi, xj ∈ X . Then k is a<br />

positive definite kernel function if<br />

�<br />

cicjk(xi, xj) ≥ 0,<br />

for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) for any n ∈ N.<br />

i,j<br />

In fact, the necessary and sufficient condition for a kernel to have associated feature<br />

12


space and feature map, is that the kernel be positive definite. Below are the three steps in<br />

[69] to construct the feature map φ and the feature space H from a given positive definite<br />

kernel k:<br />

1. Define a vector space with k.<br />

2. Endow it with an inner product with a reproducing property.<br />

3. Complete the space to a Hilbert space.<br />

First, we define H as the set of all linear combinations of the functions of the form<br />

f(·) =<br />

m�<br />

αik(·, xi),<br />

i<br />

for arbitrary m ∈ N, α1, ..., αm(αi ∈ R), and x1, ..., xm(xi ∈ X ). It is not difficult to<br />

check that H is a vector space. Let g(·) = � n<br />

j βjk(·, yj) be an another function in the<br />

vector space for some n ∈ N, β1, ..., βn(βj ∈ R), and y1, ..., yn(yj ∈ X ). Next, we define<br />

the following inner product between f and g:<br />

m,n �<br />

〈f, g〉 = αiβjk(xi, yj).<br />

i,j<br />

It is possible that the coefficients {αi} and {βj} are not unique. That is, a function f (or<br />

g) may be represented in multiple ways with different coefficients. To see that the inner<br />

product is still well-defined, note that<br />

m,n �<br />

〈f, g〉 = αiβjk(xi, yj) = �<br />

βjf(yj)<br />

i,j<br />

by definition. This shows that 〈·, ·〉 does not depend on particular expansion coefficients<br />

{αi}. Similarly, 〈f, g〉 = �<br />

i αig(xi) shows that the inner product does not depend on<br />

13<br />

j


{βj} either. The positivity 〈f, f〉 = �<br />

i,j αiαjk(xi, xj) ≥ 0 follows from the positive<br />

definiteness of k. Other axioms are easily checked. One notable property of the defined<br />

kernel is as follows. By choosing g(·) = k(·, y) we have 〈f, k(·, y)〉 = f(y) by definition.<br />

Furthermore with f(·) = k(·, x) we have<br />

which is called the reproducing property.<br />

〈k(·, x), k(·, y)〉 = k(x, y),<br />

Finally, the space can be completed to a Hilbert space, which is called the Reproducing<br />

Kernel Hilbert Space. Below is the formal definition of the space:<br />

Definition 2.5 (Reproducing Kernel Hilbert Space). Let X be a nonempty set and H a<br />

Hilbert space of functions f : X → R. Then H is called a Reproduction Kernel Hilbert<br />

Space (RKHS) endowed with the inner product 〈 , 〉, if there exist a function k : X × X<br />

with the following properties:<br />

1. 〈f, k(x, ·)〉 = f(x) for all f ∈ H; in particular, 〈k(x, ·), k(y, ·)〉 = k(x, y).<br />

2. k spans H, that is, H = span{k(x, ·)|x ∈ X } where X denote the completion of the<br />

set X.<br />

We have seen that a RKHS can be constructed from a positive definite kernel in three<br />

steps. The converse is also true. If a RKHS H is given then a unique positive definite kernel<br />

can be defined as the inner product of the space H.<br />

Finally, we show that Mercer kernels are positive definite in the generalized sense.<br />

Theorem 2.6 (Equivalence of Positive Definiteness). Let X = [a, b] be a compact interval<br />

and let k : [a, b] × [a, b] → C be continuous. Then k is a positive definite kernel if and only<br />

if<br />

�<br />

[a,b]×[a,b]<br />

k(x, y)f(x)f(y) dxdy ≥ 0,<br />

14


for any continuous function f : X → C.<br />

In this regard, every Mercer kernel k has a RKHS as a feature space for which k is the<br />

reproducing kernel.<br />

2.2.4 Examples of kernels<br />

There are an ever-expanding number of kernels for various types of data and applications,<br />

and we can only glimpse a portion of those. Below is a list of the most often-used kernels<br />

for Euclidean data. Let x, y ∈⊂ R D .<br />

• Homogeneous polynomial kernel: k(x, y) = 〈x, y〉 d .<br />

• Nonhomogeneous polynomial kernel: k(x, y) = (〈x, y〉 + c) d , c ∈ R.<br />

• Gaussian RBF kernel: k(x, y) = exp − �x−y�2<br />

2σ2 , σ > 0. The Gaussian RBF kernel<br />

has the following characteristics: 1) the points in the feature space lie on a sphere,<br />

since �φ(x)� 2 = 1, 2) the angle between two points 〈x, y〉 is at most π/2, and 3) the<br />

feature space is infinite-dimensional.<br />

Those kernels are the first kernels to be used with large margin classifiers [11]. These<br />

kernels can be evaluated in a closed-from without having to construct the feature spaces<br />

explicitly. Further work has discovered other types of kernels that can be evaluated effi-<br />

ciently by a recursion. These include the following two kernels:<br />

• All-subsets kernel [75]: Let I = {1, 2, ..., D} be indices for the variables xi, i ∈ I.<br />

For every subset of A of I, let us define φA(x) = �<br />

i∈A xi. For A = ∅ we define<br />

15


φ∅(x) = 1. If φ(x) is the sequence (φA(x))A⊂I, then the all-subsets kernel is<br />

k(x, y) = 〈φ(x), φ(y)〉 = �<br />

φA(x)φA(y)<br />

= � �<br />

A⊂I i∈A<br />

xiyi =<br />

A⊂I<br />

D�<br />

(1 + xiyi).<br />

• ANOVA kernel [87]: it is defined similarly to the all-subsets kernel. Define φ(x) as<br />

the sequence (φA(x))A⊂I,|A|=d, where we restrict A to the subsets of cardinality d.<br />

Then the kernel is<br />

i=1<br />

k(x, y) = 〈φA(x), φA(y)〉 = �<br />

=<br />

�<br />

1≤i1


Theorem 2.7. If k1(x, y) and k2(x, y) are positive definite kernels, then following kernels<br />

are also positive definite:<br />

1. Conic combination: α1k1(x, y) + α2k2(x, y), (α1, α2 > 0)<br />

2. Pointwise product: k1(x, y)k2(x, y)<br />

3. Integration: � k(x, z)k(y, z) dz,<br />

4. Product with rank-1 kernel: k(x, y)f(x)f(y)<br />

5. Limit: if k1(x, y), k2(x, y), ... are positive definite kernels then so is limi→∞ ki(x, y).<br />

Proofs can be found in [69, 71].<br />

Corollary 2.8. If k is a positive definite kernel, then so are f(k(x, y)) and exp k(x, y),<br />

where f : R → R is any polynomial function with nonnegative coefficients.<br />

2.2.6 Distance and conditionally positive definite kernels<br />

In this subsection I review the relationship between distances and conditionally positive<br />

definite kernels.<br />

Distance and metric<br />

Throughout the thesis I will use the term distance interchangeably with similarity measure,<br />

to denote an intuitive notion of ‘closeness’ between two patterns in the data. Therefore a<br />

distance d(·, ·) is any assignment of nonnegative values to a pair of points in a set X . A<br />

metric is, however, a distance that satisfies the additional axioms:<br />

Definition 2.9 (Metric). A real-valued function d : X × X → R is called a metric if<br />

1. d(x1, x2) ≥ 0,<br />

17


2. d(x1, x2) = 0 if and only if x1 = x2,<br />

3. d(x1, x2) = d(x2, x1),<br />

4. d(x1, x2) + d(x2, x3) ≤ d(x1, x3),<br />

for all x1, x2, x3 ∈ X .<br />

Relationship between metric and kernel<br />

The standard metric d(φ(x1), φ(x2)) in the feature space is the norm �φ(x1) − φ(x2)�<br />

induced from the inner product. The metric can be written in terms of the kernel as<br />

d 2 (φ(x1), φ(x2)) = k(x1, x1) + k(x2, x2) − 2k(x1, x2). (2.1)<br />

Therefore any RKHS is also a metric space (H, d) with the metric given above. Conversely,<br />

if a metric is given that is known to be induced from an inner product, then we can recover<br />

the inner product from the polarization of the metric:<br />

˜k(x1, x2) = 〈φ(x1), φ(x2)〉 = 1<br />

2 (−�φ(x1) − φ(x2)� 2 + �φ(x1)� 2 + �φ(x2)� 2 ).<br />

This raises the following question: if we are given a set and a metric (X , d), can we<br />

determine if d is induced from a positive definite kernel? To answer the question we need<br />

the following definition<br />

Definition 2.10 (Conditionally Positive Definite Kernel). Let X be any set, and k : X ×<br />

X → R be a symmetric real-valued function k(xi, xj) = k(xj, xi) for all xi, xj ∈ X . Then<br />

k is a conditionally positive definite kernel function if<br />

�<br />

cicjk(xi, xj) ≥ 0,<br />

i,j<br />

18


for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) such that � n<br />

i=1 ci = 0, for any n ∈ N.<br />

The question above is answered by the following theorem[67]:<br />

Theorem 2.11 (Schoenberg). A metric space (X , d) can be embedded isometrically into a<br />

Hilbert space if and only if d 2 (·, ·) is conditionally positive definite.<br />

As a corollary, we have<br />

Corollary 2.12 ([35]). A metric d is induced from a positive definite kernel if and only if<br />

is conditionally positive definite.<br />

˜k(x1, x2) = −d 2 (x1, x2)/2, x1, x2 ∈ X (2.2)<br />

It is known that one can use conditionally positive definite kernels just as positive defi-<br />

nite kernels in learning problems that are invariant to the choice of origin [68].<br />

2.3 Support Vector Machines<br />

A Support Vector Machine (SVM) is a supervised learning method used for classification.<br />

Due to its computational efficiency and theoretically well-understood generalization per-<br />

formance, the SVM has received a lot of attention in the last decade and is still one of the<br />

main topics in machine learning research. In this section I review the basics of SVM.<br />

I use the notation D = {(x1, y1), ..., (xN, yN)} to denote N pairs of a training sample<br />

xi ∈ R D and its class label yi ∈ {−1, 1}, i = 1, 2, ..., N.<br />

19


2.3.1 Large margin linear classifier<br />

Consider the problem of separating the two-class training data D = {(x1, y1), ..., (xN, yN)}<br />

with a hyperplane<br />

P : 〈w, x〉 + b = 0.<br />

Let’s assume the data are linearly separable, that is, we can separate the data with a hyper-<br />

plane without error (refer to Figure 2.2.) Since the equation 〈c · w, x〉 + c · b = 0 represents<br />

the same hyperplane for any nonzero c ∈ R, we choose a canonical representation of the<br />

hyperplane by setting mini � 〈w, xi〉+b� = 1. The linear separability can then be expressed<br />

as<br />

yi(〈w, xi〉 + b) ≥ 1, i = 1, ..., N, (2.3)<br />

and the distance of a point x to the hyperplane P is given by<br />

d(x, P) = | 〈w, xi〉 + b|<br />

.<br />

�w�<br />

We define the margin of the hyperplane as the minimum of the distance between training<br />

samples and the hyperplane:<br />

which can be shown to be equal to ρ = 2<br />

�w� .<br />

ρ = min<br />

i d(xi, P),<br />

If the data are linearly separable, there are typically an infinite number of hyperplanes<br />

that separate the classes correctly. However, the main idea of the SVM is to choose the one<br />

that has the maximum margin. Therefore the maximum margin classifier is the solution to<br />

the optimization problem:<br />

min<br />

w,b<br />

1<br />

2 �w�2 , subject to yi(〈w, xi〉 + b) ≥ 1, i = 1, ..., N. (2.4)<br />

20


Figure 2.2: Example of classifying two-class data with a hyperplane 〈w, x〉 + b = 0. In this<br />

case the data can be separated without error. This illustration was captured from the tutorial<br />

slides of Schölkopf’s given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005.<br />

2.3.2 Dual problem and support vectors<br />

The primal problem (2.4) is a convex optimization problem with linear constraints. From<br />

the Lagrangian duality, solving the primal problem is equivalent to solving the dual prob-<br />

lem:<br />

min<br />

α<br />

1<br />

2<br />

�<br />

αiαjyiyj 〈xi, xj〉 − �<br />

αi, subject to αi ≥ 0, i = 1, ..., N, and �<br />

αiyi = 0.<br />

i,j<br />

i<br />

i<br />

(2.5)<br />

The advantages of the dual formation are two-fold: 1) the dual problem is often easier to<br />

solve than the primal problem, and 2) it provides a geometrically meaningful interpretation<br />

of the solution.<br />

If α ∗ is the optimal solution of (2.5), then the optimal value of the primal variables is<br />

21


given by w ∗ = �<br />

i αiyixi, and b = − 1<br />

2 〈w∗ , x+ + x−〉, where x+ and x− are positive and<br />

negative class samples such that 〈w, x+〉 + b = 1 and 〈w, x−〉 + b = −1 respectively.<br />

The resultant classifier for test data is then,<br />

f(x) = sgn(〈w ∗ , x〉 + b) = sgn( �<br />

αiyi 〈xi, x〉 + b), where<br />

⎧<br />

⎪⎨<br />

−1, z < 0<br />

sgn(z) =<br />

⎪⎩<br />

0,<br />

1,<br />

z = 0<br />

z > 0<br />

The Kuhn-Tucker condition of the optimization problem requires<br />

αi[yi(〈w, xi〉 + b) − 1] = 0, i = 1, ..., N.<br />

This implies that only the points x that satisfy yi(〈w, xi〉 + b) = 1 will have nonzero dual<br />

variable α. These points are called support vectors, since these are the only points needed<br />

to define the decision function in the linearly separable case.<br />

2.3.3 Extensions<br />

Non-separable case: soft-margin SVM<br />

Suppose the data are not linearly separable and the constraints (2.3) need the relaxation of<br />

the conditions:<br />

yi(〈w, xi〉 + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, ... , N,<br />

22<br />

i<br />

.


for the problem to be feasible. A soft-margin SVM is defined by the optimization<br />

min<br />

w,b,ξ<br />

1<br />

2 �w�2 + C �<br />

ξi, subject to yi(〈w, xi〉 + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, ... , N,<br />

i<br />

(2.6)<br />

where C is a fixed parameter that determines the weight between the margin and the clas-<br />

sification error in the cost. The primal problem (2.6) also has an equivalent dual problem:<br />

min<br />

α<br />

1<br />

2<br />

�<br />

αiαjyiyj 〈xi, xj〉 − �<br />

αi, subject to (2.7)<br />

i,j<br />

0 ≤ αi ≤ C, i = 1, ..., N, and �<br />

αiyi = 0.<br />

The regularization parameter C should reflect the prior knowledge of the amount of noise<br />

in the data.<br />

Nonlinear separation: kernel SVM<br />

We obtain a nonlinear version of SVM by mapping the space X to a RKHS via a kernel<br />

function k. The kernel SVM is implemented simply by replacing the Euclidean inner prod-<br />

uct 〈xi, xj〉 with a given kernel function k(xi, xj). After the replacement the soft SVM<br />

problem (2.7) becomes<br />

min<br />

α<br />

1<br />

2<br />

i<br />

i<br />

�<br />

αiαjyiyjk(xi, xj) − �<br />

αi, subject to (2.8)<br />

i,j<br />

0 ≤ αi ≤ C, i = 1, ..., N, and �<br />

αiyi = 0,<br />

and the resultant decision function for test data is given by the kernel function:<br />

f(x) = sgn( �<br />

αiyik(xi, x) + b).<br />

i<br />

23<br />

i<br />

i


Since Kij = k(xi, xj) is a fixed matrix, the optimization in the training phase is no<br />

more difficult than solving the linear SVM. The resultant decision function can classify<br />

highly nonlinear, complicated data distributions with the same cost of training the simple<br />

linear classifier.<br />

2.3.4 Generalization error and overfitting<br />

The success of SVM algorithms in practice can be ascribed to their ability to bound gener-<br />

alization errors. I will not go into the vast topic but would like to point out the following<br />

fact: the maximization of margin corresponds to the minimization of the capacity (or the<br />

complexity) of the hyperplane, which helps to avoid overfitting.<br />

2.4 Discriminant Analysis<br />

A discriminant analysis technique is a method to find a low-dimensional subspace of the<br />

input space which preserves the discriminant features of multiclass data. Figure 2.3 illus-<br />

trates the idea for a two-class toy problem.<br />

I introduce two techniques: Fisher Discriminant Analysis (FDA) (or Linear Discrim-<br />

inant Analysis) [25] and Nonparametric Discriminant Analysis (NDA) [12]. Originally<br />

these algorithms are developed and used for low-dimensional Euclidean data. I will dis-<br />

cuss the challenges and solutions when the techniques are applied to high-dimensional data,<br />

and describe their extensions to nonlinear discrimination problems with kernels.<br />

Both FDA and NDA are discriminant analysis techniques which find a subspace that<br />

maximizes the ratio of between-class scatter Sb and within-class scatter Sw after the data<br />

are projected onto the subspace. The objective function for one-dimensional case is the<br />

Rayleigh quotient<br />

J(w) = w′ Sbw<br />

w ′ Sww , w ∈ RD ,<br />

24


Figure 2.3: The most discriminant direction for two-class data. Suppose we have two<br />

classes of Gaussian-distributed data, and we want to project the data onto one-dimensional<br />

directions denoted by the arrows. The projection in the direction of the largest variance<br />

(PCA direction) results in a large overlap of the two class which is undesirable for classification,<br />

whereas the projection in the Fisher direction yields the least overlapping, therefore<br />

most discriminant one-dimensional distributions. This illustration was captured from the<br />

paper of [58].<br />

where D is the dimension of the data space. For multiclass data there are several options<br />

for the objective function [25]. The most widely used objective is the multiclass Rayleigh<br />

quotient<br />

J(W ) = tr � (W ′ SwW ) −1 W ′ SbW �<br />

(2.9)<br />

where W is a D × d matrix, and d < D is the low-dimensional feature dimension. The<br />

quotient measures the class separability in the subspace span(W ) similarly to the one-<br />

dimensional case.<br />

25


2.4.1 Fisher Discriminant Analysis<br />

Let {x1, ..., xN} be the data vectors and {y1, ..., yN} be the class labels yi ∈ {1, ..., C}.<br />

Without loss of generality we assume the data are ordered according to the class labels:<br />

1 = y1 ≤ y2 ≤ ... ≤ yN = C. Each class c has Nc number of samples.<br />

Let µc = 1<br />

Nc<br />

�<br />

{i|yi=c} xi be the mean of class c, and µ = 1<br />

N<br />

�<br />

i xi be the global mean.<br />

The between-scatter and within-scatter matrices of FDA are defined as follows:<br />

Sb = 1<br />

N<br />

Sw = 1<br />

N<br />

C�<br />

Nc(µc − µ)(µc − µ) ′<br />

c=1<br />

C�<br />

�<br />

c=1 {i|yi=c}<br />

(xi − µc)(xi − µc) ′<br />

When Sw is nonsingular, which is typically the case for low-dimensional data (D < N),<br />

the optimal W is found from the largest eigenvector of S −1<br />

w Sb. Since S −1<br />

w Sb has rank C −1,<br />

there are C − 1-number of seqential optima W = {w1, ..., wC−1}. By projecting data onto<br />

the span(W ), we achieve dimensionality reduction and feature extraction of data onto the<br />

most discriminant directions.<br />

To classify points with the simple k-NN classifier, one can use the distance of data<br />

projected onto span(W ), or use the Mahalanobis distance to the projected mean of each<br />

class:<br />

arg min<br />

j dj(x) = [W ′ (x − µj)] ′ (W ′ SwW ) −1 [W ′ (x − µj)]. (2.10)<br />

2.4.2 Nonparametric Discriminant Analysis<br />

The FDA is motivated from the simple scenario in which the class-conditional distribution<br />

p(x|ci) is Gaussian or at least has a peak around its mean µi. However, this assumption is<br />

easily violated, for example, by a distribution that has multiple peaks. The NDA tries to<br />

26


elax the parametric Gaussian assumption a little. The between-scatter and within-scatter<br />

matrices of NDA are defined as<br />

Sb = 1<br />

N<br />

Sw = 1<br />

N<br />

N� �<br />

(xi − xj)(xi − xj) ′<br />

i=1 j∈Bi<br />

N� �<br />

(xi − xj)(xi − xj) ′ ,<br />

i=1 j∈Wi<br />

where Bi is the indices for K nearest neighbors of xi which belong to the different classes<br />

from xi, and Wi is the indices for K nearest neighbors of xi which belong to the same class<br />

as xi. While FDA uses the global class mean µi as a representative of each class, NDA<br />

uses the local class mean around the point of interest. This results in a tolerance to the<br />

non-Gaussianity or multimodality of the classes. When the number of nearest neighbors K<br />

increases, NDA behaves similarly to FDA.<br />

For classification tasks one can also use the simple k-NN rule restricted to the span(W )<br />

or the Mahalanobis distance similarly to FDA, although k-NN is more consistent with the<br />

nonparametric assumption of NDA.<br />

2.4.3 Discriminant analysis in high-dimensional spaces<br />

In the previous explanation of FDA, we assumed Sw is nonsingular. However, this is not the<br />

case for high-dimensional data. Note the ranks of Sw and Sb of FDA can be at most N − C<br />

and C − 1 respectively. The maxima are achieved when the data are not co-linear which is<br />

very likely for high-dimensional data [93]. Because the number of features d cannot exceed<br />

the rank of Sb, FDA can extract at most C − 1 features. For NDA, when K > 1, the rank<br />

of Sw can be up to N − C and the rank of Sb can be up to N − 1. However, for small K<br />

the ranks of Sw and Sb will also be small. The number of features NDA can extract is also<br />

less than the rank of SB.<br />

27


Because Sw spans at most N − C dimension, there is always nullspace in the span<br />

of data which is at least N − 1 dimensional. Without regularization, both FDA and NDA<br />

always yield nullspace of Sw in the span of data as the maximizer of the Rayleigh quotients.<br />

This is not preferable because even a small change in the data can make a big change in<br />

the solution. One solution suggested in [7] is to use Principal Component Analysis to<br />

first reduce the dimensionality of data by projecting them to a subspace spanned by the<br />

N − C largest eigenvectors. In this subspace Sw are likely to be well-conditioned. Another<br />

solution is to regularize the ill-conditioned matrix Sw by adding an isotropic noise matrix<br />

Sw ← Sw + σ 2 I, (2.11)<br />

where σ determines the amount of regularization. I use the regularization approach in this<br />

thesis.<br />

2.4.4 Extension to nonlinear discriminant analysis<br />

From the discussion of kernel machines in the previous section, we know that a linear<br />

algorithm can be extended to nonlinear algorithms by using kernels. The Kernel FDA, also<br />

known as the Generalized Discriminant Analysis, has been fully studied by [5, 56, 57]. To<br />

summarize, Kernel FDA can be formulated as follows.<br />

Let φ : G → H be the feature map, and Φ = [φ1 ... φN] be the feature matrix of the<br />

training points. Assuming the FDA direction w in the feature space is a linear combination<br />

of the feature vectors, w = Φα, we can rewrite the Rayleigh quotient in terms of α as<br />

J(α) = α′ Φ ′ SBΦα<br />

α ′ Φ ′ SW Φα = α′ K(V − 1<br />

N 1N1 ′ N )Kα<br />

α ′ (K(IN − V )K + σ2 , (2.12)<br />

IN)α<br />

where K is the kernel matrix, 1N is a uniform vector [1 ... 1] ′ of length N, and V is a<br />

28


lock-diagonal matrix whose c-th block is the uniform matrix 1<br />

Nc 1Nc1 ′ Nc ,<br />

⎛<br />

⎜<br />

V = ⎜<br />

⎝<br />

1<br />

N1 1N11 ′ N1<br />

...<br />

1<br />

Nc 1Nc1 ′ Nc<br />

The term σ 2 IN is a regularizer for making the computation stable. Similarly to FDA, the<br />

set of optimal α’s are computed from the eigenvectors of K −1<br />

w Kb, where Kb and Kw are<br />

⎞<br />

⎟<br />

⎠ .<br />

the quadratic matrices in the numerator and the denominator of (2.12):<br />

Kb = K(V − 1<br />

N 1N1 ′ N)K<br />

Kw = K(IN − V )K + σ 2 IN.<br />

The NDA can be similarly kernelized by the assumption w = Φα and is omitted here.<br />

29


Chapter 3<br />

MOTIVATION: <strong>SUBSPACE</strong><br />

STRUCTURE IN DATA<br />

3.1 Introduction<br />

In this chapter I discuss theoretical and empirical evidences of subspace structures which<br />

naturally appear in real-world data.<br />

The most prominent examples of subspaces can be found in the image-based face recog-<br />

nition problem. Face images show large variability due to identity, pose change, illumina-<br />

tion condition, facial expression, and so on. The Principal Component Analysis (PCA) has<br />

been applied to construct low-dimensional models of the faces by [73, 44] and used for<br />

recognition by [79], known as the Eigenfaces. Although the Eigenfaces were originally ap-<br />

plied to model image variations across different people, they also explain the illumination<br />

variation of a single person exceptionally well [32, 21, 94]. Theoretical explanations to the<br />

low-dimensionality of the illumination variability have been proposed by [8, 62, 61, 4].<br />

When the data consist of the illumination-varying images of multiple people, I can<br />

model the data as a collection of the illumination subspaces from each people. In this way,<br />

30


I can absorb the undesired variability of illumination as variability within subspaces, and<br />

emphasize the variability of subject identity as variability between the subspaces. This<br />

idea not only applies to illumination change but also to other types of data that have known<br />

linear substructures. This is the main idea of the subspace-based approach advocated in<br />

this thesis.<br />

More recent examples of subspaces structure are found in the dynamical system models<br />

of video sequences of, for example, human actions or time-varying textures [19, 80, 78].<br />

When each sequence is modeled by a dynamical system, I can compare those dynamical<br />

systems by comparing the linear span of the observability matrices of each systems, which<br />

is similar to comparing the images subspaces.<br />

In the rest of this chapter I explain the details of computing subspaces and estimating<br />

dynamical systems from image data. The procedures are demonstrated with the well-known<br />

image databases: the Yale Face, CMU-PIE, ETH-80, and IXMAS databases.<br />

3.2 Illumination subspaces in multi-lighting images<br />

Suppose we have a convex-shaped Lambertian object and a single distance light source<br />

illuminating the object. If we ignore the attached and the cast shadows on the object for<br />

now, then the observed intensity x (irradiance) for a surface patch is linearly related to the<br />

incident intensity of light s (radiance) by the Lambertian reflectance model<br />

x = ρ 〈b, s〉 ,<br />

where ρ is the albedo of the surface, b is the surface normal, and s is the light source vector.<br />

If the whole image x is a collection of D-pixel values, that is, x = [x1, ..., xD] ′ , then<br />

x = Bs,<br />

31


where B = [α1b1, ..., αDbD] ′ is the D × 3 matrix of albedo and surface normals. Thus,<br />

the set of images under all possible illuminations are all linear combinations of the column<br />

vectors of B,<br />

X = {Bs | , ∀s ∈ R 3 },<br />

which is a three-dimensional subspace at most. However, this is an unrealistic model since<br />

this allows a negative light intensity.<br />

We get a more realistic model by removing negative intensity and also by allowing<br />

attached shadows as follows: xi = max (αi 〈bi, si〉 , 0 ), and therefore x = max(Bs, 0),<br />

where the max operation is performed for each row.<br />

An image from a multiple light sources is the combination of single distant light cases<br />

x = �<br />

k max(Bsk, 0). As can be seen from the equation, the set of such images under<br />

all illuminations form a convex cone [8]. However, the dimensionality of the subspace the<br />

cone lies in can be as large as the number of pixels D in general, which is inconsistent with<br />

the empirical observations.<br />

Theoretical explanations on the low dimensionality have been offered by [8, 62, 61, 4]<br />

with spherical harmonics. Although the mathematics of the model is rather involved, the<br />

main idea can be summarized as follows. The interaction between a distance light source<br />

and a Lambertian surface is a convolution on the unit sphere. If we adopt frequency-domain<br />

representation of the light distributions and the reflectance function, then the interaction<br />

of an arbitrary light distribution with the surface can be computed by multiplication of<br />

coefficients w.r.t. the spherical harmonics, analogous to the Fourier analysis on real lines.<br />

Since the max operation can be well approximated by convolution with a low-pass<br />

filter, the resultant set of all possible illumination can be expressed using only a few (4<br />

to 9) harmonic basis images. Figure 3.1 shows the analytically computed PCA from this<br />

model.<br />

32


Figure 3.1: The figure shows the first five principal components of a face, computed analytically<br />

from a 3D model (Top) and a sphere (Bottom). These images matches well with<br />

the empirical principal component computed from a set of real images. The figure was<br />

captured from [61].<br />

In the following two subsections I introduce two well-known face databases and show<br />

the PCA results from the data to demonstrate the low-dimensionality of illumination-<br />

varying images.<br />

Yale Face Database<br />

It is often possible to acquire multi-view, multi-lighting images simultaneously with a spe-<br />

cial camera rig. The Yale face database and the Extended Yale face database [26] together<br />

consist of pictures of 38 subjects with 9 different poses and 45 different lighting conditions.<br />

The original image is gray-valued, is 640 × 480 in size, and includes redundant back-<br />

ground objects. I crop and align the face regions by manually choosing a few feature points<br />

(center of eyes and mouth, and nose tip) for each image. The cropped images are resized<br />

to 32 × 28 pixels (D = 896) and normalized to have a unit variance. Figure 3.2 shows the<br />

first 10 subjects and all 9 poses under a fixed illumination condition.<br />

33


Subject<br />

Pose<br />

Figure 3.2: Yale Face Database: the first 10 (out of 38) subjects at all poses under a fixed<br />

illumination condition.<br />

34


To compute subspaces, I use all 45 illumination conditions of a person under a fixed<br />

pose, which is depicted in Figure 3.3. The m-dimensional orthonormal basis are computed<br />

from the Singular Value Decomposition (SVD) of this set of data, as follows.<br />

Let X = [x1, ..., xN] is the D×N data matrix pertaining to all illuminations of a person<br />

at a fixed pose, and let X = USV ′ be the SVD of the data, where U ′ U = UU ′ = ID,<br />

V ′ V = V V ′ = IN, and S is a D × N matrix whose elements are zero except on the<br />

diagonal diag(S) = [s1, s2, ..., smin(D,N)] ′ . If the singular values are ordered as s1 ≥ s2 ≥<br />

... ≥ smin(D,N), then the m-dimensional basis for this set is the first m columns of U. In<br />

the coming chapters I will use a range of values for m in experiments. The SVD procedure<br />

above is the same as PCA procedure except that the mean is not removed from the data.<br />

The role of the mean will be discussed further in Chapter 6. When the mean is ignored, the<br />

PCA eigenvalues are related to the singular values by λ1 = s 2 1, λ2 = s 2 2, and so on.<br />

A few of the orthonormal bases computed from the procedure are shown in Figure 3.4,<br />

along with the spectrum of the singular values.<br />

CMU-PIE Database<br />

The CMU-PIE database is another multi-view, multi-lighting face database acquired with a<br />

camera rig. The database [72] consists of images from 68 subjects under 13 different poses<br />

and 43 different lighting conditions. Among the 43 lighting conditions I use 21 lighting<br />

conditions which have full pose variations.<br />

The original image is color-valued, is 640 × 480 in size, and includes redundant back-<br />

ground objects. I crop and align the face regions by manually choosing a few feature points<br />

(center of eyes and mouth, and nose tip) for each image. Among the 13 poses I choose only<br />

7 poses and discarded 6 poses which are close to a profile-view. This is done to facilitate<br />

the cropping process. The cropped images are resized to 24 × 21 pixels (D = 504) and<br />

normalized to have a unit variance. Figure 3.5 shows the first 10 subjects at 7 poses under<br />

35


a fixed illumination condition.<br />

To compute subspaces, I use all 21 illumination conditions of a person at a fixed pose<br />

(refer to Figure 3.6). The m-dimensional orthonormal basis are computed from the Singular<br />

Value Decomposition (SVD) of this set of data similarly to the Yale Face database.<br />

A few of the orthonormal bases computed from the database are shown in Figure 3.7,<br />

along with the spectrum of the singular values.<br />

36


Figure 3.3: Yale Face Database: all illumination conditions of a person at a fixed pose used<br />

to compute the corresponding illumination subspace.<br />

2<br />

1<br />

subspace #1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #3<br />

subspace #5<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #2<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #4<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #6<br />

0<br />

0 2 4 6<br />

Figure 3.4: Yale Face Database: examples of basis images and (cumulative) singular values.<br />

37


Subject<br />

Pose<br />

Figure 3.5: CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a fixed<br />

illumination condition.<br />

38


Figure 3.6: CMU-PIE Database: all illumination conditions of a person at a fixed pose used<br />

to compute the corresponding illumination subspace.<br />

2<br />

1<br />

subspace #1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #3<br />

subspace #5<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #2<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #4<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #6<br />

0<br />

0 2 4 6<br />

Figure 3.7: CMU-PIE Database: examples of basis images and (cumulative) singular values.<br />

39


3.3 Pose subspaces in multi-view images<br />

We have seen that illumination change can be approximated well by linear subspace. The<br />

change in the pose of the object and/or the camera, however, is harder to analyze without<br />

knowing the 3D geometry of the object. Since we are often given 2D image data only and<br />

do not know the underlying geometry, it is useful to construct image-based models of an<br />

object or a face under pose changes.<br />

A popular multi-pose representation of images is the light-field presentation which<br />

models the radiance of light as a function of the 5D pose of the observer [29, 30, 95].<br />

Theoretically, the light-field model provides pose-invariant recognition of images taken<br />

with arbitrary camera and pose when the illumination condition is fixed. Zhou et al. ex-<br />

tended the light-field model to a bilinear model which allows simultaneous change of pose<br />

and illumination [95]. An alternative method is proposed in [34] which uses a generative<br />

model of a warped illumination subspace. Image variations due to illumination change are<br />

accounted for by a low-dimensional linear subspace, whereas variations due to pose change<br />

are approximated by a geometric warping of images in the subspace.<br />

The studies above indicate the nonlinearity of the pose-varying images in general. How-<br />

ever, the dimensionality of the images as a nonlinear manifold is rather small, since there<br />

are at most 6 degrees of freedom for the pose space (=E(3)). Therefore, when the range<br />

of the pose variation is limited, the nonlinear structure can be contained inside a low-<br />

dimensional subspace, and the nonlinear submanifolds can be distinguished by their en-<br />

closing subspaces. Although a general method of adopting nonlinearity is possible and<br />

will be discussed in Section 5.2.4, here I use linear subspaces as the simplest model of the<br />

pose variations. The approximation by a subspace is demonstrated with the ETH-80 object<br />

database in the following subsection.<br />

40


Category<br />

Object<br />

Figure 3.8: ETH-80 Database: all categories and objects at a fixed pose.<br />

ETH-80 Database<br />

The ETH-80 [48] is an object database designed for object categorization test under varying<br />

poses. The database consists of pictures of 8 object categories; ‘apple’, ‘pear’, ‘tomato’,<br />

‘cow’, ‘dog’, ‘horse’, ‘cup’, ‘car’. Each category has 10 object instances that belong to the<br />

category, and each object is recored under 41 different poses.<br />

There are several versions of the data. The one I use is color-valued and 256 × 256 in<br />

size. The images are resized to 32 × 32 pixels (D = 1024) and normalized to have a unit<br />

variance. Figure 3.8 shows the 8 categories under 10 poses at a fixed viewpoint.<br />

From the spectrum I can determine how good the m-dimensional approximation is by<br />

looking at the value at m. For example, if the 5-th cumulative singular value is 0.92, it<br />

means that the 5-dimensional subspace captures 92 percent of the total variations of the<br />

data (including the bias of the mean).<br />

41


To compute subspaces, I use all 41 different poses of an object from a category as shown<br />

in Figure 3.9. The m-dimensional orthonormal basis are computed from SVD of this set of<br />

data. A few of the orthonormal bases computed from the database are shown in Figure 3.10<br />

along with the spectrum of the singular values.<br />

42


Figure 3.9: ETH-80 Database: all poses of an object from a category used to compute the<br />

corresponding pose subspace of the object.<br />

2<br />

1<br />

subspace #1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #3<br />

subspace #5<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #2<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #4<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

subspace #6<br />

0<br />

0 2 4 6<br />

Figure 3.10: ETH-80 Database: examples of basis images and (cumulative) singular values.<br />

43


3.4 Video sequences of human motions<br />

Suppose we have a video sequence of a person performing an action. The sequence is more<br />

than just a set of images because of the temporal information contained in the sequence.<br />

To capture both the appearance and the temporal dynamics, we often use linear dynamical<br />

systems in modeling the sequence. In particular, the Auto-Regressive Moving Average<br />

(ARMA) has been used to model moving human bodies or textures in computer vision<br />

[19, 80]. The ARMA model can be described as follows.<br />

Let y(t) be the D × 1 observation vector, and x(t) be the d × 1 internal state vector, for<br />

t = 1, ..., T . Then the states evolve according to the linear time-invariant dynamics:<br />

where v(t) and w(t) are additive noises.<br />

x(t + 1) = Ax(t) + v(t) (3.1)<br />

y(t) = Cx(t) + w(t), (3.2)<br />

A probabilistic version of the model assumes that the observation, the states, and the<br />

noise are Gaussian distributed with<br />

v(t) ∼ N (0, Q), w(t) ∼ N (0, R).<br />

This allows us to use statistical techniques such as Maximum Likelihood Estimation to<br />

infer the states and to estimate parameters only from the observed data y(1), ..., y(T ). The<br />

estimation problem is known as the system identification problem and a good textbook on<br />

the topic is [52]. The estimation I use in the thesis is one of the simplest estimation method<br />

and is described in the next section.<br />

For now, let’s go back to the original question of comparing the image sequence using<br />

the ARMA model. If we have the parameters Ai, Ci for each sequence i = 1, .., N in<br />

44


the data, the simplest method of comparing two such sequences is to measure the sum of<br />

squared differences<br />

d 2 i,j = �Ai − Aj� 2 F + �Ci − Cj� 2 F , (3.3)<br />

ignoring the noise statistics which are of less importance.<br />

However, it is well-known that the parameters are not unique. If we change the basis<br />

for the state variables to define new state variables by ˆx = Gx, where G is any d × d<br />

nonsingular matrix, then the same system can be described with different coefficients such<br />

that<br />

ˆx(t + 1) =<br />

ˆy(t) =<br />

ˆx(t) + ˆv(t)<br />

Ĉ ˆx(t) + ˆw(t),<br />

where  = GAG−1 , Ĉ = CG−1 , ˆv = Gv, and ˆw = w. Unfortunately, the simple<br />

distance (3.3) is not invariant under the coordinate change. I will defer the discussion of<br />

other invariant distances for dynamical systems to the next two chapters, and proceed with<br />

the basic idea in this chapter.<br />

One of the coordinate-independent representations of the system is given by the infinite<br />

observability matrix [17]<br />

OC,A =<br />

⎛<br />

⎜<br />

C<br />

⎜ CA<br />

⎜ CA<br />

⎝<br />

2<br />

⎞<br />

⎟ , (3.4)<br />

⎟<br />

⎠<br />

...<br />

which is concatenation of the matrices CA n for n = 1, 2, ..., along the row. Note that after<br />

45


the coordinate change ˆx = Gx, the new observability matrix becomes<br />

OĈ, Â =<br />

⎛<br />

⎜<br />

⎝<br />

CG −1<br />

CAG −1<br />

CA 2 G −1<br />

...<br />

⎞<br />

⎟ = OC,AG<br />

⎟<br />

⎠<br />

−1 , (3.5)<br />

which is the original observability matrix multiplied by G −1 on the right. This suggests<br />

that if we consider the linear span of the column vectors of the O instead of the matrix O<br />

itself to represent the dynamical system, the representation is clearly invariant to the choice<br />

of G. This linear structure of a dynamical system is exactly what we are seeking: in the<br />

(infinite-dimensional) space of all possible ARMA models of the same size, each model<br />

of a sequence occupies the subspace spanned by the columns of O. In the next section I<br />

will introduce the IXMAS database and explain how I preprocess the data to compute this<br />

linear structure.<br />

IXMAS Database<br />

The INRIA Xmas Motion Acquisition Sequences (IXMAS) is a multiview video database<br />

for view-invariant human action recognition [89]. The database consists of 11 daily-live<br />

motions (‘check watch’, ‘cross arms’, ‘scratch head’, ‘sit down’, ‘get up’, ‘turn around’,<br />

‘walk’, ‘wave hand’, ‘punch’, ‘kick’, ‘pick up’), performed by 11 actors and repeated 3<br />

times. The motions are recorded by 5 calibrated and synchronized cameras at 23 fps at<br />

390 × 29 resolutions. Figure 3.11 shows sample sequences of an actor perform the 11<br />

actions at a fixed view.<br />

The authors of [89] propose further processing of the database. The appearances of<br />

the actors such as clothes are irrelevant to actions, and therefore image silhouettes are<br />

46


computed to extract shapes from each camera. These silhouettes are combined to carve<br />

out the 3D visual hull of the actor represented by 3D occupancy volume data V (x, y, z) as<br />

shown in Figure 3.12.<br />

However, the actions performed by different actors still have a lot of variabilities as<br />

demonstrated in Figure 3.13.<br />

The variabilities irrelevant to action recognition include the followings. Firstly, the ac-<br />

tors have different heights and body shapes, and therefore the volumes have to be resized<br />

in each axes. Secondly, the actors freely choose position and orientation, and therefore the<br />

volumes have to be centered and reoriented. The resizing and centering can be done by<br />

computing the center of mass and second-order moments of the volumes and then stan-<br />

dardizing the volumes. However, the orientation variability requires further processing.<br />

The authors suggest changing the Cartesian coordinate system V (x, y, z) to the cylindrical<br />

coordinate system V (r, θ, z) and then performing 1D circular Fourier Transform along the<br />

θ axis to get F F T (V (r, θ, z)). By taking only the magnitude of the transform, the resultant<br />

feature abs|F F T (V (r, θ, z))| becomes rotation-invariant around the z-axis. The resultant<br />

feature of the 3D volume is a D = 16384 = 32 3 /2-dimensional vector. Note this FFT<br />

is computer per frame and is not to be confused with a temporal FFT along the frames.<br />

Figure 3.14 shows a sample snapshot of the processed features.<br />

ARMA model of data<br />

Once the features are computed for each action, actor and frame, we can proceed to model<br />

the feature sequences using the ARMA model.<br />

I estimate the parameters using a fast approximate method based on the SVD of the the<br />

observed data [19]. Let USV ′ = [y(1), ..., y(T )] be the SVD of data. Then, the parameters<br />

47


C, A, and the states x(1), ..., x(T ) are sequentially estimated by<br />

˜C = U, ˜x(t) = ˜ C ′ y(t)<br />

à =<br />

�T<br />

−1<br />

arg min �˜x(t + 1) − A˜x(t)�<br />

A<br />

2 .<br />

i=1<br />

I used d = 5 as the dimension of the state space.<br />

The estimated Ai and Ci matrices for each sequence i = 1, ..., N are used to form a<br />

finite observability matrix of size (Dd) × d:<br />

Oi = [C ′ i (CiAi) ′ ... (CiA d−1<br />

i ) ′ ] ′ .<br />

A total of 363=11 (action) x 3 (trial) x 11 (actor) observability matrices are computed as<br />

the final subspace representation of the database.<br />

3.5 Conclusion<br />

In this chapter I aimed to provide motivations for subspace representation with examples<br />

from image databases, which range from illumination-varying faces to video sequences<br />

modeled as dynamical systems. The procedures for computing subspaces from these databases<br />

were described.<br />

The goal of the subspace-based learning approach is using this inherent linear structure<br />

to emphasize the desired information and to de-emphasize the unwanted variations in the<br />

data. This approach translates to 1) illumination-invariant face recognition for the Yale<br />

Face and CMU-PIE databases, 2) pose-invariant object categorization with the ETH-80<br />

database, and 3) the video-based action recognition with the IXMAS database. However,<br />

I add a caveat that the invariant recognition problems above are different from the more<br />

general problem of recognizing a single test image, since at least a few test images are<br />

48


equired to reliably compute the subspace.<br />

In the next three chapters, I will use the computed subspaces from the databases to test<br />

various algorithms for subspace-based learning.<br />

49


Check watch<br />

Cross arms<br />

Scratch head<br />

Sit down<br />

Get up<br />

Turn around<br />

Walk<br />

Wave hand<br />

Punch<br />

Kick<br />

Pick up<br />

T=1, 2, 3, ...<br />

Figure 3.11: IXMAS Database: video sequences of an actor performing 11 different actions<br />

viewed from a fixed camera.<br />

50


Figure 3.12: IXMAS Database: 3D occupancy volume of an actor of one time frame.<br />

The volume is initially computed in Cartesian coordinate system and later represented to<br />

cylindrical coordinate system to apply FFT.<br />

51


Subj 1<br />

Subj 2<br />

Subj 3<br />

Subj 4<br />

Subj 5<br />

Subj 6<br />

Subj 7<br />

Subj 8<br />

Subj 9<br />

Subj 10<br />

T=1, 2, 3, ...<br />

Figure 3.13: IXMAS Database: the ‘kick’ action performed by 11 actors. Each sequence<br />

has a different kick style well as different body shape and height.<br />

52


V(r,θ,z)<br />

abs(FFT(V(r,θ,z)))<br />

Figure 3.14: IXMAS Database: cylindrical coordinate representation of the volume<br />

V (r, θ, z), and the corresponding 1D FFT feature abs(F F T (V (r, θ, z))), shown at a few<br />

values of θ.<br />

53


Chapter 4<br />

<strong>GRASSMANN</strong> MANIFOLDS AND<br />

<strong>SUBSPACE</strong> DISTANCES<br />

4.1 Introduction<br />

In the previous chapter, I discussed the examples of linear subspace structures found in<br />

the real-world data. In this chapter I introduce the Grassmann manifold as the common<br />

framework of subspace-based learning algorithms. While a subspace is certainly a linear<br />

space, the collection of linear subspaces is a totally different space of its own, which is<br />

known as the Grassmann manifold. The Grassmann manifold, named after the renowned<br />

mathematian Hermann Günther Grassmann (1809-1877), has long been known for its in-<br />

triguing mathematical properties, and as an example of homogeneous spaces of Lie groups<br />

[86, 13]. However, its applications in computer science and engineering have appeared<br />

rather recently; in signal processing and control [74, 36, 6], numerical optimization [20]<br />

(and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78].<br />

Moreover, many works have used the subspace concept without explicitly relating their<br />

works to this mathematical object [92, 64, 24, 90, 85, 43, 3]. One of the goals of this the-<br />

54


sis, is to provide a unified view of the subspace-based algorithms in the framework of the<br />

Grassmann manifold.<br />

In this chapter I define the Grassmann distance which provides a measure of (dis)similarity<br />

of subspaces, and review the known distances including the Arc-length, Projection, Binet-<br />

Cauchy, Max Corr, Min Corr, and Procrustes distances. Some of these distances have been<br />

studied in [20, 16], and I provide a more thorough analysis and proofs in this chapter. Fur-<br />

thermore, these distances will be used in conjunction with a k-NN algorithm to demonstrate<br />

their potentials in classification tasks using the databases prepared in the previous chapter.<br />

4.2 Stiefel and Grassmann manifolds<br />

In this section I introduce the Stiefel and the Grassmann manifolds by summarizing nec-<br />

essary definitions and properties of these manifolds from [28, 20, 16]. Although these<br />

manifolds are not linear spaces, I introduce these manifolds as subsets of Euclidean spaces<br />

and use matrix representations. This helps to understand the nonlinear spaces intuitively<br />

and also facilitates computations on these spaces.<br />

4.2.1 Stiefel manifold<br />

Let Y be a D × m matrix whose elements are real numbers. In optimization problems with<br />

the matrix variable Y , we often formulate the notion of normality by an orthonormality<br />

condition Y ′ Y = Im. 1 This feasible set is not linear nor convex, and in fact is the Stiefel<br />

manifold defined as follows:<br />

Definition 4.1. An m-frame is a set of m orthonormal vectors in R D (m ≤ D). The Stiefel<br />

manifold S(m, D) is the set of m-frames in R D .<br />

1 Although the term ‘orthongoal’ is the more standard one for this condition, I use the term ‘orthonormal’<br />

to clarify that each column of Y has a unit length.<br />

55


The Stiefel manifold S(m, D) is represented by the set of D × m matrices Y such that<br />

Y ′ Y = Im. Therefore we can rewrite is as<br />

S(m, D) = {Y ∈ R D×m | Y ′ Y = Im}.<br />

There are D × m variables in Y and 1m(m<br />

+ 1) independent conditions in the constraint<br />

2<br />

Y ′ Y = Im. Hence S(m, D) is an analytical manifold of dimension Dm − 1m(m<br />

+ 1) =<br />

2<br />

1m(2D<br />

− m − 1).<br />

2<br />

For m = 1, the S(m, D) is the familiar unit sphere in R D , and for m = D, the S(m, D)<br />

is the orthogonal group O(D) of m × m orthogonal matrices.<br />

The Stiefel manifold can also be thought of as the quotient space<br />

S(m, D) = O(D)/O(D − m),<br />

under the right-multiplication by orthonormal matrices. To see this point, let X = [Y | Y ⊥ ] ∈<br />

O(D) be a representer of Y ∈ S(m, D), where the first m columns form the m-frame we<br />

care about and Y ⊥ is any D × (D − m) matrix such that Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID. Then<br />

the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the block-<br />

diagonal matrix diag(Im, RD−m) where RD−m is any matrix in O(D − m). That is, the<br />

m-frame of X after the right multiplication<br />

X<br />

⎛<br />

⎜<br />

⎝ Im<br />

0<br />

0 RD−m<br />

⎞<br />

⎛<br />

⎟<br />

⎠ = [Y | Y ⊥ ⎜<br />

]<br />

remains the same as the m-frame of X.<br />

⎝ Im<br />

56<br />

0<br />

0 RD−m<br />

⎞<br />

⎟<br />

⎠ = [Y | Y ⊥ RD−m]


4.2.2 Grassmann manifold<br />

The Grassmann manifold is a mathematical object with several similarities to the Stiefel<br />

manifold. In optimization problems with a matrix variable Y , we occasionally have a cost<br />

function which is affected only by span(Y ) – the the linear subspace spanned by the column<br />

vectors of Y – and not by the specific values of Y . Such a condition leads to the concept of<br />

the Grassmann manifold defined as follows:<br />

Definition 4.2. The Grassmann manifold G(m, D) is the set of m-dimensional linear sub-<br />

spaces of the R D .<br />

For a Euclidean representation of the manifold, consider the space R (0)<br />

D,m of all D × m<br />

matrices Y ∈ R D×m of full rank m, and consider the group of transformations Y → Y L,<br />

where L is any nonsingular m × m matrix. The group defines an equivalence relation<br />

in R (0)<br />

D,m : two elements Y1, Y2 ∈ R (0)<br />

D,m are the same if span(Y1) = span(Y2). Hence<br />

the equivalence classes of R (0)<br />

D,m are in one-to-one correspondence with the points of the<br />

Grassmann manifold G(m, D), and G(m, D) is thought of as the quotient space<br />

G(m, D) = R (0)<br />

D,m /R(0)<br />

m,m.<br />

The G(m, D) is an analytical manifold of dimension Dm − m 2 = m(D − m), since for<br />

each Y regarded as a point in R Dm , the set of all elements Y L in the equivalence class is a<br />

surface in R Dm of dimension m 2 .<br />

The special case m = 1 is called the real projective space RP D−1 which consists of all<br />

lines through the origin.<br />

The Grassmann manifold can be also thought of as the quotient space<br />

G(m, D) = O(D)/O(m) × O(D − m),<br />

57


under the right-multiplication by orthonormal matrices. To see this, let X = [Y | Y ⊥ ] ∈<br />

O(D) be a representer of Y ∈ G(m, D), where we only care about the span of the first<br />

m columns and Y ⊥ is any D × (D − m) matrix such that Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID. Then<br />

the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the block-<br />

diagonal matrix diag(Rm, RD−m) where Rm and RD−m are any two matrices in O(m)<br />

and O(D − m) respectively. That is, the span of the first m-columns of X after the right<br />

multiplication<br />

X<br />

⎛<br />

⎜<br />

⎝ Rm<br />

0<br />

0 RD−m<br />

⎞<br />

⎛<br />

⎟<br />

⎠ = [Y | Y ⊥ ⎜<br />

]<br />

⎝ Rm<br />

0<br />

0 RD−m<br />

is the same as the span of the first m-columns of X.<br />

⎞<br />

⎟<br />

⎠ = [Y Rm | Y ⊥ RD−m]<br />

From the quotient space representations, we see that G(m, D) = S(m, D)/O(m). This<br />

is the representation I use throughout the thesis. To summarize, an element of G(m, D) is<br />

represented by an orthonormal matrix Y ∈ R D×m such that Y ′ Y = Im, with the equiva-<br />

lence relation:<br />

Definition 4.3. Y1 ∼ = Y2 if and only if span(Y1) = span(Y2).<br />

We can also write the equivalence relation as<br />

Corollary 4.4. Y1 ∼ = Y2 if and only if Y1 = Y2Rm for some orthonormal matrix Rm ∈<br />

O(m).<br />

In this thesis I use more general geometry than the Riemannian geometry of the Grass-<br />

mann manifold, and do not discuss this subject further. I refer the interested readers to<br />

[86, 13, 20, 16, 2] for a further reading.<br />

58


4.2.3 Principal angles and canonical correlations<br />

A canonical distance between two subspaces is the Riemannian distance, which is the<br />

length of the geodesic path connecting the two corresponding points on the Grassmann<br />

manifold. However, there is a more intuitive and computationally efficient way of defining<br />

distances using the principal angles [28]. I define the principal angles / canonical correla-<br />

tions as follows:<br />

Definition 4.5. Let Y1 and Y2 be two orthonormal matrices of size D by m. The princi-<br />

pal angles 0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2 between two subspaces span(Y1) and span(Y2), are<br />

defined recursively by<br />

cos θk = max<br />

max<br />

uk∈span(Y1) vk∈span(Y2)<br />

uk ′ uk = 1, vk ′ vk = 1,<br />

uk ′ vk, subject to<br />

uk ′ ui = 0, vk ′ vi = 0, (i = 1, ..., k − 1).<br />

The first principal angle θ1 is the smallest angle between a pair of unit vectors each<br />

from the two subspaces. The cosine of the principal angle is the first canonical correlation<br />

[39]. The k-th principal angle and canonical correlation are defined recursively. It is known<br />

[91, 20] that the principal angles are related to the geodesic (=arc length) distance as shown<br />

in Figure 4.1 by<br />

d 2 Arc(Y1, Y2) = �<br />

θ 2 i . (4.1)<br />

To compute the principal angles, we need not directly solve the maximization prob-<br />

lem. Instead, the principal angles can be computed from the Singular Value Decomposition<br />

(SVD) of the product of the two matrices Y ′<br />

1Y2,<br />

i<br />

Y ′<br />

1Y2 = USV ′ , (4.2)<br />

59


R D<br />

span( Yi )<br />

u 1<br />

θ1, ..., θm<br />

span( Yj )<br />

v 1<br />

Yi<br />

G(m, D )<br />

Figure 4.1: Principal angles and Grassmann distances. Let span(Yi) and span(Yj) be two<br />

subspaces in the Euclidean space R D on the left. The distance between two subspaces<br />

span(Yi) and span(Yj) can be measured using the principal angles θ = [θ1, ... , θm] ′ . In the<br />

Grassmann manifold viewpoint, the subspaces span(Yi) and span(Yj) are considered as<br />

two points on the manifold G(m, D), whose Riemannian distance is related to the principal<br />

angles by d(Yi, Yj) = �θ�2. Various distances can be defined based on the principal angles.<br />

where U = [u1 ... um], V = [v1 ... vm], and S is the diagonal matrix S = diag(cos θ1 ... cos θm).<br />

The proof can be found in p.604 of [28]. The principal angles form a non-decreasing se-<br />

quence<br />

0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2,<br />

and consequently the canonical correlations form a non-increasing sequence<br />

1 ≥ cos θ1 ≥ · · · ≥ cos θm ≥ 0.<br />

Although the definition of principal angles can be extended to the cases where Y1 and<br />

Y2 have different number of columns, I assume Y1 and Y2 have the same size D by m<br />

throughout this thesis.<br />

4.3 Grassmann distances for subspaces<br />

In this section I introduce a few subspace distances which appeared in the literature, and<br />

give analyses of the distances in terms of the principal angles.<br />

60<br />

θ 2<br />

Yj


I use the term distance for any assignment of nonnegative values to each pair of points in<br />

the data space. A valid metric is, however, a distance that satisfies the additional axioms in<br />

Definition 2.9. Furthermore, a distance (or a metric) between subspaces has to be invariant<br />

under different basis representations. A distance that satisfies this condition is referred to<br />

as the Grassmann distance (or metric):<br />

Definition 4.6. Let d : R D×m × R D×m → R be a distance function. The function d is a<br />

Grassmann distance if d(Y1, Y2) = d(Y1R1, Y2R2), ∀R1, R2 ∈ O(m).<br />

4.3.1 Projection distance<br />

The Projection distance is defined as<br />

dProj(Y1, Y2) =<br />

� m�<br />

i=1<br />

sin 2 θi<br />

� 1/2<br />

=<br />

�<br />

m −<br />

which is the 2-norm of the sine of principal angles [20, 85].<br />

m�<br />

i=1<br />

cos 2 θi<br />

� 1/2<br />

, (4.3)<br />

An interesting property of the this metric is that it can be computed from only the<br />

product Y ′<br />

1Y2 whose importance will be revealed in the next chapter. From the relationship<br />

between principal angles and the SVD of Y ′<br />

1Y2 in (4.2) we get<br />

d 2 Proj(Y1, Y2) = m −<br />

m�<br />

i=1<br />

where � · �F is the matrix Frobenius norm<br />

cos 2 θi = m − �Y ′<br />

1Y2� 2 F = 2 −1 �Y1Y ′<br />

1 − Y2Y ′<br />

2� 2 F , (4.4)<br />

�A� 2 F =<br />

m�<br />

i=1<br />

n�<br />

A 2 ij, A ∈ R m×n .<br />

j=1<br />

The Projection distance is a Grassmann distances since it is invariant to different represen-<br />

tations which can be easily seen from (4.4). Furthermore, the distance is a metric:<br />

61


Lemma 4.7. The Projection distance dProj is a Grassmann metric.<br />

Proof. The nonnegativity, symmetry, and triangle equality naturally follows from � · �F<br />

being a matrix norm. The remaining condition to be shown is the necessary and sufficient<br />

condition<br />

�Y1Y ′<br />

1 − Y2Y ′<br />

2�F = 0 ⇐⇒ span(Y1) = span(Y2).<br />

From being a matrix norm, the equality follows �Y1Y ′<br />

1 − Y2Y ′<br />

2�F = 0 ⇐⇒ Y1Y ′<br />

1 = Y2Y ′<br />

2.<br />

The proof of the next step Y1Y ′<br />

1 = Y2Y ′<br />

2 ⇐⇒ span(Y1) = span(Y2) is also simple and is<br />

given in the proof of Theorem 5.2.<br />

4.3.2 Binet-Cauchy distance<br />

The Binet-Cauchy distance is defined as<br />

dBC(Y1, Y2) =<br />

�<br />

1 − �<br />

i<br />

cos 2 θi<br />

� 1/2<br />

, (4.5)<br />

which involves the product of canonical correlations [90, 83]. The distance can also be<br />

computed from only the product Y ′<br />

1Y2. From the relationship between principal angles and<br />

the SVD of Y ′<br />

1Y2 (4.2) we get<br />

d 2 BC(Y1, Y2) = 1 − �<br />

i<br />

cos 2 θi = 1 − det(Y ′<br />

1Y2) 2 . (4.6)<br />

The Binet-Cauchy distance is also invariant under different representations, and further-<br />

more is a metric:<br />

Lemma 4.8. The Binet-Cauchy distance dBC is a Grassmann metric.<br />

The proof of the lemma is trivial after I prove Theorem 5.4 later.<br />

62


There is an interesting relationship between this distance and the Martin distance in<br />

control theory [55]. Martin proposed a metric between two ARMA processes with the<br />

cepstrum of the models, which was later shown to be of the following form [17]:<br />

dM(O1, O2) 2 = − log<br />

m�<br />

cos 2 θi,<br />

where O1 and O2 are the infinite observability matrices explained in the previous chapter:<br />

O1 =<br />

⎛<br />

⎜<br />

⎝<br />

C1<br />

C1A1<br />

C1A 2 1<br />

...<br />

⎞<br />

i=1<br />

⎛<br />

⎟<br />

⎜<br />

⎟<br />

⎜<br />

⎟<br />

⎜<br />

⎟<br />

⎜<br />

⎟ , and, O2 = ⎜<br />

⎟<br />

⎜<br />

⎟<br />

⎜<br />

⎠<br />

⎝<br />

C2<br />

C2A2<br />

C2A 2 2<br />

Consequently, the Binet-Cauchy distance is directly related to the Martin distance by the<br />

following:<br />

4.3.3 Max Correlation<br />

...<br />

⎞<br />

⎟ .<br />

⎟<br />

⎠<br />

dBC(span(O1), span(O2)) = exp − 1<br />

2 dM(O1, O2) 2 . (4.7)<br />

The Max Correlation distance is defined as<br />

dMaxCor(Y1, Y2) = � 1 − cos 2 �1/2 θ1 = sin θ1, (4.8)<br />

which is based on the largest canonical correlation cos θ1 (or the smallest principal angle<br />

θ1). The max correlation is an intuitive measure between two subspaces which was used<br />

often in previous works [92, 64, 24]. It is a Grassmann distance. However, it is not a metric<br />

and therefore has some limitations. For example, it is possible for two distinct subspaces<br />

span(Y1) and span(Y2) to have a zero distance dMaxCor = 0 if they have intersect at other<br />

63


than the origin.<br />

4.3.4 Min Correlation<br />

The min correlation distance is defined as<br />

dMinCor(Y1, Y2) = � 1 − cos 2 �1/2 θm = sin θm. (4.9)<br />

The min correlation is conceptually the opposite of the max correlation, in that it is based<br />

on the smallest canonical correlation (or the largest principal angle). This distance is also<br />

closely related to the definition of the Projection distance. Previously I rewrote the Pro-<br />

jection distance as dProj = 2 −1/2 �Y1Y ′<br />

1 − Y2Y ′<br />

2�F . The min correlation can be similarly<br />

written as ([20])<br />

where � · �2 is the matrix 2-norm:<br />

dMinCor = �Y1Y ′<br />

1 − Y2Y ′<br />

2�2, (4.10)<br />

�A�2 = max<br />

x�=0<br />

The proof can be found in p.75 of [28].<br />

This distance is also a metric:<br />

�Ax�2<br />

, A ∈ R<br />

�x�2<br />

m×n .<br />

Lemma 4.9. The Min Correlation distance dMinCor is a Grassmann metric.<br />

The proof is almost the same as the proof for the Projection distance with �·�2 replaced<br />

by � · �F and is omitted.<br />

64


4.3.5 Procrustes distance<br />

The Procrustes distance is defined as<br />

�<br />

m�<br />

dProc1(Y1, Y2) = 2 sin 2 (θi/2)<br />

i=1<br />

� 1/2<br />

, (4.11)<br />

which is the vector 2-norm of [sin(θ1/2), ... , sin(θm/2)]. There is an alternative definition.<br />

The Procrustes distance is the minimum Euclidean distance between different representa-<br />

tions of two subspaces span(Y1) and span(Y2) ([20, 16]):<br />

dProc1(Y1, Y2) = min<br />

R1,R2∈O(m)<br />

�Y1R1 − Y2R2�F = �Y1U − Y2V �F , (4.12)<br />

where U and V are from (4.2). Let’s first check if the equation above is true.<br />

Proof. First note that<br />

min<br />

R1,R2∈O(m)<br />

�Y1R1 − Y2R2� = min<br />

R1,R2∈O(m)<br />

�Y1 − Y2R2R ′ 1� = min<br />

Q∈O(m)<br />

for R1, R2, Q ∈ O(m). This also holds for � · �2. Using this equality, we have<br />

min<br />

R1,R2∈O(m)<br />

�Y1R1 − Y2R2� 2 F = min<br />

Q∈O(m)<br />

�Y1 − Y2Q� 2 F<br />

�Y1 − Y2Q�, (4.13)<br />

= min<br />

Q tr(Y ′<br />

1Y1 + Y ′<br />

2Y2 − Y ′<br />

1Y2Q − Q ′ Y ′<br />

2Y1)<br />

= 2m − 2 max<br />

Q<br />

′<br />

tr(Y 1Y2Q).<br />

However, trY ′<br />

1Y2Q = trUSV ′ Q = trST , where T = V ′ QU which is another orthonormal<br />

matrix. Since S is diagonal, trST = �<br />

i SiiTii ≤ �<br />

i Sii, and the maximum is achieved<br />

65


for T = Im, or equivalently Q = V U ′ . Hence<br />

min<br />

R1,R2∈O(m)<br />

�Y1R1 − Y2R2�F = �Y1 − Y2V U ′ �F = �Y1U − Y2V �F .<br />

Finally, let’s prove the equivalence of the two definitions 4.11 and 4.12 .<br />

Lemma 4.10.<br />

�<br />

m�<br />

�Y1U − Y2V �F = 2 sin 2 (θi/2)<br />

i=1<br />

Proof. Left-multiply Y1U − Y2V with U ′ Y ′<br />

1 to get<br />

� 1/2<br />

�Y1U − Y2V �F = �U ′ Y ′<br />

1Y1U − U ′ Y ′<br />

1Y2V �F = �Im − S�F ,<br />

since the norm does not change under the multiplication with the orthonormal matrix U ′ Y ′<br />

1.<br />

Since<br />

�<br />

�<br />

�Im − S�F = (1 − cos θi) 2<br />

we have the desired result.<br />

i<br />

� 1/2<br />

�<br />

�<br />

= 2 sin θi/2 2<br />

i<br />

.<br />

� 1/2<br />

The Procrustes distance is also called chordal distance [20]. The author of [20] also<br />

suggests another version of the Procrustes distance using the matrix 2-norm:<br />

Let’s check the validity of the definition:<br />

dProc2(Y1, Y2) = �Y1U − Y2V �2 = 2 sin(θm/2). (4.14)<br />

Proof. Left-multiply Y1U − Y2V with U ′ Y ′<br />

1 to get<br />

�Y1U − Y2V �2 = �U ′ Y ′<br />

1Y1U − U ′ Y ′<br />

1Y2V �2 = �Im − S�2.<br />

66<br />

,


From the definition of matrix 2-norm, we have<br />

�Im − S�2 = max<br />

�x�=1<br />

= max<br />

�x�=1<br />

�(Im − S)x�2<br />

�<br />

�<br />

(1 − cos θi) 2 x 2 i<br />

i<br />

� 1/2<br />

= max<br />

�x�=1<br />

� �<br />

i<br />

2 sin 2 (θi/2)x 2 i<br />

� 1/2<br />

Since sin 2 (θ1/2) ≤ ... ≤ sin 2 (θm/2), the sum is maximized for (x1, ..., xm) = (0, ..., 0, 1),<br />

and therefore<br />

�Y1U − Y2V �2 = 2 sin(θm/2).<br />

Note that this version of the Procrustes distance has the immediate relationship with the<br />

min correlation distance:<br />

d 2 MinCor(Y1, Y2) = sin 2 θm = 1−(1−2 sin 2 (θm/2)) 2 �<br />

= 1− 1 − 1<br />

2 d2 �2 Proc2(Y1, Y2) . (4.15)<br />

Since the function f(x) = (1 − (1 − x 2 /2) 2 ) 1/2 is a non-decreasing transform of the dis-<br />

tance for 0 ≤ x ≤ 2, the two distances are expected to behave similarly although not<br />

exactly in the same manner.<br />

By definition, both versions of the Procrustes distances are invariant under different<br />

representations and furthermore are valid metrics:<br />

Lemma 4.11. The Procrustes distances dProc1 and dProc2 are Grassmann metrics.<br />

Proof. Nonnegativity and symmetry is immediate. For triangle inequality, let’s use the<br />

67<br />

.


equality (4.13) to show that<br />

dProc(Y1, Y2) + dProc(Y2, Y3) = min<br />

R1,R2∈O(m)<br />

= min<br />

Q1∈O(m)<br />

�Y1R1 − Y2R2� + min<br />

R2,R3∈O(m)<br />

�Y1Q1 − Y2� + min<br />

Q3∈O(m)<br />

�Y2 − Y3Q3�<br />

= min {�Y1Q1 − Y2� + �Y2 − Y3Q3�}<br />

Q1,Q3∈O(m)<br />

≥ min �Y1Q1 − Y3Q3� = dProc(Y1, Y3).<br />

Q1,Q3∈O(m)<br />

The remaining condition to show is the necessary and sufficient condition<br />

�Y1U − Y2V � = 0 ⇐⇒ span(Y1) = span(Y2).<br />

From being a matrix norm, the equality follows:<br />

�Y1U − Y2V � = 0 ⇐⇒ Y1U = Y2V.<br />

�Y2R2 − Y3R3�<br />

The proof of span(Y1) = span(Y2) ⇐⇒ Y1U = Y2V is similar to the case of the Projection<br />

distance and is omitted.<br />

4.3.6 Comparison of the distances<br />

Table 4.1 summarizes the distances introduced so far. When these distances are used for a<br />

learning task, the choice of the most appropriate distance for the task depends on several<br />

factors.<br />

The first factor is the distribution of data. Since the distances are defined from particular<br />

functions of the principal angles, the best distance depends highly on the probability distri-<br />

bution of the principal angles of the given data. For example, the max correlation dMaxCor<br />

uses only the smallest principal angle θ1, and therefore can serve as a robust distance when<br />

68


Table 4.1: Summary of the Grassmann distances. The distances can be defined as simple<br />

functions of both the basis Y and the principal angles θi except for the arc-length which<br />

involves matrix exponentials.<br />

Arc Length Projection Binet-Cauchy<br />

d2 (Y1, Y2) · 2−1�Y1Y ′<br />

1 − Y2Y ′<br />

2�2 F 1 − det(Y ′<br />

1Y2) 2<br />

In terms of θ<br />

� 2 θi � 2 sin θi 1 − � cos2 Is a metric? Yes Yes<br />

θi<br />

Yes<br />

Max Corr Min Corr Procrustes 1 Procrustes 2<br />

d2 (Y1, Y2) 2 − 2�Y ′<br />

1Y2�2 In terms of θ<br />

2<br />

2 sin2 θ1 sin2 θm 4 � sin2 (θi/2) 4 sin2 (θm/2)<br />

Is a metric? No Yes Yes Yes<br />

�Y1Y ′<br />

1 − Y2Y ′<br />

2� 2 2 �Y1U − Y2V � 2 F �Y1U − Y2V � 2 2<br />

the subspaces are highly scattered and noisy, whereas the min correlation dMinCor uses only<br />

the largest principal angle θm, and therefore is not a sensible choice. On the other hand,<br />

when the subspaces are concentrated and have nonzero intersections, dMaxCor will be close<br />

to zero for most of the data, and dMinCor may be more discriminative in this case. The second<br />

Procrustes distances dProc2 is also expected to behave similarly to dMinCor since it also uses<br />

only the largest principal angle. Besides, dMinCor and dProc2 are directly related by (4.15).<br />

The Arc-length dArc, the Projection distance dProj, and the first Procrustes distance dProc1<br />

use all the principal angles. Therefore they have intermediate characteristics between<br />

dMaxCor and dMinCor, and will be useful for a wider range of data distributions. The Binet-<br />

Cauchy distance dBC also uses all the principal angles, but it behaves similarly to dMinCor<br />

for scattered subspaces since the distance will become the maximum value (=1) if at least<br />

one of the principal angles is π/2, due to the product form of dBC.<br />

The second criterion for choosing the distance, is the degree of structure in the distance.<br />

Without any structure a distance can be used only with a simple K-Nearest Neighbor (K-<br />

NN) algorithm for classification. When a distance has an extra structure such as triangle<br />

inequality, for example, we can speed up the nearest neighbor searches by estimating lower<br />

69


and upper limits of unknown distances [23]. From this point of view, the max correlation<br />

dMaxCor is not a metric and may not be used with more sophisticated algorithms unlike the<br />

rest of the distances.<br />

4.4 Experiments<br />

In this section I make empirical comparisons of the Grassmann distances discussed so far<br />

by using the distances for classification tasks with real image database.<br />

4.4.1 Experimental setting<br />

In this section I use the subspaces computed from the four databases Yale Face, CMU-PIE,<br />

ETH-80 and IXMAS, and compare the performances of simple 1NN classifiers using the<br />

Grassmann distances.<br />

The training and the test sets are prepared by N-fold cross validation as follows. For the<br />

Yale Face and the CMU-PIE databases, I keep the subspaces corresponding to a particular<br />

pose from all subjects for testing, and use the remaining subspaces corresponding to other<br />

poses for training. This results in 9-fold and 7-fold cross validation tests for Yale Face and<br />

CMU-PIE respectively. For the ETH-80 database, I keep the subspaces of 8 objects – one<br />

from each category – for testing, and use the remaining subspaces for training, which is a<br />

10-fold cross validation. For the IXMAS database, I keep all the subspaces corresponding<br />

to a particular person for testing, and use the subspaces of other people for training, which<br />

is a 11-fold cross validation test.<br />

As mentioned in the previous chapter, the subspace representation of the databases ab-<br />

sorbs the variability due to illumination, pose, and the choice of the state space respectively.<br />

The cross validation setting of this thesis is the test of whether the remaining variability be-<br />

tween subspaces are indeed useful to recognize subjects, objects, or actions, regardless of<br />

70


different poses, object instances, and actors, respectively.<br />

4.4.2 Results and discussion<br />

Figures 4.2–4.5 show the classification rates. I can summarize the results as follows:<br />

1. The best performing distances are different for each database: dMaxCor for Yale Face,<br />

dProj, dProc1 for CMU-PIE, dArc, dProj, dProc1 for ETH-80, and dProj, dProc1 for IXMAS<br />

databases. I interpret this as certain distances being better suited for discriminating<br />

the subspaces of a particular database.<br />

2. With the exception of dMaxCor for Yale Face, the three distances dArc, dProj, dProc1 are<br />

consistently better than dBC, dMinCor, dProc2. This grouping of the distances are theo-<br />

retically predicted in Section 4.3.6.<br />

3. The dMinCor and dProc2 show exactly the same rates, since the former is monotonically<br />

related to the latter by (4.15). However the two distance will show different rates<br />

when they are used with more sophisticated algorithms than the K-NN.<br />

4. With the exception of Yale Face, the three distances perform much better than the Eu-<br />

clidean distance does, which demonstrates the potential advantages of the subspace-<br />

based approach.<br />

5. For CMU-PIE and IXMAS, the rates increase overall as the subspace dimension m<br />

increases. For Yale Face, the rates of dBC and dProc2 drop as m increases, wherease the<br />

rates of other distances remain the same. For ETH-80, the rates seem to have different<br />

peaks for each distance. This means that the choice of the subspace dimensionality m<br />

can have significant effects on the recognition rates when the simple K-NN algorithm<br />

is employed. However, it will be shown in the later chapters that the m has less effects<br />

on more sophisticated algorithms that are able to adapt to the peculiarities of the data.<br />

71


4.5 Conclusion<br />

In this chapter I introduced the Grassmann manifold as the framework for subspace-based<br />

algorithms, and reviewed several well-known Grassmann distances for measuring the dis-<br />

similarity of subspaces. These Grassmann distances are analyzed and compared in terms of<br />

how they use the principal angles to define dissimilarity of subspaces. In the classification<br />

task of real image databases with 1NN algorithm, the best performing distance varied de-<br />

pending on the data used. This suggests that we need some prior knowledge of the data in<br />

choosing the best distance a priori. However, most of the Grassmann distances performed<br />

better than the Euclidean distance in 1NN classification, and behaved in groups as predicted<br />

from the analysis. In the next chapter I will present a more important criterion for choosing<br />

a distance: whether a distance is associated with a positive definite kernel or not.<br />

72


d Eucl<br />

d Arc<br />

d Proj<br />

d BC<br />

d MaxCor<br />

d MinCor<br />

d Proc1<br />

d Proc2<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

1 3 5 7 9<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

dEucl 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47<br />

dArc 87.81 80.29 83.15 87.81 84.23 84.95 80.65 81.36 82.44<br />

dP roj 87.81 80.29 83.87 87.46 86.02 84.23 84.59 85.30 85.66<br />

dBC 87.81 80.29 83.15 87.46 83.51 83.15 76.34 75.63 74.19<br />

dMaxCor 87.81 89.96 92.11 91.04 91.04 91.04 91.76 92.11 91.76<br />

dMinCor 87.81 71.68 80.65 84.95 72.76 72.76 62.72 62.72 54.84<br />

dP roc1 87.81 80.29 83.15 87.46 84.95 84.95 82.80 83.51 82.80<br />

dP roc2 87.81 71.68 80.65 84.95 72.76 72.76 62.72 62.72 54.84<br />

Figure 4.2: Yale Face Database: face recognition rates from 1NN classifier with the Grassmann<br />

distances. The two highest rates including ties are highlighted with boldface for each<br />

subspace dimension m.<br />

73


d Eucl<br />

d Arc<br />

d Proj<br />

d BC<br />

d MaxCor<br />

d MinCor<br />

d Proc1<br />

d Proc2<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

1 3 5 7 9<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

dEucl 61.96 61.96 61.96 61.96 61.96 61.96 61.96 61.96 61.96<br />

dArc 72.28 52.03 65.46 72.92 76.33 83.16 85.93 88.70 88.49<br />

dP roj 72.28 52.03 64.39 75.48 78.46 82.94 84.43 86.57 87.85<br />

dBC 72.28 51.81 65.88 72.28 76.76 81.24 84.01 85.93 82.30<br />

dMaxCor 72.28 66.95 65.03 64.61 64.18 63.97 64.61 64.61 64.39<br />

dMinCor 72.28 48.83 69.94 64.82 72.28 72.92 72.28 69.08 66.52<br />

dP roc1 72.28 52.03 65.25 73.13 77.83 83.37 86.57 88.27 88.27<br />

dP roc2 72.28 48.83 69.94 64.82 72.28 72.92 72.28 69.08 66.52<br />

Figure 4.3: CMU-PIE Database: face recognition rates from 1NN classifier with the Grassmann<br />

distances.<br />

74


d Eucl<br />

d Arc<br />

d Proj<br />

d BC<br />

d MaxCor<br />

d MinCor<br />

d Proc1<br />

d Proc2<br />

100<br />

95<br />

90<br />

85<br />

80<br />

75<br />

70<br />

1 3 5 7 9<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

dEucl 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47<br />

dArc 80.00 86.25 93.75 86.25 88.75 92.50 90.00 97.50 92.50<br />

dP roj 80.00 85.00 95.00 92.50 88.75 92.50 90.00 96.25 95.00<br />

dBC 80.00 86.25 91.25 83.75 87.50 88.75 85.00 95.00 92.50<br />

dMaxCor 80.00 78.75 82.50 81.25 81.25 82.50 83.75 81.25 83.75<br />

dMinCor 80.00 86.25 90.00 82.50 80.00 77.50 82.50 81.25 81.25<br />

dP roc1 80.00 86.25 93.75 86.25 88.75 92.50 90.00 96.25 91.25<br />

dP roc2 80.00 86.25 90.00 82.50 80.00 77.50 82.50 81.25 81.25<br />

Figure 4.4: ETH-80 Database: object categorization rates from 1NN classifier with the<br />

Grassmann distances.<br />

75


d Eucl<br />

d Arc<br />

d Proj<br />

d BC<br />

d MaxCor<br />

d MinCor<br />

d Proc1<br />

d Proc2<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

1 2 3 4 5<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5<br />

dEucl 42.61 42.61 42.61 42.61 42.61<br />

dArc 61.82 68.79 76.06 76.67 78.18<br />

dP roj 61.82 69.39 74.85 78.48 80.30<br />

dBC 61.82 67.58 71.82 73.03 73.03<br />

dMaxCor 61.82 65.76 63.03 67.58 66.97<br />

dMinCor 61.82 61.82 61.82 62.42 54.24<br />

dP roc1 61.82 68.18 76.97 76.97 80.30<br />

dP roc2 61.82 61.82 61.82 62.42 54.24<br />

Figure 4.5: IXMAS Database: action recognition rates from 1NN classifier with the Grassmann<br />

distances. The two highest rates including ties are highlighted with boldface for each<br />

subspace dimension m.<br />

76


Chapter 5<br />

<strong>GRASSMANN</strong> <strong>KERNELS</strong> AND<br />

DISCRIMINANT ANALYSIS<br />

5.1 Introduction<br />

In the previous chapter I defined subspace distances on the Grassmann manifold. However,<br />

with a distance structure only, there is a severe restricted in the possible operations with the<br />

data. In this chapter, I show that it is possible to define positive definite kernel functions<br />

on the manifold, and thereby it is possible to transform the space to the familiar Hilbert<br />

space by virtue of the RKHS theory in Section 2.2. In particular, the Projection and the<br />

Binet-Cauchy distances presented in the previous chapter will be shown to be compatible<br />

with the Projection and the Binet-Cauchy kernels defined as follows:<br />

kProj(Y1, Y2) = �Y ′<br />

1Y2� 2 F , kBC(Y1, Y2) = (det Y ′<br />

1Y2) 2 .<br />

77


These kernels are discussed in detail in this chapter. The Binet-Cauchy kernel has been<br />

used as a similarity measure for sets [90] and dynamical systems [83]. 1 The Projection dis-<br />

tance has been used for face recognition [85], but the corresponding Projection kernel has<br />

not been explicitly used, and it is the main object of this chapter. I examine both kernels as<br />

the representative kernels on the Grassmann manifold. Advantages of the Grassmann ker-<br />

nels over the Euclidean kernels are demonstrated by a classification problem with Support<br />

Vector Machines (SVMs) on synthetic datasets.<br />

To demonstrate the potential benefits of the kernels further, I use the kernels in a dis-<br />

criminant analysis of subspaces. The proposed method will be contrasted with the previ-<br />

ously suggested subspace-based discriminant algorithms [92, 64, 24, 43]. Those previous<br />

methods adopt an inconsistent strategy: feature extraction is performed in the Euclidean<br />

space while non-Euclidean subspace distances are used. This inconsistency results in a<br />

difficult optimization and a weak guarantee of convergence. In the proposed approach of<br />

this chapter, the feature extraction and the distance measurement are integrated around the<br />

Grassmann kernel, resulting in a simpler and more familiar algorithm. Experiments with<br />

the image databases also show that the proposed method performs better than the previous<br />

methods.<br />

5.2 Kernel functions for subspaces<br />

Among the various distances presented in Chapter 4, only the Projection distance and the<br />

Binet-Cauchy distance are induced from positive definite kernels. This means that we can<br />

1 The authors of [83] use the term Binet-Cachy kernel for a more abstract class of kernels for Fredholm<br />

operators. The Binet-Cauchy kernel kBC in this paper is a special case which is close to what those authors<br />

call the Martin kernel.<br />

78


define the corresponding kernels kProj and kBC such that the following is true:<br />

d 2 (Y1, Y2) = k(Y1, Y1) + k(Y2, Y2) − 2k(Y1, Y2). (5.1)<br />

To define a kernel on the Grassmann manifold, let’s recall the definition of a positive<br />

definite kernel in Definition 2.4:<br />

A real symmetric function k is a (resp. conditionally) positive definite kernel function,<br />

if �<br />

i,j cicjk(xi, xj) ≥ 0, for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) for any n ∈ N.<br />

(resp. for all c1, ..., cn(ci ∈ R) such that � n<br />

i=1 ci = 0.)<br />

Based on the Euclidean coordinates of subspaces, the Grassmann kernel is defined as<br />

follows:<br />

Definition 5.1. Let k : R D×m × R D×m → R be a real valued symmetric function<br />

k(Y1, Y2) = k(Y2, Y1). The function k is a Grassmann kernel if it is 1) positive definite and<br />

2) invariant to different representations:<br />

k(Y1, Y2) = k(Y1R1, Y2R2), ∀R1, R2 ∈ O(m).<br />

In the following sections I explicitly construct an isometry from (G, dProj or BC) to a<br />

Hilbert space (H, L2), and use the isometry to show that the Projection and the Binet-<br />

Cauchy kernels are Grassmann kernels.<br />

5.2.1 Projection kernel<br />

The Projection distance dProj can be understood by associating a subspace with a projection<br />

matrix by the following embedding [16]<br />

Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′ . (5.2)<br />

79


The image Ψ(G(m, D)) is the set of rank-m orthogonal projection matrices, hence the<br />

name Projection distances.<br />

Theorem 5.2. The map<br />

Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′<br />

is an embedding. In particular, it is an isometry from (G, dProj) to (R D×D , � · �F ).<br />

(5.3)<br />

Proof. 1. Well-defined: if span(Y1) = span(Y2), or equivalently Y1 = Y2R for some<br />

R ∈ O(m), then Ψ(Y1) = Y1Y ′<br />

1 = Y2Y ′<br />

2 = Ψ(Y2).<br />

2. Injective: suppose Ψ(Y1) = Y1Y ′<br />

1 = Y2Y ′<br />

2 = Ψ(Y2). After multiplying Y1 and Y2 to<br />

the right side we get the equalities Y1 = Y2(Y ′<br />

2Y1) and Y2 = Y1(Y ′<br />

1Y2) respectively.<br />

Let R denote R = Y ′<br />

2Y1, then Y1 = Y2R = Y1(R ′ R), which shows R ∈ O(m) and<br />

therefore span(Y1) = span(Y2).<br />

3. Isometry: �Ψ(Y1) − Ψ(Y2)�F = �Y1Y ′<br />

1 − Y2Y ′<br />

2�F = 2 1/2 dProj(Y1, Y2).<br />

Since we have a Euclidean embedding into (R D×D , � · �F ), the natural inner product of<br />

this space is the trace tr [(Y1Y ′<br />

1)(Y2Y ′<br />

2)] = �Y ′<br />

1Y2�2 F . This provides us with the definition<br />

of the Projection kernel:<br />

Theorem 5.3. The Projection kernel<br />

is a Grassmann kernel.<br />

kProj(Y1, Y2) = �Y ′<br />

1Y2� 2 F<br />

80<br />

(5.4)


Proof. The kernel is well-defined because kProj(Y1, Y2) = kProj(Y1R1, Y2R2) for any R1, R2 ∈<br />

O(m). The positive definiteness follows from the properties of the Frobenius norm: for all<br />

Y1, ..., Yn(Yi ∈ G) and c1, ..., cn(ci ∈ R) for any n ∈ N, we have<br />

�<br />

cicj�Y ′<br />

i Yj� 2 F = �<br />

ij<br />

ij<br />

cicjtr(YiY ′<br />

i YjY ′<br />

i<br />

j ) = �<br />

i<br />

ij<br />

cicjtr(YiY ′<br />

i )(YjY ′<br />

j )<br />

= tr( �<br />

ciYiY ′<br />

i ) 2 = � �<br />

ciYiY ′<br />

i � 2 F ≥ 0.<br />

The Projection kernel has a very simple form and requires only O(Dm) multiplications<br />

to evaluate. It is the main kernel I propose to use for subspace-based learning.<br />

5.2.2 Binet-Cauchy kernel<br />

The Binet-Cauchy distance can also be understood by an embedding. Let s be a subset of<br />

{1, ..., D} with m elements s = {r1, ..., rm}, and Y (s) be the m × m matrix whose rows<br />

are the r1, ... , rm-th rows of Y . If s1, s2, ..., sn are all such choices of the subset ordered<br />

lexicographically, then the Binet-Cauchy embedding is defined as<br />

Ψ : G(m, D) → R n , span(Y ) ↦→ � det Y (s1) , ..., det Y (sn) � , (5.5)<br />

where n = DCm is the number of choosing m rows out of D rows. It is also an isometry<br />

from (G, dBC) to (R n , � · �2). The natural inner product in R n is the dot product of the two<br />

vectors<br />

n�<br />

r=1<br />

det Y (si)<br />

1<br />

det Y (si)<br />

2 ,<br />

which provides us with the definition of the Binet-Cauchy kernel.<br />

81


Theorem 5.4. The Binet-Cauchy kernel<br />

is a Grassmann kernel.<br />

kBC(Y1, Y2) = (det Y ′<br />

1Y2) 2 = det Y ′<br />

1Y2Y ′<br />

2Y1<br />

(5.6)<br />

Proof. First, the kernel is well-defined because kBC(Y1, Y2) = kBC(Y1R1, Y2R2) for any<br />

R1, R2 ∈ O(m). To show that kBC is positive definite it suffices to show that k(Y1, Y2) =<br />

det Y ′<br />

1Y2 is positive definite. From the Binet-Cauchy identity [38, 90, 83], we have<br />

det Y ′<br />

1Y2 = �<br />

s<br />

det Y (s)<br />

1 det Y (s)<br />

2 .<br />

Therefore, for all Y1, ..., Yn(Yi ∈ G) and c1, ..., cn(ci ∈ R) for any n ∈ N,<br />

�<br />

ij<br />

cicj det Y ′<br />

i Yj = �<br />

ij<br />

cicj<br />

= � �<br />

s<br />

ij<br />

�<br />

s<br />

det Y (s)<br />

i<br />

cicj det Y (s)<br />

i<br />

(s)<br />

det Y<br />

j<br />

(s)<br />

det Y<br />

j<br />

�<br />

� �<br />

=<br />

s<br />

i<br />

ci det Y (s)<br />

i<br />

� 2<br />

≥ 0.<br />

Some other forms of the Binet-Cauchy kernel also appeared in the literature. Note<br />

that although det Y ′<br />

1Y2 is also a Grassmann kernel, we prefer kBC(Y1, Y2) = det(Y ′<br />

1Y2) 2 .<br />

The reason is that the latter is directly related to the principal angles by det(Y ′<br />

1Y2) 2 =<br />

�<br />

i cos2 θi and therefore admits geometric interpretations, whereas the former cannot be<br />

written directly in terms of the principal angles. That is, det Y ′<br />

1Y2 �= �<br />

i cos θi in general. 2<br />

Another variant arcsin kBC(Y1, Y2) is also a positive definite kernel 3 and its induced metric<br />

d = (arccos(det Y ′<br />

1Y2)) 1/2 is a conditionally positive definite metric.<br />

2 For example, det Y ′<br />

1Y2 can be negative whereas �<br />

i cos θi - a product of singular values - is nonnegative<br />

by definition.<br />

3 From Theorem 4.18 and 4.19 of [69].<br />

82


5.2.3 Indefinite kernels from other metrics<br />

Since the Projection distance and the Binet-Cauchy distance are derived from positive def-<br />

inite kernels, we have all the kernel-based algorithms for Hilbert spaces at our disposal. In<br />

contrast, other distances in the previous chapter are not associated with Grassmann kernels<br />

and can only be used with less powerful algorithms. Showing that a distance is not associ-<br />

ated with any kernel directly, is not as easy as showing the opposite that there is a kernel.<br />

However, Theorem 2.12 can be used to make the task easier:<br />

A metric d is induced from a positive definite kernel if and only if<br />

ˆk(x1, x2) = −d 2 (x1, x2)/2, x1, x2 ∈ X (5.7)<br />

is conditionally positive definite. The theorem allows us to show a metric’s non-positive<br />

definiteness by constructing an indefinite kernel matrix from (5.7) as a counterexample.<br />

There have been efforts to use indefinite kernels for learning [59, 31], and several<br />

heuristics have been proposed to modify an indefinite kernel matrix to a positive definite<br />

matrix [60]. However, I do not advocate the use of the heuristics since they change the<br />

geometry of the original data.<br />

5.2.4 Extension to nonlinear subspaces<br />

Linear subspaces in the original space can be generalized to ‘nonlinear’ subspaces by con-<br />

sidering linear subspaces in a RKHS, which is a trick that has been used successfully in<br />

kernel PCA [68]. 4 In [90, 85] the trick is shown to be applicable to the computation of the<br />

principal angles, called the kernel principal angles. Wolf and Shashua, in particular, use<br />

the trick to compute the Binet-Cauchy kernel. Note that these two kernels have different<br />

4 A ‘nonlinear’ subspace is an oxymoron. Technically speaking, it is a preimage of a linear subspace in<br />

RHKS.<br />

83


H 1<br />

span( Yi )<br />

θ1, ..., θm<br />

span( Yj )<br />

Φ Ψ<br />

X H<br />

2<br />

Yi<br />

G(m, D )<br />

θ 2<br />

Yj<br />

Ψ( ) Ψ( )<br />

Figure 5.1: Doubly kernel method. The first kernel implicitly maps the two ‘nonlinear<br />

subspaces’ Xi and Xj to span(Yi) and span(Yj) via the map Φ : X → H1, where the<br />

‘nonlinear subspace’ means the preimage Xi = Φ −1 (span(Yi)) and Xj = Φ −1 (span(Yj)).<br />

The second (=Grassmann) kernel maps the points Yi and Yj on the Grassmann manifold<br />

G(m, D) to the corresponding points in H2 via the map Ψ : G(m, D) → H2 such as (5.3)<br />

or (5.5).<br />

roles and need to be distinguished. An illustration of this ‘doubly kernel’ method is given<br />

in Figure 5.1.<br />

The key point of the trick is that the principal angles between two subspaces in the<br />

RKHS can be derived only from the inner products of vectors in the original space. 5 Fur-<br />

thermore, the orthonomalization procedure in the feature space also requires the inner prod-<br />

uct of vectors only. Below is a summary of the procedures in [90].<br />

1. Let Xi = {xi 1, ..., xi Ni } the i-th set of data and Φi = [φ(xi 1), ... , φ(xi )] be the<br />

Ni<br />

image matrix of Xi in the feature space implicitly defined by a kernel function k,<br />

5 A similar idea is also used to define probability distributions in the feature space [46, 96], and will be<br />

explained in the next chapter.<br />

84


e.g., Gaussian RBF kernel.<br />

2. The orthonormal basis Yi of the span(Φi) is then computed from the Gram-Schmidt<br />

process in RKHS: Φi = YiRi.<br />

3. Finally, the product Y ′<br />

i Yj in the features space, used to define the Binet-Cauchy ker-<br />

nel for example, is computed from the original data by<br />

Y ′<br />

i Yj = (R −1<br />

i )′ Φ ′ iΦjR −1<br />

j = (R−1<br />

i )′ [k(x i k, x j<br />

l<br />

)]klR −1<br />

j .<br />

Although this extension has been used to improve classification tasks with a few small<br />

databases [90], I will not use the extension in the thesis for the following reasons. First, the<br />

databases I use already have theoretical grounds for being linear subspaces, and we want to<br />

verify the linear subspace models. Second, the advantage of kernel tricks in general is most<br />

pronounced when the ambient space R D has a relatively small dimension D compared to<br />

the number of data sample N. This is obviously not the case with the data used in the<br />

thesis. Further experiments with the nonlinear extension will be carried out in the future.<br />

5.3 Experiments with synthetic data<br />

In this section I demonstrate the application of the Grassmann kernels to a two-class clas-<br />

sification problem with Support Vector Machines (SVMs). Using synthetic data I will<br />

compare the classification performances of linear/nonlinear SVMs in the original space<br />

with the performances of the SVMs in the Grassmann space. The advantages of the sub-<br />

space approach over the conventional Euclidean appraoch for classification problems will<br />

be discussed.<br />

85


A. Class centers B. Easy<br />

C. Intermediate D. Difficult<br />

Figure 5.2: A two-dimensional subspace is represented by a triangular patch swept by two<br />

basis vectors. The positive and negative classes are colored-coded by blue and red respectively.<br />

A: The two class centers Y+ and Y− around which other subspaces are randomly<br />

generated. B–D: Examples of randomly selected subspaces for ‘easy’, ‘intermediate’, and<br />

‘difficult’ datasets.<br />

5.3.1 Synthetic data<br />

I generate three types of datasets: ‘easy’, ‘intermediate’, and ‘difficult’; these datasets differ<br />

in the amount of noise in the data.<br />

For each type of the data, I generate N = 100 subspaces in D = 6 dimensional Eu-<br />

clidean space, where each subspace is m = 2 dimensional. To generate two-class data, I<br />

86


first define two exemplar subspaces spanned by the following bases Y+ and Y−:<br />

Y+ = 1<br />

�<br />

√<br />

6<br />

Y− = 1<br />

√ 6<br />

[ 1 1 1 −1 1 1 ] ′ , [ 1 1 −1 1 1 −1 ] ′<br />

�<br />

[ 1 −1 1 1 −1 1 ] ′ , [ 1 −1 −1 −1 −1 −1 ] ′<br />

The Y+ and Y− serve as the positive and the negative class centers respectively. The<br />

corresponding subspaces span(Y+) and span(Y−) have the principal angles θ1 = θ2 =<br />

arccos(1/3).<br />

The other subspaces Yi’s are generated by adding a Gaussian random matrix M to the<br />

bases Y+ or Y−, and then by applying SVD to compute the new perturbed bases:<br />

Yi = U, where<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

UΣV ′ = Y+ + Mi, i = 1, ..., N/2<br />

UΣV ′ = Y− + Mi, i = N/2 + 1, ..., N<br />

where the elements of the matrix Mi are independent Gaussian variables [Mi]jk ∼ N (0, s 2 ).<br />

The standard deviation s controls the amount of noise; the s is chosen to be s = 0.2, 0.3<br />

and 0.4 for ‘easy’, ‘intermediate’, and ‘difficult’ datasets respectively. Figure 5.2 shows<br />

examples of the subspaces for the three datasets. Note that the subspaces become more<br />

cluttered and the class boundary becomes more irregular as s increases.<br />

5.3.2 Algorithms<br />

I compare the performance of the Euclidean SVM with linear/polynomial/RBF kernels and<br />

the performance of SVM with Grassmann kernels. To test the Euclidean SVMs, I randomly<br />

sample n = 50 points from each subspace from a Gaussian distribution.<br />

There is an immediate handicap with a linear classifier in the original data space. Each<br />

subspace is symmetric with respect to the origin, that is, if x is a point on a subspace,<br />

87<br />

�<br />

,<br />

�<br />

.


then −x is also on the subspace. As a result, any hyperplane either 1) contains a subspace<br />

or 2) halves a subspace into two parts and yields 50 percent classification rate, which is<br />

useless. Therefore, if data lie on subspaces without further restrictions, a linear classifier<br />

(with a zero-bias) always fails to classify subspaces. To alleviate the problem with the<br />

Euclidean algorithms, I sample points from the intersection of the subspaces and the half-<br />

space {(x1, ..., x6) ∈ R 6 | x1 > 0}.<br />

To test the Grassmann SVM, I first estimate the basis Yi from the SVD of the same<br />

sampled points used for the Euclidean SVM, and then evaluate the Grassmann kernel func-<br />

tions.<br />

Five kernels in the followings are compared:<br />

1. Euclidean SVM with linear kernels: k(x1, x2) = 〈x1, x2〉<br />

2. Euclidean SVM with Polynomial kernels: k(x1, x2) = (〈x1, x2〉 + 1) 3 .<br />

3. Euclidean SVM with Gaussian RBF kernels: k(x1, x2) = exp(− 1<br />

2r 2 �x1 −x2� 2 ). The<br />

radius r is chosen to be one-fifth of the diameter of the data: r = 0.2 maxij �xi−xj�.<br />

4. Grassmannian SVM with Projection kernel k(Y1, Y2) = �Y ′<br />

1Y2� 2 F<br />

5. Grassmannian SVM with Binet-Cauchy kernel k(Y1, Y2) = (det Y ′<br />

1Y2) 2<br />

For the Euclidean SVMs, I use the public-domain software SVM-light [42] with default<br />

parameters. For the Grassmann SVMs, I use a Matlab code with a nonnegative QP solver.<br />

I evaluate the algorithms with the leave-one-out test by holding out one subspace and<br />

training with the other N − 1 subspaces.<br />

5.3.3 Results and discussion<br />

Table 5.1 shows the classification rates of the Euclidean SVMs and the Grassmann SVMs,<br />

averaged for 10 independent trials. The results show that the Grassmann SVM with the<br />

88


Table 5.1: Classification rates of the Euclidean SVMs and the Grassmannian SVMs. The<br />

best rate for each dataset is highlighted by boldface.<br />

Euclidean Grassmann<br />

Lin Poly RBF Proj BC<br />

Easy 88.21 98.41 98.37 100.00 100.00<br />

Intemediate 80.08 92.46 92.72 98.80 98.00<br />

Difficult 72.01 81.14 81.73 91.30 90.60<br />

Projection kernel outperforms other the Euclidean SVMs. The Grassmann SVM with the<br />

Binet-Cauchy kernel is a close second. The Polynomial and RBF kernels perform equally<br />

better than the linear kernel, but not as good as the Grassmann kernels. The overall classi-<br />

fication rates decrease as the data become more difficult to separate.<br />

The Grassmann kernels achieve better results for the two main reasons. First, when<br />

the data are highly cluttered as shown in Figure 5.2, the geometric prior of the subspace<br />

structures can disambiguate the points close to each other that the Euclidean distance can-<br />

not distinguish well. Second, the Grassmann approach implicitly maps the data from the<br />

original D-dimensional space to a higher-dimensional (m(D−m)) space where separating<br />

the subspaces becomes easier.<br />

In addition to having a superior classification performance with subspace-structured<br />

data, the Grassmann kernel method has a smaller computational cost. In the experiment<br />

above, for example, the Euclidean approach uses a kernel matrix of a size 5000 × 5000,<br />

whereas the Grassmann approach uses a kernel matrix of a size 100 × 100 which is n = 50<br />

times smaller than the Euclidean kernel matrix.<br />

89


5.4 Discriminant Analysis of subspace<br />

In this section I introduce a discriminant analysis method on the Grassmann manifold, and<br />

compare this method with other previously known discriminant techniques for subspaces.<br />

Since the image databases in Chapter 3 are highly multiclass 6 and lie in high dimensional<br />

space, I propose to use the discriminant analysis technique to reduce dimensionality and<br />

extract features of subspace data.<br />

5.4.1 Grassmann Discriminant Analysis<br />

It is straightforward to show the procedures of using the Projection and the Binet-Cauchy<br />

kernels with the Kernel FDA method introduced in Section 5.4. Recall that the cost function<br />

of Kernel FDA is as follows:<br />

J(α) = α′ Φ ′ SBΦα<br />

α ′ Φ ′ SW Φα = α′ K(V − 1N1 ′ N /N)Kα<br />

α ′ (K(IN − V )K + σ2 , (5.8)<br />

IN)α<br />

where K is the kernel matrix, σ is a regularization term, and the others are fixed terms.<br />

Since the method is already explained in detail, I only present a summary of the procedure<br />

below.<br />

6 Nc=38, 68, 8, and,11 for Yale Face, CMU-PIE, ETH-80, and IXMAS databases respectively<br />

90


Assume the D by m orthonormal bases {Yi} are already computed and given.<br />

Training:<br />

1. Compute the matrix [Ktrain]ij = kProj(Yi, Yj) or kBC(Yi, Yj) for all Yi, Yj in the<br />

training set.<br />

2. Solve maxα J(α) in (5.8) by eigen-decomposition.<br />

3. Compute the (Nc − 1)-dimensional coefficients Ftrain = α ′ Ktrain.<br />

Testing:<br />

1. Compute the matrix [Ktest]ij = kProj(Yi, Yj) or kBC(Yi, Yj) for all Yi in training<br />

set and Yj in the test set.<br />

2. Compute the (Nc − 1)-dim coefficients Ftest = α ′ Ktest.<br />

3. Perform 1-NN classification from the Euclidean distance between Ftrain and<br />

Ftest.<br />

I call this method the Grassmann Discriminant Analysis to differentiate it from other<br />

discriminant methods for subspaces, which I review in the following sections.<br />

5.4.2 Mutual Subspace Method (MSM)<br />

The original MSM [92] performs simple 1-NN classification with dMax with no feature<br />

extraction. The method can be extended to any distance described in the thesis. Although<br />

there are attempts to use kernels for MSM [64], the kernel is used only to represent data in<br />

the original space, and the MSM algorithm is still a 1-NN classification.<br />

91


5.4.3 Constrained MSM (cMSM)<br />

Constrained MSM [24] is a technique that applies dimensionality reduction to the bases<br />

of the subspaces in the original space. Let G = � ′<br />

i YiY i be the sum of the projection<br />

matrices of the data and {v1, ..., vD} be the eigenvectors corresponding to the eigenvalues<br />

{λ1 ≤ ... ≤ λD} of G. The authors of [24] claim that the first few eigenvectors v1, ..., vd<br />

of G are more discriminative than the later eigenvectors, and suggest projecting the basis<br />

vectors of each subspace Yi onto the span(v1, ..., vl), followed by normalizations. However<br />

these procedure lack justifications, as well as a clear criterion for choosing the dimension<br />

d, on which the result crucially depends from our experience.<br />

5.4.4 Discriminant Analysis of Canonical Correlations (DCC)<br />

The Discriminant Analysis of Canonical Correlations [43] can be understood as a non-<br />

parametric version of linear discrimination analysis using the Procrustes distance (4.11).<br />

The algorithm finds the discriminating direction w which maximizes the ratio L(w) =<br />

w ′ SBw/w ′ Sww, where Sb and Sw are the nonparametric between-class and within-class<br />

‘covariance’ matrices from Section 2.4.2:<br />

Sb = � �<br />

(YiU − YjV )(YiU − YjV ) ′<br />

i<br />

j∈Bi<br />

Sw = � �<br />

(YiU − YjV )(YiU − YjV ) ′ ,<br />

i<br />

j∈Wi<br />

where U and V are from (4.2). Recall that tr(YiU − YjV )(YiU − YjV ) ′ = �YiU − YjV � 2 F<br />

is the Procrustes distance (squared). However, unlike my method, Sb and Sw do not admit<br />

a geometric interpretation as true covariance matrices, nor can they be kernelized directly.<br />

Another disadvantage of the DCC is the difficulty in optimization. The algorithm iterates<br />

the two stages of 1) maximizing the ratio L(w) and of 2) computing Sb and Sw, which<br />

92


esults in a computational overhead and a weak theoretical support for global convergence.<br />

5.5 Experiments with real-world data<br />

In this section I test the Grassmann Discriminant Analysis with the Yale Face, the CMU-<br />

PIE, the ETH-80 and the IXMAS databases, and compare its performance with those of<br />

other algorithms.<br />

5.5.1 Algorithms<br />

The following is the list of algorithms used in the test.<br />

1. Baseline: Euclidean FDA<br />

2. Grassmann Discriminant Analysis:<br />

• GDA1 (Projection kernel + kernel FDA)<br />

• GDA2 (Binet-Cauchy kernel + kernel FDA)<br />

For GDA1 and GDA2, the optimal values of σ are found by scanning through a range<br />

of values. The results do not seem to vary much as long as σ is small enough.<br />

3. Others<br />

• MSM (max corr)<br />

• cMSM (PCA+max corr)<br />

• DCC (NDA + Procrustes dist): For cMSM and DCC, the optimal dimension d is<br />

found by exhaustive searching. For DCC, we have used two nearest-neighbors<br />

for Bi and Wi in Section 5.4.4. However, increasing the number of nearest-<br />

neighbors does not change the results very much as was observed in [43]. In<br />

DCC the optimization is iterated for 5 times each.<br />

93


I evaluate the algorithms with the cross validation as explained in Section 4.4.1.<br />

5.5.2 Results and discussion<br />

Figures 5.3–5.6 show the classification rates. I can summarize the results as follows:<br />

1. The GDA1 shows significantly better performance than all the other algorithms for<br />

all datasets. However, the difference is less pronounced in the Yale Face database<br />

where the other discriminant algorithms also performed well.<br />

2. The overall rates are roughly in the order of (GDA1 > cMSM > DCC > others ).<br />

These three algorithms consistently outperform the baseline method, whereas GDA2<br />

and MSM occasionally lag behind the baseline.<br />

3. With the exception of the IXMAS database, the rates of the GDA1, MSM, cMSM,<br />

and DCC remain relatively the same as the subspace dimension m increases. For<br />

IXMAS, the rates seem to increase gradually as m increases in the given range.<br />

4. The GDA2 performs poorly in general and degrades fast as m increases. This can<br />

be ascribed to the properties of the Binet-Cauchy distance explained in Chapter 4.<br />

Due to its product form, the kernel matrix tends to be an identity as the subspace<br />

dimension increases, which is also empirically checked from data.<br />

5.6 Conclusion<br />

In this chapter I defined the Grassmann kernels for subspace-based learning, and showed<br />

constructions of the Projection kernel and the Binet-Cauchy kernel via isometric embed-<br />

dings. Although the embeddings can be used explicitly to represent a subspace as a D × D<br />

projection matrix or a DCm × 1 vector, as in [3], the equivalent kernel representations are<br />

preferred due to the storage and computation requirements.<br />

94


To demonstrate the potential advantages of the Grassmann kernels, I applied the kernel<br />

discriminant analysis algorithm to image databases represented as collections of subspaces.<br />

For its surprisingly simple form and usage, the proposed method with the Projection kernel<br />

outperformed the other state-of-the-art discriminant methods with the real data. However,<br />

the Binet-Cauchy kernel, when used in its naive form, are shown to be of limited value for<br />

subspace-based learning problems. There are possibly other Grassmann kernels which are<br />

not derived from the two representative kernels, and it is left as a future work to discover<br />

them.<br />

95


FDA (Eucl)<br />

GDA (Proj)<br />

GDA (BC)<br />

MSM<br />

cMSM<br />

DCC<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

1 3 5 7 9<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

FDA (Eucl) 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />

GDA (Proj) 96.77 95.70 98.57 99.28 98.92 98.57 97.85 96.77 97.13<br />

GDA (BC) 96.77 95.34 96.77 96.42 83.87 72.76 55.20 48.03 44.09<br />

MSM 87.81 89.96 92.11 91.04 91.04 91.04 91.76 92.11 91.76<br />

cMSM 92.47 96.06 94.98 93.55 94.62 94.98 94.98 96.42 95.70<br />

DCC 54.12 96.06 94.98 95.70 93.91 94.62 96.42 94.98 93.55<br />

Figure 5.3: Yale Face Database: face recognition rates from various discriminant analysis<br />

methods. The two highest rates including ties are highlighted with boldface for each<br />

subspace dimension m.<br />

96


FDA (Eucl)<br />

GDA (Proj)<br />

GDA (BC)<br />

MSM<br />

cMSM<br />

DCC<br />

100<br />

80<br />

60<br />

40<br />

20<br />

1 3 5 7 9<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

FDA (Eucl) 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73<br />

GDA (Proj) 88.27 74.84 89.77 87.21 91.68 92.54 93.82 93.60 95.31<br />

GDA (BC) 88.27 71.43 82.52 64.82 58.64 47.55 43.07 39.87 36.25<br />

MSM 72.28 66.95 65.03 64.61 64.18 63.97 64.61 64.61 64.39<br />

cMSM 73.13 71.22 67.59 68.23 69.72 69.94 70.15 72.71 72.49<br />

DCC 77.19 78.89 66.52 63.75 64.61 67.59 67.59 67.59 65.03<br />

Figure 5.4: CMU-PIE Database: face recognition rates from various discriminant analysis<br />

methods.<br />

97


FDA (Eucl)<br />

GDA (Proj)<br />

GDA (BC)<br />

MSM<br />

cMSM<br />

DCC<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

1 3 5 7 9<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

FDA (Eucl) 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />

GDA (Proj) 88.75 90.00 96.25 97.50 95.00 95.00 95.00 96.25 96.25<br />

GDA (BC) 88.75 87.50 90.00 81.25 72.50 60.00 51.25 41.25 48.75<br />

MSM 80.00 78.75 82.50 81.25 81.25 82.50 83.75 81.25 83.75<br />

cMSM 88.75 91.25 92.50 93.75 93.75 91.25 92.50 93.75 93.75<br />

DCC 65.00 88.75 83.75 88.75 87.50 87.50 85.00 85.00 85.00<br />

Figure 5.5: ETH-80 Database: object categorization rates from various discriminant analysis<br />

methods.<br />

98


FDA (Eucl)<br />

GDA (Proj)<br />

GDA (BC)<br />

MSM<br />

cMSM<br />

DCC<br />

100<br />

80<br />

60<br />

40<br />

20<br />

1 2 3 4 5<br />

subspace dimension (m)<br />

m=1 m=2 m=3 m=4 m=5<br />

FDA (Eucl) 54.87 54.87 54.87 54.87 54.87<br />

GDA (Proj) 69.09 80.30 84.55 84.24 85.15<br />

GDA (BC) 69.09 60.00 50.91 36.36 25.15<br />

MSM 61.82 65.76 63.03 67.58 66.97<br />

cMSM 63.64 69.39 73.33 77.58 78.48<br />

DCC 38.79 61.82 74.55 76.67 77.58<br />

Figure 5.6: IXMAS Database: action recognition rates from various discriminant analysis<br />

methods.<br />

99


Chapter 6<br />

EXTENDED <strong>GRASSMANN</strong> <strong>KERNELS</strong><br />

AND PROBABILISTIC DISTANCES<br />

6.1 Introduction<br />

So far I have modeled the data as the set of linear subspaces. To relax this geometric as-<br />

sumption of the data, let’s take a step back from the assumption and take a probabilistic<br />

view of the data. Let’s suppose a set of vectors are i.i.d samples from an arbitrary prob-<br />

ability distribution. Then it is possible to compare two such distributions of vectors with<br />

probabilistic similarity measures, such as the KL distance 1 [47], the Chernoff distance [15],<br />

or the Bhattacharyya/Hellinger distance [10], to name a few [70, 40, 46, 96]. Furthermore,<br />

the Bhattacharyya affinity is in fact a positive definite kernel function on the space of dis-<br />

tributions and has nice closed-form expressions for the exponential family [40].<br />

In this paper, I investigate the relationship between the Grassmann kernels and the prob-<br />

abilistic distances. The link is provided by the probabilistic generalization of subspaces<br />

with a Factor Analyzer [22], which is a Gaussian distribution that resembles a pancake.<br />

1 By distance I mean any nonnegative measure of similarity and not necessarily a metric.<br />

100


The first result I show is that the KL distance is reduced to the Projection kernel under<br />

the Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit<br />

and is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL<br />

distance, I propose an extension of the Projection kernel which is originally confined to the<br />

set of linear subspaces, to the set of affine as well as scaled subspaces. For this I introduce<br />

the affine Grassmann manifold and kernels.<br />

I demonstrate the extended kernels with the Support Vector Machines and the Kernel<br />

Discriminant Analysis using synthetic and real image databases. The experiments show the<br />

advantages of the extended kernels over the Bhattacharyya and the Binet-Cauchy kernels.<br />

6.2 Analysis of probabilistic distances and kernels<br />

In this section I introduce several well-known probabilistic distances, and establish their<br />

relationships with the Grassmann distances and kernels.<br />

6.2.1 Probabilistic distances and kernels<br />

Various probabilistic distances between distributions have been proposed in the literature.<br />

Some of them yield closed-form expressions for the exponential family and are convenient<br />

for analysis. Below is a short list of those distances.<br />

• KL distance :<br />

�<br />

J(p1, p2) =<br />

p1(x) log p1(x)<br />

p2(x)<br />

dx (6.1)<br />

The KL distance is probably the most frequently used distance in learning problems.<br />

It is sometime called the relative entropy and plays a fundamental role in information<br />

theory.<br />

101


• KL distance (symmetric) :<br />

�<br />

JKL(p1, p2) =<br />

[p1(x) − p2(x)] log p1(x)<br />

p2(x)<br />

dx (6.2)<br />

Since the original KL distance is asymmetric, this symmetrized version is often used<br />

instead. I exclusively use the symmetric version in the chapter. This distance is still<br />

not a valid metric.<br />

• Chernoff distance:<br />

�<br />

JCher(p1, p2) = − log<br />

p α1<br />

1 (x) p α2<br />

2 (x) dx, (α1 + α2 = 1, α1, α2 > 0) (6.3)<br />

The Chernoff distance is asymmetric. A symmetric version of the distance with<br />

α1 = α2 = 1/2 is known as the Bhattacharyya distance:<br />

• Hellinger distance:<br />

�<br />

JBhat(p1, p2) = − log<br />

JHel(p1, p2) =<br />

[p1(x) p2(x)] 1/2 dx (6.4)<br />

� ��<br />

p1(x) − � �2 p2(x) dx (6.5)<br />

The Hellinger distance is directly related to the Bhattacharyya distance by JHel =<br />

2(1 − exp(−JBhat)).<br />

One can also define similarity measures instead of the dissimilarity measures above.<br />

Jebara and Kondor [40] proposed the Probability Product kernel<br />

�<br />

kProb(p1, p2) =<br />

p α 1 (x) p α 2 (x) dx, (α > 0). (6.6)<br />

102


By construction, this kernel is positive definite in the space of normalized probability distri-<br />

butions [40]. This kernel includes the Bhattacharrya and the Expected Likelihood kernels<br />

as special cases:<br />

• Bhattacharyya kernel: (α = 1/2)<br />

• Expected Likelihood kernel: (α = 1)<br />

�<br />

kBhat(p1, p2) = [p1(x) p2(x)] 1/2 dx (6.7)<br />

�<br />

kEL(p1, p2) =<br />

p1(x) p2(x) dx (6.8)<br />

The probabilistic distances are closely related to each other. For example, the Hellinger<br />

distance forms a bound on the KL distance [77], and the Bhattacharyya distance and the<br />

KL distance are both instances of the Rényi divergence [63]. However the behaviors of<br />

the distances are quite different under my data model. I examine the KL distance and the<br />

Product Probability kernel in particular.<br />

6.2.2 Data as Mixture of Factor Analyzers<br />

The probabilistic distances in the previous section are not restricted to specific distributions.<br />

However, I will model the data distribution as the Mixture of Factor Analyzers (MFA) [27].<br />

If we have i = 1, ..., N sets in the data, then each set is considered as i.i.d. samples from<br />

the i-th Factor Analyzer<br />

x ∼ pi(x) = N (ui, Ci), Ci = YiY ′<br />

i + σ 2 ID, (6.9)<br />

103


Figure 6.1: Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann manifold<br />

(Left), the set of linear subspaces, can alternatively be modeled as the set of flat<br />

(σ → 0) spheres (Y ′<br />

i Yi = Im) intersecting at the origin (ui = 0). The right figure shows a<br />

general Mixture of Factor Analyzers which are not bound by these conditions.<br />

where ui ∈ R D is the mean, Yi is a full-rank D × m matrix (D > m), and σ is the ambient<br />

noise level. The factor analyzer model is a practical substitute for a Gaussian distribution<br />

in case the dimensionality D of the images is greater than the number of samples n in a set.<br />

Otherwise it is impossible to estimate the full covariance C nor invert it.<br />

More importantly, I use the factor analyzer distribution to provide the link between the<br />

Grassmann manifold and the space of probabilistic distributions. In fact a linear subspace<br />

can be considered as the ‘flattened’ (σ → 0) limit of a zero-mean (ui = 0), homogeneous<br />

(Y ′<br />

i Yi = Im) factor analyzer distribution as depicted in Figure 6.1.<br />

Some linear algebra<br />

Let’s summarize some linear algebraic shortcuts to analyze the distances. The inversion<br />

lemma will be used several times. For σ > 0, we have the identity:<br />

C −1<br />

i<br />

′<br />

= (YiY i + σ 2 I) −1 = σ −2 (I − Yi(σ 2 I + Y ′<br />

i Yi) −1 Y ′<br />

i ).<br />

104


Let M1 and M2 be m × m matrices<br />

and � Y1 and � Y2 be the matrices<br />

M1 = (σ 2 Im + Y ′<br />

1Y1) −1 , and M2 = (σ 2 Im + Y ′<br />

2Y2) −1 ,<br />

�Y1 = Y1M 1/2<br />

1 , and � Y2 = Y2M 1/2<br />

2 .<br />

From the identity we can compute the followings<br />

C −1<br />

1 C2 + C −1<br />

2 C1 = (σ 2 ID + Y1Y ′<br />

1) −1 (σ 2 ID + Y2Y ′<br />

2) + (σ 2 ID + Y2Y ′<br />

2) −1 (σ 2 ID + Y1Y ′<br />

1)<br />

= σ −2 (ID − � Y1 � Y ′<br />

1)(σ 2 ID + Y2Y ′<br />

2) + σ −2 (ID − � Y2 � Y ′<br />

2)(σ 2 ID + Y1Y ′<br />

1),<br />

= 2ID − � Y1 � Y ′<br />

1 − � Y2 � Y ′<br />

2 + σ −2 (Y1Y ′<br />

1 + Y2Y ′<br />

2 − � Y1 � Y ′<br />

1Y2Y ′<br />

2 − � Y2 � Y ′<br />

2Y1Y ′<br />

1),<br />

(C1 + C2) −1 = (2σ 2 ID + Y1Y ′<br />

1 + Y2Y ′<br />

2) −1<br />

= (2σ 2 ) −1 (ID + ZZ ′ ) −1 = (2σ 2 ) −1 (ID − Z(I2m + Z ′ Z) −1 Z ′ ),<br />

where Z = (2σ 2 ) −1/2 [Y1 Y2],<br />

C −1<br />

1 + C −1<br />

2 = σ −2 (2ID − � Y1 � Y ′<br />

1 − � Y2 � Y ′<br />

2) = 2σ −2 (ID − � Z � Z ′ ),<br />

where � Z = 2 −1/2 [ � Y1 � Y2],<br />

(C −1<br />

1 + C −1<br />

2 ) −1 = σ2<br />

2 (ID − � Z � Z ′ ) −1 = σ2<br />

2 (ID + � Z(I2m − � Z ′ � Z) −1 � Z ′ ).<br />

6.2.3 Analysis of KL distance<br />

The KL distance for a Factor Analyzers is as follows:<br />

105


JKL(p1, p2) = 1<br />

2 tr � C −1<br />

2 C1 + C −1 � 1<br />

1 C2 − 2ID +<br />

2 (u1 − u2) ′ (C −1<br />

1 + C −1<br />

2 )(u1 − u2)<br />

= 1<br />

2 tr(−� Y ′<br />

1 � Y1 − � Y ′<br />

+ σ−2<br />

2 (u1 − u2) ′<br />

2 � Y2) + σ−2<br />

2<br />

�<br />

2ID − � Y1 � Y ′<br />

1 − � Y2 � Y ′<br />

2<br />

Furthermore, we can write the distances as<br />

JKL(p1, p2) = 1<br />

2 tr(−� Y ′<br />

1 � Y1 − � Y ′<br />

2 � Y2) + σ−2<br />

+ σ−2<br />

2 (2u′ u − u ′ � Z � Z ′ u),<br />

2<br />

tr(Y ′<br />

1Y1 + Y ′<br />

2Y2 − � Y ′<br />

1Y2Y ′<br />

2 � Y1 − � Y ′<br />

2Y1Y ′<br />

1 � Y2)<br />

�<br />

(u1 − u2) (6.10)<br />

tr(Y ′<br />

1Y1 + Y ′<br />

2Y ′<br />

2 − � Y ′<br />

1Y2Y ′<br />

2 � Y1 − � Y ′<br />

2Y1Y ′<br />

1 � Y2)<br />

where u = u1 − u2. Note that the computation of distance involves only the products of<br />

column vectors of Yi and ui, and we need not handle any D × D matrix explicitly.<br />

KL in the limit yields the projection kernel<br />

For ui = 0, and Y ′<br />

i Yi = Im, we have<br />

JKL(p1, p2) = 1<br />

2 tr(−� Y ′<br />

1 � Y1 − � Y ′<br />

= 1 m<br />

(−2<br />

2<br />

=<br />

σ 2 + 1<br />

2 � Y2) + σ−2<br />

) + σ−2<br />

2<br />

′<br />

tr(Y 1Y1 + Y ′<br />

2Y ′<br />

2 − � Y ′<br />

1Y2Y ′<br />

2 � Y1 − � Y ′<br />

2Y1Y ′<br />

1 � Y2)<br />

�<br />

� 2<br />

1<br />

2m − 2<br />

σ2 + 1<br />

1<br />

2σ2 (σ2 ′<br />

(2m − 2tr(Y<br />

+ 1)<br />

1Y2Y ′<br />

2Y1)).<br />

tr(Y ′<br />

1Y2Y ′<br />

2Y1)<br />

We can ignore the multiplying factors which does not depend on Y1 or Y2, and rewrite the<br />

distance as<br />

JKL(p1, p2) ∝ 2m − 2tr(Y ′<br />

1Y2Y ′<br />

2Y1).<br />

106


One can immediately realize that this is indeed the definition of the squared Projection<br />

distance d 2 Proj (Y1, Y2) up to multiplicative factors.<br />

6.2.4 Analysis of Probability Product Kernel<br />

The Probability Product Kernel for Gaussian distributions is [40]<br />

kProb(p1, p2) = (2π) (1−2α)D/2 det(C † ) 1/2 det(C1) −α/2 det(C2) −α/2<br />

× exp − 1 � ′<br />

αu<br />

2<br />

1C −1<br />

1 u1 + αu ′ 2C −1<br />

2 u2 − (u † ) ′ C † u †� , (6.11)<br />

where C † = α −1 (C −1<br />

1 + C −1<br />

2 ) −1 , and u † = α(C −1<br />

1 u1 + C −1<br />

2 u2).<br />

To compute the determinant terms for Factor Analyzers, we use the following identity:<br />

if A and B are D × m matrices, then<br />

det(ID + AB ′ ) = det(Im + B ′ A) =<br />

m�<br />

(1 + τi(B ′ A)), (6.12)<br />

where τi is the i-th singular value of B ′ A. Using the identity we can write the following.<br />

det C −1<br />

1 = det(σ −2 (ID − � Y1 � Y ′<br />

1)) = σ −2D det(Im − � Y ′<br />

1 � Y1)<br />

= σ −2D<br />

m�<br />

(1 − τi( � Y ′<br />

1 � Y1)),<br />

i=1<br />

det C −1<br />

2 = det(σ −2 (ID − � Y2 � Y ′<br />

2)) = σ −2D det(Im − � Y ′<br />

2 � Y2)<br />

= σ −2D<br />

m�<br />

(1 − τi( � Y ′<br />

2 � Y2)),<br />

i=1<br />

det C † �<br />

= det σ 2 (2α) −1 (ID + � Z(I2m − � Z ′ �<br />

Z) � −1<br />

Z �′ )<br />

= σ 2D (2α) −D det(I2m + (I2m − � Z ′ � Z) −1 � Z ′ � Z) = σ 2D (2α) −D det(I2m − � Z ′ � Z) −1<br />

= σ 2D (2α) −D<br />

2m�<br />

i=1<br />

(1 − τi( � Z ′ � Z)) −1<br />

107<br />

i=1


To compute the exponents in (6.11) we need to use the followings<br />

C −1<br />

1 C † C −1<br />

2 = C −1<br />

1 (C −1<br />

1 + C −1<br />

2 ) −1 C −1<br />

2 = C −1<br />

2 (C −1<br />

1 + C −1<br />

2 ) −1 C −1<br />

1<br />

= (C1 + C2) −1 = (2σ 2 ) −1 (ID − Z(I2m + Z ′ Z) −1 Z ′ )<br />

C −1<br />

1 C † C −1<br />

1 = C −1<br />

1 (C −1<br />

1 + C −1<br />

2 ) −1 C −1<br />

1 = C −1<br />

1 − (C1 + C2) −1<br />

= σ−2<br />

2 (2ID − 2 � Y1 � Y ′<br />

1 − ID + Z(I2m + Z ′ Z) −1 Z ′ )<br />

= σ−2<br />

2 (ID − 2 � Y1 � Y ′<br />

1 + Z(I2m + Z ′ Z) −1 Z ′ )<br />

C −1<br />

2 C † C −1<br />

2 = C −1<br />

2 (C −1<br />

1 + C −1<br />

2 ) −1 C −1<br />

2 = C −1<br />

2 − (C1 + C2) −1<br />

= σ−2<br />

2 (2ID − 2 � Y2 � Y ′<br />

2 − ID + Z(I2m + Z ′ Z) −1 Z ′ )<br />

= σ−2<br />

2 (ID − 2 � Y2 � Y ′<br />

2 + Z(I2m + Z ′ Z) −1 Z ′ )<br />

Plugging these results back in (6.11) we again can compute the kernel without handling<br />

any D × D matrix. For concreteness I derive the Bhattacharyya kernel as an instance of the<br />

probability product kernel with α = 1/2 as follows:<br />

kBhat(p1, p2) = det(C † ) 1/2 det(C1) −1/4 det(C2) −1/4 exp − 1<br />

4 (u1 − u2) ′ (C1 + C2) −1 (u1 − u2)<br />

= det(I2m − � Z ′ � Z) −1/2 det(Im − � Y ′<br />

1 � Y1) 1/4 det(Im − � Y ′<br />

2 � Y2) 1/4<br />

× exp − σ−2<br />

4 (u1 − u2) ′ (ID − Z(I2m + Z ′ Z) −1 Z ′ )(u1 − u2). (6.13)<br />

108


Probability product kernel in the limit becomes trivial<br />

For ui = 0, and Y ′<br />

i Yi = Im, we have<br />

kProb(p1, p2) = (2π) (1−2α)D/2 det(C † ) 1/2 det(C1) −α/2 det(C2) −α/2<br />

and furthermore,<br />

= (2π) (1−2α)D/2 σ D (2α) −D/2 det(I2m − � Z ′ � Z) −1/2<br />

×σ −αD det(Im − � Y ′<br />

1 � Y1) α/2 σ −αD det(Im − � Y ′<br />

2 � Y2) α/2<br />

= (2π) (1−2α)D/2 σ D (2α) −D/2 det(I2m − � Z ′ � Z) −1 σ2α(m−D)<br />

(σ 2 + 1) αm<br />

= π (1−2α)D 2 −αD −D/2 σ2α(m−D)+D<br />

α<br />

(σ2 + 1) αm det(I2m − � Z ′ Z) � −1/2<br />

,<br />

det(I2m − � Z ′ Z) � −1/2 ⎢<br />

= det ⎣I2m − 1 ⎜<br />

⎝<br />

2<br />

⎡<br />

⎡<br />

⎛<br />

�Y ′<br />

1 � Y1 � Y ′<br />

1 � Y2<br />

�Y ′<br />

2 � Y1 � Y ′<br />

2 � Y2<br />

⎛<br />

⎢ 1<br />

= det ⎣I2m −<br />

2(σ2 ⎜<br />

+ 1)<br />

=<br />

=<br />

� 2 2(σ + 1)<br />

2σ2 �m + 1<br />

⎛<br />

⎜<br />

det ⎝<br />

� 2 2(σ + 1)<br />

2σ2 �m �<br />

det Im −<br />

+ 1<br />

⎞⎤<br />

⎟⎥<br />

⎠⎦<br />

−1/2<br />

⎝ Im Y ′<br />

1Y2<br />

Y ′<br />

2Y1 Im<br />

⎞⎤<br />

⎟⎥<br />

⎠⎦<br />

−1/2<br />

Im − 1<br />

2σ2 ′ Y +1 1Y2<br />

− 1<br />

2σ2 ′ Y +1 2Y1<br />

Im<br />

1<br />

(2σ2 ′<br />

Y<br />

+ 1) 2 1Y2Y ′<br />

2Y1<br />

Ignoring the terms which are not the functions of Y1 or Y2, we have<br />

�<br />

1<br />

kProb(Y1, Y2) ∝ det Im −<br />

(2σ2 ′<br />

Y<br />

+ 1) 2 1Y2Y ′<br />

�−1/2 2Y1 .<br />

⎞<br />

⎟<br />

⎠<br />

−1/2<br />

� −1/2<br />

.<br />

Suppose the two subspaces span(Y1) and span(Y2) intersect only at the origin, that is,<br />

the singular values of Y ′<br />

1Y2 are strictly less than 1. In this case kProb has a finite value as<br />

109


σ → 0 and the inversion is well-defined. In contrast, the diagonal terms of kProb become<br />

�<br />

1<br />

kProb(Y1, Y1) = det (1 −<br />

(2σ2 �−1/2 � 2 2 (2σ + 1)<br />

)Im =<br />

+ 1) 2 4σ2 (σ2 �m/2 , (6.14)<br />

+ 1)<br />

which diverges to infinity as σ → 0. This implies that after the kernel is normalized by the<br />

diagonal terms, it becomes a trivial kernel:<br />

⎧<br />

⎪⎨ 1, span(Yi) = span(Yj)<br />

˜kProb(Yi, Yj) =<br />

, as σ → 0. (6.15)<br />

⎪⎩ 0, otherwise<br />

As I claimed earlier, the Probability Product kernel, including the Bhattacharyya kernel,<br />

loses its discriminating power as the Gaussian distributions become flatter.<br />

6.3 Extended Grassmann Kernel<br />

In the previous section I presented the probabilistic interpretation of the Projection kernel.<br />

Based on this analysis, I propose extensions of the Projection kernel and make the kernels<br />

applicable to more general data. In this section I examine the two directions of extension:<br />

from linear to affine subspaces, and from homogeneous to scaled subspaces.<br />

6.3.1 Motivation<br />

The motivations for considering affine subspaces and non-homogeneous subspaces arise<br />

from observing the subspaces computed from real data. Firstly, the set of images, for<br />

example from the Yale Face database, have nonzero means. If the mean is significantly<br />

different from set to set, we want to use the mean image as well as the PCA basis images to<br />

represent a set. Secondly, the eigenvalues from PCA almost always have non-homogeneous<br />

values. It is likely that the eigenvector direction corresponding to a larger eigenvalue is<br />

110


A. Linear B. Affine C. Scaled<br />

Figure 6.2: The Mixture of Factor Analyzer model of the Grassmann manifold is the collection<br />

of linear homogeneous Factor Analyzers shown as flat spheres intersecting at the origin<br />

(A). This can be relaxed to allow nonzero offsets for each Factor Analyzer (B), and also to<br />

allow arbitrary eccentricity and scale for each Factor Analyzer shown as flat ellipsoids (C).<br />

more important than the eigenvector direction corresponding to a smaller eigenvalue. In<br />

which case we want to consider the eigenvalue scales as well as the eigenvectors when<br />

representing the set.<br />

These two extensions are naturally derived from the probabilistic generalization of sub-<br />

spaces. Figure 6.2 illustrates the ideas. Considering the data as a MFA distribution, we<br />

can gradually relax the zero-mean (ui = 0) condition in Figure A to the nonzero-mean<br />

(ui = arbitrary) condition in Figure B, and furthermore relax the homogeneity (Y ′ Y = I)<br />

condition to the non-homogeneous (Y ′ Y = full rank) condition in Figure C.<br />

From this I expect to benefit from both worlds – probabilistic distributions and geo-<br />

metric manifolds. However, simply relaxing the conditions and taking the limit σ → 0 of<br />

the KL distance do not guarantee a metric or a positive definite kernel, as we will shortly<br />

examine. Certain compromises have to be made to turn the KL distance in the limit into<br />

a well-defined and usable kernel function. In the following sections I propose new frame-<br />

works for the extensions and the technical details for making valid kernels.<br />

111


6.3.2 Extension to affine subspaces<br />

An affine subspace in R D is simply a linear subspace with an offset. In that sense a linear<br />

subspace is an affine subspace with a zero offset.<br />

The affine span is an analog of a linear span. Let Y ∈ R D×m be an orthonormal basis<br />

matrix for a subspace, and u ∈ R D denote the offset of the subspace from the origin. The<br />

affine span can then be defined as<br />

aspan(Y, u) = {x | x = Y v + u, ∀v ∈ R m }. (6.16)<br />

This representation of an affine span is not unique, since different Y ’s can share the same<br />

linear span, and different offsets u’s can imply the same amount of bias. Formally, this can<br />

be expressed as an equivalence relation:<br />

Definition 6.1.<br />

aspan(Y1, u1) = aspan(Y2, u2) if and only if<br />

span(Y1) = span(Y2) and Y ⊥<br />

1 (Y ⊥<br />

1 ) ′ u1 = Y ⊥<br />

2 (Y ⊥<br />

2 ) ′ u2,<br />

where Y ⊥ is any orthonormal basis for the orthogonal complement of span(Y ), that is<br />

Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID.<br />

Although Y is not unique, one can choose a unique ‘standard’ û by the following:<br />

û = (ID − � Y � Y ′ )u = � Y ⊥ ( � Y ⊥ ) ′ u (6.17)<br />

which has the shortest distance from the origin to the affine span (refer to Figure 6.3.)<br />

112


Figure 6.3: The same affine span can be expressed with different offsets u1, u2, ... However,<br />

one can use the unique ‘standard’ offset û, which has the shortest length from the origin.<br />

Affine Grassmann manifold<br />

I define an affine Grassmann manifold analogous to the linear Grassmann manifold,<br />

Definition 6.2. The affine Grassmann manifold AG(m, D) is the set of m-dimensional<br />

affine subspaces of the R D .<br />

The set of all m-dim affine subspaces in R D is a smooth non-compact manifold that<br />

can be defined as the following quotient space similarly to the Grassmann manifold:<br />

where E is an Euclidean group.<br />

To see the point above, let<br />

AG(m, D) = E(D)/E(m) × O(D − m), (6.18)<br />

X =<br />

⎛<br />

⎜<br />

⎝ Y Y ⊥ u<br />

0 0 1<br />

be the homogeneous space representation of aspan(Y, u) in E(D).<br />

113<br />

⎞<br />

⎟<br />


Then the only subgroup of E(D) that leaves the aspan(Y, u) unchanged by right-<br />

multiplication, is the set of matrices of the form<br />

⎛<br />

⎞<br />

Rm ⎜ 0<br />

⎝<br />

0<br />

RD−m<br />

v ⎟<br />

0 ⎟ ∈ E(m) × O(D − m),<br />

⎠<br />

0 0 1<br />

where Rm and RD−m are any two matrices in O(m) and O(D − m) respectively, and<br />

v ∈ R m is any vector.<br />

To check this, note that<br />

⎛<br />

⎞<br />

Rm ⎜<br />

X ⎜ 0<br />

⎝<br />

0<br />

RD−m<br />

v ⎟<br />

0 ⎟<br />

⎠<br />

0 0 1<br />

=<br />

=<br />

⎛<br />

⎜<br />

⎝ Y Y ⊥ ⎛<br />

⎞<br />

⎞<br />

Rm ⎜ 0 v ⎟<br />

u ⎟ ⎜<br />

⎟<br />

⎠ ⎜ 0 RD−m 0 ⎟<br />

0 0 1 ⎝<br />

⎠<br />

0 0 1<br />

⎛<br />

⎞<br />

⎜<br />

⎝ Y Rm Y ⊥RD−m Y v + u<br />

0 0 1<br />

has aspan(Y Rm, Y v + u) which is the same as aspan(Y, u) from Definition 6.16.<br />

Affine Grassmann kernel<br />

Similarly to the definition of Grassmann kernels in Definition 5.1, we can now define the<br />

affine Grassmann kernel as follows.<br />

Definition 6.3. Let k : (R D×m × R D ) × (R D×m × R D ) → R be a real valued symmetric<br />

function k(Y1, u1, Y2, u2) = k(Y2, u2, Y1, u1). The function k is an affine Grassmann kernel<br />

if it is 1) positive definite and 2) invariant to different representations:<br />

114<br />

⎟<br />


k(Y1, u1, Y2, u2) = k(Y3, u3, Y4, u4), ∀Y1, Y2, Y3, Y4, u1, u2, u3, u4<br />

if aspan(Y1, u1) = aspan(Y3, u3) and aspan(Y2, u2) = aspan(Y4, u4).<br />

With this definition we can check if the KL distance in the limit suggests an affine<br />

Grassmann kernel.<br />

KL distance in the limit<br />

The KL distance only with the homogeneity condition Y ′<br />

1Y1 = Y ′<br />

2Y2 = Im becomes,<br />

JKL(p1, p2) =<br />

+<br />

1<br />

2σ 2 (σ 2 + 1)<br />

(2m − 2tr(Y ′<br />

1Y2Y ′<br />

2Y1))<br />

1<br />

2σ2 (σ2 + 1) (u1 − u2) ′ � 2(σ 2 + 1)ID − Y1Y ′<br />

1 − Y2Y ′ �<br />

2 (u1 − u2)<br />

→ 1<br />

′<br />

[2m − 2tr(Y<br />

2σ2 1Y2Y ′<br />

2Y1) + (u1 − u2) ′ (2ID − Y1Y ′<br />

1 − Y2Y ′<br />

2) (u1 − u2)] .<br />

If the multiplicative factor is ignored, the first term is the same as the zero-mean case which<br />

I denote as the ‘linear’ kernel<br />

The second term<br />

kLin(Y1, Y2) = tr(Y1Y ′<br />

1Y2Y ′<br />

2) = kProj(Y1, Y2).<br />

ku(Y1, u1, Y2, u2) = u ′ 1(2ID − Y1Y ′<br />

1 − Y2Y ′<br />

2)u2,<br />

measures the similarity of means scaled by the matrix 2I − Y1Y ′<br />

1 − Y2Y ′<br />

2. However, this<br />

term does not satisfy the affine invariance condition of Definition 6.3. Note that the term<br />

115


ku can be expressed as<br />

ku(Y1, u1, Y2, u2) = û1 ′ u2 + û2 ′ u1,<br />

with the standard offset notation. From this observation, I propose the following modifica-<br />

tion:<br />

Theorem 6.4.<br />

is an affine Grassmann kernel.<br />

ku(Y1, u1, Y2, u2) = u ′ 1(I − Y1Y ′<br />

1)(I − Y2Y ′<br />

2)u2 = û1 ′ û2<br />

Proof. 1. Invariance: if aspan(Y1, u1) = aspan(Y3, u3) and aspan(Y2, u2) = aspan(Y4, u4),<br />

then Y1Y ′<br />

1 = Y3Y ′<br />

3, Y2Y ′<br />

2 = Y4Y ′<br />

4, Y ⊥<br />

1 (Y ⊥<br />

1 ) ′ u1 = Y ⊥<br />

3 (Y ⊥<br />

3 ) ′ u3, and Y ⊥<br />

2 (Y ⊥<br />

2 ) ′ u2 =<br />

Y ⊥<br />

4 (Y ⊥<br />

4 ) ′ u4, and therefore<br />

k(Y1, u1, Y2, u2) = u ′ 1(I − Y1Y ′<br />

1)(I − Y2Y ′<br />

2)u2<br />

2. Positive definiteness:<br />

�<br />

i,j<br />

= u ′ 3(I − Y3Y ′<br />

3)(I − Y4Y ′<br />

4)u4 = k(Y3, u3, Y4, u4).<br />

cicju ′ i(I − YiY ′<br />

i )(I − YjY ′<br />

j )uj =<br />

� �<br />

i<br />

ciu ′ i(I − YiY ′<br />

i )<br />

� 2<br />

≥ 0.<br />

Combined with the linear term kLin, the modified term defines the new ‘affine’ kernel:<br />

kAff(Y1, u1, Y2, u2) = tr(Y1Y ′<br />

1Y2Y ′<br />

2) + u ′ 1(I − Y1Y ′<br />

1)(I − Y2Y ′<br />

2)u2. (6.19)<br />

116


General construction<br />

The limit of the KL distance with nonzero means has two terms: u-related and Y -related.<br />

This suggests a general construction rule for affine kernels. That is, if one has two separate<br />

positive kernels for means and for subspaces, one can add or multiply them together to<br />

construct new kernels . The various ways of generating new kernels from known kernels in<br />

Theorem 2.7 can be used to create a novel kernel for affine subspaces.<br />

Basri’s embedding<br />

There are alternatives for representing affine spans. One representation proposed by Basri<br />

et al. [3] is to use the pair (Y ⊥ , t) instead of (Y, u) where t is related to u by u = Y ⊥ t.<br />

The authors embed an affine subspace to a Euclidean space of dimension (D + 1) 2 by the<br />

following injective map:<br />

Ψ : aspan(Y, u) → R (D+1)×(D+1) , (Y, u) ↦→<br />

⎛<br />

⎜<br />

⎝ Y ⊥ (Y ⊥ ) ′ Y ⊥t t ′ (Y ⊥ ) ′ t ′ t<br />

⎞<br />

⎟<br />

⎠ . (6.20)<br />

This embedding is a direct analogue of the isometric embedding of linear subspaces to<br />

projection matrices in Theorem. 5.2.<br />

The authors did not mention or use kernel methods in the paper. However, their pro-<br />

posed embedding has the natural corresponding kernel:<br />

k(Y ⊥<br />

1 , t1, Y ⊥<br />

⎡⎛<br />

⎢⎜<br />

2 , t2) = tr ⎣⎝<br />

⎡⎛<br />

⎢⎜<br />

= tr ⎣⎝<br />

Y ⊥<br />

1 (Y ⊥<br />

1 ) ′ Y ⊥<br />

1 t1<br />

(Y ⊥<br />

1 t1) ′ t ′ 1t1<br />

⎞ ⎛<br />

⎟ ⎜<br />

⎠ ⎝<br />

Y ⊥<br />

2 (Y ⊥<br />

2 ) ′ Y ⊥<br />

2 t2<br />

(Y ⊥<br />

2 t2) ′ t ′ 2t2<br />

⎞⎤<br />

⎟⎥<br />

⎠⎦<br />

Y ⊥<br />

1 (Y ⊥<br />

1 ) ′ Y ⊥<br />

2 (Y ⊥<br />

2 ) ′ + Y ⊥<br />

1 t1(Y ⊥<br />

2 t2) ′ · · ·<br />

· · · (Y ⊥<br />

1 t1) ′ Y ⊥<br />

2 t2 + t ′ 1t1t ′ 2t2<br />

= tr � Y ⊥<br />

1 (Y ⊥<br />

1 ) ′ Y ⊥<br />

2 (Y ⊥<br />

2 ) ′� + 2t ′ 1(Y ⊥<br />

1 ) ′ Y ⊥<br />

2 t2 + t ′ 1t1t ′ 2t2.<br />

117<br />

⎞⎤<br />

⎟⎥<br />

⎠⎦


Figure 6.4: Homogeneous vs scaled subspaces. The two 2-dimensional Gaussians that<br />

span almost the same 2-dimensional space and have almost the same means, are considered<br />

similar as two representations of linear subspaces (Left). However, probabilistic distance<br />

between two Gaussian also depends on scale and eccentricity: the distance can be quite<br />

large if the Gaussians are nonhomogeneous (Right).<br />

Although this is another valid kernel, it does not admit probabilistic interpretation. Fur-<br />

thermore, their representation requires D × (D − m) matrices Y ⊥<br />

i , which is more costly in<br />

storage and computation than representation of this thesis since m ≪ D typically.<br />

6.3.3 Extension to scaled subspaces<br />

So far I have assumed that the subspace are homogeneous Y ′ Y = Im, that is, there is no<br />

preferred direction within the subspace. However, even if two subspaces have the same<br />

linear or affine span, one can further distinguish the two subspaces by allowing scales or<br />

orientations within the subspaces, as illustrated in Figure 6.4.<br />

With the relaxation of Y to any D × m full-rank matrix, the term subspace is no longer<br />

applicable in a strict sense. Nevertheless, let’s refer to the relaxation as ‘scaled’ subspace<br />

and the corresponding kernel as the ‘scaled’ Grassmann kernel in conformity with the pre-<br />

vious usages in this thesis.<br />

A scaled subspace has the same Euclidean representation (Y, u) as before but has a<br />

different equivalence relation. Let Y1 and Y2 be any D × m full-rank matrices, and u1 and<br />

118


u2 be offsets. The equivalence is then defined by<br />

(Y1, u1) ∼ = (Y2, u2) if and only if<br />

Y1Y ′<br />

1 = Y2Y ′<br />

2 and u1 = u2. (6.21)<br />

The scaled subspace is in one-to-one correspondence with the Cartesian product MD,m×<br />

R D , where MD,m is the set of D × D symmetric positive semidefinite matrices of rank m,<br />

via the embedding<br />

Ψ : (Y, u) ↦→ [Y Y ′ | u] ∈ R D×(D+1) .<br />

However, the topology and the metric from this embedding do not have a probabilistic<br />

motivation, similarly to the Basri’s embedding (6.20). I instead examine the limit of KL<br />

distance and make a positive definite kernel with the invariance condition (6.21).<br />

KL distance in the limit<br />

To incorporate these scales into affine subspaces, I allow the product Y Y ′ to be a non-<br />

identity matrix and make sure that the definition of the kernel is valid and consistent.<br />

comes<br />

Let Yi be full-rank but not necessarily orthonormal. In this case the KL distance be-<br />

JKL(p1, p2) → 1<br />

′<br />

tr(Y1Y<br />

2σ2 1 + Y2Y ′<br />

2 − � Y1 � Y ′<br />

1Y2Y ′<br />

2 − � Y2 � Y ′<br />

2Y1Y ′<br />

1)<br />

1<br />

+<br />

2σ2 (u1 − u2) ′<br />

�<br />

2ID − � Y1 � Y ′<br />

1 − � Y2 � Y ′<br />

�<br />

2 (u1 − u2),<br />

where � Yi denotes the orthonormalization of Yi<br />

�Yi = lim �Yi = Yi(Y<br />

σ→0<br />

′<br />

i Yi) −1/2 .<br />

119


Ignoring the multiplicative factors, we can see that the corresponding form<br />

k = 1<br />

2 tr(� Y1 � Y ′<br />

1Y2Y ′<br />

2 + Y1Y ′<br />

1 � Y2 � Y ′<br />

2) + u ′ 1(2I − � Y1 � Y ′<br />

1 − � Y2 � Y ′<br />

2)u2<br />

is again not a well-defined kernel.<br />

The first term<br />

1<br />

2 tr(� Y1 � Y ′<br />

1Y2Y ′<br />

2 + Y1Y ′<br />

1 � Y2 � Y ′<br />

2)<br />

is not positive definite, and there are several ways to remedy it:<br />

• Fully unnormalized: k(Y1, Y2) = tr(Y1Y ′<br />

1Y2Y2),<br />

• Partially normalized: k(Y1, Y2) = tr(Y1 � Y ′<br />

1 � Y2Y ′<br />

2) = tr( � Y1Y ′<br />

1 � Y2Y ′<br />

2) = tr( � Y1Y ′<br />

1Y2 � Y ′<br />

2)<br />

• Fully normalized: k(Y1, Y2) = tr( � Y1 � ′<br />

Y1<br />

�Y2 � ′<br />

Y2 )<br />

I use the partially normalized form 2 since it scales the same as the original form to a global<br />

scaling factor multiplied to Y ’s.<br />

The second term<br />

u ′ 1(2I − � Y1 � Y ′<br />

1 − � Y2 � Y ′<br />

2)u2<br />

is the same as in the affine case, and we also have several ways to make it well-defined and<br />

positive definite:<br />

• Affine invariant: ku(Y1, u1, Y2, u2) = u ′ 1(I − � Y1 � ′<br />

Y1 )(I − � Y2 � ′<br />

Y2 )u2 = û1 ′ û2<br />

• Direct inner product: ku(Y1, u1, Y2, u2) = u ′ 1u2.<br />

Since the affine invariance condition is irrelevant for scaled subspaces (Figure 6.3), the<br />

direct inner product form is a better choice.<br />

Finally, the sum of the two modified terms is the ‘affine scaled’ kernel I propose:<br />

2 However, these kernels showed similar results in preliminary experiments.<br />

120


Theorem 6.5.<br />

kAffSc(Y1, u1, Y2, u2) = tr(Y1 � ′<br />

Y1<br />

�Y2Y ′<br />

2) + u ′ 1u2. (6.22)<br />

is a positive definite kernel for scaled subspaces.<br />

Proof. The term u ′ 1u2 is obviously well-defined and positive definite, so let’s look at only<br />

the first term.<br />

1. Well-defined: let’s show that if Y1Y ′<br />

1 = Y3Y ′<br />

3 then Y1 � ′<br />

Y1<br />

the second equation to see that<br />

= Y3 � ′<br />

Y3 . Take squares of<br />

(Y1 � ′<br />

Y1 ) 2 = Y1(Y ′<br />

1Y1) −1/2 Y ′<br />

1Y1(Y ′<br />

1Y1) −1/2 Y ′<br />

1 = Y1Y ′<br />

1 = Y3Y ′<br />

3<br />

= Y3(Y ′<br />

3Y3) −1/2 Y ′<br />

3Y3(Y ′<br />

3Y3) −1/2 Y ′<br />

3 = (Y3 � ′<br />

Y3 ) 2 .<br />

Since Y1 � ′<br />

Y1 and Y3 � ′<br />

Y3 are both symmetric positive semidefinite matrices, the equality<br />

of the squares implies Y1 � ′<br />

Y1<br />

Y2 � ′<br />

Y2 = Y4 � ′<br />

Y4 .<br />

2. Positive definiteness:<br />

�<br />

i,j<br />

= Y3 � ′<br />

Y3 . By the same argument, if Y2Y ′<br />

2 = Y4Y ′<br />

4 then<br />

cicjtr(Yi � ′<br />

Yi<br />

�YjY ′<br />

j ) = tr( �<br />

= � �<br />

121<br />

i<br />

ciYi<br />

i<br />

� ′ �<br />

Yi cj<br />

j<br />

� YjY ′<br />

j )<br />

ciYi � ′<br />

Yi � 2 F ≥ 0.


Summary of the extended kernels<br />

The proposed kernels are summarized below. Let Yi be a full-rank D × m matrix, and let<br />

�Yi = Yi(Y ′<br />

i Yi) −1/2 the orthonormalization of Yi as before.<br />

kProj(Y1, Y2) = kLin(Y1, Y2) = tr( � Y ′<br />

1 � Y2 � Y ′<br />

2 � Y1)<br />

kAff(Y1, u1, Y2, u2) = tr( � Y ′<br />

1 � Y2 � Y ′<br />

2 � Y1) + u ′ 1(I − � Y1 � Y ′<br />

1)(I − � Y2 � Y ′<br />

2)u2<br />

kAffSc(Y1, u1, Y2, u2) = tr(Y1 � Y ′<br />

1 � Y2Y ′<br />

2) + u ′ 1u2. (6.23)<br />

I also spherize the kernels<br />

� k(Y1, u1, Y2, u2) = k(Y1, u1, Y2, u2) k(Y1, u1, Y1, u1) −1/2 k(Y2, u2, Y2, u2) −1/2<br />

so that k(Y1, u1, Y1, u1) = 1 for any Y1 and u1.<br />

There is a caveat in implementing these kernels. Although I use the same notations<br />

Y and � Y for both linear and affine kernels, they are different in computation. For linear<br />

kernels the Y and � Y are computed from data assuming u = 0, whereas for affine kernels<br />

the Y and � Y are computed after removing the estimated mean u from the data.<br />

6.3.4 Extension to nonlinear subspaces<br />

A systematic way of extending the Projection kernel from linear/affine subspaces to non-<br />

linear spaces is to use an implicit map via a kernel function as explained in Section 5.2.4,<br />

where the latter kernel is to be distinguished from the Grassmann kernels. Note that the pro-<br />

posed kernels (6.23) can be computed only from the inner products of the column vectors<br />

of Y ’s and u’s including the orthonormalization procedure. This ‘doubly kernel’ approach<br />

has already been proposed for the Binet-Cauchy kernel [90, 46] and for probabilistic dis-<br />

tances in general [96]. We can adopt the trick for the extended Projection kernels as well to<br />

122


extend the kernels to operate on nonlinear subspaces, which is the preimage corresponding<br />

to the linear subspaces in the RKHS via the feature map.<br />

6.4 Experiments with synthetic data<br />

In this section I demonstrate the application of the extended Grassmann kernels to a two-<br />

class classification problem with Support Vector Machines (SVMs). Using synthetic data<br />

generated from MFA distribution, I will compare the classification performance of lin-<br />

ear/nonlinear SVMs in the original space with the performance of the SVM in the Grass-<br />

mann space.<br />

6.4.1 Synthetic data<br />

The kernels in equation 6.23 are defined under different assumptions of data distribution.<br />

To test the kernels I generate three types of synthetic data corresponding to the assumptions:<br />

(1) linear homogeneous MFA, (2) affine homogeneous MFA, and (3) affine scaled MFA.<br />

Selecting MFA<br />

For each type of the data, I generate N = 100 Factor Analyzers in D = 5 dimensional<br />

Euclidean space. The i-th Factor Analyzer has the distribution pi(x) = N (ui, Ci), where<br />

the covariance is Ci = � YiΛi � Y ′<br />

i + σ 2 ID. The 5 × 2 orthonormal matrices � Yi are randomly<br />

chosen from the uniform distribution on G(m, D). Refer to [1] for the definition of a<br />

uniform distribution on G(m, D). The ambient noise is chosen at σ = 0.1.<br />

For type 2 and type 3 datasets, I generate the nonzero mean ui randomly from ui ∼<br />

N (0, r 2 ID) for each Factor Analyzer. The r controls the spread of the Factor Analyzers.<br />

For type 3 dataset, the covariance is additionally scaled by Ci = � YiΛi � Y ′<br />

i + σ 2 ID, where<br />

the elements of Λi = diag(λ1, ..., λm) are chosen i.i.d from the uniform distribution on<br />

123


[0, 1].<br />

The parameters for the datasets are summarized below:<br />

• Dataset 1: zero-mean (r = 0), homogeneous (λ1 = · · · = λm = 1)<br />

• Dataset 2: nonzero-mean (r = 0.2), homogeneous (λ1 = · · · = λm = 1)<br />

• Dataset 3: nonzero-mean (r = 0.2), scaled (0 ≤ λ ≤ 1)<br />

Assigning class labels<br />

So far the distribution is chosen without classes. Since I am treating each distribution as a<br />

point in the space of distributions, the class label is assigned per distribution. The binary<br />

class labels are assigned as follows. I first choose a pair of distribution p+ and p− which<br />

are the farthest apart from each other among all pairs of distributions. These p+ and p−<br />

serve as the two extreme points representing the positive and the negative distributions re-<br />

spectively. The labels of the other distributions are assigned subsequently from comparing<br />

their distances to the two extreme distributions:<br />

yi =<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

1, if JKL(pi, p+) < JKL(pi, p−)<br />

−1, otherwise<br />

The distances are measured by the KL distance JKL of the distributions. Empirically the<br />

number of positive distributions and the number of negative distributions were roughly<br />

balanced.<br />

6.4.2 Algorithms<br />

I compare the performance of the Euclidean SVM with linear/polynomial/Gaussian RBF<br />

kernels and the performance of SVM with Grassmann kernels, similarly to the comparison<br />

124


in Section 5.3. To test the original SVMs, I randomly sample n = 50 points from each<br />

Factor Analyzer pi(x).<br />

I evaluate the algorithm with N-fold cross validation by holding out one set and training<br />

with the other N − 1 sets. The polynomial kernel used is k(x1, x2) = (〈x1, x2〉 + 1) 3 , and<br />

the RBF kernel used is k(x1, x2) = exp(− 1<br />

2r 2 �x1 − x2� 2 ), where the radius r is chosen to<br />

be one-fifth of the diameter of the data: r = 0.2 maxij �xi − xj�. For training the SVMs,<br />

I use the public-domain software SVM-light [42] with default parameters.<br />

To test the Grassmann SVM, I first estimate the mean ui and the basis Yi from the same<br />

points used for the Euclidean SVM, although I could have improved the results by using<br />

the true parameters instead of the estimated ones. The Maximum Likelihood estimates of<br />

Yi, µi and σ are given from the probabilistic PCA model [76] as follows. Let µ and S be<br />

the sample mean and covariance of the i-th set<br />

Let<br />

µ = 1<br />

Ni<br />

Ni �<br />

j=1<br />

xj, S = 1<br />

Ni − 1<br />

S = UΛU ′<br />

Ni �<br />

j=1<br />

(xj − µ)(xj − µ) ′ .<br />

be the eigen-decomposition of the covariance matrix S, where U = [u1 · · · uD] is the<br />

eigenvectors corresponding to the eigenvalues λ1 ≥ ... ≥ λD, and Λ = diag(λ1, ..., λD) is<br />

the diagonal matrix of eigenvalues. Then Yi and σi are estimated from<br />

σ 2 =<br />

1<br />

D − m<br />

D�<br />

j=m+1<br />

λj<br />

(6.24)<br />

Yi = Um(Λm − σ 2 I) 1/2 , (6.25)<br />

where Um is the first m columns of U and Λm is the m × m principal submatrix of Λ. The<br />

σ is estimated individually for each set of data, and I use the averaged value for all sets. An<br />

125


Table 6.1: Classification rates of the Euclidean SVMs and the Grassmann SVMs. The best<br />

rate for each dataset is highlighted by boldface.<br />

Euclidean Grassmann Probabilistic<br />

Linear Poly RBF Lin Aff AffSc BC Bhat<br />

Dataset 1 52.86 62.38 66.33 87.00 86.80 87.70 82.50 84.30<br />

Dataset 2 62.30 64.45 65.74 76.90 82.00 83.10 70.90 72.50<br />

Dataset 3 62.76 64.73 69.47 69.50 73.70 84.40 65.10 77.30<br />

iterative and more accurate method of estimation is to use an EM approach [27] although<br />

not used here.<br />

The σ is used for the Bhattacharyya kernel which requires nonzero ambient noise σ > 0<br />

in the covariance Ci = YiY ′<br />

i + σ 2 I.<br />

Five different kernels in the followings are compared:<br />

1. SVM with the original and the extended Projection kernels: kLin, kAff, kAffSc<br />

2. SVM with the Binet-Cauchy kernel: kBC(Y1, Y2) = (det Y ′<br />

1Y2) 2 = det Y ′<br />

1Y2Y ′<br />

2Y1<br />

3. SVM with the Bhattacharyya kernel: kBhat(p1, p2) = � [p1(x) p2(x)] 1/2 dx for Factor<br />

Analyzers.<br />

I evaluate the algorithms with the leave-one-out test by holding out one subspace and train-<br />

ing with the other N − 1 subspaces. For training the SVMs, I use a Matlab code with a<br />

nonnegative QP solver.<br />

6.4.3 Results and discussion<br />

Table 6.1 shows the classification rates of the Euclidean SVMs and the Grassmann SVMs,<br />

averaged for 10 independent trials. The results show that best rates are obtained from<br />

the affine scaled kernel, and the Euclidean kernels lag behind for all types of data. The<br />

126


inferiority of the Euclidean SVMs to the Grassmann SVMs can be ascribed similarly to the<br />

reasons discussed in Section 5.3.3<br />

For dataset 1 which has zero-means, the linear SVMs degrade to the chance-level<br />

(50%). The result agrees with the intuitive picture that any decision hyperplane that passes<br />

the origin will roughly halve the positive and the negative classes. As expected, the linear<br />

kernel is inappropriate for dataset 2 which have nonzero offsets, whereas the affine and<br />

the affine scaled kernels perform well for both dataset 1 and 2. However, only the affine<br />

scaled kernel performs well for dataset 3. The Binet-Cauchy and the Bhattacharyya ker-<br />

nels perform close to the Projection kernels for dataset 1, but underperform for dataset 2<br />

and 3. This result is expected since the Binet-Cauchy kernel does not give considerations<br />

for offsets or scales, and since the Bhattacharyya kernel is not adequate for MFA data as I<br />

showed in the previous sections.<br />

I conclude that the extended kernels have advantages over the original kernels and<br />

the Euclidean kernels for subspace-based classification problems when the data consist<br />

of affine and scaled subspaces instead of simple linear subspaces.<br />

6.5 Experiments with real-world data<br />

In this section I demonstrate the application of the extended Grassmann kernels to recog-<br />

nition problems with a kernel FDA. Using real image database I compare the classification<br />

performance of the extended kernels and other previously used kernels.<br />

6.5.1 Algorithms<br />

A baseline algorithms and the kernel FDA with different kernels in the following are com-<br />

pared:<br />

1. Baseline : Euclidean FDA<br />

127


2. KFDA with the original and the extended Projection kernels: kLin, kAff, kAffSc<br />

3. KFDA with the Binet-Cauchy kernel<br />

4. KFDA the Bhattacharyya kernel<br />

The subspace parameters are estimated from the data similarly to the experiments with<br />

synthetic data. I evaluate the algorithms with leave-one-out test by holding out one sub-<br />

space and training with the other N − 1 subspaces.<br />

6.5.2 Results and discussion<br />

The recognition rates for Yale Face, CMU-PIE, ETH-80, and IXMAS databases are given<br />

in Figures 6.5–6.8. I summarize the results as follows:<br />

1. The original and the extended Grassmann kernels outperform the Binet-Cauchy and<br />

the Bhattacharyya kernels, as well as the baseline method. The superiority of the<br />

Projection kernel to the Binet-Cauchy kernel and the Euclidean method is already<br />

demonstrated in Chapter 5.<br />

2. The Bhattacharyya kernel performs quite poorly, and becomes worse as the subspace<br />

dimension increases. One can verify that the kernel matrix from the data is close to<br />

an identity matrix and therefore carries little information about the data. A similar<br />

observation for the Binet-Cauchy kernel was already made in Chapter 5.<br />

3. In Yale Face and ETH-80 databases the affine kernel and the affine scaled kernel<br />

achieve best rates, respectively. In CMU-PIE and IXMAS databases the rates of the<br />

affine kernel follow the rates of linear kernel closely. The rates of affine scaled kernel<br />

fall behind those two, but the differences are small compared to the rates achieved by<br />

the rest of the methods. Compared with the experimental result from the synthetic<br />

128


data, the result with the real data does not conclusively show the advantage nor the<br />

disadvantage of using the extended kernels. Since the extended kernels generalize<br />

the original Projection kernel, the comparable performances can be interpreted as the<br />

linear subspace assumption being valid for the real image databases I used.<br />

6.6 Conclusion<br />

In this chapter, I showed the relationship between probabilistic distances and the Projection<br />

kernel using a probabilistic model of subspaces. This analysis provides generalizations of<br />

the Projection kernel to affine and scaled subspaces. The relaxation of linear subspace<br />

assumption allows us to accommodate more complex data structures which diverge from<br />

the ideal linear subspace assumption.<br />

As demonstrated with the synthetic data, the mean and the scales within subspsace may<br />

carry important statistics of the data and can make a large difference in classification tasks.<br />

However, whether the information is useful for the real databases was not conclusive from<br />

the experiments, since the difference between the original and the extended kernel was<br />

small. However, the original and the extended kernels showed consistently better perfor-<br />

mance than the Euclidean method and the kernel method with the Binet-Cauchy and the<br />

Bhattacharyya kernels.<br />

129


Eucl<br />

Lin<br />

Aff<br />

AffSc<br />

BC<br />

Bhat<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

1 3 5<br />

subspace dimension (m)<br />

7 9<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

Eucl 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />

Lin 96.77 95.70 98.57 99.28 98.92 98.57 97.85 96.77 97.13<br />

Aff 94.62 96.06 98.92 98.92 98.57 97.85 96.77 97.13 97.49<br />

AffSc 96.77 98.21 99.28 99.28 99.28 99.28 99.28 99.28 99.28<br />

BC 96.77 95.34 96.77 96.42 83.87 72.76 55.20 48.03 44.09<br />

Bhat 97.13 98.21 94.27 85.30 65.23 60.93 53.76 48.39 44.44<br />

Figure 6.5: Yale Face Database: face recognition rates from various kernels. The two<br />

highest rates including ties are highlighted with boldface for each subspace dimension m.<br />

130


Eucl<br />

Lin<br />

Aff<br />

AffSc<br />

BC<br />

Bhat<br />

100<br />

80<br />

60<br />

40<br />

20<br />

1 3 5<br />

subspace dimension (m)<br />

7 9<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

Eucl 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73<br />

Lin 88.27 74.84 89.77 87.21 91.68 92.54 93.82 93.60 95.31<br />

Aff 72.28 86.99 85.50 91.26 91.26 92.32 92.54 94.46 94.67<br />

AffSc 83.16 85.29 85.29 85.29 85.29 85.29 85.29 85.29 85.29<br />

BC 88.27 71.43 82.52 64.82 58.64 47.55 43.07 39.87 36.25<br />

Bhat 83.37 44.78 39.45 36.25 31.98 28.78 26.23 23.88 20.47<br />

Figure 6.6: CMU-PIE Database: face recognition rates from various kernels.<br />

131


Eucl<br />

Lin<br />

Aff<br />

AffSc<br />

BC<br />

Bhat<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

1 3 5<br />

subspace dimension (m)<br />

7 9<br />

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />

Eucl 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />

Lin 88.75 88.75 93.75 96.25 96.25 95.00 93.75 96.25 96.25<br />

Aff 88.75 92.50 96.25 96.25 95.00 93.75 96.25 96.25 97.50<br />

AffSc 91.25 90.00 91.25 91.25 91.25 91.25 92.50 91.25 91.25<br />

BC 88.75 86.25 90.00 81.25 72.50 63.75 51.25 41.25 48.75<br />

Bhat 91.25 90.00 88.75 87.50 86.25 83.75 81.25 81.25 80.00<br />

Figure 6.7: ETH-80 Database: object categorization rates from various kernels.<br />

132


Eucl<br />

Lin<br />

Aff<br />

AffSc<br />

BC<br />

Bhat<br />

100<br />

80<br />

60<br />

40<br />

20<br />

1 2 3<br />

subspace dimension (m)<br />

4 5<br />

m=1 m=2 m=3 m=4 m=5<br />

Eucl 54.87 54.87 54.87 54.87 54.87<br />

Lin 69.09 80.30 84.55 84.24 85.15<br />

Aff 73.94 81.52 82.73 82.42 80.91<br />

AffSc 72.73 78.48 80.30 80.30 80.30<br />

BC 69.09 60.00 50.91 36.36 25.15<br />

Bhat 64.24 59.39 51.52 42.12 23.64<br />

Figure 6.8: IXMAS Database: action recognition rates from various kernels.<br />

133


Chapter 7<br />

CONCLUSION<br />

In this chapter I summarize the work presented in this thesis and discuss the future work<br />

related to the proposed methods.<br />

7.1 Summary<br />

In this thesis I proposed the subspace-based approach for solving novel learning problems<br />

using the Grassmann kernels. Below I summarize the progress that has been made in each<br />

chapter with regard to this goal.<br />

• In Chapter 3, I proposed the paradigm of subspace-based learning which exploits in-<br />

herent linear structures in data to solve novel learning problems. The rationale behind<br />

this approach was explained and exemplified with well-known image databases.<br />

• In Chapter 4, I introduced the Grassmann manifold as a common framework for<br />

subspace-based learning, and reviewed the geometry of the space. Various distances<br />

on the Grassmann manifold were reviewed and analyzed in depth, and were com-<br />

pared to each other by classification tests.<br />

134


• In Chapter 5, I proposed the Projection kernel and its application to discriminant<br />

analysis for subspaces-based classification problems. In classification tests with the<br />

image database the proposed method showed better performance than other previ-<br />

ously used discrimination methods as well as the Euclidean method.<br />

• In Chapter 6, I presented formal analyses of the relationship between probabilistic<br />

distances and the Grassmann kernels. Based on the analyses, I broadened the domain<br />

of subspace-based learning from linear to affine and scaled subspaces, and presented<br />

the extended kernels. The extended kernels performed competitively with synthetic<br />

and real data and showed potentials for the extended domains.<br />

7.2 Future work<br />

In this section I discuss the research directions to which I plan to expand the present work.<br />

7.2.1 Theory<br />

In this work I utilized geometric properties of the Grassmann manifolds as a framework for<br />

subspace-based learning problems. However, there are other aspects of this manifold that I<br />

did not consider in this thesis. These are briefly reviewed below.<br />

• Riemannian aspect of Grassmann manifolds: As discussed in Chapter 4, the Grass-<br />

mann manifold can be derived as a homogeneous space of orthogonal groups [86,<br />

13]:<br />

G(m, D) = O(D)/O(m) × O(D − m).<br />

This definition also induces a Riemannian geometry of the space which plays an<br />

important role for optimization techniques on the manifold such as the Newton’s<br />

method or the conjugate gradient [20, 2]. The geometry of the Grassmann manifold<br />

135


through positive definite kernels is more general than the strict Riemannian geometry<br />

from the definition above. I plan to explore the applicability of the proposed kernels<br />

for optimization problems as well.<br />

• Probabilistic aspect of Grassmann manifolds: The kernel approach to a learning<br />

problem is basically deterministic. Although I used a probabilistic model of sub-<br />

spaces in Chapter 6, the MFA model is a distribution in the usual Euclidean space.<br />

There are intrinsic definitions of probability distributions on the Grassmann mani-<br />

fold, such as the uniform distribution [1] and the matrix Langevin distribution [16].<br />

Statistical inference and estimation on the Grassmann manifold is a largely unex-<br />

plored topic with the exception of the pioneering work of Chikuse [16].<br />

• Existence of other Grassmann kernels: There is an important technical question that<br />

remains to be answered: are there more Grassmann kernels that are fundamentally<br />

different from the Projection or the Binet-Cauchy kernels? As I remarked in Chap-<br />

ter 6, an inductive approach to the discovery of new kernels is to examine the limits of<br />

other probabilistic distances that were not analyzed in this thesis. On the other hand,<br />

a deductive approach is to adapt the characterization of positive definite kernels on<br />

Hilbert spaces, n-spheres and semigroups [65, 67, 66, 9] to the case of the Stiefel<br />

and the Grassmann manifolds. In relation to the latter approach, a study involving<br />

a functional integration over orthogonal groups is currently under examination, and<br />

will be reported in the future.<br />

7.2.2 Applications<br />

The proposed Grassmann kernels are general building blocks for many applications. Those<br />

kernels can be used for an arbitrary learning task whether it is a supervised or an unsuper-<br />

vised learning problem. They can also be used with any kernel methods among which I<br />

136


used the SVM and the FDA for demonstrations. Moreover, the applications are not limited<br />

to image data, since a subspace is a familiar notion for any vectorial data.<br />

On the other hand, the proposed kernel methods may not outperform the state-of-the-<br />

art algorithms which are dedicated to specific tasks and utilize domain-specific knowledge.<br />

I remark below on the limitations and possible improvements of the proposed method in<br />

serveral application-specific aspects.<br />

• Intensity- vs feature-based representation: In this thesis I used the intensity repre-<br />

sentation of an image and relied on the low-dimensional character of pose subspaces<br />

or illumination subspaces to address the invariant recognition problem. However, a<br />

compelling alternative for recognition is to use feature-based representation of faces<br />

and objects, such as SIFT features which are already invariant to pose or illumina-<br />

tion variations [54]. It is unknown whether we can find subspace structures with<br />

the feature-based representation of images. However, the Bhattacharyya kernel was<br />

originally demonstrated with the bag-of-feature representation [40], which suggests<br />

the application of our method to such representations.<br />

• Dynamical models for sequence data: The proposed method used the observability<br />

subspace representation of the video sequences. However, there are multiples steps<br />

involved in processing a video sequence into a dynamical system representation, and<br />

each step relies on heuristic choices. The recognition can be improved solely with a<br />

clever preprocessing of the sequences without the use of dynamical models [89]. In<br />

relation to the dynamical system approach in general, Vishwanathan et. al. proposed<br />

a variety of other kernels for dynamical systems [84] in addition to the Binet-Cauchy<br />

kernel used in the thesis. In fact the authors defined the Binet-Cauchy kernel for<br />

Fredholm operators which are much general in scope. The subspace model from the<br />

observability matrix is but one approach to characterizing dynamical systems, and I<br />

137


am currently investigating other subspace models that emphasize different aspects of<br />

dynamical systems.<br />

• Handling unorganized data: The image databases I used in this work have factorized<br />

structures. For example, each image in the Yale Face database is labeled in terms<br />

of (person, pose, illumination). If we do not know the labels other than the person<br />

and have to estimate the subspaces from a clutter of data points, this estimation prob-<br />

lem itself becomes a separate problem that warrants research efforts [27, 37, 81].<br />

However, it is out of the scope of the proposed subspace-based framework.<br />

Furthermore, I mentioned a caveat in Chapter 3 that I assumed that the test data for<br />

image databases are not single images but subspaces. The assumption can potentially<br />

limit the applicability of the subspace-based approach in conventional problem set-<br />

tings. However, this limitation is to be understood as a tradeoff between data struc-<br />

ture flexibility and the strength of methods. If one is to use more powerful kernel<br />

methods, one needs more structured data.<br />

138


Bibliography<br />

[1] P. Absil, A. Edelman, and P. Koev. On the largest principal angle between random<br />

subspaces. Linear Algebra and its Applications, 414(1):288–294, 2006.<br />

[2] P. Absil, R. Mahony, and R. Sepulchre. Riemannian geometry of Grassmann mani-<br />

folds with a view on algorithmic computation. Acta Applicandae Mathematicae: An<br />

International Survey Journal on Applying Mathematics and Mathematical Applica-<br />

tions, 80(2):199–220, 2004.<br />

[3] R. Basri, T. Hassner, and L. Zelnik-Manor. Approximate nearest subspace search with<br />

applications to pattern recognition. In Proceedings of the Conference on Computer<br />

Vision and Pattern Analysis. IEEE Computer Society, June 2007.<br />

[4] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans-<br />

actions on Pattern Analysis and Machine Intelligence, 25(2):218–233, 2003.<br />

[5] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach.<br />

Neural Computation, 12(10):2385–2404, 2000.<br />

[6] M. Baumann and U. Helmke. Riemannian subspace tracking algorithms on Grass-<br />

mann manifolds. In Proceedings of the IEEE Conference on Decision and Control,<br />

pages 4731–4736, 12-14 Dec. 2007.<br />

139


[7] P. N. Belhumeur, a. H. Jo and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recogni-<br />

tion using class specific linear projection. In Proceedings of the European Conference<br />

on Computer Vision-Volume I, pages 45–58, London, UK, 1996. Springer-Verlag.<br />

[8] P. N. Belhumeur and D. J. Kriegman. What is the set of images of an object un-<br />

der all possible illumination conditions? International Journal of Computer Vision,<br />

28(3):245–260, 1998.<br />

[9] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.<br />

Springer, Berlin, 1984.<br />

[10] A. Bhattacharyya. On a measure of divergence between two statistical populations<br />

defined by their probability distributions. Bulletin of Calcutta Mathematical Society,<br />

Vol. 49, pages 214–224, 1943.<br />

[11] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin<br />

classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning<br />

Theory, pages 144–152, New York, NY, USA, 1992. ACM.<br />

[12] M. Bressan and J. Vitrià. Nonparametric discriminant analysis and nearest neighbor<br />

classification. Pattern Recognition Letters, 24:2743–2749, 2003.<br />

[13] R. Carter, G. Segal, and I. Macdonald. Lectures on Lie Groups and Lie Algebras, Lon-<br />

don Mathematical Society Student Texts. Cambridge University Press, Cambridge,<br />

UK, 1995.<br />

[14] J.-M. Chang, J. R. Beveridge, B. A. Draper, M. Kirby, H. Kley, and C. Peterson.<br />

Illumination face spaces are idiosyncratic. In Proceedings of the International Con-<br />

ference on Image Processing, Computer Vision, and Pattern Recognition, pages 390–<br />

396, 2006.<br />

140


[15] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on<br />

the sum of observations. Annals of Mathematical Statistics, pages 493–507, 1952.<br />

[16] Y. Chikuse. Statistics on special manifolds, Lecture Notes in Statistics, vol. 174.<br />

Springer-Verlag, New York, 2003.<br />

[17] K. D. Cock and B. D. Moor. Subspace angles between ARMA models. Systems<br />

Control Lett. 46 (4), pages 265–270, July 2002.<br />

[18] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines: and<br />

other kernel-based learning methods. Cambridge University Press, New York, NY,<br />

USA, 2000.<br />

[19] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures. International<br />

Journal of Computer Vision, 51(2):91–109, 2003.<br />

[20] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthog-<br />

onality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–<br />

353, 1999.<br />

[21] R. Epstein, P. Hallinan, and A. Yuille. 5 ± 2 Eigenimages suffice: An empirical<br />

investigation of low-dimensional lighting models. In Proceedings of IEEE Workshop<br />

on Physics-Based Modeling in Computer Vision, pages 108–116, 1995.<br />

[22] B. S. Everitt. An introduction to latent variable models. Chapman and Hall, London,<br />

1984.<br />

[23] A. Faragó, T. Linder, and G. Lugosi. Fast nearest-neighbor search in dissimilarity<br />

spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):957–<br />

962, 1993.<br />

141


[24] K. Fukui and O. Yamaguchi. Face recognition using multi-viewpoint patterns for<br />

robot vision. In International Symposium of Robotics Research, pages 192–201, 2003.<br />

[25] K. Fukunaga. Introduction to statistical pattern recognition (2nd ed.). Academic<br />

Press Professional, Inc., San Diego, CA, USA, 1990.<br />

[26] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illu-<br />

mination cone models for face recognition under variable lighting and pose. IEEE<br />

Transactions on Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001.<br />

[27] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyz-<br />

ers. Technical Report CRG-TR-96-1, Department of Computer Science, University<br />

of Toronto, 21 1996.<br />

[28] G. H. Golub and C. F. V. Loan. Matrix computations (3rd ed.). Johns Hopkins<br />

University Press, Baltimore, MD, USA, 1996.<br />

[29] R. Gross, I. Matthews, and S. Baker. Eigen light-fields and face recognition across<br />

pose. In Proceedings of the IEEE International Conference on Automatic Face and<br />

Gesture Recognition, page 3, Washington, DC, USA, 2002. IEEE Computer Society.<br />

[30] R. Gross, I. Matthews, and S. Baker. Fisher light-fields for face recognition across<br />

pose and illumination. In Proceedings of the 24th DAGM Symposium on Pattern<br />

Recognition, pages 481–489, London, UK, 2002. Springer-Verlag.<br />

[31] B. Haasdonk. Feature space interpretation of SVMs with indefinite kernels. IEEE<br />

Transactions on Pattern Analysis and Machine Intelligence, 27(4):482–492, 2005.<br />

[32] P. Hallinan. A low-dimensional representation of human faces for arbitrary light-<br />

ing conditions. In Proceedings of the Conference on Computer Vision and Pattern<br />

Recognition, pages 995–999, 1994.<br />

142


[33] J. Hamm and D. D. Lee. Grassmann discriminant analysis: a unifying view on<br />

subspace-based learning. In Proceedings of the International Conference on Machine<br />

Learning, 2008.<br />

[34] J. Hamm and D. D. Lee. Learning a warped subspace model of faces with images<br />

of unknown pose and illumination. In International Conference on Computer Vision<br />

Theory and Applications, pages 219–226, 2008.<br />

[35] M. Hein, O. Bousquet, and B. Schölkopf. Maximal margin classification for metric<br />

spaces. Journal of Computer and System Sciences, 71(3):333–359, 2005.<br />

[36] O. Henkel. Sphere packing bounds in the Grassmann and Stiefel manifolds. IEEE<br />

Transactions on Information Theory, 51:3445, 2005.<br />

[37] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances<br />

of objects under varying illumination conditions. Proceedings of the Conference on<br />

Computer Vision and Pattern Analysis, 01:11, 2003.<br />

[38] R. A. Horn and C. A. Johnson. Matrix analysis. Cambridge University Press, Cam-<br />

bridge, UK, 1985.<br />

[39] H. Hotelling. Relations between two sets of variates. Biometrika 28, pages 321–372,<br />

1936.<br />

[40] T. Jebara and R. I. Kondor. Bhattacharyya and expected likelihood kernels. In Pro-<br />

ceeding of the Annual Conference on Learning Theory, pages 57–71, 2003.<br />

[41] T. Joachims. Text categorization with Support Vector Machines: Learning with many<br />

relevant features. In Proceedings of the European Conference on Machine Learning,<br />

pages 137 – 142, Berlin, 1998. Springer.<br />

143


[42] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges,<br />

and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chap-<br />

ter 11, pages 169–184. MIT Press, Cambridge, MA, USA, 1999.<br />

[43] T.-K. Kim, J. Kittler, and R. Cipolla. Discriminative learning and recognition of image<br />

set classes using canonical correlations. IEEE Transactions on Pattern Analysis and<br />

Machine Intelligence, 29(6):1005–1018, 2007.<br />

[44] M. Kirby and L. Sirovich. Application of the Karhunen-Loeve procedure for the<br />

characterization of human faces. IEEE Transactions on Pattern Analysis and Machine<br />

Intelligence, 12(1):103–108, 1990.<br />

[45] R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures.<br />

In Proceedings of the International Conference on Machine Learning, 2002.<br />

[46] R. I. Kondor and T. Jebara. A kernel between sets of vectors. In Proceedings of the<br />

International Conference on Machine Learning, pages 361–368, 2003.<br />

[47] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathemat-<br />

ical Statistics, 22(1):79–86, 1951.<br />

[48] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object<br />

categorization. In Proceedings of the Conference on Computer Vision and Pattern<br />

Analysis, page 409, Los Alamitos, CA, USA, 2003. IEEE Computer Society.<br />

[49] C. Leslie, E. Eskin, and W. Noble. Mismatch string kernels for SVM protein classi-<br />

fication. In Advances in Neural Information Processing Systems, pages 1441–1448,<br />

2003.<br />

144


[50] D. Lin, S. Yan, and X. Tang. Pursuing informative projection on Grassmann manifold.<br />

In Proceedings of the Conference on Computer Vision and Pattern Analysis, pages<br />

1727–1734, Washington, DC, USA, 2006. IEEE Computer Society.<br />

[51] X. Liu, A. Srivastava, and K. Gallivan. Optimal linear representations of images for<br />

object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,<br />

26(5):662–666, 2004.<br />

[52] L. Ljung. System identification: theory for the user. Prentice-Hall, Inc., Upper Saddle<br />

River, NJ, USA, 1986.<br />

[53] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text clas-<br />

sification using string kernels. Journal of Machine Learning Research, 2:419–444,<br />

2002.<br />

[54] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International<br />

Journal of Computer Vision, 60(2):91–110, 2004.<br />

[55] R. Martin. A metric for ARMA processes. IEEE Transactions on Signal Processing,<br />

48(4):1164–1170, Apr 2000.<br />

[56] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Müller. Fisher discriminant<br />

analysis with kernels. In Y. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural<br />

Networks for Signal Processing IX, pages 41–48. IEEE, 1999.<br />

[57] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, A. Smola, and K. Müller. Construct-<br />

ing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel<br />

feature spaces. IEEE Transactions on Patterns Analysis and Machine Intelligence,<br />

25(5):623–627, May 2003.<br />

145


[58] K. Müller, S. Mika, G. Rätsch, S. Tsuda, and B. Schölkopf. An introduction to kernel-<br />

based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–202,<br />

2001.<br />

[59] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positive kernels.<br />

In Proceedings of the International Conference on Machine Learning, page 81, New<br />

York, NY, USA, 2004. ACM.<br />

[60] E. Pekalska, P. Paclik, and R. P. W. Duin. A generalized kernel approach to<br />

dissimilarity-based classification. Journal of Machine Learning Research, 2:175–<br />

211, 2002.<br />

[61] R. Ramamoorthi. Analytic PCA construction for theoretical analysis of lighting vari-<br />

ability in images of a Lambertian object. IEEE Transactions on Pattern Analysis and<br />

Machine Intelligence, 24(10):1322–1333, 2002.<br />

[62] R. Ramamoorthi and P. Hanrahan. On the relationship between radiance and irra-<br />

diance: determining the illumination from images of a convex Lambertian object.<br />

Journal of the Optical Society of America A, 18(10):2448–2459, 2001.<br />

[63] A. Rényi. On measures of information and entropy. In Proceedings of the 4th Berkeley<br />

Symposium on Mathematics, Statistics and Probability, pages 547–561, 1960.<br />

[64] N. Sakano, H.; Mukawa. Kernel mutual subspace method for robust facial image<br />

recognition. In Proceedings of the International Conference on Knowledge-Based<br />

Intelligent Engineering Systems and Allied Technologies, volume 1, pages 245–248,<br />

2000.<br />

[65] I. J. Schoenberg. Remarks to Maurice Frechet’s article ... Annals of Mathematics, 36,<br />

36(3):724–732, 1935.<br />

146


[66] I. J. Schoenberg. Metric spaces and completely monotone functions. Annal of Math-<br />

ematics, 39(4):811–841, 1938.<br />

[67] I. J. Schoenberg. Metric spaces and positive definite functions. Transactions of the<br />

American Mathematical Society, 44:522–536, 1938.<br />

[68] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel<br />

eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.<br />

[69] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,<br />

Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.<br />

[70] G. Shakhnarovich, I. John W. Fisher, and T. Darrell. Face recognition from long-term<br />

observations. In Proceedings of the European Conference on Computer Vision, pages<br />

851–868, London, UK, 2002.<br />

[71] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge<br />

University Press, New York, NY, USA, 2004.<br />

[72] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression<br />

(PIE) database. IEEE Transactions on Pattern Analysis and Machine Intelligence,<br />

25(12):1615 – 1618, December 2003.<br />

[73] L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of<br />

human faces. Journal of the Optical Society of America A, 4(3):519–524, 1987.<br />

[74] A. Srivastava. A Bayesian approach to geometric subspace estimation. IEEE Trans-<br />

actions on Signal Processing, 48(5):1390–1400, May 2000.<br />

[75] E. Takimoto and M. Warmuth. Path kernels and multiplicative updates. In Proceed-<br />

ings of the Annual Workshop on Computational Learning Theory. ACM, 2002.<br />

147


[76] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal<br />

Of The Royal Statistical Society Series B, 61(3):611–622, 1999.<br />

[77] F. Topsøe. Some inequalities for information divergence and related measures of<br />

discrimination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000.<br />

[78] P. Turaga, A. Veeraraghavan, and R. Chellappa. Statistical analysis on Stiefel and<br />

Grassmann manifolds with applications in computer vision. In Proceedings of the<br />

Conference on Computer Vision and Pattern Analysis, 2008.<br />

[79] M. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro-<br />

science, 3(1):71–86, 1991.<br />

[80] A. Veeraraghavan, A. K. Roy-Chowdhury, and R. Chellappa. Matching shape se-<br />

quences in video with applications in human movement analysis. IEEE Transactions<br />

on Pattern Analysis and Machine Intelligence, 27(12):1896–1909, 2005.<br />

[81] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (GPCA).<br />

IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1945–1959,<br />

2005.<br />

[82] S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. Advances<br />

in Neural Information Processing Systems, 15, 2003.<br />

[83] S. Vishwanathan and A. J. Smola. Binet-Cauchy kernels. In Advances in Neural<br />

Information Processing Systems, 2004.<br />

[84] S. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical<br />

systems and its application to the analysis of dynamic scenes. International Journal<br />

of Computer Vision, 73(1):95–119, 2007.<br />

148


[85] L. Wang, X. Wang, and J. Feng. Subspace distance analysis with application to adap-<br />

tive Bayesian algorithm for face recognition. Pattern Recognition, 39(3):456–464,<br />

2006.<br />

[86] F. Warner. Foundations of differentiable manifolds and Lie groups. Springer-Verlag,<br />

New York, 1983.<br />

[87] C. Watkins. Kernels from matching operations. Technical Report CSD-TR-98-07,<br />

Department of Computer Science, Royal Holloway College, 1999.<br />

[88] C. Watkins. Dynamic alignment kernels. In A. Smola and P. Bartlett, editors, Ad-<br />

vances in Large Margin Classifiers, chapter 3, pages 39–50. MIT Press, Cambridge,<br />

MA, USA, 2000.<br />

[89] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using<br />

motion history volumes. Computer Vision and Image Understanding, 104(2):249–<br />

257, 2006.<br />

[90] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Journal of<br />

Machine Learning Research, 4:913–931, 2003.<br />

[91] Y.-C. Wong. Differential geometry of Grassmann manifolds. Proceedings of the<br />

National Academy of Science, Vol. 57, pages 589–594, 1967.<br />

[92] O. Yamaguchi, K. Fukui, and K. Maeda. Face recognition using temporal image se-<br />

quence. In Proceedings of the International Conference on Face and Gesture Recog-<br />

nition, page 318, Washington, DC, USA, 1998. IEEE Computer Society.<br />

[93] J. Ye and T. Xiong. Null space versus orthogonal linear discriminant analysis. In<br />

Proceedings of the International Conference on Machine Learning, pages 1073–1080,<br />

New York, NY, USA, 2006. ACM.<br />

149


[94] A. L. Yuille, D. Snow, R. Epstein, and P. N. Belhumeur. Determining generative<br />

models of objects under varying illumination: Shape and albedo from multiple images<br />

using SVD and integrability. International Journal of Computer Vision, 35(3):203–<br />

222, 1999.<br />

[95] S. K. Zhou and R. Chellappa. Illuminating light field: Image-based face recognition<br />

across illuminations and poses. In Proceedings of the IEEE International Conference<br />

on Automatic Face and Gesture Recognition, pages 229–234, 2004.<br />

[96] S. K. Zhou and R. Chellappa. From sample similarity to ensemble similarity: Proba-<br />

bilistic distance measures in reproducing kernel Hilbert space. IEEE Transactions on<br />

Pattern Analysis and Machine Intelligence, 28(6):917–929, 2006.<br />

150

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!