SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...
SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...
SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>SUBSPACE</strong>-<strong>BASED</strong> <strong>LEARNING</strong> <strong>WITH</strong> <strong>GRASSMANN</strong> <strong>KERNELS</strong><br />
Jihun Hamm<br />
A DISSERTATION<br />
in<br />
Electrical and Systems Engineering<br />
Presented to the Faculties of the University of Pennsylvania<br />
in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy<br />
2008<br />
Supervisor of Dissertation<br />
Graduate Group Chair
COPYRIGHT<br />
Jihun Hamm<br />
2008
Acknowledgements<br />
I deeply thank my advisor Dr. Daniel D. Lee for so many things. Besides having provided<br />
financial and mental support for my graduate study, Daniel has initiated me into the field of<br />
machine learning which I barely knew about before working with him. From him I learned<br />
how to tackle the problems from the ground and stay fresh-minded. The upmost influence<br />
Daniel had on me was his energy and passion towards the goal. Being near him made me<br />
and my colleagues always stimulated and energized, and helped us to endure through some<br />
tough periods during the Ph.D process.<br />
I thank Dr. Lawrence Saul for inspiring me to follow my interest in the manifold learn-<br />
ing. During my early years working with him, his knowledge and intuition on the matter<br />
has strongly affected my approach to learning problems.<br />
I am also grateful to other professors who served in my thesis committee: Dr. Ali<br />
Jadbabaie, Dr. Jianbo Shi, Dr. Ben Taskar, and Dr. Ragini Verma. They provided valu-<br />
able feedback to polish the thesis. Dr. Jean Gallier has provided me with a guidance on<br />
mathematical issues before and during the writing of the thesis.<br />
I appreciate the support from my colleagues, especially from my lab members: Dan<br />
Huang, Yuanqing Lin, Yung-kyun Noh, and Paul Vernaza. Besides sharing the enthusiasm<br />
for research, we shared lots of fun and sometimes stressful moments of daily life as gradu-<br />
ate students. Yung-kyun has always been a pleasure to discuss any problem with. He was<br />
kind enough to read through the draft of the thesis and give suggestions.<br />
iii
Lastly, I thank my parents and my family for being who they are, and for understanding<br />
my excuses for not talking to them more often. My wife Sophia has always been by my<br />
side, and I cannot thank her enough for that.<br />
iv
ABSTRACT<br />
<strong>SUBSPACE</strong>-<strong>BASED</strong> <strong>LEARNING</strong> <strong>WITH</strong> <strong>GRASSMANN</strong> <strong>KERNELS</strong><br />
Jihun Hamm<br />
Supervisor: Prof. Daniel D. Lee<br />
In this thesis I propose a subspace-based learning paradigm for solving novel problems<br />
in machine learning. We often encounter subspace structures within data that lie inside a<br />
vector space. For example, the set of images of an object or a face under varying lighting<br />
conditions are known to lie on a low (4 or 9)-dimensional subspace with mild assumptions.<br />
Many other types of variations such as pose change or facial expression, can also be ap-<br />
proximated quite well with low-dimensional subspaces. Treating such subspaces as basic<br />
units of learning gives rise to challenges that conventional algorithms cannot handle well.<br />
In this work, I tackle subspace-based learning problems with the unifying framework of<br />
Grassmann manifold, which is the set of linear subspaces of a Euclidean space. I propose<br />
positive definite kernels on this space, which provide an easy access to the repository of<br />
various kernel algorithms. Furthermore, I show that the Grassmann kernels can be extended<br />
to the set of affine and scaled subspaces. This extension allows us to handle larger classes<br />
of problems with little additional cost.<br />
The proposed kernels in this thesis can be used with any kernel method. In particular,<br />
I demonstrate the potential advantages of the proposed kernel with Discriminant Analysis<br />
techniques and Support Vector Machines for recognition and categorization tasks. Experi-<br />
ments with real image databases show not only the feasibility of the proposed framework,<br />
but also the improved performance of the method compared with previously known meth-<br />
ods.<br />
v
Contents<br />
1 INTRODUCTION 1<br />
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.2 Contributions and related work . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
1.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
2 BACKGROUND 7<br />
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
2.2 Kernel machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
2.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
2.2.2 Mercer kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
2.2.3 Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . 12<br />
2.2.4 Examples of kernels . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
2.2.5 Generating new kernels from old kernels . . . . . . . . . . . . . . 16<br />
2.2.6 Distance and conditionally positive definite kernels . . . . . . . . . 17<br />
2.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.3.1 Large margin linear classifier . . . . . . . . . . . . . . . . . . . . . 20<br />
2.3.2 Dual problem and support vectors . . . . . . . . . . . . . . . . . . 21<br />
2.3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
2.3.4 Generalization error and overfitting . . . . . . . . . . . . . . . . . 24<br />
vi
2.4 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
2.4.1 Fisher Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 26<br />
2.4.2 Nonparametric Discriminant Analysis . . . . . . . . . . . . . . . . 26<br />
2.4.3 Discriminant analysis in high-dimensional spaces . . . . . . . . . . 27<br />
2.4.4 Extension to nonlinear discriminant analysis . . . . . . . . . . . . 28<br />
3 MOTIVATION: <strong>SUBSPACE</strong> STRUCTURE IN DATA 30<br />
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30<br />
3.2 Illumination subspaces in multi-lighting images . . . . . . . . . . . . . . . 31<br />
3.3 Pose subspaces in multi-view images . . . . . . . . . . . . . . . . . . . . . 40<br />
3.4 Video sequences of human motions . . . . . . . . . . . . . . . . . . . . . 44<br />
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
4 <strong>GRASSMANN</strong> MANIFOLDS AND <strong>SUBSPACE</strong> DISTANCES 54<br />
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
4.2 Stiefel and Grassmann manifolds . . . . . . . . . . . . . . . . . . . . . . . 55<br />
4.2.1 Stiefel manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
4.2.2 Grassmann manifold . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
4.2.3 Principal angles and canonical correlations . . . . . . . . . . . . . 59<br />
4.3 Grassmann distances for subspaces . . . . . . . . . . . . . . . . . . . . . . 60<br />
4.3.1 Projection distance . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
4.3.2 Binet-Cauchy distance . . . . . . . . . . . . . . . . . . . . . . . . 62<br />
4.3.3 Max Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
4.3.4 Min Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4.3.5 Procrustes distance . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
4.3.6 Comparison of the distances . . . . . . . . . . . . . . . . . . . . . 68<br />
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
vii
4.4.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
4.4.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
5 <strong>GRASSMANN</strong> <strong>KERNELS</strong> AND DISCRIMINANT ANALYSIS 77<br />
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
5.2 Kernel functions for subspaces . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
5.2.1 Projection kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
5.2.2 Binet-Cauchy kernel . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
5.2.3 Indefinite kernels from other metrics . . . . . . . . . . . . . . . . . 83<br />
5.2.4 Extension to nonlinear subspaces . . . . . . . . . . . . . . . . . . 83<br />
5.3 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . 85<br />
5.3.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
5.3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
5.3.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
5.4 Discriminant Analysis of subspace . . . . . . . . . . . . . . . . . . . . . . 90<br />
5.4.1 Grassmann Discriminant Analysis . . . . . . . . . . . . . . . . . . 90<br />
5.4.2 Mutual Subspace Method (MSM) . . . . . . . . . . . . . . . . . . 91<br />
5.4.3 Constrained MSM (cMSM) . . . . . . . . . . . . . . . . . . . . . 92<br />
5.4.4 Discriminant Analysis of Canonical Correlations (DCC) . . . . . . 92<br />
5.5 Experiments with real-world data . . . . . . . . . . . . . . . . . . . . . . . 93<br />
5.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
5.5.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
6 EXTENDED <strong>GRASSMANN</strong> <strong>KERNELS</strong> AND PROBABILISTIC DISTANCES100<br />
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
viii
6.2 Analysis of probabilistic distances and kernels . . . . . . . . . . . . . . . . 101<br />
6.2.1 Probabilistic distances and kernels . . . . . . . . . . . . . . . . . . 101<br />
6.2.2 Data as Mixture of Factor Analyzers . . . . . . . . . . . . . . . . . 103<br />
6.2.3 Analysis of KL distance . . . . . . . . . . . . . . . . . . . . . . . 105<br />
6.2.4 Analysis of Probability Product Kernel . . . . . . . . . . . . . . . 107<br />
6.3 Extended Grassmann Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
6.3.2 Extension to affine subspaces . . . . . . . . . . . . . . . . . . . . 112<br />
6.3.3 Extension to scaled subspaces . . . . . . . . . . . . . . . . . . . . 118<br />
6.3.4 Extension to nonlinear subspaces . . . . . . . . . . . . . . . . . . 122<br />
6.4 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . 123<br />
6.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
6.4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />
6.4.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
6.5 Experiments with real-world data . . . . . . . . . . . . . . . . . . . . . . . 127<br />
6.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
6.5.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />
7 CONCLUSION 134<br />
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
7.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
7.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />
Bibliography 139<br />
ix
List of Tables<br />
4.1 Summary of the Grassmann distances. The distances can be defined as<br />
simple functions of both the basis Y and the principal angles θi except for<br />
the arc-length which involves matrix exponentials. . . . . . . . . . . . . . 69<br />
5.1 Classification rates of the Euclidean SVMs and the Grassmannian SVMs.<br />
The best rate for each dataset is highlighted by boldface. . . . . . . . . . . 89<br />
6.1 Classification rates of the Euclidean SVMs and the Grassmann SVMs. The<br />
best rate for each dataset is highlighted by boldface. . . . . . . . . . . . . 126<br />
x
List of Figures<br />
2.1 Classification in the input space (Left) vs a feature space (Right). A non-<br />
linear classification in the input space is achieved by a linear classification<br />
in the feature space via the following map: φ : R 2 → R 3 , (x1, x2) ′ ↦→<br />
(x 2 1, √ 2x1x2, x 2 2) ′ , which maps the elliptical decision boundary to the hy-<br />
perplane. This illustration was captured from the tutorial slides of Schölkopf’s<br />
given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005. . . . 9<br />
2.2 Example of classifying two-class data with a hyperplane 〈w, x〉 + b = 0.<br />
In this case the data can be separated without error. This illustration was<br />
captured from the tutorial slides of Schölkopf’s given at the workshop “The<br />
Analysis of Patterns”, Erice, Italy, 2005. . . . . . . . . . . . . . . . . . . . 21<br />
2.3 The most discriminant direction for two-class data. Suppose we have two<br />
classes of Gaussian-distributed data, and we want to project the data onto<br />
one-dimensional directions denoted by the arrows. The projection in the<br />
direction of the largest variance (PCA direction) results in a large overlap<br />
of the two class which is undesirable for classification, whereas the pro-<br />
jection in the Fisher direction yields the least overlapping, therefore most<br />
discriminant one-dimensional distributions. This illustration was captured<br />
from the paper of [58]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
xi
3.1 The figure shows the first five principal components of a face, computed<br />
analytically from a 3D model (Top) and a sphere (Bottom). These images<br />
matches well with the empirical principal component computed from a set<br />
of real images. The figure was captured from [61]. . . . . . . . . . . . . . 33<br />
3.2 Yale Face Database: the first 10 (out of 38) subjects at all poses under a<br />
fixed illumination condition. . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
3.3 Yale Face Database: all illumination conditions of a person at a fixed pose<br />
used to compute the corresponding illumination subspace. . . . . . . . . . 37<br />
3.4 Yale Face Database: examples of basis images and (cumulative) singular<br />
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
3.5 CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a<br />
fixed illumination condition. . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
3.6 CMU-PIE Database: all illumination conditions of a person at a fixed pose<br />
used to compute the corresponding illumination subspace. . . . . . . . . . 39<br />
3.7 CMU-PIE Database: examples of basis images and (cumulative) singular<br />
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
3.8 ETH-80 Database: all categories and objects at a fixed pose. . . . . . . . . 41<br />
3.9 ETH-80 Database: all poses of an object from a category used to compute<br />
the corresponding pose subspace of the object. . . . . . . . . . . . . . . . 43<br />
3.10 ETH-80 Database: examples of basis images and (cumulative) singular<br />
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
3.11 IXMAS Database: video sequences of an actor performing 11 different<br />
actions viewed from a fixed camera. . . . . . . . . . . . . . . . . . . . . . 50<br />
3.12 IXMAS Database: 3D occupancy volume of an actor of one time frame.<br />
The volume is initially computed in Cartesian coordinate system and later<br />
represented to cylindrical coordinate system to apply FFT. . . . . . . . . . 51<br />
xii
3.13 IXMAS Database: the ‘kick’ action performed by 11 actors. Each sequence<br />
has a different kick style well as different body shape and height. . . . . . 52<br />
3.14 IXMAS Database: cylindrical coordinate representation of the volume V (r, θ, z),<br />
and the corresponding 1D FFT feature abs(F F T (V (r, θ, z))), shown at a<br />
few values of θ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />
4.1 Principal angles and Grassmann distances. Let span(Yi) and span(Yj) be<br />
two subspaces in the Euclidean space R D on the left. The distance between<br />
two subspaces span(Yi) and span(Yj) can be measured using the principal<br />
angles θ = [θ1, ... , θm] ′ . In the Grassmann manifold viewpoint, the sub-<br />
spaces span(Yi) and span(Yj) are considered as two points on the manifold<br />
G(m, D), whose Riemannian distance is related to the principal angles by<br />
d(Yi, Yj) = �θ�2. Various distances can be defined based on the principal<br />
angles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
4.2 Yale Face Database: face recognition rates from 1NN classifier with the<br />
Grassmann distances. The two highest rates including ties are highlighted<br />
with boldface for each subspace dimension m. . . . . . . . . . . . . . . . 73<br />
4.3 CMU-PIE Database: face recognition rates from 1NN classifier with the<br />
Grassmann distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
4.4 ETH-80 Database: object categorization rates from 1NN classifier with the<br />
Grassmann distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />
4.5 IXMAS Database: action recognition rates from 1NN classifier with the<br />
Grassmann distances. The two highest rates including ties are highlighted<br />
with boldface for each subspace dimension m. . . . . . . . . . . . . . . . 76<br />
xiii
5.1 Doubly kernel method. The first kernel implicitly maps the two ‘nonlinear<br />
subspaces’ Xi and Xj to span(Yi) and span(Yj) via the map Φ : X → H1,<br />
where the ‘nonlinear subspace’ means the preimage Xi = Φ −1 (span(Yi))<br />
and Xj = Φ −1 (span(Yj)). The second (=Grassmann) kernel maps the<br />
points Yi and Yj on the Grassmann manifold G(m, D) to the corresponding<br />
points in H2 via the map Ψ : G(m, D) → H2 such as (5.3) or (5.5). . . . . 84<br />
5.2 A two-dimensional subspace is represented by a triangular patch swept by<br />
two basis vectors. The positive and negative classes are colored-coded by<br />
blue and red respectively. A: The two class centers Y+ and Y− around<br />
which other subspaces are randomly generated. B–D: Examples of ran-<br />
domly selected subspaces for ‘easy’, ‘intermediate’, and ‘difficult’ datasets. 86<br />
5.3 Yale Face Database: face recognition rates from various discriminant anal-<br />
ysis methods. The two highest rates including ties are highlighted with<br />
boldface for each subspace dimension m. . . . . . . . . . . . . . . . . . . 96<br />
5.4 CMU-PIE Database: face recognition rates from various discriminant anal-<br />
ysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
5.5 ETH-80 Database: object categorization rates from various discriminant<br />
analysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
5.6 IXMAS Database: action recognition rates from various discriminant anal-<br />
ysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
6.1 Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann<br />
manifold (Left), the set of linear subspaces, can alternatively be modeled<br />
as the set of flat (σ → 0) spheres (Y ′<br />
i Yi = Im) intersecting at the origin<br />
(ui = 0). The right figure shows a general Mixture of Factor Analyzers<br />
which are not bound by these conditions. . . . . . . . . . . . . . . . . . . 104<br />
xiv
6.2 The Mixture of Factor Analyzer model of the Grassmann manifold is the<br />
collection of linear homogeneous Factor Analyzers shown as flat spheres<br />
intersecting at the origin (A). This can be relaxed to allow nonzero offsets<br />
for each Factor Analyzer (B), and also to allow arbitrary eccentricity and<br />
scale for each Factor Analyzer shown as flat ellipsoids (C). . . . . . . . . . 111<br />
6.3 The same affine span can be expressed with different offsets u1, u2, ... How-<br />
ever, one can use the unique ‘standard’ offset û, which has the shortest<br />
length from the origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
6.4 Homogeneous vs scaled subspaces. The two 2-dimensional Gaussians that<br />
span almost the same 2-dimensional space and have almost the same means,<br />
are considered similar as two representations of linear subspaces (Left).<br />
However, probabilistic distance between two Gaussian also depends on<br />
scale and eccentricity: the distance can be quite large if the Gaussians are<br />
nonhomogeneous (Right). . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
6.5 Yale Face Database: face recognition rates from various kernels. The two<br />
highest rates including ties are highlighted with boldface for each subspace<br />
dimension m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />
6.6 CMU-PIE Database: face recognition rates from various kernels. . . . . . 131<br />
6.7 ETH-80 Database: object categorization rates from various kernels. . . . . 132<br />
6.8 IXMAS Database: action recognition rates from various kernels. . . . . . . 133<br />
xv
Chapter 1<br />
INTRODUCTION<br />
1.1 Overview<br />
In machine learning problems the data commonly lie in a vector space, especially in a<br />
Euclidean space. The Euclidean space is convenient for data representation, storage and<br />
computation and geometrically intuitive to understand as well.<br />
There are, however, other kinds of non-Euclidean spaces more suitable for data outside<br />
the conventional Euclidean domain. The data domain I focus on in this thesis is one of<br />
those non-Euclidean spaces where each data sample is a linear subspace of a Euclidean<br />
space.<br />
Researches often encounter this non-conventional domain in computer vision problems.<br />
For example, a set of images of an object or a face with varying lighting conditions is known<br />
to lie on a low (4 or 9-) dimensional subspace under mild assumptions. Many other types of<br />
variations such as pose changes or facial expressions, can also be empirically approximated<br />
quite well by low-dimensional subspaces. If the data consist of multiple sets of images, they<br />
can consequently be modeled as a collection of low-dimensional subspaces.<br />
What are the potential advantages of having such structures? In the above example of<br />
1
face images, we can model the illumination variation of data irrelevant to a recognition task<br />
by subspaces, and focus on learning the appropriate variation between those subspaces such<br />
as the variation due to a subject identity. This idea applies not only to illumination-varying<br />
faces but also to many other types of data for which we can model out the undesired factors<br />
from the data with subspaces. Furthermore, representing data as a collection of subspaces<br />
is much more economical than keeping all the data samples as unorganized points, since<br />
we only need to store and handle the basis vectors. I refer to this approach of handling data<br />
as the subspace-based learning approach.<br />
Few researchers have clearly defined and fully utilized the properties of such a space<br />
in learning problems. Since a collection of subspaces is non-Euclidean, one cannot benefit<br />
from the conveniences of the Euclidean space anymore. For a learning algorithm to work<br />
with the subspace representation of the data, it requires a suitable framework which is also<br />
convenient in storage and computation of such data. This thesis provides the foundations<br />
of the subspace-based learning problems using a novel framework and kernels.<br />
To show the reader the scope and the depth of this work, I raise the following questions<br />
regarding the subject:<br />
Questions<br />
1. What are the examples of the subspace structure in real data?<br />
2. Which non-Euclidean domain suits the subspace-structed data?<br />
3. What dissimilarity measures of subspaces are there, and what are their properties?<br />
4. Can we define kernels for such domain?<br />
5. Are the kernels related to probabilistic distances?<br />
6. Can we extend the framework to subspaces that are not exactly linear?<br />
2
This thesis gives detailed and definitive answers to all of the questions above.<br />
1.2 Contributions and related work<br />
In this thesis I propose the Grassmann manifold framework for solving subspace-based<br />
problems. The Grassmann manifold is the set of fixed-dimensional linear subspaces and<br />
is an ideal model of the data under consideration. The Grassmann manfiolds have been<br />
previously used in signal processing and control [74, 36, 6], numerical optimization [20]<br />
(and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78].<br />
In particular, there are many approaches that use the subspace concept for problem solving<br />
in computer vision [92, 64, 24, 43, 3]. However, these work do not explicitly nor fully<br />
utilize the benefits of the Grassmann approach for subspace-based problems. In contrast,<br />
I make a full use of the properties of the Grassmann manifold with a unifying framework<br />
that subsumes the previous approaches.<br />
With the proposed framework, a dissimilarity between subspaces can be viewed as a<br />
distance function on the Grassmann manifold. I review several known distances including<br />
the Arc-length, Projection, Binet-Cauchy, Max Corr, Min Corr, and Procrustes distances<br />
[20, 16], and provide analytical and empirical comparisons. Furthermore, I propose the<br />
Projection kernel as a legitimate kernel function on the Grassmann manifold. The Projec-<br />
tion kernel is also used in [85] where it is mainly used as a similarity measure of subspaces<br />
rather than as a full-fledged kernel function on the Grassmann manifold. Another kernel<br />
I use in the thesis is the Binet-Cauchy kernel [90, 83]. I show that in spite of the more<br />
attention the Binet-Cauchy kernel has received, the Binet-Cauchy kernel is less useful than<br />
the Projection kernel is with noisy data.<br />
Using the two kernels as the representative kernels on the Grassmann manifold, I<br />
demonstrate the advantages of using the Grassmann kernels over the Euclidean kernels by a<br />
3
classification problem with Support Vector Machines on synthetic datasets. To demonstrate<br />
the potential benefits of the kernels further, I apply the kernels to a discriminant analysis on<br />
the Grassmann manifold and compare the approach with previously suggested algorithms<br />
for subspace-based discriminant analysis [92, 64, 24, 43]. In the previous methods, feature<br />
extraction is performed in the Euclidean space while non-Euclidean subspace distances<br />
are used in the objective. This inconsistency results in a difficult optimization and a weak<br />
guarantee of convergence, whereas the proposed approach with the Grassmann kernels is<br />
simpler and more effective, evidenced by experiments with the real image databases.<br />
In this thesis I also investigate the relationship between probabilistic distances and the<br />
Grassmann kernels. If we assume the set of vectors are i.i.d. samples from an arbitrary<br />
probability distribution, then it is possible to compare two such distributions of vectors<br />
with probabilistic similarity measures, such as the KL distance [47], the Chernoff distance<br />
[15], or the Bhattacharyya/Hellinger distance [10]. Furthermore, the Bhattacharyya affinity<br />
is in fact a positive definite kernel function on the space of distributions and has nice closed-<br />
form expressions for the exponential family [40]. The probabilistic distances and kernels<br />
are used for recognizing hand-written digits and faces [70, 46, 96]. I provide a link between<br />
the probabilistic and the Grassmann views by modeling the subspace data as a limit of the<br />
Mixture of Factor Analyzers [27] under the zero-mean and homogeneous conditions. The<br />
first result I show is that the KL distance is reduced to the Projection kernel under the<br />
Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit and<br />
is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL<br />
distance, I propose an extension of the Projection kernel which is originally confined to<br />
the set of linear subspaces, to the set of affine as well as scaled subspaces. I demonstrate<br />
the potential benefits of the extended kernels with the Support Vector Machines and the<br />
Kernel Discriminant Analysis, using synthetic and real image databases. The experiments<br />
show the superiority of the extended kernels over the Bhattacharyya and the Binet-Cauchy<br />
4
kernels, as well as over the Euclidean methods.<br />
There is a related but independent problem of clustering unlabeled data points into mul-<br />
tiple subspaces. Several approaches have been proposed in the literature. A traditional and<br />
inefficient technique is to use an EM-algorithm [27] for a Mixture of Factor Analyzers<br />
(MFA), which models the data distribution as a superposition of the Gaussian distributions.<br />
More recent work on clustering subspace data includes K-subspaces method [37] which<br />
extends the K-means algorithm to the case of subspaces, and Generalized PCA [81] which<br />
represents subspaces with polynomials and solves algebraic equations to fit the data. These<br />
methods are different from the proposed method of this thesis in that they serve as a pre-<br />
processing step to generate subspace labels for the proposed subspace-based learning.<br />
1.3 Organization of the paper<br />
The rest of the paper is organized as follows:<br />
• Chapter 2 provides background materials for the thesis, including kernel theory, large<br />
margin classifiers, and discriminant analyses.<br />
• Chapter 3 discusses theoretical and empirical evidences of inherent subspace struc-<br />
tures in image and video databases, and describes procedures for preprocessing the<br />
databases.<br />
• Chapter 4 introduces the Grassmann manifold as a common framework for subspace-<br />
based learning. Various distances on the Grassmann manifold are reviewed and ana-<br />
lyzed in depth.<br />
• Chapter 5 defines the Grassmann kernels and proposes the application to discriminant<br />
analysis. Comparisons with previously used algorithms are given.<br />
5
• Chapter 6 examines the relationship between probabilistic distances and the Grass-<br />
mann kernels. The chapter contains further discussions on the extension of the do-<br />
main of the subspace-based learning and presents the extended Grassmann kernels<br />
• Chapter 7 summarizes the contributions of the thesis and discusses the future work<br />
related to the proposed methods.<br />
• Bibliography contains all the referenced work in this thesis.<br />
The main chapters of the thesis are also divide into two parts. Chapter 3 and 4 integrate<br />
known facts and set up the framework for the thesis. Chapter 5 and 6 provide the main<br />
proposals, analyses, and experimental results.<br />
6
Chapter 2<br />
BACKGROUND<br />
2.1 Introduction<br />
In this chapter I review three topics: 1) kernel machine, 2) its application to 2) large margin<br />
classification and 3) application to discriminant analysis. The theory behind the kernel<br />
machines is helpful and partially necessary to understand the proposed kernels in the thesis.<br />
The large margin classification and discriminant analysis algorithms will be used to test the<br />
proposed kernels in Chapters 5 and 6.<br />
I provide a brief tutorial of the three topics based on the well-known texts and papers<br />
such as [18, 69, 71]. Most of the proofs are omitted and can be found in the original texts.<br />
2.2 Kernel machines<br />
2.2.1 Motivation<br />
Oftentimes it is not very effective nor convenient to use the original data space to learn pat-<br />
terns of the data. For simplicity, let’s assume the data X lie in the Euclidean space. When<br />
the patterns have a complex structure in the original data space, we can try to transform<br />
7
the data space nonlinearly to another space so that the learning task becomes easier on the<br />
transformed space. The new space is called a feature space, and the map is called a feature<br />
map.<br />
Suppose we are trying to classify two-dimensional, two-class data (Figure 2.1.) If the<br />
true class boundary is an ellipse in the input space, a linear classifier cannot classify the<br />
data correctly. However, when the input space is mapped to the feature space by<br />
φ : R 2 → R 3 , (x1, x2) ′ ↦→ (x 2 1, √ 2x1x2, x 2 2) ′ ,<br />
the decision boundary becomes a hyperplane in three-dimensional space, and therefore the<br />
two classes can be perfectly separated by a simple linear classifier.<br />
Note that we mapped the data to the feature space of all (ordered) monomials of degree<br />
two (x 2 1, √ 2x1x2, x 2 2), and used a hyperplane in that space. We can use the same idea for<br />
a feature space of higher-degree monomials. However, if we map X ∈ R D to the space of<br />
degree-d monomials, the dimension of the feature space becomes<br />
⎛<br />
⎜<br />
⎝<br />
D + d − 1<br />
d<br />
which can be computationally infeasible even for moderate D and d. This difficulty is<br />
easily circumvented by noting that we only need to compute inner products of points in<br />
the feature space to define a hyperplane. For the space of degree-2 monomials, the inner<br />
product can be computed from the original data by<br />
⎞<br />
⎟<br />
⎠<br />
〈φ(x), φ(y)〉 = x 2 1y 2 1 + x 2 2y 2 2 + 2x1x2y1y2 = 〈x, y〉 2 ,<br />
which can be extended to degree-d monomials by 〈x, y〉 d . The inner product in the feature<br />
8
Figure 2.1: Classification in the input space (Left) vs a feature space (Right). A nonlinear<br />
classification in the input space is achieved by a linear classification in the feature space via<br />
the following map: φ : R 2 → R 3 , (x1, x2) ′ ↦→ (x 2 1, √ 2x1x2, x 2 2) ′ , which maps the elliptical<br />
decision boundary to the hyperplane. This illustration was captured from the tutorial slides<br />
of Schölkopf’s given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005.<br />
space, such as k(x, y) = 〈x, y〉 d , is called a kernel function. From a user’s point of view,<br />
a kernel function is simply a nonlinear similarity measure of data that corresponds to a<br />
linear similarity measure in a feature space that the user need not know explicitly. A formal<br />
definition will follow shortly.<br />
2.2.2 Mercer kernels<br />
In this subsection I introduce the Mercer’s theorem which characterizes the condition when<br />
a kernel function k induces the feature map and space. Let X denote the data space.<br />
9
In case of finite X<br />
Definition 2.1 (Symmetric Positive Definite Matrix). A real N by N symmetric matrix K<br />
is positive definite if<br />
�<br />
cicjKij ≥ 0, for all c1, ..., cN(ci ∈ R).<br />
i,j<br />
Consider a finite input space X = {x1, ..., xN} and a symmetric real-valued function<br />
k(x, y). Let K be the N by N matrix of the function Kij = k(xi, xj) evaluated at X × X .<br />
Since K is symmetric it can be diagonalized as K = V ΛV ′ where Λ is a diagonal matrix<br />
of eigenvalues λ1 ≤ ... ≤ λN and V is an orthonormal matrix whose columns are the<br />
corresponding eigenvectors. Let vi denote the i-th row of V :<br />
V = [v ′ 1 · · · v ′ N] ′ .<br />
If the matrix K is positive definite, and therefore the eigenvalues are non-negative, then we<br />
can define the following feature map<br />
φ : X → H = R N , xi ↦→ viD 1/2 , i = 1, ..., N,<br />
where D 1/2 is a diagonal matrix D 1/2 = diag( √ λ1, ..., √ λN). We now observe that the<br />
inner product in the feature space 〈·, ·〉 H coincides with the kernel matrix of the data<br />
〈φ(xi), φ(xj)〉 = viDv ′ j = (V DV ′ )ij = Kij.<br />
10
In case of compact X<br />
Let’s apply the intuition gained from the finite case to an infinite dimensional case. Al-<br />
though further generalization is possible to a finite measure space (X , µ), we will deal with<br />
compact subsets of R D as the domain.<br />
Theorem 2.2 (Mercer). Let X be a compact subset of R D . Suppose k : X × X → R is a<br />
continuous symmetric function such that the integral operator Tk : L2(X ) → L2(X ),<br />
has the property of<br />
�<br />
(Tkf)(x) =<br />
�<br />
X 2<br />
X<br />
k(x, y)f(y) dy<br />
k(x, y)f(x)f(y) dxdy ≥ 0,<br />
for all f ∈ L2(X ). Then we have a uniformly convergent series<br />
k(x, y) =<br />
∞�<br />
λiψi(x)ψi(y)<br />
i=1<br />
in terms of the normalized eigenfunctions ψi ∈ L2(X ) of Tk ( normalized means �ψi�L2 =<br />
1.)<br />
The condition on Tk is an extension of the positive definite condition for matrices.<br />
Let’s define a sequence of features maps from the operator eigenfunctions ψi:<br />
φd : X → H = l d 2, x ↦→<br />
��<br />
λ1ψ1(x), ..., � �<br />
λdψd(x) ,<br />
for d = 1, 2, .... The Theorem 2.2 tells us that the sequence of the maps φ1, φ2, ... converges<br />
to a map φ : X → H such that 〈φ(x), φ(y)〉 H = k(x, y). The theorem below is the<br />
formalization of this observation:<br />
11
Theorem 2.3 (Mercer Kernel Map). If X is a compact subset of R D and k is a function<br />
satisfying the conditions of Theorem 2.2, then there is a feature map φ : X → H into a<br />
features space H where k becomes an inner product<br />
〈φ(x), φ(y)〉 H = k(x, y),<br />
for almost all x, y ∈ X . Moreover, given any ɛ > 0, there exists a map φn into an n-<br />
dimensional Hilbert space such that<br />
for almost all x, y ∈ X .<br />
|k(x, y) − 〈φn(x), φn(y)〉 | < ɛ<br />
The Mercer’s kernel gives us a construction of a features space. In the next section we<br />
will look at a more general construction via the Reproducing Kernel Hilbert Space.<br />
2.2.3 Reproducing Kernel Hilbert Space<br />
Extending the notion of positive definiteness of matrices and compact operators, we can<br />
define the positive definiteness of a function on an arbitrary set X as follows:<br />
Definition 2.4 (Positive Definite Kernel). Let X be any set, and k : X × X → R be<br />
a symmetric real-valued function k(xi, xj) = k(xj, xi) for all xi, xj ∈ X . Then k is a<br />
positive definite kernel function if<br />
�<br />
cicjk(xi, xj) ≥ 0,<br />
for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) for any n ∈ N.<br />
i,j<br />
In fact, the necessary and sufficient condition for a kernel to have associated feature<br />
12
space and feature map, is that the kernel be positive definite. Below are the three steps in<br />
[69] to construct the feature map φ and the feature space H from a given positive definite<br />
kernel k:<br />
1. Define a vector space with k.<br />
2. Endow it with an inner product with a reproducing property.<br />
3. Complete the space to a Hilbert space.<br />
First, we define H as the set of all linear combinations of the functions of the form<br />
f(·) =<br />
m�<br />
αik(·, xi),<br />
i<br />
for arbitrary m ∈ N, α1, ..., αm(αi ∈ R), and x1, ..., xm(xi ∈ X ). It is not difficult to<br />
check that H is a vector space. Let g(·) = � n<br />
j βjk(·, yj) be an another function in the<br />
vector space for some n ∈ N, β1, ..., βn(βj ∈ R), and y1, ..., yn(yj ∈ X ). Next, we define<br />
the following inner product between f and g:<br />
m,n �<br />
〈f, g〉 = αiβjk(xi, yj).<br />
i,j<br />
It is possible that the coefficients {αi} and {βj} are not unique. That is, a function f (or<br />
g) may be represented in multiple ways with different coefficients. To see that the inner<br />
product is still well-defined, note that<br />
m,n �<br />
〈f, g〉 = αiβjk(xi, yj) = �<br />
βjf(yj)<br />
i,j<br />
by definition. This shows that 〈·, ·〉 does not depend on particular expansion coefficients<br />
{αi}. Similarly, 〈f, g〉 = �<br />
i αig(xi) shows that the inner product does not depend on<br />
13<br />
j
{βj} either. The positivity 〈f, f〉 = �<br />
i,j αiαjk(xi, xj) ≥ 0 follows from the positive<br />
definiteness of k. Other axioms are easily checked. One notable property of the defined<br />
kernel is as follows. By choosing g(·) = k(·, y) we have 〈f, k(·, y)〉 = f(y) by definition.<br />
Furthermore with f(·) = k(·, x) we have<br />
which is called the reproducing property.<br />
〈k(·, x), k(·, y)〉 = k(x, y),<br />
Finally, the space can be completed to a Hilbert space, which is called the Reproducing<br />
Kernel Hilbert Space. Below is the formal definition of the space:<br />
Definition 2.5 (Reproducing Kernel Hilbert Space). Let X be a nonempty set and H a<br />
Hilbert space of functions f : X → R. Then H is called a Reproduction Kernel Hilbert<br />
Space (RKHS) endowed with the inner product 〈 , 〉, if there exist a function k : X × X<br />
with the following properties:<br />
1. 〈f, k(x, ·)〉 = f(x) for all f ∈ H; in particular, 〈k(x, ·), k(y, ·)〉 = k(x, y).<br />
2. k spans H, that is, H = span{k(x, ·)|x ∈ X } where X denote the completion of the<br />
set X.<br />
We have seen that a RKHS can be constructed from a positive definite kernel in three<br />
steps. The converse is also true. If a RKHS H is given then a unique positive definite kernel<br />
can be defined as the inner product of the space H.<br />
Finally, we show that Mercer kernels are positive definite in the generalized sense.<br />
Theorem 2.6 (Equivalence of Positive Definiteness). Let X = [a, b] be a compact interval<br />
and let k : [a, b] × [a, b] → C be continuous. Then k is a positive definite kernel if and only<br />
if<br />
�<br />
[a,b]×[a,b]<br />
k(x, y)f(x)f(y) dxdy ≥ 0,<br />
14
for any continuous function f : X → C.<br />
In this regard, every Mercer kernel k has a RKHS as a feature space for which k is the<br />
reproducing kernel.<br />
2.2.4 Examples of kernels<br />
There are an ever-expanding number of kernels for various types of data and applications,<br />
and we can only glimpse a portion of those. Below is a list of the most often-used kernels<br />
for Euclidean data. Let x, y ∈⊂ R D .<br />
• Homogeneous polynomial kernel: k(x, y) = 〈x, y〉 d .<br />
• Nonhomogeneous polynomial kernel: k(x, y) = (〈x, y〉 + c) d , c ∈ R.<br />
• Gaussian RBF kernel: k(x, y) = exp − �x−y�2<br />
2σ2 , σ > 0. The Gaussian RBF kernel<br />
has the following characteristics: 1) the points in the feature space lie on a sphere,<br />
since �φ(x)� 2 = 1, 2) the angle between two points 〈x, y〉 is at most π/2, and 3) the<br />
feature space is infinite-dimensional.<br />
Those kernels are the first kernels to be used with large margin classifiers [11]. These<br />
kernels can be evaluated in a closed-from without having to construct the feature spaces<br />
explicitly. Further work has discovered other types of kernels that can be evaluated effi-<br />
ciently by a recursion. These include the following two kernels:<br />
• All-subsets kernel [75]: Let I = {1, 2, ..., D} be indices for the variables xi, i ∈ I.<br />
For every subset of A of I, let us define φA(x) = �<br />
i∈A xi. For A = ∅ we define<br />
15
φ∅(x) = 1. If φ(x) is the sequence (φA(x))A⊂I, then the all-subsets kernel is<br />
k(x, y) = 〈φ(x), φ(y)〉 = �<br />
φA(x)φA(y)<br />
= � �<br />
A⊂I i∈A<br />
xiyi =<br />
A⊂I<br />
D�<br />
(1 + xiyi).<br />
• ANOVA kernel [87]: it is defined similarly to the all-subsets kernel. Define φ(x) as<br />
the sequence (φA(x))A⊂I,|A|=d, where we restrict A to the subsets of cardinality d.<br />
Then the kernel is<br />
i=1<br />
k(x, y) = 〈φA(x), φA(y)〉 = �<br />
=<br />
�<br />
1≤i1
Theorem 2.7. If k1(x, y) and k2(x, y) are positive definite kernels, then following kernels<br />
are also positive definite:<br />
1. Conic combination: α1k1(x, y) + α2k2(x, y), (α1, α2 > 0)<br />
2. Pointwise product: k1(x, y)k2(x, y)<br />
3. Integration: � k(x, z)k(y, z) dz,<br />
4. Product with rank-1 kernel: k(x, y)f(x)f(y)<br />
5. Limit: if k1(x, y), k2(x, y), ... are positive definite kernels then so is limi→∞ ki(x, y).<br />
Proofs can be found in [69, 71].<br />
Corollary 2.8. If k is a positive definite kernel, then so are f(k(x, y)) and exp k(x, y),<br />
where f : R → R is any polynomial function with nonnegative coefficients.<br />
2.2.6 Distance and conditionally positive definite kernels<br />
In this subsection I review the relationship between distances and conditionally positive<br />
definite kernels.<br />
Distance and metric<br />
Throughout the thesis I will use the term distance interchangeably with similarity measure,<br />
to denote an intuitive notion of ‘closeness’ between two patterns in the data. Therefore a<br />
distance d(·, ·) is any assignment of nonnegative values to a pair of points in a set X . A<br />
metric is, however, a distance that satisfies the additional axioms:<br />
Definition 2.9 (Metric). A real-valued function d : X × X → R is called a metric if<br />
1. d(x1, x2) ≥ 0,<br />
17
2. d(x1, x2) = 0 if and only if x1 = x2,<br />
3. d(x1, x2) = d(x2, x1),<br />
4. d(x1, x2) + d(x2, x3) ≤ d(x1, x3),<br />
for all x1, x2, x3 ∈ X .<br />
Relationship between metric and kernel<br />
The standard metric d(φ(x1), φ(x2)) in the feature space is the norm �φ(x1) − φ(x2)�<br />
induced from the inner product. The metric can be written in terms of the kernel as<br />
d 2 (φ(x1), φ(x2)) = k(x1, x1) + k(x2, x2) − 2k(x1, x2). (2.1)<br />
Therefore any RKHS is also a metric space (H, d) with the metric given above. Conversely,<br />
if a metric is given that is known to be induced from an inner product, then we can recover<br />
the inner product from the polarization of the metric:<br />
˜k(x1, x2) = 〈φ(x1), φ(x2)〉 = 1<br />
2 (−�φ(x1) − φ(x2)� 2 + �φ(x1)� 2 + �φ(x2)� 2 ).<br />
This raises the following question: if we are given a set and a metric (X , d), can we<br />
determine if d is induced from a positive definite kernel? To answer the question we need<br />
the following definition<br />
Definition 2.10 (Conditionally Positive Definite Kernel). Let X be any set, and k : X ×<br />
X → R be a symmetric real-valued function k(xi, xj) = k(xj, xi) for all xi, xj ∈ X . Then<br />
k is a conditionally positive definite kernel function if<br />
�<br />
cicjk(xi, xj) ≥ 0,<br />
i,j<br />
18
for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) such that � n<br />
i=1 ci = 0, for any n ∈ N.<br />
The question above is answered by the following theorem[67]:<br />
Theorem 2.11 (Schoenberg). A metric space (X , d) can be embedded isometrically into a<br />
Hilbert space if and only if d 2 (·, ·) is conditionally positive definite.<br />
As a corollary, we have<br />
Corollary 2.12 ([35]). A metric d is induced from a positive definite kernel if and only if<br />
is conditionally positive definite.<br />
˜k(x1, x2) = −d 2 (x1, x2)/2, x1, x2 ∈ X (2.2)<br />
It is known that one can use conditionally positive definite kernels just as positive defi-<br />
nite kernels in learning problems that are invariant to the choice of origin [68].<br />
2.3 Support Vector Machines<br />
A Support Vector Machine (SVM) is a supervised learning method used for classification.<br />
Due to its computational efficiency and theoretically well-understood generalization per-<br />
formance, the SVM has received a lot of attention in the last decade and is still one of the<br />
main topics in machine learning research. In this section I review the basics of SVM.<br />
I use the notation D = {(x1, y1), ..., (xN, yN)} to denote N pairs of a training sample<br />
xi ∈ R D and its class label yi ∈ {−1, 1}, i = 1, 2, ..., N.<br />
19
2.3.1 Large margin linear classifier<br />
Consider the problem of separating the two-class training data D = {(x1, y1), ..., (xN, yN)}<br />
with a hyperplane<br />
P : 〈w, x〉 + b = 0.<br />
Let’s assume the data are linearly separable, that is, we can separate the data with a hyper-<br />
plane without error (refer to Figure 2.2.) Since the equation 〈c · w, x〉 + c · b = 0 represents<br />
the same hyperplane for any nonzero c ∈ R, we choose a canonical representation of the<br />
hyperplane by setting mini � 〈w, xi〉+b� = 1. The linear separability can then be expressed<br />
as<br />
yi(〈w, xi〉 + b) ≥ 1, i = 1, ..., N, (2.3)<br />
and the distance of a point x to the hyperplane P is given by<br />
d(x, P) = | 〈w, xi〉 + b|<br />
.<br />
�w�<br />
We define the margin of the hyperplane as the minimum of the distance between training<br />
samples and the hyperplane:<br />
which can be shown to be equal to ρ = 2<br />
�w� .<br />
ρ = min<br />
i d(xi, P),<br />
If the data are linearly separable, there are typically an infinite number of hyperplanes<br />
that separate the classes correctly. However, the main idea of the SVM is to choose the one<br />
that has the maximum margin. Therefore the maximum margin classifier is the solution to<br />
the optimization problem:<br />
min<br />
w,b<br />
1<br />
2 �w�2 , subject to yi(〈w, xi〉 + b) ≥ 1, i = 1, ..., N. (2.4)<br />
20
Figure 2.2: Example of classifying two-class data with a hyperplane 〈w, x〉 + b = 0. In this<br />
case the data can be separated without error. This illustration was captured from the tutorial<br />
slides of Schölkopf’s given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005.<br />
2.3.2 Dual problem and support vectors<br />
The primal problem (2.4) is a convex optimization problem with linear constraints. From<br />
the Lagrangian duality, solving the primal problem is equivalent to solving the dual prob-<br />
lem:<br />
min<br />
α<br />
1<br />
2<br />
�<br />
αiαjyiyj 〈xi, xj〉 − �<br />
αi, subject to αi ≥ 0, i = 1, ..., N, and �<br />
αiyi = 0.<br />
i,j<br />
i<br />
i<br />
(2.5)<br />
The advantages of the dual formation are two-fold: 1) the dual problem is often easier to<br />
solve than the primal problem, and 2) it provides a geometrically meaningful interpretation<br />
of the solution.<br />
If α ∗ is the optimal solution of (2.5), then the optimal value of the primal variables is<br />
21
given by w ∗ = �<br />
i αiyixi, and b = − 1<br />
2 〈w∗ , x+ + x−〉, where x+ and x− are positive and<br />
negative class samples such that 〈w, x+〉 + b = 1 and 〈w, x−〉 + b = −1 respectively.<br />
The resultant classifier for test data is then,<br />
f(x) = sgn(〈w ∗ , x〉 + b) = sgn( �<br />
αiyi 〈xi, x〉 + b), where<br />
⎧<br />
⎪⎨<br />
−1, z < 0<br />
sgn(z) =<br />
⎪⎩<br />
0,<br />
1,<br />
z = 0<br />
z > 0<br />
The Kuhn-Tucker condition of the optimization problem requires<br />
αi[yi(〈w, xi〉 + b) − 1] = 0, i = 1, ..., N.<br />
This implies that only the points x that satisfy yi(〈w, xi〉 + b) = 1 will have nonzero dual<br />
variable α. These points are called support vectors, since these are the only points needed<br />
to define the decision function in the linearly separable case.<br />
2.3.3 Extensions<br />
Non-separable case: soft-margin SVM<br />
Suppose the data are not linearly separable and the constraints (2.3) need the relaxation of<br />
the conditions:<br />
yi(〈w, xi〉 + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, ... , N,<br />
22<br />
i<br />
.
for the problem to be feasible. A soft-margin SVM is defined by the optimization<br />
min<br />
w,b,ξ<br />
1<br />
2 �w�2 + C �<br />
ξi, subject to yi(〈w, xi〉 + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, ... , N,<br />
i<br />
(2.6)<br />
where C is a fixed parameter that determines the weight between the margin and the clas-<br />
sification error in the cost. The primal problem (2.6) also has an equivalent dual problem:<br />
min<br />
α<br />
1<br />
2<br />
�<br />
αiαjyiyj 〈xi, xj〉 − �<br />
αi, subject to (2.7)<br />
i,j<br />
0 ≤ αi ≤ C, i = 1, ..., N, and �<br />
αiyi = 0.<br />
The regularization parameter C should reflect the prior knowledge of the amount of noise<br />
in the data.<br />
Nonlinear separation: kernel SVM<br />
We obtain a nonlinear version of SVM by mapping the space X to a RKHS via a kernel<br />
function k. The kernel SVM is implemented simply by replacing the Euclidean inner prod-<br />
uct 〈xi, xj〉 with a given kernel function k(xi, xj). After the replacement the soft SVM<br />
problem (2.7) becomes<br />
min<br />
α<br />
1<br />
2<br />
i<br />
i<br />
�<br />
αiαjyiyjk(xi, xj) − �<br />
αi, subject to (2.8)<br />
i,j<br />
0 ≤ αi ≤ C, i = 1, ..., N, and �<br />
αiyi = 0,<br />
and the resultant decision function for test data is given by the kernel function:<br />
f(x) = sgn( �<br />
αiyik(xi, x) + b).<br />
i<br />
23<br />
i<br />
i
Since Kij = k(xi, xj) is a fixed matrix, the optimization in the training phase is no<br />
more difficult than solving the linear SVM. The resultant decision function can classify<br />
highly nonlinear, complicated data distributions with the same cost of training the simple<br />
linear classifier.<br />
2.3.4 Generalization error and overfitting<br />
The success of SVM algorithms in practice can be ascribed to their ability to bound gener-<br />
alization errors. I will not go into the vast topic but would like to point out the following<br />
fact: the maximization of margin corresponds to the minimization of the capacity (or the<br />
complexity) of the hyperplane, which helps to avoid overfitting.<br />
2.4 Discriminant Analysis<br />
A discriminant analysis technique is a method to find a low-dimensional subspace of the<br />
input space which preserves the discriminant features of multiclass data. Figure 2.3 illus-<br />
trates the idea for a two-class toy problem.<br />
I introduce two techniques: Fisher Discriminant Analysis (FDA) (or Linear Discrim-<br />
inant Analysis) [25] and Nonparametric Discriminant Analysis (NDA) [12]. Originally<br />
these algorithms are developed and used for low-dimensional Euclidean data. I will dis-<br />
cuss the challenges and solutions when the techniques are applied to high-dimensional data,<br />
and describe their extensions to nonlinear discrimination problems with kernels.<br />
Both FDA and NDA are discriminant analysis techniques which find a subspace that<br />
maximizes the ratio of between-class scatter Sb and within-class scatter Sw after the data<br />
are projected onto the subspace. The objective function for one-dimensional case is the<br />
Rayleigh quotient<br />
J(w) = w′ Sbw<br />
w ′ Sww , w ∈ RD ,<br />
24
Figure 2.3: The most discriminant direction for two-class data. Suppose we have two<br />
classes of Gaussian-distributed data, and we want to project the data onto one-dimensional<br />
directions denoted by the arrows. The projection in the direction of the largest variance<br />
(PCA direction) results in a large overlap of the two class which is undesirable for classification,<br />
whereas the projection in the Fisher direction yields the least overlapping, therefore<br />
most discriminant one-dimensional distributions. This illustration was captured from the<br />
paper of [58].<br />
where D is the dimension of the data space. For multiclass data there are several options<br />
for the objective function [25]. The most widely used objective is the multiclass Rayleigh<br />
quotient<br />
J(W ) = tr � (W ′ SwW ) −1 W ′ SbW �<br />
(2.9)<br />
where W is a D × d matrix, and d < D is the low-dimensional feature dimension. The<br />
quotient measures the class separability in the subspace span(W ) similarly to the one-<br />
dimensional case.<br />
25
2.4.1 Fisher Discriminant Analysis<br />
Let {x1, ..., xN} be the data vectors and {y1, ..., yN} be the class labels yi ∈ {1, ..., C}.<br />
Without loss of generality we assume the data are ordered according to the class labels:<br />
1 = y1 ≤ y2 ≤ ... ≤ yN = C. Each class c has Nc number of samples.<br />
Let µc = 1<br />
Nc<br />
�<br />
{i|yi=c} xi be the mean of class c, and µ = 1<br />
N<br />
�<br />
i xi be the global mean.<br />
The between-scatter and within-scatter matrices of FDA are defined as follows:<br />
Sb = 1<br />
N<br />
Sw = 1<br />
N<br />
C�<br />
Nc(µc − µ)(µc − µ) ′<br />
c=1<br />
C�<br />
�<br />
c=1 {i|yi=c}<br />
(xi − µc)(xi − µc) ′<br />
When Sw is nonsingular, which is typically the case for low-dimensional data (D < N),<br />
the optimal W is found from the largest eigenvector of S −1<br />
w Sb. Since S −1<br />
w Sb has rank C −1,<br />
there are C − 1-number of seqential optima W = {w1, ..., wC−1}. By projecting data onto<br />
the span(W ), we achieve dimensionality reduction and feature extraction of data onto the<br />
most discriminant directions.<br />
To classify points with the simple k-NN classifier, one can use the distance of data<br />
projected onto span(W ), or use the Mahalanobis distance to the projected mean of each<br />
class:<br />
arg min<br />
j dj(x) = [W ′ (x − µj)] ′ (W ′ SwW ) −1 [W ′ (x − µj)]. (2.10)<br />
2.4.2 Nonparametric Discriminant Analysis<br />
The FDA is motivated from the simple scenario in which the class-conditional distribution<br />
p(x|ci) is Gaussian or at least has a peak around its mean µi. However, this assumption is<br />
easily violated, for example, by a distribution that has multiple peaks. The NDA tries to<br />
26
elax the parametric Gaussian assumption a little. The between-scatter and within-scatter<br />
matrices of NDA are defined as<br />
Sb = 1<br />
N<br />
Sw = 1<br />
N<br />
N� �<br />
(xi − xj)(xi − xj) ′<br />
i=1 j∈Bi<br />
N� �<br />
(xi − xj)(xi − xj) ′ ,<br />
i=1 j∈Wi<br />
where Bi is the indices for K nearest neighbors of xi which belong to the different classes<br />
from xi, and Wi is the indices for K nearest neighbors of xi which belong to the same class<br />
as xi. While FDA uses the global class mean µi as a representative of each class, NDA<br />
uses the local class mean around the point of interest. This results in a tolerance to the<br />
non-Gaussianity or multimodality of the classes. When the number of nearest neighbors K<br />
increases, NDA behaves similarly to FDA.<br />
For classification tasks one can also use the simple k-NN rule restricted to the span(W )<br />
or the Mahalanobis distance similarly to FDA, although k-NN is more consistent with the<br />
nonparametric assumption of NDA.<br />
2.4.3 Discriminant analysis in high-dimensional spaces<br />
In the previous explanation of FDA, we assumed Sw is nonsingular. However, this is not the<br />
case for high-dimensional data. Note the ranks of Sw and Sb of FDA can be at most N − C<br />
and C − 1 respectively. The maxima are achieved when the data are not co-linear which is<br />
very likely for high-dimensional data [93]. Because the number of features d cannot exceed<br />
the rank of Sb, FDA can extract at most C − 1 features. For NDA, when K > 1, the rank<br />
of Sw can be up to N − C and the rank of Sb can be up to N − 1. However, for small K<br />
the ranks of Sw and Sb will also be small. The number of features NDA can extract is also<br />
less than the rank of SB.<br />
27
Because Sw spans at most N − C dimension, there is always nullspace in the span<br />
of data which is at least N − 1 dimensional. Without regularization, both FDA and NDA<br />
always yield nullspace of Sw in the span of data as the maximizer of the Rayleigh quotients.<br />
This is not preferable because even a small change in the data can make a big change in<br />
the solution. One solution suggested in [7] is to use Principal Component Analysis to<br />
first reduce the dimensionality of data by projecting them to a subspace spanned by the<br />
N − C largest eigenvectors. In this subspace Sw are likely to be well-conditioned. Another<br />
solution is to regularize the ill-conditioned matrix Sw by adding an isotropic noise matrix<br />
Sw ← Sw + σ 2 I, (2.11)<br />
where σ determines the amount of regularization. I use the regularization approach in this<br />
thesis.<br />
2.4.4 Extension to nonlinear discriminant analysis<br />
From the discussion of kernel machines in the previous section, we know that a linear<br />
algorithm can be extended to nonlinear algorithms by using kernels. The Kernel FDA, also<br />
known as the Generalized Discriminant Analysis, has been fully studied by [5, 56, 57]. To<br />
summarize, Kernel FDA can be formulated as follows.<br />
Let φ : G → H be the feature map, and Φ = [φ1 ... φN] be the feature matrix of the<br />
training points. Assuming the FDA direction w in the feature space is a linear combination<br />
of the feature vectors, w = Φα, we can rewrite the Rayleigh quotient in terms of α as<br />
J(α) = α′ Φ ′ SBΦα<br />
α ′ Φ ′ SW Φα = α′ K(V − 1<br />
N 1N1 ′ N )Kα<br />
α ′ (K(IN − V )K + σ2 , (2.12)<br />
IN)α<br />
where K is the kernel matrix, 1N is a uniform vector [1 ... 1] ′ of length N, and V is a<br />
28
lock-diagonal matrix whose c-th block is the uniform matrix 1<br />
Nc 1Nc1 ′ Nc ,<br />
⎛<br />
⎜<br />
V = ⎜<br />
⎝<br />
1<br />
N1 1N11 ′ N1<br />
...<br />
1<br />
Nc 1Nc1 ′ Nc<br />
The term σ 2 IN is a regularizer for making the computation stable. Similarly to FDA, the<br />
set of optimal α’s are computed from the eigenvectors of K −1<br />
w Kb, where Kb and Kw are<br />
⎞<br />
⎟<br />
⎠ .<br />
the quadratic matrices in the numerator and the denominator of (2.12):<br />
Kb = K(V − 1<br />
N 1N1 ′ N)K<br />
Kw = K(IN − V )K + σ 2 IN.<br />
The NDA can be similarly kernelized by the assumption w = Φα and is omitted here.<br />
29
Chapter 3<br />
MOTIVATION: <strong>SUBSPACE</strong><br />
STRUCTURE IN DATA<br />
3.1 Introduction<br />
In this chapter I discuss theoretical and empirical evidences of subspace structures which<br />
naturally appear in real-world data.<br />
The most prominent examples of subspaces can be found in the image-based face recog-<br />
nition problem. Face images show large variability due to identity, pose change, illumina-<br />
tion condition, facial expression, and so on. The Principal Component Analysis (PCA) has<br />
been applied to construct low-dimensional models of the faces by [73, 44] and used for<br />
recognition by [79], known as the Eigenfaces. Although the Eigenfaces were originally ap-<br />
plied to model image variations across different people, they also explain the illumination<br />
variation of a single person exceptionally well [32, 21, 94]. Theoretical explanations to the<br />
low-dimensionality of the illumination variability have been proposed by [8, 62, 61, 4].<br />
When the data consist of the illumination-varying images of multiple people, I can<br />
model the data as a collection of the illumination subspaces from each people. In this way,<br />
30
I can absorb the undesired variability of illumination as variability within subspaces, and<br />
emphasize the variability of subject identity as variability between the subspaces. This<br />
idea not only applies to illumination change but also to other types of data that have known<br />
linear substructures. This is the main idea of the subspace-based approach advocated in<br />
this thesis.<br />
More recent examples of subspaces structure are found in the dynamical system models<br />
of video sequences of, for example, human actions or time-varying textures [19, 80, 78].<br />
When each sequence is modeled by a dynamical system, I can compare those dynamical<br />
systems by comparing the linear span of the observability matrices of each systems, which<br />
is similar to comparing the images subspaces.<br />
In the rest of this chapter I explain the details of computing subspaces and estimating<br />
dynamical systems from image data. The procedures are demonstrated with the well-known<br />
image databases: the Yale Face, CMU-PIE, ETH-80, and IXMAS databases.<br />
3.2 Illumination subspaces in multi-lighting images<br />
Suppose we have a convex-shaped Lambertian object and a single distance light source<br />
illuminating the object. If we ignore the attached and the cast shadows on the object for<br />
now, then the observed intensity x (irradiance) for a surface patch is linearly related to the<br />
incident intensity of light s (radiance) by the Lambertian reflectance model<br />
x = ρ 〈b, s〉 ,<br />
where ρ is the albedo of the surface, b is the surface normal, and s is the light source vector.<br />
If the whole image x is a collection of D-pixel values, that is, x = [x1, ..., xD] ′ , then<br />
x = Bs,<br />
31
where B = [α1b1, ..., αDbD] ′ is the D × 3 matrix of albedo and surface normals. Thus,<br />
the set of images under all possible illuminations are all linear combinations of the column<br />
vectors of B,<br />
X = {Bs | , ∀s ∈ R 3 },<br />
which is a three-dimensional subspace at most. However, this is an unrealistic model since<br />
this allows a negative light intensity.<br />
We get a more realistic model by removing negative intensity and also by allowing<br />
attached shadows as follows: xi = max (αi 〈bi, si〉 , 0 ), and therefore x = max(Bs, 0),<br />
where the max operation is performed for each row.<br />
An image from a multiple light sources is the combination of single distant light cases<br />
x = �<br />
k max(Bsk, 0). As can be seen from the equation, the set of such images under<br />
all illuminations form a convex cone [8]. However, the dimensionality of the subspace the<br />
cone lies in can be as large as the number of pixels D in general, which is inconsistent with<br />
the empirical observations.<br />
Theoretical explanations on the low dimensionality have been offered by [8, 62, 61, 4]<br />
with spherical harmonics. Although the mathematics of the model is rather involved, the<br />
main idea can be summarized as follows. The interaction between a distance light source<br />
and a Lambertian surface is a convolution on the unit sphere. If we adopt frequency-domain<br />
representation of the light distributions and the reflectance function, then the interaction<br />
of an arbitrary light distribution with the surface can be computed by multiplication of<br />
coefficients w.r.t. the spherical harmonics, analogous to the Fourier analysis on real lines.<br />
Since the max operation can be well approximated by convolution with a low-pass<br />
filter, the resultant set of all possible illumination can be expressed using only a few (4<br />
to 9) harmonic basis images. Figure 3.1 shows the analytically computed PCA from this<br />
model.<br />
32
Figure 3.1: The figure shows the first five principal components of a face, computed analytically<br />
from a 3D model (Top) and a sphere (Bottom). These images matches well with<br />
the empirical principal component computed from a set of real images. The figure was<br />
captured from [61].<br />
In the following two subsections I introduce two well-known face databases and show<br />
the PCA results from the data to demonstrate the low-dimensionality of illumination-<br />
varying images.<br />
Yale Face Database<br />
It is often possible to acquire multi-view, multi-lighting images simultaneously with a spe-<br />
cial camera rig. The Yale face database and the Extended Yale face database [26] together<br />
consist of pictures of 38 subjects with 9 different poses and 45 different lighting conditions.<br />
The original image is gray-valued, is 640 × 480 in size, and includes redundant back-<br />
ground objects. I crop and align the face regions by manually choosing a few feature points<br />
(center of eyes and mouth, and nose tip) for each image. The cropped images are resized<br />
to 32 × 28 pixels (D = 896) and normalized to have a unit variance. Figure 3.2 shows the<br />
first 10 subjects and all 9 poses under a fixed illumination condition.<br />
33
Subject<br />
Pose<br />
Figure 3.2: Yale Face Database: the first 10 (out of 38) subjects at all poses under a fixed<br />
illumination condition.<br />
34
To compute subspaces, I use all 45 illumination conditions of a person under a fixed<br />
pose, which is depicted in Figure 3.3. The m-dimensional orthonormal basis are computed<br />
from the Singular Value Decomposition (SVD) of this set of data, as follows.<br />
Let X = [x1, ..., xN] is the D×N data matrix pertaining to all illuminations of a person<br />
at a fixed pose, and let X = USV ′ be the SVD of the data, where U ′ U = UU ′ = ID,<br />
V ′ V = V V ′ = IN, and S is a D × N matrix whose elements are zero except on the<br />
diagonal diag(S) = [s1, s2, ..., smin(D,N)] ′ . If the singular values are ordered as s1 ≥ s2 ≥<br />
... ≥ smin(D,N), then the m-dimensional basis for this set is the first m columns of U. In<br />
the coming chapters I will use a range of values for m in experiments. The SVD procedure<br />
above is the same as PCA procedure except that the mean is not removed from the data.<br />
The role of the mean will be discussed further in Chapter 6. When the mean is ignored, the<br />
PCA eigenvalues are related to the singular values by λ1 = s 2 1, λ2 = s 2 2, and so on.<br />
A few of the orthonormal bases computed from the procedure are shown in Figure 3.4,<br />
along with the spectrum of the singular values.<br />
CMU-PIE Database<br />
The CMU-PIE database is another multi-view, multi-lighting face database acquired with a<br />
camera rig. The database [72] consists of images from 68 subjects under 13 different poses<br />
and 43 different lighting conditions. Among the 43 lighting conditions I use 21 lighting<br />
conditions which have full pose variations.<br />
The original image is color-valued, is 640 × 480 in size, and includes redundant back-<br />
ground objects. I crop and align the face regions by manually choosing a few feature points<br />
(center of eyes and mouth, and nose tip) for each image. Among the 13 poses I choose only<br />
7 poses and discarded 6 poses which are close to a profile-view. This is done to facilitate<br />
the cropping process. The cropped images are resized to 24 × 21 pixels (D = 504) and<br />
normalized to have a unit variance. Figure 3.5 shows the first 10 subjects at 7 poses under<br />
35
a fixed illumination condition.<br />
To compute subspaces, I use all 21 illumination conditions of a person at a fixed pose<br />
(refer to Figure 3.6). The m-dimensional orthonormal basis are computed from the Singular<br />
Value Decomposition (SVD) of this set of data similarly to the Yale Face database.<br />
A few of the orthonormal bases computed from the database are shown in Figure 3.7,<br />
along with the spectrum of the singular values.<br />
36
Figure 3.3: Yale Face Database: all illumination conditions of a person at a fixed pose used<br />
to compute the corresponding illumination subspace.<br />
2<br />
1<br />
subspace #1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #3<br />
subspace #5<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #2<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #4<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #6<br />
0<br />
0 2 4 6<br />
Figure 3.4: Yale Face Database: examples of basis images and (cumulative) singular values.<br />
37
Subject<br />
Pose<br />
Figure 3.5: CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a fixed<br />
illumination condition.<br />
38
Figure 3.6: CMU-PIE Database: all illumination conditions of a person at a fixed pose used<br />
to compute the corresponding illumination subspace.<br />
2<br />
1<br />
subspace #1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #3<br />
subspace #5<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #2<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #4<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #6<br />
0<br />
0 2 4 6<br />
Figure 3.7: CMU-PIE Database: examples of basis images and (cumulative) singular values.<br />
39
3.3 Pose subspaces in multi-view images<br />
We have seen that illumination change can be approximated well by linear subspace. The<br />
change in the pose of the object and/or the camera, however, is harder to analyze without<br />
knowing the 3D geometry of the object. Since we are often given 2D image data only and<br />
do not know the underlying geometry, it is useful to construct image-based models of an<br />
object or a face under pose changes.<br />
A popular multi-pose representation of images is the light-field presentation which<br />
models the radiance of light as a function of the 5D pose of the observer [29, 30, 95].<br />
Theoretically, the light-field model provides pose-invariant recognition of images taken<br />
with arbitrary camera and pose when the illumination condition is fixed. Zhou et al. ex-<br />
tended the light-field model to a bilinear model which allows simultaneous change of pose<br />
and illumination [95]. An alternative method is proposed in [34] which uses a generative<br />
model of a warped illumination subspace. Image variations due to illumination change are<br />
accounted for by a low-dimensional linear subspace, whereas variations due to pose change<br />
are approximated by a geometric warping of images in the subspace.<br />
The studies above indicate the nonlinearity of the pose-varying images in general. How-<br />
ever, the dimensionality of the images as a nonlinear manifold is rather small, since there<br />
are at most 6 degrees of freedom for the pose space (=E(3)). Therefore, when the range<br />
of the pose variation is limited, the nonlinear structure can be contained inside a low-<br />
dimensional subspace, and the nonlinear submanifolds can be distinguished by their en-<br />
closing subspaces. Although a general method of adopting nonlinearity is possible and<br />
will be discussed in Section 5.2.4, here I use linear subspaces as the simplest model of the<br />
pose variations. The approximation by a subspace is demonstrated with the ETH-80 object<br />
database in the following subsection.<br />
40
Category<br />
Object<br />
Figure 3.8: ETH-80 Database: all categories and objects at a fixed pose.<br />
ETH-80 Database<br />
The ETH-80 [48] is an object database designed for object categorization test under varying<br />
poses. The database consists of pictures of 8 object categories; ‘apple’, ‘pear’, ‘tomato’,<br />
‘cow’, ‘dog’, ‘horse’, ‘cup’, ‘car’. Each category has 10 object instances that belong to the<br />
category, and each object is recored under 41 different poses.<br />
There are several versions of the data. The one I use is color-valued and 256 × 256 in<br />
size. The images are resized to 32 × 32 pixels (D = 1024) and normalized to have a unit<br />
variance. Figure 3.8 shows the 8 categories under 10 poses at a fixed viewpoint.<br />
From the spectrum I can determine how good the m-dimensional approximation is by<br />
looking at the value at m. For example, if the 5-th cumulative singular value is 0.92, it<br />
means that the 5-dimensional subspace captures 92 percent of the total variations of the<br />
data (including the bias of the mean).<br />
41
To compute subspaces, I use all 41 different poses of an object from a category as shown<br />
in Figure 3.9. The m-dimensional orthonormal basis are computed from SVD of this set of<br />
data. A few of the orthonormal bases computed from the database are shown in Figure 3.10<br />
along with the spectrum of the singular values.<br />
42
Figure 3.9: ETH-80 Database: all poses of an object from a category used to compute the<br />
corresponding pose subspace of the object.<br />
2<br />
1<br />
subspace #1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #3<br />
subspace #5<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #2<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #4<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
subspace #6<br />
0<br />
0 2 4 6<br />
Figure 3.10: ETH-80 Database: examples of basis images and (cumulative) singular values.<br />
43
3.4 Video sequences of human motions<br />
Suppose we have a video sequence of a person performing an action. The sequence is more<br />
than just a set of images because of the temporal information contained in the sequence.<br />
To capture both the appearance and the temporal dynamics, we often use linear dynamical<br />
systems in modeling the sequence. In particular, the Auto-Regressive Moving Average<br />
(ARMA) has been used to model moving human bodies or textures in computer vision<br />
[19, 80]. The ARMA model can be described as follows.<br />
Let y(t) be the D × 1 observation vector, and x(t) be the d × 1 internal state vector, for<br />
t = 1, ..., T . Then the states evolve according to the linear time-invariant dynamics:<br />
where v(t) and w(t) are additive noises.<br />
x(t + 1) = Ax(t) + v(t) (3.1)<br />
y(t) = Cx(t) + w(t), (3.2)<br />
A probabilistic version of the model assumes that the observation, the states, and the<br />
noise are Gaussian distributed with<br />
v(t) ∼ N (0, Q), w(t) ∼ N (0, R).<br />
This allows us to use statistical techniques such as Maximum Likelihood Estimation to<br />
infer the states and to estimate parameters only from the observed data y(1), ..., y(T ). The<br />
estimation problem is known as the system identification problem and a good textbook on<br />
the topic is [52]. The estimation I use in the thesis is one of the simplest estimation method<br />
and is described in the next section.<br />
For now, let’s go back to the original question of comparing the image sequence using<br />
the ARMA model. If we have the parameters Ai, Ci for each sequence i = 1, .., N in<br />
44
the data, the simplest method of comparing two such sequences is to measure the sum of<br />
squared differences<br />
d 2 i,j = �Ai − Aj� 2 F + �Ci − Cj� 2 F , (3.3)<br />
ignoring the noise statistics which are of less importance.<br />
However, it is well-known that the parameters are not unique. If we change the basis<br />
for the state variables to define new state variables by ˆx = Gx, where G is any d × d<br />
nonsingular matrix, then the same system can be described with different coefficients such<br />
that<br />
ˆx(t + 1) =<br />
ˆy(t) =<br />
ˆx(t) + ˆv(t)<br />
Ĉ ˆx(t) + ˆw(t),<br />
where  = GAG−1 , Ĉ = CG−1 , ˆv = Gv, and ˆw = w. Unfortunately, the simple<br />
distance (3.3) is not invariant under the coordinate change. I will defer the discussion of<br />
other invariant distances for dynamical systems to the next two chapters, and proceed with<br />
the basic idea in this chapter.<br />
One of the coordinate-independent representations of the system is given by the infinite<br />
observability matrix [17]<br />
OC,A =<br />
⎛<br />
⎜<br />
C<br />
⎜ CA<br />
⎜ CA<br />
⎝<br />
2<br />
⎞<br />
⎟ , (3.4)<br />
⎟<br />
⎠<br />
...<br />
which is concatenation of the matrices CA n for n = 1, 2, ..., along the row. Note that after<br />
45
the coordinate change ˆx = Gx, the new observability matrix becomes<br />
OĈ, Â =<br />
⎛<br />
⎜<br />
⎝<br />
CG −1<br />
CAG −1<br />
CA 2 G −1<br />
...<br />
⎞<br />
⎟ = OC,AG<br />
⎟<br />
⎠<br />
−1 , (3.5)<br />
which is the original observability matrix multiplied by G −1 on the right. This suggests<br />
that if we consider the linear span of the column vectors of the O instead of the matrix O<br />
itself to represent the dynamical system, the representation is clearly invariant to the choice<br />
of G. This linear structure of a dynamical system is exactly what we are seeking: in the<br />
(infinite-dimensional) space of all possible ARMA models of the same size, each model<br />
of a sequence occupies the subspace spanned by the columns of O. In the next section I<br />
will introduce the IXMAS database and explain how I preprocess the data to compute this<br />
linear structure.<br />
IXMAS Database<br />
The INRIA Xmas Motion Acquisition Sequences (IXMAS) is a multiview video database<br />
for view-invariant human action recognition [89]. The database consists of 11 daily-live<br />
motions (‘check watch’, ‘cross arms’, ‘scratch head’, ‘sit down’, ‘get up’, ‘turn around’,<br />
‘walk’, ‘wave hand’, ‘punch’, ‘kick’, ‘pick up’), performed by 11 actors and repeated 3<br />
times. The motions are recorded by 5 calibrated and synchronized cameras at 23 fps at<br />
390 × 29 resolutions. Figure 3.11 shows sample sequences of an actor perform the 11<br />
actions at a fixed view.<br />
The authors of [89] propose further processing of the database. The appearances of<br />
the actors such as clothes are irrelevant to actions, and therefore image silhouettes are<br />
46
computed to extract shapes from each camera. These silhouettes are combined to carve<br />
out the 3D visual hull of the actor represented by 3D occupancy volume data V (x, y, z) as<br />
shown in Figure 3.12.<br />
However, the actions performed by different actors still have a lot of variabilities as<br />
demonstrated in Figure 3.13.<br />
The variabilities irrelevant to action recognition include the followings. Firstly, the ac-<br />
tors have different heights and body shapes, and therefore the volumes have to be resized<br />
in each axes. Secondly, the actors freely choose position and orientation, and therefore the<br />
volumes have to be centered and reoriented. The resizing and centering can be done by<br />
computing the center of mass and second-order moments of the volumes and then stan-<br />
dardizing the volumes. However, the orientation variability requires further processing.<br />
The authors suggest changing the Cartesian coordinate system V (x, y, z) to the cylindrical<br />
coordinate system V (r, θ, z) and then performing 1D circular Fourier Transform along the<br />
θ axis to get F F T (V (r, θ, z)). By taking only the magnitude of the transform, the resultant<br />
feature abs|F F T (V (r, θ, z))| becomes rotation-invariant around the z-axis. The resultant<br />
feature of the 3D volume is a D = 16384 = 32 3 /2-dimensional vector. Note this FFT<br />
is computer per frame and is not to be confused with a temporal FFT along the frames.<br />
Figure 3.14 shows a sample snapshot of the processed features.<br />
ARMA model of data<br />
Once the features are computed for each action, actor and frame, we can proceed to model<br />
the feature sequences using the ARMA model.<br />
I estimate the parameters using a fast approximate method based on the SVD of the the<br />
observed data [19]. Let USV ′ = [y(1), ..., y(T )] be the SVD of data. Then, the parameters<br />
47
C, A, and the states x(1), ..., x(T ) are sequentially estimated by<br />
˜C = U, ˜x(t) = ˜ C ′ y(t)<br />
à =<br />
�T<br />
−1<br />
arg min �˜x(t + 1) − A˜x(t)�<br />
A<br />
2 .<br />
i=1<br />
I used d = 5 as the dimension of the state space.<br />
The estimated Ai and Ci matrices for each sequence i = 1, ..., N are used to form a<br />
finite observability matrix of size (Dd) × d:<br />
Oi = [C ′ i (CiAi) ′ ... (CiA d−1<br />
i ) ′ ] ′ .<br />
A total of 363=11 (action) x 3 (trial) x 11 (actor) observability matrices are computed as<br />
the final subspace representation of the database.<br />
3.5 Conclusion<br />
In this chapter I aimed to provide motivations for subspace representation with examples<br />
from image databases, which range from illumination-varying faces to video sequences<br />
modeled as dynamical systems. The procedures for computing subspaces from these databases<br />
were described.<br />
The goal of the subspace-based learning approach is using this inherent linear structure<br />
to emphasize the desired information and to de-emphasize the unwanted variations in the<br />
data. This approach translates to 1) illumination-invariant face recognition for the Yale<br />
Face and CMU-PIE databases, 2) pose-invariant object categorization with the ETH-80<br />
database, and 3) the video-based action recognition with the IXMAS database. However,<br />
I add a caveat that the invariant recognition problems above are different from the more<br />
general problem of recognizing a single test image, since at least a few test images are<br />
48
equired to reliably compute the subspace.<br />
In the next three chapters, I will use the computed subspaces from the databases to test<br />
various algorithms for subspace-based learning.<br />
49
Check watch<br />
Cross arms<br />
Scratch head<br />
Sit down<br />
Get up<br />
Turn around<br />
Walk<br />
Wave hand<br />
Punch<br />
Kick<br />
Pick up<br />
T=1, 2, 3, ...<br />
Figure 3.11: IXMAS Database: video sequences of an actor performing 11 different actions<br />
viewed from a fixed camera.<br />
50
Figure 3.12: IXMAS Database: 3D occupancy volume of an actor of one time frame.<br />
The volume is initially computed in Cartesian coordinate system and later represented to<br />
cylindrical coordinate system to apply FFT.<br />
51
Subj 1<br />
Subj 2<br />
Subj 3<br />
Subj 4<br />
Subj 5<br />
Subj 6<br />
Subj 7<br />
Subj 8<br />
Subj 9<br />
Subj 10<br />
T=1, 2, 3, ...<br />
Figure 3.13: IXMAS Database: the ‘kick’ action performed by 11 actors. Each sequence<br />
has a different kick style well as different body shape and height.<br />
52
V(r,θ,z)<br />
abs(FFT(V(r,θ,z)))<br />
Figure 3.14: IXMAS Database: cylindrical coordinate representation of the volume<br />
V (r, θ, z), and the corresponding 1D FFT feature abs(F F T (V (r, θ, z))), shown at a few<br />
values of θ.<br />
53
Chapter 4<br />
<strong>GRASSMANN</strong> MANIFOLDS AND<br />
<strong>SUBSPACE</strong> DISTANCES<br />
4.1 Introduction<br />
In the previous chapter, I discussed the examples of linear subspace structures found in<br />
the real-world data. In this chapter I introduce the Grassmann manifold as the common<br />
framework of subspace-based learning algorithms. While a subspace is certainly a linear<br />
space, the collection of linear subspaces is a totally different space of its own, which is<br />
known as the Grassmann manifold. The Grassmann manifold, named after the renowned<br />
mathematian Hermann Günther Grassmann (1809-1877), has long been known for its in-<br />
triguing mathematical properties, and as an example of homogeneous spaces of Lie groups<br />
[86, 13]. However, its applications in computer science and engineering have appeared<br />
rather recently; in signal processing and control [74, 36, 6], numerical optimization [20]<br />
(and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78].<br />
Moreover, many works have used the subspace concept without explicitly relating their<br />
works to this mathematical object [92, 64, 24, 90, 85, 43, 3]. One of the goals of this the-<br />
54
sis, is to provide a unified view of the subspace-based algorithms in the framework of the<br />
Grassmann manifold.<br />
In this chapter I define the Grassmann distance which provides a measure of (dis)similarity<br />
of subspaces, and review the known distances including the Arc-length, Projection, Binet-<br />
Cauchy, Max Corr, Min Corr, and Procrustes distances. Some of these distances have been<br />
studied in [20, 16], and I provide a more thorough analysis and proofs in this chapter. Fur-<br />
thermore, these distances will be used in conjunction with a k-NN algorithm to demonstrate<br />
their potentials in classification tasks using the databases prepared in the previous chapter.<br />
4.2 Stiefel and Grassmann manifolds<br />
In this section I introduce the Stiefel and the Grassmann manifolds by summarizing nec-<br />
essary definitions and properties of these manifolds from [28, 20, 16]. Although these<br />
manifolds are not linear spaces, I introduce these manifolds as subsets of Euclidean spaces<br />
and use matrix representations. This helps to understand the nonlinear spaces intuitively<br />
and also facilitates computations on these spaces.<br />
4.2.1 Stiefel manifold<br />
Let Y be a D × m matrix whose elements are real numbers. In optimization problems with<br />
the matrix variable Y , we often formulate the notion of normality by an orthonormality<br />
condition Y ′ Y = Im. 1 This feasible set is not linear nor convex, and in fact is the Stiefel<br />
manifold defined as follows:<br />
Definition 4.1. An m-frame is a set of m orthonormal vectors in R D (m ≤ D). The Stiefel<br />
manifold S(m, D) is the set of m-frames in R D .<br />
1 Although the term ‘orthongoal’ is the more standard one for this condition, I use the term ‘orthonormal’<br />
to clarify that each column of Y has a unit length.<br />
55
The Stiefel manifold S(m, D) is represented by the set of D × m matrices Y such that<br />
Y ′ Y = Im. Therefore we can rewrite is as<br />
S(m, D) = {Y ∈ R D×m | Y ′ Y = Im}.<br />
There are D × m variables in Y and 1m(m<br />
+ 1) independent conditions in the constraint<br />
2<br />
Y ′ Y = Im. Hence S(m, D) is an analytical manifold of dimension Dm − 1m(m<br />
+ 1) =<br />
2<br />
1m(2D<br />
− m − 1).<br />
2<br />
For m = 1, the S(m, D) is the familiar unit sphere in R D , and for m = D, the S(m, D)<br />
is the orthogonal group O(D) of m × m orthogonal matrices.<br />
The Stiefel manifold can also be thought of as the quotient space<br />
S(m, D) = O(D)/O(D − m),<br />
under the right-multiplication by orthonormal matrices. To see this point, let X = [Y | Y ⊥ ] ∈<br />
O(D) be a representer of Y ∈ S(m, D), where the first m columns form the m-frame we<br />
care about and Y ⊥ is any D × (D − m) matrix such that Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID. Then<br />
the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the block-<br />
diagonal matrix diag(Im, RD−m) where RD−m is any matrix in O(D − m). That is, the<br />
m-frame of X after the right multiplication<br />
X<br />
⎛<br />
⎜<br />
⎝ Im<br />
0<br />
0 RD−m<br />
⎞<br />
⎛<br />
⎟<br />
⎠ = [Y | Y ⊥ ⎜<br />
]<br />
remains the same as the m-frame of X.<br />
⎝ Im<br />
56<br />
0<br />
0 RD−m<br />
⎞<br />
⎟<br />
⎠ = [Y | Y ⊥ RD−m]
4.2.2 Grassmann manifold<br />
The Grassmann manifold is a mathematical object with several similarities to the Stiefel<br />
manifold. In optimization problems with a matrix variable Y , we occasionally have a cost<br />
function which is affected only by span(Y ) – the the linear subspace spanned by the column<br />
vectors of Y – and not by the specific values of Y . Such a condition leads to the concept of<br />
the Grassmann manifold defined as follows:<br />
Definition 4.2. The Grassmann manifold G(m, D) is the set of m-dimensional linear sub-<br />
spaces of the R D .<br />
For a Euclidean representation of the manifold, consider the space R (0)<br />
D,m of all D × m<br />
matrices Y ∈ R D×m of full rank m, and consider the group of transformations Y → Y L,<br />
where L is any nonsingular m × m matrix. The group defines an equivalence relation<br />
in R (0)<br />
D,m : two elements Y1, Y2 ∈ R (0)<br />
D,m are the same if span(Y1) = span(Y2). Hence<br />
the equivalence classes of R (0)<br />
D,m are in one-to-one correspondence with the points of the<br />
Grassmann manifold G(m, D), and G(m, D) is thought of as the quotient space<br />
G(m, D) = R (0)<br />
D,m /R(0)<br />
m,m.<br />
The G(m, D) is an analytical manifold of dimension Dm − m 2 = m(D − m), since for<br />
each Y regarded as a point in R Dm , the set of all elements Y L in the equivalence class is a<br />
surface in R Dm of dimension m 2 .<br />
The special case m = 1 is called the real projective space RP D−1 which consists of all<br />
lines through the origin.<br />
The Grassmann manifold can be also thought of as the quotient space<br />
G(m, D) = O(D)/O(m) × O(D − m),<br />
57
under the right-multiplication by orthonormal matrices. To see this, let X = [Y | Y ⊥ ] ∈<br />
O(D) be a representer of Y ∈ G(m, D), where we only care about the span of the first<br />
m columns and Y ⊥ is any D × (D − m) matrix such that Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID. Then<br />
the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the block-<br />
diagonal matrix diag(Rm, RD−m) where Rm and RD−m are any two matrices in O(m)<br />
and O(D − m) respectively. That is, the span of the first m-columns of X after the right<br />
multiplication<br />
X<br />
⎛<br />
⎜<br />
⎝ Rm<br />
0<br />
0 RD−m<br />
⎞<br />
⎛<br />
⎟<br />
⎠ = [Y | Y ⊥ ⎜<br />
]<br />
⎝ Rm<br />
0<br />
0 RD−m<br />
is the same as the span of the first m-columns of X.<br />
⎞<br />
⎟<br />
⎠ = [Y Rm | Y ⊥ RD−m]<br />
From the quotient space representations, we see that G(m, D) = S(m, D)/O(m). This<br />
is the representation I use throughout the thesis. To summarize, an element of G(m, D) is<br />
represented by an orthonormal matrix Y ∈ R D×m such that Y ′ Y = Im, with the equiva-<br />
lence relation:<br />
Definition 4.3. Y1 ∼ = Y2 if and only if span(Y1) = span(Y2).<br />
We can also write the equivalence relation as<br />
Corollary 4.4. Y1 ∼ = Y2 if and only if Y1 = Y2Rm for some orthonormal matrix Rm ∈<br />
O(m).<br />
In this thesis I use more general geometry than the Riemannian geometry of the Grass-<br />
mann manifold, and do not discuss this subject further. I refer the interested readers to<br />
[86, 13, 20, 16, 2] for a further reading.<br />
58
4.2.3 Principal angles and canonical correlations<br />
A canonical distance between two subspaces is the Riemannian distance, which is the<br />
length of the geodesic path connecting the two corresponding points on the Grassmann<br />
manifold. However, there is a more intuitive and computationally efficient way of defining<br />
distances using the principal angles [28]. I define the principal angles / canonical correla-<br />
tions as follows:<br />
Definition 4.5. Let Y1 and Y2 be two orthonormal matrices of size D by m. The princi-<br />
pal angles 0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2 between two subspaces span(Y1) and span(Y2), are<br />
defined recursively by<br />
cos θk = max<br />
max<br />
uk∈span(Y1) vk∈span(Y2)<br />
uk ′ uk = 1, vk ′ vk = 1,<br />
uk ′ vk, subject to<br />
uk ′ ui = 0, vk ′ vi = 0, (i = 1, ..., k − 1).<br />
The first principal angle θ1 is the smallest angle between a pair of unit vectors each<br />
from the two subspaces. The cosine of the principal angle is the first canonical correlation<br />
[39]. The k-th principal angle and canonical correlation are defined recursively. It is known<br />
[91, 20] that the principal angles are related to the geodesic (=arc length) distance as shown<br />
in Figure 4.1 by<br />
d 2 Arc(Y1, Y2) = �<br />
θ 2 i . (4.1)<br />
To compute the principal angles, we need not directly solve the maximization prob-<br />
lem. Instead, the principal angles can be computed from the Singular Value Decomposition<br />
(SVD) of the product of the two matrices Y ′<br />
1Y2,<br />
i<br />
Y ′<br />
1Y2 = USV ′ , (4.2)<br />
59
R D<br />
span( Yi )<br />
u 1<br />
θ1, ..., θm<br />
span( Yj )<br />
v 1<br />
Yi<br />
G(m, D )<br />
Figure 4.1: Principal angles and Grassmann distances. Let span(Yi) and span(Yj) be two<br />
subspaces in the Euclidean space R D on the left. The distance between two subspaces<br />
span(Yi) and span(Yj) can be measured using the principal angles θ = [θ1, ... , θm] ′ . In the<br />
Grassmann manifold viewpoint, the subspaces span(Yi) and span(Yj) are considered as<br />
two points on the manifold G(m, D), whose Riemannian distance is related to the principal<br />
angles by d(Yi, Yj) = �θ�2. Various distances can be defined based on the principal angles.<br />
where U = [u1 ... um], V = [v1 ... vm], and S is the diagonal matrix S = diag(cos θ1 ... cos θm).<br />
The proof can be found in p.604 of [28]. The principal angles form a non-decreasing se-<br />
quence<br />
0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2,<br />
and consequently the canonical correlations form a non-increasing sequence<br />
1 ≥ cos θ1 ≥ · · · ≥ cos θm ≥ 0.<br />
Although the definition of principal angles can be extended to the cases where Y1 and<br />
Y2 have different number of columns, I assume Y1 and Y2 have the same size D by m<br />
throughout this thesis.<br />
4.3 Grassmann distances for subspaces<br />
In this section I introduce a few subspace distances which appeared in the literature, and<br />
give analyses of the distances in terms of the principal angles.<br />
60<br />
θ 2<br />
Yj
I use the term distance for any assignment of nonnegative values to each pair of points in<br />
the data space. A valid metric is, however, a distance that satisfies the additional axioms in<br />
Definition 2.9. Furthermore, a distance (or a metric) between subspaces has to be invariant<br />
under different basis representations. A distance that satisfies this condition is referred to<br />
as the Grassmann distance (or metric):<br />
Definition 4.6. Let d : R D×m × R D×m → R be a distance function. The function d is a<br />
Grassmann distance if d(Y1, Y2) = d(Y1R1, Y2R2), ∀R1, R2 ∈ O(m).<br />
4.3.1 Projection distance<br />
The Projection distance is defined as<br />
dProj(Y1, Y2) =<br />
� m�<br />
i=1<br />
sin 2 θi<br />
� 1/2<br />
=<br />
�<br />
m −<br />
which is the 2-norm of the sine of principal angles [20, 85].<br />
m�<br />
i=1<br />
cos 2 θi<br />
� 1/2<br />
, (4.3)<br />
An interesting property of the this metric is that it can be computed from only the<br />
product Y ′<br />
1Y2 whose importance will be revealed in the next chapter. From the relationship<br />
between principal angles and the SVD of Y ′<br />
1Y2 in (4.2) we get<br />
d 2 Proj(Y1, Y2) = m −<br />
m�<br />
i=1<br />
where � · �F is the matrix Frobenius norm<br />
cos 2 θi = m − �Y ′<br />
1Y2� 2 F = 2 −1 �Y1Y ′<br />
1 − Y2Y ′<br />
2� 2 F , (4.4)<br />
�A� 2 F =<br />
m�<br />
i=1<br />
n�<br />
A 2 ij, A ∈ R m×n .<br />
j=1<br />
The Projection distance is a Grassmann distances since it is invariant to different represen-<br />
tations which can be easily seen from (4.4). Furthermore, the distance is a metric:<br />
61
Lemma 4.7. The Projection distance dProj is a Grassmann metric.<br />
Proof. The nonnegativity, symmetry, and triangle equality naturally follows from � · �F<br />
being a matrix norm. The remaining condition to be shown is the necessary and sufficient<br />
condition<br />
�Y1Y ′<br />
1 − Y2Y ′<br />
2�F = 0 ⇐⇒ span(Y1) = span(Y2).<br />
From being a matrix norm, the equality follows �Y1Y ′<br />
1 − Y2Y ′<br />
2�F = 0 ⇐⇒ Y1Y ′<br />
1 = Y2Y ′<br />
2.<br />
The proof of the next step Y1Y ′<br />
1 = Y2Y ′<br />
2 ⇐⇒ span(Y1) = span(Y2) is also simple and is<br />
given in the proof of Theorem 5.2.<br />
4.3.2 Binet-Cauchy distance<br />
The Binet-Cauchy distance is defined as<br />
dBC(Y1, Y2) =<br />
�<br />
1 − �<br />
i<br />
cos 2 θi<br />
� 1/2<br />
, (4.5)<br />
which involves the product of canonical correlations [90, 83]. The distance can also be<br />
computed from only the product Y ′<br />
1Y2. From the relationship between principal angles and<br />
the SVD of Y ′<br />
1Y2 (4.2) we get<br />
d 2 BC(Y1, Y2) = 1 − �<br />
i<br />
cos 2 θi = 1 − det(Y ′<br />
1Y2) 2 . (4.6)<br />
The Binet-Cauchy distance is also invariant under different representations, and further-<br />
more is a metric:<br />
Lemma 4.8. The Binet-Cauchy distance dBC is a Grassmann metric.<br />
The proof of the lemma is trivial after I prove Theorem 5.4 later.<br />
62
There is an interesting relationship between this distance and the Martin distance in<br />
control theory [55]. Martin proposed a metric between two ARMA processes with the<br />
cepstrum of the models, which was later shown to be of the following form [17]:<br />
dM(O1, O2) 2 = − log<br />
m�<br />
cos 2 θi,<br />
where O1 and O2 are the infinite observability matrices explained in the previous chapter:<br />
O1 =<br />
⎛<br />
⎜<br />
⎝<br />
C1<br />
C1A1<br />
C1A 2 1<br />
...<br />
⎞<br />
i=1<br />
⎛<br />
⎟<br />
⎜<br />
⎟<br />
⎜<br />
⎟<br />
⎜<br />
⎟<br />
⎜<br />
⎟ , and, O2 = ⎜<br />
⎟<br />
⎜<br />
⎟<br />
⎜<br />
⎠<br />
⎝<br />
C2<br />
C2A2<br />
C2A 2 2<br />
Consequently, the Binet-Cauchy distance is directly related to the Martin distance by the<br />
following:<br />
4.3.3 Max Correlation<br />
...<br />
⎞<br />
⎟ .<br />
⎟<br />
⎠<br />
dBC(span(O1), span(O2)) = exp − 1<br />
2 dM(O1, O2) 2 . (4.7)<br />
The Max Correlation distance is defined as<br />
dMaxCor(Y1, Y2) = � 1 − cos 2 �1/2 θ1 = sin θ1, (4.8)<br />
which is based on the largest canonical correlation cos θ1 (or the smallest principal angle<br />
θ1). The max correlation is an intuitive measure between two subspaces which was used<br />
often in previous works [92, 64, 24]. It is a Grassmann distance. However, it is not a metric<br />
and therefore has some limitations. For example, it is possible for two distinct subspaces<br />
span(Y1) and span(Y2) to have a zero distance dMaxCor = 0 if they have intersect at other<br />
63
than the origin.<br />
4.3.4 Min Correlation<br />
The min correlation distance is defined as<br />
dMinCor(Y1, Y2) = � 1 − cos 2 �1/2 θm = sin θm. (4.9)<br />
The min correlation is conceptually the opposite of the max correlation, in that it is based<br />
on the smallest canonical correlation (or the largest principal angle). This distance is also<br />
closely related to the definition of the Projection distance. Previously I rewrote the Pro-<br />
jection distance as dProj = 2 −1/2 �Y1Y ′<br />
1 − Y2Y ′<br />
2�F . The min correlation can be similarly<br />
written as ([20])<br />
where � · �2 is the matrix 2-norm:<br />
dMinCor = �Y1Y ′<br />
1 − Y2Y ′<br />
2�2, (4.10)<br />
�A�2 = max<br />
x�=0<br />
The proof can be found in p.75 of [28].<br />
This distance is also a metric:<br />
�Ax�2<br />
, A ∈ R<br />
�x�2<br />
m×n .<br />
Lemma 4.9. The Min Correlation distance dMinCor is a Grassmann metric.<br />
The proof is almost the same as the proof for the Projection distance with �·�2 replaced<br />
by � · �F and is omitted.<br />
64
4.3.5 Procrustes distance<br />
The Procrustes distance is defined as<br />
�<br />
m�<br />
dProc1(Y1, Y2) = 2 sin 2 (θi/2)<br />
i=1<br />
� 1/2<br />
, (4.11)<br />
which is the vector 2-norm of [sin(θ1/2), ... , sin(θm/2)]. There is an alternative definition.<br />
The Procrustes distance is the minimum Euclidean distance between different representa-<br />
tions of two subspaces span(Y1) and span(Y2) ([20, 16]):<br />
dProc1(Y1, Y2) = min<br />
R1,R2∈O(m)<br />
�Y1R1 − Y2R2�F = �Y1U − Y2V �F , (4.12)<br />
where U and V are from (4.2). Let’s first check if the equation above is true.<br />
Proof. First note that<br />
min<br />
R1,R2∈O(m)<br />
�Y1R1 − Y2R2� = min<br />
R1,R2∈O(m)<br />
�Y1 − Y2R2R ′ 1� = min<br />
Q∈O(m)<br />
for R1, R2, Q ∈ O(m). This also holds for � · �2. Using this equality, we have<br />
min<br />
R1,R2∈O(m)<br />
�Y1R1 − Y2R2� 2 F = min<br />
Q∈O(m)<br />
�Y1 − Y2Q� 2 F<br />
�Y1 − Y2Q�, (4.13)<br />
= min<br />
Q tr(Y ′<br />
1Y1 + Y ′<br />
2Y2 − Y ′<br />
1Y2Q − Q ′ Y ′<br />
2Y1)<br />
= 2m − 2 max<br />
Q<br />
′<br />
tr(Y 1Y2Q).<br />
However, trY ′<br />
1Y2Q = trUSV ′ Q = trST , where T = V ′ QU which is another orthonormal<br />
matrix. Since S is diagonal, trST = �<br />
i SiiTii ≤ �<br />
i Sii, and the maximum is achieved<br />
65
for T = Im, or equivalently Q = V U ′ . Hence<br />
min<br />
R1,R2∈O(m)<br />
�Y1R1 − Y2R2�F = �Y1 − Y2V U ′ �F = �Y1U − Y2V �F .<br />
Finally, let’s prove the equivalence of the two definitions 4.11 and 4.12 .<br />
Lemma 4.10.<br />
�<br />
m�<br />
�Y1U − Y2V �F = 2 sin 2 (θi/2)<br />
i=1<br />
Proof. Left-multiply Y1U − Y2V with U ′ Y ′<br />
1 to get<br />
� 1/2<br />
�Y1U − Y2V �F = �U ′ Y ′<br />
1Y1U − U ′ Y ′<br />
1Y2V �F = �Im − S�F ,<br />
since the norm does not change under the multiplication with the orthonormal matrix U ′ Y ′<br />
1.<br />
Since<br />
�<br />
�<br />
�Im − S�F = (1 − cos θi) 2<br />
we have the desired result.<br />
i<br />
� 1/2<br />
�<br />
�<br />
= 2 sin θi/2 2<br />
i<br />
.<br />
� 1/2<br />
The Procrustes distance is also called chordal distance [20]. The author of [20] also<br />
suggests another version of the Procrustes distance using the matrix 2-norm:<br />
Let’s check the validity of the definition:<br />
dProc2(Y1, Y2) = �Y1U − Y2V �2 = 2 sin(θm/2). (4.14)<br />
Proof. Left-multiply Y1U − Y2V with U ′ Y ′<br />
1 to get<br />
�Y1U − Y2V �2 = �U ′ Y ′<br />
1Y1U − U ′ Y ′<br />
1Y2V �2 = �Im − S�2.<br />
66<br />
,
From the definition of matrix 2-norm, we have<br />
�Im − S�2 = max<br />
�x�=1<br />
= max<br />
�x�=1<br />
�(Im − S)x�2<br />
�<br />
�<br />
(1 − cos θi) 2 x 2 i<br />
i<br />
� 1/2<br />
= max<br />
�x�=1<br />
� �<br />
i<br />
2 sin 2 (θi/2)x 2 i<br />
� 1/2<br />
Since sin 2 (θ1/2) ≤ ... ≤ sin 2 (θm/2), the sum is maximized for (x1, ..., xm) = (0, ..., 0, 1),<br />
and therefore<br />
�Y1U − Y2V �2 = 2 sin(θm/2).<br />
Note that this version of the Procrustes distance has the immediate relationship with the<br />
min correlation distance:<br />
d 2 MinCor(Y1, Y2) = sin 2 θm = 1−(1−2 sin 2 (θm/2)) 2 �<br />
= 1− 1 − 1<br />
2 d2 �2 Proc2(Y1, Y2) . (4.15)<br />
Since the function f(x) = (1 − (1 − x 2 /2) 2 ) 1/2 is a non-decreasing transform of the dis-<br />
tance for 0 ≤ x ≤ 2, the two distances are expected to behave similarly although not<br />
exactly in the same manner.<br />
By definition, both versions of the Procrustes distances are invariant under different<br />
representations and furthermore are valid metrics:<br />
Lemma 4.11. The Procrustes distances dProc1 and dProc2 are Grassmann metrics.<br />
Proof. Nonnegativity and symmetry is immediate. For triangle inequality, let’s use the<br />
67<br />
.
equality (4.13) to show that<br />
dProc(Y1, Y2) + dProc(Y2, Y3) = min<br />
R1,R2∈O(m)<br />
= min<br />
Q1∈O(m)<br />
�Y1R1 − Y2R2� + min<br />
R2,R3∈O(m)<br />
�Y1Q1 − Y2� + min<br />
Q3∈O(m)<br />
�Y2 − Y3Q3�<br />
= min {�Y1Q1 − Y2� + �Y2 − Y3Q3�}<br />
Q1,Q3∈O(m)<br />
≥ min �Y1Q1 − Y3Q3� = dProc(Y1, Y3).<br />
Q1,Q3∈O(m)<br />
The remaining condition to show is the necessary and sufficient condition<br />
�Y1U − Y2V � = 0 ⇐⇒ span(Y1) = span(Y2).<br />
From being a matrix norm, the equality follows:<br />
�Y1U − Y2V � = 0 ⇐⇒ Y1U = Y2V.<br />
�Y2R2 − Y3R3�<br />
The proof of span(Y1) = span(Y2) ⇐⇒ Y1U = Y2V is similar to the case of the Projection<br />
distance and is omitted.<br />
4.3.6 Comparison of the distances<br />
Table 4.1 summarizes the distances introduced so far. When these distances are used for a<br />
learning task, the choice of the most appropriate distance for the task depends on several<br />
factors.<br />
The first factor is the distribution of data. Since the distances are defined from particular<br />
functions of the principal angles, the best distance depends highly on the probability distri-<br />
bution of the principal angles of the given data. For example, the max correlation dMaxCor<br />
uses only the smallest principal angle θ1, and therefore can serve as a robust distance when<br />
68
Table 4.1: Summary of the Grassmann distances. The distances can be defined as simple<br />
functions of both the basis Y and the principal angles θi except for the arc-length which<br />
involves matrix exponentials.<br />
Arc Length Projection Binet-Cauchy<br />
d2 (Y1, Y2) · 2−1�Y1Y ′<br />
1 − Y2Y ′<br />
2�2 F 1 − det(Y ′<br />
1Y2) 2<br />
In terms of θ<br />
� 2 θi � 2 sin θi 1 − � cos2 Is a metric? Yes Yes<br />
θi<br />
Yes<br />
Max Corr Min Corr Procrustes 1 Procrustes 2<br />
d2 (Y1, Y2) 2 − 2�Y ′<br />
1Y2�2 In terms of θ<br />
2<br />
2 sin2 θ1 sin2 θm 4 � sin2 (θi/2) 4 sin2 (θm/2)<br />
Is a metric? No Yes Yes Yes<br />
�Y1Y ′<br />
1 − Y2Y ′<br />
2� 2 2 �Y1U − Y2V � 2 F �Y1U − Y2V � 2 2<br />
the subspaces are highly scattered and noisy, whereas the min correlation dMinCor uses only<br />
the largest principal angle θm, and therefore is not a sensible choice. On the other hand,<br />
when the subspaces are concentrated and have nonzero intersections, dMaxCor will be close<br />
to zero for most of the data, and dMinCor may be more discriminative in this case. The second<br />
Procrustes distances dProc2 is also expected to behave similarly to dMinCor since it also uses<br />
only the largest principal angle. Besides, dMinCor and dProc2 are directly related by (4.15).<br />
The Arc-length dArc, the Projection distance dProj, and the first Procrustes distance dProc1<br />
use all the principal angles. Therefore they have intermediate characteristics between<br />
dMaxCor and dMinCor, and will be useful for a wider range of data distributions. The Binet-<br />
Cauchy distance dBC also uses all the principal angles, but it behaves similarly to dMinCor<br />
for scattered subspaces since the distance will become the maximum value (=1) if at least<br />
one of the principal angles is π/2, due to the product form of dBC.<br />
The second criterion for choosing the distance, is the degree of structure in the distance.<br />
Without any structure a distance can be used only with a simple K-Nearest Neighbor (K-<br />
NN) algorithm for classification. When a distance has an extra structure such as triangle<br />
inequality, for example, we can speed up the nearest neighbor searches by estimating lower<br />
69
and upper limits of unknown distances [23]. From this point of view, the max correlation<br />
dMaxCor is not a metric and may not be used with more sophisticated algorithms unlike the<br />
rest of the distances.<br />
4.4 Experiments<br />
In this section I make empirical comparisons of the Grassmann distances discussed so far<br />
by using the distances for classification tasks with real image database.<br />
4.4.1 Experimental setting<br />
In this section I use the subspaces computed from the four databases Yale Face, CMU-PIE,<br />
ETH-80 and IXMAS, and compare the performances of simple 1NN classifiers using the<br />
Grassmann distances.<br />
The training and the test sets are prepared by N-fold cross validation as follows. For the<br />
Yale Face and the CMU-PIE databases, I keep the subspaces corresponding to a particular<br />
pose from all subjects for testing, and use the remaining subspaces corresponding to other<br />
poses for training. This results in 9-fold and 7-fold cross validation tests for Yale Face and<br />
CMU-PIE respectively. For the ETH-80 database, I keep the subspaces of 8 objects – one<br />
from each category – for testing, and use the remaining subspaces for training, which is a<br />
10-fold cross validation. For the IXMAS database, I keep all the subspaces corresponding<br />
to a particular person for testing, and use the subspaces of other people for training, which<br />
is a 11-fold cross validation test.<br />
As mentioned in the previous chapter, the subspace representation of the databases ab-<br />
sorbs the variability due to illumination, pose, and the choice of the state space respectively.<br />
The cross validation setting of this thesis is the test of whether the remaining variability be-<br />
tween subspaces are indeed useful to recognize subjects, objects, or actions, regardless of<br />
70
different poses, object instances, and actors, respectively.<br />
4.4.2 Results and discussion<br />
Figures 4.2–4.5 show the classification rates. I can summarize the results as follows:<br />
1. The best performing distances are different for each database: dMaxCor for Yale Face,<br />
dProj, dProc1 for CMU-PIE, dArc, dProj, dProc1 for ETH-80, and dProj, dProc1 for IXMAS<br />
databases. I interpret this as certain distances being better suited for discriminating<br />
the subspaces of a particular database.<br />
2. With the exception of dMaxCor for Yale Face, the three distances dArc, dProj, dProc1 are<br />
consistently better than dBC, dMinCor, dProc2. This grouping of the distances are theo-<br />
retically predicted in Section 4.3.6.<br />
3. The dMinCor and dProc2 show exactly the same rates, since the former is monotonically<br />
related to the latter by (4.15). However the two distance will show different rates<br />
when they are used with more sophisticated algorithms than the K-NN.<br />
4. With the exception of Yale Face, the three distances perform much better than the Eu-<br />
clidean distance does, which demonstrates the potential advantages of the subspace-<br />
based approach.<br />
5. For CMU-PIE and IXMAS, the rates increase overall as the subspace dimension m<br />
increases. For Yale Face, the rates of dBC and dProc2 drop as m increases, wherease the<br />
rates of other distances remain the same. For ETH-80, the rates seem to have different<br />
peaks for each distance. This means that the choice of the subspace dimensionality m<br />
can have significant effects on the recognition rates when the simple K-NN algorithm<br />
is employed. However, it will be shown in the later chapters that the m has less effects<br />
on more sophisticated algorithms that are able to adapt to the peculiarities of the data.<br />
71
4.5 Conclusion<br />
In this chapter I introduced the Grassmann manifold as the framework for subspace-based<br />
algorithms, and reviewed several well-known Grassmann distances for measuring the dis-<br />
similarity of subspaces. These Grassmann distances are analyzed and compared in terms of<br />
how they use the principal angles to define dissimilarity of subspaces. In the classification<br />
task of real image databases with 1NN algorithm, the best performing distance varied de-<br />
pending on the data used. This suggests that we need some prior knowledge of the data in<br />
choosing the best distance a priori. However, most of the Grassmann distances performed<br />
better than the Euclidean distance in 1NN classification, and behaved in groups as predicted<br />
from the analysis. In the next chapter I will present a more important criterion for choosing<br />
a distance: whether a distance is associated with a positive definite kernel or not.<br />
72
d Eucl<br />
d Arc<br />
d Proj<br />
d BC<br />
d MaxCor<br />
d MinCor<br />
d Proc1<br />
d Proc2<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
1 3 5 7 9<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
dEucl 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47<br />
dArc 87.81 80.29 83.15 87.81 84.23 84.95 80.65 81.36 82.44<br />
dP roj 87.81 80.29 83.87 87.46 86.02 84.23 84.59 85.30 85.66<br />
dBC 87.81 80.29 83.15 87.46 83.51 83.15 76.34 75.63 74.19<br />
dMaxCor 87.81 89.96 92.11 91.04 91.04 91.04 91.76 92.11 91.76<br />
dMinCor 87.81 71.68 80.65 84.95 72.76 72.76 62.72 62.72 54.84<br />
dP roc1 87.81 80.29 83.15 87.46 84.95 84.95 82.80 83.51 82.80<br />
dP roc2 87.81 71.68 80.65 84.95 72.76 72.76 62.72 62.72 54.84<br />
Figure 4.2: Yale Face Database: face recognition rates from 1NN classifier with the Grassmann<br />
distances. The two highest rates including ties are highlighted with boldface for each<br />
subspace dimension m.<br />
73
d Eucl<br />
d Arc<br />
d Proj<br />
d BC<br />
d MaxCor<br />
d MinCor<br />
d Proc1<br />
d Proc2<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
1 3 5 7 9<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
dEucl 61.96 61.96 61.96 61.96 61.96 61.96 61.96 61.96 61.96<br />
dArc 72.28 52.03 65.46 72.92 76.33 83.16 85.93 88.70 88.49<br />
dP roj 72.28 52.03 64.39 75.48 78.46 82.94 84.43 86.57 87.85<br />
dBC 72.28 51.81 65.88 72.28 76.76 81.24 84.01 85.93 82.30<br />
dMaxCor 72.28 66.95 65.03 64.61 64.18 63.97 64.61 64.61 64.39<br />
dMinCor 72.28 48.83 69.94 64.82 72.28 72.92 72.28 69.08 66.52<br />
dP roc1 72.28 52.03 65.25 73.13 77.83 83.37 86.57 88.27 88.27<br />
dP roc2 72.28 48.83 69.94 64.82 72.28 72.92 72.28 69.08 66.52<br />
Figure 4.3: CMU-PIE Database: face recognition rates from 1NN classifier with the Grassmann<br />
distances.<br />
74
d Eucl<br />
d Arc<br />
d Proj<br />
d BC<br />
d MaxCor<br />
d MinCor<br />
d Proc1<br />
d Proc2<br />
100<br />
95<br />
90<br />
85<br />
80<br />
75<br />
70<br />
1 3 5 7 9<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
dEucl 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47<br />
dArc 80.00 86.25 93.75 86.25 88.75 92.50 90.00 97.50 92.50<br />
dP roj 80.00 85.00 95.00 92.50 88.75 92.50 90.00 96.25 95.00<br />
dBC 80.00 86.25 91.25 83.75 87.50 88.75 85.00 95.00 92.50<br />
dMaxCor 80.00 78.75 82.50 81.25 81.25 82.50 83.75 81.25 83.75<br />
dMinCor 80.00 86.25 90.00 82.50 80.00 77.50 82.50 81.25 81.25<br />
dP roc1 80.00 86.25 93.75 86.25 88.75 92.50 90.00 96.25 91.25<br />
dP roc2 80.00 86.25 90.00 82.50 80.00 77.50 82.50 81.25 81.25<br />
Figure 4.4: ETH-80 Database: object categorization rates from 1NN classifier with the<br />
Grassmann distances.<br />
75
d Eucl<br />
d Arc<br />
d Proj<br />
d BC<br />
d MaxCor<br />
d MinCor<br />
d Proc1<br />
d Proc2<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
1 2 3 4 5<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5<br />
dEucl 42.61 42.61 42.61 42.61 42.61<br />
dArc 61.82 68.79 76.06 76.67 78.18<br />
dP roj 61.82 69.39 74.85 78.48 80.30<br />
dBC 61.82 67.58 71.82 73.03 73.03<br />
dMaxCor 61.82 65.76 63.03 67.58 66.97<br />
dMinCor 61.82 61.82 61.82 62.42 54.24<br />
dP roc1 61.82 68.18 76.97 76.97 80.30<br />
dP roc2 61.82 61.82 61.82 62.42 54.24<br />
Figure 4.5: IXMAS Database: action recognition rates from 1NN classifier with the Grassmann<br />
distances. The two highest rates including ties are highlighted with boldface for each<br />
subspace dimension m.<br />
76
Chapter 5<br />
<strong>GRASSMANN</strong> <strong>KERNELS</strong> AND<br />
DISCRIMINANT ANALYSIS<br />
5.1 Introduction<br />
In the previous chapter I defined subspace distances on the Grassmann manifold. However,<br />
with a distance structure only, there is a severe restricted in the possible operations with the<br />
data. In this chapter, I show that it is possible to define positive definite kernel functions<br />
on the manifold, and thereby it is possible to transform the space to the familiar Hilbert<br />
space by virtue of the RKHS theory in Section 2.2. In particular, the Projection and the<br />
Binet-Cauchy distances presented in the previous chapter will be shown to be compatible<br />
with the Projection and the Binet-Cauchy kernels defined as follows:<br />
kProj(Y1, Y2) = �Y ′<br />
1Y2� 2 F , kBC(Y1, Y2) = (det Y ′<br />
1Y2) 2 .<br />
77
These kernels are discussed in detail in this chapter. The Binet-Cauchy kernel has been<br />
used as a similarity measure for sets [90] and dynamical systems [83]. 1 The Projection dis-<br />
tance has been used for face recognition [85], but the corresponding Projection kernel has<br />
not been explicitly used, and it is the main object of this chapter. I examine both kernels as<br />
the representative kernels on the Grassmann manifold. Advantages of the Grassmann ker-<br />
nels over the Euclidean kernels are demonstrated by a classification problem with Support<br />
Vector Machines (SVMs) on synthetic datasets.<br />
To demonstrate the potential benefits of the kernels further, I use the kernels in a dis-<br />
criminant analysis of subspaces. The proposed method will be contrasted with the previ-<br />
ously suggested subspace-based discriminant algorithms [92, 64, 24, 43]. Those previous<br />
methods adopt an inconsistent strategy: feature extraction is performed in the Euclidean<br />
space while non-Euclidean subspace distances are used. This inconsistency results in a<br />
difficult optimization and a weak guarantee of convergence. In the proposed approach of<br />
this chapter, the feature extraction and the distance measurement are integrated around the<br />
Grassmann kernel, resulting in a simpler and more familiar algorithm. Experiments with<br />
the image databases also show that the proposed method performs better than the previous<br />
methods.<br />
5.2 Kernel functions for subspaces<br />
Among the various distances presented in Chapter 4, only the Projection distance and the<br />
Binet-Cauchy distance are induced from positive definite kernels. This means that we can<br />
1 The authors of [83] use the term Binet-Cachy kernel for a more abstract class of kernels for Fredholm<br />
operators. The Binet-Cauchy kernel kBC in this paper is a special case which is close to what those authors<br />
call the Martin kernel.<br />
78
define the corresponding kernels kProj and kBC such that the following is true:<br />
d 2 (Y1, Y2) = k(Y1, Y1) + k(Y2, Y2) − 2k(Y1, Y2). (5.1)<br />
To define a kernel on the Grassmann manifold, let’s recall the definition of a positive<br />
definite kernel in Definition 2.4:<br />
A real symmetric function k is a (resp. conditionally) positive definite kernel function,<br />
if �<br />
i,j cicjk(xi, xj) ≥ 0, for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) for any n ∈ N.<br />
(resp. for all c1, ..., cn(ci ∈ R) such that � n<br />
i=1 ci = 0.)<br />
Based on the Euclidean coordinates of subspaces, the Grassmann kernel is defined as<br />
follows:<br />
Definition 5.1. Let k : R D×m × R D×m → R be a real valued symmetric function<br />
k(Y1, Y2) = k(Y2, Y1). The function k is a Grassmann kernel if it is 1) positive definite and<br />
2) invariant to different representations:<br />
k(Y1, Y2) = k(Y1R1, Y2R2), ∀R1, R2 ∈ O(m).<br />
In the following sections I explicitly construct an isometry from (G, dProj or BC) to a<br />
Hilbert space (H, L2), and use the isometry to show that the Projection and the Binet-<br />
Cauchy kernels are Grassmann kernels.<br />
5.2.1 Projection kernel<br />
The Projection distance dProj can be understood by associating a subspace with a projection<br />
matrix by the following embedding [16]<br />
Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′ . (5.2)<br />
79
The image Ψ(G(m, D)) is the set of rank-m orthogonal projection matrices, hence the<br />
name Projection distances.<br />
Theorem 5.2. The map<br />
Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′<br />
is an embedding. In particular, it is an isometry from (G, dProj) to (R D×D , � · �F ).<br />
(5.3)<br />
Proof. 1. Well-defined: if span(Y1) = span(Y2), or equivalently Y1 = Y2R for some<br />
R ∈ O(m), then Ψ(Y1) = Y1Y ′<br />
1 = Y2Y ′<br />
2 = Ψ(Y2).<br />
2. Injective: suppose Ψ(Y1) = Y1Y ′<br />
1 = Y2Y ′<br />
2 = Ψ(Y2). After multiplying Y1 and Y2 to<br />
the right side we get the equalities Y1 = Y2(Y ′<br />
2Y1) and Y2 = Y1(Y ′<br />
1Y2) respectively.<br />
Let R denote R = Y ′<br />
2Y1, then Y1 = Y2R = Y1(R ′ R), which shows R ∈ O(m) and<br />
therefore span(Y1) = span(Y2).<br />
3. Isometry: �Ψ(Y1) − Ψ(Y2)�F = �Y1Y ′<br />
1 − Y2Y ′<br />
2�F = 2 1/2 dProj(Y1, Y2).<br />
Since we have a Euclidean embedding into (R D×D , � · �F ), the natural inner product of<br />
this space is the trace tr [(Y1Y ′<br />
1)(Y2Y ′<br />
2)] = �Y ′<br />
1Y2�2 F . This provides us with the definition<br />
of the Projection kernel:<br />
Theorem 5.3. The Projection kernel<br />
is a Grassmann kernel.<br />
kProj(Y1, Y2) = �Y ′<br />
1Y2� 2 F<br />
80<br />
(5.4)
Proof. The kernel is well-defined because kProj(Y1, Y2) = kProj(Y1R1, Y2R2) for any R1, R2 ∈<br />
O(m). The positive definiteness follows from the properties of the Frobenius norm: for all<br />
Y1, ..., Yn(Yi ∈ G) and c1, ..., cn(ci ∈ R) for any n ∈ N, we have<br />
�<br />
cicj�Y ′<br />
i Yj� 2 F = �<br />
ij<br />
ij<br />
cicjtr(YiY ′<br />
i YjY ′<br />
i<br />
j ) = �<br />
i<br />
ij<br />
cicjtr(YiY ′<br />
i )(YjY ′<br />
j )<br />
= tr( �<br />
ciYiY ′<br />
i ) 2 = � �<br />
ciYiY ′<br />
i � 2 F ≥ 0.<br />
The Projection kernel has a very simple form and requires only O(Dm) multiplications<br />
to evaluate. It is the main kernel I propose to use for subspace-based learning.<br />
5.2.2 Binet-Cauchy kernel<br />
The Binet-Cauchy distance can also be understood by an embedding. Let s be a subset of<br />
{1, ..., D} with m elements s = {r1, ..., rm}, and Y (s) be the m × m matrix whose rows<br />
are the r1, ... , rm-th rows of Y . If s1, s2, ..., sn are all such choices of the subset ordered<br />
lexicographically, then the Binet-Cauchy embedding is defined as<br />
Ψ : G(m, D) → R n , span(Y ) ↦→ � det Y (s1) , ..., det Y (sn) � , (5.5)<br />
where n = DCm is the number of choosing m rows out of D rows. It is also an isometry<br />
from (G, dBC) to (R n , � · �2). The natural inner product in R n is the dot product of the two<br />
vectors<br />
n�<br />
r=1<br />
det Y (si)<br />
1<br />
det Y (si)<br />
2 ,<br />
which provides us with the definition of the Binet-Cauchy kernel.<br />
81
Theorem 5.4. The Binet-Cauchy kernel<br />
is a Grassmann kernel.<br />
kBC(Y1, Y2) = (det Y ′<br />
1Y2) 2 = det Y ′<br />
1Y2Y ′<br />
2Y1<br />
(5.6)<br />
Proof. First, the kernel is well-defined because kBC(Y1, Y2) = kBC(Y1R1, Y2R2) for any<br />
R1, R2 ∈ O(m). To show that kBC is positive definite it suffices to show that k(Y1, Y2) =<br />
det Y ′<br />
1Y2 is positive definite. From the Binet-Cauchy identity [38, 90, 83], we have<br />
det Y ′<br />
1Y2 = �<br />
s<br />
det Y (s)<br />
1 det Y (s)<br />
2 .<br />
Therefore, for all Y1, ..., Yn(Yi ∈ G) and c1, ..., cn(ci ∈ R) for any n ∈ N,<br />
�<br />
ij<br />
cicj det Y ′<br />
i Yj = �<br />
ij<br />
cicj<br />
= � �<br />
s<br />
ij<br />
�<br />
s<br />
det Y (s)<br />
i<br />
cicj det Y (s)<br />
i<br />
(s)<br />
det Y<br />
j<br />
(s)<br />
det Y<br />
j<br />
�<br />
� �<br />
=<br />
s<br />
i<br />
ci det Y (s)<br />
i<br />
� 2<br />
≥ 0.<br />
Some other forms of the Binet-Cauchy kernel also appeared in the literature. Note<br />
that although det Y ′<br />
1Y2 is also a Grassmann kernel, we prefer kBC(Y1, Y2) = det(Y ′<br />
1Y2) 2 .<br />
The reason is that the latter is directly related to the principal angles by det(Y ′<br />
1Y2) 2 =<br />
�<br />
i cos2 θi and therefore admits geometric interpretations, whereas the former cannot be<br />
written directly in terms of the principal angles. That is, det Y ′<br />
1Y2 �= �<br />
i cos θi in general. 2<br />
Another variant arcsin kBC(Y1, Y2) is also a positive definite kernel 3 and its induced metric<br />
d = (arccos(det Y ′<br />
1Y2)) 1/2 is a conditionally positive definite metric.<br />
2 For example, det Y ′<br />
1Y2 can be negative whereas �<br />
i cos θi - a product of singular values - is nonnegative<br />
by definition.<br />
3 From Theorem 4.18 and 4.19 of [69].<br />
82
5.2.3 Indefinite kernels from other metrics<br />
Since the Projection distance and the Binet-Cauchy distance are derived from positive def-<br />
inite kernels, we have all the kernel-based algorithms for Hilbert spaces at our disposal. In<br />
contrast, other distances in the previous chapter are not associated with Grassmann kernels<br />
and can only be used with less powerful algorithms. Showing that a distance is not associ-<br />
ated with any kernel directly, is not as easy as showing the opposite that there is a kernel.<br />
However, Theorem 2.12 can be used to make the task easier:<br />
A metric d is induced from a positive definite kernel if and only if<br />
ˆk(x1, x2) = −d 2 (x1, x2)/2, x1, x2 ∈ X (5.7)<br />
is conditionally positive definite. The theorem allows us to show a metric’s non-positive<br />
definiteness by constructing an indefinite kernel matrix from (5.7) as a counterexample.<br />
There have been efforts to use indefinite kernels for learning [59, 31], and several<br />
heuristics have been proposed to modify an indefinite kernel matrix to a positive definite<br />
matrix [60]. However, I do not advocate the use of the heuristics since they change the<br />
geometry of the original data.<br />
5.2.4 Extension to nonlinear subspaces<br />
Linear subspaces in the original space can be generalized to ‘nonlinear’ subspaces by con-<br />
sidering linear subspaces in a RKHS, which is a trick that has been used successfully in<br />
kernel PCA [68]. 4 In [90, 85] the trick is shown to be applicable to the computation of the<br />
principal angles, called the kernel principal angles. Wolf and Shashua, in particular, use<br />
the trick to compute the Binet-Cauchy kernel. Note that these two kernels have different<br />
4 A ‘nonlinear’ subspace is an oxymoron. Technically speaking, it is a preimage of a linear subspace in<br />
RHKS.<br />
83
H 1<br />
span( Yi )<br />
θ1, ..., θm<br />
span( Yj )<br />
Φ Ψ<br />
X H<br />
2<br />
Yi<br />
G(m, D )<br />
θ 2<br />
Yj<br />
Ψ( ) Ψ( )<br />
Figure 5.1: Doubly kernel method. The first kernel implicitly maps the two ‘nonlinear<br />
subspaces’ Xi and Xj to span(Yi) and span(Yj) via the map Φ : X → H1, where the<br />
‘nonlinear subspace’ means the preimage Xi = Φ −1 (span(Yi)) and Xj = Φ −1 (span(Yj)).<br />
The second (=Grassmann) kernel maps the points Yi and Yj on the Grassmann manifold<br />
G(m, D) to the corresponding points in H2 via the map Ψ : G(m, D) → H2 such as (5.3)<br />
or (5.5).<br />
roles and need to be distinguished. An illustration of this ‘doubly kernel’ method is given<br />
in Figure 5.1.<br />
The key point of the trick is that the principal angles between two subspaces in the<br />
RKHS can be derived only from the inner products of vectors in the original space. 5 Fur-<br />
thermore, the orthonomalization procedure in the feature space also requires the inner prod-<br />
uct of vectors only. Below is a summary of the procedures in [90].<br />
1. Let Xi = {xi 1, ..., xi Ni } the i-th set of data and Φi = [φ(xi 1), ... , φ(xi )] be the<br />
Ni<br />
image matrix of Xi in the feature space implicitly defined by a kernel function k,<br />
5 A similar idea is also used to define probability distributions in the feature space [46, 96], and will be<br />
explained in the next chapter.<br />
84
e.g., Gaussian RBF kernel.<br />
2. The orthonormal basis Yi of the span(Φi) is then computed from the Gram-Schmidt<br />
process in RKHS: Φi = YiRi.<br />
3. Finally, the product Y ′<br />
i Yj in the features space, used to define the Binet-Cauchy ker-<br />
nel for example, is computed from the original data by<br />
Y ′<br />
i Yj = (R −1<br />
i )′ Φ ′ iΦjR −1<br />
j = (R−1<br />
i )′ [k(x i k, x j<br />
l<br />
)]klR −1<br />
j .<br />
Although this extension has been used to improve classification tasks with a few small<br />
databases [90], I will not use the extension in the thesis for the following reasons. First, the<br />
databases I use already have theoretical grounds for being linear subspaces, and we want to<br />
verify the linear subspace models. Second, the advantage of kernel tricks in general is most<br />
pronounced when the ambient space R D has a relatively small dimension D compared to<br />
the number of data sample N. This is obviously not the case with the data used in the<br />
thesis. Further experiments with the nonlinear extension will be carried out in the future.<br />
5.3 Experiments with synthetic data<br />
In this section I demonstrate the application of the Grassmann kernels to a two-class clas-<br />
sification problem with Support Vector Machines (SVMs). Using synthetic data I will<br />
compare the classification performances of linear/nonlinear SVMs in the original space<br />
with the performances of the SVMs in the Grassmann space. The advantages of the sub-<br />
space approach over the conventional Euclidean appraoch for classification problems will<br />
be discussed.<br />
85
A. Class centers B. Easy<br />
C. Intermediate D. Difficult<br />
Figure 5.2: A two-dimensional subspace is represented by a triangular patch swept by two<br />
basis vectors. The positive and negative classes are colored-coded by blue and red respectively.<br />
A: The two class centers Y+ and Y− around which other subspaces are randomly<br />
generated. B–D: Examples of randomly selected subspaces for ‘easy’, ‘intermediate’, and<br />
‘difficult’ datasets.<br />
5.3.1 Synthetic data<br />
I generate three types of datasets: ‘easy’, ‘intermediate’, and ‘difficult’; these datasets differ<br />
in the amount of noise in the data.<br />
For each type of the data, I generate N = 100 subspaces in D = 6 dimensional Eu-<br />
clidean space, where each subspace is m = 2 dimensional. To generate two-class data, I<br />
86
first define two exemplar subspaces spanned by the following bases Y+ and Y−:<br />
Y+ = 1<br />
�<br />
√<br />
6<br />
Y− = 1<br />
√ 6<br />
[ 1 1 1 −1 1 1 ] ′ , [ 1 1 −1 1 1 −1 ] ′<br />
�<br />
[ 1 −1 1 1 −1 1 ] ′ , [ 1 −1 −1 −1 −1 −1 ] ′<br />
The Y+ and Y− serve as the positive and the negative class centers respectively. The<br />
corresponding subspaces span(Y+) and span(Y−) have the principal angles θ1 = θ2 =<br />
arccos(1/3).<br />
The other subspaces Yi’s are generated by adding a Gaussian random matrix M to the<br />
bases Y+ or Y−, and then by applying SVD to compute the new perturbed bases:<br />
Yi = U, where<br />
⎧<br />
⎪⎨<br />
⎪⎩<br />
UΣV ′ = Y+ + Mi, i = 1, ..., N/2<br />
UΣV ′ = Y− + Mi, i = N/2 + 1, ..., N<br />
where the elements of the matrix Mi are independent Gaussian variables [Mi]jk ∼ N (0, s 2 ).<br />
The standard deviation s controls the amount of noise; the s is chosen to be s = 0.2, 0.3<br />
and 0.4 for ‘easy’, ‘intermediate’, and ‘difficult’ datasets respectively. Figure 5.2 shows<br />
examples of the subspaces for the three datasets. Note that the subspaces become more<br />
cluttered and the class boundary becomes more irregular as s increases.<br />
5.3.2 Algorithms<br />
I compare the performance of the Euclidean SVM with linear/polynomial/RBF kernels and<br />
the performance of SVM with Grassmann kernels. To test the Euclidean SVMs, I randomly<br />
sample n = 50 points from each subspace from a Gaussian distribution.<br />
There is an immediate handicap with a linear classifier in the original data space. Each<br />
subspace is symmetric with respect to the origin, that is, if x is a point on a subspace,<br />
87<br />
�<br />
,<br />
�<br />
.
then −x is also on the subspace. As a result, any hyperplane either 1) contains a subspace<br />
or 2) halves a subspace into two parts and yields 50 percent classification rate, which is<br />
useless. Therefore, if data lie on subspaces without further restrictions, a linear classifier<br />
(with a zero-bias) always fails to classify subspaces. To alleviate the problem with the<br />
Euclidean algorithms, I sample points from the intersection of the subspaces and the half-<br />
space {(x1, ..., x6) ∈ R 6 | x1 > 0}.<br />
To test the Grassmann SVM, I first estimate the basis Yi from the SVD of the same<br />
sampled points used for the Euclidean SVM, and then evaluate the Grassmann kernel func-<br />
tions.<br />
Five kernels in the followings are compared:<br />
1. Euclidean SVM with linear kernels: k(x1, x2) = 〈x1, x2〉<br />
2. Euclidean SVM with Polynomial kernels: k(x1, x2) = (〈x1, x2〉 + 1) 3 .<br />
3. Euclidean SVM with Gaussian RBF kernels: k(x1, x2) = exp(− 1<br />
2r 2 �x1 −x2� 2 ). The<br />
radius r is chosen to be one-fifth of the diameter of the data: r = 0.2 maxij �xi−xj�.<br />
4. Grassmannian SVM with Projection kernel k(Y1, Y2) = �Y ′<br />
1Y2� 2 F<br />
5. Grassmannian SVM with Binet-Cauchy kernel k(Y1, Y2) = (det Y ′<br />
1Y2) 2<br />
For the Euclidean SVMs, I use the public-domain software SVM-light [42] with default<br />
parameters. For the Grassmann SVMs, I use a Matlab code with a nonnegative QP solver.<br />
I evaluate the algorithms with the leave-one-out test by holding out one subspace and<br />
training with the other N − 1 subspaces.<br />
5.3.3 Results and discussion<br />
Table 5.1 shows the classification rates of the Euclidean SVMs and the Grassmann SVMs,<br />
averaged for 10 independent trials. The results show that the Grassmann SVM with the<br />
88
Table 5.1: Classification rates of the Euclidean SVMs and the Grassmannian SVMs. The<br />
best rate for each dataset is highlighted by boldface.<br />
Euclidean Grassmann<br />
Lin Poly RBF Proj BC<br />
Easy 88.21 98.41 98.37 100.00 100.00<br />
Intemediate 80.08 92.46 92.72 98.80 98.00<br />
Difficult 72.01 81.14 81.73 91.30 90.60<br />
Projection kernel outperforms other the Euclidean SVMs. The Grassmann SVM with the<br />
Binet-Cauchy kernel is a close second. The Polynomial and RBF kernels perform equally<br />
better than the linear kernel, but not as good as the Grassmann kernels. The overall classi-<br />
fication rates decrease as the data become more difficult to separate.<br />
The Grassmann kernels achieve better results for the two main reasons. First, when<br />
the data are highly cluttered as shown in Figure 5.2, the geometric prior of the subspace<br />
structures can disambiguate the points close to each other that the Euclidean distance can-<br />
not distinguish well. Second, the Grassmann approach implicitly maps the data from the<br />
original D-dimensional space to a higher-dimensional (m(D−m)) space where separating<br />
the subspaces becomes easier.<br />
In addition to having a superior classification performance with subspace-structured<br />
data, the Grassmann kernel method has a smaller computational cost. In the experiment<br />
above, for example, the Euclidean approach uses a kernel matrix of a size 5000 × 5000,<br />
whereas the Grassmann approach uses a kernel matrix of a size 100 × 100 which is n = 50<br />
times smaller than the Euclidean kernel matrix.<br />
89
5.4 Discriminant Analysis of subspace<br />
In this section I introduce a discriminant analysis method on the Grassmann manifold, and<br />
compare this method with other previously known discriminant techniques for subspaces.<br />
Since the image databases in Chapter 3 are highly multiclass 6 and lie in high dimensional<br />
space, I propose to use the discriminant analysis technique to reduce dimensionality and<br />
extract features of subspace data.<br />
5.4.1 Grassmann Discriminant Analysis<br />
It is straightforward to show the procedures of using the Projection and the Binet-Cauchy<br />
kernels with the Kernel FDA method introduced in Section 5.4. Recall that the cost function<br />
of Kernel FDA is as follows:<br />
J(α) = α′ Φ ′ SBΦα<br />
α ′ Φ ′ SW Φα = α′ K(V − 1N1 ′ N /N)Kα<br />
α ′ (K(IN − V )K + σ2 , (5.8)<br />
IN)α<br />
where K is the kernel matrix, σ is a regularization term, and the others are fixed terms.<br />
Since the method is already explained in detail, I only present a summary of the procedure<br />
below.<br />
6 Nc=38, 68, 8, and,11 for Yale Face, CMU-PIE, ETH-80, and IXMAS databases respectively<br />
90
Assume the D by m orthonormal bases {Yi} are already computed and given.<br />
Training:<br />
1. Compute the matrix [Ktrain]ij = kProj(Yi, Yj) or kBC(Yi, Yj) for all Yi, Yj in the<br />
training set.<br />
2. Solve maxα J(α) in (5.8) by eigen-decomposition.<br />
3. Compute the (Nc − 1)-dimensional coefficients Ftrain = α ′ Ktrain.<br />
Testing:<br />
1. Compute the matrix [Ktest]ij = kProj(Yi, Yj) or kBC(Yi, Yj) for all Yi in training<br />
set and Yj in the test set.<br />
2. Compute the (Nc − 1)-dim coefficients Ftest = α ′ Ktest.<br />
3. Perform 1-NN classification from the Euclidean distance between Ftrain and<br />
Ftest.<br />
I call this method the Grassmann Discriminant Analysis to differentiate it from other<br />
discriminant methods for subspaces, which I review in the following sections.<br />
5.4.2 Mutual Subspace Method (MSM)<br />
The original MSM [92] performs simple 1-NN classification with dMax with no feature<br />
extraction. The method can be extended to any distance described in the thesis. Although<br />
there are attempts to use kernels for MSM [64], the kernel is used only to represent data in<br />
the original space, and the MSM algorithm is still a 1-NN classification.<br />
91
5.4.3 Constrained MSM (cMSM)<br />
Constrained MSM [24] is a technique that applies dimensionality reduction to the bases<br />
of the subspaces in the original space. Let G = � ′<br />
i YiY i be the sum of the projection<br />
matrices of the data and {v1, ..., vD} be the eigenvectors corresponding to the eigenvalues<br />
{λ1 ≤ ... ≤ λD} of G. The authors of [24] claim that the first few eigenvectors v1, ..., vd<br />
of G are more discriminative than the later eigenvectors, and suggest projecting the basis<br />
vectors of each subspace Yi onto the span(v1, ..., vl), followed by normalizations. However<br />
these procedure lack justifications, as well as a clear criterion for choosing the dimension<br />
d, on which the result crucially depends from our experience.<br />
5.4.4 Discriminant Analysis of Canonical Correlations (DCC)<br />
The Discriminant Analysis of Canonical Correlations [43] can be understood as a non-<br />
parametric version of linear discrimination analysis using the Procrustes distance (4.11).<br />
The algorithm finds the discriminating direction w which maximizes the ratio L(w) =<br />
w ′ SBw/w ′ Sww, where Sb and Sw are the nonparametric between-class and within-class<br />
‘covariance’ matrices from Section 2.4.2:<br />
Sb = � �<br />
(YiU − YjV )(YiU − YjV ) ′<br />
i<br />
j∈Bi<br />
Sw = � �<br />
(YiU − YjV )(YiU − YjV ) ′ ,<br />
i<br />
j∈Wi<br />
where U and V are from (4.2). Recall that tr(YiU − YjV )(YiU − YjV ) ′ = �YiU − YjV � 2 F<br />
is the Procrustes distance (squared). However, unlike my method, Sb and Sw do not admit<br />
a geometric interpretation as true covariance matrices, nor can they be kernelized directly.<br />
Another disadvantage of the DCC is the difficulty in optimization. The algorithm iterates<br />
the two stages of 1) maximizing the ratio L(w) and of 2) computing Sb and Sw, which<br />
92
esults in a computational overhead and a weak theoretical support for global convergence.<br />
5.5 Experiments with real-world data<br />
In this section I test the Grassmann Discriminant Analysis with the Yale Face, the CMU-<br />
PIE, the ETH-80 and the IXMAS databases, and compare its performance with those of<br />
other algorithms.<br />
5.5.1 Algorithms<br />
The following is the list of algorithms used in the test.<br />
1. Baseline: Euclidean FDA<br />
2. Grassmann Discriminant Analysis:<br />
• GDA1 (Projection kernel + kernel FDA)<br />
• GDA2 (Binet-Cauchy kernel + kernel FDA)<br />
For GDA1 and GDA2, the optimal values of σ are found by scanning through a range<br />
of values. The results do not seem to vary much as long as σ is small enough.<br />
3. Others<br />
• MSM (max corr)<br />
• cMSM (PCA+max corr)<br />
• DCC (NDA + Procrustes dist): For cMSM and DCC, the optimal dimension d is<br />
found by exhaustive searching. For DCC, we have used two nearest-neighbors<br />
for Bi and Wi in Section 5.4.4. However, increasing the number of nearest-<br />
neighbors does not change the results very much as was observed in [43]. In<br />
DCC the optimization is iterated for 5 times each.<br />
93
I evaluate the algorithms with the cross validation as explained in Section 4.4.1.<br />
5.5.2 Results and discussion<br />
Figures 5.3–5.6 show the classification rates. I can summarize the results as follows:<br />
1. The GDA1 shows significantly better performance than all the other algorithms for<br />
all datasets. However, the difference is less pronounced in the Yale Face database<br />
where the other discriminant algorithms also performed well.<br />
2. The overall rates are roughly in the order of (GDA1 > cMSM > DCC > others ).<br />
These three algorithms consistently outperform the baseline method, whereas GDA2<br />
and MSM occasionally lag behind the baseline.<br />
3. With the exception of the IXMAS database, the rates of the GDA1, MSM, cMSM,<br />
and DCC remain relatively the same as the subspace dimension m increases. For<br />
IXMAS, the rates seem to increase gradually as m increases in the given range.<br />
4. The GDA2 performs poorly in general and degrades fast as m increases. This can<br />
be ascribed to the properties of the Binet-Cauchy distance explained in Chapter 4.<br />
Due to its product form, the kernel matrix tends to be an identity as the subspace<br />
dimension increases, which is also empirically checked from data.<br />
5.6 Conclusion<br />
In this chapter I defined the Grassmann kernels for subspace-based learning, and showed<br />
constructions of the Projection kernel and the Binet-Cauchy kernel via isometric embed-<br />
dings. Although the embeddings can be used explicitly to represent a subspace as a D × D<br />
projection matrix or a DCm × 1 vector, as in [3], the equivalent kernel representations are<br />
preferred due to the storage and computation requirements.<br />
94
To demonstrate the potential advantages of the Grassmann kernels, I applied the kernel<br />
discriminant analysis algorithm to image databases represented as collections of subspaces.<br />
For its surprisingly simple form and usage, the proposed method with the Projection kernel<br />
outperformed the other state-of-the-art discriminant methods with the real data. However,<br />
the Binet-Cauchy kernel, when used in its naive form, are shown to be of limited value for<br />
subspace-based learning problems. There are possibly other Grassmann kernels which are<br />
not derived from the two representative kernels, and it is left as a future work to discover<br />
them.<br />
95
FDA (Eucl)<br />
GDA (Proj)<br />
GDA (BC)<br />
MSM<br />
cMSM<br />
DCC<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
1 3 5 7 9<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
FDA (Eucl) 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />
GDA (Proj) 96.77 95.70 98.57 99.28 98.92 98.57 97.85 96.77 97.13<br />
GDA (BC) 96.77 95.34 96.77 96.42 83.87 72.76 55.20 48.03 44.09<br />
MSM 87.81 89.96 92.11 91.04 91.04 91.04 91.76 92.11 91.76<br />
cMSM 92.47 96.06 94.98 93.55 94.62 94.98 94.98 96.42 95.70<br />
DCC 54.12 96.06 94.98 95.70 93.91 94.62 96.42 94.98 93.55<br />
Figure 5.3: Yale Face Database: face recognition rates from various discriminant analysis<br />
methods. The two highest rates including ties are highlighted with boldface for each<br />
subspace dimension m.<br />
96
FDA (Eucl)<br />
GDA (Proj)<br />
GDA (BC)<br />
MSM<br />
cMSM<br />
DCC<br />
100<br />
80<br />
60<br />
40<br />
20<br />
1 3 5 7 9<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
FDA (Eucl) 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73<br />
GDA (Proj) 88.27 74.84 89.77 87.21 91.68 92.54 93.82 93.60 95.31<br />
GDA (BC) 88.27 71.43 82.52 64.82 58.64 47.55 43.07 39.87 36.25<br />
MSM 72.28 66.95 65.03 64.61 64.18 63.97 64.61 64.61 64.39<br />
cMSM 73.13 71.22 67.59 68.23 69.72 69.94 70.15 72.71 72.49<br />
DCC 77.19 78.89 66.52 63.75 64.61 67.59 67.59 67.59 65.03<br />
Figure 5.4: CMU-PIE Database: face recognition rates from various discriminant analysis<br />
methods.<br />
97
FDA (Eucl)<br />
GDA (Proj)<br />
GDA (BC)<br />
MSM<br />
cMSM<br />
DCC<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
1 3 5 7 9<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
FDA (Eucl) 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />
GDA (Proj) 88.75 90.00 96.25 97.50 95.00 95.00 95.00 96.25 96.25<br />
GDA (BC) 88.75 87.50 90.00 81.25 72.50 60.00 51.25 41.25 48.75<br />
MSM 80.00 78.75 82.50 81.25 81.25 82.50 83.75 81.25 83.75<br />
cMSM 88.75 91.25 92.50 93.75 93.75 91.25 92.50 93.75 93.75<br />
DCC 65.00 88.75 83.75 88.75 87.50 87.50 85.00 85.00 85.00<br />
Figure 5.5: ETH-80 Database: object categorization rates from various discriminant analysis<br />
methods.<br />
98
FDA (Eucl)<br />
GDA (Proj)<br />
GDA (BC)<br />
MSM<br />
cMSM<br />
DCC<br />
100<br />
80<br />
60<br />
40<br />
20<br />
1 2 3 4 5<br />
subspace dimension (m)<br />
m=1 m=2 m=3 m=4 m=5<br />
FDA (Eucl) 54.87 54.87 54.87 54.87 54.87<br />
GDA (Proj) 69.09 80.30 84.55 84.24 85.15<br />
GDA (BC) 69.09 60.00 50.91 36.36 25.15<br />
MSM 61.82 65.76 63.03 67.58 66.97<br />
cMSM 63.64 69.39 73.33 77.58 78.48<br />
DCC 38.79 61.82 74.55 76.67 77.58<br />
Figure 5.6: IXMAS Database: action recognition rates from various discriminant analysis<br />
methods.<br />
99
Chapter 6<br />
EXTENDED <strong>GRASSMANN</strong> <strong>KERNELS</strong><br />
AND PROBABILISTIC DISTANCES<br />
6.1 Introduction<br />
So far I have modeled the data as the set of linear subspaces. To relax this geometric as-<br />
sumption of the data, let’s take a step back from the assumption and take a probabilistic<br />
view of the data. Let’s suppose a set of vectors are i.i.d samples from an arbitrary prob-<br />
ability distribution. Then it is possible to compare two such distributions of vectors with<br />
probabilistic similarity measures, such as the KL distance 1 [47], the Chernoff distance [15],<br />
or the Bhattacharyya/Hellinger distance [10], to name a few [70, 40, 46, 96]. Furthermore,<br />
the Bhattacharyya affinity is in fact a positive definite kernel function on the space of dis-<br />
tributions and has nice closed-form expressions for the exponential family [40].<br />
In this paper, I investigate the relationship between the Grassmann kernels and the prob-<br />
abilistic distances. The link is provided by the probabilistic generalization of subspaces<br />
with a Factor Analyzer [22], which is a Gaussian distribution that resembles a pancake.<br />
1 By distance I mean any nonnegative measure of similarity and not necessarily a metric.<br />
100
The first result I show is that the KL distance is reduced to the Projection kernel under<br />
the Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit<br />
and is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL<br />
distance, I propose an extension of the Projection kernel which is originally confined to the<br />
set of linear subspaces, to the set of affine as well as scaled subspaces. For this I introduce<br />
the affine Grassmann manifold and kernels.<br />
I demonstrate the extended kernels with the Support Vector Machines and the Kernel<br />
Discriminant Analysis using synthetic and real image databases. The experiments show the<br />
advantages of the extended kernels over the Bhattacharyya and the Binet-Cauchy kernels.<br />
6.2 Analysis of probabilistic distances and kernels<br />
In this section I introduce several well-known probabilistic distances, and establish their<br />
relationships with the Grassmann distances and kernels.<br />
6.2.1 Probabilistic distances and kernels<br />
Various probabilistic distances between distributions have been proposed in the literature.<br />
Some of them yield closed-form expressions for the exponential family and are convenient<br />
for analysis. Below is a short list of those distances.<br />
• KL distance :<br />
�<br />
J(p1, p2) =<br />
p1(x) log p1(x)<br />
p2(x)<br />
dx (6.1)<br />
The KL distance is probably the most frequently used distance in learning problems.<br />
It is sometime called the relative entropy and plays a fundamental role in information<br />
theory.<br />
101
• KL distance (symmetric) :<br />
�<br />
JKL(p1, p2) =<br />
[p1(x) − p2(x)] log p1(x)<br />
p2(x)<br />
dx (6.2)<br />
Since the original KL distance is asymmetric, this symmetrized version is often used<br />
instead. I exclusively use the symmetric version in the chapter. This distance is still<br />
not a valid metric.<br />
• Chernoff distance:<br />
�<br />
JCher(p1, p2) = − log<br />
p α1<br />
1 (x) p α2<br />
2 (x) dx, (α1 + α2 = 1, α1, α2 > 0) (6.3)<br />
The Chernoff distance is asymmetric. A symmetric version of the distance with<br />
α1 = α2 = 1/2 is known as the Bhattacharyya distance:<br />
• Hellinger distance:<br />
�<br />
JBhat(p1, p2) = − log<br />
JHel(p1, p2) =<br />
[p1(x) p2(x)] 1/2 dx (6.4)<br />
� ��<br />
p1(x) − � �2 p2(x) dx (6.5)<br />
The Hellinger distance is directly related to the Bhattacharyya distance by JHel =<br />
2(1 − exp(−JBhat)).<br />
One can also define similarity measures instead of the dissimilarity measures above.<br />
Jebara and Kondor [40] proposed the Probability Product kernel<br />
�<br />
kProb(p1, p2) =<br />
p α 1 (x) p α 2 (x) dx, (α > 0). (6.6)<br />
102
By construction, this kernel is positive definite in the space of normalized probability distri-<br />
butions [40]. This kernel includes the Bhattacharrya and the Expected Likelihood kernels<br />
as special cases:<br />
• Bhattacharyya kernel: (α = 1/2)<br />
• Expected Likelihood kernel: (α = 1)<br />
�<br />
kBhat(p1, p2) = [p1(x) p2(x)] 1/2 dx (6.7)<br />
�<br />
kEL(p1, p2) =<br />
p1(x) p2(x) dx (6.8)<br />
The probabilistic distances are closely related to each other. For example, the Hellinger<br />
distance forms a bound on the KL distance [77], and the Bhattacharyya distance and the<br />
KL distance are both instances of the Rényi divergence [63]. However the behaviors of<br />
the distances are quite different under my data model. I examine the KL distance and the<br />
Product Probability kernel in particular.<br />
6.2.2 Data as Mixture of Factor Analyzers<br />
The probabilistic distances in the previous section are not restricted to specific distributions.<br />
However, I will model the data distribution as the Mixture of Factor Analyzers (MFA) [27].<br />
If we have i = 1, ..., N sets in the data, then each set is considered as i.i.d. samples from<br />
the i-th Factor Analyzer<br />
x ∼ pi(x) = N (ui, Ci), Ci = YiY ′<br />
i + σ 2 ID, (6.9)<br />
103
Figure 6.1: Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann manifold<br />
(Left), the set of linear subspaces, can alternatively be modeled as the set of flat<br />
(σ → 0) spheres (Y ′<br />
i Yi = Im) intersecting at the origin (ui = 0). The right figure shows a<br />
general Mixture of Factor Analyzers which are not bound by these conditions.<br />
where ui ∈ R D is the mean, Yi is a full-rank D × m matrix (D > m), and σ is the ambient<br />
noise level. The factor analyzer model is a practical substitute for a Gaussian distribution<br />
in case the dimensionality D of the images is greater than the number of samples n in a set.<br />
Otherwise it is impossible to estimate the full covariance C nor invert it.<br />
More importantly, I use the factor analyzer distribution to provide the link between the<br />
Grassmann manifold and the space of probabilistic distributions. In fact a linear subspace<br />
can be considered as the ‘flattened’ (σ → 0) limit of a zero-mean (ui = 0), homogeneous<br />
(Y ′<br />
i Yi = Im) factor analyzer distribution as depicted in Figure 6.1.<br />
Some linear algebra<br />
Let’s summarize some linear algebraic shortcuts to analyze the distances. The inversion<br />
lemma will be used several times. For σ > 0, we have the identity:<br />
C −1<br />
i<br />
′<br />
= (YiY i + σ 2 I) −1 = σ −2 (I − Yi(σ 2 I + Y ′<br />
i Yi) −1 Y ′<br />
i ).<br />
104
Let M1 and M2 be m × m matrices<br />
and � Y1 and � Y2 be the matrices<br />
M1 = (σ 2 Im + Y ′<br />
1Y1) −1 , and M2 = (σ 2 Im + Y ′<br />
2Y2) −1 ,<br />
�Y1 = Y1M 1/2<br />
1 , and � Y2 = Y2M 1/2<br />
2 .<br />
From the identity we can compute the followings<br />
C −1<br />
1 C2 + C −1<br />
2 C1 = (σ 2 ID + Y1Y ′<br />
1) −1 (σ 2 ID + Y2Y ′<br />
2) + (σ 2 ID + Y2Y ′<br />
2) −1 (σ 2 ID + Y1Y ′<br />
1)<br />
= σ −2 (ID − � Y1 � Y ′<br />
1)(σ 2 ID + Y2Y ′<br />
2) + σ −2 (ID − � Y2 � Y ′<br />
2)(σ 2 ID + Y1Y ′<br />
1),<br />
= 2ID − � Y1 � Y ′<br />
1 − � Y2 � Y ′<br />
2 + σ −2 (Y1Y ′<br />
1 + Y2Y ′<br />
2 − � Y1 � Y ′<br />
1Y2Y ′<br />
2 − � Y2 � Y ′<br />
2Y1Y ′<br />
1),<br />
(C1 + C2) −1 = (2σ 2 ID + Y1Y ′<br />
1 + Y2Y ′<br />
2) −1<br />
= (2σ 2 ) −1 (ID + ZZ ′ ) −1 = (2σ 2 ) −1 (ID − Z(I2m + Z ′ Z) −1 Z ′ ),<br />
where Z = (2σ 2 ) −1/2 [Y1 Y2],<br />
C −1<br />
1 + C −1<br />
2 = σ −2 (2ID − � Y1 � Y ′<br />
1 − � Y2 � Y ′<br />
2) = 2σ −2 (ID − � Z � Z ′ ),<br />
where � Z = 2 −1/2 [ � Y1 � Y2],<br />
(C −1<br />
1 + C −1<br />
2 ) −1 = σ2<br />
2 (ID − � Z � Z ′ ) −1 = σ2<br />
2 (ID + � Z(I2m − � Z ′ � Z) −1 � Z ′ ).<br />
6.2.3 Analysis of KL distance<br />
The KL distance for a Factor Analyzers is as follows:<br />
105
JKL(p1, p2) = 1<br />
2 tr � C −1<br />
2 C1 + C −1 � 1<br />
1 C2 − 2ID +<br />
2 (u1 − u2) ′ (C −1<br />
1 + C −1<br />
2 )(u1 − u2)<br />
= 1<br />
2 tr(−� Y ′<br />
1 � Y1 − � Y ′<br />
+ σ−2<br />
2 (u1 − u2) ′<br />
2 � Y2) + σ−2<br />
2<br />
�<br />
2ID − � Y1 � Y ′<br />
1 − � Y2 � Y ′<br />
2<br />
Furthermore, we can write the distances as<br />
JKL(p1, p2) = 1<br />
2 tr(−� Y ′<br />
1 � Y1 − � Y ′<br />
2 � Y2) + σ−2<br />
+ σ−2<br />
2 (2u′ u − u ′ � Z � Z ′ u),<br />
2<br />
tr(Y ′<br />
1Y1 + Y ′<br />
2Y2 − � Y ′<br />
1Y2Y ′<br />
2 � Y1 − � Y ′<br />
2Y1Y ′<br />
1 � Y2)<br />
�<br />
(u1 − u2) (6.10)<br />
tr(Y ′<br />
1Y1 + Y ′<br />
2Y ′<br />
2 − � Y ′<br />
1Y2Y ′<br />
2 � Y1 − � Y ′<br />
2Y1Y ′<br />
1 � Y2)<br />
where u = u1 − u2. Note that the computation of distance involves only the products of<br />
column vectors of Yi and ui, and we need not handle any D × D matrix explicitly.<br />
KL in the limit yields the projection kernel<br />
For ui = 0, and Y ′<br />
i Yi = Im, we have<br />
JKL(p1, p2) = 1<br />
2 tr(−� Y ′<br />
1 � Y1 − � Y ′<br />
= 1 m<br />
(−2<br />
2<br />
=<br />
σ 2 + 1<br />
2 � Y2) + σ−2<br />
) + σ−2<br />
2<br />
′<br />
tr(Y 1Y1 + Y ′<br />
2Y ′<br />
2 − � Y ′<br />
1Y2Y ′<br />
2 � Y1 − � Y ′<br />
2Y1Y ′<br />
1 � Y2)<br />
�<br />
� 2<br />
1<br />
2m − 2<br />
σ2 + 1<br />
1<br />
2σ2 (σ2 ′<br />
(2m − 2tr(Y<br />
+ 1)<br />
1Y2Y ′<br />
2Y1)).<br />
tr(Y ′<br />
1Y2Y ′<br />
2Y1)<br />
We can ignore the multiplying factors which does not depend on Y1 or Y2, and rewrite the<br />
distance as<br />
JKL(p1, p2) ∝ 2m − 2tr(Y ′<br />
1Y2Y ′<br />
2Y1).<br />
106
One can immediately realize that this is indeed the definition of the squared Projection<br />
distance d 2 Proj (Y1, Y2) up to multiplicative factors.<br />
6.2.4 Analysis of Probability Product Kernel<br />
The Probability Product Kernel for Gaussian distributions is [40]<br />
kProb(p1, p2) = (2π) (1−2α)D/2 det(C † ) 1/2 det(C1) −α/2 det(C2) −α/2<br />
× exp − 1 � ′<br />
αu<br />
2<br />
1C −1<br />
1 u1 + αu ′ 2C −1<br />
2 u2 − (u † ) ′ C † u †� , (6.11)<br />
where C † = α −1 (C −1<br />
1 + C −1<br />
2 ) −1 , and u † = α(C −1<br />
1 u1 + C −1<br />
2 u2).<br />
To compute the determinant terms for Factor Analyzers, we use the following identity:<br />
if A and B are D × m matrices, then<br />
det(ID + AB ′ ) = det(Im + B ′ A) =<br />
m�<br />
(1 + τi(B ′ A)), (6.12)<br />
where τi is the i-th singular value of B ′ A. Using the identity we can write the following.<br />
det C −1<br />
1 = det(σ −2 (ID − � Y1 � Y ′<br />
1)) = σ −2D det(Im − � Y ′<br />
1 � Y1)<br />
= σ −2D<br />
m�<br />
(1 − τi( � Y ′<br />
1 � Y1)),<br />
i=1<br />
det C −1<br />
2 = det(σ −2 (ID − � Y2 � Y ′<br />
2)) = σ −2D det(Im − � Y ′<br />
2 � Y2)<br />
= σ −2D<br />
m�<br />
(1 − τi( � Y ′<br />
2 � Y2)),<br />
i=1<br />
det C † �<br />
= det σ 2 (2α) −1 (ID + � Z(I2m − � Z ′ �<br />
Z) � −1<br />
Z �′ )<br />
= σ 2D (2α) −D det(I2m + (I2m − � Z ′ � Z) −1 � Z ′ � Z) = σ 2D (2α) −D det(I2m − � Z ′ � Z) −1<br />
= σ 2D (2α) −D<br />
2m�<br />
i=1<br />
(1 − τi( � Z ′ � Z)) −1<br />
107<br />
i=1
To compute the exponents in (6.11) we need to use the followings<br />
C −1<br />
1 C † C −1<br />
2 = C −1<br />
1 (C −1<br />
1 + C −1<br />
2 ) −1 C −1<br />
2 = C −1<br />
2 (C −1<br />
1 + C −1<br />
2 ) −1 C −1<br />
1<br />
= (C1 + C2) −1 = (2σ 2 ) −1 (ID − Z(I2m + Z ′ Z) −1 Z ′ )<br />
C −1<br />
1 C † C −1<br />
1 = C −1<br />
1 (C −1<br />
1 + C −1<br />
2 ) −1 C −1<br />
1 = C −1<br />
1 − (C1 + C2) −1<br />
= σ−2<br />
2 (2ID − 2 � Y1 � Y ′<br />
1 − ID + Z(I2m + Z ′ Z) −1 Z ′ )<br />
= σ−2<br />
2 (ID − 2 � Y1 � Y ′<br />
1 + Z(I2m + Z ′ Z) −1 Z ′ )<br />
C −1<br />
2 C † C −1<br />
2 = C −1<br />
2 (C −1<br />
1 + C −1<br />
2 ) −1 C −1<br />
2 = C −1<br />
2 − (C1 + C2) −1<br />
= σ−2<br />
2 (2ID − 2 � Y2 � Y ′<br />
2 − ID + Z(I2m + Z ′ Z) −1 Z ′ )<br />
= σ−2<br />
2 (ID − 2 � Y2 � Y ′<br />
2 + Z(I2m + Z ′ Z) −1 Z ′ )<br />
Plugging these results back in (6.11) we again can compute the kernel without handling<br />
any D × D matrix. For concreteness I derive the Bhattacharyya kernel as an instance of the<br />
probability product kernel with α = 1/2 as follows:<br />
kBhat(p1, p2) = det(C † ) 1/2 det(C1) −1/4 det(C2) −1/4 exp − 1<br />
4 (u1 − u2) ′ (C1 + C2) −1 (u1 − u2)<br />
= det(I2m − � Z ′ � Z) −1/2 det(Im − � Y ′<br />
1 � Y1) 1/4 det(Im − � Y ′<br />
2 � Y2) 1/4<br />
× exp − σ−2<br />
4 (u1 − u2) ′ (ID − Z(I2m + Z ′ Z) −1 Z ′ )(u1 − u2). (6.13)<br />
108
Probability product kernel in the limit becomes trivial<br />
For ui = 0, and Y ′<br />
i Yi = Im, we have<br />
kProb(p1, p2) = (2π) (1−2α)D/2 det(C † ) 1/2 det(C1) −α/2 det(C2) −α/2<br />
and furthermore,<br />
= (2π) (1−2α)D/2 σ D (2α) −D/2 det(I2m − � Z ′ � Z) −1/2<br />
×σ −αD det(Im − � Y ′<br />
1 � Y1) α/2 σ −αD det(Im − � Y ′<br />
2 � Y2) α/2<br />
= (2π) (1−2α)D/2 σ D (2α) −D/2 det(I2m − � Z ′ � Z) −1 σ2α(m−D)<br />
(σ 2 + 1) αm<br />
= π (1−2α)D 2 −αD −D/2 σ2α(m−D)+D<br />
α<br />
(σ2 + 1) αm det(I2m − � Z ′ Z) � −1/2<br />
,<br />
det(I2m − � Z ′ Z) � −1/2 ⎢<br />
= det ⎣I2m − 1 ⎜<br />
⎝<br />
2<br />
⎡<br />
⎡<br />
⎛<br />
�Y ′<br />
1 � Y1 � Y ′<br />
1 � Y2<br />
�Y ′<br />
2 � Y1 � Y ′<br />
2 � Y2<br />
⎛<br />
⎢ 1<br />
= det ⎣I2m −<br />
2(σ2 ⎜<br />
+ 1)<br />
=<br />
=<br />
� 2 2(σ + 1)<br />
2σ2 �m + 1<br />
⎛<br />
⎜<br />
det ⎝<br />
� 2 2(σ + 1)<br />
2σ2 �m �<br />
det Im −<br />
+ 1<br />
⎞⎤<br />
⎟⎥<br />
⎠⎦<br />
−1/2<br />
⎝ Im Y ′<br />
1Y2<br />
Y ′<br />
2Y1 Im<br />
⎞⎤<br />
⎟⎥<br />
⎠⎦<br />
−1/2<br />
Im − 1<br />
2σ2 ′ Y +1 1Y2<br />
− 1<br />
2σ2 ′ Y +1 2Y1<br />
Im<br />
1<br />
(2σ2 ′<br />
Y<br />
+ 1) 2 1Y2Y ′<br />
2Y1<br />
Ignoring the terms which are not the functions of Y1 or Y2, we have<br />
�<br />
1<br />
kProb(Y1, Y2) ∝ det Im −<br />
(2σ2 ′<br />
Y<br />
+ 1) 2 1Y2Y ′<br />
�−1/2 2Y1 .<br />
⎞<br />
⎟<br />
⎠<br />
−1/2<br />
� −1/2<br />
.<br />
Suppose the two subspaces span(Y1) and span(Y2) intersect only at the origin, that is,<br />
the singular values of Y ′<br />
1Y2 are strictly less than 1. In this case kProb has a finite value as<br />
109
σ → 0 and the inversion is well-defined. In contrast, the diagonal terms of kProb become<br />
�<br />
1<br />
kProb(Y1, Y1) = det (1 −<br />
(2σ2 �−1/2 � 2 2 (2σ + 1)<br />
)Im =<br />
+ 1) 2 4σ2 (σ2 �m/2 , (6.14)<br />
+ 1)<br />
which diverges to infinity as σ → 0. This implies that after the kernel is normalized by the<br />
diagonal terms, it becomes a trivial kernel:<br />
⎧<br />
⎪⎨ 1, span(Yi) = span(Yj)<br />
˜kProb(Yi, Yj) =<br />
, as σ → 0. (6.15)<br />
⎪⎩ 0, otherwise<br />
As I claimed earlier, the Probability Product kernel, including the Bhattacharyya kernel,<br />
loses its discriminating power as the Gaussian distributions become flatter.<br />
6.3 Extended Grassmann Kernel<br />
In the previous section I presented the probabilistic interpretation of the Projection kernel.<br />
Based on this analysis, I propose extensions of the Projection kernel and make the kernels<br />
applicable to more general data. In this section I examine the two directions of extension:<br />
from linear to affine subspaces, and from homogeneous to scaled subspaces.<br />
6.3.1 Motivation<br />
The motivations for considering affine subspaces and non-homogeneous subspaces arise<br />
from observing the subspaces computed from real data. Firstly, the set of images, for<br />
example from the Yale Face database, have nonzero means. If the mean is significantly<br />
different from set to set, we want to use the mean image as well as the PCA basis images to<br />
represent a set. Secondly, the eigenvalues from PCA almost always have non-homogeneous<br />
values. It is likely that the eigenvector direction corresponding to a larger eigenvalue is<br />
110
A. Linear B. Affine C. Scaled<br />
Figure 6.2: The Mixture of Factor Analyzer model of the Grassmann manifold is the collection<br />
of linear homogeneous Factor Analyzers shown as flat spheres intersecting at the origin<br />
(A). This can be relaxed to allow nonzero offsets for each Factor Analyzer (B), and also to<br />
allow arbitrary eccentricity and scale for each Factor Analyzer shown as flat ellipsoids (C).<br />
more important than the eigenvector direction corresponding to a smaller eigenvalue. In<br />
which case we want to consider the eigenvalue scales as well as the eigenvectors when<br />
representing the set.<br />
These two extensions are naturally derived from the probabilistic generalization of sub-<br />
spaces. Figure 6.2 illustrates the ideas. Considering the data as a MFA distribution, we<br />
can gradually relax the zero-mean (ui = 0) condition in Figure A to the nonzero-mean<br />
(ui = arbitrary) condition in Figure B, and furthermore relax the homogeneity (Y ′ Y = I)<br />
condition to the non-homogeneous (Y ′ Y = full rank) condition in Figure C.<br />
From this I expect to benefit from both worlds – probabilistic distributions and geo-<br />
metric manifolds. However, simply relaxing the conditions and taking the limit σ → 0 of<br />
the KL distance do not guarantee a metric or a positive definite kernel, as we will shortly<br />
examine. Certain compromises have to be made to turn the KL distance in the limit into<br />
a well-defined and usable kernel function. In the following sections I propose new frame-<br />
works for the extensions and the technical details for making valid kernels.<br />
111
6.3.2 Extension to affine subspaces<br />
An affine subspace in R D is simply a linear subspace with an offset. In that sense a linear<br />
subspace is an affine subspace with a zero offset.<br />
The affine span is an analog of a linear span. Let Y ∈ R D×m be an orthonormal basis<br />
matrix for a subspace, and u ∈ R D denote the offset of the subspace from the origin. The<br />
affine span can then be defined as<br />
aspan(Y, u) = {x | x = Y v + u, ∀v ∈ R m }. (6.16)<br />
This representation of an affine span is not unique, since different Y ’s can share the same<br />
linear span, and different offsets u’s can imply the same amount of bias. Formally, this can<br />
be expressed as an equivalence relation:<br />
Definition 6.1.<br />
aspan(Y1, u1) = aspan(Y2, u2) if and only if<br />
span(Y1) = span(Y2) and Y ⊥<br />
1 (Y ⊥<br />
1 ) ′ u1 = Y ⊥<br />
2 (Y ⊥<br />
2 ) ′ u2,<br />
where Y ⊥ is any orthonormal basis for the orthogonal complement of span(Y ), that is<br />
Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID.<br />
Although Y is not unique, one can choose a unique ‘standard’ û by the following:<br />
û = (ID − � Y � Y ′ )u = � Y ⊥ ( � Y ⊥ ) ′ u (6.17)<br />
which has the shortest distance from the origin to the affine span (refer to Figure 6.3.)<br />
112
Figure 6.3: The same affine span can be expressed with different offsets u1, u2, ... However,<br />
one can use the unique ‘standard’ offset û, which has the shortest length from the origin.<br />
Affine Grassmann manifold<br />
I define an affine Grassmann manifold analogous to the linear Grassmann manifold,<br />
Definition 6.2. The affine Grassmann manifold AG(m, D) is the set of m-dimensional<br />
affine subspaces of the R D .<br />
The set of all m-dim affine subspaces in R D is a smooth non-compact manifold that<br />
can be defined as the following quotient space similarly to the Grassmann manifold:<br />
where E is an Euclidean group.<br />
To see the point above, let<br />
AG(m, D) = E(D)/E(m) × O(D − m), (6.18)<br />
X =<br />
⎛<br />
⎜<br />
⎝ Y Y ⊥ u<br />
0 0 1<br />
be the homogeneous space representation of aspan(Y, u) in E(D).<br />
113<br />
⎞<br />
⎟<br />
⎠
Then the only subgroup of E(D) that leaves the aspan(Y, u) unchanged by right-<br />
multiplication, is the set of matrices of the form<br />
⎛<br />
⎞<br />
Rm ⎜ 0<br />
⎝<br />
0<br />
RD−m<br />
v ⎟<br />
0 ⎟ ∈ E(m) × O(D − m),<br />
⎠<br />
0 0 1<br />
where Rm and RD−m are any two matrices in O(m) and O(D − m) respectively, and<br />
v ∈ R m is any vector.<br />
To check this, note that<br />
⎛<br />
⎞<br />
Rm ⎜<br />
X ⎜ 0<br />
⎝<br />
0<br />
RD−m<br />
v ⎟<br />
0 ⎟<br />
⎠<br />
0 0 1<br />
=<br />
=<br />
⎛<br />
⎜<br />
⎝ Y Y ⊥ ⎛<br />
⎞<br />
⎞<br />
Rm ⎜ 0 v ⎟<br />
u ⎟ ⎜<br />
⎟<br />
⎠ ⎜ 0 RD−m 0 ⎟<br />
0 0 1 ⎝<br />
⎠<br />
0 0 1<br />
⎛<br />
⎞<br />
⎜<br />
⎝ Y Rm Y ⊥RD−m Y v + u<br />
0 0 1<br />
has aspan(Y Rm, Y v + u) which is the same as aspan(Y, u) from Definition 6.16.<br />
Affine Grassmann kernel<br />
Similarly to the definition of Grassmann kernels in Definition 5.1, we can now define the<br />
affine Grassmann kernel as follows.<br />
Definition 6.3. Let k : (R D×m × R D ) × (R D×m × R D ) → R be a real valued symmetric<br />
function k(Y1, u1, Y2, u2) = k(Y2, u2, Y1, u1). The function k is an affine Grassmann kernel<br />
if it is 1) positive definite and 2) invariant to different representations:<br />
114<br />
⎟<br />
⎠
k(Y1, u1, Y2, u2) = k(Y3, u3, Y4, u4), ∀Y1, Y2, Y3, Y4, u1, u2, u3, u4<br />
if aspan(Y1, u1) = aspan(Y3, u3) and aspan(Y2, u2) = aspan(Y4, u4).<br />
With this definition we can check if the KL distance in the limit suggests an affine<br />
Grassmann kernel.<br />
KL distance in the limit<br />
The KL distance only with the homogeneity condition Y ′<br />
1Y1 = Y ′<br />
2Y2 = Im becomes,<br />
JKL(p1, p2) =<br />
+<br />
1<br />
2σ 2 (σ 2 + 1)<br />
(2m − 2tr(Y ′<br />
1Y2Y ′<br />
2Y1))<br />
1<br />
2σ2 (σ2 + 1) (u1 − u2) ′ � 2(σ 2 + 1)ID − Y1Y ′<br />
1 − Y2Y ′ �<br />
2 (u1 − u2)<br />
→ 1<br />
′<br />
[2m − 2tr(Y<br />
2σ2 1Y2Y ′<br />
2Y1) + (u1 − u2) ′ (2ID − Y1Y ′<br />
1 − Y2Y ′<br />
2) (u1 − u2)] .<br />
If the multiplicative factor is ignored, the first term is the same as the zero-mean case which<br />
I denote as the ‘linear’ kernel<br />
The second term<br />
kLin(Y1, Y2) = tr(Y1Y ′<br />
1Y2Y ′<br />
2) = kProj(Y1, Y2).<br />
ku(Y1, u1, Y2, u2) = u ′ 1(2ID − Y1Y ′<br />
1 − Y2Y ′<br />
2)u2,<br />
measures the similarity of means scaled by the matrix 2I − Y1Y ′<br />
1 − Y2Y ′<br />
2. However, this<br />
term does not satisfy the affine invariance condition of Definition 6.3. Note that the term<br />
115
ku can be expressed as<br />
ku(Y1, u1, Y2, u2) = û1 ′ u2 + û2 ′ u1,<br />
with the standard offset notation. From this observation, I propose the following modifica-<br />
tion:<br />
Theorem 6.4.<br />
is an affine Grassmann kernel.<br />
ku(Y1, u1, Y2, u2) = u ′ 1(I − Y1Y ′<br />
1)(I − Y2Y ′<br />
2)u2 = û1 ′ û2<br />
Proof. 1. Invariance: if aspan(Y1, u1) = aspan(Y3, u3) and aspan(Y2, u2) = aspan(Y4, u4),<br />
then Y1Y ′<br />
1 = Y3Y ′<br />
3, Y2Y ′<br />
2 = Y4Y ′<br />
4, Y ⊥<br />
1 (Y ⊥<br />
1 ) ′ u1 = Y ⊥<br />
3 (Y ⊥<br />
3 ) ′ u3, and Y ⊥<br />
2 (Y ⊥<br />
2 ) ′ u2 =<br />
Y ⊥<br />
4 (Y ⊥<br />
4 ) ′ u4, and therefore<br />
k(Y1, u1, Y2, u2) = u ′ 1(I − Y1Y ′<br />
1)(I − Y2Y ′<br />
2)u2<br />
2. Positive definiteness:<br />
�<br />
i,j<br />
= u ′ 3(I − Y3Y ′<br />
3)(I − Y4Y ′<br />
4)u4 = k(Y3, u3, Y4, u4).<br />
cicju ′ i(I − YiY ′<br />
i )(I − YjY ′<br />
j )uj =<br />
� �<br />
i<br />
ciu ′ i(I − YiY ′<br />
i )<br />
� 2<br />
≥ 0.<br />
Combined with the linear term kLin, the modified term defines the new ‘affine’ kernel:<br />
kAff(Y1, u1, Y2, u2) = tr(Y1Y ′<br />
1Y2Y ′<br />
2) + u ′ 1(I − Y1Y ′<br />
1)(I − Y2Y ′<br />
2)u2. (6.19)<br />
116
General construction<br />
The limit of the KL distance with nonzero means has two terms: u-related and Y -related.<br />
This suggests a general construction rule for affine kernels. That is, if one has two separate<br />
positive kernels for means and for subspaces, one can add or multiply them together to<br />
construct new kernels . The various ways of generating new kernels from known kernels in<br />
Theorem 2.7 can be used to create a novel kernel for affine subspaces.<br />
Basri’s embedding<br />
There are alternatives for representing affine spans. One representation proposed by Basri<br />
et al. [3] is to use the pair (Y ⊥ , t) instead of (Y, u) where t is related to u by u = Y ⊥ t.<br />
The authors embed an affine subspace to a Euclidean space of dimension (D + 1) 2 by the<br />
following injective map:<br />
Ψ : aspan(Y, u) → R (D+1)×(D+1) , (Y, u) ↦→<br />
⎛<br />
⎜<br />
⎝ Y ⊥ (Y ⊥ ) ′ Y ⊥t t ′ (Y ⊥ ) ′ t ′ t<br />
⎞<br />
⎟<br />
⎠ . (6.20)<br />
This embedding is a direct analogue of the isometric embedding of linear subspaces to<br />
projection matrices in Theorem. 5.2.<br />
The authors did not mention or use kernel methods in the paper. However, their pro-<br />
posed embedding has the natural corresponding kernel:<br />
k(Y ⊥<br />
1 , t1, Y ⊥<br />
⎡⎛<br />
⎢⎜<br />
2 , t2) = tr ⎣⎝<br />
⎡⎛<br />
⎢⎜<br />
= tr ⎣⎝<br />
Y ⊥<br />
1 (Y ⊥<br />
1 ) ′ Y ⊥<br />
1 t1<br />
(Y ⊥<br />
1 t1) ′ t ′ 1t1<br />
⎞ ⎛<br />
⎟ ⎜<br />
⎠ ⎝<br />
Y ⊥<br />
2 (Y ⊥<br />
2 ) ′ Y ⊥<br />
2 t2<br />
(Y ⊥<br />
2 t2) ′ t ′ 2t2<br />
⎞⎤<br />
⎟⎥<br />
⎠⎦<br />
Y ⊥<br />
1 (Y ⊥<br />
1 ) ′ Y ⊥<br />
2 (Y ⊥<br />
2 ) ′ + Y ⊥<br />
1 t1(Y ⊥<br />
2 t2) ′ · · ·<br />
· · · (Y ⊥<br />
1 t1) ′ Y ⊥<br />
2 t2 + t ′ 1t1t ′ 2t2<br />
= tr � Y ⊥<br />
1 (Y ⊥<br />
1 ) ′ Y ⊥<br />
2 (Y ⊥<br />
2 ) ′� + 2t ′ 1(Y ⊥<br />
1 ) ′ Y ⊥<br />
2 t2 + t ′ 1t1t ′ 2t2.<br />
117<br />
⎞⎤<br />
⎟⎥<br />
⎠⎦
Figure 6.4: Homogeneous vs scaled subspaces. The two 2-dimensional Gaussians that<br />
span almost the same 2-dimensional space and have almost the same means, are considered<br />
similar as two representations of linear subspaces (Left). However, probabilistic distance<br />
between two Gaussian also depends on scale and eccentricity: the distance can be quite<br />
large if the Gaussians are nonhomogeneous (Right).<br />
Although this is another valid kernel, it does not admit probabilistic interpretation. Fur-<br />
thermore, their representation requires D × (D − m) matrices Y ⊥<br />
i , which is more costly in<br />
storage and computation than representation of this thesis since m ≪ D typically.<br />
6.3.3 Extension to scaled subspaces<br />
So far I have assumed that the subspace are homogeneous Y ′ Y = Im, that is, there is no<br />
preferred direction within the subspace. However, even if two subspaces have the same<br />
linear or affine span, one can further distinguish the two subspaces by allowing scales or<br />
orientations within the subspaces, as illustrated in Figure 6.4.<br />
With the relaxation of Y to any D × m full-rank matrix, the term subspace is no longer<br />
applicable in a strict sense. Nevertheless, let’s refer to the relaxation as ‘scaled’ subspace<br />
and the corresponding kernel as the ‘scaled’ Grassmann kernel in conformity with the pre-<br />
vious usages in this thesis.<br />
A scaled subspace has the same Euclidean representation (Y, u) as before but has a<br />
different equivalence relation. Let Y1 and Y2 be any D × m full-rank matrices, and u1 and<br />
118
u2 be offsets. The equivalence is then defined by<br />
(Y1, u1) ∼ = (Y2, u2) if and only if<br />
Y1Y ′<br />
1 = Y2Y ′<br />
2 and u1 = u2. (6.21)<br />
The scaled subspace is in one-to-one correspondence with the Cartesian product MD,m×<br />
R D , where MD,m is the set of D × D symmetric positive semidefinite matrices of rank m,<br />
via the embedding<br />
Ψ : (Y, u) ↦→ [Y Y ′ | u] ∈ R D×(D+1) .<br />
However, the topology and the metric from this embedding do not have a probabilistic<br />
motivation, similarly to the Basri’s embedding (6.20). I instead examine the limit of KL<br />
distance and make a positive definite kernel with the invariance condition (6.21).<br />
KL distance in the limit<br />
To incorporate these scales into affine subspaces, I allow the product Y Y ′ to be a non-<br />
identity matrix and make sure that the definition of the kernel is valid and consistent.<br />
comes<br />
Let Yi be full-rank but not necessarily orthonormal. In this case the KL distance be-<br />
JKL(p1, p2) → 1<br />
′<br />
tr(Y1Y<br />
2σ2 1 + Y2Y ′<br />
2 − � Y1 � Y ′<br />
1Y2Y ′<br />
2 − � Y2 � Y ′<br />
2Y1Y ′<br />
1)<br />
1<br />
+<br />
2σ2 (u1 − u2) ′<br />
�<br />
2ID − � Y1 � Y ′<br />
1 − � Y2 � Y ′<br />
�<br />
2 (u1 − u2),<br />
where � Yi denotes the orthonormalization of Yi<br />
�Yi = lim �Yi = Yi(Y<br />
σ→0<br />
′<br />
i Yi) −1/2 .<br />
119
Ignoring the multiplicative factors, we can see that the corresponding form<br />
k = 1<br />
2 tr(� Y1 � Y ′<br />
1Y2Y ′<br />
2 + Y1Y ′<br />
1 � Y2 � Y ′<br />
2) + u ′ 1(2I − � Y1 � Y ′<br />
1 − � Y2 � Y ′<br />
2)u2<br />
is again not a well-defined kernel.<br />
The first term<br />
1<br />
2 tr(� Y1 � Y ′<br />
1Y2Y ′<br />
2 + Y1Y ′<br />
1 � Y2 � Y ′<br />
2)<br />
is not positive definite, and there are several ways to remedy it:<br />
• Fully unnormalized: k(Y1, Y2) = tr(Y1Y ′<br />
1Y2Y2),<br />
• Partially normalized: k(Y1, Y2) = tr(Y1 � Y ′<br />
1 � Y2Y ′<br />
2) = tr( � Y1Y ′<br />
1 � Y2Y ′<br />
2) = tr( � Y1Y ′<br />
1Y2 � Y ′<br />
2)<br />
• Fully normalized: k(Y1, Y2) = tr( � Y1 � ′<br />
Y1<br />
�Y2 � ′<br />
Y2 )<br />
I use the partially normalized form 2 since it scales the same as the original form to a global<br />
scaling factor multiplied to Y ’s.<br />
The second term<br />
u ′ 1(2I − � Y1 � Y ′<br />
1 − � Y2 � Y ′<br />
2)u2<br />
is the same as in the affine case, and we also have several ways to make it well-defined and<br />
positive definite:<br />
• Affine invariant: ku(Y1, u1, Y2, u2) = u ′ 1(I − � Y1 � ′<br />
Y1 )(I − � Y2 � ′<br />
Y2 )u2 = û1 ′ û2<br />
• Direct inner product: ku(Y1, u1, Y2, u2) = u ′ 1u2.<br />
Since the affine invariance condition is irrelevant for scaled subspaces (Figure 6.3), the<br />
direct inner product form is a better choice.<br />
Finally, the sum of the two modified terms is the ‘affine scaled’ kernel I propose:<br />
2 However, these kernels showed similar results in preliminary experiments.<br />
120
Theorem 6.5.<br />
kAffSc(Y1, u1, Y2, u2) = tr(Y1 � ′<br />
Y1<br />
�Y2Y ′<br />
2) + u ′ 1u2. (6.22)<br />
is a positive definite kernel for scaled subspaces.<br />
Proof. The term u ′ 1u2 is obviously well-defined and positive definite, so let’s look at only<br />
the first term.<br />
1. Well-defined: let’s show that if Y1Y ′<br />
1 = Y3Y ′<br />
3 then Y1 � ′<br />
Y1<br />
the second equation to see that<br />
= Y3 � ′<br />
Y3 . Take squares of<br />
(Y1 � ′<br />
Y1 ) 2 = Y1(Y ′<br />
1Y1) −1/2 Y ′<br />
1Y1(Y ′<br />
1Y1) −1/2 Y ′<br />
1 = Y1Y ′<br />
1 = Y3Y ′<br />
3<br />
= Y3(Y ′<br />
3Y3) −1/2 Y ′<br />
3Y3(Y ′<br />
3Y3) −1/2 Y ′<br />
3 = (Y3 � ′<br />
Y3 ) 2 .<br />
Since Y1 � ′<br />
Y1 and Y3 � ′<br />
Y3 are both symmetric positive semidefinite matrices, the equality<br />
of the squares implies Y1 � ′<br />
Y1<br />
Y2 � ′<br />
Y2 = Y4 � ′<br />
Y4 .<br />
2. Positive definiteness:<br />
�<br />
i,j<br />
= Y3 � ′<br />
Y3 . By the same argument, if Y2Y ′<br />
2 = Y4Y ′<br />
4 then<br />
cicjtr(Yi � ′<br />
Yi<br />
�YjY ′<br />
j ) = tr( �<br />
= � �<br />
121<br />
i<br />
ciYi<br />
i<br />
� ′ �<br />
Yi cj<br />
j<br />
� YjY ′<br />
j )<br />
ciYi � ′<br />
Yi � 2 F ≥ 0.
Summary of the extended kernels<br />
The proposed kernels are summarized below. Let Yi be a full-rank D × m matrix, and let<br />
�Yi = Yi(Y ′<br />
i Yi) −1/2 the orthonormalization of Yi as before.<br />
kProj(Y1, Y2) = kLin(Y1, Y2) = tr( � Y ′<br />
1 � Y2 � Y ′<br />
2 � Y1)<br />
kAff(Y1, u1, Y2, u2) = tr( � Y ′<br />
1 � Y2 � Y ′<br />
2 � Y1) + u ′ 1(I − � Y1 � Y ′<br />
1)(I − � Y2 � Y ′<br />
2)u2<br />
kAffSc(Y1, u1, Y2, u2) = tr(Y1 � Y ′<br />
1 � Y2Y ′<br />
2) + u ′ 1u2. (6.23)<br />
I also spherize the kernels<br />
� k(Y1, u1, Y2, u2) = k(Y1, u1, Y2, u2) k(Y1, u1, Y1, u1) −1/2 k(Y2, u2, Y2, u2) −1/2<br />
so that k(Y1, u1, Y1, u1) = 1 for any Y1 and u1.<br />
There is a caveat in implementing these kernels. Although I use the same notations<br />
Y and � Y for both linear and affine kernels, they are different in computation. For linear<br />
kernels the Y and � Y are computed from data assuming u = 0, whereas for affine kernels<br />
the Y and � Y are computed after removing the estimated mean u from the data.<br />
6.3.4 Extension to nonlinear subspaces<br />
A systematic way of extending the Projection kernel from linear/affine subspaces to non-<br />
linear spaces is to use an implicit map via a kernel function as explained in Section 5.2.4,<br />
where the latter kernel is to be distinguished from the Grassmann kernels. Note that the pro-<br />
posed kernels (6.23) can be computed only from the inner products of the column vectors<br />
of Y ’s and u’s including the orthonormalization procedure. This ‘doubly kernel’ approach<br />
has already been proposed for the Binet-Cauchy kernel [90, 46] and for probabilistic dis-<br />
tances in general [96]. We can adopt the trick for the extended Projection kernels as well to<br />
122
extend the kernels to operate on nonlinear subspaces, which is the preimage corresponding<br />
to the linear subspaces in the RKHS via the feature map.<br />
6.4 Experiments with synthetic data<br />
In this section I demonstrate the application of the extended Grassmann kernels to a two-<br />
class classification problem with Support Vector Machines (SVMs). Using synthetic data<br />
generated from MFA distribution, I will compare the classification performance of lin-<br />
ear/nonlinear SVMs in the original space with the performance of the SVM in the Grass-<br />
mann space.<br />
6.4.1 Synthetic data<br />
The kernels in equation 6.23 are defined under different assumptions of data distribution.<br />
To test the kernels I generate three types of synthetic data corresponding to the assumptions:<br />
(1) linear homogeneous MFA, (2) affine homogeneous MFA, and (3) affine scaled MFA.<br />
Selecting MFA<br />
For each type of the data, I generate N = 100 Factor Analyzers in D = 5 dimensional<br />
Euclidean space. The i-th Factor Analyzer has the distribution pi(x) = N (ui, Ci), where<br />
the covariance is Ci = � YiΛi � Y ′<br />
i + σ 2 ID. The 5 × 2 orthonormal matrices � Yi are randomly<br />
chosen from the uniform distribution on G(m, D). Refer to [1] for the definition of a<br />
uniform distribution on G(m, D). The ambient noise is chosen at σ = 0.1.<br />
For type 2 and type 3 datasets, I generate the nonzero mean ui randomly from ui ∼<br />
N (0, r 2 ID) for each Factor Analyzer. The r controls the spread of the Factor Analyzers.<br />
For type 3 dataset, the covariance is additionally scaled by Ci = � YiΛi � Y ′<br />
i + σ 2 ID, where<br />
the elements of Λi = diag(λ1, ..., λm) are chosen i.i.d from the uniform distribution on<br />
123
[0, 1].<br />
The parameters for the datasets are summarized below:<br />
• Dataset 1: zero-mean (r = 0), homogeneous (λ1 = · · · = λm = 1)<br />
• Dataset 2: nonzero-mean (r = 0.2), homogeneous (λ1 = · · · = λm = 1)<br />
• Dataset 3: nonzero-mean (r = 0.2), scaled (0 ≤ λ ≤ 1)<br />
Assigning class labels<br />
So far the distribution is chosen without classes. Since I am treating each distribution as a<br />
point in the space of distributions, the class label is assigned per distribution. The binary<br />
class labels are assigned as follows. I first choose a pair of distribution p+ and p− which<br />
are the farthest apart from each other among all pairs of distributions. These p+ and p−<br />
serve as the two extreme points representing the positive and the negative distributions re-<br />
spectively. The labels of the other distributions are assigned subsequently from comparing<br />
their distances to the two extreme distributions:<br />
yi =<br />
⎧<br />
⎪⎨<br />
⎪⎩<br />
1, if JKL(pi, p+) < JKL(pi, p−)<br />
−1, otherwise<br />
The distances are measured by the KL distance JKL of the distributions. Empirically the<br />
number of positive distributions and the number of negative distributions were roughly<br />
balanced.<br />
6.4.2 Algorithms<br />
I compare the performance of the Euclidean SVM with linear/polynomial/Gaussian RBF<br />
kernels and the performance of SVM with Grassmann kernels, similarly to the comparison<br />
124
in Section 5.3. To test the original SVMs, I randomly sample n = 50 points from each<br />
Factor Analyzer pi(x).<br />
I evaluate the algorithm with N-fold cross validation by holding out one set and training<br />
with the other N − 1 sets. The polynomial kernel used is k(x1, x2) = (〈x1, x2〉 + 1) 3 , and<br />
the RBF kernel used is k(x1, x2) = exp(− 1<br />
2r 2 �x1 − x2� 2 ), where the radius r is chosen to<br />
be one-fifth of the diameter of the data: r = 0.2 maxij �xi − xj�. For training the SVMs,<br />
I use the public-domain software SVM-light [42] with default parameters.<br />
To test the Grassmann SVM, I first estimate the mean ui and the basis Yi from the same<br />
points used for the Euclidean SVM, although I could have improved the results by using<br />
the true parameters instead of the estimated ones. The Maximum Likelihood estimates of<br />
Yi, µi and σ are given from the probabilistic PCA model [76] as follows. Let µ and S be<br />
the sample mean and covariance of the i-th set<br />
Let<br />
µ = 1<br />
Ni<br />
Ni �<br />
j=1<br />
xj, S = 1<br />
Ni − 1<br />
S = UΛU ′<br />
Ni �<br />
j=1<br />
(xj − µ)(xj − µ) ′ .<br />
be the eigen-decomposition of the covariance matrix S, where U = [u1 · · · uD] is the<br />
eigenvectors corresponding to the eigenvalues λ1 ≥ ... ≥ λD, and Λ = diag(λ1, ..., λD) is<br />
the diagonal matrix of eigenvalues. Then Yi and σi are estimated from<br />
σ 2 =<br />
1<br />
D − m<br />
D�<br />
j=m+1<br />
λj<br />
(6.24)<br />
Yi = Um(Λm − σ 2 I) 1/2 , (6.25)<br />
where Um is the first m columns of U and Λm is the m × m principal submatrix of Λ. The<br />
σ is estimated individually for each set of data, and I use the averaged value for all sets. An<br />
125
Table 6.1: Classification rates of the Euclidean SVMs and the Grassmann SVMs. The best<br />
rate for each dataset is highlighted by boldface.<br />
Euclidean Grassmann Probabilistic<br />
Linear Poly RBF Lin Aff AffSc BC Bhat<br />
Dataset 1 52.86 62.38 66.33 87.00 86.80 87.70 82.50 84.30<br />
Dataset 2 62.30 64.45 65.74 76.90 82.00 83.10 70.90 72.50<br />
Dataset 3 62.76 64.73 69.47 69.50 73.70 84.40 65.10 77.30<br />
iterative and more accurate method of estimation is to use an EM approach [27] although<br />
not used here.<br />
The σ is used for the Bhattacharyya kernel which requires nonzero ambient noise σ > 0<br />
in the covariance Ci = YiY ′<br />
i + σ 2 I.<br />
Five different kernels in the followings are compared:<br />
1. SVM with the original and the extended Projection kernels: kLin, kAff, kAffSc<br />
2. SVM with the Binet-Cauchy kernel: kBC(Y1, Y2) = (det Y ′<br />
1Y2) 2 = det Y ′<br />
1Y2Y ′<br />
2Y1<br />
3. SVM with the Bhattacharyya kernel: kBhat(p1, p2) = � [p1(x) p2(x)] 1/2 dx for Factor<br />
Analyzers.<br />
I evaluate the algorithms with the leave-one-out test by holding out one subspace and train-<br />
ing with the other N − 1 subspaces. For training the SVMs, I use a Matlab code with a<br />
nonnegative QP solver.<br />
6.4.3 Results and discussion<br />
Table 6.1 shows the classification rates of the Euclidean SVMs and the Grassmann SVMs,<br />
averaged for 10 independent trials. The results show that best rates are obtained from<br />
the affine scaled kernel, and the Euclidean kernels lag behind for all types of data. The<br />
126
inferiority of the Euclidean SVMs to the Grassmann SVMs can be ascribed similarly to the<br />
reasons discussed in Section 5.3.3<br />
For dataset 1 which has zero-means, the linear SVMs degrade to the chance-level<br />
(50%). The result agrees with the intuitive picture that any decision hyperplane that passes<br />
the origin will roughly halve the positive and the negative classes. As expected, the linear<br />
kernel is inappropriate for dataset 2 which have nonzero offsets, whereas the affine and<br />
the affine scaled kernels perform well for both dataset 1 and 2. However, only the affine<br />
scaled kernel performs well for dataset 3. The Binet-Cauchy and the Bhattacharyya ker-<br />
nels perform close to the Projection kernels for dataset 1, but underperform for dataset 2<br />
and 3. This result is expected since the Binet-Cauchy kernel does not give considerations<br />
for offsets or scales, and since the Bhattacharyya kernel is not adequate for MFA data as I<br />
showed in the previous sections.<br />
I conclude that the extended kernels have advantages over the original kernels and<br />
the Euclidean kernels for subspace-based classification problems when the data consist<br />
of affine and scaled subspaces instead of simple linear subspaces.<br />
6.5 Experiments with real-world data<br />
In this section I demonstrate the application of the extended Grassmann kernels to recog-<br />
nition problems with a kernel FDA. Using real image database I compare the classification<br />
performance of the extended kernels and other previously used kernels.<br />
6.5.1 Algorithms<br />
A baseline algorithms and the kernel FDA with different kernels in the following are com-<br />
pared:<br />
1. Baseline : Euclidean FDA<br />
127
2. KFDA with the original and the extended Projection kernels: kLin, kAff, kAffSc<br />
3. KFDA with the Binet-Cauchy kernel<br />
4. KFDA the Bhattacharyya kernel<br />
The subspace parameters are estimated from the data similarly to the experiments with<br />
synthetic data. I evaluate the algorithms with leave-one-out test by holding out one sub-<br />
space and training with the other N − 1 subspaces.<br />
6.5.2 Results and discussion<br />
The recognition rates for Yale Face, CMU-PIE, ETH-80, and IXMAS databases are given<br />
in Figures 6.5–6.8. I summarize the results as follows:<br />
1. The original and the extended Grassmann kernels outperform the Binet-Cauchy and<br />
the Bhattacharyya kernels, as well as the baseline method. The superiority of the<br />
Projection kernel to the Binet-Cauchy kernel and the Euclidean method is already<br />
demonstrated in Chapter 5.<br />
2. The Bhattacharyya kernel performs quite poorly, and becomes worse as the subspace<br />
dimension increases. One can verify that the kernel matrix from the data is close to<br />
an identity matrix and therefore carries little information about the data. A similar<br />
observation for the Binet-Cauchy kernel was already made in Chapter 5.<br />
3. In Yale Face and ETH-80 databases the affine kernel and the affine scaled kernel<br />
achieve best rates, respectively. In CMU-PIE and IXMAS databases the rates of the<br />
affine kernel follow the rates of linear kernel closely. The rates of affine scaled kernel<br />
fall behind those two, but the differences are small compared to the rates achieved by<br />
the rest of the methods. Compared with the experimental result from the synthetic<br />
128
data, the result with the real data does not conclusively show the advantage nor the<br />
disadvantage of using the extended kernels. Since the extended kernels generalize<br />
the original Projection kernel, the comparable performances can be interpreted as the<br />
linear subspace assumption being valid for the real image databases I used.<br />
6.6 Conclusion<br />
In this chapter, I showed the relationship between probabilistic distances and the Projection<br />
kernel using a probabilistic model of subspaces. This analysis provides generalizations of<br />
the Projection kernel to affine and scaled subspaces. The relaxation of linear subspace<br />
assumption allows us to accommodate more complex data structures which diverge from<br />
the ideal linear subspace assumption.<br />
As demonstrated with the synthetic data, the mean and the scales within subspsace may<br />
carry important statistics of the data and can make a large difference in classification tasks.<br />
However, whether the information is useful for the real databases was not conclusive from<br />
the experiments, since the difference between the original and the extended kernel was<br />
small. However, the original and the extended kernels showed consistently better perfor-<br />
mance than the Euclidean method and the kernel method with the Binet-Cauchy and the<br />
Bhattacharyya kernels.<br />
129
Eucl<br />
Lin<br />
Aff<br />
AffSc<br />
BC<br />
Bhat<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
1 3 5<br />
subspace dimension (m)<br />
7 9<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
Eucl 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />
Lin 96.77 95.70 98.57 99.28 98.92 98.57 97.85 96.77 97.13<br />
Aff 94.62 96.06 98.92 98.92 98.57 97.85 96.77 97.13 97.49<br />
AffSc 96.77 98.21 99.28 99.28 99.28 99.28 99.28 99.28 99.28<br />
BC 96.77 95.34 96.77 96.42 83.87 72.76 55.20 48.03 44.09<br />
Bhat 97.13 98.21 94.27 85.30 65.23 60.93 53.76 48.39 44.44<br />
Figure 6.5: Yale Face Database: face recognition rates from various kernels. The two<br />
highest rates including ties are highlighted with boldface for each subspace dimension m.<br />
130
Eucl<br />
Lin<br />
Aff<br />
AffSc<br />
BC<br />
Bhat<br />
100<br />
80<br />
60<br />
40<br />
20<br />
1 3 5<br />
subspace dimension (m)<br />
7 9<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
Eucl 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73<br />
Lin 88.27 74.84 89.77 87.21 91.68 92.54 93.82 93.60 95.31<br />
Aff 72.28 86.99 85.50 91.26 91.26 92.32 92.54 94.46 94.67<br />
AffSc 83.16 85.29 85.29 85.29 85.29 85.29 85.29 85.29 85.29<br />
BC 88.27 71.43 82.52 64.82 58.64 47.55 43.07 39.87 36.25<br />
Bhat 83.37 44.78 39.45 36.25 31.98 28.78 26.23 23.88 20.47<br />
Figure 6.6: CMU-PIE Database: face recognition rates from various kernels.<br />
131
Eucl<br />
Lin<br />
Aff<br />
AffSc<br />
BC<br />
Bhat<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
1 3 5<br />
subspace dimension (m)<br />
7 9<br />
m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9<br />
Eucl 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00<br />
Lin 88.75 88.75 93.75 96.25 96.25 95.00 93.75 96.25 96.25<br />
Aff 88.75 92.50 96.25 96.25 95.00 93.75 96.25 96.25 97.50<br />
AffSc 91.25 90.00 91.25 91.25 91.25 91.25 92.50 91.25 91.25<br />
BC 88.75 86.25 90.00 81.25 72.50 63.75 51.25 41.25 48.75<br />
Bhat 91.25 90.00 88.75 87.50 86.25 83.75 81.25 81.25 80.00<br />
Figure 6.7: ETH-80 Database: object categorization rates from various kernels.<br />
132
Eucl<br />
Lin<br />
Aff<br />
AffSc<br />
BC<br />
Bhat<br />
100<br />
80<br />
60<br />
40<br />
20<br />
1 2 3<br />
subspace dimension (m)<br />
4 5<br />
m=1 m=2 m=3 m=4 m=5<br />
Eucl 54.87 54.87 54.87 54.87 54.87<br />
Lin 69.09 80.30 84.55 84.24 85.15<br />
Aff 73.94 81.52 82.73 82.42 80.91<br />
AffSc 72.73 78.48 80.30 80.30 80.30<br />
BC 69.09 60.00 50.91 36.36 25.15<br />
Bhat 64.24 59.39 51.52 42.12 23.64<br />
Figure 6.8: IXMAS Database: action recognition rates from various kernels.<br />
133
Chapter 7<br />
CONCLUSION<br />
In this chapter I summarize the work presented in this thesis and discuss the future work<br />
related to the proposed methods.<br />
7.1 Summary<br />
In this thesis I proposed the subspace-based approach for solving novel learning problems<br />
using the Grassmann kernels. Below I summarize the progress that has been made in each<br />
chapter with regard to this goal.<br />
• In Chapter 3, I proposed the paradigm of subspace-based learning which exploits in-<br />
herent linear structures in data to solve novel learning problems. The rationale behind<br />
this approach was explained and exemplified with well-known image databases.<br />
• In Chapter 4, I introduced the Grassmann manifold as a common framework for<br />
subspace-based learning, and reviewed the geometry of the space. Various distances<br />
on the Grassmann manifold were reviewed and analyzed in depth, and were com-<br />
pared to each other by classification tests.<br />
134
• In Chapter 5, I proposed the Projection kernel and its application to discriminant<br />
analysis for subspaces-based classification problems. In classification tests with the<br />
image database the proposed method showed better performance than other previ-<br />
ously used discrimination methods as well as the Euclidean method.<br />
• In Chapter 6, I presented formal analyses of the relationship between probabilistic<br />
distances and the Grassmann kernels. Based on the analyses, I broadened the domain<br />
of subspace-based learning from linear to affine and scaled subspaces, and presented<br />
the extended kernels. The extended kernels performed competitively with synthetic<br />
and real data and showed potentials for the extended domains.<br />
7.2 Future work<br />
In this section I discuss the research directions to which I plan to expand the present work.<br />
7.2.1 Theory<br />
In this work I utilized geometric properties of the Grassmann manifolds as a framework for<br />
subspace-based learning problems. However, there are other aspects of this manifold that I<br />
did not consider in this thesis. These are briefly reviewed below.<br />
• Riemannian aspect of Grassmann manifolds: As discussed in Chapter 4, the Grass-<br />
mann manifold can be derived as a homogeneous space of orthogonal groups [86,<br />
13]:<br />
G(m, D) = O(D)/O(m) × O(D − m).<br />
This definition also induces a Riemannian geometry of the space which plays an<br />
important role for optimization techniques on the manifold such as the Newton’s<br />
method or the conjugate gradient [20, 2]. The geometry of the Grassmann manifold<br />
135
through positive definite kernels is more general than the strict Riemannian geometry<br />
from the definition above. I plan to explore the applicability of the proposed kernels<br />
for optimization problems as well.<br />
• Probabilistic aspect of Grassmann manifolds: The kernel approach to a learning<br />
problem is basically deterministic. Although I used a probabilistic model of sub-<br />
spaces in Chapter 6, the MFA model is a distribution in the usual Euclidean space.<br />
There are intrinsic definitions of probability distributions on the Grassmann mani-<br />
fold, such as the uniform distribution [1] and the matrix Langevin distribution [16].<br />
Statistical inference and estimation on the Grassmann manifold is a largely unex-<br />
plored topic with the exception of the pioneering work of Chikuse [16].<br />
• Existence of other Grassmann kernels: There is an important technical question that<br />
remains to be answered: are there more Grassmann kernels that are fundamentally<br />
different from the Projection or the Binet-Cauchy kernels? As I remarked in Chap-<br />
ter 6, an inductive approach to the discovery of new kernels is to examine the limits of<br />
other probabilistic distances that were not analyzed in this thesis. On the other hand,<br />
a deductive approach is to adapt the characterization of positive definite kernels on<br />
Hilbert spaces, n-spheres and semigroups [65, 67, 66, 9] to the case of the Stiefel<br />
and the Grassmann manifolds. In relation to the latter approach, a study involving<br />
a functional integration over orthogonal groups is currently under examination, and<br />
will be reported in the future.<br />
7.2.2 Applications<br />
The proposed Grassmann kernels are general building blocks for many applications. Those<br />
kernels can be used for an arbitrary learning task whether it is a supervised or an unsuper-<br />
vised learning problem. They can also be used with any kernel methods among which I<br />
136
used the SVM and the FDA for demonstrations. Moreover, the applications are not limited<br />
to image data, since a subspace is a familiar notion for any vectorial data.<br />
On the other hand, the proposed kernel methods may not outperform the state-of-the-<br />
art algorithms which are dedicated to specific tasks and utilize domain-specific knowledge.<br />
I remark below on the limitations and possible improvements of the proposed method in<br />
serveral application-specific aspects.<br />
• Intensity- vs feature-based representation: In this thesis I used the intensity repre-<br />
sentation of an image and relied on the low-dimensional character of pose subspaces<br />
or illumination subspaces to address the invariant recognition problem. However, a<br />
compelling alternative for recognition is to use feature-based representation of faces<br />
and objects, such as SIFT features which are already invariant to pose or illumina-<br />
tion variations [54]. It is unknown whether we can find subspace structures with<br />
the feature-based representation of images. However, the Bhattacharyya kernel was<br />
originally demonstrated with the bag-of-feature representation [40], which suggests<br />
the application of our method to such representations.<br />
• Dynamical models for sequence data: The proposed method used the observability<br />
subspace representation of the video sequences. However, there are multiples steps<br />
involved in processing a video sequence into a dynamical system representation, and<br />
each step relies on heuristic choices. The recognition can be improved solely with a<br />
clever preprocessing of the sequences without the use of dynamical models [89]. In<br />
relation to the dynamical system approach in general, Vishwanathan et. al. proposed<br />
a variety of other kernels for dynamical systems [84] in addition to the Binet-Cauchy<br />
kernel used in the thesis. In fact the authors defined the Binet-Cauchy kernel for<br />
Fredholm operators which are much general in scope. The subspace model from the<br />
observability matrix is but one approach to characterizing dynamical systems, and I<br />
137
am currently investigating other subspace models that emphasize different aspects of<br />
dynamical systems.<br />
• Handling unorganized data: The image databases I used in this work have factorized<br />
structures. For example, each image in the Yale Face database is labeled in terms<br />
of (person, pose, illumination). If we do not know the labels other than the person<br />
and have to estimate the subspaces from a clutter of data points, this estimation prob-<br />
lem itself becomes a separate problem that warrants research efforts [27, 37, 81].<br />
However, it is out of the scope of the proposed subspace-based framework.<br />
Furthermore, I mentioned a caveat in Chapter 3 that I assumed that the test data for<br />
image databases are not single images but subspaces. The assumption can potentially<br />
limit the applicability of the subspace-based approach in conventional problem set-<br />
tings. However, this limitation is to be understood as a tradeoff between data struc-<br />
ture flexibility and the strength of methods. If one is to use more powerful kernel<br />
methods, one needs more structured data.<br />
138
Bibliography<br />
[1] P. Absil, A. Edelman, and P. Koev. On the largest principal angle between random<br />
subspaces. Linear Algebra and its Applications, 414(1):288–294, 2006.<br />
[2] P. Absil, R. Mahony, and R. Sepulchre. Riemannian geometry of Grassmann mani-<br />
folds with a view on algorithmic computation. Acta Applicandae Mathematicae: An<br />
International Survey Journal on Applying Mathematics and Mathematical Applica-<br />
tions, 80(2):199–220, 2004.<br />
[3] R. Basri, T. Hassner, and L. Zelnik-Manor. Approximate nearest subspace search with<br />
applications to pattern recognition. In Proceedings of the Conference on Computer<br />
Vision and Pattern Analysis. IEEE Computer Society, June 2007.<br />
[4] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans-<br />
actions on Pattern Analysis and Machine Intelligence, 25(2):218–233, 2003.<br />
[5] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach.<br />
Neural Computation, 12(10):2385–2404, 2000.<br />
[6] M. Baumann and U. Helmke. Riemannian subspace tracking algorithms on Grass-<br />
mann manifolds. In Proceedings of the IEEE Conference on Decision and Control,<br />
pages 4731–4736, 12-14 Dec. 2007.<br />
139
[7] P. N. Belhumeur, a. H. Jo and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recogni-<br />
tion using class specific linear projection. In Proceedings of the European Conference<br />
on Computer Vision-Volume I, pages 45–58, London, UK, 1996. Springer-Verlag.<br />
[8] P. N. Belhumeur and D. J. Kriegman. What is the set of images of an object un-<br />
der all possible illumination conditions? International Journal of Computer Vision,<br />
28(3):245–260, 1998.<br />
[9] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.<br />
Springer, Berlin, 1984.<br />
[10] A. Bhattacharyya. On a measure of divergence between two statistical populations<br />
defined by their probability distributions. Bulletin of Calcutta Mathematical Society,<br />
Vol. 49, pages 214–224, 1943.<br />
[11] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin<br />
classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning<br />
Theory, pages 144–152, New York, NY, USA, 1992. ACM.<br />
[12] M. Bressan and J. Vitrià. Nonparametric discriminant analysis and nearest neighbor<br />
classification. Pattern Recognition Letters, 24:2743–2749, 2003.<br />
[13] R. Carter, G. Segal, and I. Macdonald. Lectures on Lie Groups and Lie Algebras, Lon-<br />
don Mathematical Society Student Texts. Cambridge University Press, Cambridge,<br />
UK, 1995.<br />
[14] J.-M. Chang, J. R. Beveridge, B. A. Draper, M. Kirby, H. Kley, and C. Peterson.<br />
Illumination face spaces are idiosyncratic. In Proceedings of the International Con-<br />
ference on Image Processing, Computer Vision, and Pattern Recognition, pages 390–<br />
396, 2006.<br />
140
[15] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on<br />
the sum of observations. Annals of Mathematical Statistics, pages 493–507, 1952.<br />
[16] Y. Chikuse. Statistics on special manifolds, Lecture Notes in Statistics, vol. 174.<br />
Springer-Verlag, New York, 2003.<br />
[17] K. D. Cock and B. D. Moor. Subspace angles between ARMA models. Systems<br />
Control Lett. 46 (4), pages 265–270, July 2002.<br />
[18] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines: and<br />
other kernel-based learning methods. Cambridge University Press, New York, NY,<br />
USA, 2000.<br />
[19] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures. International<br />
Journal of Computer Vision, 51(2):91–109, 2003.<br />
[20] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthog-<br />
onality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–<br />
353, 1999.<br />
[21] R. Epstein, P. Hallinan, and A. Yuille. 5 ± 2 Eigenimages suffice: An empirical<br />
investigation of low-dimensional lighting models. In Proceedings of IEEE Workshop<br />
on Physics-Based Modeling in Computer Vision, pages 108–116, 1995.<br />
[22] B. S. Everitt. An introduction to latent variable models. Chapman and Hall, London,<br />
1984.<br />
[23] A. Faragó, T. Linder, and G. Lugosi. Fast nearest-neighbor search in dissimilarity<br />
spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):957–<br />
962, 1993.<br />
141
[24] K. Fukui and O. Yamaguchi. Face recognition using multi-viewpoint patterns for<br />
robot vision. In International Symposium of Robotics Research, pages 192–201, 2003.<br />
[25] K. Fukunaga. Introduction to statistical pattern recognition (2nd ed.). Academic<br />
Press Professional, Inc., San Diego, CA, USA, 1990.<br />
[26] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illu-<br />
mination cone models for face recognition under variable lighting and pose. IEEE<br />
Transactions on Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001.<br />
[27] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyz-<br />
ers. Technical Report CRG-TR-96-1, Department of Computer Science, University<br />
of Toronto, 21 1996.<br />
[28] G. H. Golub and C. F. V. Loan. Matrix computations (3rd ed.). Johns Hopkins<br />
University Press, Baltimore, MD, USA, 1996.<br />
[29] R. Gross, I. Matthews, and S. Baker. Eigen light-fields and face recognition across<br />
pose. In Proceedings of the IEEE International Conference on Automatic Face and<br />
Gesture Recognition, page 3, Washington, DC, USA, 2002. IEEE Computer Society.<br />
[30] R. Gross, I. Matthews, and S. Baker. Fisher light-fields for face recognition across<br />
pose and illumination. In Proceedings of the 24th DAGM Symposium on Pattern<br />
Recognition, pages 481–489, London, UK, 2002. Springer-Verlag.<br />
[31] B. Haasdonk. Feature space interpretation of SVMs with indefinite kernels. IEEE<br />
Transactions on Pattern Analysis and Machine Intelligence, 27(4):482–492, 2005.<br />
[32] P. Hallinan. A low-dimensional representation of human faces for arbitrary light-<br />
ing conditions. In Proceedings of the Conference on Computer Vision and Pattern<br />
Recognition, pages 995–999, 1994.<br />
142
[33] J. Hamm and D. D. Lee. Grassmann discriminant analysis: a unifying view on<br />
subspace-based learning. In Proceedings of the International Conference on Machine<br />
Learning, 2008.<br />
[34] J. Hamm and D. D. Lee. Learning a warped subspace model of faces with images<br />
of unknown pose and illumination. In International Conference on Computer Vision<br />
Theory and Applications, pages 219–226, 2008.<br />
[35] M. Hein, O. Bousquet, and B. Schölkopf. Maximal margin classification for metric<br />
spaces. Journal of Computer and System Sciences, 71(3):333–359, 2005.<br />
[36] O. Henkel. Sphere packing bounds in the Grassmann and Stiefel manifolds. IEEE<br />
Transactions on Information Theory, 51:3445, 2005.<br />
[37] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances<br />
of objects under varying illumination conditions. Proceedings of the Conference on<br />
Computer Vision and Pattern Analysis, 01:11, 2003.<br />
[38] R. A. Horn and C. A. Johnson. Matrix analysis. Cambridge University Press, Cam-<br />
bridge, UK, 1985.<br />
[39] H. Hotelling. Relations between two sets of variates. Biometrika 28, pages 321–372,<br />
1936.<br />
[40] T. Jebara and R. I. Kondor. Bhattacharyya and expected likelihood kernels. In Pro-<br />
ceeding of the Annual Conference on Learning Theory, pages 57–71, 2003.<br />
[41] T. Joachims. Text categorization with Support Vector Machines: Learning with many<br />
relevant features. In Proceedings of the European Conference on Machine Learning,<br />
pages 137 – 142, Berlin, 1998. Springer.<br />
143
[42] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges,<br />
and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chap-<br />
ter 11, pages 169–184. MIT Press, Cambridge, MA, USA, 1999.<br />
[43] T.-K. Kim, J. Kittler, and R. Cipolla. Discriminative learning and recognition of image<br />
set classes using canonical correlations. IEEE Transactions on Pattern Analysis and<br />
Machine Intelligence, 29(6):1005–1018, 2007.<br />
[44] M. Kirby and L. Sirovich. Application of the Karhunen-Loeve procedure for the<br />
characterization of human faces. IEEE Transactions on Pattern Analysis and Machine<br />
Intelligence, 12(1):103–108, 1990.<br />
[45] R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures.<br />
In Proceedings of the International Conference on Machine Learning, 2002.<br />
[46] R. I. Kondor and T. Jebara. A kernel between sets of vectors. In Proceedings of the<br />
International Conference on Machine Learning, pages 361–368, 2003.<br />
[47] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathemat-<br />
ical Statistics, 22(1):79–86, 1951.<br />
[48] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object<br />
categorization. In Proceedings of the Conference on Computer Vision and Pattern<br />
Analysis, page 409, Los Alamitos, CA, USA, 2003. IEEE Computer Society.<br />
[49] C. Leslie, E. Eskin, and W. Noble. Mismatch string kernels for SVM protein classi-<br />
fication. In Advances in Neural Information Processing Systems, pages 1441–1448,<br />
2003.<br />
144
[50] D. Lin, S. Yan, and X. Tang. Pursuing informative projection on Grassmann manifold.<br />
In Proceedings of the Conference on Computer Vision and Pattern Analysis, pages<br />
1727–1734, Washington, DC, USA, 2006. IEEE Computer Society.<br />
[51] X. Liu, A. Srivastava, and K. Gallivan. Optimal linear representations of images for<br />
object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,<br />
26(5):662–666, 2004.<br />
[52] L. Ljung. System identification: theory for the user. Prentice-Hall, Inc., Upper Saddle<br />
River, NJ, USA, 1986.<br />
[53] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text clas-<br />
sification using string kernels. Journal of Machine Learning Research, 2:419–444,<br />
2002.<br />
[54] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International<br />
Journal of Computer Vision, 60(2):91–110, 2004.<br />
[55] R. Martin. A metric for ARMA processes. IEEE Transactions on Signal Processing,<br />
48(4):1164–1170, Apr 2000.<br />
[56] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Müller. Fisher discriminant<br />
analysis with kernels. In Y. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural<br />
Networks for Signal Processing IX, pages 41–48. IEEE, 1999.<br />
[57] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, A. Smola, and K. Müller. Construct-<br />
ing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel<br />
feature spaces. IEEE Transactions on Patterns Analysis and Machine Intelligence,<br />
25(5):623–627, May 2003.<br />
145
[58] K. Müller, S. Mika, G. Rätsch, S. Tsuda, and B. Schölkopf. An introduction to kernel-<br />
based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–202,<br />
2001.<br />
[59] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positive kernels.<br />
In Proceedings of the International Conference on Machine Learning, page 81, New<br />
York, NY, USA, 2004. ACM.<br />
[60] E. Pekalska, P. Paclik, and R. P. W. Duin. A generalized kernel approach to<br />
dissimilarity-based classification. Journal of Machine Learning Research, 2:175–<br />
211, 2002.<br />
[61] R. Ramamoorthi. Analytic PCA construction for theoretical analysis of lighting vari-<br />
ability in images of a Lambertian object. IEEE Transactions on Pattern Analysis and<br />
Machine Intelligence, 24(10):1322–1333, 2002.<br />
[62] R. Ramamoorthi and P. Hanrahan. On the relationship between radiance and irra-<br />
diance: determining the illumination from images of a convex Lambertian object.<br />
Journal of the Optical Society of America A, 18(10):2448–2459, 2001.<br />
[63] A. Rényi. On measures of information and entropy. In Proceedings of the 4th Berkeley<br />
Symposium on Mathematics, Statistics and Probability, pages 547–561, 1960.<br />
[64] N. Sakano, H.; Mukawa. Kernel mutual subspace method for robust facial image<br />
recognition. In Proceedings of the International Conference on Knowledge-Based<br />
Intelligent Engineering Systems and Allied Technologies, volume 1, pages 245–248,<br />
2000.<br />
[65] I. J. Schoenberg. Remarks to Maurice Frechet’s article ... Annals of Mathematics, 36,<br />
36(3):724–732, 1935.<br />
146
[66] I. J. Schoenberg. Metric spaces and completely monotone functions. Annal of Math-<br />
ematics, 39(4):811–841, 1938.<br />
[67] I. J. Schoenberg. Metric spaces and positive definite functions. Transactions of the<br />
American Mathematical Society, 44:522–536, 1938.<br />
[68] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel<br />
eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.<br />
[69] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,<br />
Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.<br />
[70] G. Shakhnarovich, I. John W. Fisher, and T. Darrell. Face recognition from long-term<br />
observations. In Proceedings of the European Conference on Computer Vision, pages<br />
851–868, London, UK, 2002.<br />
[71] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge<br />
University Press, New York, NY, USA, 2004.<br />
[72] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression<br />
(PIE) database. IEEE Transactions on Pattern Analysis and Machine Intelligence,<br />
25(12):1615 – 1618, December 2003.<br />
[73] L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of<br />
human faces. Journal of the Optical Society of America A, 4(3):519–524, 1987.<br />
[74] A. Srivastava. A Bayesian approach to geometric subspace estimation. IEEE Trans-<br />
actions on Signal Processing, 48(5):1390–1400, May 2000.<br />
[75] E. Takimoto and M. Warmuth. Path kernels and multiplicative updates. In Proceed-<br />
ings of the Annual Workshop on Computational Learning Theory. ACM, 2002.<br />
147
[76] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal<br />
Of The Royal Statistical Society Series B, 61(3):611–622, 1999.<br />
[77] F. Topsøe. Some inequalities for information divergence and related measures of<br />
discrimination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000.<br />
[78] P. Turaga, A. Veeraraghavan, and R. Chellappa. Statistical analysis on Stiefel and<br />
Grassmann manifolds with applications in computer vision. In Proceedings of the<br />
Conference on Computer Vision and Pattern Analysis, 2008.<br />
[79] M. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro-<br />
science, 3(1):71–86, 1991.<br />
[80] A. Veeraraghavan, A. K. Roy-Chowdhury, and R. Chellappa. Matching shape se-<br />
quences in video with applications in human movement analysis. IEEE Transactions<br />
on Pattern Analysis and Machine Intelligence, 27(12):1896–1909, 2005.<br />
[81] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (GPCA).<br />
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1945–1959,<br />
2005.<br />
[82] S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. Advances<br />
in Neural Information Processing Systems, 15, 2003.<br />
[83] S. Vishwanathan and A. J. Smola. Binet-Cauchy kernels. In Advances in Neural<br />
Information Processing Systems, 2004.<br />
[84] S. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical<br />
systems and its application to the analysis of dynamic scenes. International Journal<br />
of Computer Vision, 73(1):95–119, 2007.<br />
148
[85] L. Wang, X. Wang, and J. Feng. Subspace distance analysis with application to adap-<br />
tive Bayesian algorithm for face recognition. Pattern Recognition, 39(3):456–464,<br />
2006.<br />
[86] F. Warner. Foundations of differentiable manifolds and Lie groups. Springer-Verlag,<br />
New York, 1983.<br />
[87] C. Watkins. Kernels from matching operations. Technical Report CSD-TR-98-07,<br />
Department of Computer Science, Royal Holloway College, 1999.<br />
[88] C. Watkins. Dynamic alignment kernels. In A. Smola and P. Bartlett, editors, Ad-<br />
vances in Large Margin Classifiers, chapter 3, pages 39–50. MIT Press, Cambridge,<br />
MA, USA, 2000.<br />
[89] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using<br />
motion history volumes. Computer Vision and Image Understanding, 104(2):249–<br />
257, 2006.<br />
[90] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Journal of<br />
Machine Learning Research, 4:913–931, 2003.<br />
[91] Y.-C. Wong. Differential geometry of Grassmann manifolds. Proceedings of the<br />
National Academy of Science, Vol. 57, pages 589–594, 1967.<br />
[92] O. Yamaguchi, K. Fukui, and K. Maeda. Face recognition using temporal image se-<br />
quence. In Proceedings of the International Conference on Face and Gesture Recog-<br />
nition, page 318, Washington, DC, USA, 1998. IEEE Computer Society.<br />
[93] J. Ye and T. Xiong. Null space versus orthogonal linear discriminant analysis. In<br />
Proceedings of the International Conference on Machine Learning, pages 1073–1080,<br />
New York, NY, USA, 2006. ACM.<br />
149
[94] A. L. Yuille, D. Snow, R. Epstein, and P. N. Belhumeur. Determining generative<br />
models of objects under varying illumination: Shape and albedo from multiple images<br />
using SVD and integrability. International Journal of Computer Vision, 35(3):203–<br />
222, 1999.<br />
[95] S. K. Zhou and R. Chellappa. Illuminating light field: Image-based face recognition<br />
across illuminations and poses. In Proceedings of the IEEE International Conference<br />
on Automatic Face and Gesture Recognition, pages 229–234, 2004.<br />
[96] S. K. Zhou and R. Chellappa. From sample similarity to ensemble similarity: Proba-<br />
bilistic distance measures in reproducing kernel Hilbert space. IEEE Transactions on<br />
Pattern Analysis and Machine Intelligence, 28(6):917–929, 2006.<br />
150