SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...

SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS 

Jihun Hamm 

A DISSERTATION 

in 

Electrical and Systems Engineering 

Presented to the Faculties of the University of Pennsylvania 

in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 

2008 

Supervisor of Dissertation 

Graduate Group Chair

COPYRIGHT 

Jihun Hamm 

2008

Acknowledgements 

I deeply thank my advisor Dr. Daniel D. Lee for so many things. Besides having provided 

financial and mental support for my graduate study, Daniel has initiated me into the field of 

machine learning which I barely knew about before working with him. From him I learned 

how to tackle the problems from the ground and stay fresh-minded. The upmost influence 

Daniel had on me was his energy and passion towards the goal. Being near him made me 

and my colleagues always stimulated and energized, and helped us to endure through some 

tough periods during the Ph.D process. 

I thank Dr. Lawrence Saul for inspiring me to follow my interest in the manifold learn- 

ing. During my early years working with him, his knowledge and intuition on the matter 

has strongly affected my approach to learning problems. 

I am also grateful to other professors who served in my thesis committee: Dr. Ali 

Jadbabaie, Dr. Jianbo Shi, Dr. Ben Taskar, and Dr. Ragini Verma. They provided valu- 

able feedback to polish the thesis. Dr. Jean Gallier has provided me with a guidance on 

mathematical issues before and during the writing of the thesis. 

I appreciate the support from my colleagues, especially from my lab members: Dan 

Huang, Yuanqing Lin, Yung-kyun Noh, and Paul Vernaza. Besides sharing the enthusiasm 

for research, we shared lots of fun and sometimes stressful moments of daily life as gradu- 

ate students. Yung-kyun has always been a pleasure to discuss any problem with. He was 

kind enough to read through the draft of the thesis and give suggestions. 

iii

Lastly, I thank my parents and my family for being who they are, and for understanding 

my excuses for not talking to them more often. My wife Sophia has always been by my 

side, and I cannot thank her enough for that. 

iv

ABSTRACT 

SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS 

Jihun Hamm 

Supervisor: Prof. Daniel D. Lee 

In this thesis I propose a subspace-based learning paradigm for solving novel problems 

in machine learning. We often encounter subspace structures within data that lie inside a 

vector space. For example, the set of images of an object or a face under varying lighting 

conditions are known to lie on a low (4 or 9)-dimensional subspace with mild assumptions. 

Many other types of variations such as pose change or facial expression, can also be ap- 

proximated quite well with low-dimensional subspaces. Treating such subspaces as basic 

units of learning gives rise to challenges that conventional algorithms cannot handle well. 

In this work, I tackle subspace-based learning problems with the unifying framework of 

Grassmann manifold, which is the set of linear subspaces of a Euclidean space. I propose 

positive definite kernels on this space, which provide an easy access to the repository of 

various kernel algorithms. Furthermore, I show that the Grassmann kernels can be extended 

to the set of affine and scaled subspaces. This extension allows us to handle larger classes 

of problems with little additional cost. 

The proposed kernels in this thesis can be used with any kernel method. In particular, 

I demonstrate the potential advantages of the proposed kernel with Discriminant Analysis 

techniques and Support Vector Machines for recognition and categorization tasks. Experi- 

ments with real image databases show not only the feasibility of the proposed framework, 

but also the improved performance of the method compared with previously known meth- 

ods. 

v

Contents 

1 INTRODUCTION 1 

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 Contributions and related work . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

2 BACKGROUND 7 

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

2.2 Kernel machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

2.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

2.2.2 Mercer kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

2.2.3 Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . 12 

2.2.4 Examples of kernels . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.2.5 Generating new kernels from old kernels . . . . . . . . . . . . . . 16 

2.2.6 Distance and conditionally positive definite kernels . . . . . . . . . 17 

2.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.3.1 Large margin linear classifier . . . . . . . . . . . . . . . . . . . . . 20 

2.3.2 Dual problem and support vectors . . . . . . . . . . . . . . . . . . 21 

2.3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

2.3.4 Generalization error and overfitting . . . . . . . . . . . . . . . . . 24 

vi

2.4 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

2.4.1 Fisher Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 26 

2.4.2 Nonparametric Discriminant Analysis . . . . . . . . . . . . . . . . 26 

2.4.3 Discriminant analysis in high-dimensional spaces . . . . . . . . . . 27 

2.4.4 Extension to nonlinear discriminant analysis . . . . . . . . . . . . 28 

3 MOTIVATION: SUBSPACE STRUCTURE IN DATA 30 

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

3.2 Illumination subspaces in multi-lighting images . . . . . . . . . . . . . . . 31 

3.3 Pose subspaces in multi-view images . . . . . . . . . . . . . . . . . . . . . 40 

3.4 Video sequences of human motions . . . . . . . . . . . . . . . . . . . . . 44 

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

4 GRASSMANN MANIFOLDS AND SUBSPACE DISTANCES 54 

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

4.2 Stiefel and Grassmann manifolds . . . . . . . . . . . . . . . . . . . . . . . 55 

4.2.1 Stiefel manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

4.2.2 Grassmann manifold . . . . . . . . . . . . . . . . . . . . . . . . . 57 

4.2.3 Principal angles and canonical correlations . . . . . . . . . . . . . 59 

4.3 Grassmann distances for subspaces . . . . . . . . . . . . . . . . . . . . . . 60 

4.3.1 Projection distance . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

4.3.2 Binet-Cauchy distance . . . . . . . . . . . . . . . . . . . . . . . . 62 

4.3.3 Max Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

4.3.4 Min Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

4.3.5 Procrustes distance . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

4.3.6 Comparison of the distances . . . . . . . . . . . . . . . . . . . . . 68 

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

vii

4.4.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . . . . . 70 

4.4.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 71 

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

5 GRASSMANN KERNELS AND DISCRIMINANT ANALYSIS 77 

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

5.2 Kernel functions for subspaces . . . . . . . . . . . . . . . . . . . . . . . . 78 

5.2.1 Projection kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

5.2.2 Binet-Cauchy kernel . . . . . . . . . . . . . . . . . . . . . . . . . 81 

5.2.3 Indefinite kernels from other metrics . . . . . . . . . . . . . . . . . 83 

5.2.4 Extension to nonlinear subspaces . . . . . . . . . . . . . . . . . . 83 

5.3 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . 85 

5.3.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

5.3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 


5.4 Discriminant Analysis of subspace . . . . . . . . . . . . . . . . . . . . . . 90 

5.4.1 Grassmann Discriminant Analysis . . . . . . . . . . . . . . . . . . 90 

5.4.2 Mutual Subspace Method (MSM) . . . . . . . . . . . . . . . . . . 91 

5.4.3 Constrained MSM (cMSM) . . . . . . . . . . . . . . . . . . . . . 92 

5.4.4 Discriminant Analysis of Canonical Correlations (DCC) . . . . . . 92 

5.5 Experiments with real-world data . . . . . . . . . . . . . . . . . . . . . . . 93 

5.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 


5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

6 EXTENDED GRASSMANN KERNELS AND PROBABILISTIC DISTANCES100 

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

viii

6.2 Analysis of probabilistic distances and kernels . . . . . . . . . . . . . . . . 101 

6.2.1 Probabilistic distances and kernels . . . . . . . . . . . . . . . . . . 101 

6.2.2 Data as Mixture of Factor Analyzers . . . . . . . . . . . . . . . . . 103 

6.2.3 Analysis of KL distance . . . . . . . . . . . . . . . . . . . . . . . 105 

6.2.4 Analysis of Probability Product Kernel . . . . . . . . . . . . . . . 107 

6.3 Extended Grassmann Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 110 

6.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

6.3.2 Extension to affine subspaces . . . . . . . . . . . . . . . . . . . . 112 

6.3.3 Extension to scaled subspaces . . . . . . . . . . . . . . . . . . . . 118 

6.3.4 Extension to nonlinear subspaces . . . . . . . . . . . . . . . . . . 122 

6.4 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . 123 

6.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

6.4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 


6.5 Experiments with real-world data . . . . . . . . . . . . . . . . . . . . . . . 127 

6.5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 


6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 

7 CONCLUSION 134 

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 

7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 

7.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 

7.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 

Bibliography 139 

ix

List of Tables 

4.1 Summary of the Grassmann distances. The distances can be defined as 

simple functions of both the basis Y and the principal angles θi except for 

the arc-length which involves matrix exponentials. . . . . . . . . . . . . . 69 

5.1 Classification rates of the Euclidean SVMs and the Grassmannian SVMs. 

The best rate for each dataset is highlighted by boldface. . . . . . . . . . . 89 

6.1 Classification rates of the Euclidean SVMs and the Grassmann SVMs. The 

best rate for each dataset is highlighted by boldface. . . . . . . . . . . . . 126 

x

List of Figures 

2.1 Classification in the input space (Left) vs a feature space (Right). A non- 

linear classification in the input space is achieved by a linear classification 

in the feature space via the following map: φ : R 2 → R 3 , (x1, x2) ′ ↦→ 

(x 2 1, √ 2x1x2, x 2 2) ′ , which maps the elliptical decision boundary to the hy- 

perplane. This illustration was captured from the tutorial slides of Schölkopf’s 

given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005. . . . 9 

2.2 Example of classifying two-class data with a hyperplane 〈w, x〉 + b = 0. 

In this case the data can be separated without error. This illustration was 

captured from the tutorial slides of Schölkopf’s given at the workshop “The 

Analysis of Patterns”, Erice, Italy, 2005. . . . . . . . . . . . . . . . . . . . 21 

2.3 The most discriminant direction for two-class data. Suppose we have two 

classes of Gaussian-distributed data, and we want to project the data onto 

one-dimensional directions denoted by the arrows. The projection in the 

direction of the largest variance (PCA direction) results in a large overlap 

of the two class which is undesirable for classification, whereas the pro- 

jection in the Fisher direction yields the least overlapping, therefore most 

discriminant one-dimensional distributions. This illustration was captured 

from the paper of [58]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

xi

3.1 The figure shows the first five principal components of a face, computed 

analytically from a 3D model (Top) and a sphere (Bottom). These images 

matches well with the empirical principal component computed from a set 

of real images. The figure was captured from [61]. . . . . . . . . . . . . . 33 

3.2 Yale Face Database: the first 10 (out of 38) subjects at all poses under a 

fixed illumination condition. . . . . . . . . . . . . . . . . . . . . . . . . . 34 

3.3 Yale Face Database: all illumination conditions of a person at a fixed pose 

used to compute the corresponding illumination subspace. . . . . . . . . . 37 

3.4 Yale Face Database: examples of basis images and (cumulative) singular 

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 

3.5 CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a 

fixed illumination condition. . . . . . . . . . . . . . . . . . . . . . . . . . 38 

3.6 CMU-PIE Database: all illumination conditions of a person at a fixed pose 

used to compute the corresponding illumination subspace. . . . . . . . . . 39 

3.7 CMU-PIE Database: examples of basis images and (cumulative) singular 

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3.8 ETH-80 Database: all categories and objects at a fixed pose. . . . . . . . . 41 

3.9 ETH-80 Database: all poses of an object from a category used to compute 

the corresponding pose subspace of the object. . . . . . . . . . . . . . . . 43 

3.10 ETH-80 Database: examples of basis images and (cumulative) singular 

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

3.11 IXMAS Database: video sequences of an actor performing 11 different 

actions viewed from a fixed camera. . . . . . . . . . . . . . . . . . . . . . 50 

3.12 IXMAS Database: 3D occupancy volume of an actor of one time frame. 

The volume is initially computed in Cartesian coordinate system and later 

represented to cylindrical coordinate system to apply FFT. . . . . . . . . . 51 

xii

3.13 IXMAS Database: the ‘kick’ action performed by 11 actors. Each sequence 

has a different kick style well as different body shape and height. . . . . . 52 

3.14 IXMAS Database: cylindrical coordinate representation of the volume V (r, θ, z), 

and the corresponding 1D FFT feature abs(F F T (V (r, θ, z))), shown at a 

few values of θ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 

4.1 Principal angles and Grassmann distances. Let span(Yi) and span(Yj) be 

two subspaces in the Euclidean space R D on the left. The distance between 

two subspaces span(Yi) and span(Yj) can be measured using the principal 

angles θ = [θ1, ... , θm] ′ . In the Grassmann manifold viewpoint, the sub- 

spaces span(Yi) and span(Yj) are considered as two points on the manifold 

G(m, D), whose Riemannian distance is related to the principal angles by 

d(Yi, Yj) = �θ�2. Various distances can be defined based on the principal 

angles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

4.2 Yale Face Database: face recognition rates from 1NN classifier with the 

Grassmann distances. The two highest rates including ties are highlighted 

with boldface for each subspace dimension m. . . . . . . . . . . . . . . . 73 

4.3 CMU-PIE Database: face recognition rates from 1NN classifier with the 

Grassmann distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

4.4 ETH-80 Database: object categorization rates from 1NN classifier with the 

Grassmann distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 

4.5 IXMAS Database: action recognition rates from 1NN classifier with the 

Grassmann distances. The two highest rates including ties are highlighted 

with boldface for each subspace dimension m. . . . . . . . . . . . . . . . 76 

xiii

5.1 Doubly kernel method. The first kernel implicitly maps the two ‘nonlinear 

subspaces’ Xi and Xj to span(Yi) and span(Yj) via the map Φ : X → H1, 

where the ‘nonlinear subspace’ means the preimage Xi = Φ −1 (span(Yi)) 

and Xj = Φ −1 (span(Yj)). The second (=Grassmann) kernel maps the 

points Yi and Yj on the Grassmann manifold G(m, D) to the corresponding 

points in H2 via the map Ψ : G(m, D) → H2 such as (5.3) or (5.5). . . . . 84 

5.2 A two-dimensional subspace is represented by a triangular patch swept by 

two basis vectors. The positive and negative classes are colored-coded by 

blue and red respectively. A: The two class centers Y+ and Y− around 

which other subspaces are randomly generated. B–D: Examples of ran- 

domly selected subspaces for ‘easy’, ‘intermediate’, and ‘difficult’ datasets. 86 

5.3 Yale Face Database: face recognition rates from various discriminant anal- 

ysis methods. The two highest rates including ties are highlighted with 

boldface for each subspace dimension m. . . . . . . . . . . . . . . . . . . 96 

5.4 CMU-PIE Database: face recognition rates from various discriminant anal- 

ysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

5.5 ETH-80 Database: object categorization rates from various discriminant 

analysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

5.6 IXMAS Database: action recognition rates from various discriminant anal- 

ysis methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

6.1 Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann 

manifold (Left), the set of linear subspaces, can alternatively be modeled 

as the set of flat (σ → 0) spheres (Y ′ 

i Yi = Im) intersecting at the origin 

(ui = 0). The right figure shows a general Mixture of Factor Analyzers 

which are not bound by these conditions. . . . . . . . . . . . . . . . . . . 104 

xiv

6.2 The Mixture of Factor Analyzer model of the Grassmann manifold is the 

collection of linear homogeneous Factor Analyzers shown as flat spheres 

intersecting at the origin (A). This can be relaxed to allow nonzero offsets 

for each Factor Analyzer (B), and also to allow arbitrary eccentricity and 

scale for each Factor Analyzer shown as flat ellipsoids (C). . . . . . . . . . 111 

6.3 The same affine span can be expressed with different offsets u1, u2, ... How- 

ever, one can use the unique ‘standard’ offset û, which has the shortest 

length from the origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 

6.4 Homogeneous vs scaled subspaces. The two 2-dimensional Gaussians that 

span almost the same 2-dimensional space and have almost the same means, 

are considered similar as two representations of linear subspaces (Left). 

However, probabilistic distance between two Gaussian also depends on 

scale and eccentricity: the distance can be quite large if the Gaussians are 

nonhomogeneous (Right). . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

6.5 Yale Face Database: face recognition rates from various kernels. The two 

highest rates including ties are highlighted with boldface for each subspace 

dimension m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

6.6 CMU-PIE Database: face recognition rates from various kernels. . . . . . 131 

6.7 ETH-80 Database: object categorization rates from various kernels. . . . . 132 

6.8 IXMAS Database: action recognition rates from various kernels. . . . . . . 133 

xv

Chapter 1 

INTRODUCTION 

1.1 Overview 

In machine learning problems the data commonly lie in a vector space, especially in a 

Euclidean space. The Euclidean space is convenient for data representation, storage and 

computation and geometrically intuitive to understand as well. 

There are, however, other kinds of non-Euclidean spaces more suitable for data outside 

the conventional Euclidean domain. The data domain I focus on in this thesis is one of 

those non-Euclidean spaces where each data sample is a linear subspace of a Euclidean 

space. 

Researches often encounter this non-conventional domain in computer vision problems. 

For example, a set of images of an object or a face with varying lighting conditions is known 

to lie on a low (4 or 9-) dimensional subspace under mild assumptions. Many other types of 

variations such as pose changes or facial expressions, can also be empirically approximated 

quite well by low-dimensional subspaces. If the data consist of multiple sets of images, they 

can consequently be modeled as a collection of low-dimensional subspaces. 

What are the potential advantages of having such structures? In the above example of 

1

face images, we can model the illumination variation of data irrelevant to a recognition task 

by subspaces, and focus on learning the appropriate variation between those subspaces such 

as the variation due to a subject identity. This idea applies not only to illumination-varying 

faces but also to many other types of data for which we can model out the undesired factors 

from the data with subspaces. Furthermore, representing data as a collection of subspaces 

is much more economical than keeping all the data samples as unorganized points, since 

we only need to store and handle the basis vectors. I refer to this approach of handling data 

as the subspace-based learning approach. 

Few researchers have clearly defined and fully utilized the properties of such a space 

in learning problems. Since a collection of subspaces is non-Euclidean, one cannot benefit 

from the conveniences of the Euclidean space anymore. For a learning algorithm to work 

with the subspace representation of the data, it requires a suitable framework which is also 

convenient in storage and computation of such data. This thesis provides the foundations 

of the subspace-based learning problems using a novel framework and kernels. 

To show the reader the scope and the depth of this work, I raise the following questions 

regarding the subject: 

Questions 

1. What are the examples of the subspace structure in real data? 

2. Which non-Euclidean domain suits the subspace-structed data? 

3. What dissimilarity measures of subspaces are there, and what are their properties? 

4. Can we define kernels for such domain? 

5. Are the kernels related to probabilistic distances? 

6. Can we extend the framework to subspaces that are not exactly linear? 

2

This thesis gives detailed and definitive answers to all of the questions above. 

1.2 Contributions and related work 

In this thesis I propose the Grassmann manifold framework for solving subspace-based 

problems. The Grassmann manifold is the set of fixed-dimensional linear subspaces and 

is an ideal model of the data under consideration. The Grassmann manfiolds have been 

previously used in signal processing and control [74, 36, 6], numerical optimization [20] 

(and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78]. 

In particular, there are many approaches that use the subspace concept for problem solving 

in computer vision [92, 64, 24, 43, 3]. However, these work do not explicitly nor fully 

utilize the benefits of the Grassmann approach for subspace-based problems. In contrast, 

I make a full use of the properties of the Grassmann manifold with a unifying framework 

that subsumes the previous approaches. 

With the proposed framework, a dissimilarity between subspaces can be viewed as a 

distance function on the Grassmann manifold. I review several known distances including 

the Arc-length, Projection, Binet-Cauchy, Max Corr, Min Corr, and Procrustes distances 

[20, 16], and provide analytical and empirical comparisons. Furthermore, I propose the 

Projection kernel as a legitimate kernel function on the Grassmann manifold. The Projec- 

tion kernel is also used in [85] where it is mainly used as a similarity measure of subspaces 

rather than as a full-fledged kernel function on the Grassmann manifold. Another kernel 

I use in the thesis is the Binet-Cauchy kernel [90, 83]. I show that in spite of the more 

attention the Binet-Cauchy kernel has received, the Binet-Cauchy kernel is less useful than 

the Projection kernel is with noisy data. 

Using the two kernels as the representative kernels on the Grassmann manifold, I 

demonstrate the advantages of using the Grassmann kernels over the Euclidean kernels by a 

3

classification problem with Support Vector Machines on synthetic datasets. To demonstrate 

the potential benefits of the kernels further, I apply the kernels to a discriminant analysis on 

the Grassmann manifold and compare the approach with previously suggested algorithms 

for subspace-based discriminant analysis [92, 64, 24, 43]. In the previous methods, feature 

extraction is performed in the Euclidean space while non-Euclidean subspace distances 

are used in the objective. This inconsistency results in a difficult optimization and a weak 

guarantee of convergence, whereas the proposed approach with the Grassmann kernels is 

simpler and more effective, evidenced by experiments with the real image databases. 

In this thesis I also investigate the relationship between probabilistic distances and the 

Grassmann kernels. If we assume the set of vectors are i.i.d. samples from an arbitrary 

probability distribution, then it is possible to compare two such distributions of vectors 

with probabilistic similarity measures, such as the KL distance [47], the Chernoff distance 

[15], or the Bhattacharyya/Hellinger distance [10]. Furthermore, the Bhattacharyya affinity 

is in fact a positive definite kernel function on the space of distributions and has nice closed- 

form expressions for the exponential family [40]. The probabilistic distances and kernels 

are used for recognizing hand-written digits and faces [70, 46, 96]. I provide a link between 

the probabilistic and the Grassmann views by modeling the subspace data as a limit of the 

Mixture of Factor Analyzers [27] under the zero-mean and homogeneous conditions. The 

first result I show is that the KL distance is reduced to the Projection kernel under the 

Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit and 

is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL 

distance, I propose an extension of the Projection kernel which is originally confined to 

the set of linear subspaces, to the set of affine as well as scaled subspaces. I demonstrate 

the potential benefits of the extended kernels with the Support Vector Machines and the 

Kernel Discriminant Analysis, using synthetic and real image databases. The experiments 

show the superiority of the extended kernels over the Bhattacharyya and the Binet-Cauchy 

4

kernels, as well as over the Euclidean methods. 

There is a related but independent problem of clustering unlabeled data points into mul- 

tiple subspaces. Several approaches have been proposed in the literature. A traditional and 

inefficient technique is to use an EM-algorithm [27] for a Mixture of Factor Analyzers 

(MFA), which models the data distribution as a superposition of the Gaussian distributions. 

More recent work on clustering subspace data includes K-subspaces method [37] which 

extends the K-means algorithm to the case of subspaces, and Generalized PCA [81] which 

represents subspaces with polynomials and solves algebraic equations to fit the data. These 

methods are different from the proposed method of this thesis in that they serve as a pre- 

processing step to generate subspace labels for the proposed subspace-based learning. 

1.3 Organization of the paper 

The rest of the paper is organized as follows: 

• Chapter 2 provides background materials for the thesis, including kernel theory, large 

margin classifiers, and discriminant analyses. 

• Chapter 3 discusses theoretical and empirical evidences of inherent subspace struc- 

tures in image and video databases, and describes procedures for preprocessing the 

databases. 

• Chapter 4 introduces the Grassmann manifold as a common framework for subspace- 

based learning. Various distances on the Grassmann manifold are reviewed and ana- 

lyzed in depth. 

• Chapter 5 defines the Grassmann kernels and proposes the application to discriminant 

analysis. Comparisons with previously used algorithms are given. 

5

• Chapter 6 examines the relationship between probabilistic distances and the Grass- 

mann kernels. The chapter contains further discussions on the extension of the do- 

main of the subspace-based learning and presents the extended Grassmann kernels 

• Chapter 7 summarizes the contributions of the thesis and discusses the future work 

related to the proposed methods. 

• Bibliography contains all the referenced work in this thesis. 

The main chapters of the thesis are also divide into two parts. Chapter 3 and 4 integrate 

known facts and set up the framework for the thesis. Chapter 5 and 6 provide the main 

proposals, analyses, and experimental results. 

6

Chapter 2 

BACKGROUND 

2.1 Introduction 

In this chapter I review three topics: 1) kernel machine, 2) its application to 2) large margin 

classification and 3) application to discriminant analysis. The theory behind the kernel 

machines is helpful and partially necessary to understand the proposed kernels in the thesis. 

The large margin classification and discriminant analysis algorithms will be used to test the 

proposed kernels in Chapters 5 and 6. 

I provide a brief tutorial of the three topics based on the well-known texts and papers 

such as [18, 69, 71]. Most of the proofs are omitted and can be found in the original texts. 

2.2 Kernel machines 

2.2.1 Motivation 

Oftentimes it is not very effective nor convenient to use the original data space to learn pat- 

terns of the data. For simplicity, let’s assume the data X lie in the Euclidean space. When 

the patterns have a complex structure in the original data space, we can try to transform 

7

the data space nonlinearly to another space so that the learning task becomes easier on the 

transformed space. The new space is called a feature space, and the map is called a feature 

map. 

Suppose we are trying to classify two-dimensional, two-class data (Figure 2.1.) If the 

true class boundary is an ellipse in the input space, a linear classifier cannot classify the 

data correctly. However, when the input space is mapped to the feature space by 

φ : R 2 → R 3 , (x1, x2) ′ ↦→ (x 2 1, √ 2x1x2, x 2 2) ′ , 

the decision boundary becomes a hyperplane in three-dimensional space, and therefore the 

two classes can be perfectly separated by a simple linear classifier. 

Note that we mapped the data to the feature space of all (ordered) monomials of degree 

two (x 2 1, √ 2x1x2, x 2 2), and used a hyperplane in that space. We can use the same idea for 

a feature space of higher-degree monomials. However, if we map X ∈ R D to the space of 

degree-d monomials, the dimension of the feature space becomes 

⎛ 

⎜ 

⎝ 

D + d − 1 

d 

which can be computationally infeasible even for moderate D and d. This difficulty is 

easily circumvented by noting that we only need to compute inner products of points in 

the feature space to define a hyperplane. For the space of degree-2 monomials, the inner 

product can be computed from the original data by 

⎞ 

⎟ 

⎠ 

〈φ(x), φ(y)〉 = x 2 1y 2 1 + x 2 2y 2 2 + 2x1x2y1y2 = 〈x, y〉 2 , 

which can be extended to degree-d monomials by 〈x, y〉 d . The inner product in the feature 

8

Figure 2.1: Classification in the input space (Left) vs a feature space (Right). A nonlinear 

classification in the input space is achieved by a linear classification in the feature space via 

the following map: φ : R 2 → R 3 , (x1, x2) ′ ↦→ (x 2 1, √ 2x1x2, x 2 2) ′ , which maps the elliptical 

decision boundary to the hyperplane. This illustration was captured from the tutorial slides 

of Schölkopf’s given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005. 

space, such as k(x, y) = 〈x, y〉 d , is called a kernel function. From a user’s point of view, 

a kernel function is simply a nonlinear similarity measure of data that corresponds to a 

linear similarity measure in a feature space that the user need not know explicitly. A formal 

definition will follow shortly. 

2.2.2 Mercer kernels 

In this subsection I introduce the Mercer’s theorem which characterizes the condition when 

a kernel function k induces the feature map and space. Let X denote the data space. 

9

In case of finite X 

Definition 2.1 (Symmetric Positive Definite Matrix). A real N by N symmetric matrix K 

is positive definite if 

� 

cicjKij ≥ 0, for all c1, ..., cN(ci ∈ R). 

i,j 

Consider a finite input space X = {x1, ..., xN} and a symmetric real-valued function 

k(x, y). Let K be the N by N matrix of the function Kij = k(xi, xj) evaluated at X × X . 

Since K is symmetric it can be diagonalized as K = V ΛV ′ where Λ is a diagonal matrix 

of eigenvalues λ1 ≤ ... ≤ λN and V is an orthonormal matrix whose columns are the 

corresponding eigenvectors. Let vi denote the i-th row of V : 

V = [v ′ 1 · · · v ′ N] ′ . 

If the matrix K is positive definite, and therefore the eigenvalues are non-negative, then we 

can define the following feature map 

φ : X → H = R N , xi ↦→ viD 1/2 , i = 1, ..., N, 

where D 1/2 is a diagonal matrix D 1/2 = diag( √ λ1, ..., √ λN). We now observe that the 

inner product in the feature space 〈·, ·〉 H coincides with the kernel matrix of the data 

〈φ(xi), φ(xj)〉 = viDv ′ j = (V DV ′ )ij = Kij. 

10

In case of compact X 

Let’s apply the intuition gained from the finite case to an infinite dimensional case. Al- 

though further generalization is possible to a finite measure space (X , µ), we will deal with 

compact subsets of R D as the domain. 

Theorem 2.2 (Mercer). Let X be a compact subset of R D . Suppose k : X × X → R is a 

continuous symmetric function such that the integral operator Tk : L2(X ) → L2(X ), 

has the property of 

� 

(Tkf)(x) = 

� 

X 2 

X 

k(x, y)f(y) dy 

k(x, y)f(x)f(y) dxdy ≥ 0, 

for all f ∈ L2(X ). Then we have a uniformly convergent series 

k(x, y) = 

∞� 

λiψi(x)ψi(y) 

i=1 

in terms of the normalized eigenfunctions ψi ∈ L2(X ) of Tk ( normalized means �ψi�L2 = 

1.) 

The condition on Tk is an extension of the positive definite condition for matrices. 

Let’s define a sequence of features maps from the operator eigenfunctions ψi: 

φd : X → H = l d 2, x ↦→ 

�� 

λ1ψ1(x), ..., � � 

λdψd(x) , 

for d = 1, 2, .... The Theorem 2.2 tells us that the sequence of the maps φ1, φ2, ... converges 

to a map φ : X → H such that 〈φ(x), φ(y)〉 H = k(x, y). The theorem below is the 

formalization of this observation: 

11

Theorem 2.3 (Mercer Kernel Map). If X is a compact subset of R D and k is a function 

satisfying the conditions of Theorem 2.2, then there is a feature map φ : X → H into a 

features space H where k becomes an inner product 

〈φ(x), φ(y)〉 H = k(x, y), 

for almost all x, y ∈ X . Moreover, given any ɛ > 0, there exists a map φn into an n- 

dimensional Hilbert space such that 

for almost all x, y ∈ X . 

|k(x, y) − 〈φn(x), φn(y)〉 | < ɛ 

The Mercer’s kernel gives us a construction of a features space. In the next section we 

will look at a more general construction via the Reproducing Kernel Hilbert Space. 

2.2.3 Reproducing Kernel Hilbert Space 

Extending the notion of positive definiteness of matrices and compact operators, we can 

define the positive definiteness of a function on an arbitrary set X as follows: 

Definition 2.4 (Positive Definite Kernel). Let X be any set, and k : X × X → R be 

a symmetric real-valued function k(xi, xj) = k(xj, xi) for all xi, xj ∈ X . Then k is a 

positive definite kernel function if 

� 

cicjk(xi, xj) ≥ 0, 

for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) for any n ∈ N. 

i,j 

In fact, the necessary and sufficient condition for a kernel to have associated feature 

12

space and feature map, is that the kernel be positive definite. Below are the three steps in 

[69] to construct the feature map φ and the feature space H from a given positive definite 

kernel k: 

1. Define a vector space with k. 

2. Endow it with an inner product with a reproducing property. 

3. Complete the space to a Hilbert space. 

First, we define H as the set of all linear combinations of the functions of the form 

f(·) = 

m� 

αik(·, xi), 

i 

for arbitrary m ∈ N, α1, ..., αm(αi ∈ R), and x1, ..., xm(xi ∈ X ). It is not difficult to 

check that H is a vector space. Let g(·) = � n 

j βjk(·, yj) be an another function in the 

vector space for some n ∈ N, β1, ..., βn(βj ∈ R), and y1, ..., yn(yj ∈ X ). Next, we define 

the following inner product between f and g: 

m,n � 

〈f, g〉 = αiβjk(xi, yj). 

i,j 

It is possible that the coefficients {αi} and {βj} are not unique. That is, a function f (or 

g) may be represented in multiple ways with different coefficients. To see that the inner 

product is still well-defined, note that 

m,n � 

〈f, g〉 = αiβjk(xi, yj) = � 

βjf(yj) 

i,j 

by definition. This shows that 〈·, ·〉 does not depend on particular expansion coefficients 

{αi}. Similarly, 〈f, g〉 = � 

i αig(xi) shows that the inner product does not depend on 

13 

j

{βj} either. The positivity 〈f, f〉 = � 

i,j αiαjk(xi, xj) ≥ 0 follows from the positive 

definiteness of k. Other axioms are easily checked. One notable property of the defined 

kernel is as follows. By choosing g(·) = k(·, y) we have 〈f, k(·, y)〉 = f(y) by definition. 

Furthermore with f(·) = k(·, x) we have 

which is called the reproducing property. 

〈k(·, x), k(·, y)〉 = k(x, y), 

Finally, the space can be completed to a Hilbert space, which is called the Reproducing 

Kernel Hilbert Space. Below is the formal definition of the space: 

Definition 2.5 (Reproducing Kernel Hilbert Space). Let X be a nonempty set and H a 

Hilbert space of functions f : X → R. Then H is called a Reproduction Kernel Hilbert 

Space (RKHS) endowed with the inner product 〈 , 〉, if there exist a function k : X × X 

with the following properties: 

1. 〈f, k(x, ·)〉 = f(x) for all f ∈ H; in particular, 〈k(x, ·), k(y, ·)〉 = k(x, y). 

2. k spans H, that is, H = span{k(x, ·)|x ∈ X } where X denote the completion of the 

set X. 

We have seen that a RKHS can be constructed from a positive definite kernel in three 

steps. The converse is also true. If a RKHS H is given then a unique positive definite kernel 

can be defined as the inner product of the space H. 

Finally, we show that Mercer kernels are positive definite in the generalized sense. 

Theorem 2.6 (Equivalence of Positive Definiteness). Let X = [a, b] be a compact interval 

and let k : [a, b] × [a, b] → C be continuous. Then k is a positive definite kernel if and only 

if 

� 

[a,b]×[a,b] 

k(x, y)f(x)f(y) dxdy ≥ 0, 

14

for any continuous function f : X → C. 

In this regard, every Mercer kernel k has a RKHS as a feature space for which k is the 

reproducing kernel. 

2.2.4 Examples of kernels 

There are an ever-expanding number of kernels for various types of data and applications, 

and we can only glimpse a portion of those. Below is a list of the most often-used kernels 

for Euclidean data. Let x, y ∈⊂ R D . 

• Homogeneous polynomial kernel: k(x, y) = 〈x, y〉 d . 

• Nonhomogeneous polynomial kernel: k(x, y) = (〈x, y〉 + c) d , c ∈ R. 

• Gaussian RBF kernel: k(x, y) = exp − �x−y�2 

2σ2 , σ > 0. The Gaussian RBF kernel 

has the following characteristics: 1) the points in the feature space lie on a sphere, 

since �φ(x)� 2 = 1, 2) the angle between two points 〈x, y〉 is at most π/2, and 3) the 

feature space is infinite-dimensional. 

Those kernels are the first kernels to be used with large margin classifiers [11]. These 

kernels can be evaluated in a closed-from without having to construct the feature spaces 

explicitly. Further work has discovered other types of kernels that can be evaluated effi- 

ciently by a recursion. These include the following two kernels: 

• All-subsets kernel [75]: Let I = {1, 2, ..., D} be indices for the variables xi, i ∈ I. 

For every subset of A of I, let us define φA(x) = � 

i∈A xi. For A = ∅ we define 

15

φ∅(x) = 1. If φ(x) is the sequence (φA(x))A⊂I, then the all-subsets kernel is 

k(x, y) = 〈φ(x), φ(y)〉 = � 

φA(x)φA(y) 

= � � 

A⊂I i∈A 

xiyi = 

A⊂I 

D� 

(1 + xiyi). 

• ANOVA kernel [87]: it is defined similarly to the all-subsets kernel. Define φ(x) as 

the sequence (φA(x))A⊂I,|A|=d, where we restrict A to the subsets of cardinality d. 

Then the kernel is 

i=1 

k(x, y) = 〈φA(x), φA(y)〉 = � 

= 

� 

1≤i1

Theorem 2.7. If k1(x, y) and k2(x, y) are positive definite kernels, then following kernels 

are also positive definite: 

1. Conic combination: α1k1(x, y) + α2k2(x, y), (α1, α2 > 0) 

2. Pointwise product: k1(x, y)k2(x, y) 

3. Integration: � k(x, z)k(y, z) dz, 

4. Product with rank-1 kernel: k(x, y)f(x)f(y) 

5. Limit: if k1(x, y), k2(x, y), ... are positive definite kernels then so is limi→∞ ki(x, y). 

Proofs can be found in [69, 71]. 

Corollary 2.8. If k is a positive definite kernel, then so are f(k(x, y)) and exp k(x, y), 

where f : R → R is any polynomial function with nonnegative coefficients. 

2.2.6 Distance and conditionally positive definite kernels 

In this subsection I review the relationship between distances and conditionally positive 

definite kernels. 

Distance and metric 

Throughout the thesis I will use the term distance interchangeably with similarity measure, 

to denote an intuitive notion of ‘closeness’ between two patterns in the data. Therefore a 

distance d(·, ·) is any assignment of nonnegative values to a pair of points in a set X . A 

metric is, however, a distance that satisfies the additional axioms: 

Definition 2.9 (Metric). A real-valued function d : X × X → R is called a metric if 

1. d(x1, x2) ≥ 0, 

17

2. d(x1, x2) = 0 if and only if x1 = x2, 

3. d(x1, x2) = d(x2, x1), 

4. d(x1, x2) + d(x2, x3) ≤ d(x1, x3), 

for all x1, x2, x3 ∈ X . 

Relationship between metric and kernel 

The standard metric d(φ(x1), φ(x2)) in the feature space is the norm �φ(x1) − φ(x2)� 

induced from the inner product. The metric can be written in terms of the kernel as 

d 2 (φ(x1), φ(x2)) = k(x1, x1) + k(x2, x2) − 2k(x1, x2). (2.1) 

Therefore any RKHS is also a metric space (H, d) with the metric given above. Conversely, 

if a metric is given that is known to be induced from an inner product, then we can recover 

the inner product from the polarization of the metric: 

˜k(x1, x2) = 〈φ(x1), φ(x2)〉 = 1 

2 (−�φ(x1) − φ(x2)� 2 + �φ(x1)� 2 + �φ(x2)� 2 ). 

This raises the following question: if we are given a set and a metric (X , d), can we 

determine if d is induced from a positive definite kernel? To answer the question we need 

the following definition 

Definition 2.10 (Conditionally Positive Definite Kernel). Let X be any set, and k : X × 

X → R be a symmetric real-valued function k(xi, xj) = k(xj, xi) for all xi, xj ∈ X . Then 

k is a conditionally positive definite kernel function if 

� 

cicjk(xi, xj) ≥ 0, 

i,j 

18

for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) such that � n 

i=1 ci = 0, for any n ∈ N. 

The question above is answered by the following theorem[67]: 

Theorem 2.11 (Schoenberg). A metric space (X , d) can be embedded isometrically into a 

Hilbert space if and only if d 2 (·, ·) is conditionally positive definite. 

As a corollary, we have 

Corollary 2.12 ([35]). A metric d is induced from a positive definite kernel if and only if 

is conditionally positive definite. 

˜k(x1, x2) = −d 2 (x1, x2)/2, x1, x2 ∈ X (2.2) 

It is known that one can use conditionally positive definite kernels just as positive defi- 

nite kernels in learning problems that are invariant to the choice of origin [68]. 

2.3 Support Vector Machines 

A Support Vector Machine (SVM) is a supervised learning method used for classification. 

Due to its computational efficiency and theoretically well-understood generalization per- 

formance, the SVM has received a lot of attention in the last decade and is still one of the 

main topics in machine learning research. In this section I review the basics of SVM. 

I use the notation D = {(x1, y1), ..., (xN, yN)} to denote N pairs of a training sample 

xi ∈ R D and its class label yi ∈ {−1, 1}, i = 1, 2, ..., N. 

19

2.3.1 Large margin linear classifier 

Consider the problem of separating the two-class training data D = {(x1, y1), ..., (xN, yN)} 

with a hyperplane 

P : 〈w, x〉 + b = 0. 

Let’s assume the data are linearly separable, that is, we can separate the data with a hyper- 

plane without error (refer to Figure 2.2.) Since the equation 〈c · w, x〉 + c · b = 0 represents 

the same hyperplane for any nonzero c ∈ R, we choose a canonical representation of the 

hyperplane by setting mini � 〈w, xi〉+b� = 1. The linear separability can then be expressed 

as 

yi(〈w, xi〉 + b) ≥ 1, i = 1, ..., N, (2.3) 

and the distance of a point x to the hyperplane P is given by 

d(x, P) = | 〈w, xi〉 + b| 

. 

�w� 

We define the margin of the hyperplane as the minimum of the distance between training 

samples and the hyperplane: 

which can be shown to be equal to ρ = 2 

�w� . 

ρ = min 

i d(xi, P), 

If the data are linearly separable, there are typically an infinite number of hyperplanes 

that separate the classes correctly. However, the main idea of the SVM is to choose the one 

that has the maximum margin. Therefore the maximum margin classifier is the solution to 

the optimization problem: 

min 

w,b 

1 

2 �w�2 , subject to yi(〈w, xi〉 + b) ≥ 1, i = 1, ..., N. (2.4) 

20

Figure 2.2: Example of classifying two-class data with a hyperplane 〈w, x〉 + b = 0. In this 

case the data can be separated without error. This illustration was captured from the tutorial 

slides of Schölkopf’s given at the workshop “The Analysis of Patterns”, Erice, Italy, 2005. 

2.3.2 Dual problem and support vectors 

The primal problem (2.4) is a convex optimization problem with linear constraints. From 

the Lagrangian duality, solving the primal problem is equivalent to solving the dual prob- 

lem: 

min 

α 

1 

2 

� 

αiαjyiyj 〈xi, xj〉 − � 

αi, subject to αi ≥ 0, i = 1, ..., N, and � 

αiyi = 0. 

i,j 

i 

i 

(2.5) 

The advantages of the dual formation are two-fold: 1) the dual problem is often easier to 

solve than the primal problem, and 2) it provides a geometrically meaningful interpretation 

of the solution. 

If α ∗ is the optimal solution of (2.5), then the optimal value of the primal variables is 

21

given by w ∗ = � 

i αiyixi, and b = − 1 

2 〈w∗ , x+ + x−〉, where x+ and x− are positive and 

negative class samples such that 〈w, x+〉 + b = 1 and 〈w, x−〉 + b = −1 respectively. 

The resultant classifier for test data is then, 

f(x) = sgn(〈w ∗ , x〉 + b) = sgn( � 

αiyi 〈xi, x〉 + b), where 

⎧ 

⎪⎨ 

−1, z < 0 

sgn(z) = 

⎪⎩ 

0, 

1, 

z = 0 

z > 0 

The Kuhn-Tucker condition of the optimization problem requires 

αi[yi(〈w, xi〉 + b) − 1] = 0, i = 1, ..., N. 

This implies that only the points x that satisfy yi(〈w, xi〉 + b) = 1 will have nonzero dual 

variable α. These points are called support vectors, since these are the only points needed 

to define the decision function in the linearly separable case. 

2.3.3 Extensions 

Non-separable case: soft-margin SVM 

Suppose the data are not linearly separable and the constraints (2.3) need the relaxation of 

the conditions: 

yi(〈w, xi〉 + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, ... , N, 

22 

i 

.

for the problem to be feasible. A soft-margin SVM is defined by the optimization 

min 

w,b,ξ 

1 

2 �w�2 + C � 

ξi, subject to yi(〈w, xi〉 + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, ... , N, 

i 

(2.6) 

where C is a fixed parameter that determines the weight between the margin and the clas- 

sification error in the cost. The primal problem (2.6) also has an equivalent dual problem: 

min 

α 

1 

2 

� 

αiαjyiyj 〈xi, xj〉 − � 

αi, subject to (2.7) 

i,j 

0 ≤ αi ≤ C, i = 1, ..., N, and � 

αiyi = 0. 

The regularization parameter C should reflect the prior knowledge of the amount of noise 

in the data. 

Nonlinear separation: kernel SVM 

We obtain a nonlinear version of SVM by mapping the space X to a RKHS via a kernel 

function k. The kernel SVM is implemented simply by replacing the Euclidean inner prod- 

uct 〈xi, xj〉 with a given kernel function k(xi, xj). After the replacement the soft SVM 

problem (2.7) becomes 

min 

α 

1 

2 

i 

i 

� 

αiαjyiyjk(xi, xj) − � 

αi, subject to (2.8) 

i,j 

0 ≤ αi ≤ C, i = 1, ..., N, and � 

αiyi = 0, 

and the resultant decision function for test data is given by the kernel function: 

f(x) = sgn( � 

αiyik(xi, x) + b). 

i 

23 

i 

i

Since Kij = k(xi, xj) is a fixed matrix, the optimization in the training phase is no 

more difficult than solving the linear SVM. The resultant decision function can classify 

highly nonlinear, complicated data distributions with the same cost of training the simple 

linear classifier. 

2.3.4 Generalization error and overfitting 

The success of SVM algorithms in practice can be ascribed to their ability to bound gener- 

alization errors. I will not go into the vast topic but would like to point out the following 

fact: the maximization of margin corresponds to the minimization of the capacity (or the 

complexity) of the hyperplane, which helps to avoid overfitting. 

2.4 Discriminant Analysis 

A discriminant analysis technique is a method to find a low-dimensional subspace of the 

input space which preserves the discriminant features of multiclass data. Figure 2.3 illus- 

trates the idea for a two-class toy problem. 

I introduce two techniques: Fisher Discriminant Analysis (FDA) (or Linear Discrim- 

inant Analysis) [25] and Nonparametric Discriminant Analysis (NDA) [12]. Originally 

these algorithms are developed and used for low-dimensional Euclidean data. I will dis- 

cuss the challenges and solutions when the techniques are applied to high-dimensional data, 

and describe their extensions to nonlinear discrimination problems with kernels. 

Both FDA and NDA are discriminant analysis techniques which find a subspace that 

maximizes the ratio of between-class scatter Sb and within-class scatter Sw after the data 

are projected onto the subspace. The objective function for one-dimensional case is the 

Rayleigh quotient 

J(w) = w′ Sbw 

w ′ Sww , w ∈ RD , 

24

Figure 2.3: The most discriminant direction for two-class data. Suppose we have two 

classes of Gaussian-distributed data, and we want to project the data onto one-dimensional 

directions denoted by the arrows. The projection in the direction of the largest variance 

(PCA direction) results in a large overlap of the two class which is undesirable for classification, 

whereas the projection in the Fisher direction yields the least overlapping, therefore 

most discriminant one-dimensional distributions. This illustration was captured from the 

paper of [58]. 

where D is the dimension of the data space. For multiclass data there are several options 

for the objective function [25]. The most widely used objective is the multiclass Rayleigh 

quotient 

J(W ) = tr � (W ′ SwW ) −1 W ′ SbW � 

(2.9) 

where W is a D × d matrix, and d < D is the low-dimensional feature dimension. The 

quotient measures the class separability in the subspace span(W ) similarly to the one- 

dimensional case. 

25

2.4.1 Fisher Discriminant Analysis 

Let {x1, ..., xN} be the data vectors and {y1, ..., yN} be the class labels yi ∈ {1, ..., C}. 

Without loss of generality we assume the data are ordered according to the class labels: 

1 = y1 ≤ y2 ≤ ... ≤ yN = C. Each class c has Nc number of samples. 

Let µc = 1 

Nc 

� 

{i|yi=c} xi be the mean of class c, and µ = 1 

N 

� 

i xi be the global mean. 

The between-scatter and within-scatter matrices of FDA are defined as follows: 

Sb = 1 

N 

Sw = 1 

N 

C� 

Nc(µc − µ)(µc − µ) ′ 

c=1 

C� 

� 

c=1 {i|yi=c} 

(xi − µc)(xi − µc) ′ 

When Sw is nonsingular, which is typically the case for low-dimensional data (D < N), 

the optimal W is found from the largest eigenvector of S −1 

w Sb. Since S −1 

w Sb has rank C −1, 

there are C − 1-number of seqential optima W = {w1, ..., wC−1}. By projecting data onto 

the span(W ), we achieve dimensionality reduction and feature extraction of data onto the 

most discriminant directions. 

To classify points with the simple k-NN classifier, one can use the distance of data 

projected onto span(W ), or use the Mahalanobis distance to the projected mean of each 

class: 

arg min 

j dj(x) = [W ′ (x − µj)] ′ (W ′ SwW ) −1 [W ′ (x − µj)]. (2.10) 

2.4.2 Nonparametric Discriminant Analysis 

The FDA is motivated from the simple scenario in which the class-conditional distribution 

p(x|ci) is Gaussian or at least has a peak around its mean µi. However, this assumption is 

easily violated, for example, by a distribution that has multiple peaks. The NDA tries to 

26

elax the parametric Gaussian assumption a little. The between-scatter and within-scatter 

matrices of NDA are defined as 

Sb = 1 

N 

Sw = 1 

N 

N� � 

(xi − xj)(xi − xj) ′ 

i=1 j∈Bi 

N� � 

(xi − xj)(xi − xj) ′ , 

i=1 j∈Wi 

where Bi is the indices for K nearest neighbors of xi which belong to the different classes 

from xi, and Wi is the indices for K nearest neighbors of xi which belong to the same class 

as xi. While FDA uses the global class mean µi as a representative of each class, NDA 

uses the local class mean around the point of interest. This results in a tolerance to the 

non-Gaussianity or multimodality of the classes. When the number of nearest neighbors K 

increases, NDA behaves similarly to FDA. 

For classification tasks one can also use the simple k-NN rule restricted to the span(W ) 

or the Mahalanobis distance similarly to FDA, although k-NN is more consistent with the 

nonparametric assumption of NDA. 

2.4.3 Discriminant analysis in high-dimensional spaces 

In the previous explanation of FDA, we assumed Sw is nonsingular. However, this is not the 

case for high-dimensional data. Note the ranks of Sw and Sb of FDA can be at most N − C 

and C − 1 respectively. The maxima are achieved when the data are not co-linear which is 

very likely for high-dimensional data [93]. Because the number of features d cannot exceed 

the rank of Sb, FDA can extract at most C − 1 features. For NDA, when K > 1, the rank 

of Sw can be up to N − C and the rank of Sb can be up to N − 1. However, for small K 

the ranks of Sw and Sb will also be small. The number of features NDA can extract is also 

less than the rank of SB. 

27

Because Sw spans at most N − C dimension, there is always nullspace in the span 

of data which is at least N − 1 dimensional. Without regularization, both FDA and NDA 

always yield nullspace of Sw in the span of data as the maximizer of the Rayleigh quotients. 

This is not preferable because even a small change in the data can make a big change in 

the solution. One solution suggested in [7] is to use Principal Component Analysis to 

first reduce the dimensionality of data by projecting them to a subspace spanned by the 

N − C largest eigenvectors. In this subspace Sw are likely to be well-conditioned. Another 

solution is to regularize the ill-conditioned matrix Sw by adding an isotropic noise matrix 

Sw ← Sw + σ 2 I, (2.11) 

where σ determines the amount of regularization. I use the regularization approach in this 

thesis. 

2.4.4 Extension to nonlinear discriminant analysis 

From the discussion of kernel machines in the previous section, we know that a linear 

algorithm can be extended to nonlinear algorithms by using kernels. The Kernel FDA, also 

known as the Generalized Discriminant Analysis, has been fully studied by [5, 56, 57]. To 

summarize, Kernel FDA can be formulated as follows. 

Let φ : G → H be the feature map, and Φ = [φ1 ... φN] be the feature matrix of the 

training points. Assuming the FDA direction w in the feature space is a linear combination 

of the feature vectors, w = Φα, we can rewrite the Rayleigh quotient in terms of α as 

J(α) = α′ Φ ′ SBΦα 

α ′ Φ ′ SW Φα = α′ K(V − 1 

N 1N1 ′ N )Kα 

α ′ (K(IN − V )K + σ2 , (2.12) 

IN)α 

where K is the kernel matrix, 1N is a uniform vector [1 ... 1] ′ of length N, and V is a 

28

lock-diagonal matrix whose c-th block is the uniform matrix 1 

Nc 1Nc1 ′ Nc , 

⎛ 

⎜ 

V = ⎜ 

⎝ 

1 

N1 1N11 ′ N1 

... 

1 

Nc 1Nc1 ′ Nc 

The term σ 2 IN is a regularizer for making the computation stable. Similarly to FDA, the 

set of optimal α’s are computed from the eigenvectors of K −1 

w Kb, where Kb and Kw are 

⎞ 

⎟ 

⎠ . 

the quadratic matrices in the numerator and the denominator of (2.12): 

Kb = K(V − 1 

N 1N1 ′ N)K 

Kw = K(IN − V )K + σ 2 IN. 

The NDA can be similarly kernelized by the assumption w = Φα and is omitted here. 

29

Chapter 3 

MOTIVATION: SUBSPACE 

STRUCTURE IN DATA 


In this chapter I discuss theoretical and empirical evidences of subspace structures which 

naturally appear in real-world data. 

The most prominent examples of subspaces can be found in the image-based face recog- 

nition problem. Face images show large variability due to identity, pose change, illumina- 

tion condition, facial expression, and so on. The Principal Component Analysis (PCA) has 

been applied to construct low-dimensional models of the faces by [73, 44] and used for 

recognition by [79], known as the Eigenfaces. Although the Eigenfaces were originally ap- 

plied to model image variations across different people, they also explain the illumination 

variation of a single person exceptionally well [32, 21, 94]. Theoretical explanations to the 

low-dimensionality of the illumination variability have been proposed by [8, 62, 61, 4]. 

When the data consist of the illumination-varying images of multiple people, I can 

model the data as a collection of the illumination subspaces from each people. In this way, 

30

I can absorb the undesired variability of illumination as variability within subspaces, and 

emphasize the variability of subject identity as variability between the subspaces. This 

idea not only applies to illumination change but also to other types of data that have known 

linear substructures. This is the main idea of the subspace-based approach advocated in 

this thesis. 

More recent examples of subspaces structure are found in the dynamical system models 

of video sequences of, for example, human actions or time-varying textures [19, 80, 78]. 

When each sequence is modeled by a dynamical system, I can compare those dynamical 

systems by comparing the linear span of the observability matrices of each systems, which 

is similar to comparing the images subspaces. 

In the rest of this chapter I explain the details of computing subspaces and estimating 

dynamical systems from image data. The procedures are demonstrated with the well-known 

image databases: the Yale Face, CMU-PIE, ETH-80, and IXMAS databases. 

3.2 Illumination subspaces in multi-lighting images 

Suppose we have a convex-shaped Lambertian object and a single distance light source 

illuminating the object. If we ignore the attached and the cast shadows on the object for 

now, then the observed intensity x (irradiance) for a surface patch is linearly related to the 

incident intensity of light s (radiance) by the Lambertian reflectance model 

x = ρ 〈b, s〉 , 

where ρ is the albedo of the surface, b is the surface normal, and s is the light source vector. 

If the whole image x is a collection of D-pixel values, that is, x = [x1, ..., xD] ′ , then 

x = Bs, 

31

where B = [α1b1, ..., αDbD] ′ is the D × 3 matrix of albedo and surface normals. Thus, 

the set of images under all possible illuminations are all linear combinations of the column 

vectors of B, 

X = {Bs | , ∀s ∈ R 3 }, 

which is a three-dimensional subspace at most. However, this is an unrealistic model since 

this allows a negative light intensity. 

We get a more realistic model by removing negative intensity and also by allowing 

attached shadows as follows: xi = max (αi 〈bi, si〉 , 0 ), and therefore x = max(Bs, 0), 

where the max operation is performed for each row. 

An image from a multiple light sources is the combination of single distant light cases 

x = � 

k max(Bsk, 0). As can be seen from the equation, the set of such images under 

all illuminations form a convex cone [8]. However, the dimensionality of the subspace the 

cone lies in can be as large as the number of pixels D in general, which is inconsistent with 

the empirical observations. 

Theoretical explanations on the low dimensionality have been offered by [8, 62, 61, 4] 

with spherical harmonics. Although the mathematics of the model is rather involved, the 

main idea can be summarized as follows. The interaction between a distance light source 

and a Lambertian surface is a convolution on the unit sphere. If we adopt frequency-domain 

representation of the light distributions and the reflectance function, then the interaction 

of an arbitrary light distribution with the surface can be computed by multiplication of 

coefficients w.r.t. the spherical harmonics, analogous to the Fourier analysis on real lines. 

Since the max operation can be well approximated by convolution with a low-pass 

filter, the resultant set of all possible illumination can be expressed using only a few (4 

to 9) harmonic basis images. Figure 3.1 shows the analytically computed PCA from this 

model. 

32

Figure 3.1: The figure shows the first five principal components of a face, computed analytically 

from a 3D model (Top) and a sphere (Bottom). These images matches well with 

the empirical principal component computed from a set of real images. The figure was 

captured from [61]. 

In the following two subsections I introduce two well-known face databases and show 

the PCA results from the data to demonstrate the low-dimensionality of illumination- 

varying images. 

Yale Face Database 

It is often possible to acquire multi-view, multi-lighting images simultaneously with a spe- 

cial camera rig. The Yale face database and the Extended Yale face database [26] together 

consist of pictures of 38 subjects with 9 different poses and 45 different lighting conditions. 

The original image is gray-valued, is 640 × 480 in size, and includes redundant back- 

ground objects. I crop and align the face regions by manually choosing a few feature points 

(center of eyes and mouth, and nose tip) for each image. The cropped images are resized 

to 32 × 28 pixels (D = 896) and normalized to have a unit variance. Figure 3.2 shows the 

first 10 subjects and all 9 poses under a fixed illumination condition. 

33

Subject 

Pose 

Figure 3.2: Yale Face Database: the first 10 (out of 38) subjects at all poses under a fixed 

illumination condition. 

34

To compute subspaces, I use all 45 illumination conditions of a person under a fixed 

pose, which is depicted in Figure 3.3. The m-dimensional orthonormal basis are computed 

from the Singular Value Decomposition (SVD) of this set of data, as follows. 

Let X = [x1, ..., xN] is the D×N data matrix pertaining to all illuminations of a person 

at a fixed pose, and let X = USV ′ be the SVD of the data, where U ′ U = UU ′ = ID, 

V ′ V = V V ′ = IN, and S is a D × N matrix whose elements are zero except on the 

diagonal diag(S) = [s1, s2, ..., smin(D,N)] ′ . If the singular values are ordered as s1 ≥ s2 ≥ 

... ≥ smin(D,N), then the m-dimensional basis for this set is the first m columns of U. In 

the coming chapters I will use a range of values for m in experiments. The SVD procedure 

above is the same as PCA procedure except that the mean is not removed from the data. 

The role of the mean will be discussed further in Chapter 6. When the mean is ignored, the 

PCA eigenvalues are related to the singular values by λ1 = s 2 1, λ2 = s 2 2, and so on. 

A few of the orthonormal bases computed from the procedure are shown in Figure 3.4, 

along with the spectrum of the singular values. 

CMU-PIE Database 

The CMU-PIE database is another multi-view, multi-lighting face database acquired with a 

camera rig. The database [72] consists of images from 68 subjects under 13 different poses 

and 43 different lighting conditions. Among the 43 lighting conditions I use 21 lighting 

conditions which have full pose variations. 

The original image is color-valued, is 640 × 480 in size, and includes redundant back- 

ground objects. I crop and align the face regions by manually choosing a few feature points 

(center of eyes and mouth, and nose tip) for each image. Among the 13 poses I choose only 

7 poses and discarded 6 poses which are close to a profile-view. This is done to facilitate 

the cropping process. The cropped images are resized to 24 × 21 pixels (D = 504) and 

normalized to have a unit variance. Figure 3.5 shows the first 10 subjects at 7 poses under 

35

a fixed illumination condition. 

To compute subspaces, I use all 21 illumination conditions of a person at a fixed pose 

(refer to Figure 3.6). The m-dimensional orthonormal basis are computed from the Singular 

Value Decomposition (SVD) of this set of data similarly to the Yale Face database. 

A few of the orthonormal bases computed from the database are shown in Figure 3.7, 


36

Figure 3.3: Yale Face Database: all illumination conditions of a person at a fixed pose used 

to compute the corresponding illumination subspace. 

2 

1 

subspace #1 

0 

0 2 4 6 

2 

1 

0 

0 2 4 6 

2 

1 

subspace #3 

subspace #5 

0 

0 2 4 6 

2 

1 

subspace #2 

0 

0 2 4 6 

2 

1 

subspace #4 

0 

0 2 4 6 

2 

1 

subspace #6 

0 

0 2 4 6 

Figure 3.4: Yale Face Database: examples of basis images and (cumulative) singular values. 

37

Subject 

Pose 

Figure 3.5: CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a fixed 

illumination condition. 

38

Figure 3.6: CMU-PIE Database: all illumination conditions of a person at a fixed pose used 

to compute the corresponding illumination subspace. 

2 

1 

subspace #1 

0 

0 2 4 6 

2 

1 

0 

0 2 4 6 

2 

1 

subspace #3 

subspace #5 

0 

0 2 4 6 

2 

1 

subspace #2 

0 

0 2 4 6 

2 

1 

subspace #4 

0 

0 2 4 6 

2 

1 

subspace #6 

0 

0 2 4 6 

Figure 3.7: CMU-PIE Database: examples of basis images and (cumulative) singular values. 

39

3.3 Pose subspaces in multi-view images 

We have seen that illumination change can be approximated well by linear subspace. The 

change in the pose of the object and/or the camera, however, is harder to analyze without 

knowing the 3D geometry of the object. Since we are often given 2D image data only and 

do not know the underlying geometry, it is useful to construct image-based models of an 

object or a face under pose changes. 

A popular multi-pose representation of images is the light-field presentation which 

models the radiance of light as a function of the 5D pose of the observer [29, 30, 95]. 

Theoretically, the light-field model provides pose-invariant recognition of images taken 

with arbitrary camera and pose when the illumination condition is fixed. Zhou et al. ex- 

tended the light-field model to a bilinear model which allows simultaneous change of pose 

and illumination [95]. An alternative method is proposed in [34] which uses a generative 

model of a warped illumination subspace. Image variations due to illumination change are 

accounted for by a low-dimensional linear subspace, whereas variations due to pose change 

are approximated by a geometric warping of images in the subspace. 

The studies above indicate the nonlinearity of the pose-varying images in general. How- 

ever, the dimensionality of the images as a nonlinear manifold is rather small, since there 

are at most 6 degrees of freedom for the pose space (=E(3)). Therefore, when the range 

of the pose variation is limited, the nonlinear structure can be contained inside a low- 

dimensional subspace, and the nonlinear submanifolds can be distinguished by their en- 

closing subspaces. Although a general method of adopting nonlinearity is possible and 

will be discussed in Section 5.2.4, here I use linear subspaces as the simplest model of the 

pose variations. The approximation by a subspace is demonstrated with the ETH-80 object 

database in the following subsection. 

40

Category 

Object 

Figure 3.8: ETH-80 Database: all categories and objects at a fixed pose. 

ETH-80 Database 

The ETH-80 [48] is an object database designed for object categorization test under varying 

poses. The database consists of pictures of 8 object categories; ‘apple’, ‘pear’, ‘tomato’, 

‘cow’, ‘dog’, ‘horse’, ‘cup’, ‘car’. Each category has 10 object instances that belong to the 

category, and each object is recored under 41 different poses. 

There are several versions of the data. The one I use is color-valued and 256 × 256 in 

size. The images are resized to 32 × 32 pixels (D = 1024) and normalized to have a unit 

variance. Figure 3.8 shows the 8 categories under 10 poses at a fixed viewpoint. 

From the spectrum I can determine how good the m-dimensional approximation is by 

looking at the value at m. For example, if the 5-th cumulative singular value is 0.92, it 

means that the 5-dimensional subspace captures 92 percent of the total variations of the 

data (including the bias of the mean). 

41

To compute subspaces, I use all 41 different poses of an object from a category as shown 

in Figure 3.9. The m-dimensional orthonormal basis are computed from SVD of this set of 

data. A few of the orthonormal bases computed from the database are shown in Figure 3.10 


42

Figure 3.9: ETH-80 Database: all poses of an object from a category used to compute the 

corresponding pose subspace of the object. 

2 

1 

subspace #1 

0 

0 2 4 6 

2 

1 

0 

0 2 4 6 

2 

1 

subspace #3 

subspace #5 

0 

0 2 4 6 

2 

1 

subspace #2 

0 

0 2 4 6 

2 

1 

subspace #4 

0 

0 2 4 6 

2 

1 

subspace #6 

0 

0 2 4 6 

Figure 3.10: ETH-80 Database: examples of basis images and (cumulative) singular values. 

43

3.4 Video sequences of human motions 

Suppose we have a video sequence of a person performing an action. The sequence is more 

than just a set of images because of the temporal information contained in the sequence. 

To capture both the appearance and the temporal dynamics, we often use linear dynamical 

systems in modeling the sequence. In particular, the Auto-Regressive Moving Average 

(ARMA) has been used to model moving human bodies or textures in computer vision 

[19, 80]. The ARMA model can be described as follows. 

Let y(t) be the D × 1 observation vector, and x(t) be the d × 1 internal state vector, for 

t = 1, ..., T . Then the states evolve according to the linear time-invariant dynamics: 

where v(t) and w(t) are additive noises. 

x(t + 1) = Ax(t) + v(t) (3.1) 

y(t) = Cx(t) + w(t), (3.2) 

A probabilistic version of the model assumes that the observation, the states, and the 

noise are Gaussian distributed with 

v(t) ∼ N (0, Q), w(t) ∼ N (0, R). 

This allows us to use statistical techniques such as Maximum Likelihood Estimation to 

infer the states and to estimate parameters only from the observed data y(1), ..., y(T ). The 

estimation problem is known as the system identification problem and a good textbook on 

the topic is [52]. The estimation I use in the thesis is one of the simplest estimation method 

and is described in the next section. 

For now, let’s go back to the original question of comparing the image sequence using 

the ARMA model. If we have the parameters Ai, Ci for each sequence i = 1, .., N in 

44

the data, the simplest method of comparing two such sequences is to measure the sum of 

squared differences 

d 2 i,j = �Ai − Aj� 2 F + �Ci − Cj� 2 F , (3.3) 

ignoring the noise statistics which are of less importance. 

However, it is well-known that the parameters are not unique. If we change the basis 

for the state variables to define new state variables by ˆx = Gx, where G is any d × d 

nonsingular matrix, then the same system can be described with different coefficients such 

that 

ˆx(t + 1) = 

ˆy(t) = 

Âˆx(t) + ˆv(t) 

Ĉ ˆx(t) + ˆw(t), 

where Â = GAG−1 , Ĉ = CG−1 , ˆv = Gv, and ˆw = w. Unfortunately, the simple 

distance (3.3) is not invariant under the coordinate change. I will defer the discussion of 

other invariant distances for dynamical systems to the next two chapters, and proceed with 

the basic idea in this chapter. 

One of the coordinate-independent representations of the system is given by the infinite 

observability matrix [17] 

OC,A = 

⎛ 

⎜ 

C 

⎜ CA 

⎜ CA 

⎝ 

2 

⎞ 

⎟ , (3.4) 

⎟ 

⎠ 

... 

which is concatenation of the matrices CA n for n = 1, 2, ..., along the row. Note that after 

45

the coordinate change ˆx = Gx, the new observability matrix becomes 

OĈ, Â = 

⎛ 

⎜ 

⎝ 

CG −1 

CAG −1 

CA 2 G −1 

... 

⎞ 

⎟ = OC,AG 

⎟ 

⎠ 

−1 , (3.5) 

which is the original observability matrix multiplied by G −1 on the right. This suggests 

that if we consider the linear span of the column vectors of the O instead of the matrix O 

itself to represent the dynamical system, the representation is clearly invariant to the choice 

of G. This linear structure of a dynamical system is exactly what we are seeking: in the 

(infinite-dimensional) space of all possible ARMA models of the same size, each model 

of a sequence occupies the subspace spanned by the columns of O. In the next section I 

will introduce the IXMAS database and explain how I preprocess the data to compute this 

linear structure. 

IXMAS Database 

The INRIA Xmas Motion Acquisition Sequences (IXMAS) is a multiview video database 

for view-invariant human action recognition [89]. The database consists of 11 daily-live 

motions (‘check watch’, ‘cross arms’, ‘scratch head’, ‘sit down’, ‘get up’, ‘turn around’, 

‘walk’, ‘wave hand’, ‘punch’, ‘kick’, ‘pick up’), performed by 11 actors and repeated 3 

times. The motions are recorded by 5 calibrated and synchronized cameras at 23 fps at 

390 × 29 resolutions. Figure 3.11 shows sample sequences of an actor perform the 11 

actions at a fixed view. 

The authors of [89] propose further processing of the database. The appearances of 

the actors such as clothes are irrelevant to actions, and therefore image silhouettes are 

46

computed to extract shapes from each camera. These silhouettes are combined to carve 

out the 3D visual hull of the actor represented by 3D occupancy volume data V (x, y, z) as 

shown in Figure 3.12. 

However, the actions performed by different actors still have a lot of variabilities as 

demonstrated in Figure 3.13. 

The variabilities irrelevant to action recognition include the followings. Firstly, the ac- 

tors have different heights and body shapes, and therefore the volumes have to be resized 

in each axes. Secondly, the actors freely choose position and orientation, and therefore the 

volumes have to be centered and reoriented. The resizing and centering can be done by 

computing the center of mass and second-order moments of the volumes and then stan- 

dardizing the volumes. However, the orientation variability requires further processing. 

The authors suggest changing the Cartesian coordinate system V (x, y, z) to the cylindrical 

coordinate system V (r, θ, z) and then performing 1D circular Fourier Transform along the 

θ axis to get F F T (V (r, θ, z)). By taking only the magnitude of the transform, the resultant 

feature abs|F F T (V (r, θ, z))| becomes rotation-invariant around the z-axis. The resultant 

feature of the 3D volume is a D = 16384 = 32 3 /2-dimensional vector. Note this FFT 

is computer per frame and is not to be confused with a temporal FFT along the frames. 

Figure 3.14 shows a sample snapshot of the processed features. 

ARMA model of data 

Once the features are computed for each action, actor and frame, we can proceed to model 

the feature sequences using the ARMA model. 

I estimate the parameters using a fast approximate method based on the SVD of the the 

observed data [19]. Let USV ′ = [y(1), ..., y(T )] be the SVD of data. Then, the parameters 

47

C, A, and the states x(1), ..., x(T ) are sequentially estimated by 

˜C = U, ˜x(t) = ˜ C ′ y(t) 

Ã = 

�T 

−1 

arg min �˜x(t + 1) − A˜x(t)� 

A 

2 . 

i=1 

I used d = 5 as the dimension of the state space. 

The estimated Ai and Ci matrices for each sequence i = 1, ..., N are used to form a 

finite observability matrix of size (Dd) × d: 

Oi = [C ′ i (CiAi) ′ ... (CiA d−1 

i ) ′ ] ′ . 

A total of 363=11 (action) x 3 (trial) x 11 (actor) observability matrices are computed as 

the final subspace representation of the database. 

3.5 Conclusion 

In this chapter I aimed to provide motivations for subspace representation with examples 

from image databases, which range from illumination-varying faces to video sequences 

modeled as dynamical systems. The procedures for computing subspaces from these databases 

were described. 

The goal of the subspace-based learning approach is using this inherent linear structure 

to emphasize the desired information and to de-emphasize the unwanted variations in the 

data. This approach translates to 1) illumination-invariant face recognition for the Yale 

Face and CMU-PIE databases, 2) pose-invariant object categorization with the ETH-80 

database, and 3) the video-based action recognition with the IXMAS database. However, 

I add a caveat that the invariant recognition problems above are different from the more 

general problem of recognizing a single test image, since at least a few test images are 

48

equired to reliably compute the subspace. 

In the next three chapters, I will use the computed subspaces from the databases to test 

various algorithms for subspace-based learning. 

49

Check watch 

Cross arms 

Scratch head 

Sit down 

Get up 

Turn around 

Walk 

Wave hand 

Punch 

Kick 

Pick up 

T=1, 2, 3, ... 

Figure 3.11: IXMAS Database: video sequences of an actor performing 11 different actions 

viewed from a fixed camera. 

50

Figure 3.12: IXMAS Database: 3D occupancy volume of an actor of one time frame. 

The volume is initially computed in Cartesian coordinate system and later represented to 

cylindrical coordinate system to apply FFT. 

51

Subj 1 

Subj 2 

Subj 3 

Subj 4 

Subj 5 

Subj 6 

Subj 7 

Subj 8 

Subj 9 

Subj 10 

T=1, 2, 3, ... 

Figure 3.13: IXMAS Database: the ‘kick’ action performed by 11 actors. Each sequence 

has a different kick style well as different body shape and height. 

52

V(r,θ,z) 

abs(FFT(V(r,θ,z))) 

Figure 3.14: IXMAS Database: cylindrical coordinate representation of the volume 

V (r, θ, z), and the corresponding 1D FFT feature abs(F F T (V (r, θ, z))), shown at a few 

values of θ. 

53

Chapter 4 

GRASSMANN MANIFOLDS AND 

SUBSPACE DISTANCES 


In the previous chapter, I discussed the examples of linear subspace structures found in 

the real-world data. In this chapter I introduce the Grassmann manifold as the common 

framework of subspace-based learning algorithms. While a subspace is certainly a linear 

space, the collection of linear subspaces is a totally different space of its own, which is 

known as the Grassmann manifold. The Grassmann manifold, named after the renowned 

mathematian Hermann Günther Grassmann (1809-1877), has long been known for its in- 

triguing mathematical properties, and as an example of homogeneous spaces of Lie groups 

[86, 13]. However, its applications in computer science and engineering have appeared 

rather recently; in signal processing and control [74, 36, 6], numerical optimization [20] 

(and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78]. 

Moreover, many works have used the subspace concept without explicitly relating their 

works to this mathematical object [92, 64, 24, 90, 85, 43, 3]. One of the goals of this the- 

54

sis, is to provide a unified view of the subspace-based algorithms in the framework of the 

Grassmann manifold. 

In this chapter I define the Grassmann distance which provides a measure of (dis)similarity 

of subspaces, and review the known distances including the Arc-length, Projection, Binet- 

Cauchy, Max Corr, Min Corr, and Procrustes distances. Some of these distances have been 

studied in [20, 16], and I provide a more thorough analysis and proofs in this chapter. Fur- 

thermore, these distances will be used in conjunction with a k-NN algorithm to demonstrate 

their potentials in classification tasks using the databases prepared in the previous chapter. 

4.2 Stiefel and Grassmann manifolds 

In this section I introduce the Stiefel and the Grassmann manifolds by summarizing nec- 

essary definitions and properties of these manifolds from [28, 20, 16]. Although these 

manifolds are not linear spaces, I introduce these manifolds as subsets of Euclidean spaces 

and use matrix representations. This helps to understand the nonlinear spaces intuitively 

and also facilitates computations on these spaces. 

4.2.1 Stiefel manifold 

Let Y be a D × m matrix whose elements are real numbers. In optimization problems with 

the matrix variable Y , we often formulate the notion of normality by an orthonormality 

condition Y ′ Y = Im. 1 This feasible set is not linear nor convex, and in fact is the Stiefel 

manifold defined as follows: 

Definition 4.1. An m-frame is a set of m orthonormal vectors in R D (m ≤ D). The Stiefel 

manifold S(m, D) is the set of m-frames in R D . 

1 Although the term ‘orthongoal’ is the more standard one for this condition, I use the term ‘orthonormal’ 

to clarify that each column of Y has a unit length. 

55

The Stiefel manifold S(m, D) is represented by the set of D × m matrices Y such that 

Y ′ Y = Im. Therefore we can rewrite is as 

S(m, D) = {Y ∈ R D×m | Y ′ Y = Im}. 

There are D × m variables in Y and 1m(m 

+ 1) independent conditions in the constraint 

2 

Y ′ Y = Im. Hence S(m, D) is an analytical manifold of dimension Dm − 1m(m 

+ 1) = 

2 

1m(2D 

− m − 1). 

2 

For m = 1, the S(m, D) is the familiar unit sphere in R D , and for m = D, the S(m, D) 

is the orthogonal group O(D) of m × m orthogonal matrices. 

The Stiefel manifold can also be thought of as the quotient space 

S(m, D) = O(D)/O(D − m), 

under the right-multiplication by orthonormal matrices. To see this point, let X = [Y | Y ⊥ ] ∈ 

O(D) be a representer of Y ∈ S(m, D), where the first m columns form the m-frame we 

care about and Y ⊥ is any D × (D − m) matrix such that Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID. Then 

the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the block- 

diagonal matrix diag(Im, RD−m) where RD−m is any matrix in O(D − m). That is, the 

m-frame of X after the right multiplication 

X 

⎛ 

⎜ 

⎝ Im 

0 

0 RD−m 

⎞ 

⎛ 

⎟ 

⎠ = [Y | Y ⊥ ⎜ 

] 

remains the same as the m-frame of X. 

⎝ Im 

56 

0 

0 RD−m 

⎞ 

⎟ 

⎠ = [Y | Y ⊥ RD−m]

4.2.2 Grassmann manifold 

The Grassmann manifold is a mathematical object with several similarities to the Stiefel 

manifold. In optimization problems with a matrix variable Y , we occasionally have a cost 

function which is affected only by span(Y ) – the the linear subspace spanned by the column 

vectors of Y – and not by the specific values of Y . Such a condition leads to the concept of 

the Grassmann manifold defined as follows: 

Definition 4.2. The Grassmann manifold G(m, D) is the set of m-dimensional linear sub- 

spaces of the R D . 

For a Euclidean representation of the manifold, consider the space R (0) 

D,m of all D × m 

matrices Y ∈ R D×m of full rank m, and consider the group of transformations Y → Y L, 

where L is any nonsingular m × m matrix. The group defines an equivalence relation 

in R (0) 

D,m : two elements Y1, Y2 ∈ R (0) 

D,m are the same if span(Y1) = span(Y2). Hence 

the equivalence classes of R (0) 

D,m are in one-to-one correspondence with the points of the 

Grassmann manifold G(m, D), and G(m, D) is thought of as the quotient space 

G(m, D) = R (0) 

D,m /R(0) 

m,m. 

The G(m, D) is an analytical manifold of dimension Dm − m 2 = m(D − m), since for 

each Y regarded as a point in R Dm , the set of all elements Y L in the equivalence class is a 

surface in R Dm of dimension m 2 . 

The special case m = 1 is called the real projective space RP D−1 which consists of all 

lines through the origin. 

The Grassmann manifold can be also thought of as the quotient space 

G(m, D) = O(D)/O(m) × O(D − m), 

57

under the right-multiplication by orthonormal matrices. To see this, let X = [Y | Y ⊥ ] ∈ 

O(D) be a representer of Y ∈ G(m, D), where we only care about the span of the first 

m columns and Y ⊥ is any D × (D − m) matrix such that Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID. Then 

the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the block- 

diagonal matrix diag(Rm, RD−m) where Rm and RD−m are any two matrices in O(m) 

and O(D − m) respectively. That is, the span of the first m-columns of X after the right 

multiplication 

X 

⎛ 

⎜ 

⎝ Rm 

0 

0 RD−m 

⎞ 

⎛ 

⎟ 

⎠ = [Y | Y ⊥ ⎜ 

] 

⎝ Rm 

0 

0 RD−m 

is the same as the span of the first m-columns of X. 

⎞ 

⎟ 

⎠ = [Y Rm | Y ⊥ RD−m] 

From the quotient space representations, we see that G(m, D) = S(m, D)/O(m). This 

is the representation I use throughout the thesis. To summarize, an element of G(m, D) is 

represented by an orthonormal matrix Y ∈ R D×m such that Y ′ Y = Im, with the equiva- 

lence relation: 

Definition 4.3. Y1 ∼ = Y2 if and only if span(Y1) = span(Y2). 

We can also write the equivalence relation as 

Corollary 4.4. Y1 ∼ = Y2 if and only if Y1 = Y2Rm for some orthonormal matrix Rm ∈ 

O(m). 

In this thesis I use more general geometry than the Riemannian geometry of the Grass- 

mann manifold, and do not discuss this subject further. I refer the interested readers to 

[86, 13, 20, 16, 2] for a further reading. 

58

4.2.3 Principal angles and canonical correlations 

A canonical distance between two subspaces is the Riemannian distance, which is the 

length of the geodesic path connecting the two corresponding points on the Grassmann 

manifold. However, there is a more intuitive and computationally efficient way of defining 

distances using the principal angles [28]. I define the principal angles / canonical correla- 

tions as follows: 

Definition 4.5. Let Y1 and Y2 be two orthonormal matrices of size D by m. The princi- 

pal angles 0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2 between two subspaces span(Y1) and span(Y2), are 

defined recursively by 

cos θk = max 

max 

uk∈span(Y1) vk∈span(Y2) 

uk ′ uk = 1, vk ′ vk = 1, 

uk ′ vk, subject to 

uk ′ ui = 0, vk ′ vi = 0, (i = 1, ..., k − 1). 

The first principal angle θ1 is the smallest angle between a pair of unit vectors each 

from the two subspaces. The cosine of the principal angle is the first canonical correlation 

[39]. The k-th principal angle and canonical correlation are defined recursively. It is known 

[91, 20] that the principal angles are related to the geodesic (=arc length) distance as shown 

in Figure 4.1 by 

d 2 Arc(Y1, Y2) = � 

θ 2 i . (4.1) 

To compute the principal angles, we need not directly solve the maximization prob- 

lem. Instead, the principal angles can be computed from the Singular Value Decomposition 

(SVD) of the product of the two matrices Y ′ 

1Y2, 

i 

Y ′ 

1Y2 = USV ′ , (4.2) 

59

R D 

span( Yi ) 

u 1 

θ1, ..., θm 

span( Yj ) 

v 1 

Yi 

G(m, D ) 

Figure 4.1: Principal angles and Grassmann distances. Let span(Yi) and span(Yj) be two 

subspaces in the Euclidean space R D on the left. The distance between two subspaces 

span(Yi) and span(Yj) can be measured using the principal angles θ = [θ1, ... , θm] ′ . In the 

Grassmann manifold viewpoint, the subspaces span(Yi) and span(Yj) are considered as 

two points on the manifold G(m, D), whose Riemannian distance is related to the principal 

angles by d(Yi, Yj) = �θ�2. Various distances can be defined based on the principal angles. 

where U = [u1 ... um], V = [v1 ... vm], and S is the diagonal matrix S = diag(cos θ1 ... cos θm). 

The proof can be found in p.604 of [28]. The principal angles form a non-decreasing se- 

quence 

0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2, 

and consequently the canonical correlations form a non-increasing sequence 

1 ≥ cos θ1 ≥ · · · ≥ cos θm ≥ 0. 

Although the definition of principal angles can be extended to the cases where Y1 and 

Y2 have different number of columns, I assume Y1 and Y2 have the same size D by m 

throughout this thesis. 

4.3 Grassmann distances for subspaces 

In this section I introduce a few subspace distances which appeared in the literature, and 

give analyses of the distances in terms of the principal angles. 

60 

θ 2 

Yj

I use the term distance for any assignment of nonnegative values to each pair of points in 

the data space. A valid metric is, however, a distance that satisfies the additional axioms in 

Definition 2.9. Furthermore, a distance (or a metric) between subspaces has to be invariant 

under different basis representations. A distance that satisfies this condition is referred to 

as the Grassmann distance (or metric): 

Definition 4.6. Let d : R D×m × R D×m → R be a distance function. The function d is a 

Grassmann distance if d(Y1, Y2) = d(Y1R1, Y2R2), ∀R1, R2 ∈ O(m). 

4.3.1 Projection distance 

The Projection distance is defined as 

dProj(Y1, Y2) = 

� m� 

i=1 

sin 2 θi 

� 1/2 

= 

� 

m − 

which is the 2-norm of the sine of principal angles [20, 85]. 

m� 

i=1 

cos 2 θi 

� 1/2 

, (4.3) 

An interesting property of the this metric is that it can be computed from only the 

product Y ′ 

1Y2 whose importance will be revealed in the next chapter. From the relationship 

between principal angles and the SVD of Y ′ 

1Y2 in (4.2) we get 

d 2 Proj(Y1, Y2) = m − 

m� 

i=1 

where � · �F is the matrix Frobenius norm 

cos 2 θi = m − �Y ′ 

1Y2� 2 F = 2 −1 �Y1Y ′ 

1 − Y2Y ′ 

2� 2 F , (4.4) 

�A� 2 F = 

m� 

i=1 

n� 

A 2 ij, A ∈ R m×n . 

j=1 

The Projection distance is a Grassmann distances since it is invariant to different represen- 

tations which can be easily seen from (4.4). Furthermore, the distance is a metric: 

61

Lemma 4.7. The Projection distance dProj is a Grassmann metric. 

Proof. The nonnegativity, symmetry, and triangle equality naturally follows from � · �F 

being a matrix norm. The remaining condition to be shown is the necessary and sufficient 

condition 

�Y1Y ′ 

1 − Y2Y ′ 

2�F = 0 ⇐⇒ span(Y1) = span(Y2). 

From being a matrix norm, the equality follows �Y1Y ′ 

1 − Y2Y ′ 

2�F = 0 ⇐⇒ Y1Y ′ 

1 = Y2Y ′ 

2. 

The proof of the next step Y1Y ′ 

1 = Y2Y ′ 

2 ⇐⇒ span(Y1) = span(Y2) is also simple and is 

given in the proof of Theorem 5.2. 

4.3.2 Binet-Cauchy distance 

The Binet-Cauchy distance is defined as 

dBC(Y1, Y2) = 

� 

1 − � 

i 

cos 2 θi 

� 1/2 

, (4.5) 

which involves the product of canonical correlations [90, 83]. The distance can also be 

computed from only the product Y ′ 

1Y2. From the relationship between principal angles and 

the SVD of Y ′ 

1Y2 (4.2) we get 

d 2 BC(Y1, Y2) = 1 − � 

i 

cos 2 θi = 1 − det(Y ′ 

1Y2) 2 . (4.6) 

The Binet-Cauchy distance is also invariant under different representations, and further- 

more is a metric: 

Lemma 4.8. The Binet-Cauchy distance dBC is a Grassmann metric. 

The proof of the lemma is trivial after I prove Theorem 5.4 later. 

62

There is an interesting relationship between this distance and the Martin distance in 

control theory [55]. Martin proposed a metric between two ARMA processes with the 

cepstrum of the models, which was later shown to be of the following form [17]: 

dM(O1, O2) 2 = − log 

m� 

cos 2 θi, 

where O1 and O2 are the infinite observability matrices explained in the previous chapter: 

O1 = 

⎛ 

⎜ 

⎝ 

C1 

C1A1 

C1A 2 1 

... 

⎞ 

i=1 

⎛ 

⎟ 

⎜ 

⎟ 

⎜ 

⎟ 

⎜ 

⎟ 

⎜ 

⎟ , and, O2 = ⎜ 

⎟ 

⎜ 

⎟ 

⎜ 

⎠ 

⎝ 

C2 

C2A2 

C2A 2 2 

Consequently, the Binet-Cauchy distance is directly related to the Martin distance by the 

following: 

4.3.3 Max Correlation 

... 

⎞ 

⎟ . 

⎟ 

⎠ 

dBC(span(O1), span(O2)) = exp − 1 

2 dM(O1, O2) 2 . (4.7) 

The Max Correlation distance is defined as 

dMaxCor(Y1, Y2) = � 1 − cos 2 �1/2 θ1 = sin θ1, (4.8) 

which is based on the largest canonical correlation cos θ1 (or the smallest principal angle 

θ1). The max correlation is an intuitive measure between two subspaces which was used 

often in previous works [92, 64, 24]. It is a Grassmann distance. However, it is not a metric 

and therefore has some limitations. For example, it is possible for two distinct subspaces 

span(Y1) and span(Y2) to have a zero distance dMaxCor = 0 if they have intersect at other 

63

than the origin. 

4.3.4 Min Correlation 

The min correlation distance is defined as 

dMinCor(Y1, Y2) = � 1 − cos 2 �1/2 θm = sin θm. (4.9) 

The min correlation is conceptually the opposite of the max correlation, in that it is based 

on the smallest canonical correlation (or the largest principal angle). This distance is also 

closely related to the definition of the Projection distance. Previously I rewrote the Pro- 

jection distance as dProj = 2 −1/2 �Y1Y ′ 

1 − Y2Y ′ 

2�F . The min correlation can be similarly 

written as ([20]) 

where � · �2 is the matrix 2-norm: 

dMinCor = �Y1Y ′ 

1 − Y2Y ′ 

2�2, (4.10) 

�A�2 = max 

x�=0 

The proof can be found in p.75 of [28]. 

This distance is also a metric: 

�Ax�2 

, A ∈ R 

�x�2 

m×n . 

Lemma 4.9. The Min Correlation distance dMinCor is a Grassmann metric. 

The proof is almost the same as the proof for the Projection distance with �·�2 replaced 

by � · �F and is omitted. 

64

4.3.5 Procrustes distance 

The Procrustes distance is defined as 

� 

m� 

dProc1(Y1, Y2) = 2 sin 2 (θi/2) 

i=1 

� 1/2 

, (4.11) 

which is the vector 2-norm of [sin(θ1/2), ... , sin(θm/2)]. There is an alternative definition. 

The Procrustes distance is the minimum Euclidean distance between different representa- 

tions of two subspaces span(Y1) and span(Y2) ([20, 16]): 

dProc1(Y1, Y2) = min 

R1,R2∈O(m) 

�Y1R1 − Y2R2�F = �Y1U − Y2V �F , (4.12) 

where U and V are from (4.2). Let’s first check if the equation above is true. 

Proof. First note that 

min 

R1,R2∈O(m) 

�Y1R1 − Y2R2� = min 

R1,R2∈O(m) 

�Y1 − Y2R2R ′ 1� = min 

Q∈O(m) 

for R1, R2, Q ∈ O(m). This also holds for � · �2. Using this equality, we have 

min 

R1,R2∈O(m) 

�Y1R1 − Y2R2� 2 F = min 

Q∈O(m) 

�Y1 − Y2Q� 2 F 

�Y1 − Y2Q�, (4.13) 

= min 

Q tr(Y ′ 

1Y1 + Y ′ 

2Y2 − Y ′ 

1Y2Q − Q ′ Y ′ 

2Y1) 

= 2m − 2 max 

Q 

′ 

tr(Y 1Y2Q). 

However, trY ′ 

1Y2Q = trUSV ′ Q = trST , where T = V ′ QU which is another orthonormal 

matrix. Since S is diagonal, trST = � 

i SiiTii ≤ � 

i Sii, and the maximum is achieved 

65

for T = Im, or equivalently Q = V U ′ . Hence 

min 

R1,R2∈O(m) 

�Y1R1 − Y2R2�F = �Y1 − Y2V U ′ �F = �Y1U − Y2V �F . 

Finally, let’s prove the equivalence of the two definitions 4.11 and 4.12 . 

Lemma 4.10. 

� 

m� 

�Y1U − Y2V �F = 2 sin 2 (θi/2) 

i=1 

Proof. Left-multiply Y1U − Y2V with U ′ Y ′ 

1 to get 

� 1/2 

�Y1U − Y2V �F = �U ′ Y ′ 

1Y1U − U ′ Y ′ 

1Y2V �F = �Im − S�F , 

since the norm does not change under the multiplication with the orthonormal matrix U ′ Y ′ 

1. 

Since 

� 

� 

�Im − S�F = (1 − cos θi) 2 

we have the desired result. 

i 

� 1/2 

� 

� 

= 2 sin θi/2 2 

i 

. 

� 1/2 

The Procrustes distance is also called chordal distance [20]. The author of [20] also 

suggests another version of the Procrustes distance using the matrix 2-norm: 

Let’s check the validity of the definition: 

dProc2(Y1, Y2) = �Y1U − Y2V �2 = 2 sin(θm/2). (4.14) 

Proof. Left-multiply Y1U − Y2V with U ′ Y ′ 

1 to get 

�Y1U − Y2V �2 = �U ′ Y ′ 

1Y1U − U ′ Y ′ 

1Y2V �2 = �Im − S�2. 

66 

,

From the definition of matrix 2-norm, we have 

�Im − S�2 = max 

�x�=1 

= max 

�x�=1 

�(Im − S)x�2 

� 

� 

(1 − cos θi) 2 x 2 i 

i 

� 1/2 

= max 

�x�=1 

� � 

i 

2 sin 2 (θi/2)x 2 i 

� 1/2 

Since sin 2 (θ1/2) ≤ ... ≤ sin 2 (θm/2), the sum is maximized for (x1, ..., xm) = (0, ..., 0, 1), 

and therefore 

�Y1U − Y2V �2 = 2 sin(θm/2). 

Note that this version of the Procrustes distance has the immediate relationship with the 

min correlation distance: 

d 2 MinCor(Y1, Y2) = sin 2 θm = 1−(1−2 sin 2 (θm/2)) 2 � 

= 1− 1 − 1 

2 d2 �2 Proc2(Y1, Y2) . (4.15) 

Since the function f(x) = (1 − (1 − x 2 /2) 2 ) 1/2 is a non-decreasing transform of the dis- 

tance for 0 ≤ x ≤ 2, the two distances are expected to behave similarly although not 

exactly in the same manner. 

By definition, both versions of the Procrustes distances are invariant under different 

representations and furthermore are valid metrics: 

Lemma 4.11. The Procrustes distances dProc1 and dProc2 are Grassmann metrics. 

Proof. Nonnegativity and symmetry is immediate. For triangle inequality, let’s use the 

67 

.

equality (4.13) to show that 

dProc(Y1, Y2) + dProc(Y2, Y3) = min 

R1,R2∈O(m) 

= min 

Q1∈O(m) 

�Y1R1 − Y2R2� + min 

R2,R3∈O(m) 

�Y1Q1 − Y2� + min 

Q3∈O(m) 

�Y2 − Y3Q3� 

= min {�Y1Q1 − Y2� + �Y2 − Y3Q3�} 

Q1,Q3∈O(m) 

≥ min �Y1Q1 − Y3Q3� = dProc(Y1, Y3). 

Q1,Q3∈O(m) 

The remaining condition to show is the necessary and sufficient condition 

�Y1U − Y2V � = 0 ⇐⇒ span(Y1) = span(Y2). 

From being a matrix norm, the equality follows: 

�Y1U − Y2V � = 0 ⇐⇒ Y1U = Y2V. 

�Y2R2 − Y3R3� 

The proof of span(Y1) = span(Y2) ⇐⇒ Y1U = Y2V is similar to the case of the Projection 

distance and is omitted. 

4.3.6 Comparison of the distances 

Table 4.1 summarizes the distances introduced so far. When these distances are used for a 

learning task, the choice of the most appropriate distance for the task depends on several 

factors. 

The first factor is the distribution of data. Since the distances are defined from particular 

functions of the principal angles, the best distance depends highly on the probability distri- 

bution of the principal angles of the given data. For example, the max correlation dMaxCor 

uses only the smallest principal angle θ1, and therefore can serve as a robust distance when 

68

Table 4.1: Summary of the Grassmann distances. The distances can be defined as simple 

functions of both the basis Y and the principal angles θi except for the arc-length which 

involves matrix exponentials. 

Arc Length Projection Binet-Cauchy 

d2 (Y1, Y2) · 2−1�Y1Y ′ 

1 − Y2Y ′ 

2�2 F 1 − det(Y ′ 

1Y2) 2 

In terms of θ 

� 2 θi � 2 sin θi 1 − � cos2 Is a metric? Yes Yes 

θi 

Yes 

Max Corr Min Corr Procrustes 1 Procrustes 2 

d2 (Y1, Y2) 2 − 2�Y ′ 

1Y2�2 In terms of θ 

2 

2 sin2 θ1 sin2 θm 4 � sin2 (θi/2) 4 sin2 (θm/2) 

Is a metric? No Yes Yes Yes 

�Y1Y ′ 

1 − Y2Y ′ 

2� 2 2 �Y1U − Y2V � 2 F �Y1U − Y2V � 2 2 

the subspaces are highly scattered and noisy, whereas the min correlation dMinCor uses only 

the largest principal angle θm, and therefore is not a sensible choice. On the other hand, 

when the subspaces are concentrated and have nonzero intersections, dMaxCor will be close 

to zero for most of the data, and dMinCor may be more discriminative in this case. The second 

Procrustes distances dProc2 is also expected to behave similarly to dMinCor since it also uses 

only the largest principal angle. Besides, dMinCor and dProc2 are directly related by (4.15). 

The Arc-length dArc, the Projection distance dProj, and the first Procrustes distance dProc1 

use all the principal angles. Therefore they have intermediate characteristics between 

dMaxCor and dMinCor, and will be useful for a wider range of data distributions. The Binet- 

Cauchy distance dBC also uses all the principal angles, but it behaves similarly to dMinCor 

for scattered subspaces since the distance will become the maximum value (=1) if at least 

one of the principal angles is π/2, due to the product form of dBC. 

The second criterion for choosing the distance, is the degree of structure in the distance. 

Without any structure a distance can be used only with a simple K-Nearest Neighbor (K- 

NN) algorithm for classification. When a distance has an extra structure such as triangle 

inequality, for example, we can speed up the nearest neighbor searches by estimating lower 

69

and upper limits of unknown distances [23]. From this point of view, the max correlation 

dMaxCor is not a metric and may not be used with more sophisticated algorithms unlike the 

rest of the distances. 

4.4 Experiments 

In this section I make empirical comparisons of the Grassmann distances discussed so far 

by using the distances for classification tasks with real image database. 

4.4.1 Experimental setting 

In this section I use the subspaces computed from the four databases Yale Face, CMU-PIE, 

ETH-80 and IXMAS, and compare the performances of simple 1NN classifiers using the 

Grassmann distances. 

The training and the test sets are prepared by N-fold cross validation as follows. For the 

Yale Face and the CMU-PIE databases, I keep the subspaces corresponding to a particular 

pose from all subjects for testing, and use the remaining subspaces corresponding to other 

poses for training. This results in 9-fold and 7-fold cross validation tests for Yale Face and 

CMU-PIE respectively. For the ETH-80 database, I keep the subspaces of 8 objects – one 

from each category – for testing, and use the remaining subspaces for training, which is a 

10-fold cross validation. For the IXMAS database, I keep all the subspaces corresponding 

to a particular person for testing, and use the subspaces of other people for training, which 

is a 11-fold cross validation test. 

As mentioned in the previous chapter, the subspace representation of the databases ab- 

sorbs the variability due to illumination, pose, and the choice of the state space respectively. 

The cross validation setting of this thesis is the test of whether the remaining variability be- 

tween subspaces are indeed useful to recognize subjects, objects, or actions, regardless of 

70

different poses, object instances, and actors, respectively. 

4.4.2 Results and discussion 

Figures 4.2–4.5 show the classification rates. I can summarize the results as follows: 

1. The best performing distances are different for each database: dMaxCor for Yale Face, 

dProj, dProc1 for CMU-PIE, dArc, dProj, dProc1 for ETH-80, and dProj, dProc1 for IXMAS 

databases. I interpret this as certain distances being better suited for discriminating 

the subspaces of a particular database. 

2. With the exception of dMaxCor for Yale Face, the three distances dArc, dProj, dProc1 are 

consistently better than dBC, dMinCor, dProc2. This grouping of the distances are theo- 

retically predicted in Section 4.3.6. 

3. The dMinCor and dProc2 show exactly the same rates, since the former is monotonically 

related to the latter by (4.15). However the two distance will show different rates 

when they are used with more sophisticated algorithms than the K-NN. 

4. With the exception of Yale Face, the three distances perform much better than the Eu- 

clidean distance does, which demonstrates the potential advantages of the subspace- 

based approach. 

5. For CMU-PIE and IXMAS, the rates increase overall as the subspace dimension m 

increases. For Yale Face, the rates of dBC and dProc2 drop as m increases, wherease the 

rates of other distances remain the same. For ETH-80, the rates seem to have different 

peaks for each distance. This means that the choice of the subspace dimensionality m 

can have significant effects on the recognition rates when the simple K-NN algorithm 

is employed. However, it will be shown in the later chapters that the m has less effects 

on more sophisticated algorithms that are able to adapt to the peculiarities of the data. 

71


In this chapter I introduced the Grassmann manifold as the framework for subspace-based 

algorithms, and reviewed several well-known Grassmann distances for measuring the dis- 

similarity of subspaces. These Grassmann distances are analyzed and compared in terms of 

how they use the principal angles to define dissimilarity of subspaces. In the classification 

task of real image databases with 1NN algorithm, the best performing distance varied de- 

pending on the data used. This suggests that we need some prior knowledge of the data in 

choosing the best distance a priori. However, most of the Grassmann distances performed 

better than the Euclidean distance in 1NN classification, and behaved in groups as predicted 

from the analysis. In the next chapter I will present a more important criterion for choosing 

a distance: whether a distance is associated with a positive definite kernel or not. 

72

d Eucl 

d Arc 

d Proj 

d BC 

d MaxCor 

d MinCor 

d Proc1 

d Proc2 

100 

90 

80 

70 

60 

50 

1 3 5 7 9 

subspace dimension (m) 

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

dEucl 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 

dArc 87.81 80.29 83.15 87.81 84.23 84.95 80.65 81.36 82.44 

dP roj 87.81 80.29 83.87 87.46 86.02 84.23 84.59 85.30 85.66 

dBC 87.81 80.29 83.15 87.46 83.51 83.15 76.34 75.63 74.19 

dMaxCor 87.81 89.96 92.11 91.04 91.04 91.04 91.76 92.11 91.76 

dMinCor 87.81 71.68 80.65 84.95 72.76 72.76 62.72 62.72 54.84 

dP roc1 87.81 80.29 83.15 87.46 84.95 84.95 82.80 83.51 82.80 

dP roc2 87.81 71.68 80.65 84.95 72.76 72.76 62.72 62.72 54.84 

Figure 4.2: Yale Face Database: face recognition rates from 1NN classifier with the Grassmann 

distances. The two highest rates including ties are highlighted with boldface for each 

subspace dimension m. 

73

d Eucl 

d Arc 

d Proj 

d BC 

d MaxCor 

d MinCor 

d Proc1 

d Proc2 

100 

90 

80 

70 

60 

50 

1 3 5 7 9 


m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

dEucl 61.96 61.96 61.96 61.96 61.96 61.96 61.96 61.96 61.96 

dArc 72.28 52.03 65.46 72.92 76.33 83.16 85.93 88.70 88.49 

dP roj 72.28 52.03 64.39 75.48 78.46 82.94 84.43 86.57 87.85 

dBC 72.28 51.81 65.88 72.28 76.76 81.24 84.01 85.93 82.30 

dMaxCor 72.28 66.95 65.03 64.61 64.18 63.97 64.61 64.61 64.39 

dMinCor 72.28 48.83 69.94 64.82 72.28 72.92 72.28 69.08 66.52 

dP roc1 72.28 52.03 65.25 73.13 77.83 83.37 86.57 88.27 88.27 

dP roc2 72.28 48.83 69.94 64.82 72.28 72.92 72.28 69.08 66.52 

Figure 4.3: CMU-PIE Database: face recognition rates from 1NN classifier with the Grassmann 

distances. 

74

d Eucl 

d Arc 

d Proj 

d BC 

d MaxCor 

d MinCor 

d Proc1 

d Proc2 

100 

95 

90 

85 

80 

75 

70 

1 3 5 7 9 


m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

dEucl 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 85.47 

dArc 80.00 86.25 93.75 86.25 88.75 92.50 90.00 97.50 92.50 

dP roj 80.00 85.00 95.00 92.50 88.75 92.50 90.00 96.25 95.00 

dBC 80.00 86.25 91.25 83.75 87.50 88.75 85.00 95.00 92.50 

dMaxCor 80.00 78.75 82.50 81.25 81.25 82.50 83.75 81.25 83.75 

dMinCor 80.00 86.25 90.00 82.50 80.00 77.50 82.50 81.25 81.25 

dP roc1 80.00 86.25 93.75 86.25 88.75 92.50 90.00 96.25 91.25 

dP roc2 80.00 86.25 90.00 82.50 80.00 77.50 82.50 81.25 81.25 

Figure 4.4: ETH-80 Database: object categorization rates from 1NN classifier with the 

Grassmann distances. 

75

d Eucl 

d Arc 

d Proj 

d BC 

d MaxCor 

d MinCor 

d Proc1 

d Proc2 

100 

90 

80 

70 

60 

50 

40 

30 

1 2 3 4 5 


m=1 m=2 m=3 m=4 m=5 

dEucl 42.61 42.61 42.61 42.61 42.61 

dArc 61.82 68.79 76.06 76.67 78.18 

dP roj 61.82 69.39 74.85 78.48 80.30 

dBC 61.82 67.58 71.82 73.03 73.03 

dMaxCor 61.82 65.76 63.03 67.58 66.97 

dMinCor 61.82 61.82 61.82 62.42 54.24 

dP roc1 61.82 68.18 76.97 76.97 80.30 

dP roc2 61.82 61.82 61.82 62.42 54.24 

Figure 4.5: IXMAS Database: action recognition rates from 1NN classifier with the Grassmann 

distances. The two highest rates including ties are highlighted with boldface for each 


76

Chapter 5 

GRASSMANN KERNELS AND 

DISCRIMINANT ANALYSIS 


In the previous chapter I defined subspace distances on the Grassmann manifold. However, 

with a distance structure only, there is a severe restricted in the possible operations with the 

data. In this chapter, I show that it is possible to define positive definite kernel functions 

on the manifold, and thereby it is possible to transform the space to the familiar Hilbert 

space by virtue of the RKHS theory in Section 2.2. In particular, the Projection and the 

Binet-Cauchy distances presented in the previous chapter will be shown to be compatible 

with the Projection and the Binet-Cauchy kernels defined as follows: 

kProj(Y1, Y2) = �Y ′ 

1Y2� 2 F , kBC(Y1, Y2) = (det Y ′ 

1Y2) 2 . 

77

These kernels are discussed in detail in this chapter. The Binet-Cauchy kernel has been 

used as a similarity measure for sets [90] and dynamical systems [83]. 1 The Projection dis- 

tance has been used for face recognition [85], but the corresponding Projection kernel has 

not been explicitly used, and it is the main object of this chapter. I examine both kernels as 

the representative kernels on the Grassmann manifold. Advantages of the Grassmann ker- 

nels over the Euclidean kernels are demonstrated by a classification problem with Support 

Vector Machines (SVMs) on synthetic datasets. 

To demonstrate the potential benefits of the kernels further, I use the kernels in a dis- 

criminant analysis of subspaces. The proposed method will be contrasted with the previ- 

ously suggested subspace-based discriminant algorithms [92, 64, 24, 43]. Those previous 

methods adopt an inconsistent strategy: feature extraction is performed in the Euclidean 

space while non-Euclidean subspace distances are used. This inconsistency results in a 

difficult optimization and a weak guarantee of convergence. In the proposed approach of 

this chapter, the feature extraction and the distance measurement are integrated around the 

Grassmann kernel, resulting in a simpler and more familiar algorithm. Experiments with 

the image databases also show that the proposed method performs better than the previous 

methods. 

5.2 Kernel functions for subspaces 

Among the various distances presented in Chapter 4, only the Projection distance and the 

Binet-Cauchy distance are induced from positive definite kernels. This means that we can 

1 The authors of [83] use the term Binet-Cachy kernel for a more abstract class of kernels for Fredholm 

operators. The Binet-Cauchy kernel kBC in this paper is a special case which is close to what those authors 

call the Martin kernel. 

78

define the corresponding kernels kProj and kBC such that the following is true: 

d 2 (Y1, Y2) = k(Y1, Y1) + k(Y2, Y2) − 2k(Y1, Y2). (5.1) 

To define a kernel on the Grassmann manifold, let’s recall the definition of a positive 

definite kernel in Definition 2.4: 

A real symmetric function k is a (resp. conditionally) positive definite kernel function, 

if � 

i,j cicjk(xi, xj) ≥ 0, for all x1, ..., xn(xi ∈ X ) and c1, ..., cn(ci ∈ R) for any n ∈ N. 

(resp. for all c1, ..., cn(ci ∈ R) such that � n 

i=1 ci = 0.) 

Based on the Euclidean coordinates of subspaces, the Grassmann kernel is defined as 

follows: 

Definition 5.1. Let k : R D×m × R D×m → R be a real valued symmetric function 

k(Y1, Y2) = k(Y2, Y1). The function k is a Grassmann kernel if it is 1) positive definite and 

2) invariant to different representations: 

k(Y1, Y2) = k(Y1R1, Y2R2), ∀R1, R2 ∈ O(m). 

In the following sections I explicitly construct an isometry from (G, dProj or BC) to a 

Hilbert space (H, L2), and use the isometry to show that the Projection and the Binet- 

Cauchy kernels are Grassmann kernels. 

5.2.1 Projection kernel 

The Projection distance dProj can be understood by associating a subspace with a projection 

matrix by the following embedding [16] 

Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′ . (5.2) 

79

The image Ψ(G(m, D)) is the set of rank-m orthogonal projection matrices, hence the 

name Projection distances. 

Theorem 5.2. The map 

Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′ 

is an embedding. In particular, it is an isometry from (G, dProj) to (R D×D , � · �F ). 

(5.3) 

Proof. 1. Well-defined: if span(Y1) = span(Y2), or equivalently Y1 = Y2R for some 

R ∈ O(m), then Ψ(Y1) = Y1Y ′ 

1 = Y2Y ′ 

2 = Ψ(Y2). 

2. Injective: suppose Ψ(Y1) = Y1Y ′ 

1 = Y2Y ′ 

2 = Ψ(Y2). After multiplying Y1 and Y2 to 

the right side we get the equalities Y1 = Y2(Y ′ 

2Y1) and Y2 = Y1(Y ′ 

1Y2) respectively. 

Let R denote R = Y ′ 

2Y1, then Y1 = Y2R = Y1(R ′ R), which shows R ∈ O(m) and 

therefore span(Y1) = span(Y2). 

3. Isometry: �Ψ(Y1) − Ψ(Y2)�F = �Y1Y ′ 

1 − Y2Y ′ 

2�F = 2 1/2 dProj(Y1, Y2). 

Since we have a Euclidean embedding into (R D×D , � · �F ), the natural inner product of 

this space is the trace tr [(Y1Y ′ 

1)(Y2Y ′ 

2)] = �Y ′ 

1Y2�2 F . This provides us with the definition 

of the Projection kernel: 

Theorem 5.3. The Projection kernel 

is a Grassmann kernel. 

kProj(Y1, Y2) = �Y ′ 

1Y2� 2 F 

80 

(5.4)

Proof. The kernel is well-defined because kProj(Y1, Y2) = kProj(Y1R1, Y2R2) for any R1, R2 ∈ 

O(m). The positive definiteness follows from the properties of the Frobenius norm: for all 

Y1, ..., Yn(Yi ∈ G) and c1, ..., cn(ci ∈ R) for any n ∈ N, we have 

� 

cicj�Y ′ 

i Yj� 2 F = � 

ij 

ij 

cicjtr(YiY ′ 

i YjY ′ 

i 

j ) = � 

i 

ij 

cicjtr(YiY ′ 

i )(YjY ′ 

j ) 

= tr( � 

ciYiY ′ 

i ) 2 = � � 

ciYiY ′ 

i � 2 F ≥ 0. 

The Projection kernel has a very simple form and requires only O(Dm) multiplications 

to evaluate. It is the main kernel I propose to use for subspace-based learning. 

5.2.2 Binet-Cauchy kernel 

The Binet-Cauchy distance can also be understood by an embedding. Let s be a subset of 

{1, ..., D} with m elements s = {r1, ..., rm}, and Y (s) be the m × m matrix whose rows 

are the r1, ... , rm-th rows of Y . If s1, s2, ..., sn are all such choices of the subset ordered 

lexicographically, then the Binet-Cauchy embedding is defined as 

Ψ : G(m, D) → R n , span(Y ) ↦→ � det Y (s1) , ..., det Y (sn) � , (5.5) 

where n = DCm is the number of choosing m rows out of D rows. It is also an isometry 

from (G, dBC) to (R n , � · �2). The natural inner product in R n is the dot product of the two 

vectors 

n� 

r=1 

det Y (si) 

1 

det Y (si) 

2 , 

which provides us with the definition of the Binet-Cauchy kernel. 

81

Theorem 5.4. The Binet-Cauchy kernel 

is a Grassmann kernel. 

kBC(Y1, Y2) = (det Y ′ 

1Y2) 2 = det Y ′ 

1Y2Y ′ 

2Y1 

(5.6) 

Proof. First, the kernel is well-defined because kBC(Y1, Y2) = kBC(Y1R1, Y2R2) for any 

R1, R2 ∈ O(m). To show that kBC is positive definite it suffices to show that k(Y1, Y2) = 

det Y ′ 

1Y2 is positive definite. From the Binet-Cauchy identity [38, 90, 83], we have 

det Y ′ 

1Y2 = � 

s 

det Y (s) 

1 det Y (s) 

2 . 

Therefore, for all Y1, ..., Yn(Yi ∈ G) and c1, ..., cn(ci ∈ R) for any n ∈ N, 

� 

ij 

cicj det Y ′ 

i Yj = � 

ij 

cicj 

= � � 

s 

ij 

� 

s 

det Y (s) 

i 

cicj det Y (s) 

i 

(s) 

det Y 

j 

(s) 

det Y 

j 

� 

� � 

= 

s 

i 

ci det Y (s) 

i 

� 2 

≥ 0. 

Some other forms of the Binet-Cauchy kernel also appeared in the literature. Note 

that although det Y ′ 

1Y2 is also a Grassmann kernel, we prefer kBC(Y1, Y2) = det(Y ′ 

1Y2) 2 . 

The reason is that the latter is directly related to the principal angles by det(Y ′ 

1Y2) 2 = 

� 

i cos2 θi and therefore admits geometric interpretations, whereas the former cannot be 

written directly in terms of the principal angles. That is, det Y ′ 

1Y2 �= � 

i cos θi in general. 2 

Another variant arcsin kBC(Y1, Y2) is also a positive definite kernel 3 and its induced metric 

d = (arccos(det Y ′ 

1Y2)) 1/2 is a conditionally positive definite metric. 

2 For example, det Y ′ 

1Y2 can be negative whereas � 

i cos θi - a product of singular values - is nonnegative 

by definition. 

3 From Theorem 4.18 and 4.19 of [69]. 

82

5.2.3 Indefinite kernels from other metrics 

Since the Projection distance and the Binet-Cauchy distance are derived from positive def- 

inite kernels, we have all the kernel-based algorithms for Hilbert spaces at our disposal. In 

contrast, other distances in the previous chapter are not associated with Grassmann kernels 

and can only be used with less powerful algorithms. Showing that a distance is not associ- 

ated with any kernel directly, is not as easy as showing the opposite that there is a kernel. 

However, Theorem 2.12 can be used to make the task easier: 

A metric d is induced from a positive definite kernel if and only if 

ˆk(x1, x2) = −d 2 (x1, x2)/2, x1, x2 ∈ X (5.7) 

is conditionally positive definite. The theorem allows us to show a metric’s non-positive 

definiteness by constructing an indefinite kernel matrix from (5.7) as a counterexample. 

There have been efforts to use indefinite kernels for learning [59, 31], and several 

heuristics have been proposed to modify an indefinite kernel matrix to a positive definite 

matrix [60]. However, I do not advocate the use of the heuristics since they change the 

geometry of the original data. 

5.2.4 Extension to nonlinear subspaces 

Linear subspaces in the original space can be generalized to ‘nonlinear’ subspaces by con- 

sidering linear subspaces in a RKHS, which is a trick that has been used successfully in 

kernel PCA [68]. 4 In [90, 85] the trick is shown to be applicable to the computation of the 

principal angles, called the kernel principal angles. Wolf and Shashua, in particular, use 

the trick to compute the Binet-Cauchy kernel. Note that these two kernels have different 

4 A ‘nonlinear’ subspace is an oxymoron. Technically speaking, it is a preimage of a linear subspace in 

RHKS. 

83

H 1 

span( Yi ) 

θ1, ..., θm 

span( Yj ) 

Φ Ψ 

X H 

2 

Yi 

G(m, D ) 

θ 2 

Yj 

Ψ( ) Ψ( ) 

Figure 5.1: Doubly kernel method. The first kernel implicitly maps the two ‘nonlinear 

subspaces’ Xi and Xj to span(Yi) and span(Yj) via the map Φ : X → H1, where the 

‘nonlinear subspace’ means the preimage Xi = Φ −1 (span(Yi)) and Xj = Φ −1 (span(Yj)). 

The second (=Grassmann) kernel maps the points Yi and Yj on the Grassmann manifold 

G(m, D) to the corresponding points in H2 via the map Ψ : G(m, D) → H2 such as (5.3) 

or (5.5). 

roles and need to be distinguished. An illustration of this ‘doubly kernel’ method is given 

in Figure 5.1. 

The key point of the trick is that the principal angles between two subspaces in the 

RKHS can be derived only from the inner products of vectors in the original space. 5 Fur- 

thermore, the orthonomalization procedure in the feature space also requires the inner prod- 

uct of vectors only. Below is a summary of the procedures in [90]. 

1. Let Xi = {xi 1, ..., xi Ni } the i-th set of data and Φi = [φ(xi 1), ... , φ(xi )] be the 

Ni 

image matrix of Xi in the feature space implicitly defined by a kernel function k, 

5 A similar idea is also used to define probability distributions in the feature space [46, 96], and will be 

explained in the next chapter. 

84

e.g., Gaussian RBF kernel. 

2. The orthonormal basis Yi of the span(Φi) is then computed from the Gram-Schmidt 

process in RKHS: Φi = YiRi. 

3. Finally, the product Y ′ 

i Yj in the features space, used to define the Binet-Cauchy ker- 

nel for example, is computed from the original data by 

Y ′ 

i Yj = (R −1 

i )′ Φ ′ iΦjR −1 

j = (R−1 

i )′ [k(x i k, x j 

l 

)]klR −1 

j . 

Although this extension has been used to improve classification tasks with a few small 

databases [90], I will not use the extension in the thesis for the following reasons. First, the 

databases I use already have theoretical grounds for being linear subspaces, and we want to 

verify the linear subspace models. Second, the advantage of kernel tricks in general is most 

pronounced when the ambient space R D has a relatively small dimension D compared to 

the number of data sample N. This is obviously not the case with the data used in the 

thesis. Further experiments with the nonlinear extension will be carried out in the future. 

5.3 Experiments with synthetic data 

In this section I demonstrate the application of the Grassmann kernels to a two-class clas- 

sification problem with Support Vector Machines (SVMs). Using synthetic data I will 

compare the classification performances of linear/nonlinear SVMs in the original space 

with the performances of the SVMs in the Grassmann space. The advantages of the sub- 

space approach over the conventional Euclidean appraoch for classification problems will 

be discussed. 

85

A. Class centers B. Easy 

C. Intermediate D. Difficult 

Figure 5.2: A two-dimensional subspace is represented by a triangular patch swept by two 

basis vectors. The positive and negative classes are colored-coded by blue and red respectively. 

A: The two class centers Y+ and Y− around which other subspaces are randomly 

generated. B–D: Examples of randomly selected subspaces for ‘easy’, ‘intermediate’, and 

‘difficult’ datasets. 

5.3.1 Synthetic data 

I generate three types of datasets: ‘easy’, ‘intermediate’, and ‘difficult’; these datasets differ 

in the amount of noise in the data. 

For each type of the data, I generate N = 100 subspaces in D = 6 dimensional Eu- 

clidean space, where each subspace is m = 2 dimensional. To generate two-class data, I 

86

first define two exemplar subspaces spanned by the following bases Y+ and Y−: 

Y+ = 1 

� 

√ 

6 

Y− = 1 

√ 6 

[ 1 1 1 −1 1 1 ] ′ , [ 1 1 −1 1 1 −1 ] ′ 

� 

[ 1 −1 1 1 −1 1 ] ′ , [ 1 −1 −1 −1 −1 −1 ] ′ 

The Y+ and Y− serve as the positive and the negative class centers respectively. The 

corresponding subspaces span(Y+) and span(Y−) have the principal angles θ1 = θ2 = 

arccos(1/3). 

The other subspaces Yi’s are generated by adding a Gaussian random matrix M to the 

bases Y+ or Y−, and then by applying SVD to compute the new perturbed bases: 

Yi = U, where 

⎧ 

⎪⎨ 

⎪⎩ 

UΣV ′ = Y+ + Mi, i = 1, ..., N/2 

UΣV ′ = Y− + Mi, i = N/2 + 1, ..., N 

where the elements of the matrix Mi are independent Gaussian variables [Mi]jk ∼ N (0, s 2 ). 

The standard deviation s controls the amount of noise; the s is chosen to be s = 0.2, 0.3 

and 0.4 for ‘easy’, ‘intermediate’, and ‘difficult’ datasets respectively. Figure 5.2 shows 

examples of the subspaces for the three datasets. Note that the subspaces become more 

cluttered and the class boundary becomes more irregular as s increases. 

5.3.2 Algorithms 

I compare the performance of the Euclidean SVM with linear/polynomial/RBF kernels and 

the performance of SVM with Grassmann kernels. To test the Euclidean SVMs, I randomly 

sample n = 50 points from each subspace from a Gaussian distribution. 

There is an immediate handicap with a linear classifier in the original data space. Each 

subspace is symmetric with respect to the origin, that is, if x is a point on a subspace, 

87 

� 

, 

� 

.

then −x is also on the subspace. As a result, any hyperplane either 1) contains a subspace 

or 2) halves a subspace into two parts and yields 50 percent classification rate, which is 

useless. Therefore, if data lie on subspaces without further restrictions, a linear classifier 

(with a zero-bias) always fails to classify subspaces. To alleviate the problem with the 

Euclidean algorithms, I sample points from the intersection of the subspaces and the half- 

space {(x1, ..., x6) ∈ R 6 | x1 > 0}. 

To test the Grassmann SVM, I first estimate the basis Yi from the SVD of the same 

sampled points used for the Euclidean SVM, and then evaluate the Grassmann kernel func- 

tions. 

Five kernels in the followings are compared: 

1. Euclidean SVM with linear kernels: k(x1, x2) = 〈x1, x2〉 

2. Euclidean SVM with Polynomial kernels: k(x1, x2) = (〈x1, x2〉 + 1) 3 . 

3. Euclidean SVM with Gaussian RBF kernels: k(x1, x2) = exp(− 1 

2r 2 �x1 −x2� 2 ). The 

radius r is chosen to be one-fifth of the diameter of the data: r = 0.2 maxij �xi−xj�. 

4. Grassmannian SVM with Projection kernel k(Y1, Y2) = �Y ′ 

1Y2� 2 F 

5. Grassmannian SVM with Binet-Cauchy kernel k(Y1, Y2) = (det Y ′ 

1Y2) 2 

For the Euclidean SVMs, I use the public-domain software SVM-light [42] with default 

parameters. For the Grassmann SVMs, I use a Matlab code with a nonnegative QP solver. 

I evaluate the algorithms with the leave-one-out test by holding out one subspace and 

training with the other N − 1 subspaces. 


Table 5.1 shows the classification rates of the Euclidean SVMs and the Grassmann SVMs, 

averaged for 10 independent trials. The results show that the Grassmann SVM with the 

88

Table 5.1: Classification rates of the Euclidean SVMs and the Grassmannian SVMs. The 

best rate for each dataset is highlighted by boldface. 

Euclidean Grassmann 

Lin Poly RBF Proj BC 

Easy 88.21 98.41 98.37 100.00 100.00 

Intemediate 80.08 92.46 92.72 98.80 98.00 

Difficult 72.01 81.14 81.73 91.30 90.60 

Projection kernel outperforms other the Euclidean SVMs. The Grassmann SVM with the 

Binet-Cauchy kernel is a close second. The Polynomial and RBF kernels perform equally 

better than the linear kernel, but not as good as the Grassmann kernels. The overall classi- 

fication rates decrease as the data become more difficult to separate. 

The Grassmann kernels achieve better results for the two main reasons. First, when 

the data are highly cluttered as shown in Figure 5.2, the geometric prior of the subspace 

structures can disambiguate the points close to each other that the Euclidean distance can- 

not distinguish well. Second, the Grassmann approach implicitly maps the data from the 

original D-dimensional space to a higher-dimensional (m(D−m)) space where separating 

the subspaces becomes easier. 

In addition to having a superior classification performance with subspace-structured 

data, the Grassmann kernel method has a smaller computational cost. In the experiment 

above, for example, the Euclidean approach uses a kernel matrix of a size 5000 × 5000, 

whereas the Grassmann approach uses a kernel matrix of a size 100 × 100 which is n = 50 

times smaller than the Euclidean kernel matrix. 

89

5.4 Discriminant Analysis of subspace 

In this section I introduce a discriminant analysis method on the Grassmann manifold, and 

compare this method with other previously known discriminant techniques for subspaces. 

Since the image databases in Chapter 3 are highly multiclass 6 and lie in high dimensional 

space, I propose to use the discriminant analysis technique to reduce dimensionality and 

extract features of subspace data. 

5.4.1 Grassmann Discriminant Analysis 

It is straightforward to show the procedures of using the Projection and the Binet-Cauchy 

kernels with the Kernel FDA method introduced in Section 5.4. Recall that the cost function 

of Kernel FDA is as follows: 

J(α) = α′ Φ ′ SBΦα 

α ′ Φ ′ SW Φα = α′ K(V − 1N1 ′ N /N)Kα 

α ′ (K(IN − V )K + σ2 , (5.8) 

IN)α 

where K is the kernel matrix, σ is a regularization term, and the others are fixed terms. 

Since the method is already explained in detail, I only present a summary of the procedure 

below. 

6 Nc=38, 68, 8, and,11 for Yale Face, CMU-PIE, ETH-80, and IXMAS databases respectively 

90

Assume the D by m orthonormal bases {Yi} are already computed and given. 

Training: 

1. Compute the matrix [Ktrain]ij = kProj(Yi, Yj) or kBC(Yi, Yj) for all Yi, Yj in the 

training set. 

2. Solve maxα J(α) in (5.8) by eigen-decomposition. 

3. Compute the (Nc − 1)-dimensional coefficients Ftrain = α ′ Ktrain. 

Testing: 

1. Compute the matrix [Ktest]ij = kProj(Yi, Yj) or kBC(Yi, Yj) for all Yi in training 

set and Yj in the test set. 

2. Compute the (Nc − 1)-dim coefficients Ftest = α ′ Ktest. 

3. Perform 1-NN classification from the Euclidean distance between Ftrain and 

Ftest. 

I call this method the Grassmann Discriminant Analysis to differentiate it from other 

discriminant methods for subspaces, which I review in the following sections. 

5.4.2 Mutual Subspace Method (MSM) 

The original MSM [92] performs simple 1-NN classification with dMax with no feature 

extraction. The method can be extended to any distance described in the thesis. Although 

there are attempts to use kernels for MSM [64], the kernel is used only to represent data in 

the original space, and the MSM algorithm is still a 1-NN classification. 

91

5.4.3 Constrained MSM (cMSM) 

Constrained MSM [24] is a technique that applies dimensionality reduction to the bases 

of the subspaces in the original space. Let G = � ′ 

i YiY i be the sum of the projection 

matrices of the data and {v1, ..., vD} be the eigenvectors corresponding to the eigenvalues 

{λ1 ≤ ... ≤ λD} of G. The authors of [24] claim that the first few eigenvectors v1, ..., vd 

of G are more discriminative than the later eigenvectors, and suggest projecting the basis 

vectors of each subspace Yi onto the span(v1, ..., vl), followed by normalizations. However 

these procedure lack justifications, as well as a clear criterion for choosing the dimension 

d, on which the result crucially depends from our experience. 

5.4.4 Discriminant Analysis of Canonical Correlations (DCC) 

The Discriminant Analysis of Canonical Correlations [43] can be understood as a non- 

parametric version of linear discrimination analysis using the Procrustes distance (4.11). 

The algorithm finds the discriminating direction w which maximizes the ratio L(w) = 

w ′ SBw/w ′ Sww, where Sb and Sw are the nonparametric between-class and within-class 

‘covariance’ matrices from Section 2.4.2: 

Sb = � � 

(YiU − YjV )(YiU − YjV ) ′ 

i 

j∈Bi 

Sw = � � 

(YiU − YjV )(YiU − YjV ) ′ , 

i 

j∈Wi 

where U and V are from (4.2). Recall that tr(YiU − YjV )(YiU − YjV ) ′ = �YiU − YjV � 2 F 

is the Procrustes distance (squared). However, unlike my method, Sb and Sw do not admit 

a geometric interpretation as true covariance matrices, nor can they be kernelized directly. 

Another disadvantage of the DCC is the difficulty in optimization. The algorithm iterates 

the two stages of 1) maximizing the ratio L(w) and of 2) computing Sb and Sw, which 

92

esults in a computational overhead and a weak theoretical support for global convergence. 

5.5 Experiments with real-world data 

In this section I test the Grassmann Discriminant Analysis with the Yale Face, the CMU- 

PIE, the ETH-80 and the IXMAS databases, and compare its performance with those of 

other algorithms. 


The following is the list of algorithms used in the test. 

1. Baseline: Euclidean FDA 

2. Grassmann Discriminant Analysis: 

• GDA1 (Projection kernel + kernel FDA) 

• GDA2 (Binet-Cauchy kernel + kernel FDA) 

For GDA1 and GDA2, the optimal values of σ are found by scanning through a range 

of values. The results do not seem to vary much as long as σ is small enough. 

3. Others 

• MSM (max corr) 

• cMSM (PCA+max corr) 

• DCC (NDA + Procrustes dist): For cMSM and DCC, the optimal dimension d is 

found by exhaustive searching. For DCC, we have used two nearest-neighbors 

for Bi and Wi in Section 5.4.4. However, increasing the number of nearest- 

neighbors does not change the results very much as was observed in [43]. In 

DCC the optimization is iterated for 5 times each. 

93

I evaluate the algorithms with the cross validation as explained in Section 4.4.1. 


Figures 5.3–5.6 show the classification rates. I can summarize the results as follows: 

1. The GDA1 shows significantly better performance than all the other algorithms for 

all datasets. However, the difference is less pronounced in the Yale Face database 

where the other discriminant algorithms also performed well. 

2. The overall rates are roughly in the order of (GDA1 > cMSM > DCC > others ). 

These three algorithms consistently outperform the baseline method, whereas GDA2 

and MSM occasionally lag behind the baseline. 

3. With the exception of the IXMAS database, the rates of the GDA1, MSM, cMSM, 

and DCC remain relatively the same as the subspace dimension m increases. For 

IXMAS, the rates seem to increase gradually as m increases in the given range. 

4. The GDA2 performs poorly in general and degrades fast as m increases. This can 

be ascribed to the properties of the Binet-Cauchy distance explained in Chapter 4. 

Due to its product form, the kernel matrix tends to be an identity as the subspace 

dimension increases, which is also empirically checked from data. 


In this chapter I defined the Grassmann kernels for subspace-based learning, and showed 

constructions of the Projection kernel and the Binet-Cauchy kernel via isometric embed- 

dings. Although the embeddings can be used explicitly to represent a subspace as a D × D 

projection matrix or a DCm × 1 vector, as in [3], the equivalent kernel representations are 

preferred due to the storage and computation requirements. 

94

To demonstrate the potential advantages of the Grassmann kernels, I applied the kernel 

discriminant analysis algorithm to image databases represented as collections of subspaces. 

For its surprisingly simple form and usage, the proposed method with the Projection kernel 

outperformed the other state-of-the-art discriminant methods with the real data. However, 

the Binet-Cauchy kernel, when used in its naive form, are shown to be of limited value for 

subspace-based learning problems. There are possibly other Grassmann kernels which are 

not derived from the two representative kernels, and it is left as a future work to discover 

them. 

95

FDA (Eucl) 

GDA (Proj) 

GDA (BC) 

MSM 

cMSM 

DCC 

100 

90 

80 

70 

60 

50 

40 

1 3 5 7 9 


m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

FDA (Eucl) 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 

GDA (Proj) 96.77 95.70 98.57 99.28 98.92 98.57 97.85 96.77 97.13 

GDA (BC) 96.77 95.34 96.77 96.42 83.87 72.76 55.20 48.03 44.09 

MSM 87.81 89.96 92.11 91.04 91.04 91.04 91.76 92.11 91.76 

cMSM 92.47 96.06 94.98 93.55 94.62 94.98 94.98 96.42 95.70 

DCC 54.12 96.06 94.98 95.70 93.91 94.62 96.42 94.98 93.55 

Figure 5.3: Yale Face Database: face recognition rates from various discriminant analysis 

methods. The two highest rates including ties are highlighted with boldface for each 


96

FDA (Eucl) 

GDA (Proj) 

GDA (BC) 

MSM 

cMSM 

DCC 

100 

80 

60 

40 

20 

1 3 5 7 9 


m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

FDA (Eucl) 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 

GDA (Proj) 88.27 74.84 89.77 87.21 91.68 92.54 93.82 93.60 95.31 

GDA (BC) 88.27 71.43 82.52 64.82 58.64 47.55 43.07 39.87 36.25 

MSM 72.28 66.95 65.03 64.61 64.18 63.97 64.61 64.61 64.39 

cMSM 73.13 71.22 67.59 68.23 69.72 69.94 70.15 72.71 72.49 

DCC 77.19 78.89 66.52 63.75 64.61 67.59 67.59 67.59 65.03 

Figure 5.4: CMU-PIE Database: face recognition rates from various discriminant analysis 

methods. 

97

FDA (Eucl) 

GDA (Proj) 

GDA (BC) 

MSM 

cMSM 

DCC 

100 

90 

80 

70 

60 

50 

40 

1 3 5 7 9 


m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

FDA (Eucl) 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 

GDA (Proj) 88.75 90.00 96.25 97.50 95.00 95.00 95.00 96.25 96.25 

GDA (BC) 88.75 87.50 90.00 81.25 72.50 60.00 51.25 41.25 48.75 

MSM 80.00 78.75 82.50 81.25 81.25 82.50 83.75 81.25 83.75 

cMSM 88.75 91.25 92.50 93.75 93.75 91.25 92.50 93.75 93.75 

DCC 65.00 88.75 83.75 88.75 87.50 87.50 85.00 85.00 85.00 

Figure 5.5: ETH-80 Database: object categorization rates from various discriminant analysis 

methods. 

98

FDA (Eucl) 

GDA (Proj) 

GDA (BC) 

MSM 

cMSM 

DCC 

100 

80 

60 

40 

20 

1 2 3 4 5 


m=1 m=2 m=3 m=4 m=5 

FDA (Eucl) 54.87 54.87 54.87 54.87 54.87 

GDA (Proj) 69.09 80.30 84.55 84.24 85.15 

GDA (BC) 69.09 60.00 50.91 36.36 25.15 

MSM 61.82 65.76 63.03 67.58 66.97 

cMSM 63.64 69.39 73.33 77.58 78.48 

DCC 38.79 61.82 74.55 76.67 77.58 

Figure 5.6: IXMAS Database: action recognition rates from various discriminant analysis 

methods. 

99

Chapter 6 

EXTENDED GRASSMANN KERNELS 

AND PROBABILISTIC DISTANCES 


So far I have modeled the data as the set of linear subspaces. To relax this geometric as- 

sumption of the data, let’s take a step back from the assumption and take a probabilistic 

view of the data. Let’s suppose a set of vectors are i.i.d samples from an arbitrary prob- 

ability distribution. Then it is possible to compare two such distributions of vectors with 

probabilistic similarity measures, such as the KL distance 1 [47], the Chernoff distance [15], 

or the Bhattacharyya/Hellinger distance [10], to name a few [70, 40, 46, 96]. Furthermore, 

the Bhattacharyya affinity is in fact a positive definite kernel function on the space of dis- 

tributions and has nice closed-form expressions for the exponential family [40]. 

In this paper, I investigate the relationship between the Grassmann kernels and the prob- 

abilistic distances. The link is provided by the probabilistic generalization of subspaces 

with a Factor Analyzer [22], which is a Gaussian distribution that resembles a pancake. 

1 By distance I mean any nonnegative measure of similarity and not necessarily a metric. 

100

The first result I show is that the KL distance is reduced to the Projection kernel under 

the Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit 

and is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL 

distance, I propose an extension of the Projection kernel which is originally confined to the 

set of linear subspaces, to the set of affine as well as scaled subspaces. For this I introduce 

the affine Grassmann manifold and kernels. 

I demonstrate the extended kernels with the Support Vector Machines and the Kernel 

Discriminant Analysis using synthetic and real image databases. The experiments show the 

advantages of the extended kernels over the Bhattacharyya and the Binet-Cauchy kernels. 

6.2 Analysis of probabilistic distances and kernels 

In this section I introduce several well-known probabilistic distances, and establish their 

relationships with the Grassmann distances and kernels. 

6.2.1 Probabilistic distances and kernels 

Various probabilistic distances between distributions have been proposed in the literature. 

Some of them yield closed-form expressions for the exponential family and are convenient 

for analysis. Below is a short list of those distances. 

• KL distance : 

� 

J(p1, p2) = 

p1(x) log p1(x) 

p2(x) 

dx (6.1) 

The KL distance is probably the most frequently used distance in learning problems. 

It is sometime called the relative entropy and plays a fundamental role in information 

theory. 

101

• KL distance (symmetric) : 

� 

JKL(p1, p2) = 

[p1(x) − p2(x)] log p1(x) 

p2(x) 

dx (6.2) 

Since the original KL distance is asymmetric, this symmetrized version is often used 

instead. I exclusively use the symmetric version in the chapter. This distance is still 

not a valid metric. 

• Chernoff distance: 

� 

JCher(p1, p2) = − log 

p α1 

1 (x) p α2 

2 (x) dx, (α1 + α2 = 1, α1, α2 > 0) (6.3) 

The Chernoff distance is asymmetric. A symmetric version of the distance with 

α1 = α2 = 1/2 is known as the Bhattacharyya distance: 

• Hellinger distance: 

� 

JBhat(p1, p2) = − log 

JHel(p1, p2) = 

[p1(x) p2(x)] 1/2 dx (6.4) 

� �� 

p1(x) − � �2 p2(x) dx (6.5) 

The Hellinger distance is directly related to the Bhattacharyya distance by JHel = 

2(1 − exp(−JBhat)). 

One can also define similarity measures instead of the dissimilarity measures above. 

Jebara and Kondor [40] proposed the Probability Product kernel 

� 

kProb(p1, p2) = 

p α 1 (x) p α 2 (x) dx, (α > 0). (6.6) 

102

By construction, this kernel is positive definite in the space of normalized probability distri- 

butions [40]. This kernel includes the Bhattacharrya and the Expected Likelihood kernels 

as special cases: 

• Bhattacharyya kernel: (α = 1/2) 

• Expected Likelihood kernel: (α = 1) 

� 

kBhat(p1, p2) = [p1(x) p2(x)] 1/2 dx (6.7) 

� 

kEL(p1, p2) = 

p1(x) p2(x) dx (6.8) 

The probabilistic distances are closely related to each other. For example, the Hellinger 

distance forms a bound on the KL distance [77], and the Bhattacharyya distance and the 

KL distance are both instances of the Rényi divergence [63]. However the behaviors of 

the distances are quite different under my data model. I examine the KL distance and the 

Product Probability kernel in particular. 

6.2.2 Data as Mixture of Factor Analyzers 

The probabilistic distances in the previous section are not restricted to specific distributions. 

However, I will model the data distribution as the Mixture of Factor Analyzers (MFA) [27]. 

If we have i = 1, ..., N sets in the data, then each set is considered as i.i.d. samples from 

the i-th Factor Analyzer 

x ∼ pi(x) = N (ui, Ci), Ci = YiY ′ 

i + σ 2 ID, (6.9) 

103

Figure 6.1: Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann manifold 

(Left), the set of linear subspaces, can alternatively be modeled as the set of flat 

(σ → 0) spheres (Y ′ 

i Yi = Im) intersecting at the origin (ui = 0). The right figure shows a 

general Mixture of Factor Analyzers which are not bound by these conditions. 

where ui ∈ R D is the mean, Yi is a full-rank D × m matrix (D > m), and σ is the ambient 

noise level. The factor analyzer model is a practical substitute for a Gaussian distribution 

in case the dimensionality D of the images is greater than the number of samples n in a set. 

Otherwise it is impossible to estimate the full covariance C nor invert it. 

More importantly, I use the factor analyzer distribution to provide the link between the 

Grassmann manifold and the space of probabilistic distributions. In fact a linear subspace 

can be considered as the ‘flattened’ (σ → 0) limit of a zero-mean (ui = 0), homogeneous 

(Y ′ 

i Yi = Im) factor analyzer distribution as depicted in Figure 6.1. 

Some linear algebra 

Let’s summarize some linear algebraic shortcuts to analyze the distances. The inversion 

lemma will be used several times. For σ > 0, we have the identity: 

C −1 

i 

′ 

= (YiY i + σ 2 I) −1 = σ −2 (I − Yi(σ 2 I + Y ′ 

i Yi) −1 Y ′ 

i ). 

104

Let M1 and M2 be m × m matrices 

and � Y1 and � Y2 be the matrices 

M1 = (σ 2 Im + Y ′ 

1Y1) −1 , and M2 = (σ 2 Im + Y ′ 

2Y2) −1 , 

�Y1 = Y1M 1/2 

1 , and � Y2 = Y2M 1/2 

2 . 

From the identity we can compute the followings 

C −1 

1 C2 + C −1 

2 C1 = (σ 2 ID + Y1Y ′ 

1) −1 (σ 2 ID + Y2Y ′ 

2) + (σ 2 ID + Y2Y ′ 

2) −1 (σ 2 ID + Y1Y ′ 

1) 

= σ −2 (ID − � Y1 � Y ′ 

1)(σ 2 ID + Y2Y ′ 

2) + σ −2 (ID − � Y2 � Y ′ 

2)(σ 2 ID + Y1Y ′ 

1), 

= 2ID − � Y1 � Y ′ 

1 − � Y2 � Y ′ 

2 + σ −2 (Y1Y ′ 

1 + Y2Y ′ 

2 − � Y1 � Y ′ 

1Y2Y ′ 

2 − � Y2 � Y ′ 

2Y1Y ′ 

1), 

(C1 + C2) −1 = (2σ 2 ID + Y1Y ′ 

1 + Y2Y ′ 

2) −1 

= (2σ 2 ) −1 (ID + ZZ ′ ) −1 = (2σ 2 ) −1 (ID − Z(I2m + Z ′ Z) −1 Z ′ ), 

where Z = (2σ 2 ) −1/2 [Y1 Y2], 

C −1 

1 + C −1 

2 = σ −2 (2ID − � Y1 � Y ′ 

1 − � Y2 � Y ′ 

2) = 2σ −2 (ID − � Z � Z ′ ), 

where � Z = 2 −1/2 [ � Y1 � Y2], 

(C −1 

1 + C −1 

2 ) −1 = σ2 

2 (ID − � Z � Z ′ ) −1 = σ2 

2 (ID + � Z(I2m − � Z ′ � Z) −1 � Z ′ ). 

6.2.3 Analysis of KL distance 

The KL distance for a Factor Analyzers is as follows: 

105

JKL(p1, p2) = 1 

2 tr � C −1 

2 C1 + C −1 � 1 

1 C2 − 2ID + 

2 (u1 − u2) ′ (C −1 

1 + C −1 

2 )(u1 − u2) 

= 1 

2 tr(−� Y ′ 

1 � Y1 − � Y ′ 

+ σ−2 

2 (u1 − u2) ′ 

2 � Y2) + σ−2 

2 

� 

2ID − � Y1 � Y ′ 

1 − � Y2 � Y ′ 

2 

Furthermore, we can write the distances as 

JKL(p1, p2) = 1 

2 tr(−� Y ′ 

1 � Y1 − � Y ′ 

2 � Y2) + σ−2 

+ σ−2 

2 (2u′ u − u ′ � Z � Z ′ u), 

2 

tr(Y ′ 

1Y1 + Y ′ 

2Y2 − � Y ′ 

1Y2Y ′ 

2 � Y1 − � Y ′ 

2Y1Y ′ 

1 � Y2) 

� 

(u1 − u2) (6.10) 

tr(Y ′ 

1Y1 + Y ′ 

2Y ′ 

2 − � Y ′ 

1Y2Y ′ 

2 � Y1 − � Y ′ 

2Y1Y ′ 

1 � Y2) 

where u = u1 − u2. Note that the computation of distance involves only the products of 

column vectors of Yi and ui, and we need not handle any D × D matrix explicitly. 

KL in the limit yields the projection kernel 

For ui = 0, and Y ′ 

i Yi = Im, we have 

JKL(p1, p2) = 1 

2 tr(−� Y ′ 

1 � Y1 − � Y ′ 

= 1 m 

(−2 

2 

= 

σ 2 + 1 

2 � Y2) + σ−2 

) + σ−2 

2 

′ 

tr(Y 1Y1 + Y ′ 

2Y ′ 

2 − � Y ′ 

1Y2Y ′ 

2 � Y1 − � Y ′ 

2Y1Y ′ 

1 � Y2) 

� 

� 2 

1 

2m − 2 

σ2 + 1 

1 

2σ2 (σ2 ′ 

(2m − 2tr(Y 

+ 1) 

1Y2Y ′ 

2Y1)). 

tr(Y ′ 

1Y2Y ′ 

2Y1) 

We can ignore the multiplying factors which does not depend on Y1 or Y2, and rewrite the 

distance as 

JKL(p1, p2) ∝ 2m − 2tr(Y ′ 

1Y2Y ′ 

2Y1). 

106

One can immediately realize that this is indeed the definition of the squared Projection 

distance d 2 Proj (Y1, Y2) up to multiplicative factors. 

6.2.4 Analysis of Probability Product Kernel 

The Probability Product Kernel for Gaussian distributions is [40] 

kProb(p1, p2) = (2π) (1−2α)D/2 det(C † ) 1/2 det(C1) −α/2 det(C2) −α/2 

× exp − 1 � ′ 

αu 

2 

1C −1 

1 u1 + αu ′ 2C −1 

2 u2 − (u † ) ′ C † u †� , (6.11) 

where C † = α −1 (C −1 

1 + C −1 

2 ) −1 , and u † = α(C −1 

1 u1 + C −1 

2 u2). 

To compute the determinant terms for Factor Analyzers, we use the following identity: 

if A and B are D × m matrices, then 

det(ID + AB ′ ) = det(Im + B ′ A) = 

m� 

(1 + τi(B ′ A)), (6.12) 

where τi is the i-th singular value of B ′ A. Using the identity we can write the following. 

det C −1 

1 = det(σ −2 (ID − � Y1 � Y ′ 

1)) = σ −2D det(Im − � Y ′ 

1 � Y1) 

= σ −2D 

m� 

(1 − τi( � Y ′ 

1 � Y1)), 

i=1 

det C −1 

2 = det(σ −2 (ID − � Y2 � Y ′ 

2)) = σ −2D det(Im − � Y ′ 

2 � Y2) 

= σ −2D 

m� 

(1 − τi( � Y ′ 

2 � Y2)), 

i=1 

det C † � 

= det σ 2 (2α) −1 (ID + � Z(I2m − � Z ′ � 

Z) � −1 

Z �′ ) 

= σ 2D (2α) −D det(I2m + (I2m − � Z ′ � Z) −1 � Z ′ � Z) = σ 2D (2α) −D det(I2m − � Z ′ � Z) −1 

= σ 2D (2α) −D 

2m� 

i=1 

(1 − τi( � Z ′ � Z)) −1 

107 

i=1

To compute the exponents in (6.11) we need to use the followings 

C −1 

1 C † C −1 

2 = C −1 

1 (C −1 

1 + C −1 

2 ) −1 C −1 

2 = C −1 

2 (C −1 

1 + C −1 

2 ) −1 C −1 

1 

= (C1 + C2) −1 = (2σ 2 ) −1 (ID − Z(I2m + Z ′ Z) −1 Z ′ ) 

C −1 

1 C † C −1 

1 = C −1 

1 (C −1 

1 + C −1 

2 ) −1 C −1 

1 = C −1 

1 − (C1 + C2) −1 

= σ−2 

2 (2ID − 2 � Y1 � Y ′ 

1 − ID + Z(I2m + Z ′ Z) −1 Z ′ ) 

= σ−2 

2 (ID − 2 � Y1 � Y ′ 

1 + Z(I2m + Z ′ Z) −1 Z ′ ) 

C −1 

2 C † C −1 

2 = C −1 

2 (C −1 

1 + C −1 

2 ) −1 C −1 

2 = C −1 

2 − (C1 + C2) −1 

= σ−2 

2 (2ID − 2 � Y2 � Y ′ 

2 − ID + Z(I2m + Z ′ Z) −1 Z ′ ) 

= σ−2 

2 (ID − 2 � Y2 � Y ′ 

2 + Z(I2m + Z ′ Z) −1 Z ′ ) 

Plugging these results back in (6.11) we again can compute the kernel without handling 

any D × D matrix. For concreteness I derive the Bhattacharyya kernel as an instance of the 

probability product kernel with α = 1/2 as follows: 

kBhat(p1, p2) = det(C † ) 1/2 det(C1) −1/4 det(C2) −1/4 exp − 1 

4 (u1 − u2) ′ (C1 + C2) −1 (u1 − u2) 

= det(I2m − � Z ′ � Z) −1/2 det(Im − � Y ′ 

1 � Y1) 1/4 det(Im − � Y ′ 

2 � Y2) 1/4 

× exp − σ−2 

4 (u1 − u2) ′ (ID − Z(I2m + Z ′ Z) −1 Z ′ )(u1 − u2). (6.13) 

108

Probability product kernel in the limit becomes trivial 

For ui = 0, and Y ′ 

i Yi = Im, we have 

kProb(p1, p2) = (2π) (1−2α)D/2 det(C † ) 1/2 det(C1) −α/2 det(C2) −α/2 

and furthermore, 

= (2π) (1−2α)D/2 σ D (2α) −D/2 det(I2m − � Z ′ � Z) −1/2 

×σ −αD det(Im − � Y ′ 

1 � Y1) α/2 σ −αD det(Im − � Y ′ 

2 � Y2) α/2 

= (2π) (1−2α)D/2 σ D (2α) −D/2 det(I2m − � Z ′ � Z) −1 σ2α(m−D) 

(σ 2 + 1) αm 

= π (1−2α)D 2 −αD −D/2 σ2α(m−D)+D 

α 

(σ2 + 1) αm det(I2m − � Z ′ Z) � −1/2 

, 

det(I2m − � Z ′ Z) � −1/2 ⎢ 

= det ⎣I2m − 1 ⎜ 

⎝ 

2 

⎡ 

⎡ 

⎛ 

�Y ′ 

1 � Y1 � Y ′ 

1 � Y2 

�Y ′ 

2 � Y1 � Y ′ 

2 � Y2 

⎛ 

⎢ 1 

= det ⎣I2m − 

2(σ2 ⎜ 

+ 1) 

= 

= 

� 2 2(σ + 1) 

2σ2 �m + 1 

⎛ 

⎜ 

det ⎝ 

� 2 2(σ + 1) 

2σ2 �m � 

det Im − 

+ 1 

⎞⎤ 

⎟⎥ 

⎠⎦ 

−1/2 

⎝ Im Y ′ 

1Y2 

Y ′ 

2Y1 Im 

⎞⎤ 

⎟⎥ 

⎠⎦ 

−1/2 

Im − 1 

2σ2 ′ Y +1 1Y2 

− 1 

2σ2 ′ Y +1 2Y1 

Im 

1 

(2σ2 ′ 

Y 

+ 1) 2 1Y2Y ′ 

2Y1 

Ignoring the terms which are not the functions of Y1 or Y2, we have 

� 

1 

kProb(Y1, Y2) ∝ det Im − 

(2σ2 ′ 

Y 

+ 1) 2 1Y2Y ′ 

�−1/2 2Y1 . 

⎞ 

⎟ 

⎠ 

−1/2 

� −1/2 

. 

Suppose the two subspaces span(Y1) and span(Y2) intersect only at the origin, that is, 

the singular values of Y ′ 

1Y2 are strictly less than 1. In this case kProb has a finite value as 

109

σ → 0 and the inversion is well-defined. In contrast, the diagonal terms of kProb become 

� 

1 

kProb(Y1, Y1) = det (1 − 

(2σ2 �−1/2 � 2 2 (2σ + 1) 

)Im = 

+ 1) 2 4σ2 (σ2 �m/2 , (6.14) 

+ 1) 

which diverges to infinity as σ → 0. This implies that after the kernel is normalized by the 

diagonal terms, it becomes a trivial kernel: 

⎧ 

⎪⎨ 1, span(Yi) = span(Yj) 

˜kProb(Yi, Yj) = 

, as σ → 0. (6.15) 

⎪⎩ 0, otherwise 

As I claimed earlier, the Probability Product kernel, including the Bhattacharyya kernel, 

loses its discriminating power as the Gaussian distributions become flatter. 

6.3 Extended Grassmann Kernel 

In the previous section I presented the probabilistic interpretation of the Projection kernel. 

Based on this analysis, I propose extensions of the Projection kernel and make the kernels 

applicable to more general data. In this section I examine the two directions of extension: 

from linear to affine subspaces, and from homogeneous to scaled subspaces. 

6.3.1 Motivation 

The motivations for considering affine subspaces and non-homogeneous subspaces arise 

from observing the subspaces computed from real data. Firstly, the set of images, for 

example from the Yale Face database, have nonzero means. If the mean is significantly 

different from set to set, we want to use the mean image as well as the PCA basis images to 

represent a set. Secondly, the eigenvalues from PCA almost always have non-homogeneous 

values. It is likely that the eigenvector direction corresponding to a larger eigenvalue is 

110

A. Linear B. Affine C. Scaled 

Figure 6.2: The Mixture of Factor Analyzer model of the Grassmann manifold is the collection 

of linear homogeneous Factor Analyzers shown as flat spheres intersecting at the origin 

(A). This can be relaxed to allow nonzero offsets for each Factor Analyzer (B), and also to 

allow arbitrary eccentricity and scale for each Factor Analyzer shown as flat ellipsoids (C). 

more important than the eigenvector direction corresponding to a smaller eigenvalue. In 

which case we want to consider the eigenvalue scales as well as the eigenvectors when 

representing the set. 

These two extensions are naturally derived from the probabilistic generalization of sub- 

spaces. Figure 6.2 illustrates the ideas. Considering the data as a MFA distribution, we 

can gradually relax the zero-mean (ui = 0) condition in Figure A to the nonzero-mean 

(ui = arbitrary) condition in Figure B, and furthermore relax the homogeneity (Y ′ Y = I) 

condition to the non-homogeneous (Y ′ Y = full rank) condition in Figure C. 

From this I expect to benefit from both worlds – probabilistic distributions and geo- 

metric manifolds. However, simply relaxing the conditions and taking the limit σ → 0 of 

the KL distance do not guarantee a metric or a positive definite kernel, as we will shortly 

examine. Certain compromises have to be made to turn the KL distance in the limit into 

a well-defined and usable kernel function. In the following sections I propose new frame- 

works for the extensions and the technical details for making valid kernels. 

111

6.3.2 Extension to affine subspaces 

An affine subspace in R D is simply a linear subspace with an offset. In that sense a linear 

subspace is an affine subspace with a zero offset. 

The affine span is an analog of a linear span. Let Y ∈ R D×m be an orthonormal basis 

matrix for a subspace, and u ∈ R D denote the offset of the subspace from the origin. The 

affine span can then be defined as 

aspan(Y, u) = {x | x = Y v + u, ∀v ∈ R m }. (6.16) 

This representation of an affine span is not unique, since different Y ’s can share the same 

linear span, and different offsets u’s can imply the same amount of bias. Formally, this can 

be expressed as an equivalence relation: 

Definition 6.1. 

aspan(Y1, u1) = aspan(Y2, u2) if and only if 

span(Y1) = span(Y2) and Y ⊥ 

1 (Y ⊥ 

1 ) ′ u1 = Y ⊥ 

2 (Y ⊥ 

2 ) ′ u2, 

where Y ⊥ is any orthonormal basis for the orthogonal complement of span(Y ), that is 

Y Y ′ + Y ⊥ (Y ⊥ ) ′ = ID. 

Although Y is not unique, one can choose a unique ‘standard’ û by the following: 

û = (ID − � Y � Y ′ )u = � Y ⊥ ( � Y ⊥ ) ′ u (6.17) 

which has the shortest distance from the origin to the affine span (refer to Figure 6.3.) 

112

Figure 6.3: The same affine span can be expressed with different offsets u1, u2, ... However, 

one can use the unique ‘standard’ offset û, which has the shortest length from the origin. 

Affine Grassmann manifold 

I define an affine Grassmann manifold analogous to the linear Grassmann manifold, 

Definition 6.2. The affine Grassmann manifold AG(m, D) is the set of m-dimensional 

affine subspaces of the R D . 

The set of all m-dim affine subspaces in R D is a smooth non-compact manifold that 

can be defined as the following quotient space similarly to the Grassmann manifold: 

where E is an Euclidean group. 

To see the point above, let 

AG(m, D) = E(D)/E(m) × O(D − m), (6.18) 

X = 

⎛ 

⎜ 

⎝ Y Y ⊥ u 

0 0 1 

be the homogeneous space representation of aspan(Y, u) in E(D). 

113 

⎞ 

⎟ 

⎠

Then the only subgroup of E(D) that leaves the aspan(Y, u) unchanged by right- 

multiplication, is the set of matrices of the form 

⎛ 

⎞ 

Rm ⎜ 0 

⎝ 

0 

RD−m 

v ⎟ 

0 ⎟ ∈ E(m) × O(D − m), 

⎠ 

0 0 1 

where Rm and RD−m are any two matrices in O(m) and O(D − m) respectively, and 

v ∈ R m is any vector. 

To check this, note that 

⎛ 

⎞ 

Rm ⎜ 

X ⎜ 0 

⎝ 

0 

RD−m 

v ⎟ 

0 ⎟ 

⎠ 

0 0 1 

= 

= 

⎛ 

⎜ 

⎝ Y Y ⊥ ⎛ 

⎞ 

⎞ 

Rm ⎜ 0 v ⎟ 

u ⎟ ⎜ 

⎟ 

⎠ ⎜ 0 RD−m 0 ⎟ 

0 0 1 ⎝ 

⎠ 

0 0 1 

⎛ 

⎞ 

⎜ 

⎝ Y Rm Y ⊥RD−m Y v + u 

0 0 1 

has aspan(Y Rm, Y v + u) which is the same as aspan(Y, u) from Definition 6.16. 

Affine Grassmann kernel 

Similarly to the definition of Grassmann kernels in Definition 5.1, we can now define the 

affine Grassmann kernel as follows. 

Definition 6.3. Let k : (R D×m × R D ) × (R D×m × R D ) → R be a real valued symmetric 

function k(Y1, u1, Y2, u2) = k(Y2, u2, Y1, u1). The function k is an affine Grassmann kernel 

if it is 1) positive definite and 2) invariant to different representations: 

114 

⎟ 

⎠

k(Y1, u1, Y2, u2) = k(Y3, u3, Y4, u4), ∀Y1, Y2, Y3, Y4, u1, u2, u3, u4 

if aspan(Y1, u1) = aspan(Y3, u3) and aspan(Y2, u2) = aspan(Y4, u4). 

With this definition we can check if the KL distance in the limit suggests an affine 

Grassmann kernel. 

KL distance in the limit 

The KL distance only with the homogeneity condition Y ′ 

1Y1 = Y ′ 

2Y2 = Im becomes, 

JKL(p1, p2) = 

+ 

1 

2σ 2 (σ 2 + 1) 

(2m − 2tr(Y ′ 

1Y2Y ′ 

2Y1)) 

1 

2σ2 (σ2 + 1) (u1 − u2) ′ � 2(σ 2 + 1)ID − Y1Y ′ 

1 − Y2Y ′ � 

2 (u1 − u2) 

→ 1 

′ 

[2m − 2tr(Y 

2σ2 1Y2Y ′ 

2Y1) + (u1 − u2) ′ (2ID − Y1Y ′ 

1 − Y2Y ′ 

2) (u1 − u2)] . 

If the multiplicative factor is ignored, the first term is the same as the zero-mean case which 

I denote as the ‘linear’ kernel 

The second term 

kLin(Y1, Y2) = tr(Y1Y ′ 

1Y2Y ′ 

2) = kProj(Y1, Y2). 

ku(Y1, u1, Y2, u2) = u ′ 1(2ID − Y1Y ′ 

1 − Y2Y ′ 

2)u2, 

measures the similarity of means scaled by the matrix 2I − Y1Y ′ 

1 − Y2Y ′ 

2. However, this 

term does not satisfy the affine invariance condition of Definition 6.3. Note that the term 

115

ku can be expressed as 

ku(Y1, u1, Y2, u2) = û1 ′ u2 + û2 ′ u1, 

with the standard offset notation. From this observation, I propose the following modifica- 

tion: 

Theorem 6.4. 

is an affine Grassmann kernel. 

ku(Y1, u1, Y2, u2) = u ′ 1(I − Y1Y ′ 

1)(I − Y2Y ′ 

2)u2 = û1 ′ û2 

Proof. 1. Invariance: if aspan(Y1, u1) = aspan(Y3, u3) and aspan(Y2, u2) = aspan(Y4, u4), 

then Y1Y ′ 

1 = Y3Y ′ 

3, Y2Y ′ 

2 = Y4Y ′ 

4, Y ⊥ 

1 (Y ⊥ 

1 ) ′ u1 = Y ⊥ 

3 (Y ⊥ 

3 ) ′ u3, and Y ⊥ 

2 (Y ⊥ 

2 ) ′ u2 = 

Y ⊥ 

4 (Y ⊥ 

4 ) ′ u4, and therefore 

k(Y1, u1, Y2, u2) = u ′ 1(I − Y1Y ′ 

1)(I − Y2Y ′ 

2)u2 

2. Positive definiteness: 

� 

i,j 

= u ′ 3(I − Y3Y ′ 

3)(I − Y4Y ′ 

4)u4 = k(Y3, u3, Y4, u4). 

cicju ′ i(I − YiY ′ 

i )(I − YjY ′ 

j )uj = 

� � 

i 

ciu ′ i(I − YiY ′ 

i ) 

� 2 

≥ 0. 

Combined with the linear term kLin, the modified term defines the new ‘affine’ kernel: 

kAff(Y1, u1, Y2, u2) = tr(Y1Y ′ 

1Y2Y ′ 

2) + u ′ 1(I − Y1Y ′ 

1)(I − Y2Y ′ 

2)u2. (6.19) 

116

General construction 

The limit of the KL distance with nonzero means has two terms: u-related and Y -related. 

This suggests a general construction rule for affine kernels. That is, if one has two separate 

positive kernels for means and for subspaces, one can add or multiply them together to 

construct new kernels . The various ways of generating new kernels from known kernels in 

Theorem 2.7 can be used to create a novel kernel for affine subspaces. 

Basri’s embedding 

There are alternatives for representing affine spans. One representation proposed by Basri 

et al. [3] is to use the pair (Y ⊥ , t) instead of (Y, u) where t is related to u by u = Y ⊥ t. 

The authors embed an affine subspace to a Euclidean space of dimension (D + 1) 2 by the 

following injective map: 

Ψ : aspan(Y, u) → R (D+1)×(D+1) , (Y, u) ↦→ 

⎛ 

⎜ 

⎝ Y ⊥ (Y ⊥ ) ′ Y ⊥t t ′ (Y ⊥ ) ′ t ′ t 

⎞ 

⎟ 

⎠ . (6.20) 

This embedding is a direct analogue of the isometric embedding of linear subspaces to 

projection matrices in Theorem. 5.2. 

The authors did not mention or use kernel methods in the paper. However, their pro- 

posed embedding has the natural corresponding kernel: 

k(Y ⊥ 

1 , t1, Y ⊥ 

⎡⎛ 

⎢⎜ 

2 , t2) = tr ⎣⎝ 

⎡⎛ 

⎢⎜ 

= tr ⎣⎝ 

Y ⊥ 

1 (Y ⊥ 

1 ) ′ Y ⊥ 

1 t1 

(Y ⊥ 

1 t1) ′ t ′ 1t1 

⎞ ⎛ 

⎟ ⎜ 

⎠ ⎝ 

Y ⊥ 

2 (Y ⊥ 

2 ) ′ Y ⊥ 

2 t2 

(Y ⊥ 

2 t2) ′ t ′ 2t2 

⎞⎤ 

⎟⎥ 

⎠⎦ 

Y ⊥ 

1 (Y ⊥ 

1 ) ′ Y ⊥ 

2 (Y ⊥ 

2 ) ′ + Y ⊥ 

1 t1(Y ⊥ 

2 t2) ′ · · · 

· · · (Y ⊥ 

1 t1) ′ Y ⊥ 

2 t2 + t ′ 1t1t ′ 2t2 

= tr � Y ⊥ 

1 (Y ⊥ 

1 ) ′ Y ⊥ 

2 (Y ⊥ 

2 ) ′� + 2t ′ 1(Y ⊥ 

1 ) ′ Y ⊥ 

2 t2 + t ′ 1t1t ′ 2t2. 

117 

⎞⎤ 

⎟⎥ 

⎠⎦

Figure 6.4: Homogeneous vs scaled subspaces. The two 2-dimensional Gaussians that 

span almost the same 2-dimensional space and have almost the same means, are considered 

similar as two representations of linear subspaces (Left). However, probabilistic distance 

between two Gaussian also depends on scale and eccentricity: the distance can be quite 

large if the Gaussians are nonhomogeneous (Right). 

Although this is another valid kernel, it does not admit probabilistic interpretation. Fur- 

thermore, their representation requires D × (D − m) matrices Y ⊥ 

i , which is more costly in 

storage and computation than representation of this thesis since m ≪ D typically. 

6.3.3 Extension to scaled subspaces 

So far I have assumed that the subspace are homogeneous Y ′ Y = Im, that is, there is no 

preferred direction within the subspace. However, even if two subspaces have the same 

linear or affine span, one can further distinguish the two subspaces by allowing scales or 

orientations within the subspaces, as illustrated in Figure 6.4. 

With the relaxation of Y to any D × m full-rank matrix, the term subspace is no longer 

applicable in a strict sense. Nevertheless, let’s refer to the relaxation as ‘scaled’ subspace 

and the corresponding kernel as the ‘scaled’ Grassmann kernel in conformity with the pre- 

vious usages in this thesis. 

A scaled subspace has the same Euclidean representation (Y, u) as before but has a 

different equivalence relation. Let Y1 and Y2 be any D × m full-rank matrices, and u1 and 

118

u2 be offsets. The equivalence is then defined by 

(Y1, u1) ∼ = (Y2, u2) if and only if 

Y1Y ′ 

1 = Y2Y ′ 

2 and u1 = u2. (6.21) 

The scaled subspace is in one-to-one correspondence with the Cartesian product MD,m× 

R D , where MD,m is the set of D × D symmetric positive semidefinite matrices of rank m, 

via the embedding 

Ψ : (Y, u) ↦→ [Y Y ′ | u] ∈ R D×(D+1) . 

However, the topology and the metric from this embedding do not have a probabilistic 

motivation, similarly to the Basri’s embedding (6.20). I instead examine the limit of KL 

distance and make a positive definite kernel with the invariance condition (6.21). 

KL distance in the limit 

To incorporate these scales into affine subspaces, I allow the product Y Y ′ to be a non- 

identity matrix and make sure that the definition of the kernel is valid and consistent. 

comes 

Let Yi be full-rank but not necessarily orthonormal. In this case the KL distance be- 

JKL(p1, p2) → 1 

′ 

tr(Y1Y 

2σ2 1 + Y2Y ′ 

2 − � Y1 � Y ′ 

1Y2Y ′ 

2 − � Y2 � Y ′ 

2Y1Y ′ 

1) 

1 

+ 

2σ2 (u1 − u2) ′ 

� 

2ID − � Y1 � Y ′ 

1 − � Y2 � Y ′ 

� 

2 (u1 − u2), 

where � Yi denotes the orthonormalization of Yi 

�Yi = lim �Yi = Yi(Y 

σ→0 

′ 

i Yi) −1/2 . 

119

Ignoring the multiplicative factors, we can see that the corresponding form 

k = 1 

2 tr(� Y1 � Y ′ 

1Y2Y ′ 

2 + Y1Y ′ 

1 � Y2 � Y ′ 

2) + u ′ 1(2I − � Y1 � Y ′ 

1 − � Y2 � Y ′ 

2)u2 

is again not a well-defined kernel. 

The first term 

1 

2 tr(� Y1 � Y ′ 

1Y2Y ′ 

2 + Y1Y ′ 

1 � Y2 � Y ′ 

2) 

is not positive definite, and there are several ways to remedy it: 

• Fully unnormalized: k(Y1, Y2) = tr(Y1Y ′ 

1Y2Y2), 

• Partially normalized: k(Y1, Y2) = tr(Y1 � Y ′ 

1 � Y2Y ′ 

2) = tr( � Y1Y ′ 

1 � Y2Y ′ 

2) = tr( � Y1Y ′ 

1Y2 � Y ′ 

2) 

• Fully normalized: k(Y1, Y2) = tr( � Y1 � ′ 

Y1 

�Y2 � ′ 

Y2 ) 

I use the partially normalized form 2 since it scales the same as the original form to a global 

scaling factor multiplied to Y ’s. 

The second term 

u ′ 1(2I − � Y1 � Y ′ 

1 − � Y2 � Y ′ 

2)u2 

is the same as in the affine case, and we also have several ways to make it well-defined and 

positive definite: 

• Affine invariant: ku(Y1, u1, Y2, u2) = u ′ 1(I − � Y1 � ′ 

Y1 )(I − � Y2 � ′ 

Y2 )u2 = û1 ′ û2 

• Direct inner product: ku(Y1, u1, Y2, u2) = u ′ 1u2. 

Since the affine invariance condition is irrelevant for scaled subspaces (Figure 6.3), the 

direct inner product form is a better choice. 

Finally, the sum of the two modified terms is the ‘affine scaled’ kernel I propose: 

2 However, these kernels showed similar results in preliminary experiments. 

120

Theorem 6.5. 

kAffSc(Y1, u1, Y2, u2) = tr(Y1 � ′ 

Y1 

�Y2Y ′ 

2) + u ′ 1u2. (6.22) 

is a positive definite kernel for scaled subspaces. 

Proof. The term u ′ 1u2 is obviously well-defined and positive definite, so let’s look at only 

the first term. 

1. Well-defined: let’s show that if Y1Y ′ 

1 = Y3Y ′ 

3 then Y1 � ′ 

Y1 

the second equation to see that 

= Y3 � ′ 

Y3 . Take squares of 

(Y1 � ′ 

Y1 ) 2 = Y1(Y ′ 

1Y1) −1/2 Y ′ 

1Y1(Y ′ 

1Y1) −1/2 Y ′ 

1 = Y1Y ′ 

1 = Y3Y ′ 

3 

= Y3(Y ′ 

3Y3) −1/2 Y ′ 

3Y3(Y ′ 

3Y3) −1/2 Y ′ 

3 = (Y3 � ′ 

Y3 ) 2 . 

Since Y1 � ′ 

Y1 and Y3 � ′ 

Y3 are both symmetric positive semidefinite matrices, the equality 

of the squares implies Y1 � ′ 

Y1 

Y2 � ′ 

Y2 = Y4 � ′ 

Y4 . 

2. Positive definiteness: 

� 

i,j 

= Y3 � ′ 

Y3 . By the same argument, if Y2Y ′ 

2 = Y4Y ′ 

4 then 

cicjtr(Yi � ′ 

Yi 

�YjY ′ 

j ) = tr( � 

= � � 

121 

i 

ciYi 

i 

� ′ � 

Yi cj 

j 

� YjY ′ 

j ) 

ciYi � ′ 

Yi � 2 F ≥ 0.

Summary of the extended kernels 

The proposed kernels are summarized below. Let Yi be a full-rank D × m matrix, and let 

�Yi = Yi(Y ′ 

i Yi) −1/2 the orthonormalization of Yi as before. 

kProj(Y1, Y2) = kLin(Y1, Y2) = tr( � Y ′ 

1 � Y2 � Y ′ 

2 � Y1) 

kAff(Y1, u1, Y2, u2) = tr( � Y ′ 

1 � Y2 � Y ′ 

2 � Y1) + u ′ 1(I − � Y1 � Y ′ 

1)(I − � Y2 � Y ′ 

2)u2 

kAffSc(Y1, u1, Y2, u2) = tr(Y1 � Y ′ 

1 � Y2Y ′ 

2) + u ′ 1u2. (6.23) 

I also spherize the kernels 

� k(Y1, u1, Y2, u2) = k(Y1, u1, Y2, u2) k(Y1, u1, Y1, u1) −1/2 k(Y2, u2, Y2, u2) −1/2 

so that k(Y1, u1, Y1, u1) = 1 for any Y1 and u1. 

There is a caveat in implementing these kernels. Although I use the same notations 

Y and � Y for both linear and affine kernels, they are different in computation. For linear 

kernels the Y and � Y are computed from data assuming u = 0, whereas for affine kernels 

the Y and � Y are computed after removing the estimated mean u from the data. 

6.3.4 Extension to nonlinear subspaces 

A systematic way of extending the Projection kernel from linear/affine subspaces to non- 

linear spaces is to use an implicit map via a kernel function as explained in Section 5.2.4, 

where the latter kernel is to be distinguished from the Grassmann kernels. Note that the pro- 

posed kernels (6.23) can be computed only from the inner products of the column vectors 

of Y ’s and u’s including the orthonormalization procedure. This ‘doubly kernel’ approach 

has already been proposed for the Binet-Cauchy kernel [90, 46] and for probabilistic dis- 

tances in general [96]. We can adopt the trick for the extended Projection kernels as well to 

122

extend the kernels to operate on nonlinear subspaces, which is the preimage corresponding 

to the linear subspaces in the RKHS via the feature map. 

6.4 Experiments with synthetic data 

In this section I demonstrate the application of the extended Grassmann kernels to a two- 

class classification problem with Support Vector Machines (SVMs). Using synthetic data 

generated from MFA distribution, I will compare the classification performance of lin- 

ear/nonlinear SVMs in the original space with the performance of the SVM in the Grass- 

mann space. 

6.4.1 Synthetic data 

The kernels in equation 6.23 are defined under different assumptions of data distribution. 

To test the kernels I generate three types of synthetic data corresponding to the assumptions: 

(1) linear homogeneous MFA, (2) affine homogeneous MFA, and (3) affine scaled MFA. 

Selecting MFA 

For each type of the data, I generate N = 100 Factor Analyzers in D = 5 dimensional 

Euclidean space. The i-th Factor Analyzer has the distribution pi(x) = N (ui, Ci), where 

the covariance is Ci = � YiΛi � Y ′ 

i + σ 2 ID. The 5 × 2 orthonormal matrices � Yi are randomly 

chosen from the uniform distribution on G(m, D). Refer to [1] for the definition of a 

uniform distribution on G(m, D). The ambient noise is chosen at σ = 0.1. 

For type 2 and type 3 datasets, I generate the nonzero mean ui randomly from ui ∼ 

N (0, r 2 ID) for each Factor Analyzer. The r controls the spread of the Factor Analyzers. 

For type 3 dataset, the covariance is additionally scaled by Ci = � YiΛi � Y ′ 

i + σ 2 ID, where 

the elements of Λi = diag(λ1, ..., λm) are chosen i.i.d from the uniform distribution on 

123

[0, 1]. 

The parameters for the datasets are summarized below: 

• Dataset 1: zero-mean (r = 0), homogeneous (λ1 = · · · = λm = 1) 

• Dataset 2: nonzero-mean (r = 0.2), homogeneous (λ1 = · · · = λm = 1) 

• Dataset 3: nonzero-mean (r = 0.2), scaled (0 ≤ λ ≤ 1) 

Assigning class labels 

So far the distribution is chosen without classes. Since I am treating each distribution as a 

point in the space of distributions, the class label is assigned per distribution. The binary 

class labels are assigned as follows. I first choose a pair of distribution p+ and p− which 

are the farthest apart from each other among all pairs of distributions. These p+ and p− 

serve as the two extreme points representing the positive and the negative distributions re- 

spectively. The labels of the other distributions are assigned subsequently from comparing 

their distances to the two extreme distributions: 

yi = 

⎧ 

⎪⎨ 

⎪⎩ 

1, if JKL(pi, p+) < JKL(pi, p−) 

−1, otherwise 

The distances are measured by the KL distance JKL of the distributions. Empirically the 

number of positive distributions and the number of negative distributions were roughly 

balanced. 


I compare the performance of the Euclidean SVM with linear/polynomial/Gaussian RBF 

kernels and the performance of SVM with Grassmann kernels, similarly to the comparison 

124

in Section 5.3. To test the original SVMs, I randomly sample n = 50 points from each 

Factor Analyzer pi(x). 

I evaluate the algorithm with N-fold cross validation by holding out one set and training 

with the other N − 1 sets. The polynomial kernel used is k(x1, x2) = (〈x1, x2〉 + 1) 3 , and 

the RBF kernel used is k(x1, x2) = exp(− 1 

2r 2 �x1 − x2� 2 ), where the radius r is chosen to 

be one-fifth of the diameter of the data: r = 0.2 maxij �xi − xj�. For training the SVMs, 

I use the public-domain software SVM-light [42] with default parameters. 

To test the Grassmann SVM, I first estimate the mean ui and the basis Yi from the same 

points used for the Euclidean SVM, although I could have improved the results by using 

the true parameters instead of the estimated ones. The Maximum Likelihood estimates of 

Yi, µi and σ are given from the probabilistic PCA model [76] as follows. Let µ and S be 

the sample mean and covariance of the i-th set 

Let 

µ = 1 

Ni 

Ni � 

j=1 

xj, S = 1 

Ni − 1 

S = UΛU ′ 

Ni � 

j=1 

(xj − µ)(xj − µ) ′ . 

be the eigen-decomposition of the covariance matrix S, where U = [u1 · · · uD] is the 

eigenvectors corresponding to the eigenvalues λ1 ≥ ... ≥ λD, and Λ = diag(λ1, ..., λD) is 

the diagonal matrix of eigenvalues. Then Yi and σi are estimated from 

σ 2 = 

1 

D − m 

D� 

j=m+1 

λj 

(6.24) 

Yi = Um(Λm − σ 2 I) 1/2 , (6.25) 

where Um is the first m columns of U and Λm is the m × m principal submatrix of Λ. The 

σ is estimated individually for each set of data, and I use the averaged value for all sets. An 

125

Table 6.1: Classification rates of the Euclidean SVMs and the Grassmann SVMs. The best 

rate for each dataset is highlighted by boldface. 

Euclidean Grassmann Probabilistic 

Linear Poly RBF Lin Aff AffSc BC Bhat 

Dataset 1 52.86 62.38 66.33 87.00 86.80 87.70 82.50 84.30 

Dataset 2 62.30 64.45 65.74 76.90 82.00 83.10 70.90 72.50 

Dataset 3 62.76 64.73 69.47 69.50 73.70 84.40 65.10 77.30 

iterative and more accurate method of estimation is to use an EM approach [27] although 

not used here. 

The σ is used for the Bhattacharyya kernel which requires nonzero ambient noise σ > 0 

in the covariance Ci = YiY ′ 

i + σ 2 I. 

Five different kernels in the followings are compared: 

1. SVM with the original and the extended Projection kernels: kLin, kAff, kAffSc 

2. SVM with the Binet-Cauchy kernel: kBC(Y1, Y2) = (det Y ′ 

1Y2) 2 = det Y ′ 

1Y2Y ′ 

2Y1 

3. SVM with the Bhattacharyya kernel: kBhat(p1, p2) = � [p1(x) p2(x)] 1/2 dx for Factor 

Analyzers. 

I evaluate the algorithms with the leave-one-out test by holding out one subspace and train- 

ing with the other N − 1 subspaces. For training the SVMs, I use a Matlab code with a 

nonnegative QP solver. 


Table 6.1 shows the classification rates of the Euclidean SVMs and the Grassmann SVMs, 

averaged for 10 independent trials. The results show that best rates are obtained from 

the affine scaled kernel, and the Euclidean kernels lag behind for all types of data. The 

126

inferiority of the Euclidean SVMs to the Grassmann SVMs can be ascribed similarly to the 

reasons discussed in Section 5.3.3 

For dataset 1 which has zero-means, the linear SVMs degrade to the chance-level 

(50%). The result agrees with the intuitive picture that any decision hyperplane that passes 

the origin will roughly halve the positive and the negative classes. As expected, the linear 

kernel is inappropriate for dataset 2 which have nonzero offsets, whereas the affine and 

the affine scaled kernels perform well for both dataset 1 and 2. However, only the affine 

scaled kernel performs well for dataset 3. The Binet-Cauchy and the Bhattacharyya ker- 

nels perform close to the Projection kernels for dataset 1, but underperform for dataset 2 

and 3. This result is expected since the Binet-Cauchy kernel does not give considerations 

for offsets or scales, and since the Bhattacharyya kernel is not adequate for MFA data as I 

showed in the previous sections. 

I conclude that the extended kernels have advantages over the original kernels and 

the Euclidean kernels for subspace-based classification problems when the data consist 

of affine and scaled subspaces instead of simple linear subspaces. 

6.5 Experiments with real-world data 

In this section I demonstrate the application of the extended Grassmann kernels to recog- 

nition problems with a kernel FDA. Using real image database I compare the classification 

performance of the extended kernels and other previously used kernels. 


A baseline algorithms and the kernel FDA with different kernels in the following are com- 

pared: 

1. Baseline : Euclidean FDA 

127

2. KFDA with the original and the extended Projection kernels: kLin, kAff, kAffSc 

3. KFDA with the Binet-Cauchy kernel 

4. KFDA the Bhattacharyya kernel 

The subspace parameters are estimated from the data similarly to the experiments with 

synthetic data. I evaluate the algorithms with leave-one-out test by holding out one sub- 

space and training with the other N − 1 subspaces. 


The recognition rates for Yale Face, CMU-PIE, ETH-80, and IXMAS databases are given 

in Figures 6.5–6.8. I summarize the results as follows: 

1. The original and the extended Grassmann kernels outperform the Binet-Cauchy and 

the Bhattacharyya kernels, as well as the baseline method. The superiority of the 

Projection kernel to the Binet-Cauchy kernel and the Euclidean method is already 

demonstrated in Chapter 5. 

2. The Bhattacharyya kernel performs quite poorly, and becomes worse as the subspace 

dimension increases. One can verify that the kernel matrix from the data is close to 

an identity matrix and therefore carries little information about the data. A similar 

observation for the Binet-Cauchy kernel was already made in Chapter 5. 

3. In Yale Face and ETH-80 databases the affine kernel and the affine scaled kernel 

achieve best rates, respectively. In CMU-PIE and IXMAS databases the rates of the 

affine kernel follow the rates of linear kernel closely. The rates of affine scaled kernel 

fall behind those two, but the differences are small compared to the rates achieved by 

the rest of the methods. Compared with the experimental result from the synthetic 

128

data, the result with the real data does not conclusively show the advantage nor the 

disadvantage of using the extended kernels. Since the extended kernels generalize 

the original Projection kernel, the comparable performances can be interpreted as the 

linear subspace assumption being valid for the real image databases I used. 


In this chapter, I showed the relationship between probabilistic distances and the Projection 

kernel using a probabilistic model of subspaces. This analysis provides generalizations of 

the Projection kernel to affine and scaled subspaces. The relaxation of linear subspace 

assumption allows us to accommodate more complex data structures which diverge from 

the ideal linear subspace assumption. 

As demonstrated with the synthetic data, the mean and the scales within subspsace may 

carry important statistics of the data and can make a large difference in classification tasks. 

However, whether the information is useful for the real databases was not conclusive from 

the experiments, since the difference between the original and the extended kernel was 

small. However, the original and the extended kernels showed consistently better perfor- 

mance than the Euclidean method and the kernel method with the Binet-Cauchy and the 

Bhattacharyya kernels. 

129

Eucl 

Lin 

Aff 

AffSc 

BC 

Bhat 

100 

90 

80 

70 

60 

50 

40 

1 3 5 


7 9 

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

Eucl 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 

Lin 96.77 95.70 98.57 99.28 98.92 98.57 97.85 96.77 97.13 

Aff 94.62 96.06 98.92 98.92 98.57 97.85 96.77 97.13 97.49 

AffSc 96.77 98.21 99.28 99.28 99.28 99.28 99.28 99.28 99.28 

BC 96.77 95.34 96.77 96.42 83.87 72.76 55.20 48.03 44.09 

Bhat 97.13 98.21 94.27 85.30 65.23 60.93 53.76 48.39 44.44 

Figure 6.5: Yale Face Database: face recognition rates from various kernels. The two 

highest rates including ties are highlighted with boldface for each subspace dimension m. 

130

Eucl 

Lin 

Aff 

AffSc 

BC 

Bhat 

100 

80 

60 

40 

20 

1 3 5 


7 9 

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

Eucl 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 60.73 

Lin 88.27 74.84 89.77 87.21 91.68 92.54 93.82 93.60 95.31 

Aff 72.28 86.99 85.50 91.26 91.26 92.32 92.54 94.46 94.67 

AffSc 83.16 85.29 85.29 85.29 85.29 85.29 85.29 85.29 85.29 

BC 88.27 71.43 82.52 64.82 58.64 47.55 43.07 39.87 36.25 

Bhat 83.37 44.78 39.45 36.25 31.98 28.78 26.23 23.88 20.47 

Figure 6.6: CMU-PIE Database: face recognition rates from various kernels. 

131

Eucl 

Lin 

Aff 

AffSc 

BC 

Bhat 

100 

90 

80 

70 

60 

50 

40 

1 3 5 


7 9 

m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 

Eucl 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 85.00 

Lin 88.75 88.75 93.75 96.25 96.25 95.00 93.75 96.25 96.25 

Aff 88.75 92.50 96.25 96.25 95.00 93.75 96.25 96.25 97.50 

AffSc 91.25 90.00 91.25 91.25 91.25 91.25 92.50 91.25 91.25 

BC 88.75 86.25 90.00 81.25 72.50 63.75 51.25 41.25 48.75 

Bhat 91.25 90.00 88.75 87.50 86.25 83.75 81.25 81.25 80.00 

Figure 6.7: ETH-80 Database: object categorization rates from various kernels. 

132

Eucl 

Lin 

Aff 

AffSc 

BC 

Bhat 

100 

80 

60 

40 

20 

1 2 3 


4 5 

m=1 m=2 m=3 m=4 m=5 

Eucl 54.87 54.87 54.87 54.87 54.87 

Lin 69.09 80.30 84.55 84.24 85.15 

Aff 73.94 81.52 82.73 82.42 80.91 

AffSc 72.73 78.48 80.30 80.30 80.30 

BC 69.09 60.00 50.91 36.36 25.15 

Bhat 64.24 59.39 51.52 42.12 23.64 

Figure 6.8: IXMAS Database: action recognition rates from various kernels. 

133

Chapter 7 

CONCLUSION 

In this chapter I summarize the work presented in this thesis and discuss the future work 

related to the proposed methods. 

7.1 Summary 

In this thesis I proposed the subspace-based approach for solving novel learning problems 

using the Grassmann kernels. Below I summarize the progress that has been made in each 

chapter with regard to this goal. 

• In Chapter 3, I proposed the paradigm of subspace-based learning which exploits in- 

herent linear structures in data to solve novel learning problems. The rationale behind 

this approach was explained and exemplified with well-known image databases. 

• In Chapter 4, I introduced the Grassmann manifold as a common framework for 

subspace-based learning, and reviewed the geometry of the space. Various distances 

on the Grassmann manifold were reviewed and analyzed in depth, and were com- 

pared to each other by classification tests. 

134

• In Chapter 5, I proposed the Projection kernel and its application to discriminant 

analysis for subspaces-based classification problems. In classification tests with the 

image database the proposed method showed better performance than other previ- 

ously used discrimination methods as well as the Euclidean method. 

• In Chapter 6, I presented formal analyses of the relationship between probabilistic 

distances and the Grassmann kernels. Based on the analyses, I broadened the domain 

of subspace-based learning from linear to affine and scaled subspaces, and presented 

the extended kernels. The extended kernels performed competitively with synthetic 

and real data and showed potentials for the extended domains. 

7.2 Future work 

In this section I discuss the research directions to which I plan to expand the present work. 

7.2.1 Theory 

In this work I utilized geometric properties of the Grassmann manifolds as a framework for 

subspace-based learning problems. However, there are other aspects of this manifold that I 

did not consider in this thesis. These are briefly reviewed below. 

• Riemannian aspect of Grassmann manifolds: As discussed in Chapter 4, the Grass- 

mann manifold can be derived as a homogeneous space of orthogonal groups [86, 

13]: 

G(m, D) = O(D)/O(m) × O(D − m). 

This definition also induces a Riemannian geometry of the space which plays an 

important role for optimization techniques on the manifold such as the Newton’s 

method or the conjugate gradient [20, 2]. The geometry of the Grassmann manifold 

135

through positive definite kernels is more general than the strict Riemannian geometry 

from the definition above. I plan to explore the applicability of the proposed kernels 

for optimization problems as well. 

• Probabilistic aspect of Grassmann manifolds: The kernel approach to a learning 

problem is basically deterministic. Although I used a probabilistic model of sub- 

spaces in Chapter 6, the MFA model is a distribution in the usual Euclidean space. 

There are intrinsic definitions of probability distributions on the Grassmann mani- 

fold, such as the uniform distribution [1] and the matrix Langevin distribution [16]. 

Statistical inference and estimation on the Grassmann manifold is a largely unex- 

plored topic with the exception of the pioneering work of Chikuse [16]. 

• Existence of other Grassmann kernels: There is an important technical question that 

remains to be answered: are there more Grassmann kernels that are fundamentally 

different from the Projection or the Binet-Cauchy kernels? As I remarked in Chap- 

ter 6, an inductive approach to the discovery of new kernels is to examine the limits of 

other probabilistic distances that were not analyzed in this thesis. On the other hand, 

a deductive approach is to adapt the characterization of positive definite kernels on 

Hilbert spaces, n-spheres and semigroups [65, 67, 66, 9] to the case of the Stiefel 

and the Grassmann manifolds. In relation to the latter approach, a study involving 

a functional integration over orthogonal groups is currently under examination, and 

will be reported in the future. 

7.2.2 Applications 

The proposed Grassmann kernels are general building blocks for many applications. Those 

kernels can be used for an arbitrary learning task whether it is a supervised or an unsuper- 

vised learning problem. They can also be used with any kernel methods among which I 

136

used the SVM and the FDA for demonstrations. Moreover, the applications are not limited 

to image data, since a subspace is a familiar notion for any vectorial data. 

On the other hand, the proposed kernel methods may not outperform the state-of-the- 

art algorithms which are dedicated to specific tasks and utilize domain-specific knowledge. 

I remark below on the limitations and possible improvements of the proposed method in 

serveral application-specific aspects. 

• Intensity- vs feature-based representation: In this thesis I used the intensity repre- 

sentation of an image and relied on the low-dimensional character of pose subspaces 

or illumination subspaces to address the invariant recognition problem. However, a 

compelling alternative for recognition is to use feature-based representation of faces 

and objects, such as SIFT features which are already invariant to pose or illumina- 

tion variations [54]. It is unknown whether we can find subspace structures with 

the feature-based representation of images. However, the Bhattacharyya kernel was 

originally demonstrated with the bag-of-feature representation [40], which suggests 

the application of our method to such representations. 

• Dynamical models for sequence data: The proposed method used the observability 

subspace representation of the video sequences. However, there are multiples steps 

involved in processing a video sequence into a dynamical system representation, and 

each step relies on heuristic choices. The recognition can be improved solely with a 

clever preprocessing of the sequences without the use of dynamical models [89]. In 

relation to the dynamical system approach in general, Vishwanathan et. al. proposed 

a variety of other kernels for dynamical systems [84] in addition to the Binet-Cauchy 

kernel used in the thesis. In fact the authors defined the Binet-Cauchy kernel for 

Fredholm operators which are much general in scope. The subspace model from the 

observability matrix is but one approach to characterizing dynamical systems, and I 

137

am currently investigating other subspace models that emphasize different aspects of 

dynamical systems. 

• Handling unorganized data: The image databases I used in this work have factorized 

structures. For example, each image in the Yale Face database is labeled in terms 

of (person, pose, illumination). If we do not know the labels other than the person 

and have to estimate the subspaces from a clutter of data points, this estimation prob- 

lem itself becomes a separate problem that warrants research efforts [27, 37, 81]. 

However, it is out of the scope of the proposed subspace-based framework. 

Furthermore, I mentioned a caveat in Chapter 3 that I assumed that the test data for 

image databases are not single images but subspaces. The assumption can potentially 

limit the applicability of the subspace-based approach in conventional problem set- 

tings. However, this limitation is to be understood as a tradeoff between data struc- 

ture flexibility and the strength of methods. If one is to use more powerful kernel 

methods, one needs more structured data. 

138

Bibliography 

[1] P. Absil, A. Edelman, and P. Koev. On the largest principal angle between random 

subspaces. Linear Algebra and its Applications, 414(1):288–294, 2006. 

[2] P. Absil, R. Mahony, and R. Sepulchre. Riemannian geometry of Grassmann mani- 

folds with a view on algorithmic computation. Acta Applicandae Mathematicae: An 

International Survey Journal on Applying Mathematics and Mathematical Applica- 

tions, 80(2):199–220, 2004. 

[3] R. Basri, T. Hassner, and L. Zelnik-Manor. Approximate nearest subspace search with 

applications to pattern recognition. In Proceedings of the Conference on Computer 

Vision and Pattern Analysis. IEEE Computer Society, June 2007. 

[4] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans- 

actions on Pattern Analysis and Machine Intelligence, 25(2):218–233, 2003. 

[5] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. 

Neural Computation, 12(10):2385–2404, 2000. 

[6] M. Baumann and U. Helmke. Riemannian subspace tracking algorithms on Grass- 

mann manifolds. In Proceedings of the IEEE Conference on Decision and Control, 

pages 4731–4736, 12-14 Dec. 2007. 

139

[7] P. N. Belhumeur, a. H. Jo and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recogni- 

tion using class specific linear projection. In Proceedings of the European Conference 

on Computer Vision-Volume I, pages 45–58, London, UK, 1996. Springer-Verlag. 

[8] P. N. Belhumeur and D. J. Kriegman. What is the set of images of an object un- 

der all possible illumination conditions? International Journal of Computer Vision, 

28(3):245–260, 1998. 

[9] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. 

Springer, Berlin, 1984. 

[10] A. Bhattacharyya. On a measure of divergence between two statistical populations 

defined by their probability distributions. Bulletin of Calcutta Mathematical Society, 

Vol. 49, pages 214–224, 1943. 

[11] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin 

classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning 

Theory, pages 144–152, New York, NY, USA, 1992. ACM. 

[12] M. Bressan and J. Vitrià. Nonparametric discriminant analysis and nearest neighbor 

classification. Pattern Recognition Letters, 24:2743–2749, 2003. 

[13] R. Carter, G. Segal, and I. Macdonald. Lectures on Lie Groups and Lie Algebras, Lon- 

don Mathematical Society Student Texts. Cambridge University Press, Cambridge, 

UK, 1995. 

[14] J.-M. Chang, J. R. Beveridge, B. A. Draper, M. Kirby, H. Kley, and C. Peterson. 

Illumination face spaces are idiosyncratic. In Proceedings of the International Con- 

ference on Image Processing, Computer Vision, and Pattern Recognition, pages 390– 

396, 2006. 

140

[15] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on 

the sum of observations. Annals of Mathematical Statistics, pages 493–507, 1952. 

[16] Y. Chikuse. Statistics on special manifolds, Lecture Notes in Statistics, vol. 174. 

Springer-Verlag, New York, 2003. 

[17] K. D. Cock and B. D. Moor. Subspace angles between ARMA models. Systems 

Control Lett. 46 (4), pages 265–270, July 2002. 

[18] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines: and 

other kernel-based learning methods. Cambridge University Press, New York, NY, 

USA, 2000. 

[19] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures. International 

Journal of Computer Vision, 51(2):91–109, 2003. 

[20] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthog- 

onality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303– 

353, 1999. 

[21] R. Epstein, P. Hallinan, and A. Yuille. 5 ± 2 Eigenimages suffice: An empirical 

investigation of low-dimensional lighting models. In Proceedings of IEEE Workshop 

on Physics-Based Modeling in Computer Vision, pages 108–116, 1995. 

[22] B. S. Everitt. An introduction to latent variable models. Chapman and Hall, London, 

1984. 

[23] A. Faragó, T. Linder, and G. Lugosi. Fast nearest-neighbor search in dissimilarity 

spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):957– 

962, 1993. 

141

[24] K. Fukui and O. Yamaguchi. Face recognition using multi-viewpoint patterns for 

robot vision. In International Symposium of Robotics Research, pages 192–201, 2003. 

[25] K. Fukunaga. Introduction to statistical pattern recognition (2nd ed.). Academic 

Press Professional, Inc., San Diego, CA, USA, 1990. 

[26] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illu- 

mination cone models for face recognition under variable lighting and pose. IEEE 

Transactions on Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001. 

[27] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyz- 

ers. Technical Report CRG-TR-96-1, Department of Computer Science, University 

of Toronto, 21 1996. 

[28] G. H. Golub and C. F. V. Loan. Matrix computations (3rd ed.). Johns Hopkins 

University Press, Baltimore, MD, USA, 1996. 

[29] R. Gross, I. Matthews, and S. Baker. Eigen light-fields and face recognition across 

pose. In Proceedings of the IEEE International Conference on Automatic Face and 

Gesture Recognition, page 3, Washington, DC, USA, 2002. IEEE Computer Society. 

[30] R. Gross, I. Matthews, and S. Baker. Fisher light-fields for face recognition across 

pose and illumination. In Proceedings of the 24th DAGM Symposium on Pattern 

Recognition, pages 481–489, London, UK, 2002. Springer-Verlag. 

[31] B. Haasdonk. Feature space interpretation of SVMs with indefinite kernels. IEEE 

Transactions on Pattern Analysis and Machine Intelligence, 27(4):482–492, 2005. 

[32] P. Hallinan. A low-dimensional representation of human faces for arbitrary light- 

ing conditions. In Proceedings of the Conference on Computer Vision and Pattern 

Recognition, pages 995–999, 1994. 

142

[33] J. Hamm and D. D. Lee. Grassmann discriminant analysis: a unifying view on 

subspace-based learning. In Proceedings of the International Conference on Machine 

Learning, 2008. 

[34] J. Hamm and D. D. Lee. Learning a warped subspace model of faces with images 

of unknown pose and illumination. In International Conference on Computer Vision 

Theory and Applications, pages 219–226, 2008. 

[35] M. Hein, O. Bousquet, and B. Schölkopf. Maximal margin classification for metric 

spaces. Journal of Computer and System Sciences, 71(3):333–359, 2005. 

[36] O. Henkel. Sphere packing bounds in the Grassmann and Stiefel manifolds. IEEE 

Transactions on Information Theory, 51:3445, 2005. 

[37] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances 

of objects under varying illumination conditions. Proceedings of the Conference on 

Computer Vision and Pattern Analysis, 01:11, 2003. 

[38] R. A. Horn and C. A. Johnson. Matrix analysis. Cambridge University Press, Cam- 

bridge, UK, 1985. 

[39] H. Hotelling. Relations between two sets of variates. Biometrika 28, pages 321–372, 

1936. 

[40] T. Jebara and R. I. Kondor. Bhattacharyya and expected likelihood kernels. In Pro- 

ceeding of the Annual Conference on Learning Theory, pages 57–71, 2003. 

[41] T. Joachims. Text categorization with Support Vector Machines: Learning with many 

relevant features. In Proceedings of the European Conference on Machine Learning, 

pages 137 – 142, Berlin, 1998. Springer. 

143

[42] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, 

and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chap- 

ter 11, pages 169–184. MIT Press, Cambridge, MA, USA, 1999. 

[43] T.-K. Kim, J. Kittler, and R. Cipolla. Discriminative learning and recognition of image 

set classes using canonical correlations. IEEE Transactions on Pattern Analysis and 

Machine Intelligence, 29(6):1005–1018, 2007. 

[44] M. Kirby and L. Sirovich. Application of the Karhunen-Loeve procedure for the 

characterization of human faces. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 12(1):103–108, 1990. 

[45] R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. 

In Proceedings of the International Conference on Machine Learning, 2002. 

[46] R. I. Kondor and T. Jebara. A kernel between sets of vectors. In Proceedings of the 

International Conference on Machine Learning, pages 361–368, 2003. 

[47] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathemat- 

ical Statistics, 22(1):79–86, 1951. 

[48] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object 

categorization. In Proceedings of the Conference on Computer Vision and Pattern 

Analysis, page 409, Los Alamitos, CA, USA, 2003. IEEE Computer Society. 

[49] C. Leslie, E. Eskin, and W. Noble. Mismatch string kernels for SVM protein classi- 

fication. In Advances in Neural Information Processing Systems, pages 1441–1448, 

2003. 

144

[50] D. Lin, S. Yan, and X. Tang. Pursuing informative projection on Grassmann manifold. 

In Proceedings of the Conference on Computer Vision and Pattern Analysis, pages 

1727–1734, Washington, DC, USA, 2006. IEEE Computer Society. 

[51] X. Liu, A. Srivastava, and K. Gallivan. Optimal linear representations of images for 

object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 

26(5):662–666, 2004. 

[52] L. Ljung. System identification: theory for the user. Prentice-Hall, Inc., Upper Saddle 

River, NJ, USA, 1986. 

[53] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text clas- 

sification using string kernels. Journal of Machine Learning Research, 2:419–444, 

2002. 

[54] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International 

Journal of Computer Vision, 60(2):91–110, 2004. 

[55] R. Martin. A metric for ARMA processes. IEEE Transactions on Signal Processing, 

48(4):1164–1170, Apr 2000. 

[56] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Müller. Fisher discriminant 

analysis with kernels. In Y. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural 

Networks for Signal Processing IX, pages 41–48. IEEE, 1999. 

[57] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, A. Smola, and K. Müller. Construct- 

ing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel 

feature spaces. IEEE Transactions on Patterns Analysis and Machine Intelligence, 

25(5):623–627, May 2003. 

145

[58] K. Müller, S. Mika, G. Rätsch, S. Tsuda, and B. Schölkopf. An introduction to kernel- 

based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–202, 

2001. 

[59] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positive kernels. 

In Proceedings of the International Conference on Machine Learning, page 81, New 

York, NY, USA, 2004. ACM. 

[60] E. Pekalska, P. Paclik, and R. P. W. Duin. A generalized kernel approach to 

dissimilarity-based classification. Journal of Machine Learning Research, 2:175– 

211, 2002. 

[61] R. Ramamoorthi. Analytic PCA construction for theoretical analysis of lighting vari- 

ability in images of a Lambertian object. IEEE Transactions on Pattern Analysis and 

Machine Intelligence, 24(10):1322–1333, 2002. 

[62] R. Ramamoorthi and P. Hanrahan. On the relationship between radiance and irra- 

diance: determining the illumination from images of a convex Lambertian object. 

Journal of the Optical Society of America A, 18(10):2448–2459, 2001. 

[63] A. Rényi. On measures of information and entropy. In Proceedings of the 4th Berkeley 

Symposium on Mathematics, Statistics and Probability, pages 547–561, 1960. 

[64] N. Sakano, H.; Mukawa. Kernel mutual subspace method for robust facial image 

recognition. In Proceedings of the International Conference on Knowledge-Based 

Intelligent Engineering Systems and Allied Technologies, volume 1, pages 245–248, 

2000. 

[65] I. J. Schoenberg. Remarks to Maurice Frechet’s article ... Annals of Mathematics, 36, 

36(3):724–732, 1935. 

146

[66] I. J. Schoenberg. Metric spaces and completely monotone functions. Annal of Math- 

ematics, 39(4):811–841, 1938. 

[67] I. J. Schoenberg. Metric spaces and positive definite functions. Transactions of the 

American Mathematical Society, 44:522–536, 1938. 

[68] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel 

eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998. 

[69] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, 

Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. 

[70] G. Shakhnarovich, I. John W. Fisher, and T. Darrell. Face recognition from long-term 

observations. In Proceedings of the European Conference on Computer Vision, pages 

851–868, London, UK, 2002. 

[71] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge 

University Press, New York, NY, USA, 2004. 

[72] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression 

(PIE) database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 

25(12):1615 – 1618, December 2003. 

[73] L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of 

human faces. Journal of the Optical Society of America A, 4(3):519–524, 1987. 

[74] A. Srivastava. A Bayesian approach to geometric subspace estimation. IEEE Trans- 

actions on Signal Processing, 48(5):1390–1400, May 2000. 

[75] E. Takimoto and M. Warmuth. Path kernels and multiplicative updates. In Proceed- 

ings of the Annual Workshop on Computational Learning Theory. ACM, 2002. 

147

[76] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal 

Of The Royal Statistical Society Series B, 61(3):611–622, 1999. 

[77] F. Topsøe. Some inequalities for information divergence and related measures of 

discrimination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000. 

[78] P. Turaga, A. Veeraraghavan, and R. Chellappa. Statistical analysis on Stiefel and 

Grassmann manifolds with applications in computer vision. In Proceedings of the 

Conference on Computer Vision and Pattern Analysis, 2008. 

[79] M. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro- 

science, 3(1):71–86, 1991. 

[80] A. Veeraraghavan, A. K. Roy-Chowdhury, and R. Chellappa. Matching shape se- 

quences in video with applications in human movement analysis. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 27(12):1896–1909, 2005. 

[81] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (GPCA). 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1945–1959, 

2005. 

[82] S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. Advances 

in Neural Information Processing Systems, 15, 2003. 

[83] S. Vishwanathan and A. J. Smola. Binet-Cauchy kernels. In Advances in Neural 

Information Processing Systems, 2004. 

[84] S. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical 

systems and its application to the analysis of dynamic scenes. International Journal 

of Computer Vision, 73(1):95–119, 2007. 

148

[85] L. Wang, X. Wang, and J. Feng. Subspace distance analysis with application to adap- 

tive Bayesian algorithm for face recognition. Pattern Recognition, 39(3):456–464, 

2006. 

[86] F. Warner. Foundations of differentiable manifolds and Lie groups. Springer-Verlag, 

New York, 1983. 

[87] C. Watkins. Kernels from matching operations. Technical Report CSD-TR-98-07, 

Department of Computer Science, Royal Holloway College, 1999. 

[88] C. Watkins. Dynamic alignment kernels. In A. Smola and P. Bartlett, editors, Ad- 

vances in Large Margin Classifiers, chapter 3, pages 39–50. MIT Press, Cambridge, 

MA, USA, 2000. 

[89] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using 

motion history volumes. Computer Vision and Image Understanding, 104(2):249– 

257, 2006. 

[90] L. Wolf and A. Shashua. Learning over sets using kernel principal angles. Journal of 

Machine Learning Research, 4:913–931, 2003. 

[91] Y.-C. Wong. Differential geometry of Grassmann manifolds. Proceedings of the 

National Academy of Science, Vol. 57, pages 589–594, 1967. 

[92] O. Yamaguchi, K. Fukui, and K. Maeda. Face recognition using temporal image se- 

quence. In Proceedings of the International Conference on Face and Gesture Recog- 

nition, page 318, Washington, DC, USA, 1998. IEEE Computer Society. 

[93] J. Ye and T. Xiong. Null space versus orthogonal linear discriminant analysis. In 

Proceedings of the International Conference on Machine Learning, pages 1073–1080, 

New York, NY, USA, 2006. ACM. 

149

[94] A. L. Yuille, D. Snow, R. Epstein, and P. N. Belhumeur. Determining generative 

models of objects under varying illumination: Shape and albedo from multiple images 

using SVD and integrability. International Journal of Computer Vision, 35(3):203– 

222, 1999. 

[95] S. K. Zhou and R. Chellappa. Illuminating light field: Image-based face recognition 

across illuminations and poses. In Proceedings of the IEEE International Conference 

on Automatic Face and Gesture Recognition, pages 229–234, 2004. 

[96] S. K. Zhou and R. Chellappa. From sample similarity to ensemble similarity: Proba- 

bilistic distance measures in reproducing kernel Hilbert space. IEEE Transactions on 

Pattern Analysis and Machine Intelligence, 28(6):917–929, 2006. 

150

SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS ...

Create successful ePaper yourself

Delete template?

Save as template?