Shared Gaussian Process Latent Variables Models - Oxford Brookes ...
Shared Gaussian Process Latent Variables Models - Oxford Brookes ...
Shared Gaussian Process Latent Variables Models - Oxford Brookes ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Shared</strong> <strong>Gaussian</strong> <strong>Process</strong> <strong>Latent</strong><br />
<strong>Variables</strong> <strong>Models</strong><br />
Carl Henrik Ek<br />
Submitted in partial fulfilment of the requirements of the award of<br />
PhD<br />
<strong>Oxford</strong> <strong>Brookes</strong> University<br />
August 2009
Abstract<br />
A fundamental task is machine learning is modeling the relationship between dif-<br />
ferent observation spaces. Dimensionality reduction is the task reducing the num-<br />
ber of dimensions in a parameterization of a data-set. In this thesis we are inter-<br />
ested in the cross-road between these two tasks: shared dimensionality reduction.<br />
<strong>Shared</strong> dimensionality reduction aims to represent multiple observation spaces<br />
within the same model. Previously suggested models have been limited to the<br />
scenarios where the observations have been generated from the same manifold.<br />
In this paper we present a <strong>Gaussian</strong> process <strong>Latent</strong> Variable Model (GP-LVM)<br />
[33] for shared dimensionality reduction without making assumptions about the<br />
relationship between the observations. Further we suggest an extension to Canon-<br />
ical Correlation Analysis (CCA) called Non Consolidating Component Analy-<br />
sis (NCCA). The proposed algorithm extends classical CCA to represent the full<br />
variance of the data opposed to only the correlated. We compare the suggested<br />
GP-LVM model to existing models and show results on real-world problems ex-<br />
emplifying the advantages of our approach.
Acknowledgements<br />
2
Contents<br />
1 Introduction 10<br />
1.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . 11<br />
1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
1.3 Notations and Conventions . . . . . . . . . . . . . . . . . . . . . 13<br />
2 Background 14<br />
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
2.1.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . 15<br />
2.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 17<br />
2.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
2.4 Spectral Dimensionality Reduction . . . . . . . . . . . . . . . . . 21<br />
2.5 Non-Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.5.1 Kernel-Trick . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
2.5.2 Proximity Graph Methods . . . . . . . . . . . . . . . . . 29<br />
2.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
2.6 Generative Dimensionality Reduction . . . . . . . . . . . . . . . 35<br />
2.7 <strong>Gaussian</strong> <strong>Process</strong>es . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
2.7.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
3
4 CONTENTS<br />
2.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
2.7.3 Relevance Vector Machine . . . . . . . . . . . . . . . . . 45<br />
2.8 GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
2.8.1 <strong>Latent</strong> Constraints . . . . . . . . . . . . . . . . . . . . . 49<br />
2.8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />
2.9 <strong>Shared</strong> Dimensionality Reduction . . . . . . . . . . . . . . . . . 54<br />
2.10 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3 <strong>Shared</strong> GP-LVM 59<br />
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.2 <strong>Shared</strong> GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
3.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
3.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
3.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
3.3 Subspace GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />
3.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
4 NCCA 80<br />
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />
4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
4.3 <strong>Shared</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
4.4 Private . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
CONTENTS 5<br />
4.4.1 Extracting the first orthogonal direction . . . . . . . . . . 85<br />
4.4.2 Extracting consecutive directions . . . . . . . . . . . . . 86<br />
4.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />
4.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />
5 Applications 92<br />
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
5.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 92<br />
5.2.1 Generative . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
5.2.2 Discriminative . . . . . . . . . . . . . . . . . . . . . . . 94<br />
5.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . 94<br />
5.3 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
5.3.1 Shapeme Features . . . . . . . . . . . . . . . . . . . . . 95<br />
5.3.2 Histogram of Oriented Gradients . . . . . . . . . . . . . . 97<br />
5.4 Data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
5.5 <strong>Shared</strong> back-constrained GP-LVM . . . . . . . . . . . . . . . . . 100<br />
5.5.1 Single Image Inference . . . . . . . . . . . . . . . . . . . 102<br />
5.5.2 Sequential Inference . . . . . . . . . . . . . . . . . . . . 103<br />
5.6 Subspace GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
5.6.1 Single Image Inference . . . . . . . . . . . . . . . . . . . 107<br />
5.6.2 Sequential Inference . . . . . . . . . . . . . . . . . . . . 108<br />
5.7 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
5.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
5.9 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 CONTENTS<br />
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
6 Conclusions 119<br />
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
6.2 Review of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
List of Figures<br />
2.1 Volume ratio of hyper-cube and hyper-sphere . . . . . . . . . . . 16<br />
2.2 Swissroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
2.3 Generative latent variable model . . . . . . . . . . . . . . . . . . 35<br />
2.4 GTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.5 Samples from GP Prior . . . . . . . . . . . . . . . . . . . . . . . 42<br />
2.6 Samples from GP Posterior . . . . . . . . . . . . . . . . . . . . . 43<br />
2.7 Probibalistic CCA . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
3.1 <strong>Shared</strong> back-constrained GP-LVM . . . . . . . . . . . . . . . . . 61<br />
3.2 Toy data: generating signals . . . . . . . . . . . . . . . . . . . . 65<br />
3.3 Toy data: observed data . . . . . . . . . . . . . . . . . . . . . . . 66<br />
3.4 Toy data: latent embeddings . . . . . . . . . . . . . . . . . . . . 67<br />
3.5 Toy data2: generating signals . . . . . . . . . . . . . . . . . . . . 68<br />
3.6 Toy data2: observed data . . . . . . . . . . . . . . . . . . . . . . 68<br />
3.7 Toy data2: latent embeddings . . . . . . . . . . . . . . . . . . . . 69<br />
3.8 Toy data3: generating signals . . . . . . . . . . . . . . . . . . . . 70<br />
3.9 Toy data3: observed data . . . . . . . . . . . . . . . . . . . . . . 71<br />
3.10 Toy data3: latent embeddings . . . . . . . . . . . . . . . . . . . . 72<br />
7
8 LIST OF FIGURES<br />
3.11 Toy data3: latent embeddings . . . . . . . . . . . . . . . . . . . . 72<br />
3.12 Subspace GP-LVM model . . . . . . . . . . . . . . . . . . . . . 74<br />
4.1 NCCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
4.2 Toy data3: NCCA embedding . . . . . . . . . . . . . . . . . . . 88<br />
4.3 Toy data3: observed data . . . . . . . . . . . . . . . . . . . . . . 88<br />
4.4 Toy data3: Subspace GP-LVM embedding . . . . . . . . . . . . . 89<br />
5.1 Kernel response matrix Poser pose data . . . . . . . . . . . . . . 102<br />
5.2 Misinterpretation of angle error . . . . . . . . . . . . . . . . . . . 109<br />
5.3 Poser single image results back-constrained GP-LVM . . . . . . . 110<br />
5.4 Poser sequence image results back-constrained GP-LVM . . . . . 111<br />
5.5 Kernel matrix back-constrained and subspace GP-LVM . . . . . . 114<br />
5.6 Subspace GP-LVM pose inference . . . . . . . . . . . . . . . . . 115<br />
5.7 Subspace GP-LVM ambiguity modeling . . . . . . . . . . . . . . 117
List of Tables<br />
3.1 Toy data: Procrustes score . . . . . . . . . . . . . . . . . . . . . 70<br />
5.1 Error on Poser data . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
5.2 Error on HumanEva . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
9
Chapter 1<br />
Introduction<br />
Information technology development have lead to a significant expansion in stor-<br />
age capabilities of digital content. This has meant that in many application areas<br />
where observations used to be scarce we now have access to significant amounts<br />
of data. This development has lead to a transition in many fields from the purely<br />
model driven paradigm to the data driven approach. In model driven modeling<br />
the aim is to explain a specific phenomenon using a model designed for the task<br />
at hand. This is different from data-driven modeling where one tries to use ob-<br />
servations of the phenomenon to learn a model. For most modeling scenarios the<br />
data available is represented in a form defined by the device used to capture the<br />
data. This means that the degrees of freedom of the available data is the degrees<br />
of freedom of the capturing device not the actual phenomenon. To reduce the ef-<br />
fect of the capturing device This often leads to that the data is represented using<br />
a very large number of dimensions, often significantly larger then the number of<br />
dimensions or degrees of freedoms that the underlying phenomenon has. This has<br />
lead to the machine learning field of Dimensionality Reduction. In dimensionality<br />
10
1.1. OVERVIEW OF THE THESIS 11<br />
reduction the aim is to find the data’s true or intrinsic parameterization from the<br />
capturing device representation.<br />
Many tasks in computer science are associated with data coming from multiple<br />
streams or views of the same underlying phenomenon. Often each view provide<br />
complementary information about the data. For modeling purposes it is there-<br />
fore of interest to use information from each view. The task of merging several<br />
different views are called Feature Fusion.<br />
The work undertaken in this thesis spans both realms presented above. Given<br />
multiple views of the same phenomenon we create models which are capable of<br />
leveraging the advantage of each view in learning a reduced dimensional repre-<br />
sentation of the data.<br />
1.1 Overview of the thesis<br />
A brief outline of the dissertation follows,<br />
Chapter 2 This chapter provides the motivation and the background to the<br />
machine learning task of dimensionality reduction. The two different approaches<br />
to dimensionality reduction, spectral and generative, are introduced and their<br />
strengths and weaknesses reviewed. We continue by introducing <strong>Gaussian</strong> Pro-<br />
cesses (GP) and give a brief background to Bayesian Modeling. The <strong>Gaussian</strong><br />
<strong>Process</strong> <strong>Latent</strong> Variable Model (GP-LVM) [33, 32] a dimensionality reduction<br />
model based on <strong>Gaussian</strong> <strong>Process</strong>es is introduced. Further, we introduce the task<br />
of <strong>Shared</strong> Dimensionality Reduction which will be the main focus of this thesis.<br />
Chapter 3 This chapter describes the two shared generative dimensionality<br />
reduction models developed in this thesis. By motivating the short-comings of
12 CHAPTER 1. INTRODUCTION<br />
current models we derive two new models in the GP-LVM framework. We first<br />
present the shared back-constrained GP-LVM and continue to describe the second<br />
suggested model, the subspace GP-LVM model.<br />
Chapter 4 We present an extension to Canonical Correlation Analysis (CCA)<br />
called Non-Consolidation Component Analysis (NCCA). The NCCA algorithm<br />
allows us to transform CCA from an algorithm for feature selection to one of<br />
shared dimensionality reduction.<br />
Chapter 5 In this chapter the suggested models are applied to the Computer<br />
Vision task of human pose estimation. We apply the models on real world data-<br />
sets and experimentally compare the two models.<br />
Chapter 6 Concludes the work undertaken and describes potential directions<br />
of future work.<br />
1.2 Publications<br />
This thesis builds on work from the following publications:<br />
1. C. H. Ek, P. H. Torr, and N. D. Lawrence. <strong>Gaussian</strong> process latent variable<br />
models for human pose estimation. In 4th Joint Workshop on Multimodal<br />
Interaction and Related Machine Learning Algorithms (MLMI 2007), vol-<br />
ume LNCS 4892, pages 132–143, Brno, Czech Republic, Jun. 2007. Springer-<br />
Verlag.<br />
2. C. H. Ek, P. H. Torr, and N. D. Lawrence. Ambiguity modeling in latent<br />
spaces. In 5th Joint Workshop on Multimodal Interaction and Related Ma-<br />
chine Learning Algorithms (MLMI 2008), 2008.
1.3. NOTATIONS AND CONVENTIONS 13<br />
3. C. H. Ek, P. Jaeckel, N. Campbell, N. Lawrence, and C. Melhuish. <strong>Shared</strong><br />
<strong>Gaussian</strong> <strong>Process</strong> <strong>Latent</strong> Variable <strong>Models</strong> for Handling Ambiguous Facial<br />
Expressions. In AIP Conference Proceedings, volume 1107, page 147,<br />
2009.<br />
1.3 Notations and Conventions<br />
In the mathematical notation we use italics x to indicate scalars, bold lowercase<br />
x to indicate vectors, and bold uppercase X to indicate matrices. Vectors, unless<br />
stated otherwise, are column vectors. The transpose of a matrix is indicated by<br />
superscript X T . The identity matrix is represented by I. The vector ei represents<br />
the unit vector with all dimensions set to 0 except dimension i which is 1.
Chapter 2<br />
Background<br />
2.1 Introduction<br />
Modeling is the task of describing a system using a specific language. In this thesis<br />
the focus is on mathematical modeling which refers to the process of describing<br />
a system through the laws of mathematics. The building blocks of mathematical<br />
models are variables or parameter who by interaction through the law of mathe-<br />
matics aims to mimic the behavior of a system. There are many reasons why we<br />
are interested in creating accurate models of a specific system. A model allows us<br />
to analyze and and simulate the behavior of the system in a hypothetical scenario<br />
without having to jeopardize the actually system.<br />
A fundamental characteristic of a model are its degrees of freedom [22] which<br />
refers to the number of parameters or variables that are allowed to vary indepen-<br />
dently from each other. In many scenarios there is more than one way to describe<br />
a system, this can either be because different approximations or assumptions have<br />
been made or due to lack of knowledge of the system. For example, one set of<br />
14
2.1. INTRODUCTION 15<br />
data can be equally well described by two different models. However, data from<br />
the input domain outside the training data might result in different behavior from<br />
each model. Similarly different assumptions often leads to different models. The<br />
degrees of freedom of a representation is equal to the number of parameters or<br />
dimensions. However, this parameterization does not need to be representing the<br />
data in the correct way. When each dimension in the representation describes or<br />
models a single degree of freedom in the data we say that the data is in its intrinsic<br />
representation.<br />
We will separate the task of modeling into data driven and model driven. Data<br />
driven modeling is, when given a set of training data we try to learn a model from<br />
the data, this is different from model based modeling when we try to fit or match<br />
a specific model to the data. This thesis will focus on data driven models.<br />
2.1.1 Curse of Dimensionality<br />
We spend our lives in a world that is essentially three dimensional. It is in this<br />
world we build our understanding of concepts such as distance and volume. In<br />
Machine Learning we often deal with data compromising many more dimensions.<br />
Many of the concepts we learn to recognize in two and three dimensions cannot<br />
easily be extrapolated to higher dimensions. One example is the relationship be-<br />
tween the volume of a hyper-sphere of diameter 2 and a cube with side length<br />
2.<br />
Vcube(d) = (2) d<br />
Vsphere(d) = 2d Π d<br />
2<br />
d · Γ( d<br />
2 )<br />
(2.1)<br />
(2.2)
16 CHAPTER 2. BACKGROUND<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0 5 10 15 20 25 30<br />
Figure 2.1: The ratio of a hyper-cube that is contained within a hyper-sphere as a<br />
function of dimension.<br />
Figure 2.1 shows this ratio as a function of dimensionality. In the limit the ratio<br />
of the volume that is contained within the hyper-sphere goes to zero,<br />
Vsphere(d)<br />
limd→∞<br />
Vcube(d)<br />
= . . . = 0 (2.3)<br />
This means that with increasing dimension all the volume of a cube will be con-<br />
tained in its corners. However, our concept of the “corners of a cube” are clearly<br />
two and three dimensional in nature as is our understanding of volume. Therefore<br />
care should be taken when working with high dimensional spaces.<br />
The “Curse of Dimensionality” refers to the exponential growth in volume<br />
with dimension. It is called a curse due to the fact that many algorithms scales<br />
badly with increasing dimension. One way of visualizing this is to think of an<br />
algorithm as a mapping from an input-space to an output-space. For the algorithm<br />
to work for any type of input it needs to “cover” or “monitor” all of the input space.<br />
Therefore, with increasing dimension the algorithm needs to cover a exponentially<br />
growing volume.
2.2. DIMENSIONALITY REDUCTION 17<br />
2.2 Dimensionality Reduction<br />
In most data-sets that the number of degrees of freedom of the representation is<br />
much higher then that of the intrinsic representation. This is due to the fact that<br />
rather then representing the degrees of freedom of the data the representation is a<br />
reflection of the degrees of freedom in the collection process of the data.<br />
How complex or simple a structure is depends critically upon the way<br />
we describe it. Most of the complex structures found in the world are<br />
enormously redundant, and we can use this redundancy to simplify<br />
their description. But to use it, to achieve the simplication, we must<br />
find the right representation [56].<br />
One example of this is the parameterization of a natural image as a matrix<br />
of real values (pixels). The matrix being captured by a camera, each pixel cor-<br />
responds to a single light sensor which are allowed to vary independently of the<br />
other sensors on the lens. However, this does not correspond well to natural im-<br />
ages as neighboring pixels are strongly correlated [8]. This implies that natural<br />
images have a significantly different intrinsic representation , and degrees of free-<br />
dom, than the image representation of the camera. The correlation between pixels<br />
will, in the vector space spanned by the pixel values, manifest itself as a low-<br />
dimensional manifold. Parameterizing this manifold are the degrees of freedom<br />
for natural images. This example is a simplification as it assumes that the camera<br />
is capable of capturing the full variability or all the degrees of freedom in a nat-<br />
ural image, i.e. that the intrinsic representation can be found as a mapping from<br />
the observed representation. A more general view of the problem that avoids this<br />
assumption is to view the collection process as a mapping from the data’s intrinsic
18 CHAPTER 2. BACKGROUND<br />
X to its observed representation Y, this mapping is referred to as the generating<br />
mapping,<br />
yi = f(xi). (2.4)<br />
As in the example with the camera and natural images the typical parameterization<br />
of data used in machine learning is over represented with respect to the number of<br />
dimensions but under represented in terms of the data. Geometrically this implies<br />
that the data only occupies a sub-volume of the representation space. However,<br />
the generating mapping is not necessarily invertible, this means that information<br />
from the intrinsic representation is lost.<br />
Dimensionality reduction is the process of reducing the number of parameters<br />
needed by a specific representation. This thesis focus on data driven dimensional-<br />
ity reduction. Data driven modeling is based on the assumption that the observed<br />
training data is representative. This means that we make the assumption that the<br />
observed data is sampling the input domain “well”. This implies that the degrees<br />
of freedom for the training data are assumed to be the same as for any additional<br />
samples from the same domain.<br />
There are two main approaches to data-driven dimensionality reduction, (1)<br />
generative and (2) spectral. The generative approach aims at modeling the gen-<br />
erating mapping f. This is, in general, an ill-posed problem as there are many<br />
ways of generating the observed data when neither the mapping nor the intrinsic<br />
representation is known. Spectral models avoid this by assuming that a smooth<br />
inverse f −1 to the generative mapping exists. This implies that the full degrees<br />
of freedom in the data is retained in the observed representation. This means that<br />
the intrinsic representation X of the data can be “unraveled” from evidence in the
2.3. LINEAR ALGEBRA 19<br />
observed representation Y.<br />
2.3 Linear Algebra<br />
A linear mapping T from vector-space U to vector-space V , T : U → V is<br />
represented in matrix form as,<br />
T(x) = Ax. (2.5)<br />
This means that the mapping T , represented by matrix A carries elements from<br />
vector space U to vector space V . The image im(T) of a mapping defines the set<br />
of all values the map can take,<br />
im(T) = {T(x) : x ∈ U} ⊂ V. (2.6)<br />
Similarly the kernel kern(T) is the set of all values that maps to zero,<br />
kern(T) = {x : x ∈ U|T(x) = 0} ⊂ U. (2.7)<br />
The kernel and the image of a linear mapping are related through the Rank-Nullity<br />
Theorem:<br />
dim(U) = dim(im(A)) + dim(kern(A)). (2.8)<br />
Intuitively this means that the number of dimensions needed to correctly represent<br />
the degrees of freedom of the data is given by subtracting the number of dimen-<br />
sions required to represent the null-space of the representation from the number of
20 CHAPTER 2. BACKGROUND<br />
dimensions of the representation. The number of dimensions needed to represent<br />
the image is more commonly referred to as the rank of the matrix which is the<br />
number of linearly independent columns in the matrix.<br />
Two square matrices A and B are said to be similar if there exists a non-<br />
singular matrix P (a matrix such that the inverse of P exists) such that the two<br />
matrices are related as,<br />
A = PBP −1 , (2.9)<br />
this is known as a similarity transform. A similarity transform P maps from a<br />
vector space onto itself and is therefore also referred to as a “change of basis”<br />
transformation mapping between two representations or reference frames. The<br />
determinant, trace and invertability are all invariant under similarity.<br />
A special similarity transform is when one reference system or basis results in<br />
a diagonal matrix,<br />
A = VΛV −1<br />
⎧<br />
⎪⎨ 0 i = j<br />
Λij =<br />
⎪⎩<br />
λi i = j<br />
(2.10)<br />
(2.11)<br />
VV T = I, (2.12)<br />
the diagonal representation is said to be the spectral decomposition of the ma-<br />
trix. The columns of V are called eigenvectors and the non-zero elements of the<br />
spectral decomposition the eigenvalues of the matrix. The spectral decomposition<br />
means that we can write any square matrix as a linear combination of rank one
2.4. SPECTRAL DIMENSIONALITY REDUCTION 21<br />
matrices,<br />
Aij =<br />
N<br />
k=1<br />
VikΛkk<br />
V T <br />
kj =<br />
N<br />
(vk)iλk(vk)j =<br />
k=1<br />
N <br />
λkvkv T <br />
k ij .(2.13)<br />
As V specifies an orthonormal basis the relative magnitude of each eigenvalue<br />
corresponds to the the amount of A that is explained by the corresponding eigen-<br />
vector. Therefore, by ordering the eigenvalues in decreasing order we can refer<br />
to,<br />
A→i =<br />
i<br />
k=1<br />
as the best rank i approximation of matrix A.<br />
k=1<br />
λkvkv T k , (2.14)<br />
λi ≥ λj, i ≤ j<br />
2.4 Spectral Dimensionality Reduction<br />
Spectral dimensionality reduction is based on the assumption that the generating<br />
mapping f is invertible. This means that the relationship between the observed<br />
representation Y and the intrinsic representation X takes the form of a bijection.<br />
This implies that the intrinsic structure of the data is fully preserved in the ob-<br />
served representation.<br />
Classic Multi Dimensional Scaling (MDS) [17, 40] is a method for represent-<br />
ing a metric dissimilarity measure as a geometrical configuration. Given dissim-<br />
ilarity measure δij between i and j the aim is to find a geometrical configuration<br />
of points X = [x1, . . .,xN] such that the Euclidean distance dij = ||xi − xj||2
22 CHAPTER 2. BACKGROUND<br />
approximates the dissimilarity δij. Classical MDS is formulated as a minimization<br />
of the following energy,<br />
<br />
argminX (δij − dij) = argminX ||∆ − D(X)||F. (2.15)<br />
ij<br />
The best d dimensional geometrical representation can be found through a rank d<br />
approximation D(X) of the dissimilarity matrix ∆ through the spectral decom-<br />
position,<br />
||∆ − D(X)||F =<br />
<br />
= ||<br />
= ||<br />
∆ = VΛV T =<br />
N<br />
λiviv T i −<br />
i=1<br />
N<br />
i=1<br />
N<br />
i=1<br />
d<br />
(λi − qi)viv T i +<br />
i=1<br />
λiviv T i<br />
qiviv T i<br />
N<br />
d+1<br />
<br />
|| =<br />
λiviv T i<br />
=<br />
||. (2.16)<br />
Having found a rank d approximation to ∆ the distance matrix can be converted<br />
to a Gram matrix G = XX T ,<br />
gij = 1<br />
<br />
N<br />
1<br />
d<br />
2 N<br />
k=1<br />
N ik +<br />
N<br />
k=1<br />
d 2 ik<br />
− 1<br />
N<br />
N<br />
N<br />
k=1 p=1<br />
d 2 kp<br />
<br />
− d 2 ij<br />
<br />
. (2.17)<br />
A geometrical configuration can be found through eigen-decomposition of the<br />
Gram matrix G,<br />
G = XX T = . . . = VΛV T = (VΛ 1<br />
2)(V∆ 1<br />
2) T ,<br />
⇒ X = VΛ 1<br />
2, (2.18)
2.4. SPECTRAL DIMENSIONALITY REDUCTION 23<br />
with the dimensionality of,<br />
rank(X) = rank(XX T ) = rank(G) = rank(D(X)) = d.<br />
In practice, for dimensionality reduction, we want to find a low dimensional repre-<br />
sentation of a set of data points i.e. vectorial data. In this case the Gram matrix G<br />
can be constructed directly from the data and a rank d dimensional approximation<br />
can be sought making the conversion step from distance matrix to gram matrix<br />
uneccesary.<br />
Principal Component Analysis (PCA) is a dimensionality reduction technique<br />
for embedding vectorial data in a dimensionally reduced representation. Given<br />
centered vectorial data Y the covariance matrix S = Y T Y has elements on the<br />
diagonal representing the variance along each dimension of the data while the off-<br />
diagonal elements measures the linear redundancies between dimensions. The<br />
objective of PCA is to find a projection v of the data Y such that the variance<br />
along each dimension is maximized,<br />
Objective: argmax v var(Yv) (2.19)<br />
subject to: v T v = 1. (2.20)<br />
This implies finding a projection of the data into a representation resulting in a
24 CHAPTER 2. BACKGROUND<br />
diagonal covariance matrix,<br />
var(Yv) =<br />
1<br />
N − 1 (Yv)T Yv = v T Sv (2.21)<br />
S = VΛV T<br />
(2.22)<br />
Λ = V T SV (2.23)<br />
X = YV. (2.24)<br />
As can be seen from above the solutions to both MDS and PCA are found through<br />
a similarity transform where one representation results in a diagonal matrix. In<br />
the case of MDS with the diagonalisation of the N × N Gram matrix G and for<br />
PCA by the D × D covariance matrix.<br />
Using the spectral decomposition of the Gram matrix,<br />
the similarity implies that,<br />
By pre-multiplying we can write,<br />
G = YY T = VΛV T , (2.25)<br />
(YY T )vi = λivi. (2.26)<br />
1<br />
N YT (YY T )vi =<br />
SY T vi =<br />
λi<br />
N − 1 YT vi<br />
(2.27)<br />
λi<br />
N − 1 YT vi, (2.28)<br />
which also defines a similarity transform but now in terms of the covariance matrix
2.5. NON-LINEAR 25<br />
S. However, we also need to enforce orthogonality of the new basis,<br />
(Y T vi) T (Y T vi) = v T i YYT vi = λi<br />
1<br />
√ (Y<br />
λi<br />
T vi) T (Y T vi) 1<br />
√<br />
λi<br />
(2.29)<br />
= v T i YYT vi = 1. (2.30)<br />
This results in the eigen-basis of the covariance matrix i.e. v PCA<br />
i<br />
gives the following embedding,<br />
x PCA<br />
i<br />
1<br />
= YY T vi √<br />
λi<br />
=<br />
= YT vi 1<br />
N which<br />
√ λi<br />
N − 1 vi = 1<br />
N − 1 xMDS<br />
i , (2.31)<br />
meaning MDS and PCA results in the same solution up to scale.<br />
MDS and PCA assumes the generating mapping f to be linear and therefore<br />
imply that the intrinsic representation of the data can be found by a change of basis<br />
transform. This restricts these algorithms to only being applicable in scenarios<br />
where the generating mapping is linear.<br />
2.5 Non-Linear<br />
Several algorithms have been suggested to model in the scenario where, rather<br />
than assuming the generating mapping f to be linear, this is relaxed to only<br />
assume that it is smooth. MDS finds a geometrical configuration respecting<br />
a specific dissimilarity measure. Measuring the dissimilarity in terms of the<br />
distance along the manifold between each point it would be possible to use<br />
MDS even in scenarios where the generating mapping is non-linear. How-<br />
ever, to acquire the distance along the manifold requires the manifold to be
26 CHAPTER 2. BACKGROUND<br />
unraveled.<br />
The objective of PCA is to find a projection of the data Y where the vari-<br />
ance along each dimension is maximized. If the data occupies a subspace in<br />
the original representation PCA will find a rotation of the data such that a<br />
reduced dimensional representation can be found by removing dimensions<br />
that do not represent any of the variance in the data. Further, for many data-<br />
sets most of the variance is represented by the first few principal directions<br />
meaning that an approximative representation of the data can be found by<br />
truncating the dimensions that represents a non-significant variance.<br />
One approach to non-linearize PCA was suggested in [51]. The idea is to<br />
first map the data Y into a feature space F through a mapping Φ. Rather<br />
then, as in standard PCA, finding the spectral decomposition of the covari-<br />
ance matrix S in the original space the decomposition can now be applied<br />
to the covariance in the feature space representation of the data. The hope<br />
is that the mapping Φ has unraveled the manifold such that applying PCA<br />
would recover the intrinsic representation of the data. However, it remains<br />
to find a mapping that does just so.<br />
2.5.1 Kernel-Trick<br />
Given a set of data Y = [y1, . . .,yN] where yi ∈ ℜ D the data is represented in<br />
the basis [e1, . . .,eD]. A different basis would be to represent each data-point<br />
yi by its own basis [y1, . . .,yN] i.e. represented in a N dimensional space<br />
with representation ˜ Y = [e1, . . .,eN]. The covariance matrix S in the space<br />
spanned by the data is equal to that of the Gram matrix of the original rep-
2.5. NON-LINEAR 27<br />
resentation. This is the fundamental background to the Kernel Trick which<br />
is a way of non-linearizing algorithms that depend only on the inner product<br />
between data-points. Even though an accepted term, it is not clear where the<br />
term was initially suggested. The Kernel Trick is based on that rather than<br />
finding a specific mapping Φ that takes the data to the feature space F we<br />
specify a function k(yi,yj), called the kernel function that parameterizes the<br />
inner product between Φ(yi) and Φ(yj),<br />
k(yi,yj) = Φ(yi) T Φ(yj). (2.32)<br />
Evaluated between each pair of points in the data-set the kernel function k<br />
specifies the kernel matrix K(Y,Y) which specifies the Gram matrix in the<br />
feature space F . From Eq. 2.17 we know that the Gram matrix and a dis-<br />
tance matrix is interchangeable representation for centered data. Therefore<br />
as long as the kernel function k specifies a valid Gram matrix K there is an<br />
underlying geometrical representation of data in F . The class of kernel func-<br />
tions that specifies geometrically representable feature spaces are known as<br />
Mercer Kernels [41, 50]. Mercer Kernels are positive semidefinite, i.e in the<br />
spectral decomposition of the resulting kernel matrix K all eigenvalues are<br />
non-negative. Intuitively this can be understood through Eq. 2.17, if the<br />
eigenvalues where to be negative then by adding basis vectors the distance<br />
between two points would be reduced, which is not possible in a Euclidean<br />
space. Using a kernel function to represent the data the feature space F is<br />
know as a kernel induced feature space.<br />
One advantage using kernel induced feature space is that if we aim to
28 CHAPTER 2. BACKGROUND<br />
apply an algorithm to the data which is only in terms of the inner product<br />
between data points we never need to find the geometrical representation<br />
of the data in F . This means that kernels resulting in potentially infinite<br />
dimensional spaces can be used. One such kernel function is the RBF kernel,<br />
k(yi,yj) = θ1e − θ 2 2 ||yi−yj|| 2 2, (2.33)<br />
with parameters {θ1, θ2}. If the inner product is specified by an RBF kernel<br />
any combination of points yi and yj will have a non-zero inner product. For<br />
this to be possible the feature space F will need to be infinite dimensional.<br />
PCA works by diagonalizing the covariance matrix of the data through<br />
the spectral decomposition. Kernel PCA [51] is formulated by first finding<br />
the Gram matrix in the kernel induced feature space. By representing each<br />
point using the basis of the data itself the Gram matrix is equivalent to the<br />
covariance matrix. A reduced representation can now be found through the<br />
spectral decomposition of the kernel matrix K.<br />
For many popular kernels, such as RBF kernels, the kernel represents<br />
feature spaces of higher dimensionality compared to the original data-space<br />
meaning that the mapping increases the dimensionality of the data. However,<br />
even though the dimensionality of the data might have been increased, the rel-<br />
ative ratio of the eigenvalues of the spectral decomposition of the covariance<br />
matrix might result in fewer eigenvectors, compared to the decomposition in<br />
the original space, that represents a significant variance meaning that a lower<br />
dimensional approximation of the data can be found. So strictly speaking, for<br />
many types of kernel functions, Kernel PCA is not a dimensionality reduction
2.5. NON-LINEAR 29<br />
technique but rather an algorithm for feature selection, which we will briefly<br />
comment on in the end of this chapter.<br />
In the next section we will go through a set of algorithms that uses local<br />
similarity measures in the data to find kernels onto which a spectral decom-<br />
position can be applied to find a geometrical representation of the data.<br />
2.5.2 Proximity Graph Methods<br />
Several dimensionality reduction algorithms have been suggested that are<br />
based on local similarity measures in the data. These algorithms are based<br />
on a proximity graph [66, 29, 10] extracted from the data. A proximity graph<br />
is a graph that represent a specific neighborhood relationships in the data.<br />
In the graph each node corresponds to a data point, vertices connects nodes<br />
that are related through the specified relationship potentially associated with<br />
a vertex weight. The fundamental idea behind proximity graph based algo-<br />
rithms for dimensionality reduction is that locally the data can be assumed to<br />
lie on a linear manifold. This means that locally the distance in the original<br />
representation of the data will be a good approximation to the manifold dis-<br />
tance. Therefore the neighborhood relationship used for proximity graphs in<br />
dimensionality reduction is the inter-distance between points in the original<br />
representation. Usually the graphs are constructed either from an N near-<br />
est neighbor algorithm where the N closest points are connected or from a<br />
ǫ nearest neighbor where all points within a ball of radius ǫ are connected.<br />
Setting either parameter is of significant importance as only points whose<br />
inter-distance can be assumed to approximate the manifold distance should
30 CHAPTER 2. BACKGROUND<br />
be connected.<br />
Isomap<br />
Isomap [60] was presented as a non-linear modification of MDS. The first step<br />
of Isomap is to construct a proximity graph of the data with edge weights<br />
corresponding to the Euclidean distance between each points. MDS finds<br />
a geometrical configuration to a global dissimilarity measure. In Isomap<br />
it is suggested that the manifold distance be approximated by the shortest<br />
path through this proximity graph. By computing the shortest path through<br />
the proximity graph a dissimilarity measure can be found between each data<br />
point onto which MDS can be applied. The shortest path through the prox-<br />
imity graph is not certain to result in a similarity matrix whose Gram matrix<br />
corresponds to a valid geometrical configuration, i.e. a Mercer kernel. A<br />
modification of the Isomap framework that guarantees a valid Mercer kernel<br />
have been suggested in [16]. However, in general, as we are only interested in<br />
the largest eigenvalues, this does not cause significant problems.<br />
Maximum Variance Unfolding<br />
A different alteration of MDS is Maximum Variance Unfolding (MVU) [78,<br />
80, 79, 81]. Based on the observation that any “fold” of a manifold would de-<br />
crease the Euclidean distance between two points while the distance along the<br />
manifold would remain the same MVU is formulated as a constrained maxi-<br />
mization. In the first step of the algorithm the proximity graph based on the<br />
Euclidean distance in the observed data is computed in the same manner as<br />
for Isomap. The inter-distance between nodes not connected by an edge in
2.5. NON-LINEAR 31<br />
the graph is maximized under the constraint that the distance between near-<br />
est neighbors stay the same. In effect the MVU objective will try to unravel<br />
the data by stretching the manifold as much as possible without causing it to<br />
tear. Rather then formulating the objective in terms of a distance matrix as<br />
in Isomap the MVU objective is expressed in terms of the Gram matrix which<br />
we know are interchangeable representations Eq. 2.17. MVU tries to find a<br />
feature space represented by a kernel matrix K. This leads to the following<br />
objective,<br />
ˆK = argmax tr(K) (2.34)<br />
subject to: K 0<br />
<br />
Kij = 0<br />
ij<br />
Kii − Kjj − Kij − Kji = Gii − Gjj − Gij − Gji, i ∈ N(j),<br />
where G is the Gram matrix for the original representation and N(i) is the<br />
index set of points that are connected to i in the proximity graph. The first<br />
constraint forces the ˆ K to represent a geometrically interpretable feature<br />
space, while the second constraint forces the data to be centered. The fi-<br />
nal constraint ensures that the distance between points that are connected in<br />
the proximity graph is conserved. The optimization is an instance of Semi-<br />
Definite Programming [11] and can be solved using efficient algorithms. Once<br />
a valid kernel matrix K has been found the resulting embedding can be found<br />
by applying MDS to ˆ K.
32 CHAPTER 2. BACKGROUND<br />
Local Linear Embeddings<br />
Local Linear Embeddings (LLE) [48] is a third method based on the preser-<br />
vation of a proximity graph structure. LLE is based on the assumption that<br />
the manifold can locally be well approximated using small linear patches.<br />
By rotating and translating each of these patches the full manifold structure<br />
can be modeled. LLE is a two step algorithm, in the first step each point in<br />
the data set is described by the nodes connected in the proximity graph as<br />
expansion,<br />
subject to:<br />
ˆW = argmin<br />
<br />
Wij = 1,<br />
j<br />
N<br />
||xi − <br />
i<br />
j∈N(i)<br />
Wijxj||2<br />
(2.35)<br />
where N(i) is the index set of points that are connected to i in the proximity<br />
graph. The optimal weights ˆ W can be solved in closed form [48]. Assuming<br />
that the manifold is locally linear the reconstruction weights should summa-<br />
rize the local structure of the data and should therefore be equally valid in<br />
reconstructing the manifold representation of the data X. To find this mani-<br />
fold representation a second minimization is formulated,<br />
ˆX = argmin <br />
||xi − <br />
Wijxj|| 2 2. (2.36)<br />
i<br />
However, Eq. 2.36 has a trivial solution placing each point in the origin,<br />
xi = 0, ∀i, this is removed by enforcing unit variance along each direction.<br />
Further, to remove the translational degree of freedom the solution is en-<br />
j
2.5. NON-LINEAR 33<br />
forced to be centered, <br />
i xi = 0. The optimal embedding ˆ X can be found<br />
through an eigenvalue problem.<br />
Laplacian Eigenmaps<br />
The proximity graph is also the starting point for Laplacian eigenmaps [5].<br />
Each node in the graph is connected to its neighbors by a vertex with an<br />
edge weight representing the locality of the points. Several different mea-<br />
sures of locality can be used. In the original paper either a heat kernel,<br />
wij = e − ||y i −y j ||2 2<br />
t , or constant wij = 1 was applied. Once the graph have<br />
been constructed the objective is to find an embedding X of the data such<br />
that points that are connected in the graph stay as close together as possible.<br />
For the first dimension,<br />
ˆX = argmin <br />
(xi − xj)Wij = y T Ly, (2.37)<br />
i,j<br />
where L is referred to as the Laplacian defined as L = D − W and D is<br />
a diagonal matrix such that Dii = <br />
j Wji. The objective Eq. 2.37 has a<br />
trivial solution zero dimensional solution representing the embedding using<br />
a single point. To remove this solution the solution is forced to be orthogonal<br />
to the constant vector 1, y T D1 = 0. Further, to shrinking the embedding a<br />
constraint on the scale y T Dy = 1 is appended to the objective. The diagonal<br />
matrix D provides a scaling of each point with respect to its locality to other<br />
points in the data. For a multi-dimensional embedding of the data this leads
34 CHAPTER 2. BACKGROUND<br />
15<br />
10<br />
5<br />
0<br />
−5<br />
−10<br />
−15<br />
20<br />
0<br />
−20<br />
−15 −10 −5 0 5 10 15<br />
Figure 2.2: Swissroll with added <strong>Gaussian</strong> noise. Given the left image it is easy<br />
to see the global structure. The spectral algorithms are based on local structure<br />
in the data as in the right image from which it is a lot harder to infer the global<br />
structure.<br />
to the following optimization problem,<br />
ˆY = argmin tr(Y T LY) (2.38)<br />
subject to: Y T DY = I<br />
which can be solved through a generalized eigenvalue problem.<br />
2.5.3 Summary<br />
Spectral algorithms are attractive as they are associated with a convex objective<br />
function leading to a unique solution. However, the proximity graph is based on<br />
local distances measures that are more likely to be effected by noise see Figure 2.2.<br />
The spectral algorithms all have the fundamental assumption that the generating<br />
mapping f has a smooth inverse which thereby preserves locality of the observed<br />
representation in the solution of the intrinsic. However, one needs to be vary what<br />
this assumption actually implies. As an example take a piece of string which has
2.6. GENERATIVE DIMENSIONALITY REDUCTION 35<br />
X<br />
Y<br />
Figure 2.3: The generative latent variable model. The observed data X is modeled<br />
as generated from a low-dimensional latent variable Y through the generative<br />
mapping f specified by parameters W.<br />
been rolled up into a ball. The string is a one dimensional object embedded in a<br />
three dimensional space. Locality is preserved through the generating mapping<br />
f, i.e. points that are close on the string will remain close in the ball. However,<br />
the reverse does not necessarily need to be true as neighboring points on the ball<br />
W<br />
might come from different “loops” when the ball was rolled together.<br />
Further, even though assumed to exists, none of the spectral algorithms models<br />
the smooth inverse of the generating mapping but rather learns embeddings of the<br />
data points this is fine as long as focus is on the visualization of the data rather<br />
then a model.<br />
2.6 Generative Dimensionality Reduction<br />
Generative approaches to dimensionality reductions aim to model the observed<br />
data as a mapping from its intrinsic representation. The underlying representa-<br />
tion is often referred to as the latent representation of the data and the models as<br />
latent variable models for dimensionality reduction. The observed data Y have<br />
been generated from the latent variables X through a mapping f parameterized
36 CHAPTER 2. BACKGROUND<br />
by W, Figure 2.3. Assuming the observations are i.i.d and have been corrupted<br />
by spherical <strong>Gaussian</strong> noise leads to the likelihood of the data,<br />
p(Y|X,W, β −1 ) =<br />
N<br />
N yi|f(xi,W), β −1 , (2.39)<br />
i=1<br />
where β −1 is the noise variance. The <strong>Gaussian</strong> noise model means we will refer<br />
to these models as <strong>Gaussian</strong> latent variable models.<br />
In the Bayesian formalism both the parameters of the mapping W and the la-<br />
tent location X are nuisance variables. Seeking the manifold representation of the<br />
observed data we want to formulate the posterior distribution over the parameters<br />
given the data, p(X,W|Y). This means inverting the generative mapping through<br />
Baye’s Theorem which implies marginalization over both the latent locations X<br />
and the mapping W. Reaching the posterior means we need to formulate prior<br />
distributions over X and W. However, this is severely ill-posed as an infinite<br />
number of combination of latent locations and mappings that could have gener-<br />
ated the data. To proceed assumptions about the relationship needs to be made.<br />
Neural Networks (NN) [27] are models traditionally used for supervised learn-<br />
ing. MacKay [39] suggested a generative dimensionality reduction approach us-<br />
ing NN labeled Density Networks. Traditional supervised learning using NN im-<br />
plies learning a conditional model over the output variables (class based or contin-<br />
uous) given the input variables through a parametric mapping. This relates each<br />
location in the input space with a density over the output space. However, in the<br />
case of dimensionality reduction only the output space is given. In [39] a model of<br />
the generative mapping parameterized as Multi-Layered-Perceptron (MLP) NN is<br />
proposed. By specifying a prior over the locations X in the input space and the
2.6. GENERATIVE DIMENSIONALITY REDUCTION 37<br />
parameters W of the generative mapping the joint probability of the full model<br />
can be formulated. Formulating an error function of the models means that the<br />
gradients of the unknown latent locations and the parameters of the generative<br />
mapping can be formulated. However, these gradients involve integrals over X<br />
and W which needs to be evaluated using sampling based methods such as Monte<br />
Carlo sampling. Optimizing the parameters W a density over the input space can<br />
be found.<br />
Tipping and Bishop [65] formulated probabilistic PCA (PPCA) by making the<br />
assumption that the observed data was related to the latent locations as a linear<br />
mapping yi = Wxi + ǫ, where ǫ ∼ N(0, β −1 I). Placing a spherical <strong>Gaussian</strong><br />
prior over the latent locations leads to the marginal likelihood,<br />
p(y|W, β −1 I) =<br />
<br />
p(y|W, β −1 )p(x)dx (2.40)<br />
p(x) = N(0, β −1 I). (2.41)<br />
The parameters of the mapping W can be found by maximum likelihood.<br />
Assuming a linear mapping severely restricts the classes of data-sets that can<br />
be modeled. But the prior over the latent locations has to be propagated through<br />
the generating mapping to form the marginal likelihood. For the linear mapping<br />
Eq 2.40 is solvable. However, when considering mappings of more general form<br />
it is not clear how to propagate the latent prior through the mapping to make the<br />
the integral Eq. 2.40 analytically tractable.<br />
Bishop [9] suggested a specific prior over the latent space making marginal-<br />
ization over more general mappings feasible, the model is referred to as The Gen-<br />
erative Topographic Map (GTM). By discritizing the latent space into regular grid,
38 CHAPTER 2. BACKGROUND<br />
Figure 2.4: Schematic representation of the GTM model: a grid of latent points X<br />
is mapped through a nonlinear mapping f parametrised by W to a corresponding<br />
grid of <strong>Gaussian</strong> centers embedded in the observed space. Adapted from [36].<br />
the prior is specified in terms of a sum of delta functions,<br />
p(x) = 1<br />
K<br />
K<br />
δ(x − xi), (2.42)<br />
i=1<br />
where [x1, . . .,xK] is regularly spaced landmark points over the latent space Fig-<br />
ure 2.4. This prior makes the integral in Eq.2.40 tractable for general parametric<br />
mappings. The GTM specifies a density over the observed data space parame-<br />
terized by a <strong>Gaussian</strong> mixture with centers at the location in the observed space<br />
corresponding to the grid points in the latent space.<br />
The <strong>Gaussian</strong> latent variable model formulates dimensionality reduction as a<br />
probabilistic model which provides a model and an associated likelihood function.<br />
Further modeling the generative mapping removes the reliance on local noise sen-<br />
sitive measures in the data. This makes the generative models applicable to a<br />
larger range of modeling scenarios compared to the spectral algorithms. How-<br />
ever, PPCA is a strictly linear model, and the energy function associated with the
2.7. GAUSSIAN PROCESSES 39<br />
GTM is non-convex which means that we cannot be guaranteed to find the global<br />
optima. Further, the GTM suffers from problems associated with mixture models<br />
in high dimensional spaces [59].<br />
2.7 <strong>Gaussian</strong> <strong>Process</strong>es<br />
A D dimensional <strong>Gaussian</strong> distribution is defined by a D × 1 mean and a D × D<br />
covariance matrix. A <strong>Gaussian</strong> process (GP) is the infinite dimensional general-<br />
ization of the distribution where the mean and covariance is defined not by fixed<br />
size matrices but a mean µ(x) and a covariance k(x,x ′ ) function, defined over<br />
infinite index sets, x.<br />
GP(µ(x), k(x,x ′ )). (2.43)<br />
Evaluating a GP over a finite index set reduces the process to a distribution<br />
with the dimensionality of the cardinality of the evaluation set. The covariance<br />
function needs to specify a valid covariance matrix when evaluated for any finite<br />
subset in its domain, this requires the covariance function to come from the same<br />
family of functions as Mercer kernels [41, 45].<br />
A GP generalizes the concept of a <strong>Gaussian</strong> distribution to infinite dimen-<br />
sions, this has been exploited in machine learning by applying GPs to specify<br />
distributions over infinite objects. One such application is when we are interested<br />
in modeling relationships defined over continuous domains such as functions. If<br />
we are interested in modeling a functional relationship f between input domain
40 CHAPTER 2. BACKGROUND<br />
X ∈ ℜ D and target domain Y ∈ ℜ,<br />
yi = f(xi). (2.44)<br />
A GP can be used to specify a prior distribution over the relationship, f ∼<br />
GP(µ, k). In Figure 2.5 samples from a GP prior with a covariance function<br />
specified by an RBF kernel and a mean function being constant 0 is shown. As<br />
can be seen the samples are all smooth with respect to the input locations x. How-<br />
ever, there is an additional property that the GP needs to fulfill to specify a valid<br />
distribution over f, consistency. A function f is consistent in the sense that the<br />
relationship between the target and input domain is fixed. For a distribution, this<br />
implies that evaluating the distribution over a finite subset Xi ⊂ X does not alter<br />
the distribution over any other subset Xj ⊂ X even if Xi ∩Xj = 0. It is clear that<br />
a GP satisfies this condition if the covariance function specifies a valid covariance<br />
matrix when evaluated over a finite number of points as,<br />
for a <strong>Gaussian</strong> distribution.<br />
⎧<br />
⎪⎨ y1 ∼ N(µ1, Σ11)<br />
(y1, y2) ∼ N(µ,Σ) ⇒<br />
⎪⎩ y2 ∼ N(µ2, Σ22)<br />
, (2.45)<br />
In regression we are interested in modeling the relationship between two do-<br />
mains X ∈ ℜ D and Y ∈ ℜ from a set of observations xi ∈ X and yi ∈ Y where<br />
i = 1 . . .N. Assuming a functional relationship and that the observations have<br />
been corrupted by additive <strong>Gaussian</strong> noise we are interested in modeling,<br />
yi = f(xi) + ǫ, (2.46)
2.7. GAUSSIAN PROCESSES 41<br />
where ǫ ∼ N(0, β −1 ). We are interested in encoding our prior knowledge about<br />
the relationship in a distribution over f. For regression we usually have a prefer-<br />
ence to functions varying smoothly over X,<br />
limxi→xj+ |f(xi) − f(xj)| =<br />
= limxi→−xj |f(xi) − f(xj)| = 0,<br />
∀xj ∈ X. This assumption can be encoded by the GP through the choice of<br />
covariance function k(x,x ′ ). The covariance function encodes how we expect<br />
variables to vary together,<br />
k(x,x ′ ) = E ((f(x − µ(x)))(f(x ′ ) − µ(x ′ ))),<br />
this means that we can encode the smoothness behavior over X by choosing a<br />
covariance function which is smooth over the same domain. The mean function<br />
µ(x) = E(f(x)) encodes the expected value of f. By translating the observed<br />
data to be centered around zero the mean function can, for simplicity, be chosen<br />
as the constant function zero.<br />
2.7.1 Prediction<br />
Having specified a prior distribution encoding our knowledge (and preference)<br />
about the relationship between X and Y we are interested in inferring the lo-<br />
cations y∗ corresponding to a previously unobserved point x∗ ∈ X. The joint<br />
distribution of the observed data (y,x) and the unobserved point (y∗,x∗) can be
42 CHAPTER 2. BACKGROUND<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
−3 −2 −1 −0.5 1 2 3<br />
−1<br />
−1.5<br />
−2<br />
−2.5<br />
Figure 2.5: Samples from GP Prior using a RBF covariance function and a constant<br />
zero mean function. As can be seen, each sample are smooth over the input<br />
domain.<br />
written as follows,<br />
⎡<br />
⎢<br />
⎣ y<br />
y∗<br />
⎤<br />
⎛<br />
⎡<br />
⎤⎞<br />
⎥ ⎜ ⎢<br />
⎦ ∼ ⎝0, ⎣ K(X,X) + β−1I K(X,x∗)<br />
K(x∗,X) K(x∗,x∗) + β−1 ⎥⎟<br />
⎦⎠<br />
.<br />
Predictions over the unobserved locations are made from the posterior distribu-<br />
tion. The posterior is formulated by conditioning the joint distribution on the<br />
observed data. Conditioning two <strong>Gaussian</strong>s results in a <strong>Gaussian</strong> distribution,<br />
defined by mean and covariance,<br />
¯y(x∗) = k(x∗,X)(K + β −1 I) −1 Y<br />
σ 2 (x∗) = (k(x∗,x∗) + β −1 ) − k(x∗,X)(K + β −1 ) −1 k(x∗,X), (2.47)<br />
where K = k(X,X). Those equations are the central predictive equations in the<br />
GP framework. In Figure 2.6 samples from the posterior distribution of a GP with<br />
an RBF covariance function and a constant zero mean function is shown. As can
2.7. GAUSSIAN PROCESSES 43<br />
1.5<br />
1<br />
0.5<br />
−3 −2 −1 −0.5 1 2 3<br />
−1<br />
−1.5<br />
−2<br />
−2.5<br />
−3<br />
Figure 2.6: Samples from GP Posterior using a RBF covariance function and a<br />
constant zero mean function. Each sample from the posterior distribution passes<br />
through the previously observed data shown as red dots.<br />
be seen each function drawn from the distribution passes through the the training<br />
data points.<br />
2.7.2 Training<br />
The covariance function specifies the class of functions most prominent in the<br />
prior. A commonly used covariance function is the RBF-kernel,<br />
k(xi,xj) = θ1e − θ 2 2 ||xi−xj|| 2 2.<br />
The free parameters {θ1, θ2, . . .} of the covariance functions together with the<br />
noise variance β −1 are the hyper-parameters 1 of the GP, Φ = {θ1, . . .,β}. Our<br />
knowledge about the relationship is encoded in the prior over f by setting the<br />
values of Φ. However, in the presence of data we can directly learn the hyper-<br />
parameters from the observations. Assuming that the observations have been cor-<br />
1 Assuming that the mean function has no free parameters
44 CHAPTER 2. BACKGROUND<br />
rupted by additive <strong>Gaussian</strong> noise Eq 2.46 we can formulate the likelihood of the<br />
data. Combining the likelihood with the prior we arrive at the marginal likelihood<br />
through integration over f,<br />
<br />
p(Y|X,Φ) =<br />
p(Y|f)p(f|X,Φ)df. (2.48)<br />
From the marginal likelihood we can seek the maximum likelihood solution<br />
for the hyper-parameters Φ,<br />
ˆΦ = argmax Φ p(Y|X,Φ). (2.49)<br />
This is referred to as training in the GP-framework. It might seem undesir-<br />
able to optimize over the hyper-parameters as the model might over-fit the data 2 .<br />
Inspection of the logarithm of equation (2.48) for a one-dimensional output y,<br />
log p(y|X) = − 1<br />
2 yTK −1 y −<br />
<br />
data−fit<br />
1<br />
log |K| −<br />
2 <br />
complexity<br />
N<br />
log 2π,<br />
2<br />
(2.50)<br />
shows two “competing terms”, the data-fit and the complexity term. The complex-<br />
ity term measures and penalizes the complexity of the model, while the data-fit<br />
term measures how well the model fits the data. This “competition” encourages<br />
the GP model not to over-fit the data.<br />
data Y<br />
2 By setting the noise variance β −1 to zero the function f will pass exactly through the observed
2.7. GAUSSIAN PROCESSES 45<br />
2.7.3 Relevance Vector Machine<br />
In this thesis our main use of <strong>Gaussian</strong> <strong>Process</strong>es will be as a tool to model<br />
functions. A different regression model is the Relevance Vector Machine<br />
(RVM) [63, 64]. In the RVM the mapping yi = f(xi) is modeled as a lin-<br />
ear combination of a the response to a kernel function of the training data,<br />
f(xi) =<br />
N<br />
wjc(xi,xj) + w0, (2.51)<br />
j=1<br />
where w = [w0, . . .,wN] are the model weights and c(·,xj) the kernel basis<br />
functions. One approach to find the weights of the model would be to min-<br />
imizes a reconstruction error of the training data. However, this is likely to<br />
lead to sever over-fitting as we are trying to estimate N + 1 parameters from<br />
given N inputs. Further, predications would only be point-estimates with no<br />
associated uncertainty.<br />
The RVM was suggested as a model to tackle the above issues. The model<br />
specifies a likelihood model of the data through which the parameters can<br />
be found associating each prediction with an uncertainty. Further, to avoid<br />
over-fitting of the data, a prior is specified over the weights w. This prior en-<br />
courages the model to push as many weights wi towards 0 making the linear<br />
combination in Eq. 2.51 depend on as few basis functions k(·,xj) as possible.<br />
Assuming additive <strong>Gaussian</strong> noise the likelihood of the model is formu-<br />
lated as,<br />
p(y|w, σ 2 ) =<br />
<br />
1<br />
2πσ2 2 1<br />
(−<br />
e 2σ2 ||y−˜ Cx|| 2 ) , (2.52)
46 CHAPTER 2. BACKGROUND<br />
where ˜ C is referred to as the design matrix, with elements,<br />
˜Cij = c(xi,xj−1)<br />
˜Ci1 = 1.<br />
A <strong>Gaussian</strong> prior is placed over the weights w,<br />
p(w|α) =<br />
N<br />
i=0<br />
N(wi|0, α −1<br />
i ), (2.53)<br />
controlled by the hyper-parameters α = [α0, . . .,αN]. The prior Eq. 2.53 over<br />
the model parameters w encourages the weights w to be zero.<br />
Through Bayes’ rule the posterior over the weights p(w|y, α, σ 2 ) can be<br />
formulated from which by integration over the weights w the marginal like-<br />
lihood of the model can be found,<br />
p(y|α, σ 2 ) =<br />
N<br />
1 2 1<br />
<br />
2π<br />
|B−1 + ˜ CA−1 ˜ CT e (−1<br />
2 (yT (B −1 + ˜ CA −1 ˜ C T )) −1 ) , (2.54)<br />
where A = diag(α0, . . .,αN) and B = σ 2 I. The optimal parameters α and σ 2<br />
can be found by optimizing the marginal likelihood. Due to the prior Eq. 2.53,<br />
it is reported in [63], when optimizing the marginal likelihood many of the<br />
hyper-parameters αi tend to approach infinity meaning that the associated<br />
weight is close to zero. This means that the corresponding kernel function<br />
has little influence in prediction Eq. 2.51. A pruning scheme is incorporated<br />
in the optimization that removes weights tending to zero from the expansion
2.8. GP-LVM 47<br />
forcing the model to explain the data using few kernel functions leading to a<br />
sparse model.<br />
As noted in [64, 45, 14] the RVM is a special case of a GP with covariance<br />
function,<br />
k(xi,xj) =<br />
N<br />
l=1<br />
1<br />
c(xi,xk)c(xj,xl), (2.55)<br />
αl<br />
where c is the kernel basis function as in Eq. 2.51. The covariance function is<br />
different in form as it depends on the training data xl. Further, it correspond<br />
to a degenerate covariance matrix having at most rank N as it is an expansion<br />
around the training data. Training the RVM is the same as optimizing a<br />
GP regression model i.e. finding the hyper-parameters that maximizes the<br />
marginal likelihood of the model. However, as noted in [45] the covariance<br />
function of the RVM has some undesirable effects. Using a standard RBF<br />
kernel for the GP the predictive variance associated with a point far away<br />
from the training data will be large, i.e. the model will be uncertain in regions<br />
where it has not previously seen data. Rather the opposite is true using the<br />
covariance function specified by the RVM as a both terms in the predictive<br />
variance Eq. 2.47 will be close to zero while for a standard RBF kernel the<br />
first term will be large.<br />
2.8 GP-LVM<br />
Lawrence [33] suggested an alternative <strong>Gaussian</strong> latent variable model capable<br />
of handling non-linear generative mappings while at the same time avoiding the<br />
problems associated with the GTM. Both the PPCA and the GTM specifies a
48 CHAPTER 2. BACKGROUND<br />
prior over the latent locations and seek a maximum likelihood solution for the<br />
parameters of the generating mapping. However, from a Bayesian perspective<br />
both the mapping and the latent locations are nuisance parameters and should<br />
therefore be marginalized. In Lawrence’s formulation, the prior is specified over<br />
the mapping instead of the latent locations and the marginal likelihood over the<br />
mapping formulated. Using the GP-framework a rich and flexible prior over non-<br />
linear mappings can be specified. The algorithm is referred to as the <strong>Gaussian</strong><br />
process <strong>Latent</strong> Variable Model (GP-LVM).<br />
By marginalizing over the mapping f the GP-LVM proceeds by seeking the<br />
maximum likelihood solution to the latent locations X and the hyper-parameters<br />
Φ of the GP,<br />
{ ˆ X, ˆ Φ} = argmax {X,Φ} p(Y|X,Φ)<br />
<br />
= argmax {X,Φ} p(Y|X, f,Φ)p(f)df, (2.56)<br />
where p(f) = GP(µ(x), k(x,x ′ )). The posterior distribution of the data can be<br />
written as,<br />
p(X,Φ|Y) ∝ p(Y|X,Φ)p(X)p(Φ). (2.57)<br />
In the standard GP-LVM formulation uninformative priors [33] are specified over<br />
the latent locations and the hyper-parameters. Learning in the GP-LVM frame-<br />
work consists of minimizing the log posterior of the data with respect to the lo-<br />
cations of the latent variables X and the hyper-parameter θ of the process. With<br />
a simple spherical <strong>Gaussian</strong> prior over the latent locations and an uninformative
2.8. GP-LVM 49<br />
prior over the parameters leads to the following objective,<br />
L = Lr + <br />
lnθi + 1<br />
2 ||xi|| 2 . (2.58)<br />
i<br />
For a covariance function specifying a distribution over linear functions a<br />
closed form solution for Eq 2.56 exists [33]. However for general covariance<br />
functions the solution is found through gradient based optimization .<br />
As previously discussed, infinitely many solutions to the latent variable for-<br />
mulation of dimensionality reduction exists, to proceed the solution needs to be<br />
constrained by prior information. The GP-LVM solution is constrained by the<br />
GP marginal likelihoods trade-off between smooth solutions and a good data-fit<br />
Eq.2.50. By fixing the dimensionality of the latent representation a solution can<br />
be found.<br />
2.8.1 <strong>Latent</strong> Constraints<br />
The GP-LVM objective seeks the locations of the latent coordinates X that max-<br />
imize the marginal likelihood of the data. One advantage of directly optimizing<br />
the latent locations is that additional constraints on X can easily be incorporated<br />
in the GP-LVM framework. In the following section we will review some of the<br />
extensions in terms of latent constraints that has been applied to the GP-LVM.<br />
The fundamental difference between spectral and generative dimensionality<br />
reduction is the assumption made by the spectral algorithms that the latent coordi-<br />
nates can be found as a smooth mapping from the observed data. This means that<br />
we are interested in finding latent locations such that the locality in the observed<br />
data is preserved. Further this assumption implies that a smooth inverse to the<br />
i
50 CHAPTER 2. BACKGROUND<br />
generative mapping Eq. 2.4 is assumed to exist. This assumption constrains the<br />
spectral algorithms and makes their objective function convex. Even though in<br />
this thesis we argue that the assumption of the existence of a smooth inverse to<br />
the generative mapping is a limitation there are modeling scenarios where we are<br />
interested to retain the locality of the observed data in the latent representation,<br />
such as for example when modeling motion capture data [35]. In [35] a con-<br />
strained form of the GP-LVM is presented. Each latent location xi is represented<br />
as a smooth mapping, referred to as a back-constraint, from the observed data,<br />
xi = g(yi,B), (2.59)<br />
with parameters B. Rather then directly optimizing the latent locations the incor-<br />
poration of the back-constraints alters the GP-LVM objective to seek the parame-<br />
ters of the back-constraints B, that maximize the likelihood of the observed data<br />
Y.<br />
As discussed, the GP-LVM objective is severely under-constrained in the gen-<br />
eral case. This means that a good initialization of the latent locations is of es-<br />
sential importance to be able to find a good solution. However, when learning a<br />
back-constrained model the preservation of locality in the observed space will in<br />
practice constrain the solution sufficiently such that the algorithm becomes less<br />
reliant on initialization of the latent locations, parameterized by the parameters<br />
of the back-constraint B Eq. 2.59. This means that for practical purposes we can<br />
reach the solution of the back-constrained GP-LVM with less careful initialization<br />
compared to the standard GP-LVM model.<br />
In the standard formulation of the GP-LVM a uninformative prior is specified
2.8. GP-LVM 51<br />
over the latent locations X Eq. 2.58. Rather then specifying this uninformative<br />
prior, in [67] a model incorporating an informative class based prior distribution<br />
over the latent locations is suggested. Incorporating this means that we can learn<br />
a latent representation that can be interpreted in terms of class. One advantage<br />
of this explored in [67] is to learn latent representation as input spaces for clas-<br />
sifiers. The objective in [67] is to rather then just efficiently represent the data<br />
as in the standard GP-LVM to find latent representation which are well suited for<br />
classification, i.e. where each class is easily separable. Practically this is achieved<br />
by incorporating the class based objective of Linear Discriminant Analysis (LDA)<br />
which aims to minimize within class variability and maximize between class sep-<br />
arability. This can be encoded in the GP-LVM by replacing the uninformative<br />
spherical <strong>Gaussian</strong> prior over the latent coordinates with,<br />
p(X) = 1<br />
Zd<br />
exp( 1<br />
λ2tr(S−1 B SW)), (2.60)<br />
where SB encodes the between class and SW the within class variability and Zd<br />
being a normalizing constant. The within class variability is computed as,<br />
SW =<br />
L<br />
i=1<br />
Ni<br />
N (µi − µ)(µi − µ) T , (2.61)<br />
where µi is the mean of class i and µ the mean over all classes. The between class<br />
variability is computed as follows,<br />
SB = 1<br />
N<br />
L<br />
i=1<br />
Ni<br />
<br />
(x i k − µi)(x i K − µi) T<br />
<br />
. (2.62)<br />
k<br />
Similarly to the discriminative GP-LVM a model incorporating constraints on
52 CHAPTER 2. BACKGROUND<br />
the latent coordinates was presented in [70]. By constraining the topology of the<br />
latent space representations with interpretable latent dimensions can be found.<br />
The model is applied to human motion data to do non-trivial transitions between<br />
styles of motion not present in the training data.<br />
In [77, 76] a model referred to as the <strong>Gaussian</strong> <strong>Process</strong> Dynamical Model<br />
(GPDM) is presented. The GPDM is a latent variable model derived from the<br />
GP-LVM that incorporates time series information in the data to learn a latent<br />
representation that respects the dynamics of the data. This achieved by incorpo-<br />
rating an auto-regressive model on the latent space in addition to the generative<br />
mapping,<br />
xt = h(xt−1) + ǫdyn. (2.63)<br />
By specifying a GP prior over the mapping h the dynamic mapping can simi-<br />
larly to the generative mapping f be marginalized from the GP-LVM to form the<br />
GPDM objective,<br />
ˆX, ˆ θ, ˆ θdyn<br />
<br />
= argmax {X,θ,θdyn} p(X, θ|Y)p(X|θdyn) (2.64)<br />
Many types of data are generated from models with a known underlying or<br />
latent structure. This is, for example, true for the human body whose motion can<br />
be decomposed into a tree structure. For modeling this type of data the hierarchi-<br />
cal GP-LVM (HGP-LVM) model was developed [34]. In the HGP-LVM model a<br />
hierarchical tree structure of latent spaces are learned where a latent space acts as<br />
a prior on the latent coordinates of a space further down in the hierarchy.
2.8. GP-LVM 53<br />
2.8.2 Summary<br />
In the context of our presentation it might seem out-of-place to present the GP-<br />
LVM separately from the generative methods. Our motivation for this it two-fold,<br />
first the framework we are about to present is an extension of the GP-LVM, sec-<br />
ondly as stated above for general covariance functions, the GP-LVM solution is<br />
found through gradient based methods. To avoid the effect of local minima in<br />
the log-likelihood the latent location X needs to be initialized close to the global<br />
optima. The approach suggested in [33] was to initialize with the solution from a<br />
different dimensionality reduction algorithm. For the general case, where we can-<br />
not assume a linear manifold, the lack of reliable non-linear generative methods<br />
means that the initialization is usually taken from the solution of a spectral algo-<br />
rithm. This means that even though the GP-LVM is a purely generative model,<br />
for practical applications, in the general case, it relies on the existence of a anal-<br />
ogous spectral algorithm. In this context the GP-LVM is a generative algorithm<br />
that sets out to improve the solution given by a spectral algorithm, this motivates<br />
our separation of the GP-LVM from the other reviewed generative dimensionality<br />
reduction algorithms.<br />
The GP-LVM framework has been shown to be very flexible and have been<br />
applied to model a large variety of different data, motion capture [24, 77, 76, 75],<br />
tracking [71, 69, 72, 28], human pose estimation [62], modeling of deformable<br />
surfaces [49] to name a small subset.
54 CHAPTER 2. BACKGROUND<br />
2.9 <strong>Shared</strong> Dimensionality Reduction<br />
Many modeling scenarios are characterized by several different types of observa-<br />
tions that share a common underlying structure. This can be the same text written<br />
in two different languages where both representations are different in form but<br />
have the same meaning or underlying concept or two image-sets which share the<br />
same modes of variability for example pose or lighting. This correspondence<br />
can be exploited for dimensionality reduction of the data, this we will refer to<br />
as shared dimensionality reduction. In shared dimensionality reduction each ob-<br />
servation space is generated from the same latent variable X. We will focus on<br />
the scenario when we have two observation spaces Y and Z which have been<br />
generated from the latent variable X through generative mappings fY and fZ,<br />
yi = fY (xi) (2.65)<br />
zi = fZ(xi). (2.66)<br />
<strong>Shared</strong> spectral dimensionality reduction is built on the assumption that smooth<br />
inverses to both the generative mappings fY and fZ exists. That both mappings<br />
are invertible implies that the observation spaces are related through a bijection,<br />
meaning that the location in one observation space is sufficient for determining the<br />
corresponding location in the other observation space. This means that by shared<br />
spectral dimensionality reduction we mean the alignment of several intrinsic rep-<br />
resentation into a single shared low-dimensional representation. In [26, 25, 82]<br />
algorithms for aligning two manifold through the use of proximity graphs for ei-<br />
ther a full or partial correspondence are presented.
2.10. FEATURE SELECTION 55<br />
A GP-LVM model learning a shared latent representation was suggested in<br />
[52]. The model presented learns two GP regressors, one to each separate ob-<br />
servation space from a shared latent variable by maximizing the joint marginal<br />
likelihood,<br />
{ ˆ X, ˆ ΦY , ˆ ΦZ} = argmax X,ΦY ,ΦZ p(Y,Z|X,ΦY ,ΦZ) =<br />
= argmax X,ΦY ,ΦZ p(Y|X,ΦY )p(Z|X,ΦZ). (2.67)<br />
The model was applied to learn a shared latent structure between the joint angle<br />
space of a humanoid robot and a human. Inference between the two different ob-<br />
servations spaces was done by learning GP regressors from the observed spaces<br />
onto the latent space. The suggested model does not make any direct assump-<br />
tion about the form of the generative mappings at training. However, as inference<br />
within the model is done by training mappings from the observed data back onto<br />
the learned latent representation the inverse mapping is assumed to exist why this<br />
model retains the central assumption from the shared spectral dimensionality re-<br />
duction that the generative mappings are invertible.<br />
2.10 Feature Selection<br />
Feature extraction is the process of simplifying the amount of resources needed<br />
to represent a data set accurately. This is in contrast to feature selection, where<br />
only features with a positive impact in relation to a certain objective is retained in<br />
the representation. This can for example be features that are able to discriminate<br />
between two different classes. Dimensionality reduction algorithms are instances
56 CHAPTER 2. BACKGROUND<br />
Y<br />
X<br />
Figure 2.7: Graphical model corresponding to the formulation of probabilistic of<br />
CCA in [4].<br />
of feature extraction. A closely related algorithm for feature selection is Canonical<br />
Correlation Analysis (CCA). Given two sets of observations Y and Z with known<br />
correspondences CCA finds directions WY ∈ Y and WZ ∈ Z such that the<br />
correlation Eq. 2.68 between YWY and ZWZ is maximized.<br />
ρ =<br />
Z<br />
tr WT Y YT <br />
ZWZ<br />
(tr(W T Z ZT ZWZ) tr(W T Y YT YWY )) 1<br />
2<br />
Finding the first set of directions w Y 1 and wZ 1<br />
strained optimization problem,<br />
argmax w Y 1 ,w Z 1<br />
(w Y 1 Y) T Zw Z 1<br />
(2.68)<br />
is formulated as the following con-<br />
(2.69)<br />
subject to: (w Y 1 ) T Y T Yw Y 1 = (w Z 1 ) T Z T Zw Z 1 = 1 (2.70)<br />
Further orthogonal directions can be found iteratively. The scaling of each basis<br />
is arbitrary. The constraints in Eq. 2.70 fixes the variance of the canonical vari-<br />
ates Yw Y 1 and ZwZ 1<br />
maximizing the correlation Eq. 2.68.<br />
to 1. This will ensures that maximizing Eq. 2.69 equates to<br />
In [4] the probabilistic form of CCA is derived through the maximum likeli-
2.11. SUMMARY 57<br />
hood solution to the following model,<br />
x ∼ N(0,I) (2.71)<br />
y|x ∼ N(WY x, ΨY ) (2.72)<br />
z|x ∼ N(WZx, ΨZ). (2.73)<br />
The model corresponds to generative CCA as long as the within set or non-shared<br />
variations can be sufficiently described by a <strong>Gaussian</strong> noise model and when the<br />
generative mappings are linear. The graphical model corresponding the the pro-<br />
pose model is shown in Figure 2.7 In [37] the model is extended to allow for<br />
non-linear mappings.<br />
2.11 Summary<br />
In this chapter we have outlined some of the background upon which the material<br />
in the following chapters will build. Through elementary linear algebra and by<br />
introduction of the concept of “The Curse of Dimensionality” we have motivated<br />
the field of dimensionality reduction which encapsulates the work presented in this<br />
thesis. We have reviewed algorithms of the two main strands of work composing<br />
the field dimensionality reduction exemplifying the strength and weaknesses of<br />
each approach. Further, we detailed the basics of probabilistic modeling from the<br />
perspective of <strong>Gaussian</strong> processes which will be fundamental for the following<br />
chapters.<br />
In the next chapter we will proceed and present a new and novel model for<br />
generative dimensionality reduction in the shared modeling scenario based on
58 CHAPTER 2. BACKGROUND<br />
<strong>Gaussian</strong> processes.
Chapter 3<br />
<strong>Shared</strong> GP-LVM<br />
3.1 Introduction<br />
Dimensionality reduction is the task of reducing the number of dimensions re-<br />
quired to describe a set of data. The previous chapter introduced dimensional-<br />
ity reduction and gave the necessary mathematical background upon which these<br />
techniques are built. We divide dimensionality reduction into two groups of al-<br />
gorithms: generative and spectral. Generative techniques are more versatile and<br />
applicable for a larger range of modeling scenarios compared to the spectral tech-<br />
niques. However, the objective function of most generative algorithms are in the<br />
general case severely under constrained. The spectral group of algorithms avoid<br />
this by constraining the solution to such where a smooth inverse to the generative<br />
mapping exists. In scenarios where we are given observations in multiple differ-<br />
ent forms we can exploit correspondence between observation when performing<br />
dimensionality reduction. This we referred to as shared dimensionality reduction.<br />
The following chapter will introduce two generative models that exploit corre-<br />
59
60 CHAPTER 3. SHARED GP-LVM<br />
spondences between observations when performing dimensionality reduction.<br />
In many modeling scenarios we have access to multiple different observations<br />
of the same underlying phenomenon. Often a significantly different cost, compu-<br />
tationally or monetary, is associated with acquiring samples from each domain.<br />
In such scenarios it is of interest to infer the location of a expensive sample from<br />
one which we can more easily acquire. One example, which we will return to in<br />
the applications chapter, is image based human pose estimation [69, 57]. This is<br />
the task of estimating the pose of a human from the evidence given in an image.<br />
Images can be captured relatively easy while recording the actual pose of a hu-<br />
man is associated with a significant cost often requiring special rigs and expensive<br />
equipment. Therefore inferring the pose from image data is of great interest.<br />
Learning the shared GP-LVM model presented in [52] is a three stage process.<br />
In the first stage PCA is applied separately to each of the two sets of observa-<br />
tions. In the second stage the GP-LVM model is trained using the average of<br />
the two PCA solutions as initialization. This stage means that we have trained<br />
GP-regressors to model the generative mappings from the latent space onto the<br />
observed spaces. In the third and final stage a second set of GP-regressors are<br />
trained that maps back from each of the observed spaces onto the latent space.<br />
Even though not explicitly stated this implies that the generative mappings are as-<br />
sumed to have a smooth inverse. We will now proceed to introduce a more general<br />
model capable of transferring locations in one observation space to a correspond-<br />
ing space.
3.2. SHARED GP-LVM 61<br />
W Z<br />
Y Z<br />
X<br />
W<br />
Y Z<br />
Figure 3.1: The left image shows the conditional model where a set of observed<br />
data Y have been generated by Z. The image to the right shows the shared GP-<br />
LVM model which we suggest as a replacement to the conditional model on the<br />
left. The back-mapping from the output space that constrains the latent space is<br />
represented by the dashed line.<br />
3.2 <strong>Shared</strong> GP-LVM<br />
Given two sets of corresponding observationsY = [y1, . . .,yN] and Z = [z1, . . .,zN],<br />
where yi ∈ ℜ DY and zi ∈ ℜ DZ . We assume that each observation has been gen-<br />
erated from the same low-dimensional manifold corrupted by additive <strong>Gaussian</strong><br />
noise,<br />
where xi ∈ ℜ q with q<br />
DY<br />
φ Y<br />
yi = f Y (xi) + ǫY , ǫY ∼ N(0, β −1<br />
Y I)<br />
zi = f Z (xi) + ǫZ, ǫZ ∼ N(0, β −1<br />
Z I)<br />
< 1 and q<br />
DZ<br />
< 1.<br />
φ Z<br />
, (3.1)<br />
Our objective is to create a model from which the location zi ∈ Z correspond-<br />
ing to a given location yi ∈ Y can be determined. This can be done by modeling<br />
the conditional distribution over the input space Y given the output space Z as<br />
shown in the left image in Figure 3.1. The conditional distribution will associate<br />
each location in the output space with a location in the input space. However, for<br />
many applications the observation spaces are likely to be high-dimensional which
62 CHAPTER 3. SHARED GP-LVM<br />
makes modeling this distribution problematic, especially in scenarios with a lim-<br />
ited amount of training data. Therefore, rather then modeling p(yi|zi) directly,<br />
given that the representation of Z is redundant, we can find a reduced dimensional<br />
representation X of the output space. This means that we can model the condi-<br />
tional distribution over this new dimensionality reduced representation p(yi|xi)<br />
which should be significantly easier. However, rather than simply modeling the<br />
conditional distribution over a dimensionality reduced representation of the out-<br />
put space we will formulate an objective which learns the latent representation<br />
together with the mappings.<br />
From the <strong>Gaussian</strong> noise assumption 3.1 we can formulate the likelihood of<br />
the data,<br />
P(Y,Z|f Y , f Z ,X,ΦY ,ΦZ) =<br />
N<br />
p(yi|f Y ,xi,ΦY )p(zi|f Z ,xi,ΦZ). (3.2)<br />
i=1<br />
Placing <strong>Gaussian</strong> process priors and integrating over the mappings leads to the<br />
marginal likelihood of the shared GP-LVM model,<br />
P(Y,Z|X,ΦY ,ΦZ) =<br />
p(yi|xi,ΦY ) =<br />
p(zi|xi,ΦZ) =<br />
N<br />
p(yi|xi,ΦY )p(zi|xi,ΦZ) (3.3)<br />
i=1<br />
<br />
<br />
p(yi|f Y ,xi,ΦY )p(f Y )df Y<br />
p(zi|f Z ,xi,ΦZ)p(f Z )df Z .<br />
We are interested in finding a low-dimensional representation X which can<br />
be used as a complete substitute for Z. This means that the relationship between<br />
these two spaces should take the form of a bijection i.e that each point zi ∈ Z is<br />
represented by a unique location in xi ∈ X and vice versa. There is nothing in
3.2. SHARED GP-LVM 63<br />
the shared model that encourages this asymmetry. However, this can be achieved<br />
by incorporating back-constraints [35] which represent the latent locations as a<br />
smooth parametric mapping from the output space Z,<br />
This leads to the following objective,<br />
xi = h(zi,W). (3.4)<br />
{ ˆ W, ˆ ΦY , ˆ ΦZ} = argmax W,ΦY ,ΦZ P(Y,Z|W,ΦY ,ΦY ). (3.5)<br />
The objective can be optimized using gradient based methods. In Figure 3.1 the<br />
leftmost image shows the conditional model and the rightmost image shows the<br />
back-constrained shared GP-LVM model.<br />
3.2.1 Initialization<br />
There are several different options of how to initialize the locations in latent space<br />
for this model. One could, as with the shared GP-LVM in [52], initialize using<br />
the average of the embeddings to a spectral algorithm or with the solution to one<br />
of the shared spectral algorithms. However, we want the latent representation<br />
to be a complete representation of the output space Z. This means that we can<br />
initialize using the solution to a spectral algorithm with the output space Z as<br />
input. Underpinning this model is the assumption that the non-back-constrained<br />
observation spaces are governed by a subset of the degrees of freedom of the back-<br />
constrained observation space. By initializing using the solution of a spectral<br />
algorithm to the output space we seek the solution to the latent representation
64 CHAPTER 3. SHARED GP-LVM<br />
where the intrinsic representation of the input space aligns to the one of the output<br />
space.<br />
3.2.2 Inference<br />
Once the model has been trained we are interested in inferring the output location<br />
corresponding to a point in the input space. As we are not assuming a functional<br />
relationship between Y and Z we cannot simply learn a mapping as in [52]. Given<br />
the location yi in the input space Y we want to infer the corresponding location<br />
zi in the output space Z. The back-constraint from the output space to the latent<br />
space encodes the bijective relationship between Z and X. This implies that any<br />
multi-modalities in the relationship between Y and Z have been contained in the<br />
mapping from the latent representation X to the input space Y. This means that<br />
to recover the location in the high-dimensional output space we only need to find<br />
the corresponding point on the much lower-dimensional latent space,<br />
ˆxi = argmax x p(yi|xi,X,ΦY ). (3.6)<br />
Having found the location in the latent space the corresponding location in the<br />
output space can be found through the mean-prediction of the uni-modal posterior<br />
distribution as,<br />
ˆzi = argmax zi p(ˆzi|xi,X,ΦZ). (3.7)
3.2. SHARED GP-LVM 65<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.2: Each dimension of underlying signals used to generate highdimensional<br />
data shown in 3.3.<br />
3.2.3 Example<br />
In the following section we will run through a toy example try to exemplify the<br />
model presented above. We will generate two sets of high dimensional data Y<br />
and Z which we will learn embeddings from using the shared GP-LVM models.<br />
Both data-sets are generated from a single underlying signal t which consists of<br />
N linearly spaced values between −1 and 1. A set of three signals is generated<br />
through non-linear mappings of t such as,<br />
x 1 i = cos(πti) (3.8)<br />
x 2 i = sin(πti) (3.9)<br />
x 3 i = cos(√ 5πti + 2). (3.10)<br />
We will refer to X = [x 1 ;x 2 ;x 3 ] as the generating signal of the data, shown in<br />
Figure 3.2. Through linear mappings of X we generate two sets of 20 dimensional<br />
signals Y and Z. To achieve this we draw values at random from a zero mean unit<br />
variance normal distribution and organize them into 20 by 3 dimensional matrices<br />
P1 ∈ R 20×3 and P2 ∈ R 20×3 . From these transformation matrices we generated
66 CHAPTER 3. SHARED GP-LVM<br />
4<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
−3<br />
0 10 20 30 40 50 60 70 80 90 100<br />
4<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
−3<br />
−4<br />
−5<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.3: Observed data Y (left) and Y (right) generated from 3.11 and 3.12.<br />
the high-dimensional observation signals through the generating signal X,<br />
Y = XP T 1<br />
Z = XP T 2<br />
+ λη (3.11)<br />
+ λη, (3.12)<br />
where η are samples from a zero mean unit variance normal distribution and λ =<br />
0.05. Figure 3.3 shows each dimension of the two high dimensional data-sets. We<br />
proceed by applying the above generated data to the shared GP-LVM model and<br />
the shared back-constrained model presented in this chapter. Each of the models<br />
are trained using linear kernels only as we are interested in their capabilities to<br />
unravel the linear transformations P1 and P2 applied to the data. Figure 3.4 shows<br />
the embeddings found by the two algorithms. As can be seen both algorithms<br />
unravels the data and finds three generating signals underlying the data. We do not<br />
expect the algorithm to exactly unfold the signal X as there are several different<br />
linear transformations that could have generated the observed data. However, we<br />
see that both algorithms finds two signals of “one period” corresponding to x 1<br />
and x 2 and a higher frequent signal corresponding to x 3 . One way to quantify
3.2. SHARED GP-LVM 67<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
0 10 20 30 40 50 60 70 80 90 100<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.4: Each dimension of the latent embeddings of the data in Figure 3.3 from<br />
the standard shared model (left) and the shared back-constrained model (right).<br />
Each model unravels the signals used to generate the data Figure 3.2<br />
the quality of the embeddings is to compare the Procrustes score [23] between the<br />
found embeddings and the generating signals. Procrustes score is a measure of<br />
similarity between shapes who are represented as point sets. In Procrustes analysis<br />
the shape of an object is considered as belonging to a equivivalence class and a<br />
shape is said to be defined as “all the geometrical information that remains when<br />
location,scale and rotational effects have been filtered out from an object [23]”.<br />
Two shapes can be compared to each other by removing the above effects. By<br />
finding the best aligning linear transformation the two point sets can be compared<br />
through the sum of squared distances between the points. Table 3.1 shows that<br />
both embeddings have low scores when compared to the generating signals.<br />
We will modify the previous example slightly to represent a different model-<br />
ing scenario. Just as in the previous example we generate two high-dimensional<br />
signals Y and Z through randomized linear mappings. However, in this case<br />
the observed data Y is generated from a two dimensional signal x Y = [x 1 ;x 2 ]<br />
whereas Z is generated from the same signals as in the previous example. This<br />
results in the sets of generating signals shown in Figure 3.5. By drawing values
68 CHAPTER 3. SHARED GP-LVM<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.5: Underlying generating signals to high-dimensional data Y (left) and<br />
Z (right) shown in Figure 3.6.<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
−3<br />
0 10 20 30 40 50 60 70 80 90 100<br />
4<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
−3<br />
−4<br />
−5<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.6: Each dimension of the observed data Y (left) and Z (right) generated<br />
from underlying signals shown in Figure 3.5<br />
to form the transformation matrices as in the previous example we can generate<br />
the the observed signals shown in Figure 3.6 This example is meant to visualize<br />
the modeling scenario when the input space Y does not share the same modes of<br />
variability as the output space Z. In this case the input space Y is generated from<br />
a subset of the signals generating the output space Z. Figure 3.7 shows the results<br />
of applying the two shared models as presented above. As can be seen in the left<br />
most plot in Figure 3.7 the shared GP-LVM model does not correctly unfold the<br />
data to recover the three generating dimensions of X. Rather the shared model<br />
seems to represent the shared signals x 1 and x 2 but avoids representing x 3 which
3.2. SHARED GP-LVM 69<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
−0.5<br />
−1<br />
−1.5<br />
−2<br />
0 10 20 30 40 50 60 70 80 90 100<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.7: Resulting embedding of applying the shared (left) and the shared backconstrained<br />
(right) GP-LVM model to the data shown in Figure 3.6. As can be<br />
seen, the shared model fails to unravel the generating signals while the backconstrained<br />
model correctly finds the signals underlying the high-dimensional<br />
data. The shared model assumes that both observation spaces are generated from<br />
the same underlying signals which is not true for the data shown in Figure 3.6. The<br />
shared back-constrained GP-LVM model relaxes this assumption, only assuming<br />
that the non-back-constrained observation space are a subset of the generating<br />
signals of the back-constrained space.<br />
is private to the output space. However, in the shared back-constrained model the<br />
latent space is constrained to be a smooth representation of the output space and<br />
therefore all the degrees of freedom in the observed data will be encoded in the<br />
latent representation of the data. This means that for the shared model the latent<br />
space is not guaranteed to be a full representation of the output space which means<br />
that determining the latent location will not be sufficient for determining the loca-<br />
tion in the output space, this implies that we cannot replace the high-dimensional<br />
conditional model p(Y|Z) with the low-dimensional p(Y|X) as intended.<br />
In the following example we will try and exemplify how the models act in a<br />
more general modeling scenario. As before we will generate two high-dimensional<br />
data-sets Y and Z from a set of underlying generating signals. In this example<br />
each data-set will be generated from a single shared signal x 1 and one signal that
70 CHAPTER 3. SHARED GP-LVM<br />
Example Model Procrustes<br />
Same generating signals <strong>Shared</strong> 0.058<br />
Same generating signals Back 0.004<br />
Two shared, one private for output <strong>Shared</strong> 0.614<br />
Two shared, one private for output Back 0.028<br />
Table 3.1: The Procrustes score corresponding to the resulting embeddings of<br />
applying the different GP-LVM models to the data shown in Figure 3.3. As can be<br />
seen the back-constrained model (referred to as Back) significantly out-performs<br />
the standard shared GP-LVM model.<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.8: Each dimension of underlying signals generating high-dimensional<br />
data Y (left) and Z (right) shown in Figure 3.9. Each signal is two dimensional<br />
with one dimension shared and one private to each data-set.<br />
remains private to each observation x 2 to Y and x 3 to Z. In Figure 3.8 the gen-<br />
erating signals are shown. Similarly to the previous examples we generate two<br />
high-dimensional signals from random linear transformations of the generating<br />
signals. Each dimension of the observed data is shown in Figure 3.9. We apply a<br />
standard shared and a back-constrained shared model to the data. In Figure 3.10<br />
the embeddings found by each model are shown. Both models represent the data<br />
using two latent dimensions. In the case of the shared model the latent locations<br />
seems to correspond to the generating signals of Y while the back-constrained<br />
model seems to unravel the generating signals corresponding to Z. In the case
3.2. SHARED GP-LVM 71<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
−3<br />
0 10 20 30 40 50 60 70 80 90 100<br />
4<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
−3<br />
−4<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.9: Each dimension of the high-dimensional observed data Y (left) and Z<br />
(right) generated from underlying signals shown in Figure 3.8.<br />
of the back-constrained model this is expected as the model was specifically de-<br />
signed to learn a latent space which corresponds to Z being the output space of<br />
the model. Further the latent locations are initialized using the solution of a spec-<br />
tral algorithm, in this case PCA, applied to Z which means that there is strong<br />
incentive for the model to focus on modeling Z rather then Y. For the standard<br />
shared model the interpretation of the resulting embedding is less obvious. The<br />
model does not favor the generation of either of the two observation spaces, how-<br />
ever, the latent locations are initialized from the mean of the solution of a spectral<br />
algorithm, in this case PCA, applied to both observation spaces Y and Z. As both<br />
data sets are generated from two dimensional signals PCA will only recover two<br />
dimensions and the mean of the solutions to each space will only have relevance<br />
for the shared generating signal while the recovered private signals will be lost<br />
when taking the mean of the signals. As we know the dimensionality of the signal<br />
generating the observations we can rather than determining the latent dimension-<br />
ality from the initialization set it to equal the dimension of the generating signal.<br />
Figure 3.11 shows the embeddings found using three dimensional latent spaces.<br />
As can be seen neither model recovers the three generating signals shown in Fig-
72 CHAPTER 3. SHARED GP-LVM<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
−0.1<br />
−0.2<br />
−0.3<br />
−0.4<br />
0 10 20 30 40 50 60 70 80 90 100<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.10: Each dimension of the embeddings found applying the shared (left)<br />
and the shared back-constrained (right) GP-LVM models to the data shown in<br />
Figure 3.9. Neither model succeeds to unravel the three different underlying generating<br />
signals shown in Figure 3.8.<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
−0.5<br />
−1<br />
−1.5<br />
0 10 20 30 40 50 60 70 80 90 100<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 3.11: Each dimension of the embeddings found applying the shared (left)<br />
and the shared back-constrained (right) GP-LVM models to the data shown in<br />
Figure 3.9. The latent dimensionality is set to equal the dimensionality of the<br />
generating signals. As can be seen neither model manages to recover the three<br />
generating signals shown in Figure 3.8.
3.2. SHARED GP-LVM 73<br />
ure 3.8. In the case of the back-constrained model this is to be expected as the<br />
latent locations are constrained to be a smooth mapping from the observed data<br />
Z. As nothing of the signal private to Y x 2 is contained in Z the model can not<br />
correctly represent this in the latent space. This constraint does not exist for the<br />
standard shared model, however, as can be seen in Figure 3.8 neither does this<br />
model succeed to recover the three generating signals [x 1 ;x 2 ;x 3 ]. The model is<br />
initialized using the mean of the data projected onto the three most representative<br />
principal components. As each data-set is generated from two dimensional signals<br />
the third principal component will, for each observation space, fit to the noise of<br />
the data. This means that the third latent dimension will be initialized to noise<br />
from which the model never manages to recover to encode x 3 .<br />
3.2.4 Summary<br />
The above presented shared and back-constrained GP-LVM model was created to<br />
model in the scenario where the input space can be modeled as a mapping from<br />
the output space. What this means is that all the degrees of freedom in the input<br />
space are contained in the output space. It does however not assume a bijective<br />
relationship between the two observation spaces as in [52] as a location in the<br />
input space can be associated with several locations in the output space which<br />
was exemplified in Figure 3.7. However, as was shown in the example leading<br />
up to embeddings Figure 3.10 and 3.11 neither model was capable of modeling<br />
in the more general scenario where each observed space shares a subset of the<br />
generating parameters, but do also contain private generating parameters. In the<br />
next section we will proceed to describe an algorithm designed for this specific
74 CHAPTER 3. SHARED GP-LVM<br />
φ Y<br />
X<br />
X φ Y S XZ<br />
Z<br />
Y Z<br />
Figure 3.12: Subspace GP-LVM model. The two observation spaces Y and Z are<br />
generated from the latent variable X factorized into subspaces X Y ,X Z representing<br />
the private information and X S modeling the shared information. ΦY and<br />
ΦZ are the hyper-parameters of the <strong>Gaussian</strong> <strong>Process</strong>es modeling the generative<br />
mappings f Y and f Z .<br />
modeling scenario.<br />
3.3 Subspace GP-LVM<br />
The main limitation of the shared GP-LVM, and shared models in general, is the<br />
assumption that each data space is governed by the same degrees of freedom or<br />
modes of variability. We relaxed this assumption when only interested in inference<br />
in one direction by introducing the back-constraint to the shared model in the<br />
previous section. We will now proceed by introducing a new shared GP-LVM<br />
model which further relaxes this assumption.<br />
Similarly to the shared GP-LVM model we are given two sets of corresponding<br />
observations Y = [y1, . . .,yN] and Z = [z1, . . .,zN], where yi ∈ ℜ DY and zi ∈<br />
ℜ DZ . We assume that each observation has been generated from low-dimensional
3.3. SUBSPACE GP-LVM 75<br />
manifolds corrupted by additive <strong>Gaussian</strong> noise,<br />
yi = f Y (u Y i ) + ǫY , ǫY ∼ N(0, β −1<br />
Y I)<br />
zi = f Z (u Z i ) + ǫZ , ǫZ ∼ N(0, β −1<br />
Z I)<br />
where u Y i ∈ ℜ qY and u Z i ∈ ℜ qZ with qY<br />
DY<br />
< 1 and qZ<br />
DZ<br />
,<br />
< 1. We assume that the<br />
two manifolds can be parameterized in such a manner that they share a common<br />
non-empty subspace X S ,<br />
X S ⊂ U Y , X S ⊂ U Z<br />
X S = 0<br />
, (3.13)<br />
which is referred to as the shared subspace X S . This assumption implies that<br />
a parameterization of each observation space in which a subset of the degrees<br />
of freedom of each observation space is shared is possible. Representing each<br />
manifold in terms of X S introduces an additional subspace associated with the<br />
manifold,<br />
U Y = X S ;X Y ;U Z = X S ;X Z , (3.14)<br />
which is referred to as the private (or non-shared) subspace. The full latent repre-<br />
sentation of both observation spaces X is the concatenation of shared and private<br />
subspaces, X = [X S ;X Y ;X Z ]. Note that the private spaces are subspaces of X,<br />
with the full latent space representing shared and non-shared variance in a factor-<br />
ized form.<br />
A shared GP-LVM model can be constructed respecting the factorized latent<br />
structure. The GP-LVM learns two separate mappings generating each observa-<br />
tion space where the input space for the GP generating Y is U Y and for Z is U Z
76 CHAPTER 3. SHARED GP-LVM<br />
leading to the following objective,<br />
{ ˆ X, ˆ ΦY , ˆ ΦZ} = argmax X,ΦY ,ΦZ p(Y,Z|X,ΦY ,ΦZ)<br />
= argmax X S ,X Y ,X Z ,ΦY ,ΦZ p(Y|XS ,X Y ,ΦY )p(Z|X S ,X Z ,ΦZ). (3.15)<br />
The latent structure presented above is capable of separately modeling the shared<br />
and the non-shared variance. This is achieved by letting the mappings f Y and f Z<br />
only be active on the subspaces associated with each observation.<br />
3.4 Extensions<br />
As with the standard GP-LVM model additional constraints such as back-constraints<br />
or dynamic models can be placed over the latent variables. However, just as with<br />
the generating mappings we can limit the constraints to only be active on sub-<br />
spaces of the latent space. In the applications chapter we will evaluate the perfor-<br />
mance of shared subspace models with dynamic constraints.<br />
3.5 Applications<br />
Applications in a wide variety of fields are concerned with inferring a specific<br />
variable, the output variable, from evidence given by the parameters of a dif-<br />
ferent variable, the input variable. In its simplest case this, when the output<br />
variable is related by a function from the input variable, this is called a re-<br />
gression problem. However, for many applications this cannot be assumed<br />
as the input variable is not sufficiently discriminative of the output variable.
3.6. SUMMARY 77<br />
In such cases there might be several different locations in the output space<br />
that corresponds to a specific location in the input space. For such an es-<br />
timation task we would ideally like to recover each possible output location<br />
corresponding to the given input.<br />
A classic example of an application where the input space is not sufficient<br />
for discriminating the output is the computer vision task of image based hu-<br />
man pose estimation. The task is concerned with estimating the pose param-<br />
eters of a human from an image. In the applications chapter we will demon-<br />
strate how the models proposed in this chapter can be applied to this task. In<br />
specific, the application of the subspace GP-LVM model is interesting since<br />
it finds a factorized representation of the data into shared and private parts.<br />
The shared latent variable represents the portion of the variance in each ob-<br />
servation space that can be determined from the other observations, i.e. vari-<br />
ance that has a functional relationship between the observations. The private<br />
represents variance which cannot be discriminated from the other space. As<br />
we will show, this means, for the task of pose-estimation, the estimation can<br />
be reduced to a regression task for estimating the shared location. The lo-<br />
cation in the private space represents variance in the pose space that cannot<br />
be estimated from the image, i.e. poses that are ambiguous to the given input<br />
image.<br />
3.6 Summary<br />
As was exemplified in the Toy example in Figure 3.5 the standard shared model is<br />
not capable of unraveling the correct generating signals for data where the two ob-
78 CHAPTER 3. SHARED GP-LVM<br />
servation spaces have not been generated from the same underlying signals. For<br />
the case where we are interested in inferring locations in one specific observed<br />
space given locations in the other observation space we proposed the shared back-<br />
constrained model. The proposed model is capable to correctly unravel the gen-<br />
erating signals of the data when the input space has been generated from a subset<br />
of generating parameters from the output space. However, in cases when the data<br />
have been generated using parameters “private” to both the observed spaces as<br />
was shown in the example Figure 3.8 the shared back-constrained model also fails<br />
to model the data which was shown in Figure 3.10 and Figure 3.11.<br />
The objective of each of the models are to find latent locations and mappings<br />
that minimizes the reconstruction error of the data. In cases where the data con-<br />
tains observation space specific “private” information used to generate a single<br />
observation space the model are left with two choices. Either represent this data<br />
in the latent space, this will reduce the error when reconstructing the associated<br />
observed space. However, for the other space, which does not contain this infor-<br />
mation, this will pollute the latent space reducing the capability of the generative<br />
mapping to generalize over the data resulting in a lower likelihood of the associ-<br />
ated generative mapping. The other option is to consider the private information<br />
as noise. None of the above scenarios is ideal as they will both result in a lower<br />
likelihood of the model. Either by reducing the generative mappings capability to<br />
regenerate the data or by trying to model a structured signal using a noise model<br />
which is not likely fit particularly well.<br />
For modeling data containing private information we have in this chapter pro-<br />
posed the shared subspace GP-LVM model. This model learns a factorized latent<br />
space of the data, separately modeling information that is shared from information
3.6. SUMMARY 79<br />
that is private. The structure of the latent space means that the private information<br />
in the observations will not pollute the reconstruction of the data neither will the<br />
model need to resort to model this information as noise. Further, the factorization<br />
into shared and private information can be beneficial when we want to do infer-<br />
ence in the model as the shared space contains information that can be determined<br />
from either observation space while the private subspace contains the information<br />
that is ambiguous knowing the location in the other observed space. In the ap-<br />
plications chapter we will exploit this factorization for the estimating the pose of<br />
a human from image evidence. Each image is ambiguous to a small sub-set of<br />
the possible human poses, using the shared subspace model, this sub-set will be<br />
modeled using the private latent spaces.<br />
The non-convex nature of the GP-LVM objective (in the non-linear case) means<br />
that the algorithm relies on an analogous dimensionality reduction algorithm for<br />
initialization. For the standard model there are as we have seen, several different<br />
algorithms that are applicable. When the observation spaces can be assumed to<br />
have been generated from the same manifold, the approach of averaging over the<br />
observations is applicable. Similarly in the asymmetric case initialization from the<br />
output space is viable. However, for the subspace model presented in this chapter,<br />
no equivivalent spectral model exists.<br />
In the next chapter we proceed by presenting a spectral approach to shared di-<br />
mensionality reduction that can be used to initialize the subspace GP-LVM model<br />
presented in this chapter.
Chapter 4<br />
NCCA<br />
4.1 Introduction<br />
In the previous chapter we introduced two new latent variable models based on the<br />
GP-LVM. The first presented algorithm, referred to as the shared back-constrained<br />
GP-LVM, models two correlated sets of observation using a single shared multi-<br />
variate latent variable. It was created for the task of inferring the location in one<br />
observation space given the corresponding location in the other observation space.<br />
The second algorithm, referred to as the subspace GP-LVM model, was designed<br />
for the more general scenario where we want to model several different observa-<br />
tion spaces sharing a subspace of their variance.<br />
The solution to both algorithms is found using gradient based methods. There-<br />
fore, to be able to recover a good solution initialization of the models are of signif-<br />
icant importance. The shared back-constrained model assumes that the non-back-<br />
constrained observation space have been generated from a subset of the generating<br />
parameters of the back-constrained space. This means that we can initialize the<br />
80
4.2. ASSUMPTIONS 81<br />
latent location using the output of a spectral dimensionality reduction technique<br />
applied to the back-constrained observation space. However, for the subspace<br />
GP-LVM model no analogous convex models exist. In this chapter we will pre-<br />
sented a spectral dimensionality reduction algorithm for finding factorized<br />
latent spaces that separately represents the shared and private variance of<br />
two corresponding observation spaces. Being associated with a convex ob-<br />
jective function, the model can be used to initialize the subspace GP-LVM<br />
model.<br />
4.2 Assumptions<br />
Just as with the spectral dimensionality reduction approaches we reviewed<br />
in Chapter 2, the model we will present is based on a set of assumptions<br />
about the relationship between the observed data and its latent represen-<br />
tation. Given two sets of corresponding observations Y = [y1, . . .,yN] and<br />
Z = [z1, . . .,zN], where yi ∈ ℜ DY and zi ∈ ℜ DZ . We assume that each obser-<br />
vation has been generated from low-dimensional manifolds corrupted by additive<br />
<strong>Gaussian</strong> noise,<br />
yi = f Y (u Y i ) + ǫY , ǫY ∼ N(0, β −1<br />
Y I)<br />
zi = f Z (u Z i ) + ǫZ , ǫZ ∼ N(0, β −1<br />
Z I)<br />
. (4.1)<br />
The latent representations u Y i ∈ UY and u Z i ∈ UZ associated with each obser-<br />
vation space X Y and X Z consists of two components, one shared x S i and one<br />
private or non-shared part x Y i and xZ i<br />
associated with each observation space.<br />
Each component represents an orthogonal subspace of the latent representation
82 CHAPTER 4. NCCA<br />
as u Y i = x S i ;x Y i<br />
and u Z i = x S i ;x Z i<br />
. We will refer to the X S as the shared<br />
subspace and X Y and X Z as the private subspaces of the model.<br />
The shared subspace X S of the latent representation U Y and U Z are assumed<br />
to be related to the observation spaces by smooth mappings g Y and g Z as,<br />
x S i = g Y (yi) = g Z (zi). (4.2)<br />
We further assume that the relationship between the observed data and its corre-<br />
sponding private manifold representation also takes the form of a smooth map-<br />
ping,<br />
x Y i = hY (yi) (4.3)<br />
x Z i = hZ (zi). (4.4)<br />
(4.5)<br />
In the following section we will first describe how the shared latent space X S<br />
can be found from the observed data. Once a solution for the shared space X S is<br />
found we will proceed and find the private subspaces X Y and X Z to complete the<br />
latent representation of the model.<br />
4.3 <strong>Shared</strong><br />
The shared subspace X S of the latent representation of the data represents<br />
variance that is shared between both observation spaces. In Chapter 2 we<br />
reviewed Canonical Correlation Analysis (CCA) which is a feature selection
4.3. SHARED 83<br />
algorithm for finding directions in two observation spaces that are maximally<br />
correlated. The model we are about to present is a two stage algorithm, in<br />
the first stage we find the shared latent space X S using CCA.<br />
The objective of CCA is to find two sets of basis vectors W Y and W Z , one<br />
for each observation space, such that the correlation ρ between the projections of<br />
the data is maximized,<br />
ρ =<br />
tr WT Y YT <br />
ZWZ<br />
(tr(W T Z ZT ZWZ) tr(W T Y YT YWY )) 1<br />
2<br />
, (4.6)<br />
subject to unit variance W T Y YT YWY = I and W T Z ZT ZWZ = I along each<br />
direction. The unit variance constraint means that the CCA solution will find<br />
maximally correlated directions irrespective of how much of the variance in the<br />
observation space is explained. As a way of avoiding low variance solutions it is<br />
suggested in [31] to first apply PCA separately to each data-set and the apply CCA<br />
in the dominant principal subspace of each data-set. This as a way of avoiding<br />
highly correlated directions that explains a non-substantial amount of the variance<br />
of the data.<br />
In the general case both CCA and PCA are applied to linear subspaces of the<br />
data. However, both algorithms can be non-linearized through the kernel trick [3]<br />
by first mapping the data to kernel induced feature spaces,<br />
ΨY : Y → F Y<br />
(4.7)<br />
ΨZ : Z → F Z , (4.8)<br />
represented by kernels K Y and K Z . In practice we first apply kernel PCA and then
84 CHAPTER 4. NCCA<br />
X<br />
Y Z<br />
X X S<br />
Y Z<br />
Figure 4.1: Graphical model of the Non-Consolidating-Component-Analysis<br />
(NCCA) model. The two observation space Y and Z are generated from a latent<br />
variable X which is factorized into three different subspaces X = [X Y ;X S ;X Z ].<br />
The subspace [X S ;X Y ] is the latent representation of Y while [X S ;X Z ] represents<br />
Z. This means that X S models the portion of the data that is correlated<br />
between Y and Z and is found using CCA. The subspaces X Y and X Z represents<br />
the remaining private portion of each observation space.<br />
look for correlated directions in the dominant kernel induced principal subspace<br />
of the data. Having found directions explaining the shared components of the data<br />
it remains to find directions explaining the private variance of each observation<br />
space.<br />
4.4 Private<br />
Given two observation spaces Y and Z together with bases W Y and W Z that<br />
explain the shared variance we are interested to find directions V Y and V Z to<br />
explain the private non-shared variance of each observation space. To find such a<br />
basis we look for directions of maximum variance in each observation space that<br />
are orthogonal to the shared bases. We will apply this to each observation space<br />
in turn why we in the following step have dropped the superscript that identifies
4.4. PRIVATE 85<br />
the observation space. We seek the first direction,<br />
subject to,<br />
ˆv = argmax v v T Cv, (4.9)<br />
v T i<br />
v T v = 1 (4.10)<br />
W = 0, (4.11)<br />
where W is the canonical directions and C is the covariance matrix of the feature<br />
space.<br />
4.4.1 Extracting the first orthogonal direction<br />
We apply the algorithm in a feature space induced by kernel K. The solution is<br />
found through formulating the Lagrangian of the problem,<br />
L = v T Cv − λ(v T v − 1) −<br />
K<br />
γiv T wi. (4.12)<br />
Seeking the stationary point of the Lagrangian leads to the following system of<br />
equations,<br />
δL<br />
δv<br />
= 2Cv − 2λv −<br />
i=1<br />
K<br />
γiwi = 0 (4.13)<br />
i=1<br />
δL<br />
δλ = vTv − 1 = 0 (4.14)<br />
δL<br />
δγi<br />
= v T wi = 0, ∀i. (4.15)
86 CHAPTER 4. NCCA<br />
By pre-multiplying Eq.4.13 with K<br />
i=1 wT i ,<br />
K<br />
i=1<br />
w T i<br />
(2Cv) − 2λ<br />
K<br />
i=1<br />
w T i<br />
using the orthogonality constraint Eq.4.15,<br />
and Eq.4.14,<br />
We obtain,<br />
Using Eq.4.13 and Eq.4.19,<br />
K<br />
j=1<br />
K<br />
v − ( w T j )(<br />
K<br />
γiwi) = 0, (4.16)<br />
w T i<br />
j=1<br />
i=1<br />
K<br />
( w<br />
j=1<br />
T j )(<br />
⎧<br />
K<br />
⎪⎨ γi i = j<br />
γiwi) =<br />
⎪⎩<br />
i=1 0 i = j<br />
2Cv − 2λv − 2<br />
<br />
C −<br />
v = 0, (4.17)<br />
. (4.18)<br />
γi = 2w T i Cv. (4.19)<br />
K<br />
wi(w T i Cv) = 0 (4.20)<br />
i=1<br />
K<br />
i=1<br />
wiw T i C<br />
<br />
v = λv. (4.21)<br />
Equation 4.21 is an eigenvalue and can be solved in close form through the eigen-<br />
<br />
decomposition of matrix C − K i=1 wiwT i C<br />
<br />
.<br />
4.4.2 Extracting consecutive directions<br />
Having found the orthogonal direction explaining the maximal of the remaining<br />
variance. Further consecutive directions can be found by appending the previously
4.4. PRIVATE 87<br />
found directions to the orthogonality constraint. For the Mth direction,<br />
v T M [W;v1, . . .,vM−1] = 0. (4.22)<br />
This means that to find the Mth direction the following eigenvalue problem needs<br />
to be solved,<br />
<br />
C −<br />
M−1<br />
<br />
j=1<br />
vjv T j +<br />
K<br />
i=1<br />
wiw T i<br />
<br />
C<br />
<br />
vM = λvM. (4.23)<br />
We will refer to the directions found as the Non-Consolidating-Components<br />
and the algorithm as Non-Consolidating-Components-Analysis (NCCA).<br />
4.4.3 Example<br />
In the previous chapter we applied the shared and the shared back-constrained GP-<br />
LVM model to a toy-data set. Doing so we exemplified in which scenarios each<br />
model works and also in which scenarios each model fails. As was shown neither<br />
model is capable of modeling data which contains both shared and private signals.<br />
This in itself was the motivation for creating the subspace GP-LVM model. In this<br />
chapter we have presented the NCCA model as an extension to CCA. We will use<br />
the solution of the proposed model to initialized the subspace GP-LVM model. To<br />
evaluate the performance of the model we applied the subspace model to the same<br />
data-set which we applied the shared and the shared back-constrained model to in<br />
the previous chapter. Each observation space Y and Z has been generated from<br />
a set of underlying low-dimensional signals shown in Figure 4.2. The generating<br />
signals have one dimension shared and one dimension private for each data-set.
88 CHAPTER 4. NCCA<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 4.2: Each dimension of the signals used to generate the high-dimensional<br />
signals shown in Figure 4.3. The data consist of one shared dimension (blue) and<br />
one dimension private to each data-set (green).<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
−0.5<br />
−1<br />
−1.5<br />
−2<br />
−2.5<br />
0 10 20 30 40 50 60 70 80 90 100<br />
4<br />
3<br />
2<br />
1<br />
0<br />
−1<br />
−2<br />
−3<br />
−4<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 4.3: Each dimension of the high-dimensional observed data to which the<br />
subspace GP-LVM model is applied.<br />
From these signals two high-dimensional signals are generated through random<br />
linear projections which result in Figure 4.3 which is referred to as the observed<br />
data Y and Z. We apply the data shown in Figure 4.3 to two variations of the<br />
subspace GP-LVM model presented in the previous chapter. For each model we<br />
set the latent structure to be composed of a single shared dimension and a private<br />
dimension corresponding to each observation space. This means that we in total<br />
are learning a three dimensional latent space. We initialize one model using the<br />
PCA solution of one of the observation spaces and one model using CCA for the
4.4. PRIVATE 89<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
0 10 20 30 40 50 60 70 80 90 100<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
−0.05<br />
−0.1<br />
−0.15<br />
−0.2<br />
0 10 20 30 40 50 60 70 80 90 100<br />
Figure 4.4: Each dimension of the embeddings of the data in Figure 4.3 applied to<br />
the subspace GP-LVM model. The left most plot corresponds to the results using<br />
PCA as initialization while the right plot corresponds to the embedding using<br />
CCA for the shared dimension and NCCA for the private dimensions. The latter<br />
model succeeds in recovering the generating signals Figure 4.2 while the model<br />
initialized using PCA fails.<br />
shared dimension and NCCA for the private spaces. In Figure 4.4 the resulting<br />
embeddings are shown. As can be seen in Figure 4.4 the model initialized using<br />
PCA is not able to correctly unravel the generating signals. The observation space<br />
Y is generated from a two dimensional signal Figure 4.2. We use the first prin-<br />
cipal component to this data as initialization for the shared space and the second<br />
component to initialize the private space corresponding to Y. We hereby make the<br />
assumption that the first principal component will correspond to the shared sig-<br />
nals something which in general cannot be assumed true. The private dimension<br />
of Z is initialized to small random values. Using the CCA and NCCA scheme<br />
presented in this chapter the model manages to correctly unravel the generating<br />
signals.
90 CHAPTER 4. NCCA<br />
4.5 Extensions<br />
We have presented the NCCA algorithm as a way of finding representative<br />
directions in the data that is complementary to the directions of high correla-<br />
tion found by CCA. However, there is nothing in the algorithm that limits the<br />
use of NCCA to only accompany CCA. The algorithm is general and can be<br />
used in addition to any feature selection algorithm where we want to model<br />
the full variance of the data.<br />
Fisher’s Discriminant Analysis (FDA)[20] is a method for finding direc-<br />
tions in the data such that the resulting projections maximizes the separa-<br />
tion between two or more classes. Finding complementary directions us-<br />
ing NCCA would allow a factorized latent representation where the sub-<br />
space corresponding to the FDA directions would discriminate the data into<br />
classes while the complementary subspace associated with the NCCA direc-<br />
tions would represent the most representative directions in the data without<br />
constraints on class discrimination. This would allow FDA to be extended to<br />
a model which represents the full variance of the observed data not limited<br />
to the variance from which the classes can be discriminated.<br />
4.6 Summary<br />
In this chapter we have presented a new two stage dimensionality reduction<br />
model. The latent space used to represent the data is factorized into two<br />
parts, one constrained part and one complementary part ensuring that the<br />
full variance in the observed data is represented in the latent space. The con-
4.6. SUMMARY 91<br />
strained latent subspace is found from any feature selection algorithm that<br />
results in a set of directions in the data. In this chapter we have used CCA<br />
to find the constrained part. The remaining latent subspace complements the<br />
constrained directions to ensure that the latent variable represents the full<br />
variance of the observed data.<br />
The initial motivation behind the NCCA model was as an analogous spec-<br />
tral algorithm to the subspace GP-LVM model. However, as we will show in<br />
Chapter 5, the model can be directly applied to model data without the need<br />
of a GP-LVM model.<br />
When using the NCCA model as an initialization scheme for the subspace<br />
GP-LVM we use the NCCA step to complement the solution found by applying<br />
CCA to the data. The assumptions 4.2 and 4.5 do not completely correspond to<br />
the subspace GP-LVM model. In particular the subspace GP-LVM model makes<br />
no assumption about the relationship between the observed data and the mani-<br />
fold representation. We could design a model which would better correspond to<br />
the CCA and NCCA scheme by incorporating back-constraints to constrain the<br />
mapping from observed data to the latent representation. However, each latent<br />
dimension can only be back-constrained once which means that we cannot com-<br />
pletely encode the assumption in Eq.4.2. However, as we will show in the next<br />
chapter, experimentally initializing according to the above scheme results in good<br />
embeddings.
Chapter 5<br />
Applications<br />
5.1 Introduction<br />
In this chapter we will apply the models presented in this thesis to the task of<br />
Human pose estimation from monocular images. In the next section we will give a<br />
brief introduction to Human pose estimation and review some of the related work.<br />
We will then proceed to present the shared back-constrained and the subspace<br />
GP-LVM model applied to this task. We will apply the models on two standard<br />
data-sets typically used within the domain of single image human pose estimation.<br />
We will conclude with a brief summary of the results.<br />
5.2 Human Pose Estimation<br />
Single view human pose estimation is the task of estimating the pose of a human<br />
from a monocular image. To get an overview of previous work we will split the the<br />
task into two different cathegoriez, generative and discriminative. Generative hu-<br />
92
5.2. HUMAN POSE ESTIMATION 93<br />
man pose estimation algorithms are often referred to as model based algorithms.<br />
They aim to fit a model of a human to evidence given in the image using an associ-<br />
ated likelihood or error function. By searching the state-space, which defines the<br />
pose, for the location which maximizes the likelihood given a specific image the<br />
task is solved. This is different from the discriminative approaches which aims to<br />
find the pose associated with a specific image from evidence extracted from the<br />
image.<br />
5.2.1 Generative<br />
Generative human pose estimation aims to fit the parameters of a human body<br />
model onto the image. The key differences between different algorithms are the<br />
parameterization of the human body and how the likelihood model is specified<br />
through which these parameters are found. A large variety of human body models<br />
have been specified, from simpler models composed of two dimensional patches<br />
[30, 43, 15] to more complex three dimensional models based on a wide range of<br />
primitives such as cylinders [13, 53, 12] or cones [74]. The more complex human<br />
models allow for more accurate estimation, however they are usually associated<br />
with a higher number of parameters which implies that they are more expensive<br />
to fit to the image evidence. Similarly to the human body model several different<br />
ways of specifying the likelihood model through which the parameters can be fit<br />
have been suggested. In [74, 19] image edges are used while [58] used texture<br />
information to fit an ellipsoid parameterized model. In [12] a stick-model of a<br />
human was fitted to the image through a image segmentation based score. If the<br />
estimation is done for a sequence of images higher level image information such
94 CHAPTER 5. APPLICATIONS<br />
as optical flow can be incorporated into the likelihood model as in [58].<br />
5.2.2 Discriminative<br />
Images are very high-dimensional objects, typically in the range of 10 3 − 10 4<br />
dimensions. To make modeling computationally feasible it is common practice, as<br />
a pre-processing stage, to represent each image using a lower dimensional feature<br />
vector. Due to the high-dimensionality of the images these feature descriptors are<br />
usually based on heuristic assumptions about the correspondence between images<br />
and pose.<br />
There are two main approaches in modeling the relationship between image<br />
features and pose. In the simplest case, where there is no ambiguity between the<br />
image representation and pose, the relationship can be modeled with regression as<br />
was demonstrated in [2, 61, 68, 83]. However, in the general case the image fea-<br />
ture representation is not capable of fully dis-ambiguating the pose. One approach<br />
to deal with the multi-modalities that arises is to use a multi-modal estimate such<br />
as in [47]. Given a multi-modal estimate additional cues such as temporal consis-<br />
tency can be used to disambiguate between the different modes.<br />
5.2.3 Problem Statement<br />
Given a set of image features yi ∈ Y with corresponding pose parameters zi ∈ Z<br />
where i = 1, . . .,N we wish to train a model through which we given an unseen<br />
image feature vector y ∗ can infer the corresponding pose parameters z ∗ . We will<br />
learn models for two different settings, single image estimation and sequential<br />
estimation. In the first case we are given a single input feature from which to de-
5.3. IMAGE FEATURES 95<br />
termine the pose while in the sequential case we are given a sequence [y ∗ 1, . . .,y ∗ N ]<br />
of image features from which we aim to determine the corresponding sequence of<br />
pose parameters [z ∗ 1 , . . .,z∗ N ].<br />
5.3 Image Features<br />
Images are very high-dimensional objects typically residing in 10 3 − 10 4 dimen-<br />
sional spaces. Due to the problems associated with working with such high-<br />
dimensional objects it is for most applications often necessary to find a reduced di-<br />
mensional representation of each image. The large variability and high-dimensionality<br />
of most image spaces means that it is often not possible to apply dimensionality<br />
reduction techniques such as the ones reviewed in chapter 2. Further, for most<br />
application we are only interested in a sub-set of the information contained in<br />
the images, often it is of interest to find a representation that introduces certain<br />
ambiguities. One example would be an application where we are comparing the<br />
shape of objects, we would then ideally like a representation that is ambiguous to<br />
color, texture and scaling and where as much as the variance in the descriptor as<br />
possible is related to the shape of objects. This has lead to a large body of work<br />
on heuristic application specific image representations. We will now proceed to<br />
briefly describe the background to the two different image features used for the<br />
experiments in this chapter.<br />
5.3.1 Shapeme Features<br />
Shape context [6, 7] was suggested as a point based descriptor for shape matching.<br />
The shape is assumed to be represented as a discrete set of points from the contour
96 CHAPTER 5. APPLICATIONS<br />
of an image, this could either have been extracted using a segmentation algorithm<br />
or as the response to an edge detector. Considering the set of contour points P =<br />
{p1, . . .,pn} the descriptor is calculated by taking each pointpi and describing its<br />
position relative to all the other points on the contour. This is done by computing<br />
the vector v i j = (pi −pj) for all the n − 1 points on the image. The set of vectors<br />
V i = {vi j }, ∀j = i completely describes the configuration of all points relative<br />
the reference point pi. The shape context descriptor of point pi is computed as<br />
the distribution of these relative point positions by placing them in a coarse spatial<br />
log-polar histogram where each bin represents the radius and the angle of the polar<br />
representation of each vector vi j . This results in n log-polar histograms describing<br />
the shape.<br />
Each histogram is computed relative to a reference point on the contour mak-<br />
ing the histogram invariant to translation. Further, by scaling the radius of the<br />
polar representation of each point with the median distance between the points<br />
on the contour, each histogram can be made invariant to scaling. The discriti-<br />
zation/binning of the vectors into a coarse histogram representation makes the<br />
descriptor robust to small affine transformation of the shape.<br />
Given two shapes we are interested in finding a measure of their similarity.<br />
Matching using the shape context representation is a two stage process. First a<br />
similarity between shape context histograms are needed, this is to find the point<br />
on each shape that best matches a point on the other shape. Secondly a measure<br />
relating the similarity of the full shape, i.e. all the points are required. In the<br />
original paper [6] the χ 2 distance was used to compare each histogram, however,<br />
in [7] this was change to the simpler L 2 -norm without significant degradation of<br />
the results. Matching two shapes implies finding the permutation π of the points
5.3. IMAGE FEATURES 97<br />
on one shape such that the sum of the histogram similarities is minimized. This is<br />
an instance of bipartite matching which is in general of cubic complexity.<br />
To reduce the complexity associated with matching two shapes using shape<br />
context descriptors the shapeme feature descriptor [42] was suggested. The shapeme<br />
descriptor is calculated by computing shape context histograms for a large set of<br />
training images. By clustering the space of all shape context histograms a set of<br />
representative shape context histograms can be found refereed to as shapemes.<br />
Each image can now be represented by associating each shape context vector with<br />
its closest shapeme. The descriptor of an image can now be reduced to a his-<br />
togram over the shapemes. Matching two shapes represented as shapemes can<br />
now be performed using as a simple nearest neighbor classifier in the shapeme<br />
space.<br />
We use the shapeme descriptor for both the Poser and the HumanEva data-set.<br />
Details of the specific features for the Poser data-set can be found in [1] and for<br />
the HumanEva data in [55].<br />
5.3.2 Histogram of Oriented Gradients<br />
The Histogram of Oriented Gradients (HOG) image descriptor was suggested in<br />
[18] for the task of detecting people in images. The descriptor is based on the dis-<br />
tribution of local texture gradients in the image. Computing the HOG descriptor<br />
for an image is a three stage process. In the first step the response of the image<br />
to a gradient filter is computed associating each pixel in the image with a direc-<br />
tion and a magnitude. The second step involves binning the gradient directions<br />
over spatial regions refereed to as cells into gradient histograms. Each bin in the
98 CHAPTER 5. APPLICATIONS<br />
histogram represents a set of gradient directions, the strength of the vote from<br />
each pixel in the image is taken as a function of the gradient magnitude, in the<br />
original work in [18] directly using the gradient magnitude was found to give the<br />
best performance. A good descriptor should be robust towards lighting changes<br />
in the image. To achieve, in the third step of the feature extraction, the gradient<br />
magnitude of each cell is normalized with respect to a local region in the image<br />
refereed to as a block. The final feature vector is calculated as the response to a<br />
sliding window over the image where the normalized cell responses are binned<br />
either using rectangular window (R-HOG) or using a circular window (C-HOG).<br />
The HOG descriptor has much in common with the SIFT descriptor [38] which<br />
also based on histograms of local gradient directions. However, whereas the SIFT<br />
descriptor is computed at key-points extracted from the image, and at multiple<br />
scales, the HOG descriptor is a dense descriptor evaluated for each pixel in the<br />
image.<br />
We use the HOG feature evaluated on the HumanEva data-set. Details of the<br />
specific feature can be found in [46].<br />
5.4 Data-sets<br />
We apply the two suggested models to two different data-sets. The first data-set,<br />
referred to as Poser was presented in [2]. Poser consists of images generated<br />
using a the computer graphics package Poser of real motion-capture data from<br />
the CMU database 1 . Each image is represented by its corresponding silhouette<br />
described using a 100 dimensional shapeme descriptor [42] as presented in [2].<br />
1 The data used in this project was obtained from mocap.cs.cmu.edu. The database was created<br />
with funding from NSF EIA-0196217.
5.4. DATA-SETS 99<br />
Each pose is represented using 54 joint angles of a full body. In total the data-set<br />
consists of 1927 training poses from 8 sequences of varied motions. The data-set<br />
is provided with one training sequence of 418 frames describing a circular walk<br />
sequence.<br />
The second data-set we are going to apply the models to is a real-world data-<br />
set known as HumanEva. It was first presented in [54]. The full HumanEva<br />
data-set consists of six different motions performed by three different persons<br />
divided into training test and a validation set. In addition frames of a forth person<br />
is provided without labeling of the motions. Each frame is captured using several<br />
cameras. In our experiments we us only images from a single camera. Due to<br />
restrictions in the amount of data the proposed models can process we have chosen<br />
a limited training data-set. Further limitations are necessary due to errors in the<br />
motion capture data provided. We use two different subsets. The first set contains<br />
the walking sequence for subject one where we use the first cycle in the walk for<br />
training and the second cycle for testing. Each image in this data-set is described<br />
using a 300 dimensional shapeme features as used in [55]. In [46] each image<br />
of the HumanEva data-set is aligned by projecting the associated pose onto the<br />
view plane and using the Procrustes score to align each image into a common<br />
frame. Through this process each image can be flipped around the horizontal axis<br />
to effectively double the size of the data-set. For the second set of images we<br />
append the first with the corresponding flipped image from the image-set created<br />
in [46] and use every second image in the data. Each image in the second data-set<br />
is presented using Histogram of Oriented Gradient (HOG) feature [18] as used in<br />
[46]. As a pre-processing stage, to reduce computational time, we represent each<br />
descriptors by the projection onto the first 100 principal directions extracted from
100 CHAPTER 5. APPLICATIONS<br />
the data. The poses in the HumanEva data-set is represented as the location of a<br />
set of 19 joints in 3D-space resulting in a 57 dimensional pose representation. We<br />
remove the global translation by centering the data.<br />
We apply the Poser data-set to the shared back-constrained GP-LVM model<br />
and the second humaneva data-set to the subspace GP-LVM model. The first<br />
humaneva set is used to compare the different models. We will now proceed to<br />
present how each model is defined and how inference is done for each problem<br />
setting.<br />
5.5 <strong>Shared</strong> back-constrained GP-LVM<br />
In this section we will present the application of the shared back-constrained GP-<br />
LVM model to the task of single image human pose estimation. Given a set of<br />
image features [y1, . . .,yN] with corresponding pose parameters [z1, . . .,zN] we<br />
learn a shared back-constrained GP-LVM model. For the generative mappings<br />
from the latent space to the observed data we use a kernel consisting of the sum<br />
of an RBF, a bias and a white noise kernel,<br />
<br />
kgen(xi,xj) = θ1exp − ||xi − xj|| 2 2<br />
2θ2 <br />
+ θ3 + βδij. (5.1)<br />
2<br />
Using this kernel means that each generative mapping are specified by the four<br />
hyper-parameters Φ = [θ1, θ2, θ3, β].<br />
For the back-constraint from the pose space to the latent space we use a simple<br />
regression over a kernel induced feature space. In all our experiments we use an
5.5. SHARED BACK-CONSTRAINED GP-LVM 101<br />
RBF kernel,<br />
<br />
kback(zi,zj) = exp − ||xi − xj|| 2 2<br />
, (5.2)<br />
2θback<br />
where the kernel width θback is set by examining the kernel response to the pose<br />
parameters.<br />
Both the Poser data and the HumanEva data is sequential. We will therefore<br />
also learn a dynamical model to predict sequentially over the latent space. The<br />
dynamic model is a GP predicting over time. We will use the same type of kernel<br />
as for the generative mappings i.e. a combination of a RBF, a bias and a white<br />
noise kernel resulting in a mapping specified by four hyper-parameters.<br />
In total this means that the shared back-constrained model we apply in this<br />
chapter is specified by 8 hyper-parameters for the generative mappings, one pa-<br />
rameter for the back-constraint and 4 hyper-parameters for the dynamical model.<br />
Further, the dimension of the latent space need to be fixed, in all our experiments<br />
we use a 3 dimensional latent space. As described above the parameter of the<br />
back-constraint is set by inspection, for the Poser data we use θback = 1<br />
10 −3 and<br />
for the HumanEva data we use θback = 1<br />
4·10 −6 . In Figure 5.1 the kernel response<br />
matrices associated with the back-constraints for the two data-sets are shown. We<br />
initialize the latent locations X of the model by applying PCA to the pose parame-<br />
ters. The hyper-parameters of the GP mappings are learnt together with the latent<br />
location of the data by optimizing the marginal likelihood of the model Eq. 3.4<br />
using a scaled conjugate gradient optimizer. Ideally we would like to run the op-<br />
timizer till convergence, however, in practice, we limit the number of iterations<br />
to 10000, running for further iteration does not have a significant impact on the<br />
results.
102 CHAPTER 5. APPLICATIONS<br />
Figure 5.1: Kernel response matrices on the training data over which the backmapping<br />
are applied to in the back-constrained model. The left image shows the<br />
response to the Poser data using an RBF kernel with width θback = 1<br />
10 −3 . The right<br />
images corresponds to the HumanEva data also using an RBF kernel with width<br />
θback = 1<br />
4·10 −6 .<br />
5.5.1 Single Image Inference<br />
For the task of single image pose inference we are given a single feature vector y ∗<br />
for which we want to infer its corresponding pose parameters z ∗ . The latent space<br />
X is back-constrained from the pose space Z which means that we encourage a<br />
one-to-one mapping between the latent space and the pose space. This means that<br />
if we can determine the latent location x ∗ associated with the image feature y ∗ we<br />
can recover the associated pose parameters z ∗ through the mean prediction of the<br />
GP generating the pose space. We can find the latent coordinate associated with a<br />
specific image feature by maximizing the likelihood over the latent space,<br />
x ∗ = argmax x p(y ∗ |x), (5.3)<br />
through gradient based optimization. However, we expect the image features to<br />
be ambiguous with respect to pose which implies a multi-modal mapping from
5.5. SHARED BACK-CONSTRAINED GP-LVM 103<br />
image feature to pose. Because we encourage the mapping between the latent<br />
space and the pose space to be uni-modal the multi-modality in the data needs<br />
to be contained in the mapping between the latent space and the image features.<br />
This means that the optimization Eq.5.3 is multi-modal and needs to be initialized<br />
in a convex region of x ∗ . In practice we first perform a nearest-neighbor search<br />
in the in image feature space and initialize the latent coordinates with the latent<br />
coordinates associated with the N nearest neighbors in the training data. For the<br />
Poser data we use 6 nearest neighbors, increasing the number does not result in<br />
a significant increase in performance while reducing the number leads to missing<br />
modes. We run each point optimization Eq. 5.3 till convergence which usually<br />
corresponds to 10 − 100 iterations.<br />
shown.<br />
In Figure 5.3 result for the single image inference on the Poser data set is<br />
5.5.2 Sequential Inference<br />
We are often interested in inferring the pose Z ∗ associated with a temporal se-<br />
quence of image observations Y ∗ . In such a setting we can exploit temporal con-<br />
sistency of the sequence to disambiguate our multi-modal estimate and recover a<br />
single pose estimate per image.<br />
To find the most likely sequence of latent locations associated with the image<br />
observations we interpret the sequence through a hidden Markov model (HMM)<br />
where the latent states of the model corresponds to the latent locations of the<br />
training data X. The likelihood of each observation is given by the likelihood of
104 CHAPTER 5. APPLICATIONS<br />
the GP associated with each latent point,<br />
L obs<br />
i = p(y∗ |xi). (5.4)<br />
The transition probabilities are given by the dynamical GP predicting latent loca-<br />
tions over time,<br />
L trans<br />
ij = p(xi|xj). (5.5)<br />
The most probable path Xinit through this lattice can be found using the Viterbi<br />
algorithm [73].<br />
Having found the most likely sequence through the training data Xinit we use<br />
this as an initialization to optimize the sequence,<br />
X ∗ = argmax X p(Y ∗ |X). (5.6)<br />
In Figure 5.4 results for the sequential estimate of the shared back-constrained<br />
GP-LVM model is shown.<br />
5.6 Subspace GP-LVM<br />
Even though not explicitly stated, the shared GP-LVM applied in [44] assumes<br />
that the image features and pose parameters are governed by the same degrees of<br />
freedom. This means that the full variance of each observation space needs to be<br />
fully correlated. However, the image features are based on heuristic assumptions<br />
about this correlation and are very likely to contain a significant amount of non-<br />
pose-correlated variance. This variance, irrelevant for the task of pose estimation,
5.6. SUBSPACE GP-LVM 105<br />
needs to be “explained away” from the latent representation by the noise term in<br />
the GP generating the image features. The back-constrained GP-LVM model aims<br />
to encourage the latent space to be a full re-representation of pose, encouraging the<br />
model to explain the non-correlated image feature variance as noise. However, for<br />
many types of features the non-pose-correlated variance represents a significant<br />
portion of the variance. Further, the structure of this variance is often not well<br />
represented by our <strong>Gaussian</strong> noise assumption, which means it will pollute the<br />
structure of the latent representation.<br />
An alternative approach is to apply the subspace GP-LVM which models the<br />
non-correlated variance separately using additional latent spaces. The additional<br />
image feature latent space will explain the non-pose-correlated variance in the im-<br />
age feature space. Compared to the shared GP-LVM this means that this variance<br />
is explained using the full flexibility of a GP instead of a simple <strong>Gaussian</strong> noise<br />
model. Further, the private latent subspace associated with the pose represent vari-<br />
ance in the pose space that is not correlated with the image features. This implies<br />
that locations over this subspace represents poses orthogonal or ambiguous to the<br />
image features.<br />
The shared back-constrained model could be initialized efficiently by apply-<br />
ing PCA to the pose space. However, for the subspace model this scheme cannot<br />
be used. Instead we initialize the shared latent space using CCA and the private<br />
spaces using NCCA. Before applying CCA to find the shared locations we apply<br />
PCA to both observations spaces to remove directions in the data representing a<br />
non-significant variance. For the first HumanEva data-set we apply linear PCA<br />
while for the second HumanEva data-set we apply kernel PCA applied on a MVU<br />
[78] kernel computed using 7 nearest neighbors. In practice we keep 70% of the
106 CHAPTER 5. APPLICATIONS<br />
variance of the image feature space and 90% of the variance in the pose space.<br />
Having found a reduced representation of the data we apply CCA to find an ini-<br />
tialization of the shared latent locations, we keep directions of CCA that have a<br />
normalized correlation coefficient of more then 30%. Finally, to complete the la-<br />
tent representation with the private spaces, we apply NCCA to find orthogonal<br />
directions to the CCA solution. We find directions until 95% of the variance in<br />
PCA reduced representation of each observation space is explained. For the first<br />
HumanEva data-set this results in a two dimensional shared space and one dimen-<br />
sional private spaces associated with each observation. The second HumanEva<br />
data-set results in a one dimensional shared space a one dimensional private space<br />
for the image features and a two dimensional private pose space.<br />
Just as with the back-constrained model we use a combination of an RBF, a<br />
bias and a white-noise kernel for the generative mappings Eq. 5.1, meaning that<br />
each generative mapping is specified using 4 hyper-parameters. We will use the<br />
first data-set to find a single pose estimate for each time instance in a sequence<br />
of images. To disambiguate our multi-modal estimate between each time instance<br />
we will use a dynamic model capable to predict over the latent space. We use<br />
a GP regressor as a dynamic model, using the same type of kernel combination<br />
as the generative mappings Eq. 5.1. In total this means that for the first data-set<br />
we learn a model having 8 parameters for the generative mappings, 4 parameters<br />
for the dynamic model in addition to the location of the latent points. For the<br />
second data-set we keep the latent locations fixed to exemplify the performance<br />
of the NCCA algorithm which means that the only parameters for the model is the<br />
8 hyper-parameters of the generative mappings. We learn the parameters of the<br />
models by optimizing the marginal likelihood of the model Eq.3.15 using a scale
5.6. SUBSPACE GP-LVM 107<br />
conjugate gradient algorithm. Ideally we would like to run the optimization until<br />
the model converges, however, in practice, we limit the number of iterations to<br />
10000. We have found that running the optimization further does not results in a<br />
increased performance.<br />
5.6.1 Single Image Inference<br />
Initializing the subspace model’s shared space using CCA and the private spaces<br />
using NCCA we have assumed that the shared latent space can be represented as a<br />
smooth mapping from both the image feature and the pose space. Having trained<br />
a model we learn a mapping gy from the image features Y to the shared locations<br />
latent locations X S of the trained model,<br />
x S i = gy(yi). (5.7)<br />
We will in practice use a GP regressor to model gy, however, any regression model<br />
would be applicable. The GP regressor uses a combination of an RBF, a bias and<br />
a white noise kernel Eq. 5.1 specified by four hyper-parameters that are found<br />
through gradient based optimization of the GP marginal likelihood Eq.2.48 of<br />
the data. Having learnt gy means that given a new unseen image feature y ∗ , the<br />
corresponding location on the shared latent space x ∗ can be determined. The<br />
private latent space for the pose X Z is orthogonal to the full latent representation<br />
of the image feature. This implies that its location, x Z ∗<br />
, has no correlation with the<br />
location in the image feature space y∗. However, it is assumed that the problem is<br />
“well” represented by the training data, implying that examples of each ambiguity<br />
we are likely to see is represented in the training data. This implies that by finding
108 CHAPTER 5. APPLICATIONS<br />
locations x Z ∗i over the private pose space which maximizes the likelihood of the<br />
pose z∗ generated from the corresponding full latent pose location [xS ∗ ;xZ ∗i ] each<br />
mode will correspond to the pose ambiguities of the feature. Maximizing the<br />
likelihood of the pose corresponds to minimizing the predictive variance of the<br />
GP generating the pose space,<br />
ˆx Z ∗ = argmin x Z ∗<br />
5.6.2 Sequential Inference<br />
k(x S,Z<br />
∗ ,x S,Z<br />
∗ )<br />
− k(x S,Z<br />
∗ ,X S,Z ) T (K + β −1 I)k(x S,Z<br />
∗ ,X S,Z ) . (5.8)<br />
Finding the modes associated with different locations over the private pose space<br />
associates multiple poses to an ambiguous location in the image feature space. As<br />
the private pose space is orthogonal to the image feature latent representation it is<br />
not possible to disambiguate between the different modes using information from<br />
the feature. However, pose data is sequential, by placing a dynamic GP over the<br />
latent pose space a representation respecting the data’s dynamics can be learned.<br />
When inferring the pose from a sequence of image features the dynamic model<br />
can be used to disambiguate locations over the image feature orthogonal private<br />
pose space. The shared latent locations are determined by the mapping from the<br />
image features as before, but the the locations over the private subspace can, with<br />
the incorporation of the dynamic model, be found such that the full sequence<br />
renders a high likelihood.<br />
ˆX Z ∗ = argmax X Z ∗ p(X Z ∗ |Z,X S , ΦZ, Φdyn). (5.9)
5.7. QUANTITATIVE RESULTS 109<br />
Figure 5.2: Angle error: The image on the left is the true pose, the middle image<br />
has an angle error of 1.7 ◦ , the image on the right has an angle error of 4.1 ◦ . An<br />
angle error higher up in the joint-hierarchy will effect the positions for all joints<br />
further down. As the errors for the middle image are higher up in the hierarchy<br />
this will effect each limb connected further down the chain from this joint thereby<br />
resulting in a significantly different limp positions.<br />
5.7 Quantitative Results<br />
Both the Poser and the HumanEva data-set comes with a provided error measure<br />
to quantify the quality of the result. In the case of Poser the mean RMS error is<br />
used defined as follows,<br />
Eposer(z, ˆz) = 1<br />
N<br />
N<br />
||(ˆzi − zi) mod 360 ◦ ||2, (5.10)<br />
i=1<br />
where z is the true pose and ˆz is the estimated pose. To make comparison of<br />
results possible we will follow [2] and use the above error measure. However, the<br />
Eposer can be misguiding as a qualitative measure as it is applied to the joint angle<br />
space. A mean square error treats all dimension of the joint angle space with equal<br />
importance and do not reflect the hierarchical structure of the human physiology.<br />
This means that joints higher up in the hierarchy effects all joints further down in<br />
the hierarchy Figure 5.2.<br />
The HumanEva data-set avoids the problems associated with the Poser data by
110 CHAPTER 5. APPLICATIONS<br />
Figure 5.3: Single Image Pose Estimation: Input silhouette followed by output<br />
poses associated with modes on the latent space ordered according to decreasing<br />
likelihood. As can be seen the modes corresponds to varying limb positions we<br />
expect to be ambiguous to the input silhouette.<br />
representing poses using joint locations rather then joint angles.<br />
EHumanEva(z, ˆz) =<br />
N<br />
i=1<br />
1<br />
N ||xi − ˆxi||2. (5.11)<br />
The HumanEva error metric EHumanEva has a better correspondence to the visual<br />
quality of the pose estimate being calculated in a joint position space rather then<br />
the joint angle space used for EPoser.<br />
5.8 Experiments<br />
We applied the shared back-constrained GP-LVM model to the Poser data-set.<br />
Figure 5.3 shows the different pose estimates associated with a single image fea-<br />
ture vector. In Figure 5.4 the estimate for the shared back-constrained GP-LVM<br />
model using dynamics to dis-ambiguate the different modes are shown. As can be<br />
seen we are correctly estimating the pose for most frames. In Table 5.1 the angle<br />
error of our suggested model is compared to the a set of baseline regression al-
5.8. EXPERIMENTS 111<br />
Figure 5.4: Every 20th frame from a circular walk sequence, Top Row: Input<br />
Silhouette, Middle Row: Model Pose Estimate, Bottom Row: Ground Truth.<br />
Angle Error( ◦ )<br />
Mean Training Pose 8.3 ◦<br />
Linear Regression 7.7 ◦<br />
RVM 5.9 ◦<br />
GP Regression 5.8 ◦<br />
SBC-GP-LVM Single 6.5 ◦<br />
SBC-GP-LVM Sequence 5.3 ◦<br />
Table 5.1: Mean RMS Angle for the Poser data-set. SBC-GP-LVM refers to the<br />
shared back-constrained GP-LVM model. Note that only the SBC-GP-LVM Sequence<br />
method is using temporal information.<br />
gorithms. We can see that both the subspace back-constrained GP-LVM methods<br />
perform in-line or better than the regression algorithms. It is clear that learning a<br />
latent representation respecting the dynamics of the data is beneficial as the best<br />
results are achieved by the only model incorporating dynamics for inference.<br />
For a silhouette based image representation as the shapeme descriptor we ex-<br />
pect significant ambiguities with respect to the heading direction, i.e. in or out of<br />
the view-plane. However, as can be seen in [2] (Figure 7), the heading angle is<br />
nearly perfectly predicted using a regression algorithm. This means that the Poser<br />
data-set does not contain a significant amount of feature to pose ambiguities. Due
112 CHAPTER 5. APPLICATIONS<br />
Error (mm)<br />
Mean Training Pose 163<br />
Linear Regression 384<br />
GP Regression 163<br />
SBC-GP-LVM Single 201<br />
SBC-GP-LVM Sequence 73<br />
S-GP-LVM Sequence 70<br />
Table 5.2: Mean Error for the shapecontext HumanEva data-set. SBC-GP-LVM<br />
refers to the shared back-constrained GP-LVM model and S-GP-LVM refers to the<br />
subspace GP-LVM model.<br />
to the central motivation of the subspace GP-LVM model to specifically model<br />
such ambiguities we therefore proceed to use the HumanEva data-set.<br />
We apply the suggested models to the first HumanEva data-set using shapeme<br />
features to represent each image. In Table 5.2 the results for our suggested models<br />
and a set of baseline results are shown. From the poor performance of the regres-<br />
sion algorithms applied to HumanEva data we can see that this data-set contains a<br />
significant amount of ambiguities between the image representation and the pose.<br />
As can be seen the single estimate using the shared back-constrained GP-LVM<br />
results worse performance compared to both the model using dynamical informa-<br />
tion and the GP regression baseline. This is to be expected as the model will, in<br />
cases where the features are ambiguous, predict one of the possible poses while<br />
the regression algorithm in such cases will predict the mean of the ambiguous<br />
poses which for many types of ambiguities results in a smaller error compared to<br />
predicting the “wrong” mode. One advantage of the subspace GP-LVM model is<br />
that we can easily visualize the ambiguities by sampling locations over the pose<br />
private subspace.<br />
In the next section we will compare the the different proposed models.
5.9. COMPARISON 113<br />
5.9 Comparison<br />
In Figure 5.5 the kernel response matrix generating the observed features for the<br />
learned kernel hyper-parameters are shown. As can be seen the width of the kernel<br />
for the back-constrained model is much smaller compared to the subspace model.<br />
This implies that the back-constrained model is less capable of generalizing the<br />
image feature representation compared to the subspace model. This is an effect of<br />
trying to learn a shared latent space of two observed spaces which has a different<br />
latent structure. The back-constraint enforces a latent structure which preserves<br />
the local structure of the pose space, however this structure is different from the<br />
structure of the pose space making the GP generating the image features unable<br />
to generalize over the feature representation. In the subspace model the private<br />
subspace can be used to model structures in the feature space not contained in the<br />
pose parameters making the model capable to generalize better over the image<br />
features. We have shown the advantage of the latent factorization in the subspace<br />
model applied to the HumanEva data-set as it allows the model to better gener-<br />
alize its description of the data. For the next experiment we will use the second<br />
HumanEva data-set which in addition to the first data-set also includes each image<br />
“flipped”, this significantly increases the amount of ambiguities in the data. Each<br />
image in this data-set is represented using the HOG feature descriptor. Applying<br />
the subspace GP-LVM model initialized using the NCCA model leads to a three<br />
dimensional latent pose representation divided into a single shared dimension and<br />
a two dimensional private pose space.<br />
In Figure 5.6 shows the results of sampling the likelihood over the pose spe-<br />
cific private latent subspace X Z for a sequence of input images together with the
114 CHAPTER 5. APPLICATIONS<br />
Figure 5.5: The kernel response matrices used to generate the observed image<br />
features from the learned latent representation. The left figure shows the matrix<br />
for the shared back-constrained GP-LVM model while the right figure shows the<br />
matrix for the subspace GP-LVM model. Comparing the images it can be seen<br />
that the back-constrained model uses a much smaller kernel width compared to<br />
the subspace model implying that it is not capable to generalize as well as the<br />
subspace model.<br />
pose of to the mode corresponding to the pose closest to the ground truth pose.<br />
We can see how the modes evolve over the sequence, for images representing mo-<br />
tion perpendicular to the view-plane (1, 2 and 6) there are two elongated modes,<br />
while for motion in the view-plane there are a discrete set of modes. Having fixed<br />
the shared latent location reduced the estimation task to choosing the appropriate<br />
mode corresponding to a much smaller sub-set of the possible poses compared to<br />
the full data-set.<br />
We compare the subspace GP-LVM model to the standard shared GP-LVM<br />
[52, 44]. For comparison and visualization purposes we will train a shared GP-<br />
LVM model using a two dimensional latent space. None of the models are make<br />
use of dynamical information. In Figure 5.7 two different examples of how the<br />
models are modeling the training data is shown. For the first image when the hu-<br />
man is position is perpendicular to the view-plane there is two elongated modes
5.9. COMPARISON 115<br />
Figure 5.6: Pose inference on a sequence of images from the second HumanEva<br />
data set. Top row: original test set image. Second row: visualisation of the<br />
modes in the non-shared portion of the pose specific latent space. Note how the<br />
modes evolve as the subject moves. When the subject is heading in a direction<br />
perpendicular to the view-plane, it is not possible to disambiguate the heading<br />
direction image (1, 2 and 6) this is indicated by two elongated modes. In image<br />
(3 − 5) it is not possible to disambiguate the configuration of the arms and legs<br />
this gives rise to a set of discrete modes over the latent space each associated with<br />
a different configuration. Bottom row: the pose coming from the mode closest to<br />
the ground truth is shown. The different types of mode are explored further in<br />
Figure 5.7.
116 CHAPTER 5. APPLICATIONS<br />
over the pose specific private latent space. Sampling poses along the modes show<br />
that each mode corresponds to the heading direction being in or out of the view-<br />
plane while points along the mode corresponds to different configurations of the<br />
legs. For the second image the motion is parallel to the view-plane resulting in<br />
a discrete set of modes, each corresponding to different limb configurations that<br />
we would expect being ambiguous. As can be seen both models are capable of<br />
modeling the correct modes in the data. However, comparing the two different<br />
energies we can see that while the modes are clearly separated for the subspace<br />
model the energy for the shared model is scattered with local minimas. As we<br />
have shown in this chapter the existing image features used for pose estimation<br />
are often ambiguous to pose. This means that even for a model resulting in a<br />
multi-modal estimation being able to easily encode additional information to dis-<br />
ambiguate between the different modes are essential. Comparing the standard<br />
shared GP-LVM with the subspace model it should be significantly easier to use<br />
additional information to disambiguate between the different modes.<br />
5.10 Summary<br />
In this chapter we have shown the application of the shared back-constrained and<br />
the subspace GP-LVM model to the task of image based human pose estima-<br />
tion. Image based human pose estimation is an interesting applications because<br />
of its potential usefulness in applications to real-world problems in areas such as<br />
computer graphics and human robotic interaction. Further, as due to the high-<br />
dimensionality of the input domain each image is represented using some type<br />
of heuristic based image feature which introduces ambiguities between each im-
5.10. SUMMARY 117<br />
Subspace GP-LVM<br />
<strong>Shared</strong> GP-LVM<br />
Figure 5.7: The top row shows two images from the training data. The 2nd and<br />
3rd row shows results from infering the pose using the subspace model, the first<br />
column shows the likelihood sampled over the pose specific latent space constrained<br />
by the image features, the remaining columns shows the modes associated<br />
with the locations of the white dots over the pose specific latent space. Subspace<br />
GP-LVM: In the 2nd row the position of the leg and the heading angle cannot<br />
be determined in a robust way from the image features. This is reflected by two<br />
elongated modes over the latent space representing the two possible headings.<br />
The poses along each mode represents different leg configurations. The top row<br />
of the 2nd column shows the poses generated by sampling along the right mode<br />
and the bottom row along the left mode. In the 3rd row the position of the leg and<br />
the heading angle is still ambiguous to the feature, however here the ambiguity<br />
is between a discrete set of poses indicated by four clear modes in the likelihood<br />
over the pose specific latent space. SGP-LVM: The 4th and 5th row show the<br />
results of doing inference using the SGP-LVM model. Even though the most likely<br />
modes found are in good correspondece to the ambiguities in the images the latent<br />
space is cluttered by local minima that the optimization can get stuck in.
118 CHAPTER 5. APPLICATIONS<br />
age representation and its corresponding pose. These ambiguities implies that the<br />
task of estimating the pose associated with a specific image feature is multi-modal<br />
which makes the problem interesting from a machine learning perspective. By ap-<br />
plying the proposed model we have shown two different approaches to handle the<br />
multi-modalities that arise each with associated advantages and disadvantages.<br />
The main goal behind these models was to be applicable for data where a<br />
significant amount of ambiguities exists between the different observation spaces.<br />
With the image features available human pose estimation is known to be such a<br />
task. However, we would like to point out that our models are general and could be<br />
applied in different application areas know to exhibit similar relationships between<br />
each view. Further, the inference schemes which we apply to get quantitiavtive<br />
results are rudimentary at best, we do not propose a method for pose estimation<br />
we simply use this as an example of modeling ambiguous data. Our main results<br />
is on how the presented models handles multi-modalities not on disambiguation<br />
between such modes.<br />
In the following chapter we will conclude the work presented in this thesis and<br />
briefly discuss potential directions of future work.
Chapter 6<br />
Conclusions<br />
6.1 Discussion<br />
In this thesis we have presented two different probibalistic models based on Gaus-<br />
sian <strong>Process</strong>es for dimensionality reduction in the scenario where we have mul-<br />
tiple different observations of a single underlying phenomenon. The models pre-<br />
sented are applicable to a multitude of different modeling scenarios, where each<br />
observation space is fully correlated or when we are only interested in modeling<br />
the correlated data or in scenarios where we want to learn a factorized structure<br />
and model both the shared correlated information but also the non-correlated in-<br />
formation private to each observation space.<br />
6.2 Review of the Thesis<br />
The first chapter provided a brief introduction to the work undertaken in this the-<br />
sis. In chapter 2 the machine learning task of dimensionality reduction was moti-<br />
119
120 CHAPTER 6. CONCLUSIONS<br />
vated and reviewed. Further, chapter 2 provided an introduction to <strong>Gaussian</strong> Pro-<br />
cesses upon which the work in this thesis is built. Chapter 3 presents two different<br />
<strong>Gaussian</strong> <strong>Process</strong> <strong>Latent</strong> Variable models for shared dimensionality reduction ap-<br />
plicable to two different but very common modeling scenarios. In chapter 4 a<br />
new novel spectral dimensionality reduction algorithm is presented. The model is<br />
an essential component for the factorized subspace GP-LVM model presented in<br />
chapter 3. Chapter 3 and 4 presents the new work done in this thesis. In chapter<br />
5 we apply the presented models to model human motion capture data with asso-<br />
ciated image observation. Further, we apply the learned models to the computer<br />
vision task of human pose estimation.<br />
6.3 Future Work<br />
There are multiple different directions of possible future work applied to the mod-<br />
els presented in this thesis. Even though we have briefly shown results on the<br />
models applied to the task of human pose estimation we believe that there are<br />
several different applications where the use of the shared GP-LVM models can<br />
be advantageous. Such application areas include multi-modal feature fusion and<br />
multi-agent modeling. Even though we in this thesis have described the model<br />
in terms of two observation spaces there is nothing in the framework limiting the<br />
number of observation spaces. The application areas suggested are often charac-<br />
terized by more then two observations making them interesting modeling scenar-<br />
ios.<br />
This thesis have been focused on creating models applicable for data with a<br />
common shared structure. However, we have not focused on making inference
6.3. FUTURE WORK 121<br />
within the models suggested. The procedures for inference suggested in the ap-<br />
plications chapter are rudimentary at best. We believe that the models models the<br />
data in an efficient way and with better inference procedures the results presented<br />
on human pose estimation could be significantly improved.<br />
The GP-LVM model has in the general case a number of free parameters.<br />
Even though few with respect to many other models for dimensionality reduction<br />
determining especially the latent dimensionality is not trivial and its choice has<br />
a significant effect on how well the observed data is modeled. Recent work [21]<br />
applies a rank-regulariser to learn the latent dimensionality of the latent space<br />
effectively removing this as a free parameter from the GP-LVM. We believe this<br />
could be of significant benefit to the shared GP-LVM models presented in this<br />
thesis.<br />
The objective of the standard GP-LVM model is to find a latent structure from<br />
which a GP regression is able to minimize the reconstruction error of the observed<br />
data. For a single observed space minimizing the reconstruction error is a sensible<br />
objective. However, for the shared models, especially in the case of the factor-<br />
ized subspace, this is less obvious. In effect, the success of the presented model<br />
relies on the fact that the initialization of the latent space is in convex region of<br />
the factorized solution we are looking for as nothing outside the initialization en-<br />
courages the shared private factorization of the latent space. It would therefore<br />
be interesting to explore ways of encode this factorization as a part of the GP-<br />
LVM objective. Further, we use CCA to initialize the shared latent locations of<br />
the subspace model. There are several problems associated with the solution of<br />
CCA, some of which we address in this thesis. It would be interesting to explore<br />
different criteria for initializing the shared location. A recent model [55] learns
122 CHAPTER 6. CONCLUSIONS<br />
a shared latent variable model where a latent space maximizing the mutual infor-<br />
mation between the observations is learnt. It would further be interesting to see if<br />
the model presented in [55] could be extended to incorporate the <strong>Shared</strong>/Private<br />
factorization suggested in this paper.
Bibliography<br />
[1] A. Agarwal. Machine learning for image based motion capture. PhD thesis,<br />
Institut national polytechnique de Grenoble, April 2006.<br />
[2] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance<br />
vector regression. In Computer Vision and Pattern Recognition, 2004. CVPR<br />
2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol-<br />
ume 2, 2004.<br />
[3] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of<br />
the potential function method in pattern recognition learning. Automation<br />
and remote control, 25(6):821–837, 1964.<br />
[4] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical cor-<br />
relation analysis. Technical Report 688, Department of Statistics, University<br />
of California, Berkeley, 2005.<br />
[5] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for<br />
embedding and clustering. Advances in Neural Information <strong>Process</strong>ing Sys-<br />
tems, 14:585–591, 2002.<br />
[6] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for<br />
shape matching and object recognition. In NIPS, 2000.<br />
123
124 BIBLIOGRAPHY<br />
[7] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition<br />
using shape contexts. IEEE Transactions on Pattern Analysis and Machine<br />
Intelligence, 24(4):509–522, 2002.<br />
[8] M. Bethge, T. Wiecki, and F. Wichmann. The independent components of<br />
natural images are perceptually dependent. In Human Vision and Electronic<br />
Imaging XII. Edited by Rogowitz, Bernice E.; Pappas, Thrasyvoulos N.;<br />
Daly, Scott J.. Proceedings of the SPIE, volume 6492, page 64920A, 2007.<br />
[9] C. Bishop, M. Svensén, and C. Williams. GTM: A Principled Alternative to<br />
the Self-Organizing Map. Artificial Neural Networks: ICANN 96: 1996 In-<br />
ternational Conference, Bochum, Germany, July 16-19, 1996: Proceedings,<br />
1996.<br />
[10] P. Bose, W. Lenhart, and G. Liotta. Characterizing proximity trees. Algo-<br />
rithmica, 16(1):83–110, 1996.<br />
[11] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university<br />
press, 2004.<br />
[12] M. Bray, P. Kohli, and P. Torr. Posecut: Simultaneous segmentation and<br />
3d pose estimation of humans using dynamic graph-cuts. Lecture Notes in<br />
Computer Science, 3952:642, 2006.<br />
[13] C. Bregler and J. Malik. Tracking people with twists and exponential maps.<br />
In 1998 IEEE Computer Society Conference on Computer Vision and Pattern<br />
Recognition, 1998. Proceedings, pages 8–15, 1998.<br />
[14] Q. Candela and C. Rasmussen. A unifying view of sparse approximate<br />
gaussian process regression. Journal of Machine Learning Reserch Volume,<br />
6:1935–1959, 2005.
BIBLIOGRAPHY 125<br />
[15] T. Cham and J. Rehg. A multiple hypothesis approach to figure tracking.<br />
In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society<br />
Conference on., volume 2, 1999.<br />
[16] H. Choi and S. Choi. Kernel isomap. Electronics letters, 40(25):1612–1613,<br />
2004.<br />
[17] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall/CRC,<br />
2001.<br />
[18] N. Dalai, B. Triggs, I. Rhone-Alps, and F. Montbonnot. Histograms of ori-<br />
ented gradients for human detection. Computer Vision and Pattern Recogni-<br />
tion, 2005. CVPR 2005. IEEE Computer Society Conference on, 1, 2005.<br />
[19] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by<br />
annealed particle filtering. In IEEE Conference on Computer Vision and<br />
Pattern Recognition, 2000. Proceedings, volume 2, 2000.<br />
[20] R. Fisher. The use of multiple measurements in taxonomic problems. Ann<br />
of Eugenics, 7:179–188, 1936.<br />
[21] A. Geiger, R. Urtasun, and T. Darrell. Rank priors for continuous non-linear<br />
dimensionality reduction. In CVPR ’09: Proceedings of the 2009 IEEE<br />
Computer Society Conference on Computer Vision and Pattern Recognition,<br />
2009.<br />
[22] I. Good. What are degrees of freedom? The American Statistician,<br />
27(5):227–228, 1973.<br />
[23] J. Gower and G. Dijksterhuis. Procrustes problems. <strong>Oxford</strong> University Press,<br />
2004.
126 BIBLIOGRAPHY<br />
[24] K. Grochow, S. Martin, A. Hertzmann, and Z. Popović. Style-based inverse<br />
kinematics. ACM Transactions on Graphics (TOG), 23(3):522–531, 2004.<br />
[25] J. Hamm, I. Ahn, and D. Lee. Learning a manifold-constrained map between<br />
image sets: applications to matching and pose estimation. In CVPR ’06:<br />
Proceedings of the 2006 IEEE Computer Society Conference on Computer<br />
Vision and Pattern Recognition, pages 817–824, Washington, DC, USA,<br />
2006. IEEE Computer Society.<br />
[26] J. Hamm, D. Lee, and L. Saul. Semisupervised alignment of manifolds. In<br />
R. G. Cowell and Z. Ghahramani, editors, Proceedings of the Tenth Inter-<br />
national Workshop on Artificial Intelligence and Statistics, pages 120–127,<br />
2005.<br />
[27] S. Haykin. Neural networks: a comprehensive foundation. Prentice Hall,<br />
2008.<br />
[28] S. Hou, A. Galata, F. Caillette, N. Thacker, and P. Bromiley. Real-time body<br />
tracking using a <strong>Gaussian</strong> process latent variable model. In Proceedings<br />
of the 11th IEEE International Conference on Computer Vision (ICCV’07),<br />
2007.<br />
[29] J. Jaromczyk and G. Toussaint. Relative neighborhood graphs and their rel-<br />
atives. Proceedings of the IEEE, 80(9):1502–1517, 1992.<br />
[30] S. Ju, M. Black, and Y. Yacoob. Cardboard people: A parameterized model<br />
of articulated image motion. In Proceedings of the 2nd International Con-<br />
ference on Automatic Face and Gesture Recognition (FG’96), page 38. IEEE<br />
Computer Society Washington, DC, USA, 1996.
BIBLIOGRAPHY 127<br />
[31] M. Kuss and T. Graepel. The geometry of kernel canonical correlation anal-<br />
ysis. Technical Report TR-108, Max Planck Institute for Biological Cyber-<br />
netics, Tübingen, Germany, 2003.<br />
[32] N. D. Lawrence. <strong>Gaussian</strong> <strong>Process</strong> <strong>Models</strong> for Visualisation of High Di-<br />
mensional Data. In S. Thrun, L. Saul, and B. Schölkopf, editors, <strong>Gaussian</strong><br />
<strong>Process</strong> <strong>Models</strong> for Visualisation of High Dimensional Data, volume 16,<br />
pages 329–336, Cambridge, MA, 2004.<br />
[33] N. D. Lawrence. Probabilistic non-linear principal component analysis with<br />
<strong>Gaussian</strong> process latent variable models. The Journal of Machine Learning<br />
Research, 6:1783–1816, November 2005.<br />
[34] N. D. Lawrence and A. Moore. Hierarchical <strong>Gaussian</strong> process latent vari-<br />
able models. Proceedings of the 24th international conference on Machine<br />
learning, pages 481–488, 2007.<br />
[35] N. D. Lawrence and Quiñonero-Candela. Local distance preservation in the<br />
GP-LVM through back constraints. In R. Greiner and D. Schuurmans, edi-<br />
tors, Local distance preservation in the GP-LVM through back constraints,<br />
volume 21, pages 513–520, New York, NY, USA, 2006. ACM.<br />
[36] G. Leen. Context assisted information extraction. PhD thesis, University the<br />
of West of Scotland, University of the West of Scotland, High Street, Paisley<br />
PA1 2BE, Scotland, 2008.<br />
[37] G. Leen and C. Fyfe. A <strong>Gaussian</strong> process latent variable model formulation<br />
of canonical correlation analysis. Bruges (Belgium), 26-28 April 2006 2006.<br />
[38] D. Lowe. Distinctive image features from scale-invariant keypoints. Inter-<br />
national Journal of Computer Vision, 60(2):91–110, 2004.
128 BIBLIOGRAPHY<br />
[39] D. MacKay. Bayesian neural networks and density networks. Nuclear In-<br />
struments and Methods in Physics Research, A, 1995.<br />
[40] K. Mardia, J. Kent, J. Bibby, et al. Multivariate analysis. Academic press<br />
New York, 1979.<br />
[41] J. Mercer. Functions of positive and negative type, and their connection with<br />
the theory of integral equations. Philosophical Transactions of the Royal So-<br />
ciety of London. Series A, Containing Papers of a Mathematical or Physical<br />
Character, pages 415–446, 1909.<br />
[42] G. Mori, S. Belongie, and J. Malik. Efficient shape matching using shape<br />
contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on,<br />
27(11):1832–1837, 2005.<br />
[43] D. Morris and J. Rehg. Singularity analysis for articulated object tracking. In<br />
1998 IEEE Computer Society Conference on Computer Vision and Pattern<br />
Recognition, 1998. Proceedings, pages 289–296, 1998.<br />
[44] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. The joint manifold model. In<br />
IEEE International Conference on Computer Vision (ICCV), 2007.<br />
[45] C. E. Rasmussen and C. K. I. Williams. <strong>Gaussian</strong> <strong>Process</strong>es for Machine<br />
Learning (Adaptive Computation and Machine Learning). The MIT Press,<br />
2005.<br />
[46] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. Torr. Randomized<br />
Trees for Human Pose Detection. In IEEE Conference on Computer Vision<br />
and Pattern Recognition, 2008. CVPR 2008, pages 1–8, 2008.<br />
[47] R. Rosales and S. Sclaroff. Learning body pose via specialized maps. Ad-<br />
vances in Neural Information <strong>Process</strong>ing Systems, 2:1263–1270, 2002.
BIBLIOGRAPHY 129<br />
[48] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear<br />
embedding. Science, 290(5500):2323–2326, 2000.<br />
[49] M. Salzmann, R. Urtasun, and P. Fua. Local deformation models for monoc-<br />
ular 3D shape recovery. In IEEE Conference on Computer Vision and Pattern<br />
Recognition, 2008. CVPR 2008, pages 1–8, 2008.<br />
[50] B. Schölkopf. The kernel trick for distances. TR MSR 2000-51, Microsoft<br />
Research, Redmond, WA, 2000. Advances in Neural Information <strong>Process</strong>ing<br />
Systems, 2001.<br />
[51] B. Scholkopf, A. Smola, and K. Muller. Kernel principal component analy-<br />
sis. Lecture notes in computer science, 1327:583–588, 1997.<br />
[52] A. Shon, K. Grochow, A. Hertzmann, and R. Rao. Learning shared latent<br />
structure for image synthesis and robotic imitation. Proc. NIPS, pages 1233–<br />
1240, 2006.<br />
[53] H. Sidenbladh, M. Black, and D. Fleet. Stochastic tracking of 3D human fig-<br />
ures using 2D image motion. Lecture Notes in Computer Science, 1843:702–<br />
718, 2000.<br />
[54] L. Sigal and M. Black. HumanEva: Synchronized video and motion capture<br />
dataset for evaluation of articulated human motion. Brown Univertsity TR,<br />
2006.<br />
[55] L. Sigal, R. Memisevic, and D. J. Fleet. <strong>Shared</strong> kernel information embed-<br />
ding for discriminative inference. In CVPR ’09: Proceedings of the 2009<br />
IEEE Computer Society Conference on Computer Vision and Pattern Recog-<br />
nition, 2009.
130 BIBLIOGRAPHY<br />
[56] H. A. Simon. The Sciences of the Artificial - 3rd Edition. The MIT Press,<br />
October 1996.<br />
[57] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative density<br />
propagation for 3d human motion estimation. Proc. Conf. Computer Vision<br />
and Pattern Recognition, pages 217–323, 2005.<br />
[58] C. Sminchisescu and B. Triggs. Estimating articulated human motion with<br />
covariance scaled sampling. The International Journal of Robotics Research,<br />
22(6):371, 2003.<br />
[59] J. F. M. Svensén. GTM: The Generative Topographic Mapping. PhD thesis,<br />
Aston University, April 1998.<br />
[60] J. Tenenbaum, V. Silva, and J. Langford. A global geometric framework for<br />
nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.<br />
[61] A. Thayananthan, R. Navaratnam, B. Stenger, P. Torr, and R. Cipolla. Pose<br />
estimation and tracking using multivariate regression. Pattern Recognition<br />
Letters, 29(9):1302–1310, 2008.<br />
[62] T. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learned<br />
smooth space of feasible solutions. In CVPR Learning workshop, volume 2,<br />
2005.<br />
[63] M. Tipping. The relevance vector machine. NIPS (pp. 652–658), 2000.<br />
[64] M. Tipping. Sparse Bayesian Learning and the Relevance Vector Machine.<br />
Journal of Machine Learning Research, 1:211–244, 2001.<br />
[65] M. Tipping and C. Bishop. Probabilistic principal component analysis.<br />
Journal of the Royal Statistical Society: Series B (Statistical Methodology),<br />
61(3):611–622, 1999.
BIBLIOGRAPHY 131<br />
[66] G. Toussaint. The relative neighbourhood graph of a finite planar set. Pattern<br />
recognition, 1980.<br />
[67] R. Urtasun and T. Darrell. Discriminative <strong>Gaussian</strong> process latent variable<br />
model for classification. In Discriminative <strong>Gaussian</strong> process latent variable<br />
model for classification, pages 927–934, New York, NY, USA, 2007. ACM.<br />
[68] R. Urtasun, T. Darrell, and U. EECS. Sparse probabilistic regression for<br />
activity-independent human pose inference. In IEEE Conference on Com-<br />
puter Vision and Pattern Recognition, 2008. CVPR 2008, pages 1–8, 2008.<br />
[69] R. Urtasun, D. Fleet, and P. Fua. 3D people tracking with <strong>Gaussian</strong> process<br />
dynamical models. CVPR, June, 2006.<br />
[70] R. Urtasun, D. Fleet, A. Geiger, J. Popović, T. Darrell, and N. Lawrence.<br />
Topologically-constrained latent variable models. In Proceedings of the<br />
25th international conference on Machine learning, pages 1080–1087. ACM<br />
New York, NY, USA, 2008.<br />
[71] R. Urtasun, D. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking<br />
from small training sets. IEEE International Conference on Computer Vision<br />
(ICCV), pages 403–410, 2005.<br />
[72] R. Urtasun, D. J. Fleet, and P. Fua. Temporal motion models for monoc-<br />
ular and multiview 3D human body tracking. Computer Vision and Image<br />
Understanding, 104(2-3):157–177, 2006.<br />
[73] A. Viterbi. Error bounds for convolutional codes and an asymptotically<br />
optimum decoding algorithm. IEEE transactions on Information Theory,<br />
13(2):260–269, 1967.
132 BIBLIOGRAPHY<br />
[74] S. Wachter and H. Nagel. Tracking of persons in monocular image se-<br />
quences. In IEEE Nonrigid and Articulated Motion Workshop, 1997. Pro-<br />
ceedings., pages 2–9, 1997.<br />
[75] J. Wang, D. Fleet, and A. Hertzmann. Multifactor <strong>Gaussian</strong> process models<br />
for style-content separation. In Proceedings of the 24th international con-<br />
ference on Machine learning, pages 975–982. ACM New York, NY, USA,<br />
2007.<br />
[76] J. Wang, D. Fleet, and A. Hertzmann. <strong>Gaussian</strong> process dynamical models<br />
for human motion. IEEE Transactions on Pattern Analysis and Machine<br />
Intelligence, 30(2):283–298, 2008.<br />
[77] J. M. Wang, D. J. Fleet, and A. Hertzmann. <strong>Gaussian</strong> process dynamical<br />
models. volume 18, Cambridge, MA, 2006.<br />
[78] K. Weinberger, F. Sha, and L. Saul. Learning a kernel matrix for nonlinear<br />
dimensionality reduction. ACM International Conference Proceeding Series,<br />
2004.<br />
[79] K. Q. Weinberger, B. D. Packer, and L. K. Saul. Unsupervised learning of<br />
image manifolds by semidefinite programming. In Proceedings of the Tenth<br />
International Workshop on Artificial Intelligence and Statistics, Barbados,<br />
January 2005.<br />
[80] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds<br />
by semidefinite programming. In Proceedings of the IEEE Conference on<br />
Computer Vision and Pattern Recognition (CVPR-04), volume 2, pages 988–<br />
995, Washington D.C., 2004.
BIBLIOGRAPHY 133<br />
[81] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds<br />
by semidefinite programming. International Journal of Computer Vision,<br />
70(1):77–90, 2006.<br />
[82] L. Xiong, F. Wang, and C. Zhang. Semi-definite manifold alignment. In<br />
ECML, pages 773–781, 2007.<br />
[83] X. Zhao, H. Ning, Y. Liu, and T. Huang. Discriminative Estimation of 3D<br />
Human Pose Using <strong>Gaussian</strong> <strong>Process</strong>es. In Pattern Recognition, 2008. ICPR<br />
2008. 19th International Conference on, pages 1–4, 2008.