28.07.2013 Views

Shared Gaussian Process Latent Variables Models - Oxford Brookes ...

Shared Gaussian Process Latent Variables Models - Oxford Brookes ...

Shared Gaussian Process Latent Variables Models - Oxford Brookes ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Shared</strong> <strong>Gaussian</strong> <strong>Process</strong> <strong>Latent</strong><br />

<strong>Variables</strong> <strong>Models</strong><br />

Carl Henrik Ek<br />

Submitted in partial fulfilment of the requirements of the award of<br />

PhD<br />

<strong>Oxford</strong> <strong>Brookes</strong> University<br />

August 2009


Abstract<br />

A fundamental task is machine learning is modeling the relationship between dif-<br />

ferent observation spaces. Dimensionality reduction is the task reducing the num-<br />

ber of dimensions in a parameterization of a data-set. In this thesis we are inter-<br />

ested in the cross-road between these two tasks: shared dimensionality reduction.<br />

<strong>Shared</strong> dimensionality reduction aims to represent multiple observation spaces<br />

within the same model. Previously suggested models have been limited to the<br />

scenarios where the observations have been generated from the same manifold.<br />

In this paper we present a <strong>Gaussian</strong> process <strong>Latent</strong> Variable Model (GP-LVM)<br />

[33] for shared dimensionality reduction without making assumptions about the<br />

relationship between the observations. Further we suggest an extension to Canon-<br />

ical Correlation Analysis (CCA) called Non Consolidating Component Analy-<br />

sis (NCCA). The proposed algorithm extends classical CCA to represent the full<br />

variance of the data opposed to only the correlated. We compare the suggested<br />

GP-LVM model to existing models and show results on real-world problems ex-<br />

emplifying the advantages of our approach.


Acknowledgements<br />

2


Contents<br />

1 Introduction 10<br />

1.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . 11<br />

1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

1.3 Notations and Conventions . . . . . . . . . . . . . . . . . . . . . 13<br />

2 Background 14<br />

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

2.1.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . 15<br />

2.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 17<br />

2.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.4 Spectral Dimensionality Reduction . . . . . . . . . . . . . . . . . 21<br />

2.5 Non-Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.5.1 Kernel-Trick . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

2.5.2 Proximity Graph Methods . . . . . . . . . . . . . . . . . 29<br />

2.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

2.6 Generative Dimensionality Reduction . . . . . . . . . . . . . . . 35<br />

2.7 <strong>Gaussian</strong> <strong>Process</strong>es . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

2.7.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

3


4 CONTENTS<br />

2.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

2.7.3 Relevance Vector Machine . . . . . . . . . . . . . . . . . 45<br />

2.8 GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

2.8.1 <strong>Latent</strong> Constraints . . . . . . . . . . . . . . . . . . . . . 49<br />

2.8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

2.9 <strong>Shared</strong> Dimensionality Reduction . . . . . . . . . . . . . . . . . 54<br />

2.10 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

3 <strong>Shared</strong> GP-LVM 59<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

3.2 <strong>Shared</strong> GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

3.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

3.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

3.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

3.3 Subspace GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

3.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

4 NCCA 80<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

4.3 <strong>Shared</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

4.4 Private . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84


CONTENTS 5<br />

4.4.1 Extracting the first orthogonal direction . . . . . . . . . . 85<br />

4.4.2 Extracting consecutive directions . . . . . . . . . . . . . 86<br />

4.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 87<br />

4.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />

5 Applications 92<br />

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

5.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 92<br />

5.2.1 Generative . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

5.2.2 Discriminative . . . . . . . . . . . . . . . . . . . . . . . 94<br />

5.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . 94<br />

5.3 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

5.3.1 Shapeme Features . . . . . . . . . . . . . . . . . . . . . 95<br />

5.3.2 Histogram of Oriented Gradients . . . . . . . . . . . . . . 97<br />

5.4 Data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

5.5 <strong>Shared</strong> back-constrained GP-LVM . . . . . . . . . . . . . . . . . 100<br />

5.5.1 Single Image Inference . . . . . . . . . . . . . . . . . . . 102<br />

5.5.2 Sequential Inference . . . . . . . . . . . . . . . . . . . . 103<br />

5.6 Subspace GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

5.6.1 Single Image Inference . . . . . . . . . . . . . . . . . . . 107<br />

5.6.2 Sequential Inference . . . . . . . . . . . . . . . . . . . . 108<br />

5.7 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

5.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

5.9 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


6 CONTENTS<br />

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

6 Conclusions 119<br />

6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

6.2 Review of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120


List of Figures<br />

2.1 Volume ratio of hyper-cube and hyper-sphere . . . . . . . . . . . 16<br />

2.2 Swissroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

2.3 Generative latent variable model . . . . . . . . . . . . . . . . . . 35<br />

2.4 GTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.5 Samples from GP Prior . . . . . . . . . . . . . . . . . . . . . . . 42<br />

2.6 Samples from GP Posterior . . . . . . . . . . . . . . . . . . . . . 43<br />

2.7 Probibalistic CCA . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

3.1 <strong>Shared</strong> back-constrained GP-LVM . . . . . . . . . . . . . . . . . 61<br />

3.2 Toy data: generating signals . . . . . . . . . . . . . . . . . . . . 65<br />

3.3 Toy data: observed data . . . . . . . . . . . . . . . . . . . . . . . 66<br />

3.4 Toy data: latent embeddings . . . . . . . . . . . . . . . . . . . . 67<br />

3.5 Toy data2: generating signals . . . . . . . . . . . . . . . . . . . . 68<br />

3.6 Toy data2: observed data . . . . . . . . . . . . . . . . . . . . . . 68<br />

3.7 Toy data2: latent embeddings . . . . . . . . . . . . . . . . . . . . 69<br />

3.8 Toy data3: generating signals . . . . . . . . . . . . . . . . . . . . 70<br />

3.9 Toy data3: observed data . . . . . . . . . . . . . . . . . . . . . . 71<br />

3.10 Toy data3: latent embeddings . . . . . . . . . . . . . . . . . . . . 72<br />

7


8 LIST OF FIGURES<br />

3.11 Toy data3: latent embeddings . . . . . . . . . . . . . . . . . . . . 72<br />

3.12 Subspace GP-LVM model . . . . . . . . . . . . . . . . . . . . . 74<br />

4.1 NCCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

4.2 Toy data3: NCCA embedding . . . . . . . . . . . . . . . . . . . 88<br />

4.3 Toy data3: observed data . . . . . . . . . . . . . . . . . . . . . . 88<br />

4.4 Toy data3: Subspace GP-LVM embedding . . . . . . . . . . . . . 89<br />

5.1 Kernel response matrix Poser pose data . . . . . . . . . . . . . . 102<br />

5.2 Misinterpretation of angle error . . . . . . . . . . . . . . . . . . . 109<br />

5.3 Poser single image results back-constrained GP-LVM . . . . . . . 110<br />

5.4 Poser sequence image results back-constrained GP-LVM . . . . . 111<br />

5.5 Kernel matrix back-constrained and subspace GP-LVM . . . . . . 114<br />

5.6 Subspace GP-LVM pose inference . . . . . . . . . . . . . . . . . 115<br />

5.7 Subspace GP-LVM ambiguity modeling . . . . . . . . . . . . . . 117


List of Tables<br />

3.1 Toy data: Procrustes score . . . . . . . . . . . . . . . . . . . . . 70<br />

5.1 Error on Poser data . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

5.2 Error on HumanEva . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

9


Chapter 1<br />

Introduction<br />

Information technology development have lead to a significant expansion in stor-<br />

age capabilities of digital content. This has meant that in many application areas<br />

where observations used to be scarce we now have access to significant amounts<br />

of data. This development has lead to a transition in many fields from the purely<br />

model driven paradigm to the data driven approach. In model driven modeling<br />

the aim is to explain a specific phenomenon using a model designed for the task<br />

at hand. This is different from data-driven modeling where one tries to use ob-<br />

servations of the phenomenon to learn a model. For most modeling scenarios the<br />

data available is represented in a form defined by the device used to capture the<br />

data. This means that the degrees of freedom of the available data is the degrees<br />

of freedom of the capturing device not the actual phenomenon. To reduce the ef-<br />

fect of the capturing device This often leads to that the data is represented using<br />

a very large number of dimensions, often significantly larger then the number of<br />

dimensions or degrees of freedoms that the underlying phenomenon has. This has<br />

lead to the machine learning field of Dimensionality Reduction. In dimensionality<br />

10


1.1. OVERVIEW OF THE THESIS 11<br />

reduction the aim is to find the data’s true or intrinsic parameterization from the<br />

capturing device representation.<br />

Many tasks in computer science are associated with data coming from multiple<br />

streams or views of the same underlying phenomenon. Often each view provide<br />

complementary information about the data. For modeling purposes it is there-<br />

fore of interest to use information from each view. The task of merging several<br />

different views are called Feature Fusion.<br />

The work undertaken in this thesis spans both realms presented above. Given<br />

multiple views of the same phenomenon we create models which are capable of<br />

leveraging the advantage of each view in learning a reduced dimensional repre-<br />

sentation of the data.<br />

1.1 Overview of the thesis<br />

A brief outline of the dissertation follows,<br />

Chapter 2 This chapter provides the motivation and the background to the<br />

machine learning task of dimensionality reduction. The two different approaches<br />

to dimensionality reduction, spectral and generative, are introduced and their<br />

strengths and weaknesses reviewed. We continue by introducing <strong>Gaussian</strong> Pro-<br />

cesses (GP) and give a brief background to Bayesian Modeling. The <strong>Gaussian</strong><br />

<strong>Process</strong> <strong>Latent</strong> Variable Model (GP-LVM) [33, 32] a dimensionality reduction<br />

model based on <strong>Gaussian</strong> <strong>Process</strong>es is introduced. Further, we introduce the task<br />

of <strong>Shared</strong> Dimensionality Reduction which will be the main focus of this thesis.<br />

Chapter 3 This chapter describes the two shared generative dimensionality<br />

reduction models developed in this thesis. By motivating the short-comings of


12 CHAPTER 1. INTRODUCTION<br />

current models we derive two new models in the GP-LVM framework. We first<br />

present the shared back-constrained GP-LVM and continue to describe the second<br />

suggested model, the subspace GP-LVM model.<br />

Chapter 4 We present an extension to Canonical Correlation Analysis (CCA)<br />

called Non-Consolidation Component Analysis (NCCA). The NCCA algorithm<br />

allows us to transform CCA from an algorithm for feature selection to one of<br />

shared dimensionality reduction.<br />

Chapter 5 In this chapter the suggested models are applied to the Computer<br />

Vision task of human pose estimation. We apply the models on real world data-<br />

sets and experimentally compare the two models.<br />

Chapter 6 Concludes the work undertaken and describes potential directions<br />

of future work.<br />

1.2 Publications<br />

This thesis builds on work from the following publications:<br />

1. C. H. Ek, P. H. Torr, and N. D. Lawrence. <strong>Gaussian</strong> process latent variable<br />

models for human pose estimation. In 4th Joint Workshop on Multimodal<br />

Interaction and Related Machine Learning Algorithms (MLMI 2007), vol-<br />

ume LNCS 4892, pages 132–143, Brno, Czech Republic, Jun. 2007. Springer-<br />

Verlag.<br />

2. C. H. Ek, P. H. Torr, and N. D. Lawrence. Ambiguity modeling in latent<br />

spaces. In 5th Joint Workshop on Multimodal Interaction and Related Ma-<br />

chine Learning Algorithms (MLMI 2008), 2008.


1.3. NOTATIONS AND CONVENTIONS 13<br />

3. C. H. Ek, P. Jaeckel, N. Campbell, N. Lawrence, and C. Melhuish. <strong>Shared</strong><br />

<strong>Gaussian</strong> <strong>Process</strong> <strong>Latent</strong> Variable <strong>Models</strong> for Handling Ambiguous Facial<br />

Expressions. In AIP Conference Proceedings, volume 1107, page 147,<br />

2009.<br />

1.3 Notations and Conventions<br />

In the mathematical notation we use italics x to indicate scalars, bold lowercase<br />

x to indicate vectors, and bold uppercase X to indicate matrices. Vectors, unless<br />

stated otherwise, are column vectors. The transpose of a matrix is indicated by<br />

superscript X T . The identity matrix is represented by I. The vector ei represents<br />

the unit vector with all dimensions set to 0 except dimension i which is 1.


Chapter 2<br />

Background<br />

2.1 Introduction<br />

Modeling is the task of describing a system using a specific language. In this thesis<br />

the focus is on mathematical modeling which refers to the process of describing<br />

a system through the laws of mathematics. The building blocks of mathematical<br />

models are variables or parameter who by interaction through the law of mathe-<br />

matics aims to mimic the behavior of a system. There are many reasons why we<br />

are interested in creating accurate models of a specific system. A model allows us<br />

to analyze and and simulate the behavior of the system in a hypothetical scenario<br />

without having to jeopardize the actually system.<br />

A fundamental characteristic of a model are its degrees of freedom [22] which<br />

refers to the number of parameters or variables that are allowed to vary indepen-<br />

dently from each other. In many scenarios there is more than one way to describe<br />

a system, this can either be because different approximations or assumptions have<br />

been made or due to lack of knowledge of the system. For example, one set of<br />

14


2.1. INTRODUCTION 15<br />

data can be equally well described by two different models. However, data from<br />

the input domain outside the training data might result in different behavior from<br />

each model. Similarly different assumptions often leads to different models. The<br />

degrees of freedom of a representation is equal to the number of parameters or<br />

dimensions. However, this parameterization does not need to be representing the<br />

data in the correct way. When each dimension in the representation describes or<br />

models a single degree of freedom in the data we say that the data is in its intrinsic<br />

representation.<br />

We will separate the task of modeling into data driven and model driven. Data<br />

driven modeling is, when given a set of training data we try to learn a model from<br />

the data, this is different from model based modeling when we try to fit or match<br />

a specific model to the data. This thesis will focus on data driven models.<br />

2.1.1 Curse of Dimensionality<br />

We spend our lives in a world that is essentially three dimensional. It is in this<br />

world we build our understanding of concepts such as distance and volume. In<br />

Machine Learning we often deal with data compromising many more dimensions.<br />

Many of the concepts we learn to recognize in two and three dimensions cannot<br />

easily be extrapolated to higher dimensions. One example is the relationship be-<br />

tween the volume of a hyper-sphere of diameter 2 and a cube with side length<br />

2.<br />

Vcube(d) = (2) d<br />

Vsphere(d) = 2d Π d<br />

2<br />

d · Γ( d<br />

2 )<br />

(2.1)<br />

(2.2)


16 CHAPTER 2. BACKGROUND<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 5 10 15 20 25 30<br />

Figure 2.1: The ratio of a hyper-cube that is contained within a hyper-sphere as a<br />

function of dimension.<br />

Figure 2.1 shows this ratio as a function of dimensionality. In the limit the ratio<br />

of the volume that is contained within the hyper-sphere goes to zero,<br />

Vsphere(d)<br />

limd→∞<br />

Vcube(d)<br />

= . . . = 0 (2.3)<br />

This means that with increasing dimension all the volume of a cube will be con-<br />

tained in its corners. However, our concept of the “corners of a cube” are clearly<br />

two and three dimensional in nature as is our understanding of volume. Therefore<br />

care should be taken when working with high dimensional spaces.<br />

The “Curse of Dimensionality” refers to the exponential growth in volume<br />

with dimension. It is called a curse due to the fact that many algorithms scales<br />

badly with increasing dimension. One way of visualizing this is to think of an<br />

algorithm as a mapping from an input-space to an output-space. For the algorithm<br />

to work for any type of input it needs to “cover” or “monitor” all of the input space.<br />

Therefore, with increasing dimension the algorithm needs to cover a exponentially<br />

growing volume.


2.2. DIMENSIONALITY REDUCTION 17<br />

2.2 Dimensionality Reduction<br />

In most data-sets that the number of degrees of freedom of the representation is<br />

much higher then that of the intrinsic representation. This is due to the fact that<br />

rather then representing the degrees of freedom of the data the representation is a<br />

reflection of the degrees of freedom in the collection process of the data.<br />

How complex or simple a structure is depends critically upon the way<br />

we describe it. Most of the complex structures found in the world are<br />

enormously redundant, and we can use this redundancy to simplify<br />

their description. But to use it, to achieve the simplication, we must<br />

find the right representation [56].<br />

One example of this is the parameterization of a natural image as a matrix<br />

of real values (pixels). The matrix being captured by a camera, each pixel cor-<br />

responds to a single light sensor which are allowed to vary independently of the<br />

other sensors on the lens. However, this does not correspond well to natural im-<br />

ages as neighboring pixels are strongly correlated [8]. This implies that natural<br />

images have a significantly different intrinsic representation , and degrees of free-<br />

dom, than the image representation of the camera. The correlation between pixels<br />

will, in the vector space spanned by the pixel values, manifest itself as a low-<br />

dimensional manifold. Parameterizing this manifold are the degrees of freedom<br />

for natural images. This example is a simplification as it assumes that the camera<br />

is capable of capturing the full variability or all the degrees of freedom in a nat-<br />

ural image, i.e. that the intrinsic representation can be found as a mapping from<br />

the observed representation. A more general view of the problem that avoids this<br />

assumption is to view the collection process as a mapping from the data’s intrinsic


18 CHAPTER 2. BACKGROUND<br />

X to its observed representation Y, this mapping is referred to as the generating<br />

mapping,<br />

yi = f(xi). (2.4)<br />

As in the example with the camera and natural images the typical parameterization<br />

of data used in machine learning is over represented with respect to the number of<br />

dimensions but under represented in terms of the data. Geometrically this implies<br />

that the data only occupies a sub-volume of the representation space. However,<br />

the generating mapping is not necessarily invertible, this means that information<br />

from the intrinsic representation is lost.<br />

Dimensionality reduction is the process of reducing the number of parameters<br />

needed by a specific representation. This thesis focus on data driven dimensional-<br />

ity reduction. Data driven modeling is based on the assumption that the observed<br />

training data is representative. This means that we make the assumption that the<br />

observed data is sampling the input domain “well”. This implies that the degrees<br />

of freedom for the training data are assumed to be the same as for any additional<br />

samples from the same domain.<br />

There are two main approaches to data-driven dimensionality reduction, (1)<br />

generative and (2) spectral. The generative approach aims at modeling the gen-<br />

erating mapping f. This is, in general, an ill-posed problem as there are many<br />

ways of generating the observed data when neither the mapping nor the intrinsic<br />

representation is known. Spectral models avoid this by assuming that a smooth<br />

inverse f −1 to the generative mapping exists. This implies that the full degrees<br />

of freedom in the data is retained in the observed representation. This means that<br />

the intrinsic representation X of the data can be “unraveled” from evidence in the


2.3. LINEAR ALGEBRA 19<br />

observed representation Y.<br />

2.3 Linear Algebra<br />

A linear mapping T from vector-space U to vector-space V , T : U → V is<br />

represented in matrix form as,<br />

T(x) = Ax. (2.5)<br />

This means that the mapping T , represented by matrix A carries elements from<br />

vector space U to vector space V . The image im(T) of a mapping defines the set<br />

of all values the map can take,<br />

im(T) = {T(x) : x ∈ U} ⊂ V. (2.6)<br />

Similarly the kernel kern(T) is the set of all values that maps to zero,<br />

kern(T) = {x : x ∈ U|T(x) = 0} ⊂ U. (2.7)<br />

The kernel and the image of a linear mapping are related through the Rank-Nullity<br />

Theorem:<br />

dim(U) = dim(im(A)) + dim(kern(A)). (2.8)<br />

Intuitively this means that the number of dimensions needed to correctly represent<br />

the degrees of freedom of the data is given by subtracting the number of dimen-<br />

sions required to represent the null-space of the representation from the number of


20 CHAPTER 2. BACKGROUND<br />

dimensions of the representation. The number of dimensions needed to represent<br />

the image is more commonly referred to as the rank of the matrix which is the<br />

number of linearly independent columns in the matrix.<br />

Two square matrices A and B are said to be similar if there exists a non-<br />

singular matrix P (a matrix such that the inverse of P exists) such that the two<br />

matrices are related as,<br />

A = PBP −1 , (2.9)<br />

this is known as a similarity transform. A similarity transform P maps from a<br />

vector space onto itself and is therefore also referred to as a “change of basis”<br />

transformation mapping between two representations or reference frames. The<br />

determinant, trace and invertability are all invariant under similarity.<br />

A special similarity transform is when one reference system or basis results in<br />

a diagonal matrix,<br />

A = VΛV −1<br />

⎧<br />

⎪⎨ 0 i = j<br />

Λij =<br />

⎪⎩<br />

λi i = j<br />

(2.10)<br />

(2.11)<br />

VV T = I, (2.12)<br />

the diagonal representation is said to be the spectral decomposition of the ma-<br />

trix. The columns of V are called eigenvectors and the non-zero elements of the<br />

spectral decomposition the eigenvalues of the matrix. The spectral decomposition<br />

means that we can write any square matrix as a linear combination of rank one


2.4. SPECTRAL DIMENSIONALITY REDUCTION 21<br />

matrices,<br />

Aij =<br />

N<br />

k=1<br />

VikΛkk<br />

V T <br />

kj =<br />

N<br />

(vk)iλk(vk)j =<br />

k=1<br />

N <br />

λkvkv T <br />

k ij .(2.13)<br />

As V specifies an orthonormal basis the relative magnitude of each eigenvalue<br />

corresponds to the the amount of A that is explained by the corresponding eigen-<br />

vector. Therefore, by ordering the eigenvalues in decreasing order we can refer<br />

to,<br />

A→i =<br />

i<br />

k=1<br />

as the best rank i approximation of matrix A.<br />

k=1<br />

λkvkv T k , (2.14)<br />

λi ≥ λj, i ≤ j<br />

2.4 Spectral Dimensionality Reduction<br />

Spectral dimensionality reduction is based on the assumption that the generating<br />

mapping f is invertible. This means that the relationship between the observed<br />

representation Y and the intrinsic representation X takes the form of a bijection.<br />

This implies that the intrinsic structure of the data is fully preserved in the ob-<br />

served representation.<br />

Classic Multi Dimensional Scaling (MDS) [17, 40] is a method for represent-<br />

ing a metric dissimilarity measure as a geometrical configuration. Given dissim-<br />

ilarity measure δij between i and j the aim is to find a geometrical configuration<br />

of points X = [x1, . . .,xN] such that the Euclidean distance dij = ||xi − xj||2


22 CHAPTER 2. BACKGROUND<br />

approximates the dissimilarity δij. Classical MDS is formulated as a minimization<br />

of the following energy,<br />

<br />

argminX (δij − dij) = argminX ||∆ − D(X)||F. (2.15)<br />

ij<br />

The best d dimensional geometrical representation can be found through a rank d<br />

approximation D(X) of the dissimilarity matrix ∆ through the spectral decom-<br />

position,<br />

||∆ − D(X)||F =<br />

<br />

= ||<br />

= ||<br />

∆ = VΛV T =<br />

N<br />

λiviv T i −<br />

i=1<br />

N<br />

i=1<br />

N<br />

i=1<br />

d<br />

(λi − qi)viv T i +<br />

i=1<br />

λiviv T i<br />

qiviv T i<br />

N<br />

d+1<br />

<br />

|| =<br />

λiviv T i<br />

=<br />

||. (2.16)<br />

Having found a rank d approximation to ∆ the distance matrix can be converted<br />

to a Gram matrix G = XX T ,<br />

gij = 1<br />

<br />

N<br />

1<br />

d<br />

2 N<br />

k=1<br />

N ik +<br />

N<br />

k=1<br />

d 2 ik<br />

− 1<br />

N<br />

N<br />

N<br />

k=1 p=1<br />

d 2 kp<br />

<br />

− d 2 ij<br />

<br />

. (2.17)<br />

A geometrical configuration can be found through eigen-decomposition of the<br />

Gram matrix G,<br />

G = XX T = . . . = VΛV T = (VΛ 1<br />

2)(V∆ 1<br />

2) T ,<br />

⇒ X = VΛ 1<br />

2, (2.18)


2.4. SPECTRAL DIMENSIONALITY REDUCTION 23<br />

with the dimensionality of,<br />

rank(X) = rank(XX T ) = rank(G) = rank(D(X)) = d.<br />

In practice, for dimensionality reduction, we want to find a low dimensional repre-<br />

sentation of a set of data points i.e. vectorial data. In this case the Gram matrix G<br />

can be constructed directly from the data and a rank d dimensional approximation<br />

can be sought making the conversion step from distance matrix to gram matrix<br />

uneccesary.<br />

Principal Component Analysis (PCA) is a dimensionality reduction technique<br />

for embedding vectorial data in a dimensionally reduced representation. Given<br />

centered vectorial data Y the covariance matrix S = Y T Y has elements on the<br />

diagonal representing the variance along each dimension of the data while the off-<br />

diagonal elements measures the linear redundancies between dimensions. The<br />

objective of PCA is to find a projection v of the data Y such that the variance<br />

along each dimension is maximized,<br />

Objective: argmax v var(Yv) (2.19)<br />

subject to: v T v = 1. (2.20)<br />

This implies finding a projection of the data into a representation resulting in a


24 CHAPTER 2. BACKGROUND<br />

diagonal covariance matrix,<br />

var(Yv) =<br />

1<br />

N − 1 (Yv)T Yv = v T Sv (2.21)<br />

S = VΛV T<br />

(2.22)<br />

Λ = V T SV (2.23)<br />

X = YV. (2.24)<br />

As can be seen from above the solutions to both MDS and PCA are found through<br />

a similarity transform where one representation results in a diagonal matrix. In<br />

the case of MDS with the diagonalisation of the N × N Gram matrix G and for<br />

PCA by the D × D covariance matrix.<br />

Using the spectral decomposition of the Gram matrix,<br />

the similarity implies that,<br />

By pre-multiplying we can write,<br />

G = YY T = VΛV T , (2.25)<br />

(YY T )vi = λivi. (2.26)<br />

1<br />

N YT (YY T )vi =<br />

SY T vi =<br />

λi<br />

N − 1 YT vi<br />

(2.27)<br />

λi<br />

N − 1 YT vi, (2.28)<br />

which also defines a similarity transform but now in terms of the covariance matrix


2.5. NON-LINEAR 25<br />

S. However, we also need to enforce orthogonality of the new basis,<br />

(Y T vi) T (Y T vi) = v T i YYT vi = λi<br />

1<br />

√ (Y<br />

λi<br />

T vi) T (Y T vi) 1<br />

√<br />

λi<br />

(2.29)<br />

= v T i YYT vi = 1. (2.30)<br />

This results in the eigen-basis of the covariance matrix i.e. v PCA<br />

i<br />

gives the following embedding,<br />

x PCA<br />

i<br />

1<br />

= YY T vi √<br />

λi<br />

=<br />

= YT vi 1<br />

N which<br />

√ λi<br />

N − 1 vi = 1<br />

N − 1 xMDS<br />

i , (2.31)<br />

meaning MDS and PCA results in the same solution up to scale.<br />

MDS and PCA assumes the generating mapping f to be linear and therefore<br />

imply that the intrinsic representation of the data can be found by a change of basis<br />

transform. This restricts these algorithms to only being applicable in scenarios<br />

where the generating mapping is linear.<br />

2.5 Non-Linear<br />

Several algorithms have been suggested to model in the scenario where, rather<br />

than assuming the generating mapping f to be linear, this is relaxed to only<br />

assume that it is smooth. MDS finds a geometrical configuration respecting<br />

a specific dissimilarity measure. Measuring the dissimilarity in terms of the<br />

distance along the manifold between each point it would be possible to use<br />

MDS even in scenarios where the generating mapping is non-linear. How-<br />

ever, to acquire the distance along the manifold requires the manifold to be


26 CHAPTER 2. BACKGROUND<br />

unraveled.<br />

The objective of PCA is to find a projection of the data Y where the vari-<br />

ance along each dimension is maximized. If the data occupies a subspace in<br />

the original representation PCA will find a rotation of the data such that a<br />

reduced dimensional representation can be found by removing dimensions<br />

that do not represent any of the variance in the data. Further, for many data-<br />

sets most of the variance is represented by the first few principal directions<br />

meaning that an approximative representation of the data can be found by<br />

truncating the dimensions that represents a non-significant variance.<br />

One approach to non-linearize PCA was suggested in [51]. The idea is to<br />

first map the data Y into a feature space F through a mapping Φ. Rather<br />

then, as in standard PCA, finding the spectral decomposition of the covari-<br />

ance matrix S in the original space the decomposition can now be applied<br />

to the covariance in the feature space representation of the data. The hope<br />

is that the mapping Φ has unraveled the manifold such that applying PCA<br />

would recover the intrinsic representation of the data. However, it remains<br />

to find a mapping that does just so.<br />

2.5.1 Kernel-Trick<br />

Given a set of data Y = [y1, . . .,yN] where yi ∈ ℜ D the data is represented in<br />

the basis [e1, . . .,eD]. A different basis would be to represent each data-point<br />

yi by its own basis [y1, . . .,yN] i.e. represented in a N dimensional space<br />

with representation ˜ Y = [e1, . . .,eN]. The covariance matrix S in the space<br />

spanned by the data is equal to that of the Gram matrix of the original rep-


2.5. NON-LINEAR 27<br />

resentation. This is the fundamental background to the Kernel Trick which<br />

is a way of non-linearizing algorithms that depend only on the inner product<br />

between data-points. Even though an accepted term, it is not clear where the<br />

term was initially suggested. The Kernel Trick is based on that rather than<br />

finding a specific mapping Φ that takes the data to the feature space F we<br />

specify a function k(yi,yj), called the kernel function that parameterizes the<br />

inner product between Φ(yi) and Φ(yj),<br />

k(yi,yj) = Φ(yi) T Φ(yj). (2.32)<br />

Evaluated between each pair of points in the data-set the kernel function k<br />

specifies the kernel matrix K(Y,Y) which specifies the Gram matrix in the<br />

feature space F . From Eq. 2.17 we know that the Gram matrix and a dis-<br />

tance matrix is interchangeable representation for centered data. Therefore<br />

as long as the kernel function k specifies a valid Gram matrix K there is an<br />

underlying geometrical representation of data in F . The class of kernel func-<br />

tions that specifies geometrically representable feature spaces are known as<br />

Mercer Kernels [41, 50]. Mercer Kernels are positive semidefinite, i.e in the<br />

spectral decomposition of the resulting kernel matrix K all eigenvalues are<br />

non-negative. Intuitively this can be understood through Eq. 2.17, if the<br />

eigenvalues where to be negative then by adding basis vectors the distance<br />

between two points would be reduced, which is not possible in a Euclidean<br />

space. Using a kernel function to represent the data the feature space F is<br />

know as a kernel induced feature space.<br />

One advantage using kernel induced feature space is that if we aim to


28 CHAPTER 2. BACKGROUND<br />

apply an algorithm to the data which is only in terms of the inner product<br />

between data points we never need to find the geometrical representation<br />

of the data in F . This means that kernels resulting in potentially infinite<br />

dimensional spaces can be used. One such kernel function is the RBF kernel,<br />

k(yi,yj) = θ1e − θ 2 2 ||yi−yj|| 2 2, (2.33)<br />

with parameters {θ1, θ2}. If the inner product is specified by an RBF kernel<br />

any combination of points yi and yj will have a non-zero inner product. For<br />

this to be possible the feature space F will need to be infinite dimensional.<br />

PCA works by diagonalizing the covariance matrix of the data through<br />

the spectral decomposition. Kernel PCA [51] is formulated by first finding<br />

the Gram matrix in the kernel induced feature space. By representing each<br />

point using the basis of the data itself the Gram matrix is equivalent to the<br />

covariance matrix. A reduced representation can now be found through the<br />

spectral decomposition of the kernel matrix K.<br />

For many popular kernels, such as RBF kernels, the kernel represents<br />

feature spaces of higher dimensionality compared to the original data-space<br />

meaning that the mapping increases the dimensionality of the data. However,<br />

even though the dimensionality of the data might have been increased, the rel-<br />

ative ratio of the eigenvalues of the spectral decomposition of the covariance<br />

matrix might result in fewer eigenvectors, compared to the decomposition in<br />

the original space, that represents a significant variance meaning that a lower<br />

dimensional approximation of the data can be found. So strictly speaking, for<br />

many types of kernel functions, Kernel PCA is not a dimensionality reduction


2.5. NON-LINEAR 29<br />

technique but rather an algorithm for feature selection, which we will briefly<br />

comment on in the end of this chapter.<br />

In the next section we will go through a set of algorithms that uses local<br />

similarity measures in the data to find kernels onto which a spectral decom-<br />

position can be applied to find a geometrical representation of the data.<br />

2.5.2 Proximity Graph Methods<br />

Several dimensionality reduction algorithms have been suggested that are<br />

based on local similarity measures in the data. These algorithms are based<br />

on a proximity graph [66, 29, 10] extracted from the data. A proximity graph<br />

is a graph that represent a specific neighborhood relationships in the data.<br />

In the graph each node corresponds to a data point, vertices connects nodes<br />

that are related through the specified relationship potentially associated with<br />

a vertex weight. The fundamental idea behind proximity graph based algo-<br />

rithms for dimensionality reduction is that locally the data can be assumed to<br />

lie on a linear manifold. This means that locally the distance in the original<br />

representation of the data will be a good approximation to the manifold dis-<br />

tance. Therefore the neighborhood relationship used for proximity graphs in<br />

dimensionality reduction is the inter-distance between points in the original<br />

representation. Usually the graphs are constructed either from an N near-<br />

est neighbor algorithm where the N closest points are connected or from a<br />

ǫ nearest neighbor where all points within a ball of radius ǫ are connected.<br />

Setting either parameter is of significant importance as only points whose<br />

inter-distance can be assumed to approximate the manifold distance should


30 CHAPTER 2. BACKGROUND<br />

be connected.<br />

Isomap<br />

Isomap [60] was presented as a non-linear modification of MDS. The first step<br />

of Isomap is to construct a proximity graph of the data with edge weights<br />

corresponding to the Euclidean distance between each points. MDS finds<br />

a geometrical configuration to a global dissimilarity measure. In Isomap<br />

it is suggested that the manifold distance be approximated by the shortest<br />

path through this proximity graph. By computing the shortest path through<br />

the proximity graph a dissimilarity measure can be found between each data<br />

point onto which MDS can be applied. The shortest path through the prox-<br />

imity graph is not certain to result in a similarity matrix whose Gram matrix<br />

corresponds to a valid geometrical configuration, i.e. a Mercer kernel. A<br />

modification of the Isomap framework that guarantees a valid Mercer kernel<br />

have been suggested in [16]. However, in general, as we are only interested in<br />

the largest eigenvalues, this does not cause significant problems.<br />

Maximum Variance Unfolding<br />

A different alteration of MDS is Maximum Variance Unfolding (MVU) [78,<br />

80, 79, 81]. Based on the observation that any “fold” of a manifold would de-<br />

crease the Euclidean distance between two points while the distance along the<br />

manifold would remain the same MVU is formulated as a constrained maxi-<br />

mization. In the first step of the algorithm the proximity graph based on the<br />

Euclidean distance in the observed data is computed in the same manner as<br />

for Isomap. The inter-distance between nodes not connected by an edge in


2.5. NON-LINEAR 31<br />

the graph is maximized under the constraint that the distance between near-<br />

est neighbors stay the same. In effect the MVU objective will try to unravel<br />

the data by stretching the manifold as much as possible without causing it to<br />

tear. Rather then formulating the objective in terms of a distance matrix as<br />

in Isomap the MVU objective is expressed in terms of the Gram matrix which<br />

we know are interchangeable representations Eq. 2.17. MVU tries to find a<br />

feature space represented by a kernel matrix K. This leads to the following<br />

objective,<br />

ˆK = argmax tr(K) (2.34)<br />

subject to: K 0<br />

<br />

Kij = 0<br />

ij<br />

Kii − Kjj − Kij − Kji = Gii − Gjj − Gij − Gji, i ∈ N(j),<br />

where G is the Gram matrix for the original representation and N(i) is the<br />

index set of points that are connected to i in the proximity graph. The first<br />

constraint forces the ˆ K to represent a geometrically interpretable feature<br />

space, while the second constraint forces the data to be centered. The fi-<br />

nal constraint ensures that the distance between points that are connected in<br />

the proximity graph is conserved. The optimization is an instance of Semi-<br />

Definite Programming [11] and can be solved using efficient algorithms. Once<br />

a valid kernel matrix K has been found the resulting embedding can be found<br />

by applying MDS to ˆ K.


32 CHAPTER 2. BACKGROUND<br />

Local Linear Embeddings<br />

Local Linear Embeddings (LLE) [48] is a third method based on the preser-<br />

vation of a proximity graph structure. LLE is based on the assumption that<br />

the manifold can locally be well approximated using small linear patches.<br />

By rotating and translating each of these patches the full manifold structure<br />

can be modeled. LLE is a two step algorithm, in the first step each point in<br />

the data set is described by the nodes connected in the proximity graph as<br />

expansion,<br />

subject to:<br />

ˆW = argmin<br />

<br />

Wij = 1,<br />

j<br />

N<br />

||xi − <br />

i<br />

j∈N(i)<br />

Wijxj||2<br />

(2.35)<br />

where N(i) is the index set of points that are connected to i in the proximity<br />

graph. The optimal weights ˆ W can be solved in closed form [48]. Assuming<br />

that the manifold is locally linear the reconstruction weights should summa-<br />

rize the local structure of the data and should therefore be equally valid in<br />

reconstructing the manifold representation of the data X. To find this mani-<br />

fold representation a second minimization is formulated,<br />

ˆX = argmin <br />

||xi − <br />

Wijxj|| 2 2. (2.36)<br />

i<br />

However, Eq. 2.36 has a trivial solution placing each point in the origin,<br />

xi = 0, ∀i, this is removed by enforcing unit variance along each direction.<br />

Further, to remove the translational degree of freedom the solution is en-<br />

j


2.5. NON-LINEAR 33<br />

forced to be centered, <br />

i xi = 0. The optimal embedding ˆ X can be found<br />

through an eigenvalue problem.<br />

Laplacian Eigenmaps<br />

The proximity graph is also the starting point for Laplacian eigenmaps [5].<br />

Each node in the graph is connected to its neighbors by a vertex with an<br />

edge weight representing the locality of the points. Several different mea-<br />

sures of locality can be used. In the original paper either a heat kernel,<br />

wij = e − ||y i −y j ||2 2<br />

t , or constant wij = 1 was applied. Once the graph have<br />

been constructed the objective is to find an embedding X of the data such<br />

that points that are connected in the graph stay as close together as possible.<br />

For the first dimension,<br />

ˆX = argmin <br />

(xi − xj)Wij = y T Ly, (2.37)<br />

i,j<br />

where L is referred to as the Laplacian defined as L = D − W and D is<br />

a diagonal matrix such that Dii = <br />

j Wji. The objective Eq. 2.37 has a<br />

trivial solution zero dimensional solution representing the embedding using<br />

a single point. To remove this solution the solution is forced to be orthogonal<br />

to the constant vector 1, y T D1 = 0. Further, to shrinking the embedding a<br />

constraint on the scale y T Dy = 1 is appended to the objective. The diagonal<br />

matrix D provides a scaling of each point with respect to its locality to other<br />

points in the data. For a multi-dimensional embedding of the data this leads


34 CHAPTER 2. BACKGROUND<br />

15<br />

10<br />

5<br />

0<br />

−5<br />

−10<br />

−15<br />

20<br />

0<br />

−20<br />

−15 −10 −5 0 5 10 15<br />

Figure 2.2: Swissroll with added <strong>Gaussian</strong> noise. Given the left image it is easy<br />

to see the global structure. The spectral algorithms are based on local structure<br />

in the data as in the right image from which it is a lot harder to infer the global<br />

structure.<br />

to the following optimization problem,<br />

ˆY = argmin tr(Y T LY) (2.38)<br />

subject to: Y T DY = I<br />

which can be solved through a generalized eigenvalue problem.<br />

2.5.3 Summary<br />

Spectral algorithms are attractive as they are associated with a convex objective<br />

function leading to a unique solution. However, the proximity graph is based on<br />

local distances measures that are more likely to be effected by noise see Figure 2.2.<br />

The spectral algorithms all have the fundamental assumption that the generating<br />

mapping f has a smooth inverse which thereby preserves locality of the observed<br />

representation in the solution of the intrinsic. However, one needs to be vary what<br />

this assumption actually implies. As an example take a piece of string which has


2.6. GENERATIVE DIMENSIONALITY REDUCTION 35<br />

X<br />

Y<br />

Figure 2.3: The generative latent variable model. The observed data X is modeled<br />

as generated from a low-dimensional latent variable Y through the generative<br />

mapping f specified by parameters W.<br />

been rolled up into a ball. The string is a one dimensional object embedded in a<br />

three dimensional space. Locality is preserved through the generating mapping<br />

f, i.e. points that are close on the string will remain close in the ball. However,<br />

the reverse does not necessarily need to be true as neighboring points on the ball<br />

W<br />

might come from different “loops” when the ball was rolled together.<br />

Further, even though assumed to exists, none of the spectral algorithms models<br />

the smooth inverse of the generating mapping but rather learns embeddings of the<br />

data points this is fine as long as focus is on the visualization of the data rather<br />

then a model.<br />

2.6 Generative Dimensionality Reduction<br />

Generative approaches to dimensionality reductions aim to model the observed<br />

data as a mapping from its intrinsic representation. The underlying representa-<br />

tion is often referred to as the latent representation of the data and the models as<br />

latent variable models for dimensionality reduction. The observed data Y have<br />

been generated from the latent variables X through a mapping f parameterized


36 CHAPTER 2. BACKGROUND<br />

by W, Figure 2.3. Assuming the observations are i.i.d and have been corrupted<br />

by spherical <strong>Gaussian</strong> noise leads to the likelihood of the data,<br />

p(Y|X,W, β −1 ) =<br />

N<br />

N yi|f(xi,W), β −1 , (2.39)<br />

i=1<br />

where β −1 is the noise variance. The <strong>Gaussian</strong> noise model means we will refer<br />

to these models as <strong>Gaussian</strong> latent variable models.<br />

In the Bayesian formalism both the parameters of the mapping W and the la-<br />

tent location X are nuisance variables. Seeking the manifold representation of the<br />

observed data we want to formulate the posterior distribution over the parameters<br />

given the data, p(X,W|Y). This means inverting the generative mapping through<br />

Baye’s Theorem which implies marginalization over both the latent locations X<br />

and the mapping W. Reaching the posterior means we need to formulate prior<br />

distributions over X and W. However, this is severely ill-posed as an infinite<br />

number of combination of latent locations and mappings that could have gener-<br />

ated the data. To proceed assumptions about the relationship needs to be made.<br />

Neural Networks (NN) [27] are models traditionally used for supervised learn-<br />

ing. MacKay [39] suggested a generative dimensionality reduction approach us-<br />

ing NN labeled Density Networks. Traditional supervised learning using NN im-<br />

plies learning a conditional model over the output variables (class based or contin-<br />

uous) given the input variables through a parametric mapping. This relates each<br />

location in the input space with a density over the output space. However, in the<br />

case of dimensionality reduction only the output space is given. In [39] a model of<br />

the generative mapping parameterized as Multi-Layered-Perceptron (MLP) NN is<br />

proposed. By specifying a prior over the locations X in the input space and the


2.6. GENERATIVE DIMENSIONALITY REDUCTION 37<br />

parameters W of the generative mapping the joint probability of the full model<br />

can be formulated. Formulating an error function of the models means that the<br />

gradients of the unknown latent locations and the parameters of the generative<br />

mapping can be formulated. However, these gradients involve integrals over X<br />

and W which needs to be evaluated using sampling based methods such as Monte<br />

Carlo sampling. Optimizing the parameters W a density over the input space can<br />

be found.<br />

Tipping and Bishop [65] formulated probabilistic PCA (PPCA) by making the<br />

assumption that the observed data was related to the latent locations as a linear<br />

mapping yi = Wxi + ǫ, where ǫ ∼ N(0, β −1 I). Placing a spherical <strong>Gaussian</strong><br />

prior over the latent locations leads to the marginal likelihood,<br />

p(y|W, β −1 I) =<br />

<br />

p(y|W, β −1 )p(x)dx (2.40)<br />

p(x) = N(0, β −1 I). (2.41)<br />

The parameters of the mapping W can be found by maximum likelihood.<br />

Assuming a linear mapping severely restricts the classes of data-sets that can<br />

be modeled. But the prior over the latent locations has to be propagated through<br />

the generating mapping to form the marginal likelihood. For the linear mapping<br />

Eq 2.40 is solvable. However, when considering mappings of more general form<br />

it is not clear how to propagate the latent prior through the mapping to make the<br />

the integral Eq. 2.40 analytically tractable.<br />

Bishop [9] suggested a specific prior over the latent space making marginal-<br />

ization over more general mappings feasible, the model is referred to as The Gen-<br />

erative Topographic Map (GTM). By discritizing the latent space into regular grid,


38 CHAPTER 2. BACKGROUND<br />

Figure 2.4: Schematic representation of the GTM model: a grid of latent points X<br />

is mapped through a nonlinear mapping f parametrised by W to a corresponding<br />

grid of <strong>Gaussian</strong> centers embedded in the observed space. Adapted from [36].<br />

the prior is specified in terms of a sum of delta functions,<br />

p(x) = 1<br />

K<br />

K<br />

δ(x − xi), (2.42)<br />

i=1<br />

where [x1, . . .,xK] is regularly spaced landmark points over the latent space Fig-<br />

ure 2.4. This prior makes the integral in Eq.2.40 tractable for general parametric<br />

mappings. The GTM specifies a density over the observed data space parame-<br />

terized by a <strong>Gaussian</strong> mixture with centers at the location in the observed space<br />

corresponding to the grid points in the latent space.<br />

The <strong>Gaussian</strong> latent variable model formulates dimensionality reduction as a<br />

probabilistic model which provides a model and an associated likelihood function.<br />

Further modeling the generative mapping removes the reliance on local noise sen-<br />

sitive measures in the data. This makes the generative models applicable to a<br />

larger range of modeling scenarios compared to the spectral algorithms. How-<br />

ever, PPCA is a strictly linear model, and the energy function associated with the


2.7. GAUSSIAN PROCESSES 39<br />

GTM is non-convex which means that we cannot be guaranteed to find the global<br />

optima. Further, the GTM suffers from problems associated with mixture models<br />

in high dimensional spaces [59].<br />

2.7 <strong>Gaussian</strong> <strong>Process</strong>es<br />

A D dimensional <strong>Gaussian</strong> distribution is defined by a D × 1 mean and a D × D<br />

covariance matrix. A <strong>Gaussian</strong> process (GP) is the infinite dimensional general-<br />

ization of the distribution where the mean and covariance is defined not by fixed<br />

size matrices but a mean µ(x) and a covariance k(x,x ′ ) function, defined over<br />

infinite index sets, x.<br />

GP(µ(x), k(x,x ′ )). (2.43)<br />

Evaluating a GP over a finite index set reduces the process to a distribution<br />

with the dimensionality of the cardinality of the evaluation set. The covariance<br />

function needs to specify a valid covariance matrix when evaluated for any finite<br />

subset in its domain, this requires the covariance function to come from the same<br />

family of functions as Mercer kernels [41, 45].<br />

A GP generalizes the concept of a <strong>Gaussian</strong> distribution to infinite dimen-<br />

sions, this has been exploited in machine learning by applying GPs to specify<br />

distributions over infinite objects. One such application is when we are interested<br />

in modeling relationships defined over continuous domains such as functions. If<br />

we are interested in modeling a functional relationship f between input domain


40 CHAPTER 2. BACKGROUND<br />

X ∈ ℜ D and target domain Y ∈ ℜ,<br />

yi = f(xi). (2.44)<br />

A GP can be used to specify a prior distribution over the relationship, f ∼<br />

GP(µ, k). In Figure 2.5 samples from a GP prior with a covariance function<br />

specified by an RBF kernel and a mean function being constant 0 is shown. As<br />

can be seen the samples are all smooth with respect to the input locations x. How-<br />

ever, there is an additional property that the GP needs to fulfill to specify a valid<br />

distribution over f, consistency. A function f is consistent in the sense that the<br />

relationship between the target and input domain is fixed. For a distribution, this<br />

implies that evaluating the distribution over a finite subset Xi ⊂ X does not alter<br />

the distribution over any other subset Xj ⊂ X even if Xi ∩Xj = 0. It is clear that<br />

a GP satisfies this condition if the covariance function specifies a valid covariance<br />

matrix when evaluated over a finite number of points as,<br />

for a <strong>Gaussian</strong> distribution.<br />

⎧<br />

⎪⎨ y1 ∼ N(µ1, Σ11)<br />

(y1, y2) ∼ N(µ,Σ) ⇒<br />

⎪⎩ y2 ∼ N(µ2, Σ22)<br />

, (2.45)<br />

In regression we are interested in modeling the relationship between two do-<br />

mains X ∈ ℜ D and Y ∈ ℜ from a set of observations xi ∈ X and yi ∈ Y where<br />

i = 1 . . .N. Assuming a functional relationship and that the observations have<br />

been corrupted by additive <strong>Gaussian</strong> noise we are interested in modeling,<br />

yi = f(xi) + ǫ, (2.46)


2.7. GAUSSIAN PROCESSES 41<br />

where ǫ ∼ N(0, β −1 ). We are interested in encoding our prior knowledge about<br />

the relationship in a distribution over f. For regression we usually have a prefer-<br />

ence to functions varying smoothly over X,<br />

limxi→xj+ |f(xi) − f(xj)| =<br />

= limxi→−xj |f(xi) − f(xj)| = 0,<br />

∀xj ∈ X. This assumption can be encoded by the GP through the choice of<br />

covariance function k(x,x ′ ). The covariance function encodes how we expect<br />

variables to vary together,<br />

k(x,x ′ ) = E ((f(x − µ(x)))(f(x ′ ) − µ(x ′ ))),<br />

this means that we can encode the smoothness behavior over X by choosing a<br />

covariance function which is smooth over the same domain. The mean function<br />

µ(x) = E(f(x)) encodes the expected value of f. By translating the observed<br />

data to be centered around zero the mean function can, for simplicity, be chosen<br />

as the constant function zero.<br />

2.7.1 Prediction<br />

Having specified a prior distribution encoding our knowledge (and preference)<br />

about the relationship between X and Y we are interested in inferring the lo-<br />

cations y∗ corresponding to a previously unobserved point x∗ ∈ X. The joint<br />

distribution of the observed data (y,x) and the unobserved point (y∗,x∗) can be


42 CHAPTER 2. BACKGROUND<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

−3 −2 −1 −0.5 1 2 3<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

Figure 2.5: Samples from GP Prior using a RBF covariance function and a constant<br />

zero mean function. As can be seen, each sample are smooth over the input<br />

domain.<br />

written as follows,<br />

⎡<br />

⎢<br />

⎣ y<br />

y∗<br />

⎤<br />

⎛<br />

⎡<br />

⎤⎞<br />

⎥ ⎜ ⎢<br />

⎦ ∼ ⎝0, ⎣ K(X,X) + β−1I K(X,x∗)<br />

K(x∗,X) K(x∗,x∗) + β−1 ⎥⎟<br />

⎦⎠<br />

.<br />

Predictions over the unobserved locations are made from the posterior distribu-<br />

tion. The posterior is formulated by conditioning the joint distribution on the<br />

observed data. Conditioning two <strong>Gaussian</strong>s results in a <strong>Gaussian</strong> distribution,<br />

defined by mean and covariance,<br />

¯y(x∗) = k(x∗,X)(K + β −1 I) −1 Y<br />

σ 2 (x∗) = (k(x∗,x∗) + β −1 ) − k(x∗,X)(K + β −1 ) −1 k(x∗,X), (2.47)<br />

where K = k(X,X). Those equations are the central predictive equations in the<br />

GP framework. In Figure 2.6 samples from the posterior distribution of a GP with<br />

an RBF covariance function and a constant zero mean function is shown. As can


2.7. GAUSSIAN PROCESSES 43<br />

1.5<br />

1<br />

0.5<br />

−3 −2 −1 −0.5 1 2 3<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

Figure 2.6: Samples from GP Posterior using a RBF covariance function and a<br />

constant zero mean function. Each sample from the posterior distribution passes<br />

through the previously observed data shown as red dots.<br />

be seen each function drawn from the distribution passes through the the training<br />

data points.<br />

2.7.2 Training<br />

The covariance function specifies the class of functions most prominent in the<br />

prior. A commonly used covariance function is the RBF-kernel,<br />

k(xi,xj) = θ1e − θ 2 2 ||xi−xj|| 2 2.<br />

The free parameters {θ1, θ2, . . .} of the covariance functions together with the<br />

noise variance β −1 are the hyper-parameters 1 of the GP, Φ = {θ1, . . .,β}. Our<br />

knowledge about the relationship is encoded in the prior over f by setting the<br />

values of Φ. However, in the presence of data we can directly learn the hyper-<br />

parameters from the observations. Assuming that the observations have been cor-<br />

1 Assuming that the mean function has no free parameters


44 CHAPTER 2. BACKGROUND<br />

rupted by additive <strong>Gaussian</strong> noise Eq 2.46 we can formulate the likelihood of the<br />

data. Combining the likelihood with the prior we arrive at the marginal likelihood<br />

through integration over f,<br />

<br />

p(Y|X,Φ) =<br />

p(Y|f)p(f|X,Φ)df. (2.48)<br />

From the marginal likelihood we can seek the maximum likelihood solution<br />

for the hyper-parameters Φ,<br />

ˆΦ = argmax Φ p(Y|X,Φ). (2.49)<br />

This is referred to as training in the GP-framework. It might seem undesir-<br />

able to optimize over the hyper-parameters as the model might over-fit the data 2 .<br />

Inspection of the logarithm of equation (2.48) for a one-dimensional output y,<br />

log p(y|X) = − 1<br />

2 yTK −1 y −<br />

<br />

data−fit<br />

1<br />

log |K| −<br />

2 <br />

complexity<br />

N<br />

log 2π,<br />

2<br />

(2.50)<br />

shows two “competing terms”, the data-fit and the complexity term. The complex-<br />

ity term measures and penalizes the complexity of the model, while the data-fit<br />

term measures how well the model fits the data. This “competition” encourages<br />

the GP model not to over-fit the data.<br />

data Y<br />

2 By setting the noise variance β −1 to zero the function f will pass exactly through the observed


2.7. GAUSSIAN PROCESSES 45<br />

2.7.3 Relevance Vector Machine<br />

In this thesis our main use of <strong>Gaussian</strong> <strong>Process</strong>es will be as a tool to model<br />

functions. A different regression model is the Relevance Vector Machine<br />

(RVM) [63, 64]. In the RVM the mapping yi = f(xi) is modeled as a lin-<br />

ear combination of a the response to a kernel function of the training data,<br />

f(xi) =<br />

N<br />

wjc(xi,xj) + w0, (2.51)<br />

j=1<br />

where w = [w0, . . .,wN] are the model weights and c(·,xj) the kernel basis<br />

functions. One approach to find the weights of the model would be to min-<br />

imizes a reconstruction error of the training data. However, this is likely to<br />

lead to sever over-fitting as we are trying to estimate N + 1 parameters from<br />

given N inputs. Further, predications would only be point-estimates with no<br />

associated uncertainty.<br />

The RVM was suggested as a model to tackle the above issues. The model<br />

specifies a likelihood model of the data through which the parameters can<br />

be found associating each prediction with an uncertainty. Further, to avoid<br />

over-fitting of the data, a prior is specified over the weights w. This prior en-<br />

courages the model to push as many weights wi towards 0 making the linear<br />

combination in Eq. 2.51 depend on as few basis functions k(·,xj) as possible.<br />

Assuming additive <strong>Gaussian</strong> noise the likelihood of the model is formu-<br />

lated as,<br />

p(y|w, σ 2 ) =<br />

<br />

1<br />

2πσ2 2 1<br />

(−<br />

e 2σ2 ||y−˜ Cx|| 2 ) , (2.52)


46 CHAPTER 2. BACKGROUND<br />

where ˜ C is referred to as the design matrix, with elements,<br />

˜Cij = c(xi,xj−1)<br />

˜Ci1 = 1.<br />

A <strong>Gaussian</strong> prior is placed over the weights w,<br />

p(w|α) =<br />

N<br />

i=0<br />

N(wi|0, α −1<br />

i ), (2.53)<br />

controlled by the hyper-parameters α = [α0, . . .,αN]. The prior Eq. 2.53 over<br />

the model parameters w encourages the weights w to be zero.<br />

Through Bayes’ rule the posterior over the weights p(w|y, α, σ 2 ) can be<br />

formulated from which by integration over the weights w the marginal like-<br />

lihood of the model can be found,<br />

p(y|α, σ 2 ) =<br />

N<br />

1 2 1<br />

<br />

2π<br />

|B−1 + ˜ CA−1 ˜ CT e (−1<br />

2 (yT (B −1 + ˜ CA −1 ˜ C T )) −1 ) , (2.54)<br />

where A = diag(α0, . . .,αN) and B = σ 2 I. The optimal parameters α and σ 2<br />

can be found by optimizing the marginal likelihood. Due to the prior Eq. 2.53,<br />

it is reported in [63], when optimizing the marginal likelihood many of the<br />

hyper-parameters αi tend to approach infinity meaning that the associated<br />

weight is close to zero. This means that the corresponding kernel function<br />

has little influence in prediction Eq. 2.51. A pruning scheme is incorporated<br />

in the optimization that removes weights tending to zero from the expansion


2.8. GP-LVM 47<br />

forcing the model to explain the data using few kernel functions leading to a<br />

sparse model.<br />

As noted in [64, 45, 14] the RVM is a special case of a GP with covariance<br />

function,<br />

k(xi,xj) =<br />

N<br />

l=1<br />

1<br />

c(xi,xk)c(xj,xl), (2.55)<br />

αl<br />

where c is the kernel basis function as in Eq. 2.51. The covariance function is<br />

different in form as it depends on the training data xl. Further, it correspond<br />

to a degenerate covariance matrix having at most rank N as it is an expansion<br />

around the training data. Training the RVM is the same as optimizing a<br />

GP regression model i.e. finding the hyper-parameters that maximizes the<br />

marginal likelihood of the model. However, as noted in [45] the covariance<br />

function of the RVM has some undesirable effects. Using a standard RBF<br />

kernel for the GP the predictive variance associated with a point far away<br />

from the training data will be large, i.e. the model will be uncertain in regions<br />

where it has not previously seen data. Rather the opposite is true using the<br />

covariance function specified by the RVM as a both terms in the predictive<br />

variance Eq. 2.47 will be close to zero while for a standard RBF kernel the<br />

first term will be large.<br />

2.8 GP-LVM<br />

Lawrence [33] suggested an alternative <strong>Gaussian</strong> latent variable model capable<br />

of handling non-linear generative mappings while at the same time avoiding the<br />

problems associated with the GTM. Both the PPCA and the GTM specifies a


48 CHAPTER 2. BACKGROUND<br />

prior over the latent locations and seek a maximum likelihood solution for the<br />

parameters of the generating mapping. However, from a Bayesian perspective<br />

both the mapping and the latent locations are nuisance parameters and should<br />

therefore be marginalized. In Lawrence’s formulation, the prior is specified over<br />

the mapping instead of the latent locations and the marginal likelihood over the<br />

mapping formulated. Using the GP-framework a rich and flexible prior over non-<br />

linear mappings can be specified. The algorithm is referred to as the <strong>Gaussian</strong><br />

process <strong>Latent</strong> Variable Model (GP-LVM).<br />

By marginalizing over the mapping f the GP-LVM proceeds by seeking the<br />

maximum likelihood solution to the latent locations X and the hyper-parameters<br />

Φ of the GP,<br />

{ ˆ X, ˆ Φ} = argmax {X,Φ} p(Y|X,Φ)<br />

<br />

= argmax {X,Φ} p(Y|X, f,Φ)p(f)df, (2.56)<br />

where p(f) = GP(µ(x), k(x,x ′ )). The posterior distribution of the data can be<br />

written as,<br />

p(X,Φ|Y) ∝ p(Y|X,Φ)p(X)p(Φ). (2.57)<br />

In the standard GP-LVM formulation uninformative priors [33] are specified over<br />

the latent locations and the hyper-parameters. Learning in the GP-LVM frame-<br />

work consists of minimizing the log posterior of the data with respect to the lo-<br />

cations of the latent variables X and the hyper-parameter θ of the process. With<br />

a simple spherical <strong>Gaussian</strong> prior over the latent locations and an uninformative


2.8. GP-LVM 49<br />

prior over the parameters leads to the following objective,<br />

L = Lr + <br />

lnθi + 1<br />

2 ||xi|| 2 . (2.58)<br />

i<br />

For a covariance function specifying a distribution over linear functions a<br />

closed form solution for Eq 2.56 exists [33]. However for general covariance<br />

functions the solution is found through gradient based optimization .<br />

As previously discussed, infinitely many solutions to the latent variable for-<br />

mulation of dimensionality reduction exists, to proceed the solution needs to be<br />

constrained by prior information. The GP-LVM solution is constrained by the<br />

GP marginal likelihoods trade-off between smooth solutions and a good data-fit<br />

Eq.2.50. By fixing the dimensionality of the latent representation a solution can<br />

be found.<br />

2.8.1 <strong>Latent</strong> Constraints<br />

The GP-LVM objective seeks the locations of the latent coordinates X that max-<br />

imize the marginal likelihood of the data. One advantage of directly optimizing<br />

the latent locations is that additional constraints on X can easily be incorporated<br />

in the GP-LVM framework. In the following section we will review some of the<br />

extensions in terms of latent constraints that has been applied to the GP-LVM.<br />

The fundamental difference between spectral and generative dimensionality<br />

reduction is the assumption made by the spectral algorithms that the latent coordi-<br />

nates can be found as a smooth mapping from the observed data. This means that<br />

we are interested in finding latent locations such that the locality in the observed<br />

data is preserved. Further this assumption implies that a smooth inverse to the<br />

i


50 CHAPTER 2. BACKGROUND<br />

generative mapping Eq. 2.4 is assumed to exist. This assumption constrains the<br />

spectral algorithms and makes their objective function convex. Even though in<br />

this thesis we argue that the assumption of the existence of a smooth inverse to<br />

the generative mapping is a limitation there are modeling scenarios where we are<br />

interested to retain the locality of the observed data in the latent representation,<br />

such as for example when modeling motion capture data [35]. In [35] a con-<br />

strained form of the GP-LVM is presented. Each latent location xi is represented<br />

as a smooth mapping, referred to as a back-constraint, from the observed data,<br />

xi = g(yi,B), (2.59)<br />

with parameters B. Rather then directly optimizing the latent locations the incor-<br />

poration of the back-constraints alters the GP-LVM objective to seek the parame-<br />

ters of the back-constraints B, that maximize the likelihood of the observed data<br />

Y.<br />

As discussed, the GP-LVM objective is severely under-constrained in the gen-<br />

eral case. This means that a good initialization of the latent locations is of es-<br />

sential importance to be able to find a good solution. However, when learning a<br />

back-constrained model the preservation of locality in the observed space will in<br />

practice constrain the solution sufficiently such that the algorithm becomes less<br />

reliant on initialization of the latent locations, parameterized by the parameters<br />

of the back-constraint B Eq. 2.59. This means that for practical purposes we can<br />

reach the solution of the back-constrained GP-LVM with less careful initialization<br />

compared to the standard GP-LVM model.<br />

In the standard formulation of the GP-LVM a uninformative prior is specified


2.8. GP-LVM 51<br />

over the latent locations X Eq. 2.58. Rather then specifying this uninformative<br />

prior, in [67] a model incorporating an informative class based prior distribution<br />

over the latent locations is suggested. Incorporating this means that we can learn<br />

a latent representation that can be interpreted in terms of class. One advantage<br />

of this explored in [67] is to learn latent representation as input spaces for clas-<br />

sifiers. The objective in [67] is to rather then just efficiently represent the data<br />

as in the standard GP-LVM to find latent representation which are well suited for<br />

classification, i.e. where each class is easily separable. Practically this is achieved<br />

by incorporating the class based objective of Linear Discriminant Analysis (LDA)<br />

which aims to minimize within class variability and maximize between class sep-<br />

arability. This can be encoded in the GP-LVM by replacing the uninformative<br />

spherical <strong>Gaussian</strong> prior over the latent coordinates with,<br />

p(X) = 1<br />

Zd<br />

exp( 1<br />

λ2tr(S−1 B SW)), (2.60)<br />

where SB encodes the between class and SW the within class variability and Zd<br />

being a normalizing constant. The within class variability is computed as,<br />

SW =<br />

L<br />

i=1<br />

Ni<br />

N (µi − µ)(µi − µ) T , (2.61)<br />

where µi is the mean of class i and µ the mean over all classes. The between class<br />

variability is computed as follows,<br />

SB = 1<br />

N<br />

L<br />

i=1<br />

Ni<br />

<br />

(x i k − µi)(x i K − µi) T<br />

<br />

. (2.62)<br />

k<br />

Similarly to the discriminative GP-LVM a model incorporating constraints on


52 CHAPTER 2. BACKGROUND<br />

the latent coordinates was presented in [70]. By constraining the topology of the<br />

latent space representations with interpretable latent dimensions can be found.<br />

The model is applied to human motion data to do non-trivial transitions between<br />

styles of motion not present in the training data.<br />

In [77, 76] a model referred to as the <strong>Gaussian</strong> <strong>Process</strong> Dynamical Model<br />

(GPDM) is presented. The GPDM is a latent variable model derived from the<br />

GP-LVM that incorporates time series information in the data to learn a latent<br />

representation that respects the dynamics of the data. This achieved by incorpo-<br />

rating an auto-regressive model on the latent space in addition to the generative<br />

mapping,<br />

xt = h(xt−1) + ǫdyn. (2.63)<br />

By specifying a GP prior over the mapping h the dynamic mapping can simi-<br />

larly to the generative mapping f be marginalized from the GP-LVM to form the<br />

GPDM objective,<br />

ˆX, ˆ θ, ˆ θdyn<br />

<br />

= argmax {X,θ,θdyn} p(X, θ|Y)p(X|θdyn) (2.64)<br />

Many types of data are generated from models with a known underlying or<br />

latent structure. This is, for example, true for the human body whose motion can<br />

be decomposed into a tree structure. For modeling this type of data the hierarchi-<br />

cal GP-LVM (HGP-LVM) model was developed [34]. In the HGP-LVM model a<br />

hierarchical tree structure of latent spaces are learned where a latent space acts as<br />

a prior on the latent coordinates of a space further down in the hierarchy.


2.8. GP-LVM 53<br />

2.8.2 Summary<br />

In the context of our presentation it might seem out-of-place to present the GP-<br />

LVM separately from the generative methods. Our motivation for this it two-fold,<br />

first the framework we are about to present is an extension of the GP-LVM, sec-<br />

ondly as stated above for general covariance functions, the GP-LVM solution is<br />

found through gradient based methods. To avoid the effect of local minima in<br />

the log-likelihood the latent location X needs to be initialized close to the global<br />

optima. The approach suggested in [33] was to initialize with the solution from a<br />

different dimensionality reduction algorithm. For the general case, where we can-<br />

not assume a linear manifold, the lack of reliable non-linear generative methods<br />

means that the initialization is usually taken from the solution of a spectral algo-<br />

rithm. This means that even though the GP-LVM is a purely generative model,<br />

for practical applications, in the general case, it relies on the existence of a anal-<br />

ogous spectral algorithm. In this context the GP-LVM is a generative algorithm<br />

that sets out to improve the solution given by a spectral algorithm, this motivates<br />

our separation of the GP-LVM from the other reviewed generative dimensionality<br />

reduction algorithms.<br />

The GP-LVM framework has been shown to be very flexible and have been<br />

applied to model a large variety of different data, motion capture [24, 77, 76, 75],<br />

tracking [71, 69, 72, 28], human pose estimation [62], modeling of deformable<br />

surfaces [49] to name a small subset.


54 CHAPTER 2. BACKGROUND<br />

2.9 <strong>Shared</strong> Dimensionality Reduction<br />

Many modeling scenarios are characterized by several different types of observa-<br />

tions that share a common underlying structure. This can be the same text written<br />

in two different languages where both representations are different in form but<br />

have the same meaning or underlying concept or two image-sets which share the<br />

same modes of variability for example pose or lighting. This correspondence<br />

can be exploited for dimensionality reduction of the data, this we will refer to<br />

as shared dimensionality reduction. In shared dimensionality reduction each ob-<br />

servation space is generated from the same latent variable X. We will focus on<br />

the scenario when we have two observation spaces Y and Z which have been<br />

generated from the latent variable X through generative mappings fY and fZ,<br />

yi = fY (xi) (2.65)<br />

zi = fZ(xi). (2.66)<br />

<strong>Shared</strong> spectral dimensionality reduction is built on the assumption that smooth<br />

inverses to both the generative mappings fY and fZ exists. That both mappings<br />

are invertible implies that the observation spaces are related through a bijection,<br />

meaning that the location in one observation space is sufficient for determining the<br />

corresponding location in the other observation space. This means that by shared<br />

spectral dimensionality reduction we mean the alignment of several intrinsic rep-<br />

resentation into a single shared low-dimensional representation. In [26, 25, 82]<br />

algorithms for aligning two manifold through the use of proximity graphs for ei-<br />

ther a full or partial correspondence are presented.


2.10. FEATURE SELECTION 55<br />

A GP-LVM model learning a shared latent representation was suggested in<br />

[52]. The model presented learns two GP regressors, one to each separate ob-<br />

servation space from a shared latent variable by maximizing the joint marginal<br />

likelihood,<br />

{ ˆ X, ˆ ΦY , ˆ ΦZ} = argmax X,ΦY ,ΦZ p(Y,Z|X,ΦY ,ΦZ) =<br />

= argmax X,ΦY ,ΦZ p(Y|X,ΦY )p(Z|X,ΦZ). (2.67)<br />

The model was applied to learn a shared latent structure between the joint angle<br />

space of a humanoid robot and a human. Inference between the two different ob-<br />

servations spaces was done by learning GP regressors from the observed spaces<br />

onto the latent space. The suggested model does not make any direct assump-<br />

tion about the form of the generative mappings at training. However, as inference<br />

within the model is done by training mappings from the observed data back onto<br />

the learned latent representation the inverse mapping is assumed to exist why this<br />

model retains the central assumption from the shared spectral dimensionality re-<br />

duction that the generative mappings are invertible.<br />

2.10 Feature Selection<br />

Feature extraction is the process of simplifying the amount of resources needed<br />

to represent a data set accurately. This is in contrast to feature selection, where<br />

only features with a positive impact in relation to a certain objective is retained in<br />

the representation. This can for example be features that are able to discriminate<br />

between two different classes. Dimensionality reduction algorithms are instances


56 CHAPTER 2. BACKGROUND<br />

Y<br />

X<br />

Figure 2.7: Graphical model corresponding to the formulation of probabilistic of<br />

CCA in [4].<br />

of feature extraction. A closely related algorithm for feature selection is Canonical<br />

Correlation Analysis (CCA). Given two sets of observations Y and Z with known<br />

correspondences CCA finds directions WY ∈ Y and WZ ∈ Z such that the<br />

correlation Eq. 2.68 between YWY and ZWZ is maximized.<br />

ρ =<br />

Z<br />

tr WT Y YT <br />

ZWZ<br />

(tr(W T Z ZT ZWZ) tr(W T Y YT YWY )) 1<br />

2<br />

Finding the first set of directions w Y 1 and wZ 1<br />

strained optimization problem,<br />

argmax w Y 1 ,w Z 1<br />

(w Y 1 Y) T Zw Z 1<br />

(2.68)<br />

is formulated as the following con-<br />

(2.69)<br />

subject to: (w Y 1 ) T Y T Yw Y 1 = (w Z 1 ) T Z T Zw Z 1 = 1 (2.70)<br />

Further orthogonal directions can be found iteratively. The scaling of each basis<br />

is arbitrary. The constraints in Eq. 2.70 fixes the variance of the canonical vari-<br />

ates Yw Y 1 and ZwZ 1<br />

maximizing the correlation Eq. 2.68.<br />

to 1. This will ensures that maximizing Eq. 2.69 equates to<br />

In [4] the probabilistic form of CCA is derived through the maximum likeli-


2.11. SUMMARY 57<br />

hood solution to the following model,<br />

x ∼ N(0,I) (2.71)<br />

y|x ∼ N(WY x, ΨY ) (2.72)<br />

z|x ∼ N(WZx, ΨZ). (2.73)<br />

The model corresponds to generative CCA as long as the within set or non-shared<br />

variations can be sufficiently described by a <strong>Gaussian</strong> noise model and when the<br />

generative mappings are linear. The graphical model corresponding the the pro-<br />

pose model is shown in Figure 2.7 In [37] the model is extended to allow for<br />

non-linear mappings.<br />

2.11 Summary<br />

In this chapter we have outlined some of the background upon which the material<br />

in the following chapters will build. Through elementary linear algebra and by<br />

introduction of the concept of “The Curse of Dimensionality” we have motivated<br />

the field of dimensionality reduction which encapsulates the work presented in this<br />

thesis. We have reviewed algorithms of the two main strands of work composing<br />

the field dimensionality reduction exemplifying the strength and weaknesses of<br />

each approach. Further, we detailed the basics of probabilistic modeling from the<br />

perspective of <strong>Gaussian</strong> processes which will be fundamental for the following<br />

chapters.<br />

In the next chapter we will proceed and present a new and novel model for<br />

generative dimensionality reduction in the shared modeling scenario based on


58 CHAPTER 2. BACKGROUND<br />

<strong>Gaussian</strong> processes.


Chapter 3<br />

<strong>Shared</strong> GP-LVM<br />

3.1 Introduction<br />

Dimensionality reduction is the task of reducing the number of dimensions re-<br />

quired to describe a set of data. The previous chapter introduced dimensional-<br />

ity reduction and gave the necessary mathematical background upon which these<br />

techniques are built. We divide dimensionality reduction into two groups of al-<br />

gorithms: generative and spectral. Generative techniques are more versatile and<br />

applicable for a larger range of modeling scenarios compared to the spectral tech-<br />

niques. However, the objective function of most generative algorithms are in the<br />

general case severely under constrained. The spectral group of algorithms avoid<br />

this by constraining the solution to such where a smooth inverse to the generative<br />

mapping exists. In scenarios where we are given observations in multiple differ-<br />

ent forms we can exploit correspondence between observation when performing<br />

dimensionality reduction. This we referred to as shared dimensionality reduction.<br />

The following chapter will introduce two generative models that exploit corre-<br />

59


60 CHAPTER 3. SHARED GP-LVM<br />

spondences between observations when performing dimensionality reduction.<br />

In many modeling scenarios we have access to multiple different observations<br />

of the same underlying phenomenon. Often a significantly different cost, compu-<br />

tationally or monetary, is associated with acquiring samples from each domain.<br />

In such scenarios it is of interest to infer the location of a expensive sample from<br />

one which we can more easily acquire. One example, which we will return to in<br />

the applications chapter, is image based human pose estimation [69, 57]. This is<br />

the task of estimating the pose of a human from the evidence given in an image.<br />

Images can be captured relatively easy while recording the actual pose of a hu-<br />

man is associated with a significant cost often requiring special rigs and expensive<br />

equipment. Therefore inferring the pose from image data is of great interest.<br />

Learning the shared GP-LVM model presented in [52] is a three stage process.<br />

In the first stage PCA is applied separately to each of the two sets of observa-<br />

tions. In the second stage the GP-LVM model is trained using the average of<br />

the two PCA solutions as initialization. This stage means that we have trained<br />

GP-regressors to model the generative mappings from the latent space onto the<br />

observed spaces. In the third and final stage a second set of GP-regressors are<br />

trained that maps back from each of the observed spaces onto the latent space.<br />

Even though not explicitly stated this implies that the generative mappings are as-<br />

sumed to have a smooth inverse. We will now proceed to introduce a more general<br />

model capable of transferring locations in one observation space to a correspond-<br />

ing space.


3.2. SHARED GP-LVM 61<br />

W Z<br />

Y Z<br />

X<br />

W<br />

Y Z<br />

Figure 3.1: The left image shows the conditional model where a set of observed<br />

data Y have been generated by Z. The image to the right shows the shared GP-<br />

LVM model which we suggest as a replacement to the conditional model on the<br />

left. The back-mapping from the output space that constrains the latent space is<br />

represented by the dashed line.<br />

3.2 <strong>Shared</strong> GP-LVM<br />

Given two sets of corresponding observationsY = [y1, . . .,yN] and Z = [z1, . . .,zN],<br />

where yi ∈ ℜ DY and zi ∈ ℜ DZ . We assume that each observation has been gen-<br />

erated from the same low-dimensional manifold corrupted by additive <strong>Gaussian</strong><br />

noise,<br />

where xi ∈ ℜ q with q<br />

DY<br />

φ Y<br />

yi = f Y (xi) + ǫY , ǫY ∼ N(0, β −1<br />

Y I)<br />

zi = f Z (xi) + ǫZ, ǫZ ∼ N(0, β −1<br />

Z I)<br />

< 1 and q<br />

DZ<br />

< 1.<br />

φ Z<br />

, (3.1)<br />

Our objective is to create a model from which the location zi ∈ Z correspond-<br />

ing to a given location yi ∈ Y can be determined. This can be done by modeling<br />

the conditional distribution over the input space Y given the output space Z as<br />

shown in the left image in Figure 3.1. The conditional distribution will associate<br />

each location in the output space with a location in the input space. However, for<br />

many applications the observation spaces are likely to be high-dimensional which


62 CHAPTER 3. SHARED GP-LVM<br />

makes modeling this distribution problematic, especially in scenarios with a lim-<br />

ited amount of training data. Therefore, rather then modeling p(yi|zi) directly,<br />

given that the representation of Z is redundant, we can find a reduced dimensional<br />

representation X of the output space. This means that we can model the condi-<br />

tional distribution over this new dimensionality reduced representation p(yi|xi)<br />

which should be significantly easier. However, rather than simply modeling the<br />

conditional distribution over a dimensionality reduced representation of the out-<br />

put space we will formulate an objective which learns the latent representation<br />

together with the mappings.<br />

From the <strong>Gaussian</strong> noise assumption 3.1 we can formulate the likelihood of<br />

the data,<br />

P(Y,Z|f Y , f Z ,X,ΦY ,ΦZ) =<br />

N<br />

p(yi|f Y ,xi,ΦY )p(zi|f Z ,xi,ΦZ). (3.2)<br />

i=1<br />

Placing <strong>Gaussian</strong> process priors and integrating over the mappings leads to the<br />

marginal likelihood of the shared GP-LVM model,<br />

P(Y,Z|X,ΦY ,ΦZ) =<br />

p(yi|xi,ΦY ) =<br />

p(zi|xi,ΦZ) =<br />

N<br />

p(yi|xi,ΦY )p(zi|xi,ΦZ) (3.3)<br />

i=1<br />

<br />

<br />

p(yi|f Y ,xi,ΦY )p(f Y )df Y<br />

p(zi|f Z ,xi,ΦZ)p(f Z )df Z .<br />

We are interested in finding a low-dimensional representation X which can<br />

be used as a complete substitute for Z. This means that the relationship between<br />

these two spaces should take the form of a bijection i.e that each point zi ∈ Z is<br />

represented by a unique location in xi ∈ X and vice versa. There is nothing in


3.2. SHARED GP-LVM 63<br />

the shared model that encourages this asymmetry. However, this can be achieved<br />

by incorporating back-constraints [35] which represent the latent locations as a<br />

smooth parametric mapping from the output space Z,<br />

This leads to the following objective,<br />

xi = h(zi,W). (3.4)<br />

{ ˆ W, ˆ ΦY , ˆ ΦZ} = argmax W,ΦY ,ΦZ P(Y,Z|W,ΦY ,ΦY ). (3.5)<br />

The objective can be optimized using gradient based methods. In Figure 3.1 the<br />

leftmost image shows the conditional model and the rightmost image shows the<br />

back-constrained shared GP-LVM model.<br />

3.2.1 Initialization<br />

There are several different options of how to initialize the locations in latent space<br />

for this model. One could, as with the shared GP-LVM in [52], initialize using<br />

the average of the embeddings to a spectral algorithm or with the solution to one<br />

of the shared spectral algorithms. However, we want the latent representation<br />

to be a complete representation of the output space Z. This means that we can<br />

initialize using the solution to a spectral algorithm with the output space Z as<br />

input. Underpinning this model is the assumption that the non-back-constrained<br />

observation spaces are governed by a subset of the degrees of freedom of the back-<br />

constrained observation space. By initializing using the solution of a spectral<br />

algorithm to the output space we seek the solution to the latent representation


64 CHAPTER 3. SHARED GP-LVM<br />

where the intrinsic representation of the input space aligns to the one of the output<br />

space.<br />

3.2.2 Inference<br />

Once the model has been trained we are interested in inferring the output location<br />

corresponding to a point in the input space. As we are not assuming a functional<br />

relationship between Y and Z we cannot simply learn a mapping as in [52]. Given<br />

the location yi in the input space Y we want to infer the corresponding location<br />

zi in the output space Z. The back-constraint from the output space to the latent<br />

space encodes the bijective relationship between Z and X. This implies that any<br />

multi-modalities in the relationship between Y and Z have been contained in the<br />

mapping from the latent representation X to the input space Y. This means that<br />

to recover the location in the high-dimensional output space we only need to find<br />

the corresponding point on the much lower-dimensional latent space,<br />

ˆxi = argmax x p(yi|xi,X,ΦY ). (3.6)<br />

Having found the location in the latent space the corresponding location in the<br />

output space can be found through the mean-prediction of the uni-modal posterior<br />

distribution as,<br />

ˆzi = argmax zi p(ˆzi|xi,X,ΦZ). (3.7)


3.2. SHARED GP-LVM 65<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.2: Each dimension of underlying signals used to generate highdimensional<br />

data shown in 3.3.<br />

3.2.3 Example<br />

In the following section we will run through a toy example try to exemplify the<br />

model presented above. We will generate two sets of high dimensional data Y<br />

and Z which we will learn embeddings from using the shared GP-LVM models.<br />

Both data-sets are generated from a single underlying signal t which consists of<br />

N linearly spaced values between −1 and 1. A set of three signals is generated<br />

through non-linear mappings of t such as,<br />

x 1 i = cos(πti) (3.8)<br />

x 2 i = sin(πti) (3.9)<br />

x 3 i = cos(√ 5πti + 2). (3.10)<br />

We will refer to X = [x 1 ;x 2 ;x 3 ] as the generating signal of the data, shown in<br />

Figure 3.2. Through linear mappings of X we generate two sets of 20 dimensional<br />

signals Y and Z. To achieve this we draw values at random from a zero mean unit<br />

variance normal distribution and organize them into 20 by 3 dimensional matrices<br />

P1 ∈ R 20×3 and P2 ∈ R 20×3 . From these transformation matrices we generated


66 CHAPTER 3. SHARED GP-LVM<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

0 10 20 30 40 50 60 70 80 90 100<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.3: Observed data Y (left) and Y (right) generated from 3.11 and 3.12.<br />

the high-dimensional observation signals through the generating signal X,<br />

Y = XP T 1<br />

Z = XP T 2<br />

+ λη (3.11)<br />

+ λη, (3.12)<br />

where η are samples from a zero mean unit variance normal distribution and λ =<br />

0.05. Figure 3.3 shows each dimension of the two high dimensional data-sets. We<br />

proceed by applying the above generated data to the shared GP-LVM model and<br />

the shared back-constrained model presented in this chapter. Each of the models<br />

are trained using linear kernels only as we are interested in their capabilities to<br />

unravel the linear transformations P1 and P2 applied to the data. Figure 3.4 shows<br />

the embeddings found by the two algorithms. As can be seen both algorithms<br />

unravels the data and finds three generating signals underlying the data. We do not<br />

expect the algorithm to exactly unfold the signal X as there are several different<br />

linear transformations that could have generated the observed data. However, we<br />

see that both algorithms finds two signals of “one period” corresponding to x 1<br />

and x 2 and a higher frequent signal corresponding to x 3 . One way to quantify


3.2. SHARED GP-LVM 67<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.4: Each dimension of the latent embeddings of the data in Figure 3.3 from<br />

the standard shared model (left) and the shared back-constrained model (right).<br />

Each model unravels the signals used to generate the data Figure 3.2<br />

the quality of the embeddings is to compare the Procrustes score [23] between the<br />

found embeddings and the generating signals. Procrustes score is a measure of<br />

similarity between shapes who are represented as point sets. In Procrustes analysis<br />

the shape of an object is considered as belonging to a equivivalence class and a<br />

shape is said to be defined as “all the geometrical information that remains when<br />

location,scale and rotational effects have been filtered out from an object [23]”.<br />

Two shapes can be compared to each other by removing the above effects. By<br />

finding the best aligning linear transformation the two point sets can be compared<br />

through the sum of squared distances between the points. Table 3.1 shows that<br />

both embeddings have low scores when compared to the generating signals.<br />

We will modify the previous example slightly to represent a different model-<br />

ing scenario. Just as in the previous example we generate two high-dimensional<br />

signals Y and Z through randomized linear mappings. However, in this case<br />

the observed data Y is generated from a two dimensional signal x Y = [x 1 ;x 2 ]<br />

whereas Z is generated from the same signals as in the previous example. This<br />

results in the sets of generating signals shown in Figure 3.5. By drawing values


68 CHAPTER 3. SHARED GP-LVM<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.5: Underlying generating signals to high-dimensional data Y (left) and<br />

Z (right) shown in Figure 3.6.<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

0 10 20 30 40 50 60 70 80 90 100<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.6: Each dimension of the observed data Y (left) and Z (right) generated<br />

from underlying signals shown in Figure 3.5<br />

to form the transformation matrices as in the previous example we can generate<br />

the the observed signals shown in Figure 3.6 This example is meant to visualize<br />

the modeling scenario when the input space Y does not share the same modes of<br />

variability as the output space Z. In this case the input space Y is generated from<br />

a subset of the signals generating the output space Z. Figure 3.7 shows the results<br />

of applying the two shared models as presented above. As can be seen in the left<br />

most plot in Figure 3.7 the shared GP-LVM model does not correctly unfold the<br />

data to recover the three generating dimensions of X. Rather the shared model<br />

seems to represent the shared signals x 1 and x 2 but avoids representing x 3 which


3.2. SHARED GP-LVM 69<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.7: Resulting embedding of applying the shared (left) and the shared backconstrained<br />

(right) GP-LVM model to the data shown in Figure 3.6. As can be<br />

seen, the shared model fails to unravel the generating signals while the backconstrained<br />

model correctly finds the signals underlying the high-dimensional<br />

data. The shared model assumes that both observation spaces are generated from<br />

the same underlying signals which is not true for the data shown in Figure 3.6. The<br />

shared back-constrained GP-LVM model relaxes this assumption, only assuming<br />

that the non-back-constrained observation space are a subset of the generating<br />

signals of the back-constrained space.<br />

is private to the output space. However, in the shared back-constrained model the<br />

latent space is constrained to be a smooth representation of the output space and<br />

therefore all the degrees of freedom in the observed data will be encoded in the<br />

latent representation of the data. This means that for the shared model the latent<br />

space is not guaranteed to be a full representation of the output space which means<br />

that determining the latent location will not be sufficient for determining the loca-<br />

tion in the output space, this implies that we cannot replace the high-dimensional<br />

conditional model p(Y|Z) with the low-dimensional p(Y|X) as intended.<br />

In the following example we will try and exemplify how the models act in a<br />

more general modeling scenario. As before we will generate two high-dimensional<br />

data-sets Y and Z from a set of underlying generating signals. In this example<br />

each data-set will be generated from a single shared signal x 1 and one signal that


70 CHAPTER 3. SHARED GP-LVM<br />

Example Model Procrustes<br />

Same generating signals <strong>Shared</strong> 0.058<br />

Same generating signals Back 0.004<br />

Two shared, one private for output <strong>Shared</strong> 0.614<br />

Two shared, one private for output Back 0.028<br />

Table 3.1: The Procrustes score corresponding to the resulting embeddings of<br />

applying the different GP-LVM models to the data shown in Figure 3.3. As can be<br />

seen the back-constrained model (referred to as Back) significantly out-performs<br />

the standard shared GP-LVM model.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.8: Each dimension of underlying signals generating high-dimensional<br />

data Y (left) and Z (right) shown in Figure 3.9. Each signal is two dimensional<br />

with one dimension shared and one private to each data-set.<br />

remains private to each observation x 2 to Y and x 3 to Z. In Figure 3.8 the gen-<br />

erating signals are shown. Similarly to the previous examples we generate two<br />

high-dimensional signals from random linear transformations of the generating<br />

signals. Each dimension of the observed data is shown in Figure 3.9. We apply a<br />

standard shared and a back-constrained shared model to the data. In Figure 3.10<br />

the embeddings found by each model are shown. Both models represent the data<br />

using two latent dimensions. In the case of the shared model the latent locations<br />

seems to correspond to the generating signals of Y while the back-constrained<br />

model seems to unravel the generating signals corresponding to Z. In the case


3.2. SHARED GP-LVM 71<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

0 10 20 30 40 50 60 70 80 90 100<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.9: Each dimension of the high-dimensional observed data Y (left) and Z<br />

(right) generated from underlying signals shown in Figure 3.8.<br />

of the back-constrained model this is expected as the model was specifically de-<br />

signed to learn a latent space which corresponds to Z being the output space of<br />

the model. Further the latent locations are initialized using the solution of a spec-<br />

tral algorithm, in this case PCA, applied to Z which means that there is strong<br />

incentive for the model to focus on modeling Z rather then Y. For the standard<br />

shared model the interpretation of the resulting embedding is less obvious. The<br />

model does not favor the generation of either of the two observation spaces, how-<br />

ever, the latent locations are initialized from the mean of the solution of a spectral<br />

algorithm, in this case PCA, applied to both observation spaces Y and Z. As both<br />

data sets are generated from two dimensional signals PCA will only recover two<br />

dimensions and the mean of the solutions to each space will only have relevance<br />

for the shared generating signal while the recovered private signals will be lost<br />

when taking the mean of the signals. As we know the dimensionality of the signal<br />

generating the observations we can rather than determining the latent dimension-<br />

ality from the initialization set it to equal the dimension of the generating signal.<br />

Figure 3.11 shows the embeddings found using three dimensional latent spaces.<br />

As can be seen neither model recovers the three generating signals shown in Fig-


72 CHAPTER 3. SHARED GP-LVM<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

−0.1<br />

−0.2<br />

−0.3<br />

−0.4<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.10: Each dimension of the embeddings found applying the shared (left)<br />

and the shared back-constrained (right) GP-LVM models to the data shown in<br />

Figure 3.9. Neither model succeeds to unravel the three different underlying generating<br />

signals shown in Figure 3.8.<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 3.11: Each dimension of the embeddings found applying the shared (left)<br />

and the shared back-constrained (right) GP-LVM models to the data shown in<br />

Figure 3.9. The latent dimensionality is set to equal the dimensionality of the<br />

generating signals. As can be seen neither model manages to recover the three<br />

generating signals shown in Figure 3.8.


3.2. SHARED GP-LVM 73<br />

ure 3.8. In the case of the back-constrained model this is to be expected as the<br />

latent locations are constrained to be a smooth mapping from the observed data<br />

Z. As nothing of the signal private to Y x 2 is contained in Z the model can not<br />

correctly represent this in the latent space. This constraint does not exist for the<br />

standard shared model, however, as can be seen in Figure 3.8 neither does this<br />

model succeed to recover the three generating signals [x 1 ;x 2 ;x 3 ]. The model is<br />

initialized using the mean of the data projected onto the three most representative<br />

principal components. As each data-set is generated from two dimensional signals<br />

the third principal component will, for each observation space, fit to the noise of<br />

the data. This means that the third latent dimension will be initialized to noise<br />

from which the model never manages to recover to encode x 3 .<br />

3.2.4 Summary<br />

The above presented shared and back-constrained GP-LVM model was created to<br />

model in the scenario where the input space can be modeled as a mapping from<br />

the output space. What this means is that all the degrees of freedom in the input<br />

space are contained in the output space. It does however not assume a bijective<br />

relationship between the two observation spaces as in [52] as a location in the<br />

input space can be associated with several locations in the output space which<br />

was exemplified in Figure 3.7. However, as was shown in the example leading<br />

up to embeddings Figure 3.10 and 3.11 neither model was capable of modeling<br />

in the more general scenario where each observed space shares a subset of the<br />

generating parameters, but do also contain private generating parameters. In the<br />

next section we will proceed to describe an algorithm designed for this specific


74 CHAPTER 3. SHARED GP-LVM<br />

φ Y<br />

X<br />

X φ Y S XZ<br />

Z<br />

Y Z<br />

Figure 3.12: Subspace GP-LVM model. The two observation spaces Y and Z are<br />

generated from the latent variable X factorized into subspaces X Y ,X Z representing<br />

the private information and X S modeling the shared information. ΦY and<br />

ΦZ are the hyper-parameters of the <strong>Gaussian</strong> <strong>Process</strong>es modeling the generative<br />

mappings f Y and f Z .<br />

modeling scenario.<br />

3.3 Subspace GP-LVM<br />

The main limitation of the shared GP-LVM, and shared models in general, is the<br />

assumption that each data space is governed by the same degrees of freedom or<br />

modes of variability. We relaxed this assumption when only interested in inference<br />

in one direction by introducing the back-constraint to the shared model in the<br />

previous section. We will now proceed by introducing a new shared GP-LVM<br />

model which further relaxes this assumption.<br />

Similarly to the shared GP-LVM model we are given two sets of corresponding<br />

observations Y = [y1, . . .,yN] and Z = [z1, . . .,zN], where yi ∈ ℜ DY and zi ∈<br />

ℜ DZ . We assume that each observation has been generated from low-dimensional


3.3. SUBSPACE GP-LVM 75<br />

manifolds corrupted by additive <strong>Gaussian</strong> noise,<br />

yi = f Y (u Y i ) + ǫY , ǫY ∼ N(0, β −1<br />

Y I)<br />

zi = f Z (u Z i ) + ǫZ , ǫZ ∼ N(0, β −1<br />

Z I)<br />

where u Y i ∈ ℜ qY and u Z i ∈ ℜ qZ with qY<br />

DY<br />

< 1 and qZ<br />

DZ<br />

,<br />

< 1. We assume that the<br />

two manifolds can be parameterized in such a manner that they share a common<br />

non-empty subspace X S ,<br />

X S ⊂ U Y , X S ⊂ U Z<br />

X S = 0<br />

, (3.13)<br />

which is referred to as the shared subspace X S . This assumption implies that<br />

a parameterization of each observation space in which a subset of the degrees<br />

of freedom of each observation space is shared is possible. Representing each<br />

manifold in terms of X S introduces an additional subspace associated with the<br />

manifold,<br />

U Y = X S ;X Y ;U Z = X S ;X Z , (3.14)<br />

which is referred to as the private (or non-shared) subspace. The full latent repre-<br />

sentation of both observation spaces X is the concatenation of shared and private<br />

subspaces, X = [X S ;X Y ;X Z ]. Note that the private spaces are subspaces of X,<br />

with the full latent space representing shared and non-shared variance in a factor-<br />

ized form.<br />

A shared GP-LVM model can be constructed respecting the factorized latent<br />

structure. The GP-LVM learns two separate mappings generating each observa-<br />

tion space where the input space for the GP generating Y is U Y and for Z is U Z


76 CHAPTER 3. SHARED GP-LVM<br />

leading to the following objective,<br />

{ ˆ X, ˆ ΦY , ˆ ΦZ} = argmax X,ΦY ,ΦZ p(Y,Z|X,ΦY ,ΦZ)<br />

= argmax X S ,X Y ,X Z ,ΦY ,ΦZ p(Y|XS ,X Y ,ΦY )p(Z|X S ,X Z ,ΦZ). (3.15)<br />

The latent structure presented above is capable of separately modeling the shared<br />

and the non-shared variance. This is achieved by letting the mappings f Y and f Z<br />

only be active on the subspaces associated with each observation.<br />

3.4 Extensions<br />

As with the standard GP-LVM model additional constraints such as back-constraints<br />

or dynamic models can be placed over the latent variables. However, just as with<br />

the generating mappings we can limit the constraints to only be active on sub-<br />

spaces of the latent space. In the applications chapter we will evaluate the perfor-<br />

mance of shared subspace models with dynamic constraints.<br />

3.5 Applications<br />

Applications in a wide variety of fields are concerned with inferring a specific<br />

variable, the output variable, from evidence given by the parameters of a dif-<br />

ferent variable, the input variable. In its simplest case this, when the output<br />

variable is related by a function from the input variable, this is called a re-<br />

gression problem. However, for many applications this cannot be assumed<br />

as the input variable is not sufficiently discriminative of the output variable.


3.6. SUMMARY 77<br />

In such cases there might be several different locations in the output space<br />

that corresponds to a specific location in the input space. For such an es-<br />

timation task we would ideally like to recover each possible output location<br />

corresponding to the given input.<br />

A classic example of an application where the input space is not sufficient<br />

for discriminating the output is the computer vision task of image based hu-<br />

man pose estimation. The task is concerned with estimating the pose param-<br />

eters of a human from an image. In the applications chapter we will demon-<br />

strate how the models proposed in this chapter can be applied to this task. In<br />

specific, the application of the subspace GP-LVM model is interesting since<br />

it finds a factorized representation of the data into shared and private parts.<br />

The shared latent variable represents the portion of the variance in each ob-<br />

servation space that can be determined from the other observations, i.e. vari-<br />

ance that has a functional relationship between the observations. The private<br />

represents variance which cannot be discriminated from the other space. As<br />

we will show, this means, for the task of pose-estimation, the estimation can<br />

be reduced to a regression task for estimating the shared location. The lo-<br />

cation in the private space represents variance in the pose space that cannot<br />

be estimated from the image, i.e. poses that are ambiguous to the given input<br />

image.<br />

3.6 Summary<br />

As was exemplified in the Toy example in Figure 3.5 the standard shared model is<br />

not capable of unraveling the correct generating signals for data where the two ob-


78 CHAPTER 3. SHARED GP-LVM<br />

servation spaces have not been generated from the same underlying signals. For<br />

the case where we are interested in inferring locations in one specific observed<br />

space given locations in the other observation space we proposed the shared back-<br />

constrained model. The proposed model is capable to correctly unravel the gen-<br />

erating signals of the data when the input space has been generated from a subset<br />

of generating parameters from the output space. However, in cases when the data<br />

have been generated using parameters “private” to both the observed spaces as<br />

was shown in the example Figure 3.8 the shared back-constrained model also fails<br />

to model the data which was shown in Figure 3.10 and Figure 3.11.<br />

The objective of each of the models are to find latent locations and mappings<br />

that minimizes the reconstruction error of the data. In cases where the data con-<br />

tains observation space specific “private” information used to generate a single<br />

observation space the model are left with two choices. Either represent this data<br />

in the latent space, this will reduce the error when reconstructing the associated<br />

observed space. However, for the other space, which does not contain this infor-<br />

mation, this will pollute the latent space reducing the capability of the generative<br />

mapping to generalize over the data resulting in a lower likelihood of the associ-<br />

ated generative mapping. The other option is to consider the private information<br />

as noise. None of the above scenarios is ideal as they will both result in a lower<br />

likelihood of the model. Either by reducing the generative mappings capability to<br />

regenerate the data or by trying to model a structured signal using a noise model<br />

which is not likely fit particularly well.<br />

For modeling data containing private information we have in this chapter pro-<br />

posed the shared subspace GP-LVM model. This model learns a factorized latent<br />

space of the data, separately modeling information that is shared from information


3.6. SUMMARY 79<br />

that is private. The structure of the latent space means that the private information<br />

in the observations will not pollute the reconstruction of the data neither will the<br />

model need to resort to model this information as noise. Further, the factorization<br />

into shared and private information can be beneficial when we want to do infer-<br />

ence in the model as the shared space contains information that can be determined<br />

from either observation space while the private subspace contains the information<br />

that is ambiguous knowing the location in the other observed space. In the ap-<br />

plications chapter we will exploit this factorization for the estimating the pose of<br />

a human from image evidence. Each image is ambiguous to a small sub-set of<br />

the possible human poses, using the shared subspace model, this sub-set will be<br />

modeled using the private latent spaces.<br />

The non-convex nature of the GP-LVM objective (in the non-linear case) means<br />

that the algorithm relies on an analogous dimensionality reduction algorithm for<br />

initialization. For the standard model there are as we have seen, several different<br />

algorithms that are applicable. When the observation spaces can be assumed to<br />

have been generated from the same manifold, the approach of averaging over the<br />

observations is applicable. Similarly in the asymmetric case initialization from the<br />

output space is viable. However, for the subspace model presented in this chapter,<br />

no equivivalent spectral model exists.<br />

In the next chapter we proceed by presenting a spectral approach to shared di-<br />

mensionality reduction that can be used to initialize the subspace GP-LVM model<br />

presented in this chapter.


Chapter 4<br />

NCCA<br />

4.1 Introduction<br />

In the previous chapter we introduced two new latent variable models based on the<br />

GP-LVM. The first presented algorithm, referred to as the shared back-constrained<br />

GP-LVM, models two correlated sets of observation using a single shared multi-<br />

variate latent variable. It was created for the task of inferring the location in one<br />

observation space given the corresponding location in the other observation space.<br />

The second algorithm, referred to as the subspace GP-LVM model, was designed<br />

for the more general scenario where we want to model several different observa-<br />

tion spaces sharing a subspace of their variance.<br />

The solution to both algorithms is found using gradient based methods. There-<br />

fore, to be able to recover a good solution initialization of the models are of signif-<br />

icant importance. The shared back-constrained model assumes that the non-back-<br />

constrained observation space have been generated from a subset of the generating<br />

parameters of the back-constrained space. This means that we can initialize the<br />

80


4.2. ASSUMPTIONS 81<br />

latent location using the output of a spectral dimensionality reduction technique<br />

applied to the back-constrained observation space. However, for the subspace<br />

GP-LVM model no analogous convex models exist. In this chapter we will pre-<br />

sented a spectral dimensionality reduction algorithm for finding factorized<br />

latent spaces that separately represents the shared and private variance of<br />

two corresponding observation spaces. Being associated with a convex ob-<br />

jective function, the model can be used to initialize the subspace GP-LVM<br />

model.<br />

4.2 Assumptions<br />

Just as with the spectral dimensionality reduction approaches we reviewed<br />

in Chapter 2, the model we will present is based on a set of assumptions<br />

about the relationship between the observed data and its latent represen-<br />

tation. Given two sets of corresponding observations Y = [y1, . . .,yN] and<br />

Z = [z1, . . .,zN], where yi ∈ ℜ DY and zi ∈ ℜ DZ . We assume that each obser-<br />

vation has been generated from low-dimensional manifolds corrupted by additive<br />

<strong>Gaussian</strong> noise,<br />

yi = f Y (u Y i ) + ǫY , ǫY ∼ N(0, β −1<br />

Y I)<br />

zi = f Z (u Z i ) + ǫZ , ǫZ ∼ N(0, β −1<br />

Z I)<br />

. (4.1)<br />

The latent representations u Y i ∈ UY and u Z i ∈ UZ associated with each obser-<br />

vation space X Y and X Z consists of two components, one shared x S i and one<br />

private or non-shared part x Y i and xZ i<br />

associated with each observation space.<br />

Each component represents an orthogonal subspace of the latent representation


82 CHAPTER 4. NCCA<br />

as u Y i = x S i ;x Y i<br />

and u Z i = x S i ;x Z i<br />

. We will refer to the X S as the shared<br />

subspace and X Y and X Z as the private subspaces of the model.<br />

The shared subspace X S of the latent representation U Y and U Z are assumed<br />

to be related to the observation spaces by smooth mappings g Y and g Z as,<br />

x S i = g Y (yi) = g Z (zi). (4.2)<br />

We further assume that the relationship between the observed data and its corre-<br />

sponding private manifold representation also takes the form of a smooth map-<br />

ping,<br />

x Y i = hY (yi) (4.3)<br />

x Z i = hZ (zi). (4.4)<br />

(4.5)<br />

In the following section we will first describe how the shared latent space X S<br />

can be found from the observed data. Once a solution for the shared space X S is<br />

found we will proceed and find the private subspaces X Y and X Z to complete the<br />

latent representation of the model.<br />

4.3 <strong>Shared</strong><br />

The shared subspace X S of the latent representation of the data represents<br />

variance that is shared between both observation spaces. In Chapter 2 we<br />

reviewed Canonical Correlation Analysis (CCA) which is a feature selection


4.3. SHARED 83<br />

algorithm for finding directions in two observation spaces that are maximally<br />

correlated. The model we are about to present is a two stage algorithm, in<br />

the first stage we find the shared latent space X S using CCA.<br />

The objective of CCA is to find two sets of basis vectors W Y and W Z , one<br />

for each observation space, such that the correlation ρ between the projections of<br />

the data is maximized,<br />

ρ =<br />

tr WT Y YT <br />

ZWZ<br />

(tr(W T Z ZT ZWZ) tr(W T Y YT YWY )) 1<br />

2<br />

, (4.6)<br />

subject to unit variance W T Y YT YWY = I and W T Z ZT ZWZ = I along each<br />

direction. The unit variance constraint means that the CCA solution will find<br />

maximally correlated directions irrespective of how much of the variance in the<br />

observation space is explained. As a way of avoiding low variance solutions it is<br />

suggested in [31] to first apply PCA separately to each data-set and the apply CCA<br />

in the dominant principal subspace of each data-set. This as a way of avoiding<br />

highly correlated directions that explains a non-substantial amount of the variance<br />

of the data.<br />

In the general case both CCA and PCA are applied to linear subspaces of the<br />

data. However, both algorithms can be non-linearized through the kernel trick [3]<br />

by first mapping the data to kernel induced feature spaces,<br />

ΨY : Y → F Y<br />

(4.7)<br />

ΨZ : Z → F Z , (4.8)<br />

represented by kernels K Y and K Z . In practice we first apply kernel PCA and then


84 CHAPTER 4. NCCA<br />

X<br />

Y Z<br />

X X S<br />

Y Z<br />

Figure 4.1: Graphical model of the Non-Consolidating-Component-Analysis<br />

(NCCA) model. The two observation space Y and Z are generated from a latent<br />

variable X which is factorized into three different subspaces X = [X Y ;X S ;X Z ].<br />

The subspace [X S ;X Y ] is the latent representation of Y while [X S ;X Z ] represents<br />

Z. This means that X S models the portion of the data that is correlated<br />

between Y and Z and is found using CCA. The subspaces X Y and X Z represents<br />

the remaining private portion of each observation space.<br />

look for correlated directions in the dominant kernel induced principal subspace<br />

of the data. Having found directions explaining the shared components of the data<br />

it remains to find directions explaining the private variance of each observation<br />

space.<br />

4.4 Private<br />

Given two observation spaces Y and Z together with bases W Y and W Z that<br />

explain the shared variance we are interested to find directions V Y and V Z to<br />

explain the private non-shared variance of each observation space. To find such a<br />

basis we look for directions of maximum variance in each observation space that<br />

are orthogonal to the shared bases. We will apply this to each observation space<br />

in turn why we in the following step have dropped the superscript that identifies


4.4. PRIVATE 85<br />

the observation space. We seek the first direction,<br />

subject to,<br />

ˆv = argmax v v T Cv, (4.9)<br />

v T i<br />

v T v = 1 (4.10)<br />

W = 0, (4.11)<br />

where W is the canonical directions and C is the covariance matrix of the feature<br />

space.<br />

4.4.1 Extracting the first orthogonal direction<br />

We apply the algorithm in a feature space induced by kernel K. The solution is<br />

found through formulating the Lagrangian of the problem,<br />

L = v T Cv − λ(v T v − 1) −<br />

K<br />

γiv T wi. (4.12)<br />

Seeking the stationary point of the Lagrangian leads to the following system of<br />

equations,<br />

δL<br />

δv<br />

= 2Cv − 2λv −<br />

i=1<br />

K<br />

γiwi = 0 (4.13)<br />

i=1<br />

δL<br />

δλ = vTv − 1 = 0 (4.14)<br />

δL<br />

δγi<br />

= v T wi = 0, ∀i. (4.15)


86 CHAPTER 4. NCCA<br />

By pre-multiplying Eq.4.13 with K<br />

i=1 wT i ,<br />

K<br />

i=1<br />

w T i<br />

(2Cv) − 2λ<br />

K<br />

i=1<br />

w T i<br />

using the orthogonality constraint Eq.4.15,<br />

and Eq.4.14,<br />

We obtain,<br />

Using Eq.4.13 and Eq.4.19,<br />

K<br />

j=1<br />

K<br />

v − ( w T j )(<br />

K<br />

γiwi) = 0, (4.16)<br />

w T i<br />

j=1<br />

i=1<br />

K<br />

( w<br />

j=1<br />

T j )(<br />

⎧<br />

K<br />

⎪⎨ γi i = j<br />

γiwi) =<br />

⎪⎩<br />

i=1 0 i = j<br />

2Cv − 2λv − 2<br />

<br />

C −<br />

v = 0, (4.17)<br />

. (4.18)<br />

γi = 2w T i Cv. (4.19)<br />

K<br />

wi(w T i Cv) = 0 (4.20)<br />

i=1<br />

K<br />

i=1<br />

wiw T i C<br />

<br />

v = λv. (4.21)<br />

Equation 4.21 is an eigenvalue and can be solved in close form through the eigen-<br />

<br />

decomposition of matrix C − K i=1 wiwT i C<br />

<br />

.<br />

4.4.2 Extracting consecutive directions<br />

Having found the orthogonal direction explaining the maximal of the remaining<br />

variance. Further consecutive directions can be found by appending the previously


4.4. PRIVATE 87<br />

found directions to the orthogonality constraint. For the Mth direction,<br />

v T M [W;v1, . . .,vM−1] = 0. (4.22)<br />

This means that to find the Mth direction the following eigenvalue problem needs<br />

to be solved,<br />

<br />

C −<br />

M−1<br />

<br />

j=1<br />

vjv T j +<br />

K<br />

i=1<br />

wiw T i<br />

<br />

C<br />

<br />

vM = λvM. (4.23)<br />

We will refer to the directions found as the Non-Consolidating-Components<br />

and the algorithm as Non-Consolidating-Components-Analysis (NCCA).<br />

4.4.3 Example<br />

In the previous chapter we applied the shared and the shared back-constrained GP-<br />

LVM model to a toy-data set. Doing so we exemplified in which scenarios each<br />

model works and also in which scenarios each model fails. As was shown neither<br />

model is capable of modeling data which contains both shared and private signals.<br />

This in itself was the motivation for creating the subspace GP-LVM model. In this<br />

chapter we have presented the NCCA model as an extension to CCA. We will use<br />

the solution of the proposed model to initialized the subspace GP-LVM model. To<br />

evaluate the performance of the model we applied the subspace model to the same<br />

data-set which we applied the shared and the shared back-constrained model to in<br />

the previous chapter. Each observation space Y and Z has been generated from<br />

a set of underlying low-dimensional signals shown in Figure 4.2. The generating<br />

signals have one dimension shared and one dimension private for each data-set.


88 CHAPTER 4. NCCA<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 4.2: Each dimension of the signals used to generate the high-dimensional<br />

signals shown in Figure 4.3. The data consist of one shared dimension (blue) and<br />

one dimension private to each data-set (green).<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 4.3: Each dimension of the high-dimensional observed data to which the<br />

subspace GP-LVM model is applied.<br />

From these signals two high-dimensional signals are generated through random<br />

linear projections which result in Figure 4.3 which is referred to as the observed<br />

data Y and Z. We apply the data shown in Figure 4.3 to two variations of the<br />

subspace GP-LVM model presented in the previous chapter. For each model we<br />

set the latent structure to be composed of a single shared dimension and a private<br />

dimension corresponding to each observation space. This means that we in total<br />

are learning a three dimensional latent space. We initialize one model using the<br />

PCA solution of one of the observation spaces and one model using CCA for the


4.4. PRIVATE 89<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

−0.05<br />

−0.1<br />

−0.15<br />

−0.2<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Figure 4.4: Each dimension of the embeddings of the data in Figure 4.3 applied to<br />

the subspace GP-LVM model. The left most plot corresponds to the results using<br />

PCA as initialization while the right plot corresponds to the embedding using<br />

CCA for the shared dimension and NCCA for the private dimensions. The latter<br />

model succeeds in recovering the generating signals Figure 4.2 while the model<br />

initialized using PCA fails.<br />

shared dimension and NCCA for the private spaces. In Figure 4.4 the resulting<br />

embeddings are shown. As can be seen in Figure 4.4 the model initialized using<br />

PCA is not able to correctly unravel the generating signals. The observation space<br />

Y is generated from a two dimensional signal Figure 4.2. We use the first prin-<br />

cipal component to this data as initialization for the shared space and the second<br />

component to initialize the private space corresponding to Y. We hereby make the<br />

assumption that the first principal component will correspond to the shared sig-<br />

nals something which in general cannot be assumed true. The private dimension<br />

of Z is initialized to small random values. Using the CCA and NCCA scheme<br />

presented in this chapter the model manages to correctly unravel the generating<br />

signals.


90 CHAPTER 4. NCCA<br />

4.5 Extensions<br />

We have presented the NCCA algorithm as a way of finding representative<br />

directions in the data that is complementary to the directions of high correla-<br />

tion found by CCA. However, there is nothing in the algorithm that limits the<br />

use of NCCA to only accompany CCA. The algorithm is general and can be<br />

used in addition to any feature selection algorithm where we want to model<br />

the full variance of the data.<br />

Fisher’s Discriminant Analysis (FDA)[20] is a method for finding direc-<br />

tions in the data such that the resulting projections maximizes the separa-<br />

tion between two or more classes. Finding complementary directions us-<br />

ing NCCA would allow a factorized latent representation where the sub-<br />

space corresponding to the FDA directions would discriminate the data into<br />

classes while the complementary subspace associated with the NCCA direc-<br />

tions would represent the most representative directions in the data without<br />

constraints on class discrimination. This would allow FDA to be extended to<br />

a model which represents the full variance of the observed data not limited<br />

to the variance from which the classes can be discriminated.<br />

4.6 Summary<br />

In this chapter we have presented a new two stage dimensionality reduction<br />

model. The latent space used to represent the data is factorized into two<br />

parts, one constrained part and one complementary part ensuring that the<br />

full variance in the observed data is represented in the latent space. The con-


4.6. SUMMARY 91<br />

strained latent subspace is found from any feature selection algorithm that<br />

results in a set of directions in the data. In this chapter we have used CCA<br />

to find the constrained part. The remaining latent subspace complements the<br />

constrained directions to ensure that the latent variable represents the full<br />

variance of the observed data.<br />

The initial motivation behind the NCCA model was as an analogous spec-<br />

tral algorithm to the subspace GP-LVM model. However, as we will show in<br />

Chapter 5, the model can be directly applied to model data without the need<br />

of a GP-LVM model.<br />

When using the NCCA model as an initialization scheme for the subspace<br />

GP-LVM we use the NCCA step to complement the solution found by applying<br />

CCA to the data. The assumptions 4.2 and 4.5 do not completely correspond to<br />

the subspace GP-LVM model. In particular the subspace GP-LVM model makes<br />

no assumption about the relationship between the observed data and the mani-<br />

fold representation. We could design a model which would better correspond to<br />

the CCA and NCCA scheme by incorporating back-constraints to constrain the<br />

mapping from observed data to the latent representation. However, each latent<br />

dimension can only be back-constrained once which means that we cannot com-<br />

pletely encode the assumption in Eq.4.2. However, as we will show in the next<br />

chapter, experimentally initializing according to the above scheme results in good<br />

embeddings.


Chapter 5<br />

Applications<br />

5.1 Introduction<br />

In this chapter we will apply the models presented in this thesis to the task of<br />

Human pose estimation from monocular images. In the next section we will give a<br />

brief introduction to Human pose estimation and review some of the related work.<br />

We will then proceed to present the shared back-constrained and the subspace<br />

GP-LVM model applied to this task. We will apply the models on two standard<br />

data-sets typically used within the domain of single image human pose estimation.<br />

We will conclude with a brief summary of the results.<br />

5.2 Human Pose Estimation<br />

Single view human pose estimation is the task of estimating the pose of a human<br />

from a monocular image. To get an overview of previous work we will split the the<br />

task into two different cathegoriez, generative and discriminative. Generative hu-<br />

92


5.2. HUMAN POSE ESTIMATION 93<br />

man pose estimation algorithms are often referred to as model based algorithms.<br />

They aim to fit a model of a human to evidence given in the image using an associ-<br />

ated likelihood or error function. By searching the state-space, which defines the<br />

pose, for the location which maximizes the likelihood given a specific image the<br />

task is solved. This is different from the discriminative approaches which aims to<br />

find the pose associated with a specific image from evidence extracted from the<br />

image.<br />

5.2.1 Generative<br />

Generative human pose estimation aims to fit the parameters of a human body<br />

model onto the image. The key differences between different algorithms are the<br />

parameterization of the human body and how the likelihood model is specified<br />

through which these parameters are found. A large variety of human body models<br />

have been specified, from simpler models composed of two dimensional patches<br />

[30, 43, 15] to more complex three dimensional models based on a wide range of<br />

primitives such as cylinders [13, 53, 12] or cones [74]. The more complex human<br />

models allow for more accurate estimation, however they are usually associated<br />

with a higher number of parameters which implies that they are more expensive<br />

to fit to the image evidence. Similarly to the human body model several different<br />

ways of specifying the likelihood model through which the parameters can be fit<br />

have been suggested. In [74, 19] image edges are used while [58] used texture<br />

information to fit an ellipsoid parameterized model. In [12] a stick-model of a<br />

human was fitted to the image through a image segmentation based score. If the<br />

estimation is done for a sequence of images higher level image information such


94 CHAPTER 5. APPLICATIONS<br />

as optical flow can be incorporated into the likelihood model as in [58].<br />

5.2.2 Discriminative<br />

Images are very high-dimensional objects, typically in the range of 10 3 − 10 4<br />

dimensions. To make modeling computationally feasible it is common practice, as<br />

a pre-processing stage, to represent each image using a lower dimensional feature<br />

vector. Due to the high-dimensionality of the images these feature descriptors are<br />

usually based on heuristic assumptions about the correspondence between images<br />

and pose.<br />

There are two main approaches in modeling the relationship between image<br />

features and pose. In the simplest case, where there is no ambiguity between the<br />

image representation and pose, the relationship can be modeled with regression as<br />

was demonstrated in [2, 61, 68, 83]. However, in the general case the image fea-<br />

ture representation is not capable of fully dis-ambiguating the pose. One approach<br />

to deal with the multi-modalities that arises is to use a multi-modal estimate such<br />

as in [47]. Given a multi-modal estimate additional cues such as temporal consis-<br />

tency can be used to disambiguate between the different modes.<br />

5.2.3 Problem Statement<br />

Given a set of image features yi ∈ Y with corresponding pose parameters zi ∈ Z<br />

where i = 1, . . .,N we wish to train a model through which we given an unseen<br />

image feature vector y ∗ can infer the corresponding pose parameters z ∗ . We will<br />

learn models for two different settings, single image estimation and sequential<br />

estimation. In the first case we are given a single input feature from which to de-


5.3. IMAGE FEATURES 95<br />

termine the pose while in the sequential case we are given a sequence [y ∗ 1, . . .,y ∗ N ]<br />

of image features from which we aim to determine the corresponding sequence of<br />

pose parameters [z ∗ 1 , . . .,z∗ N ].<br />

5.3 Image Features<br />

Images are very high-dimensional objects typically residing in 10 3 − 10 4 dimen-<br />

sional spaces. Due to the problems associated with working with such high-<br />

dimensional objects it is for most applications often necessary to find a reduced di-<br />

mensional representation of each image. The large variability and high-dimensionality<br />

of most image spaces means that it is often not possible to apply dimensionality<br />

reduction techniques such as the ones reviewed in chapter 2. Further, for most<br />

application we are only interested in a sub-set of the information contained in<br />

the images, often it is of interest to find a representation that introduces certain<br />

ambiguities. One example would be an application where we are comparing the<br />

shape of objects, we would then ideally like a representation that is ambiguous to<br />

color, texture and scaling and where as much as the variance in the descriptor as<br />

possible is related to the shape of objects. This has lead to a large body of work<br />

on heuristic application specific image representations. We will now proceed to<br />

briefly describe the background to the two different image features used for the<br />

experiments in this chapter.<br />

5.3.1 Shapeme Features<br />

Shape context [6, 7] was suggested as a point based descriptor for shape matching.<br />

The shape is assumed to be represented as a discrete set of points from the contour


96 CHAPTER 5. APPLICATIONS<br />

of an image, this could either have been extracted using a segmentation algorithm<br />

or as the response to an edge detector. Considering the set of contour points P =<br />

{p1, . . .,pn} the descriptor is calculated by taking each pointpi and describing its<br />

position relative to all the other points on the contour. This is done by computing<br />

the vector v i j = (pi −pj) for all the n − 1 points on the image. The set of vectors<br />

V i = {vi j }, ∀j = i completely describes the configuration of all points relative<br />

the reference point pi. The shape context descriptor of point pi is computed as<br />

the distribution of these relative point positions by placing them in a coarse spatial<br />

log-polar histogram where each bin represents the radius and the angle of the polar<br />

representation of each vector vi j . This results in n log-polar histograms describing<br />

the shape.<br />

Each histogram is computed relative to a reference point on the contour mak-<br />

ing the histogram invariant to translation. Further, by scaling the radius of the<br />

polar representation of each point with the median distance between the points<br />

on the contour, each histogram can be made invariant to scaling. The discriti-<br />

zation/binning of the vectors into a coarse histogram representation makes the<br />

descriptor robust to small affine transformation of the shape.<br />

Given two shapes we are interested in finding a measure of their similarity.<br />

Matching using the shape context representation is a two stage process. First a<br />

similarity between shape context histograms are needed, this is to find the point<br />

on each shape that best matches a point on the other shape. Secondly a measure<br />

relating the similarity of the full shape, i.e. all the points are required. In the<br />

original paper [6] the χ 2 distance was used to compare each histogram, however,<br />

in [7] this was change to the simpler L 2 -norm without significant degradation of<br />

the results. Matching two shapes implies finding the permutation π of the points


5.3. IMAGE FEATURES 97<br />

on one shape such that the sum of the histogram similarities is minimized. This is<br />

an instance of bipartite matching which is in general of cubic complexity.<br />

To reduce the complexity associated with matching two shapes using shape<br />

context descriptors the shapeme feature descriptor [42] was suggested. The shapeme<br />

descriptor is calculated by computing shape context histograms for a large set of<br />

training images. By clustering the space of all shape context histograms a set of<br />

representative shape context histograms can be found refereed to as shapemes.<br />

Each image can now be represented by associating each shape context vector with<br />

its closest shapeme. The descriptor of an image can now be reduced to a his-<br />

togram over the shapemes. Matching two shapes represented as shapemes can<br />

now be performed using as a simple nearest neighbor classifier in the shapeme<br />

space.<br />

We use the shapeme descriptor for both the Poser and the HumanEva data-set.<br />

Details of the specific features for the Poser data-set can be found in [1] and for<br />

the HumanEva data in [55].<br />

5.3.2 Histogram of Oriented Gradients<br />

The Histogram of Oriented Gradients (HOG) image descriptor was suggested in<br />

[18] for the task of detecting people in images. The descriptor is based on the dis-<br />

tribution of local texture gradients in the image. Computing the HOG descriptor<br />

for an image is a three stage process. In the first step the response of the image<br />

to a gradient filter is computed associating each pixel in the image with a direc-<br />

tion and a magnitude. The second step involves binning the gradient directions<br />

over spatial regions refereed to as cells into gradient histograms. Each bin in the


98 CHAPTER 5. APPLICATIONS<br />

histogram represents a set of gradient directions, the strength of the vote from<br />

each pixel in the image is taken as a function of the gradient magnitude, in the<br />

original work in [18] directly using the gradient magnitude was found to give the<br />

best performance. A good descriptor should be robust towards lighting changes<br />

in the image. To achieve, in the third step of the feature extraction, the gradient<br />

magnitude of each cell is normalized with respect to a local region in the image<br />

refereed to as a block. The final feature vector is calculated as the response to a<br />

sliding window over the image where the normalized cell responses are binned<br />

either using rectangular window (R-HOG) or using a circular window (C-HOG).<br />

The HOG descriptor has much in common with the SIFT descriptor [38] which<br />

also based on histograms of local gradient directions. However, whereas the SIFT<br />

descriptor is computed at key-points extracted from the image, and at multiple<br />

scales, the HOG descriptor is a dense descriptor evaluated for each pixel in the<br />

image.<br />

We use the HOG feature evaluated on the HumanEva data-set. Details of the<br />

specific feature can be found in [46].<br />

5.4 Data-sets<br />

We apply the two suggested models to two different data-sets. The first data-set,<br />

referred to as Poser was presented in [2]. Poser consists of images generated<br />

using a the computer graphics package Poser of real motion-capture data from<br />

the CMU database 1 . Each image is represented by its corresponding silhouette<br />

described using a 100 dimensional shapeme descriptor [42] as presented in [2].<br />

1 The data used in this project was obtained from mocap.cs.cmu.edu. The database was created<br />

with funding from NSF EIA-0196217.


5.4. DATA-SETS 99<br />

Each pose is represented using 54 joint angles of a full body. In total the data-set<br />

consists of 1927 training poses from 8 sequences of varied motions. The data-set<br />

is provided with one training sequence of 418 frames describing a circular walk<br />

sequence.<br />

The second data-set we are going to apply the models to is a real-world data-<br />

set known as HumanEva. It was first presented in [54]. The full HumanEva<br />

data-set consists of six different motions performed by three different persons<br />

divided into training test and a validation set. In addition frames of a forth person<br />

is provided without labeling of the motions. Each frame is captured using several<br />

cameras. In our experiments we us only images from a single camera. Due to<br />

restrictions in the amount of data the proposed models can process we have chosen<br />

a limited training data-set. Further limitations are necessary due to errors in the<br />

motion capture data provided. We use two different subsets. The first set contains<br />

the walking sequence for subject one where we use the first cycle in the walk for<br />

training and the second cycle for testing. Each image in this data-set is described<br />

using a 300 dimensional shapeme features as used in [55]. In [46] each image<br />

of the HumanEva data-set is aligned by projecting the associated pose onto the<br />

view plane and using the Procrustes score to align each image into a common<br />

frame. Through this process each image can be flipped around the horizontal axis<br />

to effectively double the size of the data-set. For the second set of images we<br />

append the first with the corresponding flipped image from the image-set created<br />

in [46] and use every second image in the data. Each image in the second data-set<br />

is presented using Histogram of Oriented Gradient (HOG) feature [18] as used in<br />

[46]. As a pre-processing stage, to reduce computational time, we represent each<br />

descriptors by the projection onto the first 100 principal directions extracted from


100 CHAPTER 5. APPLICATIONS<br />

the data. The poses in the HumanEva data-set is represented as the location of a<br />

set of 19 joints in 3D-space resulting in a 57 dimensional pose representation. We<br />

remove the global translation by centering the data.<br />

We apply the Poser data-set to the shared back-constrained GP-LVM model<br />

and the second humaneva data-set to the subspace GP-LVM model. The first<br />

humaneva set is used to compare the different models. We will now proceed to<br />

present how each model is defined and how inference is done for each problem<br />

setting.<br />

5.5 <strong>Shared</strong> back-constrained GP-LVM<br />

In this section we will present the application of the shared back-constrained GP-<br />

LVM model to the task of single image human pose estimation. Given a set of<br />

image features [y1, . . .,yN] with corresponding pose parameters [z1, . . .,zN] we<br />

learn a shared back-constrained GP-LVM model. For the generative mappings<br />

from the latent space to the observed data we use a kernel consisting of the sum<br />

of an RBF, a bias and a white noise kernel,<br />

<br />

kgen(xi,xj) = θ1exp − ||xi − xj|| 2 2<br />

2θ2 <br />

+ θ3 + βδij. (5.1)<br />

2<br />

Using this kernel means that each generative mapping are specified by the four<br />

hyper-parameters Φ = [θ1, θ2, θ3, β].<br />

For the back-constraint from the pose space to the latent space we use a simple<br />

regression over a kernel induced feature space. In all our experiments we use an


5.5. SHARED BACK-CONSTRAINED GP-LVM 101<br />

RBF kernel,<br />

<br />

kback(zi,zj) = exp − ||xi − xj|| 2 2<br />

, (5.2)<br />

2θback<br />

where the kernel width θback is set by examining the kernel response to the pose<br />

parameters.<br />

Both the Poser data and the HumanEva data is sequential. We will therefore<br />

also learn a dynamical model to predict sequentially over the latent space. The<br />

dynamic model is a GP predicting over time. We will use the same type of kernel<br />

as for the generative mappings i.e. a combination of a RBF, a bias and a white<br />

noise kernel resulting in a mapping specified by four hyper-parameters.<br />

In total this means that the shared back-constrained model we apply in this<br />

chapter is specified by 8 hyper-parameters for the generative mappings, one pa-<br />

rameter for the back-constraint and 4 hyper-parameters for the dynamical model.<br />

Further, the dimension of the latent space need to be fixed, in all our experiments<br />

we use a 3 dimensional latent space. As described above the parameter of the<br />

back-constraint is set by inspection, for the Poser data we use θback = 1<br />

10 −3 and<br />

for the HumanEva data we use θback = 1<br />

4·10 −6 . In Figure 5.1 the kernel response<br />

matrices associated with the back-constraints for the two data-sets are shown. We<br />

initialize the latent locations X of the model by applying PCA to the pose parame-<br />

ters. The hyper-parameters of the GP mappings are learnt together with the latent<br />

location of the data by optimizing the marginal likelihood of the model Eq. 3.4<br />

using a scaled conjugate gradient optimizer. Ideally we would like to run the op-<br />

timizer till convergence, however, in practice, we limit the number of iterations<br />

to 10000, running for further iteration does not have a significant impact on the<br />

results.


102 CHAPTER 5. APPLICATIONS<br />

Figure 5.1: Kernel response matrices on the training data over which the backmapping<br />

are applied to in the back-constrained model. The left image shows the<br />

response to the Poser data using an RBF kernel with width θback = 1<br />

10 −3 . The right<br />

images corresponds to the HumanEva data also using an RBF kernel with width<br />

θback = 1<br />

4·10 −6 .<br />

5.5.1 Single Image Inference<br />

For the task of single image pose inference we are given a single feature vector y ∗<br />

for which we want to infer its corresponding pose parameters z ∗ . The latent space<br />

X is back-constrained from the pose space Z which means that we encourage a<br />

one-to-one mapping between the latent space and the pose space. This means that<br />

if we can determine the latent location x ∗ associated with the image feature y ∗ we<br />

can recover the associated pose parameters z ∗ through the mean prediction of the<br />

GP generating the pose space. We can find the latent coordinate associated with a<br />

specific image feature by maximizing the likelihood over the latent space,<br />

x ∗ = argmax x p(y ∗ |x), (5.3)<br />

through gradient based optimization. However, we expect the image features to<br />

be ambiguous with respect to pose which implies a multi-modal mapping from


5.5. SHARED BACK-CONSTRAINED GP-LVM 103<br />

image feature to pose. Because we encourage the mapping between the latent<br />

space and the pose space to be uni-modal the multi-modality in the data needs<br />

to be contained in the mapping between the latent space and the image features.<br />

This means that the optimization Eq.5.3 is multi-modal and needs to be initialized<br />

in a convex region of x ∗ . In practice we first perform a nearest-neighbor search<br />

in the in image feature space and initialize the latent coordinates with the latent<br />

coordinates associated with the N nearest neighbors in the training data. For the<br />

Poser data we use 6 nearest neighbors, increasing the number does not result in<br />

a significant increase in performance while reducing the number leads to missing<br />

modes. We run each point optimization Eq. 5.3 till convergence which usually<br />

corresponds to 10 − 100 iterations.<br />

shown.<br />

In Figure 5.3 result for the single image inference on the Poser data set is<br />

5.5.2 Sequential Inference<br />

We are often interested in inferring the pose Z ∗ associated with a temporal se-<br />

quence of image observations Y ∗ . In such a setting we can exploit temporal con-<br />

sistency of the sequence to disambiguate our multi-modal estimate and recover a<br />

single pose estimate per image.<br />

To find the most likely sequence of latent locations associated with the image<br />

observations we interpret the sequence through a hidden Markov model (HMM)<br />

where the latent states of the model corresponds to the latent locations of the<br />

training data X. The likelihood of each observation is given by the likelihood of


104 CHAPTER 5. APPLICATIONS<br />

the GP associated with each latent point,<br />

L obs<br />

i = p(y∗ |xi). (5.4)<br />

The transition probabilities are given by the dynamical GP predicting latent loca-<br />

tions over time,<br />

L trans<br />

ij = p(xi|xj). (5.5)<br />

The most probable path Xinit through this lattice can be found using the Viterbi<br />

algorithm [73].<br />

Having found the most likely sequence through the training data Xinit we use<br />

this as an initialization to optimize the sequence,<br />

X ∗ = argmax X p(Y ∗ |X). (5.6)<br />

In Figure 5.4 results for the sequential estimate of the shared back-constrained<br />

GP-LVM model is shown.<br />

5.6 Subspace GP-LVM<br />

Even though not explicitly stated, the shared GP-LVM applied in [44] assumes<br />

that the image features and pose parameters are governed by the same degrees of<br />

freedom. This means that the full variance of each observation space needs to be<br />

fully correlated. However, the image features are based on heuristic assumptions<br />

about this correlation and are very likely to contain a significant amount of non-<br />

pose-correlated variance. This variance, irrelevant for the task of pose estimation,


5.6. SUBSPACE GP-LVM 105<br />

needs to be “explained away” from the latent representation by the noise term in<br />

the GP generating the image features. The back-constrained GP-LVM model aims<br />

to encourage the latent space to be a full re-representation of pose, encouraging the<br />

model to explain the non-correlated image feature variance as noise. However, for<br />

many types of features the non-pose-correlated variance represents a significant<br />

portion of the variance. Further, the structure of this variance is often not well<br />

represented by our <strong>Gaussian</strong> noise assumption, which means it will pollute the<br />

structure of the latent representation.<br />

An alternative approach is to apply the subspace GP-LVM which models the<br />

non-correlated variance separately using additional latent spaces. The additional<br />

image feature latent space will explain the non-pose-correlated variance in the im-<br />

age feature space. Compared to the shared GP-LVM this means that this variance<br />

is explained using the full flexibility of a GP instead of a simple <strong>Gaussian</strong> noise<br />

model. Further, the private latent subspace associated with the pose represent vari-<br />

ance in the pose space that is not correlated with the image features. This implies<br />

that locations over this subspace represents poses orthogonal or ambiguous to the<br />

image features.<br />

The shared back-constrained model could be initialized efficiently by apply-<br />

ing PCA to the pose space. However, for the subspace model this scheme cannot<br />

be used. Instead we initialize the shared latent space using CCA and the private<br />

spaces using NCCA. Before applying CCA to find the shared locations we apply<br />

PCA to both observations spaces to remove directions in the data representing a<br />

non-significant variance. For the first HumanEva data-set we apply linear PCA<br />

while for the second HumanEva data-set we apply kernel PCA applied on a MVU<br />

[78] kernel computed using 7 nearest neighbors. In practice we keep 70% of the


106 CHAPTER 5. APPLICATIONS<br />

variance of the image feature space and 90% of the variance in the pose space.<br />

Having found a reduced representation of the data we apply CCA to find an ini-<br />

tialization of the shared latent locations, we keep directions of CCA that have a<br />

normalized correlation coefficient of more then 30%. Finally, to complete the la-<br />

tent representation with the private spaces, we apply NCCA to find orthogonal<br />

directions to the CCA solution. We find directions until 95% of the variance in<br />

PCA reduced representation of each observation space is explained. For the first<br />

HumanEva data-set this results in a two dimensional shared space and one dimen-<br />

sional private spaces associated with each observation. The second HumanEva<br />

data-set results in a one dimensional shared space a one dimensional private space<br />

for the image features and a two dimensional private pose space.<br />

Just as with the back-constrained model we use a combination of an RBF, a<br />

bias and a white-noise kernel for the generative mappings Eq. 5.1, meaning that<br />

each generative mapping is specified using 4 hyper-parameters. We will use the<br />

first data-set to find a single pose estimate for each time instance in a sequence<br />

of images. To disambiguate our multi-modal estimate between each time instance<br />

we will use a dynamic model capable to predict over the latent space. We use<br />

a GP regressor as a dynamic model, using the same type of kernel combination<br />

as the generative mappings Eq. 5.1. In total this means that for the first data-set<br />

we learn a model having 8 parameters for the generative mappings, 4 parameters<br />

for the dynamic model in addition to the location of the latent points. For the<br />

second data-set we keep the latent locations fixed to exemplify the performance<br />

of the NCCA algorithm which means that the only parameters for the model is the<br />

8 hyper-parameters of the generative mappings. We learn the parameters of the<br />

models by optimizing the marginal likelihood of the model Eq.3.15 using a scale


5.6. SUBSPACE GP-LVM 107<br />

conjugate gradient algorithm. Ideally we would like to run the optimization until<br />

the model converges, however, in practice, we limit the number of iterations to<br />

10000. We have found that running the optimization further does not results in a<br />

increased performance.<br />

5.6.1 Single Image Inference<br />

Initializing the subspace model’s shared space using CCA and the private spaces<br />

using NCCA we have assumed that the shared latent space can be represented as a<br />

smooth mapping from both the image feature and the pose space. Having trained<br />

a model we learn a mapping gy from the image features Y to the shared locations<br />

latent locations X S of the trained model,<br />

x S i = gy(yi). (5.7)<br />

We will in practice use a GP regressor to model gy, however, any regression model<br />

would be applicable. The GP regressor uses a combination of an RBF, a bias and<br />

a white noise kernel Eq. 5.1 specified by four hyper-parameters that are found<br />

through gradient based optimization of the GP marginal likelihood Eq.2.48 of<br />

the data. Having learnt gy means that given a new unseen image feature y ∗ , the<br />

corresponding location on the shared latent space x ∗ can be determined. The<br />

private latent space for the pose X Z is orthogonal to the full latent representation<br />

of the image feature. This implies that its location, x Z ∗<br />

, has no correlation with the<br />

location in the image feature space y∗. However, it is assumed that the problem is<br />

“well” represented by the training data, implying that examples of each ambiguity<br />

we are likely to see is represented in the training data. This implies that by finding


108 CHAPTER 5. APPLICATIONS<br />

locations x Z ∗i over the private pose space which maximizes the likelihood of the<br />

pose z∗ generated from the corresponding full latent pose location [xS ∗ ;xZ ∗i ] each<br />

mode will correspond to the pose ambiguities of the feature. Maximizing the<br />

likelihood of the pose corresponds to minimizing the predictive variance of the<br />

GP generating the pose space,<br />

ˆx Z ∗ = argmin x Z ∗<br />

5.6.2 Sequential Inference<br />

k(x S,Z<br />

∗ ,x S,Z<br />

∗ )<br />

− k(x S,Z<br />

∗ ,X S,Z ) T (K + β −1 I)k(x S,Z<br />

∗ ,X S,Z ) . (5.8)<br />

Finding the modes associated with different locations over the private pose space<br />

associates multiple poses to an ambiguous location in the image feature space. As<br />

the private pose space is orthogonal to the image feature latent representation it is<br />

not possible to disambiguate between the different modes using information from<br />

the feature. However, pose data is sequential, by placing a dynamic GP over the<br />

latent pose space a representation respecting the data’s dynamics can be learned.<br />

When inferring the pose from a sequence of image features the dynamic model<br />

can be used to disambiguate locations over the image feature orthogonal private<br />

pose space. The shared latent locations are determined by the mapping from the<br />

image features as before, but the the locations over the private subspace can, with<br />

the incorporation of the dynamic model, be found such that the full sequence<br />

renders a high likelihood.<br />

ˆX Z ∗ = argmax X Z ∗ p(X Z ∗ |Z,X S , ΦZ, Φdyn). (5.9)


5.7. QUANTITATIVE RESULTS 109<br />

Figure 5.2: Angle error: The image on the left is the true pose, the middle image<br />

has an angle error of 1.7 ◦ , the image on the right has an angle error of 4.1 ◦ . An<br />

angle error higher up in the joint-hierarchy will effect the positions for all joints<br />

further down. As the errors for the middle image are higher up in the hierarchy<br />

this will effect each limb connected further down the chain from this joint thereby<br />

resulting in a significantly different limp positions.<br />

5.7 Quantitative Results<br />

Both the Poser and the HumanEva data-set comes with a provided error measure<br />

to quantify the quality of the result. In the case of Poser the mean RMS error is<br />

used defined as follows,<br />

Eposer(z, ˆz) = 1<br />

N<br />

N<br />

||(ˆzi − zi) mod 360 ◦ ||2, (5.10)<br />

i=1<br />

where z is the true pose and ˆz is the estimated pose. To make comparison of<br />

results possible we will follow [2] and use the above error measure. However, the<br />

Eposer can be misguiding as a qualitative measure as it is applied to the joint angle<br />

space. A mean square error treats all dimension of the joint angle space with equal<br />

importance and do not reflect the hierarchical structure of the human physiology.<br />

This means that joints higher up in the hierarchy effects all joints further down in<br />

the hierarchy Figure 5.2.<br />

The HumanEva data-set avoids the problems associated with the Poser data by


110 CHAPTER 5. APPLICATIONS<br />

Figure 5.3: Single Image Pose Estimation: Input silhouette followed by output<br />

poses associated with modes on the latent space ordered according to decreasing<br />

likelihood. As can be seen the modes corresponds to varying limb positions we<br />

expect to be ambiguous to the input silhouette.<br />

representing poses using joint locations rather then joint angles.<br />

EHumanEva(z, ˆz) =<br />

N<br />

i=1<br />

1<br />

N ||xi − ˆxi||2. (5.11)<br />

The HumanEva error metric EHumanEva has a better correspondence to the visual<br />

quality of the pose estimate being calculated in a joint position space rather then<br />

the joint angle space used for EPoser.<br />

5.8 Experiments<br />

We applied the shared back-constrained GP-LVM model to the Poser data-set.<br />

Figure 5.3 shows the different pose estimates associated with a single image fea-<br />

ture vector. In Figure 5.4 the estimate for the shared back-constrained GP-LVM<br />

model using dynamics to dis-ambiguate the different modes are shown. As can be<br />

seen we are correctly estimating the pose for most frames. In Table 5.1 the angle<br />

error of our suggested model is compared to the a set of baseline regression al-


5.8. EXPERIMENTS 111<br />

Figure 5.4: Every 20th frame from a circular walk sequence, Top Row: Input<br />

Silhouette, Middle Row: Model Pose Estimate, Bottom Row: Ground Truth.<br />

Angle Error( ◦ )<br />

Mean Training Pose 8.3 ◦<br />

Linear Regression 7.7 ◦<br />

RVM 5.9 ◦<br />

GP Regression 5.8 ◦<br />

SBC-GP-LVM Single 6.5 ◦<br />

SBC-GP-LVM Sequence 5.3 ◦<br />

Table 5.1: Mean RMS Angle for the Poser data-set. SBC-GP-LVM refers to the<br />

shared back-constrained GP-LVM model. Note that only the SBC-GP-LVM Sequence<br />

method is using temporal information.<br />

gorithms. We can see that both the subspace back-constrained GP-LVM methods<br />

perform in-line or better than the regression algorithms. It is clear that learning a<br />

latent representation respecting the dynamics of the data is beneficial as the best<br />

results are achieved by the only model incorporating dynamics for inference.<br />

For a silhouette based image representation as the shapeme descriptor we ex-<br />

pect significant ambiguities with respect to the heading direction, i.e. in or out of<br />

the view-plane. However, as can be seen in [2] (Figure 7), the heading angle is<br />

nearly perfectly predicted using a regression algorithm. This means that the Poser<br />

data-set does not contain a significant amount of feature to pose ambiguities. Due


112 CHAPTER 5. APPLICATIONS<br />

Error (mm)<br />

Mean Training Pose 163<br />

Linear Regression 384<br />

GP Regression 163<br />

SBC-GP-LVM Single 201<br />

SBC-GP-LVM Sequence 73<br />

S-GP-LVM Sequence 70<br />

Table 5.2: Mean Error for the shapecontext HumanEva data-set. SBC-GP-LVM<br />

refers to the shared back-constrained GP-LVM model and S-GP-LVM refers to the<br />

subspace GP-LVM model.<br />

to the central motivation of the subspace GP-LVM model to specifically model<br />

such ambiguities we therefore proceed to use the HumanEva data-set.<br />

We apply the suggested models to the first HumanEva data-set using shapeme<br />

features to represent each image. In Table 5.2 the results for our suggested models<br />

and a set of baseline results are shown. From the poor performance of the regres-<br />

sion algorithms applied to HumanEva data we can see that this data-set contains a<br />

significant amount of ambiguities between the image representation and the pose.<br />

As can be seen the single estimate using the shared back-constrained GP-LVM<br />

results worse performance compared to both the model using dynamical informa-<br />

tion and the GP regression baseline. This is to be expected as the model will, in<br />

cases where the features are ambiguous, predict one of the possible poses while<br />

the regression algorithm in such cases will predict the mean of the ambiguous<br />

poses which for many types of ambiguities results in a smaller error compared to<br />

predicting the “wrong” mode. One advantage of the subspace GP-LVM model is<br />

that we can easily visualize the ambiguities by sampling locations over the pose<br />

private subspace.<br />

In the next section we will compare the the different proposed models.


5.9. COMPARISON 113<br />

5.9 Comparison<br />

In Figure 5.5 the kernel response matrix generating the observed features for the<br />

learned kernel hyper-parameters are shown. As can be seen the width of the kernel<br />

for the back-constrained model is much smaller compared to the subspace model.<br />

This implies that the back-constrained model is less capable of generalizing the<br />

image feature representation compared to the subspace model. This is an effect of<br />

trying to learn a shared latent space of two observed spaces which has a different<br />

latent structure. The back-constraint enforces a latent structure which preserves<br />

the local structure of the pose space, however this structure is different from the<br />

structure of the pose space making the GP generating the image features unable<br />

to generalize over the feature representation. In the subspace model the private<br />

subspace can be used to model structures in the feature space not contained in the<br />

pose parameters making the model capable to generalize better over the image<br />

features. We have shown the advantage of the latent factorization in the subspace<br />

model applied to the HumanEva data-set as it allows the model to better gener-<br />

alize its description of the data. For the next experiment we will use the second<br />

HumanEva data-set which in addition to the first data-set also includes each image<br />

“flipped”, this significantly increases the amount of ambiguities in the data. Each<br />

image in this data-set is represented using the HOG feature descriptor. Applying<br />

the subspace GP-LVM model initialized using the NCCA model leads to a three<br />

dimensional latent pose representation divided into a single shared dimension and<br />

a two dimensional private pose space.<br />

In Figure 5.6 shows the results of sampling the likelihood over the pose spe-<br />

cific private latent subspace X Z for a sequence of input images together with the


114 CHAPTER 5. APPLICATIONS<br />

Figure 5.5: The kernel response matrices used to generate the observed image<br />

features from the learned latent representation. The left figure shows the matrix<br />

for the shared back-constrained GP-LVM model while the right figure shows the<br />

matrix for the subspace GP-LVM model. Comparing the images it can be seen<br />

that the back-constrained model uses a much smaller kernel width compared to<br />

the subspace model implying that it is not capable to generalize as well as the<br />

subspace model.<br />

pose of to the mode corresponding to the pose closest to the ground truth pose.<br />

We can see how the modes evolve over the sequence, for images representing mo-<br />

tion perpendicular to the view-plane (1, 2 and 6) there are two elongated modes,<br />

while for motion in the view-plane there are a discrete set of modes. Having fixed<br />

the shared latent location reduced the estimation task to choosing the appropriate<br />

mode corresponding to a much smaller sub-set of the possible poses compared to<br />

the full data-set.<br />

We compare the subspace GP-LVM model to the standard shared GP-LVM<br />

[52, 44]. For comparison and visualization purposes we will train a shared GP-<br />

LVM model using a two dimensional latent space. None of the models are make<br />

use of dynamical information. In Figure 5.7 two different examples of how the<br />

models are modeling the training data is shown. For the first image when the hu-<br />

man is position is perpendicular to the view-plane there is two elongated modes


5.9. COMPARISON 115<br />

Figure 5.6: Pose inference on a sequence of images from the second HumanEva<br />

data set. Top row: original test set image. Second row: visualisation of the<br />

modes in the non-shared portion of the pose specific latent space. Note how the<br />

modes evolve as the subject moves. When the subject is heading in a direction<br />

perpendicular to the view-plane, it is not possible to disambiguate the heading<br />

direction image (1, 2 and 6) this is indicated by two elongated modes. In image<br />

(3 − 5) it is not possible to disambiguate the configuration of the arms and legs<br />

this gives rise to a set of discrete modes over the latent space each associated with<br />

a different configuration. Bottom row: the pose coming from the mode closest to<br />

the ground truth is shown. The different types of mode are explored further in<br />

Figure 5.7.


116 CHAPTER 5. APPLICATIONS<br />

over the pose specific private latent space. Sampling poses along the modes show<br />

that each mode corresponds to the heading direction being in or out of the view-<br />

plane while points along the mode corresponds to different configurations of the<br />

legs. For the second image the motion is parallel to the view-plane resulting in<br />

a discrete set of modes, each corresponding to different limb configurations that<br />

we would expect being ambiguous. As can be seen both models are capable of<br />

modeling the correct modes in the data. However, comparing the two different<br />

energies we can see that while the modes are clearly separated for the subspace<br />

model the energy for the shared model is scattered with local minimas. As we<br />

have shown in this chapter the existing image features used for pose estimation<br />

are often ambiguous to pose. This means that even for a model resulting in a<br />

multi-modal estimation being able to easily encode additional information to dis-<br />

ambiguate between the different modes are essential. Comparing the standard<br />

shared GP-LVM with the subspace model it should be significantly easier to use<br />

additional information to disambiguate between the different modes.<br />

5.10 Summary<br />

In this chapter we have shown the application of the shared back-constrained and<br />

the subspace GP-LVM model to the task of image based human pose estima-<br />

tion. Image based human pose estimation is an interesting applications because<br />

of its potential usefulness in applications to real-world problems in areas such as<br />

computer graphics and human robotic interaction. Further, as due to the high-<br />

dimensionality of the input domain each image is represented using some type<br />

of heuristic based image feature which introduces ambiguities between each im-


5.10. SUMMARY 117<br />

Subspace GP-LVM<br />

<strong>Shared</strong> GP-LVM<br />

Figure 5.7: The top row shows two images from the training data. The 2nd and<br />

3rd row shows results from infering the pose using the subspace model, the first<br />

column shows the likelihood sampled over the pose specific latent space constrained<br />

by the image features, the remaining columns shows the modes associated<br />

with the locations of the white dots over the pose specific latent space. Subspace<br />

GP-LVM: In the 2nd row the position of the leg and the heading angle cannot<br />

be determined in a robust way from the image features. This is reflected by two<br />

elongated modes over the latent space representing the two possible headings.<br />

The poses along each mode represents different leg configurations. The top row<br />

of the 2nd column shows the poses generated by sampling along the right mode<br />

and the bottom row along the left mode. In the 3rd row the position of the leg and<br />

the heading angle is still ambiguous to the feature, however here the ambiguity<br />

is between a discrete set of poses indicated by four clear modes in the likelihood<br />

over the pose specific latent space. SGP-LVM: The 4th and 5th row show the<br />

results of doing inference using the SGP-LVM model. Even though the most likely<br />

modes found are in good correspondece to the ambiguities in the images the latent<br />

space is cluttered by local minima that the optimization can get stuck in.


118 CHAPTER 5. APPLICATIONS<br />

age representation and its corresponding pose. These ambiguities implies that the<br />

task of estimating the pose associated with a specific image feature is multi-modal<br />

which makes the problem interesting from a machine learning perspective. By ap-<br />

plying the proposed model we have shown two different approaches to handle the<br />

multi-modalities that arise each with associated advantages and disadvantages.<br />

The main goal behind these models was to be applicable for data where a<br />

significant amount of ambiguities exists between the different observation spaces.<br />

With the image features available human pose estimation is known to be such a<br />

task. However, we would like to point out that our models are general and could be<br />

applied in different application areas know to exhibit similar relationships between<br />

each view. Further, the inference schemes which we apply to get quantitiavtive<br />

results are rudimentary at best, we do not propose a method for pose estimation<br />

we simply use this as an example of modeling ambiguous data. Our main results<br />

is on how the presented models handles multi-modalities not on disambiguation<br />

between such modes.<br />

In the following chapter we will conclude the work presented in this thesis and<br />

briefly discuss potential directions of future work.


Chapter 6<br />

Conclusions<br />

6.1 Discussion<br />

In this thesis we have presented two different probibalistic models based on Gaus-<br />

sian <strong>Process</strong>es for dimensionality reduction in the scenario where we have mul-<br />

tiple different observations of a single underlying phenomenon. The models pre-<br />

sented are applicable to a multitude of different modeling scenarios, where each<br />

observation space is fully correlated or when we are only interested in modeling<br />

the correlated data or in scenarios where we want to learn a factorized structure<br />

and model both the shared correlated information but also the non-correlated in-<br />

formation private to each observation space.<br />

6.2 Review of the Thesis<br />

The first chapter provided a brief introduction to the work undertaken in this the-<br />

sis. In chapter 2 the machine learning task of dimensionality reduction was moti-<br />

119


120 CHAPTER 6. CONCLUSIONS<br />

vated and reviewed. Further, chapter 2 provided an introduction to <strong>Gaussian</strong> Pro-<br />

cesses upon which the work in this thesis is built. Chapter 3 presents two different<br />

<strong>Gaussian</strong> <strong>Process</strong> <strong>Latent</strong> Variable models for shared dimensionality reduction ap-<br />

plicable to two different but very common modeling scenarios. In chapter 4 a<br />

new novel spectral dimensionality reduction algorithm is presented. The model is<br />

an essential component for the factorized subspace GP-LVM model presented in<br />

chapter 3. Chapter 3 and 4 presents the new work done in this thesis. In chapter<br />

5 we apply the presented models to model human motion capture data with asso-<br />

ciated image observation. Further, we apply the learned models to the computer<br />

vision task of human pose estimation.<br />

6.3 Future Work<br />

There are multiple different directions of possible future work applied to the mod-<br />

els presented in this thesis. Even though we have briefly shown results on the<br />

models applied to the task of human pose estimation we believe that there are<br />

several different applications where the use of the shared GP-LVM models can<br />

be advantageous. Such application areas include multi-modal feature fusion and<br />

multi-agent modeling. Even though we in this thesis have described the model<br />

in terms of two observation spaces there is nothing in the framework limiting the<br />

number of observation spaces. The application areas suggested are often charac-<br />

terized by more then two observations making them interesting modeling scenar-<br />

ios.<br />

This thesis have been focused on creating models applicable for data with a<br />

common shared structure. However, we have not focused on making inference


6.3. FUTURE WORK 121<br />

within the models suggested. The procedures for inference suggested in the ap-<br />

plications chapter are rudimentary at best. We believe that the models models the<br />

data in an efficient way and with better inference procedures the results presented<br />

on human pose estimation could be significantly improved.<br />

The GP-LVM model has in the general case a number of free parameters.<br />

Even though few with respect to many other models for dimensionality reduction<br />

determining especially the latent dimensionality is not trivial and its choice has<br />

a significant effect on how well the observed data is modeled. Recent work [21]<br />

applies a rank-regulariser to learn the latent dimensionality of the latent space<br />

effectively removing this as a free parameter from the GP-LVM. We believe this<br />

could be of significant benefit to the shared GP-LVM models presented in this<br />

thesis.<br />

The objective of the standard GP-LVM model is to find a latent structure from<br />

which a GP regression is able to minimize the reconstruction error of the observed<br />

data. For a single observed space minimizing the reconstruction error is a sensible<br />

objective. However, for the shared models, especially in the case of the factor-<br />

ized subspace, this is less obvious. In effect, the success of the presented model<br />

relies on the fact that the initialization of the latent space is in convex region of<br />

the factorized solution we are looking for as nothing outside the initialization en-<br />

courages the shared private factorization of the latent space. It would therefore<br />

be interesting to explore ways of encode this factorization as a part of the GP-<br />

LVM objective. Further, we use CCA to initialize the shared latent locations of<br />

the subspace model. There are several problems associated with the solution of<br />

CCA, some of which we address in this thesis. It would be interesting to explore<br />

different criteria for initializing the shared location. A recent model [55] learns


122 CHAPTER 6. CONCLUSIONS<br />

a shared latent variable model where a latent space maximizing the mutual infor-<br />

mation between the observations is learnt. It would further be interesting to see if<br />

the model presented in [55] could be extended to incorporate the <strong>Shared</strong>/Private<br />

factorization suggested in this paper.


Bibliography<br />

[1] A. Agarwal. Machine learning for image based motion capture. PhD thesis,<br />

Institut national polytechnique de Grenoble, April 2006.<br />

[2] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance<br />

vector regression. In Computer Vision and Pattern Recognition, 2004. CVPR<br />

2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol-<br />

ume 2, 2004.<br />

[3] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of<br />

the potential function method in pattern recognition learning. Automation<br />

and remote control, 25(6):821–837, 1964.<br />

[4] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical cor-<br />

relation analysis. Technical Report 688, Department of Statistics, University<br />

of California, Berkeley, 2005.<br />

[5] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for<br />

embedding and clustering. Advances in Neural Information <strong>Process</strong>ing Sys-<br />

tems, 14:585–591, 2002.<br />

[6] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for<br />

shape matching and object recognition. In NIPS, 2000.<br />

123


124 BIBLIOGRAPHY<br />

[7] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition<br />

using shape contexts. IEEE Transactions on Pattern Analysis and Machine<br />

Intelligence, 24(4):509–522, 2002.<br />

[8] M. Bethge, T. Wiecki, and F. Wichmann. The independent components of<br />

natural images are perceptually dependent. In Human Vision and Electronic<br />

Imaging XII. Edited by Rogowitz, Bernice E.; Pappas, Thrasyvoulos N.;<br />

Daly, Scott J.. Proceedings of the SPIE, volume 6492, page 64920A, 2007.<br />

[9] C. Bishop, M. Svensén, and C. Williams. GTM: A Principled Alternative to<br />

the Self-Organizing Map. Artificial Neural Networks: ICANN 96: 1996 In-<br />

ternational Conference, Bochum, Germany, July 16-19, 1996: Proceedings,<br />

1996.<br />

[10] P. Bose, W. Lenhart, and G. Liotta. Characterizing proximity trees. Algo-<br />

rithmica, 16(1):83–110, 1996.<br />

[11] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university<br />

press, 2004.<br />

[12] M. Bray, P. Kohli, and P. Torr. Posecut: Simultaneous segmentation and<br />

3d pose estimation of humans using dynamic graph-cuts. Lecture Notes in<br />

Computer Science, 3952:642, 2006.<br />

[13] C. Bregler and J. Malik. Tracking people with twists and exponential maps.<br />

In 1998 IEEE Computer Society Conference on Computer Vision and Pattern<br />

Recognition, 1998. Proceedings, pages 8–15, 1998.<br />

[14] Q. Candela and C. Rasmussen. A unifying view of sparse approximate<br />

gaussian process regression. Journal of Machine Learning Reserch Volume,<br />

6:1935–1959, 2005.


BIBLIOGRAPHY 125<br />

[15] T. Cham and J. Rehg. A multiple hypothesis approach to figure tracking.<br />

In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society<br />

Conference on., volume 2, 1999.<br />

[16] H. Choi and S. Choi. Kernel isomap. Electronics letters, 40(25):1612–1613,<br />

2004.<br />

[17] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall/CRC,<br />

2001.<br />

[18] N. Dalai, B. Triggs, I. Rhone-Alps, and F. Montbonnot. Histograms of ori-<br />

ented gradients for human detection. Computer Vision and Pattern Recogni-<br />

tion, 2005. CVPR 2005. IEEE Computer Society Conference on, 1, 2005.<br />

[19] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by<br />

annealed particle filtering. In IEEE Conference on Computer Vision and<br />

Pattern Recognition, 2000. Proceedings, volume 2, 2000.<br />

[20] R. Fisher. The use of multiple measurements in taxonomic problems. Ann<br />

of Eugenics, 7:179–188, 1936.<br />

[21] A. Geiger, R. Urtasun, and T. Darrell. Rank priors for continuous non-linear<br />

dimensionality reduction. In CVPR ’09: Proceedings of the 2009 IEEE<br />

Computer Society Conference on Computer Vision and Pattern Recognition,<br />

2009.<br />

[22] I. Good. What are degrees of freedom? The American Statistician,<br />

27(5):227–228, 1973.<br />

[23] J. Gower and G. Dijksterhuis. Procrustes problems. <strong>Oxford</strong> University Press,<br />

2004.


126 BIBLIOGRAPHY<br />

[24] K. Grochow, S. Martin, A. Hertzmann, and Z. Popović. Style-based inverse<br />

kinematics. ACM Transactions on Graphics (TOG), 23(3):522–531, 2004.<br />

[25] J. Hamm, I. Ahn, and D. Lee. Learning a manifold-constrained map between<br />

image sets: applications to matching and pose estimation. In CVPR ’06:<br />

Proceedings of the 2006 IEEE Computer Society Conference on Computer<br />

Vision and Pattern Recognition, pages 817–824, Washington, DC, USA,<br />

2006. IEEE Computer Society.<br />

[26] J. Hamm, D. Lee, and L. Saul. Semisupervised alignment of manifolds. In<br />

R. G. Cowell and Z. Ghahramani, editors, Proceedings of the Tenth Inter-<br />

national Workshop on Artificial Intelligence and Statistics, pages 120–127,<br />

2005.<br />

[27] S. Haykin. Neural networks: a comprehensive foundation. Prentice Hall,<br />

2008.<br />

[28] S. Hou, A. Galata, F. Caillette, N. Thacker, and P. Bromiley. Real-time body<br />

tracking using a <strong>Gaussian</strong> process latent variable model. In Proceedings<br />

of the 11th IEEE International Conference on Computer Vision (ICCV’07),<br />

2007.<br />

[29] J. Jaromczyk and G. Toussaint. Relative neighborhood graphs and their rel-<br />

atives. Proceedings of the IEEE, 80(9):1502–1517, 1992.<br />

[30] S. Ju, M. Black, and Y. Yacoob. Cardboard people: A parameterized model<br />

of articulated image motion. In Proceedings of the 2nd International Con-<br />

ference on Automatic Face and Gesture Recognition (FG’96), page 38. IEEE<br />

Computer Society Washington, DC, USA, 1996.


BIBLIOGRAPHY 127<br />

[31] M. Kuss and T. Graepel. The geometry of kernel canonical correlation anal-<br />

ysis. Technical Report TR-108, Max Planck Institute for Biological Cyber-<br />

netics, Tübingen, Germany, 2003.<br />

[32] N. D. Lawrence. <strong>Gaussian</strong> <strong>Process</strong> <strong>Models</strong> for Visualisation of High Di-<br />

mensional Data. In S. Thrun, L. Saul, and B. Schölkopf, editors, <strong>Gaussian</strong><br />

<strong>Process</strong> <strong>Models</strong> for Visualisation of High Dimensional Data, volume 16,<br />

pages 329–336, Cambridge, MA, 2004.<br />

[33] N. D. Lawrence. Probabilistic non-linear principal component analysis with<br />

<strong>Gaussian</strong> process latent variable models. The Journal of Machine Learning<br />

Research, 6:1783–1816, November 2005.<br />

[34] N. D. Lawrence and A. Moore. Hierarchical <strong>Gaussian</strong> process latent vari-<br />

able models. Proceedings of the 24th international conference on Machine<br />

learning, pages 481–488, 2007.<br />

[35] N. D. Lawrence and Quiñonero-Candela. Local distance preservation in the<br />

GP-LVM through back constraints. In R. Greiner and D. Schuurmans, edi-<br />

tors, Local distance preservation in the GP-LVM through back constraints,<br />

volume 21, pages 513–520, New York, NY, USA, 2006. ACM.<br />

[36] G. Leen. Context assisted information extraction. PhD thesis, University the<br />

of West of Scotland, University of the West of Scotland, High Street, Paisley<br />

PA1 2BE, Scotland, 2008.<br />

[37] G. Leen and C. Fyfe. A <strong>Gaussian</strong> process latent variable model formulation<br />

of canonical correlation analysis. Bruges (Belgium), 26-28 April 2006 2006.<br />

[38] D. Lowe. Distinctive image features from scale-invariant keypoints. Inter-<br />

national Journal of Computer Vision, 60(2):91–110, 2004.


128 BIBLIOGRAPHY<br />

[39] D. MacKay. Bayesian neural networks and density networks. Nuclear In-<br />

struments and Methods in Physics Research, A, 1995.<br />

[40] K. Mardia, J. Kent, J. Bibby, et al. Multivariate analysis. Academic press<br />

New York, 1979.<br />

[41] J. Mercer. Functions of positive and negative type, and their connection with<br />

the theory of integral equations. Philosophical Transactions of the Royal So-<br />

ciety of London. Series A, Containing Papers of a Mathematical or Physical<br />

Character, pages 415–446, 1909.<br />

[42] G. Mori, S. Belongie, and J. Malik. Efficient shape matching using shape<br />

contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on,<br />

27(11):1832–1837, 2005.<br />

[43] D. Morris and J. Rehg. Singularity analysis for articulated object tracking. In<br />

1998 IEEE Computer Society Conference on Computer Vision and Pattern<br />

Recognition, 1998. Proceedings, pages 289–296, 1998.<br />

[44] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. The joint manifold model. In<br />

IEEE International Conference on Computer Vision (ICCV), 2007.<br />

[45] C. E. Rasmussen and C. K. I. Williams. <strong>Gaussian</strong> <strong>Process</strong>es for Machine<br />

Learning (Adaptive Computation and Machine Learning). The MIT Press,<br />

2005.<br />

[46] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. Torr. Randomized<br />

Trees for Human Pose Detection. In IEEE Conference on Computer Vision<br />

and Pattern Recognition, 2008. CVPR 2008, pages 1–8, 2008.<br />

[47] R. Rosales and S. Sclaroff. Learning body pose via specialized maps. Ad-<br />

vances in Neural Information <strong>Process</strong>ing Systems, 2:1263–1270, 2002.


BIBLIOGRAPHY 129<br />

[48] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear<br />

embedding. Science, 290(5500):2323–2326, 2000.<br />

[49] M. Salzmann, R. Urtasun, and P. Fua. Local deformation models for monoc-<br />

ular 3D shape recovery. In IEEE Conference on Computer Vision and Pattern<br />

Recognition, 2008. CVPR 2008, pages 1–8, 2008.<br />

[50] B. Schölkopf. The kernel trick for distances. TR MSR 2000-51, Microsoft<br />

Research, Redmond, WA, 2000. Advances in Neural Information <strong>Process</strong>ing<br />

Systems, 2001.<br />

[51] B. Scholkopf, A. Smola, and K. Muller. Kernel principal component analy-<br />

sis. Lecture notes in computer science, 1327:583–588, 1997.<br />

[52] A. Shon, K. Grochow, A. Hertzmann, and R. Rao. Learning shared latent<br />

structure for image synthesis and robotic imitation. Proc. NIPS, pages 1233–<br />

1240, 2006.<br />

[53] H. Sidenbladh, M. Black, and D. Fleet. Stochastic tracking of 3D human fig-<br />

ures using 2D image motion. Lecture Notes in Computer Science, 1843:702–<br />

718, 2000.<br />

[54] L. Sigal and M. Black. HumanEva: Synchronized video and motion capture<br />

dataset for evaluation of articulated human motion. Brown Univertsity TR,<br />

2006.<br />

[55] L. Sigal, R. Memisevic, and D. J. Fleet. <strong>Shared</strong> kernel information embed-<br />

ding for discriminative inference. In CVPR ’09: Proceedings of the 2009<br />

IEEE Computer Society Conference on Computer Vision and Pattern Recog-<br />

nition, 2009.


130 BIBLIOGRAPHY<br />

[56] H. A. Simon. The Sciences of the Artificial - 3rd Edition. The MIT Press,<br />

October 1996.<br />

[57] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative density<br />

propagation for 3d human motion estimation. Proc. Conf. Computer Vision<br />

and Pattern Recognition, pages 217–323, 2005.<br />

[58] C. Sminchisescu and B. Triggs. Estimating articulated human motion with<br />

covariance scaled sampling. The International Journal of Robotics Research,<br />

22(6):371, 2003.<br />

[59] J. F. M. Svensén. GTM: The Generative Topographic Mapping. PhD thesis,<br />

Aston University, April 1998.<br />

[60] J. Tenenbaum, V. Silva, and J. Langford. A global geometric framework for<br />

nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.<br />

[61] A. Thayananthan, R. Navaratnam, B. Stenger, P. Torr, and R. Cipolla. Pose<br />

estimation and tracking using multivariate regression. Pattern Recognition<br />

Letters, 29(9):1302–1310, 2008.<br />

[62] T. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learned<br />

smooth space of feasible solutions. In CVPR Learning workshop, volume 2,<br />

2005.<br />

[63] M. Tipping. The relevance vector machine. NIPS (pp. 652–658), 2000.<br />

[64] M. Tipping. Sparse Bayesian Learning and the Relevance Vector Machine.<br />

Journal of Machine Learning Research, 1:211–244, 2001.<br />

[65] M. Tipping and C. Bishop. Probabilistic principal component analysis.<br />

Journal of the Royal Statistical Society: Series B (Statistical Methodology),<br />

61(3):611–622, 1999.


BIBLIOGRAPHY 131<br />

[66] G. Toussaint. The relative neighbourhood graph of a finite planar set. Pattern<br />

recognition, 1980.<br />

[67] R. Urtasun and T. Darrell. Discriminative <strong>Gaussian</strong> process latent variable<br />

model for classification. In Discriminative <strong>Gaussian</strong> process latent variable<br />

model for classification, pages 927–934, New York, NY, USA, 2007. ACM.<br />

[68] R. Urtasun, T. Darrell, and U. EECS. Sparse probabilistic regression for<br />

activity-independent human pose inference. In IEEE Conference on Com-<br />

puter Vision and Pattern Recognition, 2008. CVPR 2008, pages 1–8, 2008.<br />

[69] R. Urtasun, D. Fleet, and P. Fua. 3D people tracking with <strong>Gaussian</strong> process<br />

dynamical models. CVPR, June, 2006.<br />

[70] R. Urtasun, D. Fleet, A. Geiger, J. Popović, T. Darrell, and N. Lawrence.<br />

Topologically-constrained latent variable models. In Proceedings of the<br />

25th international conference on Machine learning, pages 1080–1087. ACM<br />

New York, NY, USA, 2008.<br />

[71] R. Urtasun, D. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking<br />

from small training sets. IEEE International Conference on Computer Vision<br />

(ICCV), pages 403–410, 2005.<br />

[72] R. Urtasun, D. J. Fleet, and P. Fua. Temporal motion models for monoc-<br />

ular and multiview 3D human body tracking. Computer Vision and Image<br />

Understanding, 104(2-3):157–177, 2006.<br />

[73] A. Viterbi. Error bounds for convolutional codes and an asymptotically<br />

optimum decoding algorithm. IEEE transactions on Information Theory,<br />

13(2):260–269, 1967.


132 BIBLIOGRAPHY<br />

[74] S. Wachter and H. Nagel. Tracking of persons in monocular image se-<br />

quences. In IEEE Nonrigid and Articulated Motion Workshop, 1997. Pro-<br />

ceedings., pages 2–9, 1997.<br />

[75] J. Wang, D. Fleet, and A. Hertzmann. Multifactor <strong>Gaussian</strong> process models<br />

for style-content separation. In Proceedings of the 24th international con-<br />

ference on Machine learning, pages 975–982. ACM New York, NY, USA,<br />

2007.<br />

[76] J. Wang, D. Fleet, and A. Hertzmann. <strong>Gaussian</strong> process dynamical models<br />

for human motion. IEEE Transactions on Pattern Analysis and Machine<br />

Intelligence, 30(2):283–298, 2008.<br />

[77] J. M. Wang, D. J. Fleet, and A. Hertzmann. <strong>Gaussian</strong> process dynamical<br />

models. volume 18, Cambridge, MA, 2006.<br />

[78] K. Weinberger, F. Sha, and L. Saul. Learning a kernel matrix for nonlinear<br />

dimensionality reduction. ACM International Conference Proceeding Series,<br />

2004.<br />

[79] K. Q. Weinberger, B. D. Packer, and L. K. Saul. Unsupervised learning of<br />

image manifolds by semidefinite programming. In Proceedings of the Tenth<br />

International Workshop on Artificial Intelligence and Statistics, Barbados,<br />

January 2005.<br />

[80] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds<br />

by semidefinite programming. In Proceedings of the IEEE Conference on<br />

Computer Vision and Pattern Recognition (CVPR-04), volume 2, pages 988–<br />

995, Washington D.C., 2004.


BIBLIOGRAPHY 133<br />

[81] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds<br />

by semidefinite programming. International Journal of Computer Vision,<br />

70(1):77–90, 2006.<br />

[82] L. Xiong, F. Wang, and C. Zhang. Semi-definite manifold alignment. In<br />

ECML, pages 773–781, 2007.<br />

[83] X. Zhao, H. Ning, Y. Liu, and T. Huang. Discriminative Estimation of 3D<br />

Human Pose Using <strong>Gaussian</strong> <strong>Process</strong>es. In Pattern Recognition, 2008. ICPR<br />

2008. 19th International Conference on, pages 1–4, 2008.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!