Shared Gaussian Process Latent Variables Models - Oxford Brookes ...

Shared Gaussian Process Latent 

Variables Models 

Carl Henrik Ek 

Submitted in partial fulfilment of the requirements of the award of 

PhD 

Oxford Brookes University 

August 2009

Abstract 

A fundamental task is machine learning is modeling the relationship between dif- 

ferent observation spaces. Dimensionality reduction is the task reducing the num- 

ber of dimensions in a parameterization of a data-set. In this thesis we are inter- 

ested in the cross-road between these two tasks: shared dimensionality reduction. 

Shared dimensionality reduction aims to represent multiple observation spaces 

within the same model. Previously suggested models have been limited to the 

scenarios where the observations have been generated from the same manifold. 

In this paper we present a Gaussian process Latent Variable Model (GP-LVM) 

[33] for shared dimensionality reduction without making assumptions about the 

relationship between the observations. Further we suggest an extension to Canon- 

ical Correlation Analysis (CCA) called Non Consolidating Component Analy- 

sis (NCCA). The proposed algorithm extends classical CCA to represent the full 

variance of the data opposed to only the correlated. We compare the suggested 

GP-LVM model to existing models and show results on real-world problems ex- 

emplifying the advantages of our approach.

Acknowledgements 

2

Contents 

1 Introduction 10 

1.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . 11 

1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 

1.3 Notations and Conventions . . . . . . . . . . . . . . . . . . . . . 13 

2 Background 14 

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

2.1.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . 15 

2.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 17 

2.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.4 Spectral Dimensionality Reduction . . . . . . . . . . . . . . . . . 21 

2.5 Non-Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.5.1 Kernel-Trick . . . . . . . . . . . . . . . . . . . . . . . . 26 

2.5.2 Proximity Graph Methods . . . . . . . . . . . . . . . . . 29 

2.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

2.6 Generative Dimensionality Reduction . . . . . . . . . . . . . . . 35 

2.7 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . 39 

2.7.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

3

4 CONTENTS 

2.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

2.7.3 Relevance Vector Machine . . . . . . . . . . . . . . . . . 45 

2.8 GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

2.8.1 Latent Constraints . . . . . . . . . . . . . . . . . . . . . 49 

2.8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 53 

2.9 Shared Dimensionality Reduction . . . . . . . . . . . . . . . . . 54 

2.10 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

3 Shared GP-LVM 59 

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.2 Shared GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

3.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . 63 

3.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

3.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

3.3 Subspace GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . 74 

3.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

4 NCCA 80 

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

4.3 Shared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

4.4 Private . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

CONTENTS 5 

4.4.1 Extracting the first orthogonal direction . . . . . . . . . . 85 

4.4.2 Extracting consecutive directions . . . . . . . . . . . . . 86 

4.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 87 

4.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 

5 Applications 92 

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

5.2 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 92 

5.2.1 Generative . . . . . . . . . . . . . . . . . . . . . . . . . 93 

5.2.2 Discriminative . . . . . . . . . . . . . . . . . . . . . . . 94 

5.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . 94 

5.3 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 

5.3.1 Shapeme Features . . . . . . . . . . . . . . . . . . . . . 95 

5.3.2 Histogram of Oriented Gradients . . . . . . . . . . . . . . 97 

5.4 Data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

5.5 Shared back-constrained GP-LVM . . . . . . . . . . . . . . . . . 100 

5.5.1 Single Image Inference . . . . . . . . . . . . . . . . . . . 102 

5.5.2 Sequential Inference . . . . . . . . . . . . . . . . . . . . 103 

5.6 Subspace GP-LVM . . . . . . . . . . . . . . . . . . . . . . . . . 104 

5.6.1 Single Image Inference . . . . . . . . . . . . . . . . . . . 107 

5.6.2 Sequential Inference . . . . . . . . . . . . . . . . . . . . 108 

5.7 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 109 

5.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

5.9 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 CONTENTS 

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

6 Conclusions 119 

6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

6.2 Review of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 119 

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

List of Figures 

2.1 Volume ratio of hyper-cube and hyper-sphere . . . . . . . . . . . 16 

2.2 Swissroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

2.3 Generative latent variable model . . . . . . . . . . . . . . . . . . 35 

2.4 GTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.5 Samples from GP Prior . . . . . . . . . . . . . . . . . . . . . . . 42 

2.6 Samples from GP Posterior . . . . . . . . . . . . . . . . . . . . . 43 

2.7 Probibalistic CCA . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

3.1 Shared back-constrained GP-LVM . . . . . . . . . . . . . . . . . 61 

3.2 Toy data: generating signals . . . . . . . . . . . . . . . . . . . . 65 

3.3 Toy data: observed data . . . . . . . . . . . . . . . . . . . . . . . 66 

3.4 Toy data: latent embeddings . . . . . . . . . . . . . . . . . . . . 67 

3.5 Toy data2: generating signals . . . . . . . . . . . . . . . . . . . . 68 

3.6 Toy data2: observed data . . . . . . . . . . . . . . . . . . . . . . 68 

3.7 Toy data2: latent embeddings . . . . . . . . . . . . . . . . . . . . 69 

3.8 Toy data3: generating signals . . . . . . . . . . . . . . . . . . . . 70 



7

8 LIST OF FIGURES 


3.12 Subspace GP-LVM model . . . . . . . . . . . . . . . . . . . . . 74 

4.1 NCCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

4.2 Toy data3: NCCA embedding . . . . . . . . . . . . . . . . . . . 88 


4.4 Toy data3: Subspace GP-LVM embedding . . . . . . . . . . . . . 89 

5.1 Kernel response matrix Poser pose data . . . . . . . . . . . . . . 102 

5.2 Misinterpretation of angle error . . . . . . . . . . . . . . . . . . . 109 

5.3 Poser single image results back-constrained GP-LVM . . . . . . . 110 

5.4 Poser sequence image results back-constrained GP-LVM . . . . . 111 

5.5 Kernel matrix back-constrained and subspace GP-LVM . . . . . . 114 

5.6 Subspace GP-LVM pose inference . . . . . . . . . . . . . . . . . 115 

5.7 Subspace GP-LVM ambiguity modeling . . . . . . . . . . . . . . 117

List of Tables 

3.1 Toy data: Procrustes score . . . . . . . . . . . . . . . . . . . . . 70 

5.1 Error on Poser data . . . . . . . . . . . . . . . . . . . . . . . . . 111 

5.2 Error on HumanEva . . . . . . . . . . . . . . . . . . . . . . . . . 112 

9

Chapter 1 

Introduction 

Information technology development have lead to a significant expansion in stor- 

age capabilities of digital content. This has meant that in many application areas 

where observations used to be scarce we now have access to significant amounts 

of data. This development has lead to a transition in many fields from the purely 

model driven paradigm to the data driven approach. In model driven modeling 

the aim is to explain a specific phenomenon using a model designed for the task 

at hand. This is different from data-driven modeling where one tries to use ob- 

servations of the phenomenon to learn a model. For most modeling scenarios the 

data available is represented in a form defined by the device used to capture the 

data. This means that the degrees of freedom of the available data is the degrees 

of freedom of the capturing device not the actual phenomenon. To reduce the ef- 

fect of the capturing device This often leads to that the data is represented using 

a very large number of dimensions, often significantly larger then the number of 

dimensions or degrees of freedoms that the underlying phenomenon has. This has 

lead to the machine learning field of Dimensionality Reduction. In dimensionality 

10

1.1. OVERVIEW OF THE THESIS 11 

reduction the aim is to find the data’s true or intrinsic parameterization from the 

capturing device representation. 

Many tasks in computer science are associated with data coming from multiple 

streams or views of the same underlying phenomenon. Often each view provide 

complementary information about the data. For modeling purposes it is there- 

fore of interest to use information from each view. The task of merging several 

different views are called Feature Fusion. 

The work undertaken in this thesis spans both realms presented above. Given 

multiple views of the same phenomenon we create models which are capable of 

leveraging the advantage of each view in learning a reduced dimensional repre- 

sentation of the data. 

1.1 Overview of the thesis 

A brief outline of the dissertation follows, 

Chapter 2 This chapter provides the motivation and the background to the 

machine learning task of dimensionality reduction. The two different approaches 

to dimensionality reduction, spectral and generative, are introduced and their 

strengths and weaknesses reviewed. We continue by introducing Gaussian Pro- 

cesses (GP) and give a brief background to Bayesian Modeling. The Gaussian 

Process Latent Variable Model (GP-LVM) [33, 32] a dimensionality reduction 

model based on Gaussian Processes is introduced. Further, we introduce the task 

of Shared Dimensionality Reduction which will be the main focus of this thesis. 

Chapter 3 This chapter describes the two shared generative dimensionality 

reduction models developed in this thesis. By motivating the short-comings of

12 CHAPTER 1. INTRODUCTION 

current models we derive two new models in the GP-LVM framework. We first 

present the shared back-constrained GP-LVM and continue to describe the second 

suggested model, the subspace GP-LVM model. 

Chapter 4 We present an extension to Canonical Correlation Analysis (CCA) 

called Non-Consolidation Component Analysis (NCCA). The NCCA algorithm 

allows us to transform CCA from an algorithm for feature selection to one of 

shared dimensionality reduction. 

Chapter 5 In this chapter the suggested models are applied to the Computer 

Vision task of human pose estimation. We apply the models on real world data- 

sets and experimentally compare the two models. 

Chapter 6 Concludes the work undertaken and describes potential directions 

of future work. 

1.2 Publications 

This thesis builds on work from the following publications: 

1. C. H. Ek, P. H. Torr, and N. D. Lawrence. Gaussian process latent variable 

models for human pose estimation. In 4th Joint Workshop on Multimodal 

Interaction and Related Machine Learning Algorithms (MLMI 2007), vol- 

ume LNCS 4892, pages 132–143, Brno, Czech Republic, Jun. 2007. Springer- 

Verlag. 

2. C. H. Ek, P. H. Torr, and N. D. Lawrence. Ambiguity modeling in latent 

spaces. In 5th Joint Workshop on Multimodal Interaction and Related Ma- 

chine Learning Algorithms (MLMI 2008), 2008.

1.3. NOTATIONS AND CONVENTIONS 13 

3. C. H. Ek, P. Jaeckel, N. Campbell, N. Lawrence, and C. Melhuish. Shared 

Gaussian Process Latent Variable Models for Handling Ambiguous Facial 

Expressions. In AIP Conference Proceedings, volume 1107, page 147, 

2009. 

1.3 Notations and Conventions 

In the mathematical notation we use italics x to indicate scalars, bold lowercase 

x to indicate vectors, and bold uppercase X to indicate matrices. Vectors, unless 

stated otherwise, are column vectors. The transpose of a matrix is indicated by 

superscript X T . The identity matrix is represented by I. The vector ei represents 

the unit vector with all dimensions set to 0 except dimension i which is 1.

Chapter 2 

Background 

2.1 Introduction 

Modeling is the task of describing a system using a specific language. In this thesis 

the focus is on mathematical modeling which refers to the process of describing 

a system through the laws of mathematics. The building blocks of mathematical 

models are variables or parameter who by interaction through the law of mathe- 

matics aims to mimic the behavior of a system. There are many reasons why we 

are interested in creating accurate models of a specific system. A model allows us 

to analyze and and simulate the behavior of the system in a hypothetical scenario 

without having to jeopardize the actually system. 

A fundamental characteristic of a model are its degrees of freedom [22] which 

refers to the number of parameters or variables that are allowed to vary indepen- 

dently from each other. In many scenarios there is more than one way to describe 

a system, this can either be because different approximations or assumptions have 

been made or due to lack of knowledge of the system. For example, one set of 

14

2.1. INTRODUCTION 15 

data can be equally well described by two different models. However, data from 

the input domain outside the training data might result in different behavior from 

each model. Similarly different assumptions often leads to different models. The 

degrees of freedom of a representation is equal to the number of parameters or 

dimensions. However, this parameterization does not need to be representing the 

data in the correct way. When each dimension in the representation describes or 

models a single degree of freedom in the data we say that the data is in its intrinsic 

representation. 

We will separate the task of modeling into data driven and model driven. Data 

driven modeling is, when given a set of training data we try to learn a model from 

the data, this is different from model based modeling when we try to fit or match 

a specific model to the data. This thesis will focus on data driven models. 

2.1.1 Curse of Dimensionality 

We spend our lives in a world that is essentially three dimensional. It is in this 

world we build our understanding of concepts such as distance and volume. In 

Machine Learning we often deal with data compromising many more dimensions. 

Many of the concepts we learn to recognize in two and three dimensions cannot 

easily be extrapolated to higher dimensions. One example is the relationship be- 

tween the volume of a hyper-sphere of diameter 2 and a cube with side length 

2. 

Vcube(d) = (2) d 

Vsphere(d) = 2d Π d 

2 

d · Γ( d 

2 ) 

(2.1) 

(2.2)

16 CHAPTER 2. BACKGROUND 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 5 10 15 20 25 30 

Figure 2.1: The ratio of a hyper-cube that is contained within a hyper-sphere as a 

function of dimension. 

Figure 2.1 shows this ratio as a function of dimensionality. In the limit the ratio 

of the volume that is contained within the hyper-sphere goes to zero, 

Vsphere(d) 

limd→∞ 

Vcube(d) 

= . . . = 0 (2.3) 

This means that with increasing dimension all the volume of a cube will be con- 

tained in its corners. However, our concept of the “corners of a cube” are clearly 

two and three dimensional in nature as is our understanding of volume. Therefore 

care should be taken when working with high dimensional spaces. 

The “Curse of Dimensionality” refers to the exponential growth in volume 

with dimension. It is called a curse due to the fact that many algorithms scales 

badly with increasing dimension. One way of visualizing this is to think of an 

algorithm as a mapping from an input-space to an output-space. For the algorithm 

to work for any type of input it needs to “cover” or “monitor” all of the input space. 

Therefore, with increasing dimension the algorithm needs to cover a exponentially 

growing volume.

2.2. DIMENSIONALITY REDUCTION 17 

2.2 Dimensionality Reduction 

In most data-sets that the number of degrees of freedom of the representation is 

much higher then that of the intrinsic representation. This is due to the fact that 

rather then representing the degrees of freedom of the data the representation is a 

reflection of the degrees of freedom in the collection process of the data. 

How complex or simple a structure is depends critically upon the way 

we describe it. Most of the complex structures found in the world are 

enormously redundant, and we can use this redundancy to simplify 

their description. But to use it, to achieve the simplication, we must 

find the right representation [56]. 

One example of this is the parameterization of a natural image as a matrix 

of real values (pixels). The matrix being captured by a camera, each pixel cor- 

responds to a single light sensor which are allowed to vary independently of the 

other sensors on the lens. However, this does not correspond well to natural im- 

ages as neighboring pixels are strongly correlated [8]. This implies that natural 

images have a significantly different intrinsic representation , and degrees of free- 

dom, than the image representation of the camera. The correlation between pixels 

will, in the vector space spanned by the pixel values, manifest itself as a low- 

dimensional manifold. Parameterizing this manifold are the degrees of freedom 

for natural images. This example is a simplification as it assumes that the camera 

is capable of capturing the full variability or all the degrees of freedom in a nat- 

ural image, i.e. that the intrinsic representation can be found as a mapping from 

the observed representation. A more general view of the problem that avoids this 

assumption is to view the collection process as a mapping from the data’s intrinsic


X to its observed representation Y, this mapping is referred to as the generating 

mapping, 

yi = f(xi). (2.4) 

As in the example with the camera and natural images the typical parameterization 

of data used in machine learning is over represented with respect to the number of 

dimensions but under represented in terms of the data. Geometrically this implies 

that the data only occupies a sub-volume of the representation space. However, 

the generating mapping is not necessarily invertible, this means that information 

from the intrinsic representation is lost. 

Dimensionality reduction is the process of reducing the number of parameters 

needed by a specific representation. This thesis focus on data driven dimensional- 

ity reduction. Data driven modeling is based on the assumption that the observed 

training data is representative. This means that we make the assumption that the 

observed data is sampling the input domain “well”. This implies that the degrees 

of freedom for the training data are assumed to be the same as for any additional 

samples from the same domain. 

There are two main approaches to data-driven dimensionality reduction, (1) 

generative and (2) spectral. The generative approach aims at modeling the gen- 

erating mapping f. This is, in general, an ill-posed problem as there are many 

ways of generating the observed data when neither the mapping nor the intrinsic 

representation is known. Spectral models avoid this by assuming that a smooth 

inverse f −1 to the generative mapping exists. This implies that the full degrees 

of freedom in the data is retained in the observed representation. This means that 

the intrinsic representation X of the data can be “unraveled” from evidence in the

2.3. LINEAR ALGEBRA 19 

observed representation Y. 

2.3 Linear Algebra 

A linear mapping T from vector-space U to vector-space V , T : U → V is 

represented in matrix form as, 

T(x) = Ax. (2.5) 

This means that the mapping T , represented by matrix A carries elements from 

vector space U to vector space V . The image im(T) of a mapping defines the set 

of all values the map can take, 

im(T) = {T(x) : x ∈ U} ⊂ V. (2.6) 

Similarly the kernel kern(T) is the set of all values that maps to zero, 

kern(T) = {x : x ∈ U|T(x) = 0} ⊂ U. (2.7) 

The kernel and the image of a linear mapping are related through the Rank-Nullity 

Theorem: 

dim(U) = dim(im(A)) + dim(kern(A)). (2.8) 

Intuitively this means that the number of dimensions needed to correctly represent 

the degrees of freedom of the data is given by subtracting the number of dimen- 

sions required to represent the null-space of the representation from the number of


dimensions of the representation. The number of dimensions needed to represent 

the image is more commonly referred to as the rank of the matrix which is the 

number of linearly independent columns in the matrix. 

Two square matrices A and B are said to be similar if there exists a non- 

singular matrix P (a matrix such that the inverse of P exists) such that the two 

matrices are related as, 

A = PBP −1 , (2.9) 

this is known as a similarity transform. A similarity transform P maps from a 

vector space onto itself and is therefore also referred to as a “change of basis” 

transformation mapping between two representations or reference frames. The 

determinant, trace and invertability are all invariant under similarity. 

A special similarity transform is when one reference system or basis results in 

a diagonal matrix, 

A = VΛV −1 

⎧ 

⎪⎨ 0 i = j 

Λij = 

⎪⎩ 

λi i = j 

(2.10) 

(2.11) 

VV T = I, (2.12) 

the diagonal representation is said to be the spectral decomposition of the ma- 

trix. The columns of V are called eigenvectors and the non-zero elements of the 

spectral decomposition the eigenvalues of the matrix. The spectral decomposition 

means that we can write any square matrix as a linear combination of rank one

2.4. SPECTRAL DIMENSIONALITY REDUCTION 21 

matrices, 

Aij = 

N 

k=1 

VikΛkk 

V T 

kj = 

N 

(vk)iλk(vk)j = 

k=1 

N 

λkvkv T 

k ij .(2.13) 

As V specifies an orthonormal basis the relative magnitude of each eigenvalue 

corresponds to the the amount of A that is explained by the corresponding eigen- 

vector. Therefore, by ordering the eigenvalues in decreasing order we can refer 

to, 

A→i = 

i 

k=1 

as the best rank i approximation of matrix A. 

k=1 

λkvkv T k , (2.14) 

λi ≥ λj, i ≤ j 

2.4 Spectral Dimensionality Reduction 

Spectral dimensionality reduction is based on the assumption that the generating 

mapping f is invertible. This means that the relationship between the observed 

representation Y and the intrinsic representation X takes the form of a bijection. 

This implies that the intrinsic structure of the data is fully preserved in the ob- 

served representation. 

Classic Multi Dimensional Scaling (MDS) [17, 40] is a method for represent- 

ing a metric dissimilarity measure as a geometrical configuration. Given dissim- 

ilarity measure δij between i and j the aim is to find a geometrical configuration 

of points X = [x1, . . .,xN] such that the Euclidean distance dij = ||xi − xj||2


approximates the dissimilarity δij. Classical MDS is formulated as a minimization 

of the following energy, 

 

argminX (δij − dij) = argminX ||∆ − D(X)||F. (2.15) 

ij 

The best d dimensional geometrical representation can be found through a rank d 

approximation D(X) of the dissimilarity matrix ∆ through the spectral decom- 

position, 

||∆ − D(X)||F = 

 

= || 

= || 

∆ = VΛV T = 

N 

λiviv T i − 

i=1 

N 

i=1 

N 

i=1 

d 

(λi − qi)viv T i + 

i=1 

λiviv T i 

qiviv T i 

N 

d+1 

 

|| = 

λiviv T i 

= 

||. (2.16) 

Having found a rank d approximation to ∆ the distance matrix can be converted 

to a Gram matrix G = XX T , 

gij = 1 

 

N 

1 

d 

2 N 

k=1 

N ik + 

N 

k=1 

d 2 ik 

− 1 

N 

N 

N 

k=1 p=1 

d 2 kp 

 

− d 2 ij 

 

. (2.17) 

A geometrical configuration can be found through eigen-decomposition of the 

Gram matrix G, 

G = XX T = . . . = VΛV T = (VΛ 1 

2)(V∆ 1 

2) T , 

⇒ X = VΛ 1 

2, (2.18)

2.4. SPECTRAL DIMENSIONALITY REDUCTION 23 

with the dimensionality of, 

rank(X) = rank(XX T ) = rank(G) = rank(D(X)) = d. 

In practice, for dimensionality reduction, we want to find a low dimensional repre- 

sentation of a set of data points i.e. vectorial data. In this case the Gram matrix G 

can be constructed directly from the data and a rank d dimensional approximation 

can be sought making the conversion step from distance matrix to gram matrix 

uneccesary. 

Principal Component Analysis (PCA) is a dimensionality reduction technique 

for embedding vectorial data in a dimensionally reduced representation. Given 

centered vectorial data Y the covariance matrix S = Y T Y has elements on the 

diagonal representing the variance along each dimension of the data while the off- 

diagonal elements measures the linear redundancies between dimensions. The 

objective of PCA is to find a projection v of the data Y such that the variance 

along each dimension is maximized, 

Objective: argmax v var(Yv) (2.19) 

subject to: v T v = 1. (2.20) 

This implies finding a projection of the data into a representation resulting in a


diagonal covariance matrix, 

var(Yv) = 

1 

N − 1 (Yv)T Yv = v T Sv (2.21) 

S = VΛV T 

(2.22) 

Λ = V T SV (2.23) 

X = YV. (2.24) 

As can be seen from above the solutions to both MDS and PCA are found through 

a similarity transform where one representation results in a diagonal matrix. In 

the case of MDS with the diagonalisation of the N × N Gram matrix G and for 

PCA by the D × D covariance matrix. 

Using the spectral decomposition of the Gram matrix, 

the similarity implies that, 

By pre-multiplying we can write, 

G = YY T = VΛV T , (2.25) 

(YY T )vi = λivi. (2.26) 

1 

N YT (YY T )vi = 

SY T vi = 

λi 

N − 1 YT vi 

(2.27) 

λi 

N − 1 YT vi, (2.28) 

which also defines a similarity transform but now in terms of the covariance matrix

2.5. NON-LINEAR 25 

S. However, we also need to enforce orthogonality of the new basis, 

(Y T vi) T (Y T vi) = v T i YYT vi = λi 

1 

√ (Y 

λi 

T vi) T (Y T vi) 1 

√ 

λi 

(2.29) 

= v T i YYT vi = 1. (2.30) 

This results in the eigen-basis of the covariance matrix i.e. v PCA 

i 

gives the following embedding, 

x PCA 

i 

1 

= YY T vi √ 

λi 

= 

= YT vi 1 

N which 

√ λi 

N − 1 vi = 1 

N − 1 xMDS 

i , (2.31) 

meaning MDS and PCA results in the same solution up to scale. 

MDS and PCA assumes the generating mapping f to be linear and therefore 

imply that the intrinsic representation of the data can be found by a change of basis 

transform. This restricts these algorithms to only being applicable in scenarios 

where the generating mapping is linear. 

2.5 Non-Linear 

Several algorithms have been suggested to model in the scenario where, rather 

than assuming the generating mapping f to be linear, this is relaxed to only 

assume that it is smooth. MDS finds a geometrical configuration respecting 

a specific dissimilarity measure. Measuring the dissimilarity in terms of the 

distance along the manifold between each point it would be possible to use 

MDS even in scenarios where the generating mapping is non-linear. How- 

ever, to acquire the distance along the manifold requires the manifold to be


unraveled. 

The objective of PCA is to find a projection of the data Y where the vari- 

ance along each dimension is maximized. If the data occupies a subspace in 

the original representation PCA will find a rotation of the data such that a 

reduced dimensional representation can be found by removing dimensions 

that do not represent any of the variance in the data. Further, for many data- 

sets most of the variance is represented by the first few principal directions 

meaning that an approximative representation of the data can be found by 

truncating the dimensions that represents a non-significant variance. 

One approach to non-linearize PCA was suggested in [51]. The idea is to 

first map the data Y into a feature space F through a mapping Φ. Rather 

then, as in standard PCA, finding the spectral decomposition of the covari- 

ance matrix S in the original space the decomposition can now be applied 

to the covariance in the feature space representation of the data. The hope 

is that the mapping Φ has unraveled the manifold such that applying PCA 

would recover the intrinsic representation of the data. However, it remains 

to find a mapping that does just so. 

2.5.1 Kernel-Trick 

Given a set of data Y = [y1, . . .,yN] where yi ∈ ℜ D the data is represented in 

the basis [e1, . . .,eD]. A different basis would be to represent each data-point 

yi by its own basis [y1, . . .,yN] i.e. represented in a N dimensional space 

with representation ˜ Y = [e1, . . .,eN]. The covariance matrix S in the space 

spanned by the data is equal to that of the Gram matrix of the original rep-


resentation. This is the fundamental background to the Kernel Trick which 

is a way of non-linearizing algorithms that depend only on the inner product 

between data-points. Even though an accepted term, it is not clear where the 

term was initially suggested. The Kernel Trick is based on that rather than 

finding a specific mapping Φ that takes the data to the feature space F we 

specify a function k(yi,yj), called the kernel function that parameterizes the 

inner product between Φ(yi) and Φ(yj), 

k(yi,yj) = Φ(yi) T Φ(yj). (2.32) 

Evaluated between each pair of points in the data-set the kernel function k 

specifies the kernel matrix K(Y,Y) which specifies the Gram matrix in the 

feature space F . From Eq. 2.17 we know that the Gram matrix and a dis- 

tance matrix is interchangeable representation for centered data. Therefore 

as long as the kernel function k specifies a valid Gram matrix K there is an 

underlying geometrical representation of data in F . The class of kernel func- 

tions that specifies geometrically representable feature spaces are known as 

Mercer Kernels [41, 50]. Mercer Kernels are positive semidefinite, i.e in the 

spectral decomposition of the resulting kernel matrix K all eigenvalues are 

non-negative. Intuitively this can be understood through Eq. 2.17, if the 

eigenvalues where to be negative then by adding basis vectors the distance 

between two points would be reduced, which is not possible in a Euclidean 

space. Using a kernel function to represent the data the feature space F is 

know as a kernel induced feature space. 

One advantage using kernel induced feature space is that if we aim to


apply an algorithm to the data which is only in terms of the inner product 

between data points we never need to find the geometrical representation 

of the data in F . This means that kernels resulting in potentially infinite 

dimensional spaces can be used. One such kernel function is the RBF kernel, 

k(yi,yj) = θ1e − θ 2 2 ||yi−yj|| 2 2, (2.33) 

with parameters {θ1, θ2}. If the inner product is specified by an RBF kernel 

any combination of points yi and yj will have a non-zero inner product. For 

this to be possible the feature space F will need to be infinite dimensional. 

PCA works by diagonalizing the covariance matrix of the data through 

the spectral decomposition. Kernel PCA [51] is formulated by first finding 

the Gram matrix in the kernel induced feature space. By representing each 

point using the basis of the data itself the Gram matrix is equivalent to the 

covariance matrix. A reduced representation can now be found through the 

spectral decomposition of the kernel matrix K. 

For many popular kernels, such as RBF kernels, the kernel represents 

feature spaces of higher dimensionality compared to the original data-space 

meaning that the mapping increases the dimensionality of the data. However, 

even though the dimensionality of the data might have been increased, the rel- 

ative ratio of the eigenvalues of the spectral decomposition of the covariance 

matrix might result in fewer eigenvectors, compared to the decomposition in 

the original space, that represents a significant variance meaning that a lower 

dimensional approximation of the data can be found. So strictly speaking, for 

many types of kernel functions, Kernel PCA is not a dimensionality reduction


technique but rather an algorithm for feature selection, which we will briefly 

comment on in the end of this chapter. 

In the next section we will go through a set of algorithms that uses local 

similarity measures in the data to find kernels onto which a spectral decom- 

position can be applied to find a geometrical representation of the data. 

2.5.2 Proximity Graph Methods 

Several dimensionality reduction algorithms have been suggested that are 

based on local similarity measures in the data. These algorithms are based 

on a proximity graph [66, 29, 10] extracted from the data. A proximity graph 

is a graph that represent a specific neighborhood relationships in the data. 

In the graph each node corresponds to a data point, vertices connects nodes 

that are related through the specified relationship potentially associated with 

a vertex weight. The fundamental idea behind proximity graph based algo- 

rithms for dimensionality reduction is that locally the data can be assumed to 

lie on a linear manifold. This means that locally the distance in the original 

representation of the data will be a good approximation to the manifold dis- 

tance. Therefore the neighborhood relationship used for proximity graphs in 

dimensionality reduction is the inter-distance between points in the original 

representation. Usually the graphs are constructed either from an N near- 

est neighbor algorithm where the N closest points are connected or from a 

ǫ nearest neighbor where all points within a ball of radius ǫ are connected. 

Setting either parameter is of significant importance as only points whose 

inter-distance can be assumed to approximate the manifold distance should


be connected. 

Isomap 

Isomap [60] was presented as a non-linear modification of MDS. The first step 

of Isomap is to construct a proximity graph of the data with edge weights 

corresponding to the Euclidean distance between each points. MDS finds 

a geometrical configuration to a global dissimilarity measure. In Isomap 

it is suggested that the manifold distance be approximated by the shortest 

path through this proximity graph. By computing the shortest path through 

the proximity graph a dissimilarity measure can be found between each data 

point onto which MDS can be applied. The shortest path through the prox- 

imity graph is not certain to result in a similarity matrix whose Gram matrix 

corresponds to a valid geometrical configuration, i.e. a Mercer kernel. A 

modification of the Isomap framework that guarantees a valid Mercer kernel 

have been suggested in [16]. However, in general, as we are only interested in 

the largest eigenvalues, this does not cause significant problems. 

Maximum Variance Unfolding 

A different alteration of MDS is Maximum Variance Unfolding (MVU) [78, 

80, 79, 81]. Based on the observation that any “fold” of a manifold would de- 

crease the Euclidean distance between two points while the distance along the 

manifold would remain the same MVU is formulated as a constrained maxi- 

mization. In the first step of the algorithm the proximity graph based on the 

Euclidean distance in the observed data is computed in the same manner as 

for Isomap. The inter-distance between nodes not connected by an edge in


the graph is maximized under the constraint that the distance between near- 

est neighbors stay the same. In effect the MVU objective will try to unravel 

the data by stretching the manifold as much as possible without causing it to 

tear. Rather then formulating the objective in terms of a distance matrix as 

in Isomap the MVU objective is expressed in terms of the Gram matrix which 

we know are interchangeable representations Eq. 2.17. MVU tries to find a 

feature space represented by a kernel matrix K. This leads to the following 

objective, 

ˆK = argmax tr(K) (2.34) 

subject to: K 0 

 

Kij = 0 

ij 

Kii − Kjj − Kij − Kji = Gii − Gjj − Gij − Gji, i ∈ N(j), 

where G is the Gram matrix for the original representation and N(i) is the 

index set of points that are connected to i in the proximity graph. The first 

constraint forces the ˆ K to represent a geometrically interpretable feature 

space, while the second constraint forces the data to be centered. The fi- 

nal constraint ensures that the distance between points that are connected in 

the proximity graph is conserved. The optimization is an instance of Semi- 

Definite Programming [11] and can be solved using efficient algorithms. Once 

a valid kernel matrix K has been found the resulting embedding can be found 

by applying MDS to ˆ K.


Local Linear Embeddings 

Local Linear Embeddings (LLE) [48] is a third method based on the preser- 

vation of a proximity graph structure. LLE is based on the assumption that 

the manifold can locally be well approximated using small linear patches. 

By rotating and translating each of these patches the full manifold structure 

can be modeled. LLE is a two step algorithm, in the first step each point in 

the data set is described by the nodes connected in the proximity graph as 

expansion, 

subject to: 

ˆW = argmin 

 

Wij = 1, 

j 

N 

||xi − 

i 

j∈N(i) 

Wijxj||2 

(2.35) 

where N(i) is the index set of points that are connected to i in the proximity 

graph. The optimal weights ˆ W can be solved in closed form [48]. Assuming 

that the manifold is locally linear the reconstruction weights should summa- 

rize the local structure of the data and should therefore be equally valid in 

reconstructing the manifold representation of the data X. To find this mani- 

fold representation a second minimization is formulated, 

ˆX = argmin 

||xi − 

Wijxj|| 2 2. (2.36) 

i 

However, Eq. 2.36 has a trivial solution placing each point in the origin, 

xi = 0, ∀i, this is removed by enforcing unit variance along each direction. 

Further, to remove the translational degree of freedom the solution is en- 

j


forced to be centered, 

i xi = 0. The optimal embedding ˆ X can be found 

through an eigenvalue problem. 

Laplacian Eigenmaps 

The proximity graph is also the starting point for Laplacian eigenmaps [5]. 

Each node in the graph is connected to its neighbors by a vertex with an 

edge weight representing the locality of the points. Several different mea- 

sures of locality can be used. In the original paper either a heat kernel, 

wij = e − ||y i −y j ||2 2 

t , or constant wij = 1 was applied. Once the graph have 

been constructed the objective is to find an embedding X of the data such 

that points that are connected in the graph stay as close together as possible. 

For the first dimension, 

ˆX = argmin 

(xi − xj)Wij = y T Ly, (2.37) 

i,j 

where L is referred to as the Laplacian defined as L = D − W and D is 

a diagonal matrix such that Dii = 

j Wji. The objective Eq. 2.37 has a 

trivial solution zero dimensional solution representing the embedding using 

a single point. To remove this solution the solution is forced to be orthogonal 

to the constant vector 1, y T D1 = 0. Further, to shrinking the embedding a 

constraint on the scale y T Dy = 1 is appended to the objective. The diagonal 

matrix D provides a scaling of each point with respect to its locality to other 

points in the data. For a multi-dimensional embedding of the data this leads


15 

10 

5 

0 

−5 

−10 

−15 

20 

0 

−20 

−15 −10 −5 0 5 10 15 

Figure 2.2: Swissroll with added Gaussian noise. Given the left image it is easy 

to see the global structure. The spectral algorithms are based on local structure 

in the data as in the right image from which it is a lot harder to infer the global 

structure. 

to the following optimization problem, 

ˆY = argmin tr(Y T LY) (2.38) 

subject to: Y T DY = I 

which can be solved through a generalized eigenvalue problem. 

2.5.3 Summary 

Spectral algorithms are attractive as they are associated with a convex objective 

function leading to a unique solution. However, the proximity graph is based on 

local distances measures that are more likely to be effected by noise see Figure 2.2. 

The spectral algorithms all have the fundamental assumption that the generating 

mapping f has a smooth inverse which thereby preserves locality of the observed 

representation in the solution of the intrinsic. However, one needs to be vary what 

this assumption actually implies. As an example take a piece of string which has

2.6. GENERATIVE DIMENSIONALITY REDUCTION 35 

X 

Y 

Figure 2.3: The generative latent variable model. The observed data X is modeled 

as generated from a low-dimensional latent variable Y through the generative 

mapping f specified by parameters W. 

been rolled up into a ball. The string is a one dimensional object embedded in a 

three dimensional space. Locality is preserved through the generating mapping 

f, i.e. points that are close on the string will remain close in the ball. However, 

the reverse does not necessarily need to be true as neighboring points on the ball 

W 

might come from different “loops” when the ball was rolled together. 

Further, even though assumed to exists, none of the spectral algorithms models 

the smooth inverse of the generating mapping but rather learns embeddings of the 

data points this is fine as long as focus is on the visualization of the data rather 

then a model. 

2.6 Generative Dimensionality Reduction 

Generative approaches to dimensionality reductions aim to model the observed 

data as a mapping from its intrinsic representation. The underlying representa- 

tion is often referred to as the latent representation of the data and the models as 

latent variable models for dimensionality reduction. The observed data Y have 

been generated from the latent variables X through a mapping f parameterized


by W, Figure 2.3. Assuming the observations are i.i.d and have been corrupted 

by spherical Gaussian noise leads to the likelihood of the data, 

p(Y|X,W, β −1 ) = 

N 

N yi|f(xi,W), β −1 , (2.39) 

i=1 

where β −1 is the noise variance. The Gaussian noise model means we will refer 

to these models as Gaussian latent variable models. 

In the Bayesian formalism both the parameters of the mapping W and the la- 

tent location X are nuisance variables. Seeking the manifold representation of the 

observed data we want to formulate the posterior distribution over the parameters 

given the data, p(X,W|Y). This means inverting the generative mapping through 

Baye’s Theorem which implies marginalization over both the latent locations X 

and the mapping W. Reaching the posterior means we need to formulate prior 

distributions over X and W. However, this is severely ill-posed as an infinite 

number of combination of latent locations and mappings that could have gener- 

ated the data. To proceed assumptions about the relationship needs to be made. 

Neural Networks (NN) [27] are models traditionally used for supervised learn- 

ing. MacKay [39] suggested a generative dimensionality reduction approach us- 

ing NN labeled Density Networks. Traditional supervised learning using NN im- 

plies learning a conditional model over the output variables (class based or contin- 

uous) given the input variables through a parametric mapping. This relates each 

location in the input space with a density over the output space. However, in the 

case of dimensionality reduction only the output space is given. In [39] a model of 

the generative mapping parameterized as Multi-Layered-Perceptron (MLP) NN is 

proposed. By specifying a prior over the locations X in the input space and the

2.6. GENERATIVE DIMENSIONALITY REDUCTION 37 

parameters W of the generative mapping the joint probability of the full model 

can be formulated. Formulating an error function of the models means that the 

gradients of the unknown latent locations and the parameters of the generative 

mapping can be formulated. However, these gradients involve integrals over X 

and W which needs to be evaluated using sampling based methods such as Monte 

Carlo sampling. Optimizing the parameters W a density over the input space can 

be found. 

Tipping and Bishop [65] formulated probabilistic PCA (PPCA) by making the 

assumption that the observed data was related to the latent locations as a linear 

mapping yi = Wxi + ǫ, where ǫ ∼ N(0, β −1 I). Placing a spherical Gaussian 

prior over the latent locations leads to the marginal likelihood, 

p(y|W, β −1 I) = 

 

p(y|W, β −1 )p(x)dx (2.40) 

p(x) = N(0, β −1 I). (2.41) 

The parameters of the mapping W can be found by maximum likelihood. 

Assuming a linear mapping severely restricts the classes of data-sets that can 

be modeled. But the prior over the latent locations has to be propagated through 

the generating mapping to form the marginal likelihood. For the linear mapping 

Eq 2.40 is solvable. However, when considering mappings of more general form 

it is not clear how to propagate the latent prior through the mapping to make the 

the integral Eq. 2.40 analytically tractable. 

Bishop [9] suggested a specific prior over the latent space making marginal- 

ization over more general mappings feasible, the model is referred to as The Gen- 

erative Topographic Map (GTM). By discritizing the latent space into regular grid,


Figure 2.4: Schematic representation of the GTM model: a grid of latent points X 

is mapped through a nonlinear mapping f parametrised by W to a corresponding 

grid of Gaussian centers embedded in the observed space. Adapted from [36]. 

the prior is specified in terms of a sum of delta functions, 

p(x) = 1 

K 

K 

δ(x − xi), (2.42) 

i=1 

where [x1, . . .,xK] is regularly spaced landmark points over the latent space Fig- 

ure 2.4. This prior makes the integral in Eq.2.40 tractable for general parametric 

mappings. The GTM specifies a density over the observed data space parame- 

terized by a Gaussian mixture with centers at the location in the observed space 

corresponding to the grid points in the latent space. 

The Gaussian latent variable model formulates dimensionality reduction as a 

probabilistic model which provides a model and an associated likelihood function. 

Further modeling the generative mapping removes the reliance on local noise sen- 

sitive measures in the data. This makes the generative models applicable to a 

larger range of modeling scenarios compared to the spectral algorithms. How- 

ever, PPCA is a strictly linear model, and the energy function associated with the

2.7. GAUSSIAN PROCESSES 39 

GTM is non-convex which means that we cannot be guaranteed to find the global 

optima. Further, the GTM suffers from problems associated with mixture models 

in high dimensional spaces [59]. 

2.7 Gaussian Processes 

A D dimensional Gaussian distribution is defined by a D × 1 mean and a D × D 

covariance matrix. A Gaussian process (GP) is the infinite dimensional general- 

ization of the distribution where the mean and covariance is defined not by fixed 

size matrices but a mean µ(x) and a covariance k(x,x ′ ) function, defined over 

infinite index sets, x. 

GP(µ(x), k(x,x ′ )). (2.43) 

Evaluating a GP over a finite index set reduces the process to a distribution 

with the dimensionality of the cardinality of the evaluation set. The covariance 

function needs to specify a valid covariance matrix when evaluated for any finite 

subset in its domain, this requires the covariance function to come from the same 

family of functions as Mercer kernels [41, 45]. 

A GP generalizes the concept of a Gaussian distribution to infinite dimen- 

sions, this has been exploited in machine learning by applying GPs to specify 

distributions over infinite objects. One such application is when we are interested 

in modeling relationships defined over continuous domains such as functions. If 

we are interested in modeling a functional relationship f between input domain


X ∈ ℜ D and target domain Y ∈ ℜ, 

yi = f(xi). (2.44) 

A GP can be used to specify a prior distribution over the relationship, f ∼ 

GP(µ, k). In Figure 2.5 samples from a GP prior with a covariance function 

specified by an RBF kernel and a mean function being constant 0 is shown. As 

can be seen the samples are all smooth with respect to the input locations x. How- 

ever, there is an additional property that the GP needs to fulfill to specify a valid 

distribution over f, consistency. A function f is consistent in the sense that the 

relationship between the target and input domain is fixed. For a distribution, this 

implies that evaluating the distribution over a finite subset Xi ⊂ X does not alter 

the distribution over any other subset Xj ⊂ X even if Xi ∩Xj = 0. It is clear that 

a GP satisfies this condition if the covariance function specifies a valid covariance 

matrix when evaluated over a finite number of points as, 

for a Gaussian distribution. 

⎧ 

⎪⎨ y1 ∼ N(µ1, Σ11) 

(y1, y2) ∼ N(µ,Σ) ⇒ 

⎪⎩ y2 ∼ N(µ2, Σ22) 

, (2.45) 

In regression we are interested in modeling the relationship between two do- 

mains X ∈ ℜ D and Y ∈ ℜ from a set of observations xi ∈ X and yi ∈ Y where 

i = 1 . . .N. Assuming a functional relationship and that the observations have 

been corrupted by additive Gaussian noise we are interested in modeling, 

yi = f(xi) + ǫ, (2.46)


where ǫ ∼ N(0, β −1 ). We are interested in encoding our prior knowledge about 

the relationship in a distribution over f. For regression we usually have a prefer- 

ence to functions varying smoothly over X, 

limxi→xj+ |f(xi) − f(xj)| = 

= limxi→−xj |f(xi) − f(xj)| = 0, 

∀xj ∈ X. This assumption can be encoded by the GP through the choice of 

covariance function k(x,x ′ ). The covariance function encodes how we expect 

variables to vary together, 

k(x,x ′ ) = E ((f(x − µ(x)))(f(x ′ ) − µ(x ′ ))), 

this means that we can encode the smoothness behavior over X by choosing a 

covariance function which is smooth over the same domain. The mean function 

µ(x) = E(f(x)) encodes the expected value of f. By translating the observed 

data to be centered around zero the mean function can, for simplicity, be chosen 

as the constant function zero. 

2.7.1 Prediction 

Having specified a prior distribution encoding our knowledge (and preference) 

about the relationship between X and Y we are interested in inferring the lo- 

cations y∗ corresponding to a previously unobserved point x∗ ∈ X. The joint 

distribution of the observed data (y,x) and the unobserved point (y∗,x∗) can be


2.5 

2 

1.5 

1 

0.5 

−3 −2 −1 −0.5 1 2 3 

−1 

−1.5 

−2 

−2.5 

Figure 2.5: Samples from GP Prior using a RBF covariance function and a constant 

zero mean function. As can be seen, each sample are smooth over the input 

domain. 

written as follows, 

⎡ 

⎢ 

⎣ y 

y∗ 

⎤ 

⎛ 

⎡ 

⎤⎞ 

⎥ ⎜ ⎢ 

⎦ ∼ ⎝0, ⎣ K(X,X) + β−1I K(X,x∗) 

K(x∗,X) K(x∗,x∗) + β−1 ⎥⎟ 

⎦⎠ 

. 

Predictions over the unobserved locations are made from the posterior distribu- 

tion. The posterior is formulated by conditioning the joint distribution on the 

observed data. Conditioning two Gaussians results in a Gaussian distribution, 

defined by mean and covariance, 

¯y(x∗) = k(x∗,X)(K + β −1 I) −1 Y 

σ 2 (x∗) = (k(x∗,x∗) + β −1 ) − k(x∗,X)(K + β −1 ) −1 k(x∗,X), (2.47) 

where K = k(X,X). Those equations are the central predictive equations in the 

GP framework. In Figure 2.6 samples from the posterior distribution of a GP with 

an RBF covariance function and a constant zero mean function is shown. As can


1.5 

1 

0.5 

−3 −2 −1 −0.5 1 2 3 

−1 

−1.5 

−2 

−2.5 

−3 

Figure 2.6: Samples from GP Posterior using a RBF covariance function and a 

constant zero mean function. Each sample from the posterior distribution passes 

through the previously observed data shown as red dots. 

be seen each function drawn from the distribution passes through the the training 

data points. 

2.7.2 Training 

The covariance function specifies the class of functions most prominent in the 

prior. A commonly used covariance function is the RBF-kernel, 

k(xi,xj) = θ1e − θ 2 2 ||xi−xj|| 2 2. 

The free parameters {θ1, θ2, . . .} of the covariance functions together with the 

noise variance β −1 are the hyper-parameters 1 of the GP, Φ = {θ1, . . .,β}. Our 

knowledge about the relationship is encoded in the prior over f by setting the 

values of Φ. However, in the presence of data we can directly learn the hyper- 

parameters from the observations. Assuming that the observations have been cor- 

1 Assuming that the mean function has no free parameters


rupted by additive Gaussian noise Eq 2.46 we can formulate the likelihood of the 

data. Combining the likelihood with the prior we arrive at the marginal likelihood 

through integration over f, 

 

p(Y|X,Φ) = 

p(Y|f)p(f|X,Φ)df. (2.48) 

From the marginal likelihood we can seek the maximum likelihood solution 

for the hyper-parameters Φ, 

ˆΦ = argmax Φ p(Y|X,Φ). (2.49) 

This is referred to as training in the GP-framework. It might seem undesir- 

able to optimize over the hyper-parameters as the model might over-fit the data 2 . 

Inspection of the logarithm of equation (2.48) for a one-dimensional output y, 

log p(y|X) = − 1 

2 yTK −1 y − 

 

data−fit 

1 

log |K| − 

2 

complexity 

N 

log 2π, 

2 

(2.50) 

shows two “competing terms”, the data-fit and the complexity term. The complex- 

ity term measures and penalizes the complexity of the model, while the data-fit 

term measures how well the model fits the data. This “competition” encourages 

the GP model not to over-fit the data. 

data Y 

2 By setting the noise variance β −1 to zero the function f will pass exactly through the observed


2.7.3 Relevance Vector Machine 

In this thesis our main use of Gaussian Processes will be as a tool to model 

functions. A different regression model is the Relevance Vector Machine 

(RVM) [63, 64]. In the RVM the mapping yi = f(xi) is modeled as a lin- 

ear combination of a the response to a kernel function of the training data, 

f(xi) = 

N 

wjc(xi,xj) + w0, (2.51) 

j=1 

where w = [w0, . . .,wN] are the model weights and c(·,xj) the kernel basis 

functions. One approach to find the weights of the model would be to min- 

imizes a reconstruction error of the training data. However, this is likely to 

lead to sever over-fitting as we are trying to estimate N + 1 parameters from 

given N inputs. Further, predications would only be point-estimates with no 

associated uncertainty. 

The RVM was suggested as a model to tackle the above issues. The model 

specifies a likelihood model of the data through which the parameters can 

be found associating each prediction with an uncertainty. Further, to avoid 

over-fitting of the data, a prior is specified over the weights w. This prior en- 

courages the model to push as many weights wi towards 0 making the linear 

combination in Eq. 2.51 depend on as few basis functions k(·,xj) as possible. 

Assuming additive Gaussian noise the likelihood of the model is formu- 

lated as, 

p(y|w, σ 2 ) = 

 

1 

2πσ2 2 1 

(− 

e 2σ2 ||y−˜ Cx|| 2 ) , (2.52)


where ˜ C is referred to as the design matrix, with elements, 

˜Cij = c(xi,xj−1) 

˜Ci1 = 1. 

A Gaussian prior is placed over the weights w, 

p(w|α) = 

N 

i=0 

N(wi|0, α −1 

i ), (2.53) 

controlled by the hyper-parameters α = [α0, . . .,αN]. The prior Eq. 2.53 over 

the model parameters w encourages the weights w to be zero. 

Through Bayes’ rule the posterior over the weights p(w|y, α, σ 2 ) can be 

formulated from which by integration over the weights w the marginal like- 

lihood of the model can be found, 

p(y|α, σ 2 ) = 

N 

1 2 1 

 

2π 

|B−1 + ˜ CA−1 ˜ CT e (−1 

2 (yT (B −1 + ˜ CA −1 ˜ C T )) −1 ) , (2.54) 

where A = diag(α0, . . .,αN) and B = σ 2 I. The optimal parameters α and σ 2 

can be found by optimizing the marginal likelihood. Due to the prior Eq. 2.53, 

it is reported in [63], when optimizing the marginal likelihood many of the 

hyper-parameters αi tend to approach infinity meaning that the associated 

weight is close to zero. This means that the corresponding kernel function 

has little influence in prediction Eq. 2.51. A pruning scheme is incorporated 

in the optimization that removes weights tending to zero from the expansion

2.8. GP-LVM 47 

forcing the model to explain the data using few kernel functions leading to a 

sparse model. 

As noted in [64, 45, 14] the RVM is a special case of a GP with covariance 

function, 

k(xi,xj) = 

N 

l=1 

1 

c(xi,xk)c(xj,xl), (2.55) 

αl 

where c is the kernel basis function as in Eq. 2.51. The covariance function is 

different in form as it depends on the training data xl. Further, it correspond 

to a degenerate covariance matrix having at most rank N as it is an expansion 

around the training data. Training the RVM is the same as optimizing a 

GP regression model i.e. finding the hyper-parameters that maximizes the 

marginal likelihood of the model. However, as noted in [45] the covariance 

function of the RVM has some undesirable effects. Using a standard RBF 

kernel for the GP the predictive variance associated with a point far away 

from the training data will be large, i.e. the model will be uncertain in regions 

where it has not previously seen data. Rather the opposite is true using the 

covariance function specified by the RVM as a both terms in the predictive 

variance Eq. 2.47 will be close to zero while for a standard RBF kernel the 

first term will be large. 

2.8 GP-LVM 

Lawrence [33] suggested an alternative Gaussian latent variable model capable 

of handling non-linear generative mappings while at the same time avoiding the 

problems associated with the GTM. Both the PPCA and the GTM specifies a


prior over the latent locations and seek a maximum likelihood solution for the 

parameters of the generating mapping. However, from a Bayesian perspective 

both the mapping and the latent locations are nuisance parameters and should 

therefore be marginalized. In Lawrence’s formulation, the prior is specified over 

the mapping instead of the latent locations and the marginal likelihood over the 

mapping formulated. Using the GP-framework a rich and flexible prior over non- 

linear mappings can be specified. The algorithm is referred to as the Gaussian 

process Latent Variable Model (GP-LVM). 

By marginalizing over the mapping f the GP-LVM proceeds by seeking the 

maximum likelihood solution to the latent locations X and the hyper-parameters 

Φ of the GP, 

{ ˆ X, ˆ Φ} = argmax {X,Φ} p(Y|X,Φ) 

 

= argmax {X,Φ} p(Y|X, f,Φ)p(f)df, (2.56) 

where p(f) = GP(µ(x), k(x,x ′ )). The posterior distribution of the data can be 

written as, 

p(X,Φ|Y) ∝ p(Y|X,Φ)p(X)p(Φ). (2.57) 

In the standard GP-LVM formulation uninformative priors [33] are specified over 

the latent locations and the hyper-parameters. Learning in the GP-LVM frame- 

work consists of minimizing the log posterior of the data with respect to the lo- 

cations of the latent variables X and the hyper-parameter θ of the process. With 

a simple spherical Gaussian prior over the latent locations and an uninformative

2.8. GP-LVM 49 

prior over the parameters leads to the following objective, 

L = Lr + 

lnθi + 1 

2 ||xi|| 2 . (2.58) 

i 

For a covariance function specifying a distribution over linear functions a 

closed form solution for Eq 2.56 exists [33]. However for general covariance 

functions the solution is found through gradient based optimization . 

As previously discussed, infinitely many solutions to the latent variable for- 

mulation of dimensionality reduction exists, to proceed the solution needs to be 

constrained by prior information. The GP-LVM solution is constrained by the 

GP marginal likelihoods trade-off between smooth solutions and a good data-fit 

Eq.2.50. By fixing the dimensionality of the latent representation a solution can 

be found. 

2.8.1 Latent Constraints 

The GP-LVM objective seeks the locations of the latent coordinates X that max- 

imize the marginal likelihood of the data. One advantage of directly optimizing 

the latent locations is that additional constraints on X can easily be incorporated 

in the GP-LVM framework. In the following section we will review some of the 

extensions in terms of latent constraints that has been applied to the GP-LVM. 

The fundamental difference between spectral and generative dimensionality 

reduction is the assumption made by the spectral algorithms that the latent coordi- 

nates can be found as a smooth mapping from the observed data. This means that 

we are interested in finding latent locations such that the locality in the observed 

data is preserved. Further this assumption implies that a smooth inverse to the 

i


generative mapping Eq. 2.4 is assumed to exist. This assumption constrains the 

spectral algorithms and makes their objective function convex. Even though in 

this thesis we argue that the assumption of the existence of a smooth inverse to 

the generative mapping is a limitation there are modeling scenarios where we are 

interested to retain the locality of the observed data in the latent representation, 

such as for example when modeling motion capture data [35]. In [35] a con- 

strained form of the GP-LVM is presented. Each latent location xi is represented 

as a smooth mapping, referred to as a back-constraint, from the observed data, 

xi = g(yi,B), (2.59) 

with parameters B. Rather then directly optimizing the latent locations the incor- 

poration of the back-constraints alters the GP-LVM objective to seek the parame- 

ters of the back-constraints B, that maximize the likelihood of the observed data 

Y. 

As discussed, the GP-LVM objective is severely under-constrained in the gen- 

eral case. This means that a good initialization of the latent locations is of es- 

sential importance to be able to find a good solution. However, when learning a 

back-constrained model the preservation of locality in the observed space will in 

practice constrain the solution sufficiently such that the algorithm becomes less 

reliant on initialization of the latent locations, parameterized by the parameters 

of the back-constraint B Eq. 2.59. This means that for practical purposes we can 

reach the solution of the back-constrained GP-LVM with less careful initialization 

compared to the standard GP-LVM model. 

In the standard formulation of the GP-LVM a uninformative prior is specified

2.8. GP-LVM 51 

over the latent locations X Eq. 2.58. Rather then specifying this uninformative 

prior, in [67] a model incorporating an informative class based prior distribution 

over the latent locations is suggested. Incorporating this means that we can learn 

a latent representation that can be interpreted in terms of class. One advantage 

of this explored in [67] is to learn latent representation as input spaces for clas- 

sifiers. The objective in [67] is to rather then just efficiently represent the data 

as in the standard GP-LVM to find latent representation which are well suited for 

classification, i.e. where each class is easily separable. Practically this is achieved 

by incorporating the class based objective of Linear Discriminant Analysis (LDA) 

which aims to minimize within class variability and maximize between class sep- 

arability. This can be encoded in the GP-LVM by replacing the uninformative 

spherical Gaussian prior over the latent coordinates with, 

p(X) = 1 

Zd 

exp( 1 

λ2tr(S−1 B SW)), (2.60) 

where SB encodes the between class and SW the within class variability and Zd 

being a normalizing constant. The within class variability is computed as, 

SW = 

L 

i=1 

Ni 

N (µi − µ)(µi − µ) T , (2.61) 

where µi is the mean of class i and µ the mean over all classes. The between class 

variability is computed as follows, 

SB = 1 

N 

L 

i=1 

Ni 

 

(x i k − µi)(x i K − µi) T 

 

. (2.62) 

k 

Similarly to the discriminative GP-LVM a model incorporating constraints on


the latent coordinates was presented in [70]. By constraining the topology of the 

latent space representations with interpretable latent dimensions can be found. 

The model is applied to human motion data to do non-trivial transitions between 

styles of motion not present in the training data. 

In [77, 76] a model referred to as the Gaussian Process Dynamical Model 

(GPDM) is presented. The GPDM is a latent variable model derived from the 

GP-LVM that incorporates time series information in the data to learn a latent 

representation that respects the dynamics of the data. This achieved by incorpo- 

rating an auto-regressive model on the latent space in addition to the generative 

mapping, 

xt = h(xt−1) + ǫdyn. (2.63) 

By specifying a GP prior over the mapping h the dynamic mapping can simi- 

larly to the generative mapping f be marginalized from the GP-LVM to form the 

GPDM objective, 

ˆX, ˆ θ, ˆ θdyn 

 

= argmax {X,θ,θdyn} p(X, θ|Y)p(X|θdyn) (2.64) 

Many types of data are generated from models with a known underlying or 

latent structure. This is, for example, true for the human body whose motion can 

be decomposed into a tree structure. For modeling this type of data the hierarchi- 

cal GP-LVM (HGP-LVM) model was developed [34]. In the HGP-LVM model a 

hierarchical tree structure of latent spaces are learned where a latent space acts as 

a prior on the latent coordinates of a space further down in the hierarchy.

2.8. GP-LVM 53 

2.8.2 Summary 

In the context of our presentation it might seem out-of-place to present the GP- 

LVM separately from the generative methods. Our motivation for this it two-fold, 

first the framework we are about to present is an extension of the GP-LVM, sec- 

ondly as stated above for general covariance functions, the GP-LVM solution is 

found through gradient based methods. To avoid the effect of local minima in 

the log-likelihood the latent location X needs to be initialized close to the global 

optima. The approach suggested in [33] was to initialize with the solution from a 

different dimensionality reduction algorithm. For the general case, where we can- 

not assume a linear manifold, the lack of reliable non-linear generative methods 

means that the initialization is usually taken from the solution of a spectral algo- 

rithm. This means that even though the GP-LVM is a purely generative model, 

for practical applications, in the general case, it relies on the existence of a anal- 

ogous spectral algorithm. In this context the GP-LVM is a generative algorithm 

that sets out to improve the solution given by a spectral algorithm, this motivates 

our separation of the GP-LVM from the other reviewed generative dimensionality 

reduction algorithms. 

The GP-LVM framework has been shown to be very flexible and have been 

applied to model a large variety of different data, motion capture [24, 77, 76, 75], 

tracking [71, 69, 72, 28], human pose estimation [62], modeling of deformable 

surfaces [49] to name a small subset.


2.9 Shared Dimensionality Reduction 

Many modeling scenarios are characterized by several different types of observa- 

tions that share a common underlying structure. This can be the same text written 

in two different languages where both representations are different in form but 

have the same meaning or underlying concept or two image-sets which share the 

same modes of variability for example pose or lighting. This correspondence 

can be exploited for dimensionality reduction of the data, this we will refer to 

as shared dimensionality reduction. In shared dimensionality reduction each ob- 

servation space is generated from the same latent variable X. We will focus on 

the scenario when we have two observation spaces Y and Z which have been 

generated from the latent variable X through generative mappings fY and fZ, 

yi = fY (xi) (2.65) 

zi = fZ(xi). (2.66) 

Shared spectral dimensionality reduction is built on the assumption that smooth 

inverses to both the generative mappings fY and fZ exists. That both mappings 

are invertible implies that the observation spaces are related through a bijection, 

meaning that the location in one observation space is sufficient for determining the 

corresponding location in the other observation space. This means that by shared 

spectral dimensionality reduction we mean the alignment of several intrinsic rep- 

resentation into a single shared low-dimensional representation. In [26, 25, 82] 

algorithms for aligning two manifold through the use of proximity graphs for ei- 

ther a full or partial correspondence are presented.

2.10. FEATURE SELECTION 55 

A GP-LVM model learning a shared latent representation was suggested in 

[52]. The model presented learns two GP regressors, one to each separate ob- 

servation space from a shared latent variable by maximizing the joint marginal 

likelihood, 

{ ˆ X, ˆ ΦY , ˆ ΦZ} = argmax X,ΦY ,ΦZ p(Y,Z|X,ΦY ,ΦZ) = 

= argmax X,ΦY ,ΦZ p(Y|X,ΦY )p(Z|X,ΦZ). (2.67) 

The model was applied to learn a shared latent structure between the joint angle 

space of a humanoid robot and a human. Inference between the two different ob- 

servations spaces was done by learning GP regressors from the observed spaces 

onto the latent space. The suggested model does not make any direct assump- 

tion about the form of the generative mappings at training. However, as inference 

within the model is done by training mappings from the observed data back onto 

the learned latent representation the inverse mapping is assumed to exist why this 

model retains the central assumption from the shared spectral dimensionality re- 

duction that the generative mappings are invertible. 

2.10 Feature Selection 

Feature extraction is the process of simplifying the amount of resources needed 

to represent a data set accurately. This is in contrast to feature selection, where 

only features with a positive impact in relation to a certain objective is retained in 

the representation. This can for example be features that are able to discriminate 

between two different classes. Dimensionality reduction algorithms are instances


Y 

X 

Figure 2.7: Graphical model corresponding to the formulation of probabilistic of 

CCA in [4]. 

of feature extraction. A closely related algorithm for feature selection is Canonical 

Correlation Analysis (CCA). Given two sets of observations Y and Z with known 

correspondences CCA finds directions WY ∈ Y and WZ ∈ Z such that the 

correlation Eq. 2.68 between YWY and ZWZ is maximized. 

ρ = 

Z 

tr WT Y YT 

ZWZ 

(tr(W T Z ZT ZWZ) tr(W T Y YT YWY )) 1 

2 

Finding the first set of directions w Y 1 and wZ 1 

strained optimization problem, 

argmax w Y 1 ,w Z 1 

(w Y 1 Y) T Zw Z 1 

(2.68) 

is formulated as the following con- 

(2.69) 

subject to: (w Y 1 ) T Y T Yw Y 1 = (w Z 1 ) T Z T Zw Z 1 = 1 (2.70) 

Further orthogonal directions can be found iteratively. The scaling of each basis 

is arbitrary. The constraints in Eq. 2.70 fixes the variance of the canonical vari- 

ates Yw Y 1 and ZwZ 1 

maximizing the correlation Eq. 2.68. 

to 1. This will ensures that maximizing Eq. 2.69 equates to 

In [4] the probabilistic form of CCA is derived through the maximum likeli-

2.11. SUMMARY 57 

hood solution to the following model, 

x ∼ N(0,I) (2.71) 

y|x ∼ N(WY x, ΨY ) (2.72) 

z|x ∼ N(WZx, ΨZ). (2.73) 

The model corresponds to generative CCA as long as the within set or non-shared 

variations can be sufficiently described by a Gaussian noise model and when the 

generative mappings are linear. The graphical model corresponding the the pro- 

pose model is shown in Figure 2.7 In [37] the model is extended to allow for 

non-linear mappings. 

2.11 Summary 

In this chapter we have outlined some of the background upon which the material 

in the following chapters will build. Through elementary linear algebra and by 

introduction of the concept of “The Curse of Dimensionality” we have motivated 

the field of dimensionality reduction which encapsulates the work presented in this 

thesis. We have reviewed algorithms of the two main strands of work composing 

the field dimensionality reduction exemplifying the strength and weaknesses of 

each approach. Further, we detailed the basics of probabilistic modeling from the 

perspective of Gaussian processes which will be fundamental for the following 

chapters. 

In the next chapter we will proceed and present a new and novel model for 

generative dimensionality reduction in the shared modeling scenario based on


Gaussian processes.

Chapter 3 

Shared GP-LVM 


Dimensionality reduction is the task of reducing the number of dimensions re- 

quired to describe a set of data. The previous chapter introduced dimensional- 

ity reduction and gave the necessary mathematical background upon which these 

techniques are built. We divide dimensionality reduction into two groups of al- 

gorithms: generative and spectral. Generative techniques are more versatile and 

applicable for a larger range of modeling scenarios compared to the spectral tech- 

niques. However, the objective function of most generative algorithms are in the 

general case severely under constrained. The spectral group of algorithms avoid 

this by constraining the solution to such where a smooth inverse to the generative 

mapping exists. In scenarios where we are given observations in multiple differ- 

ent forms we can exploit correspondence between observation when performing 

dimensionality reduction. This we referred to as shared dimensionality reduction. 

The following chapter will introduce two generative models that exploit corre- 

59

60 CHAPTER 3. SHARED GP-LVM 

spondences between observations when performing dimensionality reduction. 

In many modeling scenarios we have access to multiple different observations 

of the same underlying phenomenon. Often a significantly different cost, compu- 

tationally or monetary, is associated with acquiring samples from each domain. 

In such scenarios it is of interest to infer the location of a expensive sample from 

one which we can more easily acquire. One example, which we will return to in 

the applications chapter, is image based human pose estimation [69, 57]. This is 

the task of estimating the pose of a human from the evidence given in an image. 

Images can be captured relatively easy while recording the actual pose of a hu- 

man is associated with a significant cost often requiring special rigs and expensive 

equipment. Therefore inferring the pose from image data is of great interest. 

Learning the shared GP-LVM model presented in [52] is a three stage process. 

In the first stage PCA is applied separately to each of the two sets of observa- 

tions. In the second stage the GP-LVM model is trained using the average of 

the two PCA solutions as initialization. This stage means that we have trained 

GP-regressors to model the generative mappings from the latent space onto the 

observed spaces. In the third and final stage a second set of GP-regressors are 

trained that maps back from each of the observed spaces onto the latent space. 

Even though not explicitly stated this implies that the generative mappings are as- 

sumed to have a smooth inverse. We will now proceed to introduce a more general 

model capable of transferring locations in one observation space to a correspond- 

ing space.

3.2. SHARED GP-LVM 61 

W Z 

Y Z 

X 

W 

Y Z 

Figure 3.1: The left image shows the conditional model where a set of observed 

data Y have been generated by Z. The image to the right shows the shared GP- 

LVM model which we suggest as a replacement to the conditional model on the 

left. The back-mapping from the output space that constrains the latent space is 

represented by the dashed line. 

3.2 Shared GP-LVM 

Given two sets of corresponding observationsY = [y1, . . .,yN] and Z = [z1, . . .,zN], 

where yi ∈ ℜ DY and zi ∈ ℜ DZ . We assume that each observation has been gen- 

erated from the same low-dimensional manifold corrupted by additive Gaussian 

noise, 

where xi ∈ ℜ q with q 

DY 

φ Y 

yi = f Y (xi) + ǫY , ǫY ∼ N(0, β −1 

Y I) 

zi = f Z (xi) + ǫZ, ǫZ ∼ N(0, β −1 

Z I) 

< 1 and q 

DZ 

< 1. 

φ Z 

, (3.1) 

Our objective is to create a model from which the location zi ∈ Z correspond- 

ing to a given location yi ∈ Y can be determined. This can be done by modeling 

the conditional distribution over the input space Y given the output space Z as 

shown in the left image in Figure 3.1. The conditional distribution will associate 

each location in the output space with a location in the input space. However, for 

many applications the observation spaces are likely to be high-dimensional which


makes modeling this distribution problematic, especially in scenarios with a lim- 

ited amount of training data. Therefore, rather then modeling p(yi|zi) directly, 

given that the representation of Z is redundant, we can find a reduced dimensional 

representation X of the output space. This means that we can model the condi- 

tional distribution over this new dimensionality reduced representation p(yi|xi) 

which should be significantly easier. However, rather than simply modeling the 

conditional distribution over a dimensionality reduced representation of the out- 

put space we will formulate an objective which learns the latent representation 

together with the mappings. 

From the Gaussian noise assumption 3.1 we can formulate the likelihood of 

the data, 

P(Y,Z|f Y , f Z ,X,ΦY ,ΦZ) = 

N 

p(yi|f Y ,xi,ΦY )p(zi|f Z ,xi,ΦZ). (3.2) 

i=1 

Placing Gaussian process priors and integrating over the mappings leads to the 

marginal likelihood of the shared GP-LVM model, 

P(Y,Z|X,ΦY ,ΦZ) = 

p(yi|xi,ΦY ) = 

p(zi|xi,ΦZ) = 

N 

p(yi|xi,ΦY )p(zi|xi,ΦZ) (3.3) 

i=1 

 

 

p(yi|f Y ,xi,ΦY )p(f Y )df Y 

p(zi|f Z ,xi,ΦZ)p(f Z )df Z . 

We are interested in finding a low-dimensional representation X which can 

be used as a complete substitute for Z. This means that the relationship between 

these two spaces should take the form of a bijection i.e that each point zi ∈ Z is 

represented by a unique location in xi ∈ X and vice versa. There is nothing in


the shared model that encourages this asymmetry. However, this can be achieved 

by incorporating back-constraints [35] which represent the latent locations as a 

smooth parametric mapping from the output space Z, 

This leads to the following objective, 

xi = h(zi,W). (3.4) 

{ ˆ W, ˆ ΦY , ˆ ΦZ} = argmax W,ΦY ,ΦZ P(Y,Z|W,ΦY ,ΦY ). (3.5) 

The objective can be optimized using gradient based methods. In Figure 3.1 the 

leftmost image shows the conditional model and the rightmost image shows the 

back-constrained shared GP-LVM model. 

3.2.1 Initialization 

There are several different options of how to initialize the locations in latent space 

for this model. One could, as with the shared GP-LVM in [52], initialize using 

the average of the embeddings to a spectral algorithm or with the solution to one 

of the shared spectral algorithms. However, we want the latent representation 

to be a complete representation of the output space Z. This means that we can 

initialize using the solution to a spectral algorithm with the output space Z as 

input. Underpinning this model is the assumption that the non-back-constrained 

observation spaces are governed by a subset of the degrees of freedom of the back- 

constrained observation space. By initializing using the solution of a spectral 

algorithm to the output space we seek the solution to the latent representation


where the intrinsic representation of the input space aligns to the one of the output 

space. 

3.2.2 Inference 

Once the model has been trained we are interested in inferring the output location 

corresponding to a point in the input space. As we are not assuming a functional 

relationship between Y and Z we cannot simply learn a mapping as in [52]. Given 

the location yi in the input space Y we want to infer the corresponding location 

zi in the output space Z. The back-constraint from the output space to the latent 

space encodes the bijective relationship between Z and X. This implies that any 

multi-modalities in the relationship between Y and Z have been contained in the 

mapping from the latent representation X to the input space Y. This means that 

to recover the location in the high-dimensional output space we only need to find 

the corresponding point on the much lower-dimensional latent space, 

ˆxi = argmax x p(yi|xi,X,ΦY ). (3.6) 

Having found the location in the latent space the corresponding location in the 

output space can be found through the mean-prediction of the uni-modal posterior 

distribution as, 

ˆzi = argmax zi p(ˆzi|xi,X,ΦZ). (3.7)


1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.2: Each dimension of underlying signals used to generate highdimensional 

data shown in 3.3. 

3.2.3 Example 

In the following section we will run through a toy example try to exemplify the 

model presented above. We will generate two sets of high dimensional data Y 

and Z which we will learn embeddings from using the shared GP-LVM models. 

Both data-sets are generated from a single underlying signal t which consists of 

N linearly spaced values between −1 and 1. A set of three signals is generated 

through non-linear mappings of t such as, 

x 1 i = cos(πti) (3.8) 

x 2 i = sin(πti) (3.9) 

x 3 i = cos(√ 5πti + 2). (3.10) 

We will refer to X = [x 1 ;x 2 ;x 3 ] as the generating signal of the data, shown in 

Figure 3.2. Through linear mappings of X we generate two sets of 20 dimensional 

signals Y and Z. To achieve this we draw values at random from a zero mean unit 

variance normal distribution and organize them into 20 by 3 dimensional matrices 

P1 ∈ R 20×3 and P2 ∈ R 20×3 . From these transformation matrices we generated


4 

3 

2 

1 

0 

−1 

−2 

−3 

0 10 20 30 40 50 60 70 80 90 100 

4 

3 

2 

1 

0 

−1 

−2 

−3 

−4 

−5 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.3: Observed data Y (left) and Y (right) generated from 3.11 and 3.12. 

the high-dimensional observation signals through the generating signal X, 

Y = XP T 1 

Z = XP T 2 

+ λη (3.11) 

+ λη, (3.12) 

where η are samples from a zero mean unit variance normal distribution and λ = 

0.05. Figure 3.3 shows each dimension of the two high dimensional data-sets. We 

proceed by applying the above generated data to the shared GP-LVM model and 

the shared back-constrained model presented in this chapter. Each of the models 

are trained using linear kernels only as we are interested in their capabilities to 

unravel the linear transformations P1 and P2 applied to the data. Figure 3.4 shows 

the embeddings found by the two algorithms. As can be seen both algorithms 

unravels the data and finds three generating signals underlying the data. We do not 

expect the algorithm to exactly unfold the signal X as there are several different 

linear transformations that could have generated the observed data. However, we 

see that both algorithms finds two signals of “one period” corresponding to x 1 

and x 2 and a higher frequent signal corresponding to x 3 . One way to quantify


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

0 10 20 30 40 50 60 70 80 90 100 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.4: Each dimension of the latent embeddings of the data in Figure 3.3 from 

the standard shared model (left) and the shared back-constrained model (right). 

Each model unravels the signals used to generate the data Figure 3.2 

the quality of the embeddings is to compare the Procrustes score [23] between the 

found embeddings and the generating signals. Procrustes score is a measure of 

similarity between shapes who are represented as point sets. In Procrustes analysis 

the shape of an object is considered as belonging to a equivivalence class and a 

shape is said to be defined as “all the geometrical information that remains when 

location,scale and rotational effects have been filtered out from an object [23]”. 

Two shapes can be compared to each other by removing the above effects. By 

finding the best aligning linear transformation the two point sets can be compared 

through the sum of squared distances between the points. Table 3.1 shows that 

both embeddings have low scores when compared to the generating signals. 

We will modify the previous example slightly to represent a different model- 

ing scenario. Just as in the previous example we generate two high-dimensional 

signals Y and Z through randomized linear mappings. However, in this case 

the observed data Y is generated from a two dimensional signal x Y = [x 1 ;x 2 ] 

whereas Z is generated from the same signals as in the previous example. This 

results in the sets of generating signals shown in Figure 3.5. By drawing values


1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.5: Underlying generating signals to high-dimensional data Y (left) and 

Z (right) shown in Figure 3.6. 

3 

2 

1 

0 

−1 

−2 

−3 

0 10 20 30 40 50 60 70 80 90 100 

4 

3 

2 

1 

0 

−1 

−2 

−3 

−4 

−5 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.6: Each dimension of the observed data Y (left) and Z (right) generated 

from underlying signals shown in Figure 3.5 

to form the transformation matrices as in the previous example we can generate 

the the observed signals shown in Figure 3.6 This example is meant to visualize 

the modeling scenario when the input space Y does not share the same modes of 

variability as the output space Z. In this case the input space Y is generated from 

a subset of the signals generating the output space Z. Figure 3.7 shows the results 

of applying the two shared models as presented above. As can be seen in the left 

most plot in Figure 3.7 the shared GP-LVM model does not correctly unfold the 

data to recover the three generating dimensions of X. Rather the shared model 

seems to represent the shared signals x 1 and x 2 but avoids representing x 3 which


2 

1.5 

1 

0.5 

0 

−0.5 

−1 

−1.5 

−2 

0 10 20 30 40 50 60 70 80 90 100 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.7: Resulting embedding of applying the shared (left) and the shared backconstrained 

(right) GP-LVM model to the data shown in Figure 3.6. As can be 

seen, the shared model fails to unravel the generating signals while the backconstrained 

model correctly finds the signals underlying the high-dimensional 

data. The shared model assumes that both observation spaces are generated from 

the same underlying signals which is not true for the data shown in Figure 3.6. The 

shared back-constrained GP-LVM model relaxes this assumption, only assuming 

that the non-back-constrained observation space are a subset of the generating 

signals of the back-constrained space. 

is private to the output space. However, in the shared back-constrained model the 

latent space is constrained to be a smooth representation of the output space and 

therefore all the degrees of freedom in the observed data will be encoded in the 

latent representation of the data. This means that for the shared model the latent 

space is not guaranteed to be a full representation of the output space which means 

that determining the latent location will not be sufficient for determining the loca- 

tion in the output space, this implies that we cannot replace the high-dimensional 

conditional model p(Y|Z) with the low-dimensional p(Y|X) as intended. 

In the following example we will try and exemplify how the models act in a 

more general modeling scenario. As before we will generate two high-dimensional 

data-sets Y and Z from a set of underlying generating signals. In this example 

each data-set will be generated from a single shared signal x 1 and one signal that


Example Model Procrustes 

Same generating signals Shared 0.058 

Same generating signals Back 0.004 

Two shared, one private for output Shared 0.614 

Two shared, one private for output Back 0.028 

Table 3.1: The Procrustes score corresponding to the resulting embeddings of 

applying the different GP-LVM models to the data shown in Figure 3.3. As can be 

seen the back-constrained model (referred to as Back) significantly out-performs 

the standard shared GP-LVM model. 

1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.8: Each dimension of underlying signals generating high-dimensional 

data Y (left) and Z (right) shown in Figure 3.9. Each signal is two dimensional 

with one dimension shared and one private to each data-set. 

remains private to each observation x 2 to Y and x 3 to Z. In Figure 3.8 the gen- 

erating signals are shown. Similarly to the previous examples we generate two 

high-dimensional signals from random linear transformations of the generating 

signals. Each dimension of the observed data is shown in Figure 3.9. We apply a 

standard shared and a back-constrained shared model to the data. In Figure 3.10 

the embeddings found by each model are shown. Both models represent the data 

using two latent dimensions. In the case of the shared model the latent locations 

seems to correspond to the generating signals of Y while the back-constrained 

model seems to unravel the generating signals corresponding to Z. In the case


3 

2 

1 

0 

−1 

−2 

−3 

0 10 20 30 40 50 60 70 80 90 100 

4 

3 

2 

1 

0 

−1 

−2 

−3 

−4 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.9: Each dimension of the high-dimensional observed data Y (left) and Z 

(right) generated from underlying signals shown in Figure 3.8. 

of the back-constrained model this is expected as the model was specifically de- 

signed to learn a latent space which corresponds to Z being the output space of 

the model. Further the latent locations are initialized using the solution of a spec- 

tral algorithm, in this case PCA, applied to Z which means that there is strong 

incentive for the model to focus on modeling Z rather then Y. For the standard 

shared model the interpretation of the resulting embedding is less obvious. The 

model does not favor the generation of either of the two observation spaces, how- 

ever, the latent locations are initialized from the mean of the solution of a spectral 

algorithm, in this case PCA, applied to both observation spaces Y and Z. As both 

data sets are generated from two dimensional signals PCA will only recover two 

dimensions and the mean of the solutions to each space will only have relevance 

for the shared generating signal while the recovered private signals will be lost 

when taking the mean of the signals. As we know the dimensionality of the signal 

generating the observations we can rather than determining the latent dimension- 

ality from the initialization set it to equal the dimension of the generating signal. 

Figure 3.11 shows the embeddings found using three dimensional latent spaces. 

As can be seen neither model recovers the three generating signals shown in Fig-


0.3 

0.2 

0.1 

0 

−0.1 

−0.2 

−0.3 

−0.4 

0 10 20 30 40 50 60 70 80 90 100 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.10: Each dimension of the embeddings found applying the shared (left) 

and the shared back-constrained (right) GP-LVM models to the data shown in 

Figure 3.9. Neither model succeeds to unravel the three different underlying generating 

signals shown in Figure 3.8. 

2 

1.5 

1 

0.5 

0 

−0.5 

−1 

−1.5 

0 10 20 30 40 50 60 70 80 90 100 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

Figure 3.11: Each dimension of the embeddings found applying the shared (left) 

and the shared back-constrained (right) GP-LVM models to the data shown in 

Figure 3.9. The latent dimensionality is set to equal the dimensionality of the 

generating signals. As can be seen neither model manages to recover the three 

generating signals shown in Figure 3.8.


ure 3.8. In the case of the back-constrained model this is to be expected as the 

latent locations are constrained to be a smooth mapping from the observed data 

Z. As nothing of the signal private to Y x 2 is contained in Z the model can not 

correctly represent this in the latent space. This constraint does not exist for the 

standard shared model, however, as can be seen in Figure 3.8 neither does this 

model succeed to recover the three generating signals [x 1 ;x 2 ;x 3 ]. The model is 

initialized using the mean of the data projected onto the three most representative 

principal components. As each data-set is generated from two dimensional signals 

the third principal component will, for each observation space, fit to the noise of 

the data. This means that the third latent dimension will be initialized to noise 

from which the model never manages to recover to encode x 3 . 

3.2.4 Summary 

The above presented shared and back-constrained GP-LVM model was created to 

model in the scenario where the input space can be modeled as a mapping from 

the output space. What this means is that all the degrees of freedom in the input 

space are contained in the output space. It does however not assume a bijective 

relationship between the two observation spaces as in [52] as a location in the 

input space can be associated with several locations in the output space which 

was exemplified in Figure 3.7. However, as was shown in the example leading 

up to embeddings Figure 3.10 and 3.11 neither model was capable of modeling 

in the more general scenario where each observed space shares a subset of the 

generating parameters, but do also contain private generating parameters. In the 

next section we will proceed to describe an algorithm designed for this specific


φ Y 

X 

X φ Y S XZ 

Z 

Y Z 

Figure 3.12: Subspace GP-LVM model. The two observation spaces Y and Z are 

generated from the latent variable X factorized into subspaces X Y ,X Z representing 

the private information and X S modeling the shared information. ΦY and 

ΦZ are the hyper-parameters of the Gaussian Processes modeling the generative 

mappings f Y and f Z . 

modeling scenario. 

3.3 Subspace GP-LVM 

The main limitation of the shared GP-LVM, and shared models in general, is the 

assumption that each data space is governed by the same degrees of freedom or 

modes of variability. We relaxed this assumption when only interested in inference 

in one direction by introducing the back-constraint to the shared model in the 

previous section. We will now proceed by introducing a new shared GP-LVM 

model which further relaxes this assumption. 

Similarly to the shared GP-LVM model we are given two sets of corresponding 

observations Y = [y1, . . .,yN] and Z = [z1, . . .,zN], where yi ∈ ℜ DY and zi ∈ 

ℜ DZ . We assume that each observation has been generated from low-dimensional

3.3. SUBSPACE GP-LVM 75 

manifolds corrupted by additive Gaussian noise, 

yi = f Y (u Y i ) + ǫY , ǫY ∼ N(0, β −1 

Y I) 

zi = f Z (u Z i ) + ǫZ , ǫZ ∼ N(0, β −1 

Z I) 

where u Y i ∈ ℜ qY and u Z i ∈ ℜ qZ with qY 

DY 

< 1 and qZ 

DZ 

, 

< 1. We assume that the 

two manifolds can be parameterized in such a manner that they share a common 

non-empty subspace X S , 

X S ⊂ U Y , X S ⊂ U Z 

X S = 0 

, (3.13) 

which is referred to as the shared subspace X S . This assumption implies that 

a parameterization of each observation space in which a subset of the degrees 

of freedom of each observation space is shared is possible. Representing each 

manifold in terms of X S introduces an additional subspace associated with the 

manifold, 

U Y = X S ;X Y ;U Z = X S ;X Z , (3.14) 

which is referred to as the private (or non-shared) subspace. The full latent repre- 

sentation of both observation spaces X is the concatenation of shared and private 

subspaces, X = [X S ;X Y ;X Z ]. Note that the private spaces are subspaces of X, 

with the full latent space representing shared and non-shared variance in a factor- 

ized form. 

A shared GP-LVM model can be constructed respecting the factorized latent 

structure. The GP-LVM learns two separate mappings generating each observa- 

tion space where the input space for the GP generating Y is U Y and for Z is U Z


leading to the following objective, 

{ ˆ X, ˆ ΦY , ˆ ΦZ} = argmax X,ΦY ,ΦZ p(Y,Z|X,ΦY ,ΦZ) 

= argmax X S ,X Y ,X Z ,ΦY ,ΦZ p(Y|XS ,X Y ,ΦY )p(Z|X S ,X Z ,ΦZ). (3.15) 

The latent structure presented above is capable of separately modeling the shared 

and the non-shared variance. This is achieved by letting the mappings f Y and f Z 

only be active on the subspaces associated with each observation. 

3.4 Extensions 

As with the standard GP-LVM model additional constraints such as back-constraints 

or dynamic models can be placed over the latent variables. However, just as with 

the generating mappings we can limit the constraints to only be active on sub- 

spaces of the latent space. In the applications chapter we will evaluate the perfor- 

mance of shared subspace models with dynamic constraints. 

3.5 Applications 

Applications in a wide variety of fields are concerned with inferring a specific 

variable, the output variable, from evidence given by the parameters of a dif- 

ferent variable, the input variable. In its simplest case this, when the output 

variable is related by a function from the input variable, this is called a re- 

gression problem. However, for many applications this cannot be assumed 

as the input variable is not sufficiently discriminative of the output variable.


In such cases there might be several different locations in the output space 

that corresponds to a specific location in the input space. For such an es- 

timation task we would ideally like to recover each possible output location 

corresponding to the given input. 

A classic example of an application where the input space is not sufficient 

for discriminating the output is the computer vision task of image based hu- 

man pose estimation. The task is concerned with estimating the pose param- 

eters of a human from an image. In the applications chapter we will demon- 

strate how the models proposed in this chapter can be applied to this task. In 

specific, the application of the subspace GP-LVM model is interesting since 

it finds a factorized representation of the data into shared and private parts. 

The shared latent variable represents the portion of the variance in each ob- 

servation space that can be determined from the other observations, i.e. vari- 

ance that has a functional relationship between the observations. The private 

represents variance which cannot be discriminated from the other space. As 

we will show, this means, for the task of pose-estimation, the estimation can 

be reduced to a regression task for estimating the shared location. The lo- 

cation in the private space represents variance in the pose space that cannot 

be estimated from the image, i.e. poses that are ambiguous to the given input 

image. 

3.6 Summary 

As was exemplified in the Toy example in Figure 3.5 the standard shared model is 

not capable of unraveling the correct generating signals for data where the two ob-


servation spaces have not been generated from the same underlying signals. For 

the case where we are interested in inferring locations in one specific observed 

space given locations in the other observation space we proposed the shared back- 

constrained model. The proposed model is capable to correctly unravel the gen- 

erating signals of the data when the input space has been generated from a subset 

of generating parameters from the output space. However, in cases when the data 

have been generated using parameters “private” to both the observed spaces as 

was shown in the example Figure 3.8 the shared back-constrained model also fails 

to model the data which was shown in Figure 3.10 and Figure 3.11. 

The objective of each of the models are to find latent locations and mappings 

that minimizes the reconstruction error of the data. In cases where the data con- 

tains observation space specific “private” information used to generate a single 

observation space the model are left with two choices. Either represent this data 

in the latent space, this will reduce the error when reconstructing the associated 

observed space. However, for the other space, which does not contain this infor- 

mation, this will pollute the latent space reducing the capability of the generative 

mapping to generalize over the data resulting in a lower likelihood of the associ- 

ated generative mapping. The other option is to consider the private information 

as noise. None of the above scenarios is ideal as they will both result in a lower 

likelihood of the model. Either by reducing the generative mappings capability to 

regenerate the data or by trying to model a structured signal using a noise model 

which is not likely fit particularly well. 

For modeling data containing private information we have in this chapter pro- 

posed the shared subspace GP-LVM model. This model learns a factorized latent 

space of the data, separately modeling information that is shared from information


that is private. The structure of the latent space means that the private information 

in the observations will not pollute the reconstruction of the data neither will the 

model need to resort to model this information as noise. Further, the factorization 

into shared and private information can be beneficial when we want to do infer- 

ence in the model as the shared space contains information that can be determined 

from either observation space while the private subspace contains the information 

that is ambiguous knowing the location in the other observed space. In the ap- 

plications chapter we will exploit this factorization for the estimating the pose of 

a human from image evidence. Each image is ambiguous to a small sub-set of 

the possible human poses, using the shared subspace model, this sub-set will be 

modeled using the private latent spaces. 

The non-convex nature of the GP-LVM objective (in the non-linear case) means 

that the algorithm relies on an analogous dimensionality reduction algorithm for 

initialization. For the standard model there are as we have seen, several different 

algorithms that are applicable. When the observation spaces can be assumed to 

have been generated from the same manifold, the approach of averaging over the 

observations is applicable. Similarly in the asymmetric case initialization from the 

output space is viable. However, for the subspace model presented in this chapter, 

no equivivalent spectral model exists. 

In the next chapter we proceed by presenting a spectral approach to shared di- 

mensionality reduction that can be used to initialize the subspace GP-LVM model 

presented in this chapter.

Chapter 4 

NCCA 


In the previous chapter we introduced two new latent variable models based on the 

GP-LVM. The first presented algorithm, referred to as the shared back-constrained 

GP-LVM, models two correlated sets of observation using a single shared multi- 

variate latent variable. It was created for the task of inferring the location in one 

observation space given the corresponding location in the other observation space. 

The second algorithm, referred to as the subspace GP-LVM model, was designed 

for the more general scenario where we want to model several different observa- 

tion spaces sharing a subspace of their variance. 

The solution to both algorithms is found using gradient based methods. There- 

fore, to be able to recover a good solution initialization of the models are of signif- 

icant importance. The shared back-constrained model assumes that the non-back- 

constrained observation space have been generated from a subset of the generating 

parameters of the back-constrained space. This means that we can initialize the 

80

4.2. ASSUMPTIONS 81 

latent location using the output of a spectral dimensionality reduction technique 

applied to the back-constrained observation space. However, for the subspace 

GP-LVM model no analogous convex models exist. In this chapter we will pre- 

sented a spectral dimensionality reduction algorithm for finding factorized 

latent spaces that separately represents the shared and private variance of 

two corresponding observation spaces. Being associated with a convex ob- 

jective function, the model can be used to initialize the subspace GP-LVM 

model. 

4.2 Assumptions 

Just as with the spectral dimensionality reduction approaches we reviewed 

in Chapter 2, the model we will present is based on a set of assumptions 

about the relationship between the observed data and its latent represen- 

tation. Given two sets of corresponding observations Y = [y1, . . .,yN] and 

Z = [z1, . . .,zN], where yi ∈ ℜ DY and zi ∈ ℜ DZ . We assume that each obser- 

vation has been generated from low-dimensional manifolds corrupted by additive 

Gaussian noise, 

yi = f Y (u Y i ) + ǫY , ǫY ∼ N(0, β −1 

Y I) 

zi = f Z (u Z i ) + ǫZ , ǫZ ∼ N(0, β −1 

Z I) 

. (4.1) 

The latent representations u Y i ∈ UY and u Z i ∈ UZ associated with each obser- 

vation space X Y and X Z consists of two components, one shared x S i and one 

private or non-shared part x Y i and xZ i 

associated with each observation space. 

Each component represents an orthogonal subspace of the latent representation

82 CHAPTER 4. NCCA 

as u Y i = x S i ;x Y i 

and u Z i = x S i ;x Z i 

. We will refer to the X S as the shared 

subspace and X Y and X Z as the private subspaces of the model. 

The shared subspace X S of the latent representation U Y and U Z are assumed 

to be related to the observation spaces by smooth mappings g Y and g Z as, 

x S i = g Y (yi) = g Z (zi). (4.2) 

We further assume that the relationship between the observed data and its corre- 

sponding private manifold representation also takes the form of a smooth map- 

ping, 

x Y i = hY (yi) (4.3) 

x Z i = hZ (zi). (4.4) 

(4.5) 

In the following section we will first describe how the shared latent space X S 

can be found from the observed data. Once a solution for the shared space X S is 

found we will proceed and find the private subspaces X Y and X Z to complete the 

latent representation of the model. 

4.3 Shared 

The shared subspace X S of the latent representation of the data represents 

variance that is shared between both observation spaces. In Chapter 2 we 

reviewed Canonical Correlation Analysis (CCA) which is a feature selection

4.3. SHARED 83 

algorithm for finding directions in two observation spaces that are maximally 

correlated. The model we are about to present is a two stage algorithm, in 

the first stage we find the shared latent space X S using CCA. 

The objective of CCA is to find two sets of basis vectors W Y and W Z , one 

for each observation space, such that the correlation ρ between the projections of 

the data is maximized, 

ρ = 

tr WT Y YT 

ZWZ 

(tr(W T Z ZT ZWZ) tr(W T Y YT YWY )) 1 

2 

, (4.6) 

subject to unit variance W T Y YT YWY = I and W T Z ZT ZWZ = I along each 

direction. The unit variance constraint means that the CCA solution will find 

maximally correlated directions irrespective of how much of the variance in the 

observation space is explained. As a way of avoiding low variance solutions it is 

suggested in [31] to first apply PCA separately to each data-set and the apply CCA 

in the dominant principal subspace of each data-set. This as a way of avoiding 

highly correlated directions that explains a non-substantial amount of the variance 

of the data. 

In the general case both CCA and PCA are applied to linear subspaces of the 

data. However, both algorithms can be non-linearized through the kernel trick [3] 

by first mapping the data to kernel induced feature spaces, 

ΨY : Y → F Y 

(4.7) 

ΨZ : Z → F Z , (4.8) 

represented by kernels K Y and K Z . In practice we first apply kernel PCA and then


X 

Y Z 

X X S 

Y Z 

Figure 4.1: Graphical model of the Non-Consolidating-Component-Analysis 

(NCCA) model. The two observation space Y and Z are generated from a latent 

variable X which is factorized into three different subspaces X = [X Y ;X S ;X Z ]. 

The subspace [X S ;X Y ] is the latent representation of Y while [X S ;X Z ] represents 

Z. This means that X S models the portion of the data that is correlated 

between Y and Z and is found using CCA. The subspaces X Y and X Z represents 

the remaining private portion of each observation space. 

look for correlated directions in the dominant kernel induced principal subspace 

of the data. Having found directions explaining the shared components of the data 

it remains to find directions explaining the private variance of each observation 

space. 

4.4 Private 

Given two observation spaces Y and Z together with bases W Y and W Z that 

explain the shared variance we are interested to find directions V Y and V Z to 

explain the private non-shared variance of each observation space. To find such a 

basis we look for directions of maximum variance in each observation space that 

are orthogonal to the shared bases. We will apply this to each observation space 

in turn why we in the following step have dropped the superscript that identifies

4.4. PRIVATE 85 

the observation space. We seek the first direction, 

subject to, 

ˆv = argmax v v T Cv, (4.9) 

v T i 

v T v = 1 (4.10) 

W = 0, (4.11) 

where W is the canonical directions and C is the covariance matrix of the feature 

space. 

4.4.1 Extracting the first orthogonal direction 

We apply the algorithm in a feature space induced by kernel K. The solution is 

found through formulating the Lagrangian of the problem, 

L = v T Cv − λ(v T v − 1) − 

K 

γiv T wi. (4.12) 

Seeking the stationary point of the Lagrangian leads to the following system of 

equations, 

δL 

δv 

= 2Cv − 2λv − 

i=1 

K 

γiwi = 0 (4.13) 

i=1 

δL 

δλ = vTv − 1 = 0 (4.14) 

δL 

δγi 

= v T wi = 0, ∀i. (4.15)


By pre-multiplying Eq.4.13 with K 

i=1 wT i , 

K 

i=1 

w T i 

(2Cv) − 2λ 

K 

i=1 

w T i 

using the orthogonality constraint Eq.4.15, 

and Eq.4.14, 

We obtain, 

Using Eq.4.13 and Eq.4.19, 

K 

j=1 

K 

v − ( w T j )( 

K 

γiwi) = 0, (4.16) 

w T i 

j=1 

i=1 

K 

( w 

j=1 

T j )( 

⎧ 

K 

⎪⎨ γi i = j 

γiwi) = 

⎪⎩ 

i=1 0 i = j 

2Cv − 2λv − 2 

 

C − 

v = 0, (4.17) 

. (4.18) 

γi = 2w T i Cv. (4.19) 

K 

wi(w T i Cv) = 0 (4.20) 

i=1 

K 

i=1 

wiw T i C 

 

v = λv. (4.21) 

Equation 4.21 is an eigenvalue and can be solved in close form through the eigen- 

 

decomposition of matrix C − K i=1 wiwT i C 

 

. 

4.4.2 Extracting consecutive directions 

Having found the orthogonal direction explaining the maximal of the remaining 

variance. Further consecutive directions can be found by appending the previously


found directions to the orthogonality constraint. For the Mth direction, 

v T M [W;v1, . . .,vM−1] = 0. (4.22) 

This means that to find the Mth direction the following eigenvalue problem needs 

to be solved, 

 

C − 

M−1 

 

j=1 

vjv T j + 

K 

i=1 

wiw T i 

 

C 

 

vM = λvM. (4.23) 

We will refer to the directions found as the Non-Consolidating-Components 

and the algorithm as Non-Consolidating-Components-Analysis (NCCA). 

4.4.3 Example 

In the previous chapter we applied the shared and the shared back-constrained GP- 

LVM model to a toy-data set. Doing so we exemplified in which scenarios each 

model works and also in which scenarios each model fails. As was shown neither 

model is capable of modeling data which contains both shared and private signals. 

This in itself was the motivation for creating the subspace GP-LVM model. In this 

chapter we have presented the NCCA model as an extension to CCA. We will use 

the solution of the proposed model to initialized the subspace GP-LVM model. To 

evaluate the performance of the model we applied the subspace model to the same 

data-set which we applied the shared and the shared back-constrained model to in 

the previous chapter. Each observation space Y and Z has been generated from 

a set of underlying low-dimensional signals shown in Figure 4.2. The generating 

signals have one dimension shared and one dimension private for each data-set.


1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 10 20 30 40 50 60 70 80 90 100 

Figure 4.2: Each dimension of the signals used to generate the high-dimensional 

signals shown in Figure 4.3. The data consist of one shared dimension (blue) and 

one dimension private to each data-set (green). 

2.5 

2 

1.5 

1 

0.5 

0 

−0.5 

−1 

−1.5 

−2 

−2.5 

0 10 20 30 40 50 60 70 80 90 100 

4 

3 

2 

1 

0 

−1 

−2 

−3 

−4 

0 10 20 30 40 50 60 70 80 90 100 

Figure 4.3: Each dimension of the high-dimensional observed data to which the 

subspace GP-LVM model is applied. 

From these signals two high-dimensional signals are generated through random 

linear projections which result in Figure 4.3 which is referred to as the observed 

data Y and Z. We apply the data shown in Figure 4.3 to two variations of the 

subspace GP-LVM model presented in the previous chapter. For each model we 

set the latent structure to be composed of a single shared dimension and a private 

dimension corresponding to each observation space. This means that we in total 

are learning a three dimensional latent space. We initialize one model using the 

PCA solution of one of the observation spaces and one model using CCA for the


0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

0 10 20 30 40 50 60 70 80 90 100 

0.15 

0.1 

0.05 

0 

−0.05 

−0.1 

−0.15 

−0.2 

0 10 20 30 40 50 60 70 80 90 100 

Figure 4.4: Each dimension of the embeddings of the data in Figure 4.3 applied to 

the subspace GP-LVM model. The left most plot corresponds to the results using 

PCA as initialization while the right plot corresponds to the embedding using 

CCA for the shared dimension and NCCA for the private dimensions. The latter 

model succeeds in recovering the generating signals Figure 4.2 while the model 

initialized using PCA fails. 

shared dimension and NCCA for the private spaces. In Figure 4.4 the resulting 

embeddings are shown. As can be seen in Figure 4.4 the model initialized using 

PCA is not able to correctly unravel the generating signals. The observation space 

Y is generated from a two dimensional signal Figure 4.2. We use the first prin- 

cipal component to this data as initialization for the shared space and the second 

component to initialize the private space corresponding to Y. We hereby make the 

assumption that the first principal component will correspond to the shared sig- 

nals something which in general cannot be assumed true. The private dimension 

of Z is initialized to small random values. Using the CCA and NCCA scheme 

presented in this chapter the model manages to correctly unravel the generating 

signals.


4.5 Extensions 

We have presented the NCCA algorithm as a way of finding representative 

directions in the data that is complementary to the directions of high correla- 

tion found by CCA. However, there is nothing in the algorithm that limits the 

use of NCCA to only accompany CCA. The algorithm is general and can be 

used in addition to any feature selection algorithm where we want to model 

the full variance of the data. 

Fisher’s Discriminant Analysis (FDA)[20] is a method for finding direc- 

tions in the data such that the resulting projections maximizes the separa- 

tion between two or more classes. Finding complementary directions us- 

ing NCCA would allow a factorized latent representation where the sub- 

space corresponding to the FDA directions would discriminate the data into 

classes while the complementary subspace associated with the NCCA direc- 

tions would represent the most representative directions in the data without 

constraints on class discrimination. This would allow FDA to be extended to 

a model which represents the full variance of the observed data not limited 

to the variance from which the classes can be discriminated. 

4.6 Summary 

In this chapter we have presented a new two stage dimensionality reduction 

model. The latent space used to represent the data is factorized into two 

parts, one constrained part and one complementary part ensuring that the 

full variance in the observed data is represented in the latent space. The con-


strained latent subspace is found from any feature selection algorithm that 

results in a set of directions in the data. In this chapter we have used CCA 

to find the constrained part. The remaining latent subspace complements the 

constrained directions to ensure that the latent variable represents the full 

variance of the observed data. 

The initial motivation behind the NCCA model was as an analogous spec- 

tral algorithm to the subspace GP-LVM model. However, as we will show in 

Chapter 5, the model can be directly applied to model data without the need 

of a GP-LVM model. 

When using the NCCA model as an initialization scheme for the subspace 

GP-LVM we use the NCCA step to complement the solution found by applying 

CCA to the data. The assumptions 4.2 and 4.5 do not completely correspond to 

the subspace GP-LVM model. In particular the subspace GP-LVM model makes 

no assumption about the relationship between the observed data and the mani- 

fold representation. We could design a model which would better correspond to 

the CCA and NCCA scheme by incorporating back-constraints to constrain the 

mapping from observed data to the latent representation. However, each latent 

dimension can only be back-constrained once which means that we cannot com- 

pletely encode the assumption in Eq.4.2. However, as we will show in the next 

chapter, experimentally initializing according to the above scheme results in good 

embeddings.

Chapter 5 

Applications 


In this chapter we will apply the models presented in this thesis to the task of 

Human pose estimation from monocular images. In the next section we will give a 

brief introduction to Human pose estimation and review some of the related work. 

We will then proceed to present the shared back-constrained and the subspace 

GP-LVM model applied to this task. We will apply the models on two standard 

data-sets typically used within the domain of single image human pose estimation. 

We will conclude with a brief summary of the results. 

5.2 Human Pose Estimation 

Single view human pose estimation is the task of estimating the pose of a human 

from a monocular image. To get an overview of previous work we will split the the 

task into two different cathegoriez, generative and discriminative. Generative hu- 

92

5.2. HUMAN POSE ESTIMATION 93 

man pose estimation algorithms are often referred to as model based algorithms. 

They aim to fit a model of a human to evidence given in the image using an associ- 

ated likelihood or error function. By searching the state-space, which defines the 

pose, for the location which maximizes the likelihood given a specific image the 

task is solved. This is different from the discriminative approaches which aims to 

find the pose associated with a specific image from evidence extracted from the 

image. 

5.2.1 Generative 

Generative human pose estimation aims to fit the parameters of a human body 

model onto the image. The key differences between different algorithms are the 

parameterization of the human body and how the likelihood model is specified 

through which these parameters are found. A large variety of human body models 

have been specified, from simpler models composed of two dimensional patches 

[30, 43, 15] to more complex three dimensional models based on a wide range of 

primitives such as cylinders [13, 53, 12] or cones [74]. The more complex human 

models allow for more accurate estimation, however they are usually associated 

with a higher number of parameters which implies that they are more expensive 

to fit to the image evidence. Similarly to the human body model several different 

ways of specifying the likelihood model through which the parameters can be fit 

have been suggested. In [74, 19] image edges are used while [58] used texture 

information to fit an ellipsoid parameterized model. In [12] a stick-model of a 

human was fitted to the image through a image segmentation based score. If the 

estimation is done for a sequence of images higher level image information such

94 CHAPTER 5. APPLICATIONS 

as optical flow can be incorporated into the likelihood model as in [58]. 

5.2.2 Discriminative 

Images are very high-dimensional objects, typically in the range of 10 3 − 10 4 

dimensions. To make modeling computationally feasible it is common practice, as 

a pre-processing stage, to represent each image using a lower dimensional feature 

vector. Due to the high-dimensionality of the images these feature descriptors are 

usually based on heuristic assumptions about the correspondence between images 

and pose. 

There are two main approaches in modeling the relationship between image 

features and pose. In the simplest case, where there is no ambiguity between the 

image representation and pose, the relationship can be modeled with regression as 

was demonstrated in [2, 61, 68, 83]. However, in the general case the image fea- 

ture representation is not capable of fully dis-ambiguating the pose. One approach 

to deal with the multi-modalities that arises is to use a multi-modal estimate such 

as in [47]. Given a multi-modal estimate additional cues such as temporal consis- 

tency can be used to disambiguate between the different modes. 

5.2.3 Problem Statement 

Given a set of image features yi ∈ Y with corresponding pose parameters zi ∈ Z 

where i = 1, . . .,N we wish to train a model through which we given an unseen 

image feature vector y ∗ can infer the corresponding pose parameters z ∗ . We will 

learn models for two different settings, single image estimation and sequential 

estimation. In the first case we are given a single input feature from which to de-

5.3. IMAGE FEATURES 95 

termine the pose while in the sequential case we are given a sequence [y ∗ 1, . . .,y ∗ N ] 

of image features from which we aim to determine the corresponding sequence of 

pose parameters [z ∗ 1 , . . .,z∗ N ]. 

5.3 Image Features 

Images are very high-dimensional objects typically residing in 10 3 − 10 4 dimen- 

sional spaces. Due to the problems associated with working with such high- 

dimensional objects it is for most applications often necessary to find a reduced di- 

mensional representation of each image. The large variability and high-dimensionality 

of most image spaces means that it is often not possible to apply dimensionality 

reduction techniques such as the ones reviewed in chapter 2. Further, for most 

application we are only interested in a sub-set of the information contained in 

the images, often it is of interest to find a representation that introduces certain 

ambiguities. One example would be an application where we are comparing the 

shape of objects, we would then ideally like a representation that is ambiguous to 

color, texture and scaling and where as much as the variance in the descriptor as 

possible is related to the shape of objects. This has lead to a large body of work 

on heuristic application specific image representations. We will now proceed to 

briefly describe the background to the two different image features used for the 

experiments in this chapter. 

5.3.1 Shapeme Features 

Shape context [6, 7] was suggested as a point based descriptor for shape matching. 

The shape is assumed to be represented as a discrete set of points from the contour


of an image, this could either have been extracted using a segmentation algorithm 

or as the response to an edge detector. Considering the set of contour points P = 

{p1, . . .,pn} the descriptor is calculated by taking each pointpi and describing its 

position relative to all the other points on the contour. This is done by computing 

the vector v i j = (pi −pj) for all the n − 1 points on the image. The set of vectors 

V i = {vi j }, ∀j = i completely describes the configuration of all points relative 

the reference point pi. The shape context descriptor of point pi is computed as 

the distribution of these relative point positions by placing them in a coarse spatial 

log-polar histogram where each bin represents the radius and the angle of the polar 

representation of each vector vi j . This results in n log-polar histograms describing 

the shape. 

Each histogram is computed relative to a reference point on the contour mak- 

ing the histogram invariant to translation. Further, by scaling the radius of the 

polar representation of each point with the median distance between the points 

on the contour, each histogram can be made invariant to scaling. The discriti- 

zation/binning of the vectors into a coarse histogram representation makes the 

descriptor robust to small affine transformation of the shape. 

Given two shapes we are interested in finding a measure of their similarity. 

Matching using the shape context representation is a two stage process. First a 

similarity between shape context histograms are needed, this is to find the point 

on each shape that best matches a point on the other shape. Secondly a measure 

relating the similarity of the full shape, i.e. all the points are required. In the 

original paper [6] the χ 2 distance was used to compare each histogram, however, 

in [7] this was change to the simpler L 2 -norm without significant degradation of 

the results. Matching two shapes implies finding the permutation π of the points

5.3. IMAGE FEATURES 97 

on one shape such that the sum of the histogram similarities is minimized. This is 

an instance of bipartite matching which is in general of cubic complexity. 

To reduce the complexity associated with matching two shapes using shape 

context descriptors the shapeme feature descriptor [42] was suggested. The shapeme 

descriptor is calculated by computing shape context histograms for a large set of 

training images. By clustering the space of all shape context histograms a set of 

representative shape context histograms can be found refereed to as shapemes. 

Each image can now be represented by associating each shape context vector with 

its closest shapeme. The descriptor of an image can now be reduced to a his- 

togram over the shapemes. Matching two shapes represented as shapemes can 

now be performed using as a simple nearest neighbor classifier in the shapeme 

space. 

We use the shapeme descriptor for both the Poser and the HumanEva data-set. 

Details of the specific features for the Poser data-set can be found in [1] and for 

the HumanEva data in [55]. 

5.3.2 Histogram of Oriented Gradients 

The Histogram of Oriented Gradients (HOG) image descriptor was suggested in 

[18] for the task of detecting people in images. The descriptor is based on the dis- 

tribution of local texture gradients in the image. Computing the HOG descriptor 

for an image is a three stage process. In the first step the response of the image 

to a gradient filter is computed associating each pixel in the image with a direc- 

tion and a magnitude. The second step involves binning the gradient directions 

over spatial regions refereed to as cells into gradient histograms. Each bin in the


histogram represents a set of gradient directions, the strength of the vote from 

each pixel in the image is taken as a function of the gradient magnitude, in the 

original work in [18] directly using the gradient magnitude was found to give the 

best performance. A good descriptor should be robust towards lighting changes 

in the image. To achieve, in the third step of the feature extraction, the gradient 

magnitude of each cell is normalized with respect to a local region in the image 

refereed to as a block. The final feature vector is calculated as the response to a 

sliding window over the image where the normalized cell responses are binned 

either using rectangular window (R-HOG) or using a circular window (C-HOG). 

The HOG descriptor has much in common with the SIFT descriptor [38] which 

also based on histograms of local gradient directions. However, whereas the SIFT 

descriptor is computed at key-points extracted from the image, and at multiple 

scales, the HOG descriptor is a dense descriptor evaluated for each pixel in the 

image. 

We use the HOG feature evaluated on the HumanEva data-set. Details of the 

specific feature can be found in [46]. 

5.4 Data-sets 

We apply the two suggested models to two different data-sets. The first data-set, 

referred to as Poser was presented in [2]. Poser consists of images generated 

using a the computer graphics package Poser of real motion-capture data from 

the CMU database 1 . Each image is represented by its corresponding silhouette 

described using a 100 dimensional shapeme descriptor [42] as presented in [2]. 

1 The data used in this project was obtained from mocap.cs.cmu.edu. The database was created 

with funding from NSF EIA-0196217.

5.4. DATA-SETS 99 

Each pose is represented using 54 joint angles of a full body. In total the data-set 

consists of 1927 training poses from 8 sequences of varied motions. The data-set 

is provided with one training sequence of 418 frames describing a circular walk 

sequence. 

The second data-set we are going to apply the models to is a real-world data- 

set known as HumanEva. It was first presented in [54]. The full HumanEva 

data-set consists of six different motions performed by three different persons 

divided into training test and a validation set. In addition frames of a forth person 

is provided without labeling of the motions. Each frame is captured using several 

cameras. In our experiments we us only images from a single camera. Due to 

restrictions in the amount of data the proposed models can process we have chosen 

a limited training data-set. Further limitations are necessary due to errors in the 

motion capture data provided. We use two different subsets. The first set contains 

the walking sequence for subject one where we use the first cycle in the walk for 

training and the second cycle for testing. Each image in this data-set is described 

using a 300 dimensional shapeme features as used in [55]. In [46] each image 

of the HumanEva data-set is aligned by projecting the associated pose onto the 

view plane and using the Procrustes score to align each image into a common 

frame. Through this process each image can be flipped around the horizontal axis 

to effectively double the size of the data-set. For the second set of images we 

append the first with the corresponding flipped image from the image-set created 

in [46] and use every second image in the data. Each image in the second data-set 

is presented using Histogram of Oriented Gradient (HOG) feature [18] as used in 

[46]. As a pre-processing stage, to reduce computational time, we represent each 

descriptors by the projection onto the first 100 principal directions extracted from


the data. The poses in the HumanEva data-set is represented as the location of a 

set of 19 joints in 3D-space resulting in a 57 dimensional pose representation. We 

remove the global translation by centering the data. 

We apply the Poser data-set to the shared back-constrained GP-LVM model 

and the second humaneva data-set to the subspace GP-LVM model. The first 

humaneva set is used to compare the different models. We will now proceed to 

present how each model is defined and how inference is done for each problem 

setting. 

5.5 Shared back-constrained GP-LVM 

In this section we will present the application of the shared back-constrained GP- 

LVM model to the task of single image human pose estimation. Given a set of 

image features [y1, . . .,yN] with corresponding pose parameters [z1, . . .,zN] we 

learn a shared back-constrained GP-LVM model. For the generative mappings 

from the latent space to the observed data we use a kernel consisting of the sum 

of an RBF, a bias and a white noise kernel, 

 

kgen(xi,xj) = θ1exp − ||xi − xj|| 2 2 

2θ2 

+ θ3 + βδij. (5.1) 

2 

Using this kernel means that each generative mapping are specified by the four 

hyper-parameters Φ = [θ1, θ2, θ3, β]. 

For the back-constraint from the pose space to the latent space we use a simple 

regression over a kernel induced feature space. In all our experiments we use an

5.5. SHARED BACK-CONSTRAINED GP-LVM 101 

RBF kernel, 

 

kback(zi,zj) = exp − ||xi − xj|| 2 2 

, (5.2) 

2θback 

where the kernel width θback is set by examining the kernel response to the pose 

parameters. 

Both the Poser data and the HumanEva data is sequential. We will therefore 

also learn a dynamical model to predict sequentially over the latent space. The 

dynamic model is a GP predicting over time. We will use the same type of kernel 

as for the generative mappings i.e. a combination of a RBF, a bias and a white 

noise kernel resulting in a mapping specified by four hyper-parameters. 

In total this means that the shared back-constrained model we apply in this 

chapter is specified by 8 hyper-parameters for the generative mappings, one pa- 

rameter for the back-constraint and 4 hyper-parameters for the dynamical model. 

Further, the dimension of the latent space need to be fixed, in all our experiments 

we use a 3 dimensional latent space. As described above the parameter of the 

back-constraint is set by inspection, for the Poser data we use θback = 1 

10 −3 and 

for the HumanEva data we use θback = 1 

4·10 −6 . In Figure 5.1 the kernel response 

matrices associated with the back-constraints for the two data-sets are shown. We 

initialize the latent locations X of the model by applying PCA to the pose parame- 

ters. The hyper-parameters of the GP mappings are learnt together with the latent 

location of the data by optimizing the marginal likelihood of the model Eq. 3.4 

using a scaled conjugate gradient optimizer. Ideally we would like to run the op- 

timizer till convergence, however, in practice, we limit the number of iterations 

to 10000, running for further iteration does not have a significant impact on the 

results.


Figure 5.1: Kernel response matrices on the training data over which the backmapping 

are applied to in the back-constrained model. The left image shows the 

response to the Poser data using an RBF kernel with width θback = 1 

10 −3 . The right 

images corresponds to the HumanEva data also using an RBF kernel with width 

θback = 1 

4·10 −6 . 

5.5.1 Single Image Inference 

For the task of single image pose inference we are given a single feature vector y ∗ 

for which we want to infer its corresponding pose parameters z ∗ . The latent space 

X is back-constrained from the pose space Z which means that we encourage a 

one-to-one mapping between the latent space and the pose space. This means that 

if we can determine the latent location x ∗ associated with the image feature y ∗ we 

can recover the associated pose parameters z ∗ through the mean prediction of the 

GP generating the pose space. We can find the latent coordinate associated with a 

specific image feature by maximizing the likelihood over the latent space, 

x ∗ = argmax x p(y ∗ |x), (5.3) 

through gradient based optimization. However, we expect the image features to 

be ambiguous with respect to pose which implies a multi-modal mapping from

5.5. SHARED BACK-CONSTRAINED GP-LVM 103 

image feature to pose. Because we encourage the mapping between the latent 

space and the pose space to be uni-modal the multi-modality in the data needs 

to be contained in the mapping between the latent space and the image features. 

This means that the optimization Eq.5.3 is multi-modal and needs to be initialized 

in a convex region of x ∗ . In practice we first perform a nearest-neighbor search 

in the in image feature space and initialize the latent coordinates with the latent 

coordinates associated with the N nearest neighbors in the training data. For the 

Poser data we use 6 nearest neighbors, increasing the number does not result in 

a significant increase in performance while reducing the number leads to missing 

modes. We run each point optimization Eq. 5.3 till convergence which usually 

corresponds to 10 − 100 iterations. 

shown. 

In Figure 5.3 result for the single image inference on the Poser data set is 

5.5.2 Sequential Inference 

We are often interested in inferring the pose Z ∗ associated with a temporal se- 

quence of image observations Y ∗ . In such a setting we can exploit temporal con- 

sistency of the sequence to disambiguate our multi-modal estimate and recover a 

single pose estimate per image. 

To find the most likely sequence of latent locations associated with the image 

observations we interpret the sequence through a hidden Markov model (HMM) 

where the latent states of the model corresponds to the latent locations of the 

training data X. The likelihood of each observation is given by the likelihood of


the GP associated with each latent point, 

L obs 

i = p(y∗ |xi). (5.4) 

The transition probabilities are given by the dynamical GP predicting latent loca- 

tions over time, 

L trans 

ij = p(xi|xj). (5.5) 

The most probable path Xinit through this lattice can be found using the Viterbi 

algorithm [73]. 

Having found the most likely sequence through the training data Xinit we use 

this as an initialization to optimize the sequence, 

X ∗ = argmax X p(Y ∗ |X). (5.6) 

In Figure 5.4 results for the sequential estimate of the shared back-constrained 

GP-LVM model is shown. 

5.6 Subspace GP-LVM 

Even though not explicitly stated, the shared GP-LVM applied in [44] assumes 

that the image features and pose parameters are governed by the same degrees of 

freedom. This means that the full variance of each observation space needs to be 

fully correlated. However, the image features are based on heuristic assumptions 

about this correlation and are very likely to contain a significant amount of non- 

pose-correlated variance. This variance, irrelevant for the task of pose estimation,


needs to be “explained away” from the latent representation by the noise term in 

the GP generating the image features. The back-constrained GP-LVM model aims 

to encourage the latent space to be a full re-representation of pose, encouraging the 

model to explain the non-correlated image feature variance as noise. However, for 

many types of features the non-pose-correlated variance represents a significant 

portion of the variance. Further, the structure of this variance is often not well 

represented by our Gaussian noise assumption, which means it will pollute the 

structure of the latent representation. 

An alternative approach is to apply the subspace GP-LVM which models the 

non-correlated variance separately using additional latent spaces. The additional 

image feature latent space will explain the non-pose-correlated variance in the im- 

age feature space. Compared to the shared GP-LVM this means that this variance 

is explained using the full flexibility of a GP instead of a simple Gaussian noise 

model. Further, the private latent subspace associated with the pose represent vari- 

ance in the pose space that is not correlated with the image features. This implies 

that locations over this subspace represents poses orthogonal or ambiguous to the 

image features. 

The shared back-constrained model could be initialized efficiently by apply- 

ing PCA to the pose space. However, for the subspace model this scheme cannot 

be used. Instead we initialize the shared latent space using CCA and the private 

spaces using NCCA. Before applying CCA to find the shared locations we apply 

PCA to both observations spaces to remove directions in the data representing a 

non-significant variance. For the first HumanEva data-set we apply linear PCA 

while for the second HumanEva data-set we apply kernel PCA applied on a MVU 

[78] kernel computed using 7 nearest neighbors. In practice we keep 70% of the


variance of the image feature space and 90% of the variance in the pose space. 

Having found a reduced representation of the data we apply CCA to find an ini- 

tialization of the shared latent locations, we keep directions of CCA that have a 

normalized correlation coefficient of more then 30%. Finally, to complete the la- 

tent representation with the private spaces, we apply NCCA to find orthogonal 

directions to the CCA solution. We find directions until 95% of the variance in 

PCA reduced representation of each observation space is explained. For the first 

HumanEva data-set this results in a two dimensional shared space and one dimen- 

sional private spaces associated with each observation. The second HumanEva 

data-set results in a one dimensional shared space a one dimensional private space 

for the image features and a two dimensional private pose space. 

Just as with the back-constrained model we use a combination of an RBF, a 

bias and a white-noise kernel for the generative mappings Eq. 5.1, meaning that 

each generative mapping is specified using 4 hyper-parameters. We will use the 

first data-set to find a single pose estimate for each time instance in a sequence 

of images. To disambiguate our multi-modal estimate between each time instance 

we will use a dynamic model capable to predict over the latent space. We use 

a GP regressor as a dynamic model, using the same type of kernel combination 

as the generative mappings Eq. 5.1. In total this means that for the first data-set 

we learn a model having 8 parameters for the generative mappings, 4 parameters 

for the dynamic model in addition to the location of the latent points. For the 

second data-set we keep the latent locations fixed to exemplify the performance 

of the NCCA algorithm which means that the only parameters for the model is the 

8 hyper-parameters of the generative mappings. We learn the parameters of the 

models by optimizing the marginal likelihood of the model Eq.3.15 using a scale


conjugate gradient algorithm. Ideally we would like to run the optimization until 

the model converges, however, in practice, we limit the number of iterations to 

10000. We have found that running the optimization further does not results in a 

increased performance. 

5.6.1 Single Image Inference 

Initializing the subspace model’s shared space using CCA and the private spaces 

using NCCA we have assumed that the shared latent space can be represented as a 

smooth mapping from both the image feature and the pose space. Having trained 

a model we learn a mapping gy from the image features Y to the shared locations 

latent locations X S of the trained model, 

x S i = gy(yi). (5.7) 

We will in practice use a GP regressor to model gy, however, any regression model 

would be applicable. The GP regressor uses a combination of an RBF, a bias and 

a white noise kernel Eq. 5.1 specified by four hyper-parameters that are found 

through gradient based optimization of the GP marginal likelihood Eq.2.48 of 

the data. Having learnt gy means that given a new unseen image feature y ∗ , the 

corresponding location on the shared latent space x ∗ can be determined. The 

private latent space for the pose X Z is orthogonal to the full latent representation 

of the image feature. This implies that its location, x Z ∗ 

, has no correlation with the 

location in the image feature space y∗. However, it is assumed that the problem is 

“well” represented by the training data, implying that examples of each ambiguity 

we are likely to see is represented in the training data. This implies that by finding


locations x Z ∗i over the private pose space which maximizes the likelihood of the 

pose z∗ generated from the corresponding full latent pose location [xS ∗ ;xZ ∗i ] each 

mode will correspond to the pose ambiguities of the feature. Maximizing the 

likelihood of the pose corresponds to minimizing the predictive variance of the 

GP generating the pose space, 

ˆx Z ∗ = argmin x Z ∗ 

5.6.2 Sequential Inference 

k(x S,Z 

∗ ,x S,Z 

∗ ) 

− k(x S,Z 

∗ ,X S,Z ) T (K + β −1 I)k(x S,Z 

∗ ,X S,Z ) . (5.8) 

Finding the modes associated with different locations over the private pose space 

associates multiple poses to an ambiguous location in the image feature space. As 

the private pose space is orthogonal to the image feature latent representation it is 

not possible to disambiguate between the different modes using information from 

the feature. However, pose data is sequential, by placing a dynamic GP over the 

latent pose space a representation respecting the data’s dynamics can be learned. 

When inferring the pose from a sequence of image features the dynamic model 

can be used to disambiguate locations over the image feature orthogonal private 

pose space. The shared latent locations are determined by the mapping from the 

image features as before, but the the locations over the private subspace can, with 

the incorporation of the dynamic model, be found such that the full sequence 

renders a high likelihood. 

ˆX Z ∗ = argmax X Z ∗ p(X Z ∗ |Z,X S , ΦZ, Φdyn). (5.9)

5.7. QUANTITATIVE RESULTS 109 

Figure 5.2: Angle error: The image on the left is the true pose, the middle image 

has an angle error of 1.7 ◦ , the image on the right has an angle error of 4.1 ◦ . An 

angle error higher up in the joint-hierarchy will effect the positions for all joints 

further down. As the errors for the middle image are higher up in the hierarchy 

this will effect each limb connected further down the chain from this joint thereby 

resulting in a significantly different limp positions. 

5.7 Quantitative Results 

Both the Poser and the HumanEva data-set comes with a provided error measure 

to quantify the quality of the result. In the case of Poser the mean RMS error is 

used defined as follows, 

Eposer(z, ˆz) = 1 

N 

N 

||(ˆzi − zi) mod 360 ◦ ||2, (5.10) 

i=1 

where z is the true pose and ˆz is the estimated pose. To make comparison of 

results possible we will follow [2] and use the above error measure. However, the 

Eposer can be misguiding as a qualitative measure as it is applied to the joint angle 

space. A mean square error treats all dimension of the joint angle space with equal 

importance and do not reflect the hierarchical structure of the human physiology. 

This means that joints higher up in the hierarchy effects all joints further down in 

the hierarchy Figure 5.2. 

The HumanEva data-set avoids the problems associated with the Poser data by


Figure 5.3: Single Image Pose Estimation: Input silhouette followed by output 

poses associated with modes on the latent space ordered according to decreasing 

likelihood. As can be seen the modes corresponds to varying limb positions we 

expect to be ambiguous to the input silhouette. 

representing poses using joint locations rather then joint angles. 

EHumanEva(z, ˆz) = 

N 

i=1 

1 

N ||xi − ˆxi||2. (5.11) 

The HumanEva error metric EHumanEva has a better correspondence to the visual 

quality of the pose estimate being calculated in a joint position space rather then 

the joint angle space used for EPoser. 

5.8 Experiments 

We applied the shared back-constrained GP-LVM model to the Poser data-set. 

Figure 5.3 shows the different pose estimates associated with a single image fea- 

ture vector. In Figure 5.4 the estimate for the shared back-constrained GP-LVM 

model using dynamics to dis-ambiguate the different modes are shown. As can be 

seen we are correctly estimating the pose for most frames. In Table 5.1 the angle 

error of our suggested model is compared to the a set of baseline regression al-

5.8. EXPERIMENTS 111 

Figure 5.4: Every 20th frame from a circular walk sequence, Top Row: Input 

Silhouette, Middle Row: Model Pose Estimate, Bottom Row: Ground Truth. 

Angle Error( ◦ ) 

Mean Training Pose 8.3 ◦ 

Linear Regression 7.7 ◦ 

RVM 5.9 ◦ 

GP Regression 5.8 ◦ 

SBC-GP-LVM Single 6.5 ◦ 

SBC-GP-LVM Sequence 5.3 ◦ 

Table 5.1: Mean RMS Angle for the Poser data-set. SBC-GP-LVM refers to the 

shared back-constrained GP-LVM model. Note that only the SBC-GP-LVM Sequence 

method is using temporal information. 

gorithms. We can see that both the subspace back-constrained GP-LVM methods 

perform in-line or better than the regression algorithms. It is clear that learning a 

latent representation respecting the dynamics of the data is beneficial as the best 

results are achieved by the only model incorporating dynamics for inference. 

For a silhouette based image representation as the shapeme descriptor we ex- 

pect significant ambiguities with respect to the heading direction, i.e. in or out of 

the view-plane. However, as can be seen in [2] (Figure 7), the heading angle is 

nearly perfectly predicted using a regression algorithm. This means that the Poser 

data-set does not contain a significant amount of feature to pose ambiguities. Due


Error (mm) 

Mean Training Pose 163 

Linear Regression 384 

GP Regression 163 

SBC-GP-LVM Single 201 

SBC-GP-LVM Sequence 73 

S-GP-LVM Sequence 70 

Table 5.2: Mean Error for the shapecontext HumanEva data-set. SBC-GP-LVM 

refers to the shared back-constrained GP-LVM model and S-GP-LVM refers to the 

subspace GP-LVM model. 

to the central motivation of the subspace GP-LVM model to specifically model 

such ambiguities we therefore proceed to use the HumanEva data-set. 

We apply the suggested models to the first HumanEva data-set using shapeme 

features to represent each image. In Table 5.2 the results for our suggested models 

and a set of baseline results are shown. From the poor performance of the regres- 

sion algorithms applied to HumanEva data we can see that this data-set contains a 

significant amount of ambiguities between the image representation and the pose. 

As can be seen the single estimate using the shared back-constrained GP-LVM 

results worse performance compared to both the model using dynamical informa- 

tion and the GP regression baseline. This is to be expected as the model will, in 

cases where the features are ambiguous, predict one of the possible poses while 

the regression algorithm in such cases will predict the mean of the ambiguous 

poses which for many types of ambiguities results in a smaller error compared to 

predicting the “wrong” mode. One advantage of the subspace GP-LVM model is 

that we can easily visualize the ambiguities by sampling locations over the pose 

private subspace. 

In the next section we will compare the the different proposed models.

5.9. COMPARISON 113 

5.9 Comparison 

In Figure 5.5 the kernel response matrix generating the observed features for the 

learned kernel hyper-parameters are shown. As can be seen the width of the kernel 

for the back-constrained model is much smaller compared to the subspace model. 

This implies that the back-constrained model is less capable of generalizing the 

image feature representation compared to the subspace model. This is an effect of 

trying to learn a shared latent space of two observed spaces which has a different 

latent structure. The back-constraint enforces a latent structure which preserves 

the local structure of the pose space, however this structure is different from the 

structure of the pose space making the GP generating the image features unable 

to generalize over the feature representation. In the subspace model the private 

subspace can be used to model structures in the feature space not contained in the 

pose parameters making the model capable to generalize better over the image 

features. We have shown the advantage of the latent factorization in the subspace 

model applied to the HumanEva data-set as it allows the model to better gener- 

alize its description of the data. For the next experiment we will use the second 

HumanEva data-set which in addition to the first data-set also includes each image 

“flipped”, this significantly increases the amount of ambiguities in the data. Each 

image in this data-set is represented using the HOG feature descriptor. Applying 

the subspace GP-LVM model initialized using the NCCA model leads to a three 

dimensional latent pose representation divided into a single shared dimension and 

a two dimensional private pose space. 

In Figure 5.6 shows the results of sampling the likelihood over the pose spe- 

cific private latent subspace X Z for a sequence of input images together with the


Figure 5.5: The kernel response matrices used to generate the observed image 

features from the learned latent representation. The left figure shows the matrix 

for the shared back-constrained GP-LVM model while the right figure shows the 

matrix for the subspace GP-LVM model. Comparing the images it can be seen 

that the back-constrained model uses a much smaller kernel width compared to 

the subspace model implying that it is not capable to generalize as well as the 

subspace model. 

pose of to the mode corresponding to the pose closest to the ground truth pose. 

We can see how the modes evolve over the sequence, for images representing mo- 

tion perpendicular to the view-plane (1, 2 and 6) there are two elongated modes, 

while for motion in the view-plane there are a discrete set of modes. Having fixed 

the shared latent location reduced the estimation task to choosing the appropriate 

mode corresponding to a much smaller sub-set of the possible poses compared to 

the full data-set. 

We compare the subspace GP-LVM model to the standard shared GP-LVM 

[52, 44]. For comparison and visualization purposes we will train a shared GP- 

LVM model using a two dimensional latent space. None of the models are make 

use of dynamical information. In Figure 5.7 two different examples of how the 

models are modeling the training data is shown. For the first image when the hu- 

man is position is perpendicular to the view-plane there is two elongated modes

5.9. COMPARISON 115 

Figure 5.6: Pose inference on a sequence of images from the second HumanEva 

data set. Top row: original test set image. Second row: visualisation of the 

modes in the non-shared portion of the pose specific latent space. Note how the 

modes evolve as the subject moves. When the subject is heading in a direction 

perpendicular to the view-plane, it is not possible to disambiguate the heading 

direction image (1, 2 and 6) this is indicated by two elongated modes. In image 

(3 − 5) it is not possible to disambiguate the configuration of the arms and legs 

this gives rise to a set of discrete modes over the latent space each associated with 

a different configuration. Bottom row: the pose coming from the mode closest to 

the ground truth is shown. The different types of mode are explored further in 

Figure 5.7.


over the pose specific private latent space. Sampling poses along the modes show 

that each mode corresponds to the heading direction being in or out of the view- 

plane while points along the mode corresponds to different configurations of the 

legs. For the second image the motion is parallel to the view-plane resulting in 

a discrete set of modes, each corresponding to different limb configurations that 

we would expect being ambiguous. As can be seen both models are capable of 

modeling the correct modes in the data. However, comparing the two different 

energies we can see that while the modes are clearly separated for the subspace 

model the energy for the shared model is scattered with local minimas. As we 

have shown in this chapter the existing image features used for pose estimation 

are often ambiguous to pose. This means that even for a model resulting in a 

multi-modal estimation being able to easily encode additional information to dis- 

ambiguate between the different modes are essential. Comparing the standard 

shared GP-LVM with the subspace model it should be significantly easier to use 

additional information to disambiguate between the different modes. 

5.10 Summary 

In this chapter we have shown the application of the shared back-constrained and 

the subspace GP-LVM model to the task of image based human pose estima- 

tion. Image based human pose estimation is an interesting applications because 

of its potential usefulness in applications to real-world problems in areas such as 

computer graphics and human robotic interaction. Further, as due to the high- 

dimensionality of the input domain each image is represented using some type 

of heuristic based image feature which introduces ambiguities between each im-

5.10. SUMMARY 117 

Subspace GP-LVM 

Shared GP-LVM 

Figure 5.7: The top row shows two images from the training data. The 2nd and 

3rd row shows results from infering the pose using the subspace model, the first 

column shows the likelihood sampled over the pose specific latent space constrained 

by the image features, the remaining columns shows the modes associated 

with the locations of the white dots over the pose specific latent space. Subspace 

GP-LVM: In the 2nd row the position of the leg and the heading angle cannot 

be determined in a robust way from the image features. This is reflected by two 

elongated modes over the latent space representing the two possible headings. 

The poses along each mode represents different leg configurations. The top row 

of the 2nd column shows the poses generated by sampling along the right mode 

and the bottom row along the left mode. In the 3rd row the position of the leg and 

the heading angle is still ambiguous to the feature, however here the ambiguity 

is between a discrete set of poses indicated by four clear modes in the likelihood 

over the pose specific latent space. SGP-LVM: The 4th and 5th row show the 

results of doing inference using the SGP-LVM model. Even though the most likely 

modes found are in good correspondece to the ambiguities in the images the latent 

space is cluttered by local minima that the optimization can get stuck in.


age representation and its corresponding pose. These ambiguities implies that the 

task of estimating the pose associated with a specific image feature is multi-modal 

which makes the problem interesting from a machine learning perspective. By ap- 

plying the proposed model we have shown two different approaches to handle the 

multi-modalities that arise each with associated advantages and disadvantages. 

The main goal behind these models was to be applicable for data where a 

significant amount of ambiguities exists between the different observation spaces. 

With the image features available human pose estimation is known to be such a 

task. However, we would like to point out that our models are general and could be 

applied in different application areas know to exhibit similar relationships between 

each view. Further, the inference schemes which we apply to get quantitiavtive 

results are rudimentary at best, we do not propose a method for pose estimation 

we simply use this as an example of modeling ambiguous data. Our main results 

is on how the presented models handles multi-modalities not on disambiguation 

between such modes. 

In the following chapter we will conclude the work presented in this thesis and 

briefly discuss potential directions of future work.

Chapter 6 

Conclusions 

6.1 Discussion 

In this thesis we have presented two different probibalistic models based on Gaus- 

sian Processes for dimensionality reduction in the scenario where we have mul- 

tiple different observations of a single underlying phenomenon. The models pre- 

sented are applicable to a multitude of different modeling scenarios, where each 

observation space is fully correlated or when we are only interested in modeling 

the correlated data or in scenarios where we want to learn a factorized structure 

and model both the shared correlated information but also the non-correlated in- 

formation private to each observation space. 

6.2 Review of the Thesis 

The first chapter provided a brief introduction to the work undertaken in this the- 

sis. In chapter 2 the machine learning task of dimensionality reduction was moti- 

119

120 CHAPTER 6. CONCLUSIONS 

vated and reviewed. Further, chapter 2 provided an introduction to Gaussian Pro- 

cesses upon which the work in this thesis is built. Chapter 3 presents two different 

Gaussian Process Latent Variable models for shared dimensionality reduction ap- 

plicable to two different but very common modeling scenarios. In chapter 4 a 

new novel spectral dimensionality reduction algorithm is presented. The model is 

an essential component for the factorized subspace GP-LVM model presented in 

chapter 3. Chapter 3 and 4 presents the new work done in this thesis. In chapter 

5 we apply the presented models to model human motion capture data with asso- 

ciated image observation. Further, we apply the learned models to the computer 

vision task of human pose estimation. 

6.3 Future Work 

There are multiple different directions of possible future work applied to the mod- 

els presented in this thesis. Even though we have briefly shown results on the 

models applied to the task of human pose estimation we believe that there are 

several different applications where the use of the shared GP-LVM models can 

be advantageous. Such application areas include multi-modal feature fusion and 

multi-agent modeling. Even though we in this thesis have described the model 

in terms of two observation spaces there is nothing in the framework limiting the 

number of observation spaces. The application areas suggested are often charac- 

terized by more then two observations making them interesting modeling scenar- 

ios. 

This thesis have been focused on creating models applicable for data with a 

common shared structure. However, we have not focused on making inference

6.3. FUTURE WORK 121 

within the models suggested. The procedures for inference suggested in the ap- 

plications chapter are rudimentary at best. We believe that the models models the 

data in an efficient way and with better inference procedures the results presented 

on human pose estimation could be significantly improved. 

The GP-LVM model has in the general case a number of free parameters. 

Even though few with respect to many other models for dimensionality reduction 

determining especially the latent dimensionality is not trivial and its choice has 

a significant effect on how well the observed data is modeled. Recent work [21] 

applies a rank-regulariser to learn the latent dimensionality of the latent space 

effectively removing this as a free parameter from the GP-LVM. We believe this 

could be of significant benefit to the shared GP-LVM models presented in this 

thesis. 

The objective of the standard GP-LVM model is to find a latent structure from 

which a GP regression is able to minimize the reconstruction error of the observed 

data. For a single observed space minimizing the reconstruction error is a sensible 

objective. However, for the shared models, especially in the case of the factor- 

ized subspace, this is less obvious. In effect, the success of the presented model 

relies on the fact that the initialization of the latent space is in convex region of 

the factorized solution we are looking for as nothing outside the initialization en- 

courages the shared private factorization of the latent space. It would therefore 

be interesting to explore ways of encode this factorization as a part of the GP- 

LVM objective. Further, we use CCA to initialize the shared latent locations of 

the subspace model. There are several problems associated with the solution of 

CCA, some of which we address in this thesis. It would be interesting to explore 

different criteria for initializing the shared location. A recent model [55] learns

122 CHAPTER 6. CONCLUSIONS 

a shared latent variable model where a latent space maximizing the mutual infor- 

mation between the observations is learnt. It would further be interesting to see if 

the model presented in [55] could be extended to incorporate the Shared/Private 

factorization suggested in this paper.

Bibliography 

[1] A. Agarwal. Machine learning for image based motion capture. PhD thesis, 

Institut national polytechnique de Grenoble, April 2006. 

[2] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance 

vector regression. In Computer Vision and Pattern Recognition, 2004. CVPR 

2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol- 

ume 2, 2004. 

[3] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of 

the potential function method in pattern recognition learning. Automation 

and remote control, 25(6):821–837, 1964. 

[4] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical cor- 

relation analysis. Technical Report 688, Department of Statistics, University 

of California, Berkeley, 2005. 

[5] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for 

embedding and clustering. Advances in Neural Information Processing Sys- 

tems, 14:585–591, 2002. 

[6] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for 

shape matching and object recognition. In NIPS, 2000. 

123

124 BIBLIOGRAPHY 

[7] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition 

using shape contexts. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 24(4):509–522, 2002. 

[8] M. Bethge, T. Wiecki, and F. Wichmann. The independent components of 

natural images are perceptually dependent. In Human Vision and Electronic 

Imaging XII. Edited by Rogowitz, Bernice E.; Pappas, Thrasyvoulos N.; 

Daly, Scott J.. Proceedings of the SPIE, volume 6492, page 64920A, 2007. 

[9] C. Bishop, M. Svensén, and C. Williams. GTM: A Principled Alternative to 

the Self-Organizing Map. Artificial Neural Networks: ICANN 96: 1996 In- 

ternational Conference, Bochum, Germany, July 16-19, 1996: Proceedings, 

1996. 

[10] P. Bose, W. Lenhart, and G. Liotta. Characterizing proximity trees. Algo- 

rithmica, 16(1):83–110, 1996. 

[11] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university 

press, 2004. 

[12] M. Bray, P. Kohli, and P. Torr. Posecut: Simultaneous segmentation and 

3d pose estimation of humans using dynamic graph-cuts. Lecture Notes in 

Computer Science, 3952:642, 2006. 

[13] C. Bregler and J. Malik. Tracking people with twists and exponential maps. 

In 1998 IEEE Computer Society Conference on Computer Vision and Pattern 

Recognition, 1998. Proceedings, pages 8–15, 1998. 

[14] Q. Candela and C. Rasmussen. A unifying view of sparse approximate 

gaussian process regression. Journal of Machine Learning Reserch Volume, 

6:1935–1959, 2005.

BIBLIOGRAPHY 125 

[15] T. Cham and J. Rehg. A multiple hypothesis approach to figure tracking. 

In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society 

Conference on., volume 2, 1999. 

[16] H. Choi and S. Choi. Kernel isomap. Electronics letters, 40(25):1612–1613, 

2004. 

[17] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall/CRC, 

2001. 

[18] N. Dalai, B. Triggs, I. Rhone-Alps, and F. Montbonnot. Histograms of ori- 

ented gradients for human detection. Computer Vision and Pattern Recogni- 

tion, 2005. CVPR 2005. IEEE Computer Society Conference on, 1, 2005. 

[19] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by 

annealed particle filtering. In IEEE Conference on Computer Vision and 

Pattern Recognition, 2000. Proceedings, volume 2, 2000. 

[20] R. Fisher. The use of multiple measurements in taxonomic problems. Ann 

of Eugenics, 7:179–188, 1936. 

[21] A. Geiger, R. Urtasun, and T. Darrell. Rank priors for continuous non-linear 

dimensionality reduction. In CVPR ’09: Proceedings of the 2009 IEEE 

Computer Society Conference on Computer Vision and Pattern Recognition, 

2009. 

[22] I. Good. What are degrees of freedom? The American Statistician, 

27(5):227–228, 1973. 

[23] J. Gower and G. Dijksterhuis. Procrustes problems. Oxford University Press, 

2004.


[24] K. Grochow, S. Martin, A. Hertzmann, and Z. Popović. Style-based inverse 

kinematics. ACM Transactions on Graphics (TOG), 23(3):522–531, 2004. 

[25] J. Hamm, I. Ahn, and D. Lee. Learning a manifold-constrained map between 

image sets: applications to matching and pose estimation. In CVPR ’06: 

Proceedings of the 2006 IEEE Computer Society Conference on Computer 

Vision and Pattern Recognition, pages 817–824, Washington, DC, USA, 

2006. IEEE Computer Society. 

[26] J. Hamm, D. Lee, and L. Saul. Semisupervised alignment of manifolds. In 

R. G. Cowell and Z. Ghahramani, editors, Proceedings of the Tenth Inter- 

national Workshop on Artificial Intelligence and Statistics, pages 120–127, 

2005. 

[27] S. Haykin. Neural networks: a comprehensive foundation. Prentice Hall, 

2008. 

[28] S. Hou, A. Galata, F. Caillette, N. Thacker, and P. Bromiley. Real-time body 

tracking using a Gaussian process latent variable model. In Proceedings 

of the 11th IEEE International Conference on Computer Vision (ICCV’07), 

2007. 

[29] J. Jaromczyk and G. Toussaint. Relative neighborhood graphs and their rel- 

atives. Proceedings of the IEEE, 80(9):1502–1517, 1992. 

[30] S. Ju, M. Black, and Y. Yacoob. Cardboard people: A parameterized model 

of articulated image motion. In Proceedings of the 2nd International Con- 

ference on Automatic Face and Gesture Recognition (FG’96), page 38. IEEE 

Computer Society Washington, DC, USA, 1996.


[31] M. Kuss and T. Graepel. The geometry of kernel canonical correlation anal- 

ysis. Technical Report TR-108, Max Planck Institute for Biological Cyber- 

netics, Tübingen, Germany, 2003. 

[32] N. D. Lawrence. Gaussian Process Models for Visualisation of High Di- 

mensional Data. In S. Thrun, L. Saul, and B. Schölkopf, editors, Gaussian 

Process Models for Visualisation of High Dimensional Data, volume 16, 

pages 329–336, Cambridge, MA, 2004. 

[33] N. D. Lawrence. Probabilistic non-linear principal component analysis with 

Gaussian process latent variable models. The Journal of Machine Learning 

Research, 6:1783–1816, November 2005. 

[34] N. D. Lawrence and A. Moore. Hierarchical Gaussian process latent vari- 

able models. Proceedings of the 24th international conference on Machine 

learning, pages 481–488, 2007. 

[35] N. D. Lawrence and Quiñonero-Candela. Local distance preservation in the 

GP-LVM through back constraints. In R. Greiner and D. Schuurmans, edi- 

tors, Local distance preservation in the GP-LVM through back constraints, 

volume 21, pages 513–520, New York, NY, USA, 2006. ACM. 

[36] G. Leen. Context assisted information extraction. PhD thesis, University the 

of West of Scotland, University of the West of Scotland, High Street, Paisley 

PA1 2BE, Scotland, 2008. 

[37] G. Leen and C. Fyfe. A Gaussian process latent variable model formulation 

of canonical correlation analysis. Bruges (Belgium), 26-28 April 2006 2006. 

[38] D. Lowe. Distinctive image features from scale-invariant keypoints. Inter- 

national Journal of Computer Vision, 60(2):91–110, 2004.


[39] D. MacKay. Bayesian neural networks and density networks. Nuclear In- 

struments and Methods in Physics Research, A, 1995. 

[40] K. Mardia, J. Kent, J. Bibby, et al. Multivariate analysis. Academic press 

New York, 1979. 

[41] J. Mercer. Functions of positive and negative type, and their connection with 

the theory of integral equations. Philosophical Transactions of the Royal So- 

ciety of London. Series A, Containing Papers of a Mathematical or Physical 

Character, pages 415–446, 1909. 

[42] G. Mori, S. Belongie, and J. Malik. Efficient shape matching using shape 

contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 

27(11):1832–1837, 2005. 

[43] D. Morris and J. Rehg. Singularity analysis for articulated object tracking. In 

1998 IEEE Computer Society Conference on Computer Vision and Pattern 

Recognition, 1998. Proceedings, pages 289–296, 1998. 

[44] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. The joint manifold model. In 

IEEE International Conference on Computer Vision (ICCV), 2007. 

[45] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine 

Learning (Adaptive Computation and Machine Learning). The MIT Press, 

2005. 

[46] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. Torr. Randomized 

Trees for Human Pose Detection. In IEEE Conference on Computer Vision 

and Pattern Recognition, 2008. CVPR 2008, pages 1–8, 2008. 

[47] R. Rosales and S. Sclaroff. Learning body pose via specialized maps. Ad- 

vances in Neural Information Processing Systems, 2:1263–1270, 2002.


[48] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear 

embedding. Science, 290(5500):2323–2326, 2000. 

[49] M. Salzmann, R. Urtasun, and P. Fua. Local deformation models for monoc- 

ular 3D shape recovery. In IEEE Conference on Computer Vision and Pattern 

Recognition, 2008. CVPR 2008, pages 1–8, 2008. 

[50] B. Schölkopf. The kernel trick for distances. TR MSR 2000-51, Microsoft 

Research, Redmond, WA, 2000. Advances in Neural Information Processing 

Systems, 2001. 

[51] B. Scholkopf, A. Smola, and K. Muller. Kernel principal component analy- 

sis. Lecture notes in computer science, 1327:583–588, 1997. 

[52] A. Shon, K. Grochow, A. Hertzmann, and R. Rao. Learning shared latent 

structure for image synthesis and robotic imitation. Proc. NIPS, pages 1233– 

1240, 2006. 

[53] H. Sidenbladh, M. Black, and D. Fleet. Stochastic tracking of 3D human fig- 

ures using 2D image motion. Lecture Notes in Computer Science, 1843:702– 

718, 2000. 

[54] L. Sigal and M. Black. HumanEva: Synchronized video and motion capture 

dataset for evaluation of articulated human motion. Brown Univertsity TR, 

2006. 

[55] L. Sigal, R. Memisevic, and D. J. Fleet. Shared kernel information embed- 

ding for discriminative inference. In CVPR ’09: Proceedings of the 2009 

IEEE Computer Society Conference on Computer Vision and Pattern Recog- 

nition, 2009.


[56] H. A. Simon. The Sciences of the Artificial - 3rd Edition. The MIT Press, 

October 1996. 

[57] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative density 

propagation for 3d human motion estimation. Proc. Conf. Computer Vision 

and Pattern Recognition, pages 217–323, 2005. 

[58] C. Sminchisescu and B. Triggs. Estimating articulated human motion with 

covariance scaled sampling. The International Journal of Robotics Research, 

22(6):371, 2003. 

[59] J. F. M. Svensén. GTM: The Generative Topographic Mapping. PhD thesis, 

Aston University, April 1998. 

[60] J. Tenenbaum, V. Silva, and J. Langford. A global geometric framework for 

nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. 

[61] A. Thayananthan, R. Navaratnam, B. Stenger, P. Torr, and R. Cipolla. Pose 

estimation and tracking using multivariate regression. Pattern Recognition 

Letters, 29(9):1302–1310, 2008. 

[62] T. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learned 

smooth space of feasible solutions. In CVPR Learning workshop, volume 2, 

2005. 

[63] M. Tipping. The relevance vector machine. NIPS (pp. 652–658), 2000. 

[64] M. Tipping. Sparse Bayesian Learning and the Relevance Vector Machine. 

Journal of Machine Learning Research, 1:211–244, 2001. 

[65] M. Tipping and C. Bishop. Probabilistic principal component analysis. 

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 

61(3):611–622, 1999.


[66] G. Toussaint. The relative neighbourhood graph of a finite planar set. Pattern 

recognition, 1980. 

[67] R. Urtasun and T. Darrell. Discriminative Gaussian process latent variable 

model for classification. In Discriminative Gaussian process latent variable 

model for classification, pages 927–934, New York, NY, USA, 2007. ACM. 

[68] R. Urtasun, T. Darrell, and U. EECS. Sparse probabilistic regression for 

activity-independent human pose inference. In IEEE Conference on Com- 

puter Vision and Pattern Recognition, 2008. CVPR 2008, pages 1–8, 2008. 

[69] R. Urtasun, D. Fleet, and P. Fua. 3D people tracking with Gaussian process 

dynamical models. CVPR, June, 2006. 

[70] R. Urtasun, D. Fleet, A. Geiger, J. Popović, T. Darrell, and N. Lawrence. 

Topologically-constrained latent variable models. In Proceedings of the 

25th international conference on Machine learning, pages 1080–1087. ACM 

New York, NY, USA, 2008. 

[71] R. Urtasun, D. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking 

from small training sets. IEEE International Conference on Computer Vision 

(ICCV), pages 403–410, 2005. 

[72] R. Urtasun, D. J. Fleet, and P. Fua. Temporal motion models for monoc- 

ular and multiview 3D human body tracking. Computer Vision and Image 

Understanding, 104(2-3):157–177, 2006. 

[73] A. Viterbi. Error bounds for convolutional codes and an asymptotically 

optimum decoding algorithm. IEEE transactions on Information Theory, 

13(2):260–269, 1967.


[74] S. Wachter and H. Nagel. Tracking of persons in monocular image se- 

quences. In IEEE Nonrigid and Articulated Motion Workshop, 1997. Pro- 

ceedings., pages 2–9, 1997. 

[75] J. Wang, D. Fleet, and A. Hertzmann. Multifactor Gaussian process models 

for style-content separation. In Proceedings of the 24th international con- 

ference on Machine learning, pages 975–982. ACM New York, NY, USA, 

2007. 

[76] J. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical models 

for human motion. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 30(2):283–298, 2008. 

[77] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical 

models. volume 18, Cambridge, MA, 2006. 

[78] K. Weinberger, F. Sha, and L. Saul. Learning a kernel matrix for nonlinear 

dimensionality reduction. ACM International Conference Proceeding Series, 

2004. 

[79] K. Q. Weinberger, B. D. Packer, and L. K. Saul. Unsupervised learning of 

image manifolds by semidefinite programming. In Proceedings of the Tenth 

International Workshop on Artificial Intelligence and Statistics, Barbados, 

January 2005. 

[80] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds 

by semidefinite programming. In Proceedings of the IEEE Conference on 

Computer Vision and Pattern Recognition (CVPR-04), volume 2, pages 988– 

995, Washington D.C., 2004.


[81] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds 

by semidefinite programming. International Journal of Computer Vision, 

70(1):77–90, 2006. 

[82] L. Xiong, F. Wang, and C. Zhang. Semi-definite manifold alignment. In 

ECML, pages 773–781, 2007. 

[83] X. Zhao, H. Ning, Y. Liu, and T. Huang. Discriminative Estimation of 3D 

Human Pose Using Gaussian Processes. In Pattern Recognition, 2008. ICPR 

2008. 19th International Conference on, pages 1–4, 2008.

Shared Gaussian Process Latent Variables Models - Oxford Brookes ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?