Residual Component Analysis: Generalising PCA for more flexible ...

Residual Component Analysis: Generalising 

PCA for more flexible inference in 

linear-Gaussian models 

Alfredo A. Kalaitzis 

Neil D. Lawrence 

Department of Computer Science 

Sheffield Institute for Translational Neuroscience 

University of Sheffield 

March 23, 2012

Outline 

Probabilistic Principal Component Analysis (PPCA) 

Residual Component Analysis (RCA) 

Optimising a Low-rank + Sparse-Inverse Covariance Model 

Experimental Results

Probabilistic PCA (sensible PCA) 

(Hotelling, 33) (Roweis, 98) (Tipping & Bishop, 99) 

◮ PPCA seeks a low dimensional representation of a data 

set in the presence of independent spherical Gaussian 

noise σ 2 I. 

Y ∈ R N×p 

X ∈ R N×q 

p(x) = N (0, I) 

p(y|x) = N (Wx, σ 2 I)

Probabilistic PCA (sensible PCA) 

(Hotelling, 33) (Roweis, 98) (Tipping & Bishop, 99) 

◮ PPCA seeks a low dimensional representation of a data 

set in the presence of independent spherical Gaussian 

noise σ 2 I. 

Y ∈ R N×p 

X ∈ R N×q 

p(x) = N (0, I) 

p(y|x) = N (Wx, σ 2 I) 

◮ Marginalising the latent variable X gives, 

� 

n� 

p(Y, X|W)dX = p(Y|W) = 

i=1 

⊤ 1 

WML = UqLqR NY⊤Y = UΛU ⊤ 

N (yi,:|0, WW ⊤ + σ 2 I) 

Lq = (Λq − σ 2 I) 1 

2

Dual PPCA (principal Coordinate analysis) 

(Lawrence, 06) 

◮ Instead of the latent variable X, we can place a prior 

distribution on the mapping W. Marginalising W instead 

of X, 

� 

p(Y, W|X)dW = 

the PPCA dual solution can be obtained for 

likelihoods of the form 

p(Y|X) = 

p� 

j=1 

N (y:,j|0, XX ⊤ + σ 2 I)

Dual PPCA (principal Coordinate analysis) 

(Lawrence, 06) 

◮ Instead of the latent variable X, we can place a prior 

distribution on the mapping W. Marginalising W instead 

of X, 

� 

p(Y, W|X)dW = 

the PPCA dual solution can be obtained for 

likelihoods of the form 

p(Y|X) = 

p� 

j=1 

N (y:,j|0, XX ⊤ + σ 2 I) 

◮ Note the conditional independence is across features. 

Now, dual-PPCA solves for the latent coordinates, 

XML = U ′ ⊤ 

qLqR

From dual-PPCA to the GPLVM 

(Lawrence, 06) (Titsias & Lawrence, 10) (Damianou et. al, 11) 

◮ Note the covariance is an inner-product term (of latent 

variables) plus spherical-Gaussian noise. 

p(Y|X) = 

p� 

j=1 

N (y:,j|0, XX ⊤ + σ 2 I)





p(Y|X) = 

p� 

j=1 

N (y:,j|0, XX ⊤ + σ 2 I) 

◮ This likelihood is a product of independent Gaussian 

processes with linear kernels (or covariance functions).





p(Y|X) = 

p� 

j=1 

N (y:,j|0, XX ⊤ + σ 2 I) 

◮ This likelihood is a product of independent Gaussian 

processes with linear kernels (or covariance functions). 

◮ A generalisation of the above to a non-linear kernel (but 

still an inner-product in some Hilbert space) is known as 

the Gaussian Process Latent Variable Model (GPLVM).

Residual Component Analysis 

◮ Discussed: low-rank + spherical noise term.


◮ Discussed: low-rank + spherical noise term. 

◮ Next: 

XX ⊤ + Σ 

low-rank arbitrary positive definite



◮ Next: 

◮ Instances of Σ: 

XX ⊤ + Σ 

low-rank arbitrary positive definite 

◮ ΣPCA = σ 2 I



◮ Next: 


XX ⊤ + Σ 


◮ ΣPCA = σ 2 I 

◮ ΣFA = diag(a)



◮ Next: 


XX ⊤ + Σ 



◮ ΣFA = diag(a) 

◮ ΣCCA = 

� Y1Y ⊤ 1 

0 

0 Y2Y ⊤ 2 

� 

+ σ 2 I, for Y = 

� Y1 

Y2 

�



◮ Next: 


XX ⊤ + Σ 




◮ ΣCCA = 

� Y1Y ⊤ 1 

0 

0 Y2Y ⊤ 2 

◮ ΣLR = ZZ ⊤ + σ2 I 

� 


� Y1 

Y2 

�



◮ Next: 


XX ⊤ + Σ 




◮ ΣCCA = 

� Y1Y ⊤ 1 

0 

0 Y2Y ⊤ 2 

� 


◮ ΣLR = ZZ ⊤ + σ 2 I 

◮ ΣGP = K + σ 2 I s.t. Kij = k(zi, zj) 

� Y1 

Y2 

�



◮ Next: 


XX ⊤ + Σ 




◮ ΣCCA = 

� Y1Y ⊤ 1 

0 

0 Y2Y ⊤ 2 

� 


◮ ΣLR = ZZ ⊤ + σ 2 I 

◮ ΣGP = K + σ 2 I s.t. Kij = k(zi, zj) 

◮ ΣGlasso = Λ −1 

s.t. � 

i,j (Λij �= 0) ≪ p 2 

� Y1 

Y2 

�



◮ Next: 


XX ⊤ + Σ 




◮ ΣCCA = 

� Y1Y ⊤ 1 

0 

0 Y2Y ⊤ 2 

� 


◮ ΣLR = ZZ ⊤ + σ2 I 

◮ ΣGP = K + σ2 ◮ 

I 

ΣGlasso = Λ 

s.t. Kij = k(zi, zj) 

−1 

s.t. � 

i,j (Λij �= 0) ≪ p 2 

◮ Given Σ, can we solve for X? 

� Y1 

Y2 

�



◮ Next: 


XX ⊤ + Σ 




◮ ΣCCA = 

� Y1Y ⊤ 1 

0 

0 Y2Y ⊤ 2 

� 


◮ ΣLR = ZZ ⊤ + σ2 I 

◮ ΣGP = K + σ2 ◮ 

I 

ΣGlasso = Λ 

s.t. Kij = k(zi, zj) 

−1 

s.t. � 

i,j (Λij �= 0) ≪ p 2 

◮ Given Σ, can we solve for X? 

� Y1 

◮ If so, more importantly, what forms of Σ model for 

important problems in machine learning? 

Y2 

�


(Kalaitzis & Lawrence, 11) 

◮ RCA theorem (dual version). The maximum likelihood 

estimate of X, for positive-definite Σ, is 

XML = ΣS(D − I) 1 

2 , (1) 

where S is the solution to the generalised eigenvalue 

problem (GEP) 

1 

p YY⊤ S = ΣSD.





XML = ΣS(D − I) 1 

2 , (1) 


problem (GEP) 

1 

p YY⊤ S = ΣSD. 

◮ Theorem extends to the primal version of RCA, 

WML = ΣS(D − I) 1 

2 , 

1 

n Y⊤ YS = ΣSD.





XML = ΣS(D − I) 1 

2 , (1) 


problem (GEP) 

1 

p YY⊤ S = ΣSD. 

◮ Theorem extends to the primal version of RCA, 

WML = ΣS(D − I) 1 

2 , 

1 

n Y⊤ YS = ΣSD. 

◮ Easier to see objective function by re-expressing into the 

equivalent regular symmetric eigenvalue problem.

Low-rank + Sparse-Inverse Covariance 


p(Λ) ∝ exp(−λ�Λ�1) 

p(z|Λ) = N (0, Λ −1 ) 

p(x) = N (0, I) 

p(y|x, z) = N (Wx + z, σ 2 I) 

◮ Marginalising x and z from the joint log-density gives the 

objective 

log{p(Y|Λ)p(Λ)} = 

n� 

log{N (yi,:|0, WW ⊤ + ΣGlasso)p(Λ)} 

i=1 

� 

(bounded below by) ≥ 

q(Z) log 

p(Y, Z, Λ) 

dZ 

q(Z)

Low-rank + Sparse-Inverse Covariance 


p(Λ) ∝ exp(−λ�Λ�1) 

p(z|Λ) = N (0, Λ −1 ) 

p(x) = N (0, I) 

p(y|x, z) = N (Wx + z, σ 2 I) 

◮ Marginalising x and z from the joint log-density gives the 

objective 

log{p(Y|Λ)p(Λ)} = 

n� 

log{N (yi,:|0, WW ⊤ + ΣGlasso)p(Λ)} 

i=1 

� 

(bounded below by) ≥ 

q(Z) log 

p(Y, Z, Λ) 

dZ 

q(Z) 

◮ Lower bound is maximised by a hybrid EM/RCA algorithm.

Optimising the LR + SI Covariance Model via EM/RCA 

◮ Initialise σ 2 , W and Λ.


◮ Initialise σ 2 , W and Λ. 

◮ REPEAT 

◮ E-step: Update posterior density p(Z|Y) given current Λ.



◮ REPEAT 

◮ E-step: Update posterior density p(Z|Y) given current Λ. 

◮ M-step: Update Λ by GLASSO optimisation on the 

expected complete data log-density (expectation wrt 

current posterior p(Z|Y)).



◮ REPEAT 

◮ E-step: Update posterior density p(Z|Y) given current Λ. 

◮ M-step: Update Λ by GLASSO optimisation on the 

expected complete data log-density (expectation wrt 

current posterior p(Z|Y)). 

◮ RCA-step: Update W via RCA for ΣGlasso = Λ −1 + σ 2 I 

W = ΣGlassoS(D − I) 1 

2 , 1 

n Y⊤ YS = ΣGlassoSD 

◮ UNTIL the lower-bound converges.

Simulations 

Precision 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

EM−RCA 

Glasso 

Glasso (no confounders) 

0 

0 0.1 0.2 0.3 0.4 0.5 

Recall 

0.6 0.7 0.8 0.9 1 

Y = XW ⊤ + Z + E 

Y ∈ R 100×50 

W ∈ R 50×3 

X ∈ R 100×3 

xi,: iid ∼ N (0, I3) 

zi,: iid ∼ N (0, Λ −1 ) 

Λ sparsity 1%, with non-zero entries sampled from N (1, 2). 

Confounders and latent variables explain equal variance. 

Signal-noise ratio is 10.

Reconstructing the Human Form 

◮ 3-D point data from CMU motion-capture database. 

◮ Uniform mixture of frames from walking, jumping, 

runnning and dancing motions. 

◮ Similarity matrix computed from inter-point squared 

distances; can be interpreted as a stiffness matrix of a 

physical spring system. 

◮ Cond. independence across frames and across features 

(sensor coordinates). Meaning each y contains only one 

of x, y or z coordinates of 31 sensors of one frame. 

◮ ∼7000 frames × |{x, y, z}| 

Y ∈ R 21000×31


y 

0.2 

0.1 

0 

−0.1 

−0.2 

−0.3 

−0.4 

−0.2 

EMRCA−inferred stickman (at recall: 0.76667) 

0 

x 

0.2 

0.3 

0.2 

0.1 

z 

0 

−0.1 

−0.2 

GLASSO−inferred stickman (at recall: 0.76667) 

y 

0.2 

0 

−0.2 

−0.4 

−0.2 

x 

0 

0.2 

−0.4 

−0.2 

0 

0.2 

z


y 

0.2 

0.1 

0 

−0.1 

−0.2 

−0.3 

−0.4 

−0.2 

EMRCA−inferred stickman (at recall: 1) 

0 

x 

0.2 

0.2 

z 

0 

−0.2 

y 

GLASSO−inferred stickman (at recall: 1) 

0.2 

0 

−0.2 

−0.4 −0.2 00.2 

x 

0.2 

0 

z 

−0.4 

−0.2


Confounding regularities in the motions


Precision 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

Glasso 

inv.cov 

EM/RCA 

0 

0 0.1 0.2 0.3 0.4 0.5 

Recall 

0.6 0.7 0.8 0.9 1

Reconstruction of a protein-signalling network from 

heterogeneous data 

Precision 

(Protein-signaling data from Sachs et.al, 08) 

(Kronecker-GLasso from Stegle et.al, 11) 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

EM−RCA 

Kronecker−Glasso (reported) 

Glasso (reported) 

0 

0 0.1 0.2 0.3 0.4 0.5 

Recall 

0.6 0.7 0.8 0.9 1 

266 

measurements of 

11 protein-signals 

under 3 different 

pertubations 

(uniform mix). 

Y ∈ R 266×11

Residual Component Analysis: Generalising PCA for more flexible ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?