14.02.2013 Views

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 14. Proc. ICASSP 2006 193<br />

S n−1<br />

p<br />

∇�.� p p<br />

y<br />

t<br />

∇�.−x� 2<br />

2<br />

Fig. 1. Constra<strong>in</strong>ed gradient t on the p-sphere, given by the projection<br />

of the unconstra<strong>in</strong>ed gradient∇�.−x� 2<br />

2 onto the tangent space<br />

that is orthogonal to∇�.� p p, see equation (6).<br />

2.2. Projection onto a p-sphere<br />

Let S n−1<br />

p := {x ∈ Rn |�x�p = 1} denote the (n−1)-dimensional<br />

sphere with respect to the p-norm (p>0). A scaled version of this<br />

unit sphere is given by cS n−1<br />

p :={x∈R n |�x�p= c}. The spheres<br />

are smoothC 1-submanifolds ofR n for p≥2. For p0.<br />

p<br />

Now, <strong>in</strong> the case p=2, the projection is simply given by<br />

πS n−1 (x)=x/�x�2. (2)<br />

2<br />

In the case p=1, the sphere consists of a union of hyperplanes<br />

be<strong>in</strong>g orthogonal to (±1,...,±1). Consider<strong>in</strong>g only the first quadrant<br />

(i.e. xi> 0), this means thatπ S n−1 (x) is given by the projection onto<br />

1<br />

the hyperplane H :={x∈R n |〈x, e〉=n −1/2 } and sett<strong>in</strong>g result<strong>in</strong>g<br />

negative coord<strong>in</strong>ates to 0; here e := n−1/2 (1,...,1). So with x+ := x<br />

if x≥0and 0 otherwise, we get<br />

πS n−1 (x)=<br />

1<br />

� x+(n −1/2 −〈x, e〉)e �<br />

+ . (3)<br />

In the case of arbitrary p>0, the projection is given by the unique<br />

solution of<br />

�x−y� 2<br />

2 . (4)<br />

π S n−1<br />

p<br />

(x)=argm<strong>in</strong> y∈S n−1<br />

p<br />

Unfortunately, no closed-form solution exists <strong>in</strong> the general case,<br />

so we have to numerically determ<strong>in</strong>e the solution. We have experimented<br />

with a) explicit Lagrange multiplier calculation and m<strong>in</strong>imization,<br />

b) constra<strong>in</strong>ed gradient descent and c) constra<strong>in</strong>ed fixedpo<strong>in</strong>t<br />

algorithm (best). Ignor<strong>in</strong>g the s<strong>in</strong>gular po<strong>in</strong>ts at the coord<strong>in</strong>ate<br />

hyperplanes, let us first assume that all xi > 0. Then at<br />

a regular solution y of equation (4), the gradient of the function<br />

to be m<strong>in</strong>imized is parallel to the gradient of the constra<strong>in</strong>t, i.e.<br />

y−x=λ∇�.� p �<br />

�<br />

p�<br />

for some Lagrange-multiplierλ∈R, which can be<br />

y<br />

calculated from the additional constra<strong>in</strong>t equation�y� p p= 1. Us<strong>in</strong>g<br />

the notation y⊙p := (y p<br />

1 ,...,yp n) ⊤ for the componentwise exponentiation,<br />

we therefore get<br />

y−x=λp y ⊙(p−1)<br />

�<br />

and y p<br />

i = 1 (5)<br />

i<br />

Algorithm 1: Projection onto S n−1<br />

p by constra<strong>in</strong>ed gradient descent.<br />

Commonly, the iteration is stopped after the update stepsize<br />

lies below some given threshold.<br />

Input: vector x∈R n , learn<strong>in</strong>g rateη(i)<br />

Output: Euclidean projection y=π S n−1 (x)<br />

p<br />

Initialize y∈S n−1<br />

1<br />

p randomly.<br />

for i←1, 2,... do<br />

df← y−x, dg← p sgn(y)|y| ⊙(p−1)<br />

2<br />

t←df− df ⊤ dgdg/(dg ⊤ 3<br />

dg)<br />

4 y←y−η(i)t<br />

5 y←y/�y�p<br />

end<br />

For p�{1, 2}, these equations cannot be solved <strong>in</strong> closed form,<br />

hence we propose an alternative approach to solv<strong>in</strong>g the constra<strong>in</strong>ed<br />

m<strong>in</strong>imization (4). The goal is to m<strong>in</strong>imize f (y) :=�y−x� 2<br />

2 under the<br />

constra<strong>in</strong>t g(y) :=�y� p p= 1. This can for example be achieved by<br />

gradient-descent methods, tak<strong>in</strong>g <strong>in</strong>to account that the gradient has<br />

to be calculated on the submanifold given by the S n−1<br />

p -constra<strong>in</strong>t, see<br />

figure 1 for an illustration. The projection of the gradient∇ f onto<br />

the tangent space of S n−1<br />

p at y can be easily calculated as<br />

t=∇ f−〈∇ f,∇g〉∇g/�∇g� 2<br />

2 . (6)<br />

Here, the explicit gradients are given by∇ f (y)=y−x and∇g(y)=<br />

p sgn(y)|y| ⊙(p−1) , where sgn(y) denotes the vector of the componentwise<br />

signs of y, and|y| := sgn(y)y the componentwise absolute<br />

value. The projection algorithm is summarized <strong>in</strong> algorithm 1. Iteratively,<br />

after calculat<strong>in</strong>g the constra<strong>in</strong>ed gradient (l<strong>in</strong>es 2 and 3), it<br />

performs a gradient-descent update step (l<strong>in</strong>e 4) followed by a pro-<br />

jection onto S n−1<br />

p (l<strong>in</strong>e 5) to guarantee that the algorithm stays on the<br />

submanifold.<br />

The method performs well, however as most gradient-descentbased<br />

algorithms, without further optimization it takes quite a few iterations<br />

to achieve acceptable convergence, and the choice of an optimal<br />

learn<strong>in</strong>g rateη(i) is non-trivial. We therefore propose a second<br />

projection method employ<strong>in</strong>g a fixed-po<strong>in</strong>t optimization strategy. Its<br />

idea is based on the fact that at local m<strong>in</strong>ima y of f (y) on S n−1<br />

p , the<br />

gradient∇ f (y) is orthogonal to S n−1<br />

p , so∇ f (y)∝∇g(y). Ignor<strong>in</strong>g<br />

signs for illustrative purposes, this means that y−x∝ p y⊙(p−1) , so<br />

y can be calculated from the fixed-po<strong>in</strong>t iteration y←λp y⊙(p−1) + x<br />

with additional normalization. Indeed, this can be equivalently derived<br />

from the previous Lagrange equations (5), also yield<strong>in</strong>g equations<br />

for the proportionality factorλ: we can simply determ<strong>in</strong>e it<br />

from one component of equation (5), or to <strong>in</strong>crease numerical robustness,<br />

as mean from the total set. Tak<strong>in</strong>g <strong>in</strong>to account the signs<br />

of the gradient (which we ignored <strong>in</strong> equation (5)), this yields an<br />

estimate ˆλ := 1�<br />

n<br />

i=1(yi−xi)/(p sgn(yi)|yi| p−1 ). Altogether, we get<br />

n<br />

the fixed-po<strong>in</strong>t algorithm 2, which <strong>in</strong> experiments turns out to have<br />

a considerably higher convergence rate than algorithm 1.<br />

In table 1, we compare the algorithms 1 and 2, namely with respect<br />

to the number of iterations they need to achieve convergence<br />

below some given threshold. As expected, the fixed-po<strong>in</strong>t algorithm<br />

outperforms gradient-descent always except for the case of higher<br />

dimensions and p>2 (non-sparse case). In the follow<strong>in</strong>g we will<br />

therefore use the fixed-po<strong>in</strong>t algorithm for projection onto S n−1<br />

1 .<br />

2.3. Projection onto convex sets<br />

If M is a convex set, then the Euclidean projectionπM(x) for any<br />

x∈R n is already unique, soX(M)=∅and the operatorπM is called

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!