14.02.2013 Views

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 12. LNCS 3195:718-725, 2004 181<br />

6 Fabian J. Theis and Shun-ichi Amari<br />

for an unknown constant c �= 0, ±1 and given curves y, z : (−1, 1) → R with<br />

y(0) = z(0) = 0. By theorem 3, g (and also c) are uniquely determ<strong>in</strong>ed by<br />

y and z except for scal<strong>in</strong>g. Indeed by tak<strong>in</strong>g derivatives <strong>in</strong> equation 3, we get<br />

c = y ′ (0)/z ′ (0), so c can be directly calculated from the known curves y and z.<br />

In the follow<strong>in</strong>g section, we propose to solve this problem numerically, given<br />

samples y(t1), z(t1), . . . , y(tT ), z(tT ) of the curves. Note that here it is assumed<br />

that the samples of the curves y and z are given at the same time <strong>in</strong>stants<br />

ti ∈ (−1, 1). In practice, this is usually not the case, so values of z at the sample<br />

po<strong>in</strong>ts of y and vice versa will first have to be estimated, for example by us<strong>in</strong>g<br />

spl<strong>in</strong>e <strong>in</strong>terpolation.<br />

3.3 MLP-based postnonl<strong>in</strong>earity approximation<br />

We want to f<strong>in</strong>d an approximation ˜g (<strong>in</strong> some parametrization) of g with ˜g(y(ti)) =<br />

c˜g(z(ti)) for i = 1, . . . , T , so <strong>in</strong> the most general sense we want to f<strong>in</strong>d<br />

˜g = argm<strong>in</strong> g E(g) := argm<strong>in</strong> g<br />

1<br />

2T<br />

T�<br />

(g(y(ti)) − cg(z(ti))) 2 . (4)<br />

In order to m<strong>in</strong>imize this energy function E(g), a s<strong>in</strong>gle-<strong>in</strong>put s<strong>in</strong>gle-output<br />

multilayered neural network (MLP) is used to parametrize the nonl<strong>in</strong>earity g.<br />

Here we choose one hidden layer of size d. This means that the approximated ˜g<br />

can be written as<br />

˜g(t) = w (2)⊤ �<br />

¯σ w (1) t + b (1)�<br />

+ b (2)<br />

with weigh vectors w (1) , w (2) ∈ R d and bias b (1) ∈ R d , b (2) ∈ R. Here σ denotes<br />

an activation function, usually the logistic sigmoid σ(t) := (1 + e −t ) −1 and we set<br />

¯σ := σ×. . .×σ, d times. The MLP weights are restricted <strong>in</strong> the sense that ˜g(0) =<br />

0 and ˜g ′ (0) = 1. This implies b (2) = −w (2)⊤¯σ(b (1) ) and �d i=1 w(1) i w(2) i σ′ (b (1)<br />

1 ) =<br />

1.<br />

Especially the second normalization is very important for the learn<strong>in</strong>g step,<br />

otherwise the weights could all converge to the (valid) zero solution. So the<br />

outer bias is not tra<strong>in</strong>ed by the network; we could fix a second weight <strong>in</strong> order<br />

to guarantee the second condition — this however would result <strong>in</strong> an unstable<br />

quotient calculation. Instead it is preferable to perform network tra<strong>in</strong><strong>in</strong>g on a<br />

submanifold <strong>in</strong> the weight space given by the second weight restriction. This<br />

results <strong>in</strong> an additional Lagrange term <strong>in</strong> the energy function from equation 4<br />

Ē(˜g) := 1<br />

2T<br />

i=1<br />

T�<br />

(˜g(y(tj)) − c˜g(z(tj))) 2 �<br />

d�<br />

+ λ<br />

j=1<br />

i=1<br />

w (1)<br />

i w(2) i σ′ (b (1)<br />

1<br />

) − 1<br />

with suitably chosen λ > 0.<br />

Learn<strong>in</strong>g of the weights is performed via backpropagation on this energy<br />

function. The gradient of Ē(˜g) with respect to the weight matrix can be easily<br />

�2<br />

(5)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!