Mathematics in Independent Component Analysis
Mathematics in Independent Component Analysis
Mathematics in Independent Component Analysis
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 12. LNCS 3195:718-725, 2004 181<br />
6 Fabian J. Theis and Shun-ichi Amari<br />
for an unknown constant c �= 0, ±1 and given curves y, z : (−1, 1) → R with<br />
y(0) = z(0) = 0. By theorem 3, g (and also c) are uniquely determ<strong>in</strong>ed by<br />
y and z except for scal<strong>in</strong>g. Indeed by tak<strong>in</strong>g derivatives <strong>in</strong> equation 3, we get<br />
c = y ′ (0)/z ′ (0), so c can be directly calculated from the known curves y and z.<br />
In the follow<strong>in</strong>g section, we propose to solve this problem numerically, given<br />
samples y(t1), z(t1), . . . , y(tT ), z(tT ) of the curves. Note that here it is assumed<br />
that the samples of the curves y and z are given at the same time <strong>in</strong>stants<br />
ti ∈ (−1, 1). In practice, this is usually not the case, so values of z at the sample<br />
po<strong>in</strong>ts of y and vice versa will first have to be estimated, for example by us<strong>in</strong>g<br />
spl<strong>in</strong>e <strong>in</strong>terpolation.<br />
3.3 MLP-based postnonl<strong>in</strong>earity approximation<br />
We want to f<strong>in</strong>d an approximation ˜g (<strong>in</strong> some parametrization) of g with ˜g(y(ti)) =<br />
c˜g(z(ti)) for i = 1, . . . , T , so <strong>in</strong> the most general sense we want to f<strong>in</strong>d<br />
˜g = argm<strong>in</strong> g E(g) := argm<strong>in</strong> g<br />
1<br />
2T<br />
T�<br />
(g(y(ti)) − cg(z(ti))) 2 . (4)<br />
In order to m<strong>in</strong>imize this energy function E(g), a s<strong>in</strong>gle-<strong>in</strong>put s<strong>in</strong>gle-output<br />
multilayered neural network (MLP) is used to parametrize the nonl<strong>in</strong>earity g.<br />
Here we choose one hidden layer of size d. This means that the approximated ˜g<br />
can be written as<br />
˜g(t) = w (2)⊤ �<br />
¯σ w (1) t + b (1)�<br />
+ b (2)<br />
with weigh vectors w (1) , w (2) ∈ R d and bias b (1) ∈ R d , b (2) ∈ R. Here σ denotes<br />
an activation function, usually the logistic sigmoid σ(t) := (1 + e −t ) −1 and we set<br />
¯σ := σ×. . .×σ, d times. The MLP weights are restricted <strong>in</strong> the sense that ˜g(0) =<br />
0 and ˜g ′ (0) = 1. This implies b (2) = −w (2)⊤¯σ(b (1) ) and �d i=1 w(1) i w(2) i σ′ (b (1)<br />
1 ) =<br />
1.<br />
Especially the second normalization is very important for the learn<strong>in</strong>g step,<br />
otherwise the weights could all converge to the (valid) zero solution. So the<br />
outer bias is not tra<strong>in</strong>ed by the network; we could fix a second weight <strong>in</strong> order<br />
to guarantee the second condition — this however would result <strong>in</strong> an unstable<br />
quotient calculation. Instead it is preferable to perform network tra<strong>in</strong><strong>in</strong>g on a<br />
submanifold <strong>in</strong> the weight space given by the second weight restriction. This<br />
results <strong>in</strong> an additional Lagrange term <strong>in</strong> the energy function from equation 4<br />
Ē(˜g) := 1<br />
2T<br />
i=1<br />
T�<br />
(˜g(y(tj)) − c˜g(z(tj))) 2 �<br />
d�<br />
+ λ<br />
j=1<br />
i=1<br />
w (1)<br />
i w(2) i σ′ (b (1)<br />
1<br />
) − 1<br />
with suitably chosen λ > 0.<br />
Learn<strong>in</strong>g of the weights is performed via backpropagation on this energy<br />
function. The gradient of Ē(˜g) with respect to the weight matrix can be easily<br />
�2<br />
(5)