Mathematics in Independent Component Analysis

Statistical machine learning of 

biomedical data 

Habilitation an der NWF II - Physik der Universität Regensburg 

vorgelegt von Fabian J. Theis, 2007

Die Habilitationsschrift wurde eingereicht am 7.8.2007. 

Fachmentorat: 

Prof. Dr. Elmar W. Lang (Universität Regensburg) 

Prof. Dr. Klaus Richter (Universität Regensburg) 

Prof. Dr. Klaus-Robert Müller (FhG FIRST & TU Berlin)

to Jakob

Preface 

In this manuscript, I present twenty papers that I have coauthored during the past four years. 

They range from theoretical contributions such as model analyses and convergence proofs to 

algorithmic implementations and biomedical applications. The common denominator is the 

framework of (mostly unsupervised) data analysis based on statistical machine learning methods. 

For convenience, I summarize and review these papers in chapter 1. 

Obviously, the papers themselves are self-contained. The summary chapter consists of an 

introduction part and then sections on independent and dependent component analysis, sparseness, 

preprocessing and applications. These sections are again mostly self-contained except for 

the dependent component analysis part, which relies on the preceding section, as well as the 

applications part, which depends on the methods described before. 

I want to thank everyone who helped me during the past four years. First of all my former 

boss Elmar Lang; I got both support and freedom, and always help when needed, even nowadays! 

Then of course all retired and active members of my group: Peter Gruber, Ingo Keck, Harold 

Gutch, Christian Guggenberger and more recently Florian Blöchl, Elias Rentzeperis, Martin 

Rhoden and Dominik Lutter. But I also have to mention and thank my colleagues at the MPI 

for Dynamics and Self-Organization at Göttingen, in particular the Director Theo Geisel and 

my collaborators Dirk Brockmann and Fred Wolf. I also want to thank my mentors, Elmar 

Lang, Klaus Richter and Klaus-Robert Müller for their manifold help and advices during my 

habilitation. 

I thank the Bernstein Center for Computational Neuroscience Göttingen as well as the 

graduate college ‘Nonlinearity and Nonequilibrium in condensed matter’ at Regensburg for their 

generous support. Additional funding by the BMBF within the project ModKog and within the 

projects BaCaTeC and PPP with Granada and London is gratefully acknowledged. 

My deepest thanks goes to my family and friends, above all my wife Michaela and my son 

Jakob. You’re the real deal. 

Göttingen, June 2007 Fabian J. Theis 

v

vi Preface

Contents 

Preface v 

Contents v 

I Summary 1 

1 Statistical machine learning of biomedical data 3 

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.2 Uniqueness issues in independent component analysis . . . . . . . . . . . . . . . . 6 

1.2.1 Linear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.2.2 Nonlinear ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

1.3 Dependent component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

1.3.1 Algebraic BSS and multidimensional generalizations . . . . . . . . . . . . 13 

1.3.2 Spatiotemporal BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

1.3.3 Independent subspace analysis . . . . . . . . . . . . . . . . . . . . . . . . 19 

1.4 Sparseness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

1.4.1 Sparse component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

1.4.2 Sparse non-negative matrix factorization . . . . . . . . . . . . . . . . . . . 30 

1.5 Machine learning for data preprocessing . . . . . . . . . . . . . . . . . . . . . . . 33 

1.5.1 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

1.5.2 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

1.5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 

1.6 Applications to biomedical data analysis . . . . . . . . . . . . . . . . . . . . . . . 42 

1.6.1 Functional MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

1.6.2 Image segmentation and cell counting . . . . . . . . . . . . . . . . . . . . 47 

1.6.3 Surface electromyograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

1.7 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

II Papers 53 

2 Neural Computation 16:1827-1850, 2004 55 

vii

viii CONTENTS 

3 Signal Processing 84(5):951-956, 2004 81 

4 Neurocomputing 64:223-234, 2005 89 

5 IEICE TF E87-A(9):2355-2363, 2004 103 

6 LNCS 3195:726-733, 2004 113 

7 Proc. ISCAS 2005, pages 5878-5881 123 

8 Proc. NIPS 2006 129 

9 Neurocomputing (in press), 2007 139 

10 IEEE TNN 16(4):992-996, 2005 155 

11 EURASIP JASP, 2007 161 

12 LNCS 3195:718-725, 2004 175 

13 Proc. EUSIPCO 2005 185 

14 Proc. ICASSP 2006 191 

15 Neurocomputing, 69:1485-1501, 2006 197 

16 Proc. ICA 2006, pages 917-925 229 

17 IEEE SPL 13(2):96-99, 2006 239 

18 Proc. EUSIPCO 2006 245 

19 LNCS 3195:977-984, 2004 251 

20 Signal Processing 86(3):603-623, 2006 261 

21 Proc. BIOMED 2005, pages 209-212 291 

Bibliography 297

Part I 

Summary 

1

Chapter 1 

Statistical machine learning of 

biomedical data 

Biostatistics deals with the analysis of high-dimensional data sets originating from biological or 

biomedical problems. An important challenge in this analysis is to identify underlying statistical 

patterns that facilitate the interpretation of the data set using techniques from machine 

learning. A possible approach is to learn a more meaningful representation of the data set, 

which maximizes certain statistical features. Such often linear representations have several potential 

applications including the decomposition of objects into ‘natural’ components (Lee and 

Seung, 1999), redundancy and dimensionality reduction (Friedman and Tukey, 1975), biomedical 

data analysis, microarray data mining or enhancement, feature extraction of images in nuclear 

medicine, etc. (Alpaydin, 2004, Bishop, 2007, Cichocki and Amari, 2002, Hyvärinen et al., 2001c, 

MacKay, 2003, Mitchell, 1997). 

In the following, we will review some statistical representation models and discuss identifiability 

conditions. The resulting separation algorithms will be applied to various biomedical 

problems in the last part of this summary. 

1.1 Introduction 

Assume the data is given by a multivariate time series x(t) ∈ R m , where t indexes time, space 

or some other quantity. Data analysis can be defined as finding a meaningful representation of 

x(t) i.e. as x(t) = f(s(t)) with unknown features s(t) ∈ R m and mixing mapping f. Often, f is 

assumed to be linear, so we are dealing with the situation 

x(t) = As(t) (1.1) 

with a mixing matrix A ∈ R m×n . Often, white noise n(t) is added to the model, yielding 

x(t) = As(t) + n(t); this can be included in s(t) by increasing its dimension. In the situation 

(1.1), the analysis problem is reformulated as the search for a (possibly overcomplete) basis, in 

which the feature signal s(t) allows more insight into the data than x(t) itself. This of course 

has to be specified within a statistical framework. 

There are two general approaches to finding data representations or models as in (1.1): 

3

4 Chapter 1. Statistical machine learning of biomedical data 

s1 

s2 

1 

0.5 

0 

−0.5 

−1 

0 20 40 

t 

60 80 100 

1 

0.5 

0 

−0.5 

−1 

0 20 40 

t 

60 80 100 

(a) sources s(t) 

x1 

x2 

1 

0.5 

0 

−0.5 

0 20 40 60 80 100 

1 

0.5 

0 

−0.5 

−1 

0 20 40 

t 

60 80 100 

(b) mixtures x(t) 

t 

ˆs1 

2 

1 

0 

−1 

−2 

0 20 40 

t 

60 80 100 

ˆs2 

2 

1 

0 

−1 

−2 

0 20 40 60 80 100 

(c) recoveries 

t 

(WA)ij 

2 

1 

0 

−1 

−2 

1 

column 

2 

(d) WA 

Figure 1.1: Two-dimensional example of ICA-based source separation. The observed mixture 

signal (b) is composed of unknown two source signals (a), using a linear mapping. Application 

of ICA (here: HessianICA) yields the recovered sources (c), which coincide with the original 

sources up to permutation and scaling: ˆs1(t) ≈ 1.5s2(t) and ˆs2(t) ≈ −1.5s1(t). Indeed, the 

composition of mixing matrix A and separating matrix W equals a unit matrix (d) up to the 

unavoidable indeterminacies of scaling and permutation. 

• supervised analysis: additional information for example in the form of input-output pairs 

(x(t1), s(t1)), . . . , (x(tT ), s(tT )). These training samples can be used for interpolation and 

learning of the map f or basis A (regression). If the sources s are discrete, this leads to a 

classification problem. The resulting map f can then be used for prediction. 

• unsupervised models: instead of samples, weak statistical assumptions are made on either 

s(t) or f/A. A common assumption for example is that the source components si(t) are 

mutually independent. 

Here, we will mostly focus on the second situation. The unsupervised analysis is often 

denoted as blind source separation (BSS), since neither features or ‘sources’ s(t) nor mixing 

mapping f are assumed to be known. The field of BSS has been rather intensively studied by 

the community for more than a decade. Since the first introduction of a neural-network-based 

BSS solution by Hérault and Jutten (1986), various algorithms have been proposed to solve 

the blind source separation problem (Bell and Sejnowski, 1995, Cardoso and Souloumiac, 1993, 

Comon, 1994, Hyvärinen and Oja, 1997, Theis et al., 2003). Good textbook-level introductions 

to the topic are given by Hyvärinen et al. (2001c) and Cichocki and Amari (2002). Recent 

research centers on generalizations and applications. The first part of this manuscript deals 

with such extended models and algorithms; some applications will be presented later. 

A common model for BSS is realized by the independent component analysis (ICA) model 

(Comon, 1994), where the underlying signals s(t) are assumed to be statistically independent. 

Let us first concentrate on the linear case, i.e. f = A linear. Then we search for a decomposition 

x(t) = As(t) of the observed data set x(t) = (x1(t), . . . , xn(t)) ⊤ into independent signals s(t) = 

(s1(t), . . . , sn(t)) ⊤ . For example, consider figure 1.1. The goal is to decompose two time-series 

(b) into two source signals (a). Visually, this is a simple task—obviously the data is composed 

of two sinusoids with different frequency; but how to do this algorithmically? And how to 

formulate a feasible model? 

1 

row 

2

1.1. Introduction 5 

auditory 

cortex 

auditory 

cortex 2 

word 

detection 

decision 

(a) cocktail party problem (b) linear mixing model 

(c) neural cocktail party 

Figure 1.2: Cocktail party problem: (a) a linear superposition of the speakers is recorded at 

each microphone. This can be written as the mixing model x(t) = As(t) (1.1) with speaker 

voices s(t) and activity x(t) at the microphones (b). Possible applications lie in neuroscience: 

given multiple activity recordings of the human brain, the goal is to identify the underlying 

hidden sources that make up the total activity (c). 

A typical application of BSS lies in the cocktail party problem: at a cocktail party, a set 

of microphones records the conversations of the guests. Each microphone records a linear superposition 

of the conversations, and at each microphone, a slightly different superposition is 

recorded depending on the position, see figure 1.2. In the following we will see that given some 

rather weak assumptions on the conversations themselves such as independence of the various 

talks, it is then possible to recover the original sources and the mixing matrix (which encodes 

the position of the speakers) using only the signals recorded at the microphones. Note that in 

real-world situations the nice linear mixing situation deteriorates due to noise, convolutions and 

nonlinearities. 

t=1 

t=2 

t=3 

t=4


1.2 Uniqueness issues in independent component analysis 

Application of ICA to BSS tacitly assumes that the data follow the model (1.1) i.e. x(t) admits 

a decomposition into independent sources, and we want to find this decomposition. But neither 

the mixing function f nor the source signals s(t) are known, so we should expect to find many 

solutions for this problem. Indeed, the order of the sources cannot be recovered—the speakers 

at the cocktail party do not have numbers—so there is always an inherent permutation indeterminacy. 

Moreover, also the strength of each source cannot be extracted from this model alone, 

because f and s(t) can interchange so-called scaling factors. In other words, by not knowing 

the power of each speaker at the cocktail party, we can only extract his speech, but not the 

volume—he could also be standing further away from the microphones, but shouting instead of 

speaking. 

One of the key questions in ICA-based source separation is the question: are there any other 

indeterminacies? Without fully answering this question, ICA algorithms cannot be applied to 

BSS, simply because we would not have any clue how to relate the resulting sources to the 

original ones. But apparently, the set of indeterminacies cannot be very large—after all, at a 

cocktail party, we ourselves are able to distinguish between the various speakers. 

1.2.1 Linear case 

In 1994, Comon was able to answer this question (Comon, 1994) in the linear case where f = A 

by reducing it to the Darmois-Skitovitch theorem (Darmois, 1953, Skitovitch, 1953, 1954). Essentially, 

he showed that if the sources contain at most one Gaussian component, the indeterminacies 

of the above model are only scaling and permutation. This positive answer more or 

less made the field popular; from then on the number of papers published in this field each year 

increased considerably. However, it may be argued that Comon’s proof lacked two points: by 

using the rather difficult to prove old theorem by the two statisticians, the central idea why 

there are no more indeterminacies is not at all obvious. Hence not many attempts have been 

made to extend it to more general situations. Furthermore, no algorithm can be extracted from 

the proof, because it is non-constructive. 

In (Theis, 2004a), we took a somewhat different approach. Instead of using Comon’s idea of 

minimal mutual information, we reformulated the condition of source independence in a different 

way: in simple terms, a two-dimensional source vector s is independent if its density ps factorizes 

into two one-component densities ps1 and ps2 . But this is the case only if ln ps is the sum of 

one-dimensional functions, each depending on a different variable. Hence taking the differential 

with respect to s1 and then to s2 always yields zero. In other words, the Hessian Hln ps of 

the logarithmic densities of the sources is diagonal—this is what we denoted by ps being a 

‘separated function’ in Theis (2004a), see chapter 2. Using only this property, we were able to 

prove Comon’s uniqueness theorem (Theis, 2004a, theorem 2) without having to resort to the 

Darmois-Skitovitch theorem. Here Gl(n) defines the group of invertible (n × n)-matrices. 

Theorem 1.2.1 (Separability of linear BSS). Let A ∈ Gl(n; R) and s be an independent random 

vector. Assume that s has at most one Gaussian component and that the covariance of s exists. 

Then As is independent if and only if A is the product of a scaling and permutation matrix.

1.2. Uniqueness issues in independent component analysis 7 

Instead of a multivariate random process s(t), the theorem is formulated for a random vector 

s, which is equivalent to assuming an i.i.d. process. Moreover, the assumption of equal source (n) 

and mixture dimension (m) is made, although relaxation to the undercomplete case (1 < n < m) 

is straightforward, and to the overcomplete case (n > m > 1) is possible (Eriksson and Koivunen, 

2003). The assumption of at most one Gaussian component is crucial, since independence of 

white, multivariate Gaussians is invariant under orthogonal transformation, so theorem 1.2.1 

cannot hold in this case. 

An algorithm for separation—Hessian ICA 

The proof of theorem 1.2.1 is constructive, and the exception of the Gaussians comes into play 

naturally as zeros of a certain differential equation. The idea, why separation is possible, becomes 

quite clear now. Furthermore, an algorithm can be extracted from the pattern used in the proof: 

After decorrelation, we can assume that the mixing matrix A is orthogonal. By using the 

transformation properties of the Hessian matrix, we can employ the linear relationship x = As 

to get 

Hln px = A⊤Hln psA (1.2) 

for the Hessian of the mixtures. The key idea, as we have seen in the previous section, is that 

due to statistical independence, the source Hessian Hln ps is diagonal everywhere. Therefore 

equation (1.2) represents a diagonalization of the mixture Hessian, and the diagonalizer equals 

the mixing matrix A. Such a diagonalization is unique if the eigenspaces of the Hessian are onedimensional 

at some point, and this is precisely the case if x(t) contains at most one Gaussian 

component (Theis, 2004a, lemma 5). Hence, the mixing matrix and the sources can be extracted 

algorithmically by simply diagonalizing the mixture Hessian evaluated at some point. The 

Hessian ICA algorithm consists of local Hessian diagonalization of the logarithmic density (or 

equivalently the easier to estimate characteristic function). In order to improve robustness, 

multiple matrices are jointly diagonalized. Applying this algorithm to the mixtures from our 

toy example from figure 1.1 yields very well recovered sources 1.1(c). 

A similar algorithm has been proposed already by Lin (1998), but without considering the 

necessary assumptions for successful algorithm application. In Theis (2004a, theorem 3), we 

gave precise conditions for when to apply this algorithm and showed that points satisfying these 

conditions can indeed be found if the sources contain at most one Gaussian component. Lin used 

a discrete approximation of the derivative operator to approximate the Hessian; we suggested 

using kernel-based density estimation, which can be directly differentiated. A similar algorithm 

based on Hessian diagonalization had been proposed by Yeredor (2000) using the character of a 

random vector. However, the character is complex-valued, and additional care has to be taken 

when applying a complex logarithm. Basically, this is only well-defined locally at non-zeros. 

In algorithmic terms, the character can be easily approximated by samples. Yeredor suggested 

joint diagonalization of the Hessian of the logarithmic character evaluated at several points in 

order to avoid the locality of the algorithm. Instead of joint diagonalization, we proposed to 

use a combined energy function based on the previously defined separator. This also takes into 

account global information, but does not have the drawback of being singular at zeros of the 

density respectively character.


Complex generalization 

Comon (1994) showed separability of linear real BSS using the Darmois-Skitovitch theorem, see 

theorem 1.2.1. He noted that his proof for the real case can also be extended to the complex 

setting. However, a complex version of the Darmois-Skitovitch theorem is needed, which, to 

the knowledge of the author, had not been shown in the literature yet. In Theis (2004b), see 

chapter 3, we derived such a theorem as corollary of a multivariate extension of the Darmois- 

Skitovitch theorem, first noted by Skitovitch (1954) and shown by Ghurye and Olkin (1962): 

Theorem 1.2.2 (complex S-D theorem). Let s1 = �n i=1 αixi and s2 = �n i=1 βixi with x1, . . . , xn 

independent complex random variables and αj, βj ∈ C for j = 1, . . . , n. If s1 and s2 are independent, 

then all xj with αjβj �= 0 are Gaussian. 

We then used this theorem to prove separability of complex BSS; moreover a generalization 

to dependent subspaces was discussed, see section 1.3.3. Note that a simple complex-valued 

uniqueness proof (Theis, 2004c), which does not need the Darmois-Skitovitch theorem, can be 

derived similarly to the case of real-valued random variables from above. Recently, additional 

relaxations of complex identifiability have been described by Eriksson and Koivunen (2006) 

1.2.2 Nonlinear ICA 

With the growth of the field of ICA, interest in nonlinear model extensions has been increasing. 

It is easy to see however that without any restrictions the class of nonlinear ICA solutions is too 

large to be of any practical use (Hyvärinen and Pajunen, 1998). Hence, special nonlinearities 

are usually considered. Here we discuss two specific nonlinear models. 

Postnonlinear ICA 

A good trade-off between model generalization and identifiability is given in the so called postnonlinear 

BSS model realized by 

� 

n� 

� 

. (1.3) 

xi = fi 

i=1 

This explicit, nonlinear model implies that in addition to the linear mixing situation, each sensor 

xi contains an unknown nonlinearity fi that can further distort the observations, modeling a 

MIMO-system with nonlinear receiver characteristics. It can be interpreted as a single-layered 

neural network (perceptron) with nonlinear, unknown activation functions in contrast to the 

linear perceptron in the case of linear ICA. The model, first proposed by Taleb and Jutten 

(1999), has applications in telecommunication and biomedical data analysis. Algorithms for 

reconstructing postnonlinearly mixed sources include (Achard et al., 2003, Babaie-Zadeh et al., 

2002, Taleb and Jutten, 1999, Theis and Lang, 2004a, Ziehe et al., 2003a). 

Identifiability of postnonlinear mixtures has first been discussed in a limited context in Taleb 

and Jutten (1999). In Theis and Gruber (2005), see chapter 4, we continued this study. We 

thereby generalized ideas already presented by Babaie-Zadeh et al. (2002), where the focus was 

put on the development of an actual identification algorithm. Babaie-Zadeh was the first to 

use the method of analyzing bounded random vectors in the context of postnonlinear mixtures 

aijsj

1.2. Uniqueness issues in independent component analysis 9 

f 1, f 2 

2 

1.5 

1 

0.5 

-2 -1.5 -1 -0.5 

-0.5 

0.5 1 1.5 2 

-1 

-1.5 

-2 

1 

0.5 

0 

-0.5 

-1 

f (As ) A − 1 f (As ) 

0.5 1 1.5 

0.5 1 1.5 

1 

0.5 

0 

-0.5 

-1 

1 

0.8 

0.6 

0.4 

0.2 

0 

0.2 0.4 0.6 0.8 

0.2 0.4 0.6 0.8 

Figure 1.3: Example of a non-trivial postnonlinear transformation using an absolutely degenerate 

matrix A and in [0, 1] 2 uniform sources s. Both sources s and recovered sources A −1 f(As) 

have support in [0, 1] 2 , but A is not a permutation and scaling. 

(Babaie-Zadeh, 2002). There, he already discussed identifiability issues, albeit explicitly only in 

the two-dimensional analytic case. 

Postnonlinear BSS is a generalization of linear BSS, so the indeterminacies of postnonlinear 

ICA contain at least the indeterminacies of linear ICA: A can only be reconstructed up to 

scaling and permutation. Here of course additional indeterminacies come into play because of 

translation: fi can only be recovered up to a constant. Also, if L ∈ Gl(n) is a scaling matrix, 

then f(As) = (f ◦ L) � (L −1 A)s � , so f and A can interchange scaling factors in each component. 

Another indeterminacy could occur if A is not mixing, i.e. at least one observation xi contains 

only one source; in this case fi can obviously not be recovered. For example if A equals the 

unit matrix I, then f(s) is already again independent, because independence is invariant under 

component-wise nonlinear transformation; so f cannot be found using this method. 

A not so obvious indeterminacy occurs if A is absolutely degenerate, which essentially means 

that the normalized columns differ only by the signs of the entries (Theis and Gruber, 2005, definition 

6). Then only the matrix A but not the nonlinearities can be recovered by considering the 

edges of the support of the fully-bounded random vector as illustrated in figure 1.3. Nonetheless 

this is no indeterminacy of the model itself, since A −1 f(As) is obviously not independent. So by 

looking at the boundary alone, we sometimes cannot detect independence if the whole system 

is highly symmetric. 

If A is mixing and not absolutely degenerate, then for all fully-bounded sources s no more 

indeterminacies than in the affine linear case exist, except for scaling interchange between f 

and A: 

Theorem 1.2.3 (Separability of bounded postnonlinear BSS). Let A, W ∈ Gl(n) and one of 

them mixing and not absolutely degenerate, h : R n → R n be a diagonal injective analytic function 

such that h ′ i �= 0 and let s be a fully-bounded independent random vector. If also W� h(As) � is 

independent, then h is affine linear. 

1 

0.8 

0.6 

0.4 

0.2 

0


So let f ◦ A be the mixing model and W ◦ g the separating model. Putting the two together 

we get the above mixing-separating model with h := g ◦ f. The theorem shows that if the 

mixing-separating model preserves independence then it is essentially trivial i.e. h affine linear 

and the matrices equivalent (up to scaling). As usual, the model is assumed to be invertible, 

hence identifiability and uniqueness of the model follow from the separability. Note that if f is 

only assumed to be continuously differentiable, then additional indeterminacies come into play. 

As in the linear case, this separability result can be transformed into a postnonlinear ICA 

algorithm: in Theis and Lang (2004b) we proposed a linearization identification framework for 

estimating postnonlinearities in such settings. 

Finally, we want to note that we derived a slightly less general uniqueness theorem from the 

linear separability proof in Theis (2004a), see chapter 2: 

Theorem 1.2.4 (Separability of bijective postnonlinear BSS). Let A, W ∈ Gl(n) be mixing, 

h : Rn → Rn be a diagonal bijective analytic function such that h ′ i �= 0 and let S be an independent 

random vector with at most one Gaussian component and existing covariance. If also W � h(AS) � 

is independent, then h is affine linear. 

Similar uniqueness results were further studied by Achard and Jutten (2005). 

Quadratic ICA 

There are a few different ways of extending linear BSS models. In the previous section, we focused 

on the postnonlinear mixing model (1.3), which adds an unknown nonlinearity at the end of a 

linear mixing situation. A different approach is to take a known nonlinearity and add another 

mixing matrix, as in the step from perceptrons to multilayer perceptrons. Another way of putting 

this is to take only the first few terms of the Taylor expansion of the mixing or unmixing mapping, 

and to try and learn such regularized systems. In Theis and Nakamura (2004), see chapter 5, 

we treated polynomial nonlinearities, specifically second-order monomials or quadratic forms. 

These represent a relatively simple class of nonlinearities, which can be investigated in detail. In 

contrast to the postnonlinear model formulation, we defined the nonlinearity for the unmixing 

situation, and derive the corresponding mixing model after some assumptions. 

We considered the quadratic unmixing model 

yi := x ⊤ G (i) x (1.4) 

for symmetric matrices G (i) , i = 1, . . . , n and estimated sources y. We restricted ourselves to a 

special case of this model by assuming that each G (i) has the same set of eigenvectors. Then 

G (i) = EΛ (i) E with orthogonal eigenvector matrix E (i) and diagonal Λ (i) with coefficients Λ (i) 

kk 

on the diagonal. Setting 

Λ := 

⎛ 

⎜ 

⎝ 

Λ (1) 

11 . . . Λ (1) 

nn 

. 

. .. 

Λ (n) 

11 . . . Λ (n) 

nn 

therefore yields a two-layered unmixing model y = Λ ◦ (.) 2 ◦ E ◦ x, where (.) 2 denotes the 

componentwise square of each element. This can be interpreted as a two-layered feed-forward 

neural network, and may be easily explicitly inverted, see figure 1.4. 

. 

⎞ 

⎟ 

⎠

S U M M A R Y T his �le contains all �gures of the manuscript 

1.2. 

in the same size as in the manuscript itself. 

Uniqueness keyissues words: in independent component analysis 11 

x 1 

x 2 

x m 

. 

E 1m 

E 2m 

E n m 

(.) 2 

(.) 2 

. 

(.) 2 

Λ 1n 

Λ 2n 

E Λ 

Λ n n 

F ig. 1 Simpli�ed quadratic unmixing model y = Λ (.) 2 E x . 

Figure 1.4: Simplified quadratic unmixing model y = Λ ◦ (.) 2 ◦ E ◦ x. If Ex only takes values 

in one quadrant, then the model is invertible and its inverse model again follows a multilayer 

perceptron structure x = E ⊤ ◦ √ ◦ Λ −1 ◦ y. 

Several studies have employed quadratic forms as a generative process of data. Abed-Meraim 

et al. (1996) suggested analyzing mixtures by second-order polynomials using a linearization for 

the mixtures, which however in general destroys the assumption of independence. Leshem (1999) 

proposed a whitening scheme based on quadratic forms in order to enhance linear separation 

of time-signals in algorithms such as SOBI (Belouchrani et al., 1997). Similar quadratic mixing 

models are also considered in Georgiev (2001) and Hosseini and Deville (2003). These are 

studies in which the mixing model is assumed to be quadratic in contrast to the quadratic unmixing 

model (1.4). For demixing into independent components by quadratic forms, Bartsch and 

Obermayer (2003) suggested applying linear ICA to second-order terms of data, and Hashimoto 

(2003) proposed an algorithm based on minimization of Kullback-Leibler divergence. However, 

no identifiability was studied; instead they focused on the application to natural images. 

In (Theis and Nakamura, 2004), we defined the above quadratic unmixing process and derived 

a generative model. We then reduced the quadratic model to an overdetermined linear model, 

where more observations than sources are given, by embedding y into R m(m+1)/2 ; this can be 

done by taking the monomials xixj as new variables. Using some linear algebra, we then derived 

the following identifiability theorem for overdetermined ICA: 

Theorem 1.2.5 (Uniqueness of overdetermined ICA). Let x = As with independent n-dimensional 

random vector s and full-rank m × n-matrix A with m ≥ n, and let the n × m matrix W 

be chosen such thatMWx anuscript is independent. received Furthermore October 11, assume 2002. that s has at most one Gaussian 

component and thatMthe anuscript variancesrevised of s exist. October Then there 11, 2002. exist a permutation matrix P and an 

invertible scaling matrix F inalL manuscript with W = LP(A received J anuary 15, 2003. 

† 

T he authors are with the L ab. for A dvanced B rain Signal 

P rocessing, B rain Science I nstitute, R I K E N, W ako-shi, 

Saitama 351-0198 J apan. 

On leave from the I nstitute of B iophysics, U niversity 

of R egensburg, 93051 R egensburg, G ermany. 

a) E -mail: fabian@theis.name 

b) E -mail: wakakoh@brain.riken.jp 

⊤A) −1A⊤ + C and CA = 0. 

. 

y 1 

y 2 

y n


−0.17 0.16 −0.25 0.08 0.16 −0.11 0.28 −0.10 0.19 −0.13 −0.24 0.23 

−0.20 0.12 −0.27 0.27 0.19 −0.16 −0.17 0.16 −0.30 0.09 0.38 0.05 

−0.19 0.17 0.18 −0.12 −0.24 0.24 −0.22 0.22 −0.24 0.22 0.24 −0.17 

0.24 −0.09 0.25 −0.22 0.23 −0.15 0.19 −0.17 0.18 −0.16 0.21 −0.21 

0.26 −0.20 0.33 −0.15 −0.28 0.15 −0.17 0.17 −0.17 0.16 −0.23 0.23 

−0.24 0.23 0.23 −0.22 0.20 −0.20 −0.30 0.30 0.18 −0.16 −0.18 0.15 

0.28 −0.24 −0.23 0.18 0.20 −0.20 −0.26 0.26 0.19 −0.19 −0.22 0.22 

0.31 −0.30 0.26 −0.23 0.18 −0.13 0.15 −0.15 0.18 −0.18 −0.31 0.30 

−0.23 0.17 −0.25 0.24 −0.27 0.26 0.25 −0.23 −0.19 0.17 0.11 −0.11 

−0.08 0.06 −0.15 0.13 0.25 −0.24 0.28 −0.26 −0.23 0.20 −0.18 0.15 

−0.22 0.19 −0.22 0.18 0.11 −0.09 0.28 −0.28 

Figure 1.5: Quadratic ICA of natural images. 3 · 10 5 sample pictures of size 8 × 8 were used. 

Plotted are the recovered maximal filters of i.e. rows of the eigenvalue matrices of the quadratic 

form coefficient matrices (top figure). For each component the two largest filters (with respect 

to the absolute eigenvalues) are shown (altogether 2 · 64). Above each image the corresponding 

eigenvalue (multiplied by 10 3 ) is printed. In most filters only one or two eigenvalues are 

dominant. 

This allowed us to derive identifiability for the quadratic unmixing model (1.4). Moreover, 

we were then able to construct an explicit ICA algorithm from this uniqueness result. 

Finally, we studied the algorithm in the context of natural image filters similar to Bartsch 

and Obermayer (2003) and Hashimoto (2003). We applied quadratic ICA to a set of small 

patches. Most obtained quadratic forms had one or two dominant linear filters and these linear 

filters were selective for local bar stimuli, see figure 1.5. Hence, values of all quadratic forms 

corresponded to squared simple cell outputs.

1.3. Dependent component analysis 13 

1.3 Dependent component analysis 

In this section, we will discuss the relaxation of the BSS model by taking into account additional 

structures in the data and dependencies between components. Many researchers have taken 

interest in this generalization, which is crucial for the application in real-world settings where 

such situations are to be expected. 

Here, we will consider model indeterminacies as well as actual separation algorithms. For 

the latter, we will employ a technique that has been the basis of one of the first ICA algorithms 

(Cardoso and Souloumiac, 1993), namely joint diagonalization (JD). It has since become an 

important tool in ICA-based BSS and in BSS relying on second-order time-decorrelation (Belouchrani 

et al., 1997). Its task is, given a set of commuting symmetric n × n matrices Ci, to 

find an orthogonal matrix A such that A ⊤ CiA is diagonal for all i. This generalizes eigenvalue 

decomposition (i = 1) and the generalized eigenvalue problem (i = 2), in which perfect 

factorization is always possible. 

Other extensions of the standard BSS model such as including singular matrices (Georgiev 

and Theis, 2004) will be omitted from the discussion in the following. 

1.3.1 Algebraic BSS and multidimensional generalizations 

Considering the BSS model from equation (1.1)—or a more general, noisy version x(t) = As(t)+ 

n(t)—the data can only be separated if we put additional conditions on the sources such as: 

• they are stochastically independent: ps(s1, . . . , sn) = ps1 (s1) · · · psn(sn), 

• each source is sparse i.e. it contains a certain number of zeros or has a low p-norm for 

small p and fixed 2-norm, 

• s(t) is stationary and for all τ, it has diagonal autocovariances E(s(t + τ) s(t) ⊤ ); here 

zero-mean s(t) are assumed. 

In the following, we will review BSS algorithms based on eigenvalue decomposition, JD and 

generalizations. Thereby, one of the above conditions is denoted by the term source condition, 

because we do not want to specialize on a single model. The additive noise n(t) is modeled by 

a stationary, temporally and spatially white zero-mean process with variance σ 2 . Moreover, we 

will not deal with the more complicated underdetermined case, so we assume that at most as 

many sources as sensors are to be extracted, i.e. n ≤ m. 

The signals x(t) are observed, and the goal is to recover A and s(t). Having found A, s(t) 

can be estimated by A † x(t), which is optimal in the maximum-likelihood sense. Here † denotes 

the pseudo-inverse of A, which equals the inverse in the case of m = n. So the BSS task reduces 

to the estimation of the mixing matrix A, hence the additive noise n is often neglected (after 

whitening). Note that in the following we will assume that all signals are real-valued. Extensions 

to the complex case are straightforward. 

Approximate joint diagonalization 

Many BSS algorithms employ joint diagonalization (JD) techniques on some source condition 

matrices to identify the mixing matrix. Given a set of symmetric matrices C := {C1, . . . , CK},


JD implies minimizing the squared sum of the off-diagonal elements of Â⊤CiÂ, i.e. minimizing 

f( Â) := 

K� 

�Â⊤CiÂ − diag(Â⊤CiÂ)�2F i=1 

with respect to the orthogonal matrix Â, where diag(C) produces a matrix, where all off-diagonal 

elements of C have been set to zero, and �C�2 F := tr(CC⊤ ) denotes the squared Frobenius norm. 

A global minimum A of f is called a joint diagonalizer of C. Such a joint diagonalizer exists if 

and only if all elements of C commute. 

Algorithms for performing joint diagonalization include, among others, gradient descent on 

f( Â), Jacobi-like iterative construction of A by Givens rotation in two coordinates (Cardoso 

and Souloumiac, 1995), an extension minimizing a logarithmic version of (1.5) (Pham, 2001), an 

alternating optimization scheme switching between column and diagonal optimization (Yeredor, 

2002) or more recently a linear least-squares algorithm for diagonalization (Ziehe et al., 2003b), 

where the latter three algorithms can also search for non-orthogonal matrices A. Note that in 

practice minimization of the off-sums only yields an approximate joint diagonalizer—in the case 

of finite samples, the source condition matrices are estimates and hence they only approximately 

share the same eigenstructure and do not fully commutate, so f( Â) from equation (1.5) cannot 

be rendered precisely zero but only approximately. 

Source conditions 

In order to get a well-defined source separation model, assumptions to the sources such as 

stochastic independence have to be formulated. In practice, the conditions are preferably given 

in terms of roots of some cost function that can easily be estimated. Here, we summarize some 

of the source conditions used in the literature; they are defined by a criterion specifying the 

diagonality of a set of matrices C(.) := {C1(.), . . . , CK(.)}, which can be estimated from the 

data. We require only that 

Ci(Wx) = WCi(x)W ⊤ 

(1.6) 

for some matrix W. Note that using the substitution ¯ Ci(x) := Ci(x) + Ci(x) ⊤ , we can assume 

Ci(x) to be symmetric. The actual source model is then defined by requiring the sources to fulfill 

Ci(s) = 0 for all i = 1, . . . , K. In table 1.1, we review some commonly used source conditions for 

an m-dimensional centered random vector x or a multivariate random process x(t) respectively. 

Searching for sources s := Wx fulfilling the source model means finding matrices W such 

that Ci(Wx) is diagonal for all i. Depending on the algorithm, whitening by PCA is performed 

as preprocessing to allow for a reduced search on the orthogonal group W ∈ O(n). This is 

equivalent to setting all source second-order statistics to I, and then searching only for rotations. 

In the case of K = 1, the search can be performed by eigenvalue decomposition of C1(˜x) of 

the source condition of the whitened mixtures ˜x; this is equivalent to solving the generalized 

eigenvalue decomposition (GEVD) problem for the matrix pencil (E(xx ⊤ ), C1(˜x)). Usually 

using more than one condition matrix increases the robustness of the proposed algorithm, and 

in these cases the algorithm performs orthogonal JD of C := {Ci(˜x)} e.g. by a Jacobi-type 

algorithm (Cardoso and Souloumiac, 1995). 

(1.5)

Table 1.1: BSS algorithms based on joint diagonalization (centered sources are assumed) 


algorithm source model condition matrices optimization algorithm 

FOBI (Cardoso and Souloumiac, independent i.i.d. sources contracted quadricovariance matrix EVD after PCA 

1990) 

with Eij = I 

(GEVD) 

JADE (Cardoso and Souloumiac, independent i.i.d. sources contracted quadricovariance matri- orthogonal JD after 

1993) 

ces 

PCA 

eJADE (Moreau, 2001) independent i.i.d. sources arbitrary-order cumulant matrices orthogonal JD after 

PCA 

HessianICA (Theis, 2004a, Yeredor, independent i.i.d. sources multiple Hessians Hlog ˆx(x 

2000) 

(i) ) or 

Hlog px(x (i) orthogonal JD after 

) 

PCA 

AMUSE (Molgedey and Schuster, wide-sense stationary s(t) with di- single autocovariance matrix 

1994, Tong et al., 1991) 

agonal autocovariances 

E(x(t + τ)x(t) ⊤ EVD after PCA 

) 

(GEVD) 

SOBI (Belouchrani et al., 1997), wide-sense stationary s(t) with di- multiple autocovariance matrices orthogonal JD after 

TDSEP (Ziehe and Mueller, 1998) agonal autocovariances 

PCA 

mdAMUSE (Theis et al., 2004e) s(t1, . . . , tM ) with diagonal autoco- single multidimensional autocovari- EVD after PCA 

variancesance 

matrix (1.7) 

(GEVD) 

mdSOBI (Schießl et al., 2000, Theis s(t1, . . . , tM ) with diagonal autoco- multidimensional autocovariance orthogonal JD after 

et al., 2004e) 

variances 

matrices (1.7) 

PCA 

JADET D (Müller et al., 1999) independent s(t) with diagonal au- cumulant and autocovariance ma- orthogonal JD after 

tocovariancestrices 

PCA 

SONS (Choi and Cichocki, 2000) non-stationary s(t) with diagonal (auto-)covariance matrices of win- orthogonal JD after 

(auto-)covariances 

dowed signals 

PCA 

ACDC (Yeredor, 2002), LSDIAG independent or auto-decorrelated covariance matrices and cumu- non-orthogonal JD 

(Ziehe et al., 2003b) 

s(t) 

lant/autocovariance matrices 

block-Gaussian likelihood (Pham block-Gaussian non-stationary s(t) (auto-)covariance matrices of win- non-orthogonal JD 

and Cardoso, 2001) 

dowed signals 

TFS (Belouchrani and Amin, 1998) s(t) from Cohen’s time-frequency spatial time-frequency distribution orthogonal JD after 

distributions (Cohen, 1995) matrices 

PCA 

FRT-based BSS (Karako-Eilon non-stationary s(t) with diagonal autocovariance of FRT-transformed (non-)orthogonal 

et al., 2003) 

block-spectra 

windowed signal 

JD 

ACMA (van der Veen and Paulraj, s(t) is of constant modulus (CM) independent vectors in ker 

1996) 

ˆ P of 

model-matrix ˆ generalized Schur 

P 

QZ-decomp. 

stBSS (Theis et al., 2005a) spatiotemporal sources S := s(r, t) any of the above conditions for both 

X and X⊤ non-orthogonal JD 

group BSS (Theis, 2005a) group-dependent sources s(t) any of the above conditions block orthogonal JD 

after PCA


In contrast to this sometimes called hard-whitening technique, soft-whitening tries to avoid 

a bias towards second-order statistics and uses a non-orthogonal joint diagonalization algorithm 

(Pham, 2001, Yeredor, 2002, Ziehe et al., 2003b) by jointly diagonalizing the source conditions 

Ci(x) together with the mixture covariance matrix E(xx ⊤ ). Then possible estimation errors in 

the second-order part do not influence the total error disproportionately high. 

Depending on the source conditions, various algorithms have been proposed in the literature. 

Table 1.1 gives an overview over the algorithms together with the references, the source model, 

the condition matrices and the optimization algorithm. For more details and references, we refer 

to Theis and Inouye (2006). 

Multidimensional autodecorrelation 

In Theis et al. (2004e), see chapter 6, we considered BSS algorithms based on time-decorrelation 

and the resulting source condition. Corresponding JD-based algorithms include AMUSE (Tong 

et al., 1991) and extensions such as SOBI (Belouchrani et al., 1997) and TDSEP (Ziehe and 

Mueller, 1998). They rely on the fact that the data sets have non-trivial autocorrelations. We 

extended them to data sets having more than one direction in the parametrization such as 

images. For this, we replaced one-dimensional autocovariances by multidimensional autocovariances 

defined by 

� 

Cτ1,...,τM (s) := E s(z1 + τ1, . . . , zM + τM)s(z1, . . . , zM) ⊤� 

(1.7) 

where the s is centered and the expectation is taken over (z1, . . . , zM). Cτ1,...,τM (s) can be 

estimated given equidistant samples by replacing random variables by sample values and expectations 

by sums as usual. 

A typical example for non-trivial multidimensional autocovariances is a source data set, in 

which each component si represents an image of size h × w. Then the data is of dimension 

M = 2 and samples of s are given at indices z1 = 1, . . . , h, z2 = 1, . . . , w. Classically, s(z1, z2) 

is transformed to s(t) by fixing a mapping from the two-dimensional parameter set to the onedimensional 

time parametrization of s(t), for example by concatenating columns or rows in the 

case of a finite number of samples (vectorization). If the time structure of s(t) is not used, as in 

all classical ICA algorithms in which i.i.d. samples are assumed, this choice does not influence 

the result. However, in time-structure based algorithms such as AMUSE and SOBI results can 

vary greatly depending on the choice of this mapping. 

The advantage of using multidimensional autocovariances lies in the fact that now the multidimensional 

structure of the data set can be used more explicitly. For example, if row concatenation 

is used to construct s(t) from the images, horizontal lines in the image will only 

give trivial contributions to the autocovariance. Figure 1.6 shows the one- and two-dimensional 

autocovariance of the Lena image for varying τ respectively (τ1, τ2) after normalization of the 

image to variance 1. Clearly, the two-dimensional autocovariance does not decay as quickly with 

increasing radius as the one-dimensional covariance. Only at multiples of the image height, the 

one-dimensional autocovariance is significantly high i.e. captures image structure. 

For details as well as extended simulations and examples, we refer to Theis et al. (2004e) 

and related work by Schießl et al. (2000), Schöner et al. (2000).


(a) analyzed image 

0.8 

0.6 

0.4 

0.2 

0 

1 

1d−autocov 

2d−autocov 

0 50 100 150 200 250 300 

|tau| 

(b) autocorrelation (1d/2d) 

Figure 1.6: Example of one- and two-dimensional autocovariance coefficient (b) of the gray-scale 

128×128 Lena image (a) after normalization to variance 1. Clearly using local structure in both 

directions (2d-autocov) guarantees that for small τ higher powers of the autocorrelations are 

present than by rearranging the data into a vector (1d-autocov), thereby loosing information 

about the second dimension. 

1.3.2 Spatiotemporal BSS 

Real-world data sets such as recordings from functional magnetic resonance imaging often possess 

both spatial and temporal structure. In Theis et al. (2007b), see chapter 9, we propose an 

algorithm including such spatiotemporal information into the analysis, and reduce the problem 

to the joint approximate diagonalization of a set of autocorrelation matrices. 

Spatiotemporal BSS in contrast to the more common spatial or temporal BSS tries to achieve 

both spatial and temporal separation by optimizing a joint energy function. First proposed 

by Stone et al. (2002), it is a promising method, which has potential applications in areas 

where data contains an inherent spatiotemporal structure, such as data from biomedicine or 

geophysics including oceanography and climate dynamics. Stone’s algorithm is based on the 

Infomax ICA algorithm by Bell and Sejnowski (1995), which due to its online nature involves 

some rather intricate choices of parameters, specifically in the spatiotemporal version, where 

online updates are being performed both in space and time. Commonly, the spatiotemporal 

data sets are recorded in advance, so we can easily replace spatiotemporal online learning by


= 

X t A t S 

X 

= 

(a) temporal BSS 

s S ⊤ 

(c) spatiotemporal BSS 

t S 

= 

X ⊤ s A s S 

(b) spatial BSS 

Figure 1.7: Temporal, spatial and spatiotemporal BSS models. The lines in the matrices ∗ S 

indicate the sample direction. Source conditions apply between adjacent such lines. 

batch optimization. This has the advantage of greatly reducing the number of parameters in the 

system, and leads to more stable optimization algorithms. In Theis et al. (2007b), we extended 

Stone’s approach by generalizing the time-decorrelation algorithms to the spatiotemporal case, 

thereby allowing us to use the inherent spatiotemporal structures of the data. 

For this, we considered data sets x(r, t) depending on two indices r and t, where r ∈ R n 

can be any multidimensional (spatial) index and t indexes the time axis. In order to be able to 

use matrix-notation, we contracted the spatial multidimensional index r into a one-dimensional 

index r by row concatenation. Then the data set x(r, t) =: xrt can be represented by a data 

matrix X of dimension s m× t m, where the superscripts s (.) and t (.) denote spatial and temporal 

variables, respectively. 

Temporal BSS implies the matrix factorization X = t A t S, whereas spatial BSS implies the 

factorization X ⊤ = s A s S or equivalently X = s S ⊤s A ⊤ . Hence X = t A t S = s S ⊤s A ⊤ . So both 

source separation models can be interpreted as matrix factorization problems; in the temporal 

case restrictions such as diagonal autocorrelations are determined by the second factor, in the 

spatial case by the first one. In order to achieve a spatiotemporal model, we required these 

conditions from both factors at the same time. Therefore the spatiotemporal BSS model can be 

derived from the above as the factorization problem 

X = s S ⊤t S (1.8) 

with spatial source matrix s S and temporal source matrix t S, which both have (multidimensional) 

autocorrelations being as diagonal as possible. The three models are illustrated in figure 

1.7. 

Concerning conditions for the sources, we interpreted Ci(X) := Ci( t x(t)) as the i-th temporal 

autocovariance matrix, whereas Ci(X ⊤ ) := Ci( s x(r)) denoted the corresponding spatial 

autocovariance matrix. Application of the spatiotemporal mixing model from equation (1.8)


together with the transformation properties (1.6) of the source conditions yields 

Ci( t S) = s S †⊤ Ci(X) s S † and Ci( s S) = t S †⊤ Ci(X ⊤ ) t S † 

because ∗ m ≥ n and hence ∗ S ∗ S † = I. By assumption the matrices Ci( ∗ S) are as diagonal as 

possible. In order to separate the data, we had to find diagonalizers for both Ci(X) and Ci(X ⊤ ) 

such that they satisfy the spatiotemporal model (1.8). As the matrices derived from X had to 

be diagonalized in terms of both columns and rows, we denoted this by double-sided approximate 

joint diagonalization. 

In Theis et al. (2007b, 2005a) we showed how to reduce this process to joint diagonalization. 

In order to get robust estimates of the source conditions, dimension reduction was essential. For 

this we considered the singular value decomposition X, and formulated the algorithm in terms 

of the pseudo-orthogonal components of X. Of course, instead of using autocovariance matrices, 

other source conditions Ci(.) from table 1.1 can be employed in order to adapt to the separation 

problem at hand. 

We present an application of the spatiotemporal BSS algorithm to fMRI data using multidimensional 

autocovariances in section 1.6.1. 

1.3.3 Independent subspace analysis 

Another extension of the simple source separation model lies in extracting groups of sources 

that are independent of each other, but not within the group. So, multidimensional independent 

component analysis or independent subspace analysis (ISA) denotes the task of transforming a 

multivariate observed sensor signal such that groups of the transformed signal components are 

mutually independent—however dependencies within the groups are still allowed. This allows 

for weakening the sometimes too strict assumption of independence in ICA, and has potential 

applications in various fields such as ECG, fMRI analysis or convolutive ICA. 

Recently we were able to calculate the indeterminacies of group ICA for known and unknown 

group structure, which finally enabled us to guarantee successful application of group ICA to 

BSS problems. Here, we will shortly review the identifiability result as well as the resulting 

algorithm for separating signals into groups of dependent signals. As before, the algorithm is 

based on joint (block) diagonalization of sets of matrices generated using one or multiple source 

conditions. 

Generalizations of the ICA model that are to include dependencies of multiple one-dimensional 

components have been studied for quite some time. ISA in the terminology of multidimensional 

ICA has first been introduced by Cardoso (1998) using geometrical motivations. His model 

as well as the related but independently proposed factorization of multivariate function classes 

(Lin, 1998) are quite general, however no identifiability results were presented, and applicability 

to an arbitrary random vector was unclear. Later, in the special case of equal group sizes k, 

in the following denoted as k-ISA, uniqueness results have been extended from the ICA theory 

(Theis, 2004b). Algorithmic enhancements in this setting have been recently studied by Poczos 

and Lörincz (2005). Similar to (Cardoso, 1998), Akaho et al. (1999) also proposed to employ 

a multidimensional component maximum likelihood algorithm, however in the slightly different 

context of multimodal component analysis. Moreover, if the observations contain additional 

(1.9)


structures such as spatial or temporal structures, these may be used for the multidimensional 

separation (Ilin, 2006, Vollgraf and Obermayer, 2001). 

Hyvärinen and Hoyer (2000) presented a special case of k-ISA by combining it with invariant 

feature subspace analysis. They model the dependence within a k-tuple explicitly and are 

therefore able to propose more efficient algorithms without having to resort to the problematic 

multidimensional density estimation. A related relaxation of the ICA assumption is given by 

topographic ICA (Hyvärinen et al., 2001a), where dependencies between all components are 

assumed and modeled along a topographic structure (e.g. a 2-dimensional grid). However these 

two approaches are not completely blind anymore. Bach and Jordan (2003b) formulate ISA 

as a component clustering problem, which necessitates a model for inter-cluster independence 

and intra-cluster dependence. For the latter, they propose to use a tree-structure as employed 

by their tree dependepent component analysis (Bach and Jordan, 2003a). Together with intercluster 

independence, this implies a search for a transformation of the mixtures into a forest 

i.e. a set of disjoint trees. However, the above models are all semi-parametric and hence not 

fully blind. In the following, we will review two contributions (Theis, 2004b) and (Theis, 2007), 

where no additional structures were necessary for the separation. 

Fixed group structure—k-ISA 

A random vector y is called an independent component of the random vector x, if there exists 

an invertible matrix A and a decomposition x = A(y, z) such that y and z are stochastically 

independent. We note that this is a more general notion of independent components in the sense 

of ICA, since we do not require them to be one-dimensional. 

The goal of a general independent subspace analysis (ISA) or multidimensional independent 

component analysis is the decomposition of an arbitrary random vector x into independent 

components. If x is to be decomposed into one-dimensional components, this coincides with 

ordinary ICA. Similarly, if the independent components are required to be of the same dimension 

k, then this is denoted by multidimensional ICA of fixed group size k or simply k-ISA. 

As we have seen before, an important structural aspect in the search for decompositions 

is the knowledge of the number of solutions i.e. the indeterminacies of the problem. Clearly, 

given an ISA solution, invertible transforms in each component (scaling matrices L) as well as 

permutations of components of the same dimension (permutation matrices P) give again an ISA 

of x. This is of course known for 1-ISA i.e. ICA, see section 1.2.1. 

In Theis (2004b), see chapter 3, we were able to extend this result to k-ISA, given some 

additional restrictions to the model:We denoted A as k-admissible if for each r, s = 1, . . . , n/k 

the (r, s) sub-k-matrix of A is either invertible or zero. Then the following theorem can be 

derived from the multivariate Darmois-Skitovitch theorem (section 1.2.1) or using our previously 

discussed approach via differential equations (Theis, 2005c). 

Theorem 1.3.1 (Separability of k-ISA). Let A ∈ Gl(n; R) be k-admissible, and let S be a kindependent 

n-dimensional random vector having no Gaussian k-dimensional component. If AS 

is again k-independent, then A is the product of a k-block-scaling and permutation matrix. 

This shows that k-ISA solutions are unique except for trivial transformations, if the model 

has no Gaussians and is admissible, and can now be turned into a separation algorithm.


ISA with known group structure via joint block diagonalization 

In order to solve ISA with fixed block-size k or at least known block structure, we will use a 

generalization of joint diagonalization, which searches for block-structures instead of diagonality. 

We are not interested in the order of the blocks, so the block-structure is uniquely specified by 

fixing a partition n = m1 + . . . + mr of n and set m := (m1, . . . , mr) ∈ N r . An n × n matrix is 

said to be m-block diagonal if it is of the form 

⎛ 

M1 

⎜ 

⎝ . 

· · · 

. .. 

0 

. 

⎞ 

⎟ 

⎠ 

0 · · · Mr 

with arbitrary mi × mi matrices Mi. 

As generalization of JD in the case of known the block structure, the joint m-block diagonalization 

problem is defined as the minimization of 

f m ( Â) := 

K� 

�Â⊤CiÂ − diagm ( Â⊤CiÂ)�2F i=1 

(1.10) 

with respect to the orthogonal matrix Â, where diagm (M) produces a m-block diagonal matrix 

by setting all other elements of M to zero. Indeterminacies of any m-JBD are m-scaling 

i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation defined 

by a permutation matrix that only swaps blocks of the same size. 

A few algorithms to actually perform JBD have been proposed, see Abed-Meraim and Belouchrani 

(2004), Févotte and Theis (2007a). In the following we will simply perform joint 

diagonalization and then permute the columns of A to achieve block-diagonality—in experiments 

this turns out to be an efficient solution to JBD, although other more sophisticated pivot 

selection strategies for JBD are of interest (Févotte and Theis, 2007b). The fact that JD induces 

JBD has been conjectured by Abed-Meraim and Belouchrani (2004), and we were able to give 

a partial answer with the following theorem: 

Theorem 1.3.2 (JBD via JD). Any block-optimal JBD of the Ci’s (i.e. a zero of f m ) is a local 

minimum of the JD-cost-function f from equation (1.5). 

Clearly not any JBD minimizes f, only those such that in each block of size mk, f( Â) when 

restricted to the block is maximal over A ∈ O(mk), which we denote as block-optimal. The 

proof is given in Theis (2007), see chapter 8. 

In the case of k-ISA, where m = (k, . . . , k), we used this result to propose an explicit 

algorithm (Theis, 2005a, see chapter 7). Consider the BSS model from equation (1.1). As usual 

by preprocessing we may assume whitened observations x, so A is orthogonal. For the density 

ps of the sources, we therefore get ps(s0) = px(As0). Its Hessian transforms like a 2-tensor, 

which locally at s0 (see section 1.2.1) guarantees 

Hln ps (s0) = Hln px◦A(s0) = AHln px (As0)A ⊤ . (1.11)


crosstalking error 

4 

3 

2 

1 

0 

FastICA JADE Extended Infomax 

Figure 1.8: Applying ICA to a random vector x = As that does not fulfill the ICA model; here 

s is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown 

are the statistics over 100 runs of the Amari error of the random original and the reconstructed 

mixing matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, 

the original mixing matrix could not be reconstructed in any of the experiments. However, 

interestingly, the latter two algorithms do indeed find an ISA up to permutation, which can be 

explained by theorem 1.3.2. 

The sources s(t) are assumed to be k-independent, so ps factorizes into r groups depending on 

k separate variables each. Thus ln ps is a sum of functions depending on k separate variables 

hence Hln ps (s0) is k-block-diagonal. Hessian ISA now simply uses the block-diagonality structure 

from equation (1.11) and performs JBD of estimates of a set of Hessians Hln ps (si) evaluated at 

different sampling points si. This corresponds to using the HessianICA source condition from 

table 1.1. Other source conditions such as contracted quadricovariance matrices (Cardoso and 

Souloumiac, 1993) can also be used in this extended framework (Theis, 2007). 

Unknown group structure—general ISA 

A serious drawback of k-ISA (and hence of ICA) lies in the fact that the requirement fixed 

group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed, 

theoretically speaking, it may only be applied to random vectors following the k-ISA blind 

source separation model, which means that they have to be mixtures of a random vector that 

consists of independent groups of size k. If this is the case, uniqueness up to permutation and 

scaling holds according to theorem 1.3.1. However if k-ISA is applied to any random vector, 

a decomposition into groups that are only ‘as independent as possible’ cannot be unique and 

depends on the contrast and the algorithm. In the literature, ICA is often applied to find 

representations fulfilling the independence condition only as well as possible, however care has 

to be taken; the strong uniqueness result is not valid any more, and the results may depend on 

the algorithm as illustrated in figure 1.8. 

In contrast to ICA and k-ISA, we do not want to fix the size of the groups Si in advance.


x 

s 

L 

P 

A 

(a) ICA 

x 

s 

L 

P 

A 

(b) ISA with fixed groupsize 

x 

s 

L 

P 

A 

(c) general ISA 

Figure 1.9: Linear factorization models for a random vector x = As and the resulting indeterminacies, 

where L denotes a one- or higher dimensional invertible matrix (scaling) and P 

a permutation, to be applied only along the horizontal line as indicated in the figures. The 

small horizontal gaps denote statistical independence. One of the key differences between the 

models is that general ISA may always be applied to any random vector x, whereas ICA and 

its generalization, fixed-size ISA, yield unique results only if x follows the corresponding model. 

Of course, some restriction is necessary, otherwise no decomposition would be enforced at all. 

The key idea in Theis (2007), see chapter 8, is to allow only irreducible components defined as 

random vectors without lower-dimensional independent components. 

The advantage of this formulation now is that it can clearly be applied to any random vector, 

although of course a trivial decomposition might be the result in the case of an irreducible random 

vector. Obvious indeterminacies of an ISA of x are scalings i.e. invertible transformations within 

each si and permutation of si of the same dimension. These are already all indeterminacies as 

shown by the following theorem: 

Theorem 1.3.3 (Existence and Uniqueness of ISA). Given a random vector X with existing 

covariance, an ISA of X exists and is unique except for permutation of components of the same 

dimension and invertible transformations within each independent component and within the 

Gaussian part. 

Here, no Gaussians had to be excluded from S as in the previous uniqueness theorems, 

because the dimension reduction result from section 1.5.2 has been used. For details we refer to 

Theis (2007) and Gutch and Theis (2007). The connection of the various factorization models 

and the corresponding uniqueness results are illustrated in figure 1.9. 

Again, we turned this uniqueness result into a separation algorithm, this time by considering 

the JADE-source condition based on fourth-order cumulants. The key idea was to translate 

irreducibility into maximal block-diagonality of the source condition matrices Ci(s). Algorithmically, 

JBD was performed using JD first using theorem 1.3.2, followed by permutation and 

block-size identification (Theis, 2007, algorithm 1). So far, we did not implement a sophisticated 

clustering step but only a straight-forward thresholding method for block-size determination. 

First results when using more elaborate clustering techniques are promising.


50 

0 

−50 

0 100 200 300 400 500 

120 

0 

−50 

0 100 200 300 400 500 

50 

0 

−100 

0 100 200 300 400 500 

50 

0 

(a) ECG recordings 

−50 

0 100 200 300 400 500 

120 

0 

−50 

0 100 200 300 400 500 

50 

0 

−100 

0 100 200 300 400 500 

(c) MECG part 

50 

0 

−120 

0 100 200 300 400 500 

80 

0 

−20 

0 100 200 300 400 500 

20 

0 

−20 

0 100 200 300 400 500 

50 

0 

(b) extracted sources 

−50 

0 100 200 300 400 500 

120 

0 

−50 

0 100 200 300 400 500 

50 

0 

−100 

0 100 200 300 400 500 

(d) fetal ECG part 

Figure 1.10: Independent subspace analysis with known block structure m = (2, 1) is applied 

to fetal ECG. (a) shows the ECG recordings. The underlying FECG (4 heartbeats) is partially 

visible in the dominating MECG (3 heartbeats). Figure (b) gives the extracted sources using 

ISA with the Hessian source condition from table 1.1 with 500 Hessian matrices. In (c) and 

(d) the projections of the mother sources (components 1 and 2 in (b)) and the fetal source 

(component 3) onto the mixture space (a) are plotted. 

Finally, we report the example from (Theis, 2005a) on how to apply the Hessian ISA algorithm 

to a real-world data set. Following (Cardoso, 1998), we show how to separate fetal ECG 

(FECG) recordings from the mother’s ECG (MECG). Our goal is to extract an MECG and an 

FECG component; however it cannot be expected to find an only one-dimensional MECG due 

to the fact that projections of a three-dimensional vector (electric) field are measured. Hence 

modeling the data by a multidimensional BSS problem with k = 2 (but allowing for an additional 

one-dimensional component) makes sense. Application of ISA extracts a two-dimensional 

MECG component and a one-dimensional FECG component. After block-permutation we get 

estimated mixing matrix A and sources s(t) as plotted in figure 1.10(b). A decomposition of 

the observed ECG data x(t) can be achieved by composing the extracted sources using only the 

relevant mixing columns. For example for the MECG part this means applying the projection 

ΠM := (a1, a2, 0)A −1 to the observations. The results are plotted in figures 1.10 (c) and (d). 

The fetal ECG is most active at sensor 1 (as visual inspection of the observation confirms). 

When comparing the projection matrices with the results from Cardoso (1998), we get quite 

high similarity of the ICA-based results, and a modest difference with the projections of the 

time-based algorithm.

1.4. Sparseness 25 

1.4 Sparseness 

One of the fundamental questions in signal processing, data mining and neuroscience is how to 

represent a large data set X, given in form of a (m × T )-matrix, in different ways. A simple 

approach is a linear matrix factorization 

X = AS, (1.12) 

which is equivalent to model (1.1) after gathering the samples into corresponding data matrices 

X := (x(1), . . . , x(T )) ∈ R m×T and S := (s(1), . . . , s(T )) ∈ R n×T . We speak of a complete, 

overcomplete or undercomplete factorization if m = n, m < n or m > n respectively. The 

unknown matrices A and S are assumed to have some specific properties, for instance: 

(i) the rows si of S are assumed to be samples of a mutually independent random vector, see 

section 1.2; 

(ii) each sample s(t) contains as many zeros as possible—this is the sparse representation or 

sparse component analysis (SCA) problem; 

(iii) the elements of X, A and S are nonnegative, which results in nonnegative matrix factorization 

(NMF). 

There is a large amount of papers devoted to ICA problems but mostly for the (under)complete 

case m ≥ n. We refer to Lee et al. (1999), Theis et al. (2004d), Zibulevsky and Pearlmutter 

(2001) and references therein for work on overcomplete ICA. Here, we will discuss constraints 

(ii) and (iii). 

1.4.1 Sparse component analysis 

We consider the blind matrix factorization problem (1.12) in the more challenging overcomplete 

case, where the additional information compensating the limited number of sensors is the sparseness 

of the sources. It should be noted that this problem is quite general and fundamental, since 

the sources could be not necessarily sparse in time domain. It would be sufficient to find a linear 

transformation (e.g. wavelet packets), in which the sources are sufficiently sparse. Applications 

of the model include biomedical data analysis, where sparsely active sources are often assumed 

(McKeown et al., 1998), and audio source separation (Araki et al., 2007). 

In Georgiev et al. (2005c), see chapter 10, we introduced a novel measure for sparsity and 

showed that based on sparsity alone, we were still able to identify both the mixing matrix and 

the sources uniquely except for trivial indeterminacies. Here, a vector v ∈ R n is said to be ksparse 

if v has at least k zero entries. An n × T data matrix is said to be k-sparse, if each of its 

columns are k-sparse. The goal of sparse component analysis of level k (k-SCA) is to decompose 

a given m-dimensional observed signal S as in equation (1.12) such that S is k-sparse. In our 

work, we always assume that the sparsity level equals k = n − m + 1, which means that at any 

time instant, fewer sources than given observations are active. 

The following theorem shows that essentially the SCA model is unique if fewer sources than 

mixtures are active i.e. if the sources are (n − m + 1)-sparse.


a 1 

a 3 

a 2 

(a) three hyperplanes 

span{ai, aj} for 1 ≤ i < j ≤ 3 in 

the 3 × 3 case 

a 1 

a 3 

a 2 

(b) hyperplanes from (a) visualized 

by intersection with the 

sphere 

a 1 

a 3 

a 4 

a 2 

(c) six hyperplanes span{ai, aj} 

for 1 ≤ i < j ≤ 4 in the 3 × 4 

case 

Figure 1.11: Visualization of the hyperplanes in the mixture space {x(t)} ⊂ R 3 . Due to the 

source sparsity, the mixtures are generated by only two matrix columns ai, aj, and are hence 

contained in a union of hyperplanes—three hyperplanes in the case of three sources (a,b) and 

six hyperplanes in the case of four sources (c). Identification of the hyperplanes gives mixing 

matrix and sources. 

Theorem 1.4.1 (SCA matrix identifiability). Assume that in the SCA model, every m × msubmatrix 

of A is invertible and that S is sufficiently rich represented. Then A is uniquely 

determined by X except for left-multiplication with permutation and scaling matrices. 

Here S is said to be sufficiently rich represented if for any index set of n − m + 1 elements 

I ⊂ {1, ..., n} there exist at least m samples of S such that each of them has zero elements in 

places with indexes in I and each m − 1 of them are linearly independent. The next theorem 

shows that in this case also the sources can be found uniquely: 

Theorem 1.4.2 (SCA source identifiability). Let H be the set of all x ∈ R m such that the linear 

system As = x has a (n−m+1)-sparse solution i.e. one with at least n−m+1 zero components. 

If A fulfills the condition from theorem 1.4.1, then for almost all x ∈ H, this system has no 

other solution with this property. 

The proofs were given in Georgiev et al. (2005c), see chapter 10. The above two theorems 

show that in the case of overcomplete BSS using k-SCA with k = n − m + 1, both the mixing 

matrix and the sources can uniquely be recovered from X except for the omnipresent permutation 

and scaling indeterminacy. The essential idea of both the theorems as well as corresponding 

algorithms is illustrated in figure 1.11: by assuming sufficiently high sparsity of the sources, 

the mixture space clusters along a union of hyperplanes, which uniquely determine both mixing 

matrix and sources. 

It is not clear a priori, whether any given data matrix X can be factorized into a sparse 

representation. A necessary and sufficient condition is given in the following theorem from 

(Georgiev et al., 2005c):


Theorem 1.4.3 (SCA conditions). Assume that m ≤ n ≤ T and the matrix X ∈ R m×T satisfies 

the following conditions: 

� � 

n 

(i) the columns of X lie in the union H of m−1 different hyperplanes, each column lies in 

only one such hyperplane, each hyperplane contains at least m columns of X such that each 

m − 1 of them are linearly independent. 

(ii) for each i ∈ {1, ..., n} there exist p = 

� � 

n−1 

m−2 

different hyperplanes {Hi,j} p 

j=1 

that their intersection Li = � p 

k=1 Hi,j is one dimensional subspace. 

(iii) any m different Li span the whole R m . 

in H such 

Then the matrix X is uniquely representable (up to permutation and scaling) as SCA, satisfying 

the conditions of theorem 1.4.1. 

Algorithms for SCA 

In Georgiev et al. (2004, 2005c), we also proposed an algorithm based on random sampling for 

reconstructing the mixing matrix and the sources, however it could not easily be applied in noisy 

settings and high dimensions due to the involved combinatorial searches. Therefore, we derived 

a novel, robust algorithm for SCA in Theis et al. (2007a), see chapter 11. The key idea was that 

if the sources are of sufficiently high sparsity, the mixtures are clustered along hyperplanes in the 

mixture space. Based on this condition, the mixing matrix could be reconstructed; furthermore, 

this property turned out to be robust against noise and outliers. 

The proposed algorithm employed a generalization of the Hough transform in order to detect 

the hyperplanes in the mixture space, see figure 1.12. This leads to an algorithmically robust 

matrix and source identification. The Hough-based hyperplane estimation does not depend on 

the source dimension n, only on the mixture dimension m. With respect to applications, this 

implies that n can be quite large and hyperplanes will still be found if the grid resolution used 

in the Hough transform is sufficiently high. Increase of the grid resolution (in polynomial time) 

results in increased accuracy also for higher source dimensions n. 

For applications of the proposed SCA algorithms in signal processing and biomedical data 

analysis, we refer to section 1.6.3 and Georgiev et al. (2006, 2005a,b), Theis et al. (2007a). More 

elaborate source reconstruction methods, after knowing the mixing matrix A were discussed in 

Theis et al. (2004a). 

Postnonlinear generalization 

In Theis and Amari (2004), see chapter 12, we considered the generalization of SCA to postnonlinear 

mixtures, see section 1.2.2. As before the data x(t) = f(As(t)) is assumed to be 

linearly mixed followed by a componentwise nonlinearity, see equation (1.3). However, now the 

(m × n)-matrix A is allowed to be ‘wide’ i.e. the more complicated overcomplete situation is 

treated and m < n. By using sparseness of s(t), we were still able to recover the system:


z 

3 

2.5 

2 

1.5 

1 

2 

1.5 

1 

0.5 

0 

−0.5 

y 

−1 

−1.5 

−2 

2 

3 

x 

4 

5 

6 

3 

2.5 

2 

theta 1.5 

1 

0.5 

0 

0 0.5 1 1.5 

phi 

2 2.5 3 

Figure 1.12: Illustration of the ‘hyperplane detecting’ Hough transform in three dimensions: A 

point (x1, x2, x3) in the data space (left) is mapped onto the curve {(ϕ, θ)|θ = arctan(x1 cos ϕ + 

x2 sin ϕ) + π 

2 } in the parameter space [0, π)2 (right). The Hough curves of points belonging to 

one plane in data space intersect in precisely one point (ϕ, θ) in the data space — and the data 

points lie on the plane given by the normal vector (cos ϕ sin θ, sin ϕ sin θ, cos θ). 

Theorem 1.4.4 (Identifiability of postnonlinear SCA). Let S ∈ R n×T be a matrix with (n − 

m + 1)-sparse columns s(t), and let X ∈ R m×T consist of columns x(t) = f(As(t)) following the 

postnonlinear mixture model (1.3). Furthermore assume that 

(i) S is fully (n − m + 1)-sparse in the sense that assymptotically for T → ∞, its image equals 

the union of all (m − 1)-dimensional coordinate spaces (in which it is contained by the 

sparsity assumption), 

(ii) A is mixing and not absolutely degenerate, 

(iii) every m × m-submatrix of A is invertible. 

If X = ˆ f( Âˆ S) is another such representation of X, then there exists an invertible scaling L with 

f = ˆ f ◦ L, and invertible scaling and permutation matrices L ′ , P ′ with A = L ÂL′ P ′ . 

The proof relied on the fact that when s(t) is sparse as formulated in theorem 1.4.4(i), it 

includes all the (m−1)-dimensional coordinate subspaces and hence intersections of (m−1) such 

subspaces, which give the n coordinate axes. They are transformed into n curves in the x-space, 

passing through the origin. By identification of these curves, we showed that each nonlinearity 

fi is in fact linear. The proof used the following lemma, which generalizes the analytic case 

presented by Babaie-Zadeh et al. (2002). 

Lemma 1.4.5. Let a, b ∈ R \ {−1, 0, 1}, a > 0 and f : [0, ε) → R differentiable such that 

f(ax) = bf(x) for all x ∈ [0, ε) with ax ∈ [0, ε). If limt→0+ f ′ (t) exists and does not vanish, then 

f is linear.


R 3 

R 3 

A 

BSRA 

R 2 

R 2 

f1 × f2 

g1 × g2 

Figure 3: Illustration of the proof of theorem ?? in the case n = 3, m = 2. The 3dimensional 

1-sparse sources (leftmost figure) are first linearly mapped onto R2 Figure 1.13: Illustration of the proof of theorem 1.4.4 in the case n = 3, m = 2. The 3dimensional 

1-sparse sources (leftmost top) are first linearly mapped onto R by 

A and then postnonlinearly distorted by f := f1 × f2 (middle figure). Separation 

is performed by first estimating the separating postnonlinearities g := g1 × g2 and 

then performing overcomplete source recovery (right figure) according to algorithm 

??. The idea of the proof now is that two lines spanned by coordinate vectors (fat 

lines, leftmost figure) are mapped onto two lines spanned by two columns of A. 

If the composition g ◦ f maps these lines onto some different lines (as sets), then 

we show that (given ’general position’ of the two lines) the components of g ◦ f 

are homogeneous functions and hence already linear according to lemma ??. 

2 by A and then 

postnonlinearly distorted by f := f1 × f2 (right). Separation is performed by first estimating 

the separating postnonlinearities g := g1 ×g2 and then performing overcomplete source recovery 

(left bottom) according to the algorithms from Georgiev et al. (2004). The idea of the proof was 

that two lines spanned by coordinate vectors (thick lines) are mapped onto two lines spanned by 

two columns of A. If the composition g ◦ f maps these lines onto some different lines (as sets), 

then we showed that (given ‘general position’ of the two lines) the components of g ◦ f satisfy 

the conditions from lemma 1.4.5 and hence are already linear. 

Theorem 1.4.4 shows that f and A are uniquely determined by x(t) except for scaling and 

permutation ambiguities. Note that then obviously also s(t) is identifiable by applying theorem 

(i) s is fully k-sparse in the sense that im s equals the union of all k-dimensional 

1.4.2 to the linearized mixtures y(t) = f 

coordinate subspaces (in which it is contained by the sparsity assumption), 

−1x(t) = As(t) given the additional assumptions to s(t) 

from the theorem. 



If x = ˆf( Âˆs) is another representation of x as in equation ?? with ˆs satisfying the 

same conditions as s, then there exists an invertible scaling L with f = ˆ Again, we derived an algorithm from this identifiability result. The separation is done in 

a two-stage procedure: In the first step, after geometrical preprocessing the postnonlinearities 

are estimated using an idea similar to the one used in the identifiability proof of theorem 1.4.4, 

also see figure 1.13. In the second stage, the mixing matrix A and then the sources s are 

reconstructed by applying linear SCA to the linearized mixtures f 

f ◦ L, and 

−1x(t). For details we refer to 

Theis and Amari (2004), see chapter 12. 

invertible scaling and permutation matrices L ′ , P ′ with A = L ÂL′ P ′ . 

The proof is given in the appendix. It relies on the fact that when s is fully 

k-sparse as formulated in theorem ??(??), it includes all the k-dimensional coordinate 

subspaces and hence intersections of k such subspaces, which give the 

n coordinate axes. They are transformed into n curves in the x-space, passing 

through the origin. By identification of these curves, we show that each nonlinearity 

is homogeneous and hence linear according to the previous section. Figure 

?? gives an illustration of the proof of theorem ?? in the case n = 3 and m = 2. 

R 2


1.4.2 Sparse non-negative matrix factorization 

In Theis et al. (2005c), see chapter 13, we studied the factorization problem (1.12) using condition 

(iii) of non-negativity. Non-negative matrix factorization (NMF) strictly requires both matrices 

A and S to have non-negative entries, which means that the data can be described using only 

additive components. Such a constraint has many physical realizations and applications, for 

instance in object decomposition (Lee and Seung, 1999). 

Although NMF has recently gained popularity due to its simplicity and power in various 

applications, its solutions frequently fail to exhibit the desired sparse object decomposition. 

Therefore, Hoyer (2004) proposed a modification of the NMF model to include sparseness: he 

minimized the deviation �X − AS� of (1.12) under the constraint of fixed sparseness of both A 

and S. Here, using a ratio of 1- and 2-norms of x ∈ R n \ {0}, the sparseness is measured by 

σ(x) := ( √ n − �x�1/�x�2)/( √ n − 1). So σ(x) = 1 (maximal) if x contains n − 1 zeros, and it 

reaches zero if the absolute value of all coefficients of x coincide. 

We restricted ourselves to the asymptotic case of perfect factorization, and therefore defined 

the sparse NMF (Hoyer, 2004) as the task of finding the matrices A and S in the decomposition 

X = AS subject to 

⎧ 

⎨ 

⎩ 

A, S ≥ 0 

σ(A∗i) = σA 

σ(Si∗) = σS 

(1.13) 

Here σA, σS ∈ [0, 1] denote fixed constants describing the sparseness of the columns of A respectively 

the rows of S. 

Uniqueness of sparse NMF 

Obvious indeterminacies of the sparse NMF model (1.13) are permutation and positive scaling of 

the columns of A (and correspondingly of the rows of S). Another not as obvious indeterminacy 

comes into play due to the sparseness assumption: S is said to be degenerate if each column of S 

is a multiple of some vector v ∈ R n . Then factorization is not unique because the row-sparseness 

of S does not depend on v, so transformations that still guarantee non-negativity are possible 

Now assume that two solutions (A, S) and ( Ã, ˜ S) of the sparse NMF model (1.13) are given 

with A and Ã of full rank; then AS = Ã˜ S, and σ(S) = σ( ˜ S). Let hi = S ⊤ i∗ respectively ˜ hi = ˜ S ⊤ i∗ 

denote the rows of the source matrices. In order to avoid the scaling indeterminacy, we can set 

the source scales to a given value, so we may assume �hi�2 = � ˜ hi�2 = 1 for all i. Hence, the 

sparseness of the rows is already fully determined by their 1-norms, and �hi�1 = � ˜ hi�1. 

The following theorem from Theis et al. (2005c), see chapter 13, showed uniqueness of sparse 

NMF in some special cases — note that in more general settings some additional indeterminacies 

(specific to n > 3) come into play; however to our present knowledge they are thin i.e. of measure 

zero and hence of no practical importance. 

Theorem 1.4.6 (Uniqueness of sparse NMF). Given two solutions (A, S) and ( Ã, ˜ S) of the 

sNMF model as above, assume that S is non degenerate and that either Ã = I and A ≥ 0 or 

n = 2. Then A = ÃP with a permutation matrix P.


Sparse projection 

Algorithmically, we followed Hoyer’s approach and solved the 

sparse NMF problem by alternatively updating A and S using 

gradient descent on the residual error �X − AS� 2 . After each 

update, the columns of A and the rows of S are projected onto 

M := {s|�s�1 = σ} ∩ {s|�s�2 = 1} ∩ {s ≥ 0} (1.14) 

in order to satisfy the sparseness conditions of (1.13). For this, 

points x ∈ R n have to be projected onto adjacent points in M, 

which was defined as �x − p�2 ≤ �x − q�2 for all q ∈ M and 

denoted as p ⊳ x. 

A priori it is not clear when such an x exists and, more so, 

is unique, see figure 1.14. We answered this question by proving 

the following theorem: 

Theorem 1.4.7 (Existence and uniqueness of the Euclidean projection). 

M X (M) 

(i) If M is closed and nonempty, then for every x ∈ R n there 

exists a p ∈ M with p ⊳ x. 

(ii) If X (M) := {x ∈ R n |#{p ∈ M|p ⊳ x} > 1} denotes the 

exception or non-uniqueness set of M, then vol(X (M)) = 0. 

M X (M) 

(a) exception set of two points 


M 

(b) excepti 

Figure 1: Two examples of exception s 

In other words, the exception set contains the set of 

can’t uniquely project. Our goal is to show that this set 

very small. Figure 1 shows the exception set of two diffe 

Note that if x ∈ M then x ⊳ x, and x is the only poi 

So M ∩ X (M) = ∅. Obviously the exception set of an a 

is empty. Indeed, we can prove more generally: 

Lemma 2.4. Let M ⊂ Rn X (M) 

M 

be convex. Then X (M) = ∅. 

For the proof we need the following simple lemma, wh 

2-norm as it uses the scalar product. 

Lemma 2.5. Let a, b ∈ Rn such that �a + b�2 = �a�2 + 

are collinear. 

Proof. By taking squares we get �a + b�2 = �a�2 + 2�a� 

�a� 2 + 2〈a, b〉 + �b� 2 = �a� 2 + 2�a��b� + 

if 〈., .〉 denotes the (symmetric) scalar product. Hence 〈 

and b are collinear according to the Schwarz inequality. 

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈ 

M such that pi ⊳ x. By assumption q := 1 

2 (p1 + p2) ∈ M 

�x − p1� ≤ �x − q� ≤ 1 

2 �x − p1� + 1 

�x − p2� 

2 

because both pi are adjacent to x. Therefore �x−q� = � 1 


(b) exception set of a sector 

Figure 1: Two examples of exception sets. 

In other words, the exception set contains the set of points from which we 

can’t uniquely project. Our goal is to show that this set vanishes or is at least 

very small. Figure 1 shows the exception set of two different sets. 

Note that if x ∈ M then x ⊳ x, and x is the only point with that property. 

So M ∩ X (M) = ∅. Obviously the exception set of an affine linear hyperspace 

is empty. Indeed, we can prove more generally: 

Lemma 2.4. Let M ⊂ R 

2 

and application of lemma 2.5 shows that x − p1 = α(x 

3 

n be convex. Then X (M) = ∅. 

For the proof we need the following simple lemma, which only works for the 

2-norm as it uses the scalar product. 

Lemma 2.5. Let a, b ∈ Rn such that �a + b�2 = �a�2 + �b�2. Then a and b 

are collinear. 

Proof. By taking squares we get �a + b�2 = �a�2 + 2�a��b� + �b�2 , so 

�a� 2 + 2〈a, b〉 + �b� 2 = �a� 2 + 2�a��b� + �b� 2 

The above is obvious if M is convex. However here, with M 

from equation (1.14), this is not the case, and the above theorem 

is needed. We then denote the (almost everywhere unique) 

projection πM(x) := p. In addition, in (Theis et al., 2005c), we (b) exception set of a sector 

proved convergence of Hoyer’s projection algorithm. 

Figure 1.14: Two exception 

(non-uniqueness) sets. 

Iterative projection onto spheres 

In Theis and Tanaka (2006), see chapter 14, our goal was to generalize the notion of sparseness. 

After all, we naturally interpret sparseness of some signal x(t) as x(t) having many zero entries. 

This can be measured by the 0-pseudo-norm, and it is common to approximate the it by p-norms 

for p → 0. Hence replacing the 1-norm in (1.14) by some p-norm is desirable. 

A p-sparse NMF algorithm can then be readily derived. However, we observed that the 

sparse projection cannot be solved in closed form anymore, and little attention has been paid 

to finding projections in the case of p �= 1, which is particularly important for p → 0 as better 

approximation of �.�0. Hence, our goal in (Theis and Tanaka, 2006) was to explore this more 

general notion of sparseness and to construct an algorithm to project a vector to its closest vector 

of a given sparseness. The resulting algorithm is a non-convex extension of the ‘projection onto 

convex sets’ (POCS) algorithm (Combettes, 1993, Youla and Webb, 1982). 

Let S 

if 〈., .〉 denotes the (symmetric) scalar product. Hence 〈a, b〉 = �a��b� and a 

and b are collinear according to the Schwarz inequality. 

n−1 

p := {x ∈ Rn | �x�p = 1} denote the (n − 1)-dimensional sphere with respect to the 

p-norm (p > 0). A scaled version of this unit sphere is given by cSn−1 p := {x ∈ Rn | �x�p = c}. 

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈ X (M) and p1 �= p2 ∈ 

M such that pi ⊳ x. By assumption q := 1 

2 (p1 + p2) ∈ M. But 

�x − p1� ≤ �x − q� ≤ 1 

2 �x − p1� + 1 

2 �x − p2� = �x − p1� 

(x−p1)�+� 1 

2 (x−p2)� 

because both pi are adjacent to x. Therefore �x−q� = � 1 

2 

and application of lemma 2.5 shows that x − p1 = α(x − p2). Taking norms 

3


(a) POSH for n = 2, p = 0.5 (b) POSH for n = 3, p = 1 

Figure 1.15: Starting from x0 (◦), we alternately project onto cSp and S2. POSH performance 

is illustrated for p = 0.5 in dimension n = 2 (a), and for p = 1 and n = 3 (b), were a projection 

via PCA is displayed—no information is lost hence the sequence of points lies in a plane. 

We were looking for the Euclidean projection y = πM(x) onto M := S n−1 

2 ∩ cSn−1 p . Note that 

due to theorem 1.4.7, this p-sparse projection exists if M �= ∅ and is almost always unique. 

The algorithmic construction of this projection now is a direct generalization of POCS: we 

alternately project first on S n−1 

2 then on Sn−1 p , using the Euclidean projection operator πM from 

above. However, the difference is that the spheres are clearly non-convex (if p �= 1), so in contrast 

to POCS, convergence is not obvious. We denoted this projection algorithm by projection onto 

spheres (POSH). Interestingly, the algorithm still converges, which we could prove for p = 1: 

Theorem 1.4.8 (Convergence of POSH). Let n ≥ 2, and x ∈ Rn \ X (M). If y1 := π n−1 

S (x) 

2 

and iteratively y i := π S n−1 

2 

(π n−1 

S (y 

1 

i−1 )), then yi converges to πM(x). 

In figure 1.15, we show the application of POSH for p ∈ {0.5, 1}; we visualize the performance 

in 3 dimensions by projecting the data via PCA — which by the way throws away virtually no 

information (confirmed by experiment) indicating the validness of theorem 1.4.8 also in higher 

dimensions. 

We want to finish this section by remarking that the strict framework of sparse NMF (1.13) 

is somewhat problematic due to the fact that the sparseness values σA and σS are parameters 

of the algorithm and hence difficult to choose. In Stadlthanner et al. (2005b), we proposed 

an alternative factorization to (1.13), where we maximize the sparseness values of A and S 

in addition to the distance to X. However the optimization gets more intricate, and we have 

studied enhanced search methods via genetic algorithms in Stadlthanner et al. (2005a).

1.5. Machine learning for data preprocessing 33 

1.5 Machine learning for data preprocessing 

Machine learning denotes the task of computationally finding structures in data. Here we describe 

some preprocessing techniques that rely on machine learning for denoising, dimension 

reduction and data grouping (clustering). 

1.5.1 Denoising 

In many fields of signal processing the examined signals bear considerable noise, which is usually 

assumed to be additive and decorrelated. For example in exploratory data analysis of medical 

data using statistical methods like ICA, the prevalent noise greatly degrades the reliability of 

the algorithms, and the underlying processes cannot be identified. 

In Gruber et al. (2006), see chapter 15, we considered the situation, where a one-dimensional 

signal s(t) ∈ R given at discrete timesteps t = 1, . . . , T is distorted as follows 

sN(t) = s(t) + N(t), (1.15) 

where N(t) are i.i.d. samples of a Gaussian random variable i.e. sN equals s up to additive 

stationary white noise. Many denoising algorithms have been proposed for recovering s(t) from 

its noisy observation sN(t), see e.g. Effern et al. (2000), Hyvärinen et al. (2001b), Ma et al. 

(2000) to name but a few. Vetter et al. (2002) suggested an algorithm based on local linear 

projective noise reduction. The idea was to observe the data in a high-dimensional space of 

delayed coordinates 

˜sN(t) := (sN(t), . . . , sN(t + m − 1)) ∈ R m 

and to denoise the data locally through a projection on the lower dimensional subspace of the 

deterministic signals. 

We followed this approach and localized the problem by selecting k clusters of the delayed 

time series { sN(t) ˜ | t = 1, . . . , n}. This can for example be done by a k-means cluster algorithm, 

see section 1.5.3, which is appropriate for noise selection schemes based on the strength or the 

kurtosis of the signal since the statistical properties do not depend on the signal structure. 

After the local linearization, we may assume that the time-embedded signal can be linearly 

decomposed into noise and a lower-dimensional signal subspace. We therefore analyzed these k 

m-dimensional signals using PCA or ICA in order to determine the ‘meaningful’ components. 

The unknown number of signal components in the high-dimensional noisy signal were determined 

either by using Vetter’s MDL estimator or a 2-means clustering of the eigenvectors of the 

covariance matrix (Liavas and Regalia, 2001). The latter gave a good estimate of the number of 

signal components if the noise variances are not clustered well enough together but nevertheless 

are separated for the signal by a large gap. 

To reconstruct the noise reduced signal, we unclustered the data to get a signal ˜se : {1, . . . , n} → 

Rm and then averaged over the candidates in the delayed data. 

This idea has been illustrated in figure 1.16. 

se(t) := 1 

m−1 � 

[ ˜se(t − i)] i . (1.16) 

m 

i=0


(a) embedded time series (b) locally linear approximation (c) local projection 

Figure 1.16: Denoising by local projective subspace projections. The time series is embedded in 

time-delayed coordinates (a), where the signal subspace is indicated by a solid line. Clustering 

in the feature space allows for a locally linear approximation (b). A local projection onto the 

signal subspace by ICA is performed in (c). Delay & signal subspace dimensions and number of 

clusters are estimated using an MDL criterion. 

Moreover in Gruber et al. (2006), we compared the above algorithm with a denoising method 

based on generalized eigenvalue decomposition called delayed AMUSE (Tomé et al., 2005), and 

with established kernel PCA denoising (Mika et al., 1999, Schölkopf et al., 1998), where solving 

the inverse problem for recovering the data turned out to be non-trivial. Finally, we showed 

applications to water-artefact removal of proton NMR spectra, which are an indispensable contribution 

to this structure determination process but are hampered by the presence of the very 

intense water proton signal (Stadlthanner et al., 2003, 2006b). 

1.5.2 Dimension reduction 

An important open problem in signal processing is the task of efficient dimension reduction that 

is the search for meaningful signals within a higher dimensional data set. Classical techniques 

such as principal component analysis hereby define ‘meaningful’ using second-order statistics 

(maximal variance), which may often be inadequate for signal detection, e.g. in the presence of 

strong noise. This contrasts to higher order models including projection pursuit (Friedman and 

Tukey, 1975, Hyvärinen and Oja, 1997, Kruskal, 1969) or non-Gaussian component analysis, 

for short NGCA (Blanchard et al., 2006, Kawanabe, 2005, Kawanabe and Theis, 2007). While 

the former classically extracts a single non-Gaussian independent component from the data set, 

the latter tries to detect a whole non-Gaussian subspace within the data, and no assumption of 

independence within the subspace is made. 

The goal of linear dimension reduction can be defined as the search for a projection W ∈ 

Mat(n × d) of a d-dimensional data set X, here modeled by a random vector, with n < d and 

WX bearing still as much information of X as possible. Of course the latter condition has


to be specified in detail in terms of some distance, index or source model, and many different 

such indices have been studied in the setting of projection pursuit (Friedman, 1987, Huber, 

1985, Hyvärinen, 1999), among others. This problem describes a special case of the larger 

field of model selection (Friedman and Tukey, 1975), an important tool for preprocessing and 

dimension reduction, used in a wide range of applications. 

In Theis and Kawanabe (2006), see chapter 16, we studied non-Gaussian component analysis 

as proposed by Blanchard et al. (2006). The idea was to follow the classical projection pursuit 

idea and to choose non-Gaussianity as measure of information content of the projection. The 

remainder X − W ⊤ WX after the projection was required to be Gaussian and independent of 

WX. For the theoretical analysis, we did not need further restrictions, for instance by specifying 

an estimator, which would of course be necessary for algorithmic purposes. In that respect, we 

provided a uniqueness result for projection pursuit in general. Our goal was to describe necessary 

and sufficient conditions in order for such projections to exist and to be unique. 

Consider for example the three-dimensional random vector X = (X1, X2, X3) with X1 non- 

Gaussian, say uniform, but X2 and X3 Gaussian such that X is mutually independent. Then if 

we were looking for two-dimensional projections of X, we would obviously find multiple, different 

projections such as 

� 1 0 0 

0 1 0 

� 

, but also 

� 1 0 0 

0 0 1 

In both cases the remainder of the projection (X3 or X2) is Gaussian and independent of the 

projected vectors, but the projection still contains a Gaussian component. If we are to look 

for one-dimensional projections of X instead, only projection onto the first coordinate yields a 

Gaussian reminder, as desired. So in this example, uniqueness follows if the Gaussian subspace 

is of maximal dimension, or correspondingly the non-Gaussian of minimal dimension. And 

precisely this is the sufficient and necessary condition for uniqueness. 

Uniqueness of non-Gaussian component analysis 

A factorization X = AS as in (1.1) with A ∈ Gl(d), random vector S = (SN, SG) and 

SN ∈ L2(Ω, R n ) is called an n-dimensional non-Gaussian component analysis of X if SN and 

SG are stochastically independent and SG is Gaussian. In the corresponding decomposition 

A = (AN, AG), the n-dimensional subvectorspace spanned by the columns of AN is called 

the non-Gaussian subspace, and the subspace spanned by AG the Gaussian subspace of the 

decomposition. 

The basic ideas of NGCA versus simple principal component analysis (PCA) is illustrated in 

figure 1.17. Dimension reduction essentially deals with the question of removing a noise subspace. 

Classically, a signal is differentiated from noise by having a higher variance, and algorithms such 

as PCA in the linear case remove the low-variance components, see figure 1.17(a). Second-order 

techniques however fail to capture signals that are deteriorated by noise of similar or stronger 

power, so higher-order statistics are necessary to remove the noise, see figure 1.17(c). 

The following theorem connects uniqueness of the dimension reduction model with minimality, 

and gives a simple characterization for it. 

� 

.


3 

2.5 

2 

1.5 

1 

0.5 

0 

0 0.5 1 1.5 2 2.5 3 

(a) PCA dimension reduction model 

8 

6 

4 

2 

0 

−2 

−4 

−6 

−8 

−8 −6 −4 −2 0 2 4 6 8 

(b) non-Gaussian subspace analysis model with directional 

histograms illustrating (non-)Gaussianity 

Figure 1.17: Illustration of dimension reduction by NGCA versus PCA. The signal subspace in 

(a) is given by a linear direction of high variance, whereas in (b), noise is defined by Gaussianity. 

The signal subspace is correspondingly given by directions of non-Gaussianity, which allows for 

the extraction of low-power signals. 

Theorem 1.5.1 (Uniqueness of NGCA). Let n < d. Given an n-dimensional NGCA ANSN + 

AGSG of the random vector X ∈ L2(Ω, R d ), the following is equivalent: 

(i) The decomposition i.e. n is minimal. 

(ii) There exists no basis M ∈ Gl(n) such that (MSN)(1) is Gaussian and independent of 

(MSN)(2 : n). 

(iii) The subspaces of the decomposition are unique i.e. another n-decomposition has the same 

non-Gaussian and Gaussian subspaces. 

Condition (ii) means that there exists no Gaussian independent component in the non- 

Gaussian part of the decomposition. The theorem proves that this is equivalent to the decomposition 

being minimal. Note that in (ii), it is not enough to require only that there exists no Gaussian 

component i.e. v ∈ R n such that v ⊤ SN is Gaussian. A simple counterexample is given by a 

two-dimensional random vector S with density c exp(−s 2 1 −(s2 1 +s2) 2 ) with c being a normalizing 

constant. Then indeed S(1) = S1 is Gaussian because � 

R c exp(−s2 1 −(s2 1 +s2) 2 )ds2 = c ′ exp(−s 2 1 ), 

but clearly no m ∈ R 2 can be chosen such that S(1) and m ⊤ S are independent. And indeed, 

this dependent Gaussian S(1) within S should not be removed by dimension reduction as it may 

contain interesting information, not being independent of the other components. 

The proof of the theorem was sketched in Theis and Kawanabe (2006), see chapter 16, where 

we also performed some simulations to validate the uniqueness result. A practical algorithm


for NGCA essentially using the idea of separated characteristic functions from the proof was 

proposed in (Kawanabe and Theis, 2007). 

Finally, in (Theis and Kawanabe, 2007), we presented an modification of NGCA that evaluates 

the time structure of the multivariate observations instead of their higher-order statistics. 

We differentiated the signal subspace from noise by searching for a subspace of non-trivially 

autocorrelated data. In contrast to blind source separation approaches however, we did not 

require the existence of sources, so the model is applicable to any wide-sense stationary time 

series without restrictions. Moreover, since the method is based on second-order time structure, 

it could be efficiently implemented even for large dimensions, which we illustrated with an 

application to dimension reduction of functional MRI recordings. 

1.5.3 Clustering 

Clustering methods are an important tool in high-dimensional explorative data mining. They 

aim at identifying samples or regions of similar characteristics, and often code them by a single 

codebook vector or centroid. In this section, we review clustering algorithms, and employ these 

methods to solve the blind matrix factorization problem (1.12) from above under various source 

assumptions. 

Clustering for solving overcomplete BSS problems 

In Theis et al. (2006), see chapter 17, we discussed the blind source separation problem (1.1) 

in the difficult case of overcomplete BSS, where less mixtures than sources are observed (m < 

n). We focused on the usually more elaborate matrix recovery part. Assuming statistically 

independent sources with existing variance and at most one Gaussian component, it is wellknown 

that A is determined uniquely by the mixtures x(t) (Eriksson and Koivunen, 2003). 

However, how to do this algorithmically is far from obvious, and although some algorithms have 

been proposed recently (Bofill and Zibulevsky, 2001, Lee et al., 1999, O’Grady and Pearlmutter, 

2004), performance is yet limited. 

The most commonly used overcomplete algorithms rely on sparse sources (after possible 

sparsification by preprocessing), which can be identified by clustering, usually by k-means or 

some extension (Bofill and Zibulevsky, 2001, O’Grady and Pearlmutter, 2004). However apart 

from the fact that theoretical justifications have not been found, mean-based clustering only 

identifies the correct A if the data density approaches a delta distribution. In figure 1.18, 

we illustrate the deficiency of mean-based clustering; we get an error of up to 5 ◦ per mixing 

angle, which is rather substantial considering the sparse density and the simple, complete case 

of m = n = 2. Moreover the figure indicates that median-based clustering performs much 

better. Indeed, mean-based clustering does not possess any equivariance property, which implies 

performance independent of the choice of A. 

We proposed a novel overcomplete, median-based clustering method in (Theis et al., 2006), 

and proved its equivariance and convergence. Simply put, we first pick 2n normalized starting 

vectors w1, w ′ 1 , . . . , wn, w ′ n, and iterate the following steps until an appropriate abort condition 

has been met: Choose a sample x(t) ∈ R m and normalize it y(t) := π(x(t)) = x(t)/|x(t)|. Let


IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 2 


x2 

x2 

α 

α 

x1 

x1 

α 

α 

F (α) 

F (α) 

0.1 

0.1 

0.08 

0.08 

0.06 

0.06 

0.04 

0.04 

0.02 

0.02 

∆(α) mean 

∆(α) mean 

∆(α) median 


0 

0 

0.2 0.4 0.6 0.8 

0 0.2 0.4 α 

0.6 0.8 

α 

(a) circle histogram for α = 0.4 

(b) comparison of mean and median 

Fig. 1. Mean- versus median-based clustering. We consider the mixture x of two independent gamma-distributed sources 

(γ = 0.5, 10 5 (a) circle histogram for α = 0.4 

(a) circle histogram for α = 0.4 

(b) comparison of mean and median 

Fig. 1. Mean- versus median-based clustering. We consider the mixture x of two independent gamma-distributed sources 

samples) using a mixing matrix A with columns inclined by α and (π/2 − α) respectively. (a) shows the mixture 

(γ = 0.5, 10 

density for α = 0.4 after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimating A by the mean 

and the median of the projected density in the receptive field F (α) = (−π/4, π/4) of the known column a1 of A. The former 

is the k-means convergence criterion. 

5 (b) comparison of mean and median 

Figure 1.18: Mean- versus median-based clustering. We consider the mixture x(t) of two in- 

samples) using a mixing matrix A with columns inclined by α and (π/2 − α) respectively. (a) shows the mixture 

dependent gamma-distributed sources (γ = 0.5, 10 

density for α = 0.4 after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimating A by the mean 

and the median of the projected density in the receptive field F (α) = (−π/4, π/4) of the known column a1 of A. The former 

is the k-means convergence criterion. 

5 samples) using a mixing matrix A with 

columns inclined by α and (π/2 − α) respectively. (a) shows the mixture density for α = 0.4 

after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimating A by 

the mean and the median of the projected density in the receptive field F (α) = (−π/4, π/4) of 

the known column a1 of A. The former is the k-means convergence criterion. 

one oneGaussian Gaussiancomponent, component, it itisis well-known well-knownthat that A Ais isdetermined determineduniquely uniquelyby by x [3]. [3]. However, However, how how to to do do 

i(t) ∈ [1 : n] such that w 

this algorithmically is far from i(t)(t) or w 

this algorithmically is far fromobvious, obvious, and andalthough althoughquite quite a afew fewalgorithms algorithmshave havebeen beenproposed proposedrecently recently 

[4]–[6], [4]–[6], performance performanceisis yet yetlimited. limited. The The most most commonly commonly used used overcomplete algorithms algorithms rely rely on on sparse sparse 

′ i(t) (t) is the neuron closest to x(t). Then set 

wi(t)(t + 1) := π(wi(t)(t) + η(t)π(σy(t) − wi(t)(t))) and w ′ i(t) (t + 1) := −w i(t)(t + 1), where σ := 1 if w i(t)(t) is closest to y(t), σ := −1 otherwise. 

sources (after possible sparsification by preprocessing), which can be identified by clustering, usually 

by k-means or some extension [5], [6]. But apart from the fact that theoretical justifications have not 

been found, mean-based clustering only identifies the correct A if the data density approaches a delta 

distribution. In figure 1, we illustrate the deficiency of mean-based clustering; we get an error of up to 

5◦ sources (after possible sparsification by preprocessing), which can be identified by clustering, usually 

by k-means or some extension [5], [6]. But apart from the fact that theoretical justifications have not 

been found, mean-based clustering only identifies the correct A if the data density approaches a delta 

distribution. In figure 1, we illustrate the deficiency of mean-based clustering; we get an error of up to 

5 per mixing angle, which is rather substantial considering the sparse density and the simple, complete 

case of m = n = 2. Moreover the figure indicates that median-based clustering performs much better. 

Indeed, mean-based clustering does not possess any equivariance property (performance independent of 

A). In the following we propose a novel median-based clustering method and prove its equivariance 

(lemma 1.2) and convergence. For brevity, the proofs are given for the case of arbitrary n, but m = 2, 

although they can be readily extended to higher sensor signal dimensions. Corresponding algorithms are 

proposed and experimentally validated. 

◦ We showed that instead of finding the mean in the receptive field (as in k-means), the algorithm 

searches for the corresponding median. For this we had to study the end points of geometric 

matrix-recovery, so we assumed that the algorithm had already converged. The idea then was 

to formulate a condition which the end points have to satisfy and to show that the solutions are 

among them. 

per Themixing anglesangle, γ1, . . which . , γn is ∈ rather [0, π) substantial are said to considering satisfy the overcomplete sparse density and geometric the simple, convergence complete 

condition (GCC) if they are the medians of y restricted to their receptive fields i.e. if γi is 

case of m = n = 2. Moreover the figure indicates that median-based clustering performs much better. 

the median of py|F (γi). Moreover, a constant random vector ˆω ∈ R 

Indeed, mean-based clustering does not possess any equivariance property (performance independent of 

A). In the following we propose a novel median-based clustering method and prove its equivariance 

(lemma 1.2) and convergence. For brevity, the proofs are given for the case of arbitrary n, but m = 2, 

although they can be readily extended to higher sensor signal dimensions. Corresponding algorithms are 

proposed and experimentally validated. 

n is called fixed point if 

E(ζ(ye − ˆω)) = 0. We showed that the median-condition is equivariant meaning that it does no 

depend on A. Hence any algorithm based on such a condition is equivariant, as confirmed by 

figure 1.18. If we set ξ(ω) := (cos ω, sin ω) ⊤ , then θ(ξ(ω)) = ω and the following holds: 

Theorem 1.5.2 (Convergence of overcomplete median-based clustering). The set Φ of fixed 

points of geometric matrix-recovery contains an element (ˆω1, . . . , ˆωn) such that the resulting 

matrix (ξ(ˆω1) . . . ξ(ˆωn)) solves the overcomplete BSS problem. 

The stable fixed points in the above set Φ can be found by the geometric matrix-recovery 

algorithm.


samples 

(a) setup 

centroids 

Grassmann clustering 

(b) division 

division 

update 

(c) update (d) after one iteration 

Figure 1.19: Illustration of the batch k-means algorithm. 

One of the most commonly used partitional clustering techniques is the k-means algorithm, 

which in its batch form partitions the data set into k disjoint clusters by simply iterating between 

cluster assignments and cluster updates (Bishop, 1995). In general, its goal can be described as 

follows: 

Given a set A of points in some metric space (M, d), find a partition of A into disjoint nonempty 

subsets Bi, � 

i Bi = A, together with centroids ci ∈ M so as to minimize the sum of the 

squares of the distances of each point of A to the centroid ci of the cluster Bi containing it. 

A common approach to minimizing such energy functions is partial optimization with respect 

to the division matrix and the centroids. The batch k-means algorithm employs precisely this 

strategy, see figure 1.19. After an initial, random choice of centroids c1, . . . , ck, it iterates 

between the following two steps until convergence measured by a suitable stopping criterion: 

• cluster assignment: for each sample x(t) determine an index i(t) = argmin i d(x(t), ci) 

• cluster update: within each cluster Bi := {a(t)|i(t) = i} determine the centroid ci by 

ci := argmin c 

� 

d(a, c) 2 

a∈Bi 

(1.17) 

Solving (1.17) is straight-forward in the Euclidean case, however nontrivial in other metric 

spaces. In Gruber and Theis (2006), see chapter 18, we generalized the concept of k-means by 

applying it not to the standard Euclidean space but to the manifold of subvectorspaces of R n of 

a fixed dimension p, also known as the Grassmann manifold Gn,p. Important examples include 

projective space i.e. the manifold of lines and the space of all hyperplanes. 

We represented an element of Gn,p by p orthonormal vectors (v1, . . . , vp). Concatenating 

these into an (n × p)-matrix V, this matrix is unique except for right multiplication by an 

orthogonal matrix. We therefore wrote [V] ∈ Gn,p for the subspace. This allowed us to define a 

distance d([V], [W]) := 2 −1/2 �VV ⊤ − WW ⊤ �F on the Grassmannian known as the projection 

F-norm.


Other metrics may be defined on Gn,p, and they result in different Riemannian geometries 

on the manifold. Optimization on non-Euclidean geometries is non-trivial, and has been studied 

for a long time, see for example Edelman et al. (1999) and references therein. For instance in 

the context of ICA, Amari’s seminal paper (Amari, 1998) on taking into account the geometry 

of the search space Gl(n) yielded a considerable increase in performance and accuracy. Learning 

in these matrix manifolds has been reviewed in Theis (2005b) and extended in Squartini and 

Theis (2006). 

In order to apply batch k-means to (Gn,p, d), we only had to solve the cluster assignment 

equation (1.17). It turned out that for this, no elaborate optimization was necessary, instead 

a closed form solution that only needs eigenvalue decomposition could be found. We state this 

with the following theorem, proved in Gruber and Theis (2006): 

Theorem 1.5.3 (Grassmann centroids). The centroid [C] ∈ Gn,p of a set of points [V1], . . . , [Vl] ∈ 

Gn,p according to (1.17) is spanned by p independent eigenvectors corresponding to the smallest 

eigenvalues of the generalized cluster correlation l −1 � l 

i=1 ViV ⊤ i . 

Application to nonnegative matrix factorization 

Detecting clusters in multiple samples drawn from a Grassmannian is a problem arising in 

various applications. In Gruber and Theis (2006), we applied this to NMF in order to illustrate 

the feasibility of the proposed algorithm. 

Consider the matrix factorization problem (1.12) with the additional non-negativity constraints 

S, A ≥ 0. If we assume that S spans the whole first quadrant, then X = AS is a conic 

hull with cone edges spanned by the columns of A. After projection onto the standard simplex, 

the conic hull reduces to the convex hull, and the projected, known mixture data set X lies 

within a convex polytope of the order given by the number of rows of S. Hence we face the 

problem of identifying n edges of a sampled polytope in R m−1 . 

In two dimensions (after reduction of m = 3), this implies the task of finding the k edges 

of a polytope where only samples in the inside are known. We used the Quickhull algorithm 

(Barber et al., 1993) to construct the convex hull thus identifying the possible edges of the 

polytope. However due to finite samples the identified polytope has far too many edges. Therefore, 

we applied affine Grassmann n-means clustering—with samples weighted according to their 

volume—to these edges in order to identify the n bounding edges, see example in figure 1.20. 

Biomedical applications of other matrix factorization methods are discussed in the next 

chapter. We only shortly want to mention Meyer-Bäse et al. (2005), where we applied NMF 

and related unsupervised clustering techniques for the self-organized segmentation of biomedical 

image time-series data describing groups of pixels exhibiting similar properties of local signal 

dynamics.


(a) samples 

(b) QHull 

(c) result of Grassmann clustering 

Samples 

QHull contour 

Grasmann clustering 

Mixing matrix 

Figure 1.20: Grassmann clustering (hyerplanes, so p = n − 1) identifies the contour of samples 

(a) following the NMF model. Quickhull was used to find the outer edges then those are clustered 

into 4 clusters (b). The dashed lines (c) show the convex hull spanned by the mixing matrix 

columns.


1.6 Applications to biomedical data analysis 

The above models are known to have many applications in data mining. Here we focus on 

biomedical data analysis. For this we review some recent applications to functional MRI, microscopy 

of labeled brain sections and surface electromyograms. 

1.6.1 Functional MRI 

Functional magnetic resonance imaging (fMRI) has been shown to be an effective imaging technique 

in human brain research (Ogawa et al., 1990). By the blood oxygen level dependent 

contrast (BOLD), local changes in the magnetic field are coupled to activity in brain areas. 

These magnetic changes are measured using MRI. The high spatial and temporal resolution 

of fMRI combined with its non-invasive nature makes it to an important tool for discovering 

functional areas in the human brain work and their interactions. However, its low signal to 

noise ratio and the high number of activities in the passive brain require sophisticated analysis 

method. These are either (i) based on models and regression, but require prior knowledge of the 

time course of the activations or (ii) employ model-free approaches such as BSS by separating 

the recorded activation into different classes according to statistical specifications without prior 

knowledge of the activation. 

The blind approach (ii) has been first studied by McKeown et al. (1998): According to the 

principle of functional organization of the brain, they suggested that the multifocal brain areas 

activated by performance of a visual task should be unrelated to the brain areas whose signals 

are affected by artifacts of physiological nature, head movements, or scanner noise related to 

fMRI experiments. Every single process can be described by one or more spatially independent 

components, each associated with a single time course of a voxel and a component map. It is 

assumed that the component maps, each described by a spatial distribution of fixed values, is 

representing overlapping, multifocal brain areas of statistically independent fMRI signals. This 

aspect is visualized in Figure 1.21. 

In addition, they considered that the distributions of the component maps are spatially 

independent and in this sense uniquely specified, see section 1.2.1. McKeown et al. (1998) 

showed that these maps are independent if the active voxels in the maps are sparse and mostly 

nonoverlapping. Additionally they assumed that the observed fMRI signals are the superposition 

of the individual component processes at each voxel. Based on these assumptions, ICA can be 

applied to fMRI time-series to spatially localize and temporally characterize the sources of BOLD 

activation, and considerable research has been devoted to this area since then. 

Model-based versus model-free analysis 

However, the use of blind signal processing techniques for the effective analysis of fMRI data 

has often been questioned, and in many applications, neurologists and psychologists prefer to 

use the computationally simpler regression models. 

In Keck et al. (2004), see chapter 19, we compared the two approaches on a sufficiently 

complex task of a combined word perception and motor activity. The event-based experiment 

was part of a study to investigate the network of neurons involved in the perception of speech

1.6. Applications to biomedical data analysis 43 

#1 

#2 

#n 

mixing 

matrix A 

component maps observations 

Figure 1.21: Visualization of the spatial fMRI separation model. The n-dimensional source 

vector is represented as component maps, which in turn are interpreted to contribute linearly in 

different concentrations to the fMRI observations at the various time points t ∈ {1, . . . , m}. 

and the decoding of auditory speech stimuli. One- and two-syllable words were divided into 

several frequency-bands and then rearranged randomly to obtain a set of auditory stimuli. Only 

a single band was perceivable as words. During the functional imaging session these stimuli were 

presented pseudo-randomized to 5 subjects, according to the rules of a stochastic event-related 

paradigm. The task of the subjects was to press a button as soon as they were sure that they 

had just recognized a word in the sound presented. It was expected that in case of the single 

perceptive frequency-band, these four types of stimuli activate different areas of the auditory 

system as well as the superior temporal sulcus in the left hemisphere (Specht and Reul, 2003). 

The regression-based analysis using a general linear model was performed using SPM2. This 

was compared with components extracted using ICA, namely fastICA (Hyvärinen and Oja, 

1997). The results are illustrated in figure 1.22, see Keck et al. (2004). Indeed, one independent 

component represented a network of three simultaneously active areas in the inferior frontal 

gyrus, which was previously proposed to be a center for the perception of speech (Specht and 

Reul, 2003). Altogether, we were able to show that ICA detects hidden or suspected links and 

activity in the brain that cannot be found using the classical, model-based approach. 

Spatial and spatiotemporal separation 

As short example of spatial and spatiotemporal BSS, we present the analysis of an experiment 

using visual stimuli. fMRI data were recorded from 10 healthy subjects performing a visual 

task. 100 scans each were acquired with 5 periods of rest and 5 photic stimulation periods, 

t=1 

t=2 

t=m


(a) general linear model analysis (b) one independent component 

Figure 1.22: Comparison of model-based and model-free analysis of a word-perception fMRI 

experiment. (a) illustrates the result of a regression-based analysis, which shows actitity mostly 

in the auditive cortex. (b) is a single component extracted by ICA, which is corresponds to a 

word-detection network. 

with a resolution of 3 × 3 × 4 mm. A single 2d-slice is analyzed, which is oriented parallel to 

the calcarine fissure. Photic stimulation was performed using an 8 Hz alternating checkerboard 

stimulus with a central fixation point and a dark background. 

At first, we show an example result using spatial ICA. We perform a dimension reduction using 

PCA to n = 6 dimensions, which still contained 99.77% of the eigenvalues. Then, we applied 

HessianICA with K = 100 Hessians evaluated at randomly chosen samples, see section 1.2.1 and 

Theis (2004a). The resulting 6-dimensional sources are interpreted as the 6 component maps 

that encode the data set. The columns of the mixing matrix contain the relative contribution of 

each component map to the mixtures at the given time point, so they represent the components’ 

time courses. The maps together with the corresponding time courses are shown in figure 1.23. 

A single highly task-related component (#4) is found, which after a shift of 4s has a high crosscorrelation 

with the block-based stimulus (cc = 0.89). Other component maps encode artifacts, 

e.g. in the interstitial brain region and other background activity. 

We then tested the usefulness of taking into account additional information contained in 

the data set such as the spatiotemporal dependencies. For this, we analyzed the data using 

spatiotemporal BSS as described in section 1.3.2 and Theis et al. (2005b), Theis et al. (2007b), 

see chapter 9. In order to make things more challenging, only 4 components were to be extracted 

from the data, with preprocessing either by PCA only or by the slightly more general singular 

value decomposition, a necessary preprocessing for spatiotemporal BSS. We based the algorithms 

on joint diagonalization, for which K = 10 autocorrelation matrices were used, both for spatial 

and temporal decorrelation, weighted equally (α = 0.5). Although the data was reduced to only 

4 components, stSOBI was able to extract the stimulus component very well, with a equally 

high crosscorrelation of cc = 0.89. We compared this result with some established algorithms 

for blind fMRI analysis by discussing the single component that is maximally autocorrelated


1 2 3 

4 5 6 

(a) recovered component maps 

1 cc: −0.14 2 cc: −0.13 3 cc: −0.22 

4 cc: 0.89 5 cc: 0.12 6 cc: 0.09 

(b) time courses 

Figure 1.23: Extracted ICA-components of fMRI recordings. (a) shows the spatial and (b) 

the corresponding temporal activation patterns, where in (b) the grey bars indicate stimulus 

activity. Component 4 contains the (independent) visual task, active in the visual cortex (white 

points in (a)). It correlates well with the stimulus activity (b), 4. 

with the known stimulus task, see figure 1.24. The absolute corresponding autocorrelations 

are 0.84 (stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied to 

separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA). The observation that neither 

Stone’s spatiotemporal ICA algorithm (Stone et al., 2002) nor the popular fastICA algorithm 

(Hyvärinen and Oja, 1997) could recover the sources showed that spatiotemporal models can 

use the additional data structure efficiently in contrast to spatial-only models, and that the 

parameter-free joint-diagonalization-based algorithms are robust against convergence issues. 

Other analysis models 

Before continuing to other biomedical applications, we shortly want to review other recent work 

of the author in this field. 

In Karvanen and Theis (2004), we proposed the concept of window ICA for the analysis of 

fMRI data. The basic idea was to apply, spatial ICA in sliding time windows; this approach 

avoided the problems related to the high number of signals and the resulting issues with dimension 

reduction methods, and moreover gave some insight into small changes happening during 

the experiment, which are otherwise not encoded in changes in the component maps. We demonstrated 

the usefulness of the proposed approach in an experiment where a subject listened to 

auditory stimuli consisting of sinusoidal sounds (beeps) and words in varying proportions. Here, 

the window ICA algorithm was able to find different auditory activations patterns related to the 

beeps respectively the words. 

An interesting model for activity maps in the brain is given by sparse coding; after all, the 

component maps are always implicitly assumed to show only strongly focused regions of activation. 

Hence we asked the question whether the sparse models proposed in section 1.4.1 could be


stimulus 

stNSS 

stSOBI (1D) 

stICA after stSOBI 

stICA 

fastICA 

20 40 60 80 

Fig. 3. Comparison of the recovered component that is maximally autocrosscorrelated 

with the stimulus task (top) for various BSS algorithms, after dimension 

reduction to 4 components. The absolute corresponding autocorrelations are 0.84 

(stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied 

to separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA). 

Figure 1.24: Comparison of the recovered component that is maximally autocrosscorrelated with 

the stimulus task (top) for various BSS algorithms, after dimension reduction to 4 components. 

stead of calculating autocovariance matrices, other statistics of the spatial 

and temporal sources can be used, as long as they satisfy the factorization 

from equation (1). This results in ‘algebraic BSS’ algorithms such as AMUSE 

[3], JADE [11], SOBI and TDSEP [5,6], reviewed for instance in [12]. Instead 

of performing autodecorrelation, we used the idea of the NSS-SD-algorithm 

(‘non-stationary sources with simultaneous diagonalization’) [19], cf. [20]: the 

sources were assumed to be spatiotemporal realizations of non-stationary random 

processes ∗si(t) with ∗ ∈ {t, s} determining the temporal spatial or spatial 

direction. If we assume that the resulting covariance matrices C( ∗s(t)) vary 

sufficiently with time, the factorization of equation (1) also holds for these 

covariance matrices. Hence, joint diagonalization of 

{C( t x(1)), C( s x(1)), C( t x(2)), C( s applied to fMRI data. We showed a successful application to the above visual-stimulus experiment 

in Georgiev et al. (2005a). Again, we were able to show that with only five components, 

the stimulus-related activity in the visual cortex could be nicely reconstructed. 

A similar question of model generalization was posed in Theis and Tanaka (2005). There 

we proposed to study the postnonlinear mixing model (1.3) in the context of fMRI data. We 

derived an algorithm for blindly estimating the sensor characteristics of such a multi-sensor 

network. From the observed sensor outputs, the non-linearities are recovered using a wellknown 

Gaussianization procedure. The underlying sources are then reconstructed using spatial 

decorrelation as proposed by Ziehe et al. (2003a). Application of this robust algorithm to data 

sets acquired fMRI lead to the detection of a distinctive x(2)), bump. . .} of the BOLD effect at larger 

activations, which may be interpreted as an inherent BOLD-related nonlinearity. 

allows for the calculation of the mixing matrix. The covariance matrices are 

In Meyer-Bäse commonly et al. estimated (2004a,b), in separate we discussed non-overlapping the concept temporal of dependent or spatial windows. component analysis, 

see section 1.3, Replacing in the context the autocovariances of fMRI in data (8) by analysis. the windowed We covariances, detected dependencies this results by finding 

clusters of dependent in the spatiotemporal components; NSS-SD algorithmically, or stNSS algorithm. we compared two of the first such algorithms, 

namely tree-dependent (Bach and Jordan, 2003a) and topographic ICA (Hyvärinen et al., 2001a). 

In the fMRI example, we applied stNSS using one-dimensional covariance ma- 

For the fMRI data, a comparative quantitative evaluation between the two methods, tree– 

dependent and topographic ICA was performed. We observed that topographic ICA outperforms 

other ordinary ICA methods and tree–dependent 11 ICA when extracting only few independent 

components. This resulted in a postprocessing algorithm based on clustering of ICA components 

resulting from different source component dimensions in Keck et al. (2005). 

The above algorithms have been included in our MFBOX (Model-free Toolbox) package


(a) training data set (b) directional normalization (c) classification result 

Figure 1.25: Directional neural networks: As training data set, a few labeled A’s from a test 

image have been used (a). (b) then shows five rotated A’s together with their image patches 

normalized using main principal component direction below. This small-scale classificator reproduces 

the A-locations in a test image sufficiently well (c), even though the training data set 

was small and different fonts, noise etc. had been added. 

(Gruber et al., 2007), a Matlab toolbox for data-driven analysis of biomedical data, which may 

also be used as SPM plugin. Its main focus lies on the analysis of functional Nuclear Magnetic 

Resonance Imaging (fMRI) data sets with various model-free or data-driven techniques. The 

toolbox includes BSS algorithms based on various source models including ICA, spatiotemporal 

ICA, autodecorrelation and NMF. They can all be easily combined with higher-level analysis 

methods such as reliability analysis using projective clustering of the components, sliding time 

window analysis or hierarchical decomposition. 

1.6.2 Image segmentation and cell counting 

A supervised interpretation of the initial data analysis model from section 1.1 leads to a classification 

problem: given a set of input-output samples, find a map that interpolate these samples, 

hopefully generalizing well to new input samples. Such a map thus serves as classifier if the 

output consists of discrete labels. Classification based on support vector machines (Boser et al., 

1992, Burges, 1998, Schölkopf and Smola, 2002) or neural networks (Haykin, 1994) has prominent 

applications in biomedical data analysis. Here we review an application to biomedical 

image processing from Theis et al. (2004c), see chapter 21. 

While many different tissues of the mammalian organism are capable of renewing themselves 

after damage, it was believed for long that the nervous system is not able to regenerate at all. 

Nevertheless, the first data showing that the generation of new nerve cells in the adult brain 

could happen were presented in the 1960s (Altman and Das, 1965) showing new neurons in 

the brain of adult rats. In order to quantify neurogenesis in animals, newborn cells are labeled 

with specific markers such as BrdU; in brain sections these can later be analyzed and counted 

through the use of a confocal microscope. However so far, this counting process had been 

performed manually. 

In Theis et al. (2004b,c), we proposed an algorithm called ZANE to automatically identify cell


components in digitized section images. First, a so-called cell classifier was trained with cell and 

non-cell patches using single- and multi-layer perceptrons as well as unsupervised independent 

component analysis with correlation comparison. In order to account for a larger variety of cell 

shapes, a directional normalization approach is proposed. The cell classifier can then be used in 

an arbitrary number of sections by scanning the section and choosing maxima of this classifier 

as cell center locations. This is illustrated using a toy example in figure 1.25. A flow-chart with 

the basic segmentation setup is shown in figure 1.26. 

ZANE was successfully applied to measure neurogenesis in adult rodent brain sections, where 

we showed that the proliferation of neurons is substantially stronger (340%) in the dentate gyrus 

of an epileptic mouse than in a control group. When comparing the counting result with manual 

counts, the mean ZANE classification rate is 90% of all (manually detected) cells; this was within 

the error bounds of a perfect count, since manual counts by different experts varied by roughly 

10% themselves (Theis et al., 2004b). 

1.6.3 Surface electromyograms 

In sections 1.2 and 1.4, we presented blind 

data factorization models based on statistical 

independence, explicit sparseness and nonnegativity. 

It is known that all three approaches 

tend to induce a more meaningful, 

often more sparse representation of the multivariate 

data set. However, developing explicit 

applications and more so performing meaningful 

comparisons of such methods is still of considerable 

interest. 

In Theis and García (2006), see chapter 

20, we analyzed and compared the above 

models, but not from a theoretical point of 

view but rather based on a real-world example, 

namely the analysis of surface electromyogram 

(sEMG) data sets. An electromyogram 

(EMG) denotes the electric signal generated 

by a contracting muscle (Basmajian and Luca, 

1985). In general, EMG measurements make 

use of invasive, painful needle electrodes. An 

alternative is to use sEMG, which is measured 

using non-invasive, painless surface electrodes. 

However, in this case the signals are rather 

more difficult to interpret due to noise and Figure 1.26: ZANE image segmentation. 

the overlap of several source signals. Direct 

application of the ICA model to real-world noisy sEMG turns out to problematic (García et al., 

2004), and it is yet unknown if the assumption of independent sources holds well in the setting 

of sEMG.


Component given by each method [a.u.] 

(a) JADE 

(b) NMF 

(c) NMF* 

(d) sNMF 

(e) sNMF* 

(f) SCA 

(g) s-EMG 

-2 

-4 

02468 

-2 

-4 

02468 

-2 02468 

-2 0246 

-2 02468 

-2 

-4 

02468 

0.5 1 

1.5 

-0.5 0 

50 100 150 200 250 300 350 400 450 500 

Time [ms] 

Figure 1.27: A recovered sources after unmixing an sEMG data, (a-f) shows results obtained 

using the different methods and (g) the original most-active sensor signal. 

Our approach was therefore to apply and validate sparse BSS methods based on various 

model assumptions to sEMG signals. When applied to artificial signals we found noticeable 

differences of algorithm performance depending on the source assumptions. In particular, sparse 

nonnegative matrix factorization outperforms the other methods with regards to increasing 

additive noise. However, in the case of real sEMG signals we showed that despite the fundamental 

differences in the various models, the methods yield rather similar results and can successfully 

separate the source signal, see figure 1.27. This was due to the fact that the different sparseness 

assumptions are only approximately fulfilled thus apparently forcing the algorithms to reach 

similar results, but from different initial conditions using different optimization criteria. 

The resulting sparse signal components can now be used for further analysis and for artifact 

removal. A similar analysis using spectral correlations have been employed in (Böhm et al., 

2006, Stadlthanner et al., 2006b) to remove the water artifact from multidimensional proton 

NMR spectra of biomolecules dissolved in aqueous solutions.


1.7 Outlook 

We considered the (mostly linear) factorization problem 

x(t) = As(t) + n(t). (1.18) 

Often, the noise n(t) was not explicitly modeled but included in s(t). We assumed that x(t) 

is known as well as some additional information about the system itself. Depending on the 

assumptions, different problems and algorithmic solutions can be derived to solve such inverse 

problems: 

• statistically independent s(t): we proved that in this case ICA could solve (1.18) uniquely, 

see section 1.2; this holds even in some nonlinear generalizations. 

• approximate spatiotemporal independence in A and s(t): we provided a very robust, 

simple algorithm for the spatiotemporal separation based on joint diagonalization, see 

section 1.3.2. 

• statistical independence between groups of sources s(t): again, uniqueness except for transformation 

within the blocks can be proven, and the constructive proof results in a simple 

update algorithm as in the linear case, see section 1.3.3. 

• sparseness of the sources s(t): in the context of SCA, we relaxed the assumptions of single 

source sparseness to multi-dimensional sparseness constraints, for which we were still able 

to prove uniqueness and to derive an algorithm, see section 1.4.1. 

• non-negativity of A and s(t): a sparse extension of the plain NMF model was analyzed 

in terms of existence an uniqueness, and a generalization for lp-sparse sources has been 

proposed in order to better approximate combinatorial sparseness in the l0-sense, see 

section 1.4.2. 

• Gaussian or noisy components in s(t): we proposed various denoising and dimension reduction 

schemes, and proved uniqueness of a non-Gaussian signal subspace in section 1.5. 

Finally, in section 1.6, we applied some of the above methods to biomedical data sets recorded 

by functional MRI, surface EMG and optical microscopy. 

Other work 

In this summary, data analysis was discussed from the viewpoint of data factorization models 

such as (1.18). Before concluding, some other works of the author in related areas should be 

mentioned: 

Primarily only discussed in mathematics, optimization on Lie groups has become an important 

topic in the field of machine learning and neural networks, since many cost functions are 

defined on parameter spaces that more naturally obey a non-Euclidean geometry. Consider for 

example Amari’s natural gradient (Amari, 1998): simply by taking into account the geometry

1.7. Outlook 51 

noise 

data signal 

independent structured models 

Figure 1.28: Future analysis framework. 

of the search space Gl(n) of all invertible (n × n)-matrices, Amari was able to considerably improve 

search performance and accuracy thus providing an equivariant ICA algorithm. In Theis 

(2005b), we gave an overview of various gradient calculations on Gl(n) and presented generalizations 

to over- and undercomplete cases, realized by a semidirect product. These ideas were used 

in Squartini and Theis (2006), where we defined some alternative Riemannian metrics on the 

parameter space of non-square matrices, corresponding to various translations defined therein. 

Such metrics allowed us to derive novel, efficient learning rules for two ICA based algorithms 

for overdetermined blind source separation. 

In Meyer-Baese et al. (2006), we studied optimization and statistical learning on a neural 

network that self-organized to solve a BSS problem. The resulting online learning solution used 

the nonstationarity of the sources to achieve the separation. For this, we divided the problem 

into two learning problems one of which is solved by an anti-Hebbian learning and the other by 

an Hebbian learning process. The stability of related networks is discussed in Meyer-Bäse et al. 

(2006). 

A major application of unsupervised learning in neuroscience lies in the analysis of functional 

MRI data sets. A problem we faced during the course of our analyses was the efficient storage 

and retrieval of many, large-dimensional spatiotemporal data sets. Although some dimension 

reduction or region-of-interest selection may be performed beforehand to reduce the sample 

size (Keck et al., 2006), we wanted to compress the data as well as possible without loosing 

information. For this, we proposed a novel lossless compression method named FTTcoder (Theis 

and Tanaka, 2005) for the compression of images and 3d sequences collected during a typical 

fMRI experiment. The large data sets involved in this popular medical application necessitated 

novel compression algorithms to take into account the structure of the recorded data as well as 

the experimental conditions, which include the 4d recordings, the used stimulus protocol and 

marked regions of interest (ROI). For this, we used simple temporal transformations and entropy 

coding with context modeling to encode the 4d scans after preprocessing with the ROI masking. 

The compression algorithm as well as the fMRI toolbox and the algorithms for spatiotemporal 

and subspace BSS are all available online at http://fabian.theis.name.


Future work 

In future work, the goal is to employ the above algorithms as preprocessing step in modeling of 

biological processes, for example in quantitative systems biology. The idea is straight-forward, 

see figure 1.28. Given multivariate data that encode for example biological parameters of multiple, 

dependent experiments, we want to find regularized simple models that can explain the 

data as well as predict future quantitative experiments. In a first step, denoising is to be performed, 

especially removal of Gaussian subspaces that do not contain meaningful data apart 

from their covariance structure. The signal subspace then is to be further processed using techniques 

from dependent component analysis and independent subspace analysis. The resulting 

subspaces themselves are analyzed with network analysis techniques, for example based on Gaussian 

graphical models and corresponding high-dimensional regularizations (Dobra et al., 2004, 

Schäfer and Strimmer, 2005). The derived structures are then expected to serve as models, 

which may provide quantitative descriptions of the underlying processes. Finally, we hope to 

drive further experiments by the model predictions. 

This well-known coupling of experimentalists and theoreticians is characteristic of systems 

biology in particular, and modern interdisciplinary research in general. With the past work, 

we hope to have made some steps in this directions, and recent work in the field of microarray 

analysis employing such techniques is already promising (Lutter et al., 2006, Schachtner et al., 

2007, Stadlthanner et al., 2006a). In theoretical terms, the long-term goal is to step from 

multivariate analysis to network analysis, just as we have observed the field of signal processing 

expanding from univariate to multivariate models in the past few decades.

Part II 

Papers 

53

Chapter 2 

Neural Computation 16:1827-1850, 

2004 

Paper F.J. Theis. A new concept for separability problems in blind source separation. 

Neural Computation, 16:1827-1850, 2004 

Reference (Theis, 2004a) 

Summary in section 1.2.1 

55

56 Chapter 2. Neural Computation 16:1827-1850, 2004 

LETTER Communicated by Aapo Hyvarinen 

A New Concept for Separability Problems in Blind Source 

Separation 

Fabian J. Theis 

fabian@theis.name 

Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany 

The goal of blind source separation (BSS) lies in recovering the original 

independent sources of a mixed random vector without knowing the 

mixing structure. A key ingredient for performing BSS successfully is to 

know the indeterminacies of the problem—that is, to know how the separating 

model relates to the original mixing model (separability). For linear 

BSS, Comon (1994) showed using the Darmois-Skitovitch theorem that 

the linear mixing matrix can be found except for permutation and scaling. 

In this work, a much simpler, direct proof for linear separability is given. 

The idea is based on the fact that a random vector is independent if and 

only if the Hessian of its logarithmic density (resp. characteristic function) 

is diagonal everywhere. This property is then exploited to propose a new 

algorithm for performing BSS. Furthermore, first ideas of how to generalize 

separability results based on Hessian diagonalization to more complicated 

nonlinear models are studied in the setting of postnonlinear BSS. 

1 Introduction 

In independent component analysis (ICA), one tries to find statistically 

independent data within a given random vector. An application of ICA 

lies in blind source separation (BSS), where it is furthermore assumed that 

the given vector has been mixed using a fixed set of independent sources. 

The advantage of applying ICA algorithms to BSS problems in contrast to 

correlation-based algorithms is that ICA tries to make the output signals as 

independent as possible by also including higher-order statistics. 

Since the introduction of independent component analysis by Hérault 

and Jutten (1986), various algorithms have been proposed to solve the BSS 

problem (Comon, 1994; Bell & Sejnowski, 1995; Hyvärinen & Oja, 1997; 

Theis, Jung, Puntonet, & Lang, 2002). Good textbook-level introductions to 

ICA are given in Hyvärinen, Karhunen, and Oja (2001) and Cichocki and 

Amari (2002). 

Separability of linear BSS states that under weak conditions to the sources, 

the mixing matrix is determined uniquely by the mixtures except for permutation 

and scaling, as showed by Comon (1994) using the Darmois- 

Skitovitch theorem. We propose a direct proof based on the concept of 

Neural Computation 16, 1827–1850 (2004) c○ 2004 Massachusetts Institute of Technology

Chapter 2. Neural Computation 16:1827-1850, 2004 57 

1828 F. Theis 

separated functions, that is, functions that can be split into a product of 

one-dimensional functions (see definition 1). If the function is positive, this 

is equivalent to the fact that its logarithm has a diagonal Hessian everywhere 

(see lemma 1 and theorem 1). A similar lemma has been shown by Lin (1998) 

for what he calls block diagonal Hessians. However, he omits discussion of 

the separatedness of densities with zeros, which plays a minor role for the 

separation algorithm he is interested in but is important for deriving separability. 

Using separatedness of the density, respectively, the characteristic 

function (Fourier transformation), of the random vector, we can then show 

separability directly (in two slightly different settings, for which we provide 

a common framework). Based on this result, we propose an algorithm 

for linear BSS by diagonalizing the logarithmic density of the Hessian. We 

recently found that this algorithm has already been proposed (Lin, 1998), 

but without considering the necessary assumptions for successful algorithm 

application. Here we give precise conditions for when to apply this algorithm 

(see theorem 3) and show that points satisfying these conditions can 

indeed be found if the sources contain at most one gaussian component 

(see lemma 5). Lin uses a discrete approximation of the derivative operator 

to approximate the Hessian. We suggest using kernel-based density 

estimation, which can be directly differentiated. A similar algorithm based 

on Hessian diagonalization has been proposed by Yeredor (2000) using the 

characteristic function of a random vector. However, the characteristic function 

is complex valued, and additional care has to be taken when applying 

a complex logarithm. Basically, this is well defined locally only at nonzeros. 

In algorithmic terms, the characteristic function can be easily approximated 

by samples (which is equivalent to our kernel-based density approximation 

using gaussians before Fourier transformation). Yeredor suggests joint diagonalization 

of the Hessian of the logarithmic characteristic function (which 

is problematic because of the nonuniqueness of the complex logarithm) 

evaluated at several points in order to avoid the locality of the algorithm. 

Instead of joint diagonalization, we use a combined energy function based 

on the previously defined separator, which also takes into account global 

information but does not have the drawback of being singular at zeros of the 

density, respectively, characteristic function. Thus, the algorithmic part of 

this article can be seen as a general framework for the algorithms proposed 

by Lin (1998) and Yeredor (2000). 

Section 2 introduces separated functions, giving local characterizations 

of the densities of independent random vectors. Section 3 then introduces 

the linear BSS model and states the well-known separability result. After 

giving an easy and short proof in two dimensions with positive densities, 

we provide a characterization of gaussians in terms of a differential equation 

and provide the general proof. The BSS algorithm based on finding 

separated densities is proposed and studied in section 4. We finish with 

a generalization of the separability for the postnonlinear mixture case in 

section 5.


A New Concept for Separability Problems in BSS 1829 

2 Separated and Linearly Separated Functions 

Definition 1. A function f : R n → C is said to be separated, respectively, 

linearly separated, if there exist one-dimensional functions g1,...,gn : R → C 

such that f (x) = g1(x1) ···gn(xn) respectively f (x) = g1(x1) +···+gn(xn) for 

all x ∈ R n . 

Note that the functions gi are uniquely determined by f up to a scalar 

factor, respectively, an additive constant. If f is linearly separated, then exp f 

is separated. Obviously the density function of an independent random 

vector is separated. For brevity, we often use the tensor product and write 

f ≡ g1 ⊗···⊗gn for separated f , where for any functions h, k defined on a 

set U, h ≡ k if h(x) = k(x) for all x ∈ U. 

Separatedness can also be defined on any open parallelepiped (a1, b1) × 

···×(an, bn) ⊂ R n in the obvious way. We say that f is locally separated 

at x ∈ R n if there exists an open parallelepiped U such that x ∈ U and 

f |U is separated. If f is separated, then f is obviously everywhere locally 

separated. The converse, however, does not necessarily hold, as shown in 

Figure 1. 

0.2 

0.1 

0 

4 

3 

2 

1 

0 

−1 

−2 

0 

−1 

−2 

−3 −3 

Figure 1: Density of a random vector S with a locally but not globally separated 

density. Here, p S := cχ[−2,2]×[−2,0]∪[0,2]×[1,3] where χU denotes the function that is 

1onU and 0 everywhere else. Obviously, p S is not separated globally, but is 

separated if restricted to squares of length < 1. Plotted is a smoothed version of 

p S . 

1 

2 

3 

4


1830 F. Theis 

The function f is said to be positive if f is real and f (x) >0 for all x ∈ R n , 

and nonnegative if f is real and f (x) ≥ 0 for all x ∈ R n . A positive function 

f is separated if and only if ln f is linearly separated. 

Let C m (U, V) be the ring of all m-times continuously differentiable func- 

tions from U ⊂ R n to V ⊂ C, U open. For a C m -function f , we write 

∂i1 ···∂im f := ∂m f/∂xi1 ···∂xim for the m-fold partial derivatives. If f ∈ 

C2 (Rn , C), denote with the symmetric (n × n)-matrix Hf (x) := � ∂i∂j f (x) �n i,j=1 

the Hessian of f at x ∈ Rn . 

Linearly separated functions can be classified using their Hessian (if it 

exists): 

Lemma 1. A function f ∈ C 2 (R n , C) is linearly separated if and only if Hf (x) 

is diagonal for all x ∈ R n . 

A similar lemma for block diagonal Hessians has been shown by Lin 

(1998). 

Proof. If f is linearly separated, its Hessian is obviously diagonal everywhere 

by definition. 

Assume the converse. We prove that f is separated by induction over 

the dimension n. For n = 1, the claim is trivial. Now assume that we have 

shown the lemma for n − 1. By induction assumption, f (x1,...,xn−1, 0) is 

linearly separated, so 

f (x1,...,xn−1, 0) = g1(x1) +···+gn−1(xn−1) 

for all xi ∈ R and some functions gi on R. Note that gi ∈ C 2 (R, C). 

Define a function h : R → C by h(y) := ∂n f (x1,...,xn−1, y), y ∈ R, 

for fixed x1,...,xn−1 ∈ R. Note that h is independent of the choice of the 

xi, because ∂n∂i f ≡ ∂i∂n f is zero everywhere, so xi ↦→ ∂n f (x1,...,xn−1, y) 

is constant for fixed xj, y ∈ R, j �= i. By definition, h ∈ C 1 (R, C), soh is 

integrable on compact intervals. Define k : R → C by k(y) := � y 

0 h. Then 

f (x1,...,xn) = g1(x1) +···+gn−1(xn−1) + k(xn) + c, 

where c ∈ C is a constant, because both functions have the same derivative 

and R n is connected. If we set gn := k + c, the claim follows. 

This lemma also holds for functions defined on any open parallelepiped 

(a1, b1) ×···×(an, bn) ⊂ R n . Hence, an arbitrary real-valued C 2 -function f 

is locally separated at x with f (x) �= 0 if and only if the Hessian of ln | f | is 

locally diagonal. 

For a positive function f , the Hessian of its logarithm is diagonal everywhere 

if it is separated, and it is easy to see that for positive f , the converse



also holds globally (see theorem 1(ii)). In this case, we have for i �= j, 

0 ≡ ∂i∂j ln f ≡ f ∂i∂j f − (∂i f )(∂j f ) 

f 2 

, 

so f is separated if and only if 

f ∂i∂j f ≡ (∂i f )(∂j f ) 

for i �= j or even i < j. This motivates the following definition: 

Definition 2. For i �= j, the operator 

Rij : C 2 (R n , C) → C 0 (R n , C) 

is called the ij-separator. 

Theorem 1. Let f ∈ C 2 (R n , C). 

f ↦→ Rij[ f ]:= f ∂i∂j f − (∂i f )(∂j f ) 

i. If f is separated, then Rij[ f ] ≡ 0 for i �= j or, equivalently, 

f ∂i∂j f ≡ (∂i f )(∂j f ) (2.1) 

holds for i �= j. 

ii. If f is positive and Rij[ f ] ≡ 0 holds for all i �= j, then f is separated. 

If f is assumed to be only nonnegative, then f is locally separated but 

not necessarily globally separated (if the support of f has more than one 

component). See Figure 1 for an example of a nonseparated density with 

R12[ f ] ≡ 0. 

Proof of Theorem 1.i. If f is separated, then f (x) = g1(x1) ···gn(xn) or 

short f ≡ g1 ⊗···⊗gn,so 

and 

∂i f ≡ g1 ⊗···⊗gi−1 ⊗ g ′ i ⊗ gi+1 ⊗···⊗gn 

∂i∂j f ≡ g1 ⊗···⊗gi−1 ⊗ g ′ i ⊗ gi+1 ⊗···⊗gj−1 ⊗ g ′ j ⊗ gj+1 ⊗···⊗gn 

for i < j. Hence equation 2.1 holds. 

ii. Now assume the converse and let f be positive. Then according to the 

remarks after lemma 1, Hln f (x) is everywhere diagonal, so lemma 1 shows 

that ln f is linearly separated; hence, f is separated.


1832 F. Theis 

Some trivial properties of the separator Rij are listed in the next lemma: 

Lemma 2. Let f, g ∈ C 2 (R n , C),i�= j and α ∈ C. Then 

and 

Rij[αf ] = α 2 Rij[ f ] 

Rij[ f + g] = Rij[ f ] + Rij[g] + f ∂i∂jg + g∂i∂j f − (∂i f )(∂jg) − (∂ig)(∂j f ). 

3 Separability of Linear BSS 

Consider the noiseless linear instantaneous BSS model with as many sources 

as sensors: 

X = AS, (3.1) 

with an independent n-dimensional random vector S and A ∈ Gl(n). Here, 

Gl(n) denotes the general linear group of R n , that is, the group of all invertible 

(n × n)-matrices. 

The task of linear BSS is to find A and S given only X. An obvious indeterminacy 

of this problem is that A can be found only up to scaling and 

permutation because for scaling L and permutation matrix P, 

X = ALPP −1 L −1 S, 

and P −1 L −1 S is also independent. Here, an invertible matrix L ∈ Gl(n) 

is said to be a scaling matrix if it is diagonal. We say two matrices B, C 

are equivalent, B ∼ C, ifC can be written as C = BPL with a scaling 

matrix L ∈ Gl(n) and an invertible matrix with unit vectors in each row 

(permutation matrix) P ∈ Gl(n). Note that PL = L ′ P for some scaling matrix 

L ′ ∈ Gl(n), so the order of the permutation and the scaling matrix does not 

play a role for equivalence. Furthermore, if B ∈ Gl(n) with B ∼ I, then also 

B −1 ∼ I, and, more generally if BC ∼ A, then C ∼ B −1 A. According to the 

above, solutions of linear BSS are equivalent. We will show that under mild 

assumptions to S, there are no further indeterminacies of linear BSS. 

S is said to have a gaussian component if one of the Si is a one-dimensional 

gaussian, that is, pSi (x) = d exp(−ax2 + bx + c) with a, b, c, d ∈ R, a > 0, and 

S has a deterministic component if one Si is deterministic, that is, constant. 

Theorem 2 (Separability of linear BSS). Let A ∈ Gl(n) and S be an independent 

random vector. Assume one of the following: 

i. S has at most one gaussian or deterministic component, and the covariance 

of S exists.



ii. S has no gaussian component, and its density pS exists and is twice continuously 

differentiable. 

Then if AS is again independent, A is equivalent to the identity. 

So A is the product of a scaling and a permutation matrix. The important 

part of this theorem is assumption i, which has been used to show separability 

by Comon (1994) and extended by Eriksson and Koivunen (2003) based 

on the Darmois-Skitovitch theorem (Darmois, 1953; Skitovitch, 1953). Using 

this theorem, the second part can be easily shown without C 2 -densities. 

Theorem 2 indeed proves separability of the linear BSS model, because 

if X = AS and W is a demixing matrix such that WX is independent, then 

WA ∼ I,soW −1 ∼ A as desired. 

We will give a much easier proof without having to use the Darmois- 

Skitovitch theorem in the following sections. 

3.1 Two-Dimensional Positive Density Case. For illustrative purposes 

we will first prove separability for a two-dimensional random vector S with 

positive density pS ∈ C 2 (R 2 , R). Let A ∈ Gl(2). It is enough to show that if 

S and AS are independent, then either A ∼ I or S is gaussian. 

S is assumed to be independent, so its density factorizes: 

pS(s) = g1(s1)g2(s2), 

for s ∈ R 2 . First, note that the density of AS is given by 

pAS(x) =|det A| −1 pS(A −1 x) = cg1(b11x1 + b12x2)g2(b21x1 + b22x2) 

for x ∈ R 2 , c �= 0 fixed. Here, B = (bij) = A −1 . AS is also assumed to be 

independent, so pAS(x) is separated. 

pS was assumed to be positive; then so is pAS. Hence, ln pAS(x) is linearly 

separated, so 

∂1∂2 ln pAS(x) = b11b12h ′′ 

1 (b11x1 + b12x2) + b21b22h ′′ 

2 (b21x1 + b22x2) = 0 

for all x ∈ R 2 , where hi := ln gi ∈ C 2 (R 2 , R). By setting y := Bx, we therefore 

have 

b11b12h ′′ 

1 (y1) + b21b22h ′′ 

2 (y2) = 0 (3.2) 

for all y ∈ R 2 , because B is invertible. 

Now, if A (and therefore also B) is equivalent to the identity, then equation 

3.2 holds. If not, then A, and hence also B, have at least three nonzero 

entries. By equation 3.2 the fourth entry has to be nonzero, because the


1834 F. Theis 

h ′′ 

i are not zero (otherwise gi(yi) = exp(ayi + b), which is not integrable). 

Furthermore, 

b11b12h ′′ 

1 (y1) =−b21b22h ′′ 

2 (y2) 

for all y ∈ R2 ,sotheh ′′ 

i are constant, say, h′′ i ≡ ci, and ci �= 0, as noted 

above. Therefore, the hi are polynomials of degree 2, and the gi = exp hi are 

gaussians (ci < 0 because of the integrability of the gi). 

3.2 Characterization of Gaussians. In this section, we show that among 

all densities, respectively, characteristic functions, the gaussians satisfy a 

special differential equation. 

Lemma 3. Let f ∈ C 2 (R, C) and a ∈ C with 

af 2 − ff ′′ + f ′2 ≡ 0. (3.3) 

Then either f ≡ 0 or f (x) = exp � a 

2 x2 + bx + c � , x ∈ R, with constants b, c ∈ C. 

Proof. Assume f �≡ 0. Let x0 ∈ R with f (x0) �= 0. Then there exists 

a nonempty interval U := (r, s) containing x0 such that a complex logarithm 

log is defined on f (U). Set g := log f |U. Substituting exp g for f in 

equation 3.3 yields 

a exp(2g) − exp(g)(g ′′ + g ′2 ) exp(g) + g ′2 exp(2g) ≡ 0, 

and therefore g ′′ ≡ a. Hence, g is a polynomial of degree ≤ 2 with leading 

coefficient a 

2 . 

Furthermore, 

lim f (x) �= 0 

x→r+ 

lim f (x) �= 0, 

x→s− 

so f has no zeros at all because of continuity. The argument above with 

U = R shows the claim. 

If, furthermore, f is real nonnegative and integrable with integral 1 (e.g., 

if f is the density of a random variable), then f has to be the exponential of 

a real-valued polynomial of degree precisely 2; otherwise, it would not be 

integrable. So we have the following corollary:



Corollary 1. Let X be a random variable with twice continuously differentiable 

density pX satisfying equation 3.3. Then X is gaussian. 

If we do not want to assume that the random variable has a density, we 

can use its characteristic function (Bauer, 1996) instead to show an equivalent 

result: 

Corollary 2. Let X be a random variable with twice continuously differentiable 

characteristic function �X(x) := EX(exp ixX) satisfying equation 3.3. Then X is 

gaussian or deterministic. 

Proof. Using �X(0) = 1, lemma 3 shows that �X(x) = exp( a 

2 x2 + bx). Moreover, 

from �X(−x) = �X(x), we get a ∈ R and b = ib ′ with real b ′ . And |�X| ≤1 

shows that a ≤ 0. So if a = 0, then X is deterministic (at b ′ ), and if a �= 0, 

then X has a gaussian distribution with mean b ′ and variance −a −1 . 

3.3 Proof of Theorem 2. We will now prove linear separability; for this, 

we will use separatedness to show that some source components have to be 

gaussian (using the results from above) if the mixing matrix is not trivial. 

The main argument is given in the following lemma: 

Lemma 4. Let gi ∈ C 2 (R, C) and B ∈ Gl(n) such that f (x) := g1⊗···⊗gn(Bx) 

is separated. Then for all indices l and i �= j with bliblj �= 0,gl satisfies the differential 

equation 3.3 with some constant a. 

Proof. f is separated, so by theorem 1i. 

Rij[ f ] ≡ f ∂i∂j f − (∂i f )(∂j f ) ≡ 0 (3.4) 

holds for i < j. The ingredients of this equation can be calculated for i < j 

as follows: 

∂i f (x) = � 

bkig1 ⊗···⊗g ′ k ⊗···⊗gn(Bx) 

k 

(∂i f )(∂j f )(x) = � 

bkiblj(g1 ⊗···⊗g ′ k ⊗···⊗gn) 

k,l 

× (g1 ⊗···⊗g ′ l ⊗···⊗gn)(Bx) 

∂i∂j f (x) = � 

bki(bkjg1 ⊗···⊗g ′′ 

k ⊗···⊗gn 

k 

+ � 

bljg1 ⊗···⊗g ′ k ⊗···⊗g′ l ⊗···⊗gn)(Bx). 

l�=k


1836 F. Theis 

Putting this in equation 3.4 yields 

0 = ( f ∂i∂j f − (∂i f )(∂j f ))(x) 

= � 

bkibkj((g1 ⊗···⊗gn)(g1 ⊗···⊗g ′′ 

k ⊗···⊗gn) 

k 

− (g1 ⊗···⊗g ′ k ⊗···⊗gn) 2 )(Bx) 

= � 

bkibkjg 2 1 ⊗···⊗g2 k−1 ⊗ 

k 

� 

gkg ′′ 

k 

− g′2 

k 

for x ∈ R n . B is invertible, so the whole function is zero: 

� 

k 

bkibkjg 2 1 ⊗···⊗g2 k−1 ⊗ 

� 

gkg ′′ 

k 

� 

⊗ g 2 k+1 ⊗···⊗g2 n (Bx) 

� 

− g′2 

k ⊗ g 2 k+1 ⊗···⊗g2n ≡ 0. (3.5) 

Choose x ∈ R n with gk(xk) �= 0 for k = 1,...,n. Evaluating equation 3.5 

at (x1,...,xl−1, y, xl+1,...,xn) for variable y ∈ R and dividing the resulting 

one-dimensional equation by the constant g 2 1 (x1) ···g 2 l−1 (xl−1)g 2 l+1 (xl+1) ··· 

g 2 n (xn) shows 

bliblj 

� 

glg ′′ 

l 

� 

− g′2 

l 

⎛ 

(y) =−⎝ 

� 

k�=l 

gkg 

bkibkj 

′′ 

k 

g 2 k 

− g′2 

k 

(xk) 

⎞ 

⎠ g 2 l 

(y) (3.6) 

for y ∈ R. So for indices l and i �= j with bliblj �= 0, it follows from equation 

3.6 that there exists a ∈ C such that gk satisfies the differential equation 

ag2 l − glg ′′ 

l + g′2 

l ≡ 0, that is, equation 3.3. 

Proof of Theorem 2. i. S is assumed to have at most one gaussian or deterministic 

component and existing covariance. Set X := AS. 

We first show using whitening that A can be assumed to be orthogonal. 

For this, we can assume S and X to have no deterministic component at all 

(because arbitrary choice of the matrix coefficients of the deterministic components 

does not change the covariance). Hence, by assumption, Cov(X) 

is diagonal and positive definite, so let D1 be diagonal invertible with 

Cov(X) = D2 1 . Similarly, let D2 be diagonal invertible with Cov(S) = D2 2 . 

Set Y := D −1 

1 X and T := D−1 

2 S, that is, normalize X and S to covariance I. 

Then 

Y = D −1 

1 

X = D−1 

1 AS = D−1 

1 AD2T



and T, D −1 

1 AD2 and Y satisfy the assumption, and D −1 

1 AD2 is orthogonal 

because 

I = Cov(Y) 

= E(YY ⊤ ) 

= E(D −1 

1 AD2TT ⊤ D2A ⊤ D −1 

1 ) 

= (D −1 −1 

1 AD2)(D1 AD2) ⊤ . 

So without loss of generality, let A be orthogonal. 

Now let � S(s) := ES(exp is ⊤ S) be the characteristic function of S. Byassumption, 

the covariance (and hence the mean) of S exists, so � S ∈ C 2 (R n , C) 

(Bauer, 1996). Furthermore, since S is assumed to be independent, its characteristic 

function is separated: � S ≡ g1 ⊗···⊗gn, where gi ≡ � Si. The characteristic 

function of AS can easily be calculated as 

�AS(x) = ES(exp ix ⊤ AS) = � S(A ⊤ x) = g1 ⊗···⊗gn(A ⊤ x) 

for x ∈ R n . Let B := (bij) = A ⊤ . Since AS is also assumed to be independent, 

f (x) := � AS(x) = g1 ⊗···⊗gn(Bx) is separated. 

Now assume that A �∼ I. Using orthogonality of B = A ⊤ , there exist 

indices k �= l and i �= j with bkibkj �= 0 and bliblj �= 0. Then according 

to lemma 4, gk and gl satisfy the differential equation 3.3. Together with 

corollary 2, this shows that both Sk and Sl are gaussian, which is a contradiction 

to the assumption. 

ii. Let S be an n-dimensional independent random vector with density 

pS ∈ C 2 (R n , R) and no gaussian component, and let A ∈ Gl(n). S is assumed 

to be independent, so its density factorizes pS ≡ g1 ⊗···⊗gn. The density 

of AS is given by 

pAS(x) =|det A| −1 pS(A −1 x) =|det A| −1 g1 ⊗···⊗gn(Ax) 

for x ∈ R n . Let B := (bij) = A −1 . AS is also assumed to be independent, so 

f (x) := |det A|pAS(x) = g1 ⊗···⊗gn(Bx) 

is separated. 

Assume A �∼ I. Then also B = A −1 �∼ I, so there exist indices l and 

i �= j with bliblj �= 0. Hence, it follows from lemma 4 that gl satisfies the 

differential equation 3.3. But gl is a density, so according to corollary 1 the 

lth component of S is gaussian, which is a contradiction.


1838 F. Theis 

4 BSS by Hessian Diagonalization 

In this section, we use the theory already set out to propose an algorithm for 

linear BSS, which can be easily extended to nonlinear settings as well. For 

this, we restrict ourselves to using C 2 -densities. A similar idea has already 

been proposed in Lin (1998), but without dealing with possibly degenerated 

eigenspaces in the Hessian. Equivalently, we could also use characteristic 

functions instead of densities, which leads to a related algorithm (Yeredor, 

2000). 

If we assume that Cov(S) exists, we can use whitening as seen in the proof 

of theorem 2i (in this context, also called principal component analysis) to 

reduce the general BSS model, equation 3.2, to 

X = AS (4.1) 

with an independent n-dimensional random vector S with existing covariance 

I and an orthogonal matrix A. Then Cov(X) = I. We assume that S 

admits a C 2 −density pS. The density of X is then given by 

pX(x) = pS(A ⊤ x) 

for x ∈ R n , because of the orthogonality of A. Hence, 

pS ≡ pX ◦ A. 

Note that the Hessian of the composition of a function f ∈ C 2 (R n , R) 

with an n × n-matrix A can be calculated using the Hessian of f as follows: 

Hf ◦A(x) = AHf (Ax)A ⊤ . 

Let s ∈ R n with pS(s) >0. Then locally at s, we have 

Hln p S (s) = Hln p X ◦A(s) = AHln p X (As)A ⊤ . (4.2) 

pS is assumed to be separated, so Hln p S (s) is diagonal, as seen in section 2. 

Lemma 5. Let X := AS with an orthogonal matrix A and S, an independent 

random vector with C 2 -density, and at most one gaussian component. Then there 

exists an open set U ⊂ R n such that for all x ∈ U, pX(x) �= 0 and Hln p X (x) has n 

different eigenvalues. 

Proof. Assume not. Then there exists no x ∈ R n at all with pX(x) �= 0 and 

Hln p X (x) having n different eigenvalues because otherwise, due to continuity, 

these conditions would also hold in an open neighborhood of x.



Using equation 4.2 the logarithmic Hessian of pS has at every s ∈ R n with 

pS(s) >0 at least two of the same eigenvalues, say, λ(s) ∈ R. Hence, since S 

is independent, Hln p S (s) is diagonal, so locally, 

� � ′′ � � ′′ 

ln pSi (s) = ln pSj (s) = λ(s) 

for two indices i �= j. Here, we have used continuity of s ↦→ Hln p (s) show- 

S 

ing that the two eigenvalues locally lie in the same two dimensions i and 

j. This proves that λ(s) is locally constant in directions i and j. So locally 

at points s with pS(s) >0, Si and Sj are of the type exp P, with P being a 

polynomial of degree ≤ 2. The same argument as in the proof of lemma 3 

then shows that pSi and pSj have no zeros at all. Using the connectedness 

of R proves that Si and Sj are globally of the type exp P, hence gaussian 

(because of � 

pSk R = 1), which is a contradiction. 

Hence, we can assume that we have found x (0) ∈ R n with Hln p X (x (0) ) 

having n different eigenvalues (which is equivalent to saying that every 

eigenvalue is of multiplicity one), because due to lemma 5, this is an open 

condition, which can be found algorithmically. In fact, most densities in 

practice turn out to have logarithmic Hessians with n different eigenvalues 

almost everywhere. In theory however, U in lemma 5 cannot be assumed to 

be, for example, dense or R n \ U to have measure zero, because if we choose 

pS1 to be a normalized gaussian and pS2 to be a normalized gaussian with a 

very localized small perturbation at zero only, then U cannot be larger than 

(−ε, ε) × R. 

By diagonalization of Hln p (x 

X (0) ) using eigenvalue decomposition (principal 

axis transformation), we can find the (orthogonal) mixing matrix A. 

Note that the eigenvalue decomposition is unique except for permutation 

and sign scaling because every eigenspace (in which A is only unique up 

to orthogonal transformation) has dimension one. Arbitrary scaling indeterminacy 

does not occur because we have forced S and X to have unit 

variances. Using uniqueness of eigenvalue decomposition and theorem 2, 

we have shown the following theorem: 

Theorem 3 (BSS by Hessian calculation). Let X = AS with an independent 

random vector S and an orthogonal matrix A. Let x ∈ R n such that locally at x, 

X admits a C 2 -density pX with pX(x) �= 0. Assume that Hln p X (x) has n different 

eigenvalues (see lemma 5). If 

EHln p X (x)E ⊤ = D 

is an eigenvalue decomposition of the Hessian of the logarithm of pX at x, that is, E 

orthogonal, D diagonal, then E ∼ A,soE ⊤ X is independent.


1840 F. Theis 

Furthermore, it follows from this theorem that linear BSS is a local problem 

as proven already in Theis, Puntonet, and Lang (2003) using the restriction 

of a random vector. 

4.1 Example for Hessian Diagonalization BSS. In order to illustrate 

the algorithm of local Hessian diagonalization, we give a two-dimensional 

example. Let S be a random vector with densities 

pS1 (s1) = 1 

2 χ[−1,1](s1) 

pS2 (s2) = 1 

√ 2π exp 

� 

− 1 

2 s2 2 

� 

where χ[−1,1] is one on [−1, 1] and zero everywhere else. The orthogonal 

mixing matrix A is chosen to be 

A = 1 √ 2 

� � 

1 1 

. 

−1 1 

The mixture density pX of X := AS then is (det A = 1), 

pX(x) = 1 

2 √ 2π χ[−1,1] 

� � � 

1√2 

(x1 − x2) exp − 1 

4 (x1 + x2) 2 

� 

, 

for x ∈ R 2 . pX is positive and C 2 in a neighborhood around 0. Then 

∂1 ln pX(x) = ∂2 ln pX(x) =− 1 

2 (x1 + x2) 

∂ 2 1 ln pX(x) = ∂ 2 2 ln pX(x) = ∂1∂2 ln pX(x) =− 1 

2 

for x with |x| < 1 

2 , and the Hessian of the logarithmic densities is 

Hln p X (x) =− 1 

2 

� � 

1 1 

1 1 

independent on x in a neighborhood around 0. Diagonalization of Hln p (0) 

X 

yields 

� � 

−1 0 

, 

0 0 

and this equals AHln p X (0)A ⊤ , as stated in theorem 3.



4.2 Global Hessian Diagonalization Using Kernel-Based Density Approximation. 

In practice, it is usually not possible to approximate the density 

locally with sufficiently high accuracy, so a better approximation using 

the typically global information of X has to be found. We suggest using 

kernel-based density estimation to get an energy function with minima at 

the BSS solutions together with a global Hessian diagonalization in the following. 

The idea is to construct a measure for separatedness of the densities 

(hence independence) based on theorem 1. A possible measure could be the 

norm of the summed-up separators � 

i


1842 F. Theis 

0.4 

0.2 

0 

2 

1 

0 

−1 

−2 −2 

−1 0 

1 

2 

0.2 

0.1 

0 

2 

1 

0 

−1 

−2 −2 

−1 0 

Figure 2: Independent Laplacian density p S (s) = 1 

2 exp(−|x1|−|x2|): theoretic 

(left) and approximated (right) densities. For the approximation, 1000 samples 

and gaussian kernel approximation (see equation 4.3) with standard deviation 

0.37 were used. 

Rij[ ˆpX] can be calculated using lemma 2—here Rij[ϕ(x − x (k) )] ≡ 0—and 

equation 4.4: 

Rij[ ˆpX](x) = 1 

� 

ν� 

Rij ϕ(x − x 

ν2 k=1 

(k) � 

) 

= 1 

ν 2 

� 

ϕ 

k�=l 

− (∂iϕ) 

= 4κ2 

ν 2 

= 4κ2 

ν 2 

� 

k�=l 

� 

k



hence, 

E = (σ 2 ν) 

−4 � 

m 

� 

� � 

ϕ(x − x (k) )ϕ(x − x (l) )(x (k) 

i 

i


1844 F. Theis 

Note that E represents a new approximate measure of independence. 

Therefore, the linear BSS algorithm can now be readily generalized to nonlinear 

situations by finding an appropriate parameterization of the possibly 

nonlinear separating model. 

The proposed algorithm basically performs a global diagonalization of 

the logarithmic Hessian after prewhitening. Interestingly, this is similar to 

traditional BSS algorithms based on joint diagonalization such as JADE 

(Cardoso & Souloumiac, 1993) using cumulant matrices or AMUSE (Tong, 

Liu, Soan, & Huang, 1991) and SOBI (Belouchrani, Meraim, Cardoso, & 

Moulines, 1997) employing time decorrelation. Instead of using a global energy 

function as proposed above, we could therefore also jointly diagonalize 

a given set of Hessians (respectively, separator matrices, as above; see also 

Yeredor, 2000). Another relation to previously proposed ICA algorithms 

lies in the kernel approximation technique. Gaussian or generalized gaussian 

kernels have already been used in the field of independent component 

analysis to model the source densities (Lee & Lewicki, 2000; Habl, Bauer, 

Putonet, Rodriguez-Alvarez, & Lang, 2001), thus giving an estimate of the 

score function used in Bell-Sejnowski-type semiparametric algorithms (Bell, 

& Sejnowski, 1995) or enabling direct separation using a maximum likelihood 

parameter estimation. Our algorithm also uses density approximation, 

but employs this for the mixture density, which can be problematic in higher 

dimensions. A different approach not involving density approximation is a 

direct sample-based Hessian estimation similar to Lin (1998). 

5 Separability of Postnonlinear BSS 

In this section, we show how to use the idea of Hessian diagonalization in 

order to give separability proofs in nonlinear situations, more precisely, in 

the setting of postnonlinear BSS. After stating the postnonlinear BSS model 

and the general (to the knowledge of the author, not yet proven) separability 

theorem, we will prove postnonlinear separability in the case of random 

vectors with distributions that are somewhere locally constant and nonzero 

(e.g., uniform distributions). A possible proof of postnonlinear separability 

has been suggested by Taleb and Jutten (1999); however, the proof applies 

only to densities with at least one zero and furthermore contains an error 

rendering the proof applicable only to restricted situations. 

Definition 3. A function f : R n → R n is called diagonal if each component 

fi(x) of f(x) depends on only the variable xi. 

In this case, we often omit the other variables and write f(x1,...,xn) = 

( f1(x1),..., fn(xn)); sof ≡ f1 ×···× fn where × denotes the Cartesian 

product.



Consider now the postnonlinear BSS model, 

X = f(AS), (5.1) 

where again S is an independent random vector, A ∈ Gl(n), and f is a diagonal 

nonlinearity. We assume the components of f to be injective analytical 

functions with invertible Jacobian at every point (locally diffeomorphic). 

Definition 4. An invertible matrix A ∈ Gl(n) is said to be mixing if A has at 

least two nonzero entries in each row. 

Note that if A is mixing, then A ′ , A −1 , and ALP for scaling matrix L and 

permutation matrix P are also mixing. 

Postnonlinear BSS is a generalization of linear BSS, so the indeterminacies 

of postnonlinear ICA contain at least the indeterminacies of linear BSS: 

A can be reconstructed only up to scaling and permutation. In the linear 

case, affine linear transformation is ignored. Here, of course, additional 

indeterminacies come into play because of translation: fi can be recovered 

only up to a constant. Also, if L ∈ Gl(n) is a scaling matrix, then 

f(AS) = (f ◦ L)((L −1 A)S), 

so f and A can interchange scaling factors in each component. Another obvious 

indeterminacy could occur if A is not general enough. If, for example, 

A = I, then f(S) is already independent, because independence is invariant 

under diagonal nonlinear transformation; so f cannot be found in this case. 

If we assume, however, that A is mixing, then we will show that except for 

scaling interchange between f and A, no more indeterminacies than in the 

affine linear case exist. 

Theorem 4 (separability of postnonlinear BSS). Let A, W ∈ Gl(n) be mixing, 

h : R n → R n be a diagonal bijective function with analytical locally diffeomorphic 

components, and S be an independent random vector with at most one 

gaussian component and existing covariance. If W(h(AS)) is independent, then 

there exists a scaling matrix L ∈ Gl(n) and p ∈ R n with LA ∼ W −1 and h ≡ L+p. 

If analyticity of the components of h is not assumed, then h ≡ L + p can 

only hold on {As|pS(s) �= 0}. 

If f ◦ A is the mixing model, W ◦ g is the separating model. Putting the 

two together, we get the above mixing-separating model. Since A has to be 

assumed to be mixing, we can assume W to be mixing as well because the 

inverse of a matrix that is mixing is again mixing. Furthermore, the mixingseparating 

model is assumed to be bijective—hence, A and W invertible and 

h bijective—because otherwise trivial solutions as, for example, h ≡ c for a 

constant c ∈ R, would also be solutions.


1846 F. Theis 

We will show the theorem in the case of S and X with components having 

somewhere locally constant nonzero C 2 -densities. An alternative geometric 

idea of how to prove theorem 4 for bounded sources in two dimensions 

is mentioned in Babaie-Zadeh, Jutten, and Nayebi (2002) and extended in 

Theis and Gruber (forthcoming). Note that in our case, as well as in the above 

restrictive cases, the assumption that S has at most one gaussian component 

holds trivially. 

Proof of Theorem 4 (with Locally-Constant Nonzero C 2 -Densities). Let h = 

h1 ×···×hn with bijective C ∞ -functions hi : R → R. We have to show only 

that the h ′ i 

are constant. Then h is affine linear, say, h ≡ L + p, with diagonal 

matrix L ∈ Gl(n) and a vector p ∈ R n . Hence, WLAS+Wp, and then WLAS 

is independent, so using linear separability, theorem 2i, WLA ∼ I, therefore, 

LA ∼ W −1 . 

Let X := W(h(AS)). The density of this transformed random vector is 

easily calculated from S: 

pX(Wh(As)) =|det W| −1 |h ′ 1 ((As)1)| −1 ···|h ′ n ((As)n)| −1 | det A| −1 pS(s) 

for s ∈ Rn . h has by assumption an invertible Jacobian at every point, so 

the h ′ i are either positive or negative; without loss of generality, h′ i > 0. 

Furthermore, pX is independent, so we can write 

pX ≡ g1 ⊗···⊗gn. 

For fixed s 0 ∈ R n with pS(s 0 )>0, there exists an open neighborhood 

U ⊂ R n of s 0 with pS|U > 0 and pS|U ∈ C 2 (U, R). If we define f (s) := 

ln � | det W| −1 | det A| −1 pS(s) � for s ∈ U, then 

f (s) = ln(h ′ 1 ((As)1) ···h ′ n ((As)n)g1((Wh(As))1) ···gn((Wh(As))n)) 

n� 

= ln h ′ k ((As)k) + ζk((Wh(As))k) 

k=1 

for x ∈ R n where ζk := ln gk locally at s 0 

k . pS is separated, so 

∂i∂j f ≡ 0 (5.2) 

for i < j. Denote A =: (aij) and W =: (wij). The first derivative and then the 

nondiagonal entries in the Hessian of f can be calculated as follows (i < j): 

∂i f (s) = 

n� 

h 

aki 

k=1 

′′ 

k 

h ′ k 

((As)k) + ζ ′ k ((Wh(As))k) 

� n� 

l=1 

wklalih ′ l ((As)l) 

�



∂i∂j f (s) = 

n� 

k=1 

h 

akiakj 

′ kh′′′ k 

− h′′2 

k 

h ′2 

k 

+ ζ ′′ 

k ((Wh(As))k) 

+ ζ ′ k ((Wh(As))k) 

((As)k) 

� n� 

l=1 

� n� 

l=1 

wklalih ′ l ((As)l) 

wklalialjh ′′ 

l ((As)l) 

�� n� 

� 

. 

l=1 

wklaljh ′ l ((As)l) 

Substituting y := As and using equation 5.2, we finally get the following 

differential equation for the hk: 

0 = 

n� 

k=1 

h 

akiakj 

′ kh′′′ k 

− h′′2 

k 

h ′2 

k 

+ ζ ′′ 

k ((Wh(y))k) 

+ ζ ′ k ((Wh(y))k) 

(yk) 

� n� 

l=1 

� n� 

l=1 

wklalih ′ l (yl) 

wklalialjh ′′ 

l (yl) 

�� n� 

� 

l=1 

wklaljh ′ l (yl) 

� 

� 

(5.3) 

for y ∈ V := A(U). 

We will restrict ourselves to the simple case mentioned above in order to 

solve this equation. We assume that the hk are analytic and that there exists 

x 0 ∈ R n where the demixed densities gk are locally constant and nonzero. 

Consider the above calculation around s 0 = A −1 (h −1 (W −1 x 0 )). 

Choose the open set V such that the gk are locally constant nonzero on 

h(W(V)). Then so are the ζ ′ k = ln gk, and therefore 

0 = 

n� 

k=1 

h 

akiakj 

′ kh′′′ k 

− h′′2 

k 

h ′2 

k 

(yk) 

for y ∈ V. Hence, there exist open intervals Ik ⊂ R and constants bk ∈ R 

with 

� 

akiakj h ′ kh′′′ � 

k − h′′2 

k ≡ dkh ′2 

k 

on Ik (here, dk = � 

l�=k alialj 

h ′ 

l h′′′ 

l −h′′2 

l 

h ′2 (yl) for some (and then any) y ∈ V). 

l 

By assumption, W is mixing. Hence, for fixed k, there exist i �= j with 

akiakj �= 0. If we set ck := bk , then 

akiakj 

ckh ′2 

k − h′ kh′′′ k + h′′2 

k ≡ 0 (5.4)


1848 F. Theis 

on Ik. hk was chosen to be analytic, and equation 5.4 holds on the open set 

Ik, so it holds on all R. Applying lemma 3 then shows that either h ′ k ≡ 0or 

h ′ � 

ck 

k (x) =±exp 

2 x2 � 

+ dkx + ek , x ∈ R (5.5) 

with constants dk, ek ∈ R. By assumption, hk is bijective, so h ′ k �≡ 0. 

Applying the same arguments as above to the inverse system 

S = A −1 (h −1 (W −1 X)) 

and using the fact that also pS is somewhere locally constant nonzero shows 

that equation 5.5 also holds for (h −1 

k )′ with other constants. But if both the 

derivatives of hk and h −1 

k are of this exponential type, then ck = dk = 0, and 

therefore hk is affine linear for all k = 1,...,n, which completes the proof of 

postnonlinear separability in this special case. 

Note that in the above proof, local positiveness of the densities was assumed 

in order to use the equivalence of local separability with the diagonality 

of the logarithm of the Hessian. Hence, these results can be generalized 

using theorem 1 in a similar fashion as we did in the linear case with theorem 

2. Hence, we have proven postnonlinear separability also for uniformly 

distributed sources. 

6 Conclusion 

We have shown how to derive the separability of linear BSS using diagonalization 

of the Hessian of the logarithmic density, respectively, characteristic 

function. This induces separated, that is, independent, sources. The idea of 

Hessian diagonalization is put into a new algorithm for performing linear 

independent component analysis, which is shown to be a local problem. In 

practice, however, due to the fact that the densities cannot be approximated 

locally very well, we also propose a diagonalization algorithm that takes 

the global structure into account. In order to show the use of this framework 

of separated functions, we finish with a proof of postnonlinear separability 

in a special case. 

In future work, more general separability results of postnonlinear BSS 

could be constructed by finding more general solutions of the differential 

equation 5.3. Algorithmic improvements could be made by using other 

density approximation methods like mixture of gaussian models or by approximating 

the Hessian itself using the cumulative density and discrete 

approximations of the differential. Finally, the diagonalization algorithm 

can easily be extended to nonlinear situations by finding appropriate model 

parameterizations; instead of minimizing the mutual information, we minimize 

the absolute value of the off-diagonal terms of the logarithmic Hessian.



The algorithm has been specified using only an energy function; gradient 

and fixed-point algorithms can be derived in the usual manner. 

Separability in nonlinear situations has turned out to be a hard problem— 

illposed in the most general case (Hyvärinen & Pajunen, 1999)—and not 

many nontrivial results exist for restricted models (Hyvärinen & Pajunen, 

1999; Babaie-Zadeh et al., 2002), all only two-dimensional. We believe that 

this is due to the fact that the rather nontrivial proof of the Darmois- 

Skitovitch theorem is not at all easily generalized to more general settings 

(Kagan, 1986). By introducing separated functions, we are able to give a 

much easier proof for linear separability and also provide new results in 

nonlinear settings. We hope that these ideas will be used to show separability 

in other situations as well. 

Acknowledgments 

I thank the anonymous reviewers for their valuable suggestions, which improved 

the original manuscript. I also thank Peter Gruber, Wolfgang Hackenbroch, 

and Michaela Theis for suggestions and remarks on various aspects 

of the separability proof. The work described here was supported by the 

DFG in the grant “Nonlinearity and Nonequilibrium in Condensed Matter” 

and the BMBF in the ModKog project. 

References 

Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for 

separating post non-linear mixtures. In Proc. of EUSIPCO ’02 (Volume 2, pp. 

11–14). Toulouse, France. 

Bauer, H. (1996). Probability theory. Berlin: Walter de Gruyter. 

Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind 

separation and blind deconvolution. Neural Computation, 7:1129–1159. 

Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moulines, E. (1997). A blind 

source separation technique based on second order statistics. IEEE Transactions 

on Signal Processing, 45(2), 434–444. 

Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non gaussian 

signals. IEEE Proceedings–F, 140(6), 362–370. 

Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. New 

York: Wiley. 

Comon, P. (1994). Independent component analysis—a new concept? Signal 

Processing, 36:287–314. 

Darmois, G. (1953). Analyse générale des liaisons stochastiques. Rev. Inst. Internationale 

Statist., 21, 2–8. 

Eriksson, J., & Koivunen, V. (2003). Identifiability and separability of linear ica 

models revisited. In Proc. of ICA 2003, (pp. 23–27). Nara, Japan. 

Habl, M., Bauer, C., Puntonet, C., Rodriguez-Alvarez, M., & Lang, E. (2001). Analyzing 

biomedical signals with probabilistic ICA and kernel-based source


1850 F. Theis 

density estimation. In M. Sebaaly, (Ed.) Information science innovations 

(Proc.ISI’2001) (pp. 219–225). Alberta, Canada: ICSC Academic Press. 

Hérault, J., & Jutten, C. (1986). Space or time adaptive signal processing by 

neural network models. In J. Denker (Ed.), Neural networks for computing: 

Proceedings of the AIP Conference (pp. 206–211). New York: American Institute 

of Physics. 

Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. 

New York: Wiley. 

Hyvärinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent 

component analysis. Neural Computation, 9:1483–1492. 

Hyvärinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: 

Existence and uniqueness results. Neural Networks, 12(3), 429–439. 

Kagan, A. (1986). New classes of dependent random variables and a generalization 

of the Darmois-Skitovitch theorem to several forms. Theory Probab. 

Appl., 33(2), 286–295. 

Lee, T., & Lewicki, M. (2000). The generalized gaussian mixture model using 

ICA. In Proc. of ICA 2000 (pp. 239–244). Helsinki, Finland. 

Lin, J. (1998). Factorizing multivariate function classes. In M. Kearns, M. Jordan, 

& S. Solla (Eds.), Advances in neural information processing systems, 10, (pp. 563– 

569). Cambridge, MA: MIT Press. 

Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89, 

217–219. 

Taleb, A., & Jutten, C. (1999). Source separation in post non linear mixtures. 

IEEE Trans. on Signal Processing, 47, 2807–2820. 

Theis, F., & Gruber, P. (forthcoming). Separability of analytic postnonlinear 

blind source separation with bounded sources. In Proc. of ESANN 2004. 

Evere, Belgium: d-side. 

Theis, F., Jung, A., Puntonet, C., & Lang, E. (2002). Linear geometric ICA: Fundamentals 

and algorithms. Neural Computation, 15, 1–21. 

Theis, F., Puntonet, C., & Lang, E. (2003). Nonlinear geometric ICA. In Proc. of 

ICA 2003 (pp. 275–280). Nara, Japan. 

Tong, L., Liu, R.-W., Soon, V., & Huang, Y.-F. (1991). Indeterminacy and identifiability 

of blind identification. IEEE Transactions on Circuits and Systems, 38, 

499–509. 

Yeredor, A. (2000). Blind source separation via the second characteristic function. 

Signal Processing, 80(5), 897–902. 

Received June 27, 2003; accepted March 8, 2004.

80 Chapter 2. Neural Computation 16:1827-1850, 2004

Chapter 3 

Signal Processing 84(5):951-956, 

2004 

Paper F.J. Theis. Uniqueness of complex and multidimensional independent component 

analysis. Signal Processing, 84(5):951-956, 2004 

Reference (Theis, 2004b) 


81

82 Chapter 3. Signal Processing 84(5):951-956, 2004 

Abstract 

Signal Processing 84 (2004) 951 – 956 

www.elsevier.com/locate/sigpro 

Fast communication 

Uniqueness of complex and multidimensional independent 

component analysis 

F.J. Theis ∗ 

Institute of Biophysics, University of Regensburg, Universitaetsstr. 31, D93040 Regensburg, Germany 

Received 25 September 2003 

A complex version of the Darmois–Skitovitch theorem is proved using a multivariate extension of the latter by Ghurye and 

Olkin. This makes it possible to calculate the indeterminacies of independent component analysis (ICA) with complex variables 

and coe cients. Furthermore, the multivariate Darmois–Skitovitch theorem is used to show uniqueness of multidimensional 

ICA, where only groups of sources are mutually independent. 

? 2004 Elsevier B.V. All rights reserved. 

PACS: 84.40.Ua; 89.70.+c; 07.05.Kf 

Keywords: Complex ICA; Multidimensional ICA; Separability 

1. Introduction 

The task of independent component analysis (ICA) 

is to transform a given random vector into a statistically 

independent one. ICA can be applied to blind 

source separation (BSS), where it is furthermore assumed 

that the given vector has been mixed using a 

xed set of independent sources. Good textbook-level 

introductions to ICA are given in [4,11]. 

BSS is said to be separable if the mixing structure 

can be blindly recovered except for obvious indeterminacies. 

In [5], Comon shows separability of linear real 

BSS using the Skitovitch–Darmois theorem. He notes 

that his proof for the real case can also be extended to 

the complex setting. However, a complex version of 

∗ Tel.: +49-941-9432924; fax: +49-941-9432479. 

E-mail addresses: fabian.theis@mathematik.uni-regensburg.de, 

fabian@theis.name (F.J. Theis). 

0165-1684/$ - see front matter ? 2004 Elsevier B.V. All rights reserved. 

doi:10.1016/j.sigpro.2004.01.008 

the Skitovitch–Darmois theorem is needed, which, to 

the knowledge of the author, has not been shown in 

the literature, yet. In this work we will provide such 

a theorem, which is then used to prove separability of 

complex BSS. 

Separability and uniqueness of BSS is already included 

in the de nition of what is commonly called a 

‘contrast’ [5]. Hence it has been widely studied, but 

in the setting of complex BSS to the knowledge of 

the author separability has only been shown under the 

additional assumption of non-zero cumulants of the 

sources [5,13]. 

The paper is organized as follows: In the next 

section, basic terms and notations are introduced. 

Section 3 states the well-known Skitovitch–Darmois 

theorem and a multivariate extension thereof; furthermore, 

a complex version of it is derived. The following 

Section 4 then introduces the complex linear blind 

source separation model and shows its separability.

Chapter 3. Signal Processing 84(5):951-956, 2004 83 

952 F.J. Theis / Signal Processing 84 (2004) 951 – 956 

Section 5 nally deals with separability of multidimensional 

ICA (group ICA). 

2. Notation 

Let K ∈{R; C} be either the real or the complex 

numbers. For m; n ∈ N let Mat(m × n; K) 

be the K-vectorspace of real, respectively, complex 

m × n matrices, Gl(n; K) := {W ∈ Mat(n × 

n; K) | det(W) �= 0} be the general linear group of 

K n . I ∈ Gl(n; K) denotes the unit matrix. For ∈ C 

we write Re( ) for its real and Im( ) for its imaginary 

part. 

An invertible matrix L ∈ Gl(n; K) is said to be a 

scaling matrix, if it is diagonal. We say two matrices 

B, C ∈ Mat(m × n; K) are (K−)equivalent, B ∼ C, if 

C can be written as C = BPL with a scaling matrix 

L ∈ Gl(n; K) and an invertible matrix with unit vectors 

in each row (permutation matrix) P ∈ Gl(n; K). Note 

that PL = L ′ P for some scaling matrix L ′ ∈ Gl(n; K), 

so the order of the permutation and the scaling matrix 

does not play a role for equivalence. Furthermore, 

if B ∈ Gl(n; K) with B ∼ I, then also B −1 ∼ I, 

and more general if BC ∼ A, then C ∼ B −1 A.So 

two matrices are equivalent if and only if they di er 

by right-multiplication by a matrix with exactly one 

non-zero entry in each row and each column. If K=R, 

the two matrices are the same except for permutation, 

sign and scaling, if K = C, they are the same except 

for permutation, sign, scaling and phase-shift. 

3. A multivariate version of the Skitovitch– 

Darmois theorem 

The original Skitovitch–Darmois theorem shows a 

non-trivial connection between Gaussian distributions 

and stochastic independence. More precisely, it states 

that if two linear combinations of non-Gaussian independent 

random variables are again independent, then 

each original random variable can appear only in one 

of the two linear combinations. It has been proved independently 

by Darmois [6] and Skitovitch [14]; in a 

more accessible form, the proof can be found in [12]. 

Separability of linear BSS as shown by Comon [5] 

is a corollary of this theorem, although recently separability 

has also been shown without it [17]. 

Theorem 3.1 (Skitovitch–Darmois theorem). Let 

L1 = �n i=1 iXi and L2 = �n i=1 iXi with X1;:::;Xn 

independent real random variables and j, j ∈ R for 

j =1;:::;n. If L1 and L2 are independent, then all Xj 

with j j �= 0are Gaussian. 

The converse is true if we assume that � n 

j=1 j j = 

0: If all Xj with j j �= 0 are Gaussian and 

� n 

j=1 j j = 0, then L1 and L2 are independent. This 

follows because then L1 and L2 are uncorrelated, and 

with all common variables being normal then also 

independent. 

Theorem 3.2 (Multivariate S–D theorem). Let L1 � 

= 

n 

i=1 AiXi and L2 = �n i=1 BiXi with mutually independent 

k-dimensional random vectors Xj and invertible 

matrices Aj; Bj ∈ Gl(k; R) for j =1;:::;n. If 

L1 and L2 are mutually independent, then all Xj are 

Gaussian. 

Here Gaussian (or jointly Gaussian) means that 

each component of the random vector is a Gaussian. 

Obviously, those Gaussians can have non-trivial correlations. 

This extension of Theorem 3.1 to random vectors 

has rst been noted by Skitovitch [15] and shown by 

Ghurye and Olkin [8]. Zinger gave a di erent proof 

for it in his Ph.D. thesis [18]. 

We need the following corollary: 

Corollary 3.3. Let L1 = � n 

i=1 AiX i and L2 = 

� n 

i=1 BiX i with mutually independent k-dimensional 

random vectors X j and matrices Aj; Bj either zero or 

in Gl(k; R) for j =1;:::;n. If L1 and L2 are mutually 

independent, then all X j with AjBj �= 0are Gaussian. 

Proof. We want to throw out all X j with AjBj =0. 

Then Theorem 3.2 can be applied. Let j be given with 

AjBj=0. Without loss of generality assume that Bj=0. 

If also Aj = 0, then we can simply leave out X j since 

it does not appear in both L1 and L2. Assume Aj �= 

0. By assumption X j and X 1 ;:::;X j−1 ; X j+1 ;:::;X n 

are mutually independent, then so are X j and L2 because 

Bj = 0. Hence both −AjX j , L2 and L1, L2 

are mutually independent, so also the two linear combinations 

L1 − AjX j and L2 of the n − 1 variables 

X 1 ;:::;X j−1 ; X j+1 ;:::;X n are mutually independent. 

After successive application of this recursion we can


assume that each Aj and Bj is invertible. Applying 

Theorem 3.2 shows the corollary. 

From this, a complex version of the Skitovitch– 

Darmois theorem can easily be derived: 

Corollary 3.4 (Complex S–D theorem). Let L1 � 

= 

n 

i=1 iXi and L2 = �n i=1 iXi with X1;:::;Xn independent 

complex random variables and j; j ∈ C for 

j =1;:::;n. If L1 and L2 are independent, then all Xj 

with j j �= 0 are Gaussian. 

Here, a complex random variable is said to be Gaussian 

if both real and imaginary part are Gaussians. 

Proof. We can interpret the n independent complex 

random variables Xi as n two dimensional real random 

vectors that are mutually independent. Multiplication 

by the complex numbers j either ( j �= 0)isamultiplication 

by the real invertible matrix 

� � 

Re( j) −Im( j) 

Im( j) Re( j) 

or ( j = 0) multiplication by the 0-matrix, similar for 

j. Applying Corollary 3.3 nishes the proof. 

4. Indeterminacies of complex ICA 

Given a complex n-dimensional random vector X, 

a matrix W ∈ Gl(n; C) is called (complex) ICA of X 

if WX is independent (as a complex random vector). 

We will show that W and V are complex ICAs of X if 

and only if W −1 ∼ V −1 that is if they di er by right 

multiplication by a complex scaling and permutation 

matrix. This is equivalent to calculating the indeterminacies 

of the complex BSS model: 

Consider the noiseless complex linear instantaneous 

blind source separation (BSS) model with as many 

sources as sensors 

X = AS: (1) 

Here S is an independent complex-valued ndimensional 

random vector and A ∈ Gl(n; C) aninvertible 

complex matrix. 

The task of linear BSS is to nd A and S given 

only X. An obvious indeterminacy of this problem is 

F.J. Theis / Signal Processing 84 (2004) 951 – 956 953 

that A can be found only up to equivalence because 

for scaling L and permutation matrix P 

X = ALPP −1 L −1 S 

and P −1 L −1 S is also independent. We will show that 

under mild assumptions to S there are no further indeterminacies 

of complex BSS. 

Various algorithms for solving the complex BSS 

problem have been proposed [1,2,7,13,16]. We want 

to note that many cases where complex BSS is applied 

can in fact be reduced to using real BSS algorithms. 

This is the case if either the sources or the 

mixing matrix are real. The latter for example occurs 

after Fourier transformation of signals with time 

structure. 

If the sources are real, then the above complex 

model can be split up into two separate real BSS problems: 

Re(X) = Re(A)S; 

Im(X)=Im(A)S: 

Solving both of these real BSS equations yields A = 

Re(A) +i Im(A). Of course, Re(A) and Im(A) can 

only be found except for scaling and permutation. By 

comparing the two recovered source random vectors 

(using for example mutual information of one component 

of each vector), we can however assume that the 

permutation and then also the scaling indeterminacy 

of both recovered matrices is the same, which allows 

the algorithm to correctly put A back together. Similarly, 

also separability of this special complex ICA 

problem can be derived from the well-known separability 

results in the real case. 

If the mixing matrix is known to be real, then again 

splitting up Eq. (1) into real and complex parts yields 

Re(X)=A Re(S); 

Im(X)=A Im(S): 

A can be found from either equation. If both real and 

complex samples are to be used in order to increase 

precision, those can simply be concatenated in order 

to generate a twice as large sample set mixed by the 

same mixing matrix A. In terms of random vectors this 

means working in two disjoint copies of the original 

probability space. Again separability follows.



Theorem 4.1 (Separability of complex linear BSS). 

Let A ∈ Gl(n; C) and S a complex independent random 

vector. Assume one of the following: 

i. S has at most one Gaussian component and the 

(complex) covariance of S exists. 

ii. S has no Gaussian component. 

If AS is again independent 1 then A is equivalent 

to the identity. 

Here, the complex covariance of S is de ned by 

Cov(S)=E((S − E(S))(S − E(S)) ∗ ); 

where the asterix denotes the transposed and complexly 

conjugated vector. 

Comon has shown this for the real case [5]; for 

the complex case a complex version of the Darmois– 

Skitovitch theorem is needed as provided in Section 3. 

Theorem 4.1 indeed proves separability of the complex 

linear BSS model, because if X = AS and W is 

a demixing matrix such that WX is independent, then 

WA ∼ I, soW −1 ∼ A as desired. And it also calculates 

the indeterminacies of complex ICA, because if 

W and V are ICAs of X, then both VX and WV −1 VX 

are independent, so WV −1 ∼ I and hence W ∼ V. 

Proof. Denote X := AS. 

First assume case ii: S has no Gaussian component 

at all. Then A =(aij) is equal to the identity, because 

if not there exist i1 �= i2 and j with ai1jai2j �= 0. 

Applying Corollary 3.4 to Xi1 and Xi2 then shows 

that Sj is Gaussian, which is a contradiction to 

Assumption ii. 

Now assume that the covariance exists and that S 

has at most one Gaussian component. First we will 

show using complex decorrelation that we can assume 

A to be unitary. Without loss of generality assume 

that all random vectors are centered. By assumption 

Cov(X) is diagonal, so let D1 be diagonal invertible 

with Cov(X) =D2 1 . Note that D1 is real. Similarly 

let D2 be diagonal invertible with Cov(S)=D2 2 . Set 

Y := D −1 

1 X and T := D−1 

2 S that is normalize X and 

S to covariance I. Then 

Y = D −1 

1 

X = D−1AS 

= D−1 

1 

1 AD2T 

1 Indeed, we only need that AS are pairwise mutually independent. 

and T, D −1 

1 AD2 and Y satisfy the assumption and 

D −1 

1 AD2 is unitary because 

I = Cov(Y) 

= E(YY ∗ ) 

= E(D −1 

1 AD2TT ∗ D2A ∗ D −1 

1 ) 

=(D −1 −1 

1 AD2)(D1 AD2) ∗ : 

If we assume A I, then using the fact that A is 

unitary there exist indices i1 �= i2 and j1 �= j2 with 

ai∗j∗ �= 0. By assumption 

Xi1 = ai1j1 Sj1 + ai1j2 Sj2 + ··· 

Xi2 = ai2j1 

Sj1 + ai2j2 

Sj2 + ··· 

are independent, and in both Xi1 and Xi2 the variables 

Sj1 and Sj2 appear non-trivially, so by the complex Skitovitch–Darmois 

Theorem 3.4 Sj1 and Sj2 are Gaussian, 

which is a contradiction to the fact that at most 

one source is Gaussian. 

5. Indeterminacies of multidimensional ICA 

In this section, we want to analyze indeterminacies 

of so-called multidimensional independent component 

analysis. The idea of this generalization of 

ICA is that we do not require full independence of the 

transform Y but only mutual independence of certain 

tuples Yi1;:::;Yi2. If the size of all tuples is restricted to 

one, this reduces to original ICA. In general, of course 

the tuples could have di erent sizes, but for the sake 

of simplicity we assume that they all have the same 

length (which then necessarily has to divide the total 

dimension). 

Multidimensional ICA has rst been introduced by 

Cardoso [3] using geometrical motivations. Hyvarinen 

and Hoyer then presented a special case of multidimensional 

ICA which they called independent 

subspace analysis [9]; there the dependence within 

a k-tuple is explicitly modelled enabling the authors 

to propose better algorithms without having to 

resort to the problematic multidimensional density 

estimation. A di erent extension of ICA is given by 

topographic ICA [10], where dependencies between 

all components are assumed. A special case of multidimensional 

ICA is complex ICA as presented in


the preceding section—here dependence is allowed 

between real-valued couples of random variables. 

Let k; n ∈ N such that k divides n. We call an 

n-dimensional random vector Y k-independent if the 

k-dimensional random vectors 

⎛ ⎞ ⎛ ⎞ 

⎜ 

⎝ 

Y1 

. 

Yk 

⎟ 

⎠ ;:::; 

⎜ 

⎝ 

Yn−k+1 

. 

Yn 

⎟ 

⎠ 

are mutually independent. A matrix W ∈ Gl(n; R) is 

called a k-multidimensional ICA of an n-dimensional 

random vector X if WX is k-independent. If k =1, 

this is the same as ordinary ICA. 

Obvious indeterminacies are, similar to ordinary 

ICA, invertible transforms in Gl(k; R) in each tuple 

as well as the fact that the order of the independent 

k-tuples is not xed. So, de ne for r; s =1;:::;n=k 

the (r; s) sub-k-matrix of W =(wij) tobethek × k 

submatrix 

(wij) i=rk; :::; rk+k−1 

j=sk; :::; sk+k−1 

that is the k × k submatrix of W starting at position 

(rk; sk). A matrix L ∈ Gl(n; R) is said to be a k-scaling 

and permutation matrix if for each r=1;:::;n=kthere 

exists precisely one s with the (r; s) sub-k-matrix of 

L to be nonzero, and such that this submatrix is in 

Gl(k; R), and if for each s=1;:::;n=kthere exists only 

one r with the (r; s) sub-k-matrix satisfying the same 

condition. Hence, if Y is k-independent, also LY is 

k-independent. 

Two matrices A and B are said to be k-equivalent, 

A ∼k B, if there exists such a k-scaling and permutation 

matrix L with A=BL. As stated above, given two 

matrices W and V with W −1 ∼k V −1 such that one of 

them is a k-multidimensional ICA of a given random 

vector, then so is the other. We will show that there are 

no more indeterminacies of multidimensional ICA. 

As usual multidimensional ICA can solve the multidimensional 

BSS problem 

X = AS; 

where A ∈ Gl(n; R) and S is a k-independent 

n-dimensional random vector. Finding the indeterminacies 

of multidimensional ICA then shows that 

A can be found except for k-equivalence (separability), 

because if X = AS and W is a demixing matrix 

F.J. Theis / Signal Processing 84 (2004) 951 – 956 955 

such that WX is k-independent, then WA ∼k I, so 

W−1 ∼k A as desired. 

However, for the proof we need one more condition 

for A: We call A k-admissible if for each r; s = 

1;:::;n=kthe (r; s) sub-k-matrix of A is either invertible 

or zero. Note that this is not a strong restriction— 

if we randomly choose A with coe cients out of a continuous 

distribution, then with probability one we get 

a k-admissible matrix, because the non-k-admissible 

matrices ⊂ Rn2 lie in a submanifold of dimension 

smaller than n2 . 

Theorem 5.1 (Separability of multidimensional 

BSS): Let A ∈ Gl(n; R) and S a k-independent 

n-dimensional random vector having no Gaussian 

k-tuple (Srk;:::;Srk+k−1) T . Assume that A is 

k-admissible. 

If AS is again k-independent, then A is k-equivalent 

to the identity. 

For the case k = 1 this is linear BSS separability 

because every matrix is 1-admissible. 

Proof. Denote X := AS. Assume that A k I. 

Then there exist indices r1;r2 and s such that 

the (r1;s) and the (r2;s) sub-k-matrices of A are 

non-zero (hence in Gl(k; R) by k-admissability). 

Applying Corollary 3.3 to the two random vectors 

(Xr1k;:::;Xr1k+k−1) T and (Xr2k;:::;Xr2k+k−1) T then 

shows that (Ssk;:::;Ssk+k−1) T is Gaussian, which is a 

contradiction. 

Note that we could have used whitening to assume 

that A is orthogonal; however there does not seem to 

be a direct way to exploit this in order to allow one 

fully Gaussian k-tuple, contrary to the complex ICA 

case, see Theorem 4.1. 

6. Conclusion 

Uniqueness and separability results play a central 

role in solving BSS problems since they allow algorithms 

to apply ICA in order to uniquely (except for 

scaling and permutation) nd the original mixing matrices. 

We have used a multidimensional version of the 

Skitovitch–Darmois theorem in order to calculate the



indeterminacies of complex and of multidimensional 

ICA. In the multidimensional ICA case an additional 

restriction was needed in the proof, which could be 

relaxed if Corollary 3.3 can be extended to allow 

matrices of arbitrary rank. 

Acknowledgements 

This research was supported by Grants from the 

DFG 2 and the BMBF 3 . 

References 

[1] A. Back, A. Tsoi, Blind deconvolution of signals using a 

complex recurrent network, in: Neural Networks for Signal 

Processing 4, Proceedings of the 1994 IEEE Workshop, 1994, 

pp. 565–574. 

[2] E. Bingham, A. Hyvarinen, A fast xed-point algorithm for 

independent component analysis of complex-valued signals, 

Internat. J. Neural Systems 10 (1) (2000) 1–8. 

[3] J. Cardoso, Multidimensional independent component 

analysis, in: Proceedings of ICASSP ’98, Seattle, WA, May 

12–15, 1998. 

[4] A. Cichocki, S. Amari, Adaptive Blind Signal and Image 

Processing, Wiley, New York, 2002. 

[5] P. Comon, Independent component analysis—a new concept? 

Signal Processing 36 (1994) 287–314. 

[6] G. Darmois, Analyse generale des liaisons stochastiques, Rev. 

Inst. Internat. Statist. 21 (1953) 2–8. 

2 Graduate college ‘nonlinear dynamics’. 

3 Project ‘ModKog’. 

[7] S. Fiori, Blind separation of circularly distributed sources by 

neural extended apex algorithm, Neurocomputing 34 (2000) 

239–252. 

[8] S. Ghurye, I. Olkin, A characterization of the multivariate 

normal distribution, Ann. Math. Statist. 33 (1962) 533–541. 

[9] A. Hyvarinen, P. Hoyer, Emergence of phase and shift 

invariant features by decomposition of natural images into 

independent feature subspaces, Neural Computation 12 (7) 

(2000) 1705–1720. 

[10] A. Hyvarinen, P. Hoyer, M. Inki, Topographic independent 

component analysis, Neural Computation 13 (7) (2001) 

1525–1558. 

[11] A. Hyvarinen, J. Karhunen, E. Oja, Independent Component 

Analysis, Wiley, New York, 2001. 

[12] A. Kagan, Y. Linnik, C. Rao, Characterization Problems in 

Mathematical Statistics, Wiley, New York, 1973. 

[13] E. Moreau, O. Macchi, Higher order contrasts for 

self-adaptive source separation, Internat. J. Adaptive Control 

Signal Process. 10 (1) (1996) 19–46. 

[14] V. Skitovitch, On a property of the normal distribution, DAN 

SSSR 89 (1953) 217–219. 

[15] V. Skitovitch, Linear forms in independent random variables 

and the normal distribution law, Izvestiia AN SSSR, Ser. 

Matem. 18 (1954) 185–200. 

[16] P. Smaragdis, Blind separation of convolved mixtures in the 

frequency domain, Neurocomputing 22 (1998) 21–34. 

[17] F. Theis, A new concept for separability problems in blind 

source separation, 2003, submitted for publication; preprint at 

http://homepages.uni-regensburg.de/ ∼ thf11669/publications/ 

preprints/theis03linuniqueness.pdf 

[18] A. Zinger, Investigations into analytical statistics and their 

application to limit theorems of probability theory, Ph.D. 

Thesis, Leningrad University, 1969.

88 Chapter 3. Signal Processing 84(5):951-956, 2004

Chapter 4 

Neurocomputing 64:223-234, 2005 

Paper F.J. Theis and P. Gruber. On model identifiability in analytic postnonlinear 

ICA. Neurocomputing, 64:223-234, 2005 

Reference (Theis and Gruber, 2005) 


89

90 Chapter 4. Neurocomputing 64:223-234, 2005 

Abstract 

On model identifiability in analytic 

postnonlinear ICA 

F.J. Theis ∗ , P. Gruber 

Institute of Biophysics, University of Regensburg, 

D-93040 Regensburg, Germany 

An important aspect of successfully analyzing data with blind source separation is 

to know the indeterminacies of the problem, that is how the separating model is 

related to the original mixing model. If linear independent component analysis (ICA) 

is used, it is well known that the mixing matrix can be found in principle, but for 

more general settings not many results exist. In this work, only considering random 

variables with bounded densities, we prove identifiability of the postnonlinear mixing 

model with analytic nonlinearities and calculate its indeterminacies. A simulation 

confirms these theoretical findings. 

Key words: postnonlinear independent component analysis, postnonlinear blind 

source separation, identifiability, separability, bounded random vectors 


Independent component analysis (ICA) finds statistically independent data 

within a given random vector. It is often applied to blind source separation 

(BSS), where it is furthermore assumed that the given vector has been mixed 

using a fixed set of independent sources. 

In linear ICA, the mixing model can be written as 

n� 

Xi = aijSj 

i=1 

∗ Corresponding author. 

Email addresses: fabian@theis.name (F.J. Theis), petergruber@gmx.net (P. 

Gruber). 

Preprint submitted to Elsevier Science 13 October 2004

Chapter 4. Neurocomputing 64:223-234, 2005 91 

with independent sources S ⊤ = (S1, . . . , Sn) and mixing matrix A = (aij). 

X is known, and the goal is to determine A and S. Traditionally, this model 

was only assumed to have decorrelated sources S, which lead to Principal 

Component Analysis (PCA). Hérault and Jutten [1] were the first to extend 

this model to the ICA case by proposing a neural algorithm based on nonlinear 

decorrelation. Since then, the field of ICA has become increasingly popular 

and many algorithms have been studied, see [2–6] to name but a few. Good 

textbook-level introductions to ICA are given in [7] and [8]. 

With the growth of the field, interest in nonlinear model extensions has increased. 

However, if the model were chosen to be too general, it would not be 

able to be identified uniquely. A good trade-off between model generalization 

and identifiability is given in the so called postnonlinear BSS model realized 

by 

� 

n� 

� 

. 

Xi = fi aijSj 

i=1 

This explicit nonlinear model implies that in addition to the linear mixing 

situation, each sensor Xi contains an unknown nonlinearity fi that can further 

distort the observation. This model, first proposed by Taleb and Jutten [9], has 

applications in telecommunication and biomedical data analysis. Algorithms 

for reconstructing postnonlinearly mixed sources include [9–13]. 

One major problem of ICA-based BSS lies in the question of model identifiability 

and separability. This describes the question whether the model respectively 

the sources are uniquely determined by the observations X alone (except 

for trivial indeterminacies such as permutation and scaling). This problem is 

of key importance for any ICA algorithm, because if such an algorithm indeed 

finds a possible mixing model for X, without identifiability the so-recovered 

sources would not have to coincide at all with the original sources. For linear 

ICA, real-valued model identifiability has been shown by Comon [3], given 

that X contains at most one gaussian. The proof uses the rather nontrivial 

Darmois-Skitovitch theorem, however a more direct elementary proof is 

possible as well [14]. A generalization to complex-valued random variables is 

given in [15]. Postnonlinear identifiability has been considered in [9]; however 

in the formulation, the proof contains an inaccuracy rendering the proof only 

applicable to quite restricted situations. 

In this work, we will analyze separability of postnonlinear mixtures. We thereby 

generalize ideas already presented by Babaie-Zadeh et al [10], where the focus 

was put on the development of an actual identification algorithm. Babaie- 

Zadeh was the first to use the method of analyzing bounded random vectors 

in the context of postnonlinear mixtures [16] 1 . There, he already discussed 

1 His PhD thesis is available online at 

http://www.lis.inpg.fr/stages dea theses/theses/manuscript/babaie-zadeh.pdf 

2


identifiability issues, albeit explicitly only in the two-dimensional analytic case. 

Extending his ideas, we are able to find a new necessary condition — which 

we name ’absolutely degenerate’, see definition 6 — for identifying mixing 

structure using only the boundary. This together with the generalization to 

arbitrary dimensions is our main contribution here, stated in theorem 7. 

The paper is arranged as follows: Section 2 presents a simple result about 

homogeneous functions and shortly discusses linear identifiability in the case 

of bounded random variables. Section 3 states the postnonlinear separability 

problem, which is then proved in the following section for real-valued random 

vectors. In section 5, a simulation confirming the main separability theorem 

is presented. 

Postnonlinear separability is important for any postnonlinear ICA algorithm, 

so we focus only on this question. We do not propose an explicit postnonlinear 

identification algorithm but instead refer to [9–13] for both algorithms and 

simulations. 

2 Basics 

For n ∈ N let Gl(n) be the general linear group of R n i.e. group of invertible 

real (n × n)−matrices. An invertible matrix L ∈ Gl(n) is said to be a scaling 

matrix, if it is diagonal. We say two (m × n)−matrices B, C are equivalent, 

B ∼ C, if C can be written as C = BPL with an scaling matrix L ∈ Gl(n) 

and an invertible matrix with unit vectors in each row (permutation matrix) 

P ∈ Gl(n). 

Definition 1 Given a function f : U → R assume there exist a, b ∈ R such 

that at least one is not of absolute value 0 or 1. If f(ax) = bf(x) for all x ∈ U 

with ax ∈ U, then f is said to be (a, b)-homogeneous or simply homogeneous. 

The following lemma characterizing homogeneous functions is from [10]. However 

we added the correction to exclude the cases |a| or |b| ∈ {0, 1}, because 

in these cases homogeneity does not induce such strong results. This lemma 

can be generalized to continuously differentiable functions, so the strong assumption 

of analyticity is not needed, but shortens the proof. 

Lemma 2 [10] Let f : U → R, be an analytic function that is (a, b)homogeneous 

on [0, ε) with ε > 0. Then there exist c ∈ R, n ∈ N ∪ {0} 

(possibly 0) such that f(x) = cx n for all x ∈ U. 

PROOF. If |a| is in {0, 1} or b = 0 then obviously f ≡ 0. If b = −1 then 

f ≡ 0, since |a| /∈ {0, 1}, f(a 2 x) = f(x) and f continuous, f is constant, but 

3


f(ax) = −f(x) implies that f ≡ 0. In the case that b = 1 again f is constant 

since f(ax) = f(x) and a 0 = 1 = b. 

By m-times differentiating the homogeneity equation we finally get bf (m) (x) = 

a m f (m) (ax), where f (m) denotes the m-th derivative of f. Evaluating this at 0 

yields bf (m) (0) = a m f (m) (0). Since f is assumed to be analytic, f is determined 

uniquely by its derivatives at 0. Now either there exists an n ≥ 0 with b = a n , 

hence f (m) (0) = 0 for all m �= n and therefore f(x) = cx n or else f ≡ 0. ✷ 

Definition 3 [10] We call a random vector X with density pX bounded, if 

its density pX is bounded. Denote supp pX := {x|pX(x) �= 0} the support of 

pX i.e. the closure of the non-zero points of pX. 

We further call an independent random vector X fully bounded, if supp pXi is 

an interval for all i. So we get supp pX = [a1, b1] × . . . × [an, bn]. 

Since a connected component of supp pX induces a restricted, fully bounded 

random vector, without loss of generality we will in the following assume to 

have fully bounded densities. In the case of linear instantaneous BSS the following 

separability result is well known and can be derived from a more general 

version of this theorem for non-bounded densities [3]. But in the context of 

fully bounded random vectors, this follows already from the fact that in this 

case independence is equivalent to having support within a cube with sides 

parallel to the coordinate planes, and only matrices equivalent to the identity 

leave this property invariant: 

Theorem 4 (Separability of bounded linear BSS) Let M ∈ Gl(n) be 

an invertible matrix and S a fully bounded independent random vector. If MS 

is again independent, then M is equivalent to the identity. 

This theorem indeed proves separability of the linear ICA model, because if 

X = AS and W is a demixing matrix such that WX is independent, then 

M := WA ∼ I, so W −1 ∼ A as desired. As the model is invertible and the 

indeterminacies are trivial, identifiability and uniqueness follow directly. 

3 Separability of postnonlinear BSS 

In this section we introduce the postnonlinear BSS model and further discuss 

its identifiability. 

Definition 5 [9] A function f : R n → R n is called diagonal or componentwise 

if each component fi(x) of f(x) depends only on the variable xi. 

4


In � this case we often omit the other variables and write f(x1, . . . , xn) = 

f1(x1), . . . , fn(xn) � 

or f = f1 × · · · × fn. 

Consider now the postnonlinear Blind Source Separation model: 

X = f(AS) 

where again S is an independent random vector, A ∈ Gl(n) and f is a diagonal 

nonlinearity. We assume the components fi of f to be injective analytic 

are analytic. 

functions with non-vanishing derivatives. Then also the f −1 

i 

Definition 6 Let A ∈ Gl(n) be an invertible matrix. Then A is said to 

be mixing if A has at least two nonzero entries in each row 2 . And A = 

(aij)i,j=1...n is said to be absolutely degenerate if there are two columns l �= m 

such that a 2 il = λa 2 im for a λ �= 0 i.e. the normalized columns differ only by the 

signs of the entries. 

Postnonlinear BSS is a generalization of linear BSS, so the indeterminacies 

of postnonlinear ICA contain at least the indeterminacies of linear BSS: A 

can only be reconstructed up to scaling and permutation. Here of course additional 

indeterminacies come into play because of translation: fi can only 

be recovered up to a constant. Also, if L ∈ Gl(n) is a scaling matrix, then 

f(AS) = (f ◦L) � 

(L−1A)S � 

, so f and A can interchange scaling factors in each 

component. Another indeterminacy could occur if A is not mixing, i.e. at least 

one observation xi contains only one source; in this case fi can obviously not 

be recovered. For example if A = I, then f(S) is already again independent, 

because independence is invariant under component-wise nonlinear transformation; 

so f cannot be found using this method. 

A not so obvious indeterminacy occurs if A is absolutely degenerate. Then 

only the matrix A but not the nonlinearities can be recovered by looking at 

the edges of the support of the fully-bounded random vector. For example 

1 1 

consider 

� 

the case n = 2, A = ( 2 −2 ) and the analytic function f(x1, x2) = 

x1 + 1 

2π sin(πx1), x2 + 1 x2 sin(π π 2 )�. 

Then A−1 ◦ f ◦ A maps [0, 1] 2 onto [0, 1] 2 . 

Since both components of f are injective, we can verify this by looking at the 

edges: 

2 A slightly more general definition of ’mixing’ can be defined still guaranteeing 

identifiability of the sources; it is however omitted for the sake of simplicity. 

5


f1, f2 

2 

1.5 

1 

0.5 

-2 -1.5 -1 -0.5 

-0.5 

0.5 1 1.5 2 

-1 

-1.5 

-2 

1 

0.5 

0 

-0.5 

-1 

f(AS) A −1 f(AS) 

0.5 1 1.5 

0.5 1 1.5 

1 

0.5 

0 

-0.5 

-1 

1 

0.8 

0.6 

0.4 

0.2 

0 

0.2 0.4 0.6 0.8 

0.2 0.4 0.6 0.8 

Fig. 1. Example of a postnonlinear transformation using an absolutely degenerate 

matrix A and in [0, 1] 2 uniform sources S. 

f ◦ A(x1, 0) = � 

x1 + 1 

2π sin(πx1), 2x1 + 1 � 

sin(πx1) 

π 

= (1, 2) � 

x1 + 1 � 


2π 

f ◦ A(0, x2) = (1, −2) � 

x2 + 1 � 


2π 

f ◦ A(x1, 1) = (1, −2) + (1, 2) � 

x1 − 1 � 


2π 

f ◦ A(1, x2) = (1, 2) + (1, −2) � 

x2 − 1 � 


2π 

So we have constructed a situation in which two uniform sources are mixed 

by f ◦ A, see figure 1. They can be separated either by A −1 ◦ f −1 or by A −1 

alone. We have shown that the latter also preserves the boundary, although 

it contains a different postnonlinearity (namely identity) in contrast to f −1 in 

the former model. Nonetheless this is no indeterminacy of the model itself, 

since A −1 f(AS) is obviously not independent. So by looking at the boundary 

alone, we sometimes cannot detect independence if the whole system is highly 

symmetric. This is the case if A is absolutely degenerate. In our example f 

was chosen such that the non-trivial postnonlinear mixture looks linear (at 

the boundary), and this was possible due to the inherent symmetry in A. 

If we however assume that A is mixing and not absolutely degenerate, then we 

will show for all fully-bounded sources S that except for scaling interchange 

between f and A no more indeterminacies than in the affine linear case exist. 

Note that if f is only assumed to be continuously differentiable, then additional 

indeterminacies come into play. 

6 

1 

0.8 

0.6 

0.4 

0.2 

0


4 Separability of bounded postnonlinear BSS 

In this section we prove separability of postnonlinear BSS; in the proof we will 

see how the two conditions from definition 6 turn out to be necessary. 

Theorem 7 (Separability of bounded postnonlinear BSS) Let A, W ∈ 

Gl(n) and one of them mixing and not absolutely degenerate, h : Rn → Rn be 

a diagonal injective analytic function such that h ′ i �= 0 and let S be a fullybounded 

independent random vector. If W � 

h(AS) � 

is independent, then there 

exists a scaling L ∈ Gl(n) and v ∈ Rn with LA ∼ W−1 and h(x) = Lx + v. 

So let f ◦ A be the mixing model and W ◦ g the separating model. Putting the 

two together we get the above mixing-separating model with h := g ◦ f. The 

theorem shows that if the mixing-separating model preserves independence 

then it is essentially trivial i.e. h affine linear and the matrices equivalent (up to 

scaling). As usual, the model is assumed to be invertible, hence identifiability 

and uniqueness of the model follow from the separability. 

Definition 8 A subset P ⊂ R n is called parallelepipeds, if it is the linear 

image of a square, that is 

P = A([a1, b1] × . . . × [an, bn]) 

for ai < bi,i = 1, . . . , n and A ∈ Gl(n). A parallelepipeds P is said to be 

tilted, if A is mixing and no 2 × 2-minor is absolutely degenerate. Let i �= j ∈ 

{1, . . . , n} and c ∈ {a1, b1} × . . . × {an, bn}, then 

A � 

{c1} × . . . × [ai, bi] × . . . × [aj, bj] × . . . × {cn} � 

is called a 2-face of P and A(c) is called a corner of P . If n = 2 the parallelepipeds 

are called parallelograms. 

Lemma 9 Let f1, . . . , fn be n one-dimensional analytic injective functions 

with f ′ i �= 0, and let f := f1 × · · · × fn be the induced injective mapping on 

R n . Let P, Q ⊂ R n be two parallelepipeds, one of them tilted. If f(P ) = Q (or 

equivalently for the boundaries f(∂P ) = ∂Q), then f is affine linear diagonal. 

Here ∂P denotes the boundary of the parallelepiped P i.e. the set of points of 

P not lying in its interior (which coincides with the union of its faces). 

In the proof we see that the requirement for P or Q being tilted can be weakened 

slightly. It would suffice that enough 2-minors are not absolutely degenerate. 

Nevertheless the set of mixing matrices having no absolutely degenerate 

2 × 2-minors is very large in the sense that its complement has measure zero 

in Gl(n). 

7


Note that the tiltedness is essential, for example let P = ( 1 2 

1 0 ) [0, 1] 2 and any 

f1 with 

⎧ 

⎪⎨ 3x 

for x < 1 

2 

f1(x) = 

⎪⎩ 3x 

− 1 for x > 2 

2 

and f2(x) := x. Then Q is a parallelogram and its corners are (0, 0), ( 3, 

1), 2 

, 1) which is not a scaled version of P . 

(0, 2), ( 7 

2 

Note that the lemma extends the lemma proved in [10] by adding the condition 

of absolutely-degeneracy — this is in fact a necessary condition as shown in 

figure 1. 

PROOF. [Lemma 9 for n = 2] Obviously images of non-tilted parallelograms 

under diagonal mappings are again non-tilted. f is invertible, so we can assume 

that both P and Q are tilted. Without loss of generality using the scaling and 

translation invariance of our problem, we may assume that 

∂P = ( 

1 1 

a1 a2 ) ∂([0, 1] × [0, c]), ∂Q = � 1 1 

b1 b2 

� 

∂([0, 1] × [0, d]), 

with ai, bi ∈ R \ {0} and a 2 1 �= a 2 2 and b 2 1 �= b 2 2 and ca2, db2 > 0 and c ≤ 1, and 

f(0) = 0, f(1, a1) = (1, b1), f(c, ca2) = (d, db2) 

(i.e. the vertices of P are mapped onto the vertices of Q in the specified order). 

Note that the vertices of P have to be mapped onto vertices of Q because f 

is at continuously differentiable. Since the fi are monotonously we also have 

d ≤ 1 and that a1 < 0 implies b1 < 0. 

It follows that f maps the four separate edges of ∂P onto the corresponding 

edges of ∂Q: f(t, a1t) = � 

g1(t), b1g1(t) � 

, f(ct, ca2t) = � 

dg2(t), db2g2(t) � 

for t ∈ [0, 1]. Here gi : [0, 1] → [0, 1] is a strongly monotonously increasing 

parametrization of the respective edge. It follows that g1(t) = f1(t) and 

dg2(t) = f1(ct) and therefore f2(a1t) = b1f1(t) and f2(ca2t) = b2f1(ct) for 

t ∈ [0, 1]. Therefore � we get an equation for both components of f, e.g. for the 

a1 

second: f2 a2 t� = b1 

b2 f2(t) for t ∈ [0, ca2]. 

So f2 is � a1 

a2 

� 

b1 , -homogeneous with coefficients not in {−1, 0, 1} by assump- 

b2 

tion; according to lemma 2 f2 and then also f1 are homogeneous polynomials 

(everywhere due to analyticity). By assumption f ′ i(0) �= 0, hence the fi are 

even linear. 

We have used the translation invariance above, so in general f is an affine 

linear scaling. ✷ 

8


PROOF. [Lemma 9 for arbitrary n] Again note that since diagonal maps 

preserve non-tiltedness we can assume that P and Q are tilted. Let πij : 

Rn → R2 be the projection onto the i, j-coordinates. Note that for any corner 

c and i �= j there is a 2-face Pijc of P containing c such that πij(Pijc) is a 

parallelogram. � In fact since P is tilted πij(Pijc) is also tilted. Since f is smooth 

f(Pijc) � 

is also a 2-face of Q and again tilted. 

πij 

For each corner c of P and i �= j�∈ {0, . . . , n} we can apply the n = 2 version 

of this lemma to πij(Pijc) and πij f(Pijc) � 

. Therefore fi and fj are affine linear 

on πi(Pijc) and πj(Pijc). Now πi(P ) ⊂ � 

cj πi(Pijc) and hence fi affine linear 

on πi(P ) which proves that f is affine linear diagonal. ✷ 

Now we are able to show the separability theorem: 

PROOF. [Theorem 7] S is bounded, and W ◦ h ◦ A is continuous, so T := 

W � 

h(AS) � 

is bounded as well. Furthermore, since S is fully bounded, T is also 

fully bounded. Then, as seen in section 2, supp S and supp T are rectangles 

with boundaries parallel to the coordinate axes. Hence P := A(supp S) and 

Q := W −1 (supp T) are parallelograms. One of them is tilted because otherwise 

A and W −1 would not be mixing. 

As W ◦ h ◦ A maps supp S onto supp T, h maps the set A supp S onto 

W −1 supp T i.e. h(P ) = Q. Then by lemma 9 h is affine linear diagonal, 

say h(x) = Lx + v for x ∈ P with L ∈ Gl(2) scaling and v ∈ R 2 . 

So W � 

h(AS) � 

= WLAS + Wv is independent, and therefore also WLAS. 

By theorem 4 WLA ∼ I, so there exists a scaling L ′ and a permutation P ′ 

with WLA = L ′ P ′ as had to be shown. ✷ 

5 Simulation 

In order to demonstrate the validity of theorem 7, we carry out a simple simulation 

in this section. We mix two independent random variables using a known 

mixing model f and A. However, f = f(p0,q0) is taken from a parameterized 

family f(p,q) of nonlinearities, which enables us to test numerically whether in 

can fully separate the data. So we 

the separation system really only f −1 

(p0,q0) 

unmix the data using inverses f −1 

(p,q) of members of this family and A−1 . The 

following simulation will show that the mutual information of the recoveries is 

minimal at (p0, q0), i.e. that f is determined uniquely by X (within this family 

at least) as stated by theorem 7. 

9


p 

Mutual information of recovered sources 

1 

0.8 

0.6 

0.4 

0.2 

1 

0.2 0.4 0.6 0.8 1 

1 

2 

0.6 

0.4 

3 

5 

4 

0.4 0.6 

0.4 0.6 

0.2 0.4 0.6 0.8 1 

q 

0 

0.8 

0.6 

0.4 

0.6 

0.2 

0.4 

p × g−1 q ) ◦ (f0.5 × g0.5) ◦ A � � 

S 

5 

4 

3 

2 

1 

� 

− log 

0 

MI � A −1 ◦ (f −1 

2 

1 

0 

0 

1 

1 

gq 

2 3 

0.5 

fp 

p = 0.1 

p = 1.0 

q = 0.1 

q = 1.0 

0 

0 1 2 3 

Fig. 2. Simulation of the separability result using two families of nonlinearities with 

fp(x) = 1 

10p log 

� √ 

x+ x2 +4e−20p 2e−10p � 

and gq = y y 

4 | 4 |3q−0.5 . The left plot displays a color 

plot together with overlayed contours of a separation measure depending on the 

parameters p, q used for recovery. The separation quality is measured using the 

negative logarithm of the mutual information of the recovered sources. The region 

around the separation point p = q = 0.5 is also displayed in more detail. 

The components of the postnonlinearity will be taken from two families of 

functions described by 

fp(x) = 1 

10p log 

� √ 

x + x2 + 4e−20p � 

2e −10p 

and gq(y) = y 

4 

� � 

�y 

�3q−0.5 

� � 

� � 

4 

with p, q varying between 0 and 1. The first component of the nonlinearity, 

fp models a sensors which saturates with varying strength and the second 

component gq describes a polynomial activation of the sensor with varying 

degree, see figure 2, right hand side. 

In the simulation, an independent uniformly distributed random vector (3000 

samples in [−1, 1] 2 ) is mixed postnonlinearly by the matrix A = ( 2.6 1.4 

0.7 3.3 ) and 

the diagonal nonlinearity 

⎛ 

⎜ 

f ⎝ x 

⎞ ⎛ 

1 

⎟ ⎜ 5 

⎠ = ⎝ 

y 

log� e5 2 

x + 1 

2 

√ �⎞ 

e10x2 + 4 

1 y |y| 16 

⎟ 

⎠ = 

⎛ 

⎜ 

⎝ f0.5(x) 

⎞ 

⎟ 

⎠ 

g0.5(y) 

To recover the sources the family (fp, gq) −1 of diagonal nonlinearities is used 

together with the inverse of A. 

10


p 

1 

0 

-1 

1 

0.8 

0.6 

0.4 

0.6 

0.2 

0.4 

0.2 0.4 0.6 0.8 1 

1 

1 

2 

3 

0.4 0.6 

4 

5 

0.6 

0.4 

0.4 0.6 

0.2 0.4 0.6 0.8 1 

Original 

-1 0 1 

Relative volume error 

q 

2 

0 

-2 

0 

0.8 

0.6 

0.4 

0.2 

6 

5 

4 

3 

2 

1 

0 

� 

� 

− log vol(∆p,q) 

Unmixing with p = 0.65, q = 0.5 

2 

0 

-2 

1 

0 

-1 


0 


-1 0 1 

Fig. 3. The top-left image graphs the separation error ∆p,q by measuring the negative 

logarithm of its volume. The error ∆p,q denotes the difference set of points from 

either the support of the recovered source distribution or a quadrangle with the 

same vertices. The other plots show the distributions of the recovered sources at 

some combinations of the parameters p, q of the nonlinearities. At p = 0.5, q = 0.5 

this is the original source distribution. Here light grey areas represent ∆p,q and dark 

grey areas the intersection of the support and the quadrangle. 

A simple estimator for mutual information based on histogram estimation of 

the entropy (with 10 bins in each dimension) is used to check the independence 

of the recovered sources. A more elaborate histogram-based estimator 

by Moddemeijer [17] yields similar results. As shown in figure 2 the mutual 

information of the recovered sources is minimal at the parameters p = q = 0.5, 

which correspond to the mixing model. It can also be noticed that the minimum 

is much less distinct in the second component. This indicates that in 

numerical application it should be easier to detect nonlinear functions which 

are bounded. 

The second graph (figure 3) further illustrates that the criterion for the borders 

11


to be of quadrangular shape is sufficient. This gives the idea of a possible 

postnonlinear ICA algorithm which minimizes the non-quadrangularity of the 

support of the estimated source. As pictured in the graph this can for example 

be achieved by minimizing the volume of the mutual differences (i.e. the points 

which are in the union but not in the difference). It can easily be seen that this 

minimization yields the same solution as minimizing the mutual information. 

For more details on such an algorithm we refer to [10, 13, 16]. 

6 Conclusion 

We have presented a new separability result for postnonlinear bounded mixtures 

that is based on the analysis of the borders of the mixture density. We 

hereby formalize and extend ideas already presented in [10]. We introduce the 

notion of absolutely degenerate mixing matrices. Using this we identify the restrictions 

of separability and also of algorithms that only use border analysis 

for postnonlinearity detection. This also represents a drawback of the algorithms 

proposed in [10] and [13], to which we want to refer for experimental 

results using border detection in postnonlinear settings. 

In future works we will show how to relax the condition of analytic postnonlinearities 

to only continuously differentiable functions; also preliminary results 

indicate how to generalize these results to complex-valued random vectors and 

mixtures. We further plan to extend this model to the case of group ICA [18], 

where independence is only assumed among groups of sources. In the linear 

case, this has been done in [15] — however the extension to postnonlinearly 

mixed sources is yet unclear. 


We thank the anonymous reviewers for their valuable suggestions, which improved 

the original manuscript. FT further would like to thank Christian Jutten 

for the helpful discussions during the preparation of this paper. Financial 

support by the BMBF in the project ’ModKog’ is gratefully acknowledged. 

References 

[1] J. Hérault, C. Jutten, Space or time adaptive signal processing by neural 

network models, in: J. Denker (Ed.), Neural Networks for Computing. 

12


Proceedings of the AIP Conference, American Institute of Physics, New York, 

1986, pp. 206–211. 

[2] J.-F. Cardoso, A. Souloumiac, Blind beamforming for non-gaussian signals, 

IEEE Proceedings - Part F 140 (1993) 362–370. 

[3] P. Comon, Independent component analysis - a new concept?, Signal Processing 

36 (1994) 287–314. 

[4] A. Bell, T. Sejnowski, An information-maximisation approach to blind 

separation and blind deconvolution, Neural Computation 7 (1995) 1129–1159. 

[5] A. Hyvärinen, E. Oja, A fast fixed-point algorithm for independent component 

analysis, Neural Computation 9 (1997) 1483–1492. 

[6] F. Theis, A. Jung, C. Puntonet, E. Lang, Linear geometric ICA: Fundamentals 

and algorithms, Neural Computation 15 (2003) 419–439. 

[7] A. Hyvärinen, J.Karhunen, E.Oja, Independent Component Analysis, John 

Wiley & Sons, 2001. 

[8] A. Cichocki, S. Amari, Adaptive blind signal and image processing, John Wiley 

& Sons, 2002. 

[9] A. Taleb, C. Jutten, Indeterminacy and identifiability of blind identification, 

IEEE Transactions on Signal Processing 47 (10) (1999) 2807–2820. 

[10] M. Babaie-Zadeh, C. Jutten, K. Nayebi, A geometric approach for separating 

post non-linear mixtures, in: Proc. of EUSIPCO ’02, Vol. II, Toulouse, France, 

2002, pp. 11–14. 

[11] S. Achard, D.-T. Pham, C. Jutten, Quadratic dependence measure for nonlinear 

blind sources separation, in: Proc. of ICA 2003, Nara, Japan, 2003, pp. 263–268. 

[12] A. Ziehe, M. Kawanabe, S. Harmeling, K.-R. Müller, Blind separation 

of post-nonlinear mixtures using linearizing transformations and temporal 

decorrelation, Journal of Machine Learning Research 4 (2003) 1319–1338. 

[13] F. Theis, E. Lang, Linearization identification and an application to BSS using 

a SOM, in: Proc. ESANN 2004, d-side, Evere, Belgium, Bruges, Belgium, 2004, 

pp. 205–210. 

[14] F. Theis, A new concept for separability problems in blind source separation, 

Neural Computation 16 (2004) 1827–1850. 

[15] F. Theis, Uniqueness of complex and multidimensional independent component 

analysis, Signal Processing 84 (5) (2004) 951–956. 

[16] M. Babaie-Zadeh, On blind source separation in convolutive and nonlinear 

mixtures, Ph.D. thesis, Institut National Polytechnique de Grenoble (2002). 

[17] R. Moddemeijer, On estimation of entropy and mutual information of 

continuous distributions, Signal Processing 16 (3) (1989) 233–246. 

[18] J. Cardoso, Multidimensional independent component analysis, in: Proc. of 

ICASSP ’98, Seattle, 1998. 

13

Chapter 5 

IEICE TF E87-A(9):2355-2363, 2004 

Paper F.J. Theis and W. Nakamura. Quadratic independent component analysis. 

IEICE Trans. Fundamentals, E87-A(9):2355-2363, 2004 

Reference (Theis and Nakamura, 2004) 


103

104 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004 

PAPER Special Section on Nonlinear Theory and its Applications 

Quadratic independent component analysis 

SUMMARY The transformation of a data set using a 

second-order polynomial mapping to find statistically independent 

components is considered (quadratic independent component 

analysis or ICA). Based on overdetermined linear ICA, an 

algorithm together with separability conditions are given via linearization 

reduction. The linearization is achieved using a higher 

dimensional embedding defined by the linear parametrization of 

the monomials, which can also be applied for higher-order polynomials. 

The paper finishes with simulations for artificial data 

and natural images. 

key words: nonlinear independent component analysis, 

quadratic forms, nonlinear blind source separation, overdetermined 

blind source separation, natural images 

1. Introduction 

The task of transforming a random vector into an independent 

one is called independent component analysis 

(ICA). ICA has been well-studied in the case of linear 

transformations [3, 12]. 

Nonlinear demixing into independent components 

is an important extension of linear ICA and we still do 

not have sufficient knowledge of this problem. It is easy 

to see that without any restrictions the class of ICA solutions 

is too large to be of any practical use [13]. Hence 

in nonlinear ICA, special nonlinearities are usually considered. 

In this paper, we treat polynomial nonlinearities, 

especially second-order monomials or quadratic 

forms. These represent a relatively simple class of nonlinearities, 

which can be investigated in detail. 

Several studies have employed quadratic forms as a 

generative process of data. Abed-Meraim et al. [1] suggested 

analyzing mixtures by second-order polynomials 

using a linearization in a similar way as we introduce 

in section 3.2, but for the mixtures, which in general 

destroys the assumption of independence. Leshem [15] 

proposed a whitening scheme based on quadratic forms 

in order to enhance linear separation of time-signals 

in algorithms such as SOBI. Similar quadratic mixing 

models are also considered in [8] and [10]. These are 

Manuscript received December 21, 2003. 

Manuscript revised April 2, 2004. 

Final manuscript received May 21, 2004. 

† The authors are with the Lab. for Advanced Brain Signal 

Processing, Brain Science Institute, RIKEN, Wako-shi, 

Saitama 351-0198 Japan. 

∗ On leave from the Institute of Biophysics, University 

of Regensburg, 93040 Regensburg, Germany. 

a) E-mail: fabian@theis.name 

b) E-mail: wakakoh@brain.riken.jp 

Fabian J. THEIS †∗a) and Wakako NAKAMURA †b) , Nonmembers 

studies in which the mixing model is assumed to be 

quadratic in contrast to the quadratic unmixing model 

used in this paper. 

For demixing into independent components by 

quadratic forms, Bartsch and Obermayer [2] suggested 

applying linear ICA to second-order terms of data. 

Hashimoto [9] suggested an algorithm based on minimization 

of Kullback-Leibler divergence. However, in 

these studies, the generative model of data was not defined 

and the interpretation of signals obtained by the 

separation was not given clearly; the focus was on the 

application to natural images. In this paper, we examine 

this quadratic demixing model. We define both 

generative model and demixing process of data explicitly 

to assume a one-to-one correspondence of the independent 

components with data. Using the analysis 

of overdetermined linear ICA, we discuss identifiability 

of this quadratic demixing model. We confirm that the 

algorithm proposed by Bartsch and Obermayer [2] can 

estimate the mixing process and retrieve the independent 

components correctly by simulation with artificial 

data. We also apply the quadratic demixing to natural 

image data. 

The paper is organized as follows: in the next section 

results about overdetermined ICA that is ICA of 

more sensors than sources are recalled and extended. 

Section 3 then introduces the quadratic ICA model and 

provides a separability result and an algorithm. The algorithms 

are then applied for artificial and natural data 

sets in section 4. 

2. Overdetermined ICA 

Before defining the polynomial model, we have to study 

indeterminacies and algorithms of overdetermined independent 

component analysis. Its goal lies in the transformation 

of a given random vector x to an independent 

one with lower dimension. Overdetermined ICA is usually 

applied to solve the overdetermined blind source 

separation (overdetermined BSS) problem, where x is 

known to be a mixture of a lower number of independent 

source signals s. Overdetermined ICA in the context 

of BSS is well-known and understood [5, 14], but 

the indeterminacies in terms of the unmixing matrix of 

overdetermined ICA problem have to the knowledge of 

the authors not yet been analyzed. 

1

Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 105 

2 

2.1 Model 

The overdetermined ICA model can be formulated as 

follows: Let x be a given m-dimensional random vector. 

An n × m-matrix W with m > n ≥ 2 is called 

overdetermined ICA of x if 

y = Wx (1) 

is independent. In order to distinguish between overdetermined 

and ordinary ICA, in the case m = n we call 

W a square ICA of x. 

Here W is not assumed to have full rank as usual. 

Theorem 2.1 shows that under reasonable assumptions 

this automatically holds. 

Often overdetermined ICA is used to solve the 

overdetermined BSS problem given by 

x = As (2) 

where s is an n-dimensional independent random vector 

and A a m × n matrix with m > n ≥ 2. Note that 

A can be assumed to have full rank (rank A = n), 

otherwise the system could be reduced to the case n−1: 

If A = (a1, . . . , an) with columns ai, in the case of 

rank A < n we can without loss of generality assume 

that an = � n−1 

i=1 λiai. Then 

x = As = 

n� 

j=1 

n−1 

ajsj = 

� 

(1 + λj)ajsj 

j=1 

so source sn does not appear in the mixture in this 

case, thus the model can be reduced to the case n − 

1. Overdetermined ICAs of x are usually considered 

solutions to this BSS problem. 

Often, overdetermined BSS is stated in the noisy 

case, 

x = As + ν (3) 

where ν is a decorrelated Gaussian ’noise’ random vector, 

independent of s. Without additional noise, the 

sources can be found by solving for example the square 

ICA problem, which is constructed from equation 1 

after leaving away the last m − n observations given 

non degenerate projected mixture matrix. In the presence 

of noise usually projection by principal component 

analysis (PCA) is chosen in order to reduce this problem 

to the square case [14]. In the next section however, 

the indeterminacies of the noise-free models represented 

by equations 1 and 2 are studied because overdetermined 

ICA will only be needed later after reduction of 

the the bilinear model — and in this model we do not 

allow any noise. However, the overdetermined noisy 

model from equation 3 can easily be reduced to the 

noise-free model by including ν as additional sources: 

x = (A I) 

� s 

ν 

� 

. 


In this case n increases and we could possibly deal with 

underdetermined ICA (where extra care has to be taken 

with the now increased number of Gaussians in the 

sources). Uniqueness and separability results for this 

case are given in [7], which shows that the following 

theorems also hold in this noisy ICA model. 

2.2 Indeterminacies 

The following theorem presents the indeterminacy of 

the unmixing matrix in the case of overdetermined mixtures, 

with the slight generalization that this unmixing 

matrix does not necessarily have to be of full rank. 

Later in this section we show that it is necessary to assume 

that the observed data set x is indeed a mixture. 

Theorem 2.1 (Indeterminacies of overdetermined 

ICA). Let m ≥ n ≥ 2. Let x = As as in 

the model of equation 2, and let the n × m matrix W 

be an overdetermined or square ICA of x such that at 

most one component of Wx is deterministic. Furthermore 

assume one of the following: 

i. s has at most one Gaussian component and the 

variances of s exist. 

ii. s has no Gaussian component. 

Then there exist a permutation matrix P and an invertible 

scaling matrix L with 

W = LP(A ⊤ A) −1 A ⊤ + C (4) 

where C is an n × m matrix with rows lying the kernel 

of A ⊤ that is with CA = 0. The converse also holds, 

i.e. if W fulfills equation 4 then Wx is independent. 

A less general form of this indeterminacy has been 

given by Joho et al. [14]. In the square case, the above 

theorem shows that it is not necessary to assume that 

mixing and especially demixing matrix have full rank if 

it is assumed that the transformation also has at most 

one deterministic component. Obviously, this assumption 

is not necessary if W is required to have full rank. 

Since rank A = n, the pseudo inverse (Moore- 

Penrose inverse) A + of A has the explicit form A + = 

(A ⊤ A) −1 A ⊤ . Note that A + A = I. So from equation 4 

we get WA = LP. We remark as corollary to the above 

theorem that overdetermined ICA is separable, which 

means that the sources are uniquely (except for scaling 

and permutation) reconstructible, because for approximated 

sources y we get y = Wx = WAs = LPs. This 

of course is well-known because overdetermined BSS 

can be reduced to square BSS (m = n) by projection; 

yet the indeterminacies of the demixing matrix, which 

are simple permutation and scaling in the square case, 

are not so obvious for overdetermined BSS. 

Proof of theorem 2.1. Consider B := WA and y := 

Bs. Let b1, . . . , bn ∈ R n denote the (transposed) rows 

of B. Then


THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS 

yi = b ⊤ i s. (5) 

We will show that we can assume B to be invertible. 

If B is not invertible, without loss of generality let 

bn = �n−1 i=1 λibi with coefficients λi ∈ R. Then at least 

one λi �= 0, otherwise yn = 0 is deterministic. Without 

loss of generality let λ1 = 1. From equation 5 we then 

get yn = �n−1 i=1 λiyi = y1+u with u := �n−1 i=2 λiyi independent 

of y1. Application of the Darmois-Skitovitch 

theorem [6, 16] to the two equations 

y1 = y1 

yn = y1 + u 

shows that y1 is Gaussian or deterministic. Hence all 

yi with λi �= 0 are Gaussian or deterministic. So we 

may assume that y1, yn and u are square integrable. 

Without loss of generality let those random variables 

be centered. Then the cross-covariance of (y1, yn) can 

be calculated as follows: 

0 = E(y1y ⊤ n ) = E(y1y ⊤ 1 ) + E(y1u ⊤ ) = var y1, 

so y1 is deterministic. Hence all yi with λi �= 0 are 

deterministic, and then also yn. So at least two components 

of y = Wx are deterministic in contradiction 

to the assumption. Therefore B is invertible. 

Using assumptions i) or ii) and the well-known 

uniqueness result of square linear BSS — a corollary 

[5, 7] of the Darmois-Skitovitch theorem, see [17] for a 

proof without using this theorem — there exist a permutation 

matrix P and an invertible scaling matrix L 

with WA = LP. The properties of the pseudo inverse 

then show that the equation 

(LP) −1 WA = I 

in W has solutions W = LP(A + + C ′ ) with C ′ A = 0, 

or W = LPA + + C again with CA = 0. 

The pseudo inverse is the unique solution of WA = 

I with minimum Frobenius norm. In this case C = 0; 

so if W is an ICA of x with minimal Frobenius norm, 

then already W equals A + except for scaling and permutation. 

This can be used for normalization. Using 

singular value decomposition of A, this has been shown 

and extended to the noisy case in [14]. 

Theorem 2.1 states that in the case where x is the 

mixture of n sources, 

s 

A 

.. x 

W 

W ′ 

. . 

. . 

y 

.................... . 

y ′ 

W ′ A 

ignoring the always present scaling and permutation 

(then y = y ′ = s), we have an n(m − n)-dimensional 

affine vector space as ICAs of x i.e. (W − W ′ )A = 0. 

However in the case of arbitrary x, the ICAs of x can 

be quite unrelated. Indeed, in the diagram 

x 

W 

W ′ 

.. 

. . 

y 

... 

.. 

.. 

... 

. . 

y ′ 

∃B? 

there does not always exist B that could make this 

diagram commute (W ′ x = BWx), as for example in 

the case m ≥ 2n and W projection along the first n 

coordinates and W ′ projection along the last n ones 

shows. In this case the square ICA argument in the 

proof of theorem 2.1 cannot be applied, and W and W ′ 

do not necessarily have any relation. This of course is 

not true in the case m = n, where uniqueness (but not 

existence) can also be shown without explicitly knowing 

that x is a mixture by inverting W. This could be 

extended to the case n ≤ m < 2n to construct 2n − m 

equations for y and y ′ . 

2.3 Algorithm 

The usual algorithm for finding an overdetermined ICA 

is to first project x along its main principal components 

to an n-dimensional random vector and to then 

apply square linear ICA [14, 19]. In [19], the question 

of where to place this projection stage (before or after 

application of ICA) is posed and answered somewhat 

heuristically. Here, a simple proof is given that in the 

overdetermined BSS case, any possible ICA matrix factorizes 

over (almost) any projection, so projecting first 

and then applying ICA to recover the sources is always 

possible. 

Theorem 2.2. Let x = As as in the model of equation 

2 such that s satisfies one of the conditions in theorem 

2.1. Furthermore let W be an overdetermined ICA of 

x. Then for almost all (in the measure sense) n × m 

matrices V there exists a square ICA B of Vx such 

that Wx = BVx. 

Proof. Let V be an n×m matrix such that VA is invertible 

— this is an open condition, so almost all matrices 

are of this type. Then there exists a square ICA, say 

B ′ of y := Vx. So B ′ V is an overdetermined ICA of x. 

Applying separability of overdetermined ICA (corollary 

of theorem 2.1) then proves that Wx = LPB ′ Vx for 

a permutation P and a scaling L. Setting B := LPB ′ 

shows the claim. 

In diagram-form, this means 

A . . 

s x y ′ W 

.. 

V 

. .. 

y 

. . .. 

∃B 

3


4 

where y ′ := Wx and y := Vx. 

In applications, V is usually chosen to be the projection 

along the first principal components in order to 

reduce noise [14]. In this case it is easy to see that 

indeed VA is invertible as needed in the theorem. 

3. Quadratic ICA 

The model of quadratic ICA is introduced and separability 

and algorithms are studied in this context. 

3.1 Model 

Let x be an m-dimensional random vector. Consider 

the following quadratic or bilinear unmixing model 

y := g(x, x) (6) 

for a fixed bilinear mapping g : R m × R m → R n . 

The components of the bilinear mapping are quadratic 

forms, which can be parameterized by symmetric matrices. 

So the above is equivalent to 

yi := x ⊤ G (i) x (7) 

for symmetric matrices G (i) and i = 1, . . . , n. If G (i) 

kl 

are the coefficients of G (i) , this means 

yi = 

m� 

m� 

k=1 l=1 

G (i) 

kl xkxl 

for i = 1, . . . , n. 

A special case of this model can be constructed as 

follows: Decompose the symmetric coefficient matrices 

into 

G (i) = E (i)⊤ Λ (i) E (i) , 

where E (i) is orthogonal and Λ (i) diagonal. In order 

to explicitly invert the above model (after restriction 

to a subset for invertibility) we now assume that these 

coordinate changes E (i) are all the same i.e. 

for i = 1, . . . , n. Then 

E = E (i) 

yi = (Ex) ⊤ Λ (i) (Ex) = 

m� 

k=1 

Λ (i) 

kk (Ex)2 k 

(8) 

where Λ (i) 

kk are the coefficients on the diagonal of Λ(i) . 

Setting 

⎛ 

⎜ 

Λ := ⎝ 

11 . . . Λ (1) ⎞ 

nn 

. . 

. .. 

. ⎟ 

. ⎠ 

Λ (1) 

Λ (n) 

11 . . . Λ (n) 

nn 

yields a two-layered unmixing model 

y = Λ ◦ (.) 2 ◦ E ◦ x, (9) 


x1 

x2 

xm 

. 

E1m 

E2m 

Enm 

(.) 2 

(.) 2 

. 

(.) 2 

Λ1n 

Λ2n 

E Λ 

Λnn 

Fig. 1 Simplified quadratic unmixing model y = Λ◦(.) 2 ◦E◦x. 

s1 

s2 

sn 

� Λ −1 � 

. 

1n 

� 

−1 Λ � 

2n 

� Λ −1 � 

nn 

√ 

√ 

� E −1 � 

. 

√ 

1n 

� E −1 � 

2n 

� E −1 � 

Λ −1 E −1 

Fig. 2 Simplified square root mixing model x = E −1 ◦ √ ◦ 

Λ −1 ◦ s. 

where (.) 2 is to be read as the componentwise square of 

each element. This can be interpreted as a two-layered 

feed-forward neural network, see figure 1. 

The advantage of the restricted model from equation 

8 is that now the model can easily be explicitly 

inverted. We assume that Λ is invertible and that 

Ex only takes values in one quadrant — otherwise the 

model cannot be inverted. Without loss of generality 

let this be the first quadrant that is assume 

(Ex)i ≥ 0 

for i = 1 . . . , m. Then model 9 is invertible and the corresponding 

inverse model (mixing model) can be easily 

expressed as 

mn 

. 

. 

y1 

y2 

yn 

x1 

x2 

xm 

x = E −1 ◦ √ ◦ Λ −1 ◦ s (10) 

with E −1 = E ⊤ . Here we write s as the domain of the 

model in order to distinguish between the recoveries y 

given by the unmixing model. Ideally, the two are the 

same. The inverse model is shown in figure 2. 

3.2 Separability 

Constructing a new random vector by arranging the



monomials in model 6 in the lexicographical order of 

(i, j), i ≥ j, lets us reduce the quadratic ICA problem 

to an (usually) overdetermined linear ICA problem. 

This idea will be used to analyze the indeterminacies 

of quadratic ICA in the following. 

Consider the map 

ζ : R m → R d 

x ↦→ � x 2 1, 2x1x2, . . . , 2x1xn, x 2 2, 2x2x3, . . . , x 2 m 

with d = m(m+1) 

2 . With ζ we can rewrite the quadratic 

mixing model from equation 6 in form of a linear model 

(linearization) 

y = Wgζ(x) (11) 

where the matrix Wg is constructed using the coefficient 

matrices of the quadratic form g as follows: 

⎛ 

⎜ 

Wg = ⎜ 

⎝ 

G (1) 

11 G(1) 12 . . . G (1) 

1m G(1) 22 . . . G (1) 

G 

mm 

(2) 

11 G(2) 12 . . . G (2) 

1m G(2) 22 . . . G (2) 

. 

. 

. .. 

. 

. 

. .. 

mm 

. 

G (n) 

11 G(n) 12 . . . G (n) 

1m G(n) 22 . . . G (n) 

⎞ 

⎟ 

⎠ 

mm 

This transforms the nonlinear ICA problem to a higher 

dimensional linear one. 

Assuming that x has been mixed by the inverse of a 

restriction of a bilinear mapping, we can apply theorem 

2.1 to show that Wg is unique except for scaling, permutation 

and addition of rows with transpose in A ⊤ . 

Hence, the coefficient matrices G (i) of g (corresponding 

to the rows of Wg) are uniquely determined except for 

the above. 

3.3 Algorithm 

For now assume m = n ≥ 2. In this case d > n, so the 

linearized problem from equation 11 consists in finding 

an overdetermined ICA of ζ(x). See section 2.3 for 

comments on such an algorithm. 

In the more general case m �= n, also situations 

with d = n and d < n are possible. For example in 

the case d = n, (for example in a quadratic mixture of 

3 sources to 2 observations) the separating quadratic 

form is unique except for scaling and permutation. 

However in simulations such settings were numerically 

very instable, so in the following we will only consider 

the case of an equal number of sensors and sources. 

3.4 Semi-online dimension reduction 

Here, we will give an algorithm for performing memory 

conservative overdetermined ICA. This is important 

when embedding the data using ζ. For example in the 

simulation section 4.3, image data is considered with 

8×8-dimensional image patches, so n = m = 64. In this 

case d = 2080, hence even ordinary ’MATLAB-style’ 

covariance calculation can become a memory problem, 

� 

because the vector ζ(x) is too large to keep in memory. 

For overdetermined ICA, a projection from the 

typically high-dimensional mixture space to the lower 

dimensional feature space has to be constructed, 

usually using PCA. After knowledge of the highdimensional 

covariance matrix Cov(ζ(x)), eigenvalue 

decomposition and sample-wise projection give the desired 

low-dimensional feature data. 

One way to calculate Cov(ζ(x)) and project is to 

use an online PCA algorithm such as [4], where the samples 

of ζ(x) are also calculated online from the known 

samples of x. Alternatively, batch estimation of the coefficients 

of Cov(ζ(x)) using these online samples is possible. 

In reality, these methods suffer from calculational 

problems such as slow MATLAB-performance in iterations. 

We therefore employed a block-wise standard 

covariance calculation, which improved performance by 

a factor of roughly 20. 

The idea is to split up the large covariance matrix 

Cov(ζ(x)) into smaller blocks, in which the corresponding 

components of ζ(x) still fit into memory. 

Denote C := Cov(ζ(x)) the large d × d covariance matrix. 

Let 1 ≤ r ≪ d be fixed. C can be decomposed 

in q := ⌊d/r⌋ + 1 blocks of (maximal) size r simply as 

follows ⎛ 

C (1,1) . . . C (1,q) 

⎞ 

⎜ 

C = ⎝ 

. 

. .. 

. 

C (q,1) . . . C (q,q) 

⎟ 

⎠ 

where C (k,k′ ) is the cross correlation between the r- 

dimensional vectors (ζi(x)) kr−1 

i=(k−1)r and (ζi(x)) k′ r−1 

i=(k ′ −1)r 

(possibly truncated if k or k ′ equals q). These two 

vectors and hence C (k,k′ ) can now be easily calculated 

using the definition of ζ. The advantage lies in the fact 

that these two vectors are of much smaller dimension 

than d and therefore fit into memory for sufficiently 

small r. Of course the computational cost increases as 

ζ(x) has to be calculated multiple times. 

In order to decompose Cov(ζ(x)) into blocks of 

equal size, a mapping between the lexicographical order 

in ζ(x) and the elements of x is needed, which we 

want to state for convenience: The index (i, j) that is 

the monomial xixj corresponds to the position k(i, j) = 

− 1 

2i2 + � n + 3 

� 

2 i−n−1 in ζ(x). Vice versa, the k-th entry 

in ζ(x) corresponds to the monomial of multiindex 

(i, j) with i(k) = � n + 1 

� √ �� 

2 3 − 4n2 + 4n + 9 − 8k 

and j(k) = k + 1 

2i(k)2 − � n + 1 

� 

2 i(k) + n. 

3.5 Extension to polynomial ICA 

Instead of using only second order monomials, we can 

of course allow any degree in the monomials in order 

to approximate more complex nonlinearities. In addition, 

first order monomials guarantee that also the 

linear ICA case is included in the model. A suitable 

higher-dimensional embedding ζ using more monomials 

generalizes the results of the quadratic case to the 

5


6 

Mean SNR (dB) 

45 

40 

35 

30 

25 

20 

n=2 

n=3 

n=5 

n=10 

15 

−20 0 20 40 60 80 100 120 

m−n 

Fig. 3 Mean SNR and standard deviation of recovered sources 

with the original ones when applying overdetermined ICA to As 

after random projection using B for varying source (n) and mixture 

(m) dimension. Here s is a in [−1, 1] n uniform random 

vector and A and B (m × n) respectively (n × m)-matrices having 

uniformly distributed coefficients in [−1, 1]. Mean was taken 

over 1000 runs. 

polynomial case. 

4. Simulation results 

In this section, computer simulations are performed to 

show the feasibility of the presented algorithms. 

4.1 Overdetermined BSS 

In order to confirm our theoretical results from section 

2.2, we perform batch runs of overdetermined ICA 

applied to randomly generated data after random projection 

to the known source dimension. Square ICA 

was performed using the FastICA algorithm [11]. As 

parameters to the algorithm we used g(s) := tanh(s) 

as nonlinearity in the approximation of the negentropy 

estimator (respectively its derivative) and stabilization 

was turned on, meaning that the step size was not kept 

fixed but could be changed adaptively (halved if the 

algorithm gets stuck between two points). The simulation, 

figure 3, not surprisingly confirms that the ICA 

algorithm performs well independently of the chosen 

projection and the mixture dimension. In the presented 

no-noise case, projecting along directions of largest variance 

using PCA instead of the random projections will 

not improve performance (according to theorem 2.2). 

However, in the case of white noise, PCA will provide 

better recoveries for larger m [14]. 

4.2 Quadratic ICA — artificially generated data 

In our first example we consider the simplified mixingunmixing 

model from equation 9 in the case m = n = 2 

with randomly generated matrices 


0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

1 

0.5 

y 

x 1 

0 0 

0.5 

x 

1 

0 0 

Fig. 4 Example 1: Square-root mixing functions x1 and x2 are 

plotted. 

and 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

0 

1 

0.5 

� � 

0.91 1.2 

E = 

1.8 −0.72 

� � 

8 −7.7 

Λ = 

. 

−5.1 6.7 

Λ was chosen such that Λ −1 has only positive coefficients; 

this and the fact that the sources are positive 

ensured the invertibility of the nonlinear transformation 

(this is equivalent to (Ex)i ≥ 0). Also note that 

we did not require E to be orthogonal in this example. 

The two-dimensional sources s are shown in figure 5 together 

with a scatter plot i.e. plot of the samples in order 

to show the density. The mixtures x := E −1√ Λ −1 s 

are also plotted in the same figure; the nonlinearity is 

quite visible. Figure 4 gives a plot of the two nonlinear 

mixing functions. 

Application of the described algorithm yields the 

quadratic form 

y1 = 29x 2 1 − 57x1x2 − 21x 2 2 

y2 = −28x 2 1 + 40x1x2 + 17x 2 2. 

The recovered signals y are given in the right column of 

figure 5; a cross scatter plot with the sources is shown 

in figure 6 for comparison. The signal to noise ratios 

between the two are 44 and 43 dB after normalization 

to zero mean and unit variance and possible sign multiplication, 

which confirms the high separation quality. 

In order to demonstrate the nonlinearity of the problem, 

figure 7 demonstrates that linear ICA does not 

perform well when applied to the mixtures. 

In order to see quantitative results in more than 

only one experiment, we apply the algorithm to mixtures 

with varying number of samples and dimension. 

We consider the cases of equal source and mixture dimension 

m = n = 2, 3, 4. Figure 8 shows the algorithm 

performance for increasing number of samples. In the 

mean, quadratic ICA always outperforms linear ICA, 

but has a higher standard deviation. Problems when 

recovering the sources were noticed to occur when the 

y 

x 2 

0.5 

x 

1



2 

1.5 

1 

0.5 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

2 

1.5 

1 

0.5 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

0 

−0.5 

−1 

−1.5 

−2 

−2.5 

−3 

2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

−3.5 

0 0.5 1 1.5 2 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

0.2 

0.15 

0.1 

0.05 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

0.2 

0.18 

0.16 

0.14 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

0 

0 0.2 0.4 0.6 0.8 1 1.2 1.4 

0 

−0.5 

−1 

−1.5 

−2 

−2.5 

−3 

−3.5 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

0.5 

0 

−0.5 

−1 

−1.5 

−2 

−2.5 

−3 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

0.5 

0 

−0.5 

−1 

−1.5 

−2 

−2.5 

−3 

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 

Fig. 5 Example 1: A two-dimensional nonlinear mixture using mixing model 10, see 

figure 2, is separated. The left column shows the two independent source signals together 

with a signal scatter plot (with only every 5th sample plotted), which depicts the source 

probability density. The middle column shows the two clearly nonlinearly mixed signals 

and their scatter plot. In the right column the two signals separated by quadratic ICA 

are depicted. The scatter plot again confirms their independence. 

0.5 

0 

−0.5 

−1 

−1.5 

−2 

−2.5 

−3 

0 0.5 1 1.5 2 

0.5 

0 

−0.5 

−1 

−1.5 

−2 

−2.5 

−3 

0 0.5 1 1.5 2 

Fig. 6 Example 1: A comparison scatter plot of source and 

recovered source signals (figure 5) is given, i.e. if y denotes the 

recovered sources, then the top left figure is a scatter plot of 

(s1, y1), the top right a plot of (s2, y1) and the bottom right of 

(s2, y2). The SNRs are 44 and 43 dB between the first respectively 

second signals after normalization to zero mean and unit 

variance and possible sign multiplication. 

4 

3 

2 

1 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

5 

4 

3 

2 

1 

0 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

0 0.5 1 1.5 2 2.5 3 3.5 4 

Fig. 7 Example 1, linear ICA: Applying linear ICA to the mixtures 

from figure 5 yields recovered signals as above (top). Clearly 

the scatter plot (bottom) confirms that the recovery was bad — 

not surprisingly no independence could be achieved. Comparison 

with the original sources shows that the recovery itself does not 

well correspond to the sources: the maximal SNRs were 7.5 and 

7.7 after normalization. 

7

PSfrag replacements 


f quadratic ICA for n = 2 





8 

Mean SNR (dB) 

Mean SNR (dB) 

Mean SNR (dB) 

45 

40 

35 

30 

25 

20 

15 

10 

Performance of quadratic ICA for n = 2 

Bilinear ICA 

Linear ICA 

5 

0 2000 4000 6000 8000 10000 12000 

Number of samples 


35 



30 

25 

20 

15 

10 

5 

0 2000 4000 6000 8000 10000 12000 



30 



25 

20 

15 

10 

5 

0 2000 4000 6000 


8000 10000 12000 

Fig. 8 Quadratic ICA performance versus sample set size in 

dimensions 2 (top), 3 (middle) and 4 (bottom). Mean was taken 

over 1000 runs with Λ −1 and E −1 having uniformly distributed 

coefficients in [0, 1] n respectively [−1, 1] n . E −1 was orthogonalized 

(an orthogonal basis was constructed out of the columns 

of E −1 ) because in experiments the algorithm in higher dimension 

turned out to be quite sensitive to high condition number of 

E −1 . So E −1 = E ⊤ in accordance with the model from equation 

9. The sources were uniformly distributed in [0, 1] n . 


condition number of Λ −1 was very high. By leaving 

out these cases, the performance of quadratic ICA noticeably 

rises in comparison to linear ICA. 

4.3 Quadratic ICA — image data 

Finally we deal with a real-world example, that is, analysis 

of natural images. We applied quadratic ICA to a 

set of small patches collected from database of natural 

images made by van Hateren [18]. We show two 

dominant linear filters and corresponding eigenvalues 

of each obtained quadratic form in figure 9. Most obtained 

quadratic forms have one or two dominant linear 

filters and these linear filters are selective for local 

bar stimuli. The other eigenvalues all are substantially 

smaller and lie centered around zero in a roughly Gaussian 

distribution, see eigenvalue distribution in figure 

9. If a quadratic form has two dominant filters, position, 

orientation and spatial frequency selectiveness of 

these two filters are very similar. Two corresponding 

eigenvalues have different signs. In summary, values of 

all quadratic forms correspond to squared simple cell 

outputs. This result is qualitatively similar to the result 

obtained in [2]. Simple-cell-like features are more 

prominent in our results. Also, in section 4 of [9], the 

obtained quadratic forms correspond to squared simple 

cell outputs. 

5. Conclusion 

This paper treats nonlinear ICA by reducing it to the 

case of linear ICA. After formally stating separability 

results of overdetermined ICA, these results are applied 

to quadratic and polynomial ICA. The presented algorithm 

consists of linearization and application of linear 

ICA. In order to apply PCA projection also in high dimensions, 

a block calculation of the covariance matrix 

is suggested. The algorithms are then used for artificial 

data sets, where they show good performance for a 

sufficiently high number of samples. In the application 

to natural image data, characteristics of the obtained 

quadratic forms are qualitatively the same as the results 

of Bartsch and Obermayer [2] and they correspond to 

squared simple cell outputs. In future work, we want 

to further investigate a possible enhancement of the 

separability result as well as a more detailed extension 

to ICA with polynomial and maybe analytic mappings. 

Furthermore, bilinear ICA with additional noise (for example 

multiplicative noise) could be considered. Also, 

issues such as model and parameter selection will have 

to be treated, if higher order models are allowed. 


We thank the anonymous reviewers for their valuable 

suggestions, which improved the original manuscript. 

FT was partially supported by the DFG in the grant



−0.17 0.16 −0.25 0.08 0.16 −0.11 0.28 −0.10 0.19 −0.13 −0.24 0.23 

−0.20 0.12 −0.27 0.27 0.19 −0.16 −0.17 0.16 −0.30 0.09 0.38 0.05 

−0.19 0.17 0.18 −0.12 −0.24 0.24 −0.22 0.22 −0.24 0.22 0.24 −0.17 

0.24 −0.09 0.25 −0.22 0.23 −0.15 0.19 −0.17 0.18 −0.16 0.21 −0.21 

0.26 −0.20 0.33 −0.15 −0.28 0.15 −0.17 0.17 −0.17 0.16 −0.23 0.23 

−0.24 0.23 0.23 −0.22 0.20 −0.20 −0.30 0.30 0.18 −0.16 −0.18 0.15 

0.28 −0.24 −0.23 0.18 0.20 −0.20 −0.26 0.26 0.19 −0.19 −0.22 0.22 

0.31 −0.30 0.26 −0.23 0.18 −0.13 0.15 −0.15 0.18 −0.18 −0.31 0.30 

−0.23 0.17 −0.25 0.24 −0.27 0.26 0.25 −0.23 −0.19 0.17 0.11 −0.11 

−0.08 0.06 −0.15 0.13 0.25 −0.24 0.28 −0.26 −0.23 0.20 −0.18 0.15 

−0.22 0.19 −0.22 0.18 0.11 −0.09 0.28 −0.28 

Fig. 9 Quadratic ICA of natural images. 3·10 5 sample pictures 

of size 8 × 8 were used. Plotted are the recovered maximal filters 

of i.e. rows of the eigenvalue matrices of the quadratic form coefficient 

matrices (top figure). For each component the two largest 

filters (with respect to the absolute eigenvalues) are shown (altogether 

2 · 64). Above each image the corresponding eigenvalue 

(multiplied with 10 3 ) is printed. In the bottom figure, the absolute 

values of the 10 largest eigenvalues of each filter are shown. 

Clearly, in most filters one or two eigenvalues are dominant. 

’Nonlinearity and Nonequilibrium in Condensed Matter’ 

and the BMBF in the project ’ModKog’. 

References 

[1] K. Abed-Meraim, A. Belouchrani, and Y. Hua. Blind identification 

of a linear-quadratic mixture of independent components 

based on joint diagonalization procedure. In Proc. 

of ICASSP 1996, volume 5, pages 2718–2721, Atlanta, 

USA, 1996. 

[2] H. Bartsch and K. Obermayer. Second order statistics of 

natural images. Neurocomputing, 52-54:467–472, 2003. 

[3] A. Cichocki and S. Amari. Adaptive blind signal and image 

processing. John Wiley & Sons, 2002. 

[4] A. Cichocki and R. Unbehauen. Robust estimation of 

principal components in real time. Electronics Letters, 

29(21):1869–1870, 1993. 

[5] P. Comon. Independent component analysis - a new concept? 

Signal Processing, 36:287–314, 1994. 

[6] G. Darmois. Analyse générale des liaisons stochastiques. 

Rev. Inst. Internationale Statist., 21:2–8, 1953. 

[7] J. Eriksson and V. Koivunen. Identifiability and separability 

of linear ica models revisited. In Proc. of ICA 2003, 

pages 23–27, 2003. 

[8] P.G. Georgiev. Blind source separation of bilinearly mixed 

signals. In Proc. of ICA 2001, pages 328–330, San Diego, 

USA, 2001. 

[9] W. Hashimoto. Quadratic forms in natural images. Network: 

Computation in Neural Systems, 14:765–788, 2003. 

[10] S. Hosseini and Y. Deville. Blind separation of linearquadratic 

mixtures of real sources using a recurrent structure. 

Lecture Notes in Computer Science, 2687:241–248, 

2003. 

[11] A. Hyvärinen. Fast and robust fixed-point algorithms for 

independent component analysis. IEEE Transactions on 

Neural Networks, 10(3):626–634, 1999. 

[12] A. Hyvärinen, J.Karhunen, and E.Oja. Independent Component 

Analysis. John Wiley & Sons, 2001. 

[13] A. Hyvärinen and P. Pajunen. On existence and uniqueness 

of solutions in nonlinear independent component analysis. 

Proceedings of the 1998 IEEE International Joint Conference 

on Neural Networks (IJCNN’98), 2:1350–1355, 1998. 

[14] M. Joho, H. Mathis, and R.H. Lamber. Overdetermined 

blind source separation: using more sensors than source 

signals in a noisy mixture. In Proc. of ICA 2000, pages 

81–86, Helsinki, Finland, 2000. 

[15] A. Leshem. Source separation using bilinear forms. In Proc. 

of the 8th Int. Conference on Higher-Order Statistical Signal 

Processing, 1999. 

[16] V.P. Skitovitch. On a property of the normal distribution. 

DAN SSSR, 89:217–219, 1953. 

[17] F.J. Theis. A new concept for separability problems in blind 

source separation. Neural Computation accepted, 2004. 

[18] J.H. van Hateren and D.L. Ruderman. Independent component 

analysis of natural image sequences yields spatiotemporal 

filters similar to simple cells in primary visual 

cortex. Proc. R. Soc. Lond. B, 265:2315–2320, 1998. 

[19] S. Winter, H. Sawada, and S. Makino. Geometrical interpretation 

of the PCA subspace method for overdetermined 

blind source separation. In Proc. of ICA 2003, pages 775– 

780, Nara, Japan, 2003. 

9

Chapter 6 

LNCS 3195:726-733, 2004 

Paper F.J. Theis, A. Meyer-Bäse, and E.W. Lang. Second-order blind source separation 

based on multi-dimensional autocovariances. In Proc. ICA 2004, 

volume 3195 of LNCS, pages 726-733, Granada, Spain, 2004. Springer 

Reference (Theis et al., 2004e) 


113

114 Chapter 6. LNCS 3195:726-733, 2004 

Second-order blind source separation based on 

multi-dimensional autocovariances 

Fabian J. Theis 1,2 , Anke Meyer-Baese 2 , and Elmar W. Lang 1 

1 Institute of Biophysics, 

University of Regensburg, D-93040 Regensburg, Germany 

2 Department of Electrical and Computer Engineering, 

Florida State University, Tallahassee, FL 32310-6046, USA 


Abstract. SOBI is a blind source separation algorithm based on time 

decorrelation. It uses multiple time autocovariance matrices, and performs 

joint diagonalization thus being more robust than previous time 

decorrelation algorithms such as AMUSE. We propose an extension called 

mdSOBI by using multidimensional autocovariances, which can be calculated 

for data sets with multidimensional parameterizations such as 

images or fMRI scans. mdSOBI has the advantage of using the spatial 

data in all directions, whereas SOBI only uses a single direction. These 

findings are confirmed by simulations and an application to fMRI analysis, 

where mdSOBI outperforms SOBI considerably. 

Blind source separation (BSS) describes the task of recovering the unknown 

mixing process and the underlying sources of an observed data set. Currently, 

many BSS algorithm assume independence of the sources (ICA), see for instance 

[1, 2] and references therein. In this work, we consider BSS algorithms based on 

time-decorrelation. Such algorithms include AMUSE [3] and extensions such as 

SOBI [4] and the similar TDSEP [5]. These algorithms rely on the fact that 

the data sets have non-trivial autocorrelations. We give an extension thereof 

to data sets, which have more than one direction in the parametrization, such 

as images, by replacing one-dimensional autocovariances by multi-dimensional 

autocovariances. 

The paper is organized as follows: In section 1 we introduce the linear mixture 

model; the next section 2 recalls results on time decorrelation BSS algorithms. 

We then define multidimensional autocovariances and use them to propose md- 

SOBI in section 3. The paper finished with both artificial and real-world results 

in section 4. 

1 Linear BSS 

We consider the following blind source separation (BSS) problem: Let x(t) be 

an (observed) stationary m-dimensional real stochastical process (with not necessarily 

discrete time t) and A an invertible real matrix such that 

x(t) = As(t) + n(t) (1)

Chapter 6. LNCS 3195:726-733, 2004 115 

2 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang 

where the source signals s(t) have diagonal autocovariances 

Rs(τ) := E � (s(t + τ) − E(s(t)))(s(t) − E(s(t)) ⊤� 

for all τ, and the additive noise n(t) is modelled by a stationary, temporally 

and spatially white zero-mean process with variance σ2 . x(t) is observed, and 

the goal is to recover A and s(t). Having found A, s(t) can be estimated by 

A−1x(t), which is optimal in the maximum-likelihood sense (if the density of 

n(t) is maximal at 0, which is the case for usual noise models such as Gaussian or 

Laplacian noise). So the BSS task reduces to the estimation of the mixing matrix 

A. Extensions of the above model include for example the complex case [4] or the 

allowance of different dimensions for s(t) and x(t), where the case of larger mixing 

dimension can be easily reduced to the presented complete case by dimension 

reduction resulting in a lower noise level [6]. 

By centering the processes, we can assume that x(t) and hence s(t) have zero 

mean. The autocovariances then have the following structure 

Rx(τ) = E � x(t + τ)x(t) ⊤� = 

� ARs(0)A ⊤ + σ 2 I τ = 0 

ARs(τ)A ⊤ τ �= 0 

Clearly, A (and hence s(t)) can be determined by equation 1 only up to permutation 

and scaling of columns. Since we assume existing variances of x(t) 

and hence s(t), the scaling indeterminacy can be eliminated by the convention 

Rs(0) = I. In order to guarantee identifiability of A except for permutation 

from the above model, we have to additionally assume that there exists a delay 

τ such that Rs(τ) has pairwise different eigenvalues (for a generalization see [4], 

theorem 2). Then using the spectral theorem it is easy to see from equation 2 

that A is determined uniquely by x(t) except for permutation. 

2 AMUSE and SOBI 

Equation 2 also gives an indication of how to perform BSS i.e. how to recover A 

from x(t). The usual first step consists of whitening the no-noise term ˜x(t) := 

As(t) of the observed mixtures x(t) using an invertible matrix V such that V˜x(t) 

has unit covariance. V can simply be estimated from x(t) by diagonalization of 

the symmetric matrix R˜x(0) = Rx(0) − σ2I, provided that the noise variance 

σ2 is known. If more signals than sources are observed, dimension reduction can 

be performed in this step, and the noise level can be reduced [6]. 

In the following without loss of generality, we will therefore assume that 

˜x(t) = As(t) has unit covariance for each t. By assumption, s(t) also has 

unit covariance, hence I = E � As(t)s(t) ⊤A⊤� = ARs(0)A⊤ = AA⊤ so A is 

orthogonal. Now define the symmetrized autocovariance of x(t) as ¯ � Rx(τ) := 

1 Rx(τ) + (Rx(τ)) ⊤� . Equation 2 shows that also the symmetrized autocovari- 

2 

ance x(t) factors, and we get 

¯Rx(τ) = A ¯ Rs(τ)A ⊤ 

(2) 

(3)

116 Chapter 6. LNCS 3195:726-733, 2004 

SOBI based on multi-dimensional autocovariances 3 

for τ �= 0. By assumption ¯ Rs(τ) is diagonal, so equation 3 is an eigenvalue decomposition 

of the symmetric matrix ¯ Rx(τ). If we furthermore assume that ¯ Rx(τ) 

or equivalently ¯ Rs(τ) has n different eigenvalues, then the above decomposition 

i.e. A is uniquely determined by ¯ Rx(τ) except for orthogonal transformation 

in each eigenspace and permutation; since the eigenspaces are one-dimensional 

this means A is uniquely determined by equation 3 except for permutation. In 

addition to this separability result, A can be recovered algorithmically by simply 

calculating the eigenvalue decomposition of ¯ Rx(τ) (AMUSE, [3]). 

In practice, if the eigenvalue decomposition is problematic, a different choice 

of τ often resolves this problem. Nontheless, there are sources in which some 

components have equal autocovariances. Also, due to the fact that the autocovariance 

matrices are only estimated by a finite amount of samples, and due to 

possible colored noise, the autocovariance at τ could be badly estimated. A more 

general BSS algorithm called SOBI (second-order blind identification) based on 

time decorrelation was therefore proposed by Belouchrani et al. [4]. In addition 

to only diagonalizing a single autocovariance matrix, it takes a whole set of autocovariance 

matrices of x(t) with varying time lags τ and jointly diagonalizes 

the whole set. It has been shown that increasing the size of this set improves 

SOBI performance in noisy settings [1]. 

Algorithms for performing joint diagonalization of a set of symmetric commuting 

matrices include gradient descent on the sum of the off-diagonal terms, 

iterative construction of A by Givens rotation in two coordinates [7] (used in the 

simulations in section 4), an iterative two-step recovery of A [8] or more recently 

a linear least-squares algorithm for diagonalization [9], where the latter two algorithms 

can also search for non-orthogonal matrices A. Joint diagonalization 

has been used in BSS using cumulant matrices [10] or time autocovariances [4,5]. 

3 Multidimensional SOBI 

The goal of this work is to improve SOBI performance for random processes 

with a higher dimensional parametrization i.e. for data sets where the random 

processes s and x do not depend on a single variable t, but on multiple variables 

(z1, . . . , zM ). A typical example is a source data set, in which each component 

si represents an image of size h × w. Then M = 2 and samples of s are given at 

z1 = 1, . . . , h, z2 = 1, . . . , w. Classically, s(z1, z2) is transformed to s(t) by fixing 

a mapping from the two-dimensional parameter set to the one-dimensional time 

parametrization of s(t), for example by concatenating columns or rows in the 

case of a finite number of samples. If the time structure of s(t) is not used, as 

in all classical ICA algorithms in which i.i.d. samples are assumed, this choice 

does not influence the result. However, in time-structure based algorithms such 

as AMUSE and SOBI results can vary greatly depending on the choice of this 

mapping, see figure 2. 

Without loss of generality we again assume centered random vectors. Then 

define the multidimensional covariance to be 

Rs(τ1, . . . , τM ) := E � s(z1 + τ1, . . . , zM + τM)s(z1, . . . , zM ) ⊤�

Chapter 6. LNCS 3195:726-733, 2004 117 



0.8 

0.6 

0.4 

0.2 

0 

1 

1d−autocov 

2d−autocov 

0 50 100 150 200 250 300 

τ respectively |(τ1, τ2)| (rescaled to N) 

Fig. 1. Example of one- and two-dimensional autocovariance coefficient of the grayscale 

128 × 128 Lena image after normalization to variance 1. 

where the expectation is taken over (z1, . . . , zM ). Rs(τ1, . . . , τM ) can be estimated 

given equidistant samples by replacing random variables by sample values 

and expectations by sums as usual. 

The advantage of using multidimensional autocovariances lies in the fact that 

now the multidimensional structure of the data set can be used more explicitly. 

For example, if row concatenation is used to construct s(t) from the images, 

horizontal lines in the image will only give trivial contributions to the autocovariance 

(see examples in figure 2 and section 4). Figure 1 shows the oneand 

two-dimensional autocovariance of the Lena image for varying τ respectively 

(τ1, τ2) after normalization of the image to variance 1. Clearly, the twodimensional 

autocovariance does not decay as quickly with increasing radius as 

the one-dimensional covariance. Only at multiples of the image height, the onedimensional 

autocovariance is significantly high i.e. captures image structure. 

Our contribution consists of using multidimensional autocovariances for joint 

diagonalization. We replace the BSS assumption of diagonal one-dimensional autocovariances 

by diagonal multi-dimensional autocovariances of the sources. Note 

that also the multidimensional covariance satisfies the equation 2. Again we as- 

sume whitened x(z1, . . . , zK). Given a autocovariance matrix ¯ Rx 

� 

τ (1) 

1 

, . . . , τ (1) 

M 

with n different eigenvalues, multidimensional AMUSE (mdAMUSE) detects the 

orthogonal unmixing mapping W by diagonalization of this matrix. 

In section 2, we discussed the advantages of using SOBI over AMUSE. This 

of course also holds in this generalized case. Hence, the multidimensional SOBI 

algorithm (mdSOBI ) consists of the joint diagonalization of a set of symmetrized 

multidimensional autocovariances 

� � 

¯Rx τ (1) 

� 

(1) 

1 , . . . , τ M , . . . , ¯ � 

Rx 

τ (K) 

1 

, . . . , τ (K) 

�� 

M 

�

118 Chapter 6. LNCS 3195:726-733, 2004 


(a) source images 

crosstalking error E1( Â, I) 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 


SOBI 

SOBI transposed images 

mdSOBI 

mdSOBI transposed images 

0 

0 10 20 30 40 50 60 70 

K 

(b) performance comparison 

Fig. 2. Comparison of SOBI and mdSOBI when applied to (unmixed) images from (a). 

The plot (b) plots the number K of time lags versus the crosstalking error E1 of the 

recovered matrix Â and the unit matrix I; here Â has been recovered by bot SOBI 

and mdSOBI given the images from (a) respectively the transposed images. 

after whitening of x(z1, . . . , zK). The joint diagonalizer then equals A except for 

permutation, given the generalized identifiability conditions from [4], theorem 

2. Therefore, also the identifiability result does not change, see [4]. In practice, 

we choose the (τ (k) 

1 , . . . , τ (k) 

M ) with increasing modulus for increasing k, but with 

the restriction τ (k) 

1 > 0 in order to avoid using the same autocovariances on the 

diagonal of the matrix twice. 

Often, data sets do not have any substantial long-distance autocorrelations, 

but quite high multi-dimensional close-distance correlations (see figure 1). When 

performing joint diagonalization, SOBI weighs each matrix equally strong, which 

can deteriorate the performance for large K, see simulation in section 4. 

Figure 2(a) shows an example, in which the images have considerable vertical 

structure, but rather random horizontal structure. Each of the two images 

consists of a concatenation of stripes of two images. For visual purposes, we 

chose the width of the stripes to be rather large with 16 pixels. According to 

the previous discussion we expect one-dimensional algorithms such as AMUSE 

and SOBI to perform well on the images, but badly (for number of time lags 

≫ 16) on the transposed images. If we apply AMUSE with τ = 20 to the images, 

we get excellent performance with a low crosstalking error with the unit 

matrix of 0.084; if we however apply AMUSE to the transposed images, the error 

is high with 1.1. This result is further confirmed by the comparison plot in 

figure 2(b); mdSOBI performs equally well on the images and the transposed

Chapter 6. LNCS 3195:726-733, 2004 119 


crosstalking error E1( Â, A) 


8 

7 

6 

5 

4 

3 

2 

1 

SOBI K=32 

SOBI K=128 

mdSOBI K=32 

mdSOBI K=128 

0 

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 

Fig. 3. SOBI and mdSOBI performance dependence on noise level σ. Plotted is the 

crosstalking error E1 of the recovered matrix Â with the real mixing matrix A. See 

text for more details. 

images, whereas performance of SOBI strongly depends on whether column or 

row concatenation was used to construct a one-dimensional random process out 

of each image. The SOBI breakpoint of around K = 52 can be decreased by 

choosing smaller stripes. In future works we want to provide an analytical discussion 

of performance increase when comparing SOBI and mdSOBI similar to 

the performance evaluation in [4]. 

4 Results 

Artificial mixtures. We consider the linear mixture of three images (baboon, 

black-haired lady and Lena) with a randomly chosen 3 × 3 matrix A. Figure 

3 shows how SOBI and mdSOBI perform depending on the noise level σ. For 

small K, both SOBI and mdSOBI perform equally well in the low noise case, but 

mdSOBI performs better in the case of stronger noise. For larger K mdSOBI 

substantially outperforms SOBI, which is due to the fact that natural images do 

not have any substantial long-distance autocorrelations (see figure 1), whereas 

mdSOBI uses the non-trivial two-dimensional autocorrelations. 

fMRI analysis. We analyze the performance of mdSOBI when applied to fMRI 

measurements. fMRI data were recorded from six subjects (3 female, 3 male, 

age 20–37) performing a visual task. In five subjects, five slices with 100 images 

(TR/TE = 3000/60 msec) were acquired with five periods of rest and five 

σ

120 Chapter 6. LNCS 3195:726-733, 2004 

1 2 3 

4 5 6 

7 8 

(a) component maps 


1 cc: −0.08 2 cc: 0.19 3 cc: −0.11 

4 cc: −0.21 5 cc: −0.43 6 cc: −0.21 

7 cc: −0.16 8 cc: −0.86 

(b) time courses 

Fig. 4. mdSOBI fMRI analysis. The data was reduced to the first 8 principal components. 

(a) shows the recovered component maps (white points indicate values stronger 

than 3 standard deviations), and (b) their time courses. mdSOBI was performed with 

K = 32. Component 5 represents inner ventricles, component 6 the frontal eye fields. 

Component 8 is the desired stimulus component, which is mainly active in the visual 

cortex; its time-course closely follows the on-off stimulus (indicated by the gray boxes) 

— their crosscorrelation lies at cc = −0.86 — with a delay of roughly 2 seconds induced 

by the BOLD effect. 

photic simulation periods with rest. Simulation and rest periods comprised 10 

repetitions each, i.e. 30s. Resolution was 3 × 3 × 4 mm. The slices were oriented 

parallel to the calcarine fissure. Photic stimulation was performed using 

an 8 Hz alternating checkerboard stimulus with a central fixation point and a 

dark background with a central fixation point during the control periods. The 

first scans were discarded for remaining saturation effects. Motion artifacts were 

compensated by automatic image alignment (AIR, [11]). 

BSS, mainly based on ICA, nowadays is a quite common tool in fMRI analysis 

(see for example [12]). Here, we analyze the fMRI data set using spatial decorrelation 

as separation criterion. Figure 4 shows the performance of mdSOBI; see figure 

text for interpretation. Using only the first 8 principal components, mdSOBI 

could recover the stimulus component as well as detect additional components. 

When applying SOBI to the data set, it could not properly detect the stimulus 

component but found two components with crosscorrelations cc = −0.81 and 

−0.84 with the stimulus time course. 

5 Conclusion 

We have proposed an extension called mdSOBI of SOBI for data sets with multidimensional 

parametrizations, such as images. Our main contribution lies in

Chapter 6. LNCS 3195:726-733, 2004 121 


replacing the one-dimensional autocovariances by multi-dimensional autocovariances. 

In both simulations and real-world applications mdSOBI outperforms 

SOBI for these multidimensional structures. 

In future work, we will show how to perform spatiotemporal BSS by jointly 

diagonalizing both spatial and time autocovariance matrices. We plan on applying 

these results to fMRI analysis, where we also want to use three-dimensional 

autocovariances for 3d-scans of the whole brain. 


The authors would like to thank Dr. Dorothee Auer from the Max Planck Institute 

of Psychiatry in Munich, Germany, for providing the fMRI data, and Oliver 

Lange from the Department of Clinical Radiology, Ludwig-Maximilian University, 

Munich, Germany, for data preprocessing and visualization. FT and EL 

acknowledge partial financial support by the BMBF in the project ’ModKog’. 

References 

1. Cichocki, A., Amari, S.: Adaptive blind signal and image processing. John Wiley 

& Sons (2002) 

2. Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis. John 

Wiley & Sons (2001) 

3. Tong, L., Liu, R.W., Soon, V., Huang, Y.F.: Indeterminacy and identifiability of 

blind identification. IEEE Transactions on Circuits and Systems 38 (1991) 499–509 

4. Belouchrani, A., Meraim, K.A., Cardoso, J.F., Moulines, E.: A blind source separation 

technique based on second order statistics. IEEE Transactions on Signal 

Processing 45 (1997) 434–444 

5. Ziehe, A., Mueller, K.R.: TDSEP – an efficient algorithm for blind separation using 

time structure. In Niklasson, L., Bodén, M., Ziemke, T., eds.: Proc. of ICANN’98, 

Skövde, Sweden, Springer Verlag, Berlin (1998) 675–680 

6. Joho, M., Mathis, H., Lamber, R.: Overdetermined blind source separation: using 

more sensors than source signals in a noisy mixture. In: Proc. of ICA 2000, Helsinki, 

Finland (2000) 81–86 

7. Cardoso, J.F., Souloumiac, A.: Jacobi angles for simultaneous diagonalization. 

SIAM J. Mat. Anal. Appl. 17 (1995) 161–164 

8. Yeredor, A.: Non-orthogonal joint diagonalization in the leastsquares sense with 

application in blind source separation. IEEE Trans. Signal Processing 50 (2002) 

15451553 

9. Ziehe, A., Laskov, P., Mueller, K.R., Nolte, G.: A linear least-squares algorithm 

for joint diagonalization. In: Proc. of ICA 2003, Nara, Japan (2003) 469–474 

10. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non gaussian signals. IEE 

Proceedings - F 140 (1993) 362–370 

11. Woods, R., Cherry, S., Mazziotta, J.: Rapid automated algorithm for aligning 

and reslicing pet images. Journal of Computer Assisted Tomography 16 (1992) 

620–633 

12. McKeown, M., Jung, T., Makeig, S., Brown, G., Kindermann, S., Bell, A., Sejnowksi, 

T.: Analysis of fMRI data by blind separation into independent spatial 

components. Human Brain Mapping 6 (1998) 160–188

122 Chapter 6. LNCS 3195:726-733, 2004

Chapter 7 

Proc. ISCAS 2005, pages 5878-5881 

Paper F.J. Theis. Blind signal separation into groups of dependent signals using 

joint block diagonalization. In Proc. ISCAS 2005, pages 5878-5881, Kobe, 

Japan, 2005 

Reference (Theis, 2005a) 


123

124 Chapter 7. Proc. ISCAS 2005, pages 5878-5881 

Blind signal separation into groups of dependent 

signals using joint block diagonalization 

Abstract— Multidimensional or group independent component 

analysis describes the task of transforming a multivariate observed 

sensor signal such that groups of the transformed signal 

components are mutually independent - however dependencies 

within the groups are still allowed. This generalization of 

independent component analysis (ICA) allows for weakening 

the sometimes too strict assumption of independence in ICA. 

It has potential applications in various fields such as ECG, 

fMRI analysis or convolutive ICA. Recently we could calculate 

the indeterminacies of group ICA, which finally enables us, 

also theoretically, to apply group ICA to solve blind source 

separation (BSS) problems. In this paper we introduce and 

discuss various algorithms for separating signals into groups 

of dependent signals. The algorithms are based on joint block 

diagonalization of sets of matrices generated using several signal 

structures. 


Institute of Biophysics, University of Regensburg 

93040 Regensburg, Germany, Email: fabian@theis.name 

of commuting symmetric n × n matrices Mi, to find an 

orthogonal matrix E such that E ⊤ MiE is diagonal for all i. 

In the following we will use a generalization of this technique 

as algorithm to solve MBSS problems. Instead of fully 

diagonalizing Mi in joint block diagonalization (JBD) we 

want to determine E such that E ⊤ MiE is block-diagonal 

(after fixing the block-structure). 

Introducing some notation, let us define for r, s =1,...,n 

the (r, s) sub-k-matrix of W =(wij), denoted by W (k) 

rs ,to 

be the k × k submatrix of W ending at position (rk, sk). 

Denote Gl(n) the group of invertible n×n matrices. A matrix 

W ∈ Gl(nk) is said to be a k-scaling matrix if W (k) 

rs =0 

for r �= s, andW is called a k-permutation matrix if for 

each r =1,...,n there exists precisely one s such that W (k) 

rs 

equals the k × k unit matrix. 

Hence, fixing the block-size to k, JBD tries to find E 

such that E ⊤ MiE is a k-scaling matrix. In practice due 

to estimation errors, such E will not exist, so we speak of 

approximate JBD and imply minimizing some error-measure 

on non-block-diagonality. 

Various algorithms to actually perform JBD have been 

proposed, see [5] and references therein. In the following we 

will simply perform joint diagonalization (using for example 

the Jacobi-like algorithm from [6]) and then permute the 

columns of E to achieve block-diagonality — in experiments 

this turns out to be an efficient solution to JBD [5]. 

I. INTRODUCTION 

In this work, we discuss multidimensional blind source 

separation (MBSS) i.e. the recovery of underlying sources 

s from an observed mixture x. As usual, s has to fulfill 

additional properties such as independence or diagonality of 

the autocovariances (if s possesses time structure). However 

in contrast to ordinary BSS, MBSS is more general as some 

source signals are allowed to possess common statistics. One 

possible solution for MBSS is multidimensional independent 

component analysis (MICA) — in section IV we will discuss 

other such conditions. The idea MICA is that we do not require 

full independence of the transform y := Wx but only mutual 

independence of certain tuples yi1 ,...,yi2 . If the size of all 

III. MULTIDIMENSIONAL ICA (MICA) 

tuples is restricted to one, this reduces to ordinary ICA. In 

Let k, n ∈ N. We call an nk-dimensional random vec- 

general, of course the tuples could have different sizes, but 

tor y k-independent if the k-dimensional random vectors 

for the sake of simplicity we assume that they all have the 

(y1,...,yk) 

same length k. 

Multidimensional ICA has first been introduced by Cardoso 

[1] using geometrical motivations. Hyvärinen and Hoyer then 

presented a special case of multidimensional ICA which they 

called independent subspace analysis [2]; there the dependence 

within a k-tuple is explicitly modelled enabling the authors 

to propose better algorithms without having to resort to the 

problematic multidimensional density estimation. 

II. JOINT BLOCK DIAGONALIZATION 

Joint diagonalization has become an important tool in ICAbased 

BSS (used for example in JADE [3]) or in BSS relying 

on second-order time-decorrelation (for example in SOBI 

[4]). The task of (real) joint diagonalization is, given a set 

⊤ ,...,(ynk−k+1,...,ynk) ⊤ are mutually independent. 

A matrix W ∈ Gl(nk) is called a k-multidimensional 

ICA of an nk-dimensional random vector x if Wx is kindependent. 

If k =1, this is the same as ordinary ICA. 

Using MICA we want to solve the (noiseless) linear MBSS 

problem x = As, where the nk-dimensional random vector x 

is given, and A ∈ Gl(nk) and s are unknown. In the case of 

MICA s isassumedtobek-independent. 

A. Indeterminacies 

Obvious indeterminacies are, similar to ordinary ICA, invertible 

transforms in Gl(k) in each tuple as well as the fact 

that the order of the independent k-tuples is not fixed. Indeed, 

if A is MBSS solution, then so is ALP with a k-scaling 

matrix L and a k-permutation P, because independence is

Chapter 7. Proc. ISCAS 2005, pages 5878-5881 125 

invariant under these transformations. In [7] we show that 

these are the only indeterminacies, given some additional weak 

restrictions to the model, namely that A has to be k-admissible 

and that s is not allowed to contain a Gaussian k-component. 

As usual by preprocessing of the observations x by whitening 

we may also assume that Cov(x) = I. Then I = 

Cov(x) =A Cov(s)A ⊤ = AA ⊤ so A is orthogonal. 

B. MICA using Hessian diagonalization (MHICA) 

We assume that s admits a C2−density ps. Using orthogonality 

of A we get ps(s0) =px(As0) for s0 ∈ Rnk .Let 

Hf (x0) denote the Hessian of f evaluated at x0. It transforms 

like a 2-tensor so locally at s0 with ps(s0) > 0 we get 

Hln ps (s0) 

⊤ 

=Hln px◦A(s0) =AHln px (As0)A 

The key idea now lies in the fact that s is assumed to be kindependent, 

so ps factorizes into n groups depending only 

on k separate variables each. So ln ps is a sum of functions 

depending on k separate variables hence Hln ps (s0) is blockdiagonal 

i.e. a k-scaling. 

The algorithm, multidimensional Hessian ICA (MHICA), 

now simply uses the block-diagonality structure from equation 

1 and performs JBD of estimates of a set of Hessians 

Hln ps (si) evaluated at different points si ∈ Rnk . Given slight 

restrictions on the eigenvalues, the resulting block diagonalizer 

then equals A⊤ except for k-scaling and permutation. The 

Hessians are estimated using kernel-density approximation 

with a sufficiently smooth kernel, but other methods such 

as approximation using finite differences are possible, too. 

Density approximation is problematic, but in this setting due to 

the fact that we can use many Hessians we only need rough 

estimates. For more details on the kernel approximation we 

refer to the one-dimensional Hessian ICA algorithm from [8]. 

MHICA generalizes one-dimensional ideas proposed in [8], 

[9]. More generally, we could have also used characteristic 

functions instead of densities, which leads to a related algorithm, 

see [10] for the single-dimensional ICA case. 

IV. MULTIDIMENSIONAL TIME DECORRELATION 

Instead of assuming k-independence of the sources in 

the MBSS problem, in this section we assume that s is a 

multivariate centered discrete WSS random process such that 

its symmetrized autocovariances 

¯Rs(τ) := 1 � � ⊤ 

E s(t + τ)s(t) 

2 

� + E � s(t)s(t + τ) ⊤�� 

(2) 

are k-scalings for all τ. This models the fact that the sources 

are supposed to be block-decorrelated in the time-domain for 

all time-shifts τ. 

A. Indeterminacies 

Again A can only be found up to k-scaling and kpermutation 

because condition 2 is invariant under this transformation. 

One sufficient condition for identifiability is to have 

pairwise different eigenvalues of at least one Rs(τ), however 

generalizations are possible, see [4] for the case k =1.Using 

whitening, we can again assume orthogonal A. 

(1) 

3500 

3000 

2500 

2000 

1500 

1000 

500 

0 

0.5 1 1.5 2 2.5 3 3.5 4 

Fig. 1. Histogram and box plot of the multidimensional performance index 

E (k) (C) evaluated for k =2and n =2. The statistics were calculated 

over 10 5 independent experiments using 4 × 4 matrices C with coefficients 

uniformly drawn out of [−1, 1]. 

B. Multidimensional SOBI (MSOBI) 

Theideaofwhatwecallmultidimensional second-order 

blind identification (MSOBI) is now a direct extension of the 

usual SOBI algorithm [4]. Symmetrized autocovariances of x 

can easily be estimated from the data, and they transform as 

follows: ¯ Rs(τ) =A⊤ ¯ Rx(τ)A. But ¯ Rs(τ) is a k-scaling by 

assumption, so JBD of a set of such symmetrized autocovariance 

matrices yields A as diagonalizer (except for k-scaling 

and permutation). 

Other researchers have worked on this problem in the setting 

of convolutive BSS — due to lack of space we want to refer 

to [11] and references therein. 

V. EXPERIMENTAL RESULTS 

In this section we demonstrate the validity of the proposed 

algorithms by applying them to both toy and real world data. 

A. Multidimensional Amari-index 

In order to analyze algorithm performance, we consider the 

index E (k) (C) defined for fixed n, k and C ∈ Gl(nk) as 

E (k) n� 

� 

n� 

�C 

(C) = 

r=1 s=1 

(k) 

rs � 

maxi �C (k) 

� 

− 1 

ri � 

n� 

� 

n� 

�C 

+ 

s=1 r=1 

(k) 

rs � 

maxi �C (k) 

� 

− 1 . 

is � 

Here �.� can be any matrix norm — we choose the operator 

norm �A� := max |x|=1 |Ax|. Thismultidimensional performance 

index of an nk × nk-matrix C generalizes the onedimensional 

performance index introduced by Amari et al. [12] 

to block-diagonal matrices. It measures how much C differs 

from a permutation and scaling matrix in the sense of k-blocks, 

so it can be used to analyze algorithm performance: 

Lemma 5.1: Let C ∈ Gl(nk). E (k) (C) =0if and only if 

C is the product of a k-scaling and a k-permutation matrix. 

Corollary 5.2: Consider the MBSS problem x = As from 

section III respectively IV. An estimate Â of the mixing matrix 

solves the MBSS problem if and only if E (k) ( Â−1 A)=0. 

In order to be able to determine the scale of this index, figure 

1 gives statistics of E (k) over randomly chosen matrices in the 

case k = n =2. The mean is 3.05 and the median 3.10. 

4 

3.5 

3 

2.5 

2 

1.5 

1 

1

126 Chapter 7. Proc. ISCAS 2005, pages 5878-5881 

1 

0 

−1 

0 100 200 300 400 500 600 700 800 900 1000 

3 

2 

1 

0 

0 100 200 300 400 500 600 700 800 900 1000 

1 

0 

−1 

0 100 200 300 400 500 600 700 800 900 1000 

3 

2 

1 

0 

0 100 200 300 400 500 600 700 800 900 1000 

Fig. 2. Simulation, 4-dimensional 2-independent sources. Clearly the first 

and the second respectively the third and the fourth signal are dependent. 

B. Simulations 

We will discuss algorithm performance when applied to a 

4-dimensional 2-independent toy signal. In order to see the 

performance of both MSOBI and MHICA we generate 2independent 

sources with non-trivial autocorrelations. For this 

we use two independent generating signals, a sinusoid and a 

sawtooth given by 

z(t) := (sin(0.1 t), 2⌊0.007 t +0.5⌋−1) ⊤ 

for discrete time steps t =1, 2,...,1000. We thus generated 

sources 

s(t) :=(z1(t), exp(z1(t)),z2(t), (z2(t)+0.5) 2 ) ⊤ , 

which are plotted in figure 2. Their covariance is 

� 

Cov(s) = 

� 0.50 0.57 0.01 0.01 

0.57 0.68 0.01 0.01 

0.01 0.01 0.33 0.33 

0.01 0.01 0.33 0.42 

so indeed s is not fully independent. 

s is mixed using a 4 × 4 matrix A with entries uniformly 

drawn out of [−1, 1], and comparisons are made over 100 

Monte-Carlo runs. We compare the two algorithms MSOBI 

(with 10 autocorrelation matrices) and MHICA (using 50 

Hessians) with the ICA algorithms JADE and fastICA, where 

in the latter both the deflation and the symmetric approach 

was used. For each run we calculate the performance index 

E (2) ( Â−1 A) of the product of the mixing and the estimated 

separating matrix. Since the one-dimensional ICA algorithms 

are unable to use the group structure, for these we take the 

minimum of the index calculated over all row permutations of 

Â −1 A. 

Figure 3 displays the result of the comparison. Clearly 

MHICA and MSOBI perform very well on this data, and 

MSOBI furthermore gives very robust estimates with the same 

error and negligibly small variance. JADE cannot separate 

performance index E (2) ( Â−1 A) 

MHICA MSOBI JADE fastICA(defl)fastICA(sym) 

Fig. 3. Simulation, algorithm results. This notched boxed plot displays 

the performance index E (2) of the mixing-separating matrix Â−1 A of 

each algorithm, sampled over 100 Monte-Carlo runs. The middle line of 

each column gives the mean, the boxes the 25th and 75th percentile. The 

deflationary fastICA algorithm only converged in 12% of all runs, the 

symmetric-approach based fastICA in 89% of all cases; the statistics are only 

given over successful runs. All other algorithms converged in all runs. 

the data at all — it performs not much better than random 

choice of matrix, see figure 1; this is due to the fact 

that the cumulants of k-independent sources are not blockdiagonal. 

FastICA only converges in 12% (deflation approach) 

respectively 89% (symmetric approach) of all cases. However, 

in the cases it converges it gives results comparable with 

the multidimensional algorithms. Apparently, especially the 

symmetric method seems to be able to use the weakened 

statistics to still find directions in the data. 

C. Application to ECG data 

Finally we illustrate how to apply the proposed algorithms 

to real-world data set. Following [1], we will show how to 

separate fetal ECG (FECG) recordings from the mother’s 

ECG (MECG). The data set [13] consists of eight recorded 

signals with 2500 observations; the sampling frequency is 

misleadingly specified as 500 Hz (which would mean around 

168 mother heartbeats per minute), it should be closer to 

around 250 Hz. We select the first three sensors cutaneously 

recorded on the abdomen of the mother. In order to save space 

and to compare the results with [1] we plot only the first 1000 

samples, see figure 4(a).

Chapter 7. Proc. ISCAS 2005, pages 5878-5881 127 

50 

0 

−50 

−120 

−50 

−50 

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 

120 

50 

0 

−100 

0 100 200 300 400 500 

(a) ECG recordings 

50 

0 

80 

0 

0 

0 

0 

−50 

0 100 200 300 400 

−20 

500 0 100 200 300 400 

−50 

500 0 100 200 300 400 

−50 

500 0 100 200 300 400 500 

20 

0 

−20 

−100 

−100 

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 

(b) extracted sources 

50 

0 

120 

50 

0 

(c) MECG part 

50 

0 

120 

50 

0 

(d) FECG part 

Fig. 4. Fetal ECG example. (a) shows the ECG recordings. The underlying FECG (4 heartbeats) is partially visible in the dominating MECG (3 heartbeats). 

Figure (b) gives the extracted sources using MHICA with k =2and 500 Hessians. In (c) and (d) the projections of the mother sources (components 1 and 

2 in (b)) respectively the fetal sources (component 3) onto the mixture space (a) are plotted. 

Our goal is to extract an MECG and an FECG component; 

however it cannot be expected to find an only one-dimensional 

MECG due to the fact that projections of a three-dimensional 

vector (electric) field are measured. Hence modelling the data 

by a multidimensional BSS problem with k =2(but allowing 

for an additional one-dimensional component) makes sense. 

Application of MHICA (with 500 Hessians) and MSOBI 

(with 50 autocorrelation matrices) extracts a two-dimensional 

MECG component and a one-dimensional FECG component. 

After block-permutation we get the following estimated mixing 

matrices (A using MHICA and A ′ using MSOBI) 

� � 

0.37 0.42 −0.81 

A = −0.75 0.89 −0.16 A 

0.55 −0.16 0.57 

′ � � 

0.22 0.91 −0.40 

= −0.84 0.23 −0.33 . 

0.50 0.34 0.85 

The thus estimated sources using MHICA are plotted in 

figure 4(b). In order to compare the two mixing matrices, 

calculation of 

� 

. 

A −1 A ′ � 

0.85 1.02 0.64 

= −0.23 1.11 0.35 

−0.01 −0.08 0.98 

yields a somewhat visible block structure; the performance 

index is E (2) (A−1A ′ ) = 1.12. The block structure is not 

very dominant, which indicates that the two models — block 

independence versus time-block-decorrelation — are not fully 

equivalent. 

A (scaling invariant) decomposition of the observed ECG 

data can be achieved by composing the extracted sources using 

only the relevant mixing columns. For example for the MECG 

part this means applying the projection ΠM := (a1, a2, 0)A−1 to the observations. This yields the projection matrices 

� 

� 

0.52 0.38 0.84 

ΠM = −0.10 1.08 0.17 

0.34 −0.27 0.41 

ΠF = 

� 0.48 −0.38 −0.84 

0.10 −0.08 −0.17 

−0.34 0.27 0.59 

onto the mother respectively the fetal ECG using MHICA and 

Π ′ � � 

0.78 0.21 0.45 

M = −0.18 1.17 0.36 Π 

0.47 −0.44 0.05 

′ � � 

0.22 −0.21 −0.45 

F = 0.18 −0.17 −0.36 . 

−0.47 0.44 0.95 

using MSOBI. The results of the first algorithm are plotted in 

figures 4 (c) and (d). The fetal ECG is most active at sensor 

1 (as visual inspection of the observation confirms). When 

comparing the projection matrices with the results from [1], we 

get quite high similarity of the ICA-based results, and a modest 

difference with the projections of the time-based algorithm. 

Other one-dimensional ICA-based results on this data set are 

reported for example in [14]. 

� 

VI. CONCLUSION 

We have shown how the idea of joint block diagonalization 

as extension of joint diagonalization helps us to generalize 

ICA and time-structure based algorithms such as HICA and 

SOBI to the multidimensional ICA case. The thus defined 

algorithms are able to robustly decompose signals into groups 

of independent signals. In future work, besides more extensive 

experiments and tests with noise and outliers, we want to 

extend this result to a version of JADE using moments instead 

of cumulants, which preserve the block structure. 

REFERENCES 

[1] J. Cardoso, “Multidimensional independent component analysis,” in 

Proc. of ICASSP ’98, Seattle, 1998. 

[2] A. Hyvärinen and P. Hoyer, “Emergence of phase and shift invariant 

features by decomposition of natural images into independent feature 

subspaces,” Neural Computation, vol. 12, no. 7, pp. 1705–1720, 2000. 

[3] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non gaussian 

signals,” IEE Proceedings - F, vol. 140, no. 6, pp. 362–370, 1993. 

[4] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, and E. Moulines, “A 

blind source separation technique based on second order statistics,” IEEE 

Transactions on Signal Processing, vol. 45, no. 2, pp. 434–444, 1997. 

[5] K. Abed-Meraim and A. Belouchrani, “Algorithms for joint block 

diagonalization,” in Proc. EUSIPCO 2004, Vienna, Austria, 2004, pp. 

209–212. 

[6] J.-F. Cardoso and A. Souloumiac, “Jacobi angles for simultaneous 

diagonalization,” SIAM J. Mat. Anal. Appl., vol. 17, no. 1, pp. 161– 

164, Jan. 1995. 

[7] F. Theis, “Uniqueness of complex and multidimensional independent 

component analysis,” Signal Processing, vol. 84, no. 5, pp. 951–956, 

2004. 

[8] ——, “A new concept for separability problems in blind source separation,” 

Neural Computation, vol. 16, pp. 1827–1850, 2004. 

[9] J. Lin, “Factorizing multivariate function classes,” in Advances in Neural 

Information Processing Systems, vol. 10, 1998, pp. 563–569. 

[10] A. Yeredor, “Blind source separation via the second characteristic 

function,” Signal Processing, vol. 80, no. 5, pp. 897–902, 2000. 

[11] C. Févotte and C. Doncarli, “A unified presentation of blind separation 

methods for convolutive mixtures using block-diagonalization,” in Proc. 

ICA 2003, Nara, Japan, 2003, pp. 349–354. 

[12] S. Amari, A. Cichocki, and H. Yang, “A new learning algorithm for blind 

signal separation,” Advances in Neural Information Processing Systems, 

vol. 8, pp. 757–763, 1996. 

[13] B. D. M. (ed.), “DaISy: database for the identification 

of systems,” Department of Electrical Engineering, 

ESAT/SISTA, K.U.Leuven, Belgium, Oct 2004. [Online]. Available: 

http://www.esat.kuleuven.ac.be/sista/daisy/ 

[14] L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “Fetal electrocardiogram 

extraction by source subspace separation,” in Proc. IEEE SP / 

ATHOS Workshop on HOS, Girona, Spain, 1995, pp. 134–138.

128 Chapter 7. Proc. ISCAS 2005, pages 5878-5881

Chapter 8 

Proc. NIPS 2006 

Paper F.J. Theis. Towards a general independent subspace analysis. Proc. NIPS 

2006, 2007 

Reference (Theis, 2007) 


129

130 Chapter 8. Proc. NIPS 2006 

Towards a general independent subspace analysis 


Max Planck Institute for Dynamics and Self-Organisation & 

Bernstein Center for Computational Neuroscience 

Bunsenstr. 10, 37073 Göttingen, Germany 


Abstract 

The increasingly popular independent component analysis (ICA) may only be applied 

to data following the generative ICA model in order to guarantee algorithmindependent 

and theoretically valid results. Subspace ICA models generalize the 

assumption of component independence to independence between groups of components. 

They are attractive candidates for dimensionality reduction methods, 

however are currently limited by the assumption of equal group sizes or less general 

semi-parametric models. By introducing the concept of irreducible independent 

subspaces or components, we present a generalization to a parameter-free 

mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by 

including previous results on non-Gaussian component analysis. After introducing 

this general model, we discuss joint block diagonalization with unknown block 

sizes, on which we base a simple extension of JADE to algorithmically perform 

the subspace analysis. Simulations confirm the feasibility of the algorithm. 

1 Independent subspace analysis 

A random vector Y is called an independent component of the random vector X, if there exists 

an invertible matrix A and a decomposition X = A(Y,Z) such that Y and Z are stochastically 

independent. The goal of a general independent subspace analysis (ISA) or multidimensional independent 

component analysis is the decomposition of an arbitrary random vector X into independent 

components. If X is to be decomposed into one-dimensional components, this coincides with ordinary 

independent component analysis (ICA). Similarly, if the independent components are required 

to be of the same dimension k, then this is denoted by multidimensional ICA of fixed group size k 

or simply k-ISA. So 1-ISA is equivalent to ICA. 

1.1 Why extend ICA? 

An important structural aspect in the search for decompositions is the knowledge of the number of 

solutions i.e. the indeterminacies of the problem. Without it, the result of any ICA or ISA algorithm 

cannot be compared with other solutions, so for instance blind source separation (BSS) would be 

impossible. Clearly, given an ISA solution, invertible transforms in each component (scaling matrices 

L) as well as permutations of components of the same dimension (permutation matrices P) give 

again an ISA of X. And indeed, in the special case of ICA, scaling and permutation are already all 

indeterminacies given that at most one Gaussian is contained in X [6]. This is one of the key theoretical 

results in ICA, allowing the usage of ICA for solving BSS problems and hence stimulating 

many applications. It has been shown that also for k-ISA, scalings and permutations as above are 

the only indeterminacies [11], given some additional rather weak restrictions to the model. 

However, a serious drawback of k-ISA (and hence of ICA) lies in the fact that the requirement 

fixed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed,

Chapter 8. Proc. NIPS 2006 131 

crosstalking error 

4 

3 

2 

1 

0 

FastICA JADE Extended Infomax 

Figure 1: Applying ICA to a random vector X = AS that does not fulfill the ICA model; here S 

is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are 

the statistics over 100 runs of the Amari error of the random original and the reconstructed mixing 

matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the original 

mixing matrix could not be reconstructed in any of the experiments. However, interestingly, the 

latter two algorithms do indeed find an ISA up to permutation, which will be explained in section 3. 

theoretically speaking, it may only be applied to random vectors following the k-ISA blind source 

separation model, which means that they have to be mixtures of a random vector that consists of 

independent groups of size k. If this is the case, uniqueness up to permutation and scaling holds as 

noted above; however if k-ISA is applied to any random vector, a decomposition into groups that are 

only ‘as independent as possible’ cannot be unique and depends on the contrast and the algorithm. 

In the literature, ICA is often applied to find representations fulfilling the independence condition as 

well as possible, however care has to be taken; the strong uniqueness result is not valid any more, 

and the results may depend on the algorithm as illustrated in figure 1. 

This work aims at finding an ISA model that allows applicability to any random vector. After reviewing 

previous approaches, we will provide such a model together with a corresponding uniqueness 

result and a preliminary algorithm. 

1.2 Previous approaches to ISA for dependent component analysis 

Generalizations of the ICA model that are to include dependencies of multiple one-dimensional 

components have been studied for quite some time. ISA in the terminology of multidimensional ICA 

has first been introduced by Cardoso [4] using geometrical motivations. His model as well as the 

related but independently proposed factorization of multivariate function classes [9] is quite general, 

however no identifiability results were presented, and applicability to an arbitrary random vector 

was unclear; later, in the special case of equal group sizes (k-ISA) uniqueness results have been 

extended from the ICA theory [11]. Algorithmic enhancements in this setting have been recently 

studied by [10]. Moreover, if the observation contain additional structures such as spatial or temporal 

structures, these may be used for the multidimensional separation [13]. 

Hyvärinen and Hoyer presented a special case of k-ISA by combining it with invariant feature subspace 

analysis [7]. They model the dependence within a k-tuple explicitly and are therefore able 

to propose more efficient algorithms without having to resort to the problematic multidimensional 

density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8], 

where dependencies between all components are assumed and modelled along a topographic structure 

(e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component clustering 

problem, which necessitates a model for inter-cluster independence and intra-cluster dependence. 

For the latter, they propose to use a tree-structure as employed by their tree dependepent component 

analysis. Together with inter-cluster independence, this implies a search for a transformation of the 

mixtures into a forest i.e. a set of disjoint trees. However, the above models are all semi-parametric 

and hence not fully blind. In the following, no additional structures are necessary for the separation.


1.3 General ISA 

Definition 1.1. A random vector S is said to be irreducible if it contains no lower-dimensional 

independent component. An invertible matrix W is called a (general) independent subspace analysis 

of X if WX = (S1, . . . ,Sk) with pairwise independent, irreducible random vectors Si. 

Note that in this case, the Si are independent components of X. The idea behind this definition is 

that in contrast to ICA and k-ISA, we do not fix the size of the groups Si in advance. Of course, 

some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction 

is realized by allowing only irreducible components. The advantage of this formulation now is that 

it can clearly be applied to any random vector, although of course a trivial decomposition might be 

the result in the case of an irreducible random vector. Obvious indeterminacies of an ISA of X are, 

as mentioned above, scalings i.e. invertible transformations within each Si and permutation of Si 

of the same dimension 1 . These are already all indeterminacies as shown by the following theorem, 

which extends previous results in the case of ICA [6] and k-ISA [11], where also the additional 

slight assumptions on square-integrability i.e. on existing covariance have been made. 

Theorem 1.2. Given a random vector X with existing covariance and no Gaussian independent 

component, then an ISA of X exists and is unique except for scaling and permutation. 

Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give 

a short sketch of the proof in the following. The uniqueness result can easily be formulated as a 

subspace extraction problem, and theorem 1.2 follows readily from 

Lemma 1.3. Let S = (S1, . . . ,Sk) be a square-integrable decomposition of S into irreducible 

independent components Si. If X is an irreducible component of S, then X ∼ Si for some i. 

Here the equivalence relation ∼ denotes equality except for an invertible transformation. The following 

two lemmata each give a simplification of lemma 1.3 by ordering the components Si according 

to their dimensions. Some care has to be taken when showing that lemma 1.5 implies lemma 1.4. 

Lemma 1.4. Let S and X be defined as in lemma 1.3. In addition assume that dimSi = dimX for 

i ≤ l and dimSi < dimX for i > l. Then X ∼ Si for some i ≤ l. 

Lemma 1.5. Let S and X be defined as in lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1. 

In order to prove lemma 1.5 (and hence the theorem), it is sufficient to show the following lemma: 

Lemma 1.6. Let S = (S1,S2) with S1 irreducible and m := dimS1 > dimS2 =: n. If X = AS 

is again irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is 

invertible, and (ii) if X is an independent component of S, the right m×n-submatrix of A vanishes. 

(i) follows after some linear algebra, and is necessary to show the more difficult part (ii). For this, 

we follow the ideas presented in [12] using factorization of the joint characteristic function of S. 

1.4 Dealing with Gaussians 

In the previous section, Gaussians had to be excluded (or at most one was allowed) in order to 

avoid additional indeterminacies. Indeed, any orthogonal transformation of two decorrelated hence 

independent Gaussians is again independent, so clearly such a strong identification result would not 

be possible. 

Recently, a general decomposition model dealing with Gaussians was proposed in the form of the socalled 

non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace 

within the data, and no assumption of independence within the subspace is made. More precisely, 

given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN,SG) 

and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if 

SN and SG are stochastically independent and SG is Gaussian. In this case, X is said to be mdecomposable. 

X is denoted to be minimally n-decomposable if X is not (n − 1)-decomposable. 

According to our previous notation, SN and SG are independent components of X. It has been 

shown that the subspaces of such decompositions are unique [12]: 

1 Note that scaling here implies a basis change in the component Si, so for example in the case of a twodimensional 

source component, this might be rotation and sheering. In the example later in figure 3, these 

indeterminacies can easily be seen by comparing true and estimated sources.


Theorem 1.7 (Uniqueness of NGSA). The mixing matrix A of a minimal decomposition is unique 

except for transformations in each of the two subspaces. 

Moreover, explicit algorithms can be constructed for identifying the subspaces [3]. This result enables 

us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes 

solutions of ISA. 

Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with existing covariance, 

an ISA of X exists and is unique except for permutation of components of the same dimension 

and invertible transformations within each independent component and within the Gaussian part. 

Proof. Existence is obvious. Uniqueness follows after first applying theorem 1.7 to X and then 

theorem 1.2 to the non-Gaussian part. 

2 Joint block diagonalization with unknown block-sizes 

Joint diagonalization has become an important tool in ICA-based BSS (used for example in JADE) 

or in BSS relying on second-order temporal decorrelation. The task of (real) joint diagonalization 

(JD) of a set of symmetric real n×n matrices M := {M1, . . . ,MK} is to find an orthogonal matrix 

E such that E⊤MkE is diagonal for all k = 1, . . .,K i.e. to minimizef( Ê) := �K k=1 �Ê⊤MkÊ − 

diagM( Ê⊤MkÊ)�2F with respect to the orthogonal matrix Ê, where diagM(M) produces a matrix 

where all off-diagonal elements of M have been set to zero, and �M�2 F := tr(MM⊤ ) denotes the 

squared Frobenius norm. The Frobenius norm is invariant under conjugation by an orthogonal matrix, 

so minimizing f is equivalent to maximizing g( Ê) := �K k=1 � diag(Ê⊤MkÊ)�2 , where now 

diag(M) := (mii)i denotes the diagonal of M. For the actual minimization of f respectively maximization 

of g, we will use the common approach of Jacobi-like optimization by iterative applications 

of Givens rotation in two coordinates [5]. 

2.1 Generalization to blocks 

In the following we will use a generalization of JD in order to solve ISA problems. Instead of fully 

diagonalizing all n × n matrices Mk ∈ M, in joint block diagonalization (JBD) of M we want to 

determine E such that E ⊤ MkE is block-diagonal. Depending on the application, we fix the blockstructure 

in advance or try to determine it from M. We are not interested in the order of the blocks, 

so the block-structure is uniquely specified by fixing a partition of n i.e. a way of writing n as a sum 

of positive integers, where the order of the addends is not significant. So let 2 n = m1 + . . . + mr 

with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1, . . .,mr) ∈ N r . An n × n matrix is said to be 

m-block diagonal if it is of the form 

⎛ 

D1 

⎜ . 

⎝ . 

· · · 

. .. 

0 

. 

. 

⎞ 

⎟ 

⎠ 

0 · · · Dr 

with arbitrary mi × mi matrices Di. 

As generalization of JD in the case of known the block structure, we can formulate the joint mblock 

diagonalization (m-JBD) problem as the minimization of fm ( Ê) := �K k=1 �Ê⊤MkÊ − 

diagM m ( Ê⊤ Mk Ê)�2 F with respect to the orthogonal matrix Ê, where diagMm (M) produces a 

m-block diagonal matrix by setting all other elements of M to zero. In practice due to estimation 

errors, such E will not exist, so we speak of approximate JBD and imply minimizing some errormeasure 

on non-block-diagonality. Indeterminacies of any m-JBD are m-scaling i.e. multiplication 

by an m-block diagonal matrix from the right, and m-permutation defined by a permutation matrix 

that only swaps blocks of the same size. 

Finally, we speak of general JBD if we search for a JBD but no block structure is given; instead 

it is to be determined from the matrix set. For this it is necessary to require a block 

2 We do not use the convention from Ferrers graphs of specifying partitions in decreasing order, as a visualization 

of increasing block-sizes seems to be preferable in our setting.


structure of maximal length, otherwise trivial solutions or ‘in-between’ solutions could exist (and 

obviously contain high indeterminacies). Formally, E is said to be a (general) JBD of M if 

(E,m) = argmax m | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result 

in the trivial decomposition m = (n), so we define an approximate general JBD by requiring 

f m (E) < ǫ for some fixed constant ǫ > 0 instead of f m (E) = 0. 

2.2 JBD by JD 

A few algorithms to actually perform JBD have been proposed, see [1] and references therein. In 

the following we will simply perform joint diagonalization and then permute the columns of E to 

achieve block-diagonality — in experiments this turns out to be an efficient solution to JBD [1]. 

This idea has been formulated in a conjecture [1] essentially claiming that a minimum of the JD cost 

function f already is a JBD i.e. a minimum of the function f m up to a permutation matrix. Indeed, 

in the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary, 

and we can prove the conjecture partially: 

We want to show that JD implies JBD up to permutation, i.e. if E is a minimum of f, then there 

exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course 

f(EP) = f(E), so we will show why (certain) JBD solutions are minima of f. However, JD might 

have additional minima. First note that clearly not any JBD minimizes f, only those such that in 

each block of size mk, f(E) when restricted to the block is maximal over E ∈ O(mk). We will call 

such a JBD block-optimal in the following. 

Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local minimum of f. 

Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local 

minimum of f or equivalently a local maximum of the squared diagonal sum g. After substituting 

each Mk by E ⊤ MkE, we may already assume that Mk is m-block diagonal, so we have to show 

that E = I is a local maximum of g. 

Consider the elementary Givens rotation Gij(ǫ) defined for i < j and ǫ ∈ (−1, 1) as the orthogonal 

matrix, where all diagonal elements are 1 except for the two elements √ 1 − ǫ 2 in rows i and j and 

with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i), 

respectively. It can be used to construct local coordinates of the d := n(n − 1)/2-dimensional 

manifold O(n) at I, simply by ι(ǫ12, ǫ13, . . . , ǫn−1,n) := � 

i 0 and therefore h is negative definite in the direction 

ǫij. Altogether we get a negative definite h at 0 except for ‘trivial directions’, and hence a local 

maximum at 0. 

2.3 Recovering the permutation 

In order to perform JBD, we therefore only have to find a JD E of M. What is left according to the 

above theorem is to find a permutation matrix P such that EP block-diagonalizes M. In the case of 

known block-order m, we can employ similar techniques as used in [1, 10], which essentially find 

P by some combinatorial optimization.


5 

10 

15 

20 

25 

30 

35 

40 

5 10 15 20 25 30 35 40 

5 10 15 20 25 30 35 40 

(a) (unknown) block diagonal M1 (b) Ê⊤E w/o recovered permutation 

5 

10 

15 

20 

25 

30 

35 

40 

5 

10 

15 

20 

25 

30 

35 

40 

5 10 15 20 25 30 35 40 

(c) Ê⊤ E 

Figure 2: Performance of the proposed general JBD algorithm in the case of the (unknown) blockpartition 

40 = 1+2+2+3+3+5+6+6+6+6in the presence of noise with SNR of 5dB. The 

product Ê⊤ E of the inverse of the estimated block diagonalizer and the original one is an m-block 

diagonal matrix except for permutation within groups of the same sizes as claimed in section 2.2. 

In the case of unknown block-size, we propose to use the following simple permutation-recovery 

algorithm: consider the mean diagonalized matrix D := K−1 �K k=1 E⊤MkE. Due to the assumption 

that M is m-block-diagonalizable (with unknown m), each E⊤MkE and hence also D must 

be m-block-diagonal except for a permutationP, so it must have the corresponding number of zeros 

in each column and row. In the approximate JBD case, thresholding with a threshold θ is necessary, 

whose choice is non-trivial. 

We propose using algorithm 1 to recover the permutation; we denote its resulting permuted matrix by 

P(D) when applied to the input D. P(D) is constructed from possibly thresholded D by iteratively 

permuting columns and rows in order to guarantee that all non-zeros of D are clustered along the 

diagonal as closely as possible. This recovers the permutation as well as the partition m of n. 

Algorithm 1: Block-diagonality permutation finder 

Input: (n × n)-matrix D 

Output: block-diagonal matrix P(D) := D ′ such that D ′ = PDP T for a permutation matrix P 

D ′ ← D 

for i ← 1 to n do 

repeat 

if (j0 ← min{j|j ≥ i and d ′ ij = 0 and d′ ji 

= 0}) exists then 

if (k0 ← min{k|k > j0 and (d ′ ik �= 0 or d′ ki �= 0)}) exists then 

swap column j0 of D ′ with column k0 

swap row j0 of D ′ with row k0 

until no swap has occurred ; 

We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K = 

100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6). 

They have been generated in blocks of size m with coefficients chosen randomly uniform from 

[−1, 1], and symmetrized by Dk ← (Dk + D⊤ k )/2. After that, they have been mixed by a random 

orthogonal mixing matrix E ∈ O(40), i.e. Mk := EDkE⊤ + N, where N is a noise matrix with 

independent Gaussian entries such that the resulting signal-to-noise ratio is 5dB. Application of 

the JBD algorithm from above to {M1, . . . ,MK} with threshold θ = 0.1 correctly recovers the 

block sizes, and the estimated block diagonalizer Ê equals E up to m-scaling and permutation, as 

illustrated in figure 2. 

3 SJADE — a simple algorithm for general ISA 

As usual by preprocessing of the observations X by whitening we may assume that Cov(X) = I. 

The indeterminacies allow scaling transformations in the sources, so without loss of generality let


5.5 

5 

4.5 

4 

3.5 

3 

2.5 

2 

(a) S2 

6 

5.5 

5 

4.5 

4 

3.5 

3 

2.5 

1.5 

7 8 9 10 11 12 13 

2 

14 3 4 5 6 7 8 

1.5 

9 3 3.5 4 4.5 5 5.5 6 6.5 7 

5 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

−6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2 

(f) ( ˆ S1, ˆ S2) 

250 

200 

150 

100 

50 

(b) S3 

0 

−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 

(g) histogram of ˆ S3 

5 

4.5 

4 

3.5 

3 

2.5 

14 

13 

12 

11 

10 

9 

2 

(c) S4 

5 

4 

3 

2 

1 

0 

5 

4 

3 

2 

1 

0 0 

1 

(d) S5 

8 

−8 

−4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 

(h) S4 

−3 

−3.5 

−4 

−4.5 

−5 

−5.5 

−6 

−6.5 

−7 

−7.5 

(i) S5 

2 

3 

4 

1 

2 

3 

4 

5 

6 

7 

8 

5 

9 

10 

1 

0 

−1 

−2 

−3 

−4 

−5 

0 

1 2 3 4 5 6 7 8 9 10 

(e) Â−1 A 

−1 

5 

−2 

4 

−3 

3 

2 

−4 

1 

−5 0 

(j) S6 

Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the 

scatter plots i.e. densities of the source components and the mixing-separating map Â−1 A. 

also Cov(S) = I. Then I = Cov(X) = A Cov(S)A ⊤ = AA ⊤ so A is orthogonal. Due to the ISA 

assumptions, the fourth-order cross cumulants of the sources have to be trivial between different 

groups, and within the Gaussians. In order to find transformations of the mixtures fulfilling this 

property, we follow the idea of the JADE algorithmbut now in the ISA setting. We perform JBD of 

the (whitened) contracted quadricovariance matrices defined by Cij(X) := E � X ⊤ EijXXX ⊤� − 

Eij −E ⊤ ij −tr(Eij)I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij, 1 ≤ i, j ≤ n. 

One simple choice is to use n 2 matrices Eij with zeros everywhere except 1 at index (i, j). More 

elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The 

resulting algorithm, subspace-JADE (SJADE) not only performs NGCA by grouping Gaussians as 

one-dimensional components with trivial Cii’s, but also automatically finds the subspace partition 

m using the general JBD algorithm from section 2.3. 

4 Experimental results 

In a first example, we consider a general ISA problem in dimension n = 10 with the unknown 

partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors, 

we decided to follow the nice visual ideas from [10] and to draw samples from a density following 

a known shape — in our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are 

shown in figure 3(a-d). Another 1-dimensional source following a uniform distribution was constructed. 

Altogether 104 samples were used. The sources S were mixed by a mixing matrix A with 

coefficients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered 

mixing matrix Â was then estimated using the above block-JADE algorithm with unknown block 

size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015). 

Figure 3(e) shows the composed mixing-separating system Â−1A; clearly the matrices are equal 

except for block permutation and scaling, which experimentally confirms theorem 1.8. The algorithm 

found a partition ˆm = (1, 1, 1, 2, 2, 3), so one 2d-source was misinterpreted as two 1d-sources, 

but by using previous knowledge combination of the correct two 1d-sources yields the original 2dsource. 

The resulting recovered sources ˆ S := Â−1X, figures 3(f-j), then equal the original sources 

except for permutation and scaling within the sources — which in the higher-dimensional cases 

implies transformations such as rotation of the underlying images or shapes. When applying ICA 

(1-ISA) to the above mixtures, we cannot expect to recover the original sources as explained in 

figure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE 

equals JADE with additional permutation recovery because the joint block diagonalization is per-


formed using joint diagonalization. This explains why JADE retrieves meaningful components even 

in this non-ICA setting as observed in [4]. 

In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the 

subspace JADE also includes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and 

sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before; 

105 samples were drawn. We perform 100 Monte-Carlo simulations with random mixing matrix 

A, and apply SJADE with θ = 0.01. The recovered mixing matrix Â is compared with A by 

taking the ad-hoc measure ι(P) := �3 �2 i=1 j=1 (p2ij + p2ji ) for P := Â−1A. Indeed, we get nearly 

perfect recovery in 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A single run 

diverges with ι(P) = 3.48. In order to show that the algorithm really separates the Gaussian part 

from the other components, we compare the recovered source kurtoses. The median kurtoses are 

−0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The first two components 

have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis 

of around −1.2, which equals the kurtosis of a uniform density. This confirms the applicability of 

the algorithm in the general, noisy ISA setting. 

5 Conclusion 

Previous approaches for independent subspace analysis were restricted either to fixed group sizes 

or semi-parametric models. In neither case, general applicability to any kind of mixture data set 

was guaranteed, so blind source separation might fail. In the present contribution we introduce the 

concept of irreducible independent components and give an identifiability result for this general, 

parameter-free model together with a novel arbitrary-subspace-size algorithm based on joint block 

diagonalization. As in ICA, the main uniqueness theorem is an asymptotic result (but includes 

noisy case via NGCA). However in practice in the finite sample case, due to estimation errors the 

general joint block diagonality only approximately holds. Our simple solution in this contribution 

was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be 

developed in future works. 

References 

[1] K. Abed-Meraim and A. Belouchrani. Algorithms for joint block diagonalization. In Proc. EUSIPCO 

2004, pages 209–212, Vienna, Austria, 2004. 

[2] F.R. Bach and M.I. Jordan. Finding clusters in independent component analysis. In Proc. ICA 2003, pages 

891–896, 2003. 

[3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. Müller. In search of non-gaussian 

components of a high-dimensional distribution. JMLR, 7:247–282, 2006. 

[4] J.F. Cardoso. Multidimensional independent component analysis. In Proc. of ICASSP ’98, Seattle, 1998. 

[5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. 

Appl., 17(1):161–164, January 1995. 

[6] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. 

[7] A. Hyvärinen and P.O. Hoyer. Emergence of phase and shift invariant features by decomposition of 

natural images into independent feature subspaces. Neural Computation, 12(7):1705–1720, 2000. 

[8] A. Hyvärinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computation, 

13(7):1525–1558, 2001. 

[9] J.K. Lin. Factorizing multivariate function classes. In Advances in Neural Information Processing Systems, 

volume 10, pages 563–569, 1998. 

[10] B. Poczos and A. Lörincz. Independent subspace analysis using k-nearest neighborhood distances. In 

Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Springer. 

[11] F.J. Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 

84(5):951–956, 2004. 

[12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages 

917–925, Charleston, USA, 2006. 

[13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS 

2001, pages 993–1000, 2001.

138 Chapter 8. Proc. NIPS 2006

Chapter 9 

Neurocomputing (in press), 2007 

Paper F.J. Theis, P.Gruber, I.R. Keck, and E.W. Lang. A robust model for spatiotemporal 

dependencies. Neurocomputing (in press), 2007 

Reference (Theis et al., 2007b) 


139

140 Chapter 9. Neurocomputing (in press), 2007 

A Robust Model for Spatiotemporal 

Dependencies 

Fabian J. Theis a,b,∗ , Peter Gruber b , Ingo R. Keck b , 

Elmar W. Lang b 

a Bernstein Center for Computational Neuroscience 

Max-Planck-Institute for Dynamics and Self-Organisation, Göttingen, Germany 

b Institute of Biophysics, University of Regensburg, Regensburg, Germany 

Abstract 

Real-world data sets such as recordings from functional magnetic resonance imaging 

often possess both spatial and temporal structure. Here, we propose an algorithm 

including such spatiotemporal information into the analysis, and reduce the problem 

to the joint approximate diagonalization of a set of autocorrelation matrices. 

We demonstrate the feasibility of the algorithm by applying it to functional MRI 

analysis, where previous approaches are outperformed considerably. 

Key words: blind source separation, independent component analysis, functional 

magnetic resonance imaging, autodecorrelation 

PACS: 07.05.Kf, 87.61.–c, 05.40.–a, 05.45.Tp 


Blind source separation (BSS) describes the task of recovering an unknown 

mixing process and underlying sources of an observed data set. It has numerous 

applications in fields ranging from signal and image processing to the 

separation of speech and radar signals to financial data analysis. Many BSS algorithms 

assume either independence (independent component analysis, ICA) 

or diagonal autocorrelations of the sources [1,2]. Here we extend BSS algorithms 

based on time-decorrelation [3–8]. They rely on the fact that the data 

∗ corresponding author 

Email address: fabian@theis.name (Fabian J. Theis). 

URL: http://fabian.theis.name (Fabian J. Theis). 

Preprint submitted to Elsevier 15 May 2007

Chapter 9. Neurocomputing (in press), 2007 141 

sets have non-trivial autocorrelations so that the unknown mixing matrix can 

be recovered by generalized eigenvalue decomposition. 

Spatiotemporal BSS in contrast to the more common spatial or temporal BSS 

tries to achieve both spatial and temporal separation by optimizing a joint 

energy function. First proposed by Stone et al. [9], it is a promising method, 

which has potential applications in areas where data contains an inherent spatiotemporal 

structure, such as data from biomedicine or geophysics including 

oceanography and climate dynamics. Stone’s algorithm is based on the Infomax 

ICA algorithm [10], which due to its online nature involves some rather intricate 

choices of parameters, specifically in the spatiotemporal version, where 

online updates are being performed both in space and time. Commonly, the 

spatiotemporal data sets are recorded in advance, so we can easily replace 

spatiotemporal online learning by batch optimization. This has the advantage 

of greatly reducing the number of parameters in the system, and leads 

to more stable optimization algorithms. We focus on so-called algebraic BSS 

algorithms [3,5,6,11], reviewed for example in [12], which employ generalized 

eigenvalue decomposition and joint diagonalization for the factorization. The 

corresponding learning rules are essentially parameter-free and are known to 

be robust and efficient [13]. 

In this contribution, we extend Stone’s approach by generalizing the timedecorrelation 

algorithms to the spatiotemporal case, thereby allowing us to 

use the inherent spatiotemporal structures of the data. In the experiments 

presented, we observe good performance of the proposed algorithm when applied 

to noisy, high-dimensional data sets acquired from functional magnetic 

resonance imaging (fMRI). We concentrate on fMRI as it is well fit for spatiotemporal 

decomposition due to the fact that spatial activation networks are 

mixed with functional and structural temporal components. 

2 Blind source separation 

We consider the following temporal BSS problem: Let x(t) be a second-order 

stationary, zero-mean, m-dimensional stochastic process and A a full rank 

matrix such that x(t) = As(t) + n(t). The n-dimensional source signals s(t) 

are assumed to have diagonal autocorrelations Rτ(s) := 〈s(t + τ)s(t) ⊤ 〉 for 

all τ, and the additive noise n(t) is modeled by a stationary, temporally and 

spatially white zero-mean process with variance σ 2 . x(t) is observed, and the 

goal is to recover A and s(t). Having found A, s(t) can be estimated by 

A † x(t), which is optimal in the maximum-likelihood sense, where A † denotes 

the pseudo-inverse of A and m ≥ n. So the BSS task reduces to the estimation 

of the mixing matrix A. 

2


3 Separation based on time-delayed decorrelation 

For τ �= 0, the mixture autocorrelations factorize 1 , 

Rτ(x) = ARτ(s)A ⊤ . (1) 

This gives an indication of how to recover A from x(t). The correlation of 

the signal part ˜x(t) := As(t) of the mixtures x(t) may be calculated as 

R0(˜x) = R0(x) − σ2I, provided that the noise variance σ2 is known. After 

whitening of ˜x(t) i.e. joint diagonalization of R0(˜x), we can assume that 

˜x(t) has unit correlation and that m = n, so A is orthogonal 2 . If more 

signals than sources are observed, dimension reduction can be performed in 

this step thus reducing noise [14]. The symmetrized autocorrelation of x(t), 

¯Rτ(x) := (1/2) � 

Rτ(x) + (Rτ(x)) ⊤�, 

factorizes as well, ¯ Rτ(x) = A ¯ Rτ(s)A⊤ , 

and by assumption ¯ Rτ(s) is diagonal. Hence this factorization represents an 

eigenvalue decomposition of the symmetric matrix ¯ Rτ(x). If we furthermore 

assume that ¯ Rτ(x) or equivalently ¯ Rτ(s) has n distinct eigenvalues, then A is 

already uniquely determined by ¯ Rτ(x) except for column permutation. In addition 

to this separability result, a BSS algorithm namely time-delayed decorrelation 

[3,4] is obtained by the diagonalization of ¯ Rτ(x) after whitening — 

the diagonalizer yields the desired separating matrix. 

However, this decorrelation approach decisively depends on the choice of τ 

— if an eigenvalue of ¯ Rτ(x) is degenerate, the algorithm fails. Moreover, we 

face misestimates of ¯ Rτ(x) due to finite sample effects, so using additional 

statistics is desirable. Therefore, Belouchrani et al [5], see also [6], proposed a 

more robust BSS algorithm, called second-order blind identification (SOBI), 

jointly diagonalizing a whole set of autocorrelation matrices ¯ Rk(x) with varying 

time lags, for simplicity indexed by k = 1, . . .,K. They showed that increasing 

K improves SOBI performance in noisy settings [2]. Algorithm speed 

decreases linearly with K, so in practice K ranges from 10 to 100. Various 

numerical techniques for joint diagonalization exist, essentially minimizing 

� Kk=1 off(A ⊤ ¯ Rk(x)A) with respect to A, where off denotes the square sum 

of the off-diagonal terms. A global minimum of this function is called (approximate) 

joint diagonalizer 3 , and it can be determined algorithmically for 

example by iterative Givens rotations [13]. 

1 s(t) and n(t) can be decorrelated, so Rτ(x) = 〈As(t + τ)s(t) ⊤A⊤ 〉 + 〈n(t + 

τ)n(t) ⊤ 〉 = ARτ(s)A⊤ , where the last equality follows because τ �= 0 and n(t) is 

white. 

2 By assumption, R0(s) = I hence I = � As(t)s(t) ⊤A⊤� = AR0(s)A⊤ = AA⊤ , so 

A is orthogonal. 

3 The case of perfect diagonalization i.e. the case of a zero-valued minimum occurs 

if and only if all matrices that are to be diagonalized commute, which is equivalent 

to the matrices sharing the same system of eigenvectors. 

3


4 Spatiotemporal structures 

Real-world data sets often possess structure in addition to the simple factorization 

models treated above. For example fMRI measurements contain both 

temporal and spatial indices so a data entry x = x(r1, r2, r3, t) can depend on 

position r := (r1, r2, r3) as well as time t. More generally, we want to consider 

data sets x(r, t) depending on two indices r and t, where r ∈ R n can be any 

multidimensional (spatial) index and t indexes the time axis. In practice this 

generalized random process is realized by a finite number of samples. For example 

in the case of fMRI scans we could assume t ∈ [1 : T] := {1, 2, . . ., T } 

and r ∈ [1 : h] × [1 : w] × [1 : d], where T is the number of scans of size 

h × w × d. So the number of spatial observations is s m := hwd and the number 

of temporal observations t m := T. 

4.1 Temporal and spatial separation 

For such multi-structured data, two methods of source separation exist. In 

temporal BSS, we interpret the data to contain a measured time series xr(t) := 

x(r, t) for each spatial location r. Then our goal is to apply BSS to the temporal 

observation vector t x(t) := (xr111(t), . . .,xrhwd (t))⊤ containing s m entries 

i.e. consisting of s m spatial observations. In other words we want to find a 

decomposition t x(t) = t A t s(t) with temporal mixing matrix t A and temporal 

sources t s(t), possibly of lower dimension. This contrasts to so-called spatial 

BSS, where the data is considered to be composed of T spatial patterns 

xt(r) := x(r, t). Spatial BSS tries to decompose the spatial observation vector 

s x(r) := (xt1(r), . . ., xtT (r))⊤ ∈ R t m into s x(r) = s A s s(r) with a spatial mixing 

matrix s A and spatial sources s s(r), possibly of lower dimension. In this case, 

using multidimensional autocorrelations considerably enhances the separation 

[7,8]. In order to be able to use matrix-notation, we contract the spatial multidimensional 

index r into a one-dimensional index r by row concatenation; 

the full multidimensional structure will only be needed later in the calculation 

of the multidimensional autocorrelation. Then the data set x(r, t) =: xrt can 

be represented by a data matrix X of dimension s m × t m, and our goal is to 

determine a source matrix S, either spatially or temporally. 

4.2 Spatiotemporal matrix factorization 

Temporal BSS implies the matrix factorization X = t A t S, whereas spatial 

BSS implies the factorization X ⊤ = s A s S or equivalently X = s S ⊤s A ⊤ . Hence 

X = t A t S = s S ⊤s A ⊤ . (2) 

4


So both source separation models can be interpreted as matrix factorization 

problems; in the temporal case restrictions such as diagonal autocorrelations 

are determined by the second factor, in the spatial case by the first one. 

In order to achieve a spatiotemporal model, we require these conditions from 

both factors at the same time. In other words, instead of recovering a single 

source data set which fulfills the source conditions spatiotemporally we try 

to find two source matrices, a spatial and a temporal source matrix, and the 

conditions are put onto the matrices separately. So the spatiotemporal BSS 

model can be derived from equation (2) as the factorization problem 

X = s S ⊤t S (3) 

with spatial source matrix s S and temporal source matrix t S, which both 

have (multidimensional) autocorrelations being as diagonal as possible. Diagonality 

of the autocorrelations is invariant under scaling and permutation, 

so the above model contains these indeterminacies — indeed the spatial and 

temporal sources can interchange scaling (L) and permutation (P) matrices, 

s S ⊤t S = (L −1 P −1s S) ⊤ (LP t S), and the model assumptions still hold. The spatiotemporal 

BSS problem as defined in equation (3) has been implicitly proposed 

in [9], equation (5), in combination with a dimension reduction scheme. 

Here, we first operate on the general model and derive the cost function based 

on autodecorrelation, and only later combine this with a dimension reduction 

method. 

5 Algorithmic spatiotemporal BSS 

Stone et al. [9] first proposed the model from equation (3), where a joint energy 

function is employed based on mutual entropy and Infomax. Apart from 

the many parameters used in the algorithm, the involved gradient descent 

optimization is susceptible to noise, local minima and inappropriate initializations, 

so we propose a novel, more robust algebraic approach in the following. 

It is based on the joint diagonalization of source conditions posed not only 

temporally but also spatially at the same time. 

5.1 Spatiotemporal BSS using joint diagonalization 

Shifting to matrix notation, we interpret ¯ Rk(X) := ¯ Rk( t x(t)) as a symmetrized 

temporal autocorrelation matrix, whereas ¯ Rk(X ⊤ ) := ¯ Rk( s x(r)) denotes 

the corresponding spatial possibly multidimensional symmetrized autocorrelation 

matrix. Here k indexes the one- or multidimensional lags τ. Ap- 

5


plication of the spatiotemporal mixing model from equation (3) together with 

the transformation properties of the Rk’s yields 

so 

Rk(X) =Rk( s S ⊤t S) = s S ⊤ Rk( t S) s S 

Rk(X ⊤ ) =Rk( t S ⊤s S) = t S ⊤ Rk( s S) t S, (4) 

¯Rk( t S)= s S †⊤ Rk(X) ¯ s † 

S 

¯Rk( s S)= t S †⊤ Rk(X ¯ ⊤ ) t S † 

because ∗ m ≥ n and hence ∗ S ∗ S † = I, where ∗ denotes either s or t. By 

assumption the matrices ¯ Rk( ∗ S) are as diagonal as possible. Hence we can find 

one of two the source sets by jointly diagonalizing either ¯ Rk(X) or ¯ Rk(X ⊤ ) 

for all k. The other source matrix can then be calculated by equation (3). 

However we would only be using either temporal or spatial properties, so this 

corresponds to only temporal or spatial BSS. 

In order to include the full spatiotemporal information, we have to find diagonalizers 

for both ¯ Rk(X) and ¯ Rk(X ⊤ ) such that they satisfy the spatiotemporal 

model (3). For now, let us assume the (unrealistic) case of s m = t m = n 

— we will deal with the general problem using dimension reduction later. 

Then all matrices can be assumed to be invertible, and by model (3) we get 

s S ⊤ = X t S −1 . Applying this to equations (5) together with an inversion of 

the second equation yields 

¯Rk( t S) = t S X † ¯ Rk(X)X †⊤ t S ⊤ 

¯Rk( s S) −1 = t S ¯ Rk(X ⊤ ) −1 t S ⊤ . (6) 

So we can separate the data spatiotemporally by jointly diagonalizing the set 

of matrices {X † ¯ Rk(X)X †⊤ , ¯ Rk(X ⊤ ) −1 | k = 1, . . ., K}. 

Hence the goal of achieving spatiotemporal BSS ‘as much as possible’ means 

minimizing the joint error term of the above joint diagonalization criterion. 

Moreover, either spatial or temporal separation can be favored by introducing 

a weighting factor α ∈ [0, 1]. The set for approximate joint diagonalization is 

then defined by 

{αX † ¯ Rk(X)X †⊤ , (1 − α) ¯ Rk(X ⊤ ) −1 | k = 1, . . .,K}. (7) 

If A is a diagonalizer of (7), then the sources can be estimated by tˆ S = A −1 

and sˆ S = A ⊤ X ⊤ . Joint diagonalization is usually performed by optimizing 

the off-diagonal criterion from above, so different scale factors in the matrices 

indeed yield different optima if the diagonalization cannot be achieved fully. 

6 

(5)


According to equations (6), the higher α the more temporal separation is 

stressed. In the limit case α = 1 only the temporal criterion is optimized, so 

temporal BSS is performed, whereas for α = 0 a spatial BSS is calculated, 

although we want to remark that in contrast to the temporal case, the cost 

function for α = 0 does not equal the spatial SOBI cost function due to the 

additional inversion. In practice, in order to be able to weight the matrix 

sets using α appropriately, a normalization by multiplication by a constant 

separately within the two sets seems to be appropriate to guarantee equal 

scales of the two matrix sets. 

5.2 Dimension reduction 

In principle, we may now use diagonalization of the matrix set from (7) to 

perform spatiotemporal BSS — but only in the case of equal dimensions. Furthermore, 

apart from computational issues involving the high dimensionality, 

the BSS estimate would be poor, simply because in the estimation either of 

¯Rk(X) or ¯ Rk(X ⊤ ) equal or less samples than signals are available. Hence 

dimension reduction is essential. 

Our goal is to extract only n ≪ min{ s m, t m} sources. A common approach 

to do so is to approximate X by the reduced singular value decomposition 

X ≈ UDV ⊤ of X, where only the n largest values of the diagonal matrix D 

and the corresponding columns of the pseudo-orthogonal matrices U and V are 

used. Plugging this approximation of X into (5) shows after some calculation 4 

that the set of matrices from equation (7) can be rewritten as 

{α ¯ Rk(D 1/2 V ⊤ ), (1 − α) ¯ Rk(D 1/2 U ⊤ ) −1 | k = 1, . . .,K}. (8) 

If A is a joint diagonalizer of this set, we may estimate the sources by tˆ S = 

A ⊤ D 1/2 V ⊤ and sˆ S = A −1 D 1/2 U ⊤ . We call the resulting algorithm spatiotemporal 

second-order blind identification or stSOBI, generalizing the temporal 

SOBI algorithm. 

4 Using the approximation X ≈ (UD 1/2 )(VD 1/2 ) ⊤ together with the spatiotemporal 

BSS model (3) yields (UD −1/2 ) ⊤s S ⊤t S(VD −1/2 ) = I. Hence W := t SVD −1/2 is 

an invertible n×n matrix. The first equation of (6) still holds in the more general case 

and we get ¯ Rk( t S) = t S ˆ X † ¯ Rk( ˆ X) ˆ X †⊤t S ⊤ = W ¯ Rk(D 1/2 V ⊤ )W ⊤ . The second equation 

of (6) cannot hold for n < ∗ m, but we can derive a similar result from (5), where 

we use W −1 = D −1/2 V ⊤t S † : ¯ Rk( s S) = t S †⊤ ¯ Rk(X ⊤ ) t S † = W −⊤ ¯ Rk(D 1/2 U ⊤ )W −1 

which we can now invert to get ¯ Rk( s S) −1 = W ¯ Rk(D 1/2 U ⊤ ) −1 W ⊤ . 

7


1 2 

3 4 

(a) component maps s S 

1 cc: 0.05 2 cc: 0.17 

3 cc: 0.14 4 cc: 0.89 

(b) time courses t S 

Fig. 1. fMRI analysis using stSOBI with temporal and two-dimensional spatial autocorrelations. 

The data was reduced to the 4 largest components. (a) shows the 

recovered component maps (brain background is given using a structural scan, overlayed 

white points indicate activation values stronger than 3 standard deviations), 

and (b) their time courses. Component 3 partially contains the frontal eye fields. 

Component 4 is the desired stimulus component, which is mainly active in the visual 

cortex; its time-course closely follows the on-off stimulus (indicated by the gray 

boxes) — their crosscorrelation lies at cc = 0.89 — with a delay of roughly 6 seconds 

induced by the BOLD effect. 

5.3 Implementation 

In the experiments we use stSOBI with both one-dimensional and multidimensional 

autocovariances. Our software package 5 implements all the details of 

mdSOBI and its extension stSOBI in Matlab. In addition to Cardoso’s joint 

diagonalization algorithm based on iterative Given’s rotations, the package 

contains all the files needed to reproduce the results described in this paper, 

with the exception of the fMRI data set. 

6 Results 

BSS, mainly based on ICA, is nowadays a quite common tool in fMRI analysis 

[15,16]. For this work, we analyzed the performance of stSOBI when applied to 

fMRI measurements. fMRI data were recorded from from 10 healthy subjects 

performing a visual task. 100 scans (TR/TE = 3000/60 ms) with 5 slices 

each were acquired with 5 periods of rest and 5 photic stimulation periods. 

5 available online at http://fabian.theis.name/ 

8


Stimulation and rest periods comprised 10 repetitions each i.e. 30s. Resolution 

was 3 × 3 × 4 mm. The slices were oriented parallel to the calcarine fissure. 

Photic stimulation was performed using an 8 Hz alternating checkerboard 

stimulus with a central fixation point and a dark background with a central 

fixation point during the control periods. The first scans were discarded for 

remaining saturation effects. Motion artifacts were compensated by automatic 

image alignment [17]. For visualization, we only considered a single slice (nonbrain 

areas were masked out), and chose to reduce the data set to n = 4 

components by singular value decomposition. 

6.1 Single subject analysis 

In the joint diagonalization, K = 10 autocorrelation matrices were used, both 

for spatial and temporal decorrelation. Figure 1 shows the performance of the 

algorithm for equal spatiotemporal weighting α = 0.5. Although the data was 

reduced to only 4 components, stSOBI was able to extract the stimulus component 

(#4) very well; the crosscorrelation of the identified task component 

with the time-delayed stimulus is high (cc = 0.89). Some additional brain 

components are detected, although higher n would allow for more elaborate 

decompositions. 

In order to compare spatial and temporal models, we applied stSOBI with 

varying spatiotemporal weighting factors α ∈ {0, 0.1, . . ., 1}. The task component 

was always extracted, although with different quality. In figure 2, we 

plotted the maximal crosscorrelation of the time courses with the stimulus 

versus α. If only spatial separation is performed, the identified stimulus component 

is considerably worse (cc = 0.8) than in the case of temporal recovery 

(cc = 0.9); the component maps coincide rather well. The enhanced extraction 

confirms the advantages of spatiotemporal separation in contrast to the 

commonly used spatial-only separation. Temporal separation alone, although 

preferable in the presented example, often faces the problem of high dimensions 

and low sample number, so an adjustable weighting α as proposed here 

allows for the highest flexibility. 

6.2 Algorithm comparison 

We then compared our analysis with some established algorithms for fMRI 

analysis. In order to numerically perform the comparisons, we determined the 

single component that is maximally autocorrelated with the known stimulus 

task. These components are shown in figure 3. 

After some difficulties due to the many possible parameters, Stone’s stICA 

9


crosscorrelation with stimulus 

s S for α = 1 

s S for α = 0 

Fig. 2. Performance of stSOBI for varying α. Low α favors spatial separation, high 

α temporal separation. Two recovered component maps are plotted for the extremal 

cases of spatial (α = 0) and temporal (α = 1) separation. 

algorithm [9] was applied to the data. However, the task component could not 

be recovered very well — it showed some activity in the visual cortex, but with 

rather low temporal crosscorrelation of 0.53 with the stimulus component, 

which is much lower than the 0.9 of the multi-dimensional stSOBI and the 

0.86 of stSOBI with one-dimensional autocovariances. We believe that this is 

due to convergence problems of the employed Infomax rule, and to non-trivial 

tuning of the many parameters involved in the algorithm. In order to test 

for convergence issues, we combined stSOBI and stICA by applying Stone’s 

local stICA algorithm to the stSOBI separation results. Due to this careful 

initialization, the stICA result improved (crosscorrelation of 0.58) but was still 

considerably lower than the stSOBI result. 

Similar results were achieved by the well-known FastICA algorithm [18], which 

we applied in order to identify spatially independent components. The algorithm 

could not recover the stimulus component (maximal crosscorrelation of 

0.51, and no activity in the visual cortex). This poor result is due to the dimension 

reduction to only 4 components, and coincides with the decreased performance 

of stSOBI in the spatial case α = 0. In this respect, the spatiotemporal 

model is obviously much more flexible, as spatiotemporal dimension reduction 

is able to capture the structure better than only spatial reduction. 

Finally, we tested the robustness of the spatiotemporal framework by modifying 

the cost function. It is well-known that sources with varying source 

properties can be separated by modifying the source condition matrices. In- 

10 

α


stimulus 

stNSS 

stSOBI (1D) 

stICA after stSOBI 

stICA 

fastICA 

20 40 60 80 

Fig. 3. Comparison of the recovered component that is maximally autocrosscorrelated 

with the stimulus task (top) for various BSS algorithms, after dimension 

reduction to 4 components. The absolute corresponding autocorrelations are 0.84 

(stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied 

to separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA). 

stead of calculating autocovariance matrices, other statistics of the spatial 

and temporal sources can be used, as long as they satisfy the factorization 

from equation (1). This results in ‘algebraic BSS’ algorithms such as AMUSE 

[3], JADE [11], SOBI and TDSEP [5,6], reviewed for instance in [12]. Instead 

of performing autodecorrelation, we used the idea of the NSS-SD-algorithm 

(‘non-stationary sources with simultaneous diagonalization’) [19], cf. [20]: the 

sources were assumed to be spatiotemporal realizations of non-stationary random 

processes ∗ si(t) with ∗ ∈ {t, s} determining the temporal spatial or spatial 

direction. If we assume that the resulting covariance matrices C( ∗ s(t)) vary 

sufficiently with time, the factorization of equation (1) also holds for these 

covariance matrices. Hence, joint diagonalization of 

{C( t x(1)),C( s x(1)),C( t x(2)),C( s x(2)), . . .} 

allows for the calculation of the mixing matrix. The covariance matrices are 

commonly estimated in separate non-overlapping temporal or spatial windows. 

Replacing the autocovariances in (8) by the windowed covariances, this results 

in the spatiotemporal NSS-SD or stNSS algorithm. 

In the fMRI example, we applied stNSS using one-dimensional covariance ma- 

11


autocorrelation with stimulus 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

autocorrelation with stimulus 

0.92 

0.9 

0.88 

0.86 

0.84 

0.82 

0.8 

0.78 

0.76 

computation time (s) 

0.3 

0.28 

0.26 

0.24 

0.22 

0.2 

0.18 

(a) comparison of separation performance and 

computational effort for 10 subjects 

1 2 5 10 20 50 100 200 

subsampling percentage (%) 

(b) separation after subsampling 

computation time (s) 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

1 2 5 10 20 50 100 200 

subsampling percentage (%) 

(c) computation time after subsampling 

Fig. 4. Multiple subject comparison. (a) shows the algorithm performance in terms 

of separation quality (autocorrelation with stimulus) and computation time when 

compared over 100 runs and 10 subjects. (b) and (c) compare these indices after 

subsampling the data spatially with varying percentages. 

trices and 12 windows (both temporally and spatially). Although the data 

exhibited only weak non-stationarities (the mean masked voxels values vary 

from 983 to 1000 over the 98 time steps, with a standard deviation varying 

from 228 to 234), the task component could be extracted rather well with 

a crosscorrelation of 0.80, see figure 3. Similarly, by replacing the autocorrelations 

with other source conditions [12], we can easily construct alternative 

separation algorithms. 

6.3 Multiple subject analysis 

We finish by analyzing the performance of the stSOBI algorithm for multiple 

subjects. As before, we applied stSOBI with dimension reduction to only n = 4 

12


sources. Here, K = 12 for simplicity one-dimensional autocovariance matrices 

were used, both spatially and temporally. We masked the data using a fixed 

common threshold. In order to quantify algorithm performance, as before we 

determined the spatiotemporal source that had a time course with maximal 

autocorrelation with the stimulus protocol, and compared this autocorrelation. 

In figure 4(a), we show a boxplot of the autocorrelations together with the 

needed computational effort. The median autocorrelation was very high with 

0.89. The separation was fast with a mean computation time of 0.25s on 

a 1.7GHz Intel Dual Core laptop running Matlab. In order to confirm this 

robustness of the algorithm, we analyzed the sample-size dependence of the 

method by running stSOBI on subsampled data sets. The bootstrapping was 

performed spatially with repetition, but reordering of the samples in order 

to maintain spatial dependencies. Figure 4(b-c) shows the algorithm performance 

when varying the subsampling percentage from 1 to 200 percent, where 

the statistics were done over 100 runs and over the 10 subjects. Even when 

using only 1 percent of the samples, we achieved a median autocorrelation 

of 0.66, which increased at a subsampling percentage of 10% to an already 

acceptable value of 0.85. This confirms the robustness and efficiency of the 

proposed method, which of course comes from the underlying robust optimization 

method of joint diagonalization. 

7 Conclusion 

We have proposed a novel spatiotemporal BSS algorithm named stSOBI. It 

is based on the joint diagonalization of both spatial and temporal autocorrelations. 

Sharing the properties of all algebraic algorithms, stSOBI is easy to 

use, robust (with only a single parameter) and fast (in contrast to the online 

algorithm proposed by Stone). The employed dimension reduction allows 

for the spatiotemporal decomposition of high-dimensional data sets such as 

fMRI recordings. The presented results for such data sets show that stSOBI 

clearly outperforms spatial-only recovery and Stone’s spatiotemporal algorithm. 

Moreover, the proposed algorithm is not limited to second-order statistics, 

but can easily be extended to spatiotemporal ICA for example by jointly 

diagonalizing both spatial and temporal cumulant matrices. 


The authors gratefully acknowledge partial financial support by the DFG 

(GRK 638) and the BMBF (project ‘ModKog’). They would like to thank 

D. Auer from the MPI of Psychiatry in Munich, Germany, for providing the 

13


fMRI data, and A. Meyer- Bäse from the Department of Electrical and Computer 

Engineering, FSU, Tallahassee, USA for discussions concerning the fMRI 

analysis. The authors thank the anonymous reviewers for their helpful comments 

during preparation of this manuscript. 

References 

[1] A. Hyvärinen, J. Karhunen, E. Oja, Independent component analysis, John 

Wiley & Sons. 


& Sons, 2002. 

[3] L. Tong, R.-W. Liu, V. Soon, Y.-F. Huang, Indeterminacy and identifiability 

of blind identification, IEEE Transactions on Circuits and Systems 38 (1991) 

499–509. 

[4] L. Molgedey, H. Schuster, Separation of a mixture of independent signals using 

time-delayed correlations, Physical Review Letters 72 (23) (1994) 3634–3637. 

[5] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, E. Moulines, A blind source 

separation technique based on second order statistics, IEEE Transactions on 

Signal Processing 45 (2) (1997) 434–444. 

[6] A. Ziehe, K.-R. Mueller, TDSEP – an efficient algorithm for blind separation 

using time structure, in: L. Niklasson, M. Bodén, T. Ziemke (Eds.), Proc. of 

ICANN’98, Springer Verlag, Berlin, Skövde, Sweden, 1998, pp. 675–680. 

[7] H. Schöner, M. Stetter, I. Schießl, J. Mayhew, J. Lund, N. McLoughlin, 

K. Obermayer, Application of blind separation of sources to optical recording 

of brain activity, in: Proc. NIPS 1999, Vol. 12, MIT Press, 2000, pp. 949–955. 

[8] F. Theis, A. Meyer-Bäse, E. Lang, Second-order blind source separation based 

on multi-dimensional autocovariances, in: Proc. ICA 2004, Vol. 3195 of LNCS, 

Springer, Granada, Spain, 2004, pp. 726–733. 

[9] J. Stone, J. Porrill, N. Porter, I. Wilkinson, Spatiotemporal independent 

component analysis of event-related fmri data using skewed probability density 

functions, NeuroImage 15 (2) (2002) 407–421. 

[10] A. Bell, T. Sejnowski, An information-maximisation approach to blind 

separation and blind deconvolution, Neural Computation 7 (1995) 1129–1159. 

[11] J. Cardoso, A. Souloumiac, Blind beamforming for non gaussian signals, IEE 

Proceedings - F 140 (6) (1993) 362–370. 

[12] F. Theis, Y. Inouye, On the use of joint diagonalization in blind signal 

processing, in: Proc. ISCAS 2006, Kos, Greece, 2006. 

14


[13] J. Cardoso, A. Souloumiac, Jacobi angles for simultaneous diagonalization, 

SIAM J. Mat. Anal. Appl. 17 (1) (1995) 161–164. 

[14] M. Joho, H. Mathis, R. Lamber, Overdetermined blind source separation: using 

more sensors than source signals in a noisy mixture, in: Proc. of ICA 2000, 

Helsinki, Finland, 2000, pp. 81–86. 

[15] M. McKeown, T. Jung, S. Makeig, G. Brown, S. Kindermann, A. Bell, 

T. Sejnowski, Analysis of fMRI data by blind separation into independent 

spatial components, Human Brain Mapping 6 (1998) 160–188. 

[16] I. Keck, F. Theis, P. Gruber, E. Lang, K. Specht, C. Puntonet, 3D spatial 

analysis of fMRI data on a word perception task, in: Proc. ICA 2004, Vol. 3195 

of LNCS, Springer, Granada, Spain, 2004, pp. 977–984. 

[17] R. Woods, S. Cherry, J. Mazziotta, Rapid automated algorithm for aligning 

and reslicing pet images, Journal of Computer Assisted Tomography 16 (1992) 

620–633. 



[19] S. Choi, A. Cichocki, Blind separation of nonstationary sources in noisy 

mixtures, Electronics Letters 36 (848-849). 

[20] D.-T. Pham, J. Cardoso, Blind separation of instantaneous mixtures of 

nonstationary sources, IEEE Transactions on Signal Processing 49 (9) (2001) 

1837–1848. 

15

Chapter 10 

IEEE TNN 16(4):992-996, 2005 

Paper P. Georgiev, F.J. Theis, and A. Cichocki. Sparse component analysis and 

blind source separation of underdetermined mixtures. IEEE Transactions on 

Neural Networks, 16(4):992-996, 2005 

Reference (Georgiev et al., 2005c) 


155

156 Chapter 10. IEEE TNN 16(4):992-996, 2005 

992 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005 

IV. CONCLUSION 

In this letter, an enhancement of the NBLM algorithm is proposed. It 

is shown that, by locally adapting one learning coefficient of the NBLM 

for each neighborhood, significant improvements on the performance 

of the method are achieved. The suggested modification requires only 

minor changes in the original algorithm, and reinforces the local character 

of the NBLM. 

With the proposed local adaptation, the modified NBLM achieves 

better performance than the LM method even for very small neighborhood 

sizes. This allows very large NNs to be efficiently trained in a 

fraction of the time LM would require, still reaching lower error rates. 

Moreover, it makes possible to retain the efficiency of that method in 

thosesituations wheretheapplication of theoriginal algorithm was 

impractical. 

ACKNOWLEDGMENT 

The authors would like to thank the reviewers for their useful hints 

and constructive comments throughout the reviewing process. 

REFERENCES 

[1] K. Levenberg, “A method for the solution of certain problems in least 

squares,” Q. Appl. Math., vol. 2, pp. 164–168, 1944. 

[2] D. W. Marquardt, “An algorithm for least-squares estimation of nonlinear 

parameters,” J. Soc. Ind. Appl. Math., vol. 11, pp. 431–441, 1963. 

[3] M. T. Hagan and M. Menhaj, “Training feedforward networks with 

theMarquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp. 

989–993, Nov. 1994. 

[4] G. Lera and M. Pinzolas, “A quasilocal Levenberg-Marquardt algorithm 

for neural network training,” in Proc. Int. Joint Conf. Neural Networks 

(IJCNN’98), vol. 3, Anchorage, AK, May, pp. 2242–2246. 

[5] , “Neighborhood-based Levenberg-Marquardt algorithm for neural 

network training,” IEEE Trans. Neural Netw., vol. 13, no. 5, pp. 

1200–1203, Sep. 2002. 

[6] M. Pinzolas, J. J. Astrain, J. R. González, and J. Villadangos, “Isolated 

hand-written digit recognition using a neurofuzzy scheme and multiple 

classification,” J. Intell. Fuzzy Syst., vol. 12, no. 2, pp. 97–105, Dec. 2002. 

[7] J. M. Cano, M. Pinzolas, J. J. Ibarrola, and J. López-Coronado, “Identificación 

de funciones utilizando sistemas lógicos difusos,” in Proc. XXI 

Jornadas Automática, Sevilla, Spain, 2000. 

[8] M. Pinzolas, J. J. Ibarrola, and J. López-Coronado, “A neurofuzzy 

scheme for on-line identification of nonlinear dynamical systems with 

variabletransfer function,” in Proc. Int. Joint Conf. Neural Networks 

(IJCNN’00), Como, Italy, pp. 215–221. 

1045-9227/$20.00 © 2005 IEEE 

Sparse Component Analysis and Blind Source Separation 

of Underdetermined Mixtures 

Pando Georgiev, Fabian Theis, and Andrzej Cichocki 

Abstract—In this letter, we solve the problem of identifying matrices ƒ 

and e knowing only their multiplication âeƒ, under 

some conditions, expressed either in terms of e and sparsity of ƒ (identifiability 

conditions), or in terms of ˆ (sparse component analysis (SCA) conditions). 

We present algorithms for such identification and illustrate them 

by examples. 

Index Terms—Blind source separation (BSS), sparse component analysis 

(SCA), underdetermined mixtures. 


One of the fundamental questions in data analysis, signal processing, 

data mining, neuroscience, etc. is how to represent a large data set ˆ 

(given in form of a @� 2 xA-matrix) in different ways. A simple approach 

is a linear matrix factorization 

ˆ a eƒ e P �2� Y ƒ P �2x 

where the unknown matrices e (dictionary) and ƒ (sourcesignals) 

have some specific properties, for instance: 

1) therows of ƒ are (discrete) random variables, which are statistically 

independent as much as possible—this is independent component 

analysis (ICA) problem; 2) ƒ contains as many zeros as possible—this 

is the sparse representation or sparse component analysis 

(SCA) problem; 3) the elements of ˆY e, and ƒ arenonnegative—this 

is nonnegative matrix factorization (NMF) [8]. 

There is a large amount of papers devoted to ICA problems [2], [5] 

but mostly for thecase� ! �. We refer to [1], [6], [7], and [9]–[11] 

for some recent papers on SCA and underdetermined ICA @� `�A. 

A related problem is the so called blind source separation (BSS) 

problem, in which we know a priori that a representation such as in 

(1) exists and the task is to recover the sources (and the mixing matrix) 

as accurately as possible. A fundamental property of the complete BSS 

problem is that such a recovery (under assumptions in 1) and non-Gaussianity 

of the sources) is possible up to permutation and scaling of the 

sources, which makes the BSS problem so attractive. 

In this letter, we consider SCA and BSS problems in the underdetermined 

case (� `�, i.e., more sources than sensors, which is more 

challenging problem), where the additional information compensating 

the limited number of sensors is the sparseness of thesources. It should 

be noted that this problem is quite general and fundamental, since the 

sources could be not necessarily sparse in time domain. It would be sufficient 

to find a linear transformation (e.g., wavelet packets), in which 

the sources are sufficiently sparse. 

In the sequel, we present new algorithms for solving the BSS 

problem: matrix identification algorithm and source recovery algorithm 

under conditions that the source matrix ƒ has at most � 0 I 

nonzero elements in each column and if the identifiability conditions 

Manuscript received November 20, 2003; revised July 25, 2004. 

P. Georgiev is with ECECS Department, University of Cincinnati, Cincinnati, 

OH 45221 USA (e-mail: pgeorgie@ececs.uc.edu). 

F. Theis is with the Institute of Biophysics, University of Regensburg, 

D-93040 Regensburg, Germany (e-mail: fabian@theis.name). 

A. Cichocki is with the Laboratory for Advanced Brain Signal Processing, 

Brain Science Institute, The Institute of Physical and Chemical Research 

(RIKEN), Saitama 351-0198, Japan (e-mail: cia@bsp.brain.riken.jp). 

Digital Object Identifier 10.1109/TNN.2005.849840 

(1)

Chapter 10. IEEE TNN 16(4):992-996, 2005 157 

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005 993 

are satisfied (see Theorem 1). When the sources are locally very sparse 

(see condition i) of Theorem 2) the matrix identification algorithm 

is much simpler. We used this simpler form for separation of mixtures 

of images. After sparsification transformation (which is in fact 

appropriate wavelet transformation) the algorithm works perfectly in 

the complete case. We demonstrate the effectiveness of our general 

matrix identification algorithm and the source recovery algorithm 

in the underdetermined case for 7 artificially created sparse source 

signals, such that thesourcematrix ƒ has at most 2 nonzero elements 

in each column, mixed with a randomly generated (3 2 7) matrix. 

For a comparison, we present a recovery using �I-norm minimization 

[3], [4], which gives signals that are far from the original ones. This 

implies that the conditions which ensure equivalence of �I-norm and 

�H-norm minimization [4], Theorem 7, are generally not satisfied for 

randomly generated matrices. Note that �I-norm minimization gives 

solutions which haveat most � nonzeros [3], [4]. Another connection 

with [4] is the fact that our algorithm for source recovery works “with 

probability one,” i.e., for almost all data vectors � (in measure sense) 

such that thesystem � a e� has a sparsesolution with less than 

� nonzero elements, this solution is unique, while in [4] the authors 

proved that for all data vectors � such that thesystem � a e� has 

a sparsesolution with less than Spark@eAaP nonzero elements, this 

solution is unique. Note that Spark@eA � CI, where Spark@eA is 

the smallest number of linearly dependent columns of e. 

exist � indexes �� aI &�IY FFFYx� such that any � 0 I vector 

columns of �ƒ@XY��A� � �aI form a basis of the @�0 IA-dimensional coordinatesubspaceof 

� with zero coordinates given by �IY FFFY��t. 

Because of the mixing model, vectors of the form 

�� a ƒ@�Y ��A—�Y � aIYFFFY� �Pt 

II. BLIND SOURCE SEPARATION 

In this section, we develop a method for completely solving the BSS 

problem if the following assumptions are satisfied: 

A1) themixing matrix e P s‚ �2� belong to the data matrix ˆ. Now, by condition A1) it follows that any 

�0I of thevectors �� 

has theproperty that any 

square � 2 � submatrix of it is nonsingular; 

A2) each column of the source matrix ƒ has at most �0I nonzero 

elements; 

A3) the sources are sufficiently rich represented in the following 

sense: for any index set of � 0 � C Ielements 

s a ��IY FFFY��0�CI� &�IYFFFY�� there exist at least � 

column vectors of the matrix ƒ such that each of them has 

zero elements in places with indexes in s and each � 0 I of 

them are linearly independent. 

A. Matrix Identification 

We describe conditions in the sparse BSS problem under which we 

can identify the mixing matrix uniquely up to permutation and scaling 

of thecolumns. Wegivetwo typeof such conditions. Thefirst one 

corresponds to the least sparsest case in which such identification is 

possible. Further, we consider the most sparsest case (for small number 

of samples) as in this case the algorithm is much simpler. 

1) General Case—Full Identifiability: 

Theorem 1: (Identifiability Conditions—General Case): Assume 

that in the representation ˆ a eƒ thematrix e satisfies condition 

A1), thematrix ƒ satisfies conditions A2) and A3) and only the matrix 

ˆ is known. Then the mixing matrix e is identifiable uniquely up to 

permutation and scaling of the columns. 

Proof: It is clear that any column —� of themixing matrix lies 

� 0 I 

in the intersection of all hyperplanes generated by those 

� 0 P 

columns of e in which —� participates. 

We will show that these hyperplanes can be obtained by the columns 

of thedata ˆ under the condition of the theorem. Let t betheset of 

all subsets of �IY FFFY�� containing � 0 I elements and let t Pt. 

� 

Notethat t consists of elements. We will show that the hy- 

� 0 I 

perplane (denoted by rt ) generated by the columns of e with indexes 

from t can beobtained by somecolumns of ˆ. By A2) and A3), there 

�0I 

�aI are linearly independent, which implies 

that they will span the same hyperplane rt . By A1) and theabove, 

� 

it follows that wecan cluster thecolumns of ˆ in groups 

� 0 I 

� 

r�Y� a IYFFFY uniquely such that each group r� con- 

� 0 I 

tains at least � elements and they span one hyperplane rt for some 

t� Pt. Now we cluster the hyperplanes obtained in such a way in the 

smallest number of groups such that the intersection of all hyperplanes 

in each group gives a single one-dimensional (1-D) subspace. It is clear 

that such 1-D subspacewill contain onecolumn of themixing matrix, 

the number of these groups is � and each group consists of 

hyperplanes. 

� 0 I 

� 0 P 

The proof of this theorem gives the idea for the matrix identification 

algorithm. 

Algorithm for Identification of the Mixing Matrix: 

1) Cluster the columns of ˆ in 

� 

� 0 I 

groups r�Y� a 

� 

IY FFFY such that the span of the elements of each 

� 0 I 

group r� produces one hyperplane and these hyperplanes are 

different. 

2) Cluster the normal vectors to these hyperplanes in the smallest 

number of groups q�Y� a IYFFFY� (which gives the number 

of sources �) such that the normal vectors to the hyperplanes in 

each group q� liein a new hyperplane ” r�. 

3) Calculatethenormal vectors ”—� to each hyperplane ” r�Y� a 

IY FFFY�. Notethat the1-D subspacespanned by ”—� is theintersection 

of all hyperplanes in q�. Thematrix ” e with columns 

”—� is an estimation of the mixing matrix (up to permutation and 

scaling of thecolumns). 

2) Degenerate Case—Sparse Instances: 

Theorem 2: (Identifiability Conditions—Locally Very Sparse Representation): 

Assume that the number of sources is unknown and the 

following: 

i) for each index � a IYFFFY� there are at least two columns of 

ƒ X ƒ@XY�IAYand ƒ@XY�PA which have nonzero elements only in 

position � (so each source is uniquely present at least twice); 

ii) ˆ@XY�A Ta ˆ@XY�A for any P ,any�a IYFFFYx and 

any � aIYFFFYxY� Ta � for which ƒ@XY�A has morethat one 

nonzero element. 

Then the number of sources and the matrix e areidentifiable 

uniquely up to permutation and scaling. 

Proof: We cluster in groups all nonzero normalized column vectors 

of ˆ such that each group consists of vectors which differ only by 

sign. From conditions i) and ii), it follows that thenumber of thegroups 

containing more that one element is precisely the number of sources �, 

and that each such group will represent a normalized column of e (up 

to sign). 

In thefollowing, weincludean algorithm for identification of the 

mixing matrix based on Theorem 2. 

Algorithm for Identification of the Mixing Matrix in the Very Sparse 

Case: 

1) Remove all zero columns of ˆ (if any) and obtain a matrix Î P 

�2x 

.



2) Normalizethecolumns ��Y � a IY FFFYxI of Î X �� a 

��a�� and set 4b0. 

Multiply each column �� by 0I if the first element of �� is negative. 

3) Cluster ��Y� aIY FFFYxI in � 0 I groups qIY FFFYq�CI such 

that for any � a IY FFFY�Y�� 0 �� ` 4YV�Y � P q�, and 

�� 0 �� !4 for any �Y � belonging to different groups 

4) Choseany �� P q� and put —� a ��. Thematrix e with 

columns �—�� aI is an estimation of the mixing matrix, up to 

permutation and scaling. 

We should mention that the very sparse case in different settings is 

already considered in the literature, but in more restrictive sense. In [6], 

theauthors supposethat thesupports of theFourier transform of any 

two source signals are disjoint sets—a much more restrictive condition 

than our condition. In [1], theauthors supposethat for any sourcethere 

exists a time-frequency window where only this source is nonzero and 

that the time-frequency transform of each source is not constant on any 

time-frequency window. We would like to mention that their condition 

should include also the case when the the time-frequency transforms 

of any two sources are not proportional in any time-frequency window. 

Such a quantitative condition (without frequency representation) is presented 

in our Theorem 2, condition ii). 

�aI 

Fig. 1. Original images. 

Fig. 2. Mixed (observed) images. 

Fig. 3. Estimated normalized images using the estimated matrix. The 

signal-to-noiseratios with thesources from Fig. 1 are232, 239, and 228 dB, 

respectively. 

B. Identification of Sources 

Theorem 3: (Uniqueness of Sparse Representation): Let r bethe 

set of all � P � such that the linear system e� a � has a solution 

with at least � 0 � CIzero components. If e fulfills A1), then there 

exists a subset rH & rwith measure zero with respect to r, such 

that for every � Pr�rHthis system has no other solution with this 

property. 

� 

Proof: Obviously r is theunion of all 

� 0 I a@�3Aa@@�0 

These coefficients are uniquely determined if �� does not 

belong to the set rH with measure zero with respect to r 

2.3) 

(see Theorem 3); 

Constructthesolution �� a ƒ@XY�A:itcontains!�Y� intheplace 

�� for � aIYFFFY�0 I, the other its components are zero. 

III. SCA 

IA3@�0�CIA3 hyperplanes, produced by taking the linear hull of every 

subsets of the columns of e with � 0 I elements. Let rH betheunion 

of all intersections of any two such subspaces. Then rH has a measure 

zero in r and satisfies the conclusion of the theorem. Indeed, assume 

that � P r�rHand e� a e"� a �, where � and "� haveat least 

� 0 � CIzeros. Since � TP rHY� belongs to only one hyperplane 

produced as a linear hull of some � 0 I columns —� Y FFFY —� of 

e. It means that the vectors � and "� have � 0 � CIzeros in places 

with indexes in �IY FFFY��IY FFFY��0I�. Now from theequation 

e@�0"�A aHit follows that the �0I vector columns —� Y FFFY —� 

of e are linearly dependent, which is a contradiction with A1). 

From Theorem 3 it follows that the sources are identifiable generically, 

i.e., up to a set with a measure zero, if they have level of sparse- 

In this section, we develop a method for the complete solution of the 

SCA problem. Now the conditions are formulated only in terms of the 

data matrix ˆ. 

Theorem 4: (SCA Conditions): Assumethat � � x and the 

matrix ˆ P 

ness grater than or equal to � 0�CI, and themixing matrix is known. 

In the following, we present an algorithm, based on the observation in 

Theorem 3. 

Source Recovery Algorithm: 

1) Identify thetheset of �-codimensional subspaces r produced 

by taking the linear hull of every subsets of the columns of e 

with � 0 I elements. 

2) Repeat for � a 1toxX 2.1) Identify the space r Prcontaining �� Xa ˆ@XY�A, or,in 

practical situation with presence of noise, identify the one to 

which thedistancefrom �� is minimal and project �� onto 

r to ��. 

2.2) if r is produced by the linear hull of column vectors 

—� Y FFFY —� , then find coefficients !�Y� such that 

�0I 

�� a !�Y�—� X 

�2x satisfies the following conditions: 

i) � 

thecolumns of ˆ liein theunion r of different hy- 

� 0 I 

perplanes, each column lies in only one such hyperplane, each 

hyperplane contains at least � columns of ˆ such that each 

� 0 I of them are linearly independent; 

ii) � 0 I 

for each � P �IYFFFY�� there exist � a different 

� 0 P 

hyperplanes �r�Y�� 

�aI in r such that their intersection v� a 

’ � 

�aIr�Y� is 1-D subspace; 

iii) any � different v� span thewhole � . 

Then the matrix ˆ is representable uniquely (up to permutation and 

scaling of thecolumns of e and rows of ƒ) in theform ˆ a eƒ, 

where the matrices e P �2� and ƒ P �2x satisfy theconditions 

A1) and A2), A3), respectively. 

Proof: Let v� bespanned by —� and set e a �—�� aI. Condition 

iii) implies that any hyperplane from r contains at most � 0 I vectors 

from e. By i) and ii), it follows that these vectors are exactly � 0 I: 

only in this case the calculation of the number of all hyperplanes by 

� 0 I 

� 

ii) will givethenumber in i): � a@� 0 IA a 

� 0 P � 0 I .Let 

e be a matrix whose column vectors are all vectors from e (taken in 

an arbitrary order). Since every column vector � of ˆ lies only in one 

hyperplane from r, the linear system e� a � has uniquesolution, 

which has at least � 0 � CIzeros (see the Proof of Theorem 3). Let 

�� aI be � column vectors from ˆ, which span onehyperplane 

from r, and � 0 I of them are linearly independent (such vectors 

exist by i)). Then we have: e�� a ��, for some uniquely determined 

vectors ��Y� aIYFFFY�0 I, which are linearly independent and have

Chapter 10. IEEE TNN 16(4):992-996, 2005 159 

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005 995 

(a) (b) 

Fig. 4. (a) Mixed signals and (b) normalized scatter plot (density) of the mixtures together with the 21 data set hyperplanes, visualized by their intersection with 

theunit spherein . 

(a) (b) (c) 

Fig. 5.(a) Original source signals.(b) Recovered source signals—the signal-to-noise ratio between the original sources and the recoveries is very high (above278 

dB after permutation and normalization). Note that only 200 samples are enough for excellent separation. (c) Recovered source signals using � -norm minimization 

and known mixing matrix. Simple comparison confirms that the recovered signals are far from the original ones, and the signal-to-noise ratio is only around 4 dB. 

at least � 0 � CIzeros in the same coordinates. In such a way, we B. Underdetermined Case 

can write: ˆ a eƒ for some uniquely determined matrix ƒ, which 

satisfies A2) and A3). 

We should mention that our algorithms are robust with respect to 

small additive noise and big outliers, since the algorithms cluster the 

data on hyperplanes approximately, up to a threshold 4 b 0, which 

could accumulatea noisewith amplitudeless than 4. Thebig outliers 

will not be clustered to any hyperplane. 

We consider a mixture of seven artificially created sources (see 

Fig. 5)—sparsified randomly generated signals with at least 5 zeros 

in each column—with a randomly generated mixing matrix with 

dimension 3 2 7. Fig. 4 gives the mixed signals together with a 

normalized scatterplot of the mixtures—the data lies in PI a 

IV. COMPUTER SIMULATION EXAMPLES 

A. Complete Case 

In this example for the complete case @� a �A of instantaneous 

mixtures, we demonstrate the effectiveness of our algorithm for identification 

of the mixing matrix in the special case considered in Theorem 

2. We mixed three images of landscapes (shown in Fig. 1) with 

a three-dimensional (3-D) Hilbert matrix e and transformed them by 

a two-dimensional (2-D) discrete Haar wavelet transform. As a result, 

since this transformation is linear, the high frequency components of 

the source signals become very sparse and they satisfy the conditions 

of Theorem 2. We use only one row (320 points) from the diagonal 

coefficients of the wavelet transformed mixture, which is enough to recover 

very precisely the ill conditioned mixing matrix e. Fig. 3 shows 

the recovered mixtures. 

U 

P 

hyperplanes. Applying the underdetermined matrix recovery algorithm 

to the mixtures gives the recovered mixing matrix perfectly well, up 

to permutation and scaling (not shown because of lack of space). 

Applying the source recovery algorithm, we recover the source signals 

up to permutation and scaling (see Fig. 5). This figure also shows that 

the recovery by �I-norm minimization does not perform well, even if 

the mixing matrix is perfectly known. 

V. CONCLUSION 

We defined rigorously the SCA and BSS problems of sparse signals 

and presented sufficient conditions for their solving. We developed 

three algorithms: for identification of the mixing matrix (two types: for 

the sparse and the very sparse cases) and for source recovery. We presented 

two experiments: the first one concerns separation of a mixture 

of images, after wavelet sparsification (producing very sparse sources), 

which performs very well in the complete case. The second one shows



the excellent performance of the another two our algorithms in the un- can be described by a linear weighted sum of the inputs, followed by 

derdetermined BSS problem, for separation of artificially created sig- some nonlinear transfer function [1], [2]. In this letter, however, the 

nals with sufficient level of sparseness. 

neural computing models studied are based on artificial neurons which 

often have binary inputs and outputs, and no adjustable weight between 

REFERENCES 

nodes. Neuron functions are stored in lookup tables, which can be 

[1] F. Abrard, Y. Deville, and P. White, “From blind source separation to 

blind source cancellation in the underdetermined case: A new approach 

based on time-frequency analysis,” in Proc. 3rd Int. Conf. Independent 

implemented using commercially available random access memories 

(RAMs). These systems and the nodes that they are composed of will 

be described, respectively, as weightless neural networks (WNNs) and 

Component Analysis and Signal Separation (ICA’2001), San Diego, CA, weightless nodes [1], [2]. They differ from other models, such as the 

Dec. 9–13, 2001, pp. 734–739. 

[2] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing. 

New York: Wiley, 2002. 

[3] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis 

pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998. 

weighted neural networks, whose training is accomplished by means 

of adjustments of weights. In the literature, the terms “RAM-based” 

and “N-tuple based” have been used to refer to WNNs. 

In this letter, the computability (computational power) of a class of 

[4] D. Donoho and M. Elad, “Optimally sparse representation in general 

(nonorthogonal) dictionaries via � minimization,” Proc. Nat. Acad. Sci., 

vol. 100, no. 5, pp. 2197–2202, 2003. 

[5] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis. 

New York: Wiley, 2001. 

WNNs, called general single-layer sequential weightless neural networks 

(GSSWNNs), is investigated. Such a class is an important representative 

of the research on temporal pattern processing in (WNNs) 

[3]–[8]. As oneof thecontributions, an algorithm (constructiveproof) 

[6] A. Jourjine, S. Rickard, and O. Yilmaz, “Blind separation of disjoint to map any probabilistic automaton (PA) into a GSSWNN is presented. 

orthogonal signals: Demixing N sources from 2 mixtures,” in Proc. 2000 

IEEE Conf. Acoustics, Speech, and Signal Processing (ICASSP’00), vol. 

5, Istanbul, Turkey, Jun. 2000, pp. 2985–2988. 

[7] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, “Blind 

sourse separation of more sources than mixtures using overcomplete rep- 

In fact, the proposed method not only allows the construction of any 

PA, but also increases the class of functions that can be computed by 

such networks. For instance, at a theoretical level, these networks are 

not restricted to finite-state languages (regular languages) and can now 

resentaitons,” IEEE Signal Process. Lett., vol. 6, no. 4, pp. 87–90, 1999. 

[8] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative 

matrix factorization,” Nature, vol. 40, pp. 788–791, 1999. 

[9] F. J. Theis, E. W. Lang, and C. G. Puntonet, “A geometric algorithm for 

overcomplete linear ICA,” Neurocomput., vol. 56, pp. 381–398, 2004. 

[10] K. Waheed and F. Salem, “Algebraic overcomplete independent com- 

deal with some context-free languages. Practical motivations for investigating 

probabilistic automata and GSSWNNs arefound in their possible 

application to, among others things, syntactic pattern recognition, 

multimodal search, and learning control [9]. 

ponent analysis,” in Proc. Int. Conf. Independent Component Analysis 

(ICA’03), Nara, Japan, pp. 1077–1082. 

II. DEFINITIONS 

[11] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse 

decomposition in a signal dictionary,” Neural Comput., vol. 13, no. 4, pp. 

A. Probabilistic Automata 

863–882, 2001. 

Probabilistic automata are a generalization of ordinary deterministic 

finitestateautomata (DFA) for which an input symbol could takethe 

Equivalence Between RAM-Based Neural Networks and 

Probabilistic Automata 

automaton into any of its states with a certain probability [9]. 

Definition 2.1: A PA is a 5-tuple e€ a@6Y Y rY �sYpA, where 

• 6a�'IY'PY FFFY'�6�� is a finite set of ordered symbols called 

the input alphabet; 

Marcilio C. P. de Souto, Teresa B. Ludermir, and Wilson R. de Oliveira 

• 

• 

a ��HY�PY FFFY�� r is a mapping of 

�� is a finite set of states; 

2 6 into theset of � 2 � stochastic state 

transition matrices (where � is the number of states in ). The 

Abstract—In this letter, the computational power of a class of random interpretation of r@—�AY—� P 6, can bestated as follows. 

access memory (RAM)-based neural networks, called general single-layer 

sequential weightless neural networks (GSSWNNs), is analyzed. The theoretical 

results presented, besides helping the understanding of the temporal 

behavior of these networks, could also provide useful insights for the developing 

of new learning algorithms. 

r@—�A a‘�� @—�A“, where ��@—�A ! H is theprobability of 

� 

entering state �� from state �� under input —�, and �� a 

�aI 

I, for all � aIYFFFY�. Thedomain of r can be extended from 

6 to 6 

Index Terms—Automata theory, computability, random access 

memory (RAM) node, probabilistic automata, RAM-based neural networks, 

weightless neural networks (WNNs). 


The neuron model used in the great majority of work involving 

neural networks is related to variations of the McCulloch–Pitts neuron, 

which will becalled theweighted neuron. A typical weighted neuron 

Manuscript received July 6, 2003; revised October 26, 2004. 

M. C. P. de Souto is with the Department of Informatics and Applied Mathematics, 

Federal University of Rio Grande do Norte, Campus Universitario, 

Natal-RN 59072-970, Brazil (e-mail: marcilio@dimap.ufrn.br). 

T. B. Ludermir is with the Center of Informatics, Federal University of Pernambuco, 

Recife 50732-970, Brazil. 

W. R. de Oliveira is with the Department of Physics and Mathematics, Rural 

Federal University of Pernambuco, Recife 52171-900, Brazil. 

Digital Object Identifier 10.1109/TNN.2005.849838 

3 by defining the following: 

1) 

2) 

r@ Aas�, where is theempty string and s� is an � 2 � 

identity matrix; 

r@—� Y—� Y FFFY—� A a 

r@—� Ar@—� AY FFFYr@—� A, where � ! P and 

• 

—� 

�s P 

P 6Y� aIYFFFY�. is theinitial statein which themachineis found before 

the first symbol of the input string is processed; 

• p is the set of final states @p A. 

The language accepted by a PA e€ is „ @e€ Aa�@3Y �@3AA�3 P 

6 3 Y�@3A a%Hr@3A%p b H� where: 1) %H is a �-dimensional row 

vector, in which the �th component is equal to one if �� a �s, and 0 

otherwise and 2) %p is an �-dimensional column vector, in which the 

�th component is equal to 1 if �� P p and 0 otherwise. 

The language accepted by e€ with cut-point(threshold) !, such that 

H !Ì,isv@e€ Y!Aa�3�3 P 6 3 

and %Hr@3A%p b!�. 

Probabilistic automata recognize exactly the class of weighted regular 

languages (WRLs) [9]. Such a class of languages includes properly 

1045-9227/$20.00 © 2005 IEEE

Chapter 11 

EURASIP JASP, 2007 

Paper F.J. Theis, P. Georgiev, and A. Cichocki. Robust sparse component analysis 

based on a generalized hough transform. EURASIP Journal on Applied 

Signal Processing, 2007 

Reference (Theis et al., 2007a) 


161

162 Chapter 11. EURASIP JASP, 2007 

Hindawi Publishing Corporation 

EURASIP Journal on Advances in Signal Processing 

Volume 2007, Article ID 52105, 13 pages 

doi:10.1155/2007/52105 

Research Article 

Robust Sparse Component Analysis Based on 

a Generalized Hough Transform 

Fabian J. Theis, 1 Pando Georgiev, 2 and Andrzej Cichocki3, 4 

1 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany 

2 ECECS Department and Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH 45221, USA 

3 BSI RIKEN, Laboratory for Advanced Brain Signal Processing, 2-1, Hirosawa, Wako, Saitama 351-0198, Japan 

4 Faculty of Electrical Engineering, Warsaw University of Technology, Pl. Politechniki 1, 00-661 Warsaw, Poland 

Received 21 October 2005; Revised 11 April 2006; Accepted 11 June 2006 

Recommended for Publication by Frank Ehlers 

An algorithm called Hough SCA is presented for recovering the matrix A in x(t) = As(t), where x(t) is a multivariate observed 

signal, possibly is of lower dimension than the unknown sources s(t). They are assumed to be sparse in the sense that at every 

time instant t, s(t) has fewer nonzero elements than the dimension of x(t). The presented algorithm performs a global search for 

hyperplane clusters within the mixture space by gathering possible hyperplane parameters within a Hough accumulator tensor. 

This renders the algorithm immune to the many local minima typically exhibited by the corresponding cost function. In contrast 

to previous approaches, Hough SCA is linear in the sample number and independent of the source dimension as well as robust 

against noise and outliers. Experiments demonstrate the flexibility of the proposed algorithm. 

Copyright © 2007 Fabian J. Theis et al. This is an open access article distributed under the Creative Commons Attribution License, 

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 

1. INTRODUCTION 

One goal of multichannel signal analysis lies in the detection 

of underlying sources within some given set of observations. 

If both the mixture process and the sources are unknown, 

this is denoted as blind source separation (BSS). BSS 

can be applied in many different fields such as medical and 

biological data analysis, broadcasting systems, and audio and 

image processing. In order to decompose the data set, different 

assumptions on the sources have to be made. The 

most common assumption currently used is statistical independence 

of the sources, which leads to the task of independent 

component analysis (ICA); see, for instance, [1, 2] 

and references therein. ICA very successfully separates data 

in the linear complete case, when as many signals as underlying 

sources are observed, and in this case the mixing 

matrix and the sources are identifiable except for permutation 

and scaling [3, 4]. In the overcomplete or underdetermined 

case, fewer observations than sources are given. 

It can be shown that the mixing matrix can still be recovered 

[5], but source identifiability does not hold. In order 

to approximately detect the sources, additional requirements 

have to be made, usually sparsity of the sources [6– 

8]. 

Recently, we have introduced a novel measure for sparsity 

and shown [9] that based on sparsity alone, we can still 

detect both mixing matrix and sources uniquely except for 

trivial indeterminacies (sparse component analysis (SCA)). In 

that paper, we have also proposed an algorithm based on random 

sampling for reconstructing the mixing matrix and the 

sources, but the focus of the paper was on the model, and the 

matrix estimation algorithm turned out to be not very robust 

against noise and outliers, and could therefore not easily 

be applied in high dimensions due to the involved combinatorial 

searches. In the present manuscript, a new algorithm 

is proposed for SCA, that is, for decomposing a data 

set x(1), . . . , x(T) ∈ R m modeled by an (m × T)-matrix X 

linearly into X = AS, where the n-dimensional sources S = 

(s(1), . . . , s(T)) are assumed to be sparse at every time instant. 

If the sources are of sufficiently high sparsity, the mixtures 

are clustered along hyperplanes in the mixture space. 

Based on this condition, the mixing matrix can be reconstructed; 

furthermore, this property is robust against noise 

and outliers, which will be used here. The proposed algorithm 

denoted by Hough SCA employs a generalization of the 

Hough transform in order to detect the hyperplanes in the 

mixture space, which then leads to matrix and source identification.

Chapter 11. EURASIP JASP, 2007 163 

2 EURASIP Journal on Advances in Signal Processing 

The Hough transform [10] is a standard tool in image 

analysis that allows recognition of global patterns in an image 

space by recognizing local patterns, ideally a point, in a transformed 

parameter space. It is particularly useful when the 

patterns in question are sparsely digitized, contain “holes,” 

or have been taking in noisy environments. The basic idea 

of this technique is to map parameterized objects such as 

straight lines, polynomials, or circles to a suitable parameter 

space. The main application of the Hough transform lies 

in the field of image processing in order to find straight lines, 

centers of circles with a fixed radius, parabolas, and so forth 

in images. 

The Hough transform has been used in a somewhat 

ad hoc way in the field of independent component analysis 

for identifying two-dimensional sources in the mixture 

plot in the complete [11] and overcomplete [12] cases, 

which without additional restrictions can be shown to have 

some theoretical issues [13]; moreover, the proposed algorithms 

were restricted to two dimensions and did not provide 

any reliable source identification method. An application 

of a time-frequency Hough transform to direction finding 

within nonstationary signals has been studied in [14]; the 

idea is based on the Hough transform of the Wigner-Ville 

distribution [15], essentially employing a generalized Hough 

transform [16] to find straight lines in the time-frequency 

plane. The results in [14] again only concentrate on the twodimensional 

mixture case. In the literature, overcomplete 

BSS and the corresponding basis estimation problems have 

gained considerable interest in the past decade [8, 17–19], 

but the sparse priors are always used in connection with the 

assumption of independent sources. This allows for probabilistic 

sparsity conditions, but cannot guarantee source 

identifiability as in our case. 

The paper is organized as follows. In Section 2, we introduce 

the overcomplete SCA model and summarize the 

known identifiability results and algorithms [9]. The following 

section then reviews the classical Hough transform in two 

dimensions and generalizes it in order to detect hyperplanes 

in any dimension. This method is used in section Section 4 

to develop an SCA algorithm, which turns out to be highly 

robust against noise and outliers. We confirm this by experiments 

in Section 5. Some results of this paper have already 

been presented at the conference “ESANN 2004” [20]. 

2. OVERCOMPLETE SCA 

We introduce a strict notion of sparsity and present identifiability 

results when applying the measure to BSS. 

A vector v ∈ R n is said to be k-sparse if v has at least k 

zero entries. An n×T data matrix is said to be k-sparse if each 

of its columns is k-sparse. Note that v is k-sparse, then it is 

also k ′ -sparse for k ′ ≤ k. The goal of sparse component analysis 

of level k (k-SCA) is to decompose a given m-dimensional 

observed signal x(t), t = 1, . . . , T, into 

x(t) = As(t) (1) 

with a real m × n-mixing matrix A and an n-dimensional 

k-sparse sources s(t). The samples are gathered into corresponding 

data matrices X := (x(1), . . . , x(T)) ∈ R m×T and 

S := (s(1), . . . , s(T)) ∈ R n×T , so the model is X = AS. We 

speak of complete, overcomplete, or undercomplete k-SCA if 

m = n, m < n, or m > n, respectively. In the following, we 

will always assume that the sparsity level equals k = n−m+1, 

which means that at any time instant, fewer sources than 

given observations are active. In the algorithm, we will also 

consider additive white Gaussian noise; however, the model 

identification results are presented only in the noiseless case 

from (1). 

Note that in contrast to the ICA model, the above problem 

is not translation invariant. However, it is easy to see that 

if instead of A we choose an affine linear transformation, 

the translation constant can be determined from X only, as 

long as the sources are nondeterministic. Put differently, this 

means that instead of assuming k-sparsity of the sources we 

could also assume that at any fixed time t, only n − k source 

components are allowed to vary from a previously fixed constant 

(which can be different for each source). In the following 

without loss of generality we will assume m ≤ n: 

the easier undercomplete (or underdetermined) case can be 

reduced to the complete case by projection in the mixture 

space. 

The following theorem shows that essentially the mixing 

model (1) is unique if fewer sources than mixtures are active, 

that is, if the sources are (n − m + 1)-sparse. 

Theorem 1 (matrix identifiability). Consider the k-SCA 

problem from (1) for k := n − m + 1 and assume that every 

m × m-submatrix of A is invertible. Furthermore, let S be sufficiently 

rich represented in the sense that for any index set of 

n − m + 1 elements I ⊂ {1, . . . , n} there exist at least m samples 

of S such that each of them has zero elements in places with 

indexes in I and each m − 1 of them are linearly independent. 

Then A is uniquely determined by X except for left multiplication 

with permutation and scaling matrices. 

So if AS = �A�S, then A = �APL with a permutation P and 

a nonsingular scaling matrix L. This means that we can recover 

the mixing matrix from the mixtures. The next theorem 

shows that in this case also the sources can be found 

uniquely. 

Theorem 2 (source identifiability). Let H be the set of all x ∈ 

R m such that the linear system As = x has an (n−m+1)-sparse 

solution, that is, one with at least n − m + 1 zero components. 

If A fulfills the condition from Theorem 1, then there exists a 

subset H0 ⊂ H with measure zero with respect to H, such that 

for every x ∈ H \ H0 this system has no other solution with 

this property. 

For proofs of these theorems we refer to [9]. The above 

two theorems show that in the case of overcomplete BSS using 

k-SCA with k = n − m + 1, both the mixing matrix and 

the sources can uniquely be recovered from X except for the 

omnipresent permutation and scaling indeterminacy. The essential 

idea of both theorems as well as a possible algorithm is


Fabian J. Theis et al. 3 

a1 

a3 

a2 

(a) Three hyperplanes span{ai, 

aj} for 1 ≤ i < j ≤ 3 in the 

3 × 3 case 

a1 

a3 

a2 

(b) Hyperplanes from (a) visualized 

by intersection with the 

sphere 

a1 

a3 

a4 

a2 

(c) Six hyperplanes span{ai, 

aj} for 1 ≤ i < j ≤ 4 in the 

3 × 4 case 

Figure 1: Visualization of the hyperplanes in the mixture space {x(t)} ⊂ R 3 . Due to the source sparsity, the mixtures are generated by only 

two matrix columns ai, aj, and are hence contained in a union of hyperplanes. Identification of the hyperplanes gives mixing matrix and 

sources. 

Data: samples x(1), . . . , x(T) 

Result: estimated mixing matrix �A 

Hyperplane identification. 

(1) Cluster the samples x(t) in � � 

n 

m−1 groups such that the span 

of the elements of each group produces one distinct 

hyperplane Hi. 

Matrix identification. 

(2) Cluster the normal vectors to these hyperplanes in the 

smallest number of groups Gj, j = 1, . . . , n (which gives the 

number of sources n) such that the normal vectors to the 

hyperplanes in each group Gj lie in a new hyperplane �Hj. 

(3) Calculate the normal vectors �aj to each hyperplane 

�Hj, j = 1, . . . , n. 

(4) The matrix �A with columns �aj is an estimate of the mixing 

matrix (up to permutation and scaling of the columns). 

Algorithm 1: SCA matrix identification algorithm. 

illustrated in Figure 1: by assuming sufficiently high sparsity 

of the sources, the mixture space clusters along a union of 

hyperplanes, which uniquely determine both mixing matrix 

and sources. 

The matrix and source identification algorithm from [9] 

are recalled in Algorithms 1 and 2. We will present a modification 

of the matrix identification part—the same source 

identification algorithm (Algorithm 2) will be used in the experiments. 

The “difficult” part of the matrix identification 

algorithm lies in the hyperplane detection; in Algorithm 1, a 

random sampling and clustering technique is used. Another 

more efficient algorithm for finding the hyperplanes containing 

the data has been developed by Bradley and Mangasarian 

[21], essentially by extending k-means batch clustering. 

Their so-called k-plane clustering algorithm in the special case 

of hyperplanes containing 0 is shown in Algorithm 3. The 

Data: samples x(1), . . . , x(T) and estimated mixing matrix �A 

Result: estimated sources �s(1), . . . ,�s(T) 

(1) Identify the set of hyperplanes H produced by taking the 

linear hull of every subsets of the columns of �A with m − 1 

elements 

for t ← 1, . . . , T do 

(2) Identify the hyperplane H ∈ H containing x(t), or, in 

the presence of noise, identify the one to which the 

distance from x(t) is minimal and project x(t) onto H 

to �x. 

(3) If H is produced by the linear hull of column vectors 

�ai1, . . . , �aim−1, find coefficients λi(j) such that 

�x = � m−1 

j=1 λi(j)�ai(j). 

(4) Construct the solution �s(t): it contains λi(j) at index i(j) 

for j = 1, . . . , m − 1, the other components are zero. 

end 

Algorithm 2: SCA source identification algorithm. 

finite termination of the algorithm is proven in [21, Theorem 

3.7]. We will later compare the proposed Hough algorithm 

with the k-hyperplane algorithm. The k-hyperplane algorithm 

has also been extended to a more general, orthogonal 

k-subspace clustering method [22, 23] thus allowing a search 

not only for hyperplanes but also for lower-dimensional subspaces. 

3. HOUGH TRANSFORM 

The Hough transform is a classical method for locating 

shapes in images, widely used in the field of image processing; 

see [10, 24]. It is robust to noise and occlusions and is 

used for extracting lines, circles, or other shapes from images. 

In addition to these nonlinear extensions, it can also be 

made more robust to noise using antialiasing techniques.



Data: samples x(1), . . . , x(T) 

Result: estimated k hyperplanes Hi given by the normal 

vectors ui 

(l) Initialize randomly ui with |ui| = 1 for i = 1, . . . , k. 

do 

Cluster assignment. 

for t ← 1, . . . , T do 

(2) Add x(t) to cluster Y (i) , where i is chosen to 

minimize |u ⊤ i x(t)| (distance to hyperplane Hi). 

end 

(3) Exit if the mean distance to the hyerplanes is smaller 

than some preset value. 

Cluster update. 

for i ← 1, . . . , k do 

(4) Calculate the i-th cluster correlation C := Y (i) Y (i)⊤ . 

(5) Choose an eigenvector v of C corresponding to 

a minimal eigenvalue. 

(6) Set ui ← v/|v|. 

end 

end 

Algorithm 3: k-hyperplane clustering algorithm. 

3.1. Definition 

Its main idea can be described as follows: consider a parameterized 

object 

Ma := {x ∈ R n | f(x, a) = 0} (2) 

for a fixed parameter set a ∈ U ⊂ R p —here U ⊂ R p is the 

parameter space, and the parameter function f : R n × U → 

R m is a set of m equations describing our types of objects 

(manifolds) Ma for different parameters a. We assume that 

the equations given by f are separating in the sense that if 

Ma ⊂ Ma ′, then already a = a′ . A simple example is the 

set of unit circles in R 2 ; then f (x, a) = |x − a| − 1. For a 

given a ∈ R 2 , Ma is the circle of radius 1 centered at a. Obviously 

f is separated. Other object manifolds will be discussed 

later. A nonseparated object function is, for example, 

f (x, a) := 1−1[0,a](x) for (x, a) ∈ R×[0, ∞), where the characteristic 

function 1[0,a](x) equals 1 if and only if x ∈ [0, a] 

and 0 otherwise. Then M1 = [0, 1] ⊂ [0, 2] = M2 but the 

parameters are different. 

Given a separating parameter function f(x, a), its Hough 

transform is defined as 

η[f] : R n −→ P (U), 

x ↦−→ {a ∈ U | f(x, a) = 0}, 

where P (U) denotes the set of all subsets of U. So η[f] maps 

a point x onto the set of all parameters describing objects 

containing x. But an object Ma as a set is mapped onto a single 

point {a}, that is, 

� 

η[f](x) = {a}. (4) 

x∈Ma 

This follows because if � 

x∈Ma η[f](x) = {a, a′ }, then for all 

x ∈ Ma we have f(x, a ′ ) = 0, which means that Ma ⊂ Ma ′; the 

(3) 

parameter function f is assumed to be separating, so a = a ′ . 

Hence, objects Ma in a data set X = {x(1), . . . , x(T)} can be 

detected by analyzing clusters in η[f](X). 

We will illustrate this concept for line detection in the 

following section before applying it to the hyperplane identification 

needed for our SCA problem. 

3.2. Classical Hough transform 

The (classical) Hough transform detects lines in a given twodimensional 

data space as follows: an affine, nonvertical line 

in R 2 can be described by the equation x2 = a1x1 + a2 for 

fixed a = (a1, a2) ∈ R 2 . If we define 

fL(x, a) := a1x1 + a2 − x2, (5) 

then the above line equals the set Ma from (2) for the unique 

parameter a, and f is clearly separating. Figures 2(a) and 2(b) 

illustrate this idea. 

In practice, polar coordinates are used to describe the line 

in Hessian normal form; this allows to also detect vertical 

lines (θ = π/2) in the data set, and moreover guarantees for 

an isotropic error in contrast to the parametrization (5). This 

leads to a parameter function 

fP(x, θ, ρ) = x1 cos(θ) + x2 sin(θ) − ρ = 0 (6) 

for parameters (θ, ρ) ∈ U := [0, π) × R. Then points in data 

space are mapped to sine curves given by f ; see Figure 2(c). 

3.3. Generalization 

The mixing matrix A in the case of (n − m + 1)-sparse SCA 

can be recovered by finding all 1-codimensional subvector 

spaces in the mixture data set. The algorithm presented here 

uses a generalized version of the Hough transform in order 

to determine hyperplanes through 0 as follows. 

Vectors x ∈ R m lying on such a hyperplane H can be 

described by the equation 

fh(x, n) := n ⊤ x = 0, (7) 

where n is a nonzero vector orthogonal to H. After normalization 

|n| = 1, the normal vector n is uniquely determined 

by H if we additionally require n to lie on one hemisphere 

of the unit sphere Sn−1 := {x ∈ Rn | |x| = 1}. This means 

that the parametrization fh is separating. In terms of spherical 

coordinates of Sn−1 , n can be expressed as 

⎛ 

⎞ 

cos ϕ sin θ1 sin θ2 · · · sin θm−2 

⎜ 

⎜sin 

ϕ sin θ1 sin θ2 · · · sin θm−2 

⎟ 

⎜ 

n = ⎜ cos θ1 sin θ2 · · · sin ⎟ 

θm−2⎟ 

⎜ . . 

⎝ . .. 

. 

⎟ (8) 

⎟ 

. ⎠ 

cos θ1 cos θ2 · · · cos θm−2 

with (ϕ, θ1, . . . , θm−2) ∈ [0, 2π) × [0, π) m−2 uniqueness of n 

can be achieved by requiring ϕ ∈ [0, π). Plugging n in spherical 

coordinates into (7) gives 

m−1 � 

cot θm−2 = − νi(ϕ, θ1, . . . , θm−3) xi 

i=1 

xm 

(9)



for x ∈ R m with xm �= 0 and 

⎧ 

m−3 � 

cos ϕ sin θj, i = 1, 

j=1 

⎪⎨ m−3 � 

νi := sin ϕ sin θj, i = 2, 

j=1 

�i−2 

m−3 � 

⎪⎩ 

cos θj sin θj, i > 2. 

j=1 j=i−1 

(10) 

With cot(θ + π/2) = − tan(θ) we finally get θm−2 = arctan 

( �m−1 i=1 νixi/xm) + π/2. Note that continuity is achieved if we 

set θm−2 := 0 for xm = 0. 

We can then define the generalized “hyperplane detecting” 

Hough transform as 

x ↦−→ 

� 

η[ fh] : R m −→ P � 

[0, π) m−1� 

, 

(ϕ, θ1, . . . , θm−2) 

∈ [0, π) m−1 | θm−2 = arctan 

� m−1 � 

� xi 

νi 

xm i=1 

+ π 

� 

. 

2 

(11) 

The parametrization fh is separating, so points lying on the 

same hyperplane are mapped to surfaces that intersect in precisely 

one point in [0, π) m−1 . This is demonstrated in the case 

m = 3 in Figure 3. The hyperplane structures of a data set 

X = {x(1), . . . , x(T)} can be analyzed by finding clusters in 

η[ fh](X). 

Let RP m−1 denote the (m − 1)-dimensional real projective 

space, that is, the manifold of all 1-dimensional subspaces 

of R m . There is a canonical diffeomorphism between RP m−1 

and the Grassmanian manifold of all (m − 1)-dimensional 

subspaces of R m , induced by the scalar product. Using this 

diffeomorphism, we can reformulate our aim of identifing 

hyperplanes as finding elements of RP m−1 . So, the Hough 

transform η[ fh] maps x onto a subset of RP m−1 , which is 

topologically equivalent to the upper hemisphere in R m with 

identifications along the boundary. In fact, in (11) we simply 

have constructed a coordinate map of RP m−1 using spherical 

coordinates. 

4. HOUGH SCA ALGORITHM 

The SCA matrix detection algorithm � (Algorithm � 1) consists 

n 

of two steps. In the first step, d := m−1 hyperplanes given 

by their normal vectors n (1) , . . . , n (d) are constructed such 

that the mixture data lies in the union of these hyperplanes— 

in the case of noise this will hold only approximately. In the 

second step, mixture matrix columns ai are identified as generators 

of the n lines lying at the intersections of � � 

n−1 

m−2 hyperplanes. 

We replace the first step by the following Hough 

SCA algorithm. 

x2 

a2 

ρ 

15 

10 

5 

0 

5 

10 

0 1 2 3 4 5 6 7 8 9 10 

15 

10 

5 

0 

5 

10 

15 

x1 

(a) Data space 

20 

0 0.5 1 1.5 2 2.5 3 

20 

15 

10 

5 

0 

5 

10 

a1 

(b) Linear Hough space 

0 0.5 1 1.5 2 2.5 3 

θ 

(c) Polar Hough space 

Figure 2: Illustration of the “classical” Hough transform: a point 

(x1, x2) in the data space (a) is mapped (b) onto the line {(a1, a2) | 

a2 = −a1x1 + x2} in the linear parameter space R 2 or (c) onto a 

translated sine curve {(θ, ρ) | ρ = x1 cos θ + x2 sin θ} in the polar 

parameter space [0, π) × R + 0 . The Hough curves of points belonging 

to one line in data space intersect in precisely one point a in 

the Hough space—and the data points lie on the line given by the 

parameter a. 

4.1. Definition 

The idea is to first gather the Hough curves η[ fh](x(t)) 

corresponding to the samples x(t) in a discretized parameter 

space, in this context often called Hough accumulator. 

Plotting these curves in the accumulator is sometimes denoted 

as voting for each bin, similar to histogram generation. 

According to the previous section, all points x from some



x3 

3 

2.5 

2 

1.5 

1 

2 

1.5 

10.5 

00.5 

x2 11.5 

2 2 

(a) Data space 

3 

4 

x1 

5 

6 

θ 

3 

2.5 

2 

1.5 

1 

0.5 

0 

0 0.5 1 1.5 2 2.5 3 

ϕ 

(b) Spherical Hough space 

Figure 3: Illustration of the “hyperplane detecting” Hough transform in three dimensions: a point (x1, x2, x3) in the data space (a) is mapped 

onto the curve {(ϕ, θ) | θ = arctan(x1 cos ϕ + x2 sin ϕ) + π/2} in the parameter space [0, π) 2 (b). The Hough curves of points belonging to 

one plane in data space intersect in precisely one point (ϕ, θ) in the Hough space and the points lie on the plane given by the normal vector 

(cos ϕ sin θ, sin ϕ sin θ, cos θ). 

hyperplane H given by a normal vector with angles (ϕ, θ) 

are mapped onto a parameterized object that contains (ϕ, θ) 

for all possible x ∈ H. Hence, the corresponding angle bin 

will contain votes from all samples x(t) lying in H, whereas 

other bins receive much less votes. Therefore, maxima analysis 

of the accumulator gives the hyperplanes in the parameter 

space. This idea corresponds to clustering all possible normal 

vectors of planes through x(t) on RP m−1 for all t. The resulting 

Hough SCA algorithm is described in Algorithm 4. We see 

that only the hyperplane identification step is different from 

Algorithm 1, the matrix identification is the same. 

The number β of bins is also called the grid resolution. 

Similar to histogram-based density estimation the choice of 

β can seriously effect the algorithm performance—if chosen 

too small, possible maxima cannot be resolved, and if chosen 

too large, the sensitivity of the algorithm increases and the 

computational burden in terms of speed and memory grows 

considerably; see next section. Note that Hough SCA performs 

a global search hence it is expected to be much slower 

than local update algorithms such as Algorithm 3, but also 

much more robust. In the following, its properties will be 

discussed; applications are given in the example in Section 5. 

4.2. Complexity 

We will only discuss the complexity of the hyperplane estimation 

because the matrix identification is performed on a 

data set of size d being typically much smaller than the sample 

size T. 

The angle θm−2 has to be calculated Tβ m−2 times. Due to 

the fact that only discrete values of the angles are of interest, 

the trigonometric functions as well as the νi can be precalculated 

and stored in exchange for speed. Then each calculation 

of θm−2 involves 2m − 1 operations (sum and product/division). 

The voting (without taking “lookup” costs in 

the accumulator into account) costs an additional operation. 

Altogether the accumulator can be filled with 2Tβ m−2 m 

Data: Samples x(1), . . . , x(T) of the random vector X 

Result: Estimated mixing matrix �A 

Hyperplane identification. 

(1) Fix the number β of bins (can be separate for each angle). 

(2) Initialize the β × · · · β (m − 1 terms) array α ∈ Rβm−1 with 

zeros (accumulator). 

for t ← 1, . . . , T do 

for ϕ, θ1, . . . , θm−3 ← 0, π/β, . . . , (β − 1)π/β do 

(3) θm−2 ← arctan( �m−1 i=1 νi(ϕ, . . . , θm−3)xi(t)/xm(t)) + π/2 

(4) Increase (vote for) the accumulator value of α in 

bin corresponding to (ϕ, θ1, . . . , θm−2) by one. 

end 

end � � 

n 

(5) The d := m−1 largest local maxima of α correspond to the 

d hyperplanes present in the data set. 

(6) Back transformation as in (8) gives the corresponding 

normal vectors n (1) , . . . , n (d) to those hyperplanes. 

Matrix identification. 

(7) Clustering of hyperplanes generated by (m − 1)-tuples in 

{n (1) , . . . , n (d) } gives n separate hyperplanes. 

(8) Their normal vectors are the n columns of the estimated 

mixing matrix �A. 

Algorithm 4: Hough SCA algorithm for mixing matrix identification. 

operations. This means that the algorithm depends linearly 

on the sample size and is polynomial in the grid resolution 

and exponential in the mixture dimension. The maxima 

search involves O(β m−1 ) operations, which for small to 

medium dimensions can be ignored in comparison to the accumulator 

generation because usually β ≪ T. 

So the main part of the algorithm does not depend on 

the source dimension n but only on the mixture dimension 

m. This means for applications that n can be quite large but



hyperplanes will still be found if the grid resolution is high 

enough. Increase of the grid resolution (in polynomial time) 

results in increased accuracy also for higher source dimensions 

n. The memory requirement of the algorithm is dominated 

by the accumulator size, which is β m−1 . This can limit 

the grid resolution. 

4.3. Resolution error 

The choice of the grid resolution β in the algorithm induces 

a systematic resolution error in the estimation of A (as tradeoff 

for robustness and speed). This error is calculated in this 

section. 

Let A be the unknown mixing matrix and �A its estimate, 

constructed by the Hough SCA algorithm (Algorithm 4) 

with grid resolution β. Let n (1) , . . . , n (d) be the normal vectors 

of hyperplanes generated by (m − 1)-tuples of columns 

of A and let �n (1) , . . . , �n (d) be their corresponding estimates. 

Ignoring permutations, it is sufficient to only describe how 

�n (i) differs from n (i) . 

Assume that the maxima of the accumulator are correctly 

estimated, but due to the discrete grid resolution, an 

average error of π/2β is made when estimating the precise 

maximum position, because the size of one bin is π/β. How 

is this error propagated into �n (i) ? By assumption each estimate 

�ϕ, �θ1, . . . , �θm−2 differs from ϕ, θ1, . . . , θm−2 maximally 

by π/2β. As we are only interested in an upper boundary, we 

simply calculate the deviation of each component of �n (i) from 

n (i) . Using the fact that sine and cosine are bounded by one, 

(8) then gives us estimates |�n (i) 

j − n (i) 

j | ≤ (m − 1)π/(2β) for 

coordinate j, so altogether 

� 

� (i) (i) �n − n � � (m − 1) 

≤ √ mπ 

. (12) 

2β 

This estimate may be improved by using the Jacobian of the 

spherical coordinate transformation and its determinant, but 

for our purpose this boundary is sufficient. In summary, we 

have shown that the grid resolution contributes to a β −1 - 

perturbation in the estimation of A. 

4.4. Robustness 

Robustness with regard to additive noise as well as outliers 

is important for any algorithm to be used in the real world. 

Here an outlier is roughly defined to be a sample far away 

from other observations, and indeed some researchers define 

outliers to be sample further away from the mean than say 

5 standard deviations. However, such definitions do necessarily 

depend on the underlying random variable to be estimated, 

so most books only give examples of outliers, and indeed 

no consistent, context-free, precise definition of outliers 

exists [25]. In the following, given samples of a fixed random 

variable of interest, we denote a sample as outlier if it is drawn 

from another sufficiently different distribution. 

Data fitting of only one hyperspace to the data set can 

be achieved by linear regression namely by minimizing the 

squared distance to such a possible hyperplane. These least 

squares fitting algorithms are well known to be sensitive to 

outliers, and various extensions of the LS method such as 

least median of squares and reweighted least squares [26] 

have been developed to overcome this problem. The breakdown 

point of the latter is 0.5, which means that the fit parameters 

are only stably estimated for data sets with less 

than 50% outliers. The other techniques typically have much 

lower breakdown points, usually below 0.3. The classical 

Hough transform, albeit no regression method, is comparable 

in terms of breakdown with robust fitting algorithms 

such as the reweighted least squares algorithm [27]. In the experiments 

we will observe similar results for the generalized 

method presented above. Namely, we achieve breakdown levels 

of up to 0.8 in the low-noise case, which considerably decrease 

with increasing noise. 

From a mathematical point of view, the “classical” Hough 

transform as an estimator (and extension of linear regression) 

as well as regarding algorithmic and implementational 

aspects has been studied quite extensively; see, for example, 

[28] and references therein. Most of the presented theoretical 

results in the two-dimensional case could be extended to the 

more general objective presented here, but this is not within 

the scope of this manuscript. Simulations giving experimental 

evidence that the robustness also holds in our case are 

shown in Section 5. 

4.5. Extensions 

The following possible extensions to the Hough SCA algorithm 

can be employed to increase its performance. 

If the noise level is known, smoothing of the accumulator 

(antialiasing) will help to give more robust results in terms of 

noise. For smoothing (usually with a Gaussian), the smoothing 

radius must be set according to the noise level. If the 

noise level is not known, smoothing can still be applied by 

gradually increasing the radius until the number of clearly 

detectable maxima equals d. 

Furthermore, an additional fine-tuning step is possible: 

the estimated plane norms are slightly deteriorated by the 

systematic resolution error as shown previously. However, after 

application of Hough SCA, the data space can now be 

clustered into data points lying close to corresponding hyperplanes. 

Within each cluster linear regression (or some 

more robust version of it; see Section 4.4) can now be applied 

to improve the hyperplane estimate—this is actually 

the idea used locally in the k-hyperplane clustering algorithm 

(Algorithm 3). Such a method requires additional computational 

power, but makes the algorithm less dependent on the 

grid resolution, which is only needed for the hyperplane clustering 

step. However, it is expected that this additional finetuning 

step may decrease robustness especially against biased 

noise and outliers. 

5. SIMULATIONS 

We give a simulation example as well as batch runs to analyze 

the performance of the proposed algorithm.



5 

0 

5 

10 

0 

5 

100 200 300 400 500 600 700 800 900 1000 

0 

5 

0 100 200 300 400 500 600 700 800 900 1000 

5 

0 

5 

0 

10 

5 

0 

100 200 300 400 500 600 700 800 900 1000 

5 

0 100 200 300 400 500 600 700 800 900 1000 

1 

0.5 

0 

0.5 

1 

1 

0.5 

0 

(a) Source signals 

0.5 

1 1 

0.5 

(c) Normalized mixture scatter plot 

0 

0.5 

1 

5 

0 

5 

10 

0 

4 

2 

0 

100 200 300 400 500 600 700 800 900 1000 

2 

4 

0 

5 

100 200 300 400 500 600 700 800 900 1000 

0 

5 

0 100 200 300 400 500 600 700 800 900 1000 

50 

100 

150 

200 

250 

300 

350 

(b) Mixture signals 

50 100 150 200 250 300 350 

(d) Hough accumulator with labeled maxima 

Figure 4: Example: (a) shows the 2-sparse, sufficiently rich represented, 4-dimensional source signals, and (b) the randomly mixed, 3dimensional 

mixtures. The normalized mixture scatter plot {x(t)/|x(t)| | t = 1, . . . , T} is given in (c), and the generated Hough accumulator 

in (d); note that the color scale in (d) was chosen to be nonlinear (γnew := (1 − γ/ max) 10 ) in order to visualize structure in addition to the 

strong maxima. 

5.1. Explicit example 

In the first experiment, we consider the case of source dimensions 

n = 4 and mixing dimension m = 3. The 

4-dimensional sources have been generated from i.i.d. samples 

(two Laplacian and two Gaussian sequences) followed 

by setting some entries to zero in order to fulfill the sparsity 

constraints; see Figure 4(a). They are 2-sparse and consist 

of 1000 samples. Obviously all combinations (i, j), i < 

j, of active sources are present in the data set; this condition 

is needed by the matrix recovery step. The sources 

were mixed using a mixing matrix with randomly (uniform 

in [−1, 1]) chosen coefficients to give mixtures as 

shown in Figure 4(b). The mixture density clearly lies in 

6 disjoint hyperplanes, spanned by pairs (ai, aj), i < j, of 

mixture matrix columns, as indicated by the normalized 

scatter plot in Figure 4(c), similar to the illustration from 

Figure 1(c). 

In order to detect the planes in the data space, we apply 

the generalized Hough transform as explained in Section 3.3. 

Figure 4(d) shows the Hough image with β = 360. Each sample 

results in a curve, and clearly 6 intersection points are 

visible, which correspond to the 6 hyperplanes in question. 

Maxima analysis retrieves these points (in Hough space) as 

shown in the same figure. After transforming these points 

back into R 3 with the inverse Hough transform, we get 6 

normalized vectors corresponding to the 6 planes. Considering 

intersections of the hyperplanes, we notice that only 4 of 

them intersect in precisely 3 planes, and these 4 intersection 

lines are spanned by the matrix columns ai. For practical reasons, 

we recover these combinatorially from the plane norm 

vectors; see Algorithm 4. The deviation of the recovered mixing 

matrix �A from the original mixing matrix A in the overcomplete 

case can be measured by the generalized crosstalking 

error [8] defined as E(A, �A) := minM∈Π �A− �AM�, where 

the minimum is taken over the group Π of all invertible real 

180 

160 

140 

120 

100 

80 

60 

40 

20



n × n-matrices where only one entry in each column differs 

from 0; �·� denotes a fixed matrix norm. In our case the generalized 

crosstalking error is very low with E(A, �A) = 0.040. 

This essentially means that the two matrices, after permutation, 

differ only by 0.04 with respect to the chosen matrix 

norm, in our case the (squared) Frobenius norm. Then, the 

sources are recovered using the source recovery algorithm 

(Algorithm 2) with the approximated mixing matrix �A. The 

normalized signal-to-noise ratios (SNRs) of the recovered 

sources with respect to the original ones are high with 36, 

38, 36, and 37 dB, respectively. 

As a modification of the previous example, we now consider 

also additive noise. We use sources S (which have unit 

covariance) and mixing matrix A from above, but add 1% 

random white noise to the mixtures X = AS + 0.01N where 

N is a normal random vector. This corresponds to a still high 

mean SNR of 38 dB. When considering the normalized scatter 

plot, again the 6 planes are visible, but the additive noise 

deteriorates the clear separation of each plane. We apply the 

generalized Hough transform to the mixture data; however, 

because of the noise we choose a more coarse discretization 

(β = 180 bins). Curves in Hough space corresponding 

to a single plane do not intersect any more in precisely one 

point due to the noise; a low-resolution Hough space however 

fuses these intersections in one point, so that our simple 

maxima detection still achieves good results. We recover 

the mixing matrix similar to the above to get a low generalized 

crosstalking error of E(A, �A) = 0.12. The sources are 

recovered well with mean SNRs of 20 dB, which is quite satisfactory 

considering the noisy, overcomplete mixture situation. 

The following example demonstrates the good performance 

in higher source dimensions. Consider 6-dimensional 

2-sparse sources that are mixed again by matrix A with coefficients 

drawn uniformly from [−1, 1]. Application of the generalized 

Hough transform to the mixtures retrieves the plane 

norm vectors. The recovered mixing matrix has a low generalized 

crosstalking error of E(A, �A) = 0.047. However, if the 

noise level increases, the performance considerably drops because 

many maxima, in this case 15, have to be located in the 

accumulator. After recovering the sources with this approximated 

matrix �A, we get SNRs of only 11, 8, 6, 10, 12, and 

11 dB. The rather high source recovery error is most probably 

due to the sensitivity of the source recovery to slight perturbations 

in the approximated mixing matrix. 

5.2. Outliers 

We will now perform experiments systematically analyzing 

the robustness of the proposed algorithm with respect to outliers 

in the sense of model-violating samples. 

In the first explicit example we consider the sources from 

Figure 4(a), but 80% of the samples have been replaced by 

outliers (drawn from a 4-dimensional normal distribution). 

Due to the high percentage of outliers, the mixtures, mixed 

by the same random 3 × 4 matrix A as before, do not obviously 

exhibit any clear hyperplane structure. As discussed in 

Section 4.4, the Hough SCA algorithm is very robust against 

outliers. Indeed, in addition to a noisy background within 

the Hough accumulator, the intersection maxima are still noticeable, 

and local maxima detection finds the correct hyperplanes 

(cf. Figure 4(d)), although 80% of the data is corrupted. 

The recovered mixing matrix has an excellent generalized 

crosstalking error of E(A, �A) = 0.040. Of course the 

sparse source recovery from above cannot recover the outlying 

samples. Applying the corresponding algorithms, we 

get SNRs of the corrupted sources with the recovered ones 

of around 4 dB; source recovery with the pseudo-inverse of 

�A corresponding to maximum-likelihood recovery with a 

Gaussian prior gives somewhat better SNRs of around 6 dB. 

But the sparse recovery method has the advantage that it can 

detect outliers by measuring distance from the hyperplanes. 

So outlier rejection is possible. Note that we get similar results 

when the outliers are not added in the source space but 

only in the mixture space, that is, only after the mixing process. 

We now perform a numerical comparison of the number 

of outliers versus the algorithm performance for varying 

noise level; see Figure 5. The rationale behind this is that 

already small noise levels in addition to the outliers might 

be enough to destroy maxima in the accumulator thus deteriorating 

the SCA performance. The same (uncorrupted) 

sources and mixing matrix from above are used. Numerically, 

we get breakdown points of 0.8 for the no-noise case, 

and values of 0.5, 0.3, and 0.1 with increasing noise levels of 

0.1% (58 dB), 0.5% (44 dB), and 1% (38 dB). Better performances 

at higher noise levels could be achieved by applying 

antialiasing techniques before maxima detection as described 

in Section 4.5. 

5.3. Grid resolution 

In this section we will demonstrate numerical examples 

to confirm the linear dependence of the algorithm performance 

with the inverse grid resolution β −1 . We consider 4dimensional 

sources S with 1000 samples, in which for each 

sample two source components were drawn out of a distribution 

uniform in [−1, 1], and the other two were set to zero, 

so S is 2-sparse. For each grid resolution β we perform 50 

runs, and in each run a new set of sources is generated as 

above. These are then mixed using a 3 × 4 mixing matrix A 

with random coefficients uniformly out of [−1, 1]. Application 

of the Hough SCA algorithm gives an estimated matrix 

�A. In Figure 6 we plot the mean generalized crosstalking error 

E(A, �A) for each grid resolution. With increasing β the accuracy 

increases—a logarithmic plot indeed confirms the linear 

dependence on β −1 as stated in Section 4.3. Furthermore we 

see that for example for β = 360, among all S and A as above 

we get a mean crosstalking error of 0.23 ± 0.5. 

5.4. Batch runs and comparison with 

hyperplane k-means 

In the last example, we consider the case of m = n = 4, 

and do compare the proposed algorithm (now with a



Crosstalking error 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

0 10 20 30 40 50 60 70 80 90 100 

Noise = 0% 

Percentage of outliers 

(a) Noiseless breakdown analysis with respect to outliers 

Crosstalking error 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

0 10 20 30 40 50 60 70 80 90 100 

Noise = 0% 

Noise = 0.1% 

Percentage of outliers 

Noise = 0.5% 

Noise = 1% 

(b) Breakdown analysis for varying noise level 

Figure 5: Performance of Hough SCA with increasing number of outliers. Plotted is the percentage of outliers in the source data versus the 

matrix recovery performance (measured by the generalized crosstalking error). For each 1%-step one calculation was performed; in (b) the 

plots have been smoothed by taking average over ten 1%-steps. In the no-noise case 360 bins were used, 180 bins in all other cases. 

Mean crosstalking error 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

0 50 100 150 200 250 300 350 400 

Grid resolution 

(a) Mean performance versus grid resolution 

Logarithmic mean Crosstalking error 

1 

0.5 

0 

0.5 

1 

1.5 

2 

0 50 100 150 200 250 300 350 400 

In (E) 

Line fit 

Grid resolution 

(b) Fit of logarithmic mean performance 

Figure 6: Dependence of Hough SCA performance (a) on the grid resolution β; mean has been taken over 50 runs. With a logarithmic y-axis 

(b), a least squares line fit confirms the linear dependence of performance and β −1 . 

three-dimensional accumulator) with k-hyperplane clustering 

algorithm (Algorithm 3). For this, random sources 

with T = 10 5 samples are uniformly drawn from [−1, 1] 

uniform distribution, and a single coordinate is randomly 

set to zero, thus generating 1-sparse sources S. In 100 batch 

runs, a random 4 × 4 mixing matrix A with coefficients 

uniformly drawn from [−1, 1], but columns normalized to 

1 are constructed. The resulting mixtures X := AS are then 

separated both by the proposed Hough k-SCA algorithm 

as well as the Bradley-Mangasarian k-hyperplane clustering 

algorithm (with 100 iterations, and without restart). 

The resulting median crosstalking error E(A, �A) of the 

Hough algorithm is 3.3 ± 2.3, and hence considerably lower 

than the k-hyperplane clustering result of 5.5 ± 1.9. This 

confirms the well-known fact that k-means and its extension 

exhibit local convergence only and are therefore susceptible 

to local minima, as seems to be the case in our example. A 

possible solution would be to use many restarts, but global 

convergence cannot be guaranteed. For practical applications, 

we therefore suggest using a rather rough (low grid 

resolution β) global search by Hough SCA followed by a finer 

local search using k-hyperplane clustering; see Section 4.5.



(a) Source signals 

50 

100 

150 

200 

250 

300 

350 

50 100 150 200 250 300 350 

(b) Hough accumulator with three labeled maxima 

(c) Recovered sources (d) Recovered sources after outlier removal 

Figure 7: Application to speech signals: (a) shows the original speech sources (“peace and love,” “hello, how are you,” and “to be or not to 

be”), and (b) the Hough accumulator when trained to mixtures of (a) with 20% outliers. A nonlinear gray scale γnew := (1 − γ/ max) 10 was 

chosen for better visualization. (c) and (d) present the recovered sources, without and with outlier removal. They coincide with (a) up to 

permutation (reversed order) and scaling. 

5.5. Application to the separation of speech signals 

In order to illustrate that the SCA assumptions are also valid 

for real data sets, we shortly present an application to audio 

source separation, namely, to the instantaneous, robust BSS 

of speech signals—a problem of importance in the field of 

audio signal processing. In the next section, we then refer to 

other works applying the model to biomedical data sets. 

We consider three speech signals S of length 2.2s, sampled 

at 22000 Hz; see Figure 7(a). They are spoken by the same 

person, but may still be assumed to be independent. The signals 

are mixed by a randomly chosen mixing matrix A (coefficients 

uniform from [−1, 1]) to yield mixtures X = AS, but 

20% outliers are introduced by replacing 20% of the samples 

of X by i.i.d. Gaussian samples. Without the outliers, more 

classical BSS algorithms such as ICA would have been able to 

perfectly separate the mixtures; however, in this noisy setting, 

ICA performs very poorly: application of the popular fastICA 

algorithm [29] yields only a poor estimate �A f of the mixing 

matrix A, with high crosstalking error of E(A, �A f ) = 3.73. 

Instead, we apply the complete-case Hough-SCA algorithm 

to this model with β = 360 bins—the sparseness 

assumption now means that we are searching for sources, 

3500 

3000 

2500 

2000 

1500 

1000 

which have samples with at least one zero (quiet) source 

component. The Hough accumulator exhibits very nicely 

three strong maxima; see Figure 7(b). And indeed, the 

crosstalking error of the corresponding estimated mixing 

matrix �A with the original one is very low at E(A, �A) = 0.020. 

This experimentally confirms that speech signals obey an 

(m−1)-sparse signal model, at least if m = n. An explanation 

for this fact is that in typical speech data sets, considerable 

pauses are common, so with high probability we may find 

samples in which at least one source vanishes, and all such 

permutations occur—which is necessary for identifying the 

mixing matrix according to Theorem 1. We are dealing with 

a complete-case problem, so inverting �A directly yields recovered 

sources �S. But of course due to the outliers, the SNR 

of �S with the original sources is low with only −1.35 dB. We 

therefore apply a simple outlier removal scheme by scanning 

each estimated source using a window of size w = 10 samples. 

An adjacent sample to the window is identified as outlier 

if its absolute value is larger than 20% of the maximal signal 

amplitude, but the window sample variance is lower than 

half of the variance when including the sample. The outliers 

are then replaced by the window average. This rough outlierdetection 

algorithm works satisfactorily well, see Figure 7(d); 

500



the perceptual audio quality increased considerably, see also 

the differences between Figures 7(c) and 7(d), although the 

nominal SNR increase is only roughly 4.1 dB. Altogether, this 

example illustrates the applicability of the Hough SCA algorithm 

and its corresponding SCA model to audio data sets 

also in noisy settings, where ICA algorithms perform very 

poorly. 

5.6. Other applications 

We are currently studying several biomedical applications of 

the proposed model and algorithm, including the separation 

of functional magnetic resonance imaging data sets as well as 

surface electromyograms. For results on the former data set, 

we refer to the detailed book chapters [22, 23]. 

The results of the k-SCA algorithm applied to the latter 

signals are shortly summarized in the following. An electromyogram 

(EMG) denotes the electric signal generated by 

a contracting muscle; its study is relevant to the diagnosis of 

motoneuron diseases as well as neurophysiological research. 

In general, EMG measurements make use of invasive, painful 

needle electrodes. An alternative is to use surface EMGs, 

which are measured using noninvasive, painless surface electrodes. 

However, in this case the signals are rather more difficult 

to interpret due to noise and overlap of several source 

signals. When applying the k-SCA model to real recordings, 

Hough-based separation outperforms classical approaches 

based on filtering and ICA in terms of a greater reduction 

of the zero-crossings, a common measure to analyze the unknown 

extracted sources. The relative sEMG enhancement 

was 24.6 ± 21.4%, where the mean was taken over a group of 

9 subjects. For a detailed analysis, comparing various sparse 

factorization models both on toy and on real data, we refer 

to [30]. 

6. CONCLUSION 

We have presented an algorithm for performing a global 

search for overcomplete SCA representations, and experiments 

confirm that Hough SCA is robust against noise and 

outliers with breakdown points up to 0.8. The algorithm employs 

hyperplane detection using a generalized Hough transform. 

Currently, we are working on applying the SCA algorithm 

to high-dimensional biomedical data sets to see how 

the different assumption of high sparsity contributes to the 

signal separation. 

ACKNOWLEDGMENTS 

The authors gratefully thank W. Nakamura for her suggestion 

of using the Hough transform when detecting 

hyperplanes, and the anonymous reviewers for their comments, 

which significantly improved the manuscript. The 

first author acknowledges partial financial support by the 

JSPS (PE 05543). 

REFERENCES 

[1] A. Cichocki and S. Amari, Adaptive Blind Signal and Image 

Processing, John Wiley & Sons, New York, NY, USA, 2002. 

[2] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component 

Analysis, John Wiley & Sons, New York, NY, USA, 2001. 

[3] P. Comon, “Independent component analysis. A new concept?” 

Signal Processing, vol. 36, no. 3, pp. 287–314, 1994. 

[4] F. J. Theis, “A new concept for separability problems in blind 

source separation,” Neural Computation, vol. 16, no. 9, pp. 

1827–1850, 2004. 

[5] J. Eriksson and V. Koivunen, “Identifiability and separability 

of linear ica models revisited,” in Proceedings of the 4th International 

Symposium on Independent Component Analysis and 

Blind Source Separation (ICA ’03), pp. 23–27, Nara, Japan, 

April 2003. 

[6] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition 

by basis pursuit,” SIAM Journal of Scientific Computing, 

vol. 20, no. 1, pp. 33–61, 1998. 

[7] D. L. Donoho and M. Elad, “Optimally sparse representation 

in general (nonorthogonal) dictionaries via l 1 minimization,” 

Proceedings of the National Academy of Sciences of the United 

States of America, vol. 100, no. 5, pp. 2197–2202, 2003. 

[8] F. J. Theis, E. W. Lang, and C. G. Puntonet, “A geometric algorithm 

for overcomplete linear ICA,” Neurocomputing, vol. 56, 

no. 1–4, pp. 381–398, 2004. 

[9] P. Georgiev, F. J. Theis, and A. Cichocki, “Sparse component 

analysis and blind source separation of underdetermined mixtures,” 

IEEE Transactions on Neural Networks, vol. 16, no. 4, pp. 

992–996, 2005. 

[10] P. V. C. Hough, “Machine analysis of bubble chamber pictures,” 

in International Conference on High Energy Accelerators 

and Instrumentation, pp. 554–556, CERN, Geneva, Switzerland, 

1959. 

[11] J. K. Lin, D. G. Grier, and J. D. Cowan, “Feature extraction approach 

to blind source separation,” in Proceedings of the IEEE 

Workshop on Neural Networks for Signal Processing (NNSP ’97), 

pp. 398–405, Amelia Island, Fla, USA, September 1997. 

[12] H. Shindo and Y. Hirai, “An approach to overcomplete-blind 

source separation using geometric structure,” in Proceedings of 

Annual Conference of Japanese Neural Network Society (JNNS 

’01), pp. 95–96, Naramachi Center, Nara, Japan, 2001. 

[13] F. J. Theis, C. G. Puntonet, and E. W. Lang, “Median-based 

clustering for underdetermined blind signal processing,” IEEE 

Signal Processing Letters, vol. 13, no. 2, pp. 96–99, 2006. 

[14] L. Cirillo, A. Zoubir, and M. Amin, “Direction finding of nonstationary 

signals using a time-frequency Hough transform,” 

in Proceedings of IEEE International Conference on Acoustics, 

Speech, and Signal Processing (ICASSP ’05), pp. 2718–2721, 

Philadelphia, Pa, USA, March 2005. 

[15] S. Barbarossa, “Analysis of multicomponent LFM signals by 

a combined Wigner-Hough transform,” IEEE Transactions on 

Signal Processing, vol. 43, no. 6, pp. 1511–1515, 1995. 

[16] D. H. Ballard, “Generalizing the Hough transform to detect 

arbitrary shapes,” Pattern Recognition, vol. 13, no. 2, pp. 111– 

122, 1981. 

[17] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, 

“Blind source separation of more sources than mixtures using 

overcomplete representations,” IEEE Signal Processing Letters, 

vol. 6, no. 4, pp. 87–90, 1999. 

[18] K. Waheed and F. Salem, “Algebraic overcomplete independent 

component analysis,” in Proceedings of the 4th International 

Symposium on Independent Component Analysis and 

Blind Source Separation (ICA ’03), pp. 1077–1082, Nara, Japan, 

April 2003.



[19] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation 

by sparse decomposition in a signal dictionary,” Neural Computation, 

vol. 13, no. 4, pp. 863–882, 2001. 

[20] F. J. Theis, P. Georgiev, and A. Cichocki, “Robust overcomplete 

matrix recovery for sparse sources using a generalized Hough 

transform,” in Proceedings of 12th European Symposium on Artificial 

Neural Networks (ESANN ’04), pp. 343–348, Bruges, 

Belgium, April 2004, d-side, Evere, Belgium. 

[21] P. S. Bradley and O. L. Mangasarian, “k-plane clustering,” Journal 

of Global Optimization, vol. 16, no. 1, pp. 23–32, 2000. 

[22] P. Georgiev, P. Pardalos, F. J. Theis, A. Cichocki, and H. 

Bakardjian, “Sparse component analysis: a new tool for data 

mining,” in Data Mining in Biomedicine, Springer, New York, 

NY, USA, 2005, in print. 

[23] P. Georgiev, F. J. Theis, and A. Cichocki, “Optimization algorithms 

for sparse representations and applications,” in Multiscale 

Optimization Methods, P. Pardalos, Ed., Springer, New 

York, NY, USA, 2005. 

[24] R. O. Duda and P. E. Hart, “Use of the Hough transformation 

to detect lines and curves in pictures,” Communications of the 

ACM, vol. 15, no. 1, pp. 204–208, 1972. 

[25] R. Dudley, Department of Mathematics, MIT, course 18.465, 

2005. 

[26] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier 

Detection, John Wiley & Sons, New York, NY, USA, 1987. 

[27] P. Ballester, “Applications of the Hough transform,” in Astronomical 

Data Analysis Software and Systems III, J. Barnes, D. 

R. Crabtree, and R. J. Hanisch, Eds., vol. 61 of ASP Conference 

Series, 1994. 

[28] A. Goldenshluger and A. Zeevi, “The Hough transform estimator,” 

Annals of Statistics, vol. 32, no. 5, pp. 1908–1932, 2004. 

[29] A. Hyvärinen and E. Oja, “A fast fixed-point algorithm for independent 

component analysis,” Neural Computation, vol. 9, 

no. 7, pp. 1483–1492, 1997. 

[30] F. J. Theis and G. A. García, “On the use of sparse signal 

decomposition in the analysis of multi-channel surface electromyograms,” 

Signal Processing, vol. 86, no. 3, pp. 603–623, 

2006. 

Fabian J. Theis obtained his M.S. degree in 

mathematics and physics from the University 

of Regensburg, Germany, in 2000. He 

also received the Ph.D. degree in physics 

from the same university in 2002 and the 

Ph.D. degree in computer science from the 

University of Granada in 2003. He worked 

as a Visiting Researcher at the Department 

of Architecture and Computer Technology 

(University of Granada, Spain), at 

the RIKEN Brain Science Institute (Wako, Japan), at FAMU-FSU 

(Florida State University, USA), and at TUAT’s Laboratory for Signal 

and Image Processing (Tokyo, Japan). Currently, he is heading 

the Signal Processing & Information Theory Group at the Institute 

of Biophysics at the University of Regensburg and is working 

on his habilitation. He serves as an Associate Editor of “Computational 

Intelligence and Neuroscience,” and is a Member of 

IEEE, EURASIP, and ENNS. His research interests include statistical 

signal processing, machine learning, blind source separation, 

and biomedical data analysis. 

Pando Georgiev received his M.S., Ph.D., 

and “Doctor of Mathematical Sciences” degrees 

in mathematics (operations research) 

from Sofia University “St. Kl. Ohridski,” 

Bulgaria, in 1982, 1987, and 2001, respectively. 

He has been with the Department 

of Probability, Operations Research, and 

Statistics at the Faculty of Mathematics 

and Informatics, Sofia University “St. Kl. 

Ohridski,” Bulgaria, as an Assistant Professor 

(1989–1994), and since 1994, as an Associate Professor. He was 

a Visiting Professor at the University of Rome II, Italy (CNR grants, 

several one-month visits), the International Center for Theoretical 

Physics, Trieste, Italy (ICTP grant, six months), the University 

of Pau, France (NATO grant, three months), Hirosaki University, 

Japan (JSPS grant, nine months), and so forth. He has been working 

for four years (2000–2004) as a research scientist at the Laboratory 

for Advanced Brain Signal Processing, Brain Science Institute, 

the Institute of Physical and Chemical Research (RIKEN), Wako, 

Japan. After that and currently he is a Visiting Scholar in ECECS 

Department, University of Cincinnati, USA. His interests include 

machine learning and computational intelligence, independent and 

sparse component analysis, blind signal separation, statistics and 

inverse problems, signal and image processing, optimization, and 

variational analysis. He is a Member of AMS, IEEE, and UBM. 

Andrzej Cichocki was born in Poland. He 

received the M.S. (with honors), Ph.D., and 

Habilitate Doctorate (Dr.Sc.) degrees, all 

in electrical engineering, from the Warsaw 

University of Technology (Poland) in 1972, 

1975, and 1982, respectively. He is the coauthor 

of three international and successful 

books (two of them were translated to Chinese): 

Adaptive Blind Signal and Image Processing 

(John Wiley, 2002) MOS Switched- 

Capacitor and Continuous-Time Integrated Circuits and Systems 

(Springer, 1989), and Neural Networks for Optimization and Signal 

Processing (J. Wiley and Teubner Verlag, 1993/1994) and the author 

or coauthor of more than three hundred papers. He is the Editorin-Chief 

of the Journal Computational Intelligence and Neuroscience 

and an Associate Editor of IEEE Transactions on Neural 

Networks. Since 1997, he has been the Head of the Laboratory for 

Advanced Brain Signal Processing in the Riken Brain Science Institute, 

Japan.

Chapter 12 

LNCS 3195:718-725, 2004 

Paper F.J. Theis and S. Amari. Postnonlinear overcomplete blind source separation 

using sparse sources. In Proc. ICA 2004, volume 3195 of LNCS, pages 718- 

725, Granada, Spain, 2004 

Reference (Theis and Amari, 2004) 


175

176 Chapter 12. LNCS 3195:718-725, 2004 

Postnonlinear overcomplete blind source 

separation using sparse sources 

Fabian J. Theis 1,2 and Shun-ichi Amari 1 

1 Brain Science Institute, RIKEN 

2-1, Hirosawa, Wako-shi, Saitama, 351-0198, Japan 

2 Institute of Biophysics, University of Regensburg 

D-93040 Regensburg, Germany 

fabian@theis.name,amari@brain.riken.go.jp 

Abstract. We present an approach for blindly decomposing an observed 

random vector x into f(As) where f is a diagonal function i.e. 

f = f1 × . . . × fm with one-dimensional functions fi and A an m × n 

matrix. This postnonlinear model is allowed to be overcomplete, which 

means that less observations than sources (m < n) are given. In contrast 

to Independent Component Analysis (ICA) we do not assume the sources 

s to be independent but to be sparse in the sense that at each time instant 

they have at most m − 1 non-zero components (Sparse Component 

Analysis or SCA). Identifiability of the model is shown, and an algorithm 

for model and source recovery is proposed. It first detects the postnonlinearities 

in each component, and then identifies the now linearized model 

using previous results. 

Blind source separation (BSS) based on ICA is a rapidly growing field (see 

for instance [1,2] and references therein), but most algorithms deal only with the 

case of at least as many observations as sources. However, there is an increasing 

interest in (linear) overcomplete ICA [3–5], where matrix identifiability is known 

[6], but source identifiability does not hold. In order to approximatively detect 

the sources [7], additional requirements have to be made, usually sparsity of the 

sources. 

Recently, we have proposed a model based only upon the sparsity assumption 

(summarized in section 1) [8]. In this case identifiability of both matrix and 

sources given sufficiently high sparsity can be shown. Here, we extend these results 

to postnonlinear mixtures (section 2); they describe a model often occurring 

in real situations, when the mixture is in principle linear, but the sensors introduce 

an additional nonlinearity during the recording [9]. Section 3 presents an 

algorithm for identifying such models, and section 4 finishes with an illustrative 

simulation. 

1 Linear overcomplete SCA 

Definition 1. A vector v ∈ R n is said to be k-sparse if v has at most k non-zero 

entries.

Chapter 12. LNCS 3195:718-725, 2004 177 

2 Fabian J. Theis and Shun-ichi Amari 

If an n-dimensional vector is (n − 1)-sparse, that is it includes at least one 

zero component, it is simply said to be sparse. The goal of Sparse Component 

Analysis of level k (k-SCA) is to decompose a given m-dimensional random 

vector x into 

x = As (1) 

with a real m × n-matrix A and an n-dimensional k-sparse random vector s. s 

is called the source vector, x the mixtures and A the mixing matrix. We speak 

of complete, overcomplete or undercomplete k-SCA if m = n, m < n or m > n 

respectively. In the following without loss of generality we will assume m ≤ n 

because the undercomplete case can be easily reduced to the complete case by 

projection of x. 

Theorem 1 (Matrix identifiability). Consider the k-SCA problem from equation 

1 for k := m − 1 and assume that every m × m-submatrix of A is invertible. 

Furthermore let s be sufficiently rich represented in the sense that for any index 

set of n − m + 1 elements I ⊂ {1, ..., n} there exist at least m samples of s such 

that each of them has zero elements in places with indexes in I and each m − 1 

of them are linearly independent. Then A is uniquely determined by x except for 

left-multiplication with permutation and scaling matrices. 

Theorem 2 (Source identifiablity). Let H be the set of all x ∈ R m such 

that the linear system As = x has an (m − 1)-sparse solution s. If A fulfills 

the condition from theorem 1, then there exists a subset H0 ⊂ H with measure 

zero with respect to H, such that for every x ∈ H \ H0 this system has no other 

solution with this property. 

The above two theorems show that in the case of overcomplete BSS using 

(m−1)-SCA, both the mixing matrix and the sources can uniquely be recovered 

from x except for the omnipresent permutation and scaling indeterminacy. We 

refer to [8] for proofs of these theorems and algorithms based upon them. We 

also want to note that the present source recovery algorithm is quite different 

from the usual sparse source recovery using l1-norm minimization [7] and linear 

programming. In the case of sources with sparsity as above, the latter will not 

be able to detect the sources. 

2 Postnonlinear overcomplete SCA 

2.1 Model 

Consider n-dimensional k-sparse sources s with k < m. The postnonlinear mixing 

model [9] is defined to be 

x = f(As) (2) 

with a diagonal invertible function f with f(0) = 0 and a real m × n-matrix A. 

Here a function f is said to be diagonal if each component fi only depends on 

xi. In abuse of notation we will in this case interpret the components fi of f as

178 Chapter 12. LNCS 3195:718-725, 2004 

Postnonlinear overcomplete blind source separation using sparse sources 3 

functions with domain R and write f = f1 × . . . × fm. The goal of overcomplete 

postnonlinear k-SCA is to determine the mixing functions f and A and the 

sources s given only x. 

Without loss of generality consider only the complete and the overcomplete 

case (i.e. m ≤ n). In the following we will assume that the sources are sparse of 

level k := m − 1 and that the components fi of f are continuously differentiable 

with f ′ i (t) �= 0. This is equivalent to saying that the fi are continuously differentiable 

with continuously differentiable inverse functions (diffeomorphisms). 

2.2 Identifiability 

Definition 2. Let A be an m × n matrix. Then A is said to be mixing if A has 

at least two nonzero entries in each row. And A = (aij)i=1...m,j=1...n is said to 

be absolutely degenerate if there are two columns k �= l such that a 2 ik = λa2 il for 

all i and fixed λ �= 0 i.e. the normalized columns differ only by the sign of the 

entries. 

Postnonlinear overcomplete SCA is a generalization of linear overcomplete 

SCA, so the indeterminacies of postnonlinear SCA contain at least the indeterminacies 

of linear overcomplete SCA: A can only be reconstructed up to scaling 

and permutation. Also, if L is an invertible scaling matrix, then 

f(As) = (f ◦ L) � (L −1 A)s � , 

so f and A can interchange scaling factors in each component. 

Two further indeterminacies occur if A is either not mixing or absolutely 

degenerate. In the first case, this means that fi cannot be identified if the ith 

row of A contains only one non-zero element. In the case of an absolutely 

degenerate mixing matrix, � sparseness � alone cannot detect the nonlinearity as the 

1 1 

counterexample A = and arbitrary f1 ≡ f2 shows. 

1 −1 

If s is an n-dimensional random vector, its image (or the support of its 

density) is denoted as im s := {s(t)}. 

Theorem 3 (Identifiability). Let s be an n-dimensional k-sparse random vector 

(k < m), and x an m-dimensional random vector constructed from s as in 

equation 2. Furthermore assume that 

(i) s is fully k-sparse in the sense that im s equals the union of all k-dimensional 

coordinate spaces (in which it is contained by the sparsity assumption), 



If x = ˆf( Âˆs) is another representation of x as in equation 2 with ˆs satisfying 

the same conditions as s, then there exists an invertible scaling L with f = ˆf ◦ L, 

and invertible scaling and permutation matrices L ′ , P ′ with A = LÂL′ P ′ .

Chapter 12. 


LNCS 3195:718-725, 2004 179 


R 3 

A f1 × f2 g1 × g2 BSRA 

R 2 

R 2 

Fig. 1. Illustration of the proof of theorem 3 in the case n = 3, m = 2. The 3dimensional 

1-sparse sources (leftmost figure) are first linearly mapped onto R 2 by 

A and then postnonlinearly distorted by f := f1 × f2 (middle figure). Separation is 

performed by first estimating the separating postnonlinearities g := g1 × g2 and then 

performing overcomplete source recovery (right figure) according to the algorithms 

from [8]. The idea of the proof now is that two lines spanned by coordinate vectors 

(thick lines, leftmost figure) are mapped onto two lines spanned by two columns of A. 

If the composition g ◦ f maps these lines onto some different lines (as sets), then we 

show that (given ’general position’ of the two lines) the components of g ◦ f satisfy the 

conditions from lemma 1 and hence are already linear. 

The proof relies on the fact that when s is fully k-sparse as formulated in 3(i), 

it includes all the k-dimensional coordinate subspaces and hence intersections 

of k such subspaces, which give the n coordinate axes. They are transformed 

into n curves in the x-space, passing through the origin. By identification of 

these curves, we show that each nonlinearity is homogeneous and hence linear 

according to the previous section. The proof is omitted due to lack of space. 

Figure 1 gives an illustration of the proof in the case n = 3 and m = 2. It uses 

the following lemma (a generalization of the analytic case presented in [10]). 

Lemma 1. Let a, b ∈ R \ {−1, 0, 1}, a > 0 and f : [0, ε) → R differentiable such 

that f(ax) = bf(x) for all x ∈ [0, ε) with ax ∈ [0, ε). If limt→0+ f ′ (t) exists and 

does not vanish, then f is linear. 

Theorem 3 shows that f and A are uniquely determined by x except for scaling 

and permutation ambiguities. Note that then obviously also s is identifiable 

by applying theorem 2 to the linearized mixtures y = f −1 x = As given the 

additional assumptions to s from the theorem. 

For brevity, the theorem assumes in (i) that im s is the whole union of the 

k-dimensional coordinate spaces — this condition can be relaxed (the proof is 

local in nature) but then the nonlinearities can only be found on intervals where 

the corresponding marginal densities of As are non-zero (however in addition, 

the proof needs that locally at 0 they are nonzero). Furthermore in practice the 

assumption about the image of s will have to be replaced by assuming the same 

with non-zero probability. Also note that almost any A ∈ R mn in the measure 

sense fulfills the conditions (ii) and (iii). 

3 Algorithm for postnonlinear (over)complete SCA 

The separation is done in a two-stage procedure: In the first step, after geometrical 

preprocessing the postnonlinearities are estimated using an idea similar to 

R 2 

R 3

180 Chapter 12. LNCS 3195:718-725, 2004 


the one used in the identifiability proof of theorem 3, also see figure 1. In the 

second stage, the mixing matrix A and then the sources s are reconstructed 

by applying the linear algorithms from [8], section 1, to the linearized mixtures 

f −1 x. So in the following it is enough to reconstruct f. 

3.1 Geometrical preprocessing 

Let x(1), . . . , x(T ) ∈ R m be i.i.d.-samples of the random vector x. The goal of geometrical 

preprocessing is to construct vectors y(1), . . . , y(T ) and z(1), . . . , z(T ) ∈ 

R m using clustering or interpolation on the samples x(t) such that f −1 (y(t)) and 

f −1 (z(t)) lie in two linearly independent lines of R m . In figure 1 they are to span 

the two thick lines which already determine the postnonlinearities. 

Algorithmically, y and z can be constructed in the case m = 2 by first choosing 

far away samples (on different ’non-opposite’ curves) as initial starting point 

and then advancing to the known data set center by always choosing the closest 

samples of x with smaller modulus. Such an algorithm can also be implemented 

for larger m but only for sources with at most one non-zero coefficient at each 

time instant, but it can be generalized to sources of sparseness m − 1 using more 

elaborate clustering. 

3.2 Postnonlinearity estimation 

Given the subspace vectors y(t) and z(t) from the previous section, the goal is 

to find C1-diffeomorphisms gi : R → R such that g1 × . . . × gm maps the vectors 

y(t) and z(t) onto two different linear subspaces. 

In abuse of notation, we now assume that two curves (injective infinitely 

differentiable mappings) y, z : (−1, 1) → Rm are given with y(0) = z(0) = 0. 

These can for example be constructed from the discrete sample points y(t) and 

z(t) from the previous section by polynomial or spline interpolation. If the two 

curves are mapped onto lines by g1 × . . . × gm (and if these are in sufficiently 

general position) then gi = λif −1 

i for some λ �= 0 according to theorem 3. By 

requiring this condition only for the discrete sample points from the previous 

section we get an approximation of the unmixing nonlinearities gi. Let i �= j be 

fixed. It is then easy to see that by projecting x, y and z onto the i-th and j-th 

coordinate, the problem of finding the nonlinearities can be reduced to the case 

m = 2, and g2 is to be reconstructed, which we will assume in the following. 

A is chosen to be mixing, so we can assume that the indices i, j were chosen 

such that the two lines f −1 ◦ y, f −1 ◦ z : (−1, 1) → R 2 do not coincide with 

the coordinate axes. Reparametrization (¯y := y ◦ y −1 

1 

) of the curves lets us 

further assume that y1 = z1 = id. Then after some algebraic manipulation, the 

condition that the separating nonlinearities g = g1 × g2 must map y and z onto 

lines can be written as g2 ◦ y2 = ag1 = a 

b g2 ◦ z2 with constants a, b ∈ R \ {0}, 

a �= ±b. 

So the goal of geometrical postnonlinearity detection is to find a C 1 -diffeomorphism 

g on subsets of R with 

g ◦ y = cg ◦ z (3)

Chapter 12. LNCS 3195:718-725, 2004 181 


for an unknown constant c �= 0, ±1 and given curves y, z : (−1, 1) → R with 

y(0) = z(0) = 0. By theorem 3, g (and also c) are uniquely determined by 

y and z except for scaling. Indeed by taking derivatives in equation 3, we get 

c = y ′ (0)/z ′ (0), so c can be directly calculated from the known curves y and z. 

In the following section, we propose to solve this problem numerically, given 

samples y(t1), z(t1), . . . , y(tT ), z(tT ) of the curves. Note that here it is assumed 

that the samples of the curves y and z are given at the same time instants 

ti ∈ (−1, 1). In practice, this is usually not the case, so values of z at the sample 

points of y and vice versa will first have to be estimated, for example by using 

spline interpolation. 

3.3 MLP-based postnonlinearity approximation 

We want to find an approximation ˜g (in some parametrization) of g with ˜g(y(ti)) = 

c˜g(z(ti)) for i = 1, . . . , T , so in the most general sense we want to find 

˜g = argmin g E(g) := argmin g 

1 

2T 

T� 

(g(y(ti)) − cg(z(ti))) 2 . (4) 

In order to minimize this energy function E(g), a single-input single-output 

multilayered neural network (MLP) is used to parametrize the nonlinearity g. 

Here we choose one hidden layer of size d. This means that the approximated ˜g 

can be written as 

˜g(t) = w (2)⊤ � 

¯σ w (1) t + b (1)� 

+ b (2) 

with weigh vectors w (1) , w (2) ∈ R d and bias b (1) ∈ R d , b (2) ∈ R. Here σ denotes 

an activation function, usually the logistic sigmoid σ(t) := (1 + e −t ) −1 and we set 

¯σ := σ×. . .×σ, d times. The MLP weights are restricted in the sense that ˜g(0) = 

0 and ˜g ′ (0) = 1. This implies b (2) = −w (2)⊤¯σ(b (1) ) and �d i=1 w(1) i w(2) i σ′ (b (1) 

1 ) = 

1. 

Especially the second normalization is very important for the learning step, 

otherwise the weights could all converge to the (valid) zero solution. So the 

outer bias is not trained by the network; we could fix a second weight in order 

to guarantee the second condition — this however would result in an unstable 

quotient calculation. Instead it is preferable to perform network training on a 

submanifold in the weight space given by the second weight restriction. This 

results in an additional Lagrange term in the energy function from equation 4 

Ē(˜g) := 1 

2T 

i=1 

T� 

(˜g(y(tj)) − c˜g(z(tj))) 2 � 

d� 

+ λ 

j=1 

i=1 

w (1) 

i w(2) i σ′ (b (1) 

1 

) − 1 

with suitably chosen λ > 0. 

Learning of the weights is performed via backpropagation on this energy 

function. The gradient of Ē(˜g) with respect to the weight matrix can be easily 

�2 

(5)

182 Chapter 12. LNCS 3195:718-725, 2004 


calculated from the Euclidean gradient of g. For the learning process, we further 

note that all weights w (j) 

i should be kept nonnegative in order to ensure 

invertibility of ˜g. 

In order to increase convergence speed, the Euclidean gradient of g should 

be replaced by the natural gradient [11], which in experiments enhances the 

algorithm performance in terms of speed by a factor of roughly 10. 

4 Experiment 

The postnonlinear mixture of three sources to two mixtures is considered. 105 samples of artificially generated sources with one non-zero coefficient (drawn uniformly 

from [−0.5, 0.5]) are used. We refer to figure 2 for a plot of the sources, 

mixtures and recoveries. The sources were mixed�using the postnonlinear � mixing 

4.3 7.8 0.59 

model x = f1 ×f2(As) with mixing matrix A = 

and postnonlin- 

9 6.2 10 

earities f1(x) = tanh(x) + 0.1x and f2(x) = x. For easier algorithm visualization 

and evaluation we chose f2 to be linear and did not add any noise. 

MLP based postnonlinearity detection algorithm from section 3.3 with natural 

gradient-descent learning, 9 hidden neurons, a learning rate of η = 0.01 

and 105 iterations gives a good approximation of the unmixing nonlinearities gi. 

Linear overcomplete SCA is then applied to g1 ×g2(x): for practical reasons (due 

to approximation errors, the data is not fully linearized) instead of the matrix 

recovery algorithm from [8] we use a modification of the geometric ICA algorithm 

[4], which is known to work well in� the very sparse one-dimensional � case 

−0.46 −0.81 −0.069 

to get the recovered mixing matrix Â = 

, which except 

−0.89 −0.58 −1.0 

for scaling and permutation coincides well with A. Source recovery then gives a 

(normalized) signal-to-noise ratios (SNRs) of these with the original sources are 

high with 26, 71 and 46 dB respectively. 

References 

1. Cichocki, A., Amari, S.: Adaptive blind signal and image processing. John Wiley 

& Sons (2002) 

2. Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis. John 

Wiley & Sons (2001) 

3. Lee, T., Lewicki, M., Girolami, M., Sejnowski, T.: Blind source separation of more 

sources than mixtures using overcomplete representations. IEEE Signal Processing 

Letters 6 (1999) 87–90 

4. Theis, F., Lang, E., Puntonet, C.: A geometric algorithm for overcomplete linear 

ICA. Neurocomputing 56 (2004) 381–398 

5. Zibulevsky, M., Pearlmutter, B.: Blind source separation by sparse decomposition 

in a signal dictionary. Neural Computation 13 (2001) 863–882 

6. Eriksson, J., Koivunen, V.: Identifiability and separability of linear ICA models 

revisited. In: Proc. of ICA 2003. (2003) 23–27

Chapter 12. LNCS 3195:718-725, 2004 183 


0.5 

0 

−0.5 

0 10 20 30 40 50 60 70 80 90 100 

0.5 

0 

−0.5 

0 10 20 30 40 50 60 70 80 90 100 

0.5 

0 

−0.5 

0 10 20 30 40 50 60 70 80 90 100 

5 

4 

3 

2 

1 

0 

−1 

−2 

−3 

−4 

(a) sources 

−5 

−1.5 −1 −0.5 0 0.5 1 1.5 

(c) mixture scatterplot 

1.5 

1 

0.5 

0 

−0.5 

−1 

−1.5 

0 10 20 30 40 50 60 70 80 90 100 

5 

0 

−5 

0 10 20 30 40 50 60 70 80 90 100 

5 

0 

(b) mixtures 

−5 

0 10 20 30 40 50 60 70 80 90 100 

5 

0 

−5 

0 10 20 30 40 50 60 70 80 90 100 

5 

0 

−5 

0 10 20 30 40 50 60 70 80 90 100 

(d) recovered sources 

Fig. 2. Example: (a) shows the 1-sparse source signals, and (b) the postnonlinear overcomplete 

mixtures. The original source directions can be clearly seen in the structure 

of the mixture scatterplot (c). The crosses and stars indicate the found interpolation 

points used for approximating the separating nonlinearities, generated by geometrical 

preprocessing. Now, according to theorem 3, the sources can be recovered uniquely, 

figure (d), except for permutation and scaling. 

7. Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM 

J. Sci. Comput. 20 (1998) 33–61 

8. Georgiev, P., Theis, F., Cichocki, A.: Blind source separation and sparse component 

analysis of overcomplete mixtures. In: Proc. of ICASSP 2004, Montreal, Canada 

(2004) 

9. Taleb, A., Jutten, C.: Indeterminacy and identifiability of blind identification. 

IEEE Transactions on Signal Processing 47 (1999) 2807–2820 

10. Babaie-Zadeh, M., Jutten, C., Nayebi, K.: A geometric approach for separating 

post non-linear mixtures. In: Proc. of EUSIPCO ’02. Volume II., Toulouse, France 

(2002) 11–14 

11. Amari, S., Park, H., Fukumizu, K.: Adaptive method of realizing gradient learning 

for multilayer perceptrons. Neural Computation 12 (2000) 1399–1409

184 Chapter 12. LNCS 3195:718-725, 2004

Chapter 13 

Proc. EUSIPCO 2005 

Paper F.J. Theis, K. Stadlthanner, and T. Tanaka. First results on uniqueness of 

sparse non-negative matrix factorization. In Proc. EUSIPCO 2005, Antalya, 

Turkey, 2005 

Reference (Theis et al., 2005c) 


185

186 Chapter 13. Proc. EUSIPCO 2005 

FIRST RESULTS ON UNIQUENESS OF SPARSE NON-NEGATIVE MATRIX 

FACTORIZATION 

Fabian J. Theis, Kurt Stadlthanner and Toshihisa Tanaka ∗ 

Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany 

phone: +49 941 943 2924, fax: +49 941 943 2479, email: fabian@theis.name 

∗ Department of Electrical and Electronic Engineering, Tokyo University of Agriculture and Technology (TUAT) 

2-24-16, Nakacho, Koganei-shi, Tokyo 184-8588 Japan and 

ABSP Laboratory, BSI, RIKEN, 2-1, Hirosawa, Wako-shi, Saitama 351-0198 Japan 

ABSTRACT 

Sparse non-negative matrix factorization (sNMF) allows for the decomposition 

of a given data set into a mixing matrix and a feature 

data set, which are both non-negative and fulfill certain sparsity conditions. 

In this paper it is shown that the employed projection step 

proposed by Hoyer has a unique solution, and that it indeed finds 

this solution. Then indeterminacies of the sNMF model are identified 

and first uniqueness results are presented, both theoretically 

and experimentally. 


Non-negative matrix factorization (NMF) describes a promising 

new technique for decomposing non-negative data sets into a product 

of two smaller matrices thus capturing the underlying structure 

[3]. In applications it turns out that additional constraints like for 

example sparsity enhance the recoveries; one promising variant of 

such a sparse NMF algorithm has recently been proposed by Hoyer 

[2]. It consists of the common NMF update steps, but at each step 

a sparsity constraint is posed. If factorization algorithms are to produce 

reliable results, their indeterminacies have to be known and 

uniqueness (except for the indeterminacies) has to be shown — so 

far only restricted and quite disappointing results for NMF [1] and 

none for sNMF are known. 

In this paper we first present a novel uniqueness result showing 

that the projection step of sparse NMF always possesses a unique 

solution (except for a set of measure zero), theorems 2.2 and 2.6. 

We then prove that Hoyer’s algorithm indeed detects these solutions, 

theorem 2.8. In section 3 after shortly repeating Hoyer’s 

sNMF algorithm, we analyze its indeterminacies and show uniqueness 

in some restricted cases, theorem 3.3. The result is both new 

and astonishing, because the set of indeterminacies is much smaller 

than the one of NMF, namely of measure zero. 

2. SPARSE PROJECTION 

The sparse NMF algorithm enforces sparseness by using a projection 

step as follows: Given x ∈ Rn and fixed λ1,λ2 > 0, find s such 

that 

s = argmin�s�1=λ1,�s�2=λ2,s≥0 �x −s�2 (1) 

Here �s�p := � ∑ n i=1 |si| p� 1/p denotes the p-norm; in the following 

we often omit the index in the case p = 2. Furthermore s ≥ 0 is 

defined as si ≥ 0 for all i = 1,...,n, so s is to be non-negative. Our 

goal is to show that such a projection always exists and is unique 

for almost all x. This problem can be generalized by replacing the 

1-norm by an arbitrary p-norm, however the (Euclidean) 2-norm 

has to be used as can be seen in the proof later. Other possible 

generalizations include projections in infinite-dimensional Hilbert 

spaces. 

First note that the two norms are equivalent i.e. induce the same 

topology; indeed �s�2 ≤ �s�1 ≤ √ n�s�2 for all s ∈ R n as can be 

easily shown. So a necessary condition for any s to satisfy equation 

(1) is λ2 ≤ λ1 ≤ √ nλ2. 

We want to solve problem (1) by projecting x onto 

M := {s|�s�1 = λ1} ∩ {s|�s�2 = λ2} ∩ {s ≥ 0} (2) 

In order to solve equation (1), x has to be projected onto a point 

adjacent to it in M: 

Definition 2.1. A point p ∈ M ⊂ R n is called adjacent to x ∈ R n 

in M, in symbols p⊳M x or shorter p⊳x, if �x −p�2 ≤ �x −q�2 

for all q ∈ M. 

In the following we will study in which cases this is possibly, 

and which conditions are needed to guarantee that this projection is 

even unique. 

2.1 Existence 

Assume that x lies in the closure of M, but not in M. Obviously 

there exists no p⊳x as x ‘touches’ M without being an element of 

it. In order to avoid these exceptions, it is enough to assume that M 

is closed: 

Theorem 2.2 (Existence). If M is closed and nonempty, then for 

every x ∈ R n there exists a p ∈ M with p⊳x. 

Proof. Let x ∈ R n be fixed. Without loss of generality (by taking 

intersections with a large enough ball) we can assume that M is 

compact. Then f : M → R,p ↦→ �x −p� is continuous and has 

therefore a minimum p0, so p0 ⊳x. 

2.2 Uniqueness 

Definition 2.3. Let X (M) := {x ∈ R n |there exists more than one 

point adjacent to x in M} = {x ∈ R n |#{p ∈ M|p⊳x} > 1} denote 

the exception set of M. 

In other words, the exception set contains the set of points from 

which we can’t uniquely project. Our goal is to show that this set 

vanishes or is at least very small. Figure 1 shows the exception set 

of two different sets. 

Note that if x ∈ M then x ⊳ x, and x is the only point with 

that property. So M ∩X (M) = ∅. Obviously the exception set of 

an affine linear hyperspace is empty. Indeed, we can prove more 

generally: 

Lemma 2.4. Let M ⊂ R n be convex. Then X (M) = ∅. 

For the proof we need the following simple lemma, which only 

works for the 2-norm as it uses the scalar product. 

Lemma 2.5. Let a,b ∈ R n such that �a + b�2 = �a�2 + �b�2. 

Then a and b are collinear. 

Proof. By taking squares we get �a +b� 2 = �a� 2 + 2�a��b� + 

�b� 2 , so 

�a� 2 + 2〈a,b〉+�b� 2 = �a� 2 + 2�a��b�+�b� 2

Chapter 13. Proc. EUSIPCO 2005 187 

M X (M) 


M 

X (M) 

(b) exception set of a sector 

Figure 1: Two examples of exception sets. 

if 〈.,.〉 denotes the (symmetric) scalar product. Hence 〈a,b〉 = 

�a��b� and a and b are collinear according to the Schwarz inequality. 

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈ X (M) and 

p1 �= p2 ∈ M such that pi ⊳x. By assumption q := 1 2 (p1 +p2) ∈ 

M. But 

�x −p1� ≤ �x −q� ≤ 1 

1 

�x −p1�+ 

2 2 �x −p2� = �x −p1� 

because both pi are adjacent to x. Therefore �x −q� = � 1 2 (x − 

p1)�+� 1 2 (x −p2)� and application of lemma 2.5 shows that x − 

p1 = α(x −p2). Taking norms (and using the fact that q �= x) 

shows that α = 1 and hence p1 = p2, which is a contradiction. 

In a similar manner, it is easy to show for example that the 

exception set of the sphere consists only of its center, or to calculate 

the exception sets of the sets M from figure 1. Another property 

of the exception set is that it behaves nicely under non-degenerate 

affine linear transformation. 

Hence in general, we cannot expect X (M) to vanish altogether. 

However we can show that in practical applications we can easily 

neglect it: 

Theorem 2.6 (Uniqueness). vol(X (M)) = 0. 

This means that the Lebesgue measure of the exception set is 

zero i.e. that it does not contain any open ball. In other words, if 

x is drawn from a continuous probability distribution on R n , then 

x ∈X (M) with probability 0. We simplify the proof by introducing 

the following lemma: 

Lemma 2.7. Letx∈X (M) withp⊳x,p ′ ⊳x and p �=p ′ . Assume 

y lies on the line between x and p. Then y �∈ X (M). 

Proof. So y = αx+(1 − α)p with α ∈ (0,1). Note that then also 

p⊳y — otherwise we would have another q⊳y with �q −y� < 

�p−y�. But then �q−x� ≤ �q−y�+�y−x� < �p−y�+�y− 

x� = �p −x�, which contradicts the assumption. 

Now assume that y ∈ X (M). Then there exists p ′′ ⊳y with 

p ′′ �= p. But �p ′′ −x� ≤ �p ′′ −y�+�y −x� = �p −y�+�y − 

x� = �p −x�. Then p⊳x induces �p ′′ −x� = �p −x�. So 

�p ′′ −x� = �p ′′ −y�+�y −x�. 

Application of lemma 2.5 then yields p ′′ −y = α(y−x), and hence 

p ′′ −y = β(p −y). Taking norms (and using p ⊳x) shows that 

β = 1 and hence p = p ′′ , which is a contradiction. 

Proof of theorem 2.6. Assume there exists an open set U ⊂ X (M), 

and let x ∈ U. Then choose p �= p ′ ∈ M with p⊳x,p ′ ⊳x. But 

{αx+(1 − αp)|α ∈ (0,1)} ∩U �= ∅, 

which contradicts lemma 2.7. 

2.3 Algorithm 

From here on, let M be defined by equation (2). In [2], Hoyer proposes 

algorithm 1 to project a given vector x onto p ∈ M such that 

p⊳x (we added a slight simplification by not setting all negative 

values of s to zero but only a single one in each step). The algorithm 

iteratively detects p by first satisfying the 1-norm condition (line 1) 

and then the 2-norm condition (line 3). The algorithm terminates if 

the constructed vector is already positive; otherwise a negative coordinate 

is selected, set to zero (line 4) and the search is continued 

in R n−1 . 

Algorithm 1: Sparse projection 

Input: vector x ∈ R n , norm conditions λ1 and λ2 

Output: closest non-negative s with �s�i = λi 

Set r ← x+(�x�1 − λ1/n)e with e = (1,...,1) ⊤ ∈ Rn 1 

. 

2 Set m ← (λ1/n)e. 

3 Set s ← m+α(r −m) with α > 0 such that �s�2 = λ2. 

if exists j with s j < 0 then 

4 Fix s j ← 0. 

5 Remove j-th coordinate of x. 

6 Decrease dimension n ← n − 1. 

7 goto 1. 

end 

The projection algorithm terminates after maximally n − 1 iterations. 

However it is not obvious that it indeed detects p. In the 

following we will prove this given that x �∈ X (M) — of course we 

have to exclude non-uniqueness points. The idea of the proof is to 

show that in each step the new estimate has p as closest point in M. 

Theorem 2.8 (Sparse projection). Given x ≥ 0 such that x �∈ 

X (M). Let p ∈ M with p ⊳ x. Furthermore assume that r and 

s are constructed by lines 1 and 3 of algorithm 1. Then: 

(i) ∑ri = λ1, p⊳r and r �∈ X (M). 

(ii) ∑si = λ1, �s�2 = λ2 and p⊳s and s �∈ X (M). 

(iii) If s j < 0 then p j = 0. 

(iv) Define u := s but set u j = 0. Then p⊳u and u �∈ X (M). 

This theorem shows that if s ≥ 0 then already s ∈ M and p⊳s 

(ii) so s = p. If s j < 0 then it is enough to set s j := 0 (because 

p j = 0 (iii)) and continue the search in one dimension lower (iv). 

Proof. Let H := {x ∈ R n |∑xi = λ1} denote the affine hyperplane 

given by the 1-norm. Note that M ⊂ H. 

(i) By construction r ∈ H. Furthermore e⊥H, so r is the orthogonal 

projection of x onto H. Let q ∈ M be arbitrary. We then 

get �q−x� 2 = �q−r� 2 +�r−x� 2 . By definition �p−x� ≤ �q− 

x�, so �p −r� 2 = �p −x� 2 − �r −x� 2 ≤ �q −x� 2 − �r −x� 2 = 

�q −r� 2 and therefore p⊳r. Furthermore r �∈ X (M) because if 

q ∈ R n with q⊳r, then �q−r� = �p−r�. Then by the above also 

�q −x� = �p −x�, hence q = p (because x �∈ X (M)). 

(ii) First note that s is a linear combination of m and r, and both 

lie in H so also s ∈ H i.e. ∑si = λ1. Furthermore by construction 

�s� = λ2. Now let q ∈ M. For p⊳s to hold, we have to show that 

�p −s� ≤ �q −s�. This follows (see (i)) if we can show 

�q −r� 2 = �s −r� 2 + 1 

�q −s� 

α0 

2 . (3) 

We can prove this equation as follows: By definition λ 2 2 = �q − 

m�2 = �q −s�2 + �s −m�2 + 2〈q −s,s −m〉, hence �q −s�2 = 

−2〈q−s,s−m〉 = −2 α0 

α0−1 〈q−s,s−r〉, where we have used s− 

m = α0(r −m) i.e. m = s−α0r 

1−α0 

α0 

so s −m = α0−1 (s −r).


Using the above, we can now calculate 

�q −r� 2 = �q −s� 2 + �s −r� 2 + 2〈q −s,s −r〉 

= �q −s� 2 + �s −r� 2 1 − α0 

+ �q −s� 

α0 

2 

= �s −r� 2 + 1 

�q −s� 

α0 

2 . 

Similarly, from formula 3, we get s �∈ X (M), because if there exists 

q ∈ Rn with �q−s� = �p−s�, then also �q−r� = �p−r� hence 

q = p. 

(iii) Assume s j < 0. First note that m does not lie on the line 

βs+(1 − β)p (in other words m �= (p+s)/2), because otherwise 

due to symmetry there would be at least two points in M closest to 

s, but s �∈ X (M). Now assume the claim is wrong, then p j > 0 (because 

p ≥ 0). Define g : [0,1] → H by g(β) := m+α β(βs+(1 − 

β)p −m), where αβ > 0 has been chosen such that �g(β)� = λ2. 

The curve g describes the shortest arc in H ∩ {�q� = λ2} connecting 

p to s. We notice that p j > 0,rj < 0 and g is continuous. Hence 

determine the (unique) β0 such that q := g(β0) has the property 

q j = 0. By construction q ∈ M, but q lies closer to s than p (because 

�g(β −r)�2 = 2〈g(β)−m,m−r〉+2λ 2 2 is decreasing with 

increasing β). But this is a contradiction to p⊳s. 

(iv) The vector u is defined by ui = si if i �= j and u j = 0 i.e. 

u is the orthogonal projection of s onto the coordinate hyperplane 

given by x j = 0. So we calculate �p −s�2 = �p −u�2 + �u −s�2 and the claim follows directly as in (i). 

3. MATRIX FACTORIZATION 

Matrix factorization models have already been used successfully in 

many applications when it comes to find suitable data representations. 

Basically, a given m × T data matrix X is factorized into a 

m × n matrix W and a n × T matrix H 

where m ≤ n. 

X = WH, (4) 

3.1 Sparse non-negative matrix factorization 

In contrast to other matrix factorization models such as principal 

or independent component analysis, non-negative matrix factorization 

(NMF) strictly requires both matrices W and H to have nonnegative 

entries, which means that the data can be described using 

only additive components. Such a constraint has many physical realizations 

and applications, for instance in object decomposition [3]. 

Although NMF has recently gained popularity due to its simplicity 

and power in various applications, its solutions frequently 

fail to exhibit the desired sparse object decomposition. Therefore, 

Hoyer [2] proposed a modification of the NMF model to include 

sparseness: he minimizes the deviation of (4) under the constraint 

of fixed sparseness of both W and H. Here, using a ratio of 

1- and 2-norms of x ∈ Rn \ {0}, the sparseness is measured by 

σ(x) := ( √ n − �x�1/�x�2)/( √ n − 1). So σ(x) = 1 (maximal) 

if x contains n−1 zeros, and it reaches zero if the absolute value of 

all coefficients of x coincide. 

Formally, sparse NMF (sNMF) [2] can be defined as the task of 

finding 

� 

X,W,H ≥ 0 

X = WH subject to σ(W∗i) = σW (5) 

σ(Hi∗) = σH 

Here σW,σH ∈ [0,1] denote fixed constants describing the sparseness 

of the columns of W respectively the rows of H. Usually, the 

linear model in NMF is assumed to hold only approximately, hence 

the above formulation of sNMF represents the limit case of perfect 

factorization. sNMF is summarized by algorithm 2, which uses algorithm 

1 separately in each column respectively row for the sparse 

projection. 

Algorithm 2: Sparse non-negative matrix factorization 

Input: observation data matrix X 

Output: decomposition WH of X fulfilling given 

sparseness constraints σH and σW 

1 Initialize W and H to random non-negative matrices. 

2 Project the rows of H and the columns of W such that they 

meet the sparseness constraints σH and σW respectively. 

repeat 

Set H ← H − µHW⊤ 3 

(WH −X). 

4 Project the rows of H such that they meet the sparseness 

constraint σH. 

Set W ← W − µW(WH −X)H⊤ 5 

. 

6 Project the rows of W such that they meet the 

sparseness constraint σW. 

until convergence; 


Obvious indeterminacies of model 5 are permutation and positive 

scaling of the columns of W (and correspondingly of the rows of 

H), because if P denotes a permutation matrix and L a positive 

scaling matrix, then X = WH = (WP −1 L −1 )(LPH) and the 

conditions of positivity and sparseness are invariant under scaling 

by a positive number. Another maybe not as obvious indeterminacy 

comes into play due to the sparseness assumption. 

Definition 3.1. The n×T -matrix H is said to be degenerate if there 

exist v ∈ R n , v > 0 and λt ≥ 0 such that H∗t = λtv for all t. 

Note that in this case all rows h ⊤ i := Hi∗ of H have the same 

sparseness σ(hi) = ( √ n − �λ�1/�λ�2)/( √ n − 1) independent of 

v, where λ := (λ1,...,λT) ⊤ . Furthermore, if W is any matrix with 

positive entries, then Wv > 0 and WH∗t = λt(Wv), so the signals 

H and its transformations WH have rows of equal sparseness. 

Hence if the sources are degenerate we get an indeterminacy of 

sNMF: Let W, ˜W be non-negative such that ˜W −1 Wv > 0 (for example 

W > 0 arbitrary and ˜W :=I), and letHbe degenerate. Then 

˜H := ˜W −1 WH is of the same sparseness as H and WH = ˜W ˜H ′ , 

but the mixing matrices W and ˜W do not coincide up to permutation 

and scaling. 

3.3 Uniqueness 

In this section we will discuss the uniqueness of sNMF solutions 

i.e. we will formulate conditions under which the set of solutions 

is satisfactorily small. We will see that in the perfect factorization 

case, it is enough to put the sparseness condition either onto W or 

H — we chose H in the following to match the picture of sources 

with a given sparseness. 

Assume that two solutions (W,H) and ( ˜W, ˜H) of the sNMF 

model (2) are given with W and ˜W of full rank; then 

WH = ˜W ˜H, (6) 

and σ(H) = σ( ˜H). As before let hi = H ⊤ i∗ respectively ˜ hi = ˜H ⊤ i∗ 

denote the rows of the source matrices. In order to avoid the scaling 

indeterminacy, we can set the source scales to a given value, so we 

may assume 

�hi�2 = � ˜ hi�2 = 1 (7) 

for all i. Hence, the sparseness of the rows is already fully determined 

by their 1-norms, and 

�hi�1 = � ˜ hi�1. (8) 

We can then show the following lemma (even without positive mixing 

matrices).


Lemma 3.2. Let W, ˜W ∈ Rm×n and H, ˜H ∈ Rn×T , H, ˜H ≥ 0, 

such that equations (6–8) hold. Then for all i ∈ {1,...,m} 

(i) ∑ j wi j = ∑ j ˜wi j 

(ii) ∑ j

190 Chapter 13. Proc. EUSIPCO 2005

Chapter 14 

Proc. ICASSP 2006 

Paper F.J. Theis and T. Tanaka. Sparseness by iterative projections onto spheres. 

In Proc. ICASSP 2006, Toulouse, France, 2006 

Reference (Theis and Tanaka, 2006) 


191

192 Chapter 14. Proc. ICASSP 2006 

SPARSENESS BY ITERATIVE PROJECTIONS ONTO SPHERES 

Fabian J. Theis ∗ 

Inst. of Biophysics, Univ. of Regensburg 

93040 Regensburg, Germany 

ABSTRACT 

Many interesting signals share the property of being sparsely active. 

The search for such sparse components within a data set commonly 

involves a linear or nonlinear projection step in order to fulfill the 

sparseness constraints. In addition to the proximity measure used 

for the projection, the result of course is also intimately connected 

with the actual definition of the sparseness criterion. In this work, 

we introduce a novel sparseness measure and apply it to the problem 

of finding a sparse projection of a given signal. Here, sparseness 

is defined as the fixed ratio of p- over 2-norm, and existence and 

uniqueness of the projection holds. This framework extends previous 

work by Hoyer in the case of p=1, where it is easy to give a 

deterministic, more or less closed-form solution. This is not possible 

for p�1, so we introduce an algorithm based on alternating projections 

onto spheres (POSH), which is similar to the projection onto 

convex sets (POCS). Although the assumption of convexity does not 

hold in our setting, we observe not only convergence of the algorithm, 

but also convergence to the correct minimal distance solution. 

Indications for a proof of this surprising property are given. Simulations 

confirm these results. 


Sparseness is an important property of many natural signals, and various 

definitions exist. Intuitively, a signal x∈R n increases in sparseness 

with the increasing number of zeros; this is often measured by 

the 0-(pseudo)-norm�x�0 := |{i|xi � 0}|, counting the number of 

non-zero entries of x. It is a pseudo-norm because�αx�0=|α|�x�0 

does not necessarily hold; indeed�αx�0=�x�0 ifα�0, so�.�0 is 

scale-invariant. 

A typical problem in the field is the search for sparse instances 

or representations of a data set. Using the above 0-pseudo-norm as 

sparseness measure quickly turns out to be both theoretically and 

algorithmically unfeasible. The former simply follows because�.�0 

is discrete, so the indeterminacies of the problem can be expected 

to be very high, and the latter because optimization on such a discrete 

function is a combinatorial problem and indeed turns out to be 

NP-complete. Hence, this sparseness measure is commonly approx- 

imated by some continuous measures e.g. by replacing it by the pnorm�x�p 

:= �� n 

i=1 |xi| p� 1/p for p∈R + . As limp→0+�x� p p=�x�0, this 

can be interpreted as a possible approximation. This together with 

extensions to noisy situations can be used for measuring sparseness, 

and the connection with�.�0, especially in the case of p=1, has 

been intensively studied [1]. 

Often, we are not interested in the scale of the signals, so ideally 

the sparseness measure should be independent of the scale — which 

∗ Partial financial support by the JSPS (PE 05543) and the DFG (GRK 

638) is acknowledged. 

Toshihisa Tanaka ∗ 

Dept. EEE, Tokyo Univ. of Agri. and Tech. 

Tokyo 184-8588, Japan 

is the case for the 0-pseudo-norm, but not for the p-norms. In order 

to guarantee scaling invariance, some normalization has to be 

applied in the latter case, and a possible solution is the measure 

σp(x) :=�x�p/�x�2 

for x ∈ R n \{0} and p > 0. Thenσp(αx) = σp(x) forα � 0; 

moreover the sparser x, the smallerσp(x). Indeed, it can still be 

interpreted as approximation of the 0-pseudo-norm in the sense that 

it is scale-invariant and that limp→0+σp(x) p =�x�0. Altogether we 

infer that by minimizingσp(x) under some constraint, we can find 

a sparse representation of x. Hoyer [2] noticed this in the important 

case of p=1; he defined a normalized sparseness measure by ( √ n− 

σ1(x))/( √ n−1), which lies in [0, 1] and is maximal if x contains 

n−1 zeros and minimal if the absolute value of all coefficients of x 

coincide. 

Little attention has been paid to finding projections in the case of 

p�1, which is particularly important for p→0 as better approximation 

of�.�0. Hence, the goal of this manuscript is to explore the 

general notion of sparseness in the sense of equation (1) and to construct 

algorithms to project a vector to its closest vector of a given 

sparseness. 

2. EUCLIDEAN PROJECTION 

Let M⊂R n be an arbitrary, non-empty set. A vector y∈ M⊂R n 

is called Euclidean projection of x∈R n in M, in symbols y⊳M x, if 

�x−y�2≤�x−z�2 for all z∈ M. 

2.1. Existence and uniqueness 

We review conditions [3] for existence on uniqueness of the Euclidean 

projection. For this, we need the following notion: LetX(M) := 

{x∈R n | there exists more than one point adjacent to x in M}={x∈ 

R n | #{y∈ M| y⊳M x}>1} denote the exception set of M. 

Theorem 2.1 (Euclidean projection). 

i. If M is closed and nonempty, then the Euclidean projection 

onto M is exists that is for every x∈R n there exists a y∈M 

with y⊳M x. 

ii. The Euclidean projection onto M is unique from almost all 

points inR n i.e. vol(X(M))=0. 

Proof. See [3], theorems 2.2 and 2.6. � 

So we can always project a vector x∈R n onto a closed set M, 

and this projection will be unique almost everywhere. In this case, 

we denote the projection byπM(x) orπ(x) for short. Indeed, in the 

case of the p-spheres S n−1 

p , the exception set consists of a single 

pointX(S n−1 

p )={0} if p≥2, henceπ S n−1 

p 

(1) 

is well-defined onR n \{0}. 

If p

Chapter 14. Proc. ICASSP 2006 193 

S n−1 

p 

∇�.� p p 

y 

t 

∇�.−x� 2 

2 

Fig. 1. Constrained gradient t on the p-sphere, given by the projection 

of the unconstrained gradient∇�.−x� 2 

2 onto the tangent space 

that is orthogonal to∇�.� p p, see equation (6). 

2.2. Projection onto a p-sphere 

Let S n−1 

p := {x ∈ Rn |�x�p = 1} denote the (n−1)-dimensional 

sphere with respect to the p-norm (p>0). A scaled version of this 

unit sphere is given by cS n−1 

p :={x∈R n |�x�p= c}. The spheres 

are smoothC 1-submanifolds ofR n for p≥2. For p0. 

p 

Now, in the case p=2, the projection is simply given by 

πS n−1 (x)=x/�x�2. (2) 

2 

In the case p=1, the sphere consists of a union of hyperplanes 

being orthogonal to (±1,...,±1). Considering only the first quadrant 

(i.e. xi> 0), this means thatπ S n−1 (x) is given by the projection onto 

1 

the hyperplane H :={x∈R n |〈x, e〉=n −1/2 } and setting resulting 

negative coordinates to 0; here e := n−1/2 (1,...,1). So with x+ := x 

if x≥0and 0 otherwise, we get 

πS n−1 (x)= 

1 

� x+(n −1/2 −〈x, e〉)e � 

+ . (3) 

In the case of arbitrary p>0, the projection is given by the unique 

solution of 

�x−y� 2 

2 . (4) 

π S n−1 

p 

(x)=argmin y∈S n−1 

p 

Unfortunately, no closed-form solution exists in the general case, 

so we have to numerically determine the solution. We have experimented 

with a) explicit Lagrange multiplier calculation and minimization, 

b) constrained gradient descent and c) constrained fixedpoint 

algorithm (best). Ignoring the singular points at the coordinate 

hyperplanes, let us first assume that all xi > 0. Then at 

a regular solution y of equation (4), the gradient of the function 

to be minimized is parallel to the gradient of the constraint, i.e. 

y−x=λ∇�.� p � 

� 

p� 

for some Lagrange-multiplierλ∈R, which can be 

y 

calculated from the additional constraint equation�y� p p= 1. Using 

the notation y⊙p := (y p 

1 ,...,yp n) ⊤ for the componentwise exponentiation, 

we therefore get 

y−x=λp y ⊙(p−1) 

� 

and y p 

i = 1 (5) 

i 

Algorithm 1: Projection onto S n−1 

p by constrained gradient descent. 

Commonly, the iteration is stopped after the update stepsize 

lies below some given threshold. 

Input: vector x∈R n , learning rateη(i) 

Output: Euclidean projection y=π S n−1 (x) 

p 

Initialize y∈S n−1 

1 

p randomly. 

for i←1, 2,... do 

df← y−x, dg← p sgn(y)|y| ⊙(p−1) 

2 

t←df− df ⊤ dgdg/(dg ⊤ 3 

dg) 

4 y←y−η(i)t 

5 y←y/�y�p 

end 

For p�{1, 2}, these equations cannot be solved in closed form, 

hence we propose an alternative approach to solving the constrained 

minimization (4). The goal is to minimize f (y) :=�y−x� 2 

2 under the 

constraint g(y) :=�y� p p= 1. This can for example be achieved by 

gradient-descent methods, taking into account that the gradient has 

to be calculated on the submanifold given by the S n−1 

p -constraint, see 

figure 1 for an illustration. The projection of the gradient∇ f onto 

the tangent space of S n−1 

p at y can be easily calculated as 

t=∇ f−〈∇ f,∇g〉∇g/�∇g� 2 

2 . (6) 

Here, the explicit gradients are given by∇ f (y)=y−x and∇g(y)= 

p sgn(y)|y| ⊙(p−1) , where sgn(y) denotes the vector of the componentwise 

signs of y, and|y| := sgn(y)y the componentwise absolute 

value. The projection algorithm is summarized in algorithm 1. Iteratively, 

after calculating the constrained gradient (lines 2 and 3), it 

performs a gradient-descent update step (line 4) followed by a pro- 

jection onto S n−1 

p (line 5) to guarantee that the algorithm stays on the 

submanifold. 

The method performs well, however as most gradient-descentbased 

algorithms, without further optimization it takes quite a few iterations 

to achieve acceptable convergence, and the choice of an optimal 

learning rateη(i) is non-trivial. We therefore propose a second 

projection method employing a fixed-point optimization strategy. Its 

idea is based on the fact that at local minima y of f (y) on S n−1 

p , the 

gradient∇ f (y) is orthogonal to S n−1 

p , so∇ f (y)∝∇g(y). Ignoring 

signs for illustrative purposes, this means that y−x∝ p y⊙(p−1) , so 

y can be calculated from the fixed-point iteration y←λp y⊙(p−1) + x 

with additional normalization. Indeed, this can be equivalently derived 

from the previous Lagrange equations (5), also yielding equations 

for the proportionality factorλ: we can simply determine it 

from one component of equation (5), or to increase numerical robustness, 

as mean from the total set. Taking into account the signs 

of the gradient (which we ignored in equation (5)), this yields an 

estimate ˆλ := 1� 

n 

i=1(yi−xi)/(p sgn(yi)|yi| p−1 ). Altogether, we get 

n 

the fixed-point algorithm 2, which in experiments turns out to have 

a considerably higher convergence rate than algorithm 1. 

In table 1, we compare the algorithms 1 and 2, namely with respect 

to the number of iterations they need to achieve convergence 

below some given threshold. As expected, the fixed-point algorithm 

outperforms gradient-descent always except for the case of higher 

dimensions and p>2 (non-sparse case). In the following we will 

therefore use the fixed-point algorithm for projection onto S n−1 

1 . 

2.3. Projection onto convex sets 

If M is a convex set, then the Euclidean projectionπM(x) for any 

x∈R n is already unique, soX(M)=∅and the operatorπM is called

194 Chapter 14. Proc. ICASSP 2006 

Algorithm 2: Projection onto S n−1 

p via fixed-point iteration. 

Again, the iteration is to be stopped after only sufficiently small 

updates are taken. 

Input: vector x∈R n 

Output: Euclidean projection y=π S n−1 (x) 

p 

Initialize y∈S n−1 

1 

p randomly. 

for i←1, 2,... do 

λ← �n i=1 (yi−xi)/ � n sgn(yi)|yi| p−1� 

2 

y←x+λ sgn(y)|y| ⊙(p−1) 

3 

4 y←y/�y�p 

end 

Table 1. Comparison of the gradient- and fixed-point-based projection 

algorithms 1 and 2 for finding the Euclidean projection onto 

cS n−1 

p for varying parameters; mean was taken over 100 iterations 

with x∈[−1, 1] n sampled uniformly. Here #itsgd and #itsfp denote 

the numbers of iterations the algorithm took to achieve update steps 

of size smaller thanε=0.0001, and�y gd− yfp� equals the norm of 

the difference of the found minima. 

n p c #its gd #its fp �y gd − y fp � 

2 0.9 1.2 6.7±4.7 3.7±1.0 0.0±0.0 

2 0.9 2.0 10.9±6.9 4.1±1.0 0.0±0.0 

2 2.2 0.9 13.0±21.0 5.5±4.2 0.0±0.0 

3 0.9 3 13.7±6.9 4.4±1.0 0.0±0.0 

3 2.2 0.9 9.6±16.6 7.2±10.2 0.0±0.0 

4 0.9 3 9.8±6.8 4.4±1.1 0.0±0.0 

4 2.2 0.9 6.0±5.0 9.2±8.1 0.0±0.0 

convex projector, see e.g. [3], lemma 2.4 and [4]. The theory of 

projection onto convex sets (POCS) [4, 5] is a well-known technique 

in signal processing; given N convex sets M1,..., MN ⊂ R n and 

an operator defined byπ=πMN ···πM1 , POCS can be formulated 

as the recursion defined by yi+1=π(yi). It always approaches the 

intersection of the convex sets that is yi→M ∗ = �N i=1 Mi. 

Note that POCS only finds an arbitrary point in �N i=1 Mi, which 

not necessarily coincides with its Euclidean projection: for example 

if M1 :={x∈R n |�x�2≤ 1} is the unit disc, and M2 :={x∈R n | x1≤ 

0} the lower first half-plane, then the Euclidean projection from x := 

(1, 1, 0,...,0) onto M1∩M2 equalsπM1∩M2 (x)=(0, 1, 0,...,0), but 

application of POCS yields (0, 1/ √ 2, 0,...,0). 

3. SPARSE PROJECTION 

In this section, we combine the notions from the previous sections to 

search for sparse projections. Given a signal x∈R n , our goal is to 

find the closest signal y∈R n of fixed sparsenessσp(y)=c. Hence, 

we search for y∈R n with 

y=argmin σp(y)=c �x−y�2. (7) 

Due to the scale-invariance ofσ, the problem (7) is equivalent to 

finding 

y=argmin �y�2=1,�y�p=c �x−y�2. (8) 

In other words, we are looking for the Euclidean projection y = 

πM(x) onto M := S n−1 n−1 

2 ∩ cS p . Note that due to theorem 2.1, this 

solution to (8) exists if M�∅ and is almost always unique. 

Algorithm 3: Projection onto spheres (POSH). In practice, 

some abort criterion has to be implemented. Often q=2. 

Input: vector x∈R n \X(S n−1 

p ∩ S n−1 

q ) and p, q>0 

Output: y=π S n−1 (x) 

p ∩S n−1 

q 

1 Set y←x. 

while y�S n−1 

p ∩ S n−1 

q do 

2 y←π S n−1 (π 

q S n−1 (y)) 

p 

end 

3.1. Projection onto spheres (POSH) 

In the special case of p=1 and nonnegative x, Hoyer has proposed 

an efficient algorithm for finding the projection [2], simply by using 

the explicit formulas for the p-sphere projection; such formulas do 

not exist for p�1, 2, so a more general algorithm for this situation 

is proposed in the following. 

Its idea is a direct generalization of POCS: we alternately project 

first on S n−1 then on S n−1 

p , using the Euclidean projection operators 

2 

from section 2.2. However, the difference is that the spheres are 

clearly non-convex (if p�1), so in contrast to POCS, convergence 

is not obvious. We denote this projection algorithm by projection 

onto spheres (POSH), see algorithm 3. 

First note that POSH obviously has the desired solution as fixedpoint. 

In experiments, we observe that indeed POSH converges, and 

moreover it converges to the closest solution i.e. to the Euclidean 

projection (which does not hold for POCS in general)! Finally we 

see that in higher dimensions, all update vectors together with the 

starting point x lie in a single two-dimensional plane, so theoretically 

we can reduce proofs to two-dimensional cases as well as build 

algorithms using this fact. 

In the following section, we will prove the above claims for the 

case of p=1, where an explicit projection formula (3) is known. In 

the case of arbitrary p, so far we are only able to give experimental 

validation of the astonishing facts of convergence and convergence 

to the Euclidean projection. 

3.2. Convergence 

The proof needs the following simple convergence lemma, which 

somewhat extends on a special case treated by the more general Banach 

fixed-point theorem. 

Lemma 3.1. Let f : R→R be continuously-differentiable with 

f (0)=0, f ′ (0)>1 and let f ′ be positive and strictly decreasing. If 

f ′ (x)0, then there exists a single positive fixedpoint 

ˆx of f , and f i (x) converges to ˆx for i→∞ and any x>0. 

Theorem 3.2. Let n≥2, p>0 and x∈R n \X(M). If y1 :=π S n−1 (x) 

2 

and iteratively y i :=π S n−1 

2 

(π S n−1 

1 

(y i−1 )) according to the POSH algo- 

rithm, then y i converges to some y ∞ ∈ S n−1 

2 , and y ∞ =πM(x). 

Using lemma 3.1, we can prove the convergence theorem in the 

case of p=1, but omit the proof due to lack of space. 

3.3. Simulations 

At first, we confirm the convergence results from theorem 3.2 for 

p=1by applying POSH with 100 iterations in 1000 runs onto vectors 

x ∈ R 6 sampled uniformly from [0, 1] 6 ; c was chosen to be 

sufficiently large (c=2.4). We always get convergence. We also 

calculate the correct projection (using Hoyer’s projection algorithm

Chapter 14. Proc. ICASSP 2006 195 

1.2 

1.1 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

x 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

(a) POSH for n=2, p=1 

x 

1.5π(S 2 

1 ) 

(b) POSH for n=3, p=1 

π(S 2 

2 ) 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

0 

2S 1 

0.5 

S 1 

2 

x 

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

(c) POSH for n=2, p=0.5 

Fig. 2. Starting from x0 (◦), we alternately project onto cS 1 and S 2. POSH performance is illustrated for p=1in dimensions 2 (a) and 3 (b), 

were a projection via PCA is displayed — no information is lost hence the sequence of points lies in a plane as shown in the proof theorem 

3.2. Figure (c) shows application of POCS for n=2and p=0.5. 

Table 2. Performance of the POSH algorithm 3 for varying parameters. 

See text for details. 

n p c �y POSH − yscan�2 

2 0.8 1.2 0.005±0.0008 

2 4 0.9 0.02±0.005 

3 0.8 1.2 0.02±0.009 

3 4 0.9 0.04±0.03 

[2]). The distance between his and our solution was calculated to 

give a mean value of 5·10 −13 ± 5·10 −12 i.e. we get virtually always 

the same solution. 

In figures 2(a) and (b), we show application for p=1; we visualize 

the performance in 3 dimensions by projecting the data via PCA 

— which by the way throws away virtually no information (confirmed 

by experiment) indicating the validness of theorem 3.2 also in 

higher dimensions. In figure 2(c) a projection for p=0.5 is shown. 

Now, we perform batch-simulations for varying p. For this, we 

uniformly sample the starting vector x ∈ [0, 1] n in 100 runs, and 

compare the POSH algorithm result with the true projection. POSH 

is performed starting with the p-norm projection using algorithm 1 

and 100 iterations. As the true projectionπM(x) cannot be determined 

in closed form, we scan [0, 1] n−1 using the stepsizeε=0.01 

to give the first (n−1) coordinates of our estimate y ofπM(x); its 

n-th coordinate is then constructed to guarantee y∈S n−1 

p (for p1) respectively. Using Taylor-approximation 

of (y+ε) p , it can easily be shown that two adjacent grid points have 

maximal difference � � 

��(y1+ε,...,yn+ε)� p p−�y� p � 

� 

p�≤ 

pnε+O(ε 2 ) if 

y∈[0, 1] n and p≥1. Hence by taking only vectors y as approximation 

ofπM(x) with � � 

��y�2 2− 1� � 

� 

�

196 Chapter 14. Proc. ICASSP 2006

Chapter 15 

Neurocomputing, 69:1485-1501, 2006 

Paper P.Gruber, K.Stadlthanner, M.Böhm, F.J. Theis, E.W. Lang, A.M. Tomé, 

A.R. Teixeira, C.G. Puntonet, and J.M.Górriz Saéz. Denoising using local 

projective subspace methods. Neurocomputing, 69:1485-1501, 2006 

Reference (Gruber et al., 2006) 


197

198 Chapter 15. Neurocomputing, 69:1485-1501, 2006 

Denoising Using Local Projective Subspace 

Methods 

P. Gruber, K. Stadlthanner, M. Böhm, F. J. Theis, E. W. Lang 

Abstract 

Institute of Biophysics, Neuro-and Bioinformatics Group 

University of Regensburg, 93040 Regensburg, Germany 

email: elmar.lang@biologie.uni-regensburg.de 

A. M. Tomé, A. R. Teixeira 

Dept. de Electrónica e Telecomunicações/IEETA 

Universidade de Aveiro, 3810 Aveiro, Portugal 

email: ana@ieeta.pt 

C. G. Puntonet, J. M. Gorriz Saéz 

Dep. Arqitectura y Técnologia de Computadores 

Universidad de Granada, 18371 Granada, Spain 

email: carlos@atc.ugr.es 

In this paper we present denoising algorithms for enhancing noisy signals based on 

Local ICA (LICA), Delayed AMUSE (dAMUSE) and Kernel PCA (KPCA). The 

algorithm LICA relies on applying ICA locally to clusters of signals embedded in 

a high dimensional feature space of delayed coordinates. The components resembling 

the signals can be detected by various criteria like estimators of kurtosis or 

the variance of autocorrelations depending on the statistical nature of the signal. 

The algorithm proposed can be applied favorably to the problem of denoising multidimensional 

data. Another projective subspace denoising method using delayed 

coordinates has been proposed recently with the algorithm dAMUSE. It combines 

the solution of blind source separation problems with denoising efforts in an elegant 

way and proofs to be very efficient and fast. Finally, KPCA represents a non-linear 

projective subspace method that is well suited for denoising also. Besides illustrative 

applications to toy examples and images, we provide an application of all algorithms 

considered to the analysis of protein NMR spectra. 

Preprint submitted to Elsevier Science 4 February 2005

Chapter 15. Neurocomputing, 69:1485-1501, 2006 199 


The interpretation of recorded signals is often hampered by the presence of 

noise. This is especially true with biomedical signals which are buried in a 

large noise background most often. Statistical analysis tools like Principal 

Component Analysis (PCA), singular spectral analysis (SSA), Independent 

Component Analysis (ICA) etc. quickly degrade if the signals exhibit a low 

Signal to Noise Ratio (SNR). Furthermore due to their statistical nature, the 

application of such analysis tools can also lead to extracted signals with a 

larger SNR than the original ones as we will discuss below in case of Nuclear 

Magnetic Resonance (NMR) spectra. 

Hence in the signal processing community many denoising algorithms have 

been proposed [5,12,18,38] including algorithms based on local linear projective 

noise reduction. The idea is to project noisy signals in a high-dimensional 

space of delayed coordinates, called feature space henceforth. A similar strategy 

is used in SSA [20], [9] where a matrix composed of the data and their 

delayed versions is considered. Then, a Singular Value Decomposition (SVD) 

of the data matrix or a PCA of the related correlation matrix is computed. 

Noise contributions to the signals are then removed locally by projecting the 

data onto a subset of principal directions of the eigenvectors of the SVD or 

PCA analysis related with the deterministic signals. 

Modern multi-dimensional NMR spectroscopy is a very versatile tool for the 

determination of the native 3D structure of biomolecules in their natural aqueous 

environment [7,10]. Proton NMR is an indispensable contribution to this 

structure determination process but is hampered by the presence of the very 

intense water (H2O) proton signal. The latter causes severe baseline distortions 

and obscures weak signals lying under its skirts. It has been shown [26,29] 

that Blind Source Separation (BSS) techniques like ICA can contribute to the 

removal of the water artifact in proton NMR spectra. 

ICA techniques extract a set of signals out of a set of measured signals without 

knowing how the mixing process is carried out [2, 13]. Considering that the 

set of measured spectra X is a linear combination of a set of Independent 

Components (ICs) S, i.e. X = AS, the goal is to estimate the inverse of the 

mixing matrix A, using only the measured spectra, and then compute the ICs. 

Then the spectra are reconstructed using the mixing matrix A and those ICs 

contained in S which are not related with the water artifact. Unfortunately 

the statistical separation process in practice introduces additional noise not 

present in the original spectra. Hence denoising as a post-processing of the 

artifact-free spectra is necessary to achieve the highest possible SNR of the 

reconstructed spectra. It is important that the denoising does not change the 

spectral characteristics like integral peak intensities as the deduction of the 

2


3D structure of the proteins heavily relies on the latter. 

We propose two new approaches to this denoising problem and compare the 

results to the established Kernel PCA (KPCA) denoising [19, 25]. 

The first approach (Local ICA (LICA)) concerns a local projective denoising 

algorithm using ICA. Here it is assumed that the noise can, at least locally, 

be represented by a stationary Gaussian white noise. Signals usually come 

from a deterministic or at least predictable source and can be described as 

a smooth function evaluated at discrete time steps small enough to capture 

the characteristics of the function. That implies, using a dynamical model for 

the data, that the signal embedded in delayed coordinates resides within a 

sub-manifold of the feature space spanned by these delayed coordinates. With 

local projective denoising techniques, the task is to detect this signal manifold. 

We will use LICA to detect the statistically most interesting submanifold. In 

the following we call this manifold the signal+noise subspace since it contains 

all of the signal plus that part of the noise components which lie in the same 

subspace. Parameter selection within LICA will be effected with a Minimum 

Description Length (MDL) criterion [40], [6] which selects optimal parameters 

based on the data themselves. 

For the second approach we combine the ideas of solving BSS problems algebraically 

using a Generalized Eigenvector Decomposition (GEVD) [28] with 

local projective denoising techniques. We propose, like in the Algorithm for 

Multiple Unknown Signals Extraction (AMUSE) [37], a GEVD of two correlation 

matrices i.e, the simultaneous diagonalization of a matrix pencil formed 

with a correlation matrix and a matrix of delayed correlations. These algorithms 

are exact and fast but sensitive to noise. There are several proposals 

to improve efficiency and robustness of these algorithms when noise is 

present [2, 8]. They mostly rely on an approximative joint diagonalization of 

a set of correlation or cumulant matrices like the algorithm Second Order 

Blind Identification (SOBI) [1]. The algorithm we propose, called Delayed 

AMUSE (dAMUSE) [33], computes a GEVD of the congruent matrix pencil 

in a high-dimensional feature space of delayed coordinates. We show that the 

estimated signal components correspond to filtered versions of the underlying 

uncorrelated source signals. We also present an algorithm to compute the 

eigenvector matrix of the pencil which involves a two step procedure based on 

the standard Eigenvector Decomposition (EVD) approach. The advantage of 

this two step procedure is related with a dimension reduction between the two 

steps according to a threshold criterion. Thereby estimated signal components 

related with noise only can be neglected thus performing a denoising of the 

reconstructed signals. 

As a third denoising method we consider KPCA based denoising techniques 

[19,25] which have been shown to be very efficient outperforming linear PCA. 

3


KPCA actually generalizes linear PCA which hitherto has been used for denoising. 

PCA denoising follows the idea that retaining only the principal components 

with highest variance to reconstruct the decomposed signal, noise 

contributions which should correspond to the low variance components can 

deliberately be omitted hence reducing the noise contribution to the observed 

signal. KPCA extends this idea to non-linear signal decompositions. The idea 

is to project observed data non-linearly into a high-dimensional feature space 

and then to perform linear PCA in feature space. The trick is that the whole 

formalism can be cast into dot product form hence the latter can be replaced 

by suitable kernel functions to be evaluated in the lower dimensional input 

space instead of the high-dimensional feature space. Denoising then amounts 

to estimating appropriate pre-images in input space of the nonlinearly transformed 

signals. 

The paper is organized as follows: Section 1 presents an introduction and discusses 

some related work. In section 2 some general aspects about embedding 

and clustering are discussed, before in section 3 the new denoising algorithms 

are discussed in detail. Section 4 presents some applications to toy as well as 

to real world examples and section 5 draws some conclusions. 

2 Feature Space Embedding 

In this section we introduce new denoising techniques and propose algorithms 

using them. At first we present the signal processing tools we will use later 

on. 

2.1 Embedding using delayed coordinates 

A common theme of all three algorithms presented is to embed the data into 

a high dimensional feature space and try to solve the noise separation problem 

there. With the LICA and the dAMUSE we embed signals in delayed 

coordinates and do all computations directly in the space of delayed coordinates. 

The KPCA algorithm considers a non-linear projection of the signals 

to a feature space but performs all calculations in input space using the kernel 

trick. It uses the space of delayed coordinates only implicitly as intermediate 

step in the nonlinear transformation since for that transformation the signal 

at different time steps is used. 

Delayed coordinates are an ideal tool for representing the signal information. 

For example in the context of chaotic dynamical systems, embedding an observable 

in delayed coordinates of sufficient dimension already captures the 

4


full dynamical system [30]. There also exists a similar result in statistics for 

signals with a finite decaying memory [24]. 

Given a group of N sensor signals, x[l] = � 

x0[l], . . . , xN−1[l] � T sampled at 

time steps l = 0, . . . L − 1, a very convenient representation of the signals 

embedded in delayed coordinates is to arrange them componentwise into component 

trajectory matrices Xi, i = 0, . . . , N − 1 [20]. Hence embedding 

can be regarded as a mapping that transforms a one-dimensional time series 

xi = (xi[0], xi[1], . . . , xi[L − 1]) into a multi-dimensional sequence of lagged 

vectors. Let M be an integer (window length) with M < L. The embedding 

procedure then forms L − M + 1 lagged vectors which constitute the columns 

of the component trajectory matrix. Hence given sensor signals x[l], registered 

for a set of L samples, their related component trajectory matrices are given 

by 

� 

� 

Xi = xi[M − 1] xi[M] . . . xi[L − 1] 

⎡ 

⎤ 

⎢ xi[M − 1] xi[M] · · · 

⎢ xi[M − 2] xi[M − 1] · · · 

= ⎢ 

. 

⎢ . . .. 

⎣ 

xi[L − 1] ⎥ 

xi[L − 2] ⎥ 

. ⎥ 

⎦ 

xi[0] xi[1] · · · xi[L − M] 

and encompass M delayed versions of each signal component xi[l − m], m = 

0, . . . , M −1 collected at time steps l = M −1, . . . , L−1. Note that a trajectory 

matrix has identical entries along each diagonal. The total trajectory matrix 

of the set X will be a concatenation of the component trajectory matrices Xi 

computed for each sensor, i.e 

X = 

� 

� 

T 

X1, X2, . . . , XN 

Note that the embedded sensor signal is also formed by a concatenation of 

embedded component vectors, i.e. x[l] [x0[l], . . . , xN−1[l]]. Also note that with 

LICA we deal with single column vectors of the trajectory matrix only, while 

with dAMUSE we consider the total trajectory matrix. 

2.2 Clustering 

In our context clustering of signals means rearranging the signal vectors, sampled 

at different time steps, by similarity. Hence for signals embedded in delayed 

coordinates, the idea is to look for K disjoint sub-trajectory matrices to 

5 

(1) 

(2)


group together similar column vectors of the trajectory matrix X. 

A clustering algorithm like k-means [15] is appropriate for problems where the 

time structure of the signal is irrelevant. If, however, time or spatial correlations 

matter, clustering should be based on finding an appropriate partitioning 

of {M − 1, . . . , L − 1} into K successive segments, since this preserves the inherent 

correlation structure of the signals. In any case the number of columns 

in each sub-trajectory matrix X (j) amounts to Lj such that the following 

completeness relation holds: 

K� 

Lj = L − M + 1 (3) 

j=1 

The mean vector mj in each cluster can be considered a prototype vector and 

is given by 

mj = 1 

Xcj = 1 

X (j) [1, . . . , 1] T , j = 1, . . . , K (4) 

Lj 

Lj 

where cj is a vector with Lj entries equal to one which characterizes the 

clustering. Note that after the clustering the set {k = 0, . . . , L − M − 1} of 

indices of the columns of X is split in K disjoint subsets Kj. Each trajectory 

sub-matrix X (j) is formed with those columns of the matrix X, the indices of 

which belong to the subset Kj of indices. 

2.3 Principal Component Analysis and Independent Component Analysis 

PCA [23] is one of the most common multivariate data analysis tools. It tries 

to linearly transform given data into uncorrelated data (feature space). Thus 

in PCA [4] a data vector is represented in an orthogonal basis system such 

that the projected data have maximal variance. PCA can be performed by 

eigenvalue decomposition of the data covariance matrix. The orthogonal transformation 

is obtained by diagonalizing the centered covariance matrix of the 

data set. 

In ICA, given a random vector, the goal is to find its statistically ICs. In 

contrast to correlation-based transformations like PCA, ICA renders the output 

signals as statistically independent as possible by evaluating higher-order 

statistics. The idea of ICA was first expressed by Jutten and Hérault [11] 

while the term ICA was later coined by Comon [3]. With LICA we will use 

the popular FastICA algorithm by Hyvärinen and Oja [14], which performs 

ICA by maximizing the non-Gaussianity of the signal components. 

6


3 Denoising Algorithms 

3.1 Local ICA denoising 

The LICA algorithm we present is based on a local projective denoising technique 

using an MDL criterion for parameter selection. The idea is to achieve 

denoising by locally projecting the embedded noisy signal into a lower dimensional 

subspace which contains the characteristics of the noise free signal. 

Finally the signal has to be reconstructed using the various candidates generated 

by the embedding. 

Consider the situation, where we have a signal x 0 i [l] at discrete time steps 

l = 0, . . . , L − 1 but only its noise corrupted version xi[l] is measured 

xi[l] = x 0 i [l] + ei[l] (5) 

where e[l] are samples of a random variable with Gaussian distribution, i.e. xi 

equals x 0 i up to additive stationary white noise. 

3.1.1 Embedding and clustering 

First the noisy signal xi[l] is transformed into a high-dimensional signal xi[l] 

in the M-dimensional space of delayed coordinates according to 

xi[l] := � 

xi[l], . . . , xi[l − M + 1 mod L] � T 

(6) 

which corresponds to a column of the trajectory matrix in equation 1. 

To simplify implementation, we want to ensure that the delayed signal, like 

the original signal, (trajectory matrix) is given at L time steps instead of 

L − M + 1. This can be achieved by using the samples in round robin manner, 

i.e. by closing the end and the begin of each delayed signal and cutting out 

exactly L components in accord with the delay. If the signal contains a trend 

or its statistical nature is significantly different at the end compared to the 

beginning, then this leads to compatibility problems of the beginning and end 

of the signal. We can easily resolve this misfit by replacing the signal with a 

version where we add the signal in reverse order, hence avoiding any sudden 

change in signal amplitude which would be smoothed out by the algorithm. 

The problem can now be localized by selecting K clusters in the feature space 

of delayed coordinates of the signal {xi[l] | l = 0, . . . , L − 1}. Clustering 

can be achieved by a k-means cluster algorithm as explained in section 2.2. 

But k-means clustering is only appropriate if the variance or the kurtosis 

of a signal do not depend on the inherent signal structure. For other noise 

7


selection schemes like choosing the noise components based on the variance 

of the autocorrelation, it is usually better to find an appropriate partitioning 

of the set of time steps {0, . . . , L − 1} into K successive segments, since this 

preserves the inherent time structure of the signals. 

For other noise selection methods like choosing the noise components based on 

the variance of the autocorrelation it is usually better to find an appropriate 

partition of the set of time steps {0, . . . , L − 1} into K successive segments, 

since this preserves the inherent time structure of the signal. 

Note that the clustering does not change the data but only changes its time 

sequence, i.e. permutes and regroups the columns of the trajectory matrix and 

separates it into K sub-matrices. 

3.1.2 Decomposition and denoising 

After centering, i.e. removing the mean in each cluster, we can analyze the 

M-dimensional signals in these K clusters using PCA or ICA. The PCA case 

(Local PCA (LPCA)) is studied in [39] so in the following we will propose an 

ICA based denoising. 

Using ICA, we extract M ICs from each delayed signal. Like in all projection 

based denoising algorithms, noise reduction is achieved by projecting the signal 

into a lower dimensional subspace. We used two different criteria to estimate 

the number p of signal+noise components, i.e. the dimension of the signal 

subspace onto which we project after applying ICA. 

• One criterion is a consistent MDL estimator of pMDL for the data model in 

equation 5 ( [39]) 

pMDL = argmin MDL(M, L, p, (λj), γ) (7) 

p=0,...,M−1 

⎧ 

⎪⎨ 

argmin 

p=0,...,M−1 ⎪⎩ −�(M 

− p)L � 

⎛ 

⎜ 

ln ⎝ ΠMj=p+1λ 1 ⎞ 

M−p 

j ⎟ 

1 �Mj=p+1 ⎠ 

λj 

M−p 

� 

+ pM − p2 

� �1 � 

p 

+ + 1 + ln γ 

2 2 2 

⎛ 

p2 p 

pM − + + 1 

2 2 − ⎝ 

p 

1 2 

ln 2 L + 

⎞⎫ 

p� 

M−1 � ⎬ 

ln λj − ln λj 

⎠ 

⎭ 

j=1 

j=1 

where λj denote the variances of the signal components in feature space, 

i.e. after applying the de-mixing matrix which we estimate with the ICA 

algorithm. To retain the relative strength of the components in the mixture, 

we normalize the rows of the de-mixing matrix to unit norm. The variances 

8


are ordered such that the smallest eigenvalues λj correspond to directions 

in feature space most likely to be associated with noise components only. 

The first term in the MDL estimator represents the likelihood of the 

m − p gaussian white noise components. The third term stems from the 

estimation of the description length of the signal part (first p components) 

of the mixture based on their variances. The second term acts as a penalty 

term to favor parsimonious representations of the data for short time series, 

and becomes insignificant in the limit L → ∞ since it does not depend on L 

while the other two terms grow without bounds. The parameter γ controls 

this behavior and is a parameter of the MDL estimator, hence of the final 

denoising algorithm. By experience, good values for γ seem to be 32 or 64. 

• Based on the observations reported by [17] and our observations that, in 

some situations, the MDL estimator tends to significantly underestimate 

the number of noise components, we also used another approach: We clustered 

the variances of the signal components into two clusters using k-means 

clustering and defined pcl as the number of elements in the cluster which 

contains the largest eigenvalue. This yields a good estimate of the number 

of signal components, if the noise variances are not clustered well enough 

together but, nevertheless, are separated from the signal by a large gap. 

More details and simulations corroborating our observations can be found 

in section 4.1.1. 

Depending on the statistical nature of the data, the ordering of the components 

in the MDL estimator can be achieved using different methods. For data with 

a non-Gaussian distribution, we select the noise component as the component 

with the smallest value of the kurtosis as Gaussian noise corresponds to a 

vanishing kurtosis. For non-stationary data with stationary noise, we identify 

the noise by the smallest variance of its autocorrelation. 

3.1.3 Reconstruction 

In each cluster the centering is reversed by adding back the cluster mean. To 

reconstruct the noise reduced signal, we first have to reverse the clustering of 

the data to yield the signal x e i [l] ∈ R M by concatenating the trajectory submatrices 

and reversing the permutation done during clustering. The resulting 

trajectory matrix does not possess identical entries in each diagonal. Hence 

we average over the candidates in the delayed data, i.e.over all entries in each 

diagonal: 

x e i [l] := 1 

M−1 � 

x 

M j=0 

e i [l + j mod L]j (8) 

where x e i [l]j stands for the j-th component of the enhanced vector x e i at time 

step l. Note that the summation is done over the diagonals of the trajectory 

matrix so it would yields xi if performed on the original delayed signal xi. 

9


3.1.4 Parameter estimation 

We still have to find optimal values for the global parameters M and K. 

Their selection again can be based on a MDL criterion for the detected noise 

e := x − xe. Accordingly we apply the LICA algorithm for different M and 

K and embed each of these error signals e(M, K) in delayed coordinates of a 

fixed large enough dimension ˆ M and choose the parameters M0 and K0 such 

that the MDL criterion estimating the noisiness of the error signal is minimal. 

The MDL criterion is evaluated with respect to the eigenvalues λj(M, K) of 

the correlation matrix of e(M, K) such that 

(M0, K0) = argmin MDL 

M,K 

� M, ˆ L, 0, (λj(M, K)), γ � 

3.2 Denoising using Delayed AMUSE 

Signals with an inherent correlation structure like time series data can as well 

be analyzed using second-order blind source separation techniques only [22,34]. 

GEVD of a matrix pencil [36,37] or a joint approximative diagonalization of a 

set of correlation matrices [1] is then usually considered. Recently we proposed 

an algorithm based on a generalized eigenvalue decomposition in a feature 

space of delayed coordinates [34]. It provides means for BSS and denoising 

simultaneously. 

3.2.1 Embedding 

Assuming that each sensor signal is a linear combination X = AS of N 

underlying but unknown source signals si, a source signal trajectory matrix 

S can be written in analogy to equation 1 and equation 2. Then the mixing 

matrix A is a block matrix with a diagonal matrix in each block: 

⎡ 

a11IM×M a12IM×M ⎢ 

· · · a1NIM×M ⎥ 

⎢ 

⎥ 

⎢ 

⎥ 

⎢ a21IM×M a22IM×M · · · · · · ⎥ 

A = ⎢ 

⎥ 

⎢ 

. ⎥ 

⎢ . . .. . ⎥ 

⎣ 

⎦ 

aN1IM×M aN2IM×M · · · aNNIM×M 

⎤ 

(9) 

(10) 

The matrix IM×M represents the identity matrix, and in accord with an instantaneous 

mixing model the mixing coefficient aij relates the sensor signal 

xi with the source signal sj. 

10


3.2.2 Generalized Eigenvector Decomposition 

The delayed correlation matrices of the matrix pencil are computed with one 

matrix Xr obtained by eliminating the first ki columns of X and another 

matrix, Xl, obtained by eliminating the last ki columns. Then, the delayed 

correlation matrix Rx(ki) = XrXl T will be an NM × NM matrix. Each of 

these two matrices can be related with a corresponding matrix in the source 

signal domain: 

Rx(ki) = ARs(ki)A T = ASrSl T A T 

(11) 

Then the two pairs of matrices (Rx(k1), Rx(k2)) and (Rs(k1), Rs(k2)) represent 

congruent pencils [32] with the following properties: 

• Their eigenvalues are the same, i.e., the eigenvalue matrices of both pencils 

are identical: Dx = Ds. 

• If the eigenvalues are non-degenerate (distinct values in the diagonal of 

the matrix Dx = Ds), the corresponding eigenvectors are related by the 

transformation Es = A T Ex. 

Assuming that all sources are uncorrelated, the matrices Rs(ki) are block 

diagonal, having block matrices Rmm(ki) = SriSli T along the diagonal. The 

eigenvector matrix of the GEVD of the pencil (Rs(k1), Rs(k2)) again forms a 

block diagonal matrix with block matrices Emm forming M × M eigenvector 

matrices of the GEVD of the pencils (Rmm(k1), Rmm(k2)). The uncorrelated 

components can then be estimated from linearly transformed sensor signals 

via 

Y = Ex T X = Ex T AS = Es T S (12) 

hence turn out to be filtered versions of the underlying source signals. As the 

eigenvector matrix Es is a block diagonal matrix, there are M signals in each 

column of Y which are a linear combination of one of the source signals and 

its delayed versions. Then the columns of the matrix Emm represent impulse 

responses of Finite Impulse Response (FIR) filters. Considering that all the 

columns of Emm are different, their frequency response might provide different 

spectral densities of the source signal spectra. Then NM output signals y 

encompass M filtered versions of each of the N estimated source signals. 

3.2.3 Implementation of the GEVD 

There are several ways to compute the generalized eigenvalue decomposition. 

We resume a procedure valid if one of the matrices of the pencil is symmetric 

positive definite. Thus, we consider the pencil (Rx(0), Rx(k2)) and perform 

the following steps: 

Step 1: Compute a standard eigenvalue decomposition of Rx(0) = V ΛV T , i.e, 

11


compute the eigenvectors vi and eigenvalues λi. As the matrix is symmetric 

positive definite, the eigenvalues can be arranged in descending order (λ1 > 

λ2 > · · · > λNM). This procedure corresponds to the usual whitening step 

in many ICA algorithms. It can be used to estimate the number of sources, 

but it can also be considered a strategy to reduce noise much like with PCA 

denoising. Dropping small eigenvalues amounts to a projection from a highdimensional 

feature space onto a lower dimensional manifold representing the 

signal+noise subspace. Thereby it is tacitly assumed that small eigenvalues 

are related with noise components only. Here we consider a variance criterion 

to choose the most significant eigenvalues, those related with the embedded 

deterministic signal, according to 

λ1 + λ2 + ... + λl 

λ1 + λ2 + ...λNM 

≥ T H (13) 

If we are interested in the eigenvectors corresponding to directions of high 

variance of the signals, the threshold T H should be chosen such that their 

maximum energy is preserved. Similar to the whitening phase in many BSS 

algorithms, the data matrix X can be transformed using 

1 

− 

Q = Λ 2 V T 

(14) 

to calculate a transformed matrix of delayed correlations C(k2) to be used 

in the next step. The transformation matrix can be computed using either 

the l most significant eigenvalues, in which case denoising is achieved, or all 

eigenvalues and respective eigenvectors. Also note, that Q represents a l×NM 

matrix if denoising is considered. 

Step 2: Use the transformed delayed correlation matrix C(k2) = QRx(k2)Q T 

and its standard eigenvalue decomposition: the eigenvector matrix U and 

eigenvalue matrix Dx. 

The eigenvectors of the pencil (Rx(0), Rx(k2)), which are not normalized, form 

the columns of the eigenvector matrix Ex = QT 1 

− 

U = V Λ 2 U. The ICs of 

the delayed sensor signals can then be estimated via the transformation given 

below, yielding l (or NM) signals, one signal per row of Y : 

Y = Ex T X = U T QX = U T 1 

− 

Λ 2 V T X (15) 

The first step of this algorithm is therefore equivalent to a PCA in a highdimensional 

feature space [9,38], where a matrix similar to Q is used to project 

the data onto the signal manifold. 

12


3.3 Kernel PCA based denoising 

Kernel PCA has been developed by [19], hence we give here only a short summary 

for convenience. PCA only extracts linear features though with suitable 

nonlinear features more information could be extracted. It has been shown [19] 

that KPCA is well suited to extract interesting nonlinear features in the data. 

KPCA first maps the data xi into some high-dimensional feature space Ω 

through a nonlinear mapping Φ : R n → R m , m > n and then performs linear 

PCA on the mapped data in the feature space Ω. Assuming centered data in 

feature space, i.e. � l k=1 Φ(xk) = 0, to perform PCA in space Ω amounts to 

finding the eigenvalues λ > 0 and eigenvectors ω ∈ Ω of the correlation matrix 

¯R = 1 

l 

� lj=1 Φ(xj)Φ(xj) T . 

Note that all ω with λ �= 0 lie in the subspace spanned by the vectors 

Φ(x1), . . . , Φ(xl). Hence the eigenvectors can be represented via 

l� 

ω = αiΦ(xi) (16) 

i=1 

Multiplying the eigenequation with Φ(xk) from the left the following modified 

eigenequation is obtained 

Kα = lλα (17) 

with λ > 0. The eigenequation now is cast in the form of dot products occurring 

in feature space through the l × l matrix K with elements Kij = 

(Φ(xi)·Φ(xj)) = k(xi, xj) which are represented by kernel functions k(xi, xj) 

to be evaluated in the input space. For feature extraction any suitable kernel 

can be used and knowledge of the nonlinear function Φ(x) is not needed. Note 

that the latter can always be reconstructed from the principal components 

obtained. The image of a data vector under the map Φ can be reconstructed 

from its projections βk via 

n� 

n� 

ˆPnΦ(x) = βkωk = (ωk · Φ(x))ωk 

k=1 

k=1 

(18) 

which defines the projection operator ˆ Pn. In denoising applications, n is deliberately 

chosen such that the squared reconstruction error 

e 2 l� 

rec = 

i=1 

� ˆ PnΦ(xi) − Φ(xi) � 2 

(19) 

is minimized. To find a corresponding approximate representation of the data 

in input space, the so called pre-image, it is necessary to estimate a vector 

13


z ∈ R N in input space such that 

ρ(z) =� ˆ PnΦ(x) − Φ(z) � 2 n� l� 

= k(z, z) − 2 βk α 

k=1 i=1 

k i k(xi, z) (20) 

is minimized. Note that an analytic solution to the pre-image problem has been 

given recently in case of invertible kernels [16]. In denoising applications it is 

hoped that the deliberately neglected dimensions of minor variance contain 

noise mostly and z represents a denoised version of x. Equation (20) can be 

minimized via gradient descent techniques. 

4 Applications and simulations 

In this section we will first present results and concomitant interpretation 

of some experiments with toy data using different variations of the LICA 

denoising algorithm. Next we also present some test simulations using toy data 

of the algorithm dAMUSE. Finally we will discuss the results of applying the 

three different denoising algorithms presented above to a real world problem, 

i.e. to enhance protein NMR spectra contaminated with a huge water artifact. 

4.1 Denoising with Local ICA applied to toy examples 

We will present some sample experimental results using artificially generated 

signals and random noise. As the latter is characterized by a vanishing kurtosis, 

the LICA based denoising algorithm uses the component kurtosis for noise 

selection. 

4.1.1 Discussion of an MDL based subspace selection 

In the LICA denoising algorithm the MDL criterion is also used to select the 

number of noise components in each cluster. This works without prior knowledge 

of the noise strength. Since the estimation is based solely on statistical 

properties, it produces suboptimal results in some cases, however. In figure 1 

we compare, for an artificial signal with a known additive white gaussian noise, 

the denoising achieved with the MDL based estimation of the subspace dimension 

versus the estimation based on the noise level. The latter is done using 

a threshold on the variances of the components in feature space such that 

only the signal part is conserved. Fig. 1 shows that the threshold criterion 

works slightly better in this case, though the MDL based selection can obtain 

a comparable level of denoising. However, the smaller SNR indicates that the 

14


0.5 1 1.5 2 2.5 3 3.5 4 

10 

10 

5 

0 

-5 

-10 

Original Signal Noisy Signal MDL based local ICA Threshold based local ICA 

0.5 1 1.5 2 2.5 3 3.5 4 

×10 3 

5 

0 

-5 

-10 

10 

5 

0 

-5 

-10 

0.5 1 1.5 2 2.5 3 3.5 4 

-10 

-15 

-15 

0.5 1 1.5 2 2.5 3 3.5 4 

×10 3 

10 

5 

0 

-5 

0.5 1 1.5 2 2.5 3 3.5 4 

10 

10 

5 

0 

-5 

-10 

0.5 1 1.5 2 2.5 3 3.5 4 

5 

0 

-5 

-10 

0.5 1 1.5 2 2.5 3 3.5 4 

8 

8 

6 

4 

2 

0 

-2 

-4 

-6 

-8 

-8 

0.5 1 1.5 2 2.5 3 3.5 4 

Fig. 1. Comparison between MDL and threshold denoising of an artificial signal 

with known SNR = 0. The feature space dimension was M = 40 and the number 

of clusters was K = 35. The (MDL achieved an SNR = 8.9 dB and the Threshold 

criterion an SNR = 10.5 dB) 

MDL criterion favors some over-modelling of the signal subspace, i.e. it tends 

to underestimate the number of noise components in the registered signals. 

In [17] the conditions, such as the noise not being completely white, which 

lead to a strong over-modelling are identified. Over-modelling also happens 

frequently, if the eigenvalues of the covariance matrix related with noise components, 

are not sufficiently close together and are not separated from the 

signal components by a gap. In those cases a clustering criterion for the eigenvalues 

seems to yield better results, but it is not as generic as the MDL 

criterion. 

4.1.2 Comparisons between LICA and LPCA 

Consider the artificial signal shown in figure 1 with varying additive Gaussian 

white noise. We apply the LICA denoising algorithm using either an MDL criterion 

or a threshold criterion for parameter selection. The results are depicted 

in figure 2. 

The first and second diagram of figure 2 compare the performance, here the 

enhancement of the SNR and the mean square error, of LPCA and LICA 

depending on the input SNR. Note that a source SNR of 0 describes a case 

where signal and noise have the same strength, while negative values indicate 

situations where the signal is buried int the noise. The third graph shows the 

difference in kurtosis of the original signal and the source signal in dependence 

on the input SNR. All three diagrams were generated with the same data set, 

i.e. the same signal and, for a given input SNR, the same additive noise. 

These results suggest that a LICA approach is more effective when the signal 

is infested with a large amount of noise, whereas a LPCA seems better suited 

for signals with high SNRs. This might be due to the nature of our selection of 

subspaces based on kurtosis or variance of the autocorrelation as the comparison 

of higher statistical moments of the restored data, like kurtosis, indicate 

that noise reduction can be enhanced if we are using a LICA approach. 

15 

×10 3 

×10 3 

6 

4 

2 

0 

-2 

-4 

-6


SNR Enhancement [dB] 

-5 0 5 10 

11 

10 

9 

8 

7 

6 

5 

4 

SNR enhancement Mean square error Kurtosis 

ica 

pca 

-5 0 5 10 

Source SNR [dB] 

11 

10 

9 

8 

7 

6 

5 

4 

Error original/recovered [×10 3 ] 

0 1 2 3 4 5 6 7 8 

3 

2 

1 

ica 

pca 

3 

2 

1 

0 

0 

0 1 2 3 4 5 6 7 8 

Error original/noisy [×10 3 ] 

Kurtosis error |kurt(se) − kurt(s)| 

-5 0 5 10 

10 

8 

6 

4 

2 

0 

-5 0 5 10 

Source SNR [dB] 

Fig. 2. Comparison between LPCA and LICA based denoising. Here the mean square 

error of two signals x, y with L samples is 1 � 

L ||xi − yi|| 2 . For all noise levels a 

complete parameter estimation was done in the sets {10, 15, . . . , 60} for M and 

{20, 30, . . . , 80} for K. 

4.1.3 LICA denoising with multi-dimensional data sets 

A generalization of the LICA algorithm to multidimensional data sets like 

images where pixel intensities depend on two coordinates is desirable. A simple 

generalization would be to look at delayed coordinates of vectors instead 

of scalars. However, this appears impractical due to the prohibitive computational 

effort. More importantly, this direct approach reduces the number 

of available samples significantly. This leads to far less accurate estimators 

of important aspects like the MDL estimation of the dimension of the signal 

subspace or the estimation of the kurtosis criterion in the LICA case. 

Another approach could be to convert the data to a 1D string by choosing some 

path through the data and concatenating the pixel intensities accordingly. But 

this can easily create unwanted artifacts along the chosen path. Further, local 

correlations are broken up, hence not all the available information is used. 

But a more sophisticated and, depending on the nature of the signal, very effective 

alternative approach can be envisaged. Instead of converting the multidimensional 

data into 1D data strings prior to applying the LICA algorithm, 

we can use a modified delay transformation using shifts along all available 

dimensions. This concept is similar to the multidimensional auto-covariances 

used in the Multi Dimensional SOBI (mdSOBI) algorithm introduced in [31]. 

In the 2D case, for example, consider an n × n image represented by a matrix 

P = (aij)i,j=1...n. Then the transformed data set consists of copies of P which 

are shifted either along columns or rows or both. For instance, a translation 

16 

i 

ica 

pca 

10 

8 

6 

4 

2 

0


Noisy Image Local PCA 

Local ICA Local ICA + PCA 

Fig. 3. Comparison of LPCA and LICA based denoising upon an image infested with 

Gaussian noise. Also note an improvement in denoising power if both are applied 

consecutively (Local PCA SNR = 8.8 dB, LICA SNR = 10.6 dB, LPCA and LICA 

consecutively SNR = 12.6 dB). All images where denoised using a fixed number of 

clusters K = 20 and a delay radius of M = 4, which results in a 49-dimensional 

feature space. 

aij → ai−1,j+1, (i, j = 1, ..., n) yields the following transformed image: 

⎡ 

⎢ 

P −1,1 = ⎢ 

⎣ 

an,2 . . . an,n an,1 

a1,2 . . . a1,n a1,1 

. 

an−1,2 . . . an−1,n an−1,1 

. 

. 

⎤ 

⎥ 

⎦ 

(21) 

Then instead of choosing a single delay dimension, we choose a delay radius 

M and use all P ν with �ν� < M as delayed versions of the original signal. 

The remainder of the LICA based denoising algorithm works exactly as in 

the case of a 1D time series. 

In figure 3 we show that this approach using the the MDL criterion to select 

the number of components compared between LPCA and LICA. In addition 

we see that the algorithm also works favorable if applied multiple times. 

17


4.2 Denoising with dAMUSE applied to toy examples 

A group of three artificial source signals with different frequency contents was 

chosen: one member of the group represents a narrow-band signal, a sinusoid; 

the second signal encompasses a wide frequency range; and the last one represents 

a sawtooth wave whose spectral density is concentrated in the low 

frequency band (see figure 4). 

1 

0.5 

0 

−0.5 

−1 

0 20 40 60 80 100 

4 

2 

0 

−2 

−4 

0 20 40 60 80 100 

2 

1 

0 

−1 

−2 

0 20 40 60 80 100 

(n) 

400 

300 

200 

100 

0 

0 0.1 0.2 0.3 0.4 0.5 

400 

300 

200 

100 

0 

0 0.1 0.2 0.3 0.4 0.5 

600 

400 

200 

0 

0 0.1 0.2 0.3 0.4 0.5 

(π) 

Fig. 4. Artificial signals (left column) and their frequency contents (right column) 

The simulations were designed to illustrate the method and to study the influence 

of the threshold parameter T H on the performance when noise is added 

at different levels. In what concerns noise we also try to find out if there is 

any advantage of using a GEVD instead of a PCA analysis. Hence the signals 

at the output of the first step of the algorithm (using the matrix Q to project 

the data) are also compared with the output signals. Results are collected in 

table 1. 

Random noise was added to the sensor signals yielding a SNR in the range 

of [0, 20] dB. The parameters M = 4 and T H = 0.95 were kept fixed. As 

the noise level increases, the number of significant eigenvalues also increases. 

Hence at the output of the first step more signals need to be considered. Thus 

as the noise energy increases, the number (l) of signals or the dimension of 

18


Table 1 

Number of output signals correlated with noise or source signals after step 1 and 

step 2 of the algorithm dAMUSE. 

1st step 2nd step 

SNR NM Sources Noise Sources Noise Total 

20 dB 12 6 0 6 0 6 

15 dB 12 5 2 6 1 7 

10 dB 12 6 2 7 1 8 

5 dB 12 6 3 7 2 9 

0 dB 12 7 4 8 3 11 

matrix C also increases after the application of the first step (last column of 

table 1). As the noise increases, an increasing number of ICs will be available 

at the output of the two steps. Computing, in the frequency domain, the correlation 

coefficients between the output signals of each step of the algorithm 

and noise or source signals we confirm that some are related with the sources 

and others with noise. Table 1 (columns 3-6) shows that the maximal correlation 

coefficients are distributed between noise and source signals to a varying 

degree. We can see that the number of signals correlated with noise is always 

higher in the first level. Results show that for low noise levels the first step 

(which is mainly a principal component analysis in a space of dimension NM) 

achieves good solutions already. However, we can also see (for narrow-band 

signals and/or M low) that the time domain characteristics of the signals resemble 

the original source signals only after a GEVD, i.e. at the output of the 

second step rather than with a PCA, i.e. at the output of first step. Figure 5 

shows examples of signals that have been obtained in the two steps of the 

algorithm for SNR = 10 dB. At the output of the first level the 3 signals with 

highest frequency correlation were chosen among the 8 output signals. Using a 

similar criterion to choose 3 signals at the output of the 2nd step (last column 

of figure 5), we can see that their time course is more similar to the source 

signals than after the first step (middle column of figure 5) 

4.3 Denoising of protein NMR spectra 

In biophysics the determination of the 3D structure of biomolecules like proteins 

is of utmost importance. Nuclear magnetic resonance techniques provide 

indispensable tools to reach this goal. As hydrogen nuclei are the most abundant 

and most sensitive nuclei in proteins, proton NMR spectra of proteins 

dissolved in water are recorded mostly. Since the concentration of the solvent 

is by magnitudes larger than the protein concentration, there is always a large 

19


1 

0.5 

0 

−0.5 

Sources 

−1 

0 20 40 

4 

2 

0 

−2 

−4 

0 20 40 

2 

1 

0 

−1 

−2 

0 20 40 

(n) 

2 

1 

0 

−1 

1st step 

−2 

0 20 40 

4 

2 

0 

−2 

−4 

0 20 40 

2 

0 

−2 

−4 

0 20 40 

(n) 

2 

1 

0 

−1 

2st step 

−2 

0 20 40 

4 

2 

0 

−2 

−4 

0 20 40 

2 

1 

0 

−1 

−2 

0 20 40 

Fig. 5. Comparison of output signals resulting after the first step (second column) 

and the second step (last column) of dAMUSE. 

proton signal of the water solvent contaminating the protein spectrum. This 

water artifact cannot be suppressed completely with technical means, hence 

it would be interesting to remove it during the analysis of the spectra. 

BSS techniques have been shown to solve this separation problem [27,28]. BSS 

algorithms are based on an ICA [2] which extracts a set of underlying independent 

source signals out of a set of measured signals without knowing how 

the mixing process is carried out. We have used an algebraic algorithm [35,36] 

based on second order statistics using the time structure of the signals to 

separate this and related artifacts from the remaining protein spectrum. Unfortunately 

due to the statistical nature of the algorithm unwanted noise is 

introduced into the reconstructed spectrum as can be seen in figure 6. The water 

artifact removal is effected by a decomposition of a series of NMR spectra 

into their uncorrelated spectral components applying a generalized eigendecomposition 

of a congruent matrix pencil [37]. The latter is formed with a 

correlation matrix of the signals and a correlation matrix with delayed or 

filtered signals [32]. Then we can detect and remove the components which 

contain only a signal generated by the water and reconstruct the remaining 

protein spectrum from its ICs. But the latter now contains additional noise 

20 

(n)


introduced by the statistical analysis procedure, hence denoising deemed necessary. 

The algorithms discussed above have been applied to an experimental 2D Nuclear 

Overhauser Effect Spectroscopy (NOESY) proton NMR spectrum of the 

polypeptide P11 dissolved in water. The synthetic peptide P11 consists of 24 

amino acids only and represents the helix H11 of the human Glutathion reductase 

[21]. A simple pre-saturation of the water resonance was applied to prevent 

saturation of the dynamic range of the Analog Digital Converter (ADC). 

Every data set comprises 512 Free Induction Decays (FIDs) S(t1, t2) ≡ xn[l] 

or their corresponding spectra ˆ S(11, ω2) ≡ ˆxn[l], with L = 2048 samples each, 

which correspond to N = 128 evolution periods t1 ≡ [n]. To each evolution 

period belong four FIDs with different phase modulations, hence only FIDs 

with equal phase modulations have been considered for analysis. A BSS analysis, 

using both the algorithm GEVD using Matrix Pencil (GEVD-MP) [28] 

and the algorithm dAMUSE [33], was applied to all data sets. Note that the 

matrix pencil within GEVD-MP was conveniently computed in the frequency 

domain, while in the algorithm dAMUSE in spite of the filtering operation 

being performed in the frequency domain, the matrix pencil was computed in 

the time domain. The GEVD is performed in dAMUSE as described above to 

achieve a dimension reduction and concomitant denoising. 

4.3.1 Local ICA denoising 

For denoising we first used the LICA denoising algorithm proposed above 

to enhance the reconstructed protein signal without the water artifact. We 

applied the denoising only to those components which were identified as water 

components. Then we removed the denoised versions of these water artifact 

components from the total spectrum. As a result, the additional noise is at 

least halved as can also be seen from figure 7. On the part of the spectrum 

away from the center, i.e. not containing any water artifacts, we could estimate 

the increase of the SNR with the original spectrum as reference. We calculated 

a SNR of 17.3 dB of the noisy spectrum and a SNR of 21.6 dB with applying 

the denoising algorithm. 

We compare the result, i.e. the reconstructed artifact-free protein spectrum of 

our denoising algorithm to the result of a KPCA based denoising algorithm 

using a gaussian kernel in figure 8. The figure depicts the differences between 

the denoised spectra and the original spectrum in the regions where the water 

signal is not very dominating. As can be seen, the LICA denoising algorithm 

reduces the noise but does not change the content of the signal, whereas the 

KPCA algorithm seems to influence the peak amplitudes of the protein resonances 

as well. Further experiments are under way in our laboratory to investigate 

these differences in more detail and to establish an automatic artifact 

21


Signal [a.u.] 

Signal [a.u.] 

Original NMR spectrum of the P11 protein 

10 8 6 4 

δ [ppm] 

2 0 -2 

Spectrum after the water removal algorithm 

10 8 6 4 

δ [ppm] 

2 0 -2 

Fig. 6. The graph shows a 1D slice of a proton 2D NOESY NMR spectrum of the 

polypeptide P11 before and after removing the water artifact with the GEVD-MP 

algorithm. The 1D spectrum corresponds to the shortest evolution period t1. 

removal algorithm for multidimensional NMR spectra. 

4.3.2 Kernel PCA denoising 

As the removal of the water artifact lead to additional noise in the spectra 

(compare figure 9(a) and figure 9(b)) KPCA based denoising was applied. 

First (almost) noise free samples had to be created in order to determine the 

principle axes in feature space. For that purpose, the first 400 data points 

of the real and the imaginary part of each of the 512 original spectra were 

used to form a 400 × 1024 sample matrix X (1) . Likewise five further sample 

matrices X (m) , m = 2, . . . , 6, were created, which now consisted of the data 

points 401 to 800, 601 to 1000, 1101 to 1500, 1249 to 1648 and 1649 to 2048 

respectively. Note that the region (1000 - 1101) of data points comprising the 

main part of the water resonance was nulled deliberately as it is of no use for 

22 

12 

10 

8 

6 

4 

2 

0 

-2 

12 

10 

8 

6 

4 

2 

0 

-2


Signal [a.u.] 

10 8 6 4 

δ [ppm] 

2 0 -2 

(a) LICA denoised spectrum of P11 after the water artifact has been removed 

with the algorithm GEVD-MP 

Signal [a.u.] 

10 8 6 4 

δ [ppm] 

2 0 -2 

(b) KPCA denoised spectrum of P11 after the water artifact has been removed 


Fig. 7. The figure shows the corresponding artifact free P11 spectra after the denoising 

algorithms have been applied. The LICA algorithm was applied to all water 

components with M, K chosen with the MDL estimator (γ = 32) between 20 and 

60 and 20 and 80 respectively. The second graph shows the denoised spectrum with 

a KPCA based algorithm using a gaussian kernel. 

the KPCA. For each of the sample matrices X (m) the corresponding kernel 

matrix K was determined by 

Ki,j = k(xi, xj), i, j = 1, . . . , 400 (22) 

where xi denotes the i-th column of X (m) . For the kernel function a Gaussian 

kernel 

k(xi, xj) = exp 

� 

− � xi − xj � 2 

2σ 2 

23 

� 

(23)


Signal [a.u.] Signal [a.u.] 

10 8 6 4 2 0 -2 -4 

10 8 6 4 2 0 

14 

12 

10 

8 

6 

4 

2 

0 

-2 

-2 -4 

14 

12 

10 

8 

6 

4 

2 

0 

-2 

δ [ppm] 

10 8 6 4 2 0 -2 0 

10 8 6 4 2 0 -2 0 

δ [ppm] 

Fig. 8. The graph uncovers the differences of the LICA and KPCA denoising algorithms. 

As a reference the corresponding 1D slice of the original P11 spectrum is 

displayed on top. From top to bottom the three curves show: The difference of the 

original and the spectrum with the GEVD-MP algorithm applied, the difference between 

the original and the LICA denoised spectrum and the difference between the 

original and the KPCA denoised spectrum. To compare the graphs in one diagram 

the three graphs are translated vertically by 2, 4 and 6 respectively. 

where 

2σ 2 = 

1 

400 ∗ 399 

is the width parameter σ, was chosen. 

400 

� 

i,j=1 

� xi − xj � 2 

8 

7 

6 

5 

4 

3 

2 

1 

(24) 

Finally the kernel matrix K was expressed in terms of its EVD (equation 17) 

which lead to the expansion parameters α necessary to determine the principal 

axes of the corresponding feature space Ω (m) : 

400 

� 

ω = αiΦ(xi). (25) 

i=1 

Similar to the original data, the noisy data of the reconstructed spectra were 

used to form six 400 × 1024 dimensional pattern matrices P (m) , m = 1, . . . , 6. 

24


x 106 

12 

x 107 

10 

10 

8 

6 

4 

2 

0 

−2 

5 

0 

−5 

10 9 8 7 6 5 4 3 2 1 0 −1 

−4 

10 9 8 7 6 5 4 3 2 1 0 −1 

δ [ppm] 

(a) Original (noisy) spectrum 

x 106 

12 

10 

8 

6 

4 

2 

0 

−2 

−4 

x 105 

−2 

−4 

−6 

−8 

−10 

10 9 

x 106 

12 

10 

8 

6 

4 

2 

0 

−2 

−4 

10 9 8 7 6 5 4 3 2 1 0 −1 

δ [ppm] 

(b) Reconstructed spectrum with the 

water artifact removed with the matrix 

pencil algorithm 

10 9 8 7 6 5 4 3 2 1 0 −1 

δ [ppm] 

(c) Result of the KPCA denoising of the reconstructed spectrum 

Fig. 9. 1D slice of a 2D NOESY spectrum of the polypeptide P11 in aqueous solution 

corresponding to the shortest evolution period t1. The chemical shift ranges from 

−1ppm to 10ppm roughly, The insert shows the region of the spectrum between 10 

and 9ppm roughly. The upper trace corresponds to the denoised baseline and the 

lower trace shows the baseline of the original spectrum. 

25


Then the principal components βk of each column of P m were calculated in 

the corresponding feature space Ω (m) . In order to denoise the patterns only 

projections onto the first n = 112 principal axes were considered. This lead to 

� 

βk = 

where x is a column of P m . 

400 

α 

i=1 

k i k(xi, x), k = 1, . . . , 112 (26) 

After reconstructing the image ˆ PnΦ(x) of the sample vector under the map Φ 

(equation 18), its approximate pre-image was determined by minimizing the 

cost function 

�112 

�400 

ρ(z) = −2 

βk α 

k=1 i=1 

k i k(xi, z) (27) 

Note that the method described above fails to denoise the region where the 

water resonance appears (data points 1001 to 1101) because then the samples 

formed from the original data differ too much from the noisy data. This is not 

a major drawback as protein peaks totally hidden under the water artifact 

cannot be uncovered by the presented blind source separation method. Figure 

9(c) shows the resulting denoised protein spectrum on an identical vertical 

scale as figure 9(a) and figure 9(b). The insert compares the noise in a region 

of the spectrum between 10 and 9ppm roughly where no protein peaks are 

found. The upper trace shows the baseline of the denoised reconstructed protein 

spectrum and the lower trace the corresponding baseline of the original 

experimental spectrum before the water artifact has been separated out. 

4.3.3 Denoising using Delayed AMUSE 

LICA denoising of reconstructed protein spectra necessitate the solution of the 

BSS problem beforehand using any ICA algorithm. A much more elegant solution 

is provided by the recently proposed algorithm dAMUSE, which achieves 

BSS and denoising simultaneously. To test the performance of the algorithm, 

it was also applied to the 2D NOESY NMR spectra of the polypeptide P11. 

A 1D slice of the 2D NOESY spectrum of P11 corresponding to the shortest 

evolution period t1 is presented in figure 9(a) which shows a huge water artifact 

despite some pre-saturation on the water resonance. Figure 10 shows the reconstructed 

spectra obtained with the algorithms GEVD-MP and dAMUSE, 

respectively. The algorithm GEVD-MP yielded almost artifact-free spectra 

but with clear changes in the peak intensities in some areas of the spectra. On 

the contrary, the reconstructed spectra obtained with the algorithm dAMUSE 

still contain some remnants of the water artifact but the protein peak intensities 

remained unchanged and all baseline distortions have been cured. All 

26


(a) 1D slice of the NOESY spectrum of the protein P11 spectrum reconstructed 


(b) Corresponding protein spectrum reconstructed with the algorithm 

dAMUSE 

Fig. 10. Comparison of denoising of the P11 protein spectrum 

parameters of the algorithms are collected in table 2 

5 Conclusions 

We proposed two new denoising techniques and also considered KPCA denoising 

which are all based on the concept of embedding signals in delayed 

coordinates. We presented a detailed discussion of their properties and also 

discussed results obtained applying them to illustrative toy examples. Furthermore 

we compared all three algorithms by applying them to the real world 

27


Table 2 

Parameter values for the embedding dimension of the feature space of dAMUSE 

(MdAMUSE), the number (K) of sampling intervals used per delay in the trajectory 

matrix, the number Npc of principal components retained after the first step of 

the GEVD and the half-width (σ) of the Gaussian filter used in the algorithms 

GEVD-MP and dAMUSE 

Parameter NIC MdAMUSE Npc Nw(GEV D) 

P11 256 3 148 49 

Parameter Nw(dAMUSE) σ SNRGEV D−MP SNRdAMUSE 

P11 46 0.3 18, 6 dB 22, 9 dB 

problem of removing the water artifact from NMR spectra and denoising the 

resulting reconstructed spectra. Although all three algorithms achieved good 

results concerning the final SNR, in case of the NMR spectra it turned out 

that KPCA seems to alter the spectral shapes while LICA and dAMUSE do 

not. At least with protein NMR spectra it is crucial that denoising algorithms 

do not alter integrated peak intensities in the spectra as the latter form the 

bases for the structure elucidation process. 

In future we have to further investigate the dependence of the proposed algorithms 

on the situation at hand. Thereby it will be crucial to identify data 

models for which the each one of the proposed denoising techniques works 

best and to find good measures of how such models suit the given data. 

6 Acknowledgements 

This research has been supported by the BMBF (project ModKog) and the 

DFG (GRK 638: Nonlinearity and Nonequilibrium in Condensed Matter). We 

are grateful to W. Gronwald and H. R. Kalbitzer for providing the NMR 

spectrum of P11 and helpful discussions. 

References 

[1] Adel Belouchrani, Karim Abed-Meraim, Jean-François Cardoso, and Eric 

Moulines. A blind source separation technique using secon-order statistics. 

IEEE Transactions on Signal Processing, 45(2):434–444, 1997. 

[2] Andrzej Cichocki and Shun-Ichi Amari. Adaptive Blind Signal and Image 

Processing. Wiley, 2002. 

28


[3] P. Comon. Idependent component analysis - a new concept ? Signal Processing, 

36:287–314, 1994. 

[4] K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks, 

Theory and Applications. Wiley, 1996. 

[5] A. Effern, K. Lehnertz, T. Schreiber, T. Grunwald, P. David, and C. E. 

Elger. Nonlinear denoising of transient signals with application to event-related 

potentials. Physica D, 140:257–266, 2000. 

[6] E. Fishler and H. Messer. On the use of order statistics for improved detection 

of signals by the MDL criterion. IEEE Transactions on Signal Processing, 

48:2242–2247, 2000. 

[7] R. Freeman. Spin Choreography. Spektrum Academic Publishers, Oxford, 1997. 

[8] R. R. Gharieb and A. Cichocki. Second-order statistics based blind source 

separation using a bank of subband filters. Digital Signal Processing, 13:252– 

274, 2003. 

[9] M. Ghil, M. R. Allen, M. D. Dettinger, and K. Ide. Advanced spectral methods 

for climatic time series. Reviews of Geophysics, 40(1):1–41, 2002. 

[10] K. H. Hausser and H.-R. Kalbitzer. NMR in Medicine and Biology. Berlin, 

1991. 

[11] J. Hérault and C. Jutten. Space or time adaptive signal processing by neural 

network models. In J. S. Denker, editor, Neural Networks for Computing. 

Proceedings of the AIP Conference, pages 206–211, New York, 1986. American 

Institute of Physics. 

[12] Aapo Hyvärinen, Patrik Hoyer, and Erkki Oja. Intelligent Signal Processing. 

IEEE Press, 2001. 

[13] A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. 

2001. 

[14] A. Hyvärinen and E. Oja. A fast fixed-point algorithm for independent 

component analysis. Neural Computation, 9:1483–1492, 1997. 

[15] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall: 

New Jersey, 1988. 

[16] J. T. Kwok and I. W. Tsang. The pre-image problem in kernel methods. In 

Proceed. Int. Conf. Machine Learning (ICML03), 2003. 

[17] A. P. Liavas and P. A. Regalia. On the behavior of information theoretic criteria 

for model order selection. IEEE Transactions on Signal Processing, 49:1689– 

1695, 2001. 

[18] Chor Tin Ma, Zhi Ding, and Sze Fong Yau. A two-stage algorithm for MIMO 

blind deconvolution of nonstationary colored noise. IEEE Transactions on 

Signal Processing, 48:1187–1192, 2000. 

29


[19] S. Mika, B. Schölkopf, A. Smola, K. Müller, M. Scholz, and G. Rätsch. Kernel 

PCA and denoising in feature spaces. Adv. Neural Information Processing 

Systems, NIPS11, 11, 1999. 

[20] V. Moskvina and K. M. Schmidt. Approximate projectors in singular spectrum 

analysis. SIAM Journal Mat. Anal. Appl., 24(4):932–942, 2003. 

[21] A. Nordhoff, Ch. Tziatzios, J. A. V. Broek, M. Schott, H.-R. Kalbitzer, 

K. Becker, D. Schubert, and R. H. Schirme. Denaturation and reactivation of 

dimeric human glutathione reductase. Eur. J. Biochem, pages 273–282, 1997. 

[22] L. Parra and P. Sajda. Blind source separation vis generalized eigenvalue 

decomposition. Journal of Machine Learning Research, 4:1261–1269, 2003. 

[23] K. Pearson. On lines and planes of closest fit to systems of points in space. 

Philosophical Magazine, 2:559–572, 1901. 

[24] I. W. Sandberg and L. Xu. Uniform approximation of multidimensional myoptic 

maps. Transactions on Circuits and Systems, 44:477–485, 1997. 

[25] B. Schoelkopf, A. Smola, and K.-R. Mueller. Nonlinear component analysis as 

a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. 

[26] K. Stadlthanner, E. W. Lang, A. M. Tomé, A. R. Teixeira, and C. G. Puntonet. 

Kernel-PCA denoising of artifact-free protein NMR spectra. Proc. IJCNN’2004, 

Budapest, Hungaria, 2004. 

[27] K. Stadlthanner, F. J. Theis, E. W. Lang, A. M. Tomé, W. Gronwald, and H.-R. 

Kalbitzer. A matrix pencil approach to the blind source separation of artifacts 

in 2D NMR spectra. Neural Information Processing - Letters and Reviews, 

1:103–110, 2003. 

[28] K. Stadlthanner, F. Theis, E. W. Lang, A. M. Tomé, A. R. Teixeira, 

W. Gronwald, and H.-R. Kalbitzer. GEVD-MP. Neurocomputing accepted, 

2005. 

[29] K. Stadlthanner, A. M. Tomé, F. J. Theis, W. Gronwald, H.-R. Kalbitzer, and 

E. W. Lang. Blind source separation of water artifacts in NMR spectra using a 

matrix pencil. In Fourth International Symposium On Independent Component 

Analysis and Blind Source Separation, ICA’2003, pages 167–172, Nara, Japan, 

2003. 

[30] F. Takens. On the numerical determination of the dimension of an attractor. 

Dynamical Systems and Turbulence, Annual Notes in Mathematics, 898:366– 

381, 1981. 

[31] F. J. Theis, A. Meyer-Bäse, and E. W. Lang. Second-order blind source 

separation based on multi-dimensional autocovariances. In Proc. ICA 2004, 

volume 3195 of Lecture Notes in Computer Science, pages 726–733, Granada, 

Spain, 2004. 

[32] Ana Maria Tomé and Nuno Ferreira. On-line source separation of temporally 

correlated signals. In European Signal Processing Conference, EUSIPCO2002, 

Toulouse, France, 2002. 

30


[33] Ana Maria Tomé, Ana Rita Teixeira, Elmar Wolfgang Lang, Kurt Stadlthanner, 

A. P. Rocha, and R. ALmeida. dAMUSE - A new tool for denoising and BSS. 

Digital Signal Processing, 2005. 

[34] Ana Maria Tomé, Ana Rita Teixeira, Elmar Wolfgang Lang, Kurt Stadlthanner, 

and A. P. Rocha. Blind source separation using time-delayed signals. In 

International Joint Conference on Neural Networks, IJCNN’2004, volume CD, 

Budapest, Hungary, 2004. 

[35] Ana Maria Tomé. Blind source separation using a matrix pencil. In Int. Joint 

Conf. on Neural Networks, IJCNN’2000, Como, Italy, 2000. 

[36] Ana Maria Tomé. An iterative eigendecomposition approach to blind source 

separation. In 3rd Intern. Conf. on Independent Component Analysis and Signal 

Separation, ICA’2003, pages 424–428, San Diego, USA, 2001. 

[37] Lang Tong, Ruey wen Liu, Victor C. Soon, and Yih-Fang Huang. Indeterminacy 

and identifiability of blind identification. IEEE Transactions on Circuits and 

Systems, 38(5):499–509, 1991. 

[38] Rolf Vetter, J. M. Vesin, Patrick Celka, Philippe Renevey, and Jens Krauss. 

Automatic nonlinear noise reduction using local principal component analysis 

and MDL parameter selection. Proceedings of the IASTED International 

Conference on Signal Processing Pattern Recognition and Applications (SPPRA 

02) Crete, pages 290–294, 2002. 

[39] Rolf Vetter. Extraction of efficient and characteristic features of 

multidimensional time series. PhD thesis, EPFL, Lausanne, 1999. 

[40] P. Vitányi and M. Li. Miminum description length induction, bayesianism, and 

kolmogorov complexity. IEEE Transactions on Information Theory, 46:446– 

464, 2000. 

31

Chapter 16 

Proc. ICA 2006, pages 917-925 

Paper F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. 

In Proc. ICA 2006, pages 917-925, Charleston, USA, 2006 

Reference (Theis and Kawanabe, 2006) 


229

230 Chapter 16. Proc. ICA 2006, pages 917-925 

Uniqueness of Non-Gaussian Subspace Analysis 

Fabian J. Theis 1 and Motoaki Kawanabe 2 


2 Fraunhofer FIRST.IDA, Kekuléstraße 7, 12439 Berlin, Germany 

fabian@theis.name andnabe@first.fhg.de 

Abstract. Dimension reduction provides an important tool for preprocessing 

large scale data sets. A possible model for dimension reduction is realized by 

projecting onto the non-Gaussian part of a given multivariate recording. We prove 

that the subspaces of such a projection are unique given that the Gaussian subspace 

is of maximal dimension. This result therefore guarantees that projection 

algorithms uniquely recover the underlying lower dimensional data signals. 

An important open problem in signal processing is the task of efficient dimension 

reduction, i.e. the search for meaningful signals within a higher dimensional data set. 

Classical techniques such as principal component analysis hereby define ‘meaningful’ 

using second-order statistics (maximal variance), which may often be inadequate for 

signal detection, i.e. in the presence of strong noise. This contrasts to higher order models 

including projection pursuit [1,2] or non-Gaussian subspace analysis (NGSA) [3,4]. 

While the former extracts a single non-Gaussian independent component from the data 

set, the latter tries to detect a whole non-Gaussian subspace within the data, and no 

assumption of independence within the subspace is made. 

The goal of linear dimension reduction can be defined as the search of a projection 

W∈Mat(n×d) of a d-dimensional random vector X with n

Chapter 16. Proc. ICA 2006, pages 917-925 231 

An intuitive notion of how to choose the reduced dimension n is to require that WGX is 

maximally Gaussian, and hence WNX non-Gaussian. 

The dimension reduction problem itself can of course also be formulated within a 

generative model, which leads to the following linear mixing model 

X=ANSN+ AGSG 

such that SN and SG are independent, and SG Gaussian. Then (AN, AG) −1 = (W ⊤ N , W⊤ G )⊤ . 

This model includes the general noisy ICA model X=ANSN+ G, where G is Gaussian 

and SN is also assumed to be mutually independent; the dimension reduction then means 

projection onto the signal subspace, which might be deteriorated by the noise G along 

the subspace — the components of G orthogonal to the subspace will be removed. 

However (1) is more general in the sense that it does not assume mutual independence 

of SN, only independence of SN and SG. 

The paper is organized as follows: In the next section, we first discuss obvious 

indeterminacies of NGSA and possible regularizations. We then present our main result, 

theorem 1, and give an explicit proof in a special case. The general proof is divided 

up into a series of lemmas, the proofs of which are omitted due to lack of space. In 

section 2, some simulations are performed to validate the uniqueness result. A practical 

algorithm for performing NGSA essentially using the idea of separated characteristic 

functions from the proof is presented in the co-paper [6]. 

1 Uniqueness of NGSA-based dimension reduction 

This contribution aims at providing conditions such that the decomposition (1) is unique. 

More precisely, we will show under which conditions the non-Gaussian as well as the 

Gaussian subspace is unique. 


Clearly, the matrices AN and AG in the decomposition (1) cannot be unique — multiplication 

from the right using any invertible matrix leaves the model invariant: X= 

ANSN+ AGSG= (ANBN)(B−1 −1 

N SN)+(AGBG)(B G SG) with BN∈ Gl(n), BG∈ Gl(d− n), 

because B−1 N SN and B−1 G SG are again independent, and B−1 G SG Gaussian. 

An additional indeterminacy comes into play due to the fact that we do not want to 

fix the reduced dimension in advance. Given a realization of the model (1) with d


1.2 Uniqueness theorem 

Definition 1. X=AS with A∈Gl(d), S=(SN, SG) and SN∈ L2(Ω,R n ) is called an 

n-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. 

In this case, X is said to be n-decomposable. 

Hence an n-decomposition of X corresponds to the NGSA problem. If as before 

A=(AN, AG), then the n-dimensional subvectorspace im(AN)⊂R d is called the non- 

Gaussian subspace, and im(AG) the Gaussian subspace of the decomposition; here 

im(A) denotes the image of the linear map A. 

Definition 2. X is denoted to be minimally n-decomposable if X is not (n−1)-decomposable. 

Then dime(X) := n is called the essential dimension of X. 

For example, the essential dimension dime(X) is zero if and only if X is Gaussian, 

whereas the essential dimension of a d-dimensional mutually independent Laplacian is 

d. The following theorem is the main theoretical contribution of this work. It essentially 

connects uniqueness of the dimension reduction model with minimality, and gives a 

simple characterization for it. 

Theorem 1 (Uniqueness of NGSA). Let n


1.3 Proof of theorem 1 

First note that the theorem holds trivially for n=0, because in this case X is Gaussian. 

So in the following let 0


for all xN∈R, because A is invertible. 

Now aNN� 0, otherwise XN= aNGS G which contradicts (ii) for X. If also aNG� 0, 

then by equation (3), h ′′ N is constant and therefore S N Gaussian, which again contradicts 

(ii), now for S. Hence aNG= 0. By (3), aGNaGG= 0, and again aGG�0, otherwise 

XG= aGNS N contradicting (ii) for S. Hence also aGN= 0 as was to show. 

General proof. In order to give an idea of the main proof without getting lost in details, 

we have divided it up into a sequence of lemmas; these will not be proven due 

to lack of space. The characteristic function of the random vector X is defined by 

�X(x) := E(exp ix⊤X), and since X is assumed to have existing covariance,�X is twice 

continuously differentiable. Moreover by definition�AS(x)=�S(A ⊤x), and the characteristic 

function of an independent random vector factorizes into the component characteristic 

functions. So instead of using pX as in the 2-dimensional example, we use�X, 

having similar properties except for the fact that the range is now complex and that the 

differentiability condition can be considerably relaxed. 

We will need the following lemma, which has essentially been shown in [9]; here 

∇ f denotes the gradient of f and H f its Hessian. 

Lemma 1. Let X∈ L2(Ω,R m ) be a random vector. Then X is Gaussian with covariance 

2C if and only if it satisfies�XH�X −∇�X(∇�X) ⊤ + C�X 2≡ 0. 

Note that we may assume that the covariance of S (and hence also of X) is positive 

definite — otherwise, while still keeping the model, we can simply remove the subspace 

of deterministic components (i.e. components of variance 0), which have to be mapped 

onto each other by A. Hence we may even assume Cov(SG)=I, after whitening as 

described in section 1.1. This uses the fact that the basis within the Gaussian subspace 

is not unique. The same holds also for the non-Gaussian subspace, so we may choose 

any BN∈ Gl(n) and BG∈ O(d−n) to get 

� � 

ANNBN 

X= (B 

AGNBN 

−1 

N SN)+ 

� � 

ANGBG 

(B 

AGGBG 

⊤ GSG). (4) 

Here only orthogonal matrices BG are allowed in order for B⊤ GSG to stay decorrelated, 

with SG being decorrelated. 

The next lemma uses the dimension reduction model for X and S to derive an explicit 

differential equation for ˆSN. The Gaussian part ˆSG in the following lemma vanishes 

after application of lemma 1. 

Lemma 2. For any basis BN∈ Gl(n), the non-Gaussian source characteristic function 

ˆSN∈C 2 (Rn ,C) fulfills 

� 

ANNBN 

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� B ⊤ NA⊤ GN + 2ANGA ⊤ GG ˆS 2 N≡ 0. (5) 

Lemma 3. Let (ANN, ANG)∈Mat(n×(n+(d−n))) be an arbitrary full rank matrix. 

If rank ANN < n, then we may choose coordinates BN ∈ Gl(n), BG ∈ O(d−n) and 

M∈Gl(n) such that for arbitrary matrices∗∈Mat((n−1)×(n−1)),∗ ′ ∈ Mat((n− 

1)×(d− n−1)): 

MANNBN= 

0 0 

0∗ 

and MANGBG= 1 0 

0∗ ′


The basis choice from lemma 3 together with assumption (ii) can be used to prove 

the following fact: 

Lemma 4. The non-Gaussian transformation is invertible i.e. ANN∈ Gl(n). 

The next lemma can be seen as modification of lemma 1, and indeed it can be shown 

similarly. 

Lemma 5. If ˆSN fulfills � ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� e1+ ˆS 2 Nc≡0 for some constant vector 

c∈R n , then the source component (SN)1 is Gaussian and independent of (SN)(2 : n). 

Here more generally ei ∈ Rn denotes the i-th unit vector. Putting these lemmas 

together, we can finally prove theorem 1: According to lemma 4, ANN is invertible, so 

from the left yields 

multiplying equation (5) from lemma 2 by B −1 

N A−1 

NN 

� 

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� B ⊤ NA⊤ GN + CˆS 2 N≡ 0 (6) 

for any BN∈ Gl(n) and some fixed, real matrix C∈Mat(n×(d− n)). 

We claim that AGN= 0. If not, then there exists v∈R d−n with�A⊤ GNv�=1. Choose 

BN from (4) such that B−1 N SN is decorrelated. This is invariant under left-multiplication 

by an orthogonal matrix, so we may moreover assume that B⊤ NA⊤ GNv=e1. Multiplying 

equation (6) in turn by v from the right therefore shows that the vector function 

� 

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� e1+ cˆS 2 N≡ 0 (7) 

is zero; here c := Cv∈R. This means that ˆSN fulfills the condition of lemma 5, which 

implies that (SN)1 is Gaussian and independent of the rest. But this contradicts (ii) for 

S, hence AGN= 0. Plugging this result into equation (5), evaluation at sN= 0 shows 

that ANGA ⊤ GG = 0. Since AGN = 0 and A∈Gl(d), necessarily AGG ∈ Gl(d−n), so 

ANG= 0 as was to prove. 

2 Simulations 

In this section, we will provide experimental validation of the uniqueness result of 

corollary 1. In order to stay unbiased and not test a single algorithm, we have to uniformly 

search the parameter space for possibly equivalent model representations. The 

model assumptions (1) will not be perfectly fulfilled, so we introduce a measure of 

model deviation based on 4-th order cumulants in the following. 

Let the non-Gaussian dimension n and the total dimension d be fixed. Given a random 

vector X=(XN, XG), we can without loss of generality assume that Cov(X)=I. 

Any possible model deviation consists of (i) a deviation from the independence of XN 

and XG and (ii) a deviation from the Gaussianity of XG. In the case of non-vanishing 

kurtoses, the former can be approximated for example by 

δI(X) := 

1 

n(n−d)d 2 

n� 

d� 

d� 

i=1 j=n+1 k=1 l=1 

d� 

cum 2 (Xi, X j, Xk, Xl),


where the fourth-order cumulant tensor is defined as cum(Xi, X j, Xk, Xl) := E(XiX jXkXl)− 

E(XiX j)E(XkXl)−E(XiXk)E(X jXl)−E(XiXl)E(X jXK). The deviation (ii) from Gaussianity 

of XG can simply be measured by kurtosis, which in the case of white X means 

δG(X) := 

1 

(n−d) 

d� 

j=n+1 

� 

� 

�E(X 4 i )−3� � �. 

Altogether, we can therefore define a total model deviation as the weighted sum of the 

above indices; the weight in the following was chosen experimentally to approximately 

yield even contributions of the two measures: 

δ(X)=10n(d−n)δI(X)+δG(X) 

For numerical tests, we generate two different non-Gaussian source data sets, see 

figure 1(d) and also [4], figure 1. The first source set (I) is an n-dimensional dependent 

sub-Gaussian random vector given by an isotropic uniform density within the unit 

disc, and source set (II) a 2-dimensional dependent super- and sub-Gaussian, given by 

p(s1, s2)∝exp(−|s1|)1[c(s1),c(s1)+1], where c(s1)=0 if|s1|≤ln 2 and c(s1)=−1 otherwise. 

Normalization was chosen to guarantee Cov(S N)=Iin advance. 

In order to test for model violations, we have to find two representations X=AS 

and X=A ′ S ′ of the same mixtures. After multiplication by A −1 we may as before 

assume that a single representation X=AS is given with X and S both fulfilling the 

dimension reduction model (1), and we have to show that ANG = AGN = 0 if the 

decomposition is minimal (corollary 1). The latter can be tested numerically by using 

the so-called normalized crosserror 

E(A) := 

1 

2n(d−n) 

� �ANG� 2 F +�AGN� 2 F 

where�.�F is some matrix norm, in our case the Frobenius norm. 

In order to reduce the d 2 -dimensional search space, after whitening we may assume 

that A∈O(d), so only d(d− 1)/2 dimensions have to be searched. O(d) can be uniformly 

sampled for example by choosing B with Gaussian i.i.d. coefficients and orthogonalizing 

A := (BB ⊤ ) −1/2 B. We perform 10 4 Monte-Carlo runs with random A∈O(d). 

Sources have been generated with T= 10 4 samples, n-dimensional non-Gaussian part 

(a) and (b) from above, and (d−n)-dimensional i.i.d. Gaussians. We measure model 

deviationδ(AS) and compare it with the deviation E(A) from block-diagonality. 

The results for varying parameters are given in figure 1(a-c). In all three cases we 

observe that the smaller the model deviation, the smaller also the crosserror. This gives 

an asymptotic confirmation of corollary 1, indicating that by random sampling no nonuniqueness 

realizations have been found. 

3 Conclusion 

By minimality of the decomposition (1), we gave a necessary condition for the uniqueness 

of non-Gaussian subspace analysis. Together with the assumption of existing covariance, 

this was already sufficient to guarantee model uniqueness. Our result allows 

NGSA algorithms to find the unknown, unique signal space within a noisy highdimensional 

data set [6]. 

� ,


model deviationδ(AS) 


0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 

2.5 

2 

1.5 

1 

0.5 

crosserror E(A) 

(a) n=2, d= 5, source (I) 

0 

0 0.1 0.2 0.3 0.4 0.5 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 


(b) n=3, d=5, source (I) 


(c) n=2, d=4, source (II) (d) Laplacian & uniform source 

Fig. 1. (a-c): total model deviationδ(AS) of the transformed sources versus crosserror E(A) of the 

mixing matrix for 10 4 Monte-Carlo runs. The circle◦indicates the actual source model deviation 

(non-zero due to finite sample sizes). (d): 2-dimensional dependent sub-Gaussian source (II). 

References 

1. Friedman, J., Tukey, J.: A projection pursuit algorithm for exploratory data analysis. IEEE 

Trans. on Computers 23 (1975) 881–890 

2. Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Wiley & Sons 

(2001) 

3. Blanchard, G., Kawanabe, M., Sugiyama, M., Spokoiny, V., Müller, K.R.: In search of nongaussian 

components of a high-dimensional distribution. JMLR (2005) In revision. The 

preprint is available at http://www.cs.titech.ac.jp/ tr/reports/2005/TR05-0003.pdf. 

4. Kawanabe, M.: Linear dimension reduction based on the fourth-order cumulant tensor. In: 

Proc. ICANN 2005. Volume 3697 of LNCS., Warsaw, Poland, Springer (2005) 151–156 

5. Theis, F.: Uniqueness of complex and multidimensional independent component analysis. 

Signal Processing 84 (2004) 951–956 

6. Kawanabe, M., Theis, F.: Extracting non-gaussian subspaces by characteristic functions. In: 

submitted to ICA 2006. (2006) 

7. Comon, P.: Independent component analysis - a new concept? Signal Processing 36 (1994) 

287–314 

8. Theis, F.: A new concept for separability problems in blind source separation. Neural Computation 

16 (2004) 1827–1850 

9. Theis, F.: Multidimensional independent component analysis using characteristic functions. 

In: Proc. EUSIPCO 2005, Antalya, Turkey (2005)

238 Chapter 16. Proc. ICA 2006, pages 917-925

Chapter 17 

IEEE SPL 13(2):96-99, 2006 

Paper F.J. Theis, C.G. Puntonet, and E.W. Lang. Median-based clustering for 

underdetermined blind signal processing. IEEE Signal Processing Letters, 

13(2):96-99, 2006 

Reference (Theis et al., 2006) 


239

240 Chapter 17. IEEE SPL 13(2):96-99, 2006 


Median-based clustering for underdetermined blind 

signal processing 

Fabian J. Theis, Member, IEEE, Carlos G. Puntonet, Member, IEEE, Elmar W. Lang 

Abstract— In underdetermined blind source separation, more 

sources are to be extracted from less observed mixtures without 

knowing both sources and mixing matrix. k-means-style clustering 

algorithms are commonly used to do this algorithmically 

given sufficiently sparse sources, but in any case other than 

deterministic sources, this lacks theoretical justification. After 

establishing that mean-based algorithms converge to wrong solutions 

in practice, we propose a median-based clustering scheme. 

Theoretical justification as well as algorithmic realizations (both 

online and batch) are given and illustrated by some examples. 

x2 

α 

x1 

α 

F(α) 

0.1 

0.08 

0.06 

0.04 

0.02 

∆(α) mean 


0 

0 0.2 0.4 0.6 0.8 

EDICS Category: SAS-ICAB 

BLIND source separation (BSS), mainly based on the assumption 

of independent sources, is currently the topic of 

many researchers [1], [2]. Given an observed m-dimensional 

mixture random vector x, which allows an unknown decomposition 

x = As, the goal is to identify the mixing matrix 

A and the unknown n-dimensional source random vector s. 

Commonly, first A is identified, and only then are the sources 

recovered. We will therefore denote the former task by blind 

mixing model recovery (BMMR), and the latter (with known 

A) by blind source recovery (BSR). 

In the difficult case of underdetermined or overcomplete 

BSS, where less mixtures than sources are observed (m < n), 

BSR is non-trivial, see section II. However, our main focus 

lies on the usually more elaborate matrix recovery. Assuming 

statistically independent sources with existing variance and 

at most one Gaussian component, it is well-known that A 

is determined uniquely by x [3]. However, how to do this 

algorithmically is far from obvious, and although quite a few 

algorithms have been proposed recently [4]–[6], performance 

is yet limited. The most commonly used overcomplete algorithms 

rely on sparse sources (after possible sparsification by 

preprocessing), which can be identified by clustering, usually 

by k-means or some extension [5], [6]. But apart from the fact 

that theoretical justifications have not been found, mean-based 

clustering only identifies the correct A if the data density 

approaches a delta distribution. In figure 1, we illustrate the 

deficiency of mean-based clustering; we get an error of up to 

5◦ (a) circle histogram for α = 0.4 (b) comparison of mean and median 

Fig. 1. Mean- versus median-based clustering. We consider the mixture x of 

two independent gamma-distributed sources (γ = 0.5, 10 

per mixing angle, which is rather substantial considering the 

sparse density and the simple, complete case of m = n = 2. 

Moreover the figure indicates that median-based clustering 

performs much better. Indeed, mean-based clustering does not 

Manuscript received xxx; revised xxx. 

Some preliminary results were reported at the conferences ESANN 2002, 

SIP 2002 and ICA 2003. 

FT and EL are with the Institute of Biophysics, University of Regensburg, 

93040 Regensburg, Germany (phone: +49 941 943 2924, fax: +49 941 943 

2479, e-mail: fabian@theis.name), and CP is with the Dep. Arqitectura y 

Tecnología de Computadores, Universidad de Granada, 18071 Granada, Spain. 

5 samples) using a 

mixing matrix A with columns inclined by α and (π/2−α) respectively. (a) 

shows the mixture density for α = 0.4 after projection onto the circle. For 

α ∈ [0, π/4), (b) compares the error when estimating A by the mean and the 

median of the projected density in the receptive field F(α) = (−π/4, π/4) of 

the known column a1 of A. The former is the k-means convergence criterion. 

possess any equivariance property (performance independent 

of A). In the following we propose a novel median-based 

clustering method and prove its equivariance (lemma 1.2) and 

convergence. For brevity, the proofs are given for the case of 

arbitrary n, but m = 2, although they can be readily extended 

to higher sensor signal dimensions. Corresponding algorithms 

are proposed and experimentally validated. 

I. GEOMETRIC MATRIX RECOVERY 

Without loss of generality we assume that A has pairwise 

linearly independent columns, and m ≤ n. BMMR tries to 

identify A in x = As given x, where s is assumed to be 

statistically independent. Obviously, this can only be done up 

to equivalence [3], where B is said to be equivalent to A, 

B ∼ A, if B can be written as B = APL with an invertible 

diagonal matrix L (scaling matrix) and an invertible matrix P 

with unit vectors in each row (permutation matrix). Hence we 

may assume the columns ai of A to have unit norm. 

For geometric matrix-recovery, we use a generalization [7] 

of the geometric ICA algorithm [8]. Let s be an independent 

n-dimensional, Lebesgue-continuous, random vector with 

density ps describing the sources. As s is independent, ps 

factorizes into ps(s1, . . . , sn) = ps1(s1) · · · psn(sn) with the 

one-dimensional marginal source density functions psi. We 

assume symmetric sources, i.e. psi(s) = psi(−s) for s ∈ R 

and i ∈ [1 : n] := {1, . . .,n}, in particular E(s) = 0. 

The geometric BMMR algorithm for symmetric distributions 

goes as follows [7]: Pick 2n starting vectors 

w1,w ′ 1, . . . ,wn,w ′ n on the unit sphere Sm−1 ⊂ Rm such 

that the wi are pairwise linearly independent and wi = −w ′ i . 

Often, these wi are called neurons because they resemble the 

neurons used in clustering algorithms and in Kohonen’s selforganizing 

maps. Furthermore fix a learning rate η : N → R. 

α

Chapter 17. IEEE SPL 13(2):96-99, 2006 241 


The 

� 

usual hypothesis in 

� 

competitive learning is η(t) > 0, 

t∈N η(t) = ∞ and t∈N η(t)2 < ∞. Then iterate the 

following steps until an appropriate abort condition has been 

met: Choose a sample x(t) ∈ Rm according to the distribution 

of x. If x(t) = 0 pick a new one — note that this case happens 

with probability zero. Project x(t) onto the unit sphere and get 

y(t) := π(x(t)), where π(x) := x/|x| ∈ Sm−1 . Let i(t) ∈ 

[1 : n] such that wi(t)(t) or w ′ i(t) (t) is the neuron closest to 

y(t). Then set wi(t)(t + 1) := π(wi(t)(t) + η(t)π(σy(t) − 

wi(t)(t))) and w ′ i(t) (t + 1) := −wi(t)(t + 1), where σ := 1 

if wi(t)(t) is closest to y(t), σ := −1 otherwise. All other 

neurons are not moved in this iteration. This update rule equals 

online k-means on Sn−1 except for the fact that the winner 

neuron is not moved proportional to the sample but only in 

its direction due to the normalization. We will see that instead 

of finding the mean in the receptive field (as in k-means), the 

algorithm searches for the corresponding median. 

A. Model verification 

In this section we first calculate the densities of the random 

variables of our clustering problem. Then we will prove an 

asymptotic convergence result. For the theoretical analysis, we 

will restrict ourselves to m = 2 mixtures for simplicity. As 

above, let x denote the sensor signal vector and A the mixing 

matrix such that x = As. We may assume that A has columns 

ai = (cosαi, sinαi) ⊤ with 0 ≤ α1 < . . . < αn < π. 

1) Neural update rule on the sphere: Due to the symmetry 

of s we can identify the two neurons wi and w ′ i 

. For this 

let ι(ϕ) := (ϕ mod π) ∈ [0, π) identify all angles modulo 

π, and set θ(v) := ι(arctan(v2/v1)); then θ(wi) = θ(w ′ i ) 

and θ(ai) = αi. We are interested in the essentially onedimensional 

projected sensor signal random vector π(x), so 

using θ we may approximate y := θ(π(x)) ∈ [0, π) measuring 

the argument of x. Note that the density py of y can be 

calculated from px by py(ϕ) = � ∞ 

−∞ px(r cosϕ, r sinϕ)rdr. 

Now let the (n × n)-matrix B be defined by 

B := 

A 

0 In−2 

where In−2 is the (n − 2)-dimensional identity matrix; so B 

is invertible. The random vector Bs has the density pBs = 

| detB| −1ps ◦ B−1 . A equals B followed by the projection 

from Rn onto the first two coordinates, hence 

py(ϕ)= 2 

| detB| 

� ∞ 

� 

rdr 

0 

, 

Rn−2 dxps(B −1 (r cosϕ, r sin ϕ,x) ⊤ ) (1) 

for any ϕ ∈ [0, π), where we have used the symmetry of ps. 

The geometric learning algorithm induces the following 

n-dimensional Markov chain ω(t) defined recursively by a 

starting point ω(0) ∈ R n and the iteration rule ω(t + 1) = 

ι n (ω(t)+η(t)ζ(y(t)e−ω(t))), where e = (1, . . .,1) ⊤ ∈ R n , 

ζ(x1, . . . , xn) ∈ R n such that 

ζi(x1, . . . , xn) = 

� sgn(xi) |xi| ≤ |xj| for all j 

0 otherwise 

and y(0), y(1), . . . is a sequence of i.i.d. random variables with 

density py representing the samples in each online iteration. 

Note that the ‘modulo π’ map ι is only needed to guarantee 

that each component of ω(t + 1) lies in [0, π). 

Furthermore, we can assume that after enough iterations 

there is one point v ∈ S 1 that will not be traversed any 

more, and without loss of generality we assume θ(v) = 0 

so that the above algorithm simplifies to the planar case with 

the recursion rule ω(t + 1) = ω(t) + η(t)ζ(y(t)e − ω(t)). 

This is k-means-type learning with an additional sign function. 

Without the sign function and the additional condition that 

py is log-concave, it has been shown [9] that the process 

ω(t) converges to a unique constant process ω(∞) ∈ R n 

such that ωi(∞) = E(py|[β(i), β ′ (i)]), where F(ωi(∞)) := 

{ϕ ∈ [0, π) | ι(|ϕ − ωi(∞)|) ≤ ι(|ϕ − ωj(∞)|) for all j �= i} 

denotes the receptive field of the neuron ωi(∞) and β(i), β ′ (i) 

designate the receptive field borders. This is precisely the kmeans 

convergence condition illustrated in figure 1. 

2) Limit points analysis: We now want to study the limit 

points of geometric matrix-recovery, so we assume that the 

algorithm has already converged. The idea, generalizing our 

analysis in the complete case [7] then is to formulate a 

condition which the limit points will have to satisfy and to 

show that the BMMR solutions are among them. 

The angles γ1, . . .,γn ∈ [0, π) are said to satisfy the geometric 

convergence condition (GCC) if they are the medians 

of y restricted to their receptive fields i.e. if γi is the median 

of py|F(γi). Moreover, a constant random vector ˆω ∈ R n is 

called fixed point if E(ζ(ye− ˆω)) = 0. Hence, the expectation 

of a Markov process ω(t) starting at a fixed point will 

indeed not be changed by the geometric update rule because 

E(ω(t+1)) = E(ω(t))+η(t)E(ζ(y(t)e−ω(t))) = E(ω(t)). 

Lemma 1.1: Assume that the geometric algorithm converges 

to a constant random vector ω(∞). Then ω(∞) is a 

fixed point if and only if the ωi(∞) satisfy the GCC. 

Proof: Assume ω(∞) is a fixed point of geometric ICA 

in the expectation. Without loss of generality, let [β, β ′ ] be 

the receptive field of ωi(∞) such that β, β ′ ∈ [0, π). Since 

ω(∞) is a fixed point of geometric ICA in the expectation, 

E � χ [β,β ′ ](y(t))sgn(y(t) − ωi(∞)) � = 0, where χ [β,β ′ ] 

denotes the characteristic function of that interval. But this 

means � ωi(∞) 

β (−1)py(ϕ)dϕ+ � β ′ 

ωi(∞) 1py(ϕ)dϕ = 0 so ωi(∞) 

satisfies GCC. The other direction follows similarly. 

Lemma 1.2: The angles αi = θ(ai) satisfy the GCC. 

Proof: It is enough to show that α1 satisfies GCC. Let 

β := (α1 + αn − π)/2 and β ′ := (α1 + α2)/2. Then the 

receptive field of α1 can be written (modulo π) as F(α1) = 

[β, β ′ ]. Therefore, we have to show that α1 is the median 

of py|F(α1), which means � α1 

β py(ϕ)dϕ = � β ′ 

α1 py(ϕ)dϕ. 

Using � equation (1), the left hand side can be expanded as 

α1 

β py(ϕ)dϕ 

� 

−1 = 2| detB| K ′ dxps(B−1x), where K := 

θ−1 [β, α1] denotes the cone of opening angle α1 − β starting 

from angle β, and K ′ = K × Rn−2 � . This implies 

α1 

β py(ϕ)dϕ = 2 � 

B−1 (K ′ ) dsps(s). Now note that the trans- 

formed extended cone B −1 (K ′ ) is a cone ending at the s1-axis 

of opening angle π/4 times Rn−2 � , because A is linear. Hence 

α1 

β py(ϕ)dϕ = 2 � ∞ 

0 ds1 

� s1 

0 ds2 

� 

Rn−2 ds3 . . .dsnps(s) = 

� β ′ 

α1 py(ϕ)dϕ, where we have used the same calculation for 

[α1, β ′ ] as for [β, α1] at the last step.

242 Chapter 17. IEEE SPL 13(2):96-99, 2006 


4000 

3000 

2000 

1000 

2° 74° 104° 135° 

0 

0 20 40 60 80 100 120 140 160 180 

Fig. 2. Approximated probability density function of a mixture of four 

speech signals using the mixing angles αi = 2 ◦ ,74 ◦ ,104 ◦ ,135 ◦ . Plotted 

is the approximated density using a histogram with 180 bins; the thick line 

indicates the density after smoothing with a 5 degree radius polynomial kernel. 

In the proof we show that the median-condition is equivariant 

meaning that it does no depend on A. Hence any algorithm 

based on such a condition is equivariant, as confirmed by 

figure 1. Set ξ(ω) := (cosω, sin ω) ⊤ ; then θ(ξ(ω)) = ω. 

Combining both lemmata, we have therefore shown: 

Theorem 1.3: The set Φ of fixed points of geometric 

matrix-recovery contains an element (ˆω1, . . . , ˆωn) such that 

the matrix (ξ(ˆω1)...ξ(ˆωn)) solves the overcomplete BMMR. 

The stable fixed points in the above set Φ can be found by 

the geometric matrix-recovery algorithm. Furthermore, experiments 

confirm that in the special case of unimodal, symmetric 

and non-Gaussian signals, the set Φ consists of only two 

elements: a stable and an unstable fixed point, where the stable 

fixed point will be found by the algorithm. Then, depending 

on the kurtosis of the sources, either the stable (supergaussian 

case) or the instable (subgaussian case) fixed point represents 

the image of the unit vectors. We have partially shown this in 

the complete case, see [7], theorem 5: 

Theorem 1.4: If n = 2 and ps1 = ps2, then Φ contains only 

two elements given that py|[0, π) has exactly 4 local extrema. 

More elaborate studies of Φ are necessary to show full 

convergence, however the mathematics can be expected to be 

difficult. This can already be seen from the complicated and 

in higher dimensions yet unknown proofs of convergence of 

the related self-organizing-map algorithm [9]. 

B. Turning the online algorithm into a batch algorithm 

The above theory can be used to derive a batch-type 

learning algorithm, by testing all different receptive 

fields for the overcomplete GCC after histogrambased 

estimation of y. For simplicity let us assume 

that the cumulative distribution Py of y is invertible. 

For ϕ = (ϕ1, . . . , ϕn−1), define a function µ(ϕ) := 

((γ1(ϕ) + γ2(ϕ))/2 − ϕ1, . . . , (γn(ϕ) + γ1(ϕ))/2 − ϕn), 

where γi(ϕ) := P −1 

y ((Py(ϕi) + Py(ϕi+1)/2) is the 

median of y|[ϕi, ϕi+1] in [ϕi, ϕi+1] for i ∈ [1 : n] and 

ϕn := (ϕn−1 + ϕ1 + π)/2, ϕn+1 := ϕ1. 

Lemma 1.5: If µ(ϕ) = 0 then the γi(ϕ) satisfy the GCC. 

Proof: By definition, the receptive field of γi(ϕ) is 

given by [(γi−1(ϕ) + γi(ϕ))/2, (γi(ϕ) + γi+1(ϕ))/2]. Since 

µ(ϕ) = 0 implies γi(ϕ) + γi+1(ϕ))/2 = ϕi, the receptive 

field of γi(ϕ) is precisely [ϕi, ϕi+1], and by construction 

γi(ϕ) is the median of y restricted to the above interval. 

Algorithm (overcomplete FastGeo): Find the zeros of µ(ϕ). 

Algorithmically, we may simply estimate y using a histogram 

and search for the zeros exhaustively or by discrete 

gradient descent. Note that for m = 2 this is precisely the 

complete FastGeo algorithm [7]. Again µ always has at least 

two zeros representing the stable and the unstable fixed point 

of the neural algorithm, so for supergaussian sources we 

extract the stable fixed point by maximizing �n i=1 py(γi(ϕi)). 

Similar to the complete case, histogram-based density approximation 

results in a quite ‘ragged’ distribution. Hence, 

zeros of µ are split up into multiple close-by zeros. This 

can be improved by smoothing the distribution using a kernel 

with sufficiently small radius. Too large kernel radii result in 

lower accuracy because the calculation of the median is only 

roughly independent of the kernel radius. In figure 2 we use a 

polynomial kernel of radius 5 degree (zero everywhere else); 

one can see that indeed this smoothes the distribution nicely. 

C. Clustering in higher mixture dimensions 

We now extend any BSS algorithm working in lower mixing 

dimension m ′ < m to dimension m using the simple idea of 

projecting the mixtures x onto different subspaces and then 

estimating A from the recovered projected matrices. We eliminate 

scaling indeterminacies by normalization and permutation 

indeterminacies by comparing the source correlation matrices. 

1) Equivalence after projections: Let m ′ ∈ N with 1 < 

m ′ < m, and let M denote the set of all subsets of [1 : m] 

of size m ′ . For an element τ ∈ M let τ = {τ1, . . . , τm ′} 

such that 1 ≤ τ1 < . . . < τm ′ ≤ m and let πτ denote the 

ordered projection from R m onto those coordinates. Consider 

the projected mixing matrix Aτ := πτA. We will study 

how scaling-equivalence behaves under projection, where two 

matrices A,B are said to be scaling equivalent, A ∼s B if 

there exists an invertible diagonal matrix L with A = BL. 

Lemma 1.6: Let τ 1 , . . . , τ k ∈ M such that � 

i τi = [1 : m] 

and j ∈ � 

i τi . If A τ i ∼s B τ i for all i and ajl �= 0 for all l, 

then A ∼s B. 

Proof: Fix a column l ∈ [1 : n], and let a := al, b := bl. 

By assumption, for each i ∈ [1 : k] there exist λi �= 0 such that 

b τ i = λia τ i. Index j occurs in all projections, so bj = λiaj 

for all i. Hence all λi =: λ coincide and b = λa. 

This lemma gives the general idea how to combine matrices; 

now we will construct specific projections. Assume that the 

first row of A does not contain any zeros. This is a very mild 

assumption because A was assumed to be full-rank, and the 

set of A’s with first row without zeros is dense in the set of 

full-rank matrices. As usual let ⌈λ⌉ denote the smallest integer 

larger or equal to λ ∈ R. Then let k := ⌈(m−1)/(m ′ −1)⌉, and 

define τ i := {1, 2+(m ′ −1)(i−1), . . .,2+(m ′ −1)i−1} for 

i < k and τ k := {1, m−m ′ +2, . . ., m}. Then � 

i τi = [1 : m] 

and 1 ∈ � 

i τi . Given (m ′ ×n)-matrices B 1 , . . .,B k := (b k i,j ), 

define A B 1 ,...,B k to be composed of the columns (j ∈ [1 : n]) 

aj := (1, b 1 2,j /b1 1,j , . . . , b1 m ′ ,j /b1 1,j , 

b 2 2,j /b21,j , . . . , bk−1 

m ′ ,j /bk−1 1,j , 

b k 3+k(m ′ −1)−m,j /bk1,j , . . . , bkm ′ ,j /bk1,j )⊤ .

Chapter 17. IEEE SPL 13(2):96-99, 2006 243 


Lemma 1.7: Let B 1 , . . .,B k be (m ′ ×n)-matrices such that 

Aτ i ∼s Bi for i ∈ [1 : k]. Then AB1 ,...,Bk ∼s A. 

Proof: By assumption there exist λi l 

∈ R \ {0} such that 

bi j,l = λil (Aτ i)j,l for all i ∈ [1 : k], j ∈ [1 : m ′ ] and l ∈ [1 : n], 

hence bi j,l /bi 1,l = (Aτ i)j,l/(Aτ i)1,l. One can check that due to 

the choice of the τi ’s we then have (AB1 ,...,Bk)j,l = aj,l/a1,l 

for all j ∈ [1 : m] and therefore AB1 ,...,Bk ∼ A. 

2) Reduction algorithm: The dimension reduction algorithm 

now is very simple. Pick k and τ1 , . . . , τk as in the 

previous section. Perform overcomplete BMMR with the projected 

mixtures πτ i(x) for i ∈ [1 : k] and get estimated mixing 

matrices Bi . If this recovery has been carried out without any 

error, then every Bi is equivalent to Aτ i. Due to permutations, 

they might however not be scaling-equivalent. Therefore do 

the following iteratively for each i ∈ [1 : k]: Apply the overcomplete 

source-recovery, see next section, to πτ i(x) using 

Bi and get recovered sources si . For all j < i, consider the 

absolute crosscorrelation matrices (| Cor(si r, sj s)|)r,s. The row 

positions of the maxima of this matrix are pairwise different 

because the original sources were chosen to be independent. 

Thereby we get a permutation matrix Pi indicating how to 

permute Bi , Ci := BiPi , so that the new source correlation 

matrices are diagonal. Finally, we have constructed matrices 

Ci such that there exists a permutation P independent of i with 

Ci ∼s Aτ iP for all i ∈ [1 : k]. Now we can apply lemma 

1.7 and get a matrix AC1 ,...,Ck with AC1 ,...,Ck ∼s AP and 

therefore AC1 ,...,Ck ∼ A as desired. 

II. BLIND SOURCE RECOVERY 

Using the results from the BMMR step, we can assume 

that an estimate of A has been found. In order to solve the 

overcomplete BSS problem, we are therefore left with the task 

of reconstructing the sources using the mixtures x and the 

estimated matrix (BSR). Since A has full rank, the equation 

x(t) = As(t) yields the (n − m)-dimensional affine vector 

space A−1 {x(t)} as solution space for s(t). Hence, if n > 

m the source-recovery problem is ill-posed without further 

assumptions. Using a maximum likelihood approach [4], [5] 

an appropriate assumption can be derived: 

Given a prior probability p0 s on the sources, it can be 

seen quickly [4], [10] that the most likely source sample 

is recovered by s = argmaxx=Asp0 s. Depending on the 

assumptions on the prior p 0 s 

of s, we get different optimization 

criteria. In the experiments we will assume a simple prior p 0 s ∝ 

exp(−|s|p) with any p-norm |.|p. Then s = argmin x=As |s|p, 

which can be solved linearly in the Gaussian case p = 2 and 

by linear programming or a shortest-path decomposition in the 

sparse, Laplacian case p = 1, see [5], [10]. 

III. EXPERIMENTAL RESULTS 

In order to compare the mixture matrix A with the recovered 

matrix B from the BMMR step, we calculate the 

generalized crosstalking error E(A,B) of A and B defined 

by E(A,B) := minM∈Π �A −BM�, where the minimum is 

taken over the group Π of all invertible matrices having only 

one non-zero entry per column and �.� denotes some matrix 

norm. It vanishes if and only if A and B are equivalent [10]. 

TABLE I 

PERFORMANCE OF BMMR-ALGORITHMS (n = 3, m = 2, 100 RUNS) 

algorithm mean E(A, Â) deviation σ 

FastGeo (kernel r = 5, approx. 0.1) 0.60 0.60 



Soft-LOST (p = 0.01) 0.68 0.57 

The overcomplete FastGeo algorithm is applied to 4 speech 

signals s, mixed by a (2 × 4)-mixing matrix A with coefficients 

uniformly drawn from [−1, 1], see figure 2 for their 

mixture density. The algorithm estimates the matrix well with 

E(A, Â) = 0.68, and BSR by 1-norm minimization yields 

recovered sources with a mean SNR of only 2.6dB when 

compared with the original sources; as noted before [5], [10], 

without sparsification for instance by FFT, source-recovery 

is difficult. To analyze the overcomplete FastGeo algorithm 

more generally, we perform 100 Monte-Carlo runs using highkurtotic 

gamma-distributed three-dimensional sources with 

104 samples, mixed by a (2 × 3)-mixing matrix with weights 

uniformly chosen from [−1, 1]. In table I, the mean of the 

performance index depending on various parameters is presented. 

Noting that the mean error when using random (2×3)matrices 

with coefficients uniformly taken from [−1, 1] is 

E = 1.9 ± 0.73, we observe good performance, especially 

for a larger kernel radius and higher approximation parameter 

(E = 0.29), also compared with Soft-LOST’s E = 0.68 [6]. 

As an example in higher mixture dimension three speech 

signals are mixed by a column-normalized (3 × 3) mixing 

matrix A. For n = m = 3, m ′ = 2, the projection 

framework simplifies to k = 2 with projections π {1,2} and 

π {1,3}. Overcomplete geometric ICA is performed with 5·104 sweeps. The recoveries of the projected matrices π {1,2}A 

and π {1,3}A are quite good with E(π {1,2}A,B1 ) = 0.084 

and E(π {1,3}A,B2 ) = 0.10. Taking out the permutations as 

described before, we get a recovered mixing matrix AB1 ,B2 with low generalized crosstalking error of E(A,AB1 ,B2) = 

0.15 (compared with a mean random error of E = 3.2 ± 0.7). 

REFERENCES 

[1] A. Hyvärinen, J. Karhunen, and E. Oja, “Independent component 

analysis,” John Wiley & Sons, 2001. 

[2] A. Cichocki and S. Amari, Adaptive blind signal and image processing. 

John Wiley & Sons, 2002. 

[3] J. Eriksson and V. Koivunen, “Identifiability, separability and uniqueness 

of linear ICA models,” IEEE Signal Processing Letters, vol. 11, no. 7, 

pp. 601–604, 2004. 

[4] T. Lee, M. Lewicki, M. Girolami, and T. Sejnowski, “Blind source 

separation of more sources than mixtures using overcomplete representations,” 

IEEE Signal Processing Letters, vol. 6, no. 4, pp. 87–90, 1999. 

[5] P. Bofill and M. Zibulevsky, “Underdetermined blind source separation 

using sparse representations,” Signal Processing, vol. 81, pp. 2353–2362, 

2001. 

[6] P. O’Grady and B. Pearlmutter, “Soft-LOST: EM on a mixture of 

oriented lines,” in Proc. ICA 2004, ser. Lecture Notes in Computer 

Science, vol. 3195, Granada, Spain, 2004, pp. 430–436. 

[7] F. Theis, A. Jung, C. Puntonet, and E. Lang, “Linear geometric ICA: 

Fundamentals and algorithms,” Neural Computation, vol. 15, pp. 419– 

439, 2003. 

[8] C. Puntonet and A. Prieto, “An adaptive geometrical procedure for blind 

separation of sources,” Neural Processing Letters, vol. 2, 1995. 

[9] M. Benaim, J.-C. Fort, and G. Pagés, “Convergence of the onedimensional 

Kohonen algorithm,” Adv. Appl. Prob., vol. 30, pp. 850– 

869, 1998. 

[10] F. Theis, E. Lang, and C. Puntonet, “A geometric algorithm for overcomplete 

linear ICA,” Neurocomputing, vol. 56, pp. 381–398, 2004.

244 Chapter 17. IEEE SPL 13(2):96-99, 2006

Chapter 18 

Proc. EUSIPCO 2006 

Paper P. Gruber and F.J. Theis. Grassmann clustering. In Proc. EUSIPCO 2006, 

Florence, Italy, 2006 

Reference (Gruber and Theis, 2006) 


245


ABSTRACT 

An important tool in high-dimensional, explorative data mining 

is given by clustering methods. They aim at identifying 

samples or regions of similar characteristics, and often code 

them by a single codebook vector or centroid. One of the 

most commonly used partitional clustering techniques is the 

k-means algorithm, which in its batch form partitions the data 

set into k disjoint clusters by simply iterating between cluster 

assignments and cluster updates. The latter step implies 

calculating a new centroid within each cluster. We generalize 

the concept of k-means by applying it not to the standard 

Euclidean space but to the manifold of subvectorspaces 

of a fixed dimension, also known as the Grassmann manifold. 

Important examples include projective space i.e. the 

manifold of lines and the space of all hyperplanes. Detecting 

clusters in multiple samples drawn from a Grassmannian 

is a problem arising in various applications. In this manuscript, 

we provide corresponding metrics for a Grassmann 

k-means algorithm, and solve the centroid calculation problem 

explicitly in closed form. An application to nonnegative 

matrix factorization illustrates the feasibility of the proposed 

algorithm. 

1. PARTITIONAL CLUSTERING 

Many algorithms for clustering i.e. the detection of common 

features within a data set are discussed in the literature. In 

the following, we will study clustering within the framework 

of k-means [2]. 

In general, its goal can be described as follows: Given a 

set A of points in some metric space (M,d), find a partition of 

A into disjoint non-empty subsets Bi, � 

i Bi = A, together with 

centroids ci ∈ M so as to minimize the sum of the squares 

of the distances of each point of A to the centroid ci of the 

cluster Bi containing it. In other words, minimize 

E(B1,c1,...,Bk,ck) := 

GRASSMANN CLUSTERING 

Peter Gruber and Fabian J. Theis 

Institute of Biophysics, University of Regensburg 

93040 Regensburg, Germany 

phone: +49 941 943 2924, fax: +49 941 943 2479 

email: fabian@theis.name, web: http://fabian.theis.name 

k 

∑ ∑ 

i=1 a∈Bi 

d(a,ci) 2 . (1) 

If the set A contains only finitely many elements 

a1,...,aN, then this can be easily re-formulated as constrained 

non-linear optimization problem: minimize 

subject to 

wit ∈ {0,1}, 

E(W,C) := 

k 

∑ 

i=1 

k 

T 

∑ ∑ 

i=1 t=1 

witd(ai,ci) 2 . (2) 

wit = 1 for 1 ≤ i ≤ k,1 ≤ t ≤ T. (3) 

Here C := {c1,...,ck} are the centroid locations, and W := 

(wit) is the partition matrix corresponding to the partition Bi 

of A. 

A common approach to minimizing (2) subject to (3) is 

partial optimization for W and C, i.e. alternating minimization 

of either W and C while keeping the other one fixed. 

The batch k-means algorithm employs precisely this strategy: 

After an initial, random choice of centroids c1,...,ck, 

it iterates between the following two steps until convergence 

measured by a suitable stopping criterion: 

• cluster assignment: at determine an index i(t) such that 

i(t) = argmin i d(at,ci) (4) 

• cluster update: within each cluster Bi := {at|i(t) = i} 

determine the centroid ci by minimizing 

ci := argminc ∑ d(a,c) 

a∈Bi 

2 

The cluster assignment step corresponds to minimizing 

(2) for fixed C, which means choosing the partition W such 

that each element of A is assigned to the i-th cluster if ci is 

the closest centroid. In the cluster update step, (2) is minimized 

for fixed partition W, implying that ci is constructed 

as centroid within the i-th cluster; this indeed corresponds to 

minimizing E(W,C) for fixed W because in this case the 

cost function is a sum of functions depending different parameters, 

so we can minimize them separately leading to the 

centroid equation (5). This general update rule converges to 

a local minimum under rather weak conditions [3, 7]. 

An important special case is given by M := Rn and 

the Euclidean distance d(x,y) := �x − y�. The centroids 

from equation (5) can then be calculated in closed form, 

and each centroid is simply given by the cluster mean ci := 

(1/|Bi|)∑a∈Bi a; this follows directly from 

∑ �a − ci� 

a∈Bi 

2 = ∑ 

a∈Bi 

n 

∑ 

j=1 

(a j −ci j) 2 = 

n 

∑ ∑ 

j=1 a∈Bi 

(5) 

(a 2 j −2a jci j +c 2 i j), 

which can be minimized separately for each coordinate j and 

is minimal with respect to ci j if the derivative of the quadratic 

function is zero, so if |Bi|ci j = ∑a∈Bi a j. 

In the following, we are interested in more complex metric 

spaces. Typically, k-means can be implemented efficiently, 

if the cluster centroids can be calculated quickly. In 

the example of R n , we saw that it was crucial to use minimize 

the square distances and to use the Euclidean distance. 

Hence we will study metrics which also allow a closed-form 

centroid solution.


The data space of interest will consist of subspaces of 

R n , and the goal is to find subspace clusters. We will only be 

dealing with sub-vector-spaces; extensions to the affine case 

are discussed in section 3.3. 

A somewhat related method is the so-called k-plane clustering 

algorithm [4], which does not cluster subspaces but 

solves the problem of fitting hyperplanes in R n to a given 

point set A ⊂ R n . A hyperplane H ⊂ R n can be described by 

H = {x|c ⊤ x = 0} = c ⊥ for some normal vector c, typically 

chosen such that �c� = 1. Bradley and Mangasarian [4] essentially 

choose the pseudo-metric d(a,b) := |a ⊤ b| on the 

sphere S n−1 := {x ∈ R|�x� = 1} — the data can be assumed 

to lie on the sphere after normalization, which does 

not change cluster containment. They show that the centroid 

equation (5) is solved by any eigenvector of the cluster correlation 

BiBi ⊤ corresponding to the minimal eigenvalue, if by 

abuse of notation Bi is to indicate the (n × |Bi|)-matrix containing 

the elements of the set Bi in its columns. Alternative 

approaches to this subspace clustering problem are reviewed 

in [6]. 

2. PROJECTIVE CLUSTERING 

A first step towards general subspace clustering is to consider 

one-dimensional subspace i.e. lines. Let RP n denote 

the space of one-dimensional real vector subspaces of Rn+1 . 

It is equivalent to Sn after identifying antipodal points, so it 

has the quotient representation RP n = Sn /{−1,1}. We will 

represent lines by their equivalence class [x] := {λx|λ ∈ Rn } 

for x �= 0. A metric can be defined by 

d0([x],[y]) := 

� 

1 − 

� 

x⊤y �x��y� 

�2 Clearly d is symmetric, and positive definite according to the 

Cauchy-Schwartz’s inequality. 

Conveniently, the cluster centroid of cluster Bi is given by 

any eigenvector of the cluster correlation BiBi ⊤ corresponding 

to the largest eigenvalue. In section 3.1, we will show that 

projective clustering is a special case of a more general clustering 

and hence the derivation of the corresponding centroid 

clustering algorithm will be postponed until later. 

Figure 1 shows an example application of the projective 

k-means algorithm. Note that the projective k-means can be 

directly applied to the dual problem of clustering hyperplanes 

by using the description via their normal ‘lines’. 

3. GRASSMANN CLUSTERING 

More interestingly, we would like to perform clustering in 

the Grassmann manifold Gn,p of p-dimensional vector subspaces 

of R n for 0 ≤ p ≤ n. If Vn,p denotes the Stiefel 

manifold consisting of orthonormal matrices for n ≥ p, then 

Gn,p has the natural quotient representation Gn,p = Vn,p/Op, 

where Op := Vp,p denotes the orthogonal group. This representation 

simply means that any p-dimensional subspace 

of R n is given by p orthonormal vectors, i.e. by a basis 

V ∈ Vn,p, which is unique except for right multiplication by 

an orthogonal matrix. We will also write [V] for the subspace. 

The geometric properties of optimization algorithms on 

Gn,p are nicely discussed by Edelman et al. [5]. They 

also summarize various metrics on the Grassmann manifold, 

(6) 

1 

0.5 

0 

−0.5 

−1 

1 

0.5 

0 

−0.5 

−1 

−1 

Figure 1: Illustration of projective k-means clustering in 

three dimensions. 10 5 samples from a 4-dimensional 

strongly supergaussian distribution are projected onto three 

dimensions and serve as the generators of the lines. These 

were nicely clustered into k = 4 centroids, located at the density 

axes. 

which can all be naturally derived from the geodesic metric 

(arc length) induced by the natural Riemannian structure of 

Gn,p. Some equivalence relations between the metrics are 

known, but for computational purposes, we choose the very 

easy to calculate so-called projection F-norm given by 

−0.5 

d([V],[W]) := 2 −1/2 �VV ⊤ − WW ⊤ �F 

where �V�F := � tr(VV ⊤ ) denotes the Frobenius-norm of 

a matrix. Note that the projection F-norm is indeed welldefined, 

as (7) does not depend on the choice of class representatives. 

In order to perform k-means clustering on (Gn,p,d), we 

have to solve the centroid problem (5). One of our main 

results is that the centroid [Ci] of subspaces of some cluster 

Bi is spanned by p eigenvectors corresponding to the 

smallest eigenvalues of the generalized cluster covariance 

(1/|Bi|)∑[V]∈Bi VV⊤ . This generalizes the projective and 

the hyperplane k-means algorithm from above. 

3.1 Calculating the optimal centroids 

For the cluster update step of the batch k-means algorithm 

we need to find [C] such that 

f (C) := 

l 

∑ 

i=1 

d([Vi],[C]) 2 

for l subspaces [Vi] represented by Vi ∈ V(n, p) is minimal, 

subject to g(C) := C ⊤ C = Ip (pseudo orthogonality). 

We may also assume that the Vi are pseudo-orthonormal 

Vi ⊤ Vi = Ip. 

It is easy to see that: 

f (C) = 2 −1/2 tr(∑ViVi i 

⊤ ) + tr(lCC ⊤ − 2CC ⊤ ∑ViVi i 

⊤ ) 

where 

= 2 −1/2 trD + tr((lIn − 2V)CC ⊤ ) 

V := ∑ViVi i 

⊤ 

and EDE ⊤ = V 

0 

0.5 

1 

(7)


(0, .5, .5, 0) 

(0, 1, 0, 0) 

(0, 0, 1, 0) (.5, 0, .5, 0) 

(.3, .3, 0, .3) 

(1, 0, 0, 0) 

Figure 2: Illustration of the convex subsets on which the 

equation ∑ n i=1 diixi for given D is optimized. Here n = 4 and 

the surfaces for p = 1,...,3 are depicted (normalized onto 

the standard simplex). 

denote the eigenvalue decomposition of V with E orthonormal 

and D diagonal. This means that 

n 

−1/2 

f (C) = 2 ∑ dii + l p − 2tr(DE 

i=1 

⊤ CC ⊤ E) 

n 

n 

−1/2 

= 2 ∑ dii + l p − 2∑ 

diixii 

i=1 

i=1 

where di j are the matrix elements of D, and xi j of X = 

E ⊤ CC ⊤ E. 

Here trX = tr(CC ⊤ ) = p for pseudo orthogonal C (p 

eigenvectors C with eigenvalue 1) and all 0 ≤ xii ≤ 1 (again 

pseudo orthogonality). Hence this is a linear optimization 

problem on a convex set (see also figure 2) and therefore any 

optimum is located at the corners of the convex set, which in 

our case are {x ∈ {0,1} n | ∑ n i=1 xi = p}. If we assume that 

the dii are ordered in descending order, then a minimum of f 

is given by 

which corresponds to 

CC ⊤ = EXE ⊤ = E 

C = 

� � 

Ip 

. 

0 

� � 

Ip 0 

E 

0 0 

⊤ , 

In this calculation we can also see the indeterminacies of 

the optimization: 

1. If two or more eigenvalues of V are equal, any point on 

the corresponding edge of the convex set is optimal and 

hence the centroid can vary along the subspace generated 

by the corresponding eigenvectors E 

2. If some eigenvalues of V are zero, a similar indeterminacy 

occurs. 

An example in RP 2 is demonstrated in figure 3. 

e1 

} 

d0(e1, x) = cos(x1) 2 

} 

x = (x1, x2) 

d0(e2, x) = cos(x2) 2 

Figure 3: Let Vi be two samples which are orthogonal 

(w.l.o.g. we can assume Vi = ei represented by the unit vectors). 

Hence V = ∑ViVi ⊤ has degenerate eigenstructure. 

Then the quantisation error is given by d(e1,x) 2 + d(e2,x) 2 

which is here 1 2 (2 + 2 − 2trIXX⊤ ) = 2 − x2 1 − x2 2 = 1 for X 

represented by x = (x1,x2). Hence any X is a centroid in the 

sense of the batch k-means algorithm. 

3.2 Relationship to projective clustering 

The distance d0 on RP n from above (equation (6)) was defined 

as 

� 

� � 

V ⊤ 2 

W 

d0(V,W) = 1 − 

, 

�V ��W� 

if according to our previous notation [V],[W] ∈ Gn,1 = RP n . 

Note that if the two vectors represent time series, then this is 

the same as the correlation between the two. 

It is now easy to see that this distance coincides with the 

definition of d on the general Grassmanian from above. Let 

V,W ∈ V(n,1) be two vectors. We may assume that V ⊤ V = 

W ⊤ W = 1. Then 

2d(V,W) 2 = tr(VV ⊤ + WW ⊤ − VW ⊤ − WV ⊤ ) 

= tr(VV ⊤ ) + tr(WW ⊤ ) − 2tr(V(V ⊤ W)W ⊤ ) 

All matrices have rank 1 and hence the trace is the sole 

nonzero eigenvalue. Since VV ⊤ V = V it is 1 for the first 

matrix, similar for the second and W ⊤ V for the third, because 

VW ⊤ V = (W ⊤ V)V. Hence 

2d(V,W) 2 = 2 − 2(W ⊤ V) 2 

= 2d0(V,W) 2 . 

3.3 Dealing with affine spaces 

So far we only have dealt with the special case of clustering 

subspaces, i.e. linear subsets which contain the origin. But 

in practice the problem of clustering affine subspaces arises, 

see for example 5. This can be dealt with quite easily. 

Let F be a p dimensional affine linear subset of R n . Then 

F can be characterized by p + 1 points v0,...,vp such that 

e2


v1 − v0,...,vn − v0 are linearly independent. Consider the 

following embedding 

R n → R n+1 : (x1,...,xn) ↦→ (x1,...,xn,1). 

We may therefore identify the p dimensional affine subspaces 

with the p + 1 linear subspaces in R n+1 by embedding 

the generators and taking the linear closure. In fact it 

is easy to see that we obtain a 1-to-1 mapping between the 

p dimensional affine subspaces of R n and the p + 1 dimensional 

linear subspaces in R n−1 , which intersect the orthogonal 

complement of (0,...,0,1) only at the origin. 

Hence we can reduce the affine case to calculations for 

linear subsets only. Note that since only eigenvectors of sums 

of projections onto the subsets Vi can become centroids in 

the batch version of the k-means algorithm, any centroid is 

also in the image of the above embedding and can be identified 

uniquely with a affine subspace of the original problem. 

4. EXPERIMENTAL RESULTS 

We finish by illustrating the algorithm in a few examples. 

4.1 Toy example 

As a toy example, let us first consider 10 4 samples of 

G4,2, namely uniformly randomly chosen from the 6 possible 

2-dimensional coordinate planes. In order to avoid 

any bias within the algorithm, the non-zero coefficients from 

the plane-representing matrices have been chosen uniformly 

from O2. The samples have been deteriorated by Gaussian 

noise with a signal-to-noise ratio of 10dB. Application of 

the Grassmann k-means algorithm with k = 6 yields convergence 

after only 6 epochs with the resulting 6 clusters 

with centroids [V i ]. The distance measure µ(V) := (|vi1 + 

vi2| + |vi1 − vi2|)i should be large only in two coordinates if 

[V] is close to the corresponding 2-dimensional coordinate 

plane. And indeed, the found centroids have distance measures 

µ(V i ) = 

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 

0.02 1.7 1.7 0.01 2.0 0.01 

⎜ 0 ⎟ ⎜0.01⎟ 

⎜0.01⎟ 

⎜ 1.5 ⎟ ⎜ 2.0 ⎟ ⎜ 2.0 ⎟ 

⎝ 

1.9 

⎠, ⎝ 

0.01 

⎠, ⎝ 

1.7 

⎠, ⎝ 

1.5 

⎠, ⎝ 

0 

⎠, ⎝ 

0.01 

⎠. 

1.9 1.7 0.02 0 0.01 2.0 

Hence, the algorithm correctly chose all 6 coordinate planes 

as cluster centroids. 

4.2 Polytope identification 

As an example application of the Grassmann clustering algorithm, 

we want to solve the following approximation problem 

from computational geometry: given a set of points, identify 

the smallest convex polytope with a fixed number of faces 

k, containing the points. In two dimensions, this implies the 

task of finding the k edges of a polytope where only samples 

in the inside are known. We use QHull algorithm [1] to 

construct the convex hull thus identifying the possible edges 

of the polytope. Then, we apply affine Grassmann k-means 

clustering to these edges in order to identify the k bounding 

edges. Figure 4 shows an example. Generalization to arbitrary 

dimensions are straight-forward. 

(a) Samples (b) QHull 

(c) Grassmann clustering 

(d) Result 

Samples 

QHull contour 

Grasmann clustering 


Figure 4: An example of using hyperplane clustering (p = 

n − 1) to identify the contour of a samples figure. QHull was 

used to find the outer edges then those are clustered into 4 

clusters. The broken lines show the boundaries use to generate 

the 300 samples.


4.3 Nonnegative Matrix Factorization 

(Overcomplete) Nonnegative Matrix Factorization (NMF) 

deals with the problem of finding a nonnegative decomposition 

X = AS + N of a nonnegative matrix X, where N 

denotes unknown Gaussian noise. S is often pictured as a 

source data set containing samples along its columns. If we 

assume that S spans the whole first quadrant, then X is a 

conic hull with cone lines given by the columns of A. After 

projection to the standard simplex, the conic hull reduces 

to the convex hull, and the projected, known mixture data 

set X lies within a convex polytope of the order given by the 

number of rows of S. Hence we face the problem of identifying 

edges of a sampled polytope, and, even in the overcomplete 

case, we may tackle this problem by the Grassmann 

clustering-based identification algorithm from the previous 

section. 

As an example, see figure 5, we choose a random mixing 

matrix 

A = 

� 

0.76 0.39 

� 

0.14 

0.033 0.06 0.43 

0.20 0.56 0.43 

and sources S given by i.i.d. samples from a squared 

gaussian. 10 5 samples were drawn, and 10 to 10 5 sample 

subsets where used for the comparison. We refer to the figure 

caption for further details. 

5. CONCLUSION 

We have studied k-means-style clustering problems on the 

non-Euclidean Grassmann manifold. In an adequate metric, 

we were able to reduce the arising centroid calculation 

problem to the calculation of eigenvectors of the cluster covariance, 

for which we gave a proof based on convex optimization. 

The algorithm was illustrated by applications to 

polytope fitting and to performing overcomplete nonnegative 

factorizations similar to NMF. In future work, besides 

extending the framework to other clustering algorithms and 

matrix manifolds together with proving convergence of the 

resulting algorithms, we plan on applying the algorithm for 

the stability analysis of multidimensional independent component 

analysis. 


Partial financial support by the DFG (GRK 638) and the 

BMBF (project ‘ModKog’) is gratefully acknowledged. 

REFERENCES 

[1] C.B. Barber, D.P. Dobkin, and H. Huhdanpaa. The 

quickhull algorithm for convex hull. Technical Report 

GCG53, The Geometry Center, University of Minnesota, 

Minneapolis, 1993. 

[2] C.M. Bishop. Neural Networks for Pattern Recognition. 

Oxford University Press, 1995. 

[3] L. Bottou and Y. Bengio. Convergence properties of the 

k-means algorithms. In Proc. NIPS 1994, pages 585– 

592. MIT Press, 1995. 

[4] P.S. Bradley and O.L. Mangasarian. k-plane clustering. 

Journal of Global Optimization, 16(1):23–32, 2000. 

Samples 

QHull contour 

Grasmann clustering (100 Samples) 

NMF Mixing Matrix (100000 Samples) 


(a) Comparison of NMF (Mean square error) and Grassmann clustering 

for NMF (averaged over 4 tries) 

Crosserror 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

0 

10 1 

10 1 

10 2 

10 2 

10 3 

NMF (Mean Square Error) 

Grassmann clustering 

10 3 


10 4 

10 4 

10 5 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

10 5 

0 

(b) Illustration of the NMF algorithm: projection onto 

the standard simplex 

Figure 5: Grassmann clustering can be used to solve the 

NMF problem. The mixed signal to be analyzed is a 3dimensional 

toy signal with a positive 3 × 3 matrix. The 

resulting mixture was analyzed with a mean square error implementation 

of NMF and compared to Grassmann clustering. 

In the clustering algorithm the data is first projected onto 

the standard simplex. This translates the task to the polytope 

identification discussed in section 4.2. 

[5] A. Edelman, T.A. Arias, and S.T. Smith. The geometry 

of algorithms with orthogonality constraints. SIAM Journal 

on Matrix Analysis and Applications, 20(2):303– 

353, 1999. 

[6] L. Parsons, E. Haque, and H. Liu. Subspace clustering 

for high dimensional data: a review. SIGKDD Explor. 

Newsl., 6(1):90–105, 2004. 

[7] S.Z. Selim and M.A. Ismail. K-means-type algorithms: a 

generalized convergence theorem and characterization of 

local optimality. IEEE Transactions on Pattern Analysis 

and Machine Intelligence, 6:81–87, 1984.

Chapter 19 

LNCS 3195:977-984, 2004 

Paper I.R. Keck, F.J. Theis, P. Gruber, E.W. Lang, K. Specht, and C.G. Puntonet. 

3D spatial analysis of fMRI data on a word perception task. In Proc. ICA 

2004, volume 3195 of LNCS, pages 977-984, Granada, Spain, 2004. Springer 

Reference (Keck et al., 2004) 


251

252 Chapter 19. LNCS 3195:977-984, 2004 

3D Spatial Analysis of fMRI Data 

on a Word Perception Task 

Ingo R. Keck 1 , Fabian J. Theis 1 , Peter Gruber 1 ,ElmarW.Lang 1 , 

Karsten Specht 2 , and Carlos G. Puntonet 3 

1 Institute of Biophysics, Neuro- and Bioinformatics Group 

University of Regensburg, D-93040 Regensburg, Germany 

{Ingo.Keck,elmar.lang}@biologie.uni-regensburg.de 

2 Institute of Medicine, Research Center Jülich, D-52425 Jülich, Germany 

k.specht@fz-juelich.de 

3 Departamento de Arquitectura y Tecnologia de Computadores 

Universidad de Granada/ESII, E-1807 Granada, Spain 

carlos@atc.ugr.es 

Abstract. We discuss a 3D spatial analysis of fMRI data taken during 

a combined word perception and motor task. The event - based experiment 

was part of a study to investigate the network of neurons involved 

in the perception of speech and the decoding of auditory speech stimuli. 

We show that a classical general linear model analysis using SPM does 

not yield reasonable results. With blind source separation (BSS) techniques 

using the FastICA algorithm it is possible to identify different 

independent components (IC) in the auditory cortex corresponding to 

four different stimuli. Most interesting, we could detect an IC representing 

a network of simultaneously active areas in the inferior frontal gyrus 

responsible for word perception. 


Since the early 90s [1, 2], functional magnetic resonance imaging (fMRI) based 

on the blood oxygen level dependent contrast (BOLD) developed into one of 

the main technologies in human brain research. Its high spatial and temporal 

resolution combined with its non-invasive nature makes it to an important tool 

to discover functional areas in the human brain work and their interactions. 

However, its low signal to noise ratio (SNR) and the high number of activities 

in the passive brain require sophisticated analysis methods which can be divided 

into two classes: 

– model based approaches like the general linear model which require prior 

knowledge of the time course of the activations, 

– model free approaches like blind source separation (BSS) which try to separate 

the recorded activation into different classes according to statistical 

specifications without prior knowledge of the activation. 

C.G. Puntonet and A. Prieto (Eds.): ICA 2004, LNCS 3195, pp. 977–984, 2004. 

c○ Springer-Verlag Berlin Heidelberg 2004

Chapter 19. LNCS 3195:977-984, 2004 253 

978 Ingo R. Keck et al. 

In this text we compare these analysis techniques in a study of an auditory 

task. We show an example where traditional model based methods do not yield 

reasonable results. Rather blind source separation techniques have to be used 

to get meaningful and interesting results concerning the networks of activations 

related to a combined word recognition and motor task. 

1.1 Model Based Approach: General Linear Model 

The general linear model as a kind of regression analysis has been the classic 

way to analyze fMRI data in the past [3]. Basically it uses second order statistics 

to find the voxels whose activations correlate best to given time courses. The 

measured signal for each voxel in time y =(y(t1), ..., y(tn)) T is written as a 

linear combination of independent variables y = Xb + e, with the vector b of 

regression coefficients and the matrix X of the independent variables which in 

case of an fMRI-analysis consist of the assumed time courses in the data and 

additional filters to account for the serial correlation of fMRI data. The residual 

error e ought to be minimized. The normal equation X T Xb = X T y of the 

problem is solved by b =(X T X) −1 X T y and has a unique solution if XX T has 

full rank. Finally a significance test using e is applied to estimate the statistical 

significance of the found correlation. 

As the model X must be known in advance to calculate b, this method is 

called “model-based”. It can be used to test the accuracy of a given model, but 

cannot by itself find a better suited model even if one exists. 

1.2 Model Free Approach: 

BSS Using Independent Component Analysis 

In case of fMRI data blind source separation refers to the problem of separating 

a given sensor signal, i.e. the fMRI data at the time t 

x(t) =A [s(t)+snoise(t)] = 

n� 

aisi(t)+ 

i=1 

n� 

i=1 

aisnoise,i(t) 

into its underlying n source signals s with ai(t) being its contribution to the 

sensor signal, hence its mixing coefficient. A and s are unique except for permutation 

and scaling. The functional segregation of the brain [3] closely matches 

the requirement of spatially independent sources as assumed in spatial ICA. 

The term snoise(t) is the time dependent noise. Unfortunately, in fMRI the noise 

level is of the same order of magnitude as the signal, so it has to be taken into 

account. As the noise term will depend on time, it can be included as additional 

components into the problem. This problem is called “under-determined” 

or “over-complete” as the number of independent sources will always exceed the 

number of measured sensor signals x(t). 

Various algorithms utilizing higher order statistics have been proposed to solve 

the BSS problem. In fMRI analysis, mostly the extended Infomax (based on 

entropy maximisation [4, 5]) and FastICA (based on negentropy using fix-point

254 Chapter 19. LNCS 3195:977-984, 2004 

3D Spatial Analysis of fMRI Data on a Word Perception Task 979 

iteration [6]) algorithm have been used so far. While the extended Infomax algorithm 

is expected to perform slightly better on real data due to its adaptive 

nature, FastICA does not depend on educated guesses about the probability 

density distribution of the unknown source signals. In this paper we choose to 

utilize FastICA because of its low demands on computational power. 

2 Results 

First, we will present the implementation of the algorithm we used. Then we will 

discuss an example of an event-designed experiment and its BSS based analysis 

where we were able to identify a network of brain areas which could not be 

detected using classic regression methods. 

2.1 Method 

To implement spatial ICA for fMRI data, every three-dimensional fMRI image 

is considered as a single mixture of underlying independent components. The 

rows of every image matrix have to be concatenated to a single row-vector and 

with these image-vectors the mixture matrix X is constructed. 

For FastICA the second order correlation in the data has to be eliminated by 

a “whitening” preprocessing. This is done using a principal component analysis 

(PCA) step prior to the FastICA algorithm. In this step a data reduction can be 

applied by omitting principal components (PC) with a low variance in the signal 

reconstruction process. However, this should be handled with care as valuable 

high order statistical information can be contained in these low variance PCs. 

The maximal variations in the timetrends of the supposed word-detection ICs 

in our example account only for 0.7 % of the measured fMRI Signal. 

The FastICA algorithm calculates the de-mixing matrix W = A −1 . Then the 

underlying sources S can be reconstructed as well as the original mixing matrix 

A. ThecolumnsofA represent the time-courses of the underlying sources which 

are contained in the rows of S. To display the ICs the rows of S have to be 

converted back to three-dimensional image matrixes. 

As noted before because of the high noise present in fMRI data the ICA 

problem will always be under-determined or over-complete. As FastICA cannot 

separate more components than the number of mixtures available, the resulting 

IC will always be composed of a noise part and the “real” IC superimposed on 

that noise. This can be compensated by individually de-noising the IC. As a rule 

of thumb we decided that to be considered a noise signal the value has to be 

below 10 times the mean variance in the IC which corresponds to a standard 

deviation of about 3. 

2.2 Example: Analysis of an Event-Based Experiment 

This experiment was part of a study to investigate the network involved in 

the perception of speech and the decoding of auditory speech stimuli. Therefor

Chapter 19. LNCS 3195:977-984, 2004 255 


one- and two-syllable words were divided into several frequency-bands and then 

rearranged randomly to obtain a set of auditory stimuli. The set consisted of four 

different types of stimuli, containing 1, 2, 3 or 4 frequency bands (FB1–FB4) 

respectively. Only FB4 was perceivable as words. 

During the functional imaging session these stimuli were presented pseudorandomized 

to 5 subjects, according to the rules of a stochastic event-related 

paradigm. The task of the subjects was to press a button as soon as they were 

sure that they had just recognized a word in the sound presented. It was expected 

that in case of FB4 these four types of stimuli activate different areas of the 

auditory system as well as the superior temporal sulcus in the left hemisphere [8]. 

Prior to the statistical analysis the fMRI data were pre-processed with the 

SPM2 toolbox [9]. A slice-timing procedure was performed, movements corrected, 

the resulting images were normalized into a stereotactical standard space 

(defined by a template from the Montreal Neurological Institute) and smoothed 

with a gaussian kernel to increase the signal-to-noise ratio. 

Classical Fixed-Effect Analysis. First, a classic regression analysis with 

SPM2 was applied. No substantial differences in the activation of the auditory 

cortex apart from an overall increase of activity with ascending number of frequency 

bands was found in three subjects. One subject showed no correlated 

activity at all, two only had marginal activity located in the auditory cortex 

(figure 1 (c)). Only one subject showed obvious differences between FB1 and 

FB4: an activation of the left supplementary motor area, the cingulate gyrus 

and an increased size of active area in the left auditory cortex for FB4 (figure 1 

(a),(b)). 

Spatial ICA with FastICA. For the sICA with FastICA [6] up to 351 threedimensional 

images of the fMRI sessions were interpreted as separate mixtures 

of the unknown spatial independent activity signals. Because of the high computational 

demand each subject was analyzed individually instead of a whole group 

ICA as proposed in [10]. A principal component analysis (PCA) was applied to 

whiten the data. 340 components of this PCA were retained that correspond to 

more than 99.999% of the original signals. This is still 100 times greater than 

the share of ICs like that shown in figure 3 on the fMRI signal. In one case only 

317 fMRI images were measured and all resulting 317 PCA components were 

retained. 

Then the stabilized version of the FastICA algorithm was applied using tanh 

as non-linearity. The resulting 340 (resp. 317) spatially independent components 

(IC) were sorted into different classes depending on their structural localization 

within the brain. Various ICs in the region of the auditory cortex could be 

identified in all subjects, figure 2 showing one example. Note that all brain images 

in this article are flipped, i.e. the left hemisphere appears on the right side of 

the picture. To calculate the contribution of the displayed ICs to the observed 

fMRI data the value of its voxels has to be multiplied with the time course of 

its activation for each scan (lower subplot to the right of each IC plot). Also

256 Chapter 19. LNCS 3195:977-984, 2004 


a) 

b) 

c) 

Fig. 1. Fixed-effect analysis of the experimental data. No substantial differences between 

the activation in the auditory cortex correlated to (a) FB1 and (b) FB4 can be 

seen. (c) shows the analysis for FB4 of a different subject. 

Fig. 2. Independent component located in the auditory cortex and its time course. 

a component located at the position of the supplementary motor area (SMA) 

could be found in all subjects.

Chapter 19. LNCS 3195:977-984, 2004 257 


Fig. 3. Independent component which correspond to a proposed subsystem for word 

detection. 

Fig. 4. Independent component with activation in Broca’s area (speech motor area). 

The most interesting finding was an IC which represents a network of three 

simultaneously active areas in the inferior frontal gyrus (figure 3) in one subject. 

This network was suggested to be a center for the perception of speech in [8]. 

Figure 4 shows an IC (of the same subject) that we assume to be a network for 

the decision to press the button. All other subjects except one had ICs that correspond 

to these networks, although often separated into different components. 

The time course of both components matches visually very well (figure 5) while 

their correlation coefficient remains rather low (kcorr =0.36), apparently due to 

temporary time- and baseline-shifts.

258 Chapter 19. LNCS 3195:977-984, 2004 


Comparison of the Regression Analysis Versus ICA. To compare the 

results of the fixed-effect analysis with the results of the ICA the correlation 

coefficients between the expected time-trends of the fixed-effect analysis and the 

time-trends of the ICs were calculated. No substantial correlation was found: 

87 % of all these coefficients were in the range of −0.1 to0.1, the highest coefficient 

found being 0.36 for an IC within the auditory cortex (figure 2). The 

correlation coefficients for the proposed word detection network (figure 3) were 

0.14, 0.08, 0.19 and 0.18 for FB1–FB4. Therefor it is quite obvious that this 

network of areas in the inferior frontal gyrus cannot be detected with a classic 

fixed-effect regression analysis. 

While the reasons for the differences between the activation-trends of the 

ICs and the assumed time-trends are still subject to on-going research, it can be 

expected that the results of this ICA will help to gain further information about 

the work flow of the brain concerning the task of word detection. 

Fig. 5. The activation of the ICs shown in figure 3 (dotted) and 4 (solid), plotted 

for scan no. 25–75. While these time-trends obviously appear to be correlated, their 

correlation coefficient remains very low due to temporary baseline- and time-shifts in 

the trends. 

3 Conclusions 

We have shown that ICA can be a valuable tool to detect hidden or suspected 

links and activity in the brain that cannot be found using the classical approach 

of a model-based analysis like the general linear model. While clearly ICA cannot 

be used to validate a model (being in itself model-free), it can give useful hints 

to understand the internal organization of the brain and help to develop new 

models and study designs which then can be validated using a classic regression 

analysis.

Chapter 19. LNCS 3195:977-984, 2004 259 


Acknowledgment 

This work was supported by the BMBF (project ModKog). 

References 

1. K. K. Kwong, J. W. Belliveau, D. A. Chester, I. E. Goldberg, R. M. Weisskoff, B. 

P. Poncelet, D. N. Kennedy, B. E. Hoppel, M. S. Cohen, R. Turner, H-M. Cheng, 

T. J. Brady, B. R. Rosen, “Dynamic magnetic resonance imaging of human brain 

activity during primary sensory stimulation”, Proc. Natl. Acad. USA 89, 5675– 

5679 (1992). 

2. S. Ogawa, T. M. Lee, A. R. Kay, D. W. Tank, “Brain magnetic-resonance-imaging 

with contrast dependent on blood oxygenation”, Proc. Natl Acad. Sci. USA 87, 

9868–9872 (1990). 

3. R. S. J. Frackowiak, K. J. Friston, Ch. D. Frith, R. J. Dolan, J. C. Mazziotta, 

“Human Brain Function”, Academic Press, San Diego, USA, 1997. 

4. A.J. Bell, T.J. Sejnowski, “An information-maximisation approach to blind separation 

and blind deconvolution”, Neural Computation, 7(6), 1129–1159 (1995). 

5. M.J. McKeown, T.J. Sejnowski, “Independent Component Analysis of FMRI Data: 

Examining the Assumptions”, Human Brain Mapping 6, 368–372 (1998). 

6. A. Hyvärinnen, “Fast and Robust Fixed-Point Algorithms for Independent Component 

Analysis”, IEEE Transactions on Neural Networks 10(3), 626–634 (1999). 

7. F. Esposito, E. Formisano, E. Seifritz, R. Goebel, R. Morrone, G. Tedeschi, F. Di 

Salle, “Spatial Independent Component Analysis of Functional MRI Time-Series: 

To What Extent Do Results Depend on the Algorithm Used?”, Human Brain 

Mapping 16, 146–157 (2002). 

8. K. Specht, J. Reul, “Function segregation of the temporal lobes into highly differentiated 

subsystems for auditory perception: an auditory rapid event-related 

fMRI-task”, NeuroImage 20, 1944–1954 (2003). 

9. SPM2: http://www.fil.ion.ulc.ac.uk/spm/spm2.html, July 2003. 

10. V.D. Calhoun, T. Adali, G.D. Pearlson, J.J. Pekar, “A Method for Making Group 

Inferences from Functional MRI Data Using Independent Component Analysis”, 

Human Brain Mapping 14, 140–151 (2001).

260 Chapter 19. LNCS 3195:977-984, 2004

Chapter 20 

Signal Processing 86(3):603-623, 

2006 

Paper F.J. Theis and G.A. García. On the use of sparse signal decomposition in 

the analysis of multi-channel surface electromyograms. Signal Processing, 

86(3):603-623, 2006 

Reference (Theis and García, 2006) 


261


On the use of sparse signal decomposition in 

the analysis of multi-channel surface 

electromyograms 

Fabian J. Theis a,∗ , Gonzalo A. García b 

a Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany 

b Department of Bioinformatic Engineering, Osaka University, Osaka, Japan 

Abstract 

The decomposition of surface electromyogram data sets (s-EMG) is studied using 

blind source separation techniques based on sparseness; namely independent 

component analysis, sparse nonnegative matrix factorization, and sparse component 

analysis. When applied to artificial signals we find noticeable differences of 

algorithm performance depending on the source assumptions. In particular, sparse 

nonnegative matrix factorization outperforms the other methods with regards to 

increasing additive noise. However, in the case of real s-EMG signals we show that 

despite the fundamental differences in the various models, the methods yield rather 

similar results and can successfully separate the source signal. This can be explained 

by the fact that the different sparseness assumptions (super-Gaussianity, positivity 

together with minimal 1-norm and fixed number of zeros respectively) are all 

only approximately fulfilled thus apparently forcing the algorithms to reach similar 

results, but from different initial conditions. 

Key words: surface EMG, blind source separation, sparse component analysis, 

independent component analysis, sparse nonnegative matrix factorization 


A basic question in data analysis, signal processing, data mining as well as 

neuroscience is how to represent a large data set X (observed as an (m × 

∗ Corresponding author. 

Email addresses: fabian@theis.name (Fabian J. Theis), garciaga@ieee.org 

(Gonzalo A. García). 

Preprint submitted to Elsevier Science 8 April 2005


N)−matrix) in different ways. One of the simplest approaches lies in a linear 

decomposition, X = AS, with A an (m × N)−matrix (mixing matrix) and S 

an (n × N)−matrix storing the sources. Both A and S are unknown, hence 

this problem is often described as blind source separation (BSS). To get a 

well-defined problem, A and S have to satisfy additional properties such as: 

• the source components Si (rows of S) are assumed to be realizations of 

stochastically independent random variables — this method is called independent 

component analysis (ICA) [1,2], 

• the sources S are required to contain as many zeros as possible — we then 

speak of sparse component analysis (SCA) [3,4], 

• A and S are assumed to be nonnegative, which is denoted by nonnegative 

matrix factorization (NMF) [5]. 

The above-mentioned models as well as their interplay have recently been in 

the focus of many researchers, for instance concerning the question of how 

ICA and sparseness are related to each other and how they can be integrated 

into BSS algorithms [6–10], how to deal with nonnegativity in the ICA case 

[11,12] or how to extend NMF in order to include sparseness [13,14]. Much 

work has already been devoted to these subjects, and their applications to 

various fields are currently emerging. Indeed, linear representations such as the 

above have several potential applications including decomposition of objects 

into ‘natural’ components [5], redundancy and dimensionality reduction [2], 

biomedical data analysis, micro-array data mining or enhancement, feature 

extraction of images in nuclear medicine, etc. [1,2]. 

In this study, we will analyze and compare the above models, not from a theoretical 

point of view but rather from a concrete, real-world example, namely 

the analysis of surface electromyogram (s-EMG) data sets. An electromyogram 

(EMG) denotes the electric signal generated by a contracting muscle 

[15]; its study is relevant to the diagnosis of motoneuron diseases [16] as well 

as neurophysiological research [17]. In general, EMG measurements make use 

of invasive, painful needle electrodes. An alternative is to use s-EMG, which 

is measured using non-invasive, painless surface electrodes. However, in this 

case the signals are rather more difficult to interpret due to noise and overlap 

of several source signals as shown in Fig 1(a). We have already applied ICA in 

order to solve the s-EMG decomposition problem [18], however performance 

in real-world noisy s-EMG is still problematic, and it is yet unknown if the 

assumption of independent sources holds very well in the setting of s-EMG. 

In the present work, we apply sparse BSS methods based on various model 

assumptions to s-EMG signals. We first outline each of those methods and 

the corresponding performance indices used for their comparison. We then 

present the decompositions obtained with each method, and finally discuss 

these results in section 4. 

2


(a) 

s-EMG MU#2 MU#1 

50 ms 0.5 mV 

(b) 

CNS 

spinal cord 

motor 

unit 

α-motoneurons 

MU#2 

( ) 

MU#1 ( ) 

Fig. 1. Electromyogram; (a) Example of a signal obtained from a surface electrode 

showing the superposition of the activity of several motor units. (b) α-motoneurons 

convey the commands from the central nervous system to the muscles. A motor unit 

consists of an α-motoneuron and the muscle fibres it innervates. 

2 Method 

2.1 Signal origin and acquisition 

The central nervous system conveys commands to the muscles by trains of 

electric impulses (firings) via α-motoneurons, whose bodies are located in the 

spinal cord. The terminal axons of an α-motoneuron innervate a group of 

muscle fibres. A motor unit (MU) consists of an α-motoneuron and the muscle 

fibres it innervates, Fig. 1(b). When a MU is active, it produces a train of 

electric impulses called motor unit action potential train (MUAPT). The s- 

EMG signal is composed of the weighted summation of several MUAPTs; an 

example is given in Fig. 1(a). 

2.1.1 Artificial signal generator 

We developed a multi-channel s-EMG generator based on Disselhorst-Klug et 

al.’s model [19], employing Andreassen and Rosenfalck’s conductivity parameters 

[20], and the firing rate statistical characteristics described by Clamann 

[21]. 

The volume conductor between a muscle fibre and the skin surface where the 

electrodes are placed acts as a spatial low-pass filter; hence, a source signal 

is attenuated in direct proportion to its distance from the electrode. Using 

Griep’s tripole model [22], we can calculate the spatial potential distribution 

3


Synthetic s-EMG Source Signals [arbitrary units] 

MU#1 

MU#2 

MU#3 

MU#4 

MU#5 

Channel 1 

Synthetic s-EMG signal [arbitrary units] 

Channel 7 Channel 6 Channel 5 Channel 4 Channel 3 Channel 2 

Channel 8 

10 

0 

-10 

10 

0 

-10 

10 

0 

-10 

10 

0 

-10 

10 

0 

-10 

20 

0 

-20 

20 

0 

-20 

20 

0 

-20 

20 

0 

-20 

20 

0 

-20 

20 

0 

-20 

20 

0 

-20 

20 

0 

-20 

50 100 150 200 250 300 350 400 450 500 

Time [ms] 

(a) synthetic source signals 

50 100 150 200 250 300 350 400 450 500 

Time [ms] 

(c) generated s-EMG signals 

Depth inside the arm [mm] 

6 

5 

4 

3 

2 

1 

MU#1 

MU#3 

MU#4 

MU#2 

MU#5 

Ch#1 Ch#2 Ch#3 Ch#4 Ch#5 Ch#6 Ch#7 Ch#8 

0 

-15 -10 -5 0 5 10 

Electrode-Array surface [mm] 

(b) source locations 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

1 

2 

3 

Channel 

4 

5 

6 

7 

(d) mixing matrix 

8 

1 

2 

3 

4 

5 

Source Signal 

Fig. 2. Artificially created signals. (a) Five synthetic motor units action potential 

trains (source signals, MUAPTs) generated by the model. (b) Artificial surface 

electromyogram (s-EMG); location of the sources (motor units, encircled asterisks) 

respect the electrodes location marked by an x. The y-axis represents the depth 

inside the upper limb. (c) Artificially created mixture signal — only first 5000 

samples (out of 15000) are shown. The eight-channel signal is generated from the 

five source signals (a), using the mixing matrix illustrated in (d). 

generated on the skin surface by the excitation of a single muscle fibre. The 

potential distribution generated by the firing of a single motor unit can be 

calculated as the linear superposition of the contributions of all its constitutive 

muscle fibres, which are not located in one plane. The aforementioned 

model can be considered as a linear instantaneous mixing process. In reality 

however it is well-known that the model could exhibit slightly convolutive behavior, 

which is due to the fact that the media crossed by the s-EMG sources 

is anisotropic, provoking delayed mixtures of the same signals (convolutions) 

rather than an instantaneous mixture of them. However, in the muscle model 

used, the distances between the different muscle fibres have been increased 

4


name meaning value 

num number of MUs 5 

siz size of each MU, in number of fibres 20-30 (uniformly distr.) 

cfr central firing rate 20 Hz 

ED electrode inter-pole distance 2.54 mm 

CH number of generated channels 8 

meanIPI mean inter-pulse interval 1/cfr · 1000 ms 

IPIstd Clamann formula for IPI deviation 9.1 · 10 −4 · meanIPI 2 + 4 

SY radial conductivity 0.063 S/m 

SZ axial conductivity 0.328 S/m 

A anisotropy ratio SZ/SY = 5.2 

SI intracellular conductivity 1.010 S/m 

D fibre diameter 50 · 10 −6 m 

CV conduction velocity 4.0 m/s 

FatHeight fat-skin layer thickness 3 mm 

MUterritory XY-plane radius of each MU 1.0 mm 

DistEndplate endplate-to-electrode distance 20 mm 

FibZsegment length of the endplate in the Z axis 0.4 mm 

CondVolCond conductivity of the volume conductor 0.08 S/m 

Table 1 

Parameters used for artificial s-EMG generation. 

by the anisotropy factor A, calculated as the ratio between the radial conductivity 

(0.063 S/m) and the axial conductivity (0.328 S/m), see Tab. 1 for 

parameter details. Afterwards, the potential distribution can be calculated as 

if the volume conductor were isotropic. The channels composing the synthetic 

s-EMG contain the same MUAP without any time delay among themselves, 

hence producing a instantaneous mixture with the other source signals. 

We have generated 100 synthetic, eight-channel s-EMG signals of 1.5-second 

duration. Each s-EMG was composed of five MUs randomly located under 

a 3 mm fat and skin layer in the detection area of the electrode-array (up 

to ±2 mm of the sides of the electrode array and up to a depth of 6 mm). 

Each MU in turn consisted of 20 to 30 fibres (uniform distribution) of infinite 

length located randomly in a 1 mm circle in the XY plane. The conduction 

velocity of the fibres was 4.0 m/s. Note that an average biceps brachii MU 

is generally composed of at least 200-300 muscle fibres [23]; however, for the 

sake of computational speed, that number was reduced to one tenth, which 

5


only resulted in the reduction of the amplitude of the final MUAPs. This does 

not have any influence in our case, as the algorithms are not affected by the 

general amplitude of the signals, only by their respective proportions, if at all. 

The simulated electrode-array used to generate the mixed recordings had the 

same dimensions and characteristics of the real one and was virtually located 

20 mm apart from the endplate region, which had a thickness of 0.4 mm. X axis 

corresponded to the length of the electrode, Y axis to the depth, and Z axis 

to the fibres direction. The mean instantaneous firing rate of each MU was 20 

firings/second (mean inter-pulse interval meanIPI of 50 ms), with a standard 

deviation calculated following Clamman’s model as 9.1 · 10 −4 · meanIPI 2 + 4 

[21]. 

The s-EMG signals were generated as followed; each muscle fibre composing 

a firing MU generates an action potential following the moving tripole model 

([24] and references therein); the final MUAP appearing in each channel is the 

summation of those action potentials after applying the anisotropy coefficient 

A and the fading produced by the media crossed by the signal ([19] and references 

therein). Note that the model generally holds for any skeletal muscle. 

However, some of the parameters (such as the fat layer and firing rate) were 

chosen following the average, normal values of the biceps brachii [25]. 

A single such data set is visualized in Fig. 2. The eight-dimensional s-EMG 

observations that have been generated from the five underlying MUAPTs are 

shown in Fig. 2(a). Their locations are depicted in Fig. 2(b) and the mixing 

matrix used to generate the synthetic data set (see Fig. 2(c)) is shown in 

Fig. 2(d). We will use several (100) of these signals with randomly chosen 

source locations in batch runs for statistical comparisons. 

2.1.2 Real s-EMG signal acquisition 

Fig. 3 shows the experimental setting for the acquisition of real s-EMG recordings. 

The subject (after giving informed consent) sat on a chair and was 

strapped to it with nylon belts as shown in Fig. 3(a). The subject’s forearm 

was put in a cast fixed to the table, the shoulder angle being approximately 

30 ◦ . To measure the torque exerted by the subject, a strain gauge was placed 

between the forearm cast and the table fixation. An electrode array was placed 

on the biceps short head muscle belly, out of the innervation zone, transverse 

to the direction of muscle fibers, after cleaning the skin with alcohol and Skin- 

Pure (Nihon Kohden Corp, Tokyo, Japan). The electrode array consists of 

16 stainless steel poles of 1 mm in diameter, 3 mm in height, and 2.54 mm 

apart in both plane directions that are attached to a rigid acrylic plate, see 

Fig. 3(b). The poles of each bipolar electrode were placed at the shortest 

possible distance to obtain sharper MUAPs. 

6


(a) 

(b) 

mass-EMG 

electrode 

electrode-array 

8 chan.s 

direction of muscle fiber 

3mm 

CH1 

1mm 

2.54mm 

belt 

strain gauge 

CH8 

41mm 

2.54mm 

41mm 

chair 

styrofoam 

cast 

oscilloscope 

measured 

torque target 

torque 

Fig. 3. Experimental setting; (a) Eight-channels s-EMG was recorded during an 

isometric, constant force 30% of subject’s MVC. (b) Detail of the s-EMG bipolar 

electrode array used for the recordings. 

Eight-channel EMG bipolar signals were measured (input impedance above 

10 12 Ω, CMRR of 110 dB), filtered by a first-order, band-pass Butterworth filter 

(cut-off frequencies of 70 Hz and 1 kHz), and then amplified (gain of 80 

dB). The target torque and the measured torque were displayed on an oscilloscope 

as a thick and as a thin line respectively, which the subject was asked 

to match. The eight-channel s-EMG was sampled at a frequency of 10 kHz, 12 

bits per sample, by a PCI-MIO-16E-1 A/D converter card (National Instruments, 

Austin, Texas, USA) installed in a PC, which was also equipped with 

a LabView (National Instruments, Austin, Texas, USA) program in charge of 

controlling the experiments. The signals were recorded during a 5s period of 

constant-force isometric contractions at 30% maximum voluntary contraction 

(MVC). 

Before applying any BSS method, the signal was preprocessed using a modified 

dead-zone filter developed to retain the MUAPs belonging to the target 

MUAPT and to remove both noise and low-amplitude MUAPs that did not 

reach the given thresholds. The nonlinear filter is realized by an algorithm that 

was developed in order to retain the MUAPs belonging to the target MUAPT 

and to remove both the noise and the low-amplitude MUAPs that did not 

reach the given thresholds. In former works, these thresholds were initially set 

at a level three times above the noise level (recording at 0% MVC) and then 

7


adjusted with the aid of an incorporated graphical user interface [18]. In the 

present work, the thresholds were set to 10% for all the signals so that we may 

compare all obtained results easily. The algorithm looks for zero-crossings on 

the signal and defines a waveform as the samples of the considered signal comprised 

between two consecutive zero-crossings. If the maximum (respectively, 

minimum in the case of a negative waveform) is above (respectively, below) 

the threshold, that waveform is kept in the signal given as output. Otherwise, 

the waveform is substituted by a zero-voltage signal on that period. 

It is quite important to use such a filter, because although BSS algorithms 

are generally able to separate the noise as a component, they are usually 

unable to separate more source signals than available channels. Generally in 

our experiments there were more MUs than available channels even at low 

levels of contraction, but only few of them had amplitude above the noise 

level. In order to eliminate the activity of some MUs we delete the MUAPTs 

belonging to distant MUs, whose MUAPs are less powerful. This is enhanced 

by PCA preprocessing and dimension reduction as discussed later in section 

3.2. 

2.2 Blind source separation 

Linear blind source separation (BSS) describes the task of blindly recovering 

A and S in the equation 

X = AS + N (1) 

where X consists of m signals with N observations each, put together into an 

(m × N)−matrix. A is a full-rank (m × n)−matrix, and we typically assume 

that m ≥ n (complete and under-complete case). Moreover N models additive 

noise, which is commonly assumed to be white Gaussian. Depending on the 

assumptions on A and S we get different models. Our goal is to estimate 

the mixing matrix A. Then the sources can be recovered by applying the 

pseudoinverse A + of A to the mixtures X (which is optimal in the maximumlikelihood 

sense when Gaussian noise is assumed). 

In the case of s-EMG data, the original signals are the MUAPTs generated by 

the motor units active during a sustained contraction. In this setting, A quantifies 

how each source contributes to each observation. Of course additional 

requirements will have to be applied to the model to guarantee a satisfactorily 

small space of solutions, and depending on the assumptions on A and S we 

will obtain different models. 

8


2.2.1 Independent component analysis 

Independent component analysis (ICA) describes the task of (here: linearly) 

transforming a given multivariate random vector such that its transform is 

stochastically independent. In our setting the random vector is given by N 

realizations, and ICA is applied to solve the BSS problem (1), where S is 

assumed to be independent. As in all BSS problems, a key issue lies in the 

question of identifiability of the model, and it can be shown that A (and 

hence S) is already uniquely — except for column permutation and scaling — 

determined by X if S contains at most one Gaussian and is square integrable 

[26,27]. This enables us to apply ICA to the BSS problem and to recover the 

original sources. 

The idea of ICA was first expressed by Jutten and Hérault [28], while the 

term ICA was later coined by Comon [26]. In contrast to principal component 

analysis (PCA), ICA uses higher-order statistics to fully separate the data. 

Typical algorithms are based on contrasts such as minimum mutual information, 

maximum entropy, diagonal cumulants or non-Gaussianity. For more 

details on ICA we refer to the two available excellent text-books [1,2]. 

In the following we will use the so-called JADE (joint approximate diagonalization 

of eigenmatrices) algorithm, which identifies the sources using the fact 

that due to independence, the diagonal cumulants of the sources vanish. Furthermore, 

fixing two indices of the 4th order cumulants, it is easy to see that 

such cumulant matrices of the mixtures are diagonalized by A. 

After pre-processing by PCA, we can assume that A is orthogonal. Then 

diagonalization of one mixture cumulant matrix already yields A, given that 

its eigenvalues are pairwise different. However, in practice this is not always 

the case; furthermore, estimation errors could result in a bad estimate of the 

cumulant matrix and hence of A. Therefore, joint diagonalization of a whole 

set of cumulant matrices yields an improved estimate of A. Algorithms for 

actually performing joint diagonalization include gradient descent on the sum 

of off-diagonal terms, iterative construction of A by Givens rotation in two 

coordinates [29], an iterative two-step recovery of A [30] or — more recently 

— a linear least-squares algorithm for diagonalization [31], where the latter 

two algorithms can also search for non-orthogonal matrices A. 

One method we use for analyzing the s-EMG signals is JADE-based ICA. Confirming 

results from [32] we show that ICA can indeed extract the underlying 

sources. In the case of s-EMG signals, all sources are strongly super-Gaussian 

and can therefore safely be assumed to be non-Gaussian, so identifiability 

holds. However, due to the nonnegativity of A, the scaling indeterminacy is 

reduced to multiplication with a positive scalar in each column. If we additionally 

use the common assumption of unit variance of the sources, this 

9


already eliminates the scaling indeterminacy. In order to use an ordinary ICA 

algorithm, we simply have to add a ’postprocessing’ stage: to guarantee a 

nonnegative matrix, column signs are flipped to have only (or as many as 

possible) nonnegative column entries. Also note that statistical independence 

meaning that the multivariate probability densities factorize is not related to 

the synchrony in the firing of MUs [33] — otherwise overlapping MUs could 

not be separated. 

We finally want to remark that ICA can also be interpreted as sparse signal 

decomposition method in the case of super-Gaussian sources. This follows from 

the fact that a good and often-used contrast for ICA is given by maximization 

of non-Gaussianity [34] — this can be approximately derived from the fact 

that due to the central limit theorem, a mixture of independent sources tends 

to be more Gaussian than the sources, so the process can be inverted by 

maximizing non-Gaussianity. In our setting the sources are very sparse, hence 

strongly non-Gaussian. An ICA decomposition is therefore closely related to 

a decomposition into parts of maximal sparseness — at least if sparseness is 

measured using kurtosis. 

2.2.2 (Sparse) nonnegative matrix factorization 

In contrast to other matrix factorization models such as PCA, ICA or SCA, 

nonnegative matrix factorization (NMF) strictly requires both matrices A and 

S to have nonnegative entries, which means that the data can be described 

using only additive components. Such a constraint has many physical realizations 

and applications, for instance in object decomposition [5]. If additionally 

some sparseness constraints are put on A and S, we speak of sparse NMF, see 

[14] for more details. 

Typically, NMF is performed using a least-squares (Euclidean) contrast 

E(A,S) = �X − AS� 2 , (2) 

which is to be minimized. This optimization problem, albeit convex in each 

variable separately, is not convex in both at the same time and hence direct 

estimation is not possible. Paatero [35] minimizes (2) using a gradient 

algorithm, whereas Lee and Seung [36] develop a multiplicative update rule 

increasing algorithm performance considerably. 

Although NMF has recently gained popularity due to its simplicity and power 

in various applications, its solutions frequently fail to exhibit the desired sparse 

object decomposition. Therefore, Hoyer [14] proposes a modification of the 

NMF model to include sparseness. However, a simple modification of the cost 

function (2) could yield undesirable local minima, so instead he chooses to 

minimize (2) under the constraint of fixed sparseness of both A and S. Here, 

10


sparseness is measured by combining the Euclidean norm �.�2 and the 1-norm 

�x�1 := � 

i |xi| as follows: 

√ 

n − �x�1/�x�2 

sparseness(x) := √ (3) 

n − 1 

if x ∈ R n \ {0}. So sparseness(x) = 1 (maximal) if x contains n −1 zeros, and 

it reaches zero if the absolute value of all coefficients of x coincide. 

The devised algorithm is based on the iterative application of a gradient decent 

step and a projection step, thus restricting the search to the subspace of sparse 

solutions. We perform the factorization using the publicly available Matlab 

library nmfpack 1 , which is used in [14]. 

So NMF decomposes X into nonnegative A and nonnegative S. The assumption 

that A has nonnegative coefficients is very well fulfilled by s-EMG recordings; 

however as seen before the sources also have negative entries. In order 

to be able to apply the algorithms, we therefore preprocess the data using the 

function 

⎧ 

⎪⎨ 0, x < 0 

κ(x) = 

(4) 

⎪⎩ x, x ≥ 0 

to cut off non-zero values; this yields the new random vector (sample matrix) 

X+ := (κ(X1), . . .,κ(Xn)) ⊤ . For comparison, we also construct a new sample 

set by simply leaving out samples that have at least one negative value. Here 

we model this by the random vector X∗. 

2.2.3 Sparse component analysis 

Sparse component analysis (SCA) [3,4] requires strong sparseness in the sources 

only — this is then sufficient to decompose the observations. In order to define 

the SCA model, a vector x ∈ R n is said to be k-sparse if x has at most 

k non-zero entries. This k-sparseness implies k0-sparseness for k0 ≥ k. If an 

n-dimensional vector is k-sparse for k = n − 1, it is simply said to be sparse. 

The goal of sparse component analysis of level k (k-SCA) is to decompose X 

into X = AS as above such that each sample (i.e. column) of S is k-sparse. 

In the following we will assume k = n − 1. 

Note that, in contrast to the ICA model, the above model is not translation 

invariant. However, it is easy to see that if instead of A we allow an affine 

linear transformation, the translation constant can be determined from X 

only as long as the sources are non-deterministic. In other words, instead of 

assuming k-sparseness of the sources we could also assume that at any time 

1 http://www.cs.helsinki.fi/u/phoyer/ 

11


instant only k source components are allowed to vary from a previously fixed 

constant (which can be different for each source). 

In [3] we showed that under slight conditions k-sparseness already guarantees 

identifiability of the model, even in the case of less observations than sources. 

In the setting of s-EMG however we are in the comfortable situation of having 

more observations than sources, so as in the ICA case we preprocess our data 

using PCA projection — this dimension reduction algorithm can be applied 

even to our case of non-decorrelated sources as (given low or no noise) the first 

three principal components will span the source signal subspace, see comment 

in section 3.2. The above uniqueness result is based on the fact that due to the 

assumed sparseness the data clusters into a fixed number of hyperplanes. This 

fact can also be used in an algorithm to actually reconstruct A by identifying 

the set of hyperplanes. From the hyperplanes, A can be recovered by simply 

taking intersections. 

However, multiple hyperplane identification is non trivial, and the involved 

cost function 

σ(A) = 1 

N 

N� 

t=1 

n 

min 

i=1 

|a⊤ i X(t)| 

, (5) 

�X(t)� 

where ai denote the columns of A, is highly non-convex. In order to improve 

the robustness of the proposed, stochastic identifier, we developed an identification 

algorithm using a generalized Hough transform [37]. Alternatively a 

generalization of k-means clustering can be used, which iteratively clusters the 

data into groups belonging to the different hyperplanes, and then identifies a 

hyperplane within the cluster by regression [38]. 

In this paper, we assume sparseness of the sources S in the sense that at 

least one coefficient of S at a certain time instant has to be zero. In the 

case of s-EMG, the maximum natural firing rate of a motor unit is about 30 

pulses/second, lasting each pulse less than 15 ms [39]. Therefore, a motor unit 

is active less than 450 ms per second; that is, at least 55% of the time each 

source signal is zero. In addition, the firings of different motor units are not 

synchronized (only their respective firing rate shows the tendency to change 

together, the so-called common drive [40]). For these reasons, the probability 

of all n sources firing at a given time instant is bounded by 0.45 n , which quickly 

approaches zero for increasing n. Even in the case of only n = 3 sources, at 

most 9% of the samples are fully active. Hence the source conditions for SCA 

should be rather well fulfilled, an we can find isolated MUAPs inside an s-EMG 

signal using SCA with high probability. 

12


2.3 Measures used for comparison 

In order to compare the recovered signals with the artificial sources, we simply 

compare the mixing matrices. For this we employ Amari’s separation performance 

index [41], which is given by the equation 

⎛ 

⎞ 

n� n� |pij| 

n� 

� 

n� 

� 

|pij| 

E1(P) = ⎝ 

− 1⎠ 

+ 

− 1 

i=1 j=1 maxk |pik| 

j=1 i=1 maxk |pkj| 

where P = (pij) = Â+ A, A being the real mixing matrix and Â+ the 

pseudoinverse of its estimation Â. Note that E1(P) ≤ 2n(n − 1). For both 

the artificial and the real signals, we calculate E1( Â+ 1 Â2), where Âi are the 

two recovered mixing matrices. 

Furthermore, we also want to study the recovered signals. So in order to be able 

to compare between different methods, to each pair of components obtained 

with each method, as well as to the source signals and the synthetic s-EMG 

channels, we apply the following equivalence measures: Principe’s quadratic 

mutual information (QMI) [42], Kullback-Leibler information distance (K-LD) 

[43], Renyi’s entropy measure [43], mutual information measure (MuIn) [42], 

Rosenblatt’s squared distance functional (RSD) [43], Skaug and Tjøstheim’s 

weighted difference (STW) [43] and cross-correlation (Xcor). All the measures 

are normalized with respect to the maximum value obtained when applied 

to each component with itself; that is, we divide by the maximum of each 

comparison matrix diagonal (maximum mutual information). 

For the calculation of the above-mentioned indices it is necessary to estimate 

both joint and marginal probability density function of the signals. We have 

decide to use the data-driven method Kn-nearest-neighbour (KnNN) [44] (for 

details, refer to [32]). Furthermore, in order to compare separation performance 

in the presence of noise, we measure the strength of a one-dimensional 

signal S versus additive noise ˜ S using the signal-to-noise ratio (SNR) defined 

by 

SNR(S, ˜ �S� 

S) := 20 log10 �S − ˜ S� . 

3 Results 

In this section we compare the various models for source separation when 

applied to both toy and real s-EMG recordings. 

13


(a) A JADE 

0.6 

0.4 

0.2 

0 

0.6 

0.4 

0.2 

1 2 3 4 5 6 7 8 

(d) A sNMF 

0 

1 

2 

3 

4 

5 

6 

7 

8 

1 

2 

3 

1 

2 

3 

0.8 

0.6 

0.4 

0.2 

channel 

0 

1 

2 

3 

4 

5 

6 

7 

8 

0.6 

0.4 

0.2 

(b) A NMF 

signal 

0 

1 

2 

3 

4 

5 

6 

7 

8 

1 

2 

3 

(e) A sNMF* 

1 

2 

3 

0.6 

0.4 

0.2 

0 

1 

2 

3 

4 

5 

6 

7 

8 

0.6 

0.4 

0.2 

(c) A NMF* 

(f) A SCA 

0 

1 

2 

3 

4 

5 

6 

7 

8 

1 

2 

3 

1 

2 

3 

Fig. 4. Mixing matrices recovered for the synthetic s-EMG using different methods; 

(a) ICA using joint approximate diagonalization of eigenmatrices (JADE), (b, c) 

nonnegative matrix factorization with different preprocessing (NMF, NMF∗), (d, e) 

Sparse NMF (sNMF, sNMF∗) with the same two preprocessing methods as NMF, 

and (f) sparse component analysis (SCA). 

3.1 Artificial signals 

In the first example, we compare performance in the well-known setting of 

artificially created toy-signals. 

3.1.1 Single s-EMG 

For visualization, we will first analyze a single artificial s-EMG recording, and 

only later present batch-runs over multiple separate realizations to test for 

statistical robustness. As data set, we use toy signals as in section 3.1 but 

with only three source components for easier visualization. The ICA-result 

is produced using JADE after PCA to 3 components. Please note that here 

and in the following we perform dimension reduction because in the small 

sensor volumes in question not many MUs are present. This is confirmed by 

considering the eigenvalue structure of the covariance matrix: taking the mean 

over 10 real s-EMG data sets — further discussed in section 3.2 — the ratio 

of third to first largest eigenvalue is only 0.11, and the ratio of the fourth to 

the first only 0.04. Taking sums, in the mean we loose only 5.7% of raw data 

14


Component given by each method [arbitrary units] 

(a) JADE 

(b) NMF 

(c) NMF * 

(d) sNMF 

(g) Source (f) SCA (e) sNMF * 

-2 

-4 

024 

0.05 

0 

-0.05 

0.05 

0 

-0.05 

0.05 

0 

-0.05 

0.05 

0 

-0.05 

5 

0 

-5 

10 

0 

-10 

50 100 150 200 250 300 350 400 450 500 

Time [ms] 

Fig. 5. One of the three recovered sources after applying the pseudoinverse A + 

of the estimated mixing matrices to the synthetic s-EMG mixtures X; (a-f) results 

obtained using the different methods, see Fig. 4; (g) for comparison, also the original 

source signal is shown. 

by dimension reduction to the first three eigenvalues, which lies easily in the 

range of the noise level of typical data sets. 

In order to reduce the ever-present permutation and scaling ambiguity, the 

columns of the recovered mixing matrix AJADE are normalized to unit length; 

furthermore, column sign is chosen to give a positive sum of coefficients (because 

A is assumed to have only positive coefficients), and the columns are 

permuted such that the maxima index was increasing, Fig. 4(a). The three 

components are mostly active in channels 3, 4 or 5 respectively, which coincides 

with our construction (the real sources lie closest to sensor 4, 3 and 5 

respectively). Fig. 5(a) shows one of the recovered sources, and for comparison, 

Fig. 5(g) shows the original source signal. 

We subsequently apply NMF and sparse NMF to both X+ and X∗ (denoted 

by NMF, NMF∗, sNMF and sNMF∗ respectively); in all cases we achieve fast 

convergence. The mixing matrices obtained are shown in Fig. 4(b-e). In all four 

cases we observe a good recovery of the source locations. Cross-multiplication 

of these matrices with their pseudoinverses shows a high similarity, and the 

recovered source signals are similar and match well the original source, see 

Fig. 5(b-e). 

15


4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

SCA 

sNMF * 

sNMF 

NMF * 

NMF 

sNMF * 

sNMF 

(a) matrix comparison by Amari index 

NMF * 

NMF 

JADE 

Measure mean [a.u.] 

1.5 

1 

0.5 

0 

Renyi 

QMI 

KLD 

MuIn 

RSDSTW 

Measure 

Xcor 

JADE 

Source 

SCA NMFNMF Channels 

sNMF * 

sNMF 

* 

Signal 

(b) comparison of the recovered sources 

Fig. 6. Inter-component performance index comparisons. (a) Comparison of the 

Amari index of matrices acquired from two different methods each in the case of 

synthetic s-EMG decomposition. As the comparison matrix is symmetric, half of 

its values have been omitted for clarity. (b) Mean of the inter-components mutual 

information for each method estimated by different measures. For comparison, the 

measures corresponding to the source signals (minimal indices) and to the channels 

of the mixed s-EMG (maximal indices) are also shown. 

Finally, we perform SCA using a generalized Hough transform [37]. Note that 

there are also other algorithms possible for such extraction, and model generalizations 

are possible, see for example [45]. After whitening and dimension 

reduction, we perform Hough hyperplane identification with bin-size 180 and 

manually identified the maxima in the obtained the Hough accumulator. The 

recovered mixing matrix is again visualized in Fig. 4(f). Similar to the previous 

results, the three components are roughly most active at locations 2 to 3, 

4 and 5 respectively. Multiplication with the pseudoinverse of the recovered 

matrices from the above experiments shows that the result coincides quite well 

with the matrices from NMF, but differs slightly more from the ICA-recovery. 

We calculate and compare the Amari index for each method and as shown in 

Fig. 6(a), no major differences can be detected. A comparison of the recovered 

sources using the indices from section 2.3 is given in Fig. 6(b), where also 

the indices corresponding to the original source signals (minimum mutual 

information values) and to the channels of the s-EMG (maximum values) have 

been added. All the methods separate the mixtures (improvement in terms of 

indices) but the methods yield somewhat different result, with JADE giving 

quite different sources than the rest of the methods, and NMF and NMF∗ 

performing rather similar. In terms of source independence, of course JADE 

scores best ratings in the indices, as can be confirmed by calculating the Amari 

index of the recovered matrix with the original source matrix: 

16


JADE NMF NMF∗ sNMF sNMF∗ SCA sources 

mean kurtosis 10.80 8.48 8.80 8.85 9.64 7.14 11.4 

sparseness 0.286 0.268 0.295 0.270 0.302 0.201 0.252 

σ(A✷) 2.91 2.02 2.11 1.99 2.36 0.96 3.24 

Table 2 

Sparseness measures of the recovered sources using the various methods. In the first 

row, the mean kurtosis is calculated (the higher, the more ‘spiky’ the signal). The 

second row gives the sparseness index from equation (3) (the higher, the sparser 

the signal) and the third row the cost function (5) employed in SCA (the lower, the 

better the SCA criterion is fulfilled). Optimal values are printed in bold face. 

AJADE ANMF ANMF∗ AsNMF AsNMF∗ ASCA 

E1(A + ✷A) 1.39 3.27 2.96 3.41 2.56 3.86 

JADE clearly outperforms the other methods — this will be explained in the 

next paragraph. However we have to add that the signal generation additionally 

involves a slightly nonlinear filter, so we can only estimate the real mixing 

matrix A from the sources and the mixtures, which yielded a non-negligible 

error. Hence this result indicates that JADE could best approximate the linearized 

system. 

One key question in this work is how the different models induce sparseness. 

Clearly, sparseness is a rather ambiguous term, so we calculate three 

indices (kurtosis, ‘sparseness’ from equation (3) and σ from (5)) for the real 

as well as the recovered sources using the proposed methods, see Tab. 2. As 

expected, ICA gives the highest kurtosis among all methods, whereas sparse 

NMF yields highest values of the sparseness criterion. And SCA has the lowest 

i.e. best value regarding k-sparseness measured by σ(A). We further see that 

both the kurtosis and the sparseness criterion seem to be somewhat related 

with regards to the data set as high values in both indices are achieved by 

both JADE and sNMF. The k-sparseness criterion, which fixes only the zero- 

(semi)norm i.e. requires a fixed amount of zeros in the data without additional 

requirements about the other values does not induce as high sparseness when 

measured using kurtosis/sparseness and vice versa. Finally by looking at the 

sparseness indices of the real sources, we can now understand why JADE outperformed 

the other methods in this toy data setting — indeed kurtosis of the 

sources is high and the mutual information low, see Fig. 6(b). But in terms of 

sparseness and especially σ, the sources are not as sparse as expected. Hence 

the (sparse) NMF and mainly the SCA algorithm could not perform as well 

as JADE when compared to the original sources as noted above. However, we 

will see that in the case of real s-EMG signals, this distinction will break; furthermore, 

sparse NMF turns out to be more robust against noise than JADE, 

as is shown in the following. 

17


Amari index 

25 

20 

15 

10 

5 

0 

JADE NMF NMF∗ sNMF sNMF∗ SCA 

(a) Amari index 

SNR 

30 

25 

20 

15 

10 

5 

0 


(b) SNR 

Fig. 7. Boxplot of the separation performance when identifying 5 sources in 8-dimensional 

observed artificial s-EMG recordings. (a) shows the mean Amari index of 

the product of the identified separation and the real mixing matrix, and (b) depicts 

the SNR of the recovered sources versus the real sources. Mean and variance were 

taken over 100 runs. 

3.1.2 Multiple s-EMG experiments 

We now show performance of the above algorithms when applied to 100 different 

realizations of artificial s-EMG data sets. The data consists of 8 channels 

with 5 underlying source activities; more details about data generation are 

given in section 2.1.1. The algorithm parameters are the same as above with 

the exception that automated SCA is performed using Mangasarian-style clustering 

[38] after PCA dimension reduction to 5. 

The ICA and sparse BSS algorithms from above are applied to these data sets, 

and Amari index as well as SNR of the recovered sources versus the original 

sources is stored. In Fig. 7, the means of these two indices as well as their 

deviations are shown in a box plot, separately for each algorithm. As in the 

single s-EMG experiment, these statistics confirm that JADE performs best, 

both in terms of matrix and in terms of source recovery (which is more or less 

the same due to the fact that we are dealing with the noiseless case so far). 

The NMF algorithms identify the mixing matrix with acceptable performance, 

however (sparse) NMF taking only positive samples (sNMF∗) tends to separate 

the data slightly better than by sample preprocessing using κ from equation 

(4). SCA cannot detect the source matrix as well as the other BSS algorithms 

— again the SCA conditions seem to be somewhat violated — however it 

performs adequately well recovering the sources, which is due to the fact that 

some sources are recovered very well resulting in a higher SNR than the NMF 

algorithms. For practical reasons, it is important to check to what extend the 

signal-to-interference ratio between the channels is improved after applying 

BSS algorithms. For each run, we monitor the SIR of the original sources and 

the recoveries by taking the mean over all channels. The two SIR means are 

18


divided to give a mean improvement ratio. Taking again mean over the 100 

runs yields the following improvements: 


mean SIR improvement 4.15 2.54 2.87 0.701 1.90 3.08 

This confirms our results; JADE works very well for preprocessing, but also 

NMF∗ and, interestingly, SCA. 

In order to test the robustness of the methods against noise, we recorded s- 

EMG at 0%MVC, that is, when the muscle was totally relaxed; in this way 

we obtained a recording of noise only. As expected the resulting recordings 

have nearly the same means and variances, and are close to Gaussian. This is 

confirmed by a Jarque-Bera test, which asymptotically tests for goodness-of-fit 

of an observed signal to a normal distribution. We recorded two different noise 

signals, and the test was positive in 11 out of 16 cases (at significance level 

α = 5%); the 5 exceptions had p-values not lower than 0.001. Furthermore, the 

noise is independent as expected because it has a close to diagonal covariance 

matrix, Amari index of 2.1, which is quite low for 8 dimensional signals. The 

noise is not fully i.i.d. but exhibits slight non-stationarity. Nonetheless, we 

take this findings as confirmation to assume additive Gaussian noise in the 

following. We will show mean algorithm performance over 50 runs for varying 

noise levels. 

Note that due to the Gaussian noise, the models (especially the NMF model 

which already uses such a Gaussian error term) hold well given the additive 

noise. We multiplied this noise signal progressively by 0, 0.01, 0.05, 0.1, 0.5, 

1 and 5 (which corresponds to mean source SNRs of ∞, 36, 22, 16, 2.1, -3.9 

and -18 dB) and then added each of the obtained signals to a randomly generated 

synthetic s-EMG containing 5 sources as above. The Amari index was 

calculated for each method and for each noise level. We thus obtained the 

comparative graph shown in Fig. 8. Interestingly, sparse NMF∗ outperforms 

JADE in all cases, which indicates that the sNMF model (which already includes 

noise), works best in cases of slight to stronger additive noise — which 

makes it very well adapted to real applications. Again SCA performs somewhat 

problematically, however separates the data well a the noise level of -3.9 

dB. We believe that this is due to the involved thresholding parameter in 

SCA hyperplane detection; apparently it is necessary to implement an adaptive 

choice of this parameter in order to improve separation as in the case of 

SNR of -3.9 dB. 

19


median Amari−index 

9 

8 

7 

6 

5 

4 

3 

2 

1 

JADE 

NMF 

NMF∗ 

sNMF 

sNMF∗ 

SCA 

0 

−20 −10 0 10 

mean SNR (dB) 

20 30 40 

Fig. 8. Result of the experiment testing the robustness of the different methods 

against noise. Plotted is the mean over 50 runs. 

Real s-EMG signal [V] 

Channel 1 

Channel 2 

Channel 3 

Channel 4 

Channel 5 

Channel 6 

Channel 7 

Channel 8 

5 

0 

-5 

-10 

5 

0 

-5 

-10 

5 

0 

-5 

-10 

5 

0 

-5 

-10 

5 

0 

-5 

-10 

5 

0 

-5 

-10 

5 

0 

-5 

-10 

5 

0 

-5 

-10 

100 200 300 400 500 600 700 800 900 1000 

Time [ms] 

Fig. 9. Real s-EMG obtained from a healthy subject performing a sustained contraction 

at 30% MVC. 

20


0.6 

0.4 

0.2 

0 

0.8 

0.6 

0.4 

0.2 

(a) A JADE 

1 2 3 4 5 6 7 8 

(d) A sNMF 

0 

1 

2 

3 

4 

5 

6 

7 

8 

1 

2 

3 

1 

2 

3 

0.8 

0.6 

0.4 

0.2 

channel 

0 

1 

2 

3 

4 

5 

6 

7 

8 

0.8 

0.6 

0.4 

0.2 

(b) A NMF 

signal 

0 

1 

2 

3 

4 

5 

6 

7 

8 

1 

2 

3 

(e) A sNMF* 

1 

2 

3 

0.8 

0.6 

0.4 

0.2 

0 

1 

2 

3 

4 

5 

6 

7 

8 

0.8 

0.6 

0.4 

0.2 

(c) A NMF* 

(f) A SCA 

0 

1 

2 

3 

4 

5 

6 

7 

8 

1 

2 

3 

Fig. 10. Mixing matrices recovered for the real s-EMG using different methods; (a) 

JADE, (b, c) NMF with different preprocessing, (d, e) sparse NMF with the same 

two preprocessing methods as NMF, and (f) SCA. 

3.2 Real s-EMG signals 

In this section we analyze real s-EMG recordings obtained from ten healthy 

subjects. At first we will again study a single s-EMG and plot it for visual 

inspection, and then show statistics over multiple subjects. The first data set 

has been obtained from a single subject performing a sustained contraction at 

30% MVC, see Fig. 9. The signal acquisition and preprocessing is described 

in section 2.1.2. 

We initially use JADE as ICA algorithm of choice. The estimated mixing 

matrix is visualized in Fig. 10(a). As in the case of synthetic signals, we then 

apply NMF and sparse NMF to both X+ and X∗ and obtain the recovered 

mixing matrices visualized in Fig. 10(b-e). 

We face the following problem when recovering the source signals by SCA. 

If we use PCA to n = 3 dimensions, we cannot achieve convergence; also a 

generalized Hough plot [37] does not reveal such structure. Hence we choose 

dimension reduction to n = 2. In the two-dimensional projected mixtures, the 

data clearly clusters along two lines, so the assumption of sparseness holds 

in 2 dimensions. We use Mangasarian-style clustering / SCA (similar to kmeans) 

[38] to recover these directions. The thus recovered mixing matrix in 

21 

1 2


1.5 

1 

0.5 

0 

SCA 

sNMF * 

sNMF 

NMF * 

NMF 

sNMF * 

sNMF 

NMF * 

NMF 

JADE 

(a) matrix comparison by Amari index 

Measure value [a.u.] 

2.5 

2 

1.5 

1 

0.5 

0 

KLD 

MuIn 

QMI 

RSD 

Renyi 

Measure STW 

Xcor 

JADE 

SCA NMFNMF sNMF 

* 

sNMF Channels 

* 

Signal 

(b) comparison of the recovered sources 

Fig. 11. Inter-component performance index comparisons; (a) comparison of the 

Amari index; (b) mean of various source dependence measures. 


mean kurtosis 4.97 4.74 4.80 4.81 4.80 4.82 

sparseness 0.387 0.424 0.413 0.408 0.407 0.405 

σ(A✷) 0.76 0.50 0.55 0.60 0.62 0.70 

Table 3 

Comparison of sparseness of the recovered sources (real s-EMG data). 

two dimensions is plotted in Fig. 10(f). Note that the SCA matrix columns 

match two columns of the mixing matrices found by the other methods. 

Similar to the toy data set, we compare the recoveries based on the different 

methods using various indices from section 2.3 for mixing matrices and recovered 

sources , Fig. 11. In contrast to the artificial signals, here all methods 

yield rather similar performance, the mean Amari indices are roughly half the 

value of the indices in the toy data setting. This confirms that the methods 

recover rather similar sources. 

As in the case of artificial signals, we again compare the sparseness of the 

recovered sources, see Tab. 3. Due to the noise present in the real data, the 

signals are clearly less sparse than the artificial data. Furthermore a comparison 

between the various methods yields noticeably less differences than in 

the case of artificial signals. At first glance it seems unclear why SCA performs 

worse in terms of k-sparseness (using σ(A)) than the NMF methods, 

but better than JADE. This however can be explained by the fact that the 

PCA dimension reduction reduces the number of parameters hence SCA cannot 

search the whole space, so it performs worse than NMF in that respect. 

However it outperforms JADE, which also uses PCA preprocessing. 

22


Component given by each method [a.u.] 

(a) JADE 

(b) NMF 

(c) NMF* 

(d) sNMF 

(e) sNMF* 

(f) SCA 

(g) s-EMG 

-2 

-4 

02468 

-2 

-4 

02468 

-2 02468 

-2 0246 

-2 02468 

-2 

-4 

02468 

0.5 1 

1.5 

-0.5 0 

50 100 150 200 250 300 350 400 450 500 

Time [ms] 

Fig. 12. One of the three recovered sources after applying A + to an s-EMG data 

recorded at 30% MVC; (a-f) results obtained using the different methods; (g) original 

source signal. 

In order to be able to draw statistically more relevant conclusions, we compare 

the various BSS algorithms for s-EMGs of nine subject, recorded at 30% MVC. 

Fig. 12 plots a single extracted source for each BSS algorithms; Fig. 12(g) 

shows the s-EMG channel where the dominant source signal is chosen for 

comparison. One problem of performing batch comparisons however lies in the 

fact that separation performance is commonly evaluated by visual inspection 

or at most by comparison with other separation methods — because of course 

the original sources are unknown. We cannot do plots similar to Fig. 12 for 

each subject, so in order to provide a more objective measure, we consider the 

main application of BSS for real s-EMG analysis — preprocessing in order to 

achieve ‘cleaner’ data for template matching. 

A common measure for this is to count the number of zero-crossings of the 

sources (after combining them to a full eight dimensional observation vector by 

taking only the maximally active source in each channel). Note that this zerocrossing 

count is already outputted by the MDZ filter. It is directly related 

to the amount of MUAPs present in a s-EMG signal [46], and by comparing 

this index before and after the BSS algorithms, we can analyze whether and 

to what extent they actually enhance the signal. With the aid of the MDZ 

filter, we count the number of waves (the excursion of the signal between two 

23


subject ID JADE NMF NMF∗ sNMF sNMF∗ SCA 

b 9% 7% 9% 10% 11% 10% 

f 5% 3% 5% 4% 5% 4% 

g 37% 35% 35% 35% 36% 30% 

k 35% 35% 35% 35% 34% 38% 

m 16% 16% 17% 16% 16% 13% 

og 9% 8% 8% 8% 8% 10% 

ok 3% 7% 9% 9% 10% 7% 

s 41% 42% 42% 41% 41% 37% 

y 71% 71% 73% 71% 73% 74% 

means 25.0 % 25.1 % 25.7% 25.4% 25.8 % 24.6 % 

std. deviations 21.4 % 21.4 % 21.3% 20.9% 21.0% 21.4 % 

Table 4 

Negative zero-crossing mean ratios i.e. relative enhancements for each subject, together 

with mean performance of each algorithm. 

consecutive zero crossings) and take the mean over all channels before and 

after applying each of the BSS algorithms. We then subtract the latter from 

the previous value and divide by the initial number of waves in order to have 

an index than can be compared between different signals. Tab. 4 shows the 

resulting ratios for the nine subjects. All BSS algorithms result in a reduction 

of zero-crossings, and best results per run are achieved by NMF∗, sNMF∗ and 

SCA. We see that in the mean, all algorithm perform somewhat similarly, 

with sNMF∗ being best and SCA being worst. Hence the best algorithm for 

this data set, sNMF∗, achieved a mean reduction of the number of waves 

was 25.8%, which means that after applying the algorithms each channel is 

composed, in average, of one fourth of the original number of waves, making 

the template-matching technique easily applicable. 

4 Discussion 

The main focus of this work lies in the application of three different sparse 

BSS models — source independence, (sparse) nonnegativity and k-sparseness 

— to the analysis of s-EMG signals. This application is motivated by the fact 

that the underlying MUAPTs exhibit properties (mainly sparseness) that fit 

quite well to the three in principle different models. Furthermore, we take 

interest in how well these models behave in the case of slightly perturbed 

initial conditions — ICA for instance is known to be quite robust against 

24


small errors in the independence assumption — and how they can deal with 

additional noise. 

In the first example of artificially created mixtures, we are able to demonstrate 

that a decomposition analysis using the three models was possible, and give 

comparisons over larger data sets. Although the recovered sources are all rather 

alike, Fig. 5 and Fig. 7, we found that ICA outperformed the other methods 

concerning distance to the real solution, which was mainly due to the fact that 

the artificial sources — but not in the case of real signals, see below — best 

fitted to the ICA model, Tab. 2. However, when considering additional noise 

with increasing power, sparse NMF turned out to be more robust than the 

ICA model, Fig. 8. 

We then applied the BSS methods to real s-EMG data sets, and the three 

different models yielded surprisingly similar results, although in theory these 

models do not fully overlap. We speculate that this similarity is due to the 

fact that — as most probably in all applications — the models do not fully 

hold. This allows the various algorithms to only approximate the model solutions, 

and hence to arrive at similar solutions, but from different directions. 

Furthermore, this indicates that the three models look for different properties 

of the sources, and that these properties are fulfilled to varying extent, see 

Tab. 3 for numerical details. Comparisons over s-EMG data sets from multiple 

subjects again confirmed similar equality of performance, where again 

sparse NMF slightly outperformed the other algorithms in the mean (in nice 

correspondence with the noise result from above). 

Note that the aim of the present work is not the full recovery of a target 

MUAPT (source signal) in its original form. In fact, as shown previously [18], 

it would be sufficient to increase the amplitude a target MUAPT so that in 

average it is above the noise and above the other MUAPTs level. Then, we 

are able to cut the interfering signals with a modified dead zone filter and 

thus isolating the target MUAPT. And indeed all of the employed sparse BSS 

methods fulfill this requirement, which is confirmed by the decrease in the 

number of zero-crossings of the separated signals, see Tab. 4. 

5 Conclusion 

We have compared the effectiveness of various sparse BSS methods for signal 

decomposition; namely, ICA, NMF, sparse NMF and SCA, when applied to 

s-EMG signals. Surface EMG signals represent an attractive testing signal as 

they approximately fulfill all the requirements of these methods and are of 

major importance in medical diagnosis and basic neurophysiological research. 

25


All methods, in spite of being based on very different approaches, gave similar 

results in the case of real data and decomposed them sufficiently well. We 

therefore suggest using sparse BSS as an important preprocessing tool before 

applying the common template matching technique. In terms of algorithm 

comparisons, ICA performed better than the other algorithms in the noiseless 

case, however sparse NMF∗ outperformed the other methods when noise was 

added and slightly so in the case of multiple real s-EMG recordings. In later 

work concerning methods of s-EMG decomposition, we therefore want to focus 

on properties and possible improvements of sparse NMF regarding parameter 

choice (which level of sparseness to choose) and signal preprocessing (in order 

to deal with positive signals). Preprocessing to improve sparseness is currently 

studied. In order to be better able to compare methods, we also plan to apply 

the methods to artificial sources generated using other available s-EMG generators 

[47]. Finally extensions to convolutive mixing situations will have to 

be analyzed. 


We would like to thank Dr. S. Rainieri for helpful discussion and K. Maekawa 

for the first version of the s-EMG generator. This work was partly supported 

by the Ministry of Education, Culture, Sports, Science, and Technology of 

Japan (Grant-in-Aid for Scientific Research). G.A.G. is supported by a grant 

from the same Ministry. F.T. gratefully acknowledges partial financial support 

by the DFG (GRK 638) and the BMBF (project ‘ModKog’). 

References 

[1] A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, John 

Wiley & Sons, 2001. 


& Sons, 2002. 

[3] P. Georgiev, F. Theis, A. Cichocki, Blind source separation and sparse 

component analysis of overcomplete mixtures, in: Proc. ICASSP 2004, Vol. 5, 

Montreal, Canada, 2004, pp. 493–496. 

[4] P. Georgiev, F. Theis, A. Cichocki, Sparse component analysis and blind source 

separation of underdetermined mixtures, IEEE Trans. Neural Networks in press. 

[5] D. Lee, H. Seung, Learning the parts of objects by non-negative matrix 

factorization, Nature 40 (1999) 788–791. 

26


[6] S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pursuit, 

SIAM J. Sci. Comput. 20 (1) (1998) 33–61. 

[7] F. Theis, E. Lang, C. Puntonet, A geometric algorithm for overcomplete linear 

ICA, Neurocomputing 56 (2004) 381–398. 

[8] M. Zibulevsky, B. Pearlmutter, Blind source separation by sparse decomposition 

in a signal dictionary, Neural Computation 13 (4) (2001) 863–882. 

[9] M. Lewicki, T. Sejnowski, Learning overcomplete representations, Neural 

Computation 12 (2) (2000) 337–365. 

[10] B. Olshausen, D. Field, Emergence of simple-cell receptive field properties by 

learning a sparse code for natural images, Nature 381 (6583) (1996) 607–609. 

[11] M. D. Plumbley, E. Oja, A ’non-negative pca’ algorithm for independent 

component analysis, IEEE Transactions on Neural Networks 15 (1) (2004) 66– 

76. 

[12] E. Oja, M. D. Plumbley, Blind separation of positive sources using non-negative 

pca, in: Proc. ICA 2003, Nara, Japan, 2003, pp. 11–16. 

[13] W. Liu, N. Zheng, X. Lu, Non-negative matrix factorization for visual coding, 

in: Proc. ICASSP 2003, Vol. III, 2003, pp. 293–296. 

[14] P. Hoyer, Non-negative matrix factorization with sparseness constraints, 

Journal of Machine Learning Research 5 (2004) 1457–1469. 

[15] J. Basmajian, C. D. Luca, Muscle Alive. Their Functions Revealed by 

Electromyography, 5th Edition, Williams & Wilkins, Baltimore, 1985. 

[16] A. Halliday, S. Butler, R. Paul, A Textbook of Clinical Neurophysiology, John 

Wiley & Sons, New York, 1987. 

[17] D. Farina, R. Merletti, R. Enoka, The extraction of neural strategies from the 

surface EMG, Journal of Applied Physiology 96 (2004) 1486–1495. 

[18] G. García, R. Okuno, K. Akazawa, Decomposition algorithm for surface 

electrode-array electromyogram in voluntary isometric contraction, IEEE 

Engineering in Medicine and Biology Magazine in press. 

[19] C. Disselhorst-Klug, J. Silny, G. Rau, Estimation of the relationship 

between the noninvasively detected activity of single motor units and their 

characteristic pathological changes by modelling, Journal of Electromyography 

and Kinesiology 8 (1998) 323–335. 

[20] S. Andreassen, A. Rosenfalck, Relationship of intracellular and extracellular 

action potentials of skeletal muscle fibers, CRC critical reviews in bioengineering 

6 (4) (1981) 267–306. 

[21] H. Clamann, Statistical analysis of motor unit firing patterns in a human 

skeletal muscle, Biophysical Journal 9 (1969) 1233–1251. 

27


[22] P. Griep, F. Gielen, H. Boom, K. Boon, L. Hoogstraten, C. Pool, W. Wallinga- 

De-Jonge, Calculation and registration of the same motor unit action potential, 

Electroenceph Clin Neurophysiol 53 (1973) 388–404. 

[23] E. St˚alberg, J. Trontelj, Single Fiber Electromyography, Old Working (UK), 

The Mirvalle Press, 1979. 

[24] S. Maekawa, T. Arimoto, M. Kotani, Y. Fujiwara, Motor unit decomposition 

of surface emg using multichannel blind deconvolution, in: Proc. ISEK 2002, 

Vienna, Austria, 2002, pp. 38–39. 

[25] K. McGill, K. Cummins, L. Dorfman, Automatic decomposition of the clinical 

electromyogram, IEEE Transactions on Biomedical Engineering 32 (7) (1985) 

470–477. 

[26] P. Comon, Independent component analysis - a new concept?, Signal Processing 

36 (1994) 287–314. 

[27] F. Theis, A new concept for separability problems in blind source separation, 

Neural Computation 16 (2004) 1827–1850. 

[28] J. Hérault, C. Jutten, Space or time adaptive signal processing by neural 

network models, in: J. Denker (Ed.), Neural Networks for Computing. 

Proceedings of the AIP Conference, American Institute of Physics, New York, 

1986, pp. 206–211. 

[29] J.-F. Cardoso, A. Souloumiac, Jacobi angles for simultaneous diagonalization, 

SIAM J. Mat. Anal. Appl. 17 (1) (1995) 161–164. 

[30] A. Yeredor, Non-orthogonal joint diagonalization in the leastsquares sense with 

application in blind source separation, IEEE Trans. Signal Processing 50 (7) 

(2002) 1545–1553. 

[31] A. Ziehe, P. Laskov, K.-R. Mueller, G. Nolte, A linear least-squares algorithm 

for joint diagonalization, in: Proc. of ICA 2003, Nara, Japan, 2003, pp. 469–474. 

[32] G. García, K. Maekawa, K. Akazawa, Decomposition of synthetic multi-channel 

surface-electromyogram using independent component analysis, in: Proc. ICA 

2004, Vol. 3195 of Lecture Notes in Computer Science, Granada, Spain, 2004, 

pp. 985–991. 

[33] S. Takahashi, Y. Sakurai, M. Tsukada, Y. Anzai, Classification of neural 

activities from tetrode recordings using independent component analysis, 

Neurocomputing 49 (2002) 289–298. 



[35] P. Paatero, U. Tapper, Positive matrix factorization: A non-negative factor 

model with optimal utilization of error estimates of data values, Environmetrics 

5 (1994) 111–126. 

28


[36] D. Lee, H. Seung, Algorithms for non-negative matrix factorization, in: 

Advances in Neural Information Processing (Proc. NIPS 2000), Vol. 13, MIT 

Press, 2000, pp. 556–562. 

[37] F. Theis, P. Georgiev, A. Cichocki, Robust overcomplete matrix recovery for 

sparse sources using a generalized hough transform, in: Proc. ESANN 2004, 

d-side, Evere, Belgium, Bruges, Belgium, 2004, pp. 343–348. 

[38] P. Bradley, O. Mangasarian, k-plane clustering, Journal of Global Optimization 

16 (1) (2000) 23–32. 

[39] C. D. Luca, Physiology and mathematics of myoelectric signals, IEEE 

Transactions on Biomedical Engineering 26 (6) (1979) 313–325. 

[40] C. D. Luca, R. LeFever, M. McCue, A. Xenakis, Control scheme governing 

concurrently active human motor units during voluntary contractions, Journal 

of Physiology 329 (1982) 129–142. 

[41] S. Amari, A. Cichocki, H. Yang, A new learning algorithm for blind signal 

separation, Advances in Neural Information Processing Systems 8 (1996) 757– 

763. 

[42] D. Xu, J. Principe, J. F. III, H.-C. Wu, A novel measure for independent 

component analysis (ICA), in: Proc. ICASSP 1998, Vol. 2, Seattle, 1998, pp. 

1161–1164. 

[43] D. Tjøstheim, Measures of dependence and tests of independence, Statistics 28 

(1996) 249–282. 

[44] R. Duda, P. Hart, D. Stork, Pattern classification, 2nd Edition, Wiley, New 

York, 2001. 

[45] F. Theis, S. Amari, Postnonlinear overcomplete blind source separation using 

sparse sources, in: Proc. ICA 2004, Vol. 3195 of Lecture Notes in Computer 

Science, Granada, Spain, 2004, pp. 718–725. 

[46] P. Zhou, Z. Erim, W. Rymer, Motor unit action potential counts in surface 

electrode array EMG, in: Proc. IEEE EMBS 2003, Cancun, Mexico, 2003, pp. 

2067–2070. 

[47] B. Freriks, H. Hermens, European Recommendations for 

surface electromyography, results of the SENIAM project, Roessingh Research 

and Development b.v. (CD-ROM), 1999. 

29

Chapter 21 

Proc. BIOMED 2005, pages 209-212 

Paper F.J. Theis, Z. Kohl, H.G. Kuhn, H.G. Stockmeier, and E.W. Lang. Automated 

counting of labelled cells in rodent brain section images. In Proc. 

BioMED 2004, pages 209-212, Innsbruck, Austria, 2004. ACTA Press, 

Canada 

Reference (Theis et al., 2004c) 


291

292 Chapter 21. Proc. BIOMED 2005, pages 209-212 

Automated counting of labelled cells in rodent brain section images 

F.J. Theis 1 , Z. Kohl 2 , H.G. Kuhn 2 , H.G. Stockmeier 1 and E.W. Lang 1 


2 Department of Neurology, University of Regensburg, 93053 Regensburg, Germany 

email: fabian@theis.name 

ABSTRACT 

The genesis of new cells, especially of neurons, in the adult 

human brain is currently of great scientific interest. In order 

to measure neurogenesis in animals new born cells are 

labelled with specific markers such as BrdU; in brain sections 

these can later be analyzed and counted through the 

microscope. So far, the image analysis has been performed 

by hand. In this work, we present an algorithm to automatically 

segment the digital brain section picture into cell 

and noncell components, giving a count of the number of 

cells in the section. This is done by first training a so-called 

cell classifier with cell and non-cell patches in a supervised 

manner. This cell classifier can later be used in an arbitrary 

number of sections by scanning the section and choosing 

maxima of this classifier as cell center locations. For training, 

single- and multi-layer perceptrons were used. In preliminary 

experiments, we get good performance of the classifier. 

KEY WORDS 

Cell counting, image segmentation, cell classification, neurogenesis, 

BrdU 

1 Biological background 

1.1 New neurons in the adult brain 

During the last decades the fact that new neurons are continuously 

generated in the adult mammalian brain - a phenomenon 

termed adult neurogenesis - came more and more 

into focus of neuroscience research [1][2][7]. Under physiological 

conditions, neuroscientists found, that adult neurogenesis 

seems to be restricted to two brain regions: The 

wall of the lateral ventricle and the granular cell layer of 

the hippocampus. 

A large variety of factors including environmental 

signals, trophic factors, hormone and neurotransmitters 

have recently been identified to regulate the generation of 

new neurons in the adult brain. These studies were typically 

performed by using a combination of different histological 

techniques, such as non-radioactive labeling of 

newly generated cells, stereological counting and confocal 

microscope analysis, in order to quantitatively analyze 

adult neurogenesis (review in [8]). However, this procedure 

is time consuming, since histological analysis currently depends 

on assessment of positive signals in histological sec- 

tions by individual investigators through manual or semiautomatic 

counting. 

1.2 Method used 

Bromodeoxyuridine (BrdU), a thymidine analog is given 

systemically and is integrated into the replicating DNA during 

cell division [3]. Using a specific antibody against 

BrdU, labelled cells can be detected by an immunohistochemical 

staining procedure. The nuclei of labelled cells 

on 40µm thick brain sections appear in dark brown or 

black dense color. To determine the amount of BrdUpositive 

cells in the granular cell layer of the hippocampus 

they were counted on a light microscope (Olympus IX 

70; Hamburg, Germany) with a 20× objective. Digital images 

with a resolution of 1600 × 1200 pixels were taken by 

a color video camera using the analySIS-software system 

(Soft Imaging System, Münster, Germany). 

2 Automated counting 

Figure 1 shows a section image, in which the cells are to be 

counted. 

Classical approaches such as thresholding and erosion 

after image normalization were not successful, mainly because 

cell clusters in the image cannot be detected properly 

and counted using this method. 

We decided to adapt a method proposed by Nattkemper 

et al. [9]to evaluate fluorescence micrographs of lymphocytes 

invading human tissue. The main idea is to build 

in a first step a function mapping an image patch to a confidence 

value in [0, 1], indicating how probable a cell lies 

in this patch or not — we call this function cell classifier. 

In the second step this function is used as a local filter on 

the whole image; its application gives a probability distribution 

over the whole image with local maxima at cell positions. 

Nattkemper et al. call this distribution confidence 

map. Maxima analysis of the confidence map reveals the 

number and the position of the cells (image segmentation). 

3 Cell classifier 

In this section, we will explain how to generate a cell classifier 

that is a function mapping image patches to cell confidence 

values. For this we will generate a sample set of cells

Chapter 21. Proc. BIOMED 2005, pages 209-212 293 

Figure 1. Typical section image and below the manually 

bounded but automatically labelled image. In order to depict 

the image size a black scale bar of length 50µm is 

given in the top image at the bottom. Here, the number 

of counted cell within the boundary (region of interest) is 

84, the number of cells in the whole image is 116. 

and non-cells, and then train an artificial neural network to 

this sample set. 

3.1 Sample set 

After fixing the patch size — in the following we will use 

20 by 20 pixel grey-level image patches — a training set of 

cell and non-cell patches has to be generated manually by 

the expert. These image patches are then to be classified by 

a neural network. Figure 2 shows some cell and non-cell 

patches. 

Interpreting each 20 by 20 image patch as a 400dimensional 

vector, we thus get a set of L training vectors 

T := {(x1, t1), . . . , (xL, tL)} 

with xi ∈ R n — here n = 20 2 — representing the image 

patch and ti ∈ {0, 1} either 0 or 1 depending on 

Figure 2. Part of the training set: The first row consists of 

20x20-pixel image patches that contain cells, the lower row 

consists of non-cell image patches. 

whether xi is a non-cell or a cell. The goal is to find a 

mapping correctly classifying this data set that is a mapping 

ζ : R n → [0, 1] with ζ(xi) = oi for i = 1, . . . , L. 

We call such a mapping cell classifier. Of course ζ is not 

uniquely defined by the above property, so some regularization 

has to be introduced. Any interpolation technique such 

as Fourier or Tayler approximation can be used to find ζ; 

we will use single and multilayer perceptrons as explained 

in the previous section. 

3.2 Preprocessing 

Before we apply neural network learning, we preprocess 

the data as follows: After mean removal, we apply principal 

component analysis (PCA) in order to reduce the data 

set dimension as well as to decorrelate the data in a first 

separation step. This is achieved by diagonalizing the data 

set covariance and projecting along the eigenvectors with 

largest eigenvalues. 

By only taking the first 5 eigenvalues of the training 

set, projection along those first 5 principal axes still retains 

95% of the data. Thus, the 400 dimensional data space was 

reduced to a whitened 5-dimensional data set. 

A visualization of the 120-samples data set is given 

in figure 3, here after projection to 3 dimensions. One can 

easily see that the cell and non-cell components can be linearly 

separated — thus using a perceptron, see next section, 

can indeed already learn the cell classifier. Furthermore, a 

k-means clustering algorithm has been applied with k = 2 

in order to find the two data clusters. Those directly correspond 

to the cell/non-cell components, see figure. 

3.3 Neural network learning 

Supervised learning algorithms try to approximate a given 

function f : R n → A ⊂ R m by using a number of given 

sample-observation pairs (xλ, f(xλ)) ∈ R n × A. We will 

restrict ourselves to feed-forward layered neural networks 

[4]; furthermore, we found that simple single-layered neural 

networks (perceptrons) in comparison to multi-layered 

networks (MLP) already sufficed to learn the data set well 

— furthermore they have the advantage of easier rule extraction 

and interpretation.

294 Chapter 21. Proc. BIOMED 2005, pages 209-212 

1.5 

1 

0.5 

0 

−0.5 

−1 

−1.5 

−2 

4 

2 

0 

−2 

−4 

−3 

Figure 3. Data set with 120 samples after 3-dimensional 

PCA projection (91% of the data was retained). The dots 

mark the 60 samples representing cells, the x’s mark the 60 

non-cell data points. The two circles indicate clusters of 

a k-means application with a search for two clusters. Obviously, 

k-means nicely differentiates between the cell and 

the non-cell components. 

A perceptron with output dimension 1 consists of a 

single neuron only, so the output function y can be written 

as 

y(x) = θ(w ⊤ x + w0) 

with weight w ∈ R n , n input dimension, w0 ∈ R the 

bias and as activation function θ, the Heaviside function 

(θ(x) = 0 for x < 0 and θ(x) = 1 for x ≥ 0). Often, the 

bias w0 is added as additional weight to w with fixed input 

1. 

Learning in a perceptron means minimizing the error 

energy function shown above. This can be performed for 

example by gradient descent with respect to w and w0. This 

induces the well known delta-rule for the weight update 

−2 

−1 

∆w = η(y(x) − t)x, 

where η denotes a chosen learning rate parameter, y(x) the 

output of the neural network at sample x and t the observation 

of input x. It is easy to see that a perceptron separates 

the data linearly with the boundary hyperplane given by 

{x ∈ R n |w ⊤ x + w0 = 0}. 

For the cell classifier, we use a single-unit perceptron 

with linear activation function in order to get a measure 

for the certainty of cell/non-cell classification. Application 

of delta-learning to the 5-dimensional data set from 

above gives excellent performance after already 4 epochs 

of batch learning. The final performance error (variance 

of perceptron estimation error of the training set) after 55 

epochs was 0.0038 which confirms the good performance 

as well as the linearity of the classification problem. This 

was further shown, when we used a two-layered network 

0 

1 

2 

3 

4 

with 5 hidden neurons in order to test for nonlinearities in 

the data set. After only 10 epochs, the error was already 

very small, and it could finally be diminished to 3 · 10 −19 . 

Still the performance of the perceptron is more than enough 

for classification. 

4 Confidence map 

4.1 Generation 

The cell classifier from above only has to be trained once. 

Given such a cell classifier, section pictures can now 

be analyzed as follows: 

A pixelwise scan of the image gives an image patch 

with center location at the scan point; to this image patch 

the cell classifier is then applied to give a probability of 

whether a cell is at the given location or not. This altogether 

(after image extension at the boundaries) yields a probability 

distribution over the whole image which is called confidence 

map. Each point of the confidence map is a value in 

[0, 1] stating how probable it is that a cell is depicted at the 

specified location. 

In practice a pixelwise scan is too expensive in terms 

of calculation time, so instead a grid value say 5 for 20 × 

20-patches is introduced, and the picture is scanned only 

every 5-th pixel. This yields a rasterization of the original 

confidence map, which is still good enough to detect cells. 

Figure 4 shows the rasterized confidence map of a section 

part. The maxima of the confidence map correspond to the 

cell locations; small but non-zero values in the confidence 

map typically depict misclassifications that can be avoided 

by thresholding. 

4.2 Evaluation 

After the confidence map has been generated, it can be 

evaluated by simple maxima analysis. However as seen in 

figure 4, maxima not always correspond to cell positions, 

so thresholding in the confidence map is first applied. Values 

of 0.5 to 0.8 yield good results in experiments. Furthermore, 

the cell classifier could have responded to one cell 

when applied to image patches with large overlap. Therefore 

after a maximum has been detected, adjacent points in 

the confidence map are also set to zero within a given radius 

(15 to 18 were good values for 20 × 20 image patches). Iterative 

application of this algorithm then gave the final cell 

positions and hence the image segmentation. 

5 Results 

In practice we used perceptron learning after preprocessing 

with PCA and also ICA [5] [6] in order to help the learning 

algorithm with linearly separated data. 

The patch size was chosen to be 20 × 20, a thresholding 

of 0.8 was applied in the confidence map and the

Chapter 21. Proc. BIOMED 2005, pages 209-212 295 

1 

0.5 

0 

0 

5 

10 

15 

20 

25 

30 

35 

40 

45 

50 

0 

5 

10 

Figure 4. The plot shows the confidence map with grid 

value 5 of the image part shown above. 

cut-out radius for cell detection in the confidence map was 

18 pixels. 

In figure 1 an automatically segmented picture is 

shown. We see good performances of the counting algorithm. 

So far we only compared cell-numbers of sections, 

counted by the algorithm and by an expert. We get calculation 

errors of about 5%. In further experiments, we also 

want to compare cell positions, detected by the algorithm 

and by an expert. 

6 Conclusion 

We have presented a framework for brain section image 

segmentation and analysis. The feature detector, here cell 

classifier, was first trained on a given sample set using a 

neural network. This detector was then applied by scanning 

over the image to get a confidence map. Maxima analysis 

yields the cell locations. Experiments showed good performance 

of the classifier, however larger tests will have to be 

performed. 

In future work, various problems will have to be dealt 

with. First of all the scanning performance should be increased 

in order to be able to use smaller grid values, which 

could significantly increase the classification rate. This 

could be done for example by using some kind of hierarchical 

neural network like a cellular neural network, see 

[9]. In typical brain section images, some cells not directly 

15 

20 

25 

30 

35 

40 

45 

50 

lying in the focus plane occur in a blurred fashion. In order 

to count those without counting them twice in two section 

images with different focus planes, a three-dimensional cell 

classifier could be trained for fixed focus plane distances. 

A different approach for accounting for non-focused cells 

would be to simply allow ’overcounting’, and then to reduce 

doubles in the segmented images according to location. 

This seems suitable given the fact that cells do not 

vary greatly in size. Finally, sections typically span more 

than one microscope image. In order to count cells of the 

whole section some way of not counting cells twice in both 

image parts has to be devised. This could be done for example 

by using techniques from image fusion. Furthermore, 

often not the whole image but only parts of the section are 

to be counted; so far this choice of the ’region of interest’ 

is done manually. We hope to automate this in the future 

by finding separating features of these regions. 

7 Acknowledgements 

F.J.T. and E.W.L. gratefully acknowledges financial support 

by the DFG 1 and the BMBF 2 . 

References 

[1] J. Altman and G. Das. Autoradiographic and histological 

evidence of postnatal hippocampal neurogenesis in rats. J. 

Comp. Neurol., 124(3):319–335, 1965. 

[2] H. Cameron, C. Woolley, B. McEwen, and E. Gould. Differentiation 

of newly born neurons and glia in the dentate gyrus 

of the adult rat. Neuroscience, 56(2):336–344, 1993. 

[3] F. Dolbeare. Bromodeoxyuridine: a diagnostic tool in biology 

and medicine, part i: Historical perspectives, histochemical 

methods and cell kinetics. Histochem. J., 27(5):339–369, 

1995. 

[4] S. Haykin. Neural networks. Macmillan College Publishing 

Company, 1994. 

[5] J. Hérault and C. Jutten. Space or time adaptive signal 

processing by neural network models. In J.S. Denker, editor, 

Neural Networks for Computing. Proceedings of the AIP 

Conference, pages 206–211, New York, 1986. American Institute 

of Physics. 

[6] A. Hyvärinen, J.Karhunen, and E.Oja. Independent Component 

Analysis. John Wiley & Sons, 2001. 

[7] H.G. Kuhn, H. Dickinson-Anson, and F. Gage. Neurogenesis 

in the dentate gyrus of the adult rat: age-related decrease of 

neuronal progenitor proliferation. J. Neurosci., 16(6):2027– 

2033, 1996. 

[8] H.G. Kuhn, T. Palmer, and E. Fuchs. Adult neurogenesis: a 

compensatory mechanism for neuronal damage. Eur. Arch. 

Psychiatry Clin. Neurosci., 251(4):152–158, 2001. 

[9] T.W. Nattkemper, H. Ritter, and W. Schubert. A neural classifier 

enabling high-throughput topological analysis of lymphocytes 

in tissue sections. IEEE Trans. ITB, 5:138–149, 2001. 

1 graduate college ‘nonlinear dynamics’ 

2 project ‘ModKog’

296 Chapter 21. Proc. BIOMED 2005, pages 209-212

Bibliography 

Abed-Meraim, K. and Belouchrani, A. (2004). Algorithms for joint block diagonalization. In 

Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria. 

Abed-Meraim, K., Belouchrani, A., and Hua, Y. (1996). Blind identification of a linear-quadratic 

mixture of independent components based on joint diagonalization procedure. In Proc. of 

ICASSP 1996, volume 5, pages 2718–2721, Atlanta, USA. 

Achard, S. and Jutten, C. (2005). Identifiability of post-nonlinear mixtures. IEEE Signal 

Processing Letters, 12(5):423–426. 

Achard, S., Pham, D.-T., and Jutten, C. (2003). Quadratic dependence measure for nonlinear 

blind sources separation. In Proc. of ICA 2003, pages 263–268, Nara, Japan. 

Akaho, S., Kiuchi, Y., and Umeyama, S. (1999). MICA: Multimodal independent component 

analysis. In Proc. IJCNN 1999, pages 927–932. 

Alpaydin, E. (2004). Introduction to Machine Learning. MIT Press. 

Altman, J. and Das, G. (1965). Autoradiographic and histological evidence of postnatal hippocampal 

neurogenesis in rats. J. Comp. Neurol., 124(3):319–335. 

Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10:251– 

276. 

Araki, S., Sawada, H., Mukai, R., and Makino, S. (2007). Underdetermined blind sparse source 

separation for arbitrarily arranged multiple sensors. Signal Processing, 87(8):1833–1847. 

Babaie-Zadeh, M. (2002). On Blind Source Separation in Convolutive and Nonlinear Mixtures. 

PhD thesis, Institut National Polytechnique de Grenoble. 

Babaie-Zadeh, M., Jutten, C., and Nayebi, K. (2002). A geometric approach for separating post 

non-linear mixtures. In Proc. of EUSIPCO ’02, volume II, pages 11–14, Toulouse, France. 

Bach, F. and Jordan, M. (2003a). Beyond independent components: trees and clusters. Journal 

of Machine Learning Research, 4:1205–1233. 

Bach, F. and Jordan, M. (2003b). Finding clusters in independent component analysis. In Proc. 

ICA 2003, pages 891–896. 

297

298 BIBLIOGRAPHY 

Barber, C., Dobkin, D., and Huhdanpaa, H. (1993). The quickhull algorithm for convex hull. 

Technical Report GCG53, The Geometry Center, University of Minnesota, Minneapolis. 

Bartsch, H. and Obermayer, K. (2003). Second order statistics of natural images. Neurocomputing, 

52-54:467–472. 

Basmajian, J. and Luca, C. D. (1985). Muscle Alive. Their Functions Revealed by Electromyography. 

Williams & Wilkins, Baltimore, 5th edition. 

Bell, A. and Sejnowski, T. (1995). An information-maximisation approach to blind separation 

and blind deconvolution. Neural Computation, 7:1129–1159. 

Belouchrani, A. and Amin, M. (1998). Blind source separation based on time-frequency signal 

representations. IEEE Trans. Signal Processing, 46(11):2888–2897. 

Belouchrani, A., Meraim, K. A., Cardoso, J.-F., and Moulines, E. (1997). A blind source separation 

technique based on second order statistics. IEEE Transactions on Signal Processing, 

45(2):434–444. 

Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press. 

Bishop, C. (2007). Pattern Recognition and Machine Learning. Springer. 

Blanchard, G., Kawanabe, M., Sugiyama, M., Spokoiny, V., and Müller, K. (2006). In search 

of non-gaussian components of a high-dimensional distribution. Journal of Machine Learning 

Research, 7:247–282. 

Bofill, P. and Zibulevsky, M. (2001). Underdetermined blind source separation using sparse 

representations. Signal Processing, 81:2353–2362. 

Böhm, M., Stadlthanner, K., Gruber, P., Theis, F., Lang, E., Tomé, A., Teixeira, A., Gronwald, 

W., and Kalbitzer, H. (2006). On the use of simulated annealing to automatically assign 

independent components. IEEE Transactions on Biomedical Engineering, 53(5):810–820. 

Boser, B., Guyon, I., and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. 

In Proc. COLT 1992, pages 144–152. 

Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Knowledge 

Discovery and Data Mining, 2(2):121–167. 

Cardoso, J. (1998). Multidimensional independent component analysis. In Proc. of ICASSP 

’98, Seattle. 

Cardoso, J. and Souloumiac, A. (1990). Localization and identification with the quadricovariance. 

Traitement du Signal, 7(5):397–406. 

Cardoso, J. and Souloumiac, A. (1993). Blind beamforming for non gaussian signals. IEE 

Proceedings - F, 140(6):362–370.

BIBLIOGRAPHY 299 

Cardoso, J. and Souloumiac, A. (1995). Jacobi angles for simultaneous diagonalization. SIAM 

J. Mat. Anal. Appl., 17(1):161–164. 

Choi, S. and Cichocki, A. (2000). Blind separation of nonstationary sources in noisy mixtures. 

Electronics Letters, 36(848-849). 

Cichocki, A. and Amari, S. (2002). Adaptive blind signal and image processing. John Wiley & 

Sons. 

Cohen, L. (1995). Time-Frequency Analysis. Prentice-Hall, Englewood Cliffs, NJ. 

Combettes, P. (1993). The foundations of set theoretic estimation. Proc. IEEE, 81:182–208. 

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing, 36:287– 

314. 

Darmois, G. (1953). Analyse générale des liaisons stochastiques. Rev. Inst. Internationale 

Statist., 21:2–8. 

Dobra, A., Hans, C., Jones, B., Nevins, J., Yao, G., and West, M. (2004). Sparse graphical 

models for exploring gene expression data. Journal of Multivariate Analysis, 90:196–212. 

Edelman, A., Arias, T., and Smith, S. (1999). The geometry of algorithms with orthogonality 

constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353. 

Effern, A., Lehnertz, K., Schreiber, T., Grunwald, T., David, P., and Elger, C. E. (2000). 

Nonlinear denoising of transient signals with application to event-related potentials. Physica 

D, 140:257–266. 

Eriksson, J. and Koivunen, V. (2003). Identifiability and separability of linear ica models revisited. 

In Proc. of ICA 2003, pages 23–27. 

Eriksson, J. and Koivunen, V. (2006). Complex random vectors and ica models: identifiability, 

uniqueness, and separability. IEEE Transactions on Information Theory, 52(3):1017–1029. 

Févotte, C. and Theis, F. (2007a). Orthonormal approximate joint block-diagonalization. Technical 

report, GET/Télécom Paris. 

Févotte, C. and Theis, F. (2007b). Pivot selection strategies in jacobi joint block-diagonalization. 

In Proc. ICA 2007, London, U.K. 

Friedman, J. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 

82:249–266. 

Friedman, J. and Tukey, J. (1975). A projection pursuit algorithm for exploratory data analysis. 

IEEE Trans. on Computers, 23(9):881–890. 

García, G., Okuno, R., and Akazawa, K. (2004). Decomposition algorithm for surface electrodearray 

electromyogram in voluntary isometric contraction. IEEE Engineering in Medicine and 

Biology Magazine, in press.


Georgiev, P. (2001). Blind source separation of bilinearly mixed signals. In Proc. of ICA 2001, 

pages 328–330, San Diego, USA. 

Georgiev, P., Pardalos, P., and Theis, F. (2006). A bilinear algorithm for sparse representations. 

Computational Optimization and Applications. 

Georgiev, P., Pardalos, P., Theis, F., Cichocki, A., and Bakardjian, H. (2005a). Data Mining 

in Biomedicine, chapter Sparse component analysis: a new tool for data mining. Springer, in 

print. 

Georgiev, P. and Theis, F. (2004). Blind source separation of linear mixtures with singular 

matrices. In Proc. ICA 2004, volume 3195 of LNCS, pages 121–128, Granada, Spain. Springer. 

Georgiev, P., Theis, F., and Cichocki, A. (2004). Blind source separation and sparse component 

analysis of overcomplete mixtures. In Proc. ICASSP 2004, volume 5, pages 493–496, Montreal, 

Canada. 

Georgiev, P., Theis, F., and Cichocki, A. (2005b). Multiscale Optimization Methods, chapter 

Optimization algorithms for sparse representations and applications. Ed. P. Pardalos. 

Georgiev, P., Theis, F., and Cichocki, A. (2005c). Sparse component analysis and blind source 

separation of underdetermined mixtures. IEEE Transactions on Neural Networks, 16(4):992– 

996. 

Ghurye, S. and Olkin, I. (1962). A characterization of the multivariate normal distribution. 

Ann. Math. Statist., 33:533–541. 

Gruber, P., Kohler, C., and Theis, F. (2007). A toolbox for model-free analysis of fmri data. In 

Proc. ICA 2007, London, U.K. 

Gruber, P., Stadlthanner, K., Böhm, M., Theis, F., Lang, E., Tomé, A., Teixeira, A., Puntonet, 

C., and Saéz, J. G. (2006). Denoising using local projective subspace methods. Neurocomputing, 

69:1485–1501. 

Gruber, P. and Theis, F. (2006). Grassmann clustering. In Proc. EUSIPCO 2006, Florence, 

Italy. 

Gutch, H. and Theis, F. (2007). Independent subspace analysis is unique, given irreducibility. 

In Proc. ICA 2007, London, U.K. 

Hashimoto, W. (2003). Quadratic forms in natural images. Network: Computation in Neural 

Systems, 14:765–788. 

Haykin, S. (1994). Neural networks. Macmillan College Publishing Company. 

Hérault, J. and Jutten, C. (1986). Space or time adaptive signal processing by neural network 

models. In Denker, J., editor, Neural Networks for Computing. Proceedings of the AIP 

Conference, pages 206–211, New York. American Institute of Physics.


Hosseini, S. and Deville, Y. (2003). Blind separation of linear-quadratic mixtures of real sources 

using a recurrent structure. Lecture Notes in Computer Science, 2687:241–248. 

Hoyer, P. (2004). Non-negative matrix factorization with sparseness constraints. Journal of 

Machine Learning Research, 5:1457–1469. 

Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2):435–475. 

Hyvärinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. 

IEEE Transactions on Neural Networks, 10(3):626–634. 

Hyvärinen, A. and Hoyer, P. (2000). Emergence of phase and shift invariant features by decomposition 

of natural images into independent feature subspaces. Neural Computation, 

12(7):1705–1720. 

Hyvärinen, A., Hoyer, P., and Inki, M. (2001a). Topographic independent component analysis. 

Neural Computation, 13(7):1525–1558. 

Hyvärinen, A., Hoyer, P., and Oja, E. (2001b). Intelligent Signal Processing. IEEE Press. 

Hyvärinen, A., Karhunen, J., and Oja, E. (2001c). Independent component analysis. John Wiley 

& Sons. 

Hyvärinen, A. and Oja, E. (1997). A fast fixed-point algorithm for independent component 

analysis. Neural Computation, 9:1483–1492. 

Hyvärinen, A. and Pajunen, P. (1998). On existence and uniqueness of solutions in nonlinear independent 

component analysis. Proceedings of the 1998 IEEE International Joint Conference 

on Neural Networks (IJCNN’98), 2:1350–1355. 

Ilin, A. (2006). Independent dynamics subspace analysis. In Proc. ESANN 2006, pages 345–350. 

Karako-Eilon, S., Yeredor, A., and Mendlovic, D. (2003). Blind source separation based on the 

fractional fourier transform. In Proc. ICA 2003, pages 615–620, Nara, Japan. 

Karvanen, J. and Theis, F. (2004). Spatial ICA of fMRI data in time windows. In Proc. MaxEnt 

2004, volume 735 of AIP conference proceedings, pages 312–319, Garching, Germany. 

Kawanabe, M. (2005). Linear dimension reduction based on the fourth-order cumulant tensor. 

In Proc. ICANN 2005, volume 3697 of LNCS, pages 151–156, Warsaw, Poland. Springer. 

Kawanabe, M. and Theis, F. (2007). Joint low-rank approximation for extracting non-gaussian 

subspaces. Signal Processing. 

Keck, I., Theis, F., Gruber, P., Lang, E., Churan, J., and Puntonet, C. (2006). Model-free region 

of interest based analysis of fMRI data. In Proc. MELECON 2006, pages 478–481, Malaga, 

Spain.


Keck, I., Theis, F., Gruber, P., Lang, E., Specht, K., Fink, G., Tomé, A., and Puntonet, C. 

(2005). Automated clustering of ICA results for fMRI data analysis. In Proc. CIMED 2005, 

pages 211–216, Lisbon, Portugal. 

Keck, I., Theis, F., Gruber, P., Lang, E., Specht, K., and Puntonet, C. (2004). 3D spatial 

analysis of fMRI data on a word perception task. In Proc. ICA 2004, volume 3195 of LNCS, 

pages 977–984, Granada, Spain. Springer. 

Kruskal, J. (1969). Statistical Computation, chapter Toward a practical method which helps 

uncover the structure of a set of observations by finding the line tranformation which optimizes 

a new index of condensation, pages 427–440. Academic Press, New York. 

Lee, D. and Seung, H. (1999). Learning the parts of objects by non-negative matrix factorization. 

Nature, 40:788–791. 

Lee, T., Lewicki, M., Girolami, M., and Sejnowski, T. (1999). Blind source separation of more 

sources than mixtures using overcomplete representations. IEEE Signal Processing Letters, 

6(4):87–90. 

Leshem, A. (1999). Source separation using bilinear forms. In Proc. of the 8th Int. Conference 

on Higher-Order Statistical Signal Processing. 

Liavas, A. P. and Regalia, P. A. (2001). On the behavior of information theoretic criteria for 

model order selection. IEEE Transactions on Signal Processing, 49:1689–1695. 

Lin, J. (1998). Factorizing multivariate function classes. In Advances in Neural Information 

Processing Systems, volume 10, pages 563–569. 

Lutter, D., Stadlthanner, K., Theis, F., Lang, E. W., Tomé, A., Becker, B., and Vogt, T. (2006). 

Analyzing gene expression profiles with ICA. In Proc. BioMED 2006, Innsbruck, Austria. 

Ma, C. T., Ding, Z., and Yau, S. F. (2000). A two-stage algorithm for MIMO blind deconvolution 

of nonstationary colored noise. IEEE Transactions on Signal Processing, 48:1187–1192. 

MacKay, D. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University 

Press, 6.9 edition. 

McKeown, M., Jung, T., Makeig, S., Brown, G., Kindermann, S., Bell, A., and Sejnowski, 

T. (1998). Analysis of fMRI data by blind separation into independent spatial components. 

Human Brain Mapping, 6:160–188. 

Meyer-Baese, A., Gruber, P., Theis, F., and Foo, S. (2006). Blind source separation based on 

self-organizing neural network. Engineering Applications of Artificial Intelligence, 19:305–311. 

Meyer-Bäse, A., Gruber, P., Theis, F., Wismüller, A., and Ritter, H. (2005). Application of 

unsupervised clustering methods to biomedical image analysis. In Proc. WSOM 2005, pages 

621–628, Paris, France.


Meyer-Bäse, A., Theis, F., Lange, O., and Puntonet, C. (2004a). Tree-dependent and topographic 

independent component analysis for fMRI analysis. In Proc. ICA 2004, volume 3195 

of LNCS, pages 782–789, Granada, Spain. Springer. 

Meyer-Bäse, A., Theis, F., Lange, O., and Wismüller, A. (2004b). Clustering of dependent 

components: A new paradigm for fMRI signal detection. In Proc. IJCNN 2004, pages 1947– 

1952, Budapest, Hungary. 

Meyer-Bäse, A., Thümmler, V., and Theis, F. (2006). Stability analysis of an unsupervised 

competitive neural network. In Proc. IJCNN 2006, Vancouver, Canada. 

Mika, S., Schölkopf, B., Smola, A., Müller, K., Scholz, M., and Rätsch, G. (1999). Kernel pca 

and de-noising in feature spaces. In Proc. NIPS, volume 11. 

Mitchell, T. (1997). Machine Learning. McGraw Hill. 

Molgedey, L. and Schuster, H. (1994). Separation of a mixture of independent signals using 

time-delayed correlations. Physical Review Letters, 72(23):3634–3637. 

Moreau, E. (2001). A generalization of joint-diagonalization criteria for source separation. IEEE 

Transactions on Signal Processing, 49(3):530–541. 

Müller, K.-R., Philips, P., and Ziehe, A. (1999). Jadetd: Combining higher-order statistics and 

temporal information for blind source separation (with noise). In Proc. of ICA 1999, pages 

87–92, Aussois, France. 

Ogawa, S., Lee, T., Kay, A., and Tank, D. (1990). Brain magnetic-resonance-imaging with 

contrast dependent on blood oxygenation. Proc. Nat. Acad. Sci. USA, 87:9868–9872. 

O’Grady, P. and Pearlmutter, B. (2004). Soft-LOST: EM on a mixture of oriented lines. In 

Proc. ICA 2004, volume 3195 of Lecture Notes in Computer Science, pages 430–436, Granada, 

Spain. 

Pham, D. (2001). Joint approximate diagonalization of positive definite matrices. SIAM Journal 

on Matrix Anal. and Appl., 22(4):1136–1152. 

Pham, D.-T. and Cardoso, J. (2001). Blind separation of instantaneous mixtures of nonstationary 

sources. IEEE Transactions on Signal Processing, 49(9):1837–1848. 

Poczos, B. and Lörincz, A. (2005). Independent subspace analysis using k-nearest neighborhood 

distances. In Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland. 

Springer. 

Schachtner, R., Lutter, D., Theis, F., Lang, E., Tomé, A., Saéz, J. G., and Puntonet, C. (2007). 

Blind matrix decomposition techniques to identify marker genes from microarrays. In Proc. 

ICA 2007, London, U.K. 

Schäfer, J. and Strimmer, K. (2005). Learning large-scale graphical gaussian models from genomic 

data. In Proc. CNET 2004, AIP Conference Proceedings 776.


Schießl, I., Schöner, H., Stetter, M., Dima, A., and Obermayer, K. (2000). Regularized second 

order source separation. In Proc. ICA 2000, volume 2, pages 111–116, Helsinki, Finland. 

Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA. 

Schölkopf, B., Smola, A., and Müller, K. (1998). Nonlinear component analysis as a kernel 

eigenvalue nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 

10:1299–1319. 

Schöner, H., Stetter, M., Schießl, I., Mayhew, J., Lund, J., McLoughlin, N., and Obermayer, K. 

(2000). Application of blind separation of sources to optical recording of brain activity. In 

Proc. NIPS 1999, volume 12, pages 949–955. MIT Press. 

Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89:217–219. 

Skitovitch, V. (1954). Linear forms in independent random variables and the normal distribution 

law. Izvestiia AN SSSR, Ser. Matem., 18:185–200. 

Specht, K. and Reul, J. (2003). Function segregation of the temporal lobes into highly differentiated 

subsystems for auditory perception: an auditory rapid event-related fmri-task. 

NeuroImage, 20:1944–1954. 

Squartini, S. and Theis, F. (2006). New riemannian metrics for speeding-up the convergence of 

over- and underdetermined ICA. In Proc. ISCAS 2006, Kos, Greece. 

Stadlthanner, K., Theis, F., Lang, E., and Puntonet, C. (2005a). Hybridizing sparse component 

analysis with genetic algorithms for blind source separation. In Proc. ISBMDA 2005, pages 

137–148, Aveiro, Portugal. 

Stadlthanner, K., Theis, F., Lang, E., Puntonet, C., Gomez-Vilda, P., Tomé, A., Langmann, 

T., and Schmitz, G. (2006a). Sparse nonnegative matrix factorization applied to microarray 

data sets. In Proc. ICA 2006, pages 254–261, Charleston, USA. 

Stadlthanner, K., Theis, F., Lang, E., Tomé, A., Gronwald, W., and Kalbitzer, H.-R. (2003). A 

matrix pencil approach to the blind source separation of artifacts in 2D NMR spectra. Neural 

Information Processing LR, 1(3):103–110. 

Stadlthanner, K., Theis, F., Puntonet, C., and Lang, E. (2005b). Extended sparse nonnegative 

matrix factorization. In Proc. IWANN 2005, volume 3512 of LNCS, pages 249–256, Barcelona, 

Spain. Springer. 

Stadlthanner, K., Tomé, A., Theis, F., Lang, E., Gronwald, W., and Kalbitzer, H. (2006b). 

Separation of water artifacts in 2D NOESY protein spectra using congruent matrix pencils. 

Neurocomputing, 69:497–522. 

Stone, J., Porrill, J., Porter, N., and Wilkinson, I. (2002). Spatiotemporal independent component 

analysis of event-related fmri data using skewed probability density functions. NeuroImage, 

15(2):407–421.


Taleb, A. and Jutten, C. (1999). Indeterminacy and identifiability of blind identification. IEEE 

Transactions on Signal Processing, 47(10):2807–2820. 

Theis, F. (2004a). A new concept for separability problems in blind source separation. Neural 

Computation, 16:1827–1850. 

Theis, F. (2004b). Uniqueness of complex and multidimensional independent component analysis. 

Signal Processing, 84(5):951–956. 

Theis, F. (2004c). Uniqueness of real and complex linear independent component analysis 

revisited. In Proc. EUSIPCO 2004, pages 1705–1708, Vienna, Austria. 

Theis, F. (2005a). Blind signal separation into groups of dependent signals using joint block 

diagonalization. In Proc. ISCAS 2005, pages 5878–5881, Kobe, Japan. 

Theis, F. (2005b). Gradients on matrix manifolds and their chain rule. Neural Information 

Processing LR, 9(1):1–13. 

Theis, F. (2005c). Multidimensional independent component analysis using characteristic functions. 

In Proc. EUSIPCO 2005, Antalya, Turkey. 

Theis, F. (2007). Towards a general independent subspace analysis. In Proc. NIPS 2006. 

Theis, F. and Amari, S. (2004). Postnonlinear overcomplete blind source separation using sparse 

sources. In Proc. ICA 2004, volume 3195 of LNCS, pages 718–725, Granada, Spain. Springer. 

Theis, F. and García, G. (2006). On the use of sparse signal decomposition in the analysis of 

multi-channel surface electromyograms. Signal Processing, 86(3):603–623. 

Theis, F., Georgiev, P., and Cichocki, A. (2004a). Blind source recovery: algorithm comparison 

and fusion. In Proc. MaxEnt 2004, volume 735 of AIP conference proceedings, pages 320–327, 

Garching, Germany. 

Theis, F., Georgiev, P., and Cichocki, A. (2007a). Robust sparse component analysis based on 

a generalized hough transform. EURASIP Journal on Applied Signal Processing. 

Theis, F. and Gruber, P. (2005). On model identifiability in analytic postnonlinear ICA. Neurocomputing, 

64:223–234. 

Theis, F., Gruber, P., Keck, I., and Lang, E. (2007b). A robust model for spatiotemporal 

dependencies. Neurocomputing (in press). 

Theis, F., Gruber, P., Keck, I., Meyer-Bäse, A., and Lang, E. (2005a). Spatiotemporal blind 

source separation using double-sided approximate joint diagonalization. In Proc. EUSIPCO 

2005, Antalya, Turkey. 

Theis, F., Gruber, P., Keck, I., Tomé, A., and Lang, E. (2005b). A spatiotemporal second-order 

algorithm for fMRI data analysis. In Proc. CIMED 2005, pages 194–201, Lisbon, Portugal.


Theis, F. and Inouye, Y. (2006). On the use of joint diagonalization in blind signal processing. 

In Proc. ISCAS 2006, Kos, Greece. 

Theis, F., Jung, A., Puntonet, C., and Lang, E. (2003). Linear geometric ICA: Fundamentals 

and algorithms. Neural Computation, 15:419–439. 

Theis, F. and Kawanabe, M. (2006). Uniqueness of non-gaussian subspace analysis. In Proc. 

ICA 2006, pages 917–925, Charleston, USA. 

Theis, F. and Kawanabe, M. (2007). Colored subspace analysis. In Proc. ICA 2007, London, 

U.K. 

Theis, F., Kohl, Z., Guggenberger, C., Kuhn, H., Puntonet, C., and Lang, E. (2004b). ZANE - 

an algorithm for counting labelled cells in section images. In Proc. MEDSIP 2004, Malta. 

Theis, F., Kohl, Z., Kuhn, H., Stockmeier, H., and Lang, E. (2004c). Automated counting 

of labelled cells in rodent brain section images. In Proc. BioMED 2004, pages 209–212, 

Innsbruck, Austria. ACTA Press, Canada. 

Theis, F. and Lang, E. (2004a). Linearization identification and an application to BSS using a 

SOM. In Proc. ESANN 2004, pages 205–210, Bruges, Belgium. d-side, Evere, Belgium. 

Theis, F. and Lang, E. (2004b). Postnonlinear blind source separation via linearization identification. 

In Proc. IJCNN 2004, pages 2199–2204, Budapest, Hungary. 

Theis, F., Lang, E., and Puntonet, C. (2004d). A geometric algorithm for overcomplete linear 

ICA. Neurocomputing, 56:381–398. 

Theis, F., Meyer-Bäse, A., and Lang, E. (2004e). Second-order blind source separation based on 

multi-dimensional autocovariances. In Proc. ICA 2004, volume 3195 of LNCS, pages 726–733, 

Granada, Spain. Springer. 

Theis, F. and Nakamura, W. (2004). Quadratic independent component analysis. IEICE Trans. 

Fundamentals, E87-A(9):2355–2363. 

Theis, F., Puntonet, C., and Lang, E. (2006). Median-based clustering for underdetermined 

blind signal processing. IEEE Signal Processing Letters, 13(2):96–99. 

Theis, F., Stadlthanner, K., and Tanaka, T. (2005c). First results on uniqueness of sparse 

non-negative matrix factorization. In Proc. EUSIPCO 2005, Antalya, Turkey. 

Theis, F. and Tanaka, T. (2005). A fast and efficient method for compressing fMRI data sets. 

In Proc. ICANN 2005, volume 3697 of LNCS, pages 769–777, Warsaw, Poland. Springer. 

Theis, F. and Tanaka, T. (2006). Sparseness by iterative projections onto spheres. In Proc. 

ICASSP 2006, Toulouse, France. 

Tomé, A. M., Teixeira, A. R., Lang, E. W., Stadlthanner, K., Rocha, A. P., and ALmeida, R. 

(2005). dAMUSE - A new tool for denoising and BSS. Digital Signal Processing.


Tong, L., Liu, R.-W., Soon, V., and Huang, Y.-F. (1991). Indeterminacy and identifiability of 

blind identification. IEEE Transactions on Circuits and Systems, 38:499–509. 

van der Veen, A. and Paulraj, A. (1996). An analytical constant modulus algorithm. IEEE 

Trans. Signal Processing, 44(5):1–19. 

Vetter, R., Vesin, J. M., Celka, P., Renevey, P., and Krauss, J. (2002). Automatic nonlinear 

noise reduction using local principal component analysis and MDL parameter selection. Proc. 

SPPRA 2002, pages 290–294. 

Vollgraf, R. and Obermayer, K. (2001). Multi-dimensional ICA to separate correlated sources. 

In Proc. NIPS 2001, pages 993–1000. 

Yeredor, A. (2000). Blind source separation via the second characteristic function. Signal 

Processing, 80(5):897–902. 

Yeredor, A. (2002). Non-orthogonal joint diagonalization in the leastsquares sense with application 

in blind source separation. IEEE Trans. Signal Processing, 50(7):1545–1553. 

Youla, D. and Webb, H. (1982). Image restoration by the methods of convex projections. part 

I — theory. IEEE Trans. Med. Imaging, MI(I):81–94. 

Zibulevsky, M. and Pearlmutter, B. (2001). Blind source separation by sparse decomposition in 

a signal dictionary. Neural Computation, 13(4):863–882. 

Ziehe, A., Kawanabe, M., Harmeling, S., and Müller, K.-R. (2003a). Blind separation of postnonlinear 

mixtures using linearizing transformations and temporal decorrelation. Journal of 

Machine Learning Research, 4:1319–1338. 

Ziehe, A., Laskov, P., Mueller, K.-R., and Nolte, G. (2003b). A linear least-squares algorithm 

for joint diagonalization. In Proc. of ICA 2003, pages 469–474, Nara, Japan. 

Ziehe, A. and Mueller, K.-R. (1998). TDSEP – an efficient algorithm for blind separation using 

time structure. In Niklasson, L., Bodén, M., and Ziemke, T., editors, Proc. of ICANN’98, 

pages 675–680, Skövde, Sweden. Springer Verlag, Berlin.

Mathematics in Independent Component Analysis

Create successful ePaper yourself

Delete template?

Save as template?