14.02.2013 Views

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Statistical mach<strong>in</strong>e learn<strong>in</strong>g of<br />

biomedical data<br />

Habilitation an der NWF II - Physik der Universität Regensburg<br />

vorgelegt von Fabian J. Theis, 2007


Die Habilitationsschrift wurde e<strong>in</strong>gereicht am 7.8.2007.<br />

Fachmentorat:<br />

Prof. Dr. Elmar W. Lang (Universität Regensburg)<br />

Prof. Dr. Klaus Richter (Universität Regensburg)<br />

Prof. Dr. Klaus-Robert Müller (FhG FIRST & TU Berl<strong>in</strong>)


to Jakob


Preface<br />

In this manuscript, I present twenty papers that I have coauthored dur<strong>in</strong>g the past four years.<br />

They range from theoretical contributions such as model analyses and convergence proofs to<br />

algorithmic implementations and biomedical applications. The common denom<strong>in</strong>ator is the<br />

framework of (mostly unsupervised) data analysis based on statistical mach<strong>in</strong>e learn<strong>in</strong>g methods.<br />

For convenience, I summarize and review these papers <strong>in</strong> chapter 1.<br />

Obviously, the papers themselves are self-conta<strong>in</strong>ed. The summary chapter consists of an<br />

<strong>in</strong>troduction part and then sections on <strong>in</strong>dependent and dependent component analysis, sparseness,<br />

preprocess<strong>in</strong>g and applications. These sections are aga<strong>in</strong> mostly self-conta<strong>in</strong>ed except for<br />

the dependent component analysis part, which relies on the preced<strong>in</strong>g section, as well as the<br />

applications part, which depends on the methods described before.<br />

I want to thank everyone who helped me dur<strong>in</strong>g the past four years. First of all my former<br />

boss Elmar Lang; I got both support and freedom, and always help when needed, even nowadays!<br />

Then of course all retired and active members of my group: Peter Gruber, Ingo Keck, Harold<br />

Gutch, Christian Guggenberger and more recently Florian Blöchl, Elias Rentzeperis, Mart<strong>in</strong><br />

Rhoden and Dom<strong>in</strong>ik Lutter. But I also have to mention and thank my colleagues at the MPI<br />

for Dynamics and Self-Organization at Gött<strong>in</strong>gen, <strong>in</strong> particular the Director Theo Geisel and<br />

my collaborators Dirk Brockmann and Fred Wolf. I also want to thank my mentors, Elmar<br />

Lang, Klaus Richter and Klaus-Robert Müller for their manifold help and advices dur<strong>in</strong>g my<br />

habilitation.<br />

I thank the Bernste<strong>in</strong> Center for Computational Neuroscience Gött<strong>in</strong>gen as well as the<br />

graduate college ‘Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> condensed matter’ at Regensburg for their<br />

generous support. Additional fund<strong>in</strong>g by the BMBF with<strong>in</strong> the project ModKog and with<strong>in</strong> the<br />

projects BaCaTeC and PPP with Granada and London is gratefully acknowledged.<br />

My deepest thanks goes to my family and friends, above all my wife Michaela and my son<br />

Jakob. You’re the real deal.<br />

Gött<strong>in</strong>gen, June 2007 Fabian J. Theis<br />

v


vi Preface


Contents<br />

Preface v<br />

Contents v<br />

I Summary 1<br />

1 Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data 3<br />

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis . . . . . . . . . . . . . . . . 6<br />

1.2.1 L<strong>in</strong>ear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.2.2 Nonl<strong>in</strong>ear ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.3 Dependent component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

1.3.1 Algebraic BSS and multidimensional generalizations . . . . . . . . . . . . 13<br />

1.3.2 Spatiotemporal BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.3.3 <strong>Independent</strong> subspace analysis . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.4 Sparseness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

1.4.1 Sparse component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

1.4.2 Sparse non-negative matrix factorization . . . . . . . . . . . . . . . . . . . 30<br />

1.5 Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.5.1 Denois<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.5.2 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

1.5.3 Cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

1.6 Applications to biomedical data analysis . . . . . . . . . . . . . . . . . . . . . . . 42<br />

1.6.1 Functional MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

1.6.2 Image segmentation and cell count<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . 47<br />

1.6.3 Surface electromyograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

1.7 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

II Papers 53<br />

2 Neural Computation 16:1827-1850, 2004 55<br />

vii


viii CONTENTS<br />

3 Signal Process<strong>in</strong>g 84(5):951-956, 2004 81<br />

4 Neurocomput<strong>in</strong>g 64:223-234, 2005 89<br />

5 IEICE TF E87-A(9):2355-2363, 2004 103<br />

6 LNCS 3195:726-733, 2004 113<br />

7 Proc. ISCAS 2005, pages 5878-5881 123<br />

8 Proc. NIPS 2006 129<br />

9 Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 139<br />

10 IEEE TNN 16(4):992-996, 2005 155<br />

11 EURASIP JASP, 2007 161<br />

12 LNCS 3195:718-725, 2004 175<br />

13 Proc. EUSIPCO 2005 185<br />

14 Proc. ICASSP 2006 191<br />

15 Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 197<br />

16 Proc. ICA 2006, pages 917-925 229<br />

17 IEEE SPL 13(2):96-99, 2006 239<br />

18 Proc. EUSIPCO 2006 245<br />

19 LNCS 3195:977-984, 2004 251<br />

20 Signal Process<strong>in</strong>g 86(3):603-623, 2006 261<br />

21 Proc. BIOMED 2005, pages 209-212 291<br />

Bibliography 297


Part I<br />

Summary<br />

1


Chapter 1<br />

Statistical mach<strong>in</strong>e learn<strong>in</strong>g of<br />

biomedical data<br />

Biostatistics deals with the analysis of high-dimensional data sets orig<strong>in</strong>at<strong>in</strong>g from biological or<br />

biomedical problems. An important challenge <strong>in</strong> this analysis is to identify underly<strong>in</strong>g statistical<br />

patterns that facilitate the <strong>in</strong>terpretation of the data set us<strong>in</strong>g techniques from mach<strong>in</strong>e<br />

learn<strong>in</strong>g. A possible approach is to learn a more mean<strong>in</strong>gful representation of the data set,<br />

which maximizes certa<strong>in</strong> statistical features. Such often l<strong>in</strong>ear representations have several potential<br />

applications <strong>in</strong>clud<strong>in</strong>g the decomposition of objects <strong>in</strong>to ‘natural’ components (Lee and<br />

Seung, 1999), redundancy and dimensionality reduction (Friedman and Tukey, 1975), biomedical<br />

data analysis, microarray data m<strong>in</strong><strong>in</strong>g or enhancement, feature extraction of images <strong>in</strong> nuclear<br />

medic<strong>in</strong>e, etc. (Alpayd<strong>in</strong>, 2004, Bishop, 2007, Cichocki and Amari, 2002, Hyvär<strong>in</strong>en et al., 2001c,<br />

MacKay, 2003, Mitchell, 1997).<br />

In the follow<strong>in</strong>g, we will review some statistical representation models and discuss identifiability<br />

conditions. The result<strong>in</strong>g separation algorithms will be applied to various biomedical<br />

problems <strong>in</strong> the last part of this summary.<br />

1.1 Introduction<br />

Assume the data is given by a multivariate time series x(t) ∈ R m , where t <strong>in</strong>dexes time, space<br />

or some other quantity. Data analysis can be def<strong>in</strong>ed as f<strong>in</strong>d<strong>in</strong>g a mean<strong>in</strong>gful representation of<br />

x(t) i.e. as x(t) = f(s(t)) with unknown features s(t) ∈ R m and mix<strong>in</strong>g mapp<strong>in</strong>g f. Often, f is<br />

assumed to be l<strong>in</strong>ear, so we are deal<strong>in</strong>g with the situation<br />

x(t) = As(t) (1.1)<br />

with a mix<strong>in</strong>g matrix A ∈ R m×n . Often, white noise n(t) is added to the model, yield<strong>in</strong>g<br />

x(t) = As(t) + n(t); this can be <strong>in</strong>cluded <strong>in</strong> s(t) by <strong>in</strong>creas<strong>in</strong>g its dimension. In the situation<br />

(1.1), the analysis problem is reformulated as the search for a (possibly overcomplete) basis, <strong>in</strong><br />

which the feature signal s(t) allows more <strong>in</strong>sight <strong>in</strong>to the data than x(t) itself. This of course<br />

has to be specified with<strong>in</strong> a statistical framework.<br />

There are two general approaches to f<strong>in</strong>d<strong>in</strong>g data representations or models as <strong>in</strong> (1.1):<br />

3


4 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

s1<br />

s2<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40<br />

t<br />

60 80 100<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40<br />

t<br />

60 80 100<br />

(a) sources s(t)<br />

x1<br />

x2<br />

1<br />

0.5<br />

0<br />

−0.5<br />

0 20 40 60 80 100<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40<br />

t<br />

60 80 100<br />

(b) mixtures x(t)<br />

t<br />

ˆs1<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40<br />

t<br />

60 80 100<br />

ˆs2<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40 60 80 100<br />

(c) recoveries<br />

t<br />

(WA)ij<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

1<br />

column<br />

2<br />

(d) WA<br />

Figure 1.1: Two-dimensional example of ICA-based source separation. The observed mixture<br />

signal (b) is composed of unknown two source signals (a), us<strong>in</strong>g a l<strong>in</strong>ear mapp<strong>in</strong>g. Application<br />

of ICA (here: HessianICA) yields the recovered sources (c), which co<strong>in</strong>cide with the orig<strong>in</strong>al<br />

sources up to permutation and scal<strong>in</strong>g: ˆs1(t) ≈ 1.5s2(t) and ˆs2(t) ≈ −1.5s1(t). Indeed, the<br />

composition of mix<strong>in</strong>g matrix A and separat<strong>in</strong>g matrix W equals a unit matrix (d) up to the<br />

unavoidable <strong>in</strong>determ<strong>in</strong>acies of scal<strong>in</strong>g and permutation.<br />

• supervised analysis: additional <strong>in</strong>formation for example <strong>in</strong> the form of <strong>in</strong>put-output pairs<br />

(x(t1), s(t1)), . . . , (x(tT ), s(tT )). These tra<strong>in</strong><strong>in</strong>g samples can be used for <strong>in</strong>terpolation and<br />

learn<strong>in</strong>g of the map f or basis A (regression). If the sources s are discrete, this leads to a<br />

classification problem. The result<strong>in</strong>g map f can then be used for prediction.<br />

• unsupervised models: <strong>in</strong>stead of samples, weak statistical assumptions are made on either<br />

s(t) or f/A. A common assumption for example is that the source components si(t) are<br />

mutually <strong>in</strong>dependent.<br />

Here, we will mostly focus on the second situation. The unsupervised analysis is often<br />

denoted as bl<strong>in</strong>d source separation (BSS), s<strong>in</strong>ce neither features or ‘sources’ s(t) nor mix<strong>in</strong>g<br />

mapp<strong>in</strong>g f are assumed to be known. The field of BSS has been rather <strong>in</strong>tensively studied by<br />

the community for more than a decade. S<strong>in</strong>ce the first <strong>in</strong>troduction of a neural-network-based<br />

BSS solution by Hérault and Jutten (1986), various algorithms have been proposed to solve<br />

the bl<strong>in</strong>d source separation problem (Bell and Sejnowski, 1995, Cardoso and Souloumiac, 1993,<br />

Comon, 1994, Hyvär<strong>in</strong>en and Oja, 1997, Theis et al., 2003). Good textbook-level <strong>in</strong>troductions<br />

to the topic are given by Hyvär<strong>in</strong>en et al. (2001c) and Cichocki and Amari (2002). Recent<br />

research centers on generalizations and applications. The first part of this manuscript deals<br />

with such extended models and algorithms; some applications will be presented later.<br />

A common model for BSS is realized by the <strong>in</strong>dependent component analysis (ICA) model<br />

(Comon, 1994), where the underly<strong>in</strong>g signals s(t) are assumed to be statistically <strong>in</strong>dependent.<br />

Let us first concentrate on the l<strong>in</strong>ear case, i.e. f = A l<strong>in</strong>ear. Then we search for a decomposition<br />

x(t) = As(t) of the observed data set x(t) = (x1(t), . . . , xn(t)) ⊤ <strong>in</strong>to <strong>in</strong>dependent signals s(t) =<br />

(s1(t), . . . , sn(t)) ⊤ . For example, consider figure 1.1. The goal is to decompose two time-series<br />

(b) <strong>in</strong>to two source signals (a). Visually, this is a simple task—obviously the data is composed<br />

of two s<strong>in</strong>usoids with different frequency; but how to do this algorithmically? And how to<br />

formulate a feasible model?<br />

1<br />

row<br />

2


1.1. Introduction 5<br />

auditory<br />

cortex<br />

auditory<br />

cortex 2<br />

word<br />

detection<br />

decision<br />

(a) cocktail party problem (b) l<strong>in</strong>ear mix<strong>in</strong>g model<br />

(c) neural cocktail party<br />

Figure 1.2: Cocktail party problem: (a) a l<strong>in</strong>ear superposition of the speakers is recorded at<br />

each microphone. This can be written as the mix<strong>in</strong>g model x(t) = As(t) (1.1) with speaker<br />

voices s(t) and activity x(t) at the microphones (b). Possible applications lie <strong>in</strong> neuroscience:<br />

given multiple activity record<strong>in</strong>gs of the human bra<strong>in</strong>, the goal is to identify the underly<strong>in</strong>g<br />

hidden sources that make up the total activity (c).<br />

A typical application of BSS lies <strong>in</strong> the cocktail party problem: at a cocktail party, a set<br />

of microphones records the conversations of the guests. Each microphone records a l<strong>in</strong>ear superposition<br />

of the conversations, and at each microphone, a slightly different superposition is<br />

recorded depend<strong>in</strong>g on the position, see figure 1.2. In the follow<strong>in</strong>g we will see that given some<br />

rather weak assumptions on the conversations themselves such as <strong>in</strong>dependence of the various<br />

talks, it is then possible to recover the orig<strong>in</strong>al sources and the mix<strong>in</strong>g matrix (which encodes<br />

the position of the speakers) us<strong>in</strong>g only the signals recorded at the microphones. Note that <strong>in</strong><br />

real-world situations the nice l<strong>in</strong>ear mix<strong>in</strong>g situation deteriorates due to noise, convolutions and<br />

nonl<strong>in</strong>earities.<br />

t=1<br />

t=2<br />

t=3<br />

t=4


6 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.2 Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis<br />

Application of ICA to BSS tacitly assumes that the data follow the model (1.1) i.e. x(t) admits<br />

a decomposition <strong>in</strong>to <strong>in</strong>dependent sources, and we want to f<strong>in</strong>d this decomposition. But neither<br />

the mix<strong>in</strong>g function f nor the source signals s(t) are known, so we should expect to f<strong>in</strong>d many<br />

solutions for this problem. Indeed, the order of the sources cannot be recovered—the speakers<br />

at the cocktail party do not have numbers—so there is always an <strong>in</strong>herent permutation <strong>in</strong>determ<strong>in</strong>acy.<br />

Moreover, also the strength of each source cannot be extracted from this model alone,<br />

because f and s(t) can <strong>in</strong>terchange so-called scal<strong>in</strong>g factors. In other words, by not know<strong>in</strong>g<br />

the power of each speaker at the cocktail party, we can only extract his speech, but not the<br />

volume—he could also be stand<strong>in</strong>g further away from the microphones, but shout<strong>in</strong>g <strong>in</strong>stead of<br />

speak<strong>in</strong>g.<br />

One of the key questions <strong>in</strong> ICA-based source separation is the question: are there any other<br />

<strong>in</strong>determ<strong>in</strong>acies? Without fully answer<strong>in</strong>g this question, ICA algorithms cannot be applied to<br />

BSS, simply because we would not have any clue how to relate the result<strong>in</strong>g sources to the<br />

orig<strong>in</strong>al ones. But apparently, the set of <strong>in</strong>determ<strong>in</strong>acies cannot be very large—after all, at a<br />

cocktail party, we ourselves are able to dist<strong>in</strong>guish between the various speakers.<br />

1.2.1 L<strong>in</strong>ear case<br />

In 1994, Comon was able to answer this question (Comon, 1994) <strong>in</strong> the l<strong>in</strong>ear case where f = A<br />

by reduc<strong>in</strong>g it to the Darmois-Skitovitch theorem (Darmois, 1953, Skitovitch, 1953, 1954). Essentially,<br />

he showed that if the sources conta<strong>in</strong> at most one Gaussian component, the <strong>in</strong>determ<strong>in</strong>acies<br />

of the above model are only scal<strong>in</strong>g and permutation. This positive answer more or<br />

less made the field popular; from then on the number of papers published <strong>in</strong> this field each year<br />

<strong>in</strong>creased considerably. However, it may be argued that Comon’s proof lacked two po<strong>in</strong>ts: by<br />

us<strong>in</strong>g the rather difficult to prove old theorem by the two statisticians, the central idea why<br />

there are no more <strong>in</strong>determ<strong>in</strong>acies is not at all obvious. Hence not many attempts have been<br />

made to extend it to more general situations. Furthermore, no algorithm can be extracted from<br />

the proof, because it is non-constructive.<br />

In (Theis, 2004a), we took a somewhat different approach. Instead of us<strong>in</strong>g Comon’s idea of<br />

m<strong>in</strong>imal mutual <strong>in</strong>formation, we reformulated the condition of source <strong>in</strong>dependence <strong>in</strong> a different<br />

way: <strong>in</strong> simple terms, a two-dimensional source vector s is <strong>in</strong>dependent if its density ps factorizes<br />

<strong>in</strong>to two one-component densities ps1 and ps2 . But this is the case only if ln ps is the sum of<br />

one-dimensional functions, each depend<strong>in</strong>g on a different variable. Hence tak<strong>in</strong>g the differential<br />

with respect to s1 and then to s2 always yields zero. In other words, the Hessian Hln ps of<br />

the logarithmic densities of the sources is diagonal—this is what we denoted by ps be<strong>in</strong>g a<br />

‘separated function’ <strong>in</strong> Theis (2004a), see chapter 2. Us<strong>in</strong>g only this property, we were able to<br />

prove Comon’s uniqueness theorem (Theis, 2004a, theorem 2) without hav<strong>in</strong>g to resort to the<br />

Darmois-Skitovitch theorem. Here Gl(n) def<strong>in</strong>es the group of <strong>in</strong>vertible (n × n)-matrices.<br />

Theorem 1.2.1 (Separability of l<strong>in</strong>ear BSS). Let A ∈ Gl(n; R) and s be an <strong>in</strong>dependent random<br />

vector. Assume that s has at most one Gaussian component and that the covariance of s exists.<br />

Then As is <strong>in</strong>dependent if and only if A is the product of a scal<strong>in</strong>g and permutation matrix.


1.2. Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis 7<br />

Instead of a multivariate random process s(t), the theorem is formulated for a random vector<br />

s, which is equivalent to assum<strong>in</strong>g an i.i.d. process. Moreover, the assumption of equal source (n)<br />

and mixture dimension (m) is made, although relaxation to the undercomplete case (1 < n < m)<br />

is straightforward, and to the overcomplete case (n > m > 1) is possible (Eriksson and Koivunen,<br />

2003). The assumption of at most one Gaussian component is crucial, s<strong>in</strong>ce <strong>in</strong>dependence of<br />

white, multivariate Gaussians is <strong>in</strong>variant under orthogonal transformation, so theorem 1.2.1<br />

cannot hold <strong>in</strong> this case.<br />

An algorithm for separation—Hessian ICA<br />

The proof of theorem 1.2.1 is constructive, and the exception of the Gaussians comes <strong>in</strong>to play<br />

naturally as zeros of a certa<strong>in</strong> differential equation. The idea, why separation is possible, becomes<br />

quite clear now. Furthermore, an algorithm can be extracted from the pattern used <strong>in</strong> the proof:<br />

After decorrelation, we can assume that the mix<strong>in</strong>g matrix A is orthogonal. By us<strong>in</strong>g the<br />

transformation properties of the Hessian matrix, we can employ the l<strong>in</strong>ear relationship x = As<br />

to get<br />

Hln px = A⊤Hln psA (1.2)<br />

for the Hessian of the mixtures. The key idea, as we have seen <strong>in</strong> the previous section, is that<br />

due to statistical <strong>in</strong>dependence, the source Hessian Hln ps is diagonal everywhere. Therefore<br />

equation (1.2) represents a diagonalization of the mixture Hessian, and the diagonalizer equals<br />

the mix<strong>in</strong>g matrix A. Such a diagonalization is unique if the eigenspaces of the Hessian are onedimensional<br />

at some po<strong>in</strong>t, and this is precisely the case if x(t) conta<strong>in</strong>s at most one Gaussian<br />

component (Theis, 2004a, lemma 5). Hence, the mix<strong>in</strong>g matrix and the sources can be extracted<br />

algorithmically by simply diagonaliz<strong>in</strong>g the mixture Hessian evaluated at some po<strong>in</strong>t. The<br />

Hessian ICA algorithm consists of local Hessian diagonalization of the logarithmic density (or<br />

equivalently the easier to estimate characteristic function). In order to improve robustness,<br />

multiple matrices are jo<strong>in</strong>tly diagonalized. Apply<strong>in</strong>g this algorithm to the mixtures from our<br />

toy example from figure 1.1 yields very well recovered sources 1.1(c).<br />

A similar algorithm has been proposed already by L<strong>in</strong> (1998), but without consider<strong>in</strong>g the<br />

necessary assumptions for successful algorithm application. In Theis (2004a, theorem 3), we<br />

gave precise conditions for when to apply this algorithm and showed that po<strong>in</strong>ts satisfy<strong>in</strong>g these<br />

conditions can <strong>in</strong>deed be found if the sources conta<strong>in</strong> at most one Gaussian component. L<strong>in</strong> used<br />

a discrete approximation of the derivative operator to approximate the Hessian; we suggested<br />

us<strong>in</strong>g kernel-based density estimation, which can be directly differentiated. A similar algorithm<br />

based on Hessian diagonalization had been proposed by Yeredor (2000) us<strong>in</strong>g the character of a<br />

random vector. However, the character is complex-valued, and additional care has to be taken<br />

when apply<strong>in</strong>g a complex logarithm. Basically, this is only well-def<strong>in</strong>ed locally at non-zeros.<br />

In algorithmic terms, the character can be easily approximated by samples. Yeredor suggested<br />

jo<strong>in</strong>t diagonalization of the Hessian of the logarithmic character evaluated at several po<strong>in</strong>ts <strong>in</strong><br />

order to avoid the locality of the algorithm. Instead of jo<strong>in</strong>t diagonalization, we proposed to<br />

use a comb<strong>in</strong>ed energy function based on the previously def<strong>in</strong>ed separator. This also takes <strong>in</strong>to<br />

account global <strong>in</strong>formation, but does not have the drawback of be<strong>in</strong>g s<strong>in</strong>gular at zeros of the<br />

density respectively character.


8 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

Complex generalization<br />

Comon (1994) showed separability of l<strong>in</strong>ear real BSS us<strong>in</strong>g the Darmois-Skitovitch theorem, see<br />

theorem 1.2.1. He noted that his proof for the real case can also be extended to the complex<br />

sett<strong>in</strong>g. However, a complex version of the Darmois-Skitovitch theorem is needed, which, to<br />

the knowledge of the author, had not been shown <strong>in</strong> the literature yet. In Theis (2004b), see<br />

chapter 3, we derived such a theorem as corollary of a multivariate extension of the Darmois-<br />

Skitovitch theorem, first noted by Skitovitch (1954) and shown by Ghurye and Olk<strong>in</strong> (1962):<br />

Theorem 1.2.2 (complex S-D theorem). Let s1 = �n i=1 αixi and s2 = �n i=1 βixi with x1, . . . , xn<br />

<strong>in</strong>dependent complex random variables and αj, βj ∈ C for j = 1, . . . , n. If s1 and s2 are <strong>in</strong>dependent,<br />

then all xj with αjβj �= 0 are Gaussian.<br />

We then used this theorem to prove separability of complex BSS; moreover a generalization<br />

to dependent subspaces was discussed, see section 1.3.3. Note that a simple complex-valued<br />

uniqueness proof (Theis, 2004c), which does not need the Darmois-Skitovitch theorem, can be<br />

derived similarly to the case of real-valued random variables from above. Recently, additional<br />

relaxations of complex identifiability have been described by Eriksson and Koivunen (2006)<br />

1.2.2 Nonl<strong>in</strong>ear ICA<br />

With the growth of the field of ICA, <strong>in</strong>terest <strong>in</strong> nonl<strong>in</strong>ear model extensions has been <strong>in</strong>creas<strong>in</strong>g.<br />

It is easy to see however that without any restrictions the class of nonl<strong>in</strong>ear ICA solutions is too<br />

large to be of any practical use (Hyvär<strong>in</strong>en and Pajunen, 1998). Hence, special nonl<strong>in</strong>earities<br />

are usually considered. Here we discuss two specific nonl<strong>in</strong>ear models.<br />

Postnonl<strong>in</strong>ear ICA<br />

A good trade-off between model generalization and identifiability is given <strong>in</strong> the so called postnonl<strong>in</strong>ear<br />

BSS model realized by<br />

�<br />

n�<br />

�<br />

. (1.3)<br />

xi = fi<br />

i=1<br />

This explicit, nonl<strong>in</strong>ear model implies that <strong>in</strong> addition to the l<strong>in</strong>ear mix<strong>in</strong>g situation, each sensor<br />

xi conta<strong>in</strong>s an unknown nonl<strong>in</strong>earity fi that can further distort the observations, model<strong>in</strong>g a<br />

MIMO-system with nonl<strong>in</strong>ear receiver characteristics. It can be <strong>in</strong>terpreted as a s<strong>in</strong>gle-layered<br />

neural network (perceptron) with nonl<strong>in</strong>ear, unknown activation functions <strong>in</strong> contrast to the<br />

l<strong>in</strong>ear perceptron <strong>in</strong> the case of l<strong>in</strong>ear ICA. The model, first proposed by Taleb and Jutten<br />

(1999), has applications <strong>in</strong> telecommunication and biomedical data analysis. Algorithms for<br />

reconstruct<strong>in</strong>g postnonl<strong>in</strong>early mixed sources <strong>in</strong>clude (Achard et al., 2003, Babaie-Zadeh et al.,<br />

2002, Taleb and Jutten, 1999, Theis and Lang, 2004a, Ziehe et al., 2003a).<br />

Identifiability of postnonl<strong>in</strong>ear mixtures has first been discussed <strong>in</strong> a limited context <strong>in</strong> Taleb<br />

and Jutten (1999). In Theis and Gruber (2005), see chapter 4, we cont<strong>in</strong>ued this study. We<br />

thereby generalized ideas already presented by Babaie-Zadeh et al. (2002), where the focus was<br />

put on the development of an actual identification algorithm. Babaie-Zadeh was the first to<br />

use the method of analyz<strong>in</strong>g bounded random vectors <strong>in</strong> the context of postnonl<strong>in</strong>ear mixtures<br />

aijsj


1.2. Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis 9<br />

f 1, f 2<br />

2<br />

1.5<br />

1<br />

0.5<br />

-2 -1.5 -1 -0.5<br />

-0.5<br />

0.5 1 1.5 2<br />

-1<br />

-1.5<br />

-2<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

f (As ) A − 1 f (As )<br />

0.5 1 1.5<br />

0.5 1 1.5<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.2 0.4 0.6 0.8<br />

0.2 0.4 0.6 0.8<br />

Figure 1.3: Example of a non-trivial postnonl<strong>in</strong>ear transformation us<strong>in</strong>g an absolutely degenerate<br />

matrix A and <strong>in</strong> [0, 1] 2 uniform sources s. Both sources s and recovered sources A −1 f(As)<br />

have support <strong>in</strong> [0, 1] 2 , but A is not a permutation and scal<strong>in</strong>g.<br />

(Babaie-Zadeh, 2002). There, he already discussed identifiability issues, albeit explicitly only <strong>in</strong><br />

the two-dimensional analytic case.<br />

Postnonl<strong>in</strong>ear BSS is a generalization of l<strong>in</strong>ear BSS, so the <strong>in</strong>determ<strong>in</strong>acies of postnonl<strong>in</strong>ear<br />

ICA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear ICA: A can only be reconstructed up to<br />

scal<strong>in</strong>g and permutation. Here of course additional <strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play because of<br />

translation: fi can only be recovered up to a constant. Also, if L ∈ Gl(n) is a scal<strong>in</strong>g matrix,<br />

then f(As) = (f ◦ L) � (L −1 A)s � , so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each component.<br />

Another <strong>in</strong>determ<strong>in</strong>acy could occur if A is not mix<strong>in</strong>g, i.e. at least one observation xi conta<strong>in</strong>s<br />

only one source; <strong>in</strong> this case fi can obviously not be recovered. For example if A equals the<br />

unit matrix I, then f(s) is already aga<strong>in</strong> <strong>in</strong>dependent, because <strong>in</strong>dependence is <strong>in</strong>variant under<br />

component-wise nonl<strong>in</strong>ear transformation; so f cannot be found us<strong>in</strong>g this method.<br />

A not so obvious <strong>in</strong>determ<strong>in</strong>acy occurs if A is absolutely degenerate, which essentially means<br />

that the normalized columns differ only by the signs of the entries (Theis and Gruber, 2005, def<strong>in</strong>ition<br />

6). Then only the matrix A but not the nonl<strong>in</strong>earities can be recovered by consider<strong>in</strong>g the<br />

edges of the support of the fully-bounded random vector as illustrated <strong>in</strong> figure 1.3. Nonetheless<br />

this is no <strong>in</strong>determ<strong>in</strong>acy of the model itself, s<strong>in</strong>ce A −1 f(As) is obviously not <strong>in</strong>dependent. So by<br />

look<strong>in</strong>g at the boundary alone, we sometimes cannot detect <strong>in</strong>dependence if the whole system<br />

is highly symmetric.<br />

If A is mix<strong>in</strong>g and not absolutely degenerate, then for all fully-bounded sources s no more<br />

<strong>in</strong>determ<strong>in</strong>acies than <strong>in</strong> the aff<strong>in</strong>e l<strong>in</strong>ear case exist, except for scal<strong>in</strong>g <strong>in</strong>terchange between f<br />

and A:<br />

Theorem 1.2.3 (Separability of bounded postnonl<strong>in</strong>ear BSS). Let A, W ∈ Gl(n) and one of<br />

them mix<strong>in</strong>g and not absolutely degenerate, h : R n → R n be a diagonal <strong>in</strong>jective analytic function<br />

such that h ′ i �= 0 and let s be a fully-bounded <strong>in</strong>dependent random vector. If also W� h(As) � is<br />

<strong>in</strong>dependent, then h is aff<strong>in</strong>e l<strong>in</strong>ear.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0


10 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

So let f ◦ A be the mix<strong>in</strong>g model and W ◦ g the separat<strong>in</strong>g model. Putt<strong>in</strong>g the two together<br />

we get the above mix<strong>in</strong>g-separat<strong>in</strong>g model with h := g ◦ f. The theorem shows that if the<br />

mix<strong>in</strong>g-separat<strong>in</strong>g model preserves <strong>in</strong>dependence then it is essentially trivial i.e. h aff<strong>in</strong>e l<strong>in</strong>ear<br />

and the matrices equivalent (up to scal<strong>in</strong>g). As usual, the model is assumed to be <strong>in</strong>vertible,<br />

hence identifiability and uniqueness of the model follow from the separability. Note that if f is<br />

only assumed to be cont<strong>in</strong>uously differentiable, then additional <strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play.<br />

As <strong>in</strong> the l<strong>in</strong>ear case, this separability result can be transformed <strong>in</strong>to a postnonl<strong>in</strong>ear ICA<br />

algorithm: <strong>in</strong> Theis and Lang (2004b) we proposed a l<strong>in</strong>earization identification framework for<br />

estimat<strong>in</strong>g postnonl<strong>in</strong>earities <strong>in</strong> such sett<strong>in</strong>gs.<br />

F<strong>in</strong>ally, we want to note that we derived a slightly less general uniqueness theorem from the<br />

l<strong>in</strong>ear separability proof <strong>in</strong> Theis (2004a), see chapter 2:<br />

Theorem 1.2.4 (Separability of bijective postnonl<strong>in</strong>ear BSS). Let A, W ∈ Gl(n) be mix<strong>in</strong>g,<br />

h : Rn → Rn be a diagonal bijective analytic function such that h ′ i �= 0 and let S be an <strong>in</strong>dependent<br />

random vector with at most one Gaussian component and exist<strong>in</strong>g covariance. If also W � h(AS) �<br />

is <strong>in</strong>dependent, then h is aff<strong>in</strong>e l<strong>in</strong>ear.<br />

Similar uniqueness results were further studied by Achard and Jutten (2005).<br />

Quadratic ICA<br />

There are a few different ways of extend<strong>in</strong>g l<strong>in</strong>ear BSS models. In the previous section, we focused<br />

on the postnonl<strong>in</strong>ear mix<strong>in</strong>g model (1.3), which adds an unknown nonl<strong>in</strong>earity at the end of a<br />

l<strong>in</strong>ear mix<strong>in</strong>g situation. A different approach is to take a known nonl<strong>in</strong>earity and add another<br />

mix<strong>in</strong>g matrix, as <strong>in</strong> the step from perceptrons to multilayer perceptrons. Another way of putt<strong>in</strong>g<br />

this is to take only the first few terms of the Taylor expansion of the mix<strong>in</strong>g or unmix<strong>in</strong>g mapp<strong>in</strong>g,<br />

and to try and learn such regularized systems. In Theis and Nakamura (2004), see chapter 5,<br />

we treated polynomial nonl<strong>in</strong>earities, specifically second-order monomials or quadratic forms.<br />

These represent a relatively simple class of nonl<strong>in</strong>earities, which can be <strong>in</strong>vestigated <strong>in</strong> detail. In<br />

contrast to the postnonl<strong>in</strong>ear model formulation, we def<strong>in</strong>ed the nonl<strong>in</strong>earity for the unmix<strong>in</strong>g<br />

situation, and derive the correspond<strong>in</strong>g mix<strong>in</strong>g model after some assumptions.<br />

We considered the quadratic unmix<strong>in</strong>g model<br />

yi := x ⊤ G (i) x (1.4)<br />

for symmetric matrices G (i) , i = 1, . . . , n and estimated sources y. We restricted ourselves to a<br />

special case of this model by assum<strong>in</strong>g that each G (i) has the same set of eigenvectors. Then<br />

G (i) = EΛ (i) E with orthogonal eigenvector matrix E (i) and diagonal Λ (i) with coefficients Λ (i)<br />

kk<br />

on the diagonal. Sett<strong>in</strong>g<br />

Λ :=<br />

⎛<br />

⎜<br />

⎝<br />

Λ (1)<br />

11 . . . Λ (1)<br />

nn<br />

.<br />

. ..<br />

Λ (n)<br />

11 . . . Λ (n)<br />

nn<br />

therefore yields a two-layered unmix<strong>in</strong>g model y = Λ ◦ (.) 2 ◦ E ◦ x, where (.) 2 denotes the<br />

componentwise square of each element. This can be <strong>in</strong>terpreted as a two-layered feed-forward<br />

neural network, and may be easily explicitly <strong>in</strong>verted, see figure 1.4.<br />

.<br />

⎞<br />

⎟<br />


S U M M A R Y T his �le conta<strong>in</strong>s all �gures of the manuscript<br />

1.2.<br />

<strong>in</strong> the same size as <strong>in</strong> the manuscript itself.<br />

Uniqueness keyissues words: <strong>in</strong> <strong>in</strong>dependent component analysis 11<br />

x 1<br />

x 2<br />

x m<br />

.<br />

E 1m<br />

E 2m<br />

E n m<br />

(.) 2<br />

(.) 2<br />

.<br />

(.) 2<br />

Λ 1n<br />

Λ 2n<br />

E Λ<br />

Λ n n<br />

F ig. 1 Simpli�ed quadratic unmix<strong>in</strong>g model y = Λ (.) 2 E x .<br />

Figure 1.4: Simplified quadratic unmix<strong>in</strong>g model y = Λ ◦ (.) 2 ◦ E ◦ x. If Ex only takes values<br />

<strong>in</strong> one quadrant, then the model is <strong>in</strong>vertible and its <strong>in</strong>verse model aga<strong>in</strong> follows a multilayer<br />

perceptron structure x = E ⊤ ◦ √ ◦ Λ −1 ◦ y.<br />

Several studies have employed quadratic forms as a generative process of data. Abed-Meraim<br />

et al. (1996) suggested analyz<strong>in</strong>g mixtures by second-order polynomials us<strong>in</strong>g a l<strong>in</strong>earization for<br />

the mixtures, which however <strong>in</strong> general destroys the assumption of <strong>in</strong>dependence. Leshem (1999)<br />

proposed a whiten<strong>in</strong>g scheme based on quadratic forms <strong>in</strong> order to enhance l<strong>in</strong>ear separation<br />

of time-signals <strong>in</strong> algorithms such as SOBI (Belouchrani et al., 1997). Similar quadratic mix<strong>in</strong>g<br />

models are also considered <strong>in</strong> Georgiev (2001) and Hosse<strong>in</strong>i and Deville (2003). These are<br />

studies <strong>in</strong> which the mix<strong>in</strong>g model is assumed to be quadratic <strong>in</strong> contrast to the quadratic unmix<strong>in</strong>g<br />

model (1.4). For demix<strong>in</strong>g <strong>in</strong>to <strong>in</strong>dependent components by quadratic forms, Bartsch and<br />

Obermayer (2003) suggested apply<strong>in</strong>g l<strong>in</strong>ear ICA to second-order terms of data, and Hashimoto<br />

(2003) proposed an algorithm based on m<strong>in</strong>imization of Kullback-Leibler divergence. However,<br />

no identifiability was studied; <strong>in</strong>stead they focused on the application to natural images.<br />

In (Theis and Nakamura, 2004), we def<strong>in</strong>ed the above quadratic unmix<strong>in</strong>g process and derived<br />

a generative model. We then reduced the quadratic model to an overdeterm<strong>in</strong>ed l<strong>in</strong>ear model,<br />

where more observations than sources are given, by embedd<strong>in</strong>g y <strong>in</strong>to R m(m+1)/2 ; this can be<br />

done by tak<strong>in</strong>g the monomials xixj as new variables. Us<strong>in</strong>g some l<strong>in</strong>ear algebra, we then derived<br />

the follow<strong>in</strong>g identifiability theorem for overdeterm<strong>in</strong>ed ICA:<br />

Theorem 1.2.5 (Uniqueness of overdeterm<strong>in</strong>ed ICA). Let x = As with <strong>in</strong>dependent n-dimensional<br />

random vector s and full-rank m × n-matrix A with m ≥ n, and let the n × m matrix W<br />

be chosen such thatMWx anuscript is <strong>in</strong>dependent. received Furthermore October 11, assume 2002. that s has at most one Gaussian<br />

component and thatMthe anuscript variancesrevised of s exist. October Then there 11, 2002. exist a permutation matrix P and an<br />

<strong>in</strong>vertible scal<strong>in</strong>g matrix F <strong>in</strong>alL manuscript with W = LP(A received J anuary 15, 2003.<br />

†<br />

T he authors are with the L ab. for A dvanced B ra<strong>in</strong> Signal<br />

P rocess<strong>in</strong>g, B ra<strong>in</strong> Science I nstitute, R I K E N, W ako-shi,<br />

Saitama 351-0198 J apan.<br />

On leave from the I nstitute of B iophysics, U niversity<br />

of R egensburg, 93051 R egensburg, G ermany.<br />

a) E -mail: fabian@theis.name<br />

b) E -mail: wakakoh@bra<strong>in</strong>.riken.jp<br />

⊤A) −1A⊤ + C and CA = 0.<br />

.<br />

y 1<br />

y 2<br />

y n


12 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

−0.17 0.16 −0.25 0.08 0.16 −0.11 0.28 −0.10 0.19 −0.13 −0.24 0.23<br />

−0.20 0.12 −0.27 0.27 0.19 −0.16 −0.17 0.16 −0.30 0.09 0.38 0.05<br />

−0.19 0.17 0.18 −0.12 −0.24 0.24 −0.22 0.22 −0.24 0.22 0.24 −0.17<br />

0.24 −0.09 0.25 −0.22 0.23 −0.15 0.19 −0.17 0.18 −0.16 0.21 −0.21<br />

0.26 −0.20 0.33 −0.15 −0.28 0.15 −0.17 0.17 −0.17 0.16 −0.23 0.23<br />

−0.24 0.23 0.23 −0.22 0.20 −0.20 −0.30 0.30 0.18 −0.16 −0.18 0.15<br />

0.28 −0.24 −0.23 0.18 0.20 −0.20 −0.26 0.26 0.19 −0.19 −0.22 0.22<br />

0.31 −0.30 0.26 −0.23 0.18 −0.13 0.15 −0.15 0.18 −0.18 −0.31 0.30<br />

−0.23 0.17 −0.25 0.24 −0.27 0.26 0.25 −0.23 −0.19 0.17 0.11 −0.11<br />

−0.08 0.06 −0.15 0.13 0.25 −0.24 0.28 −0.26 −0.23 0.20 −0.18 0.15<br />

−0.22 0.19 −0.22 0.18 0.11 −0.09 0.28 −0.28<br />

Figure 1.5: Quadratic ICA of natural images. 3 · 10 5 sample pictures of size 8 × 8 were used.<br />

Plotted are the recovered maximal filters of i.e. rows of the eigenvalue matrices of the quadratic<br />

form coefficient matrices (top figure). For each component the two largest filters (with respect<br />

to the absolute eigenvalues) are shown (altogether 2 · 64). Above each image the correspond<strong>in</strong>g<br />

eigenvalue (multiplied by 10 3 ) is pr<strong>in</strong>ted. In most filters only one or two eigenvalues are<br />

dom<strong>in</strong>ant.<br />

This allowed us to derive identifiability for the quadratic unmix<strong>in</strong>g model (1.4). Moreover,<br />

we were then able to construct an explicit ICA algorithm from this uniqueness result.<br />

F<strong>in</strong>ally, we studied the algorithm <strong>in</strong> the context of natural image filters similar to Bartsch<br />

and Obermayer (2003) and Hashimoto (2003). We applied quadratic ICA to a set of small<br />

patches. Most obta<strong>in</strong>ed quadratic forms had one or two dom<strong>in</strong>ant l<strong>in</strong>ear filters and these l<strong>in</strong>ear<br />

filters were selective for local bar stimuli, see figure 1.5. Hence, values of all quadratic forms<br />

corresponded to squared simple cell outputs.


1.3. Dependent component analysis 13<br />

1.3 Dependent component analysis<br />

In this section, we will discuss the relaxation of the BSS model by tak<strong>in</strong>g <strong>in</strong>to account additional<br />

structures <strong>in</strong> the data and dependencies between components. Many researchers have taken<br />

<strong>in</strong>terest <strong>in</strong> this generalization, which is crucial for the application <strong>in</strong> real-world sett<strong>in</strong>gs where<br />

such situations are to be expected.<br />

Here, we will consider model <strong>in</strong>determ<strong>in</strong>acies as well as actual separation algorithms. For<br />

the latter, we will employ a technique that has been the basis of one of the first ICA algorithms<br />

(Cardoso and Souloumiac, 1993), namely jo<strong>in</strong>t diagonalization (JD). It has s<strong>in</strong>ce become an<br />

important tool <strong>in</strong> ICA-based BSS and <strong>in</strong> BSS rely<strong>in</strong>g on second-order time-decorrelation (Belouchrani<br />

et al., 1997). Its task is, given a set of commut<strong>in</strong>g symmetric n × n matrices Ci, to<br />

f<strong>in</strong>d an orthogonal matrix A such that A ⊤ CiA is diagonal for all i. This generalizes eigenvalue<br />

decomposition (i = 1) and the generalized eigenvalue problem (i = 2), <strong>in</strong> which perfect<br />

factorization is always possible.<br />

Other extensions of the standard BSS model such as <strong>in</strong>clud<strong>in</strong>g s<strong>in</strong>gular matrices (Georgiev<br />

and Theis, 2004) will be omitted from the discussion <strong>in</strong> the follow<strong>in</strong>g.<br />

1.3.1 Algebraic BSS and multidimensional generalizations<br />

Consider<strong>in</strong>g the BSS model from equation (1.1)—or a more general, noisy version x(t) = As(t)+<br />

n(t)—the data can only be separated if we put additional conditions on the sources such as:<br />

• they are stochastically <strong>in</strong>dependent: ps(s1, . . . , sn) = ps1 (s1) · · · psn(sn),<br />

• each source is sparse i.e. it conta<strong>in</strong>s a certa<strong>in</strong> number of zeros or has a low p-norm for<br />

small p and fixed 2-norm,<br />

• s(t) is stationary and for all τ, it has diagonal autocovariances E(s(t + τ) s(t) ⊤ ); here<br />

zero-mean s(t) are assumed.<br />

In the follow<strong>in</strong>g, we will review BSS algorithms based on eigenvalue decomposition, JD and<br />

generalizations. Thereby, one of the above conditions is denoted by the term source condition,<br />

because we do not want to specialize on a s<strong>in</strong>gle model. The additive noise n(t) is modeled by<br />

a stationary, temporally and spatially white zero-mean process with variance σ 2 . Moreover, we<br />

will not deal with the more complicated underdeterm<strong>in</strong>ed case, so we assume that at most as<br />

many sources as sensors are to be extracted, i.e. n ≤ m.<br />

The signals x(t) are observed, and the goal is to recover A and s(t). Hav<strong>in</strong>g found A, s(t)<br />

can be estimated by A † x(t), which is optimal <strong>in</strong> the maximum-likelihood sense. Here † denotes<br />

the pseudo-<strong>in</strong>verse of A, which equals the <strong>in</strong>verse <strong>in</strong> the case of m = n. So the BSS task reduces<br />

to the estimation of the mix<strong>in</strong>g matrix A, hence the additive noise n is often neglected (after<br />

whiten<strong>in</strong>g). Note that <strong>in</strong> the follow<strong>in</strong>g we will assume that all signals are real-valued. Extensions<br />

to the complex case are straightforward.<br />

Approximate jo<strong>in</strong>t diagonalization<br />

Many BSS algorithms employ jo<strong>in</strong>t diagonalization (JD) techniques on some source condition<br />

matrices to identify the mix<strong>in</strong>g matrix. Given a set of symmetric matrices C := {C1, . . . , CK},


14 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

JD implies m<strong>in</strong>imiz<strong>in</strong>g the squared sum of the off-diagonal elements of Â⊤CiÂ, i.e. m<strong>in</strong>imiz<strong>in</strong>g<br />

f( Â) :=<br />

K�<br />

�Â⊤Ci − diag(Â⊤CiÂ)�2F i=1<br />

with respect to the orthogonal matrix Â, where diag(C) produces a matrix, where all off-diagonal<br />

elements of C have been set to zero, and �C�2 F := tr(CC⊤ ) denotes the squared Frobenius norm.<br />

A global m<strong>in</strong>imum A of f is called a jo<strong>in</strong>t diagonalizer of C. Such a jo<strong>in</strong>t diagonalizer exists if<br />

and only if all elements of C commute.<br />

Algorithms for perform<strong>in</strong>g jo<strong>in</strong>t diagonalization <strong>in</strong>clude, among others, gradient descent on<br />

f( Â), Jacobi-like iterative construction of A by Givens rotation <strong>in</strong> two coord<strong>in</strong>ates (Cardoso<br />

and Souloumiac, 1995), an extension m<strong>in</strong>imiz<strong>in</strong>g a logarithmic version of (1.5) (Pham, 2001), an<br />

alternat<strong>in</strong>g optimization scheme switch<strong>in</strong>g between column and diagonal optimization (Yeredor,<br />

2002) or more recently a l<strong>in</strong>ear least-squares algorithm for diagonalization (Ziehe et al., 2003b),<br />

where the latter three algorithms can also search for non-orthogonal matrices A. Note that <strong>in</strong><br />

practice m<strong>in</strong>imization of the off-sums only yields an approximate jo<strong>in</strong>t diagonalizer—<strong>in</strong> the case<br />

of f<strong>in</strong>ite samples, the source condition matrices are estimates and hence they only approximately<br />

share the same eigenstructure and do not fully commutate, so f( Â) from equation (1.5) cannot<br />

be rendered precisely zero but only approximately.<br />

Source conditions<br />

In order to get a well-def<strong>in</strong>ed source separation model, assumptions to the sources such as<br />

stochastic <strong>in</strong>dependence have to be formulated. In practice, the conditions are preferably given<br />

<strong>in</strong> terms of roots of some cost function that can easily be estimated. Here, we summarize some<br />

of the source conditions used <strong>in</strong> the literature; they are def<strong>in</strong>ed by a criterion specify<strong>in</strong>g the<br />

diagonality of a set of matrices C(.) := {C1(.), . . . , CK(.)}, which can be estimated from the<br />

data. We require only that<br />

Ci(Wx) = WCi(x)W ⊤<br />

(1.6)<br />

for some matrix W. Note that us<strong>in</strong>g the substitution ¯ Ci(x) := Ci(x) + Ci(x) ⊤ , we can assume<br />

Ci(x) to be symmetric. The actual source model is then def<strong>in</strong>ed by requir<strong>in</strong>g the sources to fulfill<br />

Ci(s) = 0 for all i = 1, . . . , K. In table 1.1, we review some commonly used source conditions for<br />

an m-dimensional centered random vector x or a multivariate random process x(t) respectively.<br />

Search<strong>in</strong>g for sources s := Wx fulfill<strong>in</strong>g the source model means f<strong>in</strong>d<strong>in</strong>g matrices W such<br />

that Ci(Wx) is diagonal for all i. Depend<strong>in</strong>g on the algorithm, whiten<strong>in</strong>g by PCA is performed<br />

as preprocess<strong>in</strong>g to allow for a reduced search on the orthogonal group W ∈ O(n). This is<br />

equivalent to sett<strong>in</strong>g all source second-order statistics to I, and then search<strong>in</strong>g only for rotations.<br />

In the case of K = 1, the search can be performed by eigenvalue decomposition of C1(˜x) of<br />

the source condition of the whitened mixtures ˜x; this is equivalent to solv<strong>in</strong>g the generalized<br />

eigenvalue decomposition (GEVD) problem for the matrix pencil (E(xx ⊤ ), C1(˜x)). Usually<br />

us<strong>in</strong>g more than one condition matrix <strong>in</strong>creases the robustness of the proposed algorithm, and<br />

<strong>in</strong> these cases the algorithm performs orthogonal JD of C := {Ci(˜x)} e.g. by a Jacobi-type<br />

algorithm (Cardoso and Souloumiac, 1995).<br />

(1.5)


Table 1.1: BSS algorithms based on jo<strong>in</strong>t diagonalization (centered sources are assumed)<br />

1.3. Dependent component analysis 15<br />

algorithm source model condition matrices optimization algorithm<br />

FOBI (Cardoso and Souloumiac, <strong>in</strong>dependent i.i.d. sources contracted quadricovariance matrix EVD after PCA<br />

1990)<br />

with Eij = I<br />

(GEVD)<br />

JADE (Cardoso and Souloumiac, <strong>in</strong>dependent i.i.d. sources contracted quadricovariance matri- orthogonal JD after<br />

1993)<br />

ces<br />

PCA<br />

eJADE (Moreau, 2001) <strong>in</strong>dependent i.i.d. sources arbitrary-order cumulant matrices orthogonal JD after<br />

PCA<br />

HessianICA (Theis, 2004a, Yeredor, <strong>in</strong>dependent i.i.d. sources multiple Hessians Hlog ˆx(x<br />

2000)<br />

(i) ) or<br />

Hlog px(x (i) orthogonal JD after<br />

)<br />

PCA<br />

AMUSE (Molgedey and Schuster, wide-sense stationary s(t) with di- s<strong>in</strong>gle autocovariance matrix<br />

1994, Tong et al., 1991)<br />

agonal autocovariances<br />

E(x(t + τ)x(t) ⊤ EVD after PCA<br />

)<br />

(GEVD)<br />

SOBI (Belouchrani et al., 1997), wide-sense stationary s(t) with di- multiple autocovariance matrices orthogonal JD after<br />

TDSEP (Ziehe and Mueller, 1998) agonal autocovariances<br />

PCA<br />

mdAMUSE (Theis et al., 2004e) s(t1, . . . , tM ) with diagonal autoco- s<strong>in</strong>gle multidimensional autocovari- EVD after PCA<br />

variancesance<br />

matrix (1.7)<br />

(GEVD)<br />

mdSOBI (Schießl et al., 2000, Theis s(t1, . . . , tM ) with diagonal autoco- multidimensional autocovariance orthogonal JD after<br />

et al., 2004e)<br />

variances<br />

matrices (1.7)<br />

PCA<br />

JADET D (Müller et al., 1999) <strong>in</strong>dependent s(t) with diagonal au- cumulant and autocovariance ma- orthogonal JD after<br />

tocovariancestrices<br />

PCA<br />

SONS (Choi and Cichocki, 2000) non-stationary s(t) with diagonal (auto-)covariance matrices of w<strong>in</strong>- orthogonal JD after<br />

(auto-)covariances<br />

dowed signals<br />

PCA<br />

ACDC (Yeredor, 2002), LSDIAG <strong>in</strong>dependent or auto-decorrelated covariance matrices and cumu- non-orthogonal JD<br />

(Ziehe et al., 2003b)<br />

s(t)<br />

lant/autocovariance matrices<br />

block-Gaussian likelihood (Pham block-Gaussian non-stationary s(t) (auto-)covariance matrices of w<strong>in</strong>- non-orthogonal JD<br />

and Cardoso, 2001)<br />

dowed signals<br />

TFS (Belouchrani and Am<strong>in</strong>, 1998) s(t) from Cohen’s time-frequency spatial time-frequency distribution orthogonal JD after<br />

distributions (Cohen, 1995) matrices<br />

PCA<br />

FRT-based BSS (Karako-Eilon non-stationary s(t) with diagonal autocovariance of FRT-transformed (non-)orthogonal<br />

et al., 2003)<br />

block-spectra<br />

w<strong>in</strong>dowed signal<br />

JD<br />

ACMA (van der Veen and Paulraj, s(t) is of constant modulus (CM) <strong>in</strong>dependent vectors <strong>in</strong> ker<br />

1996)<br />

ˆ P of<br />

model-matrix ˆ generalized Schur<br />

P<br />

QZ-decomp.<br />

stBSS (Theis et al., 2005a) spatiotemporal sources S := s(r, t) any of the above conditions for both<br />

X and X⊤ non-orthogonal JD<br />

group BSS (Theis, 2005a) group-dependent sources s(t) any of the above conditions block orthogonal JD<br />

after PCA


16 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

In contrast to this sometimes called hard-whiten<strong>in</strong>g technique, soft-whiten<strong>in</strong>g tries to avoid<br />

a bias towards second-order statistics and uses a non-orthogonal jo<strong>in</strong>t diagonalization algorithm<br />

(Pham, 2001, Yeredor, 2002, Ziehe et al., 2003b) by jo<strong>in</strong>tly diagonaliz<strong>in</strong>g the source conditions<br />

Ci(x) together with the mixture covariance matrix E(xx ⊤ ). Then possible estimation errors <strong>in</strong><br />

the second-order part do not <strong>in</strong>fluence the total error disproportionately high.<br />

Depend<strong>in</strong>g on the source conditions, various algorithms have been proposed <strong>in</strong> the literature.<br />

Table 1.1 gives an overview over the algorithms together with the references, the source model,<br />

the condition matrices and the optimization algorithm. For more details and references, we refer<br />

to Theis and Inouye (2006).<br />

Multidimensional autodecorrelation<br />

In Theis et al. (2004e), see chapter 6, we considered BSS algorithms based on time-decorrelation<br />

and the result<strong>in</strong>g source condition. Correspond<strong>in</strong>g JD-based algorithms <strong>in</strong>clude AMUSE (Tong<br />

et al., 1991) and extensions such as SOBI (Belouchrani et al., 1997) and TDSEP (Ziehe and<br />

Mueller, 1998). They rely on the fact that the data sets have non-trivial autocorrelations. We<br />

extended them to data sets hav<strong>in</strong>g more than one direction <strong>in</strong> the parametrization such as<br />

images. For this, we replaced one-dimensional autocovariances by multidimensional autocovariances<br />

def<strong>in</strong>ed by<br />

�<br />

Cτ1,...,τM (s) := E s(z1 + τ1, . . . , zM + τM)s(z1, . . . , zM) ⊤�<br />

(1.7)<br />

where the s is centered and the expectation is taken over (z1, . . . , zM). Cτ1,...,τM (s) can be<br />

estimated given equidistant samples by replac<strong>in</strong>g random variables by sample values and expectations<br />

by sums as usual.<br />

A typical example for non-trivial multidimensional autocovariances is a source data set, <strong>in</strong><br />

which each component si represents an image of size h × w. Then the data is of dimension<br />

M = 2 and samples of s are given at <strong>in</strong>dices z1 = 1, . . . , h, z2 = 1, . . . , w. Classically, s(z1, z2)<br />

is transformed to s(t) by fix<strong>in</strong>g a mapp<strong>in</strong>g from the two-dimensional parameter set to the onedimensional<br />

time parametrization of s(t), for example by concatenat<strong>in</strong>g columns or rows <strong>in</strong> the<br />

case of a f<strong>in</strong>ite number of samples (vectorization). If the time structure of s(t) is not used, as <strong>in</strong><br />

all classical ICA algorithms <strong>in</strong> which i.i.d. samples are assumed, this choice does not <strong>in</strong>fluence<br />

the result. However, <strong>in</strong> time-structure based algorithms such as AMUSE and SOBI results can<br />

vary greatly depend<strong>in</strong>g on the choice of this mapp<strong>in</strong>g.<br />

The advantage of us<strong>in</strong>g multidimensional autocovariances lies <strong>in</strong> the fact that now the multidimensional<br />

structure of the data set can be used more explicitly. For example, if row concatenation<br />

is used to construct s(t) from the images, horizontal l<strong>in</strong>es <strong>in</strong> the image will only<br />

give trivial contributions to the autocovariance. Figure 1.6 shows the one- and two-dimensional<br />

autocovariance of the Lena image for vary<strong>in</strong>g τ respectively (τ1, τ2) after normalization of the<br />

image to variance 1. Clearly, the two-dimensional autocovariance does not decay as quickly with<br />

<strong>in</strong>creas<strong>in</strong>g radius as the one-dimensional covariance. Only at multiples of the image height, the<br />

one-dimensional autocovariance is significantly high i.e. captures image structure.<br />

For details as well as extended simulations and examples, we refer to Theis et al. (2004e)<br />

and related work by Schießl et al. (2000), Schöner et al. (2000).


1.3. Dependent component analysis 17<br />

(a) analyzed image<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

1d−autocov<br />

2d−autocov<br />

0 50 100 150 200 250 300<br />

|tau|<br />

(b) autocorrelation (1d/2d)<br />

Figure 1.6: Example of one- and two-dimensional autocovariance coefficient (b) of the gray-scale<br />

128×128 Lena image (a) after normalization to variance 1. Clearly us<strong>in</strong>g local structure <strong>in</strong> both<br />

directions (2d-autocov) guarantees that for small τ higher powers of the autocorrelations are<br />

present than by rearrang<strong>in</strong>g the data <strong>in</strong>to a vector (1d-autocov), thereby loos<strong>in</strong>g <strong>in</strong>formation<br />

about the second dimension.<br />

1.3.2 Spatiotemporal BSS<br />

Real-world data sets such as record<strong>in</strong>gs from functional magnetic resonance imag<strong>in</strong>g often possess<br />

both spatial and temporal structure. In Theis et al. (2007b), see chapter 9, we propose an<br />

algorithm <strong>in</strong>clud<strong>in</strong>g such spatiotemporal <strong>in</strong>formation <strong>in</strong>to the analysis, and reduce the problem<br />

to the jo<strong>in</strong>t approximate diagonalization of a set of autocorrelation matrices.<br />

Spatiotemporal BSS <strong>in</strong> contrast to the more common spatial or temporal BSS tries to achieve<br />

both spatial and temporal separation by optimiz<strong>in</strong>g a jo<strong>in</strong>t energy function. First proposed<br />

by Stone et al. (2002), it is a promis<strong>in</strong>g method, which has potential applications <strong>in</strong> areas<br />

where data conta<strong>in</strong>s an <strong>in</strong>herent spatiotemporal structure, such as data from biomedic<strong>in</strong>e or<br />

geophysics <strong>in</strong>clud<strong>in</strong>g oceanography and climate dynamics. Stone’s algorithm is based on the<br />

Infomax ICA algorithm by Bell and Sejnowski (1995), which due to its onl<strong>in</strong>e nature <strong>in</strong>volves<br />

some rather <strong>in</strong>tricate choices of parameters, specifically <strong>in</strong> the spatiotemporal version, where<br />

onl<strong>in</strong>e updates are be<strong>in</strong>g performed both <strong>in</strong> space and time. Commonly, the spatiotemporal<br />

data sets are recorded <strong>in</strong> advance, so we can easily replace spatiotemporal onl<strong>in</strong>e learn<strong>in</strong>g by


18 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

=<br />

X t A t S<br />

X<br />

=<br />

(a) temporal BSS<br />

s S ⊤<br />

(c) spatiotemporal BSS<br />

t S<br />

=<br />

X ⊤ s A s S<br />

(b) spatial BSS<br />

Figure 1.7: Temporal, spatial and spatiotemporal BSS models. The l<strong>in</strong>es <strong>in</strong> the matrices ∗ S<br />

<strong>in</strong>dicate the sample direction. Source conditions apply between adjacent such l<strong>in</strong>es.<br />

batch optimization. This has the advantage of greatly reduc<strong>in</strong>g the number of parameters <strong>in</strong> the<br />

system, and leads to more stable optimization algorithms. In Theis et al. (2007b), we extended<br />

Stone’s approach by generaliz<strong>in</strong>g the time-decorrelation algorithms to the spatiotemporal case,<br />

thereby allow<strong>in</strong>g us to use the <strong>in</strong>herent spatiotemporal structures of the data.<br />

For this, we considered data sets x(r, t) depend<strong>in</strong>g on two <strong>in</strong>dices r and t, where r ∈ R n<br />

can be any multidimensional (spatial) <strong>in</strong>dex and t <strong>in</strong>dexes the time axis. In order to be able to<br />

use matrix-notation, we contracted the spatial multidimensional <strong>in</strong>dex r <strong>in</strong>to a one-dimensional<br />

<strong>in</strong>dex r by row concatenation. Then the data set x(r, t) =: xrt can be represented by a data<br />

matrix X of dimension s m× t m, where the superscripts s (.) and t (.) denote spatial and temporal<br />

variables, respectively.<br />

Temporal BSS implies the matrix factorization X = t A t S, whereas spatial BSS implies the<br />

factorization X ⊤ = s A s S or equivalently X = s S ⊤s A ⊤ . Hence X = t A t S = s S ⊤s A ⊤ . So both<br />

source separation models can be <strong>in</strong>terpreted as matrix factorization problems; <strong>in</strong> the temporal<br />

case restrictions such as diagonal autocorrelations are determ<strong>in</strong>ed by the second factor, <strong>in</strong> the<br />

spatial case by the first one. In order to achieve a spatiotemporal model, we required these<br />

conditions from both factors at the same time. Therefore the spatiotemporal BSS model can be<br />

derived from the above as the factorization problem<br />

X = s S ⊤t S (1.8)<br />

with spatial source matrix s S and temporal source matrix t S, which both have (multidimensional)<br />

autocorrelations be<strong>in</strong>g as diagonal as possible. The three models are illustrated <strong>in</strong> figure<br />

1.7.<br />

Concern<strong>in</strong>g conditions for the sources, we <strong>in</strong>terpreted Ci(X) := Ci( t x(t)) as the i-th temporal<br />

autocovariance matrix, whereas Ci(X ⊤ ) := Ci( s x(r)) denoted the correspond<strong>in</strong>g spatial<br />

autocovariance matrix. Application of the spatiotemporal mix<strong>in</strong>g model from equation (1.8)


1.3. Dependent component analysis 19<br />

together with the transformation properties (1.6) of the source conditions yields<br />

Ci( t S) = s S †⊤ Ci(X) s S † and Ci( s S) = t S †⊤ Ci(X ⊤ ) t S †<br />

because ∗ m ≥ n and hence ∗ S ∗ S † = I. By assumption the matrices Ci( ∗ S) are as diagonal as<br />

possible. In order to separate the data, we had to f<strong>in</strong>d diagonalizers for both Ci(X) and Ci(X ⊤ )<br />

such that they satisfy the spatiotemporal model (1.8). As the matrices derived from X had to<br />

be diagonalized <strong>in</strong> terms of both columns and rows, we denoted this by double-sided approximate<br />

jo<strong>in</strong>t diagonalization.<br />

In Theis et al. (2007b, 2005a) we showed how to reduce this process to jo<strong>in</strong>t diagonalization.<br />

In order to get robust estimates of the source conditions, dimension reduction was essential. For<br />

this we considered the s<strong>in</strong>gular value decomposition X, and formulated the algorithm <strong>in</strong> terms<br />

of the pseudo-orthogonal components of X. Of course, <strong>in</strong>stead of us<strong>in</strong>g autocovariance matrices,<br />

other source conditions Ci(.) from table 1.1 can be employed <strong>in</strong> order to adapt to the separation<br />

problem at hand.<br />

We present an application of the spatiotemporal BSS algorithm to fMRI data us<strong>in</strong>g multidimensional<br />

autocovariances <strong>in</strong> section 1.6.1.<br />

1.3.3 <strong>Independent</strong> subspace analysis<br />

Another extension of the simple source separation model lies <strong>in</strong> extract<strong>in</strong>g groups of sources<br />

that are <strong>in</strong>dependent of each other, but not with<strong>in</strong> the group. So, multidimensional <strong>in</strong>dependent<br />

component analysis or <strong>in</strong>dependent subspace analysis (ISA) denotes the task of transform<strong>in</strong>g a<br />

multivariate observed sensor signal such that groups of the transformed signal components are<br />

mutually <strong>in</strong>dependent—however dependencies with<strong>in</strong> the groups are still allowed. This allows<br />

for weaken<strong>in</strong>g the sometimes too strict assumption of <strong>in</strong>dependence <strong>in</strong> ICA, and has potential<br />

applications <strong>in</strong> various fields such as ECG, fMRI analysis or convolutive ICA.<br />

Recently we were able to calculate the <strong>in</strong>determ<strong>in</strong>acies of group ICA for known and unknown<br />

group structure, which f<strong>in</strong>ally enabled us to guarantee successful application of group ICA to<br />

BSS problems. Here, we will shortly review the identifiability result as well as the result<strong>in</strong>g<br />

algorithm for separat<strong>in</strong>g signals <strong>in</strong>to groups of dependent signals. As before, the algorithm is<br />

based on jo<strong>in</strong>t (block) diagonalization of sets of matrices generated us<strong>in</strong>g one or multiple source<br />

conditions.<br />

Generalizations of the ICA model that are to <strong>in</strong>clude dependencies of multiple one-dimensional<br />

components have been studied for quite some time. ISA <strong>in</strong> the term<strong>in</strong>ology of multidimensional<br />

ICA has first been <strong>in</strong>troduced by Cardoso (1998) us<strong>in</strong>g geometrical motivations. His model<br />

as well as the related but <strong>in</strong>dependently proposed factorization of multivariate function classes<br />

(L<strong>in</strong>, 1998) are quite general, however no identifiability results were presented, and applicability<br />

to an arbitrary random vector was unclear. Later, <strong>in</strong> the special case of equal group sizes k,<br />

<strong>in</strong> the follow<strong>in</strong>g denoted as k-ISA, uniqueness results have been extended from the ICA theory<br />

(Theis, 2004b). Algorithmic enhancements <strong>in</strong> this sett<strong>in</strong>g have been recently studied by Poczos<br />

and Lör<strong>in</strong>cz (2005). Similar to (Cardoso, 1998), Akaho et al. (1999) also proposed to employ<br />

a multidimensional component maximum likelihood algorithm, however <strong>in</strong> the slightly different<br />

context of multimodal component analysis. Moreover, if the observations conta<strong>in</strong> additional<br />

(1.9)


20 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

structures such as spatial or temporal structures, these may be used for the multidimensional<br />

separation (Il<strong>in</strong>, 2006, Vollgraf and Obermayer, 2001).<br />

Hyvär<strong>in</strong>en and Hoyer (2000) presented a special case of k-ISA by comb<strong>in</strong><strong>in</strong>g it with <strong>in</strong>variant<br />

feature subspace analysis. They model the dependence with<strong>in</strong> a k-tuple explicitly and are<br />

therefore able to propose more efficient algorithms without hav<strong>in</strong>g to resort to the problematic<br />

multidimensional density estimation. A related relaxation of the ICA assumption is given by<br />

topographic ICA (Hyvär<strong>in</strong>en et al., 2001a), where dependencies between all components are<br />

assumed and modeled along a topographic structure (e.g. a 2-dimensional grid). However these<br />

two approaches are not completely bl<strong>in</strong>d anymore. Bach and Jordan (2003b) formulate ISA<br />

as a component cluster<strong>in</strong>g problem, which necessitates a model for <strong>in</strong>ter-cluster <strong>in</strong>dependence<br />

and <strong>in</strong>tra-cluster dependence. For the latter, they propose to use a tree-structure as employed<br />

by their tree dependepent component analysis (Bach and Jordan, 2003a). Together with <strong>in</strong>tercluster<br />

<strong>in</strong>dependence, this implies a search for a transformation of the mixtures <strong>in</strong>to a forest<br />

i.e. a set of disjo<strong>in</strong>t trees. However, the above models are all semi-parametric and hence not<br />

fully bl<strong>in</strong>d. In the follow<strong>in</strong>g, we will review two contributions (Theis, 2004b) and (Theis, 2007),<br />

where no additional structures were necessary for the separation.<br />

Fixed group structure—k-ISA<br />

A random vector y is called an <strong>in</strong>dependent component of the random vector x, if there exists<br />

an <strong>in</strong>vertible matrix A and a decomposition x = A(y, z) such that y and z are stochastically<br />

<strong>in</strong>dependent. We note that this is a more general notion of <strong>in</strong>dependent components <strong>in</strong> the sense<br />

of ICA, s<strong>in</strong>ce we do not require them to be one-dimensional.<br />

The goal of a general <strong>in</strong>dependent subspace analysis (ISA) or multidimensional <strong>in</strong>dependent<br />

component analysis is the decomposition of an arbitrary random vector x <strong>in</strong>to <strong>in</strong>dependent<br />

components. If x is to be decomposed <strong>in</strong>to one-dimensional components, this co<strong>in</strong>cides with<br />

ord<strong>in</strong>ary ICA. Similarly, if the <strong>in</strong>dependent components are required to be of the same dimension<br />

k, then this is denoted by multidimensional ICA of fixed group size k or simply k-ISA.<br />

As we have seen before, an important structural aspect <strong>in</strong> the search for decompositions<br />

is the knowledge of the number of solutions i.e. the <strong>in</strong>determ<strong>in</strong>acies of the problem. Clearly,<br />

given an ISA solution, <strong>in</strong>vertible transforms <strong>in</strong> each component (scal<strong>in</strong>g matrices L) as well as<br />

permutations of components of the same dimension (permutation matrices P) give aga<strong>in</strong> an ISA<br />

of x. This is of course known for 1-ISA i.e. ICA, see section 1.2.1.<br />

In Theis (2004b), see chapter 3, we were able to extend this result to k-ISA, given some<br />

additional restrictions to the model:We denoted A as k-admissible if for each r, s = 1, . . . , n/k<br />

the (r, s) sub-k-matrix of A is either <strong>in</strong>vertible or zero. Then the follow<strong>in</strong>g theorem can be<br />

derived from the multivariate Darmois-Skitovitch theorem (section 1.2.1) or us<strong>in</strong>g our previously<br />

discussed approach via differential equations (Theis, 2005c).<br />

Theorem 1.3.1 (Separability of k-ISA). Let A ∈ Gl(n; R) be k-admissible, and let S be a k<strong>in</strong>dependent<br />

n-dimensional random vector hav<strong>in</strong>g no Gaussian k-dimensional component. If AS<br />

is aga<strong>in</strong> k-<strong>in</strong>dependent, then A is the product of a k-block-scal<strong>in</strong>g and permutation matrix.<br />

This shows that k-ISA solutions are unique except for trivial transformations, if the model<br />

has no Gaussians and is admissible, and can now be turned <strong>in</strong>to a separation algorithm.


1.3. Dependent component analysis 21<br />

ISA with known group structure via jo<strong>in</strong>t block diagonalization<br />

In order to solve ISA with fixed block-size k or at least known block structure, we will use a<br />

generalization of jo<strong>in</strong>t diagonalization, which searches for block-structures <strong>in</strong>stead of diagonality.<br />

We are not <strong>in</strong>terested <strong>in</strong> the order of the blocks, so the block-structure is uniquely specified by<br />

fix<strong>in</strong>g a partition n = m1 + . . . + mr of n and set m := (m1, . . . , mr) ∈ N r . An n × n matrix is<br />

said to be m-block diagonal if it is of the form<br />

⎛<br />

M1<br />

⎜<br />

⎝ .<br />

· · ·<br />

. ..<br />

0<br />

.<br />

⎞<br />

⎟<br />

⎠<br />

0 · · · Mr<br />

with arbitrary mi × mi matrices Mi.<br />

As generalization of JD <strong>in</strong> the case of known the block structure, the jo<strong>in</strong>t m-block diagonalization<br />

problem is def<strong>in</strong>ed as the m<strong>in</strong>imization of<br />

f m ( Â) :=<br />

K�<br />

�Â⊤Ci − diagm ( Â⊤CiÂ)�2F i=1<br />

(1.10)<br />

with respect to the orthogonal matrix Â, where diagm (M) produces a m-block diagonal matrix<br />

by sett<strong>in</strong>g all other elements of M to zero. Indeterm<strong>in</strong>acies of any m-JBD are m-scal<strong>in</strong>g<br />

i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation def<strong>in</strong>ed<br />

by a permutation matrix that only swaps blocks of the same size.<br />

A few algorithms to actually perform JBD have been proposed, see Abed-Meraim and Belouchrani<br />

(2004), Févotte and Theis (2007a). In the follow<strong>in</strong>g we will simply perform jo<strong>in</strong>t<br />

diagonalization and then permute the columns of A to achieve block-diagonality—<strong>in</strong> experiments<br />

this turns out to be an efficient solution to JBD, although other more sophisticated pivot<br />

selection strategies for JBD are of <strong>in</strong>terest (Févotte and Theis, 2007b). The fact that JD <strong>in</strong>duces<br />

JBD has been conjectured by Abed-Meraim and Belouchrani (2004), and we were able to give<br />

a partial answer with the follow<strong>in</strong>g theorem:<br />

Theorem 1.3.2 (JBD via JD). Any block-optimal JBD of the Ci’s (i.e. a zero of f m ) is a local<br />

m<strong>in</strong>imum of the JD-cost-function f from equation (1.5).<br />

Clearly not any JBD m<strong>in</strong>imizes f, only those such that <strong>in</strong> each block of size mk, f( Â) when<br />

restricted to the block is maximal over A ∈ O(mk), which we denote as block-optimal. The<br />

proof is given <strong>in</strong> Theis (2007), see chapter 8.<br />

In the case of k-ISA, where m = (k, . . . , k), we used this result to propose an explicit<br />

algorithm (Theis, 2005a, see chapter 7). Consider the BSS model from equation (1.1). As usual<br />

by preprocess<strong>in</strong>g we may assume whitened observations x, so A is orthogonal. For the density<br />

ps of the sources, we therefore get ps(s0) = px(As0). Its Hessian transforms like a 2-tensor,<br />

which locally at s0 (see section 1.2.1) guarantees<br />

Hln ps (s0) = Hln px◦A(s0) = AHln px (As0)A ⊤ . (1.11)


22 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

crosstalk<strong>in</strong>g error<br />

4<br />

3<br />

2<br />

1<br />

0<br />

FastICA JADE Extended Infomax<br />

Figure 1.8: Apply<strong>in</strong>g ICA to a random vector x = As that does not fulfill the ICA model; here<br />

s is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown<br />

are the statistics over 100 runs of the Amari error of the random orig<strong>in</strong>al and the reconstructed<br />

mix<strong>in</strong>g matrix us<strong>in</strong>g the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly,<br />

the orig<strong>in</strong>al mix<strong>in</strong>g matrix could not be reconstructed <strong>in</strong> any of the experiments. However,<br />

<strong>in</strong>terest<strong>in</strong>gly, the latter two algorithms do <strong>in</strong>deed f<strong>in</strong>d an ISA up to permutation, which can be<br />

expla<strong>in</strong>ed by theorem 1.3.2.<br />

The sources s(t) are assumed to be k-<strong>in</strong>dependent, so ps factorizes <strong>in</strong>to r groups depend<strong>in</strong>g on<br />

k separate variables each. Thus ln ps is a sum of functions depend<strong>in</strong>g on k separate variables<br />

hence Hln ps (s0) is k-block-diagonal. Hessian ISA now simply uses the block-diagonality structure<br />

from equation (1.11) and performs JBD of estimates of a set of Hessians Hln ps (si) evaluated at<br />

different sampl<strong>in</strong>g po<strong>in</strong>ts si. This corresponds to us<strong>in</strong>g the HessianICA source condition from<br />

table 1.1. Other source conditions such as contracted quadricovariance matrices (Cardoso and<br />

Souloumiac, 1993) can also be used <strong>in</strong> this extended framework (Theis, 2007).<br />

Unknown group structure—general ISA<br />

A serious drawback of k-ISA (and hence of ICA) lies <strong>in</strong> the fact that the requirement fixed<br />

group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed,<br />

theoretically speak<strong>in</strong>g, it may only be applied to random vectors follow<strong>in</strong>g the k-ISA bl<strong>in</strong>d<br />

source separation model, which means that they have to be mixtures of a random vector that<br />

consists of <strong>in</strong>dependent groups of size k. If this is the case, uniqueness up to permutation and<br />

scal<strong>in</strong>g holds accord<strong>in</strong>g to theorem 1.3.1. However if k-ISA is applied to any random vector,<br />

a decomposition <strong>in</strong>to groups that are only ‘as <strong>in</strong>dependent as possible’ cannot be unique and<br />

depends on the contrast and the algorithm. In the literature, ICA is often applied to f<strong>in</strong>d<br />

representations fulfill<strong>in</strong>g the <strong>in</strong>dependence condition only as well as possible, however care has<br />

to be taken; the strong uniqueness result is not valid any more, and the results may depend on<br />

the algorithm as illustrated <strong>in</strong> figure 1.8.<br />

In contrast to ICA and k-ISA, we do not want to fix the size of the groups Si <strong>in</strong> advance.


1.3. Dependent component analysis 23<br />

x<br />

s<br />

L<br />

P<br />

A<br />

(a) ICA<br />

x<br />

s<br />

L<br />

P<br />

A<br />

(b) ISA with fixed groupsize<br />

x<br />

s<br />

L<br />

P<br />

A<br />

(c) general ISA<br />

Figure 1.9: L<strong>in</strong>ear factorization models for a random vector x = As and the result<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acies,<br />

where L denotes a one- or higher dimensional <strong>in</strong>vertible matrix (scal<strong>in</strong>g) and P<br />

a permutation, to be applied only along the horizontal l<strong>in</strong>e as <strong>in</strong>dicated <strong>in</strong> the figures. The<br />

small horizontal gaps denote statistical <strong>in</strong>dependence. One of the key differences between the<br />

models is that general ISA may always be applied to any random vector x, whereas ICA and<br />

its generalization, fixed-size ISA, yield unique results only if x follows the correspond<strong>in</strong>g model.<br />

Of course, some restriction is necessary, otherwise no decomposition would be enforced at all.<br />

The key idea <strong>in</strong> Theis (2007), see chapter 8, is to allow only irreducible components def<strong>in</strong>ed as<br />

random vectors without lower-dimensional <strong>in</strong>dependent components.<br />

The advantage of this formulation now is that it can clearly be applied to any random vector,<br />

although of course a trivial decomposition might be the result <strong>in</strong> the case of an irreducible random<br />

vector. Obvious <strong>in</strong>determ<strong>in</strong>acies of an ISA of x are scal<strong>in</strong>gs i.e. <strong>in</strong>vertible transformations with<strong>in</strong><br />

each si and permutation of si of the same dimension. These are already all <strong>in</strong>determ<strong>in</strong>acies as<br />

shown by the follow<strong>in</strong>g theorem:<br />

Theorem 1.3.3 (Existence and Uniqueness of ISA). Given a random vector X with exist<strong>in</strong>g<br />

covariance, an ISA of X exists and is unique except for permutation of components of the same<br />

dimension and <strong>in</strong>vertible transformations with<strong>in</strong> each <strong>in</strong>dependent component and with<strong>in</strong> the<br />

Gaussian part.<br />

Here, no Gaussians had to be excluded from S as <strong>in</strong> the previous uniqueness theorems,<br />

because the dimension reduction result from section 1.5.2 has been used. For details we refer to<br />

Theis (2007) and Gutch and Theis (2007). The connection of the various factorization models<br />

and the correspond<strong>in</strong>g uniqueness results are illustrated <strong>in</strong> figure 1.9.<br />

Aga<strong>in</strong>, we turned this uniqueness result <strong>in</strong>to a separation algorithm, this time by consider<strong>in</strong>g<br />

the JADE-source condition based on fourth-order cumulants. The key idea was to translate<br />

irreducibility <strong>in</strong>to maximal block-diagonality of the source condition matrices Ci(s). Algorithmically,<br />

JBD was performed us<strong>in</strong>g JD first us<strong>in</strong>g theorem 1.3.2, followed by permutation and<br />

block-size identification (Theis, 2007, algorithm 1). So far, we did not implement a sophisticated<br />

cluster<strong>in</strong>g step but only a straight-forward threshold<strong>in</strong>g method for block-size determ<strong>in</strong>ation.<br />

First results when us<strong>in</strong>g more elaborate cluster<strong>in</strong>g techniques are promis<strong>in</strong>g.


24 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

50<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

120<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

50<br />

0<br />

(a) ECG record<strong>in</strong>gs<br />

−50<br />

0 100 200 300 400 500<br />

120<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

(c) MECG part<br />

50<br />

0<br />

−120<br />

0 100 200 300 400 500<br />

80<br />

0<br />

−20<br />

0 100 200 300 400 500<br />

20<br />

0<br />

−20<br />

0 100 200 300 400 500<br />

50<br />

0<br />

(b) extracted sources<br />

−50<br />

0 100 200 300 400 500<br />

120<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

(d) fetal ECG part<br />

Figure 1.10: <strong>Independent</strong> subspace analysis with known block structure m = (2, 1) is applied<br />

to fetal ECG. (a) shows the ECG record<strong>in</strong>gs. The underly<strong>in</strong>g FECG (4 heartbeats) is partially<br />

visible <strong>in</strong> the dom<strong>in</strong>at<strong>in</strong>g MECG (3 heartbeats). Figure (b) gives the extracted sources us<strong>in</strong>g<br />

ISA with the Hessian source condition from table 1.1 with 500 Hessian matrices. In (c) and<br />

(d) the projections of the mother sources (components 1 and 2 <strong>in</strong> (b)) and the fetal source<br />

(component 3) onto the mixture space (a) are plotted.<br />

F<strong>in</strong>ally, we report the example from (Theis, 2005a) on how to apply the Hessian ISA algorithm<br />

to a real-world data set. Follow<strong>in</strong>g (Cardoso, 1998), we show how to separate fetal ECG<br />

(FECG) record<strong>in</strong>gs from the mother’s ECG (MECG). Our goal is to extract an MECG and an<br />

FECG component; however it cannot be expected to f<strong>in</strong>d an only one-dimensional MECG due<br />

to the fact that projections of a three-dimensional vector (electric) field are measured. Hence<br />

model<strong>in</strong>g the data by a multidimensional BSS problem with k = 2 (but allow<strong>in</strong>g for an additional<br />

one-dimensional component) makes sense. Application of ISA extracts a two-dimensional<br />

MECG component and a one-dimensional FECG component. After block-permutation we get<br />

estimated mix<strong>in</strong>g matrix A and sources s(t) as plotted <strong>in</strong> figure 1.10(b). A decomposition of<br />

the observed ECG data x(t) can be achieved by compos<strong>in</strong>g the extracted sources us<strong>in</strong>g only the<br />

relevant mix<strong>in</strong>g columns. For example for the MECG part this means apply<strong>in</strong>g the projection<br />

ΠM := (a1, a2, 0)A −1 to the observations. The results are plotted <strong>in</strong> figures 1.10 (c) and (d).<br />

The fetal ECG is most active at sensor 1 (as visual <strong>in</strong>spection of the observation confirms).<br />

When compar<strong>in</strong>g the projection matrices with the results from Cardoso (1998), we get quite<br />

high similarity of the ICA-based results, and a modest difference with the projections of the<br />

time-based algorithm.


1.4. Sparseness 25<br />

1.4 Sparseness<br />

One of the fundamental questions <strong>in</strong> signal process<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g and neuroscience is how to<br />

represent a large data set X, given <strong>in</strong> form of a (m × T )-matrix, <strong>in</strong> different ways. A simple<br />

approach is a l<strong>in</strong>ear matrix factorization<br />

X = AS, (1.12)<br />

which is equivalent to model (1.1) after gather<strong>in</strong>g the samples <strong>in</strong>to correspond<strong>in</strong>g data matrices<br />

X := (x(1), . . . , x(T )) ∈ R m×T and S := (s(1), . . . , s(T )) ∈ R n×T . We speak of a complete,<br />

overcomplete or undercomplete factorization if m = n, m < n or m > n respectively. The<br />

unknown matrices A and S are assumed to have some specific properties, for <strong>in</strong>stance:<br />

(i) the rows si of S are assumed to be samples of a mutually <strong>in</strong>dependent random vector, see<br />

section 1.2;<br />

(ii) each sample s(t) conta<strong>in</strong>s as many zeros as possible—this is the sparse representation or<br />

sparse component analysis (SCA) problem;<br />

(iii) the elements of X, A and S are nonnegative, which results <strong>in</strong> nonnegative matrix factorization<br />

(NMF).<br />

There is a large amount of papers devoted to ICA problems but mostly for the (under)complete<br />

case m ≥ n. We refer to Lee et al. (1999), Theis et al. (2004d), Zibulevsky and Pearlmutter<br />

(2001) and references there<strong>in</strong> for work on overcomplete ICA. Here, we will discuss constra<strong>in</strong>ts<br />

(ii) and (iii).<br />

1.4.1 Sparse component analysis<br />

We consider the bl<strong>in</strong>d matrix factorization problem (1.12) <strong>in</strong> the more challeng<strong>in</strong>g overcomplete<br />

case, where the additional <strong>in</strong>formation compensat<strong>in</strong>g the limited number of sensors is the sparseness<br />

of the sources. It should be noted that this problem is quite general and fundamental, s<strong>in</strong>ce<br />

the sources could be not necessarily sparse <strong>in</strong> time doma<strong>in</strong>. It would be sufficient to f<strong>in</strong>d a l<strong>in</strong>ear<br />

transformation (e.g. wavelet packets), <strong>in</strong> which the sources are sufficiently sparse. Applications<br />

of the model <strong>in</strong>clude biomedical data analysis, where sparsely active sources are often assumed<br />

(McKeown et al., 1998), and audio source separation (Araki et al., 2007).<br />

In Georgiev et al. (2005c), see chapter 10, we <strong>in</strong>troduced a novel measure for sparsity and<br />

showed that based on sparsity alone, we were still able to identify both the mix<strong>in</strong>g matrix and<br />

the sources uniquely except for trivial <strong>in</strong>determ<strong>in</strong>acies. Here, a vector v ∈ R n is said to be ksparse<br />

if v has at least k zero entries. An n × T data matrix is said to be k-sparse, if each of its<br />

columns are k-sparse. The goal of sparse component analysis of level k (k-SCA) is to decompose<br />

a given m-dimensional observed signal S as <strong>in</strong> equation (1.12) such that S is k-sparse. In our<br />

work, we always assume that the sparsity level equals k = n − m + 1, which means that at any<br />

time <strong>in</strong>stant, fewer sources than given observations are active.<br />

The follow<strong>in</strong>g theorem shows that essentially the SCA model is unique if fewer sources than<br />

mixtures are active i.e. if the sources are (n − m + 1)-sparse.


26 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

a 1<br />

a 3<br />

a 2<br />

(a) three hyperplanes<br />

span{ai, aj} for 1 ≤ i < j ≤ 3 <strong>in</strong><br />

the 3 × 3 case<br />

a 1<br />

a 3<br />

a 2<br />

(b) hyperplanes from (a) visualized<br />

by <strong>in</strong>tersection with the<br />

sphere<br />

a 1<br />

a 3<br />

a 4<br />

a 2<br />

(c) six hyperplanes span{ai, aj}<br />

for 1 ≤ i < j ≤ 4 <strong>in</strong> the 3 × 4<br />

case<br />

Figure 1.11: Visualization of the hyperplanes <strong>in</strong> the mixture space {x(t)} ⊂ R 3 . Due to the<br />

source sparsity, the mixtures are generated by only two matrix columns ai, aj, and are hence<br />

conta<strong>in</strong>ed <strong>in</strong> a union of hyperplanes—three hyperplanes <strong>in</strong> the case of three sources (a,b) and<br />

six hyperplanes <strong>in</strong> the case of four sources (c). Identification of the hyperplanes gives mix<strong>in</strong>g<br />

matrix and sources.<br />

Theorem 1.4.1 (SCA matrix identifiability). Assume that <strong>in</strong> the SCA model, every m × msubmatrix<br />

of A is <strong>in</strong>vertible and that S is sufficiently rich represented. Then A is uniquely<br />

determ<strong>in</strong>ed by X except for left-multiplication with permutation and scal<strong>in</strong>g matrices.<br />

Here S is said to be sufficiently rich represented if for any <strong>in</strong>dex set of n − m + 1 elements<br />

I ⊂ {1, ..., n} there exist at least m samples of S such that each of them has zero elements <strong>in</strong><br />

places with <strong>in</strong>dexes <strong>in</strong> I and each m − 1 of them are l<strong>in</strong>early <strong>in</strong>dependent. The next theorem<br />

shows that <strong>in</strong> this case also the sources can be found uniquely:<br />

Theorem 1.4.2 (SCA source identifiability). Let H be the set of all x ∈ R m such that the l<strong>in</strong>ear<br />

system As = x has a (n−m+1)-sparse solution i.e. one with at least n−m+1 zero components.<br />

If A fulfills the condition from theorem 1.4.1, then for almost all x ∈ H, this system has no<br />

other solution with this property.<br />

The proofs were given <strong>in</strong> Georgiev et al. (2005c), see chapter 10. The above two theorems<br />

show that <strong>in</strong> the case of overcomplete BSS us<strong>in</strong>g k-SCA with k = n − m + 1, both the mix<strong>in</strong>g<br />

matrix and the sources can uniquely be recovered from X except for the omnipresent permutation<br />

and scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. The essential idea of both the theorems as well as correspond<strong>in</strong>g<br />

algorithms is illustrated <strong>in</strong> figure 1.11: by assum<strong>in</strong>g sufficiently high sparsity of the sources,<br />

the mixture space clusters along a union of hyperplanes, which uniquely determ<strong>in</strong>e both mix<strong>in</strong>g<br />

matrix and sources.<br />

It is not clear a priori, whether any given data matrix X can be factorized <strong>in</strong>to a sparse<br />

representation. A necessary and sufficient condition is given <strong>in</strong> the follow<strong>in</strong>g theorem from<br />

(Georgiev et al., 2005c):


1.4. Sparseness 27<br />

Theorem 1.4.3 (SCA conditions). Assume that m ≤ n ≤ T and the matrix X ∈ R m×T satisfies<br />

the follow<strong>in</strong>g conditions:<br />

� �<br />

n<br />

(i) the columns of X lie <strong>in</strong> the union H of m−1 different hyperplanes, each column lies <strong>in</strong><br />

only one such hyperplane, each hyperplane conta<strong>in</strong>s at least m columns of X such that each<br />

m − 1 of them are l<strong>in</strong>early <strong>in</strong>dependent.<br />

(ii) for each i ∈ {1, ..., n} there exist p =<br />

� �<br />

n−1<br />

m−2<br />

different hyperplanes {Hi,j} p<br />

j=1<br />

that their <strong>in</strong>tersection Li = � p<br />

k=1 Hi,j is one dimensional subspace.<br />

(iii) any m different Li span the whole R m .<br />

<strong>in</strong> H such<br />

Then the matrix X is uniquely representable (up to permutation and scal<strong>in</strong>g) as SCA, satisfy<strong>in</strong>g<br />

the conditions of theorem 1.4.1.<br />

Algorithms for SCA<br />

In Georgiev et al. (2004, 2005c), we also proposed an algorithm based on random sampl<strong>in</strong>g for<br />

reconstruct<strong>in</strong>g the mix<strong>in</strong>g matrix and the sources, however it could not easily be applied <strong>in</strong> noisy<br />

sett<strong>in</strong>gs and high dimensions due to the <strong>in</strong>volved comb<strong>in</strong>atorial searches. Therefore, we derived<br />

a novel, robust algorithm for SCA <strong>in</strong> Theis et al. (2007a), see chapter 11. The key idea was that<br />

if the sources are of sufficiently high sparsity, the mixtures are clustered along hyperplanes <strong>in</strong> the<br />

mixture space. Based on this condition, the mix<strong>in</strong>g matrix could be reconstructed; furthermore,<br />

this property turned out to be robust aga<strong>in</strong>st noise and outliers.<br />

The proposed algorithm employed a generalization of the Hough transform <strong>in</strong> order to detect<br />

the hyperplanes <strong>in</strong> the mixture space, see figure 1.12. This leads to an algorithmically robust<br />

matrix and source identification. The Hough-based hyperplane estimation does not depend on<br />

the source dimension n, only on the mixture dimension m. With respect to applications, this<br />

implies that n can be quite large and hyperplanes will still be found if the grid resolution used<br />

<strong>in</strong> the Hough transform is sufficiently high. Increase of the grid resolution (<strong>in</strong> polynomial time)<br />

results <strong>in</strong> <strong>in</strong>creased accuracy also for higher source dimensions n.<br />

For applications of the proposed SCA algorithms <strong>in</strong> signal process<strong>in</strong>g and biomedical data<br />

analysis, we refer to section 1.6.3 and Georgiev et al. (2006, 2005a,b), Theis et al. (2007a). More<br />

elaborate source reconstruction methods, after know<strong>in</strong>g the mix<strong>in</strong>g matrix A were discussed <strong>in</strong><br />

Theis et al. (2004a).<br />

Postnonl<strong>in</strong>ear generalization<br />

In Theis and Amari (2004), see chapter 12, we considered the generalization of SCA to postnonl<strong>in</strong>ear<br />

mixtures, see section 1.2.2. As before the data x(t) = f(As(t)) is assumed to be<br />

l<strong>in</strong>early mixed followed by a componentwise nonl<strong>in</strong>earity, see equation (1.3). However, now the<br />

(m × n)-matrix A is allowed to be ‘wide’ i.e. the more complicated overcomplete situation is<br />

treated and m < n. By us<strong>in</strong>g sparseness of s(t), we were still able to recover the system:


28 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

z<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

y<br />

−1<br />

−1.5<br />

−2<br />

2<br />

3<br />

x<br />

4<br />

5<br />

6<br />

3<br />

2.5<br />

2<br />

theta 1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5<br />

phi<br />

2 2.5 3<br />

Figure 1.12: Illustration of the ‘hyperplane detect<strong>in</strong>g’ Hough transform <strong>in</strong> three dimensions: A<br />

po<strong>in</strong>t (x1, x2, x3) <strong>in</strong> the data space (left) is mapped onto the curve {(ϕ, θ)|θ = arctan(x1 cos ϕ +<br />

x2 s<strong>in</strong> ϕ) + π<br />

2 } <strong>in</strong> the parameter space [0, π)2 (right). The Hough curves of po<strong>in</strong>ts belong<strong>in</strong>g to<br />

one plane <strong>in</strong> data space <strong>in</strong>tersect <strong>in</strong> precisely one po<strong>in</strong>t (ϕ, θ) <strong>in</strong> the data space — and the data<br />

po<strong>in</strong>ts lie on the plane given by the normal vector (cos ϕ s<strong>in</strong> θ, s<strong>in</strong> ϕ s<strong>in</strong> θ, cos θ).<br />

Theorem 1.4.4 (Identifiability of postnonl<strong>in</strong>ear SCA). Let S ∈ R n×T be a matrix with (n −<br />

m + 1)-sparse columns s(t), and let X ∈ R m×T consist of columns x(t) = f(As(t)) follow<strong>in</strong>g the<br />

postnonl<strong>in</strong>ear mixture model (1.3). Furthermore assume that<br />

(i) S is fully (n − m + 1)-sparse <strong>in</strong> the sense that assymptotically for T → ∞, its image equals<br />

the union of all (m − 1)-dimensional coord<strong>in</strong>ate spaces (<strong>in</strong> which it is conta<strong>in</strong>ed by the<br />

sparsity assumption),<br />

(ii) A is mix<strong>in</strong>g and not absolutely degenerate,<br />

(iii) every m × m-submatrix of A is <strong>in</strong>vertible.<br />

If X = ˆ f( ˆ S) is another such representation of X, then there exists an <strong>in</strong>vertible scal<strong>in</strong>g L with<br />

f = ˆ f ◦ L, and <strong>in</strong>vertible scal<strong>in</strong>g and permutation matrices L ′ , P ′ with A = L ÂL′ P ′ .<br />

The proof relied on the fact that when s(t) is sparse as formulated <strong>in</strong> theorem 1.4.4(i), it<br />

<strong>in</strong>cludes all the (m−1)-dimensional coord<strong>in</strong>ate subspaces and hence <strong>in</strong>tersections of (m−1) such<br />

subspaces, which give the n coord<strong>in</strong>ate axes. They are transformed <strong>in</strong>to n curves <strong>in</strong> the x-space,<br />

pass<strong>in</strong>g through the orig<strong>in</strong>. By identification of these curves, we showed that each nonl<strong>in</strong>earity<br />

fi is <strong>in</strong> fact l<strong>in</strong>ear. The proof used the follow<strong>in</strong>g lemma, which generalizes the analytic case<br />

presented by Babaie-Zadeh et al. (2002).<br />

Lemma 1.4.5. Let a, b ∈ R \ {−1, 0, 1}, a > 0 and f : [0, ε) → R differentiable such that<br />

f(ax) = bf(x) for all x ∈ [0, ε) with ax ∈ [0, ε). If limt→0+ f ′ (t) exists and does not vanish, then<br />

f is l<strong>in</strong>ear.


1.4. Sparseness 29<br />

R 3<br />

R 3<br />

A<br />

BSRA<br />

R 2<br />

R 2<br />

f1 × f2<br />

g1 × g2<br />

Figure 3: Illustration of the proof of theorem ?? <strong>in</strong> the case n = 3, m = 2. The 3dimensional<br />

1-sparse sources (leftmost figure) are first l<strong>in</strong>early mapped onto R2 Figure 1.13: Illustration of the proof of theorem 1.4.4 <strong>in</strong> the case n = 3, m = 2. The 3dimensional<br />

1-sparse sources (leftmost top) are first l<strong>in</strong>early mapped onto R by<br />

A and then postnonl<strong>in</strong>early distorted by f := f1 × f2 (middle figure). Separation<br />

is performed by first estimat<strong>in</strong>g the separat<strong>in</strong>g postnonl<strong>in</strong>earities g := g1 × g2 and<br />

then perform<strong>in</strong>g overcomplete source recovery (right figure) accord<strong>in</strong>g to algorithm<br />

??. The idea of the proof now is that two l<strong>in</strong>es spanned by coord<strong>in</strong>ate vectors (fat<br />

l<strong>in</strong>es, leftmost figure) are mapped onto two l<strong>in</strong>es spanned by two columns of A.<br />

If the composition g ◦ f maps these l<strong>in</strong>es onto some different l<strong>in</strong>es (as sets), then<br />

we show that (given ’general position’ of the two l<strong>in</strong>es) the components of g ◦ f<br />

are homogeneous functions and hence already l<strong>in</strong>ear accord<strong>in</strong>g to lemma ??.<br />

2 by A and then<br />

postnonl<strong>in</strong>early distorted by f := f1 × f2 (right). Separation is performed by first estimat<strong>in</strong>g<br />

the separat<strong>in</strong>g postnonl<strong>in</strong>earities g := g1 ×g2 and then perform<strong>in</strong>g overcomplete source recovery<br />

(left bottom) accord<strong>in</strong>g to the algorithms from Georgiev et al. (2004). The idea of the proof was<br />

that two l<strong>in</strong>es spanned by coord<strong>in</strong>ate vectors (thick l<strong>in</strong>es) are mapped onto two l<strong>in</strong>es spanned by<br />

two columns of A. If the composition g ◦ f maps these l<strong>in</strong>es onto some different l<strong>in</strong>es (as sets),<br />

then we showed that (given ‘general position’ of the two l<strong>in</strong>es) the components of g ◦ f satisfy<br />

the conditions from lemma 1.4.5 and hence are already l<strong>in</strong>ear.<br />

Theorem 1.4.4 shows that f and A are uniquely determ<strong>in</strong>ed by x(t) except for scal<strong>in</strong>g and<br />

permutation ambiguities. Note that then obviously also s(t) is identifiable by apply<strong>in</strong>g theorem<br />

(i) s is fully k-sparse <strong>in</strong> the sense that im s equals the union of all k-dimensional<br />

1.4.2 to the l<strong>in</strong>earized mixtures y(t) = f<br />

coord<strong>in</strong>ate subspaces (<strong>in</strong> which it is conta<strong>in</strong>ed by the sparsity assumption),<br />

−1x(t) = As(t) given the additional assumptions to s(t)<br />

from the theorem.<br />

(ii) A is mix<strong>in</strong>g and not absolutely degenerate,<br />

(iii) every m × m-submatrix of A is <strong>in</strong>vertible.<br />

If x = ˆf( ˆs) is another representation of x as <strong>in</strong> equation ?? with ˆs satisfy<strong>in</strong>g the<br />

same conditions as s, then there exists an <strong>in</strong>vertible scal<strong>in</strong>g L with f = ˆ Aga<strong>in</strong>, we derived an algorithm from this identifiability result. The separation is done <strong>in</strong><br />

a two-stage procedure: In the first step, after geometrical preprocess<strong>in</strong>g the postnonl<strong>in</strong>earities<br />

are estimated us<strong>in</strong>g an idea similar to the one used <strong>in</strong> the identifiability proof of theorem 1.4.4,<br />

also see figure 1.13. In the second stage, the mix<strong>in</strong>g matrix A and then the sources s are<br />

reconstructed by apply<strong>in</strong>g l<strong>in</strong>ear SCA to the l<strong>in</strong>earized mixtures f<br />

f ◦ L, and<br />

−1x(t). For details we refer to<br />

Theis and Amari (2004), see chapter 12.<br />

<strong>in</strong>vertible scal<strong>in</strong>g and permutation matrices L ′ , P ′ with A = L ÂL′ P ′ .<br />

The proof is given <strong>in</strong> the appendix. It relies on the fact that when s is fully<br />

k-sparse as formulated <strong>in</strong> theorem ??(??), it <strong>in</strong>cludes all the k-dimensional coord<strong>in</strong>ate<br />

subspaces and hence <strong>in</strong>tersections of k such subspaces, which give the<br />

n coord<strong>in</strong>ate axes. They are transformed <strong>in</strong>to n curves <strong>in</strong> the x-space, pass<strong>in</strong>g<br />

through the orig<strong>in</strong>. By identification of these curves, we show that each nonl<strong>in</strong>earity<br />

is homogeneous and hence l<strong>in</strong>ear accord<strong>in</strong>g to the previous section. Figure<br />

?? gives an illustration of the proof of theorem ?? <strong>in</strong> the case n = 3 and m = 2.<br />

R 2


30 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.4.2 Sparse non-negative matrix factorization<br />

In Theis et al. (2005c), see chapter 13, we studied the factorization problem (1.12) us<strong>in</strong>g condition<br />

(iii) of non-negativity. Non-negative matrix factorization (NMF) strictly requires both matrices<br />

A and S to have non-negative entries, which means that the data can be described us<strong>in</strong>g only<br />

additive components. Such a constra<strong>in</strong>t has many physical realizations and applications, for<br />

<strong>in</strong>stance <strong>in</strong> object decomposition (Lee and Seung, 1999).<br />

Although NMF has recently ga<strong>in</strong>ed popularity due to its simplicity and power <strong>in</strong> various<br />

applications, its solutions frequently fail to exhibit the desired sparse object decomposition.<br />

Therefore, Hoyer (2004) proposed a modification of the NMF model to <strong>in</strong>clude sparseness: he<br />

m<strong>in</strong>imized the deviation �X − AS� of (1.12) under the constra<strong>in</strong>t of fixed sparseness of both A<br />

and S. Here, us<strong>in</strong>g a ratio of 1- and 2-norms of x ∈ R n \ {0}, the sparseness is measured by<br />

σ(x) := ( √ n − �x�1/�x�2)/( √ n − 1). So σ(x) = 1 (maximal) if x conta<strong>in</strong>s n − 1 zeros, and it<br />

reaches zero if the absolute value of all coefficients of x co<strong>in</strong>cide.<br />

We restricted ourselves to the asymptotic case of perfect factorization, and therefore def<strong>in</strong>ed<br />

the sparse NMF (Hoyer, 2004) as the task of f<strong>in</strong>d<strong>in</strong>g the matrices A and S <strong>in</strong> the decomposition<br />

X = AS subject to<br />

⎧<br />

⎨<br />

⎩<br />

A, S ≥ 0<br />

σ(A∗i) = σA<br />

σ(Si∗) = σS<br />

(1.13)<br />

Here σA, σS ∈ [0, 1] denote fixed constants describ<strong>in</strong>g the sparseness of the columns of A respectively<br />

the rows of S.<br />

Uniqueness of sparse NMF<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies of the sparse NMF model (1.13) are permutation and positive scal<strong>in</strong>g of<br />

the columns of A (and correspond<strong>in</strong>gly of the rows of S). Another not as obvious <strong>in</strong>determ<strong>in</strong>acy<br />

comes <strong>in</strong>to play due to the sparseness assumption: S is said to be degenerate if each column of S<br />

is a multiple of some vector v ∈ R n . Then factorization is not unique because the row-sparseness<br />

of S does not depend on v, so transformations that still guarantee non-negativity are possible<br />

Now assume that two solutions (A, S) and ( Ã, ˜ S) of the sparse NMF model (1.13) are given<br />

with A and à of full rank; then AS = Ø S, and σ(S) = σ( ˜ S). Let hi = S ⊤ i∗ respectively ˜ hi = ˜ S ⊤ i∗<br />

denote the rows of the source matrices. In order to avoid the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy, we can set<br />

the source scales to a given value, so we may assume �hi�2 = � ˜ hi�2 = 1 for all i. Hence, the<br />

sparseness of the rows is already fully determ<strong>in</strong>ed by their 1-norms, and �hi�1 = � ˜ hi�1.<br />

The follow<strong>in</strong>g theorem from Theis et al. (2005c), see chapter 13, showed uniqueness of sparse<br />

NMF <strong>in</strong> some special cases — note that <strong>in</strong> more general sett<strong>in</strong>gs some additional <strong>in</strong>determ<strong>in</strong>acies<br />

(specific to n > 3) come <strong>in</strong>to play; however to our present knowledge they are th<strong>in</strong> i.e. of measure<br />

zero and hence of no practical importance.<br />

Theorem 1.4.6 (Uniqueness of sparse NMF). Given two solutions (A, S) and ( Ã, ˜ S) of the<br />

sNMF model as above, assume that S is non degenerate and that either à = I and A ≥ 0 or<br />

n = 2. Then A = ÃP with a permutation matrix P.


1.4. Sparseness 31<br />

Sparse projection<br />

Algorithmically, we followed Hoyer’s approach and solved the<br />

sparse NMF problem by alternatively updat<strong>in</strong>g A and S us<strong>in</strong>g<br />

gradient descent on the residual error �X − AS� 2 . After each<br />

update, the columns of A and the rows of S are projected onto<br />

M := {s|�s�1 = σ} ∩ {s|�s�2 = 1} ∩ {s ≥ 0} (1.14)<br />

<strong>in</strong> order to satisfy the sparseness conditions of (1.13). For this,<br />

po<strong>in</strong>ts x ∈ R n have to be projected onto adjacent po<strong>in</strong>ts <strong>in</strong> M,<br />

which was def<strong>in</strong>ed as �x − p�2 ≤ �x − q�2 for all q ∈ M and<br />

denoted as p ⊳ x.<br />

A priori it is not clear when such an x exists and, more so,<br />

is unique, see figure 1.14. We answered this question by prov<strong>in</strong>g<br />

the follow<strong>in</strong>g theorem:<br />

Theorem 1.4.7 (Existence and uniqueness of the Euclidean projection).<br />

M X (M)<br />

(i) If M is closed and nonempty, then for every x ∈ R n there<br />

exists a p ∈ M with p ⊳ x.<br />

(ii) If X (M) := {x ∈ R n |#{p ∈ M|p ⊳ x} > 1} denotes the<br />

exception or non-uniqueness set of M, then vol(X (M)) = 0.<br />

M X (M)<br />

(a) exception set of two po<strong>in</strong>ts<br />

(a) exception set of two po<strong>in</strong>ts<br />

M<br />

(b) excepti<br />

Figure 1: Two examples of exception s<br />

In other words, the exception set conta<strong>in</strong>s the set of<br />

can’t uniquely project. Our goal is to show that this set<br />

very small. Figure 1 shows the exception set of two diffe<br />

Note that if x ∈ M then x ⊳ x, and x is the only poi<br />

So M ∩ X (M) = ∅. Obviously the exception set of an a<br />

is empty. Indeed, we can prove more generally:<br />

Lemma 2.4. Let M ⊂ Rn X (M)<br />

M<br />

be convex. Then X (M) = ∅.<br />

For the proof we need the follow<strong>in</strong>g simple lemma, wh<br />

2-norm as it uses the scalar product.<br />

Lemma 2.5. Let a, b ∈ Rn such that �a + b�2 = �a�2 +<br />

are coll<strong>in</strong>ear.<br />

Proof. By tak<strong>in</strong>g squares we get �a + b�2 = �a�2 + 2�a�<br />

�a� 2 + 2〈a, b〉 + �b� 2 = �a� 2 + 2�a��b� +<br />

if 〈., .〉 denotes the (symmetric) scalar product. Hence 〈<br />

and b are coll<strong>in</strong>ear accord<strong>in</strong>g to the Schwarz <strong>in</strong>equality.<br />

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈<br />

M such that pi ⊳ x. By assumption q := 1<br />

2 (p1 + p2) ∈ M<br />

�x − p1� ≤ �x − q� ≤ 1<br />

2 �x − p1� + 1<br />

�x − p2�<br />

2<br />

because both pi are adjacent to x. Therefore �x−q� = � 1<br />

(a) exception set of two po<strong>in</strong>ts<br />

(b) exception set of a sector<br />

Figure 1: Two examples of exception sets.<br />

In other words, the exception set conta<strong>in</strong>s the set of po<strong>in</strong>ts from which we<br />

can’t uniquely project. Our goal is to show that this set vanishes or is at least<br />

very small. Figure 1 shows the exception set of two different sets.<br />

Note that if x ∈ M then x ⊳ x, and x is the only po<strong>in</strong>t with that property.<br />

So M ∩ X (M) = ∅. Obviously the exception set of an aff<strong>in</strong>e l<strong>in</strong>ear hyperspace<br />

is empty. Indeed, we can prove more generally:<br />

Lemma 2.4. Let M ⊂ R<br />

2<br />

and application of lemma 2.5 shows that x − p1 = α(x<br />

3<br />

n be convex. Then X (M) = ∅.<br />

For the proof we need the follow<strong>in</strong>g simple lemma, which only works for the<br />

2-norm as it uses the scalar product.<br />

Lemma 2.5. Let a, b ∈ Rn such that �a + b�2 = �a�2 + �b�2. Then a and b<br />

are coll<strong>in</strong>ear.<br />

Proof. By tak<strong>in</strong>g squares we get �a + b�2 = �a�2 + 2�a��b� + �b�2 , so<br />

�a� 2 + 2〈a, b〉 + �b� 2 = �a� 2 + 2�a��b� + �b� 2<br />

The above is obvious if M is convex. However here, with M<br />

from equation (1.14), this is not the case, and the above theorem<br />

is needed. We then denote the (almost everywhere unique)<br />

projection πM(x) := p. In addition, <strong>in</strong> (Theis et al., 2005c), we (b) exception set of a sector<br />

proved convergence of Hoyer’s projection algorithm.<br />

Figure 1.14: Two exception<br />

(non-uniqueness) sets.<br />

Iterative projection onto spheres<br />

In Theis and Tanaka (2006), see chapter 14, our goal was to generalize the notion of sparseness.<br />

After all, we naturally <strong>in</strong>terpret sparseness of some signal x(t) as x(t) hav<strong>in</strong>g many zero entries.<br />

This can be measured by the 0-pseudo-norm, and it is common to approximate the it by p-norms<br />

for p → 0. Hence replac<strong>in</strong>g the 1-norm <strong>in</strong> (1.14) by some p-norm is desirable.<br />

A p-sparse NMF algorithm can then be readily derived. However, we observed that the<br />

sparse projection cannot be solved <strong>in</strong> closed form anymore, and little attention has been paid<br />

to f<strong>in</strong>d<strong>in</strong>g projections <strong>in</strong> the case of p �= 1, which is particularly important for p → 0 as better<br />

approximation of �.�0. Hence, our goal <strong>in</strong> (Theis and Tanaka, 2006) was to explore this more<br />

general notion of sparseness and to construct an algorithm to project a vector to its closest vector<br />

of a given sparseness. The result<strong>in</strong>g algorithm is a non-convex extension of the ‘projection onto<br />

convex sets’ (POCS) algorithm (Combettes, 1993, Youla and Webb, 1982).<br />

Let S<br />

if 〈., .〉 denotes the (symmetric) scalar product. Hence 〈a, b〉 = �a��b� and a<br />

and b are coll<strong>in</strong>ear accord<strong>in</strong>g to the Schwarz <strong>in</strong>equality.<br />

n−1<br />

p := {x ∈ Rn | �x�p = 1} denote the (n − 1)-dimensional sphere with respect to the<br />

p-norm (p > 0). A scaled version of this unit sphere is given by cSn−1 p := {x ∈ Rn | �x�p = c}.<br />

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈ X (M) and p1 �= p2 ∈<br />

M such that pi ⊳ x. By assumption q := 1<br />

2 (p1 + p2) ∈ M. But<br />

�x − p1� ≤ �x − q� ≤ 1<br />

2 �x − p1� + 1<br />

2 �x − p2� = �x − p1�<br />

(x−p1)�+� 1<br />

2 (x−p2)�<br />

because both pi are adjacent to x. Therefore �x−q� = � 1<br />

2<br />

and application of lemma 2.5 shows that x − p1 = α(x − p2). Tak<strong>in</strong>g norms<br />

3


32 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

(a) POSH for n = 2, p = 0.5 (b) POSH for n = 3, p = 1<br />

Figure 1.15: Start<strong>in</strong>g from x0 (◦), we alternately project onto cSp and S2. POSH performance<br />

is illustrated for p = 0.5 <strong>in</strong> dimension n = 2 (a), and for p = 1 and n = 3 (b), were a projection<br />

via PCA is displayed—no <strong>in</strong>formation is lost hence the sequence of po<strong>in</strong>ts lies <strong>in</strong> a plane.<br />

We were look<strong>in</strong>g for the Euclidean projection y = πM(x) onto M := S n−1<br />

2 ∩ cSn−1 p . Note that<br />

due to theorem 1.4.7, this p-sparse projection exists if M �= ∅ and is almost always unique.<br />

The algorithmic construction of this projection now is a direct generalization of POCS: we<br />

alternately project first on S n−1<br />

2 then on Sn−1 p , us<strong>in</strong>g the Euclidean projection operator πM from<br />

above. However, the difference is that the spheres are clearly non-convex (if p �= 1), so <strong>in</strong> contrast<br />

to POCS, convergence is not obvious. We denoted this projection algorithm by projection onto<br />

spheres (POSH). Interest<strong>in</strong>gly, the algorithm still converges, which we could prove for p = 1:<br />

Theorem 1.4.8 (Convergence of POSH). Let n ≥ 2, and x ∈ Rn \ X (M). If y1 := π n−1<br />

S (x)<br />

2<br />

and iteratively y i := π S n−1<br />

2<br />

(π n−1<br />

S (y<br />

1<br />

i−1 )), then yi converges to πM(x).<br />

In figure 1.15, we show the application of POSH for p ∈ {0.5, 1}; we visualize the performance<br />

<strong>in</strong> 3 dimensions by project<strong>in</strong>g the data via PCA — which by the way throws away virtually no<br />

<strong>in</strong>formation (confirmed by experiment) <strong>in</strong>dicat<strong>in</strong>g the validness of theorem 1.4.8 also <strong>in</strong> higher<br />

dimensions.<br />

We want to f<strong>in</strong>ish this section by remark<strong>in</strong>g that the strict framework of sparse NMF (1.13)<br />

is somewhat problematic due to the fact that the sparseness values σA and σS are parameters<br />

of the algorithm and hence difficult to choose. In Stadlthanner et al. (2005b), we proposed<br />

an alternative factorization to (1.13), where we maximize the sparseness values of A and S<br />

<strong>in</strong> addition to the distance to X. However the optimization gets more <strong>in</strong>tricate, and we have<br />

studied enhanced search methods via genetic algorithms <strong>in</strong> Stadlthanner et al. (2005a).


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 33<br />

1.5 Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g<br />

Mach<strong>in</strong>e learn<strong>in</strong>g denotes the task of computationally f<strong>in</strong>d<strong>in</strong>g structures <strong>in</strong> data. Here we describe<br />

some preprocess<strong>in</strong>g techniques that rely on mach<strong>in</strong>e learn<strong>in</strong>g for denois<strong>in</strong>g, dimension<br />

reduction and data group<strong>in</strong>g (cluster<strong>in</strong>g).<br />

1.5.1 Denois<strong>in</strong>g<br />

In many fields of signal process<strong>in</strong>g the exam<strong>in</strong>ed signals bear considerable noise, which is usually<br />

assumed to be additive and decorrelated. For example <strong>in</strong> exploratory data analysis of medical<br />

data us<strong>in</strong>g statistical methods like ICA, the prevalent noise greatly degrades the reliability of<br />

the algorithms, and the underly<strong>in</strong>g processes cannot be identified.<br />

In Gruber et al. (2006), see chapter 15, we considered the situation, where a one-dimensional<br />

signal s(t) ∈ R given at discrete timesteps t = 1, . . . , T is distorted as follows<br />

sN(t) = s(t) + N(t), (1.15)<br />

where N(t) are i.i.d. samples of a Gaussian random variable i.e. sN equals s up to additive<br />

stationary white noise. Many denois<strong>in</strong>g algorithms have been proposed for recover<strong>in</strong>g s(t) from<br />

its noisy observation sN(t), see e.g. Effern et al. (2000), Hyvär<strong>in</strong>en et al. (2001b), Ma et al.<br />

(2000) to name but a few. Vetter et al. (2002) suggested an algorithm based on local l<strong>in</strong>ear<br />

projective noise reduction. The idea was to observe the data <strong>in</strong> a high-dimensional space of<br />

delayed coord<strong>in</strong>ates<br />

˜sN(t) := (sN(t), . . . , sN(t + m − 1)) ∈ R m<br />

and to denoise the data locally through a projection on the lower dimensional subspace of the<br />

determ<strong>in</strong>istic signals.<br />

We followed this approach and localized the problem by select<strong>in</strong>g k clusters of the delayed<br />

time series { sN(t) ˜ | t = 1, . . . , n}. This can for example be done by a k-means cluster algorithm,<br />

see section 1.5.3, which is appropriate for noise selection schemes based on the strength or the<br />

kurtosis of the signal s<strong>in</strong>ce the statistical properties do not depend on the signal structure.<br />

After the local l<strong>in</strong>earization, we may assume that the time-embedded signal can be l<strong>in</strong>early<br />

decomposed <strong>in</strong>to noise and a lower-dimensional signal subspace. We therefore analyzed these k<br />

m-dimensional signals us<strong>in</strong>g PCA or ICA <strong>in</strong> order to determ<strong>in</strong>e the ‘mean<strong>in</strong>gful’ components.<br />

The unknown number of signal components <strong>in</strong> the high-dimensional noisy signal were determ<strong>in</strong>ed<br />

either by us<strong>in</strong>g Vetter’s MDL estimator or a 2-means cluster<strong>in</strong>g of the eigenvectors of the<br />

covariance matrix (Liavas and Regalia, 2001). The latter gave a good estimate of the number of<br />

signal components if the noise variances are not clustered well enough together but nevertheless<br />

are separated for the signal by a large gap.<br />

To reconstruct the noise reduced signal, we unclustered the data to get a signal ˜se : {1, . . . , n} →<br />

Rm and then averaged over the candidates <strong>in</strong> the delayed data.<br />

This idea has been illustrated <strong>in</strong> figure 1.16.<br />

se(t) := 1<br />

m−1 �<br />

[ ˜se(t − i)] i . (1.16)<br />

m<br />

i=0


34 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

(a) embedded time series (b) locally l<strong>in</strong>ear approximation (c) local projection<br />

Figure 1.16: Denois<strong>in</strong>g by local projective subspace projections. The time series is embedded <strong>in</strong><br />

time-delayed coord<strong>in</strong>ates (a), where the signal subspace is <strong>in</strong>dicated by a solid l<strong>in</strong>e. Cluster<strong>in</strong>g<br />

<strong>in</strong> the feature space allows for a locally l<strong>in</strong>ear approximation (b). A local projection onto the<br />

signal subspace by ICA is performed <strong>in</strong> (c). Delay & signal subspace dimensions and number of<br />

clusters are estimated us<strong>in</strong>g an MDL criterion.<br />

Moreover <strong>in</strong> Gruber et al. (2006), we compared the above algorithm with a denois<strong>in</strong>g method<br />

based on generalized eigenvalue decomposition called delayed AMUSE (Tomé et al., 2005), and<br />

with established kernel PCA denois<strong>in</strong>g (Mika et al., 1999, Schölkopf et al., 1998), where solv<strong>in</strong>g<br />

the <strong>in</strong>verse problem for recover<strong>in</strong>g the data turned out to be non-trivial. F<strong>in</strong>ally, we showed<br />

applications to water-artefact removal of proton NMR spectra, which are an <strong>in</strong>dispensable contribution<br />

to this structure determ<strong>in</strong>ation process but are hampered by the presence of the very<br />

<strong>in</strong>tense water proton signal (Stadlthanner et al., 2003, 2006b).<br />

1.5.2 Dimension reduction<br />

An important open problem <strong>in</strong> signal process<strong>in</strong>g is the task of efficient dimension reduction that<br />

is the search for mean<strong>in</strong>gful signals with<strong>in</strong> a higher dimensional data set. Classical techniques<br />

such as pr<strong>in</strong>cipal component analysis hereby def<strong>in</strong>e ‘mean<strong>in</strong>gful’ us<strong>in</strong>g second-order statistics<br />

(maximal variance), which may often be <strong>in</strong>adequate for signal detection, e.g. <strong>in</strong> the presence of<br />

strong noise. This contrasts to higher order models <strong>in</strong>clud<strong>in</strong>g projection pursuit (Friedman and<br />

Tukey, 1975, Hyvär<strong>in</strong>en and Oja, 1997, Kruskal, 1969) or non-Gaussian component analysis,<br />

for short NGCA (Blanchard et al., 2006, Kawanabe, 2005, Kawanabe and Theis, 2007). While<br />

the former classically extracts a s<strong>in</strong>gle non-Gaussian <strong>in</strong>dependent component from the data set,<br />

the latter tries to detect a whole non-Gaussian subspace with<strong>in</strong> the data, and no assumption of<br />

<strong>in</strong>dependence with<strong>in</strong> the subspace is made.<br />

The goal of l<strong>in</strong>ear dimension reduction can be def<strong>in</strong>ed as the search for a projection W ∈<br />

Mat(n × d) of a d-dimensional data set X, here modeled by a random vector, with n < d and<br />

WX bear<strong>in</strong>g still as much <strong>in</strong>formation of X as possible. Of course the latter condition has


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 35<br />

to be specified <strong>in</strong> detail <strong>in</strong> terms of some distance, <strong>in</strong>dex or source model, and many different<br />

such <strong>in</strong>dices have been studied <strong>in</strong> the sett<strong>in</strong>g of projection pursuit (Friedman, 1987, Huber,<br />

1985, Hyvär<strong>in</strong>en, 1999), among others. This problem describes a special case of the larger<br />

field of model selection (Friedman and Tukey, 1975), an important tool for preprocess<strong>in</strong>g and<br />

dimension reduction, used <strong>in</strong> a wide range of applications.<br />

In Theis and Kawanabe (2006), see chapter 16, we studied non-Gaussian component analysis<br />

as proposed by Blanchard et al. (2006). The idea was to follow the classical projection pursuit<br />

idea and to choose non-Gaussianity as measure of <strong>in</strong>formation content of the projection. The<br />

rema<strong>in</strong>der X − W ⊤ WX after the projection was required to be Gaussian and <strong>in</strong>dependent of<br />

WX. For the theoretical analysis, we did not need further restrictions, for <strong>in</strong>stance by specify<strong>in</strong>g<br />

an estimator, which would of course be necessary for algorithmic purposes. In that respect, we<br />

provided a uniqueness result for projection pursuit <strong>in</strong> general. Our goal was to describe necessary<br />

and sufficient conditions <strong>in</strong> order for such projections to exist and to be unique.<br />

Consider for example the three-dimensional random vector X = (X1, X2, X3) with X1 non-<br />

Gaussian, say uniform, but X2 and X3 Gaussian such that X is mutually <strong>in</strong>dependent. Then if<br />

we were look<strong>in</strong>g for two-dimensional projections of X, we would obviously f<strong>in</strong>d multiple, different<br />

projections such as<br />

� 1 0 0<br />

0 1 0<br />

�<br />

, but also<br />

� 1 0 0<br />

0 0 1<br />

In both cases the rema<strong>in</strong>der of the projection (X3 or X2) is Gaussian and <strong>in</strong>dependent of the<br />

projected vectors, but the projection still conta<strong>in</strong>s a Gaussian component. If we are to look<br />

for one-dimensional projections of X <strong>in</strong>stead, only projection onto the first coord<strong>in</strong>ate yields a<br />

Gaussian rem<strong>in</strong>der, as desired. So <strong>in</strong> this example, uniqueness follows if the Gaussian subspace<br />

is of maximal dimension, or correspond<strong>in</strong>gly the non-Gaussian of m<strong>in</strong>imal dimension. And<br />

precisely this is the sufficient and necessary condition for uniqueness.<br />

Uniqueness of non-Gaussian component analysis<br />

A factorization X = AS as <strong>in</strong> (1.1) with A ∈ Gl(d), random vector S = (SN, SG) and<br />

SN ∈ L2(Ω, R n ) is called an n-dimensional non-Gaussian component analysis of X if SN and<br />

SG are stochastically <strong>in</strong>dependent and SG is Gaussian. In the correspond<strong>in</strong>g decomposition<br />

A = (AN, AG), the n-dimensional subvectorspace spanned by the columns of AN is called<br />

the non-Gaussian subspace, and the subspace spanned by AG the Gaussian subspace of the<br />

decomposition.<br />

The basic ideas of NGCA versus simple pr<strong>in</strong>cipal component analysis (PCA) is illustrated <strong>in</strong><br />

figure 1.17. Dimension reduction essentially deals with the question of remov<strong>in</strong>g a noise subspace.<br />

Classically, a signal is differentiated from noise by hav<strong>in</strong>g a higher variance, and algorithms such<br />

as PCA <strong>in</strong> the l<strong>in</strong>ear case remove the low-variance components, see figure 1.17(a). Second-order<br />

techniques however fail to capture signals that are deteriorated by noise of similar or stronger<br />

power, so higher-order statistics are necessary to remove the noise, see figure 1.17(c).<br />

The follow<strong>in</strong>g theorem connects uniqueness of the dimension reduction model with m<strong>in</strong>imality,<br />

and gives a simple characterization for it.<br />

�<br />

.


36 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5 2 2.5 3<br />

(a) PCA dimension reduction model<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

−6<br />

−8<br />

−8 −6 −4 −2 0 2 4 6 8<br />

(b) non-Gaussian subspace analysis model with directional<br />

histograms illustrat<strong>in</strong>g (non-)Gaussianity<br />

Figure 1.17: Illustration of dimension reduction by NGCA versus PCA. The signal subspace <strong>in</strong><br />

(a) is given by a l<strong>in</strong>ear direction of high variance, whereas <strong>in</strong> (b), noise is def<strong>in</strong>ed by Gaussianity.<br />

The signal subspace is correspond<strong>in</strong>gly given by directions of non-Gaussianity, which allows for<br />

the extraction of low-power signals.<br />

Theorem 1.5.1 (Uniqueness of NGCA). Let n < d. Given an n-dimensional NGCA ANSN +<br />

AGSG of the random vector X ∈ L2(Ω, R d ), the follow<strong>in</strong>g is equivalent:<br />

(i) The decomposition i.e. n is m<strong>in</strong>imal.<br />

(ii) There exists no basis M ∈ Gl(n) such that (MSN)(1) is Gaussian and <strong>in</strong>dependent of<br />

(MSN)(2 : n).<br />

(iii) The subspaces of the decomposition are unique i.e. another n-decomposition has the same<br />

non-Gaussian and Gaussian subspaces.<br />

Condition (ii) means that there exists no Gaussian <strong>in</strong>dependent component <strong>in</strong> the non-<br />

Gaussian part of the decomposition. The theorem proves that this is equivalent to the decomposition<br />

be<strong>in</strong>g m<strong>in</strong>imal. Note that <strong>in</strong> (ii), it is not enough to require only that there exists no Gaussian<br />

component i.e. v ∈ R n such that v ⊤ SN is Gaussian. A simple counterexample is given by a<br />

two-dimensional random vector S with density c exp(−s 2 1 −(s2 1 +s2) 2 ) with c be<strong>in</strong>g a normaliz<strong>in</strong>g<br />

constant. Then <strong>in</strong>deed S(1) = S1 is Gaussian because �<br />

R c exp(−s2 1 −(s2 1 +s2) 2 )ds2 = c ′ exp(−s 2 1 ),<br />

but clearly no m ∈ R 2 can be chosen such that S(1) and m ⊤ S are <strong>in</strong>dependent. And <strong>in</strong>deed,<br />

this dependent Gaussian S(1) with<strong>in</strong> S should not be removed by dimension reduction as it may<br />

conta<strong>in</strong> <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>formation, not be<strong>in</strong>g <strong>in</strong>dependent of the other components.<br />

The proof of the theorem was sketched <strong>in</strong> Theis and Kawanabe (2006), see chapter 16, where<br />

we also performed some simulations to validate the uniqueness result. A practical algorithm


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 37<br />

for NGCA essentially us<strong>in</strong>g the idea of separated characteristic functions from the proof was<br />

proposed <strong>in</strong> (Kawanabe and Theis, 2007).<br />

F<strong>in</strong>ally, <strong>in</strong> (Theis and Kawanabe, 2007), we presented an modification of NGCA that evaluates<br />

the time structure of the multivariate observations <strong>in</strong>stead of their higher-order statistics.<br />

We differentiated the signal subspace from noise by search<strong>in</strong>g for a subspace of non-trivially<br />

autocorrelated data. In contrast to bl<strong>in</strong>d source separation approaches however, we did not<br />

require the existence of sources, so the model is applicable to any wide-sense stationary time<br />

series without restrictions. Moreover, s<strong>in</strong>ce the method is based on second-order time structure,<br />

it could be efficiently implemented even for large dimensions, which we illustrated with an<br />

application to dimension reduction of functional MRI record<strong>in</strong>gs.<br />

1.5.3 Cluster<strong>in</strong>g<br />

Cluster<strong>in</strong>g methods are an important tool <strong>in</strong> high-dimensional explorative data m<strong>in</strong><strong>in</strong>g. They<br />

aim at identify<strong>in</strong>g samples or regions of similar characteristics, and often code them by a s<strong>in</strong>gle<br />

codebook vector or centroid. In this section, we review cluster<strong>in</strong>g algorithms, and employ these<br />

methods to solve the bl<strong>in</strong>d matrix factorization problem (1.12) from above under various source<br />

assumptions.<br />

Cluster<strong>in</strong>g for solv<strong>in</strong>g overcomplete BSS problems<br />

In Theis et al. (2006), see chapter 17, we discussed the bl<strong>in</strong>d source separation problem (1.1)<br />

<strong>in</strong> the difficult case of overcomplete BSS, where less mixtures than sources are observed (m <<br />

n). We focused on the usually more elaborate matrix recovery part. Assum<strong>in</strong>g statistically<br />

<strong>in</strong>dependent sources with exist<strong>in</strong>g variance and at most one Gaussian component, it is wellknown<br />

that A is determ<strong>in</strong>ed uniquely by the mixtures x(t) (Eriksson and Koivunen, 2003).<br />

However, how to do this algorithmically is far from obvious, and although some algorithms have<br />

been proposed recently (Bofill and Zibulevsky, 2001, Lee et al., 1999, O’Grady and Pearlmutter,<br />

2004), performance is yet limited.<br />

The most commonly used overcomplete algorithms rely on sparse sources (after possible<br />

sparsification by preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually by k-means or<br />

some extension (Bofill and Zibulevsky, 2001, O’Grady and Pearlmutter, 2004). However apart<br />

from the fact that theoretical justifications have not been found, mean-based cluster<strong>in</strong>g only<br />

identifies the correct A if the data density approaches a delta distribution. In figure 1.18,<br />

we illustrate the deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to 5 ◦ per mix<strong>in</strong>g<br />

angle, which is rather substantial consider<strong>in</strong>g the sparse density and the simple, complete case<br />

of m = n = 2. Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g performs much<br />

better. Indeed, mean-based cluster<strong>in</strong>g does not possess any equivariance property, which implies<br />

performance <strong>in</strong>dependent of the choice of A.<br />

We proposed a novel overcomplete, median-based cluster<strong>in</strong>g method <strong>in</strong> (Theis et al., 2006),<br />

and proved its equivariance and convergence. Simply put, we first pick 2n normalized start<strong>in</strong>g<br />

vectors w1, w ′ 1 , . . . , wn, w ′ n, and iterate the follow<strong>in</strong>g steps until an appropriate abort condition<br />

has been met: Choose a sample x(t) ∈ R m and normalize it y(t) := π(x(t)) = x(t)/|x(t)|. Let


38 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 2<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 2<br />

x2<br />

x2<br />

α<br />

α<br />

x1<br />

x1<br />

α<br />

α<br />

F (α)<br />

F (α)<br />

0.1<br />

0.1<br />

0.08<br />

0.08<br />

0.06<br />

0.06<br />

0.04<br />

0.04<br />

0.02<br />

0.02<br />

∆(α) mean<br />

∆(α) mean<br />

∆(α) median<br />

∆(α) median<br />

0<br />

0<br />

0.2 0.4 0.6 0.8<br />

0 0.2 0.4 α<br />

0.6 0.8<br />

α<br />

(a) circle histogram for α = 0.4<br />

(b) comparison of mean and median<br />

Fig. 1. Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x of two <strong>in</strong>dependent gamma-distributed sources<br />

(γ = 0.5, 10 5 (a) circle histogram for α = 0.4<br />

(a) circle histogram for α = 0.4<br />

(b) comparison of mean and median<br />

Fig. 1. Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x of two <strong>in</strong>dependent gamma-distributed sources<br />

samples) us<strong>in</strong>g a mix<strong>in</strong>g matrix A with columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2 − α) respectively. (a) shows the mixture<br />

(γ = 0.5, 10<br />

density for α = 0.4 after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by the mean<br />

and the median of the projected density <strong>in</strong> the receptive field F (α) = (−π/4, π/4) of the known column a1 of A. The former<br />

is the k-means convergence criterion.<br />

5 (b) comparison of mean and median<br />

Figure 1.18: Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x(t) of two <strong>in</strong>-<br />

samples) us<strong>in</strong>g a mix<strong>in</strong>g matrix A with columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2 − α) respectively. (a) shows the mixture<br />

dependent gamma-distributed sources (γ = 0.5, 10<br />

density for α = 0.4 after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by the mean<br />

and the median of the projected density <strong>in</strong> the receptive field F (α) = (−π/4, π/4) of the known column a1 of A. The former<br />

is the k-means convergence criterion.<br />

5 samples) us<strong>in</strong>g a mix<strong>in</strong>g matrix A with<br />

columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2 − α) respectively. (a) shows the mixture density for α = 0.4<br />

after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by<br />

the mean and the median of the projected density <strong>in</strong> the receptive field F (α) = (−π/4, π/4) of<br />

the known column a1 of A. The former is the k-means convergence criterion.<br />

one oneGaussian Gaussiancomponent, component, it itisis well-known well-knownthat that A Ais isdeterm<strong>in</strong>ed determ<strong>in</strong>eduniquely uniquelyby by x [3]. [3]. However, However, how how to to do do<br />

i(t) ∈ [1 : n] such that w<br />

this algorithmically is far from i(t)(t) or w<br />

this algorithmically is far fromobvious, obvious, and andalthough althoughquite quite a afew fewalgorithms algorithmshave havebeen beenproposed proposedrecently recently<br />

[4]–[6], [4]–[6], performance performanceisis yet yetlimited. limited. The The most most commonly commonly used used overcomplete algorithms algorithms rely rely on on sparse sparse<br />

′ i(t) (t) is the neuron closest to x(t). Then set<br />

wi(t)(t + 1) := π(wi(t)(t) + η(t)π(σy(t) − wi(t)(t))) and w ′ i(t) (t + 1) := −w i(t)(t + 1), where σ := 1 if w i(t)(t) is closest to y(t), σ := −1 otherwise.<br />

sources (after possible sparsification by preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually<br />

by k-means or some extension [5], [6]. But apart from the fact that theoretical justifications have not<br />

been found, mean-based cluster<strong>in</strong>g only identifies the correct A if the data density approaches a delta<br />

distribution. In figure 1, we illustrate the deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to<br />

5◦ sources (after possible sparsification by preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually<br />

by k-means or some extension [5], [6]. But apart from the fact that theoretical justifications have not<br />

been found, mean-based cluster<strong>in</strong>g only identifies the correct A if the data density approaches a delta<br />

distribution. In figure 1, we illustrate the deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to<br />

5 per mix<strong>in</strong>g angle, which is rather substantial consider<strong>in</strong>g the sparse density and the simple, complete<br />

case of m = n = 2. Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g performs much better.<br />

Indeed, mean-based cluster<strong>in</strong>g does not possess any equivariance property (performance <strong>in</strong>dependent of<br />

A). In the follow<strong>in</strong>g we propose a novel median-based cluster<strong>in</strong>g method and prove its equivariance<br />

(lemma 1.2) and convergence. For brevity, the proofs are given for the case of arbitrary n, but m = 2,<br />

although they can be readily extended to higher sensor signal dimensions. Correspond<strong>in</strong>g algorithms are<br />

proposed and experimentally validated.<br />

◦ We showed that <strong>in</strong>stead of f<strong>in</strong>d<strong>in</strong>g the mean <strong>in</strong> the receptive field (as <strong>in</strong> k-means), the algorithm<br />

searches for the correspond<strong>in</strong>g median. For this we had to study the end po<strong>in</strong>ts of geometric<br />

matrix-recovery, so we assumed that the algorithm had already converged. The idea then was<br />

to formulate a condition which the end po<strong>in</strong>ts have to satisfy and to show that the solutions are<br />

among them.<br />

per Themix<strong>in</strong>g anglesangle, γ1, . . which . , γn is ∈ rather [0, π) substantial are said to consider<strong>in</strong>g satisfy the overcomplete sparse density and geometric the simple, convergence complete<br />

condition (GCC) if they are the medians of y restricted to their receptive fields i.e. if γi is<br />

case of m = n = 2. Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g performs much better.<br />

the median of py|F (γi). Moreover, a constant random vector ˆω ∈ R<br />

Indeed, mean-based cluster<strong>in</strong>g does not possess any equivariance property (performance <strong>in</strong>dependent of<br />

A). In the follow<strong>in</strong>g we propose a novel median-based cluster<strong>in</strong>g method and prove its equivariance<br />

(lemma 1.2) and convergence. For brevity, the proofs are given for the case of arbitrary n, but m = 2,<br />

although they can be readily extended to higher sensor signal dimensions. Correspond<strong>in</strong>g algorithms are<br />

proposed and experimentally validated.<br />

n is called fixed po<strong>in</strong>t if<br />

E(ζ(ye − ˆω)) = 0. We showed that the median-condition is equivariant mean<strong>in</strong>g that it does no<br />

depend on A. Hence any algorithm based on such a condition is equivariant, as confirmed by<br />

figure 1.18. If we set ξ(ω) := (cos ω, s<strong>in</strong> ω) ⊤ , then θ(ξ(ω)) = ω and the follow<strong>in</strong>g holds:<br />

Theorem 1.5.2 (Convergence of overcomplete median-based cluster<strong>in</strong>g). The set Φ of fixed<br />

po<strong>in</strong>ts of geometric matrix-recovery conta<strong>in</strong>s an element (ˆω1, . . . , ˆωn) such that the result<strong>in</strong>g<br />

matrix (ξ(ˆω1) . . . ξ(ˆωn)) solves the overcomplete BSS problem.<br />

The stable fixed po<strong>in</strong>ts <strong>in</strong> the above set Φ can be found by the geometric matrix-recovery<br />

algorithm.


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 39<br />

samples<br />

(a) setup<br />

centroids<br />

Grassmann cluster<strong>in</strong>g<br />

(b) division<br />

division<br />

update<br />

(c) update (d) after one iteration<br />

Figure 1.19: Illustration of the batch k-means algorithm.<br />

One of the most commonly used partitional cluster<strong>in</strong>g techniques is the k-means algorithm,<br />

which <strong>in</strong> its batch form partitions the data set <strong>in</strong>to k disjo<strong>in</strong>t clusters by simply iterat<strong>in</strong>g between<br />

cluster assignments and cluster updates (Bishop, 1995). In general, its goal can be described as<br />

follows:<br />

Given a set A of po<strong>in</strong>ts <strong>in</strong> some metric space (M, d), f<strong>in</strong>d a partition of A <strong>in</strong>to disjo<strong>in</strong>t nonempty<br />

subsets Bi, �<br />

i Bi = A, together with centroids ci ∈ M so as to m<strong>in</strong>imize the sum of the<br />

squares of the distances of each po<strong>in</strong>t of A to the centroid ci of the cluster Bi conta<strong>in</strong><strong>in</strong>g it.<br />

A common approach to m<strong>in</strong>imiz<strong>in</strong>g such energy functions is partial optimization with respect<br />

to the division matrix and the centroids. The batch k-means algorithm employs precisely this<br />

strategy, see figure 1.19. After an <strong>in</strong>itial, random choice of centroids c1, . . . , ck, it iterates<br />

between the follow<strong>in</strong>g two steps until convergence measured by a suitable stopp<strong>in</strong>g criterion:<br />

• cluster assignment: for each sample x(t) determ<strong>in</strong>e an <strong>in</strong>dex i(t) = argm<strong>in</strong> i d(x(t), ci)<br />

• cluster update: with<strong>in</strong> each cluster Bi := {a(t)|i(t) = i} determ<strong>in</strong>e the centroid ci by<br />

ci := argm<strong>in</strong> c<br />

�<br />

d(a, c) 2<br />

a∈Bi<br />

(1.17)<br />

Solv<strong>in</strong>g (1.17) is straight-forward <strong>in</strong> the Euclidean case, however nontrivial <strong>in</strong> other metric<br />

spaces. In Gruber and Theis (2006), see chapter 18, we generalized the concept of k-means by<br />

apply<strong>in</strong>g it not to the standard Euclidean space but to the manifold of subvectorspaces of R n of<br />

a fixed dimension p, also known as the Grassmann manifold Gn,p. Important examples <strong>in</strong>clude<br />

projective space i.e. the manifold of l<strong>in</strong>es and the space of all hyperplanes.<br />

We represented an element of Gn,p by p orthonormal vectors (v1, . . . , vp). Concatenat<strong>in</strong>g<br />

these <strong>in</strong>to an (n × p)-matrix V, this matrix is unique except for right multiplication by an<br />

orthogonal matrix. We therefore wrote [V] ∈ Gn,p for the subspace. This allowed us to def<strong>in</strong>e a<br />

distance d([V], [W]) := 2 −1/2 �VV ⊤ − WW ⊤ �F on the Grassmannian known as the projection<br />

F-norm.


40 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

Other metrics may be def<strong>in</strong>ed on Gn,p, and they result <strong>in</strong> different Riemannian geometries<br />

on the manifold. Optimization on non-Euclidean geometries is non-trivial, and has been studied<br />

for a long time, see for example Edelman et al. (1999) and references there<strong>in</strong>. For <strong>in</strong>stance <strong>in</strong><br />

the context of ICA, Amari’s sem<strong>in</strong>al paper (Amari, 1998) on tak<strong>in</strong>g <strong>in</strong>to account the geometry<br />

of the search space Gl(n) yielded a considerable <strong>in</strong>crease <strong>in</strong> performance and accuracy. Learn<strong>in</strong>g<br />

<strong>in</strong> these matrix manifolds has been reviewed <strong>in</strong> Theis (2005b) and extended <strong>in</strong> Squart<strong>in</strong>i and<br />

Theis (2006).<br />

In order to apply batch k-means to (Gn,p, d), we only had to solve the cluster assignment<br />

equation (1.17). It turned out that for this, no elaborate optimization was necessary, <strong>in</strong>stead<br />

a closed form solution that only needs eigenvalue decomposition could be found. We state this<br />

with the follow<strong>in</strong>g theorem, proved <strong>in</strong> Gruber and Theis (2006):<br />

Theorem 1.5.3 (Grassmann centroids). The centroid [C] ∈ Gn,p of a set of po<strong>in</strong>ts [V1], . . . , [Vl] ∈<br />

Gn,p accord<strong>in</strong>g to (1.17) is spanned by p <strong>in</strong>dependent eigenvectors correspond<strong>in</strong>g to the smallest<br />

eigenvalues of the generalized cluster correlation l −1 � l<br />

i=1 ViV ⊤ i .<br />

Application to nonnegative matrix factorization<br />

Detect<strong>in</strong>g clusters <strong>in</strong> multiple samples drawn from a Grassmannian is a problem aris<strong>in</strong>g <strong>in</strong><br />

various applications. In Gruber and Theis (2006), we applied this to NMF <strong>in</strong> order to illustrate<br />

the feasibility of the proposed algorithm.<br />

Consider the matrix factorization problem (1.12) with the additional non-negativity constra<strong>in</strong>ts<br />

S, A ≥ 0. If we assume that S spans the whole first quadrant, then X = AS is a conic<br />

hull with cone edges spanned by the columns of A. After projection onto the standard simplex,<br />

the conic hull reduces to the convex hull, and the projected, known mixture data set X lies<br />

with<strong>in</strong> a convex polytope of the order given by the number of rows of S. Hence we face the<br />

problem of identify<strong>in</strong>g n edges of a sampled polytope <strong>in</strong> R m−1 .<br />

In two dimensions (after reduction of m = 3), this implies the task of f<strong>in</strong>d<strong>in</strong>g the k edges<br />

of a polytope where only samples <strong>in</strong> the <strong>in</strong>side are known. We used the Quickhull algorithm<br />

(Barber et al., 1993) to construct the convex hull thus identify<strong>in</strong>g the possible edges of the<br />

polytope. However due to f<strong>in</strong>ite samples the identified polytope has far too many edges. Therefore,<br />

we applied aff<strong>in</strong>e Grassmann n-means cluster<strong>in</strong>g—with samples weighted accord<strong>in</strong>g to their<br />

volume—to these edges <strong>in</strong> order to identify the n bound<strong>in</strong>g edges, see example <strong>in</strong> figure 1.20.<br />

Biomedical applications of other matrix factorization methods are discussed <strong>in</strong> the next<br />

chapter. We only shortly want to mention Meyer-Bäse et al. (2005), where we applied NMF<br />

and related unsupervised cluster<strong>in</strong>g techniques for the self-organized segmentation of biomedical<br />

image time-series data describ<strong>in</strong>g groups of pixels exhibit<strong>in</strong>g similar properties of local signal<br />

dynamics.


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 41<br />

(a) samples<br />

(b) QHull<br />

(c) result of Grassmann cluster<strong>in</strong>g<br />

Samples<br />

QHull contour<br />

Grasmann cluster<strong>in</strong>g<br />

Mix<strong>in</strong>g matrix<br />

Figure 1.20: Grassmann cluster<strong>in</strong>g (hyerplanes, so p = n − 1) identifies the contour of samples<br />

(a) follow<strong>in</strong>g the NMF model. Quickhull was used to f<strong>in</strong>d the outer edges then those are clustered<br />

<strong>in</strong>to 4 clusters (b). The dashed l<strong>in</strong>es (c) show the convex hull spanned by the mix<strong>in</strong>g matrix<br />

columns.


42 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.6 Applications to biomedical data analysis<br />

The above models are known to have many applications <strong>in</strong> data m<strong>in</strong><strong>in</strong>g. Here we focus on<br />

biomedical data analysis. For this we review some recent applications to functional MRI, microscopy<br />

of labeled bra<strong>in</strong> sections and surface electromyograms.<br />

1.6.1 Functional MRI<br />

Functional magnetic resonance imag<strong>in</strong>g (fMRI) has been shown to be an effective imag<strong>in</strong>g technique<br />

<strong>in</strong> human bra<strong>in</strong> research (Ogawa et al., 1990). By the blood oxygen level dependent<br />

contrast (BOLD), local changes <strong>in</strong> the magnetic field are coupled to activity <strong>in</strong> bra<strong>in</strong> areas.<br />

These magnetic changes are measured us<strong>in</strong>g MRI. The high spatial and temporal resolution<br />

of fMRI comb<strong>in</strong>ed with its non-<strong>in</strong>vasive nature makes it to an important tool for discover<strong>in</strong>g<br />

functional areas <strong>in</strong> the human bra<strong>in</strong> work and their <strong>in</strong>teractions. However, its low signal to<br />

noise ratio and the high number of activities <strong>in</strong> the passive bra<strong>in</strong> require sophisticated analysis<br />

method. These are either (i) based on models and regression, but require prior knowledge of the<br />

time course of the activations or (ii) employ model-free approaches such as BSS by separat<strong>in</strong>g<br />

the recorded activation <strong>in</strong>to different classes accord<strong>in</strong>g to statistical specifications without prior<br />

knowledge of the activation.<br />

The bl<strong>in</strong>d approach (ii) has been first studied by McKeown et al. (1998): Accord<strong>in</strong>g to the<br />

pr<strong>in</strong>ciple of functional organization of the bra<strong>in</strong>, they suggested that the multifocal bra<strong>in</strong> areas<br />

activated by performance of a visual task should be unrelated to the bra<strong>in</strong> areas whose signals<br />

are affected by artifacts of physiological nature, head movements, or scanner noise related to<br />

fMRI experiments. Every s<strong>in</strong>gle process can be described by one or more spatially <strong>in</strong>dependent<br />

components, each associated with a s<strong>in</strong>gle time course of a voxel and a component map. It is<br />

assumed that the component maps, each described by a spatial distribution of fixed values, is<br />

represent<strong>in</strong>g overlapp<strong>in</strong>g, multifocal bra<strong>in</strong> areas of statistically <strong>in</strong>dependent fMRI signals. This<br />

aspect is visualized <strong>in</strong> Figure 1.21.<br />

In addition, they considered that the distributions of the component maps are spatially<br />

<strong>in</strong>dependent and <strong>in</strong> this sense uniquely specified, see section 1.2.1. McKeown et al. (1998)<br />

showed that these maps are <strong>in</strong>dependent if the active voxels <strong>in</strong> the maps are sparse and mostly<br />

nonoverlapp<strong>in</strong>g. Additionally they assumed that the observed fMRI signals are the superposition<br />

of the <strong>in</strong>dividual component processes at each voxel. Based on these assumptions, ICA can be<br />

applied to fMRI time-series to spatially localize and temporally characterize the sources of BOLD<br />

activation, and considerable research has been devoted to this area s<strong>in</strong>ce then.<br />

Model-based versus model-free analysis<br />

However, the use of bl<strong>in</strong>d signal process<strong>in</strong>g techniques for the effective analysis of fMRI data<br />

has often been questioned, and <strong>in</strong> many applications, neurologists and psychologists prefer to<br />

use the computationally simpler regression models.<br />

In Keck et al. (2004), see chapter 19, we compared the two approaches on a sufficiently<br />

complex task of a comb<strong>in</strong>ed word perception and motor activity. The event-based experiment<br />

was part of a study to <strong>in</strong>vestigate the network of neurons <strong>in</strong>volved <strong>in</strong> the perception of speech


1.6. Applications to biomedical data analysis 43<br />

#1<br />

#2<br />

#n<br />

mix<strong>in</strong>g<br />

matrix A<br />

component maps observations<br />

Figure 1.21: Visualization of the spatial fMRI separation model. The n-dimensional source<br />

vector is represented as component maps, which <strong>in</strong> turn are <strong>in</strong>terpreted to contribute l<strong>in</strong>early <strong>in</strong><br />

different concentrations to the fMRI observations at the various time po<strong>in</strong>ts t ∈ {1, . . . , m}.<br />

and the decod<strong>in</strong>g of auditory speech stimuli. One- and two-syllable words were divided <strong>in</strong>to<br />

several frequency-bands and then rearranged randomly to obta<strong>in</strong> a set of auditory stimuli. Only<br />

a s<strong>in</strong>gle band was perceivable as words. Dur<strong>in</strong>g the functional imag<strong>in</strong>g session these stimuli were<br />

presented pseudo-randomized to 5 subjects, accord<strong>in</strong>g to the rules of a stochastic event-related<br />

paradigm. The task of the subjects was to press a button as soon as they were sure that they<br />

had just recognized a word <strong>in</strong> the sound presented. It was expected that <strong>in</strong> case of the s<strong>in</strong>gle<br />

perceptive frequency-band, these four types of stimuli activate different areas of the auditory<br />

system as well as the superior temporal sulcus <strong>in</strong> the left hemisphere (Specht and Reul, 2003).<br />

The regression-based analysis us<strong>in</strong>g a general l<strong>in</strong>ear model was performed us<strong>in</strong>g SPM2. This<br />

was compared with components extracted us<strong>in</strong>g ICA, namely fastICA (Hyvär<strong>in</strong>en and Oja,<br />

1997). The results are illustrated <strong>in</strong> figure 1.22, see Keck et al. (2004). Indeed, one <strong>in</strong>dependent<br />

component represented a network of three simultaneously active areas <strong>in</strong> the <strong>in</strong>ferior frontal<br />

gyrus, which was previously proposed to be a center for the perception of speech (Specht and<br />

Reul, 2003). Altogether, we were able to show that ICA detects hidden or suspected l<strong>in</strong>ks and<br />

activity <strong>in</strong> the bra<strong>in</strong> that cannot be found us<strong>in</strong>g the classical, model-based approach.<br />

Spatial and spatiotemporal separation<br />

As short example of spatial and spatiotemporal BSS, we present the analysis of an experiment<br />

us<strong>in</strong>g visual stimuli. fMRI data were recorded from 10 healthy subjects perform<strong>in</strong>g a visual<br />

task. 100 scans each were acquired with 5 periods of rest and 5 photic stimulation periods,<br />

t=1<br />

t=2<br />

t=m


44 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

(a) general l<strong>in</strong>ear model analysis (b) one <strong>in</strong>dependent component<br />

Figure 1.22: Comparison of model-based and model-free analysis of a word-perception fMRI<br />

experiment. (a) illustrates the result of a regression-based analysis, which shows actitity mostly<br />

<strong>in</strong> the auditive cortex. (b) is a s<strong>in</strong>gle component extracted by ICA, which is corresponds to a<br />

word-detection network.<br />

with a resolution of 3 × 3 × 4 mm. A s<strong>in</strong>gle 2d-slice is analyzed, which is oriented parallel to<br />

the calcar<strong>in</strong>e fissure. Photic stimulation was performed us<strong>in</strong>g an 8 Hz alternat<strong>in</strong>g checkerboard<br />

stimulus with a central fixation po<strong>in</strong>t and a dark background.<br />

At first, we show an example result us<strong>in</strong>g spatial ICA. We perform a dimension reduction us<strong>in</strong>g<br />

PCA to n = 6 dimensions, which still conta<strong>in</strong>ed 99.77% of the eigenvalues. Then, we applied<br />

HessianICA with K = 100 Hessians evaluated at randomly chosen samples, see section 1.2.1 and<br />

Theis (2004a). The result<strong>in</strong>g 6-dimensional sources are <strong>in</strong>terpreted as the 6 component maps<br />

that encode the data set. The columns of the mix<strong>in</strong>g matrix conta<strong>in</strong> the relative contribution of<br />

each component map to the mixtures at the given time po<strong>in</strong>t, so they represent the components’<br />

time courses. The maps together with the correspond<strong>in</strong>g time courses are shown <strong>in</strong> figure 1.23.<br />

A s<strong>in</strong>gle highly task-related component (#4) is found, which after a shift of 4s has a high crosscorrelation<br />

with the block-based stimulus (cc = 0.89). Other component maps encode artifacts,<br />

e.g. <strong>in</strong> the <strong>in</strong>terstitial bra<strong>in</strong> region and other background activity.<br />

We then tested the usefulness of tak<strong>in</strong>g <strong>in</strong>to account additional <strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong><br />

the data set such as the spatiotemporal dependencies. For this, we analyzed the data us<strong>in</strong>g<br />

spatiotemporal BSS as described <strong>in</strong> section 1.3.2 and Theis et al. (2005b), Theis et al. (2007b),<br />

see chapter 9. In order to make th<strong>in</strong>gs more challeng<strong>in</strong>g, only 4 components were to be extracted<br />

from the data, with preprocess<strong>in</strong>g either by PCA only or by the slightly more general s<strong>in</strong>gular<br />

value decomposition, a necessary preprocess<strong>in</strong>g for spatiotemporal BSS. We based the algorithms<br />

on jo<strong>in</strong>t diagonalization, for which K = 10 autocorrelation matrices were used, both for spatial<br />

and temporal decorrelation, weighted equally (α = 0.5). Although the data was reduced to only<br />

4 components, stSOBI was able to extract the stimulus component very well, with a equally<br />

high crosscorrelation of cc = 0.89. We compared this result with some established algorithms<br />

for bl<strong>in</strong>d fMRI analysis by discuss<strong>in</strong>g the s<strong>in</strong>gle component that is maximally autocorrelated


1.6. Applications to biomedical data analysis 45<br />

1 2 3<br />

4 5 6<br />

(a) recovered component maps<br />

1 cc: −0.14 2 cc: −0.13 3 cc: −0.22<br />

4 cc: 0.89 5 cc: 0.12 6 cc: 0.09<br />

(b) time courses<br />

Figure 1.23: Extracted ICA-components of fMRI record<strong>in</strong>gs. (a) shows the spatial and (b)<br />

the correspond<strong>in</strong>g temporal activation patterns, where <strong>in</strong> (b) the grey bars <strong>in</strong>dicate stimulus<br />

activity. <strong>Component</strong> 4 conta<strong>in</strong>s the (<strong>in</strong>dependent) visual task, active <strong>in</strong> the visual cortex (white<br />

po<strong>in</strong>ts <strong>in</strong> (a)). It correlates well with the stimulus activity (b), 4.<br />

with the known stimulus task, see figure 1.24. The absolute correspond<strong>in</strong>g autocorrelations<br />

are 0.84 (stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied to<br />

separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA). The observation that neither<br />

Stone’s spatiotemporal ICA algorithm (Stone et al., 2002) nor the popular fastICA algorithm<br />

(Hyvär<strong>in</strong>en and Oja, 1997) could recover the sources showed that spatiotemporal models can<br />

use the additional data structure efficiently <strong>in</strong> contrast to spatial-only models, and that the<br />

parameter-free jo<strong>in</strong>t-diagonalization-based algorithms are robust aga<strong>in</strong>st convergence issues.<br />

Other analysis models<br />

Before cont<strong>in</strong>u<strong>in</strong>g to other biomedical applications, we shortly want to review other recent work<br />

of the author <strong>in</strong> this field.<br />

In Karvanen and Theis (2004), we proposed the concept of w<strong>in</strong>dow ICA for the analysis of<br />

fMRI data. The basic idea was to apply, spatial ICA <strong>in</strong> slid<strong>in</strong>g time w<strong>in</strong>dows; this approach<br />

avoided the problems related to the high number of signals and the result<strong>in</strong>g issues with dimension<br />

reduction methods, and moreover gave some <strong>in</strong>sight <strong>in</strong>to small changes happen<strong>in</strong>g dur<strong>in</strong>g<br />

the experiment, which are otherwise not encoded <strong>in</strong> changes <strong>in</strong> the component maps. We demonstrated<br />

the usefulness of the proposed approach <strong>in</strong> an experiment where a subject listened to<br />

auditory stimuli consist<strong>in</strong>g of s<strong>in</strong>usoidal sounds (beeps) and words <strong>in</strong> vary<strong>in</strong>g proportions. Here,<br />

the w<strong>in</strong>dow ICA algorithm was able to f<strong>in</strong>d different auditory activations patterns related to the<br />

beeps respectively the words.<br />

An <strong>in</strong>terest<strong>in</strong>g model for activity maps <strong>in</strong> the bra<strong>in</strong> is given by sparse cod<strong>in</strong>g; after all, the<br />

component maps are always implicitly assumed to show only strongly focused regions of activation.<br />

Hence we asked the question whether the sparse models proposed <strong>in</strong> section 1.4.1 could be


46 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

stimulus<br />

stNSS<br />

stSOBI (1D)<br />

stICA after stSOBI<br />

stICA<br />

fastICA<br />

20 40 60 80<br />

Fig. 3. Comparison of the recovered component that is maximally autocrosscorrelated<br />

with the stimulus task (top) for various BSS algorithms, after dimension<br />

reduction to 4 components. The absolute correspond<strong>in</strong>g autocorrelations are 0.84<br />

(stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied<br />

to separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA).<br />

Figure 1.24: Comparison of the recovered component that is maximally autocrosscorrelated with<br />

the stimulus task (top) for various BSS algorithms, after dimension reduction to 4 components.<br />

stead of calculat<strong>in</strong>g autocovariance matrices, other statistics of the spatial<br />

and temporal sources can be used, as long as they satisfy the factorization<br />

from equation (1). This results <strong>in</strong> ‘algebraic BSS’ algorithms such as AMUSE<br />

[3], JADE [11], SOBI and TDSEP [5,6], reviewed for <strong>in</strong>stance <strong>in</strong> [12]. Instead<br />

of perform<strong>in</strong>g autodecorrelation, we used the idea of the NSS-SD-algorithm<br />

(‘non-stationary sources with simultaneous diagonalization’) [19], cf. [20]: the<br />

sources were assumed to be spatiotemporal realizations of non-stationary random<br />

processes ∗si(t) with ∗ ∈ {t, s} determ<strong>in</strong><strong>in</strong>g the temporal spatial or spatial<br />

direction. If we assume that the result<strong>in</strong>g covariance matrices C( ∗s(t)) vary<br />

sufficiently with time, the factorization of equation (1) also holds for these<br />

covariance matrices. Hence, jo<strong>in</strong>t diagonalization of<br />

{C( t x(1)), C( s x(1)), C( t x(2)), C( s applied to fMRI data. We showed a successful application to the above visual-stimulus experiment<br />

<strong>in</strong> Georgiev et al. (2005a). Aga<strong>in</strong>, we were able to show that with only five components,<br />

the stimulus-related activity <strong>in</strong> the visual cortex could be nicely reconstructed.<br />

A similar question of model generalization was posed <strong>in</strong> Theis and Tanaka (2005). There<br />

we proposed to study the postnonl<strong>in</strong>ear mix<strong>in</strong>g model (1.3) <strong>in</strong> the context of fMRI data. We<br />

derived an algorithm for bl<strong>in</strong>dly estimat<strong>in</strong>g the sensor characteristics of such a multi-sensor<br />

network. From the observed sensor outputs, the non-l<strong>in</strong>earities are recovered us<strong>in</strong>g a wellknown<br />

Gaussianization procedure. The underly<strong>in</strong>g sources are then reconstructed us<strong>in</strong>g spatial<br />

decorrelation as proposed by Ziehe et al. (2003a). Application of this robust algorithm to data<br />

sets acquired fMRI lead to the detection of a dist<strong>in</strong>ctive x(2)), bump. . .} of the BOLD effect at larger<br />

activations, which may be <strong>in</strong>terpreted as an <strong>in</strong>herent BOLD-related nonl<strong>in</strong>earity.<br />

allows for the calculation of the mix<strong>in</strong>g matrix. The covariance matrices are<br />

In Meyer-Bäse commonly et al. estimated (2004a,b), <strong>in</strong> separate we discussed non-overlapp<strong>in</strong>g the concept temporal of dependent or spatial w<strong>in</strong>dows. component analysis,<br />

see section 1.3, Replac<strong>in</strong>g <strong>in</strong> the context the autocovariances of fMRI <strong>in</strong> data (8) by analysis. the w<strong>in</strong>dowed We covariances, detected dependencies this results by f<strong>in</strong>d<strong>in</strong>g<br />

clusters of dependent <strong>in</strong> the spatiotemporal components; NSS-SD algorithmically, or stNSS algorithm. we compared two of the first such algorithms,<br />

namely tree-dependent (Bach and Jordan, 2003a) and topographic ICA (Hyvär<strong>in</strong>en et al., 2001a).<br />

In the fMRI example, we applied stNSS us<strong>in</strong>g one-dimensional covariance ma-<br />

For the fMRI data, a comparative quantitative evaluation between the two methods, tree–<br />

dependent and topographic ICA was performed. We observed that topographic ICA outperforms<br />

other ord<strong>in</strong>ary ICA methods and tree–dependent 11 ICA when extract<strong>in</strong>g only few <strong>in</strong>dependent<br />

components. This resulted <strong>in</strong> a postprocess<strong>in</strong>g algorithm based on cluster<strong>in</strong>g of ICA components<br />

result<strong>in</strong>g from different source component dimensions <strong>in</strong> Keck et al. (2005).<br />

The above algorithms have been <strong>in</strong>cluded <strong>in</strong> our MFBOX (Model-free Toolbox) package


1.6. Applications to biomedical data analysis 47<br />

(a) tra<strong>in</strong><strong>in</strong>g data set (b) directional normalization (c) classification result<br />

Figure 1.25: Directional neural networks: As tra<strong>in</strong><strong>in</strong>g data set, a few labeled A’s from a test<br />

image have been used (a). (b) then shows five rotated A’s together with their image patches<br />

normalized us<strong>in</strong>g ma<strong>in</strong> pr<strong>in</strong>cipal component direction below. This small-scale classificator reproduces<br />

the A-locations <strong>in</strong> a test image sufficiently well (c), even though the tra<strong>in</strong><strong>in</strong>g data set<br />

was small and different fonts, noise etc. had been added.<br />

(Gruber et al., 2007), a Matlab toolbox for data-driven analysis of biomedical data, which may<br />

also be used as SPM plug<strong>in</strong>. Its ma<strong>in</strong> focus lies on the analysis of functional Nuclear Magnetic<br />

Resonance Imag<strong>in</strong>g (fMRI) data sets with various model-free or data-driven techniques. The<br />

toolbox <strong>in</strong>cludes BSS algorithms based on various source models <strong>in</strong>clud<strong>in</strong>g ICA, spatiotemporal<br />

ICA, autodecorrelation and NMF. They can all be easily comb<strong>in</strong>ed with higher-level analysis<br />

methods such as reliability analysis us<strong>in</strong>g projective cluster<strong>in</strong>g of the components, slid<strong>in</strong>g time<br />

w<strong>in</strong>dow analysis or hierarchical decomposition.<br />

1.6.2 Image segmentation and cell count<strong>in</strong>g<br />

A supervised <strong>in</strong>terpretation of the <strong>in</strong>itial data analysis model from section 1.1 leads to a classification<br />

problem: given a set of <strong>in</strong>put-output samples, f<strong>in</strong>d a map that <strong>in</strong>terpolate these samples,<br />

hopefully generaliz<strong>in</strong>g well to new <strong>in</strong>put samples. Such a map thus serves as classifier if the<br />

output consists of discrete labels. Classification based on support vector mach<strong>in</strong>es (Boser et al.,<br />

1992, Burges, 1998, Schölkopf and Smola, 2002) or neural networks (Hayk<strong>in</strong>, 1994) has prom<strong>in</strong>ent<br />

applications <strong>in</strong> biomedical data analysis. Here we review an application to biomedical<br />

image process<strong>in</strong>g from Theis et al. (2004c), see chapter 21.<br />

While many different tissues of the mammalian organism are capable of renew<strong>in</strong>g themselves<br />

after damage, it was believed for long that the nervous system is not able to regenerate at all.<br />

Nevertheless, the first data show<strong>in</strong>g that the generation of new nerve cells <strong>in</strong> the adult bra<strong>in</strong><br />

could happen were presented <strong>in</strong> the 1960s (Altman and Das, 1965) show<strong>in</strong>g new neurons <strong>in</strong><br />

the bra<strong>in</strong> of adult rats. In order to quantify neurogenesis <strong>in</strong> animals, newborn cells are labeled<br />

with specific markers such as BrdU; <strong>in</strong> bra<strong>in</strong> sections these can later be analyzed and counted<br />

through the use of a confocal microscope. However so far, this count<strong>in</strong>g process had been<br />

performed manually.<br />

In Theis et al. (2004b,c), we proposed an algorithm called ZANE to automatically identify cell


48 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

components <strong>in</strong> digitized section images. First, a so-called cell classifier was tra<strong>in</strong>ed with cell and<br />

non-cell patches us<strong>in</strong>g s<strong>in</strong>gle- and multi-layer perceptrons as well as unsupervised <strong>in</strong>dependent<br />

component analysis with correlation comparison. In order to account for a larger variety of cell<br />

shapes, a directional normalization approach is proposed. The cell classifier can then be used <strong>in</strong><br />

an arbitrary number of sections by scann<strong>in</strong>g the section and choos<strong>in</strong>g maxima of this classifier<br />

as cell center locations. This is illustrated us<strong>in</strong>g a toy example <strong>in</strong> figure 1.25. A flow-chart with<br />

the basic segmentation setup is shown <strong>in</strong> figure 1.26.<br />

ZANE was successfully applied to measure neurogenesis <strong>in</strong> adult rodent bra<strong>in</strong> sections, where<br />

we showed that the proliferation of neurons is substantially stronger (340%) <strong>in</strong> the dentate gyrus<br />

of an epileptic mouse than <strong>in</strong> a control group. When compar<strong>in</strong>g the count<strong>in</strong>g result with manual<br />

counts, the mean ZANE classification rate is 90% of all (manually detected) cells; this was with<strong>in</strong><br />

the error bounds of a perfect count, s<strong>in</strong>ce manual counts by different experts varied by roughly<br />

10% themselves (Theis et al., 2004b).<br />

1.6.3 Surface electromyograms<br />

In sections 1.2 and 1.4, we presented bl<strong>in</strong>d<br />

data factorization models based on statistical<br />

<strong>in</strong>dependence, explicit sparseness and nonnegativity.<br />

It is known that all three approaches<br />

tend to <strong>in</strong>duce a more mean<strong>in</strong>gful,<br />

often more sparse representation of the multivariate<br />

data set. However, develop<strong>in</strong>g explicit<br />

applications and more so perform<strong>in</strong>g mean<strong>in</strong>gful<br />

comparisons of such methods is still of considerable<br />

<strong>in</strong>terest.<br />

In Theis and García (2006), see chapter<br />

20, we analyzed and compared the above<br />

models, but not from a theoretical po<strong>in</strong>t of<br />

view but rather based on a real-world example,<br />

namely the analysis of surface electromyogram<br />

(sEMG) data sets. An electromyogram<br />

(EMG) denotes the electric signal generated<br />

by a contract<strong>in</strong>g muscle (Basmajian and Luca,<br />

1985). In general, EMG measurements make<br />

use of <strong>in</strong>vasive, pa<strong>in</strong>ful needle electrodes. An<br />

alternative is to use sEMG, which is measured<br />

us<strong>in</strong>g non-<strong>in</strong>vasive, pa<strong>in</strong>less surface electrodes.<br />

However, <strong>in</strong> this case the signals are rather<br />

more difficult to <strong>in</strong>terpret due to noise and Figure 1.26: ZANE image segmentation.<br />

the overlap of several source signals. Direct<br />

application of the ICA model to real-world noisy sEMG turns out to problematic (García et al.,<br />

2004), and it is yet unknown if the assumption of <strong>in</strong>dependent sources holds well <strong>in</strong> the sett<strong>in</strong>g<br />

of sEMG.


1.6. Applications to biomedical data analysis 49<br />

<strong>Component</strong> given by each method [a.u.]<br />

(a) JADE<br />

(b) NMF<br />

(c) NMF*<br />

(d) sNMF<br />

(e) sNMF*<br />

(f) SCA<br />

(g) s-EMG<br />

-2<br />

-4<br />

02468<br />

-2<br />

-4<br />

02468<br />

-2 02468<br />

-2 0246<br />

-2 02468<br />

-2<br />

-4<br />

02468<br />

0.5 1<br />

1.5<br />

-0.5 0<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

Figure 1.27: A recovered sources after unmix<strong>in</strong>g an sEMG data, (a-f) shows results obta<strong>in</strong>ed<br />

us<strong>in</strong>g the different methods and (g) the orig<strong>in</strong>al most-active sensor signal.<br />

Our approach was therefore to apply and validate sparse BSS methods based on various<br />

model assumptions to sEMG signals. When applied to artificial signals we found noticeable<br />

differences of algorithm performance depend<strong>in</strong>g on the source assumptions. In particular, sparse<br />

nonnegative matrix factorization outperforms the other methods with regards to <strong>in</strong>creas<strong>in</strong>g<br />

additive noise. However, <strong>in</strong> the case of real sEMG signals we showed that despite the fundamental<br />

differences <strong>in</strong> the various models, the methods yield rather similar results and can successfully<br />

separate the source signal, see figure 1.27. This was due to the fact that the different sparseness<br />

assumptions are only approximately fulfilled thus apparently forc<strong>in</strong>g the algorithms to reach<br />

similar results, but from different <strong>in</strong>itial conditions us<strong>in</strong>g different optimization criteria.<br />

The result<strong>in</strong>g sparse signal components can now be used for further analysis and for artifact<br />

removal. A similar analysis us<strong>in</strong>g spectral correlations have been employed <strong>in</strong> (Böhm et al.,<br />

2006, Stadlthanner et al., 2006b) to remove the water artifact from multidimensional proton<br />

NMR spectra of biomolecules dissolved <strong>in</strong> aqueous solutions.


50 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.7 Outlook<br />

We considered the (mostly l<strong>in</strong>ear) factorization problem<br />

x(t) = As(t) + n(t). (1.18)<br />

Often, the noise n(t) was not explicitly modeled but <strong>in</strong>cluded <strong>in</strong> s(t). We assumed that x(t)<br />

is known as well as some additional <strong>in</strong>formation about the system itself. Depend<strong>in</strong>g on the<br />

assumptions, different problems and algorithmic solutions can be derived to solve such <strong>in</strong>verse<br />

problems:<br />

• statistically <strong>in</strong>dependent s(t): we proved that <strong>in</strong> this case ICA could solve (1.18) uniquely,<br />

see section 1.2; this holds even <strong>in</strong> some nonl<strong>in</strong>ear generalizations.<br />

• approximate spatiotemporal <strong>in</strong>dependence <strong>in</strong> A and s(t): we provided a very robust,<br />

simple algorithm for the spatiotemporal separation based on jo<strong>in</strong>t diagonalization, see<br />

section 1.3.2.<br />

• statistical <strong>in</strong>dependence between groups of sources s(t): aga<strong>in</strong>, uniqueness except for transformation<br />

with<strong>in</strong> the blocks can be proven, and the constructive proof results <strong>in</strong> a simple<br />

update algorithm as <strong>in</strong> the l<strong>in</strong>ear case, see section 1.3.3.<br />

• sparseness of the sources s(t): <strong>in</strong> the context of SCA, we relaxed the assumptions of s<strong>in</strong>gle<br />

source sparseness to multi-dimensional sparseness constra<strong>in</strong>ts, for which we were still able<br />

to prove uniqueness and to derive an algorithm, see section 1.4.1.<br />

• non-negativity of A and s(t): a sparse extension of the pla<strong>in</strong> NMF model was analyzed<br />

<strong>in</strong> terms of existence an uniqueness, and a generalization for lp-sparse sources has been<br />

proposed <strong>in</strong> order to better approximate comb<strong>in</strong>atorial sparseness <strong>in</strong> the l0-sense, see<br />

section 1.4.2.<br />

• Gaussian or noisy components <strong>in</strong> s(t): we proposed various denois<strong>in</strong>g and dimension reduction<br />

schemes, and proved uniqueness of a non-Gaussian signal subspace <strong>in</strong> section 1.5.<br />

F<strong>in</strong>ally, <strong>in</strong> section 1.6, we applied some of the above methods to biomedical data sets recorded<br />

by functional MRI, surface EMG and optical microscopy.<br />

Other work<br />

In this summary, data analysis was discussed from the viewpo<strong>in</strong>t of data factorization models<br />

such as (1.18). Before conclud<strong>in</strong>g, some other works of the author <strong>in</strong> related areas should be<br />

mentioned:<br />

Primarily only discussed <strong>in</strong> mathematics, optimization on Lie groups has become an important<br />

topic <strong>in</strong> the field of mach<strong>in</strong>e learn<strong>in</strong>g and neural networks, s<strong>in</strong>ce many cost functions are<br />

def<strong>in</strong>ed on parameter spaces that more naturally obey a non-Euclidean geometry. Consider for<br />

example Amari’s natural gradient (Amari, 1998): simply by tak<strong>in</strong>g <strong>in</strong>to account the geometry


1.7. Outlook 51<br />

noise<br />

data signal<br />

<strong>in</strong>dependent structured models<br />

Figure 1.28: Future analysis framework.<br />

of the search space Gl(n) of all <strong>in</strong>vertible (n × n)-matrices, Amari was able to considerably improve<br />

search performance and accuracy thus provid<strong>in</strong>g an equivariant ICA algorithm. In Theis<br />

(2005b), we gave an overview of various gradient calculations on Gl(n) and presented generalizations<br />

to over- and undercomplete cases, realized by a semidirect product. These ideas were used<br />

<strong>in</strong> Squart<strong>in</strong>i and Theis (2006), where we def<strong>in</strong>ed some alternative Riemannian metrics on the<br />

parameter space of non-square matrices, correspond<strong>in</strong>g to various translations def<strong>in</strong>ed there<strong>in</strong>.<br />

Such metrics allowed us to derive novel, efficient learn<strong>in</strong>g rules for two ICA based algorithms<br />

for overdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation.<br />

In Meyer-Baese et al. (2006), we studied optimization and statistical learn<strong>in</strong>g on a neural<br />

network that self-organized to solve a BSS problem. The result<strong>in</strong>g onl<strong>in</strong>e learn<strong>in</strong>g solution used<br />

the nonstationarity of the sources to achieve the separation. For this, we divided the problem<br />

<strong>in</strong>to two learn<strong>in</strong>g problems one of which is solved by an anti-Hebbian learn<strong>in</strong>g and the other by<br />

an Hebbian learn<strong>in</strong>g process. The stability of related networks is discussed <strong>in</strong> Meyer-Bäse et al.<br />

(2006).<br />

A major application of unsupervised learn<strong>in</strong>g <strong>in</strong> neuroscience lies <strong>in</strong> the analysis of functional<br />

MRI data sets. A problem we faced dur<strong>in</strong>g the course of our analyses was the efficient storage<br />

and retrieval of many, large-dimensional spatiotemporal data sets. Although some dimension<br />

reduction or region-of-<strong>in</strong>terest selection may be performed beforehand to reduce the sample<br />

size (Keck et al., 2006), we wanted to compress the data as well as possible without loos<strong>in</strong>g<br />

<strong>in</strong>formation. For this, we proposed a novel lossless compression method named FTTcoder (Theis<br />

and Tanaka, 2005) for the compression of images and 3d sequences collected dur<strong>in</strong>g a typical<br />

fMRI experiment. The large data sets <strong>in</strong>volved <strong>in</strong> this popular medical application necessitated<br />

novel compression algorithms to take <strong>in</strong>to account the structure of the recorded data as well as<br />

the experimental conditions, which <strong>in</strong>clude the 4d record<strong>in</strong>gs, the used stimulus protocol and<br />

marked regions of <strong>in</strong>terest (ROI). For this, we used simple temporal transformations and entropy<br />

cod<strong>in</strong>g with context model<strong>in</strong>g to encode the 4d scans after preprocess<strong>in</strong>g with the ROI mask<strong>in</strong>g.<br />

The compression algorithm as well as the fMRI toolbox and the algorithms for spatiotemporal<br />

and subspace BSS are all available onl<strong>in</strong>e at http://fabian.theis.name.


52 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

Future work<br />

In future work, the goal is to employ the above algorithms as preprocess<strong>in</strong>g step <strong>in</strong> model<strong>in</strong>g of<br />

biological processes, for example <strong>in</strong> quantitative systems biology. The idea is straight-forward,<br />

see figure 1.28. Given multivariate data that encode for example biological parameters of multiple,<br />

dependent experiments, we want to f<strong>in</strong>d regularized simple models that can expla<strong>in</strong> the<br />

data as well as predict future quantitative experiments. In a first step, denois<strong>in</strong>g is to be performed,<br />

especially removal of Gaussian subspaces that do not conta<strong>in</strong> mean<strong>in</strong>gful data apart<br />

from their covariance structure. The signal subspace then is to be further processed us<strong>in</strong>g techniques<br />

from dependent component analysis and <strong>in</strong>dependent subspace analysis. The result<strong>in</strong>g<br />

subspaces themselves are analyzed with network analysis techniques, for example based on Gaussian<br />

graphical models and correspond<strong>in</strong>g high-dimensional regularizations (Dobra et al., 2004,<br />

Schäfer and Strimmer, 2005). The derived structures are then expected to serve as models,<br />

which may provide quantitative descriptions of the underly<strong>in</strong>g processes. F<strong>in</strong>ally, we hope to<br />

drive further experiments by the model predictions.<br />

This well-known coupl<strong>in</strong>g of experimentalists and theoreticians is characteristic of systems<br />

biology <strong>in</strong> particular, and modern <strong>in</strong>terdiscipl<strong>in</strong>ary research <strong>in</strong> general. With the past work,<br />

we hope to have made some steps <strong>in</strong> this directions, and recent work <strong>in</strong> the field of microarray<br />

analysis employ<strong>in</strong>g such techniques is already promis<strong>in</strong>g (Lutter et al., 2006, Schachtner et al.,<br />

2007, Stadlthanner et al., 2006a). In theoretical terms, the long-term goal is to step from<br />

multivariate analysis to network analysis, just as we have observed the field of signal process<strong>in</strong>g<br />

expand<strong>in</strong>g from univariate to multivariate models <strong>in</strong> the past few decades.


Part II<br />

Papers<br />

53


Chapter 2<br />

Neural Computation 16:1827-1850,<br />

2004<br />

Paper F.J. Theis. A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation.<br />

Neural Computation, 16:1827-1850, 2004<br />

Reference (Theis, 2004a)<br />

Summary <strong>in</strong> section 1.2.1<br />

55


56 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

LETTER Communicated by Aapo Hyvar<strong>in</strong>en<br />

A New Concept for Separability Problems <strong>in</strong> Bl<strong>in</strong>d Source<br />

Separation<br />

Fabian J. Theis<br />

fabian@theis.name<br />

Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

The goal of bl<strong>in</strong>d source separation (BSS) lies <strong>in</strong> recover<strong>in</strong>g the orig<strong>in</strong>al<br />

<strong>in</strong>dependent sources of a mixed random vector without know<strong>in</strong>g the<br />

mix<strong>in</strong>g structure. A key <strong>in</strong>gredient for perform<strong>in</strong>g BSS successfully is to<br />

know the <strong>in</strong>determ<strong>in</strong>acies of the problem—that is, to know how the separat<strong>in</strong>g<br />

model relates to the orig<strong>in</strong>al mix<strong>in</strong>g model (separability). For l<strong>in</strong>ear<br />

BSS, Comon (1994) showed us<strong>in</strong>g the Darmois-Skitovitch theorem that<br />

the l<strong>in</strong>ear mix<strong>in</strong>g matrix can be found except for permutation and scal<strong>in</strong>g.<br />

In this work, a much simpler, direct proof for l<strong>in</strong>ear separability is given.<br />

The idea is based on the fact that a random vector is <strong>in</strong>dependent if and<br />

only if the Hessian of its logarithmic density (resp. characteristic function)<br />

is diagonal everywhere. This property is then exploited to propose a new<br />

algorithm for perform<strong>in</strong>g BSS. Furthermore, first ideas of how to generalize<br />

separability results based on Hessian diagonalization to more complicated<br />

nonl<strong>in</strong>ear models are studied <strong>in</strong> the sett<strong>in</strong>g of postnonl<strong>in</strong>ear BSS.<br />

1 Introduction<br />

In <strong>in</strong>dependent component analysis (ICA), one tries to f<strong>in</strong>d statistically<br />

<strong>in</strong>dependent data with<strong>in</strong> a given random vector. An application of ICA<br />

lies <strong>in</strong> bl<strong>in</strong>d source separation (BSS), where it is furthermore assumed that<br />

the given vector has been mixed us<strong>in</strong>g a fixed set of <strong>in</strong>dependent sources.<br />

The advantage of apply<strong>in</strong>g ICA algorithms to BSS problems <strong>in</strong> contrast to<br />

correlation-based algorithms is that ICA tries to make the output signals as<br />

<strong>in</strong>dependent as possible by also <strong>in</strong>clud<strong>in</strong>g higher-order statistics.<br />

S<strong>in</strong>ce the <strong>in</strong>troduction of <strong>in</strong>dependent component analysis by Hérault<br />

and Jutten (1986), various algorithms have been proposed to solve the BSS<br />

problem (Comon, 1994; Bell & Sejnowski, 1995; Hyvär<strong>in</strong>en & Oja, 1997;<br />

Theis, Jung, Puntonet, & Lang, 2002). Good textbook-level <strong>in</strong>troductions to<br />

ICA are given <strong>in</strong> Hyvär<strong>in</strong>en, Karhunen, and Oja (2001) and Cichocki and<br />

Amari (2002).<br />

Separability of l<strong>in</strong>ear BSS states that under weak conditions to the sources,<br />

the mix<strong>in</strong>g matrix is determ<strong>in</strong>ed uniquely by the mixtures except for permutation<br />

and scal<strong>in</strong>g, as showed by Comon (1994) us<strong>in</strong>g the Darmois-<br />

Skitovitch theorem. We propose a direct proof based on the concept of<br />

Neural Computation 16, 1827–1850 (2004) c○ 2004 Massachusetts Institute of Technology


Chapter 2. Neural Computation 16:1827-1850, 2004 57<br />

1828 F. Theis<br />

separated functions, that is, functions that can be split <strong>in</strong>to a product of<br />

one-dimensional functions (see def<strong>in</strong>ition 1). If the function is positive, this<br />

is equivalent to the fact that its logarithm has a diagonal Hessian everywhere<br />

(see lemma 1 and theorem 1). A similar lemma has been shown by L<strong>in</strong> (1998)<br />

for what he calls block diagonal Hessians. However, he omits discussion of<br />

the separatedness of densities with zeros, which plays a m<strong>in</strong>or role for the<br />

separation algorithm he is <strong>in</strong>terested <strong>in</strong> but is important for deriv<strong>in</strong>g separability.<br />

Us<strong>in</strong>g separatedness of the density, respectively, the characteristic<br />

function (Fourier transformation), of the random vector, we can then show<br />

separability directly (<strong>in</strong> two slightly different sett<strong>in</strong>gs, for which we provide<br />

a common framework). Based on this result, we propose an algorithm<br />

for l<strong>in</strong>ear BSS by diagonaliz<strong>in</strong>g the logarithmic density of the Hessian. We<br />

recently found that this algorithm has already been proposed (L<strong>in</strong>, 1998),<br />

but without consider<strong>in</strong>g the necessary assumptions for successful algorithm<br />

application. Here we give precise conditions for when to apply this algorithm<br />

(see theorem 3) and show that po<strong>in</strong>ts satisfy<strong>in</strong>g these conditions can<br />

<strong>in</strong>deed be found if the sources conta<strong>in</strong> at most one gaussian component<br />

(see lemma 5). L<strong>in</strong> uses a discrete approximation of the derivative operator<br />

to approximate the Hessian. We suggest us<strong>in</strong>g kernel-based density<br />

estimation, which can be directly differentiated. A similar algorithm based<br />

on Hessian diagonalization has been proposed by Yeredor (2000) us<strong>in</strong>g the<br />

characteristic function of a random vector. However, the characteristic function<br />

is complex valued, and additional care has to be taken when apply<strong>in</strong>g<br />

a complex logarithm. Basically, this is well def<strong>in</strong>ed locally only at nonzeros.<br />

In algorithmic terms, the characteristic function can be easily approximated<br />

by samples (which is equivalent to our kernel-based density approximation<br />

us<strong>in</strong>g gaussians before Fourier transformation). Yeredor suggests jo<strong>in</strong>t diagonalization<br />

of the Hessian of the logarithmic characteristic function (which<br />

is problematic because of the nonuniqueness of the complex logarithm)<br />

evaluated at several po<strong>in</strong>ts <strong>in</strong> order to avoid the locality of the algorithm.<br />

Instead of jo<strong>in</strong>t diagonalization, we use a comb<strong>in</strong>ed energy function based<br />

on the previously def<strong>in</strong>ed separator, which also takes <strong>in</strong>to account global<br />

<strong>in</strong>formation but does not have the drawback of be<strong>in</strong>g s<strong>in</strong>gular at zeros of the<br />

density, respectively, characteristic function. Thus, the algorithmic part of<br />

this article can be seen as a general framework for the algorithms proposed<br />

by L<strong>in</strong> (1998) and Yeredor (2000).<br />

Section 2 <strong>in</strong>troduces separated functions, giv<strong>in</strong>g local characterizations<br />

of the densities of <strong>in</strong>dependent random vectors. Section 3 then <strong>in</strong>troduces<br />

the l<strong>in</strong>ear BSS model and states the well-known separability result. After<br />

giv<strong>in</strong>g an easy and short proof <strong>in</strong> two dimensions with positive densities,<br />

we provide a characterization of gaussians <strong>in</strong> terms of a differential equation<br />

and provide the general proof. The BSS algorithm based on f<strong>in</strong>d<strong>in</strong>g<br />

separated densities is proposed and studied <strong>in</strong> section 4. We f<strong>in</strong>ish with<br />

a generalization of the separability for the postnonl<strong>in</strong>ear mixture case <strong>in</strong><br />

section 5.


58 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1829<br />

2 Separated and L<strong>in</strong>early Separated Functions<br />

Def<strong>in</strong>ition 1. A function f : R n → C is said to be separated, respectively,<br />

l<strong>in</strong>early separated, if there exist one-dimensional functions g1,...,gn : R → C<br />

such that f (x) = g1(x1) ···gn(xn) respectively f (x) = g1(x1) +···+gn(xn) for<br />

all x ∈ R n .<br />

Note that the functions gi are uniquely determ<strong>in</strong>ed by f up to a scalar<br />

factor, respectively, an additive constant. If f is l<strong>in</strong>early separated, then exp f<br />

is separated. Obviously the density function of an <strong>in</strong>dependent random<br />

vector is separated. For brevity, we often use the tensor product and write<br />

f ≡ g1 ⊗···⊗gn for separated f , where for any functions h, k def<strong>in</strong>ed on a<br />

set U, h ≡ k if h(x) = k(x) for all x ∈ U.<br />

Separatedness can also be def<strong>in</strong>ed on any open parallelepiped (a1, b1) ×<br />

···×(an, bn) ⊂ R n <strong>in</strong> the obvious way. We say that f is locally separated<br />

at x ∈ R n if there exists an open parallelepiped U such that x ∈ U and<br />

f |U is separated. If f is separated, then f is obviously everywhere locally<br />

separated. The converse, however, does not necessarily hold, as shown <strong>in</strong><br />

Figure 1.<br />

0.2<br />

0.1<br />

0<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0<br />

−1<br />

−2<br />

−3 −3<br />

Figure 1: Density of a random vector S with a locally but not globally separated<br />

density. Here, p S := cχ[−2,2]×[−2,0]∪[0,2]×[1,3] where χU denotes the function that is<br />

1onU and 0 everywhere else. Obviously, p S is not separated globally, but is<br />

separated if restricted to squares of length < 1. Plotted is a smoothed version of<br />

p S .<br />

1<br />

2<br />

3<br />

4


Chapter 2. Neural Computation 16:1827-1850, 2004 59<br />

1830 F. Theis<br />

The function f is said to be positive if f is real and f (x) >0 for all x ∈ R n ,<br />

and nonnegative if f is real and f (x) ≥ 0 for all x ∈ R n . A positive function<br />

f is separated if and only if ln f is l<strong>in</strong>early separated.<br />

Let C m (U, V) be the r<strong>in</strong>g of all m-times cont<strong>in</strong>uously differentiable func-<br />

tions from U ⊂ R n to V ⊂ C, U open. For a C m -function f , we write<br />

∂i1 ···∂im f := ∂m f/∂xi1 ···∂xim for the m-fold partial derivatives. If f ∈<br />

C2 (Rn , C), denote with the symmetric (n × n)-matrix Hf (x) := � ∂i∂j f (x) �n i,j=1<br />

the Hessian of f at x ∈ Rn .<br />

L<strong>in</strong>early separated functions can be classified us<strong>in</strong>g their Hessian (if it<br />

exists):<br />

Lemma 1. A function f ∈ C 2 (R n , C) is l<strong>in</strong>early separated if and only if Hf (x)<br />

is diagonal for all x ∈ R n .<br />

A similar lemma for block diagonal Hessians has been shown by L<strong>in</strong><br />

(1998).<br />

Proof. If f is l<strong>in</strong>early separated, its Hessian is obviously diagonal everywhere<br />

by def<strong>in</strong>ition.<br />

Assume the converse. We prove that f is separated by <strong>in</strong>duction over<br />

the dimension n. For n = 1, the claim is trivial. Now assume that we have<br />

shown the lemma for n − 1. By <strong>in</strong>duction assumption, f (x1,...,xn−1, 0) is<br />

l<strong>in</strong>early separated, so<br />

f (x1,...,xn−1, 0) = g1(x1) +···+gn−1(xn−1)<br />

for all xi ∈ R and some functions gi on R. Note that gi ∈ C 2 (R, C).<br />

Def<strong>in</strong>e a function h : R → C by h(y) := ∂n f (x1,...,xn−1, y), y ∈ R,<br />

for fixed x1,...,xn−1 ∈ R. Note that h is <strong>in</strong>dependent of the choice of the<br />

xi, because ∂n∂i f ≡ ∂i∂n f is zero everywhere, so xi ↦→ ∂n f (x1,...,xn−1, y)<br />

is constant for fixed xj, y ∈ R, j �= i. By def<strong>in</strong>ition, h ∈ C 1 (R, C), soh is<br />

<strong>in</strong>tegrable on compact <strong>in</strong>tervals. Def<strong>in</strong>e k : R → C by k(y) := � y<br />

0 h. Then<br />

f (x1,...,xn) = g1(x1) +···+gn−1(xn−1) + k(xn) + c,<br />

where c ∈ C is a constant, because both functions have the same derivative<br />

and R n is connected. If we set gn := k + c, the claim follows.<br />

This lemma also holds for functions def<strong>in</strong>ed on any open parallelepiped<br />

(a1, b1) ×···×(an, bn) ⊂ R n . Hence, an arbitrary real-valued C 2 -function f<br />

is locally separated at x with f (x) �= 0 if and only if the Hessian of ln | f | is<br />

locally diagonal.<br />

For a positive function f , the Hessian of its logarithm is diagonal everywhere<br />

if it is separated, and it is easy to see that for positive f , the converse


60 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1831<br />

also holds globally (see theorem 1(ii)). In this case, we have for i �= j,<br />

0 ≡ ∂i∂j ln f ≡ f ∂i∂j f − (∂i f )(∂j f )<br />

f 2<br />

,<br />

so f is separated if and only if<br />

f ∂i∂j f ≡ (∂i f )(∂j f )<br />

for i �= j or even i < j. This motivates the follow<strong>in</strong>g def<strong>in</strong>ition:<br />

Def<strong>in</strong>ition 2. For i �= j, the operator<br />

Rij : C 2 (R n , C) → C 0 (R n , C)<br />

is called the ij-separator.<br />

Theorem 1. Let f ∈ C 2 (R n , C).<br />

f ↦→ Rij[ f ]:= f ∂i∂j f − (∂i f )(∂j f )<br />

i. If f is separated, then Rij[ f ] ≡ 0 for i �= j or, equivalently,<br />

f ∂i∂j f ≡ (∂i f )(∂j f ) (2.1)<br />

holds for i �= j.<br />

ii. If f is positive and Rij[ f ] ≡ 0 holds for all i �= j, then f is separated.<br />

If f is assumed to be only nonnegative, then f is locally separated but<br />

not necessarily globally separated (if the support of f has more than one<br />

component). See Figure 1 for an example of a nonseparated density with<br />

R12[ f ] ≡ 0.<br />

Proof of Theorem 1.i. If f is separated, then f (x) = g1(x1) ···gn(xn) or<br />

short f ≡ g1 ⊗···⊗gn,so<br />

and<br />

∂i f ≡ g1 ⊗···⊗gi−1 ⊗ g ′ i ⊗ gi+1 ⊗···⊗gn<br />

∂i∂j f ≡ g1 ⊗···⊗gi−1 ⊗ g ′ i ⊗ gi+1 ⊗···⊗gj−1 ⊗ g ′ j ⊗ gj+1 ⊗···⊗gn<br />

for i < j. Hence equation 2.1 holds.<br />

ii. Now assume the converse and let f be positive. Then accord<strong>in</strong>g to the<br />

remarks after lemma 1, Hln f (x) is everywhere diagonal, so lemma 1 shows<br />

that ln f is l<strong>in</strong>early separated; hence, f is separated.


Chapter 2. Neural Computation 16:1827-1850, 2004 61<br />

1832 F. Theis<br />

Some trivial properties of the separator Rij are listed <strong>in</strong> the next lemma:<br />

Lemma 2. Let f, g ∈ C 2 (R n , C),i�= j and α ∈ C. Then<br />

and<br />

Rij[αf ] = α 2 Rij[ f ]<br />

Rij[ f + g] = Rij[ f ] + Rij[g] + f ∂i∂jg + g∂i∂j f − (∂i f )(∂jg) − (∂ig)(∂j f ).<br />

3 Separability of L<strong>in</strong>ear BSS<br />

Consider the noiseless l<strong>in</strong>ear <strong>in</strong>stantaneous BSS model with as many sources<br />

as sensors:<br />

X = AS, (3.1)<br />

with an <strong>in</strong>dependent n-dimensional random vector S and A ∈ Gl(n). Here,<br />

Gl(n) denotes the general l<strong>in</strong>ear group of R n , that is, the group of all <strong>in</strong>vertible<br />

(n × n)-matrices.<br />

The task of l<strong>in</strong>ear BSS is to f<strong>in</strong>d A and S given only X. An obvious <strong>in</strong>determ<strong>in</strong>acy<br />

of this problem is that A can be found only up to scal<strong>in</strong>g and<br />

permutation because for scal<strong>in</strong>g L and permutation matrix P,<br />

X = ALPP −1 L −1 S,<br />

and P −1 L −1 S is also <strong>in</strong>dependent. Here, an <strong>in</strong>vertible matrix L ∈ Gl(n)<br />

is said to be a scal<strong>in</strong>g matrix if it is diagonal. We say two matrices B, C<br />

are equivalent, B ∼ C, ifC can be written as C = BPL with a scal<strong>in</strong>g<br />

matrix L ∈ Gl(n) and an <strong>in</strong>vertible matrix with unit vectors <strong>in</strong> each row<br />

(permutation matrix) P ∈ Gl(n). Note that PL = L ′ P for some scal<strong>in</strong>g matrix<br />

L ′ ∈ Gl(n), so the order of the permutation and the scal<strong>in</strong>g matrix does not<br />

play a role for equivalence. Furthermore, if B ∈ Gl(n) with B ∼ I, then also<br />

B −1 ∼ I, and, more generally if BC ∼ A, then C ∼ B −1 A. Accord<strong>in</strong>g to the<br />

above, solutions of l<strong>in</strong>ear BSS are equivalent. We will show that under mild<br />

assumptions to S, there are no further <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear BSS.<br />

S is said to have a gaussian component if one of the Si is a one-dimensional<br />

gaussian, that is, pSi (x) = d exp(−ax2 + bx + c) with a, b, c, d ∈ R, a > 0, and<br />

S has a determ<strong>in</strong>istic component if one Si is determ<strong>in</strong>istic, that is, constant.<br />

Theorem 2 (Separability of l<strong>in</strong>ear BSS). Let A ∈ Gl(n) and S be an <strong>in</strong>dependent<br />

random vector. Assume one of the follow<strong>in</strong>g:<br />

i. S has at most one gaussian or determ<strong>in</strong>istic component, and the covariance<br />

of S exists.


62 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1833<br />

ii. S has no gaussian component, and its density pS exists and is twice cont<strong>in</strong>uously<br />

differentiable.<br />

Then if AS is aga<strong>in</strong> <strong>in</strong>dependent, A is equivalent to the identity.<br />

So A is the product of a scal<strong>in</strong>g and a permutation matrix. The important<br />

part of this theorem is assumption i, which has been used to show separability<br />

by Comon (1994) and extended by Eriksson and Koivunen (2003) based<br />

on the Darmois-Skitovitch theorem (Darmois, 1953; Skitovitch, 1953). Us<strong>in</strong>g<br />

this theorem, the second part can be easily shown without C 2 -densities.<br />

Theorem 2 <strong>in</strong>deed proves separability of the l<strong>in</strong>ear BSS model, because<br />

if X = AS and W is a demix<strong>in</strong>g matrix such that WX is <strong>in</strong>dependent, then<br />

WA ∼ I,soW −1 ∼ A as desired.<br />

We will give a much easier proof without hav<strong>in</strong>g to use the Darmois-<br />

Skitovitch theorem <strong>in</strong> the follow<strong>in</strong>g sections.<br />

3.1 Two-Dimensional Positive Density Case. For illustrative purposes<br />

we will first prove separability for a two-dimensional random vector S with<br />

positive density pS ∈ C 2 (R 2 , R). Let A ∈ Gl(2). It is enough to show that if<br />

S and AS are <strong>in</strong>dependent, then either A ∼ I or S is gaussian.<br />

S is assumed to be <strong>in</strong>dependent, so its density factorizes:<br />

pS(s) = g1(s1)g2(s2),<br />

for s ∈ R 2 . First, note that the density of AS is given by<br />

pAS(x) =|det A| −1 pS(A −1 x) = cg1(b11x1 + b12x2)g2(b21x1 + b22x2)<br />

for x ∈ R 2 , c �= 0 fixed. Here, B = (bij) = A −1 . AS is also assumed to be<br />

<strong>in</strong>dependent, so pAS(x) is separated.<br />

pS was assumed to be positive; then so is pAS. Hence, ln pAS(x) is l<strong>in</strong>early<br />

separated, so<br />

∂1∂2 ln pAS(x) = b11b12h ′′<br />

1 (b11x1 + b12x2) + b21b22h ′′<br />

2 (b21x1 + b22x2) = 0<br />

for all x ∈ R 2 , where hi := ln gi ∈ C 2 (R 2 , R). By sett<strong>in</strong>g y := Bx, we therefore<br />

have<br />

b11b12h ′′<br />

1 (y1) + b21b22h ′′<br />

2 (y2) = 0 (3.2)<br />

for all y ∈ R 2 , because B is <strong>in</strong>vertible.<br />

Now, if A (and therefore also B) is equivalent to the identity, then equation<br />

3.2 holds. If not, then A, and hence also B, have at least three nonzero<br />

entries. By equation 3.2 the fourth entry has to be nonzero, because the


Chapter 2. Neural Computation 16:1827-1850, 2004 63<br />

1834 F. Theis<br />

h ′′<br />

i are not zero (otherwise gi(yi) = exp(ayi + b), which is not <strong>in</strong>tegrable).<br />

Furthermore,<br />

b11b12h ′′<br />

1 (y1) =−b21b22h ′′<br />

2 (y2)<br />

for all y ∈ R2 ,sotheh ′′<br />

i are constant, say, h′′ i ≡ ci, and ci �= 0, as noted<br />

above. Therefore, the hi are polynomials of degree 2, and the gi = exp hi are<br />

gaussians (ci < 0 because of the <strong>in</strong>tegrability of the gi).<br />

3.2 Characterization of Gaussians. In this section, we show that among<br />

all densities, respectively, characteristic functions, the gaussians satisfy a<br />

special differential equation.<br />

Lemma 3. Let f ∈ C 2 (R, C) and a ∈ C with<br />

af 2 − ff ′′ + f ′2 ≡ 0. (3.3)<br />

Then either f ≡ 0 or f (x) = exp � a<br />

2 x2 + bx + c � , x ∈ R, with constants b, c ∈ C.<br />

Proof. Assume f �≡ 0. Let x0 ∈ R with f (x0) �= 0. Then there exists<br />

a nonempty <strong>in</strong>terval U := (r, s) conta<strong>in</strong><strong>in</strong>g x0 such that a complex logarithm<br />

log is def<strong>in</strong>ed on f (U). Set g := log f |U. Substitut<strong>in</strong>g exp g for f <strong>in</strong><br />

equation 3.3 yields<br />

a exp(2g) − exp(g)(g ′′ + g ′2 ) exp(g) + g ′2 exp(2g) ≡ 0,<br />

and therefore g ′′ ≡ a. Hence, g is a polynomial of degree ≤ 2 with lead<strong>in</strong>g<br />

coefficient a<br />

2 .<br />

Furthermore,<br />

lim f (x) �= 0<br />

x→r+<br />

lim f (x) �= 0,<br />

x→s−<br />

so f has no zeros at all because of cont<strong>in</strong>uity. The argument above with<br />

U = R shows the claim.<br />

If, furthermore, f is real nonnegative and <strong>in</strong>tegrable with <strong>in</strong>tegral 1 (e.g.,<br />

if f is the density of a random variable), then f has to be the exponential of<br />

a real-valued polynomial of degree precisely 2; otherwise, it would not be<br />

<strong>in</strong>tegrable. So we have the follow<strong>in</strong>g corollary:


64 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1835<br />

Corollary 1. Let X be a random variable with twice cont<strong>in</strong>uously differentiable<br />

density pX satisfy<strong>in</strong>g equation 3.3. Then X is gaussian.<br />

If we do not want to assume that the random variable has a density, we<br />

can use its characteristic function (Bauer, 1996) <strong>in</strong>stead to show an equivalent<br />

result:<br />

Corollary 2. Let X be a random variable with twice cont<strong>in</strong>uously differentiable<br />

characteristic function �X(x) := EX(exp ixX) satisfy<strong>in</strong>g equation 3.3. Then X is<br />

gaussian or determ<strong>in</strong>istic.<br />

Proof. Us<strong>in</strong>g �X(0) = 1, lemma 3 shows that �X(x) = exp( a<br />

2 x2 + bx). Moreover,<br />

from �X(−x) = �X(x), we get a ∈ R and b = ib ′ with real b ′ . And |�X| ≤1<br />

shows that a ≤ 0. So if a = 0, then X is determ<strong>in</strong>istic (at b ′ ), and if a �= 0,<br />

then X has a gaussian distribution with mean b ′ and variance −a −1 .<br />

3.3 Proof of Theorem 2. We will now prove l<strong>in</strong>ear separability; for this,<br />

we will use separatedness to show that some source components have to be<br />

gaussian (us<strong>in</strong>g the results from above) if the mix<strong>in</strong>g matrix is not trivial.<br />

The ma<strong>in</strong> argument is given <strong>in</strong> the follow<strong>in</strong>g lemma:<br />

Lemma 4. Let gi ∈ C 2 (R, C) and B ∈ Gl(n) such that f (x) := g1⊗···⊗gn(Bx)<br />

is separated. Then for all <strong>in</strong>dices l and i �= j with bliblj �= 0,gl satisfies the differential<br />

equation 3.3 with some constant a.<br />

Proof. f is separated, so by theorem 1i.<br />

Rij[ f ] ≡ f ∂i∂j f − (∂i f )(∂j f ) ≡ 0 (3.4)<br />

holds for i < j. The <strong>in</strong>gredients of this equation can be calculated for i < j<br />

as follows:<br />

∂i f (x) = �<br />

bkig1 ⊗···⊗g ′ k ⊗···⊗gn(Bx)<br />

k<br />

(∂i f )(∂j f )(x) = �<br />

bkiblj(g1 ⊗···⊗g ′ k ⊗···⊗gn)<br />

k,l<br />

× (g1 ⊗···⊗g ′ l ⊗···⊗gn)(Bx)<br />

∂i∂j f (x) = �<br />

bki(bkjg1 ⊗···⊗g ′′<br />

k ⊗···⊗gn<br />

k<br />

+ �<br />

bljg1 ⊗···⊗g ′ k ⊗···⊗g′ l ⊗···⊗gn)(Bx).<br />

l�=k


Chapter 2. Neural Computation 16:1827-1850, 2004 65<br />

1836 F. Theis<br />

Putt<strong>in</strong>g this <strong>in</strong> equation 3.4 yields<br />

0 = ( f ∂i∂j f − (∂i f )(∂j f ))(x)<br />

= �<br />

bkibkj((g1 ⊗···⊗gn)(g1 ⊗···⊗g ′′<br />

k ⊗···⊗gn)<br />

k<br />

− (g1 ⊗···⊗g ′ k ⊗···⊗gn) 2 )(Bx)<br />

= �<br />

bkibkjg 2 1 ⊗···⊗g2 k−1 ⊗<br />

k<br />

�<br />

gkg ′′<br />

k<br />

− g′2<br />

k<br />

for x ∈ R n . B is <strong>in</strong>vertible, so the whole function is zero:<br />

�<br />

k<br />

bkibkjg 2 1 ⊗···⊗g2 k−1 ⊗<br />

�<br />

gkg ′′<br />

k<br />

�<br />

⊗ g 2 k+1 ⊗···⊗g2 n (Bx)<br />

�<br />

− g′2<br />

k ⊗ g 2 k+1 ⊗···⊗g2n ≡ 0. (3.5)<br />

Choose x ∈ R n with gk(xk) �= 0 for k = 1,...,n. Evaluat<strong>in</strong>g equation 3.5<br />

at (x1,...,xl−1, y, xl+1,...,xn) for variable y ∈ R and divid<strong>in</strong>g the result<strong>in</strong>g<br />

one-dimensional equation by the constant g 2 1 (x1) ···g 2 l−1 (xl−1)g 2 l+1 (xl+1) ···<br />

g 2 n (xn) shows<br />

bliblj<br />

�<br />

glg ′′<br />

l<br />

�<br />

− g′2<br />

l<br />

⎛<br />

(y) =−⎝<br />

�<br />

k�=l<br />

gkg<br />

bkibkj<br />

′′<br />

k<br />

g 2 k<br />

− g′2<br />

k<br />

(xk)<br />

⎞<br />

⎠ g 2 l<br />

(y) (3.6)<br />

for y ∈ R. So for <strong>in</strong>dices l and i �= j with bliblj �= 0, it follows from equation<br />

3.6 that there exists a ∈ C such that gk satisfies the differential equation<br />

ag2 l − glg ′′<br />

l + g′2<br />

l ≡ 0, that is, equation 3.3.<br />

Proof of Theorem 2. i. S is assumed to have at most one gaussian or determ<strong>in</strong>istic<br />

component and exist<strong>in</strong>g covariance. Set X := AS.<br />

We first show us<strong>in</strong>g whiten<strong>in</strong>g that A can be assumed to be orthogonal.<br />

For this, we can assume S and X to have no determ<strong>in</strong>istic component at all<br />

(because arbitrary choice of the matrix coefficients of the determ<strong>in</strong>istic components<br />

does not change the covariance). Hence, by assumption, Cov(X)<br />

is diagonal and positive def<strong>in</strong>ite, so let D1 be diagonal <strong>in</strong>vertible with<br />

Cov(X) = D2 1 . Similarly, let D2 be diagonal <strong>in</strong>vertible with Cov(S) = D2 2 .<br />

Set Y := D −1<br />

1 X and T := D−1<br />

2 S, that is, normalize X and S to covariance I.<br />

Then<br />

Y = D −1<br />

1<br />

X = D−1<br />

1 AS = D−1<br />

1 AD2T


66 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1837<br />

and T, D −1<br />

1 AD2 and Y satisfy the assumption, and D −1<br />

1 AD2 is orthogonal<br />

because<br />

I = Cov(Y)<br />

= E(YY ⊤ )<br />

= E(D −1<br />

1 AD2TT ⊤ D2A ⊤ D −1<br />

1 )<br />

= (D −1 −1<br />

1 AD2)(D1 AD2) ⊤ .<br />

So without loss of generality, let A be orthogonal.<br />

Now let � S(s) := ES(exp is ⊤ S) be the characteristic function of S. Byassumption,<br />

the covariance (and hence the mean) of S exists, so � S ∈ C 2 (R n , C)<br />

(Bauer, 1996). Furthermore, s<strong>in</strong>ce S is assumed to be <strong>in</strong>dependent, its characteristic<br />

function is separated: � S ≡ g1 ⊗···⊗gn, where gi ≡ � Si. The characteristic<br />

function of AS can easily be calculated as<br />

�AS(x) = ES(exp ix ⊤ AS) = � S(A ⊤ x) = g1 ⊗···⊗gn(A ⊤ x)<br />

for x ∈ R n . Let B := (bij) = A ⊤ . S<strong>in</strong>ce AS is also assumed to be <strong>in</strong>dependent,<br />

f (x) := � AS(x) = g1 ⊗···⊗gn(Bx) is separated.<br />

Now assume that A �∼ I. Us<strong>in</strong>g orthogonality of B = A ⊤ , there exist<br />

<strong>in</strong>dices k �= l and i �= j with bkibkj �= 0 and bliblj �= 0. Then accord<strong>in</strong>g<br />

to lemma 4, gk and gl satisfy the differential equation 3.3. Together with<br />

corollary 2, this shows that both Sk and Sl are gaussian, which is a contradiction<br />

to the assumption.<br />

ii. Let S be an n-dimensional <strong>in</strong>dependent random vector with density<br />

pS ∈ C 2 (R n , R) and no gaussian component, and let A ∈ Gl(n). S is assumed<br />

to be <strong>in</strong>dependent, so its density factorizes pS ≡ g1 ⊗···⊗gn. The density<br />

of AS is given by<br />

pAS(x) =|det A| −1 pS(A −1 x) =|det A| −1 g1 ⊗···⊗gn(Ax)<br />

for x ∈ R n . Let B := (bij) = A −1 . AS is also assumed to be <strong>in</strong>dependent, so<br />

f (x) := |det A|pAS(x) = g1 ⊗···⊗gn(Bx)<br />

is separated.<br />

Assume A �∼ I. Then also B = A −1 �∼ I, so there exist <strong>in</strong>dices l and<br />

i �= j with bliblj �= 0. Hence, it follows from lemma 4 that gl satisfies the<br />

differential equation 3.3. But gl is a density, so accord<strong>in</strong>g to corollary 1 the<br />

lth component of S is gaussian, which is a contradiction.


Chapter 2. Neural Computation 16:1827-1850, 2004 67<br />

1838 F. Theis<br />

4 BSS by Hessian Diagonalization<br />

In this section, we use the theory already set out to propose an algorithm for<br />

l<strong>in</strong>ear BSS, which can be easily extended to nonl<strong>in</strong>ear sett<strong>in</strong>gs as well. For<br />

this, we restrict ourselves to us<strong>in</strong>g C 2 -densities. A similar idea has already<br />

been proposed <strong>in</strong> L<strong>in</strong> (1998), but without deal<strong>in</strong>g with possibly degenerated<br />

eigenspaces <strong>in</strong> the Hessian. Equivalently, we could also use characteristic<br />

functions <strong>in</strong>stead of densities, which leads to a related algorithm (Yeredor,<br />

2000).<br />

If we assume that Cov(S) exists, we can use whiten<strong>in</strong>g as seen <strong>in</strong> the proof<br />

of theorem 2i (<strong>in</strong> this context, also called pr<strong>in</strong>cipal component analysis) to<br />

reduce the general BSS model, equation 3.2, to<br />

X = AS (4.1)<br />

with an <strong>in</strong>dependent n-dimensional random vector S with exist<strong>in</strong>g covariance<br />

I and an orthogonal matrix A. Then Cov(X) = I. We assume that S<br />

admits a C 2 −density pS. The density of X is then given by<br />

pX(x) = pS(A ⊤ x)<br />

for x ∈ R n , because of the orthogonality of A. Hence,<br />

pS ≡ pX ◦ A.<br />

Note that the Hessian of the composition of a function f ∈ C 2 (R n , R)<br />

with an n × n-matrix A can be calculated us<strong>in</strong>g the Hessian of f as follows:<br />

Hf ◦A(x) = AHf (Ax)A ⊤ .<br />

Let s ∈ R n with pS(s) >0. Then locally at s, we have<br />

Hln p S (s) = Hln p X ◦A(s) = AHln p X (As)A ⊤ . (4.2)<br />

pS is assumed to be separated, so Hln p S (s) is diagonal, as seen <strong>in</strong> section 2.<br />

Lemma 5. Let X := AS with an orthogonal matrix A and S, an <strong>in</strong>dependent<br />

random vector with C 2 -density, and at most one gaussian component. Then there<br />

exists an open set U ⊂ R n such that for all x ∈ U, pX(x) �= 0 and Hln p X (x) has n<br />

different eigenvalues.<br />

Proof. Assume not. Then there exists no x ∈ R n at all with pX(x) �= 0 and<br />

Hln p X (x) hav<strong>in</strong>g n different eigenvalues because otherwise, due to cont<strong>in</strong>uity,<br />

these conditions would also hold <strong>in</strong> an open neighborhood of x.


68 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1839<br />

Us<strong>in</strong>g equation 4.2 the logarithmic Hessian of pS has at every s ∈ R n with<br />

pS(s) >0 at least two of the same eigenvalues, say, λ(s) ∈ R. Hence, s<strong>in</strong>ce S<br />

is <strong>in</strong>dependent, Hln p S (s) is diagonal, so locally,<br />

� � ′′ � � ′′<br />

ln pSi (s) = ln pSj (s) = λ(s)<br />

for two <strong>in</strong>dices i �= j. Here, we have used cont<strong>in</strong>uity of s ↦→ Hln p (s) show-<br />

S<br />

<strong>in</strong>g that the two eigenvalues locally lie <strong>in</strong> the same two dimensions i and<br />

j. This proves that λ(s) is locally constant <strong>in</strong> directions i and j. So locally<br />

at po<strong>in</strong>ts s with pS(s) >0, Si and Sj are of the type exp P, with P be<strong>in</strong>g a<br />

polynomial of degree ≤ 2. The same argument as <strong>in</strong> the proof of lemma 3<br />

then shows that pSi and pSj have no zeros at all. Us<strong>in</strong>g the connectedness<br />

of R proves that Si and Sj are globally of the type exp P, hence gaussian<br />

(because of �<br />

pSk R = 1), which is a contradiction.<br />

Hence, we can assume that we have found x (0) ∈ R n with Hln p X (x (0) )<br />

hav<strong>in</strong>g n different eigenvalues (which is equivalent to say<strong>in</strong>g that every<br />

eigenvalue is of multiplicity one), because due to lemma 5, this is an open<br />

condition, which can be found algorithmically. In fact, most densities <strong>in</strong><br />

practice turn out to have logarithmic Hessians with n different eigenvalues<br />

almost everywhere. In theory however, U <strong>in</strong> lemma 5 cannot be assumed to<br />

be, for example, dense or R n \ U to have measure zero, because if we choose<br />

pS1 to be a normalized gaussian and pS2 to be a normalized gaussian with a<br />

very localized small perturbation at zero only, then U cannot be larger than<br />

(−ε, ε) × R.<br />

By diagonalization of Hln p (x<br />

X (0) ) us<strong>in</strong>g eigenvalue decomposition (pr<strong>in</strong>cipal<br />

axis transformation), we can f<strong>in</strong>d the (orthogonal) mix<strong>in</strong>g matrix A.<br />

Note that the eigenvalue decomposition is unique except for permutation<br />

and sign scal<strong>in</strong>g because every eigenspace (<strong>in</strong> which A is only unique up<br />

to orthogonal transformation) has dimension one. Arbitrary scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy<br />

does not occur because we have forced S and X to have unit<br />

variances. Us<strong>in</strong>g uniqueness of eigenvalue decomposition and theorem 2,<br />

we have shown the follow<strong>in</strong>g theorem:<br />

Theorem 3 (BSS by Hessian calculation). Let X = AS with an <strong>in</strong>dependent<br />

random vector S and an orthogonal matrix A. Let x ∈ R n such that locally at x,<br />

X admits a C 2 -density pX with pX(x) �= 0. Assume that Hln p X (x) has n different<br />

eigenvalues (see lemma 5). If<br />

EHln p X (x)E ⊤ = D<br />

is an eigenvalue decomposition of the Hessian of the logarithm of pX at x, that is, E<br />

orthogonal, D diagonal, then E ∼ A,soE ⊤ X is <strong>in</strong>dependent.


Chapter 2. Neural Computation 16:1827-1850, 2004 69<br />

1840 F. Theis<br />

Furthermore, it follows from this theorem that l<strong>in</strong>ear BSS is a local problem<br />

as proven already <strong>in</strong> Theis, Puntonet, and Lang (2003) us<strong>in</strong>g the restriction<br />

of a random vector.<br />

4.1 Example for Hessian Diagonalization BSS. In order to illustrate<br />

the algorithm of local Hessian diagonalization, we give a two-dimensional<br />

example. Let S be a random vector with densities<br />

pS1 (s1) = 1<br />

2 χ[−1,1](s1)<br />

pS2 (s2) = 1<br />

√ 2π exp<br />

�<br />

− 1<br />

2 s2 2<br />

�<br />

where χ[−1,1] is one on [−1, 1] and zero everywhere else. The orthogonal<br />

mix<strong>in</strong>g matrix A is chosen to be<br />

A = 1 √ 2<br />

� �<br />

1 1<br />

.<br />

−1 1<br />

The mixture density pX of X := AS then is (det A = 1),<br />

pX(x) = 1<br />

2 √ 2π χ[−1,1]<br />

� � �<br />

1√2<br />

(x1 − x2) exp − 1<br />

4 (x1 + x2) 2<br />

�<br />

,<br />

for x ∈ R 2 . pX is positive and C 2 <strong>in</strong> a neighborhood around 0. Then<br />

∂1 ln pX(x) = ∂2 ln pX(x) =− 1<br />

2 (x1 + x2)<br />

∂ 2 1 ln pX(x) = ∂ 2 2 ln pX(x) = ∂1∂2 ln pX(x) =− 1<br />

2<br />

for x with |x| < 1<br />

2 , and the Hessian of the logarithmic densities is<br />

Hln p X (x) =− 1<br />

2<br />

� �<br />

1 1<br />

1 1<br />

<strong>in</strong>dependent on x <strong>in</strong> a neighborhood around 0. Diagonalization of Hln p (0)<br />

X<br />

yields<br />

� �<br />

−1 0<br />

,<br />

0 0<br />

and this equals AHln p X (0)A ⊤ , as stated <strong>in</strong> theorem 3.


70 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1841<br />

4.2 Global Hessian Diagonalization Us<strong>in</strong>g Kernel-Based Density Approximation.<br />

In practice, it is usually not possible to approximate the density<br />

locally with sufficiently high accuracy, so a better approximation us<strong>in</strong>g<br />

the typically global <strong>in</strong>formation of X has to be found. We suggest us<strong>in</strong>g<br />

kernel-based density estimation to get an energy function with m<strong>in</strong>ima at<br />

the BSS solutions together with a global Hessian diagonalization <strong>in</strong> the follow<strong>in</strong>g.<br />

The idea is to construct a measure for separatedness of the densities<br />

(hence <strong>in</strong>dependence) based on theorem 1. A possible measure could be the<br />

norm of the summed-up separators �<br />

i


Chapter 2. Neural Computation 16:1827-1850, 2004 71<br />

1842 F. Theis<br />

0.4<br />

0.2<br />

0<br />

2<br />

1<br />

0<br />

−1<br />

−2 −2<br />

−1 0<br />

1<br />

2<br />

0.2<br />

0.1<br />

0<br />

2<br />

1<br />

0<br />

−1<br />

−2 −2<br />

−1 0<br />

Figure 2: <strong>Independent</strong> Laplacian density p S (s) = 1<br />

2 exp(−|x1|−|x2|): theoretic<br />

(left) and approximated (right) densities. For the approximation, 1000 samples<br />

and gaussian kernel approximation (see equation 4.3) with standard deviation<br />

0.37 were used.<br />

Rij[ ˆpX] can be calculated us<strong>in</strong>g lemma 2—here Rij[ϕ(x − x (k) )] ≡ 0—and<br />

equation 4.4:<br />

Rij[ ˆpX](x) = 1<br />

�<br />

ν�<br />

Rij ϕ(x − x<br />

ν2 k=1<br />

(k) �<br />

)<br />

= 1<br />

ν 2<br />

�<br />

ϕ<br />

k�=l<br />

− (∂iϕ)<br />

= 4κ2<br />

ν 2<br />

= 4κ2<br />

ν 2<br />

�<br />

k�=l<br />

�<br />

k


72 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1843<br />

hence,<br />

E = (σ 2 ν)<br />

−4 �<br />

m<br />

�<br />

� �<br />

ϕ(x − x (k) )ϕ(x − x (l) )(x (k)<br />

i<br />

i


Chapter 2. Neural Computation 16:1827-1850, 2004 73<br />

1844 F. Theis<br />

Note that E represents a new approximate measure of <strong>in</strong>dependence.<br />

Therefore, the l<strong>in</strong>ear BSS algorithm can now be readily generalized to nonl<strong>in</strong>ear<br />

situations by f<strong>in</strong>d<strong>in</strong>g an appropriate parameterization of the possibly<br />

nonl<strong>in</strong>ear separat<strong>in</strong>g model.<br />

The proposed algorithm basically performs a global diagonalization of<br />

the logarithmic Hessian after prewhiten<strong>in</strong>g. Interest<strong>in</strong>gly, this is similar to<br />

traditional BSS algorithms based on jo<strong>in</strong>t diagonalization such as JADE<br />

(Cardoso & Souloumiac, 1993) us<strong>in</strong>g cumulant matrices or AMUSE (Tong,<br />

Liu, Soan, & Huang, 1991) and SOBI (Belouchrani, Meraim, Cardoso, &<br />

Moul<strong>in</strong>es, 1997) employ<strong>in</strong>g time decorrelation. Instead of us<strong>in</strong>g a global energy<br />

function as proposed above, we could therefore also jo<strong>in</strong>tly diagonalize<br />

a given set of Hessians (respectively, separator matrices, as above; see also<br />

Yeredor, 2000). Another relation to previously proposed ICA algorithms<br />

lies <strong>in</strong> the kernel approximation technique. Gaussian or generalized gaussian<br />

kernels have already been used <strong>in</strong> the field of <strong>in</strong>dependent component<br />

analysis to model the source densities (Lee & Lewicki, 2000; Habl, Bauer,<br />

Putonet, Rodriguez-Alvarez, & Lang, 2001), thus giv<strong>in</strong>g an estimate of the<br />

score function used <strong>in</strong> Bell-Sejnowski-type semiparametric algorithms (Bell,<br />

& Sejnowski, 1995) or enabl<strong>in</strong>g direct separation us<strong>in</strong>g a maximum likelihood<br />

parameter estimation. Our algorithm also uses density approximation,<br />

but employs this for the mixture density, which can be problematic <strong>in</strong> higher<br />

dimensions. A different approach not <strong>in</strong>volv<strong>in</strong>g density approximation is a<br />

direct sample-based Hessian estimation similar to L<strong>in</strong> (1998).<br />

5 Separability of Postnonl<strong>in</strong>ear BSS<br />

In this section, we show how to use the idea of Hessian diagonalization <strong>in</strong><br />

order to give separability proofs <strong>in</strong> nonl<strong>in</strong>ear situations, more precisely, <strong>in</strong><br />

the sett<strong>in</strong>g of postnonl<strong>in</strong>ear BSS. After stat<strong>in</strong>g the postnonl<strong>in</strong>ear BSS model<br />

and the general (to the knowledge of the author, not yet proven) separability<br />

theorem, we will prove postnonl<strong>in</strong>ear separability <strong>in</strong> the case of random<br />

vectors with distributions that are somewhere locally constant and nonzero<br />

(e.g., uniform distributions). A possible proof of postnonl<strong>in</strong>ear separability<br />

has been suggested by Taleb and Jutten (1999); however, the proof applies<br />

only to densities with at least one zero and furthermore conta<strong>in</strong>s an error<br />

render<strong>in</strong>g the proof applicable only to restricted situations.<br />

Def<strong>in</strong>ition 3. A function f : R n → R n is called diagonal if each component<br />

fi(x) of f(x) depends on only the variable xi.<br />

In this case, we often omit the other variables and write f(x1,...,xn) =<br />

( f1(x1),..., fn(xn)); sof ≡ f1 ×···× fn where × denotes the Cartesian<br />

product.


74 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1845<br />

Consider now the postnonl<strong>in</strong>ear BSS model,<br />

X = f(AS), (5.1)<br />

where aga<strong>in</strong> S is an <strong>in</strong>dependent random vector, A ∈ Gl(n), and f is a diagonal<br />

nonl<strong>in</strong>earity. We assume the components of f to be <strong>in</strong>jective analytical<br />

functions with <strong>in</strong>vertible Jacobian at every po<strong>in</strong>t (locally diffeomorphic).<br />

Def<strong>in</strong>ition 4. An <strong>in</strong>vertible matrix A ∈ Gl(n) is said to be mix<strong>in</strong>g if A has at<br />

least two nonzero entries <strong>in</strong> each row.<br />

Note that if A is mix<strong>in</strong>g, then A ′ , A −1 , and ALP for scal<strong>in</strong>g matrix L and<br />

permutation matrix P are also mix<strong>in</strong>g.<br />

Postnonl<strong>in</strong>ear BSS is a generalization of l<strong>in</strong>ear BSS, so the <strong>in</strong>determ<strong>in</strong>acies<br />

of postnonl<strong>in</strong>ear ICA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear BSS:<br />

A can be reconstructed only up to scal<strong>in</strong>g and permutation. In the l<strong>in</strong>ear<br />

case, aff<strong>in</strong>e l<strong>in</strong>ear transformation is ignored. Here, of course, additional<br />

<strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play because of translation: fi can be recovered<br />

only up to a constant. Also, if L ∈ Gl(n) is a scal<strong>in</strong>g matrix, then<br />

f(AS) = (f ◦ L)((L −1 A)S),<br />

so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each component. Another obvious<br />

<strong>in</strong>determ<strong>in</strong>acy could occur if A is not general enough. If, for example,<br />

A = I, then f(S) is already <strong>in</strong>dependent, because <strong>in</strong>dependence is <strong>in</strong>variant<br />

under diagonal nonl<strong>in</strong>ear transformation; so f cannot be found <strong>in</strong> this case.<br />

If we assume, however, that A is mix<strong>in</strong>g, then we will show that except for<br />

scal<strong>in</strong>g <strong>in</strong>terchange between f and A, no more <strong>in</strong>determ<strong>in</strong>acies than <strong>in</strong> the<br />

aff<strong>in</strong>e l<strong>in</strong>ear case exist.<br />

Theorem 4 (separability of postnonl<strong>in</strong>ear BSS). Let A, W ∈ Gl(n) be mix<strong>in</strong>g,<br />

h : R n → R n be a diagonal bijective function with analytical locally diffeomorphic<br />

components, and S be an <strong>in</strong>dependent random vector with at most one<br />

gaussian component and exist<strong>in</strong>g covariance. If W(h(AS)) is <strong>in</strong>dependent, then<br />

there exists a scal<strong>in</strong>g matrix L ∈ Gl(n) and p ∈ R n with LA ∼ W −1 and h ≡ L+p.<br />

If analyticity of the components of h is not assumed, then h ≡ L + p can<br />

only hold on {As|pS(s) �= 0}.<br />

If f ◦ A is the mix<strong>in</strong>g model, W ◦ g is the separat<strong>in</strong>g model. Putt<strong>in</strong>g the<br />

two together, we get the above mix<strong>in</strong>g-separat<strong>in</strong>g model. S<strong>in</strong>ce A has to be<br />

assumed to be mix<strong>in</strong>g, we can assume W to be mix<strong>in</strong>g as well because the<br />

<strong>in</strong>verse of a matrix that is mix<strong>in</strong>g is aga<strong>in</strong> mix<strong>in</strong>g. Furthermore, the mix<strong>in</strong>gseparat<strong>in</strong>g<br />

model is assumed to be bijective—hence, A and W <strong>in</strong>vertible and<br />

h bijective—because otherwise trivial solutions as, for example, h ≡ c for a<br />

constant c ∈ R, would also be solutions.


Chapter 2. Neural Computation 16:1827-1850, 2004 75<br />

1846 F. Theis<br />

We will show the theorem <strong>in</strong> the case of S and X with components hav<strong>in</strong>g<br />

somewhere locally constant nonzero C 2 -densities. An alternative geometric<br />

idea of how to prove theorem 4 for bounded sources <strong>in</strong> two dimensions<br />

is mentioned <strong>in</strong> Babaie-Zadeh, Jutten, and Nayebi (2002) and extended <strong>in</strong><br />

Theis and Gruber (forthcom<strong>in</strong>g). Note that <strong>in</strong> our case, as well as <strong>in</strong> the above<br />

restrictive cases, the assumption that S has at most one gaussian component<br />

holds trivially.<br />

Proof of Theorem 4 (with Locally-Constant Nonzero C 2 -Densities). Let h =<br />

h1 ×···×hn with bijective C ∞ -functions hi : R → R. We have to show only<br />

that the h ′ i<br />

are constant. Then h is aff<strong>in</strong>e l<strong>in</strong>ear, say, h ≡ L + p, with diagonal<br />

matrix L ∈ Gl(n) and a vector p ∈ R n . Hence, WLAS+Wp, and then WLAS<br />

is <strong>in</strong>dependent, so us<strong>in</strong>g l<strong>in</strong>ear separability, theorem 2i, WLA ∼ I, therefore,<br />

LA ∼ W −1 .<br />

Let X := W(h(AS)). The density of this transformed random vector is<br />

easily calculated from S:<br />

pX(Wh(As)) =|det W| −1 |h ′ 1 ((As)1)| −1 ···|h ′ n ((As)n)| −1 | det A| −1 pS(s)<br />

for s ∈ Rn . h has by assumption an <strong>in</strong>vertible Jacobian at every po<strong>in</strong>t, so<br />

the h ′ i are either positive or negative; without loss of generality, h′ i > 0.<br />

Furthermore, pX is <strong>in</strong>dependent, so we can write<br />

pX ≡ g1 ⊗···⊗gn.<br />

For fixed s 0 ∈ R n with pS(s 0 )>0, there exists an open neighborhood<br />

U ⊂ R n of s 0 with pS|U > 0 and pS|U ∈ C 2 (U, R). If we def<strong>in</strong>e f (s) :=<br />

ln � | det W| −1 | det A| −1 pS(s) � for s ∈ U, then<br />

f (s) = ln(h ′ 1 ((As)1) ···h ′ n ((As)n)g1((Wh(As))1) ···gn((Wh(As))n))<br />

n�<br />

= ln h ′ k ((As)k) + ζk((Wh(As))k)<br />

k=1<br />

for x ∈ R n where ζk := ln gk locally at s 0<br />

k . pS is separated, so<br />

∂i∂j f ≡ 0 (5.2)<br />

for i < j. Denote A =: (aij) and W =: (wij). The first derivative and then the<br />

nondiagonal entries <strong>in</strong> the Hessian of f can be calculated as follows (i < j):<br />

∂i f (s) =<br />

n�<br />

h<br />

aki<br />

k=1<br />

′′<br />

k<br />

h ′ k<br />

((As)k) + ζ ′ k ((Wh(As))k)<br />

� n�<br />

l=1<br />

wklalih ′ l ((As)l)<br />


76 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1847<br />

∂i∂j f (s) =<br />

n�<br />

k=1<br />

h<br />

akiakj<br />

′ kh′′′ k<br />

− h′′2<br />

k<br />

h ′2<br />

k<br />

+ ζ ′′<br />

k ((Wh(As))k)<br />

+ ζ ′ k ((Wh(As))k)<br />

((As)k)<br />

� n�<br />

l=1<br />

� n�<br />

l=1<br />

wklalih ′ l ((As)l)<br />

wklalialjh ′′<br />

l ((As)l)<br />

�� n�<br />

�<br />

.<br />

l=1<br />

wklaljh ′ l ((As)l)<br />

Substitut<strong>in</strong>g y := As and us<strong>in</strong>g equation 5.2, we f<strong>in</strong>ally get the follow<strong>in</strong>g<br />

differential equation for the hk:<br />

0 =<br />

n�<br />

k=1<br />

h<br />

akiakj<br />

′ kh′′′ k<br />

− h′′2<br />

k<br />

h ′2<br />

k<br />

+ ζ ′′<br />

k ((Wh(y))k)<br />

+ ζ ′ k ((Wh(y))k)<br />

(yk)<br />

� n�<br />

l=1<br />

� n�<br />

l=1<br />

wklalih ′ l (yl)<br />

wklalialjh ′′<br />

l (yl)<br />

�� n�<br />

�<br />

l=1<br />

wklaljh ′ l (yl)<br />

�<br />

�<br />

(5.3)<br />

for y ∈ V := A(U).<br />

We will restrict ourselves to the simple case mentioned above <strong>in</strong> order to<br />

solve this equation. We assume that the hk are analytic and that there exists<br />

x 0 ∈ R n where the demixed densities gk are locally constant and nonzero.<br />

Consider the above calculation around s 0 = A −1 (h −1 (W −1 x 0 )).<br />

Choose the open set V such that the gk are locally constant nonzero on<br />

h(W(V)). Then so are the ζ ′ k = ln gk, and therefore<br />

0 =<br />

n�<br />

k=1<br />

h<br />

akiakj<br />

′ kh′′′ k<br />

− h′′2<br />

k<br />

h ′2<br />

k<br />

(yk)<br />

for y ∈ V. Hence, there exist open <strong>in</strong>tervals Ik ⊂ R and constants bk ∈ R<br />

with<br />

�<br />

akiakj h ′ kh′′′ �<br />

k − h′′2<br />

k ≡ dkh ′2<br />

k<br />

on Ik (here, dk = �<br />

l�=k alialj<br />

h ′<br />

l h′′′<br />

l −h′′2<br />

l<br />

h ′2 (yl) for some (and then any) y ∈ V).<br />

l<br />

By assumption, W is mix<strong>in</strong>g. Hence, for fixed k, there exist i �= j with<br />

akiakj �= 0. If we set ck := bk , then<br />

akiakj<br />

ckh ′2<br />

k − h′ kh′′′ k + h′′2<br />

k ≡ 0 (5.4)


Chapter 2. Neural Computation 16:1827-1850, 2004 77<br />

1848 F. Theis<br />

on Ik. hk was chosen to be analytic, and equation 5.4 holds on the open set<br />

Ik, so it holds on all R. Apply<strong>in</strong>g lemma 3 then shows that either h ′ k ≡ 0or<br />

h ′ �<br />

ck<br />

k (x) =±exp<br />

2 x2 �<br />

+ dkx + ek , x ∈ R (5.5)<br />

with constants dk, ek ∈ R. By assumption, hk is bijective, so h ′ k �≡ 0.<br />

Apply<strong>in</strong>g the same arguments as above to the <strong>in</strong>verse system<br />

S = A −1 (h −1 (W −1 X))<br />

and us<strong>in</strong>g the fact that also pS is somewhere locally constant nonzero shows<br />

that equation 5.5 also holds for (h −1<br />

k )′ with other constants. But if both the<br />

derivatives of hk and h −1<br />

k are of this exponential type, then ck = dk = 0, and<br />

therefore hk is aff<strong>in</strong>e l<strong>in</strong>ear for all k = 1,...,n, which completes the proof of<br />

postnonl<strong>in</strong>ear separability <strong>in</strong> this special case.<br />

Note that <strong>in</strong> the above proof, local positiveness of the densities was assumed<br />

<strong>in</strong> order to use the equivalence of local separability with the diagonality<br />

of the logarithm of the Hessian. Hence, these results can be generalized<br />

us<strong>in</strong>g theorem 1 <strong>in</strong> a similar fashion as we did <strong>in</strong> the l<strong>in</strong>ear case with theorem<br />

2. Hence, we have proven postnonl<strong>in</strong>ear separability also for uniformly<br />

distributed sources.<br />

6 Conclusion<br />

We have shown how to derive the separability of l<strong>in</strong>ear BSS us<strong>in</strong>g diagonalization<br />

of the Hessian of the logarithmic density, respectively, characteristic<br />

function. This <strong>in</strong>duces separated, that is, <strong>in</strong>dependent, sources. The idea of<br />

Hessian diagonalization is put <strong>in</strong>to a new algorithm for perform<strong>in</strong>g l<strong>in</strong>ear<br />

<strong>in</strong>dependent component analysis, which is shown to be a local problem. In<br />

practice, however, due to the fact that the densities cannot be approximated<br />

locally very well, we also propose a diagonalization algorithm that takes<br />

the global structure <strong>in</strong>to account. In order to show the use of this framework<br />

of separated functions, we f<strong>in</strong>ish with a proof of postnonl<strong>in</strong>ear separability<br />

<strong>in</strong> a special case.<br />

In future work, more general separability results of postnonl<strong>in</strong>ear BSS<br />

could be constructed by f<strong>in</strong>d<strong>in</strong>g more general solutions of the differential<br />

equation 5.3. Algorithmic improvements could be made by us<strong>in</strong>g other<br />

density approximation methods like mixture of gaussian models or by approximat<strong>in</strong>g<br />

the Hessian itself us<strong>in</strong>g the cumulative density and discrete<br />

approximations of the differential. F<strong>in</strong>ally, the diagonalization algorithm<br />

can easily be extended to nonl<strong>in</strong>ear situations by f<strong>in</strong>d<strong>in</strong>g appropriate model<br />

parameterizations; <strong>in</strong>stead of m<strong>in</strong>imiz<strong>in</strong>g the mutual <strong>in</strong>formation, we m<strong>in</strong>imize<br />

the absolute value of the off-diagonal terms of the logarithmic Hessian.


78 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1849<br />

The algorithm has been specified us<strong>in</strong>g only an energy function; gradient<br />

and fixed-po<strong>in</strong>t algorithms can be derived <strong>in</strong> the usual manner.<br />

Separability <strong>in</strong> nonl<strong>in</strong>ear situations has turned out to be a hard problem—<br />

illposed <strong>in</strong> the most general case (Hyvär<strong>in</strong>en & Pajunen, 1999)—and not<br />

many nontrivial results exist for restricted models (Hyvär<strong>in</strong>en & Pajunen,<br />

1999; Babaie-Zadeh et al., 2002), all only two-dimensional. We believe that<br />

this is due to the fact that the rather nontrivial proof of the Darmois-<br />

Skitovitch theorem is not at all easily generalized to more general sett<strong>in</strong>gs<br />

(Kagan, 1986). By <strong>in</strong>troduc<strong>in</strong>g separated functions, we are able to give a<br />

much easier proof for l<strong>in</strong>ear separability and also provide new results <strong>in</strong><br />

nonl<strong>in</strong>ear sett<strong>in</strong>gs. We hope that these ideas will be used to show separability<br />

<strong>in</strong> other situations as well.<br />

Acknowledgments<br />

I thank the anonymous reviewers for their valuable suggestions, which improved<br />

the orig<strong>in</strong>al manuscript. I also thank Peter Gruber, Wolfgang Hackenbroch,<br />

and Michaela Theis for suggestions and remarks on various aspects<br />

of the separability proof. The work described here was supported by the<br />

DFG <strong>in</strong> the grant “Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> Condensed Matter”<br />

and the BMBF <strong>in</strong> the ModKog project.<br />

References<br />

Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for<br />

separat<strong>in</strong>g post non-l<strong>in</strong>ear mixtures. In Proc. of EUSIPCO ’02 (Volume 2, pp.<br />

11–14). Toulouse, France.<br />

Bauer, H. (1996). Probability theory. Berl<strong>in</strong>: Walter de Gruyter.<br />

Bell, A., & Sejnowski, T. (1995). An <strong>in</strong>formation-maximization approach to bl<strong>in</strong>d<br />

separation and bl<strong>in</strong>d deconvolution. Neural Computation, 7:1129–1159.<br />

Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moul<strong>in</strong>es, E. (1997). A bl<strong>in</strong>d<br />

source separation technique based on second order statistics. IEEE Transactions<br />

on Signal Process<strong>in</strong>g, 45(2), 434–444.<br />

Cardoso, J.-F., & Souloumiac, A. (1993). Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian<br />

signals. IEEE Proceed<strong>in</strong>gs–F, 140(6), 362–370.<br />

Cichocki, A., & Amari, S. (2002). Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. New<br />

York: Wiley.<br />

Comon, P. (1994). <strong>Independent</strong> component analysis—a new concept? Signal<br />

Process<strong>in</strong>g, 36:287–314.<br />

Darmois, G. (1953). Analyse générale des liaisons stochastiques. Rev. Inst. Internationale<br />

Statist., 21, 2–8.<br />

Eriksson, J., & Koivunen, V. (2003). Identifiability and separability of l<strong>in</strong>ear ica<br />

models revisited. In Proc. of ICA 2003, (pp. 23–27). Nara, Japan.<br />

Habl, M., Bauer, C., Puntonet, C., Rodriguez-Alvarez, M., & Lang, E. (2001). Analyz<strong>in</strong>g<br />

biomedical signals with probabilistic ICA and kernel-based source


Chapter 2. Neural Computation 16:1827-1850, 2004 79<br />

1850 F. Theis<br />

density estimation. In M. Sebaaly, (Ed.) Information science <strong>in</strong>novations<br />

(Proc.ISI’2001) (pp. 219–225). Alberta, Canada: ICSC Academic Press.<br />

Hérault, J., & Jutten, C. (1986). Space or time adaptive signal process<strong>in</strong>g by<br />

neural network models. In J. Denker (Ed.), Neural networks for comput<strong>in</strong>g:<br />

Proceed<strong>in</strong>gs of the AIP Conference (pp. 206–211). New York: American Institute<br />

of Physics.<br />

Hyvär<strong>in</strong>en, A., Karhunen, J., & Oja, E. (2001). <strong>Independent</strong> component analysis.<br />

New York: Wiley.<br />

Hyvär<strong>in</strong>en, A., & Oja, E. (1997). A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent<br />

component analysis. Neural Computation, 9:1483–1492.<br />

Hyvär<strong>in</strong>en, A., & Pajunen, P. (1999). Nonl<strong>in</strong>ear <strong>in</strong>dependent component analysis:<br />

Existence and uniqueness results. Neural Networks, 12(3), 429–439.<br />

Kagan, A. (1986). New classes of dependent random variables and a generalization<br />

of the Darmois-Skitovitch theorem to several forms. Theory Probab.<br />

Appl., 33(2), 286–295.<br />

Lee, T., & Lewicki, M. (2000). The generalized gaussian mixture model us<strong>in</strong>g<br />

ICA. In Proc. of ICA 2000 (pp. 239–244). Hels<strong>in</strong>ki, F<strong>in</strong>land.<br />

L<strong>in</strong>, J. (1998). Factoriz<strong>in</strong>g multivariate function classes. In M. Kearns, M. Jordan,<br />

& S. Solla (Eds.), Advances <strong>in</strong> neural <strong>in</strong>formation process<strong>in</strong>g systems, 10, (pp. 563–<br />

569). Cambridge, MA: MIT Press.<br />

Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89,<br />

217–219.<br />

Taleb, A., & Jutten, C. (1999). Source separation <strong>in</strong> post non l<strong>in</strong>ear mixtures.<br />

IEEE Trans. on Signal Process<strong>in</strong>g, 47, 2807–2820.<br />

Theis, F., & Gruber, P. (forthcom<strong>in</strong>g). Separability of analytic postnonl<strong>in</strong>ear<br />

bl<strong>in</strong>d source separation with bounded sources. In Proc. of ESANN 2004.<br />

Evere, Belgium: d-side.<br />

Theis, F., Jung, A., Puntonet, C., & Lang, E. (2002). L<strong>in</strong>ear geometric ICA: Fundamentals<br />

and algorithms. Neural Computation, 15, 1–21.<br />

Theis, F., Puntonet, C., & Lang, E. (2003). Nonl<strong>in</strong>ear geometric ICA. In Proc. of<br />

ICA 2003 (pp. 275–280). Nara, Japan.<br />

Tong, L., Liu, R.-W., Soon, V., & Huang, Y.-F. (1991). Indeterm<strong>in</strong>acy and identifiability<br />

of bl<strong>in</strong>d identification. IEEE Transactions on Circuits and Systems, 38,<br />

499–509.<br />

Yeredor, A. (2000). Bl<strong>in</strong>d source separation via the second characteristic function.<br />

Signal Process<strong>in</strong>g, 80(5), 897–902.<br />

Received June 27, 2003; accepted March 8, 2004.


80 Chapter 2. Neural Computation 16:1827-1850, 2004


Chapter 3<br />

Signal Process<strong>in</strong>g 84(5):951-956,<br />

2004<br />

Paper F.J. Theis. Uniqueness of complex and multidimensional <strong>in</strong>dependent component<br />

analysis. Signal Process<strong>in</strong>g, 84(5):951-956, 2004<br />

Reference (Theis, 2004b)<br />

Summary <strong>in</strong> section 1.2.1<br />

81


82 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004<br />

Abstract<br />

Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

www.elsevier.com/locate/sigpro<br />

Fast communication<br />

Uniqueness of complex and multidimensional <strong>in</strong>dependent<br />

component analysis<br />

F.J. Theis ∗<br />

Institute of Biophysics, University of Regensburg, Universitaetsstr. 31, D93040 Regensburg, Germany<br />

Received 25 September 2003<br />

A complex version of the Darmois–Skitovitch theorem is proved us<strong>in</strong>g a multivariate extension of the latter by Ghurye and<br />

Olk<strong>in</strong>. This makes it possible to calculate the <strong>in</strong>determ<strong>in</strong>acies of <strong>in</strong>dependent component analysis (ICA) with complex variables<br />

and coe cients. Furthermore, the multivariate Darmois–Skitovitch theorem is used to show uniqueness of multidimensional<br />

ICA, where only groups of sources are mutually <strong>in</strong>dependent.<br />

? 2004 Elsevier B.V. All rights reserved.<br />

PACS: 84.40.Ua; 89.70.+c; 07.05.Kf<br />

Keywords: Complex ICA; Multidimensional ICA; Separability<br />

1. Introduction<br />

The task of <strong>in</strong>dependent component analysis (ICA)<br />

is to transform a given random vector <strong>in</strong>to a statistically<br />

<strong>in</strong>dependent one. ICA can be applied to bl<strong>in</strong>d<br />

source separation (BSS), where it is furthermore assumed<br />

that the given vector has been mixed us<strong>in</strong>g a<br />

xed set of <strong>in</strong>dependent sources. Good textbook-level<br />

<strong>in</strong>troductions to ICA are given <strong>in</strong> [4,11].<br />

BSS is said to be separable if the mix<strong>in</strong>g structure<br />

can be bl<strong>in</strong>dly recovered except for obvious <strong>in</strong>determ<strong>in</strong>acies.<br />

In [5], Comon shows separability of l<strong>in</strong>ear real<br />

BSS us<strong>in</strong>g the Skitovitch–Darmois theorem. He notes<br />

that his proof for the real case can also be extended to<br />

the complex sett<strong>in</strong>g. However, a complex version of<br />

∗ Tel.: +49-941-9432924; fax: +49-941-9432479.<br />

E-mail addresses: fabian.theis@mathematik.uni-regensburg.de,<br />

fabian@theis.name (F.J. Theis).<br />

0165-1684/$ - see front matter ? 2004 Elsevier B.V. All rights reserved.<br />

doi:10.1016/j.sigpro.2004.01.008<br />

the Skitovitch–Darmois theorem is needed, which, to<br />

the knowledge of the author, has not been shown <strong>in</strong><br />

the literature, yet. In this work we will provide such<br />

a theorem, which is then used to prove separability of<br />

complex BSS.<br />

Separability and uniqueness of BSS is already <strong>in</strong>cluded<br />

<strong>in</strong> the de nition of what is commonly called a<br />

‘contrast’ [5]. Hence it has been widely studied, but<br />

<strong>in</strong> the sett<strong>in</strong>g of complex BSS to the knowledge of<br />

the author separability has only been shown under the<br />

additional assumption of non-zero cumulants of the<br />

sources [5,13].<br />

The paper is organized as follows: In the next<br />

section, basic terms and notations are <strong>in</strong>troduced.<br />

Section 3 states the well-known Skitovitch–Darmois<br />

theorem and a multivariate extension thereof; furthermore,<br />

a complex version of it is derived. The follow<strong>in</strong>g<br />

Section 4 then <strong>in</strong>troduces the complex l<strong>in</strong>ear bl<strong>in</strong>d<br />

source separation model and shows its separability.


Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004 83<br />

952 F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

Section 5 nally deals with separability of multidimensional<br />

ICA (group ICA).<br />

2. Notation<br />

Let K ∈{R; C} be either the real or the complex<br />

numbers. For m; n ∈ N let Mat(m × n; K)<br />

be the K-vectorspace of real, respectively, complex<br />

m × n matrices, Gl(n; K) := {W ∈ Mat(n ×<br />

n; K) | det(W) �= 0} be the general l<strong>in</strong>ear group of<br />

K n . I ∈ Gl(n; K) denotes the unit matrix. For ∈ C<br />

we write Re( ) for its real and Im( ) for its imag<strong>in</strong>ary<br />

part.<br />

An <strong>in</strong>vertible matrix L ∈ Gl(n; K) is said to be a<br />

scal<strong>in</strong>g matrix, if it is diagonal. We say two matrices<br />

B, C ∈ Mat(m × n; K) are (K−)equivalent, B ∼ C, if<br />

C can be written as C = BPL with a scal<strong>in</strong>g matrix<br />

L ∈ Gl(n; K) and an <strong>in</strong>vertible matrix with unit vectors<br />

<strong>in</strong> each row (permutation matrix) P ∈ Gl(n; K). Note<br />

that PL = L ′ P for some scal<strong>in</strong>g matrix L ′ ∈ Gl(n; K),<br />

so the order of the permutation and the scal<strong>in</strong>g matrix<br />

does not play a role for equivalence. Furthermore,<br />

if B ∈ Gl(n; K) with B ∼ I, then also B −1 ∼ I,<br />

and more general if BC ∼ A, then C ∼ B −1 A.So<br />

two matrices are equivalent if and only if they di er<br />

by right-multiplication by a matrix with exactly one<br />

non-zero entry <strong>in</strong> each row and each column. If K=R,<br />

the two matrices are the same except for permutation,<br />

sign and scal<strong>in</strong>g, if K = C, they are the same except<br />

for permutation, sign, scal<strong>in</strong>g and phase-shift.<br />

3. A multivariate version of the Skitovitch–<br />

Darmois theorem<br />

The orig<strong>in</strong>al Skitovitch–Darmois theorem shows a<br />

non-trivial connection between Gaussian distributions<br />

and stochastic <strong>in</strong>dependence. More precisely, it states<br />

that if two l<strong>in</strong>ear comb<strong>in</strong>ations of non-Gaussian <strong>in</strong>dependent<br />

random variables are aga<strong>in</strong> <strong>in</strong>dependent, then<br />

each orig<strong>in</strong>al random variable can appear only <strong>in</strong> one<br />

of the two l<strong>in</strong>ear comb<strong>in</strong>ations. It has been proved <strong>in</strong>dependently<br />

by Darmois [6] and Skitovitch [14]; <strong>in</strong> a<br />

more accessible form, the proof can be found <strong>in</strong> [12].<br />

Separability of l<strong>in</strong>ear BSS as shown by Comon [5]<br />

is a corollary of this theorem, although recently separability<br />

has also been shown without it [17].<br />

Theorem 3.1 (Skitovitch–Darmois theorem). Let<br />

L1 = �n i=1 iXi and L2 = �n i=1 iXi with X1;:::;Xn<br />

<strong>in</strong>dependent real random variables and j, j ∈ R for<br />

j =1;:::;n. If L1 and L2 are <strong>in</strong>dependent, then all Xj<br />

with j j �= 0are Gaussian.<br />

The converse is true if we assume that � n<br />

j=1 j j =<br />

0: If all Xj with j j �= 0 are Gaussian and<br />

� n<br />

j=1 j j = 0, then L1 and L2 are <strong>in</strong>dependent. This<br />

follows because then L1 and L2 are uncorrelated, and<br />

with all common variables be<strong>in</strong>g normal then also<br />

<strong>in</strong>dependent.<br />

Theorem 3.2 (Multivariate S–D theorem). Let L1 �<br />

=<br />

n<br />

i=1 AiXi and L2 = �n i=1 BiXi with mutually <strong>in</strong>dependent<br />

k-dimensional random vectors Xj and <strong>in</strong>vertible<br />

matrices Aj; Bj ∈ Gl(k; R) for j =1;:::;n. If<br />

L1 and L2 are mutually <strong>in</strong>dependent, then all Xj are<br />

Gaussian.<br />

Here Gaussian (or jo<strong>in</strong>tly Gaussian) means that<br />

each component of the random vector is a Gaussian.<br />

Obviously, those Gaussians can have non-trivial correlations.<br />

This extension of Theorem 3.1 to random vectors<br />

has rst been noted by Skitovitch [15] and shown by<br />

Ghurye and Olk<strong>in</strong> [8]. Z<strong>in</strong>ger gave a di erent proof<br />

for it <strong>in</strong> his Ph.D. thesis [18].<br />

We need the follow<strong>in</strong>g corollary:<br />

Corollary 3.3. Let L1 = � n<br />

i=1 AiX i and L2 =<br />

� n<br />

i=1 BiX i with mutually <strong>in</strong>dependent k-dimensional<br />

random vectors X j and matrices Aj; Bj either zero or<br />

<strong>in</strong> Gl(k; R) for j =1;:::;n. If L1 and L2 are mutually<br />

<strong>in</strong>dependent, then all X j with AjBj �= 0are Gaussian.<br />

Proof. We want to throw out all X j with AjBj =0.<br />

Then Theorem 3.2 can be applied. Let j be given with<br />

AjBj=0. Without loss of generality assume that Bj=0.<br />

If also Aj = 0, then we can simply leave out X j s<strong>in</strong>ce<br />

it does not appear <strong>in</strong> both L1 and L2. Assume Aj �=<br />

0. By assumption X j and X 1 ;:::;X j−1 ; X j+1 ;:::;X n<br />

are mutually <strong>in</strong>dependent, then so are X j and L2 because<br />

Bj = 0. Hence both −AjX j , L2 and L1, L2<br />

are mutually <strong>in</strong>dependent, so also the two l<strong>in</strong>ear comb<strong>in</strong>ations<br />

L1 − AjX j and L2 of the n − 1 variables<br />

X 1 ;:::;X j−1 ; X j+1 ;:::;X n are mutually <strong>in</strong>dependent.<br />

After successive application of this recursion we can


84 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004<br />

assume that each Aj and Bj is <strong>in</strong>vertible. Apply<strong>in</strong>g<br />

Theorem 3.2 shows the corollary.<br />

From this, a complex version of the Skitovitch–<br />

Darmois theorem can easily be derived:<br />

Corollary 3.4 (Complex S–D theorem). Let L1 �<br />

=<br />

n<br />

i=1 iXi and L2 = �n i=1 iXi with X1;:::;Xn <strong>in</strong>dependent<br />

complex random variables and j; j ∈ C for<br />

j =1;:::;n. If L1 and L2 are <strong>in</strong>dependent, then all Xj<br />

with j j �= 0 are Gaussian.<br />

Here, a complex random variable is said to be Gaussian<br />

if both real and imag<strong>in</strong>ary part are Gaussians.<br />

Proof. We can <strong>in</strong>terpret the n <strong>in</strong>dependent complex<br />

random variables Xi as n two dimensional real random<br />

vectors that are mutually <strong>in</strong>dependent. Multiplication<br />

by the complex numbers j either ( j �= 0)isamultiplication<br />

by the real <strong>in</strong>vertible matrix<br />

� �<br />

Re( j) −Im( j)<br />

Im( j) Re( j)<br />

or ( j = 0) multiplication by the 0-matrix, similar for<br />

j. Apply<strong>in</strong>g Corollary 3.3 nishes the proof.<br />

4. Indeterm<strong>in</strong>acies of complex ICA<br />

Given a complex n-dimensional random vector X,<br />

a matrix W ∈ Gl(n; C) is called (complex) ICA of X<br />

if WX is <strong>in</strong>dependent (as a complex random vector).<br />

We will show that W and V are complex ICAs of X if<br />

and only if W −1 ∼ V −1 that is if they di er by right<br />

multiplication by a complex scal<strong>in</strong>g and permutation<br />

matrix. This is equivalent to calculat<strong>in</strong>g the <strong>in</strong>determ<strong>in</strong>acies<br />

of the complex BSS model:<br />

Consider the noiseless complex l<strong>in</strong>ear <strong>in</strong>stantaneous<br />

bl<strong>in</strong>d source separation (BSS) model with as many<br />

sources as sensors<br />

X = AS: (1)<br />

Here S is an <strong>in</strong>dependent complex-valued ndimensional<br />

random vector and A ∈ Gl(n; C) an<strong>in</strong>vertible<br />

complex matrix.<br />

The task of l<strong>in</strong>ear BSS is to nd A and S given<br />

only X. An obvious <strong>in</strong>determ<strong>in</strong>acy of this problem is<br />

F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956 953<br />

that A can be found only up to equivalence because<br />

for scal<strong>in</strong>g L and permutation matrix P<br />

X = ALPP −1 L −1 S<br />

and P −1 L −1 S is also <strong>in</strong>dependent. We will show that<br />

under mild assumptions to S there are no further <strong>in</strong>determ<strong>in</strong>acies<br />

of complex BSS.<br />

Various algorithms for solv<strong>in</strong>g the complex BSS<br />

problem have been proposed [1,2,7,13,16]. We want<br />

to note that many cases where complex BSS is applied<br />

can <strong>in</strong> fact be reduced to us<strong>in</strong>g real BSS algorithms.<br />

This is the case if either the sources or the<br />

mix<strong>in</strong>g matrix are real. The latter for example occurs<br />

after Fourier transformation of signals with time<br />

structure.<br />

If the sources are real, then the above complex<br />

model can be split up <strong>in</strong>to two separate real BSS problems:<br />

Re(X) = Re(A)S;<br />

Im(X)=Im(A)S:<br />

Solv<strong>in</strong>g both of these real BSS equations yields A =<br />

Re(A) +i Im(A). Of course, Re(A) and Im(A) can<br />

only be found except for scal<strong>in</strong>g and permutation. By<br />

compar<strong>in</strong>g the two recovered source random vectors<br />

(us<strong>in</strong>g for example mutual <strong>in</strong>formation of one component<br />

of each vector), we can however assume that the<br />

permutation and then also the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy<br />

of both recovered matrices is the same, which allows<br />

the algorithm to correctly put A back together. Similarly,<br />

also separability of this special complex ICA<br />

problem can be derived from the well-known separability<br />

results <strong>in</strong> the real case.<br />

If the mix<strong>in</strong>g matrix is known to be real, then aga<strong>in</strong><br />

splitt<strong>in</strong>g up Eq. (1) <strong>in</strong>to real and complex parts yields<br />

Re(X)=A Re(S);<br />

Im(X)=A Im(S):<br />

A can be found from either equation. If both real and<br />

complex samples are to be used <strong>in</strong> order to <strong>in</strong>crease<br />

precision, those can simply be concatenated <strong>in</strong> order<br />

to generate a twice as large sample set mixed by the<br />

same mix<strong>in</strong>g matrix A. In terms of random vectors this<br />

means work<strong>in</strong>g <strong>in</strong> two disjo<strong>in</strong>t copies of the orig<strong>in</strong>al<br />

probability space. Aga<strong>in</strong> separability follows.


Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004 85<br />

954 F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

Theorem 4.1 (Separability of complex l<strong>in</strong>ear BSS).<br />

Let A ∈ Gl(n; C) and S a complex <strong>in</strong>dependent random<br />

vector. Assume one of the follow<strong>in</strong>g:<br />

i. S has at most one Gaussian component and the<br />

(complex) covariance of S exists.<br />

ii. S has no Gaussian component.<br />

If AS is aga<strong>in</strong> <strong>in</strong>dependent 1 then A is equivalent<br />

to the identity.<br />

Here, the complex covariance of S is de ned by<br />

Cov(S)=E((S − E(S))(S − E(S)) ∗ );<br />

where the asterix denotes the transposed and complexly<br />

conjugated vector.<br />

Comon has shown this for the real case [5]; for<br />

the complex case a complex version of the Darmois–<br />

Skitovitch theorem is needed as provided <strong>in</strong> Section 3.<br />

Theorem 4.1 <strong>in</strong>deed proves separability of the complex<br />

l<strong>in</strong>ear BSS model, because if X = AS and W is<br />

a demix<strong>in</strong>g matrix such that WX is <strong>in</strong>dependent, then<br />

WA ∼ I, soW −1 ∼ A as desired. And it also calculates<br />

the <strong>in</strong>determ<strong>in</strong>acies of complex ICA, because if<br />

W and V are ICAs of X, then both VX and WV −1 VX<br />

are <strong>in</strong>dependent, so WV −1 ∼ I and hence W ∼ V.<br />

Proof. Denote X := AS.<br />

First assume case ii: S has no Gaussian component<br />

at all. Then A =(aij) is equal to the identity, because<br />

if not there exist i1 �= i2 and j with ai1jai2j �= 0.<br />

Apply<strong>in</strong>g Corollary 3.4 to Xi1 and Xi2 then shows<br />

that Sj is Gaussian, which is a contradiction to<br />

Assumption ii.<br />

Now assume that the covariance exists and that S<br />

has at most one Gaussian component. First we will<br />

show us<strong>in</strong>g complex decorrelation that we can assume<br />

A to be unitary. Without loss of generality assume<br />

that all random vectors are centered. By assumption<br />

Cov(X) is diagonal, so let D1 be diagonal <strong>in</strong>vertible<br />

with Cov(X) =D2 1 . Note that D1 is real. Similarly<br />

let D2 be diagonal <strong>in</strong>vertible with Cov(S)=D2 2 . Set<br />

Y := D −1<br />

1 X and T := D−1<br />

2 S that is normalize X and<br />

S to covariance I. Then<br />

Y = D −1<br />

1<br />

X = D−1AS<br />

= D−1<br />

1<br />

1 AD2T<br />

1 Indeed, we only need that AS are pairwise mutually <strong>in</strong>dependent.<br />

and T, D −1<br />

1 AD2 and Y satisfy the assumption and<br />

D −1<br />

1 AD2 is unitary because<br />

I = Cov(Y)<br />

= E(YY ∗ )<br />

= E(D −1<br />

1 AD2TT ∗ D2A ∗ D −1<br />

1 )<br />

=(D −1 −1<br />

1 AD2)(D1 AD2) ∗ :<br />

If we assume A I, then us<strong>in</strong>g the fact that A is<br />

unitary there exist <strong>in</strong>dices i1 �= i2 and j1 �= j2 with<br />

ai∗j∗ �= 0. By assumption<br />

Xi1 = ai1j1 Sj1 + ai1j2 Sj2 + ···<br />

Xi2 = ai2j1<br />

Sj1 + ai2j2<br />

Sj2 + ···<br />

are <strong>in</strong>dependent, and <strong>in</strong> both Xi1 and Xi2 the variables<br />

Sj1 and Sj2 appear non-trivially, so by the complex Skitovitch–Darmois<br />

Theorem 3.4 Sj1 and Sj2 are Gaussian,<br />

which is a contradiction to the fact that at most<br />

one source is Gaussian.<br />

5. Indeterm<strong>in</strong>acies of multidimensional ICA<br />

In this section, we want to analyze <strong>in</strong>determ<strong>in</strong>acies<br />

of so-called multidimensional <strong>in</strong>dependent component<br />

analysis. The idea of this generalization of<br />

ICA is that we do not require full <strong>in</strong>dependence of the<br />

transform Y but only mutual <strong>in</strong>dependence of certa<strong>in</strong><br />

tuples Yi1;:::;Yi2. If the size of all tuples is restricted to<br />

one, this reduces to orig<strong>in</strong>al ICA. In general, of course<br />

the tuples could have di erent sizes, but for the sake<br />

of simplicity we assume that they all have the same<br />

length (which then necessarily has to divide the total<br />

dimension).<br />

Multidimensional ICA has rst been <strong>in</strong>troduced by<br />

Cardoso [3] us<strong>in</strong>g geometrical motivations. Hyvar<strong>in</strong>en<br />

and Hoyer then presented a special case of multidimensional<br />

ICA which they called <strong>in</strong>dependent<br />

subspace analysis [9]; there the dependence with<strong>in</strong><br />

a k-tuple is explicitly modelled enabl<strong>in</strong>g the authors<br />

to propose better algorithms without hav<strong>in</strong>g to<br />

resort to the problematic multidimensional density<br />

estimation. A di erent extension of ICA is given by<br />

topographic ICA [10], where dependencies between<br />

all components are assumed. A special case of multidimensional<br />

ICA is complex ICA as presented <strong>in</strong>


86 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004<br />

the preced<strong>in</strong>g section—here dependence is allowed<br />

between real-valued couples of random variables.<br />

Let k; n ∈ N such that k divides n. We call an<br />

n-dimensional random vector Y k-<strong>in</strong>dependent if the<br />

k-dimensional random vectors<br />

⎛ ⎞ ⎛ ⎞<br />

⎜<br />

⎝<br />

Y1<br />

.<br />

Yk<br />

⎟<br />

⎠ ;:::;<br />

⎜<br />

⎝<br />

Yn−k+1<br />

.<br />

Yn<br />

⎟<br />

⎠<br />

are mutually <strong>in</strong>dependent. A matrix W ∈ Gl(n; R) is<br />

called a k-multidimensional ICA of an n-dimensional<br />

random vector X if WX is k-<strong>in</strong>dependent. If k =1,<br />

this is the same as ord<strong>in</strong>ary ICA.<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies are, similar to ord<strong>in</strong>ary<br />

ICA, <strong>in</strong>vertible transforms <strong>in</strong> Gl(k; R) <strong>in</strong> each tuple<br />

as well as the fact that the order of the <strong>in</strong>dependent<br />

k-tuples is not xed. So, de ne for r; s =1;:::;n=k<br />

the (r; s) sub-k-matrix of W =(wij) tobethek × k<br />

submatrix<br />

(wij) i=rk; :::; rk+k−1<br />

j=sk; :::; sk+k−1<br />

that is the k × k submatrix of W start<strong>in</strong>g at position<br />

(rk; sk). A matrix L ∈ Gl(n; R) is said to be a k-scal<strong>in</strong>g<br />

and permutation matrix if for each r=1;:::;n=kthere<br />

exists precisely one s with the (r; s) sub-k-matrix of<br />

L to be nonzero, and such that this submatrix is <strong>in</strong><br />

Gl(k; R), and if for each s=1;:::;n=kthere exists only<br />

one r with the (r; s) sub-k-matrix satisfy<strong>in</strong>g the same<br />

condition. Hence, if Y is k-<strong>in</strong>dependent, also LY is<br />

k-<strong>in</strong>dependent.<br />

Two matrices A and B are said to be k-equivalent,<br />

A ∼k B, if there exists such a k-scal<strong>in</strong>g and permutation<br />

matrix L with A=BL. As stated above, given two<br />

matrices W and V with W −1 ∼k V −1 such that one of<br />

them is a k-multidimensional ICA of a given random<br />

vector, then so is the other. We will show that there are<br />

no more <strong>in</strong>determ<strong>in</strong>acies of multidimensional ICA.<br />

As usual multidimensional ICA can solve the multidimensional<br />

BSS problem<br />

X = AS;<br />

where A ∈ Gl(n; R) and S is a k-<strong>in</strong>dependent<br />

n-dimensional random vector. F<strong>in</strong>d<strong>in</strong>g the <strong>in</strong>determ<strong>in</strong>acies<br />

of multidimensional ICA then shows that<br />

A can be found except for k-equivalence (separability),<br />

because if X = AS and W is a demix<strong>in</strong>g matrix<br />

F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956 955<br />

such that WX is k-<strong>in</strong>dependent, then WA ∼k I, so<br />

W−1 ∼k A as desired.<br />

However, for the proof we need one more condition<br />

for A: We call A k-admissible if for each r; s =<br />

1;:::;n=kthe (r; s) sub-k-matrix of A is either <strong>in</strong>vertible<br />

or zero. Note that this is not a strong restriction—<br />

if we randomly choose A with coe cients out of a cont<strong>in</strong>uous<br />

distribution, then with probability one we get<br />

a k-admissible matrix, because the non-k-admissible<br />

matrices ⊂ Rn2 lie <strong>in</strong> a submanifold of dimension<br />

smaller than n2 .<br />

Theorem 5.1 (Separability of multidimensional<br />

BSS): Let A ∈ Gl(n; R) and S a k-<strong>in</strong>dependent<br />

n-dimensional random vector hav<strong>in</strong>g no Gaussian<br />

k-tuple (Srk;:::;Srk+k−1) T . Assume that A is<br />

k-admissible.<br />

If AS is aga<strong>in</strong> k-<strong>in</strong>dependent, then A is k-equivalent<br />

to the identity.<br />

For the case k = 1 this is l<strong>in</strong>ear BSS separability<br />

because every matrix is 1-admissible.<br />

Proof. Denote X := AS. Assume that A k I.<br />

Then there exist <strong>in</strong>dices r1;r2 and s such that<br />

the (r1;s) and the (r2;s) sub-k-matrices of A are<br />

non-zero (hence <strong>in</strong> Gl(k; R) by k-admissability).<br />

Apply<strong>in</strong>g Corollary 3.3 to the two random vectors<br />

(Xr1k;:::;Xr1k+k−1) T and (Xr2k;:::;Xr2k+k−1) T then<br />

shows that (Ssk;:::;Ssk+k−1) T is Gaussian, which is a<br />

contradiction.<br />

Note that we could have used whiten<strong>in</strong>g to assume<br />

that A is orthogonal; however there does not seem to<br />

be a direct way to exploit this <strong>in</strong> order to allow one<br />

fully Gaussian k-tuple, contrary to the complex ICA<br />

case, see Theorem 4.1.<br />

6. Conclusion<br />

Uniqueness and separability results play a central<br />

role <strong>in</strong> solv<strong>in</strong>g BSS problems s<strong>in</strong>ce they allow algorithms<br />

to apply ICA <strong>in</strong> order to uniquely (except for<br />

scal<strong>in</strong>g and permutation) nd the orig<strong>in</strong>al mix<strong>in</strong>g matrices.<br />

We have used a multidimensional version of the<br />

Skitovitch–Darmois theorem <strong>in</strong> order to calculate the


Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004 87<br />

956 F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

<strong>in</strong>determ<strong>in</strong>acies of complex and of multidimensional<br />

ICA. In the multidimensional ICA case an additional<br />

restriction was needed <strong>in</strong> the proof, which could be<br />

relaxed if Corollary 3.3 can be extended to allow<br />

matrices of arbitrary rank.<br />

Acknowledgements<br />

This research was supported by Grants from the<br />

DFG 2 and the BMBF 3 .<br />

References<br />

[1] A. Back, A. Tsoi, Bl<strong>in</strong>d deconvolution of signals us<strong>in</strong>g a<br />

complex recurrent network, <strong>in</strong>: Neural Networks for Signal<br />

Process<strong>in</strong>g 4, Proceed<strong>in</strong>gs of the 1994 IEEE Workshop, 1994,<br />

pp. 565–574.<br />

[2] E. B<strong>in</strong>gham, A. Hyvar<strong>in</strong>en, A fast xed-po<strong>in</strong>t algorithm for<br />

<strong>in</strong>dependent component analysis of complex-valued signals,<br />

Internat. J. Neural Systems 10 (1) (2000) 1–8.<br />

[3] J. Cardoso, Multidimensional <strong>in</strong>dependent component<br />

analysis, <strong>in</strong>: Proceed<strong>in</strong>gs of ICASSP ’98, Seattle, WA, May<br />

12–15, 1998.<br />

[4] A. Cichocki, S. Amari, Adaptive Bl<strong>in</strong>d Signal and Image<br />

Process<strong>in</strong>g, Wiley, New York, 2002.<br />

[5] P. Comon, <strong>Independent</strong> component analysis—a new concept?<br />

Signal Process<strong>in</strong>g 36 (1994) 287–314.<br />

[6] G. Darmois, Analyse generale des liaisons stochastiques, Rev.<br />

Inst. Internat. Statist. 21 (1953) 2–8.<br />

2 Graduate college ‘nonl<strong>in</strong>ear dynamics’.<br />

3 Project ‘ModKog’.<br />

[7] S. Fiori, Bl<strong>in</strong>d separation of circularly distributed sources by<br />

neural extended apex algorithm, Neurocomput<strong>in</strong>g 34 (2000)<br />

239–252.<br />

[8] S. Ghurye, I. Olk<strong>in</strong>, A characterization of the multivariate<br />

normal distribution, Ann. Math. Statist. 33 (1962) 533–541.<br />

[9] A. Hyvar<strong>in</strong>en, P. Hoyer, Emergence of phase and shift<br />

<strong>in</strong>variant features by decomposition of natural images <strong>in</strong>to<br />

<strong>in</strong>dependent feature subspaces, Neural Computation 12 (7)<br />

(2000) 1705–1720.<br />

[10] A. Hyvar<strong>in</strong>en, P. Hoyer, M. Inki, Topographic <strong>in</strong>dependent<br />

component analysis, Neural Computation 13 (7) (2001)<br />

1525–1558.<br />

[11] A. Hyvar<strong>in</strong>en, J. Karhunen, E. Oja, <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>, Wiley, New York, 2001.<br />

[12] A. Kagan, Y. L<strong>in</strong>nik, C. Rao, Characterization Problems <strong>in</strong><br />

Mathematical Statistics, Wiley, New York, 1973.<br />

[13] E. Moreau, O. Macchi, Higher order contrasts for<br />

self-adaptive source separation, Internat. J. Adaptive Control<br />

Signal Process. 10 (1) (1996) 19–46.<br />

[14] V. Skitovitch, On a property of the normal distribution, DAN<br />

SSSR 89 (1953) 217–219.<br />

[15] V. Skitovitch, L<strong>in</strong>ear forms <strong>in</strong> <strong>in</strong>dependent random variables<br />

and the normal distribution law, Izvestiia AN SSSR, Ser.<br />

Matem. 18 (1954) 185–200.<br />

[16] P. Smaragdis, Bl<strong>in</strong>d separation of convolved mixtures <strong>in</strong> the<br />

frequency doma<strong>in</strong>, Neurocomput<strong>in</strong>g 22 (1998) 21–34.<br />

[17] F. Theis, A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d<br />

source separation, 2003, submitted for publication; prepr<strong>in</strong>t at<br />

http://homepages.uni-regensburg.de/ ∼ thf11669/publications/<br />

prepr<strong>in</strong>ts/theis03l<strong>in</strong>uniqueness.pdf<br />

[18] A. Z<strong>in</strong>ger, Investigations <strong>in</strong>to analytical statistics and their<br />

application to limit theorems of probability theory, Ph.D.<br />

Thesis, Len<strong>in</strong>grad University, 1969.


88 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004


Chapter 4<br />

Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

Paper F.J. Theis and P. Gruber. On model identifiability <strong>in</strong> analytic postnonl<strong>in</strong>ear<br />

ICA. Neurocomput<strong>in</strong>g, 64:223-234, 2005<br />

Reference (Theis and Gruber, 2005)<br />

Summary <strong>in</strong> section 1.2.2<br />

89


90 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

Abstract<br />

On model identifiability <strong>in</strong> analytic<br />

postnonl<strong>in</strong>ear ICA<br />

F.J. Theis ∗ , P. Gruber<br />

Institute of Biophysics, University of Regensburg,<br />

D-93040 Regensburg, Germany<br />

An important aspect of successfully analyz<strong>in</strong>g data with bl<strong>in</strong>d source separation is<br />

to know the <strong>in</strong>determ<strong>in</strong>acies of the problem, that is how the separat<strong>in</strong>g model is<br />

related to the orig<strong>in</strong>al mix<strong>in</strong>g model. If l<strong>in</strong>ear <strong>in</strong>dependent component analysis (ICA)<br />

is used, it is well known that the mix<strong>in</strong>g matrix can be found <strong>in</strong> pr<strong>in</strong>ciple, but for<br />

more general sett<strong>in</strong>gs not many results exist. In this work, only consider<strong>in</strong>g random<br />

variables with bounded densities, we prove identifiability of the postnonl<strong>in</strong>ear mix<strong>in</strong>g<br />

model with analytic nonl<strong>in</strong>earities and calculate its <strong>in</strong>determ<strong>in</strong>acies. A simulation<br />

confirms these theoretical f<strong>in</strong>d<strong>in</strong>gs.<br />

Key words: postnonl<strong>in</strong>ear <strong>in</strong>dependent component analysis, postnonl<strong>in</strong>ear bl<strong>in</strong>d<br />

source separation, identifiability, separability, bounded random vectors<br />

1 Introduction<br />

<strong>Independent</strong> component analysis (ICA) f<strong>in</strong>ds statistically <strong>in</strong>dependent data<br />

with<strong>in</strong> a given random vector. It is often applied to bl<strong>in</strong>d source separation<br />

(BSS), where it is furthermore assumed that the given vector has been mixed<br />

us<strong>in</strong>g a fixed set of <strong>in</strong>dependent sources.<br />

In l<strong>in</strong>ear ICA, the mix<strong>in</strong>g model can be written as<br />

n�<br />

Xi = aijSj<br />

i=1<br />

∗ Correspond<strong>in</strong>g author.<br />

Email addresses: fabian@theis.name (F.J. Theis), petergruber@gmx.net (P.<br />

Gruber).<br />

Prepr<strong>in</strong>t submitted to Elsevier Science 13 October 2004


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 91<br />

with <strong>in</strong>dependent sources S ⊤ = (S1, . . . , Sn) and mix<strong>in</strong>g matrix A = (aij).<br />

X is known, and the goal is to determ<strong>in</strong>e A and S. Traditionally, this model<br />

was only assumed to have decorrelated sources S, which lead to Pr<strong>in</strong>cipal<br />

<strong>Component</strong> <strong>Analysis</strong> (PCA). Hérault and Jutten [1] were the first to extend<br />

this model to the ICA case by propos<strong>in</strong>g a neural algorithm based on nonl<strong>in</strong>ear<br />

decorrelation. S<strong>in</strong>ce then, the field of ICA has become <strong>in</strong>creas<strong>in</strong>gly popular<br />

and many algorithms have been studied, see [2–6] to name but a few. Good<br />

textbook-level <strong>in</strong>troductions to ICA are given <strong>in</strong> [7] and [8].<br />

With the growth of the field, <strong>in</strong>terest <strong>in</strong> nonl<strong>in</strong>ear model extensions has <strong>in</strong>creased.<br />

However, if the model were chosen to be too general, it would not be<br />

able to be identified uniquely. A good trade-off between model generalization<br />

and identifiability is given <strong>in</strong> the so called postnonl<strong>in</strong>ear BSS model realized<br />

by<br />

�<br />

n�<br />

�<br />

.<br />

Xi = fi aijSj<br />

i=1<br />

This explicit nonl<strong>in</strong>ear model implies that <strong>in</strong> addition to the l<strong>in</strong>ear mix<strong>in</strong>g<br />

situation, each sensor Xi conta<strong>in</strong>s an unknown nonl<strong>in</strong>earity fi that can further<br />

distort the observation. This model, first proposed by Taleb and Jutten [9], has<br />

applications <strong>in</strong> telecommunication and biomedical data analysis. Algorithms<br />

for reconstruct<strong>in</strong>g postnonl<strong>in</strong>early mixed sources <strong>in</strong>clude [9–13].<br />

One major problem of ICA-based BSS lies <strong>in</strong> the question of model identifiability<br />

and separability. This describes the question whether the model respectively<br />

the sources are uniquely determ<strong>in</strong>ed by the observations X alone (except<br />

for trivial <strong>in</strong>determ<strong>in</strong>acies such as permutation and scal<strong>in</strong>g). This problem is<br />

of key importance for any ICA algorithm, because if such an algorithm <strong>in</strong>deed<br />

f<strong>in</strong>ds a possible mix<strong>in</strong>g model for X, without identifiability the so-recovered<br />

sources would not have to co<strong>in</strong>cide at all with the orig<strong>in</strong>al sources. For l<strong>in</strong>ear<br />

ICA, real-valued model identifiability has been shown by Comon [3], given<br />

that X conta<strong>in</strong>s at most one gaussian. The proof uses the rather nontrivial<br />

Darmois-Skitovitch theorem, however a more direct elementary proof is<br />

possible as well [14]. A generalization to complex-valued random variables is<br />

given <strong>in</strong> [15]. Postnonl<strong>in</strong>ear identifiability has been considered <strong>in</strong> [9]; however<br />

<strong>in</strong> the formulation, the proof conta<strong>in</strong>s an <strong>in</strong>accuracy render<strong>in</strong>g the proof only<br />

applicable to quite restricted situations.<br />

In this work, we will analyze separability of postnonl<strong>in</strong>ear mixtures. We thereby<br />

generalize ideas already presented by Babaie-Zadeh et al [10], where the focus<br />

was put on the development of an actual identification algorithm. Babaie-<br />

Zadeh was the first to use the method of analyz<strong>in</strong>g bounded random vectors<br />

<strong>in</strong> the context of postnonl<strong>in</strong>ear mixtures [16] 1 . There, he already discussed<br />

1 His PhD thesis is available onl<strong>in</strong>e at<br />

http://www.lis.<strong>in</strong>pg.fr/stages dea theses/theses/manuscript/babaie-zadeh.pdf<br />

2


92 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

identifiability issues, albeit explicitly only <strong>in</strong> the two-dimensional analytic case.<br />

Extend<strong>in</strong>g his ideas, we are able to f<strong>in</strong>d a new necessary condition — which<br />

we name ’absolutely degenerate’, see def<strong>in</strong>ition 6 — for identify<strong>in</strong>g mix<strong>in</strong>g<br />

structure us<strong>in</strong>g only the boundary. This together with the generalization to<br />

arbitrary dimensions is our ma<strong>in</strong> contribution here, stated <strong>in</strong> theorem 7.<br />

The paper is arranged as follows: Section 2 presents a simple result about<br />

homogeneous functions and shortly discusses l<strong>in</strong>ear identifiability <strong>in</strong> the case<br />

of bounded random variables. Section 3 states the postnonl<strong>in</strong>ear separability<br />

problem, which is then proved <strong>in</strong> the follow<strong>in</strong>g section for real-valued random<br />

vectors. In section 5, a simulation confirm<strong>in</strong>g the ma<strong>in</strong> separability theorem<br />

is presented.<br />

Postnonl<strong>in</strong>ear separability is important for any postnonl<strong>in</strong>ear ICA algorithm,<br />

so we focus only on this question. We do not propose an explicit postnonl<strong>in</strong>ear<br />

identification algorithm but <strong>in</strong>stead refer to [9–13] for both algorithms and<br />

simulations.<br />

2 Basics<br />

For n ∈ N let Gl(n) be the general l<strong>in</strong>ear group of R n i.e. group of <strong>in</strong>vertible<br />

real (n × n)−matrices. An <strong>in</strong>vertible matrix L ∈ Gl(n) is said to be a scal<strong>in</strong>g<br />

matrix, if it is diagonal. We say two (m × n)−matrices B, C are equivalent,<br />

B ∼ C, if C can be written as C = BPL with an scal<strong>in</strong>g matrix L ∈ Gl(n)<br />

and an <strong>in</strong>vertible matrix with unit vectors <strong>in</strong> each row (permutation matrix)<br />

P ∈ Gl(n).<br />

Def<strong>in</strong>ition 1 Given a function f : U → R assume there exist a, b ∈ R such<br />

that at least one is not of absolute value 0 or 1. If f(ax) = bf(x) for all x ∈ U<br />

with ax ∈ U, then f is said to be (a, b)-homogeneous or simply homogeneous.<br />

The follow<strong>in</strong>g lemma characteriz<strong>in</strong>g homogeneous functions is from [10]. However<br />

we added the correction to exclude the cases |a| or |b| ∈ {0, 1}, because<br />

<strong>in</strong> these cases homogeneity does not <strong>in</strong>duce such strong results. This lemma<br />

can be generalized to cont<strong>in</strong>uously differentiable functions, so the strong assumption<br />

of analyticity is not needed, but shortens the proof.<br />

Lemma 2 [10] Let f : U → R, be an analytic function that is (a, b)homogeneous<br />

on [0, ε) with ε > 0. Then there exist c ∈ R, n ∈ N ∪ {0}<br />

(possibly 0) such that f(x) = cx n for all x ∈ U.<br />

PROOF. If |a| is <strong>in</strong> {0, 1} or b = 0 then obviously f ≡ 0. If b = −1 then<br />

f ≡ 0, s<strong>in</strong>ce |a| /∈ {0, 1}, f(a 2 x) = f(x) and f cont<strong>in</strong>uous, f is constant, but<br />

3


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 93<br />

f(ax) = −f(x) implies that f ≡ 0. In the case that b = 1 aga<strong>in</strong> f is constant<br />

s<strong>in</strong>ce f(ax) = f(x) and a 0 = 1 = b.<br />

By m-times differentiat<strong>in</strong>g the homogeneity equation we f<strong>in</strong>ally get bf (m) (x) =<br />

a m f (m) (ax), where f (m) denotes the m-th derivative of f. Evaluat<strong>in</strong>g this at 0<br />

yields bf (m) (0) = a m f (m) (0). S<strong>in</strong>ce f is assumed to be analytic, f is determ<strong>in</strong>ed<br />

uniquely by its derivatives at 0. Now either there exists an n ≥ 0 with b = a n ,<br />

hence f (m) (0) = 0 for all m �= n and therefore f(x) = cx n or else f ≡ 0. ✷<br />

Def<strong>in</strong>ition 3 [10] We call a random vector X with density pX bounded, if<br />

its density pX is bounded. Denote supp pX := {x|pX(x) �= 0} the support of<br />

pX i.e. the closure of the non-zero po<strong>in</strong>ts of pX.<br />

We further call an <strong>in</strong>dependent random vector X fully bounded, if supp pXi is<br />

an <strong>in</strong>terval for all i. So we get supp pX = [a1, b1] × . . . × [an, bn].<br />

S<strong>in</strong>ce a connected component of supp pX <strong>in</strong>duces a restricted, fully bounded<br />

random vector, without loss of generality we will <strong>in</strong> the follow<strong>in</strong>g assume to<br />

have fully bounded densities. In the case of l<strong>in</strong>ear <strong>in</strong>stantaneous BSS the follow<strong>in</strong>g<br />

separability result is well known and can be derived from a more general<br />

version of this theorem for non-bounded densities [3]. But <strong>in</strong> the context of<br />

fully bounded random vectors, this follows already from the fact that <strong>in</strong> this<br />

case <strong>in</strong>dependence is equivalent to hav<strong>in</strong>g support with<strong>in</strong> a cube with sides<br />

parallel to the coord<strong>in</strong>ate planes, and only matrices equivalent to the identity<br />

leave this property <strong>in</strong>variant:<br />

Theorem 4 (Separability of bounded l<strong>in</strong>ear BSS) Let M ∈ Gl(n) be<br />

an <strong>in</strong>vertible matrix and S a fully bounded <strong>in</strong>dependent random vector. If MS<br />

is aga<strong>in</strong> <strong>in</strong>dependent, then M is equivalent to the identity.<br />

This theorem <strong>in</strong>deed proves separability of the l<strong>in</strong>ear ICA model, because if<br />

X = AS and W is a demix<strong>in</strong>g matrix such that WX is <strong>in</strong>dependent, then<br />

M := WA ∼ I, so W −1 ∼ A as desired. As the model is <strong>in</strong>vertible and the<br />

<strong>in</strong>determ<strong>in</strong>acies are trivial, identifiability and uniqueness follow directly.<br />

3 Separability of postnonl<strong>in</strong>ear BSS<br />

In this section we <strong>in</strong>troduce the postnonl<strong>in</strong>ear BSS model and further discuss<br />

its identifiability.<br />

Def<strong>in</strong>ition 5 [9] A function f : R n → R n is called diagonal or componentwise<br />

if each component fi(x) of f(x) depends only on the variable xi.<br />

4


94 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

In � this case we often omit the other variables and write f(x1, . . . , xn) =<br />

f1(x1), . . . , fn(xn) �<br />

or f = f1 × · · · × fn.<br />

Consider now the postnonl<strong>in</strong>ear Bl<strong>in</strong>d Source Separation model:<br />

X = f(AS)<br />

where aga<strong>in</strong> S is an <strong>in</strong>dependent random vector, A ∈ Gl(n) and f is a diagonal<br />

nonl<strong>in</strong>earity. We assume the components fi of f to be <strong>in</strong>jective analytic<br />

are analytic.<br />

functions with non-vanish<strong>in</strong>g derivatives. Then also the f −1<br />

i<br />

Def<strong>in</strong>ition 6 Let A ∈ Gl(n) be an <strong>in</strong>vertible matrix. Then A is said to<br />

be mix<strong>in</strong>g if A has at least two nonzero entries <strong>in</strong> each row 2 . And A =<br />

(aij)i,j=1...n is said to be absolutely degenerate if there are two columns l �= m<br />

such that a 2 il = λa 2 im for a λ �= 0 i.e. the normalized columns differ only by the<br />

signs of the entries.<br />

Postnonl<strong>in</strong>ear BSS is a generalization of l<strong>in</strong>ear BSS, so the <strong>in</strong>determ<strong>in</strong>acies<br />

of postnonl<strong>in</strong>ear ICA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear BSS: A<br />

can only be reconstructed up to scal<strong>in</strong>g and permutation. Here of course additional<br />

<strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play because of translation: fi can only<br />

be recovered up to a constant. Also, if L ∈ Gl(n) is a scal<strong>in</strong>g matrix, then<br />

f(AS) = (f ◦L) �<br />

(L−1A)S �<br />

, so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each<br />

component. Another <strong>in</strong>determ<strong>in</strong>acy could occur if A is not mix<strong>in</strong>g, i.e. at least<br />

one observation xi conta<strong>in</strong>s only one source; <strong>in</strong> this case fi can obviously not<br />

be recovered. For example if A = I, then f(S) is already aga<strong>in</strong> <strong>in</strong>dependent,<br />

because <strong>in</strong>dependence is <strong>in</strong>variant under component-wise nonl<strong>in</strong>ear transformation;<br />

so f cannot be found us<strong>in</strong>g this method.<br />

A not so obvious <strong>in</strong>determ<strong>in</strong>acy occurs if A is absolutely degenerate. Then<br />

only the matrix A but not the nonl<strong>in</strong>earities can be recovered by look<strong>in</strong>g at<br />

the edges of the support of the fully-bounded random vector. For example<br />

1 1<br />

consider<br />

�<br />

the case n = 2, A = ( 2 −2 ) and the analytic function f(x1, x2) =<br />

x1 + 1<br />

2π s<strong>in</strong>(πx1), x2 + 1 x2 s<strong>in</strong>(π π 2 )�.<br />

Then A−1 ◦ f ◦ A maps [0, 1] 2 onto [0, 1] 2 .<br />

S<strong>in</strong>ce both components of f are <strong>in</strong>jective, we can verify this by look<strong>in</strong>g at the<br />

edges:<br />

2 A slightly more general def<strong>in</strong>ition of ’mix<strong>in</strong>g’ can be def<strong>in</strong>ed still guarantee<strong>in</strong>g<br />

identifiability of the sources; it is however omitted for the sake of simplicity.<br />

5


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 95<br />

f1, f2<br />

2<br />

1.5<br />

1<br />

0.5<br />

-2 -1.5 -1 -0.5<br />

-0.5<br />

0.5 1 1.5 2<br />

-1<br />

-1.5<br />

-2<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

f(AS) A −1 f(AS)<br />

0.5 1 1.5<br />

0.5 1 1.5<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.2 0.4 0.6 0.8<br />

0.2 0.4 0.6 0.8<br />

Fig. 1. Example of a postnonl<strong>in</strong>ear transformation us<strong>in</strong>g an absolutely degenerate<br />

matrix A and <strong>in</strong> [0, 1] 2 uniform sources S.<br />

f ◦ A(x1, 0) = �<br />

x1 + 1<br />

2π s<strong>in</strong>(πx1), 2x1 + 1 �<br />

s<strong>in</strong>(πx1)<br />

π<br />

= (1, 2) �<br />

x1 + 1 �<br />

s<strong>in</strong>(πx1)<br />

2π<br />

f ◦ A(0, x2) = (1, −2) �<br />

x2 + 1 �<br />

s<strong>in</strong>(πx2)<br />

2π<br />

f ◦ A(x1, 1) = (1, −2) + (1, 2) �<br />

x1 − 1 �<br />

s<strong>in</strong>(πx1)<br />

2π<br />

f ◦ A(1, x2) = (1, 2) + (1, −2) �<br />

x2 − 1 �<br />

s<strong>in</strong>(πx2)<br />

2π<br />

So we have constructed a situation <strong>in</strong> which two uniform sources are mixed<br />

by f ◦ A, see figure 1. They can be separated either by A −1 ◦ f −1 or by A −1<br />

alone. We have shown that the latter also preserves the boundary, although<br />

it conta<strong>in</strong>s a different postnonl<strong>in</strong>earity (namely identity) <strong>in</strong> contrast to f −1 <strong>in</strong><br />

the former model. Nonetheless this is no <strong>in</strong>determ<strong>in</strong>acy of the model itself,<br />

s<strong>in</strong>ce A −1 f(AS) is obviously not <strong>in</strong>dependent. So by look<strong>in</strong>g at the boundary<br />

alone, we sometimes cannot detect <strong>in</strong>dependence if the whole system is highly<br />

symmetric. This is the case if A is absolutely degenerate. In our example f<br />

was chosen such that the non-trivial postnonl<strong>in</strong>ear mixture looks l<strong>in</strong>ear (at<br />

the boundary), and this was possible due to the <strong>in</strong>herent symmetry <strong>in</strong> A.<br />

If we however assume that A is mix<strong>in</strong>g and not absolutely degenerate, then we<br />

will show for all fully-bounded sources S that except for scal<strong>in</strong>g <strong>in</strong>terchange<br />

between f and A no more <strong>in</strong>determ<strong>in</strong>acies than <strong>in</strong> the aff<strong>in</strong>e l<strong>in</strong>ear case exist.<br />

Note that if f is only assumed to be cont<strong>in</strong>uously differentiable, then additional<br />

<strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play.<br />

6<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0


96 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

4 Separability of bounded postnonl<strong>in</strong>ear BSS<br />

In this section we prove separability of postnonl<strong>in</strong>ear BSS; <strong>in</strong> the proof we will<br />

see how the two conditions from def<strong>in</strong>ition 6 turn out to be necessary.<br />

Theorem 7 (Separability of bounded postnonl<strong>in</strong>ear BSS) Let A, W ∈<br />

Gl(n) and one of them mix<strong>in</strong>g and not absolutely degenerate, h : Rn → Rn be<br />

a diagonal <strong>in</strong>jective analytic function such that h ′ i �= 0 and let S be a fullybounded<br />

<strong>in</strong>dependent random vector. If W �<br />

h(AS) �<br />

is <strong>in</strong>dependent, then there<br />

exists a scal<strong>in</strong>g L ∈ Gl(n) and v ∈ Rn with LA ∼ W−1 and h(x) = Lx + v.<br />

So let f ◦ A be the mix<strong>in</strong>g model and W ◦ g the separat<strong>in</strong>g model. Putt<strong>in</strong>g the<br />

two together we get the above mix<strong>in</strong>g-separat<strong>in</strong>g model with h := g ◦ f. The<br />

theorem shows that if the mix<strong>in</strong>g-separat<strong>in</strong>g model preserves <strong>in</strong>dependence<br />

then it is essentially trivial i.e. h aff<strong>in</strong>e l<strong>in</strong>ear and the matrices equivalent (up to<br />

scal<strong>in</strong>g). As usual, the model is assumed to be <strong>in</strong>vertible, hence identifiability<br />

and uniqueness of the model follow from the separability.<br />

Def<strong>in</strong>ition 8 A subset P ⊂ R n is called parallelepipeds, if it is the l<strong>in</strong>ear<br />

image of a square, that is<br />

P = A([a1, b1] × . . . × [an, bn])<br />

for ai < bi,i = 1, . . . , n and A ∈ Gl(n). A parallelepipeds P is said to be<br />

tilted, if A is mix<strong>in</strong>g and no 2 × 2-m<strong>in</strong>or is absolutely degenerate. Let i �= j ∈<br />

{1, . . . , n} and c ∈ {a1, b1} × . . . × {an, bn}, then<br />

A �<br />

{c1} × . . . × [ai, bi] × . . . × [aj, bj] × . . . × {cn} �<br />

is called a 2-face of P and A(c) is called a corner of P . If n = 2 the parallelepipeds<br />

are called parallelograms.<br />

Lemma 9 Let f1, . . . , fn be n one-dimensional analytic <strong>in</strong>jective functions<br />

with f ′ i �= 0, and let f := f1 × · · · × fn be the <strong>in</strong>duced <strong>in</strong>jective mapp<strong>in</strong>g on<br />

R n . Let P, Q ⊂ R n be two parallelepipeds, one of them tilted. If f(P ) = Q (or<br />

equivalently for the boundaries f(∂P ) = ∂Q), then f is aff<strong>in</strong>e l<strong>in</strong>ear diagonal.<br />

Here ∂P denotes the boundary of the parallelepiped P i.e. the set of po<strong>in</strong>ts of<br />

P not ly<strong>in</strong>g <strong>in</strong> its <strong>in</strong>terior (which co<strong>in</strong>cides with the union of its faces).<br />

In the proof we see that the requirement for P or Q be<strong>in</strong>g tilted can be weakened<br />

slightly. It would suffice that enough 2-m<strong>in</strong>ors are not absolutely degenerate.<br />

Nevertheless the set of mix<strong>in</strong>g matrices hav<strong>in</strong>g no absolutely degenerate<br />

2 × 2-m<strong>in</strong>ors is very large <strong>in</strong> the sense that its complement has measure zero<br />

<strong>in</strong> Gl(n).<br />

7


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 97<br />

Note that the tiltedness is essential, for example let P = ( 1 2<br />

1 0 ) [0, 1] 2 and any<br />

f1 with<br />

⎧<br />

⎪⎨ 3x<br />

for x < 1<br />

2<br />

f1(x) =<br />

⎪⎩ 3x<br />

− 1 for x > 2<br />

2<br />

and f2(x) := x. Then Q is a parallelogram and its corners are (0, 0), ( 3,<br />

1), 2<br />

, 1) which is not a scaled version of P .<br />

(0, 2), ( 7<br />

2<br />

Note that the lemma extends the lemma proved <strong>in</strong> [10] by add<strong>in</strong>g the condition<br />

of absolutely-degeneracy — this is <strong>in</strong> fact a necessary condition as shown <strong>in</strong><br />

figure 1.<br />

PROOF. [Lemma 9 for n = 2] Obviously images of non-tilted parallelograms<br />

under diagonal mapp<strong>in</strong>gs are aga<strong>in</strong> non-tilted. f is <strong>in</strong>vertible, so we can assume<br />

that both P and Q are tilted. Without loss of generality us<strong>in</strong>g the scal<strong>in</strong>g and<br />

translation <strong>in</strong>variance of our problem, we may assume that<br />

∂P = (<br />

1 1<br />

a1 a2 ) ∂([0, 1] × [0, c]), ∂Q = � 1 1<br />

b1 b2<br />

�<br />

∂([0, 1] × [0, d]),<br />

with ai, bi ∈ R \ {0} and a 2 1 �= a 2 2 and b 2 1 �= b 2 2 and ca2, db2 > 0 and c ≤ 1, and<br />

f(0) = 0, f(1, a1) = (1, b1), f(c, ca2) = (d, db2)<br />

(i.e. the vertices of P are mapped onto the vertices of Q <strong>in</strong> the specified order).<br />

Note that the vertices of P have to be mapped onto vertices of Q because f<br />

is at cont<strong>in</strong>uously differentiable. S<strong>in</strong>ce the fi are monotonously we also have<br />

d ≤ 1 and that a1 < 0 implies b1 < 0.<br />

It follows that f maps the four separate edges of ∂P onto the correspond<strong>in</strong>g<br />

edges of ∂Q: f(t, a1t) = �<br />

g1(t), b1g1(t) �<br />

, f(ct, ca2t) = �<br />

dg2(t), db2g2(t) �<br />

for t ∈ [0, 1]. Here gi : [0, 1] → [0, 1] is a strongly monotonously <strong>in</strong>creas<strong>in</strong>g<br />

parametrization of the respective edge. It follows that g1(t) = f1(t) and<br />

dg2(t) = f1(ct) and therefore f2(a1t) = b1f1(t) and f2(ca2t) = b2f1(ct) for<br />

t ∈ [0, 1]. Therefore � we get an equation for both components of f, e.g. for the<br />

a1<br />

second: f2 a2 t� = b1<br />

b2 f2(t) for t ∈ [0, ca2].<br />

So f2 is � a1<br />

a2<br />

�<br />

b1 , -homogeneous with coefficients not <strong>in</strong> {−1, 0, 1} by assump-<br />

b2<br />

tion; accord<strong>in</strong>g to lemma 2 f2 and then also f1 are homogeneous polynomials<br />

(everywhere due to analyticity). By assumption f ′ i(0) �= 0, hence the fi are<br />

even l<strong>in</strong>ear.<br />

We have used the translation <strong>in</strong>variance above, so <strong>in</strong> general f is an aff<strong>in</strong>e<br />

l<strong>in</strong>ear scal<strong>in</strong>g. ✷<br />

8


98 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

PROOF. [Lemma 9 for arbitrary n] Aga<strong>in</strong> note that s<strong>in</strong>ce diagonal maps<br />

preserve non-tiltedness we can assume that P and Q are tilted. Let πij :<br />

Rn → R2 be the projection onto the i, j-coord<strong>in</strong>ates. Note that for any corner<br />

c and i �= j there is a 2-face Pijc of P conta<strong>in</strong><strong>in</strong>g c such that πij(Pijc) is a<br />

parallelogram. � In fact s<strong>in</strong>ce P is tilted πij(Pijc) is also tilted. S<strong>in</strong>ce f is smooth<br />

f(Pijc) �<br />

is also a 2-face of Q and aga<strong>in</strong> tilted.<br />

πij<br />

For each corner c of P and i �= j�∈ {0, . . . , n} we can apply the n = 2 version<br />

of this lemma to πij(Pijc) and πij f(Pijc) �<br />

. Therefore fi and fj are aff<strong>in</strong>e l<strong>in</strong>ear<br />

on πi(Pijc) and πj(Pijc). Now πi(P ) ⊂ �<br />

cj πi(Pijc) and hence fi aff<strong>in</strong>e l<strong>in</strong>ear<br />

on πi(P ) which proves that f is aff<strong>in</strong>e l<strong>in</strong>ear diagonal. ✷<br />

Now we are able to show the separability theorem:<br />

PROOF. [Theorem 7] S is bounded, and W ◦ h ◦ A is cont<strong>in</strong>uous, so T :=<br />

W �<br />

h(AS) �<br />

is bounded as well. Furthermore, s<strong>in</strong>ce S is fully bounded, T is also<br />

fully bounded. Then, as seen <strong>in</strong> section 2, supp S and supp T are rectangles<br />

with boundaries parallel to the coord<strong>in</strong>ate axes. Hence P := A(supp S) and<br />

Q := W −1 (supp T) are parallelograms. One of them is tilted because otherwise<br />

A and W −1 would not be mix<strong>in</strong>g.<br />

As W ◦ h ◦ A maps supp S onto supp T, h maps the set A supp S onto<br />

W −1 supp T i.e. h(P ) = Q. Then by lemma 9 h is aff<strong>in</strong>e l<strong>in</strong>ear diagonal,<br />

say h(x) = Lx + v for x ∈ P with L ∈ Gl(2) scal<strong>in</strong>g and v ∈ R 2 .<br />

So W �<br />

h(AS) �<br />

= WLAS + Wv is <strong>in</strong>dependent, and therefore also WLAS.<br />

By theorem 4 WLA ∼ I, so there exists a scal<strong>in</strong>g L ′ and a permutation P ′<br />

with WLA = L ′ P ′ as had to be shown. ✷<br />

5 Simulation<br />

In order to demonstrate the validity of theorem 7, we carry out a simple simulation<br />

<strong>in</strong> this section. We mix two <strong>in</strong>dependent random variables us<strong>in</strong>g a known<br />

mix<strong>in</strong>g model f and A. However, f = f(p0,q0) is taken from a parameterized<br />

family f(p,q) of nonl<strong>in</strong>earities, which enables us to test numerically whether <strong>in</strong><br />

can fully separate the data. So we<br />

the separation system really only f −1<br />

(p0,q0)<br />

unmix the data us<strong>in</strong>g <strong>in</strong>verses f −1<br />

(p,q) of members of this family and A−1 . The<br />

follow<strong>in</strong>g simulation will show that the mutual <strong>in</strong>formation of the recoveries is<br />

m<strong>in</strong>imal at (p0, q0), i.e. that f is determ<strong>in</strong>ed uniquely by X (with<strong>in</strong> this family<br />

at least) as stated by theorem 7.<br />

9


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 99<br />

p<br />

Mutual <strong>in</strong>formation of recovered sources<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1<br />

0.2 0.4 0.6 0.8 1<br />

1<br />

2<br />

0.6<br />

0.4<br />

3<br />

5<br />

4<br />

0.4 0.6<br />

0.4 0.6<br />

0.2 0.4 0.6 0.8 1<br />

q<br />

0<br />

0.8<br />

0.6<br />

0.4<br />

0.6<br />

0.2<br />

0.4<br />

p × g−1 q ) ◦ (f0.5 × g0.5) ◦ A � �<br />

S<br />

5<br />

4<br />

3<br />

2<br />

1<br />

�<br />

− log<br />

0<br />

MI � A −1 ◦ (f −1<br />

2<br />

1<br />

0<br />

0<br />

1<br />

1<br />

gq<br />

2 3<br />

0.5<br />

fp<br />

p = 0.1<br />

p = 1.0<br />

q = 0.1<br />

q = 1.0<br />

0<br />

0 1 2 3<br />

Fig. 2. Simulation of the separability result us<strong>in</strong>g two families of nonl<strong>in</strong>earities with<br />

fp(x) = 1<br />

10p log<br />

� √<br />

x+ x2 +4e−20p 2e−10p �<br />

and gq = y y<br />

4 | 4 |3q−0.5 . The left plot displays a color<br />

plot together with overlayed contours of a separation measure depend<strong>in</strong>g on the<br />

parameters p, q used for recovery. The separation quality is measured us<strong>in</strong>g the<br />

negative logarithm of the mutual <strong>in</strong>formation of the recovered sources. The region<br />

around the separation po<strong>in</strong>t p = q = 0.5 is also displayed <strong>in</strong> more detail.<br />

The components of the postnonl<strong>in</strong>earity will be taken from two families of<br />

functions described by<br />

fp(x) = 1<br />

10p log<br />

� √<br />

x + x2 + 4e−20p �<br />

2e −10p<br />

and gq(y) = y<br />

4<br />

� �<br />

�y<br />

�3q−0.5<br />

� �<br />

� �<br />

4<br />

with p, q vary<strong>in</strong>g between 0 and 1. The first component of the nonl<strong>in</strong>earity,<br />

fp models a sensors which saturates with vary<strong>in</strong>g strength and the second<br />

component gq describes a polynomial activation of the sensor with vary<strong>in</strong>g<br />

degree, see figure 2, right hand side.<br />

In the simulation, an <strong>in</strong>dependent uniformly distributed random vector (3000<br />

samples <strong>in</strong> [−1, 1] 2 ) is mixed postnonl<strong>in</strong>early by the matrix A = ( 2.6 1.4<br />

0.7 3.3 ) and<br />

the diagonal nonl<strong>in</strong>earity<br />

⎛<br />

⎜<br />

f ⎝ x<br />

⎞ ⎛<br />

1<br />

⎟ ⎜ 5<br />

⎠ = ⎝<br />

y<br />

log� e5 2<br />

x + 1<br />

2<br />

√ �⎞<br />

e10x2 + 4<br />

1 y |y| 16<br />

⎟<br />

⎠ =<br />

⎛<br />

⎜<br />

⎝ f0.5(x)<br />

⎞<br />

⎟<br />

⎠<br />

g0.5(y)<br />

To recover the sources the family (fp, gq) −1 of diagonal nonl<strong>in</strong>earities is used<br />

together with the <strong>in</strong>verse of A.<br />

10


100 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

p<br />

1<br />

0<br />

-1<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.6<br />

0.2<br />

0.4<br />

0.2 0.4 0.6 0.8 1<br />

1<br />

1<br />

2<br />

3<br />

0.4 0.6<br />

4<br />

5<br />

0.6<br />

0.4<br />

0.4 0.6<br />

0.2 0.4 0.6 0.8 1<br />

Orig<strong>in</strong>al<br />

-1 0 1<br />

Relative volume error<br />

q<br />

2<br />

0<br />

-2<br />

0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

�<br />

�<br />

− log vol(∆p,q)<br />

Unmix<strong>in</strong>g with p = 0.65, q = 0.5<br />

2<br />

0<br />

-2<br />

1<br />

0<br />

-1<br />

Unmix<strong>in</strong>g with p = 0.65, q = 0.65<br />

0<br />

Unmix<strong>in</strong>g with p = 0.5, q = 0.65<br />

-1 0 1<br />

Fig. 3. The top-left image graphs the separation error ∆p,q by measur<strong>in</strong>g the negative<br />

logarithm of its volume. The error ∆p,q denotes the difference set of po<strong>in</strong>ts from<br />

either the support of the recovered source distribution or a quadrangle with the<br />

same vertices. The other plots show the distributions of the recovered sources at<br />

some comb<strong>in</strong>ations of the parameters p, q of the nonl<strong>in</strong>earities. At p = 0.5, q = 0.5<br />

this is the orig<strong>in</strong>al source distribution. Here light grey areas represent ∆p,q and dark<br />

grey areas the <strong>in</strong>tersection of the support and the quadrangle.<br />

A simple estimator for mutual <strong>in</strong>formation based on histogram estimation of<br />

the entropy (with 10 b<strong>in</strong>s <strong>in</strong> each dimension) is used to check the <strong>in</strong>dependence<br />

of the recovered sources. A more elaborate histogram-based estimator<br />

by Moddemeijer [17] yields similar results. As shown <strong>in</strong> figure 2 the mutual<br />

<strong>in</strong>formation of the recovered sources is m<strong>in</strong>imal at the parameters p = q = 0.5,<br />

which correspond to the mix<strong>in</strong>g model. It can also be noticed that the m<strong>in</strong>imum<br />

is much less dist<strong>in</strong>ct <strong>in</strong> the second component. This <strong>in</strong>dicates that <strong>in</strong><br />

numerical application it should be easier to detect nonl<strong>in</strong>ear functions which<br />

are bounded.<br />

The second graph (figure 3) further illustrates that the criterion for the borders<br />

11


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 101<br />

to be of quadrangular shape is sufficient. This gives the idea of a possible<br />

postnonl<strong>in</strong>ear ICA algorithm which m<strong>in</strong>imizes the non-quadrangularity of the<br />

support of the estimated source. As pictured <strong>in</strong> the graph this can for example<br />

be achieved by m<strong>in</strong>imiz<strong>in</strong>g the volume of the mutual differences (i.e. the po<strong>in</strong>ts<br />

which are <strong>in</strong> the union but not <strong>in</strong> the difference). It can easily be seen that this<br />

m<strong>in</strong>imization yields the same solution as m<strong>in</strong>imiz<strong>in</strong>g the mutual <strong>in</strong>formation.<br />

For more details on such an algorithm we refer to [10, 13, 16].<br />

6 Conclusion<br />

We have presented a new separability result for postnonl<strong>in</strong>ear bounded mixtures<br />

that is based on the analysis of the borders of the mixture density. We<br />

hereby formalize and extend ideas already presented <strong>in</strong> [10]. We <strong>in</strong>troduce the<br />

notion of absolutely degenerate mix<strong>in</strong>g matrices. Us<strong>in</strong>g this we identify the restrictions<br />

of separability and also of algorithms that only use border analysis<br />

for postnonl<strong>in</strong>earity detection. This also represents a drawback of the algorithms<br />

proposed <strong>in</strong> [10] and [13], to which we want to refer for experimental<br />

results us<strong>in</strong>g border detection <strong>in</strong> postnonl<strong>in</strong>ear sett<strong>in</strong>gs.<br />

In future works we will show how to relax the condition of analytic postnonl<strong>in</strong>earities<br />

to only cont<strong>in</strong>uously differentiable functions; also prelim<strong>in</strong>ary results<br />

<strong>in</strong>dicate how to generalize these results to complex-valued random vectors and<br />

mixtures. We further plan to extend this model to the case of group ICA [18],<br />

where <strong>in</strong>dependence is only assumed among groups of sources. In the l<strong>in</strong>ear<br />

case, this has been done <strong>in</strong> [15] — however the extension to postnonl<strong>in</strong>early<br />

mixed sources is yet unclear.<br />

Acknowledgements<br />

We thank the anonymous reviewers for their valuable suggestions, which improved<br />

the orig<strong>in</strong>al manuscript. FT further would like to thank Christian Jutten<br />

for the helpful discussions dur<strong>in</strong>g the preparation of this paper. F<strong>in</strong>ancial<br />

support by the BMBF <strong>in</strong> the project ’ModKog’ is gratefully acknowledged.<br />

References<br />

[1] J. Hérault, C. Jutten, Space or time adaptive signal process<strong>in</strong>g by neural<br />

network models, <strong>in</strong>: J. Denker (Ed.), Neural Networks for Comput<strong>in</strong>g.<br />

12


102 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

Proceed<strong>in</strong>gs of the AIP Conference, American Institute of Physics, New York,<br />

1986, pp. 206–211.<br />

[2] J.-F. Cardoso, A. Souloumiac, Bl<strong>in</strong>d beamform<strong>in</strong>g for non-gaussian signals,<br />

IEEE Proceed<strong>in</strong>gs - Part F 140 (1993) 362–370.<br />

[3] P. Comon, <strong>Independent</strong> component analysis - a new concept?, Signal Process<strong>in</strong>g<br />

36 (1994) 287–314.<br />

[4] A. Bell, T. Sejnowski, An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d<br />

separation and bl<strong>in</strong>d deconvolution, Neural Computation 7 (1995) 1129–1159.<br />

[5] A. Hyvär<strong>in</strong>en, E. Oja, A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis, Neural Computation 9 (1997) 1483–1492.<br />

[6] F. Theis, A. Jung, C. Puntonet, E. Lang, L<strong>in</strong>ear geometric ICA: Fundamentals<br />

and algorithms, Neural Computation 15 (2003) 419–439.<br />

[7] A. Hyvär<strong>in</strong>en, J.Karhunen, E.Oja, <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>, John<br />

Wiley & Sons, 2001.<br />

[8] A. Cichocki, S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g, John Wiley<br />

& Sons, 2002.<br />

[9] A. Taleb, C. Jutten, Indeterm<strong>in</strong>acy and identifiability of bl<strong>in</strong>d identification,<br />

IEEE Transactions on Signal Process<strong>in</strong>g 47 (10) (1999) 2807–2820.<br />

[10] M. Babaie-Zadeh, C. Jutten, K. Nayebi, A geometric approach for separat<strong>in</strong>g<br />

post non-l<strong>in</strong>ear mixtures, <strong>in</strong>: Proc. of EUSIPCO ’02, Vol. II, Toulouse, France,<br />

2002, pp. 11–14.<br />

[11] S. Achard, D.-T. Pham, C. Jutten, Quadratic dependence measure for nonl<strong>in</strong>ear<br />

bl<strong>in</strong>d sources separation, <strong>in</strong>: Proc. of ICA 2003, Nara, Japan, 2003, pp. 263–268.<br />

[12] A. Ziehe, M. Kawanabe, S. Harmel<strong>in</strong>g, K.-R. Müller, Bl<strong>in</strong>d separation<br />

of post-nonl<strong>in</strong>ear mixtures us<strong>in</strong>g l<strong>in</strong>eariz<strong>in</strong>g transformations and temporal<br />

decorrelation, Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research 4 (2003) 1319–1338.<br />

[13] F. Theis, E. Lang, L<strong>in</strong>earization identification and an application to BSS us<strong>in</strong>g<br />

a SOM, <strong>in</strong>: Proc. ESANN 2004, d-side, Evere, Belgium, Bruges, Belgium, 2004,<br />

pp. 205–210.<br />

[14] F. Theis, A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation,<br />

Neural Computation 16 (2004) 1827–1850.<br />

[15] F. Theis, Uniqueness of complex and multidimensional <strong>in</strong>dependent component<br />

analysis, Signal Process<strong>in</strong>g 84 (5) (2004) 951–956.<br />

[16] M. Babaie-Zadeh, On bl<strong>in</strong>d source separation <strong>in</strong> convolutive and nonl<strong>in</strong>ear<br />

mixtures, Ph.D. thesis, Institut National Polytechnique de Grenoble (2002).<br />

[17] R. Moddemeijer, On estimation of entropy and mutual <strong>in</strong>formation of<br />

cont<strong>in</strong>uous distributions, Signal Process<strong>in</strong>g 16 (3) (1989) 233–246.<br />

[18] J. Cardoso, Multidimensional <strong>in</strong>dependent component analysis, <strong>in</strong>: Proc. of<br />

ICASSP ’98, Seattle, 1998.<br />

13


Chapter 5<br />

IEICE TF E87-A(9):2355-2363, 2004<br />

Paper F.J. Theis and W. Nakamura. Quadratic <strong>in</strong>dependent component analysis.<br />

IEICE Trans. Fundamentals, E87-A(9):2355-2363, 2004<br />

Reference (Theis and Nakamura, 2004)<br />

Summary <strong>in</strong> section 1.2.2<br />

103


104 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

PAPER Special Section on Nonl<strong>in</strong>ear Theory and its Applications<br />

Quadratic <strong>in</strong>dependent component analysis<br />

SUMMARY The transformation of a data set us<strong>in</strong>g a<br />

second-order polynomial mapp<strong>in</strong>g to f<strong>in</strong>d statistically <strong>in</strong>dependent<br />

components is considered (quadratic <strong>in</strong>dependent component<br />

analysis or ICA). Based on overdeterm<strong>in</strong>ed l<strong>in</strong>ear ICA, an<br />

algorithm together with separability conditions are given via l<strong>in</strong>earization<br />

reduction. The l<strong>in</strong>earization is achieved us<strong>in</strong>g a higher<br />

dimensional embedd<strong>in</strong>g def<strong>in</strong>ed by the l<strong>in</strong>ear parametrization of<br />

the monomials, which can also be applied for higher-order polynomials.<br />

The paper f<strong>in</strong>ishes with simulations for artificial data<br />

and natural images.<br />

key words: nonl<strong>in</strong>ear <strong>in</strong>dependent component analysis,<br />

quadratic forms, nonl<strong>in</strong>ear bl<strong>in</strong>d source separation, overdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d source separation, natural images<br />

1. Introduction<br />

The task of transform<strong>in</strong>g a random vector <strong>in</strong>to an <strong>in</strong>dependent<br />

one is called <strong>in</strong>dependent component analysis<br />

(ICA). ICA has been well-studied <strong>in</strong> the case of l<strong>in</strong>ear<br />

transformations [3, 12].<br />

Nonl<strong>in</strong>ear demix<strong>in</strong>g <strong>in</strong>to <strong>in</strong>dependent components<br />

is an important extension of l<strong>in</strong>ear ICA and we still do<br />

not have sufficient knowledge of this problem. It is easy<br />

to see that without any restrictions the class of ICA solutions<br />

is too large to be of any practical use [13]. Hence<br />

<strong>in</strong> nonl<strong>in</strong>ear ICA, special nonl<strong>in</strong>earities are usually considered.<br />

In this paper, we treat polynomial nonl<strong>in</strong>earities,<br />

especially second-order monomials or quadratic<br />

forms. These represent a relatively simple class of nonl<strong>in</strong>earities,<br />

which can be <strong>in</strong>vestigated <strong>in</strong> detail.<br />

Several studies have employed quadratic forms as a<br />

generative process of data. Abed-Meraim et al. [1] suggested<br />

analyz<strong>in</strong>g mixtures by second-order polynomials<br />

us<strong>in</strong>g a l<strong>in</strong>earization <strong>in</strong> a similar way as we <strong>in</strong>troduce<br />

<strong>in</strong> section 3.2, but for the mixtures, which <strong>in</strong> general<br />

destroys the assumption of <strong>in</strong>dependence. Leshem [15]<br />

proposed a whiten<strong>in</strong>g scheme based on quadratic forms<br />

<strong>in</strong> order to enhance l<strong>in</strong>ear separation of time-signals<br />

<strong>in</strong> algorithms such as SOBI. Similar quadratic mix<strong>in</strong>g<br />

models are also considered <strong>in</strong> [8] and [10]. These are<br />

Manuscript received December 21, 2003.<br />

Manuscript revised April 2, 2004.<br />

F<strong>in</strong>al manuscript received May 21, 2004.<br />

† The authors are with the Lab. for Advanced Bra<strong>in</strong> Signal<br />

Process<strong>in</strong>g, Bra<strong>in</strong> Science Institute, RIKEN, Wako-shi,<br />

Saitama 351-0198 Japan.<br />

∗ On leave from the Institute of Biophysics, University<br />

of Regensburg, 93040 Regensburg, Germany.<br />

a) E-mail: fabian@theis.name<br />

b) E-mail: wakakoh@bra<strong>in</strong>.riken.jp<br />

Fabian J. THEIS †∗a) and Wakako NAKAMURA †b) , Nonmembers<br />

studies <strong>in</strong> which the mix<strong>in</strong>g model is assumed to be<br />

quadratic <strong>in</strong> contrast to the quadratic unmix<strong>in</strong>g model<br />

used <strong>in</strong> this paper.<br />

For demix<strong>in</strong>g <strong>in</strong>to <strong>in</strong>dependent components by<br />

quadratic forms, Bartsch and Obermayer [2] suggested<br />

apply<strong>in</strong>g l<strong>in</strong>ear ICA to second-order terms of data.<br />

Hashimoto [9] suggested an algorithm based on m<strong>in</strong>imization<br />

of Kullback-Leibler divergence. However, <strong>in</strong><br />

these studies, the generative model of data was not def<strong>in</strong>ed<br />

and the <strong>in</strong>terpretation of signals obta<strong>in</strong>ed by the<br />

separation was not given clearly; the focus was on the<br />

application to natural images. In this paper, we exam<strong>in</strong>e<br />

this quadratic demix<strong>in</strong>g model. We def<strong>in</strong>e both<br />

generative model and demix<strong>in</strong>g process of data explicitly<br />

to assume a one-to-one correspondence of the <strong>in</strong>dependent<br />

components with data. Us<strong>in</strong>g the analysis<br />

of overdeterm<strong>in</strong>ed l<strong>in</strong>ear ICA, we discuss identifiability<br />

of this quadratic demix<strong>in</strong>g model. We confirm that the<br />

algorithm proposed by Bartsch and Obermayer [2] can<br />

estimate the mix<strong>in</strong>g process and retrieve the <strong>in</strong>dependent<br />

components correctly by simulation with artificial<br />

data. We also apply the quadratic demix<strong>in</strong>g to natural<br />

image data.<br />

The paper is organized as follows: <strong>in</strong> the next section<br />

results about overdeterm<strong>in</strong>ed ICA that is ICA of<br />

more sensors than sources are recalled and extended.<br />

Section 3 then <strong>in</strong>troduces the quadratic ICA model and<br />

provides a separability result and an algorithm. The algorithms<br />

are then applied for artificial and natural data<br />

sets <strong>in</strong> section 4.<br />

2. Overdeterm<strong>in</strong>ed ICA<br />

Before def<strong>in</strong><strong>in</strong>g the polynomial model, we have to study<br />

<strong>in</strong>determ<strong>in</strong>acies and algorithms of overdeterm<strong>in</strong>ed <strong>in</strong>dependent<br />

component analysis. Its goal lies <strong>in</strong> the transformation<br />

of a given random vector x to an <strong>in</strong>dependent<br />

one with lower dimension. Overdeterm<strong>in</strong>ed ICA is usually<br />

applied to solve the overdeterm<strong>in</strong>ed bl<strong>in</strong>d source<br />

separation (overdeterm<strong>in</strong>ed BSS) problem, where x is<br />

known to be a mixture of a lower number of <strong>in</strong>dependent<br />

source signals s. Overdeterm<strong>in</strong>ed ICA <strong>in</strong> the context<br />

of BSS is well-known and understood [5, 14], but<br />

the <strong>in</strong>determ<strong>in</strong>acies <strong>in</strong> terms of the unmix<strong>in</strong>g matrix of<br />

overdeterm<strong>in</strong>ed ICA problem have to the knowledge of<br />

the authors not yet been analyzed.<br />

1


Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 105<br />

2<br />

2.1 Model<br />

The overdeterm<strong>in</strong>ed ICA model can be formulated as<br />

follows: Let x be a given m-dimensional random vector.<br />

An n × m-matrix W with m > n ≥ 2 is called<br />

overdeterm<strong>in</strong>ed ICA of x if<br />

y = Wx (1)<br />

is <strong>in</strong>dependent. In order to dist<strong>in</strong>guish between overdeterm<strong>in</strong>ed<br />

and ord<strong>in</strong>ary ICA, <strong>in</strong> the case m = n we call<br />

W a square ICA of x.<br />

Here W is not assumed to have full rank as usual.<br />

Theorem 2.1 shows that under reasonable assumptions<br />

this automatically holds.<br />

Often overdeterm<strong>in</strong>ed ICA is used to solve the<br />

overdeterm<strong>in</strong>ed BSS problem given by<br />

x = As (2)<br />

where s is an n-dimensional <strong>in</strong>dependent random vector<br />

and A a m × n matrix with m > n ≥ 2. Note that<br />

A can be assumed to have full rank (rank A = n),<br />

otherwise the system could be reduced to the case n−1:<br />

If A = (a1, . . . , an) with columns ai, <strong>in</strong> the case of<br />

rank A < n we can without loss of generality assume<br />

that an = � n−1<br />

i=1 λiai. Then<br />

x = As =<br />

n�<br />

j=1<br />

n−1<br />

ajsj =<br />

�<br />

(1 + λj)ajsj<br />

j=1<br />

so source sn does not appear <strong>in</strong> the mixture <strong>in</strong> this<br />

case, thus the model can be reduced to the case n −<br />

1. Overdeterm<strong>in</strong>ed ICAs of x are usually considered<br />

solutions to this BSS problem.<br />

Often, overdeterm<strong>in</strong>ed BSS is stated <strong>in</strong> the noisy<br />

case,<br />

x = As + ν (3)<br />

where ν is a decorrelated Gaussian ’noise’ random vector,<br />

<strong>in</strong>dependent of s. Without additional noise, the<br />

sources can be found by solv<strong>in</strong>g for example the square<br />

ICA problem, which is constructed from equation 1<br />

after leav<strong>in</strong>g away the last m − n observations given<br />

non degenerate projected mixture matrix. In the presence<br />

of noise usually projection by pr<strong>in</strong>cipal component<br />

analysis (PCA) is chosen <strong>in</strong> order to reduce this problem<br />

to the square case [14]. In the next section however,<br />

the <strong>in</strong>determ<strong>in</strong>acies of the noise-free models represented<br />

by equations 1 and 2 are studied because overdeterm<strong>in</strong>ed<br />

ICA will only be needed later after reduction of<br />

the the bil<strong>in</strong>ear model — and <strong>in</strong> this model we do not<br />

allow any noise. However, the overdeterm<strong>in</strong>ed noisy<br />

model from equation 3 can easily be reduced to the<br />

noise-free model by <strong>in</strong>clud<strong>in</strong>g ν as additional sources:<br />

x = (A I)<br />

� s<br />

ν<br />

�<br />

.<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

In this case n <strong>in</strong>creases and we could possibly deal with<br />

underdeterm<strong>in</strong>ed ICA (where extra care has to be taken<br />

with the now <strong>in</strong>creased number of Gaussians <strong>in</strong> the<br />

sources). Uniqueness and separability results for this<br />

case are given <strong>in</strong> [7], which shows that the follow<strong>in</strong>g<br />

theorems also hold <strong>in</strong> this noisy ICA model.<br />

2.2 Indeterm<strong>in</strong>acies<br />

The follow<strong>in</strong>g theorem presents the <strong>in</strong>determ<strong>in</strong>acy of<br />

the unmix<strong>in</strong>g matrix <strong>in</strong> the case of overdeterm<strong>in</strong>ed mixtures,<br />

with the slight generalization that this unmix<strong>in</strong>g<br />

matrix does not necessarily have to be of full rank.<br />

Later <strong>in</strong> this section we show that it is necessary to assume<br />

that the observed data set x is <strong>in</strong>deed a mixture.<br />

Theorem 2.1 (Indeterm<strong>in</strong>acies of overdeterm<strong>in</strong>ed<br />

ICA). Let m ≥ n ≥ 2. Let x = As as <strong>in</strong><br />

the model of equation 2, and let the n × m matrix W<br />

be an overdeterm<strong>in</strong>ed or square ICA of x such that at<br />

most one component of Wx is determ<strong>in</strong>istic. Furthermore<br />

assume one of the follow<strong>in</strong>g:<br />

i. s has at most one Gaussian component and the<br />

variances of s exist.<br />

ii. s has no Gaussian component.<br />

Then there exist a permutation matrix P and an <strong>in</strong>vertible<br />

scal<strong>in</strong>g matrix L with<br />

W = LP(A ⊤ A) −1 A ⊤ + C (4)<br />

where C is an n × m matrix with rows ly<strong>in</strong>g the kernel<br />

of A ⊤ that is with CA = 0. The converse also holds,<br />

i.e. if W fulfills equation 4 then Wx is <strong>in</strong>dependent.<br />

A less general form of this <strong>in</strong>determ<strong>in</strong>acy has been<br />

given by Joho et al. [14]. In the square case, the above<br />

theorem shows that it is not necessary to assume that<br />

mix<strong>in</strong>g and especially demix<strong>in</strong>g matrix have full rank if<br />

it is assumed that the transformation also has at most<br />

one determ<strong>in</strong>istic component. Obviously, this assumption<br />

is not necessary if W is required to have full rank.<br />

S<strong>in</strong>ce rank A = n, the pseudo <strong>in</strong>verse (Moore-<br />

Penrose <strong>in</strong>verse) A + of A has the explicit form A + =<br />

(A ⊤ A) −1 A ⊤ . Note that A + A = I. So from equation 4<br />

we get WA = LP. We remark as corollary to the above<br />

theorem that overdeterm<strong>in</strong>ed ICA is separable, which<br />

means that the sources are uniquely (except for scal<strong>in</strong>g<br />

and permutation) reconstructible, because for approximated<br />

sources y we get y = Wx = WAs = LPs. This<br />

of course is well-known because overdeterm<strong>in</strong>ed BSS<br />

can be reduced to square BSS (m = n) by projection;<br />

yet the <strong>in</strong>determ<strong>in</strong>acies of the demix<strong>in</strong>g matrix, which<br />

are simple permutation and scal<strong>in</strong>g <strong>in</strong> the square case,<br />

are not so obvious for overdeterm<strong>in</strong>ed BSS.<br />

Proof of theorem 2.1. Consider B := WA and y :=<br />

Bs. Let b1, . . . , bn ∈ R n denote the (transposed) rows<br />

of B. Then


106 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

yi = b ⊤ i s. (5)<br />

We will show that we can assume B to be <strong>in</strong>vertible.<br />

If B is not <strong>in</strong>vertible, without loss of generality let<br />

bn = �n−1 i=1 λibi with coefficients λi ∈ R. Then at least<br />

one λi �= 0, otherwise yn = 0 is determ<strong>in</strong>istic. Without<br />

loss of generality let λ1 = 1. From equation 5 we then<br />

get yn = �n−1 i=1 λiyi = y1+u with u := �n−1 i=2 λiyi <strong>in</strong>dependent<br />

of y1. Application of the Darmois-Skitovitch<br />

theorem [6, 16] to the two equations<br />

y1 = y1<br />

yn = y1 + u<br />

shows that y1 is Gaussian or determ<strong>in</strong>istic. Hence all<br />

yi with λi �= 0 are Gaussian or determ<strong>in</strong>istic. So we<br />

may assume that y1, yn and u are square <strong>in</strong>tegrable.<br />

Without loss of generality let those random variables<br />

be centered. Then the cross-covariance of (y1, yn) can<br />

be calculated as follows:<br />

0 = E(y1y ⊤ n ) = E(y1y ⊤ 1 ) + E(y1u ⊤ ) = var y1,<br />

so y1 is determ<strong>in</strong>istic. Hence all yi with λi �= 0 are<br />

determ<strong>in</strong>istic, and then also yn. So at least two components<br />

of y = Wx are determ<strong>in</strong>istic <strong>in</strong> contradiction<br />

to the assumption. Therefore B is <strong>in</strong>vertible.<br />

Us<strong>in</strong>g assumptions i) or ii) and the well-known<br />

uniqueness result of square l<strong>in</strong>ear BSS — a corollary<br />

[5, 7] of the Darmois-Skitovitch theorem, see [17] for a<br />

proof without us<strong>in</strong>g this theorem — there exist a permutation<br />

matrix P and an <strong>in</strong>vertible scal<strong>in</strong>g matrix L<br />

with WA = LP. The properties of the pseudo <strong>in</strong>verse<br />

then show that the equation<br />

(LP) −1 WA = I<br />

<strong>in</strong> W has solutions W = LP(A + + C ′ ) with C ′ A = 0,<br />

or W = LPA + + C aga<strong>in</strong> with CA = 0.<br />

The pseudo <strong>in</strong>verse is the unique solution of WA =<br />

I with m<strong>in</strong>imum Frobenius norm. In this case C = 0;<br />

so if W is an ICA of x with m<strong>in</strong>imal Frobenius norm,<br />

then already W equals A + except for scal<strong>in</strong>g and permutation.<br />

This can be used for normalization. Us<strong>in</strong>g<br />

s<strong>in</strong>gular value decomposition of A, this has been shown<br />

and extended to the noisy case <strong>in</strong> [14].<br />

Theorem 2.1 states that <strong>in</strong> the case where x is the<br />

mixture of n sources,<br />

s<br />

A<br />

.. x<br />

W<br />

W ′<br />

. .<br />

. .<br />

y<br />

.................... .<br />

y ′<br />

W ′ A<br />

ignor<strong>in</strong>g the always present scal<strong>in</strong>g and permutation<br />

(then y = y ′ = s), we have an n(m − n)-dimensional<br />

aff<strong>in</strong>e vector space as ICAs of x i.e. (W − W ′ )A = 0.<br />

However <strong>in</strong> the case of arbitrary x, the ICAs of x can<br />

be quite unrelated. Indeed, <strong>in</strong> the diagram<br />

x<br />

W<br />

W ′<br />

..<br />

. .<br />

y<br />

...<br />

..<br />

..<br />

...<br />

. .<br />

y ′<br />

∃B?<br />

there does not always exist B that could make this<br />

diagram commute (W ′ x = BWx), as for example <strong>in</strong><br />

the case m ≥ 2n and W projection along the first n<br />

coord<strong>in</strong>ates and W ′ projection along the last n ones<br />

shows. In this case the square ICA argument <strong>in</strong> the<br />

proof of theorem 2.1 cannot be applied, and W and W ′<br />

do not necessarily have any relation. This of course is<br />

not true <strong>in</strong> the case m = n, where uniqueness (but not<br />

existence) can also be shown without explicitly know<strong>in</strong>g<br />

that x is a mixture by <strong>in</strong>vert<strong>in</strong>g W. This could be<br />

extended to the case n ≤ m < 2n to construct 2n − m<br />

equations for y and y ′ .<br />

2.3 Algorithm<br />

The usual algorithm for f<strong>in</strong>d<strong>in</strong>g an overdeterm<strong>in</strong>ed ICA<br />

is to first project x along its ma<strong>in</strong> pr<strong>in</strong>cipal components<br />

to an n-dimensional random vector and to then<br />

apply square l<strong>in</strong>ear ICA [14, 19]. In [19], the question<br />

of where to place this projection stage (before or after<br />

application of ICA) is posed and answered somewhat<br />

heuristically. Here, a simple proof is given that <strong>in</strong> the<br />

overdeterm<strong>in</strong>ed BSS case, any possible ICA matrix factorizes<br />

over (almost) any projection, so project<strong>in</strong>g first<br />

and then apply<strong>in</strong>g ICA to recover the sources is always<br />

possible.<br />

Theorem 2.2. Let x = As as <strong>in</strong> the model of equation<br />

2 such that s satisfies one of the conditions <strong>in</strong> theorem<br />

2.1. Furthermore let W be an overdeterm<strong>in</strong>ed ICA of<br />

x. Then for almost all (<strong>in</strong> the measure sense) n × m<br />

matrices V there exists a square ICA B of Vx such<br />

that Wx = BVx.<br />

Proof. Let V be an n×m matrix such that VA is <strong>in</strong>vertible<br />

— this is an open condition, so almost all matrices<br />

are of this type. Then there exists a square ICA, say<br />

B ′ of y := Vx. So B ′ V is an overdeterm<strong>in</strong>ed ICA of x.<br />

Apply<strong>in</strong>g separability of overdeterm<strong>in</strong>ed ICA (corollary<br />

of theorem 2.1) then proves that Wx = LPB ′ Vx for<br />

a permutation P and a scal<strong>in</strong>g L. Sett<strong>in</strong>g B := LPB ′<br />

shows the claim.<br />

In diagram-form, this means<br />

A . .<br />

s x y ′ W<br />

..<br />

V<br />

. ..<br />

y<br />

. . ..<br />

∃B<br />

3


Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 107<br />

4<br />

where y ′ := Wx and y := Vx.<br />

In applications, V is usually chosen to be the projection<br />

along the first pr<strong>in</strong>cipal components <strong>in</strong> order to<br />

reduce noise [14]. In this case it is easy to see that<br />

<strong>in</strong>deed VA is <strong>in</strong>vertible as needed <strong>in</strong> the theorem.<br />

3. Quadratic ICA<br />

The model of quadratic ICA is <strong>in</strong>troduced and separability<br />

and algorithms are studied <strong>in</strong> this context.<br />

3.1 Model<br />

Let x be an m-dimensional random vector. Consider<br />

the follow<strong>in</strong>g quadratic or bil<strong>in</strong>ear unmix<strong>in</strong>g model<br />

y := g(x, x) (6)<br />

for a fixed bil<strong>in</strong>ear mapp<strong>in</strong>g g : R m × R m → R n .<br />

The components of the bil<strong>in</strong>ear mapp<strong>in</strong>g are quadratic<br />

forms, which can be parameterized by symmetric matrices.<br />

So the above is equivalent to<br />

yi := x ⊤ G (i) x (7)<br />

for symmetric matrices G (i) and i = 1, . . . , n. If G (i)<br />

kl<br />

are the coefficients of G (i) , this means<br />

yi =<br />

m�<br />

m�<br />

k=1 l=1<br />

G (i)<br />

kl xkxl<br />

for i = 1, . . . , n.<br />

A special case of this model can be constructed as<br />

follows: Decompose the symmetric coefficient matrices<br />

<strong>in</strong>to<br />

G (i) = E (i)⊤ Λ (i) E (i) ,<br />

where E (i) is orthogonal and Λ (i) diagonal. In order<br />

to explicitly <strong>in</strong>vert the above model (after restriction<br />

to a subset for <strong>in</strong>vertibility) we now assume that these<br />

coord<strong>in</strong>ate changes E (i) are all the same i.e.<br />

for i = 1, . . . , n. Then<br />

E = E (i)<br />

yi = (Ex) ⊤ Λ (i) (Ex) =<br />

m�<br />

k=1<br />

Λ (i)<br />

kk (Ex)2 k<br />

(8)<br />

where Λ (i)<br />

kk are the coefficients on the diagonal of Λ(i) .<br />

Sett<strong>in</strong>g<br />

⎛<br />

⎜<br />

Λ := ⎝<br />

11 . . . Λ (1) ⎞<br />

nn<br />

. .<br />

. ..<br />

. ⎟<br />

. ⎠<br />

Λ (1)<br />

Λ (n)<br />

11 . . . Λ (n)<br />

nn<br />

yields a two-layered unmix<strong>in</strong>g model<br />

y = Λ ◦ (.) 2 ◦ E ◦ x, (9)<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

x1<br />

x2<br />

xm<br />

.<br />

E1m<br />

E2m<br />

Enm<br />

(.) 2<br />

(.) 2<br />

.<br />

(.) 2<br />

Λ1n<br />

Λ2n<br />

E Λ<br />

Λnn<br />

Fig. 1 Simplified quadratic unmix<strong>in</strong>g model y = Λ◦(.) 2 ◦E◦x.<br />

s1<br />

s2<br />

sn<br />

� Λ −1 �<br />

.<br />

1n<br />

�<br />

−1 Λ �<br />

2n<br />

� Λ −1 �<br />

nn<br />

√<br />

√<br />

� E −1 �<br />

.<br />

√<br />

1n<br />

� E −1 �<br />

2n<br />

� E −1 �<br />

Λ −1 E −1<br />

Fig. 2 Simplified square root mix<strong>in</strong>g model x = E −1 ◦ √ ◦<br />

Λ −1 ◦ s.<br />

where (.) 2 is to be read as the componentwise square of<br />

each element. This can be <strong>in</strong>terpreted as a two-layered<br />

feed-forward neural network, see figure 1.<br />

The advantage of the restricted model from equation<br />

8 is that now the model can easily be explicitly<br />

<strong>in</strong>verted. We assume that Λ is <strong>in</strong>vertible and that<br />

Ex only takes values <strong>in</strong> one quadrant — otherwise the<br />

model cannot be <strong>in</strong>verted. Without loss of generality<br />

let this be the first quadrant that is assume<br />

(Ex)i ≥ 0<br />

for i = 1 . . . , m. Then model 9 is <strong>in</strong>vertible and the correspond<strong>in</strong>g<br />

<strong>in</strong>verse model (mix<strong>in</strong>g model) can be easily<br />

expressed as<br />

mn<br />

.<br />

.<br />

y1<br />

y2<br />

yn<br />

x1<br />

x2<br />

xm<br />

x = E −1 ◦ √ ◦ Λ −1 ◦ s (10)<br />

with E −1 = E ⊤ . Here we write s as the doma<strong>in</strong> of the<br />

model <strong>in</strong> order to dist<strong>in</strong>guish between the recoveries y<br />

given by the unmix<strong>in</strong>g model. Ideally, the two are the<br />

same. The <strong>in</strong>verse model is shown <strong>in</strong> figure 2.<br />

3.2 Separability<br />

Construct<strong>in</strong>g a new random vector by arrang<strong>in</strong>g the


108 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

monomials <strong>in</strong> model 6 <strong>in</strong> the lexicographical order of<br />

(i, j), i ≥ j, lets us reduce the quadratic ICA problem<br />

to an (usually) overdeterm<strong>in</strong>ed l<strong>in</strong>ear ICA problem.<br />

This idea will be used to analyze the <strong>in</strong>determ<strong>in</strong>acies<br />

of quadratic ICA <strong>in</strong> the follow<strong>in</strong>g.<br />

Consider the map<br />

ζ : R m → R d<br />

x ↦→ � x 2 1, 2x1x2, . . . , 2x1xn, x 2 2, 2x2x3, . . . , x 2 m<br />

with d = m(m+1)<br />

2 . With ζ we can rewrite the quadratic<br />

mix<strong>in</strong>g model from equation 6 <strong>in</strong> form of a l<strong>in</strong>ear model<br />

(l<strong>in</strong>earization)<br />

y = Wgζ(x) (11)<br />

where the matrix Wg is constructed us<strong>in</strong>g the coefficient<br />

matrices of the quadratic form g as follows:<br />

⎛<br />

⎜<br />

Wg = ⎜<br />

⎝<br />

G (1)<br />

11 G(1) 12 . . . G (1)<br />

1m G(1) 22 . . . G (1)<br />

G<br />

mm<br />

(2)<br />

11 G(2) 12 . . . G (2)<br />

1m G(2) 22 . . . G (2)<br />

.<br />

.<br />

. ..<br />

.<br />

.<br />

. ..<br />

mm<br />

.<br />

G (n)<br />

11 G(n) 12 . . . G (n)<br />

1m G(n) 22 . . . G (n)<br />

⎞<br />

⎟<br />

⎠<br />

mm<br />

This transforms the nonl<strong>in</strong>ear ICA problem to a higher<br />

dimensional l<strong>in</strong>ear one.<br />

Assum<strong>in</strong>g that x has been mixed by the <strong>in</strong>verse of a<br />

restriction of a bil<strong>in</strong>ear mapp<strong>in</strong>g, we can apply theorem<br />

2.1 to show that Wg is unique except for scal<strong>in</strong>g, permutation<br />

and addition of rows with transpose <strong>in</strong> A ⊤ .<br />

Hence, the coefficient matrices G (i) of g (correspond<strong>in</strong>g<br />

to the rows of Wg) are uniquely determ<strong>in</strong>ed except for<br />

the above.<br />

3.3 Algorithm<br />

For now assume m = n ≥ 2. In this case d > n, so the<br />

l<strong>in</strong>earized problem from equation 11 consists <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g<br />

an overdeterm<strong>in</strong>ed ICA of ζ(x). See section 2.3 for<br />

comments on such an algorithm.<br />

In the more general case m �= n, also situations<br />

with d = n and d < n are possible. For example <strong>in</strong><br />

the case d = n, (for example <strong>in</strong> a quadratic mixture of<br />

3 sources to 2 observations) the separat<strong>in</strong>g quadratic<br />

form is unique except for scal<strong>in</strong>g and permutation.<br />

However <strong>in</strong> simulations such sett<strong>in</strong>gs were numerically<br />

very <strong>in</strong>stable, so <strong>in</strong> the follow<strong>in</strong>g we will only consider<br />

the case of an equal number of sensors and sources.<br />

3.4 Semi-onl<strong>in</strong>e dimension reduction<br />

Here, we will give an algorithm for perform<strong>in</strong>g memory<br />

conservative overdeterm<strong>in</strong>ed ICA. This is important<br />

when embedd<strong>in</strong>g the data us<strong>in</strong>g ζ. For example <strong>in</strong> the<br />

simulation section 4.3, image data is considered with<br />

8×8-dimensional image patches, so n = m = 64. In this<br />

case d = 2080, hence even ord<strong>in</strong>ary ’MATLAB-style’<br />

covariance calculation can become a memory problem,<br />

�<br />

because the vector ζ(x) is too large to keep <strong>in</strong> memory.<br />

For overdeterm<strong>in</strong>ed ICA, a projection from the<br />

typically high-dimensional mixture space to the lower<br />

dimensional feature space has to be constructed,<br />

usually us<strong>in</strong>g PCA. After knowledge of the highdimensional<br />

covariance matrix Cov(ζ(x)), eigenvalue<br />

decomposition and sample-wise projection give the desired<br />

low-dimensional feature data.<br />

One way to calculate Cov(ζ(x)) and project is to<br />

use an onl<strong>in</strong>e PCA algorithm such as [4], where the samples<br />

of ζ(x) are also calculated onl<strong>in</strong>e from the known<br />

samples of x. Alternatively, batch estimation of the coefficients<br />

of Cov(ζ(x)) us<strong>in</strong>g these onl<strong>in</strong>e samples is possible.<br />

In reality, these methods suffer from calculational<br />

problems such as slow MATLAB-performance <strong>in</strong> iterations.<br />

We therefore employed a block-wise standard<br />

covariance calculation, which improved performance by<br />

a factor of roughly 20.<br />

The idea is to split up the large covariance matrix<br />

Cov(ζ(x)) <strong>in</strong>to smaller blocks, <strong>in</strong> which the correspond<strong>in</strong>g<br />

components of ζ(x) still fit <strong>in</strong>to memory.<br />

Denote C := Cov(ζ(x)) the large d × d covariance matrix.<br />

Let 1 ≤ r ≪ d be fixed. C can be decomposed<br />

<strong>in</strong> q := ⌊d/r⌋ + 1 blocks of (maximal) size r simply as<br />

follows ⎛<br />

C (1,1) . . . C (1,q)<br />

⎞<br />

⎜<br />

C = ⎝<br />

.<br />

. ..<br />

.<br />

C (q,1) . . . C (q,q)<br />

⎟<br />

⎠<br />

where C (k,k′ ) is the cross correlation between the r-<br />

dimensional vectors (ζi(x)) kr−1<br />

i=(k−1)r and (ζi(x)) k′ r−1<br />

i=(k ′ −1)r<br />

(possibly truncated if k or k ′ equals q). These two<br />

vectors and hence C (k,k′ ) can now be easily calculated<br />

us<strong>in</strong>g the def<strong>in</strong>ition of ζ. The advantage lies <strong>in</strong> the fact<br />

that these two vectors are of much smaller dimension<br />

than d and therefore fit <strong>in</strong>to memory for sufficiently<br />

small r. Of course the computational cost <strong>in</strong>creases as<br />

ζ(x) has to be calculated multiple times.<br />

In order to decompose Cov(ζ(x)) <strong>in</strong>to blocks of<br />

equal size, a mapp<strong>in</strong>g between the lexicographical order<br />

<strong>in</strong> ζ(x) and the elements of x is needed, which we<br />

want to state for convenience: The <strong>in</strong>dex (i, j) that is<br />

the monomial xixj corresponds to the position k(i, j) =<br />

− 1<br />

2i2 + � n + 3<br />

�<br />

2 i−n−1 <strong>in</strong> ζ(x). Vice versa, the k-th entry<br />

<strong>in</strong> ζ(x) corresponds to the monomial of multi<strong>in</strong>dex<br />

(i, j) with i(k) = � n + 1<br />

� √ ��<br />

2 3 − 4n2 + 4n + 9 − 8k<br />

and j(k) = k + 1<br />

2i(k)2 − � n + 1<br />

�<br />

2 i(k) + n.<br />

3.5 Extension to polynomial ICA<br />

Instead of us<strong>in</strong>g only second order monomials, we can<br />

of course allow any degree <strong>in</strong> the monomials <strong>in</strong> order<br />

to approximate more complex nonl<strong>in</strong>earities. In addition,<br />

first order monomials guarantee that also the<br />

l<strong>in</strong>ear ICA case is <strong>in</strong>cluded <strong>in</strong> the model. A suitable<br />

higher-dimensional embedd<strong>in</strong>g ζ us<strong>in</strong>g more monomials<br />

generalizes the results of the quadratic case to the<br />

5


Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 109<br />

6<br />

Mean SNR (dB)<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

n=2<br />

n=3<br />

n=5<br />

n=10<br />

15<br />

−20 0 20 40 60 80 100 120<br />

m−n<br />

Fig. 3 Mean SNR and standard deviation of recovered sources<br />

with the orig<strong>in</strong>al ones when apply<strong>in</strong>g overdeterm<strong>in</strong>ed ICA to As<br />

after random projection us<strong>in</strong>g B for vary<strong>in</strong>g source (n) and mixture<br />

(m) dimension. Here s is a <strong>in</strong> [−1, 1] n uniform random<br />

vector and A and B (m × n) respectively (n × m)-matrices hav<strong>in</strong>g<br />

uniformly distributed coefficients <strong>in</strong> [−1, 1]. Mean was taken<br />

over 1000 runs.<br />

polynomial case.<br />

4. Simulation results<br />

In this section, computer simulations are performed to<br />

show the feasibility of the presented algorithms.<br />

4.1 Overdeterm<strong>in</strong>ed BSS<br />

In order to confirm our theoretical results from section<br />

2.2, we perform batch runs of overdeterm<strong>in</strong>ed ICA<br />

applied to randomly generated data after random projection<br />

to the known source dimension. Square ICA<br />

was performed us<strong>in</strong>g the FastICA algorithm [11]. As<br />

parameters to the algorithm we used g(s) := tanh(s)<br />

as nonl<strong>in</strong>earity <strong>in</strong> the approximation of the negentropy<br />

estimator (respectively its derivative) and stabilization<br />

was turned on, mean<strong>in</strong>g that the step size was not kept<br />

fixed but could be changed adaptively (halved if the<br />

algorithm gets stuck between two po<strong>in</strong>ts). The simulation,<br />

figure 3, not surpris<strong>in</strong>gly confirms that the ICA<br />

algorithm performs well <strong>in</strong>dependently of the chosen<br />

projection and the mixture dimension. In the presented<br />

no-noise case, project<strong>in</strong>g along directions of largest variance<br />

us<strong>in</strong>g PCA <strong>in</strong>stead of the random projections will<br />

not improve performance (accord<strong>in</strong>g to theorem 2.2).<br />

However, <strong>in</strong> the case of white noise, PCA will provide<br />

better recoveries for larger m [14].<br />

4.2 Quadratic ICA — artificially generated data<br />

In our first example we consider the simplified mix<strong>in</strong>gunmix<strong>in</strong>g<br />

model from equation 9 <strong>in</strong> the case m = n = 2<br />

with randomly generated matrices<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1<br />

0.5<br />

y<br />

x 1<br />

0 0<br />

0.5<br />

x<br />

1<br />

0 0<br />

Fig. 4 Example 1: Square-root mix<strong>in</strong>g functions x1 and x2 are<br />

plotted.<br />

and<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

1<br />

0.5<br />

� �<br />

0.91 1.2<br />

E =<br />

1.8 −0.72<br />

� �<br />

8 −7.7<br />

Λ =<br />

.<br />

−5.1 6.7<br />

Λ was chosen such that Λ −1 has only positive coefficients;<br />

this and the fact that the sources are positive<br />

ensured the <strong>in</strong>vertibility of the nonl<strong>in</strong>ear transformation<br />

(this is equivalent to (Ex)i ≥ 0). Also note that<br />

we did not require E to be orthogonal <strong>in</strong> this example.<br />

The two-dimensional sources s are shown <strong>in</strong> figure 5 together<br />

with a scatter plot i.e. plot of the samples <strong>in</strong> order<br />

to show the density. The mixtures x := E −1√ Λ −1 s<br />

are also plotted <strong>in</strong> the same figure; the nonl<strong>in</strong>earity is<br />

quite visible. Figure 4 gives a plot of the two nonl<strong>in</strong>ear<br />

mix<strong>in</strong>g functions.<br />

Application of the described algorithm yields the<br />

quadratic form<br />

y1 = 29x 2 1 − 57x1x2 − 21x 2 2<br />

y2 = −28x 2 1 + 40x1x2 + 17x 2 2.<br />

The recovered signals y are given <strong>in</strong> the right column of<br />

figure 5; a cross scatter plot with the sources is shown<br />

<strong>in</strong> figure 6 for comparison. The signal to noise ratios<br />

between the two are 44 and 43 dB after normalization<br />

to zero mean and unit variance and possible sign multiplication,<br />

which confirms the high separation quality.<br />

In order to demonstrate the nonl<strong>in</strong>earity of the problem,<br />

figure 7 demonstrates that l<strong>in</strong>ear ICA does not<br />

perform well when applied to the mixtures.<br />

In order to see quantitative results <strong>in</strong> more than<br />

only one experiment, we apply the algorithm to mixtures<br />

with vary<strong>in</strong>g number of samples and dimension.<br />

We consider the cases of equal source and mixture dimension<br />

m = n = 2, 3, 4. Figure 8 shows the algorithm<br />

performance for <strong>in</strong>creas<strong>in</strong>g number of samples. In the<br />

mean, quadratic ICA always outperforms l<strong>in</strong>ear ICA,<br />

but has a higher standard deviation. Problems when<br />

recover<strong>in</strong>g the sources were noticed to occur when the<br />

y<br />

x 2<br />

0.5<br />

x<br />

1


110 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

−3.5<br />

0 0.5 1 1.5 2<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.2<br />

0.18<br />

0.16<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

−3.5<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0<br />

Fig. 5 Example 1: A two-dimensional nonl<strong>in</strong>ear mixture us<strong>in</strong>g mix<strong>in</strong>g model 10, see<br />

figure 2, is separated. The left column shows the two <strong>in</strong>dependent source signals together<br />

with a signal scatter plot (with only every 5th sample plotted), which depicts the source<br />

probability density. The middle column shows the two clearly nonl<strong>in</strong>early mixed signals<br />

and their scatter plot. In the right column the two signals separated by quadratic ICA<br />

are depicted. The scatter plot aga<strong>in</strong> confirms their <strong>in</strong>dependence.<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

0 0.5 1 1.5 2<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

0 0.5 1 1.5 2<br />

Fig. 6 Example 1: A comparison scatter plot of source and<br />

recovered source signals (figure 5) is given, i.e. if y denotes the<br />

recovered sources, then the top left figure is a scatter plot of<br />

(s1, y1), the top right a plot of (s2, y1) and the bottom right of<br />

(s2, y2). The SNRs are 44 and 43 dB between the first respectively<br />

second signals after normalization to zero mean and unit<br />

variance and possible sign multiplication.<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5 2 2.5 3 3.5 4<br />

Fig. 7 Example 1, l<strong>in</strong>ear ICA: Apply<strong>in</strong>g l<strong>in</strong>ear ICA to the mixtures<br />

from figure 5 yields recovered signals as above (top). Clearly<br />

the scatter plot (bottom) confirms that the recovery was bad —<br />

not surpris<strong>in</strong>gly no <strong>in</strong>dependence could be achieved. Comparison<br />

with the orig<strong>in</strong>al sources shows that the recovery itself does not<br />

well correspond to the sources: the maximal SNRs were 7.5 and<br />

7.7 after normalization.<br />

7


PSfrag replacements<br />

PSfrag replacements<br />

f quadratic ICA for n = 2<br />

PSfrag replacements<br />

f quadratic ICA for n = 2<br />

f quadratic ICA for n = 3<br />

Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 111<br />

8<br />

Mean SNR (dB)<br />

Mean SNR (dB)<br />

Mean SNR (dB)<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

Performance of quadratic ICA for n = 2<br />

Bil<strong>in</strong>ear ICA<br />

L<strong>in</strong>ear ICA<br />

5<br />

0 2000 4000 6000 8000 10000 12000<br />

Number of samples<br />

Performance of quadratic ICA for n = 3<br />

35<br />

Bil<strong>in</strong>ear ICA<br />

L<strong>in</strong>ear ICA<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0 2000 4000 6000 8000 10000 12000<br />

Number of samples<br />

Performance of quadratic ICA for n = 4<br />

30<br />

Bil<strong>in</strong>ear ICA<br />

L<strong>in</strong>ear ICA<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0 2000 4000 6000<br />

Number of samples<br />

8000 10000 12000<br />

Fig. 8 Quadratic ICA performance versus sample set size <strong>in</strong><br />

dimensions 2 (top), 3 (middle) and 4 (bottom). Mean was taken<br />

over 1000 runs with Λ −1 and E −1 hav<strong>in</strong>g uniformly distributed<br />

coefficients <strong>in</strong> [0, 1] n respectively [−1, 1] n . E −1 was orthogonalized<br />

(an orthogonal basis was constructed out of the columns<br />

of E −1 ) because <strong>in</strong> experiments the algorithm <strong>in</strong> higher dimension<br />

turned out to be quite sensitive to high condition number of<br />

E −1 . So E −1 = E ⊤ <strong>in</strong> accordance with the model from equation<br />

9. The sources were uniformly distributed <strong>in</strong> [0, 1] n .<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

condition number of Λ −1 was very high. By leav<strong>in</strong>g<br />

out these cases, the performance of quadratic ICA noticeably<br />

rises <strong>in</strong> comparison to l<strong>in</strong>ear ICA.<br />

4.3 Quadratic ICA — image data<br />

F<strong>in</strong>ally we deal with a real-world example, that is, analysis<br />

of natural images. We applied quadratic ICA to a<br />

set of small patches collected from database of natural<br />

images made by van Hateren [18]. We show two<br />

dom<strong>in</strong>ant l<strong>in</strong>ear filters and correspond<strong>in</strong>g eigenvalues<br />

of each obta<strong>in</strong>ed quadratic form <strong>in</strong> figure 9. Most obta<strong>in</strong>ed<br />

quadratic forms have one or two dom<strong>in</strong>ant l<strong>in</strong>ear<br />

filters and these l<strong>in</strong>ear filters are selective for local<br />

bar stimuli. The other eigenvalues all are substantially<br />

smaller and lie centered around zero <strong>in</strong> a roughly Gaussian<br />

distribution, see eigenvalue distribution <strong>in</strong> figure<br />

9. If a quadratic form has two dom<strong>in</strong>ant filters, position,<br />

orientation and spatial frequency selectiveness of<br />

these two filters are very similar. Two correspond<strong>in</strong>g<br />

eigenvalues have different signs. In summary, values of<br />

all quadratic forms correspond to squared simple cell<br />

outputs. This result is qualitatively similar to the result<br />

obta<strong>in</strong>ed <strong>in</strong> [2]. Simple-cell-like features are more<br />

prom<strong>in</strong>ent <strong>in</strong> our results. Also, <strong>in</strong> section 4 of [9], the<br />

obta<strong>in</strong>ed quadratic forms correspond to squared simple<br />

cell outputs.<br />

5. Conclusion<br />

This paper treats nonl<strong>in</strong>ear ICA by reduc<strong>in</strong>g it to the<br />

case of l<strong>in</strong>ear ICA. After formally stat<strong>in</strong>g separability<br />

results of overdeterm<strong>in</strong>ed ICA, these results are applied<br />

to quadratic and polynomial ICA. The presented algorithm<br />

consists of l<strong>in</strong>earization and application of l<strong>in</strong>ear<br />

ICA. In order to apply PCA projection also <strong>in</strong> high dimensions,<br />

a block calculation of the covariance matrix<br />

is suggested. The algorithms are then used for artificial<br />

data sets, where they show good performance for a<br />

sufficiently high number of samples. In the application<br />

to natural image data, characteristics of the obta<strong>in</strong>ed<br />

quadratic forms are qualitatively the same as the results<br />

of Bartsch and Obermayer [2] and they correspond to<br />

squared simple cell outputs. In future work, we want<br />

to further <strong>in</strong>vestigate a possible enhancement of the<br />

separability result as well as a more detailed extension<br />

to ICA with polynomial and maybe analytic mapp<strong>in</strong>gs.<br />

Furthermore, bil<strong>in</strong>ear ICA with additional noise (for example<br />

multiplicative noise) could be considered. Also,<br />

issues such as model and parameter selection will have<br />

to be treated, if higher order models are allowed.<br />

Acknowledgments<br />

We thank the anonymous reviewers for their valuable<br />

suggestions, which improved the orig<strong>in</strong>al manuscript.<br />

FT was partially supported by the DFG <strong>in</strong> the grant


112 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

−0.17 0.16 −0.25 0.08 0.16 −0.11 0.28 −0.10 0.19 −0.13 −0.24 0.23<br />

−0.20 0.12 −0.27 0.27 0.19 −0.16 −0.17 0.16 −0.30 0.09 0.38 0.05<br />

−0.19 0.17 0.18 −0.12 −0.24 0.24 −0.22 0.22 −0.24 0.22 0.24 −0.17<br />

0.24 −0.09 0.25 −0.22 0.23 −0.15 0.19 −0.17 0.18 −0.16 0.21 −0.21<br />

0.26 −0.20 0.33 −0.15 −0.28 0.15 −0.17 0.17 −0.17 0.16 −0.23 0.23<br />

−0.24 0.23 0.23 −0.22 0.20 −0.20 −0.30 0.30 0.18 −0.16 −0.18 0.15<br />

0.28 −0.24 −0.23 0.18 0.20 −0.20 −0.26 0.26 0.19 −0.19 −0.22 0.22<br />

0.31 −0.30 0.26 −0.23 0.18 −0.13 0.15 −0.15 0.18 −0.18 −0.31 0.30<br />

−0.23 0.17 −0.25 0.24 −0.27 0.26 0.25 −0.23 −0.19 0.17 0.11 −0.11<br />

−0.08 0.06 −0.15 0.13 0.25 −0.24 0.28 −0.26 −0.23 0.20 −0.18 0.15<br />

−0.22 0.19 −0.22 0.18 0.11 −0.09 0.28 −0.28<br />

Fig. 9 Quadratic ICA of natural images. 3·10 5 sample pictures<br />

of size 8 × 8 were used. Plotted are the recovered maximal filters<br />

of i.e. rows of the eigenvalue matrices of the quadratic form coefficient<br />

matrices (top figure). For each component the two largest<br />

filters (with respect to the absolute eigenvalues) are shown (altogether<br />

2 · 64). Above each image the correspond<strong>in</strong>g eigenvalue<br />

(multiplied with 10 3 ) is pr<strong>in</strong>ted. In the bottom figure, the absolute<br />

values of the 10 largest eigenvalues of each filter are shown.<br />

Clearly, <strong>in</strong> most filters one or two eigenvalues are dom<strong>in</strong>ant.<br />

’Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> Condensed Matter’<br />

and the BMBF <strong>in</strong> the project ’ModKog’.<br />

References<br />

[1] K. Abed-Meraim, A. Belouchrani, and Y. Hua. Bl<strong>in</strong>d identification<br />

of a l<strong>in</strong>ear-quadratic mixture of <strong>in</strong>dependent components<br />

based on jo<strong>in</strong>t diagonalization procedure. In Proc.<br />

of ICASSP 1996, volume 5, pages 2718–2721, Atlanta,<br />

USA, 1996.<br />

[2] H. Bartsch and K. Obermayer. Second order statistics of<br />

natural images. Neurocomput<strong>in</strong>g, 52-54:467–472, 2003.<br />

[3] A. Cichocki and S. Amari. Adaptive bl<strong>in</strong>d signal and image<br />

process<strong>in</strong>g. John Wiley & Sons, 2002.<br />

[4] A. Cichocki and R. Unbehauen. Robust estimation of<br />

pr<strong>in</strong>cipal components <strong>in</strong> real time. Electronics Letters,<br />

29(21):1869–1870, 1993.<br />

[5] P. Comon. <strong>Independent</strong> component analysis - a new concept?<br />

Signal Process<strong>in</strong>g, 36:287–314, 1994.<br />

[6] G. Darmois. Analyse générale des liaisons stochastiques.<br />

Rev. Inst. Internationale Statist., 21:2–8, 1953.<br />

[7] J. Eriksson and V. Koivunen. Identifiability and separability<br />

of l<strong>in</strong>ear ica models revisited. In Proc. of ICA 2003,<br />

pages 23–27, 2003.<br />

[8] P.G. Georgiev. Bl<strong>in</strong>d source separation of bil<strong>in</strong>early mixed<br />

signals. In Proc. of ICA 2001, pages 328–330, San Diego,<br />

USA, 2001.<br />

[9] W. Hashimoto. Quadratic forms <strong>in</strong> natural images. Network:<br />

Computation <strong>in</strong> Neural Systems, 14:765–788, 2003.<br />

[10] S. Hosse<strong>in</strong>i and Y. Deville. Bl<strong>in</strong>d separation of l<strong>in</strong>earquadratic<br />

mixtures of real sources us<strong>in</strong>g a recurrent structure.<br />

Lecture Notes <strong>in</strong> Computer Science, 2687:241–248,<br />

2003.<br />

[11] A. Hyvär<strong>in</strong>en. Fast and robust fixed-po<strong>in</strong>t algorithms for<br />

<strong>in</strong>dependent component analysis. IEEE Transactions on<br />

Neural Networks, 10(3):626–634, 1999.<br />

[12] A. Hyvär<strong>in</strong>en, J.Karhunen, and E.Oja. <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>. John Wiley & Sons, 2001.<br />

[13] A. Hyvär<strong>in</strong>en and P. Pajunen. On existence and uniqueness<br />

of solutions <strong>in</strong> nonl<strong>in</strong>ear <strong>in</strong>dependent component analysis.<br />

Proceed<strong>in</strong>gs of the 1998 IEEE International Jo<strong>in</strong>t Conference<br />

on Neural Networks (IJCNN’98), 2:1350–1355, 1998.<br />

[14] M. Joho, H. Mathis, and R.H. Lamber. Overdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d source separation: us<strong>in</strong>g more sensors than source<br />

signals <strong>in</strong> a noisy mixture. In Proc. of ICA 2000, pages<br />

81–86, Hels<strong>in</strong>ki, F<strong>in</strong>land, 2000.<br />

[15] A. Leshem. Source separation us<strong>in</strong>g bil<strong>in</strong>ear forms. In Proc.<br />

of the 8th Int. Conference on Higher-Order Statistical Signal<br />

Process<strong>in</strong>g, 1999.<br />

[16] V.P. Skitovitch. On a property of the normal distribution.<br />

DAN SSSR, 89:217–219, 1953.<br />

[17] F.J. Theis. A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d<br />

source separation. Neural Computation accepted, 2004.<br />

[18] J.H. van Hateren and D.L. Ruderman. <strong>Independent</strong> component<br />

analysis of natural image sequences yields spatiotemporal<br />

filters similar to simple cells <strong>in</strong> primary visual<br />

cortex. Proc. R. Soc. Lond. B, 265:2315–2320, 1998.<br />

[19] S. W<strong>in</strong>ter, H. Sawada, and S. Mak<strong>in</strong>o. Geometrical <strong>in</strong>terpretation<br />

of the PCA subspace method for overdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d source separation. In Proc. of ICA 2003, pages 775–<br />

780, Nara, Japan, 2003.<br />

9


Chapter 6<br />

LNCS 3195:726-733, 2004<br />

Paper F.J. Theis, A. Meyer-Bäse, and E.W. Lang. Second-order bl<strong>in</strong>d source separation<br />

based on multi-dimensional autocovariances. In Proc. ICA 2004,<br />

volume 3195 of LNCS, pages 726-733, Granada, Spa<strong>in</strong>, 2004. Spr<strong>in</strong>ger<br />

Reference (Theis et al., 2004e)<br />

Summary <strong>in</strong> section 1.3.1<br />

113


114 Chapter 6. LNCS 3195:726-733, 2004<br />

Second-order bl<strong>in</strong>d source separation based on<br />

multi-dimensional autocovariances<br />

Fabian J. Theis 1,2 , Anke Meyer-Baese 2 , and Elmar W. Lang 1<br />

1 Institute of Biophysics,<br />

University of Regensburg, D-93040 Regensburg, Germany<br />

2 Department of Electrical and Computer Eng<strong>in</strong>eer<strong>in</strong>g,<br />

Florida State University, Tallahassee, FL 32310-6046, USA<br />

fabian@theis.name<br />

Abstract. SOBI is a bl<strong>in</strong>d source separation algorithm based on time<br />

decorrelation. It uses multiple time autocovariance matrices, and performs<br />

jo<strong>in</strong>t diagonalization thus be<strong>in</strong>g more robust than previous time<br />

decorrelation algorithms such as AMUSE. We propose an extension called<br />

mdSOBI by us<strong>in</strong>g multidimensional autocovariances, which can be calculated<br />

for data sets with multidimensional parameterizations such as<br />

images or fMRI scans. mdSOBI has the advantage of us<strong>in</strong>g the spatial<br />

data <strong>in</strong> all directions, whereas SOBI only uses a s<strong>in</strong>gle direction. These<br />

f<strong>in</strong>d<strong>in</strong>gs are confirmed by simulations and an application to fMRI analysis,<br />

where mdSOBI outperforms SOBI considerably.<br />

Bl<strong>in</strong>d source separation (BSS) describes the task of recover<strong>in</strong>g the unknown<br />

mix<strong>in</strong>g process and the underly<strong>in</strong>g sources of an observed data set. Currently,<br />

many BSS algorithm assume <strong>in</strong>dependence of the sources (ICA), see for <strong>in</strong>stance<br />

[1, 2] and references there<strong>in</strong>. In this work, we consider BSS algorithms based on<br />

time-decorrelation. Such algorithms <strong>in</strong>clude AMUSE [3] and extensions such as<br />

SOBI [4] and the similar TDSEP [5]. These algorithms rely on the fact that<br />

the data sets have non-trivial autocorrelations. We give an extension thereof<br />

to data sets, which have more than one direction <strong>in</strong> the parametrization, such<br />

as images, by replac<strong>in</strong>g one-dimensional autocovariances by multi-dimensional<br />

autocovariances.<br />

The paper is organized as follows: In section 1 we <strong>in</strong>troduce the l<strong>in</strong>ear mixture<br />

model; the next section 2 recalls results on time decorrelation BSS algorithms.<br />

We then def<strong>in</strong>e multidimensional autocovariances and use them to propose md-<br />

SOBI <strong>in</strong> section 3. The paper f<strong>in</strong>ished with both artificial and real-world results<br />

<strong>in</strong> section 4.<br />

1 L<strong>in</strong>ear BSS<br />

We consider the follow<strong>in</strong>g bl<strong>in</strong>d source separation (BSS) problem: Let x(t) be<br />

an (observed) stationary m-dimensional real stochastical process (with not necessarily<br />

discrete time t) and A an <strong>in</strong>vertible real matrix such that<br />

x(t) = As(t) + n(t) (1)


Chapter 6. LNCS 3195:726-733, 2004 115<br />

2 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

where the source signals s(t) have diagonal autocovariances<br />

Rs(τ) := E � (s(t + τ) − E(s(t)))(s(t) − E(s(t)) ⊤�<br />

for all τ, and the additive noise n(t) is modelled by a stationary, temporally<br />

and spatially white zero-mean process with variance σ2 . x(t) is observed, and<br />

the goal is to recover A and s(t). Hav<strong>in</strong>g found A, s(t) can be estimated by<br />

A−1x(t), which is optimal <strong>in</strong> the maximum-likelihood sense (if the density of<br />

n(t) is maximal at 0, which is the case for usual noise models such as Gaussian or<br />

Laplacian noise). So the BSS task reduces to the estimation of the mix<strong>in</strong>g matrix<br />

A. Extensions of the above model <strong>in</strong>clude for example the complex case [4] or the<br />

allowance of different dimensions for s(t) and x(t), where the case of larger mix<strong>in</strong>g<br />

dimension can be easily reduced to the presented complete case by dimension<br />

reduction result<strong>in</strong>g <strong>in</strong> a lower noise level [6].<br />

By center<strong>in</strong>g the processes, we can assume that x(t) and hence s(t) have zero<br />

mean. The autocovariances then have the follow<strong>in</strong>g structure<br />

Rx(τ) = E � x(t + τ)x(t) ⊤� =<br />

� ARs(0)A ⊤ + σ 2 I τ = 0<br />

ARs(τ)A ⊤ τ �= 0<br />

Clearly, A (and hence s(t)) can be determ<strong>in</strong>ed by equation 1 only up to permutation<br />

and scal<strong>in</strong>g of columns. S<strong>in</strong>ce we assume exist<strong>in</strong>g variances of x(t)<br />

and hence s(t), the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy can be elim<strong>in</strong>ated by the convention<br />

Rs(0) = I. In order to guarantee identifiability of A except for permutation<br />

from the above model, we have to additionally assume that there exists a delay<br />

τ such that Rs(τ) has pairwise different eigenvalues (for a generalization see [4],<br />

theorem 2). Then us<strong>in</strong>g the spectral theorem it is easy to see from equation 2<br />

that A is determ<strong>in</strong>ed uniquely by x(t) except for permutation.<br />

2 AMUSE and SOBI<br />

Equation 2 also gives an <strong>in</strong>dication of how to perform BSS i.e. how to recover A<br />

from x(t). The usual first step consists of whiten<strong>in</strong>g the no-noise term ˜x(t) :=<br />

As(t) of the observed mixtures x(t) us<strong>in</strong>g an <strong>in</strong>vertible matrix V such that V˜x(t)<br />

has unit covariance. V can simply be estimated from x(t) by diagonalization of<br />

the symmetric matrix R˜x(0) = Rx(0) − σ2I, provided that the noise variance<br />

σ2 is known. If more signals than sources are observed, dimension reduction can<br />

be performed <strong>in</strong> this step, and the noise level can be reduced [6].<br />

In the follow<strong>in</strong>g without loss of generality, we will therefore assume that<br />

˜x(t) = As(t) has unit covariance for each t. By assumption, s(t) also has<br />

unit covariance, hence I = E � As(t)s(t) ⊤A⊤� = ARs(0)A⊤ = AA⊤ so A is<br />

orthogonal. Now def<strong>in</strong>e the symmetrized autocovariance of x(t) as ¯ � Rx(τ) :=<br />

1 Rx(τ) + (Rx(τ)) ⊤� . Equation 2 shows that also the symmetrized autocovari-<br />

2<br />

ance x(t) factors, and we get<br />

¯Rx(τ) = A ¯ Rs(τ)A ⊤<br />

(2)<br />

(3)


116 Chapter 6. LNCS 3195:726-733, 2004<br />

SOBI based on multi-dimensional autocovariances 3<br />

for τ �= 0. By assumption ¯ Rs(τ) is diagonal, so equation 3 is an eigenvalue decomposition<br />

of the symmetric matrix ¯ Rx(τ). If we furthermore assume that ¯ Rx(τ)<br />

or equivalently ¯ Rs(τ) has n different eigenvalues, then the above decomposition<br />

i.e. A is uniquely determ<strong>in</strong>ed by ¯ Rx(τ) except for orthogonal transformation<br />

<strong>in</strong> each eigenspace and permutation; s<strong>in</strong>ce the eigenspaces are one-dimensional<br />

this means A is uniquely determ<strong>in</strong>ed by equation 3 except for permutation. In<br />

addition to this separability result, A can be recovered algorithmically by simply<br />

calculat<strong>in</strong>g the eigenvalue decomposition of ¯ Rx(τ) (AMUSE, [3]).<br />

In practice, if the eigenvalue decomposition is problematic, a different choice<br />

of τ often resolves this problem. Nontheless, there are sources <strong>in</strong> which some<br />

components have equal autocovariances. Also, due to the fact that the autocovariance<br />

matrices are only estimated by a f<strong>in</strong>ite amount of samples, and due to<br />

possible colored noise, the autocovariance at τ could be badly estimated. A more<br />

general BSS algorithm called SOBI (second-order bl<strong>in</strong>d identification) based on<br />

time decorrelation was therefore proposed by Belouchrani et al. [4]. In addition<br />

to only diagonaliz<strong>in</strong>g a s<strong>in</strong>gle autocovariance matrix, it takes a whole set of autocovariance<br />

matrices of x(t) with vary<strong>in</strong>g time lags τ and jo<strong>in</strong>tly diagonalizes<br />

the whole set. It has been shown that <strong>in</strong>creas<strong>in</strong>g the size of this set improves<br />

SOBI performance <strong>in</strong> noisy sett<strong>in</strong>gs [1].<br />

Algorithms for perform<strong>in</strong>g jo<strong>in</strong>t diagonalization of a set of symmetric commut<strong>in</strong>g<br />

matrices <strong>in</strong>clude gradient descent on the sum of the off-diagonal terms,<br />

iterative construction of A by Givens rotation <strong>in</strong> two coord<strong>in</strong>ates [7] (used <strong>in</strong> the<br />

simulations <strong>in</strong> section 4), an iterative two-step recovery of A [8] or more recently<br />

a l<strong>in</strong>ear least-squares algorithm for diagonalization [9], where the latter two algorithms<br />

can also search for non-orthogonal matrices A. Jo<strong>in</strong>t diagonalization<br />

has been used <strong>in</strong> BSS us<strong>in</strong>g cumulant matrices [10] or time autocovariances [4,5].<br />

3 Multidimensional SOBI<br />

The goal of this work is to improve SOBI performance for random processes<br />

with a higher dimensional parametrization i.e. for data sets where the random<br />

processes s and x do not depend on a s<strong>in</strong>gle variable t, but on multiple variables<br />

(z1, . . . , zM ). A typical example is a source data set, <strong>in</strong> which each component<br />

si represents an image of size h × w. Then M = 2 and samples of s are given at<br />

z1 = 1, . . . , h, z2 = 1, . . . , w. Classically, s(z1, z2) is transformed to s(t) by fix<strong>in</strong>g<br />

a mapp<strong>in</strong>g from the two-dimensional parameter set to the one-dimensional time<br />

parametrization of s(t), for example by concatenat<strong>in</strong>g columns or rows <strong>in</strong> the<br />

case of a f<strong>in</strong>ite number of samples. If the time structure of s(t) is not used, as<br />

<strong>in</strong> all classical ICA algorithms <strong>in</strong> which i.i.d. samples are assumed, this choice<br />

does not <strong>in</strong>fluence the result. However, <strong>in</strong> time-structure based algorithms such<br />

as AMUSE and SOBI results can vary greatly depend<strong>in</strong>g on the choice of this<br />

mapp<strong>in</strong>g, see figure 2.<br />

Without loss of generality we aga<strong>in</strong> assume centered random vectors. Then<br />

def<strong>in</strong>e the multidimensional covariance to be<br />

Rs(τ1, . . . , τM ) := E � s(z1 + τ1, . . . , zM + τM)s(z1, . . . , zM ) ⊤�


Chapter 6. LNCS 3195:726-733, 2004 117<br />

4 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

PSfrag replacements<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

1d−autocov<br />

2d−autocov<br />

0 50 100 150 200 250 300<br />

τ respectively |(τ1, τ2)| (rescaled to N)<br />

Fig. 1. Example of one- and two-dimensional autocovariance coefficient of the grayscale<br />

128 × 128 Lena image after normalization to variance 1.<br />

where the expectation is taken over (z1, . . . , zM ). Rs(τ1, . . . , τM ) can be estimated<br />

given equidistant samples by replac<strong>in</strong>g random variables by sample values<br />

and expectations by sums as usual.<br />

The advantage of us<strong>in</strong>g multidimensional autocovariances lies <strong>in</strong> the fact that<br />

now the multidimensional structure of the data set can be used more explicitly.<br />

For example, if row concatenation is used to construct s(t) from the images,<br />

horizontal l<strong>in</strong>es <strong>in</strong> the image will only give trivial contributions to the autocovariance<br />

(see examples <strong>in</strong> figure 2 and section 4). Figure 1 shows the oneand<br />

two-dimensional autocovariance of the Lena image for vary<strong>in</strong>g τ respectively<br />

(τ1, τ2) after normalization of the image to variance 1. Clearly, the twodimensional<br />

autocovariance does not decay as quickly with <strong>in</strong>creas<strong>in</strong>g radius as<br />

the one-dimensional covariance. Only at multiples of the image height, the onedimensional<br />

autocovariance is significantly high i.e. captures image structure.<br />

Our contribution consists of us<strong>in</strong>g multidimensional autocovariances for jo<strong>in</strong>t<br />

diagonalization. We replace the BSS assumption of diagonal one-dimensional autocovariances<br />

by diagonal multi-dimensional autocovariances of the sources. Note<br />

that also the multidimensional covariance satisfies the equation 2. Aga<strong>in</strong> we as-<br />

sume whitened x(z1, . . . , zK). Given a autocovariance matrix ¯ Rx<br />

�<br />

τ (1)<br />

1<br />

, . . . , τ (1)<br />

M<br />

with n different eigenvalues, multidimensional AMUSE (mdAMUSE) detects the<br />

orthogonal unmix<strong>in</strong>g mapp<strong>in</strong>g W by diagonalization of this matrix.<br />

In section 2, we discussed the advantages of us<strong>in</strong>g SOBI over AMUSE. This<br />

of course also holds <strong>in</strong> this generalized case. Hence, the multidimensional SOBI<br />

algorithm (mdSOBI ) consists of the jo<strong>in</strong>t diagonalization of a set of symmetrized<br />

multidimensional autocovariances<br />

� �<br />

¯Rx τ (1)<br />

�<br />

(1)<br />

1 , . . . , τ M , . . . , ¯ �<br />

Rx<br />

τ (K)<br />

1<br />

, . . . , τ (K)<br />

��<br />

M<br />


118 Chapter 6. LNCS 3195:726-733, 2004<br />

PSfrag replacements<br />

(a) source images<br />

crosstalk<strong>in</strong>g error E1( Â, I)<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

SOBI based on multi-dimensional autocovariances 5<br />

SOBI<br />

SOBI transposed images<br />

mdSOBI<br />

mdSOBI transposed images<br />

0<br />

0 10 20 30 40 50 60 70<br />

K<br />

(b) performance comparison<br />

Fig. 2. Comparison of SOBI and mdSOBI when applied to (unmixed) images from (a).<br />

The plot (b) plots the number K of time lags versus the crosstalk<strong>in</strong>g error E1 of the<br />

recovered matrix  and the unit matrix I; here  has been recovered by bot SOBI<br />

and mdSOBI given the images from (a) respectively the transposed images.<br />

after whiten<strong>in</strong>g of x(z1, . . . , zK). The jo<strong>in</strong>t diagonalizer then equals A except for<br />

permutation, given the generalized identifiability conditions from [4], theorem<br />

2. Therefore, also the identifiability result does not change, see [4]. In practice,<br />

we choose the (τ (k)<br />

1 , . . . , τ (k)<br />

M ) with <strong>in</strong>creas<strong>in</strong>g modulus for <strong>in</strong>creas<strong>in</strong>g k, but with<br />

the restriction τ (k)<br />

1 > 0 <strong>in</strong> order to avoid us<strong>in</strong>g the same autocovariances on the<br />

diagonal of the matrix twice.<br />

Often, data sets do not have any substantial long-distance autocorrelations,<br />

but quite high multi-dimensional close-distance correlations (see figure 1). When<br />

perform<strong>in</strong>g jo<strong>in</strong>t diagonalization, SOBI weighs each matrix equally strong, which<br />

can deteriorate the performance for large K, see simulation <strong>in</strong> section 4.<br />

Figure 2(a) shows an example, <strong>in</strong> which the images have considerable vertical<br />

structure, but rather random horizontal structure. Each of the two images<br />

consists of a concatenation of stripes of two images. For visual purposes, we<br />

chose the width of the stripes to be rather large with 16 pixels. Accord<strong>in</strong>g to<br />

the previous discussion we expect one-dimensional algorithms such as AMUSE<br />

and SOBI to perform well on the images, but badly (for number of time lags<br />

≫ 16) on the transposed images. If we apply AMUSE with τ = 20 to the images,<br />

we get excellent performance with a low crosstalk<strong>in</strong>g error with the unit<br />

matrix of 0.084; if we however apply AMUSE to the transposed images, the error<br />

is high with 1.1. This result is further confirmed by the comparison plot <strong>in</strong><br />

figure 2(b); mdSOBI performs equally well on the images and the transposed


Chapter 6. LNCS 3195:726-733, 2004 119<br />

6 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

crosstalk<strong>in</strong>g error E1( Â, A)<br />

PSfrag replacements<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

SOBI K=32<br />

SOBI K=128<br />

mdSOBI K=32<br />

mdSOBI K=128<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5<br />

Fig. 3. SOBI and mdSOBI performance dependence on noise level σ. Plotted is the<br />

crosstalk<strong>in</strong>g error E1 of the recovered matrix  with the real mix<strong>in</strong>g matrix A. See<br />

text for more details.<br />

images, whereas performance of SOBI strongly depends on whether column or<br />

row concatenation was used to construct a one-dimensional random process out<br />

of each image. The SOBI breakpo<strong>in</strong>t of around K = 52 can be decreased by<br />

choos<strong>in</strong>g smaller stripes. In future works we want to provide an analytical discussion<br />

of performance <strong>in</strong>crease when compar<strong>in</strong>g SOBI and mdSOBI similar to<br />

the performance evaluation <strong>in</strong> [4].<br />

4 Results<br />

Artificial mixtures. We consider the l<strong>in</strong>ear mixture of three images (baboon,<br />

black-haired lady and Lena) with a randomly chosen 3 × 3 matrix A. Figure<br />

3 shows how SOBI and mdSOBI perform depend<strong>in</strong>g on the noise level σ. For<br />

small K, both SOBI and mdSOBI perform equally well <strong>in</strong> the low noise case, but<br />

mdSOBI performs better <strong>in</strong> the case of stronger noise. For larger K mdSOBI<br />

substantially outperforms SOBI, which is due to the fact that natural images do<br />

not have any substantial long-distance autocorrelations (see figure 1), whereas<br />

mdSOBI uses the non-trivial two-dimensional autocorrelations.<br />

fMRI analysis. We analyze the performance of mdSOBI when applied to fMRI<br />

measurements. fMRI data were recorded from six subjects (3 female, 3 male,<br />

age 20–37) perform<strong>in</strong>g a visual task. In five subjects, five slices with 100 images<br />

(TR/TE = 3000/60 msec) were acquired with five periods of rest and five<br />

σ


120 Chapter 6. LNCS 3195:726-733, 2004<br />

1 2 3<br />

4 5 6<br />

7 8<br />

(a) component maps<br />

SOBI based on multi-dimensional autocovariances 7<br />

1 cc: −0.08 2 cc: 0.19 3 cc: −0.11<br />

4 cc: −0.21 5 cc: −0.43 6 cc: −0.21<br />

7 cc: −0.16 8 cc: −0.86<br />

(b) time courses<br />

Fig. 4. mdSOBI fMRI analysis. The data was reduced to the first 8 pr<strong>in</strong>cipal components.<br />

(a) shows the recovered component maps (white po<strong>in</strong>ts <strong>in</strong>dicate values stronger<br />

than 3 standard deviations), and (b) their time courses. mdSOBI was performed with<br />

K = 32. <strong>Component</strong> 5 represents <strong>in</strong>ner ventricles, component 6 the frontal eye fields.<br />

<strong>Component</strong> 8 is the desired stimulus component, which is ma<strong>in</strong>ly active <strong>in</strong> the visual<br />

cortex; its time-course closely follows the on-off stimulus (<strong>in</strong>dicated by the gray boxes)<br />

— their crosscorrelation lies at cc = −0.86 — with a delay of roughly 2 seconds <strong>in</strong>duced<br />

by the BOLD effect.<br />

photic simulation periods with rest. Simulation and rest periods comprised 10<br />

repetitions each, i.e. 30s. Resolution was 3 × 3 × 4 mm. The slices were oriented<br />

parallel to the calcar<strong>in</strong>e fissure. Photic stimulation was performed us<strong>in</strong>g<br />

an 8 Hz alternat<strong>in</strong>g checkerboard stimulus with a central fixation po<strong>in</strong>t and a<br />

dark background with a central fixation po<strong>in</strong>t dur<strong>in</strong>g the control periods. The<br />

first scans were discarded for rema<strong>in</strong><strong>in</strong>g saturation effects. Motion artifacts were<br />

compensated by automatic image alignment (AIR, [11]).<br />

BSS, ma<strong>in</strong>ly based on ICA, nowadays is a quite common tool <strong>in</strong> fMRI analysis<br />

(see for example [12]). Here, we analyze the fMRI data set us<strong>in</strong>g spatial decorrelation<br />

as separation criterion. Figure 4 shows the performance of mdSOBI; see figure<br />

text for <strong>in</strong>terpretation. Us<strong>in</strong>g only the first 8 pr<strong>in</strong>cipal components, mdSOBI<br />

could recover the stimulus component as well as detect additional components.<br />

When apply<strong>in</strong>g SOBI to the data set, it could not properly detect the stimulus<br />

component but found two components with crosscorrelations cc = −0.81 and<br />

−0.84 with the stimulus time course.<br />

5 Conclusion<br />

We have proposed an extension called mdSOBI of SOBI for data sets with multidimensional<br />

parametrizations, such as images. Our ma<strong>in</strong> contribution lies <strong>in</strong>


Chapter 6. LNCS 3195:726-733, 2004 121<br />

8 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

replac<strong>in</strong>g the one-dimensional autocovariances by multi-dimensional autocovariances.<br />

In both simulations and real-world applications mdSOBI outperforms<br />

SOBI for these multidimensional structures.<br />

In future work, we will show how to perform spatiotemporal BSS by jo<strong>in</strong>tly<br />

diagonaliz<strong>in</strong>g both spatial and time autocovariance matrices. We plan on apply<strong>in</strong>g<br />

these results to fMRI analysis, where we also want to use three-dimensional<br />

autocovariances for 3d-scans of the whole bra<strong>in</strong>.<br />

Acknowledgements<br />

The authors would like to thank Dr. Dorothee Auer from the Max Planck Institute<br />

of Psychiatry <strong>in</strong> Munich, Germany, for provid<strong>in</strong>g the fMRI data, and Oliver<br />

Lange from the Department of Cl<strong>in</strong>ical Radiology, Ludwig-Maximilian University,<br />

Munich, Germany, for data preprocess<strong>in</strong>g and visualization. FT and EL<br />

acknowledge partial f<strong>in</strong>ancial support by the BMBF <strong>in</strong> the project ’ModKog’.<br />

References<br />

1. Cichocki, A., Amari, S.: Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. John Wiley<br />

& Sons (2002)<br />

2. Hyvär<strong>in</strong>en, A., Karhunen, J., Oja, E.: <strong>Independent</strong> component analysis. John<br />

Wiley & Sons (2001)<br />

3. Tong, L., Liu, R.W., Soon, V., Huang, Y.F.: Indeterm<strong>in</strong>acy and identifiability of<br />

bl<strong>in</strong>d identification. IEEE Transactions on Circuits and Systems 38 (1991) 499–509<br />

4. Belouchrani, A., Meraim, K.A., Cardoso, J.F., Moul<strong>in</strong>es, E.: A bl<strong>in</strong>d source separation<br />

technique based on second order statistics. IEEE Transactions on Signal<br />

Process<strong>in</strong>g 45 (1997) 434–444<br />

5. Ziehe, A., Mueller, K.R.: TDSEP – an efficient algorithm for bl<strong>in</strong>d separation us<strong>in</strong>g<br />

time structure. In Niklasson, L., Bodén, M., Ziemke, T., eds.: Proc. of ICANN’98,<br />

Skövde, Sweden, Spr<strong>in</strong>ger Verlag, Berl<strong>in</strong> (1998) 675–680<br />

6. Joho, M., Mathis, H., Lamber, R.: Overdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation: us<strong>in</strong>g<br />

more sensors than source signals <strong>in</strong> a noisy mixture. In: Proc. of ICA 2000, Hels<strong>in</strong>ki,<br />

F<strong>in</strong>land (2000) 81–86<br />

7. Cardoso, J.F., Souloumiac, A.: Jacobi angles for simultaneous diagonalization.<br />

SIAM J. Mat. Anal. Appl. 17 (1995) 161–164<br />

8. Yeredor, A.: Non-orthogonal jo<strong>in</strong>t diagonalization <strong>in</strong> the leastsquares sense with<br />

application <strong>in</strong> bl<strong>in</strong>d source separation. IEEE Trans. Signal Process<strong>in</strong>g 50 (2002)<br />

15451553<br />

9. Ziehe, A., Laskov, P., Mueller, K.R., Nolte, G.: A l<strong>in</strong>ear least-squares algorithm<br />

for jo<strong>in</strong>t diagonalization. In: Proc. of ICA 2003, Nara, Japan (2003) 469–474<br />

10. Cardoso, J.F., Souloumiac, A.: Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian signals. IEE<br />

Proceed<strong>in</strong>gs - F 140 (1993) 362–370<br />

11. Woods, R., Cherry, S., Mazziotta, J.: Rapid automated algorithm for align<strong>in</strong>g<br />

and reslic<strong>in</strong>g pet images. Journal of Computer Assisted Tomography 16 (1992)<br />

620–633<br />

12. McKeown, M., Jung, T., Makeig, S., Brown, G., K<strong>in</strong>dermann, S., Bell, A., Sejnowksi,<br />

T.: <strong>Analysis</strong> of fMRI data by bl<strong>in</strong>d separation <strong>in</strong>to <strong>in</strong>dependent spatial<br />

components. Human Bra<strong>in</strong> Mapp<strong>in</strong>g 6 (1998) 160–188


122 Chapter 6. LNCS 3195:726-733, 2004


Chapter 7<br />

Proc. ISCAS 2005, pages 5878-5881<br />

Paper F.J. Theis. Bl<strong>in</strong>d signal separation <strong>in</strong>to groups of dependent signals us<strong>in</strong>g<br />

jo<strong>in</strong>t block diagonalization. In Proc. ISCAS 2005, pages 5878-5881, Kobe,<br />

Japan, 2005<br />

Reference (Theis, 2005a)<br />

Summary <strong>in</strong> section 1.3.3<br />

123


124 Chapter 7. Proc. ISCAS 2005, pages 5878-5881<br />

Bl<strong>in</strong>d signal separation <strong>in</strong>to groups of dependent<br />

signals us<strong>in</strong>g jo<strong>in</strong>t block diagonalization<br />

Abstract— Multidimensional or group <strong>in</strong>dependent component<br />

analysis describes the task of transform<strong>in</strong>g a multivariate observed<br />

sensor signal such that groups of the transformed signal<br />

components are mutually <strong>in</strong>dependent - however dependencies<br />

with<strong>in</strong> the groups are still allowed. This generalization of<br />

<strong>in</strong>dependent component analysis (ICA) allows for weaken<strong>in</strong>g<br />

the sometimes too strict assumption of <strong>in</strong>dependence <strong>in</strong> ICA.<br />

It has potential applications <strong>in</strong> various fields such as ECG,<br />

fMRI analysis or convolutive ICA. Recently we could calculate<br />

the <strong>in</strong>determ<strong>in</strong>acies of group ICA, which f<strong>in</strong>ally enables us,<br />

also theoretically, to apply group ICA to solve bl<strong>in</strong>d source<br />

separation (BSS) problems. In this paper we <strong>in</strong>troduce and<br />

discuss various algorithms for separat<strong>in</strong>g signals <strong>in</strong>to groups<br />

of dependent signals. The algorithms are based on jo<strong>in</strong>t block<br />

diagonalization of sets of matrices generated us<strong>in</strong>g several signal<br />

structures.<br />

Fabian J. Theis<br />

Institute of Biophysics, University of Regensburg<br />

93040 Regensburg, Germany, Email: fabian@theis.name<br />

of commut<strong>in</strong>g symmetric n × n matrices Mi, to f<strong>in</strong>d an<br />

orthogonal matrix E such that E ⊤ MiE is diagonal for all i.<br />

In the follow<strong>in</strong>g we will use a generalization of this technique<br />

as algorithm to solve MBSS problems. Instead of fully<br />

diagonaliz<strong>in</strong>g Mi <strong>in</strong> jo<strong>in</strong>t block diagonalization (JBD) we<br />

want to determ<strong>in</strong>e E such that E ⊤ MiE is block-diagonal<br />

(after fix<strong>in</strong>g the block-structure).<br />

Introduc<strong>in</strong>g some notation, let us def<strong>in</strong>e for r, s =1,...,n<br />

the (r, s) sub-k-matrix of W =(wij), denoted by W (k)<br />

rs ,to<br />

be the k × k submatrix of W end<strong>in</strong>g at position (rk, sk).<br />

Denote Gl(n) the group of <strong>in</strong>vertible n×n matrices. A matrix<br />

W ∈ Gl(nk) is said to be a k-scal<strong>in</strong>g matrix if W (k)<br />

rs =0<br />

for r �= s, andW is called a k-permutation matrix if for<br />

each r =1,...,n there exists precisely one s such that W (k)<br />

rs<br />

equals the k × k unit matrix.<br />

Hence, fix<strong>in</strong>g the block-size to k, JBD tries to f<strong>in</strong>d E<br />

such that E ⊤ MiE is a k-scal<strong>in</strong>g matrix. In practice due<br />

to estimation errors, such E will not exist, so we speak of<br />

approximate JBD and imply m<strong>in</strong>imiz<strong>in</strong>g some error-measure<br />

on non-block-diagonality.<br />

Various algorithms to actually perform JBD have been<br />

proposed, see [5] and references there<strong>in</strong>. In the follow<strong>in</strong>g we<br />

will simply perform jo<strong>in</strong>t diagonalization (us<strong>in</strong>g for example<br />

the Jacobi-like algorithm from [6]) and then permute the<br />

columns of E to achieve block-diagonality — <strong>in</strong> experiments<br />

this turns out to be an efficient solution to JBD [5].<br />

I. INTRODUCTION<br />

In this work, we discuss multidimensional bl<strong>in</strong>d source<br />

separation (MBSS) i.e. the recovery of underly<strong>in</strong>g sources<br />

s from an observed mixture x. As usual, s has to fulfill<br />

additional properties such as <strong>in</strong>dependence or diagonality of<br />

the autocovariances (if s possesses time structure). However<br />

<strong>in</strong> contrast to ord<strong>in</strong>ary BSS, MBSS is more general as some<br />

source signals are allowed to possess common statistics. One<br />

possible solution for MBSS is multidimensional <strong>in</strong>dependent<br />

component analysis (MICA) — <strong>in</strong> section IV we will discuss<br />

other such conditions. The idea MICA is that we do not require<br />

full <strong>in</strong>dependence of the transform y := Wx but only mutual<br />

<strong>in</strong>dependence of certa<strong>in</strong> tuples yi1 ,...,yi2 . If the size of all<br />

III. MULTIDIMENSIONAL ICA (MICA)<br />

tuples is restricted to one, this reduces to ord<strong>in</strong>ary ICA. In<br />

Let k, n ∈ N. We call an nk-dimensional random vec-<br />

general, of course the tuples could have different sizes, but<br />

tor y k-<strong>in</strong>dependent if the k-dimensional random vectors<br />

for the sake of simplicity we assume that they all have the<br />

(y1,...,yk)<br />

same length k.<br />

Multidimensional ICA has first been <strong>in</strong>troduced by Cardoso<br />

[1] us<strong>in</strong>g geometrical motivations. Hyvär<strong>in</strong>en and Hoyer then<br />

presented a special case of multidimensional ICA which they<br />

called <strong>in</strong>dependent subspace analysis [2]; there the dependence<br />

with<strong>in</strong> a k-tuple is explicitly modelled enabl<strong>in</strong>g the authors<br />

to propose better algorithms without hav<strong>in</strong>g to resort to the<br />

problematic multidimensional density estimation.<br />

II. JOINT BLOCK DIAGONALIZATION<br />

Jo<strong>in</strong>t diagonalization has become an important tool <strong>in</strong> ICAbased<br />

BSS (used for example <strong>in</strong> JADE [3]) or <strong>in</strong> BSS rely<strong>in</strong>g<br />

on second-order time-decorrelation (for example <strong>in</strong> SOBI<br />

[4]). The task of (real) jo<strong>in</strong>t diagonalization is, given a set<br />

⊤ ,...,(ynk−k+1,...,ynk) ⊤ are mutually <strong>in</strong>dependent.<br />

A matrix W ∈ Gl(nk) is called a k-multidimensional<br />

ICA of an nk-dimensional random vector x if Wx is k<strong>in</strong>dependent.<br />

If k =1, this is the same as ord<strong>in</strong>ary ICA.<br />

Us<strong>in</strong>g MICA we want to solve the (noiseless) l<strong>in</strong>ear MBSS<br />

problem x = As, where the nk-dimensional random vector x<br />

is given, and A ∈ Gl(nk) and s are unknown. In the case of<br />

MICA s isassumedtobek-<strong>in</strong>dependent.<br />

A. Indeterm<strong>in</strong>acies<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies are, similar to ord<strong>in</strong>ary ICA, <strong>in</strong>vertible<br />

transforms <strong>in</strong> Gl(k) <strong>in</strong> each tuple as well as the fact<br />

that the order of the <strong>in</strong>dependent k-tuples is not fixed. Indeed,<br />

if A is MBSS solution, then so is ALP with a k-scal<strong>in</strong>g<br />

matrix L and a k-permutation P, because <strong>in</strong>dependence is


Chapter 7. Proc. ISCAS 2005, pages 5878-5881 125<br />

<strong>in</strong>variant under these transformations. In [7] we show that<br />

these are the only <strong>in</strong>determ<strong>in</strong>acies, given some additional weak<br />

restrictions to the model, namely that A has to be k-admissible<br />

and that s is not allowed to conta<strong>in</strong> a Gaussian k-component.<br />

As usual by preprocess<strong>in</strong>g of the observations x by whiten<strong>in</strong>g<br />

we may also assume that Cov(x) = I. Then I =<br />

Cov(x) =A Cov(s)A ⊤ = AA ⊤ so A is orthogonal.<br />

B. MICA us<strong>in</strong>g Hessian diagonalization (MHICA)<br />

We assume that s admits a C2−density ps. Us<strong>in</strong>g orthogonality<br />

of A we get ps(s0) =px(As0) for s0 ∈ Rnk .Let<br />

Hf (x0) denote the Hessian of f evaluated at x0. It transforms<br />

like a 2-tensor so locally at s0 with ps(s0) > 0 we get<br />

Hln ps (s0)<br />

⊤<br />

=Hln px◦A(s0) =AHln px (As0)A<br />

The key idea now lies <strong>in</strong> the fact that s is assumed to be k<strong>in</strong>dependent,<br />

so ps factorizes <strong>in</strong>to n groups depend<strong>in</strong>g only<br />

on k separate variables each. So ln ps is a sum of functions<br />

depend<strong>in</strong>g on k separate variables hence Hln ps (s0) is blockdiagonal<br />

i.e. a k-scal<strong>in</strong>g.<br />

The algorithm, multidimensional Hessian ICA (MHICA),<br />

now simply uses the block-diagonality structure from equation<br />

1 and performs JBD of estimates of a set of Hessians<br />

Hln ps (si) evaluated at different po<strong>in</strong>ts si ∈ Rnk . Given slight<br />

restrictions on the eigenvalues, the result<strong>in</strong>g block diagonalizer<br />

then equals A⊤ except for k-scal<strong>in</strong>g and permutation. The<br />

Hessians are estimated us<strong>in</strong>g kernel-density approximation<br />

with a sufficiently smooth kernel, but other methods such<br />

as approximation us<strong>in</strong>g f<strong>in</strong>ite differences are possible, too.<br />

Density approximation is problematic, but <strong>in</strong> this sett<strong>in</strong>g due to<br />

the fact that we can use many Hessians we only need rough<br />

estimates. For more details on the kernel approximation we<br />

refer to the one-dimensional Hessian ICA algorithm from [8].<br />

MHICA generalizes one-dimensional ideas proposed <strong>in</strong> [8],<br />

[9]. More generally, we could have also used characteristic<br />

functions <strong>in</strong>stead of densities, which leads to a related algorithm,<br />

see [10] for the s<strong>in</strong>gle-dimensional ICA case.<br />

IV. MULTIDIMENSIONAL TIME DECORRELATION<br />

Instead of assum<strong>in</strong>g k-<strong>in</strong>dependence of the sources <strong>in</strong><br />

the MBSS problem, <strong>in</strong> this section we assume that s is a<br />

multivariate centered discrete WSS random process such that<br />

its symmetrized autocovariances<br />

¯Rs(τ) := 1 � � ⊤<br />

E s(t + τ)s(t)<br />

2<br />

� + E � s(t)s(t + τ) ⊤��<br />

(2)<br />

are k-scal<strong>in</strong>gs for all τ. This models the fact that the sources<br />

are supposed to be block-decorrelated <strong>in</strong> the time-doma<strong>in</strong> for<br />

all time-shifts τ.<br />

A. Indeterm<strong>in</strong>acies<br />

Aga<strong>in</strong> A can only be found up to k-scal<strong>in</strong>g and kpermutation<br />

because condition 2 is <strong>in</strong>variant under this transformation.<br />

One sufficient condition for identifiability is to have<br />

pairwise different eigenvalues of at least one Rs(τ), however<br />

generalizations are possible, see [4] for the case k =1.Us<strong>in</strong>g<br />

whiten<strong>in</strong>g, we can aga<strong>in</strong> assume orthogonal A.<br />

(1)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

Fig. 1. Histogram and box plot of the multidimensional performance <strong>in</strong>dex<br />

E (k) (C) evaluated for k =2and n =2. The statistics were calculated<br />

over 10 5 <strong>in</strong>dependent experiments us<strong>in</strong>g 4 × 4 matrices C with coefficients<br />

uniformly drawn out of [−1, 1].<br />

B. Multidimensional SOBI (MSOBI)<br />

Theideaofwhatwecallmultidimensional second-order<br />

bl<strong>in</strong>d identification (MSOBI) is now a direct extension of the<br />

usual SOBI algorithm [4]. Symmetrized autocovariances of x<br />

can easily be estimated from the data, and they transform as<br />

follows: ¯ Rs(τ) =A⊤ ¯ Rx(τ)A. But ¯ Rs(τ) is a k-scal<strong>in</strong>g by<br />

assumption, so JBD of a set of such symmetrized autocovariance<br />

matrices yields A as diagonalizer (except for k-scal<strong>in</strong>g<br />

and permutation).<br />

Other researchers have worked on this problem <strong>in</strong> the sett<strong>in</strong>g<br />

of convolutive BSS — due to lack of space we want to refer<br />

to [11] and references there<strong>in</strong>.<br />

V. EXPERIMENTAL RESULTS<br />

In this section we demonstrate the validity of the proposed<br />

algorithms by apply<strong>in</strong>g them to both toy and real world data.<br />

A. Multidimensional Amari-<strong>in</strong>dex<br />

In order to analyze algorithm performance, we consider the<br />

<strong>in</strong>dex E (k) (C) def<strong>in</strong>ed for fixed n, k and C ∈ Gl(nk) as<br />

E (k) n�<br />

�<br />

n�<br />

�C<br />

(C) =<br />

r=1 s=1<br />

(k)<br />

rs �<br />

maxi �C (k)<br />

�<br />

− 1<br />

ri �<br />

n�<br />

�<br />

n�<br />

�C<br />

+<br />

s=1 r=1<br />

(k)<br />

rs �<br />

maxi �C (k)<br />

�<br />

− 1 .<br />

is �<br />

Here �.� can be any matrix norm — we choose the operator<br />

norm �A� := max |x|=1 |Ax|. Thismultidimensional performance<br />

<strong>in</strong>dex of an nk × nk-matrix C generalizes the onedimensional<br />

performance <strong>in</strong>dex <strong>in</strong>troduced by Amari et al. [12]<br />

to block-diagonal matrices. It measures how much C differs<br />

from a permutation and scal<strong>in</strong>g matrix <strong>in</strong> the sense of k-blocks,<br />

so it can be used to analyze algorithm performance:<br />

Lemma 5.1: Let C ∈ Gl(nk). E (k) (C) =0if and only if<br />

C is the product of a k-scal<strong>in</strong>g and a k-permutation matrix.<br />

Corollary 5.2: Consider the MBSS problem x = As from<br />

section III respectively IV. An estimate  of the mix<strong>in</strong>g matrix<br />

solves the MBSS problem if and only if E (k) ( Â−1 A)=0.<br />

In order to be able to determ<strong>in</strong>e the scale of this <strong>in</strong>dex, figure<br />

1 gives statistics of E (k) over randomly chosen matrices <strong>in</strong> the<br />

case k = n =2. The mean is 3.05 and the median 3.10.<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

1


126 Chapter 7. Proc. ISCAS 2005, pages 5878-5881<br />

1<br />

0<br />

−1<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

3<br />

2<br />

1<br />

0<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

1<br />

0<br />

−1<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

3<br />

2<br />

1<br />

0<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

Fig. 2. Simulation, 4-dimensional 2-<strong>in</strong>dependent sources. Clearly the first<br />

and the second respectively the third and the fourth signal are dependent.<br />

B. Simulations<br />

We will discuss algorithm performance when applied to a<br />

4-dimensional 2-<strong>in</strong>dependent toy signal. In order to see the<br />

performance of both MSOBI and MHICA we generate 2<strong>in</strong>dependent<br />

sources with non-trivial autocorrelations. For this<br />

we use two <strong>in</strong>dependent generat<strong>in</strong>g signals, a s<strong>in</strong>usoid and a<br />

sawtooth given by<br />

z(t) := (s<strong>in</strong>(0.1 t), 2⌊0.007 t +0.5⌋−1) ⊤<br />

for discrete time steps t =1, 2,...,1000. We thus generated<br />

sources<br />

s(t) :=(z1(t), exp(z1(t)),z2(t), (z2(t)+0.5) 2 ) ⊤ ,<br />

which are plotted <strong>in</strong> figure 2. Their covariance is<br />

�<br />

Cov(s) =<br />

� 0.50 0.57 0.01 0.01<br />

0.57 0.68 0.01 0.01<br />

0.01 0.01 0.33 0.33<br />

0.01 0.01 0.33 0.42<br />

so <strong>in</strong>deed s is not fully <strong>in</strong>dependent.<br />

s is mixed us<strong>in</strong>g a 4 × 4 matrix A with entries uniformly<br />

drawn out of [−1, 1], and comparisons are made over 100<br />

Monte-Carlo runs. We compare the two algorithms MSOBI<br />

(with 10 autocorrelation matrices) and MHICA (us<strong>in</strong>g 50<br />

Hessians) with the ICA algorithms JADE and fastICA, where<br />

<strong>in</strong> the latter both the deflation and the symmetric approach<br />

was used. For each run we calculate the performance <strong>in</strong>dex<br />

E (2) ( Â−1 A) of the product of the mix<strong>in</strong>g and the estimated<br />

separat<strong>in</strong>g matrix. S<strong>in</strong>ce the one-dimensional ICA algorithms<br />

are unable to use the group structure, for these we take the<br />

m<strong>in</strong>imum of the <strong>in</strong>dex calculated over all row permutations of<br />

 −1 A.<br />

Figure 3 displays the result of the comparison. Clearly<br />

MHICA and MSOBI perform very well on this data, and<br />

MSOBI furthermore gives very robust estimates with the same<br />

error and negligibly small variance. JADE cannot separate<br />

performance <strong>in</strong>dex E (2) ( Â−1 A)<br />

MHICA MSOBI JADE fastICA(defl)fastICA(sym)<br />

Fig. 3. Simulation, algorithm results. This notched boxed plot displays<br />

the performance <strong>in</strong>dex E (2) of the mix<strong>in</strong>g-separat<strong>in</strong>g matrix Â−1 A of<br />

each algorithm, sampled over 100 Monte-Carlo runs. The middle l<strong>in</strong>e of<br />

each column gives the mean, the boxes the 25th and 75th percentile. The<br />

deflationary fastICA algorithm only converged <strong>in</strong> 12% of all runs, the<br />

symmetric-approach based fastICA <strong>in</strong> 89% of all cases; the statistics are only<br />

given over successful runs. All other algorithms converged <strong>in</strong> all runs.<br />

the data at all — it performs not much better than random<br />

choice of matrix, see figure 1; this is due to the fact<br />

that the cumulants of k-<strong>in</strong>dependent sources are not blockdiagonal.<br />

FastICA only converges <strong>in</strong> 12% (deflation approach)<br />

respectively 89% (symmetric approach) of all cases. However,<br />

<strong>in</strong> the cases it converges it gives results comparable with<br />

the multidimensional algorithms. Apparently, especially the<br />

symmetric method seems to be able to use the weakened<br />

statistics to still f<strong>in</strong>d directions <strong>in</strong> the data.<br />

C. Application to ECG data<br />

F<strong>in</strong>ally we illustrate how to apply the proposed algorithms<br />

to real-world data set. Follow<strong>in</strong>g [1], we will show how to<br />

separate fetal ECG (FECG) record<strong>in</strong>gs from the mother’s<br />

ECG (MECG). The data set [13] consists of eight recorded<br />

signals with 2500 observations; the sampl<strong>in</strong>g frequency is<br />

mislead<strong>in</strong>gly specified as 500 Hz (which would mean around<br />

168 mother heartbeats per m<strong>in</strong>ute), it should be closer to<br />

around 250 Hz. We select the first three sensors cutaneously<br />

recorded on the abdomen of the mother. In order to save space<br />

and to compare the results with [1] we plot only the first 1000<br />

samples, see figure 4(a).


Chapter 7. Proc. ISCAS 2005, pages 5878-5881 127<br />

50<br />

0<br />

−50<br />

−120<br />

−50<br />

−50<br />

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500<br />

120<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

(a) ECG record<strong>in</strong>gs<br />

50<br />

0<br />

80<br />

0<br />

0<br />

0<br />

0<br />

−50<br />

0 100 200 300 400<br />

−20<br />

500 0 100 200 300 400<br />

−50<br />

500 0 100 200 300 400<br />

−50<br />

500 0 100 200 300 400 500<br />

20<br />

0<br />

−20<br />

−100<br />

−100<br />

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500<br />

(b) extracted sources<br />

50<br />

0<br />

120<br />

50<br />

0<br />

(c) MECG part<br />

50<br />

0<br />

120<br />

50<br />

0<br />

(d) FECG part<br />

Fig. 4. Fetal ECG example. (a) shows the ECG record<strong>in</strong>gs. The underly<strong>in</strong>g FECG (4 heartbeats) is partially visible <strong>in</strong> the dom<strong>in</strong>at<strong>in</strong>g MECG (3 heartbeats).<br />

Figure (b) gives the extracted sources us<strong>in</strong>g MHICA with k =2and 500 Hessians. In (c) and (d) the projections of the mother sources (components 1 and<br />

2 <strong>in</strong> (b)) respectively the fetal sources (component 3) onto the mixture space (a) are plotted.<br />

Our goal is to extract an MECG and an FECG component;<br />

however it cannot be expected to f<strong>in</strong>d an only one-dimensional<br />

MECG due to the fact that projections of a three-dimensional<br />

vector (electric) field are measured. Hence modell<strong>in</strong>g the data<br />

by a multidimensional BSS problem with k =2(but allow<strong>in</strong>g<br />

for an additional one-dimensional component) makes sense.<br />

Application of MHICA (with 500 Hessians) and MSOBI<br />

(with 50 autocorrelation matrices) extracts a two-dimensional<br />

MECG component and a one-dimensional FECG component.<br />

After block-permutation we get the follow<strong>in</strong>g estimated mix<strong>in</strong>g<br />

matrices (A us<strong>in</strong>g MHICA and A ′ us<strong>in</strong>g MSOBI)<br />

� �<br />

0.37 0.42 −0.81<br />

A = −0.75 0.89 −0.16 A<br />

0.55 −0.16 0.57<br />

′ � �<br />

0.22 0.91 −0.40<br />

= −0.84 0.23 −0.33 .<br />

0.50 0.34 0.85<br />

The thus estimated sources us<strong>in</strong>g MHICA are plotted <strong>in</strong><br />

figure 4(b). In order to compare the two mix<strong>in</strong>g matrices,<br />

calculation of<br />

�<br />

.<br />

A −1 A ′ �<br />

0.85 1.02 0.64<br />

= −0.23 1.11 0.35<br />

−0.01 −0.08 0.98<br />

yields a somewhat visible block structure; the performance<br />

<strong>in</strong>dex is E (2) (A−1A ′ ) = 1.12. The block structure is not<br />

very dom<strong>in</strong>ant, which <strong>in</strong>dicates that the two models — block<br />

<strong>in</strong>dependence versus time-block-decorrelation — are not fully<br />

equivalent.<br />

A (scal<strong>in</strong>g <strong>in</strong>variant) decomposition of the observed ECG<br />

data can be achieved by compos<strong>in</strong>g the extracted sources us<strong>in</strong>g<br />

only the relevant mix<strong>in</strong>g columns. For example for the MECG<br />

part this means apply<strong>in</strong>g the projection ΠM := (a1, a2, 0)A−1 to the observations. This yields the projection matrices<br />

�<br />

�<br />

0.52 0.38 0.84<br />

ΠM = −0.10 1.08 0.17<br />

0.34 −0.27 0.41<br />

ΠF =<br />

� 0.48 −0.38 −0.84<br />

0.10 −0.08 −0.17<br />

−0.34 0.27 0.59<br />

onto the mother respectively the fetal ECG us<strong>in</strong>g MHICA and<br />

Π ′ � �<br />

0.78 0.21 0.45<br />

M = −0.18 1.17 0.36 Π<br />

0.47 −0.44 0.05<br />

′ � �<br />

0.22 −0.21 −0.45<br />

F = 0.18 −0.17 −0.36 .<br />

−0.47 0.44 0.95<br />

us<strong>in</strong>g MSOBI. The results of the first algorithm are plotted <strong>in</strong><br />

figures 4 (c) and (d). The fetal ECG is most active at sensor<br />

1 (as visual <strong>in</strong>spection of the observation confirms). When<br />

compar<strong>in</strong>g the projection matrices with the results from [1], we<br />

get quite high similarity of the ICA-based results, and a modest<br />

difference with the projections of the time-based algorithm.<br />

Other one-dimensional ICA-based results on this data set are<br />

reported for example <strong>in</strong> [14].<br />

�<br />

VI. CONCLUSION<br />

We have shown how the idea of jo<strong>in</strong>t block diagonalization<br />

as extension of jo<strong>in</strong>t diagonalization helps us to generalize<br />

ICA and time-structure based algorithms such as HICA and<br />

SOBI to the multidimensional ICA case. The thus def<strong>in</strong>ed<br />

algorithms are able to robustly decompose signals <strong>in</strong>to groups<br />

of <strong>in</strong>dependent signals. In future work, besides more extensive<br />

experiments and tests with noise and outliers, we want to<br />

extend this result to a version of JADE us<strong>in</strong>g moments <strong>in</strong>stead<br />

of cumulants, which preserve the block structure.<br />

REFERENCES<br />

[1] J. Cardoso, “Multidimensional <strong>in</strong>dependent component analysis,” <strong>in</strong><br />

Proc. of ICASSP ’98, Seattle, 1998.<br />

[2] A. Hyvär<strong>in</strong>en and P. Hoyer, “Emergence of phase and shift <strong>in</strong>variant<br />

features by decomposition of natural images <strong>in</strong>to <strong>in</strong>dependent feature<br />

subspaces,” Neural Computation, vol. 12, no. 7, pp. 1705–1720, 2000.<br />

[3] J.-F. Cardoso and A. Souloumiac, “Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian<br />

signals,” IEE Proceed<strong>in</strong>gs - F, vol. 140, no. 6, pp. 362–370, 1993.<br />

[4] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, and E. Moul<strong>in</strong>es, “A<br />

bl<strong>in</strong>d source separation technique based on second order statistics,” IEEE<br />

Transactions on Signal Process<strong>in</strong>g, vol. 45, no. 2, pp. 434–444, 1997.<br />

[5] K. Abed-Meraim and A. Belouchrani, “Algorithms for jo<strong>in</strong>t block<br />

diagonalization,” <strong>in</strong> Proc. EUSIPCO 2004, Vienna, Austria, 2004, pp.<br />

209–212.<br />

[6] J.-F. Cardoso and A. Souloumiac, “Jacobi angles for simultaneous<br />

diagonalization,” SIAM J. Mat. Anal. Appl., vol. 17, no. 1, pp. 161–<br />

164, Jan. 1995.<br />

[7] F. Theis, “Uniqueness of complex and multidimensional <strong>in</strong>dependent<br />

component analysis,” Signal Process<strong>in</strong>g, vol. 84, no. 5, pp. 951–956,<br />

2004.<br />

[8] ——, “A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation,”<br />

Neural Computation, vol. 16, pp. 1827–1850, 2004.<br />

[9] J. L<strong>in</strong>, “Factoriz<strong>in</strong>g multivariate function classes,” <strong>in</strong> Advances <strong>in</strong> Neural<br />

Information Process<strong>in</strong>g Systems, vol. 10, 1998, pp. 563–569.<br />

[10] A. Yeredor, “Bl<strong>in</strong>d source separation via the second characteristic<br />

function,” Signal Process<strong>in</strong>g, vol. 80, no. 5, pp. 897–902, 2000.<br />

[11] C. Févotte and C. Doncarli, “A unified presentation of bl<strong>in</strong>d separation<br />

methods for convolutive mixtures us<strong>in</strong>g block-diagonalization,” <strong>in</strong> Proc.<br />

ICA 2003, Nara, Japan, 2003, pp. 349–354.<br />

[12] S. Amari, A. Cichocki, and H. Yang, “A new learn<strong>in</strong>g algorithm for bl<strong>in</strong>d<br />

signal separation,” Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems,<br />

vol. 8, pp. 757–763, 1996.<br />

[13] B. D. M. (ed.), “DaISy: database for the identification<br />

of systems,” Department of Electrical Eng<strong>in</strong>eer<strong>in</strong>g,<br />

ESAT/SISTA, K.U.Leuven, Belgium, Oct 2004. [Onl<strong>in</strong>e]. Available:<br />

http://www.esat.kuleuven.ac.be/sista/daisy/<br />

[14] L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “Fetal electrocardiogram<br />

extraction by source subspace separation,” <strong>in</strong> Proc. IEEE SP /<br />

ATHOS Workshop on HOS, Girona, Spa<strong>in</strong>, 1995, pp. 134–138.


128 Chapter 7. Proc. ISCAS 2005, pages 5878-5881


Chapter 8<br />

Proc. NIPS 2006<br />

Paper F.J. Theis. Towards a general <strong>in</strong>dependent subspace analysis. Proc. NIPS<br />

2006, 2007<br />

Reference (Theis, 2007)<br />

Summary <strong>in</strong> section 1.3.3<br />

129


130 Chapter 8. Proc. NIPS 2006<br />

Towards a general <strong>in</strong>dependent subspace analysis<br />

Fabian J. Theis<br />

Max Planck Institute for Dynamics and Self-Organisation &<br />

Bernste<strong>in</strong> Center for Computational Neuroscience<br />

Bunsenstr. 10, 37073 Gött<strong>in</strong>gen, Germany<br />

fabian@theis.name<br />

Abstract<br />

The <strong>in</strong>creas<strong>in</strong>gly popular <strong>in</strong>dependent component analysis (ICA) may only be applied<br />

to data follow<strong>in</strong>g the generative ICA model <strong>in</strong> order to guarantee algorithm<strong>in</strong>dependent<br />

and theoretically valid results. Subspace ICA models generalize the<br />

assumption of component <strong>in</strong>dependence to <strong>in</strong>dependence between groups of components.<br />

They are attractive candidates for dimensionality reduction methods,<br />

however are currently limited by the assumption of equal group sizes or less general<br />

semi-parametric models. By <strong>in</strong>troduc<strong>in</strong>g the concept of irreducible <strong>in</strong>dependent<br />

subspaces or components, we present a generalization to a parameter-free<br />

mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by<br />

<strong>in</strong>clud<strong>in</strong>g previous results on non-Gaussian component analysis. After <strong>in</strong>troduc<strong>in</strong>g<br />

this general model, we discuss jo<strong>in</strong>t block diagonalization with unknown block<br />

sizes, on which we base a simple extension of JADE to algorithmically perform<br />

the subspace analysis. Simulations confirm the feasibility of the algorithm.<br />

1 <strong>Independent</strong> subspace analysis<br />

A random vector Y is called an <strong>in</strong>dependent component of the random vector X, if there exists<br />

an <strong>in</strong>vertible matrix A and a decomposition X = A(Y,Z) such that Y and Z are stochastically<br />

<strong>in</strong>dependent. The goal of a general <strong>in</strong>dependent subspace analysis (ISA) or multidimensional <strong>in</strong>dependent<br />

component analysis is the decomposition of an arbitrary random vector X <strong>in</strong>to <strong>in</strong>dependent<br />

components. If X is to be decomposed <strong>in</strong>to one-dimensional components, this co<strong>in</strong>cides with ord<strong>in</strong>ary<br />

<strong>in</strong>dependent component analysis (ICA). Similarly, if the <strong>in</strong>dependent components are required<br />

to be of the same dimension k, then this is denoted by multidimensional ICA of fixed group size k<br />

or simply k-ISA. So 1-ISA is equivalent to ICA.<br />

1.1 Why extend ICA?<br />

An important structural aspect <strong>in</strong> the search for decompositions is the knowledge of the number of<br />

solutions i.e. the <strong>in</strong>determ<strong>in</strong>acies of the problem. Without it, the result of any ICA or ISA algorithm<br />

cannot be compared with other solutions, so for <strong>in</strong>stance bl<strong>in</strong>d source separation (BSS) would be<br />

impossible. Clearly, given an ISA solution, <strong>in</strong>vertible transforms <strong>in</strong> each component (scal<strong>in</strong>g matrices<br />

L) as well as permutations of components of the same dimension (permutation matrices P) give<br />

aga<strong>in</strong> an ISA of X. And <strong>in</strong>deed, <strong>in</strong> the special case of ICA, scal<strong>in</strong>g and permutation are already all<br />

<strong>in</strong>determ<strong>in</strong>acies given that at most one Gaussian is conta<strong>in</strong>ed <strong>in</strong> X [6]. This is one of the key theoretical<br />

results <strong>in</strong> ICA, allow<strong>in</strong>g the usage of ICA for solv<strong>in</strong>g BSS problems and hence stimulat<strong>in</strong>g<br />

many applications. It has been shown that also for k-ISA, scal<strong>in</strong>gs and permutations as above are<br />

the only <strong>in</strong>determ<strong>in</strong>acies [11], given some additional rather weak restrictions to the model.<br />

However, a serious drawback of k-ISA (and hence of ICA) lies <strong>in</strong> the fact that the requirement<br />

fixed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed,


Chapter 8. Proc. NIPS 2006 131<br />

crosstalk<strong>in</strong>g error<br />

4<br />

3<br />

2<br />

1<br />

0<br />

FastICA JADE Extended Infomax<br />

Figure 1: Apply<strong>in</strong>g ICA to a random vector X = AS that does not fulfill the ICA model; here S<br />

is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are<br />

the statistics over 100 runs of the Amari error of the random orig<strong>in</strong>al and the reconstructed mix<strong>in</strong>g<br />

matrix us<strong>in</strong>g the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the orig<strong>in</strong>al<br />

mix<strong>in</strong>g matrix could not be reconstructed <strong>in</strong> any of the experiments. However, <strong>in</strong>terest<strong>in</strong>gly, the<br />

latter two algorithms do <strong>in</strong>deed f<strong>in</strong>d an ISA up to permutation, which will be expla<strong>in</strong>ed <strong>in</strong> section 3.<br />

theoretically speak<strong>in</strong>g, it may only be applied to random vectors follow<strong>in</strong>g the k-ISA bl<strong>in</strong>d source<br />

separation model, which means that they have to be mixtures of a random vector that consists of<br />

<strong>in</strong>dependent groups of size k. If this is the case, uniqueness up to permutation and scal<strong>in</strong>g holds as<br />

noted above; however if k-ISA is applied to any random vector, a decomposition <strong>in</strong>to groups that are<br />

only ‘as <strong>in</strong>dependent as possible’ cannot be unique and depends on the contrast and the algorithm.<br />

In the literature, ICA is often applied to f<strong>in</strong>d representations fulfill<strong>in</strong>g the <strong>in</strong>dependence condition as<br />

well as possible, however care has to be taken; the strong uniqueness result is not valid any more,<br />

and the results may depend on the algorithm as illustrated <strong>in</strong> figure 1.<br />

This work aims at f<strong>in</strong>d<strong>in</strong>g an ISA model that allows applicability to any random vector. After review<strong>in</strong>g<br />

previous approaches, we will provide such a model together with a correspond<strong>in</strong>g uniqueness<br />

result and a prelim<strong>in</strong>ary algorithm.<br />

1.2 Previous approaches to ISA for dependent component analysis<br />

Generalizations of the ICA model that are to <strong>in</strong>clude dependencies of multiple one-dimensional<br />

components have been studied for quite some time. ISA <strong>in</strong> the term<strong>in</strong>ology of multidimensional ICA<br />

has first been <strong>in</strong>troduced by Cardoso [4] us<strong>in</strong>g geometrical motivations. His model as well as the<br />

related but <strong>in</strong>dependently proposed factorization of multivariate function classes [9] is quite general,<br />

however no identifiability results were presented, and applicability to an arbitrary random vector<br />

was unclear; later, <strong>in</strong> the special case of equal group sizes (k-ISA) uniqueness results have been<br />

extended from the ICA theory [11]. Algorithmic enhancements <strong>in</strong> this sett<strong>in</strong>g have been recently<br />

studied by [10]. Moreover, if the observation conta<strong>in</strong> additional structures such as spatial or temporal<br />

structures, these may be used for the multidimensional separation [13].<br />

Hyvär<strong>in</strong>en and Hoyer presented a special case of k-ISA by comb<strong>in</strong><strong>in</strong>g it with <strong>in</strong>variant feature subspace<br />

analysis [7]. They model the dependence with<strong>in</strong> a k-tuple explicitly and are therefore able<br />

to propose more efficient algorithms without hav<strong>in</strong>g to resort to the problematic multidimensional<br />

density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8],<br />

where dependencies between all components are assumed and modelled along a topographic structure<br />

(e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component cluster<strong>in</strong>g<br />

problem, which necessitates a model for <strong>in</strong>ter-cluster <strong>in</strong>dependence and <strong>in</strong>tra-cluster dependence.<br />

For the latter, they propose to use a tree-structure as employed by their tree dependepent component<br />

analysis. Together with <strong>in</strong>ter-cluster <strong>in</strong>dependence, this implies a search for a transformation of the<br />

mixtures <strong>in</strong>to a forest i.e. a set of disjo<strong>in</strong>t trees. However, the above models are all semi-parametric<br />

and hence not fully bl<strong>in</strong>d. In the follow<strong>in</strong>g, no additional structures are necessary for the separation.


132 Chapter 8. Proc. NIPS 2006<br />

1.3 General ISA<br />

Def<strong>in</strong>ition 1.1. A random vector S is said to be irreducible if it conta<strong>in</strong>s no lower-dimensional<br />

<strong>in</strong>dependent component. An <strong>in</strong>vertible matrix W is called a (general) <strong>in</strong>dependent subspace analysis<br />

of X if WX = (S1, . . . ,Sk) with pairwise <strong>in</strong>dependent, irreducible random vectors Si.<br />

Note that <strong>in</strong> this case, the Si are <strong>in</strong>dependent components of X. The idea beh<strong>in</strong>d this def<strong>in</strong>ition is<br />

that <strong>in</strong> contrast to ICA and k-ISA, we do not fix the size of the groups Si <strong>in</strong> advance. Of course,<br />

some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction<br />

is realized by allow<strong>in</strong>g only irreducible components. The advantage of this formulation now is that<br />

it can clearly be applied to any random vector, although of course a trivial decomposition might be<br />

the result <strong>in</strong> the case of an irreducible random vector. Obvious <strong>in</strong>determ<strong>in</strong>acies of an ISA of X are,<br />

as mentioned above, scal<strong>in</strong>gs i.e. <strong>in</strong>vertible transformations with<strong>in</strong> each Si and permutation of Si<br />

of the same dimension 1 . These are already all <strong>in</strong>determ<strong>in</strong>acies as shown by the follow<strong>in</strong>g theorem,<br />

which extends previous results <strong>in</strong> the case of ICA [6] and k-ISA [11], where also the additional<br />

slight assumptions on square-<strong>in</strong>tegrability i.e. on exist<strong>in</strong>g covariance have been made.<br />

Theorem 1.2. Given a random vector X with exist<strong>in</strong>g covariance and no Gaussian <strong>in</strong>dependent<br />

component, then an ISA of X exists and is unique except for scal<strong>in</strong>g and permutation.<br />

Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give<br />

a short sketch of the proof <strong>in</strong> the follow<strong>in</strong>g. The uniqueness result can easily be formulated as a<br />

subspace extraction problem, and theorem 1.2 follows readily from<br />

Lemma 1.3. Let S = (S1, . . . ,Sk) be a square-<strong>in</strong>tegrable decomposition of S <strong>in</strong>to irreducible<br />

<strong>in</strong>dependent components Si. If X is an irreducible component of S, then X ∼ Si for some i.<br />

Here the equivalence relation ∼ denotes equality except for an <strong>in</strong>vertible transformation. The follow<strong>in</strong>g<br />

two lemmata each give a simplification of lemma 1.3 by order<strong>in</strong>g the components Si accord<strong>in</strong>g<br />

to their dimensions. Some care has to be taken when show<strong>in</strong>g that lemma 1.5 implies lemma 1.4.<br />

Lemma 1.4. Let S and X be def<strong>in</strong>ed as <strong>in</strong> lemma 1.3. In addition assume that dimSi = dimX for<br />

i ≤ l and dimSi < dimX for i > l. Then X ∼ Si for some i ≤ l.<br />

Lemma 1.5. Let S and X be def<strong>in</strong>ed as <strong>in</strong> lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1.<br />

In order to prove lemma 1.5 (and hence the theorem), it is sufficient to show the follow<strong>in</strong>g lemma:<br />

Lemma 1.6. Let S = (S1,S2) with S1 irreducible and m := dimS1 > dimS2 =: n. If X = AS<br />

is aga<strong>in</strong> irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is<br />

<strong>in</strong>vertible, and (ii) if X is an <strong>in</strong>dependent component of S, the right m×n-submatrix of A vanishes.<br />

(i) follows after some l<strong>in</strong>ear algebra, and is necessary to show the more difficult part (ii). For this,<br />

we follow the ideas presented <strong>in</strong> [12] us<strong>in</strong>g factorization of the jo<strong>in</strong>t characteristic function of S.<br />

1.4 Deal<strong>in</strong>g with Gaussians<br />

In the previous section, Gaussians had to be excluded (or at most one was allowed) <strong>in</strong> order to<br />

avoid additional <strong>in</strong>determ<strong>in</strong>acies. Indeed, any orthogonal transformation of two decorrelated hence<br />

<strong>in</strong>dependent Gaussians is aga<strong>in</strong> <strong>in</strong>dependent, so clearly such a strong identification result would not<br />

be possible.<br />

Recently, a general decomposition model deal<strong>in</strong>g with Gaussians was proposed <strong>in</strong> the form of the socalled<br />

non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace<br />

with<strong>in</strong> the data, and no assumption of <strong>in</strong>dependence with<strong>in</strong> the subspace is made. More precisely,<br />

given a random vector X, a factorization X = AS with an <strong>in</strong>vertible matrix A, S = (SN,SG)<br />

and SN a square-<strong>in</strong>tegrable m-dimensional random vector is called an m-decomposition of X if<br />

SN and SG are stochastically <strong>in</strong>dependent and SG is Gaussian. In this case, X is said to be mdecomposable.<br />

X is denoted to be m<strong>in</strong>imally n-decomposable if X is not (n − 1)-decomposable.<br />

Accord<strong>in</strong>g to our previous notation, SN and SG are <strong>in</strong>dependent components of X. It has been<br />

shown that the subspaces of such decompositions are unique [12]:<br />

1 Note that scal<strong>in</strong>g here implies a basis change <strong>in</strong> the component Si, so for example <strong>in</strong> the case of a twodimensional<br />

source component, this might be rotation and sheer<strong>in</strong>g. In the example later <strong>in</strong> figure 3, these<br />

<strong>in</strong>determ<strong>in</strong>acies can easily be seen by compar<strong>in</strong>g true and estimated sources.


Chapter 8. Proc. NIPS 2006 133<br />

Theorem 1.7 (Uniqueness of NGSA). The mix<strong>in</strong>g matrix A of a m<strong>in</strong>imal decomposition is unique<br />

except for transformations <strong>in</strong> each of the two subspaces.<br />

Moreover, explicit algorithms can be constructed for identify<strong>in</strong>g the subspaces [3]. This result enables<br />

us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes<br />

solutions of ISA.<br />

Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with exist<strong>in</strong>g covariance,<br />

an ISA of X exists and is unique except for permutation of components of the same dimension<br />

and <strong>in</strong>vertible transformations with<strong>in</strong> each <strong>in</strong>dependent component and with<strong>in</strong> the Gaussian part.<br />

Proof. Existence is obvious. Uniqueness follows after first apply<strong>in</strong>g theorem 1.7 to X and then<br />

theorem 1.2 to the non-Gaussian part.<br />

2 Jo<strong>in</strong>t block diagonalization with unknown block-sizes<br />

Jo<strong>in</strong>t diagonalization has become an important tool <strong>in</strong> ICA-based BSS (used for example <strong>in</strong> JADE)<br />

or <strong>in</strong> BSS rely<strong>in</strong>g on second-order temporal decorrelation. The task of (real) jo<strong>in</strong>t diagonalization<br />

(JD) of a set of symmetric real n×n matrices M := {M1, . . . ,MK} is to f<strong>in</strong>d an orthogonal matrix<br />

E such that E⊤MkE is diagonal for all k = 1, . . .,K i.e. to m<strong>in</strong>imizef( Ê) := �K k=1 �Ê⊤MkÊ −<br />

diagM( Ê⊤MkÊ)�2F with respect to the orthogonal matrix Ê, where diagM(M) produces a matrix<br />

where all off-diagonal elements of M have been set to zero, and �M�2 F := tr(MM⊤ ) denotes the<br />

squared Frobenius norm. The Frobenius norm is <strong>in</strong>variant under conjugation by an orthogonal matrix,<br />

so m<strong>in</strong>imiz<strong>in</strong>g f is equivalent to maximiz<strong>in</strong>g g( Ê) := �K k=1 � diag(Ê⊤MkÊ)�2 , where now<br />

diag(M) := (mii)i denotes the diagonal of M. For the actual m<strong>in</strong>imization of f respectively maximization<br />

of g, we will use the common approach of Jacobi-like optimization by iterative applications<br />

of Givens rotation <strong>in</strong> two coord<strong>in</strong>ates [5].<br />

2.1 Generalization to blocks<br />

In the follow<strong>in</strong>g we will use a generalization of JD <strong>in</strong> order to solve ISA problems. Instead of fully<br />

diagonaliz<strong>in</strong>g all n × n matrices Mk ∈ M, <strong>in</strong> jo<strong>in</strong>t block diagonalization (JBD) of M we want to<br />

determ<strong>in</strong>e E such that E ⊤ MkE is block-diagonal. Depend<strong>in</strong>g on the application, we fix the blockstructure<br />

<strong>in</strong> advance or try to determ<strong>in</strong>e it from M. We are not <strong>in</strong>terested <strong>in</strong> the order of the blocks,<br />

so the block-structure is uniquely specified by fix<strong>in</strong>g a partition of n i.e. a way of writ<strong>in</strong>g n as a sum<br />

of positive <strong>in</strong>tegers, where the order of the addends is not significant. So let 2 n = m1 + . . . + mr<br />

with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1, . . .,mr) ∈ N r . An n × n matrix is said to be<br />

m-block diagonal if it is of the form<br />

⎛<br />

D1<br />

⎜ .<br />

⎝ .<br />

· · ·<br />

. ..<br />

0<br />

.<br />

.<br />

⎞<br />

⎟<br />

⎠<br />

0 · · · Dr<br />

with arbitrary mi × mi matrices Di.<br />

As generalization of JD <strong>in</strong> the case of known the block structure, we can formulate the jo<strong>in</strong>t mblock<br />

diagonalization (m-JBD) problem as the m<strong>in</strong>imization of fm ( Ê) := �K k=1 �Ê⊤MkÊ −<br />

diagM m ( Ê⊤ Mk Ê)�2 F with respect to the orthogonal matrix Ê, where diagMm (M) produces a<br />

m-block diagonal matrix by sett<strong>in</strong>g all other elements of M to zero. In practice due to estimation<br />

errors, such E will not exist, so we speak of approximate JBD and imply m<strong>in</strong>imiz<strong>in</strong>g some errormeasure<br />

on non-block-diagonality. Indeterm<strong>in</strong>acies of any m-JBD are m-scal<strong>in</strong>g i.e. multiplication<br />

by an m-block diagonal matrix from the right, and m-permutation def<strong>in</strong>ed by a permutation matrix<br />

that only swaps blocks of the same size.<br />

F<strong>in</strong>ally, we speak of general JBD if we search for a JBD but no block structure is given; <strong>in</strong>stead<br />

it is to be determ<strong>in</strong>ed from the matrix set. For this it is necessary to require a block<br />

2 We do not use the convention from Ferrers graphs of specify<strong>in</strong>g partitions <strong>in</strong> decreas<strong>in</strong>g order, as a visualization<br />

of <strong>in</strong>creas<strong>in</strong>g block-sizes seems to be preferable <strong>in</strong> our sett<strong>in</strong>g.


134 Chapter 8. Proc. NIPS 2006<br />

structure of maximal length, otherwise trivial solutions or ‘<strong>in</strong>-between’ solutions could exist (and<br />

obviously conta<strong>in</strong> high <strong>in</strong>determ<strong>in</strong>acies). Formally, E is said to be a (general) JBD of M if<br />

(E,m) = argmax m | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result<br />

<strong>in</strong> the trivial decomposition m = (n), so we def<strong>in</strong>e an approximate general JBD by requir<strong>in</strong>g<br />

f m (E) < ǫ for some fixed constant ǫ > 0 <strong>in</strong>stead of f m (E) = 0.<br />

2.2 JBD by JD<br />

A few algorithms to actually perform JBD have been proposed, see [1] and references there<strong>in</strong>. In<br />

the follow<strong>in</strong>g we will simply perform jo<strong>in</strong>t diagonalization and then permute the columns of E to<br />

achieve block-diagonality — <strong>in</strong> experiments this turns out to be an efficient solution to JBD [1].<br />

This idea has been formulated <strong>in</strong> a conjecture [1] essentially claim<strong>in</strong>g that a m<strong>in</strong>imum of the JD cost<br />

function f already is a JBD i.e. a m<strong>in</strong>imum of the function f m up to a permutation matrix. Indeed,<br />

<strong>in</strong> the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary,<br />

and we can prove the conjecture partially:<br />

We want to show that JD implies JBD up to permutation, i.e. if E is a m<strong>in</strong>imum of f, then there<br />

exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course<br />

f(EP) = f(E), so we will show why (certa<strong>in</strong>) JBD solutions are m<strong>in</strong>ima of f. However, JD might<br />

have additional m<strong>in</strong>ima. First note that clearly not any JBD m<strong>in</strong>imizes f, only those such that <strong>in</strong><br />

each block of size mk, f(E) when restricted to the block is maximal over E ∈ O(mk). We will call<br />

such a JBD block-optimal <strong>in</strong> the follow<strong>in</strong>g.<br />

Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local m<strong>in</strong>imum of f.<br />

Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local<br />

m<strong>in</strong>imum of f or equivalently a local maximum of the squared diagonal sum g. After substitut<strong>in</strong>g<br />

each Mk by E ⊤ MkE, we may already assume that Mk is m-block diagonal, so we have to show<br />

that E = I is a local maximum of g.<br />

Consider the elementary Givens rotation Gij(ǫ) def<strong>in</strong>ed for i < j and ǫ ∈ (−1, 1) as the orthogonal<br />

matrix, where all diagonal elements are 1 except for the two elements √ 1 − ǫ 2 <strong>in</strong> rows i and j and<br />

with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i),<br />

respectively. It can be used to construct local coord<strong>in</strong>ates of the d := n(n − 1)/2-dimensional<br />

manifold O(n) at I, simply by ι(ǫ12, ǫ13, . . . , ǫn−1,n) := �<br />

i 0 and therefore h is negative def<strong>in</strong>ite <strong>in</strong> the direction<br />

ǫij. Altogether we get a negative def<strong>in</strong>ite h at 0 except for ‘trivial directions’, and hence a local<br />

maximum at 0.<br />

2.3 Recover<strong>in</strong>g the permutation<br />

In order to perform JBD, we therefore only have to f<strong>in</strong>d a JD E of M. What is left accord<strong>in</strong>g to the<br />

above theorem is to f<strong>in</strong>d a permutation matrix P such that EP block-diagonalizes M. In the case of<br />

known block-order m, we can employ similar techniques as used <strong>in</strong> [1, 10], which essentially f<strong>in</strong>d<br />

P by some comb<strong>in</strong>atorial optimization.


Chapter 8. Proc. NIPS 2006 135<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

5 10 15 20 25 30 35 40<br />

5 10 15 20 25 30 35 40<br />

(a) (unknown) block diagonal M1 (b) Ê⊤E w/o recovered permutation<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

5 10 15 20 25 30 35 40<br />

(c) Ê⊤ E<br />

Figure 2: Performance of the proposed general JBD algorithm <strong>in</strong> the case of the (unknown) blockpartition<br />

40 = 1+2+2+3+3+5+6+6+6+6<strong>in</strong> the presence of noise with SNR of 5dB. The<br />

product Ê⊤ E of the <strong>in</strong>verse of the estimated block diagonalizer and the orig<strong>in</strong>al one is an m-block<br />

diagonal matrix except for permutation with<strong>in</strong> groups of the same sizes as claimed <strong>in</strong> section 2.2.<br />

In the case of unknown block-size, we propose to use the follow<strong>in</strong>g simple permutation-recovery<br />

algorithm: consider the mean diagonalized matrix D := K−1 �K k=1 E⊤MkE. Due to the assumption<br />

that M is m-block-diagonalizable (with unknown m), each E⊤MkE and hence also D must<br />

be m-block-diagonal except for a permutationP, so it must have the correspond<strong>in</strong>g number of zeros<br />

<strong>in</strong> each column and row. In the approximate JBD case, threshold<strong>in</strong>g with a threshold θ is necessary,<br />

whose choice is non-trivial.<br />

We propose us<strong>in</strong>g algorithm 1 to recover the permutation; we denote its result<strong>in</strong>g permuted matrix by<br />

P(D) when applied to the <strong>in</strong>put D. P(D) is constructed from possibly thresholded D by iteratively<br />

permut<strong>in</strong>g columns and rows <strong>in</strong> order to guarantee that all non-zeros of D are clustered along the<br />

diagonal as closely as possible. This recovers the permutation as well as the partition m of n.<br />

Algorithm 1: Block-diagonality permutation f<strong>in</strong>der<br />

Input: (n × n)-matrix D<br />

Output: block-diagonal matrix P(D) := D ′ such that D ′ = PDP T for a permutation matrix P<br />

D ′ ← D<br />

for i ← 1 to n do<br />

repeat<br />

if (j0 ← m<strong>in</strong>{j|j ≥ i and d ′ ij = 0 and d′ ji<br />

= 0}) exists then<br />

if (k0 ← m<strong>in</strong>{k|k > j0 and (d ′ ik �= 0 or d′ ki �= 0)}) exists then<br />

swap column j0 of D ′ with column k0<br />

swap row j0 of D ′ with row k0<br />

until no swap has occurred ;<br />

We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K =<br />

100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6).<br />

They have been generated <strong>in</strong> blocks of size m with coefficients chosen randomly uniform from<br />

[−1, 1], and symmetrized by Dk ← (Dk + D⊤ k )/2. After that, they have been mixed by a random<br />

orthogonal mix<strong>in</strong>g matrix E ∈ O(40), i.e. Mk := EDkE⊤ + N, where N is a noise matrix with<br />

<strong>in</strong>dependent Gaussian entries such that the result<strong>in</strong>g signal-to-noise ratio is 5dB. Application of<br />

the JBD algorithm from above to {M1, . . . ,MK} with threshold θ = 0.1 correctly recovers the<br />

block sizes, and the estimated block diagonalizer Ê equals E up to m-scal<strong>in</strong>g and permutation, as<br />

illustrated <strong>in</strong> figure 2.<br />

3 SJADE — a simple algorithm for general ISA<br />

As usual by preprocess<strong>in</strong>g of the observations X by whiten<strong>in</strong>g we may assume that Cov(X) = I.<br />

The <strong>in</strong>determ<strong>in</strong>acies allow scal<strong>in</strong>g transformations <strong>in</strong> the sources, so without loss of generality let


136 Chapter 8. Proc. NIPS 2006<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

(a) S2<br />

6<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

1.5<br />

7 8 9 10 11 12 13<br />

2<br />

14 3 4 5 6 7 8<br />

1.5<br />

9 3 3.5 4 4.5 5 5.5 6 6.5 7<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

−6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2<br />

(f) ( ˆ S1, ˆ S2)<br />

250<br />

200<br />

150<br />

100<br />

50<br />

(b) S3<br />

0<br />

−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0<br />

(g) histogram of ˆ S3<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

14<br />

13<br />

12<br />

11<br />

10<br />

9<br />

2<br />

(c) S4<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0 0<br />

1<br />

(d) S5<br />

8<br />

−8<br />

−4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 −2.5<br />

(h) S4<br />

−3<br />

−3.5<br />

−4<br />

−4.5<br />

−5<br />

−5.5<br />

−6<br />

−6.5<br />

−7<br />

−7.5<br />

(i) S5<br />

2<br />

3<br />

4<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

5<br />

9<br />

10<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

−5<br />

0<br />

1 2 3 4 5 6 7 8 9 10<br />

(e) Â−1 A<br />

−1<br />

5<br />

−2<br />

4<br />

−3<br />

3<br />

2<br />

−4<br />

1<br />

−5 0<br />

(j) S6<br />

Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the<br />

scatter plots i.e. densities of the source components and the mix<strong>in</strong>g-separat<strong>in</strong>g map Â−1 A.<br />

also Cov(S) = I. Then I = Cov(X) = A Cov(S)A ⊤ = AA ⊤ so A is orthogonal. Due to the ISA<br />

assumptions, the fourth-order cross cumulants of the sources have to be trivial between different<br />

groups, and with<strong>in</strong> the Gaussians. In order to f<strong>in</strong>d transformations of the mixtures fulfill<strong>in</strong>g this<br />

property, we follow the idea of the JADE algorithmbut now <strong>in</strong> the ISA sett<strong>in</strong>g. We perform JBD of<br />

the (whitened) contracted quadricovariance matrices def<strong>in</strong>ed by Cij(X) := E � X ⊤ EijXXX ⊤� −<br />

Eij −E ⊤ ij −tr(Eij)I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij, 1 ≤ i, j ≤ n.<br />

One simple choice is to use n 2 matrices Eij with zeros everywhere except 1 at <strong>in</strong>dex (i, j). More<br />

elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The<br />

result<strong>in</strong>g algorithm, subspace-JADE (SJADE) not only performs NGCA by group<strong>in</strong>g Gaussians as<br />

one-dimensional components with trivial Cii’s, but also automatically f<strong>in</strong>ds the subspace partition<br />

m us<strong>in</strong>g the general JBD algorithm from section 2.3.<br />

4 Experimental results<br />

In a first example, we consider a general ISA problem <strong>in</strong> dimension n = 10 with the unknown<br />

partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors,<br />

we decided to follow the nice visual ideas from [10] and to draw samples from a density follow<strong>in</strong>g<br />

a known shape — <strong>in</strong> our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are<br />

shown <strong>in</strong> figure 3(a-d). Another 1-dimensional source follow<strong>in</strong>g a uniform distribution was constructed.<br />

Altogether 104 samples were used. The sources S were mixed by a mix<strong>in</strong>g matrix A with<br />

coefficients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered<br />

mix<strong>in</strong>g matrix  was then estimated us<strong>in</strong>g the above block-JADE algorithm with unknown block<br />

size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015).<br />

Figure 3(e) shows the composed mix<strong>in</strong>g-separat<strong>in</strong>g system Â−1A; clearly the matrices are equal<br />

except for block permutation and scal<strong>in</strong>g, which experimentally confirms theorem 1.8. The algorithm<br />

found a partition ˆm = (1, 1, 1, 2, 2, 3), so one 2d-source was mis<strong>in</strong>terpreted as two 1d-sources,<br />

but by us<strong>in</strong>g previous knowledge comb<strong>in</strong>ation of the correct two 1d-sources yields the orig<strong>in</strong>al 2dsource.<br />

The result<strong>in</strong>g recovered sources ˆ S := Â−1X, figures 3(f-j), then equal the orig<strong>in</strong>al sources<br />

except for permutation and scal<strong>in</strong>g with<strong>in</strong> the sources — which <strong>in</strong> the higher-dimensional cases<br />

implies transformations such as rotation of the underly<strong>in</strong>g images or shapes. When apply<strong>in</strong>g ICA<br />

(1-ISA) to the above mixtures, we cannot expect to recover the orig<strong>in</strong>al sources as expla<strong>in</strong>ed <strong>in</strong><br />

figure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE<br />

equals JADE with additional permutation recovery because the jo<strong>in</strong>t block diagonalization is per-


Chapter 8. Proc. NIPS 2006 137<br />

formed us<strong>in</strong>g jo<strong>in</strong>t diagonalization. This expla<strong>in</strong>s why JADE retrieves mean<strong>in</strong>gful components even<br />

<strong>in</strong> this non-ICA sett<strong>in</strong>g as observed <strong>in</strong> [4].<br />

In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the<br />

subspace JADE also <strong>in</strong>cludes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and<br />

sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before;<br />

105 samples were drawn. We perform 100 Monte-Carlo simulations with random mix<strong>in</strong>g matrix<br />

A, and apply SJADE with θ = 0.01. The recovered mix<strong>in</strong>g matrix  is compared with A by<br />

tak<strong>in</strong>g the ad-hoc measure ι(P) := �3 �2 i=1 j=1 (p2ij + p2ji ) for P := Â−1A. Indeed, we get nearly<br />

perfect recovery <strong>in</strong> 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A s<strong>in</strong>gle run<br />

diverges with ι(P) = 3.48. In order to show that the algorithm really separates the Gaussian part<br />

from the other components, we compare the recovered source kurtoses. The median kurtoses are<br />

−0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The first two components<br />

have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis<br />

of around −1.2, which equals the kurtosis of a uniform density. This confirms the applicability of<br />

the algorithm <strong>in</strong> the general, noisy ISA sett<strong>in</strong>g.<br />

5 Conclusion<br />

Previous approaches for <strong>in</strong>dependent subspace analysis were restricted either to fixed group sizes<br />

or semi-parametric models. In neither case, general applicability to any k<strong>in</strong>d of mixture data set<br />

was guaranteed, so bl<strong>in</strong>d source separation might fail. In the present contribution we <strong>in</strong>troduce the<br />

concept of irreducible <strong>in</strong>dependent components and give an identifiability result for this general,<br />

parameter-free model together with a novel arbitrary-subspace-size algorithm based on jo<strong>in</strong>t block<br />

diagonalization. As <strong>in</strong> ICA, the ma<strong>in</strong> uniqueness theorem is an asymptotic result (but <strong>in</strong>cludes<br />

noisy case via NGCA). However <strong>in</strong> practice <strong>in</strong> the f<strong>in</strong>ite sample case, due to estimation errors the<br />

general jo<strong>in</strong>t block diagonality only approximately holds. Our simple solution <strong>in</strong> this contribution<br />

was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be<br />

developed <strong>in</strong> future works.<br />

References<br />

[1] K. Abed-Meraim and A. Belouchrani. Algorithms for jo<strong>in</strong>t block diagonalization. In Proc. EUSIPCO<br />

2004, pages 209–212, Vienna, Austria, 2004.<br />

[2] F.R. Bach and M.I. Jordan. F<strong>in</strong>d<strong>in</strong>g clusters <strong>in</strong> <strong>in</strong>dependent component analysis. In Proc. ICA 2003, pages<br />

891–896, 2003.<br />

[3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spoko<strong>in</strong>y, and K.-R. Müller. In search of non-gaussian<br />

components of a high-dimensional distribution. JMLR, 7:247–282, 2006.<br />

[4] J.F. Cardoso. Multidimensional <strong>in</strong>dependent component analysis. In Proc. of ICASSP ’98, Seattle, 1998.<br />

[5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal.<br />

Appl., 17(1):161–164, January 1995.<br />

[6] P. Comon. <strong>Independent</strong> component analysis - a new concept? Signal Process<strong>in</strong>g, 36:287–314, 1994.<br />

[7] A. Hyvär<strong>in</strong>en and P.O. Hoyer. Emergence of phase and shift <strong>in</strong>variant features by decomposition of<br />

natural images <strong>in</strong>to <strong>in</strong>dependent feature subspaces. Neural Computation, 12(7):1705–1720, 2000.<br />

[8] A. Hyvär<strong>in</strong>en, P.O. Hoyer, and M. Inki. Topographic <strong>in</strong>dependent component analysis. Neural Computation,<br />

13(7):1525–1558, 2001.<br />

[9] J.K. L<strong>in</strong>. Factoriz<strong>in</strong>g multivariate function classes. In Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems,<br />

volume 10, pages 563–569, 1998.<br />

[10] B. Poczos and A. Lör<strong>in</strong>cz. <strong>Independent</strong> subspace analysis us<strong>in</strong>g k-nearest neighborhood distances. In<br />

Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Spr<strong>in</strong>ger.<br />

[11] F.J. Theis. Uniqueness of complex and multidimensional <strong>in</strong>dependent component analysis. Signal Process<strong>in</strong>g,<br />

84(5):951–956, 2004.<br />

[12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages<br />

917–925, Charleston, USA, 2006.<br />

[13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS<br />

2001, pages 993–1000, 2001.


138 Chapter 8. Proc. NIPS 2006


Chapter 9<br />

Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Paper F.J. Theis, P.Gruber, I.R. Keck, and E.W. Lang. A robust model for spatiotemporal<br />

dependencies. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Reference (Theis et al., 2007b)<br />

Summary <strong>in</strong> section 1.3.3<br />

139


140 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

A Robust Model for Spatiotemporal<br />

Dependencies<br />

Fabian J. Theis a,b,∗ , Peter Gruber b , Ingo R. Keck b ,<br />

Elmar W. Lang b<br />

a Bernste<strong>in</strong> Center for Computational Neuroscience<br />

Max-Planck-Institute for Dynamics and Self-Organisation, Gött<strong>in</strong>gen, Germany<br />

b Institute of Biophysics, University of Regensburg, Regensburg, Germany<br />

Abstract<br />

Real-world data sets such as record<strong>in</strong>gs from functional magnetic resonance imag<strong>in</strong>g<br />

often possess both spatial and temporal structure. Here, we propose an algorithm<br />

<strong>in</strong>clud<strong>in</strong>g such spatiotemporal <strong>in</strong>formation <strong>in</strong>to the analysis, and reduce the problem<br />

to the jo<strong>in</strong>t approximate diagonalization of a set of autocorrelation matrices.<br />

We demonstrate the feasibility of the algorithm by apply<strong>in</strong>g it to functional MRI<br />

analysis, where previous approaches are outperformed considerably.<br />

Key words: bl<strong>in</strong>d source separation, <strong>in</strong>dependent component analysis, functional<br />

magnetic resonance imag<strong>in</strong>g, autodecorrelation<br />

PACS: 07.05.Kf, 87.61.–c, 05.40.–a, 05.45.Tp<br />

1 Introduction<br />

Bl<strong>in</strong>d source separation (BSS) describes the task of recover<strong>in</strong>g an unknown<br />

mix<strong>in</strong>g process and underly<strong>in</strong>g sources of an observed data set. It has numerous<br />

applications <strong>in</strong> fields rang<strong>in</strong>g from signal and image process<strong>in</strong>g to the<br />

separation of speech and radar signals to f<strong>in</strong>ancial data analysis. Many BSS algorithms<br />

assume either <strong>in</strong>dependence (<strong>in</strong>dependent component analysis, ICA)<br />

or diagonal autocorrelations of the sources [1,2]. Here we extend BSS algorithms<br />

based on time-decorrelation [3–8]. They rely on the fact that the data<br />

∗ correspond<strong>in</strong>g author<br />

Email address: fabian@theis.name (Fabian J. Theis).<br />

URL: http://fabian.theis.name (Fabian J. Theis).<br />

Prepr<strong>in</strong>t submitted to Elsevier 15 May 2007


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 141<br />

sets have non-trivial autocorrelations so that the unknown mix<strong>in</strong>g matrix can<br />

be recovered by generalized eigenvalue decomposition.<br />

Spatiotemporal BSS <strong>in</strong> contrast to the more common spatial or temporal BSS<br />

tries to achieve both spatial and temporal separation by optimiz<strong>in</strong>g a jo<strong>in</strong>t<br />

energy function. First proposed by Stone et al. [9], it is a promis<strong>in</strong>g method,<br />

which has potential applications <strong>in</strong> areas where data conta<strong>in</strong>s an <strong>in</strong>herent spatiotemporal<br />

structure, such as data from biomedic<strong>in</strong>e or geophysics <strong>in</strong>clud<strong>in</strong>g<br />

oceanography and climate dynamics. Stone’s algorithm is based on the Infomax<br />

ICA algorithm [10], which due to its onl<strong>in</strong>e nature <strong>in</strong>volves some rather <strong>in</strong>tricate<br />

choices of parameters, specifically <strong>in</strong> the spatiotemporal version, where<br />

onl<strong>in</strong>e updates are be<strong>in</strong>g performed both <strong>in</strong> space and time. Commonly, the<br />

spatiotemporal data sets are recorded <strong>in</strong> advance, so we can easily replace<br />

spatiotemporal onl<strong>in</strong>e learn<strong>in</strong>g by batch optimization. This has the advantage<br />

of greatly reduc<strong>in</strong>g the number of parameters <strong>in</strong> the system, and leads<br />

to more stable optimization algorithms. We focus on so-called algebraic BSS<br />

algorithms [3,5,6,11], reviewed for example <strong>in</strong> [12], which employ generalized<br />

eigenvalue decomposition and jo<strong>in</strong>t diagonalization for the factorization. The<br />

correspond<strong>in</strong>g learn<strong>in</strong>g rules are essentially parameter-free and are known to<br />

be robust and efficient [13].<br />

In this contribution, we extend Stone’s approach by generaliz<strong>in</strong>g the timedecorrelation<br />

algorithms to the spatiotemporal case, thereby allow<strong>in</strong>g us to<br />

use the <strong>in</strong>herent spatiotemporal structures of the data. In the experiments<br />

presented, we observe good performance of the proposed algorithm when applied<br />

to noisy, high-dimensional data sets acquired from functional magnetic<br />

resonance imag<strong>in</strong>g (fMRI). We concentrate on fMRI as it is well fit for spatiotemporal<br />

decomposition due to the fact that spatial activation networks are<br />

mixed with functional and structural temporal components.<br />

2 Bl<strong>in</strong>d source separation<br />

We consider the follow<strong>in</strong>g temporal BSS problem: Let x(t) be a second-order<br />

stationary, zero-mean, m-dimensional stochastic process and A a full rank<br />

matrix such that x(t) = As(t) + n(t). The n-dimensional source signals s(t)<br />

are assumed to have diagonal autocorrelations Rτ(s) := 〈s(t + τ)s(t) ⊤ 〉 for<br />

all τ, and the additive noise n(t) is modeled by a stationary, temporally and<br />

spatially white zero-mean process with variance σ 2 . x(t) is observed, and the<br />

goal is to recover A and s(t). Hav<strong>in</strong>g found A, s(t) can be estimated by<br />

A † x(t), which is optimal <strong>in</strong> the maximum-likelihood sense, where A † denotes<br />

the pseudo-<strong>in</strong>verse of A and m ≥ n. So the BSS task reduces to the estimation<br />

of the mix<strong>in</strong>g matrix A.<br />

2


142 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

3 Separation based on time-delayed decorrelation<br />

For τ �= 0, the mixture autocorrelations factorize 1 ,<br />

Rτ(x) = ARτ(s)A ⊤ . (1)<br />

This gives an <strong>in</strong>dication of how to recover A from x(t). The correlation of<br />

the signal part ˜x(t) := As(t) of the mixtures x(t) may be calculated as<br />

R0(˜x) = R0(x) − σ2I, provided that the noise variance σ2 is known. After<br />

whiten<strong>in</strong>g of ˜x(t) i.e. jo<strong>in</strong>t diagonalization of R0(˜x), we can assume that<br />

˜x(t) has unit correlation and that m = n, so A is orthogonal 2 . If more<br />

signals than sources are observed, dimension reduction can be performed <strong>in</strong><br />

this step thus reduc<strong>in</strong>g noise [14]. The symmetrized autocorrelation of x(t),<br />

¯Rτ(x) := (1/2) �<br />

Rτ(x) + (Rτ(x)) ⊤�,<br />

factorizes as well, ¯ Rτ(x) = A ¯ Rτ(s)A⊤ ,<br />

and by assumption ¯ Rτ(s) is diagonal. Hence this factorization represents an<br />

eigenvalue decomposition of the symmetric matrix ¯ Rτ(x). If we furthermore<br />

assume that ¯ Rτ(x) or equivalently ¯ Rτ(s) has n dist<strong>in</strong>ct eigenvalues, then A is<br />

already uniquely determ<strong>in</strong>ed by ¯ Rτ(x) except for column permutation. In addition<br />

to this separability result, a BSS algorithm namely time-delayed decorrelation<br />

[3,4] is obta<strong>in</strong>ed by the diagonalization of ¯ Rτ(x) after whiten<strong>in</strong>g —<br />

the diagonalizer yields the desired separat<strong>in</strong>g matrix.<br />

However, this decorrelation approach decisively depends on the choice of τ<br />

— if an eigenvalue of ¯ Rτ(x) is degenerate, the algorithm fails. Moreover, we<br />

face misestimates of ¯ Rτ(x) due to f<strong>in</strong>ite sample effects, so us<strong>in</strong>g additional<br />

statistics is desirable. Therefore, Belouchrani et al [5], see also [6], proposed a<br />

more robust BSS algorithm, called second-order bl<strong>in</strong>d identification (SOBI),<br />

jo<strong>in</strong>tly diagonaliz<strong>in</strong>g a whole set of autocorrelation matrices ¯ Rk(x) with vary<strong>in</strong>g<br />

time lags, for simplicity <strong>in</strong>dexed by k = 1, . . .,K. They showed that <strong>in</strong>creas<strong>in</strong>g<br />

K improves SOBI performance <strong>in</strong> noisy sett<strong>in</strong>gs [2]. Algorithm speed<br />

decreases l<strong>in</strong>early with K, so <strong>in</strong> practice K ranges from 10 to 100. Various<br />

numerical techniques for jo<strong>in</strong>t diagonalization exist, essentially m<strong>in</strong>imiz<strong>in</strong>g<br />

� Kk=1 off(A ⊤ ¯ Rk(x)A) with respect to A, where off denotes the square sum<br />

of the off-diagonal terms. A global m<strong>in</strong>imum of this function is called (approximate)<br />

jo<strong>in</strong>t diagonalizer 3 , and it can be determ<strong>in</strong>ed algorithmically for<br />

example by iterative Givens rotations [13].<br />

1 s(t) and n(t) can be decorrelated, so Rτ(x) = 〈As(t + τ)s(t) ⊤A⊤ 〉 + 〈n(t +<br />

τ)n(t) ⊤ 〉 = ARτ(s)A⊤ , where the last equality follows because τ �= 0 and n(t) is<br />

white.<br />

2 By assumption, R0(s) = I hence I = � As(t)s(t) ⊤A⊤� = AR0(s)A⊤ = AA⊤ , so<br />

A is orthogonal.<br />

3 The case of perfect diagonalization i.e. the case of a zero-valued m<strong>in</strong>imum occurs<br />

if and only if all matrices that are to be diagonalized commute, which is equivalent<br />

to the matrices shar<strong>in</strong>g the same system of eigenvectors.<br />

3


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 143<br />

4 Spatiotemporal structures<br />

Real-world data sets often possess structure <strong>in</strong> addition to the simple factorization<br />

models treated above. For example fMRI measurements conta<strong>in</strong> both<br />

temporal and spatial <strong>in</strong>dices so a data entry x = x(r1, r2, r3, t) can depend on<br />

position r := (r1, r2, r3) as well as time t. More generally, we want to consider<br />

data sets x(r, t) depend<strong>in</strong>g on two <strong>in</strong>dices r and t, where r ∈ R n can be any<br />

multidimensional (spatial) <strong>in</strong>dex and t <strong>in</strong>dexes the time axis. In practice this<br />

generalized random process is realized by a f<strong>in</strong>ite number of samples. For example<br />

<strong>in</strong> the case of fMRI scans we could assume t ∈ [1 : T] := {1, 2, . . ., T }<br />

and r ∈ [1 : h] × [1 : w] × [1 : d], where T is the number of scans of size<br />

h × w × d. So the number of spatial observations is s m := hwd and the number<br />

of temporal observations t m := T.<br />

4.1 Temporal and spatial separation<br />

For such multi-structured data, two methods of source separation exist. In<br />

temporal BSS, we <strong>in</strong>terpret the data to conta<strong>in</strong> a measured time series xr(t) :=<br />

x(r, t) for each spatial location r. Then our goal is to apply BSS to the temporal<br />

observation vector t x(t) := (xr111(t), . . .,xrhwd (t))⊤ conta<strong>in</strong><strong>in</strong>g s m entries<br />

i.e. consist<strong>in</strong>g of s m spatial observations. In other words we want to f<strong>in</strong>d a<br />

decomposition t x(t) = t A t s(t) with temporal mix<strong>in</strong>g matrix t A and temporal<br />

sources t s(t), possibly of lower dimension. This contrasts to so-called spatial<br />

BSS, where the data is considered to be composed of T spatial patterns<br />

xt(r) := x(r, t). Spatial BSS tries to decompose the spatial observation vector<br />

s x(r) := (xt1(r), . . ., xtT (r))⊤ ∈ R t m <strong>in</strong>to s x(r) = s A s s(r) with a spatial mix<strong>in</strong>g<br />

matrix s A and spatial sources s s(r), possibly of lower dimension. In this case,<br />

us<strong>in</strong>g multidimensional autocorrelations considerably enhances the separation<br />

[7,8]. In order to be able to use matrix-notation, we contract the spatial multidimensional<br />

<strong>in</strong>dex r <strong>in</strong>to a one-dimensional <strong>in</strong>dex r by row concatenation;<br />

the full multidimensional structure will only be needed later <strong>in</strong> the calculation<br />

of the multidimensional autocorrelation. Then the data set x(r, t) =: xrt can<br />

be represented by a data matrix X of dimension s m × t m, and our goal is to<br />

determ<strong>in</strong>e a source matrix S, either spatially or temporally.<br />

4.2 Spatiotemporal matrix factorization<br />

Temporal BSS implies the matrix factorization X = t A t S, whereas spatial<br />

BSS implies the factorization X ⊤ = s A s S or equivalently X = s S ⊤s A ⊤ . Hence<br />

X = t A t S = s S ⊤s A ⊤ . (2)<br />

4


144 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

So both source separation models can be <strong>in</strong>terpreted as matrix factorization<br />

problems; <strong>in</strong> the temporal case restrictions such as diagonal autocorrelations<br />

are determ<strong>in</strong>ed by the second factor, <strong>in</strong> the spatial case by the first one.<br />

In order to achieve a spatiotemporal model, we require these conditions from<br />

both factors at the same time. In other words, <strong>in</strong>stead of recover<strong>in</strong>g a s<strong>in</strong>gle<br />

source data set which fulfills the source conditions spatiotemporally we try<br />

to f<strong>in</strong>d two source matrices, a spatial and a temporal source matrix, and the<br />

conditions are put onto the matrices separately. So the spatiotemporal BSS<br />

model can be derived from equation (2) as the factorization problem<br />

X = s S ⊤t S (3)<br />

with spatial source matrix s S and temporal source matrix t S, which both<br />

have (multidimensional) autocorrelations be<strong>in</strong>g as diagonal as possible. Diagonality<br />

of the autocorrelations is <strong>in</strong>variant under scal<strong>in</strong>g and permutation,<br />

so the above model conta<strong>in</strong>s these <strong>in</strong>determ<strong>in</strong>acies — <strong>in</strong>deed the spatial and<br />

temporal sources can <strong>in</strong>terchange scal<strong>in</strong>g (L) and permutation (P) matrices,<br />

s S ⊤t S = (L −1 P −1s S) ⊤ (LP t S), and the model assumptions still hold. The spatiotemporal<br />

BSS problem as def<strong>in</strong>ed <strong>in</strong> equation (3) has been implicitly proposed<br />

<strong>in</strong> [9], equation (5), <strong>in</strong> comb<strong>in</strong>ation with a dimension reduction scheme.<br />

Here, we first operate on the general model and derive the cost function based<br />

on autodecorrelation, and only later comb<strong>in</strong>e this with a dimension reduction<br />

method.<br />

5 Algorithmic spatiotemporal BSS<br />

Stone et al. [9] first proposed the model from equation (3), where a jo<strong>in</strong>t energy<br />

function is employed based on mutual entropy and Infomax. Apart from<br />

the many parameters used <strong>in</strong> the algorithm, the <strong>in</strong>volved gradient descent<br />

optimization is susceptible to noise, local m<strong>in</strong>ima and <strong>in</strong>appropriate <strong>in</strong>itializations,<br />

so we propose a novel, more robust algebraic approach <strong>in</strong> the follow<strong>in</strong>g.<br />

It is based on the jo<strong>in</strong>t diagonalization of source conditions posed not only<br />

temporally but also spatially at the same time.<br />

5.1 Spatiotemporal BSS us<strong>in</strong>g jo<strong>in</strong>t diagonalization<br />

Shift<strong>in</strong>g to matrix notation, we <strong>in</strong>terpret ¯ Rk(X) := ¯ Rk( t x(t)) as a symmetrized<br />

temporal autocorrelation matrix, whereas ¯ Rk(X ⊤ ) := ¯ Rk( s x(r)) denotes<br />

the correspond<strong>in</strong>g spatial possibly multidimensional symmetrized autocorrelation<br />

matrix. Here k <strong>in</strong>dexes the one- or multidimensional lags τ. Ap-<br />

5


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 145<br />

plication of the spatiotemporal mix<strong>in</strong>g model from equation (3) together with<br />

the transformation properties of the Rk’s yields<br />

so<br />

Rk(X) =Rk( s S ⊤t S) = s S ⊤ Rk( t S) s S<br />

Rk(X ⊤ ) =Rk( t S ⊤s S) = t S ⊤ Rk( s S) t S, (4)<br />

¯Rk( t S)= s S †⊤ Rk(X) ¯ s †<br />

S<br />

¯Rk( s S)= t S †⊤ Rk(X ¯ ⊤ ) t S †<br />

because ∗ m ≥ n and hence ∗ S ∗ S † = I, where ∗ denotes either s or t. By<br />

assumption the matrices ¯ Rk( ∗ S) are as diagonal as possible. Hence we can f<strong>in</strong>d<br />

one of two the source sets by jo<strong>in</strong>tly diagonaliz<strong>in</strong>g either ¯ Rk(X) or ¯ Rk(X ⊤ )<br />

for all k. The other source matrix can then be calculated by equation (3).<br />

However we would only be us<strong>in</strong>g either temporal or spatial properties, so this<br />

corresponds to only temporal or spatial BSS.<br />

In order to <strong>in</strong>clude the full spatiotemporal <strong>in</strong>formation, we have to f<strong>in</strong>d diagonalizers<br />

for both ¯ Rk(X) and ¯ Rk(X ⊤ ) such that they satisfy the spatiotemporal<br />

model (3). For now, let us assume the (unrealistic) case of s m = t m = n<br />

— we will deal with the general problem us<strong>in</strong>g dimension reduction later.<br />

Then all matrices can be assumed to be <strong>in</strong>vertible, and by model (3) we get<br />

s S ⊤ = X t S −1 . Apply<strong>in</strong>g this to equations (5) together with an <strong>in</strong>version of<br />

the second equation yields<br />

¯Rk( t S) = t S X † ¯ Rk(X)X †⊤ t S ⊤<br />

¯Rk( s S) −1 = t S ¯ Rk(X ⊤ ) −1 t S ⊤ . (6)<br />

So we can separate the data spatiotemporally by jo<strong>in</strong>tly diagonaliz<strong>in</strong>g the set<br />

of matrices {X † ¯ Rk(X)X †⊤ , ¯ Rk(X ⊤ ) −1 | k = 1, . . ., K}.<br />

Hence the goal of achiev<strong>in</strong>g spatiotemporal BSS ‘as much as possible’ means<br />

m<strong>in</strong>imiz<strong>in</strong>g the jo<strong>in</strong>t error term of the above jo<strong>in</strong>t diagonalization criterion.<br />

Moreover, either spatial or temporal separation can be favored by <strong>in</strong>troduc<strong>in</strong>g<br />

a weight<strong>in</strong>g factor α ∈ [0, 1]. The set for approximate jo<strong>in</strong>t diagonalization is<br />

then def<strong>in</strong>ed by<br />

{αX † ¯ Rk(X)X †⊤ , (1 − α) ¯ Rk(X ⊤ ) −1 | k = 1, . . .,K}. (7)<br />

If A is a diagonalizer of (7), then the sources can be estimated by tˆ S = A −1<br />

and sˆ S = A ⊤ X ⊤ . Jo<strong>in</strong>t diagonalization is usually performed by optimiz<strong>in</strong>g<br />

the off-diagonal criterion from above, so different scale factors <strong>in</strong> the matrices<br />

<strong>in</strong>deed yield different optima if the diagonalization cannot be achieved fully.<br />

6<br />

(5)


146 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Accord<strong>in</strong>g to equations (6), the higher α the more temporal separation is<br />

stressed. In the limit case α = 1 only the temporal criterion is optimized, so<br />

temporal BSS is performed, whereas for α = 0 a spatial BSS is calculated,<br />

although we want to remark that <strong>in</strong> contrast to the temporal case, the cost<br />

function for α = 0 does not equal the spatial SOBI cost function due to the<br />

additional <strong>in</strong>version. In practice, <strong>in</strong> order to be able to weight the matrix<br />

sets us<strong>in</strong>g α appropriately, a normalization by multiplication by a constant<br />

separately with<strong>in</strong> the two sets seems to be appropriate to guarantee equal<br />

scales of the two matrix sets.<br />

5.2 Dimension reduction<br />

In pr<strong>in</strong>ciple, we may now use diagonalization of the matrix set from (7) to<br />

perform spatiotemporal BSS — but only <strong>in</strong> the case of equal dimensions. Furthermore,<br />

apart from computational issues <strong>in</strong>volv<strong>in</strong>g the high dimensionality,<br />

the BSS estimate would be poor, simply because <strong>in</strong> the estimation either of<br />

¯Rk(X) or ¯ Rk(X ⊤ ) equal or less samples than signals are available. Hence<br />

dimension reduction is essential.<br />

Our goal is to extract only n ≪ m<strong>in</strong>{ s m, t m} sources. A common approach<br />

to do so is to approximate X by the reduced s<strong>in</strong>gular value decomposition<br />

X ≈ UDV ⊤ of X, where only the n largest values of the diagonal matrix D<br />

and the correspond<strong>in</strong>g columns of the pseudo-orthogonal matrices U and V are<br />

used. Plugg<strong>in</strong>g this approximation of X <strong>in</strong>to (5) shows after some calculation 4<br />

that the set of matrices from equation (7) can be rewritten as<br />

{α ¯ Rk(D 1/2 V ⊤ ), (1 − α) ¯ Rk(D 1/2 U ⊤ ) −1 | k = 1, . . .,K}. (8)<br />

If A is a jo<strong>in</strong>t diagonalizer of this set, we may estimate the sources by tˆ S =<br />

A ⊤ D 1/2 V ⊤ and sˆ S = A −1 D 1/2 U ⊤ . We call the result<strong>in</strong>g algorithm spatiotemporal<br />

second-order bl<strong>in</strong>d identification or stSOBI, generaliz<strong>in</strong>g the temporal<br />

SOBI algorithm.<br />

4 Us<strong>in</strong>g the approximation X ≈ (UD 1/2 )(VD 1/2 ) ⊤ together with the spatiotemporal<br />

BSS model (3) yields (UD −1/2 ) ⊤s S ⊤t S(VD −1/2 ) = I. Hence W := t SVD −1/2 is<br />

an <strong>in</strong>vertible n×n matrix. The first equation of (6) still holds <strong>in</strong> the more general case<br />

and we get ¯ Rk( t S) = t S ˆ X † ¯ Rk( ˆ X) ˆ X †⊤t S ⊤ = W ¯ Rk(D 1/2 V ⊤ )W ⊤ . The second equation<br />

of (6) cannot hold for n < ∗ m, but we can derive a similar result from (5), where<br />

we use W −1 = D −1/2 V ⊤t S † : ¯ Rk( s S) = t S †⊤ ¯ Rk(X ⊤ ) t S † = W −⊤ ¯ Rk(D 1/2 U ⊤ )W −1<br />

which we can now <strong>in</strong>vert to get ¯ Rk( s S) −1 = W ¯ Rk(D 1/2 U ⊤ ) −1 W ⊤ .<br />

7


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 147<br />

1 2<br />

3 4<br />

(a) component maps s S<br />

1 cc: 0.05 2 cc: 0.17<br />

3 cc: 0.14 4 cc: 0.89<br />

(b) time courses t S<br />

Fig. 1. fMRI analysis us<strong>in</strong>g stSOBI with temporal and two-dimensional spatial autocorrelations.<br />

The data was reduced to the 4 largest components. (a) shows the<br />

recovered component maps (bra<strong>in</strong> background is given us<strong>in</strong>g a structural scan, overlayed<br />

white po<strong>in</strong>ts <strong>in</strong>dicate activation values stronger than 3 standard deviations),<br />

and (b) their time courses. <strong>Component</strong> 3 partially conta<strong>in</strong>s the frontal eye fields.<br />

<strong>Component</strong> 4 is the desired stimulus component, which is ma<strong>in</strong>ly active <strong>in</strong> the visual<br />

cortex; its time-course closely follows the on-off stimulus (<strong>in</strong>dicated by the gray<br />

boxes) — their crosscorrelation lies at cc = 0.89 — with a delay of roughly 6 seconds<br />

<strong>in</strong>duced by the BOLD effect.<br />

5.3 Implementation<br />

In the experiments we use stSOBI with both one-dimensional and multidimensional<br />

autocovariances. Our software package 5 implements all the details of<br />

mdSOBI and its extension stSOBI <strong>in</strong> Matlab. In addition to Cardoso’s jo<strong>in</strong>t<br />

diagonalization algorithm based on iterative Given’s rotations, the package<br />

conta<strong>in</strong>s all the files needed to reproduce the results described <strong>in</strong> this paper,<br />

with the exception of the fMRI data set.<br />

6 Results<br />

BSS, ma<strong>in</strong>ly based on ICA, is nowadays a quite common tool <strong>in</strong> fMRI analysis<br />

[15,16]. For this work, we analyzed the performance of stSOBI when applied to<br />

fMRI measurements. fMRI data were recorded from from 10 healthy subjects<br />

perform<strong>in</strong>g a visual task. 100 scans (TR/TE = 3000/60 ms) with 5 slices<br />

each were acquired with 5 periods of rest and 5 photic stimulation periods.<br />

5 available onl<strong>in</strong>e at http://fabian.theis.name/<br />

8


148 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Stimulation and rest periods comprised 10 repetitions each i.e. 30s. Resolution<br />

was 3 × 3 × 4 mm. The slices were oriented parallel to the calcar<strong>in</strong>e fissure.<br />

Photic stimulation was performed us<strong>in</strong>g an 8 Hz alternat<strong>in</strong>g checkerboard<br />

stimulus with a central fixation po<strong>in</strong>t and a dark background with a central<br />

fixation po<strong>in</strong>t dur<strong>in</strong>g the control periods. The first scans were discarded for<br />

rema<strong>in</strong><strong>in</strong>g saturation effects. Motion artifacts were compensated by automatic<br />

image alignment [17]. For visualization, we only considered a s<strong>in</strong>gle slice (nonbra<strong>in</strong><br />

areas were masked out), and chose to reduce the data set to n = 4<br />

components by s<strong>in</strong>gular value decomposition.<br />

6.1 S<strong>in</strong>gle subject analysis<br />

In the jo<strong>in</strong>t diagonalization, K = 10 autocorrelation matrices were used, both<br />

for spatial and temporal decorrelation. Figure 1 shows the performance of the<br />

algorithm for equal spatiotemporal weight<strong>in</strong>g α = 0.5. Although the data was<br />

reduced to only 4 components, stSOBI was able to extract the stimulus component<br />

(#4) very well; the crosscorrelation of the identified task component<br />

with the time-delayed stimulus is high (cc = 0.89). Some additional bra<strong>in</strong><br />

components are detected, although higher n would allow for more elaborate<br />

decompositions.<br />

In order to compare spatial and temporal models, we applied stSOBI with<br />

vary<strong>in</strong>g spatiotemporal weight<strong>in</strong>g factors α ∈ {0, 0.1, . . ., 1}. The task component<br />

was always extracted, although with different quality. In figure 2, we<br />

plotted the maximal crosscorrelation of the time courses with the stimulus<br />

versus α. If only spatial separation is performed, the identified stimulus component<br />

is considerably worse (cc = 0.8) than <strong>in</strong> the case of temporal recovery<br />

(cc = 0.9); the component maps co<strong>in</strong>cide rather well. The enhanced extraction<br />

confirms the advantages of spatiotemporal separation <strong>in</strong> contrast to the<br />

commonly used spatial-only separation. Temporal separation alone, although<br />

preferable <strong>in</strong> the presented example, often faces the problem of high dimensions<br />

and low sample number, so an adjustable weight<strong>in</strong>g α as proposed here<br />

allows for the highest flexibility.<br />

6.2 Algorithm comparison<br />

We then compared our analysis with some established algorithms for fMRI<br />

analysis. In order to numerically perform the comparisons, we determ<strong>in</strong>ed the<br />

s<strong>in</strong>gle component that is maximally autocorrelated with the known stimulus<br />

task. These components are shown <strong>in</strong> figure 3.<br />

After some difficulties due to the many possible parameters, Stone’s stICA<br />

9


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 149<br />

crosscorrelation with stimulus<br />

s S for α = 1<br />

s S for α = 0<br />

Fig. 2. Performance of stSOBI for vary<strong>in</strong>g α. Low α favors spatial separation, high<br />

α temporal separation. Two recovered component maps are plotted for the extremal<br />

cases of spatial (α = 0) and temporal (α = 1) separation.<br />

algorithm [9] was applied to the data. However, the task component could not<br />

be recovered very well — it showed some activity <strong>in</strong> the visual cortex, but with<br />

rather low temporal crosscorrelation of 0.53 with the stimulus component,<br />

which is much lower than the 0.9 of the multi-dimensional stSOBI and the<br />

0.86 of stSOBI with one-dimensional autocovariances. We believe that this is<br />

due to convergence problems of the employed Infomax rule, and to non-trivial<br />

tun<strong>in</strong>g of the many parameters <strong>in</strong>volved <strong>in</strong> the algorithm. In order to test<br />

for convergence issues, we comb<strong>in</strong>ed stSOBI and stICA by apply<strong>in</strong>g Stone’s<br />

local stICA algorithm to the stSOBI separation results. Due to this careful<br />

<strong>in</strong>itialization, the stICA result improved (crosscorrelation of 0.58) but was still<br />

considerably lower than the stSOBI result.<br />

Similar results were achieved by the well-known FastICA algorithm [18], which<br />

we applied <strong>in</strong> order to identify spatially <strong>in</strong>dependent components. The algorithm<br />

could not recover the stimulus component (maximal crosscorrelation of<br />

0.51, and no activity <strong>in</strong> the visual cortex). This poor result is due to the dimension<br />

reduction to only 4 components, and co<strong>in</strong>cides with the decreased performance<br />

of stSOBI <strong>in</strong> the spatial case α = 0. In this respect, the spatiotemporal<br />

model is obviously much more flexible, as spatiotemporal dimension reduction<br />

is able to capture the structure better than only spatial reduction.<br />

F<strong>in</strong>ally, we tested the robustness of the spatiotemporal framework by modify<strong>in</strong>g<br />

the cost function. It is well-known that sources with vary<strong>in</strong>g source<br />

properties can be separated by modify<strong>in</strong>g the source condition matrices. In-<br />

10<br />

α


150 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

stimulus<br />

stNSS<br />

stSOBI (1D)<br />

stICA after stSOBI<br />

stICA<br />

fastICA<br />

20 40 60 80<br />

Fig. 3. Comparison of the recovered component that is maximally autocrosscorrelated<br />

with the stimulus task (top) for various BSS algorithms, after dimension<br />

reduction to 4 components. The absolute correspond<strong>in</strong>g autocorrelations are 0.84<br />

(stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied<br />

to separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA).<br />

stead of calculat<strong>in</strong>g autocovariance matrices, other statistics of the spatial<br />

and temporal sources can be used, as long as they satisfy the factorization<br />

from equation (1). This results <strong>in</strong> ‘algebraic BSS’ algorithms such as AMUSE<br />

[3], JADE [11], SOBI and TDSEP [5,6], reviewed for <strong>in</strong>stance <strong>in</strong> [12]. Instead<br />

of perform<strong>in</strong>g autodecorrelation, we used the idea of the NSS-SD-algorithm<br />

(‘non-stationary sources with simultaneous diagonalization’) [19], cf. [20]: the<br />

sources were assumed to be spatiotemporal realizations of non-stationary random<br />

processes ∗ si(t) with ∗ ∈ {t, s} determ<strong>in</strong><strong>in</strong>g the temporal spatial or spatial<br />

direction. If we assume that the result<strong>in</strong>g covariance matrices C( ∗ s(t)) vary<br />

sufficiently with time, the factorization of equation (1) also holds for these<br />

covariance matrices. Hence, jo<strong>in</strong>t diagonalization of<br />

{C( t x(1)),C( s x(1)),C( t x(2)),C( s x(2)), . . .}<br />

allows for the calculation of the mix<strong>in</strong>g matrix. The covariance matrices are<br />

commonly estimated <strong>in</strong> separate non-overlapp<strong>in</strong>g temporal or spatial w<strong>in</strong>dows.<br />

Replac<strong>in</strong>g the autocovariances <strong>in</strong> (8) by the w<strong>in</strong>dowed covariances, this results<br />

<strong>in</strong> the spatiotemporal NSS-SD or stNSS algorithm.<br />

In the fMRI example, we applied stNSS us<strong>in</strong>g one-dimensional covariance ma-<br />

11


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 151<br />

autocorrelation with stimulus<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

autocorrelation with stimulus<br />

0.92<br />

0.9<br />

0.88<br />

0.86<br />

0.84<br />

0.82<br />

0.8<br />

0.78<br />

0.76<br />

computation time (s)<br />

0.3<br />

0.28<br />

0.26<br />

0.24<br />

0.22<br />

0.2<br />

0.18<br />

(a) comparison of separation performance and<br />

computational effort for 10 subjects<br />

1 2 5 10 20 50 100 200<br />

subsampl<strong>in</strong>g percentage (%)<br />

(b) separation after subsampl<strong>in</strong>g<br />

computation time (s)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1 2 5 10 20 50 100 200<br />

subsampl<strong>in</strong>g percentage (%)<br />

(c) computation time after subsampl<strong>in</strong>g<br />

Fig. 4. Multiple subject comparison. (a) shows the algorithm performance <strong>in</strong> terms<br />

of separation quality (autocorrelation with stimulus) and computation time when<br />

compared over 100 runs and 10 subjects. (b) and (c) compare these <strong>in</strong>dices after<br />

subsampl<strong>in</strong>g the data spatially with vary<strong>in</strong>g percentages.<br />

trices and 12 w<strong>in</strong>dows (both temporally and spatially). Although the data<br />

exhibited only weak non-stationarities (the mean masked voxels values vary<br />

from 983 to 1000 over the 98 time steps, with a standard deviation vary<strong>in</strong>g<br />

from 228 to 234), the task component could be extracted rather well with<br />

a crosscorrelation of 0.80, see figure 3. Similarly, by replac<strong>in</strong>g the autocorrelations<br />

with other source conditions [12], we can easily construct alternative<br />

separation algorithms.<br />

6.3 Multiple subject analysis<br />

We f<strong>in</strong>ish by analyz<strong>in</strong>g the performance of the stSOBI algorithm for multiple<br />

subjects. As before, we applied stSOBI with dimension reduction to only n = 4<br />

12


152 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

sources. Here, K = 12 for simplicity one-dimensional autocovariance matrices<br />

were used, both spatially and temporally. We masked the data us<strong>in</strong>g a fixed<br />

common threshold. In order to quantify algorithm performance, as before we<br />

determ<strong>in</strong>ed the spatiotemporal source that had a time course with maximal<br />

autocorrelation with the stimulus protocol, and compared this autocorrelation.<br />

In figure 4(a), we show a boxplot of the autocorrelations together with the<br />

needed computational effort. The median autocorrelation was very high with<br />

0.89. The separation was fast with a mean computation time of 0.25s on<br />

a 1.7GHz Intel Dual Core laptop runn<strong>in</strong>g Matlab. In order to confirm this<br />

robustness of the algorithm, we analyzed the sample-size dependence of the<br />

method by runn<strong>in</strong>g stSOBI on subsampled data sets. The bootstrapp<strong>in</strong>g was<br />

performed spatially with repetition, but reorder<strong>in</strong>g of the samples <strong>in</strong> order<br />

to ma<strong>in</strong>ta<strong>in</strong> spatial dependencies. Figure 4(b-c) shows the algorithm performance<br />

when vary<strong>in</strong>g the subsampl<strong>in</strong>g percentage from 1 to 200 percent, where<br />

the statistics were done over 100 runs and over the 10 subjects. Even when<br />

us<strong>in</strong>g only 1 percent of the samples, we achieved a median autocorrelation<br />

of 0.66, which <strong>in</strong>creased at a subsampl<strong>in</strong>g percentage of 10% to an already<br />

acceptable value of 0.85. This confirms the robustness and efficiency of the<br />

proposed method, which of course comes from the underly<strong>in</strong>g robust optimization<br />

method of jo<strong>in</strong>t diagonalization.<br />

7 Conclusion<br />

We have proposed a novel spatiotemporal BSS algorithm named stSOBI. It<br />

is based on the jo<strong>in</strong>t diagonalization of both spatial and temporal autocorrelations.<br />

Shar<strong>in</strong>g the properties of all algebraic algorithms, stSOBI is easy to<br />

use, robust (with only a s<strong>in</strong>gle parameter) and fast (<strong>in</strong> contrast to the onl<strong>in</strong>e<br />

algorithm proposed by Stone). The employed dimension reduction allows<br />

for the spatiotemporal decomposition of high-dimensional data sets such as<br />

fMRI record<strong>in</strong>gs. The presented results for such data sets show that stSOBI<br />

clearly outperforms spatial-only recovery and Stone’s spatiotemporal algorithm.<br />

Moreover, the proposed algorithm is not limited to second-order statistics,<br />

but can easily be extended to spatiotemporal ICA for example by jo<strong>in</strong>tly<br />

diagonaliz<strong>in</strong>g both spatial and temporal cumulant matrices.<br />

Acknowledgments<br />

The authors gratefully acknowledge partial f<strong>in</strong>ancial support by the DFG<br />

(GRK 638) and the BMBF (project ‘ModKog’). They would like to thank<br />

D. Auer from the MPI of Psychiatry <strong>in</strong> Munich, Germany, for provid<strong>in</strong>g the<br />

13


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 153<br />

fMRI data, and A. Meyer- Bäse from the Department of Electrical and Computer<br />

Eng<strong>in</strong>eer<strong>in</strong>g, FSU, Tallahassee, USA for discussions concern<strong>in</strong>g the fMRI<br />

analysis. The authors thank the anonymous reviewers for their helpful comments<br />

dur<strong>in</strong>g preparation of this manuscript.<br />

References<br />

[1] A. Hyvär<strong>in</strong>en, J. Karhunen, E. Oja, <strong>Independent</strong> component analysis, John<br />

Wiley & Sons.<br />

[2] A. Cichocki, S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g, John Wiley<br />

& Sons, 2002.<br />

[3] L. Tong, R.-W. Liu, V. Soon, Y.-F. Huang, Indeterm<strong>in</strong>acy and identifiability<br />

of bl<strong>in</strong>d identification, IEEE Transactions on Circuits and Systems 38 (1991)<br />

499–509.<br />

[4] L. Molgedey, H. Schuster, Separation of a mixture of <strong>in</strong>dependent signals us<strong>in</strong>g<br />

time-delayed correlations, Physical Review Letters 72 (23) (1994) 3634–3637.<br />

[5] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, E. Moul<strong>in</strong>es, A bl<strong>in</strong>d source<br />

separation technique based on second order statistics, IEEE Transactions on<br />

Signal Process<strong>in</strong>g 45 (2) (1997) 434–444.<br />

[6] A. Ziehe, K.-R. Mueller, TDSEP – an efficient algorithm for bl<strong>in</strong>d separation<br />

us<strong>in</strong>g time structure, <strong>in</strong>: L. Niklasson, M. Bodén, T. Ziemke (Eds.), Proc. of<br />

ICANN’98, Spr<strong>in</strong>ger Verlag, Berl<strong>in</strong>, Skövde, Sweden, 1998, pp. 675–680.<br />

[7] H. Schöner, M. Stetter, I. Schießl, J. Mayhew, J. Lund, N. McLoughl<strong>in</strong>,<br />

K. Obermayer, Application of bl<strong>in</strong>d separation of sources to optical record<strong>in</strong>g<br />

of bra<strong>in</strong> activity, <strong>in</strong>: Proc. NIPS 1999, Vol. 12, MIT Press, 2000, pp. 949–955.<br />

[8] F. Theis, A. Meyer-Bäse, E. Lang, Second-order bl<strong>in</strong>d source separation based<br />

on multi-dimensional autocovariances, <strong>in</strong>: Proc. ICA 2004, Vol. 3195 of LNCS,<br />

Spr<strong>in</strong>ger, Granada, Spa<strong>in</strong>, 2004, pp. 726–733.<br />

[9] J. Stone, J. Porrill, N. Porter, I. Wilk<strong>in</strong>son, Spatiotemporal <strong>in</strong>dependent<br />

component analysis of event-related fmri data us<strong>in</strong>g skewed probability density<br />

functions, NeuroImage 15 (2) (2002) 407–421.<br />

[10] A. Bell, T. Sejnowski, An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d<br />

separation and bl<strong>in</strong>d deconvolution, Neural Computation 7 (1995) 1129–1159.<br />

[11] J. Cardoso, A. Souloumiac, Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian signals, IEE<br />

Proceed<strong>in</strong>gs - F 140 (6) (1993) 362–370.<br />

[12] F. Theis, Y. Inouye, On the use of jo<strong>in</strong>t diagonalization <strong>in</strong> bl<strong>in</strong>d signal<br />

process<strong>in</strong>g, <strong>in</strong>: Proc. ISCAS 2006, Kos, Greece, 2006.<br />

14


154 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

[13] J. Cardoso, A. Souloumiac, Jacobi angles for simultaneous diagonalization,<br />

SIAM J. Mat. Anal. Appl. 17 (1) (1995) 161–164.<br />

[14] M. Joho, H. Mathis, R. Lamber, Overdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation: us<strong>in</strong>g<br />

more sensors than source signals <strong>in</strong> a noisy mixture, <strong>in</strong>: Proc. of ICA 2000,<br />

Hels<strong>in</strong>ki, F<strong>in</strong>land, 2000, pp. 81–86.<br />

[15] M. McKeown, T. Jung, S. Makeig, G. Brown, S. K<strong>in</strong>dermann, A. Bell,<br />

T. Sejnowski, <strong>Analysis</strong> of fMRI data by bl<strong>in</strong>d separation <strong>in</strong>to <strong>in</strong>dependent<br />

spatial components, Human Bra<strong>in</strong> Mapp<strong>in</strong>g 6 (1998) 160–188.<br />

[16] I. Keck, F. Theis, P. Gruber, E. Lang, K. Specht, C. Puntonet, 3D spatial<br />

analysis of fMRI data on a word perception task, <strong>in</strong>: Proc. ICA 2004, Vol. 3195<br />

of LNCS, Spr<strong>in</strong>ger, Granada, Spa<strong>in</strong>, 2004, pp. 977–984.<br />

[17] R. Woods, S. Cherry, J. Mazziotta, Rapid automated algorithm for align<strong>in</strong>g<br />

and reslic<strong>in</strong>g pet images, Journal of Computer Assisted Tomography 16 (1992)<br />

620–633.<br />

[18] A. Hyvär<strong>in</strong>en, E. Oja, A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis, Neural Computation 9 (1997) 1483–1492.<br />

[19] S. Choi, A. Cichocki, Bl<strong>in</strong>d separation of nonstationary sources <strong>in</strong> noisy<br />

mixtures, Electronics Letters 36 (848-849).<br />

[20] D.-T. Pham, J. Cardoso, Bl<strong>in</strong>d separation of <strong>in</strong>stantaneous mixtures of<br />

nonstationary sources, IEEE Transactions on Signal Process<strong>in</strong>g 49 (9) (2001)<br />

1837–1848.<br />

15


Chapter 10<br />

IEEE TNN 16(4):992-996, 2005<br />

Paper P. Georgiev, F.J. Theis, and A. Cichocki. Sparse component analysis and<br />

bl<strong>in</strong>d source separation of underdeterm<strong>in</strong>ed mixtures. IEEE Transactions on<br />

Neural Networks, 16(4):992-996, 2005<br />

Reference (Georgiev et al., 2005c)<br />

Summary <strong>in</strong> section 1.4.1<br />

155


156 Chapter 10. IEEE TNN 16(4):992-996, 2005<br />

992 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005<br />

IV. CONCLUSION<br />

In this letter, an enhancement of the NBLM algorithm is proposed. It<br />

is shown that, by locally adapt<strong>in</strong>g one learn<strong>in</strong>g coefficient of the NBLM<br />

for each neighborhood, significant improvements on the performance<br />

of the method are achieved. The suggested modification requires only<br />

m<strong>in</strong>or changes <strong>in</strong> the orig<strong>in</strong>al algorithm, and re<strong>in</strong>forces the local character<br />

of the NBLM.<br />

With the proposed local adaptation, the modified NBLM achieves<br />

better performance than the LM method even for very small neighborhood<br />

sizes. This allows very large NNs to be efficiently tra<strong>in</strong>ed <strong>in</strong> a<br />

fraction of the time LM would require, still reach<strong>in</strong>g lower error rates.<br />

Moreover, it makes possible to reta<strong>in</strong> the efficiency of that method <strong>in</strong><br />

thosesituations wheretheapplication of theorig<strong>in</strong>al algorithm was<br />

impractical.<br />

ACKNOWLEDGMENT<br />

The authors would like to thank the reviewers for their useful h<strong>in</strong>ts<br />

and constructive comments throughout the review<strong>in</strong>g process.<br />

REFERENCES<br />

[1] K. Levenberg, “A method for the solution of certa<strong>in</strong> problems <strong>in</strong> least<br />

squares,” Q. Appl. Math., vol. 2, pp. 164–168, 1944.<br />

[2] D. W. Marquardt, “An algorithm for least-squares estimation of nonl<strong>in</strong>ear<br />

parameters,” J. Soc. Ind. Appl. Math., vol. 11, pp. 431–441, 1963.<br />

[3] M. T. Hagan and M. Menhaj, “Tra<strong>in</strong><strong>in</strong>g feedforward networks with<br />

theMarquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp.<br />

989–993, Nov. 1994.<br />

[4] G. Lera and M. P<strong>in</strong>zolas, “A quasilocal Levenberg-Marquardt algorithm<br />

for neural network tra<strong>in</strong><strong>in</strong>g,” <strong>in</strong> Proc. Int. Jo<strong>in</strong>t Conf. Neural Networks<br />

(IJCNN’98), vol. 3, Anchorage, AK, May, pp. 2242–2246.<br />

[5] , “Neighborhood-based Levenberg-Marquardt algorithm for neural<br />

network tra<strong>in</strong><strong>in</strong>g,” IEEE Trans. Neural Netw., vol. 13, no. 5, pp.<br />

1200–1203, Sep. 2002.<br />

[6] M. P<strong>in</strong>zolas, J. J. Astra<strong>in</strong>, J. R. González, and J. Villadangos, “Isolated<br />

hand-written digit recognition us<strong>in</strong>g a neurofuzzy scheme and multiple<br />

classification,” J. Intell. Fuzzy Syst., vol. 12, no. 2, pp. 97–105, Dec. 2002.<br />

[7] J. M. Cano, M. P<strong>in</strong>zolas, J. J. Ibarrola, and J. López-Coronado, “Identificación<br />

de funciones utilizando sistemas lógicos difusos,” <strong>in</strong> Proc. XXI<br />

Jornadas Automática, Sevilla, Spa<strong>in</strong>, 2000.<br />

[8] M. P<strong>in</strong>zolas, J. J. Ibarrola, and J. López-Coronado, “A neurofuzzy<br />

scheme for on-l<strong>in</strong>e identification of nonl<strong>in</strong>ear dynamical systems with<br />

variabletransfer function,” <strong>in</strong> Proc. Int. Jo<strong>in</strong>t Conf. Neural Networks<br />

(IJCNN’00), Como, Italy, pp. 215–221.<br />

1045-9227/$20.00 © 2005 IEEE<br />

Sparse <strong>Component</strong> <strong>Analysis</strong> and Bl<strong>in</strong>d Source Separation<br />

of Underdeterm<strong>in</strong>ed Mixtures<br />

Pando Georgiev, Fabian Theis, and Andrzej Cichocki<br />

Abstract—In this letter, we solve the problem of identify<strong>in</strong>g matrices ƒ<br />

and e know<strong>in</strong>g only their multiplication ˆaeƒ, under<br />

some conditions, expressed either <strong>in</strong> terms of e and sparsity of ƒ (identifiability<br />

conditions), or <strong>in</strong> terms of ˆ (sparse component analysis (SCA) conditions).<br />

We present algorithms for such identification and illustrate them<br />

by examples.<br />

Index Terms—Bl<strong>in</strong>d source separation (BSS), sparse component analysis<br />

(SCA), underdeterm<strong>in</strong>ed mixtures.<br />

I. INTRODUCTION<br />

One of the fundamental questions <strong>in</strong> data analysis, signal process<strong>in</strong>g,<br />

data m<strong>in</strong><strong>in</strong>g, neuroscience, etc. is how to represent a large data set ˆ<br />

(given <strong>in</strong> form of a @� 2 xA-matrix) <strong>in</strong> different ways. A simple approach<br />

is a l<strong>in</strong>ear matrix factorization<br />

ˆ a eƒ e P �2� Y ƒ P �2x<br />

where the unknown matrices e (dictionary) and ƒ (sourcesignals)<br />

have some specific properties, for <strong>in</strong>stance:<br />

1) therows of ƒ are (discrete) random variables, which are statistically<br />

<strong>in</strong>dependent as much as possible—this is <strong>in</strong>dependent component<br />

analysis (ICA) problem; 2) ƒ conta<strong>in</strong>s as many zeros as possible—this<br />

is the sparse representation or sparse component analysis<br />

(SCA) problem; 3) the elements of ˆY e, and ƒ arenonnegative—this<br />

is nonnegative matrix factorization (NMF) [8].<br />

There is a large amount of papers devoted to ICA problems [2], [5]<br />

but mostly for thecase� ! �. We refer to [1], [6], [7], and [9]–[11]<br />

for some recent papers on SCA and underdeterm<strong>in</strong>ed ICA @� `�A.<br />

A related problem is the so called bl<strong>in</strong>d source separation (BSS)<br />

problem, <strong>in</strong> which we know a priori that a representation such as <strong>in</strong><br />

(1) exists and the task is to recover the sources (and the mix<strong>in</strong>g matrix)<br />

as accurately as possible. A fundamental property of the complete BSS<br />

problem is that such a recovery (under assumptions <strong>in</strong> 1) and non-Gaussianity<br />

of the sources) is possible up to permutation and scal<strong>in</strong>g of the<br />

sources, which makes the BSS problem so attractive.<br />

In this letter, we consider SCA and BSS problems <strong>in</strong> the underdeterm<strong>in</strong>ed<br />

case (� `�, i.e., more sources than sensors, which is more<br />

challeng<strong>in</strong>g problem), where the additional <strong>in</strong>formation compensat<strong>in</strong>g<br />

the limited number of sensors is the sparseness of thesources. It should<br />

be noted that this problem is quite general and fundamental, s<strong>in</strong>ce the<br />

sources could be not necessarily sparse <strong>in</strong> time doma<strong>in</strong>. It would be sufficient<br />

to f<strong>in</strong>d a l<strong>in</strong>ear transformation (e.g., wavelet packets), <strong>in</strong> which<br />

the sources are sufficiently sparse.<br />

In the sequel, we present new algorithms for solv<strong>in</strong>g the BSS<br />

problem: matrix identification algorithm and source recovery algorithm<br />

under conditions that the source matrix ƒ has at most � 0 I<br />

nonzero elements <strong>in</strong> each column and if the identifiability conditions<br />

Manuscript received November 20, 2003; revised July 25, 2004.<br />

P. Georgiev is with ECECS Department, University of C<strong>in</strong>c<strong>in</strong>nati, C<strong>in</strong>c<strong>in</strong>nati,<br />

OH 45221 USA (e-mail: pgeorgie@ececs.uc.edu).<br />

F. Theis is with the Institute of Biophysics, University of Regensburg,<br />

D-93040 Regensburg, Germany (e-mail: fabian@theis.name).<br />

A. Cichocki is with the Laboratory for Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g,<br />

Bra<strong>in</strong> Science Institute, The Institute of Physical and Chemical Research<br />

(RIKEN), Saitama 351-0198, Japan (e-mail: cia@bsp.bra<strong>in</strong>.riken.jp).<br />

Digital Object Identifier 10.1109/TNN.2005.849840<br />

(1)


Chapter 10. IEEE TNN 16(4):992-996, 2005 157<br />

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005 993<br />

are satisfied (see Theorem 1). When the sources are locally very sparse<br />

(see condition i) of Theorem 2) the matrix identification algorithm<br />

is much simpler. We used this simpler form for separation of mixtures<br />

of images. After sparsification transformation (which is <strong>in</strong> fact<br />

appropriate wavelet transformation) the algorithm works perfectly <strong>in</strong><br />

the complete case. We demonstrate the effectiveness of our general<br />

matrix identification algorithm and the source recovery algorithm<br />

<strong>in</strong> the underdeterm<strong>in</strong>ed case for 7 artificially created sparse source<br />

signals, such that thesourcematrix ƒ has at most 2 nonzero elements<br />

<strong>in</strong> each column, mixed with a randomly generated (3 2 7) matrix.<br />

For a comparison, we present a recovery us<strong>in</strong>g �I-norm m<strong>in</strong>imization<br />

[3], [4], which gives signals that are far from the orig<strong>in</strong>al ones. This<br />

implies that the conditions which ensure equivalence of �I-norm and<br />

�H-norm m<strong>in</strong>imization [4], Theorem 7, are generally not satisfied for<br />

randomly generated matrices. Note that �I-norm m<strong>in</strong>imization gives<br />

solutions which haveat most � nonzeros [3], [4]. Another connection<br />

with [4] is the fact that our algorithm for source recovery works “with<br />

probability one,” i.e., for almost all data vectors � (<strong>in</strong> measure sense)<br />

such that thesystem � a e� has a sparsesolution with less than<br />

� nonzero elements, this solution is unique, while <strong>in</strong> [4] the authors<br />

proved that for all data vectors � such that thesystem � a e� has<br />

a sparsesolution with less than Spark@eAaP nonzero elements, this<br />

solution is unique. Note that Spark@eA � CI, where Spark@eA is<br />

the smallest number of l<strong>in</strong>early dependent columns of e.<br />

exist � <strong>in</strong>dexes ���� � �aI &�IY FFFYx� such that any � 0 I vector<br />

columns of �ƒ@XY��A� � �aI form a basis of the @�0 IA-dimensional coord<strong>in</strong>atesubspaceof<br />

� with zero coord<strong>in</strong>ates given by �IY FFFY���t.<br />

Because of the mix<strong>in</strong>g model, vectors of the form<br />

�� a ƒ@�Y ��A—�Y � aIYFFFY� �Pt<br />

II. BLIND SOURCE SEPARATION<br />

In this section, we develop a method for completely solv<strong>in</strong>g the BSS<br />

problem if the follow<strong>in</strong>g assumptions are satisfied:<br />

A1) themix<strong>in</strong>g matrix e P s‚ �2� belong to the data matrix ˆ. Now, by condition A1) it follows that any<br />

�0I of thevectors ����<br />

has theproperty that any<br />

square � 2 � submatrix of it is nons<strong>in</strong>gular;<br />

A2) each column of the source matrix ƒ has at most �0I nonzero<br />

elements;<br />

A3) the sources are sufficiently rich represented <strong>in</strong> the follow<strong>in</strong>g<br />

sense: for any <strong>in</strong>dex set of � 0 � C Ielements<br />

s a ��IY FFFY��0�CI� &�IYFFFY�� there exist at least �<br />

column vectors of the matrix ƒ such that each of them has<br />

zero elements <strong>in</strong> places with <strong>in</strong>dexes <strong>in</strong> s and each � 0 I of<br />

them are l<strong>in</strong>early <strong>in</strong>dependent.<br />

A. Matrix Identification<br />

We describe conditions <strong>in</strong> the sparse BSS problem under which we<br />

can identify the mix<strong>in</strong>g matrix uniquely up to permutation and scal<strong>in</strong>g<br />

of thecolumns. Wegivetwo typeof such conditions. Thefirst one<br />

corresponds to the least sparsest case <strong>in</strong> which such identification is<br />

possible. Further, we consider the most sparsest case (for small number<br />

of samples) as <strong>in</strong> this case the algorithm is much simpler.<br />

1) General Case—Full Identifiability:<br />

Theorem 1: (Identifiability Conditions—General Case): Assume<br />

that <strong>in</strong> the representation ˆ a eƒ thematrix e satisfies condition<br />

A1), thematrix ƒ satisfies conditions A2) and A3) and only the matrix<br />

ˆ is known. Then the mix<strong>in</strong>g matrix e is identifiable uniquely up to<br />

permutation and scal<strong>in</strong>g of the columns.<br />

Proof: It is clear that any column —� of themix<strong>in</strong>g matrix lies<br />

� 0 I<br />

<strong>in</strong> the <strong>in</strong>tersection of all hyperplanes generated by those<br />

� 0 P<br />

columns of e <strong>in</strong> which —� participates.<br />

We will show that these hyperplanes can be obta<strong>in</strong>ed by the columns<br />

of thedata ˆ under the condition of the theorem. Let t betheset of<br />

all subsets of �IY FFFY�� conta<strong>in</strong><strong>in</strong>g � 0 I elements and let t Pt.<br />

�<br />

Notethat t consists of elements. We will show that the hy-<br />

� 0 I<br />

perplane (denoted by rt ) generated by the columns of e with <strong>in</strong>dexes<br />

from t can beobta<strong>in</strong>ed by somecolumns of ˆ. By A2) and A3), there<br />

�0I<br />

�aI are l<strong>in</strong>early <strong>in</strong>dependent, which implies<br />

that they will span the same hyperplane rt . By A1) and theabove,<br />

�<br />

it follows that wecan cluster thecolumns of ˆ <strong>in</strong> groups<br />

� 0 I<br />

�<br />

r�Y� a IYFFFY uniquely such that each group r� con-<br />

� 0 I<br />

ta<strong>in</strong>s at least � elements and they span one hyperplane rt for some<br />

t� Pt. Now we cluster the hyperplanes obta<strong>in</strong>ed <strong>in</strong> such a way <strong>in</strong> the<br />

smallest number of groups such that the <strong>in</strong>tersection of all hyperplanes<br />

<strong>in</strong> each group gives a s<strong>in</strong>gle one-dimensional (1-D) subspace. It is clear<br />

that such 1-D subspacewill conta<strong>in</strong> onecolumn of themix<strong>in</strong>g matrix,<br />

the number of these groups is � and each group consists of<br />

hyperplanes.<br />

� 0 I<br />

� 0 P<br />

The proof of this theorem gives the idea for the matrix identification<br />

algorithm.<br />

Algorithm for Identification of the Mix<strong>in</strong>g Matrix:<br />

1) Cluster the columns of ˆ <strong>in</strong><br />

�<br />

� 0 I<br />

groups r�Y� a<br />

�<br />

IY FFFY such that the span of the elements of each<br />

� 0 I<br />

group r� produces one hyperplane and these hyperplanes are<br />

different.<br />

2) Cluster the normal vectors to these hyperplanes <strong>in</strong> the smallest<br />

number of groups q�Y� a IYFFFY� (which gives the number<br />

of sources �) such that the normal vectors to the hyperplanes <strong>in</strong><br />

each group q� lie<strong>in</strong> a new hyperplane ” r�.<br />

3) Calculatethenormal vectors ”—� to each hyperplane ” r�Y� a<br />

IY FFFY�. Notethat the1-D subspacespanned by ”—� is the<strong>in</strong>tersection<br />

of all hyperplanes <strong>in</strong> q�. Thematrix ” e with columns<br />

”—� is an estimation of the mix<strong>in</strong>g matrix (up to permutation and<br />

scal<strong>in</strong>g of thecolumns).<br />

2) Degenerate Case—Sparse Instances:<br />

Theorem 2: (Identifiability Conditions—Locally Very Sparse Representation):<br />

Assume that the number of sources is unknown and the<br />

follow<strong>in</strong>g:<br />

i) for each <strong>in</strong>dex � a IYFFFY� there are at least two columns of<br />

ƒ X ƒ@XY�IAYand ƒ@XY�PA which have nonzero elements only <strong>in</strong><br />

position � (so each source is uniquely present at least twice);<br />

ii) ˆ@XY�A Ta ˆ@XY�A for any P ,any�a IYFFFYx and<br />

any � aIYFFFYxY� Ta � for which ƒ@XY�A has morethat one<br />

nonzero element.<br />

Then the number of sources and the matrix e areidentifiable<br />

uniquely up to permutation and scal<strong>in</strong>g.<br />

Proof: We cluster <strong>in</strong> groups all nonzero normalized column vectors<br />

of ˆ such that each group consists of vectors which differ only by<br />

sign. From conditions i) and ii), it follows that thenumber of thegroups<br />

conta<strong>in</strong><strong>in</strong>g more that one element is precisely the number of sources �,<br />

and that each such group will represent a normalized column of e (up<br />

to sign).<br />

In thefollow<strong>in</strong>g, we<strong>in</strong>cludean algorithm for identification of the<br />

mix<strong>in</strong>g matrix based on Theorem 2.<br />

Algorithm for Identification of the Mix<strong>in</strong>g Matrix <strong>in</strong> the Very Sparse<br />

Case:<br />

1) Remove all zero columns of ˆ (if any) and obta<strong>in</strong> a matrix ˆI P<br />

�2x<br />

.


158 Chapter 10. IEEE TNN 16(4):992-996, 2005<br />

994 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005<br />

2) Normalizethecolumns ��Y � a IY FFFYxI of ˆI X �� a<br />

��a���� and set 4b0.<br />

Multiply each column �� by 0I if the first element of �� is negative.<br />

3) Cluster ��Y� aIY FFFYxI <strong>in</strong> � 0 I groups qIY FFFYq�CI such<br />

that for any � a IY FFFY�Y�� 0 �� ` 4YV�Y � P q�, and<br />

�� 0 �� !4 for any �Y � belong<strong>in</strong>g to different groups<br />

4) Choseany �� P q� and put —� a ��. Thematrix e with<br />

columns �—�� � �aI is an estimation of the mix<strong>in</strong>g matrix, up to<br />

permutation and scal<strong>in</strong>g.<br />

We should mention that the very sparse case <strong>in</strong> different sett<strong>in</strong>gs is<br />

already considered <strong>in</strong> the literature, but <strong>in</strong> more restrictive sense. In [6],<br />

theauthors supposethat thesupports of theFourier transform of any<br />

two source signals are disjo<strong>in</strong>t sets—a much more restrictive condition<br />

than our condition. In [1], theauthors supposethat for any sourcethere<br />

exists a time-frequency w<strong>in</strong>dow where only this source is nonzero and<br />

that the time-frequency transform of each source is not constant on any<br />

time-frequency w<strong>in</strong>dow. We would like to mention that their condition<br />

should <strong>in</strong>clude also the case when the the time-frequency transforms<br />

of any two sources are not proportional <strong>in</strong> any time-frequency w<strong>in</strong>dow.<br />

Such a quantitative condition (without frequency representation) is presented<br />

<strong>in</strong> our Theorem 2, condition ii).<br />

�aI<br />

Fig. 1. Orig<strong>in</strong>al images.<br />

Fig. 2. Mixed (observed) images.<br />

Fig. 3. Estimated normalized images us<strong>in</strong>g the estimated matrix. The<br />

signal-to-noiseratios with thesources from Fig. 1 are232, 239, and 228 dB,<br />

respectively.<br />

B. Identification of Sources<br />

Theorem 3: (Uniqueness of Sparse Representation): Let r bethe<br />

set of all � P � such that the l<strong>in</strong>ear system e� a � has a solution<br />

with at least � 0 � CIzero components. If e fulfills A1), then there<br />

exists a subset rH & rwith measure zero with respect to r, such<br />

that for every � Pr�rHthis system has no other solution with this<br />

property.<br />

�<br />

Proof: Obviously r is theunion of all<br />

� 0 I a@�3Aa@@�0<br />

These coefficients are uniquely determ<strong>in</strong>ed if ��� does not<br />

belong to the set rH with measure zero with respect to r<br />

2.3)<br />

(see Theorem 3);<br />

Constructthesolution �� a ƒ@XY�A:itconta<strong>in</strong>s!�Y� <strong>in</strong>theplace<br />

�� for � aIYFFFY�0 I, the other its components are zero.<br />

III. SCA<br />

IA3@�0�CIA3 hyperplanes, produced by tak<strong>in</strong>g the l<strong>in</strong>ear hull of every<br />

subsets of the columns of e with � 0 I elements. Let rH betheunion<br />

of all <strong>in</strong>tersections of any two such subspaces. Then rH has a measure<br />

zero <strong>in</strong> r and satisfies the conclusion of the theorem. Indeed, assume<br />

that � P r�rHand e� a e"� a �, where � and "� haveat least<br />

� 0 � CIzeros. S<strong>in</strong>ce � TP rHY� belongs to only one hyperplane<br />

produced as a l<strong>in</strong>ear hull of some � 0 I columns —� Y FFFY —� of<br />

e. It means that the vectors � and "� have � 0 � CIzeros <strong>in</strong> places<br />

with <strong>in</strong>dexes <strong>in</strong> �IY FFFY�����IY FFFY��0I�. Now from theequation<br />

e@�0"�A aHit follows that the �0I vector columns —� Y FFFY —�<br />

of e are l<strong>in</strong>early dependent, which is a contradiction with A1).<br />

From Theorem 3 it follows that the sources are identifiable generically,<br />

i.e., up to a set with a measure zero, if they have level of sparse-<br />

In this section, we develop a method for the complete solution of the<br />

SCA problem. Now the conditions are formulated only <strong>in</strong> terms of the<br />

data matrix ˆ.<br />

Theorem 4: (SCA Conditions): Assumethat � � x and the<br />

matrix ˆ P<br />

ness grater than or equal to � 0�CI, and themix<strong>in</strong>g matrix is known.<br />

In the follow<strong>in</strong>g, we present an algorithm, based on the observation <strong>in</strong><br />

Theorem 3.<br />

Source Recovery Algorithm:<br />

1) Identify thetheset of �-codimensional subspaces r produced<br />

by tak<strong>in</strong>g the l<strong>in</strong>ear hull of every subsets of the columns of e<br />

with � 0 I elements.<br />

2) Repeat for � a 1toxX 2.1) Identify the space r Prconta<strong>in</strong><strong>in</strong>g �� Xa ˆ@XY�A, or,<strong>in</strong><br />

practical situation with presence of noise, identify the one to<br />

which thedistancefrom �� is m<strong>in</strong>imal and project �� onto<br />

r to ���.<br />

2.2) if r is produced by the l<strong>in</strong>ear hull of column vectors<br />

—� Y FFFY —� , then f<strong>in</strong>d coefficients !�Y� such that<br />

�0I<br />

��� a !�Y�—� X<br />

�2x satisfies the follow<strong>in</strong>g conditions:<br />

i) �<br />

thecolumns of ˆ lie<strong>in</strong> theunion r of different hy-<br />

� 0 I<br />

perplanes, each column lies <strong>in</strong> only one such hyperplane, each<br />

hyperplane conta<strong>in</strong>s at least � columns of ˆ such that each<br />

� 0 I of them are l<strong>in</strong>early <strong>in</strong>dependent;<br />

ii) � 0 I<br />

for each � P �IYFFFY�� there exist � a different<br />

� 0 P<br />

hyperplanes �r�Y�� �<br />

�aI <strong>in</strong> r such that their <strong>in</strong>tersection v� a<br />

’ �<br />

�aIr�Y� is 1-D subspace;<br />

iii) any � different v� span thewhole � .<br />

Then the matrix ˆ is representable uniquely (up to permutation and<br />

scal<strong>in</strong>g of thecolumns of e and rows of ƒ) <strong>in</strong> theform ˆ a eƒ,<br />

where the matrices e P �2� and ƒ P �2x satisfy theconditions<br />

A1) and A2), A3), respectively.<br />

Proof: Let v� bespanned by —� and set e a �—�� � �aI. Condition<br />

iii) implies that any hyperplane from r conta<strong>in</strong>s at most � 0 I vectors<br />

from e. By i) and ii), it follows that these vectors are exactly � 0 I:<br />

only <strong>in</strong> this case the calculation of the number of all hyperplanes by<br />

� 0 I<br />

�<br />

ii) will givethenumber <strong>in</strong> i): � a@� 0 IA a<br />

� 0 P � 0 I .Let<br />

e be a matrix whose column vectors are all vectors from e (taken <strong>in</strong><br />

an arbitrary order). S<strong>in</strong>ce every column vector � of ˆ lies only <strong>in</strong> one<br />

hyperplane from r, the l<strong>in</strong>ear system e� a � has uniquesolution,<br />

which has at least � 0 � CIzeros (see the Proof of Theorem 3). Let<br />

���� � �aI be � column vectors from ˆ, which span onehyperplane<br />

from r, and � 0 I of them are l<strong>in</strong>early <strong>in</strong>dependent (such vectors<br />

exist by i)). Then we have: e�� a ��, for some uniquely determ<strong>in</strong>ed<br />

vectors ��Y� aIYFFFY�0 I, which are l<strong>in</strong>early <strong>in</strong>dependent and have


Chapter 10. IEEE TNN 16(4):992-996, 2005 159<br />

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005 995<br />

(a) (b)<br />

Fig. 4. (a) Mixed signals and (b) normalized scatter plot (density) of the mixtures together with the 21 data set hyperplanes, visualized by their <strong>in</strong>tersection with<br />

theunit sphere<strong>in</strong> .<br />

(a) (b) (c)<br />

Fig. 5.(a) Orig<strong>in</strong>al source signals.(b) Recovered source signals—the signal-to-noise ratio between the orig<strong>in</strong>al sources and the recoveries is very high (above278<br />

dB after permutation and normalization). Note that only 200 samples are enough for excellent separation. (c) Recovered source signals us<strong>in</strong>g � -norm m<strong>in</strong>imization<br />

and known mix<strong>in</strong>g matrix. Simple comparison confirms that the recovered signals are far from the orig<strong>in</strong>al ones, and the signal-to-noise ratio is only around 4 dB.<br />

at least � 0 � CIzeros <strong>in</strong> the same coord<strong>in</strong>ates. In such a way, we B. Underdeterm<strong>in</strong>ed Case<br />

can write: ˆ a eƒ for some uniquely determ<strong>in</strong>ed matrix ƒ, which<br />

satisfies A2) and A3).<br />

We should mention that our algorithms are robust with respect to<br />

small additive noise and big outliers, s<strong>in</strong>ce the algorithms cluster the<br />

data on hyperplanes approximately, up to a threshold 4 b 0, which<br />

could accumulatea noisewith amplitudeless than 4. Thebig outliers<br />

will not be clustered to any hyperplane.<br />

We consider a mixture of seven artificially created sources (see<br />

Fig. 5)—sparsified randomly generated signals with at least 5 zeros<br />

<strong>in</strong> each column—with a randomly generated mix<strong>in</strong>g matrix with<br />

dimension 3 2 7. Fig. 4 gives the mixed signals together with a<br />

normalized scatterplot of the mixtures—the data lies <strong>in</strong> PI a<br />

IV. COMPUTER SIMULATION EXAMPLES<br />

A. Complete Case<br />

In this example for the complete case @� a �A of <strong>in</strong>stantaneous<br />

mixtures, we demonstrate the effectiveness of our algorithm for identification<br />

of the mix<strong>in</strong>g matrix <strong>in</strong> the special case considered <strong>in</strong> Theorem<br />

2. We mixed three images of landscapes (shown <strong>in</strong> Fig. 1) with<br />

a three-dimensional (3-D) Hilbert matrix e and transformed them by<br />

a two-dimensional (2-D) discrete Haar wavelet transform. As a result,<br />

s<strong>in</strong>ce this transformation is l<strong>in</strong>ear, the high frequency components of<br />

the source signals become very sparse and they satisfy the conditions<br />

of Theorem 2. We use only one row (320 po<strong>in</strong>ts) from the diagonal<br />

coefficients of the wavelet transformed mixture, which is enough to recover<br />

very precisely the ill conditioned mix<strong>in</strong>g matrix e. Fig. 3 shows<br />

the recovered mixtures.<br />

U<br />

P<br />

hyperplanes. Apply<strong>in</strong>g the underdeterm<strong>in</strong>ed matrix recovery algorithm<br />

to the mixtures gives the recovered mix<strong>in</strong>g matrix perfectly well, up<br />

to permutation and scal<strong>in</strong>g (not shown because of lack of space).<br />

Apply<strong>in</strong>g the source recovery algorithm, we recover the source signals<br />

up to permutation and scal<strong>in</strong>g (see Fig. 5). This figure also shows that<br />

the recovery by �I-norm m<strong>in</strong>imization does not perform well, even if<br />

the mix<strong>in</strong>g matrix is perfectly known.<br />

V. CONCLUSION<br />

We def<strong>in</strong>ed rigorously the SCA and BSS problems of sparse signals<br />

and presented sufficient conditions for their solv<strong>in</strong>g. We developed<br />

three algorithms: for identification of the mix<strong>in</strong>g matrix (two types: for<br />

the sparse and the very sparse cases) and for source recovery. We presented<br />

two experiments: the first one concerns separation of a mixture<br />

of images, after wavelet sparsification (produc<strong>in</strong>g very sparse sources),<br />

which performs very well <strong>in</strong> the complete case. The second one shows


160 Chapter 10. IEEE TNN 16(4):992-996, 2005<br />

996 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005<br />

the excellent performance of the another two our algorithms <strong>in</strong> the un- can be described by a l<strong>in</strong>ear weighted sum of the <strong>in</strong>puts, followed by<br />

derdeterm<strong>in</strong>ed BSS problem, for separation of artificially created sig- some nonl<strong>in</strong>ear transfer function [1], [2]. In this letter, however, the<br />

nals with sufficient level of sparseness.<br />

neural comput<strong>in</strong>g models studied are based on artificial neurons which<br />

often have b<strong>in</strong>ary <strong>in</strong>puts and outputs, and no adjustable weight between<br />

REFERENCES<br />

nodes. Neuron functions are stored <strong>in</strong> lookup tables, which can be<br />

[1] F. Abrard, Y. Deville, and P. White, “From bl<strong>in</strong>d source separation to<br />

bl<strong>in</strong>d source cancellation <strong>in</strong> the underdeterm<strong>in</strong>ed case: A new approach<br />

based on time-frequency analysis,” <strong>in</strong> Proc. 3rd Int. Conf. <strong>Independent</strong><br />

implemented us<strong>in</strong>g commercially available random access memories<br />

(RAMs). These systems and the nodes that they are composed of will<br />

be described, respectively, as weightless neural networks (WNNs) and<br />

<strong>Component</strong> <strong>Analysis</strong> and Signal Separation (ICA’2001), San Diego, CA, weightless nodes [1], [2]. They differ from other models, such as the<br />

Dec. 9–13, 2001, pp. 734–739.<br />

[2] A. Cichocki and S. Amari, Adaptive Bl<strong>in</strong>d Signal and Image Process<strong>in</strong>g.<br />

New York: Wiley, 2002.<br />

[3] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis<br />

pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.<br />

weighted neural networks, whose tra<strong>in</strong><strong>in</strong>g is accomplished by means<br />

of adjustments of weights. In the literature, the terms “RAM-based”<br />

and “N-tuple based” have been used to refer to WNNs.<br />

In this letter, the computability (computational power) of a class of<br />

[4] D. Donoho and M. Elad, “Optimally sparse representation <strong>in</strong> general<br />

(nonorthogonal) dictionaries via � m<strong>in</strong>imization,” Proc. Nat. Acad. Sci.,<br />

vol. 100, no. 5, pp. 2197–2202, 2003.<br />

[5] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja, <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>.<br />

New York: Wiley, 2001.<br />

WNNs, called general s<strong>in</strong>gle-layer sequential weightless neural networks<br />

(GSSWNNs), is <strong>in</strong>vestigated. Such a class is an important representative<br />

of the research on temporal pattern process<strong>in</strong>g <strong>in</strong> (WNNs)<br />

[3]–[8]. As oneof thecontributions, an algorithm (constructiveproof)<br />

[6] A. Jourj<strong>in</strong>e, S. Rickard, and O. Yilmaz, “Bl<strong>in</strong>d separation of disjo<strong>in</strong>t to map any probabilistic automaton (PA) <strong>in</strong>to a GSSWNN is presented.<br />

orthogonal signals: Demix<strong>in</strong>g N sources from 2 mixtures,” <strong>in</strong> Proc. 2000<br />

IEEE Conf. Acoustics, Speech, and Signal Process<strong>in</strong>g (ICASSP’00), vol.<br />

5, Istanbul, Turkey, Jun. 2000, pp. 2985–2988.<br />

[7] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, “Bl<strong>in</strong>d<br />

sourse separation of more sources than mixtures us<strong>in</strong>g overcomplete rep-<br />

In fact, the proposed method not only allows the construction of any<br />

PA, but also <strong>in</strong>creases the class of functions that can be computed by<br />

such networks. For <strong>in</strong>stance, at a theoretical level, these networks are<br />

not restricted to f<strong>in</strong>ite-state languages (regular languages) and can now<br />

resentaitons,” IEEE Signal Process. Lett., vol. 6, no. 4, pp. 87–90, 1999.<br />

[8] D. D. Lee and H. S. Seung, “Learn<strong>in</strong>g the parts of objects by nonnegative<br />

matrix factorization,” Nature, vol. 40, pp. 788–791, 1999.<br />

[9] F. J. Theis, E. W. Lang, and C. G. Puntonet, “A geometric algorithm for<br />

overcomplete l<strong>in</strong>ear ICA,” Neurocomput., vol. 56, pp. 381–398, 2004.<br />

[10] K. Waheed and F. Salem, “Algebraic overcomplete <strong>in</strong>dependent com-<br />

deal with some context-free languages. Practical motivations for <strong>in</strong>vestigat<strong>in</strong>g<br />

probabilistic automata and GSSWNNs arefound <strong>in</strong> their possible<br />

application to, among others th<strong>in</strong>gs, syntactic pattern recognition,<br />

multimodal search, and learn<strong>in</strong>g control [9].<br />

ponent analysis,” <strong>in</strong> Proc. Int. Conf. <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong><br />

(ICA’03), Nara, Japan, pp. 1077–1082.<br />

II. DEFINITIONS<br />

[11] M. Zibulevsky and B. A. Pearlmutter, “Bl<strong>in</strong>d source separation by sparse<br />

decomposition <strong>in</strong> a signal dictionary,” Neural Comput., vol. 13, no. 4, pp.<br />

A. Probabilistic Automata<br />

863–882, 2001.<br />

Probabilistic automata are a generalization of ord<strong>in</strong>ary determ<strong>in</strong>istic<br />

f<strong>in</strong>itestateautomata (DFA) for which an <strong>in</strong>put symbol could takethe<br />

Equivalence Between RAM-Based Neural Networks and<br />

Probabilistic Automata<br />

automaton <strong>in</strong>to any of its states with a certa<strong>in</strong> probability [9].<br />

Def<strong>in</strong>ition 2.1: A PA is a 5-tuple e€ a@6Y Y rY �sYpA, where<br />

• 6a�'IY'PY FFFY'�6�� is a f<strong>in</strong>ite set of ordered symbols called<br />

the <strong>in</strong>put alphabet;<br />

Marcilio C. P. de Souto, Teresa B. Ludermir, and Wilson R. de Oliveira<br />

•<br />

•<br />

a ��HY�PY FFFY�� r is a mapp<strong>in</strong>g of<br />

�� is a f<strong>in</strong>ite set of states;<br />

2 6 <strong>in</strong>to theset of � 2 � stochastic state<br />

transition matrices (where � is the number of states <strong>in</strong> ). The<br />

Abstract—In this letter, the computational power of a class of random <strong>in</strong>terpretation of r@—�AY—� P 6, can bestated as follows.<br />

access memory (RAM)-based neural networks, called general s<strong>in</strong>gle-layer<br />

sequential weightless neural networks (GSSWNNs), is analyzed. The theoretical<br />

results presented, besides help<strong>in</strong>g the understand<strong>in</strong>g of the temporal<br />

behavior of these networks, could also provide useful <strong>in</strong>sights for the develop<strong>in</strong>g<br />

of new learn<strong>in</strong>g algorithms.<br />

r@—�A a‘��� @—�A“, where ���@—�A ! H is theprobability of<br />

�<br />

enter<strong>in</strong>g state �� from state �� under <strong>in</strong>put —�, and ��� a<br />

�aI<br />

I, for all � aIYFFFY�. Thedoma<strong>in</strong> of r can be extended from<br />

6 to 6<br />

Index Terms—Automata theory, computability, random access<br />

memory (RAM) node, probabilistic automata, RAM-based neural networks,<br />

weightless neural networks (WNNs).<br />

I. INTRODUCTION<br />

The neuron model used <strong>in</strong> the great majority of work <strong>in</strong>volv<strong>in</strong>g<br />

neural networks is related to variations of the McCulloch–Pitts neuron,<br />

which will becalled theweighted neuron. A typical weighted neuron<br />

Manuscript received July 6, 2003; revised October 26, 2004.<br />

M. C. P. de Souto is with the Department of Informatics and Applied <strong>Mathematics</strong>,<br />

Federal University of Rio Grande do Norte, Campus Universitario,<br />

Natal-RN 59072-970, Brazil (e-mail: marcilio@dimap.ufrn.br).<br />

T. B. Ludermir is with the Center of Informatics, Federal University of Pernambuco,<br />

Recife 50732-970, Brazil.<br />

W. R. de Oliveira is with the Department of Physics and <strong>Mathematics</strong>, Rural<br />

Federal University of Pernambuco, Recife 52171-900, Brazil.<br />

Digital Object Identifier 10.1109/TNN.2005.849838<br />

3 by def<strong>in</strong><strong>in</strong>g the follow<strong>in</strong>g:<br />

1)<br />

2)<br />

r@ Aas�, where is theempty str<strong>in</strong>g and s� is an � 2 �<br />

identity matrix;<br />

r@—� Y—� Y FFFY—� A a<br />

r@—� Ar@—� AY FFFYr@—� A, where � ! P and<br />

•<br />

—�<br />

�s P<br />

P 6Y� aIYFFFY�. is the<strong>in</strong>itial state<strong>in</strong> which themach<strong>in</strong>eis found before<br />

the first symbol of the <strong>in</strong>put str<strong>in</strong>g is processed;<br />

• p is the set of f<strong>in</strong>al states @p A.<br />

The language accepted by a PA e€ is „ @e€ Aa�@3Y �@3AA�3 P<br />

6 3 Y�@3A a%Hr@3A%p b H� where: 1) %H is a �-dimensional row<br />

vector, <strong>in</strong> which the �th component is equal to one if �� a �s, and 0<br />

otherwise and 2) %p is an �-dimensional column vector, <strong>in</strong> which the<br />

�th component is equal to 1 if �� P p and 0 otherwise.<br />

The language accepted by e€ with cut-po<strong>in</strong>t(threshold) !, such that<br />

H !`I,isv@e€ Y!Aa�3�3 P 6 3<br />

and %Hr@3A%p b!�.<br />

Probabilistic automata recognize exactly the class of weighted regular<br />

languages (WRLs) [9]. Such a class of languages <strong>in</strong>cludes properly<br />

1045-9227/$20.00 © 2005 IEEE


Chapter 11<br />

EURASIP JASP, 2007<br />

Paper F.J. Theis, P. Georgiev, and A. Cichocki. Robust sparse component analysis<br />

based on a generalized hough transform. EURASIP Journal on Applied<br />

Signal Process<strong>in</strong>g, 2007<br />

Reference (Theis et al., 2007a)<br />

Summary <strong>in</strong> section 1.4.1<br />

161


162 Chapter 11. EURASIP JASP, 2007<br />

H<strong>in</strong>dawi Publish<strong>in</strong>g Corporation<br />

EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

Volume 2007, Article ID 52105, 13 pages<br />

doi:10.1155/2007/52105<br />

Research Article<br />

Robust Sparse <strong>Component</strong> <strong>Analysis</strong> Based on<br />

a Generalized Hough Transform<br />

Fabian J. Theis, 1 Pando Georgiev, 2 and Andrzej Cichocki3, 4<br />

1 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

2 ECECS Department and Department of Mathematical Sciences, University of C<strong>in</strong>c<strong>in</strong>nati, C<strong>in</strong>c<strong>in</strong>nati, OH 45221, USA<br />

3 BSI RIKEN, Laboratory for Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g, 2-1, Hirosawa, Wako, Saitama 351-0198, Japan<br />

4 Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g, Warsaw University of Technology, Pl. Politechniki 1, 00-661 Warsaw, Poland<br />

Received 21 October 2005; Revised 11 April 2006; Accepted 11 June 2006<br />

Recommended for Publication by Frank Ehlers<br />

An algorithm called Hough SCA is presented for recover<strong>in</strong>g the matrix A <strong>in</strong> x(t) = As(t), where x(t) is a multivariate observed<br />

signal, possibly is of lower dimension than the unknown sources s(t). They are assumed to be sparse <strong>in</strong> the sense that at every<br />

time <strong>in</strong>stant t, s(t) has fewer nonzero elements than the dimension of x(t). The presented algorithm performs a global search for<br />

hyperplane clusters with<strong>in</strong> the mixture space by gather<strong>in</strong>g possible hyperplane parameters with<strong>in</strong> a Hough accumulator tensor.<br />

This renders the algorithm immune to the many local m<strong>in</strong>ima typically exhibited by the correspond<strong>in</strong>g cost function. In contrast<br />

to previous approaches, Hough SCA is l<strong>in</strong>ear <strong>in</strong> the sample number and <strong>in</strong>dependent of the source dimension as well as robust<br />

aga<strong>in</strong>st noise and outliers. Experiments demonstrate the flexibility of the proposed algorithm.<br />

Copyright © 2007 Fabian J. Theis et al. This is an open access article distributed under the Creative Commons Attribution License,<br />

which permits unrestricted use, distribution, and reproduction <strong>in</strong> any medium, provided the orig<strong>in</strong>al work is properly cited.<br />

1. INTRODUCTION<br />

One goal of multichannel signal analysis lies <strong>in</strong> the detection<br />

of underly<strong>in</strong>g sources with<strong>in</strong> some given set of observations.<br />

If both the mixture process and the sources are unknown,<br />

this is denoted as bl<strong>in</strong>d source separation (BSS). BSS<br />

can be applied <strong>in</strong> many different fields such as medical and<br />

biological data analysis, broadcast<strong>in</strong>g systems, and audio and<br />

image process<strong>in</strong>g. In order to decompose the data set, different<br />

assumptions on the sources have to be made. The<br />

most common assumption currently used is statistical <strong>in</strong>dependence<br />

of the sources, which leads to the task of <strong>in</strong>dependent<br />

component analysis (ICA); see, for <strong>in</strong>stance, [1, 2]<br />

and references there<strong>in</strong>. ICA very successfully separates data<br />

<strong>in</strong> the l<strong>in</strong>ear complete case, when as many signals as underly<strong>in</strong>g<br />

sources are observed, and <strong>in</strong> this case the mix<strong>in</strong>g<br />

matrix and the sources are identifiable except for permutation<br />

and scal<strong>in</strong>g [3, 4]. In the overcomplete or underdeterm<strong>in</strong>ed<br />

case, fewer observations than sources are given.<br />

It can be shown that the mix<strong>in</strong>g matrix can still be recovered<br />

[5], but source identifiability does not hold. In order<br />

to approximately detect the sources, additional requirements<br />

have to be made, usually sparsity of the sources [6–<br />

8].<br />

Recently, we have <strong>in</strong>troduced a novel measure for sparsity<br />

and shown [9] that based on sparsity alone, we can still<br />

detect both mix<strong>in</strong>g matrix and sources uniquely except for<br />

trivial <strong>in</strong>determ<strong>in</strong>acies (sparse component analysis (SCA)). In<br />

that paper, we have also proposed an algorithm based on random<br />

sampl<strong>in</strong>g for reconstruct<strong>in</strong>g the mix<strong>in</strong>g matrix and the<br />

sources, but the focus of the paper was on the model, and the<br />

matrix estimation algorithm turned out to be not very robust<br />

aga<strong>in</strong>st noise and outliers, and could therefore not easily<br />

be applied <strong>in</strong> high dimensions due to the <strong>in</strong>volved comb<strong>in</strong>atorial<br />

searches. In the present manuscript, a new algorithm<br />

is proposed for SCA, that is, for decompos<strong>in</strong>g a data<br />

set x(1), . . . , x(T) ∈ R m modeled by an (m × T)-matrix X<br />

l<strong>in</strong>early <strong>in</strong>to X = AS, where the n-dimensional sources S =<br />

(s(1), . . . , s(T)) are assumed to be sparse at every time <strong>in</strong>stant.<br />

If the sources are of sufficiently high sparsity, the mixtures<br />

are clustered along hyperplanes <strong>in</strong> the mixture space.<br />

Based on this condition, the mix<strong>in</strong>g matrix can be reconstructed;<br />

furthermore, this property is robust aga<strong>in</strong>st noise<br />

and outliers, which will be used here. The proposed algorithm<br />

denoted by Hough SCA employs a generalization of the<br />

Hough transform <strong>in</strong> order to detect the hyperplanes <strong>in</strong> the<br />

mixture space, which then leads to matrix and source identification.


Chapter 11. EURASIP JASP, 2007 163<br />

2 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

The Hough transform [10] is a standard tool <strong>in</strong> image<br />

analysis that allows recognition of global patterns <strong>in</strong> an image<br />

space by recogniz<strong>in</strong>g local patterns, ideally a po<strong>in</strong>t, <strong>in</strong> a transformed<br />

parameter space. It is particularly useful when the<br />

patterns <strong>in</strong> question are sparsely digitized, conta<strong>in</strong> “holes,”<br />

or have been tak<strong>in</strong>g <strong>in</strong> noisy environments. The basic idea<br />

of this technique is to map parameterized objects such as<br />

straight l<strong>in</strong>es, polynomials, or circles to a suitable parameter<br />

space. The ma<strong>in</strong> application of the Hough transform lies<br />

<strong>in</strong> the field of image process<strong>in</strong>g <strong>in</strong> order to f<strong>in</strong>d straight l<strong>in</strong>es,<br />

centers of circles with a fixed radius, parabolas, and so forth<br />

<strong>in</strong> images.<br />

The Hough transform has been used <strong>in</strong> a somewhat<br />

ad hoc way <strong>in</strong> the field of <strong>in</strong>dependent component analysis<br />

for identify<strong>in</strong>g two-dimensional sources <strong>in</strong> the mixture<br />

plot <strong>in</strong> the complete [11] and overcomplete [12] cases,<br />

which without additional restrictions can be shown to have<br />

some theoretical issues [13]; moreover, the proposed algorithms<br />

were restricted to two dimensions and did not provide<br />

any reliable source identification method. An application<br />

of a time-frequency Hough transform to direction f<strong>in</strong>d<strong>in</strong>g<br />

with<strong>in</strong> nonstationary signals has been studied <strong>in</strong> [14]; the<br />

idea is based on the Hough transform of the Wigner-Ville<br />

distribution [15], essentially employ<strong>in</strong>g a generalized Hough<br />

transform [16] to f<strong>in</strong>d straight l<strong>in</strong>es <strong>in</strong> the time-frequency<br />

plane. The results <strong>in</strong> [14] aga<strong>in</strong> only concentrate on the twodimensional<br />

mixture case. In the literature, overcomplete<br />

BSS and the correspond<strong>in</strong>g basis estimation problems have<br />

ga<strong>in</strong>ed considerable <strong>in</strong>terest <strong>in</strong> the past decade [8, 17–19],<br />

but the sparse priors are always used <strong>in</strong> connection with the<br />

assumption of <strong>in</strong>dependent sources. This allows for probabilistic<br />

sparsity conditions, but cannot guarantee source<br />

identifiability as <strong>in</strong> our case.<br />

The paper is organized as follows. In Section 2, we <strong>in</strong>troduce<br />

the overcomplete SCA model and summarize the<br />

known identifiability results and algorithms [9]. The follow<strong>in</strong>g<br />

section then reviews the classical Hough transform <strong>in</strong> two<br />

dimensions and generalizes it <strong>in</strong> order to detect hyperplanes<br />

<strong>in</strong> any dimension. This method is used <strong>in</strong> section Section 4<br />

to develop an SCA algorithm, which turns out to be highly<br />

robust aga<strong>in</strong>st noise and outliers. We confirm this by experiments<br />

<strong>in</strong> Section 5. Some results of this paper have already<br />

been presented at the conference “ESANN 2004” [20].<br />

2. OVERCOMPLETE SCA<br />

We <strong>in</strong>troduce a strict notion of sparsity and present identifiability<br />

results when apply<strong>in</strong>g the measure to BSS.<br />

A vector v ∈ R n is said to be k-sparse if v has at least k<br />

zero entries. An n×T data matrix is said to be k-sparse if each<br />

of its columns is k-sparse. Note that v is k-sparse, then it is<br />

also k ′ -sparse for k ′ ≤ k. The goal of sparse component analysis<br />

of level k (k-SCA) is to decompose a given m-dimensional<br />

observed signal x(t), t = 1, . . . , T, <strong>in</strong>to<br />

x(t) = As(t) (1)<br />

with a real m × n-mix<strong>in</strong>g matrix A and an n-dimensional<br />

k-sparse sources s(t). The samples are gathered <strong>in</strong>to correspond<strong>in</strong>g<br />

data matrices X := (x(1), . . . , x(T)) ∈ R m×T and<br />

S := (s(1), . . . , s(T)) ∈ R n×T , so the model is X = AS. We<br />

speak of complete, overcomplete, or undercomplete k-SCA if<br />

m = n, m < n, or m > n, respectively. In the follow<strong>in</strong>g, we<br />

will always assume that the sparsity level equals k = n−m+1,<br />

which means that at any time <strong>in</strong>stant, fewer sources than<br />

given observations are active. In the algorithm, we will also<br />

consider additive white Gaussian noise; however, the model<br />

identification results are presented only <strong>in</strong> the noiseless case<br />

from (1).<br />

Note that <strong>in</strong> contrast to the ICA model, the above problem<br />

is not translation <strong>in</strong>variant. However, it is easy to see that<br />

if <strong>in</strong>stead of A we choose an aff<strong>in</strong>e l<strong>in</strong>ear transformation,<br />

the translation constant can be determ<strong>in</strong>ed from X only, as<br />

long as the sources are nondeterm<strong>in</strong>istic. Put differently, this<br />

means that <strong>in</strong>stead of assum<strong>in</strong>g k-sparsity of the sources we<br />

could also assume that at any fixed time t, only n − k source<br />

components are allowed to vary from a previously fixed constant<br />

(which can be different for each source). In the follow<strong>in</strong>g<br />

without loss of generality we will assume m ≤ n:<br />

the easier undercomplete (or underdeterm<strong>in</strong>ed) case can be<br />

reduced to the complete case by projection <strong>in</strong> the mixture<br />

space.<br />

The follow<strong>in</strong>g theorem shows that essentially the mix<strong>in</strong>g<br />

model (1) is unique if fewer sources than mixtures are active,<br />

that is, if the sources are (n − m + 1)-sparse.<br />

Theorem 1 (matrix identifiability). Consider the k-SCA<br />

problem from (1) for k := n − m + 1 and assume that every<br />

m × m-submatrix of A is <strong>in</strong>vertible. Furthermore, let S be sufficiently<br />

rich represented <strong>in</strong> the sense that for any <strong>in</strong>dex set of<br />

n − m + 1 elements I ⊂ {1, . . . , n} there exist at least m samples<br />

of S such that each of them has zero elements <strong>in</strong> places with<br />

<strong>in</strong>dexes <strong>in</strong> I and each m − 1 of them are l<strong>in</strong>early <strong>in</strong>dependent.<br />

Then A is uniquely determ<strong>in</strong>ed by X except for left multiplication<br />

with permutation and scal<strong>in</strong>g matrices.<br />

So if AS = �A�S, then A = �APL with a permutation P and<br />

a nons<strong>in</strong>gular scal<strong>in</strong>g matrix L. This means that we can recover<br />

the mix<strong>in</strong>g matrix from the mixtures. The next theorem<br />

shows that <strong>in</strong> this case also the sources can be found<br />

uniquely.<br />

Theorem 2 (source identifiability). Let H be the set of all x ∈<br />

R m such that the l<strong>in</strong>ear system As = x has an (n−m+1)-sparse<br />

solution, that is, one with at least n − m + 1 zero components.<br />

If A fulfills the condition from Theorem 1, then there exists a<br />

subset H0 ⊂ H with measure zero with respect to H, such that<br />

for every x ∈ H \ H0 this system has no other solution with<br />

this property.<br />

For proofs of these theorems we refer to [9]. The above<br />

two theorems show that <strong>in</strong> the case of overcomplete BSS us<strong>in</strong>g<br />

k-SCA with k = n − m + 1, both the mix<strong>in</strong>g matrix and<br />

the sources can uniquely be recovered from X except for the<br />

omnipresent permutation and scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. The essential<br />

idea of both theorems as well as a possible algorithm is


164 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 3<br />

a1<br />

a3<br />

a2<br />

(a) Three hyperplanes span{ai,<br />

aj} for 1 ≤ i < j ≤ 3 <strong>in</strong> the<br />

3 × 3 case<br />

a1<br />

a3<br />

a2<br />

(b) Hyperplanes from (a) visualized<br />

by <strong>in</strong>tersection with the<br />

sphere<br />

a1<br />

a3<br />

a4<br />

a2<br />

(c) Six hyperplanes span{ai,<br />

aj} for 1 ≤ i < j ≤ 4 <strong>in</strong> the<br />

3 × 4 case<br />

Figure 1: Visualization of the hyperplanes <strong>in</strong> the mixture space {x(t)} ⊂ R 3 . Due to the source sparsity, the mixtures are generated by only<br />

two matrix columns ai, aj, and are hence conta<strong>in</strong>ed <strong>in</strong> a union of hyperplanes. Identification of the hyperplanes gives mix<strong>in</strong>g matrix and<br />

sources.<br />

Data: samples x(1), . . . , x(T)<br />

Result: estimated mix<strong>in</strong>g matrix �A<br />

Hyperplane identification.<br />

(1) Cluster the samples x(t) <strong>in</strong> � �<br />

n<br />

m−1 groups such that the span<br />

of the elements of each group produces one dist<strong>in</strong>ct<br />

hyperplane Hi.<br />

Matrix identification.<br />

(2) Cluster the normal vectors to these hyperplanes <strong>in</strong> the<br />

smallest number of groups Gj, j = 1, . . . , n (which gives the<br />

number of sources n) such that the normal vectors to the<br />

hyperplanes <strong>in</strong> each group Gj lie <strong>in</strong> a new hyperplane �Hj.<br />

(3) Calculate the normal vectors �aj to each hyperplane<br />

�Hj, j = 1, . . . , n.<br />

(4) The matrix �A with columns �aj is an estimate of the mix<strong>in</strong>g<br />

matrix (up to permutation and scal<strong>in</strong>g of the columns).<br />

Algorithm 1: SCA matrix identification algorithm.<br />

illustrated <strong>in</strong> Figure 1: by assum<strong>in</strong>g sufficiently high sparsity<br />

of the sources, the mixture space clusters along a union of<br />

hyperplanes, which uniquely determ<strong>in</strong>e both mix<strong>in</strong>g matrix<br />

and sources.<br />

The matrix and source identification algorithm from [9]<br />

are recalled <strong>in</strong> Algorithms 1 and 2. We will present a modification<br />

of the matrix identification part—the same source<br />

identification algorithm (Algorithm 2) will be used <strong>in</strong> the experiments.<br />

The “difficult” part of the matrix identification<br />

algorithm lies <strong>in</strong> the hyperplane detection; <strong>in</strong> Algorithm 1, a<br />

random sampl<strong>in</strong>g and cluster<strong>in</strong>g technique is used. Another<br />

more efficient algorithm for f<strong>in</strong>d<strong>in</strong>g the hyperplanes conta<strong>in</strong><strong>in</strong>g<br />

the data has been developed by Bradley and Mangasarian<br />

[21], essentially by extend<strong>in</strong>g k-means batch cluster<strong>in</strong>g.<br />

Their so-called k-plane cluster<strong>in</strong>g algorithm <strong>in</strong> the special case<br />

of hyperplanes conta<strong>in</strong><strong>in</strong>g 0 is shown <strong>in</strong> Algorithm 3. The<br />

Data: samples x(1), . . . , x(T) and estimated mix<strong>in</strong>g matrix �A<br />

Result: estimated sources �s(1), . . . ,�s(T)<br />

(1) Identify the set of hyperplanes H produced by tak<strong>in</strong>g the<br />

l<strong>in</strong>ear hull of every subsets of the columns of �A with m − 1<br />

elements<br />

for t ← 1, . . . , T do<br />

(2) Identify the hyperplane H ∈ H conta<strong>in</strong><strong>in</strong>g x(t), or, <strong>in</strong><br />

the presence of noise, identify the one to which the<br />

distance from x(t) is m<strong>in</strong>imal and project x(t) onto H<br />

to �x.<br />

(3) If H is produced by the l<strong>in</strong>ear hull of column vectors<br />

�ai1, . . . , �aim−1, f<strong>in</strong>d coefficients λi(j) such that<br />

�x = � m−1<br />

j=1 λi(j)�ai(j).<br />

(4) Construct the solution �s(t): it conta<strong>in</strong>s λi(j) at <strong>in</strong>dex i(j)<br />

for j = 1, . . . , m − 1, the other components are zero.<br />

end<br />

Algorithm 2: SCA source identification algorithm.<br />

f<strong>in</strong>ite term<strong>in</strong>ation of the algorithm is proven <strong>in</strong> [21, Theorem<br />

3.7]. We will later compare the proposed Hough algorithm<br />

with the k-hyperplane algorithm. The k-hyperplane algorithm<br />

has also been extended to a more general, orthogonal<br />

k-subspace cluster<strong>in</strong>g method [22, 23] thus allow<strong>in</strong>g a search<br />

not only for hyperplanes but also for lower-dimensional subspaces.<br />

3. HOUGH TRANSFORM<br />

The Hough transform is a classical method for locat<strong>in</strong>g<br />

shapes <strong>in</strong> images, widely used <strong>in</strong> the field of image process<strong>in</strong>g;<br />

see [10, 24]. It is robust to noise and occlusions and is<br />

used for extract<strong>in</strong>g l<strong>in</strong>es, circles, or other shapes from images.<br />

In addition to these nonl<strong>in</strong>ear extensions, it can also be<br />

made more robust to noise us<strong>in</strong>g antialias<strong>in</strong>g techniques.


Chapter 11. EURASIP JASP, 2007 165<br />

4 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

Data: samples x(1), . . . , x(T)<br />

Result: estimated k hyperplanes Hi given by the normal<br />

vectors ui<br />

(l) Initialize randomly ui with |ui| = 1 for i = 1, . . . , k.<br />

do<br />

Cluster assignment.<br />

for t ← 1, . . . , T do<br />

(2) Add x(t) to cluster Y (i) , where i is chosen to<br />

m<strong>in</strong>imize |u ⊤ i x(t)| (distance to hyperplane Hi).<br />

end<br />

(3) Exit if the mean distance to the hyerplanes is smaller<br />

than some preset value.<br />

Cluster update.<br />

for i ← 1, . . . , k do<br />

(4) Calculate the i-th cluster correlation C := Y (i) Y (i)⊤ .<br />

(5) Choose an eigenvector v of C correspond<strong>in</strong>g to<br />

a m<strong>in</strong>imal eigenvalue.<br />

(6) Set ui ← v/|v|.<br />

end<br />

end<br />

Algorithm 3: k-hyperplane cluster<strong>in</strong>g algorithm.<br />

3.1. Def<strong>in</strong>ition<br />

Its ma<strong>in</strong> idea can be described as follows: consider a parameterized<br />

object<br />

Ma := {x ∈ R n | f(x, a) = 0} (2)<br />

for a fixed parameter set a ∈ U ⊂ R p —here U ⊂ R p is the<br />

parameter space, and the parameter function f : R n × U →<br />

R m is a set of m equations describ<strong>in</strong>g our types of objects<br />

(manifolds) Ma for different parameters a. We assume that<br />

the equations given by f are separat<strong>in</strong>g <strong>in</strong> the sense that if<br />

Ma ⊂ Ma ′, then already a = a′ . A simple example is the<br />

set of unit circles <strong>in</strong> R 2 ; then f (x, a) = |x − a| − 1. For a<br />

given a ∈ R 2 , Ma is the circle of radius 1 centered at a. Obviously<br />

f is separated. Other object manifolds will be discussed<br />

later. A nonseparated object function is, for example,<br />

f (x, a) := 1−1[0,a](x) for (x, a) ∈ R×[0, ∞), where the characteristic<br />

function 1[0,a](x) equals 1 if and only if x ∈ [0, a]<br />

and 0 otherwise. Then M1 = [0, 1] ⊂ [0, 2] = M2 but the<br />

parameters are different.<br />

Given a separat<strong>in</strong>g parameter function f(x, a), its Hough<br />

transform is def<strong>in</strong>ed as<br />

η[f] : R n −→ P (U),<br />

x ↦−→ {a ∈ U | f(x, a) = 0},<br />

where P (U) denotes the set of all subsets of U. So η[f] maps<br />

a po<strong>in</strong>t x onto the set of all parameters describ<strong>in</strong>g objects<br />

conta<strong>in</strong><strong>in</strong>g x. But an object Ma as a set is mapped onto a s<strong>in</strong>gle<br />

po<strong>in</strong>t {a}, that is,<br />

�<br />

η[f](x) = {a}. (4)<br />

x∈Ma<br />

This follows because if �<br />

x∈Ma η[f](x) = {a, a′ }, then for all<br />

x ∈ Ma we have f(x, a ′ ) = 0, which means that Ma ⊂ Ma ′; the<br />

(3)<br />

parameter function f is assumed to be separat<strong>in</strong>g, so a = a ′ .<br />

Hence, objects Ma <strong>in</strong> a data set X = {x(1), . . . , x(T)} can be<br />

detected by analyz<strong>in</strong>g clusters <strong>in</strong> η[f](X).<br />

We will illustrate this concept for l<strong>in</strong>e detection <strong>in</strong> the<br />

follow<strong>in</strong>g section before apply<strong>in</strong>g it to the hyperplane identification<br />

needed for our SCA problem.<br />

3.2. Classical Hough transform<br />

The (classical) Hough transform detects l<strong>in</strong>es <strong>in</strong> a given twodimensional<br />

data space as follows: an aff<strong>in</strong>e, nonvertical l<strong>in</strong>e<br />

<strong>in</strong> R 2 can be described by the equation x2 = a1x1 + a2 for<br />

fixed a = (a1, a2) ∈ R 2 . If we def<strong>in</strong>e<br />

fL(x, a) := a1x1 + a2 − x2, (5)<br />

then the above l<strong>in</strong>e equals the set Ma from (2) for the unique<br />

parameter a, and f is clearly separat<strong>in</strong>g. Figures 2(a) and 2(b)<br />

illustrate this idea.<br />

In practice, polar coord<strong>in</strong>ates are used to describe the l<strong>in</strong>e<br />

<strong>in</strong> Hessian normal form; this allows to also detect vertical<br />

l<strong>in</strong>es (θ = π/2) <strong>in</strong> the data set, and moreover guarantees for<br />

an isotropic error <strong>in</strong> contrast to the parametrization (5). This<br />

leads to a parameter function<br />

fP(x, θ, ρ) = x1 cos(θ) + x2 s<strong>in</strong>(θ) − ρ = 0 (6)<br />

for parameters (θ, ρ) ∈ U := [0, π) × R. Then po<strong>in</strong>ts <strong>in</strong> data<br />

space are mapped to s<strong>in</strong>e curves given by f ; see Figure 2(c).<br />

3.3. Generalization<br />

The mix<strong>in</strong>g matrix A <strong>in</strong> the case of (n − m + 1)-sparse SCA<br />

can be recovered by f<strong>in</strong>d<strong>in</strong>g all 1-codimensional subvector<br />

spaces <strong>in</strong> the mixture data set. The algorithm presented here<br />

uses a generalized version of the Hough transform <strong>in</strong> order<br />

to determ<strong>in</strong>e hyperplanes through 0 as follows.<br />

Vectors x ∈ R m ly<strong>in</strong>g on such a hyperplane H can be<br />

described by the equation<br />

fh(x, n) := n ⊤ x = 0, (7)<br />

where n is a nonzero vector orthogonal to H. After normalization<br />

|n| = 1, the normal vector n is uniquely determ<strong>in</strong>ed<br />

by H if we additionally require n to lie on one hemisphere<br />

of the unit sphere Sn−1 := {x ∈ Rn | |x| = 1}. This means<br />

that the parametrization fh is separat<strong>in</strong>g. In terms of spherical<br />

coord<strong>in</strong>ates of Sn−1 , n can be expressed as<br />

⎛<br />

⎞<br />

cos ϕ s<strong>in</strong> θ1 s<strong>in</strong> θ2 · · · s<strong>in</strong> θm−2<br />

⎜<br />

⎜s<strong>in</strong><br />

ϕ s<strong>in</strong> θ1 s<strong>in</strong> θ2 · · · s<strong>in</strong> θm−2<br />

⎟<br />

⎜<br />

n = ⎜ cos θ1 s<strong>in</strong> θ2 · · · s<strong>in</strong> ⎟<br />

θm−2⎟<br />

⎜ . .<br />

⎝ . ..<br />

.<br />

⎟ (8)<br />

⎟<br />

. ⎠<br />

cos θ1 cos θ2 · · · cos θm−2<br />

with (ϕ, θ1, . . . , θm−2) ∈ [0, 2π) × [0, π) m−2 uniqueness of n<br />

can be achieved by requir<strong>in</strong>g ϕ ∈ [0, π). Plugg<strong>in</strong>g n <strong>in</strong> spherical<br />

coord<strong>in</strong>ates <strong>in</strong>to (7) gives<br />

m−1 �<br />

cot θm−2 = − νi(ϕ, θ1, . . . , θm−3) xi<br />

i=1<br />

xm<br />

(9)


166 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 5<br />

for x ∈ R m with xm �= 0 and<br />

⎧<br />

m−3 �<br />

cos ϕ s<strong>in</strong> θj, i = 1,<br />

j=1<br />

⎪⎨ m−3 �<br />

νi := s<strong>in</strong> ϕ s<strong>in</strong> θj, i = 2,<br />

j=1<br />

�i−2<br />

m−3 �<br />

⎪⎩<br />

cos θj s<strong>in</strong> θj, i > 2.<br />

j=1 j=i−1<br />

(10)<br />

With cot(θ + π/2) = − tan(θ) we f<strong>in</strong>ally get θm−2 = arctan<br />

( �m−1 i=1 νixi/xm) + π/2. Note that cont<strong>in</strong>uity is achieved if we<br />

set θm−2 := 0 for xm = 0.<br />

We can then def<strong>in</strong>e the generalized “hyperplane detect<strong>in</strong>g”<br />

Hough transform as<br />

x ↦−→<br />

�<br />

η[ fh] : R m −→ P �<br />

[0, π) m−1�<br />

,<br />

(ϕ, θ1, . . . , θm−2)<br />

∈ [0, π) m−1 | θm−2 = arctan<br />

� m−1 �<br />

� xi<br />

νi<br />

xm i=1<br />

+ π<br />

�<br />

.<br />

2<br />

(11)<br />

The parametrization fh is separat<strong>in</strong>g, so po<strong>in</strong>ts ly<strong>in</strong>g on the<br />

same hyperplane are mapped to surfaces that <strong>in</strong>tersect <strong>in</strong> precisely<br />

one po<strong>in</strong>t <strong>in</strong> [0, π) m−1 . This is demonstrated <strong>in</strong> the case<br />

m = 3 <strong>in</strong> Figure 3. The hyperplane structures of a data set<br />

X = {x(1), . . . , x(T)} can be analyzed by f<strong>in</strong>d<strong>in</strong>g clusters <strong>in</strong><br />

η[ fh](X).<br />

Let RP m−1 denote the (m − 1)-dimensional real projective<br />

space, that is, the manifold of all 1-dimensional subspaces<br />

of R m . There is a canonical diffeomorphism between RP m−1<br />

and the Grassmanian manifold of all (m − 1)-dimensional<br />

subspaces of R m , <strong>in</strong>duced by the scalar product. Us<strong>in</strong>g this<br />

diffeomorphism, we can reformulate our aim of identif<strong>in</strong>g<br />

hyperplanes as f<strong>in</strong>d<strong>in</strong>g elements of RP m−1 . So, the Hough<br />

transform η[ fh] maps x onto a subset of RP m−1 , which is<br />

topologically equivalent to the upper hemisphere <strong>in</strong> R m with<br />

identifications along the boundary. In fact, <strong>in</strong> (11) we simply<br />

have constructed a coord<strong>in</strong>ate map of RP m−1 us<strong>in</strong>g spherical<br />

coord<strong>in</strong>ates.<br />

4. HOUGH SCA ALGORITHM<br />

The SCA matrix detection algorithm � (Algorithm � 1) consists<br />

n<br />

of two steps. In the first step, d := m−1 hyperplanes given<br />

by their normal vectors n (1) , . . . , n (d) are constructed such<br />

that the mixture data lies <strong>in</strong> the union of these hyperplanes—<br />

<strong>in</strong> the case of noise this will hold only approximately. In the<br />

second step, mixture matrix columns ai are identified as generators<br />

of the n l<strong>in</strong>es ly<strong>in</strong>g at the <strong>in</strong>tersections of � �<br />

n−1<br />

m−2 hyperplanes.<br />

We replace the first step by the follow<strong>in</strong>g Hough<br />

SCA algorithm.<br />

x2<br />

a2<br />

ρ<br />

15<br />

10<br />

5<br />

0<br />

5<br />

10<br />

0 1 2 3 4 5 6 7 8 9 10<br />

15<br />

10<br />

5<br />

0<br />

5<br />

10<br />

15<br />

x1<br />

(a) Data space<br />

20<br />

0 0.5 1 1.5 2 2.5 3<br />

20<br />

15<br />

10<br />

5<br />

0<br />

5<br />

10<br />

a1<br />

(b) L<strong>in</strong>ear Hough space<br />

0 0.5 1 1.5 2 2.5 3<br />

θ<br />

(c) Polar Hough space<br />

Figure 2: Illustration of the “classical” Hough transform: a po<strong>in</strong>t<br />

(x1, x2) <strong>in</strong> the data space (a) is mapped (b) onto the l<strong>in</strong>e {(a1, a2) |<br />

a2 = −a1x1 + x2} <strong>in</strong> the l<strong>in</strong>ear parameter space R 2 or (c) onto a<br />

translated s<strong>in</strong>e curve {(θ, ρ) | ρ = x1 cos θ + x2 s<strong>in</strong> θ} <strong>in</strong> the polar<br />

parameter space [0, π) × R + 0 . The Hough curves of po<strong>in</strong>ts belong<strong>in</strong>g<br />

to one l<strong>in</strong>e <strong>in</strong> data space <strong>in</strong>tersect <strong>in</strong> precisely one po<strong>in</strong>t a <strong>in</strong><br />

the Hough space—and the data po<strong>in</strong>ts lie on the l<strong>in</strong>e given by the<br />

parameter a.<br />

4.1. Def<strong>in</strong>ition<br />

The idea is to first gather the Hough curves η[ fh](x(t))<br />

correspond<strong>in</strong>g to the samples x(t) <strong>in</strong> a discretized parameter<br />

space, <strong>in</strong> this context often called Hough accumulator.<br />

Plott<strong>in</strong>g these curves <strong>in</strong> the accumulator is sometimes denoted<br />

as vot<strong>in</strong>g for each b<strong>in</strong>, similar to histogram generation.<br />

Accord<strong>in</strong>g to the previous section, all po<strong>in</strong>ts x from some


Chapter 11. EURASIP JASP, 2007 167<br />

6 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

x3<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

2<br />

1.5<br />

10.5<br />

00.5<br />

x2 11.5<br />

2 2<br />

(a) Data space<br />

3<br />

4<br />

x1<br />

5<br />

6<br />

θ<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5 2 2.5 3<br />

ϕ<br />

(b) Spherical Hough space<br />

Figure 3: Illustration of the “hyperplane detect<strong>in</strong>g” Hough transform <strong>in</strong> three dimensions: a po<strong>in</strong>t (x1, x2, x3) <strong>in</strong> the data space (a) is mapped<br />

onto the curve {(ϕ, θ) | θ = arctan(x1 cos ϕ + x2 s<strong>in</strong> ϕ) + π/2} <strong>in</strong> the parameter space [0, π) 2 (b). The Hough curves of po<strong>in</strong>ts belong<strong>in</strong>g to<br />

one plane <strong>in</strong> data space <strong>in</strong>tersect <strong>in</strong> precisely one po<strong>in</strong>t (ϕ, θ) <strong>in</strong> the Hough space and the po<strong>in</strong>ts lie on the plane given by the normal vector<br />

(cos ϕ s<strong>in</strong> θ, s<strong>in</strong> ϕ s<strong>in</strong> θ, cos θ).<br />

hyperplane H given by a normal vector with angles (ϕ, θ)<br />

are mapped onto a parameterized object that conta<strong>in</strong>s (ϕ, θ)<br />

for all possible x ∈ H. Hence, the correspond<strong>in</strong>g angle b<strong>in</strong><br />

will conta<strong>in</strong> votes from all samples x(t) ly<strong>in</strong>g <strong>in</strong> H, whereas<br />

other b<strong>in</strong>s receive much less votes. Therefore, maxima analysis<br />

of the accumulator gives the hyperplanes <strong>in</strong> the parameter<br />

space. This idea corresponds to cluster<strong>in</strong>g all possible normal<br />

vectors of planes through x(t) on RP m−1 for all t. The result<strong>in</strong>g<br />

Hough SCA algorithm is described <strong>in</strong> Algorithm 4. We see<br />

that only the hyperplane identification step is different from<br />

Algorithm 1, the matrix identification is the same.<br />

The number β of b<strong>in</strong>s is also called the grid resolution.<br />

Similar to histogram-based density estimation the choice of<br />

β can seriously effect the algorithm performance—if chosen<br />

too small, possible maxima cannot be resolved, and if chosen<br />

too large, the sensitivity of the algorithm <strong>in</strong>creases and the<br />

computational burden <strong>in</strong> terms of speed and memory grows<br />

considerably; see next section. Note that Hough SCA performs<br />

a global search hence it is expected to be much slower<br />

than local update algorithms such as Algorithm 3, but also<br />

much more robust. In the follow<strong>in</strong>g, its properties will be<br />

discussed; applications are given <strong>in</strong> the example <strong>in</strong> Section 5.<br />

4.2. Complexity<br />

We will only discuss the complexity of the hyperplane estimation<br />

because the matrix identification is performed on a<br />

data set of size d be<strong>in</strong>g typically much smaller than the sample<br />

size T.<br />

The angle θm−2 has to be calculated Tβ m−2 times. Due to<br />

the fact that only discrete values of the angles are of <strong>in</strong>terest,<br />

the trigonometric functions as well as the νi can be precalculated<br />

and stored <strong>in</strong> exchange for speed. Then each calculation<br />

of θm−2 <strong>in</strong>volves 2m − 1 operations (sum and product/division).<br />

The vot<strong>in</strong>g (without tak<strong>in</strong>g “lookup” costs <strong>in</strong><br />

the accumulator <strong>in</strong>to account) costs an additional operation.<br />

Altogether the accumulator can be filled with 2Tβ m−2 m<br />

Data: Samples x(1), . . . , x(T) of the random vector X<br />

Result: Estimated mix<strong>in</strong>g matrix �A<br />

Hyperplane identification.<br />

(1) Fix the number β of b<strong>in</strong>s (can be separate for each angle).<br />

(2) Initialize the β × · · · β (m − 1 terms) array α ∈ Rβm−1 with<br />

zeros (accumulator).<br />

for t ← 1, . . . , T do<br />

for ϕ, θ1, . . . , θm−3 ← 0, π/β, . . . , (β − 1)π/β do<br />

(3) θm−2 ← arctan( �m−1 i=1 νi(ϕ, . . . , θm−3)xi(t)/xm(t)) + π/2<br />

(4) Increase (vote for) the accumulator value of α <strong>in</strong><br />

b<strong>in</strong> correspond<strong>in</strong>g to (ϕ, θ1, . . . , θm−2) by one.<br />

end<br />

end � �<br />

n<br />

(5) The d := m−1 largest local maxima of α correspond to the<br />

d hyperplanes present <strong>in</strong> the data set.<br />

(6) Back transformation as <strong>in</strong> (8) gives the correspond<strong>in</strong>g<br />

normal vectors n (1) , . . . , n (d) to those hyperplanes.<br />

Matrix identification.<br />

(7) Cluster<strong>in</strong>g of hyperplanes generated by (m − 1)-tuples <strong>in</strong><br />

{n (1) , . . . , n (d) } gives n separate hyperplanes.<br />

(8) Their normal vectors are the n columns of the estimated<br />

mix<strong>in</strong>g matrix �A.<br />

Algorithm 4: Hough SCA algorithm for mix<strong>in</strong>g matrix identification.<br />

operations. This means that the algorithm depends l<strong>in</strong>early<br />

on the sample size and is polynomial <strong>in</strong> the grid resolution<br />

and exponential <strong>in</strong> the mixture dimension. The maxima<br />

search <strong>in</strong>volves O(β m−1 ) operations, which for small to<br />

medium dimensions can be ignored <strong>in</strong> comparison to the accumulator<br />

generation because usually β ≪ T.<br />

So the ma<strong>in</strong> part of the algorithm does not depend on<br />

the source dimension n but only on the mixture dimension<br />

m. This means for applications that n can be quite large but


168 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 7<br />

hyperplanes will still be found if the grid resolution is high<br />

enough. Increase of the grid resolution (<strong>in</strong> polynomial time)<br />

results <strong>in</strong> <strong>in</strong>creased accuracy also for higher source dimensions<br />

n. The memory requirement of the algorithm is dom<strong>in</strong>ated<br />

by the accumulator size, which is β m−1 . This can limit<br />

the grid resolution.<br />

4.3. Resolution error<br />

The choice of the grid resolution β <strong>in</strong> the algorithm <strong>in</strong>duces<br />

a systematic resolution error <strong>in</strong> the estimation of A (as tradeoff<br />

for robustness and speed). This error is calculated <strong>in</strong> this<br />

section.<br />

Let A be the unknown mix<strong>in</strong>g matrix and �A its estimate,<br />

constructed by the Hough SCA algorithm (Algorithm 4)<br />

with grid resolution β. Let n (1) , . . . , n (d) be the normal vectors<br />

of hyperplanes generated by (m − 1)-tuples of columns<br />

of A and let �n (1) , . . . , �n (d) be their correspond<strong>in</strong>g estimates.<br />

Ignor<strong>in</strong>g permutations, it is sufficient to only describe how<br />

�n (i) differs from n (i) .<br />

Assume that the maxima of the accumulator are correctly<br />

estimated, but due to the discrete grid resolution, an<br />

average error of π/2β is made when estimat<strong>in</strong>g the precise<br />

maximum position, because the size of one b<strong>in</strong> is π/β. How<br />

is this error propagated <strong>in</strong>to �n (i) ? By assumption each estimate<br />

�ϕ, �θ1, . . . , �θm−2 differs from ϕ, θ1, . . . , θm−2 maximally<br />

by π/2β. As we are only <strong>in</strong>terested <strong>in</strong> an upper boundary, we<br />

simply calculate the deviation of each component of �n (i) from<br />

n (i) . Us<strong>in</strong>g the fact that s<strong>in</strong>e and cos<strong>in</strong>e are bounded by one,<br />

(8) then gives us estimates |�n (i)<br />

j − n (i)<br />

j | ≤ (m − 1)π/(2β) for<br />

coord<strong>in</strong>ate j, so altogether<br />

�<br />

� (i) (i) �n − n � � (m − 1)<br />

≤ √ mπ<br />

. (12)<br />

2β<br />

This estimate may be improved by us<strong>in</strong>g the Jacobian of the<br />

spherical coord<strong>in</strong>ate transformation and its determ<strong>in</strong>ant, but<br />

for our purpose this boundary is sufficient. In summary, we<br />

have shown that the grid resolution contributes to a β −1 -<br />

perturbation <strong>in</strong> the estimation of A.<br />

4.4. Robustness<br />

Robustness with regard to additive noise as well as outliers<br />

is important for any algorithm to be used <strong>in</strong> the real world.<br />

Here an outlier is roughly def<strong>in</strong>ed to be a sample far away<br />

from other observations, and <strong>in</strong>deed some researchers def<strong>in</strong>e<br />

outliers to be sample further away from the mean than say<br />

5 standard deviations. However, such def<strong>in</strong>itions do necessarily<br />

depend on the underly<strong>in</strong>g random variable to be estimated,<br />

so most books only give examples of outliers, and <strong>in</strong>deed<br />

no consistent, context-free, precise def<strong>in</strong>ition of outliers<br />

exists [25]. In the follow<strong>in</strong>g, given samples of a fixed random<br />

variable of <strong>in</strong>terest, we denote a sample as outlier if it is drawn<br />

from another sufficiently different distribution.<br />

Data fitt<strong>in</strong>g of only one hyperspace to the data set can<br />

be achieved by l<strong>in</strong>ear regression namely by m<strong>in</strong>imiz<strong>in</strong>g the<br />

squared distance to such a possible hyperplane. These least<br />

squares fitt<strong>in</strong>g algorithms are well known to be sensitive to<br />

outliers, and various extensions of the LS method such as<br />

least median of squares and reweighted least squares [26]<br />

have been developed to overcome this problem. The breakdown<br />

po<strong>in</strong>t of the latter is 0.5, which means that the fit parameters<br />

are only stably estimated for data sets with less<br />

than 50% outliers. The other techniques typically have much<br />

lower breakdown po<strong>in</strong>ts, usually below 0.3. The classical<br />

Hough transform, albeit no regression method, is comparable<br />

<strong>in</strong> terms of breakdown with robust fitt<strong>in</strong>g algorithms<br />

such as the reweighted least squares algorithm [27]. In the experiments<br />

we will observe similar results for the generalized<br />

method presented above. Namely, we achieve breakdown levels<br />

of up to 0.8 <strong>in</strong> the low-noise case, which considerably decrease<br />

with <strong>in</strong>creas<strong>in</strong>g noise.<br />

From a mathematical po<strong>in</strong>t of view, the “classical” Hough<br />

transform as an estimator (and extension of l<strong>in</strong>ear regression)<br />

as well as regard<strong>in</strong>g algorithmic and implementational<br />

aspects has been studied quite extensively; see, for example,<br />

[28] and references there<strong>in</strong>. Most of the presented theoretical<br />

results <strong>in</strong> the two-dimensional case could be extended to the<br />

more general objective presented here, but this is not with<strong>in</strong><br />

the scope of this manuscript. Simulations giv<strong>in</strong>g experimental<br />

evidence that the robustness also holds <strong>in</strong> our case are<br />

shown <strong>in</strong> Section 5.<br />

4.5. Extensions<br />

The follow<strong>in</strong>g possible extensions to the Hough SCA algorithm<br />

can be employed to <strong>in</strong>crease its performance.<br />

If the noise level is known, smooth<strong>in</strong>g of the accumulator<br />

(antialias<strong>in</strong>g) will help to give more robust results <strong>in</strong> terms of<br />

noise. For smooth<strong>in</strong>g (usually with a Gaussian), the smooth<strong>in</strong>g<br />

radius must be set accord<strong>in</strong>g to the noise level. If the<br />

noise level is not known, smooth<strong>in</strong>g can still be applied by<br />

gradually <strong>in</strong>creas<strong>in</strong>g the radius until the number of clearly<br />

detectable maxima equals d.<br />

Furthermore, an additional f<strong>in</strong>e-tun<strong>in</strong>g step is possible:<br />

the estimated plane norms are slightly deteriorated by the<br />

systematic resolution error as shown previously. However, after<br />

application of Hough SCA, the data space can now be<br />

clustered <strong>in</strong>to data po<strong>in</strong>ts ly<strong>in</strong>g close to correspond<strong>in</strong>g hyperplanes.<br />

With<strong>in</strong> each cluster l<strong>in</strong>ear regression (or some<br />

more robust version of it; see Section 4.4) can now be applied<br />

to improve the hyperplane estimate—this is actually<br />

the idea used locally <strong>in</strong> the k-hyperplane cluster<strong>in</strong>g algorithm<br />

(Algorithm 3). Such a method requires additional computational<br />

power, but makes the algorithm less dependent on the<br />

grid resolution, which is only needed for the hyperplane cluster<strong>in</strong>g<br />

step. However, it is expected that this additional f<strong>in</strong>etun<strong>in</strong>g<br />

step may decrease robustness especially aga<strong>in</strong>st biased<br />

noise and outliers.<br />

5. SIMULATIONS<br />

We give a simulation example as well as batch runs to analyze<br />

the performance of the proposed algorithm.


Chapter 11. EURASIP JASP, 2007 169<br />

8 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

5<br />

0<br />

5<br />

10<br />

0<br />

5<br />

100 200 300 400 500 600 700 800 900 1000<br />

0<br />

5<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

5<br />

0<br />

5<br />

0<br />

10<br />

5<br />

0<br />

100 200 300 400 500 600 700 800 900 1000<br />

5<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

1<br />

0.5<br />

0<br />

0.5<br />

1<br />

1<br />

0.5<br />

0<br />

(a) Source signals<br />

0.5<br />

1 1<br />

0.5<br />

(c) Normalized mixture scatter plot<br />

0<br />

0.5<br />

1<br />

5<br />

0<br />

5<br />

10<br />

0<br />

4<br />

2<br />

0<br />

100 200 300 400 500 600 700 800 900 1000<br />

2<br />

4<br />

0<br />

5<br />

100 200 300 400 500 600 700 800 900 1000<br />

0<br />

5<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

50<br />

100<br />

150<br />

200<br />

250<br />

300<br />

350<br />

(b) Mixture signals<br />

50 100 150 200 250 300 350<br />

(d) Hough accumulator with labeled maxima<br />

Figure 4: Example: (a) shows the 2-sparse, sufficiently rich represented, 4-dimensional source signals, and (b) the randomly mixed, 3dimensional<br />

mixtures. The normalized mixture scatter plot {x(t)/|x(t)| | t = 1, . . . , T} is given <strong>in</strong> (c), and the generated Hough accumulator<br />

<strong>in</strong> (d); note that the color scale <strong>in</strong> (d) was chosen to be nonl<strong>in</strong>ear (γnew := (1 − γ/ max) 10 ) <strong>in</strong> order to visualize structure <strong>in</strong> addition to the<br />

strong maxima.<br />

5.1. Explicit example<br />

In the first experiment, we consider the case of source dimensions<br />

n = 4 and mix<strong>in</strong>g dimension m = 3. The<br />

4-dimensional sources have been generated from i.i.d. samples<br />

(two Laplacian and two Gaussian sequences) followed<br />

by sett<strong>in</strong>g some entries to zero <strong>in</strong> order to fulfill the sparsity<br />

constra<strong>in</strong>ts; see Figure 4(a). They are 2-sparse and consist<br />

of 1000 samples. Obviously all comb<strong>in</strong>ations (i, j), i <<br />

j, of active sources are present <strong>in</strong> the data set; this condition<br />

is needed by the matrix recovery step. The sources<br />

were mixed us<strong>in</strong>g a mix<strong>in</strong>g matrix with randomly (uniform<br />

<strong>in</strong> [−1, 1]) chosen coefficients to give mixtures as<br />

shown <strong>in</strong> Figure 4(b). The mixture density clearly lies <strong>in</strong><br />

6 disjo<strong>in</strong>t hyperplanes, spanned by pairs (ai, aj), i < j, of<br />

mixture matrix columns, as <strong>in</strong>dicated by the normalized<br />

scatter plot <strong>in</strong> Figure 4(c), similar to the illustration from<br />

Figure 1(c).<br />

In order to detect the planes <strong>in</strong> the data space, we apply<br />

the generalized Hough transform as expla<strong>in</strong>ed <strong>in</strong> Section 3.3.<br />

Figure 4(d) shows the Hough image with β = 360. Each sample<br />

results <strong>in</strong> a curve, and clearly 6 <strong>in</strong>tersection po<strong>in</strong>ts are<br />

visible, which correspond to the 6 hyperplanes <strong>in</strong> question.<br />

Maxima analysis retrieves these po<strong>in</strong>ts (<strong>in</strong> Hough space) as<br />

shown <strong>in</strong> the same figure. After transform<strong>in</strong>g these po<strong>in</strong>ts<br />

back <strong>in</strong>to R 3 with the <strong>in</strong>verse Hough transform, we get 6<br />

normalized vectors correspond<strong>in</strong>g to the 6 planes. Consider<strong>in</strong>g<br />

<strong>in</strong>tersections of the hyperplanes, we notice that only 4 of<br />

them <strong>in</strong>tersect <strong>in</strong> precisely 3 planes, and these 4 <strong>in</strong>tersection<br />

l<strong>in</strong>es are spanned by the matrix columns ai. For practical reasons,<br />

we recover these comb<strong>in</strong>atorially from the plane norm<br />

vectors; see Algorithm 4. The deviation of the recovered mix<strong>in</strong>g<br />

matrix �A from the orig<strong>in</strong>al mix<strong>in</strong>g matrix A <strong>in</strong> the overcomplete<br />

case can be measured by the generalized crosstalk<strong>in</strong>g<br />

error [8] def<strong>in</strong>ed as E(A, �A) := m<strong>in</strong>M∈Π �A− �AM�, where<br />

the m<strong>in</strong>imum is taken over the group Π of all <strong>in</strong>vertible real<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20


170 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 9<br />

n × n-matrices where only one entry <strong>in</strong> each column differs<br />

from 0; �·� denotes a fixed matrix norm. In our case the generalized<br />

crosstalk<strong>in</strong>g error is very low with E(A, �A) = 0.040.<br />

This essentially means that the two matrices, after permutation,<br />

differ only by 0.04 with respect to the chosen matrix<br />

norm, <strong>in</strong> our case the (squared) Frobenius norm. Then, the<br />

sources are recovered us<strong>in</strong>g the source recovery algorithm<br />

(Algorithm 2) with the approximated mix<strong>in</strong>g matrix �A. The<br />

normalized signal-to-noise ratios (SNRs) of the recovered<br />

sources with respect to the orig<strong>in</strong>al ones are high with 36,<br />

38, 36, and 37 dB, respectively.<br />

As a modification of the previous example, we now consider<br />

also additive noise. We use sources S (which have unit<br />

covariance) and mix<strong>in</strong>g matrix A from above, but add 1%<br />

random white noise to the mixtures X = AS + 0.01N where<br />

N is a normal random vector. This corresponds to a still high<br />

mean SNR of 38 dB. When consider<strong>in</strong>g the normalized scatter<br />

plot, aga<strong>in</strong> the 6 planes are visible, but the additive noise<br />

deteriorates the clear separation of each plane. We apply the<br />

generalized Hough transform to the mixture data; however,<br />

because of the noise we choose a more coarse discretization<br />

(β = 180 b<strong>in</strong>s). Curves <strong>in</strong> Hough space correspond<strong>in</strong>g<br />

to a s<strong>in</strong>gle plane do not <strong>in</strong>tersect any more <strong>in</strong> precisely one<br />

po<strong>in</strong>t due to the noise; a low-resolution Hough space however<br />

fuses these <strong>in</strong>tersections <strong>in</strong> one po<strong>in</strong>t, so that our simple<br />

maxima detection still achieves good results. We recover<br />

the mix<strong>in</strong>g matrix similar to the above to get a low generalized<br />

crosstalk<strong>in</strong>g error of E(A, �A) = 0.12. The sources are<br />

recovered well with mean SNRs of 20 dB, which is quite satisfactory<br />

consider<strong>in</strong>g the noisy, overcomplete mixture situation.<br />

The follow<strong>in</strong>g example demonstrates the good performance<br />

<strong>in</strong> higher source dimensions. Consider 6-dimensional<br />

2-sparse sources that are mixed aga<strong>in</strong> by matrix A with coefficients<br />

drawn uniformly from [−1, 1]. Application of the generalized<br />

Hough transform to the mixtures retrieves the plane<br />

norm vectors. The recovered mix<strong>in</strong>g matrix has a low generalized<br />

crosstalk<strong>in</strong>g error of E(A, �A) = 0.047. However, if the<br />

noise level <strong>in</strong>creases, the performance considerably drops because<br />

many maxima, <strong>in</strong> this case 15, have to be located <strong>in</strong> the<br />

accumulator. After recover<strong>in</strong>g the sources with this approximated<br />

matrix �A, we get SNRs of only 11, 8, 6, 10, 12, and<br />

11 dB. The rather high source recovery error is most probably<br />

due to the sensitivity of the source recovery to slight perturbations<br />

<strong>in</strong> the approximated mix<strong>in</strong>g matrix.<br />

5.2. Outliers<br />

We will now perform experiments systematically analyz<strong>in</strong>g<br />

the robustness of the proposed algorithm with respect to outliers<br />

<strong>in</strong> the sense of model-violat<strong>in</strong>g samples.<br />

In the first explicit example we consider the sources from<br />

Figure 4(a), but 80% of the samples have been replaced by<br />

outliers (drawn from a 4-dimensional normal distribution).<br />

Due to the high percentage of outliers, the mixtures, mixed<br />

by the same random 3 × 4 matrix A as before, do not obviously<br />

exhibit any clear hyperplane structure. As discussed <strong>in</strong><br />

Section 4.4, the Hough SCA algorithm is very robust aga<strong>in</strong>st<br />

outliers. Indeed, <strong>in</strong> addition to a noisy background with<strong>in</strong><br />

the Hough accumulator, the <strong>in</strong>tersection maxima are still noticeable,<br />

and local maxima detection f<strong>in</strong>ds the correct hyperplanes<br />

(cf. Figure 4(d)), although 80% of the data is corrupted.<br />

The recovered mix<strong>in</strong>g matrix has an excellent generalized<br />

crosstalk<strong>in</strong>g error of E(A, �A) = 0.040. Of course the<br />

sparse source recovery from above cannot recover the outly<strong>in</strong>g<br />

samples. Apply<strong>in</strong>g the correspond<strong>in</strong>g algorithms, we<br />

get SNRs of the corrupted sources with the recovered ones<br />

of around 4 dB; source recovery with the pseudo-<strong>in</strong>verse of<br />

�A correspond<strong>in</strong>g to maximum-likelihood recovery with a<br />

Gaussian prior gives somewhat better SNRs of around 6 dB.<br />

But the sparse recovery method has the advantage that it can<br />

detect outliers by measur<strong>in</strong>g distance from the hyperplanes.<br />

So outlier rejection is possible. Note that we get similar results<br />

when the outliers are not added <strong>in</strong> the source space but<br />

only <strong>in</strong> the mixture space, that is, only after the mix<strong>in</strong>g process.<br />

We now perform a numerical comparison of the number<br />

of outliers versus the algorithm performance for vary<strong>in</strong>g<br />

noise level; see Figure 5. The rationale beh<strong>in</strong>d this is that<br />

already small noise levels <strong>in</strong> addition to the outliers might<br />

be enough to destroy maxima <strong>in</strong> the accumulator thus deteriorat<strong>in</strong>g<br />

the SCA performance. The same (uncorrupted)<br />

sources and mix<strong>in</strong>g matrix from above are used. Numerically,<br />

we get breakdown po<strong>in</strong>ts of 0.8 for the no-noise case,<br />

and values of 0.5, 0.3, and 0.1 with <strong>in</strong>creas<strong>in</strong>g noise levels of<br />

0.1% (58 dB), 0.5% (44 dB), and 1% (38 dB). Better performances<br />

at higher noise levels could be achieved by apply<strong>in</strong>g<br />

antialias<strong>in</strong>g techniques before maxima detection as described<br />

<strong>in</strong> Section 4.5.<br />

5.3. Grid resolution<br />

In this section we will demonstrate numerical examples<br />

to confirm the l<strong>in</strong>ear dependence of the algorithm performance<br />

with the <strong>in</strong>verse grid resolution β −1 . We consider 4dimensional<br />

sources S with 1000 samples, <strong>in</strong> which for each<br />

sample two source components were drawn out of a distribution<br />

uniform <strong>in</strong> [−1, 1], and the other two were set to zero,<br />

so S is 2-sparse. For each grid resolution β we perform 50<br />

runs, and <strong>in</strong> each run a new set of sources is generated as<br />

above. These are then mixed us<strong>in</strong>g a 3 × 4 mix<strong>in</strong>g matrix A<br />

with random coefficients uniformly out of [−1, 1]. Application<br />

of the Hough SCA algorithm gives an estimated matrix<br />

�A. In Figure 6 we plot the mean generalized crosstalk<strong>in</strong>g error<br />

E(A, �A) for each grid resolution. With <strong>in</strong>creas<strong>in</strong>g β the accuracy<br />

<strong>in</strong>creases—a logarithmic plot <strong>in</strong>deed confirms the l<strong>in</strong>ear<br />

dependence on β −1 as stated <strong>in</strong> Section 4.3. Furthermore we<br />

see that for example for β = 360, among all S and A as above<br />

we get a mean crosstalk<strong>in</strong>g error of 0.23 ± 0.5.<br />

5.4. Batch runs and comparison with<br />

hyperplane k-means<br />

In the last example, we consider the case of m = n = 4,<br />

and do compare the proposed algorithm (now with a


Chapter 11. EURASIP JASP, 2007 171<br />

10 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

Crosstalk<strong>in</strong>g error<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Noise = 0%<br />

Percentage of outliers<br />

(a) Noiseless breakdown analysis with respect to outliers<br />

Crosstalk<strong>in</strong>g error<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Noise = 0%<br />

Noise = 0.1%<br />

Percentage of outliers<br />

Noise = 0.5%<br />

Noise = 1%<br />

(b) Breakdown analysis for vary<strong>in</strong>g noise level<br />

Figure 5: Performance of Hough SCA with <strong>in</strong>creas<strong>in</strong>g number of outliers. Plotted is the percentage of outliers <strong>in</strong> the source data versus the<br />

matrix recovery performance (measured by the generalized crosstalk<strong>in</strong>g error). For each 1%-step one calculation was performed; <strong>in</strong> (b) the<br />

plots have been smoothed by tak<strong>in</strong>g average over ten 1%-steps. In the no-noise case 360 b<strong>in</strong>s were used, 180 b<strong>in</strong>s <strong>in</strong> all other cases.<br />

Mean crosstalk<strong>in</strong>g error<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 50 100 150 200 250 300 350 400<br />

Grid resolution<br />

(a) Mean performance versus grid resolution<br />

Logarithmic mean Crosstalk<strong>in</strong>g error<br />

1<br />

0.5<br />

0<br />

0.5<br />

1<br />

1.5<br />

2<br />

0 50 100 150 200 250 300 350 400<br />

In (E)<br />

L<strong>in</strong>e fit<br />

Grid resolution<br />

(b) Fit of logarithmic mean performance<br />

Figure 6: Dependence of Hough SCA performance (a) on the grid resolution β; mean has been taken over 50 runs. With a logarithmic y-axis<br />

(b), a least squares l<strong>in</strong>e fit confirms the l<strong>in</strong>ear dependence of performance and β −1 .<br />

three-dimensional accumulator) with k-hyperplane cluster<strong>in</strong>g<br />

algorithm (Algorithm 3). For this, random sources<br />

with T = 10 5 samples are uniformly drawn from [−1, 1]<br />

uniform distribution, and a s<strong>in</strong>gle coord<strong>in</strong>ate is randomly<br />

set to zero, thus generat<strong>in</strong>g 1-sparse sources S. In 100 batch<br />

runs, a random 4 × 4 mix<strong>in</strong>g matrix A with coefficients<br />

uniformly drawn from [−1, 1], but columns normalized to<br />

1 are constructed. The result<strong>in</strong>g mixtures X := AS are then<br />

separated both by the proposed Hough k-SCA algorithm<br />

as well as the Bradley-Mangasarian k-hyperplane cluster<strong>in</strong>g<br />

algorithm (with 100 iterations, and without restart).<br />

The result<strong>in</strong>g median crosstalk<strong>in</strong>g error E(A, �A) of the<br />

Hough algorithm is 3.3 ± 2.3, and hence considerably lower<br />

than the k-hyperplane cluster<strong>in</strong>g result of 5.5 ± 1.9. This<br />

confirms the well-known fact that k-means and its extension<br />

exhibit local convergence only and are therefore susceptible<br />

to local m<strong>in</strong>ima, as seems to be the case <strong>in</strong> our example. A<br />

possible solution would be to use many restarts, but global<br />

convergence cannot be guaranteed. For practical applications,<br />

we therefore suggest us<strong>in</strong>g a rather rough (low grid<br />

resolution β) global search by Hough SCA followed by a f<strong>in</strong>er<br />

local search us<strong>in</strong>g k-hyperplane cluster<strong>in</strong>g; see Section 4.5.


172 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 11<br />

(a) Source signals<br />

50<br />

100<br />

150<br />

200<br />

250<br />

300<br />

350<br />

50 100 150 200 250 300 350<br />

(b) Hough accumulator with three labeled maxima<br />

(c) Recovered sources (d) Recovered sources after outlier removal<br />

Figure 7: Application to speech signals: (a) shows the orig<strong>in</strong>al speech sources (“peace and love,” “hello, how are you,” and “to be or not to<br />

be”), and (b) the Hough accumulator when tra<strong>in</strong>ed to mixtures of (a) with 20% outliers. A nonl<strong>in</strong>ear gray scale γnew := (1 − γ/ max) 10 was<br />

chosen for better visualization. (c) and (d) present the recovered sources, without and with outlier removal. They co<strong>in</strong>cide with (a) up to<br />

permutation (reversed order) and scal<strong>in</strong>g.<br />

5.5. Application to the separation of speech signals<br />

In order to illustrate that the SCA assumptions are also valid<br />

for real data sets, we shortly present an application to audio<br />

source separation, namely, to the <strong>in</strong>stantaneous, robust BSS<br />

of speech signals—a problem of importance <strong>in</strong> the field of<br />

audio signal process<strong>in</strong>g. In the next section, we then refer to<br />

other works apply<strong>in</strong>g the model to biomedical data sets.<br />

We consider three speech signals S of length 2.2s, sampled<br />

at 22000 Hz; see Figure 7(a). They are spoken by the same<br />

person, but may still be assumed to be <strong>in</strong>dependent. The signals<br />

are mixed by a randomly chosen mix<strong>in</strong>g matrix A (coefficients<br />

uniform from [−1, 1]) to yield mixtures X = AS, but<br />

20% outliers are <strong>in</strong>troduced by replac<strong>in</strong>g 20% of the samples<br />

of X by i.i.d. Gaussian samples. Without the outliers, more<br />

classical BSS algorithms such as ICA would have been able to<br />

perfectly separate the mixtures; however, <strong>in</strong> this noisy sett<strong>in</strong>g,<br />

ICA performs very poorly: application of the popular fastICA<br />

algorithm [29] yields only a poor estimate �A f of the mix<strong>in</strong>g<br />

matrix A, with high crosstalk<strong>in</strong>g error of E(A, �A f ) = 3.73.<br />

Instead, we apply the complete-case Hough-SCA algorithm<br />

to this model with β = 360 b<strong>in</strong>s—the sparseness<br />

assumption now means that we are search<strong>in</strong>g for sources,<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

which have samples with at least one zero (quiet) source<br />

component. The Hough accumulator exhibits very nicely<br />

three strong maxima; see Figure 7(b). And <strong>in</strong>deed, the<br />

crosstalk<strong>in</strong>g error of the correspond<strong>in</strong>g estimated mix<strong>in</strong>g<br />

matrix �A with the orig<strong>in</strong>al one is very low at E(A, �A) = 0.020.<br />

This experimentally confirms that speech signals obey an<br />

(m−1)-sparse signal model, at least if m = n. An explanation<br />

for this fact is that <strong>in</strong> typical speech data sets, considerable<br />

pauses are common, so with high probability we may f<strong>in</strong>d<br />

samples <strong>in</strong> which at least one source vanishes, and all such<br />

permutations occur—which is necessary for identify<strong>in</strong>g the<br />

mix<strong>in</strong>g matrix accord<strong>in</strong>g to Theorem 1. We are deal<strong>in</strong>g with<br />

a complete-case problem, so <strong>in</strong>vert<strong>in</strong>g �A directly yields recovered<br />

sources �S. But of course due to the outliers, the SNR<br />

of �S with the orig<strong>in</strong>al sources is low with only −1.35 dB. We<br />

therefore apply a simple outlier removal scheme by scann<strong>in</strong>g<br />

each estimated source us<strong>in</strong>g a w<strong>in</strong>dow of size w = 10 samples.<br />

An adjacent sample to the w<strong>in</strong>dow is identified as outlier<br />

if its absolute value is larger than 20% of the maximal signal<br />

amplitude, but the w<strong>in</strong>dow sample variance is lower than<br />

half of the variance when <strong>in</strong>clud<strong>in</strong>g the sample. The outliers<br />

are then replaced by the w<strong>in</strong>dow average. This rough outlierdetection<br />

algorithm works satisfactorily well, see Figure 7(d);<br />

500


Chapter 11. EURASIP JASP, 2007 173<br />

12 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

the perceptual audio quality <strong>in</strong>creased considerably, see also<br />

the differences between Figures 7(c) and 7(d), although the<br />

nom<strong>in</strong>al SNR <strong>in</strong>crease is only roughly 4.1 dB. Altogether, this<br />

example illustrates the applicability of the Hough SCA algorithm<br />

and its correspond<strong>in</strong>g SCA model to audio data sets<br />

also <strong>in</strong> noisy sett<strong>in</strong>gs, where ICA algorithms perform very<br />

poorly.<br />

5.6. Other applications<br />

We are currently study<strong>in</strong>g several biomedical applications of<br />

the proposed model and algorithm, <strong>in</strong>clud<strong>in</strong>g the separation<br />

of functional magnetic resonance imag<strong>in</strong>g data sets as well as<br />

surface electromyograms. For results on the former data set,<br />

we refer to the detailed book chapters [22, 23].<br />

The results of the k-SCA algorithm applied to the latter<br />

signals are shortly summarized <strong>in</strong> the follow<strong>in</strong>g. An electromyogram<br />

(EMG) denotes the electric signal generated by<br />

a contract<strong>in</strong>g muscle; its study is relevant to the diagnosis of<br />

motoneuron diseases as well as neurophysiological research.<br />

In general, EMG measurements make use of <strong>in</strong>vasive, pa<strong>in</strong>ful<br />

needle electrodes. An alternative is to use surface EMGs,<br />

which are measured us<strong>in</strong>g non<strong>in</strong>vasive, pa<strong>in</strong>less surface electrodes.<br />

However, <strong>in</strong> this case the signals are rather more difficult<br />

to <strong>in</strong>terpret due to noise and overlap of several source<br />

signals. When apply<strong>in</strong>g the k-SCA model to real record<strong>in</strong>gs,<br />

Hough-based separation outperforms classical approaches<br />

based on filter<strong>in</strong>g and ICA <strong>in</strong> terms of a greater reduction<br />

of the zero-cross<strong>in</strong>gs, a common measure to analyze the unknown<br />

extracted sources. The relative sEMG enhancement<br />

was 24.6 ± 21.4%, where the mean was taken over a group of<br />

9 subjects. For a detailed analysis, compar<strong>in</strong>g various sparse<br />

factorization models both on toy and on real data, we refer<br />

to [30].<br />

6. CONCLUSION<br />

We have presented an algorithm for perform<strong>in</strong>g a global<br />

search for overcomplete SCA representations, and experiments<br />

confirm that Hough SCA is robust aga<strong>in</strong>st noise and<br />

outliers with breakdown po<strong>in</strong>ts up to 0.8. The algorithm employs<br />

hyperplane detection us<strong>in</strong>g a generalized Hough transform.<br />

Currently, we are work<strong>in</strong>g on apply<strong>in</strong>g the SCA algorithm<br />

to high-dimensional biomedical data sets to see how<br />

the different assumption of high sparsity contributes to the<br />

signal separation.<br />

ACKNOWLEDGMENTS<br />

The authors gratefully thank W. Nakamura for her suggestion<br />

of us<strong>in</strong>g the Hough transform when detect<strong>in</strong>g<br />

hyperplanes, and the anonymous reviewers for their comments,<br />

which significantly improved the manuscript. The<br />

first author acknowledges partial f<strong>in</strong>ancial support by the<br />

JSPS (PE 05543).<br />

REFERENCES<br />

[1] A. Cichocki and S. Amari, Adaptive Bl<strong>in</strong>d Signal and Image<br />

Process<strong>in</strong>g, John Wiley & Sons, New York, NY, USA, 2002.<br />

[2] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja, <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>, John Wiley & Sons, New York, NY, USA, 2001.<br />

[3] P. Comon, “<strong>Independent</strong> component analysis. A new concept?”<br />

Signal Process<strong>in</strong>g, vol. 36, no. 3, pp. 287–314, 1994.<br />

[4] F. J. Theis, “A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d<br />

source separation,” Neural Computation, vol. 16, no. 9, pp.<br />

1827–1850, 2004.<br />

[5] J. Eriksson and V. Koivunen, “Identifiability and separability<br />

of l<strong>in</strong>ear ica models revisited,” <strong>in</strong> Proceed<strong>in</strong>gs of the 4th International<br />

Symposium on <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> and<br />

Bl<strong>in</strong>d Source Separation (ICA ’03), pp. 23–27, Nara, Japan,<br />

April 2003.<br />

[6] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition<br />

by basis pursuit,” SIAM Journal of Scientific Comput<strong>in</strong>g,<br />

vol. 20, no. 1, pp. 33–61, 1998.<br />

[7] D. L. Donoho and M. Elad, “Optimally sparse representation<br />

<strong>in</strong> general (nonorthogonal) dictionaries via l 1 m<strong>in</strong>imization,”<br />

Proceed<strong>in</strong>gs of the National Academy of Sciences of the United<br />

States of America, vol. 100, no. 5, pp. 2197–2202, 2003.<br />

[8] F. J. Theis, E. W. Lang, and C. G. Puntonet, “A geometric algorithm<br />

for overcomplete l<strong>in</strong>ear ICA,” Neurocomput<strong>in</strong>g, vol. 56,<br />

no. 1–4, pp. 381–398, 2004.<br />

[9] P. Georgiev, F. J. Theis, and A. Cichocki, “Sparse component<br />

analysis and bl<strong>in</strong>d source separation of underdeterm<strong>in</strong>ed mixtures,”<br />

IEEE Transactions on Neural Networks, vol. 16, no. 4, pp.<br />

992–996, 2005.<br />

[10] P. V. C. Hough, “Mach<strong>in</strong>e analysis of bubble chamber pictures,”<br />

<strong>in</strong> International Conference on High Energy Accelerators<br />

and Instrumentation, pp. 554–556, CERN, Geneva, Switzerland,<br />

1959.<br />

[11] J. K. L<strong>in</strong>, D. G. Grier, and J. D. Cowan, “Feature extraction approach<br />

to bl<strong>in</strong>d source separation,” <strong>in</strong> Proceed<strong>in</strong>gs of the IEEE<br />

Workshop on Neural Networks for Signal Process<strong>in</strong>g (NNSP ’97),<br />

pp. 398–405, Amelia Island, Fla, USA, September 1997.<br />

[12] H. Sh<strong>in</strong>do and Y. Hirai, “An approach to overcomplete-bl<strong>in</strong>d<br />

source separation us<strong>in</strong>g geometric structure,” <strong>in</strong> Proceed<strong>in</strong>gs of<br />

Annual Conference of Japanese Neural Network Society (JNNS<br />

’01), pp. 95–96, Naramachi Center, Nara, Japan, 2001.<br />

[13] F. J. Theis, C. G. Puntonet, and E. W. Lang, “Median-based<br />

cluster<strong>in</strong>g for underdeterm<strong>in</strong>ed bl<strong>in</strong>d signal process<strong>in</strong>g,” IEEE<br />

Signal Process<strong>in</strong>g Letters, vol. 13, no. 2, pp. 96–99, 2006.<br />

[14] L. Cirillo, A. Zoubir, and M. Am<strong>in</strong>, “Direction f<strong>in</strong>d<strong>in</strong>g of nonstationary<br />

signals us<strong>in</strong>g a time-frequency Hough transform,”<br />

<strong>in</strong> Proceed<strong>in</strong>gs of IEEE International Conference on Acoustics,<br />

Speech, and Signal Process<strong>in</strong>g (ICASSP ’05), pp. 2718–2721,<br />

Philadelphia, Pa, USA, March 2005.<br />

[15] S. Barbarossa, “<strong>Analysis</strong> of multicomponent LFM signals by<br />

a comb<strong>in</strong>ed Wigner-Hough transform,” IEEE Transactions on<br />

Signal Process<strong>in</strong>g, vol. 43, no. 6, pp. 1511–1515, 1995.<br />

[16] D. H. Ballard, “Generaliz<strong>in</strong>g the Hough transform to detect<br />

arbitrary shapes,” Pattern Recognition, vol. 13, no. 2, pp. 111–<br />

122, 1981.<br />

[17] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski,<br />

“Bl<strong>in</strong>d source separation of more sources than mixtures us<strong>in</strong>g<br />

overcomplete representations,” IEEE Signal Process<strong>in</strong>g Letters,<br />

vol. 6, no. 4, pp. 87–90, 1999.<br />

[18] K. Waheed and F. Salem, “Algebraic overcomplete <strong>in</strong>dependent<br />

component analysis,” <strong>in</strong> Proceed<strong>in</strong>gs of the 4th International<br />

Symposium on <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> and<br />

Bl<strong>in</strong>d Source Separation (ICA ’03), pp. 1077–1082, Nara, Japan,<br />

April 2003.


174 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 13<br />

[19] M. Zibulevsky and B. A. Pearlmutter, “Bl<strong>in</strong>d source separation<br />

by sparse decomposition <strong>in</strong> a signal dictionary,” Neural Computation,<br />

vol. 13, no. 4, pp. 863–882, 2001.<br />

[20] F. J. Theis, P. Georgiev, and A. Cichocki, “Robust overcomplete<br />

matrix recovery for sparse sources us<strong>in</strong>g a generalized Hough<br />

transform,” <strong>in</strong> Proceed<strong>in</strong>gs of 12th European Symposium on Artificial<br />

Neural Networks (ESANN ’04), pp. 343–348, Bruges,<br />

Belgium, April 2004, d-side, Evere, Belgium.<br />

[21] P. S. Bradley and O. L. Mangasarian, “k-plane cluster<strong>in</strong>g,” Journal<br />

of Global Optimization, vol. 16, no. 1, pp. 23–32, 2000.<br />

[22] P. Georgiev, P. Pardalos, F. J. Theis, A. Cichocki, and H.<br />

Bakardjian, “Sparse component analysis: a new tool for data<br />

m<strong>in</strong><strong>in</strong>g,” <strong>in</strong> Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Biomedic<strong>in</strong>e, Spr<strong>in</strong>ger, New York,<br />

NY, USA, 2005, <strong>in</strong> pr<strong>in</strong>t.<br />

[23] P. Georgiev, F. J. Theis, and A. Cichocki, “Optimization algorithms<br />

for sparse representations and applications,” <strong>in</strong> Multiscale<br />

Optimization Methods, P. Pardalos, Ed., Spr<strong>in</strong>ger, New<br />

York, NY, USA, 2005.<br />

[24] R. O. Duda and P. E. Hart, “Use of the Hough transformation<br />

to detect l<strong>in</strong>es and curves <strong>in</strong> pictures,” Communications of the<br />

ACM, vol. 15, no. 1, pp. 204–208, 1972.<br />

[25] R. Dudley, Department of <strong>Mathematics</strong>, MIT, course 18.465,<br />

2005.<br />

[26] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier<br />

Detection, John Wiley & Sons, New York, NY, USA, 1987.<br />

[27] P. Ballester, “Applications of the Hough transform,” <strong>in</strong> Astronomical<br />

Data <strong>Analysis</strong> Software and Systems III, J. Barnes, D.<br />

R. Crabtree, and R. J. Hanisch, Eds., vol. 61 of ASP Conference<br />

Series, 1994.<br />

[28] A. Goldenshluger and A. Zeevi, “The Hough transform estimator,”<br />

Annals of Statistics, vol. 32, no. 5, pp. 1908–1932, 2004.<br />

[29] A. Hyvär<strong>in</strong>en and E. Oja, “A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent<br />

component analysis,” Neural Computation, vol. 9,<br />

no. 7, pp. 1483–1492, 1997.<br />

[30] F. J. Theis and G. A. García, “On the use of sparse signal<br />

decomposition <strong>in</strong> the analysis of multi-channel surface electromyograms,”<br />

Signal Process<strong>in</strong>g, vol. 86, no. 3, pp. 603–623,<br />

2006.<br />

Fabian J. Theis obta<strong>in</strong>ed his M.S. degree <strong>in</strong><br />

mathematics and physics from the University<br />

of Regensburg, Germany, <strong>in</strong> 2000. He<br />

also received the Ph.D. degree <strong>in</strong> physics<br />

from the same university <strong>in</strong> 2002 and the<br />

Ph.D. degree <strong>in</strong> computer science from the<br />

University of Granada <strong>in</strong> 2003. He worked<br />

as a Visit<strong>in</strong>g Researcher at the Department<br />

of Architecture and Computer Technology<br />

(University of Granada, Spa<strong>in</strong>), at<br />

the RIKEN Bra<strong>in</strong> Science Institute (Wako, Japan), at FAMU-FSU<br />

(Florida State University, USA), and at TUAT’s Laboratory for Signal<br />

and Image Process<strong>in</strong>g (Tokyo, Japan). Currently, he is head<strong>in</strong>g<br />

the Signal Process<strong>in</strong>g & Information Theory Group at the Institute<br />

of Biophysics at the University of Regensburg and is work<strong>in</strong>g<br />

on his habilitation. He serves as an Associate Editor of “Computational<br />

Intelligence and Neuroscience,” and is a Member of<br />

IEEE, EURASIP, and ENNS. His research <strong>in</strong>terests <strong>in</strong>clude statistical<br />

signal process<strong>in</strong>g, mach<strong>in</strong>e learn<strong>in</strong>g, bl<strong>in</strong>d source separation,<br />

and biomedical data analysis.<br />

Pando Georgiev received his M.S., Ph.D.,<br />

and “Doctor of Mathematical Sciences” degrees<br />

<strong>in</strong> mathematics (operations research)<br />

from Sofia University “St. Kl. Ohridski,”<br />

Bulgaria, <strong>in</strong> 1982, 1987, and 2001, respectively.<br />

He has been with the Department<br />

of Probability, Operations Research, and<br />

Statistics at the Faculty of <strong>Mathematics</strong><br />

and Informatics, Sofia University “St. Kl.<br />

Ohridski,” Bulgaria, as an Assistant Professor<br />

(1989–1994), and s<strong>in</strong>ce 1994, as an Associate Professor. He was<br />

a Visit<strong>in</strong>g Professor at the University of Rome II, Italy (CNR grants,<br />

several one-month visits), the International Center for Theoretical<br />

Physics, Trieste, Italy (ICTP grant, six months), the University<br />

of Pau, France (NATO grant, three months), Hirosaki University,<br />

Japan (JSPS grant, n<strong>in</strong>e months), and so forth. He has been work<strong>in</strong>g<br />

for four years (2000–2004) as a research scientist at the Laboratory<br />

for Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g, Bra<strong>in</strong> Science Institute,<br />

the Institute of Physical and Chemical Research (RIKEN), Wako,<br />

Japan. After that and currently he is a Visit<strong>in</strong>g Scholar <strong>in</strong> ECECS<br />

Department, University of C<strong>in</strong>c<strong>in</strong>nati, USA. His <strong>in</strong>terests <strong>in</strong>clude<br />

mach<strong>in</strong>e learn<strong>in</strong>g and computational <strong>in</strong>telligence, <strong>in</strong>dependent and<br />

sparse component analysis, bl<strong>in</strong>d signal separation, statistics and<br />

<strong>in</strong>verse problems, signal and image process<strong>in</strong>g, optimization, and<br />

variational analysis. He is a Member of AMS, IEEE, and UBM.<br />

Andrzej Cichocki was born <strong>in</strong> Poland. He<br />

received the M.S. (with honors), Ph.D., and<br />

Habilitate Doctorate (Dr.Sc.) degrees, all<br />

<strong>in</strong> electrical eng<strong>in</strong>eer<strong>in</strong>g, from the Warsaw<br />

University of Technology (Poland) <strong>in</strong> 1972,<br />

1975, and 1982, respectively. He is the coauthor<br />

of three <strong>in</strong>ternational and successful<br />

books (two of them were translated to Ch<strong>in</strong>ese):<br />

Adaptive Bl<strong>in</strong>d Signal and Image Process<strong>in</strong>g<br />

(John Wiley, 2002) MOS Switched-<br />

Capacitor and Cont<strong>in</strong>uous-Time Integrated Circuits and Systems<br />

(Spr<strong>in</strong>ger, 1989), and Neural Networks for Optimization and Signal<br />

Process<strong>in</strong>g (J. Wiley and Teubner Verlag, 1993/1994) and the author<br />

or coauthor of more than three hundred papers. He is the Editor<strong>in</strong>-Chief<br />

of the Journal Computational Intelligence and Neuroscience<br />

and an Associate Editor of IEEE Transactions on Neural<br />

Networks. S<strong>in</strong>ce 1997, he has been the Head of the Laboratory for<br />

Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g <strong>in</strong> the Riken Bra<strong>in</strong> Science Institute,<br />

Japan.


Chapter 12<br />

LNCS 3195:718-725, 2004<br />

Paper F.J. Theis and S. Amari. Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation<br />

us<strong>in</strong>g sparse sources. In Proc. ICA 2004, volume 3195 of LNCS, pages 718-<br />

725, Granada, Spa<strong>in</strong>, 2004<br />

Reference (Theis and Amari, 2004)<br />

Summary <strong>in</strong> section 1.4.1<br />

175


176 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source<br />

separation us<strong>in</strong>g sparse sources<br />

Fabian J. Theis 1,2 and Shun-ichi Amari 1<br />

1 Bra<strong>in</strong> Science Institute, RIKEN<br />

2-1, Hirosawa, Wako-shi, Saitama, 351-0198, Japan<br />

2 Institute of Biophysics, University of Regensburg<br />

D-93040 Regensburg, Germany<br />

fabian@theis.name,amari@bra<strong>in</strong>.riken.go.jp<br />

Abstract. We present an approach for bl<strong>in</strong>dly decompos<strong>in</strong>g an observed<br />

random vector x <strong>in</strong>to f(As) where f is a diagonal function i.e.<br />

f = f1 × . . . × fm with one-dimensional functions fi and A an m × n<br />

matrix. This postnonl<strong>in</strong>ear model is allowed to be overcomplete, which<br />

means that less observations than sources (m < n) are given. In contrast<br />

to <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> (ICA) we do not assume the sources<br />

s to be <strong>in</strong>dependent but to be sparse <strong>in</strong> the sense that at each time <strong>in</strong>stant<br />

they have at most m − 1 non-zero components (Sparse <strong>Component</strong><br />

<strong>Analysis</strong> or SCA). Identifiability of the model is shown, and an algorithm<br />

for model and source recovery is proposed. It first detects the postnonl<strong>in</strong>earities<br />

<strong>in</strong> each component, and then identifies the now l<strong>in</strong>earized model<br />

us<strong>in</strong>g previous results.<br />

Bl<strong>in</strong>d source separation (BSS) based on ICA is a rapidly grow<strong>in</strong>g field (see<br />

for <strong>in</strong>stance [1,2] and references there<strong>in</strong>), but most algorithms deal only with the<br />

case of at least as many observations as sources. However, there is an <strong>in</strong>creas<strong>in</strong>g<br />

<strong>in</strong>terest <strong>in</strong> (l<strong>in</strong>ear) overcomplete ICA [3–5], where matrix identifiability is known<br />

[6], but source identifiability does not hold. In order to approximatively detect<br />

the sources [7], additional requirements have to be made, usually sparsity of the<br />

sources.<br />

Recently, we have proposed a model based only upon the sparsity assumption<br />

(summarized <strong>in</strong> section 1) [8]. In this case identifiability of both matrix and<br />

sources given sufficiently high sparsity can be shown. Here, we extend these results<br />

to postnonl<strong>in</strong>ear mixtures (section 2); they describe a model often occurr<strong>in</strong>g<br />

<strong>in</strong> real situations, when the mixture is <strong>in</strong> pr<strong>in</strong>ciple l<strong>in</strong>ear, but the sensors <strong>in</strong>troduce<br />

an additional nonl<strong>in</strong>earity dur<strong>in</strong>g the record<strong>in</strong>g [9]. Section 3 presents an<br />

algorithm for identify<strong>in</strong>g such models, and section 4 f<strong>in</strong>ishes with an illustrative<br />

simulation.<br />

1 L<strong>in</strong>ear overcomplete SCA<br />

Def<strong>in</strong>ition 1. A vector v ∈ R n is said to be k-sparse if v has at most k non-zero<br />

entries.


Chapter 12. LNCS 3195:718-725, 2004 177<br />

2 Fabian J. Theis and Shun-ichi Amari<br />

If an n-dimensional vector is (n − 1)-sparse, that is it <strong>in</strong>cludes at least one<br />

zero component, it is simply said to be sparse. The goal of Sparse <strong>Component</strong><br />

<strong>Analysis</strong> of level k (k-SCA) is to decompose a given m-dimensional random<br />

vector x <strong>in</strong>to<br />

x = As (1)<br />

with a real m × n-matrix A and an n-dimensional k-sparse random vector s. s<br />

is called the source vector, x the mixtures and A the mix<strong>in</strong>g matrix. We speak<br />

of complete, overcomplete or undercomplete k-SCA if m = n, m < n or m > n<br />

respectively. In the follow<strong>in</strong>g without loss of generality we will assume m ≤ n<br />

because the undercomplete case can be easily reduced to the complete case by<br />

projection of x.<br />

Theorem 1 (Matrix identifiability). Consider the k-SCA problem from equation<br />

1 for k := m − 1 and assume that every m × m-submatrix of A is <strong>in</strong>vertible.<br />

Furthermore let s be sufficiently rich represented <strong>in</strong> the sense that for any <strong>in</strong>dex<br />

set of n − m + 1 elements I ⊂ {1, ..., n} there exist at least m samples of s such<br />

that each of them has zero elements <strong>in</strong> places with <strong>in</strong>dexes <strong>in</strong> I and each m − 1<br />

of them are l<strong>in</strong>early <strong>in</strong>dependent. Then A is uniquely determ<strong>in</strong>ed by x except for<br />

left-multiplication with permutation and scal<strong>in</strong>g matrices.<br />

Theorem 2 (Source identifiablity). Let H be the set of all x ∈ R m such<br />

that the l<strong>in</strong>ear system As = x has an (m − 1)-sparse solution s. If A fulfills<br />

the condition from theorem 1, then there exists a subset H0 ⊂ H with measure<br />

zero with respect to H, such that for every x ∈ H \ H0 this system has no other<br />

solution with this property.<br />

The above two theorems show that <strong>in</strong> the case of overcomplete BSS us<strong>in</strong>g<br />

(m−1)-SCA, both the mix<strong>in</strong>g matrix and the sources can uniquely be recovered<br />

from x except for the omnipresent permutation and scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. We<br />

refer to [8] for proofs of these theorems and algorithms based upon them. We<br />

also want to note that the present source recovery algorithm is quite different<br />

from the usual sparse source recovery us<strong>in</strong>g l1-norm m<strong>in</strong>imization [7] and l<strong>in</strong>ear<br />

programm<strong>in</strong>g. In the case of sources with sparsity as above, the latter will not<br />

be able to detect the sources.<br />

2 Postnonl<strong>in</strong>ear overcomplete SCA<br />

2.1 Model<br />

Consider n-dimensional k-sparse sources s with k < m. The postnonl<strong>in</strong>ear mix<strong>in</strong>g<br />

model [9] is def<strong>in</strong>ed to be<br />

x = f(As) (2)<br />

with a diagonal <strong>in</strong>vertible function f with f(0) = 0 and a real m × n-matrix A.<br />

Here a function f is said to be diagonal if each component fi only depends on<br />

xi. In abuse of notation we will <strong>in</strong> this case <strong>in</strong>terpret the components fi of f as


178 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse sources 3<br />

functions with doma<strong>in</strong> R and write f = f1 × . . . × fm. The goal of overcomplete<br />

postnonl<strong>in</strong>ear k-SCA is to determ<strong>in</strong>e the mix<strong>in</strong>g functions f and A and the<br />

sources s given only x.<br />

Without loss of generality consider only the complete and the overcomplete<br />

case (i.e. m ≤ n). In the follow<strong>in</strong>g we will assume that the sources are sparse of<br />

level k := m − 1 and that the components fi of f are cont<strong>in</strong>uously differentiable<br />

with f ′ i (t) �= 0. This is equivalent to say<strong>in</strong>g that the fi are cont<strong>in</strong>uously differentiable<br />

with cont<strong>in</strong>uously differentiable <strong>in</strong>verse functions (diffeomorphisms).<br />

2.2 Identifiability<br />

Def<strong>in</strong>ition 2. Let A be an m × n matrix. Then A is said to be mix<strong>in</strong>g if A has<br />

at least two nonzero entries <strong>in</strong> each row. And A = (aij)i=1...m,j=1...n is said to<br />

be absolutely degenerate if there are two columns k �= l such that a 2 ik = λa2 il for<br />

all i and fixed λ �= 0 i.e. the normalized columns differ only by the sign of the<br />

entries.<br />

Postnonl<strong>in</strong>ear overcomplete SCA is a generalization of l<strong>in</strong>ear overcomplete<br />

SCA, so the <strong>in</strong>determ<strong>in</strong>acies of postnonl<strong>in</strong>ear SCA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies<br />

of l<strong>in</strong>ear overcomplete SCA: A can only be reconstructed up to scal<strong>in</strong>g<br />

and permutation. Also, if L is an <strong>in</strong>vertible scal<strong>in</strong>g matrix, then<br />

f(As) = (f ◦ L) � (L −1 A)s � ,<br />

so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each component.<br />

Two further <strong>in</strong>determ<strong>in</strong>acies occur if A is either not mix<strong>in</strong>g or absolutely<br />

degenerate. In the first case, this means that fi cannot be identified if the ith<br />

row of A conta<strong>in</strong>s only one non-zero element. In the case of an absolutely<br />

degenerate mix<strong>in</strong>g matrix, � sparseness � alone cannot detect the nonl<strong>in</strong>earity as the<br />

1 1<br />

counterexample A = and arbitrary f1 ≡ f2 shows.<br />

1 −1<br />

If s is an n-dimensional random vector, its image (or the support of its<br />

density) is denoted as im s := {s(t)}.<br />

Theorem 3 (Identifiability). Let s be an n-dimensional k-sparse random vector<br />

(k < m), and x an m-dimensional random vector constructed from s as <strong>in</strong><br />

equation 2. Furthermore assume that<br />

(i) s is fully k-sparse <strong>in</strong> the sense that im s equals the union of all k-dimensional<br />

coord<strong>in</strong>ate spaces (<strong>in</strong> which it is conta<strong>in</strong>ed by the sparsity assumption),<br />

(ii) A is mix<strong>in</strong>g and not absolutely degenerate,<br />

(iii) every m × m-submatrix of A is <strong>in</strong>vertible.<br />

If x = ˆf( ˆs) is another representation of x as <strong>in</strong> equation 2 with ˆs satisfy<strong>in</strong>g<br />

the same conditions as s, then there exists an <strong>in</strong>vertible scal<strong>in</strong>g L with f = ˆf ◦ L,<br />

and <strong>in</strong>vertible scal<strong>in</strong>g and permutation matrices L ′ , P ′ with A = LÂL′ P ′ .


Chapter 12.<br />

PSfrag replacements<br />

LNCS 3195:718-725, 2004 179<br />

4 Fabian J. Theis and Shun-ichi Amari<br />

R 3<br />

A f1 × f2 g1 × g2 BSRA<br />

R 2<br />

R 2<br />

Fig. 1. Illustration of the proof of theorem 3 <strong>in</strong> the case n = 3, m = 2. The 3dimensional<br />

1-sparse sources (leftmost figure) are first l<strong>in</strong>early mapped onto R 2 by<br />

A and then postnonl<strong>in</strong>early distorted by f := f1 × f2 (middle figure). Separation is<br />

performed by first estimat<strong>in</strong>g the separat<strong>in</strong>g postnonl<strong>in</strong>earities g := g1 × g2 and then<br />

perform<strong>in</strong>g overcomplete source recovery (right figure) accord<strong>in</strong>g to the algorithms<br />

from [8]. The idea of the proof now is that two l<strong>in</strong>es spanned by coord<strong>in</strong>ate vectors<br />

(thick l<strong>in</strong>es, leftmost figure) are mapped onto two l<strong>in</strong>es spanned by two columns of A.<br />

If the composition g ◦ f maps these l<strong>in</strong>es onto some different l<strong>in</strong>es (as sets), then we<br />

show that (given ’general position’ of the two l<strong>in</strong>es) the components of g ◦ f satisfy the<br />

conditions from lemma 1 and hence are already l<strong>in</strong>ear.<br />

The proof relies on the fact that when s is fully k-sparse as formulated <strong>in</strong> 3(i),<br />

it <strong>in</strong>cludes all the k-dimensional coord<strong>in</strong>ate subspaces and hence <strong>in</strong>tersections<br />

of k such subspaces, which give the n coord<strong>in</strong>ate axes. They are transformed<br />

<strong>in</strong>to n curves <strong>in</strong> the x-space, pass<strong>in</strong>g through the orig<strong>in</strong>. By identification of<br />

these curves, we show that each nonl<strong>in</strong>earity is homogeneous and hence l<strong>in</strong>ear<br />

accord<strong>in</strong>g to the previous section. The proof is omitted due to lack of space.<br />

Figure 1 gives an illustration of the proof <strong>in</strong> the case n = 3 and m = 2. It uses<br />

the follow<strong>in</strong>g lemma (a generalization of the analytic case presented <strong>in</strong> [10]).<br />

Lemma 1. Let a, b ∈ R \ {−1, 0, 1}, a > 0 and f : [0, ε) → R differentiable such<br />

that f(ax) = bf(x) for all x ∈ [0, ε) with ax ∈ [0, ε). If limt→0+ f ′ (t) exists and<br />

does not vanish, then f is l<strong>in</strong>ear.<br />

Theorem 3 shows that f and A are uniquely determ<strong>in</strong>ed by x except for scal<strong>in</strong>g<br />

and permutation ambiguities. Note that then obviously also s is identifiable<br />

by apply<strong>in</strong>g theorem 2 to the l<strong>in</strong>earized mixtures y = f −1 x = As given the<br />

additional assumptions to s from the theorem.<br />

For brevity, the theorem assumes <strong>in</strong> (i) that im s is the whole union of the<br />

k-dimensional coord<strong>in</strong>ate spaces — this condition can be relaxed (the proof is<br />

local <strong>in</strong> nature) but then the nonl<strong>in</strong>earities can only be found on <strong>in</strong>tervals where<br />

the correspond<strong>in</strong>g marg<strong>in</strong>al densities of As are non-zero (however <strong>in</strong> addition,<br />

the proof needs that locally at 0 they are nonzero). Furthermore <strong>in</strong> practice the<br />

assumption about the image of s will have to be replaced by assum<strong>in</strong>g the same<br />

with non-zero probability. Also note that almost any A ∈ R mn <strong>in</strong> the measure<br />

sense fulfills the conditions (ii) and (iii).<br />

3 Algorithm for postnonl<strong>in</strong>ear (over)complete SCA<br />

The separation is done <strong>in</strong> a two-stage procedure: In the first step, after geometrical<br />

preprocess<strong>in</strong>g the postnonl<strong>in</strong>earities are estimated us<strong>in</strong>g an idea similar to<br />

R 2<br />

R 3


180 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse sources 5<br />

the one used <strong>in</strong> the identifiability proof of theorem 3, also see figure 1. In the<br />

second stage, the mix<strong>in</strong>g matrix A and then the sources s are reconstructed<br />

by apply<strong>in</strong>g the l<strong>in</strong>ear algorithms from [8], section 1, to the l<strong>in</strong>earized mixtures<br />

f −1 x. So <strong>in</strong> the follow<strong>in</strong>g it is enough to reconstruct f.<br />

3.1 Geometrical preprocess<strong>in</strong>g<br />

Let x(1), . . . , x(T ) ∈ R m be i.i.d.-samples of the random vector x. The goal of geometrical<br />

preprocess<strong>in</strong>g is to construct vectors y(1), . . . , y(T ) and z(1), . . . , z(T ) ∈<br />

R m us<strong>in</strong>g cluster<strong>in</strong>g or <strong>in</strong>terpolation on the samples x(t) such that f −1 (y(t)) and<br />

f −1 (z(t)) lie <strong>in</strong> two l<strong>in</strong>early <strong>in</strong>dependent l<strong>in</strong>es of R m . In figure 1 they are to span<br />

the two thick l<strong>in</strong>es which already determ<strong>in</strong>e the postnonl<strong>in</strong>earities.<br />

Algorithmically, y and z can be constructed <strong>in</strong> the case m = 2 by first choos<strong>in</strong>g<br />

far away samples (on different ’non-opposite’ curves) as <strong>in</strong>itial start<strong>in</strong>g po<strong>in</strong>t<br />

and then advanc<strong>in</strong>g to the known data set center by always choos<strong>in</strong>g the closest<br />

samples of x with smaller modulus. Such an algorithm can also be implemented<br />

for larger m but only for sources with at most one non-zero coefficient at each<br />

time <strong>in</strong>stant, but it can be generalized to sources of sparseness m − 1 us<strong>in</strong>g more<br />

elaborate cluster<strong>in</strong>g.<br />

3.2 Postnonl<strong>in</strong>earity estimation<br />

Given the subspace vectors y(t) and z(t) from the previous section, the goal is<br />

to f<strong>in</strong>d C1-diffeomorphisms gi : R → R such that g1 × . . . × gm maps the vectors<br />

y(t) and z(t) onto two different l<strong>in</strong>ear subspaces.<br />

In abuse of notation, we now assume that two curves (<strong>in</strong>jective <strong>in</strong>f<strong>in</strong>itely<br />

differentiable mapp<strong>in</strong>gs) y, z : (−1, 1) → Rm are given with y(0) = z(0) = 0.<br />

These can for example be constructed from the discrete sample po<strong>in</strong>ts y(t) and<br />

z(t) from the previous section by polynomial or spl<strong>in</strong>e <strong>in</strong>terpolation. If the two<br />

curves are mapped onto l<strong>in</strong>es by g1 × . . . × gm (and if these are <strong>in</strong> sufficiently<br />

general position) then gi = λif −1<br />

i for some λ �= 0 accord<strong>in</strong>g to theorem 3. By<br />

requir<strong>in</strong>g this condition only for the discrete sample po<strong>in</strong>ts from the previous<br />

section we get an approximation of the unmix<strong>in</strong>g nonl<strong>in</strong>earities gi. Let i �= j be<br />

fixed. It is then easy to see that by project<strong>in</strong>g x, y and z onto the i-th and j-th<br />

coord<strong>in</strong>ate, the problem of f<strong>in</strong>d<strong>in</strong>g the nonl<strong>in</strong>earities can be reduced to the case<br />

m = 2, and g2 is to be reconstructed, which we will assume <strong>in</strong> the follow<strong>in</strong>g.<br />

A is chosen to be mix<strong>in</strong>g, so we can assume that the <strong>in</strong>dices i, j were chosen<br />

such that the two l<strong>in</strong>es f −1 ◦ y, f −1 ◦ z : (−1, 1) → R 2 do not co<strong>in</strong>cide with<br />

the coord<strong>in</strong>ate axes. Reparametrization (¯y := y ◦ y −1<br />

1<br />

) of the curves lets us<br />

further assume that y1 = z1 = id. Then after some algebraic manipulation, the<br />

condition that the separat<strong>in</strong>g nonl<strong>in</strong>earities g = g1 × g2 must map y and z onto<br />

l<strong>in</strong>es can be written as g2 ◦ y2 = ag1 = a<br />

b g2 ◦ z2 with constants a, b ∈ R \ {0},<br />

a �= ±b.<br />

So the goal of geometrical postnonl<strong>in</strong>earity detection is to f<strong>in</strong>d a C 1 -diffeomorphism<br />

g on subsets of R with<br />

g ◦ y = cg ◦ z (3)


Chapter 12. LNCS 3195:718-725, 2004 181<br />

6 Fabian J. Theis and Shun-ichi Amari<br />

for an unknown constant c �= 0, ±1 and given curves y, z : (−1, 1) → R with<br />

y(0) = z(0) = 0. By theorem 3, g (and also c) are uniquely determ<strong>in</strong>ed by<br />

y and z except for scal<strong>in</strong>g. Indeed by tak<strong>in</strong>g derivatives <strong>in</strong> equation 3, we get<br />

c = y ′ (0)/z ′ (0), so c can be directly calculated from the known curves y and z.<br />

In the follow<strong>in</strong>g section, we propose to solve this problem numerically, given<br />

samples y(t1), z(t1), . . . , y(tT ), z(tT ) of the curves. Note that here it is assumed<br />

that the samples of the curves y and z are given at the same time <strong>in</strong>stants<br />

ti ∈ (−1, 1). In practice, this is usually not the case, so values of z at the sample<br />

po<strong>in</strong>ts of y and vice versa will first have to be estimated, for example by us<strong>in</strong>g<br />

spl<strong>in</strong>e <strong>in</strong>terpolation.<br />

3.3 MLP-based postnonl<strong>in</strong>earity approximation<br />

We want to f<strong>in</strong>d an approximation ˜g (<strong>in</strong> some parametrization) of g with ˜g(y(ti)) =<br />

c˜g(z(ti)) for i = 1, . . . , T , so <strong>in</strong> the most general sense we want to f<strong>in</strong>d<br />

˜g = argm<strong>in</strong> g E(g) := argm<strong>in</strong> g<br />

1<br />

2T<br />

T�<br />

(g(y(ti)) − cg(z(ti))) 2 . (4)<br />

In order to m<strong>in</strong>imize this energy function E(g), a s<strong>in</strong>gle-<strong>in</strong>put s<strong>in</strong>gle-output<br />

multilayered neural network (MLP) is used to parametrize the nonl<strong>in</strong>earity g.<br />

Here we choose one hidden layer of size d. This means that the approximated ˜g<br />

can be written as<br />

˜g(t) = w (2)⊤ �<br />

¯σ w (1) t + b (1)�<br />

+ b (2)<br />

with weigh vectors w (1) , w (2) ∈ R d and bias b (1) ∈ R d , b (2) ∈ R. Here σ denotes<br />

an activation function, usually the logistic sigmoid σ(t) := (1 + e −t ) −1 and we set<br />

¯σ := σ×. . .×σ, d times. The MLP weights are restricted <strong>in</strong> the sense that ˜g(0) =<br />

0 and ˜g ′ (0) = 1. This implies b (2) = −w (2)⊤¯σ(b (1) ) and �d i=1 w(1) i w(2) i σ′ (b (1)<br />

1 ) =<br />

1.<br />

Especially the second normalization is very important for the learn<strong>in</strong>g step,<br />

otherwise the weights could all converge to the (valid) zero solution. So the<br />

outer bias is not tra<strong>in</strong>ed by the network; we could fix a second weight <strong>in</strong> order<br />

to guarantee the second condition — this however would result <strong>in</strong> an unstable<br />

quotient calculation. Instead it is preferable to perform network tra<strong>in</strong><strong>in</strong>g on a<br />

submanifold <strong>in</strong> the weight space given by the second weight restriction. This<br />

results <strong>in</strong> an additional Lagrange term <strong>in</strong> the energy function from equation 4<br />

Ē(˜g) := 1<br />

2T<br />

i=1<br />

T�<br />

(˜g(y(tj)) − c˜g(z(tj))) 2 �<br />

d�<br />

+ λ<br />

j=1<br />

i=1<br />

w (1)<br />

i w(2) i σ′ (b (1)<br />

1<br />

) − 1<br />

with suitably chosen λ > 0.<br />

Learn<strong>in</strong>g of the weights is performed via backpropagation on this energy<br />

function. The gradient of Ē(˜g) with respect to the weight matrix can be easily<br />

�2<br />

(5)


182 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse sources 7<br />

calculated from the Euclidean gradient of g. For the learn<strong>in</strong>g process, we further<br />

note that all weights w (j)<br />

i should be kept nonnegative <strong>in</strong> order to ensure<br />

<strong>in</strong>vertibility of ˜g.<br />

In order to <strong>in</strong>crease convergence speed, the Euclidean gradient of g should<br />

be replaced by the natural gradient [11], which <strong>in</strong> experiments enhances the<br />

algorithm performance <strong>in</strong> terms of speed by a factor of roughly 10.<br />

4 Experiment<br />

The postnonl<strong>in</strong>ear mixture of three sources to two mixtures is considered. 105 samples of artificially generated sources with one non-zero coefficient (drawn uniformly<br />

from [−0.5, 0.5]) are used. We refer to figure 2 for a plot of the sources,<br />

mixtures and recoveries. The sources were mixed�us<strong>in</strong>g the postnonl<strong>in</strong>ear � mix<strong>in</strong>g<br />

4.3 7.8 0.59<br />

model x = f1 ×f2(As) with mix<strong>in</strong>g matrix A =<br />

and postnonl<strong>in</strong>-<br />

9 6.2 10<br />

earities f1(x) = tanh(x) + 0.1x and f2(x) = x. For easier algorithm visualization<br />

and evaluation we chose f2 to be l<strong>in</strong>ear and did not add any noise.<br />

MLP based postnonl<strong>in</strong>earity detection algorithm from section 3.3 with natural<br />

gradient-descent learn<strong>in</strong>g, 9 hidden neurons, a learn<strong>in</strong>g rate of η = 0.01<br />

and 105 iterations gives a good approximation of the unmix<strong>in</strong>g nonl<strong>in</strong>earities gi.<br />

L<strong>in</strong>ear overcomplete SCA is then applied to g1 ×g2(x): for practical reasons (due<br />

to approximation errors, the data is not fully l<strong>in</strong>earized) <strong>in</strong>stead of the matrix<br />

recovery algorithm from [8] we use a modification of the geometric ICA algorithm<br />

[4], which is known to work well <strong>in</strong>� the very sparse one-dimensional � case<br />

−0.46 −0.81 −0.069<br />

to get the recovered mix<strong>in</strong>g matrix  =<br />

, which except<br />

−0.89 −0.58 −1.0<br />

for scal<strong>in</strong>g and permutation co<strong>in</strong>cides well with A. Source recovery then gives a<br />

(normalized) signal-to-noise ratios (SNRs) of these with the orig<strong>in</strong>al sources are<br />

high with 26, 71 and 46 dB respectively.<br />

References<br />

1. Cichocki, A., Amari, S.: Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. John Wiley<br />

& Sons (2002)<br />

2. Hyvär<strong>in</strong>en, A., Karhunen, J., Oja, E.: <strong>Independent</strong> component analysis. John<br />

Wiley & Sons (2001)<br />

3. Lee, T., Lewicki, M., Girolami, M., Sejnowski, T.: Bl<strong>in</strong>d source separation of more<br />

sources than mixtures us<strong>in</strong>g overcomplete representations. IEEE Signal Process<strong>in</strong>g<br />

Letters 6 (1999) 87–90<br />

4. Theis, F., Lang, E., Puntonet, C.: A geometric algorithm for overcomplete l<strong>in</strong>ear<br />

ICA. Neurocomput<strong>in</strong>g 56 (2004) 381–398<br />

5. Zibulevsky, M., Pearlmutter, B.: Bl<strong>in</strong>d source separation by sparse decomposition<br />

<strong>in</strong> a signal dictionary. Neural Computation 13 (2001) 863–882<br />

6. Eriksson, J., Koivunen, V.: Identifiability and separability of l<strong>in</strong>ear ICA models<br />

revisited. In: Proc. of ICA 2003. (2003) 23–27


Chapter 12. LNCS 3195:718-725, 2004 183<br />

8 Fabian J. Theis and Shun-ichi Amari<br />

0.5<br />

0<br />

−0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.5<br />

0<br />

−0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.5<br />

0<br />

−0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

(a) sources<br />

−5<br />

−1.5 −1 −0.5 0 0.5 1 1.5<br />

(c) mixture scatterplot<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

(b) mixtures<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

(d) recovered sources<br />

Fig. 2. Example: (a) shows the 1-sparse source signals, and (b) the postnonl<strong>in</strong>ear overcomplete<br />

mixtures. The orig<strong>in</strong>al source directions can be clearly seen <strong>in</strong> the structure<br />

of the mixture scatterplot (c). The crosses and stars <strong>in</strong>dicate the found <strong>in</strong>terpolation<br />

po<strong>in</strong>ts used for approximat<strong>in</strong>g the separat<strong>in</strong>g nonl<strong>in</strong>earities, generated by geometrical<br />

preprocess<strong>in</strong>g. Now, accord<strong>in</strong>g to theorem 3, the sources can be recovered uniquely,<br />

figure (d), except for permutation and scal<strong>in</strong>g.<br />

7. Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM<br />

J. Sci. Comput. 20 (1998) 33–61<br />

8. Georgiev, P., Theis, F., Cichocki, A.: Bl<strong>in</strong>d source separation and sparse component<br />

analysis of overcomplete mixtures. In: Proc. of ICASSP 2004, Montreal, Canada<br />

(2004)<br />

9. Taleb, A., Jutten, C.: Indeterm<strong>in</strong>acy and identifiability of bl<strong>in</strong>d identification.<br />

IEEE Transactions on Signal Process<strong>in</strong>g 47 (1999) 2807–2820<br />

10. Babaie-Zadeh, M., Jutten, C., Nayebi, K.: A geometric approach for separat<strong>in</strong>g<br />

post non-l<strong>in</strong>ear mixtures. In: Proc. of EUSIPCO ’02. Volume II., Toulouse, France<br />

(2002) 11–14<br />

11. Amari, S., Park, H., Fukumizu, K.: Adaptive method of realiz<strong>in</strong>g gradient learn<strong>in</strong>g<br />

for multilayer perceptrons. Neural Computation 12 (2000) 1399–1409


184 Chapter 12. LNCS 3195:718-725, 2004


Chapter 13<br />

Proc. EUSIPCO 2005<br />

Paper F.J. Theis, K. Stadlthanner, and T. Tanaka. First results on uniqueness of<br />

sparse non-negative matrix factorization. In Proc. EUSIPCO 2005, Antalya,<br />

Turkey, 2005<br />

Reference (Theis et al., 2005c)<br />

Summary <strong>in</strong> section 1.4.2<br />

185


186 Chapter 13. Proc. EUSIPCO 2005<br />

FIRST RESULTS ON UNIQUENESS OF SPARSE NON-NEGATIVE MATRIX<br />

FACTORIZATION<br />

Fabian J. Theis, Kurt Stadlthanner and Toshihisa Tanaka ∗<br />

Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

phone: +49 941 943 2924, fax: +49 941 943 2479, email: fabian@theis.name<br />

∗ Department of Electrical and Electronic Eng<strong>in</strong>eer<strong>in</strong>g, Tokyo University of Agriculture and Technology (TUAT)<br />

2-24-16, Nakacho, Koganei-shi, Tokyo 184-8588 Japan and<br />

ABSP Laboratory, BSI, RIKEN, 2-1, Hirosawa, Wako-shi, Saitama 351-0198 Japan<br />

ABSTRACT<br />

Sparse non-negative matrix factorization (sNMF) allows for the decomposition<br />

of a given data set <strong>in</strong>to a mix<strong>in</strong>g matrix and a feature<br />

data set, which are both non-negative and fulfill certa<strong>in</strong> sparsity conditions.<br />

In this paper it is shown that the employed projection step<br />

proposed by Hoyer has a unique solution, and that it <strong>in</strong>deed f<strong>in</strong>ds<br />

this solution. Then <strong>in</strong>determ<strong>in</strong>acies of the sNMF model are identified<br />

and first uniqueness results are presented, both theoretically<br />

and experimentally.<br />

1. INTRODUCTION<br />

Non-negative matrix factorization (NMF) describes a promis<strong>in</strong>g<br />

new technique for decompos<strong>in</strong>g non-negative data sets <strong>in</strong>to a product<br />

of two smaller matrices thus captur<strong>in</strong>g the underly<strong>in</strong>g structure<br />

[3]. In applications it turns out that additional constra<strong>in</strong>ts like for<br />

example sparsity enhance the recoveries; one promis<strong>in</strong>g variant of<br />

such a sparse NMF algorithm has recently been proposed by Hoyer<br />

[2]. It consists of the common NMF update steps, but at each step<br />

a sparsity constra<strong>in</strong>t is posed. If factorization algorithms are to produce<br />

reliable results, their <strong>in</strong>determ<strong>in</strong>acies have to be known and<br />

uniqueness (except for the <strong>in</strong>determ<strong>in</strong>acies) has to be shown — so<br />

far only restricted and quite disappo<strong>in</strong>t<strong>in</strong>g results for NMF [1] and<br />

none for sNMF are known.<br />

In this paper we first present a novel uniqueness result show<strong>in</strong>g<br />

that the projection step of sparse NMF always possesses a unique<br />

solution (except for a set of measure zero), theorems 2.2 and 2.6.<br />

We then prove that Hoyer’s algorithm <strong>in</strong>deed detects these solutions,<br />

theorem 2.8. In section 3 after shortly repeat<strong>in</strong>g Hoyer’s<br />

sNMF algorithm, we analyze its <strong>in</strong>determ<strong>in</strong>acies and show uniqueness<br />

<strong>in</strong> some restricted cases, theorem 3.3. The result is both new<br />

and astonish<strong>in</strong>g, because the set of <strong>in</strong>determ<strong>in</strong>acies is much smaller<br />

than the one of NMF, namely of measure zero.<br />

2. SPARSE PROJECTION<br />

The sparse NMF algorithm enforces sparseness by us<strong>in</strong>g a projection<br />

step as follows: Given x ∈ Rn and fixed λ1,λ2 > 0, f<strong>in</strong>d s such<br />

that<br />

s = argm<strong>in</strong>�s�1=λ1,�s�2=λ2,s≥0 �x −s�2 (1)<br />

Here �s�p := � ∑ n i=1 |si| p� 1/p denotes the p-norm; <strong>in</strong> the follow<strong>in</strong>g<br />

we often omit the <strong>in</strong>dex <strong>in</strong> the case p = 2. Furthermore s ≥ 0 is<br />

def<strong>in</strong>ed as si ≥ 0 for all i = 1,...,n, so s is to be non-negative. Our<br />

goal is to show that such a projection always exists and is unique<br />

for almost all x. This problem can be generalized by replac<strong>in</strong>g the<br />

1-norm by an arbitrary p-norm, however the (Euclidean) 2-norm<br />

has to be used as can be seen <strong>in</strong> the proof later. Other possible<br />

generalizations <strong>in</strong>clude projections <strong>in</strong> <strong>in</strong>f<strong>in</strong>ite-dimensional Hilbert<br />

spaces.<br />

First note that the two norms are equivalent i.e. <strong>in</strong>duce the same<br />

topology; <strong>in</strong>deed �s�2 ≤ �s�1 ≤ √ n�s�2 for all s ∈ R n as can be<br />

easily shown. So a necessary condition for any s to satisfy equation<br />

(1) is λ2 ≤ λ1 ≤ √ nλ2.<br />

We want to solve problem (1) by project<strong>in</strong>g x onto<br />

M := {s|�s�1 = λ1} ∩ {s|�s�2 = λ2} ∩ {s ≥ 0} (2)<br />

In order to solve equation (1), x has to be projected onto a po<strong>in</strong>t<br />

adjacent to it <strong>in</strong> M:<br />

Def<strong>in</strong>ition 2.1. A po<strong>in</strong>t p ∈ M ⊂ R n is called adjacent to x ∈ R n<br />

<strong>in</strong> M, <strong>in</strong> symbols p⊳M x or shorter p⊳x, if �x −p�2 ≤ �x −q�2<br />

for all q ∈ M.<br />

In the follow<strong>in</strong>g we will study <strong>in</strong> which cases this is possibly,<br />

and which conditions are needed to guarantee that this projection is<br />

even unique.<br />

2.1 Existence<br />

Assume that x lies <strong>in</strong> the closure of M, but not <strong>in</strong> M. Obviously<br />

there exists no p⊳x as x ‘touches’ M without be<strong>in</strong>g an element of<br />

it. In order to avoid these exceptions, it is enough to assume that M<br />

is closed:<br />

Theorem 2.2 (Existence). If M is closed and nonempty, then for<br />

every x ∈ R n there exists a p ∈ M with p⊳x.<br />

Proof. Let x ∈ R n be fixed. Without loss of generality (by tak<strong>in</strong>g<br />

<strong>in</strong>tersections with a large enough ball) we can assume that M is<br />

compact. Then f : M → R,p ↦→ �x −p� is cont<strong>in</strong>uous and has<br />

therefore a m<strong>in</strong>imum p0, so p0 ⊳x.<br />

2.2 Uniqueness<br />

Def<strong>in</strong>ition 2.3. Let X (M) := {x ∈ R n |there exists more than one<br />

po<strong>in</strong>t adjacent to x <strong>in</strong> M} = {x ∈ R n |#{p ∈ M|p⊳x} > 1} denote<br />

the exception set of M.<br />

In other words, the exception set conta<strong>in</strong>s the set of po<strong>in</strong>ts from<br />

which we can’t uniquely project. Our goal is to show that this set<br />

vanishes or is at least very small. Figure 1 shows the exception set<br />

of two different sets.<br />

Note that if x ∈ M then x ⊳ x, and x is the only po<strong>in</strong>t with<br />

that property. So M ∩X (M) = ∅. Obviously the exception set of<br />

an aff<strong>in</strong>e l<strong>in</strong>ear hyperspace is empty. Indeed, we can prove more<br />

generally:<br />

Lemma 2.4. Let M ⊂ R n be convex. Then X (M) = ∅.<br />

For the proof we need the follow<strong>in</strong>g simple lemma, which only<br />

works for the 2-norm as it uses the scalar product.<br />

Lemma 2.5. Let a,b ∈ R n such that �a + b�2 = �a�2 + �b�2.<br />

Then a and b are coll<strong>in</strong>ear.<br />

Proof. By tak<strong>in</strong>g squares we get �a +b� 2 = �a� 2 + 2�a��b� +<br />

�b� 2 , so<br />

�a� 2 + 2〈a,b〉+�b� 2 = �a� 2 + 2�a��b�+�b� 2


Chapter 13. Proc. EUSIPCO 2005 187<br />

M X (M)<br />

(a) exception set of two po<strong>in</strong>ts<br />

M<br />

X (M)<br />

(b) exception set of a sector<br />

Figure 1: Two examples of exception sets.<br />

if 〈.,.〉 denotes the (symmetric) scalar product. Hence 〈a,b〉 =<br />

�a��b� and a and b are coll<strong>in</strong>ear accord<strong>in</strong>g to the Schwarz <strong>in</strong>equality.<br />

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈ X (M) and<br />

p1 �= p2 ∈ M such that pi ⊳x. By assumption q := 1 2 (p1 +p2) ∈<br />

M. But<br />

�x −p1� ≤ �x −q� ≤ 1<br />

1<br />

�x −p1�+<br />

2 2 �x −p2� = �x −p1�<br />

because both pi are adjacent to x. Therefore �x −q� = � 1 2 (x −<br />

p1)�+� 1 2 (x −p2)� and application of lemma 2.5 shows that x −<br />

p1 = α(x −p2). Tak<strong>in</strong>g norms (and us<strong>in</strong>g the fact that q �= x)<br />

shows that α = 1 and hence p1 = p2, which is a contradiction.<br />

In a similar manner, it is easy to show for example that the<br />

exception set of the sphere consists only of its center, or to calculate<br />

the exception sets of the sets M from figure 1. Another property<br />

of the exception set is that it behaves nicely under non-degenerate<br />

aff<strong>in</strong>e l<strong>in</strong>ear transformation.<br />

Hence <strong>in</strong> general, we cannot expect X (M) to vanish altogether.<br />

However we can show that <strong>in</strong> practical applications we can easily<br />

neglect it:<br />

Theorem 2.6 (Uniqueness). vol(X (M)) = 0.<br />

This means that the Lebesgue measure of the exception set is<br />

zero i.e. that it does not conta<strong>in</strong> any open ball. In other words, if<br />

x is drawn from a cont<strong>in</strong>uous probability distribution on R n , then<br />

x ∈X (M) with probability 0. We simplify the proof by <strong>in</strong>troduc<strong>in</strong>g<br />

the follow<strong>in</strong>g lemma:<br />

Lemma 2.7. Letx∈X (M) withp⊳x,p ′ ⊳x and p �=p ′ . Assume<br />

y lies on the l<strong>in</strong>e between x and p. Then y �∈ X (M).<br />

Proof. So y = αx+(1 − α)p with α ∈ (0,1). Note that then also<br />

p⊳y — otherwise we would have another q⊳y with �q −y� <<br />

�p−y�. But then �q−x� ≤ �q−y�+�y−x� < �p−y�+�y−<br />

x� = �p −x�, which contradicts the assumption.<br />

Now assume that y ∈ X (M). Then there exists p ′′ ⊳y with<br />

p ′′ �= p. But �p ′′ −x� ≤ �p ′′ −y�+�y −x� = �p −y�+�y −<br />

x� = �p −x�. Then p⊳x <strong>in</strong>duces �p ′′ −x� = �p −x�. So<br />

�p ′′ −x� = �p ′′ −y�+�y −x�.<br />

Application of lemma 2.5 then yields p ′′ −y = α(y−x), and hence<br />

p ′′ −y = β(p −y). Tak<strong>in</strong>g norms (and us<strong>in</strong>g p ⊳x) shows that<br />

β = 1 and hence p = p ′′ , which is a contradiction.<br />

Proof of theorem 2.6. Assume there exists an open set U ⊂ X (M),<br />

and let x ∈ U. Then choose p �= p ′ ∈ M with p⊳x,p ′ ⊳x. But<br />

{αx+(1 − αp)|α ∈ (0,1)} ∩U �= ∅,<br />

which contradicts lemma 2.7.<br />

2.3 Algorithm<br />

From here on, let M be def<strong>in</strong>ed by equation (2). In [2], Hoyer proposes<br />

algorithm 1 to project a given vector x onto p ∈ M such that<br />

p⊳x (we added a slight simplification by not sett<strong>in</strong>g all negative<br />

values of s to zero but only a s<strong>in</strong>gle one <strong>in</strong> each step). The algorithm<br />

iteratively detects p by first satisfy<strong>in</strong>g the 1-norm condition (l<strong>in</strong>e 1)<br />

and then the 2-norm condition (l<strong>in</strong>e 3). The algorithm term<strong>in</strong>ates if<br />

the constructed vector is already positive; otherwise a negative coord<strong>in</strong>ate<br />

is selected, set to zero (l<strong>in</strong>e 4) and the search is cont<strong>in</strong>ued<br />

<strong>in</strong> R n−1 .<br />

Algorithm 1: Sparse projection<br />

Input: vector x ∈ R n , norm conditions λ1 and λ2<br />

Output: closest non-negative s with �s�i = λi<br />

Set r ← x+(�x�1 − λ1/n)e with e = (1,...,1) ⊤ ∈ Rn 1<br />

.<br />

2 Set m ← (λ1/n)e.<br />

3 Set s ← m+α(r −m) with α > 0 such that �s�2 = λ2.<br />

if exists j with s j < 0 then<br />

4 Fix s j ← 0.<br />

5 Remove j-th coord<strong>in</strong>ate of x.<br />

6 Decrease dimension n ← n − 1.<br />

7 goto 1.<br />

end<br />

The projection algorithm term<strong>in</strong>ates after maximally n − 1 iterations.<br />

However it is not obvious that it <strong>in</strong>deed detects p. In the<br />

follow<strong>in</strong>g we will prove this given that x �∈ X (M) — of course we<br />

have to exclude non-uniqueness po<strong>in</strong>ts. The idea of the proof is to<br />

show that <strong>in</strong> each step the new estimate has p as closest po<strong>in</strong>t <strong>in</strong> M.<br />

Theorem 2.8 (Sparse projection). Given x ≥ 0 such that x �∈<br />

X (M). Let p ∈ M with p ⊳ x. Furthermore assume that r and<br />

s are constructed by l<strong>in</strong>es 1 and 3 of algorithm 1. Then:<br />

(i) ∑ri = λ1, p⊳r and r �∈ X (M).<br />

(ii) ∑si = λ1, �s�2 = λ2 and p⊳s and s �∈ X (M).<br />

(iii) If s j < 0 then p j = 0.<br />

(iv) Def<strong>in</strong>e u := s but set u j = 0. Then p⊳u and u �∈ X (M).<br />

This theorem shows that if s ≥ 0 then already s ∈ M and p⊳s<br />

(ii) so s = p. If s j < 0 then it is enough to set s j := 0 (because<br />

p j = 0 (iii)) and cont<strong>in</strong>ue the search <strong>in</strong> one dimension lower (iv).<br />

Proof. Let H := {x ∈ R n |∑xi = λ1} denote the aff<strong>in</strong>e hyperplane<br />

given by the 1-norm. Note that M ⊂ H.<br />

(i) By construction r ∈ H. Furthermore e⊥H, so r is the orthogonal<br />

projection of x onto H. Let q ∈ M be arbitrary. We then<br />

get �q−x� 2 = �q−r� 2 +�r−x� 2 . By def<strong>in</strong>ition �p−x� ≤ �q−<br />

x�, so �p −r� 2 = �p −x� 2 − �r −x� 2 ≤ �q −x� 2 − �r −x� 2 =<br />

�q −r� 2 and therefore p⊳r. Furthermore r �∈ X (M) because if<br />

q ∈ R n with q⊳r, then �q−r� = �p−r�. Then by the above also<br />

�q −x� = �p −x�, hence q = p (because x �∈ X (M)).<br />

(ii) First note that s is a l<strong>in</strong>ear comb<strong>in</strong>ation of m and r, and both<br />

lie <strong>in</strong> H so also s ∈ H i.e. ∑si = λ1. Furthermore by construction<br />

�s� = λ2. Now let q ∈ M. For p⊳s to hold, we have to show that<br />

�p −s� ≤ �q −s�. This follows (see (i)) if we can show<br />

�q −r� 2 = �s −r� 2 + 1<br />

�q −s�<br />

α0<br />

2 . (3)<br />

We can prove this equation as follows: By def<strong>in</strong>ition λ 2 2 = �q −<br />

m�2 = �q −s�2 + �s −m�2 + 2〈q −s,s −m〉, hence �q −s�2 =<br />

−2〈q−s,s−m〉 = −2 α0<br />

α0−1 〈q−s,s−r〉, where we have used s−<br />

m = α0(r −m) i.e. m = s−α0r<br />

1−α0<br />

α0<br />

so s −m = α0−1 (s −r).


188 Chapter 13. Proc. EUSIPCO 2005<br />

Us<strong>in</strong>g the above, we can now calculate<br />

�q −r� 2 = �q −s� 2 + �s −r� 2 + 2〈q −s,s −r〉<br />

= �q −s� 2 + �s −r� 2 1 − α0<br />

+ �q −s�<br />

α0<br />

2<br />

= �s −r� 2 + 1<br />

�q −s�<br />

α0<br />

2 .<br />

Similarly, from formula 3, we get s �∈ X (M), because if there exists<br />

q ∈ Rn with �q−s� = �p−s�, then also �q−r� = �p−r� hence<br />

q = p.<br />

(iii) Assume s j < 0. First note that m does not lie on the l<strong>in</strong>e<br />

βs+(1 − β)p (<strong>in</strong> other words m �= (p+s)/2), because otherwise<br />

due to symmetry there would be at least two po<strong>in</strong>ts <strong>in</strong> M closest to<br />

s, but s �∈ X (M). Now assume the claim is wrong, then p j > 0 (because<br />

p ≥ 0). Def<strong>in</strong>e g : [0,1] → H by g(β) := m+α β(βs+(1 −<br />

β)p −m), where αβ > 0 has been chosen such that �g(β)� = λ2.<br />

The curve g describes the shortest arc <strong>in</strong> H ∩ {�q� = λ2} connect<strong>in</strong>g<br />

p to s. We notice that p j > 0,rj < 0 and g is cont<strong>in</strong>uous. Hence<br />

determ<strong>in</strong>e the (unique) β0 such that q := g(β0) has the property<br />

q j = 0. By construction q ∈ M, but q lies closer to s than p (because<br />

�g(β −r)�2 = 2〈g(β)−m,m−r〉+2λ 2 2 is decreas<strong>in</strong>g with<br />

<strong>in</strong>creas<strong>in</strong>g β). But this is a contradiction to p⊳s.<br />

(iv) The vector u is def<strong>in</strong>ed by ui = si if i �= j and u j = 0 i.e.<br />

u is the orthogonal projection of s onto the coord<strong>in</strong>ate hyperplane<br />

given by x j = 0. So we calculate �p −s�2 = �p −u�2 + �u −s�2 and the claim follows directly as <strong>in</strong> (i).<br />

3. MATRIX FACTORIZATION<br />

Matrix factorization models have already been used successfully <strong>in</strong><br />

many applications when it comes to f<strong>in</strong>d suitable data representations.<br />

Basically, a given m × T data matrix X is factorized <strong>in</strong>to a<br />

m × n matrix W and a n × T matrix H<br />

where m ≤ n.<br />

X = WH, (4)<br />

3.1 Sparse non-negative matrix factorization<br />

In contrast to other matrix factorization models such as pr<strong>in</strong>cipal<br />

or <strong>in</strong>dependent component analysis, non-negative matrix factorization<br />

(NMF) strictly requires both matrices W and H to have nonnegative<br />

entries, which means that the data can be described us<strong>in</strong>g<br />

only additive components. Such a constra<strong>in</strong>t has many physical realizations<br />

and applications, for <strong>in</strong>stance <strong>in</strong> object decomposition [3].<br />

Although NMF has recently ga<strong>in</strong>ed popularity due to its simplicity<br />

and power <strong>in</strong> various applications, its solutions frequently<br />

fail to exhibit the desired sparse object decomposition. Therefore,<br />

Hoyer [2] proposed a modification of the NMF model to <strong>in</strong>clude<br />

sparseness: he m<strong>in</strong>imizes the deviation of (4) under the constra<strong>in</strong>t<br />

of fixed sparseness of both W and H. Here, us<strong>in</strong>g a ratio of<br />

1- and 2-norms of x ∈ Rn \ {0}, the sparseness is measured by<br />

σ(x) := ( √ n − �x�1/�x�2)/( √ n − 1). So σ(x) = 1 (maximal)<br />

if x conta<strong>in</strong>s n−1 zeros, and it reaches zero if the absolute value of<br />

all coefficients of x co<strong>in</strong>cide.<br />

Formally, sparse NMF (sNMF) [2] can be def<strong>in</strong>ed as the task of<br />

f<strong>in</strong>d<strong>in</strong>g<br />

�<br />

X,W,H ≥ 0<br />

X = WH subject to σ(W∗i) = σW (5)<br />

σ(Hi∗) = σH<br />

Here σW,σH ∈ [0,1] denote fixed constants describ<strong>in</strong>g the sparseness<br />

of the columns of W respectively the rows of H. Usually, the<br />

l<strong>in</strong>ear model <strong>in</strong> NMF is assumed to hold only approximately, hence<br />

the above formulation of sNMF represents the limit case of perfect<br />

factorization. sNMF is summarized by algorithm 2, which uses algorithm<br />

1 separately <strong>in</strong> each column respectively row for the sparse<br />

projection.<br />

Algorithm 2: Sparse non-negative matrix factorization<br />

Input: observation data matrix X<br />

Output: decomposition WH of X fulfill<strong>in</strong>g given<br />

sparseness constra<strong>in</strong>ts σH and σW<br />

1 Initialize W and H to random non-negative matrices.<br />

2 Project the rows of H and the columns of W such that they<br />

meet the sparseness constra<strong>in</strong>ts σH and σW respectively.<br />

repeat<br />

Set H ← H − µHW⊤ 3<br />

(WH −X).<br />

4 Project the rows of H such that they meet the sparseness<br />

constra<strong>in</strong>t σH.<br />

Set W ← W − µW(WH −X)H⊤ 5<br />

.<br />

6 Project the rows of W such that they meet the<br />

sparseness constra<strong>in</strong>t σW.<br />

until convergence;<br />

3.2 Indeterm<strong>in</strong>acies<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies of model 5 are permutation and positive<br />

scal<strong>in</strong>g of the columns of W (and correspond<strong>in</strong>gly of the rows of<br />

H), because if P denotes a permutation matrix and L a positive<br />

scal<strong>in</strong>g matrix, then X = WH = (WP −1 L −1 )(LPH) and the<br />

conditions of positivity and sparseness are <strong>in</strong>variant under scal<strong>in</strong>g<br />

by a positive number. Another maybe not as obvious <strong>in</strong>determ<strong>in</strong>acy<br />

comes <strong>in</strong>to play due to the sparseness assumption.<br />

Def<strong>in</strong>ition 3.1. The n×T -matrix H is said to be degenerate if there<br />

exist v ∈ R n , v > 0 and λt ≥ 0 such that H∗t = λtv for all t.<br />

Note that <strong>in</strong> this case all rows h ⊤ i := Hi∗ of H have the same<br />

sparseness σ(hi) = ( √ n − �λ�1/�λ�2)/( √ n − 1) <strong>in</strong>dependent of<br />

v, where λ := (λ1,...,λT) ⊤ . Furthermore, if W is any matrix with<br />

positive entries, then Wv > 0 and WH∗t = λt(Wv), so the signals<br />

H and its transformations WH have rows of equal sparseness.<br />

Hence if the sources are degenerate we get an <strong>in</strong>determ<strong>in</strong>acy of<br />

sNMF: Let W, ˜W be non-negative such that ˜W −1 Wv > 0 (for example<br />

W > 0 arbitrary and ˜W :=I), and letHbe degenerate. Then<br />

˜H := ˜W −1 WH is of the same sparseness as H and WH = ˜W ˜H ′ ,<br />

but the mix<strong>in</strong>g matrices W and ˜W do not co<strong>in</strong>cide up to permutation<br />

and scal<strong>in</strong>g.<br />

3.3 Uniqueness<br />

In this section we will discuss the uniqueness of sNMF solutions<br />

i.e. we will formulate conditions under which the set of solutions<br />

is satisfactorily small. We will see that <strong>in</strong> the perfect factorization<br />

case, it is enough to put the sparseness condition either onto W or<br />

H — we chose H <strong>in</strong> the follow<strong>in</strong>g to match the picture of sources<br />

with a given sparseness.<br />

Assume that two solutions (W,H) and ( ˜W, ˜H) of the sNMF<br />

model (2) are given with W and ˜W of full rank; then<br />

WH = ˜W ˜H, (6)<br />

and σ(H) = σ( ˜H). As before let hi = H ⊤ i∗ respectively ˜ hi = ˜H ⊤ i∗<br />

denote the rows of the source matrices. In order to avoid the scal<strong>in</strong>g<br />

<strong>in</strong>determ<strong>in</strong>acy, we can set the source scales to a given value, so we<br />

may assume<br />

�hi�2 = � ˜ hi�2 = 1 (7)<br />

for all i. Hence, the sparseness of the rows is already fully determ<strong>in</strong>ed<br />

by their 1-norms, and<br />

�hi�1 = � ˜ hi�1. (8)<br />

We can then show the follow<strong>in</strong>g lemma (even without positive mix<strong>in</strong>g<br />

matrices).


Chapter 13. Proc. EUSIPCO 2005 189<br />

Lemma 3.2. Let W, ˜W ∈ Rm×n and H, ˜H ∈ Rn×T , H, ˜H ≥ 0,<br />

such that equations (6–8) hold. Then for all i ∈ {1,...,m}<br />

(i) ∑ j wi j = ∑ j ˜wi j<br />

(ii) ∑ j


190 Chapter 13. Proc. EUSIPCO 2005


Chapter 14<br />

Proc. ICASSP 2006<br />

Paper F.J. Theis and T. Tanaka. Sparseness by iterative projections onto spheres.<br />

In Proc. ICASSP 2006, Toulouse, France, 2006<br />

Reference (Theis and Tanaka, 2006)<br />

Summary <strong>in</strong> section 1.4.2<br />

191


192 Chapter 14. Proc. ICASSP 2006<br />

SPARSENESS BY ITERATIVE PROJECTIONS ONTO SPHERES<br />

Fabian J. Theis ∗<br />

Inst. of Biophysics, Univ. of Regensburg<br />

93040 Regensburg, Germany<br />

ABSTRACT<br />

Many <strong>in</strong>terest<strong>in</strong>g signals share the property of be<strong>in</strong>g sparsely active.<br />

The search for such sparse components with<strong>in</strong> a data set commonly<br />

<strong>in</strong>volves a l<strong>in</strong>ear or nonl<strong>in</strong>ear projection step <strong>in</strong> order to fulfill the<br />

sparseness constra<strong>in</strong>ts. In addition to the proximity measure used<br />

for the projection, the result of course is also <strong>in</strong>timately connected<br />

with the actual def<strong>in</strong>ition of the sparseness criterion. In this work,<br />

we <strong>in</strong>troduce a novel sparseness measure and apply it to the problem<br />

of f<strong>in</strong>d<strong>in</strong>g a sparse projection of a given signal. Here, sparseness<br />

is def<strong>in</strong>ed as the fixed ratio of p- over 2-norm, and existence and<br />

uniqueness of the projection holds. This framework extends previous<br />

work by Hoyer <strong>in</strong> the case of p=1, where it is easy to give a<br />

determ<strong>in</strong>istic, more or less closed-form solution. This is not possible<br />

for p�1, so we <strong>in</strong>troduce an algorithm based on alternat<strong>in</strong>g projections<br />

onto spheres (POSH), which is similar to the projection onto<br />

convex sets (POCS). Although the assumption of convexity does not<br />

hold <strong>in</strong> our sett<strong>in</strong>g, we observe not only convergence of the algorithm,<br />

but also convergence to the correct m<strong>in</strong>imal distance solution.<br />

Indications for a proof of this surpris<strong>in</strong>g property are given. Simulations<br />

confirm these results.<br />

1. INTRODUCTION<br />

Sparseness is an important property of many natural signals, and various<br />

def<strong>in</strong>itions exist. Intuitively, a signal x∈R n <strong>in</strong>creases <strong>in</strong> sparseness<br />

with the <strong>in</strong>creas<strong>in</strong>g number of zeros; this is often measured by<br />

the 0-(pseudo)-norm�x�0 := |{i|xi � 0}|, count<strong>in</strong>g the number of<br />

non-zero entries of x. It is a pseudo-norm because�αx�0=|α|�x�0<br />

does not necessarily hold; <strong>in</strong>deed�αx�0=�x�0 ifα�0, so�.�0 is<br />

scale-<strong>in</strong>variant.<br />

A typical problem <strong>in</strong> the field is the search for sparse <strong>in</strong>stances<br />

or representations of a data set. Us<strong>in</strong>g the above 0-pseudo-norm as<br />

sparseness measure quickly turns out to be both theoretically and<br />

algorithmically unfeasible. The former simply follows because�.�0<br />

is discrete, so the <strong>in</strong>determ<strong>in</strong>acies of the problem can be expected<br />

to be very high, and the latter because optimization on such a discrete<br />

function is a comb<strong>in</strong>atorial problem and <strong>in</strong>deed turns out to be<br />

NP-complete. Hence, this sparseness measure is commonly approx-<br />

imated by some cont<strong>in</strong>uous measures e.g. by replac<strong>in</strong>g it by the pnorm�x�p<br />

:= �� n<br />

i=1 |xi| p� 1/p for p∈R + . As limp→0+�x� p p=�x�0, this<br />

can be <strong>in</strong>terpreted as a possible approximation. This together with<br />

extensions to noisy situations can be used for measur<strong>in</strong>g sparseness,<br />

and the connection with�.�0, especially <strong>in</strong> the case of p=1, has<br />

been <strong>in</strong>tensively studied [1].<br />

Often, we are not <strong>in</strong>terested <strong>in</strong> the scale of the signals, so ideally<br />

the sparseness measure should be <strong>in</strong>dependent of the scale — which<br />

∗ Partial f<strong>in</strong>ancial support by the JSPS (PE 05543) and the DFG (GRK<br />

638) is acknowledged.<br />

Toshihisa Tanaka ∗<br />

Dept. EEE, Tokyo Univ. of Agri. and Tech.<br />

Tokyo 184-8588, Japan<br />

is the case for the 0-pseudo-norm, but not for the p-norms. In order<br />

to guarantee scal<strong>in</strong>g <strong>in</strong>variance, some normalization has to be<br />

applied <strong>in</strong> the latter case, and a possible solution is the measure<br />

σp(x) :=�x�p/�x�2<br />

for x ∈ R n \{0} and p > 0. Thenσp(αx) = σp(x) forα � 0;<br />

moreover the sparser x, the smallerσp(x). Indeed, it can still be<br />

<strong>in</strong>terpreted as approximation of the 0-pseudo-norm <strong>in</strong> the sense that<br />

it is scale-<strong>in</strong>variant and that limp→0+σp(x) p =�x�0. Altogether we<br />

<strong>in</strong>fer that by m<strong>in</strong>imiz<strong>in</strong>gσp(x) under some constra<strong>in</strong>t, we can f<strong>in</strong>d<br />

a sparse representation of x. Hoyer [2] noticed this <strong>in</strong> the important<br />

case of p=1; he def<strong>in</strong>ed a normalized sparseness measure by ( √ n−<br />

σ1(x))/( √ n−1), which lies <strong>in</strong> [0, 1] and is maximal if x conta<strong>in</strong>s<br />

n−1 zeros and m<strong>in</strong>imal if the absolute value of all coefficients of x<br />

co<strong>in</strong>cide.<br />

Little attention has been paid to f<strong>in</strong>d<strong>in</strong>g projections <strong>in</strong> the case of<br />

p�1, which is particularly important for p→0 as better approximation<br />

of�.�0. Hence, the goal of this manuscript is to explore the<br />

general notion of sparseness <strong>in</strong> the sense of equation (1) and to construct<br />

algorithms to project a vector to its closest vector of a given<br />

sparseness.<br />

2. EUCLIDEAN PROJECTION<br />

Let M⊂R n be an arbitrary, non-empty set. A vector y∈ M⊂R n<br />

is called Euclidean projection of x∈R n <strong>in</strong> M, <strong>in</strong> symbols y⊳M x, if<br />

�x−y�2≤�x−z�2 for all z∈ M.<br />

2.1. Existence and uniqueness<br />

We review conditions [3] for existence on uniqueness of the Euclidean<br />

projection. For this, we need the follow<strong>in</strong>g notion: LetX(M) :=<br />

{x∈R n | there exists more than one po<strong>in</strong>t adjacent to x <strong>in</strong> M}={x∈<br />

R n | #{y∈ M| y⊳M x}>1} denote the exception set of M.<br />

Theorem 2.1 (Euclidean projection).<br />

i. If M is closed and nonempty, then the Euclidean projection<br />

onto M is exists that is for every x∈R n there exists a y∈M<br />

with y⊳M x.<br />

ii. The Euclidean projection onto M is unique from almost all<br />

po<strong>in</strong>ts <strong>in</strong>R n i.e. vol(X(M))=0.<br />

Proof. See [3], theorems 2.2 and 2.6. �<br />

So we can always project a vector x∈R n onto a closed set M,<br />

and this projection will be unique almost everywhere. In this case,<br />

we denote the projection byπM(x) orπ(x) for short. Indeed, <strong>in</strong> the<br />

case of the p-spheres S n−1<br />

p , the exception set consists of a s<strong>in</strong>gle<br />

po<strong>in</strong>tX(S n−1<br />

p )={0} if p≥2, henceπ S n−1<br />

p<br />

(1)<br />

is well-def<strong>in</strong>ed onR n \{0}.<br />

If p


Chapter 14. Proc. ICASSP 2006 193<br />

S n−1<br />

p<br />

∇�.� p p<br />

y<br />

t<br />

∇�.−x� 2<br />

2<br />

Fig. 1. Constra<strong>in</strong>ed gradient t on the p-sphere, given by the projection<br />

of the unconstra<strong>in</strong>ed gradient∇�.−x� 2<br />

2 onto the tangent space<br />

that is orthogonal to∇�.� p p, see equation (6).<br />

2.2. Projection onto a p-sphere<br />

Let S n−1<br />

p := {x ∈ Rn |�x�p = 1} denote the (n−1)-dimensional<br />

sphere with respect to the p-norm (p>0). A scaled version of this<br />

unit sphere is given by cS n−1<br />

p :={x∈R n |�x�p= c}. The spheres<br />

are smoothC 1-submanifolds ofR n for p≥2. For p0.<br />

p<br />

Now, <strong>in</strong> the case p=2, the projection is simply given by<br />

πS n−1 (x)=x/�x�2. (2)<br />

2<br />

In the case p=1, the sphere consists of a union of hyperplanes<br />

be<strong>in</strong>g orthogonal to (±1,...,±1). Consider<strong>in</strong>g only the first quadrant<br />

(i.e. xi> 0), this means thatπ S n−1 (x) is given by the projection onto<br />

1<br />

the hyperplane H :={x∈R n |〈x, e〉=n −1/2 } and sett<strong>in</strong>g result<strong>in</strong>g<br />

negative coord<strong>in</strong>ates to 0; here e := n−1/2 (1,...,1). So with x+ := x<br />

if x≥0and 0 otherwise, we get<br />

πS n−1 (x)=<br />

1<br />

� x+(n −1/2 −〈x, e〉)e �<br />

+ . (3)<br />

In the case of arbitrary p>0, the projection is given by the unique<br />

solution of<br />

�x−y� 2<br />

2 . (4)<br />

π S n−1<br />

p<br />

(x)=argm<strong>in</strong> y∈S n−1<br />

p<br />

Unfortunately, no closed-form solution exists <strong>in</strong> the general case,<br />

so we have to numerically determ<strong>in</strong>e the solution. We have experimented<br />

with a) explicit Lagrange multiplier calculation and m<strong>in</strong>imization,<br />

b) constra<strong>in</strong>ed gradient descent and c) constra<strong>in</strong>ed fixedpo<strong>in</strong>t<br />

algorithm (best). Ignor<strong>in</strong>g the s<strong>in</strong>gular po<strong>in</strong>ts at the coord<strong>in</strong>ate<br />

hyperplanes, let us first assume that all xi > 0. Then at<br />

a regular solution y of equation (4), the gradient of the function<br />

to be m<strong>in</strong>imized is parallel to the gradient of the constra<strong>in</strong>t, i.e.<br />

y−x=λ∇�.� p �<br />

�<br />

p�<br />

for some Lagrange-multiplierλ∈R, which can be<br />

y<br />

calculated from the additional constra<strong>in</strong>t equation�y� p p= 1. Us<strong>in</strong>g<br />

the notation y⊙p := (y p<br />

1 ,...,yp n) ⊤ for the componentwise exponentiation,<br />

we therefore get<br />

y−x=λp y ⊙(p−1)<br />

�<br />

and y p<br />

i = 1 (5)<br />

i<br />

Algorithm 1: Projection onto S n−1<br />

p by constra<strong>in</strong>ed gradient descent.<br />

Commonly, the iteration is stopped after the update stepsize<br />

lies below some given threshold.<br />

Input: vector x∈R n , learn<strong>in</strong>g rateη(i)<br />

Output: Euclidean projection y=π S n−1 (x)<br />

p<br />

Initialize y∈S n−1<br />

1<br />

p randomly.<br />

for i←1, 2,... do<br />

df← y−x, dg← p sgn(y)|y| ⊙(p−1)<br />

2<br />

t←df− df ⊤ dgdg/(dg ⊤ 3<br />

dg)<br />

4 y←y−η(i)t<br />

5 y←y/�y�p<br />

end<br />

For p�{1, 2}, these equations cannot be solved <strong>in</strong> closed form,<br />

hence we propose an alternative approach to solv<strong>in</strong>g the constra<strong>in</strong>ed<br />

m<strong>in</strong>imization (4). The goal is to m<strong>in</strong>imize f (y) :=�y−x� 2<br />

2 under the<br />

constra<strong>in</strong>t g(y) :=�y� p p= 1. This can for example be achieved by<br />

gradient-descent methods, tak<strong>in</strong>g <strong>in</strong>to account that the gradient has<br />

to be calculated on the submanifold given by the S n−1<br />

p -constra<strong>in</strong>t, see<br />

figure 1 for an illustration. The projection of the gradient∇ f onto<br />

the tangent space of S n−1<br />

p at y can be easily calculated as<br />

t=∇ f−〈∇ f,∇g〉∇g/�∇g� 2<br />

2 . (6)<br />

Here, the explicit gradients are given by∇ f (y)=y−x and∇g(y)=<br />

p sgn(y)|y| ⊙(p−1) , where sgn(y) denotes the vector of the componentwise<br />

signs of y, and|y| := sgn(y)y the componentwise absolute<br />

value. The projection algorithm is summarized <strong>in</strong> algorithm 1. Iteratively,<br />

after calculat<strong>in</strong>g the constra<strong>in</strong>ed gradient (l<strong>in</strong>es 2 and 3), it<br />

performs a gradient-descent update step (l<strong>in</strong>e 4) followed by a pro-<br />

jection onto S n−1<br />

p (l<strong>in</strong>e 5) to guarantee that the algorithm stays on the<br />

submanifold.<br />

The method performs well, however as most gradient-descentbased<br />

algorithms, without further optimization it takes quite a few iterations<br />

to achieve acceptable convergence, and the choice of an optimal<br />

learn<strong>in</strong>g rateη(i) is non-trivial. We therefore propose a second<br />

projection method employ<strong>in</strong>g a fixed-po<strong>in</strong>t optimization strategy. Its<br />

idea is based on the fact that at local m<strong>in</strong>ima y of f (y) on S n−1<br />

p , the<br />

gradient∇ f (y) is orthogonal to S n−1<br />

p , so∇ f (y)∝∇g(y). Ignor<strong>in</strong>g<br />

signs for illustrative purposes, this means that y−x∝ p y⊙(p−1) , so<br />

y can be calculated from the fixed-po<strong>in</strong>t iteration y←λp y⊙(p−1) + x<br />

with additional normalization. Indeed, this can be equivalently derived<br />

from the previous Lagrange equations (5), also yield<strong>in</strong>g equations<br />

for the proportionality factorλ: we can simply determ<strong>in</strong>e it<br />

from one component of equation (5), or to <strong>in</strong>crease numerical robustness,<br />

as mean from the total set. Tak<strong>in</strong>g <strong>in</strong>to account the signs<br />

of the gradient (which we ignored <strong>in</strong> equation (5)), this yields an<br />

estimate ˆλ := 1�<br />

n<br />

i=1(yi−xi)/(p sgn(yi)|yi| p−1 ). Altogether, we get<br />

n<br />

the fixed-po<strong>in</strong>t algorithm 2, which <strong>in</strong> experiments turns out to have<br />

a considerably higher convergence rate than algorithm 1.<br />

In table 1, we compare the algorithms 1 and 2, namely with respect<br />

to the number of iterations they need to achieve convergence<br />

below some given threshold. As expected, the fixed-po<strong>in</strong>t algorithm<br />

outperforms gradient-descent always except for the case of higher<br />

dimensions and p>2 (non-sparse case). In the follow<strong>in</strong>g we will<br />

therefore use the fixed-po<strong>in</strong>t algorithm for projection onto S n−1<br />

1 .<br />

2.3. Projection onto convex sets<br />

If M is a convex set, then the Euclidean projectionπM(x) for any<br />

x∈R n is already unique, soX(M)=∅and the operatorπM is called


194 Chapter 14. Proc. ICASSP 2006<br />

Algorithm 2: Projection onto S n−1<br />

p via fixed-po<strong>in</strong>t iteration.<br />

Aga<strong>in</strong>, the iteration is to be stopped after only sufficiently small<br />

updates are taken.<br />

Input: vector x∈R n<br />

Output: Euclidean projection y=π S n−1 (x)<br />

p<br />

Initialize y∈S n−1<br />

1<br />

p randomly.<br />

for i←1, 2,... do<br />

λ← �n i=1 (yi−xi)/ � n sgn(yi)|yi| p−1�<br />

2<br />

y←x+λ sgn(y)|y| ⊙(p−1)<br />

3<br />

4 y←y/�y�p<br />

end<br />

Table 1. Comparison of the gradient- and fixed-po<strong>in</strong>t-based projection<br />

algorithms 1 and 2 for f<strong>in</strong>d<strong>in</strong>g the Euclidean projection onto<br />

cS n−1<br />

p for vary<strong>in</strong>g parameters; mean was taken over 100 iterations<br />

with x∈[−1, 1] n sampled uniformly. Here #itsgd and #itsfp denote<br />

the numbers of iterations the algorithm took to achieve update steps<br />

of size smaller thanε=0.0001, and�y gd− yfp� equals the norm of<br />

the difference of the found m<strong>in</strong>ima.<br />

n p c #its gd #its fp �y gd − y fp �<br />

2 0.9 1.2 6.7±4.7 3.7±1.0 0.0±0.0<br />

2 0.9 2.0 10.9±6.9 4.1±1.0 0.0±0.0<br />

2 2.2 0.9 13.0±21.0 5.5±4.2 0.0±0.0<br />

3 0.9 3 13.7±6.9 4.4±1.0 0.0±0.0<br />

3 2.2 0.9 9.6±16.6 7.2±10.2 0.0±0.0<br />

4 0.9 3 9.8±6.8 4.4±1.1 0.0±0.0<br />

4 2.2 0.9 6.0±5.0 9.2±8.1 0.0±0.0<br />

convex projector, see e.g. [3], lemma 2.4 and [4]. The theory of<br />

projection onto convex sets (POCS) [4, 5] is a well-known technique<br />

<strong>in</strong> signal process<strong>in</strong>g; given N convex sets M1,..., MN ⊂ R n and<br />

an operator def<strong>in</strong>ed byπ=πMN ···πM1 , POCS can be formulated<br />

as the recursion def<strong>in</strong>ed by yi+1=π(yi). It always approaches the<br />

<strong>in</strong>tersection of the convex sets that is yi→M ∗ = �N i=1 Mi.<br />

Note that POCS only f<strong>in</strong>ds an arbitrary po<strong>in</strong>t <strong>in</strong> �N i=1 Mi, which<br />

not necessarily co<strong>in</strong>cides with its Euclidean projection: for example<br />

if M1 :={x∈R n |�x�2≤ 1} is the unit disc, and M2 :={x∈R n | x1≤<br />

0} the lower first half-plane, then the Euclidean projection from x :=<br />

(1, 1, 0,...,0) onto M1∩M2 equalsπM1∩M2 (x)=(0, 1, 0,...,0), but<br />

application of POCS yields (0, 1/ √ 2, 0,...,0).<br />

3. SPARSE PROJECTION<br />

In this section, we comb<strong>in</strong>e the notions from the previous sections to<br />

search for sparse projections. Given a signal x∈R n , our goal is to<br />

f<strong>in</strong>d the closest signal y∈R n of fixed sparsenessσp(y)=c. Hence,<br />

we search for y∈R n with<br />

y=argm<strong>in</strong> σp(y)=c �x−y�2. (7)<br />

Due to the scale-<strong>in</strong>variance ofσ, the problem (7) is equivalent to<br />

f<strong>in</strong>d<strong>in</strong>g<br />

y=argm<strong>in</strong> �y�2=1,�y�p=c �x−y�2. (8)<br />

In other words, we are look<strong>in</strong>g for the Euclidean projection y =<br />

πM(x) onto M := S n−1 n−1<br />

2 ∩ cS p . Note that due to theorem 2.1, this<br />

solution to (8) exists if M�∅ and is almost always unique.<br />

Algorithm 3: Projection onto spheres (POSH). In practice,<br />

some abort criterion has to be implemented. Often q=2.<br />

Input: vector x∈R n \X(S n−1<br />

p ∩ S n−1<br />

q ) and p, q>0<br />

Output: y=π S n−1 (x)<br />

p ∩S n−1<br />

q<br />

1 Set y←x.<br />

while y�S n−1<br />

p ∩ S n−1<br />

q do<br />

2 y←π S n−1 (π<br />

q S n−1 (y))<br />

p<br />

end<br />

3.1. Projection onto spheres (POSH)<br />

In the special case of p=1 and nonnegative x, Hoyer has proposed<br />

an efficient algorithm for f<strong>in</strong>d<strong>in</strong>g the projection [2], simply by us<strong>in</strong>g<br />

the explicit formulas for the p-sphere projection; such formulas do<br />

not exist for p�1, 2, so a more general algorithm for this situation<br />

is proposed <strong>in</strong> the follow<strong>in</strong>g.<br />

Its idea is a direct generalization of POCS: we alternately project<br />

first on S n−1 then on S n−1<br />

p , us<strong>in</strong>g the Euclidean projection operators<br />

2<br />

from section 2.2. However, the difference is that the spheres are<br />

clearly non-convex (if p�1), so <strong>in</strong> contrast to POCS, convergence<br />

is not obvious. We denote this projection algorithm by projection<br />

onto spheres (POSH), see algorithm 3.<br />

First note that POSH obviously has the desired solution as fixedpo<strong>in</strong>t.<br />

In experiments, we observe that <strong>in</strong>deed POSH converges, and<br />

moreover it converges to the closest solution i.e. to the Euclidean<br />

projection (which does not hold for POCS <strong>in</strong> general)! F<strong>in</strong>ally we<br />

see that <strong>in</strong> higher dimensions, all update vectors together with the<br />

start<strong>in</strong>g po<strong>in</strong>t x lie <strong>in</strong> a s<strong>in</strong>gle two-dimensional plane, so theoretically<br />

we can reduce proofs to two-dimensional cases as well as build<br />

algorithms us<strong>in</strong>g this fact.<br />

In the follow<strong>in</strong>g section, we will prove the above claims for the<br />

case of p=1, where an explicit projection formula (3) is known. In<br />

the case of arbitrary p, so far we are only able to give experimental<br />

validation of the astonish<strong>in</strong>g facts of convergence and convergence<br />

to the Euclidean projection.<br />

3.2. Convergence<br />

The proof needs the follow<strong>in</strong>g simple convergence lemma, which<br />

somewhat extends on a special case treated by the more general Banach<br />

fixed-po<strong>in</strong>t theorem.<br />

Lemma 3.1. Let f : R→R be cont<strong>in</strong>uously-differentiable with<br />

f (0)=0, f ′ (0)>1 and let f ′ be positive and strictly decreas<strong>in</strong>g. If<br />

f ′ (x)0, then there exists a s<strong>in</strong>gle positive fixedpo<strong>in</strong>t<br />

ˆx of f , and f i (x) converges to ˆx for i→∞ and any x>0.<br />

Theorem 3.2. Let n≥2, p>0 and x∈R n \X(M). If y1 :=π S n−1 (x)<br />

2<br />

and iteratively y i :=π S n−1<br />

2<br />

(π S n−1<br />

1<br />

(y i−1 )) accord<strong>in</strong>g to the POSH algo-<br />

rithm, then y i converges to some y ∞ ∈ S n−1<br />

2 , and y ∞ =πM(x).<br />

Us<strong>in</strong>g lemma 3.1, we can prove the convergence theorem <strong>in</strong> the<br />

case of p=1, but omit the proof due to lack of space.<br />

3.3. Simulations<br />

At first, we confirm the convergence results from theorem 3.2 for<br />

p=1by apply<strong>in</strong>g POSH with 100 iterations <strong>in</strong> 1000 runs onto vectors<br />

x ∈ R 6 sampled uniformly from [0, 1] 6 ; c was chosen to be<br />

sufficiently large (c=2.4). We always get convergence. We also<br />

calculate the correct projection (us<strong>in</strong>g Hoyer’s projection algorithm


Chapter 14. Proc. ICASSP 2006 195<br />

1.2<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

x<br />

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />

(a) POSH for n=2, p=1<br />

x<br />

1.5π(S 2<br />

1 )<br />

(b) POSH for n=3, p=1<br />

π(S 2<br />

2 )<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

2S 1<br />

0.5<br />

S 1<br />

2<br />

x<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

(c) POSH for n=2, p=0.5<br />

Fig. 2. Start<strong>in</strong>g from x0 (◦), we alternately project onto cS 1 and S 2. POSH performance is illustrated for p=1<strong>in</strong> dimensions 2 (a) and 3 (b),<br />

were a projection via PCA is displayed — no <strong>in</strong>formation is lost hence the sequence of po<strong>in</strong>ts lies <strong>in</strong> a plane as shown <strong>in</strong> the proof theorem<br />

3.2. Figure (c) shows application of POCS for n=2and p=0.5.<br />

Table 2. Performance of the POSH algorithm 3 for vary<strong>in</strong>g parameters.<br />

See text for details.<br />

n p c �y POSH − yscan�2<br />

2 0.8 1.2 0.005±0.0008<br />

2 4 0.9 0.02±0.005<br />

3 0.8 1.2 0.02±0.009<br />

3 4 0.9 0.04±0.03<br />

[2]). The distance between his and our solution was calculated to<br />

give a mean value of 5·10 −13 ± 5·10 −12 i.e. we get virtually always<br />

the same solution.<br />

In figures 2(a) and (b), we show application for p=1; we visualize<br />

the performance <strong>in</strong> 3 dimensions by project<strong>in</strong>g the data via PCA<br />

— which by the way throws away virtually no <strong>in</strong>formation (confirmed<br />

by experiment) <strong>in</strong>dicat<strong>in</strong>g the validness of theorem 3.2 also <strong>in</strong><br />

higher dimensions. In figure 2(c) a projection for p=0.5 is shown.<br />

Now, we perform batch-simulations for vary<strong>in</strong>g p. For this, we<br />

uniformly sample the start<strong>in</strong>g vector x ∈ [0, 1] n <strong>in</strong> 100 runs, and<br />

compare the POSH algorithm result with the true projection. POSH<br />

is performed start<strong>in</strong>g with the p-norm projection us<strong>in</strong>g algorithm 1<br />

and 100 iterations. As the true projectionπM(x) cannot be determ<strong>in</strong>ed<br />

<strong>in</strong> closed form, we scan [0, 1] n−1 us<strong>in</strong>g the stepsizeε=0.01<br />

to give the first (n−1) coord<strong>in</strong>ates of our estimate y ofπM(x); its<br />

n-th coord<strong>in</strong>ate is then constructed to guarantee y∈S n−1<br />

p (for p1) respectively. Us<strong>in</strong>g Taylor-approximation<br />

of (y+ε) p , it can easily be shown that two adjacent grid po<strong>in</strong>ts have<br />

maximal difference � �<br />

��(y1+ε,...,yn+ε)� p p−�y� p �<br />

�<br />

p�≤<br />

pnε+O(ε 2 ) if<br />

y∈[0, 1] n and p≥1. Hence by tak<strong>in</strong>g only vectors y as approximation<br />

ofπM(x) with � �<br />

��y�2 2− 1� �<br />

�<br />


196 Chapter 14. Proc. ICASSP 2006


Chapter 15<br />

Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Paper P.Gruber, K.Stadlthanner, M.Böhm, F.J. Theis, E.W. Lang, A.M. Tomé,<br />

A.R. Teixeira, C.G. Puntonet, and J.M.Górriz Saéz. Denois<strong>in</strong>g us<strong>in</strong>g local<br />

projective subspace methods. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Reference (Gruber et al., 2006)<br />

Summary <strong>in</strong> section 1.5.2<br />

197


198 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Denois<strong>in</strong>g Us<strong>in</strong>g Local Projective Subspace<br />

Methods<br />

P. Gruber, K. Stadlthanner, M. Böhm, F. J. Theis, E. W. Lang<br />

Abstract<br />

Institute of Biophysics, Neuro-and Bio<strong>in</strong>formatics Group<br />

University of Regensburg, 93040 Regensburg, Germany<br />

email: elmar.lang@biologie.uni-regensburg.de<br />

A. M. Tomé, A. R. Teixeira<br />

Dept. de Electrónica e Telecomunicações/IEETA<br />

Universidade de Aveiro, 3810 Aveiro, Portugal<br />

email: ana@ieeta.pt<br />

C. G. Puntonet, J. M. Gorriz Saéz<br />

Dep. Arqitectura y Técnologia de Computadores<br />

Universidad de Granada, 18371 Granada, Spa<strong>in</strong><br />

email: carlos@atc.ugr.es<br />

In this paper we present denois<strong>in</strong>g algorithms for enhanc<strong>in</strong>g noisy signals based on<br />

Local ICA (LICA), Delayed AMUSE (dAMUSE) and Kernel PCA (KPCA). The<br />

algorithm LICA relies on apply<strong>in</strong>g ICA locally to clusters of signals embedded <strong>in</strong><br />

a high dimensional feature space of delayed coord<strong>in</strong>ates. The components resembl<strong>in</strong>g<br />

the signals can be detected by various criteria like estimators of kurtosis or<br />

the variance of autocorrelations depend<strong>in</strong>g on the statistical nature of the signal.<br />

The algorithm proposed can be applied favorably to the problem of denois<strong>in</strong>g multidimensional<br />

data. Another projective subspace denois<strong>in</strong>g method us<strong>in</strong>g delayed<br />

coord<strong>in</strong>ates has been proposed recently with the algorithm dAMUSE. It comb<strong>in</strong>es<br />

the solution of bl<strong>in</strong>d source separation problems with denois<strong>in</strong>g efforts <strong>in</strong> an elegant<br />

way and proofs to be very efficient and fast. F<strong>in</strong>ally, KPCA represents a non-l<strong>in</strong>ear<br />

projective subspace method that is well suited for denois<strong>in</strong>g also. Besides illustrative<br />

applications to toy examples and images, we provide an application of all algorithms<br />

considered to the analysis of prote<strong>in</strong> NMR spectra.<br />

Prepr<strong>in</strong>t submitted to Elsevier Science 4 February 2005


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 199<br />

1 Introduction<br />

The <strong>in</strong>terpretation of recorded signals is often hampered by the presence of<br />

noise. This is especially true with biomedical signals which are buried <strong>in</strong> a<br />

large noise background most often. Statistical analysis tools like Pr<strong>in</strong>cipal<br />

<strong>Component</strong> <strong>Analysis</strong> (PCA), s<strong>in</strong>gular spectral analysis (SSA), <strong>Independent</strong><br />

<strong>Component</strong> <strong>Analysis</strong> (ICA) etc. quickly degrade if the signals exhibit a low<br />

Signal to Noise Ratio (SNR). Furthermore due to their statistical nature, the<br />

application of such analysis tools can also lead to extracted signals with a<br />

larger SNR than the orig<strong>in</strong>al ones as we will discuss below <strong>in</strong> case of Nuclear<br />

Magnetic Resonance (NMR) spectra.<br />

Hence <strong>in</strong> the signal process<strong>in</strong>g community many denois<strong>in</strong>g algorithms have<br />

been proposed [5,12,18,38] <strong>in</strong>clud<strong>in</strong>g algorithms based on local l<strong>in</strong>ear projective<br />

noise reduction. The idea is to project noisy signals <strong>in</strong> a high-dimensional<br />

space of delayed coord<strong>in</strong>ates, called feature space henceforth. A similar strategy<br />

is used <strong>in</strong> SSA [20], [9] where a matrix composed of the data and their<br />

delayed versions is considered. Then, a S<strong>in</strong>gular Value Decomposition (SVD)<br />

of the data matrix or a PCA of the related correlation matrix is computed.<br />

Noise contributions to the signals are then removed locally by project<strong>in</strong>g the<br />

data onto a subset of pr<strong>in</strong>cipal directions of the eigenvectors of the SVD or<br />

PCA analysis related with the determ<strong>in</strong>istic signals.<br />

Modern multi-dimensional NMR spectroscopy is a very versatile tool for the<br />

determ<strong>in</strong>ation of the native 3D structure of biomolecules <strong>in</strong> their natural aqueous<br />

environment [7,10]. Proton NMR is an <strong>in</strong>dispensable contribution to this<br />

structure determ<strong>in</strong>ation process but is hampered by the presence of the very<br />

<strong>in</strong>tense water (H2O) proton signal. The latter causes severe basel<strong>in</strong>e distortions<br />

and obscures weak signals ly<strong>in</strong>g under its skirts. It has been shown [26,29]<br />

that Bl<strong>in</strong>d Source Separation (BSS) techniques like ICA can contribute to the<br />

removal of the water artifact <strong>in</strong> proton NMR spectra.<br />

ICA techniques extract a set of signals out of a set of measured signals without<br />

know<strong>in</strong>g how the mix<strong>in</strong>g process is carried out [2, 13]. Consider<strong>in</strong>g that the<br />

set of measured spectra X is a l<strong>in</strong>ear comb<strong>in</strong>ation of a set of <strong>Independent</strong><br />

<strong>Component</strong>s (ICs) S, i.e. X = AS, the goal is to estimate the <strong>in</strong>verse of the<br />

mix<strong>in</strong>g matrix A, us<strong>in</strong>g only the measured spectra, and then compute the ICs.<br />

Then the spectra are reconstructed us<strong>in</strong>g the mix<strong>in</strong>g matrix A and those ICs<br />

conta<strong>in</strong>ed <strong>in</strong> S which are not related with the water artifact. Unfortunately<br />

the statistical separation process <strong>in</strong> practice <strong>in</strong>troduces additional noise not<br />

present <strong>in</strong> the orig<strong>in</strong>al spectra. Hence denois<strong>in</strong>g as a post-process<strong>in</strong>g of the<br />

artifact-free spectra is necessary to achieve the highest possible SNR of the<br />

reconstructed spectra. It is important that the denois<strong>in</strong>g does not change the<br />

spectral characteristics like <strong>in</strong>tegral peak <strong>in</strong>tensities as the deduction of the<br />

2


200 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3D structure of the prote<strong>in</strong>s heavily relies on the latter.<br />

We propose two new approaches to this denois<strong>in</strong>g problem and compare the<br />

results to the established Kernel PCA (KPCA) denois<strong>in</strong>g [19, 25].<br />

The first approach (Local ICA (LICA)) concerns a local projective denois<strong>in</strong>g<br />

algorithm us<strong>in</strong>g ICA. Here it is assumed that the noise can, at least locally,<br />

be represented by a stationary Gaussian white noise. Signals usually come<br />

from a determ<strong>in</strong>istic or at least predictable source and can be described as<br />

a smooth function evaluated at discrete time steps small enough to capture<br />

the characteristics of the function. That implies, us<strong>in</strong>g a dynamical model for<br />

the data, that the signal embedded <strong>in</strong> delayed coord<strong>in</strong>ates resides with<strong>in</strong> a<br />

sub-manifold of the feature space spanned by these delayed coord<strong>in</strong>ates. With<br />

local projective denois<strong>in</strong>g techniques, the task is to detect this signal manifold.<br />

We will use LICA to detect the statistically most <strong>in</strong>terest<strong>in</strong>g submanifold. In<br />

the follow<strong>in</strong>g we call this manifold the signal+noise subspace s<strong>in</strong>ce it conta<strong>in</strong>s<br />

all of the signal plus that part of the noise components which lie <strong>in</strong> the same<br />

subspace. Parameter selection with<strong>in</strong> LICA will be effected with a M<strong>in</strong>imum<br />

Description Length (MDL) criterion [40], [6] which selects optimal parameters<br />

based on the data themselves.<br />

For the second approach we comb<strong>in</strong>e the ideas of solv<strong>in</strong>g BSS problems algebraically<br />

us<strong>in</strong>g a Generalized Eigenvector Decomposition (GEVD) [28] with<br />

local projective denois<strong>in</strong>g techniques. We propose, like <strong>in</strong> the Algorithm for<br />

Multiple Unknown Signals Extraction (AMUSE) [37], a GEVD of two correlation<br />

matrices i.e, the simultaneous diagonalization of a matrix pencil formed<br />

with a correlation matrix and a matrix of delayed correlations. These algorithms<br />

are exact and fast but sensitive to noise. There are several proposals<br />

to improve efficiency and robustness of these algorithms when noise is<br />

present [2, 8]. They mostly rely on an approximative jo<strong>in</strong>t diagonalization of<br />

a set of correlation or cumulant matrices like the algorithm Second Order<br />

Bl<strong>in</strong>d Identification (SOBI) [1]. The algorithm we propose, called Delayed<br />

AMUSE (dAMUSE) [33], computes a GEVD of the congruent matrix pencil<br />

<strong>in</strong> a high-dimensional feature space of delayed coord<strong>in</strong>ates. We show that the<br />

estimated signal components correspond to filtered versions of the underly<strong>in</strong>g<br />

uncorrelated source signals. We also present an algorithm to compute the<br />

eigenvector matrix of the pencil which <strong>in</strong>volves a two step procedure based on<br />

the standard Eigenvector Decomposition (EVD) approach. The advantage of<br />

this two step procedure is related with a dimension reduction between the two<br />

steps accord<strong>in</strong>g to a threshold criterion. Thereby estimated signal components<br />

related with noise only can be neglected thus perform<strong>in</strong>g a denois<strong>in</strong>g of the<br />

reconstructed signals.<br />

As a third denois<strong>in</strong>g method we consider KPCA based denois<strong>in</strong>g techniques<br />

[19,25] which have been shown to be very efficient outperform<strong>in</strong>g l<strong>in</strong>ear PCA.<br />

3


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 201<br />

KPCA actually generalizes l<strong>in</strong>ear PCA which hitherto has been used for denois<strong>in</strong>g.<br />

PCA denois<strong>in</strong>g follows the idea that reta<strong>in</strong><strong>in</strong>g only the pr<strong>in</strong>cipal components<br />

with highest variance to reconstruct the decomposed signal, noise<br />

contributions which should correspond to the low variance components can<br />

deliberately be omitted hence reduc<strong>in</strong>g the noise contribution to the observed<br />

signal. KPCA extends this idea to non-l<strong>in</strong>ear signal decompositions. The idea<br />

is to project observed data non-l<strong>in</strong>early <strong>in</strong>to a high-dimensional feature space<br />

and then to perform l<strong>in</strong>ear PCA <strong>in</strong> feature space. The trick is that the whole<br />

formalism can be cast <strong>in</strong>to dot product form hence the latter can be replaced<br />

by suitable kernel functions to be evaluated <strong>in</strong> the lower dimensional <strong>in</strong>put<br />

space <strong>in</strong>stead of the high-dimensional feature space. Denois<strong>in</strong>g then amounts<br />

to estimat<strong>in</strong>g appropriate pre-images <strong>in</strong> <strong>in</strong>put space of the nonl<strong>in</strong>early transformed<br />

signals.<br />

The paper is organized as follows: Section 1 presents an <strong>in</strong>troduction and discusses<br />

some related work. In section 2 some general aspects about embedd<strong>in</strong>g<br />

and cluster<strong>in</strong>g are discussed, before <strong>in</strong> section 3 the new denois<strong>in</strong>g algorithms<br />

are discussed <strong>in</strong> detail. Section 4 presents some applications to toy as well as<br />

to real world examples and section 5 draws some conclusions.<br />

2 Feature Space Embedd<strong>in</strong>g<br />

In this section we <strong>in</strong>troduce new denois<strong>in</strong>g techniques and propose algorithms<br />

us<strong>in</strong>g them. At first we present the signal process<strong>in</strong>g tools we will use later<br />

on.<br />

2.1 Embedd<strong>in</strong>g us<strong>in</strong>g delayed coord<strong>in</strong>ates<br />

A common theme of all three algorithms presented is to embed the data <strong>in</strong>to<br />

a high dimensional feature space and try to solve the noise separation problem<br />

there. With the LICA and the dAMUSE we embed signals <strong>in</strong> delayed<br />

coord<strong>in</strong>ates and do all computations directly <strong>in</strong> the space of delayed coord<strong>in</strong>ates.<br />

The KPCA algorithm considers a non-l<strong>in</strong>ear projection of the signals<br />

to a feature space but performs all calculations <strong>in</strong> <strong>in</strong>put space us<strong>in</strong>g the kernel<br />

trick. It uses the space of delayed coord<strong>in</strong>ates only implicitly as <strong>in</strong>termediate<br />

step <strong>in</strong> the nonl<strong>in</strong>ear transformation s<strong>in</strong>ce for that transformation the signal<br />

at different time steps is used.<br />

Delayed coord<strong>in</strong>ates are an ideal tool for represent<strong>in</strong>g the signal <strong>in</strong>formation.<br />

For example <strong>in</strong> the context of chaotic dynamical systems, embedd<strong>in</strong>g an observable<br />

<strong>in</strong> delayed coord<strong>in</strong>ates of sufficient dimension already captures the<br />

4


202 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

full dynamical system [30]. There also exists a similar result <strong>in</strong> statistics for<br />

signals with a f<strong>in</strong>ite decay<strong>in</strong>g memory [24].<br />

Given a group of N sensor signals, x[l] = �<br />

x0[l], . . . , xN−1[l] � T sampled at<br />

time steps l = 0, . . . L − 1, a very convenient representation of the signals<br />

embedded <strong>in</strong> delayed coord<strong>in</strong>ates is to arrange them componentwise <strong>in</strong>to component<br />

trajectory matrices Xi, i = 0, . . . , N − 1 [20]. Hence embedd<strong>in</strong>g<br />

can be regarded as a mapp<strong>in</strong>g that transforms a one-dimensional time series<br />

xi = (xi[0], xi[1], . . . , xi[L − 1]) <strong>in</strong>to a multi-dimensional sequence of lagged<br />

vectors. Let M be an <strong>in</strong>teger (w<strong>in</strong>dow length) with M < L. The embedd<strong>in</strong>g<br />

procedure then forms L − M + 1 lagged vectors which constitute the columns<br />

of the component trajectory matrix. Hence given sensor signals x[l], registered<br />

for a set of L samples, their related component trajectory matrices are given<br />

by<br />

�<br />

�<br />

Xi = xi[M − 1] xi[M] . . . xi[L − 1]<br />

⎡<br />

⎤<br />

⎢ xi[M − 1] xi[M] · · ·<br />

⎢ xi[M − 2] xi[M − 1] · · ·<br />

= ⎢<br />

.<br />

⎢ . . ..<br />

⎣<br />

xi[L − 1] ⎥<br />

xi[L − 2] ⎥<br />

. ⎥<br />

⎦<br />

xi[0] xi[1] · · · xi[L − M]<br />

and encompass M delayed versions of each signal component xi[l − m], m =<br />

0, . . . , M −1 collected at time steps l = M −1, . . . , L−1. Note that a trajectory<br />

matrix has identical entries along each diagonal. The total trajectory matrix<br />

of the set X will be a concatenation of the component trajectory matrices Xi<br />

computed for each sensor, i.e<br />

X =<br />

�<br />

�<br />

T<br />

X1, X2, . . . , XN<br />

Note that the embedded sensor signal is also formed by a concatenation of<br />

embedded component vectors, i.e. x[l] [x0[l], . . . , xN−1[l]]. Also note that with<br />

LICA we deal with s<strong>in</strong>gle column vectors of the trajectory matrix only, while<br />

with dAMUSE we consider the total trajectory matrix.<br />

2.2 Cluster<strong>in</strong>g<br />

In our context cluster<strong>in</strong>g of signals means rearrang<strong>in</strong>g the signal vectors, sampled<br />

at different time steps, by similarity. Hence for signals embedded <strong>in</strong> delayed<br />

coord<strong>in</strong>ates, the idea is to look for K disjo<strong>in</strong>t sub-trajectory matrices to<br />

5<br />

(1)<br />

(2)


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 203<br />

group together similar column vectors of the trajectory matrix X.<br />

A cluster<strong>in</strong>g algorithm like k-means [15] is appropriate for problems where the<br />

time structure of the signal is irrelevant. If, however, time or spatial correlations<br />

matter, cluster<strong>in</strong>g should be based on f<strong>in</strong>d<strong>in</strong>g an appropriate partition<strong>in</strong>g<br />

of {M − 1, . . . , L − 1} <strong>in</strong>to K successive segments, s<strong>in</strong>ce this preserves the <strong>in</strong>herent<br />

correlation structure of the signals. In any case the number of columns<br />

<strong>in</strong> each sub-trajectory matrix X (j) amounts to Lj such that the follow<strong>in</strong>g<br />

completeness relation holds:<br />

K�<br />

Lj = L − M + 1 (3)<br />

j=1<br />

The mean vector mj <strong>in</strong> each cluster can be considered a prototype vector and<br />

is given by<br />

mj = 1<br />

Xcj = 1<br />

X (j) [1, . . . , 1] T , j = 1, . . . , K (4)<br />

Lj<br />

Lj<br />

where cj is a vector with Lj entries equal to one which characterizes the<br />

cluster<strong>in</strong>g. Note that after the cluster<strong>in</strong>g the set {k = 0, . . . , L − M − 1} of<br />

<strong>in</strong>dices of the columns of X is split <strong>in</strong> K disjo<strong>in</strong>t subsets Kj. Each trajectory<br />

sub-matrix X (j) is formed with those columns of the matrix X, the <strong>in</strong>dices of<br />

which belong to the subset Kj of <strong>in</strong>dices.<br />

2.3 Pr<strong>in</strong>cipal <strong>Component</strong> <strong>Analysis</strong> and <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong><br />

PCA [23] is one of the most common multivariate data analysis tools. It tries<br />

to l<strong>in</strong>early transform given data <strong>in</strong>to uncorrelated data (feature space). Thus<br />

<strong>in</strong> PCA [4] a data vector is represented <strong>in</strong> an orthogonal basis system such<br />

that the projected data have maximal variance. PCA can be performed by<br />

eigenvalue decomposition of the data covariance matrix. The orthogonal transformation<br />

is obta<strong>in</strong>ed by diagonaliz<strong>in</strong>g the centered covariance matrix of the<br />

data set.<br />

In ICA, given a random vector, the goal is to f<strong>in</strong>d its statistically ICs. In<br />

contrast to correlation-based transformations like PCA, ICA renders the output<br />

signals as statistically <strong>in</strong>dependent as possible by evaluat<strong>in</strong>g higher-order<br />

statistics. The idea of ICA was first expressed by Jutten and Hérault [11]<br />

while the term ICA was later co<strong>in</strong>ed by Comon [3]. With LICA we will use<br />

the popular FastICA algorithm by Hyvär<strong>in</strong>en and Oja [14], which performs<br />

ICA by maximiz<strong>in</strong>g the non-Gaussianity of the signal components.<br />

6


204 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3 Denois<strong>in</strong>g Algorithms<br />

3.1 Local ICA denois<strong>in</strong>g<br />

The LICA algorithm we present is based on a local projective denois<strong>in</strong>g technique<br />

us<strong>in</strong>g an MDL criterion for parameter selection. The idea is to achieve<br />

denois<strong>in</strong>g by locally project<strong>in</strong>g the embedded noisy signal <strong>in</strong>to a lower dimensional<br />

subspace which conta<strong>in</strong>s the characteristics of the noise free signal.<br />

F<strong>in</strong>ally the signal has to be reconstructed us<strong>in</strong>g the various candidates generated<br />

by the embedd<strong>in</strong>g.<br />

Consider the situation, where we have a signal x 0 i [l] at discrete time steps<br />

l = 0, . . . , L − 1 but only its noise corrupted version xi[l] is measured<br />

xi[l] = x 0 i [l] + ei[l] (5)<br />

where e[l] are samples of a random variable with Gaussian distribution, i.e. xi<br />

equals x 0 i up to additive stationary white noise.<br />

3.1.1 Embedd<strong>in</strong>g and cluster<strong>in</strong>g<br />

First the noisy signal xi[l] is transformed <strong>in</strong>to a high-dimensional signal xi[l]<br />

<strong>in</strong> the M-dimensional space of delayed coord<strong>in</strong>ates accord<strong>in</strong>g to<br />

xi[l] := �<br />

xi[l], . . . , xi[l − M + 1 mod L] � T<br />

(6)<br />

which corresponds to a column of the trajectory matrix <strong>in</strong> equation 1.<br />

To simplify implementation, we want to ensure that the delayed signal, like<br />

the orig<strong>in</strong>al signal, (trajectory matrix) is given at L time steps <strong>in</strong>stead of<br />

L − M + 1. This can be achieved by us<strong>in</strong>g the samples <strong>in</strong> round rob<strong>in</strong> manner,<br />

i.e. by clos<strong>in</strong>g the end and the beg<strong>in</strong> of each delayed signal and cutt<strong>in</strong>g out<br />

exactly L components <strong>in</strong> accord with the delay. If the signal conta<strong>in</strong>s a trend<br />

or its statistical nature is significantly different at the end compared to the<br />

beg<strong>in</strong>n<strong>in</strong>g, then this leads to compatibility problems of the beg<strong>in</strong>n<strong>in</strong>g and end<br />

of the signal. We can easily resolve this misfit by replac<strong>in</strong>g the signal with a<br />

version where we add the signal <strong>in</strong> reverse order, hence avoid<strong>in</strong>g any sudden<br />

change <strong>in</strong> signal amplitude which would be smoothed out by the algorithm.<br />

The problem can now be localized by select<strong>in</strong>g K clusters <strong>in</strong> the feature space<br />

of delayed coord<strong>in</strong>ates of the signal {xi[l] | l = 0, . . . , L − 1}. Cluster<strong>in</strong>g<br />

can be achieved by a k-means cluster algorithm as expla<strong>in</strong>ed <strong>in</strong> section 2.2.<br />

But k-means cluster<strong>in</strong>g is only appropriate if the variance or the kurtosis<br />

of a signal do not depend on the <strong>in</strong>herent signal structure. For other noise<br />

7


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 205<br />

selection schemes like choos<strong>in</strong>g the noise components based on the variance<br />

of the autocorrelation, it is usually better to f<strong>in</strong>d an appropriate partition<strong>in</strong>g<br />

of the set of time steps {0, . . . , L − 1} <strong>in</strong>to K successive segments, s<strong>in</strong>ce this<br />

preserves the <strong>in</strong>herent time structure of the signals.<br />

For other noise selection methods like choos<strong>in</strong>g the noise components based on<br />

the variance of the autocorrelation it is usually better to f<strong>in</strong>d an appropriate<br />

partition of the set of time steps {0, . . . , L − 1} <strong>in</strong>to K successive segments,<br />

s<strong>in</strong>ce this preserves the <strong>in</strong>herent time structure of the signal.<br />

Note that the cluster<strong>in</strong>g does not change the data but only changes its time<br />

sequence, i.e. permutes and regroups the columns of the trajectory matrix and<br />

separates it <strong>in</strong>to K sub-matrices.<br />

3.1.2 Decomposition and denois<strong>in</strong>g<br />

After center<strong>in</strong>g, i.e. remov<strong>in</strong>g the mean <strong>in</strong> each cluster, we can analyze the<br />

M-dimensional signals <strong>in</strong> these K clusters us<strong>in</strong>g PCA or ICA. The PCA case<br />

(Local PCA (LPCA)) is studied <strong>in</strong> [39] so <strong>in</strong> the follow<strong>in</strong>g we will propose an<br />

ICA based denois<strong>in</strong>g.<br />

Us<strong>in</strong>g ICA, we extract M ICs from each delayed signal. Like <strong>in</strong> all projection<br />

based denois<strong>in</strong>g algorithms, noise reduction is achieved by project<strong>in</strong>g the signal<br />

<strong>in</strong>to a lower dimensional subspace. We used two different criteria to estimate<br />

the number p of signal+noise components, i.e. the dimension of the signal<br />

subspace onto which we project after apply<strong>in</strong>g ICA.<br />

• One criterion is a consistent MDL estimator of pMDL for the data model <strong>in</strong><br />

equation 5 ( [39])<br />

pMDL = argm<strong>in</strong> MDL(M, L, p, (λj), γ) (7)<br />

p=0,...,M−1<br />

⎧<br />

⎪⎨<br />

argm<strong>in</strong><br />

p=0,...,M−1 ⎪⎩ −�(M<br />

− p)L �<br />

⎛<br />

⎜<br />

ln ⎝ ΠMj=p+1λ 1 ⎞<br />

M−p<br />

j ⎟<br />

1 �Mj=p+1 ⎠<br />

λj<br />

M−p<br />

�<br />

+ pM − p2<br />

� �1 �<br />

p<br />

+ + 1 + ln γ<br />

2 2 2<br />

⎛<br />

p2 p<br />

pM − + + 1<br />

2 2 − ⎝<br />

p<br />

1 2<br />

ln 2 L +<br />

⎞⎫<br />

p�<br />

M−1 � ⎬<br />

ln λj − ln λj<br />

⎠<br />

⎭<br />

j=1<br />

j=1<br />

where λj denote the variances of the signal components <strong>in</strong> feature space,<br />

i.e. after apply<strong>in</strong>g the de-mix<strong>in</strong>g matrix which we estimate with the ICA<br />

algorithm. To reta<strong>in</strong> the relative strength of the components <strong>in</strong> the mixture,<br />

we normalize the rows of the de-mix<strong>in</strong>g matrix to unit norm. The variances<br />

8


206 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

are ordered such that the smallest eigenvalues λj correspond to directions<br />

<strong>in</strong> feature space most likely to be associated with noise components only.<br />

The first term <strong>in</strong> the MDL estimator represents the likelihood of the<br />

m − p gaussian white noise components. The third term stems from the<br />

estimation of the description length of the signal part (first p components)<br />

of the mixture based on their variances. The second term acts as a penalty<br />

term to favor parsimonious representations of the data for short time series,<br />

and becomes <strong>in</strong>significant <strong>in</strong> the limit L → ∞ s<strong>in</strong>ce it does not depend on L<br />

while the other two terms grow without bounds. The parameter γ controls<br />

this behavior and is a parameter of the MDL estimator, hence of the f<strong>in</strong>al<br />

denois<strong>in</strong>g algorithm. By experience, good values for γ seem to be 32 or 64.<br />

• Based on the observations reported by [17] and our observations that, <strong>in</strong><br />

some situations, the MDL estimator tends to significantly underestimate<br />

the number of noise components, we also used another approach: We clustered<br />

the variances of the signal components <strong>in</strong>to two clusters us<strong>in</strong>g k-means<br />

cluster<strong>in</strong>g and def<strong>in</strong>ed pcl as the number of elements <strong>in</strong> the cluster which<br />

conta<strong>in</strong>s the largest eigenvalue. This yields a good estimate of the number<br />

of signal components, if the noise variances are not clustered well enough<br />

together but, nevertheless, are separated from the signal by a large gap.<br />

More details and simulations corroborat<strong>in</strong>g our observations can be found<br />

<strong>in</strong> section 4.1.1.<br />

Depend<strong>in</strong>g on the statistical nature of the data, the order<strong>in</strong>g of the components<br />

<strong>in</strong> the MDL estimator can be achieved us<strong>in</strong>g different methods. For data with<br />

a non-Gaussian distribution, we select the noise component as the component<br />

with the smallest value of the kurtosis as Gaussian noise corresponds to a<br />

vanish<strong>in</strong>g kurtosis. For non-stationary data with stationary noise, we identify<br />

the noise by the smallest variance of its autocorrelation.<br />

3.1.3 Reconstruction<br />

In each cluster the center<strong>in</strong>g is reversed by add<strong>in</strong>g back the cluster mean. To<br />

reconstruct the noise reduced signal, we first have to reverse the cluster<strong>in</strong>g of<br />

the data to yield the signal x e i [l] ∈ R M by concatenat<strong>in</strong>g the trajectory submatrices<br />

and revers<strong>in</strong>g the permutation done dur<strong>in</strong>g cluster<strong>in</strong>g. The result<strong>in</strong>g<br />

trajectory matrix does not possess identical entries <strong>in</strong> each diagonal. Hence<br />

we average over the candidates <strong>in</strong> the delayed data, i.e.over all entries <strong>in</strong> each<br />

diagonal:<br />

x e i [l] := 1<br />

M−1 �<br />

x<br />

M j=0<br />

e i [l + j mod L]j (8)<br />

where x e i [l]j stands for the j-th component of the enhanced vector x e i at time<br />

step l. Note that the summation is done over the diagonals of the trajectory<br />

matrix so it would yields xi if performed on the orig<strong>in</strong>al delayed signal xi.<br />

9


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 207<br />

3.1.4 Parameter estimation<br />

We still have to f<strong>in</strong>d optimal values for the global parameters M and K.<br />

Their selection aga<strong>in</strong> can be based on a MDL criterion for the detected noise<br />

e := x − xe. Accord<strong>in</strong>gly we apply the LICA algorithm for different M and<br />

K and embed each of these error signals e(M, K) <strong>in</strong> delayed coord<strong>in</strong>ates of a<br />

fixed large enough dimension ˆ M and choose the parameters M0 and K0 such<br />

that the MDL criterion estimat<strong>in</strong>g the nois<strong>in</strong>ess of the error signal is m<strong>in</strong>imal.<br />

The MDL criterion is evaluated with respect to the eigenvalues λj(M, K) of<br />

the correlation matrix of e(M, K) such that<br />

(M0, K0) = argm<strong>in</strong> MDL<br />

M,K<br />

� M, ˆ L, 0, (λj(M, K)), γ �<br />

3.2 Denois<strong>in</strong>g us<strong>in</strong>g Delayed AMUSE<br />

Signals with an <strong>in</strong>herent correlation structure like time series data can as well<br />

be analyzed us<strong>in</strong>g second-order bl<strong>in</strong>d source separation techniques only [22,34].<br />

GEVD of a matrix pencil [36,37] or a jo<strong>in</strong>t approximative diagonalization of a<br />

set of correlation matrices [1] is then usually considered. Recently we proposed<br />

an algorithm based on a generalized eigenvalue decomposition <strong>in</strong> a feature<br />

space of delayed coord<strong>in</strong>ates [34]. It provides means for BSS and denois<strong>in</strong>g<br />

simultaneously.<br />

3.2.1 Embedd<strong>in</strong>g<br />

Assum<strong>in</strong>g that each sensor signal is a l<strong>in</strong>ear comb<strong>in</strong>ation X = AS of N<br />

underly<strong>in</strong>g but unknown source signals si, a source signal trajectory matrix<br />

S can be written <strong>in</strong> analogy to equation 1 and equation 2. Then the mix<strong>in</strong>g<br />

matrix A is a block matrix with a diagonal matrix <strong>in</strong> each block:<br />

⎡<br />

a11IM×M a12IM×M ⎢<br />

· · · a1NIM×M ⎥<br />

⎢<br />

⎥<br />

⎢<br />

⎥<br />

⎢ a21IM×M a22IM×M · · · · · · ⎥<br />

A = ⎢<br />

⎥<br />

⎢<br />

. ⎥<br />

⎢ . . .. . ⎥<br />

⎣<br />

⎦<br />

aN1IM×M aN2IM×M · · · aNNIM×M<br />

⎤<br />

(9)<br />

(10)<br />

The matrix IM×M represents the identity matrix, and <strong>in</strong> accord with an <strong>in</strong>stantaneous<br />

mix<strong>in</strong>g model the mix<strong>in</strong>g coefficient aij relates the sensor signal<br />

xi with the source signal sj.<br />

10


208 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3.2.2 Generalized Eigenvector Decomposition<br />

The delayed correlation matrices of the matrix pencil are computed with one<br />

matrix Xr obta<strong>in</strong>ed by elim<strong>in</strong>at<strong>in</strong>g the first ki columns of X and another<br />

matrix, Xl, obta<strong>in</strong>ed by elim<strong>in</strong>at<strong>in</strong>g the last ki columns. Then, the delayed<br />

correlation matrix Rx(ki) = XrXl T will be an NM × NM matrix. Each of<br />

these two matrices can be related with a correspond<strong>in</strong>g matrix <strong>in</strong> the source<br />

signal doma<strong>in</strong>:<br />

Rx(ki) = ARs(ki)A T = ASrSl T A T<br />

(11)<br />

Then the two pairs of matrices (Rx(k1), Rx(k2)) and (Rs(k1), Rs(k2)) represent<br />

congruent pencils [32] with the follow<strong>in</strong>g properties:<br />

• Their eigenvalues are the same, i.e., the eigenvalue matrices of both pencils<br />

are identical: Dx = Ds.<br />

• If the eigenvalues are non-degenerate (dist<strong>in</strong>ct values <strong>in</strong> the diagonal of<br />

the matrix Dx = Ds), the correspond<strong>in</strong>g eigenvectors are related by the<br />

transformation Es = A T Ex.<br />

Assum<strong>in</strong>g that all sources are uncorrelated, the matrices Rs(ki) are block<br />

diagonal, hav<strong>in</strong>g block matrices Rmm(ki) = SriSli T along the diagonal. The<br />

eigenvector matrix of the GEVD of the pencil (Rs(k1), Rs(k2)) aga<strong>in</strong> forms a<br />

block diagonal matrix with block matrices Emm form<strong>in</strong>g M × M eigenvector<br />

matrices of the GEVD of the pencils (Rmm(k1), Rmm(k2)). The uncorrelated<br />

components can then be estimated from l<strong>in</strong>early transformed sensor signals<br />

via<br />

Y = Ex T X = Ex T AS = Es T S (12)<br />

hence turn out to be filtered versions of the underly<strong>in</strong>g source signals. As the<br />

eigenvector matrix Es is a block diagonal matrix, there are M signals <strong>in</strong> each<br />

column of Y which are a l<strong>in</strong>ear comb<strong>in</strong>ation of one of the source signals and<br />

its delayed versions. Then the columns of the matrix Emm represent impulse<br />

responses of F<strong>in</strong>ite Impulse Response (FIR) filters. Consider<strong>in</strong>g that all the<br />

columns of Emm are different, their frequency response might provide different<br />

spectral densities of the source signal spectra. Then NM output signals y<br />

encompass M filtered versions of each of the N estimated source signals.<br />

3.2.3 Implementation of the GEVD<br />

There are several ways to compute the generalized eigenvalue decomposition.<br />

We resume a procedure valid if one of the matrices of the pencil is symmetric<br />

positive def<strong>in</strong>ite. Thus, we consider the pencil (Rx(0), Rx(k2)) and perform<br />

the follow<strong>in</strong>g steps:<br />

Step 1: Compute a standard eigenvalue decomposition of Rx(0) = V ΛV T , i.e,<br />

11


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 209<br />

compute the eigenvectors vi and eigenvalues λi. As the matrix is symmetric<br />

positive def<strong>in</strong>ite, the eigenvalues can be arranged <strong>in</strong> descend<strong>in</strong>g order (λ1 ><br />

λ2 > · · · > λNM). This procedure corresponds to the usual whiten<strong>in</strong>g step<br />

<strong>in</strong> many ICA algorithms. It can be used to estimate the number of sources,<br />

but it can also be considered a strategy to reduce noise much like with PCA<br />

denois<strong>in</strong>g. Dropp<strong>in</strong>g small eigenvalues amounts to a projection from a highdimensional<br />

feature space onto a lower dimensional manifold represent<strong>in</strong>g the<br />

signal+noise subspace. Thereby it is tacitly assumed that small eigenvalues<br />

are related with noise components only. Here we consider a variance criterion<br />

to choose the most significant eigenvalues, those related with the embedded<br />

determ<strong>in</strong>istic signal, accord<strong>in</strong>g to<br />

λ1 + λ2 + ... + λl<br />

λ1 + λ2 + ...λNM<br />

≥ T H (13)<br />

If we are <strong>in</strong>terested <strong>in</strong> the eigenvectors correspond<strong>in</strong>g to directions of high<br />

variance of the signals, the threshold T H should be chosen such that their<br />

maximum energy is preserved. Similar to the whiten<strong>in</strong>g phase <strong>in</strong> many BSS<br />

algorithms, the data matrix X can be transformed us<strong>in</strong>g<br />

1<br />

−<br />

Q = Λ 2 V T<br />

(14)<br />

to calculate a transformed matrix of delayed correlations C(k2) to be used<br />

<strong>in</strong> the next step. The transformation matrix can be computed us<strong>in</strong>g either<br />

the l most significant eigenvalues, <strong>in</strong> which case denois<strong>in</strong>g is achieved, or all<br />

eigenvalues and respective eigenvectors. Also note, that Q represents a l×NM<br />

matrix if denois<strong>in</strong>g is considered.<br />

Step 2: Use the transformed delayed correlation matrix C(k2) = QRx(k2)Q T<br />

and its standard eigenvalue decomposition: the eigenvector matrix U and<br />

eigenvalue matrix Dx.<br />

The eigenvectors of the pencil (Rx(0), Rx(k2)), which are not normalized, form<br />

the columns of the eigenvector matrix Ex = QT 1<br />

−<br />

U = V Λ 2 U. The ICs of<br />

the delayed sensor signals can then be estimated via the transformation given<br />

below, yield<strong>in</strong>g l (or NM) signals, one signal per row of Y :<br />

Y = Ex T X = U T QX = U T 1<br />

−<br />

Λ 2 V T X (15)<br />

The first step of this algorithm is therefore equivalent to a PCA <strong>in</strong> a highdimensional<br />

feature space [9,38], where a matrix similar to Q is used to project<br />

the data onto the signal manifold.<br />

12


210 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3.3 Kernel PCA based denois<strong>in</strong>g<br />

Kernel PCA has been developed by [19], hence we give here only a short summary<br />

for convenience. PCA only extracts l<strong>in</strong>ear features though with suitable<br />

nonl<strong>in</strong>ear features more <strong>in</strong>formation could be extracted. It has been shown [19]<br />

that KPCA is well suited to extract <strong>in</strong>terest<strong>in</strong>g nonl<strong>in</strong>ear features <strong>in</strong> the data.<br />

KPCA first maps the data xi <strong>in</strong>to some high-dimensional feature space Ω<br />

through a nonl<strong>in</strong>ear mapp<strong>in</strong>g Φ : R n → R m , m > n and then performs l<strong>in</strong>ear<br />

PCA on the mapped data <strong>in</strong> the feature space Ω. Assum<strong>in</strong>g centered data <strong>in</strong><br />

feature space, i.e. � l k=1 Φ(xk) = 0, to perform PCA <strong>in</strong> space Ω amounts to<br />

f<strong>in</strong>d<strong>in</strong>g the eigenvalues λ > 0 and eigenvectors ω ∈ Ω of the correlation matrix<br />

¯R = 1<br />

l<br />

� lj=1 Φ(xj)Φ(xj) T .<br />

Note that all ω with λ �= 0 lie <strong>in</strong> the subspace spanned by the vectors<br />

Φ(x1), . . . , Φ(xl). Hence the eigenvectors can be represented via<br />

l�<br />

ω = αiΦ(xi) (16)<br />

i=1<br />

Multiply<strong>in</strong>g the eigenequation with Φ(xk) from the left the follow<strong>in</strong>g modified<br />

eigenequation is obta<strong>in</strong>ed<br />

Kα = lλα (17)<br />

with λ > 0. The eigenequation now is cast <strong>in</strong> the form of dot products occurr<strong>in</strong>g<br />

<strong>in</strong> feature space through the l × l matrix K with elements Kij =<br />

(Φ(xi)·Φ(xj)) = k(xi, xj) which are represented by kernel functions k(xi, xj)<br />

to be evaluated <strong>in</strong> the <strong>in</strong>put space. For feature extraction any suitable kernel<br />

can be used and knowledge of the nonl<strong>in</strong>ear function Φ(x) is not needed. Note<br />

that the latter can always be reconstructed from the pr<strong>in</strong>cipal components<br />

obta<strong>in</strong>ed. The image of a data vector under the map Φ can be reconstructed<br />

from its projections βk via<br />

n�<br />

n�<br />

ˆPnΦ(x) = βkωk = (ωk · Φ(x))ωk<br />

k=1<br />

k=1<br />

(18)<br />

which def<strong>in</strong>es the projection operator ˆ Pn. In denois<strong>in</strong>g applications, n is deliberately<br />

chosen such that the squared reconstruction error<br />

e 2 l�<br />

rec =<br />

i=1<br />

� ˆ PnΦ(xi) − Φ(xi) � 2<br />

(19)<br />

is m<strong>in</strong>imized. To f<strong>in</strong>d a correspond<strong>in</strong>g approximate representation of the data<br />

<strong>in</strong> <strong>in</strong>put space, the so called pre-image, it is necessary to estimate a vector<br />

13


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 211<br />

z ∈ R N <strong>in</strong> <strong>in</strong>put space such that<br />

ρ(z) =� ˆ PnΦ(x) − Φ(z) � 2 n� l�<br />

= k(z, z) − 2 βk α<br />

k=1 i=1<br />

k i k(xi, z) (20)<br />

is m<strong>in</strong>imized. Note that an analytic solution to the pre-image problem has been<br />

given recently <strong>in</strong> case of <strong>in</strong>vertible kernels [16]. In denois<strong>in</strong>g applications it is<br />

hoped that the deliberately neglected dimensions of m<strong>in</strong>or variance conta<strong>in</strong><br />

noise mostly and z represents a denoised version of x. Equation (20) can be<br />

m<strong>in</strong>imized via gradient descent techniques.<br />

4 Applications and simulations<br />

In this section we will first present results and concomitant <strong>in</strong>terpretation<br />

of some experiments with toy data us<strong>in</strong>g different variations of the LICA<br />

denois<strong>in</strong>g algorithm. Next we also present some test simulations us<strong>in</strong>g toy data<br />

of the algorithm dAMUSE. F<strong>in</strong>ally we will discuss the results of apply<strong>in</strong>g the<br />

three different denois<strong>in</strong>g algorithms presented above to a real world problem,<br />

i.e. to enhance prote<strong>in</strong> NMR spectra contam<strong>in</strong>ated with a huge water artifact.<br />

4.1 Denois<strong>in</strong>g with Local ICA applied to toy examples<br />

We will present some sample experimental results us<strong>in</strong>g artificially generated<br />

signals and random noise. As the latter is characterized by a vanish<strong>in</strong>g kurtosis,<br />

the LICA based denois<strong>in</strong>g algorithm uses the component kurtosis for noise<br />

selection.<br />

4.1.1 Discussion of an MDL based subspace selection<br />

In the LICA denois<strong>in</strong>g algorithm the MDL criterion is also used to select the<br />

number of noise components <strong>in</strong> each cluster. This works without prior knowledge<br />

of the noise strength. S<strong>in</strong>ce the estimation is based solely on statistical<br />

properties, it produces suboptimal results <strong>in</strong> some cases, however. In figure 1<br />

we compare, for an artificial signal with a known additive white gaussian noise,<br />

the denois<strong>in</strong>g achieved with the MDL based estimation of the subspace dimension<br />

versus the estimation based on the noise level. The latter is done us<strong>in</strong>g<br />

a threshold on the variances of the components <strong>in</strong> feature space such that<br />

only the signal part is conserved. Fig. 1 shows that the threshold criterion<br />

works slightly better <strong>in</strong> this case, though the MDL based selection can obta<strong>in</strong><br />

a comparable level of denois<strong>in</strong>g. However, the smaller SNR <strong>in</strong>dicates that the<br />

14


212 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

10<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

Orig<strong>in</strong>al Signal Noisy Signal MDL based local ICA Threshold based local ICA<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

×10 3<br />

5<br />

0<br />

-5<br />

-10<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

-10<br />

-15<br />

-15<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

×10 3<br />

10<br />

5<br />

0<br />

-5<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

10<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

5<br />

0<br />

-5<br />

-10<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

8<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

-4<br />

-6<br />

-8<br />

-8<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

Fig. 1. Comparison between MDL and threshold denois<strong>in</strong>g of an artificial signal<br />

with known SNR = 0. The feature space dimension was M = 40 and the number<br />

of clusters was K = 35. The (MDL achieved an SNR = 8.9 dB and the Threshold<br />

criterion an SNR = 10.5 dB)<br />

MDL criterion favors some over-modell<strong>in</strong>g of the signal subspace, i.e. it tends<br />

to underestimate the number of noise components <strong>in</strong> the registered signals.<br />

In [17] the conditions, such as the noise not be<strong>in</strong>g completely white, which<br />

lead to a strong over-modell<strong>in</strong>g are identified. Over-modell<strong>in</strong>g also happens<br />

frequently, if the eigenvalues of the covariance matrix related with noise components,<br />

are not sufficiently close together and are not separated from the<br />

signal components by a gap. In those cases a cluster<strong>in</strong>g criterion for the eigenvalues<br />

seems to yield better results, but it is not as generic as the MDL<br />

criterion.<br />

4.1.2 Comparisons between LICA and LPCA<br />

Consider the artificial signal shown <strong>in</strong> figure 1 with vary<strong>in</strong>g additive Gaussian<br />

white noise. We apply the LICA denois<strong>in</strong>g algorithm us<strong>in</strong>g either an MDL criterion<br />

or a threshold criterion for parameter selection. The results are depicted<br />

<strong>in</strong> figure 2.<br />

The first and second diagram of figure 2 compare the performance, here the<br />

enhancement of the SNR and the mean square error, of LPCA and LICA<br />

depend<strong>in</strong>g on the <strong>in</strong>put SNR. Note that a source SNR of 0 describes a case<br />

where signal and noise have the same strength, while negative values <strong>in</strong>dicate<br />

situations where the signal is buried <strong>in</strong>t the noise. The third graph shows the<br />

difference <strong>in</strong> kurtosis of the orig<strong>in</strong>al signal and the source signal <strong>in</strong> dependence<br />

on the <strong>in</strong>put SNR. All three diagrams were generated with the same data set,<br />

i.e. the same signal and, for a given <strong>in</strong>put SNR, the same additive noise.<br />

These results suggest that a LICA approach is more effective when the signal<br />

is <strong>in</strong>fested with a large amount of noise, whereas a LPCA seems better suited<br />

for signals with high SNRs. This might be due to the nature of our selection of<br />

subspaces based on kurtosis or variance of the autocorrelation as the comparison<br />

of higher statistical moments of the restored data, like kurtosis, <strong>in</strong>dicate<br />

that noise reduction can be enhanced if we are us<strong>in</strong>g a LICA approach.<br />

15<br />

×10 3<br />

×10 3<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

-4<br />

-6


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 213<br />

SNR Enhancement [dB]<br />

-5 0 5 10<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

SNR enhancement Mean square error Kurtosis<br />

ica<br />

pca<br />

-5 0 5 10<br />

Source SNR [dB]<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

Error orig<strong>in</strong>al/recovered [×10 3 ]<br />

0 1 2 3 4 5 6 7 8<br />

3<br />

2<br />

1<br />

ica<br />

pca<br />

3<br />

2<br />

1<br />

0<br />

0<br />

0 1 2 3 4 5 6 7 8<br />

Error orig<strong>in</strong>al/noisy [×10 3 ]<br />

Kurtosis error |kurt(se) − kurt(s)|<br />

-5 0 5 10<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-5 0 5 10<br />

Source SNR [dB]<br />

Fig. 2. Comparison between LPCA and LICA based denois<strong>in</strong>g. Here the mean square<br />

error of two signals x, y with L samples is 1 �<br />

L ||xi − yi|| 2 . For all noise levels a<br />

complete parameter estimation was done <strong>in</strong> the sets {10, 15, . . . , 60} for M and<br />

{20, 30, . . . , 80} for K.<br />

4.1.3 LICA denois<strong>in</strong>g with multi-dimensional data sets<br />

A generalization of the LICA algorithm to multidimensional data sets like<br />

images where pixel <strong>in</strong>tensities depend on two coord<strong>in</strong>ates is desirable. A simple<br />

generalization would be to look at delayed coord<strong>in</strong>ates of vectors <strong>in</strong>stead<br />

of scalars. However, this appears impractical due to the prohibitive computational<br />

effort. More importantly, this direct approach reduces the number<br />

of available samples significantly. This leads to far less accurate estimators<br />

of important aspects like the MDL estimation of the dimension of the signal<br />

subspace or the estimation of the kurtosis criterion <strong>in</strong> the LICA case.<br />

Another approach could be to convert the data to a 1D str<strong>in</strong>g by choos<strong>in</strong>g some<br />

path through the data and concatenat<strong>in</strong>g the pixel <strong>in</strong>tensities accord<strong>in</strong>gly. But<br />

this can easily create unwanted artifacts along the chosen path. Further, local<br />

correlations are broken up, hence not all the available <strong>in</strong>formation is used.<br />

But a more sophisticated and, depend<strong>in</strong>g on the nature of the signal, very effective<br />

alternative approach can be envisaged. Instead of convert<strong>in</strong>g the multidimensional<br />

data <strong>in</strong>to 1D data str<strong>in</strong>gs prior to apply<strong>in</strong>g the LICA algorithm,<br />

we can use a modified delay transformation us<strong>in</strong>g shifts along all available<br />

dimensions. This concept is similar to the multidimensional auto-covariances<br />

used <strong>in</strong> the Multi Dimensional SOBI (mdSOBI) algorithm <strong>in</strong>troduced <strong>in</strong> [31].<br />

In the 2D case, for example, consider an n × n image represented by a matrix<br />

P = (aij)i,j=1...n. Then the transformed data set consists of copies of P which<br />

are shifted either along columns or rows or both. For <strong>in</strong>stance, a translation<br />

16<br />

i<br />

ica<br />

pca<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0


214 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Noisy Image Local PCA<br />

Local ICA Local ICA + PCA<br />

Fig. 3. Comparison of LPCA and LICA based denois<strong>in</strong>g upon an image <strong>in</strong>fested with<br />

Gaussian noise. Also note an improvement <strong>in</strong> denois<strong>in</strong>g power if both are applied<br />

consecutively (Local PCA SNR = 8.8 dB, LICA SNR = 10.6 dB, LPCA and LICA<br />

consecutively SNR = 12.6 dB). All images where denoised us<strong>in</strong>g a fixed number of<br />

clusters K = 20 and a delay radius of M = 4, which results <strong>in</strong> a 49-dimensional<br />

feature space.<br />

aij → ai−1,j+1, (i, j = 1, ..., n) yields the follow<strong>in</strong>g transformed image:<br />

⎡<br />

⎢<br />

P −1,1 = ⎢<br />

⎣<br />

an,2 . . . an,n an,1<br />

a1,2 . . . a1,n a1,1<br />

.<br />

an−1,2 . . . an−1,n an−1,1<br />

.<br />

.<br />

⎤<br />

⎥<br />

⎦<br />

(21)<br />

Then <strong>in</strong>stead of choos<strong>in</strong>g a s<strong>in</strong>gle delay dimension, we choose a delay radius<br />

M and use all P ν with �ν� < M as delayed versions of the orig<strong>in</strong>al signal.<br />

The rema<strong>in</strong>der of the LICA based denois<strong>in</strong>g algorithm works exactly as <strong>in</strong><br />

the case of a 1D time series.<br />

In figure 3 we show that this approach us<strong>in</strong>g the the MDL criterion to select<br />

the number of components compared between LPCA and LICA. In addition<br />

we see that the algorithm also works favorable if applied multiple times.<br />

17


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 215<br />

4.2 Denois<strong>in</strong>g with dAMUSE applied to toy examples<br />

A group of three artificial source signals with different frequency contents was<br />

chosen: one member of the group represents a narrow-band signal, a s<strong>in</strong>usoid;<br />

the second signal encompasses a wide frequency range; and the last one represents<br />

a sawtooth wave whose spectral density is concentrated <strong>in</strong> the low<br />

frequency band (see figure 4).<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40 60 80 100<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40 60 80 100<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40 60 80 100<br />

(n)<br />

400<br />

300<br />

200<br />

100<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

400<br />

300<br />

200<br />

100<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

600<br />

400<br />

200<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

(π)<br />

Fig. 4. Artificial signals (left column) and their frequency contents (right column)<br />

The simulations were designed to illustrate the method and to study the <strong>in</strong>fluence<br />

of the threshold parameter T H on the performance when noise is added<br />

at different levels. In what concerns noise we also try to f<strong>in</strong>d out if there is<br />

any advantage of us<strong>in</strong>g a GEVD <strong>in</strong>stead of a PCA analysis. Hence the signals<br />

at the output of the first step of the algorithm (us<strong>in</strong>g the matrix Q to project<br />

the data) are also compared with the output signals. Results are collected <strong>in</strong><br />

table 1.<br />

Random noise was added to the sensor signals yield<strong>in</strong>g a SNR <strong>in</strong> the range<br />

of [0, 20] dB. The parameters M = 4 and T H = 0.95 were kept fixed. As<br />

the noise level <strong>in</strong>creases, the number of significant eigenvalues also <strong>in</strong>creases.<br />

Hence at the output of the first step more signals need to be considered. Thus<br />

as the noise energy <strong>in</strong>creases, the number (l) of signals or the dimension of<br />

18


216 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Table 1<br />

Number of output signals correlated with noise or source signals after step 1 and<br />

step 2 of the algorithm dAMUSE.<br />

1st step 2nd step<br />

SNR NM Sources Noise Sources Noise Total<br />

20 dB 12 6 0 6 0 6<br />

15 dB 12 5 2 6 1 7<br />

10 dB 12 6 2 7 1 8<br />

5 dB 12 6 3 7 2 9<br />

0 dB 12 7 4 8 3 11<br />

matrix C also <strong>in</strong>creases after the application of the first step (last column of<br />

table 1). As the noise <strong>in</strong>creases, an <strong>in</strong>creas<strong>in</strong>g number of ICs will be available<br />

at the output of the two steps. Comput<strong>in</strong>g, <strong>in</strong> the frequency doma<strong>in</strong>, the correlation<br />

coefficients between the output signals of each step of the algorithm<br />

and noise or source signals we confirm that some are related with the sources<br />

and others with noise. Table 1 (columns 3-6) shows that the maximal correlation<br />

coefficients are distributed between noise and source signals to a vary<strong>in</strong>g<br />

degree. We can see that the number of signals correlated with noise is always<br />

higher <strong>in</strong> the first level. Results show that for low noise levels the first step<br />

(which is ma<strong>in</strong>ly a pr<strong>in</strong>cipal component analysis <strong>in</strong> a space of dimension NM)<br />

achieves good solutions already. However, we can also see (for narrow-band<br />

signals and/or M low) that the time doma<strong>in</strong> characteristics of the signals resemble<br />

the orig<strong>in</strong>al source signals only after a GEVD, i.e. at the output of the<br />

second step rather than with a PCA, i.e. at the output of first step. Figure 5<br />

shows examples of signals that have been obta<strong>in</strong>ed <strong>in</strong> the two steps of the<br />

algorithm for SNR = 10 dB. At the output of the first level the 3 signals with<br />

highest frequency correlation were chosen among the 8 output signals. Us<strong>in</strong>g a<br />

similar criterion to choose 3 signals at the output of the 2nd step (last column<br />

of figure 5), we can see that their time course is more similar to the source<br />

signals than after the first step (middle column of figure 5)<br />

4.3 Denois<strong>in</strong>g of prote<strong>in</strong> NMR spectra<br />

In biophysics the determ<strong>in</strong>ation of the 3D structure of biomolecules like prote<strong>in</strong>s<br />

is of utmost importance. Nuclear magnetic resonance techniques provide<br />

<strong>in</strong>dispensable tools to reach this goal. As hydrogen nuclei are the most abundant<br />

and most sensitive nuclei <strong>in</strong> prote<strong>in</strong>s, proton NMR spectra of prote<strong>in</strong>s<br />

dissolved <strong>in</strong> water are recorded mostly. S<strong>in</strong>ce the concentration of the solvent<br />

is by magnitudes larger than the prote<strong>in</strong> concentration, there is always a large<br />

19


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 217<br />

1<br />

0.5<br />

0<br />

−0.5<br />

Sources<br />

−1<br />

0 20 40<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40<br />

(n)<br />

2<br />

1<br />

0<br />

−1<br />

1st step<br />

−2<br />

0 20 40<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

(n)<br />

2<br />

1<br />

0<br />

−1<br />

2st step<br />

−2<br />

0 20 40<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40<br />

Fig. 5. Comparison of output signals result<strong>in</strong>g after the first step (second column)<br />

and the second step (last column) of dAMUSE.<br />

proton signal of the water solvent contam<strong>in</strong>at<strong>in</strong>g the prote<strong>in</strong> spectrum. This<br />

water artifact cannot be suppressed completely with technical means, hence<br />

it would be <strong>in</strong>terest<strong>in</strong>g to remove it dur<strong>in</strong>g the analysis of the spectra.<br />

BSS techniques have been shown to solve this separation problem [27,28]. BSS<br />

algorithms are based on an ICA [2] which extracts a set of underly<strong>in</strong>g <strong>in</strong>dependent<br />

source signals out of a set of measured signals without know<strong>in</strong>g how<br />

the mix<strong>in</strong>g process is carried out. We have used an algebraic algorithm [35,36]<br />

based on second order statistics us<strong>in</strong>g the time structure of the signals to<br />

separate this and related artifacts from the rema<strong>in</strong><strong>in</strong>g prote<strong>in</strong> spectrum. Unfortunately<br />

due to the statistical nature of the algorithm unwanted noise is<br />

<strong>in</strong>troduced <strong>in</strong>to the reconstructed spectrum as can be seen <strong>in</strong> figure 6. The water<br />

artifact removal is effected by a decomposition of a series of NMR spectra<br />

<strong>in</strong>to their uncorrelated spectral components apply<strong>in</strong>g a generalized eigendecomposition<br />

of a congruent matrix pencil [37]. The latter is formed with a<br />

correlation matrix of the signals and a correlation matrix with delayed or<br />

filtered signals [32]. Then we can detect and remove the components which<br />

conta<strong>in</strong> only a signal generated by the water and reconstruct the rema<strong>in</strong><strong>in</strong>g<br />

prote<strong>in</strong> spectrum from its ICs. But the latter now conta<strong>in</strong>s additional noise<br />

20<br />

(n)


218 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

<strong>in</strong>troduced by the statistical analysis procedure, hence denois<strong>in</strong>g deemed necessary.<br />

The algorithms discussed above have been applied to an experimental 2D Nuclear<br />

Overhauser Effect Spectroscopy (NOESY) proton NMR spectrum of the<br />

polypeptide P11 dissolved <strong>in</strong> water. The synthetic peptide P11 consists of 24<br />

am<strong>in</strong>o acids only and represents the helix H11 of the human Glutathion reductase<br />

[21]. A simple pre-saturation of the water resonance was applied to prevent<br />

saturation of the dynamic range of the Analog Digital Converter (ADC).<br />

Every data set comprises 512 Free Induction Decays (FIDs) S(t1, t2) ≡ xn[l]<br />

or their correspond<strong>in</strong>g spectra ˆ S(11, ω2) ≡ ˆxn[l], with L = 2048 samples each,<br />

which correspond to N = 128 evolution periods t1 ≡ [n]. To each evolution<br />

period belong four FIDs with different phase modulations, hence only FIDs<br />

with equal phase modulations have been considered for analysis. A BSS analysis,<br />

us<strong>in</strong>g both the algorithm GEVD us<strong>in</strong>g Matrix Pencil (GEVD-MP) [28]<br />

and the algorithm dAMUSE [33], was applied to all data sets. Note that the<br />

matrix pencil with<strong>in</strong> GEVD-MP was conveniently computed <strong>in</strong> the frequency<br />

doma<strong>in</strong>, while <strong>in</strong> the algorithm dAMUSE <strong>in</strong> spite of the filter<strong>in</strong>g operation<br />

be<strong>in</strong>g performed <strong>in</strong> the frequency doma<strong>in</strong>, the matrix pencil was computed <strong>in</strong><br />

the time doma<strong>in</strong>. The GEVD is performed <strong>in</strong> dAMUSE as described above to<br />

achieve a dimension reduction and concomitant denois<strong>in</strong>g.<br />

4.3.1 Local ICA denois<strong>in</strong>g<br />

For denois<strong>in</strong>g we first used the LICA denois<strong>in</strong>g algorithm proposed above<br />

to enhance the reconstructed prote<strong>in</strong> signal without the water artifact. We<br />

applied the denois<strong>in</strong>g only to those components which were identified as water<br />

components. Then we removed the denoised versions of these water artifact<br />

components from the total spectrum. As a result, the additional noise is at<br />

least halved as can also be seen from figure 7. On the part of the spectrum<br />

away from the center, i.e. not conta<strong>in</strong><strong>in</strong>g any water artifacts, we could estimate<br />

the <strong>in</strong>crease of the SNR with the orig<strong>in</strong>al spectrum as reference. We calculated<br />

a SNR of 17.3 dB of the noisy spectrum and a SNR of 21.6 dB with apply<strong>in</strong>g<br />

the denois<strong>in</strong>g algorithm.<br />

We compare the result, i.e. the reconstructed artifact-free prote<strong>in</strong> spectrum of<br />

our denois<strong>in</strong>g algorithm to the result of a KPCA based denois<strong>in</strong>g algorithm<br />

us<strong>in</strong>g a gaussian kernel <strong>in</strong> figure 8. The figure depicts the differences between<br />

the denoised spectra and the orig<strong>in</strong>al spectrum <strong>in</strong> the regions where the water<br />

signal is not very dom<strong>in</strong>at<strong>in</strong>g. As can be seen, the LICA denois<strong>in</strong>g algorithm<br />

reduces the noise but does not change the content of the signal, whereas the<br />

KPCA algorithm seems to <strong>in</strong>fluence the peak amplitudes of the prote<strong>in</strong> resonances<br />

as well. Further experiments are under way <strong>in</strong> our laboratory to <strong>in</strong>vestigate<br />

these differences <strong>in</strong> more detail and to establish an automatic artifact<br />

21


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 219<br />

Signal [a.u.]<br />

Signal [a.u.]<br />

Orig<strong>in</strong>al NMR spectrum of the P11 prote<strong>in</strong><br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

Spectrum after the water removal algorithm<br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

Fig. 6. The graph shows a 1D slice of a proton 2D NOESY NMR spectrum of the<br />

polypeptide P11 before and after remov<strong>in</strong>g the water artifact with the GEVD-MP<br />

algorithm. The 1D spectrum corresponds to the shortest evolution period t1.<br />

removal algorithm for multidimensional NMR spectra.<br />

4.3.2 Kernel PCA denois<strong>in</strong>g<br />

As the removal of the water artifact lead to additional noise <strong>in</strong> the spectra<br />

(compare figure 9(a) and figure 9(b)) KPCA based denois<strong>in</strong>g was applied.<br />

First (almost) noise free samples had to be created <strong>in</strong> order to determ<strong>in</strong>e the<br />

pr<strong>in</strong>ciple axes <strong>in</strong> feature space. For that purpose, the first 400 data po<strong>in</strong>ts<br />

of the real and the imag<strong>in</strong>ary part of each of the 512 orig<strong>in</strong>al spectra were<br />

used to form a 400 × 1024 sample matrix X (1) . Likewise five further sample<br />

matrices X (m) , m = 2, . . . , 6, were created, which now consisted of the data<br />

po<strong>in</strong>ts 401 to 800, 601 to 1000, 1101 to 1500, 1249 to 1648 and 1649 to 2048<br />

respectively. Note that the region (1000 - 1101) of data po<strong>in</strong>ts compris<strong>in</strong>g the<br />

ma<strong>in</strong> part of the water resonance was nulled deliberately as it is of no use for<br />

22<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2


220 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Signal [a.u.]<br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

(a) LICA denoised spectrum of P11 after the water artifact has been removed<br />

with the algorithm GEVD-MP<br />

Signal [a.u.]<br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

(b) KPCA denoised spectrum of P11 after the water artifact has been removed<br />

with the algorithm GEVD-MP<br />

Fig. 7. The figure shows the correspond<strong>in</strong>g artifact free P11 spectra after the denois<strong>in</strong>g<br />

algorithms have been applied. The LICA algorithm was applied to all water<br />

components with M, K chosen with the MDL estimator (γ = 32) between 20 and<br />

60 and 20 and 80 respectively. The second graph shows the denoised spectrum with<br />

a KPCA based algorithm us<strong>in</strong>g a gaussian kernel.<br />

the KPCA. For each of the sample matrices X (m) the correspond<strong>in</strong>g kernel<br />

matrix K was determ<strong>in</strong>ed by<br />

Ki,j = k(xi, xj), i, j = 1, . . . , 400 (22)<br />

where xi denotes the i-th column of X (m) . For the kernel function a Gaussian<br />

kernel<br />

k(xi, xj) = exp<br />

�<br />

− � xi − xj � 2<br />

2σ 2<br />

23<br />

�<br />

(23)


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 221<br />

Signal [a.u.] Signal [a.u.]<br />

10 8 6 4 2 0 -2 -4<br />

10 8 6 4 2 0<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

-2 -4<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

δ [ppm]<br />

10 8 6 4 2 0 -2 0<br />

10 8 6 4 2 0 -2 0<br />

δ [ppm]<br />

Fig. 8. The graph uncovers the differences of the LICA and KPCA denois<strong>in</strong>g algorithms.<br />

As a reference the correspond<strong>in</strong>g 1D slice of the orig<strong>in</strong>al P11 spectrum is<br />

displayed on top. From top to bottom the three curves show: The difference of the<br />

orig<strong>in</strong>al and the spectrum with the GEVD-MP algorithm applied, the difference between<br />

the orig<strong>in</strong>al and the LICA denoised spectrum and the difference between the<br />

orig<strong>in</strong>al and the KPCA denoised spectrum. To compare the graphs <strong>in</strong> one diagram<br />

the three graphs are translated vertically by 2, 4 and 6 respectively.<br />

where<br />

2σ 2 =<br />

1<br />

400 ∗ 399<br />

is the width parameter σ, was chosen.<br />

400<br />

�<br />

i,j=1<br />

� xi − xj � 2<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

(24)<br />

F<strong>in</strong>ally the kernel matrix K was expressed <strong>in</strong> terms of its EVD (equation 17)<br />

which lead to the expansion parameters α necessary to determ<strong>in</strong>e the pr<strong>in</strong>cipal<br />

axes of the correspond<strong>in</strong>g feature space Ω (m) :<br />

400<br />

�<br />

ω = αiΦ(xi). (25)<br />

i=1<br />

Similar to the orig<strong>in</strong>al data, the noisy data of the reconstructed spectra were<br />

used to form six 400 × 1024 dimensional pattern matrices P (m) , m = 1, . . . , 6.<br />

24


222 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

x 106<br />

12<br />

x 107<br />

10<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

5<br />

0<br />

−5<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

−4<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

δ [ppm]<br />

(a) Orig<strong>in</strong>al (noisy) spectrum<br />

x 106<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

x 105<br />

−2<br />

−4<br />

−6<br />

−8<br />

−10<br />

10 9<br />

x 106<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

δ [ppm]<br />

(b) Reconstructed spectrum with the<br />

water artifact removed with the matrix<br />

pencil algorithm<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

δ [ppm]<br />

(c) Result of the KPCA denois<strong>in</strong>g of the reconstructed spectrum<br />

Fig. 9. 1D slice of a 2D NOESY spectrum of the polypeptide P11 <strong>in</strong> aqueous solution<br />

correspond<strong>in</strong>g to the shortest evolution period t1. The chemical shift ranges from<br />

−1ppm to 10ppm roughly, The <strong>in</strong>sert shows the region of the spectrum between 10<br />

and 9ppm roughly. The upper trace corresponds to the denoised basel<strong>in</strong>e and the<br />

lower trace shows the basel<strong>in</strong>e of the orig<strong>in</strong>al spectrum.<br />

25


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 223<br />

Then the pr<strong>in</strong>cipal components βk of each column of P m were calculated <strong>in</strong><br />

the correspond<strong>in</strong>g feature space Ω (m) . In order to denoise the patterns only<br />

projections onto the first n = 112 pr<strong>in</strong>cipal axes were considered. This lead to<br />

�<br />

βk =<br />

where x is a column of P m .<br />

400<br />

α<br />

i=1<br />

k i k(xi, x), k = 1, . . . , 112 (26)<br />

After reconstruct<strong>in</strong>g the image ˆ PnΦ(x) of the sample vector under the map Φ<br />

(equation 18), its approximate pre-image was determ<strong>in</strong>ed by m<strong>in</strong>imiz<strong>in</strong>g the<br />

cost function<br />

�112<br />

�400<br />

ρ(z) = −2<br />

βk α<br />

k=1 i=1<br />

k i k(xi, z) (27)<br />

Note that the method described above fails to denoise the region where the<br />

water resonance appears (data po<strong>in</strong>ts 1001 to 1101) because then the samples<br />

formed from the orig<strong>in</strong>al data differ too much from the noisy data. This is not<br />

a major drawback as prote<strong>in</strong> peaks totally hidden under the water artifact<br />

cannot be uncovered by the presented bl<strong>in</strong>d source separation method. Figure<br />

9(c) shows the result<strong>in</strong>g denoised prote<strong>in</strong> spectrum on an identical vertical<br />

scale as figure 9(a) and figure 9(b). The <strong>in</strong>sert compares the noise <strong>in</strong> a region<br />

of the spectrum between 10 and 9ppm roughly where no prote<strong>in</strong> peaks are<br />

found. The upper trace shows the basel<strong>in</strong>e of the denoised reconstructed prote<strong>in</strong><br />

spectrum and the lower trace the correspond<strong>in</strong>g basel<strong>in</strong>e of the orig<strong>in</strong>al<br />

experimental spectrum before the water artifact has been separated out.<br />

4.3.3 Denois<strong>in</strong>g us<strong>in</strong>g Delayed AMUSE<br />

LICA denois<strong>in</strong>g of reconstructed prote<strong>in</strong> spectra necessitate the solution of the<br />

BSS problem beforehand us<strong>in</strong>g any ICA algorithm. A much more elegant solution<br />

is provided by the recently proposed algorithm dAMUSE, which achieves<br />

BSS and denois<strong>in</strong>g simultaneously. To test the performance of the algorithm,<br />

it was also applied to the 2D NOESY NMR spectra of the polypeptide P11.<br />

A 1D slice of the 2D NOESY spectrum of P11 correspond<strong>in</strong>g to the shortest<br />

evolution period t1 is presented <strong>in</strong> figure 9(a) which shows a huge water artifact<br />

despite some pre-saturation on the water resonance. Figure 10 shows the reconstructed<br />

spectra obta<strong>in</strong>ed with the algorithms GEVD-MP and dAMUSE,<br />

respectively. The algorithm GEVD-MP yielded almost artifact-free spectra<br />

but with clear changes <strong>in</strong> the peak <strong>in</strong>tensities <strong>in</strong> some areas of the spectra. On<br />

the contrary, the reconstructed spectra obta<strong>in</strong>ed with the algorithm dAMUSE<br />

still conta<strong>in</strong> some remnants of the water artifact but the prote<strong>in</strong> peak <strong>in</strong>tensities<br />

rema<strong>in</strong>ed unchanged and all basel<strong>in</strong>e distortions have been cured. All<br />

26


224 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

(a) 1D slice of the NOESY spectrum of the prote<strong>in</strong> P11 spectrum reconstructed<br />

with the algorithm GEVD-MP<br />

(b) Correspond<strong>in</strong>g prote<strong>in</strong> spectrum reconstructed with the algorithm<br />

dAMUSE<br />

Fig. 10. Comparison of denois<strong>in</strong>g of the P11 prote<strong>in</strong> spectrum<br />

parameters of the algorithms are collected <strong>in</strong> table 2<br />

5 Conclusions<br />

We proposed two new denois<strong>in</strong>g techniques and also considered KPCA denois<strong>in</strong>g<br />

which are all based on the concept of embedd<strong>in</strong>g signals <strong>in</strong> delayed<br />

coord<strong>in</strong>ates. We presented a detailed discussion of their properties and also<br />

discussed results obta<strong>in</strong>ed apply<strong>in</strong>g them to illustrative toy examples. Furthermore<br />

we compared all three algorithms by apply<strong>in</strong>g them to the real world<br />

27


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 225<br />

Table 2<br />

Parameter values for the embedd<strong>in</strong>g dimension of the feature space of dAMUSE<br />

(MdAMUSE), the number (K) of sampl<strong>in</strong>g <strong>in</strong>tervals used per delay <strong>in</strong> the trajectory<br />

matrix, the number Npc of pr<strong>in</strong>cipal components reta<strong>in</strong>ed after the first step of<br />

the GEVD and the half-width (σ) of the Gaussian filter used <strong>in</strong> the algorithms<br />

GEVD-MP and dAMUSE<br />

Parameter NIC MdAMUSE Npc Nw(GEV D)<br />

P11 256 3 148 49<br />

Parameter Nw(dAMUSE) σ SNRGEV D−MP SNRdAMUSE<br />

P11 46 0.3 18, 6 dB 22, 9 dB<br />

problem of remov<strong>in</strong>g the water artifact from NMR spectra and denois<strong>in</strong>g the<br />

result<strong>in</strong>g reconstructed spectra. Although all three algorithms achieved good<br />

results concern<strong>in</strong>g the f<strong>in</strong>al SNR, <strong>in</strong> case of the NMR spectra it turned out<br />

that KPCA seems to alter the spectral shapes while LICA and dAMUSE do<br />

not. At least with prote<strong>in</strong> NMR spectra it is crucial that denois<strong>in</strong>g algorithms<br />

do not alter <strong>in</strong>tegrated peak <strong>in</strong>tensities <strong>in</strong> the spectra as the latter form the<br />

bases for the structure elucidation process.<br />

In future we have to further <strong>in</strong>vestigate the dependence of the proposed algorithms<br />

on the situation at hand. Thereby it will be crucial to identify data<br />

models for which the each one of the proposed denois<strong>in</strong>g techniques works<br />

best and to f<strong>in</strong>d good measures of how such models suit the given data.<br />

6 Acknowledgements<br />

This research has been supported by the BMBF (project ModKog) and the<br />

DFG (GRK 638: Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> Condensed Matter). We<br />

are grateful to W. Gronwald and H. R. Kalbitzer for provid<strong>in</strong>g the NMR<br />

spectrum of P11 and helpful discussions.<br />

References<br />

[1] Adel Belouchrani, Karim Abed-Meraim, Jean-François Cardoso, and Eric<br />

Moul<strong>in</strong>es. A bl<strong>in</strong>d source separation technique us<strong>in</strong>g secon-order statistics.<br />

IEEE Transactions on Signal Process<strong>in</strong>g, 45(2):434–444, 1997.<br />

[2] Andrzej Cichocki and Shun-Ichi Amari. Adaptive Bl<strong>in</strong>d Signal and Image<br />

Process<strong>in</strong>g. Wiley, 2002.<br />

28


226 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

[3] P. Comon. Idependent component analysis - a new concept ? Signal Process<strong>in</strong>g,<br />

36:287–314, 1994.<br />

[4] K. I. Diamantaras and S. Y. Kung. Pr<strong>in</strong>cipal <strong>Component</strong> Neural Networks,<br />

Theory and Applications. Wiley, 1996.<br />

[5] A. Effern, K. Lehnertz, T. Schreiber, T. Grunwald, P. David, and C. E.<br />

Elger. Nonl<strong>in</strong>ear denois<strong>in</strong>g of transient signals with application to event-related<br />

potentials. Physica D, 140:257–266, 2000.<br />

[6] E. Fishler and H. Messer. On the use of order statistics for improved detection<br />

of signals by the MDL criterion. IEEE Transactions on Signal Process<strong>in</strong>g,<br />

48:2242–2247, 2000.<br />

[7] R. Freeman. Sp<strong>in</strong> Choreography. Spektrum Academic Publishers, Oxford, 1997.<br />

[8] R. R. Gharieb and A. Cichocki. Second-order statistics based bl<strong>in</strong>d source<br />

separation us<strong>in</strong>g a bank of subband filters. Digital Signal Process<strong>in</strong>g, 13:252–<br />

274, 2003.<br />

[9] M. Ghil, M. R. Allen, M. D. Dett<strong>in</strong>ger, and K. Ide. Advanced spectral methods<br />

for climatic time series. Reviews of Geophysics, 40(1):1–41, 2002.<br />

[10] K. H. Hausser and H.-R. Kalbitzer. NMR <strong>in</strong> Medic<strong>in</strong>e and Biology. Berl<strong>in</strong>,<br />

1991.<br />

[11] J. Hérault and C. Jutten. Space or time adaptive signal process<strong>in</strong>g by neural<br />

network models. In J. S. Denker, editor, Neural Networks for Comput<strong>in</strong>g.<br />

Proceed<strong>in</strong>gs of the AIP Conference, pages 206–211, New York, 1986. American<br />

Institute of Physics.<br />

[12] Aapo Hyvär<strong>in</strong>en, Patrik Hoyer, and Erkki Oja. Intelligent Signal Process<strong>in</strong>g.<br />

IEEE Press, 2001.<br />

[13] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja. <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>.<br />

2001.<br />

[14] A. Hyvär<strong>in</strong>en and E. Oja. A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent<br />

component analysis. Neural Computation, 9:1483–1492, 1997.<br />

[15] A. K. Ja<strong>in</strong> and R. C. Dubes. Algorithms for Cluster<strong>in</strong>g Data. Prentice Hall:<br />

New Jersey, 1988.<br />

[16] J. T. Kwok and I. W. Tsang. The pre-image problem <strong>in</strong> kernel methods. In<br />

Proceed. Int. Conf. Mach<strong>in</strong>e Learn<strong>in</strong>g (ICML03), 2003.<br />

[17] A. P. Liavas and P. A. Regalia. On the behavior of <strong>in</strong>formation theoretic criteria<br />

for model order selection. IEEE Transactions on Signal Process<strong>in</strong>g, 49:1689–<br />

1695, 2001.<br />

[18] Chor T<strong>in</strong> Ma, Zhi D<strong>in</strong>g, and Sze Fong Yau. A two-stage algorithm for MIMO<br />

bl<strong>in</strong>d deconvolution of nonstationary colored noise. IEEE Transactions on<br />

Signal Process<strong>in</strong>g, 48:1187–1192, 2000.<br />

29


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 227<br />

[19] S. Mika, B. Schölkopf, A. Smola, K. Müller, M. Scholz, and G. Rätsch. Kernel<br />

PCA and denois<strong>in</strong>g <strong>in</strong> feature spaces. Adv. Neural Information Process<strong>in</strong>g<br />

Systems, NIPS11, 11, 1999.<br />

[20] V. Moskv<strong>in</strong>a and K. M. Schmidt. Approximate projectors <strong>in</strong> s<strong>in</strong>gular spectrum<br />

analysis. SIAM Journal Mat. Anal. Appl., 24(4):932–942, 2003.<br />

[21] A. Nordhoff, Ch. Tziatzios, J. A. V. Broek, M. Schott, H.-R. Kalbitzer,<br />

K. Becker, D. Schubert, and R. H. Schirme. Denaturation and reactivation of<br />

dimeric human glutathione reductase. Eur. J. Biochem, pages 273–282, 1997.<br />

[22] L. Parra and P. Sajda. Bl<strong>in</strong>d source separation vis generalized eigenvalue<br />

decomposition. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 4:1261–1269, 2003.<br />

[23] K. Pearson. On l<strong>in</strong>es and planes of closest fit to systems of po<strong>in</strong>ts <strong>in</strong> space.<br />

Philosophical Magaz<strong>in</strong>e, 2:559–572, 1901.<br />

[24] I. W. Sandberg and L. Xu. Uniform approximation of multidimensional myoptic<br />

maps. Transactions on Circuits and Systems, 44:477–485, 1997.<br />

[25] B. Schoelkopf, A. Smola, and K.-R. Mueller. Nonl<strong>in</strong>ear component analysis as<br />

a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998.<br />

[26] K. Stadlthanner, E. W. Lang, A. M. Tomé, A. R. Teixeira, and C. G. Puntonet.<br />

Kernel-PCA denois<strong>in</strong>g of artifact-free prote<strong>in</strong> NMR spectra. Proc. IJCNN’2004,<br />

Budapest, Hungaria, 2004.<br />

[27] K. Stadlthanner, F. J. Theis, E. W. Lang, A. M. Tomé, W. Gronwald, and H.-R.<br />

Kalbitzer. A matrix pencil approach to the bl<strong>in</strong>d source separation of artifacts<br />

<strong>in</strong> 2D NMR spectra. Neural Information Process<strong>in</strong>g - Letters and Reviews,<br />

1:103–110, 2003.<br />

[28] K. Stadlthanner, F. Theis, E. W. Lang, A. M. Tomé, A. R. Teixeira,<br />

W. Gronwald, and H.-R. Kalbitzer. GEVD-MP. Neurocomput<strong>in</strong>g accepted,<br />

2005.<br />

[29] K. Stadlthanner, A. M. Tomé, F. J. Theis, W. Gronwald, H.-R. Kalbitzer, and<br />

E. W. Lang. Bl<strong>in</strong>d source separation of water artifacts <strong>in</strong> NMR spectra us<strong>in</strong>g a<br />

matrix pencil. In Fourth International Symposium On <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong> and Bl<strong>in</strong>d Source Separation, ICA’2003, pages 167–172, Nara, Japan,<br />

2003.<br />

[30] F. Takens. On the numerical determ<strong>in</strong>ation of the dimension of an attractor.<br />

Dynamical Systems and Turbulence, Annual Notes <strong>in</strong> <strong>Mathematics</strong>, 898:366–<br />

381, 1981.<br />

[31] F. J. Theis, A. Meyer-Bäse, and E. W. Lang. Second-order bl<strong>in</strong>d source<br />

separation based on multi-dimensional autocovariances. In Proc. ICA 2004,<br />

volume 3195 of Lecture Notes <strong>in</strong> Computer Science, pages 726–733, Granada,<br />

Spa<strong>in</strong>, 2004.<br />

[32] Ana Maria Tomé and Nuno Ferreira. On-l<strong>in</strong>e source separation of temporally<br />

correlated signals. In European Signal Process<strong>in</strong>g Conference, EUSIPCO2002,<br />

Toulouse, France, 2002.<br />

30


228 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

[33] Ana Maria Tomé, Ana Rita Teixeira, Elmar Wolfgang Lang, Kurt Stadlthanner,<br />

A. P. Rocha, and R. ALmeida. dAMUSE - A new tool for denois<strong>in</strong>g and BSS.<br />

Digital Signal Process<strong>in</strong>g, 2005.<br />

[34] Ana Maria Tomé, Ana Rita Teixeira, Elmar Wolfgang Lang, Kurt Stadlthanner,<br />

and A. P. Rocha. Bl<strong>in</strong>d source separation us<strong>in</strong>g time-delayed signals. In<br />

International Jo<strong>in</strong>t Conference on Neural Networks, IJCNN’2004, volume CD,<br />

Budapest, Hungary, 2004.<br />

[35] Ana Maria Tomé. Bl<strong>in</strong>d source separation us<strong>in</strong>g a matrix pencil. In Int. Jo<strong>in</strong>t<br />

Conf. on Neural Networks, IJCNN’2000, Como, Italy, 2000.<br />

[36] Ana Maria Tomé. An iterative eigendecomposition approach to bl<strong>in</strong>d source<br />

separation. In 3rd Intern. Conf. on <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> and Signal<br />

Separation, ICA’2003, pages 424–428, San Diego, USA, 2001.<br />

[37] Lang Tong, Ruey wen Liu, Victor C. Soon, and Yih-Fang Huang. Indeterm<strong>in</strong>acy<br />

and identifiability of bl<strong>in</strong>d identification. IEEE Transactions on Circuits and<br />

Systems, 38(5):499–509, 1991.<br />

[38] Rolf Vetter, J. M. Ves<strong>in</strong>, Patrick Celka, Philippe Renevey, and Jens Krauss.<br />

Automatic nonl<strong>in</strong>ear noise reduction us<strong>in</strong>g local pr<strong>in</strong>cipal component analysis<br />

and MDL parameter selection. Proceed<strong>in</strong>gs of the IASTED International<br />

Conference on Signal Process<strong>in</strong>g Pattern Recognition and Applications (SPPRA<br />

02) Crete, pages 290–294, 2002.<br />

[39] Rolf Vetter. Extraction of efficient and characteristic features of<br />

multidimensional time series. PhD thesis, EPFL, Lausanne, 1999.<br />

[40] P. Vitányi and M. Li. Mim<strong>in</strong>um description length <strong>in</strong>duction, bayesianism, and<br />

kolmogorov complexity. IEEE Transactions on Information Theory, 46:446–<br />

464, 2000.<br />

31


Chapter 16<br />

Proc. ICA 2006, pages 917-925<br />

Paper F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis.<br />

In Proc. ICA 2006, pages 917-925, Charleston, USA, 2006<br />

Reference (Theis and Kawanabe, 2006)<br />

Summary <strong>in</strong> section 1.5.2<br />

229


230 Chapter 16. Proc. ICA 2006, pages 917-925<br />

Uniqueness of Non-Gaussian Subspace <strong>Analysis</strong><br />

Fabian J. Theis 1 and Motoaki Kawanabe 2<br />

1 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

2 Fraunhofer FIRST.IDA, Kekuléstraße 7, 12439 Berl<strong>in</strong>, Germany<br />

fabian@theis.name andnabe@first.fhg.de<br />

Abstract. Dimension reduction provides an important tool for preprocess<strong>in</strong>g<br />

large scale data sets. A possible model for dimension reduction is realized by<br />

project<strong>in</strong>g onto the non-Gaussian part of a given multivariate record<strong>in</strong>g. We prove<br />

that the subspaces of such a projection are unique given that the Gaussian subspace<br />

is of maximal dimension. This result therefore guarantees that projection<br />

algorithms uniquely recover the underly<strong>in</strong>g lower dimensional data signals.<br />

An important open problem <strong>in</strong> signal process<strong>in</strong>g is the task of efficient dimension<br />

reduction, i.e. the search for mean<strong>in</strong>gful signals with<strong>in</strong> a higher dimensional data set.<br />

Classical techniques such as pr<strong>in</strong>cipal component analysis hereby def<strong>in</strong>e ‘mean<strong>in</strong>gful’<br />

us<strong>in</strong>g second-order statistics (maximal variance), which may often be <strong>in</strong>adequate for<br />

signal detection, i.e. <strong>in</strong> the presence of strong noise. This contrasts to higher order models<br />

<strong>in</strong>clud<strong>in</strong>g projection pursuit [1,2] or non-Gaussian subspace analysis (NGSA) [3,4].<br />

While the former extracts a s<strong>in</strong>gle non-Gaussian <strong>in</strong>dependent component from the data<br />

set, the latter tries to detect a whole non-Gaussian subspace with<strong>in</strong> the data, and no<br />

assumption of <strong>in</strong>dependence with<strong>in</strong> the subspace is made.<br />

The goal of l<strong>in</strong>ear dimension reduction can be def<strong>in</strong>ed as the search of a projection<br />

W∈Mat(n×d) of a d-dimensional random vector X with n


Chapter 16. Proc. ICA 2006, pages 917-925 231<br />

An <strong>in</strong>tuitive notion of how to choose the reduced dimension n is to require that WGX is<br />

maximally Gaussian, and hence WNX non-Gaussian.<br />

The dimension reduction problem itself can of course also be formulated with<strong>in</strong> a<br />

generative model, which leads to the follow<strong>in</strong>g l<strong>in</strong>ear mix<strong>in</strong>g model<br />

X=ANSN+ AGSG<br />

such that SN and SG are <strong>in</strong>dependent, and SG Gaussian. Then (AN, AG) −1 = (W ⊤ N , W⊤ G )⊤ .<br />

This model <strong>in</strong>cludes the general noisy ICA model X=ANSN+ G, where G is Gaussian<br />

and SN is also assumed to be mutually <strong>in</strong>dependent; the dimension reduction then means<br />

projection onto the signal subspace, which might be deteriorated by the noise G along<br />

the subspace — the components of G orthogonal to the subspace will be removed.<br />

However (1) is more general <strong>in</strong> the sense that it does not assume mutual <strong>in</strong>dependence<br />

of SN, only <strong>in</strong>dependence of SN and SG.<br />

The paper is organized as follows: In the next section, we first discuss obvious<br />

<strong>in</strong>determ<strong>in</strong>acies of NGSA and possible regularizations. We then present our ma<strong>in</strong> result,<br />

theorem 1, and give an explicit proof <strong>in</strong> a special case. The general proof is divided<br />

up <strong>in</strong>to a series of lemmas, the proofs of which are omitted due to lack of space. In<br />

section 2, some simulations are performed to validate the uniqueness result. A practical<br />

algorithm for perform<strong>in</strong>g NGSA essentially us<strong>in</strong>g the idea of separated characteristic<br />

functions from the proof is presented <strong>in</strong> the co-paper [6].<br />

1 Uniqueness of NGSA-based dimension reduction<br />

This contribution aims at provid<strong>in</strong>g conditions such that the decomposition (1) is unique.<br />

More precisely, we will show under which conditions the non-Gaussian as well as the<br />

Gaussian subspace is unique.<br />

1.1 Indeterm<strong>in</strong>acies<br />

Clearly, the matrices AN and AG <strong>in</strong> the decomposition (1) cannot be unique — multiplication<br />

from the right us<strong>in</strong>g any <strong>in</strong>vertible matrix leaves the model <strong>in</strong>variant: X=<br />

ANSN+ AGSG= (ANBN)(B−1 −1<br />

N SN)+(AGBG)(B G SG) with BN∈ Gl(n), BG∈ Gl(d− n),<br />

because B−1 N SN and B−1 G SG are aga<strong>in</strong> <strong>in</strong>dependent, and B−1 G SG Gaussian.<br />

An additional <strong>in</strong>determ<strong>in</strong>acy comes <strong>in</strong>to play due to the fact that we do not want to<br />

fix the reduced dimension <strong>in</strong> advance. Given a realization of the model (1) with d


232 Chapter 16. Proc. ICA 2006, pages 917-925<br />

1.2 Uniqueness theorem<br />

Def<strong>in</strong>ition 1. X=AS with A∈Gl(d), S=(SN, SG) and SN∈ L2(Ω,R n ) is called an<br />

n-decomposition of X if SN and SG are stochastically <strong>in</strong>dependent and SG is Gaussian.<br />

In this case, X is said to be n-decomposable.<br />

Hence an n-decomposition of X corresponds to the NGSA problem. If as before<br />

A=(AN, AG), then the n-dimensional subvectorspace im(AN)⊂R d is called the non-<br />

Gaussian subspace, and im(AG) the Gaussian subspace of the decomposition; here<br />

im(A) denotes the image of the l<strong>in</strong>ear map A.<br />

Def<strong>in</strong>ition 2. X is denoted to be m<strong>in</strong>imally n-decomposable if X is not (n−1)-decomposable.<br />

Then dime(X) := n is called the essential dimension of X.<br />

For example, the essential dimension dime(X) is zero if and only if X is Gaussian,<br />

whereas the essential dimension of a d-dimensional mutually <strong>in</strong>dependent Laplacian is<br />

d. The follow<strong>in</strong>g theorem is the ma<strong>in</strong> theoretical contribution of this work. It essentially<br />

connects uniqueness of the dimension reduction model with m<strong>in</strong>imality, and gives a<br />

simple characterization for it.<br />

Theorem 1 (Uniqueness of NGSA). Let n


Chapter 16. Proc. ICA 2006, pages 917-925 233<br />

1.3 Proof of theorem 1<br />

First note that the theorem holds trivially for n=0, because <strong>in</strong> this case X is Gaussian.<br />

So <strong>in</strong> the follow<strong>in</strong>g let 0


234 Chapter 16. Proc. ICA 2006, pages 917-925<br />

for all xN∈R, because A is <strong>in</strong>vertible.<br />

Now aNN� 0, otherwise XN= aNGS G which contradicts (ii) for X. If also aNG� 0,<br />

then by equation (3), h ′′ N is constant and therefore S N Gaussian, which aga<strong>in</strong> contradicts<br />

(ii), now for S. Hence aNG= 0. By (3), aGNaGG= 0, and aga<strong>in</strong> aGG�0, otherwise<br />

XG= aGNS N contradict<strong>in</strong>g (ii) for S. Hence also aGN= 0 as was to show.<br />

General proof. In order to give an idea of the ma<strong>in</strong> proof without gett<strong>in</strong>g lost <strong>in</strong> details,<br />

we have divided it up <strong>in</strong>to a sequence of lemmas; these will not be proven due<br />

to lack of space. The characteristic function of the random vector X is def<strong>in</strong>ed by<br />

�X(x) := E(exp ix⊤X), and s<strong>in</strong>ce X is assumed to have exist<strong>in</strong>g covariance,�X is twice<br />

cont<strong>in</strong>uously differentiable. Moreover by def<strong>in</strong>ition�AS(x)=�S(A ⊤x), and the characteristic<br />

function of an <strong>in</strong>dependent random vector factorizes <strong>in</strong>to the component characteristic<br />

functions. So <strong>in</strong>stead of us<strong>in</strong>g pX as <strong>in</strong> the 2-dimensional example, we use�X,<br />

hav<strong>in</strong>g similar properties except for the fact that the range is now complex and that the<br />

differentiability condition can be considerably relaxed.<br />

We will need the follow<strong>in</strong>g lemma, which has essentially been shown <strong>in</strong> [9]; here<br />

∇ f denotes the gradient of f and H f its Hessian.<br />

Lemma 1. Let X∈ L2(Ω,R m ) be a random vector. Then X is Gaussian with covariance<br />

2C if and only if it satisfies�XH�X −∇�X(∇�X) ⊤ + C�X 2≡ 0.<br />

Note that we may assume that the covariance of S (and hence also of X) is positive<br />

def<strong>in</strong>ite — otherwise, while still keep<strong>in</strong>g the model, we can simply remove the subspace<br />

of determ<strong>in</strong>istic components (i.e. components of variance 0), which have to be mapped<br />

onto each other by A. Hence we may even assume Cov(SG)=I, after whiten<strong>in</strong>g as<br />

described <strong>in</strong> section 1.1. This uses the fact that the basis with<strong>in</strong> the Gaussian subspace<br />

is not unique. The same holds also for the non-Gaussian subspace, so we may choose<br />

any BN∈ Gl(n) and BG∈ O(d−n) to get<br />

� �<br />

ANNBN<br />

X= (B<br />

AGNBN<br />

−1<br />

N SN)+<br />

� �<br />

ANGBG<br />

(B<br />

AGGBG<br />

⊤ GSG). (4)<br />

Here only orthogonal matrices BG are allowed <strong>in</strong> order for B⊤ GSG to stay decorrelated,<br />

with SG be<strong>in</strong>g decorrelated.<br />

The next lemma uses the dimension reduction model for X and S to derive an explicit<br />

differential equation for ˆSN. The Gaussian part ˆSG <strong>in</strong> the follow<strong>in</strong>g lemma vanishes<br />

after application of lemma 1.<br />

Lemma 2. For any basis BN∈ Gl(n), the non-Gaussian source characteristic function<br />

ˆSN∈C 2 (Rn ,C) fulfills<br />

�<br />

ANNBN<br />

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� B ⊤ NA⊤ GN + 2ANGA ⊤ GG ˆS 2 N≡ 0. (5)<br />

Lemma 3. Let (ANN, ANG)∈Mat(n×(n+(d−n))) be an arbitrary full rank matrix.<br />

If rank ANN < n, then we may choose coord<strong>in</strong>ates BN ∈ Gl(n), BG ∈ O(d−n) and<br />

M∈Gl(n) such that for arbitrary matrices∗∈Mat((n−1)×(n−1)),∗ ′ ∈ Mat((n−<br />

1)×(d− n−1)):<br />

MANNBN=<br />

0 0<br />

0∗<br />

and MANGBG= 1 0<br />

0∗ ′


Chapter 16. Proc. ICA 2006, pages 917-925 235<br />

The basis choice from lemma 3 together with assumption (ii) can be used to prove<br />

the follow<strong>in</strong>g fact:<br />

Lemma 4. The non-Gaussian transformation is <strong>in</strong>vertible i.e. ANN∈ Gl(n).<br />

The next lemma can be seen as modification of lemma 1, and <strong>in</strong>deed it can be shown<br />

similarly.<br />

Lemma 5. If ˆSN fulfills � ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� e1+ ˆS 2 Nc≡0 for some constant vector<br />

c∈R n , then the source component (SN)1 is Gaussian and <strong>in</strong>dependent of (SN)(2 : n).<br />

Here more generally ei ∈ Rn denotes the i-th unit vector. Putt<strong>in</strong>g these lemmas<br />

together, we can f<strong>in</strong>ally prove theorem 1: Accord<strong>in</strong>g to lemma 4, ANN is <strong>in</strong>vertible, so<br />

from the left yields<br />

multiply<strong>in</strong>g equation (5) from lemma 2 by B −1<br />

N A−1<br />

NN<br />

�<br />

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� B ⊤ NA⊤ GN + CˆS 2 N≡ 0 (6)<br />

for any BN∈ Gl(n) and some fixed, real matrix C∈Mat(n×(d− n)).<br />

We claim that AGN= 0. If not, then there exists v∈R d−n with�A⊤ GNv�=1. Choose<br />

BN from (4) such that B−1 N SN is decorrelated. This is <strong>in</strong>variant under left-multiplication<br />

by an orthogonal matrix, so we may moreover assume that B⊤ NA⊤ GNv=e1. Multiply<strong>in</strong>g<br />

equation (6) <strong>in</strong> turn by v from the right therefore shows that the vector function<br />

�<br />

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� e1+ cˆS 2 N≡ 0 (7)<br />

is zero; here c := Cv∈R. This means that ˆSN fulfills the condition of lemma 5, which<br />

implies that (SN)1 is Gaussian and <strong>in</strong>dependent of the rest. But this contradicts (ii) for<br />

S, hence AGN= 0. Plugg<strong>in</strong>g this result <strong>in</strong>to equation (5), evaluation at sN= 0 shows<br />

that ANGA ⊤ GG = 0. S<strong>in</strong>ce AGN = 0 and A∈Gl(d), necessarily AGG ∈ Gl(d−n), so<br />

ANG= 0 as was to prove.<br />

2 Simulations<br />

In this section, we will provide experimental validation of the uniqueness result of<br />

corollary 1. In order to stay unbiased and not test a s<strong>in</strong>gle algorithm, we have to uniformly<br />

search the parameter space for possibly equivalent model representations. The<br />

model assumptions (1) will not be perfectly fulfilled, so we <strong>in</strong>troduce a measure of<br />

model deviation based on 4-th order cumulants <strong>in</strong> the follow<strong>in</strong>g.<br />

Let the non-Gaussian dimension n and the total dimension d be fixed. Given a random<br />

vector X=(XN, XG), we can without loss of generality assume that Cov(X)=I.<br />

Any possible model deviation consists of (i) a deviation from the <strong>in</strong>dependence of XN<br />

and XG and (ii) a deviation from the Gaussianity of XG. In the case of non-vanish<strong>in</strong>g<br />

kurtoses, the former can be approximated for example by<br />

δI(X) :=<br />

1<br />

n(n−d)d 2<br />

n�<br />

d�<br />

d�<br />

i=1 j=n+1 k=1 l=1<br />

d�<br />

cum 2 (Xi, X j, Xk, Xl),


236 Chapter 16. Proc. ICA 2006, pages 917-925<br />

where the fourth-order cumulant tensor is def<strong>in</strong>ed as cum(Xi, X j, Xk, Xl) := E(XiX jXkXl)−<br />

E(XiX j)E(XkXl)−E(XiXk)E(X jXl)−E(XiXl)E(X jXK). The deviation (ii) from Gaussianity<br />

of XG can simply be measured by kurtosis, which <strong>in</strong> the case of white X means<br />

δG(X) :=<br />

1<br />

(n−d)<br />

d�<br />

j=n+1<br />

�<br />

�<br />

�E(X 4 i )−3� � �.<br />

Altogether, we can therefore def<strong>in</strong>e a total model deviation as the weighted sum of the<br />

above <strong>in</strong>dices; the weight <strong>in</strong> the follow<strong>in</strong>g was chosen experimentally to approximately<br />

yield even contributions of the two measures:<br />

δ(X)=10n(d−n)δI(X)+δG(X)<br />

For numerical tests, we generate two different non-Gaussian source data sets, see<br />

figure 1(d) and also [4], figure 1. The first source set (I) is an n-dimensional dependent<br />

sub-Gaussian random vector given by an isotropic uniform density with<strong>in</strong> the unit<br />

disc, and source set (II) a 2-dimensional dependent super- and sub-Gaussian, given by<br />

p(s1, s2)∝exp(−|s1|)1[c(s1),c(s1)+1], where c(s1)=0 if|s1|≤ln 2 and c(s1)=−1 otherwise.<br />

Normalization was chosen to guarantee Cov(S N)=I<strong>in</strong> advance.<br />

In order to test for model violations, we have to f<strong>in</strong>d two representations X=AS<br />

and X=A ′ S ′ of the same mixtures. After multiplication by A −1 we may as before<br />

assume that a s<strong>in</strong>gle representation X=AS is given with X and S both fulfill<strong>in</strong>g the<br />

dimension reduction model (1), and we have to show that ANG = AGN = 0 if the<br />

decomposition is m<strong>in</strong>imal (corollary 1). The latter can be tested numerically by us<strong>in</strong>g<br />

the so-called normalized crosserror<br />

E(A) :=<br />

1<br />

2n(d−n)<br />

� �ANG� 2 F +�AGN� 2 F<br />

where�.�F is some matrix norm, <strong>in</strong> our case the Frobenius norm.<br />

In order to reduce the d 2 -dimensional search space, after whiten<strong>in</strong>g we may assume<br />

that A∈O(d), so only d(d− 1)/2 dimensions have to be searched. O(d) can be uniformly<br />

sampled for example by choos<strong>in</strong>g B with Gaussian i.i.d. coefficients and orthogonaliz<strong>in</strong>g<br />

A := (BB ⊤ ) −1/2 B. We perform 10 4 Monte-Carlo runs with random A∈O(d).<br />

Sources have been generated with T= 10 4 samples, n-dimensional non-Gaussian part<br />

(a) and (b) from above, and (d−n)-dimensional i.i.d. Gaussians. We measure model<br />

deviationδ(AS) and compare it with the deviation E(A) from block-diagonality.<br />

The results for vary<strong>in</strong>g parameters are given <strong>in</strong> figure 1(a-c). In all three cases we<br />

observe that the smaller the model deviation, the smaller also the crosserror. This gives<br />

an asymptotic confirmation of corollary 1, <strong>in</strong>dicat<strong>in</strong>g that by random sampl<strong>in</strong>g no nonuniqueness<br />

realizations have been found.<br />

3 Conclusion<br />

By m<strong>in</strong>imality of the decomposition (1), we gave a necessary condition for the uniqueness<br />

of non-Gaussian subspace analysis. Together with the assumption of exist<strong>in</strong>g covariance,<br />

this was already sufficient to guarantee model uniqueness. Our result allows<br />

NGSA algorithms to f<strong>in</strong>d the unknown, unique signal space with<strong>in</strong> a noisy highdimensional<br />

data set [6].<br />

� ,


Chapter 16. Proc. ICA 2006, pages 917-925 237<br />

model deviationδ(AS)<br />

model deviationδ(AS)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

crosserror E(A)<br />

(a) n=2, d= 5, source (I)<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

model deviationδ(AS)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35<br />

crosserror E(A)<br />

(b) n=3, d=5, source (I)<br />

crosserror E(A)<br />

(c) n=2, d=4, source (II) (d) Laplacian & uniform source<br />

Fig. 1. (a-c): total model deviationδ(AS) of the transformed sources versus crosserror E(A) of the<br />

mix<strong>in</strong>g matrix for 10 4 Monte-Carlo runs. The circle◦<strong>in</strong>dicates the actual source model deviation<br />

(non-zero due to f<strong>in</strong>ite sample sizes). (d): 2-dimensional dependent sub-Gaussian source (II).<br />

References<br />

1. Friedman, J., Tukey, J.: A projection pursuit algorithm for exploratory data analysis. IEEE<br />

Trans. on Computers 23 (1975) 881–890<br />

2. Hyvär<strong>in</strong>en, A., Karhunen, J., Oja, E.: <strong>Independent</strong> component analysis. John Wiley & Sons<br />

(2001)<br />

3. Blanchard, G., Kawanabe, M., Sugiyama, M., Spoko<strong>in</strong>y, V., Müller, K.R.: In search of nongaussian<br />

components of a high-dimensional distribution. JMLR (2005) In revision. The<br />

prepr<strong>in</strong>t is available at http://www.cs.titech.ac.jp/ tr/reports/2005/TR05-0003.pdf.<br />

4. Kawanabe, M.: L<strong>in</strong>ear dimension reduction based on the fourth-order cumulant tensor. In:<br />

Proc. ICANN 2005. Volume 3697 of LNCS., Warsaw, Poland, Spr<strong>in</strong>ger (2005) 151–156<br />

5. Theis, F.: Uniqueness of complex and multidimensional <strong>in</strong>dependent component analysis.<br />

Signal Process<strong>in</strong>g 84 (2004) 951–956<br />

6. Kawanabe, M., Theis, F.: Extract<strong>in</strong>g non-gaussian subspaces by characteristic functions. In:<br />

submitted to ICA 2006. (2006)<br />

7. Comon, P.: <strong>Independent</strong> component analysis - a new concept? Signal Process<strong>in</strong>g 36 (1994)<br />

287–314<br />

8. Theis, F.: A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation. Neural Computation<br />

16 (2004) 1827–1850<br />

9. Theis, F.: Multidimensional <strong>in</strong>dependent component analysis us<strong>in</strong>g characteristic functions.<br />

In: Proc. EUSIPCO 2005, Antalya, Turkey (2005)


238 Chapter 16. Proc. ICA 2006, pages 917-925


Chapter 17<br />

IEEE SPL 13(2):96-99, 2006<br />

Paper F.J. Theis, C.G. Puntonet, and E.W. Lang. Median-based cluster<strong>in</strong>g for<br />

underdeterm<strong>in</strong>ed bl<strong>in</strong>d signal process<strong>in</strong>g. IEEE Signal Process<strong>in</strong>g Letters,<br />

13(2):96-99, 2006<br />

Reference (Theis et al., 2006)<br />

Summary <strong>in</strong> section 1.5.3<br />

239


240 Chapter 17. IEEE SPL 13(2):96-99, 2006<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 1<br />

Median-based cluster<strong>in</strong>g for underdeterm<strong>in</strong>ed bl<strong>in</strong>d<br />

signal process<strong>in</strong>g<br />

Fabian J. Theis, Member, IEEE, Carlos G. Puntonet, Member, IEEE, Elmar W. Lang<br />

Abstract— In underdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation, more<br />

sources are to be extracted from less observed mixtures without<br />

know<strong>in</strong>g both sources and mix<strong>in</strong>g matrix. k-means-style cluster<strong>in</strong>g<br />

algorithms are commonly used to do this algorithmically<br />

given sufficiently sparse sources, but <strong>in</strong> any case other than<br />

determ<strong>in</strong>istic sources, this lacks theoretical justification. After<br />

establish<strong>in</strong>g that mean-based algorithms converge to wrong solutions<br />

<strong>in</strong> practice, we propose a median-based cluster<strong>in</strong>g scheme.<br />

Theoretical justification as well as algorithmic realizations (both<br />

onl<strong>in</strong>e and batch) are given and illustrated by some examples.<br />

x2<br />

α<br />

x1<br />

α<br />

F(α)<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

∆(α) mean<br />

∆(α) median<br />

0<br />

0 0.2 0.4 0.6 0.8<br />

EDICS Category: SAS-ICAB<br />

BLIND source separation (BSS), ma<strong>in</strong>ly based on the assumption<br />

of <strong>in</strong>dependent sources, is currently the topic of<br />

many researchers [1], [2]. Given an observed m-dimensional<br />

mixture random vector x, which allows an unknown decomposition<br />

x = As, the goal is to identify the mix<strong>in</strong>g matrix<br />

A and the unknown n-dimensional source random vector s.<br />

Commonly, first A is identified, and only then are the sources<br />

recovered. We will therefore denote the former task by bl<strong>in</strong>d<br />

mix<strong>in</strong>g model recovery (BMMR), and the latter (with known<br />

A) by bl<strong>in</strong>d source recovery (BSR).<br />

In the difficult case of underdeterm<strong>in</strong>ed or overcomplete<br />

BSS, where less mixtures than sources are observed (m < n),<br />

BSR is non-trivial, see section II. However, our ma<strong>in</strong> focus<br />

lies on the usually more elaborate matrix recovery. Assum<strong>in</strong>g<br />

statistically <strong>in</strong>dependent sources with exist<strong>in</strong>g variance and<br />

at most one Gaussian component, it is well-known that A<br />

is determ<strong>in</strong>ed uniquely by x [3]. However, how to do this<br />

algorithmically is far from obvious, and although quite a few<br />

algorithms have been proposed recently [4]–[6], performance<br />

is yet limited. The most commonly used overcomplete algorithms<br />

rely on sparse sources (after possible sparsification by<br />

preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually<br />

by k-means or some extension [5], [6]. But apart from the fact<br />

that theoretical justifications have not been found, mean-based<br />

cluster<strong>in</strong>g only identifies the correct A if the data density<br />

approaches a delta distribution. In figure 1, we illustrate the<br />

deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to<br />

5◦ (a) circle histogram for α = 0.4 (b) comparison of mean and median<br />

Fig. 1. Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x of<br />

two <strong>in</strong>dependent gamma-distributed sources (γ = 0.5, 10<br />

per mix<strong>in</strong>g angle, which is rather substantial consider<strong>in</strong>g the<br />

sparse density and the simple, complete case of m = n = 2.<br />

Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g<br />

performs much better. Indeed, mean-based cluster<strong>in</strong>g does not<br />

Manuscript received xxx; revised xxx.<br />

Some prelim<strong>in</strong>ary results were reported at the conferences ESANN 2002,<br />

SIP 2002 and ICA 2003.<br />

FT and EL are with the Institute of Biophysics, University of Regensburg,<br />

93040 Regensburg, Germany (phone: +49 941 943 2924, fax: +49 941 943<br />

2479, e-mail: fabian@theis.name), and CP is with the Dep. Arqitectura y<br />

Tecnología de Computadores, Universidad de Granada, 18071 Granada, Spa<strong>in</strong>.<br />

5 samples) us<strong>in</strong>g a<br />

mix<strong>in</strong>g matrix A with columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2−α) respectively. (a)<br />

shows the mixture density for α = 0.4 after projection onto the circle. For<br />

α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by the mean and the<br />

median of the projected density <strong>in</strong> the receptive field F(α) = (−π/4, π/4) of<br />

the known column a1 of A. The former is the k-means convergence criterion.<br />

possess any equivariance property (performance <strong>in</strong>dependent<br />

of A). In the follow<strong>in</strong>g we propose a novel median-based<br />

cluster<strong>in</strong>g method and prove its equivariance (lemma 1.2) and<br />

convergence. For brevity, the proofs are given for the case of<br />

arbitrary n, but m = 2, although they can be readily extended<br />

to higher sensor signal dimensions. Correspond<strong>in</strong>g algorithms<br />

are proposed and experimentally validated.<br />

I. GEOMETRIC MATRIX RECOVERY<br />

Without loss of generality we assume that A has pairwise<br />

l<strong>in</strong>early <strong>in</strong>dependent columns, and m ≤ n. BMMR tries to<br />

identify A <strong>in</strong> x = As given x, where s is assumed to be<br />

statistically <strong>in</strong>dependent. Obviously, this can only be done up<br />

to equivalence [3], where B is said to be equivalent to A,<br />

B ∼ A, if B can be written as B = APL with an <strong>in</strong>vertible<br />

diagonal matrix L (scal<strong>in</strong>g matrix) and an <strong>in</strong>vertible matrix P<br />

with unit vectors <strong>in</strong> each row (permutation matrix). Hence we<br />

may assume the columns ai of A to have unit norm.<br />

For geometric matrix-recovery, we use a generalization [7]<br />

of the geometric ICA algorithm [8]. Let s be an <strong>in</strong>dependent<br />

n-dimensional, Lebesgue-cont<strong>in</strong>uous, random vector with<br />

density ps describ<strong>in</strong>g the sources. As s is <strong>in</strong>dependent, ps<br />

factorizes <strong>in</strong>to ps(s1, . . . , sn) = ps1(s1) · · · psn(sn) with the<br />

one-dimensional marg<strong>in</strong>al source density functions psi. We<br />

assume symmetric sources, i.e. psi(s) = psi(−s) for s ∈ R<br />

and i ∈ [1 : n] := {1, . . .,n}, <strong>in</strong> particular E(s) = 0.<br />

The geometric BMMR algorithm for symmetric distributions<br />

goes as follows [7]: Pick 2n start<strong>in</strong>g vectors<br />

w1,w ′ 1, . . . ,wn,w ′ n on the unit sphere Sm−1 ⊂ Rm such<br />

that the wi are pairwise l<strong>in</strong>early <strong>in</strong>dependent and wi = −w ′ i .<br />

Often, these wi are called neurons because they resemble the<br />

neurons used <strong>in</strong> cluster<strong>in</strong>g algorithms and <strong>in</strong> Kohonen’s selforganiz<strong>in</strong>g<br />

maps. Furthermore fix a learn<strong>in</strong>g rate η : N → R.<br />

α


Chapter 17. IEEE SPL 13(2):96-99, 2006 241<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 2<br />

The<br />

�<br />

usual hypothesis <strong>in</strong><br />

�<br />

competitive learn<strong>in</strong>g is η(t) > 0,<br />

t∈N η(t) = ∞ and t∈N η(t)2 < ∞. Then iterate the<br />

follow<strong>in</strong>g steps until an appropriate abort condition has been<br />

met: Choose a sample x(t) ∈ Rm accord<strong>in</strong>g to the distribution<br />

of x. If x(t) = 0 pick a new one — note that this case happens<br />

with probability zero. Project x(t) onto the unit sphere and get<br />

y(t) := π(x(t)), where π(x) := x/|x| ∈ Sm−1 . Let i(t) ∈<br />

[1 : n] such that wi(t)(t) or w ′ i(t) (t) is the neuron closest to<br />

y(t). Then set wi(t)(t + 1) := π(wi(t)(t) + η(t)π(σy(t) −<br />

wi(t)(t))) and w ′ i(t) (t + 1) := −wi(t)(t + 1), where σ := 1<br />

if wi(t)(t) is closest to y(t), σ := −1 otherwise. All other<br />

neurons are not moved <strong>in</strong> this iteration. This update rule equals<br />

onl<strong>in</strong>e k-means on Sn−1 except for the fact that the w<strong>in</strong>ner<br />

neuron is not moved proportional to the sample but only <strong>in</strong><br />

its direction due to the normalization. We will see that <strong>in</strong>stead<br />

of f<strong>in</strong>d<strong>in</strong>g the mean <strong>in</strong> the receptive field (as <strong>in</strong> k-means), the<br />

algorithm searches for the correspond<strong>in</strong>g median.<br />

A. Model verification<br />

In this section we first calculate the densities of the random<br />

variables of our cluster<strong>in</strong>g problem. Then we will prove an<br />

asymptotic convergence result. For the theoretical analysis, we<br />

will restrict ourselves to m = 2 mixtures for simplicity. As<br />

above, let x denote the sensor signal vector and A the mix<strong>in</strong>g<br />

matrix such that x = As. We may assume that A has columns<br />

ai = (cosαi, s<strong>in</strong>αi) ⊤ with 0 ≤ α1 < . . . < αn < π.<br />

1) Neural update rule on the sphere: Due to the symmetry<br />

of s we can identify the two neurons wi and w ′ i<br />

. For this<br />

let ι(ϕ) := (ϕ mod π) ∈ [0, π) identify all angles modulo<br />

π, and set θ(v) := ι(arctan(v2/v1)); then θ(wi) = θ(w ′ i )<br />

and θ(ai) = αi. We are <strong>in</strong>terested <strong>in</strong> the essentially onedimensional<br />

projected sensor signal random vector π(x), so<br />

us<strong>in</strong>g θ we may approximate y := θ(π(x)) ∈ [0, π) measur<strong>in</strong>g<br />

the argument of x. Note that the density py of y can be<br />

calculated from px by py(ϕ) = � ∞<br />

−∞ px(r cosϕ, r s<strong>in</strong>ϕ)rdr.<br />

Now let the (n × n)-matrix B be def<strong>in</strong>ed by<br />

B :=<br />

A<br />

0 In−2<br />

where In−2 is the (n − 2)-dimensional identity matrix; so B<br />

is <strong>in</strong>vertible. The random vector Bs has the density pBs =<br />

| detB| −1ps ◦ B−1 . A equals B followed by the projection<br />

from Rn onto the first two coord<strong>in</strong>ates, hence<br />

py(ϕ)= 2<br />

| detB|<br />

� ∞<br />

�<br />

rdr<br />

0<br />

,<br />

Rn−2 dxps(B −1 (r cosϕ, r s<strong>in</strong> ϕ,x) ⊤ ) (1)<br />

for any ϕ ∈ [0, π), where we have used the symmetry of ps.<br />

The geometric learn<strong>in</strong>g algorithm <strong>in</strong>duces the follow<strong>in</strong>g<br />

n-dimensional Markov cha<strong>in</strong> ω(t) def<strong>in</strong>ed recursively by a<br />

start<strong>in</strong>g po<strong>in</strong>t ω(0) ∈ R n and the iteration rule ω(t + 1) =<br />

ι n (ω(t)+η(t)ζ(y(t)e−ω(t))), where e = (1, . . .,1) ⊤ ∈ R n ,<br />

ζ(x1, . . . , xn) ∈ R n such that<br />

ζi(x1, . . . , xn) =<br />

� sgn(xi) |xi| ≤ |xj| for all j<br />

0 otherwise<br />

and y(0), y(1), . . . is a sequence of i.i.d. random variables with<br />

density py represent<strong>in</strong>g the samples <strong>in</strong> each onl<strong>in</strong>e iteration.<br />

Note that the ‘modulo π’ map ι is only needed to guarantee<br />

that each component of ω(t + 1) lies <strong>in</strong> [0, π).<br />

Furthermore, we can assume that after enough iterations<br />

there is one po<strong>in</strong>t v ∈ S 1 that will not be traversed any<br />

more, and without loss of generality we assume θ(v) = 0<br />

so that the above algorithm simplifies to the planar case with<br />

the recursion rule ω(t + 1) = ω(t) + η(t)ζ(y(t)e − ω(t)).<br />

This is k-means-type learn<strong>in</strong>g with an additional sign function.<br />

Without the sign function and the additional condition that<br />

py is log-concave, it has been shown [9] that the process<br />

ω(t) converges to a unique constant process ω(∞) ∈ R n<br />

such that ωi(∞) = E(py|[β(i), β ′ (i)]), where F(ωi(∞)) :=<br />

{ϕ ∈ [0, π) | ι(|ϕ − ωi(∞)|) ≤ ι(|ϕ − ωj(∞)|) for all j �= i}<br />

denotes the receptive field of the neuron ωi(∞) and β(i), β ′ (i)<br />

designate the receptive field borders. This is precisely the kmeans<br />

convergence condition illustrated <strong>in</strong> figure 1.<br />

2) Limit po<strong>in</strong>ts analysis: We now want to study the limit<br />

po<strong>in</strong>ts of geometric matrix-recovery, so we assume that the<br />

algorithm has already converged. The idea, generaliz<strong>in</strong>g our<br />

analysis <strong>in</strong> the complete case [7] then is to formulate a<br />

condition which the limit po<strong>in</strong>ts will have to satisfy and to<br />

show that the BMMR solutions are among them.<br />

The angles γ1, . . .,γn ∈ [0, π) are said to satisfy the geometric<br />

convergence condition (GCC) if they are the medians<br />

of y restricted to their receptive fields i.e. if γi is the median<br />

of py|F(γi). Moreover, a constant random vector ˆω ∈ R n is<br />

called fixed po<strong>in</strong>t if E(ζ(ye− ˆω)) = 0. Hence, the expectation<br />

of a Markov process ω(t) start<strong>in</strong>g at a fixed po<strong>in</strong>t will<br />

<strong>in</strong>deed not be changed by the geometric update rule because<br />

E(ω(t+1)) = E(ω(t))+η(t)E(ζ(y(t)e−ω(t))) = E(ω(t)).<br />

Lemma 1.1: Assume that the geometric algorithm converges<br />

to a constant random vector ω(∞). Then ω(∞) is a<br />

fixed po<strong>in</strong>t if and only if the ωi(∞) satisfy the GCC.<br />

Proof: Assume ω(∞) is a fixed po<strong>in</strong>t of geometric ICA<br />

<strong>in</strong> the expectation. Without loss of generality, let [β, β ′ ] be<br />

the receptive field of ωi(∞) such that β, β ′ ∈ [0, π). S<strong>in</strong>ce<br />

ω(∞) is a fixed po<strong>in</strong>t of geometric ICA <strong>in</strong> the expectation,<br />

E � χ [β,β ′ ](y(t))sgn(y(t) − ωi(∞)) � = 0, where χ [β,β ′ ]<br />

denotes the characteristic function of that <strong>in</strong>terval. But this<br />

means � ωi(∞)<br />

β (−1)py(ϕ)dϕ+ � β ′<br />

ωi(∞) 1py(ϕ)dϕ = 0 so ωi(∞)<br />

satisfies GCC. The other direction follows similarly.<br />

Lemma 1.2: The angles αi = θ(ai) satisfy the GCC.<br />

Proof: It is enough to show that α1 satisfies GCC. Let<br />

β := (α1 + αn − π)/2 and β ′ := (α1 + α2)/2. Then the<br />

receptive field of α1 can be written (modulo π) as F(α1) =<br />

[β, β ′ ]. Therefore, we have to show that α1 is the median<br />

of py|F(α1), which means � α1<br />

β py(ϕ)dϕ = � β ′<br />

α1 py(ϕ)dϕ.<br />

Us<strong>in</strong>g � equation (1), the left hand side can be expanded as<br />

α1<br />

β py(ϕ)dϕ<br />

�<br />

−1 = 2| detB| K ′ dxps(B−1x), where K :=<br />

θ−1 [β, α1] denotes the cone of open<strong>in</strong>g angle α1 − β start<strong>in</strong>g<br />

from angle β, and K ′ = K × Rn−2 � . This implies<br />

α1<br />

β py(ϕ)dϕ = 2 �<br />

B−1 (K ′ ) dsps(s). Now note that the trans-<br />

formed extended cone B −1 (K ′ ) is a cone end<strong>in</strong>g at the s1-axis<br />

of open<strong>in</strong>g angle π/4 times Rn−2 � , because A is l<strong>in</strong>ear. Hence<br />

α1<br />

β py(ϕ)dϕ = 2 � ∞<br />

0 ds1<br />

� s1<br />

0 ds2<br />

�<br />

Rn−2 ds3 . . .dsnps(s) =<br />

� β ′<br />

α1 py(ϕ)dϕ, where we have used the same calculation for<br />

[α1, β ′ ] as for [β, α1] at the last step.


242 Chapter 17. IEEE SPL 13(2):96-99, 2006<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 3<br />

4000<br />

3000<br />

2000<br />

1000<br />

2° 74° 104° 135°<br />

0<br />

0 20 40 60 80 100 120 140 160 180<br />

Fig. 2. Approximated probability density function of a mixture of four<br />

speech signals us<strong>in</strong>g the mix<strong>in</strong>g angles αi = 2 ◦ ,74 ◦ ,104 ◦ ,135 ◦ . Plotted<br />

is the approximated density us<strong>in</strong>g a histogram with 180 b<strong>in</strong>s; the thick l<strong>in</strong>e<br />

<strong>in</strong>dicates the density after smooth<strong>in</strong>g with a 5 degree radius polynomial kernel.<br />

In the proof we show that the median-condition is equivariant<br />

mean<strong>in</strong>g that it does no depend on A. Hence any algorithm<br />

based on such a condition is equivariant, as confirmed by<br />

figure 1. Set ξ(ω) := (cosω, s<strong>in</strong> ω) ⊤ ; then θ(ξ(ω)) = ω.<br />

Comb<strong>in</strong><strong>in</strong>g both lemmata, we have therefore shown:<br />

Theorem 1.3: The set Φ of fixed po<strong>in</strong>ts of geometric<br />

matrix-recovery conta<strong>in</strong>s an element (ˆω1, . . . , ˆωn) such that<br />

the matrix (ξ(ˆω1)...ξ(ˆωn)) solves the overcomplete BMMR.<br />

The stable fixed po<strong>in</strong>ts <strong>in</strong> the above set Φ can be found by<br />

the geometric matrix-recovery algorithm. Furthermore, experiments<br />

confirm that <strong>in</strong> the special case of unimodal, symmetric<br />

and non-Gaussian signals, the set Φ consists of only two<br />

elements: a stable and an unstable fixed po<strong>in</strong>t, where the stable<br />

fixed po<strong>in</strong>t will be found by the algorithm. Then, depend<strong>in</strong>g<br />

on the kurtosis of the sources, either the stable (supergaussian<br />

case) or the <strong>in</strong>stable (subgaussian case) fixed po<strong>in</strong>t represents<br />

the image of the unit vectors. We have partially shown this <strong>in</strong><br />

the complete case, see [7], theorem 5:<br />

Theorem 1.4: If n = 2 and ps1 = ps2, then Φ conta<strong>in</strong>s only<br />

two elements given that py|[0, π) has exactly 4 local extrema.<br />

More elaborate studies of Φ are necessary to show full<br />

convergence, however the mathematics can be expected to be<br />

difficult. This can already be seen from the complicated and<br />

<strong>in</strong> higher dimensions yet unknown proofs of convergence of<br />

the related self-organiz<strong>in</strong>g-map algorithm [9].<br />

B. Turn<strong>in</strong>g the onl<strong>in</strong>e algorithm <strong>in</strong>to a batch algorithm<br />

The above theory can be used to derive a batch-type<br />

learn<strong>in</strong>g algorithm, by test<strong>in</strong>g all different receptive<br />

fields for the overcomplete GCC after histogrambased<br />

estimation of y. For simplicity let us assume<br />

that the cumulative distribution Py of y is <strong>in</strong>vertible.<br />

For ϕ = (ϕ1, . . . , ϕn−1), def<strong>in</strong>e a function µ(ϕ) :=<br />

((γ1(ϕ) + γ2(ϕ))/2 − ϕ1, . . . , (γn(ϕ) + γ1(ϕ))/2 − ϕn),<br />

where γi(ϕ) := P −1<br />

y ((Py(ϕi) + Py(ϕi+1)/2) is the<br />

median of y|[ϕi, ϕi+1] <strong>in</strong> [ϕi, ϕi+1] for i ∈ [1 : n] and<br />

ϕn := (ϕn−1 + ϕ1 + π)/2, ϕn+1 := ϕ1.<br />

Lemma 1.5: If µ(ϕ) = 0 then the γi(ϕ) satisfy the GCC.<br />

Proof: By def<strong>in</strong>ition, the receptive field of γi(ϕ) is<br />

given by [(γi−1(ϕ) + γi(ϕ))/2, (γi(ϕ) + γi+1(ϕ))/2]. S<strong>in</strong>ce<br />

µ(ϕ) = 0 implies γi(ϕ) + γi+1(ϕ))/2 = ϕi, the receptive<br />

field of γi(ϕ) is precisely [ϕi, ϕi+1], and by construction<br />

γi(ϕ) is the median of y restricted to the above <strong>in</strong>terval.<br />

Algorithm (overcomplete FastGeo): F<strong>in</strong>d the zeros of µ(ϕ).<br />

Algorithmically, we may simply estimate y us<strong>in</strong>g a histogram<br />

and search for the zeros exhaustively or by discrete<br />

gradient descent. Note that for m = 2 this is precisely the<br />

complete FastGeo algorithm [7]. Aga<strong>in</strong> µ always has at least<br />

two zeros represent<strong>in</strong>g the stable and the unstable fixed po<strong>in</strong>t<br />

of the neural algorithm, so for supergaussian sources we<br />

extract the stable fixed po<strong>in</strong>t by maximiz<strong>in</strong>g �n i=1 py(γi(ϕi)).<br />

Similar to the complete case, histogram-based density approximation<br />

results <strong>in</strong> a quite ‘ragged’ distribution. Hence,<br />

zeros of µ are split up <strong>in</strong>to multiple close-by zeros. This<br />

can be improved by smooth<strong>in</strong>g the distribution us<strong>in</strong>g a kernel<br />

with sufficiently small radius. Too large kernel radii result <strong>in</strong><br />

lower accuracy because the calculation of the median is only<br />

roughly <strong>in</strong>dependent of the kernel radius. In figure 2 we use a<br />

polynomial kernel of radius 5 degree (zero everywhere else);<br />

one can see that <strong>in</strong>deed this smoothes the distribution nicely.<br />

C. Cluster<strong>in</strong>g <strong>in</strong> higher mixture dimensions<br />

We now extend any BSS algorithm work<strong>in</strong>g <strong>in</strong> lower mix<strong>in</strong>g<br />

dimension m ′ < m to dimension m us<strong>in</strong>g the simple idea of<br />

project<strong>in</strong>g the mixtures x onto different subspaces and then<br />

estimat<strong>in</strong>g A from the recovered projected matrices. We elim<strong>in</strong>ate<br />

scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acies by normalization and permutation<br />

<strong>in</strong>determ<strong>in</strong>acies by compar<strong>in</strong>g the source correlation matrices.<br />

1) Equivalence after projections: Let m ′ ∈ N with 1 <<br />

m ′ < m, and let M denote the set of all subsets of [1 : m]<br />

of size m ′ . For an element τ ∈ M let τ = {τ1, . . . , τm ′}<br />

such that 1 ≤ τ1 < . . . < τm ′ ≤ m and let πτ denote the<br />

ordered projection from R m onto those coord<strong>in</strong>ates. Consider<br />

the projected mix<strong>in</strong>g matrix Aτ := πτA. We will study<br />

how scal<strong>in</strong>g-equivalence behaves under projection, where two<br />

matrices A,B are said to be scal<strong>in</strong>g equivalent, A ∼s B if<br />

there exists an <strong>in</strong>vertible diagonal matrix L with A = BL.<br />

Lemma 1.6: Let τ 1 , . . . , τ k ∈ M such that �<br />

i τi = [1 : m]<br />

and j ∈ �<br />

i τi . If A τ i ∼s B τ i for all i and ajl �= 0 for all l,<br />

then A ∼s B.<br />

Proof: Fix a column l ∈ [1 : n], and let a := al, b := bl.<br />

By assumption, for each i ∈ [1 : k] there exist λi �= 0 such that<br />

b τ i = λia τ i. Index j occurs <strong>in</strong> all projections, so bj = λiaj<br />

for all i. Hence all λi =: λ co<strong>in</strong>cide and b = λa.<br />

This lemma gives the general idea how to comb<strong>in</strong>e matrices;<br />

now we will construct specific projections. Assume that the<br />

first row of A does not conta<strong>in</strong> any zeros. This is a very mild<br />

assumption because A was assumed to be full-rank, and the<br />

set of A’s with first row without zeros is dense <strong>in</strong> the set of<br />

full-rank matrices. As usual let ⌈λ⌉ denote the smallest <strong>in</strong>teger<br />

larger or equal to λ ∈ R. Then let k := ⌈(m−1)/(m ′ −1)⌉, and<br />

def<strong>in</strong>e τ i := {1, 2+(m ′ −1)(i−1), . . .,2+(m ′ −1)i−1} for<br />

i < k and τ k := {1, m−m ′ +2, . . ., m}. Then �<br />

i τi = [1 : m]<br />

and 1 ∈ �<br />

i τi . Given (m ′ ×n)-matrices B 1 , . . .,B k := (b k i,j ),<br />

def<strong>in</strong>e A B 1 ,...,B k to be composed of the columns (j ∈ [1 : n])<br />

aj := (1, b 1 2,j /b1 1,j , . . . , b1 m ′ ,j /b1 1,j ,<br />

b 2 2,j /b21,j , . . . , bk−1<br />

m ′ ,j /bk−1 1,j ,<br />

b k 3+k(m ′ −1)−m,j /bk1,j , . . . , bkm ′ ,j /bk1,j )⊤ .


Chapter 17. IEEE SPL 13(2):96-99, 2006 243<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 4<br />

Lemma 1.7: Let B 1 , . . .,B k be (m ′ ×n)-matrices such that<br />

Aτ i ∼s Bi for i ∈ [1 : k]. Then AB1 ,...,Bk ∼s A.<br />

Proof: By assumption there exist λi l<br />

∈ R \ {0} such that<br />

bi j,l = λil (Aτ i)j,l for all i ∈ [1 : k], j ∈ [1 : m ′ ] and l ∈ [1 : n],<br />

hence bi j,l /bi 1,l = (Aτ i)j,l/(Aτ i)1,l. One can check that due to<br />

the choice of the τi ’s we then have (AB1 ,...,Bk)j,l = aj,l/a1,l<br />

for all j ∈ [1 : m] and therefore AB1 ,...,Bk ∼ A.<br />

2) Reduction algorithm: The dimension reduction algorithm<br />

now is very simple. Pick k and τ1 , . . . , τk as <strong>in</strong> the<br />

previous section. Perform overcomplete BMMR with the projected<br />

mixtures πτ i(x) for i ∈ [1 : k] and get estimated mix<strong>in</strong>g<br />

matrices Bi . If this recovery has been carried out without any<br />

error, then every Bi is equivalent to Aτ i. Due to permutations,<br />

they might however not be scal<strong>in</strong>g-equivalent. Therefore do<br />

the follow<strong>in</strong>g iteratively for each i ∈ [1 : k]: Apply the overcomplete<br />

source-recovery, see next section, to πτ i(x) us<strong>in</strong>g<br />

Bi and get recovered sources si . For all j < i, consider the<br />

absolute crosscorrelation matrices (| Cor(si r, sj s)|)r,s. The row<br />

positions of the maxima of this matrix are pairwise different<br />

because the orig<strong>in</strong>al sources were chosen to be <strong>in</strong>dependent.<br />

Thereby we get a permutation matrix Pi <strong>in</strong>dicat<strong>in</strong>g how to<br />

permute Bi , Ci := BiPi , so that the new source correlation<br />

matrices are diagonal. F<strong>in</strong>ally, we have constructed matrices<br />

Ci such that there exists a permutation P <strong>in</strong>dependent of i with<br />

Ci ∼s Aτ iP for all i ∈ [1 : k]. Now we can apply lemma<br />

1.7 and get a matrix AC1 ,...,Ck with AC1 ,...,Ck ∼s AP and<br />

therefore AC1 ,...,Ck ∼ A as desired.<br />

II. BLIND SOURCE RECOVERY<br />

Us<strong>in</strong>g the results from the BMMR step, we can assume<br />

that an estimate of A has been found. In order to solve the<br />

overcomplete BSS problem, we are therefore left with the task<br />

of reconstruct<strong>in</strong>g the sources us<strong>in</strong>g the mixtures x and the<br />

estimated matrix (BSR). S<strong>in</strong>ce A has full rank, the equation<br />

x(t) = As(t) yields the (n − m)-dimensional aff<strong>in</strong>e vector<br />

space A−1 {x(t)} as solution space for s(t). Hence, if n ><br />

m the source-recovery problem is ill-posed without further<br />

assumptions. Us<strong>in</strong>g a maximum likelihood approach [4], [5]<br />

an appropriate assumption can be derived:<br />

Given a prior probability p0 s on the sources, it can be<br />

seen quickly [4], [10] that the most likely source sample<br />

is recovered by s = argmaxx=Asp0 s. Depend<strong>in</strong>g on the<br />

assumptions on the prior p 0 s<br />

of s, we get different optimization<br />

criteria. In the experiments we will assume a simple prior p 0 s ∝<br />

exp(−|s|p) with any p-norm |.|p. Then s = argm<strong>in</strong> x=As |s|p,<br />

which can be solved l<strong>in</strong>early <strong>in</strong> the Gaussian case p = 2 and<br />

by l<strong>in</strong>ear programm<strong>in</strong>g or a shortest-path decomposition <strong>in</strong> the<br />

sparse, Laplacian case p = 1, see [5], [10].<br />

III. EXPERIMENTAL RESULTS<br />

In order to compare the mixture matrix A with the recovered<br />

matrix B from the BMMR step, we calculate the<br />

generalized crosstalk<strong>in</strong>g error E(A,B) of A and B def<strong>in</strong>ed<br />

by E(A,B) := m<strong>in</strong>M∈Π �A −BM�, where the m<strong>in</strong>imum is<br />

taken over the group Π of all <strong>in</strong>vertible matrices hav<strong>in</strong>g only<br />

one non-zero entry per column and �.� denotes some matrix<br />

norm. It vanishes if and only if A and B are equivalent [10].<br />

TABLE I<br />

PERFORMANCE OF BMMR-ALGORITHMS (n = 3, m = 2, 100 RUNS)<br />

algorithm mean E(A, Â) deviation σ<br />

FastGeo (kernel r = 5, approx. 0.1) 0.60 0.60<br />

FastGeo (kernel r = 0, approx. 0.5) 0.40 0.46<br />

FastGeo (kernel r = 5, approx. 0.5) 0.29 0.42<br />

Soft-LOST (p = 0.01) 0.68 0.57<br />

The overcomplete FastGeo algorithm is applied to 4 speech<br />

signals s, mixed by a (2 × 4)-mix<strong>in</strong>g matrix A with coefficients<br />

uniformly drawn from [−1, 1], see figure 2 for their<br />

mixture density. The algorithm estimates the matrix well with<br />

E(A, Â) = 0.68, and BSR by 1-norm m<strong>in</strong>imization yields<br />

recovered sources with a mean SNR of only 2.6dB when<br />

compared with the orig<strong>in</strong>al sources; as noted before [5], [10],<br />

without sparsification for <strong>in</strong>stance by FFT, source-recovery<br />

is difficult. To analyze the overcomplete FastGeo algorithm<br />

more generally, we perform 100 Monte-Carlo runs us<strong>in</strong>g highkurtotic<br />

gamma-distributed three-dimensional sources with<br />

104 samples, mixed by a (2 × 3)-mix<strong>in</strong>g matrix with weights<br />

uniformly chosen from [−1, 1]. In table I, the mean of the<br />

performance <strong>in</strong>dex depend<strong>in</strong>g on various parameters is presented.<br />

Not<strong>in</strong>g that the mean error when us<strong>in</strong>g random (2×3)matrices<br />

with coefficients uniformly taken from [−1, 1] is<br />

E = 1.9 ± 0.73, we observe good performance, especially<br />

for a larger kernel radius and higher approximation parameter<br />

(E = 0.29), also compared with Soft-LOST’s E = 0.68 [6].<br />

As an example <strong>in</strong> higher mixture dimension three speech<br />

signals are mixed by a column-normalized (3 × 3) mix<strong>in</strong>g<br />

matrix A. For n = m = 3, m ′ = 2, the projection<br />

framework simplifies to k = 2 with projections π {1,2} and<br />

π {1,3}. Overcomplete geometric ICA is performed with 5·104 sweeps. The recoveries of the projected matrices π {1,2}A<br />

and π {1,3}A are quite good with E(π {1,2}A,B1 ) = 0.084<br />

and E(π {1,3}A,B2 ) = 0.10. Tak<strong>in</strong>g out the permutations as<br />

described before, we get a recovered mix<strong>in</strong>g matrix AB1 ,B2 with low generalized crosstalk<strong>in</strong>g error of E(A,AB1 ,B2) =<br />

0.15 (compared with a mean random error of E = 3.2 ± 0.7).<br />

REFERENCES<br />

[1] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja, “<strong>Independent</strong> component<br />

analysis,” John Wiley & Sons, 2001.<br />

[2] A. Cichocki and S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g.<br />

John Wiley & Sons, 2002.<br />

[3] J. Eriksson and V. Koivunen, “Identifiability, separability and uniqueness<br />

of l<strong>in</strong>ear ICA models,” IEEE Signal Process<strong>in</strong>g Letters, vol. 11, no. 7,<br />

pp. 601–604, 2004.<br />

[4] T. Lee, M. Lewicki, M. Girolami, and T. Sejnowski, “Bl<strong>in</strong>d source<br />

separation of more sources than mixtures us<strong>in</strong>g overcomplete representations,”<br />

IEEE Signal Process<strong>in</strong>g Letters, vol. 6, no. 4, pp. 87–90, 1999.<br />

[5] P. Bofill and M. Zibulevsky, “Underdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation<br />

us<strong>in</strong>g sparse representations,” Signal Process<strong>in</strong>g, vol. 81, pp. 2353–2362,<br />

2001.<br />

[6] P. O’Grady and B. Pearlmutter, “Soft-LOST: EM on a mixture of<br />

oriented l<strong>in</strong>es,” <strong>in</strong> Proc. ICA 2004, ser. Lecture Notes <strong>in</strong> Computer<br />

Science, vol. 3195, Granada, Spa<strong>in</strong>, 2004, pp. 430–436.<br />

[7] F. Theis, A. Jung, C. Puntonet, and E. Lang, “L<strong>in</strong>ear geometric ICA:<br />

Fundamentals and algorithms,” Neural Computation, vol. 15, pp. 419–<br />

439, 2003.<br />

[8] C. Puntonet and A. Prieto, “An adaptive geometrical procedure for bl<strong>in</strong>d<br />

separation of sources,” Neural Process<strong>in</strong>g Letters, vol. 2, 1995.<br />

[9] M. Benaim, J.-C. Fort, and G. Pagés, “Convergence of the onedimensional<br />

Kohonen algorithm,” Adv. Appl. Prob., vol. 30, pp. 850–<br />

869, 1998.<br />

[10] F. Theis, E. Lang, and C. Puntonet, “A geometric algorithm for overcomplete<br />

l<strong>in</strong>ear ICA,” Neurocomput<strong>in</strong>g, vol. 56, pp. 381–398, 2004.


244 Chapter 17. IEEE SPL 13(2):96-99, 2006


Chapter 18<br />

Proc. EUSIPCO 2006<br />

Paper P. Gruber and F.J. Theis. Grassmann cluster<strong>in</strong>g. In Proc. EUSIPCO 2006,<br />

Florence, Italy, 2006<br />

Reference (Gruber and Theis, 2006)<br />

Summary <strong>in</strong> section 1.5.3<br />

245


246 Chapter 18. Proc. EUSIPCO 2006<br />

ABSTRACT<br />

An important tool <strong>in</strong> high-dimensional, explorative data m<strong>in</strong><strong>in</strong>g<br />

is given by cluster<strong>in</strong>g methods. They aim at identify<strong>in</strong>g<br />

samples or regions of similar characteristics, and often code<br />

them by a s<strong>in</strong>gle codebook vector or centroid. One of the<br />

most commonly used partitional cluster<strong>in</strong>g techniques is the<br />

k-means algorithm, which <strong>in</strong> its batch form partitions the data<br />

set <strong>in</strong>to k disjo<strong>in</strong>t clusters by simply iterat<strong>in</strong>g between cluster<br />

assignments and cluster updates. The latter step implies<br />

calculat<strong>in</strong>g a new centroid with<strong>in</strong> each cluster. We generalize<br />

the concept of k-means by apply<strong>in</strong>g it not to the standard<br />

Euclidean space but to the manifold of subvectorspaces<br />

of a fixed dimension, also known as the Grassmann manifold.<br />

Important examples <strong>in</strong>clude projective space i.e. the<br />

manifold of l<strong>in</strong>es and the space of all hyperplanes. Detect<strong>in</strong>g<br />

clusters <strong>in</strong> multiple samples drawn from a Grassmannian<br />

is a problem aris<strong>in</strong>g <strong>in</strong> various applications. In this manuscript,<br />

we provide correspond<strong>in</strong>g metrics for a Grassmann<br />

k-means algorithm, and solve the centroid calculation problem<br />

explicitly <strong>in</strong> closed form. An application to nonnegative<br />

matrix factorization illustrates the feasibility of the proposed<br />

algorithm.<br />

1. PARTITIONAL CLUSTERING<br />

Many algorithms for cluster<strong>in</strong>g i.e. the detection of common<br />

features with<strong>in</strong> a data set are discussed <strong>in</strong> the literature. In<br />

the follow<strong>in</strong>g, we will study cluster<strong>in</strong>g with<strong>in</strong> the framework<br />

of k-means [2].<br />

In general, its goal can be described as follows: Given a<br />

set A of po<strong>in</strong>ts <strong>in</strong> some metric space (M,d), f<strong>in</strong>d a partition of<br />

A <strong>in</strong>to disjo<strong>in</strong>t non-empty subsets Bi, �<br />

i Bi = A, together with<br />

centroids ci ∈ M so as to m<strong>in</strong>imize the sum of the squares<br />

of the distances of each po<strong>in</strong>t of A to the centroid ci of the<br />

cluster Bi conta<strong>in</strong><strong>in</strong>g it. In other words, m<strong>in</strong>imize<br />

E(B1,c1,...,Bk,ck) :=<br />

GRASSMANN CLUSTERING<br />

Peter Gruber and Fabian J. Theis<br />

Institute of Biophysics, University of Regensburg<br />

93040 Regensburg, Germany<br />

phone: +49 941 943 2924, fax: +49 941 943 2479<br />

email: fabian@theis.name, web: http://fabian.theis.name<br />

k<br />

∑ ∑<br />

i=1 a∈Bi<br />

d(a,ci) 2 . (1)<br />

If the set A conta<strong>in</strong>s only f<strong>in</strong>itely many elements<br />

a1,...,aN, then this can be easily re-formulated as constra<strong>in</strong>ed<br />

non-l<strong>in</strong>ear optimization problem: m<strong>in</strong>imize<br />

subject to<br />

wit ∈ {0,1},<br />

E(W,C) :=<br />

k<br />

∑<br />

i=1<br />

k<br />

T<br />

∑ ∑<br />

i=1 t=1<br />

witd(ai,ci) 2 . (2)<br />

wit = 1 for 1 ≤ i ≤ k,1 ≤ t ≤ T. (3)<br />

Here C := {c1,...,ck} are the centroid locations, and W :=<br />

(wit) is the partition matrix correspond<strong>in</strong>g to the partition Bi<br />

of A.<br />

A common approach to m<strong>in</strong>imiz<strong>in</strong>g (2) subject to (3) is<br />

partial optimization for W and C, i.e. alternat<strong>in</strong>g m<strong>in</strong>imization<br />

of either W and C while keep<strong>in</strong>g the other one fixed.<br />

The batch k-means algorithm employs precisely this strategy:<br />

After an <strong>in</strong>itial, random choice of centroids c1,...,ck,<br />

it iterates between the follow<strong>in</strong>g two steps until convergence<br />

measured by a suitable stopp<strong>in</strong>g criterion:<br />

• cluster assignment: at determ<strong>in</strong>e an <strong>in</strong>dex i(t) such that<br />

i(t) = argm<strong>in</strong> i d(at,ci) (4)<br />

• cluster update: with<strong>in</strong> each cluster Bi := {at|i(t) = i}<br />

determ<strong>in</strong>e the centroid ci by m<strong>in</strong>imiz<strong>in</strong>g<br />

ci := argm<strong>in</strong>c ∑ d(a,c)<br />

a∈Bi<br />

2<br />

The cluster assignment step corresponds to m<strong>in</strong>imiz<strong>in</strong>g<br />

(2) for fixed C, which means choos<strong>in</strong>g the partition W such<br />

that each element of A is assigned to the i-th cluster if ci is<br />

the closest centroid. In the cluster update step, (2) is m<strong>in</strong>imized<br />

for fixed partition W, imply<strong>in</strong>g that ci is constructed<br />

as centroid with<strong>in</strong> the i-th cluster; this <strong>in</strong>deed corresponds to<br />

m<strong>in</strong>imiz<strong>in</strong>g E(W,C) for fixed W because <strong>in</strong> this case the<br />

cost function is a sum of functions depend<strong>in</strong>g different parameters,<br />

so we can m<strong>in</strong>imize them separately lead<strong>in</strong>g to the<br />

centroid equation (5). This general update rule converges to<br />

a local m<strong>in</strong>imum under rather weak conditions [3, 7].<br />

An important special case is given by M := Rn and<br />

the Euclidean distance d(x,y) := �x − y�. The centroids<br />

from equation (5) can then be calculated <strong>in</strong> closed form,<br />

and each centroid is simply given by the cluster mean ci :=<br />

(1/|Bi|)∑a∈Bi a; this follows directly from<br />

∑ �a − ci�<br />

a∈Bi<br />

2 = ∑<br />

a∈Bi<br />

n<br />

∑<br />

j=1<br />

(a j −ci j) 2 =<br />

n<br />

∑ ∑<br />

j=1 a∈Bi<br />

(5)<br />

(a 2 j −2a jci j +c 2 i j),<br />

which can be m<strong>in</strong>imized separately for each coord<strong>in</strong>ate j and<br />

is m<strong>in</strong>imal with respect to ci j if the derivative of the quadratic<br />

function is zero, so if |Bi|ci j = ∑a∈Bi a j.<br />

In the follow<strong>in</strong>g, we are <strong>in</strong>terested <strong>in</strong> more complex metric<br />

spaces. Typically, k-means can be implemented efficiently,<br />

if the cluster centroids can be calculated quickly. In<br />

the example of R n , we saw that it was crucial to use m<strong>in</strong>imize<br />

the square distances and to use the Euclidean distance.<br />

Hence we will study metrics which also allow a closed-form<br />

centroid solution.


Chapter 18. Proc. EUSIPCO 2006 247<br />

The data space of <strong>in</strong>terest will consist of subspaces of<br />

R n , and the goal is to f<strong>in</strong>d subspace clusters. We will only be<br />

deal<strong>in</strong>g with sub-vector-spaces; extensions to the aff<strong>in</strong>e case<br />

are discussed <strong>in</strong> section 3.3.<br />

A somewhat related method is the so-called k-plane cluster<strong>in</strong>g<br />

algorithm [4], which does not cluster subspaces but<br />

solves the problem of fitt<strong>in</strong>g hyperplanes <strong>in</strong> R n to a given<br />

po<strong>in</strong>t set A ⊂ R n . A hyperplane H ⊂ R n can be described by<br />

H = {x|c ⊤ x = 0} = c ⊥ for some normal vector c, typically<br />

chosen such that �c� = 1. Bradley and Mangasarian [4] essentially<br />

choose the pseudo-metric d(a,b) := |a ⊤ b| on the<br />

sphere S n−1 := {x ∈ R|�x� = 1} — the data can be assumed<br />

to lie on the sphere after normalization, which does<br />

not change cluster conta<strong>in</strong>ment. They show that the centroid<br />

equation (5) is solved by any eigenvector of the cluster correlation<br />

BiBi ⊤ correspond<strong>in</strong>g to the m<strong>in</strong>imal eigenvalue, if by<br />

abuse of notation Bi is to <strong>in</strong>dicate the (n × |Bi|)-matrix conta<strong>in</strong><strong>in</strong>g<br />

the elements of the set Bi <strong>in</strong> its columns. Alternative<br />

approaches to this subspace cluster<strong>in</strong>g problem are reviewed<br />

<strong>in</strong> [6].<br />

2. PROJECTIVE CLUSTERING<br />

A first step towards general subspace cluster<strong>in</strong>g is to consider<br />

one-dimensional subspace i.e. l<strong>in</strong>es. Let RP n denote<br />

the space of one-dimensional real vector subspaces of Rn+1 .<br />

It is equivalent to Sn after identify<strong>in</strong>g antipodal po<strong>in</strong>ts, so it<br />

has the quotient representation RP n = Sn /{−1,1}. We will<br />

represent l<strong>in</strong>es by their equivalence class [x] := {λx|λ ∈ Rn }<br />

for x �= 0. A metric can be def<strong>in</strong>ed by<br />

d0([x],[y]) :=<br />

�<br />

1 −<br />

�<br />

x⊤y �x��y�<br />

�2 Clearly d is symmetric, and positive def<strong>in</strong>ite accord<strong>in</strong>g to the<br />

Cauchy-Schwartz’s <strong>in</strong>equality.<br />

Conveniently, the cluster centroid of cluster Bi is given by<br />

any eigenvector of the cluster correlation BiBi ⊤ correspond<strong>in</strong>g<br />

to the largest eigenvalue. In section 3.1, we will show that<br />

projective cluster<strong>in</strong>g is a special case of a more general cluster<strong>in</strong>g<br />

and hence the derivation of the correspond<strong>in</strong>g centroid<br />

cluster<strong>in</strong>g algorithm will be postponed until later.<br />

Figure 1 shows an example application of the projective<br />

k-means algorithm. Note that the projective k-means can be<br />

directly applied to the dual problem of cluster<strong>in</strong>g hyperplanes<br />

by us<strong>in</strong>g the description via their normal ‘l<strong>in</strong>es’.<br />

3. GRASSMANN CLUSTERING<br />

More <strong>in</strong>terest<strong>in</strong>gly, we would like to perform cluster<strong>in</strong>g <strong>in</strong><br />

the Grassmann manifold Gn,p of p-dimensional vector subspaces<br />

of R n for 0 ≤ p ≤ n. If Vn,p denotes the Stiefel<br />

manifold consist<strong>in</strong>g of orthonormal matrices for n ≥ p, then<br />

Gn,p has the natural quotient representation Gn,p = Vn,p/Op,<br />

where Op := Vp,p denotes the orthogonal group. This representation<br />

simply means that any p-dimensional subspace<br />

of R n is given by p orthonormal vectors, i.e. by a basis<br />

V ∈ Vn,p, which is unique except for right multiplication by<br />

an orthogonal matrix. We will also write [V] for the subspace.<br />

The geometric properties of optimization algorithms on<br />

Gn,p are nicely discussed by Edelman et al. [5]. They<br />

also summarize various metrics on the Grassmann manifold,<br />

(6)<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1<br />

Figure 1: Illustration of projective k-means cluster<strong>in</strong>g <strong>in</strong><br />

three dimensions. 10 5 samples from a 4-dimensional<br />

strongly supergaussian distribution are projected onto three<br />

dimensions and serve as the generators of the l<strong>in</strong>es. These<br />

were nicely clustered <strong>in</strong>to k = 4 centroids, located at the density<br />

axes.<br />

which can all be naturally derived from the geodesic metric<br />

(arc length) <strong>in</strong>duced by the natural Riemannian structure of<br />

Gn,p. Some equivalence relations between the metrics are<br />

known, but for computational purposes, we choose the very<br />

easy to calculate so-called projection F-norm given by<br />

−0.5<br />

d([V],[W]) := 2 −1/2 �VV ⊤ − WW ⊤ �F<br />

where �V�F := � tr(VV ⊤ ) denotes the Frobenius-norm of<br />

a matrix. Note that the projection F-norm is <strong>in</strong>deed welldef<strong>in</strong>ed,<br />

as (7) does not depend on the choice of class representatives.<br />

In order to perform k-means cluster<strong>in</strong>g on (Gn,p,d), we<br />

have to solve the centroid problem (5). One of our ma<strong>in</strong><br />

results is that the centroid [Ci] of subspaces of some cluster<br />

Bi is spanned by p eigenvectors correspond<strong>in</strong>g to the<br />

smallest eigenvalues of the generalized cluster covariance<br />

(1/|Bi|)∑[V]∈Bi VV⊤ . This generalizes the projective and<br />

the hyperplane k-means algorithm from above.<br />

3.1 Calculat<strong>in</strong>g the optimal centroids<br />

For the cluster update step of the batch k-means algorithm<br />

we need to f<strong>in</strong>d [C] such that<br />

f (C) :=<br />

l<br />

∑<br />

i=1<br />

d([Vi],[C]) 2<br />

for l subspaces [Vi] represented by Vi ∈ V(n, p) is m<strong>in</strong>imal,<br />

subject to g(C) := C ⊤ C = Ip (pseudo orthogonality).<br />

We may also assume that the Vi are pseudo-orthonormal<br />

Vi ⊤ Vi = Ip.<br />

It is easy to see that:<br />

f (C) = 2 −1/2 tr(∑ViVi i<br />

⊤ ) + tr(lCC ⊤ − 2CC ⊤ ∑ViVi i<br />

⊤ )<br />

where<br />

= 2 −1/2 trD + tr((lIn − 2V)CC ⊤ )<br />

V := ∑ViVi i<br />

⊤<br />

and EDE ⊤ = V<br />

0<br />

0.5<br />

1<br />

(7)


248 Chapter 18. Proc. EUSIPCO 2006<br />

(0, .5, .5, 0)<br />

(0, 1, 0, 0)<br />

(0, 0, 1, 0) (.5, 0, .5, 0)<br />

(.3, .3, 0, .3)<br />

(1, 0, 0, 0)<br />

Figure 2: Illustration of the convex subsets on which the<br />

equation ∑ n i=1 diixi for given D is optimized. Here n = 4 and<br />

the surfaces for p = 1,...,3 are depicted (normalized onto<br />

the standard simplex).<br />

denote the eigenvalue decomposition of V with E orthonormal<br />

and D diagonal. This means that<br />

n<br />

−1/2<br />

f (C) = 2 ∑ dii + l p − 2tr(DE<br />

i=1<br />

⊤ CC ⊤ E)<br />

n<br />

n<br />

−1/2<br />

= 2 ∑ dii + l p − 2∑<br />

diixii<br />

i=1<br />

i=1<br />

where di j are the matrix elements of D, and xi j of X =<br />

E ⊤ CC ⊤ E.<br />

Here trX = tr(CC ⊤ ) = p for pseudo orthogonal C (p<br />

eigenvectors C with eigenvalue 1) and all 0 ≤ xii ≤ 1 (aga<strong>in</strong><br />

pseudo orthogonality). Hence this is a l<strong>in</strong>ear optimization<br />

problem on a convex set (see also figure 2) and therefore any<br />

optimum is located at the corners of the convex set, which <strong>in</strong><br />

our case are {x ∈ {0,1} n | ∑ n i=1 xi = p}. If we assume that<br />

the dii are ordered <strong>in</strong> descend<strong>in</strong>g order, then a m<strong>in</strong>imum of f<br />

is given by<br />

which corresponds to<br />

CC ⊤ = EXE ⊤ = E<br />

C =<br />

� �<br />

Ip<br />

.<br />

0<br />

� �<br />

Ip 0<br />

E<br />

0 0<br />

⊤ ,<br />

In this calculation we can also see the <strong>in</strong>determ<strong>in</strong>acies of<br />

the optimization:<br />

1. If two or more eigenvalues of V are equal, any po<strong>in</strong>t on<br />

the correspond<strong>in</strong>g edge of the convex set is optimal and<br />

hence the centroid can vary along the subspace generated<br />

by the correspond<strong>in</strong>g eigenvectors E<br />

2. If some eigenvalues of V are zero, a similar <strong>in</strong>determ<strong>in</strong>acy<br />

occurs.<br />

An example <strong>in</strong> RP 2 is demonstrated <strong>in</strong> figure 3.<br />

e1<br />

}<br />

d0(e1, x) = cos(x1) 2<br />

}<br />

x = (x1, x2)<br />

d0(e2, x) = cos(x2) 2<br />

Figure 3: Let Vi be two samples which are orthogonal<br />

(w.l.o.g. we can assume Vi = ei represented by the unit vectors).<br />

Hence V = ∑ViVi ⊤ has degenerate eigenstructure.<br />

Then the quantisation error is given by d(e1,x) 2 + d(e2,x) 2<br />

which is here 1 2 (2 + 2 − 2trIXX⊤ ) = 2 − x2 1 − x2 2 = 1 for X<br />

represented by x = (x1,x2). Hence any X is a centroid <strong>in</strong> the<br />

sense of the batch k-means algorithm.<br />

3.2 Relationship to projective cluster<strong>in</strong>g<br />

The distance d0 on RP n from above (equation (6)) was def<strong>in</strong>ed<br />

as<br />

�<br />

� �<br />

V ⊤ 2<br />

W<br />

d0(V,W) = 1 −<br />

,<br />

�V ��W�<br />

if accord<strong>in</strong>g to our previous notation [V],[W] ∈ Gn,1 = RP n .<br />

Note that if the two vectors represent time series, then this is<br />

the same as the correlation between the two.<br />

It is now easy to see that this distance co<strong>in</strong>cides with the<br />

def<strong>in</strong>ition of d on the general Grassmanian from above. Let<br />

V,W ∈ V(n,1) be two vectors. We may assume that V ⊤ V =<br />

W ⊤ W = 1. Then<br />

2d(V,W) 2 = tr(VV ⊤ + WW ⊤ − VW ⊤ − WV ⊤ )<br />

= tr(VV ⊤ ) + tr(WW ⊤ ) − 2tr(V(V ⊤ W)W ⊤ )<br />

All matrices have rank 1 and hence the trace is the sole<br />

nonzero eigenvalue. S<strong>in</strong>ce VV ⊤ V = V it is 1 for the first<br />

matrix, similar for the second and W ⊤ V for the third, because<br />

VW ⊤ V = (W ⊤ V)V. Hence<br />

2d(V,W) 2 = 2 − 2(W ⊤ V) 2<br />

= 2d0(V,W) 2 .<br />

3.3 Deal<strong>in</strong>g with aff<strong>in</strong>e spaces<br />

So far we only have dealt with the special case of cluster<strong>in</strong>g<br />

subspaces, i.e. l<strong>in</strong>ear subsets which conta<strong>in</strong> the orig<strong>in</strong>. But<br />

<strong>in</strong> practice the problem of cluster<strong>in</strong>g aff<strong>in</strong>e subspaces arises,<br />

see for example 5. This can be dealt with quite easily.<br />

Let F be a p dimensional aff<strong>in</strong>e l<strong>in</strong>ear subset of R n . Then<br />

F can be characterized by p + 1 po<strong>in</strong>ts v0,...,vp such that<br />

e2


Chapter 18. Proc. EUSIPCO 2006 249<br />

v1 − v0,...,vn − v0 are l<strong>in</strong>early <strong>in</strong>dependent. Consider the<br />

follow<strong>in</strong>g embedd<strong>in</strong>g<br />

R n → R n+1 : (x1,...,xn) ↦→ (x1,...,xn,1).<br />

We may therefore identify the p dimensional aff<strong>in</strong>e subspaces<br />

with the p + 1 l<strong>in</strong>ear subspaces <strong>in</strong> R n+1 by embedd<strong>in</strong>g<br />

the generators and tak<strong>in</strong>g the l<strong>in</strong>ear closure. In fact it<br />

is easy to see that we obta<strong>in</strong> a 1-to-1 mapp<strong>in</strong>g between the<br />

p dimensional aff<strong>in</strong>e subspaces of R n and the p + 1 dimensional<br />

l<strong>in</strong>ear subspaces <strong>in</strong> R n−1 , which <strong>in</strong>tersect the orthogonal<br />

complement of (0,...,0,1) only at the orig<strong>in</strong>.<br />

Hence we can reduce the aff<strong>in</strong>e case to calculations for<br />

l<strong>in</strong>ear subsets only. Note that s<strong>in</strong>ce only eigenvectors of sums<br />

of projections onto the subsets Vi can become centroids <strong>in</strong><br />

the batch version of the k-means algorithm, any centroid is<br />

also <strong>in</strong> the image of the above embedd<strong>in</strong>g and can be identified<br />

uniquely with a aff<strong>in</strong>e subspace of the orig<strong>in</strong>al problem.<br />

4. EXPERIMENTAL RESULTS<br />

We f<strong>in</strong>ish by illustrat<strong>in</strong>g the algorithm <strong>in</strong> a few examples.<br />

4.1 Toy example<br />

As a toy example, let us first consider 10 4 samples of<br />

G4,2, namely uniformly randomly chosen from the 6 possible<br />

2-dimensional coord<strong>in</strong>ate planes. In order to avoid<br />

any bias with<strong>in</strong> the algorithm, the non-zero coefficients from<br />

the plane-represent<strong>in</strong>g matrices have been chosen uniformly<br />

from O2. The samples have been deteriorated by Gaussian<br />

noise with a signal-to-noise ratio of 10dB. Application of<br />

the Grassmann k-means algorithm with k = 6 yields convergence<br />

after only 6 epochs with the result<strong>in</strong>g 6 clusters<br />

with centroids [V i ]. The distance measure µ(V) := (|vi1 +<br />

vi2| + |vi1 − vi2|)i should be large only <strong>in</strong> two coord<strong>in</strong>ates if<br />

[V] is close to the correspond<strong>in</strong>g 2-dimensional coord<strong>in</strong>ate<br />

plane. And <strong>in</strong>deed, the found centroids have distance measures<br />

µ(V i ) =<br />

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞<br />

0.02 1.7 1.7 0.01 2.0 0.01<br />

⎜ 0 ⎟ ⎜0.01⎟<br />

⎜0.01⎟<br />

⎜ 1.5 ⎟ ⎜ 2.0 ⎟ ⎜ 2.0 ⎟<br />

⎝<br />

1.9<br />

⎠, ⎝<br />

0.01<br />

⎠, ⎝<br />

1.7<br />

⎠, ⎝<br />

1.5<br />

⎠, ⎝<br />

0<br />

⎠, ⎝<br />

0.01<br />

⎠.<br />

1.9 1.7 0.02 0 0.01 2.0<br />

Hence, the algorithm correctly chose all 6 coord<strong>in</strong>ate planes<br />

as cluster centroids.<br />

4.2 Polytope identification<br />

As an example application of the Grassmann cluster<strong>in</strong>g algorithm,<br />

we want to solve the follow<strong>in</strong>g approximation problem<br />

from computational geometry: given a set of po<strong>in</strong>ts, identify<br />

the smallest convex polytope with a fixed number of faces<br />

k, conta<strong>in</strong><strong>in</strong>g the po<strong>in</strong>ts. In two dimensions, this implies the<br />

task of f<strong>in</strong>d<strong>in</strong>g the k edges of a polytope where only samples<br />

<strong>in</strong> the <strong>in</strong>side are known. We use QHull algorithm [1] to<br />

construct the convex hull thus identify<strong>in</strong>g the possible edges<br />

of the polytope. Then, we apply aff<strong>in</strong>e Grassmann k-means<br />

cluster<strong>in</strong>g to these edges <strong>in</strong> order to identify the k bound<strong>in</strong>g<br />

edges. Figure 4 shows an example. Generalization to arbitrary<br />

dimensions are straight-forward.<br />

(a) Samples (b) QHull<br />

(c) Grassmann cluster<strong>in</strong>g<br />

(d) Result<br />

Samples<br />

QHull contour<br />

Grasmann cluster<strong>in</strong>g<br />

Mix<strong>in</strong>g matrix<br />

Figure 4: An example of us<strong>in</strong>g hyperplane cluster<strong>in</strong>g (p =<br />

n − 1) to identify the contour of a samples figure. QHull was<br />

used to f<strong>in</strong>d the outer edges then those are clustered <strong>in</strong>to 4<br />

clusters. The broken l<strong>in</strong>es show the boundaries use to generate<br />

the 300 samples.


250 Chapter 18. Proc. EUSIPCO 2006<br />

4.3 Nonnegative Matrix Factorization<br />

(Overcomplete) Nonnegative Matrix Factorization (NMF)<br />

deals with the problem of f<strong>in</strong>d<strong>in</strong>g a nonnegative decomposition<br />

X = AS + N of a nonnegative matrix X, where N<br />

denotes unknown Gaussian noise. S is often pictured as a<br />

source data set conta<strong>in</strong><strong>in</strong>g samples along its columns. If we<br />

assume that S spans the whole first quadrant, then X is a<br />

conic hull with cone l<strong>in</strong>es given by the columns of A. After<br />

projection to the standard simplex, the conic hull reduces<br />

to the convex hull, and the projected, known mixture data<br />

set X lies with<strong>in</strong> a convex polytope of the order given by the<br />

number of rows of S. Hence we face the problem of identify<strong>in</strong>g<br />

edges of a sampled polytope, and, even <strong>in</strong> the overcomplete<br />

case, we may tackle this problem by the Grassmann<br />

cluster<strong>in</strong>g-based identification algorithm from the previous<br />

section.<br />

As an example, see figure 5, we choose a random mix<strong>in</strong>g<br />

matrix<br />

A =<br />

�<br />

0.76 0.39<br />

�<br />

0.14<br />

0.033 0.06 0.43<br />

0.20 0.56 0.43<br />

and sources S given by i.i.d. samples from a squared<br />

gaussian. 10 5 samples were drawn, and 10 to 10 5 sample<br />

subsets where used for the comparison. We refer to the figure<br />

caption for further details.<br />

5. CONCLUSION<br />

We have studied k-means-style cluster<strong>in</strong>g problems on the<br />

non-Euclidean Grassmann manifold. In an adequate metric,<br />

we were able to reduce the aris<strong>in</strong>g centroid calculation<br />

problem to the calculation of eigenvectors of the cluster covariance,<br />

for which we gave a proof based on convex optimization.<br />

The algorithm was illustrated by applications to<br />

polytope fitt<strong>in</strong>g and to perform<strong>in</strong>g overcomplete nonnegative<br />

factorizations similar to NMF. In future work, besides<br />

extend<strong>in</strong>g the framework to other cluster<strong>in</strong>g algorithms and<br />

matrix manifolds together with prov<strong>in</strong>g convergence of the<br />

result<strong>in</strong>g algorithms, we plan on apply<strong>in</strong>g the algorithm for<br />

the stability analysis of multidimensional <strong>in</strong>dependent component<br />

analysis.<br />

Acknowledgements<br />

Partial f<strong>in</strong>ancial support by the DFG (GRK 638) and the<br />

BMBF (project ‘ModKog’) is gratefully acknowledged.<br />

REFERENCES<br />

[1] C.B. Barber, D.P. Dobk<strong>in</strong>, and H. Huhdanpaa. The<br />

quickhull algorithm for convex hull. Technical Report<br />

GCG53, The Geometry Center, University of M<strong>in</strong>nesota,<br />

M<strong>in</strong>neapolis, 1993.<br />

[2] C.M. Bishop. Neural Networks for Pattern Recognition.<br />

Oxford University Press, 1995.<br />

[3] L. Bottou and Y. Bengio. Convergence properties of the<br />

k-means algorithms. In Proc. NIPS 1994, pages 585–<br />

592. MIT Press, 1995.<br />

[4] P.S. Bradley and O.L. Mangasarian. k-plane cluster<strong>in</strong>g.<br />

Journal of Global Optimization, 16(1):23–32, 2000.<br />

Samples<br />

QHull contour<br />

Grasmann cluster<strong>in</strong>g (100 Samples)<br />

NMF Mix<strong>in</strong>g Matrix (100000 Samples)<br />

Mix<strong>in</strong>g matrix<br />

(a) Comparison of NMF (Mean square error) and Grassmann cluster<strong>in</strong>g<br />

for NMF (averaged over 4 tries)<br />

Crosserror<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

10 1<br />

10 1<br />

10 2<br />

10 2<br />

10 3<br />

NMF (Mean Square Error)<br />

Grassmann cluster<strong>in</strong>g<br />

10 3<br />

Number of samples<br />

10 4<br />

10 4<br />

10 5<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

10 5<br />

0<br />

(b) Illustration of the NMF algorithm: projection onto<br />

the standard simplex<br />

Figure 5: Grassmann cluster<strong>in</strong>g can be used to solve the<br />

NMF problem. The mixed signal to be analyzed is a 3dimensional<br />

toy signal with a positive 3 × 3 matrix. The<br />

result<strong>in</strong>g mixture was analyzed with a mean square error implementation<br />

of NMF and compared to Grassmann cluster<strong>in</strong>g.<br />

In the cluster<strong>in</strong>g algorithm the data is first projected onto<br />

the standard simplex. This translates the task to the polytope<br />

identification discussed <strong>in</strong> section 4.2.<br />

[5] A. Edelman, T.A. Arias, and S.T. Smith. The geometry<br />

of algorithms with orthogonality constra<strong>in</strong>ts. SIAM Journal<br />

on Matrix <strong>Analysis</strong> and Applications, 20(2):303–<br />

353, 1999.<br />

[6] L. Parsons, E. Haque, and H. Liu. Subspace cluster<strong>in</strong>g<br />

for high dimensional data: a review. SIGKDD Explor.<br />

Newsl., 6(1):90–105, 2004.<br />

[7] S.Z. Selim and M.A. Ismail. K-means-type algorithms: a<br />

generalized convergence theorem and characterization of<br />

local optimality. IEEE Transactions on Pattern <strong>Analysis</strong><br />

and Mach<strong>in</strong>e Intelligence, 6:81–87, 1984.


Chapter 19<br />

LNCS 3195:977-984, 2004<br />

Paper I.R. Keck, F.J. Theis, P. Gruber, E.W. Lang, K. Specht, and C.G. Puntonet.<br />

3D spatial analysis of fMRI data on a word perception task. In Proc. ICA<br />

2004, volume 3195 of LNCS, pages 977-984, Granada, Spa<strong>in</strong>, 2004. Spr<strong>in</strong>ger<br />

Reference (Keck et al., 2004)<br />

Summary <strong>in</strong> section 1.6.1<br />

251


252 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data<br />

on a Word Perception Task<br />

Ingo R. Keck 1 , Fabian J. Theis 1 , Peter Gruber 1 ,ElmarW.Lang 1 ,<br />

Karsten Specht 2 , and Carlos G. Puntonet 3<br />

1 Institute of Biophysics, Neuro- and Bio<strong>in</strong>formatics Group<br />

University of Regensburg, D-93040 Regensburg, Germany<br />

{Ingo.Keck,elmar.lang}@biologie.uni-regensburg.de<br />

2 Institute of Medic<strong>in</strong>e, Research Center Jülich, D-52425 Jülich, Germany<br />

k.specht@fz-juelich.de<br />

3 Departamento de Arquitectura y Tecnologia de Computadores<br />

Universidad de Granada/ESII, E-1807 Granada, Spa<strong>in</strong><br />

carlos@atc.ugr.es<br />

Abstract. We discuss a 3D spatial analysis of fMRI data taken dur<strong>in</strong>g<br />

a comb<strong>in</strong>ed word perception and motor task. The event - based experiment<br />

was part of a study to <strong>in</strong>vestigate the network of neurons <strong>in</strong>volved<br />

<strong>in</strong> the perception of speech and the decod<strong>in</strong>g of auditory speech stimuli.<br />

We show that a classical general l<strong>in</strong>ear model analysis us<strong>in</strong>g SPM does<br />

not yield reasonable results. With bl<strong>in</strong>d source separation (BSS) techniques<br />

us<strong>in</strong>g the FastICA algorithm it is possible to identify different<br />

<strong>in</strong>dependent components (IC) <strong>in</strong> the auditory cortex correspond<strong>in</strong>g to<br />

four different stimuli. Most <strong>in</strong>terest<strong>in</strong>g, we could detect an IC represent<strong>in</strong>g<br />

a network of simultaneously active areas <strong>in</strong> the <strong>in</strong>ferior frontal gyrus<br />

responsible for word perception.<br />

1 Introduction<br />

S<strong>in</strong>ce the early 90s [1, 2], functional magnetic resonance imag<strong>in</strong>g (fMRI) based<br />

on the blood oxygen level dependent contrast (BOLD) developed <strong>in</strong>to one of<br />

the ma<strong>in</strong> technologies <strong>in</strong> human bra<strong>in</strong> research. Its high spatial and temporal<br />

resolution comb<strong>in</strong>ed with its non-<strong>in</strong>vasive nature makes it to an important tool<br />

to discover functional areas <strong>in</strong> the human bra<strong>in</strong> work and their <strong>in</strong>teractions.<br />

However, its low signal to noise ratio (SNR) and the high number of activities<br />

<strong>in</strong> the passive bra<strong>in</strong> require sophisticated analysis methods which can be divided<br />

<strong>in</strong>to two classes:<br />

– model based approaches like the general l<strong>in</strong>ear model which require prior<br />

knowledge of the time course of the activations,<br />

– model free approaches like bl<strong>in</strong>d source separation (BSS) which try to separate<br />

the recorded activation <strong>in</strong>to different classes accord<strong>in</strong>g to statistical<br />

specifications without prior knowledge of the activation.<br />

C.G. Puntonet and A. Prieto (Eds.): ICA 2004, LNCS 3195, pp. 977–984, 2004.<br />

c○ Spr<strong>in</strong>ger-Verlag Berl<strong>in</strong> Heidelberg 2004


Chapter 19. LNCS 3195:977-984, 2004 253<br />

978 Ingo R. Keck et al.<br />

In this text we compare these analysis techniques <strong>in</strong> a study of an auditory<br />

task. We show an example where traditional model based methods do not yield<br />

reasonable results. Rather bl<strong>in</strong>d source separation techniques have to be used<br />

to get mean<strong>in</strong>gful and <strong>in</strong>terest<strong>in</strong>g results concern<strong>in</strong>g the networks of activations<br />

related to a comb<strong>in</strong>ed word recognition and motor task.<br />

1.1 Model Based Approach: General L<strong>in</strong>ear Model<br />

The general l<strong>in</strong>ear model as a k<strong>in</strong>d of regression analysis has been the classic<br />

way to analyze fMRI data <strong>in</strong> the past [3]. Basically it uses second order statistics<br />

to f<strong>in</strong>d the voxels whose activations correlate best to given time courses. The<br />

measured signal for each voxel <strong>in</strong> time y =(y(t1), ..., y(tn)) T is written as a<br />

l<strong>in</strong>ear comb<strong>in</strong>ation of <strong>in</strong>dependent variables y = Xb + e, with the vector b of<br />

regression coefficients and the matrix X of the <strong>in</strong>dependent variables which <strong>in</strong><br />

case of an fMRI-analysis consist of the assumed time courses <strong>in</strong> the data and<br />

additional filters to account for the serial correlation of fMRI data. The residual<br />

error e ought to be m<strong>in</strong>imized. The normal equation X T Xb = X T y of the<br />

problem is solved by b =(X T X) −1 X T y and has a unique solution if XX T has<br />

full rank. F<strong>in</strong>ally a significance test us<strong>in</strong>g e is applied to estimate the statistical<br />

significance of the found correlation.<br />

As the model X must be known <strong>in</strong> advance to calculate b, this method is<br />

called “model-based”. It can be used to test the accuracy of a given model, but<br />

cannot by itself f<strong>in</strong>d a better suited model even if one exists.<br />

1.2 Model Free Approach:<br />

BSS Us<strong>in</strong>g <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong><br />

In case of fMRI data bl<strong>in</strong>d source separation refers to the problem of separat<strong>in</strong>g<br />

a given sensor signal, i.e. the fMRI data at the time t<br />

x(t) =A [s(t)+snoise(t)] =<br />

n�<br />

aisi(t)+<br />

i=1<br />

n�<br />

i=1<br />

aisnoise,i(t)<br />

<strong>in</strong>to its underly<strong>in</strong>g n source signals s with ai(t) be<strong>in</strong>g its contribution to the<br />

sensor signal, hence its mix<strong>in</strong>g coefficient. A and s are unique except for permutation<br />

and scal<strong>in</strong>g. The functional segregation of the bra<strong>in</strong> [3] closely matches<br />

the requirement of spatially <strong>in</strong>dependent sources as assumed <strong>in</strong> spatial ICA.<br />

The term snoise(t) is the time dependent noise. Unfortunately, <strong>in</strong> fMRI the noise<br />

level is of the same order of magnitude as the signal, so it has to be taken <strong>in</strong>to<br />

account. As the noise term will depend on time, it can be <strong>in</strong>cluded as additional<br />

components <strong>in</strong>to the problem. This problem is called “under-determ<strong>in</strong>ed”<br />

or “over-complete” as the number of <strong>in</strong>dependent sources will always exceed the<br />

number of measured sensor signals x(t).<br />

Various algorithms utiliz<strong>in</strong>g higher order statistics have been proposed to solve<br />

the BSS problem. In fMRI analysis, mostly the extended Infomax (based on<br />

entropy maximisation [4, 5]) and FastICA (based on negentropy us<strong>in</strong>g fix-po<strong>in</strong>t


254 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data on a Word Perception Task 979<br />

iteration [6]) algorithm have been used so far. While the extended Infomax algorithm<br />

is expected to perform slightly better on real data due to its adaptive<br />

nature, FastICA does not depend on educated guesses about the probability<br />

density distribution of the unknown source signals. In this paper we choose to<br />

utilize FastICA because of its low demands on computational power.<br />

2 Results<br />

First, we will present the implementation of the algorithm we used. Then we will<br />

discuss an example of an event-designed experiment and its BSS based analysis<br />

where we were able to identify a network of bra<strong>in</strong> areas which could not be<br />

detected us<strong>in</strong>g classic regression methods.<br />

2.1 Method<br />

To implement spatial ICA for fMRI data, every three-dimensional fMRI image<br />

is considered as a s<strong>in</strong>gle mixture of underly<strong>in</strong>g <strong>in</strong>dependent components. The<br />

rows of every image matrix have to be concatenated to a s<strong>in</strong>gle row-vector and<br />

with these image-vectors the mixture matrix X is constructed.<br />

For FastICA the second order correlation <strong>in</strong> the data has to be elim<strong>in</strong>ated by<br />

a “whiten<strong>in</strong>g” preprocess<strong>in</strong>g. This is done us<strong>in</strong>g a pr<strong>in</strong>cipal component analysis<br />

(PCA) step prior to the FastICA algorithm. In this step a data reduction can be<br />

applied by omitt<strong>in</strong>g pr<strong>in</strong>cipal components (PC) with a low variance <strong>in</strong> the signal<br />

reconstruction process. However, this should be handled with care as valuable<br />

high order statistical <strong>in</strong>formation can be conta<strong>in</strong>ed <strong>in</strong> these low variance PCs.<br />

The maximal variations <strong>in</strong> the timetrends of the supposed word-detection ICs<br />

<strong>in</strong> our example account only for 0.7 % of the measured fMRI Signal.<br />

The FastICA algorithm calculates the de-mix<strong>in</strong>g matrix W = A −1 . Then the<br />

underly<strong>in</strong>g sources S can be reconstructed as well as the orig<strong>in</strong>al mix<strong>in</strong>g matrix<br />

A. ThecolumnsofA represent the time-courses of the underly<strong>in</strong>g sources which<br />

are conta<strong>in</strong>ed <strong>in</strong> the rows of S. To display the ICs the rows of S have to be<br />

converted back to three-dimensional image matrixes.<br />

As noted before because of the high noise present <strong>in</strong> fMRI data the ICA<br />

problem will always be under-determ<strong>in</strong>ed or over-complete. As FastICA cannot<br />

separate more components than the number of mixtures available, the result<strong>in</strong>g<br />

IC will always be composed of a noise part and the “real” IC superimposed on<br />

that noise. This can be compensated by <strong>in</strong>dividually de-nois<strong>in</strong>g the IC. As a rule<br />

of thumb we decided that to be considered a noise signal the value has to be<br />

below 10 times the mean variance <strong>in</strong> the IC which corresponds to a standard<br />

deviation of about 3.<br />

2.2 Example: <strong>Analysis</strong> of an Event-Based Experiment<br />

This experiment was part of a study to <strong>in</strong>vestigate the network <strong>in</strong>volved <strong>in</strong><br />

the perception of speech and the decod<strong>in</strong>g of auditory speech stimuli. Therefor


Chapter 19. LNCS 3195:977-984, 2004 255<br />

980 Ingo R. Keck et al.<br />

one- and two-syllable words were divided <strong>in</strong>to several frequency-bands and then<br />

rearranged randomly to obta<strong>in</strong> a set of auditory stimuli. The set consisted of four<br />

different types of stimuli, conta<strong>in</strong><strong>in</strong>g 1, 2, 3 or 4 frequency bands (FB1–FB4)<br />

respectively. Only FB4 was perceivable as words.<br />

Dur<strong>in</strong>g the functional imag<strong>in</strong>g session these stimuli were presented pseudorandomized<br />

to 5 subjects, accord<strong>in</strong>g to the rules of a stochastic event-related<br />

paradigm. The task of the subjects was to press a button as soon as they were<br />

sure that they had just recognized a word <strong>in</strong> the sound presented. It was expected<br />

that <strong>in</strong> case of FB4 these four types of stimuli activate different areas of the<br />

auditory system as well as the superior temporal sulcus <strong>in</strong> the left hemisphere [8].<br />

Prior to the statistical analysis the fMRI data were pre-processed with the<br />

SPM2 toolbox [9]. A slice-tim<strong>in</strong>g procedure was performed, movements corrected,<br />

the result<strong>in</strong>g images were normalized <strong>in</strong>to a stereotactical standard space<br />

(def<strong>in</strong>ed by a template from the Montreal Neurological Institute) and smoothed<br />

with a gaussian kernel to <strong>in</strong>crease the signal-to-noise ratio.<br />

Classical Fixed-Effect <strong>Analysis</strong>. First, a classic regression analysis with<br />

SPM2 was applied. No substantial differences <strong>in</strong> the activation of the auditory<br />

cortex apart from an overall <strong>in</strong>crease of activity with ascend<strong>in</strong>g number of frequency<br />

bands was found <strong>in</strong> three subjects. One subject showed no correlated<br />

activity at all, two only had marg<strong>in</strong>al activity located <strong>in</strong> the auditory cortex<br />

(figure 1 (c)). Only one subject showed obvious differences between FB1 and<br />

FB4: an activation of the left supplementary motor area, the c<strong>in</strong>gulate gyrus<br />

and an <strong>in</strong>creased size of active area <strong>in</strong> the left auditory cortex for FB4 (figure 1<br />

(a),(b)).<br />

Spatial ICA with FastICA. For the sICA with FastICA [6] up to 351 threedimensional<br />

images of the fMRI sessions were <strong>in</strong>terpreted as separate mixtures<br />

of the unknown spatial <strong>in</strong>dependent activity signals. Because of the high computational<br />

demand each subject was analyzed <strong>in</strong>dividually <strong>in</strong>stead of a whole group<br />

ICA as proposed <strong>in</strong> [10]. A pr<strong>in</strong>cipal component analysis (PCA) was applied to<br />

whiten the data. 340 components of this PCA were reta<strong>in</strong>ed that correspond to<br />

more than 99.999% of the orig<strong>in</strong>al signals. This is still 100 times greater than<br />

the share of ICs like that shown <strong>in</strong> figure 3 on the fMRI signal. In one case only<br />

317 fMRI images were measured and all result<strong>in</strong>g 317 PCA components were<br />

reta<strong>in</strong>ed.<br />

Then the stabilized version of the FastICA algorithm was applied us<strong>in</strong>g tanh<br />

as non-l<strong>in</strong>earity. The result<strong>in</strong>g 340 (resp. 317) spatially <strong>in</strong>dependent components<br />

(IC) were sorted <strong>in</strong>to different classes depend<strong>in</strong>g on their structural localization<br />

with<strong>in</strong> the bra<strong>in</strong>. Various ICs <strong>in</strong> the region of the auditory cortex could be<br />

identified <strong>in</strong> all subjects, figure 2 show<strong>in</strong>g one example. Note that all bra<strong>in</strong> images<br />

<strong>in</strong> this article are flipped, i.e. the left hemisphere appears on the right side of<br />

the picture. To calculate the contribution of the displayed ICs to the observed<br />

fMRI data the value of its voxels has to be multiplied with the time course of<br />

its activation for each scan (lower subplot to the right of each IC plot). Also


256 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data on a Word Perception Task 981<br />

a)<br />

b)<br />

c)<br />

Fig. 1. Fixed-effect analysis of the experimental data. No substantial differences between<br />

the activation <strong>in</strong> the auditory cortex correlated to (a) FB1 and (b) FB4 can be<br />

seen. (c) shows the analysis for FB4 of a different subject.<br />

Fig. 2. <strong>Independent</strong> component located <strong>in</strong> the auditory cortex and its time course.<br />

a component located at the position of the supplementary motor area (SMA)<br />

could be found <strong>in</strong> all subjects.


Chapter 19. LNCS 3195:977-984, 2004 257<br />

982 Ingo R. Keck et al.<br />

Fig. 3. <strong>Independent</strong> component which correspond to a proposed subsystem for word<br />

detection.<br />

Fig. 4. <strong>Independent</strong> component with activation <strong>in</strong> Broca’s area (speech motor area).<br />

The most <strong>in</strong>terest<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>g was an IC which represents a network of three<br />

simultaneously active areas <strong>in</strong> the <strong>in</strong>ferior frontal gyrus (figure 3) <strong>in</strong> one subject.<br />

This network was suggested to be a center for the perception of speech <strong>in</strong> [8].<br />

Figure 4 shows an IC (of the same subject) that we assume to be a network for<br />

the decision to press the button. All other subjects except one had ICs that correspond<br />

to these networks, although often separated <strong>in</strong>to different components.<br />

The time course of both components matches visually very well (figure 5) while<br />

their correlation coefficient rema<strong>in</strong>s rather low (kcorr =0.36), apparently due to<br />

temporary time- and basel<strong>in</strong>e-shifts.


258 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data on a Word Perception Task 983<br />

Comparison of the Regression <strong>Analysis</strong> Versus ICA. To compare the<br />

results of the fixed-effect analysis with the results of the ICA the correlation<br />

coefficients between the expected time-trends of the fixed-effect analysis and the<br />

time-trends of the ICs were calculated. No substantial correlation was found:<br />

87 % of all these coefficients were <strong>in</strong> the range of −0.1 to0.1, the highest coefficient<br />

found be<strong>in</strong>g 0.36 for an IC with<strong>in</strong> the auditory cortex (figure 2). The<br />

correlation coefficients for the proposed word detection network (figure 3) were<br />

0.14, 0.08, 0.19 and 0.18 for FB1–FB4. Therefor it is quite obvious that this<br />

network of areas <strong>in</strong> the <strong>in</strong>ferior frontal gyrus cannot be detected with a classic<br />

fixed-effect regression analysis.<br />

While the reasons for the differences between the activation-trends of the<br />

ICs and the assumed time-trends are still subject to on-go<strong>in</strong>g research, it can be<br />

expected that the results of this ICA will help to ga<strong>in</strong> further <strong>in</strong>formation about<br />

the work flow of the bra<strong>in</strong> concern<strong>in</strong>g the task of word detection.<br />

Fig. 5. The activation of the ICs shown <strong>in</strong> figure 3 (dotted) and 4 (solid), plotted<br />

for scan no. 25–75. While these time-trends obviously appear to be correlated, their<br />

correlation coefficient rema<strong>in</strong>s very low due to temporary basel<strong>in</strong>e- and time-shifts <strong>in</strong><br />

the trends.<br />

3 Conclusions<br />

We have shown that ICA can be a valuable tool to detect hidden or suspected<br />

l<strong>in</strong>ks and activity <strong>in</strong> the bra<strong>in</strong> that cannot be found us<strong>in</strong>g the classical approach<br />

of a model-based analysis like the general l<strong>in</strong>ear model. While clearly ICA cannot<br />

be used to validate a model (be<strong>in</strong>g <strong>in</strong> itself model-free), it can give useful h<strong>in</strong>ts<br />

to understand the <strong>in</strong>ternal organization of the bra<strong>in</strong> and help to develop new<br />

models and study designs which then can be validated us<strong>in</strong>g a classic regression<br />

analysis.


Chapter 19. LNCS 3195:977-984, 2004 259<br />

984 Ingo R. Keck et al.<br />

Acknowledgment<br />

This work was supported by the BMBF (project ModKog).<br />

References<br />

1. K. K. Kwong, J. W. Belliveau, D. A. Chester, I. E. Goldberg, R. M. Weisskoff, B.<br />

P. Poncelet, D. N. Kennedy, B. E. Hoppel, M. S. Cohen, R. Turner, H-M. Cheng,<br />

T. J. Brady, B. R. Rosen, “Dynamic magnetic resonance imag<strong>in</strong>g of human bra<strong>in</strong><br />

activity dur<strong>in</strong>g primary sensory stimulation”, Proc. Natl. Acad. USA 89, 5675–<br />

5679 (1992).<br />

2. S. Ogawa, T. M. Lee, A. R. Kay, D. W. Tank, “Bra<strong>in</strong> magnetic-resonance-imag<strong>in</strong>g<br />

with contrast dependent on blood oxygenation”, Proc. Natl Acad. Sci. USA 87,<br />

9868–9872 (1990).<br />

3. R. S. J. Frackowiak, K. J. Friston, Ch. D. Frith, R. J. Dolan, J. C. Mazziotta,<br />

“Human Bra<strong>in</strong> Function”, Academic Press, San Diego, USA, 1997.<br />

4. A.J. Bell, T.J. Sejnowski, “An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d separation<br />

and bl<strong>in</strong>d deconvolution”, Neural Computation, 7(6), 1129–1159 (1995).<br />

5. M.J. McKeown, T.J. Sejnowski, “<strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> of FMRI Data:<br />

Exam<strong>in</strong><strong>in</strong>g the Assumptions”, Human Bra<strong>in</strong> Mapp<strong>in</strong>g 6, 368–372 (1998).<br />

6. A. Hyvär<strong>in</strong>nen, “Fast and Robust Fixed-Po<strong>in</strong>t Algorithms for <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>”, IEEE Transactions on Neural Networks 10(3), 626–634 (1999).<br />

7. F. Esposito, E. Formisano, E. Seifritz, R. Goebel, R. Morrone, G. Tedeschi, F. Di<br />

Salle, “Spatial <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> of Functional MRI Time-Series:<br />

To What Extent Do Results Depend on the Algorithm Used?”, Human Bra<strong>in</strong><br />

Mapp<strong>in</strong>g 16, 146–157 (2002).<br />

8. K. Specht, J. Reul, “Function segregation of the temporal lobes <strong>in</strong>to highly differentiated<br />

subsystems for auditory perception: an auditory rapid event-related<br />

fMRI-task”, NeuroImage 20, 1944–1954 (2003).<br />

9. SPM2: http://www.fil.ion.ulc.ac.uk/spm/spm2.html, July 2003.<br />

10. V.D. Calhoun, T. Adali, G.D. Pearlson, J.J. Pekar, “A Method for Mak<strong>in</strong>g Group<br />

Inferences from Functional MRI Data Us<strong>in</strong>g <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>”,<br />

Human Bra<strong>in</strong> Mapp<strong>in</strong>g 14, 140–151 (2001).


260 Chapter 19. LNCS 3195:977-984, 2004


Chapter 20<br />

Signal Process<strong>in</strong>g 86(3):603-623,<br />

2006<br />

Paper F.J. Theis and G.A. García. On the use of sparse signal decomposition <strong>in</strong><br />

the analysis of multi-channel surface electromyograms. Signal Process<strong>in</strong>g,<br />

86(3):603-623, 2006<br />

Reference (Theis and García, 2006)<br />

Summary <strong>in</strong> section 1.6.3<br />

261


262 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

On the use of sparse signal decomposition <strong>in</strong><br />

the analysis of multi-channel surface<br />

electromyograms<br />

Fabian J. Theis a,∗ , Gonzalo A. García b<br />

a Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

b Department of Bio<strong>in</strong>formatic Eng<strong>in</strong>eer<strong>in</strong>g, Osaka University, Osaka, Japan<br />

Abstract<br />

The decomposition of surface electromyogram data sets (s-EMG) is studied us<strong>in</strong>g<br />

bl<strong>in</strong>d source separation techniques based on sparseness; namely <strong>in</strong>dependent<br />

component analysis, sparse nonnegative matrix factorization, and sparse component<br />

analysis. When applied to artificial signals we f<strong>in</strong>d noticeable differences of<br />

algorithm performance depend<strong>in</strong>g on the source assumptions. In particular, sparse<br />

nonnegative matrix factorization outperforms the other methods with regards to<br />

<strong>in</strong>creas<strong>in</strong>g additive noise. However, <strong>in</strong> the case of real s-EMG signals we show that<br />

despite the fundamental differences <strong>in</strong> the various models, the methods yield rather<br />

similar results and can successfully separate the source signal. This can be expla<strong>in</strong>ed<br />

by the fact that the different sparseness assumptions (super-Gaussianity, positivity<br />

together with m<strong>in</strong>imal 1-norm and fixed number of zeros respectively) are all<br />

only approximately fulfilled thus apparently forc<strong>in</strong>g the algorithms to reach similar<br />

results, but from different <strong>in</strong>itial conditions.<br />

Key words: surface EMG, bl<strong>in</strong>d source separation, sparse component analysis,<br />

<strong>in</strong>dependent component analysis, sparse nonnegative matrix factorization<br />

1 Introduction<br />

A basic question <strong>in</strong> data analysis, signal process<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g as well as<br />

neuroscience is how to represent a large data set X (observed as an (m ×<br />

∗ Correspond<strong>in</strong>g author.<br />

Email addresses: fabian@theis.name (Fabian J. Theis), garciaga@ieee.org<br />

(Gonzalo A. García).<br />

Prepr<strong>in</strong>t submitted to Elsevier Science 8 April 2005


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 263<br />

N)−matrix) <strong>in</strong> different ways. One of the simplest approaches lies <strong>in</strong> a l<strong>in</strong>ear<br />

decomposition, X = AS, with A an (m × N)−matrix (mix<strong>in</strong>g matrix) and S<br />

an (n × N)−matrix stor<strong>in</strong>g the sources. Both A and S are unknown, hence<br />

this problem is often described as bl<strong>in</strong>d source separation (BSS). To get a<br />

well-def<strong>in</strong>ed problem, A and S have to satisfy additional properties such as:<br />

• the source components Si (rows of S) are assumed to be realizations of<br />

stochastically <strong>in</strong>dependent random variables — this method is called <strong>in</strong>dependent<br />

component analysis (ICA) [1,2],<br />

• the sources S are required to conta<strong>in</strong> as many zeros as possible — we then<br />

speak of sparse component analysis (SCA) [3,4],<br />

• A and S are assumed to be nonnegative, which is denoted by nonnegative<br />

matrix factorization (NMF) [5].<br />

The above-mentioned models as well as their <strong>in</strong>terplay have recently been <strong>in</strong><br />

the focus of many researchers, for <strong>in</strong>stance concern<strong>in</strong>g the question of how<br />

ICA and sparseness are related to each other and how they can be <strong>in</strong>tegrated<br />

<strong>in</strong>to BSS algorithms [6–10], how to deal with nonnegativity <strong>in</strong> the ICA case<br />

[11,12] or how to extend NMF <strong>in</strong> order to <strong>in</strong>clude sparseness [13,14]. Much<br />

work has already been devoted to these subjects, and their applications to<br />

various fields are currently emerg<strong>in</strong>g. Indeed, l<strong>in</strong>ear representations such as the<br />

above have several potential applications <strong>in</strong>clud<strong>in</strong>g decomposition of objects<br />

<strong>in</strong>to ‘natural’ components [5], redundancy and dimensionality reduction [2],<br />

biomedical data analysis, micro-array data m<strong>in</strong><strong>in</strong>g or enhancement, feature<br />

extraction of images <strong>in</strong> nuclear medic<strong>in</strong>e, etc. [1,2].<br />

In this study, we will analyze and compare the above models, not from a theoretical<br />

po<strong>in</strong>t of view but rather from a concrete, real-world example, namely<br />

the analysis of surface electromyogram (s-EMG) data sets. An electromyogram<br />

(EMG) denotes the electric signal generated by a contract<strong>in</strong>g muscle<br />

[15]; its study is relevant to the diagnosis of motoneuron diseases [16] as well<br />

as neurophysiological research [17]. In general, EMG measurements make use<br />

of <strong>in</strong>vasive, pa<strong>in</strong>ful needle electrodes. An alternative is to use s-EMG, which<br />

is measured us<strong>in</strong>g non-<strong>in</strong>vasive, pa<strong>in</strong>less surface electrodes. However, <strong>in</strong> this<br />

case the signals are rather more difficult to <strong>in</strong>terpret due to noise and overlap<br />

of several source signals as shown <strong>in</strong> Fig 1(a). We have already applied ICA <strong>in</strong><br />

order to solve the s-EMG decomposition problem [18], however performance<br />

<strong>in</strong> real-world noisy s-EMG is still problematic, and it is yet unknown if the<br />

assumption of <strong>in</strong>dependent sources holds very well <strong>in</strong> the sett<strong>in</strong>g of s-EMG.<br />

In the present work, we apply sparse BSS methods based on various model<br />

assumptions to s-EMG signals. We first outl<strong>in</strong>e each of those methods and<br />

the correspond<strong>in</strong>g performance <strong>in</strong>dices used for their comparison. We then<br />

present the decompositions obta<strong>in</strong>ed with each method, and f<strong>in</strong>ally discuss<br />

these results <strong>in</strong> section 4.<br />

2


264 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

(a)<br />

s-EMG MU#2 MU#1<br />

50 ms 0.5 mV<br />

(b)<br />

CNS<br />

sp<strong>in</strong>al cord<br />

motor<br />

unit<br />

α-motoneurons<br />

MU#2<br />

( )<br />

MU#1 ( )<br />

Fig. 1. Electromyogram; (a) Example of a signal obta<strong>in</strong>ed from a surface electrode<br />

show<strong>in</strong>g the superposition of the activity of several motor units. (b) α-motoneurons<br />

convey the commands from the central nervous system to the muscles. A motor unit<br />

consists of an α-motoneuron and the muscle fibres it <strong>in</strong>nervates.<br />

2 Method<br />

2.1 Signal orig<strong>in</strong> and acquisition<br />

The central nervous system conveys commands to the muscles by tra<strong>in</strong>s of<br />

electric impulses (fir<strong>in</strong>gs) via α-motoneurons, whose bodies are located <strong>in</strong> the<br />

sp<strong>in</strong>al cord. The term<strong>in</strong>al axons of an α-motoneuron <strong>in</strong>nervate a group of<br />

muscle fibres. A motor unit (MU) consists of an α-motoneuron and the muscle<br />

fibres it <strong>in</strong>nervates, Fig. 1(b). When a MU is active, it produces a tra<strong>in</strong> of<br />

electric impulses called motor unit action potential tra<strong>in</strong> (MUAPT). The s-<br />

EMG signal is composed of the weighted summation of several MUAPTs; an<br />

example is given <strong>in</strong> Fig. 1(a).<br />

2.1.1 Artificial signal generator<br />

We developed a multi-channel s-EMG generator based on Disselhorst-Klug et<br />

al.’s model [19], employ<strong>in</strong>g Andreassen and Rosenfalck’s conductivity parameters<br />

[20], and the fir<strong>in</strong>g rate statistical characteristics described by Clamann<br />

[21].<br />

The volume conductor between a muscle fibre and the sk<strong>in</strong> surface where the<br />

electrodes are placed acts as a spatial low-pass filter; hence, a source signal<br />

is attenuated <strong>in</strong> direct proportion to its distance from the electrode. Us<strong>in</strong>g<br />

Griep’s tripole model [22], we can calculate the spatial potential distribution<br />

3


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 265<br />

Synthetic s-EMG Source Signals [arbitrary units]<br />

MU#1<br />

MU#2<br />

MU#3<br />

MU#4<br />

MU#5<br />

Channel 1<br />

Synthetic s-EMG signal [arbitrary units]<br />

Channel 7 Channel 6 Channel 5 Channel 4 Channel 3 Channel 2<br />

Channel 8<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

(a) synthetic source signals<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

(c) generated s-EMG signals<br />

Depth <strong>in</strong>side the arm [mm]<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

MU#1<br />

MU#3<br />

MU#4<br />

MU#2<br />

MU#5<br />

Ch#1 Ch#2 Ch#3 Ch#4 Ch#5 Ch#6 Ch#7 Ch#8<br />

0<br />

-15 -10 -5 0 5 10<br />

Electrode-Array surface [mm]<br />

(b) source locations<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1<br />

2<br />

3<br />

Channel<br />

4<br />

5<br />

6<br />

7<br />

(d) mix<strong>in</strong>g matrix<br />

8<br />

1<br />

2<br />

3<br />

4<br />

5<br />

Source Signal<br />

Fig. 2. Artificially created signals. (a) Five synthetic motor units action potential<br />

tra<strong>in</strong>s (source signals, MUAPTs) generated by the model. (b) Artificial surface<br />

electromyogram (s-EMG); location of the sources (motor units, encircled asterisks)<br />

respect the electrodes location marked by an x. The y-axis represents the depth<br />

<strong>in</strong>side the upper limb. (c) Artificially created mixture signal — only first 5000<br />

samples (out of 15000) are shown. The eight-channel signal is generated from the<br />

five source signals (a), us<strong>in</strong>g the mix<strong>in</strong>g matrix illustrated <strong>in</strong> (d).<br />

generated on the sk<strong>in</strong> surface by the excitation of a s<strong>in</strong>gle muscle fibre. The<br />

potential distribution generated by the fir<strong>in</strong>g of a s<strong>in</strong>gle motor unit can be<br />

calculated as the l<strong>in</strong>ear superposition of the contributions of all its constitutive<br />

muscle fibres, which are not located <strong>in</strong> one plane. The aforementioned<br />

model can be considered as a l<strong>in</strong>ear <strong>in</strong>stantaneous mix<strong>in</strong>g process. In reality<br />

however it is well-known that the model could exhibit slightly convolutive behavior,<br />

which is due to the fact that the media crossed by the s-EMG sources<br />

is anisotropic, provok<strong>in</strong>g delayed mixtures of the same signals (convolutions)<br />

rather than an <strong>in</strong>stantaneous mixture of them. However, <strong>in</strong> the muscle model<br />

used, the distances between the different muscle fibres have been <strong>in</strong>creased<br />

4


266 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

name mean<strong>in</strong>g value<br />

num number of MUs 5<br />

siz size of each MU, <strong>in</strong> number of fibres 20-30 (uniformly distr.)<br />

cfr central fir<strong>in</strong>g rate 20 Hz<br />

ED electrode <strong>in</strong>ter-pole distance 2.54 mm<br />

CH number of generated channels 8<br />

meanIPI mean <strong>in</strong>ter-pulse <strong>in</strong>terval 1/cfr · 1000 ms<br />

IPIstd Clamann formula for IPI deviation 9.1 · 10 −4 · meanIPI 2 + 4<br />

SY radial conductivity 0.063 S/m<br />

SZ axial conductivity 0.328 S/m<br />

A anisotropy ratio SZ/SY = 5.2<br />

SI <strong>in</strong>tracellular conductivity 1.010 S/m<br />

D fibre diameter 50 · 10 −6 m<br />

CV conduction velocity 4.0 m/s<br />

FatHeight fat-sk<strong>in</strong> layer thickness 3 mm<br />

MUterritory XY-plane radius of each MU 1.0 mm<br />

DistEndplate endplate-to-electrode distance 20 mm<br />

FibZsegment length of the endplate <strong>in</strong> the Z axis 0.4 mm<br />

CondVolCond conductivity of the volume conductor 0.08 S/m<br />

Table 1<br />

Parameters used for artificial s-EMG generation.<br />

by the anisotropy factor A, calculated as the ratio between the radial conductivity<br />

(0.063 S/m) and the axial conductivity (0.328 S/m), see Tab. 1 for<br />

parameter details. Afterwards, the potential distribution can be calculated as<br />

if the volume conductor were isotropic. The channels compos<strong>in</strong>g the synthetic<br />

s-EMG conta<strong>in</strong> the same MUAP without any time delay among themselves,<br />

hence produc<strong>in</strong>g a <strong>in</strong>stantaneous mixture with the other source signals.<br />

We have generated 100 synthetic, eight-channel s-EMG signals of 1.5-second<br />

duration. Each s-EMG was composed of five MUs randomly located under<br />

a 3 mm fat and sk<strong>in</strong> layer <strong>in</strong> the detection area of the electrode-array (up<br />

to ±2 mm of the sides of the electrode array and up to a depth of 6 mm).<br />

Each MU <strong>in</strong> turn consisted of 20 to 30 fibres (uniform distribution) of <strong>in</strong>f<strong>in</strong>ite<br />

length located randomly <strong>in</strong> a 1 mm circle <strong>in</strong> the XY plane. The conduction<br />

velocity of the fibres was 4.0 m/s. Note that an average biceps brachii MU<br />

is generally composed of at least 200-300 muscle fibres [23]; however, for the<br />

sake of computational speed, that number was reduced to one tenth, which<br />

5


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 267<br />

only resulted <strong>in</strong> the reduction of the amplitude of the f<strong>in</strong>al MUAPs. This does<br />

not have any <strong>in</strong>fluence <strong>in</strong> our case, as the algorithms are not affected by the<br />

general amplitude of the signals, only by their respective proportions, if at all.<br />

The simulated electrode-array used to generate the mixed record<strong>in</strong>gs had the<br />

same dimensions and characteristics of the real one and was virtually located<br />

20 mm apart from the endplate region, which had a thickness of 0.4 mm. X axis<br />

corresponded to the length of the electrode, Y axis to the depth, and Z axis<br />

to the fibres direction. The mean <strong>in</strong>stantaneous fir<strong>in</strong>g rate of each MU was 20<br />

fir<strong>in</strong>gs/second (mean <strong>in</strong>ter-pulse <strong>in</strong>terval meanIPI of 50 ms), with a standard<br />

deviation calculated follow<strong>in</strong>g Clamman’s model as 9.1 · 10 −4 · meanIPI 2 + 4<br />

[21].<br />

The s-EMG signals were generated as followed; each muscle fibre compos<strong>in</strong>g<br />

a fir<strong>in</strong>g MU generates an action potential follow<strong>in</strong>g the mov<strong>in</strong>g tripole model<br />

([24] and references there<strong>in</strong>); the f<strong>in</strong>al MUAP appear<strong>in</strong>g <strong>in</strong> each channel is the<br />

summation of those action potentials after apply<strong>in</strong>g the anisotropy coefficient<br />

A and the fad<strong>in</strong>g produced by the media crossed by the signal ([19] and references<br />

there<strong>in</strong>). Note that the model generally holds for any skeletal muscle.<br />

However, some of the parameters (such as the fat layer and fir<strong>in</strong>g rate) were<br />

chosen follow<strong>in</strong>g the average, normal values of the biceps brachii [25].<br />

A s<strong>in</strong>gle such data set is visualized <strong>in</strong> Fig. 2. The eight-dimensional s-EMG<br />

observations that have been generated from the five underly<strong>in</strong>g MUAPTs are<br />

shown <strong>in</strong> Fig. 2(a). Their locations are depicted <strong>in</strong> Fig. 2(b) and the mix<strong>in</strong>g<br />

matrix used to generate the synthetic data set (see Fig. 2(c)) is shown <strong>in</strong><br />

Fig. 2(d). We will use several (100) of these signals with randomly chosen<br />

source locations <strong>in</strong> batch runs for statistical comparisons.<br />

2.1.2 Real s-EMG signal acquisition<br />

Fig. 3 shows the experimental sett<strong>in</strong>g for the acquisition of real s-EMG record<strong>in</strong>gs.<br />

The subject (after giv<strong>in</strong>g <strong>in</strong>formed consent) sat on a chair and was<br />

strapped to it with nylon belts as shown <strong>in</strong> Fig. 3(a). The subject’s forearm<br />

was put <strong>in</strong> a cast fixed to the table, the shoulder angle be<strong>in</strong>g approximately<br />

30 ◦ . To measure the torque exerted by the subject, a stra<strong>in</strong> gauge was placed<br />

between the forearm cast and the table fixation. An electrode array was placed<br />

on the biceps short head muscle belly, out of the <strong>in</strong>nervation zone, transverse<br />

to the direction of muscle fibers, after clean<strong>in</strong>g the sk<strong>in</strong> with alcohol and Sk<strong>in</strong>-<br />

Pure (Nihon Kohden Corp, Tokyo, Japan). The electrode array consists of<br />

16 sta<strong>in</strong>less steel poles of 1 mm <strong>in</strong> diameter, 3 mm <strong>in</strong> height, and 2.54 mm<br />

apart <strong>in</strong> both plane directions that are attached to a rigid acrylic plate, see<br />

Fig. 3(b). The poles of each bipolar electrode were placed at the shortest<br />

possible distance to obta<strong>in</strong> sharper MUAPs.<br />

6


268 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

(a)<br />

(b)<br />

mass-EMG<br />

electrode<br />

electrode-array<br />

8 chan.s<br />

direction of muscle fiber<br />

3mm<br />

CH1<br />

1mm<br />

2.54mm<br />

belt<br />

stra<strong>in</strong> gauge<br />

CH8<br />

41mm<br />

2.54mm<br />

41mm<br />

chair<br />

styrofoam<br />

cast<br />

oscilloscope<br />

measured<br />

torque target<br />

torque<br />

Fig. 3. Experimental sett<strong>in</strong>g; (a) Eight-channels s-EMG was recorded dur<strong>in</strong>g an<br />

isometric, constant force 30% of subject’s MVC. (b) Detail of the s-EMG bipolar<br />

electrode array used for the record<strong>in</strong>gs.<br />

Eight-channel EMG bipolar signals were measured (<strong>in</strong>put impedance above<br />

10 12 Ω, CMRR of 110 dB), filtered by a first-order, band-pass Butterworth filter<br />

(cut-off frequencies of 70 Hz and 1 kHz), and then amplified (ga<strong>in</strong> of 80<br />

dB). The target torque and the measured torque were displayed on an oscilloscope<br />

as a thick and as a th<strong>in</strong> l<strong>in</strong>e respectively, which the subject was asked<br />

to match. The eight-channel s-EMG was sampled at a frequency of 10 kHz, 12<br />

bits per sample, by a PCI-MIO-16E-1 A/D converter card (National Instruments,<br />

Aust<strong>in</strong>, Texas, USA) <strong>in</strong>stalled <strong>in</strong> a PC, which was also equipped with<br />

a LabView (National Instruments, Aust<strong>in</strong>, Texas, USA) program <strong>in</strong> charge of<br />

controll<strong>in</strong>g the experiments. The signals were recorded dur<strong>in</strong>g a 5s period of<br />

constant-force isometric contractions at 30% maximum voluntary contraction<br />

(MVC).<br />

Before apply<strong>in</strong>g any BSS method, the signal was preprocessed us<strong>in</strong>g a modified<br />

dead-zone filter developed to reta<strong>in</strong> the MUAPs belong<strong>in</strong>g to the target<br />

MUAPT and to remove both noise and low-amplitude MUAPs that did not<br />

reach the given thresholds. The nonl<strong>in</strong>ear filter is realized by an algorithm that<br />

was developed <strong>in</strong> order to reta<strong>in</strong> the MUAPs belong<strong>in</strong>g to the target MUAPT<br />

and to remove both the noise and the low-amplitude MUAPs that did not<br />

reach the given thresholds. In former works, these thresholds were <strong>in</strong>itially set<br />

at a level three times above the noise level (record<strong>in</strong>g at 0% MVC) and then<br />

7


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 269<br />

adjusted with the aid of an <strong>in</strong>corporated graphical user <strong>in</strong>terface [18]. In the<br />

present work, the thresholds were set to 10% for all the signals so that we may<br />

compare all obta<strong>in</strong>ed results easily. The algorithm looks for zero-cross<strong>in</strong>gs on<br />

the signal and def<strong>in</strong>es a waveform as the samples of the considered signal comprised<br />

between two consecutive zero-cross<strong>in</strong>gs. If the maximum (respectively,<br />

m<strong>in</strong>imum <strong>in</strong> the case of a negative waveform) is above (respectively, below)<br />

the threshold, that waveform is kept <strong>in</strong> the signal given as output. Otherwise,<br />

the waveform is substituted by a zero-voltage signal on that period.<br />

It is quite important to use such a filter, because although BSS algorithms<br />

are generally able to separate the noise as a component, they are usually<br />

unable to separate more source signals than available channels. Generally <strong>in</strong><br />

our experiments there were more MUs than available channels even at low<br />

levels of contraction, but only few of them had amplitude above the noise<br />

level. In order to elim<strong>in</strong>ate the activity of some MUs we delete the MUAPTs<br />

belong<strong>in</strong>g to distant MUs, whose MUAPs are less powerful. This is enhanced<br />

by PCA preprocess<strong>in</strong>g and dimension reduction as discussed later <strong>in</strong> section<br />

3.2.<br />

2.2 Bl<strong>in</strong>d source separation<br />

L<strong>in</strong>ear bl<strong>in</strong>d source separation (BSS) describes the task of bl<strong>in</strong>dly recover<strong>in</strong>g<br />

A and S <strong>in</strong> the equation<br />

X = AS + N (1)<br />

where X consists of m signals with N observations each, put together <strong>in</strong>to an<br />

(m × N)−matrix. A is a full-rank (m × n)−matrix, and we typically assume<br />

that m ≥ n (complete and under-complete case). Moreover N models additive<br />

noise, which is commonly assumed to be white Gaussian. Depend<strong>in</strong>g on the<br />

assumptions on A and S we get different models. Our goal is to estimate<br />

the mix<strong>in</strong>g matrix A. Then the sources can be recovered by apply<strong>in</strong>g the<br />

pseudo<strong>in</strong>verse A + of A to the mixtures X (which is optimal <strong>in</strong> the maximumlikelihood<br />

sense when Gaussian noise is assumed).<br />

In the case of s-EMG data, the orig<strong>in</strong>al signals are the MUAPTs generated by<br />

the motor units active dur<strong>in</strong>g a susta<strong>in</strong>ed contraction. In this sett<strong>in</strong>g, A quantifies<br />

how each source contributes to each observation. Of course additional<br />

requirements will have to be applied to the model to guarantee a satisfactorily<br />

small space of solutions, and depend<strong>in</strong>g on the assumptions on A and S we<br />

will obta<strong>in</strong> different models.<br />

8


270 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

2.2.1 <strong>Independent</strong> component analysis<br />

<strong>Independent</strong> component analysis (ICA) describes the task of (here: l<strong>in</strong>early)<br />

transform<strong>in</strong>g a given multivariate random vector such that its transform is<br />

stochastically <strong>in</strong>dependent. In our sett<strong>in</strong>g the random vector is given by N<br />

realizations, and ICA is applied to solve the BSS problem (1), where S is<br />

assumed to be <strong>in</strong>dependent. As <strong>in</strong> all BSS problems, a key issue lies <strong>in</strong> the<br />

question of identifiability of the model, and it can be shown that A (and<br />

hence S) is already uniquely — except for column permutation and scal<strong>in</strong>g —<br />

determ<strong>in</strong>ed by X if S conta<strong>in</strong>s at most one Gaussian and is square <strong>in</strong>tegrable<br />

[26,27]. This enables us to apply ICA to the BSS problem and to recover the<br />

orig<strong>in</strong>al sources.<br />

The idea of ICA was first expressed by Jutten and Hérault [28], while the<br />

term ICA was later co<strong>in</strong>ed by Comon [26]. In contrast to pr<strong>in</strong>cipal component<br />

analysis (PCA), ICA uses higher-order statistics to fully separate the data.<br />

Typical algorithms are based on contrasts such as m<strong>in</strong>imum mutual <strong>in</strong>formation,<br />

maximum entropy, diagonal cumulants or non-Gaussianity. For more<br />

details on ICA we refer to the two available excellent text-books [1,2].<br />

In the follow<strong>in</strong>g we will use the so-called JADE (jo<strong>in</strong>t approximate diagonalization<br />

of eigenmatrices) algorithm, which identifies the sources us<strong>in</strong>g the fact<br />

that due to <strong>in</strong>dependence, the diagonal cumulants of the sources vanish. Furthermore,<br />

fix<strong>in</strong>g two <strong>in</strong>dices of the 4th order cumulants, it is easy to see that<br />

such cumulant matrices of the mixtures are diagonalized by A.<br />

After pre-process<strong>in</strong>g by PCA, we can assume that A is orthogonal. Then<br />

diagonalization of one mixture cumulant matrix already yields A, given that<br />

its eigenvalues are pairwise different. However, <strong>in</strong> practice this is not always<br />

the case; furthermore, estimation errors could result <strong>in</strong> a bad estimate of the<br />

cumulant matrix and hence of A. Therefore, jo<strong>in</strong>t diagonalization of a whole<br />

set of cumulant matrices yields an improved estimate of A. Algorithms for<br />

actually perform<strong>in</strong>g jo<strong>in</strong>t diagonalization <strong>in</strong>clude gradient descent on the sum<br />

of off-diagonal terms, iterative construction of A by Givens rotation <strong>in</strong> two<br />

coord<strong>in</strong>ates [29], an iterative two-step recovery of A [30] or — more recently<br />

— a l<strong>in</strong>ear least-squares algorithm for diagonalization [31], where the latter<br />

two algorithms can also search for non-orthogonal matrices A.<br />

One method we use for analyz<strong>in</strong>g the s-EMG signals is JADE-based ICA. Confirm<strong>in</strong>g<br />

results from [32] we show that ICA can <strong>in</strong>deed extract the underly<strong>in</strong>g<br />

sources. In the case of s-EMG signals, all sources are strongly super-Gaussian<br />

and can therefore safely be assumed to be non-Gaussian, so identifiability<br />

holds. However, due to the nonnegativity of A, the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy is<br />

reduced to multiplication with a positive scalar <strong>in</strong> each column. If we additionally<br />

use the common assumption of unit variance of the sources, this<br />

9


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 271<br />

already elim<strong>in</strong>ates the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. In order to use an ord<strong>in</strong>ary ICA<br />

algorithm, we simply have to add a ’postprocess<strong>in</strong>g’ stage: to guarantee a<br />

nonnegative matrix, column signs are flipped to have only (or as many as<br />

possible) nonnegative column entries. Also note that statistical <strong>in</strong>dependence<br />

mean<strong>in</strong>g that the multivariate probability densities factorize is not related to<br />

the synchrony <strong>in</strong> the fir<strong>in</strong>g of MUs [33] — otherwise overlapp<strong>in</strong>g MUs could<br />

not be separated.<br />

We f<strong>in</strong>ally want to remark that ICA can also be <strong>in</strong>terpreted as sparse signal<br />

decomposition method <strong>in</strong> the case of super-Gaussian sources. This follows from<br />

the fact that a good and often-used contrast for ICA is given by maximization<br />

of non-Gaussianity [34] — this can be approximately derived from the fact<br />

that due to the central limit theorem, a mixture of <strong>in</strong>dependent sources tends<br />

to be more Gaussian than the sources, so the process can be <strong>in</strong>verted by<br />

maximiz<strong>in</strong>g non-Gaussianity. In our sett<strong>in</strong>g the sources are very sparse, hence<br />

strongly non-Gaussian. An ICA decomposition is therefore closely related to<br />

a decomposition <strong>in</strong>to parts of maximal sparseness — at least if sparseness is<br />

measured us<strong>in</strong>g kurtosis.<br />

2.2.2 (Sparse) nonnegative matrix factorization<br />

In contrast to other matrix factorization models such as PCA, ICA or SCA,<br />

nonnegative matrix factorization (NMF) strictly requires both matrices A and<br />

S to have nonnegative entries, which means that the data can be described<br />

us<strong>in</strong>g only additive components. Such a constra<strong>in</strong>t has many physical realizations<br />

and applications, for <strong>in</strong>stance <strong>in</strong> object decomposition [5]. If additionally<br />

some sparseness constra<strong>in</strong>ts are put on A and S, we speak of sparse NMF, see<br />

[14] for more details.<br />

Typically, NMF is performed us<strong>in</strong>g a least-squares (Euclidean) contrast<br />

E(A,S) = �X − AS� 2 , (2)<br />

which is to be m<strong>in</strong>imized. This optimization problem, albeit convex <strong>in</strong> each<br />

variable separately, is not convex <strong>in</strong> both at the same time and hence direct<br />

estimation is not possible. Paatero [35] m<strong>in</strong>imizes (2) us<strong>in</strong>g a gradient<br />

algorithm, whereas Lee and Seung [36] develop a multiplicative update rule<br />

<strong>in</strong>creas<strong>in</strong>g algorithm performance considerably.<br />

Although NMF has recently ga<strong>in</strong>ed popularity due to its simplicity and power<br />

<strong>in</strong> various applications, its solutions frequently fail to exhibit the desired sparse<br />

object decomposition. Therefore, Hoyer [14] proposes a modification of the<br />

NMF model to <strong>in</strong>clude sparseness. However, a simple modification of the cost<br />

function (2) could yield undesirable local m<strong>in</strong>ima, so <strong>in</strong>stead he chooses to<br />

m<strong>in</strong>imize (2) under the constra<strong>in</strong>t of fixed sparseness of both A and S. Here,<br />

10


272 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

sparseness is measured by comb<strong>in</strong><strong>in</strong>g the Euclidean norm �.�2 and the 1-norm<br />

�x�1 := �<br />

i |xi| as follows:<br />

√<br />

n − �x�1/�x�2<br />

sparseness(x) := √ (3)<br />

n − 1<br />

if x ∈ R n \ {0}. So sparseness(x) = 1 (maximal) if x conta<strong>in</strong>s n −1 zeros, and<br />

it reaches zero if the absolute value of all coefficients of x co<strong>in</strong>cide.<br />

The devised algorithm is based on the iterative application of a gradient decent<br />

step and a projection step, thus restrict<strong>in</strong>g the search to the subspace of sparse<br />

solutions. We perform the factorization us<strong>in</strong>g the publicly available Matlab<br />

library nmfpack 1 , which is used <strong>in</strong> [14].<br />

So NMF decomposes X <strong>in</strong>to nonnegative A and nonnegative S. The assumption<br />

that A has nonnegative coefficients is very well fulfilled by s-EMG record<strong>in</strong>gs;<br />

however as seen before the sources also have negative entries. In order<br />

to be able to apply the algorithms, we therefore preprocess the data us<strong>in</strong>g the<br />

function<br />

⎧<br />

⎪⎨ 0, x < 0<br />

κ(x) =<br />

(4)<br />

⎪⎩ x, x ≥ 0<br />

to cut off non-zero values; this yields the new random vector (sample matrix)<br />

X+ := (κ(X1), . . .,κ(Xn)) ⊤ . For comparison, we also construct a new sample<br />

set by simply leav<strong>in</strong>g out samples that have at least one negative value. Here<br />

we model this by the random vector X∗.<br />

2.2.3 Sparse component analysis<br />

Sparse component analysis (SCA) [3,4] requires strong sparseness <strong>in</strong> the sources<br />

only — this is then sufficient to decompose the observations. In order to def<strong>in</strong>e<br />

the SCA model, a vector x ∈ R n is said to be k-sparse if x has at most<br />

k non-zero entries. This k-sparseness implies k0-sparseness for k0 ≥ k. If an<br />

n-dimensional vector is k-sparse for k = n − 1, it is simply said to be sparse.<br />

The goal of sparse component analysis of level k (k-SCA) is to decompose X<br />

<strong>in</strong>to X = AS as above such that each sample (i.e. column) of S is k-sparse.<br />

In the follow<strong>in</strong>g we will assume k = n − 1.<br />

Note that, <strong>in</strong> contrast to the ICA model, the above model is not translation<br />

<strong>in</strong>variant. However, it is easy to see that if <strong>in</strong>stead of A we allow an aff<strong>in</strong>e<br />

l<strong>in</strong>ear transformation, the translation constant can be determ<strong>in</strong>ed from X<br />

only as long as the sources are non-determ<strong>in</strong>istic. In other words, <strong>in</strong>stead of<br />

assum<strong>in</strong>g k-sparseness of the sources we could also assume that at any time<br />

1 http://www.cs.hels<strong>in</strong>ki.fi/u/phoyer/<br />

11


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 273<br />

<strong>in</strong>stant only k source components are allowed to vary from a previously fixed<br />

constant (which can be different for each source).<br />

In [3] we showed that under slight conditions k-sparseness already guarantees<br />

identifiability of the model, even <strong>in</strong> the case of less observations than sources.<br />

In the sett<strong>in</strong>g of s-EMG however we are <strong>in</strong> the comfortable situation of hav<strong>in</strong>g<br />

more observations than sources, so as <strong>in</strong> the ICA case we preprocess our data<br />

us<strong>in</strong>g PCA projection — this dimension reduction algorithm can be applied<br />

even to our case of non-decorrelated sources as (given low or no noise) the first<br />

three pr<strong>in</strong>cipal components will span the source signal subspace, see comment<br />

<strong>in</strong> section 3.2. The above uniqueness result is based on the fact that due to the<br />

assumed sparseness the data clusters <strong>in</strong>to a fixed number of hyperplanes. This<br />

fact can also be used <strong>in</strong> an algorithm to actually reconstruct A by identify<strong>in</strong>g<br />

the set of hyperplanes. From the hyperplanes, A can be recovered by simply<br />

tak<strong>in</strong>g <strong>in</strong>tersections.<br />

However, multiple hyperplane identification is non trivial, and the <strong>in</strong>volved<br />

cost function<br />

σ(A) = 1<br />

N<br />

N�<br />

t=1<br />

n<br />

m<strong>in</strong><br />

i=1<br />

|a⊤ i X(t)|<br />

, (5)<br />

�X(t)�<br />

where ai denote the columns of A, is highly non-convex. In order to improve<br />

the robustness of the proposed, stochastic identifier, we developed an identification<br />

algorithm us<strong>in</strong>g a generalized Hough transform [37]. Alternatively a<br />

generalization of k-means cluster<strong>in</strong>g can be used, which iteratively clusters the<br />

data <strong>in</strong>to groups belong<strong>in</strong>g to the different hyperplanes, and then identifies a<br />

hyperplane with<strong>in</strong> the cluster by regression [38].<br />

In this paper, we assume sparseness of the sources S <strong>in</strong> the sense that at<br />

least one coefficient of S at a certa<strong>in</strong> time <strong>in</strong>stant has to be zero. In the<br />

case of s-EMG, the maximum natural fir<strong>in</strong>g rate of a motor unit is about 30<br />

pulses/second, last<strong>in</strong>g each pulse less than 15 ms [39]. Therefore, a motor unit<br />

is active less than 450 ms per second; that is, at least 55% of the time each<br />

source signal is zero. In addition, the fir<strong>in</strong>gs of different motor units are not<br />

synchronized (only their respective fir<strong>in</strong>g rate shows the tendency to change<br />

together, the so-called common drive [40]). For these reasons, the probability<br />

of all n sources fir<strong>in</strong>g at a given time <strong>in</strong>stant is bounded by 0.45 n , which quickly<br />

approaches zero for <strong>in</strong>creas<strong>in</strong>g n. Even <strong>in</strong> the case of only n = 3 sources, at<br />

most 9% of the samples are fully active. Hence the source conditions for SCA<br />

should be rather well fulfilled, an we can f<strong>in</strong>d isolated MUAPs <strong>in</strong>side an s-EMG<br />

signal us<strong>in</strong>g SCA with high probability.<br />

12


274 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

2.3 Measures used for comparison<br />

In order to compare the recovered signals with the artificial sources, we simply<br />

compare the mix<strong>in</strong>g matrices. For this we employ Amari’s separation performance<br />

<strong>in</strong>dex [41], which is given by the equation<br />

⎛<br />

⎞<br />

n� n� |pij|<br />

n�<br />

�<br />

n�<br />

�<br />

|pij|<br />

E1(P) = ⎝<br />

− 1⎠<br />

+<br />

− 1<br />

i=1 j=1 maxk |pik|<br />

j=1 i=1 maxk |pkj|<br />

where P = (pij) = Â+ A, A be<strong>in</strong>g the real mix<strong>in</strong>g matrix and Â+ the<br />

pseudo<strong>in</strong>verse of its estimation Â. Note that E1(P) ≤ 2n(n − 1). For both<br />

the artificial and the real signals, we calculate E1( Â+ 1 Â2), where Âi are the<br />

two recovered mix<strong>in</strong>g matrices.<br />

Furthermore, we also want to study the recovered signals. So <strong>in</strong> order to be able<br />

to compare between different methods, to each pair of components obta<strong>in</strong>ed<br />

with each method, as well as to the source signals and the synthetic s-EMG<br />

channels, we apply the follow<strong>in</strong>g equivalence measures: Pr<strong>in</strong>cipe’s quadratic<br />

mutual <strong>in</strong>formation (QMI) [42], Kullback-Leibler <strong>in</strong>formation distance (K-LD)<br />

[43], Renyi’s entropy measure [43], mutual <strong>in</strong>formation measure (MuIn) [42],<br />

Rosenblatt’s squared distance functional (RSD) [43], Skaug and Tjøstheim’s<br />

weighted difference (STW) [43] and cross-correlation (Xcor). All the measures<br />

are normalized with respect to the maximum value obta<strong>in</strong>ed when applied<br />

to each component with itself; that is, we divide by the maximum of each<br />

comparison matrix diagonal (maximum mutual <strong>in</strong>formation).<br />

For the calculation of the above-mentioned <strong>in</strong>dices it is necessary to estimate<br />

both jo<strong>in</strong>t and marg<strong>in</strong>al probability density function of the signals. We have<br />

decide to use the data-driven method Kn-nearest-neighbour (KnNN) [44] (for<br />

details, refer to [32]). Furthermore, <strong>in</strong> order to compare separation performance<br />

<strong>in</strong> the presence of noise, we measure the strength of a one-dimensional<br />

signal S versus additive noise ˜ S us<strong>in</strong>g the signal-to-noise ratio (SNR) def<strong>in</strong>ed<br />

by<br />

SNR(S, ˜ �S�<br />

S) := 20 log10 �S − ˜ S� .<br />

3 Results<br />

In this section we compare the various models for source separation when<br />

applied to both toy and real s-EMG record<strong>in</strong>gs.<br />

13


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 275<br />

(a) A JADE<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.6<br />

0.4<br />

0.2<br />

1 2 3 4 5 6 7 8<br />

(d) A sNMF<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

1<br />

2<br />

3<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

channel<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.6<br />

0.4<br />

0.2<br />

(b) A NMF<br />

signal<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

(e) A sNMF*<br />

1<br />

2<br />

3<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.6<br />

0.4<br />

0.2<br />

(c) A NMF*<br />

(f) A SCA<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

1<br />

2<br />

3<br />

Fig. 4. Mix<strong>in</strong>g matrices recovered for the synthetic s-EMG us<strong>in</strong>g different methods;<br />

(a) ICA us<strong>in</strong>g jo<strong>in</strong>t approximate diagonalization of eigenmatrices (JADE), (b, c)<br />

nonnegative matrix factorization with different preprocess<strong>in</strong>g (NMF, NMF∗), (d, e)<br />

Sparse NMF (sNMF, sNMF∗) with the same two preprocess<strong>in</strong>g methods as NMF,<br />

and (f) sparse component analysis (SCA).<br />

3.1 Artificial signals<br />

In the first example, we compare performance <strong>in</strong> the well-known sett<strong>in</strong>g of<br />

artificially created toy-signals.<br />

3.1.1 S<strong>in</strong>gle s-EMG<br />

For visualization, we will first analyze a s<strong>in</strong>gle artificial s-EMG record<strong>in</strong>g, and<br />

only later present batch-runs over multiple separate realizations to test for<br />

statistical robustness. As data set, we use toy signals as <strong>in</strong> section 3.1 but<br />

with only three source components for easier visualization. The ICA-result<br />

is produced us<strong>in</strong>g JADE after PCA to 3 components. Please note that here<br />

and <strong>in</strong> the follow<strong>in</strong>g we perform dimension reduction because <strong>in</strong> the small<br />

sensor volumes <strong>in</strong> question not many MUs are present. This is confirmed by<br />

consider<strong>in</strong>g the eigenvalue structure of the covariance matrix: tak<strong>in</strong>g the mean<br />

over 10 real s-EMG data sets — further discussed <strong>in</strong> section 3.2 — the ratio<br />

of third to first largest eigenvalue is only 0.11, and the ratio of the fourth to<br />

the first only 0.04. Tak<strong>in</strong>g sums, <strong>in</strong> the mean we loose only 5.7% of raw data<br />

14


276 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

<strong>Component</strong> given by each method [arbitrary units]<br />

(a) JADE<br />

(b) NMF<br />

(c) NMF *<br />

(d) sNMF<br />

(g) Source (f) SCA (e) sNMF *<br />

-2<br />

-4<br />

024<br />

0.05<br />

0<br />

-0.05<br />

0.05<br />

0<br />

-0.05<br />

0.05<br />

0<br />

-0.05<br />

0.05<br />

0<br />

-0.05<br />

5<br />

0<br />

-5<br />

10<br />

0<br />

-10<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

Fig. 5. One of the three recovered sources after apply<strong>in</strong>g the pseudo<strong>in</strong>verse A +<br />

of the estimated mix<strong>in</strong>g matrices to the synthetic s-EMG mixtures X; (a-f) results<br />

obta<strong>in</strong>ed us<strong>in</strong>g the different methods, see Fig. 4; (g) for comparison, also the orig<strong>in</strong>al<br />

source signal is shown.<br />

by dimension reduction to the first three eigenvalues, which lies easily <strong>in</strong> the<br />

range of the noise level of typical data sets.<br />

In order to reduce the ever-present permutation and scal<strong>in</strong>g ambiguity, the<br />

columns of the recovered mix<strong>in</strong>g matrix AJADE are normalized to unit length;<br />

furthermore, column sign is chosen to give a positive sum of coefficients (because<br />

A is assumed to have only positive coefficients), and the columns are<br />

permuted such that the maxima <strong>in</strong>dex was <strong>in</strong>creas<strong>in</strong>g, Fig. 4(a). The three<br />

components are mostly active <strong>in</strong> channels 3, 4 or 5 respectively, which co<strong>in</strong>cides<br />

with our construction (the real sources lie closest to sensor 4, 3 and 5<br />

respectively). Fig. 5(a) shows one of the recovered sources, and for comparison,<br />

Fig. 5(g) shows the orig<strong>in</strong>al source signal.<br />

We subsequently apply NMF and sparse NMF to both X+ and X∗ (denoted<br />

by NMF, NMF∗, sNMF and sNMF∗ respectively); <strong>in</strong> all cases we achieve fast<br />

convergence. The mix<strong>in</strong>g matrices obta<strong>in</strong>ed are shown <strong>in</strong> Fig. 4(b-e). In all four<br />

cases we observe a good recovery of the source locations. Cross-multiplication<br />

of these matrices with their pseudo<strong>in</strong>verses shows a high similarity, and the<br />

recovered source signals are similar and match well the orig<strong>in</strong>al source, see<br />

Fig. 5(b-e).<br />

15


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 277<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

SCA<br />

sNMF *<br />

sNMF<br />

NMF *<br />

NMF<br />

sNMF *<br />

sNMF<br />

(a) matrix comparison by Amari <strong>in</strong>dex<br />

NMF *<br />

NMF<br />

JADE<br />

Measure mean [a.u.]<br />

1.5<br />

1<br />

0.5<br />

0<br />

Renyi<br />

QMI<br />

KLD<br />

MuIn<br />

RSDSTW<br />

Measure<br />

Xcor<br />

JADE<br />

Source<br />

SCA NMFNMF Channels<br />

sNMF *<br />

sNMF<br />

*<br />

Signal<br />

(b) comparison of the recovered sources<br />

Fig. 6. Inter-component performance <strong>in</strong>dex comparisons. (a) Comparison of the<br />

Amari <strong>in</strong>dex of matrices acquired from two different methods each <strong>in</strong> the case of<br />

synthetic s-EMG decomposition. As the comparison matrix is symmetric, half of<br />

its values have been omitted for clarity. (b) Mean of the <strong>in</strong>ter-components mutual<br />

<strong>in</strong>formation for each method estimated by different measures. For comparison, the<br />

measures correspond<strong>in</strong>g to the source signals (m<strong>in</strong>imal <strong>in</strong>dices) and to the channels<br />

of the mixed s-EMG (maximal <strong>in</strong>dices) are also shown.<br />

F<strong>in</strong>ally, we perform SCA us<strong>in</strong>g a generalized Hough transform [37]. Note that<br />

there are also other algorithms possible for such extraction, and model generalizations<br />

are possible, see for example [45]. After whiten<strong>in</strong>g and dimension<br />

reduction, we perform Hough hyperplane identification with b<strong>in</strong>-size 180 and<br />

manually identified the maxima <strong>in</strong> the obta<strong>in</strong>ed the Hough accumulator. The<br />

recovered mix<strong>in</strong>g matrix is aga<strong>in</strong> visualized <strong>in</strong> Fig. 4(f). Similar to the previous<br />

results, the three components are roughly most active at locations 2 to 3,<br />

4 and 5 respectively. Multiplication with the pseudo<strong>in</strong>verse of the recovered<br />

matrices from the above experiments shows that the result co<strong>in</strong>cides quite well<br />

with the matrices from NMF, but differs slightly more from the ICA-recovery.<br />

We calculate and compare the Amari <strong>in</strong>dex for each method and as shown <strong>in</strong><br />

Fig. 6(a), no major differences can be detected. A comparison of the recovered<br />

sources us<strong>in</strong>g the <strong>in</strong>dices from section 2.3 is given <strong>in</strong> Fig. 6(b), where also<br />

the <strong>in</strong>dices correspond<strong>in</strong>g to the orig<strong>in</strong>al source signals (m<strong>in</strong>imum mutual<br />

<strong>in</strong>formation values) and to the channels of the s-EMG (maximum values) have<br />

been added. All the methods separate the mixtures (improvement <strong>in</strong> terms of<br />

<strong>in</strong>dices) but the methods yield somewhat different result, with JADE giv<strong>in</strong>g<br />

quite different sources than the rest of the methods, and NMF and NMF∗<br />

perform<strong>in</strong>g rather similar. In terms of source <strong>in</strong>dependence, of course JADE<br />

scores best rat<strong>in</strong>gs <strong>in</strong> the <strong>in</strong>dices, as can be confirmed by calculat<strong>in</strong>g the Amari<br />

<strong>in</strong>dex of the recovered matrix with the orig<strong>in</strong>al source matrix:<br />

16


278 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA sources<br />

mean kurtosis 10.80 8.48 8.80 8.85 9.64 7.14 11.4<br />

sparseness 0.286 0.268 0.295 0.270 0.302 0.201 0.252<br />

σ(A✷) 2.91 2.02 2.11 1.99 2.36 0.96 3.24<br />

Table 2<br />

Sparseness measures of the recovered sources us<strong>in</strong>g the various methods. In the first<br />

row, the mean kurtosis is calculated (the higher, the more ‘spiky’ the signal). The<br />

second row gives the sparseness <strong>in</strong>dex from equation (3) (the higher, the sparser<br />

the signal) and the third row the cost function (5) employed <strong>in</strong> SCA (the lower, the<br />

better the SCA criterion is fulfilled). Optimal values are pr<strong>in</strong>ted <strong>in</strong> bold face.<br />

AJADE ANMF ANMF∗ AsNMF AsNMF∗ ASCA<br />

E1(A + ✷A) 1.39 3.27 2.96 3.41 2.56 3.86<br />

JADE clearly outperforms the other methods — this will be expla<strong>in</strong>ed <strong>in</strong> the<br />

next paragraph. However we have to add that the signal generation additionally<br />

<strong>in</strong>volves a slightly nonl<strong>in</strong>ear filter, so we can only estimate the real mix<strong>in</strong>g<br />

matrix A from the sources and the mixtures, which yielded a non-negligible<br />

error. Hence this result <strong>in</strong>dicates that JADE could best approximate the l<strong>in</strong>earized<br />

system.<br />

One key question <strong>in</strong> this work is how the different models <strong>in</strong>duce sparseness.<br />

Clearly, sparseness is a rather ambiguous term, so we calculate three<br />

<strong>in</strong>dices (kurtosis, ‘sparseness’ from equation (3) and σ from (5)) for the real<br />

as well as the recovered sources us<strong>in</strong>g the proposed methods, see Tab. 2. As<br />

expected, ICA gives the highest kurtosis among all methods, whereas sparse<br />

NMF yields highest values of the sparseness criterion. And SCA has the lowest<br />

i.e. best value regard<strong>in</strong>g k-sparseness measured by σ(A). We further see that<br />

both the kurtosis and the sparseness criterion seem to be somewhat related<br />

with regards to the data set as high values <strong>in</strong> both <strong>in</strong>dices are achieved by<br />

both JADE and sNMF. The k-sparseness criterion, which fixes only the zero-<br />

(semi)norm i.e. requires a fixed amount of zeros <strong>in</strong> the data without additional<br />

requirements about the other values does not <strong>in</strong>duce as high sparseness when<br />

measured us<strong>in</strong>g kurtosis/sparseness and vice versa. F<strong>in</strong>ally by look<strong>in</strong>g at the<br />

sparseness <strong>in</strong>dices of the real sources, we can now understand why JADE outperformed<br />

the other methods <strong>in</strong> this toy data sett<strong>in</strong>g — <strong>in</strong>deed kurtosis of the<br />

sources is high and the mutual <strong>in</strong>formation low, see Fig. 6(b). But <strong>in</strong> terms of<br />

sparseness and especially σ, the sources are not as sparse as expected. Hence<br />

the (sparse) NMF and ma<strong>in</strong>ly the SCA algorithm could not perform as well<br />

as JADE when compared to the orig<strong>in</strong>al sources as noted above. However, we<br />

will see that <strong>in</strong> the case of real s-EMG signals, this dist<strong>in</strong>ction will break; furthermore,<br />

sparse NMF turns out to be more robust aga<strong>in</strong>st noise than JADE,<br />

as is shown <strong>in</strong> the follow<strong>in</strong>g.<br />

17


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 279<br />

Amari <strong>in</strong>dex<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

(a) Amari <strong>in</strong>dex<br />

SNR<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

(b) SNR<br />

Fig. 7. Boxplot of the separation performance when identify<strong>in</strong>g 5 sources <strong>in</strong> 8-dimensional<br />

observed artificial s-EMG record<strong>in</strong>gs. (a) shows the mean Amari <strong>in</strong>dex of<br />

the product of the identified separation and the real mix<strong>in</strong>g matrix, and (b) depicts<br />

the SNR of the recovered sources versus the real sources. Mean and variance were<br />

taken over 100 runs.<br />

3.1.2 Multiple s-EMG experiments<br />

We now show performance of the above algorithms when applied to 100 different<br />

realizations of artificial s-EMG data sets. The data consists of 8 channels<br />

with 5 underly<strong>in</strong>g source activities; more details about data generation are<br />

given <strong>in</strong> section 2.1.1. The algorithm parameters are the same as above with<br />

the exception that automated SCA is performed us<strong>in</strong>g Mangasarian-style cluster<strong>in</strong>g<br />

[38] after PCA dimension reduction to 5.<br />

The ICA and sparse BSS algorithms from above are applied to these data sets,<br />

and Amari <strong>in</strong>dex as well as SNR of the recovered sources versus the orig<strong>in</strong>al<br />

sources is stored. In Fig. 7, the means of these two <strong>in</strong>dices as well as their<br />

deviations are shown <strong>in</strong> a box plot, separately for each algorithm. As <strong>in</strong> the<br />

s<strong>in</strong>gle s-EMG experiment, these statistics confirm that JADE performs best,<br />

both <strong>in</strong> terms of matrix and <strong>in</strong> terms of source recovery (which is more or less<br />

the same due to the fact that we are deal<strong>in</strong>g with the noiseless case so far).<br />

The NMF algorithms identify the mix<strong>in</strong>g matrix with acceptable performance,<br />

however (sparse) NMF tak<strong>in</strong>g only positive samples (sNMF∗) tends to separate<br />

the data slightly better than by sample preprocess<strong>in</strong>g us<strong>in</strong>g κ from equation<br />

(4). SCA cannot detect the source matrix as well as the other BSS algorithms<br />

— aga<strong>in</strong> the SCA conditions seem to be somewhat violated — however it<br />

performs adequately well recover<strong>in</strong>g the sources, which is due to the fact that<br />

some sources are recovered very well result<strong>in</strong>g <strong>in</strong> a higher SNR than the NMF<br />

algorithms. For practical reasons, it is important to check to what extend the<br />

signal-to-<strong>in</strong>terference ratio between the channels is improved after apply<strong>in</strong>g<br />

BSS algorithms. For each run, we monitor the SIR of the orig<strong>in</strong>al sources and<br />

the recoveries by tak<strong>in</strong>g the mean over all channels. The two SIR means are<br />

18


280 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

divided to give a mean improvement ratio. Tak<strong>in</strong>g aga<strong>in</strong> mean over the 100<br />

runs yields the follow<strong>in</strong>g improvements:<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

mean SIR improvement 4.15 2.54 2.87 0.701 1.90 3.08<br />

This confirms our results; JADE works very well for preprocess<strong>in</strong>g, but also<br />

NMF∗ and, <strong>in</strong>terest<strong>in</strong>gly, SCA.<br />

In order to test the robustness of the methods aga<strong>in</strong>st noise, we recorded s-<br />

EMG at 0%MVC, that is, when the muscle was totally relaxed; <strong>in</strong> this way<br />

we obta<strong>in</strong>ed a record<strong>in</strong>g of noise only. As expected the result<strong>in</strong>g record<strong>in</strong>gs<br />

have nearly the same means and variances, and are close to Gaussian. This is<br />

confirmed by a Jarque-Bera test, which asymptotically tests for goodness-of-fit<br />

of an observed signal to a normal distribution. We recorded two different noise<br />

signals, and the test was positive <strong>in</strong> 11 out of 16 cases (at significance level<br />

α = 5%); the 5 exceptions had p-values not lower than 0.001. Furthermore, the<br />

noise is <strong>in</strong>dependent as expected because it has a close to diagonal covariance<br />

matrix, Amari <strong>in</strong>dex of 2.1, which is quite low for 8 dimensional signals. The<br />

noise is not fully i.i.d. but exhibits slight non-stationarity. Nonetheless, we<br />

take this f<strong>in</strong>d<strong>in</strong>gs as confirmation to assume additive Gaussian noise <strong>in</strong> the<br />

follow<strong>in</strong>g. We will show mean algorithm performance over 50 runs for vary<strong>in</strong>g<br />

noise levels.<br />

Note that due to the Gaussian noise, the models (especially the NMF model<br />

which already uses such a Gaussian error term) hold well given the additive<br />

noise. We multiplied this noise signal progressively by 0, 0.01, 0.05, 0.1, 0.5,<br />

1 and 5 (which corresponds to mean source SNRs of ∞, 36, 22, 16, 2.1, -3.9<br />

and -18 dB) and then added each of the obta<strong>in</strong>ed signals to a randomly generated<br />

synthetic s-EMG conta<strong>in</strong><strong>in</strong>g 5 sources as above. The Amari <strong>in</strong>dex was<br />

calculated for each method and for each noise level. We thus obta<strong>in</strong>ed the<br />

comparative graph shown <strong>in</strong> Fig. 8. Interest<strong>in</strong>gly, sparse NMF∗ outperforms<br />

JADE <strong>in</strong> all cases, which <strong>in</strong>dicates that the sNMF model (which already <strong>in</strong>cludes<br />

noise), works best <strong>in</strong> cases of slight to stronger additive noise — which<br />

makes it very well adapted to real applications. Aga<strong>in</strong> SCA performs somewhat<br />

problematically, however separates the data well a the noise level of -3.9<br />

dB. We believe that this is due to the <strong>in</strong>volved threshold<strong>in</strong>g parameter <strong>in</strong><br />

SCA hyperplane detection; apparently it is necessary to implement an adaptive<br />

choice of this parameter <strong>in</strong> order to improve separation as <strong>in</strong> the case of<br />

SNR of -3.9 dB.<br />

19


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 281<br />

median Amari−<strong>in</strong>dex<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

JADE<br />

NMF<br />

NMF∗<br />

sNMF<br />

sNMF∗<br />

SCA<br />

0<br />

−20 −10 0 10<br />

mean SNR (dB)<br />

20 30 40<br />

Fig. 8. Result of the experiment test<strong>in</strong>g the robustness of the different methods<br />

aga<strong>in</strong>st noise. Plotted is the mean over 50 runs.<br />

Real s-EMG signal [V]<br />

Channel 1<br />

Channel 2<br />

Channel 3<br />

Channel 4<br />

Channel 5<br />

Channel 6<br />

Channel 7<br />

Channel 8<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

100 200 300 400 500 600 700 800 900 1000<br />

Time [ms]<br />

Fig. 9. Real s-EMG obta<strong>in</strong>ed from a healthy subject perform<strong>in</strong>g a susta<strong>in</strong>ed contraction<br />

at 30% MVC.<br />

20


282 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

(a) A JADE<br />

1 2 3 4 5 6 7 8<br />

(d) A sNMF<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

1<br />

2<br />

3<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

channel<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

(b) A NMF<br />

signal<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

(e) A sNMF*<br />

1<br />

2<br />

3<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

(c) A NMF*<br />

(f) A SCA<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

Fig. 10. Mix<strong>in</strong>g matrices recovered for the real s-EMG us<strong>in</strong>g different methods; (a)<br />

JADE, (b, c) NMF with different preprocess<strong>in</strong>g, (d, e) sparse NMF with the same<br />

two preprocess<strong>in</strong>g methods as NMF, and (f) SCA.<br />

3.2 Real s-EMG signals<br />

In this section we analyze real s-EMG record<strong>in</strong>gs obta<strong>in</strong>ed from ten healthy<br />

subjects. At first we will aga<strong>in</strong> study a s<strong>in</strong>gle s-EMG and plot it for visual<br />

<strong>in</strong>spection, and then show statistics over multiple subjects. The first data set<br />

has been obta<strong>in</strong>ed from a s<strong>in</strong>gle subject perform<strong>in</strong>g a susta<strong>in</strong>ed contraction at<br />

30% MVC, see Fig. 9. The signal acquisition and preprocess<strong>in</strong>g is described<br />

<strong>in</strong> section 2.1.2.<br />

We <strong>in</strong>itially use JADE as ICA algorithm of choice. The estimated mix<strong>in</strong>g<br />

matrix is visualized <strong>in</strong> Fig. 10(a). As <strong>in</strong> the case of synthetic signals, we then<br />

apply NMF and sparse NMF to both X+ and X∗ and obta<strong>in</strong> the recovered<br />

mix<strong>in</strong>g matrices visualized <strong>in</strong> Fig. 10(b-e).<br />

We face the follow<strong>in</strong>g problem when recover<strong>in</strong>g the source signals by SCA.<br />

If we use PCA to n = 3 dimensions, we cannot achieve convergence; also a<br />

generalized Hough plot [37] does not reveal such structure. Hence we choose<br />

dimension reduction to n = 2. In the two-dimensional projected mixtures, the<br />

data clearly clusters along two l<strong>in</strong>es, so the assumption of sparseness holds<br />

<strong>in</strong> 2 dimensions. We use Mangasarian-style cluster<strong>in</strong>g / SCA (similar to kmeans)<br />

[38] to recover these directions. The thus recovered mix<strong>in</strong>g matrix <strong>in</strong><br />

21<br />

1 2


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 283<br />

1.5<br />

1<br />

0.5<br />

0<br />

SCA<br />

sNMF *<br />

sNMF<br />

NMF *<br />

NMF<br />

sNMF *<br />

sNMF<br />

NMF *<br />

NMF<br />

JADE<br />

(a) matrix comparison by Amari <strong>in</strong>dex<br />

Measure value [a.u.]<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

KLD<br />

MuIn<br />

QMI<br />

RSD<br />

Renyi<br />

Measure STW<br />

Xcor<br />

JADE<br />

SCA NMFNMF sNMF<br />

*<br />

sNMF Channels<br />

*<br />

Signal<br />

(b) comparison of the recovered sources<br />

Fig. 11. Inter-component performance <strong>in</strong>dex comparisons; (a) comparison of the<br />

Amari <strong>in</strong>dex; (b) mean of various source dependence measures.<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

mean kurtosis 4.97 4.74 4.80 4.81 4.80 4.82<br />

sparseness 0.387 0.424 0.413 0.408 0.407 0.405<br />

σ(A✷) 0.76 0.50 0.55 0.60 0.62 0.70<br />

Table 3<br />

Comparison of sparseness of the recovered sources (real s-EMG data).<br />

two dimensions is plotted <strong>in</strong> Fig. 10(f). Note that the SCA matrix columns<br />

match two columns of the mix<strong>in</strong>g matrices found by the other methods.<br />

Similar to the toy data set, we compare the recoveries based on the different<br />

methods us<strong>in</strong>g various <strong>in</strong>dices from section 2.3 for mix<strong>in</strong>g matrices and recovered<br />

sources , Fig. 11. In contrast to the artificial signals, here all methods<br />

yield rather similar performance, the mean Amari <strong>in</strong>dices are roughly half the<br />

value of the <strong>in</strong>dices <strong>in</strong> the toy data sett<strong>in</strong>g. This confirms that the methods<br />

recover rather similar sources.<br />

As <strong>in</strong> the case of artificial signals, we aga<strong>in</strong> compare the sparseness of the<br />

recovered sources, see Tab. 3. Due to the noise present <strong>in</strong> the real data, the<br />

signals are clearly less sparse than the artificial data. Furthermore a comparison<br />

between the various methods yields noticeably less differences than <strong>in</strong><br />

the case of artificial signals. At first glance it seems unclear why SCA performs<br />

worse <strong>in</strong> terms of k-sparseness (us<strong>in</strong>g σ(A)) than the NMF methods,<br />

but better than JADE. This however can be expla<strong>in</strong>ed by the fact that the<br />

PCA dimension reduction reduces the number of parameters hence SCA cannot<br />

search the whole space, so it performs worse than NMF <strong>in</strong> that respect.<br />

However it outperforms JADE, which also uses PCA preprocess<strong>in</strong>g.<br />

22


284 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

<strong>Component</strong> given by each method [a.u.]<br />

(a) JADE<br />

(b) NMF<br />

(c) NMF*<br />

(d) sNMF<br />

(e) sNMF*<br />

(f) SCA<br />

(g) s-EMG<br />

-2<br />

-4<br />

02468<br />

-2<br />

-4<br />

02468<br />

-2 02468<br />

-2 0246<br />

-2 02468<br />

-2<br />

-4<br />

02468<br />

0.5 1<br />

1.5<br />

-0.5 0<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

Fig. 12. One of the three recovered sources after apply<strong>in</strong>g A + to an s-EMG data<br />

recorded at 30% MVC; (a-f) results obta<strong>in</strong>ed us<strong>in</strong>g the different methods; (g) orig<strong>in</strong>al<br />

source signal.<br />

In order to be able to draw statistically more relevant conclusions, we compare<br />

the various BSS algorithms for s-EMGs of n<strong>in</strong>e subject, recorded at 30% MVC.<br />

Fig. 12 plots a s<strong>in</strong>gle extracted source for each BSS algorithms; Fig. 12(g)<br />

shows the s-EMG channel where the dom<strong>in</strong>ant source signal is chosen for<br />

comparison. One problem of perform<strong>in</strong>g batch comparisons however lies <strong>in</strong> the<br />

fact that separation performance is commonly evaluated by visual <strong>in</strong>spection<br />

or at most by comparison with other separation methods — because of course<br />

the orig<strong>in</strong>al sources are unknown. We cannot do plots similar to Fig. 12 for<br />

each subject, so <strong>in</strong> order to provide a more objective measure, we consider the<br />

ma<strong>in</strong> application of BSS for real s-EMG analysis — preprocess<strong>in</strong>g <strong>in</strong> order to<br />

achieve ‘cleaner’ data for template match<strong>in</strong>g.<br />

A common measure for this is to count the number of zero-cross<strong>in</strong>gs of the<br />

sources (after comb<strong>in</strong><strong>in</strong>g them to a full eight dimensional observation vector by<br />

tak<strong>in</strong>g only the maximally active source <strong>in</strong> each channel). Note that this zerocross<strong>in</strong>g<br />

count is already outputted by the MDZ filter. It is directly related<br />

to the amount of MUAPs present <strong>in</strong> a s-EMG signal [46], and by compar<strong>in</strong>g<br />

this <strong>in</strong>dex before and after the BSS algorithms, we can analyze whether and<br />

to what extent they actually enhance the signal. With the aid of the MDZ<br />

filter, we count the number of waves (the excursion of the signal between two<br />

23


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 285<br />

subject ID JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

b 9% 7% 9% 10% 11% 10%<br />

f 5% 3% 5% 4% 5% 4%<br />

g 37% 35% 35% 35% 36% 30%<br />

k 35% 35% 35% 35% 34% 38%<br />

m 16% 16% 17% 16% 16% 13%<br />

og 9% 8% 8% 8% 8% 10%<br />

ok 3% 7% 9% 9% 10% 7%<br />

s 41% 42% 42% 41% 41% 37%<br />

y 71% 71% 73% 71% 73% 74%<br />

means 25.0 % 25.1 % 25.7% 25.4% 25.8 % 24.6 %<br />

std. deviations 21.4 % 21.4 % 21.3% 20.9% 21.0% 21.4 %<br />

Table 4<br />

Negative zero-cross<strong>in</strong>g mean ratios i.e. relative enhancements for each subject, together<br />

with mean performance of each algorithm.<br />

consecutive zero cross<strong>in</strong>gs) and take the mean over all channels before and<br />

after apply<strong>in</strong>g each of the BSS algorithms. We then subtract the latter from<br />

the previous value and divide by the <strong>in</strong>itial number of waves <strong>in</strong> order to have<br />

an <strong>in</strong>dex than can be compared between different signals. Tab. 4 shows the<br />

result<strong>in</strong>g ratios for the n<strong>in</strong>e subjects. All BSS algorithms result <strong>in</strong> a reduction<br />

of zero-cross<strong>in</strong>gs, and best results per run are achieved by NMF∗, sNMF∗ and<br />

SCA. We see that <strong>in</strong> the mean, all algorithm perform somewhat similarly,<br />

with sNMF∗ be<strong>in</strong>g best and SCA be<strong>in</strong>g worst. Hence the best algorithm for<br />

this data set, sNMF∗, achieved a mean reduction of the number of waves<br />

was 25.8%, which means that after apply<strong>in</strong>g the algorithms each channel is<br />

composed, <strong>in</strong> average, of one fourth of the orig<strong>in</strong>al number of waves, mak<strong>in</strong>g<br />

the template-match<strong>in</strong>g technique easily applicable.<br />

4 Discussion<br />

The ma<strong>in</strong> focus of this work lies <strong>in</strong> the application of three different sparse<br />

BSS models — source <strong>in</strong>dependence, (sparse) nonnegativity and k-sparseness<br />

— to the analysis of s-EMG signals. This application is motivated by the fact<br />

that the underly<strong>in</strong>g MUAPTs exhibit properties (ma<strong>in</strong>ly sparseness) that fit<br />

quite well to the three <strong>in</strong> pr<strong>in</strong>ciple different models. Furthermore, we take<br />

<strong>in</strong>terest <strong>in</strong> how well these models behave <strong>in</strong> the case of slightly perturbed<br />

<strong>in</strong>itial conditions — ICA for <strong>in</strong>stance is known to be quite robust aga<strong>in</strong>st<br />

24


286 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

small errors <strong>in</strong> the <strong>in</strong>dependence assumption — and how they can deal with<br />

additional noise.<br />

In the first example of artificially created mixtures, we are able to demonstrate<br />

that a decomposition analysis us<strong>in</strong>g the three models was possible, and give<br />

comparisons over larger data sets. Although the recovered sources are all rather<br />

alike, Fig. 5 and Fig. 7, we found that ICA outperformed the other methods<br />

concern<strong>in</strong>g distance to the real solution, which was ma<strong>in</strong>ly due to the fact that<br />

the artificial sources — but not <strong>in</strong> the case of real signals, see below — best<br />

fitted to the ICA model, Tab. 2. However, when consider<strong>in</strong>g additional noise<br />

with <strong>in</strong>creas<strong>in</strong>g power, sparse NMF turned out to be more robust than the<br />

ICA model, Fig. 8.<br />

We then applied the BSS methods to real s-EMG data sets, and the three<br />

different models yielded surpris<strong>in</strong>gly similar results, although <strong>in</strong> theory these<br />

models do not fully overlap. We speculate that this similarity is due to the<br />

fact that — as most probably <strong>in</strong> all applications — the models do not fully<br />

hold. This allows the various algorithms to only approximate the model solutions,<br />

and hence to arrive at similar solutions, but from different directions.<br />

Furthermore, this <strong>in</strong>dicates that the three models look for different properties<br />

of the sources, and that these properties are fulfilled to vary<strong>in</strong>g extent, see<br />

Tab. 3 for numerical details. Comparisons over s-EMG data sets from multiple<br />

subjects aga<strong>in</strong> confirmed similar equality of performance, where aga<strong>in</strong><br />

sparse NMF slightly outperformed the other algorithms <strong>in</strong> the mean (<strong>in</strong> nice<br />

correspondence with the noise result from above).<br />

Note that the aim of the present work is not the full recovery of a target<br />

MUAPT (source signal) <strong>in</strong> its orig<strong>in</strong>al form. In fact, as shown previously [18],<br />

it would be sufficient to <strong>in</strong>crease the amplitude a target MUAPT so that <strong>in</strong><br />

average it is above the noise and above the other MUAPTs level. Then, we<br />

are able to cut the <strong>in</strong>terfer<strong>in</strong>g signals with a modified dead zone filter and<br />

thus isolat<strong>in</strong>g the target MUAPT. And <strong>in</strong>deed all of the employed sparse BSS<br />

methods fulfill this requirement, which is confirmed by the decrease <strong>in</strong> the<br />

number of zero-cross<strong>in</strong>gs of the separated signals, see Tab. 4.<br />

5 Conclusion<br />

We have compared the effectiveness of various sparse BSS methods for signal<br />

decomposition; namely, ICA, NMF, sparse NMF and SCA, when applied to<br />

s-EMG signals. Surface EMG signals represent an attractive test<strong>in</strong>g signal as<br />

they approximately fulfill all the requirements of these methods and are of<br />

major importance <strong>in</strong> medical diagnosis and basic neurophysiological research.<br />

25


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 287<br />

All methods, <strong>in</strong> spite of be<strong>in</strong>g based on very different approaches, gave similar<br />

results <strong>in</strong> the case of real data and decomposed them sufficiently well. We<br />

therefore suggest us<strong>in</strong>g sparse BSS as an important preprocess<strong>in</strong>g tool before<br />

apply<strong>in</strong>g the common template match<strong>in</strong>g technique. In terms of algorithm<br />

comparisons, ICA performed better than the other algorithms <strong>in</strong> the noiseless<br />

case, however sparse NMF∗ outperformed the other methods when noise was<br />

added and slightly so <strong>in</strong> the case of multiple real s-EMG record<strong>in</strong>gs. In later<br />

work concern<strong>in</strong>g methods of s-EMG decomposition, we therefore want to focus<br />

on properties and possible improvements of sparse NMF regard<strong>in</strong>g parameter<br />

choice (which level of sparseness to choose) and signal preprocess<strong>in</strong>g (<strong>in</strong> order<br />

to deal with positive signals). Preprocess<strong>in</strong>g to improve sparseness is currently<br />

studied. In order to be better able to compare methods, we also plan to apply<br />

the methods to artificial sources generated us<strong>in</strong>g other available s-EMG generators<br />

[47]. F<strong>in</strong>ally extensions to convolutive mix<strong>in</strong>g situations will have to<br />

be analyzed.<br />

Acknowledgements<br />

We would like to thank Dr. S. Ra<strong>in</strong>ieri for helpful discussion and K. Maekawa<br />

for the first version of the s-EMG generator. This work was partly supported<br />

by the M<strong>in</strong>istry of Education, Culture, Sports, Science, and Technology of<br />

Japan (Grant-<strong>in</strong>-Aid for Scientific Research). G.A.G. is supported by a grant<br />

from the same M<strong>in</strong>istry. F.T. gratefully acknowledges partial f<strong>in</strong>ancial support<br />

by the DFG (GRK 638) and the BMBF (project ‘ModKog’).<br />

References<br />

[1] A. Hyvär<strong>in</strong>en, J. Karhunen, E. Oja, <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>, John<br />

Wiley & Sons, 2001.<br />

[2] A. Cichocki, S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g, John Wiley<br />

& Sons, 2002.<br />

[3] P. Georgiev, F. Theis, A. Cichocki, Bl<strong>in</strong>d source separation and sparse<br />

component analysis of overcomplete mixtures, <strong>in</strong>: Proc. ICASSP 2004, Vol. 5,<br />

Montreal, Canada, 2004, pp. 493–496.<br />

[4] P. Georgiev, F. Theis, A. Cichocki, Sparse component analysis and bl<strong>in</strong>d source<br />

separation of underdeterm<strong>in</strong>ed mixtures, IEEE Trans. Neural Networks <strong>in</strong> press.<br />

[5] D. Lee, H. Seung, Learn<strong>in</strong>g the parts of objects by non-negative matrix<br />

factorization, Nature 40 (1999) 788–791.<br />

26


288 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

[6] S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pursuit,<br />

SIAM J. Sci. Comput. 20 (1) (1998) 33–61.<br />

[7] F. Theis, E. Lang, C. Puntonet, A geometric algorithm for overcomplete l<strong>in</strong>ear<br />

ICA, Neurocomput<strong>in</strong>g 56 (2004) 381–398.<br />

[8] M. Zibulevsky, B. Pearlmutter, Bl<strong>in</strong>d source separation by sparse decomposition<br />

<strong>in</strong> a signal dictionary, Neural Computation 13 (4) (2001) 863–882.<br />

[9] M. Lewicki, T. Sejnowski, Learn<strong>in</strong>g overcomplete representations, Neural<br />

Computation 12 (2) (2000) 337–365.<br />

[10] B. Olshausen, D. Field, Emergence of simple-cell receptive field properties by<br />

learn<strong>in</strong>g a sparse code for natural images, Nature 381 (6583) (1996) 607–609.<br />

[11] M. D. Plumbley, E. Oja, A ’non-negative pca’ algorithm for <strong>in</strong>dependent<br />

component analysis, IEEE Transactions on Neural Networks 15 (1) (2004) 66–<br />

76.<br />

[12] E. Oja, M. D. Plumbley, Bl<strong>in</strong>d separation of positive sources us<strong>in</strong>g non-negative<br />

pca, <strong>in</strong>: Proc. ICA 2003, Nara, Japan, 2003, pp. 11–16.<br />

[13] W. Liu, N. Zheng, X. Lu, Non-negative matrix factorization for visual cod<strong>in</strong>g,<br />

<strong>in</strong>: Proc. ICASSP 2003, Vol. III, 2003, pp. 293–296.<br />

[14] P. Hoyer, Non-negative matrix factorization with sparseness constra<strong>in</strong>ts,<br />

Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research 5 (2004) 1457–1469.<br />

[15] J. Basmajian, C. D. Luca, Muscle Alive. Their Functions Revealed by<br />

Electromyography, 5th Edition, Williams & Wilk<strong>in</strong>s, Baltimore, 1985.<br />

[16] A. Halliday, S. Butler, R. Paul, A Textbook of Cl<strong>in</strong>ical Neurophysiology, John<br />

Wiley & Sons, New York, 1987.<br />

[17] D. Far<strong>in</strong>a, R. Merletti, R. Enoka, The extraction of neural strategies from the<br />

surface EMG, Journal of Applied Physiology 96 (2004) 1486–1495.<br />

[18] G. García, R. Okuno, K. Akazawa, Decomposition algorithm for surface<br />

electrode-array electromyogram <strong>in</strong> voluntary isometric contraction, IEEE<br />

Eng<strong>in</strong>eer<strong>in</strong>g <strong>in</strong> Medic<strong>in</strong>e and Biology Magaz<strong>in</strong>e <strong>in</strong> press.<br />

[19] C. Disselhorst-Klug, J. Silny, G. Rau, Estimation of the relationship<br />

between the non<strong>in</strong>vasively detected activity of s<strong>in</strong>gle motor units and their<br />

characteristic pathological changes by modell<strong>in</strong>g, Journal of Electromyography<br />

and K<strong>in</strong>esiology 8 (1998) 323–335.<br />

[20] S. Andreassen, A. Rosenfalck, Relationship of <strong>in</strong>tracellular and extracellular<br />

action potentials of skeletal muscle fibers, CRC critical reviews <strong>in</strong> bioeng<strong>in</strong>eer<strong>in</strong>g<br />

6 (4) (1981) 267–306.<br />

[21] H. Clamann, Statistical analysis of motor unit fir<strong>in</strong>g patterns <strong>in</strong> a human<br />

skeletal muscle, Biophysical Journal 9 (1969) 1233–1251.<br />

27


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 289<br />

[22] P. Griep, F. Gielen, H. Boom, K. Boon, L. Hoogstraten, C. Pool, W. Wall<strong>in</strong>ga-<br />

De-Jonge, Calculation and registration of the same motor unit action potential,<br />

Electroenceph Cl<strong>in</strong> Neurophysiol 53 (1973) 388–404.<br />

[23] E. St˚alberg, J. Trontelj, S<strong>in</strong>gle Fiber Electromyography, Old Work<strong>in</strong>g (UK),<br />

The Mirvalle Press, 1979.<br />

[24] S. Maekawa, T. Arimoto, M. Kotani, Y. Fujiwara, Motor unit decomposition<br />

of surface emg us<strong>in</strong>g multichannel bl<strong>in</strong>d deconvolution, <strong>in</strong>: Proc. ISEK 2002,<br />

Vienna, Austria, 2002, pp. 38–39.<br />

[25] K. McGill, K. Cumm<strong>in</strong>s, L. Dorfman, Automatic decomposition of the cl<strong>in</strong>ical<br />

electromyogram, IEEE Transactions on Biomedical Eng<strong>in</strong>eer<strong>in</strong>g 32 (7) (1985)<br />

470–477.<br />

[26] P. Comon, <strong>Independent</strong> component analysis - a new concept?, Signal Process<strong>in</strong>g<br />

36 (1994) 287–314.<br />

[27] F. Theis, A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation,<br />

Neural Computation 16 (2004) 1827–1850.<br />

[28] J. Hérault, C. Jutten, Space or time adaptive signal process<strong>in</strong>g by neural<br />

network models, <strong>in</strong>: J. Denker (Ed.), Neural Networks for Comput<strong>in</strong>g.<br />

Proceed<strong>in</strong>gs of the AIP Conference, American Institute of Physics, New York,<br />

1986, pp. 206–211.<br />

[29] J.-F. Cardoso, A. Souloumiac, Jacobi angles for simultaneous diagonalization,<br />

SIAM J. Mat. Anal. Appl. 17 (1) (1995) 161–164.<br />

[30] A. Yeredor, Non-orthogonal jo<strong>in</strong>t diagonalization <strong>in</strong> the leastsquares sense with<br />

application <strong>in</strong> bl<strong>in</strong>d source separation, IEEE Trans. Signal Process<strong>in</strong>g 50 (7)<br />

(2002) 1545–1553.<br />

[31] A. Ziehe, P. Laskov, K.-R. Mueller, G. Nolte, A l<strong>in</strong>ear least-squares algorithm<br />

for jo<strong>in</strong>t diagonalization, <strong>in</strong>: Proc. of ICA 2003, Nara, Japan, 2003, pp. 469–474.<br />

[32] G. García, K. Maekawa, K. Akazawa, Decomposition of synthetic multi-channel<br />

surface-electromyogram us<strong>in</strong>g <strong>in</strong>dependent component analysis, <strong>in</strong>: Proc. ICA<br />

2004, Vol. 3195 of Lecture Notes <strong>in</strong> Computer Science, Granada, Spa<strong>in</strong>, 2004,<br />

pp. 985–991.<br />

[33] S. Takahashi, Y. Sakurai, M. Tsukada, Y. Anzai, Classification of neural<br />

activities from tetrode record<strong>in</strong>gs us<strong>in</strong>g <strong>in</strong>dependent component analysis,<br />

Neurocomput<strong>in</strong>g 49 (2002) 289–298.<br />

[34] A. Hyvär<strong>in</strong>en, E. Oja, A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis, Neural Computation 9 (1997) 1483–1492.<br />

[35] P. Paatero, U. Tapper, Positive matrix factorization: A non-negative factor<br />

model with optimal utilization of error estimates of data values, Environmetrics<br />

5 (1994) 111–126.<br />

28


290 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

[36] D. Lee, H. Seung, Algorithms for non-negative matrix factorization, <strong>in</strong>:<br />

Advances <strong>in</strong> Neural Information Process<strong>in</strong>g (Proc. NIPS 2000), Vol. 13, MIT<br />

Press, 2000, pp. 556–562.<br />

[37] F. Theis, P. Georgiev, A. Cichocki, Robust overcomplete matrix recovery for<br />

sparse sources us<strong>in</strong>g a generalized hough transform, <strong>in</strong>: Proc. ESANN 2004,<br />

d-side, Evere, Belgium, Bruges, Belgium, 2004, pp. 343–348.<br />

[38] P. Bradley, O. Mangasarian, k-plane cluster<strong>in</strong>g, Journal of Global Optimization<br />

16 (1) (2000) 23–32.<br />

[39] C. D. Luca, Physiology and mathematics of myoelectric signals, IEEE<br />

Transactions on Biomedical Eng<strong>in</strong>eer<strong>in</strong>g 26 (6) (1979) 313–325.<br />

[40] C. D. Luca, R. LeFever, M. McCue, A. Xenakis, Control scheme govern<strong>in</strong>g<br />

concurrently active human motor units dur<strong>in</strong>g voluntary contractions, Journal<br />

of Physiology 329 (1982) 129–142.<br />

[41] S. Amari, A. Cichocki, H. Yang, A new learn<strong>in</strong>g algorithm for bl<strong>in</strong>d signal<br />

separation, Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems 8 (1996) 757–<br />

763.<br />

[42] D. Xu, J. Pr<strong>in</strong>cipe, J. F. III, H.-C. Wu, A novel measure for <strong>in</strong>dependent<br />

component analysis (ICA), <strong>in</strong>: Proc. ICASSP 1998, Vol. 2, Seattle, 1998, pp.<br />

1161–1164.<br />

[43] D. Tjøstheim, Measures of dependence and tests of <strong>in</strong>dependence, Statistics 28<br />

(1996) 249–282.<br />

[44] R. Duda, P. Hart, D. Stork, Pattern classification, 2nd Edition, Wiley, New<br />

York, 2001.<br />

[45] F. Theis, S. Amari, Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g<br />

sparse sources, <strong>in</strong>: Proc. ICA 2004, Vol. 3195 of Lecture Notes <strong>in</strong> Computer<br />

Science, Granada, Spa<strong>in</strong>, 2004, pp. 718–725.<br />

[46] P. Zhou, Z. Erim, W. Rymer, Motor unit action potential counts <strong>in</strong> surface<br />

electrode array EMG, <strong>in</strong>: Proc. IEEE EMBS 2003, Cancun, Mexico, 2003, pp.<br />

2067–2070.<br />

[47] B. Freriks, H. Hermens, European Recommendations for<br />

surface electromyography, results of the SENIAM project, Roess<strong>in</strong>gh Research<br />

and Development b.v. (CD-ROM), 1999.<br />

29


Chapter 21<br />

Proc. BIOMED 2005, pages 209-212<br />

Paper F.J. Theis, Z. Kohl, H.G. Kuhn, H.G. Stockmeier, and E.W. Lang. Automated<br />

count<strong>in</strong>g of labelled cells <strong>in</strong> rodent bra<strong>in</strong> section images. In Proc.<br />

BioMED 2004, pages 209-212, Innsbruck, Austria, 2004. ACTA Press,<br />

Canada<br />

Reference (Theis et al., 2004c)<br />

Summary <strong>in</strong> section 1.6.2<br />

291


292 Chapter 21. Proc. BIOMED 2005, pages 209-212<br />

Automated count<strong>in</strong>g of labelled cells <strong>in</strong> rodent bra<strong>in</strong> section images<br />

F.J. Theis 1 , Z. Kohl 2 , H.G. Kuhn 2 , H.G. Stockmeier 1 and E.W. Lang 1<br />

1 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

2 Department of Neurology, University of Regensburg, 93053 Regensburg, Germany<br />

email: fabian@theis.name<br />

ABSTRACT<br />

The genesis of new cells, especially of neurons, <strong>in</strong> the adult<br />

human bra<strong>in</strong> is currently of great scientific <strong>in</strong>terest. In order<br />

to measure neurogenesis <strong>in</strong> animals new born cells are<br />

labelled with specific markers such as BrdU; <strong>in</strong> bra<strong>in</strong> sections<br />

these can later be analyzed and counted through the<br />

microscope. So far, the image analysis has been performed<br />

by hand. In this work, we present an algorithm to automatically<br />

segment the digital bra<strong>in</strong> section picture <strong>in</strong>to cell<br />

and noncell components, giv<strong>in</strong>g a count of the number of<br />

cells <strong>in</strong> the section. This is done by first tra<strong>in</strong><strong>in</strong>g a so-called<br />

cell classifier with cell and non-cell patches <strong>in</strong> a supervised<br />

manner. This cell classifier can later be used <strong>in</strong> an arbitrary<br />

number of sections by scann<strong>in</strong>g the section and choos<strong>in</strong>g<br />

maxima of this classifier as cell center locations. For tra<strong>in</strong><strong>in</strong>g,<br />

s<strong>in</strong>gle- and multi-layer perceptrons were used. In prelim<strong>in</strong>ary<br />

experiments, we get good performance of the classifier.<br />

KEY WORDS<br />

Cell count<strong>in</strong>g, image segmentation, cell classification, neurogenesis,<br />

BrdU<br />

1 Biological background<br />

1.1 New neurons <strong>in</strong> the adult bra<strong>in</strong><br />

Dur<strong>in</strong>g the last decades the fact that new neurons are cont<strong>in</strong>uously<br />

generated <strong>in</strong> the adult mammalian bra<strong>in</strong> - a phenomenon<br />

termed adult neurogenesis - came more and more<br />

<strong>in</strong>to focus of neuroscience research [1][2][7]. Under physiological<br />

conditions, neuroscientists found, that adult neurogenesis<br />

seems to be restricted to two bra<strong>in</strong> regions: The<br />

wall of the lateral ventricle and the granular cell layer of<br />

the hippocampus.<br />

A large variety of factors <strong>in</strong>clud<strong>in</strong>g environmental<br />

signals, trophic factors, hormone and neurotransmitters<br />

have recently been identified to regulate the generation of<br />

new neurons <strong>in</strong> the adult bra<strong>in</strong>. These studies were typically<br />

performed by us<strong>in</strong>g a comb<strong>in</strong>ation of different histological<br />

techniques, such as non-radioactive label<strong>in</strong>g of<br />

newly generated cells, stereological count<strong>in</strong>g and confocal<br />

microscope analysis, <strong>in</strong> order to quantitatively analyze<br />

adult neurogenesis (review <strong>in</strong> [8]). However, this procedure<br />

is time consum<strong>in</strong>g, s<strong>in</strong>ce histological analysis currently depends<br />

on assessment of positive signals <strong>in</strong> histological sec-<br />

tions by <strong>in</strong>dividual <strong>in</strong>vestigators through manual or semiautomatic<br />

count<strong>in</strong>g.<br />

1.2 Method used<br />

Bromodeoxyurid<strong>in</strong>e (BrdU), a thymid<strong>in</strong>e analog is given<br />

systemically and is <strong>in</strong>tegrated <strong>in</strong>to the replicat<strong>in</strong>g DNA dur<strong>in</strong>g<br />

cell division [3]. Us<strong>in</strong>g a specific antibody aga<strong>in</strong>st<br />

BrdU, labelled cells can be detected by an immunohistochemical<br />

sta<strong>in</strong><strong>in</strong>g procedure. The nuclei of labelled cells<br />

on 40µm thick bra<strong>in</strong> sections appear <strong>in</strong> dark brown or<br />

black dense color. To determ<strong>in</strong>e the amount of BrdUpositive<br />

cells <strong>in</strong> the granular cell layer of the hippocampus<br />

they were counted on a light microscope (Olympus IX<br />

70; Hamburg, Germany) with a 20× objective. Digital images<br />

with a resolution of 1600 × 1200 pixels were taken by<br />

a color video camera us<strong>in</strong>g the analySIS-software system<br />

(Soft Imag<strong>in</strong>g System, Münster, Germany).<br />

2 Automated count<strong>in</strong>g<br />

Figure 1 shows a section image, <strong>in</strong> which the cells are to be<br />

counted.<br />

Classical approaches such as threshold<strong>in</strong>g and erosion<br />

after image normalization were not successful, ma<strong>in</strong>ly because<br />

cell clusters <strong>in</strong> the image cannot be detected properly<br />

and counted us<strong>in</strong>g this method.<br />

We decided to adapt a method proposed by Nattkemper<br />

et al. [9]to evaluate fluorescence micrographs of lymphocytes<br />

<strong>in</strong>vad<strong>in</strong>g human tissue. The ma<strong>in</strong> idea is to build<br />

<strong>in</strong> a first step a function mapp<strong>in</strong>g an image patch to a confidence<br />

value <strong>in</strong> [0, 1], <strong>in</strong>dicat<strong>in</strong>g how probable a cell lies<br />

<strong>in</strong> this patch or not — we call this function cell classifier.<br />

In the second step this function is used as a local filter on<br />

the whole image; its application gives a probability distribution<br />

over the whole image with local maxima at cell positions.<br />

Nattkemper et al. call this distribution confidence<br />

map. Maxima analysis of the confidence map reveals the<br />

number and the position of the cells (image segmentation).<br />

3 Cell classifier<br />

In this section, we will expla<strong>in</strong> how to generate a cell classifier<br />

that is a function mapp<strong>in</strong>g image patches to cell confidence<br />

values. For this we will generate a sample set of cells


Chapter 21. Proc. BIOMED 2005, pages 209-212 293<br />

Figure 1. Typical section image and below the manually<br />

bounded but automatically labelled image. In order to depict<br />

the image size a black scale bar of length 50µm is<br />

given <strong>in</strong> the top image at the bottom. Here, the number<br />

of counted cell with<strong>in</strong> the boundary (region of <strong>in</strong>terest) is<br />

84, the number of cells <strong>in</strong> the whole image is 116.<br />

and non-cells, and then tra<strong>in</strong> an artificial neural network to<br />

this sample set.<br />

3.1 Sample set<br />

After fix<strong>in</strong>g the patch size — <strong>in</strong> the follow<strong>in</strong>g we will use<br />

20 by 20 pixel grey-level image patches — a tra<strong>in</strong><strong>in</strong>g set of<br />

cell and non-cell patches has to be generated manually by<br />

the expert. These image patches are then to be classified by<br />

a neural network. Figure 2 shows some cell and non-cell<br />

patches.<br />

Interpret<strong>in</strong>g each 20 by 20 image patch as a 400dimensional<br />

vector, we thus get a set of L tra<strong>in</strong><strong>in</strong>g vectors<br />

T := {(x1, t1), . . . , (xL, tL)}<br />

with xi ∈ R n — here n = 20 2 — represent<strong>in</strong>g the image<br />

patch and ti ∈ {0, 1} either 0 or 1 depend<strong>in</strong>g on<br />

Figure 2. Part of the tra<strong>in</strong><strong>in</strong>g set: The first row consists of<br />

20x20-pixel image patches that conta<strong>in</strong> cells, the lower row<br />

consists of non-cell image patches.<br />

whether xi is a non-cell or a cell. The goal is to f<strong>in</strong>d a<br />

mapp<strong>in</strong>g correctly classify<strong>in</strong>g this data set that is a mapp<strong>in</strong>g<br />

ζ : R n → [0, 1] with ζ(xi) = oi for i = 1, . . . , L.<br />

We call such a mapp<strong>in</strong>g cell classifier. Of course ζ is not<br />

uniquely def<strong>in</strong>ed by the above property, so some regularization<br />

has to be <strong>in</strong>troduced. Any <strong>in</strong>terpolation technique such<br />

as Fourier or Tayler approximation can be used to f<strong>in</strong>d ζ;<br />

we will use s<strong>in</strong>gle and multilayer perceptrons as expla<strong>in</strong>ed<br />

<strong>in</strong> the previous section.<br />

3.2 Preprocess<strong>in</strong>g<br />

Before we apply neural network learn<strong>in</strong>g, we preprocess<br />

the data as follows: After mean removal, we apply pr<strong>in</strong>cipal<br />

component analysis (PCA) <strong>in</strong> order to reduce the data<br />

set dimension as well as to decorrelate the data <strong>in</strong> a first<br />

separation step. This is achieved by diagonaliz<strong>in</strong>g the data<br />

set covariance and project<strong>in</strong>g along the eigenvectors with<br />

largest eigenvalues.<br />

By only tak<strong>in</strong>g the first 5 eigenvalues of the tra<strong>in</strong><strong>in</strong>g<br />

set, projection along those first 5 pr<strong>in</strong>cipal axes still reta<strong>in</strong>s<br />

95% of the data. Thus, the 400 dimensional data space was<br />

reduced to a whitened 5-dimensional data set.<br />

A visualization of the 120-samples data set is given<br />

<strong>in</strong> figure 3, here after projection to 3 dimensions. One can<br />

easily see that the cell and non-cell components can be l<strong>in</strong>early<br />

separated — thus us<strong>in</strong>g a perceptron, see next section,<br />

can <strong>in</strong>deed already learn the cell classifier. Furthermore, a<br />

k-means cluster<strong>in</strong>g algorithm has been applied with k = 2<br />

<strong>in</strong> order to f<strong>in</strong>d the two data clusters. Those directly correspond<br />

to the cell/non-cell components, see figure.<br />

3.3 Neural network learn<strong>in</strong>g<br />

Supervised learn<strong>in</strong>g algorithms try to approximate a given<br />

function f : R n → A ⊂ R m by us<strong>in</strong>g a number of given<br />

sample-observation pairs (xλ, f(xλ)) ∈ R n × A. We will<br />

restrict ourselves to feed-forward layered neural networks<br />

[4]; furthermore, we found that simple s<strong>in</strong>gle-layered neural<br />

networks (perceptrons) <strong>in</strong> comparison to multi-layered<br />

networks (MLP) already sufficed to learn the data set well<br />

— furthermore they have the advantage of easier rule extraction<br />

and <strong>in</strong>terpretation.


294 Chapter 21. Proc. BIOMED 2005, pages 209-212<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

−3<br />

Figure 3. Data set with 120 samples after 3-dimensional<br />

PCA projection (91% of the data was reta<strong>in</strong>ed). The dots<br />

mark the 60 samples represent<strong>in</strong>g cells, the x’s mark the 60<br />

non-cell data po<strong>in</strong>ts. The two circles <strong>in</strong>dicate clusters of<br />

a k-means application with a search for two clusters. Obviously,<br />

k-means nicely differentiates between the cell and<br />

the non-cell components.<br />

A perceptron with output dimension 1 consists of a<br />

s<strong>in</strong>gle neuron only, so the output function y can be written<br />

as<br />

y(x) = θ(w ⊤ x + w0)<br />

with weight w ∈ R n , n <strong>in</strong>put dimension, w0 ∈ R the<br />

bias and as activation function θ, the Heaviside function<br />

(θ(x) = 0 for x < 0 and θ(x) = 1 for x ≥ 0). Often, the<br />

bias w0 is added as additional weight to w with fixed <strong>in</strong>put<br />

1.<br />

Learn<strong>in</strong>g <strong>in</strong> a perceptron means m<strong>in</strong>imiz<strong>in</strong>g the error<br />

energy function shown above. This can be performed for<br />

example by gradient descent with respect to w and w0. This<br />

<strong>in</strong>duces the well known delta-rule for the weight update<br />

−2<br />

−1<br />

∆w = η(y(x) − t)x,<br />

where η denotes a chosen learn<strong>in</strong>g rate parameter, y(x) the<br />

output of the neural network at sample x and t the observation<br />

of <strong>in</strong>put x. It is easy to see that a perceptron separates<br />

the data l<strong>in</strong>early with the boundary hyperplane given by<br />

{x ∈ R n |w ⊤ x + w0 = 0}.<br />

For the cell classifier, we use a s<strong>in</strong>gle-unit perceptron<br />

with l<strong>in</strong>ear activation function <strong>in</strong> order to get a measure<br />

for the certa<strong>in</strong>ty of cell/non-cell classification. Application<br />

of delta-learn<strong>in</strong>g to the 5-dimensional data set from<br />

above gives excellent performance after already 4 epochs<br />

of batch learn<strong>in</strong>g. The f<strong>in</strong>al performance error (variance<br />

of perceptron estimation error of the tra<strong>in</strong><strong>in</strong>g set) after 55<br />

epochs was 0.0038 which confirms the good performance<br />

as well as the l<strong>in</strong>earity of the classification problem. This<br />

was further shown, when we used a two-layered network<br />

0<br />

1<br />

2<br />

3<br />

4<br />

with 5 hidden neurons <strong>in</strong> order to test for nonl<strong>in</strong>earities <strong>in</strong><br />

the data set. After only 10 epochs, the error was already<br />

very small, and it could f<strong>in</strong>ally be dim<strong>in</strong>ished to 3 · 10 −19 .<br />

Still the performance of the perceptron is more than enough<br />

for classification.<br />

4 Confidence map<br />

4.1 Generation<br />

The cell classifier from above only has to be tra<strong>in</strong>ed once.<br />

Given such a cell classifier, section pictures can now<br />

be analyzed as follows:<br />

A pixelwise scan of the image gives an image patch<br />

with center location at the scan po<strong>in</strong>t; to this image patch<br />

the cell classifier is then applied to give a probability of<br />

whether a cell is at the given location or not. This altogether<br />

(after image extension at the boundaries) yields a probability<br />

distribution over the whole image which is called confidence<br />

map. Each po<strong>in</strong>t of the confidence map is a value <strong>in</strong><br />

[0, 1] stat<strong>in</strong>g how probable it is that a cell is depicted at the<br />

specified location.<br />

In practice a pixelwise scan is too expensive <strong>in</strong> terms<br />

of calculation time, so <strong>in</strong>stead a grid value say 5 for 20 ×<br />

20-patches is <strong>in</strong>troduced, and the picture is scanned only<br />

every 5-th pixel. This yields a rasterization of the orig<strong>in</strong>al<br />

confidence map, which is still good enough to detect cells.<br />

Figure 4 shows the rasterized confidence map of a section<br />

part. The maxima of the confidence map correspond to the<br />

cell locations; small but non-zero values <strong>in</strong> the confidence<br />

map typically depict misclassifications that can be avoided<br />

by threshold<strong>in</strong>g.<br />

4.2 Evaluation<br />

After the confidence map has been generated, it can be<br />

evaluated by simple maxima analysis. However as seen <strong>in</strong><br />

figure 4, maxima not always correspond to cell positions,<br />

so threshold<strong>in</strong>g <strong>in</strong> the confidence map is first applied. Values<br />

of 0.5 to 0.8 yield good results <strong>in</strong> experiments. Furthermore,<br />

the cell classifier could have responded to one cell<br />

when applied to image patches with large overlap. Therefore<br />

after a maximum has been detected, adjacent po<strong>in</strong>ts <strong>in</strong><br />

the confidence map are also set to zero with<strong>in</strong> a given radius<br />

(15 to 18 were good values for 20 × 20 image patches). Iterative<br />

application of this algorithm then gave the f<strong>in</strong>al cell<br />

positions and hence the image segmentation.<br />

5 Results<br />

In practice we used perceptron learn<strong>in</strong>g after preprocess<strong>in</strong>g<br />

with PCA and also ICA [5] [6] <strong>in</strong> order to help the learn<strong>in</strong>g<br />

algorithm with l<strong>in</strong>early separated data.<br />

The patch size was chosen to be 20 × 20, a threshold<strong>in</strong>g<br />

of 0.8 was applied <strong>in</strong> the confidence map and the


Chapter 21. Proc. BIOMED 2005, pages 209-212 295<br />

1<br />

0.5<br />

0<br />

0<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

45<br />

50<br />

0<br />

5<br />

10<br />

Figure 4. The plot shows the confidence map with grid<br />

value 5 of the image part shown above.<br />

cut-out radius for cell detection <strong>in</strong> the confidence map was<br />

18 pixels.<br />

In figure 1 an automatically segmented picture is<br />

shown. We see good performances of the count<strong>in</strong>g algorithm.<br />

So far we only compared cell-numbers of sections,<br />

counted by the algorithm and by an expert. We get calculation<br />

errors of about 5%. In further experiments, we also<br />

want to compare cell positions, detected by the algorithm<br />

and by an expert.<br />

6 Conclusion<br />

We have presented a framework for bra<strong>in</strong> section image<br />

segmentation and analysis. The feature detector, here cell<br />

classifier, was first tra<strong>in</strong>ed on a given sample set us<strong>in</strong>g a<br />

neural network. This detector was then applied by scann<strong>in</strong>g<br />

over the image to get a confidence map. Maxima analysis<br />

yields the cell locations. Experiments showed good performance<br />

of the classifier, however larger tests will have to be<br />

performed.<br />

In future work, various problems will have to be dealt<br />

with. First of all the scann<strong>in</strong>g performance should be <strong>in</strong>creased<br />

<strong>in</strong> order to be able to use smaller grid values, which<br />

could significantly <strong>in</strong>crease the classification rate. This<br />

could be done for example by us<strong>in</strong>g some k<strong>in</strong>d of hierarchical<br />

neural network like a cellular neural network, see<br />

[9]. In typical bra<strong>in</strong> section images, some cells not directly<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

45<br />

50<br />

ly<strong>in</strong>g <strong>in</strong> the focus plane occur <strong>in</strong> a blurred fashion. In order<br />

to count those without count<strong>in</strong>g them twice <strong>in</strong> two section<br />

images with different focus planes, a three-dimensional cell<br />

classifier could be tra<strong>in</strong>ed for fixed focus plane distances.<br />

A different approach for account<strong>in</strong>g for non-focused cells<br />

would be to simply allow ’overcount<strong>in</strong>g’, and then to reduce<br />

doubles <strong>in</strong> the segmented images accord<strong>in</strong>g to location.<br />

This seems suitable given the fact that cells do not<br />

vary greatly <strong>in</strong> size. F<strong>in</strong>ally, sections typically span more<br />

than one microscope image. In order to count cells of the<br />

whole section some way of not count<strong>in</strong>g cells twice <strong>in</strong> both<br />

image parts has to be devised. This could be done for example<br />

by us<strong>in</strong>g techniques from image fusion. Furthermore,<br />

often not the whole image but only parts of the section are<br />

to be counted; so far this choice of the ’region of <strong>in</strong>terest’<br />

is done manually. We hope to automate this <strong>in</strong> the future<br />

by f<strong>in</strong>d<strong>in</strong>g separat<strong>in</strong>g features of these regions.<br />

7 Acknowledgements<br />

F.J.T. and E.W.L. gratefully acknowledges f<strong>in</strong>ancial support<br />

by the DFG 1 and the BMBF 2 .<br />

References<br />

[1] J. Altman and G. Das. Autoradiographic and histological<br />

evidence of postnatal hippocampal neurogenesis <strong>in</strong> rats. J.<br />

Comp. Neurol., 124(3):319–335, 1965.<br />

[2] H. Cameron, C. Woolley, B. McEwen, and E. Gould. Differentiation<br />

of newly born neurons and glia <strong>in</strong> the dentate gyrus<br />

of the adult rat. Neuroscience, 56(2):336–344, 1993.<br />

[3] F. Dolbeare. Bromodeoxyurid<strong>in</strong>e: a diagnostic tool <strong>in</strong> biology<br />

and medic<strong>in</strong>e, part i: Historical perspectives, histochemical<br />

methods and cell k<strong>in</strong>etics. Histochem. J., 27(5):339–369,<br />

1995.<br />

[4] S. Hayk<strong>in</strong>. Neural networks. Macmillan College Publish<strong>in</strong>g<br />

Company, 1994.<br />

[5] J. Hérault and C. Jutten. Space or time adaptive signal<br />

process<strong>in</strong>g by neural network models. In J.S. Denker, editor,<br />

Neural Networks for Comput<strong>in</strong>g. Proceed<strong>in</strong>gs of the AIP<br />

Conference, pages 206–211, New York, 1986. American Institute<br />

of Physics.<br />

[6] A. Hyvär<strong>in</strong>en, J.Karhunen, and E.Oja. <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>. John Wiley & Sons, 2001.<br />

[7] H.G. Kuhn, H. Dick<strong>in</strong>son-Anson, and F. Gage. Neurogenesis<br />

<strong>in</strong> the dentate gyrus of the adult rat: age-related decrease of<br />

neuronal progenitor proliferation. J. Neurosci., 16(6):2027–<br />

2033, 1996.<br />

[8] H.G. Kuhn, T. Palmer, and E. Fuchs. Adult neurogenesis: a<br />

compensatory mechanism for neuronal damage. Eur. Arch.<br />

Psychiatry Cl<strong>in</strong>. Neurosci., 251(4):152–158, 2001.<br />

[9] T.W. Nattkemper, H. Ritter, and W. Schubert. A neural classifier<br />

enabl<strong>in</strong>g high-throughput topological analysis of lymphocytes<br />

<strong>in</strong> tissue sections. IEEE Trans. ITB, 5:138–149, 2001.<br />

1 graduate college ‘nonl<strong>in</strong>ear dynamics’<br />

2 project ‘ModKog’


296 Chapter 21. Proc. BIOMED 2005, pages 209-212


Bibliography<br />

Abed-Meraim, K. and Belouchrani, A. (2004). Algorithms for jo<strong>in</strong>t block diagonalization. In<br />

Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria.<br />

Abed-Meraim, K., Belouchrani, A., and Hua, Y. (1996). Bl<strong>in</strong>d identification of a l<strong>in</strong>ear-quadratic<br />

mixture of <strong>in</strong>dependent components based on jo<strong>in</strong>t diagonalization procedure. In Proc. of<br />

ICASSP 1996, volume 5, pages 2718–2721, Atlanta, USA.<br />

Achard, S. and Jutten, C. (2005). Identifiability of post-nonl<strong>in</strong>ear mixtures. IEEE Signal<br />

Process<strong>in</strong>g Letters, 12(5):423–426.<br />

Achard, S., Pham, D.-T., and Jutten, C. (2003). Quadratic dependence measure for nonl<strong>in</strong>ear<br />

bl<strong>in</strong>d sources separation. In Proc. of ICA 2003, pages 263–268, Nara, Japan.<br />

Akaho, S., Kiuchi, Y., and Umeyama, S. (1999). MICA: Multimodal <strong>in</strong>dependent component<br />

analysis. In Proc. IJCNN 1999, pages 927–932.<br />

Alpayd<strong>in</strong>, E. (2004). Introduction to Mach<strong>in</strong>e Learn<strong>in</strong>g. MIT Press.<br />

Altman, J. and Das, G. (1965). Autoradiographic and histological evidence of postnatal hippocampal<br />

neurogenesis <strong>in</strong> rats. J. Comp. Neurol., 124(3):319–335.<br />

Amari, S. (1998). Natural gradient works efficiently <strong>in</strong> learn<strong>in</strong>g. Neural Computation, 10:251–<br />

276.<br />

Araki, S., Sawada, H., Mukai, R., and Mak<strong>in</strong>o, S. (2007). Underdeterm<strong>in</strong>ed bl<strong>in</strong>d sparse source<br />

separation for arbitrarily arranged multiple sensors. Signal Process<strong>in</strong>g, 87(8):1833–1847.<br />

Babaie-Zadeh, M. (2002). On Bl<strong>in</strong>d Source Separation <strong>in</strong> Convolutive and Nonl<strong>in</strong>ear Mixtures.<br />

PhD thesis, Institut National Polytechnique de Grenoble.<br />

Babaie-Zadeh, M., Jutten, C., and Nayebi, K. (2002). A geometric approach for separat<strong>in</strong>g post<br />

non-l<strong>in</strong>ear mixtures. In Proc. of EUSIPCO ’02, volume II, pages 11–14, Toulouse, France.<br />

Bach, F. and Jordan, M. (2003a). Beyond <strong>in</strong>dependent components: trees and clusters. Journal<br />

of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 4:1205–1233.<br />

Bach, F. and Jordan, M. (2003b). F<strong>in</strong>d<strong>in</strong>g clusters <strong>in</strong> <strong>in</strong>dependent component analysis. In Proc.<br />

ICA 2003, pages 891–896.<br />

297


298 BIBLIOGRAPHY<br />

Barber, C., Dobk<strong>in</strong>, D., and Huhdanpaa, H. (1993). The quickhull algorithm for convex hull.<br />

Technical Report GCG53, The Geometry Center, University of M<strong>in</strong>nesota, M<strong>in</strong>neapolis.<br />

Bartsch, H. and Obermayer, K. (2003). Second order statistics of natural images. Neurocomput<strong>in</strong>g,<br />

52-54:467–472.<br />

Basmajian, J. and Luca, C. D. (1985). Muscle Alive. Their Functions Revealed by Electromyography.<br />

Williams & Wilk<strong>in</strong>s, Baltimore, 5th edition.<br />

Bell, A. and Sejnowski, T. (1995). An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d separation<br />

and bl<strong>in</strong>d deconvolution. Neural Computation, 7:1129–1159.<br />

Belouchrani, A. and Am<strong>in</strong>, M. (1998). Bl<strong>in</strong>d source separation based on time-frequency signal<br />

representations. IEEE Trans. Signal Process<strong>in</strong>g, 46(11):2888–2897.<br />

Belouchrani, A., Meraim, K. A., Cardoso, J.-F., and Moul<strong>in</strong>es, E. (1997). A bl<strong>in</strong>d source separation<br />

technique based on second order statistics. IEEE Transactions on Signal Process<strong>in</strong>g,<br />

45(2):434–444.<br />

Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press.<br />

Bishop, C. (2007). Pattern Recognition and Mach<strong>in</strong>e Learn<strong>in</strong>g. Spr<strong>in</strong>ger.<br />

Blanchard, G., Kawanabe, M., Sugiyama, M., Spoko<strong>in</strong>y, V., and Müller, K. (2006). In search<br />

of non-gaussian components of a high-dimensional distribution. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g<br />

Research, 7:247–282.<br />

Bofill, P. and Zibulevsky, M. (2001). Underdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation us<strong>in</strong>g sparse<br />

representations. Signal Process<strong>in</strong>g, 81:2353–2362.<br />

Böhm, M., Stadlthanner, K., Gruber, P., Theis, F., Lang, E., Tomé, A., Teixeira, A., Gronwald,<br />

W., and Kalbitzer, H. (2006). On the use of simulated anneal<strong>in</strong>g to automatically assign<br />

<strong>in</strong>dependent components. IEEE Transactions on Biomedical Eng<strong>in</strong>eer<strong>in</strong>g, 53(5):810–820.<br />

Boser, B., Guyon, I., and Vapnik, V. (1992). A tra<strong>in</strong><strong>in</strong>g algorithm for optimal marg<strong>in</strong> classifiers.<br />

In Proc. COLT 1992, pages 144–152.<br />

Burges, C. (1998). A tutorial on support vector mach<strong>in</strong>es for pattern recognition. Knowledge<br />

Discovery and Data M<strong>in</strong><strong>in</strong>g, 2(2):121–167.<br />

Cardoso, J. (1998). Multidimensional <strong>in</strong>dependent component analysis. In Proc. of ICASSP<br />

’98, Seattle.<br />

Cardoso, J. and Souloumiac, A. (1990). Localization and identification with the quadricovariance.<br />

Traitement du Signal, 7(5):397–406.<br />

Cardoso, J. and Souloumiac, A. (1993). Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian signals. IEE<br />

Proceed<strong>in</strong>gs - F, 140(6):362–370.


BIBLIOGRAPHY 299<br />

Cardoso, J. and Souloumiac, A. (1995). Jacobi angles for simultaneous diagonalization. SIAM<br />

J. Mat. Anal. Appl., 17(1):161–164.<br />

Choi, S. and Cichocki, A. (2000). Bl<strong>in</strong>d separation of nonstationary sources <strong>in</strong> noisy mixtures.<br />

Electronics Letters, 36(848-849).<br />

Cichocki, A. and Amari, S. (2002). Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. John Wiley &<br />

Sons.<br />

Cohen, L. (1995). Time-Frequency <strong>Analysis</strong>. Prentice-Hall, Englewood Cliffs, NJ.<br />

Combettes, P. (1993). The foundations of set theoretic estimation. Proc. IEEE, 81:182–208.<br />

Comon, P. (1994). <strong>Independent</strong> component analysis - a new concept? Signal Process<strong>in</strong>g, 36:287–<br />

314.<br />

Darmois, G. (1953). Analyse générale des liaisons stochastiques. Rev. Inst. Internationale<br />

Statist., 21:2–8.<br />

Dobra, A., Hans, C., Jones, B., Nev<strong>in</strong>s, J., Yao, G., and West, M. (2004). Sparse graphical<br />

models for explor<strong>in</strong>g gene expression data. Journal of Multivariate <strong>Analysis</strong>, 90:196–212.<br />

Edelman, A., Arias, T., and Smith, S. (1999). The geometry of algorithms with orthogonality<br />

constra<strong>in</strong>ts. SIAM Journal on Matrix <strong>Analysis</strong> and Applications, 20(2):303–353.<br />

Effern, A., Lehnertz, K., Schreiber, T., Grunwald, T., David, P., and Elger, C. E. (2000).<br />

Nonl<strong>in</strong>ear denois<strong>in</strong>g of transient signals with application to event-related potentials. Physica<br />

D, 140:257–266.<br />

Eriksson, J. and Koivunen, V. (2003). Identifiability and separability of l<strong>in</strong>ear ica models revisited.<br />

In Proc. of ICA 2003, pages 23–27.<br />

Eriksson, J. and Koivunen, V. (2006). Complex random vectors and ica models: identifiability,<br />

uniqueness, and separability. IEEE Transactions on Information Theory, 52(3):1017–1029.<br />

Févotte, C. and Theis, F. (2007a). Orthonormal approximate jo<strong>in</strong>t block-diagonalization. Technical<br />

report, GET/Télécom Paris.<br />

Févotte, C. and Theis, F. (2007b). Pivot selection strategies <strong>in</strong> jacobi jo<strong>in</strong>t block-diagonalization.<br />

In Proc. ICA 2007, London, U.K.<br />

Friedman, J. (1987). Exploratory projection pursuit. Journal of the American Statistical Association,<br />

82:249–266.<br />

Friedman, J. and Tukey, J. (1975). A projection pursuit algorithm for exploratory data analysis.<br />

IEEE Trans. on Computers, 23(9):881–890.<br />

García, G., Okuno, R., and Akazawa, K. (2004). Decomposition algorithm for surface electrodearray<br />

electromyogram <strong>in</strong> voluntary isometric contraction. IEEE Eng<strong>in</strong>eer<strong>in</strong>g <strong>in</strong> Medic<strong>in</strong>e and<br />

Biology Magaz<strong>in</strong>e, <strong>in</strong> press.


300 BIBLIOGRAPHY<br />

Georgiev, P. (2001). Bl<strong>in</strong>d source separation of bil<strong>in</strong>early mixed signals. In Proc. of ICA 2001,<br />

pages 328–330, San Diego, USA.<br />

Georgiev, P., Pardalos, P., and Theis, F. (2006). A bil<strong>in</strong>ear algorithm for sparse representations.<br />

Computational Optimization and Applications.<br />

Georgiev, P., Pardalos, P., Theis, F., Cichocki, A., and Bakardjian, H. (2005a). Data M<strong>in</strong><strong>in</strong>g<br />

<strong>in</strong> Biomedic<strong>in</strong>e, chapter Sparse component analysis: a new tool for data m<strong>in</strong><strong>in</strong>g. Spr<strong>in</strong>ger, <strong>in</strong><br />

pr<strong>in</strong>t.<br />

Georgiev, P. and Theis, F. (2004). Bl<strong>in</strong>d source separation of l<strong>in</strong>ear mixtures with s<strong>in</strong>gular<br />

matrices. In Proc. ICA 2004, volume 3195 of LNCS, pages 121–128, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Georgiev, P., Theis, F., and Cichocki, A. (2004). Bl<strong>in</strong>d source separation and sparse component<br />

analysis of overcomplete mixtures. In Proc. ICASSP 2004, volume 5, pages 493–496, Montreal,<br />

Canada.<br />

Georgiev, P., Theis, F., and Cichocki, A. (2005b). Multiscale Optimization Methods, chapter<br />

Optimization algorithms for sparse representations and applications. Ed. P. Pardalos.<br />

Georgiev, P., Theis, F., and Cichocki, A. (2005c). Sparse component analysis and bl<strong>in</strong>d source<br />

separation of underdeterm<strong>in</strong>ed mixtures. IEEE Transactions on Neural Networks, 16(4):992–<br />

996.<br />

Ghurye, S. and Olk<strong>in</strong>, I. (1962). A characterization of the multivariate normal distribution.<br />

Ann. Math. Statist., 33:533–541.<br />

Gruber, P., Kohler, C., and Theis, F. (2007). A toolbox for model-free analysis of fmri data. In<br />

Proc. ICA 2007, London, U.K.<br />

Gruber, P., Stadlthanner, K., Böhm, M., Theis, F., Lang, E., Tomé, A., Teixeira, A., Puntonet,<br />

C., and Saéz, J. G. (2006). Denois<strong>in</strong>g us<strong>in</strong>g local projective subspace methods. Neurocomput<strong>in</strong>g,<br />

69:1485–1501.<br />

Gruber, P. and Theis, F. (2006). Grassmann cluster<strong>in</strong>g. In Proc. EUSIPCO 2006, Florence,<br />

Italy.<br />

Gutch, H. and Theis, F. (2007). <strong>Independent</strong> subspace analysis is unique, given irreducibility.<br />

In Proc. ICA 2007, London, U.K.<br />

Hashimoto, W. (2003). Quadratic forms <strong>in</strong> natural images. Network: Computation <strong>in</strong> Neural<br />

Systems, 14:765–788.<br />

Hayk<strong>in</strong>, S. (1994). Neural networks. Macmillan College Publish<strong>in</strong>g Company.<br />

Hérault, J. and Jutten, C. (1986). Space or time adaptive signal process<strong>in</strong>g by neural network<br />

models. In Denker, J., editor, Neural Networks for Comput<strong>in</strong>g. Proceed<strong>in</strong>gs of the AIP<br />

Conference, pages 206–211, New York. American Institute of Physics.


BIBLIOGRAPHY 301<br />

Hosse<strong>in</strong>i, S. and Deville, Y. (2003). Bl<strong>in</strong>d separation of l<strong>in</strong>ear-quadratic mixtures of real sources<br />

us<strong>in</strong>g a recurrent structure. Lecture Notes <strong>in</strong> Computer Science, 2687:241–248.<br />

Hoyer, P. (2004). Non-negative matrix factorization with sparseness constra<strong>in</strong>ts. Journal of<br />

Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 5:1457–1469.<br />

Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2):435–475.<br />

Hyvär<strong>in</strong>en, A. (1999). Fast and robust fixed-po<strong>in</strong>t algorithms for <strong>in</strong>dependent component analysis.<br />

IEEE Transactions on Neural Networks, 10(3):626–634.<br />

Hyvär<strong>in</strong>en, A. and Hoyer, P. (2000). Emergence of phase and shift <strong>in</strong>variant features by decomposition<br />

of natural images <strong>in</strong>to <strong>in</strong>dependent feature subspaces. Neural Computation,<br />

12(7):1705–1720.<br />

Hyvär<strong>in</strong>en, A., Hoyer, P., and Inki, M. (2001a). Topographic <strong>in</strong>dependent component analysis.<br />

Neural Computation, 13(7):1525–1558.<br />

Hyvär<strong>in</strong>en, A., Hoyer, P., and Oja, E. (2001b). Intelligent Signal Process<strong>in</strong>g. IEEE Press.<br />

Hyvär<strong>in</strong>en, A., Karhunen, J., and Oja, E. (2001c). <strong>Independent</strong> component analysis. John Wiley<br />

& Sons.<br />

Hyvär<strong>in</strong>en, A. and Oja, E. (1997). A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis. Neural Computation, 9:1483–1492.<br />

Hyvär<strong>in</strong>en, A. and Pajunen, P. (1998). On existence and uniqueness of solutions <strong>in</strong> nonl<strong>in</strong>ear <strong>in</strong>dependent<br />

component analysis. Proceed<strong>in</strong>gs of the 1998 IEEE International Jo<strong>in</strong>t Conference<br />

on Neural Networks (IJCNN’98), 2:1350–1355.<br />

Il<strong>in</strong>, A. (2006). <strong>Independent</strong> dynamics subspace analysis. In Proc. ESANN 2006, pages 345–350.<br />

Karako-Eilon, S., Yeredor, A., and Mendlovic, D. (2003). Bl<strong>in</strong>d source separation based on the<br />

fractional fourier transform. In Proc. ICA 2003, pages 615–620, Nara, Japan.<br />

Karvanen, J. and Theis, F. (2004). Spatial ICA of fMRI data <strong>in</strong> time w<strong>in</strong>dows. In Proc. MaxEnt<br />

2004, volume 735 of AIP conference proceed<strong>in</strong>gs, pages 312–319, Garch<strong>in</strong>g, Germany.<br />

Kawanabe, M. (2005). L<strong>in</strong>ear dimension reduction based on the fourth-order cumulant tensor.<br />

In Proc. ICANN 2005, volume 3697 of LNCS, pages 151–156, Warsaw, Poland. Spr<strong>in</strong>ger.<br />

Kawanabe, M. and Theis, F. (2007). Jo<strong>in</strong>t low-rank approximation for extract<strong>in</strong>g non-gaussian<br />

subspaces. Signal Process<strong>in</strong>g.<br />

Keck, I., Theis, F., Gruber, P., Lang, E., Churan, J., and Puntonet, C. (2006). Model-free region<br />

of <strong>in</strong>terest based analysis of fMRI data. In Proc. MELECON 2006, pages 478–481, Malaga,<br />

Spa<strong>in</strong>.


302 BIBLIOGRAPHY<br />

Keck, I., Theis, F., Gruber, P., Lang, E., Specht, K., F<strong>in</strong>k, G., Tomé, A., and Puntonet, C.<br />

(2005). Automated cluster<strong>in</strong>g of ICA results for fMRI data analysis. In Proc. CIMED 2005,<br />

pages 211–216, Lisbon, Portugal.<br />

Keck, I., Theis, F., Gruber, P., Lang, E., Specht, K., and Puntonet, C. (2004). 3D spatial<br />

analysis of fMRI data on a word perception task. In Proc. ICA 2004, volume 3195 of LNCS,<br />

pages 977–984, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Kruskal, J. (1969). Statistical Computation, chapter Toward a practical method which helps<br />

uncover the structure of a set of observations by f<strong>in</strong>d<strong>in</strong>g the l<strong>in</strong>e tranformation which optimizes<br />

a new <strong>in</strong>dex of condensation, pages 427–440. Academic Press, New York.<br />

Lee, D. and Seung, H. (1999). Learn<strong>in</strong>g the parts of objects by non-negative matrix factorization.<br />

Nature, 40:788–791.<br />

Lee, T., Lewicki, M., Girolami, M., and Sejnowski, T. (1999). Bl<strong>in</strong>d source separation of more<br />

sources than mixtures us<strong>in</strong>g overcomplete representations. IEEE Signal Process<strong>in</strong>g Letters,<br />

6(4):87–90.<br />

Leshem, A. (1999). Source separation us<strong>in</strong>g bil<strong>in</strong>ear forms. In Proc. of the 8th Int. Conference<br />

on Higher-Order Statistical Signal Process<strong>in</strong>g.<br />

Liavas, A. P. and Regalia, P. A. (2001). On the behavior of <strong>in</strong>formation theoretic criteria for<br />

model order selection. IEEE Transactions on Signal Process<strong>in</strong>g, 49:1689–1695.<br />

L<strong>in</strong>, J. (1998). Factoriz<strong>in</strong>g multivariate function classes. In Advances <strong>in</strong> Neural Information<br />

Process<strong>in</strong>g Systems, volume 10, pages 563–569.<br />

Lutter, D., Stadlthanner, K., Theis, F., Lang, E. W., Tomé, A., Becker, B., and Vogt, T. (2006).<br />

Analyz<strong>in</strong>g gene expression profiles with ICA. In Proc. BioMED 2006, Innsbruck, Austria.<br />

Ma, C. T., D<strong>in</strong>g, Z., and Yau, S. F. (2000). A two-stage algorithm for MIMO bl<strong>in</strong>d deconvolution<br />

of nonstationary colored noise. IEEE Transactions on Signal Process<strong>in</strong>g, 48:1187–1192.<br />

MacKay, D. (2003). Information Theory, Inference, and Learn<strong>in</strong>g Algorithms. Cambridge University<br />

Press, 6.9 edition.<br />

McKeown, M., Jung, T., Makeig, S., Brown, G., K<strong>in</strong>dermann, S., Bell, A., and Sejnowski,<br />

T. (1998). <strong>Analysis</strong> of fMRI data by bl<strong>in</strong>d separation <strong>in</strong>to <strong>in</strong>dependent spatial components.<br />

Human Bra<strong>in</strong> Mapp<strong>in</strong>g, 6:160–188.<br />

Meyer-Baese, A., Gruber, P., Theis, F., and Foo, S. (2006). Bl<strong>in</strong>d source separation based on<br />

self-organiz<strong>in</strong>g neural network. Eng<strong>in</strong>eer<strong>in</strong>g Applications of Artificial Intelligence, 19:305–311.<br />

Meyer-Bäse, A., Gruber, P., Theis, F., Wismüller, A., and Ritter, H. (2005). Application of<br />

unsupervised cluster<strong>in</strong>g methods to biomedical image analysis. In Proc. WSOM 2005, pages<br />

621–628, Paris, France.


BIBLIOGRAPHY 303<br />

Meyer-Bäse, A., Theis, F., Lange, O., and Puntonet, C. (2004a). Tree-dependent and topographic<br />

<strong>in</strong>dependent component analysis for fMRI analysis. In Proc. ICA 2004, volume 3195<br />

of LNCS, pages 782–789, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Meyer-Bäse, A., Theis, F., Lange, O., and Wismüller, A. (2004b). Cluster<strong>in</strong>g of dependent<br />

components: A new paradigm for fMRI signal detection. In Proc. IJCNN 2004, pages 1947–<br />

1952, Budapest, Hungary.<br />

Meyer-Bäse, A., Thümmler, V., and Theis, F. (2006). Stability analysis of an unsupervised<br />

competitive neural network. In Proc. IJCNN 2006, Vancouver, Canada.<br />

Mika, S., Schölkopf, B., Smola, A., Müller, K., Scholz, M., and Rätsch, G. (1999). Kernel pca<br />

and de-nois<strong>in</strong>g <strong>in</strong> feature spaces. In Proc. NIPS, volume 11.<br />

Mitchell, T. (1997). Mach<strong>in</strong>e Learn<strong>in</strong>g. McGraw Hill.<br />

Molgedey, L. and Schuster, H. (1994). Separation of a mixture of <strong>in</strong>dependent signals us<strong>in</strong>g<br />

time-delayed correlations. Physical Review Letters, 72(23):3634–3637.<br />

Moreau, E. (2001). A generalization of jo<strong>in</strong>t-diagonalization criteria for source separation. IEEE<br />

Transactions on Signal Process<strong>in</strong>g, 49(3):530–541.<br />

Müller, K.-R., Philips, P., and Ziehe, A. (1999). Jadetd: Comb<strong>in</strong><strong>in</strong>g higher-order statistics and<br />

temporal <strong>in</strong>formation for bl<strong>in</strong>d source separation (with noise). In Proc. of ICA 1999, pages<br />

87–92, Aussois, France.<br />

Ogawa, S., Lee, T., Kay, A., and Tank, D. (1990). Bra<strong>in</strong> magnetic-resonance-imag<strong>in</strong>g with<br />

contrast dependent on blood oxygenation. Proc. Nat. Acad. Sci. USA, 87:9868–9872.<br />

O’Grady, P. and Pearlmutter, B. (2004). Soft-LOST: EM on a mixture of oriented l<strong>in</strong>es. In<br />

Proc. ICA 2004, volume 3195 of Lecture Notes <strong>in</strong> Computer Science, pages 430–436, Granada,<br />

Spa<strong>in</strong>.<br />

Pham, D. (2001). Jo<strong>in</strong>t approximate diagonalization of positive def<strong>in</strong>ite matrices. SIAM Journal<br />

on Matrix Anal. and Appl., 22(4):1136–1152.<br />

Pham, D.-T. and Cardoso, J. (2001). Bl<strong>in</strong>d separation of <strong>in</strong>stantaneous mixtures of nonstationary<br />

sources. IEEE Transactions on Signal Process<strong>in</strong>g, 49(9):1837–1848.<br />

Poczos, B. and Lör<strong>in</strong>cz, A. (2005). <strong>Independent</strong> subspace analysis us<strong>in</strong>g k-nearest neighborhood<br />

distances. In Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland.<br />

Spr<strong>in</strong>ger.<br />

Schachtner, R., Lutter, D., Theis, F., Lang, E., Tomé, A., Saéz, J. G., and Puntonet, C. (2007).<br />

Bl<strong>in</strong>d matrix decomposition techniques to identify marker genes from microarrays. In Proc.<br />

ICA 2007, London, U.K.<br />

Schäfer, J. and Strimmer, K. (2005). Learn<strong>in</strong>g large-scale graphical gaussian models from genomic<br />

data. In Proc. CNET 2004, AIP Conference Proceed<strong>in</strong>gs 776.


304 BIBLIOGRAPHY<br />

Schießl, I., Schöner, H., Stetter, M., Dima, A., and Obermayer, K. (2000). Regularized second<br />

order source separation. In Proc. ICA 2000, volume 2, pages 111–116, Hels<strong>in</strong>ki, F<strong>in</strong>land.<br />

Schölkopf, B. and Smola, A. (2002). Learn<strong>in</strong>g with Kernels. MIT Press, Cambridge, MA.<br />

Schölkopf, B., Smola, A., and Müller, K. (1998). Nonl<strong>in</strong>ear component analysis as a kernel<br />

eigenvalue nonl<strong>in</strong>ear component analysis as a kernel eigenvalue problem. Neural Computation,<br />

10:1299–1319.<br />

Schöner, H., Stetter, M., Schießl, I., Mayhew, J., Lund, J., McLoughl<strong>in</strong>, N., and Obermayer, K.<br />

(2000). Application of bl<strong>in</strong>d separation of sources to optical record<strong>in</strong>g of bra<strong>in</strong> activity. In<br />

Proc. NIPS 1999, volume 12, pages 949–955. MIT Press.<br />

Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89:217–219.<br />

Skitovitch, V. (1954). L<strong>in</strong>ear forms <strong>in</strong> <strong>in</strong>dependent random variables and the normal distribution<br />

law. Izvestiia AN SSSR, Ser. Matem., 18:185–200.<br />

Specht, K. and Reul, J. (2003). Function segregation of the temporal lobes <strong>in</strong>to highly differentiated<br />

subsystems for auditory perception: an auditory rapid event-related fmri-task.<br />

NeuroImage, 20:1944–1954.<br />

Squart<strong>in</strong>i, S. and Theis, F. (2006). New riemannian metrics for speed<strong>in</strong>g-up the convergence of<br />

over- and underdeterm<strong>in</strong>ed ICA. In Proc. ISCAS 2006, Kos, Greece.<br />

Stadlthanner, K., Theis, F., Lang, E., and Puntonet, C. (2005a). Hybridiz<strong>in</strong>g sparse component<br />

analysis with genetic algorithms for bl<strong>in</strong>d source separation. In Proc. ISBMDA 2005, pages<br />

137–148, Aveiro, Portugal.<br />

Stadlthanner, K., Theis, F., Lang, E., Puntonet, C., Gomez-Vilda, P., Tomé, A., Langmann,<br />

T., and Schmitz, G. (2006a). Sparse nonnegative matrix factorization applied to microarray<br />

data sets. In Proc. ICA 2006, pages 254–261, Charleston, USA.<br />

Stadlthanner, K., Theis, F., Lang, E., Tomé, A., Gronwald, W., and Kalbitzer, H.-R. (2003). A<br />

matrix pencil approach to the bl<strong>in</strong>d source separation of artifacts <strong>in</strong> 2D NMR spectra. Neural<br />

Information Process<strong>in</strong>g LR, 1(3):103–110.<br />

Stadlthanner, K., Theis, F., Puntonet, C., and Lang, E. (2005b). Extended sparse nonnegative<br />

matrix factorization. In Proc. IWANN 2005, volume 3512 of LNCS, pages 249–256, Barcelona,<br />

Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Stadlthanner, K., Tomé, A., Theis, F., Lang, E., Gronwald, W., and Kalbitzer, H. (2006b).<br />

Separation of water artifacts <strong>in</strong> 2D NOESY prote<strong>in</strong> spectra us<strong>in</strong>g congruent matrix pencils.<br />

Neurocomput<strong>in</strong>g, 69:497–522.<br />

Stone, J., Porrill, J., Porter, N., and Wilk<strong>in</strong>son, I. (2002). Spatiotemporal <strong>in</strong>dependent component<br />

analysis of event-related fmri data us<strong>in</strong>g skewed probability density functions. NeuroImage,<br />

15(2):407–421.


BIBLIOGRAPHY 305<br />

Taleb, A. and Jutten, C. (1999). Indeterm<strong>in</strong>acy and identifiability of bl<strong>in</strong>d identification. IEEE<br />

Transactions on Signal Process<strong>in</strong>g, 47(10):2807–2820.<br />

Theis, F. (2004a). A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation. Neural<br />

Computation, 16:1827–1850.<br />

Theis, F. (2004b). Uniqueness of complex and multidimensional <strong>in</strong>dependent component analysis.<br />

Signal Process<strong>in</strong>g, 84(5):951–956.<br />

Theis, F. (2004c). Uniqueness of real and complex l<strong>in</strong>ear <strong>in</strong>dependent component analysis<br />

revisited. In Proc. EUSIPCO 2004, pages 1705–1708, Vienna, Austria.<br />

Theis, F. (2005a). Bl<strong>in</strong>d signal separation <strong>in</strong>to groups of dependent signals us<strong>in</strong>g jo<strong>in</strong>t block<br />

diagonalization. In Proc. ISCAS 2005, pages 5878–5881, Kobe, Japan.<br />

Theis, F. (2005b). Gradients on matrix manifolds and their cha<strong>in</strong> rule. Neural Information<br />

Process<strong>in</strong>g LR, 9(1):1–13.<br />

Theis, F. (2005c). Multidimensional <strong>in</strong>dependent component analysis us<strong>in</strong>g characteristic functions.<br />

In Proc. EUSIPCO 2005, Antalya, Turkey.<br />

Theis, F. (2007). Towards a general <strong>in</strong>dependent subspace analysis. In Proc. NIPS 2006.<br />

Theis, F. and Amari, S. (2004). Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse<br />

sources. In Proc. ICA 2004, volume 3195 of LNCS, pages 718–725, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Theis, F. and García, G. (2006). On the use of sparse signal decomposition <strong>in</strong> the analysis of<br />

multi-channel surface electromyograms. Signal Process<strong>in</strong>g, 86(3):603–623.<br />

Theis, F., Georgiev, P., and Cichocki, A. (2004a). Bl<strong>in</strong>d source recovery: algorithm comparison<br />

and fusion. In Proc. MaxEnt 2004, volume 735 of AIP conference proceed<strong>in</strong>gs, pages 320–327,<br />

Garch<strong>in</strong>g, Germany.<br />

Theis, F., Georgiev, P., and Cichocki, A. (2007a). Robust sparse component analysis based on<br />

a generalized hough transform. EURASIP Journal on Applied Signal Process<strong>in</strong>g.<br />

Theis, F. and Gruber, P. (2005). On model identifiability <strong>in</strong> analytic postnonl<strong>in</strong>ear ICA. Neurocomput<strong>in</strong>g,<br />

64:223–234.<br />

Theis, F., Gruber, P., Keck, I., and Lang, E. (2007b). A robust model for spatiotemporal<br />

dependencies. Neurocomput<strong>in</strong>g (<strong>in</strong> press).<br />

Theis, F., Gruber, P., Keck, I., Meyer-Bäse, A., and Lang, E. (2005a). Spatiotemporal bl<strong>in</strong>d<br />

source separation us<strong>in</strong>g double-sided approximate jo<strong>in</strong>t diagonalization. In Proc. EUSIPCO<br />

2005, Antalya, Turkey.<br />

Theis, F., Gruber, P., Keck, I., Tomé, A., and Lang, E. (2005b). A spatiotemporal second-order<br />

algorithm for fMRI data analysis. In Proc. CIMED 2005, pages 194–201, Lisbon, Portugal.


306 BIBLIOGRAPHY<br />

Theis, F. and Inouye, Y. (2006). On the use of jo<strong>in</strong>t diagonalization <strong>in</strong> bl<strong>in</strong>d signal process<strong>in</strong>g.<br />

In Proc. ISCAS 2006, Kos, Greece.<br />

Theis, F., Jung, A., Puntonet, C., and Lang, E. (2003). L<strong>in</strong>ear geometric ICA: Fundamentals<br />

and algorithms. Neural Computation, 15:419–439.<br />

Theis, F. and Kawanabe, M. (2006). Uniqueness of non-gaussian subspace analysis. In Proc.<br />

ICA 2006, pages 917–925, Charleston, USA.<br />

Theis, F. and Kawanabe, M. (2007). Colored subspace analysis. In Proc. ICA 2007, London,<br />

U.K.<br />

Theis, F., Kohl, Z., Guggenberger, C., Kuhn, H., Puntonet, C., and Lang, E. (2004b). ZANE -<br />

an algorithm for count<strong>in</strong>g labelled cells <strong>in</strong> section images. In Proc. MEDSIP 2004, Malta.<br />

Theis, F., Kohl, Z., Kuhn, H., Stockmeier, H., and Lang, E. (2004c). Automated count<strong>in</strong>g<br />

of labelled cells <strong>in</strong> rodent bra<strong>in</strong> section images. In Proc. BioMED 2004, pages 209–212,<br />

Innsbruck, Austria. ACTA Press, Canada.<br />

Theis, F. and Lang, E. (2004a). L<strong>in</strong>earization identification and an application to BSS us<strong>in</strong>g a<br />

SOM. In Proc. ESANN 2004, pages 205–210, Bruges, Belgium. d-side, Evere, Belgium.<br />

Theis, F. and Lang, E. (2004b). Postnonl<strong>in</strong>ear bl<strong>in</strong>d source separation via l<strong>in</strong>earization identification.<br />

In Proc. IJCNN 2004, pages 2199–2204, Budapest, Hungary.<br />

Theis, F., Lang, E., and Puntonet, C. (2004d). A geometric algorithm for overcomplete l<strong>in</strong>ear<br />

ICA. Neurocomput<strong>in</strong>g, 56:381–398.<br />

Theis, F., Meyer-Bäse, A., and Lang, E. (2004e). Second-order bl<strong>in</strong>d source separation based on<br />

multi-dimensional autocovariances. In Proc. ICA 2004, volume 3195 of LNCS, pages 726–733,<br />

Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Theis, F. and Nakamura, W. (2004). Quadratic <strong>in</strong>dependent component analysis. IEICE Trans.<br />

Fundamentals, E87-A(9):2355–2363.<br />

Theis, F., Puntonet, C., and Lang, E. (2006). Median-based cluster<strong>in</strong>g for underdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d signal process<strong>in</strong>g. IEEE Signal Process<strong>in</strong>g Letters, 13(2):96–99.<br />

Theis, F., Stadlthanner, K., and Tanaka, T. (2005c). First results on uniqueness of sparse<br />

non-negative matrix factorization. In Proc. EUSIPCO 2005, Antalya, Turkey.<br />

Theis, F. and Tanaka, T. (2005). A fast and efficient method for compress<strong>in</strong>g fMRI data sets.<br />

In Proc. ICANN 2005, volume 3697 of LNCS, pages 769–777, Warsaw, Poland. Spr<strong>in</strong>ger.<br />

Theis, F. and Tanaka, T. (2006). Sparseness by iterative projections onto spheres. In Proc.<br />

ICASSP 2006, Toulouse, France.<br />

Tomé, A. M., Teixeira, A. R., Lang, E. W., Stadlthanner, K., Rocha, A. P., and ALmeida, R.<br />

(2005). dAMUSE - A new tool for denois<strong>in</strong>g and BSS. Digital Signal Process<strong>in</strong>g.


BIBLIOGRAPHY 307<br />

Tong, L., Liu, R.-W., Soon, V., and Huang, Y.-F. (1991). Indeterm<strong>in</strong>acy and identifiability of<br />

bl<strong>in</strong>d identification. IEEE Transactions on Circuits and Systems, 38:499–509.<br />

van der Veen, A. and Paulraj, A. (1996). An analytical constant modulus algorithm. IEEE<br />

Trans. Signal Process<strong>in</strong>g, 44(5):1–19.<br />

Vetter, R., Ves<strong>in</strong>, J. M., Celka, P., Renevey, P., and Krauss, J. (2002). Automatic nonl<strong>in</strong>ear<br />

noise reduction us<strong>in</strong>g local pr<strong>in</strong>cipal component analysis and MDL parameter selection. Proc.<br />

SPPRA 2002, pages 290–294.<br />

Vollgraf, R. and Obermayer, K. (2001). Multi-dimensional ICA to separate correlated sources.<br />

In Proc. NIPS 2001, pages 993–1000.<br />

Yeredor, A. (2000). Bl<strong>in</strong>d source separation via the second characteristic function. Signal<br />

Process<strong>in</strong>g, 80(5):897–902.<br />

Yeredor, A. (2002). Non-orthogonal jo<strong>in</strong>t diagonalization <strong>in</strong> the leastsquares sense with application<br />

<strong>in</strong> bl<strong>in</strong>d source separation. IEEE Trans. Signal Process<strong>in</strong>g, 50(7):1545–1553.<br />

Youla, D. and Webb, H. (1982). Image restoration by the methods of convex projections. part<br />

I — theory. IEEE Trans. Med. Imag<strong>in</strong>g, MI(I):81–94.<br />

Zibulevsky, M. and Pearlmutter, B. (2001). Bl<strong>in</strong>d source separation by sparse decomposition <strong>in</strong><br />

a signal dictionary. Neural Computation, 13(4):863–882.<br />

Ziehe, A., Kawanabe, M., Harmel<strong>in</strong>g, S., and Müller, K.-R. (2003a). Bl<strong>in</strong>d separation of postnonl<strong>in</strong>ear<br />

mixtures us<strong>in</strong>g l<strong>in</strong>eariz<strong>in</strong>g transformations and temporal decorrelation. Journal of<br />

Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 4:1319–1338.<br />

Ziehe, A., Laskov, P., Mueller, K.-R., and Nolte, G. (2003b). A l<strong>in</strong>ear least-squares algorithm<br />

for jo<strong>in</strong>t diagonalization. In Proc. of ICA 2003, pages 469–474, Nara, Japan.<br />

Ziehe, A. and Mueller, K.-R. (1998). TDSEP – an efficient algorithm for bl<strong>in</strong>d separation us<strong>in</strong>g<br />

time structure. In Niklasson, L., Bodén, M., and Ziemke, T., editors, Proc. of ICANN’98,<br />

pages 675–680, Skövde, Sweden. Spr<strong>in</strong>ger Verlag, Berl<strong>in</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!