14.02.2013 Views

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

Mathematics in Independent Component Analysis

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Statistical mach<strong>in</strong>e learn<strong>in</strong>g of<br />

biomedical data<br />

Habilitation an der NWF II - Physik der Universität Regensburg<br />

vorgelegt von Fabian J. Theis, 2007


Die Habilitationsschrift wurde e<strong>in</strong>gereicht am 7.8.2007.<br />

Fachmentorat:<br />

Prof. Dr. Elmar W. Lang (Universität Regensburg)<br />

Prof. Dr. Klaus Richter (Universität Regensburg)<br />

Prof. Dr. Klaus-Robert Müller (FhG FIRST & TU Berl<strong>in</strong>)


to Jakob


Preface<br />

In this manuscript, I present twenty papers that I have coauthored dur<strong>in</strong>g the past four years.<br />

They range from theoretical contributions such as model analyses and convergence proofs to<br />

algorithmic implementations and biomedical applications. The common denom<strong>in</strong>ator is the<br />

framework of (mostly unsupervised) data analysis based on statistical mach<strong>in</strong>e learn<strong>in</strong>g methods.<br />

For convenience, I summarize and review these papers <strong>in</strong> chapter 1.<br />

Obviously, the papers themselves are self-conta<strong>in</strong>ed. The summary chapter consists of an<br />

<strong>in</strong>troduction part and then sections on <strong>in</strong>dependent and dependent component analysis, sparseness,<br />

preprocess<strong>in</strong>g and applications. These sections are aga<strong>in</strong> mostly self-conta<strong>in</strong>ed except for<br />

the dependent component analysis part, which relies on the preced<strong>in</strong>g section, as well as the<br />

applications part, which depends on the methods described before.<br />

I want to thank everyone who helped me dur<strong>in</strong>g the past four years. First of all my former<br />

boss Elmar Lang; I got both support and freedom, and always help when needed, even nowadays!<br />

Then of course all retired and active members of my group: Peter Gruber, Ingo Keck, Harold<br />

Gutch, Christian Guggenberger and more recently Florian Blöchl, Elias Rentzeperis, Mart<strong>in</strong><br />

Rhoden and Dom<strong>in</strong>ik Lutter. But I also have to mention and thank my colleagues at the MPI<br />

for Dynamics and Self-Organization at Gött<strong>in</strong>gen, <strong>in</strong> particular the Director Theo Geisel and<br />

my collaborators Dirk Brockmann and Fred Wolf. I also want to thank my mentors, Elmar<br />

Lang, Klaus Richter and Klaus-Robert Müller for their manifold help and advices dur<strong>in</strong>g my<br />

habilitation.<br />

I thank the Bernste<strong>in</strong> Center for Computational Neuroscience Gött<strong>in</strong>gen as well as the<br />

graduate college ‘Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> condensed matter’ at Regensburg for their<br />

generous support. Additional fund<strong>in</strong>g by the BMBF with<strong>in</strong> the project ModKog and with<strong>in</strong> the<br />

projects BaCaTeC and PPP with Granada and London is gratefully acknowledged.<br />

My deepest thanks goes to my family and friends, above all my wife Michaela and my son<br />

Jakob. You’re the real deal.<br />

Gött<strong>in</strong>gen, June 2007 Fabian J. Theis<br />

v


vi Preface


Contents<br />

Preface v<br />

Contents v<br />

I Summary 1<br />

1 Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data 3<br />

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

1.2 Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis . . . . . . . . . . . . . . . . 6<br />

1.2.1 L<strong>in</strong>ear case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.2.2 Nonl<strong>in</strong>ear ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.3 Dependent component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

1.3.1 Algebraic BSS and multidimensional generalizations . . . . . . . . . . . . 13<br />

1.3.2 Spatiotemporal BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.3.3 <strong>Independent</strong> subspace analysis . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.4 Sparseness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

1.4.1 Sparse component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

1.4.2 Sparse non-negative matrix factorization . . . . . . . . . . . . . . . . . . . 30<br />

1.5 Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.5.1 Denois<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

1.5.2 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

1.5.3 Cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

1.6 Applications to biomedical data analysis . . . . . . . . . . . . . . . . . . . . . . . 42<br />

1.6.1 Functional MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

1.6.2 Image segmentation and cell count<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . 47<br />

1.6.3 Surface electromyograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

1.7 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

II Papers 53<br />

2 Neural Computation 16:1827-1850, 2004 55<br />

vii


viii CONTENTS<br />

3 Signal Process<strong>in</strong>g 84(5):951-956, 2004 81<br />

4 Neurocomput<strong>in</strong>g 64:223-234, 2005 89<br />

5 IEICE TF E87-A(9):2355-2363, 2004 103<br />

6 LNCS 3195:726-733, 2004 113<br />

7 Proc. ISCAS 2005, pages 5878-5881 123<br />

8 Proc. NIPS 2006 129<br />

9 Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 139<br />

10 IEEE TNN 16(4):992-996, 2005 155<br />

11 EURASIP JASP, 2007 161<br />

12 LNCS 3195:718-725, 2004 175<br />

13 Proc. EUSIPCO 2005 185<br />

14 Proc. ICASSP 2006 191<br />

15 Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 197<br />

16 Proc. ICA 2006, pages 917-925 229<br />

17 IEEE SPL 13(2):96-99, 2006 239<br />

18 Proc. EUSIPCO 2006 245<br />

19 LNCS 3195:977-984, 2004 251<br />

20 Signal Process<strong>in</strong>g 86(3):603-623, 2006 261<br />

21 Proc. BIOMED 2005, pages 209-212 291<br />

Bibliography 297


Part I<br />

Summary<br />

1


Chapter 1<br />

Statistical mach<strong>in</strong>e learn<strong>in</strong>g of<br />

biomedical data<br />

Biostatistics deals with the analysis of high-dimensional data sets orig<strong>in</strong>at<strong>in</strong>g from biological or<br />

biomedical problems. An important challenge <strong>in</strong> this analysis is to identify underly<strong>in</strong>g statistical<br />

patterns that facilitate the <strong>in</strong>terpretation of the data set us<strong>in</strong>g techniques from mach<strong>in</strong>e<br />

learn<strong>in</strong>g. A possible approach is to learn a more mean<strong>in</strong>gful representation of the data set,<br />

which maximizes certa<strong>in</strong> statistical features. Such often l<strong>in</strong>ear representations have several potential<br />

applications <strong>in</strong>clud<strong>in</strong>g the decomposition of objects <strong>in</strong>to ‘natural’ components (Lee and<br />

Seung, 1999), redundancy and dimensionality reduction (Friedman and Tukey, 1975), biomedical<br />

data analysis, microarray data m<strong>in</strong><strong>in</strong>g or enhancement, feature extraction of images <strong>in</strong> nuclear<br />

medic<strong>in</strong>e, etc. (Alpayd<strong>in</strong>, 2004, Bishop, 2007, Cichocki and Amari, 2002, Hyvär<strong>in</strong>en et al., 2001c,<br />

MacKay, 2003, Mitchell, 1997).<br />

In the follow<strong>in</strong>g, we will review some statistical representation models and discuss identifiability<br />

conditions. The result<strong>in</strong>g separation algorithms will be applied to various biomedical<br />

problems <strong>in</strong> the last part of this summary.<br />

1.1 Introduction<br />

Assume the data is given by a multivariate time series x(t) ∈ R m , where t <strong>in</strong>dexes time, space<br />

or some other quantity. Data analysis can be def<strong>in</strong>ed as f<strong>in</strong>d<strong>in</strong>g a mean<strong>in</strong>gful representation of<br />

x(t) i.e. as x(t) = f(s(t)) with unknown features s(t) ∈ R m and mix<strong>in</strong>g mapp<strong>in</strong>g f. Often, f is<br />

assumed to be l<strong>in</strong>ear, so we are deal<strong>in</strong>g with the situation<br />

x(t) = As(t) (1.1)<br />

with a mix<strong>in</strong>g matrix A ∈ R m×n . Often, white noise n(t) is added to the model, yield<strong>in</strong>g<br />

x(t) = As(t) + n(t); this can be <strong>in</strong>cluded <strong>in</strong> s(t) by <strong>in</strong>creas<strong>in</strong>g its dimension. In the situation<br />

(1.1), the analysis problem is reformulated as the search for a (possibly overcomplete) basis, <strong>in</strong><br />

which the feature signal s(t) allows more <strong>in</strong>sight <strong>in</strong>to the data than x(t) itself. This of course<br />

has to be specified with<strong>in</strong> a statistical framework.<br />

There are two general approaches to f<strong>in</strong>d<strong>in</strong>g data representations or models as <strong>in</strong> (1.1):<br />

3


4 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

s1<br />

s2<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40<br />

t<br />

60 80 100<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40<br />

t<br />

60 80 100<br />

(a) sources s(t)<br />

x1<br />

x2<br />

1<br />

0.5<br />

0<br />

−0.5<br />

0 20 40 60 80 100<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40<br />

t<br />

60 80 100<br />

(b) mixtures x(t)<br />

t<br />

ˆs1<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40<br />

t<br />

60 80 100<br />

ˆs2<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40 60 80 100<br />

(c) recoveries<br />

t<br />

(WA)ij<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

1<br />

column<br />

2<br />

(d) WA<br />

Figure 1.1: Two-dimensional example of ICA-based source separation. The observed mixture<br />

signal (b) is composed of unknown two source signals (a), us<strong>in</strong>g a l<strong>in</strong>ear mapp<strong>in</strong>g. Application<br />

of ICA (here: HessianICA) yields the recovered sources (c), which co<strong>in</strong>cide with the orig<strong>in</strong>al<br />

sources up to permutation and scal<strong>in</strong>g: ˆs1(t) ≈ 1.5s2(t) and ˆs2(t) ≈ −1.5s1(t). Indeed, the<br />

composition of mix<strong>in</strong>g matrix A and separat<strong>in</strong>g matrix W equals a unit matrix (d) up to the<br />

unavoidable <strong>in</strong>determ<strong>in</strong>acies of scal<strong>in</strong>g and permutation.<br />

• supervised analysis: additional <strong>in</strong>formation for example <strong>in</strong> the form of <strong>in</strong>put-output pairs<br />

(x(t1), s(t1)), . . . , (x(tT ), s(tT )). These tra<strong>in</strong><strong>in</strong>g samples can be used for <strong>in</strong>terpolation and<br />

learn<strong>in</strong>g of the map f or basis A (regression). If the sources s are discrete, this leads to a<br />

classification problem. The result<strong>in</strong>g map f can then be used for prediction.<br />

• unsupervised models: <strong>in</strong>stead of samples, weak statistical assumptions are made on either<br />

s(t) or f/A. A common assumption for example is that the source components si(t) are<br />

mutually <strong>in</strong>dependent.<br />

Here, we will mostly focus on the second situation. The unsupervised analysis is often<br />

denoted as bl<strong>in</strong>d source separation (BSS), s<strong>in</strong>ce neither features or ‘sources’ s(t) nor mix<strong>in</strong>g<br />

mapp<strong>in</strong>g f are assumed to be known. The field of BSS has been rather <strong>in</strong>tensively studied by<br />

the community for more than a decade. S<strong>in</strong>ce the first <strong>in</strong>troduction of a neural-network-based<br />

BSS solution by Hérault and Jutten (1986), various algorithms have been proposed to solve<br />

the bl<strong>in</strong>d source separation problem (Bell and Sejnowski, 1995, Cardoso and Souloumiac, 1993,<br />

Comon, 1994, Hyvär<strong>in</strong>en and Oja, 1997, Theis et al., 2003). Good textbook-level <strong>in</strong>troductions<br />

to the topic are given by Hyvär<strong>in</strong>en et al. (2001c) and Cichocki and Amari (2002). Recent<br />

research centers on generalizations and applications. The first part of this manuscript deals<br />

with such extended models and algorithms; some applications will be presented later.<br />

A common model for BSS is realized by the <strong>in</strong>dependent component analysis (ICA) model<br />

(Comon, 1994), where the underly<strong>in</strong>g signals s(t) are assumed to be statistically <strong>in</strong>dependent.<br />

Let us first concentrate on the l<strong>in</strong>ear case, i.e. f = A l<strong>in</strong>ear. Then we search for a decomposition<br />

x(t) = As(t) of the observed data set x(t) = (x1(t), . . . , xn(t)) ⊤ <strong>in</strong>to <strong>in</strong>dependent signals s(t) =<br />

(s1(t), . . . , sn(t)) ⊤ . For example, consider figure 1.1. The goal is to decompose two time-series<br />

(b) <strong>in</strong>to two source signals (a). Visually, this is a simple task—obviously the data is composed<br />

of two s<strong>in</strong>usoids with different frequency; but how to do this algorithmically? And how to<br />

formulate a feasible model?<br />

1<br />

row<br />

2


1.1. Introduction 5<br />

auditory<br />

cortex<br />

auditory<br />

cortex 2<br />

word<br />

detection<br />

decision<br />

(a) cocktail party problem (b) l<strong>in</strong>ear mix<strong>in</strong>g model<br />

(c) neural cocktail party<br />

Figure 1.2: Cocktail party problem: (a) a l<strong>in</strong>ear superposition of the speakers is recorded at<br />

each microphone. This can be written as the mix<strong>in</strong>g model x(t) = As(t) (1.1) with speaker<br />

voices s(t) and activity x(t) at the microphones (b). Possible applications lie <strong>in</strong> neuroscience:<br />

given multiple activity record<strong>in</strong>gs of the human bra<strong>in</strong>, the goal is to identify the underly<strong>in</strong>g<br />

hidden sources that make up the total activity (c).<br />

A typical application of BSS lies <strong>in</strong> the cocktail party problem: at a cocktail party, a set<br />

of microphones records the conversations of the guests. Each microphone records a l<strong>in</strong>ear superposition<br />

of the conversations, and at each microphone, a slightly different superposition is<br />

recorded depend<strong>in</strong>g on the position, see figure 1.2. In the follow<strong>in</strong>g we will see that given some<br />

rather weak assumptions on the conversations themselves such as <strong>in</strong>dependence of the various<br />

talks, it is then possible to recover the orig<strong>in</strong>al sources and the mix<strong>in</strong>g matrix (which encodes<br />

the position of the speakers) us<strong>in</strong>g only the signals recorded at the microphones. Note that <strong>in</strong><br />

real-world situations the nice l<strong>in</strong>ear mix<strong>in</strong>g situation deteriorates due to noise, convolutions and<br />

nonl<strong>in</strong>earities.<br />

t=1<br />

t=2<br />

t=3<br />

t=4


6 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.2 Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis<br />

Application of ICA to BSS tacitly assumes that the data follow the model (1.1) i.e. x(t) admits<br />

a decomposition <strong>in</strong>to <strong>in</strong>dependent sources, and we want to f<strong>in</strong>d this decomposition. But neither<br />

the mix<strong>in</strong>g function f nor the source signals s(t) are known, so we should expect to f<strong>in</strong>d many<br />

solutions for this problem. Indeed, the order of the sources cannot be recovered—the speakers<br />

at the cocktail party do not have numbers—so there is always an <strong>in</strong>herent permutation <strong>in</strong>determ<strong>in</strong>acy.<br />

Moreover, also the strength of each source cannot be extracted from this model alone,<br />

because f and s(t) can <strong>in</strong>terchange so-called scal<strong>in</strong>g factors. In other words, by not know<strong>in</strong>g<br />

the power of each speaker at the cocktail party, we can only extract his speech, but not the<br />

volume—he could also be stand<strong>in</strong>g further away from the microphones, but shout<strong>in</strong>g <strong>in</strong>stead of<br />

speak<strong>in</strong>g.<br />

One of the key questions <strong>in</strong> ICA-based source separation is the question: are there any other<br />

<strong>in</strong>determ<strong>in</strong>acies? Without fully answer<strong>in</strong>g this question, ICA algorithms cannot be applied to<br />

BSS, simply because we would not have any clue how to relate the result<strong>in</strong>g sources to the<br />

orig<strong>in</strong>al ones. But apparently, the set of <strong>in</strong>determ<strong>in</strong>acies cannot be very large—after all, at a<br />

cocktail party, we ourselves are able to dist<strong>in</strong>guish between the various speakers.<br />

1.2.1 L<strong>in</strong>ear case<br />

In 1994, Comon was able to answer this question (Comon, 1994) <strong>in</strong> the l<strong>in</strong>ear case where f = A<br />

by reduc<strong>in</strong>g it to the Darmois-Skitovitch theorem (Darmois, 1953, Skitovitch, 1953, 1954). Essentially,<br />

he showed that if the sources conta<strong>in</strong> at most one Gaussian component, the <strong>in</strong>determ<strong>in</strong>acies<br />

of the above model are only scal<strong>in</strong>g and permutation. This positive answer more or<br />

less made the field popular; from then on the number of papers published <strong>in</strong> this field each year<br />

<strong>in</strong>creased considerably. However, it may be argued that Comon’s proof lacked two po<strong>in</strong>ts: by<br />

us<strong>in</strong>g the rather difficult to prove old theorem by the two statisticians, the central idea why<br />

there are no more <strong>in</strong>determ<strong>in</strong>acies is not at all obvious. Hence not many attempts have been<br />

made to extend it to more general situations. Furthermore, no algorithm can be extracted from<br />

the proof, because it is non-constructive.<br />

In (Theis, 2004a), we took a somewhat different approach. Instead of us<strong>in</strong>g Comon’s idea of<br />

m<strong>in</strong>imal mutual <strong>in</strong>formation, we reformulated the condition of source <strong>in</strong>dependence <strong>in</strong> a different<br />

way: <strong>in</strong> simple terms, a two-dimensional source vector s is <strong>in</strong>dependent if its density ps factorizes<br />

<strong>in</strong>to two one-component densities ps1 and ps2 . But this is the case only if ln ps is the sum of<br />

one-dimensional functions, each depend<strong>in</strong>g on a different variable. Hence tak<strong>in</strong>g the differential<br />

with respect to s1 and then to s2 always yields zero. In other words, the Hessian Hln ps of<br />

the logarithmic densities of the sources is diagonal—this is what we denoted by ps be<strong>in</strong>g a<br />

‘separated function’ <strong>in</strong> Theis (2004a), see chapter 2. Us<strong>in</strong>g only this property, we were able to<br />

prove Comon’s uniqueness theorem (Theis, 2004a, theorem 2) without hav<strong>in</strong>g to resort to the<br />

Darmois-Skitovitch theorem. Here Gl(n) def<strong>in</strong>es the group of <strong>in</strong>vertible (n × n)-matrices.<br />

Theorem 1.2.1 (Separability of l<strong>in</strong>ear BSS). Let A ∈ Gl(n; R) and s be an <strong>in</strong>dependent random<br />

vector. Assume that s has at most one Gaussian component and that the covariance of s exists.<br />

Then As is <strong>in</strong>dependent if and only if A is the product of a scal<strong>in</strong>g and permutation matrix.


1.2. Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis 7<br />

Instead of a multivariate random process s(t), the theorem is formulated for a random vector<br />

s, which is equivalent to assum<strong>in</strong>g an i.i.d. process. Moreover, the assumption of equal source (n)<br />

and mixture dimension (m) is made, although relaxation to the undercomplete case (1 < n < m)<br />

is straightforward, and to the overcomplete case (n > m > 1) is possible (Eriksson and Koivunen,<br />

2003). The assumption of at most one Gaussian component is crucial, s<strong>in</strong>ce <strong>in</strong>dependence of<br />

white, multivariate Gaussians is <strong>in</strong>variant under orthogonal transformation, so theorem 1.2.1<br />

cannot hold <strong>in</strong> this case.<br />

An algorithm for separation—Hessian ICA<br />

The proof of theorem 1.2.1 is constructive, and the exception of the Gaussians comes <strong>in</strong>to play<br />

naturally as zeros of a certa<strong>in</strong> differential equation. The idea, why separation is possible, becomes<br />

quite clear now. Furthermore, an algorithm can be extracted from the pattern used <strong>in</strong> the proof:<br />

After decorrelation, we can assume that the mix<strong>in</strong>g matrix A is orthogonal. By us<strong>in</strong>g the<br />

transformation properties of the Hessian matrix, we can employ the l<strong>in</strong>ear relationship x = As<br />

to get<br />

Hln px = A⊤Hln psA (1.2)<br />

for the Hessian of the mixtures. The key idea, as we have seen <strong>in</strong> the previous section, is that<br />

due to statistical <strong>in</strong>dependence, the source Hessian Hln ps is diagonal everywhere. Therefore<br />

equation (1.2) represents a diagonalization of the mixture Hessian, and the diagonalizer equals<br />

the mix<strong>in</strong>g matrix A. Such a diagonalization is unique if the eigenspaces of the Hessian are onedimensional<br />

at some po<strong>in</strong>t, and this is precisely the case if x(t) conta<strong>in</strong>s at most one Gaussian<br />

component (Theis, 2004a, lemma 5). Hence, the mix<strong>in</strong>g matrix and the sources can be extracted<br />

algorithmically by simply diagonaliz<strong>in</strong>g the mixture Hessian evaluated at some po<strong>in</strong>t. The<br />

Hessian ICA algorithm consists of local Hessian diagonalization of the logarithmic density (or<br />

equivalently the easier to estimate characteristic function). In order to improve robustness,<br />

multiple matrices are jo<strong>in</strong>tly diagonalized. Apply<strong>in</strong>g this algorithm to the mixtures from our<br />

toy example from figure 1.1 yields very well recovered sources 1.1(c).<br />

A similar algorithm has been proposed already by L<strong>in</strong> (1998), but without consider<strong>in</strong>g the<br />

necessary assumptions for successful algorithm application. In Theis (2004a, theorem 3), we<br />

gave precise conditions for when to apply this algorithm and showed that po<strong>in</strong>ts satisfy<strong>in</strong>g these<br />

conditions can <strong>in</strong>deed be found if the sources conta<strong>in</strong> at most one Gaussian component. L<strong>in</strong> used<br />

a discrete approximation of the derivative operator to approximate the Hessian; we suggested<br />

us<strong>in</strong>g kernel-based density estimation, which can be directly differentiated. A similar algorithm<br />

based on Hessian diagonalization had been proposed by Yeredor (2000) us<strong>in</strong>g the character of a<br />

random vector. However, the character is complex-valued, and additional care has to be taken<br />

when apply<strong>in</strong>g a complex logarithm. Basically, this is only well-def<strong>in</strong>ed locally at non-zeros.<br />

In algorithmic terms, the character can be easily approximated by samples. Yeredor suggested<br />

jo<strong>in</strong>t diagonalization of the Hessian of the logarithmic character evaluated at several po<strong>in</strong>ts <strong>in</strong><br />

order to avoid the locality of the algorithm. Instead of jo<strong>in</strong>t diagonalization, we proposed to<br />

use a comb<strong>in</strong>ed energy function based on the previously def<strong>in</strong>ed separator. This also takes <strong>in</strong>to<br />

account global <strong>in</strong>formation, but does not have the drawback of be<strong>in</strong>g s<strong>in</strong>gular at zeros of the<br />

density respectively character.


8 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

Complex generalization<br />

Comon (1994) showed separability of l<strong>in</strong>ear real BSS us<strong>in</strong>g the Darmois-Skitovitch theorem, see<br />

theorem 1.2.1. He noted that his proof for the real case can also be extended to the complex<br />

sett<strong>in</strong>g. However, a complex version of the Darmois-Skitovitch theorem is needed, which, to<br />

the knowledge of the author, had not been shown <strong>in</strong> the literature yet. In Theis (2004b), see<br />

chapter 3, we derived such a theorem as corollary of a multivariate extension of the Darmois-<br />

Skitovitch theorem, first noted by Skitovitch (1954) and shown by Ghurye and Olk<strong>in</strong> (1962):<br />

Theorem 1.2.2 (complex S-D theorem). Let s1 = �n i=1 αixi and s2 = �n i=1 βixi with x1, . . . , xn<br />

<strong>in</strong>dependent complex random variables and αj, βj ∈ C for j = 1, . . . , n. If s1 and s2 are <strong>in</strong>dependent,<br />

then all xj with αjβj �= 0 are Gaussian.<br />

We then used this theorem to prove separability of complex BSS; moreover a generalization<br />

to dependent subspaces was discussed, see section 1.3.3. Note that a simple complex-valued<br />

uniqueness proof (Theis, 2004c), which does not need the Darmois-Skitovitch theorem, can be<br />

derived similarly to the case of real-valued random variables from above. Recently, additional<br />

relaxations of complex identifiability have been described by Eriksson and Koivunen (2006)<br />

1.2.2 Nonl<strong>in</strong>ear ICA<br />

With the growth of the field of ICA, <strong>in</strong>terest <strong>in</strong> nonl<strong>in</strong>ear model extensions has been <strong>in</strong>creas<strong>in</strong>g.<br />

It is easy to see however that without any restrictions the class of nonl<strong>in</strong>ear ICA solutions is too<br />

large to be of any practical use (Hyvär<strong>in</strong>en and Pajunen, 1998). Hence, special nonl<strong>in</strong>earities<br />

are usually considered. Here we discuss two specific nonl<strong>in</strong>ear models.<br />

Postnonl<strong>in</strong>ear ICA<br />

A good trade-off between model generalization and identifiability is given <strong>in</strong> the so called postnonl<strong>in</strong>ear<br />

BSS model realized by<br />

�<br />

n�<br />

�<br />

. (1.3)<br />

xi = fi<br />

i=1<br />

This explicit, nonl<strong>in</strong>ear model implies that <strong>in</strong> addition to the l<strong>in</strong>ear mix<strong>in</strong>g situation, each sensor<br />

xi conta<strong>in</strong>s an unknown nonl<strong>in</strong>earity fi that can further distort the observations, model<strong>in</strong>g a<br />

MIMO-system with nonl<strong>in</strong>ear receiver characteristics. It can be <strong>in</strong>terpreted as a s<strong>in</strong>gle-layered<br />

neural network (perceptron) with nonl<strong>in</strong>ear, unknown activation functions <strong>in</strong> contrast to the<br />

l<strong>in</strong>ear perceptron <strong>in</strong> the case of l<strong>in</strong>ear ICA. The model, first proposed by Taleb and Jutten<br />

(1999), has applications <strong>in</strong> telecommunication and biomedical data analysis. Algorithms for<br />

reconstruct<strong>in</strong>g postnonl<strong>in</strong>early mixed sources <strong>in</strong>clude (Achard et al., 2003, Babaie-Zadeh et al.,<br />

2002, Taleb and Jutten, 1999, Theis and Lang, 2004a, Ziehe et al., 2003a).<br />

Identifiability of postnonl<strong>in</strong>ear mixtures has first been discussed <strong>in</strong> a limited context <strong>in</strong> Taleb<br />

and Jutten (1999). In Theis and Gruber (2005), see chapter 4, we cont<strong>in</strong>ued this study. We<br />

thereby generalized ideas already presented by Babaie-Zadeh et al. (2002), where the focus was<br />

put on the development of an actual identification algorithm. Babaie-Zadeh was the first to<br />

use the method of analyz<strong>in</strong>g bounded random vectors <strong>in</strong> the context of postnonl<strong>in</strong>ear mixtures<br />

aijsj


1.2. Uniqueness issues <strong>in</strong> <strong>in</strong>dependent component analysis 9<br />

f 1, f 2<br />

2<br />

1.5<br />

1<br />

0.5<br />

-2 -1.5 -1 -0.5<br />

-0.5<br />

0.5 1 1.5 2<br />

-1<br />

-1.5<br />

-2<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

f (As ) A − 1 f (As )<br />

0.5 1 1.5<br />

0.5 1 1.5<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.2 0.4 0.6 0.8<br />

0.2 0.4 0.6 0.8<br />

Figure 1.3: Example of a non-trivial postnonl<strong>in</strong>ear transformation us<strong>in</strong>g an absolutely degenerate<br />

matrix A and <strong>in</strong> [0, 1] 2 uniform sources s. Both sources s and recovered sources A −1 f(As)<br />

have support <strong>in</strong> [0, 1] 2 , but A is not a permutation and scal<strong>in</strong>g.<br />

(Babaie-Zadeh, 2002). There, he already discussed identifiability issues, albeit explicitly only <strong>in</strong><br />

the two-dimensional analytic case.<br />

Postnonl<strong>in</strong>ear BSS is a generalization of l<strong>in</strong>ear BSS, so the <strong>in</strong>determ<strong>in</strong>acies of postnonl<strong>in</strong>ear<br />

ICA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear ICA: A can only be reconstructed up to<br />

scal<strong>in</strong>g and permutation. Here of course additional <strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play because of<br />

translation: fi can only be recovered up to a constant. Also, if L ∈ Gl(n) is a scal<strong>in</strong>g matrix,<br />

then f(As) = (f ◦ L) � (L −1 A)s � , so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each component.<br />

Another <strong>in</strong>determ<strong>in</strong>acy could occur if A is not mix<strong>in</strong>g, i.e. at least one observation xi conta<strong>in</strong>s<br />

only one source; <strong>in</strong> this case fi can obviously not be recovered. For example if A equals the<br />

unit matrix I, then f(s) is already aga<strong>in</strong> <strong>in</strong>dependent, because <strong>in</strong>dependence is <strong>in</strong>variant under<br />

component-wise nonl<strong>in</strong>ear transformation; so f cannot be found us<strong>in</strong>g this method.<br />

A not so obvious <strong>in</strong>determ<strong>in</strong>acy occurs if A is absolutely degenerate, which essentially means<br />

that the normalized columns differ only by the signs of the entries (Theis and Gruber, 2005, def<strong>in</strong>ition<br />

6). Then only the matrix A but not the nonl<strong>in</strong>earities can be recovered by consider<strong>in</strong>g the<br />

edges of the support of the fully-bounded random vector as illustrated <strong>in</strong> figure 1.3. Nonetheless<br />

this is no <strong>in</strong>determ<strong>in</strong>acy of the model itself, s<strong>in</strong>ce A −1 f(As) is obviously not <strong>in</strong>dependent. So by<br />

look<strong>in</strong>g at the boundary alone, we sometimes cannot detect <strong>in</strong>dependence if the whole system<br />

is highly symmetric.<br />

If A is mix<strong>in</strong>g and not absolutely degenerate, then for all fully-bounded sources s no more<br />

<strong>in</strong>determ<strong>in</strong>acies than <strong>in</strong> the aff<strong>in</strong>e l<strong>in</strong>ear case exist, except for scal<strong>in</strong>g <strong>in</strong>terchange between f<br />

and A:<br />

Theorem 1.2.3 (Separability of bounded postnonl<strong>in</strong>ear BSS). Let A, W ∈ Gl(n) and one of<br />

them mix<strong>in</strong>g and not absolutely degenerate, h : R n → R n be a diagonal <strong>in</strong>jective analytic function<br />

such that h ′ i �= 0 and let s be a fully-bounded <strong>in</strong>dependent random vector. If also W� h(As) � is<br />

<strong>in</strong>dependent, then h is aff<strong>in</strong>e l<strong>in</strong>ear.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0


10 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

So let f ◦ A be the mix<strong>in</strong>g model and W ◦ g the separat<strong>in</strong>g model. Putt<strong>in</strong>g the two together<br />

we get the above mix<strong>in</strong>g-separat<strong>in</strong>g model with h := g ◦ f. The theorem shows that if the<br />

mix<strong>in</strong>g-separat<strong>in</strong>g model preserves <strong>in</strong>dependence then it is essentially trivial i.e. h aff<strong>in</strong>e l<strong>in</strong>ear<br />

and the matrices equivalent (up to scal<strong>in</strong>g). As usual, the model is assumed to be <strong>in</strong>vertible,<br />

hence identifiability and uniqueness of the model follow from the separability. Note that if f is<br />

only assumed to be cont<strong>in</strong>uously differentiable, then additional <strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play.<br />

As <strong>in</strong> the l<strong>in</strong>ear case, this separability result can be transformed <strong>in</strong>to a postnonl<strong>in</strong>ear ICA<br />

algorithm: <strong>in</strong> Theis and Lang (2004b) we proposed a l<strong>in</strong>earization identification framework for<br />

estimat<strong>in</strong>g postnonl<strong>in</strong>earities <strong>in</strong> such sett<strong>in</strong>gs.<br />

F<strong>in</strong>ally, we want to note that we derived a slightly less general uniqueness theorem from the<br />

l<strong>in</strong>ear separability proof <strong>in</strong> Theis (2004a), see chapter 2:<br />

Theorem 1.2.4 (Separability of bijective postnonl<strong>in</strong>ear BSS). Let A, W ∈ Gl(n) be mix<strong>in</strong>g,<br />

h : Rn → Rn be a diagonal bijective analytic function such that h ′ i �= 0 and let S be an <strong>in</strong>dependent<br />

random vector with at most one Gaussian component and exist<strong>in</strong>g covariance. If also W � h(AS) �<br />

is <strong>in</strong>dependent, then h is aff<strong>in</strong>e l<strong>in</strong>ear.<br />

Similar uniqueness results were further studied by Achard and Jutten (2005).<br />

Quadratic ICA<br />

There are a few different ways of extend<strong>in</strong>g l<strong>in</strong>ear BSS models. In the previous section, we focused<br />

on the postnonl<strong>in</strong>ear mix<strong>in</strong>g model (1.3), which adds an unknown nonl<strong>in</strong>earity at the end of a<br />

l<strong>in</strong>ear mix<strong>in</strong>g situation. A different approach is to take a known nonl<strong>in</strong>earity and add another<br />

mix<strong>in</strong>g matrix, as <strong>in</strong> the step from perceptrons to multilayer perceptrons. Another way of putt<strong>in</strong>g<br />

this is to take only the first few terms of the Taylor expansion of the mix<strong>in</strong>g or unmix<strong>in</strong>g mapp<strong>in</strong>g,<br />

and to try and learn such regularized systems. In Theis and Nakamura (2004), see chapter 5,<br />

we treated polynomial nonl<strong>in</strong>earities, specifically second-order monomials or quadratic forms.<br />

These represent a relatively simple class of nonl<strong>in</strong>earities, which can be <strong>in</strong>vestigated <strong>in</strong> detail. In<br />

contrast to the postnonl<strong>in</strong>ear model formulation, we def<strong>in</strong>ed the nonl<strong>in</strong>earity for the unmix<strong>in</strong>g<br />

situation, and derive the correspond<strong>in</strong>g mix<strong>in</strong>g model after some assumptions.<br />

We considered the quadratic unmix<strong>in</strong>g model<br />

yi := x ⊤ G (i) x (1.4)<br />

for symmetric matrices G (i) , i = 1, . . . , n and estimated sources y. We restricted ourselves to a<br />

special case of this model by assum<strong>in</strong>g that each G (i) has the same set of eigenvectors. Then<br />

G (i) = EΛ (i) E with orthogonal eigenvector matrix E (i) and diagonal Λ (i) with coefficients Λ (i)<br />

kk<br />

on the diagonal. Sett<strong>in</strong>g<br />

Λ :=<br />

⎛<br />

⎜<br />

⎝<br />

Λ (1)<br />

11 . . . Λ (1)<br />

nn<br />

.<br />

. ..<br />

Λ (n)<br />

11 . . . Λ (n)<br />

nn<br />

therefore yields a two-layered unmix<strong>in</strong>g model y = Λ ◦ (.) 2 ◦ E ◦ x, where (.) 2 denotes the<br />

componentwise square of each element. This can be <strong>in</strong>terpreted as a two-layered feed-forward<br />

neural network, and may be easily explicitly <strong>in</strong>verted, see figure 1.4.<br />

.<br />

⎞<br />

⎟<br />


S U M M A R Y T his �le conta<strong>in</strong>s all �gures of the manuscript<br />

1.2.<br />

<strong>in</strong> the same size as <strong>in</strong> the manuscript itself.<br />

Uniqueness keyissues words: <strong>in</strong> <strong>in</strong>dependent component analysis 11<br />

x 1<br />

x 2<br />

x m<br />

.<br />

E 1m<br />

E 2m<br />

E n m<br />

(.) 2<br />

(.) 2<br />

.<br />

(.) 2<br />

Λ 1n<br />

Λ 2n<br />

E Λ<br />

Λ n n<br />

F ig. 1 Simpli�ed quadratic unmix<strong>in</strong>g model y = Λ (.) 2 E x .<br />

Figure 1.4: Simplified quadratic unmix<strong>in</strong>g model y = Λ ◦ (.) 2 ◦ E ◦ x. If Ex only takes values<br />

<strong>in</strong> one quadrant, then the model is <strong>in</strong>vertible and its <strong>in</strong>verse model aga<strong>in</strong> follows a multilayer<br />

perceptron structure x = E ⊤ ◦ √ ◦ Λ −1 ◦ y.<br />

Several studies have employed quadratic forms as a generative process of data. Abed-Meraim<br />

et al. (1996) suggested analyz<strong>in</strong>g mixtures by second-order polynomials us<strong>in</strong>g a l<strong>in</strong>earization for<br />

the mixtures, which however <strong>in</strong> general destroys the assumption of <strong>in</strong>dependence. Leshem (1999)<br />

proposed a whiten<strong>in</strong>g scheme based on quadratic forms <strong>in</strong> order to enhance l<strong>in</strong>ear separation<br />

of time-signals <strong>in</strong> algorithms such as SOBI (Belouchrani et al., 1997). Similar quadratic mix<strong>in</strong>g<br />

models are also considered <strong>in</strong> Georgiev (2001) and Hosse<strong>in</strong>i and Deville (2003). These are<br />

studies <strong>in</strong> which the mix<strong>in</strong>g model is assumed to be quadratic <strong>in</strong> contrast to the quadratic unmix<strong>in</strong>g<br />

model (1.4). For demix<strong>in</strong>g <strong>in</strong>to <strong>in</strong>dependent components by quadratic forms, Bartsch and<br />

Obermayer (2003) suggested apply<strong>in</strong>g l<strong>in</strong>ear ICA to second-order terms of data, and Hashimoto<br />

(2003) proposed an algorithm based on m<strong>in</strong>imization of Kullback-Leibler divergence. However,<br />

no identifiability was studied; <strong>in</strong>stead they focused on the application to natural images.<br />

In (Theis and Nakamura, 2004), we def<strong>in</strong>ed the above quadratic unmix<strong>in</strong>g process and derived<br />

a generative model. We then reduced the quadratic model to an overdeterm<strong>in</strong>ed l<strong>in</strong>ear model,<br />

where more observations than sources are given, by embedd<strong>in</strong>g y <strong>in</strong>to R m(m+1)/2 ; this can be<br />

done by tak<strong>in</strong>g the monomials xixj as new variables. Us<strong>in</strong>g some l<strong>in</strong>ear algebra, we then derived<br />

the follow<strong>in</strong>g identifiability theorem for overdeterm<strong>in</strong>ed ICA:<br />

Theorem 1.2.5 (Uniqueness of overdeterm<strong>in</strong>ed ICA). Let x = As with <strong>in</strong>dependent n-dimensional<br />

random vector s and full-rank m × n-matrix A with m ≥ n, and let the n × m matrix W<br />

be chosen such thatMWx anuscript is <strong>in</strong>dependent. received Furthermore October 11, assume 2002. that s has at most one Gaussian<br />

component and thatMthe anuscript variancesrevised of s exist. October Then there 11, 2002. exist a permutation matrix P and an<br />

<strong>in</strong>vertible scal<strong>in</strong>g matrix F <strong>in</strong>alL manuscript with W = LP(A received J anuary 15, 2003.<br />

†<br />

T he authors are with the L ab. for A dvanced B ra<strong>in</strong> Signal<br />

P rocess<strong>in</strong>g, B ra<strong>in</strong> Science I nstitute, R I K E N, W ako-shi,<br />

Saitama 351-0198 J apan.<br />

On leave from the I nstitute of B iophysics, U niversity<br />

of R egensburg, 93051 R egensburg, G ermany.<br />

a) E -mail: fabian@theis.name<br />

b) E -mail: wakakoh@bra<strong>in</strong>.riken.jp<br />

⊤A) −1A⊤ + C and CA = 0.<br />

.<br />

y 1<br />

y 2<br />

y n


12 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

−0.17 0.16 −0.25 0.08 0.16 −0.11 0.28 −0.10 0.19 −0.13 −0.24 0.23<br />

−0.20 0.12 −0.27 0.27 0.19 −0.16 −0.17 0.16 −0.30 0.09 0.38 0.05<br />

−0.19 0.17 0.18 −0.12 −0.24 0.24 −0.22 0.22 −0.24 0.22 0.24 −0.17<br />

0.24 −0.09 0.25 −0.22 0.23 −0.15 0.19 −0.17 0.18 −0.16 0.21 −0.21<br />

0.26 −0.20 0.33 −0.15 −0.28 0.15 −0.17 0.17 −0.17 0.16 −0.23 0.23<br />

−0.24 0.23 0.23 −0.22 0.20 −0.20 −0.30 0.30 0.18 −0.16 −0.18 0.15<br />

0.28 −0.24 −0.23 0.18 0.20 −0.20 −0.26 0.26 0.19 −0.19 −0.22 0.22<br />

0.31 −0.30 0.26 −0.23 0.18 −0.13 0.15 −0.15 0.18 −0.18 −0.31 0.30<br />

−0.23 0.17 −0.25 0.24 −0.27 0.26 0.25 −0.23 −0.19 0.17 0.11 −0.11<br />

−0.08 0.06 −0.15 0.13 0.25 −0.24 0.28 −0.26 −0.23 0.20 −0.18 0.15<br />

−0.22 0.19 −0.22 0.18 0.11 −0.09 0.28 −0.28<br />

Figure 1.5: Quadratic ICA of natural images. 3 · 10 5 sample pictures of size 8 × 8 were used.<br />

Plotted are the recovered maximal filters of i.e. rows of the eigenvalue matrices of the quadratic<br />

form coefficient matrices (top figure). For each component the two largest filters (with respect<br />

to the absolute eigenvalues) are shown (altogether 2 · 64). Above each image the correspond<strong>in</strong>g<br />

eigenvalue (multiplied by 10 3 ) is pr<strong>in</strong>ted. In most filters only one or two eigenvalues are<br />

dom<strong>in</strong>ant.<br />

This allowed us to derive identifiability for the quadratic unmix<strong>in</strong>g model (1.4). Moreover,<br />

we were then able to construct an explicit ICA algorithm from this uniqueness result.<br />

F<strong>in</strong>ally, we studied the algorithm <strong>in</strong> the context of natural image filters similar to Bartsch<br />

and Obermayer (2003) and Hashimoto (2003). We applied quadratic ICA to a set of small<br />

patches. Most obta<strong>in</strong>ed quadratic forms had one or two dom<strong>in</strong>ant l<strong>in</strong>ear filters and these l<strong>in</strong>ear<br />

filters were selective for local bar stimuli, see figure 1.5. Hence, values of all quadratic forms<br />

corresponded to squared simple cell outputs.


1.3. Dependent component analysis 13<br />

1.3 Dependent component analysis<br />

In this section, we will discuss the relaxation of the BSS model by tak<strong>in</strong>g <strong>in</strong>to account additional<br />

structures <strong>in</strong> the data and dependencies between components. Many researchers have taken<br />

<strong>in</strong>terest <strong>in</strong> this generalization, which is crucial for the application <strong>in</strong> real-world sett<strong>in</strong>gs where<br />

such situations are to be expected.<br />

Here, we will consider model <strong>in</strong>determ<strong>in</strong>acies as well as actual separation algorithms. For<br />

the latter, we will employ a technique that has been the basis of one of the first ICA algorithms<br />

(Cardoso and Souloumiac, 1993), namely jo<strong>in</strong>t diagonalization (JD). It has s<strong>in</strong>ce become an<br />

important tool <strong>in</strong> ICA-based BSS and <strong>in</strong> BSS rely<strong>in</strong>g on second-order time-decorrelation (Belouchrani<br />

et al., 1997). Its task is, given a set of commut<strong>in</strong>g symmetric n × n matrices Ci, to<br />

f<strong>in</strong>d an orthogonal matrix A such that A ⊤ CiA is diagonal for all i. This generalizes eigenvalue<br />

decomposition (i = 1) and the generalized eigenvalue problem (i = 2), <strong>in</strong> which perfect<br />

factorization is always possible.<br />

Other extensions of the standard BSS model such as <strong>in</strong>clud<strong>in</strong>g s<strong>in</strong>gular matrices (Georgiev<br />

and Theis, 2004) will be omitted from the discussion <strong>in</strong> the follow<strong>in</strong>g.<br />

1.3.1 Algebraic BSS and multidimensional generalizations<br />

Consider<strong>in</strong>g the BSS model from equation (1.1)—or a more general, noisy version x(t) = As(t)+<br />

n(t)—the data can only be separated if we put additional conditions on the sources such as:<br />

• they are stochastically <strong>in</strong>dependent: ps(s1, . . . , sn) = ps1 (s1) · · · psn(sn),<br />

• each source is sparse i.e. it conta<strong>in</strong>s a certa<strong>in</strong> number of zeros or has a low p-norm for<br />

small p and fixed 2-norm,<br />

• s(t) is stationary and for all τ, it has diagonal autocovariances E(s(t + τ) s(t) ⊤ ); here<br />

zero-mean s(t) are assumed.<br />

In the follow<strong>in</strong>g, we will review BSS algorithms based on eigenvalue decomposition, JD and<br />

generalizations. Thereby, one of the above conditions is denoted by the term source condition,<br />

because we do not want to specialize on a s<strong>in</strong>gle model. The additive noise n(t) is modeled by<br />

a stationary, temporally and spatially white zero-mean process with variance σ 2 . Moreover, we<br />

will not deal with the more complicated underdeterm<strong>in</strong>ed case, so we assume that at most as<br />

many sources as sensors are to be extracted, i.e. n ≤ m.<br />

The signals x(t) are observed, and the goal is to recover A and s(t). Hav<strong>in</strong>g found A, s(t)<br />

can be estimated by A † x(t), which is optimal <strong>in</strong> the maximum-likelihood sense. Here † denotes<br />

the pseudo-<strong>in</strong>verse of A, which equals the <strong>in</strong>verse <strong>in</strong> the case of m = n. So the BSS task reduces<br />

to the estimation of the mix<strong>in</strong>g matrix A, hence the additive noise n is often neglected (after<br />

whiten<strong>in</strong>g). Note that <strong>in</strong> the follow<strong>in</strong>g we will assume that all signals are real-valued. Extensions<br />

to the complex case are straightforward.<br />

Approximate jo<strong>in</strong>t diagonalization<br />

Many BSS algorithms employ jo<strong>in</strong>t diagonalization (JD) techniques on some source condition<br />

matrices to identify the mix<strong>in</strong>g matrix. Given a set of symmetric matrices C := {C1, . . . , CK},


14 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

JD implies m<strong>in</strong>imiz<strong>in</strong>g the squared sum of the off-diagonal elements of Â⊤CiÂ, i.e. m<strong>in</strong>imiz<strong>in</strong>g<br />

f( Â) :=<br />

K�<br />

�Â⊤Ci − diag(Â⊤CiÂ)�2F i=1<br />

with respect to the orthogonal matrix Â, where diag(C) produces a matrix, where all off-diagonal<br />

elements of C have been set to zero, and �C�2 F := tr(CC⊤ ) denotes the squared Frobenius norm.<br />

A global m<strong>in</strong>imum A of f is called a jo<strong>in</strong>t diagonalizer of C. Such a jo<strong>in</strong>t diagonalizer exists if<br />

and only if all elements of C commute.<br />

Algorithms for perform<strong>in</strong>g jo<strong>in</strong>t diagonalization <strong>in</strong>clude, among others, gradient descent on<br />

f( Â), Jacobi-like iterative construction of A by Givens rotation <strong>in</strong> two coord<strong>in</strong>ates (Cardoso<br />

and Souloumiac, 1995), an extension m<strong>in</strong>imiz<strong>in</strong>g a logarithmic version of (1.5) (Pham, 2001), an<br />

alternat<strong>in</strong>g optimization scheme switch<strong>in</strong>g between column and diagonal optimization (Yeredor,<br />

2002) or more recently a l<strong>in</strong>ear least-squares algorithm for diagonalization (Ziehe et al., 2003b),<br />

where the latter three algorithms can also search for non-orthogonal matrices A. Note that <strong>in</strong><br />

practice m<strong>in</strong>imization of the off-sums only yields an approximate jo<strong>in</strong>t diagonalizer—<strong>in</strong> the case<br />

of f<strong>in</strong>ite samples, the source condition matrices are estimates and hence they only approximately<br />

share the same eigenstructure and do not fully commutate, so f( Â) from equation (1.5) cannot<br />

be rendered precisely zero but only approximately.<br />

Source conditions<br />

In order to get a well-def<strong>in</strong>ed source separation model, assumptions to the sources such as<br />

stochastic <strong>in</strong>dependence have to be formulated. In practice, the conditions are preferably given<br />

<strong>in</strong> terms of roots of some cost function that can easily be estimated. Here, we summarize some<br />

of the source conditions used <strong>in</strong> the literature; they are def<strong>in</strong>ed by a criterion specify<strong>in</strong>g the<br />

diagonality of a set of matrices C(.) := {C1(.), . . . , CK(.)}, which can be estimated from the<br />

data. We require only that<br />

Ci(Wx) = WCi(x)W ⊤<br />

(1.6)<br />

for some matrix W. Note that us<strong>in</strong>g the substitution ¯ Ci(x) := Ci(x) + Ci(x) ⊤ , we can assume<br />

Ci(x) to be symmetric. The actual source model is then def<strong>in</strong>ed by requir<strong>in</strong>g the sources to fulfill<br />

Ci(s) = 0 for all i = 1, . . . , K. In table 1.1, we review some commonly used source conditions for<br />

an m-dimensional centered random vector x or a multivariate random process x(t) respectively.<br />

Search<strong>in</strong>g for sources s := Wx fulfill<strong>in</strong>g the source model means f<strong>in</strong>d<strong>in</strong>g matrices W such<br />

that Ci(Wx) is diagonal for all i. Depend<strong>in</strong>g on the algorithm, whiten<strong>in</strong>g by PCA is performed<br />

as preprocess<strong>in</strong>g to allow for a reduced search on the orthogonal group W ∈ O(n). This is<br />

equivalent to sett<strong>in</strong>g all source second-order statistics to I, and then search<strong>in</strong>g only for rotations.<br />

In the case of K = 1, the search can be performed by eigenvalue decomposition of C1(˜x) of<br />

the source condition of the whitened mixtures ˜x; this is equivalent to solv<strong>in</strong>g the generalized<br />

eigenvalue decomposition (GEVD) problem for the matrix pencil (E(xx ⊤ ), C1(˜x)). Usually<br />

us<strong>in</strong>g more than one condition matrix <strong>in</strong>creases the robustness of the proposed algorithm, and<br />

<strong>in</strong> these cases the algorithm performs orthogonal JD of C := {Ci(˜x)} e.g. by a Jacobi-type<br />

algorithm (Cardoso and Souloumiac, 1995).<br />

(1.5)


Table 1.1: BSS algorithms based on jo<strong>in</strong>t diagonalization (centered sources are assumed)<br />

1.3. Dependent component analysis 15<br />

algorithm source model condition matrices optimization algorithm<br />

FOBI (Cardoso and Souloumiac, <strong>in</strong>dependent i.i.d. sources contracted quadricovariance matrix EVD after PCA<br />

1990)<br />

with Eij = I<br />

(GEVD)<br />

JADE (Cardoso and Souloumiac, <strong>in</strong>dependent i.i.d. sources contracted quadricovariance matri- orthogonal JD after<br />

1993)<br />

ces<br />

PCA<br />

eJADE (Moreau, 2001) <strong>in</strong>dependent i.i.d. sources arbitrary-order cumulant matrices orthogonal JD after<br />

PCA<br />

HessianICA (Theis, 2004a, Yeredor, <strong>in</strong>dependent i.i.d. sources multiple Hessians Hlog ˆx(x<br />

2000)<br />

(i) ) or<br />

Hlog px(x (i) orthogonal JD after<br />

)<br />

PCA<br />

AMUSE (Molgedey and Schuster, wide-sense stationary s(t) with di- s<strong>in</strong>gle autocovariance matrix<br />

1994, Tong et al., 1991)<br />

agonal autocovariances<br />

E(x(t + τ)x(t) ⊤ EVD after PCA<br />

)<br />

(GEVD)<br />

SOBI (Belouchrani et al., 1997), wide-sense stationary s(t) with di- multiple autocovariance matrices orthogonal JD after<br />

TDSEP (Ziehe and Mueller, 1998) agonal autocovariances<br />

PCA<br />

mdAMUSE (Theis et al., 2004e) s(t1, . . . , tM ) with diagonal autoco- s<strong>in</strong>gle multidimensional autocovari- EVD after PCA<br />

variancesance<br />

matrix (1.7)<br />

(GEVD)<br />

mdSOBI (Schießl et al., 2000, Theis s(t1, . . . , tM ) with diagonal autoco- multidimensional autocovariance orthogonal JD after<br />

et al., 2004e)<br />

variances<br />

matrices (1.7)<br />

PCA<br />

JADET D (Müller et al., 1999) <strong>in</strong>dependent s(t) with diagonal au- cumulant and autocovariance ma- orthogonal JD after<br />

tocovariancestrices<br />

PCA<br />

SONS (Choi and Cichocki, 2000) non-stationary s(t) with diagonal (auto-)covariance matrices of w<strong>in</strong>- orthogonal JD after<br />

(auto-)covariances<br />

dowed signals<br />

PCA<br />

ACDC (Yeredor, 2002), LSDIAG <strong>in</strong>dependent or auto-decorrelated covariance matrices and cumu- non-orthogonal JD<br />

(Ziehe et al., 2003b)<br />

s(t)<br />

lant/autocovariance matrices<br />

block-Gaussian likelihood (Pham block-Gaussian non-stationary s(t) (auto-)covariance matrices of w<strong>in</strong>- non-orthogonal JD<br />

and Cardoso, 2001)<br />

dowed signals<br />

TFS (Belouchrani and Am<strong>in</strong>, 1998) s(t) from Cohen’s time-frequency spatial time-frequency distribution orthogonal JD after<br />

distributions (Cohen, 1995) matrices<br />

PCA<br />

FRT-based BSS (Karako-Eilon non-stationary s(t) with diagonal autocovariance of FRT-transformed (non-)orthogonal<br />

et al., 2003)<br />

block-spectra<br />

w<strong>in</strong>dowed signal<br />

JD<br />

ACMA (van der Veen and Paulraj, s(t) is of constant modulus (CM) <strong>in</strong>dependent vectors <strong>in</strong> ker<br />

1996)<br />

ˆ P of<br />

model-matrix ˆ generalized Schur<br />

P<br />

QZ-decomp.<br />

stBSS (Theis et al., 2005a) spatiotemporal sources S := s(r, t) any of the above conditions for both<br />

X and X⊤ non-orthogonal JD<br />

group BSS (Theis, 2005a) group-dependent sources s(t) any of the above conditions block orthogonal JD<br />

after PCA


16 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

In contrast to this sometimes called hard-whiten<strong>in</strong>g technique, soft-whiten<strong>in</strong>g tries to avoid<br />

a bias towards second-order statistics and uses a non-orthogonal jo<strong>in</strong>t diagonalization algorithm<br />

(Pham, 2001, Yeredor, 2002, Ziehe et al., 2003b) by jo<strong>in</strong>tly diagonaliz<strong>in</strong>g the source conditions<br />

Ci(x) together with the mixture covariance matrix E(xx ⊤ ). Then possible estimation errors <strong>in</strong><br />

the second-order part do not <strong>in</strong>fluence the total error disproportionately high.<br />

Depend<strong>in</strong>g on the source conditions, various algorithms have been proposed <strong>in</strong> the literature.<br />

Table 1.1 gives an overview over the algorithms together with the references, the source model,<br />

the condition matrices and the optimization algorithm. For more details and references, we refer<br />

to Theis and Inouye (2006).<br />

Multidimensional autodecorrelation<br />

In Theis et al. (2004e), see chapter 6, we considered BSS algorithms based on time-decorrelation<br />

and the result<strong>in</strong>g source condition. Correspond<strong>in</strong>g JD-based algorithms <strong>in</strong>clude AMUSE (Tong<br />

et al., 1991) and extensions such as SOBI (Belouchrani et al., 1997) and TDSEP (Ziehe and<br />

Mueller, 1998). They rely on the fact that the data sets have non-trivial autocorrelations. We<br />

extended them to data sets hav<strong>in</strong>g more than one direction <strong>in</strong> the parametrization such as<br />

images. For this, we replaced one-dimensional autocovariances by multidimensional autocovariances<br />

def<strong>in</strong>ed by<br />

�<br />

Cτ1,...,τM (s) := E s(z1 + τ1, . . . , zM + τM)s(z1, . . . , zM) ⊤�<br />

(1.7)<br />

where the s is centered and the expectation is taken over (z1, . . . , zM). Cτ1,...,τM (s) can be<br />

estimated given equidistant samples by replac<strong>in</strong>g random variables by sample values and expectations<br />

by sums as usual.<br />

A typical example for non-trivial multidimensional autocovariances is a source data set, <strong>in</strong><br />

which each component si represents an image of size h × w. Then the data is of dimension<br />

M = 2 and samples of s are given at <strong>in</strong>dices z1 = 1, . . . , h, z2 = 1, . . . , w. Classically, s(z1, z2)<br />

is transformed to s(t) by fix<strong>in</strong>g a mapp<strong>in</strong>g from the two-dimensional parameter set to the onedimensional<br />

time parametrization of s(t), for example by concatenat<strong>in</strong>g columns or rows <strong>in</strong> the<br />

case of a f<strong>in</strong>ite number of samples (vectorization). If the time structure of s(t) is not used, as <strong>in</strong><br />

all classical ICA algorithms <strong>in</strong> which i.i.d. samples are assumed, this choice does not <strong>in</strong>fluence<br />

the result. However, <strong>in</strong> time-structure based algorithms such as AMUSE and SOBI results can<br />

vary greatly depend<strong>in</strong>g on the choice of this mapp<strong>in</strong>g.<br />

The advantage of us<strong>in</strong>g multidimensional autocovariances lies <strong>in</strong> the fact that now the multidimensional<br />

structure of the data set can be used more explicitly. For example, if row concatenation<br />

is used to construct s(t) from the images, horizontal l<strong>in</strong>es <strong>in</strong> the image will only<br />

give trivial contributions to the autocovariance. Figure 1.6 shows the one- and two-dimensional<br />

autocovariance of the Lena image for vary<strong>in</strong>g τ respectively (τ1, τ2) after normalization of the<br />

image to variance 1. Clearly, the two-dimensional autocovariance does not decay as quickly with<br />

<strong>in</strong>creas<strong>in</strong>g radius as the one-dimensional covariance. Only at multiples of the image height, the<br />

one-dimensional autocovariance is significantly high i.e. captures image structure.<br />

For details as well as extended simulations and examples, we refer to Theis et al. (2004e)<br />

and related work by Schießl et al. (2000), Schöner et al. (2000).


1.3. Dependent component analysis 17<br />

(a) analyzed image<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

1d−autocov<br />

2d−autocov<br />

0 50 100 150 200 250 300<br />

|tau|<br />

(b) autocorrelation (1d/2d)<br />

Figure 1.6: Example of one- and two-dimensional autocovariance coefficient (b) of the gray-scale<br />

128×128 Lena image (a) after normalization to variance 1. Clearly us<strong>in</strong>g local structure <strong>in</strong> both<br />

directions (2d-autocov) guarantees that for small τ higher powers of the autocorrelations are<br />

present than by rearrang<strong>in</strong>g the data <strong>in</strong>to a vector (1d-autocov), thereby loos<strong>in</strong>g <strong>in</strong>formation<br />

about the second dimension.<br />

1.3.2 Spatiotemporal BSS<br />

Real-world data sets such as record<strong>in</strong>gs from functional magnetic resonance imag<strong>in</strong>g often possess<br />

both spatial and temporal structure. In Theis et al. (2007b), see chapter 9, we propose an<br />

algorithm <strong>in</strong>clud<strong>in</strong>g such spatiotemporal <strong>in</strong>formation <strong>in</strong>to the analysis, and reduce the problem<br />

to the jo<strong>in</strong>t approximate diagonalization of a set of autocorrelation matrices.<br />

Spatiotemporal BSS <strong>in</strong> contrast to the more common spatial or temporal BSS tries to achieve<br />

both spatial and temporal separation by optimiz<strong>in</strong>g a jo<strong>in</strong>t energy function. First proposed<br />

by Stone et al. (2002), it is a promis<strong>in</strong>g method, which has potential applications <strong>in</strong> areas<br />

where data conta<strong>in</strong>s an <strong>in</strong>herent spatiotemporal structure, such as data from biomedic<strong>in</strong>e or<br />

geophysics <strong>in</strong>clud<strong>in</strong>g oceanography and climate dynamics. Stone’s algorithm is based on the<br />

Infomax ICA algorithm by Bell and Sejnowski (1995), which due to its onl<strong>in</strong>e nature <strong>in</strong>volves<br />

some rather <strong>in</strong>tricate choices of parameters, specifically <strong>in</strong> the spatiotemporal version, where<br />

onl<strong>in</strong>e updates are be<strong>in</strong>g performed both <strong>in</strong> space and time. Commonly, the spatiotemporal<br />

data sets are recorded <strong>in</strong> advance, so we can easily replace spatiotemporal onl<strong>in</strong>e learn<strong>in</strong>g by


18 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

=<br />

X t A t S<br />

X<br />

=<br />

(a) temporal BSS<br />

s S ⊤<br />

(c) spatiotemporal BSS<br />

t S<br />

=<br />

X ⊤ s A s S<br />

(b) spatial BSS<br />

Figure 1.7: Temporal, spatial and spatiotemporal BSS models. The l<strong>in</strong>es <strong>in</strong> the matrices ∗ S<br />

<strong>in</strong>dicate the sample direction. Source conditions apply between adjacent such l<strong>in</strong>es.<br />

batch optimization. This has the advantage of greatly reduc<strong>in</strong>g the number of parameters <strong>in</strong> the<br />

system, and leads to more stable optimization algorithms. In Theis et al. (2007b), we extended<br />

Stone’s approach by generaliz<strong>in</strong>g the time-decorrelation algorithms to the spatiotemporal case,<br />

thereby allow<strong>in</strong>g us to use the <strong>in</strong>herent spatiotemporal structures of the data.<br />

For this, we considered data sets x(r, t) depend<strong>in</strong>g on two <strong>in</strong>dices r and t, where r ∈ R n<br />

can be any multidimensional (spatial) <strong>in</strong>dex and t <strong>in</strong>dexes the time axis. In order to be able to<br />

use matrix-notation, we contracted the spatial multidimensional <strong>in</strong>dex r <strong>in</strong>to a one-dimensional<br />

<strong>in</strong>dex r by row concatenation. Then the data set x(r, t) =: xrt can be represented by a data<br />

matrix X of dimension s m× t m, where the superscripts s (.) and t (.) denote spatial and temporal<br />

variables, respectively.<br />

Temporal BSS implies the matrix factorization X = t A t S, whereas spatial BSS implies the<br />

factorization X ⊤ = s A s S or equivalently X = s S ⊤s A ⊤ . Hence X = t A t S = s S ⊤s A ⊤ . So both<br />

source separation models can be <strong>in</strong>terpreted as matrix factorization problems; <strong>in</strong> the temporal<br />

case restrictions such as diagonal autocorrelations are determ<strong>in</strong>ed by the second factor, <strong>in</strong> the<br />

spatial case by the first one. In order to achieve a spatiotemporal model, we required these<br />

conditions from both factors at the same time. Therefore the spatiotemporal BSS model can be<br />

derived from the above as the factorization problem<br />

X = s S ⊤t S (1.8)<br />

with spatial source matrix s S and temporal source matrix t S, which both have (multidimensional)<br />

autocorrelations be<strong>in</strong>g as diagonal as possible. The three models are illustrated <strong>in</strong> figure<br />

1.7.<br />

Concern<strong>in</strong>g conditions for the sources, we <strong>in</strong>terpreted Ci(X) := Ci( t x(t)) as the i-th temporal<br />

autocovariance matrix, whereas Ci(X ⊤ ) := Ci( s x(r)) denoted the correspond<strong>in</strong>g spatial<br />

autocovariance matrix. Application of the spatiotemporal mix<strong>in</strong>g model from equation (1.8)


1.3. Dependent component analysis 19<br />

together with the transformation properties (1.6) of the source conditions yields<br />

Ci( t S) = s S †⊤ Ci(X) s S † and Ci( s S) = t S †⊤ Ci(X ⊤ ) t S †<br />

because ∗ m ≥ n and hence ∗ S ∗ S † = I. By assumption the matrices Ci( ∗ S) are as diagonal as<br />

possible. In order to separate the data, we had to f<strong>in</strong>d diagonalizers for both Ci(X) and Ci(X ⊤ )<br />

such that they satisfy the spatiotemporal model (1.8). As the matrices derived from X had to<br />

be diagonalized <strong>in</strong> terms of both columns and rows, we denoted this by double-sided approximate<br />

jo<strong>in</strong>t diagonalization.<br />

In Theis et al. (2007b, 2005a) we showed how to reduce this process to jo<strong>in</strong>t diagonalization.<br />

In order to get robust estimates of the source conditions, dimension reduction was essential. For<br />

this we considered the s<strong>in</strong>gular value decomposition X, and formulated the algorithm <strong>in</strong> terms<br />

of the pseudo-orthogonal components of X. Of course, <strong>in</strong>stead of us<strong>in</strong>g autocovariance matrices,<br />

other source conditions Ci(.) from table 1.1 can be employed <strong>in</strong> order to adapt to the separation<br />

problem at hand.<br />

We present an application of the spatiotemporal BSS algorithm to fMRI data us<strong>in</strong>g multidimensional<br />

autocovariances <strong>in</strong> section 1.6.1.<br />

1.3.3 <strong>Independent</strong> subspace analysis<br />

Another extension of the simple source separation model lies <strong>in</strong> extract<strong>in</strong>g groups of sources<br />

that are <strong>in</strong>dependent of each other, but not with<strong>in</strong> the group. So, multidimensional <strong>in</strong>dependent<br />

component analysis or <strong>in</strong>dependent subspace analysis (ISA) denotes the task of transform<strong>in</strong>g a<br />

multivariate observed sensor signal such that groups of the transformed signal components are<br />

mutually <strong>in</strong>dependent—however dependencies with<strong>in</strong> the groups are still allowed. This allows<br />

for weaken<strong>in</strong>g the sometimes too strict assumption of <strong>in</strong>dependence <strong>in</strong> ICA, and has potential<br />

applications <strong>in</strong> various fields such as ECG, fMRI analysis or convolutive ICA.<br />

Recently we were able to calculate the <strong>in</strong>determ<strong>in</strong>acies of group ICA for known and unknown<br />

group structure, which f<strong>in</strong>ally enabled us to guarantee successful application of group ICA to<br />

BSS problems. Here, we will shortly review the identifiability result as well as the result<strong>in</strong>g<br />

algorithm for separat<strong>in</strong>g signals <strong>in</strong>to groups of dependent signals. As before, the algorithm is<br />

based on jo<strong>in</strong>t (block) diagonalization of sets of matrices generated us<strong>in</strong>g one or multiple source<br />

conditions.<br />

Generalizations of the ICA model that are to <strong>in</strong>clude dependencies of multiple one-dimensional<br />

components have been studied for quite some time. ISA <strong>in</strong> the term<strong>in</strong>ology of multidimensional<br />

ICA has first been <strong>in</strong>troduced by Cardoso (1998) us<strong>in</strong>g geometrical motivations. His model<br />

as well as the related but <strong>in</strong>dependently proposed factorization of multivariate function classes<br />

(L<strong>in</strong>, 1998) are quite general, however no identifiability results were presented, and applicability<br />

to an arbitrary random vector was unclear. Later, <strong>in</strong> the special case of equal group sizes k,<br />

<strong>in</strong> the follow<strong>in</strong>g denoted as k-ISA, uniqueness results have been extended from the ICA theory<br />

(Theis, 2004b). Algorithmic enhancements <strong>in</strong> this sett<strong>in</strong>g have been recently studied by Poczos<br />

and Lör<strong>in</strong>cz (2005). Similar to (Cardoso, 1998), Akaho et al. (1999) also proposed to employ<br />

a multidimensional component maximum likelihood algorithm, however <strong>in</strong> the slightly different<br />

context of multimodal component analysis. Moreover, if the observations conta<strong>in</strong> additional<br />

(1.9)


20 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

structures such as spatial or temporal structures, these may be used for the multidimensional<br />

separation (Il<strong>in</strong>, 2006, Vollgraf and Obermayer, 2001).<br />

Hyvär<strong>in</strong>en and Hoyer (2000) presented a special case of k-ISA by comb<strong>in</strong><strong>in</strong>g it with <strong>in</strong>variant<br />

feature subspace analysis. They model the dependence with<strong>in</strong> a k-tuple explicitly and are<br />

therefore able to propose more efficient algorithms without hav<strong>in</strong>g to resort to the problematic<br />

multidimensional density estimation. A related relaxation of the ICA assumption is given by<br />

topographic ICA (Hyvär<strong>in</strong>en et al., 2001a), where dependencies between all components are<br />

assumed and modeled along a topographic structure (e.g. a 2-dimensional grid). However these<br />

two approaches are not completely bl<strong>in</strong>d anymore. Bach and Jordan (2003b) formulate ISA<br />

as a component cluster<strong>in</strong>g problem, which necessitates a model for <strong>in</strong>ter-cluster <strong>in</strong>dependence<br />

and <strong>in</strong>tra-cluster dependence. For the latter, they propose to use a tree-structure as employed<br />

by their tree dependepent component analysis (Bach and Jordan, 2003a). Together with <strong>in</strong>tercluster<br />

<strong>in</strong>dependence, this implies a search for a transformation of the mixtures <strong>in</strong>to a forest<br />

i.e. a set of disjo<strong>in</strong>t trees. However, the above models are all semi-parametric and hence not<br />

fully bl<strong>in</strong>d. In the follow<strong>in</strong>g, we will review two contributions (Theis, 2004b) and (Theis, 2007),<br />

where no additional structures were necessary for the separation.<br />

Fixed group structure—k-ISA<br />

A random vector y is called an <strong>in</strong>dependent component of the random vector x, if there exists<br />

an <strong>in</strong>vertible matrix A and a decomposition x = A(y, z) such that y and z are stochastically<br />

<strong>in</strong>dependent. We note that this is a more general notion of <strong>in</strong>dependent components <strong>in</strong> the sense<br />

of ICA, s<strong>in</strong>ce we do not require them to be one-dimensional.<br />

The goal of a general <strong>in</strong>dependent subspace analysis (ISA) or multidimensional <strong>in</strong>dependent<br />

component analysis is the decomposition of an arbitrary random vector x <strong>in</strong>to <strong>in</strong>dependent<br />

components. If x is to be decomposed <strong>in</strong>to one-dimensional components, this co<strong>in</strong>cides with<br />

ord<strong>in</strong>ary ICA. Similarly, if the <strong>in</strong>dependent components are required to be of the same dimension<br />

k, then this is denoted by multidimensional ICA of fixed group size k or simply k-ISA.<br />

As we have seen before, an important structural aspect <strong>in</strong> the search for decompositions<br />

is the knowledge of the number of solutions i.e. the <strong>in</strong>determ<strong>in</strong>acies of the problem. Clearly,<br />

given an ISA solution, <strong>in</strong>vertible transforms <strong>in</strong> each component (scal<strong>in</strong>g matrices L) as well as<br />

permutations of components of the same dimension (permutation matrices P) give aga<strong>in</strong> an ISA<br />

of x. This is of course known for 1-ISA i.e. ICA, see section 1.2.1.<br />

In Theis (2004b), see chapter 3, we were able to extend this result to k-ISA, given some<br />

additional restrictions to the model:We denoted A as k-admissible if for each r, s = 1, . . . , n/k<br />

the (r, s) sub-k-matrix of A is either <strong>in</strong>vertible or zero. Then the follow<strong>in</strong>g theorem can be<br />

derived from the multivariate Darmois-Skitovitch theorem (section 1.2.1) or us<strong>in</strong>g our previously<br />

discussed approach via differential equations (Theis, 2005c).<br />

Theorem 1.3.1 (Separability of k-ISA). Let A ∈ Gl(n; R) be k-admissible, and let S be a k<strong>in</strong>dependent<br />

n-dimensional random vector hav<strong>in</strong>g no Gaussian k-dimensional component. If AS<br />

is aga<strong>in</strong> k-<strong>in</strong>dependent, then A is the product of a k-block-scal<strong>in</strong>g and permutation matrix.<br />

This shows that k-ISA solutions are unique except for trivial transformations, if the model<br />

has no Gaussians and is admissible, and can now be turned <strong>in</strong>to a separation algorithm.


1.3. Dependent component analysis 21<br />

ISA with known group structure via jo<strong>in</strong>t block diagonalization<br />

In order to solve ISA with fixed block-size k or at least known block structure, we will use a<br />

generalization of jo<strong>in</strong>t diagonalization, which searches for block-structures <strong>in</strong>stead of diagonality.<br />

We are not <strong>in</strong>terested <strong>in</strong> the order of the blocks, so the block-structure is uniquely specified by<br />

fix<strong>in</strong>g a partition n = m1 + . . . + mr of n and set m := (m1, . . . , mr) ∈ N r . An n × n matrix is<br />

said to be m-block diagonal if it is of the form<br />

⎛<br />

M1<br />

⎜<br />

⎝ .<br />

· · ·<br />

. ..<br />

0<br />

.<br />

⎞<br />

⎟<br />

⎠<br />

0 · · · Mr<br />

with arbitrary mi × mi matrices Mi.<br />

As generalization of JD <strong>in</strong> the case of known the block structure, the jo<strong>in</strong>t m-block diagonalization<br />

problem is def<strong>in</strong>ed as the m<strong>in</strong>imization of<br />

f m ( Â) :=<br />

K�<br />

�Â⊤Ci − diagm ( Â⊤CiÂ)�2F i=1<br />

(1.10)<br />

with respect to the orthogonal matrix Â, where diagm (M) produces a m-block diagonal matrix<br />

by sett<strong>in</strong>g all other elements of M to zero. Indeterm<strong>in</strong>acies of any m-JBD are m-scal<strong>in</strong>g<br />

i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation def<strong>in</strong>ed<br />

by a permutation matrix that only swaps blocks of the same size.<br />

A few algorithms to actually perform JBD have been proposed, see Abed-Meraim and Belouchrani<br />

(2004), Févotte and Theis (2007a). In the follow<strong>in</strong>g we will simply perform jo<strong>in</strong>t<br />

diagonalization and then permute the columns of A to achieve block-diagonality—<strong>in</strong> experiments<br />

this turns out to be an efficient solution to JBD, although other more sophisticated pivot<br />

selection strategies for JBD are of <strong>in</strong>terest (Févotte and Theis, 2007b). The fact that JD <strong>in</strong>duces<br />

JBD has been conjectured by Abed-Meraim and Belouchrani (2004), and we were able to give<br />

a partial answer with the follow<strong>in</strong>g theorem:<br />

Theorem 1.3.2 (JBD via JD). Any block-optimal JBD of the Ci’s (i.e. a zero of f m ) is a local<br />

m<strong>in</strong>imum of the JD-cost-function f from equation (1.5).<br />

Clearly not any JBD m<strong>in</strong>imizes f, only those such that <strong>in</strong> each block of size mk, f( Â) when<br />

restricted to the block is maximal over A ∈ O(mk), which we denote as block-optimal. The<br />

proof is given <strong>in</strong> Theis (2007), see chapter 8.<br />

In the case of k-ISA, where m = (k, . . . , k), we used this result to propose an explicit<br />

algorithm (Theis, 2005a, see chapter 7). Consider the BSS model from equation (1.1). As usual<br />

by preprocess<strong>in</strong>g we may assume whitened observations x, so A is orthogonal. For the density<br />

ps of the sources, we therefore get ps(s0) = px(As0). Its Hessian transforms like a 2-tensor,<br />

which locally at s0 (see section 1.2.1) guarantees<br />

Hln ps (s0) = Hln px◦A(s0) = AHln px (As0)A ⊤ . (1.11)


22 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

crosstalk<strong>in</strong>g error<br />

4<br />

3<br />

2<br />

1<br />

0<br />

FastICA JADE Extended Infomax<br />

Figure 1.8: Apply<strong>in</strong>g ICA to a random vector x = As that does not fulfill the ICA model; here<br />

s is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown<br />

are the statistics over 100 runs of the Amari error of the random orig<strong>in</strong>al and the reconstructed<br />

mix<strong>in</strong>g matrix us<strong>in</strong>g the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly,<br />

the orig<strong>in</strong>al mix<strong>in</strong>g matrix could not be reconstructed <strong>in</strong> any of the experiments. However,<br />

<strong>in</strong>terest<strong>in</strong>gly, the latter two algorithms do <strong>in</strong>deed f<strong>in</strong>d an ISA up to permutation, which can be<br />

expla<strong>in</strong>ed by theorem 1.3.2.<br />

The sources s(t) are assumed to be k-<strong>in</strong>dependent, so ps factorizes <strong>in</strong>to r groups depend<strong>in</strong>g on<br />

k separate variables each. Thus ln ps is a sum of functions depend<strong>in</strong>g on k separate variables<br />

hence Hln ps (s0) is k-block-diagonal. Hessian ISA now simply uses the block-diagonality structure<br />

from equation (1.11) and performs JBD of estimates of a set of Hessians Hln ps (si) evaluated at<br />

different sampl<strong>in</strong>g po<strong>in</strong>ts si. This corresponds to us<strong>in</strong>g the HessianICA source condition from<br />

table 1.1. Other source conditions such as contracted quadricovariance matrices (Cardoso and<br />

Souloumiac, 1993) can also be used <strong>in</strong> this extended framework (Theis, 2007).<br />

Unknown group structure—general ISA<br />

A serious drawback of k-ISA (and hence of ICA) lies <strong>in</strong> the fact that the requirement fixed<br />

group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed,<br />

theoretically speak<strong>in</strong>g, it may only be applied to random vectors follow<strong>in</strong>g the k-ISA bl<strong>in</strong>d<br />

source separation model, which means that they have to be mixtures of a random vector that<br />

consists of <strong>in</strong>dependent groups of size k. If this is the case, uniqueness up to permutation and<br />

scal<strong>in</strong>g holds accord<strong>in</strong>g to theorem 1.3.1. However if k-ISA is applied to any random vector,<br />

a decomposition <strong>in</strong>to groups that are only ‘as <strong>in</strong>dependent as possible’ cannot be unique and<br />

depends on the contrast and the algorithm. In the literature, ICA is often applied to f<strong>in</strong>d<br />

representations fulfill<strong>in</strong>g the <strong>in</strong>dependence condition only as well as possible, however care has<br />

to be taken; the strong uniqueness result is not valid any more, and the results may depend on<br />

the algorithm as illustrated <strong>in</strong> figure 1.8.<br />

In contrast to ICA and k-ISA, we do not want to fix the size of the groups Si <strong>in</strong> advance.


1.3. Dependent component analysis 23<br />

x<br />

s<br />

L<br />

P<br />

A<br />

(a) ICA<br />

x<br />

s<br />

L<br />

P<br />

A<br />

(b) ISA with fixed groupsize<br />

x<br />

s<br />

L<br />

P<br />

A<br />

(c) general ISA<br />

Figure 1.9: L<strong>in</strong>ear factorization models for a random vector x = As and the result<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acies,<br />

where L denotes a one- or higher dimensional <strong>in</strong>vertible matrix (scal<strong>in</strong>g) and P<br />

a permutation, to be applied only along the horizontal l<strong>in</strong>e as <strong>in</strong>dicated <strong>in</strong> the figures. The<br />

small horizontal gaps denote statistical <strong>in</strong>dependence. One of the key differences between the<br />

models is that general ISA may always be applied to any random vector x, whereas ICA and<br />

its generalization, fixed-size ISA, yield unique results only if x follows the correspond<strong>in</strong>g model.<br />

Of course, some restriction is necessary, otherwise no decomposition would be enforced at all.<br />

The key idea <strong>in</strong> Theis (2007), see chapter 8, is to allow only irreducible components def<strong>in</strong>ed as<br />

random vectors without lower-dimensional <strong>in</strong>dependent components.<br />

The advantage of this formulation now is that it can clearly be applied to any random vector,<br />

although of course a trivial decomposition might be the result <strong>in</strong> the case of an irreducible random<br />

vector. Obvious <strong>in</strong>determ<strong>in</strong>acies of an ISA of x are scal<strong>in</strong>gs i.e. <strong>in</strong>vertible transformations with<strong>in</strong><br />

each si and permutation of si of the same dimension. These are already all <strong>in</strong>determ<strong>in</strong>acies as<br />

shown by the follow<strong>in</strong>g theorem:<br />

Theorem 1.3.3 (Existence and Uniqueness of ISA). Given a random vector X with exist<strong>in</strong>g<br />

covariance, an ISA of X exists and is unique except for permutation of components of the same<br />

dimension and <strong>in</strong>vertible transformations with<strong>in</strong> each <strong>in</strong>dependent component and with<strong>in</strong> the<br />

Gaussian part.<br />

Here, no Gaussians had to be excluded from S as <strong>in</strong> the previous uniqueness theorems,<br />

because the dimension reduction result from section 1.5.2 has been used. For details we refer to<br />

Theis (2007) and Gutch and Theis (2007). The connection of the various factorization models<br />

and the correspond<strong>in</strong>g uniqueness results are illustrated <strong>in</strong> figure 1.9.<br />

Aga<strong>in</strong>, we turned this uniqueness result <strong>in</strong>to a separation algorithm, this time by consider<strong>in</strong>g<br />

the JADE-source condition based on fourth-order cumulants. The key idea was to translate<br />

irreducibility <strong>in</strong>to maximal block-diagonality of the source condition matrices Ci(s). Algorithmically,<br />

JBD was performed us<strong>in</strong>g JD first us<strong>in</strong>g theorem 1.3.2, followed by permutation and<br />

block-size identification (Theis, 2007, algorithm 1). So far, we did not implement a sophisticated<br />

cluster<strong>in</strong>g step but only a straight-forward threshold<strong>in</strong>g method for block-size determ<strong>in</strong>ation.<br />

First results when us<strong>in</strong>g more elaborate cluster<strong>in</strong>g techniques are promis<strong>in</strong>g.


24 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

50<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

120<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

50<br />

0<br />

(a) ECG record<strong>in</strong>gs<br />

−50<br />

0 100 200 300 400 500<br />

120<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

(c) MECG part<br />

50<br />

0<br />

−120<br />

0 100 200 300 400 500<br />

80<br />

0<br />

−20<br />

0 100 200 300 400 500<br />

20<br />

0<br />

−20<br />

0 100 200 300 400 500<br />

50<br />

0<br />

(b) extracted sources<br />

−50<br />

0 100 200 300 400 500<br />

120<br />

0<br />

−50<br />

0 100 200 300 400 500<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

(d) fetal ECG part<br />

Figure 1.10: <strong>Independent</strong> subspace analysis with known block structure m = (2, 1) is applied<br />

to fetal ECG. (a) shows the ECG record<strong>in</strong>gs. The underly<strong>in</strong>g FECG (4 heartbeats) is partially<br />

visible <strong>in</strong> the dom<strong>in</strong>at<strong>in</strong>g MECG (3 heartbeats). Figure (b) gives the extracted sources us<strong>in</strong>g<br />

ISA with the Hessian source condition from table 1.1 with 500 Hessian matrices. In (c) and<br />

(d) the projections of the mother sources (components 1 and 2 <strong>in</strong> (b)) and the fetal source<br />

(component 3) onto the mixture space (a) are plotted.<br />

F<strong>in</strong>ally, we report the example from (Theis, 2005a) on how to apply the Hessian ISA algorithm<br />

to a real-world data set. Follow<strong>in</strong>g (Cardoso, 1998), we show how to separate fetal ECG<br />

(FECG) record<strong>in</strong>gs from the mother’s ECG (MECG). Our goal is to extract an MECG and an<br />

FECG component; however it cannot be expected to f<strong>in</strong>d an only one-dimensional MECG due<br />

to the fact that projections of a three-dimensional vector (electric) field are measured. Hence<br />

model<strong>in</strong>g the data by a multidimensional BSS problem with k = 2 (but allow<strong>in</strong>g for an additional<br />

one-dimensional component) makes sense. Application of ISA extracts a two-dimensional<br />

MECG component and a one-dimensional FECG component. After block-permutation we get<br />

estimated mix<strong>in</strong>g matrix A and sources s(t) as plotted <strong>in</strong> figure 1.10(b). A decomposition of<br />

the observed ECG data x(t) can be achieved by compos<strong>in</strong>g the extracted sources us<strong>in</strong>g only the<br />

relevant mix<strong>in</strong>g columns. For example for the MECG part this means apply<strong>in</strong>g the projection<br />

ΠM := (a1, a2, 0)A −1 to the observations. The results are plotted <strong>in</strong> figures 1.10 (c) and (d).<br />

The fetal ECG is most active at sensor 1 (as visual <strong>in</strong>spection of the observation confirms).<br />

When compar<strong>in</strong>g the projection matrices with the results from Cardoso (1998), we get quite<br />

high similarity of the ICA-based results, and a modest difference with the projections of the<br />

time-based algorithm.


1.4. Sparseness 25<br />

1.4 Sparseness<br />

One of the fundamental questions <strong>in</strong> signal process<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g and neuroscience is how to<br />

represent a large data set X, given <strong>in</strong> form of a (m × T )-matrix, <strong>in</strong> different ways. A simple<br />

approach is a l<strong>in</strong>ear matrix factorization<br />

X = AS, (1.12)<br />

which is equivalent to model (1.1) after gather<strong>in</strong>g the samples <strong>in</strong>to correspond<strong>in</strong>g data matrices<br />

X := (x(1), . . . , x(T )) ∈ R m×T and S := (s(1), . . . , s(T )) ∈ R n×T . We speak of a complete,<br />

overcomplete or undercomplete factorization if m = n, m < n or m > n respectively. The<br />

unknown matrices A and S are assumed to have some specific properties, for <strong>in</strong>stance:<br />

(i) the rows si of S are assumed to be samples of a mutually <strong>in</strong>dependent random vector, see<br />

section 1.2;<br />

(ii) each sample s(t) conta<strong>in</strong>s as many zeros as possible—this is the sparse representation or<br />

sparse component analysis (SCA) problem;<br />

(iii) the elements of X, A and S are nonnegative, which results <strong>in</strong> nonnegative matrix factorization<br />

(NMF).<br />

There is a large amount of papers devoted to ICA problems but mostly for the (under)complete<br />

case m ≥ n. We refer to Lee et al. (1999), Theis et al. (2004d), Zibulevsky and Pearlmutter<br />

(2001) and references there<strong>in</strong> for work on overcomplete ICA. Here, we will discuss constra<strong>in</strong>ts<br />

(ii) and (iii).<br />

1.4.1 Sparse component analysis<br />

We consider the bl<strong>in</strong>d matrix factorization problem (1.12) <strong>in</strong> the more challeng<strong>in</strong>g overcomplete<br />

case, where the additional <strong>in</strong>formation compensat<strong>in</strong>g the limited number of sensors is the sparseness<br />

of the sources. It should be noted that this problem is quite general and fundamental, s<strong>in</strong>ce<br />

the sources could be not necessarily sparse <strong>in</strong> time doma<strong>in</strong>. It would be sufficient to f<strong>in</strong>d a l<strong>in</strong>ear<br />

transformation (e.g. wavelet packets), <strong>in</strong> which the sources are sufficiently sparse. Applications<br />

of the model <strong>in</strong>clude biomedical data analysis, where sparsely active sources are often assumed<br />

(McKeown et al., 1998), and audio source separation (Araki et al., 2007).<br />

In Georgiev et al. (2005c), see chapter 10, we <strong>in</strong>troduced a novel measure for sparsity and<br />

showed that based on sparsity alone, we were still able to identify both the mix<strong>in</strong>g matrix and<br />

the sources uniquely except for trivial <strong>in</strong>determ<strong>in</strong>acies. Here, a vector v ∈ R n is said to be ksparse<br />

if v has at least k zero entries. An n × T data matrix is said to be k-sparse, if each of its<br />

columns are k-sparse. The goal of sparse component analysis of level k (k-SCA) is to decompose<br />

a given m-dimensional observed signal S as <strong>in</strong> equation (1.12) such that S is k-sparse. In our<br />

work, we always assume that the sparsity level equals k = n − m + 1, which means that at any<br />

time <strong>in</strong>stant, fewer sources than given observations are active.<br />

The follow<strong>in</strong>g theorem shows that essentially the SCA model is unique if fewer sources than<br />

mixtures are active i.e. if the sources are (n − m + 1)-sparse.


26 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

a 1<br />

a 3<br />

a 2<br />

(a) three hyperplanes<br />

span{ai, aj} for 1 ≤ i < j ≤ 3 <strong>in</strong><br />

the 3 × 3 case<br />

a 1<br />

a 3<br />

a 2<br />

(b) hyperplanes from (a) visualized<br />

by <strong>in</strong>tersection with the<br />

sphere<br />

a 1<br />

a 3<br />

a 4<br />

a 2<br />

(c) six hyperplanes span{ai, aj}<br />

for 1 ≤ i < j ≤ 4 <strong>in</strong> the 3 × 4<br />

case<br />

Figure 1.11: Visualization of the hyperplanes <strong>in</strong> the mixture space {x(t)} ⊂ R 3 . Due to the<br />

source sparsity, the mixtures are generated by only two matrix columns ai, aj, and are hence<br />

conta<strong>in</strong>ed <strong>in</strong> a union of hyperplanes—three hyperplanes <strong>in</strong> the case of three sources (a,b) and<br />

six hyperplanes <strong>in</strong> the case of four sources (c). Identification of the hyperplanes gives mix<strong>in</strong>g<br />

matrix and sources.<br />

Theorem 1.4.1 (SCA matrix identifiability). Assume that <strong>in</strong> the SCA model, every m × msubmatrix<br />

of A is <strong>in</strong>vertible and that S is sufficiently rich represented. Then A is uniquely<br />

determ<strong>in</strong>ed by X except for left-multiplication with permutation and scal<strong>in</strong>g matrices.<br />

Here S is said to be sufficiently rich represented if for any <strong>in</strong>dex set of n − m + 1 elements<br />

I ⊂ {1, ..., n} there exist at least m samples of S such that each of them has zero elements <strong>in</strong><br />

places with <strong>in</strong>dexes <strong>in</strong> I and each m − 1 of them are l<strong>in</strong>early <strong>in</strong>dependent. The next theorem<br />

shows that <strong>in</strong> this case also the sources can be found uniquely:<br />

Theorem 1.4.2 (SCA source identifiability). Let H be the set of all x ∈ R m such that the l<strong>in</strong>ear<br />

system As = x has a (n−m+1)-sparse solution i.e. one with at least n−m+1 zero components.<br />

If A fulfills the condition from theorem 1.4.1, then for almost all x ∈ H, this system has no<br />

other solution with this property.<br />

The proofs were given <strong>in</strong> Georgiev et al. (2005c), see chapter 10. The above two theorems<br />

show that <strong>in</strong> the case of overcomplete BSS us<strong>in</strong>g k-SCA with k = n − m + 1, both the mix<strong>in</strong>g<br />

matrix and the sources can uniquely be recovered from X except for the omnipresent permutation<br />

and scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. The essential idea of both the theorems as well as correspond<strong>in</strong>g<br />

algorithms is illustrated <strong>in</strong> figure 1.11: by assum<strong>in</strong>g sufficiently high sparsity of the sources,<br />

the mixture space clusters along a union of hyperplanes, which uniquely determ<strong>in</strong>e both mix<strong>in</strong>g<br />

matrix and sources.<br />

It is not clear a priori, whether any given data matrix X can be factorized <strong>in</strong>to a sparse<br />

representation. A necessary and sufficient condition is given <strong>in</strong> the follow<strong>in</strong>g theorem from<br />

(Georgiev et al., 2005c):


1.4. Sparseness 27<br />

Theorem 1.4.3 (SCA conditions). Assume that m ≤ n ≤ T and the matrix X ∈ R m×T satisfies<br />

the follow<strong>in</strong>g conditions:<br />

� �<br />

n<br />

(i) the columns of X lie <strong>in</strong> the union H of m−1 different hyperplanes, each column lies <strong>in</strong><br />

only one such hyperplane, each hyperplane conta<strong>in</strong>s at least m columns of X such that each<br />

m − 1 of them are l<strong>in</strong>early <strong>in</strong>dependent.<br />

(ii) for each i ∈ {1, ..., n} there exist p =<br />

� �<br />

n−1<br />

m−2<br />

different hyperplanes {Hi,j} p<br />

j=1<br />

that their <strong>in</strong>tersection Li = � p<br />

k=1 Hi,j is one dimensional subspace.<br />

(iii) any m different Li span the whole R m .<br />

<strong>in</strong> H such<br />

Then the matrix X is uniquely representable (up to permutation and scal<strong>in</strong>g) as SCA, satisfy<strong>in</strong>g<br />

the conditions of theorem 1.4.1.<br />

Algorithms for SCA<br />

In Georgiev et al. (2004, 2005c), we also proposed an algorithm based on random sampl<strong>in</strong>g for<br />

reconstruct<strong>in</strong>g the mix<strong>in</strong>g matrix and the sources, however it could not easily be applied <strong>in</strong> noisy<br />

sett<strong>in</strong>gs and high dimensions due to the <strong>in</strong>volved comb<strong>in</strong>atorial searches. Therefore, we derived<br />

a novel, robust algorithm for SCA <strong>in</strong> Theis et al. (2007a), see chapter 11. The key idea was that<br />

if the sources are of sufficiently high sparsity, the mixtures are clustered along hyperplanes <strong>in</strong> the<br />

mixture space. Based on this condition, the mix<strong>in</strong>g matrix could be reconstructed; furthermore,<br />

this property turned out to be robust aga<strong>in</strong>st noise and outliers.<br />

The proposed algorithm employed a generalization of the Hough transform <strong>in</strong> order to detect<br />

the hyperplanes <strong>in</strong> the mixture space, see figure 1.12. This leads to an algorithmically robust<br />

matrix and source identification. The Hough-based hyperplane estimation does not depend on<br />

the source dimension n, only on the mixture dimension m. With respect to applications, this<br />

implies that n can be quite large and hyperplanes will still be found if the grid resolution used<br />

<strong>in</strong> the Hough transform is sufficiently high. Increase of the grid resolution (<strong>in</strong> polynomial time)<br />

results <strong>in</strong> <strong>in</strong>creased accuracy also for higher source dimensions n.<br />

For applications of the proposed SCA algorithms <strong>in</strong> signal process<strong>in</strong>g and biomedical data<br />

analysis, we refer to section 1.6.3 and Georgiev et al. (2006, 2005a,b), Theis et al. (2007a). More<br />

elaborate source reconstruction methods, after know<strong>in</strong>g the mix<strong>in</strong>g matrix A were discussed <strong>in</strong><br />

Theis et al. (2004a).<br />

Postnonl<strong>in</strong>ear generalization<br />

In Theis and Amari (2004), see chapter 12, we considered the generalization of SCA to postnonl<strong>in</strong>ear<br />

mixtures, see section 1.2.2. As before the data x(t) = f(As(t)) is assumed to be<br />

l<strong>in</strong>early mixed followed by a componentwise nonl<strong>in</strong>earity, see equation (1.3). However, now the<br />

(m × n)-matrix A is allowed to be ‘wide’ i.e. the more complicated overcomplete situation is<br />

treated and m < n. By us<strong>in</strong>g sparseness of s(t), we were still able to recover the system:


28 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

z<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

y<br />

−1<br />

−1.5<br />

−2<br />

2<br />

3<br />

x<br />

4<br />

5<br />

6<br />

3<br />

2.5<br />

2<br />

theta 1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5<br />

phi<br />

2 2.5 3<br />

Figure 1.12: Illustration of the ‘hyperplane detect<strong>in</strong>g’ Hough transform <strong>in</strong> three dimensions: A<br />

po<strong>in</strong>t (x1, x2, x3) <strong>in</strong> the data space (left) is mapped onto the curve {(ϕ, θ)|θ = arctan(x1 cos ϕ +<br />

x2 s<strong>in</strong> ϕ) + π<br />

2 } <strong>in</strong> the parameter space [0, π)2 (right). The Hough curves of po<strong>in</strong>ts belong<strong>in</strong>g to<br />

one plane <strong>in</strong> data space <strong>in</strong>tersect <strong>in</strong> precisely one po<strong>in</strong>t (ϕ, θ) <strong>in</strong> the data space — and the data<br />

po<strong>in</strong>ts lie on the plane given by the normal vector (cos ϕ s<strong>in</strong> θ, s<strong>in</strong> ϕ s<strong>in</strong> θ, cos θ).<br />

Theorem 1.4.4 (Identifiability of postnonl<strong>in</strong>ear SCA). Let S ∈ R n×T be a matrix with (n −<br />

m + 1)-sparse columns s(t), and let X ∈ R m×T consist of columns x(t) = f(As(t)) follow<strong>in</strong>g the<br />

postnonl<strong>in</strong>ear mixture model (1.3). Furthermore assume that<br />

(i) S is fully (n − m + 1)-sparse <strong>in</strong> the sense that assymptotically for T → ∞, its image equals<br />

the union of all (m − 1)-dimensional coord<strong>in</strong>ate spaces (<strong>in</strong> which it is conta<strong>in</strong>ed by the<br />

sparsity assumption),<br />

(ii) A is mix<strong>in</strong>g and not absolutely degenerate,<br />

(iii) every m × m-submatrix of A is <strong>in</strong>vertible.<br />

If X = ˆ f( ˆ S) is another such representation of X, then there exists an <strong>in</strong>vertible scal<strong>in</strong>g L with<br />

f = ˆ f ◦ L, and <strong>in</strong>vertible scal<strong>in</strong>g and permutation matrices L ′ , P ′ with A = L ÂL′ P ′ .<br />

The proof relied on the fact that when s(t) is sparse as formulated <strong>in</strong> theorem 1.4.4(i), it<br />

<strong>in</strong>cludes all the (m−1)-dimensional coord<strong>in</strong>ate subspaces and hence <strong>in</strong>tersections of (m−1) such<br />

subspaces, which give the n coord<strong>in</strong>ate axes. They are transformed <strong>in</strong>to n curves <strong>in</strong> the x-space,<br />

pass<strong>in</strong>g through the orig<strong>in</strong>. By identification of these curves, we showed that each nonl<strong>in</strong>earity<br />

fi is <strong>in</strong> fact l<strong>in</strong>ear. The proof used the follow<strong>in</strong>g lemma, which generalizes the analytic case<br />

presented by Babaie-Zadeh et al. (2002).<br />

Lemma 1.4.5. Let a, b ∈ R \ {−1, 0, 1}, a > 0 and f : [0, ε) → R differentiable such that<br />

f(ax) = bf(x) for all x ∈ [0, ε) with ax ∈ [0, ε). If limt→0+ f ′ (t) exists and does not vanish, then<br />

f is l<strong>in</strong>ear.


1.4. Sparseness 29<br />

R 3<br />

R 3<br />

A<br />

BSRA<br />

R 2<br />

R 2<br />

f1 × f2<br />

g1 × g2<br />

Figure 3: Illustration of the proof of theorem ?? <strong>in</strong> the case n = 3, m = 2. The 3dimensional<br />

1-sparse sources (leftmost figure) are first l<strong>in</strong>early mapped onto R2 Figure 1.13: Illustration of the proof of theorem 1.4.4 <strong>in</strong> the case n = 3, m = 2. The 3dimensional<br />

1-sparse sources (leftmost top) are first l<strong>in</strong>early mapped onto R by<br />

A and then postnonl<strong>in</strong>early distorted by f := f1 × f2 (middle figure). Separation<br />

is performed by first estimat<strong>in</strong>g the separat<strong>in</strong>g postnonl<strong>in</strong>earities g := g1 × g2 and<br />

then perform<strong>in</strong>g overcomplete source recovery (right figure) accord<strong>in</strong>g to algorithm<br />

??. The idea of the proof now is that two l<strong>in</strong>es spanned by coord<strong>in</strong>ate vectors (fat<br />

l<strong>in</strong>es, leftmost figure) are mapped onto two l<strong>in</strong>es spanned by two columns of A.<br />

If the composition g ◦ f maps these l<strong>in</strong>es onto some different l<strong>in</strong>es (as sets), then<br />

we show that (given ’general position’ of the two l<strong>in</strong>es) the components of g ◦ f<br />

are homogeneous functions and hence already l<strong>in</strong>ear accord<strong>in</strong>g to lemma ??.<br />

2 by A and then<br />

postnonl<strong>in</strong>early distorted by f := f1 × f2 (right). Separation is performed by first estimat<strong>in</strong>g<br />

the separat<strong>in</strong>g postnonl<strong>in</strong>earities g := g1 ×g2 and then perform<strong>in</strong>g overcomplete source recovery<br />

(left bottom) accord<strong>in</strong>g to the algorithms from Georgiev et al. (2004). The idea of the proof was<br />

that two l<strong>in</strong>es spanned by coord<strong>in</strong>ate vectors (thick l<strong>in</strong>es) are mapped onto two l<strong>in</strong>es spanned by<br />

two columns of A. If the composition g ◦ f maps these l<strong>in</strong>es onto some different l<strong>in</strong>es (as sets),<br />

then we showed that (given ‘general position’ of the two l<strong>in</strong>es) the components of g ◦ f satisfy<br />

the conditions from lemma 1.4.5 and hence are already l<strong>in</strong>ear.<br />

Theorem 1.4.4 shows that f and A are uniquely determ<strong>in</strong>ed by x(t) except for scal<strong>in</strong>g and<br />

permutation ambiguities. Note that then obviously also s(t) is identifiable by apply<strong>in</strong>g theorem<br />

(i) s is fully k-sparse <strong>in</strong> the sense that im s equals the union of all k-dimensional<br />

1.4.2 to the l<strong>in</strong>earized mixtures y(t) = f<br />

coord<strong>in</strong>ate subspaces (<strong>in</strong> which it is conta<strong>in</strong>ed by the sparsity assumption),<br />

−1x(t) = As(t) given the additional assumptions to s(t)<br />

from the theorem.<br />

(ii) A is mix<strong>in</strong>g and not absolutely degenerate,<br />

(iii) every m × m-submatrix of A is <strong>in</strong>vertible.<br />

If x = ˆf( ˆs) is another representation of x as <strong>in</strong> equation ?? with ˆs satisfy<strong>in</strong>g the<br />

same conditions as s, then there exists an <strong>in</strong>vertible scal<strong>in</strong>g L with f = ˆ Aga<strong>in</strong>, we derived an algorithm from this identifiability result. The separation is done <strong>in</strong><br />

a two-stage procedure: In the first step, after geometrical preprocess<strong>in</strong>g the postnonl<strong>in</strong>earities<br />

are estimated us<strong>in</strong>g an idea similar to the one used <strong>in</strong> the identifiability proof of theorem 1.4.4,<br />

also see figure 1.13. In the second stage, the mix<strong>in</strong>g matrix A and then the sources s are<br />

reconstructed by apply<strong>in</strong>g l<strong>in</strong>ear SCA to the l<strong>in</strong>earized mixtures f<br />

f ◦ L, and<br />

−1x(t). For details we refer to<br />

Theis and Amari (2004), see chapter 12.<br />

<strong>in</strong>vertible scal<strong>in</strong>g and permutation matrices L ′ , P ′ with A = L ÂL′ P ′ .<br />

The proof is given <strong>in</strong> the appendix. It relies on the fact that when s is fully<br />

k-sparse as formulated <strong>in</strong> theorem ??(??), it <strong>in</strong>cludes all the k-dimensional coord<strong>in</strong>ate<br />

subspaces and hence <strong>in</strong>tersections of k such subspaces, which give the<br />

n coord<strong>in</strong>ate axes. They are transformed <strong>in</strong>to n curves <strong>in</strong> the x-space, pass<strong>in</strong>g<br />

through the orig<strong>in</strong>. By identification of these curves, we show that each nonl<strong>in</strong>earity<br />

is homogeneous and hence l<strong>in</strong>ear accord<strong>in</strong>g to the previous section. Figure<br />

?? gives an illustration of the proof of theorem ?? <strong>in</strong> the case n = 3 and m = 2.<br />

R 2


30 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.4.2 Sparse non-negative matrix factorization<br />

In Theis et al. (2005c), see chapter 13, we studied the factorization problem (1.12) us<strong>in</strong>g condition<br />

(iii) of non-negativity. Non-negative matrix factorization (NMF) strictly requires both matrices<br />

A and S to have non-negative entries, which means that the data can be described us<strong>in</strong>g only<br />

additive components. Such a constra<strong>in</strong>t has many physical realizations and applications, for<br />

<strong>in</strong>stance <strong>in</strong> object decomposition (Lee and Seung, 1999).<br />

Although NMF has recently ga<strong>in</strong>ed popularity due to its simplicity and power <strong>in</strong> various<br />

applications, its solutions frequently fail to exhibit the desired sparse object decomposition.<br />

Therefore, Hoyer (2004) proposed a modification of the NMF model to <strong>in</strong>clude sparseness: he<br />

m<strong>in</strong>imized the deviation �X − AS� of (1.12) under the constra<strong>in</strong>t of fixed sparseness of both A<br />

and S. Here, us<strong>in</strong>g a ratio of 1- and 2-norms of x ∈ R n \ {0}, the sparseness is measured by<br />

σ(x) := ( √ n − �x�1/�x�2)/( √ n − 1). So σ(x) = 1 (maximal) if x conta<strong>in</strong>s n − 1 zeros, and it<br />

reaches zero if the absolute value of all coefficients of x co<strong>in</strong>cide.<br />

We restricted ourselves to the asymptotic case of perfect factorization, and therefore def<strong>in</strong>ed<br />

the sparse NMF (Hoyer, 2004) as the task of f<strong>in</strong>d<strong>in</strong>g the matrices A and S <strong>in</strong> the decomposition<br />

X = AS subject to<br />

⎧<br />

⎨<br />

⎩<br />

A, S ≥ 0<br />

σ(A∗i) = σA<br />

σ(Si∗) = σS<br />

(1.13)<br />

Here σA, σS ∈ [0, 1] denote fixed constants describ<strong>in</strong>g the sparseness of the columns of A respectively<br />

the rows of S.<br />

Uniqueness of sparse NMF<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies of the sparse NMF model (1.13) are permutation and positive scal<strong>in</strong>g of<br />

the columns of A (and correspond<strong>in</strong>gly of the rows of S). Another not as obvious <strong>in</strong>determ<strong>in</strong>acy<br />

comes <strong>in</strong>to play due to the sparseness assumption: S is said to be degenerate if each column of S<br />

is a multiple of some vector v ∈ R n . Then factorization is not unique because the row-sparseness<br />

of S does not depend on v, so transformations that still guarantee non-negativity are possible<br />

Now assume that two solutions (A, S) and ( Ã, ˜ S) of the sparse NMF model (1.13) are given<br />

with A and à of full rank; then AS = Ø S, and σ(S) = σ( ˜ S). Let hi = S ⊤ i∗ respectively ˜ hi = ˜ S ⊤ i∗<br />

denote the rows of the source matrices. In order to avoid the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy, we can set<br />

the source scales to a given value, so we may assume �hi�2 = � ˜ hi�2 = 1 for all i. Hence, the<br />

sparseness of the rows is already fully determ<strong>in</strong>ed by their 1-norms, and �hi�1 = � ˜ hi�1.<br />

The follow<strong>in</strong>g theorem from Theis et al. (2005c), see chapter 13, showed uniqueness of sparse<br />

NMF <strong>in</strong> some special cases — note that <strong>in</strong> more general sett<strong>in</strong>gs some additional <strong>in</strong>determ<strong>in</strong>acies<br />

(specific to n > 3) come <strong>in</strong>to play; however to our present knowledge they are th<strong>in</strong> i.e. of measure<br />

zero and hence of no practical importance.<br />

Theorem 1.4.6 (Uniqueness of sparse NMF). Given two solutions (A, S) and ( Ã, ˜ S) of the<br />

sNMF model as above, assume that S is non degenerate and that either à = I and A ≥ 0 or<br />

n = 2. Then A = ÃP with a permutation matrix P.


1.4. Sparseness 31<br />

Sparse projection<br />

Algorithmically, we followed Hoyer’s approach and solved the<br />

sparse NMF problem by alternatively updat<strong>in</strong>g A and S us<strong>in</strong>g<br />

gradient descent on the residual error �X − AS� 2 . After each<br />

update, the columns of A and the rows of S are projected onto<br />

M := {s|�s�1 = σ} ∩ {s|�s�2 = 1} ∩ {s ≥ 0} (1.14)<br />

<strong>in</strong> order to satisfy the sparseness conditions of (1.13). For this,<br />

po<strong>in</strong>ts x ∈ R n have to be projected onto adjacent po<strong>in</strong>ts <strong>in</strong> M,<br />

which was def<strong>in</strong>ed as �x − p�2 ≤ �x − q�2 for all q ∈ M and<br />

denoted as p ⊳ x.<br />

A priori it is not clear when such an x exists and, more so,<br />

is unique, see figure 1.14. We answered this question by prov<strong>in</strong>g<br />

the follow<strong>in</strong>g theorem:<br />

Theorem 1.4.7 (Existence and uniqueness of the Euclidean projection).<br />

M X (M)<br />

(i) If M is closed and nonempty, then for every x ∈ R n there<br />

exists a p ∈ M with p ⊳ x.<br />

(ii) If X (M) := {x ∈ R n |#{p ∈ M|p ⊳ x} > 1} denotes the<br />

exception or non-uniqueness set of M, then vol(X (M)) = 0.<br />

M X (M)<br />

(a) exception set of two po<strong>in</strong>ts<br />

(a) exception set of two po<strong>in</strong>ts<br />

M<br />

(b) excepti<br />

Figure 1: Two examples of exception s<br />

In other words, the exception set conta<strong>in</strong>s the set of<br />

can’t uniquely project. Our goal is to show that this set<br />

very small. Figure 1 shows the exception set of two diffe<br />

Note that if x ∈ M then x ⊳ x, and x is the only poi<br />

So M ∩ X (M) = ∅. Obviously the exception set of an a<br />

is empty. Indeed, we can prove more generally:<br />

Lemma 2.4. Let M ⊂ Rn X (M)<br />

M<br />

be convex. Then X (M) = ∅.<br />

For the proof we need the follow<strong>in</strong>g simple lemma, wh<br />

2-norm as it uses the scalar product.<br />

Lemma 2.5. Let a, b ∈ Rn such that �a + b�2 = �a�2 +<br />

are coll<strong>in</strong>ear.<br />

Proof. By tak<strong>in</strong>g squares we get �a + b�2 = �a�2 + 2�a�<br />

�a� 2 + 2〈a, b〉 + �b� 2 = �a� 2 + 2�a��b� +<br />

if 〈., .〉 denotes the (symmetric) scalar product. Hence 〈<br />

and b are coll<strong>in</strong>ear accord<strong>in</strong>g to the Schwarz <strong>in</strong>equality.<br />

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈<br />

M such that pi ⊳ x. By assumption q := 1<br />

2 (p1 + p2) ∈ M<br />

�x − p1� ≤ �x − q� ≤ 1<br />

2 �x − p1� + 1<br />

�x − p2�<br />

2<br />

because both pi are adjacent to x. Therefore �x−q� = � 1<br />

(a) exception set of two po<strong>in</strong>ts<br />

(b) exception set of a sector<br />

Figure 1: Two examples of exception sets.<br />

In other words, the exception set conta<strong>in</strong>s the set of po<strong>in</strong>ts from which we<br />

can’t uniquely project. Our goal is to show that this set vanishes or is at least<br />

very small. Figure 1 shows the exception set of two different sets.<br />

Note that if x ∈ M then x ⊳ x, and x is the only po<strong>in</strong>t with that property.<br />

So M ∩ X (M) = ∅. Obviously the exception set of an aff<strong>in</strong>e l<strong>in</strong>ear hyperspace<br />

is empty. Indeed, we can prove more generally:<br />

Lemma 2.4. Let M ⊂ R<br />

2<br />

and application of lemma 2.5 shows that x − p1 = α(x<br />

3<br />

n be convex. Then X (M) = ∅.<br />

For the proof we need the follow<strong>in</strong>g simple lemma, which only works for the<br />

2-norm as it uses the scalar product.<br />

Lemma 2.5. Let a, b ∈ Rn such that �a + b�2 = �a�2 + �b�2. Then a and b<br />

are coll<strong>in</strong>ear.<br />

Proof. By tak<strong>in</strong>g squares we get �a + b�2 = �a�2 + 2�a��b� + �b�2 , so<br />

�a� 2 + 2〈a, b〉 + �b� 2 = �a� 2 + 2�a��b� + �b� 2<br />

The above is obvious if M is convex. However here, with M<br />

from equation (1.14), this is not the case, and the above theorem<br />

is needed. We then denote the (almost everywhere unique)<br />

projection πM(x) := p. In addition, <strong>in</strong> (Theis et al., 2005c), we (b) exception set of a sector<br />

proved convergence of Hoyer’s projection algorithm.<br />

Figure 1.14: Two exception<br />

(non-uniqueness) sets.<br />

Iterative projection onto spheres<br />

In Theis and Tanaka (2006), see chapter 14, our goal was to generalize the notion of sparseness.<br />

After all, we naturally <strong>in</strong>terpret sparseness of some signal x(t) as x(t) hav<strong>in</strong>g many zero entries.<br />

This can be measured by the 0-pseudo-norm, and it is common to approximate the it by p-norms<br />

for p → 0. Hence replac<strong>in</strong>g the 1-norm <strong>in</strong> (1.14) by some p-norm is desirable.<br />

A p-sparse NMF algorithm can then be readily derived. However, we observed that the<br />

sparse projection cannot be solved <strong>in</strong> closed form anymore, and little attention has been paid<br />

to f<strong>in</strong>d<strong>in</strong>g projections <strong>in</strong> the case of p �= 1, which is particularly important for p → 0 as better<br />

approximation of �.�0. Hence, our goal <strong>in</strong> (Theis and Tanaka, 2006) was to explore this more<br />

general notion of sparseness and to construct an algorithm to project a vector to its closest vector<br />

of a given sparseness. The result<strong>in</strong>g algorithm is a non-convex extension of the ‘projection onto<br />

convex sets’ (POCS) algorithm (Combettes, 1993, Youla and Webb, 1982).<br />

Let S<br />

if 〈., .〉 denotes the (symmetric) scalar product. Hence 〈a, b〉 = �a��b� and a<br />

and b are coll<strong>in</strong>ear accord<strong>in</strong>g to the Schwarz <strong>in</strong>equality.<br />

n−1<br />

p := {x ∈ Rn | �x�p = 1} denote the (n − 1)-dimensional sphere with respect to the<br />

p-norm (p > 0). A scaled version of this unit sphere is given by cSn−1 p := {x ∈ Rn | �x�p = c}.<br />

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈ X (M) and p1 �= p2 ∈<br />

M such that pi ⊳ x. By assumption q := 1<br />

2 (p1 + p2) ∈ M. But<br />

�x − p1� ≤ �x − q� ≤ 1<br />

2 �x − p1� + 1<br />

2 �x − p2� = �x − p1�<br />

(x−p1)�+� 1<br />

2 (x−p2)�<br />

because both pi are adjacent to x. Therefore �x−q� = � 1<br />

2<br />

and application of lemma 2.5 shows that x − p1 = α(x − p2). Tak<strong>in</strong>g norms<br />

3


32 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

(a) POSH for n = 2, p = 0.5 (b) POSH for n = 3, p = 1<br />

Figure 1.15: Start<strong>in</strong>g from x0 (◦), we alternately project onto cSp and S2. POSH performance<br />

is illustrated for p = 0.5 <strong>in</strong> dimension n = 2 (a), and for p = 1 and n = 3 (b), were a projection<br />

via PCA is displayed—no <strong>in</strong>formation is lost hence the sequence of po<strong>in</strong>ts lies <strong>in</strong> a plane.<br />

We were look<strong>in</strong>g for the Euclidean projection y = πM(x) onto M := S n−1<br />

2 ∩ cSn−1 p . Note that<br />

due to theorem 1.4.7, this p-sparse projection exists if M �= ∅ and is almost always unique.<br />

The algorithmic construction of this projection now is a direct generalization of POCS: we<br />

alternately project first on S n−1<br />

2 then on Sn−1 p , us<strong>in</strong>g the Euclidean projection operator πM from<br />

above. However, the difference is that the spheres are clearly non-convex (if p �= 1), so <strong>in</strong> contrast<br />

to POCS, convergence is not obvious. We denoted this projection algorithm by projection onto<br />

spheres (POSH). Interest<strong>in</strong>gly, the algorithm still converges, which we could prove for p = 1:<br />

Theorem 1.4.8 (Convergence of POSH). Let n ≥ 2, and x ∈ Rn \ X (M). If y1 := π n−1<br />

S (x)<br />

2<br />

and iteratively y i := π S n−1<br />

2<br />

(π n−1<br />

S (y<br />

1<br />

i−1 )), then yi converges to πM(x).<br />

In figure 1.15, we show the application of POSH for p ∈ {0.5, 1}; we visualize the performance<br />

<strong>in</strong> 3 dimensions by project<strong>in</strong>g the data via PCA — which by the way throws away virtually no<br />

<strong>in</strong>formation (confirmed by experiment) <strong>in</strong>dicat<strong>in</strong>g the validness of theorem 1.4.8 also <strong>in</strong> higher<br />

dimensions.<br />

We want to f<strong>in</strong>ish this section by remark<strong>in</strong>g that the strict framework of sparse NMF (1.13)<br />

is somewhat problematic due to the fact that the sparseness values σA and σS are parameters<br />

of the algorithm and hence difficult to choose. In Stadlthanner et al. (2005b), we proposed<br />

an alternative factorization to (1.13), where we maximize the sparseness values of A and S<br />

<strong>in</strong> addition to the distance to X. However the optimization gets more <strong>in</strong>tricate, and we have<br />

studied enhanced search methods via genetic algorithms <strong>in</strong> Stadlthanner et al. (2005a).


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 33<br />

1.5 Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g<br />

Mach<strong>in</strong>e learn<strong>in</strong>g denotes the task of computationally f<strong>in</strong>d<strong>in</strong>g structures <strong>in</strong> data. Here we describe<br />

some preprocess<strong>in</strong>g techniques that rely on mach<strong>in</strong>e learn<strong>in</strong>g for denois<strong>in</strong>g, dimension<br />

reduction and data group<strong>in</strong>g (cluster<strong>in</strong>g).<br />

1.5.1 Denois<strong>in</strong>g<br />

In many fields of signal process<strong>in</strong>g the exam<strong>in</strong>ed signals bear considerable noise, which is usually<br />

assumed to be additive and decorrelated. For example <strong>in</strong> exploratory data analysis of medical<br />

data us<strong>in</strong>g statistical methods like ICA, the prevalent noise greatly degrades the reliability of<br />

the algorithms, and the underly<strong>in</strong>g processes cannot be identified.<br />

In Gruber et al. (2006), see chapter 15, we considered the situation, where a one-dimensional<br />

signal s(t) ∈ R given at discrete timesteps t = 1, . . . , T is distorted as follows<br />

sN(t) = s(t) + N(t), (1.15)<br />

where N(t) are i.i.d. samples of a Gaussian random variable i.e. sN equals s up to additive<br />

stationary white noise. Many denois<strong>in</strong>g algorithms have been proposed for recover<strong>in</strong>g s(t) from<br />

its noisy observation sN(t), see e.g. Effern et al. (2000), Hyvär<strong>in</strong>en et al. (2001b), Ma et al.<br />

(2000) to name but a few. Vetter et al. (2002) suggested an algorithm based on local l<strong>in</strong>ear<br />

projective noise reduction. The idea was to observe the data <strong>in</strong> a high-dimensional space of<br />

delayed coord<strong>in</strong>ates<br />

˜sN(t) := (sN(t), . . . , sN(t + m − 1)) ∈ R m<br />

and to denoise the data locally through a projection on the lower dimensional subspace of the<br />

determ<strong>in</strong>istic signals.<br />

We followed this approach and localized the problem by select<strong>in</strong>g k clusters of the delayed<br />

time series { sN(t) ˜ | t = 1, . . . , n}. This can for example be done by a k-means cluster algorithm,<br />

see section 1.5.3, which is appropriate for noise selection schemes based on the strength or the<br />

kurtosis of the signal s<strong>in</strong>ce the statistical properties do not depend on the signal structure.<br />

After the local l<strong>in</strong>earization, we may assume that the time-embedded signal can be l<strong>in</strong>early<br />

decomposed <strong>in</strong>to noise and a lower-dimensional signal subspace. We therefore analyzed these k<br />

m-dimensional signals us<strong>in</strong>g PCA or ICA <strong>in</strong> order to determ<strong>in</strong>e the ‘mean<strong>in</strong>gful’ components.<br />

The unknown number of signal components <strong>in</strong> the high-dimensional noisy signal were determ<strong>in</strong>ed<br />

either by us<strong>in</strong>g Vetter’s MDL estimator or a 2-means cluster<strong>in</strong>g of the eigenvectors of the<br />

covariance matrix (Liavas and Regalia, 2001). The latter gave a good estimate of the number of<br />

signal components if the noise variances are not clustered well enough together but nevertheless<br />

are separated for the signal by a large gap.<br />

To reconstruct the noise reduced signal, we unclustered the data to get a signal ˜se : {1, . . . , n} →<br />

Rm and then averaged over the candidates <strong>in</strong> the delayed data.<br />

This idea has been illustrated <strong>in</strong> figure 1.16.<br />

se(t) := 1<br />

m−1 �<br />

[ ˜se(t − i)] i . (1.16)<br />

m<br />

i=0


34 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

(a) embedded time series (b) locally l<strong>in</strong>ear approximation (c) local projection<br />

Figure 1.16: Denois<strong>in</strong>g by local projective subspace projections. The time series is embedded <strong>in</strong><br />

time-delayed coord<strong>in</strong>ates (a), where the signal subspace is <strong>in</strong>dicated by a solid l<strong>in</strong>e. Cluster<strong>in</strong>g<br />

<strong>in</strong> the feature space allows for a locally l<strong>in</strong>ear approximation (b). A local projection onto the<br />

signal subspace by ICA is performed <strong>in</strong> (c). Delay & signal subspace dimensions and number of<br />

clusters are estimated us<strong>in</strong>g an MDL criterion.<br />

Moreover <strong>in</strong> Gruber et al. (2006), we compared the above algorithm with a denois<strong>in</strong>g method<br />

based on generalized eigenvalue decomposition called delayed AMUSE (Tomé et al., 2005), and<br />

with established kernel PCA denois<strong>in</strong>g (Mika et al., 1999, Schölkopf et al., 1998), where solv<strong>in</strong>g<br />

the <strong>in</strong>verse problem for recover<strong>in</strong>g the data turned out to be non-trivial. F<strong>in</strong>ally, we showed<br />

applications to water-artefact removal of proton NMR spectra, which are an <strong>in</strong>dispensable contribution<br />

to this structure determ<strong>in</strong>ation process but are hampered by the presence of the very<br />

<strong>in</strong>tense water proton signal (Stadlthanner et al., 2003, 2006b).<br />

1.5.2 Dimension reduction<br />

An important open problem <strong>in</strong> signal process<strong>in</strong>g is the task of efficient dimension reduction that<br />

is the search for mean<strong>in</strong>gful signals with<strong>in</strong> a higher dimensional data set. Classical techniques<br />

such as pr<strong>in</strong>cipal component analysis hereby def<strong>in</strong>e ‘mean<strong>in</strong>gful’ us<strong>in</strong>g second-order statistics<br />

(maximal variance), which may often be <strong>in</strong>adequate for signal detection, e.g. <strong>in</strong> the presence of<br />

strong noise. This contrasts to higher order models <strong>in</strong>clud<strong>in</strong>g projection pursuit (Friedman and<br />

Tukey, 1975, Hyvär<strong>in</strong>en and Oja, 1997, Kruskal, 1969) or non-Gaussian component analysis,<br />

for short NGCA (Blanchard et al., 2006, Kawanabe, 2005, Kawanabe and Theis, 2007). While<br />

the former classically extracts a s<strong>in</strong>gle non-Gaussian <strong>in</strong>dependent component from the data set,<br />

the latter tries to detect a whole non-Gaussian subspace with<strong>in</strong> the data, and no assumption of<br />

<strong>in</strong>dependence with<strong>in</strong> the subspace is made.<br />

The goal of l<strong>in</strong>ear dimension reduction can be def<strong>in</strong>ed as the search for a projection W ∈<br />

Mat(n × d) of a d-dimensional data set X, here modeled by a random vector, with n < d and<br />

WX bear<strong>in</strong>g still as much <strong>in</strong>formation of X as possible. Of course the latter condition has


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 35<br />

to be specified <strong>in</strong> detail <strong>in</strong> terms of some distance, <strong>in</strong>dex or source model, and many different<br />

such <strong>in</strong>dices have been studied <strong>in</strong> the sett<strong>in</strong>g of projection pursuit (Friedman, 1987, Huber,<br />

1985, Hyvär<strong>in</strong>en, 1999), among others. This problem describes a special case of the larger<br />

field of model selection (Friedman and Tukey, 1975), an important tool for preprocess<strong>in</strong>g and<br />

dimension reduction, used <strong>in</strong> a wide range of applications.<br />

In Theis and Kawanabe (2006), see chapter 16, we studied non-Gaussian component analysis<br />

as proposed by Blanchard et al. (2006). The idea was to follow the classical projection pursuit<br />

idea and to choose non-Gaussianity as measure of <strong>in</strong>formation content of the projection. The<br />

rema<strong>in</strong>der X − W ⊤ WX after the projection was required to be Gaussian and <strong>in</strong>dependent of<br />

WX. For the theoretical analysis, we did not need further restrictions, for <strong>in</strong>stance by specify<strong>in</strong>g<br />

an estimator, which would of course be necessary for algorithmic purposes. In that respect, we<br />

provided a uniqueness result for projection pursuit <strong>in</strong> general. Our goal was to describe necessary<br />

and sufficient conditions <strong>in</strong> order for such projections to exist and to be unique.<br />

Consider for example the three-dimensional random vector X = (X1, X2, X3) with X1 non-<br />

Gaussian, say uniform, but X2 and X3 Gaussian such that X is mutually <strong>in</strong>dependent. Then if<br />

we were look<strong>in</strong>g for two-dimensional projections of X, we would obviously f<strong>in</strong>d multiple, different<br />

projections such as<br />

� 1 0 0<br />

0 1 0<br />

�<br />

, but also<br />

� 1 0 0<br />

0 0 1<br />

In both cases the rema<strong>in</strong>der of the projection (X3 or X2) is Gaussian and <strong>in</strong>dependent of the<br />

projected vectors, but the projection still conta<strong>in</strong>s a Gaussian component. If we are to look<br />

for one-dimensional projections of X <strong>in</strong>stead, only projection onto the first coord<strong>in</strong>ate yields a<br />

Gaussian rem<strong>in</strong>der, as desired. So <strong>in</strong> this example, uniqueness follows if the Gaussian subspace<br />

is of maximal dimension, or correspond<strong>in</strong>gly the non-Gaussian of m<strong>in</strong>imal dimension. And<br />

precisely this is the sufficient and necessary condition for uniqueness.<br />

Uniqueness of non-Gaussian component analysis<br />

A factorization X = AS as <strong>in</strong> (1.1) with A ∈ Gl(d), random vector S = (SN, SG) and<br />

SN ∈ L2(Ω, R n ) is called an n-dimensional non-Gaussian component analysis of X if SN and<br />

SG are stochastically <strong>in</strong>dependent and SG is Gaussian. In the correspond<strong>in</strong>g decomposition<br />

A = (AN, AG), the n-dimensional subvectorspace spanned by the columns of AN is called<br />

the non-Gaussian subspace, and the subspace spanned by AG the Gaussian subspace of the<br />

decomposition.<br />

The basic ideas of NGCA versus simple pr<strong>in</strong>cipal component analysis (PCA) is illustrated <strong>in</strong><br />

figure 1.17. Dimension reduction essentially deals with the question of remov<strong>in</strong>g a noise subspace.<br />

Classically, a signal is differentiated from noise by hav<strong>in</strong>g a higher variance, and algorithms such<br />

as PCA <strong>in</strong> the l<strong>in</strong>ear case remove the low-variance components, see figure 1.17(a). Second-order<br />

techniques however fail to capture signals that are deteriorated by noise of similar or stronger<br />

power, so higher-order statistics are necessary to remove the noise, see figure 1.17(c).<br />

The follow<strong>in</strong>g theorem connects uniqueness of the dimension reduction model with m<strong>in</strong>imality,<br />

and gives a simple characterization for it.<br />

�<br />

.


36 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5 2 2.5 3<br />

(a) PCA dimension reduction model<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

−6<br />

−8<br />

−8 −6 −4 −2 0 2 4 6 8<br />

(b) non-Gaussian subspace analysis model with directional<br />

histograms illustrat<strong>in</strong>g (non-)Gaussianity<br />

Figure 1.17: Illustration of dimension reduction by NGCA versus PCA. The signal subspace <strong>in</strong><br />

(a) is given by a l<strong>in</strong>ear direction of high variance, whereas <strong>in</strong> (b), noise is def<strong>in</strong>ed by Gaussianity.<br />

The signal subspace is correspond<strong>in</strong>gly given by directions of non-Gaussianity, which allows for<br />

the extraction of low-power signals.<br />

Theorem 1.5.1 (Uniqueness of NGCA). Let n < d. Given an n-dimensional NGCA ANSN +<br />

AGSG of the random vector X ∈ L2(Ω, R d ), the follow<strong>in</strong>g is equivalent:<br />

(i) The decomposition i.e. n is m<strong>in</strong>imal.<br />

(ii) There exists no basis M ∈ Gl(n) such that (MSN)(1) is Gaussian and <strong>in</strong>dependent of<br />

(MSN)(2 : n).<br />

(iii) The subspaces of the decomposition are unique i.e. another n-decomposition has the same<br />

non-Gaussian and Gaussian subspaces.<br />

Condition (ii) means that there exists no Gaussian <strong>in</strong>dependent component <strong>in</strong> the non-<br />

Gaussian part of the decomposition. The theorem proves that this is equivalent to the decomposition<br />

be<strong>in</strong>g m<strong>in</strong>imal. Note that <strong>in</strong> (ii), it is not enough to require only that there exists no Gaussian<br />

component i.e. v ∈ R n such that v ⊤ SN is Gaussian. A simple counterexample is given by a<br />

two-dimensional random vector S with density c exp(−s 2 1 −(s2 1 +s2) 2 ) with c be<strong>in</strong>g a normaliz<strong>in</strong>g<br />

constant. Then <strong>in</strong>deed S(1) = S1 is Gaussian because �<br />

R c exp(−s2 1 −(s2 1 +s2) 2 )ds2 = c ′ exp(−s 2 1 ),<br />

but clearly no m ∈ R 2 can be chosen such that S(1) and m ⊤ S are <strong>in</strong>dependent. And <strong>in</strong>deed,<br />

this dependent Gaussian S(1) with<strong>in</strong> S should not be removed by dimension reduction as it may<br />

conta<strong>in</strong> <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>formation, not be<strong>in</strong>g <strong>in</strong>dependent of the other components.<br />

The proof of the theorem was sketched <strong>in</strong> Theis and Kawanabe (2006), see chapter 16, where<br />

we also performed some simulations to validate the uniqueness result. A practical algorithm


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 37<br />

for NGCA essentially us<strong>in</strong>g the idea of separated characteristic functions from the proof was<br />

proposed <strong>in</strong> (Kawanabe and Theis, 2007).<br />

F<strong>in</strong>ally, <strong>in</strong> (Theis and Kawanabe, 2007), we presented an modification of NGCA that evaluates<br />

the time structure of the multivariate observations <strong>in</strong>stead of their higher-order statistics.<br />

We differentiated the signal subspace from noise by search<strong>in</strong>g for a subspace of non-trivially<br />

autocorrelated data. In contrast to bl<strong>in</strong>d source separation approaches however, we did not<br />

require the existence of sources, so the model is applicable to any wide-sense stationary time<br />

series without restrictions. Moreover, s<strong>in</strong>ce the method is based on second-order time structure,<br />

it could be efficiently implemented even for large dimensions, which we illustrated with an<br />

application to dimension reduction of functional MRI record<strong>in</strong>gs.<br />

1.5.3 Cluster<strong>in</strong>g<br />

Cluster<strong>in</strong>g methods are an important tool <strong>in</strong> high-dimensional explorative data m<strong>in</strong><strong>in</strong>g. They<br />

aim at identify<strong>in</strong>g samples or regions of similar characteristics, and often code them by a s<strong>in</strong>gle<br />

codebook vector or centroid. In this section, we review cluster<strong>in</strong>g algorithms, and employ these<br />

methods to solve the bl<strong>in</strong>d matrix factorization problem (1.12) from above under various source<br />

assumptions.<br />

Cluster<strong>in</strong>g for solv<strong>in</strong>g overcomplete BSS problems<br />

In Theis et al. (2006), see chapter 17, we discussed the bl<strong>in</strong>d source separation problem (1.1)<br />

<strong>in</strong> the difficult case of overcomplete BSS, where less mixtures than sources are observed (m <<br />

n). We focused on the usually more elaborate matrix recovery part. Assum<strong>in</strong>g statistically<br />

<strong>in</strong>dependent sources with exist<strong>in</strong>g variance and at most one Gaussian component, it is wellknown<br />

that A is determ<strong>in</strong>ed uniquely by the mixtures x(t) (Eriksson and Koivunen, 2003).<br />

However, how to do this algorithmically is far from obvious, and although some algorithms have<br />

been proposed recently (Bofill and Zibulevsky, 2001, Lee et al., 1999, O’Grady and Pearlmutter,<br />

2004), performance is yet limited.<br />

The most commonly used overcomplete algorithms rely on sparse sources (after possible<br />

sparsification by preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually by k-means or<br />

some extension (Bofill and Zibulevsky, 2001, O’Grady and Pearlmutter, 2004). However apart<br />

from the fact that theoretical justifications have not been found, mean-based cluster<strong>in</strong>g only<br />

identifies the correct A if the data density approaches a delta distribution. In figure 1.18,<br />

we illustrate the deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to 5 ◦ per mix<strong>in</strong>g<br />

angle, which is rather substantial consider<strong>in</strong>g the sparse density and the simple, complete case<br />

of m = n = 2. Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g performs much<br />

better. Indeed, mean-based cluster<strong>in</strong>g does not possess any equivariance property, which implies<br />

performance <strong>in</strong>dependent of the choice of A.<br />

We proposed a novel overcomplete, median-based cluster<strong>in</strong>g method <strong>in</strong> (Theis et al., 2006),<br />

and proved its equivariance and convergence. Simply put, we first pick 2n normalized start<strong>in</strong>g<br />

vectors w1, w ′ 1 , . . . , wn, w ′ n, and iterate the follow<strong>in</strong>g steps until an appropriate abort condition<br />

has been met: Choose a sample x(t) ∈ R m and normalize it y(t) := π(x(t)) = x(t)/|x(t)|. Let


38 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 2<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 2<br />

x2<br />

x2<br />

α<br />

α<br />

x1<br />

x1<br />

α<br />

α<br />

F (α)<br />

F (α)<br />

0.1<br />

0.1<br />

0.08<br />

0.08<br />

0.06<br />

0.06<br />

0.04<br />

0.04<br />

0.02<br />

0.02<br />

∆(α) mean<br />

∆(α) mean<br />

∆(α) median<br />

∆(α) median<br />

0<br />

0<br />

0.2 0.4 0.6 0.8<br />

0 0.2 0.4 α<br />

0.6 0.8<br />

α<br />

(a) circle histogram for α = 0.4<br />

(b) comparison of mean and median<br />

Fig. 1. Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x of two <strong>in</strong>dependent gamma-distributed sources<br />

(γ = 0.5, 10 5 (a) circle histogram for α = 0.4<br />

(a) circle histogram for α = 0.4<br />

(b) comparison of mean and median<br />

Fig. 1. Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x of two <strong>in</strong>dependent gamma-distributed sources<br />

samples) us<strong>in</strong>g a mix<strong>in</strong>g matrix A with columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2 − α) respectively. (a) shows the mixture<br />

(γ = 0.5, 10<br />

density for α = 0.4 after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by the mean<br />

and the median of the projected density <strong>in</strong> the receptive field F (α) = (−π/4, π/4) of the known column a1 of A. The former<br />

is the k-means convergence criterion.<br />

5 (b) comparison of mean and median<br />

Figure 1.18: Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x(t) of two <strong>in</strong>-<br />

samples) us<strong>in</strong>g a mix<strong>in</strong>g matrix A with columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2 − α) respectively. (a) shows the mixture<br />

dependent gamma-distributed sources (γ = 0.5, 10<br />

density for α = 0.4 after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by the mean<br />

and the median of the projected density <strong>in</strong> the receptive field F (α) = (−π/4, π/4) of the known column a1 of A. The former<br />

is the k-means convergence criterion.<br />

5 samples) us<strong>in</strong>g a mix<strong>in</strong>g matrix A with<br />

columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2 − α) respectively. (a) shows the mixture density for α = 0.4<br />

after projection onto the circle. For α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by<br />

the mean and the median of the projected density <strong>in</strong> the receptive field F (α) = (−π/4, π/4) of<br />

the known column a1 of A. The former is the k-means convergence criterion.<br />

one oneGaussian Gaussiancomponent, component, it itisis well-known well-knownthat that A Ais isdeterm<strong>in</strong>ed determ<strong>in</strong>eduniquely uniquelyby by x [3]. [3]. However, However, how how to to do do<br />

i(t) ∈ [1 : n] such that w<br />

this algorithmically is far from i(t)(t) or w<br />

this algorithmically is far fromobvious, obvious, and andalthough althoughquite quite a afew fewalgorithms algorithmshave havebeen beenproposed proposedrecently recently<br />

[4]–[6], [4]–[6], performance performanceisis yet yetlimited. limited. The The most most commonly commonly used used overcomplete algorithms algorithms rely rely on on sparse sparse<br />

′ i(t) (t) is the neuron closest to x(t). Then set<br />

wi(t)(t + 1) := π(wi(t)(t) + η(t)π(σy(t) − wi(t)(t))) and w ′ i(t) (t + 1) := −w i(t)(t + 1), where σ := 1 if w i(t)(t) is closest to y(t), σ := −1 otherwise.<br />

sources (after possible sparsification by preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually<br />

by k-means or some extension [5], [6]. But apart from the fact that theoretical justifications have not<br />

been found, mean-based cluster<strong>in</strong>g only identifies the correct A if the data density approaches a delta<br />

distribution. In figure 1, we illustrate the deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to<br />

5◦ sources (after possible sparsification by preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually<br />

by k-means or some extension [5], [6]. But apart from the fact that theoretical justifications have not<br />

been found, mean-based cluster<strong>in</strong>g only identifies the correct A if the data density approaches a delta<br />

distribution. In figure 1, we illustrate the deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to<br />

5 per mix<strong>in</strong>g angle, which is rather substantial consider<strong>in</strong>g the sparse density and the simple, complete<br />

case of m = n = 2. Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g performs much better.<br />

Indeed, mean-based cluster<strong>in</strong>g does not possess any equivariance property (performance <strong>in</strong>dependent of<br />

A). In the follow<strong>in</strong>g we propose a novel median-based cluster<strong>in</strong>g method and prove its equivariance<br />

(lemma 1.2) and convergence. For brevity, the proofs are given for the case of arbitrary n, but m = 2,<br />

although they can be readily extended to higher sensor signal dimensions. Correspond<strong>in</strong>g algorithms are<br />

proposed and experimentally validated.<br />

◦ We showed that <strong>in</strong>stead of f<strong>in</strong>d<strong>in</strong>g the mean <strong>in</strong> the receptive field (as <strong>in</strong> k-means), the algorithm<br />

searches for the correspond<strong>in</strong>g median. For this we had to study the end po<strong>in</strong>ts of geometric<br />

matrix-recovery, so we assumed that the algorithm had already converged. The idea then was<br />

to formulate a condition which the end po<strong>in</strong>ts have to satisfy and to show that the solutions are<br />

among them.<br />

per Themix<strong>in</strong>g anglesangle, γ1, . . which . , γn is ∈ rather [0, π) substantial are said to consider<strong>in</strong>g satisfy the overcomplete sparse density and geometric the simple, convergence complete<br />

condition (GCC) if they are the medians of y restricted to their receptive fields i.e. if γi is<br />

case of m = n = 2. Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g performs much better.<br />

the median of py|F (γi). Moreover, a constant random vector ˆω ∈ R<br />

Indeed, mean-based cluster<strong>in</strong>g does not possess any equivariance property (performance <strong>in</strong>dependent of<br />

A). In the follow<strong>in</strong>g we propose a novel median-based cluster<strong>in</strong>g method and prove its equivariance<br />

(lemma 1.2) and convergence. For brevity, the proofs are given for the case of arbitrary n, but m = 2,<br />

although they can be readily extended to higher sensor signal dimensions. Correspond<strong>in</strong>g algorithms are<br />

proposed and experimentally validated.<br />

n is called fixed po<strong>in</strong>t if<br />

E(ζ(ye − ˆω)) = 0. We showed that the median-condition is equivariant mean<strong>in</strong>g that it does no<br />

depend on A. Hence any algorithm based on such a condition is equivariant, as confirmed by<br />

figure 1.18. If we set ξ(ω) := (cos ω, s<strong>in</strong> ω) ⊤ , then θ(ξ(ω)) = ω and the follow<strong>in</strong>g holds:<br />

Theorem 1.5.2 (Convergence of overcomplete median-based cluster<strong>in</strong>g). The set Φ of fixed<br />

po<strong>in</strong>ts of geometric matrix-recovery conta<strong>in</strong>s an element (ˆω1, . . . , ˆωn) such that the result<strong>in</strong>g<br />

matrix (ξ(ˆω1) . . . ξ(ˆωn)) solves the overcomplete BSS problem.<br />

The stable fixed po<strong>in</strong>ts <strong>in</strong> the above set Φ can be found by the geometric matrix-recovery<br />

algorithm.


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 39<br />

samples<br />

(a) setup<br />

centroids<br />

Grassmann cluster<strong>in</strong>g<br />

(b) division<br />

division<br />

update<br />

(c) update (d) after one iteration<br />

Figure 1.19: Illustration of the batch k-means algorithm.<br />

One of the most commonly used partitional cluster<strong>in</strong>g techniques is the k-means algorithm,<br />

which <strong>in</strong> its batch form partitions the data set <strong>in</strong>to k disjo<strong>in</strong>t clusters by simply iterat<strong>in</strong>g between<br />

cluster assignments and cluster updates (Bishop, 1995). In general, its goal can be described as<br />

follows:<br />

Given a set A of po<strong>in</strong>ts <strong>in</strong> some metric space (M, d), f<strong>in</strong>d a partition of A <strong>in</strong>to disjo<strong>in</strong>t nonempty<br />

subsets Bi, �<br />

i Bi = A, together with centroids ci ∈ M so as to m<strong>in</strong>imize the sum of the<br />

squares of the distances of each po<strong>in</strong>t of A to the centroid ci of the cluster Bi conta<strong>in</strong><strong>in</strong>g it.<br />

A common approach to m<strong>in</strong>imiz<strong>in</strong>g such energy functions is partial optimization with respect<br />

to the division matrix and the centroids. The batch k-means algorithm employs precisely this<br />

strategy, see figure 1.19. After an <strong>in</strong>itial, random choice of centroids c1, . . . , ck, it iterates<br />

between the follow<strong>in</strong>g two steps until convergence measured by a suitable stopp<strong>in</strong>g criterion:<br />

• cluster assignment: for each sample x(t) determ<strong>in</strong>e an <strong>in</strong>dex i(t) = argm<strong>in</strong> i d(x(t), ci)<br />

• cluster update: with<strong>in</strong> each cluster Bi := {a(t)|i(t) = i} determ<strong>in</strong>e the centroid ci by<br />

ci := argm<strong>in</strong> c<br />

�<br />

d(a, c) 2<br />

a∈Bi<br />

(1.17)<br />

Solv<strong>in</strong>g (1.17) is straight-forward <strong>in</strong> the Euclidean case, however nontrivial <strong>in</strong> other metric<br />

spaces. In Gruber and Theis (2006), see chapter 18, we generalized the concept of k-means by<br />

apply<strong>in</strong>g it not to the standard Euclidean space but to the manifold of subvectorspaces of R n of<br />

a fixed dimension p, also known as the Grassmann manifold Gn,p. Important examples <strong>in</strong>clude<br />

projective space i.e. the manifold of l<strong>in</strong>es and the space of all hyperplanes.<br />

We represented an element of Gn,p by p orthonormal vectors (v1, . . . , vp). Concatenat<strong>in</strong>g<br />

these <strong>in</strong>to an (n × p)-matrix V, this matrix is unique except for right multiplication by an<br />

orthogonal matrix. We therefore wrote [V] ∈ Gn,p for the subspace. This allowed us to def<strong>in</strong>e a<br />

distance d([V], [W]) := 2 −1/2 �VV ⊤ − WW ⊤ �F on the Grassmannian known as the projection<br />

F-norm.


40 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

Other metrics may be def<strong>in</strong>ed on Gn,p, and they result <strong>in</strong> different Riemannian geometries<br />

on the manifold. Optimization on non-Euclidean geometries is non-trivial, and has been studied<br />

for a long time, see for example Edelman et al. (1999) and references there<strong>in</strong>. For <strong>in</strong>stance <strong>in</strong><br />

the context of ICA, Amari’s sem<strong>in</strong>al paper (Amari, 1998) on tak<strong>in</strong>g <strong>in</strong>to account the geometry<br />

of the search space Gl(n) yielded a considerable <strong>in</strong>crease <strong>in</strong> performance and accuracy. Learn<strong>in</strong>g<br />

<strong>in</strong> these matrix manifolds has been reviewed <strong>in</strong> Theis (2005b) and extended <strong>in</strong> Squart<strong>in</strong>i and<br />

Theis (2006).<br />

In order to apply batch k-means to (Gn,p, d), we only had to solve the cluster assignment<br />

equation (1.17). It turned out that for this, no elaborate optimization was necessary, <strong>in</strong>stead<br />

a closed form solution that only needs eigenvalue decomposition could be found. We state this<br />

with the follow<strong>in</strong>g theorem, proved <strong>in</strong> Gruber and Theis (2006):<br />

Theorem 1.5.3 (Grassmann centroids). The centroid [C] ∈ Gn,p of a set of po<strong>in</strong>ts [V1], . . . , [Vl] ∈<br />

Gn,p accord<strong>in</strong>g to (1.17) is spanned by p <strong>in</strong>dependent eigenvectors correspond<strong>in</strong>g to the smallest<br />

eigenvalues of the generalized cluster correlation l −1 � l<br />

i=1 ViV ⊤ i .<br />

Application to nonnegative matrix factorization<br />

Detect<strong>in</strong>g clusters <strong>in</strong> multiple samples drawn from a Grassmannian is a problem aris<strong>in</strong>g <strong>in</strong><br />

various applications. In Gruber and Theis (2006), we applied this to NMF <strong>in</strong> order to illustrate<br />

the feasibility of the proposed algorithm.<br />

Consider the matrix factorization problem (1.12) with the additional non-negativity constra<strong>in</strong>ts<br />

S, A ≥ 0. If we assume that S spans the whole first quadrant, then X = AS is a conic<br />

hull with cone edges spanned by the columns of A. After projection onto the standard simplex,<br />

the conic hull reduces to the convex hull, and the projected, known mixture data set X lies<br />

with<strong>in</strong> a convex polytope of the order given by the number of rows of S. Hence we face the<br />

problem of identify<strong>in</strong>g n edges of a sampled polytope <strong>in</strong> R m−1 .<br />

In two dimensions (after reduction of m = 3), this implies the task of f<strong>in</strong>d<strong>in</strong>g the k edges<br />

of a polytope where only samples <strong>in</strong> the <strong>in</strong>side are known. We used the Quickhull algorithm<br />

(Barber et al., 1993) to construct the convex hull thus identify<strong>in</strong>g the possible edges of the<br />

polytope. However due to f<strong>in</strong>ite samples the identified polytope has far too many edges. Therefore,<br />

we applied aff<strong>in</strong>e Grassmann n-means cluster<strong>in</strong>g—with samples weighted accord<strong>in</strong>g to their<br />

volume—to these edges <strong>in</strong> order to identify the n bound<strong>in</strong>g edges, see example <strong>in</strong> figure 1.20.<br />

Biomedical applications of other matrix factorization methods are discussed <strong>in</strong> the next<br />

chapter. We only shortly want to mention Meyer-Bäse et al. (2005), where we applied NMF<br />

and related unsupervised cluster<strong>in</strong>g techniques for the self-organized segmentation of biomedical<br />

image time-series data describ<strong>in</strong>g groups of pixels exhibit<strong>in</strong>g similar properties of local signal<br />

dynamics.


1.5. Mach<strong>in</strong>e learn<strong>in</strong>g for data preprocess<strong>in</strong>g 41<br />

(a) samples<br />

(b) QHull<br />

(c) result of Grassmann cluster<strong>in</strong>g<br />

Samples<br />

QHull contour<br />

Grasmann cluster<strong>in</strong>g<br />

Mix<strong>in</strong>g matrix<br />

Figure 1.20: Grassmann cluster<strong>in</strong>g (hyerplanes, so p = n − 1) identifies the contour of samples<br />

(a) follow<strong>in</strong>g the NMF model. Quickhull was used to f<strong>in</strong>d the outer edges then those are clustered<br />

<strong>in</strong>to 4 clusters (b). The dashed l<strong>in</strong>es (c) show the convex hull spanned by the mix<strong>in</strong>g matrix<br />

columns.


42 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.6 Applications to biomedical data analysis<br />

The above models are known to have many applications <strong>in</strong> data m<strong>in</strong><strong>in</strong>g. Here we focus on<br />

biomedical data analysis. For this we review some recent applications to functional MRI, microscopy<br />

of labeled bra<strong>in</strong> sections and surface electromyograms.<br />

1.6.1 Functional MRI<br />

Functional magnetic resonance imag<strong>in</strong>g (fMRI) has been shown to be an effective imag<strong>in</strong>g technique<br />

<strong>in</strong> human bra<strong>in</strong> research (Ogawa et al., 1990). By the blood oxygen level dependent<br />

contrast (BOLD), local changes <strong>in</strong> the magnetic field are coupled to activity <strong>in</strong> bra<strong>in</strong> areas.<br />

These magnetic changes are measured us<strong>in</strong>g MRI. The high spatial and temporal resolution<br />

of fMRI comb<strong>in</strong>ed with its non-<strong>in</strong>vasive nature makes it to an important tool for discover<strong>in</strong>g<br />

functional areas <strong>in</strong> the human bra<strong>in</strong> work and their <strong>in</strong>teractions. However, its low signal to<br />

noise ratio and the high number of activities <strong>in</strong> the passive bra<strong>in</strong> require sophisticated analysis<br />

method. These are either (i) based on models and regression, but require prior knowledge of the<br />

time course of the activations or (ii) employ model-free approaches such as BSS by separat<strong>in</strong>g<br />

the recorded activation <strong>in</strong>to different classes accord<strong>in</strong>g to statistical specifications without prior<br />

knowledge of the activation.<br />

The bl<strong>in</strong>d approach (ii) has been first studied by McKeown et al. (1998): Accord<strong>in</strong>g to the<br />

pr<strong>in</strong>ciple of functional organization of the bra<strong>in</strong>, they suggested that the multifocal bra<strong>in</strong> areas<br />

activated by performance of a visual task should be unrelated to the bra<strong>in</strong> areas whose signals<br />

are affected by artifacts of physiological nature, head movements, or scanner noise related to<br />

fMRI experiments. Every s<strong>in</strong>gle process can be described by one or more spatially <strong>in</strong>dependent<br />

components, each associated with a s<strong>in</strong>gle time course of a voxel and a component map. It is<br />

assumed that the component maps, each described by a spatial distribution of fixed values, is<br />

represent<strong>in</strong>g overlapp<strong>in</strong>g, multifocal bra<strong>in</strong> areas of statistically <strong>in</strong>dependent fMRI signals. This<br />

aspect is visualized <strong>in</strong> Figure 1.21.<br />

In addition, they considered that the distributions of the component maps are spatially<br />

<strong>in</strong>dependent and <strong>in</strong> this sense uniquely specified, see section 1.2.1. McKeown et al. (1998)<br />

showed that these maps are <strong>in</strong>dependent if the active voxels <strong>in</strong> the maps are sparse and mostly<br />

nonoverlapp<strong>in</strong>g. Additionally they assumed that the observed fMRI signals are the superposition<br />

of the <strong>in</strong>dividual component processes at each voxel. Based on these assumptions, ICA can be<br />

applied to fMRI time-series to spatially localize and temporally characterize the sources of BOLD<br />

activation, and considerable research has been devoted to this area s<strong>in</strong>ce then.<br />

Model-based versus model-free analysis<br />

However, the use of bl<strong>in</strong>d signal process<strong>in</strong>g techniques for the effective analysis of fMRI data<br />

has often been questioned, and <strong>in</strong> many applications, neurologists and psychologists prefer to<br />

use the computationally simpler regression models.<br />

In Keck et al. (2004), see chapter 19, we compared the two approaches on a sufficiently<br />

complex task of a comb<strong>in</strong>ed word perception and motor activity. The event-based experiment<br />

was part of a study to <strong>in</strong>vestigate the network of neurons <strong>in</strong>volved <strong>in</strong> the perception of speech


1.6. Applications to biomedical data analysis 43<br />

#1<br />

#2<br />

#n<br />

mix<strong>in</strong>g<br />

matrix A<br />

component maps observations<br />

Figure 1.21: Visualization of the spatial fMRI separation model. The n-dimensional source<br />

vector is represented as component maps, which <strong>in</strong> turn are <strong>in</strong>terpreted to contribute l<strong>in</strong>early <strong>in</strong><br />

different concentrations to the fMRI observations at the various time po<strong>in</strong>ts t ∈ {1, . . . , m}.<br />

and the decod<strong>in</strong>g of auditory speech stimuli. One- and two-syllable words were divided <strong>in</strong>to<br />

several frequency-bands and then rearranged randomly to obta<strong>in</strong> a set of auditory stimuli. Only<br />

a s<strong>in</strong>gle band was perceivable as words. Dur<strong>in</strong>g the functional imag<strong>in</strong>g session these stimuli were<br />

presented pseudo-randomized to 5 subjects, accord<strong>in</strong>g to the rules of a stochastic event-related<br />

paradigm. The task of the subjects was to press a button as soon as they were sure that they<br />

had just recognized a word <strong>in</strong> the sound presented. It was expected that <strong>in</strong> case of the s<strong>in</strong>gle<br />

perceptive frequency-band, these four types of stimuli activate different areas of the auditory<br />

system as well as the superior temporal sulcus <strong>in</strong> the left hemisphere (Specht and Reul, 2003).<br />

The regression-based analysis us<strong>in</strong>g a general l<strong>in</strong>ear model was performed us<strong>in</strong>g SPM2. This<br />

was compared with components extracted us<strong>in</strong>g ICA, namely fastICA (Hyvär<strong>in</strong>en and Oja,<br />

1997). The results are illustrated <strong>in</strong> figure 1.22, see Keck et al. (2004). Indeed, one <strong>in</strong>dependent<br />

component represented a network of three simultaneously active areas <strong>in</strong> the <strong>in</strong>ferior frontal<br />

gyrus, which was previously proposed to be a center for the perception of speech (Specht and<br />

Reul, 2003). Altogether, we were able to show that ICA detects hidden or suspected l<strong>in</strong>ks and<br />

activity <strong>in</strong> the bra<strong>in</strong> that cannot be found us<strong>in</strong>g the classical, model-based approach.<br />

Spatial and spatiotemporal separation<br />

As short example of spatial and spatiotemporal BSS, we present the analysis of an experiment<br />

us<strong>in</strong>g visual stimuli. fMRI data were recorded from 10 healthy subjects perform<strong>in</strong>g a visual<br />

task. 100 scans each were acquired with 5 periods of rest and 5 photic stimulation periods,<br />

t=1<br />

t=2<br />

t=m


44 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

(a) general l<strong>in</strong>ear model analysis (b) one <strong>in</strong>dependent component<br />

Figure 1.22: Comparison of model-based and model-free analysis of a word-perception fMRI<br />

experiment. (a) illustrates the result of a regression-based analysis, which shows actitity mostly<br />

<strong>in</strong> the auditive cortex. (b) is a s<strong>in</strong>gle component extracted by ICA, which is corresponds to a<br />

word-detection network.<br />

with a resolution of 3 × 3 × 4 mm. A s<strong>in</strong>gle 2d-slice is analyzed, which is oriented parallel to<br />

the calcar<strong>in</strong>e fissure. Photic stimulation was performed us<strong>in</strong>g an 8 Hz alternat<strong>in</strong>g checkerboard<br />

stimulus with a central fixation po<strong>in</strong>t and a dark background.<br />

At first, we show an example result us<strong>in</strong>g spatial ICA. We perform a dimension reduction us<strong>in</strong>g<br />

PCA to n = 6 dimensions, which still conta<strong>in</strong>ed 99.77% of the eigenvalues. Then, we applied<br />

HessianICA with K = 100 Hessians evaluated at randomly chosen samples, see section 1.2.1 and<br />

Theis (2004a). The result<strong>in</strong>g 6-dimensional sources are <strong>in</strong>terpreted as the 6 component maps<br />

that encode the data set. The columns of the mix<strong>in</strong>g matrix conta<strong>in</strong> the relative contribution of<br />

each component map to the mixtures at the given time po<strong>in</strong>t, so they represent the components’<br />

time courses. The maps together with the correspond<strong>in</strong>g time courses are shown <strong>in</strong> figure 1.23.<br />

A s<strong>in</strong>gle highly task-related component (#4) is found, which after a shift of 4s has a high crosscorrelation<br />

with the block-based stimulus (cc = 0.89). Other component maps encode artifacts,<br />

e.g. <strong>in</strong> the <strong>in</strong>terstitial bra<strong>in</strong> region and other background activity.<br />

We then tested the usefulness of tak<strong>in</strong>g <strong>in</strong>to account additional <strong>in</strong>formation conta<strong>in</strong>ed <strong>in</strong><br />

the data set such as the spatiotemporal dependencies. For this, we analyzed the data us<strong>in</strong>g<br />

spatiotemporal BSS as described <strong>in</strong> section 1.3.2 and Theis et al. (2005b), Theis et al. (2007b),<br />

see chapter 9. In order to make th<strong>in</strong>gs more challeng<strong>in</strong>g, only 4 components were to be extracted<br />

from the data, with preprocess<strong>in</strong>g either by PCA only or by the slightly more general s<strong>in</strong>gular<br />

value decomposition, a necessary preprocess<strong>in</strong>g for spatiotemporal BSS. We based the algorithms<br />

on jo<strong>in</strong>t diagonalization, for which K = 10 autocorrelation matrices were used, both for spatial<br />

and temporal decorrelation, weighted equally (α = 0.5). Although the data was reduced to only<br />

4 components, stSOBI was able to extract the stimulus component very well, with a equally<br />

high crosscorrelation of cc = 0.89. We compared this result with some established algorithms<br />

for bl<strong>in</strong>d fMRI analysis by discuss<strong>in</strong>g the s<strong>in</strong>gle component that is maximally autocorrelated


1.6. Applications to biomedical data analysis 45<br />

1 2 3<br />

4 5 6<br />

(a) recovered component maps<br />

1 cc: −0.14 2 cc: −0.13 3 cc: −0.22<br />

4 cc: 0.89 5 cc: 0.12 6 cc: 0.09<br />

(b) time courses<br />

Figure 1.23: Extracted ICA-components of fMRI record<strong>in</strong>gs. (a) shows the spatial and (b)<br />

the correspond<strong>in</strong>g temporal activation patterns, where <strong>in</strong> (b) the grey bars <strong>in</strong>dicate stimulus<br />

activity. <strong>Component</strong> 4 conta<strong>in</strong>s the (<strong>in</strong>dependent) visual task, active <strong>in</strong> the visual cortex (white<br />

po<strong>in</strong>ts <strong>in</strong> (a)). It correlates well with the stimulus activity (b), 4.<br />

with the known stimulus task, see figure 1.24. The absolute correspond<strong>in</strong>g autocorrelations<br />

are 0.84 (stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied to<br />

separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA). The observation that neither<br />

Stone’s spatiotemporal ICA algorithm (Stone et al., 2002) nor the popular fastICA algorithm<br />

(Hyvär<strong>in</strong>en and Oja, 1997) could recover the sources showed that spatiotemporal models can<br />

use the additional data structure efficiently <strong>in</strong> contrast to spatial-only models, and that the<br />

parameter-free jo<strong>in</strong>t-diagonalization-based algorithms are robust aga<strong>in</strong>st convergence issues.<br />

Other analysis models<br />

Before cont<strong>in</strong>u<strong>in</strong>g to other biomedical applications, we shortly want to review other recent work<br />

of the author <strong>in</strong> this field.<br />

In Karvanen and Theis (2004), we proposed the concept of w<strong>in</strong>dow ICA for the analysis of<br />

fMRI data. The basic idea was to apply, spatial ICA <strong>in</strong> slid<strong>in</strong>g time w<strong>in</strong>dows; this approach<br />

avoided the problems related to the high number of signals and the result<strong>in</strong>g issues with dimension<br />

reduction methods, and moreover gave some <strong>in</strong>sight <strong>in</strong>to small changes happen<strong>in</strong>g dur<strong>in</strong>g<br />

the experiment, which are otherwise not encoded <strong>in</strong> changes <strong>in</strong> the component maps. We demonstrated<br />

the usefulness of the proposed approach <strong>in</strong> an experiment where a subject listened to<br />

auditory stimuli consist<strong>in</strong>g of s<strong>in</strong>usoidal sounds (beeps) and words <strong>in</strong> vary<strong>in</strong>g proportions. Here,<br />

the w<strong>in</strong>dow ICA algorithm was able to f<strong>in</strong>d different auditory activations patterns related to the<br />

beeps respectively the words.<br />

An <strong>in</strong>terest<strong>in</strong>g model for activity maps <strong>in</strong> the bra<strong>in</strong> is given by sparse cod<strong>in</strong>g; after all, the<br />

component maps are always implicitly assumed to show only strongly focused regions of activation.<br />

Hence we asked the question whether the sparse models proposed <strong>in</strong> section 1.4.1 could be


46 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

stimulus<br />

stNSS<br />

stSOBI (1D)<br />

stICA after stSOBI<br />

stICA<br />

fastICA<br />

20 40 60 80<br />

Fig. 3. Comparison of the recovered component that is maximally autocrosscorrelated<br />

with the stimulus task (top) for various BSS algorithms, after dimension<br />

reduction to 4 components. The absolute correspond<strong>in</strong>g autocorrelations are 0.84<br />

(stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied<br />

to separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA).<br />

Figure 1.24: Comparison of the recovered component that is maximally autocrosscorrelated with<br />

the stimulus task (top) for various BSS algorithms, after dimension reduction to 4 components.<br />

stead of calculat<strong>in</strong>g autocovariance matrices, other statistics of the spatial<br />

and temporal sources can be used, as long as they satisfy the factorization<br />

from equation (1). This results <strong>in</strong> ‘algebraic BSS’ algorithms such as AMUSE<br />

[3], JADE [11], SOBI and TDSEP [5,6], reviewed for <strong>in</strong>stance <strong>in</strong> [12]. Instead<br />

of perform<strong>in</strong>g autodecorrelation, we used the idea of the NSS-SD-algorithm<br />

(‘non-stationary sources with simultaneous diagonalization’) [19], cf. [20]: the<br />

sources were assumed to be spatiotemporal realizations of non-stationary random<br />

processes ∗si(t) with ∗ ∈ {t, s} determ<strong>in</strong><strong>in</strong>g the temporal spatial or spatial<br />

direction. If we assume that the result<strong>in</strong>g covariance matrices C( ∗s(t)) vary<br />

sufficiently with time, the factorization of equation (1) also holds for these<br />

covariance matrices. Hence, jo<strong>in</strong>t diagonalization of<br />

{C( t x(1)), C( s x(1)), C( t x(2)), C( s applied to fMRI data. We showed a successful application to the above visual-stimulus experiment<br />

<strong>in</strong> Georgiev et al. (2005a). Aga<strong>in</strong>, we were able to show that with only five components,<br />

the stimulus-related activity <strong>in</strong> the visual cortex could be nicely reconstructed.<br />

A similar question of model generalization was posed <strong>in</strong> Theis and Tanaka (2005). There<br />

we proposed to study the postnonl<strong>in</strong>ear mix<strong>in</strong>g model (1.3) <strong>in</strong> the context of fMRI data. We<br />

derived an algorithm for bl<strong>in</strong>dly estimat<strong>in</strong>g the sensor characteristics of such a multi-sensor<br />

network. From the observed sensor outputs, the non-l<strong>in</strong>earities are recovered us<strong>in</strong>g a wellknown<br />

Gaussianization procedure. The underly<strong>in</strong>g sources are then reconstructed us<strong>in</strong>g spatial<br />

decorrelation as proposed by Ziehe et al. (2003a). Application of this robust algorithm to data<br />

sets acquired fMRI lead to the detection of a dist<strong>in</strong>ctive x(2)), bump. . .} of the BOLD effect at larger<br />

activations, which may be <strong>in</strong>terpreted as an <strong>in</strong>herent BOLD-related nonl<strong>in</strong>earity.<br />

allows for the calculation of the mix<strong>in</strong>g matrix. The covariance matrices are<br />

In Meyer-Bäse commonly et al. estimated (2004a,b), <strong>in</strong> separate we discussed non-overlapp<strong>in</strong>g the concept temporal of dependent or spatial w<strong>in</strong>dows. component analysis,<br />

see section 1.3, Replac<strong>in</strong>g <strong>in</strong> the context the autocovariances of fMRI <strong>in</strong> data (8) by analysis. the w<strong>in</strong>dowed We covariances, detected dependencies this results by f<strong>in</strong>d<strong>in</strong>g<br />

clusters of dependent <strong>in</strong> the spatiotemporal components; NSS-SD algorithmically, or stNSS algorithm. we compared two of the first such algorithms,<br />

namely tree-dependent (Bach and Jordan, 2003a) and topographic ICA (Hyvär<strong>in</strong>en et al., 2001a).<br />

In the fMRI example, we applied stNSS us<strong>in</strong>g one-dimensional covariance ma-<br />

For the fMRI data, a comparative quantitative evaluation between the two methods, tree–<br />

dependent and topographic ICA was performed. We observed that topographic ICA outperforms<br />

other ord<strong>in</strong>ary ICA methods and tree–dependent 11 ICA when extract<strong>in</strong>g only few <strong>in</strong>dependent<br />

components. This resulted <strong>in</strong> a postprocess<strong>in</strong>g algorithm based on cluster<strong>in</strong>g of ICA components<br />

result<strong>in</strong>g from different source component dimensions <strong>in</strong> Keck et al. (2005).<br />

The above algorithms have been <strong>in</strong>cluded <strong>in</strong> our MFBOX (Model-free Toolbox) package


1.6. Applications to biomedical data analysis 47<br />

(a) tra<strong>in</strong><strong>in</strong>g data set (b) directional normalization (c) classification result<br />

Figure 1.25: Directional neural networks: As tra<strong>in</strong><strong>in</strong>g data set, a few labeled A’s from a test<br />

image have been used (a). (b) then shows five rotated A’s together with their image patches<br />

normalized us<strong>in</strong>g ma<strong>in</strong> pr<strong>in</strong>cipal component direction below. This small-scale classificator reproduces<br />

the A-locations <strong>in</strong> a test image sufficiently well (c), even though the tra<strong>in</strong><strong>in</strong>g data set<br />

was small and different fonts, noise etc. had been added.<br />

(Gruber et al., 2007), a Matlab toolbox for data-driven analysis of biomedical data, which may<br />

also be used as SPM plug<strong>in</strong>. Its ma<strong>in</strong> focus lies on the analysis of functional Nuclear Magnetic<br />

Resonance Imag<strong>in</strong>g (fMRI) data sets with various model-free or data-driven techniques. The<br />

toolbox <strong>in</strong>cludes BSS algorithms based on various source models <strong>in</strong>clud<strong>in</strong>g ICA, spatiotemporal<br />

ICA, autodecorrelation and NMF. They can all be easily comb<strong>in</strong>ed with higher-level analysis<br />

methods such as reliability analysis us<strong>in</strong>g projective cluster<strong>in</strong>g of the components, slid<strong>in</strong>g time<br />

w<strong>in</strong>dow analysis or hierarchical decomposition.<br />

1.6.2 Image segmentation and cell count<strong>in</strong>g<br />

A supervised <strong>in</strong>terpretation of the <strong>in</strong>itial data analysis model from section 1.1 leads to a classification<br />

problem: given a set of <strong>in</strong>put-output samples, f<strong>in</strong>d a map that <strong>in</strong>terpolate these samples,<br />

hopefully generaliz<strong>in</strong>g well to new <strong>in</strong>put samples. Such a map thus serves as classifier if the<br />

output consists of discrete labels. Classification based on support vector mach<strong>in</strong>es (Boser et al.,<br />

1992, Burges, 1998, Schölkopf and Smola, 2002) or neural networks (Hayk<strong>in</strong>, 1994) has prom<strong>in</strong>ent<br />

applications <strong>in</strong> biomedical data analysis. Here we review an application to biomedical<br />

image process<strong>in</strong>g from Theis et al. (2004c), see chapter 21.<br />

While many different tissues of the mammalian organism are capable of renew<strong>in</strong>g themselves<br />

after damage, it was believed for long that the nervous system is not able to regenerate at all.<br />

Nevertheless, the first data show<strong>in</strong>g that the generation of new nerve cells <strong>in</strong> the adult bra<strong>in</strong><br />

could happen were presented <strong>in</strong> the 1960s (Altman and Das, 1965) show<strong>in</strong>g new neurons <strong>in</strong><br />

the bra<strong>in</strong> of adult rats. In order to quantify neurogenesis <strong>in</strong> animals, newborn cells are labeled<br />

with specific markers such as BrdU; <strong>in</strong> bra<strong>in</strong> sections these can later be analyzed and counted<br />

through the use of a confocal microscope. However so far, this count<strong>in</strong>g process had been<br />

performed manually.<br />

In Theis et al. (2004b,c), we proposed an algorithm called ZANE to automatically identify cell


48 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

components <strong>in</strong> digitized section images. First, a so-called cell classifier was tra<strong>in</strong>ed with cell and<br />

non-cell patches us<strong>in</strong>g s<strong>in</strong>gle- and multi-layer perceptrons as well as unsupervised <strong>in</strong>dependent<br />

component analysis with correlation comparison. In order to account for a larger variety of cell<br />

shapes, a directional normalization approach is proposed. The cell classifier can then be used <strong>in</strong><br />

an arbitrary number of sections by scann<strong>in</strong>g the section and choos<strong>in</strong>g maxima of this classifier<br />

as cell center locations. This is illustrated us<strong>in</strong>g a toy example <strong>in</strong> figure 1.25. A flow-chart with<br />

the basic segmentation setup is shown <strong>in</strong> figure 1.26.<br />

ZANE was successfully applied to measure neurogenesis <strong>in</strong> adult rodent bra<strong>in</strong> sections, where<br />

we showed that the proliferation of neurons is substantially stronger (340%) <strong>in</strong> the dentate gyrus<br />

of an epileptic mouse than <strong>in</strong> a control group. When compar<strong>in</strong>g the count<strong>in</strong>g result with manual<br />

counts, the mean ZANE classification rate is 90% of all (manually detected) cells; this was with<strong>in</strong><br />

the error bounds of a perfect count, s<strong>in</strong>ce manual counts by different experts varied by roughly<br />

10% themselves (Theis et al., 2004b).<br />

1.6.3 Surface electromyograms<br />

In sections 1.2 and 1.4, we presented bl<strong>in</strong>d<br />

data factorization models based on statistical<br />

<strong>in</strong>dependence, explicit sparseness and nonnegativity.<br />

It is known that all three approaches<br />

tend to <strong>in</strong>duce a more mean<strong>in</strong>gful,<br />

often more sparse representation of the multivariate<br />

data set. However, develop<strong>in</strong>g explicit<br />

applications and more so perform<strong>in</strong>g mean<strong>in</strong>gful<br />

comparisons of such methods is still of considerable<br />

<strong>in</strong>terest.<br />

In Theis and García (2006), see chapter<br />

20, we analyzed and compared the above<br />

models, but not from a theoretical po<strong>in</strong>t of<br />

view but rather based on a real-world example,<br />

namely the analysis of surface electromyogram<br />

(sEMG) data sets. An electromyogram<br />

(EMG) denotes the electric signal generated<br />

by a contract<strong>in</strong>g muscle (Basmajian and Luca,<br />

1985). In general, EMG measurements make<br />

use of <strong>in</strong>vasive, pa<strong>in</strong>ful needle electrodes. An<br />

alternative is to use sEMG, which is measured<br />

us<strong>in</strong>g non-<strong>in</strong>vasive, pa<strong>in</strong>less surface electrodes.<br />

However, <strong>in</strong> this case the signals are rather<br />

more difficult to <strong>in</strong>terpret due to noise and Figure 1.26: ZANE image segmentation.<br />

the overlap of several source signals. Direct<br />

application of the ICA model to real-world noisy sEMG turns out to problematic (García et al.,<br />

2004), and it is yet unknown if the assumption of <strong>in</strong>dependent sources holds well <strong>in</strong> the sett<strong>in</strong>g<br />

of sEMG.


1.6. Applications to biomedical data analysis 49<br />

<strong>Component</strong> given by each method [a.u.]<br />

(a) JADE<br />

(b) NMF<br />

(c) NMF*<br />

(d) sNMF<br />

(e) sNMF*<br />

(f) SCA<br />

(g) s-EMG<br />

-2<br />

-4<br />

02468<br />

-2<br />

-4<br />

02468<br />

-2 02468<br />

-2 0246<br />

-2 02468<br />

-2<br />

-4<br />

02468<br />

0.5 1<br />

1.5<br />

-0.5 0<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

Figure 1.27: A recovered sources after unmix<strong>in</strong>g an sEMG data, (a-f) shows results obta<strong>in</strong>ed<br />

us<strong>in</strong>g the different methods and (g) the orig<strong>in</strong>al most-active sensor signal.<br />

Our approach was therefore to apply and validate sparse BSS methods based on various<br />

model assumptions to sEMG signals. When applied to artificial signals we found noticeable<br />

differences of algorithm performance depend<strong>in</strong>g on the source assumptions. In particular, sparse<br />

nonnegative matrix factorization outperforms the other methods with regards to <strong>in</strong>creas<strong>in</strong>g<br />

additive noise. However, <strong>in</strong> the case of real sEMG signals we showed that despite the fundamental<br />

differences <strong>in</strong> the various models, the methods yield rather similar results and can successfully<br />

separate the source signal, see figure 1.27. This was due to the fact that the different sparseness<br />

assumptions are only approximately fulfilled thus apparently forc<strong>in</strong>g the algorithms to reach<br />

similar results, but from different <strong>in</strong>itial conditions us<strong>in</strong>g different optimization criteria.<br />

The result<strong>in</strong>g sparse signal components can now be used for further analysis and for artifact<br />

removal. A similar analysis us<strong>in</strong>g spectral correlations have been employed <strong>in</strong> (Böhm et al.,<br />

2006, Stadlthanner et al., 2006b) to remove the water artifact from multidimensional proton<br />

NMR spectra of biomolecules dissolved <strong>in</strong> aqueous solutions.


50 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

1.7 Outlook<br />

We considered the (mostly l<strong>in</strong>ear) factorization problem<br />

x(t) = As(t) + n(t). (1.18)<br />

Often, the noise n(t) was not explicitly modeled but <strong>in</strong>cluded <strong>in</strong> s(t). We assumed that x(t)<br />

is known as well as some additional <strong>in</strong>formation about the system itself. Depend<strong>in</strong>g on the<br />

assumptions, different problems and algorithmic solutions can be derived to solve such <strong>in</strong>verse<br />

problems:<br />

• statistically <strong>in</strong>dependent s(t): we proved that <strong>in</strong> this case ICA could solve (1.18) uniquely,<br />

see section 1.2; this holds even <strong>in</strong> some nonl<strong>in</strong>ear generalizations.<br />

• approximate spatiotemporal <strong>in</strong>dependence <strong>in</strong> A and s(t): we provided a very robust,<br />

simple algorithm for the spatiotemporal separation based on jo<strong>in</strong>t diagonalization, see<br />

section 1.3.2.<br />

• statistical <strong>in</strong>dependence between groups of sources s(t): aga<strong>in</strong>, uniqueness except for transformation<br />

with<strong>in</strong> the blocks can be proven, and the constructive proof results <strong>in</strong> a simple<br />

update algorithm as <strong>in</strong> the l<strong>in</strong>ear case, see section 1.3.3.<br />

• sparseness of the sources s(t): <strong>in</strong> the context of SCA, we relaxed the assumptions of s<strong>in</strong>gle<br />

source sparseness to multi-dimensional sparseness constra<strong>in</strong>ts, for which we were still able<br />

to prove uniqueness and to derive an algorithm, see section 1.4.1.<br />

• non-negativity of A and s(t): a sparse extension of the pla<strong>in</strong> NMF model was analyzed<br />

<strong>in</strong> terms of existence an uniqueness, and a generalization for lp-sparse sources has been<br />

proposed <strong>in</strong> order to better approximate comb<strong>in</strong>atorial sparseness <strong>in</strong> the l0-sense, see<br />

section 1.4.2.<br />

• Gaussian or noisy components <strong>in</strong> s(t): we proposed various denois<strong>in</strong>g and dimension reduction<br />

schemes, and proved uniqueness of a non-Gaussian signal subspace <strong>in</strong> section 1.5.<br />

F<strong>in</strong>ally, <strong>in</strong> section 1.6, we applied some of the above methods to biomedical data sets recorded<br />

by functional MRI, surface EMG and optical microscopy.<br />

Other work<br />

In this summary, data analysis was discussed from the viewpo<strong>in</strong>t of data factorization models<br />

such as (1.18). Before conclud<strong>in</strong>g, some other works of the author <strong>in</strong> related areas should be<br />

mentioned:<br />

Primarily only discussed <strong>in</strong> mathematics, optimization on Lie groups has become an important<br />

topic <strong>in</strong> the field of mach<strong>in</strong>e learn<strong>in</strong>g and neural networks, s<strong>in</strong>ce many cost functions are<br />

def<strong>in</strong>ed on parameter spaces that more naturally obey a non-Euclidean geometry. Consider for<br />

example Amari’s natural gradient (Amari, 1998): simply by tak<strong>in</strong>g <strong>in</strong>to account the geometry


1.7. Outlook 51<br />

noise<br />

data signal<br />

<strong>in</strong>dependent structured models<br />

Figure 1.28: Future analysis framework.<br />

of the search space Gl(n) of all <strong>in</strong>vertible (n × n)-matrices, Amari was able to considerably improve<br />

search performance and accuracy thus provid<strong>in</strong>g an equivariant ICA algorithm. In Theis<br />

(2005b), we gave an overview of various gradient calculations on Gl(n) and presented generalizations<br />

to over- and undercomplete cases, realized by a semidirect product. These ideas were used<br />

<strong>in</strong> Squart<strong>in</strong>i and Theis (2006), where we def<strong>in</strong>ed some alternative Riemannian metrics on the<br />

parameter space of non-square matrices, correspond<strong>in</strong>g to various translations def<strong>in</strong>ed there<strong>in</strong>.<br />

Such metrics allowed us to derive novel, efficient learn<strong>in</strong>g rules for two ICA based algorithms<br />

for overdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation.<br />

In Meyer-Baese et al. (2006), we studied optimization and statistical learn<strong>in</strong>g on a neural<br />

network that self-organized to solve a BSS problem. The result<strong>in</strong>g onl<strong>in</strong>e learn<strong>in</strong>g solution used<br />

the nonstationarity of the sources to achieve the separation. For this, we divided the problem<br />

<strong>in</strong>to two learn<strong>in</strong>g problems one of which is solved by an anti-Hebbian learn<strong>in</strong>g and the other by<br />

an Hebbian learn<strong>in</strong>g process. The stability of related networks is discussed <strong>in</strong> Meyer-Bäse et al.<br />

(2006).<br />

A major application of unsupervised learn<strong>in</strong>g <strong>in</strong> neuroscience lies <strong>in</strong> the analysis of functional<br />

MRI data sets. A problem we faced dur<strong>in</strong>g the course of our analyses was the efficient storage<br />

and retrieval of many, large-dimensional spatiotemporal data sets. Although some dimension<br />

reduction or region-of-<strong>in</strong>terest selection may be performed beforehand to reduce the sample<br />

size (Keck et al., 2006), we wanted to compress the data as well as possible without loos<strong>in</strong>g<br />

<strong>in</strong>formation. For this, we proposed a novel lossless compression method named FTTcoder (Theis<br />

and Tanaka, 2005) for the compression of images and 3d sequences collected dur<strong>in</strong>g a typical<br />

fMRI experiment. The large data sets <strong>in</strong>volved <strong>in</strong> this popular medical application necessitated<br />

novel compression algorithms to take <strong>in</strong>to account the structure of the recorded data as well as<br />

the experimental conditions, which <strong>in</strong>clude the 4d record<strong>in</strong>gs, the used stimulus protocol and<br />

marked regions of <strong>in</strong>terest (ROI). For this, we used simple temporal transformations and entropy<br />

cod<strong>in</strong>g with context model<strong>in</strong>g to encode the 4d scans after preprocess<strong>in</strong>g with the ROI mask<strong>in</strong>g.<br />

The compression algorithm as well as the fMRI toolbox and the algorithms for spatiotemporal<br />

and subspace BSS are all available onl<strong>in</strong>e at http://fabian.theis.name.


52 Chapter 1. Statistical mach<strong>in</strong>e learn<strong>in</strong>g of biomedical data<br />

Future work<br />

In future work, the goal is to employ the above algorithms as preprocess<strong>in</strong>g step <strong>in</strong> model<strong>in</strong>g of<br />

biological processes, for example <strong>in</strong> quantitative systems biology. The idea is straight-forward,<br />

see figure 1.28. Given multivariate data that encode for example biological parameters of multiple,<br />

dependent experiments, we want to f<strong>in</strong>d regularized simple models that can expla<strong>in</strong> the<br />

data as well as predict future quantitative experiments. In a first step, denois<strong>in</strong>g is to be performed,<br />

especially removal of Gaussian subspaces that do not conta<strong>in</strong> mean<strong>in</strong>gful data apart<br />

from their covariance structure. The signal subspace then is to be further processed us<strong>in</strong>g techniques<br />

from dependent component analysis and <strong>in</strong>dependent subspace analysis. The result<strong>in</strong>g<br />

subspaces themselves are analyzed with network analysis techniques, for example based on Gaussian<br />

graphical models and correspond<strong>in</strong>g high-dimensional regularizations (Dobra et al., 2004,<br />

Schäfer and Strimmer, 2005). The derived structures are then expected to serve as models,<br />

which may provide quantitative descriptions of the underly<strong>in</strong>g processes. F<strong>in</strong>ally, we hope to<br />

drive further experiments by the model predictions.<br />

This well-known coupl<strong>in</strong>g of experimentalists and theoreticians is characteristic of systems<br />

biology <strong>in</strong> particular, and modern <strong>in</strong>terdiscipl<strong>in</strong>ary research <strong>in</strong> general. With the past work,<br />

we hope to have made some steps <strong>in</strong> this directions, and recent work <strong>in</strong> the field of microarray<br />

analysis employ<strong>in</strong>g such techniques is already promis<strong>in</strong>g (Lutter et al., 2006, Schachtner et al.,<br />

2007, Stadlthanner et al., 2006a). In theoretical terms, the long-term goal is to step from<br />

multivariate analysis to network analysis, just as we have observed the field of signal process<strong>in</strong>g<br />

expand<strong>in</strong>g from univariate to multivariate models <strong>in</strong> the past few decades.


Part II<br />

Papers<br />

53


Chapter 2<br />

Neural Computation 16:1827-1850,<br />

2004<br />

Paper F.J. Theis. A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation.<br />

Neural Computation, 16:1827-1850, 2004<br />

Reference (Theis, 2004a)<br />

Summary <strong>in</strong> section 1.2.1<br />

55


56 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

LETTER Communicated by Aapo Hyvar<strong>in</strong>en<br />

A New Concept for Separability Problems <strong>in</strong> Bl<strong>in</strong>d Source<br />

Separation<br />

Fabian J. Theis<br />

fabian@theis.name<br />

Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

The goal of bl<strong>in</strong>d source separation (BSS) lies <strong>in</strong> recover<strong>in</strong>g the orig<strong>in</strong>al<br />

<strong>in</strong>dependent sources of a mixed random vector without know<strong>in</strong>g the<br />

mix<strong>in</strong>g structure. A key <strong>in</strong>gredient for perform<strong>in</strong>g BSS successfully is to<br />

know the <strong>in</strong>determ<strong>in</strong>acies of the problem—that is, to know how the separat<strong>in</strong>g<br />

model relates to the orig<strong>in</strong>al mix<strong>in</strong>g model (separability). For l<strong>in</strong>ear<br />

BSS, Comon (1994) showed us<strong>in</strong>g the Darmois-Skitovitch theorem that<br />

the l<strong>in</strong>ear mix<strong>in</strong>g matrix can be found except for permutation and scal<strong>in</strong>g.<br />

In this work, a much simpler, direct proof for l<strong>in</strong>ear separability is given.<br />

The idea is based on the fact that a random vector is <strong>in</strong>dependent if and<br />

only if the Hessian of its logarithmic density (resp. characteristic function)<br />

is diagonal everywhere. This property is then exploited to propose a new<br />

algorithm for perform<strong>in</strong>g BSS. Furthermore, first ideas of how to generalize<br />

separability results based on Hessian diagonalization to more complicated<br />

nonl<strong>in</strong>ear models are studied <strong>in</strong> the sett<strong>in</strong>g of postnonl<strong>in</strong>ear BSS.<br />

1 Introduction<br />

In <strong>in</strong>dependent component analysis (ICA), one tries to f<strong>in</strong>d statistically<br />

<strong>in</strong>dependent data with<strong>in</strong> a given random vector. An application of ICA<br />

lies <strong>in</strong> bl<strong>in</strong>d source separation (BSS), where it is furthermore assumed that<br />

the given vector has been mixed us<strong>in</strong>g a fixed set of <strong>in</strong>dependent sources.<br />

The advantage of apply<strong>in</strong>g ICA algorithms to BSS problems <strong>in</strong> contrast to<br />

correlation-based algorithms is that ICA tries to make the output signals as<br />

<strong>in</strong>dependent as possible by also <strong>in</strong>clud<strong>in</strong>g higher-order statistics.<br />

S<strong>in</strong>ce the <strong>in</strong>troduction of <strong>in</strong>dependent component analysis by Hérault<br />

and Jutten (1986), various algorithms have been proposed to solve the BSS<br />

problem (Comon, 1994; Bell & Sejnowski, 1995; Hyvär<strong>in</strong>en & Oja, 1997;<br />

Theis, Jung, Puntonet, & Lang, 2002). Good textbook-level <strong>in</strong>troductions to<br />

ICA are given <strong>in</strong> Hyvär<strong>in</strong>en, Karhunen, and Oja (2001) and Cichocki and<br />

Amari (2002).<br />

Separability of l<strong>in</strong>ear BSS states that under weak conditions to the sources,<br />

the mix<strong>in</strong>g matrix is determ<strong>in</strong>ed uniquely by the mixtures except for permutation<br />

and scal<strong>in</strong>g, as showed by Comon (1994) us<strong>in</strong>g the Darmois-<br />

Skitovitch theorem. We propose a direct proof based on the concept of<br />

Neural Computation 16, 1827–1850 (2004) c○ 2004 Massachusetts Institute of Technology


Chapter 2. Neural Computation 16:1827-1850, 2004 57<br />

1828 F. Theis<br />

separated functions, that is, functions that can be split <strong>in</strong>to a product of<br />

one-dimensional functions (see def<strong>in</strong>ition 1). If the function is positive, this<br />

is equivalent to the fact that its logarithm has a diagonal Hessian everywhere<br />

(see lemma 1 and theorem 1). A similar lemma has been shown by L<strong>in</strong> (1998)<br />

for what he calls block diagonal Hessians. However, he omits discussion of<br />

the separatedness of densities with zeros, which plays a m<strong>in</strong>or role for the<br />

separation algorithm he is <strong>in</strong>terested <strong>in</strong> but is important for deriv<strong>in</strong>g separability.<br />

Us<strong>in</strong>g separatedness of the density, respectively, the characteristic<br />

function (Fourier transformation), of the random vector, we can then show<br />

separability directly (<strong>in</strong> two slightly different sett<strong>in</strong>gs, for which we provide<br />

a common framework). Based on this result, we propose an algorithm<br />

for l<strong>in</strong>ear BSS by diagonaliz<strong>in</strong>g the logarithmic density of the Hessian. We<br />

recently found that this algorithm has already been proposed (L<strong>in</strong>, 1998),<br />

but without consider<strong>in</strong>g the necessary assumptions for successful algorithm<br />

application. Here we give precise conditions for when to apply this algorithm<br />

(see theorem 3) and show that po<strong>in</strong>ts satisfy<strong>in</strong>g these conditions can<br />

<strong>in</strong>deed be found if the sources conta<strong>in</strong> at most one gaussian component<br />

(see lemma 5). L<strong>in</strong> uses a discrete approximation of the derivative operator<br />

to approximate the Hessian. We suggest us<strong>in</strong>g kernel-based density<br />

estimation, which can be directly differentiated. A similar algorithm based<br />

on Hessian diagonalization has been proposed by Yeredor (2000) us<strong>in</strong>g the<br />

characteristic function of a random vector. However, the characteristic function<br />

is complex valued, and additional care has to be taken when apply<strong>in</strong>g<br />

a complex logarithm. Basically, this is well def<strong>in</strong>ed locally only at nonzeros.<br />

In algorithmic terms, the characteristic function can be easily approximated<br />

by samples (which is equivalent to our kernel-based density approximation<br />

us<strong>in</strong>g gaussians before Fourier transformation). Yeredor suggests jo<strong>in</strong>t diagonalization<br />

of the Hessian of the logarithmic characteristic function (which<br />

is problematic because of the nonuniqueness of the complex logarithm)<br />

evaluated at several po<strong>in</strong>ts <strong>in</strong> order to avoid the locality of the algorithm.<br />

Instead of jo<strong>in</strong>t diagonalization, we use a comb<strong>in</strong>ed energy function based<br />

on the previously def<strong>in</strong>ed separator, which also takes <strong>in</strong>to account global<br />

<strong>in</strong>formation but does not have the drawback of be<strong>in</strong>g s<strong>in</strong>gular at zeros of the<br />

density, respectively, characteristic function. Thus, the algorithmic part of<br />

this article can be seen as a general framework for the algorithms proposed<br />

by L<strong>in</strong> (1998) and Yeredor (2000).<br />

Section 2 <strong>in</strong>troduces separated functions, giv<strong>in</strong>g local characterizations<br />

of the densities of <strong>in</strong>dependent random vectors. Section 3 then <strong>in</strong>troduces<br />

the l<strong>in</strong>ear BSS model and states the well-known separability result. After<br />

giv<strong>in</strong>g an easy and short proof <strong>in</strong> two dimensions with positive densities,<br />

we provide a characterization of gaussians <strong>in</strong> terms of a differential equation<br />

and provide the general proof. The BSS algorithm based on f<strong>in</strong>d<strong>in</strong>g<br />

separated densities is proposed and studied <strong>in</strong> section 4. We f<strong>in</strong>ish with<br />

a generalization of the separability for the postnonl<strong>in</strong>ear mixture case <strong>in</strong><br />

section 5.


58 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1829<br />

2 Separated and L<strong>in</strong>early Separated Functions<br />

Def<strong>in</strong>ition 1. A function f : R n → C is said to be separated, respectively,<br />

l<strong>in</strong>early separated, if there exist one-dimensional functions g1,...,gn : R → C<br />

such that f (x) = g1(x1) ···gn(xn) respectively f (x) = g1(x1) +···+gn(xn) for<br />

all x ∈ R n .<br />

Note that the functions gi are uniquely determ<strong>in</strong>ed by f up to a scalar<br />

factor, respectively, an additive constant. If f is l<strong>in</strong>early separated, then exp f<br />

is separated. Obviously the density function of an <strong>in</strong>dependent random<br />

vector is separated. For brevity, we often use the tensor product and write<br />

f ≡ g1 ⊗···⊗gn for separated f , where for any functions h, k def<strong>in</strong>ed on a<br />

set U, h ≡ k if h(x) = k(x) for all x ∈ U.<br />

Separatedness can also be def<strong>in</strong>ed on any open parallelepiped (a1, b1) ×<br />

···×(an, bn) ⊂ R n <strong>in</strong> the obvious way. We say that f is locally separated<br />

at x ∈ R n if there exists an open parallelepiped U such that x ∈ U and<br />

f |U is separated. If f is separated, then f is obviously everywhere locally<br />

separated. The converse, however, does not necessarily hold, as shown <strong>in</strong><br />

Figure 1.<br />

0.2<br />

0.1<br />

0<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0<br />

−1<br />

−2<br />

−3 −3<br />

Figure 1: Density of a random vector S with a locally but not globally separated<br />

density. Here, p S := cχ[−2,2]×[−2,0]∪[0,2]×[1,3] where χU denotes the function that is<br />

1onU and 0 everywhere else. Obviously, p S is not separated globally, but is<br />

separated if restricted to squares of length < 1. Plotted is a smoothed version of<br />

p S .<br />

1<br />

2<br />

3<br />

4


Chapter 2. Neural Computation 16:1827-1850, 2004 59<br />

1830 F. Theis<br />

The function f is said to be positive if f is real and f (x) >0 for all x ∈ R n ,<br />

and nonnegative if f is real and f (x) ≥ 0 for all x ∈ R n . A positive function<br />

f is separated if and only if ln f is l<strong>in</strong>early separated.<br />

Let C m (U, V) be the r<strong>in</strong>g of all m-times cont<strong>in</strong>uously differentiable func-<br />

tions from U ⊂ R n to V ⊂ C, U open. For a C m -function f , we write<br />

∂i1 ···∂im f := ∂m f/∂xi1 ···∂xim for the m-fold partial derivatives. If f ∈<br />

C2 (Rn , C), denote with the symmetric (n × n)-matrix Hf (x) := � ∂i∂j f (x) �n i,j=1<br />

the Hessian of f at x ∈ Rn .<br />

L<strong>in</strong>early separated functions can be classified us<strong>in</strong>g their Hessian (if it<br />

exists):<br />

Lemma 1. A function f ∈ C 2 (R n , C) is l<strong>in</strong>early separated if and only if Hf (x)<br />

is diagonal for all x ∈ R n .<br />

A similar lemma for block diagonal Hessians has been shown by L<strong>in</strong><br />

(1998).<br />

Proof. If f is l<strong>in</strong>early separated, its Hessian is obviously diagonal everywhere<br />

by def<strong>in</strong>ition.<br />

Assume the converse. We prove that f is separated by <strong>in</strong>duction over<br />

the dimension n. For n = 1, the claim is trivial. Now assume that we have<br />

shown the lemma for n − 1. By <strong>in</strong>duction assumption, f (x1,...,xn−1, 0) is<br />

l<strong>in</strong>early separated, so<br />

f (x1,...,xn−1, 0) = g1(x1) +···+gn−1(xn−1)<br />

for all xi ∈ R and some functions gi on R. Note that gi ∈ C 2 (R, C).<br />

Def<strong>in</strong>e a function h : R → C by h(y) := ∂n f (x1,...,xn−1, y), y ∈ R,<br />

for fixed x1,...,xn−1 ∈ R. Note that h is <strong>in</strong>dependent of the choice of the<br />

xi, because ∂n∂i f ≡ ∂i∂n f is zero everywhere, so xi ↦→ ∂n f (x1,...,xn−1, y)<br />

is constant for fixed xj, y ∈ R, j �= i. By def<strong>in</strong>ition, h ∈ C 1 (R, C), soh is<br />

<strong>in</strong>tegrable on compact <strong>in</strong>tervals. Def<strong>in</strong>e k : R → C by k(y) := � y<br />

0 h. Then<br />

f (x1,...,xn) = g1(x1) +···+gn−1(xn−1) + k(xn) + c,<br />

where c ∈ C is a constant, because both functions have the same derivative<br />

and R n is connected. If we set gn := k + c, the claim follows.<br />

This lemma also holds for functions def<strong>in</strong>ed on any open parallelepiped<br />

(a1, b1) ×···×(an, bn) ⊂ R n . Hence, an arbitrary real-valued C 2 -function f<br />

is locally separated at x with f (x) �= 0 if and only if the Hessian of ln | f | is<br />

locally diagonal.<br />

For a positive function f , the Hessian of its logarithm is diagonal everywhere<br />

if it is separated, and it is easy to see that for positive f , the converse


60 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1831<br />

also holds globally (see theorem 1(ii)). In this case, we have for i �= j,<br />

0 ≡ ∂i∂j ln f ≡ f ∂i∂j f − (∂i f )(∂j f )<br />

f 2<br />

,<br />

so f is separated if and only if<br />

f ∂i∂j f ≡ (∂i f )(∂j f )<br />

for i �= j or even i < j. This motivates the follow<strong>in</strong>g def<strong>in</strong>ition:<br />

Def<strong>in</strong>ition 2. For i �= j, the operator<br />

Rij : C 2 (R n , C) → C 0 (R n , C)<br />

is called the ij-separator.<br />

Theorem 1. Let f ∈ C 2 (R n , C).<br />

f ↦→ Rij[ f ]:= f ∂i∂j f − (∂i f )(∂j f )<br />

i. If f is separated, then Rij[ f ] ≡ 0 for i �= j or, equivalently,<br />

f ∂i∂j f ≡ (∂i f )(∂j f ) (2.1)<br />

holds for i �= j.<br />

ii. If f is positive and Rij[ f ] ≡ 0 holds for all i �= j, then f is separated.<br />

If f is assumed to be only nonnegative, then f is locally separated but<br />

not necessarily globally separated (if the support of f has more than one<br />

component). See Figure 1 for an example of a nonseparated density with<br />

R12[ f ] ≡ 0.<br />

Proof of Theorem 1.i. If f is separated, then f (x) = g1(x1) ···gn(xn) or<br />

short f ≡ g1 ⊗···⊗gn,so<br />

and<br />

∂i f ≡ g1 ⊗···⊗gi−1 ⊗ g ′ i ⊗ gi+1 ⊗···⊗gn<br />

∂i∂j f ≡ g1 ⊗···⊗gi−1 ⊗ g ′ i ⊗ gi+1 ⊗···⊗gj−1 ⊗ g ′ j ⊗ gj+1 ⊗···⊗gn<br />

for i < j. Hence equation 2.1 holds.<br />

ii. Now assume the converse and let f be positive. Then accord<strong>in</strong>g to the<br />

remarks after lemma 1, Hln f (x) is everywhere diagonal, so lemma 1 shows<br />

that ln f is l<strong>in</strong>early separated; hence, f is separated.


Chapter 2. Neural Computation 16:1827-1850, 2004 61<br />

1832 F. Theis<br />

Some trivial properties of the separator Rij are listed <strong>in</strong> the next lemma:<br />

Lemma 2. Let f, g ∈ C 2 (R n , C),i�= j and α ∈ C. Then<br />

and<br />

Rij[αf ] = α 2 Rij[ f ]<br />

Rij[ f + g] = Rij[ f ] + Rij[g] + f ∂i∂jg + g∂i∂j f − (∂i f )(∂jg) − (∂ig)(∂j f ).<br />

3 Separability of L<strong>in</strong>ear BSS<br />

Consider the noiseless l<strong>in</strong>ear <strong>in</strong>stantaneous BSS model with as many sources<br />

as sensors:<br />

X = AS, (3.1)<br />

with an <strong>in</strong>dependent n-dimensional random vector S and A ∈ Gl(n). Here,<br />

Gl(n) denotes the general l<strong>in</strong>ear group of R n , that is, the group of all <strong>in</strong>vertible<br />

(n × n)-matrices.<br />

The task of l<strong>in</strong>ear BSS is to f<strong>in</strong>d A and S given only X. An obvious <strong>in</strong>determ<strong>in</strong>acy<br />

of this problem is that A can be found only up to scal<strong>in</strong>g and<br />

permutation because for scal<strong>in</strong>g L and permutation matrix P,<br />

X = ALPP −1 L −1 S,<br />

and P −1 L −1 S is also <strong>in</strong>dependent. Here, an <strong>in</strong>vertible matrix L ∈ Gl(n)<br />

is said to be a scal<strong>in</strong>g matrix if it is diagonal. We say two matrices B, C<br />

are equivalent, B ∼ C, ifC can be written as C = BPL with a scal<strong>in</strong>g<br />

matrix L ∈ Gl(n) and an <strong>in</strong>vertible matrix with unit vectors <strong>in</strong> each row<br />

(permutation matrix) P ∈ Gl(n). Note that PL = L ′ P for some scal<strong>in</strong>g matrix<br />

L ′ ∈ Gl(n), so the order of the permutation and the scal<strong>in</strong>g matrix does not<br />

play a role for equivalence. Furthermore, if B ∈ Gl(n) with B ∼ I, then also<br />

B −1 ∼ I, and, more generally if BC ∼ A, then C ∼ B −1 A. Accord<strong>in</strong>g to the<br />

above, solutions of l<strong>in</strong>ear BSS are equivalent. We will show that under mild<br />

assumptions to S, there are no further <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear BSS.<br />

S is said to have a gaussian component if one of the Si is a one-dimensional<br />

gaussian, that is, pSi (x) = d exp(−ax2 + bx + c) with a, b, c, d ∈ R, a > 0, and<br />

S has a determ<strong>in</strong>istic component if one Si is determ<strong>in</strong>istic, that is, constant.<br />

Theorem 2 (Separability of l<strong>in</strong>ear BSS). Let A ∈ Gl(n) and S be an <strong>in</strong>dependent<br />

random vector. Assume one of the follow<strong>in</strong>g:<br />

i. S has at most one gaussian or determ<strong>in</strong>istic component, and the covariance<br />

of S exists.


62 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1833<br />

ii. S has no gaussian component, and its density pS exists and is twice cont<strong>in</strong>uously<br />

differentiable.<br />

Then if AS is aga<strong>in</strong> <strong>in</strong>dependent, A is equivalent to the identity.<br />

So A is the product of a scal<strong>in</strong>g and a permutation matrix. The important<br />

part of this theorem is assumption i, which has been used to show separability<br />

by Comon (1994) and extended by Eriksson and Koivunen (2003) based<br />

on the Darmois-Skitovitch theorem (Darmois, 1953; Skitovitch, 1953). Us<strong>in</strong>g<br />

this theorem, the second part can be easily shown without C 2 -densities.<br />

Theorem 2 <strong>in</strong>deed proves separability of the l<strong>in</strong>ear BSS model, because<br />

if X = AS and W is a demix<strong>in</strong>g matrix such that WX is <strong>in</strong>dependent, then<br />

WA ∼ I,soW −1 ∼ A as desired.<br />

We will give a much easier proof without hav<strong>in</strong>g to use the Darmois-<br />

Skitovitch theorem <strong>in</strong> the follow<strong>in</strong>g sections.<br />

3.1 Two-Dimensional Positive Density Case. For illustrative purposes<br />

we will first prove separability for a two-dimensional random vector S with<br />

positive density pS ∈ C 2 (R 2 , R). Let A ∈ Gl(2). It is enough to show that if<br />

S and AS are <strong>in</strong>dependent, then either A ∼ I or S is gaussian.<br />

S is assumed to be <strong>in</strong>dependent, so its density factorizes:<br />

pS(s) = g1(s1)g2(s2),<br />

for s ∈ R 2 . First, note that the density of AS is given by<br />

pAS(x) =|det A| −1 pS(A −1 x) = cg1(b11x1 + b12x2)g2(b21x1 + b22x2)<br />

for x ∈ R 2 , c �= 0 fixed. Here, B = (bij) = A −1 . AS is also assumed to be<br />

<strong>in</strong>dependent, so pAS(x) is separated.<br />

pS was assumed to be positive; then so is pAS. Hence, ln pAS(x) is l<strong>in</strong>early<br />

separated, so<br />

∂1∂2 ln pAS(x) = b11b12h ′′<br />

1 (b11x1 + b12x2) + b21b22h ′′<br />

2 (b21x1 + b22x2) = 0<br />

for all x ∈ R 2 , where hi := ln gi ∈ C 2 (R 2 , R). By sett<strong>in</strong>g y := Bx, we therefore<br />

have<br />

b11b12h ′′<br />

1 (y1) + b21b22h ′′<br />

2 (y2) = 0 (3.2)<br />

for all y ∈ R 2 , because B is <strong>in</strong>vertible.<br />

Now, if A (and therefore also B) is equivalent to the identity, then equation<br />

3.2 holds. If not, then A, and hence also B, have at least three nonzero<br />

entries. By equation 3.2 the fourth entry has to be nonzero, because the


Chapter 2. Neural Computation 16:1827-1850, 2004 63<br />

1834 F. Theis<br />

h ′′<br />

i are not zero (otherwise gi(yi) = exp(ayi + b), which is not <strong>in</strong>tegrable).<br />

Furthermore,<br />

b11b12h ′′<br />

1 (y1) =−b21b22h ′′<br />

2 (y2)<br />

for all y ∈ R2 ,sotheh ′′<br />

i are constant, say, h′′ i ≡ ci, and ci �= 0, as noted<br />

above. Therefore, the hi are polynomials of degree 2, and the gi = exp hi are<br />

gaussians (ci < 0 because of the <strong>in</strong>tegrability of the gi).<br />

3.2 Characterization of Gaussians. In this section, we show that among<br />

all densities, respectively, characteristic functions, the gaussians satisfy a<br />

special differential equation.<br />

Lemma 3. Let f ∈ C 2 (R, C) and a ∈ C with<br />

af 2 − ff ′′ + f ′2 ≡ 0. (3.3)<br />

Then either f ≡ 0 or f (x) = exp � a<br />

2 x2 + bx + c � , x ∈ R, with constants b, c ∈ C.<br />

Proof. Assume f �≡ 0. Let x0 ∈ R with f (x0) �= 0. Then there exists<br />

a nonempty <strong>in</strong>terval U := (r, s) conta<strong>in</strong><strong>in</strong>g x0 such that a complex logarithm<br />

log is def<strong>in</strong>ed on f (U). Set g := log f |U. Substitut<strong>in</strong>g exp g for f <strong>in</strong><br />

equation 3.3 yields<br />

a exp(2g) − exp(g)(g ′′ + g ′2 ) exp(g) + g ′2 exp(2g) ≡ 0,<br />

and therefore g ′′ ≡ a. Hence, g is a polynomial of degree ≤ 2 with lead<strong>in</strong>g<br />

coefficient a<br />

2 .<br />

Furthermore,<br />

lim f (x) �= 0<br />

x→r+<br />

lim f (x) �= 0,<br />

x→s−<br />

so f has no zeros at all because of cont<strong>in</strong>uity. The argument above with<br />

U = R shows the claim.<br />

If, furthermore, f is real nonnegative and <strong>in</strong>tegrable with <strong>in</strong>tegral 1 (e.g.,<br />

if f is the density of a random variable), then f has to be the exponential of<br />

a real-valued polynomial of degree precisely 2; otherwise, it would not be<br />

<strong>in</strong>tegrable. So we have the follow<strong>in</strong>g corollary:


64 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1835<br />

Corollary 1. Let X be a random variable with twice cont<strong>in</strong>uously differentiable<br />

density pX satisfy<strong>in</strong>g equation 3.3. Then X is gaussian.<br />

If we do not want to assume that the random variable has a density, we<br />

can use its characteristic function (Bauer, 1996) <strong>in</strong>stead to show an equivalent<br />

result:<br />

Corollary 2. Let X be a random variable with twice cont<strong>in</strong>uously differentiable<br />

characteristic function �X(x) := EX(exp ixX) satisfy<strong>in</strong>g equation 3.3. Then X is<br />

gaussian or determ<strong>in</strong>istic.<br />

Proof. Us<strong>in</strong>g �X(0) = 1, lemma 3 shows that �X(x) = exp( a<br />

2 x2 + bx). Moreover,<br />

from �X(−x) = �X(x), we get a ∈ R and b = ib ′ with real b ′ . And |�X| ≤1<br />

shows that a ≤ 0. So if a = 0, then X is determ<strong>in</strong>istic (at b ′ ), and if a �= 0,<br />

then X has a gaussian distribution with mean b ′ and variance −a −1 .<br />

3.3 Proof of Theorem 2. We will now prove l<strong>in</strong>ear separability; for this,<br />

we will use separatedness to show that some source components have to be<br />

gaussian (us<strong>in</strong>g the results from above) if the mix<strong>in</strong>g matrix is not trivial.<br />

The ma<strong>in</strong> argument is given <strong>in</strong> the follow<strong>in</strong>g lemma:<br />

Lemma 4. Let gi ∈ C 2 (R, C) and B ∈ Gl(n) such that f (x) := g1⊗···⊗gn(Bx)<br />

is separated. Then for all <strong>in</strong>dices l and i �= j with bliblj �= 0,gl satisfies the differential<br />

equation 3.3 with some constant a.<br />

Proof. f is separated, so by theorem 1i.<br />

Rij[ f ] ≡ f ∂i∂j f − (∂i f )(∂j f ) ≡ 0 (3.4)<br />

holds for i < j. The <strong>in</strong>gredients of this equation can be calculated for i < j<br />

as follows:<br />

∂i f (x) = �<br />

bkig1 ⊗···⊗g ′ k ⊗···⊗gn(Bx)<br />

k<br />

(∂i f )(∂j f )(x) = �<br />

bkiblj(g1 ⊗···⊗g ′ k ⊗···⊗gn)<br />

k,l<br />

× (g1 ⊗···⊗g ′ l ⊗···⊗gn)(Bx)<br />

∂i∂j f (x) = �<br />

bki(bkjg1 ⊗···⊗g ′′<br />

k ⊗···⊗gn<br />

k<br />

+ �<br />

bljg1 ⊗···⊗g ′ k ⊗···⊗g′ l ⊗···⊗gn)(Bx).<br />

l�=k


Chapter 2. Neural Computation 16:1827-1850, 2004 65<br />

1836 F. Theis<br />

Putt<strong>in</strong>g this <strong>in</strong> equation 3.4 yields<br />

0 = ( f ∂i∂j f − (∂i f )(∂j f ))(x)<br />

= �<br />

bkibkj((g1 ⊗···⊗gn)(g1 ⊗···⊗g ′′<br />

k ⊗···⊗gn)<br />

k<br />

− (g1 ⊗···⊗g ′ k ⊗···⊗gn) 2 )(Bx)<br />

= �<br />

bkibkjg 2 1 ⊗···⊗g2 k−1 ⊗<br />

k<br />

�<br />

gkg ′′<br />

k<br />

− g′2<br />

k<br />

for x ∈ R n . B is <strong>in</strong>vertible, so the whole function is zero:<br />

�<br />

k<br />

bkibkjg 2 1 ⊗···⊗g2 k−1 ⊗<br />

�<br />

gkg ′′<br />

k<br />

�<br />

⊗ g 2 k+1 ⊗···⊗g2 n (Bx)<br />

�<br />

− g′2<br />

k ⊗ g 2 k+1 ⊗···⊗g2n ≡ 0. (3.5)<br />

Choose x ∈ R n with gk(xk) �= 0 for k = 1,...,n. Evaluat<strong>in</strong>g equation 3.5<br />

at (x1,...,xl−1, y, xl+1,...,xn) for variable y ∈ R and divid<strong>in</strong>g the result<strong>in</strong>g<br />

one-dimensional equation by the constant g 2 1 (x1) ···g 2 l−1 (xl−1)g 2 l+1 (xl+1) ···<br />

g 2 n (xn) shows<br />

bliblj<br />

�<br />

glg ′′<br />

l<br />

�<br />

− g′2<br />

l<br />

⎛<br />

(y) =−⎝<br />

�<br />

k�=l<br />

gkg<br />

bkibkj<br />

′′<br />

k<br />

g 2 k<br />

− g′2<br />

k<br />

(xk)<br />

⎞<br />

⎠ g 2 l<br />

(y) (3.6)<br />

for y ∈ R. So for <strong>in</strong>dices l and i �= j with bliblj �= 0, it follows from equation<br />

3.6 that there exists a ∈ C such that gk satisfies the differential equation<br />

ag2 l − glg ′′<br />

l + g′2<br />

l ≡ 0, that is, equation 3.3.<br />

Proof of Theorem 2. i. S is assumed to have at most one gaussian or determ<strong>in</strong>istic<br />

component and exist<strong>in</strong>g covariance. Set X := AS.<br />

We first show us<strong>in</strong>g whiten<strong>in</strong>g that A can be assumed to be orthogonal.<br />

For this, we can assume S and X to have no determ<strong>in</strong>istic component at all<br />

(because arbitrary choice of the matrix coefficients of the determ<strong>in</strong>istic components<br />

does not change the covariance). Hence, by assumption, Cov(X)<br />

is diagonal and positive def<strong>in</strong>ite, so let D1 be diagonal <strong>in</strong>vertible with<br />

Cov(X) = D2 1 . Similarly, let D2 be diagonal <strong>in</strong>vertible with Cov(S) = D2 2 .<br />

Set Y := D −1<br />

1 X and T := D−1<br />

2 S, that is, normalize X and S to covariance I.<br />

Then<br />

Y = D −1<br />

1<br />

X = D−1<br />

1 AS = D−1<br />

1 AD2T


66 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1837<br />

and T, D −1<br />

1 AD2 and Y satisfy the assumption, and D −1<br />

1 AD2 is orthogonal<br />

because<br />

I = Cov(Y)<br />

= E(YY ⊤ )<br />

= E(D −1<br />

1 AD2TT ⊤ D2A ⊤ D −1<br />

1 )<br />

= (D −1 −1<br />

1 AD2)(D1 AD2) ⊤ .<br />

So without loss of generality, let A be orthogonal.<br />

Now let � S(s) := ES(exp is ⊤ S) be the characteristic function of S. Byassumption,<br />

the covariance (and hence the mean) of S exists, so � S ∈ C 2 (R n , C)<br />

(Bauer, 1996). Furthermore, s<strong>in</strong>ce S is assumed to be <strong>in</strong>dependent, its characteristic<br />

function is separated: � S ≡ g1 ⊗···⊗gn, where gi ≡ � Si. The characteristic<br />

function of AS can easily be calculated as<br />

�AS(x) = ES(exp ix ⊤ AS) = � S(A ⊤ x) = g1 ⊗···⊗gn(A ⊤ x)<br />

for x ∈ R n . Let B := (bij) = A ⊤ . S<strong>in</strong>ce AS is also assumed to be <strong>in</strong>dependent,<br />

f (x) := � AS(x) = g1 ⊗···⊗gn(Bx) is separated.<br />

Now assume that A �∼ I. Us<strong>in</strong>g orthogonality of B = A ⊤ , there exist<br />

<strong>in</strong>dices k �= l and i �= j with bkibkj �= 0 and bliblj �= 0. Then accord<strong>in</strong>g<br />

to lemma 4, gk and gl satisfy the differential equation 3.3. Together with<br />

corollary 2, this shows that both Sk and Sl are gaussian, which is a contradiction<br />

to the assumption.<br />

ii. Let S be an n-dimensional <strong>in</strong>dependent random vector with density<br />

pS ∈ C 2 (R n , R) and no gaussian component, and let A ∈ Gl(n). S is assumed<br />

to be <strong>in</strong>dependent, so its density factorizes pS ≡ g1 ⊗···⊗gn. The density<br />

of AS is given by<br />

pAS(x) =|det A| −1 pS(A −1 x) =|det A| −1 g1 ⊗···⊗gn(Ax)<br />

for x ∈ R n . Let B := (bij) = A −1 . AS is also assumed to be <strong>in</strong>dependent, so<br />

f (x) := |det A|pAS(x) = g1 ⊗···⊗gn(Bx)<br />

is separated.<br />

Assume A �∼ I. Then also B = A −1 �∼ I, so there exist <strong>in</strong>dices l and<br />

i �= j with bliblj �= 0. Hence, it follows from lemma 4 that gl satisfies the<br />

differential equation 3.3. But gl is a density, so accord<strong>in</strong>g to corollary 1 the<br />

lth component of S is gaussian, which is a contradiction.


Chapter 2. Neural Computation 16:1827-1850, 2004 67<br />

1838 F. Theis<br />

4 BSS by Hessian Diagonalization<br />

In this section, we use the theory already set out to propose an algorithm for<br />

l<strong>in</strong>ear BSS, which can be easily extended to nonl<strong>in</strong>ear sett<strong>in</strong>gs as well. For<br />

this, we restrict ourselves to us<strong>in</strong>g C 2 -densities. A similar idea has already<br />

been proposed <strong>in</strong> L<strong>in</strong> (1998), but without deal<strong>in</strong>g with possibly degenerated<br />

eigenspaces <strong>in</strong> the Hessian. Equivalently, we could also use characteristic<br />

functions <strong>in</strong>stead of densities, which leads to a related algorithm (Yeredor,<br />

2000).<br />

If we assume that Cov(S) exists, we can use whiten<strong>in</strong>g as seen <strong>in</strong> the proof<br />

of theorem 2i (<strong>in</strong> this context, also called pr<strong>in</strong>cipal component analysis) to<br />

reduce the general BSS model, equation 3.2, to<br />

X = AS (4.1)<br />

with an <strong>in</strong>dependent n-dimensional random vector S with exist<strong>in</strong>g covariance<br />

I and an orthogonal matrix A. Then Cov(X) = I. We assume that S<br />

admits a C 2 −density pS. The density of X is then given by<br />

pX(x) = pS(A ⊤ x)<br />

for x ∈ R n , because of the orthogonality of A. Hence,<br />

pS ≡ pX ◦ A.<br />

Note that the Hessian of the composition of a function f ∈ C 2 (R n , R)<br />

with an n × n-matrix A can be calculated us<strong>in</strong>g the Hessian of f as follows:<br />

Hf ◦A(x) = AHf (Ax)A ⊤ .<br />

Let s ∈ R n with pS(s) >0. Then locally at s, we have<br />

Hln p S (s) = Hln p X ◦A(s) = AHln p X (As)A ⊤ . (4.2)<br />

pS is assumed to be separated, so Hln p S (s) is diagonal, as seen <strong>in</strong> section 2.<br />

Lemma 5. Let X := AS with an orthogonal matrix A and S, an <strong>in</strong>dependent<br />

random vector with C 2 -density, and at most one gaussian component. Then there<br />

exists an open set U ⊂ R n such that for all x ∈ U, pX(x) �= 0 and Hln p X (x) has n<br />

different eigenvalues.<br />

Proof. Assume not. Then there exists no x ∈ R n at all with pX(x) �= 0 and<br />

Hln p X (x) hav<strong>in</strong>g n different eigenvalues because otherwise, due to cont<strong>in</strong>uity,<br />

these conditions would also hold <strong>in</strong> an open neighborhood of x.


68 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1839<br />

Us<strong>in</strong>g equation 4.2 the logarithmic Hessian of pS has at every s ∈ R n with<br />

pS(s) >0 at least two of the same eigenvalues, say, λ(s) ∈ R. Hence, s<strong>in</strong>ce S<br />

is <strong>in</strong>dependent, Hln p S (s) is diagonal, so locally,<br />

� � ′′ � � ′′<br />

ln pSi (s) = ln pSj (s) = λ(s)<br />

for two <strong>in</strong>dices i �= j. Here, we have used cont<strong>in</strong>uity of s ↦→ Hln p (s) show-<br />

S<br />

<strong>in</strong>g that the two eigenvalues locally lie <strong>in</strong> the same two dimensions i and<br />

j. This proves that λ(s) is locally constant <strong>in</strong> directions i and j. So locally<br />

at po<strong>in</strong>ts s with pS(s) >0, Si and Sj are of the type exp P, with P be<strong>in</strong>g a<br />

polynomial of degree ≤ 2. The same argument as <strong>in</strong> the proof of lemma 3<br />

then shows that pSi and pSj have no zeros at all. Us<strong>in</strong>g the connectedness<br />

of R proves that Si and Sj are globally of the type exp P, hence gaussian<br />

(because of �<br />

pSk R = 1), which is a contradiction.<br />

Hence, we can assume that we have found x (0) ∈ R n with Hln p X (x (0) )<br />

hav<strong>in</strong>g n different eigenvalues (which is equivalent to say<strong>in</strong>g that every<br />

eigenvalue is of multiplicity one), because due to lemma 5, this is an open<br />

condition, which can be found algorithmically. In fact, most densities <strong>in</strong><br />

practice turn out to have logarithmic Hessians with n different eigenvalues<br />

almost everywhere. In theory however, U <strong>in</strong> lemma 5 cannot be assumed to<br />

be, for example, dense or R n \ U to have measure zero, because if we choose<br />

pS1 to be a normalized gaussian and pS2 to be a normalized gaussian with a<br />

very localized small perturbation at zero only, then U cannot be larger than<br />

(−ε, ε) × R.<br />

By diagonalization of Hln p (x<br />

X (0) ) us<strong>in</strong>g eigenvalue decomposition (pr<strong>in</strong>cipal<br />

axis transformation), we can f<strong>in</strong>d the (orthogonal) mix<strong>in</strong>g matrix A.<br />

Note that the eigenvalue decomposition is unique except for permutation<br />

and sign scal<strong>in</strong>g because every eigenspace (<strong>in</strong> which A is only unique up<br />

to orthogonal transformation) has dimension one. Arbitrary scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy<br />

does not occur because we have forced S and X to have unit<br />

variances. Us<strong>in</strong>g uniqueness of eigenvalue decomposition and theorem 2,<br />

we have shown the follow<strong>in</strong>g theorem:<br />

Theorem 3 (BSS by Hessian calculation). Let X = AS with an <strong>in</strong>dependent<br />

random vector S and an orthogonal matrix A. Let x ∈ R n such that locally at x,<br />

X admits a C 2 -density pX with pX(x) �= 0. Assume that Hln p X (x) has n different<br />

eigenvalues (see lemma 5). If<br />

EHln p X (x)E ⊤ = D<br />

is an eigenvalue decomposition of the Hessian of the logarithm of pX at x, that is, E<br />

orthogonal, D diagonal, then E ∼ A,soE ⊤ X is <strong>in</strong>dependent.


Chapter 2. Neural Computation 16:1827-1850, 2004 69<br />

1840 F. Theis<br />

Furthermore, it follows from this theorem that l<strong>in</strong>ear BSS is a local problem<br />

as proven already <strong>in</strong> Theis, Puntonet, and Lang (2003) us<strong>in</strong>g the restriction<br />

of a random vector.<br />

4.1 Example for Hessian Diagonalization BSS. In order to illustrate<br />

the algorithm of local Hessian diagonalization, we give a two-dimensional<br />

example. Let S be a random vector with densities<br />

pS1 (s1) = 1<br />

2 χ[−1,1](s1)<br />

pS2 (s2) = 1<br />

√ 2π exp<br />

�<br />

− 1<br />

2 s2 2<br />

�<br />

where χ[−1,1] is one on [−1, 1] and zero everywhere else. The orthogonal<br />

mix<strong>in</strong>g matrix A is chosen to be<br />

A = 1 √ 2<br />

� �<br />

1 1<br />

.<br />

−1 1<br />

The mixture density pX of X := AS then is (det A = 1),<br />

pX(x) = 1<br />

2 √ 2π χ[−1,1]<br />

� � �<br />

1√2<br />

(x1 − x2) exp − 1<br />

4 (x1 + x2) 2<br />

�<br />

,<br />

for x ∈ R 2 . pX is positive and C 2 <strong>in</strong> a neighborhood around 0. Then<br />

∂1 ln pX(x) = ∂2 ln pX(x) =− 1<br />

2 (x1 + x2)<br />

∂ 2 1 ln pX(x) = ∂ 2 2 ln pX(x) = ∂1∂2 ln pX(x) =− 1<br />

2<br />

for x with |x| < 1<br />

2 , and the Hessian of the logarithmic densities is<br />

Hln p X (x) =− 1<br />

2<br />

� �<br />

1 1<br />

1 1<br />

<strong>in</strong>dependent on x <strong>in</strong> a neighborhood around 0. Diagonalization of Hln p (0)<br />

X<br />

yields<br />

� �<br />

−1 0<br />

,<br />

0 0<br />

and this equals AHln p X (0)A ⊤ , as stated <strong>in</strong> theorem 3.


70 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1841<br />

4.2 Global Hessian Diagonalization Us<strong>in</strong>g Kernel-Based Density Approximation.<br />

In practice, it is usually not possible to approximate the density<br />

locally with sufficiently high accuracy, so a better approximation us<strong>in</strong>g<br />

the typically global <strong>in</strong>formation of X has to be found. We suggest us<strong>in</strong>g<br />

kernel-based density estimation to get an energy function with m<strong>in</strong>ima at<br />

the BSS solutions together with a global Hessian diagonalization <strong>in</strong> the follow<strong>in</strong>g.<br />

The idea is to construct a measure for separatedness of the densities<br />

(hence <strong>in</strong>dependence) based on theorem 1. A possible measure could be the<br />

norm of the summed-up separators �<br />

i


Chapter 2. Neural Computation 16:1827-1850, 2004 71<br />

1842 F. Theis<br />

0.4<br />

0.2<br />

0<br />

2<br />

1<br />

0<br />

−1<br />

−2 −2<br />

−1 0<br />

1<br />

2<br />

0.2<br />

0.1<br />

0<br />

2<br />

1<br />

0<br />

−1<br />

−2 −2<br />

−1 0<br />

Figure 2: <strong>Independent</strong> Laplacian density p S (s) = 1<br />

2 exp(−|x1|−|x2|): theoretic<br />

(left) and approximated (right) densities. For the approximation, 1000 samples<br />

and gaussian kernel approximation (see equation 4.3) with standard deviation<br />

0.37 were used.<br />

Rij[ ˆpX] can be calculated us<strong>in</strong>g lemma 2—here Rij[ϕ(x − x (k) )] ≡ 0—and<br />

equation 4.4:<br />

Rij[ ˆpX](x) = 1<br />

�<br />

ν�<br />

Rij ϕ(x − x<br />

ν2 k=1<br />

(k) �<br />

)<br />

= 1<br />

ν 2<br />

�<br />

ϕ<br />

k�=l<br />

− (∂iϕ)<br />

= 4κ2<br />

ν 2<br />

= 4κ2<br />

ν 2<br />

�<br />

k�=l<br />

�<br />

k


72 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1843<br />

hence,<br />

E = (σ 2 ν)<br />

−4 �<br />

m<br />

�<br />

� �<br />

ϕ(x − x (k) )ϕ(x − x (l) )(x (k)<br />

i<br />

i


Chapter 2. Neural Computation 16:1827-1850, 2004 73<br />

1844 F. Theis<br />

Note that E represents a new approximate measure of <strong>in</strong>dependence.<br />

Therefore, the l<strong>in</strong>ear BSS algorithm can now be readily generalized to nonl<strong>in</strong>ear<br />

situations by f<strong>in</strong>d<strong>in</strong>g an appropriate parameterization of the possibly<br />

nonl<strong>in</strong>ear separat<strong>in</strong>g model.<br />

The proposed algorithm basically performs a global diagonalization of<br />

the logarithmic Hessian after prewhiten<strong>in</strong>g. Interest<strong>in</strong>gly, this is similar to<br />

traditional BSS algorithms based on jo<strong>in</strong>t diagonalization such as JADE<br />

(Cardoso & Souloumiac, 1993) us<strong>in</strong>g cumulant matrices or AMUSE (Tong,<br />

Liu, Soan, & Huang, 1991) and SOBI (Belouchrani, Meraim, Cardoso, &<br />

Moul<strong>in</strong>es, 1997) employ<strong>in</strong>g time decorrelation. Instead of us<strong>in</strong>g a global energy<br />

function as proposed above, we could therefore also jo<strong>in</strong>tly diagonalize<br />

a given set of Hessians (respectively, separator matrices, as above; see also<br />

Yeredor, 2000). Another relation to previously proposed ICA algorithms<br />

lies <strong>in</strong> the kernel approximation technique. Gaussian or generalized gaussian<br />

kernels have already been used <strong>in</strong> the field of <strong>in</strong>dependent component<br />

analysis to model the source densities (Lee & Lewicki, 2000; Habl, Bauer,<br />

Putonet, Rodriguez-Alvarez, & Lang, 2001), thus giv<strong>in</strong>g an estimate of the<br />

score function used <strong>in</strong> Bell-Sejnowski-type semiparametric algorithms (Bell,<br />

& Sejnowski, 1995) or enabl<strong>in</strong>g direct separation us<strong>in</strong>g a maximum likelihood<br />

parameter estimation. Our algorithm also uses density approximation,<br />

but employs this for the mixture density, which can be problematic <strong>in</strong> higher<br />

dimensions. A different approach not <strong>in</strong>volv<strong>in</strong>g density approximation is a<br />

direct sample-based Hessian estimation similar to L<strong>in</strong> (1998).<br />

5 Separability of Postnonl<strong>in</strong>ear BSS<br />

In this section, we show how to use the idea of Hessian diagonalization <strong>in</strong><br />

order to give separability proofs <strong>in</strong> nonl<strong>in</strong>ear situations, more precisely, <strong>in</strong><br />

the sett<strong>in</strong>g of postnonl<strong>in</strong>ear BSS. After stat<strong>in</strong>g the postnonl<strong>in</strong>ear BSS model<br />

and the general (to the knowledge of the author, not yet proven) separability<br />

theorem, we will prove postnonl<strong>in</strong>ear separability <strong>in</strong> the case of random<br />

vectors with distributions that are somewhere locally constant and nonzero<br />

(e.g., uniform distributions). A possible proof of postnonl<strong>in</strong>ear separability<br />

has been suggested by Taleb and Jutten (1999); however, the proof applies<br />

only to densities with at least one zero and furthermore conta<strong>in</strong>s an error<br />

render<strong>in</strong>g the proof applicable only to restricted situations.<br />

Def<strong>in</strong>ition 3. A function f : R n → R n is called diagonal if each component<br />

fi(x) of f(x) depends on only the variable xi.<br />

In this case, we often omit the other variables and write f(x1,...,xn) =<br />

( f1(x1),..., fn(xn)); sof ≡ f1 ×···× fn where × denotes the Cartesian<br />

product.


74 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1845<br />

Consider now the postnonl<strong>in</strong>ear BSS model,<br />

X = f(AS), (5.1)<br />

where aga<strong>in</strong> S is an <strong>in</strong>dependent random vector, A ∈ Gl(n), and f is a diagonal<br />

nonl<strong>in</strong>earity. We assume the components of f to be <strong>in</strong>jective analytical<br />

functions with <strong>in</strong>vertible Jacobian at every po<strong>in</strong>t (locally diffeomorphic).<br />

Def<strong>in</strong>ition 4. An <strong>in</strong>vertible matrix A ∈ Gl(n) is said to be mix<strong>in</strong>g if A has at<br />

least two nonzero entries <strong>in</strong> each row.<br />

Note that if A is mix<strong>in</strong>g, then A ′ , A −1 , and ALP for scal<strong>in</strong>g matrix L and<br />

permutation matrix P are also mix<strong>in</strong>g.<br />

Postnonl<strong>in</strong>ear BSS is a generalization of l<strong>in</strong>ear BSS, so the <strong>in</strong>determ<strong>in</strong>acies<br />

of postnonl<strong>in</strong>ear ICA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear BSS:<br />

A can be reconstructed only up to scal<strong>in</strong>g and permutation. In the l<strong>in</strong>ear<br />

case, aff<strong>in</strong>e l<strong>in</strong>ear transformation is ignored. Here, of course, additional<br />

<strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play because of translation: fi can be recovered<br />

only up to a constant. Also, if L ∈ Gl(n) is a scal<strong>in</strong>g matrix, then<br />

f(AS) = (f ◦ L)((L −1 A)S),<br />

so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each component. Another obvious<br />

<strong>in</strong>determ<strong>in</strong>acy could occur if A is not general enough. If, for example,<br />

A = I, then f(S) is already <strong>in</strong>dependent, because <strong>in</strong>dependence is <strong>in</strong>variant<br />

under diagonal nonl<strong>in</strong>ear transformation; so f cannot be found <strong>in</strong> this case.<br />

If we assume, however, that A is mix<strong>in</strong>g, then we will show that except for<br />

scal<strong>in</strong>g <strong>in</strong>terchange between f and A, no more <strong>in</strong>determ<strong>in</strong>acies than <strong>in</strong> the<br />

aff<strong>in</strong>e l<strong>in</strong>ear case exist.<br />

Theorem 4 (separability of postnonl<strong>in</strong>ear BSS). Let A, W ∈ Gl(n) be mix<strong>in</strong>g,<br />

h : R n → R n be a diagonal bijective function with analytical locally diffeomorphic<br />

components, and S be an <strong>in</strong>dependent random vector with at most one<br />

gaussian component and exist<strong>in</strong>g covariance. If W(h(AS)) is <strong>in</strong>dependent, then<br />

there exists a scal<strong>in</strong>g matrix L ∈ Gl(n) and p ∈ R n with LA ∼ W −1 and h ≡ L+p.<br />

If analyticity of the components of h is not assumed, then h ≡ L + p can<br />

only hold on {As|pS(s) �= 0}.<br />

If f ◦ A is the mix<strong>in</strong>g model, W ◦ g is the separat<strong>in</strong>g model. Putt<strong>in</strong>g the<br />

two together, we get the above mix<strong>in</strong>g-separat<strong>in</strong>g model. S<strong>in</strong>ce A has to be<br />

assumed to be mix<strong>in</strong>g, we can assume W to be mix<strong>in</strong>g as well because the<br />

<strong>in</strong>verse of a matrix that is mix<strong>in</strong>g is aga<strong>in</strong> mix<strong>in</strong>g. Furthermore, the mix<strong>in</strong>gseparat<strong>in</strong>g<br />

model is assumed to be bijective—hence, A and W <strong>in</strong>vertible and<br />

h bijective—because otherwise trivial solutions as, for example, h ≡ c for a<br />

constant c ∈ R, would also be solutions.


Chapter 2. Neural Computation 16:1827-1850, 2004 75<br />

1846 F. Theis<br />

We will show the theorem <strong>in</strong> the case of S and X with components hav<strong>in</strong>g<br />

somewhere locally constant nonzero C 2 -densities. An alternative geometric<br />

idea of how to prove theorem 4 for bounded sources <strong>in</strong> two dimensions<br />

is mentioned <strong>in</strong> Babaie-Zadeh, Jutten, and Nayebi (2002) and extended <strong>in</strong><br />

Theis and Gruber (forthcom<strong>in</strong>g). Note that <strong>in</strong> our case, as well as <strong>in</strong> the above<br />

restrictive cases, the assumption that S has at most one gaussian component<br />

holds trivially.<br />

Proof of Theorem 4 (with Locally-Constant Nonzero C 2 -Densities). Let h =<br />

h1 ×···×hn with bijective C ∞ -functions hi : R → R. We have to show only<br />

that the h ′ i<br />

are constant. Then h is aff<strong>in</strong>e l<strong>in</strong>ear, say, h ≡ L + p, with diagonal<br />

matrix L ∈ Gl(n) and a vector p ∈ R n . Hence, WLAS+Wp, and then WLAS<br />

is <strong>in</strong>dependent, so us<strong>in</strong>g l<strong>in</strong>ear separability, theorem 2i, WLA ∼ I, therefore,<br />

LA ∼ W −1 .<br />

Let X := W(h(AS)). The density of this transformed random vector is<br />

easily calculated from S:<br />

pX(Wh(As)) =|det W| −1 |h ′ 1 ((As)1)| −1 ···|h ′ n ((As)n)| −1 | det A| −1 pS(s)<br />

for s ∈ Rn . h has by assumption an <strong>in</strong>vertible Jacobian at every po<strong>in</strong>t, so<br />

the h ′ i are either positive or negative; without loss of generality, h′ i > 0.<br />

Furthermore, pX is <strong>in</strong>dependent, so we can write<br />

pX ≡ g1 ⊗···⊗gn.<br />

For fixed s 0 ∈ R n with pS(s 0 )>0, there exists an open neighborhood<br />

U ⊂ R n of s 0 with pS|U > 0 and pS|U ∈ C 2 (U, R). If we def<strong>in</strong>e f (s) :=<br />

ln � | det W| −1 | det A| −1 pS(s) � for s ∈ U, then<br />

f (s) = ln(h ′ 1 ((As)1) ···h ′ n ((As)n)g1((Wh(As))1) ···gn((Wh(As))n))<br />

n�<br />

= ln h ′ k ((As)k) + ζk((Wh(As))k)<br />

k=1<br />

for x ∈ R n where ζk := ln gk locally at s 0<br />

k . pS is separated, so<br />

∂i∂j f ≡ 0 (5.2)<br />

for i < j. Denote A =: (aij) and W =: (wij). The first derivative and then the<br />

nondiagonal entries <strong>in</strong> the Hessian of f can be calculated as follows (i < j):<br />

∂i f (s) =<br />

n�<br />

h<br />

aki<br />

k=1<br />

′′<br />

k<br />

h ′ k<br />

((As)k) + ζ ′ k ((Wh(As))k)<br />

� n�<br />

l=1<br />

wklalih ′ l ((As)l)<br />


76 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1847<br />

∂i∂j f (s) =<br />

n�<br />

k=1<br />

h<br />

akiakj<br />

′ kh′′′ k<br />

− h′′2<br />

k<br />

h ′2<br />

k<br />

+ ζ ′′<br />

k ((Wh(As))k)<br />

+ ζ ′ k ((Wh(As))k)<br />

((As)k)<br />

� n�<br />

l=1<br />

� n�<br />

l=1<br />

wklalih ′ l ((As)l)<br />

wklalialjh ′′<br />

l ((As)l)<br />

�� n�<br />

�<br />

.<br />

l=1<br />

wklaljh ′ l ((As)l)<br />

Substitut<strong>in</strong>g y := As and us<strong>in</strong>g equation 5.2, we f<strong>in</strong>ally get the follow<strong>in</strong>g<br />

differential equation for the hk:<br />

0 =<br />

n�<br />

k=1<br />

h<br />

akiakj<br />

′ kh′′′ k<br />

− h′′2<br />

k<br />

h ′2<br />

k<br />

+ ζ ′′<br />

k ((Wh(y))k)<br />

+ ζ ′ k ((Wh(y))k)<br />

(yk)<br />

� n�<br />

l=1<br />

� n�<br />

l=1<br />

wklalih ′ l (yl)<br />

wklalialjh ′′<br />

l (yl)<br />

�� n�<br />

�<br />

l=1<br />

wklaljh ′ l (yl)<br />

�<br />

�<br />

(5.3)<br />

for y ∈ V := A(U).<br />

We will restrict ourselves to the simple case mentioned above <strong>in</strong> order to<br />

solve this equation. We assume that the hk are analytic and that there exists<br />

x 0 ∈ R n where the demixed densities gk are locally constant and nonzero.<br />

Consider the above calculation around s 0 = A −1 (h −1 (W −1 x 0 )).<br />

Choose the open set V such that the gk are locally constant nonzero on<br />

h(W(V)). Then so are the ζ ′ k = ln gk, and therefore<br />

0 =<br />

n�<br />

k=1<br />

h<br />

akiakj<br />

′ kh′′′ k<br />

− h′′2<br />

k<br />

h ′2<br />

k<br />

(yk)<br />

for y ∈ V. Hence, there exist open <strong>in</strong>tervals Ik ⊂ R and constants bk ∈ R<br />

with<br />

�<br />

akiakj h ′ kh′′′ �<br />

k − h′′2<br />

k ≡ dkh ′2<br />

k<br />

on Ik (here, dk = �<br />

l�=k alialj<br />

h ′<br />

l h′′′<br />

l −h′′2<br />

l<br />

h ′2 (yl) for some (and then any) y ∈ V).<br />

l<br />

By assumption, W is mix<strong>in</strong>g. Hence, for fixed k, there exist i �= j with<br />

akiakj �= 0. If we set ck := bk , then<br />

akiakj<br />

ckh ′2<br />

k − h′ kh′′′ k + h′′2<br />

k ≡ 0 (5.4)


Chapter 2. Neural Computation 16:1827-1850, 2004 77<br />

1848 F. Theis<br />

on Ik. hk was chosen to be analytic, and equation 5.4 holds on the open set<br />

Ik, so it holds on all R. Apply<strong>in</strong>g lemma 3 then shows that either h ′ k ≡ 0or<br />

h ′ �<br />

ck<br />

k (x) =±exp<br />

2 x2 �<br />

+ dkx + ek , x ∈ R (5.5)<br />

with constants dk, ek ∈ R. By assumption, hk is bijective, so h ′ k �≡ 0.<br />

Apply<strong>in</strong>g the same arguments as above to the <strong>in</strong>verse system<br />

S = A −1 (h −1 (W −1 X))<br />

and us<strong>in</strong>g the fact that also pS is somewhere locally constant nonzero shows<br />

that equation 5.5 also holds for (h −1<br />

k )′ with other constants. But if both the<br />

derivatives of hk and h −1<br />

k are of this exponential type, then ck = dk = 0, and<br />

therefore hk is aff<strong>in</strong>e l<strong>in</strong>ear for all k = 1,...,n, which completes the proof of<br />

postnonl<strong>in</strong>ear separability <strong>in</strong> this special case.<br />

Note that <strong>in</strong> the above proof, local positiveness of the densities was assumed<br />

<strong>in</strong> order to use the equivalence of local separability with the diagonality<br />

of the logarithm of the Hessian. Hence, these results can be generalized<br />

us<strong>in</strong>g theorem 1 <strong>in</strong> a similar fashion as we did <strong>in</strong> the l<strong>in</strong>ear case with theorem<br />

2. Hence, we have proven postnonl<strong>in</strong>ear separability also for uniformly<br />

distributed sources.<br />

6 Conclusion<br />

We have shown how to derive the separability of l<strong>in</strong>ear BSS us<strong>in</strong>g diagonalization<br />

of the Hessian of the logarithmic density, respectively, characteristic<br />

function. This <strong>in</strong>duces separated, that is, <strong>in</strong>dependent, sources. The idea of<br />

Hessian diagonalization is put <strong>in</strong>to a new algorithm for perform<strong>in</strong>g l<strong>in</strong>ear<br />

<strong>in</strong>dependent component analysis, which is shown to be a local problem. In<br />

practice, however, due to the fact that the densities cannot be approximated<br />

locally very well, we also propose a diagonalization algorithm that takes<br />

the global structure <strong>in</strong>to account. In order to show the use of this framework<br />

of separated functions, we f<strong>in</strong>ish with a proof of postnonl<strong>in</strong>ear separability<br />

<strong>in</strong> a special case.<br />

In future work, more general separability results of postnonl<strong>in</strong>ear BSS<br />

could be constructed by f<strong>in</strong>d<strong>in</strong>g more general solutions of the differential<br />

equation 5.3. Algorithmic improvements could be made by us<strong>in</strong>g other<br />

density approximation methods like mixture of gaussian models or by approximat<strong>in</strong>g<br />

the Hessian itself us<strong>in</strong>g the cumulative density and discrete<br />

approximations of the differential. F<strong>in</strong>ally, the diagonalization algorithm<br />

can easily be extended to nonl<strong>in</strong>ear situations by f<strong>in</strong>d<strong>in</strong>g appropriate model<br />

parameterizations; <strong>in</strong>stead of m<strong>in</strong>imiz<strong>in</strong>g the mutual <strong>in</strong>formation, we m<strong>in</strong>imize<br />

the absolute value of the off-diagonal terms of the logarithmic Hessian.


78 Chapter 2. Neural Computation 16:1827-1850, 2004<br />

A New Concept for Separability Problems <strong>in</strong> BSS 1849<br />

The algorithm has been specified us<strong>in</strong>g only an energy function; gradient<br />

and fixed-po<strong>in</strong>t algorithms can be derived <strong>in</strong> the usual manner.<br />

Separability <strong>in</strong> nonl<strong>in</strong>ear situations has turned out to be a hard problem—<br />

illposed <strong>in</strong> the most general case (Hyvär<strong>in</strong>en & Pajunen, 1999)—and not<br />

many nontrivial results exist for restricted models (Hyvär<strong>in</strong>en & Pajunen,<br />

1999; Babaie-Zadeh et al., 2002), all only two-dimensional. We believe that<br />

this is due to the fact that the rather nontrivial proof of the Darmois-<br />

Skitovitch theorem is not at all easily generalized to more general sett<strong>in</strong>gs<br />

(Kagan, 1986). By <strong>in</strong>troduc<strong>in</strong>g separated functions, we are able to give a<br />

much easier proof for l<strong>in</strong>ear separability and also provide new results <strong>in</strong><br />

nonl<strong>in</strong>ear sett<strong>in</strong>gs. We hope that these ideas will be used to show separability<br />

<strong>in</strong> other situations as well.<br />

Acknowledgments<br />

I thank the anonymous reviewers for their valuable suggestions, which improved<br />

the orig<strong>in</strong>al manuscript. I also thank Peter Gruber, Wolfgang Hackenbroch,<br />

and Michaela Theis for suggestions and remarks on various aspects<br />

of the separability proof. The work described here was supported by the<br />

DFG <strong>in</strong> the grant “Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> Condensed Matter”<br />

and the BMBF <strong>in</strong> the ModKog project.<br />

References<br />

Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for<br />

separat<strong>in</strong>g post non-l<strong>in</strong>ear mixtures. In Proc. of EUSIPCO ’02 (Volume 2, pp.<br />

11–14). Toulouse, France.<br />

Bauer, H. (1996). Probability theory. Berl<strong>in</strong>: Walter de Gruyter.<br />

Bell, A., & Sejnowski, T. (1995). An <strong>in</strong>formation-maximization approach to bl<strong>in</strong>d<br />

separation and bl<strong>in</strong>d deconvolution. Neural Computation, 7:1129–1159.<br />

Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moul<strong>in</strong>es, E. (1997). A bl<strong>in</strong>d<br />

source separation technique based on second order statistics. IEEE Transactions<br />

on Signal Process<strong>in</strong>g, 45(2), 434–444.<br />

Cardoso, J.-F., & Souloumiac, A. (1993). Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian<br />

signals. IEEE Proceed<strong>in</strong>gs–F, 140(6), 362–370.<br />

Cichocki, A., & Amari, S. (2002). Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. New<br />

York: Wiley.<br />

Comon, P. (1994). <strong>Independent</strong> component analysis—a new concept? Signal<br />

Process<strong>in</strong>g, 36:287–314.<br />

Darmois, G. (1953). Analyse générale des liaisons stochastiques. Rev. Inst. Internationale<br />

Statist., 21, 2–8.<br />

Eriksson, J., & Koivunen, V. (2003). Identifiability and separability of l<strong>in</strong>ear ica<br />

models revisited. In Proc. of ICA 2003, (pp. 23–27). Nara, Japan.<br />

Habl, M., Bauer, C., Puntonet, C., Rodriguez-Alvarez, M., & Lang, E. (2001). Analyz<strong>in</strong>g<br />

biomedical signals with probabilistic ICA and kernel-based source


Chapter 2. Neural Computation 16:1827-1850, 2004 79<br />

1850 F. Theis<br />

density estimation. In M. Sebaaly, (Ed.) Information science <strong>in</strong>novations<br />

(Proc.ISI’2001) (pp. 219–225). Alberta, Canada: ICSC Academic Press.<br />

Hérault, J., & Jutten, C. (1986). Space or time adaptive signal process<strong>in</strong>g by<br />

neural network models. In J. Denker (Ed.), Neural networks for comput<strong>in</strong>g:<br />

Proceed<strong>in</strong>gs of the AIP Conference (pp. 206–211). New York: American Institute<br />

of Physics.<br />

Hyvär<strong>in</strong>en, A., Karhunen, J., & Oja, E. (2001). <strong>Independent</strong> component analysis.<br />

New York: Wiley.<br />

Hyvär<strong>in</strong>en, A., & Oja, E. (1997). A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent<br />

component analysis. Neural Computation, 9:1483–1492.<br />

Hyvär<strong>in</strong>en, A., & Pajunen, P. (1999). Nonl<strong>in</strong>ear <strong>in</strong>dependent component analysis:<br />

Existence and uniqueness results. Neural Networks, 12(3), 429–439.<br />

Kagan, A. (1986). New classes of dependent random variables and a generalization<br />

of the Darmois-Skitovitch theorem to several forms. Theory Probab.<br />

Appl., 33(2), 286–295.<br />

Lee, T., & Lewicki, M. (2000). The generalized gaussian mixture model us<strong>in</strong>g<br />

ICA. In Proc. of ICA 2000 (pp. 239–244). Hels<strong>in</strong>ki, F<strong>in</strong>land.<br />

L<strong>in</strong>, J. (1998). Factoriz<strong>in</strong>g multivariate function classes. In M. Kearns, M. Jordan,<br />

& S. Solla (Eds.), Advances <strong>in</strong> neural <strong>in</strong>formation process<strong>in</strong>g systems, 10, (pp. 563–<br />

569). Cambridge, MA: MIT Press.<br />

Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89,<br />

217–219.<br />

Taleb, A., & Jutten, C. (1999). Source separation <strong>in</strong> post non l<strong>in</strong>ear mixtures.<br />

IEEE Trans. on Signal Process<strong>in</strong>g, 47, 2807–2820.<br />

Theis, F., & Gruber, P. (forthcom<strong>in</strong>g). Separability of analytic postnonl<strong>in</strong>ear<br />

bl<strong>in</strong>d source separation with bounded sources. In Proc. of ESANN 2004.<br />

Evere, Belgium: d-side.<br />

Theis, F., Jung, A., Puntonet, C., & Lang, E. (2002). L<strong>in</strong>ear geometric ICA: Fundamentals<br />

and algorithms. Neural Computation, 15, 1–21.<br />

Theis, F., Puntonet, C., & Lang, E. (2003). Nonl<strong>in</strong>ear geometric ICA. In Proc. of<br />

ICA 2003 (pp. 275–280). Nara, Japan.<br />

Tong, L., Liu, R.-W., Soon, V., & Huang, Y.-F. (1991). Indeterm<strong>in</strong>acy and identifiability<br />

of bl<strong>in</strong>d identification. IEEE Transactions on Circuits and Systems, 38,<br />

499–509.<br />

Yeredor, A. (2000). Bl<strong>in</strong>d source separation via the second characteristic function.<br />

Signal Process<strong>in</strong>g, 80(5), 897–902.<br />

Received June 27, 2003; accepted March 8, 2004.


80 Chapter 2. Neural Computation 16:1827-1850, 2004


Chapter 3<br />

Signal Process<strong>in</strong>g 84(5):951-956,<br />

2004<br />

Paper F.J. Theis. Uniqueness of complex and multidimensional <strong>in</strong>dependent component<br />

analysis. Signal Process<strong>in</strong>g, 84(5):951-956, 2004<br />

Reference (Theis, 2004b)<br />

Summary <strong>in</strong> section 1.2.1<br />

81


82 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004<br />

Abstract<br />

Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

www.elsevier.com/locate/sigpro<br />

Fast communication<br />

Uniqueness of complex and multidimensional <strong>in</strong>dependent<br />

component analysis<br />

F.J. Theis ∗<br />

Institute of Biophysics, University of Regensburg, Universitaetsstr. 31, D93040 Regensburg, Germany<br />

Received 25 September 2003<br />

A complex version of the Darmois–Skitovitch theorem is proved us<strong>in</strong>g a multivariate extension of the latter by Ghurye and<br />

Olk<strong>in</strong>. This makes it possible to calculate the <strong>in</strong>determ<strong>in</strong>acies of <strong>in</strong>dependent component analysis (ICA) with complex variables<br />

and coe cients. Furthermore, the multivariate Darmois–Skitovitch theorem is used to show uniqueness of multidimensional<br />

ICA, where only groups of sources are mutually <strong>in</strong>dependent.<br />

? 2004 Elsevier B.V. All rights reserved.<br />

PACS: 84.40.Ua; 89.70.+c; 07.05.Kf<br />

Keywords: Complex ICA; Multidimensional ICA; Separability<br />

1. Introduction<br />

The task of <strong>in</strong>dependent component analysis (ICA)<br />

is to transform a given random vector <strong>in</strong>to a statistically<br />

<strong>in</strong>dependent one. ICA can be applied to bl<strong>in</strong>d<br />

source separation (BSS), where it is furthermore assumed<br />

that the given vector has been mixed us<strong>in</strong>g a<br />

xed set of <strong>in</strong>dependent sources. Good textbook-level<br />

<strong>in</strong>troductions to ICA are given <strong>in</strong> [4,11].<br />

BSS is said to be separable if the mix<strong>in</strong>g structure<br />

can be bl<strong>in</strong>dly recovered except for obvious <strong>in</strong>determ<strong>in</strong>acies.<br />

In [5], Comon shows separability of l<strong>in</strong>ear real<br />

BSS us<strong>in</strong>g the Skitovitch–Darmois theorem. He notes<br />

that his proof for the real case can also be extended to<br />

the complex sett<strong>in</strong>g. However, a complex version of<br />

∗ Tel.: +49-941-9432924; fax: +49-941-9432479.<br />

E-mail addresses: fabian.theis@mathematik.uni-regensburg.de,<br />

fabian@theis.name (F.J. Theis).<br />

0165-1684/$ - see front matter ? 2004 Elsevier B.V. All rights reserved.<br />

doi:10.1016/j.sigpro.2004.01.008<br />

the Skitovitch–Darmois theorem is needed, which, to<br />

the knowledge of the author, has not been shown <strong>in</strong><br />

the literature, yet. In this work we will provide such<br />

a theorem, which is then used to prove separability of<br />

complex BSS.<br />

Separability and uniqueness of BSS is already <strong>in</strong>cluded<br />

<strong>in</strong> the de nition of what is commonly called a<br />

‘contrast’ [5]. Hence it has been widely studied, but<br />

<strong>in</strong> the sett<strong>in</strong>g of complex BSS to the knowledge of<br />

the author separability has only been shown under the<br />

additional assumption of non-zero cumulants of the<br />

sources [5,13].<br />

The paper is organized as follows: In the next<br />

section, basic terms and notations are <strong>in</strong>troduced.<br />

Section 3 states the well-known Skitovitch–Darmois<br />

theorem and a multivariate extension thereof; furthermore,<br />

a complex version of it is derived. The follow<strong>in</strong>g<br />

Section 4 then <strong>in</strong>troduces the complex l<strong>in</strong>ear bl<strong>in</strong>d<br />

source separation model and shows its separability.


Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004 83<br />

952 F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

Section 5 nally deals with separability of multidimensional<br />

ICA (group ICA).<br />

2. Notation<br />

Let K ∈{R; C} be either the real or the complex<br />

numbers. For m; n ∈ N let Mat(m × n; K)<br />

be the K-vectorspace of real, respectively, complex<br />

m × n matrices, Gl(n; K) := {W ∈ Mat(n ×<br />

n; K) | det(W) �= 0} be the general l<strong>in</strong>ear group of<br />

K n . I ∈ Gl(n; K) denotes the unit matrix. For ∈ C<br />

we write Re( ) for its real and Im( ) for its imag<strong>in</strong>ary<br />

part.<br />

An <strong>in</strong>vertible matrix L ∈ Gl(n; K) is said to be a<br />

scal<strong>in</strong>g matrix, if it is diagonal. We say two matrices<br />

B, C ∈ Mat(m × n; K) are (K−)equivalent, B ∼ C, if<br />

C can be written as C = BPL with a scal<strong>in</strong>g matrix<br />

L ∈ Gl(n; K) and an <strong>in</strong>vertible matrix with unit vectors<br />

<strong>in</strong> each row (permutation matrix) P ∈ Gl(n; K). Note<br />

that PL = L ′ P for some scal<strong>in</strong>g matrix L ′ ∈ Gl(n; K),<br />

so the order of the permutation and the scal<strong>in</strong>g matrix<br />

does not play a role for equivalence. Furthermore,<br />

if B ∈ Gl(n; K) with B ∼ I, then also B −1 ∼ I,<br />

and more general if BC ∼ A, then C ∼ B −1 A.So<br />

two matrices are equivalent if and only if they di er<br />

by right-multiplication by a matrix with exactly one<br />

non-zero entry <strong>in</strong> each row and each column. If K=R,<br />

the two matrices are the same except for permutation,<br />

sign and scal<strong>in</strong>g, if K = C, they are the same except<br />

for permutation, sign, scal<strong>in</strong>g and phase-shift.<br />

3. A multivariate version of the Skitovitch–<br />

Darmois theorem<br />

The orig<strong>in</strong>al Skitovitch–Darmois theorem shows a<br />

non-trivial connection between Gaussian distributions<br />

and stochastic <strong>in</strong>dependence. More precisely, it states<br />

that if two l<strong>in</strong>ear comb<strong>in</strong>ations of non-Gaussian <strong>in</strong>dependent<br />

random variables are aga<strong>in</strong> <strong>in</strong>dependent, then<br />

each orig<strong>in</strong>al random variable can appear only <strong>in</strong> one<br />

of the two l<strong>in</strong>ear comb<strong>in</strong>ations. It has been proved <strong>in</strong>dependently<br />

by Darmois [6] and Skitovitch [14]; <strong>in</strong> a<br />

more accessible form, the proof can be found <strong>in</strong> [12].<br />

Separability of l<strong>in</strong>ear BSS as shown by Comon [5]<br />

is a corollary of this theorem, although recently separability<br />

has also been shown without it [17].<br />

Theorem 3.1 (Skitovitch–Darmois theorem). Let<br />

L1 = �n i=1 iXi and L2 = �n i=1 iXi with X1;:::;Xn<br />

<strong>in</strong>dependent real random variables and j, j ∈ R for<br />

j =1;:::;n. If L1 and L2 are <strong>in</strong>dependent, then all Xj<br />

with j j �= 0are Gaussian.<br />

The converse is true if we assume that � n<br />

j=1 j j =<br />

0: If all Xj with j j �= 0 are Gaussian and<br />

� n<br />

j=1 j j = 0, then L1 and L2 are <strong>in</strong>dependent. This<br />

follows because then L1 and L2 are uncorrelated, and<br />

with all common variables be<strong>in</strong>g normal then also<br />

<strong>in</strong>dependent.<br />

Theorem 3.2 (Multivariate S–D theorem). Let L1 �<br />

=<br />

n<br />

i=1 AiXi and L2 = �n i=1 BiXi with mutually <strong>in</strong>dependent<br />

k-dimensional random vectors Xj and <strong>in</strong>vertible<br />

matrices Aj; Bj ∈ Gl(k; R) for j =1;:::;n. If<br />

L1 and L2 are mutually <strong>in</strong>dependent, then all Xj are<br />

Gaussian.<br />

Here Gaussian (or jo<strong>in</strong>tly Gaussian) means that<br />

each component of the random vector is a Gaussian.<br />

Obviously, those Gaussians can have non-trivial correlations.<br />

This extension of Theorem 3.1 to random vectors<br />

has rst been noted by Skitovitch [15] and shown by<br />

Ghurye and Olk<strong>in</strong> [8]. Z<strong>in</strong>ger gave a di erent proof<br />

for it <strong>in</strong> his Ph.D. thesis [18].<br />

We need the follow<strong>in</strong>g corollary:<br />

Corollary 3.3. Let L1 = � n<br />

i=1 AiX i and L2 =<br />

� n<br />

i=1 BiX i with mutually <strong>in</strong>dependent k-dimensional<br />

random vectors X j and matrices Aj; Bj either zero or<br />

<strong>in</strong> Gl(k; R) for j =1;:::;n. If L1 and L2 are mutually<br />

<strong>in</strong>dependent, then all X j with AjBj �= 0are Gaussian.<br />

Proof. We want to throw out all X j with AjBj =0.<br />

Then Theorem 3.2 can be applied. Let j be given with<br />

AjBj=0. Without loss of generality assume that Bj=0.<br />

If also Aj = 0, then we can simply leave out X j s<strong>in</strong>ce<br />

it does not appear <strong>in</strong> both L1 and L2. Assume Aj �=<br />

0. By assumption X j and X 1 ;:::;X j−1 ; X j+1 ;:::;X n<br />

are mutually <strong>in</strong>dependent, then so are X j and L2 because<br />

Bj = 0. Hence both −AjX j , L2 and L1, L2<br />

are mutually <strong>in</strong>dependent, so also the two l<strong>in</strong>ear comb<strong>in</strong>ations<br />

L1 − AjX j and L2 of the n − 1 variables<br />

X 1 ;:::;X j−1 ; X j+1 ;:::;X n are mutually <strong>in</strong>dependent.<br />

After successive application of this recursion we can


84 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004<br />

assume that each Aj and Bj is <strong>in</strong>vertible. Apply<strong>in</strong>g<br />

Theorem 3.2 shows the corollary.<br />

From this, a complex version of the Skitovitch–<br />

Darmois theorem can easily be derived:<br />

Corollary 3.4 (Complex S–D theorem). Let L1 �<br />

=<br />

n<br />

i=1 iXi and L2 = �n i=1 iXi with X1;:::;Xn <strong>in</strong>dependent<br />

complex random variables and j; j ∈ C for<br />

j =1;:::;n. If L1 and L2 are <strong>in</strong>dependent, then all Xj<br />

with j j �= 0 are Gaussian.<br />

Here, a complex random variable is said to be Gaussian<br />

if both real and imag<strong>in</strong>ary part are Gaussians.<br />

Proof. We can <strong>in</strong>terpret the n <strong>in</strong>dependent complex<br />

random variables Xi as n two dimensional real random<br />

vectors that are mutually <strong>in</strong>dependent. Multiplication<br />

by the complex numbers j either ( j �= 0)isamultiplication<br />

by the real <strong>in</strong>vertible matrix<br />

� �<br />

Re( j) −Im( j)<br />

Im( j) Re( j)<br />

or ( j = 0) multiplication by the 0-matrix, similar for<br />

j. Apply<strong>in</strong>g Corollary 3.3 nishes the proof.<br />

4. Indeterm<strong>in</strong>acies of complex ICA<br />

Given a complex n-dimensional random vector X,<br />

a matrix W ∈ Gl(n; C) is called (complex) ICA of X<br />

if WX is <strong>in</strong>dependent (as a complex random vector).<br />

We will show that W and V are complex ICAs of X if<br />

and only if W −1 ∼ V −1 that is if they di er by right<br />

multiplication by a complex scal<strong>in</strong>g and permutation<br />

matrix. This is equivalent to calculat<strong>in</strong>g the <strong>in</strong>determ<strong>in</strong>acies<br />

of the complex BSS model:<br />

Consider the noiseless complex l<strong>in</strong>ear <strong>in</strong>stantaneous<br />

bl<strong>in</strong>d source separation (BSS) model with as many<br />

sources as sensors<br />

X = AS: (1)<br />

Here S is an <strong>in</strong>dependent complex-valued ndimensional<br />

random vector and A ∈ Gl(n; C) an<strong>in</strong>vertible<br />

complex matrix.<br />

The task of l<strong>in</strong>ear BSS is to nd A and S given<br />

only X. An obvious <strong>in</strong>determ<strong>in</strong>acy of this problem is<br />

F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956 953<br />

that A can be found only up to equivalence because<br />

for scal<strong>in</strong>g L and permutation matrix P<br />

X = ALPP −1 L −1 S<br />

and P −1 L −1 S is also <strong>in</strong>dependent. We will show that<br />

under mild assumptions to S there are no further <strong>in</strong>determ<strong>in</strong>acies<br />

of complex BSS.<br />

Various algorithms for solv<strong>in</strong>g the complex BSS<br />

problem have been proposed [1,2,7,13,16]. We want<br />

to note that many cases where complex BSS is applied<br />

can <strong>in</strong> fact be reduced to us<strong>in</strong>g real BSS algorithms.<br />

This is the case if either the sources or the<br />

mix<strong>in</strong>g matrix are real. The latter for example occurs<br />

after Fourier transformation of signals with time<br />

structure.<br />

If the sources are real, then the above complex<br />

model can be split up <strong>in</strong>to two separate real BSS problems:<br />

Re(X) = Re(A)S;<br />

Im(X)=Im(A)S:<br />

Solv<strong>in</strong>g both of these real BSS equations yields A =<br />

Re(A) +i Im(A). Of course, Re(A) and Im(A) can<br />

only be found except for scal<strong>in</strong>g and permutation. By<br />

compar<strong>in</strong>g the two recovered source random vectors<br />

(us<strong>in</strong>g for example mutual <strong>in</strong>formation of one component<br />

of each vector), we can however assume that the<br />

permutation and then also the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy<br />

of both recovered matrices is the same, which allows<br />

the algorithm to correctly put A back together. Similarly,<br />

also separability of this special complex ICA<br />

problem can be derived from the well-known separability<br />

results <strong>in</strong> the real case.<br />

If the mix<strong>in</strong>g matrix is known to be real, then aga<strong>in</strong><br />

splitt<strong>in</strong>g up Eq. (1) <strong>in</strong>to real and complex parts yields<br />

Re(X)=A Re(S);<br />

Im(X)=A Im(S):<br />

A can be found from either equation. If both real and<br />

complex samples are to be used <strong>in</strong> order to <strong>in</strong>crease<br />

precision, those can simply be concatenated <strong>in</strong> order<br />

to generate a twice as large sample set mixed by the<br />

same mix<strong>in</strong>g matrix A. In terms of random vectors this<br />

means work<strong>in</strong>g <strong>in</strong> two disjo<strong>in</strong>t copies of the orig<strong>in</strong>al<br />

probability space. Aga<strong>in</strong> separability follows.


Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004 85<br />

954 F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

Theorem 4.1 (Separability of complex l<strong>in</strong>ear BSS).<br />

Let A ∈ Gl(n; C) and S a complex <strong>in</strong>dependent random<br />

vector. Assume one of the follow<strong>in</strong>g:<br />

i. S has at most one Gaussian component and the<br />

(complex) covariance of S exists.<br />

ii. S has no Gaussian component.<br />

If AS is aga<strong>in</strong> <strong>in</strong>dependent 1 then A is equivalent<br />

to the identity.<br />

Here, the complex covariance of S is de ned by<br />

Cov(S)=E((S − E(S))(S − E(S)) ∗ );<br />

where the asterix denotes the transposed and complexly<br />

conjugated vector.<br />

Comon has shown this for the real case [5]; for<br />

the complex case a complex version of the Darmois–<br />

Skitovitch theorem is needed as provided <strong>in</strong> Section 3.<br />

Theorem 4.1 <strong>in</strong>deed proves separability of the complex<br />

l<strong>in</strong>ear BSS model, because if X = AS and W is<br />

a demix<strong>in</strong>g matrix such that WX is <strong>in</strong>dependent, then<br />

WA ∼ I, soW −1 ∼ A as desired. And it also calculates<br />

the <strong>in</strong>determ<strong>in</strong>acies of complex ICA, because if<br />

W and V are ICAs of X, then both VX and WV −1 VX<br />

are <strong>in</strong>dependent, so WV −1 ∼ I and hence W ∼ V.<br />

Proof. Denote X := AS.<br />

First assume case ii: S has no Gaussian component<br />

at all. Then A =(aij) is equal to the identity, because<br />

if not there exist i1 �= i2 and j with ai1jai2j �= 0.<br />

Apply<strong>in</strong>g Corollary 3.4 to Xi1 and Xi2 then shows<br />

that Sj is Gaussian, which is a contradiction to<br />

Assumption ii.<br />

Now assume that the covariance exists and that S<br />

has at most one Gaussian component. First we will<br />

show us<strong>in</strong>g complex decorrelation that we can assume<br />

A to be unitary. Without loss of generality assume<br />

that all random vectors are centered. By assumption<br />

Cov(X) is diagonal, so let D1 be diagonal <strong>in</strong>vertible<br />

with Cov(X) =D2 1 . Note that D1 is real. Similarly<br />

let D2 be diagonal <strong>in</strong>vertible with Cov(S)=D2 2 . Set<br />

Y := D −1<br />

1 X and T := D−1<br />

2 S that is normalize X and<br />

S to covariance I. Then<br />

Y = D −1<br />

1<br />

X = D−1AS<br />

= D−1<br />

1<br />

1 AD2T<br />

1 Indeed, we only need that AS are pairwise mutually <strong>in</strong>dependent.<br />

and T, D −1<br />

1 AD2 and Y satisfy the assumption and<br />

D −1<br />

1 AD2 is unitary because<br />

I = Cov(Y)<br />

= E(YY ∗ )<br />

= E(D −1<br />

1 AD2TT ∗ D2A ∗ D −1<br />

1 )<br />

=(D −1 −1<br />

1 AD2)(D1 AD2) ∗ :<br />

If we assume A I, then us<strong>in</strong>g the fact that A is<br />

unitary there exist <strong>in</strong>dices i1 �= i2 and j1 �= j2 with<br />

ai∗j∗ �= 0. By assumption<br />

Xi1 = ai1j1 Sj1 + ai1j2 Sj2 + ···<br />

Xi2 = ai2j1<br />

Sj1 + ai2j2<br />

Sj2 + ···<br />

are <strong>in</strong>dependent, and <strong>in</strong> both Xi1 and Xi2 the variables<br />

Sj1 and Sj2 appear non-trivially, so by the complex Skitovitch–Darmois<br />

Theorem 3.4 Sj1 and Sj2 are Gaussian,<br />

which is a contradiction to the fact that at most<br />

one source is Gaussian.<br />

5. Indeterm<strong>in</strong>acies of multidimensional ICA<br />

In this section, we want to analyze <strong>in</strong>determ<strong>in</strong>acies<br />

of so-called multidimensional <strong>in</strong>dependent component<br />

analysis. The idea of this generalization of<br />

ICA is that we do not require full <strong>in</strong>dependence of the<br />

transform Y but only mutual <strong>in</strong>dependence of certa<strong>in</strong><br />

tuples Yi1;:::;Yi2. If the size of all tuples is restricted to<br />

one, this reduces to orig<strong>in</strong>al ICA. In general, of course<br />

the tuples could have di erent sizes, but for the sake<br />

of simplicity we assume that they all have the same<br />

length (which then necessarily has to divide the total<br />

dimension).<br />

Multidimensional ICA has rst been <strong>in</strong>troduced by<br />

Cardoso [3] us<strong>in</strong>g geometrical motivations. Hyvar<strong>in</strong>en<br />

and Hoyer then presented a special case of multidimensional<br />

ICA which they called <strong>in</strong>dependent<br />

subspace analysis [9]; there the dependence with<strong>in</strong><br />

a k-tuple is explicitly modelled enabl<strong>in</strong>g the authors<br />

to propose better algorithms without hav<strong>in</strong>g to<br />

resort to the problematic multidimensional density<br />

estimation. A di erent extension of ICA is given by<br />

topographic ICA [10], where dependencies between<br />

all components are assumed. A special case of multidimensional<br />

ICA is complex ICA as presented <strong>in</strong>


86 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004<br />

the preced<strong>in</strong>g section—here dependence is allowed<br />

between real-valued couples of random variables.<br />

Let k; n ∈ N such that k divides n. We call an<br />

n-dimensional random vector Y k-<strong>in</strong>dependent if the<br />

k-dimensional random vectors<br />

⎛ ⎞ ⎛ ⎞<br />

⎜<br />

⎝<br />

Y1<br />

.<br />

Yk<br />

⎟<br />

⎠ ;:::;<br />

⎜<br />

⎝<br />

Yn−k+1<br />

.<br />

Yn<br />

⎟<br />

⎠<br />

are mutually <strong>in</strong>dependent. A matrix W ∈ Gl(n; R) is<br />

called a k-multidimensional ICA of an n-dimensional<br />

random vector X if WX is k-<strong>in</strong>dependent. If k =1,<br />

this is the same as ord<strong>in</strong>ary ICA.<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies are, similar to ord<strong>in</strong>ary<br />

ICA, <strong>in</strong>vertible transforms <strong>in</strong> Gl(k; R) <strong>in</strong> each tuple<br />

as well as the fact that the order of the <strong>in</strong>dependent<br />

k-tuples is not xed. So, de ne for r; s =1;:::;n=k<br />

the (r; s) sub-k-matrix of W =(wij) tobethek × k<br />

submatrix<br />

(wij) i=rk; :::; rk+k−1<br />

j=sk; :::; sk+k−1<br />

that is the k × k submatrix of W start<strong>in</strong>g at position<br />

(rk; sk). A matrix L ∈ Gl(n; R) is said to be a k-scal<strong>in</strong>g<br />

and permutation matrix if for each r=1;:::;n=kthere<br />

exists precisely one s with the (r; s) sub-k-matrix of<br />

L to be nonzero, and such that this submatrix is <strong>in</strong><br />

Gl(k; R), and if for each s=1;:::;n=kthere exists only<br />

one r with the (r; s) sub-k-matrix satisfy<strong>in</strong>g the same<br />

condition. Hence, if Y is k-<strong>in</strong>dependent, also LY is<br />

k-<strong>in</strong>dependent.<br />

Two matrices A and B are said to be k-equivalent,<br />

A ∼k B, if there exists such a k-scal<strong>in</strong>g and permutation<br />

matrix L with A=BL. As stated above, given two<br />

matrices W and V with W −1 ∼k V −1 such that one of<br />

them is a k-multidimensional ICA of a given random<br />

vector, then so is the other. We will show that there are<br />

no more <strong>in</strong>determ<strong>in</strong>acies of multidimensional ICA.<br />

As usual multidimensional ICA can solve the multidimensional<br />

BSS problem<br />

X = AS;<br />

where A ∈ Gl(n; R) and S is a k-<strong>in</strong>dependent<br />

n-dimensional random vector. F<strong>in</strong>d<strong>in</strong>g the <strong>in</strong>determ<strong>in</strong>acies<br />

of multidimensional ICA then shows that<br />

A can be found except for k-equivalence (separability),<br />

because if X = AS and W is a demix<strong>in</strong>g matrix<br />

F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956 955<br />

such that WX is k-<strong>in</strong>dependent, then WA ∼k I, so<br />

W−1 ∼k A as desired.<br />

However, for the proof we need one more condition<br />

for A: We call A k-admissible if for each r; s =<br />

1;:::;n=kthe (r; s) sub-k-matrix of A is either <strong>in</strong>vertible<br />

or zero. Note that this is not a strong restriction—<br />

if we randomly choose A with coe cients out of a cont<strong>in</strong>uous<br />

distribution, then with probability one we get<br />

a k-admissible matrix, because the non-k-admissible<br />

matrices ⊂ Rn2 lie <strong>in</strong> a submanifold of dimension<br />

smaller than n2 .<br />

Theorem 5.1 (Separability of multidimensional<br />

BSS): Let A ∈ Gl(n; R) and S a k-<strong>in</strong>dependent<br />

n-dimensional random vector hav<strong>in</strong>g no Gaussian<br />

k-tuple (Srk;:::;Srk+k−1) T . Assume that A is<br />

k-admissible.<br />

If AS is aga<strong>in</strong> k-<strong>in</strong>dependent, then A is k-equivalent<br />

to the identity.<br />

For the case k = 1 this is l<strong>in</strong>ear BSS separability<br />

because every matrix is 1-admissible.<br />

Proof. Denote X := AS. Assume that A k I.<br />

Then there exist <strong>in</strong>dices r1;r2 and s such that<br />

the (r1;s) and the (r2;s) sub-k-matrices of A are<br />

non-zero (hence <strong>in</strong> Gl(k; R) by k-admissability).<br />

Apply<strong>in</strong>g Corollary 3.3 to the two random vectors<br />

(Xr1k;:::;Xr1k+k−1) T and (Xr2k;:::;Xr2k+k−1) T then<br />

shows that (Ssk;:::;Ssk+k−1) T is Gaussian, which is a<br />

contradiction.<br />

Note that we could have used whiten<strong>in</strong>g to assume<br />

that A is orthogonal; however there does not seem to<br />

be a direct way to exploit this <strong>in</strong> order to allow one<br />

fully Gaussian k-tuple, contrary to the complex ICA<br />

case, see Theorem 4.1.<br />

6. Conclusion<br />

Uniqueness and separability results play a central<br />

role <strong>in</strong> solv<strong>in</strong>g BSS problems s<strong>in</strong>ce they allow algorithms<br />

to apply ICA <strong>in</strong> order to uniquely (except for<br />

scal<strong>in</strong>g and permutation) nd the orig<strong>in</strong>al mix<strong>in</strong>g matrices.<br />

We have used a multidimensional version of the<br />

Skitovitch–Darmois theorem <strong>in</strong> order to calculate the


Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004 87<br />

956 F.J. Theis / Signal Process<strong>in</strong>g 84 (2004) 951 – 956<br />

<strong>in</strong>determ<strong>in</strong>acies of complex and of multidimensional<br />

ICA. In the multidimensional ICA case an additional<br />

restriction was needed <strong>in</strong> the proof, which could be<br />

relaxed if Corollary 3.3 can be extended to allow<br />

matrices of arbitrary rank.<br />

Acknowledgements<br />

This research was supported by Grants from the<br />

DFG 2 and the BMBF 3 .<br />

References<br />

[1] A. Back, A. Tsoi, Bl<strong>in</strong>d deconvolution of signals us<strong>in</strong>g a<br />

complex recurrent network, <strong>in</strong>: Neural Networks for Signal<br />

Process<strong>in</strong>g 4, Proceed<strong>in</strong>gs of the 1994 IEEE Workshop, 1994,<br />

pp. 565–574.<br />

[2] E. B<strong>in</strong>gham, A. Hyvar<strong>in</strong>en, A fast xed-po<strong>in</strong>t algorithm for<br />

<strong>in</strong>dependent component analysis of complex-valued signals,<br />

Internat. J. Neural Systems 10 (1) (2000) 1–8.<br />

[3] J. Cardoso, Multidimensional <strong>in</strong>dependent component<br />

analysis, <strong>in</strong>: Proceed<strong>in</strong>gs of ICASSP ’98, Seattle, WA, May<br />

12–15, 1998.<br />

[4] A. Cichocki, S. Amari, Adaptive Bl<strong>in</strong>d Signal and Image<br />

Process<strong>in</strong>g, Wiley, New York, 2002.<br />

[5] P. Comon, <strong>Independent</strong> component analysis—a new concept?<br />

Signal Process<strong>in</strong>g 36 (1994) 287–314.<br />

[6] G. Darmois, Analyse generale des liaisons stochastiques, Rev.<br />

Inst. Internat. Statist. 21 (1953) 2–8.<br />

2 Graduate college ‘nonl<strong>in</strong>ear dynamics’.<br />

3 Project ‘ModKog’.<br />

[7] S. Fiori, Bl<strong>in</strong>d separation of circularly distributed sources by<br />

neural extended apex algorithm, Neurocomput<strong>in</strong>g 34 (2000)<br />

239–252.<br />

[8] S. Ghurye, I. Olk<strong>in</strong>, A characterization of the multivariate<br />

normal distribution, Ann. Math. Statist. 33 (1962) 533–541.<br />

[9] A. Hyvar<strong>in</strong>en, P. Hoyer, Emergence of phase and shift<br />

<strong>in</strong>variant features by decomposition of natural images <strong>in</strong>to<br />

<strong>in</strong>dependent feature subspaces, Neural Computation 12 (7)<br />

(2000) 1705–1720.<br />

[10] A. Hyvar<strong>in</strong>en, P. Hoyer, M. Inki, Topographic <strong>in</strong>dependent<br />

component analysis, Neural Computation 13 (7) (2001)<br />

1525–1558.<br />

[11] A. Hyvar<strong>in</strong>en, J. Karhunen, E. Oja, <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>, Wiley, New York, 2001.<br />

[12] A. Kagan, Y. L<strong>in</strong>nik, C. Rao, Characterization Problems <strong>in</strong><br />

Mathematical Statistics, Wiley, New York, 1973.<br />

[13] E. Moreau, O. Macchi, Higher order contrasts for<br />

self-adaptive source separation, Internat. J. Adaptive Control<br />

Signal Process. 10 (1) (1996) 19–46.<br />

[14] V. Skitovitch, On a property of the normal distribution, DAN<br />

SSSR 89 (1953) 217–219.<br />

[15] V. Skitovitch, L<strong>in</strong>ear forms <strong>in</strong> <strong>in</strong>dependent random variables<br />

and the normal distribution law, Izvestiia AN SSSR, Ser.<br />

Matem. 18 (1954) 185–200.<br />

[16] P. Smaragdis, Bl<strong>in</strong>d separation of convolved mixtures <strong>in</strong> the<br />

frequency doma<strong>in</strong>, Neurocomput<strong>in</strong>g 22 (1998) 21–34.<br />

[17] F. Theis, A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d<br />

source separation, 2003, submitted for publication; prepr<strong>in</strong>t at<br />

http://homepages.uni-regensburg.de/ ∼ thf11669/publications/<br />

prepr<strong>in</strong>ts/theis03l<strong>in</strong>uniqueness.pdf<br />

[18] A. Z<strong>in</strong>ger, Investigations <strong>in</strong>to analytical statistics and their<br />

application to limit theorems of probability theory, Ph.D.<br />

Thesis, Len<strong>in</strong>grad University, 1969.


88 Chapter 3. Signal Process<strong>in</strong>g 84(5):951-956, 2004


Chapter 4<br />

Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

Paper F.J. Theis and P. Gruber. On model identifiability <strong>in</strong> analytic postnonl<strong>in</strong>ear<br />

ICA. Neurocomput<strong>in</strong>g, 64:223-234, 2005<br />

Reference (Theis and Gruber, 2005)<br />

Summary <strong>in</strong> section 1.2.2<br />

89


90 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

Abstract<br />

On model identifiability <strong>in</strong> analytic<br />

postnonl<strong>in</strong>ear ICA<br />

F.J. Theis ∗ , P. Gruber<br />

Institute of Biophysics, University of Regensburg,<br />

D-93040 Regensburg, Germany<br />

An important aspect of successfully analyz<strong>in</strong>g data with bl<strong>in</strong>d source separation is<br />

to know the <strong>in</strong>determ<strong>in</strong>acies of the problem, that is how the separat<strong>in</strong>g model is<br />

related to the orig<strong>in</strong>al mix<strong>in</strong>g model. If l<strong>in</strong>ear <strong>in</strong>dependent component analysis (ICA)<br />

is used, it is well known that the mix<strong>in</strong>g matrix can be found <strong>in</strong> pr<strong>in</strong>ciple, but for<br />

more general sett<strong>in</strong>gs not many results exist. In this work, only consider<strong>in</strong>g random<br />

variables with bounded densities, we prove identifiability of the postnonl<strong>in</strong>ear mix<strong>in</strong>g<br />

model with analytic nonl<strong>in</strong>earities and calculate its <strong>in</strong>determ<strong>in</strong>acies. A simulation<br />

confirms these theoretical f<strong>in</strong>d<strong>in</strong>gs.<br />

Key words: postnonl<strong>in</strong>ear <strong>in</strong>dependent component analysis, postnonl<strong>in</strong>ear bl<strong>in</strong>d<br />

source separation, identifiability, separability, bounded random vectors<br />

1 Introduction<br />

<strong>Independent</strong> component analysis (ICA) f<strong>in</strong>ds statistically <strong>in</strong>dependent data<br />

with<strong>in</strong> a given random vector. It is often applied to bl<strong>in</strong>d source separation<br />

(BSS), where it is furthermore assumed that the given vector has been mixed<br />

us<strong>in</strong>g a fixed set of <strong>in</strong>dependent sources.<br />

In l<strong>in</strong>ear ICA, the mix<strong>in</strong>g model can be written as<br />

n�<br />

Xi = aijSj<br />

i=1<br />

∗ Correspond<strong>in</strong>g author.<br />

Email addresses: fabian@theis.name (F.J. Theis), petergruber@gmx.net (P.<br />

Gruber).<br />

Prepr<strong>in</strong>t submitted to Elsevier Science 13 October 2004


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 91<br />

with <strong>in</strong>dependent sources S ⊤ = (S1, . . . , Sn) and mix<strong>in</strong>g matrix A = (aij).<br />

X is known, and the goal is to determ<strong>in</strong>e A and S. Traditionally, this model<br />

was only assumed to have decorrelated sources S, which lead to Pr<strong>in</strong>cipal<br />

<strong>Component</strong> <strong>Analysis</strong> (PCA). Hérault and Jutten [1] were the first to extend<br />

this model to the ICA case by propos<strong>in</strong>g a neural algorithm based on nonl<strong>in</strong>ear<br />

decorrelation. S<strong>in</strong>ce then, the field of ICA has become <strong>in</strong>creas<strong>in</strong>gly popular<br />

and many algorithms have been studied, see [2–6] to name but a few. Good<br />

textbook-level <strong>in</strong>troductions to ICA are given <strong>in</strong> [7] and [8].<br />

With the growth of the field, <strong>in</strong>terest <strong>in</strong> nonl<strong>in</strong>ear model extensions has <strong>in</strong>creased.<br />

However, if the model were chosen to be too general, it would not be<br />

able to be identified uniquely. A good trade-off between model generalization<br />

and identifiability is given <strong>in</strong> the so called postnonl<strong>in</strong>ear BSS model realized<br />

by<br />

�<br />

n�<br />

�<br />

.<br />

Xi = fi aijSj<br />

i=1<br />

This explicit nonl<strong>in</strong>ear model implies that <strong>in</strong> addition to the l<strong>in</strong>ear mix<strong>in</strong>g<br />

situation, each sensor Xi conta<strong>in</strong>s an unknown nonl<strong>in</strong>earity fi that can further<br />

distort the observation. This model, first proposed by Taleb and Jutten [9], has<br />

applications <strong>in</strong> telecommunication and biomedical data analysis. Algorithms<br />

for reconstruct<strong>in</strong>g postnonl<strong>in</strong>early mixed sources <strong>in</strong>clude [9–13].<br />

One major problem of ICA-based BSS lies <strong>in</strong> the question of model identifiability<br />

and separability. This describes the question whether the model respectively<br />

the sources are uniquely determ<strong>in</strong>ed by the observations X alone (except<br />

for trivial <strong>in</strong>determ<strong>in</strong>acies such as permutation and scal<strong>in</strong>g). This problem is<br />

of key importance for any ICA algorithm, because if such an algorithm <strong>in</strong>deed<br />

f<strong>in</strong>ds a possible mix<strong>in</strong>g model for X, without identifiability the so-recovered<br />

sources would not have to co<strong>in</strong>cide at all with the orig<strong>in</strong>al sources. For l<strong>in</strong>ear<br />

ICA, real-valued model identifiability has been shown by Comon [3], given<br />

that X conta<strong>in</strong>s at most one gaussian. The proof uses the rather nontrivial<br />

Darmois-Skitovitch theorem, however a more direct elementary proof is<br />

possible as well [14]. A generalization to complex-valued random variables is<br />

given <strong>in</strong> [15]. Postnonl<strong>in</strong>ear identifiability has been considered <strong>in</strong> [9]; however<br />

<strong>in</strong> the formulation, the proof conta<strong>in</strong>s an <strong>in</strong>accuracy render<strong>in</strong>g the proof only<br />

applicable to quite restricted situations.<br />

In this work, we will analyze separability of postnonl<strong>in</strong>ear mixtures. We thereby<br />

generalize ideas already presented by Babaie-Zadeh et al [10], where the focus<br />

was put on the development of an actual identification algorithm. Babaie-<br />

Zadeh was the first to use the method of analyz<strong>in</strong>g bounded random vectors<br />

<strong>in</strong> the context of postnonl<strong>in</strong>ear mixtures [16] 1 . There, he already discussed<br />

1 His PhD thesis is available onl<strong>in</strong>e at<br />

http://www.lis.<strong>in</strong>pg.fr/stages dea theses/theses/manuscript/babaie-zadeh.pdf<br />

2


92 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

identifiability issues, albeit explicitly only <strong>in</strong> the two-dimensional analytic case.<br />

Extend<strong>in</strong>g his ideas, we are able to f<strong>in</strong>d a new necessary condition — which<br />

we name ’absolutely degenerate’, see def<strong>in</strong>ition 6 — for identify<strong>in</strong>g mix<strong>in</strong>g<br />

structure us<strong>in</strong>g only the boundary. This together with the generalization to<br />

arbitrary dimensions is our ma<strong>in</strong> contribution here, stated <strong>in</strong> theorem 7.<br />

The paper is arranged as follows: Section 2 presents a simple result about<br />

homogeneous functions and shortly discusses l<strong>in</strong>ear identifiability <strong>in</strong> the case<br />

of bounded random variables. Section 3 states the postnonl<strong>in</strong>ear separability<br />

problem, which is then proved <strong>in</strong> the follow<strong>in</strong>g section for real-valued random<br />

vectors. In section 5, a simulation confirm<strong>in</strong>g the ma<strong>in</strong> separability theorem<br />

is presented.<br />

Postnonl<strong>in</strong>ear separability is important for any postnonl<strong>in</strong>ear ICA algorithm,<br />

so we focus only on this question. We do not propose an explicit postnonl<strong>in</strong>ear<br />

identification algorithm but <strong>in</strong>stead refer to [9–13] for both algorithms and<br />

simulations.<br />

2 Basics<br />

For n ∈ N let Gl(n) be the general l<strong>in</strong>ear group of R n i.e. group of <strong>in</strong>vertible<br />

real (n × n)−matrices. An <strong>in</strong>vertible matrix L ∈ Gl(n) is said to be a scal<strong>in</strong>g<br />

matrix, if it is diagonal. We say two (m × n)−matrices B, C are equivalent,<br />

B ∼ C, if C can be written as C = BPL with an scal<strong>in</strong>g matrix L ∈ Gl(n)<br />

and an <strong>in</strong>vertible matrix with unit vectors <strong>in</strong> each row (permutation matrix)<br />

P ∈ Gl(n).<br />

Def<strong>in</strong>ition 1 Given a function f : U → R assume there exist a, b ∈ R such<br />

that at least one is not of absolute value 0 or 1. If f(ax) = bf(x) for all x ∈ U<br />

with ax ∈ U, then f is said to be (a, b)-homogeneous or simply homogeneous.<br />

The follow<strong>in</strong>g lemma characteriz<strong>in</strong>g homogeneous functions is from [10]. However<br />

we added the correction to exclude the cases |a| or |b| ∈ {0, 1}, because<br />

<strong>in</strong> these cases homogeneity does not <strong>in</strong>duce such strong results. This lemma<br />

can be generalized to cont<strong>in</strong>uously differentiable functions, so the strong assumption<br />

of analyticity is not needed, but shortens the proof.<br />

Lemma 2 [10] Let f : U → R, be an analytic function that is (a, b)homogeneous<br />

on [0, ε) with ε > 0. Then there exist c ∈ R, n ∈ N ∪ {0}<br />

(possibly 0) such that f(x) = cx n for all x ∈ U.<br />

PROOF. If |a| is <strong>in</strong> {0, 1} or b = 0 then obviously f ≡ 0. If b = −1 then<br />

f ≡ 0, s<strong>in</strong>ce |a| /∈ {0, 1}, f(a 2 x) = f(x) and f cont<strong>in</strong>uous, f is constant, but<br />

3


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 93<br />

f(ax) = −f(x) implies that f ≡ 0. In the case that b = 1 aga<strong>in</strong> f is constant<br />

s<strong>in</strong>ce f(ax) = f(x) and a 0 = 1 = b.<br />

By m-times differentiat<strong>in</strong>g the homogeneity equation we f<strong>in</strong>ally get bf (m) (x) =<br />

a m f (m) (ax), where f (m) denotes the m-th derivative of f. Evaluat<strong>in</strong>g this at 0<br />

yields bf (m) (0) = a m f (m) (0). S<strong>in</strong>ce f is assumed to be analytic, f is determ<strong>in</strong>ed<br />

uniquely by its derivatives at 0. Now either there exists an n ≥ 0 with b = a n ,<br />

hence f (m) (0) = 0 for all m �= n and therefore f(x) = cx n or else f ≡ 0. ✷<br />

Def<strong>in</strong>ition 3 [10] We call a random vector X with density pX bounded, if<br />

its density pX is bounded. Denote supp pX := {x|pX(x) �= 0} the support of<br />

pX i.e. the closure of the non-zero po<strong>in</strong>ts of pX.<br />

We further call an <strong>in</strong>dependent random vector X fully bounded, if supp pXi is<br />

an <strong>in</strong>terval for all i. So we get supp pX = [a1, b1] × . . . × [an, bn].<br />

S<strong>in</strong>ce a connected component of supp pX <strong>in</strong>duces a restricted, fully bounded<br />

random vector, without loss of generality we will <strong>in</strong> the follow<strong>in</strong>g assume to<br />

have fully bounded densities. In the case of l<strong>in</strong>ear <strong>in</strong>stantaneous BSS the follow<strong>in</strong>g<br />

separability result is well known and can be derived from a more general<br />

version of this theorem for non-bounded densities [3]. But <strong>in</strong> the context of<br />

fully bounded random vectors, this follows already from the fact that <strong>in</strong> this<br />

case <strong>in</strong>dependence is equivalent to hav<strong>in</strong>g support with<strong>in</strong> a cube with sides<br />

parallel to the coord<strong>in</strong>ate planes, and only matrices equivalent to the identity<br />

leave this property <strong>in</strong>variant:<br />

Theorem 4 (Separability of bounded l<strong>in</strong>ear BSS) Let M ∈ Gl(n) be<br />

an <strong>in</strong>vertible matrix and S a fully bounded <strong>in</strong>dependent random vector. If MS<br />

is aga<strong>in</strong> <strong>in</strong>dependent, then M is equivalent to the identity.<br />

This theorem <strong>in</strong>deed proves separability of the l<strong>in</strong>ear ICA model, because if<br />

X = AS and W is a demix<strong>in</strong>g matrix such that WX is <strong>in</strong>dependent, then<br />

M := WA ∼ I, so W −1 ∼ A as desired. As the model is <strong>in</strong>vertible and the<br />

<strong>in</strong>determ<strong>in</strong>acies are trivial, identifiability and uniqueness follow directly.<br />

3 Separability of postnonl<strong>in</strong>ear BSS<br />

In this section we <strong>in</strong>troduce the postnonl<strong>in</strong>ear BSS model and further discuss<br />

its identifiability.<br />

Def<strong>in</strong>ition 5 [9] A function f : R n → R n is called diagonal or componentwise<br />

if each component fi(x) of f(x) depends only on the variable xi.<br />

4


94 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

In � this case we often omit the other variables and write f(x1, . . . , xn) =<br />

f1(x1), . . . , fn(xn) �<br />

or f = f1 × · · · × fn.<br />

Consider now the postnonl<strong>in</strong>ear Bl<strong>in</strong>d Source Separation model:<br />

X = f(AS)<br />

where aga<strong>in</strong> S is an <strong>in</strong>dependent random vector, A ∈ Gl(n) and f is a diagonal<br />

nonl<strong>in</strong>earity. We assume the components fi of f to be <strong>in</strong>jective analytic<br />

are analytic.<br />

functions with non-vanish<strong>in</strong>g derivatives. Then also the f −1<br />

i<br />

Def<strong>in</strong>ition 6 Let A ∈ Gl(n) be an <strong>in</strong>vertible matrix. Then A is said to<br />

be mix<strong>in</strong>g if A has at least two nonzero entries <strong>in</strong> each row 2 . And A =<br />

(aij)i,j=1...n is said to be absolutely degenerate if there are two columns l �= m<br />

such that a 2 il = λa 2 im for a λ �= 0 i.e. the normalized columns differ only by the<br />

signs of the entries.<br />

Postnonl<strong>in</strong>ear BSS is a generalization of l<strong>in</strong>ear BSS, so the <strong>in</strong>determ<strong>in</strong>acies<br />

of postnonl<strong>in</strong>ear ICA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies of l<strong>in</strong>ear BSS: A<br />

can only be reconstructed up to scal<strong>in</strong>g and permutation. Here of course additional<br />

<strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play because of translation: fi can only<br />

be recovered up to a constant. Also, if L ∈ Gl(n) is a scal<strong>in</strong>g matrix, then<br />

f(AS) = (f ◦L) �<br />

(L−1A)S �<br />

, so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each<br />

component. Another <strong>in</strong>determ<strong>in</strong>acy could occur if A is not mix<strong>in</strong>g, i.e. at least<br />

one observation xi conta<strong>in</strong>s only one source; <strong>in</strong> this case fi can obviously not<br />

be recovered. For example if A = I, then f(S) is already aga<strong>in</strong> <strong>in</strong>dependent,<br />

because <strong>in</strong>dependence is <strong>in</strong>variant under component-wise nonl<strong>in</strong>ear transformation;<br />

so f cannot be found us<strong>in</strong>g this method.<br />

A not so obvious <strong>in</strong>determ<strong>in</strong>acy occurs if A is absolutely degenerate. Then<br />

only the matrix A but not the nonl<strong>in</strong>earities can be recovered by look<strong>in</strong>g at<br />

the edges of the support of the fully-bounded random vector. For example<br />

1 1<br />

consider<br />

�<br />

the case n = 2, A = ( 2 −2 ) and the analytic function f(x1, x2) =<br />

x1 + 1<br />

2π s<strong>in</strong>(πx1), x2 + 1 x2 s<strong>in</strong>(π π 2 )�.<br />

Then A−1 ◦ f ◦ A maps [0, 1] 2 onto [0, 1] 2 .<br />

S<strong>in</strong>ce both components of f are <strong>in</strong>jective, we can verify this by look<strong>in</strong>g at the<br />

edges:<br />

2 A slightly more general def<strong>in</strong>ition of ’mix<strong>in</strong>g’ can be def<strong>in</strong>ed still guarantee<strong>in</strong>g<br />

identifiability of the sources; it is however omitted for the sake of simplicity.<br />

5


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 95<br />

f1, f2<br />

2<br />

1.5<br />

1<br />

0.5<br />

-2 -1.5 -1 -0.5<br />

-0.5<br />

0.5 1 1.5 2<br />

-1<br />

-1.5<br />

-2<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

f(AS) A −1 f(AS)<br />

0.5 1 1.5<br />

0.5 1 1.5<br />

1<br />

0.5<br />

0<br />

-0.5<br />

-1<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.2 0.4 0.6 0.8<br />

0.2 0.4 0.6 0.8<br />

Fig. 1. Example of a postnonl<strong>in</strong>ear transformation us<strong>in</strong>g an absolutely degenerate<br />

matrix A and <strong>in</strong> [0, 1] 2 uniform sources S.<br />

f ◦ A(x1, 0) = �<br />

x1 + 1<br />

2π s<strong>in</strong>(πx1), 2x1 + 1 �<br />

s<strong>in</strong>(πx1)<br />

π<br />

= (1, 2) �<br />

x1 + 1 �<br />

s<strong>in</strong>(πx1)<br />

2π<br />

f ◦ A(0, x2) = (1, −2) �<br />

x2 + 1 �<br />

s<strong>in</strong>(πx2)<br />

2π<br />

f ◦ A(x1, 1) = (1, −2) + (1, 2) �<br />

x1 − 1 �<br />

s<strong>in</strong>(πx1)<br />

2π<br />

f ◦ A(1, x2) = (1, 2) + (1, −2) �<br />

x2 − 1 �<br />

s<strong>in</strong>(πx2)<br />

2π<br />

So we have constructed a situation <strong>in</strong> which two uniform sources are mixed<br />

by f ◦ A, see figure 1. They can be separated either by A −1 ◦ f −1 or by A −1<br />

alone. We have shown that the latter also preserves the boundary, although<br />

it conta<strong>in</strong>s a different postnonl<strong>in</strong>earity (namely identity) <strong>in</strong> contrast to f −1 <strong>in</strong><br />

the former model. Nonetheless this is no <strong>in</strong>determ<strong>in</strong>acy of the model itself,<br />

s<strong>in</strong>ce A −1 f(AS) is obviously not <strong>in</strong>dependent. So by look<strong>in</strong>g at the boundary<br />

alone, we sometimes cannot detect <strong>in</strong>dependence if the whole system is highly<br />

symmetric. This is the case if A is absolutely degenerate. In our example f<br />

was chosen such that the non-trivial postnonl<strong>in</strong>ear mixture looks l<strong>in</strong>ear (at<br />

the boundary), and this was possible due to the <strong>in</strong>herent symmetry <strong>in</strong> A.<br />

If we however assume that A is mix<strong>in</strong>g and not absolutely degenerate, then we<br />

will show for all fully-bounded sources S that except for scal<strong>in</strong>g <strong>in</strong>terchange<br />

between f and A no more <strong>in</strong>determ<strong>in</strong>acies than <strong>in</strong> the aff<strong>in</strong>e l<strong>in</strong>ear case exist.<br />

Note that if f is only assumed to be cont<strong>in</strong>uously differentiable, then additional<br />

<strong>in</strong>determ<strong>in</strong>acies come <strong>in</strong>to play.<br />

6<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0


96 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

4 Separability of bounded postnonl<strong>in</strong>ear BSS<br />

In this section we prove separability of postnonl<strong>in</strong>ear BSS; <strong>in</strong> the proof we will<br />

see how the two conditions from def<strong>in</strong>ition 6 turn out to be necessary.<br />

Theorem 7 (Separability of bounded postnonl<strong>in</strong>ear BSS) Let A, W ∈<br />

Gl(n) and one of them mix<strong>in</strong>g and not absolutely degenerate, h : Rn → Rn be<br />

a diagonal <strong>in</strong>jective analytic function such that h ′ i �= 0 and let S be a fullybounded<br />

<strong>in</strong>dependent random vector. If W �<br />

h(AS) �<br />

is <strong>in</strong>dependent, then there<br />

exists a scal<strong>in</strong>g L ∈ Gl(n) and v ∈ Rn with LA ∼ W−1 and h(x) = Lx + v.<br />

So let f ◦ A be the mix<strong>in</strong>g model and W ◦ g the separat<strong>in</strong>g model. Putt<strong>in</strong>g the<br />

two together we get the above mix<strong>in</strong>g-separat<strong>in</strong>g model with h := g ◦ f. The<br />

theorem shows that if the mix<strong>in</strong>g-separat<strong>in</strong>g model preserves <strong>in</strong>dependence<br />

then it is essentially trivial i.e. h aff<strong>in</strong>e l<strong>in</strong>ear and the matrices equivalent (up to<br />

scal<strong>in</strong>g). As usual, the model is assumed to be <strong>in</strong>vertible, hence identifiability<br />

and uniqueness of the model follow from the separability.<br />

Def<strong>in</strong>ition 8 A subset P ⊂ R n is called parallelepipeds, if it is the l<strong>in</strong>ear<br />

image of a square, that is<br />

P = A([a1, b1] × . . . × [an, bn])<br />

for ai < bi,i = 1, . . . , n and A ∈ Gl(n). A parallelepipeds P is said to be<br />

tilted, if A is mix<strong>in</strong>g and no 2 × 2-m<strong>in</strong>or is absolutely degenerate. Let i �= j ∈<br />

{1, . . . , n} and c ∈ {a1, b1} × . . . × {an, bn}, then<br />

A �<br />

{c1} × . . . × [ai, bi] × . . . × [aj, bj] × . . . × {cn} �<br />

is called a 2-face of P and A(c) is called a corner of P . If n = 2 the parallelepipeds<br />

are called parallelograms.<br />

Lemma 9 Let f1, . . . , fn be n one-dimensional analytic <strong>in</strong>jective functions<br />

with f ′ i �= 0, and let f := f1 × · · · × fn be the <strong>in</strong>duced <strong>in</strong>jective mapp<strong>in</strong>g on<br />

R n . Let P, Q ⊂ R n be two parallelepipeds, one of them tilted. If f(P ) = Q (or<br />

equivalently for the boundaries f(∂P ) = ∂Q), then f is aff<strong>in</strong>e l<strong>in</strong>ear diagonal.<br />

Here ∂P denotes the boundary of the parallelepiped P i.e. the set of po<strong>in</strong>ts of<br />

P not ly<strong>in</strong>g <strong>in</strong> its <strong>in</strong>terior (which co<strong>in</strong>cides with the union of its faces).<br />

In the proof we see that the requirement for P or Q be<strong>in</strong>g tilted can be weakened<br />

slightly. It would suffice that enough 2-m<strong>in</strong>ors are not absolutely degenerate.<br />

Nevertheless the set of mix<strong>in</strong>g matrices hav<strong>in</strong>g no absolutely degenerate<br />

2 × 2-m<strong>in</strong>ors is very large <strong>in</strong> the sense that its complement has measure zero<br />

<strong>in</strong> Gl(n).<br />

7


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 97<br />

Note that the tiltedness is essential, for example let P = ( 1 2<br />

1 0 ) [0, 1] 2 and any<br />

f1 with<br />

⎧<br />

⎪⎨ 3x<br />

for x < 1<br />

2<br />

f1(x) =<br />

⎪⎩ 3x<br />

− 1 for x > 2<br />

2<br />

and f2(x) := x. Then Q is a parallelogram and its corners are (0, 0), ( 3,<br />

1), 2<br />

, 1) which is not a scaled version of P .<br />

(0, 2), ( 7<br />

2<br />

Note that the lemma extends the lemma proved <strong>in</strong> [10] by add<strong>in</strong>g the condition<br />

of absolutely-degeneracy — this is <strong>in</strong> fact a necessary condition as shown <strong>in</strong><br />

figure 1.<br />

PROOF. [Lemma 9 for n = 2] Obviously images of non-tilted parallelograms<br />

under diagonal mapp<strong>in</strong>gs are aga<strong>in</strong> non-tilted. f is <strong>in</strong>vertible, so we can assume<br />

that both P and Q are tilted. Without loss of generality us<strong>in</strong>g the scal<strong>in</strong>g and<br />

translation <strong>in</strong>variance of our problem, we may assume that<br />

∂P = (<br />

1 1<br />

a1 a2 ) ∂([0, 1] × [0, c]), ∂Q = � 1 1<br />

b1 b2<br />

�<br />

∂([0, 1] × [0, d]),<br />

with ai, bi ∈ R \ {0} and a 2 1 �= a 2 2 and b 2 1 �= b 2 2 and ca2, db2 > 0 and c ≤ 1, and<br />

f(0) = 0, f(1, a1) = (1, b1), f(c, ca2) = (d, db2)<br />

(i.e. the vertices of P are mapped onto the vertices of Q <strong>in</strong> the specified order).<br />

Note that the vertices of P have to be mapped onto vertices of Q because f<br />

is at cont<strong>in</strong>uously differentiable. S<strong>in</strong>ce the fi are monotonously we also have<br />

d ≤ 1 and that a1 < 0 implies b1 < 0.<br />

It follows that f maps the four separate edges of ∂P onto the correspond<strong>in</strong>g<br />

edges of ∂Q: f(t, a1t) = �<br />

g1(t), b1g1(t) �<br />

, f(ct, ca2t) = �<br />

dg2(t), db2g2(t) �<br />

for t ∈ [0, 1]. Here gi : [0, 1] → [0, 1] is a strongly monotonously <strong>in</strong>creas<strong>in</strong>g<br />

parametrization of the respective edge. It follows that g1(t) = f1(t) and<br />

dg2(t) = f1(ct) and therefore f2(a1t) = b1f1(t) and f2(ca2t) = b2f1(ct) for<br />

t ∈ [0, 1]. Therefore � we get an equation for both components of f, e.g. for the<br />

a1<br />

second: f2 a2 t� = b1<br />

b2 f2(t) for t ∈ [0, ca2].<br />

So f2 is � a1<br />

a2<br />

�<br />

b1 , -homogeneous with coefficients not <strong>in</strong> {−1, 0, 1} by assump-<br />

b2<br />

tion; accord<strong>in</strong>g to lemma 2 f2 and then also f1 are homogeneous polynomials<br />

(everywhere due to analyticity). By assumption f ′ i(0) �= 0, hence the fi are<br />

even l<strong>in</strong>ear.<br />

We have used the translation <strong>in</strong>variance above, so <strong>in</strong> general f is an aff<strong>in</strong>e<br />

l<strong>in</strong>ear scal<strong>in</strong>g. ✷<br />

8


98 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

PROOF. [Lemma 9 for arbitrary n] Aga<strong>in</strong> note that s<strong>in</strong>ce diagonal maps<br />

preserve non-tiltedness we can assume that P and Q are tilted. Let πij :<br />

Rn → R2 be the projection onto the i, j-coord<strong>in</strong>ates. Note that for any corner<br />

c and i �= j there is a 2-face Pijc of P conta<strong>in</strong><strong>in</strong>g c such that πij(Pijc) is a<br />

parallelogram. � In fact s<strong>in</strong>ce P is tilted πij(Pijc) is also tilted. S<strong>in</strong>ce f is smooth<br />

f(Pijc) �<br />

is also a 2-face of Q and aga<strong>in</strong> tilted.<br />

πij<br />

For each corner c of P and i �= j�∈ {0, . . . , n} we can apply the n = 2 version<br />

of this lemma to πij(Pijc) and πij f(Pijc) �<br />

. Therefore fi and fj are aff<strong>in</strong>e l<strong>in</strong>ear<br />

on πi(Pijc) and πj(Pijc). Now πi(P ) ⊂ �<br />

cj πi(Pijc) and hence fi aff<strong>in</strong>e l<strong>in</strong>ear<br />

on πi(P ) which proves that f is aff<strong>in</strong>e l<strong>in</strong>ear diagonal. ✷<br />

Now we are able to show the separability theorem:<br />

PROOF. [Theorem 7] S is bounded, and W ◦ h ◦ A is cont<strong>in</strong>uous, so T :=<br />

W �<br />

h(AS) �<br />

is bounded as well. Furthermore, s<strong>in</strong>ce S is fully bounded, T is also<br />

fully bounded. Then, as seen <strong>in</strong> section 2, supp S and supp T are rectangles<br />

with boundaries parallel to the coord<strong>in</strong>ate axes. Hence P := A(supp S) and<br />

Q := W −1 (supp T) are parallelograms. One of them is tilted because otherwise<br />

A and W −1 would not be mix<strong>in</strong>g.<br />

As W ◦ h ◦ A maps supp S onto supp T, h maps the set A supp S onto<br />

W −1 supp T i.e. h(P ) = Q. Then by lemma 9 h is aff<strong>in</strong>e l<strong>in</strong>ear diagonal,<br />

say h(x) = Lx + v for x ∈ P with L ∈ Gl(2) scal<strong>in</strong>g and v ∈ R 2 .<br />

So W �<br />

h(AS) �<br />

= WLAS + Wv is <strong>in</strong>dependent, and therefore also WLAS.<br />

By theorem 4 WLA ∼ I, so there exists a scal<strong>in</strong>g L ′ and a permutation P ′<br />

with WLA = L ′ P ′ as had to be shown. ✷<br />

5 Simulation<br />

In order to demonstrate the validity of theorem 7, we carry out a simple simulation<br />

<strong>in</strong> this section. We mix two <strong>in</strong>dependent random variables us<strong>in</strong>g a known<br />

mix<strong>in</strong>g model f and A. However, f = f(p0,q0) is taken from a parameterized<br />

family f(p,q) of nonl<strong>in</strong>earities, which enables us to test numerically whether <strong>in</strong><br />

can fully separate the data. So we<br />

the separation system really only f −1<br />

(p0,q0)<br />

unmix the data us<strong>in</strong>g <strong>in</strong>verses f −1<br />

(p,q) of members of this family and A−1 . The<br />

follow<strong>in</strong>g simulation will show that the mutual <strong>in</strong>formation of the recoveries is<br />

m<strong>in</strong>imal at (p0, q0), i.e. that f is determ<strong>in</strong>ed uniquely by X (with<strong>in</strong> this family<br />

at least) as stated by theorem 7.<br />

9


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 99<br />

p<br />

Mutual <strong>in</strong>formation of recovered sources<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1<br />

0.2 0.4 0.6 0.8 1<br />

1<br />

2<br />

0.6<br />

0.4<br />

3<br />

5<br />

4<br />

0.4 0.6<br />

0.4 0.6<br />

0.2 0.4 0.6 0.8 1<br />

q<br />

0<br />

0.8<br />

0.6<br />

0.4<br />

0.6<br />

0.2<br />

0.4<br />

p × g−1 q ) ◦ (f0.5 × g0.5) ◦ A � �<br />

S<br />

5<br />

4<br />

3<br />

2<br />

1<br />

�<br />

− log<br />

0<br />

MI � A −1 ◦ (f −1<br />

2<br />

1<br />

0<br />

0<br />

1<br />

1<br />

gq<br />

2 3<br />

0.5<br />

fp<br />

p = 0.1<br />

p = 1.0<br />

q = 0.1<br />

q = 1.0<br />

0<br />

0 1 2 3<br />

Fig. 2. Simulation of the separability result us<strong>in</strong>g two families of nonl<strong>in</strong>earities with<br />

fp(x) = 1<br />

10p log<br />

� √<br />

x+ x2 +4e−20p 2e−10p �<br />

and gq = y y<br />

4 | 4 |3q−0.5 . The left plot displays a color<br />

plot together with overlayed contours of a separation measure depend<strong>in</strong>g on the<br />

parameters p, q used for recovery. The separation quality is measured us<strong>in</strong>g the<br />

negative logarithm of the mutual <strong>in</strong>formation of the recovered sources. The region<br />

around the separation po<strong>in</strong>t p = q = 0.5 is also displayed <strong>in</strong> more detail.<br />

The components of the postnonl<strong>in</strong>earity will be taken from two families of<br />

functions described by<br />

fp(x) = 1<br />

10p log<br />

� √<br />

x + x2 + 4e−20p �<br />

2e −10p<br />

and gq(y) = y<br />

4<br />

� �<br />

�y<br />

�3q−0.5<br />

� �<br />

� �<br />

4<br />

with p, q vary<strong>in</strong>g between 0 and 1. The first component of the nonl<strong>in</strong>earity,<br />

fp models a sensors which saturates with vary<strong>in</strong>g strength and the second<br />

component gq describes a polynomial activation of the sensor with vary<strong>in</strong>g<br />

degree, see figure 2, right hand side.<br />

In the simulation, an <strong>in</strong>dependent uniformly distributed random vector (3000<br />

samples <strong>in</strong> [−1, 1] 2 ) is mixed postnonl<strong>in</strong>early by the matrix A = ( 2.6 1.4<br />

0.7 3.3 ) and<br />

the diagonal nonl<strong>in</strong>earity<br />

⎛<br />

⎜<br />

f ⎝ x<br />

⎞ ⎛<br />

1<br />

⎟ ⎜ 5<br />

⎠ = ⎝<br />

y<br />

log� e5 2<br />

x + 1<br />

2<br />

√ �⎞<br />

e10x2 + 4<br />

1 y |y| 16<br />

⎟<br />

⎠ =<br />

⎛<br />

⎜<br />

⎝ f0.5(x)<br />

⎞<br />

⎟<br />

⎠<br />

g0.5(y)<br />

To recover the sources the family (fp, gq) −1 of diagonal nonl<strong>in</strong>earities is used<br />

together with the <strong>in</strong>verse of A.<br />

10


100 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

p<br />

1<br />

0<br />

-1<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.6<br />

0.2<br />

0.4<br />

0.2 0.4 0.6 0.8 1<br />

1<br />

1<br />

2<br />

3<br />

0.4 0.6<br />

4<br />

5<br />

0.6<br />

0.4<br />

0.4 0.6<br />

0.2 0.4 0.6 0.8 1<br />

Orig<strong>in</strong>al<br />

-1 0 1<br />

Relative volume error<br />

q<br />

2<br />

0<br />

-2<br />

0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

�<br />

�<br />

− log vol(∆p,q)<br />

Unmix<strong>in</strong>g with p = 0.65, q = 0.5<br />

2<br />

0<br />

-2<br />

1<br />

0<br />

-1<br />

Unmix<strong>in</strong>g with p = 0.65, q = 0.65<br />

0<br />

Unmix<strong>in</strong>g with p = 0.5, q = 0.65<br />

-1 0 1<br />

Fig. 3. The top-left image graphs the separation error ∆p,q by measur<strong>in</strong>g the negative<br />

logarithm of its volume. The error ∆p,q denotes the difference set of po<strong>in</strong>ts from<br />

either the support of the recovered source distribution or a quadrangle with the<br />

same vertices. The other plots show the distributions of the recovered sources at<br />

some comb<strong>in</strong>ations of the parameters p, q of the nonl<strong>in</strong>earities. At p = 0.5, q = 0.5<br />

this is the orig<strong>in</strong>al source distribution. Here light grey areas represent ∆p,q and dark<br />

grey areas the <strong>in</strong>tersection of the support and the quadrangle.<br />

A simple estimator for mutual <strong>in</strong>formation based on histogram estimation of<br />

the entropy (with 10 b<strong>in</strong>s <strong>in</strong> each dimension) is used to check the <strong>in</strong>dependence<br />

of the recovered sources. A more elaborate histogram-based estimator<br />

by Moddemeijer [17] yields similar results. As shown <strong>in</strong> figure 2 the mutual<br />

<strong>in</strong>formation of the recovered sources is m<strong>in</strong>imal at the parameters p = q = 0.5,<br />

which correspond to the mix<strong>in</strong>g model. It can also be noticed that the m<strong>in</strong>imum<br />

is much less dist<strong>in</strong>ct <strong>in</strong> the second component. This <strong>in</strong>dicates that <strong>in</strong><br />

numerical application it should be easier to detect nonl<strong>in</strong>ear functions which<br />

are bounded.<br />

The second graph (figure 3) further illustrates that the criterion for the borders<br />

11


Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005 101<br />

to be of quadrangular shape is sufficient. This gives the idea of a possible<br />

postnonl<strong>in</strong>ear ICA algorithm which m<strong>in</strong>imizes the non-quadrangularity of the<br />

support of the estimated source. As pictured <strong>in</strong> the graph this can for example<br />

be achieved by m<strong>in</strong>imiz<strong>in</strong>g the volume of the mutual differences (i.e. the po<strong>in</strong>ts<br />

which are <strong>in</strong> the union but not <strong>in</strong> the difference). It can easily be seen that this<br />

m<strong>in</strong>imization yields the same solution as m<strong>in</strong>imiz<strong>in</strong>g the mutual <strong>in</strong>formation.<br />

For more details on such an algorithm we refer to [10, 13, 16].<br />

6 Conclusion<br />

We have presented a new separability result for postnonl<strong>in</strong>ear bounded mixtures<br />

that is based on the analysis of the borders of the mixture density. We<br />

hereby formalize and extend ideas already presented <strong>in</strong> [10]. We <strong>in</strong>troduce the<br />

notion of absolutely degenerate mix<strong>in</strong>g matrices. Us<strong>in</strong>g this we identify the restrictions<br />

of separability and also of algorithms that only use border analysis<br />

for postnonl<strong>in</strong>earity detection. This also represents a drawback of the algorithms<br />

proposed <strong>in</strong> [10] and [13], to which we want to refer for experimental<br />

results us<strong>in</strong>g border detection <strong>in</strong> postnonl<strong>in</strong>ear sett<strong>in</strong>gs.<br />

In future works we will show how to relax the condition of analytic postnonl<strong>in</strong>earities<br />

to only cont<strong>in</strong>uously differentiable functions; also prelim<strong>in</strong>ary results<br />

<strong>in</strong>dicate how to generalize these results to complex-valued random vectors and<br />

mixtures. We further plan to extend this model to the case of group ICA [18],<br />

where <strong>in</strong>dependence is only assumed among groups of sources. In the l<strong>in</strong>ear<br />

case, this has been done <strong>in</strong> [15] — however the extension to postnonl<strong>in</strong>early<br />

mixed sources is yet unclear.<br />

Acknowledgements<br />

We thank the anonymous reviewers for their valuable suggestions, which improved<br />

the orig<strong>in</strong>al manuscript. FT further would like to thank Christian Jutten<br />

for the helpful discussions dur<strong>in</strong>g the preparation of this paper. F<strong>in</strong>ancial<br />

support by the BMBF <strong>in</strong> the project ’ModKog’ is gratefully acknowledged.<br />

References<br />

[1] J. Hérault, C. Jutten, Space or time adaptive signal process<strong>in</strong>g by neural<br />

network models, <strong>in</strong>: J. Denker (Ed.), Neural Networks for Comput<strong>in</strong>g.<br />

12


102 Chapter 4. Neurocomput<strong>in</strong>g 64:223-234, 2005<br />

Proceed<strong>in</strong>gs of the AIP Conference, American Institute of Physics, New York,<br />

1986, pp. 206–211.<br />

[2] J.-F. Cardoso, A. Souloumiac, Bl<strong>in</strong>d beamform<strong>in</strong>g for non-gaussian signals,<br />

IEEE Proceed<strong>in</strong>gs - Part F 140 (1993) 362–370.<br />

[3] P. Comon, <strong>Independent</strong> component analysis - a new concept?, Signal Process<strong>in</strong>g<br />

36 (1994) 287–314.<br />

[4] A. Bell, T. Sejnowski, An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d<br />

separation and bl<strong>in</strong>d deconvolution, Neural Computation 7 (1995) 1129–1159.<br />

[5] A. Hyvär<strong>in</strong>en, E. Oja, A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis, Neural Computation 9 (1997) 1483–1492.<br />

[6] F. Theis, A. Jung, C. Puntonet, E. Lang, L<strong>in</strong>ear geometric ICA: Fundamentals<br />

and algorithms, Neural Computation 15 (2003) 419–439.<br />

[7] A. Hyvär<strong>in</strong>en, J.Karhunen, E.Oja, <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>, John<br />

Wiley & Sons, 2001.<br />

[8] A. Cichocki, S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g, John Wiley<br />

& Sons, 2002.<br />

[9] A. Taleb, C. Jutten, Indeterm<strong>in</strong>acy and identifiability of bl<strong>in</strong>d identification,<br />

IEEE Transactions on Signal Process<strong>in</strong>g 47 (10) (1999) 2807–2820.<br />

[10] M. Babaie-Zadeh, C. Jutten, K. Nayebi, A geometric approach for separat<strong>in</strong>g<br />

post non-l<strong>in</strong>ear mixtures, <strong>in</strong>: Proc. of EUSIPCO ’02, Vol. II, Toulouse, France,<br />

2002, pp. 11–14.<br />

[11] S. Achard, D.-T. Pham, C. Jutten, Quadratic dependence measure for nonl<strong>in</strong>ear<br />

bl<strong>in</strong>d sources separation, <strong>in</strong>: Proc. of ICA 2003, Nara, Japan, 2003, pp. 263–268.<br />

[12] A. Ziehe, M. Kawanabe, S. Harmel<strong>in</strong>g, K.-R. Müller, Bl<strong>in</strong>d separation<br />

of post-nonl<strong>in</strong>ear mixtures us<strong>in</strong>g l<strong>in</strong>eariz<strong>in</strong>g transformations and temporal<br />

decorrelation, Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research 4 (2003) 1319–1338.<br />

[13] F. Theis, E. Lang, L<strong>in</strong>earization identification and an application to BSS us<strong>in</strong>g<br />

a SOM, <strong>in</strong>: Proc. ESANN 2004, d-side, Evere, Belgium, Bruges, Belgium, 2004,<br />

pp. 205–210.<br />

[14] F. Theis, A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation,<br />

Neural Computation 16 (2004) 1827–1850.<br />

[15] F. Theis, Uniqueness of complex and multidimensional <strong>in</strong>dependent component<br />

analysis, Signal Process<strong>in</strong>g 84 (5) (2004) 951–956.<br />

[16] M. Babaie-Zadeh, On bl<strong>in</strong>d source separation <strong>in</strong> convolutive and nonl<strong>in</strong>ear<br />

mixtures, Ph.D. thesis, Institut National Polytechnique de Grenoble (2002).<br />

[17] R. Moddemeijer, On estimation of entropy and mutual <strong>in</strong>formation of<br />

cont<strong>in</strong>uous distributions, Signal Process<strong>in</strong>g 16 (3) (1989) 233–246.<br />

[18] J. Cardoso, Multidimensional <strong>in</strong>dependent component analysis, <strong>in</strong>: Proc. of<br />

ICASSP ’98, Seattle, 1998.<br />

13


Chapter 5<br />

IEICE TF E87-A(9):2355-2363, 2004<br />

Paper F.J. Theis and W. Nakamura. Quadratic <strong>in</strong>dependent component analysis.<br />

IEICE Trans. Fundamentals, E87-A(9):2355-2363, 2004<br />

Reference (Theis and Nakamura, 2004)<br />

Summary <strong>in</strong> section 1.2.2<br />

103


104 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

PAPER Special Section on Nonl<strong>in</strong>ear Theory and its Applications<br />

Quadratic <strong>in</strong>dependent component analysis<br />

SUMMARY The transformation of a data set us<strong>in</strong>g a<br />

second-order polynomial mapp<strong>in</strong>g to f<strong>in</strong>d statistically <strong>in</strong>dependent<br />

components is considered (quadratic <strong>in</strong>dependent component<br />

analysis or ICA). Based on overdeterm<strong>in</strong>ed l<strong>in</strong>ear ICA, an<br />

algorithm together with separability conditions are given via l<strong>in</strong>earization<br />

reduction. The l<strong>in</strong>earization is achieved us<strong>in</strong>g a higher<br />

dimensional embedd<strong>in</strong>g def<strong>in</strong>ed by the l<strong>in</strong>ear parametrization of<br />

the monomials, which can also be applied for higher-order polynomials.<br />

The paper f<strong>in</strong>ishes with simulations for artificial data<br />

and natural images.<br />

key words: nonl<strong>in</strong>ear <strong>in</strong>dependent component analysis,<br />

quadratic forms, nonl<strong>in</strong>ear bl<strong>in</strong>d source separation, overdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d source separation, natural images<br />

1. Introduction<br />

The task of transform<strong>in</strong>g a random vector <strong>in</strong>to an <strong>in</strong>dependent<br />

one is called <strong>in</strong>dependent component analysis<br />

(ICA). ICA has been well-studied <strong>in</strong> the case of l<strong>in</strong>ear<br />

transformations [3, 12].<br />

Nonl<strong>in</strong>ear demix<strong>in</strong>g <strong>in</strong>to <strong>in</strong>dependent components<br />

is an important extension of l<strong>in</strong>ear ICA and we still do<br />

not have sufficient knowledge of this problem. It is easy<br />

to see that without any restrictions the class of ICA solutions<br />

is too large to be of any practical use [13]. Hence<br />

<strong>in</strong> nonl<strong>in</strong>ear ICA, special nonl<strong>in</strong>earities are usually considered.<br />

In this paper, we treat polynomial nonl<strong>in</strong>earities,<br />

especially second-order monomials or quadratic<br />

forms. These represent a relatively simple class of nonl<strong>in</strong>earities,<br />

which can be <strong>in</strong>vestigated <strong>in</strong> detail.<br />

Several studies have employed quadratic forms as a<br />

generative process of data. Abed-Meraim et al. [1] suggested<br />

analyz<strong>in</strong>g mixtures by second-order polynomials<br />

us<strong>in</strong>g a l<strong>in</strong>earization <strong>in</strong> a similar way as we <strong>in</strong>troduce<br />

<strong>in</strong> section 3.2, but for the mixtures, which <strong>in</strong> general<br />

destroys the assumption of <strong>in</strong>dependence. Leshem [15]<br />

proposed a whiten<strong>in</strong>g scheme based on quadratic forms<br />

<strong>in</strong> order to enhance l<strong>in</strong>ear separation of time-signals<br />

<strong>in</strong> algorithms such as SOBI. Similar quadratic mix<strong>in</strong>g<br />

models are also considered <strong>in</strong> [8] and [10]. These are<br />

Manuscript received December 21, 2003.<br />

Manuscript revised April 2, 2004.<br />

F<strong>in</strong>al manuscript received May 21, 2004.<br />

† The authors are with the Lab. for Advanced Bra<strong>in</strong> Signal<br />

Process<strong>in</strong>g, Bra<strong>in</strong> Science Institute, RIKEN, Wako-shi,<br />

Saitama 351-0198 Japan.<br />

∗ On leave from the Institute of Biophysics, University<br />

of Regensburg, 93040 Regensburg, Germany.<br />

a) E-mail: fabian@theis.name<br />

b) E-mail: wakakoh@bra<strong>in</strong>.riken.jp<br />

Fabian J. THEIS †∗a) and Wakako NAKAMURA †b) , Nonmembers<br />

studies <strong>in</strong> which the mix<strong>in</strong>g model is assumed to be<br />

quadratic <strong>in</strong> contrast to the quadratic unmix<strong>in</strong>g model<br />

used <strong>in</strong> this paper.<br />

For demix<strong>in</strong>g <strong>in</strong>to <strong>in</strong>dependent components by<br />

quadratic forms, Bartsch and Obermayer [2] suggested<br />

apply<strong>in</strong>g l<strong>in</strong>ear ICA to second-order terms of data.<br />

Hashimoto [9] suggested an algorithm based on m<strong>in</strong>imization<br />

of Kullback-Leibler divergence. However, <strong>in</strong><br />

these studies, the generative model of data was not def<strong>in</strong>ed<br />

and the <strong>in</strong>terpretation of signals obta<strong>in</strong>ed by the<br />

separation was not given clearly; the focus was on the<br />

application to natural images. In this paper, we exam<strong>in</strong>e<br />

this quadratic demix<strong>in</strong>g model. We def<strong>in</strong>e both<br />

generative model and demix<strong>in</strong>g process of data explicitly<br />

to assume a one-to-one correspondence of the <strong>in</strong>dependent<br />

components with data. Us<strong>in</strong>g the analysis<br />

of overdeterm<strong>in</strong>ed l<strong>in</strong>ear ICA, we discuss identifiability<br />

of this quadratic demix<strong>in</strong>g model. We confirm that the<br />

algorithm proposed by Bartsch and Obermayer [2] can<br />

estimate the mix<strong>in</strong>g process and retrieve the <strong>in</strong>dependent<br />

components correctly by simulation with artificial<br />

data. We also apply the quadratic demix<strong>in</strong>g to natural<br />

image data.<br />

The paper is organized as follows: <strong>in</strong> the next section<br />

results about overdeterm<strong>in</strong>ed ICA that is ICA of<br />

more sensors than sources are recalled and extended.<br />

Section 3 then <strong>in</strong>troduces the quadratic ICA model and<br />

provides a separability result and an algorithm. The algorithms<br />

are then applied for artificial and natural data<br />

sets <strong>in</strong> section 4.<br />

2. Overdeterm<strong>in</strong>ed ICA<br />

Before def<strong>in</strong><strong>in</strong>g the polynomial model, we have to study<br />

<strong>in</strong>determ<strong>in</strong>acies and algorithms of overdeterm<strong>in</strong>ed <strong>in</strong>dependent<br />

component analysis. Its goal lies <strong>in</strong> the transformation<br />

of a given random vector x to an <strong>in</strong>dependent<br />

one with lower dimension. Overdeterm<strong>in</strong>ed ICA is usually<br />

applied to solve the overdeterm<strong>in</strong>ed bl<strong>in</strong>d source<br />

separation (overdeterm<strong>in</strong>ed BSS) problem, where x is<br />

known to be a mixture of a lower number of <strong>in</strong>dependent<br />

source signals s. Overdeterm<strong>in</strong>ed ICA <strong>in</strong> the context<br />

of BSS is well-known and understood [5, 14], but<br />

the <strong>in</strong>determ<strong>in</strong>acies <strong>in</strong> terms of the unmix<strong>in</strong>g matrix of<br />

overdeterm<strong>in</strong>ed ICA problem have to the knowledge of<br />

the authors not yet been analyzed.<br />

1


Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 105<br />

2<br />

2.1 Model<br />

The overdeterm<strong>in</strong>ed ICA model can be formulated as<br />

follows: Let x be a given m-dimensional random vector.<br />

An n × m-matrix W with m > n ≥ 2 is called<br />

overdeterm<strong>in</strong>ed ICA of x if<br />

y = Wx (1)<br />

is <strong>in</strong>dependent. In order to dist<strong>in</strong>guish between overdeterm<strong>in</strong>ed<br />

and ord<strong>in</strong>ary ICA, <strong>in</strong> the case m = n we call<br />

W a square ICA of x.<br />

Here W is not assumed to have full rank as usual.<br />

Theorem 2.1 shows that under reasonable assumptions<br />

this automatically holds.<br />

Often overdeterm<strong>in</strong>ed ICA is used to solve the<br />

overdeterm<strong>in</strong>ed BSS problem given by<br />

x = As (2)<br />

where s is an n-dimensional <strong>in</strong>dependent random vector<br />

and A a m × n matrix with m > n ≥ 2. Note that<br />

A can be assumed to have full rank (rank A = n),<br />

otherwise the system could be reduced to the case n−1:<br />

If A = (a1, . . . , an) with columns ai, <strong>in</strong> the case of<br />

rank A < n we can without loss of generality assume<br />

that an = � n−1<br />

i=1 λiai. Then<br />

x = As =<br />

n�<br />

j=1<br />

n−1<br />

ajsj =<br />

�<br />

(1 + λj)ajsj<br />

j=1<br />

so source sn does not appear <strong>in</strong> the mixture <strong>in</strong> this<br />

case, thus the model can be reduced to the case n −<br />

1. Overdeterm<strong>in</strong>ed ICAs of x are usually considered<br />

solutions to this BSS problem.<br />

Often, overdeterm<strong>in</strong>ed BSS is stated <strong>in</strong> the noisy<br />

case,<br />

x = As + ν (3)<br />

where ν is a decorrelated Gaussian ’noise’ random vector,<br />

<strong>in</strong>dependent of s. Without additional noise, the<br />

sources can be found by solv<strong>in</strong>g for example the square<br />

ICA problem, which is constructed from equation 1<br />

after leav<strong>in</strong>g away the last m − n observations given<br />

non degenerate projected mixture matrix. In the presence<br />

of noise usually projection by pr<strong>in</strong>cipal component<br />

analysis (PCA) is chosen <strong>in</strong> order to reduce this problem<br />

to the square case [14]. In the next section however,<br />

the <strong>in</strong>determ<strong>in</strong>acies of the noise-free models represented<br />

by equations 1 and 2 are studied because overdeterm<strong>in</strong>ed<br />

ICA will only be needed later after reduction of<br />

the the bil<strong>in</strong>ear model — and <strong>in</strong> this model we do not<br />

allow any noise. However, the overdeterm<strong>in</strong>ed noisy<br />

model from equation 3 can easily be reduced to the<br />

noise-free model by <strong>in</strong>clud<strong>in</strong>g ν as additional sources:<br />

x = (A I)<br />

� s<br />

ν<br />

�<br />

.<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

In this case n <strong>in</strong>creases and we could possibly deal with<br />

underdeterm<strong>in</strong>ed ICA (where extra care has to be taken<br />

with the now <strong>in</strong>creased number of Gaussians <strong>in</strong> the<br />

sources). Uniqueness and separability results for this<br />

case are given <strong>in</strong> [7], which shows that the follow<strong>in</strong>g<br />

theorems also hold <strong>in</strong> this noisy ICA model.<br />

2.2 Indeterm<strong>in</strong>acies<br />

The follow<strong>in</strong>g theorem presents the <strong>in</strong>determ<strong>in</strong>acy of<br />

the unmix<strong>in</strong>g matrix <strong>in</strong> the case of overdeterm<strong>in</strong>ed mixtures,<br />

with the slight generalization that this unmix<strong>in</strong>g<br />

matrix does not necessarily have to be of full rank.<br />

Later <strong>in</strong> this section we show that it is necessary to assume<br />

that the observed data set x is <strong>in</strong>deed a mixture.<br />

Theorem 2.1 (Indeterm<strong>in</strong>acies of overdeterm<strong>in</strong>ed<br />

ICA). Let m ≥ n ≥ 2. Let x = As as <strong>in</strong><br />

the model of equation 2, and let the n × m matrix W<br />

be an overdeterm<strong>in</strong>ed or square ICA of x such that at<br />

most one component of Wx is determ<strong>in</strong>istic. Furthermore<br />

assume one of the follow<strong>in</strong>g:<br />

i. s has at most one Gaussian component and the<br />

variances of s exist.<br />

ii. s has no Gaussian component.<br />

Then there exist a permutation matrix P and an <strong>in</strong>vertible<br />

scal<strong>in</strong>g matrix L with<br />

W = LP(A ⊤ A) −1 A ⊤ + C (4)<br />

where C is an n × m matrix with rows ly<strong>in</strong>g the kernel<br />

of A ⊤ that is with CA = 0. The converse also holds,<br />

i.e. if W fulfills equation 4 then Wx is <strong>in</strong>dependent.<br />

A less general form of this <strong>in</strong>determ<strong>in</strong>acy has been<br />

given by Joho et al. [14]. In the square case, the above<br />

theorem shows that it is not necessary to assume that<br />

mix<strong>in</strong>g and especially demix<strong>in</strong>g matrix have full rank if<br />

it is assumed that the transformation also has at most<br />

one determ<strong>in</strong>istic component. Obviously, this assumption<br />

is not necessary if W is required to have full rank.<br />

S<strong>in</strong>ce rank A = n, the pseudo <strong>in</strong>verse (Moore-<br />

Penrose <strong>in</strong>verse) A + of A has the explicit form A + =<br />

(A ⊤ A) −1 A ⊤ . Note that A + A = I. So from equation 4<br />

we get WA = LP. We remark as corollary to the above<br />

theorem that overdeterm<strong>in</strong>ed ICA is separable, which<br />

means that the sources are uniquely (except for scal<strong>in</strong>g<br />

and permutation) reconstructible, because for approximated<br />

sources y we get y = Wx = WAs = LPs. This<br />

of course is well-known because overdeterm<strong>in</strong>ed BSS<br />

can be reduced to square BSS (m = n) by projection;<br />

yet the <strong>in</strong>determ<strong>in</strong>acies of the demix<strong>in</strong>g matrix, which<br />

are simple permutation and scal<strong>in</strong>g <strong>in</strong> the square case,<br />

are not so obvious for overdeterm<strong>in</strong>ed BSS.<br />

Proof of theorem 2.1. Consider B := WA and y :=<br />

Bs. Let b1, . . . , bn ∈ R n denote the (transposed) rows<br />

of B. Then


106 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

yi = b ⊤ i s. (5)<br />

We will show that we can assume B to be <strong>in</strong>vertible.<br />

If B is not <strong>in</strong>vertible, without loss of generality let<br />

bn = �n−1 i=1 λibi with coefficients λi ∈ R. Then at least<br />

one λi �= 0, otherwise yn = 0 is determ<strong>in</strong>istic. Without<br />

loss of generality let λ1 = 1. From equation 5 we then<br />

get yn = �n−1 i=1 λiyi = y1+u with u := �n−1 i=2 λiyi <strong>in</strong>dependent<br />

of y1. Application of the Darmois-Skitovitch<br />

theorem [6, 16] to the two equations<br />

y1 = y1<br />

yn = y1 + u<br />

shows that y1 is Gaussian or determ<strong>in</strong>istic. Hence all<br />

yi with λi �= 0 are Gaussian or determ<strong>in</strong>istic. So we<br />

may assume that y1, yn and u are square <strong>in</strong>tegrable.<br />

Without loss of generality let those random variables<br />

be centered. Then the cross-covariance of (y1, yn) can<br />

be calculated as follows:<br />

0 = E(y1y ⊤ n ) = E(y1y ⊤ 1 ) + E(y1u ⊤ ) = var y1,<br />

so y1 is determ<strong>in</strong>istic. Hence all yi with λi �= 0 are<br />

determ<strong>in</strong>istic, and then also yn. So at least two components<br />

of y = Wx are determ<strong>in</strong>istic <strong>in</strong> contradiction<br />

to the assumption. Therefore B is <strong>in</strong>vertible.<br />

Us<strong>in</strong>g assumptions i) or ii) and the well-known<br />

uniqueness result of square l<strong>in</strong>ear BSS — a corollary<br />

[5, 7] of the Darmois-Skitovitch theorem, see [17] for a<br />

proof without us<strong>in</strong>g this theorem — there exist a permutation<br />

matrix P and an <strong>in</strong>vertible scal<strong>in</strong>g matrix L<br />

with WA = LP. The properties of the pseudo <strong>in</strong>verse<br />

then show that the equation<br />

(LP) −1 WA = I<br />

<strong>in</strong> W has solutions W = LP(A + + C ′ ) with C ′ A = 0,<br />

or W = LPA + + C aga<strong>in</strong> with CA = 0.<br />

The pseudo <strong>in</strong>verse is the unique solution of WA =<br />

I with m<strong>in</strong>imum Frobenius norm. In this case C = 0;<br />

so if W is an ICA of x with m<strong>in</strong>imal Frobenius norm,<br />

then already W equals A + except for scal<strong>in</strong>g and permutation.<br />

This can be used for normalization. Us<strong>in</strong>g<br />

s<strong>in</strong>gular value decomposition of A, this has been shown<br />

and extended to the noisy case <strong>in</strong> [14].<br />

Theorem 2.1 states that <strong>in</strong> the case where x is the<br />

mixture of n sources,<br />

s<br />

A<br />

.. x<br />

W<br />

W ′<br />

. .<br />

. .<br />

y<br />

.................... .<br />

y ′<br />

W ′ A<br />

ignor<strong>in</strong>g the always present scal<strong>in</strong>g and permutation<br />

(then y = y ′ = s), we have an n(m − n)-dimensional<br />

aff<strong>in</strong>e vector space as ICAs of x i.e. (W − W ′ )A = 0.<br />

However <strong>in</strong> the case of arbitrary x, the ICAs of x can<br />

be quite unrelated. Indeed, <strong>in</strong> the diagram<br />

x<br />

W<br />

W ′<br />

..<br />

. .<br />

y<br />

...<br />

..<br />

..<br />

...<br />

. .<br />

y ′<br />

∃B?<br />

there does not always exist B that could make this<br />

diagram commute (W ′ x = BWx), as for example <strong>in</strong><br />

the case m ≥ 2n and W projection along the first n<br />

coord<strong>in</strong>ates and W ′ projection along the last n ones<br />

shows. In this case the square ICA argument <strong>in</strong> the<br />

proof of theorem 2.1 cannot be applied, and W and W ′<br />

do not necessarily have any relation. This of course is<br />

not true <strong>in</strong> the case m = n, where uniqueness (but not<br />

existence) can also be shown without explicitly know<strong>in</strong>g<br />

that x is a mixture by <strong>in</strong>vert<strong>in</strong>g W. This could be<br />

extended to the case n ≤ m < 2n to construct 2n − m<br />

equations for y and y ′ .<br />

2.3 Algorithm<br />

The usual algorithm for f<strong>in</strong>d<strong>in</strong>g an overdeterm<strong>in</strong>ed ICA<br />

is to first project x along its ma<strong>in</strong> pr<strong>in</strong>cipal components<br />

to an n-dimensional random vector and to then<br />

apply square l<strong>in</strong>ear ICA [14, 19]. In [19], the question<br />

of where to place this projection stage (before or after<br />

application of ICA) is posed and answered somewhat<br />

heuristically. Here, a simple proof is given that <strong>in</strong> the<br />

overdeterm<strong>in</strong>ed BSS case, any possible ICA matrix factorizes<br />

over (almost) any projection, so project<strong>in</strong>g first<br />

and then apply<strong>in</strong>g ICA to recover the sources is always<br />

possible.<br />

Theorem 2.2. Let x = As as <strong>in</strong> the model of equation<br />

2 such that s satisfies one of the conditions <strong>in</strong> theorem<br />

2.1. Furthermore let W be an overdeterm<strong>in</strong>ed ICA of<br />

x. Then for almost all (<strong>in</strong> the measure sense) n × m<br />

matrices V there exists a square ICA B of Vx such<br />

that Wx = BVx.<br />

Proof. Let V be an n×m matrix such that VA is <strong>in</strong>vertible<br />

— this is an open condition, so almost all matrices<br />

are of this type. Then there exists a square ICA, say<br />

B ′ of y := Vx. So B ′ V is an overdeterm<strong>in</strong>ed ICA of x.<br />

Apply<strong>in</strong>g separability of overdeterm<strong>in</strong>ed ICA (corollary<br />

of theorem 2.1) then proves that Wx = LPB ′ Vx for<br />

a permutation P and a scal<strong>in</strong>g L. Sett<strong>in</strong>g B := LPB ′<br />

shows the claim.<br />

In diagram-form, this means<br />

A . .<br />

s x y ′ W<br />

..<br />

V<br />

. ..<br />

y<br />

. . ..<br />

∃B<br />

3


Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 107<br />

4<br />

where y ′ := Wx and y := Vx.<br />

In applications, V is usually chosen to be the projection<br />

along the first pr<strong>in</strong>cipal components <strong>in</strong> order to<br />

reduce noise [14]. In this case it is easy to see that<br />

<strong>in</strong>deed VA is <strong>in</strong>vertible as needed <strong>in</strong> the theorem.<br />

3. Quadratic ICA<br />

The model of quadratic ICA is <strong>in</strong>troduced and separability<br />

and algorithms are studied <strong>in</strong> this context.<br />

3.1 Model<br />

Let x be an m-dimensional random vector. Consider<br />

the follow<strong>in</strong>g quadratic or bil<strong>in</strong>ear unmix<strong>in</strong>g model<br />

y := g(x, x) (6)<br />

for a fixed bil<strong>in</strong>ear mapp<strong>in</strong>g g : R m × R m → R n .<br />

The components of the bil<strong>in</strong>ear mapp<strong>in</strong>g are quadratic<br />

forms, which can be parameterized by symmetric matrices.<br />

So the above is equivalent to<br />

yi := x ⊤ G (i) x (7)<br />

for symmetric matrices G (i) and i = 1, . . . , n. If G (i)<br />

kl<br />

are the coefficients of G (i) , this means<br />

yi =<br />

m�<br />

m�<br />

k=1 l=1<br />

G (i)<br />

kl xkxl<br />

for i = 1, . . . , n.<br />

A special case of this model can be constructed as<br />

follows: Decompose the symmetric coefficient matrices<br />

<strong>in</strong>to<br />

G (i) = E (i)⊤ Λ (i) E (i) ,<br />

where E (i) is orthogonal and Λ (i) diagonal. In order<br />

to explicitly <strong>in</strong>vert the above model (after restriction<br />

to a subset for <strong>in</strong>vertibility) we now assume that these<br />

coord<strong>in</strong>ate changes E (i) are all the same i.e.<br />

for i = 1, . . . , n. Then<br />

E = E (i)<br />

yi = (Ex) ⊤ Λ (i) (Ex) =<br />

m�<br />

k=1<br />

Λ (i)<br />

kk (Ex)2 k<br />

(8)<br />

where Λ (i)<br />

kk are the coefficients on the diagonal of Λ(i) .<br />

Sett<strong>in</strong>g<br />

⎛<br />

⎜<br />

Λ := ⎝<br />

11 . . . Λ (1) ⎞<br />

nn<br />

. .<br />

. ..<br />

. ⎟<br />

. ⎠<br />

Λ (1)<br />

Λ (n)<br />

11 . . . Λ (n)<br />

nn<br />

yields a two-layered unmix<strong>in</strong>g model<br />

y = Λ ◦ (.) 2 ◦ E ◦ x, (9)<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

x1<br />

x2<br />

xm<br />

.<br />

E1m<br />

E2m<br />

Enm<br />

(.) 2<br />

(.) 2<br />

.<br />

(.) 2<br />

Λ1n<br />

Λ2n<br />

E Λ<br />

Λnn<br />

Fig. 1 Simplified quadratic unmix<strong>in</strong>g model y = Λ◦(.) 2 ◦E◦x.<br />

s1<br />

s2<br />

sn<br />

� Λ −1 �<br />

.<br />

1n<br />

�<br />

−1 Λ �<br />

2n<br />

� Λ −1 �<br />

nn<br />

√<br />

√<br />

� E −1 �<br />

.<br />

√<br />

1n<br />

� E −1 �<br />

2n<br />

� E −1 �<br />

Λ −1 E −1<br />

Fig. 2 Simplified square root mix<strong>in</strong>g model x = E −1 ◦ √ ◦<br />

Λ −1 ◦ s.<br />

where (.) 2 is to be read as the componentwise square of<br />

each element. This can be <strong>in</strong>terpreted as a two-layered<br />

feed-forward neural network, see figure 1.<br />

The advantage of the restricted model from equation<br />

8 is that now the model can easily be explicitly<br />

<strong>in</strong>verted. We assume that Λ is <strong>in</strong>vertible and that<br />

Ex only takes values <strong>in</strong> one quadrant — otherwise the<br />

model cannot be <strong>in</strong>verted. Without loss of generality<br />

let this be the first quadrant that is assume<br />

(Ex)i ≥ 0<br />

for i = 1 . . . , m. Then model 9 is <strong>in</strong>vertible and the correspond<strong>in</strong>g<br />

<strong>in</strong>verse model (mix<strong>in</strong>g model) can be easily<br />

expressed as<br />

mn<br />

.<br />

.<br />

y1<br />

y2<br />

yn<br />

x1<br />

x2<br />

xm<br />

x = E −1 ◦ √ ◦ Λ −1 ◦ s (10)<br />

with E −1 = E ⊤ . Here we write s as the doma<strong>in</strong> of the<br />

model <strong>in</strong> order to dist<strong>in</strong>guish between the recoveries y<br />

given by the unmix<strong>in</strong>g model. Ideally, the two are the<br />

same. The <strong>in</strong>verse model is shown <strong>in</strong> figure 2.<br />

3.2 Separability<br />

Construct<strong>in</strong>g a new random vector by arrang<strong>in</strong>g the


108 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

monomials <strong>in</strong> model 6 <strong>in</strong> the lexicographical order of<br />

(i, j), i ≥ j, lets us reduce the quadratic ICA problem<br />

to an (usually) overdeterm<strong>in</strong>ed l<strong>in</strong>ear ICA problem.<br />

This idea will be used to analyze the <strong>in</strong>determ<strong>in</strong>acies<br />

of quadratic ICA <strong>in</strong> the follow<strong>in</strong>g.<br />

Consider the map<br />

ζ : R m → R d<br />

x ↦→ � x 2 1, 2x1x2, . . . , 2x1xn, x 2 2, 2x2x3, . . . , x 2 m<br />

with d = m(m+1)<br />

2 . With ζ we can rewrite the quadratic<br />

mix<strong>in</strong>g model from equation 6 <strong>in</strong> form of a l<strong>in</strong>ear model<br />

(l<strong>in</strong>earization)<br />

y = Wgζ(x) (11)<br />

where the matrix Wg is constructed us<strong>in</strong>g the coefficient<br />

matrices of the quadratic form g as follows:<br />

⎛<br />

⎜<br />

Wg = ⎜<br />

⎝<br />

G (1)<br />

11 G(1) 12 . . . G (1)<br />

1m G(1) 22 . . . G (1)<br />

G<br />

mm<br />

(2)<br />

11 G(2) 12 . . . G (2)<br />

1m G(2) 22 . . . G (2)<br />

.<br />

.<br />

. ..<br />

.<br />

.<br />

. ..<br />

mm<br />

.<br />

G (n)<br />

11 G(n) 12 . . . G (n)<br />

1m G(n) 22 . . . G (n)<br />

⎞<br />

⎟<br />

⎠<br />

mm<br />

This transforms the nonl<strong>in</strong>ear ICA problem to a higher<br />

dimensional l<strong>in</strong>ear one.<br />

Assum<strong>in</strong>g that x has been mixed by the <strong>in</strong>verse of a<br />

restriction of a bil<strong>in</strong>ear mapp<strong>in</strong>g, we can apply theorem<br />

2.1 to show that Wg is unique except for scal<strong>in</strong>g, permutation<br />

and addition of rows with transpose <strong>in</strong> A ⊤ .<br />

Hence, the coefficient matrices G (i) of g (correspond<strong>in</strong>g<br />

to the rows of Wg) are uniquely determ<strong>in</strong>ed except for<br />

the above.<br />

3.3 Algorithm<br />

For now assume m = n ≥ 2. In this case d > n, so the<br />

l<strong>in</strong>earized problem from equation 11 consists <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g<br />

an overdeterm<strong>in</strong>ed ICA of ζ(x). See section 2.3 for<br />

comments on such an algorithm.<br />

In the more general case m �= n, also situations<br />

with d = n and d < n are possible. For example <strong>in</strong><br />

the case d = n, (for example <strong>in</strong> a quadratic mixture of<br />

3 sources to 2 observations) the separat<strong>in</strong>g quadratic<br />

form is unique except for scal<strong>in</strong>g and permutation.<br />

However <strong>in</strong> simulations such sett<strong>in</strong>gs were numerically<br />

very <strong>in</strong>stable, so <strong>in</strong> the follow<strong>in</strong>g we will only consider<br />

the case of an equal number of sensors and sources.<br />

3.4 Semi-onl<strong>in</strong>e dimension reduction<br />

Here, we will give an algorithm for perform<strong>in</strong>g memory<br />

conservative overdeterm<strong>in</strong>ed ICA. This is important<br />

when embedd<strong>in</strong>g the data us<strong>in</strong>g ζ. For example <strong>in</strong> the<br />

simulation section 4.3, image data is considered with<br />

8×8-dimensional image patches, so n = m = 64. In this<br />

case d = 2080, hence even ord<strong>in</strong>ary ’MATLAB-style’<br />

covariance calculation can become a memory problem,<br />

�<br />

because the vector ζ(x) is too large to keep <strong>in</strong> memory.<br />

For overdeterm<strong>in</strong>ed ICA, a projection from the<br />

typically high-dimensional mixture space to the lower<br />

dimensional feature space has to be constructed,<br />

usually us<strong>in</strong>g PCA. After knowledge of the highdimensional<br />

covariance matrix Cov(ζ(x)), eigenvalue<br />

decomposition and sample-wise projection give the desired<br />

low-dimensional feature data.<br />

One way to calculate Cov(ζ(x)) and project is to<br />

use an onl<strong>in</strong>e PCA algorithm such as [4], where the samples<br />

of ζ(x) are also calculated onl<strong>in</strong>e from the known<br />

samples of x. Alternatively, batch estimation of the coefficients<br />

of Cov(ζ(x)) us<strong>in</strong>g these onl<strong>in</strong>e samples is possible.<br />

In reality, these methods suffer from calculational<br />

problems such as slow MATLAB-performance <strong>in</strong> iterations.<br />

We therefore employed a block-wise standard<br />

covariance calculation, which improved performance by<br />

a factor of roughly 20.<br />

The idea is to split up the large covariance matrix<br />

Cov(ζ(x)) <strong>in</strong>to smaller blocks, <strong>in</strong> which the correspond<strong>in</strong>g<br />

components of ζ(x) still fit <strong>in</strong>to memory.<br />

Denote C := Cov(ζ(x)) the large d × d covariance matrix.<br />

Let 1 ≤ r ≪ d be fixed. C can be decomposed<br />

<strong>in</strong> q := ⌊d/r⌋ + 1 blocks of (maximal) size r simply as<br />

follows ⎛<br />

C (1,1) . . . C (1,q)<br />

⎞<br />

⎜<br />

C = ⎝<br />

.<br />

. ..<br />

.<br />

C (q,1) . . . C (q,q)<br />

⎟<br />

⎠<br />

where C (k,k′ ) is the cross correlation between the r-<br />

dimensional vectors (ζi(x)) kr−1<br />

i=(k−1)r and (ζi(x)) k′ r−1<br />

i=(k ′ −1)r<br />

(possibly truncated if k or k ′ equals q). These two<br />

vectors and hence C (k,k′ ) can now be easily calculated<br />

us<strong>in</strong>g the def<strong>in</strong>ition of ζ. The advantage lies <strong>in</strong> the fact<br />

that these two vectors are of much smaller dimension<br />

than d and therefore fit <strong>in</strong>to memory for sufficiently<br />

small r. Of course the computational cost <strong>in</strong>creases as<br />

ζ(x) has to be calculated multiple times.<br />

In order to decompose Cov(ζ(x)) <strong>in</strong>to blocks of<br />

equal size, a mapp<strong>in</strong>g between the lexicographical order<br />

<strong>in</strong> ζ(x) and the elements of x is needed, which we<br />

want to state for convenience: The <strong>in</strong>dex (i, j) that is<br />

the monomial xixj corresponds to the position k(i, j) =<br />

− 1<br />

2i2 + � n + 3<br />

�<br />

2 i−n−1 <strong>in</strong> ζ(x). Vice versa, the k-th entry<br />

<strong>in</strong> ζ(x) corresponds to the monomial of multi<strong>in</strong>dex<br />

(i, j) with i(k) = � n + 1<br />

� √ ��<br />

2 3 − 4n2 + 4n + 9 − 8k<br />

and j(k) = k + 1<br />

2i(k)2 − � n + 1<br />

�<br />

2 i(k) + n.<br />

3.5 Extension to polynomial ICA<br />

Instead of us<strong>in</strong>g only second order monomials, we can<br />

of course allow any degree <strong>in</strong> the monomials <strong>in</strong> order<br />

to approximate more complex nonl<strong>in</strong>earities. In addition,<br />

first order monomials guarantee that also the<br />

l<strong>in</strong>ear ICA case is <strong>in</strong>cluded <strong>in</strong> the model. A suitable<br />

higher-dimensional embedd<strong>in</strong>g ζ us<strong>in</strong>g more monomials<br />

generalizes the results of the quadratic case to the<br />

5


Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 109<br />

6<br />

Mean SNR (dB)<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

n=2<br />

n=3<br />

n=5<br />

n=10<br />

15<br />

−20 0 20 40 60 80 100 120<br />

m−n<br />

Fig. 3 Mean SNR and standard deviation of recovered sources<br />

with the orig<strong>in</strong>al ones when apply<strong>in</strong>g overdeterm<strong>in</strong>ed ICA to As<br />

after random projection us<strong>in</strong>g B for vary<strong>in</strong>g source (n) and mixture<br />

(m) dimension. Here s is a <strong>in</strong> [−1, 1] n uniform random<br />

vector and A and B (m × n) respectively (n × m)-matrices hav<strong>in</strong>g<br />

uniformly distributed coefficients <strong>in</strong> [−1, 1]. Mean was taken<br />

over 1000 runs.<br />

polynomial case.<br />

4. Simulation results<br />

In this section, computer simulations are performed to<br />

show the feasibility of the presented algorithms.<br />

4.1 Overdeterm<strong>in</strong>ed BSS<br />

In order to confirm our theoretical results from section<br />

2.2, we perform batch runs of overdeterm<strong>in</strong>ed ICA<br />

applied to randomly generated data after random projection<br />

to the known source dimension. Square ICA<br />

was performed us<strong>in</strong>g the FastICA algorithm [11]. As<br />

parameters to the algorithm we used g(s) := tanh(s)<br />

as nonl<strong>in</strong>earity <strong>in</strong> the approximation of the negentropy<br />

estimator (respectively its derivative) and stabilization<br />

was turned on, mean<strong>in</strong>g that the step size was not kept<br />

fixed but could be changed adaptively (halved if the<br />

algorithm gets stuck between two po<strong>in</strong>ts). The simulation,<br />

figure 3, not surpris<strong>in</strong>gly confirms that the ICA<br />

algorithm performs well <strong>in</strong>dependently of the chosen<br />

projection and the mixture dimension. In the presented<br />

no-noise case, project<strong>in</strong>g along directions of largest variance<br />

us<strong>in</strong>g PCA <strong>in</strong>stead of the random projections will<br />

not improve performance (accord<strong>in</strong>g to theorem 2.2).<br />

However, <strong>in</strong> the case of white noise, PCA will provide<br />

better recoveries for larger m [14].<br />

4.2 Quadratic ICA — artificially generated data<br />

In our first example we consider the simplified mix<strong>in</strong>gunmix<strong>in</strong>g<br />

model from equation 9 <strong>in</strong> the case m = n = 2<br />

with randomly generated matrices<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1<br />

0.5<br />

y<br />

x 1<br />

0 0<br />

0.5<br />

x<br />

1<br />

0 0<br />

Fig. 4 Example 1: Square-root mix<strong>in</strong>g functions x1 and x2 are<br />

plotted.<br />

and<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

1<br />

0.5<br />

� �<br />

0.91 1.2<br />

E =<br />

1.8 −0.72<br />

� �<br />

8 −7.7<br />

Λ =<br />

.<br />

−5.1 6.7<br />

Λ was chosen such that Λ −1 has only positive coefficients;<br />

this and the fact that the sources are positive<br />

ensured the <strong>in</strong>vertibility of the nonl<strong>in</strong>ear transformation<br />

(this is equivalent to (Ex)i ≥ 0). Also note that<br />

we did not require E to be orthogonal <strong>in</strong> this example.<br />

The two-dimensional sources s are shown <strong>in</strong> figure 5 together<br />

with a scatter plot i.e. plot of the samples <strong>in</strong> order<br />

to show the density. The mixtures x := E −1√ Λ −1 s<br />

are also plotted <strong>in</strong> the same figure; the nonl<strong>in</strong>earity is<br />

quite visible. Figure 4 gives a plot of the two nonl<strong>in</strong>ear<br />

mix<strong>in</strong>g functions.<br />

Application of the described algorithm yields the<br />

quadratic form<br />

y1 = 29x 2 1 − 57x1x2 − 21x 2 2<br />

y2 = −28x 2 1 + 40x1x2 + 17x 2 2.<br />

The recovered signals y are given <strong>in</strong> the right column of<br />

figure 5; a cross scatter plot with the sources is shown<br />

<strong>in</strong> figure 6 for comparison. The signal to noise ratios<br />

between the two are 44 and 43 dB after normalization<br />

to zero mean and unit variance and possible sign multiplication,<br />

which confirms the high separation quality.<br />

In order to demonstrate the nonl<strong>in</strong>earity of the problem,<br />

figure 7 demonstrates that l<strong>in</strong>ear ICA does not<br />

perform well when applied to the mixtures.<br />

In order to see quantitative results <strong>in</strong> more than<br />

only one experiment, we apply the algorithm to mixtures<br />

with vary<strong>in</strong>g number of samples and dimension.<br />

We consider the cases of equal source and mixture dimension<br />

m = n = 2, 3, 4. Figure 8 shows the algorithm<br />

performance for <strong>in</strong>creas<strong>in</strong>g number of samples. In the<br />

mean, quadratic ICA always outperforms l<strong>in</strong>ear ICA,<br />

but has a higher standard deviation. Problems when<br />

recover<strong>in</strong>g the sources were noticed to occur when the<br />

y<br />

x 2<br />

0.5<br />

x<br />

1


110 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

2<br />

1.8<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

−3.5<br />

0 0.5 1 1.5 2<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.2<br />

0.18<br />

0.16<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

−3.5<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0<br />

Fig. 5 Example 1: A two-dimensional nonl<strong>in</strong>ear mixture us<strong>in</strong>g mix<strong>in</strong>g model 10, see<br />

figure 2, is separated. The left column shows the two <strong>in</strong>dependent source signals together<br />

with a signal scatter plot (with only every 5th sample plotted), which depicts the source<br />

probability density. The middle column shows the two clearly nonl<strong>in</strong>early mixed signals<br />

and their scatter plot. In the right column the two signals separated by quadratic ICA<br />

are depicted. The scatter plot aga<strong>in</strong> confirms their <strong>in</strong>dependence.<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

0 0.5 1 1.5 2<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

−2.5<br />

−3<br />

0 0.5 1 1.5 2<br />

Fig. 6 Example 1: A comparison scatter plot of source and<br />

recovered source signals (figure 5) is given, i.e. if y denotes the<br />

recovered sources, then the top left figure is a scatter plot of<br />

(s1, y1), the top right a plot of (s2, y1) and the bottom right of<br />

(s2, y2). The SNRs are 44 and 43 dB between the first respectively<br />

second signals after normalization to zero mean and unit<br />

variance and possible sign multiplication.<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5 2 2.5 3 3.5 4<br />

Fig. 7 Example 1, l<strong>in</strong>ear ICA: Apply<strong>in</strong>g l<strong>in</strong>ear ICA to the mixtures<br />

from figure 5 yields recovered signals as above (top). Clearly<br />

the scatter plot (bottom) confirms that the recovery was bad —<br />

not surpris<strong>in</strong>gly no <strong>in</strong>dependence could be achieved. Comparison<br />

with the orig<strong>in</strong>al sources shows that the recovery itself does not<br />

well correspond to the sources: the maximal SNRs were 7.5 and<br />

7.7 after normalization.<br />

7


PSfrag replacements<br />

PSfrag replacements<br />

f quadratic ICA for n = 2<br />

PSfrag replacements<br />

f quadratic ICA for n = 2<br />

f quadratic ICA for n = 3<br />

Chapter 5. IEICE TF E87-A(9):2355-2363, 2004 111<br />

8<br />

Mean SNR (dB)<br />

Mean SNR (dB)<br />

Mean SNR (dB)<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

Performance of quadratic ICA for n = 2<br />

Bil<strong>in</strong>ear ICA<br />

L<strong>in</strong>ear ICA<br />

5<br />

0 2000 4000 6000 8000 10000 12000<br />

Number of samples<br />

Performance of quadratic ICA for n = 3<br />

35<br />

Bil<strong>in</strong>ear ICA<br />

L<strong>in</strong>ear ICA<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0 2000 4000 6000 8000 10000 12000<br />

Number of samples<br />

Performance of quadratic ICA for n = 4<br />

30<br />

Bil<strong>in</strong>ear ICA<br />

L<strong>in</strong>ear ICA<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0 2000 4000 6000<br />

Number of samples<br />

8000 10000 12000<br />

Fig. 8 Quadratic ICA performance versus sample set size <strong>in</strong><br />

dimensions 2 (top), 3 (middle) and 4 (bottom). Mean was taken<br />

over 1000 runs with Λ −1 and E −1 hav<strong>in</strong>g uniformly distributed<br />

coefficients <strong>in</strong> [0, 1] n respectively [−1, 1] n . E −1 was orthogonalized<br />

(an orthogonal basis was constructed out of the columns<br />

of E −1 ) because <strong>in</strong> experiments the algorithm <strong>in</strong> higher dimension<br />

turned out to be quite sensitive to high condition number of<br />

E −1 . So E −1 = E ⊤ <strong>in</strong> accordance with the model from equation<br />

9. The sources were uniformly distributed <strong>in</strong> [0, 1] n .<br />

IEICE TRANS. FUNDAMENTALS, VOL.E87–A, NO.9 SEPTEMBER 2004<br />

condition number of Λ −1 was very high. By leav<strong>in</strong>g<br />

out these cases, the performance of quadratic ICA noticeably<br />

rises <strong>in</strong> comparison to l<strong>in</strong>ear ICA.<br />

4.3 Quadratic ICA — image data<br />

F<strong>in</strong>ally we deal with a real-world example, that is, analysis<br />

of natural images. We applied quadratic ICA to a<br />

set of small patches collected from database of natural<br />

images made by van Hateren [18]. We show two<br />

dom<strong>in</strong>ant l<strong>in</strong>ear filters and correspond<strong>in</strong>g eigenvalues<br />

of each obta<strong>in</strong>ed quadratic form <strong>in</strong> figure 9. Most obta<strong>in</strong>ed<br />

quadratic forms have one or two dom<strong>in</strong>ant l<strong>in</strong>ear<br />

filters and these l<strong>in</strong>ear filters are selective for local<br />

bar stimuli. The other eigenvalues all are substantially<br />

smaller and lie centered around zero <strong>in</strong> a roughly Gaussian<br />

distribution, see eigenvalue distribution <strong>in</strong> figure<br />

9. If a quadratic form has two dom<strong>in</strong>ant filters, position,<br />

orientation and spatial frequency selectiveness of<br />

these two filters are very similar. Two correspond<strong>in</strong>g<br />

eigenvalues have different signs. In summary, values of<br />

all quadratic forms correspond to squared simple cell<br />

outputs. This result is qualitatively similar to the result<br />

obta<strong>in</strong>ed <strong>in</strong> [2]. Simple-cell-like features are more<br />

prom<strong>in</strong>ent <strong>in</strong> our results. Also, <strong>in</strong> section 4 of [9], the<br />

obta<strong>in</strong>ed quadratic forms correspond to squared simple<br />

cell outputs.<br />

5. Conclusion<br />

This paper treats nonl<strong>in</strong>ear ICA by reduc<strong>in</strong>g it to the<br />

case of l<strong>in</strong>ear ICA. After formally stat<strong>in</strong>g separability<br />

results of overdeterm<strong>in</strong>ed ICA, these results are applied<br />

to quadratic and polynomial ICA. The presented algorithm<br />

consists of l<strong>in</strong>earization and application of l<strong>in</strong>ear<br />

ICA. In order to apply PCA projection also <strong>in</strong> high dimensions,<br />

a block calculation of the covariance matrix<br />

is suggested. The algorithms are then used for artificial<br />

data sets, where they show good performance for a<br />

sufficiently high number of samples. In the application<br />

to natural image data, characteristics of the obta<strong>in</strong>ed<br />

quadratic forms are qualitatively the same as the results<br />

of Bartsch and Obermayer [2] and they correspond to<br />

squared simple cell outputs. In future work, we want<br />

to further <strong>in</strong>vestigate a possible enhancement of the<br />

separability result as well as a more detailed extension<br />

to ICA with polynomial and maybe analytic mapp<strong>in</strong>gs.<br />

Furthermore, bil<strong>in</strong>ear ICA with additional noise (for example<br />

multiplicative noise) could be considered. Also,<br />

issues such as model and parameter selection will have<br />

to be treated, if higher order models are allowed.<br />

Acknowledgments<br />

We thank the anonymous reviewers for their valuable<br />

suggestions, which improved the orig<strong>in</strong>al manuscript.<br />

FT was partially supported by the DFG <strong>in</strong> the grant


112 Chapter 5. IEICE TF E87-A(9):2355-2363, 2004<br />

THEIS and NAKAMURA: QUADRATIC INDEPENDENT COMPONENT ANALYSIS<br />

−0.17 0.16 −0.25 0.08 0.16 −0.11 0.28 −0.10 0.19 −0.13 −0.24 0.23<br />

−0.20 0.12 −0.27 0.27 0.19 −0.16 −0.17 0.16 −0.30 0.09 0.38 0.05<br />

−0.19 0.17 0.18 −0.12 −0.24 0.24 −0.22 0.22 −0.24 0.22 0.24 −0.17<br />

0.24 −0.09 0.25 −0.22 0.23 −0.15 0.19 −0.17 0.18 −0.16 0.21 −0.21<br />

0.26 −0.20 0.33 −0.15 −0.28 0.15 −0.17 0.17 −0.17 0.16 −0.23 0.23<br />

−0.24 0.23 0.23 −0.22 0.20 −0.20 −0.30 0.30 0.18 −0.16 −0.18 0.15<br />

0.28 −0.24 −0.23 0.18 0.20 −0.20 −0.26 0.26 0.19 −0.19 −0.22 0.22<br />

0.31 −0.30 0.26 −0.23 0.18 −0.13 0.15 −0.15 0.18 −0.18 −0.31 0.30<br />

−0.23 0.17 −0.25 0.24 −0.27 0.26 0.25 −0.23 −0.19 0.17 0.11 −0.11<br />

−0.08 0.06 −0.15 0.13 0.25 −0.24 0.28 −0.26 −0.23 0.20 −0.18 0.15<br />

−0.22 0.19 −0.22 0.18 0.11 −0.09 0.28 −0.28<br />

Fig. 9 Quadratic ICA of natural images. 3·10 5 sample pictures<br />

of size 8 × 8 were used. Plotted are the recovered maximal filters<br />

of i.e. rows of the eigenvalue matrices of the quadratic form coefficient<br />

matrices (top figure). For each component the two largest<br />

filters (with respect to the absolute eigenvalues) are shown (altogether<br />

2 · 64). Above each image the correspond<strong>in</strong>g eigenvalue<br />

(multiplied with 10 3 ) is pr<strong>in</strong>ted. In the bottom figure, the absolute<br />

values of the 10 largest eigenvalues of each filter are shown.<br />

Clearly, <strong>in</strong> most filters one or two eigenvalues are dom<strong>in</strong>ant.<br />

’Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> Condensed Matter’<br />

and the BMBF <strong>in</strong> the project ’ModKog’.<br />

References<br />

[1] K. Abed-Meraim, A. Belouchrani, and Y. Hua. Bl<strong>in</strong>d identification<br />

of a l<strong>in</strong>ear-quadratic mixture of <strong>in</strong>dependent components<br />

based on jo<strong>in</strong>t diagonalization procedure. In Proc.<br />

of ICASSP 1996, volume 5, pages 2718–2721, Atlanta,<br />

USA, 1996.<br />

[2] H. Bartsch and K. Obermayer. Second order statistics of<br />

natural images. Neurocomput<strong>in</strong>g, 52-54:467–472, 2003.<br />

[3] A. Cichocki and S. Amari. Adaptive bl<strong>in</strong>d signal and image<br />

process<strong>in</strong>g. John Wiley & Sons, 2002.<br />

[4] A. Cichocki and R. Unbehauen. Robust estimation of<br />

pr<strong>in</strong>cipal components <strong>in</strong> real time. Electronics Letters,<br />

29(21):1869–1870, 1993.<br />

[5] P. Comon. <strong>Independent</strong> component analysis - a new concept?<br />

Signal Process<strong>in</strong>g, 36:287–314, 1994.<br />

[6] G. Darmois. Analyse générale des liaisons stochastiques.<br />

Rev. Inst. Internationale Statist., 21:2–8, 1953.<br />

[7] J. Eriksson and V. Koivunen. Identifiability and separability<br />

of l<strong>in</strong>ear ica models revisited. In Proc. of ICA 2003,<br />

pages 23–27, 2003.<br />

[8] P.G. Georgiev. Bl<strong>in</strong>d source separation of bil<strong>in</strong>early mixed<br />

signals. In Proc. of ICA 2001, pages 328–330, San Diego,<br />

USA, 2001.<br />

[9] W. Hashimoto. Quadratic forms <strong>in</strong> natural images. Network:<br />

Computation <strong>in</strong> Neural Systems, 14:765–788, 2003.<br />

[10] S. Hosse<strong>in</strong>i and Y. Deville. Bl<strong>in</strong>d separation of l<strong>in</strong>earquadratic<br />

mixtures of real sources us<strong>in</strong>g a recurrent structure.<br />

Lecture Notes <strong>in</strong> Computer Science, 2687:241–248,<br />

2003.<br />

[11] A. Hyvär<strong>in</strong>en. Fast and robust fixed-po<strong>in</strong>t algorithms for<br />

<strong>in</strong>dependent component analysis. IEEE Transactions on<br />

Neural Networks, 10(3):626–634, 1999.<br />

[12] A. Hyvär<strong>in</strong>en, J.Karhunen, and E.Oja. <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>. John Wiley & Sons, 2001.<br />

[13] A. Hyvär<strong>in</strong>en and P. Pajunen. On existence and uniqueness<br />

of solutions <strong>in</strong> nonl<strong>in</strong>ear <strong>in</strong>dependent component analysis.<br />

Proceed<strong>in</strong>gs of the 1998 IEEE International Jo<strong>in</strong>t Conference<br />

on Neural Networks (IJCNN’98), 2:1350–1355, 1998.<br />

[14] M. Joho, H. Mathis, and R.H. Lamber. Overdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d source separation: us<strong>in</strong>g more sensors than source<br />

signals <strong>in</strong> a noisy mixture. In Proc. of ICA 2000, pages<br />

81–86, Hels<strong>in</strong>ki, F<strong>in</strong>land, 2000.<br />

[15] A. Leshem. Source separation us<strong>in</strong>g bil<strong>in</strong>ear forms. In Proc.<br />

of the 8th Int. Conference on Higher-Order Statistical Signal<br />

Process<strong>in</strong>g, 1999.<br />

[16] V.P. Skitovitch. On a property of the normal distribution.<br />

DAN SSSR, 89:217–219, 1953.<br />

[17] F.J. Theis. A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d<br />

source separation. Neural Computation accepted, 2004.<br />

[18] J.H. van Hateren and D.L. Ruderman. <strong>Independent</strong> component<br />

analysis of natural image sequences yields spatiotemporal<br />

filters similar to simple cells <strong>in</strong> primary visual<br />

cortex. Proc. R. Soc. Lond. B, 265:2315–2320, 1998.<br />

[19] S. W<strong>in</strong>ter, H. Sawada, and S. Mak<strong>in</strong>o. Geometrical <strong>in</strong>terpretation<br />

of the PCA subspace method for overdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d source separation. In Proc. of ICA 2003, pages 775–<br />

780, Nara, Japan, 2003.<br />

9


Chapter 6<br />

LNCS 3195:726-733, 2004<br />

Paper F.J. Theis, A. Meyer-Bäse, and E.W. Lang. Second-order bl<strong>in</strong>d source separation<br />

based on multi-dimensional autocovariances. In Proc. ICA 2004,<br />

volume 3195 of LNCS, pages 726-733, Granada, Spa<strong>in</strong>, 2004. Spr<strong>in</strong>ger<br />

Reference (Theis et al., 2004e)<br />

Summary <strong>in</strong> section 1.3.1<br />

113


114 Chapter 6. LNCS 3195:726-733, 2004<br />

Second-order bl<strong>in</strong>d source separation based on<br />

multi-dimensional autocovariances<br />

Fabian J. Theis 1,2 , Anke Meyer-Baese 2 , and Elmar W. Lang 1<br />

1 Institute of Biophysics,<br />

University of Regensburg, D-93040 Regensburg, Germany<br />

2 Department of Electrical and Computer Eng<strong>in</strong>eer<strong>in</strong>g,<br />

Florida State University, Tallahassee, FL 32310-6046, USA<br />

fabian@theis.name<br />

Abstract. SOBI is a bl<strong>in</strong>d source separation algorithm based on time<br />

decorrelation. It uses multiple time autocovariance matrices, and performs<br />

jo<strong>in</strong>t diagonalization thus be<strong>in</strong>g more robust than previous time<br />

decorrelation algorithms such as AMUSE. We propose an extension called<br />

mdSOBI by us<strong>in</strong>g multidimensional autocovariances, which can be calculated<br />

for data sets with multidimensional parameterizations such as<br />

images or fMRI scans. mdSOBI has the advantage of us<strong>in</strong>g the spatial<br />

data <strong>in</strong> all directions, whereas SOBI only uses a s<strong>in</strong>gle direction. These<br />

f<strong>in</strong>d<strong>in</strong>gs are confirmed by simulations and an application to fMRI analysis,<br />

where mdSOBI outperforms SOBI considerably.<br />

Bl<strong>in</strong>d source separation (BSS) describes the task of recover<strong>in</strong>g the unknown<br />

mix<strong>in</strong>g process and the underly<strong>in</strong>g sources of an observed data set. Currently,<br />

many BSS algorithm assume <strong>in</strong>dependence of the sources (ICA), see for <strong>in</strong>stance<br />

[1, 2] and references there<strong>in</strong>. In this work, we consider BSS algorithms based on<br />

time-decorrelation. Such algorithms <strong>in</strong>clude AMUSE [3] and extensions such as<br />

SOBI [4] and the similar TDSEP [5]. These algorithms rely on the fact that<br />

the data sets have non-trivial autocorrelations. We give an extension thereof<br />

to data sets, which have more than one direction <strong>in</strong> the parametrization, such<br />

as images, by replac<strong>in</strong>g one-dimensional autocovariances by multi-dimensional<br />

autocovariances.<br />

The paper is organized as follows: In section 1 we <strong>in</strong>troduce the l<strong>in</strong>ear mixture<br />

model; the next section 2 recalls results on time decorrelation BSS algorithms.<br />

We then def<strong>in</strong>e multidimensional autocovariances and use them to propose md-<br />

SOBI <strong>in</strong> section 3. The paper f<strong>in</strong>ished with both artificial and real-world results<br />

<strong>in</strong> section 4.<br />

1 L<strong>in</strong>ear BSS<br />

We consider the follow<strong>in</strong>g bl<strong>in</strong>d source separation (BSS) problem: Let x(t) be<br />

an (observed) stationary m-dimensional real stochastical process (with not necessarily<br />

discrete time t) and A an <strong>in</strong>vertible real matrix such that<br />

x(t) = As(t) + n(t) (1)


Chapter 6. LNCS 3195:726-733, 2004 115<br />

2 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

where the source signals s(t) have diagonal autocovariances<br />

Rs(τ) := E � (s(t + τ) − E(s(t)))(s(t) − E(s(t)) ⊤�<br />

for all τ, and the additive noise n(t) is modelled by a stationary, temporally<br />

and spatially white zero-mean process with variance σ2 . x(t) is observed, and<br />

the goal is to recover A and s(t). Hav<strong>in</strong>g found A, s(t) can be estimated by<br />

A−1x(t), which is optimal <strong>in</strong> the maximum-likelihood sense (if the density of<br />

n(t) is maximal at 0, which is the case for usual noise models such as Gaussian or<br />

Laplacian noise). So the BSS task reduces to the estimation of the mix<strong>in</strong>g matrix<br />

A. Extensions of the above model <strong>in</strong>clude for example the complex case [4] or the<br />

allowance of different dimensions for s(t) and x(t), where the case of larger mix<strong>in</strong>g<br />

dimension can be easily reduced to the presented complete case by dimension<br />

reduction result<strong>in</strong>g <strong>in</strong> a lower noise level [6].<br />

By center<strong>in</strong>g the processes, we can assume that x(t) and hence s(t) have zero<br />

mean. The autocovariances then have the follow<strong>in</strong>g structure<br />

Rx(τ) = E � x(t + τ)x(t) ⊤� =<br />

� ARs(0)A ⊤ + σ 2 I τ = 0<br />

ARs(τ)A ⊤ τ �= 0<br />

Clearly, A (and hence s(t)) can be determ<strong>in</strong>ed by equation 1 only up to permutation<br />

and scal<strong>in</strong>g of columns. S<strong>in</strong>ce we assume exist<strong>in</strong>g variances of x(t)<br />

and hence s(t), the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy can be elim<strong>in</strong>ated by the convention<br />

Rs(0) = I. In order to guarantee identifiability of A except for permutation<br />

from the above model, we have to additionally assume that there exists a delay<br />

τ such that Rs(τ) has pairwise different eigenvalues (for a generalization see [4],<br />

theorem 2). Then us<strong>in</strong>g the spectral theorem it is easy to see from equation 2<br />

that A is determ<strong>in</strong>ed uniquely by x(t) except for permutation.<br />

2 AMUSE and SOBI<br />

Equation 2 also gives an <strong>in</strong>dication of how to perform BSS i.e. how to recover A<br />

from x(t). The usual first step consists of whiten<strong>in</strong>g the no-noise term ˜x(t) :=<br />

As(t) of the observed mixtures x(t) us<strong>in</strong>g an <strong>in</strong>vertible matrix V such that V˜x(t)<br />

has unit covariance. V can simply be estimated from x(t) by diagonalization of<br />

the symmetric matrix R˜x(0) = Rx(0) − σ2I, provided that the noise variance<br />

σ2 is known. If more signals than sources are observed, dimension reduction can<br />

be performed <strong>in</strong> this step, and the noise level can be reduced [6].<br />

In the follow<strong>in</strong>g without loss of generality, we will therefore assume that<br />

˜x(t) = As(t) has unit covariance for each t. By assumption, s(t) also has<br />

unit covariance, hence I = E � As(t)s(t) ⊤A⊤� = ARs(0)A⊤ = AA⊤ so A is<br />

orthogonal. Now def<strong>in</strong>e the symmetrized autocovariance of x(t) as ¯ � Rx(τ) :=<br />

1 Rx(τ) + (Rx(τ)) ⊤� . Equation 2 shows that also the symmetrized autocovari-<br />

2<br />

ance x(t) factors, and we get<br />

¯Rx(τ) = A ¯ Rs(τ)A ⊤<br />

(2)<br />

(3)


116 Chapter 6. LNCS 3195:726-733, 2004<br />

SOBI based on multi-dimensional autocovariances 3<br />

for τ �= 0. By assumption ¯ Rs(τ) is diagonal, so equation 3 is an eigenvalue decomposition<br />

of the symmetric matrix ¯ Rx(τ). If we furthermore assume that ¯ Rx(τ)<br />

or equivalently ¯ Rs(τ) has n different eigenvalues, then the above decomposition<br />

i.e. A is uniquely determ<strong>in</strong>ed by ¯ Rx(τ) except for orthogonal transformation<br />

<strong>in</strong> each eigenspace and permutation; s<strong>in</strong>ce the eigenspaces are one-dimensional<br />

this means A is uniquely determ<strong>in</strong>ed by equation 3 except for permutation. In<br />

addition to this separability result, A can be recovered algorithmically by simply<br />

calculat<strong>in</strong>g the eigenvalue decomposition of ¯ Rx(τ) (AMUSE, [3]).<br />

In practice, if the eigenvalue decomposition is problematic, a different choice<br />

of τ often resolves this problem. Nontheless, there are sources <strong>in</strong> which some<br />

components have equal autocovariances. Also, due to the fact that the autocovariance<br />

matrices are only estimated by a f<strong>in</strong>ite amount of samples, and due to<br />

possible colored noise, the autocovariance at τ could be badly estimated. A more<br />

general BSS algorithm called SOBI (second-order bl<strong>in</strong>d identification) based on<br />

time decorrelation was therefore proposed by Belouchrani et al. [4]. In addition<br />

to only diagonaliz<strong>in</strong>g a s<strong>in</strong>gle autocovariance matrix, it takes a whole set of autocovariance<br />

matrices of x(t) with vary<strong>in</strong>g time lags τ and jo<strong>in</strong>tly diagonalizes<br />

the whole set. It has been shown that <strong>in</strong>creas<strong>in</strong>g the size of this set improves<br />

SOBI performance <strong>in</strong> noisy sett<strong>in</strong>gs [1].<br />

Algorithms for perform<strong>in</strong>g jo<strong>in</strong>t diagonalization of a set of symmetric commut<strong>in</strong>g<br />

matrices <strong>in</strong>clude gradient descent on the sum of the off-diagonal terms,<br />

iterative construction of A by Givens rotation <strong>in</strong> two coord<strong>in</strong>ates [7] (used <strong>in</strong> the<br />

simulations <strong>in</strong> section 4), an iterative two-step recovery of A [8] or more recently<br />

a l<strong>in</strong>ear least-squares algorithm for diagonalization [9], where the latter two algorithms<br />

can also search for non-orthogonal matrices A. Jo<strong>in</strong>t diagonalization<br />

has been used <strong>in</strong> BSS us<strong>in</strong>g cumulant matrices [10] or time autocovariances [4,5].<br />

3 Multidimensional SOBI<br />

The goal of this work is to improve SOBI performance for random processes<br />

with a higher dimensional parametrization i.e. for data sets where the random<br />

processes s and x do not depend on a s<strong>in</strong>gle variable t, but on multiple variables<br />

(z1, . . . , zM ). A typical example is a source data set, <strong>in</strong> which each component<br />

si represents an image of size h × w. Then M = 2 and samples of s are given at<br />

z1 = 1, . . . , h, z2 = 1, . . . , w. Classically, s(z1, z2) is transformed to s(t) by fix<strong>in</strong>g<br />

a mapp<strong>in</strong>g from the two-dimensional parameter set to the one-dimensional time<br />

parametrization of s(t), for example by concatenat<strong>in</strong>g columns or rows <strong>in</strong> the<br />

case of a f<strong>in</strong>ite number of samples. If the time structure of s(t) is not used, as<br />

<strong>in</strong> all classical ICA algorithms <strong>in</strong> which i.i.d. samples are assumed, this choice<br />

does not <strong>in</strong>fluence the result. However, <strong>in</strong> time-structure based algorithms such<br />

as AMUSE and SOBI results can vary greatly depend<strong>in</strong>g on the choice of this<br />

mapp<strong>in</strong>g, see figure 2.<br />

Without loss of generality we aga<strong>in</strong> assume centered random vectors. Then<br />

def<strong>in</strong>e the multidimensional covariance to be<br />

Rs(τ1, . . . , τM ) := E � s(z1 + τ1, . . . , zM + τM)s(z1, . . . , zM ) ⊤�


Chapter 6. LNCS 3195:726-733, 2004 117<br />

4 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

PSfrag replacements<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

1d−autocov<br />

2d−autocov<br />

0 50 100 150 200 250 300<br />

τ respectively |(τ1, τ2)| (rescaled to N)<br />

Fig. 1. Example of one- and two-dimensional autocovariance coefficient of the grayscale<br />

128 × 128 Lena image after normalization to variance 1.<br />

where the expectation is taken over (z1, . . . , zM ). Rs(τ1, . . . , τM ) can be estimated<br />

given equidistant samples by replac<strong>in</strong>g random variables by sample values<br />

and expectations by sums as usual.<br />

The advantage of us<strong>in</strong>g multidimensional autocovariances lies <strong>in</strong> the fact that<br />

now the multidimensional structure of the data set can be used more explicitly.<br />

For example, if row concatenation is used to construct s(t) from the images,<br />

horizontal l<strong>in</strong>es <strong>in</strong> the image will only give trivial contributions to the autocovariance<br />

(see examples <strong>in</strong> figure 2 and section 4). Figure 1 shows the oneand<br />

two-dimensional autocovariance of the Lena image for vary<strong>in</strong>g τ respectively<br />

(τ1, τ2) after normalization of the image to variance 1. Clearly, the twodimensional<br />

autocovariance does not decay as quickly with <strong>in</strong>creas<strong>in</strong>g radius as<br />

the one-dimensional covariance. Only at multiples of the image height, the onedimensional<br />

autocovariance is significantly high i.e. captures image structure.<br />

Our contribution consists of us<strong>in</strong>g multidimensional autocovariances for jo<strong>in</strong>t<br />

diagonalization. We replace the BSS assumption of diagonal one-dimensional autocovariances<br />

by diagonal multi-dimensional autocovariances of the sources. Note<br />

that also the multidimensional covariance satisfies the equation 2. Aga<strong>in</strong> we as-<br />

sume whitened x(z1, . . . , zK). Given a autocovariance matrix ¯ Rx<br />

�<br />

τ (1)<br />

1<br />

, . . . , τ (1)<br />

M<br />

with n different eigenvalues, multidimensional AMUSE (mdAMUSE) detects the<br />

orthogonal unmix<strong>in</strong>g mapp<strong>in</strong>g W by diagonalization of this matrix.<br />

In section 2, we discussed the advantages of us<strong>in</strong>g SOBI over AMUSE. This<br />

of course also holds <strong>in</strong> this generalized case. Hence, the multidimensional SOBI<br />

algorithm (mdSOBI ) consists of the jo<strong>in</strong>t diagonalization of a set of symmetrized<br />

multidimensional autocovariances<br />

� �<br />

¯Rx τ (1)<br />

�<br />

(1)<br />

1 , . . . , τ M , . . . , ¯ �<br />

Rx<br />

τ (K)<br />

1<br />

, . . . , τ (K)<br />

��<br />

M<br />


118 Chapter 6. LNCS 3195:726-733, 2004<br />

PSfrag replacements<br />

(a) source images<br />

crosstalk<strong>in</strong>g error E1( Â, I)<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

SOBI based on multi-dimensional autocovariances 5<br />

SOBI<br />

SOBI transposed images<br />

mdSOBI<br />

mdSOBI transposed images<br />

0<br />

0 10 20 30 40 50 60 70<br />

K<br />

(b) performance comparison<br />

Fig. 2. Comparison of SOBI and mdSOBI when applied to (unmixed) images from (a).<br />

The plot (b) plots the number K of time lags versus the crosstalk<strong>in</strong>g error E1 of the<br />

recovered matrix  and the unit matrix I; here  has been recovered by bot SOBI<br />

and mdSOBI given the images from (a) respectively the transposed images.<br />

after whiten<strong>in</strong>g of x(z1, . . . , zK). The jo<strong>in</strong>t diagonalizer then equals A except for<br />

permutation, given the generalized identifiability conditions from [4], theorem<br />

2. Therefore, also the identifiability result does not change, see [4]. In practice,<br />

we choose the (τ (k)<br />

1 , . . . , τ (k)<br />

M ) with <strong>in</strong>creas<strong>in</strong>g modulus for <strong>in</strong>creas<strong>in</strong>g k, but with<br />

the restriction τ (k)<br />

1 > 0 <strong>in</strong> order to avoid us<strong>in</strong>g the same autocovariances on the<br />

diagonal of the matrix twice.<br />

Often, data sets do not have any substantial long-distance autocorrelations,<br />

but quite high multi-dimensional close-distance correlations (see figure 1). When<br />

perform<strong>in</strong>g jo<strong>in</strong>t diagonalization, SOBI weighs each matrix equally strong, which<br />

can deteriorate the performance for large K, see simulation <strong>in</strong> section 4.<br />

Figure 2(a) shows an example, <strong>in</strong> which the images have considerable vertical<br />

structure, but rather random horizontal structure. Each of the two images<br />

consists of a concatenation of stripes of two images. For visual purposes, we<br />

chose the width of the stripes to be rather large with 16 pixels. Accord<strong>in</strong>g to<br />

the previous discussion we expect one-dimensional algorithms such as AMUSE<br />

and SOBI to perform well on the images, but badly (for number of time lags<br />

≫ 16) on the transposed images. If we apply AMUSE with τ = 20 to the images,<br />

we get excellent performance with a low crosstalk<strong>in</strong>g error with the unit<br />

matrix of 0.084; if we however apply AMUSE to the transposed images, the error<br />

is high with 1.1. This result is further confirmed by the comparison plot <strong>in</strong><br />

figure 2(b); mdSOBI performs equally well on the images and the transposed


Chapter 6. LNCS 3195:726-733, 2004 119<br />

6 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

crosstalk<strong>in</strong>g error E1( Â, A)<br />

PSfrag replacements<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

SOBI K=32<br />

SOBI K=128<br />

mdSOBI K=32<br />

mdSOBI K=128<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5<br />

Fig. 3. SOBI and mdSOBI performance dependence on noise level σ. Plotted is the<br />

crosstalk<strong>in</strong>g error E1 of the recovered matrix  with the real mix<strong>in</strong>g matrix A. See<br />

text for more details.<br />

images, whereas performance of SOBI strongly depends on whether column or<br />

row concatenation was used to construct a one-dimensional random process out<br />

of each image. The SOBI breakpo<strong>in</strong>t of around K = 52 can be decreased by<br />

choos<strong>in</strong>g smaller stripes. In future works we want to provide an analytical discussion<br />

of performance <strong>in</strong>crease when compar<strong>in</strong>g SOBI and mdSOBI similar to<br />

the performance evaluation <strong>in</strong> [4].<br />

4 Results<br />

Artificial mixtures. We consider the l<strong>in</strong>ear mixture of three images (baboon,<br />

black-haired lady and Lena) with a randomly chosen 3 × 3 matrix A. Figure<br />

3 shows how SOBI and mdSOBI perform depend<strong>in</strong>g on the noise level σ. For<br />

small K, both SOBI and mdSOBI perform equally well <strong>in</strong> the low noise case, but<br />

mdSOBI performs better <strong>in</strong> the case of stronger noise. For larger K mdSOBI<br />

substantially outperforms SOBI, which is due to the fact that natural images do<br />

not have any substantial long-distance autocorrelations (see figure 1), whereas<br />

mdSOBI uses the non-trivial two-dimensional autocorrelations.<br />

fMRI analysis. We analyze the performance of mdSOBI when applied to fMRI<br />

measurements. fMRI data were recorded from six subjects (3 female, 3 male,<br />

age 20–37) perform<strong>in</strong>g a visual task. In five subjects, five slices with 100 images<br />

(TR/TE = 3000/60 msec) were acquired with five periods of rest and five<br />

σ


120 Chapter 6. LNCS 3195:726-733, 2004<br />

1 2 3<br />

4 5 6<br />

7 8<br />

(a) component maps<br />

SOBI based on multi-dimensional autocovariances 7<br />

1 cc: −0.08 2 cc: 0.19 3 cc: −0.11<br />

4 cc: −0.21 5 cc: −0.43 6 cc: −0.21<br />

7 cc: −0.16 8 cc: −0.86<br />

(b) time courses<br />

Fig. 4. mdSOBI fMRI analysis. The data was reduced to the first 8 pr<strong>in</strong>cipal components.<br />

(a) shows the recovered component maps (white po<strong>in</strong>ts <strong>in</strong>dicate values stronger<br />

than 3 standard deviations), and (b) their time courses. mdSOBI was performed with<br />

K = 32. <strong>Component</strong> 5 represents <strong>in</strong>ner ventricles, component 6 the frontal eye fields.<br />

<strong>Component</strong> 8 is the desired stimulus component, which is ma<strong>in</strong>ly active <strong>in</strong> the visual<br />

cortex; its time-course closely follows the on-off stimulus (<strong>in</strong>dicated by the gray boxes)<br />

— their crosscorrelation lies at cc = −0.86 — with a delay of roughly 2 seconds <strong>in</strong>duced<br />

by the BOLD effect.<br />

photic simulation periods with rest. Simulation and rest periods comprised 10<br />

repetitions each, i.e. 30s. Resolution was 3 × 3 × 4 mm. The slices were oriented<br />

parallel to the calcar<strong>in</strong>e fissure. Photic stimulation was performed us<strong>in</strong>g<br />

an 8 Hz alternat<strong>in</strong>g checkerboard stimulus with a central fixation po<strong>in</strong>t and a<br />

dark background with a central fixation po<strong>in</strong>t dur<strong>in</strong>g the control periods. The<br />

first scans were discarded for rema<strong>in</strong><strong>in</strong>g saturation effects. Motion artifacts were<br />

compensated by automatic image alignment (AIR, [11]).<br />

BSS, ma<strong>in</strong>ly based on ICA, nowadays is a quite common tool <strong>in</strong> fMRI analysis<br />

(see for example [12]). Here, we analyze the fMRI data set us<strong>in</strong>g spatial decorrelation<br />

as separation criterion. Figure 4 shows the performance of mdSOBI; see figure<br />

text for <strong>in</strong>terpretation. Us<strong>in</strong>g only the first 8 pr<strong>in</strong>cipal components, mdSOBI<br />

could recover the stimulus component as well as detect additional components.<br />

When apply<strong>in</strong>g SOBI to the data set, it could not properly detect the stimulus<br />

component but found two components with crosscorrelations cc = −0.81 and<br />

−0.84 with the stimulus time course.<br />

5 Conclusion<br />

We have proposed an extension called mdSOBI of SOBI for data sets with multidimensional<br />

parametrizations, such as images. Our ma<strong>in</strong> contribution lies <strong>in</strong>


Chapter 6. LNCS 3195:726-733, 2004 121<br />

8 Fabian J. Theis, Anke Meyer-Baese, and Elmar W. Lang<br />

replac<strong>in</strong>g the one-dimensional autocovariances by multi-dimensional autocovariances.<br />

In both simulations and real-world applications mdSOBI outperforms<br />

SOBI for these multidimensional structures.<br />

In future work, we will show how to perform spatiotemporal BSS by jo<strong>in</strong>tly<br />

diagonaliz<strong>in</strong>g both spatial and time autocovariance matrices. We plan on apply<strong>in</strong>g<br />

these results to fMRI analysis, where we also want to use three-dimensional<br />

autocovariances for 3d-scans of the whole bra<strong>in</strong>.<br />

Acknowledgements<br />

The authors would like to thank Dr. Dorothee Auer from the Max Planck Institute<br />

of Psychiatry <strong>in</strong> Munich, Germany, for provid<strong>in</strong>g the fMRI data, and Oliver<br />

Lange from the Department of Cl<strong>in</strong>ical Radiology, Ludwig-Maximilian University,<br />

Munich, Germany, for data preprocess<strong>in</strong>g and visualization. FT and EL<br />

acknowledge partial f<strong>in</strong>ancial support by the BMBF <strong>in</strong> the project ’ModKog’.<br />

References<br />

1. Cichocki, A., Amari, S.: Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. John Wiley<br />

& Sons (2002)<br />

2. Hyvär<strong>in</strong>en, A., Karhunen, J., Oja, E.: <strong>Independent</strong> component analysis. John<br />

Wiley & Sons (2001)<br />

3. Tong, L., Liu, R.W., Soon, V., Huang, Y.F.: Indeterm<strong>in</strong>acy and identifiability of<br />

bl<strong>in</strong>d identification. IEEE Transactions on Circuits and Systems 38 (1991) 499–509<br />

4. Belouchrani, A., Meraim, K.A., Cardoso, J.F., Moul<strong>in</strong>es, E.: A bl<strong>in</strong>d source separation<br />

technique based on second order statistics. IEEE Transactions on Signal<br />

Process<strong>in</strong>g 45 (1997) 434–444<br />

5. Ziehe, A., Mueller, K.R.: TDSEP – an efficient algorithm for bl<strong>in</strong>d separation us<strong>in</strong>g<br />

time structure. In Niklasson, L., Bodén, M., Ziemke, T., eds.: Proc. of ICANN’98,<br />

Skövde, Sweden, Spr<strong>in</strong>ger Verlag, Berl<strong>in</strong> (1998) 675–680<br />

6. Joho, M., Mathis, H., Lamber, R.: Overdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation: us<strong>in</strong>g<br />

more sensors than source signals <strong>in</strong> a noisy mixture. In: Proc. of ICA 2000, Hels<strong>in</strong>ki,<br />

F<strong>in</strong>land (2000) 81–86<br />

7. Cardoso, J.F., Souloumiac, A.: Jacobi angles for simultaneous diagonalization.<br />

SIAM J. Mat. Anal. Appl. 17 (1995) 161–164<br />

8. Yeredor, A.: Non-orthogonal jo<strong>in</strong>t diagonalization <strong>in</strong> the leastsquares sense with<br />

application <strong>in</strong> bl<strong>in</strong>d source separation. IEEE Trans. Signal Process<strong>in</strong>g 50 (2002)<br />

15451553<br />

9. Ziehe, A., Laskov, P., Mueller, K.R., Nolte, G.: A l<strong>in</strong>ear least-squares algorithm<br />

for jo<strong>in</strong>t diagonalization. In: Proc. of ICA 2003, Nara, Japan (2003) 469–474<br />

10. Cardoso, J.F., Souloumiac, A.: Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian signals. IEE<br />

Proceed<strong>in</strong>gs - F 140 (1993) 362–370<br />

11. Woods, R., Cherry, S., Mazziotta, J.: Rapid automated algorithm for align<strong>in</strong>g<br />

and reslic<strong>in</strong>g pet images. Journal of Computer Assisted Tomography 16 (1992)<br />

620–633<br />

12. McKeown, M., Jung, T., Makeig, S., Brown, G., K<strong>in</strong>dermann, S., Bell, A., Sejnowksi,<br />

T.: <strong>Analysis</strong> of fMRI data by bl<strong>in</strong>d separation <strong>in</strong>to <strong>in</strong>dependent spatial<br />

components. Human Bra<strong>in</strong> Mapp<strong>in</strong>g 6 (1998) 160–188


122 Chapter 6. LNCS 3195:726-733, 2004


Chapter 7<br />

Proc. ISCAS 2005, pages 5878-5881<br />

Paper F.J. Theis. Bl<strong>in</strong>d signal separation <strong>in</strong>to groups of dependent signals us<strong>in</strong>g<br />

jo<strong>in</strong>t block diagonalization. In Proc. ISCAS 2005, pages 5878-5881, Kobe,<br />

Japan, 2005<br />

Reference (Theis, 2005a)<br />

Summary <strong>in</strong> section 1.3.3<br />

123


124 Chapter 7. Proc. ISCAS 2005, pages 5878-5881<br />

Bl<strong>in</strong>d signal separation <strong>in</strong>to groups of dependent<br />

signals us<strong>in</strong>g jo<strong>in</strong>t block diagonalization<br />

Abstract— Multidimensional or group <strong>in</strong>dependent component<br />

analysis describes the task of transform<strong>in</strong>g a multivariate observed<br />

sensor signal such that groups of the transformed signal<br />

components are mutually <strong>in</strong>dependent - however dependencies<br />

with<strong>in</strong> the groups are still allowed. This generalization of<br />

<strong>in</strong>dependent component analysis (ICA) allows for weaken<strong>in</strong>g<br />

the sometimes too strict assumption of <strong>in</strong>dependence <strong>in</strong> ICA.<br />

It has potential applications <strong>in</strong> various fields such as ECG,<br />

fMRI analysis or convolutive ICA. Recently we could calculate<br />

the <strong>in</strong>determ<strong>in</strong>acies of group ICA, which f<strong>in</strong>ally enables us,<br />

also theoretically, to apply group ICA to solve bl<strong>in</strong>d source<br />

separation (BSS) problems. In this paper we <strong>in</strong>troduce and<br />

discuss various algorithms for separat<strong>in</strong>g signals <strong>in</strong>to groups<br />

of dependent signals. The algorithms are based on jo<strong>in</strong>t block<br />

diagonalization of sets of matrices generated us<strong>in</strong>g several signal<br />

structures.<br />

Fabian J. Theis<br />

Institute of Biophysics, University of Regensburg<br />

93040 Regensburg, Germany, Email: fabian@theis.name<br />

of commut<strong>in</strong>g symmetric n × n matrices Mi, to f<strong>in</strong>d an<br />

orthogonal matrix E such that E ⊤ MiE is diagonal for all i.<br />

In the follow<strong>in</strong>g we will use a generalization of this technique<br />

as algorithm to solve MBSS problems. Instead of fully<br />

diagonaliz<strong>in</strong>g Mi <strong>in</strong> jo<strong>in</strong>t block diagonalization (JBD) we<br />

want to determ<strong>in</strong>e E such that E ⊤ MiE is block-diagonal<br />

(after fix<strong>in</strong>g the block-structure).<br />

Introduc<strong>in</strong>g some notation, let us def<strong>in</strong>e for r, s =1,...,n<br />

the (r, s) sub-k-matrix of W =(wij), denoted by W (k)<br />

rs ,to<br />

be the k × k submatrix of W end<strong>in</strong>g at position (rk, sk).<br />

Denote Gl(n) the group of <strong>in</strong>vertible n×n matrices. A matrix<br />

W ∈ Gl(nk) is said to be a k-scal<strong>in</strong>g matrix if W (k)<br />

rs =0<br />

for r �= s, andW is called a k-permutation matrix if for<br />

each r =1,...,n there exists precisely one s such that W (k)<br />

rs<br />

equals the k × k unit matrix.<br />

Hence, fix<strong>in</strong>g the block-size to k, JBD tries to f<strong>in</strong>d E<br />

such that E ⊤ MiE is a k-scal<strong>in</strong>g matrix. In practice due<br />

to estimation errors, such E will not exist, so we speak of<br />

approximate JBD and imply m<strong>in</strong>imiz<strong>in</strong>g some error-measure<br />

on non-block-diagonality.<br />

Various algorithms to actually perform JBD have been<br />

proposed, see [5] and references there<strong>in</strong>. In the follow<strong>in</strong>g we<br />

will simply perform jo<strong>in</strong>t diagonalization (us<strong>in</strong>g for example<br />

the Jacobi-like algorithm from [6]) and then permute the<br />

columns of E to achieve block-diagonality — <strong>in</strong> experiments<br />

this turns out to be an efficient solution to JBD [5].<br />

I. INTRODUCTION<br />

In this work, we discuss multidimensional bl<strong>in</strong>d source<br />

separation (MBSS) i.e. the recovery of underly<strong>in</strong>g sources<br />

s from an observed mixture x. As usual, s has to fulfill<br />

additional properties such as <strong>in</strong>dependence or diagonality of<br />

the autocovariances (if s possesses time structure). However<br />

<strong>in</strong> contrast to ord<strong>in</strong>ary BSS, MBSS is more general as some<br />

source signals are allowed to possess common statistics. One<br />

possible solution for MBSS is multidimensional <strong>in</strong>dependent<br />

component analysis (MICA) — <strong>in</strong> section IV we will discuss<br />

other such conditions. The idea MICA is that we do not require<br />

full <strong>in</strong>dependence of the transform y := Wx but only mutual<br />

<strong>in</strong>dependence of certa<strong>in</strong> tuples yi1 ,...,yi2 . If the size of all<br />

III. MULTIDIMENSIONAL ICA (MICA)<br />

tuples is restricted to one, this reduces to ord<strong>in</strong>ary ICA. In<br />

Let k, n ∈ N. We call an nk-dimensional random vec-<br />

general, of course the tuples could have different sizes, but<br />

tor y k-<strong>in</strong>dependent if the k-dimensional random vectors<br />

for the sake of simplicity we assume that they all have the<br />

(y1,...,yk)<br />

same length k.<br />

Multidimensional ICA has first been <strong>in</strong>troduced by Cardoso<br />

[1] us<strong>in</strong>g geometrical motivations. Hyvär<strong>in</strong>en and Hoyer then<br />

presented a special case of multidimensional ICA which they<br />

called <strong>in</strong>dependent subspace analysis [2]; there the dependence<br />

with<strong>in</strong> a k-tuple is explicitly modelled enabl<strong>in</strong>g the authors<br />

to propose better algorithms without hav<strong>in</strong>g to resort to the<br />

problematic multidimensional density estimation.<br />

II. JOINT BLOCK DIAGONALIZATION<br />

Jo<strong>in</strong>t diagonalization has become an important tool <strong>in</strong> ICAbased<br />

BSS (used for example <strong>in</strong> JADE [3]) or <strong>in</strong> BSS rely<strong>in</strong>g<br />

on second-order time-decorrelation (for example <strong>in</strong> SOBI<br />

[4]). The task of (real) jo<strong>in</strong>t diagonalization is, given a set<br />

⊤ ,...,(ynk−k+1,...,ynk) ⊤ are mutually <strong>in</strong>dependent.<br />

A matrix W ∈ Gl(nk) is called a k-multidimensional<br />

ICA of an nk-dimensional random vector x if Wx is k<strong>in</strong>dependent.<br />

If k =1, this is the same as ord<strong>in</strong>ary ICA.<br />

Us<strong>in</strong>g MICA we want to solve the (noiseless) l<strong>in</strong>ear MBSS<br />

problem x = As, where the nk-dimensional random vector x<br />

is given, and A ∈ Gl(nk) and s are unknown. In the case of<br />

MICA s isassumedtobek-<strong>in</strong>dependent.<br />

A. Indeterm<strong>in</strong>acies<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies are, similar to ord<strong>in</strong>ary ICA, <strong>in</strong>vertible<br />

transforms <strong>in</strong> Gl(k) <strong>in</strong> each tuple as well as the fact<br />

that the order of the <strong>in</strong>dependent k-tuples is not fixed. Indeed,<br />

if A is MBSS solution, then so is ALP with a k-scal<strong>in</strong>g<br />

matrix L and a k-permutation P, because <strong>in</strong>dependence is


Chapter 7. Proc. ISCAS 2005, pages 5878-5881 125<br />

<strong>in</strong>variant under these transformations. In [7] we show that<br />

these are the only <strong>in</strong>determ<strong>in</strong>acies, given some additional weak<br />

restrictions to the model, namely that A has to be k-admissible<br />

and that s is not allowed to conta<strong>in</strong> a Gaussian k-component.<br />

As usual by preprocess<strong>in</strong>g of the observations x by whiten<strong>in</strong>g<br />

we may also assume that Cov(x) = I. Then I =<br />

Cov(x) =A Cov(s)A ⊤ = AA ⊤ so A is orthogonal.<br />

B. MICA us<strong>in</strong>g Hessian diagonalization (MHICA)<br />

We assume that s admits a C2−density ps. Us<strong>in</strong>g orthogonality<br />

of A we get ps(s0) =px(As0) for s0 ∈ Rnk .Let<br />

Hf (x0) denote the Hessian of f evaluated at x0. It transforms<br />

like a 2-tensor so locally at s0 with ps(s0) > 0 we get<br />

Hln ps (s0)<br />

⊤<br />

=Hln px◦A(s0) =AHln px (As0)A<br />

The key idea now lies <strong>in</strong> the fact that s is assumed to be k<strong>in</strong>dependent,<br />

so ps factorizes <strong>in</strong>to n groups depend<strong>in</strong>g only<br />

on k separate variables each. So ln ps is a sum of functions<br />

depend<strong>in</strong>g on k separate variables hence Hln ps (s0) is blockdiagonal<br />

i.e. a k-scal<strong>in</strong>g.<br />

The algorithm, multidimensional Hessian ICA (MHICA),<br />

now simply uses the block-diagonality structure from equation<br />

1 and performs JBD of estimates of a set of Hessians<br />

Hln ps (si) evaluated at different po<strong>in</strong>ts si ∈ Rnk . Given slight<br />

restrictions on the eigenvalues, the result<strong>in</strong>g block diagonalizer<br />

then equals A⊤ except for k-scal<strong>in</strong>g and permutation. The<br />

Hessians are estimated us<strong>in</strong>g kernel-density approximation<br />

with a sufficiently smooth kernel, but other methods such<br />

as approximation us<strong>in</strong>g f<strong>in</strong>ite differences are possible, too.<br />

Density approximation is problematic, but <strong>in</strong> this sett<strong>in</strong>g due to<br />

the fact that we can use many Hessians we only need rough<br />

estimates. For more details on the kernel approximation we<br />

refer to the one-dimensional Hessian ICA algorithm from [8].<br />

MHICA generalizes one-dimensional ideas proposed <strong>in</strong> [8],<br />

[9]. More generally, we could have also used characteristic<br />

functions <strong>in</strong>stead of densities, which leads to a related algorithm,<br />

see [10] for the s<strong>in</strong>gle-dimensional ICA case.<br />

IV. MULTIDIMENSIONAL TIME DECORRELATION<br />

Instead of assum<strong>in</strong>g k-<strong>in</strong>dependence of the sources <strong>in</strong><br />

the MBSS problem, <strong>in</strong> this section we assume that s is a<br />

multivariate centered discrete WSS random process such that<br />

its symmetrized autocovariances<br />

¯Rs(τ) := 1 � � ⊤<br />

E s(t + τ)s(t)<br />

2<br />

� + E � s(t)s(t + τ) ⊤��<br />

(2)<br />

are k-scal<strong>in</strong>gs for all τ. This models the fact that the sources<br />

are supposed to be block-decorrelated <strong>in</strong> the time-doma<strong>in</strong> for<br />

all time-shifts τ.<br />

A. Indeterm<strong>in</strong>acies<br />

Aga<strong>in</strong> A can only be found up to k-scal<strong>in</strong>g and kpermutation<br />

because condition 2 is <strong>in</strong>variant under this transformation.<br />

One sufficient condition for identifiability is to have<br />

pairwise different eigenvalues of at least one Rs(τ), however<br />

generalizations are possible, see [4] for the case k =1.Us<strong>in</strong>g<br />

whiten<strong>in</strong>g, we can aga<strong>in</strong> assume orthogonal A.<br />

(1)<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

0<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

Fig. 1. Histogram and box plot of the multidimensional performance <strong>in</strong>dex<br />

E (k) (C) evaluated for k =2and n =2. The statistics were calculated<br />

over 10 5 <strong>in</strong>dependent experiments us<strong>in</strong>g 4 × 4 matrices C with coefficients<br />

uniformly drawn out of [−1, 1].<br />

B. Multidimensional SOBI (MSOBI)<br />

Theideaofwhatwecallmultidimensional second-order<br />

bl<strong>in</strong>d identification (MSOBI) is now a direct extension of the<br />

usual SOBI algorithm [4]. Symmetrized autocovariances of x<br />

can easily be estimated from the data, and they transform as<br />

follows: ¯ Rs(τ) =A⊤ ¯ Rx(τ)A. But ¯ Rs(τ) is a k-scal<strong>in</strong>g by<br />

assumption, so JBD of a set of such symmetrized autocovariance<br />

matrices yields A as diagonalizer (except for k-scal<strong>in</strong>g<br />

and permutation).<br />

Other researchers have worked on this problem <strong>in</strong> the sett<strong>in</strong>g<br />

of convolutive BSS — due to lack of space we want to refer<br />

to [11] and references there<strong>in</strong>.<br />

V. EXPERIMENTAL RESULTS<br />

In this section we demonstrate the validity of the proposed<br />

algorithms by apply<strong>in</strong>g them to both toy and real world data.<br />

A. Multidimensional Amari-<strong>in</strong>dex<br />

In order to analyze algorithm performance, we consider the<br />

<strong>in</strong>dex E (k) (C) def<strong>in</strong>ed for fixed n, k and C ∈ Gl(nk) as<br />

E (k) n�<br />

�<br />

n�<br />

�C<br />

(C) =<br />

r=1 s=1<br />

(k)<br />

rs �<br />

maxi �C (k)<br />

�<br />

− 1<br />

ri �<br />

n�<br />

�<br />

n�<br />

�C<br />

+<br />

s=1 r=1<br />

(k)<br />

rs �<br />

maxi �C (k)<br />

�<br />

− 1 .<br />

is �<br />

Here �.� can be any matrix norm — we choose the operator<br />

norm �A� := max |x|=1 |Ax|. Thismultidimensional performance<br />

<strong>in</strong>dex of an nk × nk-matrix C generalizes the onedimensional<br />

performance <strong>in</strong>dex <strong>in</strong>troduced by Amari et al. [12]<br />

to block-diagonal matrices. It measures how much C differs<br />

from a permutation and scal<strong>in</strong>g matrix <strong>in</strong> the sense of k-blocks,<br />

so it can be used to analyze algorithm performance:<br />

Lemma 5.1: Let C ∈ Gl(nk). E (k) (C) =0if and only if<br />

C is the product of a k-scal<strong>in</strong>g and a k-permutation matrix.<br />

Corollary 5.2: Consider the MBSS problem x = As from<br />

section III respectively IV. An estimate  of the mix<strong>in</strong>g matrix<br />

solves the MBSS problem if and only if E (k) ( Â−1 A)=0.<br />

In order to be able to determ<strong>in</strong>e the scale of this <strong>in</strong>dex, figure<br />

1 gives statistics of E (k) over randomly chosen matrices <strong>in</strong> the<br />

case k = n =2. The mean is 3.05 and the median 3.10.<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

1


126 Chapter 7. Proc. ISCAS 2005, pages 5878-5881<br />

1<br />

0<br />

−1<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

3<br />

2<br />

1<br />

0<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

1<br />

0<br />

−1<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

3<br />

2<br />

1<br />

0<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

Fig. 2. Simulation, 4-dimensional 2-<strong>in</strong>dependent sources. Clearly the first<br />

and the second respectively the third and the fourth signal are dependent.<br />

B. Simulations<br />

We will discuss algorithm performance when applied to a<br />

4-dimensional 2-<strong>in</strong>dependent toy signal. In order to see the<br />

performance of both MSOBI and MHICA we generate 2<strong>in</strong>dependent<br />

sources with non-trivial autocorrelations. For this<br />

we use two <strong>in</strong>dependent generat<strong>in</strong>g signals, a s<strong>in</strong>usoid and a<br />

sawtooth given by<br />

z(t) := (s<strong>in</strong>(0.1 t), 2⌊0.007 t +0.5⌋−1) ⊤<br />

for discrete time steps t =1, 2,...,1000. We thus generated<br />

sources<br />

s(t) :=(z1(t), exp(z1(t)),z2(t), (z2(t)+0.5) 2 ) ⊤ ,<br />

which are plotted <strong>in</strong> figure 2. Their covariance is<br />

�<br />

Cov(s) =<br />

� 0.50 0.57 0.01 0.01<br />

0.57 0.68 0.01 0.01<br />

0.01 0.01 0.33 0.33<br />

0.01 0.01 0.33 0.42<br />

so <strong>in</strong>deed s is not fully <strong>in</strong>dependent.<br />

s is mixed us<strong>in</strong>g a 4 × 4 matrix A with entries uniformly<br />

drawn out of [−1, 1], and comparisons are made over 100<br />

Monte-Carlo runs. We compare the two algorithms MSOBI<br />

(with 10 autocorrelation matrices) and MHICA (us<strong>in</strong>g 50<br />

Hessians) with the ICA algorithms JADE and fastICA, where<br />

<strong>in</strong> the latter both the deflation and the symmetric approach<br />

was used. For each run we calculate the performance <strong>in</strong>dex<br />

E (2) ( Â−1 A) of the product of the mix<strong>in</strong>g and the estimated<br />

separat<strong>in</strong>g matrix. S<strong>in</strong>ce the one-dimensional ICA algorithms<br />

are unable to use the group structure, for these we take the<br />

m<strong>in</strong>imum of the <strong>in</strong>dex calculated over all row permutations of<br />

 −1 A.<br />

Figure 3 displays the result of the comparison. Clearly<br />

MHICA and MSOBI perform very well on this data, and<br />

MSOBI furthermore gives very robust estimates with the same<br />

error and negligibly small variance. JADE cannot separate<br />

performance <strong>in</strong>dex E (2) ( Â−1 A)<br />

MHICA MSOBI JADE fastICA(defl)fastICA(sym)<br />

Fig. 3. Simulation, algorithm results. This notched boxed plot displays<br />

the performance <strong>in</strong>dex E (2) of the mix<strong>in</strong>g-separat<strong>in</strong>g matrix Â−1 A of<br />

each algorithm, sampled over 100 Monte-Carlo runs. The middle l<strong>in</strong>e of<br />

each column gives the mean, the boxes the 25th and 75th percentile. The<br />

deflationary fastICA algorithm only converged <strong>in</strong> 12% of all runs, the<br />

symmetric-approach based fastICA <strong>in</strong> 89% of all cases; the statistics are only<br />

given over successful runs. All other algorithms converged <strong>in</strong> all runs.<br />

the data at all — it performs not much better than random<br />

choice of matrix, see figure 1; this is due to the fact<br />

that the cumulants of k-<strong>in</strong>dependent sources are not blockdiagonal.<br />

FastICA only converges <strong>in</strong> 12% (deflation approach)<br />

respectively 89% (symmetric approach) of all cases. However,<br />

<strong>in</strong> the cases it converges it gives results comparable with<br />

the multidimensional algorithms. Apparently, especially the<br />

symmetric method seems to be able to use the weakened<br />

statistics to still f<strong>in</strong>d directions <strong>in</strong> the data.<br />

C. Application to ECG data<br />

F<strong>in</strong>ally we illustrate how to apply the proposed algorithms<br />

to real-world data set. Follow<strong>in</strong>g [1], we will show how to<br />

separate fetal ECG (FECG) record<strong>in</strong>gs from the mother’s<br />

ECG (MECG). The data set [13] consists of eight recorded<br />

signals with 2500 observations; the sampl<strong>in</strong>g frequency is<br />

mislead<strong>in</strong>gly specified as 500 Hz (which would mean around<br />

168 mother heartbeats per m<strong>in</strong>ute), it should be closer to<br />

around 250 Hz. We select the first three sensors cutaneously<br />

recorded on the abdomen of the mother. In order to save space<br />

and to compare the results with [1] we plot only the first 1000<br />

samples, see figure 4(a).


Chapter 7. Proc. ISCAS 2005, pages 5878-5881 127<br />

50<br />

0<br />

−50<br />

−120<br />

−50<br />

−50<br />

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500<br />

120<br />

50<br />

0<br />

−100<br />

0 100 200 300 400 500<br />

(a) ECG record<strong>in</strong>gs<br />

50<br />

0<br />

80<br />

0<br />

0<br />

0<br />

0<br />

−50<br />

0 100 200 300 400<br />

−20<br />

500 0 100 200 300 400<br />

−50<br />

500 0 100 200 300 400<br />

−50<br />

500 0 100 200 300 400 500<br />

20<br />

0<br />

−20<br />

−100<br />

−100<br />

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500<br />

(b) extracted sources<br />

50<br />

0<br />

120<br />

50<br />

0<br />

(c) MECG part<br />

50<br />

0<br />

120<br />

50<br />

0<br />

(d) FECG part<br />

Fig. 4. Fetal ECG example. (a) shows the ECG record<strong>in</strong>gs. The underly<strong>in</strong>g FECG (4 heartbeats) is partially visible <strong>in</strong> the dom<strong>in</strong>at<strong>in</strong>g MECG (3 heartbeats).<br />

Figure (b) gives the extracted sources us<strong>in</strong>g MHICA with k =2and 500 Hessians. In (c) and (d) the projections of the mother sources (components 1 and<br />

2 <strong>in</strong> (b)) respectively the fetal sources (component 3) onto the mixture space (a) are plotted.<br />

Our goal is to extract an MECG and an FECG component;<br />

however it cannot be expected to f<strong>in</strong>d an only one-dimensional<br />

MECG due to the fact that projections of a three-dimensional<br />

vector (electric) field are measured. Hence modell<strong>in</strong>g the data<br />

by a multidimensional BSS problem with k =2(but allow<strong>in</strong>g<br />

for an additional one-dimensional component) makes sense.<br />

Application of MHICA (with 500 Hessians) and MSOBI<br />

(with 50 autocorrelation matrices) extracts a two-dimensional<br />

MECG component and a one-dimensional FECG component.<br />

After block-permutation we get the follow<strong>in</strong>g estimated mix<strong>in</strong>g<br />

matrices (A us<strong>in</strong>g MHICA and A ′ us<strong>in</strong>g MSOBI)<br />

� �<br />

0.37 0.42 −0.81<br />

A = −0.75 0.89 −0.16 A<br />

0.55 −0.16 0.57<br />

′ � �<br />

0.22 0.91 −0.40<br />

= −0.84 0.23 −0.33 .<br />

0.50 0.34 0.85<br />

The thus estimated sources us<strong>in</strong>g MHICA are plotted <strong>in</strong><br />

figure 4(b). In order to compare the two mix<strong>in</strong>g matrices,<br />

calculation of<br />

�<br />

.<br />

A −1 A ′ �<br />

0.85 1.02 0.64<br />

= −0.23 1.11 0.35<br />

−0.01 −0.08 0.98<br />

yields a somewhat visible block structure; the performance<br />

<strong>in</strong>dex is E (2) (A−1A ′ ) = 1.12. The block structure is not<br />

very dom<strong>in</strong>ant, which <strong>in</strong>dicates that the two models — block<br />

<strong>in</strong>dependence versus time-block-decorrelation — are not fully<br />

equivalent.<br />

A (scal<strong>in</strong>g <strong>in</strong>variant) decomposition of the observed ECG<br />

data can be achieved by compos<strong>in</strong>g the extracted sources us<strong>in</strong>g<br />

only the relevant mix<strong>in</strong>g columns. For example for the MECG<br />

part this means apply<strong>in</strong>g the projection ΠM := (a1, a2, 0)A−1 to the observations. This yields the projection matrices<br />

�<br />

�<br />

0.52 0.38 0.84<br />

ΠM = −0.10 1.08 0.17<br />

0.34 −0.27 0.41<br />

ΠF =<br />

� 0.48 −0.38 −0.84<br />

0.10 −0.08 −0.17<br />

−0.34 0.27 0.59<br />

onto the mother respectively the fetal ECG us<strong>in</strong>g MHICA and<br />

Π ′ � �<br />

0.78 0.21 0.45<br />

M = −0.18 1.17 0.36 Π<br />

0.47 −0.44 0.05<br />

′ � �<br />

0.22 −0.21 −0.45<br />

F = 0.18 −0.17 −0.36 .<br />

−0.47 0.44 0.95<br />

us<strong>in</strong>g MSOBI. The results of the first algorithm are plotted <strong>in</strong><br />

figures 4 (c) and (d). The fetal ECG is most active at sensor<br />

1 (as visual <strong>in</strong>spection of the observation confirms). When<br />

compar<strong>in</strong>g the projection matrices with the results from [1], we<br />

get quite high similarity of the ICA-based results, and a modest<br />

difference with the projections of the time-based algorithm.<br />

Other one-dimensional ICA-based results on this data set are<br />

reported for example <strong>in</strong> [14].<br />

�<br />

VI. CONCLUSION<br />

We have shown how the idea of jo<strong>in</strong>t block diagonalization<br />

as extension of jo<strong>in</strong>t diagonalization helps us to generalize<br />

ICA and time-structure based algorithms such as HICA and<br />

SOBI to the multidimensional ICA case. The thus def<strong>in</strong>ed<br />

algorithms are able to robustly decompose signals <strong>in</strong>to groups<br />

of <strong>in</strong>dependent signals. In future work, besides more extensive<br />

experiments and tests with noise and outliers, we want to<br />

extend this result to a version of JADE us<strong>in</strong>g moments <strong>in</strong>stead<br />

of cumulants, which preserve the block structure.<br />

REFERENCES<br />

[1] J. Cardoso, “Multidimensional <strong>in</strong>dependent component analysis,” <strong>in</strong><br />

Proc. of ICASSP ’98, Seattle, 1998.<br />

[2] A. Hyvär<strong>in</strong>en and P. Hoyer, “Emergence of phase and shift <strong>in</strong>variant<br />

features by decomposition of natural images <strong>in</strong>to <strong>in</strong>dependent feature<br />

subspaces,” Neural Computation, vol. 12, no. 7, pp. 1705–1720, 2000.<br />

[3] J.-F. Cardoso and A. Souloumiac, “Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian<br />

signals,” IEE Proceed<strong>in</strong>gs - F, vol. 140, no. 6, pp. 362–370, 1993.<br />

[4] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, and E. Moul<strong>in</strong>es, “A<br />

bl<strong>in</strong>d source separation technique based on second order statistics,” IEEE<br />

Transactions on Signal Process<strong>in</strong>g, vol. 45, no. 2, pp. 434–444, 1997.<br />

[5] K. Abed-Meraim and A. Belouchrani, “Algorithms for jo<strong>in</strong>t block<br />

diagonalization,” <strong>in</strong> Proc. EUSIPCO 2004, Vienna, Austria, 2004, pp.<br />

209–212.<br />

[6] J.-F. Cardoso and A. Souloumiac, “Jacobi angles for simultaneous<br />

diagonalization,” SIAM J. Mat. Anal. Appl., vol. 17, no. 1, pp. 161–<br />

164, Jan. 1995.<br />

[7] F. Theis, “Uniqueness of complex and multidimensional <strong>in</strong>dependent<br />

component analysis,” Signal Process<strong>in</strong>g, vol. 84, no. 5, pp. 951–956,<br />

2004.<br />

[8] ——, “A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation,”<br />

Neural Computation, vol. 16, pp. 1827–1850, 2004.<br />

[9] J. L<strong>in</strong>, “Factoriz<strong>in</strong>g multivariate function classes,” <strong>in</strong> Advances <strong>in</strong> Neural<br />

Information Process<strong>in</strong>g Systems, vol. 10, 1998, pp. 563–569.<br />

[10] A. Yeredor, “Bl<strong>in</strong>d source separation via the second characteristic<br />

function,” Signal Process<strong>in</strong>g, vol. 80, no. 5, pp. 897–902, 2000.<br />

[11] C. Févotte and C. Doncarli, “A unified presentation of bl<strong>in</strong>d separation<br />

methods for convolutive mixtures us<strong>in</strong>g block-diagonalization,” <strong>in</strong> Proc.<br />

ICA 2003, Nara, Japan, 2003, pp. 349–354.<br />

[12] S. Amari, A. Cichocki, and H. Yang, “A new learn<strong>in</strong>g algorithm for bl<strong>in</strong>d<br />

signal separation,” Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems,<br />

vol. 8, pp. 757–763, 1996.<br />

[13] B. D. M. (ed.), “DaISy: database for the identification<br />

of systems,” Department of Electrical Eng<strong>in</strong>eer<strong>in</strong>g,<br />

ESAT/SISTA, K.U.Leuven, Belgium, Oct 2004. [Onl<strong>in</strong>e]. Available:<br />

http://www.esat.kuleuven.ac.be/sista/daisy/<br />

[14] L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “Fetal electrocardiogram<br />

extraction by source subspace separation,” <strong>in</strong> Proc. IEEE SP /<br />

ATHOS Workshop on HOS, Girona, Spa<strong>in</strong>, 1995, pp. 134–138.


128 Chapter 7. Proc. ISCAS 2005, pages 5878-5881


Chapter 8<br />

Proc. NIPS 2006<br />

Paper F.J. Theis. Towards a general <strong>in</strong>dependent subspace analysis. Proc. NIPS<br />

2006, 2007<br />

Reference (Theis, 2007)<br />

Summary <strong>in</strong> section 1.3.3<br />

129


130 Chapter 8. Proc. NIPS 2006<br />

Towards a general <strong>in</strong>dependent subspace analysis<br />

Fabian J. Theis<br />

Max Planck Institute for Dynamics and Self-Organisation &<br />

Bernste<strong>in</strong> Center for Computational Neuroscience<br />

Bunsenstr. 10, 37073 Gött<strong>in</strong>gen, Germany<br />

fabian@theis.name<br />

Abstract<br />

The <strong>in</strong>creas<strong>in</strong>gly popular <strong>in</strong>dependent component analysis (ICA) may only be applied<br />

to data follow<strong>in</strong>g the generative ICA model <strong>in</strong> order to guarantee algorithm<strong>in</strong>dependent<br />

and theoretically valid results. Subspace ICA models generalize the<br />

assumption of component <strong>in</strong>dependence to <strong>in</strong>dependence between groups of components.<br />

They are attractive candidates for dimensionality reduction methods,<br />

however are currently limited by the assumption of equal group sizes or less general<br />

semi-parametric models. By <strong>in</strong>troduc<strong>in</strong>g the concept of irreducible <strong>in</strong>dependent<br />

subspaces or components, we present a generalization to a parameter-free<br />

mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by<br />

<strong>in</strong>clud<strong>in</strong>g previous results on non-Gaussian component analysis. After <strong>in</strong>troduc<strong>in</strong>g<br />

this general model, we discuss jo<strong>in</strong>t block diagonalization with unknown block<br />

sizes, on which we base a simple extension of JADE to algorithmically perform<br />

the subspace analysis. Simulations confirm the feasibility of the algorithm.<br />

1 <strong>Independent</strong> subspace analysis<br />

A random vector Y is called an <strong>in</strong>dependent component of the random vector X, if there exists<br />

an <strong>in</strong>vertible matrix A and a decomposition X = A(Y,Z) such that Y and Z are stochastically<br />

<strong>in</strong>dependent. The goal of a general <strong>in</strong>dependent subspace analysis (ISA) or multidimensional <strong>in</strong>dependent<br />

component analysis is the decomposition of an arbitrary random vector X <strong>in</strong>to <strong>in</strong>dependent<br />

components. If X is to be decomposed <strong>in</strong>to one-dimensional components, this co<strong>in</strong>cides with ord<strong>in</strong>ary<br />

<strong>in</strong>dependent component analysis (ICA). Similarly, if the <strong>in</strong>dependent components are required<br />

to be of the same dimension k, then this is denoted by multidimensional ICA of fixed group size k<br />

or simply k-ISA. So 1-ISA is equivalent to ICA.<br />

1.1 Why extend ICA?<br />

An important structural aspect <strong>in</strong> the search for decompositions is the knowledge of the number of<br />

solutions i.e. the <strong>in</strong>determ<strong>in</strong>acies of the problem. Without it, the result of any ICA or ISA algorithm<br />

cannot be compared with other solutions, so for <strong>in</strong>stance bl<strong>in</strong>d source separation (BSS) would be<br />

impossible. Clearly, given an ISA solution, <strong>in</strong>vertible transforms <strong>in</strong> each component (scal<strong>in</strong>g matrices<br />

L) as well as permutations of components of the same dimension (permutation matrices P) give<br />

aga<strong>in</strong> an ISA of X. And <strong>in</strong>deed, <strong>in</strong> the special case of ICA, scal<strong>in</strong>g and permutation are already all<br />

<strong>in</strong>determ<strong>in</strong>acies given that at most one Gaussian is conta<strong>in</strong>ed <strong>in</strong> X [6]. This is one of the key theoretical<br />

results <strong>in</strong> ICA, allow<strong>in</strong>g the usage of ICA for solv<strong>in</strong>g BSS problems and hence stimulat<strong>in</strong>g<br />

many applications. It has been shown that also for k-ISA, scal<strong>in</strong>gs and permutations as above are<br />

the only <strong>in</strong>determ<strong>in</strong>acies [11], given some additional rather weak restrictions to the model.<br />

However, a serious drawback of k-ISA (and hence of ICA) lies <strong>in</strong> the fact that the requirement<br />

fixed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed,


Chapter 8. Proc. NIPS 2006 131<br />

crosstalk<strong>in</strong>g error<br />

4<br />

3<br />

2<br />

1<br />

0<br />

FastICA JADE Extended Infomax<br />

Figure 1: Apply<strong>in</strong>g ICA to a random vector X = AS that does not fulfill the ICA model; here S<br />

is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are<br />

the statistics over 100 runs of the Amari error of the random orig<strong>in</strong>al and the reconstructed mix<strong>in</strong>g<br />

matrix us<strong>in</strong>g the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the orig<strong>in</strong>al<br />

mix<strong>in</strong>g matrix could not be reconstructed <strong>in</strong> any of the experiments. However, <strong>in</strong>terest<strong>in</strong>gly, the<br />

latter two algorithms do <strong>in</strong>deed f<strong>in</strong>d an ISA up to permutation, which will be expla<strong>in</strong>ed <strong>in</strong> section 3.<br />

theoretically speak<strong>in</strong>g, it may only be applied to random vectors follow<strong>in</strong>g the k-ISA bl<strong>in</strong>d source<br />

separation model, which means that they have to be mixtures of a random vector that consists of<br />

<strong>in</strong>dependent groups of size k. If this is the case, uniqueness up to permutation and scal<strong>in</strong>g holds as<br />

noted above; however if k-ISA is applied to any random vector, a decomposition <strong>in</strong>to groups that are<br />

only ‘as <strong>in</strong>dependent as possible’ cannot be unique and depends on the contrast and the algorithm.<br />

In the literature, ICA is often applied to f<strong>in</strong>d representations fulfill<strong>in</strong>g the <strong>in</strong>dependence condition as<br />

well as possible, however care has to be taken; the strong uniqueness result is not valid any more,<br />

and the results may depend on the algorithm as illustrated <strong>in</strong> figure 1.<br />

This work aims at f<strong>in</strong>d<strong>in</strong>g an ISA model that allows applicability to any random vector. After review<strong>in</strong>g<br />

previous approaches, we will provide such a model together with a correspond<strong>in</strong>g uniqueness<br />

result and a prelim<strong>in</strong>ary algorithm.<br />

1.2 Previous approaches to ISA for dependent component analysis<br />

Generalizations of the ICA model that are to <strong>in</strong>clude dependencies of multiple one-dimensional<br />

components have been studied for quite some time. ISA <strong>in</strong> the term<strong>in</strong>ology of multidimensional ICA<br />

has first been <strong>in</strong>troduced by Cardoso [4] us<strong>in</strong>g geometrical motivations. His model as well as the<br />

related but <strong>in</strong>dependently proposed factorization of multivariate function classes [9] is quite general,<br />

however no identifiability results were presented, and applicability to an arbitrary random vector<br />

was unclear; later, <strong>in</strong> the special case of equal group sizes (k-ISA) uniqueness results have been<br />

extended from the ICA theory [11]. Algorithmic enhancements <strong>in</strong> this sett<strong>in</strong>g have been recently<br />

studied by [10]. Moreover, if the observation conta<strong>in</strong> additional structures such as spatial or temporal<br />

structures, these may be used for the multidimensional separation [13].<br />

Hyvär<strong>in</strong>en and Hoyer presented a special case of k-ISA by comb<strong>in</strong><strong>in</strong>g it with <strong>in</strong>variant feature subspace<br />

analysis [7]. They model the dependence with<strong>in</strong> a k-tuple explicitly and are therefore able<br />

to propose more efficient algorithms without hav<strong>in</strong>g to resort to the problematic multidimensional<br />

density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8],<br />

where dependencies between all components are assumed and modelled along a topographic structure<br />

(e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component cluster<strong>in</strong>g<br />

problem, which necessitates a model for <strong>in</strong>ter-cluster <strong>in</strong>dependence and <strong>in</strong>tra-cluster dependence.<br />

For the latter, they propose to use a tree-structure as employed by their tree dependepent component<br />

analysis. Together with <strong>in</strong>ter-cluster <strong>in</strong>dependence, this implies a search for a transformation of the<br />

mixtures <strong>in</strong>to a forest i.e. a set of disjo<strong>in</strong>t trees. However, the above models are all semi-parametric<br />

and hence not fully bl<strong>in</strong>d. In the follow<strong>in</strong>g, no additional structures are necessary for the separation.


132 Chapter 8. Proc. NIPS 2006<br />

1.3 General ISA<br />

Def<strong>in</strong>ition 1.1. A random vector S is said to be irreducible if it conta<strong>in</strong>s no lower-dimensional<br />

<strong>in</strong>dependent component. An <strong>in</strong>vertible matrix W is called a (general) <strong>in</strong>dependent subspace analysis<br />

of X if WX = (S1, . . . ,Sk) with pairwise <strong>in</strong>dependent, irreducible random vectors Si.<br />

Note that <strong>in</strong> this case, the Si are <strong>in</strong>dependent components of X. The idea beh<strong>in</strong>d this def<strong>in</strong>ition is<br />

that <strong>in</strong> contrast to ICA and k-ISA, we do not fix the size of the groups Si <strong>in</strong> advance. Of course,<br />

some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction<br />

is realized by allow<strong>in</strong>g only irreducible components. The advantage of this formulation now is that<br />

it can clearly be applied to any random vector, although of course a trivial decomposition might be<br />

the result <strong>in</strong> the case of an irreducible random vector. Obvious <strong>in</strong>determ<strong>in</strong>acies of an ISA of X are,<br />

as mentioned above, scal<strong>in</strong>gs i.e. <strong>in</strong>vertible transformations with<strong>in</strong> each Si and permutation of Si<br />

of the same dimension 1 . These are already all <strong>in</strong>determ<strong>in</strong>acies as shown by the follow<strong>in</strong>g theorem,<br />

which extends previous results <strong>in</strong> the case of ICA [6] and k-ISA [11], where also the additional<br />

slight assumptions on square-<strong>in</strong>tegrability i.e. on exist<strong>in</strong>g covariance have been made.<br />

Theorem 1.2. Given a random vector X with exist<strong>in</strong>g covariance and no Gaussian <strong>in</strong>dependent<br />

component, then an ISA of X exists and is unique except for scal<strong>in</strong>g and permutation.<br />

Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give<br />

a short sketch of the proof <strong>in</strong> the follow<strong>in</strong>g. The uniqueness result can easily be formulated as a<br />

subspace extraction problem, and theorem 1.2 follows readily from<br />

Lemma 1.3. Let S = (S1, . . . ,Sk) be a square-<strong>in</strong>tegrable decomposition of S <strong>in</strong>to irreducible<br />

<strong>in</strong>dependent components Si. If X is an irreducible component of S, then X ∼ Si for some i.<br />

Here the equivalence relation ∼ denotes equality except for an <strong>in</strong>vertible transformation. The follow<strong>in</strong>g<br />

two lemmata each give a simplification of lemma 1.3 by order<strong>in</strong>g the components Si accord<strong>in</strong>g<br />

to their dimensions. Some care has to be taken when show<strong>in</strong>g that lemma 1.5 implies lemma 1.4.<br />

Lemma 1.4. Let S and X be def<strong>in</strong>ed as <strong>in</strong> lemma 1.3. In addition assume that dimSi = dimX for<br />

i ≤ l and dimSi < dimX for i > l. Then X ∼ Si for some i ≤ l.<br />

Lemma 1.5. Let S and X be def<strong>in</strong>ed as <strong>in</strong> lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1.<br />

In order to prove lemma 1.5 (and hence the theorem), it is sufficient to show the follow<strong>in</strong>g lemma:<br />

Lemma 1.6. Let S = (S1,S2) with S1 irreducible and m := dimS1 > dimS2 =: n. If X = AS<br />

is aga<strong>in</strong> irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is<br />

<strong>in</strong>vertible, and (ii) if X is an <strong>in</strong>dependent component of S, the right m×n-submatrix of A vanishes.<br />

(i) follows after some l<strong>in</strong>ear algebra, and is necessary to show the more difficult part (ii). For this,<br />

we follow the ideas presented <strong>in</strong> [12] us<strong>in</strong>g factorization of the jo<strong>in</strong>t characteristic function of S.<br />

1.4 Deal<strong>in</strong>g with Gaussians<br />

In the previous section, Gaussians had to be excluded (or at most one was allowed) <strong>in</strong> order to<br />

avoid additional <strong>in</strong>determ<strong>in</strong>acies. Indeed, any orthogonal transformation of two decorrelated hence<br />

<strong>in</strong>dependent Gaussians is aga<strong>in</strong> <strong>in</strong>dependent, so clearly such a strong identification result would not<br />

be possible.<br />

Recently, a general decomposition model deal<strong>in</strong>g with Gaussians was proposed <strong>in</strong> the form of the socalled<br />

non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace<br />

with<strong>in</strong> the data, and no assumption of <strong>in</strong>dependence with<strong>in</strong> the subspace is made. More precisely,<br />

given a random vector X, a factorization X = AS with an <strong>in</strong>vertible matrix A, S = (SN,SG)<br />

and SN a square-<strong>in</strong>tegrable m-dimensional random vector is called an m-decomposition of X if<br />

SN and SG are stochastically <strong>in</strong>dependent and SG is Gaussian. In this case, X is said to be mdecomposable.<br />

X is denoted to be m<strong>in</strong>imally n-decomposable if X is not (n − 1)-decomposable.<br />

Accord<strong>in</strong>g to our previous notation, SN and SG are <strong>in</strong>dependent components of X. It has been<br />

shown that the subspaces of such decompositions are unique [12]:<br />

1 Note that scal<strong>in</strong>g here implies a basis change <strong>in</strong> the component Si, so for example <strong>in</strong> the case of a twodimensional<br />

source component, this might be rotation and sheer<strong>in</strong>g. In the example later <strong>in</strong> figure 3, these<br />

<strong>in</strong>determ<strong>in</strong>acies can easily be seen by compar<strong>in</strong>g true and estimated sources.


Chapter 8. Proc. NIPS 2006 133<br />

Theorem 1.7 (Uniqueness of NGSA). The mix<strong>in</strong>g matrix A of a m<strong>in</strong>imal decomposition is unique<br />

except for transformations <strong>in</strong> each of the two subspaces.<br />

Moreover, explicit algorithms can be constructed for identify<strong>in</strong>g the subspaces [3]. This result enables<br />

us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes<br />

solutions of ISA.<br />

Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with exist<strong>in</strong>g covariance,<br />

an ISA of X exists and is unique except for permutation of components of the same dimension<br />

and <strong>in</strong>vertible transformations with<strong>in</strong> each <strong>in</strong>dependent component and with<strong>in</strong> the Gaussian part.<br />

Proof. Existence is obvious. Uniqueness follows after first apply<strong>in</strong>g theorem 1.7 to X and then<br />

theorem 1.2 to the non-Gaussian part.<br />

2 Jo<strong>in</strong>t block diagonalization with unknown block-sizes<br />

Jo<strong>in</strong>t diagonalization has become an important tool <strong>in</strong> ICA-based BSS (used for example <strong>in</strong> JADE)<br />

or <strong>in</strong> BSS rely<strong>in</strong>g on second-order temporal decorrelation. The task of (real) jo<strong>in</strong>t diagonalization<br />

(JD) of a set of symmetric real n×n matrices M := {M1, . . . ,MK} is to f<strong>in</strong>d an orthogonal matrix<br />

E such that E⊤MkE is diagonal for all k = 1, . . .,K i.e. to m<strong>in</strong>imizef( Ê) := �K k=1 �Ê⊤MkÊ −<br />

diagM( Ê⊤MkÊ)�2F with respect to the orthogonal matrix Ê, where diagM(M) produces a matrix<br />

where all off-diagonal elements of M have been set to zero, and �M�2 F := tr(MM⊤ ) denotes the<br />

squared Frobenius norm. The Frobenius norm is <strong>in</strong>variant under conjugation by an orthogonal matrix,<br />

so m<strong>in</strong>imiz<strong>in</strong>g f is equivalent to maximiz<strong>in</strong>g g( Ê) := �K k=1 � diag(Ê⊤MkÊ)�2 , where now<br />

diag(M) := (mii)i denotes the diagonal of M. For the actual m<strong>in</strong>imization of f respectively maximization<br />

of g, we will use the common approach of Jacobi-like optimization by iterative applications<br />

of Givens rotation <strong>in</strong> two coord<strong>in</strong>ates [5].<br />

2.1 Generalization to blocks<br />

In the follow<strong>in</strong>g we will use a generalization of JD <strong>in</strong> order to solve ISA problems. Instead of fully<br />

diagonaliz<strong>in</strong>g all n × n matrices Mk ∈ M, <strong>in</strong> jo<strong>in</strong>t block diagonalization (JBD) of M we want to<br />

determ<strong>in</strong>e E such that E ⊤ MkE is block-diagonal. Depend<strong>in</strong>g on the application, we fix the blockstructure<br />

<strong>in</strong> advance or try to determ<strong>in</strong>e it from M. We are not <strong>in</strong>terested <strong>in</strong> the order of the blocks,<br />

so the block-structure is uniquely specified by fix<strong>in</strong>g a partition of n i.e. a way of writ<strong>in</strong>g n as a sum<br />

of positive <strong>in</strong>tegers, where the order of the addends is not significant. So let 2 n = m1 + . . . + mr<br />

with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1, . . .,mr) ∈ N r . An n × n matrix is said to be<br />

m-block diagonal if it is of the form<br />

⎛<br />

D1<br />

⎜ .<br />

⎝ .<br />

· · ·<br />

. ..<br />

0<br />

.<br />

.<br />

⎞<br />

⎟<br />

⎠<br />

0 · · · Dr<br />

with arbitrary mi × mi matrices Di.<br />

As generalization of JD <strong>in</strong> the case of known the block structure, we can formulate the jo<strong>in</strong>t mblock<br />

diagonalization (m-JBD) problem as the m<strong>in</strong>imization of fm ( Ê) := �K k=1 �Ê⊤MkÊ −<br />

diagM m ( Ê⊤ Mk Ê)�2 F with respect to the orthogonal matrix Ê, where diagMm (M) produces a<br />

m-block diagonal matrix by sett<strong>in</strong>g all other elements of M to zero. In practice due to estimation<br />

errors, such E will not exist, so we speak of approximate JBD and imply m<strong>in</strong>imiz<strong>in</strong>g some errormeasure<br />

on non-block-diagonality. Indeterm<strong>in</strong>acies of any m-JBD are m-scal<strong>in</strong>g i.e. multiplication<br />

by an m-block diagonal matrix from the right, and m-permutation def<strong>in</strong>ed by a permutation matrix<br />

that only swaps blocks of the same size.<br />

F<strong>in</strong>ally, we speak of general JBD if we search for a JBD but no block structure is given; <strong>in</strong>stead<br />

it is to be determ<strong>in</strong>ed from the matrix set. For this it is necessary to require a block<br />

2 We do not use the convention from Ferrers graphs of specify<strong>in</strong>g partitions <strong>in</strong> decreas<strong>in</strong>g order, as a visualization<br />

of <strong>in</strong>creas<strong>in</strong>g block-sizes seems to be preferable <strong>in</strong> our sett<strong>in</strong>g.


134 Chapter 8. Proc. NIPS 2006<br />

structure of maximal length, otherwise trivial solutions or ‘<strong>in</strong>-between’ solutions could exist (and<br />

obviously conta<strong>in</strong> high <strong>in</strong>determ<strong>in</strong>acies). Formally, E is said to be a (general) JBD of M if<br />

(E,m) = argmax m | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result<br />

<strong>in</strong> the trivial decomposition m = (n), so we def<strong>in</strong>e an approximate general JBD by requir<strong>in</strong>g<br />

f m (E) < ǫ for some fixed constant ǫ > 0 <strong>in</strong>stead of f m (E) = 0.<br />

2.2 JBD by JD<br />

A few algorithms to actually perform JBD have been proposed, see [1] and references there<strong>in</strong>. In<br />

the follow<strong>in</strong>g we will simply perform jo<strong>in</strong>t diagonalization and then permute the columns of E to<br />

achieve block-diagonality — <strong>in</strong> experiments this turns out to be an efficient solution to JBD [1].<br />

This idea has been formulated <strong>in</strong> a conjecture [1] essentially claim<strong>in</strong>g that a m<strong>in</strong>imum of the JD cost<br />

function f already is a JBD i.e. a m<strong>in</strong>imum of the function f m up to a permutation matrix. Indeed,<br />

<strong>in</strong> the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary,<br />

and we can prove the conjecture partially:<br />

We want to show that JD implies JBD up to permutation, i.e. if E is a m<strong>in</strong>imum of f, then there<br />

exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course<br />

f(EP) = f(E), so we will show why (certa<strong>in</strong>) JBD solutions are m<strong>in</strong>ima of f. However, JD might<br />

have additional m<strong>in</strong>ima. First note that clearly not any JBD m<strong>in</strong>imizes f, only those such that <strong>in</strong><br />

each block of size mk, f(E) when restricted to the block is maximal over E ∈ O(mk). We will call<br />

such a JBD block-optimal <strong>in</strong> the follow<strong>in</strong>g.<br />

Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local m<strong>in</strong>imum of f.<br />

Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local<br />

m<strong>in</strong>imum of f or equivalently a local maximum of the squared diagonal sum g. After substitut<strong>in</strong>g<br />

each Mk by E ⊤ MkE, we may already assume that Mk is m-block diagonal, so we have to show<br />

that E = I is a local maximum of g.<br />

Consider the elementary Givens rotation Gij(ǫ) def<strong>in</strong>ed for i < j and ǫ ∈ (−1, 1) as the orthogonal<br />

matrix, where all diagonal elements are 1 except for the two elements √ 1 − ǫ 2 <strong>in</strong> rows i and j and<br />

with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i),<br />

respectively. It can be used to construct local coord<strong>in</strong>ates of the d := n(n − 1)/2-dimensional<br />

manifold O(n) at I, simply by ι(ǫ12, ǫ13, . . . , ǫn−1,n) := �<br />

i 0 and therefore h is negative def<strong>in</strong>ite <strong>in</strong> the direction<br />

ǫij. Altogether we get a negative def<strong>in</strong>ite h at 0 except for ‘trivial directions’, and hence a local<br />

maximum at 0.<br />

2.3 Recover<strong>in</strong>g the permutation<br />

In order to perform JBD, we therefore only have to f<strong>in</strong>d a JD E of M. What is left accord<strong>in</strong>g to the<br />

above theorem is to f<strong>in</strong>d a permutation matrix P such that EP block-diagonalizes M. In the case of<br />

known block-order m, we can employ similar techniques as used <strong>in</strong> [1, 10], which essentially f<strong>in</strong>d<br />

P by some comb<strong>in</strong>atorial optimization.


Chapter 8. Proc. NIPS 2006 135<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

5 10 15 20 25 30 35 40<br />

5 10 15 20 25 30 35 40<br />

(a) (unknown) block diagonal M1 (b) Ê⊤E w/o recovered permutation<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

5 10 15 20 25 30 35 40<br />

(c) Ê⊤ E<br />

Figure 2: Performance of the proposed general JBD algorithm <strong>in</strong> the case of the (unknown) blockpartition<br />

40 = 1+2+2+3+3+5+6+6+6+6<strong>in</strong> the presence of noise with SNR of 5dB. The<br />

product Ê⊤ E of the <strong>in</strong>verse of the estimated block diagonalizer and the orig<strong>in</strong>al one is an m-block<br />

diagonal matrix except for permutation with<strong>in</strong> groups of the same sizes as claimed <strong>in</strong> section 2.2.<br />

In the case of unknown block-size, we propose to use the follow<strong>in</strong>g simple permutation-recovery<br />

algorithm: consider the mean diagonalized matrix D := K−1 �K k=1 E⊤MkE. Due to the assumption<br />

that M is m-block-diagonalizable (with unknown m), each E⊤MkE and hence also D must<br />

be m-block-diagonal except for a permutationP, so it must have the correspond<strong>in</strong>g number of zeros<br />

<strong>in</strong> each column and row. In the approximate JBD case, threshold<strong>in</strong>g with a threshold θ is necessary,<br />

whose choice is non-trivial.<br />

We propose us<strong>in</strong>g algorithm 1 to recover the permutation; we denote its result<strong>in</strong>g permuted matrix by<br />

P(D) when applied to the <strong>in</strong>put D. P(D) is constructed from possibly thresholded D by iteratively<br />

permut<strong>in</strong>g columns and rows <strong>in</strong> order to guarantee that all non-zeros of D are clustered along the<br />

diagonal as closely as possible. This recovers the permutation as well as the partition m of n.<br />

Algorithm 1: Block-diagonality permutation f<strong>in</strong>der<br />

Input: (n × n)-matrix D<br />

Output: block-diagonal matrix P(D) := D ′ such that D ′ = PDP T for a permutation matrix P<br />

D ′ ← D<br />

for i ← 1 to n do<br />

repeat<br />

if (j0 ← m<strong>in</strong>{j|j ≥ i and d ′ ij = 0 and d′ ji<br />

= 0}) exists then<br />

if (k0 ← m<strong>in</strong>{k|k > j0 and (d ′ ik �= 0 or d′ ki �= 0)}) exists then<br />

swap column j0 of D ′ with column k0<br />

swap row j0 of D ′ with row k0<br />

until no swap has occurred ;<br />

We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K =<br />

100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6).<br />

They have been generated <strong>in</strong> blocks of size m with coefficients chosen randomly uniform from<br />

[−1, 1], and symmetrized by Dk ← (Dk + D⊤ k )/2. After that, they have been mixed by a random<br />

orthogonal mix<strong>in</strong>g matrix E ∈ O(40), i.e. Mk := EDkE⊤ + N, where N is a noise matrix with<br />

<strong>in</strong>dependent Gaussian entries such that the result<strong>in</strong>g signal-to-noise ratio is 5dB. Application of<br />

the JBD algorithm from above to {M1, . . . ,MK} with threshold θ = 0.1 correctly recovers the<br />

block sizes, and the estimated block diagonalizer Ê equals E up to m-scal<strong>in</strong>g and permutation, as<br />

illustrated <strong>in</strong> figure 2.<br />

3 SJADE — a simple algorithm for general ISA<br />

As usual by preprocess<strong>in</strong>g of the observations X by whiten<strong>in</strong>g we may assume that Cov(X) = I.<br />

The <strong>in</strong>determ<strong>in</strong>acies allow scal<strong>in</strong>g transformations <strong>in</strong> the sources, so without loss of generality let


136 Chapter 8. Proc. NIPS 2006<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

(a) S2<br />

6<br />

5.5<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

1.5<br />

7 8 9 10 11 12 13<br />

2<br />

14 3 4 5 6 7 8<br />

1.5<br />

9 3 3.5 4 4.5 5 5.5 6 6.5 7<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

−6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2<br />

(f) ( ˆ S1, ˆ S2)<br />

250<br />

200<br />

150<br />

100<br />

50<br />

(b) S3<br />

0<br />

−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0<br />

(g) histogram of ˆ S3<br />

5<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

14<br />

13<br />

12<br />

11<br />

10<br />

9<br />

2<br />

(c) S4<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0 0<br />

1<br />

(d) S5<br />

8<br />

−8<br />

−4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 −2.5<br />

(h) S4<br />

−3<br />

−3.5<br />

−4<br />

−4.5<br />

−5<br />

−5.5<br />

−6<br />

−6.5<br />

−7<br />

−7.5<br />

(i) S5<br />

2<br />

3<br />

4<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

5<br />

9<br />

10<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

−5<br />

0<br />

1 2 3 4 5 6 7 8 9 10<br />

(e) Â−1 A<br />

−1<br />

5<br />

−2<br />

4<br />

−3<br />

3<br />

2<br />

−4<br />

1<br />

−5 0<br />

(j) S6<br />

Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the<br />

scatter plots i.e. densities of the source components and the mix<strong>in</strong>g-separat<strong>in</strong>g map Â−1 A.<br />

also Cov(S) = I. Then I = Cov(X) = A Cov(S)A ⊤ = AA ⊤ so A is orthogonal. Due to the ISA<br />

assumptions, the fourth-order cross cumulants of the sources have to be trivial between different<br />

groups, and with<strong>in</strong> the Gaussians. In order to f<strong>in</strong>d transformations of the mixtures fulfill<strong>in</strong>g this<br />

property, we follow the idea of the JADE algorithmbut now <strong>in</strong> the ISA sett<strong>in</strong>g. We perform JBD of<br />

the (whitened) contracted quadricovariance matrices def<strong>in</strong>ed by Cij(X) := E � X ⊤ EijXXX ⊤� −<br />

Eij −E ⊤ ij −tr(Eij)I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij, 1 ≤ i, j ≤ n.<br />

One simple choice is to use n 2 matrices Eij with zeros everywhere except 1 at <strong>in</strong>dex (i, j). More<br />

elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The<br />

result<strong>in</strong>g algorithm, subspace-JADE (SJADE) not only performs NGCA by group<strong>in</strong>g Gaussians as<br />

one-dimensional components with trivial Cii’s, but also automatically f<strong>in</strong>ds the subspace partition<br />

m us<strong>in</strong>g the general JBD algorithm from section 2.3.<br />

4 Experimental results<br />

In a first example, we consider a general ISA problem <strong>in</strong> dimension n = 10 with the unknown<br />

partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors,<br />

we decided to follow the nice visual ideas from [10] and to draw samples from a density follow<strong>in</strong>g<br />

a known shape — <strong>in</strong> our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are<br />

shown <strong>in</strong> figure 3(a-d). Another 1-dimensional source follow<strong>in</strong>g a uniform distribution was constructed.<br />

Altogether 104 samples were used. The sources S were mixed by a mix<strong>in</strong>g matrix A with<br />

coefficients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered<br />

mix<strong>in</strong>g matrix  was then estimated us<strong>in</strong>g the above block-JADE algorithm with unknown block<br />

size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015).<br />

Figure 3(e) shows the composed mix<strong>in</strong>g-separat<strong>in</strong>g system Â−1A; clearly the matrices are equal<br />

except for block permutation and scal<strong>in</strong>g, which experimentally confirms theorem 1.8. The algorithm<br />

found a partition ˆm = (1, 1, 1, 2, 2, 3), so one 2d-source was mis<strong>in</strong>terpreted as two 1d-sources,<br />

but by us<strong>in</strong>g previous knowledge comb<strong>in</strong>ation of the correct two 1d-sources yields the orig<strong>in</strong>al 2dsource.<br />

The result<strong>in</strong>g recovered sources ˆ S := Â−1X, figures 3(f-j), then equal the orig<strong>in</strong>al sources<br />

except for permutation and scal<strong>in</strong>g with<strong>in</strong> the sources — which <strong>in</strong> the higher-dimensional cases<br />

implies transformations such as rotation of the underly<strong>in</strong>g images or shapes. When apply<strong>in</strong>g ICA<br />

(1-ISA) to the above mixtures, we cannot expect to recover the orig<strong>in</strong>al sources as expla<strong>in</strong>ed <strong>in</strong><br />

figure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE<br />

equals JADE with additional permutation recovery because the jo<strong>in</strong>t block diagonalization is per-


Chapter 8. Proc. NIPS 2006 137<br />

formed us<strong>in</strong>g jo<strong>in</strong>t diagonalization. This expla<strong>in</strong>s why JADE retrieves mean<strong>in</strong>gful components even<br />

<strong>in</strong> this non-ICA sett<strong>in</strong>g as observed <strong>in</strong> [4].<br />

In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the<br />

subspace JADE also <strong>in</strong>cludes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and<br />

sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before;<br />

105 samples were drawn. We perform 100 Monte-Carlo simulations with random mix<strong>in</strong>g matrix<br />

A, and apply SJADE with θ = 0.01. The recovered mix<strong>in</strong>g matrix  is compared with A by<br />

tak<strong>in</strong>g the ad-hoc measure ι(P) := �3 �2 i=1 j=1 (p2ij + p2ji ) for P := Â−1A. Indeed, we get nearly<br />

perfect recovery <strong>in</strong> 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A s<strong>in</strong>gle run<br />

diverges with ι(P) = 3.48. In order to show that the algorithm really separates the Gaussian part<br />

from the other components, we compare the recovered source kurtoses. The median kurtoses are<br />

−0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The first two components<br />

have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis<br />

of around −1.2, which equals the kurtosis of a uniform density. This confirms the applicability of<br />

the algorithm <strong>in</strong> the general, noisy ISA sett<strong>in</strong>g.<br />

5 Conclusion<br />

Previous approaches for <strong>in</strong>dependent subspace analysis were restricted either to fixed group sizes<br />

or semi-parametric models. In neither case, general applicability to any k<strong>in</strong>d of mixture data set<br />

was guaranteed, so bl<strong>in</strong>d source separation might fail. In the present contribution we <strong>in</strong>troduce the<br />

concept of irreducible <strong>in</strong>dependent components and give an identifiability result for this general,<br />

parameter-free model together with a novel arbitrary-subspace-size algorithm based on jo<strong>in</strong>t block<br />

diagonalization. As <strong>in</strong> ICA, the ma<strong>in</strong> uniqueness theorem is an asymptotic result (but <strong>in</strong>cludes<br />

noisy case via NGCA). However <strong>in</strong> practice <strong>in</strong> the f<strong>in</strong>ite sample case, due to estimation errors the<br />

general jo<strong>in</strong>t block diagonality only approximately holds. Our simple solution <strong>in</strong> this contribution<br />

was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be<br />

developed <strong>in</strong> future works.<br />

References<br />

[1] K. Abed-Meraim and A. Belouchrani. Algorithms for jo<strong>in</strong>t block diagonalization. In Proc. EUSIPCO<br />

2004, pages 209–212, Vienna, Austria, 2004.<br />

[2] F.R. Bach and M.I. Jordan. F<strong>in</strong>d<strong>in</strong>g clusters <strong>in</strong> <strong>in</strong>dependent component analysis. In Proc. ICA 2003, pages<br />

891–896, 2003.<br />

[3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spoko<strong>in</strong>y, and K.-R. Müller. In search of non-gaussian<br />

components of a high-dimensional distribution. JMLR, 7:247–282, 2006.<br />

[4] J.F. Cardoso. Multidimensional <strong>in</strong>dependent component analysis. In Proc. of ICASSP ’98, Seattle, 1998.<br />

[5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal.<br />

Appl., 17(1):161–164, January 1995.<br />

[6] P. Comon. <strong>Independent</strong> component analysis - a new concept? Signal Process<strong>in</strong>g, 36:287–314, 1994.<br />

[7] A. Hyvär<strong>in</strong>en and P.O. Hoyer. Emergence of phase and shift <strong>in</strong>variant features by decomposition of<br />

natural images <strong>in</strong>to <strong>in</strong>dependent feature subspaces. Neural Computation, 12(7):1705–1720, 2000.<br />

[8] A. Hyvär<strong>in</strong>en, P.O. Hoyer, and M. Inki. Topographic <strong>in</strong>dependent component analysis. Neural Computation,<br />

13(7):1525–1558, 2001.<br />

[9] J.K. L<strong>in</strong>. Factoriz<strong>in</strong>g multivariate function classes. In Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems,<br />

volume 10, pages 563–569, 1998.<br />

[10] B. Poczos and A. Lör<strong>in</strong>cz. <strong>Independent</strong> subspace analysis us<strong>in</strong>g k-nearest neighborhood distances. In<br />

Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Spr<strong>in</strong>ger.<br />

[11] F.J. Theis. Uniqueness of complex and multidimensional <strong>in</strong>dependent component analysis. Signal Process<strong>in</strong>g,<br />

84(5):951–956, 2004.<br />

[12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages<br />

917–925, Charleston, USA, 2006.<br />

[13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS<br />

2001, pages 993–1000, 2001.


138 Chapter 8. Proc. NIPS 2006


Chapter 9<br />

Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Paper F.J. Theis, P.Gruber, I.R. Keck, and E.W. Lang. A robust model for spatiotemporal<br />

dependencies. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Reference (Theis et al., 2007b)<br />

Summary <strong>in</strong> section 1.3.3<br />

139


140 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

A Robust Model for Spatiotemporal<br />

Dependencies<br />

Fabian J. Theis a,b,∗ , Peter Gruber b , Ingo R. Keck b ,<br />

Elmar W. Lang b<br />

a Bernste<strong>in</strong> Center for Computational Neuroscience<br />

Max-Planck-Institute for Dynamics and Self-Organisation, Gött<strong>in</strong>gen, Germany<br />

b Institute of Biophysics, University of Regensburg, Regensburg, Germany<br />

Abstract<br />

Real-world data sets such as record<strong>in</strong>gs from functional magnetic resonance imag<strong>in</strong>g<br />

often possess both spatial and temporal structure. Here, we propose an algorithm<br />

<strong>in</strong>clud<strong>in</strong>g such spatiotemporal <strong>in</strong>formation <strong>in</strong>to the analysis, and reduce the problem<br />

to the jo<strong>in</strong>t approximate diagonalization of a set of autocorrelation matrices.<br />

We demonstrate the feasibility of the algorithm by apply<strong>in</strong>g it to functional MRI<br />

analysis, where previous approaches are outperformed considerably.<br />

Key words: bl<strong>in</strong>d source separation, <strong>in</strong>dependent component analysis, functional<br />

magnetic resonance imag<strong>in</strong>g, autodecorrelation<br />

PACS: 07.05.Kf, 87.61.–c, 05.40.–a, 05.45.Tp<br />

1 Introduction<br />

Bl<strong>in</strong>d source separation (BSS) describes the task of recover<strong>in</strong>g an unknown<br />

mix<strong>in</strong>g process and underly<strong>in</strong>g sources of an observed data set. It has numerous<br />

applications <strong>in</strong> fields rang<strong>in</strong>g from signal and image process<strong>in</strong>g to the<br />

separation of speech and radar signals to f<strong>in</strong>ancial data analysis. Many BSS algorithms<br />

assume either <strong>in</strong>dependence (<strong>in</strong>dependent component analysis, ICA)<br />

or diagonal autocorrelations of the sources [1,2]. Here we extend BSS algorithms<br />

based on time-decorrelation [3–8]. They rely on the fact that the data<br />

∗ correspond<strong>in</strong>g author<br />

Email address: fabian@theis.name (Fabian J. Theis).<br />

URL: http://fabian.theis.name (Fabian J. Theis).<br />

Prepr<strong>in</strong>t submitted to Elsevier 15 May 2007


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 141<br />

sets have non-trivial autocorrelations so that the unknown mix<strong>in</strong>g matrix can<br />

be recovered by generalized eigenvalue decomposition.<br />

Spatiotemporal BSS <strong>in</strong> contrast to the more common spatial or temporal BSS<br />

tries to achieve both spatial and temporal separation by optimiz<strong>in</strong>g a jo<strong>in</strong>t<br />

energy function. First proposed by Stone et al. [9], it is a promis<strong>in</strong>g method,<br />

which has potential applications <strong>in</strong> areas where data conta<strong>in</strong>s an <strong>in</strong>herent spatiotemporal<br />

structure, such as data from biomedic<strong>in</strong>e or geophysics <strong>in</strong>clud<strong>in</strong>g<br />

oceanography and climate dynamics. Stone’s algorithm is based on the Infomax<br />

ICA algorithm [10], which due to its onl<strong>in</strong>e nature <strong>in</strong>volves some rather <strong>in</strong>tricate<br />

choices of parameters, specifically <strong>in</strong> the spatiotemporal version, where<br />

onl<strong>in</strong>e updates are be<strong>in</strong>g performed both <strong>in</strong> space and time. Commonly, the<br />

spatiotemporal data sets are recorded <strong>in</strong> advance, so we can easily replace<br />

spatiotemporal onl<strong>in</strong>e learn<strong>in</strong>g by batch optimization. This has the advantage<br />

of greatly reduc<strong>in</strong>g the number of parameters <strong>in</strong> the system, and leads<br />

to more stable optimization algorithms. We focus on so-called algebraic BSS<br />

algorithms [3,5,6,11], reviewed for example <strong>in</strong> [12], which employ generalized<br />

eigenvalue decomposition and jo<strong>in</strong>t diagonalization for the factorization. The<br />

correspond<strong>in</strong>g learn<strong>in</strong>g rules are essentially parameter-free and are known to<br />

be robust and efficient [13].<br />

In this contribution, we extend Stone’s approach by generaliz<strong>in</strong>g the timedecorrelation<br />

algorithms to the spatiotemporal case, thereby allow<strong>in</strong>g us to<br />

use the <strong>in</strong>herent spatiotemporal structures of the data. In the experiments<br />

presented, we observe good performance of the proposed algorithm when applied<br />

to noisy, high-dimensional data sets acquired from functional magnetic<br />

resonance imag<strong>in</strong>g (fMRI). We concentrate on fMRI as it is well fit for spatiotemporal<br />

decomposition due to the fact that spatial activation networks are<br />

mixed with functional and structural temporal components.<br />

2 Bl<strong>in</strong>d source separation<br />

We consider the follow<strong>in</strong>g temporal BSS problem: Let x(t) be a second-order<br />

stationary, zero-mean, m-dimensional stochastic process and A a full rank<br />

matrix such that x(t) = As(t) + n(t). The n-dimensional source signals s(t)<br />

are assumed to have diagonal autocorrelations Rτ(s) := 〈s(t + τ)s(t) ⊤ 〉 for<br />

all τ, and the additive noise n(t) is modeled by a stationary, temporally and<br />

spatially white zero-mean process with variance σ 2 . x(t) is observed, and the<br />

goal is to recover A and s(t). Hav<strong>in</strong>g found A, s(t) can be estimated by<br />

A † x(t), which is optimal <strong>in</strong> the maximum-likelihood sense, where A † denotes<br />

the pseudo-<strong>in</strong>verse of A and m ≥ n. So the BSS task reduces to the estimation<br />

of the mix<strong>in</strong>g matrix A.<br />

2


142 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

3 Separation based on time-delayed decorrelation<br />

For τ �= 0, the mixture autocorrelations factorize 1 ,<br />

Rτ(x) = ARτ(s)A ⊤ . (1)<br />

This gives an <strong>in</strong>dication of how to recover A from x(t). The correlation of<br />

the signal part ˜x(t) := As(t) of the mixtures x(t) may be calculated as<br />

R0(˜x) = R0(x) − σ2I, provided that the noise variance σ2 is known. After<br />

whiten<strong>in</strong>g of ˜x(t) i.e. jo<strong>in</strong>t diagonalization of R0(˜x), we can assume that<br />

˜x(t) has unit correlation and that m = n, so A is orthogonal 2 . If more<br />

signals than sources are observed, dimension reduction can be performed <strong>in</strong><br />

this step thus reduc<strong>in</strong>g noise [14]. The symmetrized autocorrelation of x(t),<br />

¯Rτ(x) := (1/2) �<br />

Rτ(x) + (Rτ(x)) ⊤�,<br />

factorizes as well, ¯ Rτ(x) = A ¯ Rτ(s)A⊤ ,<br />

and by assumption ¯ Rτ(s) is diagonal. Hence this factorization represents an<br />

eigenvalue decomposition of the symmetric matrix ¯ Rτ(x). If we furthermore<br />

assume that ¯ Rτ(x) or equivalently ¯ Rτ(s) has n dist<strong>in</strong>ct eigenvalues, then A is<br />

already uniquely determ<strong>in</strong>ed by ¯ Rτ(x) except for column permutation. In addition<br />

to this separability result, a BSS algorithm namely time-delayed decorrelation<br />

[3,4] is obta<strong>in</strong>ed by the diagonalization of ¯ Rτ(x) after whiten<strong>in</strong>g —<br />

the diagonalizer yields the desired separat<strong>in</strong>g matrix.<br />

However, this decorrelation approach decisively depends on the choice of τ<br />

— if an eigenvalue of ¯ Rτ(x) is degenerate, the algorithm fails. Moreover, we<br />

face misestimates of ¯ Rτ(x) due to f<strong>in</strong>ite sample effects, so us<strong>in</strong>g additional<br />

statistics is desirable. Therefore, Belouchrani et al [5], see also [6], proposed a<br />

more robust BSS algorithm, called second-order bl<strong>in</strong>d identification (SOBI),<br />

jo<strong>in</strong>tly diagonaliz<strong>in</strong>g a whole set of autocorrelation matrices ¯ Rk(x) with vary<strong>in</strong>g<br />

time lags, for simplicity <strong>in</strong>dexed by k = 1, . . .,K. They showed that <strong>in</strong>creas<strong>in</strong>g<br />

K improves SOBI performance <strong>in</strong> noisy sett<strong>in</strong>gs [2]. Algorithm speed<br />

decreases l<strong>in</strong>early with K, so <strong>in</strong> practice K ranges from 10 to 100. Various<br />

numerical techniques for jo<strong>in</strong>t diagonalization exist, essentially m<strong>in</strong>imiz<strong>in</strong>g<br />

� Kk=1 off(A ⊤ ¯ Rk(x)A) with respect to A, where off denotes the square sum<br />

of the off-diagonal terms. A global m<strong>in</strong>imum of this function is called (approximate)<br />

jo<strong>in</strong>t diagonalizer 3 , and it can be determ<strong>in</strong>ed algorithmically for<br />

example by iterative Givens rotations [13].<br />

1 s(t) and n(t) can be decorrelated, so Rτ(x) = 〈As(t + τ)s(t) ⊤A⊤ 〉 + 〈n(t +<br />

τ)n(t) ⊤ 〉 = ARτ(s)A⊤ , where the last equality follows because τ �= 0 and n(t) is<br />

white.<br />

2 By assumption, R0(s) = I hence I = � As(t)s(t) ⊤A⊤� = AR0(s)A⊤ = AA⊤ , so<br />

A is orthogonal.<br />

3 The case of perfect diagonalization i.e. the case of a zero-valued m<strong>in</strong>imum occurs<br />

if and only if all matrices that are to be diagonalized commute, which is equivalent<br />

to the matrices shar<strong>in</strong>g the same system of eigenvectors.<br />

3


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 143<br />

4 Spatiotemporal structures<br />

Real-world data sets often possess structure <strong>in</strong> addition to the simple factorization<br />

models treated above. For example fMRI measurements conta<strong>in</strong> both<br />

temporal and spatial <strong>in</strong>dices so a data entry x = x(r1, r2, r3, t) can depend on<br />

position r := (r1, r2, r3) as well as time t. More generally, we want to consider<br />

data sets x(r, t) depend<strong>in</strong>g on two <strong>in</strong>dices r and t, where r ∈ R n can be any<br />

multidimensional (spatial) <strong>in</strong>dex and t <strong>in</strong>dexes the time axis. In practice this<br />

generalized random process is realized by a f<strong>in</strong>ite number of samples. For example<br />

<strong>in</strong> the case of fMRI scans we could assume t ∈ [1 : T] := {1, 2, . . ., T }<br />

and r ∈ [1 : h] × [1 : w] × [1 : d], where T is the number of scans of size<br />

h × w × d. So the number of spatial observations is s m := hwd and the number<br />

of temporal observations t m := T.<br />

4.1 Temporal and spatial separation<br />

For such multi-structured data, two methods of source separation exist. In<br />

temporal BSS, we <strong>in</strong>terpret the data to conta<strong>in</strong> a measured time series xr(t) :=<br />

x(r, t) for each spatial location r. Then our goal is to apply BSS to the temporal<br />

observation vector t x(t) := (xr111(t), . . .,xrhwd (t))⊤ conta<strong>in</strong><strong>in</strong>g s m entries<br />

i.e. consist<strong>in</strong>g of s m spatial observations. In other words we want to f<strong>in</strong>d a<br />

decomposition t x(t) = t A t s(t) with temporal mix<strong>in</strong>g matrix t A and temporal<br />

sources t s(t), possibly of lower dimension. This contrasts to so-called spatial<br />

BSS, where the data is considered to be composed of T spatial patterns<br />

xt(r) := x(r, t). Spatial BSS tries to decompose the spatial observation vector<br />

s x(r) := (xt1(r), . . ., xtT (r))⊤ ∈ R t m <strong>in</strong>to s x(r) = s A s s(r) with a spatial mix<strong>in</strong>g<br />

matrix s A and spatial sources s s(r), possibly of lower dimension. In this case,<br />

us<strong>in</strong>g multidimensional autocorrelations considerably enhances the separation<br />

[7,8]. In order to be able to use matrix-notation, we contract the spatial multidimensional<br />

<strong>in</strong>dex r <strong>in</strong>to a one-dimensional <strong>in</strong>dex r by row concatenation;<br />

the full multidimensional structure will only be needed later <strong>in</strong> the calculation<br />

of the multidimensional autocorrelation. Then the data set x(r, t) =: xrt can<br />

be represented by a data matrix X of dimension s m × t m, and our goal is to<br />

determ<strong>in</strong>e a source matrix S, either spatially or temporally.<br />

4.2 Spatiotemporal matrix factorization<br />

Temporal BSS implies the matrix factorization X = t A t S, whereas spatial<br />

BSS implies the factorization X ⊤ = s A s S or equivalently X = s S ⊤s A ⊤ . Hence<br />

X = t A t S = s S ⊤s A ⊤ . (2)<br />

4


144 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

So both source separation models can be <strong>in</strong>terpreted as matrix factorization<br />

problems; <strong>in</strong> the temporal case restrictions such as diagonal autocorrelations<br />

are determ<strong>in</strong>ed by the second factor, <strong>in</strong> the spatial case by the first one.<br />

In order to achieve a spatiotemporal model, we require these conditions from<br />

both factors at the same time. In other words, <strong>in</strong>stead of recover<strong>in</strong>g a s<strong>in</strong>gle<br />

source data set which fulfills the source conditions spatiotemporally we try<br />

to f<strong>in</strong>d two source matrices, a spatial and a temporal source matrix, and the<br />

conditions are put onto the matrices separately. So the spatiotemporal BSS<br />

model can be derived from equation (2) as the factorization problem<br />

X = s S ⊤t S (3)<br />

with spatial source matrix s S and temporal source matrix t S, which both<br />

have (multidimensional) autocorrelations be<strong>in</strong>g as diagonal as possible. Diagonality<br />

of the autocorrelations is <strong>in</strong>variant under scal<strong>in</strong>g and permutation,<br />

so the above model conta<strong>in</strong>s these <strong>in</strong>determ<strong>in</strong>acies — <strong>in</strong>deed the spatial and<br />

temporal sources can <strong>in</strong>terchange scal<strong>in</strong>g (L) and permutation (P) matrices,<br />

s S ⊤t S = (L −1 P −1s S) ⊤ (LP t S), and the model assumptions still hold. The spatiotemporal<br />

BSS problem as def<strong>in</strong>ed <strong>in</strong> equation (3) has been implicitly proposed<br />

<strong>in</strong> [9], equation (5), <strong>in</strong> comb<strong>in</strong>ation with a dimension reduction scheme.<br />

Here, we first operate on the general model and derive the cost function based<br />

on autodecorrelation, and only later comb<strong>in</strong>e this with a dimension reduction<br />

method.<br />

5 Algorithmic spatiotemporal BSS<br />

Stone et al. [9] first proposed the model from equation (3), where a jo<strong>in</strong>t energy<br />

function is employed based on mutual entropy and Infomax. Apart from<br />

the many parameters used <strong>in</strong> the algorithm, the <strong>in</strong>volved gradient descent<br />

optimization is susceptible to noise, local m<strong>in</strong>ima and <strong>in</strong>appropriate <strong>in</strong>itializations,<br />

so we propose a novel, more robust algebraic approach <strong>in</strong> the follow<strong>in</strong>g.<br />

It is based on the jo<strong>in</strong>t diagonalization of source conditions posed not only<br />

temporally but also spatially at the same time.<br />

5.1 Spatiotemporal BSS us<strong>in</strong>g jo<strong>in</strong>t diagonalization<br />

Shift<strong>in</strong>g to matrix notation, we <strong>in</strong>terpret ¯ Rk(X) := ¯ Rk( t x(t)) as a symmetrized<br />

temporal autocorrelation matrix, whereas ¯ Rk(X ⊤ ) := ¯ Rk( s x(r)) denotes<br />

the correspond<strong>in</strong>g spatial possibly multidimensional symmetrized autocorrelation<br />

matrix. Here k <strong>in</strong>dexes the one- or multidimensional lags τ. Ap-<br />

5


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 145<br />

plication of the spatiotemporal mix<strong>in</strong>g model from equation (3) together with<br />

the transformation properties of the Rk’s yields<br />

so<br />

Rk(X) =Rk( s S ⊤t S) = s S ⊤ Rk( t S) s S<br />

Rk(X ⊤ ) =Rk( t S ⊤s S) = t S ⊤ Rk( s S) t S, (4)<br />

¯Rk( t S)= s S †⊤ Rk(X) ¯ s †<br />

S<br />

¯Rk( s S)= t S †⊤ Rk(X ¯ ⊤ ) t S †<br />

because ∗ m ≥ n and hence ∗ S ∗ S † = I, where ∗ denotes either s or t. By<br />

assumption the matrices ¯ Rk( ∗ S) are as diagonal as possible. Hence we can f<strong>in</strong>d<br />

one of two the source sets by jo<strong>in</strong>tly diagonaliz<strong>in</strong>g either ¯ Rk(X) or ¯ Rk(X ⊤ )<br />

for all k. The other source matrix can then be calculated by equation (3).<br />

However we would only be us<strong>in</strong>g either temporal or spatial properties, so this<br />

corresponds to only temporal or spatial BSS.<br />

In order to <strong>in</strong>clude the full spatiotemporal <strong>in</strong>formation, we have to f<strong>in</strong>d diagonalizers<br />

for both ¯ Rk(X) and ¯ Rk(X ⊤ ) such that they satisfy the spatiotemporal<br />

model (3). For now, let us assume the (unrealistic) case of s m = t m = n<br />

— we will deal with the general problem us<strong>in</strong>g dimension reduction later.<br />

Then all matrices can be assumed to be <strong>in</strong>vertible, and by model (3) we get<br />

s S ⊤ = X t S −1 . Apply<strong>in</strong>g this to equations (5) together with an <strong>in</strong>version of<br />

the second equation yields<br />

¯Rk( t S) = t S X † ¯ Rk(X)X †⊤ t S ⊤<br />

¯Rk( s S) −1 = t S ¯ Rk(X ⊤ ) −1 t S ⊤ . (6)<br />

So we can separate the data spatiotemporally by jo<strong>in</strong>tly diagonaliz<strong>in</strong>g the set<br />

of matrices {X † ¯ Rk(X)X †⊤ , ¯ Rk(X ⊤ ) −1 | k = 1, . . ., K}.<br />

Hence the goal of achiev<strong>in</strong>g spatiotemporal BSS ‘as much as possible’ means<br />

m<strong>in</strong>imiz<strong>in</strong>g the jo<strong>in</strong>t error term of the above jo<strong>in</strong>t diagonalization criterion.<br />

Moreover, either spatial or temporal separation can be favored by <strong>in</strong>troduc<strong>in</strong>g<br />

a weight<strong>in</strong>g factor α ∈ [0, 1]. The set for approximate jo<strong>in</strong>t diagonalization is<br />

then def<strong>in</strong>ed by<br />

{αX † ¯ Rk(X)X †⊤ , (1 − α) ¯ Rk(X ⊤ ) −1 | k = 1, . . .,K}. (7)<br />

If A is a diagonalizer of (7), then the sources can be estimated by tˆ S = A −1<br />

and sˆ S = A ⊤ X ⊤ . Jo<strong>in</strong>t diagonalization is usually performed by optimiz<strong>in</strong>g<br />

the off-diagonal criterion from above, so different scale factors <strong>in</strong> the matrices<br />

<strong>in</strong>deed yield different optima if the diagonalization cannot be achieved fully.<br />

6<br />

(5)


146 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Accord<strong>in</strong>g to equations (6), the higher α the more temporal separation is<br />

stressed. In the limit case α = 1 only the temporal criterion is optimized, so<br />

temporal BSS is performed, whereas for α = 0 a spatial BSS is calculated,<br />

although we want to remark that <strong>in</strong> contrast to the temporal case, the cost<br />

function for α = 0 does not equal the spatial SOBI cost function due to the<br />

additional <strong>in</strong>version. In practice, <strong>in</strong> order to be able to weight the matrix<br />

sets us<strong>in</strong>g α appropriately, a normalization by multiplication by a constant<br />

separately with<strong>in</strong> the two sets seems to be appropriate to guarantee equal<br />

scales of the two matrix sets.<br />

5.2 Dimension reduction<br />

In pr<strong>in</strong>ciple, we may now use diagonalization of the matrix set from (7) to<br />

perform spatiotemporal BSS — but only <strong>in</strong> the case of equal dimensions. Furthermore,<br />

apart from computational issues <strong>in</strong>volv<strong>in</strong>g the high dimensionality,<br />

the BSS estimate would be poor, simply because <strong>in</strong> the estimation either of<br />

¯Rk(X) or ¯ Rk(X ⊤ ) equal or less samples than signals are available. Hence<br />

dimension reduction is essential.<br />

Our goal is to extract only n ≪ m<strong>in</strong>{ s m, t m} sources. A common approach<br />

to do so is to approximate X by the reduced s<strong>in</strong>gular value decomposition<br />

X ≈ UDV ⊤ of X, where only the n largest values of the diagonal matrix D<br />

and the correspond<strong>in</strong>g columns of the pseudo-orthogonal matrices U and V are<br />

used. Plugg<strong>in</strong>g this approximation of X <strong>in</strong>to (5) shows after some calculation 4<br />

that the set of matrices from equation (7) can be rewritten as<br />

{α ¯ Rk(D 1/2 V ⊤ ), (1 − α) ¯ Rk(D 1/2 U ⊤ ) −1 | k = 1, . . .,K}. (8)<br />

If A is a jo<strong>in</strong>t diagonalizer of this set, we may estimate the sources by tˆ S =<br />

A ⊤ D 1/2 V ⊤ and sˆ S = A −1 D 1/2 U ⊤ . We call the result<strong>in</strong>g algorithm spatiotemporal<br />

second-order bl<strong>in</strong>d identification or stSOBI, generaliz<strong>in</strong>g the temporal<br />

SOBI algorithm.<br />

4 Us<strong>in</strong>g the approximation X ≈ (UD 1/2 )(VD 1/2 ) ⊤ together with the spatiotemporal<br />

BSS model (3) yields (UD −1/2 ) ⊤s S ⊤t S(VD −1/2 ) = I. Hence W := t SVD −1/2 is<br />

an <strong>in</strong>vertible n×n matrix. The first equation of (6) still holds <strong>in</strong> the more general case<br />

and we get ¯ Rk( t S) = t S ˆ X † ¯ Rk( ˆ X) ˆ X †⊤t S ⊤ = W ¯ Rk(D 1/2 V ⊤ )W ⊤ . The second equation<br />

of (6) cannot hold for n < ∗ m, but we can derive a similar result from (5), where<br />

we use W −1 = D −1/2 V ⊤t S † : ¯ Rk( s S) = t S †⊤ ¯ Rk(X ⊤ ) t S † = W −⊤ ¯ Rk(D 1/2 U ⊤ )W −1<br />

which we can now <strong>in</strong>vert to get ¯ Rk( s S) −1 = W ¯ Rk(D 1/2 U ⊤ ) −1 W ⊤ .<br />

7


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 147<br />

1 2<br />

3 4<br />

(a) component maps s S<br />

1 cc: 0.05 2 cc: 0.17<br />

3 cc: 0.14 4 cc: 0.89<br />

(b) time courses t S<br />

Fig. 1. fMRI analysis us<strong>in</strong>g stSOBI with temporal and two-dimensional spatial autocorrelations.<br />

The data was reduced to the 4 largest components. (a) shows the<br />

recovered component maps (bra<strong>in</strong> background is given us<strong>in</strong>g a structural scan, overlayed<br />

white po<strong>in</strong>ts <strong>in</strong>dicate activation values stronger than 3 standard deviations),<br />

and (b) their time courses. <strong>Component</strong> 3 partially conta<strong>in</strong>s the frontal eye fields.<br />

<strong>Component</strong> 4 is the desired stimulus component, which is ma<strong>in</strong>ly active <strong>in</strong> the visual<br />

cortex; its time-course closely follows the on-off stimulus (<strong>in</strong>dicated by the gray<br />

boxes) — their crosscorrelation lies at cc = 0.89 — with a delay of roughly 6 seconds<br />

<strong>in</strong>duced by the BOLD effect.<br />

5.3 Implementation<br />

In the experiments we use stSOBI with both one-dimensional and multidimensional<br />

autocovariances. Our software package 5 implements all the details of<br />

mdSOBI and its extension stSOBI <strong>in</strong> Matlab. In addition to Cardoso’s jo<strong>in</strong>t<br />

diagonalization algorithm based on iterative Given’s rotations, the package<br />

conta<strong>in</strong>s all the files needed to reproduce the results described <strong>in</strong> this paper,<br />

with the exception of the fMRI data set.<br />

6 Results<br />

BSS, ma<strong>in</strong>ly based on ICA, is nowadays a quite common tool <strong>in</strong> fMRI analysis<br />

[15,16]. For this work, we analyzed the performance of stSOBI when applied to<br />

fMRI measurements. fMRI data were recorded from from 10 healthy subjects<br />

perform<strong>in</strong>g a visual task. 100 scans (TR/TE = 3000/60 ms) with 5 slices<br />

each were acquired with 5 periods of rest and 5 photic stimulation periods.<br />

5 available onl<strong>in</strong>e at http://fabian.theis.name/<br />

8


148 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

Stimulation and rest periods comprised 10 repetitions each i.e. 30s. Resolution<br />

was 3 × 3 × 4 mm. The slices were oriented parallel to the calcar<strong>in</strong>e fissure.<br />

Photic stimulation was performed us<strong>in</strong>g an 8 Hz alternat<strong>in</strong>g checkerboard<br />

stimulus with a central fixation po<strong>in</strong>t and a dark background with a central<br />

fixation po<strong>in</strong>t dur<strong>in</strong>g the control periods. The first scans were discarded for<br />

rema<strong>in</strong><strong>in</strong>g saturation effects. Motion artifacts were compensated by automatic<br />

image alignment [17]. For visualization, we only considered a s<strong>in</strong>gle slice (nonbra<strong>in</strong><br />

areas were masked out), and chose to reduce the data set to n = 4<br />

components by s<strong>in</strong>gular value decomposition.<br />

6.1 S<strong>in</strong>gle subject analysis<br />

In the jo<strong>in</strong>t diagonalization, K = 10 autocorrelation matrices were used, both<br />

for spatial and temporal decorrelation. Figure 1 shows the performance of the<br />

algorithm for equal spatiotemporal weight<strong>in</strong>g α = 0.5. Although the data was<br />

reduced to only 4 components, stSOBI was able to extract the stimulus component<br />

(#4) very well; the crosscorrelation of the identified task component<br />

with the time-delayed stimulus is high (cc = 0.89). Some additional bra<strong>in</strong><br />

components are detected, although higher n would allow for more elaborate<br />

decompositions.<br />

In order to compare spatial and temporal models, we applied stSOBI with<br />

vary<strong>in</strong>g spatiotemporal weight<strong>in</strong>g factors α ∈ {0, 0.1, . . ., 1}. The task component<br />

was always extracted, although with different quality. In figure 2, we<br />

plotted the maximal crosscorrelation of the time courses with the stimulus<br />

versus α. If only spatial separation is performed, the identified stimulus component<br />

is considerably worse (cc = 0.8) than <strong>in</strong> the case of temporal recovery<br />

(cc = 0.9); the component maps co<strong>in</strong>cide rather well. The enhanced extraction<br />

confirms the advantages of spatiotemporal separation <strong>in</strong> contrast to the<br />

commonly used spatial-only separation. Temporal separation alone, although<br />

preferable <strong>in</strong> the presented example, often faces the problem of high dimensions<br />

and low sample number, so an adjustable weight<strong>in</strong>g α as proposed here<br />

allows for the highest flexibility.<br />

6.2 Algorithm comparison<br />

We then compared our analysis with some established algorithms for fMRI<br />

analysis. In order to numerically perform the comparisons, we determ<strong>in</strong>ed the<br />

s<strong>in</strong>gle component that is maximally autocorrelated with the known stimulus<br />

task. These components are shown <strong>in</strong> figure 3.<br />

After some difficulties due to the many possible parameters, Stone’s stICA<br />

9


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 149<br />

crosscorrelation with stimulus<br />

s S for α = 1<br />

s S for α = 0<br />

Fig. 2. Performance of stSOBI for vary<strong>in</strong>g α. Low α favors spatial separation, high<br />

α temporal separation. Two recovered component maps are plotted for the extremal<br />

cases of spatial (α = 0) and temporal (α = 1) separation.<br />

algorithm [9] was applied to the data. However, the task component could not<br />

be recovered very well — it showed some activity <strong>in</strong> the visual cortex, but with<br />

rather low temporal crosscorrelation of 0.53 with the stimulus component,<br />

which is much lower than the 0.9 of the multi-dimensional stSOBI and the<br />

0.86 of stSOBI with one-dimensional autocovariances. We believe that this is<br />

due to convergence problems of the employed Infomax rule, and to non-trivial<br />

tun<strong>in</strong>g of the many parameters <strong>in</strong>volved <strong>in</strong> the algorithm. In order to test<br />

for convergence issues, we comb<strong>in</strong>ed stSOBI and stICA by apply<strong>in</strong>g Stone’s<br />

local stICA algorithm to the stSOBI separation results. Due to this careful<br />

<strong>in</strong>itialization, the stICA result improved (crosscorrelation of 0.58) but was still<br />

considerably lower than the stSOBI result.<br />

Similar results were achieved by the well-known FastICA algorithm [18], which<br />

we applied <strong>in</strong> order to identify spatially <strong>in</strong>dependent components. The algorithm<br />

could not recover the stimulus component (maximal crosscorrelation of<br />

0.51, and no activity <strong>in</strong> the visual cortex). This poor result is due to the dimension<br />

reduction to only 4 components, and co<strong>in</strong>cides with the decreased performance<br />

of stSOBI <strong>in</strong> the spatial case α = 0. In this respect, the spatiotemporal<br />

model is obviously much more flexible, as spatiotemporal dimension reduction<br />

is able to capture the structure better than only spatial reduction.<br />

F<strong>in</strong>ally, we tested the robustness of the spatiotemporal framework by modify<strong>in</strong>g<br />

the cost function. It is well-known that sources with vary<strong>in</strong>g source<br />

properties can be separated by modify<strong>in</strong>g the source condition matrices. In-<br />

10<br />

α


150 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

stimulus<br />

stNSS<br />

stSOBI (1D)<br />

stICA after stSOBI<br />

stICA<br />

fastICA<br />

20 40 60 80<br />

Fig. 3. Comparison of the recovered component that is maximally autocrosscorrelated<br />

with the stimulus task (top) for various BSS algorithms, after dimension<br />

reduction to 4 components. The absolute correspond<strong>in</strong>g autocorrelations are 0.84<br />

(stNSS), 0.91 (stSOBI with one-dimensional autocorrelations), 0.58 (stICA applied<br />

to separation provided by stSOBI), 0.53 (stICA) and 0.51 (fastICA).<br />

stead of calculat<strong>in</strong>g autocovariance matrices, other statistics of the spatial<br />

and temporal sources can be used, as long as they satisfy the factorization<br />

from equation (1). This results <strong>in</strong> ‘algebraic BSS’ algorithms such as AMUSE<br />

[3], JADE [11], SOBI and TDSEP [5,6], reviewed for <strong>in</strong>stance <strong>in</strong> [12]. Instead<br />

of perform<strong>in</strong>g autodecorrelation, we used the idea of the NSS-SD-algorithm<br />

(‘non-stationary sources with simultaneous diagonalization’) [19], cf. [20]: the<br />

sources were assumed to be spatiotemporal realizations of non-stationary random<br />

processes ∗ si(t) with ∗ ∈ {t, s} determ<strong>in</strong><strong>in</strong>g the temporal spatial or spatial<br />

direction. If we assume that the result<strong>in</strong>g covariance matrices C( ∗ s(t)) vary<br />

sufficiently with time, the factorization of equation (1) also holds for these<br />

covariance matrices. Hence, jo<strong>in</strong>t diagonalization of<br />

{C( t x(1)),C( s x(1)),C( t x(2)),C( s x(2)), . . .}<br />

allows for the calculation of the mix<strong>in</strong>g matrix. The covariance matrices are<br />

commonly estimated <strong>in</strong> separate non-overlapp<strong>in</strong>g temporal or spatial w<strong>in</strong>dows.<br />

Replac<strong>in</strong>g the autocovariances <strong>in</strong> (8) by the w<strong>in</strong>dowed covariances, this results<br />

<strong>in</strong> the spatiotemporal NSS-SD or stNSS algorithm.<br />

In the fMRI example, we applied stNSS us<strong>in</strong>g one-dimensional covariance ma-<br />

11


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 151<br />

autocorrelation with stimulus<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

autocorrelation with stimulus<br />

0.92<br />

0.9<br />

0.88<br />

0.86<br />

0.84<br />

0.82<br />

0.8<br />

0.78<br />

0.76<br />

computation time (s)<br />

0.3<br />

0.28<br />

0.26<br />

0.24<br />

0.22<br />

0.2<br />

0.18<br />

(a) comparison of separation performance and<br />

computational effort for 10 subjects<br />

1 2 5 10 20 50 100 200<br />

subsampl<strong>in</strong>g percentage (%)<br />

(b) separation after subsampl<strong>in</strong>g<br />

computation time (s)<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1 2 5 10 20 50 100 200<br />

subsampl<strong>in</strong>g percentage (%)<br />

(c) computation time after subsampl<strong>in</strong>g<br />

Fig. 4. Multiple subject comparison. (a) shows the algorithm performance <strong>in</strong> terms<br />

of separation quality (autocorrelation with stimulus) and computation time when<br />

compared over 100 runs and 10 subjects. (b) and (c) compare these <strong>in</strong>dices after<br />

subsampl<strong>in</strong>g the data spatially with vary<strong>in</strong>g percentages.<br />

trices and 12 w<strong>in</strong>dows (both temporally and spatially). Although the data<br />

exhibited only weak non-stationarities (the mean masked voxels values vary<br />

from 983 to 1000 over the 98 time steps, with a standard deviation vary<strong>in</strong>g<br />

from 228 to 234), the task component could be extracted rather well with<br />

a crosscorrelation of 0.80, see figure 3. Similarly, by replac<strong>in</strong>g the autocorrelations<br />

with other source conditions [12], we can easily construct alternative<br />

separation algorithms.<br />

6.3 Multiple subject analysis<br />

We f<strong>in</strong>ish by analyz<strong>in</strong>g the performance of the stSOBI algorithm for multiple<br />

subjects. As before, we applied stSOBI with dimension reduction to only n = 4<br />

12


152 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

sources. Here, K = 12 for simplicity one-dimensional autocovariance matrices<br />

were used, both spatially and temporally. We masked the data us<strong>in</strong>g a fixed<br />

common threshold. In order to quantify algorithm performance, as before we<br />

determ<strong>in</strong>ed the spatiotemporal source that had a time course with maximal<br />

autocorrelation with the stimulus protocol, and compared this autocorrelation.<br />

In figure 4(a), we show a boxplot of the autocorrelations together with the<br />

needed computational effort. The median autocorrelation was very high with<br />

0.89. The separation was fast with a mean computation time of 0.25s on<br />

a 1.7GHz Intel Dual Core laptop runn<strong>in</strong>g Matlab. In order to confirm this<br />

robustness of the algorithm, we analyzed the sample-size dependence of the<br />

method by runn<strong>in</strong>g stSOBI on subsampled data sets. The bootstrapp<strong>in</strong>g was<br />

performed spatially with repetition, but reorder<strong>in</strong>g of the samples <strong>in</strong> order<br />

to ma<strong>in</strong>ta<strong>in</strong> spatial dependencies. Figure 4(b-c) shows the algorithm performance<br />

when vary<strong>in</strong>g the subsampl<strong>in</strong>g percentage from 1 to 200 percent, where<br />

the statistics were done over 100 runs and over the 10 subjects. Even when<br />

us<strong>in</strong>g only 1 percent of the samples, we achieved a median autocorrelation<br />

of 0.66, which <strong>in</strong>creased at a subsampl<strong>in</strong>g percentage of 10% to an already<br />

acceptable value of 0.85. This confirms the robustness and efficiency of the<br />

proposed method, which of course comes from the underly<strong>in</strong>g robust optimization<br />

method of jo<strong>in</strong>t diagonalization.<br />

7 Conclusion<br />

We have proposed a novel spatiotemporal BSS algorithm named stSOBI. It<br />

is based on the jo<strong>in</strong>t diagonalization of both spatial and temporal autocorrelations.<br />

Shar<strong>in</strong>g the properties of all algebraic algorithms, stSOBI is easy to<br />

use, robust (with only a s<strong>in</strong>gle parameter) and fast (<strong>in</strong> contrast to the onl<strong>in</strong>e<br />

algorithm proposed by Stone). The employed dimension reduction allows<br />

for the spatiotemporal decomposition of high-dimensional data sets such as<br />

fMRI record<strong>in</strong>gs. The presented results for such data sets show that stSOBI<br />

clearly outperforms spatial-only recovery and Stone’s spatiotemporal algorithm.<br />

Moreover, the proposed algorithm is not limited to second-order statistics,<br />

but can easily be extended to spatiotemporal ICA for example by jo<strong>in</strong>tly<br />

diagonaliz<strong>in</strong>g both spatial and temporal cumulant matrices.<br />

Acknowledgments<br />

The authors gratefully acknowledge partial f<strong>in</strong>ancial support by the DFG<br />

(GRK 638) and the BMBF (project ‘ModKog’). They would like to thank<br />

D. Auer from the MPI of Psychiatry <strong>in</strong> Munich, Germany, for provid<strong>in</strong>g the<br />

13


Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007 153<br />

fMRI data, and A. Meyer- Bäse from the Department of Electrical and Computer<br />

Eng<strong>in</strong>eer<strong>in</strong>g, FSU, Tallahassee, USA for discussions concern<strong>in</strong>g the fMRI<br />

analysis. The authors thank the anonymous reviewers for their helpful comments<br />

dur<strong>in</strong>g preparation of this manuscript.<br />

References<br />

[1] A. Hyvär<strong>in</strong>en, J. Karhunen, E. Oja, <strong>Independent</strong> component analysis, John<br />

Wiley & Sons.<br />

[2] A. Cichocki, S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g, John Wiley<br />

& Sons, 2002.<br />

[3] L. Tong, R.-W. Liu, V. Soon, Y.-F. Huang, Indeterm<strong>in</strong>acy and identifiability<br />

of bl<strong>in</strong>d identification, IEEE Transactions on Circuits and Systems 38 (1991)<br />

499–509.<br />

[4] L. Molgedey, H. Schuster, Separation of a mixture of <strong>in</strong>dependent signals us<strong>in</strong>g<br />

time-delayed correlations, Physical Review Letters 72 (23) (1994) 3634–3637.<br />

[5] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, E. Moul<strong>in</strong>es, A bl<strong>in</strong>d source<br />

separation technique based on second order statistics, IEEE Transactions on<br />

Signal Process<strong>in</strong>g 45 (2) (1997) 434–444.<br />

[6] A. Ziehe, K.-R. Mueller, TDSEP – an efficient algorithm for bl<strong>in</strong>d separation<br />

us<strong>in</strong>g time structure, <strong>in</strong>: L. Niklasson, M. Bodén, T. Ziemke (Eds.), Proc. of<br />

ICANN’98, Spr<strong>in</strong>ger Verlag, Berl<strong>in</strong>, Skövde, Sweden, 1998, pp. 675–680.<br />

[7] H. Schöner, M. Stetter, I. Schießl, J. Mayhew, J. Lund, N. McLoughl<strong>in</strong>,<br />

K. Obermayer, Application of bl<strong>in</strong>d separation of sources to optical record<strong>in</strong>g<br />

of bra<strong>in</strong> activity, <strong>in</strong>: Proc. NIPS 1999, Vol. 12, MIT Press, 2000, pp. 949–955.<br />

[8] F. Theis, A. Meyer-Bäse, E. Lang, Second-order bl<strong>in</strong>d source separation based<br />

on multi-dimensional autocovariances, <strong>in</strong>: Proc. ICA 2004, Vol. 3195 of LNCS,<br />

Spr<strong>in</strong>ger, Granada, Spa<strong>in</strong>, 2004, pp. 726–733.<br />

[9] J. Stone, J. Porrill, N. Porter, I. Wilk<strong>in</strong>son, Spatiotemporal <strong>in</strong>dependent<br />

component analysis of event-related fmri data us<strong>in</strong>g skewed probability density<br />

functions, NeuroImage 15 (2) (2002) 407–421.<br />

[10] A. Bell, T. Sejnowski, An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d<br />

separation and bl<strong>in</strong>d deconvolution, Neural Computation 7 (1995) 1129–1159.<br />

[11] J. Cardoso, A. Souloumiac, Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian signals, IEE<br />

Proceed<strong>in</strong>gs - F 140 (6) (1993) 362–370.<br />

[12] F. Theis, Y. Inouye, On the use of jo<strong>in</strong>t diagonalization <strong>in</strong> bl<strong>in</strong>d signal<br />

process<strong>in</strong>g, <strong>in</strong>: Proc. ISCAS 2006, Kos, Greece, 2006.<br />

14


154 Chapter 9. Neurocomput<strong>in</strong>g (<strong>in</strong> press), 2007<br />

[13] J. Cardoso, A. Souloumiac, Jacobi angles for simultaneous diagonalization,<br />

SIAM J. Mat. Anal. Appl. 17 (1) (1995) 161–164.<br />

[14] M. Joho, H. Mathis, R. Lamber, Overdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation: us<strong>in</strong>g<br />

more sensors than source signals <strong>in</strong> a noisy mixture, <strong>in</strong>: Proc. of ICA 2000,<br />

Hels<strong>in</strong>ki, F<strong>in</strong>land, 2000, pp. 81–86.<br />

[15] M. McKeown, T. Jung, S. Makeig, G. Brown, S. K<strong>in</strong>dermann, A. Bell,<br />

T. Sejnowski, <strong>Analysis</strong> of fMRI data by bl<strong>in</strong>d separation <strong>in</strong>to <strong>in</strong>dependent<br />

spatial components, Human Bra<strong>in</strong> Mapp<strong>in</strong>g 6 (1998) 160–188.<br />

[16] I. Keck, F. Theis, P. Gruber, E. Lang, K. Specht, C. Puntonet, 3D spatial<br />

analysis of fMRI data on a word perception task, <strong>in</strong>: Proc. ICA 2004, Vol. 3195<br />

of LNCS, Spr<strong>in</strong>ger, Granada, Spa<strong>in</strong>, 2004, pp. 977–984.<br />

[17] R. Woods, S. Cherry, J. Mazziotta, Rapid automated algorithm for align<strong>in</strong>g<br />

and reslic<strong>in</strong>g pet images, Journal of Computer Assisted Tomography 16 (1992)<br />

620–633.<br />

[18] A. Hyvär<strong>in</strong>en, E. Oja, A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis, Neural Computation 9 (1997) 1483–1492.<br />

[19] S. Choi, A. Cichocki, Bl<strong>in</strong>d separation of nonstationary sources <strong>in</strong> noisy<br />

mixtures, Electronics Letters 36 (848-849).<br />

[20] D.-T. Pham, J. Cardoso, Bl<strong>in</strong>d separation of <strong>in</strong>stantaneous mixtures of<br />

nonstationary sources, IEEE Transactions on Signal Process<strong>in</strong>g 49 (9) (2001)<br />

1837–1848.<br />

15


Chapter 10<br />

IEEE TNN 16(4):992-996, 2005<br />

Paper P. Georgiev, F.J. Theis, and A. Cichocki. Sparse component analysis and<br />

bl<strong>in</strong>d source separation of underdeterm<strong>in</strong>ed mixtures. IEEE Transactions on<br />

Neural Networks, 16(4):992-996, 2005<br />

Reference (Georgiev et al., 2005c)<br />

Summary <strong>in</strong> section 1.4.1<br />

155


156 Chapter 10. IEEE TNN 16(4):992-996, 2005<br />

992 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005<br />

IV. CONCLUSION<br />

In this letter, an enhancement of the NBLM algorithm is proposed. It<br />

is shown that, by locally adapt<strong>in</strong>g one learn<strong>in</strong>g coefficient of the NBLM<br />

for each neighborhood, significant improvements on the performance<br />

of the method are achieved. The suggested modification requires only<br />

m<strong>in</strong>or changes <strong>in</strong> the orig<strong>in</strong>al algorithm, and re<strong>in</strong>forces the local character<br />

of the NBLM.<br />

With the proposed local adaptation, the modified NBLM achieves<br />

better performance than the LM method even for very small neighborhood<br />

sizes. This allows very large NNs to be efficiently tra<strong>in</strong>ed <strong>in</strong> a<br />

fraction of the time LM would require, still reach<strong>in</strong>g lower error rates.<br />

Moreover, it makes possible to reta<strong>in</strong> the efficiency of that method <strong>in</strong><br />

thosesituations wheretheapplication of theorig<strong>in</strong>al algorithm was<br />

impractical.<br />

ACKNOWLEDGMENT<br />

The authors would like to thank the reviewers for their useful h<strong>in</strong>ts<br />

and constructive comments throughout the review<strong>in</strong>g process.<br />

REFERENCES<br />

[1] K. Levenberg, “A method for the solution of certa<strong>in</strong> problems <strong>in</strong> least<br />

squares,” Q. Appl. Math., vol. 2, pp. 164–168, 1944.<br />

[2] D. W. Marquardt, “An algorithm for least-squares estimation of nonl<strong>in</strong>ear<br />

parameters,” J. Soc. Ind. Appl. Math., vol. 11, pp. 431–441, 1963.<br />

[3] M. T. Hagan and M. Menhaj, “Tra<strong>in</strong><strong>in</strong>g feedforward networks with<br />

theMarquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp.<br />

989–993, Nov. 1994.<br />

[4] G. Lera and M. P<strong>in</strong>zolas, “A quasilocal Levenberg-Marquardt algorithm<br />

for neural network tra<strong>in</strong><strong>in</strong>g,” <strong>in</strong> Proc. Int. Jo<strong>in</strong>t Conf. Neural Networks<br />

(IJCNN’98), vol. 3, Anchorage, AK, May, pp. 2242–2246.<br />

[5] , “Neighborhood-based Levenberg-Marquardt algorithm for neural<br />

network tra<strong>in</strong><strong>in</strong>g,” IEEE Trans. Neural Netw., vol. 13, no. 5, pp.<br />

1200–1203, Sep. 2002.<br />

[6] M. P<strong>in</strong>zolas, J. J. Astra<strong>in</strong>, J. R. González, and J. Villadangos, “Isolated<br />

hand-written digit recognition us<strong>in</strong>g a neurofuzzy scheme and multiple<br />

classification,” J. Intell. Fuzzy Syst., vol. 12, no. 2, pp. 97–105, Dec. 2002.<br />

[7] J. M. Cano, M. P<strong>in</strong>zolas, J. J. Ibarrola, and J. López-Coronado, “Identificación<br />

de funciones utilizando sistemas lógicos difusos,” <strong>in</strong> Proc. XXI<br />

Jornadas Automática, Sevilla, Spa<strong>in</strong>, 2000.<br />

[8] M. P<strong>in</strong>zolas, J. J. Ibarrola, and J. López-Coronado, “A neurofuzzy<br />

scheme for on-l<strong>in</strong>e identification of nonl<strong>in</strong>ear dynamical systems with<br />

variabletransfer function,” <strong>in</strong> Proc. Int. Jo<strong>in</strong>t Conf. Neural Networks<br />

(IJCNN’00), Como, Italy, pp. 215–221.<br />

1045-9227/$20.00 © 2005 IEEE<br />

Sparse <strong>Component</strong> <strong>Analysis</strong> and Bl<strong>in</strong>d Source Separation<br />

of Underdeterm<strong>in</strong>ed Mixtures<br />

Pando Georgiev, Fabian Theis, and Andrzej Cichocki<br />

Abstract—In this letter, we solve the problem of identify<strong>in</strong>g matrices ƒ<br />

and e know<strong>in</strong>g only their multiplication ˆaeƒ, under<br />

some conditions, expressed either <strong>in</strong> terms of e and sparsity of ƒ (identifiability<br />

conditions), or <strong>in</strong> terms of ˆ (sparse component analysis (SCA) conditions).<br />

We present algorithms for such identification and illustrate them<br />

by examples.<br />

Index Terms—Bl<strong>in</strong>d source separation (BSS), sparse component analysis<br />

(SCA), underdeterm<strong>in</strong>ed mixtures.<br />

I. INTRODUCTION<br />

One of the fundamental questions <strong>in</strong> data analysis, signal process<strong>in</strong>g,<br />

data m<strong>in</strong><strong>in</strong>g, neuroscience, etc. is how to represent a large data set ˆ<br />

(given <strong>in</strong> form of a @� 2 xA-matrix) <strong>in</strong> different ways. A simple approach<br />

is a l<strong>in</strong>ear matrix factorization<br />

ˆ a eƒ e P �2� Y ƒ P �2x<br />

where the unknown matrices e (dictionary) and ƒ (sourcesignals)<br />

have some specific properties, for <strong>in</strong>stance:<br />

1) therows of ƒ are (discrete) random variables, which are statistically<br />

<strong>in</strong>dependent as much as possible—this is <strong>in</strong>dependent component<br />

analysis (ICA) problem; 2) ƒ conta<strong>in</strong>s as many zeros as possible—this<br />

is the sparse representation or sparse component analysis<br />

(SCA) problem; 3) the elements of ˆY e, and ƒ arenonnegative—this<br />

is nonnegative matrix factorization (NMF) [8].<br />

There is a large amount of papers devoted to ICA problems [2], [5]<br />

but mostly for thecase� ! �. We refer to [1], [6], [7], and [9]–[11]<br />

for some recent papers on SCA and underdeterm<strong>in</strong>ed ICA @� `�A.<br />

A related problem is the so called bl<strong>in</strong>d source separation (BSS)<br />

problem, <strong>in</strong> which we know a priori that a representation such as <strong>in</strong><br />

(1) exists and the task is to recover the sources (and the mix<strong>in</strong>g matrix)<br />

as accurately as possible. A fundamental property of the complete BSS<br />

problem is that such a recovery (under assumptions <strong>in</strong> 1) and non-Gaussianity<br />

of the sources) is possible up to permutation and scal<strong>in</strong>g of the<br />

sources, which makes the BSS problem so attractive.<br />

In this letter, we consider SCA and BSS problems <strong>in</strong> the underdeterm<strong>in</strong>ed<br />

case (� `�, i.e., more sources than sensors, which is more<br />

challeng<strong>in</strong>g problem), where the additional <strong>in</strong>formation compensat<strong>in</strong>g<br />

the limited number of sensors is the sparseness of thesources. It should<br />

be noted that this problem is quite general and fundamental, s<strong>in</strong>ce the<br />

sources could be not necessarily sparse <strong>in</strong> time doma<strong>in</strong>. It would be sufficient<br />

to f<strong>in</strong>d a l<strong>in</strong>ear transformation (e.g., wavelet packets), <strong>in</strong> which<br />

the sources are sufficiently sparse.<br />

In the sequel, we present new algorithms for solv<strong>in</strong>g the BSS<br />

problem: matrix identification algorithm and source recovery algorithm<br />

under conditions that the source matrix ƒ has at most � 0 I<br />

nonzero elements <strong>in</strong> each column and if the identifiability conditions<br />

Manuscript received November 20, 2003; revised July 25, 2004.<br />

P. Georgiev is with ECECS Department, University of C<strong>in</strong>c<strong>in</strong>nati, C<strong>in</strong>c<strong>in</strong>nati,<br />

OH 45221 USA (e-mail: pgeorgie@ececs.uc.edu).<br />

F. Theis is with the Institute of Biophysics, University of Regensburg,<br />

D-93040 Regensburg, Germany (e-mail: fabian@theis.name).<br />

A. Cichocki is with the Laboratory for Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g,<br />

Bra<strong>in</strong> Science Institute, The Institute of Physical and Chemical Research<br />

(RIKEN), Saitama 351-0198, Japan (e-mail: cia@bsp.bra<strong>in</strong>.riken.jp).<br />

Digital Object Identifier 10.1109/TNN.2005.849840<br />

(1)


Chapter 10. IEEE TNN 16(4):992-996, 2005 157<br />

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005 993<br />

are satisfied (see Theorem 1). When the sources are locally very sparse<br />

(see condition i) of Theorem 2) the matrix identification algorithm<br />

is much simpler. We used this simpler form for separation of mixtures<br />

of images. After sparsification transformation (which is <strong>in</strong> fact<br />

appropriate wavelet transformation) the algorithm works perfectly <strong>in</strong><br />

the complete case. We demonstrate the effectiveness of our general<br />

matrix identification algorithm and the source recovery algorithm<br />

<strong>in</strong> the underdeterm<strong>in</strong>ed case for 7 artificially created sparse source<br />

signals, such that thesourcematrix ƒ has at most 2 nonzero elements<br />

<strong>in</strong> each column, mixed with a randomly generated (3 2 7) matrix.<br />

For a comparison, we present a recovery us<strong>in</strong>g �I-norm m<strong>in</strong>imization<br />

[3], [4], which gives signals that are far from the orig<strong>in</strong>al ones. This<br />

implies that the conditions which ensure equivalence of �I-norm and<br />

�H-norm m<strong>in</strong>imization [4], Theorem 7, are generally not satisfied for<br />

randomly generated matrices. Note that �I-norm m<strong>in</strong>imization gives<br />

solutions which haveat most � nonzeros [3], [4]. Another connection<br />

with [4] is the fact that our algorithm for source recovery works “with<br />

probability one,” i.e., for almost all data vectors � (<strong>in</strong> measure sense)<br />

such that thesystem � a e� has a sparsesolution with less than<br />

� nonzero elements, this solution is unique, while <strong>in</strong> [4] the authors<br />

proved that for all data vectors � such that thesystem � a e� has<br />

a sparsesolution with less than Spark@eAaP nonzero elements, this<br />

solution is unique. Note that Spark@eA � CI, where Spark@eA is<br />

the smallest number of l<strong>in</strong>early dependent columns of e.<br />

exist � <strong>in</strong>dexes ���� � �aI &�IY FFFYx� such that any � 0 I vector<br />

columns of �ƒ@XY��A� � �aI form a basis of the @�0 IA-dimensional coord<strong>in</strong>atesubspaceof<br />

� with zero coord<strong>in</strong>ates given by �IY FFFY���t.<br />

Because of the mix<strong>in</strong>g model, vectors of the form<br />

�� a ƒ@�Y ��A—�Y � aIYFFFY� �Pt<br />

II. BLIND SOURCE SEPARATION<br />

In this section, we develop a method for completely solv<strong>in</strong>g the BSS<br />

problem if the follow<strong>in</strong>g assumptions are satisfied:<br />

A1) themix<strong>in</strong>g matrix e P s‚ �2� belong to the data matrix ˆ. Now, by condition A1) it follows that any<br />

�0I of thevectors ����<br />

has theproperty that any<br />

square � 2 � submatrix of it is nons<strong>in</strong>gular;<br />

A2) each column of the source matrix ƒ has at most �0I nonzero<br />

elements;<br />

A3) the sources are sufficiently rich represented <strong>in</strong> the follow<strong>in</strong>g<br />

sense: for any <strong>in</strong>dex set of � 0 � C Ielements<br />

s a ��IY FFFY��0�CI� &�IYFFFY�� there exist at least �<br />

column vectors of the matrix ƒ such that each of them has<br />

zero elements <strong>in</strong> places with <strong>in</strong>dexes <strong>in</strong> s and each � 0 I of<br />

them are l<strong>in</strong>early <strong>in</strong>dependent.<br />

A. Matrix Identification<br />

We describe conditions <strong>in</strong> the sparse BSS problem under which we<br />

can identify the mix<strong>in</strong>g matrix uniquely up to permutation and scal<strong>in</strong>g<br />

of thecolumns. Wegivetwo typeof such conditions. Thefirst one<br />

corresponds to the least sparsest case <strong>in</strong> which such identification is<br />

possible. Further, we consider the most sparsest case (for small number<br />

of samples) as <strong>in</strong> this case the algorithm is much simpler.<br />

1) General Case—Full Identifiability:<br />

Theorem 1: (Identifiability Conditions—General Case): Assume<br />

that <strong>in</strong> the representation ˆ a eƒ thematrix e satisfies condition<br />

A1), thematrix ƒ satisfies conditions A2) and A3) and only the matrix<br />

ˆ is known. Then the mix<strong>in</strong>g matrix e is identifiable uniquely up to<br />

permutation and scal<strong>in</strong>g of the columns.<br />

Proof: It is clear that any column —� of themix<strong>in</strong>g matrix lies<br />

� 0 I<br />

<strong>in</strong> the <strong>in</strong>tersection of all hyperplanes generated by those<br />

� 0 P<br />

columns of e <strong>in</strong> which —� participates.<br />

We will show that these hyperplanes can be obta<strong>in</strong>ed by the columns<br />

of thedata ˆ under the condition of the theorem. Let t betheset of<br />

all subsets of �IY FFFY�� conta<strong>in</strong><strong>in</strong>g � 0 I elements and let t Pt.<br />

�<br />

Notethat t consists of elements. We will show that the hy-<br />

� 0 I<br />

perplane (denoted by rt ) generated by the columns of e with <strong>in</strong>dexes<br />

from t can beobta<strong>in</strong>ed by somecolumns of ˆ. By A2) and A3), there<br />

�0I<br />

�aI are l<strong>in</strong>early <strong>in</strong>dependent, which implies<br />

that they will span the same hyperplane rt . By A1) and theabove,<br />

�<br />

it follows that wecan cluster thecolumns of ˆ <strong>in</strong> groups<br />

� 0 I<br />

�<br />

r�Y� a IYFFFY uniquely such that each group r� con-<br />

� 0 I<br />

ta<strong>in</strong>s at least � elements and they span one hyperplane rt for some<br />

t� Pt. Now we cluster the hyperplanes obta<strong>in</strong>ed <strong>in</strong> such a way <strong>in</strong> the<br />

smallest number of groups such that the <strong>in</strong>tersection of all hyperplanes<br />

<strong>in</strong> each group gives a s<strong>in</strong>gle one-dimensional (1-D) subspace. It is clear<br />

that such 1-D subspacewill conta<strong>in</strong> onecolumn of themix<strong>in</strong>g matrix,<br />

the number of these groups is � and each group consists of<br />

hyperplanes.<br />

� 0 I<br />

� 0 P<br />

The proof of this theorem gives the idea for the matrix identification<br />

algorithm.<br />

Algorithm for Identification of the Mix<strong>in</strong>g Matrix:<br />

1) Cluster the columns of ˆ <strong>in</strong><br />

�<br />

� 0 I<br />

groups r�Y� a<br />

�<br />

IY FFFY such that the span of the elements of each<br />

� 0 I<br />

group r� produces one hyperplane and these hyperplanes are<br />

different.<br />

2) Cluster the normal vectors to these hyperplanes <strong>in</strong> the smallest<br />

number of groups q�Y� a IYFFFY� (which gives the number<br />

of sources �) such that the normal vectors to the hyperplanes <strong>in</strong><br />

each group q� lie<strong>in</strong> a new hyperplane ” r�.<br />

3) Calculatethenormal vectors ”—� to each hyperplane ” r�Y� a<br />

IY FFFY�. Notethat the1-D subspacespanned by ”—� is the<strong>in</strong>tersection<br />

of all hyperplanes <strong>in</strong> q�. Thematrix ” e with columns<br />

”—� is an estimation of the mix<strong>in</strong>g matrix (up to permutation and<br />

scal<strong>in</strong>g of thecolumns).<br />

2) Degenerate Case—Sparse Instances:<br />

Theorem 2: (Identifiability Conditions—Locally Very Sparse Representation):<br />

Assume that the number of sources is unknown and the<br />

follow<strong>in</strong>g:<br />

i) for each <strong>in</strong>dex � a IYFFFY� there are at least two columns of<br />

ƒ X ƒ@XY�IAYand ƒ@XY�PA which have nonzero elements only <strong>in</strong><br />

position � (so each source is uniquely present at least twice);<br />

ii) ˆ@XY�A Ta ˆ@XY�A for any P ,any�a IYFFFYx and<br />

any � aIYFFFYxY� Ta � for which ƒ@XY�A has morethat one<br />

nonzero element.<br />

Then the number of sources and the matrix e areidentifiable<br />

uniquely up to permutation and scal<strong>in</strong>g.<br />

Proof: We cluster <strong>in</strong> groups all nonzero normalized column vectors<br />

of ˆ such that each group consists of vectors which differ only by<br />

sign. From conditions i) and ii), it follows that thenumber of thegroups<br />

conta<strong>in</strong><strong>in</strong>g more that one element is precisely the number of sources �,<br />

and that each such group will represent a normalized column of e (up<br />

to sign).<br />

In thefollow<strong>in</strong>g, we<strong>in</strong>cludean algorithm for identification of the<br />

mix<strong>in</strong>g matrix based on Theorem 2.<br />

Algorithm for Identification of the Mix<strong>in</strong>g Matrix <strong>in</strong> the Very Sparse<br />

Case:<br />

1) Remove all zero columns of ˆ (if any) and obta<strong>in</strong> a matrix ˆI P<br />

�2x<br />

.


158 Chapter 10. IEEE TNN 16(4):992-996, 2005<br />

994 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005<br />

2) Normalizethecolumns ��Y � a IY FFFYxI of ˆI X �� a<br />

��a���� and set 4b0.<br />

Multiply each column �� by 0I if the first element of �� is negative.<br />

3) Cluster ��Y� aIY FFFYxI <strong>in</strong> � 0 I groups qIY FFFYq�CI such<br />

that for any � a IY FFFY�Y�� 0 �� ` 4YV�Y � P q�, and<br />

�� 0 �� !4 for any �Y � belong<strong>in</strong>g to different groups<br />

4) Choseany �� P q� and put —� a ��. Thematrix e with<br />

columns �—�� � �aI is an estimation of the mix<strong>in</strong>g matrix, up to<br />

permutation and scal<strong>in</strong>g.<br />

We should mention that the very sparse case <strong>in</strong> different sett<strong>in</strong>gs is<br />

already considered <strong>in</strong> the literature, but <strong>in</strong> more restrictive sense. In [6],<br />

theauthors supposethat thesupports of theFourier transform of any<br />

two source signals are disjo<strong>in</strong>t sets—a much more restrictive condition<br />

than our condition. In [1], theauthors supposethat for any sourcethere<br />

exists a time-frequency w<strong>in</strong>dow where only this source is nonzero and<br />

that the time-frequency transform of each source is not constant on any<br />

time-frequency w<strong>in</strong>dow. We would like to mention that their condition<br />

should <strong>in</strong>clude also the case when the the time-frequency transforms<br />

of any two sources are not proportional <strong>in</strong> any time-frequency w<strong>in</strong>dow.<br />

Such a quantitative condition (without frequency representation) is presented<br />

<strong>in</strong> our Theorem 2, condition ii).<br />

�aI<br />

Fig. 1. Orig<strong>in</strong>al images.<br />

Fig. 2. Mixed (observed) images.<br />

Fig. 3. Estimated normalized images us<strong>in</strong>g the estimated matrix. The<br />

signal-to-noiseratios with thesources from Fig. 1 are232, 239, and 228 dB,<br />

respectively.<br />

B. Identification of Sources<br />

Theorem 3: (Uniqueness of Sparse Representation): Let r bethe<br />

set of all � P � such that the l<strong>in</strong>ear system e� a � has a solution<br />

with at least � 0 � CIzero components. If e fulfills A1), then there<br />

exists a subset rH & rwith measure zero with respect to r, such<br />

that for every � Pr�rHthis system has no other solution with this<br />

property.<br />

�<br />

Proof: Obviously r is theunion of all<br />

� 0 I a@�3Aa@@�0<br />

These coefficients are uniquely determ<strong>in</strong>ed if ��� does not<br />

belong to the set rH with measure zero with respect to r<br />

2.3)<br />

(see Theorem 3);<br />

Constructthesolution �� a ƒ@XY�A:itconta<strong>in</strong>s!�Y� <strong>in</strong>theplace<br />

�� for � aIYFFFY�0 I, the other its components are zero.<br />

III. SCA<br />

IA3@�0�CIA3 hyperplanes, produced by tak<strong>in</strong>g the l<strong>in</strong>ear hull of every<br />

subsets of the columns of e with � 0 I elements. Let rH betheunion<br />

of all <strong>in</strong>tersections of any two such subspaces. Then rH has a measure<br />

zero <strong>in</strong> r and satisfies the conclusion of the theorem. Indeed, assume<br />

that � P r�rHand e� a e"� a �, where � and "� haveat least<br />

� 0 � CIzeros. S<strong>in</strong>ce � TP rHY� belongs to only one hyperplane<br />

produced as a l<strong>in</strong>ear hull of some � 0 I columns —� Y FFFY —� of<br />

e. It means that the vectors � and "� have � 0 � CIzeros <strong>in</strong> places<br />

with <strong>in</strong>dexes <strong>in</strong> �IY FFFY�����IY FFFY��0I�. Now from theequation<br />

e@�0"�A aHit follows that the �0I vector columns —� Y FFFY —�<br />

of e are l<strong>in</strong>early dependent, which is a contradiction with A1).<br />

From Theorem 3 it follows that the sources are identifiable generically,<br />

i.e., up to a set with a measure zero, if they have level of sparse-<br />

In this section, we develop a method for the complete solution of the<br />

SCA problem. Now the conditions are formulated only <strong>in</strong> terms of the<br />

data matrix ˆ.<br />

Theorem 4: (SCA Conditions): Assumethat � � x and the<br />

matrix ˆ P<br />

ness grater than or equal to � 0�CI, and themix<strong>in</strong>g matrix is known.<br />

In the follow<strong>in</strong>g, we present an algorithm, based on the observation <strong>in</strong><br />

Theorem 3.<br />

Source Recovery Algorithm:<br />

1) Identify thetheset of �-codimensional subspaces r produced<br />

by tak<strong>in</strong>g the l<strong>in</strong>ear hull of every subsets of the columns of e<br />

with � 0 I elements.<br />

2) Repeat for � a 1toxX 2.1) Identify the space r Prconta<strong>in</strong><strong>in</strong>g �� Xa ˆ@XY�A, or,<strong>in</strong><br />

practical situation with presence of noise, identify the one to<br />

which thedistancefrom �� is m<strong>in</strong>imal and project �� onto<br />

r to ���.<br />

2.2) if r is produced by the l<strong>in</strong>ear hull of column vectors<br />

—� Y FFFY —� , then f<strong>in</strong>d coefficients !�Y� such that<br />

�0I<br />

��� a !�Y�—� X<br />

�2x satisfies the follow<strong>in</strong>g conditions:<br />

i) �<br />

thecolumns of ˆ lie<strong>in</strong> theunion r of different hy-<br />

� 0 I<br />

perplanes, each column lies <strong>in</strong> only one such hyperplane, each<br />

hyperplane conta<strong>in</strong>s at least � columns of ˆ such that each<br />

� 0 I of them are l<strong>in</strong>early <strong>in</strong>dependent;<br />

ii) � 0 I<br />

for each � P �IYFFFY�� there exist � a different<br />

� 0 P<br />

hyperplanes �r�Y�� �<br />

�aI <strong>in</strong> r such that their <strong>in</strong>tersection v� a<br />

’ �<br />

�aIr�Y� is 1-D subspace;<br />

iii) any � different v� span thewhole � .<br />

Then the matrix ˆ is representable uniquely (up to permutation and<br />

scal<strong>in</strong>g of thecolumns of e and rows of ƒ) <strong>in</strong> theform ˆ a eƒ,<br />

where the matrices e P �2� and ƒ P �2x satisfy theconditions<br />

A1) and A2), A3), respectively.<br />

Proof: Let v� bespanned by —� and set e a �—�� � �aI. Condition<br />

iii) implies that any hyperplane from r conta<strong>in</strong>s at most � 0 I vectors<br />

from e. By i) and ii), it follows that these vectors are exactly � 0 I:<br />

only <strong>in</strong> this case the calculation of the number of all hyperplanes by<br />

� 0 I<br />

�<br />

ii) will givethenumber <strong>in</strong> i): � a@� 0 IA a<br />

� 0 P � 0 I .Let<br />

e be a matrix whose column vectors are all vectors from e (taken <strong>in</strong><br />

an arbitrary order). S<strong>in</strong>ce every column vector � of ˆ lies only <strong>in</strong> one<br />

hyperplane from r, the l<strong>in</strong>ear system e� a � has uniquesolution,<br />

which has at least � 0 � CIzeros (see the Proof of Theorem 3). Let<br />

���� � �aI be � column vectors from ˆ, which span onehyperplane<br />

from r, and � 0 I of them are l<strong>in</strong>early <strong>in</strong>dependent (such vectors<br />

exist by i)). Then we have: e�� a ��, for some uniquely determ<strong>in</strong>ed<br />

vectors ��Y� aIYFFFY�0 I, which are l<strong>in</strong>early <strong>in</strong>dependent and have


Chapter 10. IEEE TNN 16(4):992-996, 2005 159<br />

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005 995<br />

(a) (b)<br />

Fig. 4. (a) Mixed signals and (b) normalized scatter plot (density) of the mixtures together with the 21 data set hyperplanes, visualized by their <strong>in</strong>tersection with<br />

theunit sphere<strong>in</strong> .<br />

(a) (b) (c)<br />

Fig. 5.(a) Orig<strong>in</strong>al source signals.(b) Recovered source signals—the signal-to-noise ratio between the orig<strong>in</strong>al sources and the recoveries is very high (above278<br />

dB after permutation and normalization). Note that only 200 samples are enough for excellent separation. (c) Recovered source signals us<strong>in</strong>g � -norm m<strong>in</strong>imization<br />

and known mix<strong>in</strong>g matrix. Simple comparison confirms that the recovered signals are far from the orig<strong>in</strong>al ones, and the signal-to-noise ratio is only around 4 dB.<br />

at least � 0 � CIzeros <strong>in</strong> the same coord<strong>in</strong>ates. In such a way, we B. Underdeterm<strong>in</strong>ed Case<br />

can write: ˆ a eƒ for some uniquely determ<strong>in</strong>ed matrix ƒ, which<br />

satisfies A2) and A3).<br />

We should mention that our algorithms are robust with respect to<br />

small additive noise and big outliers, s<strong>in</strong>ce the algorithms cluster the<br />

data on hyperplanes approximately, up to a threshold 4 b 0, which<br />

could accumulatea noisewith amplitudeless than 4. Thebig outliers<br />

will not be clustered to any hyperplane.<br />

We consider a mixture of seven artificially created sources (see<br />

Fig. 5)—sparsified randomly generated signals with at least 5 zeros<br />

<strong>in</strong> each column—with a randomly generated mix<strong>in</strong>g matrix with<br />

dimension 3 2 7. Fig. 4 gives the mixed signals together with a<br />

normalized scatterplot of the mixtures—the data lies <strong>in</strong> PI a<br />

IV. COMPUTER SIMULATION EXAMPLES<br />

A. Complete Case<br />

In this example for the complete case @� a �A of <strong>in</strong>stantaneous<br />

mixtures, we demonstrate the effectiveness of our algorithm for identification<br />

of the mix<strong>in</strong>g matrix <strong>in</strong> the special case considered <strong>in</strong> Theorem<br />

2. We mixed three images of landscapes (shown <strong>in</strong> Fig. 1) with<br />

a three-dimensional (3-D) Hilbert matrix e and transformed them by<br />

a two-dimensional (2-D) discrete Haar wavelet transform. As a result,<br />

s<strong>in</strong>ce this transformation is l<strong>in</strong>ear, the high frequency components of<br />

the source signals become very sparse and they satisfy the conditions<br />

of Theorem 2. We use only one row (320 po<strong>in</strong>ts) from the diagonal<br />

coefficients of the wavelet transformed mixture, which is enough to recover<br />

very precisely the ill conditioned mix<strong>in</strong>g matrix e. Fig. 3 shows<br />

the recovered mixtures.<br />

U<br />

P<br />

hyperplanes. Apply<strong>in</strong>g the underdeterm<strong>in</strong>ed matrix recovery algorithm<br />

to the mixtures gives the recovered mix<strong>in</strong>g matrix perfectly well, up<br />

to permutation and scal<strong>in</strong>g (not shown because of lack of space).<br />

Apply<strong>in</strong>g the source recovery algorithm, we recover the source signals<br />

up to permutation and scal<strong>in</strong>g (see Fig. 5). This figure also shows that<br />

the recovery by �I-norm m<strong>in</strong>imization does not perform well, even if<br />

the mix<strong>in</strong>g matrix is perfectly known.<br />

V. CONCLUSION<br />

We def<strong>in</strong>ed rigorously the SCA and BSS problems of sparse signals<br />

and presented sufficient conditions for their solv<strong>in</strong>g. We developed<br />

three algorithms: for identification of the mix<strong>in</strong>g matrix (two types: for<br />

the sparse and the very sparse cases) and for source recovery. We presented<br />

two experiments: the first one concerns separation of a mixture<br />

of images, after wavelet sparsification (produc<strong>in</strong>g very sparse sources),<br />

which performs very well <strong>in</strong> the complete case. The second one shows


160 Chapter 10. IEEE TNN 16(4):992-996, 2005<br />

996 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 4, JULY 2005<br />

the excellent performance of the another two our algorithms <strong>in</strong> the un- can be described by a l<strong>in</strong>ear weighted sum of the <strong>in</strong>puts, followed by<br />

derdeterm<strong>in</strong>ed BSS problem, for separation of artificially created sig- some nonl<strong>in</strong>ear transfer function [1], [2]. In this letter, however, the<br />

nals with sufficient level of sparseness.<br />

neural comput<strong>in</strong>g models studied are based on artificial neurons which<br />

often have b<strong>in</strong>ary <strong>in</strong>puts and outputs, and no adjustable weight between<br />

REFERENCES<br />

nodes. Neuron functions are stored <strong>in</strong> lookup tables, which can be<br />

[1] F. Abrard, Y. Deville, and P. White, “From bl<strong>in</strong>d source separation to<br />

bl<strong>in</strong>d source cancellation <strong>in</strong> the underdeterm<strong>in</strong>ed case: A new approach<br />

based on time-frequency analysis,” <strong>in</strong> Proc. 3rd Int. Conf. <strong>Independent</strong><br />

implemented us<strong>in</strong>g commercially available random access memories<br />

(RAMs). These systems and the nodes that they are composed of will<br />

be described, respectively, as weightless neural networks (WNNs) and<br />

<strong>Component</strong> <strong>Analysis</strong> and Signal Separation (ICA’2001), San Diego, CA, weightless nodes [1], [2]. They differ from other models, such as the<br />

Dec. 9–13, 2001, pp. 734–739.<br />

[2] A. Cichocki and S. Amari, Adaptive Bl<strong>in</strong>d Signal and Image Process<strong>in</strong>g.<br />

New York: Wiley, 2002.<br />

[3] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis<br />

pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.<br />

weighted neural networks, whose tra<strong>in</strong><strong>in</strong>g is accomplished by means<br />

of adjustments of weights. In the literature, the terms “RAM-based”<br />

and “N-tuple based” have been used to refer to WNNs.<br />

In this letter, the computability (computational power) of a class of<br />

[4] D. Donoho and M. Elad, “Optimally sparse representation <strong>in</strong> general<br />

(nonorthogonal) dictionaries via � m<strong>in</strong>imization,” Proc. Nat. Acad. Sci.,<br />

vol. 100, no. 5, pp. 2197–2202, 2003.<br />

[5] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja, <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>.<br />

New York: Wiley, 2001.<br />

WNNs, called general s<strong>in</strong>gle-layer sequential weightless neural networks<br />

(GSSWNNs), is <strong>in</strong>vestigated. Such a class is an important representative<br />

of the research on temporal pattern process<strong>in</strong>g <strong>in</strong> (WNNs)<br />

[3]–[8]. As oneof thecontributions, an algorithm (constructiveproof)<br />

[6] A. Jourj<strong>in</strong>e, S. Rickard, and O. Yilmaz, “Bl<strong>in</strong>d separation of disjo<strong>in</strong>t to map any probabilistic automaton (PA) <strong>in</strong>to a GSSWNN is presented.<br />

orthogonal signals: Demix<strong>in</strong>g N sources from 2 mixtures,” <strong>in</strong> Proc. 2000<br />

IEEE Conf. Acoustics, Speech, and Signal Process<strong>in</strong>g (ICASSP’00), vol.<br />

5, Istanbul, Turkey, Jun. 2000, pp. 2985–2988.<br />

[7] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, “Bl<strong>in</strong>d<br />

sourse separation of more sources than mixtures us<strong>in</strong>g overcomplete rep-<br />

In fact, the proposed method not only allows the construction of any<br />

PA, but also <strong>in</strong>creases the class of functions that can be computed by<br />

such networks. For <strong>in</strong>stance, at a theoretical level, these networks are<br />

not restricted to f<strong>in</strong>ite-state languages (regular languages) and can now<br />

resentaitons,” IEEE Signal Process. Lett., vol. 6, no. 4, pp. 87–90, 1999.<br />

[8] D. D. Lee and H. S. Seung, “Learn<strong>in</strong>g the parts of objects by nonnegative<br />

matrix factorization,” Nature, vol. 40, pp. 788–791, 1999.<br />

[9] F. J. Theis, E. W. Lang, and C. G. Puntonet, “A geometric algorithm for<br />

overcomplete l<strong>in</strong>ear ICA,” Neurocomput., vol. 56, pp. 381–398, 2004.<br />

[10] K. Waheed and F. Salem, “Algebraic overcomplete <strong>in</strong>dependent com-<br />

deal with some context-free languages. Practical motivations for <strong>in</strong>vestigat<strong>in</strong>g<br />

probabilistic automata and GSSWNNs arefound <strong>in</strong> their possible<br />

application to, among others th<strong>in</strong>gs, syntactic pattern recognition,<br />

multimodal search, and learn<strong>in</strong>g control [9].<br />

ponent analysis,” <strong>in</strong> Proc. Int. Conf. <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong><br />

(ICA’03), Nara, Japan, pp. 1077–1082.<br />

II. DEFINITIONS<br />

[11] M. Zibulevsky and B. A. Pearlmutter, “Bl<strong>in</strong>d source separation by sparse<br />

decomposition <strong>in</strong> a signal dictionary,” Neural Comput., vol. 13, no. 4, pp.<br />

A. Probabilistic Automata<br />

863–882, 2001.<br />

Probabilistic automata are a generalization of ord<strong>in</strong>ary determ<strong>in</strong>istic<br />

f<strong>in</strong>itestateautomata (DFA) for which an <strong>in</strong>put symbol could takethe<br />

Equivalence Between RAM-Based Neural Networks and<br />

Probabilistic Automata<br />

automaton <strong>in</strong>to any of its states with a certa<strong>in</strong> probability [9].<br />

Def<strong>in</strong>ition 2.1: A PA is a 5-tuple e€ a@6Y Y rY �sYpA, where<br />

• 6a�'IY'PY FFFY'�6�� is a f<strong>in</strong>ite set of ordered symbols called<br />

the <strong>in</strong>put alphabet;<br />

Marcilio C. P. de Souto, Teresa B. Ludermir, and Wilson R. de Oliveira<br />

•<br />

•<br />

a ��HY�PY FFFY�� r is a mapp<strong>in</strong>g of<br />

�� is a f<strong>in</strong>ite set of states;<br />

2 6 <strong>in</strong>to theset of � 2 � stochastic state<br />

transition matrices (where � is the number of states <strong>in</strong> ). The<br />

Abstract—In this letter, the computational power of a class of random <strong>in</strong>terpretation of r@—�AY—� P 6, can bestated as follows.<br />

access memory (RAM)-based neural networks, called general s<strong>in</strong>gle-layer<br />

sequential weightless neural networks (GSSWNNs), is analyzed. The theoretical<br />

results presented, besides help<strong>in</strong>g the understand<strong>in</strong>g of the temporal<br />

behavior of these networks, could also provide useful <strong>in</strong>sights for the develop<strong>in</strong>g<br />

of new learn<strong>in</strong>g algorithms.<br />

r@—�A a‘��� @—�A“, where ���@—�A ! H is theprobability of<br />

�<br />

enter<strong>in</strong>g state �� from state �� under <strong>in</strong>put —�, and ��� a<br />

�aI<br />

I, for all � aIYFFFY�. Thedoma<strong>in</strong> of r can be extended from<br />

6 to 6<br />

Index Terms—Automata theory, computability, random access<br />

memory (RAM) node, probabilistic automata, RAM-based neural networks,<br />

weightless neural networks (WNNs).<br />

I. INTRODUCTION<br />

The neuron model used <strong>in</strong> the great majority of work <strong>in</strong>volv<strong>in</strong>g<br />

neural networks is related to variations of the McCulloch–Pitts neuron,<br />

which will becalled theweighted neuron. A typical weighted neuron<br />

Manuscript received July 6, 2003; revised October 26, 2004.<br />

M. C. P. de Souto is with the Department of Informatics and Applied <strong>Mathematics</strong>,<br />

Federal University of Rio Grande do Norte, Campus Universitario,<br />

Natal-RN 59072-970, Brazil (e-mail: marcilio@dimap.ufrn.br).<br />

T. B. Ludermir is with the Center of Informatics, Federal University of Pernambuco,<br />

Recife 50732-970, Brazil.<br />

W. R. de Oliveira is with the Department of Physics and <strong>Mathematics</strong>, Rural<br />

Federal University of Pernambuco, Recife 52171-900, Brazil.<br />

Digital Object Identifier 10.1109/TNN.2005.849838<br />

3 by def<strong>in</strong><strong>in</strong>g the follow<strong>in</strong>g:<br />

1)<br />

2)<br />

r@ Aas�, where is theempty str<strong>in</strong>g and s� is an � 2 �<br />

identity matrix;<br />

r@—� Y—� Y FFFY—� A a<br />

r@—� Ar@—� AY FFFYr@—� A, where � ! P and<br />

•<br />

—�<br />

�s P<br />

P 6Y� aIYFFFY�. is the<strong>in</strong>itial state<strong>in</strong> which themach<strong>in</strong>eis found before<br />

the first symbol of the <strong>in</strong>put str<strong>in</strong>g is processed;<br />

• p is the set of f<strong>in</strong>al states @p A.<br />

The language accepted by a PA e€ is „ @e€ Aa�@3Y �@3AA�3 P<br />

6 3 Y�@3A a%Hr@3A%p b H� where: 1) %H is a �-dimensional row<br />

vector, <strong>in</strong> which the �th component is equal to one if �� a �s, and 0<br />

otherwise and 2) %p is an �-dimensional column vector, <strong>in</strong> which the<br />

�th component is equal to 1 if �� P p and 0 otherwise.<br />

The language accepted by e€ with cut-po<strong>in</strong>t(threshold) !, such that<br />

H !`I,isv@e€ Y!Aa�3�3 P 6 3<br />

and %Hr@3A%p b!�.<br />

Probabilistic automata recognize exactly the class of weighted regular<br />

languages (WRLs) [9]. Such a class of languages <strong>in</strong>cludes properly<br />

1045-9227/$20.00 © 2005 IEEE


Chapter 11<br />

EURASIP JASP, 2007<br />

Paper F.J. Theis, P. Georgiev, and A. Cichocki. Robust sparse component analysis<br />

based on a generalized hough transform. EURASIP Journal on Applied<br />

Signal Process<strong>in</strong>g, 2007<br />

Reference (Theis et al., 2007a)<br />

Summary <strong>in</strong> section 1.4.1<br />

161


162 Chapter 11. EURASIP JASP, 2007<br />

H<strong>in</strong>dawi Publish<strong>in</strong>g Corporation<br />

EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

Volume 2007, Article ID 52105, 13 pages<br />

doi:10.1155/2007/52105<br />

Research Article<br />

Robust Sparse <strong>Component</strong> <strong>Analysis</strong> Based on<br />

a Generalized Hough Transform<br />

Fabian J. Theis, 1 Pando Georgiev, 2 and Andrzej Cichocki3, 4<br />

1 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

2 ECECS Department and Department of Mathematical Sciences, University of C<strong>in</strong>c<strong>in</strong>nati, C<strong>in</strong>c<strong>in</strong>nati, OH 45221, USA<br />

3 BSI RIKEN, Laboratory for Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g, 2-1, Hirosawa, Wako, Saitama 351-0198, Japan<br />

4 Faculty of Electrical Eng<strong>in</strong>eer<strong>in</strong>g, Warsaw University of Technology, Pl. Politechniki 1, 00-661 Warsaw, Poland<br />

Received 21 October 2005; Revised 11 April 2006; Accepted 11 June 2006<br />

Recommended for Publication by Frank Ehlers<br />

An algorithm called Hough SCA is presented for recover<strong>in</strong>g the matrix A <strong>in</strong> x(t) = As(t), where x(t) is a multivariate observed<br />

signal, possibly is of lower dimension than the unknown sources s(t). They are assumed to be sparse <strong>in</strong> the sense that at every<br />

time <strong>in</strong>stant t, s(t) has fewer nonzero elements than the dimension of x(t). The presented algorithm performs a global search for<br />

hyperplane clusters with<strong>in</strong> the mixture space by gather<strong>in</strong>g possible hyperplane parameters with<strong>in</strong> a Hough accumulator tensor.<br />

This renders the algorithm immune to the many local m<strong>in</strong>ima typically exhibited by the correspond<strong>in</strong>g cost function. In contrast<br />

to previous approaches, Hough SCA is l<strong>in</strong>ear <strong>in</strong> the sample number and <strong>in</strong>dependent of the source dimension as well as robust<br />

aga<strong>in</strong>st noise and outliers. Experiments demonstrate the flexibility of the proposed algorithm.<br />

Copyright © 2007 Fabian J. Theis et al. This is an open access article distributed under the Creative Commons Attribution License,<br />

which permits unrestricted use, distribution, and reproduction <strong>in</strong> any medium, provided the orig<strong>in</strong>al work is properly cited.<br />

1. INTRODUCTION<br />

One goal of multichannel signal analysis lies <strong>in</strong> the detection<br />

of underly<strong>in</strong>g sources with<strong>in</strong> some given set of observations.<br />

If both the mixture process and the sources are unknown,<br />

this is denoted as bl<strong>in</strong>d source separation (BSS). BSS<br />

can be applied <strong>in</strong> many different fields such as medical and<br />

biological data analysis, broadcast<strong>in</strong>g systems, and audio and<br />

image process<strong>in</strong>g. In order to decompose the data set, different<br />

assumptions on the sources have to be made. The<br />

most common assumption currently used is statistical <strong>in</strong>dependence<br />

of the sources, which leads to the task of <strong>in</strong>dependent<br />

component analysis (ICA); see, for <strong>in</strong>stance, [1, 2]<br />

and references there<strong>in</strong>. ICA very successfully separates data<br />

<strong>in</strong> the l<strong>in</strong>ear complete case, when as many signals as underly<strong>in</strong>g<br />

sources are observed, and <strong>in</strong> this case the mix<strong>in</strong>g<br />

matrix and the sources are identifiable except for permutation<br />

and scal<strong>in</strong>g [3, 4]. In the overcomplete or underdeterm<strong>in</strong>ed<br />

case, fewer observations than sources are given.<br />

It can be shown that the mix<strong>in</strong>g matrix can still be recovered<br />

[5], but source identifiability does not hold. In order<br />

to approximately detect the sources, additional requirements<br />

have to be made, usually sparsity of the sources [6–<br />

8].<br />

Recently, we have <strong>in</strong>troduced a novel measure for sparsity<br />

and shown [9] that based on sparsity alone, we can still<br />

detect both mix<strong>in</strong>g matrix and sources uniquely except for<br />

trivial <strong>in</strong>determ<strong>in</strong>acies (sparse component analysis (SCA)). In<br />

that paper, we have also proposed an algorithm based on random<br />

sampl<strong>in</strong>g for reconstruct<strong>in</strong>g the mix<strong>in</strong>g matrix and the<br />

sources, but the focus of the paper was on the model, and the<br />

matrix estimation algorithm turned out to be not very robust<br />

aga<strong>in</strong>st noise and outliers, and could therefore not easily<br />

be applied <strong>in</strong> high dimensions due to the <strong>in</strong>volved comb<strong>in</strong>atorial<br />

searches. In the present manuscript, a new algorithm<br />

is proposed for SCA, that is, for decompos<strong>in</strong>g a data<br />

set x(1), . . . , x(T) ∈ R m modeled by an (m × T)-matrix X<br />

l<strong>in</strong>early <strong>in</strong>to X = AS, where the n-dimensional sources S =<br />

(s(1), . . . , s(T)) are assumed to be sparse at every time <strong>in</strong>stant.<br />

If the sources are of sufficiently high sparsity, the mixtures<br />

are clustered along hyperplanes <strong>in</strong> the mixture space.<br />

Based on this condition, the mix<strong>in</strong>g matrix can be reconstructed;<br />

furthermore, this property is robust aga<strong>in</strong>st noise<br />

and outliers, which will be used here. The proposed algorithm<br />

denoted by Hough SCA employs a generalization of the<br />

Hough transform <strong>in</strong> order to detect the hyperplanes <strong>in</strong> the<br />

mixture space, which then leads to matrix and source identification.


Chapter 11. EURASIP JASP, 2007 163<br />

2 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

The Hough transform [10] is a standard tool <strong>in</strong> image<br />

analysis that allows recognition of global patterns <strong>in</strong> an image<br />

space by recogniz<strong>in</strong>g local patterns, ideally a po<strong>in</strong>t, <strong>in</strong> a transformed<br />

parameter space. It is particularly useful when the<br />

patterns <strong>in</strong> question are sparsely digitized, conta<strong>in</strong> “holes,”<br />

or have been tak<strong>in</strong>g <strong>in</strong> noisy environments. The basic idea<br />

of this technique is to map parameterized objects such as<br />

straight l<strong>in</strong>es, polynomials, or circles to a suitable parameter<br />

space. The ma<strong>in</strong> application of the Hough transform lies<br />

<strong>in</strong> the field of image process<strong>in</strong>g <strong>in</strong> order to f<strong>in</strong>d straight l<strong>in</strong>es,<br />

centers of circles with a fixed radius, parabolas, and so forth<br />

<strong>in</strong> images.<br />

The Hough transform has been used <strong>in</strong> a somewhat<br />

ad hoc way <strong>in</strong> the field of <strong>in</strong>dependent component analysis<br />

for identify<strong>in</strong>g two-dimensional sources <strong>in</strong> the mixture<br />

plot <strong>in</strong> the complete [11] and overcomplete [12] cases,<br />

which without additional restrictions can be shown to have<br />

some theoretical issues [13]; moreover, the proposed algorithms<br />

were restricted to two dimensions and did not provide<br />

any reliable source identification method. An application<br />

of a time-frequency Hough transform to direction f<strong>in</strong>d<strong>in</strong>g<br />

with<strong>in</strong> nonstationary signals has been studied <strong>in</strong> [14]; the<br />

idea is based on the Hough transform of the Wigner-Ville<br />

distribution [15], essentially employ<strong>in</strong>g a generalized Hough<br />

transform [16] to f<strong>in</strong>d straight l<strong>in</strong>es <strong>in</strong> the time-frequency<br />

plane. The results <strong>in</strong> [14] aga<strong>in</strong> only concentrate on the twodimensional<br />

mixture case. In the literature, overcomplete<br />

BSS and the correspond<strong>in</strong>g basis estimation problems have<br />

ga<strong>in</strong>ed considerable <strong>in</strong>terest <strong>in</strong> the past decade [8, 17–19],<br />

but the sparse priors are always used <strong>in</strong> connection with the<br />

assumption of <strong>in</strong>dependent sources. This allows for probabilistic<br />

sparsity conditions, but cannot guarantee source<br />

identifiability as <strong>in</strong> our case.<br />

The paper is organized as follows. In Section 2, we <strong>in</strong>troduce<br />

the overcomplete SCA model and summarize the<br />

known identifiability results and algorithms [9]. The follow<strong>in</strong>g<br />

section then reviews the classical Hough transform <strong>in</strong> two<br />

dimensions and generalizes it <strong>in</strong> order to detect hyperplanes<br />

<strong>in</strong> any dimension. This method is used <strong>in</strong> section Section 4<br />

to develop an SCA algorithm, which turns out to be highly<br />

robust aga<strong>in</strong>st noise and outliers. We confirm this by experiments<br />

<strong>in</strong> Section 5. Some results of this paper have already<br />

been presented at the conference “ESANN 2004” [20].<br />

2. OVERCOMPLETE SCA<br />

We <strong>in</strong>troduce a strict notion of sparsity and present identifiability<br />

results when apply<strong>in</strong>g the measure to BSS.<br />

A vector v ∈ R n is said to be k-sparse if v has at least k<br />

zero entries. An n×T data matrix is said to be k-sparse if each<br />

of its columns is k-sparse. Note that v is k-sparse, then it is<br />

also k ′ -sparse for k ′ ≤ k. The goal of sparse component analysis<br />

of level k (k-SCA) is to decompose a given m-dimensional<br />

observed signal x(t), t = 1, . . . , T, <strong>in</strong>to<br />

x(t) = As(t) (1)<br />

with a real m × n-mix<strong>in</strong>g matrix A and an n-dimensional<br />

k-sparse sources s(t). The samples are gathered <strong>in</strong>to correspond<strong>in</strong>g<br />

data matrices X := (x(1), . . . , x(T)) ∈ R m×T and<br />

S := (s(1), . . . , s(T)) ∈ R n×T , so the model is X = AS. We<br />

speak of complete, overcomplete, or undercomplete k-SCA if<br />

m = n, m < n, or m > n, respectively. In the follow<strong>in</strong>g, we<br />

will always assume that the sparsity level equals k = n−m+1,<br />

which means that at any time <strong>in</strong>stant, fewer sources than<br />

given observations are active. In the algorithm, we will also<br />

consider additive white Gaussian noise; however, the model<br />

identification results are presented only <strong>in</strong> the noiseless case<br />

from (1).<br />

Note that <strong>in</strong> contrast to the ICA model, the above problem<br />

is not translation <strong>in</strong>variant. However, it is easy to see that<br />

if <strong>in</strong>stead of A we choose an aff<strong>in</strong>e l<strong>in</strong>ear transformation,<br />

the translation constant can be determ<strong>in</strong>ed from X only, as<br />

long as the sources are nondeterm<strong>in</strong>istic. Put differently, this<br />

means that <strong>in</strong>stead of assum<strong>in</strong>g k-sparsity of the sources we<br />

could also assume that at any fixed time t, only n − k source<br />

components are allowed to vary from a previously fixed constant<br />

(which can be different for each source). In the follow<strong>in</strong>g<br />

without loss of generality we will assume m ≤ n:<br />

the easier undercomplete (or underdeterm<strong>in</strong>ed) case can be<br />

reduced to the complete case by projection <strong>in</strong> the mixture<br />

space.<br />

The follow<strong>in</strong>g theorem shows that essentially the mix<strong>in</strong>g<br />

model (1) is unique if fewer sources than mixtures are active,<br />

that is, if the sources are (n − m + 1)-sparse.<br />

Theorem 1 (matrix identifiability). Consider the k-SCA<br />

problem from (1) for k := n − m + 1 and assume that every<br />

m × m-submatrix of A is <strong>in</strong>vertible. Furthermore, let S be sufficiently<br />

rich represented <strong>in</strong> the sense that for any <strong>in</strong>dex set of<br />

n − m + 1 elements I ⊂ {1, . . . , n} there exist at least m samples<br />

of S such that each of them has zero elements <strong>in</strong> places with<br />

<strong>in</strong>dexes <strong>in</strong> I and each m − 1 of them are l<strong>in</strong>early <strong>in</strong>dependent.<br />

Then A is uniquely determ<strong>in</strong>ed by X except for left multiplication<br />

with permutation and scal<strong>in</strong>g matrices.<br />

So if AS = �A�S, then A = �APL with a permutation P and<br />

a nons<strong>in</strong>gular scal<strong>in</strong>g matrix L. This means that we can recover<br />

the mix<strong>in</strong>g matrix from the mixtures. The next theorem<br />

shows that <strong>in</strong> this case also the sources can be found<br />

uniquely.<br />

Theorem 2 (source identifiability). Let H be the set of all x ∈<br />

R m such that the l<strong>in</strong>ear system As = x has an (n−m+1)-sparse<br />

solution, that is, one with at least n − m + 1 zero components.<br />

If A fulfills the condition from Theorem 1, then there exists a<br />

subset H0 ⊂ H with measure zero with respect to H, such that<br />

for every x ∈ H \ H0 this system has no other solution with<br />

this property.<br />

For proofs of these theorems we refer to [9]. The above<br />

two theorems show that <strong>in</strong> the case of overcomplete BSS us<strong>in</strong>g<br />

k-SCA with k = n − m + 1, both the mix<strong>in</strong>g matrix and<br />

the sources can uniquely be recovered from X except for the<br />

omnipresent permutation and scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. The essential<br />

idea of both theorems as well as a possible algorithm is


164 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 3<br />

a1<br />

a3<br />

a2<br />

(a) Three hyperplanes span{ai,<br />

aj} for 1 ≤ i < j ≤ 3 <strong>in</strong> the<br />

3 × 3 case<br />

a1<br />

a3<br />

a2<br />

(b) Hyperplanes from (a) visualized<br />

by <strong>in</strong>tersection with the<br />

sphere<br />

a1<br />

a3<br />

a4<br />

a2<br />

(c) Six hyperplanes span{ai,<br />

aj} for 1 ≤ i < j ≤ 4 <strong>in</strong> the<br />

3 × 4 case<br />

Figure 1: Visualization of the hyperplanes <strong>in</strong> the mixture space {x(t)} ⊂ R 3 . Due to the source sparsity, the mixtures are generated by only<br />

two matrix columns ai, aj, and are hence conta<strong>in</strong>ed <strong>in</strong> a union of hyperplanes. Identification of the hyperplanes gives mix<strong>in</strong>g matrix and<br />

sources.<br />

Data: samples x(1), . . . , x(T)<br />

Result: estimated mix<strong>in</strong>g matrix �A<br />

Hyperplane identification.<br />

(1) Cluster the samples x(t) <strong>in</strong> � �<br />

n<br />

m−1 groups such that the span<br />

of the elements of each group produces one dist<strong>in</strong>ct<br />

hyperplane Hi.<br />

Matrix identification.<br />

(2) Cluster the normal vectors to these hyperplanes <strong>in</strong> the<br />

smallest number of groups Gj, j = 1, . . . , n (which gives the<br />

number of sources n) such that the normal vectors to the<br />

hyperplanes <strong>in</strong> each group Gj lie <strong>in</strong> a new hyperplane �Hj.<br />

(3) Calculate the normal vectors �aj to each hyperplane<br />

�Hj, j = 1, . . . , n.<br />

(4) The matrix �A with columns �aj is an estimate of the mix<strong>in</strong>g<br />

matrix (up to permutation and scal<strong>in</strong>g of the columns).<br />

Algorithm 1: SCA matrix identification algorithm.<br />

illustrated <strong>in</strong> Figure 1: by assum<strong>in</strong>g sufficiently high sparsity<br />

of the sources, the mixture space clusters along a union of<br />

hyperplanes, which uniquely determ<strong>in</strong>e both mix<strong>in</strong>g matrix<br />

and sources.<br />

The matrix and source identification algorithm from [9]<br />

are recalled <strong>in</strong> Algorithms 1 and 2. We will present a modification<br />

of the matrix identification part—the same source<br />

identification algorithm (Algorithm 2) will be used <strong>in</strong> the experiments.<br />

The “difficult” part of the matrix identification<br />

algorithm lies <strong>in</strong> the hyperplane detection; <strong>in</strong> Algorithm 1, a<br />

random sampl<strong>in</strong>g and cluster<strong>in</strong>g technique is used. Another<br />

more efficient algorithm for f<strong>in</strong>d<strong>in</strong>g the hyperplanes conta<strong>in</strong><strong>in</strong>g<br />

the data has been developed by Bradley and Mangasarian<br />

[21], essentially by extend<strong>in</strong>g k-means batch cluster<strong>in</strong>g.<br />

Their so-called k-plane cluster<strong>in</strong>g algorithm <strong>in</strong> the special case<br />

of hyperplanes conta<strong>in</strong><strong>in</strong>g 0 is shown <strong>in</strong> Algorithm 3. The<br />

Data: samples x(1), . . . , x(T) and estimated mix<strong>in</strong>g matrix �A<br />

Result: estimated sources �s(1), . . . ,�s(T)<br />

(1) Identify the set of hyperplanes H produced by tak<strong>in</strong>g the<br />

l<strong>in</strong>ear hull of every subsets of the columns of �A with m − 1<br />

elements<br />

for t ← 1, . . . , T do<br />

(2) Identify the hyperplane H ∈ H conta<strong>in</strong><strong>in</strong>g x(t), or, <strong>in</strong><br />

the presence of noise, identify the one to which the<br />

distance from x(t) is m<strong>in</strong>imal and project x(t) onto H<br />

to �x.<br />

(3) If H is produced by the l<strong>in</strong>ear hull of column vectors<br />

�ai1, . . . , �aim−1, f<strong>in</strong>d coefficients λi(j) such that<br />

�x = � m−1<br />

j=1 λi(j)�ai(j).<br />

(4) Construct the solution �s(t): it conta<strong>in</strong>s λi(j) at <strong>in</strong>dex i(j)<br />

for j = 1, . . . , m − 1, the other components are zero.<br />

end<br />

Algorithm 2: SCA source identification algorithm.<br />

f<strong>in</strong>ite term<strong>in</strong>ation of the algorithm is proven <strong>in</strong> [21, Theorem<br />

3.7]. We will later compare the proposed Hough algorithm<br />

with the k-hyperplane algorithm. The k-hyperplane algorithm<br />

has also been extended to a more general, orthogonal<br />

k-subspace cluster<strong>in</strong>g method [22, 23] thus allow<strong>in</strong>g a search<br />

not only for hyperplanes but also for lower-dimensional subspaces.<br />

3. HOUGH TRANSFORM<br />

The Hough transform is a classical method for locat<strong>in</strong>g<br />

shapes <strong>in</strong> images, widely used <strong>in</strong> the field of image process<strong>in</strong>g;<br />

see [10, 24]. It is robust to noise and occlusions and is<br />

used for extract<strong>in</strong>g l<strong>in</strong>es, circles, or other shapes from images.<br />

In addition to these nonl<strong>in</strong>ear extensions, it can also be<br />

made more robust to noise us<strong>in</strong>g antialias<strong>in</strong>g techniques.


Chapter 11. EURASIP JASP, 2007 165<br />

4 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

Data: samples x(1), . . . , x(T)<br />

Result: estimated k hyperplanes Hi given by the normal<br />

vectors ui<br />

(l) Initialize randomly ui with |ui| = 1 for i = 1, . . . , k.<br />

do<br />

Cluster assignment.<br />

for t ← 1, . . . , T do<br />

(2) Add x(t) to cluster Y (i) , where i is chosen to<br />

m<strong>in</strong>imize |u ⊤ i x(t)| (distance to hyperplane Hi).<br />

end<br />

(3) Exit if the mean distance to the hyerplanes is smaller<br />

than some preset value.<br />

Cluster update.<br />

for i ← 1, . . . , k do<br />

(4) Calculate the i-th cluster correlation C := Y (i) Y (i)⊤ .<br />

(5) Choose an eigenvector v of C correspond<strong>in</strong>g to<br />

a m<strong>in</strong>imal eigenvalue.<br />

(6) Set ui ← v/|v|.<br />

end<br />

end<br />

Algorithm 3: k-hyperplane cluster<strong>in</strong>g algorithm.<br />

3.1. Def<strong>in</strong>ition<br />

Its ma<strong>in</strong> idea can be described as follows: consider a parameterized<br />

object<br />

Ma := {x ∈ R n | f(x, a) = 0} (2)<br />

for a fixed parameter set a ∈ U ⊂ R p —here U ⊂ R p is the<br />

parameter space, and the parameter function f : R n × U →<br />

R m is a set of m equations describ<strong>in</strong>g our types of objects<br />

(manifolds) Ma for different parameters a. We assume that<br />

the equations given by f are separat<strong>in</strong>g <strong>in</strong> the sense that if<br />

Ma ⊂ Ma ′, then already a = a′ . A simple example is the<br />

set of unit circles <strong>in</strong> R 2 ; then f (x, a) = |x − a| − 1. For a<br />

given a ∈ R 2 , Ma is the circle of radius 1 centered at a. Obviously<br />

f is separated. Other object manifolds will be discussed<br />

later. A nonseparated object function is, for example,<br />

f (x, a) := 1−1[0,a](x) for (x, a) ∈ R×[0, ∞), where the characteristic<br />

function 1[0,a](x) equals 1 if and only if x ∈ [0, a]<br />

and 0 otherwise. Then M1 = [0, 1] ⊂ [0, 2] = M2 but the<br />

parameters are different.<br />

Given a separat<strong>in</strong>g parameter function f(x, a), its Hough<br />

transform is def<strong>in</strong>ed as<br />

η[f] : R n −→ P (U),<br />

x ↦−→ {a ∈ U | f(x, a) = 0},<br />

where P (U) denotes the set of all subsets of U. So η[f] maps<br />

a po<strong>in</strong>t x onto the set of all parameters describ<strong>in</strong>g objects<br />

conta<strong>in</strong><strong>in</strong>g x. But an object Ma as a set is mapped onto a s<strong>in</strong>gle<br />

po<strong>in</strong>t {a}, that is,<br />

�<br />

η[f](x) = {a}. (4)<br />

x∈Ma<br />

This follows because if �<br />

x∈Ma η[f](x) = {a, a′ }, then for all<br />

x ∈ Ma we have f(x, a ′ ) = 0, which means that Ma ⊂ Ma ′; the<br />

(3)<br />

parameter function f is assumed to be separat<strong>in</strong>g, so a = a ′ .<br />

Hence, objects Ma <strong>in</strong> a data set X = {x(1), . . . , x(T)} can be<br />

detected by analyz<strong>in</strong>g clusters <strong>in</strong> η[f](X).<br />

We will illustrate this concept for l<strong>in</strong>e detection <strong>in</strong> the<br />

follow<strong>in</strong>g section before apply<strong>in</strong>g it to the hyperplane identification<br />

needed for our SCA problem.<br />

3.2. Classical Hough transform<br />

The (classical) Hough transform detects l<strong>in</strong>es <strong>in</strong> a given twodimensional<br />

data space as follows: an aff<strong>in</strong>e, nonvertical l<strong>in</strong>e<br />

<strong>in</strong> R 2 can be described by the equation x2 = a1x1 + a2 for<br />

fixed a = (a1, a2) ∈ R 2 . If we def<strong>in</strong>e<br />

fL(x, a) := a1x1 + a2 − x2, (5)<br />

then the above l<strong>in</strong>e equals the set Ma from (2) for the unique<br />

parameter a, and f is clearly separat<strong>in</strong>g. Figures 2(a) and 2(b)<br />

illustrate this idea.<br />

In practice, polar coord<strong>in</strong>ates are used to describe the l<strong>in</strong>e<br />

<strong>in</strong> Hessian normal form; this allows to also detect vertical<br />

l<strong>in</strong>es (θ = π/2) <strong>in</strong> the data set, and moreover guarantees for<br />

an isotropic error <strong>in</strong> contrast to the parametrization (5). This<br />

leads to a parameter function<br />

fP(x, θ, ρ) = x1 cos(θ) + x2 s<strong>in</strong>(θ) − ρ = 0 (6)<br />

for parameters (θ, ρ) ∈ U := [0, π) × R. Then po<strong>in</strong>ts <strong>in</strong> data<br />

space are mapped to s<strong>in</strong>e curves given by f ; see Figure 2(c).<br />

3.3. Generalization<br />

The mix<strong>in</strong>g matrix A <strong>in</strong> the case of (n − m + 1)-sparse SCA<br />

can be recovered by f<strong>in</strong>d<strong>in</strong>g all 1-codimensional subvector<br />

spaces <strong>in</strong> the mixture data set. The algorithm presented here<br />

uses a generalized version of the Hough transform <strong>in</strong> order<br />

to determ<strong>in</strong>e hyperplanes through 0 as follows.<br />

Vectors x ∈ R m ly<strong>in</strong>g on such a hyperplane H can be<br />

described by the equation<br />

fh(x, n) := n ⊤ x = 0, (7)<br />

where n is a nonzero vector orthogonal to H. After normalization<br />

|n| = 1, the normal vector n is uniquely determ<strong>in</strong>ed<br />

by H if we additionally require n to lie on one hemisphere<br />

of the unit sphere Sn−1 := {x ∈ Rn | |x| = 1}. This means<br />

that the parametrization fh is separat<strong>in</strong>g. In terms of spherical<br />

coord<strong>in</strong>ates of Sn−1 , n can be expressed as<br />

⎛<br />

⎞<br />

cos ϕ s<strong>in</strong> θ1 s<strong>in</strong> θ2 · · · s<strong>in</strong> θm−2<br />

⎜<br />

⎜s<strong>in</strong><br />

ϕ s<strong>in</strong> θ1 s<strong>in</strong> θ2 · · · s<strong>in</strong> θm−2<br />

⎟<br />

⎜<br />

n = ⎜ cos θ1 s<strong>in</strong> θ2 · · · s<strong>in</strong> ⎟<br />

θm−2⎟<br />

⎜ . .<br />

⎝ . ..<br />

.<br />

⎟ (8)<br />

⎟<br />

. ⎠<br />

cos θ1 cos θ2 · · · cos θm−2<br />

with (ϕ, θ1, . . . , θm−2) ∈ [0, 2π) × [0, π) m−2 uniqueness of n<br />

can be achieved by requir<strong>in</strong>g ϕ ∈ [0, π). Plugg<strong>in</strong>g n <strong>in</strong> spherical<br />

coord<strong>in</strong>ates <strong>in</strong>to (7) gives<br />

m−1 �<br />

cot θm−2 = − νi(ϕ, θ1, . . . , θm−3) xi<br />

i=1<br />

xm<br />

(9)


166 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 5<br />

for x ∈ R m with xm �= 0 and<br />

⎧<br />

m−3 �<br />

cos ϕ s<strong>in</strong> θj, i = 1,<br />

j=1<br />

⎪⎨ m−3 �<br />

νi := s<strong>in</strong> ϕ s<strong>in</strong> θj, i = 2,<br />

j=1<br />

�i−2<br />

m−3 �<br />

⎪⎩<br />

cos θj s<strong>in</strong> θj, i > 2.<br />

j=1 j=i−1<br />

(10)<br />

With cot(θ + π/2) = − tan(θ) we f<strong>in</strong>ally get θm−2 = arctan<br />

( �m−1 i=1 νixi/xm) + π/2. Note that cont<strong>in</strong>uity is achieved if we<br />

set θm−2 := 0 for xm = 0.<br />

We can then def<strong>in</strong>e the generalized “hyperplane detect<strong>in</strong>g”<br />

Hough transform as<br />

x ↦−→<br />

�<br />

η[ fh] : R m −→ P �<br />

[0, π) m−1�<br />

,<br />

(ϕ, θ1, . . . , θm−2)<br />

∈ [0, π) m−1 | θm−2 = arctan<br />

� m−1 �<br />

� xi<br />

νi<br />

xm i=1<br />

+ π<br />

�<br />

.<br />

2<br />

(11)<br />

The parametrization fh is separat<strong>in</strong>g, so po<strong>in</strong>ts ly<strong>in</strong>g on the<br />

same hyperplane are mapped to surfaces that <strong>in</strong>tersect <strong>in</strong> precisely<br />

one po<strong>in</strong>t <strong>in</strong> [0, π) m−1 . This is demonstrated <strong>in</strong> the case<br />

m = 3 <strong>in</strong> Figure 3. The hyperplane structures of a data set<br />

X = {x(1), . . . , x(T)} can be analyzed by f<strong>in</strong>d<strong>in</strong>g clusters <strong>in</strong><br />

η[ fh](X).<br />

Let RP m−1 denote the (m − 1)-dimensional real projective<br />

space, that is, the manifold of all 1-dimensional subspaces<br />

of R m . There is a canonical diffeomorphism between RP m−1<br />

and the Grassmanian manifold of all (m − 1)-dimensional<br />

subspaces of R m , <strong>in</strong>duced by the scalar product. Us<strong>in</strong>g this<br />

diffeomorphism, we can reformulate our aim of identif<strong>in</strong>g<br />

hyperplanes as f<strong>in</strong>d<strong>in</strong>g elements of RP m−1 . So, the Hough<br />

transform η[ fh] maps x onto a subset of RP m−1 , which is<br />

topologically equivalent to the upper hemisphere <strong>in</strong> R m with<br />

identifications along the boundary. In fact, <strong>in</strong> (11) we simply<br />

have constructed a coord<strong>in</strong>ate map of RP m−1 us<strong>in</strong>g spherical<br />

coord<strong>in</strong>ates.<br />

4. HOUGH SCA ALGORITHM<br />

The SCA matrix detection algorithm � (Algorithm � 1) consists<br />

n<br />

of two steps. In the first step, d := m−1 hyperplanes given<br />

by their normal vectors n (1) , . . . , n (d) are constructed such<br />

that the mixture data lies <strong>in</strong> the union of these hyperplanes—<br />

<strong>in</strong> the case of noise this will hold only approximately. In the<br />

second step, mixture matrix columns ai are identified as generators<br />

of the n l<strong>in</strong>es ly<strong>in</strong>g at the <strong>in</strong>tersections of � �<br />

n−1<br />

m−2 hyperplanes.<br />

We replace the first step by the follow<strong>in</strong>g Hough<br />

SCA algorithm.<br />

x2<br />

a2<br />

ρ<br />

15<br />

10<br />

5<br />

0<br />

5<br />

10<br />

0 1 2 3 4 5 6 7 8 9 10<br />

15<br />

10<br />

5<br />

0<br />

5<br />

10<br />

15<br />

x1<br />

(a) Data space<br />

20<br />

0 0.5 1 1.5 2 2.5 3<br />

20<br />

15<br />

10<br />

5<br />

0<br />

5<br />

10<br />

a1<br />

(b) L<strong>in</strong>ear Hough space<br />

0 0.5 1 1.5 2 2.5 3<br />

θ<br />

(c) Polar Hough space<br />

Figure 2: Illustration of the “classical” Hough transform: a po<strong>in</strong>t<br />

(x1, x2) <strong>in</strong> the data space (a) is mapped (b) onto the l<strong>in</strong>e {(a1, a2) |<br />

a2 = −a1x1 + x2} <strong>in</strong> the l<strong>in</strong>ear parameter space R 2 or (c) onto a<br />

translated s<strong>in</strong>e curve {(θ, ρ) | ρ = x1 cos θ + x2 s<strong>in</strong> θ} <strong>in</strong> the polar<br />

parameter space [0, π) × R + 0 . The Hough curves of po<strong>in</strong>ts belong<strong>in</strong>g<br />

to one l<strong>in</strong>e <strong>in</strong> data space <strong>in</strong>tersect <strong>in</strong> precisely one po<strong>in</strong>t a <strong>in</strong><br />

the Hough space—and the data po<strong>in</strong>ts lie on the l<strong>in</strong>e given by the<br />

parameter a.<br />

4.1. Def<strong>in</strong>ition<br />

The idea is to first gather the Hough curves η[ fh](x(t))<br />

correspond<strong>in</strong>g to the samples x(t) <strong>in</strong> a discretized parameter<br />

space, <strong>in</strong> this context often called Hough accumulator.<br />

Plott<strong>in</strong>g these curves <strong>in</strong> the accumulator is sometimes denoted<br />

as vot<strong>in</strong>g for each b<strong>in</strong>, similar to histogram generation.<br />

Accord<strong>in</strong>g to the previous section, all po<strong>in</strong>ts x from some


Chapter 11. EURASIP JASP, 2007 167<br />

6 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

x3<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

2<br />

1.5<br />

10.5<br />

00.5<br />

x2 11.5<br />

2 2<br />

(a) Data space<br />

3<br />

4<br />

x1<br />

5<br />

6<br />

θ<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5 2 2.5 3<br />

ϕ<br />

(b) Spherical Hough space<br />

Figure 3: Illustration of the “hyperplane detect<strong>in</strong>g” Hough transform <strong>in</strong> three dimensions: a po<strong>in</strong>t (x1, x2, x3) <strong>in</strong> the data space (a) is mapped<br />

onto the curve {(ϕ, θ) | θ = arctan(x1 cos ϕ + x2 s<strong>in</strong> ϕ) + π/2} <strong>in</strong> the parameter space [0, π) 2 (b). The Hough curves of po<strong>in</strong>ts belong<strong>in</strong>g to<br />

one plane <strong>in</strong> data space <strong>in</strong>tersect <strong>in</strong> precisely one po<strong>in</strong>t (ϕ, θ) <strong>in</strong> the Hough space and the po<strong>in</strong>ts lie on the plane given by the normal vector<br />

(cos ϕ s<strong>in</strong> θ, s<strong>in</strong> ϕ s<strong>in</strong> θ, cos θ).<br />

hyperplane H given by a normal vector with angles (ϕ, θ)<br />

are mapped onto a parameterized object that conta<strong>in</strong>s (ϕ, θ)<br />

for all possible x ∈ H. Hence, the correspond<strong>in</strong>g angle b<strong>in</strong><br />

will conta<strong>in</strong> votes from all samples x(t) ly<strong>in</strong>g <strong>in</strong> H, whereas<br />

other b<strong>in</strong>s receive much less votes. Therefore, maxima analysis<br />

of the accumulator gives the hyperplanes <strong>in</strong> the parameter<br />

space. This idea corresponds to cluster<strong>in</strong>g all possible normal<br />

vectors of planes through x(t) on RP m−1 for all t. The result<strong>in</strong>g<br />

Hough SCA algorithm is described <strong>in</strong> Algorithm 4. We see<br />

that only the hyperplane identification step is different from<br />

Algorithm 1, the matrix identification is the same.<br />

The number β of b<strong>in</strong>s is also called the grid resolution.<br />

Similar to histogram-based density estimation the choice of<br />

β can seriously effect the algorithm performance—if chosen<br />

too small, possible maxima cannot be resolved, and if chosen<br />

too large, the sensitivity of the algorithm <strong>in</strong>creases and the<br />

computational burden <strong>in</strong> terms of speed and memory grows<br />

considerably; see next section. Note that Hough SCA performs<br />

a global search hence it is expected to be much slower<br />

than local update algorithms such as Algorithm 3, but also<br />

much more robust. In the follow<strong>in</strong>g, its properties will be<br />

discussed; applications are given <strong>in</strong> the example <strong>in</strong> Section 5.<br />

4.2. Complexity<br />

We will only discuss the complexity of the hyperplane estimation<br />

because the matrix identification is performed on a<br />

data set of size d be<strong>in</strong>g typically much smaller than the sample<br />

size T.<br />

The angle θm−2 has to be calculated Tβ m−2 times. Due to<br />

the fact that only discrete values of the angles are of <strong>in</strong>terest,<br />

the trigonometric functions as well as the νi can be precalculated<br />

and stored <strong>in</strong> exchange for speed. Then each calculation<br />

of θm−2 <strong>in</strong>volves 2m − 1 operations (sum and product/division).<br />

The vot<strong>in</strong>g (without tak<strong>in</strong>g “lookup” costs <strong>in</strong><br />

the accumulator <strong>in</strong>to account) costs an additional operation.<br />

Altogether the accumulator can be filled with 2Tβ m−2 m<br />

Data: Samples x(1), . . . , x(T) of the random vector X<br />

Result: Estimated mix<strong>in</strong>g matrix �A<br />

Hyperplane identification.<br />

(1) Fix the number β of b<strong>in</strong>s (can be separate for each angle).<br />

(2) Initialize the β × · · · β (m − 1 terms) array α ∈ Rβm−1 with<br />

zeros (accumulator).<br />

for t ← 1, . . . , T do<br />

for ϕ, θ1, . . . , θm−3 ← 0, π/β, . . . , (β − 1)π/β do<br />

(3) θm−2 ← arctan( �m−1 i=1 νi(ϕ, . . . , θm−3)xi(t)/xm(t)) + π/2<br />

(4) Increase (vote for) the accumulator value of α <strong>in</strong><br />

b<strong>in</strong> correspond<strong>in</strong>g to (ϕ, θ1, . . . , θm−2) by one.<br />

end<br />

end � �<br />

n<br />

(5) The d := m−1 largest local maxima of α correspond to the<br />

d hyperplanes present <strong>in</strong> the data set.<br />

(6) Back transformation as <strong>in</strong> (8) gives the correspond<strong>in</strong>g<br />

normal vectors n (1) , . . . , n (d) to those hyperplanes.<br />

Matrix identification.<br />

(7) Cluster<strong>in</strong>g of hyperplanes generated by (m − 1)-tuples <strong>in</strong><br />

{n (1) , . . . , n (d) } gives n separate hyperplanes.<br />

(8) Their normal vectors are the n columns of the estimated<br />

mix<strong>in</strong>g matrix �A.<br />

Algorithm 4: Hough SCA algorithm for mix<strong>in</strong>g matrix identification.<br />

operations. This means that the algorithm depends l<strong>in</strong>early<br />

on the sample size and is polynomial <strong>in</strong> the grid resolution<br />

and exponential <strong>in</strong> the mixture dimension. The maxima<br />

search <strong>in</strong>volves O(β m−1 ) operations, which for small to<br />

medium dimensions can be ignored <strong>in</strong> comparison to the accumulator<br />

generation because usually β ≪ T.<br />

So the ma<strong>in</strong> part of the algorithm does not depend on<br />

the source dimension n but only on the mixture dimension<br />

m. This means for applications that n can be quite large but


168 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 7<br />

hyperplanes will still be found if the grid resolution is high<br />

enough. Increase of the grid resolution (<strong>in</strong> polynomial time)<br />

results <strong>in</strong> <strong>in</strong>creased accuracy also for higher source dimensions<br />

n. The memory requirement of the algorithm is dom<strong>in</strong>ated<br />

by the accumulator size, which is β m−1 . This can limit<br />

the grid resolution.<br />

4.3. Resolution error<br />

The choice of the grid resolution β <strong>in</strong> the algorithm <strong>in</strong>duces<br />

a systematic resolution error <strong>in</strong> the estimation of A (as tradeoff<br />

for robustness and speed). This error is calculated <strong>in</strong> this<br />

section.<br />

Let A be the unknown mix<strong>in</strong>g matrix and �A its estimate,<br />

constructed by the Hough SCA algorithm (Algorithm 4)<br />

with grid resolution β. Let n (1) , . . . , n (d) be the normal vectors<br />

of hyperplanes generated by (m − 1)-tuples of columns<br />

of A and let �n (1) , . . . , �n (d) be their correspond<strong>in</strong>g estimates.<br />

Ignor<strong>in</strong>g permutations, it is sufficient to only describe how<br />

�n (i) differs from n (i) .<br />

Assume that the maxima of the accumulator are correctly<br />

estimated, but due to the discrete grid resolution, an<br />

average error of π/2β is made when estimat<strong>in</strong>g the precise<br />

maximum position, because the size of one b<strong>in</strong> is π/β. How<br />

is this error propagated <strong>in</strong>to �n (i) ? By assumption each estimate<br />

�ϕ, �θ1, . . . , �θm−2 differs from ϕ, θ1, . . . , θm−2 maximally<br />

by π/2β. As we are only <strong>in</strong>terested <strong>in</strong> an upper boundary, we<br />

simply calculate the deviation of each component of �n (i) from<br />

n (i) . Us<strong>in</strong>g the fact that s<strong>in</strong>e and cos<strong>in</strong>e are bounded by one,<br />

(8) then gives us estimates |�n (i)<br />

j − n (i)<br />

j | ≤ (m − 1)π/(2β) for<br />

coord<strong>in</strong>ate j, so altogether<br />

�<br />

� (i) (i) �n − n � � (m − 1)<br />

≤ √ mπ<br />

. (12)<br />

2β<br />

This estimate may be improved by us<strong>in</strong>g the Jacobian of the<br />

spherical coord<strong>in</strong>ate transformation and its determ<strong>in</strong>ant, but<br />

for our purpose this boundary is sufficient. In summary, we<br />

have shown that the grid resolution contributes to a β −1 -<br />

perturbation <strong>in</strong> the estimation of A.<br />

4.4. Robustness<br />

Robustness with regard to additive noise as well as outliers<br />

is important for any algorithm to be used <strong>in</strong> the real world.<br />

Here an outlier is roughly def<strong>in</strong>ed to be a sample far away<br />

from other observations, and <strong>in</strong>deed some researchers def<strong>in</strong>e<br />

outliers to be sample further away from the mean than say<br />

5 standard deviations. However, such def<strong>in</strong>itions do necessarily<br />

depend on the underly<strong>in</strong>g random variable to be estimated,<br />

so most books only give examples of outliers, and <strong>in</strong>deed<br />

no consistent, context-free, precise def<strong>in</strong>ition of outliers<br />

exists [25]. In the follow<strong>in</strong>g, given samples of a fixed random<br />

variable of <strong>in</strong>terest, we denote a sample as outlier if it is drawn<br />

from another sufficiently different distribution.<br />

Data fitt<strong>in</strong>g of only one hyperspace to the data set can<br />

be achieved by l<strong>in</strong>ear regression namely by m<strong>in</strong>imiz<strong>in</strong>g the<br />

squared distance to such a possible hyperplane. These least<br />

squares fitt<strong>in</strong>g algorithms are well known to be sensitive to<br />

outliers, and various extensions of the LS method such as<br />

least median of squares and reweighted least squares [26]<br />

have been developed to overcome this problem. The breakdown<br />

po<strong>in</strong>t of the latter is 0.5, which means that the fit parameters<br />

are only stably estimated for data sets with less<br />

than 50% outliers. The other techniques typically have much<br />

lower breakdown po<strong>in</strong>ts, usually below 0.3. The classical<br />

Hough transform, albeit no regression method, is comparable<br />

<strong>in</strong> terms of breakdown with robust fitt<strong>in</strong>g algorithms<br />

such as the reweighted least squares algorithm [27]. In the experiments<br />

we will observe similar results for the generalized<br />

method presented above. Namely, we achieve breakdown levels<br />

of up to 0.8 <strong>in</strong> the low-noise case, which considerably decrease<br />

with <strong>in</strong>creas<strong>in</strong>g noise.<br />

From a mathematical po<strong>in</strong>t of view, the “classical” Hough<br />

transform as an estimator (and extension of l<strong>in</strong>ear regression)<br />

as well as regard<strong>in</strong>g algorithmic and implementational<br />

aspects has been studied quite extensively; see, for example,<br />

[28] and references there<strong>in</strong>. Most of the presented theoretical<br />

results <strong>in</strong> the two-dimensional case could be extended to the<br />

more general objective presented here, but this is not with<strong>in</strong><br />

the scope of this manuscript. Simulations giv<strong>in</strong>g experimental<br />

evidence that the robustness also holds <strong>in</strong> our case are<br />

shown <strong>in</strong> Section 5.<br />

4.5. Extensions<br />

The follow<strong>in</strong>g possible extensions to the Hough SCA algorithm<br />

can be employed to <strong>in</strong>crease its performance.<br />

If the noise level is known, smooth<strong>in</strong>g of the accumulator<br />

(antialias<strong>in</strong>g) will help to give more robust results <strong>in</strong> terms of<br />

noise. For smooth<strong>in</strong>g (usually with a Gaussian), the smooth<strong>in</strong>g<br />

radius must be set accord<strong>in</strong>g to the noise level. If the<br />

noise level is not known, smooth<strong>in</strong>g can still be applied by<br />

gradually <strong>in</strong>creas<strong>in</strong>g the radius until the number of clearly<br />

detectable maxima equals d.<br />

Furthermore, an additional f<strong>in</strong>e-tun<strong>in</strong>g step is possible:<br />

the estimated plane norms are slightly deteriorated by the<br />

systematic resolution error as shown previously. However, after<br />

application of Hough SCA, the data space can now be<br />

clustered <strong>in</strong>to data po<strong>in</strong>ts ly<strong>in</strong>g close to correspond<strong>in</strong>g hyperplanes.<br />

With<strong>in</strong> each cluster l<strong>in</strong>ear regression (or some<br />

more robust version of it; see Section 4.4) can now be applied<br />

to improve the hyperplane estimate—this is actually<br />

the idea used locally <strong>in</strong> the k-hyperplane cluster<strong>in</strong>g algorithm<br />

(Algorithm 3). Such a method requires additional computational<br />

power, but makes the algorithm less dependent on the<br />

grid resolution, which is only needed for the hyperplane cluster<strong>in</strong>g<br />

step. However, it is expected that this additional f<strong>in</strong>etun<strong>in</strong>g<br />

step may decrease robustness especially aga<strong>in</strong>st biased<br />

noise and outliers.<br />

5. SIMULATIONS<br />

We give a simulation example as well as batch runs to analyze<br />

the performance of the proposed algorithm.


Chapter 11. EURASIP JASP, 2007 169<br />

8 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

5<br />

0<br />

5<br />

10<br />

0<br />

5<br />

100 200 300 400 500 600 700 800 900 1000<br />

0<br />

5<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

5<br />

0<br />

5<br />

0<br />

10<br />

5<br />

0<br />

100 200 300 400 500 600 700 800 900 1000<br />

5<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

1<br />

0.5<br />

0<br />

0.5<br />

1<br />

1<br />

0.5<br />

0<br />

(a) Source signals<br />

0.5<br />

1 1<br />

0.5<br />

(c) Normalized mixture scatter plot<br />

0<br />

0.5<br />

1<br />

5<br />

0<br />

5<br />

10<br />

0<br />

4<br />

2<br />

0<br />

100 200 300 400 500 600 700 800 900 1000<br />

2<br />

4<br />

0<br />

5<br />

100 200 300 400 500 600 700 800 900 1000<br />

0<br />

5<br />

0 100 200 300 400 500 600 700 800 900 1000<br />

50<br />

100<br />

150<br />

200<br />

250<br />

300<br />

350<br />

(b) Mixture signals<br />

50 100 150 200 250 300 350<br />

(d) Hough accumulator with labeled maxima<br />

Figure 4: Example: (a) shows the 2-sparse, sufficiently rich represented, 4-dimensional source signals, and (b) the randomly mixed, 3dimensional<br />

mixtures. The normalized mixture scatter plot {x(t)/|x(t)| | t = 1, . . . , T} is given <strong>in</strong> (c), and the generated Hough accumulator<br />

<strong>in</strong> (d); note that the color scale <strong>in</strong> (d) was chosen to be nonl<strong>in</strong>ear (γnew := (1 − γ/ max) 10 ) <strong>in</strong> order to visualize structure <strong>in</strong> addition to the<br />

strong maxima.<br />

5.1. Explicit example<br />

In the first experiment, we consider the case of source dimensions<br />

n = 4 and mix<strong>in</strong>g dimension m = 3. The<br />

4-dimensional sources have been generated from i.i.d. samples<br />

(two Laplacian and two Gaussian sequences) followed<br />

by sett<strong>in</strong>g some entries to zero <strong>in</strong> order to fulfill the sparsity<br />

constra<strong>in</strong>ts; see Figure 4(a). They are 2-sparse and consist<br />

of 1000 samples. Obviously all comb<strong>in</strong>ations (i, j), i <<br />

j, of active sources are present <strong>in</strong> the data set; this condition<br />

is needed by the matrix recovery step. The sources<br />

were mixed us<strong>in</strong>g a mix<strong>in</strong>g matrix with randomly (uniform<br />

<strong>in</strong> [−1, 1]) chosen coefficients to give mixtures as<br />

shown <strong>in</strong> Figure 4(b). The mixture density clearly lies <strong>in</strong><br />

6 disjo<strong>in</strong>t hyperplanes, spanned by pairs (ai, aj), i < j, of<br />

mixture matrix columns, as <strong>in</strong>dicated by the normalized<br />

scatter plot <strong>in</strong> Figure 4(c), similar to the illustration from<br />

Figure 1(c).<br />

In order to detect the planes <strong>in</strong> the data space, we apply<br />

the generalized Hough transform as expla<strong>in</strong>ed <strong>in</strong> Section 3.3.<br />

Figure 4(d) shows the Hough image with β = 360. Each sample<br />

results <strong>in</strong> a curve, and clearly 6 <strong>in</strong>tersection po<strong>in</strong>ts are<br />

visible, which correspond to the 6 hyperplanes <strong>in</strong> question.<br />

Maxima analysis retrieves these po<strong>in</strong>ts (<strong>in</strong> Hough space) as<br />

shown <strong>in</strong> the same figure. After transform<strong>in</strong>g these po<strong>in</strong>ts<br />

back <strong>in</strong>to R 3 with the <strong>in</strong>verse Hough transform, we get 6<br />

normalized vectors correspond<strong>in</strong>g to the 6 planes. Consider<strong>in</strong>g<br />

<strong>in</strong>tersections of the hyperplanes, we notice that only 4 of<br />

them <strong>in</strong>tersect <strong>in</strong> precisely 3 planes, and these 4 <strong>in</strong>tersection<br />

l<strong>in</strong>es are spanned by the matrix columns ai. For practical reasons,<br />

we recover these comb<strong>in</strong>atorially from the plane norm<br />

vectors; see Algorithm 4. The deviation of the recovered mix<strong>in</strong>g<br />

matrix �A from the orig<strong>in</strong>al mix<strong>in</strong>g matrix A <strong>in</strong> the overcomplete<br />

case can be measured by the generalized crosstalk<strong>in</strong>g<br />

error [8] def<strong>in</strong>ed as E(A, �A) := m<strong>in</strong>M∈Π �A− �AM�, where<br />

the m<strong>in</strong>imum is taken over the group Π of all <strong>in</strong>vertible real<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20


170 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 9<br />

n × n-matrices where only one entry <strong>in</strong> each column differs<br />

from 0; �·� denotes a fixed matrix norm. In our case the generalized<br />

crosstalk<strong>in</strong>g error is very low with E(A, �A) = 0.040.<br />

This essentially means that the two matrices, after permutation,<br />

differ only by 0.04 with respect to the chosen matrix<br />

norm, <strong>in</strong> our case the (squared) Frobenius norm. Then, the<br />

sources are recovered us<strong>in</strong>g the source recovery algorithm<br />

(Algorithm 2) with the approximated mix<strong>in</strong>g matrix �A. The<br />

normalized signal-to-noise ratios (SNRs) of the recovered<br />

sources with respect to the orig<strong>in</strong>al ones are high with 36,<br />

38, 36, and 37 dB, respectively.<br />

As a modification of the previous example, we now consider<br />

also additive noise. We use sources S (which have unit<br />

covariance) and mix<strong>in</strong>g matrix A from above, but add 1%<br />

random white noise to the mixtures X = AS + 0.01N where<br />

N is a normal random vector. This corresponds to a still high<br />

mean SNR of 38 dB. When consider<strong>in</strong>g the normalized scatter<br />

plot, aga<strong>in</strong> the 6 planes are visible, but the additive noise<br />

deteriorates the clear separation of each plane. We apply the<br />

generalized Hough transform to the mixture data; however,<br />

because of the noise we choose a more coarse discretization<br />

(β = 180 b<strong>in</strong>s). Curves <strong>in</strong> Hough space correspond<strong>in</strong>g<br />

to a s<strong>in</strong>gle plane do not <strong>in</strong>tersect any more <strong>in</strong> precisely one<br />

po<strong>in</strong>t due to the noise; a low-resolution Hough space however<br />

fuses these <strong>in</strong>tersections <strong>in</strong> one po<strong>in</strong>t, so that our simple<br />

maxima detection still achieves good results. We recover<br />

the mix<strong>in</strong>g matrix similar to the above to get a low generalized<br />

crosstalk<strong>in</strong>g error of E(A, �A) = 0.12. The sources are<br />

recovered well with mean SNRs of 20 dB, which is quite satisfactory<br />

consider<strong>in</strong>g the noisy, overcomplete mixture situation.<br />

The follow<strong>in</strong>g example demonstrates the good performance<br />

<strong>in</strong> higher source dimensions. Consider 6-dimensional<br />

2-sparse sources that are mixed aga<strong>in</strong> by matrix A with coefficients<br />

drawn uniformly from [−1, 1]. Application of the generalized<br />

Hough transform to the mixtures retrieves the plane<br />

norm vectors. The recovered mix<strong>in</strong>g matrix has a low generalized<br />

crosstalk<strong>in</strong>g error of E(A, �A) = 0.047. However, if the<br />

noise level <strong>in</strong>creases, the performance considerably drops because<br />

many maxima, <strong>in</strong> this case 15, have to be located <strong>in</strong> the<br />

accumulator. After recover<strong>in</strong>g the sources with this approximated<br />

matrix �A, we get SNRs of only 11, 8, 6, 10, 12, and<br />

11 dB. The rather high source recovery error is most probably<br />

due to the sensitivity of the source recovery to slight perturbations<br />

<strong>in</strong> the approximated mix<strong>in</strong>g matrix.<br />

5.2. Outliers<br />

We will now perform experiments systematically analyz<strong>in</strong>g<br />

the robustness of the proposed algorithm with respect to outliers<br />

<strong>in</strong> the sense of model-violat<strong>in</strong>g samples.<br />

In the first explicit example we consider the sources from<br />

Figure 4(a), but 80% of the samples have been replaced by<br />

outliers (drawn from a 4-dimensional normal distribution).<br />

Due to the high percentage of outliers, the mixtures, mixed<br />

by the same random 3 × 4 matrix A as before, do not obviously<br />

exhibit any clear hyperplane structure. As discussed <strong>in</strong><br />

Section 4.4, the Hough SCA algorithm is very robust aga<strong>in</strong>st<br />

outliers. Indeed, <strong>in</strong> addition to a noisy background with<strong>in</strong><br />

the Hough accumulator, the <strong>in</strong>tersection maxima are still noticeable,<br />

and local maxima detection f<strong>in</strong>ds the correct hyperplanes<br />

(cf. Figure 4(d)), although 80% of the data is corrupted.<br />

The recovered mix<strong>in</strong>g matrix has an excellent generalized<br />

crosstalk<strong>in</strong>g error of E(A, �A) = 0.040. Of course the<br />

sparse source recovery from above cannot recover the outly<strong>in</strong>g<br />

samples. Apply<strong>in</strong>g the correspond<strong>in</strong>g algorithms, we<br />

get SNRs of the corrupted sources with the recovered ones<br />

of around 4 dB; source recovery with the pseudo-<strong>in</strong>verse of<br />

�A correspond<strong>in</strong>g to maximum-likelihood recovery with a<br />

Gaussian prior gives somewhat better SNRs of around 6 dB.<br />

But the sparse recovery method has the advantage that it can<br />

detect outliers by measur<strong>in</strong>g distance from the hyperplanes.<br />

So outlier rejection is possible. Note that we get similar results<br />

when the outliers are not added <strong>in</strong> the source space but<br />

only <strong>in</strong> the mixture space, that is, only after the mix<strong>in</strong>g process.<br />

We now perform a numerical comparison of the number<br />

of outliers versus the algorithm performance for vary<strong>in</strong>g<br />

noise level; see Figure 5. The rationale beh<strong>in</strong>d this is that<br />

already small noise levels <strong>in</strong> addition to the outliers might<br />

be enough to destroy maxima <strong>in</strong> the accumulator thus deteriorat<strong>in</strong>g<br />

the SCA performance. The same (uncorrupted)<br />

sources and mix<strong>in</strong>g matrix from above are used. Numerically,<br />

we get breakdown po<strong>in</strong>ts of 0.8 for the no-noise case,<br />

and values of 0.5, 0.3, and 0.1 with <strong>in</strong>creas<strong>in</strong>g noise levels of<br />

0.1% (58 dB), 0.5% (44 dB), and 1% (38 dB). Better performances<br />

at higher noise levels could be achieved by apply<strong>in</strong>g<br />

antialias<strong>in</strong>g techniques before maxima detection as described<br />

<strong>in</strong> Section 4.5.<br />

5.3. Grid resolution<br />

In this section we will demonstrate numerical examples<br />

to confirm the l<strong>in</strong>ear dependence of the algorithm performance<br />

with the <strong>in</strong>verse grid resolution β −1 . We consider 4dimensional<br />

sources S with 1000 samples, <strong>in</strong> which for each<br />

sample two source components were drawn out of a distribution<br />

uniform <strong>in</strong> [−1, 1], and the other two were set to zero,<br />

so S is 2-sparse. For each grid resolution β we perform 50<br />

runs, and <strong>in</strong> each run a new set of sources is generated as<br />

above. These are then mixed us<strong>in</strong>g a 3 × 4 mix<strong>in</strong>g matrix A<br />

with random coefficients uniformly out of [−1, 1]. Application<br />

of the Hough SCA algorithm gives an estimated matrix<br />

�A. In Figure 6 we plot the mean generalized crosstalk<strong>in</strong>g error<br />

E(A, �A) for each grid resolution. With <strong>in</strong>creas<strong>in</strong>g β the accuracy<br />

<strong>in</strong>creases—a logarithmic plot <strong>in</strong>deed confirms the l<strong>in</strong>ear<br />

dependence on β −1 as stated <strong>in</strong> Section 4.3. Furthermore we<br />

see that for example for β = 360, among all S and A as above<br />

we get a mean crosstalk<strong>in</strong>g error of 0.23 ± 0.5.<br />

5.4. Batch runs and comparison with<br />

hyperplane k-means<br />

In the last example, we consider the case of m = n = 4,<br />

and do compare the proposed algorithm (now with a


Chapter 11. EURASIP JASP, 2007 171<br />

10 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

Crosstalk<strong>in</strong>g error<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Noise = 0%<br />

Percentage of outliers<br />

(a) Noiseless breakdown analysis with respect to outliers<br />

Crosstalk<strong>in</strong>g error<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 10 20 30 40 50 60 70 80 90 100<br />

Noise = 0%<br />

Noise = 0.1%<br />

Percentage of outliers<br />

Noise = 0.5%<br />

Noise = 1%<br />

(b) Breakdown analysis for vary<strong>in</strong>g noise level<br />

Figure 5: Performance of Hough SCA with <strong>in</strong>creas<strong>in</strong>g number of outliers. Plotted is the percentage of outliers <strong>in</strong> the source data versus the<br />

matrix recovery performance (measured by the generalized crosstalk<strong>in</strong>g error). For each 1%-step one calculation was performed; <strong>in</strong> (b) the<br />

plots have been smoothed by tak<strong>in</strong>g average over ten 1%-steps. In the no-noise case 360 b<strong>in</strong>s were used, 180 b<strong>in</strong>s <strong>in</strong> all other cases.<br />

Mean crosstalk<strong>in</strong>g error<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 50 100 150 200 250 300 350 400<br />

Grid resolution<br />

(a) Mean performance versus grid resolution<br />

Logarithmic mean Crosstalk<strong>in</strong>g error<br />

1<br />

0.5<br />

0<br />

0.5<br />

1<br />

1.5<br />

2<br />

0 50 100 150 200 250 300 350 400<br />

In (E)<br />

L<strong>in</strong>e fit<br />

Grid resolution<br />

(b) Fit of logarithmic mean performance<br />

Figure 6: Dependence of Hough SCA performance (a) on the grid resolution β; mean has been taken over 50 runs. With a logarithmic y-axis<br />

(b), a least squares l<strong>in</strong>e fit confirms the l<strong>in</strong>ear dependence of performance and β −1 .<br />

three-dimensional accumulator) with k-hyperplane cluster<strong>in</strong>g<br />

algorithm (Algorithm 3). For this, random sources<br />

with T = 10 5 samples are uniformly drawn from [−1, 1]<br />

uniform distribution, and a s<strong>in</strong>gle coord<strong>in</strong>ate is randomly<br />

set to zero, thus generat<strong>in</strong>g 1-sparse sources S. In 100 batch<br />

runs, a random 4 × 4 mix<strong>in</strong>g matrix A with coefficients<br />

uniformly drawn from [−1, 1], but columns normalized to<br />

1 are constructed. The result<strong>in</strong>g mixtures X := AS are then<br />

separated both by the proposed Hough k-SCA algorithm<br />

as well as the Bradley-Mangasarian k-hyperplane cluster<strong>in</strong>g<br />

algorithm (with 100 iterations, and without restart).<br />

The result<strong>in</strong>g median crosstalk<strong>in</strong>g error E(A, �A) of the<br />

Hough algorithm is 3.3 ± 2.3, and hence considerably lower<br />

than the k-hyperplane cluster<strong>in</strong>g result of 5.5 ± 1.9. This<br />

confirms the well-known fact that k-means and its extension<br />

exhibit local convergence only and are therefore susceptible<br />

to local m<strong>in</strong>ima, as seems to be the case <strong>in</strong> our example. A<br />

possible solution would be to use many restarts, but global<br />

convergence cannot be guaranteed. For practical applications,<br />

we therefore suggest us<strong>in</strong>g a rather rough (low grid<br />

resolution β) global search by Hough SCA followed by a f<strong>in</strong>er<br />

local search us<strong>in</strong>g k-hyperplane cluster<strong>in</strong>g; see Section 4.5.


172 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 11<br />

(a) Source signals<br />

50<br />

100<br />

150<br />

200<br />

250<br />

300<br />

350<br />

50 100 150 200 250 300 350<br />

(b) Hough accumulator with three labeled maxima<br />

(c) Recovered sources (d) Recovered sources after outlier removal<br />

Figure 7: Application to speech signals: (a) shows the orig<strong>in</strong>al speech sources (“peace and love,” “hello, how are you,” and “to be or not to<br />

be”), and (b) the Hough accumulator when tra<strong>in</strong>ed to mixtures of (a) with 20% outliers. A nonl<strong>in</strong>ear gray scale γnew := (1 − γ/ max) 10 was<br />

chosen for better visualization. (c) and (d) present the recovered sources, without and with outlier removal. They co<strong>in</strong>cide with (a) up to<br />

permutation (reversed order) and scal<strong>in</strong>g.<br />

5.5. Application to the separation of speech signals<br />

In order to illustrate that the SCA assumptions are also valid<br />

for real data sets, we shortly present an application to audio<br />

source separation, namely, to the <strong>in</strong>stantaneous, robust BSS<br />

of speech signals—a problem of importance <strong>in</strong> the field of<br />

audio signal process<strong>in</strong>g. In the next section, we then refer to<br />

other works apply<strong>in</strong>g the model to biomedical data sets.<br />

We consider three speech signals S of length 2.2s, sampled<br />

at 22000 Hz; see Figure 7(a). They are spoken by the same<br />

person, but may still be assumed to be <strong>in</strong>dependent. The signals<br />

are mixed by a randomly chosen mix<strong>in</strong>g matrix A (coefficients<br />

uniform from [−1, 1]) to yield mixtures X = AS, but<br />

20% outliers are <strong>in</strong>troduced by replac<strong>in</strong>g 20% of the samples<br />

of X by i.i.d. Gaussian samples. Without the outliers, more<br />

classical BSS algorithms such as ICA would have been able to<br />

perfectly separate the mixtures; however, <strong>in</strong> this noisy sett<strong>in</strong>g,<br />

ICA performs very poorly: application of the popular fastICA<br />

algorithm [29] yields only a poor estimate �A f of the mix<strong>in</strong>g<br />

matrix A, with high crosstalk<strong>in</strong>g error of E(A, �A f ) = 3.73.<br />

Instead, we apply the complete-case Hough-SCA algorithm<br />

to this model with β = 360 b<strong>in</strong>s—the sparseness<br />

assumption now means that we are search<strong>in</strong>g for sources,<br />

3500<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

which have samples with at least one zero (quiet) source<br />

component. The Hough accumulator exhibits very nicely<br />

three strong maxima; see Figure 7(b). And <strong>in</strong>deed, the<br />

crosstalk<strong>in</strong>g error of the correspond<strong>in</strong>g estimated mix<strong>in</strong>g<br />

matrix �A with the orig<strong>in</strong>al one is very low at E(A, �A) = 0.020.<br />

This experimentally confirms that speech signals obey an<br />

(m−1)-sparse signal model, at least if m = n. An explanation<br />

for this fact is that <strong>in</strong> typical speech data sets, considerable<br />

pauses are common, so with high probability we may f<strong>in</strong>d<br />

samples <strong>in</strong> which at least one source vanishes, and all such<br />

permutations occur—which is necessary for identify<strong>in</strong>g the<br />

mix<strong>in</strong>g matrix accord<strong>in</strong>g to Theorem 1. We are deal<strong>in</strong>g with<br />

a complete-case problem, so <strong>in</strong>vert<strong>in</strong>g �A directly yields recovered<br />

sources �S. But of course due to the outliers, the SNR<br />

of �S with the orig<strong>in</strong>al sources is low with only −1.35 dB. We<br />

therefore apply a simple outlier removal scheme by scann<strong>in</strong>g<br />

each estimated source us<strong>in</strong>g a w<strong>in</strong>dow of size w = 10 samples.<br />

An adjacent sample to the w<strong>in</strong>dow is identified as outlier<br />

if its absolute value is larger than 20% of the maximal signal<br />

amplitude, but the w<strong>in</strong>dow sample variance is lower than<br />

half of the variance when <strong>in</strong>clud<strong>in</strong>g the sample. The outliers<br />

are then replaced by the w<strong>in</strong>dow average. This rough outlierdetection<br />

algorithm works satisfactorily well, see Figure 7(d);<br />

500


Chapter 11. EURASIP JASP, 2007 173<br />

12 EURASIP Journal on Advances <strong>in</strong> Signal Process<strong>in</strong>g<br />

the perceptual audio quality <strong>in</strong>creased considerably, see also<br />

the differences between Figures 7(c) and 7(d), although the<br />

nom<strong>in</strong>al SNR <strong>in</strong>crease is only roughly 4.1 dB. Altogether, this<br />

example illustrates the applicability of the Hough SCA algorithm<br />

and its correspond<strong>in</strong>g SCA model to audio data sets<br />

also <strong>in</strong> noisy sett<strong>in</strong>gs, where ICA algorithms perform very<br />

poorly.<br />

5.6. Other applications<br />

We are currently study<strong>in</strong>g several biomedical applications of<br />

the proposed model and algorithm, <strong>in</strong>clud<strong>in</strong>g the separation<br />

of functional magnetic resonance imag<strong>in</strong>g data sets as well as<br />

surface electromyograms. For results on the former data set,<br />

we refer to the detailed book chapters [22, 23].<br />

The results of the k-SCA algorithm applied to the latter<br />

signals are shortly summarized <strong>in</strong> the follow<strong>in</strong>g. An electromyogram<br />

(EMG) denotes the electric signal generated by<br />

a contract<strong>in</strong>g muscle; its study is relevant to the diagnosis of<br />

motoneuron diseases as well as neurophysiological research.<br />

In general, EMG measurements make use of <strong>in</strong>vasive, pa<strong>in</strong>ful<br />

needle electrodes. An alternative is to use surface EMGs,<br />

which are measured us<strong>in</strong>g non<strong>in</strong>vasive, pa<strong>in</strong>less surface electrodes.<br />

However, <strong>in</strong> this case the signals are rather more difficult<br />

to <strong>in</strong>terpret due to noise and overlap of several source<br />

signals. When apply<strong>in</strong>g the k-SCA model to real record<strong>in</strong>gs,<br />

Hough-based separation outperforms classical approaches<br />

based on filter<strong>in</strong>g and ICA <strong>in</strong> terms of a greater reduction<br />

of the zero-cross<strong>in</strong>gs, a common measure to analyze the unknown<br />

extracted sources. The relative sEMG enhancement<br />

was 24.6 ± 21.4%, where the mean was taken over a group of<br />

9 subjects. For a detailed analysis, compar<strong>in</strong>g various sparse<br />

factorization models both on toy and on real data, we refer<br />

to [30].<br />

6. CONCLUSION<br />

We have presented an algorithm for perform<strong>in</strong>g a global<br />

search for overcomplete SCA representations, and experiments<br />

confirm that Hough SCA is robust aga<strong>in</strong>st noise and<br />

outliers with breakdown po<strong>in</strong>ts up to 0.8. The algorithm employs<br />

hyperplane detection us<strong>in</strong>g a generalized Hough transform.<br />

Currently, we are work<strong>in</strong>g on apply<strong>in</strong>g the SCA algorithm<br />

to high-dimensional biomedical data sets to see how<br />

the different assumption of high sparsity contributes to the<br />

signal separation.<br />

ACKNOWLEDGMENTS<br />

The authors gratefully thank W. Nakamura for her suggestion<br />

of us<strong>in</strong>g the Hough transform when detect<strong>in</strong>g<br />

hyperplanes, and the anonymous reviewers for their comments,<br />

which significantly improved the manuscript. The<br />

first author acknowledges partial f<strong>in</strong>ancial support by the<br />

JSPS (PE 05543).<br />

REFERENCES<br />

[1] A. Cichocki and S. Amari, Adaptive Bl<strong>in</strong>d Signal and Image<br />

Process<strong>in</strong>g, John Wiley & Sons, New York, NY, USA, 2002.<br />

[2] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja, <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>, John Wiley & Sons, New York, NY, USA, 2001.<br />

[3] P. Comon, “<strong>Independent</strong> component analysis. A new concept?”<br />

Signal Process<strong>in</strong>g, vol. 36, no. 3, pp. 287–314, 1994.<br />

[4] F. J. Theis, “A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d<br />

source separation,” Neural Computation, vol. 16, no. 9, pp.<br />

1827–1850, 2004.<br />

[5] J. Eriksson and V. Koivunen, “Identifiability and separability<br />

of l<strong>in</strong>ear ica models revisited,” <strong>in</strong> Proceed<strong>in</strong>gs of the 4th International<br />

Symposium on <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> and<br />

Bl<strong>in</strong>d Source Separation (ICA ’03), pp. 23–27, Nara, Japan,<br />

April 2003.<br />

[6] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition<br />

by basis pursuit,” SIAM Journal of Scientific Comput<strong>in</strong>g,<br />

vol. 20, no. 1, pp. 33–61, 1998.<br />

[7] D. L. Donoho and M. Elad, “Optimally sparse representation<br />

<strong>in</strong> general (nonorthogonal) dictionaries via l 1 m<strong>in</strong>imization,”<br />

Proceed<strong>in</strong>gs of the National Academy of Sciences of the United<br />

States of America, vol. 100, no. 5, pp. 2197–2202, 2003.<br />

[8] F. J. Theis, E. W. Lang, and C. G. Puntonet, “A geometric algorithm<br />

for overcomplete l<strong>in</strong>ear ICA,” Neurocomput<strong>in</strong>g, vol. 56,<br />

no. 1–4, pp. 381–398, 2004.<br />

[9] P. Georgiev, F. J. Theis, and A. Cichocki, “Sparse component<br />

analysis and bl<strong>in</strong>d source separation of underdeterm<strong>in</strong>ed mixtures,”<br />

IEEE Transactions on Neural Networks, vol. 16, no. 4, pp.<br />

992–996, 2005.<br />

[10] P. V. C. Hough, “Mach<strong>in</strong>e analysis of bubble chamber pictures,”<br />

<strong>in</strong> International Conference on High Energy Accelerators<br />

and Instrumentation, pp. 554–556, CERN, Geneva, Switzerland,<br />

1959.<br />

[11] J. K. L<strong>in</strong>, D. G. Grier, and J. D. Cowan, “Feature extraction approach<br />

to bl<strong>in</strong>d source separation,” <strong>in</strong> Proceed<strong>in</strong>gs of the IEEE<br />

Workshop on Neural Networks for Signal Process<strong>in</strong>g (NNSP ’97),<br />

pp. 398–405, Amelia Island, Fla, USA, September 1997.<br />

[12] H. Sh<strong>in</strong>do and Y. Hirai, “An approach to overcomplete-bl<strong>in</strong>d<br />

source separation us<strong>in</strong>g geometric structure,” <strong>in</strong> Proceed<strong>in</strong>gs of<br />

Annual Conference of Japanese Neural Network Society (JNNS<br />

’01), pp. 95–96, Naramachi Center, Nara, Japan, 2001.<br />

[13] F. J. Theis, C. G. Puntonet, and E. W. Lang, “Median-based<br />

cluster<strong>in</strong>g for underdeterm<strong>in</strong>ed bl<strong>in</strong>d signal process<strong>in</strong>g,” IEEE<br />

Signal Process<strong>in</strong>g Letters, vol. 13, no. 2, pp. 96–99, 2006.<br />

[14] L. Cirillo, A. Zoubir, and M. Am<strong>in</strong>, “Direction f<strong>in</strong>d<strong>in</strong>g of nonstationary<br />

signals us<strong>in</strong>g a time-frequency Hough transform,”<br />

<strong>in</strong> Proceed<strong>in</strong>gs of IEEE International Conference on Acoustics,<br />

Speech, and Signal Process<strong>in</strong>g (ICASSP ’05), pp. 2718–2721,<br />

Philadelphia, Pa, USA, March 2005.<br />

[15] S. Barbarossa, “<strong>Analysis</strong> of multicomponent LFM signals by<br />

a comb<strong>in</strong>ed Wigner-Hough transform,” IEEE Transactions on<br />

Signal Process<strong>in</strong>g, vol. 43, no. 6, pp. 1511–1515, 1995.<br />

[16] D. H. Ballard, “Generaliz<strong>in</strong>g the Hough transform to detect<br />

arbitrary shapes,” Pattern Recognition, vol. 13, no. 2, pp. 111–<br />

122, 1981.<br />

[17] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski,<br />

“Bl<strong>in</strong>d source separation of more sources than mixtures us<strong>in</strong>g<br />

overcomplete representations,” IEEE Signal Process<strong>in</strong>g Letters,<br />

vol. 6, no. 4, pp. 87–90, 1999.<br />

[18] K. Waheed and F. Salem, “Algebraic overcomplete <strong>in</strong>dependent<br />

component analysis,” <strong>in</strong> Proceed<strong>in</strong>gs of the 4th International<br />

Symposium on <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> and<br />

Bl<strong>in</strong>d Source Separation (ICA ’03), pp. 1077–1082, Nara, Japan,<br />

April 2003.


174 Chapter 11. EURASIP JASP, 2007<br />

Fabian J. Theis et al. 13<br />

[19] M. Zibulevsky and B. A. Pearlmutter, “Bl<strong>in</strong>d source separation<br />

by sparse decomposition <strong>in</strong> a signal dictionary,” Neural Computation,<br />

vol. 13, no. 4, pp. 863–882, 2001.<br />

[20] F. J. Theis, P. Georgiev, and A. Cichocki, “Robust overcomplete<br />

matrix recovery for sparse sources us<strong>in</strong>g a generalized Hough<br />

transform,” <strong>in</strong> Proceed<strong>in</strong>gs of 12th European Symposium on Artificial<br />

Neural Networks (ESANN ’04), pp. 343–348, Bruges,<br />

Belgium, April 2004, d-side, Evere, Belgium.<br />

[21] P. S. Bradley and O. L. Mangasarian, “k-plane cluster<strong>in</strong>g,” Journal<br />

of Global Optimization, vol. 16, no. 1, pp. 23–32, 2000.<br />

[22] P. Georgiev, P. Pardalos, F. J. Theis, A. Cichocki, and H.<br />

Bakardjian, “Sparse component analysis: a new tool for data<br />

m<strong>in</strong><strong>in</strong>g,” <strong>in</strong> Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> Biomedic<strong>in</strong>e, Spr<strong>in</strong>ger, New York,<br />

NY, USA, 2005, <strong>in</strong> pr<strong>in</strong>t.<br />

[23] P. Georgiev, F. J. Theis, and A. Cichocki, “Optimization algorithms<br />

for sparse representations and applications,” <strong>in</strong> Multiscale<br />

Optimization Methods, P. Pardalos, Ed., Spr<strong>in</strong>ger, New<br />

York, NY, USA, 2005.<br />

[24] R. O. Duda and P. E. Hart, “Use of the Hough transformation<br />

to detect l<strong>in</strong>es and curves <strong>in</strong> pictures,” Communications of the<br />

ACM, vol. 15, no. 1, pp. 204–208, 1972.<br />

[25] R. Dudley, Department of <strong>Mathematics</strong>, MIT, course 18.465,<br />

2005.<br />

[26] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier<br />

Detection, John Wiley & Sons, New York, NY, USA, 1987.<br />

[27] P. Ballester, “Applications of the Hough transform,” <strong>in</strong> Astronomical<br />

Data <strong>Analysis</strong> Software and Systems III, J. Barnes, D.<br />

R. Crabtree, and R. J. Hanisch, Eds., vol. 61 of ASP Conference<br />

Series, 1994.<br />

[28] A. Goldenshluger and A. Zeevi, “The Hough transform estimator,”<br />

Annals of Statistics, vol. 32, no. 5, pp. 1908–1932, 2004.<br />

[29] A. Hyvär<strong>in</strong>en and E. Oja, “A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent<br />

component analysis,” Neural Computation, vol. 9,<br />

no. 7, pp. 1483–1492, 1997.<br />

[30] F. J. Theis and G. A. García, “On the use of sparse signal<br />

decomposition <strong>in</strong> the analysis of multi-channel surface electromyograms,”<br />

Signal Process<strong>in</strong>g, vol. 86, no. 3, pp. 603–623,<br />

2006.<br />

Fabian J. Theis obta<strong>in</strong>ed his M.S. degree <strong>in</strong><br />

mathematics and physics from the University<br />

of Regensburg, Germany, <strong>in</strong> 2000. He<br />

also received the Ph.D. degree <strong>in</strong> physics<br />

from the same university <strong>in</strong> 2002 and the<br />

Ph.D. degree <strong>in</strong> computer science from the<br />

University of Granada <strong>in</strong> 2003. He worked<br />

as a Visit<strong>in</strong>g Researcher at the Department<br />

of Architecture and Computer Technology<br />

(University of Granada, Spa<strong>in</strong>), at<br />

the RIKEN Bra<strong>in</strong> Science Institute (Wako, Japan), at FAMU-FSU<br />

(Florida State University, USA), and at TUAT’s Laboratory for Signal<br />

and Image Process<strong>in</strong>g (Tokyo, Japan). Currently, he is head<strong>in</strong>g<br />

the Signal Process<strong>in</strong>g & Information Theory Group at the Institute<br />

of Biophysics at the University of Regensburg and is work<strong>in</strong>g<br />

on his habilitation. He serves as an Associate Editor of “Computational<br />

Intelligence and Neuroscience,” and is a Member of<br />

IEEE, EURASIP, and ENNS. His research <strong>in</strong>terests <strong>in</strong>clude statistical<br />

signal process<strong>in</strong>g, mach<strong>in</strong>e learn<strong>in</strong>g, bl<strong>in</strong>d source separation,<br />

and biomedical data analysis.<br />

Pando Georgiev received his M.S., Ph.D.,<br />

and “Doctor of Mathematical Sciences” degrees<br />

<strong>in</strong> mathematics (operations research)<br />

from Sofia University “St. Kl. Ohridski,”<br />

Bulgaria, <strong>in</strong> 1982, 1987, and 2001, respectively.<br />

He has been with the Department<br />

of Probability, Operations Research, and<br />

Statistics at the Faculty of <strong>Mathematics</strong><br />

and Informatics, Sofia University “St. Kl.<br />

Ohridski,” Bulgaria, as an Assistant Professor<br />

(1989–1994), and s<strong>in</strong>ce 1994, as an Associate Professor. He was<br />

a Visit<strong>in</strong>g Professor at the University of Rome II, Italy (CNR grants,<br />

several one-month visits), the International Center for Theoretical<br />

Physics, Trieste, Italy (ICTP grant, six months), the University<br />

of Pau, France (NATO grant, three months), Hirosaki University,<br />

Japan (JSPS grant, n<strong>in</strong>e months), and so forth. He has been work<strong>in</strong>g<br />

for four years (2000–2004) as a research scientist at the Laboratory<br />

for Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g, Bra<strong>in</strong> Science Institute,<br />

the Institute of Physical and Chemical Research (RIKEN), Wako,<br />

Japan. After that and currently he is a Visit<strong>in</strong>g Scholar <strong>in</strong> ECECS<br />

Department, University of C<strong>in</strong>c<strong>in</strong>nati, USA. His <strong>in</strong>terests <strong>in</strong>clude<br />

mach<strong>in</strong>e learn<strong>in</strong>g and computational <strong>in</strong>telligence, <strong>in</strong>dependent and<br />

sparse component analysis, bl<strong>in</strong>d signal separation, statistics and<br />

<strong>in</strong>verse problems, signal and image process<strong>in</strong>g, optimization, and<br />

variational analysis. He is a Member of AMS, IEEE, and UBM.<br />

Andrzej Cichocki was born <strong>in</strong> Poland. He<br />

received the M.S. (with honors), Ph.D., and<br />

Habilitate Doctorate (Dr.Sc.) degrees, all<br />

<strong>in</strong> electrical eng<strong>in</strong>eer<strong>in</strong>g, from the Warsaw<br />

University of Technology (Poland) <strong>in</strong> 1972,<br />

1975, and 1982, respectively. He is the coauthor<br />

of three <strong>in</strong>ternational and successful<br />

books (two of them were translated to Ch<strong>in</strong>ese):<br />

Adaptive Bl<strong>in</strong>d Signal and Image Process<strong>in</strong>g<br />

(John Wiley, 2002) MOS Switched-<br />

Capacitor and Cont<strong>in</strong>uous-Time Integrated Circuits and Systems<br />

(Spr<strong>in</strong>ger, 1989), and Neural Networks for Optimization and Signal<br />

Process<strong>in</strong>g (J. Wiley and Teubner Verlag, 1993/1994) and the author<br />

or coauthor of more than three hundred papers. He is the Editor<strong>in</strong>-Chief<br />

of the Journal Computational Intelligence and Neuroscience<br />

and an Associate Editor of IEEE Transactions on Neural<br />

Networks. S<strong>in</strong>ce 1997, he has been the Head of the Laboratory for<br />

Advanced Bra<strong>in</strong> Signal Process<strong>in</strong>g <strong>in</strong> the Riken Bra<strong>in</strong> Science Institute,<br />

Japan.


Chapter 12<br />

LNCS 3195:718-725, 2004<br />

Paper F.J. Theis and S. Amari. Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation<br />

us<strong>in</strong>g sparse sources. In Proc. ICA 2004, volume 3195 of LNCS, pages 718-<br />

725, Granada, Spa<strong>in</strong>, 2004<br />

Reference (Theis and Amari, 2004)<br />

Summary <strong>in</strong> section 1.4.1<br />

175


176 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source<br />

separation us<strong>in</strong>g sparse sources<br />

Fabian J. Theis 1,2 and Shun-ichi Amari 1<br />

1 Bra<strong>in</strong> Science Institute, RIKEN<br />

2-1, Hirosawa, Wako-shi, Saitama, 351-0198, Japan<br />

2 Institute of Biophysics, University of Regensburg<br />

D-93040 Regensburg, Germany<br />

fabian@theis.name,amari@bra<strong>in</strong>.riken.go.jp<br />

Abstract. We present an approach for bl<strong>in</strong>dly decompos<strong>in</strong>g an observed<br />

random vector x <strong>in</strong>to f(As) where f is a diagonal function i.e.<br />

f = f1 × . . . × fm with one-dimensional functions fi and A an m × n<br />

matrix. This postnonl<strong>in</strong>ear model is allowed to be overcomplete, which<br />

means that less observations than sources (m < n) are given. In contrast<br />

to <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> (ICA) we do not assume the sources<br />

s to be <strong>in</strong>dependent but to be sparse <strong>in</strong> the sense that at each time <strong>in</strong>stant<br />

they have at most m − 1 non-zero components (Sparse <strong>Component</strong><br />

<strong>Analysis</strong> or SCA). Identifiability of the model is shown, and an algorithm<br />

for model and source recovery is proposed. It first detects the postnonl<strong>in</strong>earities<br />

<strong>in</strong> each component, and then identifies the now l<strong>in</strong>earized model<br />

us<strong>in</strong>g previous results.<br />

Bl<strong>in</strong>d source separation (BSS) based on ICA is a rapidly grow<strong>in</strong>g field (see<br />

for <strong>in</strong>stance [1,2] and references there<strong>in</strong>), but most algorithms deal only with the<br />

case of at least as many observations as sources. However, there is an <strong>in</strong>creas<strong>in</strong>g<br />

<strong>in</strong>terest <strong>in</strong> (l<strong>in</strong>ear) overcomplete ICA [3–5], where matrix identifiability is known<br />

[6], but source identifiability does not hold. In order to approximatively detect<br />

the sources [7], additional requirements have to be made, usually sparsity of the<br />

sources.<br />

Recently, we have proposed a model based only upon the sparsity assumption<br />

(summarized <strong>in</strong> section 1) [8]. In this case identifiability of both matrix and<br />

sources given sufficiently high sparsity can be shown. Here, we extend these results<br />

to postnonl<strong>in</strong>ear mixtures (section 2); they describe a model often occurr<strong>in</strong>g<br />

<strong>in</strong> real situations, when the mixture is <strong>in</strong> pr<strong>in</strong>ciple l<strong>in</strong>ear, but the sensors <strong>in</strong>troduce<br />

an additional nonl<strong>in</strong>earity dur<strong>in</strong>g the record<strong>in</strong>g [9]. Section 3 presents an<br />

algorithm for identify<strong>in</strong>g such models, and section 4 f<strong>in</strong>ishes with an illustrative<br />

simulation.<br />

1 L<strong>in</strong>ear overcomplete SCA<br />

Def<strong>in</strong>ition 1. A vector v ∈ R n is said to be k-sparse if v has at most k non-zero<br />

entries.


Chapter 12. LNCS 3195:718-725, 2004 177<br />

2 Fabian J. Theis and Shun-ichi Amari<br />

If an n-dimensional vector is (n − 1)-sparse, that is it <strong>in</strong>cludes at least one<br />

zero component, it is simply said to be sparse. The goal of Sparse <strong>Component</strong><br />

<strong>Analysis</strong> of level k (k-SCA) is to decompose a given m-dimensional random<br />

vector x <strong>in</strong>to<br />

x = As (1)<br />

with a real m × n-matrix A and an n-dimensional k-sparse random vector s. s<br />

is called the source vector, x the mixtures and A the mix<strong>in</strong>g matrix. We speak<br />

of complete, overcomplete or undercomplete k-SCA if m = n, m < n or m > n<br />

respectively. In the follow<strong>in</strong>g without loss of generality we will assume m ≤ n<br />

because the undercomplete case can be easily reduced to the complete case by<br />

projection of x.<br />

Theorem 1 (Matrix identifiability). Consider the k-SCA problem from equation<br />

1 for k := m − 1 and assume that every m × m-submatrix of A is <strong>in</strong>vertible.<br />

Furthermore let s be sufficiently rich represented <strong>in</strong> the sense that for any <strong>in</strong>dex<br />

set of n − m + 1 elements I ⊂ {1, ..., n} there exist at least m samples of s such<br />

that each of them has zero elements <strong>in</strong> places with <strong>in</strong>dexes <strong>in</strong> I and each m − 1<br />

of them are l<strong>in</strong>early <strong>in</strong>dependent. Then A is uniquely determ<strong>in</strong>ed by x except for<br />

left-multiplication with permutation and scal<strong>in</strong>g matrices.<br />

Theorem 2 (Source identifiablity). Let H be the set of all x ∈ R m such<br />

that the l<strong>in</strong>ear system As = x has an (m − 1)-sparse solution s. If A fulfills<br />

the condition from theorem 1, then there exists a subset H0 ⊂ H with measure<br />

zero with respect to H, such that for every x ∈ H \ H0 this system has no other<br />

solution with this property.<br />

The above two theorems show that <strong>in</strong> the case of overcomplete BSS us<strong>in</strong>g<br />

(m−1)-SCA, both the mix<strong>in</strong>g matrix and the sources can uniquely be recovered<br />

from x except for the omnipresent permutation and scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. We<br />

refer to [8] for proofs of these theorems and algorithms based upon them. We<br />

also want to note that the present source recovery algorithm is quite different<br />

from the usual sparse source recovery us<strong>in</strong>g l1-norm m<strong>in</strong>imization [7] and l<strong>in</strong>ear<br />

programm<strong>in</strong>g. In the case of sources with sparsity as above, the latter will not<br />

be able to detect the sources.<br />

2 Postnonl<strong>in</strong>ear overcomplete SCA<br />

2.1 Model<br />

Consider n-dimensional k-sparse sources s with k < m. The postnonl<strong>in</strong>ear mix<strong>in</strong>g<br />

model [9] is def<strong>in</strong>ed to be<br />

x = f(As) (2)<br />

with a diagonal <strong>in</strong>vertible function f with f(0) = 0 and a real m × n-matrix A.<br />

Here a function f is said to be diagonal if each component fi only depends on<br />

xi. In abuse of notation we will <strong>in</strong> this case <strong>in</strong>terpret the components fi of f as


178 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse sources 3<br />

functions with doma<strong>in</strong> R and write f = f1 × . . . × fm. The goal of overcomplete<br />

postnonl<strong>in</strong>ear k-SCA is to determ<strong>in</strong>e the mix<strong>in</strong>g functions f and A and the<br />

sources s given only x.<br />

Without loss of generality consider only the complete and the overcomplete<br />

case (i.e. m ≤ n). In the follow<strong>in</strong>g we will assume that the sources are sparse of<br />

level k := m − 1 and that the components fi of f are cont<strong>in</strong>uously differentiable<br />

with f ′ i (t) �= 0. This is equivalent to say<strong>in</strong>g that the fi are cont<strong>in</strong>uously differentiable<br />

with cont<strong>in</strong>uously differentiable <strong>in</strong>verse functions (diffeomorphisms).<br />

2.2 Identifiability<br />

Def<strong>in</strong>ition 2. Let A be an m × n matrix. Then A is said to be mix<strong>in</strong>g if A has<br />

at least two nonzero entries <strong>in</strong> each row. And A = (aij)i=1...m,j=1...n is said to<br />

be absolutely degenerate if there are two columns k �= l such that a 2 ik = λa2 il for<br />

all i and fixed λ �= 0 i.e. the normalized columns differ only by the sign of the<br />

entries.<br />

Postnonl<strong>in</strong>ear overcomplete SCA is a generalization of l<strong>in</strong>ear overcomplete<br />

SCA, so the <strong>in</strong>determ<strong>in</strong>acies of postnonl<strong>in</strong>ear SCA conta<strong>in</strong> at least the <strong>in</strong>determ<strong>in</strong>acies<br />

of l<strong>in</strong>ear overcomplete SCA: A can only be reconstructed up to scal<strong>in</strong>g<br />

and permutation. Also, if L is an <strong>in</strong>vertible scal<strong>in</strong>g matrix, then<br />

f(As) = (f ◦ L) � (L −1 A)s � ,<br />

so f and A can <strong>in</strong>terchange scal<strong>in</strong>g factors <strong>in</strong> each component.<br />

Two further <strong>in</strong>determ<strong>in</strong>acies occur if A is either not mix<strong>in</strong>g or absolutely<br />

degenerate. In the first case, this means that fi cannot be identified if the ith<br />

row of A conta<strong>in</strong>s only one non-zero element. In the case of an absolutely<br />

degenerate mix<strong>in</strong>g matrix, � sparseness � alone cannot detect the nonl<strong>in</strong>earity as the<br />

1 1<br />

counterexample A = and arbitrary f1 ≡ f2 shows.<br />

1 −1<br />

If s is an n-dimensional random vector, its image (or the support of its<br />

density) is denoted as im s := {s(t)}.<br />

Theorem 3 (Identifiability). Let s be an n-dimensional k-sparse random vector<br />

(k < m), and x an m-dimensional random vector constructed from s as <strong>in</strong><br />

equation 2. Furthermore assume that<br />

(i) s is fully k-sparse <strong>in</strong> the sense that im s equals the union of all k-dimensional<br />

coord<strong>in</strong>ate spaces (<strong>in</strong> which it is conta<strong>in</strong>ed by the sparsity assumption),<br />

(ii) A is mix<strong>in</strong>g and not absolutely degenerate,<br />

(iii) every m × m-submatrix of A is <strong>in</strong>vertible.<br />

If x = ˆf( ˆs) is another representation of x as <strong>in</strong> equation 2 with ˆs satisfy<strong>in</strong>g<br />

the same conditions as s, then there exists an <strong>in</strong>vertible scal<strong>in</strong>g L with f = ˆf ◦ L,<br />

and <strong>in</strong>vertible scal<strong>in</strong>g and permutation matrices L ′ , P ′ with A = LÂL′ P ′ .


Chapter 12.<br />

PSfrag replacements<br />

LNCS 3195:718-725, 2004 179<br />

4 Fabian J. Theis and Shun-ichi Amari<br />

R 3<br />

A f1 × f2 g1 × g2 BSRA<br />

R 2<br />

R 2<br />

Fig. 1. Illustration of the proof of theorem 3 <strong>in</strong> the case n = 3, m = 2. The 3dimensional<br />

1-sparse sources (leftmost figure) are first l<strong>in</strong>early mapped onto R 2 by<br />

A and then postnonl<strong>in</strong>early distorted by f := f1 × f2 (middle figure). Separation is<br />

performed by first estimat<strong>in</strong>g the separat<strong>in</strong>g postnonl<strong>in</strong>earities g := g1 × g2 and then<br />

perform<strong>in</strong>g overcomplete source recovery (right figure) accord<strong>in</strong>g to the algorithms<br />

from [8]. The idea of the proof now is that two l<strong>in</strong>es spanned by coord<strong>in</strong>ate vectors<br />

(thick l<strong>in</strong>es, leftmost figure) are mapped onto two l<strong>in</strong>es spanned by two columns of A.<br />

If the composition g ◦ f maps these l<strong>in</strong>es onto some different l<strong>in</strong>es (as sets), then we<br />

show that (given ’general position’ of the two l<strong>in</strong>es) the components of g ◦ f satisfy the<br />

conditions from lemma 1 and hence are already l<strong>in</strong>ear.<br />

The proof relies on the fact that when s is fully k-sparse as formulated <strong>in</strong> 3(i),<br />

it <strong>in</strong>cludes all the k-dimensional coord<strong>in</strong>ate subspaces and hence <strong>in</strong>tersections<br />

of k such subspaces, which give the n coord<strong>in</strong>ate axes. They are transformed<br />

<strong>in</strong>to n curves <strong>in</strong> the x-space, pass<strong>in</strong>g through the orig<strong>in</strong>. By identification of<br />

these curves, we show that each nonl<strong>in</strong>earity is homogeneous and hence l<strong>in</strong>ear<br />

accord<strong>in</strong>g to the previous section. The proof is omitted due to lack of space.<br />

Figure 1 gives an illustration of the proof <strong>in</strong> the case n = 3 and m = 2. It uses<br />

the follow<strong>in</strong>g lemma (a generalization of the analytic case presented <strong>in</strong> [10]).<br />

Lemma 1. Let a, b ∈ R \ {−1, 0, 1}, a > 0 and f : [0, ε) → R differentiable such<br />

that f(ax) = bf(x) for all x ∈ [0, ε) with ax ∈ [0, ε). If limt→0+ f ′ (t) exists and<br />

does not vanish, then f is l<strong>in</strong>ear.<br />

Theorem 3 shows that f and A are uniquely determ<strong>in</strong>ed by x except for scal<strong>in</strong>g<br />

and permutation ambiguities. Note that then obviously also s is identifiable<br />

by apply<strong>in</strong>g theorem 2 to the l<strong>in</strong>earized mixtures y = f −1 x = As given the<br />

additional assumptions to s from the theorem.<br />

For brevity, the theorem assumes <strong>in</strong> (i) that im s is the whole union of the<br />

k-dimensional coord<strong>in</strong>ate spaces — this condition can be relaxed (the proof is<br />

local <strong>in</strong> nature) but then the nonl<strong>in</strong>earities can only be found on <strong>in</strong>tervals where<br />

the correspond<strong>in</strong>g marg<strong>in</strong>al densities of As are non-zero (however <strong>in</strong> addition,<br />

the proof needs that locally at 0 they are nonzero). Furthermore <strong>in</strong> practice the<br />

assumption about the image of s will have to be replaced by assum<strong>in</strong>g the same<br />

with non-zero probability. Also note that almost any A ∈ R mn <strong>in</strong> the measure<br />

sense fulfills the conditions (ii) and (iii).<br />

3 Algorithm for postnonl<strong>in</strong>ear (over)complete SCA<br />

The separation is done <strong>in</strong> a two-stage procedure: In the first step, after geometrical<br />

preprocess<strong>in</strong>g the postnonl<strong>in</strong>earities are estimated us<strong>in</strong>g an idea similar to<br />

R 2<br />

R 3


180 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse sources 5<br />

the one used <strong>in</strong> the identifiability proof of theorem 3, also see figure 1. In the<br />

second stage, the mix<strong>in</strong>g matrix A and then the sources s are reconstructed<br />

by apply<strong>in</strong>g the l<strong>in</strong>ear algorithms from [8], section 1, to the l<strong>in</strong>earized mixtures<br />

f −1 x. So <strong>in</strong> the follow<strong>in</strong>g it is enough to reconstruct f.<br />

3.1 Geometrical preprocess<strong>in</strong>g<br />

Let x(1), . . . , x(T ) ∈ R m be i.i.d.-samples of the random vector x. The goal of geometrical<br />

preprocess<strong>in</strong>g is to construct vectors y(1), . . . , y(T ) and z(1), . . . , z(T ) ∈<br />

R m us<strong>in</strong>g cluster<strong>in</strong>g or <strong>in</strong>terpolation on the samples x(t) such that f −1 (y(t)) and<br />

f −1 (z(t)) lie <strong>in</strong> two l<strong>in</strong>early <strong>in</strong>dependent l<strong>in</strong>es of R m . In figure 1 they are to span<br />

the two thick l<strong>in</strong>es which already determ<strong>in</strong>e the postnonl<strong>in</strong>earities.<br />

Algorithmically, y and z can be constructed <strong>in</strong> the case m = 2 by first choos<strong>in</strong>g<br />

far away samples (on different ’non-opposite’ curves) as <strong>in</strong>itial start<strong>in</strong>g po<strong>in</strong>t<br />

and then advanc<strong>in</strong>g to the known data set center by always choos<strong>in</strong>g the closest<br />

samples of x with smaller modulus. Such an algorithm can also be implemented<br />

for larger m but only for sources with at most one non-zero coefficient at each<br />

time <strong>in</strong>stant, but it can be generalized to sources of sparseness m − 1 us<strong>in</strong>g more<br />

elaborate cluster<strong>in</strong>g.<br />

3.2 Postnonl<strong>in</strong>earity estimation<br />

Given the subspace vectors y(t) and z(t) from the previous section, the goal is<br />

to f<strong>in</strong>d C1-diffeomorphisms gi : R → R such that g1 × . . . × gm maps the vectors<br />

y(t) and z(t) onto two different l<strong>in</strong>ear subspaces.<br />

In abuse of notation, we now assume that two curves (<strong>in</strong>jective <strong>in</strong>f<strong>in</strong>itely<br />

differentiable mapp<strong>in</strong>gs) y, z : (−1, 1) → Rm are given with y(0) = z(0) = 0.<br />

These can for example be constructed from the discrete sample po<strong>in</strong>ts y(t) and<br />

z(t) from the previous section by polynomial or spl<strong>in</strong>e <strong>in</strong>terpolation. If the two<br />

curves are mapped onto l<strong>in</strong>es by g1 × . . . × gm (and if these are <strong>in</strong> sufficiently<br />

general position) then gi = λif −1<br />

i for some λ �= 0 accord<strong>in</strong>g to theorem 3. By<br />

requir<strong>in</strong>g this condition only for the discrete sample po<strong>in</strong>ts from the previous<br />

section we get an approximation of the unmix<strong>in</strong>g nonl<strong>in</strong>earities gi. Let i �= j be<br />

fixed. It is then easy to see that by project<strong>in</strong>g x, y and z onto the i-th and j-th<br />

coord<strong>in</strong>ate, the problem of f<strong>in</strong>d<strong>in</strong>g the nonl<strong>in</strong>earities can be reduced to the case<br />

m = 2, and g2 is to be reconstructed, which we will assume <strong>in</strong> the follow<strong>in</strong>g.<br />

A is chosen to be mix<strong>in</strong>g, so we can assume that the <strong>in</strong>dices i, j were chosen<br />

such that the two l<strong>in</strong>es f −1 ◦ y, f −1 ◦ z : (−1, 1) → R 2 do not co<strong>in</strong>cide with<br />

the coord<strong>in</strong>ate axes. Reparametrization (¯y := y ◦ y −1<br />

1<br />

) of the curves lets us<br />

further assume that y1 = z1 = id. Then after some algebraic manipulation, the<br />

condition that the separat<strong>in</strong>g nonl<strong>in</strong>earities g = g1 × g2 must map y and z onto<br />

l<strong>in</strong>es can be written as g2 ◦ y2 = ag1 = a<br />

b g2 ◦ z2 with constants a, b ∈ R \ {0},<br />

a �= ±b.<br />

So the goal of geometrical postnonl<strong>in</strong>earity detection is to f<strong>in</strong>d a C 1 -diffeomorphism<br />

g on subsets of R with<br />

g ◦ y = cg ◦ z (3)


Chapter 12. LNCS 3195:718-725, 2004 181<br />

6 Fabian J. Theis and Shun-ichi Amari<br />

for an unknown constant c �= 0, ±1 and given curves y, z : (−1, 1) → R with<br />

y(0) = z(0) = 0. By theorem 3, g (and also c) are uniquely determ<strong>in</strong>ed by<br />

y and z except for scal<strong>in</strong>g. Indeed by tak<strong>in</strong>g derivatives <strong>in</strong> equation 3, we get<br />

c = y ′ (0)/z ′ (0), so c can be directly calculated from the known curves y and z.<br />

In the follow<strong>in</strong>g section, we propose to solve this problem numerically, given<br />

samples y(t1), z(t1), . . . , y(tT ), z(tT ) of the curves. Note that here it is assumed<br />

that the samples of the curves y and z are given at the same time <strong>in</strong>stants<br />

ti ∈ (−1, 1). In practice, this is usually not the case, so values of z at the sample<br />

po<strong>in</strong>ts of y and vice versa will first have to be estimated, for example by us<strong>in</strong>g<br />

spl<strong>in</strong>e <strong>in</strong>terpolation.<br />

3.3 MLP-based postnonl<strong>in</strong>earity approximation<br />

We want to f<strong>in</strong>d an approximation ˜g (<strong>in</strong> some parametrization) of g with ˜g(y(ti)) =<br />

c˜g(z(ti)) for i = 1, . . . , T , so <strong>in</strong> the most general sense we want to f<strong>in</strong>d<br />

˜g = argm<strong>in</strong> g E(g) := argm<strong>in</strong> g<br />

1<br />

2T<br />

T�<br />

(g(y(ti)) − cg(z(ti))) 2 . (4)<br />

In order to m<strong>in</strong>imize this energy function E(g), a s<strong>in</strong>gle-<strong>in</strong>put s<strong>in</strong>gle-output<br />

multilayered neural network (MLP) is used to parametrize the nonl<strong>in</strong>earity g.<br />

Here we choose one hidden layer of size d. This means that the approximated ˜g<br />

can be written as<br />

˜g(t) = w (2)⊤ �<br />

¯σ w (1) t + b (1)�<br />

+ b (2)<br />

with weigh vectors w (1) , w (2) ∈ R d and bias b (1) ∈ R d , b (2) ∈ R. Here σ denotes<br />

an activation function, usually the logistic sigmoid σ(t) := (1 + e −t ) −1 and we set<br />

¯σ := σ×. . .×σ, d times. The MLP weights are restricted <strong>in</strong> the sense that ˜g(0) =<br />

0 and ˜g ′ (0) = 1. This implies b (2) = −w (2)⊤¯σ(b (1) ) and �d i=1 w(1) i w(2) i σ′ (b (1)<br />

1 ) =<br />

1.<br />

Especially the second normalization is very important for the learn<strong>in</strong>g step,<br />

otherwise the weights could all converge to the (valid) zero solution. So the<br />

outer bias is not tra<strong>in</strong>ed by the network; we could fix a second weight <strong>in</strong> order<br />

to guarantee the second condition — this however would result <strong>in</strong> an unstable<br />

quotient calculation. Instead it is preferable to perform network tra<strong>in</strong><strong>in</strong>g on a<br />

submanifold <strong>in</strong> the weight space given by the second weight restriction. This<br />

results <strong>in</strong> an additional Lagrange term <strong>in</strong> the energy function from equation 4<br />

Ē(˜g) := 1<br />

2T<br />

i=1<br />

T�<br />

(˜g(y(tj)) − c˜g(z(tj))) 2 �<br />

d�<br />

+ λ<br />

j=1<br />

i=1<br />

w (1)<br />

i w(2) i σ′ (b (1)<br />

1<br />

) − 1<br />

with suitably chosen λ > 0.<br />

Learn<strong>in</strong>g of the weights is performed via backpropagation on this energy<br />

function. The gradient of Ē(˜g) with respect to the weight matrix can be easily<br />

�2<br />

(5)


182 Chapter 12. LNCS 3195:718-725, 2004<br />

Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse sources 7<br />

calculated from the Euclidean gradient of g. For the learn<strong>in</strong>g process, we further<br />

note that all weights w (j)<br />

i should be kept nonnegative <strong>in</strong> order to ensure<br />

<strong>in</strong>vertibility of ˜g.<br />

In order to <strong>in</strong>crease convergence speed, the Euclidean gradient of g should<br />

be replaced by the natural gradient [11], which <strong>in</strong> experiments enhances the<br />

algorithm performance <strong>in</strong> terms of speed by a factor of roughly 10.<br />

4 Experiment<br />

The postnonl<strong>in</strong>ear mixture of three sources to two mixtures is considered. 105 samples of artificially generated sources with one non-zero coefficient (drawn uniformly<br />

from [−0.5, 0.5]) are used. We refer to figure 2 for a plot of the sources,<br />

mixtures and recoveries. The sources were mixed�us<strong>in</strong>g the postnonl<strong>in</strong>ear � mix<strong>in</strong>g<br />

4.3 7.8 0.59<br />

model x = f1 ×f2(As) with mix<strong>in</strong>g matrix A =<br />

and postnonl<strong>in</strong>-<br />

9 6.2 10<br />

earities f1(x) = tanh(x) + 0.1x and f2(x) = x. For easier algorithm visualization<br />

and evaluation we chose f2 to be l<strong>in</strong>ear and did not add any noise.<br />

MLP based postnonl<strong>in</strong>earity detection algorithm from section 3.3 with natural<br />

gradient-descent learn<strong>in</strong>g, 9 hidden neurons, a learn<strong>in</strong>g rate of η = 0.01<br />

and 105 iterations gives a good approximation of the unmix<strong>in</strong>g nonl<strong>in</strong>earities gi.<br />

L<strong>in</strong>ear overcomplete SCA is then applied to g1 ×g2(x): for practical reasons (due<br />

to approximation errors, the data is not fully l<strong>in</strong>earized) <strong>in</strong>stead of the matrix<br />

recovery algorithm from [8] we use a modification of the geometric ICA algorithm<br />

[4], which is known to work well <strong>in</strong>� the very sparse one-dimensional � case<br />

−0.46 −0.81 −0.069<br />

to get the recovered mix<strong>in</strong>g matrix  =<br />

, which except<br />

−0.89 −0.58 −1.0<br />

for scal<strong>in</strong>g and permutation co<strong>in</strong>cides well with A. Source recovery then gives a<br />

(normalized) signal-to-noise ratios (SNRs) of these with the orig<strong>in</strong>al sources are<br />

high with 26, 71 and 46 dB respectively.<br />

References<br />

1. Cichocki, A., Amari, S.: Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. John Wiley<br />

& Sons (2002)<br />

2. Hyvär<strong>in</strong>en, A., Karhunen, J., Oja, E.: <strong>Independent</strong> component analysis. John<br />

Wiley & Sons (2001)<br />

3. Lee, T., Lewicki, M., Girolami, M., Sejnowski, T.: Bl<strong>in</strong>d source separation of more<br />

sources than mixtures us<strong>in</strong>g overcomplete representations. IEEE Signal Process<strong>in</strong>g<br />

Letters 6 (1999) 87–90<br />

4. Theis, F., Lang, E., Puntonet, C.: A geometric algorithm for overcomplete l<strong>in</strong>ear<br />

ICA. Neurocomput<strong>in</strong>g 56 (2004) 381–398<br />

5. Zibulevsky, M., Pearlmutter, B.: Bl<strong>in</strong>d source separation by sparse decomposition<br />

<strong>in</strong> a signal dictionary. Neural Computation 13 (2001) 863–882<br />

6. Eriksson, J., Koivunen, V.: Identifiability and separability of l<strong>in</strong>ear ICA models<br />

revisited. In: Proc. of ICA 2003. (2003) 23–27


Chapter 12. LNCS 3195:718-725, 2004 183<br />

8 Fabian J. Theis and Shun-ichi Amari<br />

0.5<br />

0<br />

−0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.5<br />

0<br />

−0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

0.5<br />

0<br />

−0.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

−3<br />

−4<br />

(a) sources<br />

−5<br />

−1.5 −1 −0.5 0 0.5 1 1.5<br />

(c) mixture scatterplot<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

(b) mixtures<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

5<br />

0<br />

−5<br />

0 10 20 30 40 50 60 70 80 90 100<br />

(d) recovered sources<br />

Fig. 2. Example: (a) shows the 1-sparse source signals, and (b) the postnonl<strong>in</strong>ear overcomplete<br />

mixtures. The orig<strong>in</strong>al source directions can be clearly seen <strong>in</strong> the structure<br />

of the mixture scatterplot (c). The crosses and stars <strong>in</strong>dicate the found <strong>in</strong>terpolation<br />

po<strong>in</strong>ts used for approximat<strong>in</strong>g the separat<strong>in</strong>g nonl<strong>in</strong>earities, generated by geometrical<br />

preprocess<strong>in</strong>g. Now, accord<strong>in</strong>g to theorem 3, the sources can be recovered uniquely,<br />

figure (d), except for permutation and scal<strong>in</strong>g.<br />

7. Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM<br />

J. Sci. Comput. 20 (1998) 33–61<br />

8. Georgiev, P., Theis, F., Cichocki, A.: Bl<strong>in</strong>d source separation and sparse component<br />

analysis of overcomplete mixtures. In: Proc. of ICASSP 2004, Montreal, Canada<br />

(2004)<br />

9. Taleb, A., Jutten, C.: Indeterm<strong>in</strong>acy and identifiability of bl<strong>in</strong>d identification.<br />

IEEE Transactions on Signal Process<strong>in</strong>g 47 (1999) 2807–2820<br />

10. Babaie-Zadeh, M., Jutten, C., Nayebi, K.: A geometric approach for separat<strong>in</strong>g<br />

post non-l<strong>in</strong>ear mixtures. In: Proc. of EUSIPCO ’02. Volume II., Toulouse, France<br />

(2002) 11–14<br />

11. Amari, S., Park, H., Fukumizu, K.: Adaptive method of realiz<strong>in</strong>g gradient learn<strong>in</strong>g<br />

for multilayer perceptrons. Neural Computation 12 (2000) 1399–1409


184 Chapter 12. LNCS 3195:718-725, 2004


Chapter 13<br />

Proc. EUSIPCO 2005<br />

Paper F.J. Theis, K. Stadlthanner, and T. Tanaka. First results on uniqueness of<br />

sparse non-negative matrix factorization. In Proc. EUSIPCO 2005, Antalya,<br />

Turkey, 2005<br />

Reference (Theis et al., 2005c)<br />

Summary <strong>in</strong> section 1.4.2<br />

185


186 Chapter 13. Proc. EUSIPCO 2005<br />

FIRST RESULTS ON UNIQUENESS OF SPARSE NON-NEGATIVE MATRIX<br />

FACTORIZATION<br />

Fabian J. Theis, Kurt Stadlthanner and Toshihisa Tanaka ∗<br />

Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

phone: +49 941 943 2924, fax: +49 941 943 2479, email: fabian@theis.name<br />

∗ Department of Electrical and Electronic Eng<strong>in</strong>eer<strong>in</strong>g, Tokyo University of Agriculture and Technology (TUAT)<br />

2-24-16, Nakacho, Koganei-shi, Tokyo 184-8588 Japan and<br />

ABSP Laboratory, BSI, RIKEN, 2-1, Hirosawa, Wako-shi, Saitama 351-0198 Japan<br />

ABSTRACT<br />

Sparse non-negative matrix factorization (sNMF) allows for the decomposition<br />

of a given data set <strong>in</strong>to a mix<strong>in</strong>g matrix and a feature<br />

data set, which are both non-negative and fulfill certa<strong>in</strong> sparsity conditions.<br />

In this paper it is shown that the employed projection step<br />

proposed by Hoyer has a unique solution, and that it <strong>in</strong>deed f<strong>in</strong>ds<br />

this solution. Then <strong>in</strong>determ<strong>in</strong>acies of the sNMF model are identified<br />

and first uniqueness results are presented, both theoretically<br />

and experimentally.<br />

1. INTRODUCTION<br />

Non-negative matrix factorization (NMF) describes a promis<strong>in</strong>g<br />

new technique for decompos<strong>in</strong>g non-negative data sets <strong>in</strong>to a product<br />

of two smaller matrices thus captur<strong>in</strong>g the underly<strong>in</strong>g structure<br />

[3]. In applications it turns out that additional constra<strong>in</strong>ts like for<br />

example sparsity enhance the recoveries; one promis<strong>in</strong>g variant of<br />

such a sparse NMF algorithm has recently been proposed by Hoyer<br />

[2]. It consists of the common NMF update steps, but at each step<br />

a sparsity constra<strong>in</strong>t is posed. If factorization algorithms are to produce<br />

reliable results, their <strong>in</strong>determ<strong>in</strong>acies have to be known and<br />

uniqueness (except for the <strong>in</strong>determ<strong>in</strong>acies) has to be shown — so<br />

far only restricted and quite disappo<strong>in</strong>t<strong>in</strong>g results for NMF [1] and<br />

none for sNMF are known.<br />

In this paper we first present a novel uniqueness result show<strong>in</strong>g<br />

that the projection step of sparse NMF always possesses a unique<br />

solution (except for a set of measure zero), theorems 2.2 and 2.6.<br />

We then prove that Hoyer’s algorithm <strong>in</strong>deed detects these solutions,<br />

theorem 2.8. In section 3 after shortly repeat<strong>in</strong>g Hoyer’s<br />

sNMF algorithm, we analyze its <strong>in</strong>determ<strong>in</strong>acies and show uniqueness<br />

<strong>in</strong> some restricted cases, theorem 3.3. The result is both new<br />

and astonish<strong>in</strong>g, because the set of <strong>in</strong>determ<strong>in</strong>acies is much smaller<br />

than the one of NMF, namely of measure zero.<br />

2. SPARSE PROJECTION<br />

The sparse NMF algorithm enforces sparseness by us<strong>in</strong>g a projection<br />

step as follows: Given x ∈ Rn and fixed λ1,λ2 > 0, f<strong>in</strong>d s such<br />

that<br />

s = argm<strong>in</strong>�s�1=λ1,�s�2=λ2,s≥0 �x −s�2 (1)<br />

Here �s�p := � ∑ n i=1 |si| p� 1/p denotes the p-norm; <strong>in</strong> the follow<strong>in</strong>g<br />

we often omit the <strong>in</strong>dex <strong>in</strong> the case p = 2. Furthermore s ≥ 0 is<br />

def<strong>in</strong>ed as si ≥ 0 for all i = 1,...,n, so s is to be non-negative. Our<br />

goal is to show that such a projection always exists and is unique<br />

for almost all x. This problem can be generalized by replac<strong>in</strong>g the<br />

1-norm by an arbitrary p-norm, however the (Euclidean) 2-norm<br />

has to be used as can be seen <strong>in</strong> the proof later. Other possible<br />

generalizations <strong>in</strong>clude projections <strong>in</strong> <strong>in</strong>f<strong>in</strong>ite-dimensional Hilbert<br />

spaces.<br />

First note that the two norms are equivalent i.e. <strong>in</strong>duce the same<br />

topology; <strong>in</strong>deed �s�2 ≤ �s�1 ≤ √ n�s�2 for all s ∈ R n as can be<br />

easily shown. So a necessary condition for any s to satisfy equation<br />

(1) is λ2 ≤ λ1 ≤ √ nλ2.<br />

We want to solve problem (1) by project<strong>in</strong>g x onto<br />

M := {s|�s�1 = λ1} ∩ {s|�s�2 = λ2} ∩ {s ≥ 0} (2)<br />

In order to solve equation (1), x has to be projected onto a po<strong>in</strong>t<br />

adjacent to it <strong>in</strong> M:<br />

Def<strong>in</strong>ition 2.1. A po<strong>in</strong>t p ∈ M ⊂ R n is called adjacent to x ∈ R n<br />

<strong>in</strong> M, <strong>in</strong> symbols p⊳M x or shorter p⊳x, if �x −p�2 ≤ �x −q�2<br />

for all q ∈ M.<br />

In the follow<strong>in</strong>g we will study <strong>in</strong> which cases this is possibly,<br />

and which conditions are needed to guarantee that this projection is<br />

even unique.<br />

2.1 Existence<br />

Assume that x lies <strong>in</strong> the closure of M, but not <strong>in</strong> M. Obviously<br />

there exists no p⊳x as x ‘touches’ M without be<strong>in</strong>g an element of<br />

it. In order to avoid these exceptions, it is enough to assume that M<br />

is closed:<br />

Theorem 2.2 (Existence). If M is closed and nonempty, then for<br />

every x ∈ R n there exists a p ∈ M with p⊳x.<br />

Proof. Let x ∈ R n be fixed. Without loss of generality (by tak<strong>in</strong>g<br />

<strong>in</strong>tersections with a large enough ball) we can assume that M is<br />

compact. Then f : M → R,p ↦→ �x −p� is cont<strong>in</strong>uous and has<br />

therefore a m<strong>in</strong>imum p0, so p0 ⊳x.<br />

2.2 Uniqueness<br />

Def<strong>in</strong>ition 2.3. Let X (M) := {x ∈ R n |there exists more than one<br />

po<strong>in</strong>t adjacent to x <strong>in</strong> M} = {x ∈ R n |#{p ∈ M|p⊳x} > 1} denote<br />

the exception set of M.<br />

In other words, the exception set conta<strong>in</strong>s the set of po<strong>in</strong>ts from<br />

which we can’t uniquely project. Our goal is to show that this set<br />

vanishes or is at least very small. Figure 1 shows the exception set<br />

of two different sets.<br />

Note that if x ∈ M then x ⊳ x, and x is the only po<strong>in</strong>t with<br />

that property. So M ∩X (M) = ∅. Obviously the exception set of<br />

an aff<strong>in</strong>e l<strong>in</strong>ear hyperspace is empty. Indeed, we can prove more<br />

generally:<br />

Lemma 2.4. Let M ⊂ R n be convex. Then X (M) = ∅.<br />

For the proof we need the follow<strong>in</strong>g simple lemma, which only<br />

works for the 2-norm as it uses the scalar product.<br />

Lemma 2.5. Let a,b ∈ R n such that �a + b�2 = �a�2 + �b�2.<br />

Then a and b are coll<strong>in</strong>ear.<br />

Proof. By tak<strong>in</strong>g squares we get �a +b� 2 = �a� 2 + 2�a��b� +<br />

�b� 2 , so<br />

�a� 2 + 2〈a,b〉+�b� 2 = �a� 2 + 2�a��b�+�b� 2


Chapter 13. Proc. EUSIPCO 2005 187<br />

M X (M)<br />

(a) exception set of two po<strong>in</strong>ts<br />

M<br />

X (M)<br />

(b) exception set of a sector<br />

Figure 1: Two examples of exception sets.<br />

if 〈.,.〉 denotes the (symmetric) scalar product. Hence 〈a,b〉 =<br />

�a��b� and a and b are coll<strong>in</strong>ear accord<strong>in</strong>g to the Schwarz <strong>in</strong>equality.<br />

Proof of lemma 2.4. Assume X (M) �= ∅. Then let x ∈ X (M) and<br />

p1 �= p2 ∈ M such that pi ⊳x. By assumption q := 1 2 (p1 +p2) ∈<br />

M. But<br />

�x −p1� ≤ �x −q� ≤ 1<br />

1<br />

�x −p1�+<br />

2 2 �x −p2� = �x −p1�<br />

because both pi are adjacent to x. Therefore �x −q� = � 1 2 (x −<br />

p1)�+� 1 2 (x −p2)� and application of lemma 2.5 shows that x −<br />

p1 = α(x −p2). Tak<strong>in</strong>g norms (and us<strong>in</strong>g the fact that q �= x)<br />

shows that α = 1 and hence p1 = p2, which is a contradiction.<br />

In a similar manner, it is easy to show for example that the<br />

exception set of the sphere consists only of its center, or to calculate<br />

the exception sets of the sets M from figure 1. Another property<br />

of the exception set is that it behaves nicely under non-degenerate<br />

aff<strong>in</strong>e l<strong>in</strong>ear transformation.<br />

Hence <strong>in</strong> general, we cannot expect X (M) to vanish altogether.<br />

However we can show that <strong>in</strong> practical applications we can easily<br />

neglect it:<br />

Theorem 2.6 (Uniqueness). vol(X (M)) = 0.<br />

This means that the Lebesgue measure of the exception set is<br />

zero i.e. that it does not conta<strong>in</strong> any open ball. In other words, if<br />

x is drawn from a cont<strong>in</strong>uous probability distribution on R n , then<br />

x ∈X (M) with probability 0. We simplify the proof by <strong>in</strong>troduc<strong>in</strong>g<br />

the follow<strong>in</strong>g lemma:<br />

Lemma 2.7. Letx∈X (M) withp⊳x,p ′ ⊳x and p �=p ′ . Assume<br />

y lies on the l<strong>in</strong>e between x and p. Then y �∈ X (M).<br />

Proof. So y = αx+(1 − α)p with α ∈ (0,1). Note that then also<br />

p⊳y — otherwise we would have another q⊳y with �q −y� <<br />

�p−y�. But then �q−x� ≤ �q−y�+�y−x� < �p−y�+�y−<br />

x� = �p −x�, which contradicts the assumption.<br />

Now assume that y ∈ X (M). Then there exists p ′′ ⊳y with<br />

p ′′ �= p. But �p ′′ −x� ≤ �p ′′ −y�+�y −x� = �p −y�+�y −<br />

x� = �p −x�. Then p⊳x <strong>in</strong>duces �p ′′ −x� = �p −x�. So<br />

�p ′′ −x� = �p ′′ −y�+�y −x�.<br />

Application of lemma 2.5 then yields p ′′ −y = α(y−x), and hence<br />

p ′′ −y = β(p −y). Tak<strong>in</strong>g norms (and us<strong>in</strong>g p ⊳x) shows that<br />

β = 1 and hence p = p ′′ , which is a contradiction.<br />

Proof of theorem 2.6. Assume there exists an open set U ⊂ X (M),<br />

and let x ∈ U. Then choose p �= p ′ ∈ M with p⊳x,p ′ ⊳x. But<br />

{αx+(1 − αp)|α ∈ (0,1)} ∩U �= ∅,<br />

which contradicts lemma 2.7.<br />

2.3 Algorithm<br />

From here on, let M be def<strong>in</strong>ed by equation (2). In [2], Hoyer proposes<br />

algorithm 1 to project a given vector x onto p ∈ M such that<br />

p⊳x (we added a slight simplification by not sett<strong>in</strong>g all negative<br />

values of s to zero but only a s<strong>in</strong>gle one <strong>in</strong> each step). The algorithm<br />

iteratively detects p by first satisfy<strong>in</strong>g the 1-norm condition (l<strong>in</strong>e 1)<br />

and then the 2-norm condition (l<strong>in</strong>e 3). The algorithm term<strong>in</strong>ates if<br />

the constructed vector is already positive; otherwise a negative coord<strong>in</strong>ate<br />

is selected, set to zero (l<strong>in</strong>e 4) and the search is cont<strong>in</strong>ued<br />

<strong>in</strong> R n−1 .<br />

Algorithm 1: Sparse projection<br />

Input: vector x ∈ R n , norm conditions λ1 and λ2<br />

Output: closest non-negative s with �s�i = λi<br />

Set r ← x+(�x�1 − λ1/n)e with e = (1,...,1) ⊤ ∈ Rn 1<br />

.<br />

2 Set m ← (λ1/n)e.<br />

3 Set s ← m+α(r −m) with α > 0 such that �s�2 = λ2.<br />

if exists j with s j < 0 then<br />

4 Fix s j ← 0.<br />

5 Remove j-th coord<strong>in</strong>ate of x.<br />

6 Decrease dimension n ← n − 1.<br />

7 goto 1.<br />

end<br />

The projection algorithm term<strong>in</strong>ates after maximally n − 1 iterations.<br />

However it is not obvious that it <strong>in</strong>deed detects p. In the<br />

follow<strong>in</strong>g we will prove this given that x �∈ X (M) — of course we<br />

have to exclude non-uniqueness po<strong>in</strong>ts. The idea of the proof is to<br />

show that <strong>in</strong> each step the new estimate has p as closest po<strong>in</strong>t <strong>in</strong> M.<br />

Theorem 2.8 (Sparse projection). Given x ≥ 0 such that x �∈<br />

X (M). Let p ∈ M with p ⊳ x. Furthermore assume that r and<br />

s are constructed by l<strong>in</strong>es 1 and 3 of algorithm 1. Then:<br />

(i) ∑ri = λ1, p⊳r and r �∈ X (M).<br />

(ii) ∑si = λ1, �s�2 = λ2 and p⊳s and s �∈ X (M).<br />

(iii) If s j < 0 then p j = 0.<br />

(iv) Def<strong>in</strong>e u := s but set u j = 0. Then p⊳u and u �∈ X (M).<br />

This theorem shows that if s ≥ 0 then already s ∈ M and p⊳s<br />

(ii) so s = p. If s j < 0 then it is enough to set s j := 0 (because<br />

p j = 0 (iii)) and cont<strong>in</strong>ue the search <strong>in</strong> one dimension lower (iv).<br />

Proof. Let H := {x ∈ R n |∑xi = λ1} denote the aff<strong>in</strong>e hyperplane<br />

given by the 1-norm. Note that M ⊂ H.<br />

(i) By construction r ∈ H. Furthermore e⊥H, so r is the orthogonal<br />

projection of x onto H. Let q ∈ M be arbitrary. We then<br />

get �q−x� 2 = �q−r� 2 +�r−x� 2 . By def<strong>in</strong>ition �p−x� ≤ �q−<br />

x�, so �p −r� 2 = �p −x� 2 − �r −x� 2 ≤ �q −x� 2 − �r −x� 2 =<br />

�q −r� 2 and therefore p⊳r. Furthermore r �∈ X (M) because if<br />

q ∈ R n with q⊳r, then �q−r� = �p−r�. Then by the above also<br />

�q −x� = �p −x�, hence q = p (because x �∈ X (M)).<br />

(ii) First note that s is a l<strong>in</strong>ear comb<strong>in</strong>ation of m and r, and both<br />

lie <strong>in</strong> H so also s ∈ H i.e. ∑si = λ1. Furthermore by construction<br />

�s� = λ2. Now let q ∈ M. For p⊳s to hold, we have to show that<br />

�p −s� ≤ �q −s�. This follows (see (i)) if we can show<br />

�q −r� 2 = �s −r� 2 + 1<br />

�q −s�<br />

α0<br />

2 . (3)<br />

We can prove this equation as follows: By def<strong>in</strong>ition λ 2 2 = �q −<br />

m�2 = �q −s�2 + �s −m�2 + 2〈q −s,s −m〉, hence �q −s�2 =<br />

−2〈q−s,s−m〉 = −2 α0<br />

α0−1 〈q−s,s−r〉, where we have used s−<br />

m = α0(r −m) i.e. m = s−α0r<br />

1−α0<br />

α0<br />

so s −m = α0−1 (s −r).


188 Chapter 13. Proc. EUSIPCO 2005<br />

Us<strong>in</strong>g the above, we can now calculate<br />

�q −r� 2 = �q −s� 2 + �s −r� 2 + 2〈q −s,s −r〉<br />

= �q −s� 2 + �s −r� 2 1 − α0<br />

+ �q −s�<br />

α0<br />

2<br />

= �s −r� 2 + 1<br />

�q −s�<br />

α0<br />

2 .<br />

Similarly, from formula 3, we get s �∈ X (M), because if there exists<br />

q ∈ Rn with �q−s� = �p−s�, then also �q−r� = �p−r� hence<br />

q = p.<br />

(iii) Assume s j < 0. First note that m does not lie on the l<strong>in</strong>e<br />

βs+(1 − β)p (<strong>in</strong> other words m �= (p+s)/2), because otherwise<br />

due to symmetry there would be at least two po<strong>in</strong>ts <strong>in</strong> M closest to<br />

s, but s �∈ X (M). Now assume the claim is wrong, then p j > 0 (because<br />

p ≥ 0). Def<strong>in</strong>e g : [0,1] → H by g(β) := m+α β(βs+(1 −<br />

β)p −m), where αβ > 0 has been chosen such that �g(β)� = λ2.<br />

The curve g describes the shortest arc <strong>in</strong> H ∩ {�q� = λ2} connect<strong>in</strong>g<br />

p to s. We notice that p j > 0,rj < 0 and g is cont<strong>in</strong>uous. Hence<br />

determ<strong>in</strong>e the (unique) β0 such that q := g(β0) has the property<br />

q j = 0. By construction q ∈ M, but q lies closer to s than p (because<br />

�g(β −r)�2 = 2〈g(β)−m,m−r〉+2λ 2 2 is decreas<strong>in</strong>g with<br />

<strong>in</strong>creas<strong>in</strong>g β). But this is a contradiction to p⊳s.<br />

(iv) The vector u is def<strong>in</strong>ed by ui = si if i �= j and u j = 0 i.e.<br />

u is the orthogonal projection of s onto the coord<strong>in</strong>ate hyperplane<br />

given by x j = 0. So we calculate �p −s�2 = �p −u�2 + �u −s�2 and the claim follows directly as <strong>in</strong> (i).<br />

3. MATRIX FACTORIZATION<br />

Matrix factorization models have already been used successfully <strong>in</strong><br />

many applications when it comes to f<strong>in</strong>d suitable data representations.<br />

Basically, a given m × T data matrix X is factorized <strong>in</strong>to a<br />

m × n matrix W and a n × T matrix H<br />

where m ≤ n.<br />

X = WH, (4)<br />

3.1 Sparse non-negative matrix factorization<br />

In contrast to other matrix factorization models such as pr<strong>in</strong>cipal<br />

or <strong>in</strong>dependent component analysis, non-negative matrix factorization<br />

(NMF) strictly requires both matrices W and H to have nonnegative<br />

entries, which means that the data can be described us<strong>in</strong>g<br />

only additive components. Such a constra<strong>in</strong>t has many physical realizations<br />

and applications, for <strong>in</strong>stance <strong>in</strong> object decomposition [3].<br />

Although NMF has recently ga<strong>in</strong>ed popularity due to its simplicity<br />

and power <strong>in</strong> various applications, its solutions frequently<br />

fail to exhibit the desired sparse object decomposition. Therefore,<br />

Hoyer [2] proposed a modification of the NMF model to <strong>in</strong>clude<br />

sparseness: he m<strong>in</strong>imizes the deviation of (4) under the constra<strong>in</strong>t<br />

of fixed sparseness of both W and H. Here, us<strong>in</strong>g a ratio of<br />

1- and 2-norms of x ∈ Rn \ {0}, the sparseness is measured by<br />

σ(x) := ( √ n − �x�1/�x�2)/( √ n − 1). So σ(x) = 1 (maximal)<br />

if x conta<strong>in</strong>s n−1 zeros, and it reaches zero if the absolute value of<br />

all coefficients of x co<strong>in</strong>cide.<br />

Formally, sparse NMF (sNMF) [2] can be def<strong>in</strong>ed as the task of<br />

f<strong>in</strong>d<strong>in</strong>g<br />

�<br />

X,W,H ≥ 0<br />

X = WH subject to σ(W∗i) = σW (5)<br />

σ(Hi∗) = σH<br />

Here σW,σH ∈ [0,1] denote fixed constants describ<strong>in</strong>g the sparseness<br />

of the columns of W respectively the rows of H. Usually, the<br />

l<strong>in</strong>ear model <strong>in</strong> NMF is assumed to hold only approximately, hence<br />

the above formulation of sNMF represents the limit case of perfect<br />

factorization. sNMF is summarized by algorithm 2, which uses algorithm<br />

1 separately <strong>in</strong> each column respectively row for the sparse<br />

projection.<br />

Algorithm 2: Sparse non-negative matrix factorization<br />

Input: observation data matrix X<br />

Output: decomposition WH of X fulfill<strong>in</strong>g given<br />

sparseness constra<strong>in</strong>ts σH and σW<br />

1 Initialize W and H to random non-negative matrices.<br />

2 Project the rows of H and the columns of W such that they<br />

meet the sparseness constra<strong>in</strong>ts σH and σW respectively.<br />

repeat<br />

Set H ← H − µHW⊤ 3<br />

(WH −X).<br />

4 Project the rows of H such that they meet the sparseness<br />

constra<strong>in</strong>t σH.<br />

Set W ← W − µW(WH −X)H⊤ 5<br />

.<br />

6 Project the rows of W such that they meet the<br />

sparseness constra<strong>in</strong>t σW.<br />

until convergence;<br />

3.2 Indeterm<strong>in</strong>acies<br />

Obvious <strong>in</strong>determ<strong>in</strong>acies of model 5 are permutation and positive<br />

scal<strong>in</strong>g of the columns of W (and correspond<strong>in</strong>gly of the rows of<br />

H), because if P denotes a permutation matrix and L a positive<br />

scal<strong>in</strong>g matrix, then X = WH = (WP −1 L −1 )(LPH) and the<br />

conditions of positivity and sparseness are <strong>in</strong>variant under scal<strong>in</strong>g<br />

by a positive number. Another maybe not as obvious <strong>in</strong>determ<strong>in</strong>acy<br />

comes <strong>in</strong>to play due to the sparseness assumption.<br />

Def<strong>in</strong>ition 3.1. The n×T -matrix H is said to be degenerate if there<br />

exist v ∈ R n , v > 0 and λt ≥ 0 such that H∗t = λtv for all t.<br />

Note that <strong>in</strong> this case all rows h ⊤ i := Hi∗ of H have the same<br />

sparseness σ(hi) = ( √ n − �λ�1/�λ�2)/( √ n − 1) <strong>in</strong>dependent of<br />

v, where λ := (λ1,...,λT) ⊤ . Furthermore, if W is any matrix with<br />

positive entries, then Wv > 0 and WH∗t = λt(Wv), so the signals<br />

H and its transformations WH have rows of equal sparseness.<br />

Hence if the sources are degenerate we get an <strong>in</strong>determ<strong>in</strong>acy of<br />

sNMF: Let W, ˜W be non-negative such that ˜W −1 Wv > 0 (for example<br />

W > 0 arbitrary and ˜W :=I), and letHbe degenerate. Then<br />

˜H := ˜W −1 WH is of the same sparseness as H and WH = ˜W ˜H ′ ,<br />

but the mix<strong>in</strong>g matrices W and ˜W do not co<strong>in</strong>cide up to permutation<br />

and scal<strong>in</strong>g.<br />

3.3 Uniqueness<br />

In this section we will discuss the uniqueness of sNMF solutions<br />

i.e. we will formulate conditions under which the set of solutions<br />

is satisfactorily small. We will see that <strong>in</strong> the perfect factorization<br />

case, it is enough to put the sparseness condition either onto W or<br />

H — we chose H <strong>in</strong> the follow<strong>in</strong>g to match the picture of sources<br />

with a given sparseness.<br />

Assume that two solutions (W,H) and ( ˜W, ˜H) of the sNMF<br />

model (2) are given with W and ˜W of full rank; then<br />

WH = ˜W ˜H, (6)<br />

and σ(H) = σ( ˜H). As before let hi = H ⊤ i∗ respectively ˜ hi = ˜H ⊤ i∗<br />

denote the rows of the source matrices. In order to avoid the scal<strong>in</strong>g<br />

<strong>in</strong>determ<strong>in</strong>acy, we can set the source scales to a given value, so we<br />

may assume<br />

�hi�2 = � ˜ hi�2 = 1 (7)<br />

for all i. Hence, the sparseness of the rows is already fully determ<strong>in</strong>ed<br />

by their 1-norms, and<br />

�hi�1 = � ˜ hi�1. (8)<br />

We can then show the follow<strong>in</strong>g lemma (even without positive mix<strong>in</strong>g<br />

matrices).


Chapter 13. Proc. EUSIPCO 2005 189<br />

Lemma 3.2. Let W, ˜W ∈ Rm×n and H, ˜H ∈ Rn×T , H, ˜H ≥ 0,<br />

such that equations (6–8) hold. Then for all i ∈ {1,...,m}<br />

(i) ∑ j wi j = ∑ j ˜wi j<br />

(ii) ∑ j


190 Chapter 13. Proc. EUSIPCO 2005


Chapter 14<br />

Proc. ICASSP 2006<br />

Paper F.J. Theis and T. Tanaka. Sparseness by iterative projections onto spheres.<br />

In Proc. ICASSP 2006, Toulouse, France, 2006<br />

Reference (Theis and Tanaka, 2006)<br />

Summary <strong>in</strong> section 1.4.2<br />

191


192 Chapter 14. Proc. ICASSP 2006<br />

SPARSENESS BY ITERATIVE PROJECTIONS ONTO SPHERES<br />

Fabian J. Theis ∗<br />

Inst. of Biophysics, Univ. of Regensburg<br />

93040 Regensburg, Germany<br />

ABSTRACT<br />

Many <strong>in</strong>terest<strong>in</strong>g signals share the property of be<strong>in</strong>g sparsely active.<br />

The search for such sparse components with<strong>in</strong> a data set commonly<br />

<strong>in</strong>volves a l<strong>in</strong>ear or nonl<strong>in</strong>ear projection step <strong>in</strong> order to fulfill the<br />

sparseness constra<strong>in</strong>ts. In addition to the proximity measure used<br />

for the projection, the result of course is also <strong>in</strong>timately connected<br />

with the actual def<strong>in</strong>ition of the sparseness criterion. In this work,<br />

we <strong>in</strong>troduce a novel sparseness measure and apply it to the problem<br />

of f<strong>in</strong>d<strong>in</strong>g a sparse projection of a given signal. Here, sparseness<br />

is def<strong>in</strong>ed as the fixed ratio of p- over 2-norm, and existence and<br />

uniqueness of the projection holds. This framework extends previous<br />

work by Hoyer <strong>in</strong> the case of p=1, where it is easy to give a<br />

determ<strong>in</strong>istic, more or less closed-form solution. This is not possible<br />

for p�1, so we <strong>in</strong>troduce an algorithm based on alternat<strong>in</strong>g projections<br />

onto spheres (POSH), which is similar to the projection onto<br />

convex sets (POCS). Although the assumption of convexity does not<br />

hold <strong>in</strong> our sett<strong>in</strong>g, we observe not only convergence of the algorithm,<br />

but also convergence to the correct m<strong>in</strong>imal distance solution.<br />

Indications for a proof of this surpris<strong>in</strong>g property are given. Simulations<br />

confirm these results.<br />

1. INTRODUCTION<br />

Sparseness is an important property of many natural signals, and various<br />

def<strong>in</strong>itions exist. Intuitively, a signal x∈R n <strong>in</strong>creases <strong>in</strong> sparseness<br />

with the <strong>in</strong>creas<strong>in</strong>g number of zeros; this is often measured by<br />

the 0-(pseudo)-norm�x�0 := |{i|xi � 0}|, count<strong>in</strong>g the number of<br />

non-zero entries of x. It is a pseudo-norm because�αx�0=|α|�x�0<br />

does not necessarily hold; <strong>in</strong>deed�αx�0=�x�0 ifα�0, so�.�0 is<br />

scale-<strong>in</strong>variant.<br />

A typical problem <strong>in</strong> the field is the search for sparse <strong>in</strong>stances<br />

or representations of a data set. Us<strong>in</strong>g the above 0-pseudo-norm as<br />

sparseness measure quickly turns out to be both theoretically and<br />

algorithmically unfeasible. The former simply follows because�.�0<br />

is discrete, so the <strong>in</strong>determ<strong>in</strong>acies of the problem can be expected<br />

to be very high, and the latter because optimization on such a discrete<br />

function is a comb<strong>in</strong>atorial problem and <strong>in</strong>deed turns out to be<br />

NP-complete. Hence, this sparseness measure is commonly approx-<br />

imated by some cont<strong>in</strong>uous measures e.g. by replac<strong>in</strong>g it by the pnorm�x�p<br />

:= �� n<br />

i=1 |xi| p� 1/p for p∈R + . As limp→0+�x� p p=�x�0, this<br />

can be <strong>in</strong>terpreted as a possible approximation. This together with<br />

extensions to noisy situations can be used for measur<strong>in</strong>g sparseness,<br />

and the connection with�.�0, especially <strong>in</strong> the case of p=1, has<br />

been <strong>in</strong>tensively studied [1].<br />

Often, we are not <strong>in</strong>terested <strong>in</strong> the scale of the signals, so ideally<br />

the sparseness measure should be <strong>in</strong>dependent of the scale — which<br />

∗ Partial f<strong>in</strong>ancial support by the JSPS (PE 05543) and the DFG (GRK<br />

638) is acknowledged.<br />

Toshihisa Tanaka ∗<br />

Dept. EEE, Tokyo Univ. of Agri. and Tech.<br />

Tokyo 184-8588, Japan<br />

is the case for the 0-pseudo-norm, but not for the p-norms. In order<br />

to guarantee scal<strong>in</strong>g <strong>in</strong>variance, some normalization has to be<br />

applied <strong>in</strong> the latter case, and a possible solution is the measure<br />

σp(x) :=�x�p/�x�2<br />

for x ∈ R n \{0} and p > 0. Thenσp(αx) = σp(x) forα � 0;<br />

moreover the sparser x, the smallerσp(x). Indeed, it can still be<br />

<strong>in</strong>terpreted as approximation of the 0-pseudo-norm <strong>in</strong> the sense that<br />

it is scale-<strong>in</strong>variant and that limp→0+σp(x) p =�x�0. Altogether we<br />

<strong>in</strong>fer that by m<strong>in</strong>imiz<strong>in</strong>gσp(x) under some constra<strong>in</strong>t, we can f<strong>in</strong>d<br />

a sparse representation of x. Hoyer [2] noticed this <strong>in</strong> the important<br />

case of p=1; he def<strong>in</strong>ed a normalized sparseness measure by ( √ n−<br />

σ1(x))/( √ n−1), which lies <strong>in</strong> [0, 1] and is maximal if x conta<strong>in</strong>s<br />

n−1 zeros and m<strong>in</strong>imal if the absolute value of all coefficients of x<br />

co<strong>in</strong>cide.<br />

Little attention has been paid to f<strong>in</strong>d<strong>in</strong>g projections <strong>in</strong> the case of<br />

p�1, which is particularly important for p→0 as better approximation<br />

of�.�0. Hence, the goal of this manuscript is to explore the<br />

general notion of sparseness <strong>in</strong> the sense of equation (1) and to construct<br />

algorithms to project a vector to its closest vector of a given<br />

sparseness.<br />

2. EUCLIDEAN PROJECTION<br />

Let M⊂R n be an arbitrary, non-empty set. A vector y∈ M⊂R n<br />

is called Euclidean projection of x∈R n <strong>in</strong> M, <strong>in</strong> symbols y⊳M x, if<br />

�x−y�2≤�x−z�2 for all z∈ M.<br />

2.1. Existence and uniqueness<br />

We review conditions [3] for existence on uniqueness of the Euclidean<br />

projection. For this, we need the follow<strong>in</strong>g notion: LetX(M) :=<br />

{x∈R n | there exists more than one po<strong>in</strong>t adjacent to x <strong>in</strong> M}={x∈<br />

R n | #{y∈ M| y⊳M x}>1} denote the exception set of M.<br />

Theorem 2.1 (Euclidean projection).<br />

i. If M is closed and nonempty, then the Euclidean projection<br />

onto M is exists that is for every x∈R n there exists a y∈M<br />

with y⊳M x.<br />

ii. The Euclidean projection onto M is unique from almost all<br />

po<strong>in</strong>ts <strong>in</strong>R n i.e. vol(X(M))=0.<br />

Proof. See [3], theorems 2.2 and 2.6. �<br />

So we can always project a vector x∈R n onto a closed set M,<br />

and this projection will be unique almost everywhere. In this case,<br />

we denote the projection byπM(x) orπ(x) for short. Indeed, <strong>in</strong> the<br />

case of the p-spheres S n−1<br />

p , the exception set consists of a s<strong>in</strong>gle<br />

po<strong>in</strong>tX(S n−1<br />

p )={0} if p≥2, henceπ S n−1<br />

p<br />

(1)<br />

is well-def<strong>in</strong>ed onR n \{0}.<br />

If p


Chapter 14. Proc. ICASSP 2006 193<br />

S n−1<br />

p<br />

∇�.� p p<br />

y<br />

t<br />

∇�.−x� 2<br />

2<br />

Fig. 1. Constra<strong>in</strong>ed gradient t on the p-sphere, given by the projection<br />

of the unconstra<strong>in</strong>ed gradient∇�.−x� 2<br />

2 onto the tangent space<br />

that is orthogonal to∇�.� p p, see equation (6).<br />

2.2. Projection onto a p-sphere<br />

Let S n−1<br />

p := {x ∈ Rn |�x�p = 1} denote the (n−1)-dimensional<br />

sphere with respect to the p-norm (p>0). A scaled version of this<br />

unit sphere is given by cS n−1<br />

p :={x∈R n |�x�p= c}. The spheres<br />

are smoothC 1-submanifolds ofR n for p≥2. For p0.<br />

p<br />

Now, <strong>in</strong> the case p=2, the projection is simply given by<br />

πS n−1 (x)=x/�x�2. (2)<br />

2<br />

In the case p=1, the sphere consists of a union of hyperplanes<br />

be<strong>in</strong>g orthogonal to (±1,...,±1). Consider<strong>in</strong>g only the first quadrant<br />

(i.e. xi> 0), this means thatπ S n−1 (x) is given by the projection onto<br />

1<br />

the hyperplane H :={x∈R n |〈x, e〉=n −1/2 } and sett<strong>in</strong>g result<strong>in</strong>g<br />

negative coord<strong>in</strong>ates to 0; here e := n−1/2 (1,...,1). So with x+ := x<br />

if x≥0and 0 otherwise, we get<br />

πS n−1 (x)=<br />

1<br />

� x+(n −1/2 −〈x, e〉)e �<br />

+ . (3)<br />

In the case of arbitrary p>0, the projection is given by the unique<br />

solution of<br />

�x−y� 2<br />

2 . (4)<br />

π S n−1<br />

p<br />

(x)=argm<strong>in</strong> y∈S n−1<br />

p<br />

Unfortunately, no closed-form solution exists <strong>in</strong> the general case,<br />

so we have to numerically determ<strong>in</strong>e the solution. We have experimented<br />

with a) explicit Lagrange multiplier calculation and m<strong>in</strong>imization,<br />

b) constra<strong>in</strong>ed gradient descent and c) constra<strong>in</strong>ed fixedpo<strong>in</strong>t<br />

algorithm (best). Ignor<strong>in</strong>g the s<strong>in</strong>gular po<strong>in</strong>ts at the coord<strong>in</strong>ate<br />

hyperplanes, let us first assume that all xi > 0. Then at<br />

a regular solution y of equation (4), the gradient of the function<br />

to be m<strong>in</strong>imized is parallel to the gradient of the constra<strong>in</strong>t, i.e.<br />

y−x=λ∇�.� p �<br />

�<br />

p�<br />

for some Lagrange-multiplierλ∈R, which can be<br />

y<br />

calculated from the additional constra<strong>in</strong>t equation�y� p p= 1. Us<strong>in</strong>g<br />

the notation y⊙p := (y p<br />

1 ,...,yp n) ⊤ for the componentwise exponentiation,<br />

we therefore get<br />

y−x=λp y ⊙(p−1)<br />

�<br />

and y p<br />

i = 1 (5)<br />

i<br />

Algorithm 1: Projection onto S n−1<br />

p by constra<strong>in</strong>ed gradient descent.<br />

Commonly, the iteration is stopped after the update stepsize<br />

lies below some given threshold.<br />

Input: vector x∈R n , learn<strong>in</strong>g rateη(i)<br />

Output: Euclidean projection y=π S n−1 (x)<br />

p<br />

Initialize y∈S n−1<br />

1<br />

p randomly.<br />

for i←1, 2,... do<br />

df← y−x, dg← p sgn(y)|y| ⊙(p−1)<br />

2<br />

t←df− df ⊤ dgdg/(dg ⊤ 3<br />

dg)<br />

4 y←y−η(i)t<br />

5 y←y/�y�p<br />

end<br />

For p�{1, 2}, these equations cannot be solved <strong>in</strong> closed form,<br />

hence we propose an alternative approach to solv<strong>in</strong>g the constra<strong>in</strong>ed<br />

m<strong>in</strong>imization (4). The goal is to m<strong>in</strong>imize f (y) :=�y−x� 2<br />

2 under the<br />

constra<strong>in</strong>t g(y) :=�y� p p= 1. This can for example be achieved by<br />

gradient-descent methods, tak<strong>in</strong>g <strong>in</strong>to account that the gradient has<br />

to be calculated on the submanifold given by the S n−1<br />

p -constra<strong>in</strong>t, see<br />

figure 1 for an illustration. The projection of the gradient∇ f onto<br />

the tangent space of S n−1<br />

p at y can be easily calculated as<br />

t=∇ f−〈∇ f,∇g〉∇g/�∇g� 2<br />

2 . (6)<br />

Here, the explicit gradients are given by∇ f (y)=y−x and∇g(y)=<br />

p sgn(y)|y| ⊙(p−1) , where sgn(y) denotes the vector of the componentwise<br />

signs of y, and|y| := sgn(y)y the componentwise absolute<br />

value. The projection algorithm is summarized <strong>in</strong> algorithm 1. Iteratively,<br />

after calculat<strong>in</strong>g the constra<strong>in</strong>ed gradient (l<strong>in</strong>es 2 and 3), it<br />

performs a gradient-descent update step (l<strong>in</strong>e 4) followed by a pro-<br />

jection onto S n−1<br />

p (l<strong>in</strong>e 5) to guarantee that the algorithm stays on the<br />

submanifold.<br />

The method performs well, however as most gradient-descentbased<br />

algorithms, without further optimization it takes quite a few iterations<br />

to achieve acceptable convergence, and the choice of an optimal<br />

learn<strong>in</strong>g rateη(i) is non-trivial. We therefore propose a second<br />

projection method employ<strong>in</strong>g a fixed-po<strong>in</strong>t optimization strategy. Its<br />

idea is based on the fact that at local m<strong>in</strong>ima y of f (y) on S n−1<br />

p , the<br />

gradient∇ f (y) is orthogonal to S n−1<br />

p , so∇ f (y)∝∇g(y). Ignor<strong>in</strong>g<br />

signs for illustrative purposes, this means that y−x∝ p y⊙(p−1) , so<br />

y can be calculated from the fixed-po<strong>in</strong>t iteration y←λp y⊙(p−1) + x<br />

with additional normalization. Indeed, this can be equivalently derived<br />

from the previous Lagrange equations (5), also yield<strong>in</strong>g equations<br />

for the proportionality factorλ: we can simply determ<strong>in</strong>e it<br />

from one component of equation (5), or to <strong>in</strong>crease numerical robustness,<br />

as mean from the total set. Tak<strong>in</strong>g <strong>in</strong>to account the signs<br />

of the gradient (which we ignored <strong>in</strong> equation (5)), this yields an<br />

estimate ˆλ := 1�<br />

n<br />

i=1(yi−xi)/(p sgn(yi)|yi| p−1 ). Altogether, we get<br />

n<br />

the fixed-po<strong>in</strong>t algorithm 2, which <strong>in</strong> experiments turns out to have<br />

a considerably higher convergence rate than algorithm 1.<br />

In table 1, we compare the algorithms 1 and 2, namely with respect<br />

to the number of iterations they need to achieve convergence<br />

below some given threshold. As expected, the fixed-po<strong>in</strong>t algorithm<br />

outperforms gradient-descent always except for the case of higher<br />

dimensions and p>2 (non-sparse case). In the follow<strong>in</strong>g we will<br />

therefore use the fixed-po<strong>in</strong>t algorithm for projection onto S n−1<br />

1 .<br />

2.3. Projection onto convex sets<br />

If M is a convex set, then the Euclidean projectionπM(x) for any<br />

x∈R n is already unique, soX(M)=∅and the operatorπM is called


194 Chapter 14. Proc. ICASSP 2006<br />

Algorithm 2: Projection onto S n−1<br />

p via fixed-po<strong>in</strong>t iteration.<br />

Aga<strong>in</strong>, the iteration is to be stopped after only sufficiently small<br />

updates are taken.<br />

Input: vector x∈R n<br />

Output: Euclidean projection y=π S n−1 (x)<br />

p<br />

Initialize y∈S n−1<br />

1<br />

p randomly.<br />

for i←1, 2,... do<br />

λ← �n i=1 (yi−xi)/ � n sgn(yi)|yi| p−1�<br />

2<br />

y←x+λ sgn(y)|y| ⊙(p−1)<br />

3<br />

4 y←y/�y�p<br />

end<br />

Table 1. Comparison of the gradient- and fixed-po<strong>in</strong>t-based projection<br />

algorithms 1 and 2 for f<strong>in</strong>d<strong>in</strong>g the Euclidean projection onto<br />

cS n−1<br />

p for vary<strong>in</strong>g parameters; mean was taken over 100 iterations<br />

with x∈[−1, 1] n sampled uniformly. Here #itsgd and #itsfp denote<br />

the numbers of iterations the algorithm took to achieve update steps<br />

of size smaller thanε=0.0001, and�y gd− yfp� equals the norm of<br />

the difference of the found m<strong>in</strong>ima.<br />

n p c #its gd #its fp �y gd − y fp �<br />

2 0.9 1.2 6.7±4.7 3.7±1.0 0.0±0.0<br />

2 0.9 2.0 10.9±6.9 4.1±1.0 0.0±0.0<br />

2 2.2 0.9 13.0±21.0 5.5±4.2 0.0±0.0<br />

3 0.9 3 13.7±6.9 4.4±1.0 0.0±0.0<br />

3 2.2 0.9 9.6±16.6 7.2±10.2 0.0±0.0<br />

4 0.9 3 9.8±6.8 4.4±1.1 0.0±0.0<br />

4 2.2 0.9 6.0±5.0 9.2±8.1 0.0±0.0<br />

convex projector, see e.g. [3], lemma 2.4 and [4]. The theory of<br />

projection onto convex sets (POCS) [4, 5] is a well-known technique<br />

<strong>in</strong> signal process<strong>in</strong>g; given N convex sets M1,..., MN ⊂ R n and<br />

an operator def<strong>in</strong>ed byπ=πMN ···πM1 , POCS can be formulated<br />

as the recursion def<strong>in</strong>ed by yi+1=π(yi). It always approaches the<br />

<strong>in</strong>tersection of the convex sets that is yi→M ∗ = �N i=1 Mi.<br />

Note that POCS only f<strong>in</strong>ds an arbitrary po<strong>in</strong>t <strong>in</strong> �N i=1 Mi, which<br />

not necessarily co<strong>in</strong>cides with its Euclidean projection: for example<br />

if M1 :={x∈R n |�x�2≤ 1} is the unit disc, and M2 :={x∈R n | x1≤<br />

0} the lower first half-plane, then the Euclidean projection from x :=<br />

(1, 1, 0,...,0) onto M1∩M2 equalsπM1∩M2 (x)=(0, 1, 0,...,0), but<br />

application of POCS yields (0, 1/ √ 2, 0,...,0).<br />

3. SPARSE PROJECTION<br />

In this section, we comb<strong>in</strong>e the notions from the previous sections to<br />

search for sparse projections. Given a signal x∈R n , our goal is to<br />

f<strong>in</strong>d the closest signal y∈R n of fixed sparsenessσp(y)=c. Hence,<br />

we search for y∈R n with<br />

y=argm<strong>in</strong> σp(y)=c �x−y�2. (7)<br />

Due to the scale-<strong>in</strong>variance ofσ, the problem (7) is equivalent to<br />

f<strong>in</strong>d<strong>in</strong>g<br />

y=argm<strong>in</strong> �y�2=1,�y�p=c �x−y�2. (8)<br />

In other words, we are look<strong>in</strong>g for the Euclidean projection y =<br />

πM(x) onto M := S n−1 n−1<br />

2 ∩ cS p . Note that due to theorem 2.1, this<br />

solution to (8) exists if M�∅ and is almost always unique.<br />

Algorithm 3: Projection onto spheres (POSH). In practice,<br />

some abort criterion has to be implemented. Often q=2.<br />

Input: vector x∈R n \X(S n−1<br />

p ∩ S n−1<br />

q ) and p, q>0<br />

Output: y=π S n−1 (x)<br />

p ∩S n−1<br />

q<br />

1 Set y←x.<br />

while y�S n−1<br />

p ∩ S n−1<br />

q do<br />

2 y←π S n−1 (π<br />

q S n−1 (y))<br />

p<br />

end<br />

3.1. Projection onto spheres (POSH)<br />

In the special case of p=1 and nonnegative x, Hoyer has proposed<br />

an efficient algorithm for f<strong>in</strong>d<strong>in</strong>g the projection [2], simply by us<strong>in</strong>g<br />

the explicit formulas for the p-sphere projection; such formulas do<br />

not exist for p�1, 2, so a more general algorithm for this situation<br />

is proposed <strong>in</strong> the follow<strong>in</strong>g.<br />

Its idea is a direct generalization of POCS: we alternately project<br />

first on S n−1 then on S n−1<br />

p , us<strong>in</strong>g the Euclidean projection operators<br />

2<br />

from section 2.2. However, the difference is that the spheres are<br />

clearly non-convex (if p�1), so <strong>in</strong> contrast to POCS, convergence<br />

is not obvious. We denote this projection algorithm by projection<br />

onto spheres (POSH), see algorithm 3.<br />

First note that POSH obviously has the desired solution as fixedpo<strong>in</strong>t.<br />

In experiments, we observe that <strong>in</strong>deed POSH converges, and<br />

moreover it converges to the closest solution i.e. to the Euclidean<br />

projection (which does not hold for POCS <strong>in</strong> general)! F<strong>in</strong>ally we<br />

see that <strong>in</strong> higher dimensions, all update vectors together with the<br />

start<strong>in</strong>g po<strong>in</strong>t x lie <strong>in</strong> a s<strong>in</strong>gle two-dimensional plane, so theoretically<br />

we can reduce proofs to two-dimensional cases as well as build<br />

algorithms us<strong>in</strong>g this fact.<br />

In the follow<strong>in</strong>g section, we will prove the above claims for the<br />

case of p=1, where an explicit projection formula (3) is known. In<br />

the case of arbitrary p, so far we are only able to give experimental<br />

validation of the astonish<strong>in</strong>g facts of convergence and convergence<br />

to the Euclidean projection.<br />

3.2. Convergence<br />

The proof needs the follow<strong>in</strong>g simple convergence lemma, which<br />

somewhat extends on a special case treated by the more general Banach<br />

fixed-po<strong>in</strong>t theorem.<br />

Lemma 3.1. Let f : R→R be cont<strong>in</strong>uously-differentiable with<br />

f (0)=0, f ′ (0)>1 and let f ′ be positive and strictly decreas<strong>in</strong>g. If<br />

f ′ (x)0, then there exists a s<strong>in</strong>gle positive fixedpo<strong>in</strong>t<br />

ˆx of f , and f i (x) converges to ˆx for i→∞ and any x>0.<br />

Theorem 3.2. Let n≥2, p>0 and x∈R n \X(M). If y1 :=π S n−1 (x)<br />

2<br />

and iteratively y i :=π S n−1<br />

2<br />

(π S n−1<br />

1<br />

(y i−1 )) accord<strong>in</strong>g to the POSH algo-<br />

rithm, then y i converges to some y ∞ ∈ S n−1<br />

2 , and y ∞ =πM(x).<br />

Us<strong>in</strong>g lemma 3.1, we can prove the convergence theorem <strong>in</strong> the<br />

case of p=1, but omit the proof due to lack of space.<br />

3.3. Simulations<br />

At first, we confirm the convergence results from theorem 3.2 for<br />

p=1by apply<strong>in</strong>g POSH with 100 iterations <strong>in</strong> 1000 runs onto vectors<br />

x ∈ R 6 sampled uniformly from [0, 1] 6 ; c was chosen to be<br />

sufficiently large (c=2.4). We always get convergence. We also<br />

calculate the correct projection (us<strong>in</strong>g Hoyer’s projection algorithm


Chapter 14. Proc. ICASSP 2006 195<br />

1.2<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

x<br />

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1<br />

(a) POSH for n=2, p=1<br />

x<br />

1.5π(S 2<br />

1 )<br />

(b) POSH for n=3, p=1<br />

π(S 2<br />

2 )<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

2S 1<br />

0.5<br />

S 1<br />

2<br />

x<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2<br />

(c) POSH for n=2, p=0.5<br />

Fig. 2. Start<strong>in</strong>g from x0 (◦), we alternately project onto cS 1 and S 2. POSH performance is illustrated for p=1<strong>in</strong> dimensions 2 (a) and 3 (b),<br />

were a projection via PCA is displayed — no <strong>in</strong>formation is lost hence the sequence of po<strong>in</strong>ts lies <strong>in</strong> a plane as shown <strong>in</strong> the proof theorem<br />

3.2. Figure (c) shows application of POCS for n=2and p=0.5.<br />

Table 2. Performance of the POSH algorithm 3 for vary<strong>in</strong>g parameters.<br />

See text for details.<br />

n p c �y POSH − yscan�2<br />

2 0.8 1.2 0.005±0.0008<br />

2 4 0.9 0.02±0.005<br />

3 0.8 1.2 0.02±0.009<br />

3 4 0.9 0.04±0.03<br />

[2]). The distance between his and our solution was calculated to<br />

give a mean value of 5·10 −13 ± 5·10 −12 i.e. we get virtually always<br />

the same solution.<br />

In figures 2(a) and (b), we show application for p=1; we visualize<br />

the performance <strong>in</strong> 3 dimensions by project<strong>in</strong>g the data via PCA<br />

— which by the way throws away virtually no <strong>in</strong>formation (confirmed<br />

by experiment) <strong>in</strong>dicat<strong>in</strong>g the validness of theorem 3.2 also <strong>in</strong><br />

higher dimensions. In figure 2(c) a projection for p=0.5 is shown.<br />

Now, we perform batch-simulations for vary<strong>in</strong>g p. For this, we<br />

uniformly sample the start<strong>in</strong>g vector x ∈ [0, 1] n <strong>in</strong> 100 runs, and<br />

compare the POSH algorithm result with the true projection. POSH<br />

is performed start<strong>in</strong>g with the p-norm projection us<strong>in</strong>g algorithm 1<br />

and 100 iterations. As the true projectionπM(x) cannot be determ<strong>in</strong>ed<br />

<strong>in</strong> closed form, we scan [0, 1] n−1 us<strong>in</strong>g the stepsizeε=0.01<br />

to give the first (n−1) coord<strong>in</strong>ates of our estimate y ofπM(x); its<br />

n-th coord<strong>in</strong>ate is then constructed to guarantee y∈S n−1<br />

p (for p1) respectively. Us<strong>in</strong>g Taylor-approximation<br />

of (y+ε) p , it can easily be shown that two adjacent grid po<strong>in</strong>ts have<br />

maximal difference � �<br />

��(y1+ε,...,yn+ε)� p p−�y� p �<br />

�<br />

p�≤<br />

pnε+O(ε 2 ) if<br />

y∈[0, 1] n and p≥1. Hence by tak<strong>in</strong>g only vectors y as approximation<br />

ofπM(x) with � �<br />

��y�2 2− 1� �<br />

�<br />


196 Chapter 14. Proc. ICASSP 2006


Chapter 15<br />

Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Paper P.Gruber, K.Stadlthanner, M.Böhm, F.J. Theis, E.W. Lang, A.M. Tomé,<br />

A.R. Teixeira, C.G. Puntonet, and J.M.Górriz Saéz. Denois<strong>in</strong>g us<strong>in</strong>g local<br />

projective subspace methods. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Reference (Gruber et al., 2006)<br />

Summary <strong>in</strong> section 1.5.2<br />

197


198 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Denois<strong>in</strong>g Us<strong>in</strong>g Local Projective Subspace<br />

Methods<br />

P. Gruber, K. Stadlthanner, M. Böhm, F. J. Theis, E. W. Lang<br />

Abstract<br />

Institute of Biophysics, Neuro-and Bio<strong>in</strong>formatics Group<br />

University of Regensburg, 93040 Regensburg, Germany<br />

email: elmar.lang@biologie.uni-regensburg.de<br />

A. M. Tomé, A. R. Teixeira<br />

Dept. de Electrónica e Telecomunicações/IEETA<br />

Universidade de Aveiro, 3810 Aveiro, Portugal<br />

email: ana@ieeta.pt<br />

C. G. Puntonet, J. M. Gorriz Saéz<br />

Dep. Arqitectura y Técnologia de Computadores<br />

Universidad de Granada, 18371 Granada, Spa<strong>in</strong><br />

email: carlos@atc.ugr.es<br />

In this paper we present denois<strong>in</strong>g algorithms for enhanc<strong>in</strong>g noisy signals based on<br />

Local ICA (LICA), Delayed AMUSE (dAMUSE) and Kernel PCA (KPCA). The<br />

algorithm LICA relies on apply<strong>in</strong>g ICA locally to clusters of signals embedded <strong>in</strong><br />

a high dimensional feature space of delayed coord<strong>in</strong>ates. The components resembl<strong>in</strong>g<br />

the signals can be detected by various criteria like estimators of kurtosis or<br />

the variance of autocorrelations depend<strong>in</strong>g on the statistical nature of the signal.<br />

The algorithm proposed can be applied favorably to the problem of denois<strong>in</strong>g multidimensional<br />

data. Another projective subspace denois<strong>in</strong>g method us<strong>in</strong>g delayed<br />

coord<strong>in</strong>ates has been proposed recently with the algorithm dAMUSE. It comb<strong>in</strong>es<br />

the solution of bl<strong>in</strong>d source separation problems with denois<strong>in</strong>g efforts <strong>in</strong> an elegant<br />

way and proofs to be very efficient and fast. F<strong>in</strong>ally, KPCA represents a non-l<strong>in</strong>ear<br />

projective subspace method that is well suited for denois<strong>in</strong>g also. Besides illustrative<br />

applications to toy examples and images, we provide an application of all algorithms<br />

considered to the analysis of prote<strong>in</strong> NMR spectra.<br />

Prepr<strong>in</strong>t submitted to Elsevier Science 4 February 2005


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 199<br />

1 Introduction<br />

The <strong>in</strong>terpretation of recorded signals is often hampered by the presence of<br />

noise. This is especially true with biomedical signals which are buried <strong>in</strong> a<br />

large noise background most often. Statistical analysis tools like Pr<strong>in</strong>cipal<br />

<strong>Component</strong> <strong>Analysis</strong> (PCA), s<strong>in</strong>gular spectral analysis (SSA), <strong>Independent</strong><br />

<strong>Component</strong> <strong>Analysis</strong> (ICA) etc. quickly degrade if the signals exhibit a low<br />

Signal to Noise Ratio (SNR). Furthermore due to their statistical nature, the<br />

application of such analysis tools can also lead to extracted signals with a<br />

larger SNR than the orig<strong>in</strong>al ones as we will discuss below <strong>in</strong> case of Nuclear<br />

Magnetic Resonance (NMR) spectra.<br />

Hence <strong>in</strong> the signal process<strong>in</strong>g community many denois<strong>in</strong>g algorithms have<br />

been proposed [5,12,18,38] <strong>in</strong>clud<strong>in</strong>g algorithms based on local l<strong>in</strong>ear projective<br />

noise reduction. The idea is to project noisy signals <strong>in</strong> a high-dimensional<br />

space of delayed coord<strong>in</strong>ates, called feature space henceforth. A similar strategy<br />

is used <strong>in</strong> SSA [20], [9] where a matrix composed of the data and their<br />

delayed versions is considered. Then, a S<strong>in</strong>gular Value Decomposition (SVD)<br />

of the data matrix or a PCA of the related correlation matrix is computed.<br />

Noise contributions to the signals are then removed locally by project<strong>in</strong>g the<br />

data onto a subset of pr<strong>in</strong>cipal directions of the eigenvectors of the SVD or<br />

PCA analysis related with the determ<strong>in</strong>istic signals.<br />

Modern multi-dimensional NMR spectroscopy is a very versatile tool for the<br />

determ<strong>in</strong>ation of the native 3D structure of biomolecules <strong>in</strong> their natural aqueous<br />

environment [7,10]. Proton NMR is an <strong>in</strong>dispensable contribution to this<br />

structure determ<strong>in</strong>ation process but is hampered by the presence of the very<br />

<strong>in</strong>tense water (H2O) proton signal. The latter causes severe basel<strong>in</strong>e distortions<br />

and obscures weak signals ly<strong>in</strong>g under its skirts. It has been shown [26,29]<br />

that Bl<strong>in</strong>d Source Separation (BSS) techniques like ICA can contribute to the<br />

removal of the water artifact <strong>in</strong> proton NMR spectra.<br />

ICA techniques extract a set of signals out of a set of measured signals without<br />

know<strong>in</strong>g how the mix<strong>in</strong>g process is carried out [2, 13]. Consider<strong>in</strong>g that the<br />

set of measured spectra X is a l<strong>in</strong>ear comb<strong>in</strong>ation of a set of <strong>Independent</strong><br />

<strong>Component</strong>s (ICs) S, i.e. X = AS, the goal is to estimate the <strong>in</strong>verse of the<br />

mix<strong>in</strong>g matrix A, us<strong>in</strong>g only the measured spectra, and then compute the ICs.<br />

Then the spectra are reconstructed us<strong>in</strong>g the mix<strong>in</strong>g matrix A and those ICs<br />

conta<strong>in</strong>ed <strong>in</strong> S which are not related with the water artifact. Unfortunately<br />

the statistical separation process <strong>in</strong> practice <strong>in</strong>troduces additional noise not<br />

present <strong>in</strong> the orig<strong>in</strong>al spectra. Hence denois<strong>in</strong>g as a post-process<strong>in</strong>g of the<br />

artifact-free spectra is necessary to achieve the highest possible SNR of the<br />

reconstructed spectra. It is important that the denois<strong>in</strong>g does not change the<br />

spectral characteristics like <strong>in</strong>tegral peak <strong>in</strong>tensities as the deduction of the<br />

2


200 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3D structure of the prote<strong>in</strong>s heavily relies on the latter.<br />

We propose two new approaches to this denois<strong>in</strong>g problem and compare the<br />

results to the established Kernel PCA (KPCA) denois<strong>in</strong>g [19, 25].<br />

The first approach (Local ICA (LICA)) concerns a local projective denois<strong>in</strong>g<br />

algorithm us<strong>in</strong>g ICA. Here it is assumed that the noise can, at least locally,<br />

be represented by a stationary Gaussian white noise. Signals usually come<br />

from a determ<strong>in</strong>istic or at least predictable source and can be described as<br />

a smooth function evaluated at discrete time steps small enough to capture<br />

the characteristics of the function. That implies, us<strong>in</strong>g a dynamical model for<br />

the data, that the signal embedded <strong>in</strong> delayed coord<strong>in</strong>ates resides with<strong>in</strong> a<br />

sub-manifold of the feature space spanned by these delayed coord<strong>in</strong>ates. With<br />

local projective denois<strong>in</strong>g techniques, the task is to detect this signal manifold.<br />

We will use LICA to detect the statistically most <strong>in</strong>terest<strong>in</strong>g submanifold. In<br />

the follow<strong>in</strong>g we call this manifold the signal+noise subspace s<strong>in</strong>ce it conta<strong>in</strong>s<br />

all of the signal plus that part of the noise components which lie <strong>in</strong> the same<br />

subspace. Parameter selection with<strong>in</strong> LICA will be effected with a M<strong>in</strong>imum<br />

Description Length (MDL) criterion [40], [6] which selects optimal parameters<br />

based on the data themselves.<br />

For the second approach we comb<strong>in</strong>e the ideas of solv<strong>in</strong>g BSS problems algebraically<br />

us<strong>in</strong>g a Generalized Eigenvector Decomposition (GEVD) [28] with<br />

local projective denois<strong>in</strong>g techniques. We propose, like <strong>in</strong> the Algorithm for<br />

Multiple Unknown Signals Extraction (AMUSE) [37], a GEVD of two correlation<br />

matrices i.e, the simultaneous diagonalization of a matrix pencil formed<br />

with a correlation matrix and a matrix of delayed correlations. These algorithms<br />

are exact and fast but sensitive to noise. There are several proposals<br />

to improve efficiency and robustness of these algorithms when noise is<br />

present [2, 8]. They mostly rely on an approximative jo<strong>in</strong>t diagonalization of<br />

a set of correlation or cumulant matrices like the algorithm Second Order<br />

Bl<strong>in</strong>d Identification (SOBI) [1]. The algorithm we propose, called Delayed<br />

AMUSE (dAMUSE) [33], computes a GEVD of the congruent matrix pencil<br />

<strong>in</strong> a high-dimensional feature space of delayed coord<strong>in</strong>ates. We show that the<br />

estimated signal components correspond to filtered versions of the underly<strong>in</strong>g<br />

uncorrelated source signals. We also present an algorithm to compute the<br />

eigenvector matrix of the pencil which <strong>in</strong>volves a two step procedure based on<br />

the standard Eigenvector Decomposition (EVD) approach. The advantage of<br />

this two step procedure is related with a dimension reduction between the two<br />

steps accord<strong>in</strong>g to a threshold criterion. Thereby estimated signal components<br />

related with noise only can be neglected thus perform<strong>in</strong>g a denois<strong>in</strong>g of the<br />

reconstructed signals.<br />

As a third denois<strong>in</strong>g method we consider KPCA based denois<strong>in</strong>g techniques<br />

[19,25] which have been shown to be very efficient outperform<strong>in</strong>g l<strong>in</strong>ear PCA.<br />

3


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 201<br />

KPCA actually generalizes l<strong>in</strong>ear PCA which hitherto has been used for denois<strong>in</strong>g.<br />

PCA denois<strong>in</strong>g follows the idea that reta<strong>in</strong><strong>in</strong>g only the pr<strong>in</strong>cipal components<br />

with highest variance to reconstruct the decomposed signal, noise<br />

contributions which should correspond to the low variance components can<br />

deliberately be omitted hence reduc<strong>in</strong>g the noise contribution to the observed<br />

signal. KPCA extends this idea to non-l<strong>in</strong>ear signal decompositions. The idea<br />

is to project observed data non-l<strong>in</strong>early <strong>in</strong>to a high-dimensional feature space<br />

and then to perform l<strong>in</strong>ear PCA <strong>in</strong> feature space. The trick is that the whole<br />

formalism can be cast <strong>in</strong>to dot product form hence the latter can be replaced<br />

by suitable kernel functions to be evaluated <strong>in</strong> the lower dimensional <strong>in</strong>put<br />

space <strong>in</strong>stead of the high-dimensional feature space. Denois<strong>in</strong>g then amounts<br />

to estimat<strong>in</strong>g appropriate pre-images <strong>in</strong> <strong>in</strong>put space of the nonl<strong>in</strong>early transformed<br />

signals.<br />

The paper is organized as follows: Section 1 presents an <strong>in</strong>troduction and discusses<br />

some related work. In section 2 some general aspects about embedd<strong>in</strong>g<br />

and cluster<strong>in</strong>g are discussed, before <strong>in</strong> section 3 the new denois<strong>in</strong>g algorithms<br />

are discussed <strong>in</strong> detail. Section 4 presents some applications to toy as well as<br />

to real world examples and section 5 draws some conclusions.<br />

2 Feature Space Embedd<strong>in</strong>g<br />

In this section we <strong>in</strong>troduce new denois<strong>in</strong>g techniques and propose algorithms<br />

us<strong>in</strong>g them. At first we present the signal process<strong>in</strong>g tools we will use later<br />

on.<br />

2.1 Embedd<strong>in</strong>g us<strong>in</strong>g delayed coord<strong>in</strong>ates<br />

A common theme of all three algorithms presented is to embed the data <strong>in</strong>to<br />

a high dimensional feature space and try to solve the noise separation problem<br />

there. With the LICA and the dAMUSE we embed signals <strong>in</strong> delayed<br />

coord<strong>in</strong>ates and do all computations directly <strong>in</strong> the space of delayed coord<strong>in</strong>ates.<br />

The KPCA algorithm considers a non-l<strong>in</strong>ear projection of the signals<br />

to a feature space but performs all calculations <strong>in</strong> <strong>in</strong>put space us<strong>in</strong>g the kernel<br />

trick. It uses the space of delayed coord<strong>in</strong>ates only implicitly as <strong>in</strong>termediate<br />

step <strong>in</strong> the nonl<strong>in</strong>ear transformation s<strong>in</strong>ce for that transformation the signal<br />

at different time steps is used.<br />

Delayed coord<strong>in</strong>ates are an ideal tool for represent<strong>in</strong>g the signal <strong>in</strong>formation.<br />

For example <strong>in</strong> the context of chaotic dynamical systems, embedd<strong>in</strong>g an observable<br />

<strong>in</strong> delayed coord<strong>in</strong>ates of sufficient dimension already captures the<br />

4


202 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

full dynamical system [30]. There also exists a similar result <strong>in</strong> statistics for<br />

signals with a f<strong>in</strong>ite decay<strong>in</strong>g memory [24].<br />

Given a group of N sensor signals, x[l] = �<br />

x0[l], . . . , xN−1[l] � T sampled at<br />

time steps l = 0, . . . L − 1, a very convenient representation of the signals<br />

embedded <strong>in</strong> delayed coord<strong>in</strong>ates is to arrange them componentwise <strong>in</strong>to component<br />

trajectory matrices Xi, i = 0, . . . , N − 1 [20]. Hence embedd<strong>in</strong>g<br />

can be regarded as a mapp<strong>in</strong>g that transforms a one-dimensional time series<br />

xi = (xi[0], xi[1], . . . , xi[L − 1]) <strong>in</strong>to a multi-dimensional sequence of lagged<br />

vectors. Let M be an <strong>in</strong>teger (w<strong>in</strong>dow length) with M < L. The embedd<strong>in</strong>g<br />

procedure then forms L − M + 1 lagged vectors which constitute the columns<br />

of the component trajectory matrix. Hence given sensor signals x[l], registered<br />

for a set of L samples, their related component trajectory matrices are given<br />

by<br />

�<br />

�<br />

Xi = xi[M − 1] xi[M] . . . xi[L − 1]<br />

⎡<br />

⎤<br />

⎢ xi[M − 1] xi[M] · · ·<br />

⎢ xi[M − 2] xi[M − 1] · · ·<br />

= ⎢<br />

.<br />

⎢ . . ..<br />

⎣<br />

xi[L − 1] ⎥<br />

xi[L − 2] ⎥<br />

. ⎥<br />

⎦<br />

xi[0] xi[1] · · · xi[L − M]<br />

and encompass M delayed versions of each signal component xi[l − m], m =<br />

0, . . . , M −1 collected at time steps l = M −1, . . . , L−1. Note that a trajectory<br />

matrix has identical entries along each diagonal. The total trajectory matrix<br />

of the set X will be a concatenation of the component trajectory matrices Xi<br />

computed for each sensor, i.e<br />

X =<br />

�<br />

�<br />

T<br />

X1, X2, . . . , XN<br />

Note that the embedded sensor signal is also formed by a concatenation of<br />

embedded component vectors, i.e. x[l] [x0[l], . . . , xN−1[l]]. Also note that with<br />

LICA we deal with s<strong>in</strong>gle column vectors of the trajectory matrix only, while<br />

with dAMUSE we consider the total trajectory matrix.<br />

2.2 Cluster<strong>in</strong>g<br />

In our context cluster<strong>in</strong>g of signals means rearrang<strong>in</strong>g the signal vectors, sampled<br />

at different time steps, by similarity. Hence for signals embedded <strong>in</strong> delayed<br />

coord<strong>in</strong>ates, the idea is to look for K disjo<strong>in</strong>t sub-trajectory matrices to<br />

5<br />

(1)<br />

(2)


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 203<br />

group together similar column vectors of the trajectory matrix X.<br />

A cluster<strong>in</strong>g algorithm like k-means [15] is appropriate for problems where the<br />

time structure of the signal is irrelevant. If, however, time or spatial correlations<br />

matter, cluster<strong>in</strong>g should be based on f<strong>in</strong>d<strong>in</strong>g an appropriate partition<strong>in</strong>g<br />

of {M − 1, . . . , L − 1} <strong>in</strong>to K successive segments, s<strong>in</strong>ce this preserves the <strong>in</strong>herent<br />

correlation structure of the signals. In any case the number of columns<br />

<strong>in</strong> each sub-trajectory matrix X (j) amounts to Lj such that the follow<strong>in</strong>g<br />

completeness relation holds:<br />

K�<br />

Lj = L − M + 1 (3)<br />

j=1<br />

The mean vector mj <strong>in</strong> each cluster can be considered a prototype vector and<br />

is given by<br />

mj = 1<br />

Xcj = 1<br />

X (j) [1, . . . , 1] T , j = 1, . . . , K (4)<br />

Lj<br />

Lj<br />

where cj is a vector with Lj entries equal to one which characterizes the<br />

cluster<strong>in</strong>g. Note that after the cluster<strong>in</strong>g the set {k = 0, . . . , L − M − 1} of<br />

<strong>in</strong>dices of the columns of X is split <strong>in</strong> K disjo<strong>in</strong>t subsets Kj. Each trajectory<br />

sub-matrix X (j) is formed with those columns of the matrix X, the <strong>in</strong>dices of<br />

which belong to the subset Kj of <strong>in</strong>dices.<br />

2.3 Pr<strong>in</strong>cipal <strong>Component</strong> <strong>Analysis</strong> and <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong><br />

PCA [23] is one of the most common multivariate data analysis tools. It tries<br />

to l<strong>in</strong>early transform given data <strong>in</strong>to uncorrelated data (feature space). Thus<br />

<strong>in</strong> PCA [4] a data vector is represented <strong>in</strong> an orthogonal basis system such<br />

that the projected data have maximal variance. PCA can be performed by<br />

eigenvalue decomposition of the data covariance matrix. The orthogonal transformation<br />

is obta<strong>in</strong>ed by diagonaliz<strong>in</strong>g the centered covariance matrix of the<br />

data set.<br />

In ICA, given a random vector, the goal is to f<strong>in</strong>d its statistically ICs. In<br />

contrast to correlation-based transformations like PCA, ICA renders the output<br />

signals as statistically <strong>in</strong>dependent as possible by evaluat<strong>in</strong>g higher-order<br />

statistics. The idea of ICA was first expressed by Jutten and Hérault [11]<br />

while the term ICA was later co<strong>in</strong>ed by Comon [3]. With LICA we will use<br />

the popular FastICA algorithm by Hyvär<strong>in</strong>en and Oja [14], which performs<br />

ICA by maximiz<strong>in</strong>g the non-Gaussianity of the signal components.<br />

6


204 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3 Denois<strong>in</strong>g Algorithms<br />

3.1 Local ICA denois<strong>in</strong>g<br />

The LICA algorithm we present is based on a local projective denois<strong>in</strong>g technique<br />

us<strong>in</strong>g an MDL criterion for parameter selection. The idea is to achieve<br />

denois<strong>in</strong>g by locally project<strong>in</strong>g the embedded noisy signal <strong>in</strong>to a lower dimensional<br />

subspace which conta<strong>in</strong>s the characteristics of the noise free signal.<br />

F<strong>in</strong>ally the signal has to be reconstructed us<strong>in</strong>g the various candidates generated<br />

by the embedd<strong>in</strong>g.<br />

Consider the situation, where we have a signal x 0 i [l] at discrete time steps<br />

l = 0, . . . , L − 1 but only its noise corrupted version xi[l] is measured<br />

xi[l] = x 0 i [l] + ei[l] (5)<br />

where e[l] are samples of a random variable with Gaussian distribution, i.e. xi<br />

equals x 0 i up to additive stationary white noise.<br />

3.1.1 Embedd<strong>in</strong>g and cluster<strong>in</strong>g<br />

First the noisy signal xi[l] is transformed <strong>in</strong>to a high-dimensional signal xi[l]<br />

<strong>in</strong> the M-dimensional space of delayed coord<strong>in</strong>ates accord<strong>in</strong>g to<br />

xi[l] := �<br />

xi[l], . . . , xi[l − M + 1 mod L] � T<br />

(6)<br />

which corresponds to a column of the trajectory matrix <strong>in</strong> equation 1.<br />

To simplify implementation, we want to ensure that the delayed signal, like<br />

the orig<strong>in</strong>al signal, (trajectory matrix) is given at L time steps <strong>in</strong>stead of<br />

L − M + 1. This can be achieved by us<strong>in</strong>g the samples <strong>in</strong> round rob<strong>in</strong> manner,<br />

i.e. by clos<strong>in</strong>g the end and the beg<strong>in</strong> of each delayed signal and cutt<strong>in</strong>g out<br />

exactly L components <strong>in</strong> accord with the delay. If the signal conta<strong>in</strong>s a trend<br />

or its statistical nature is significantly different at the end compared to the<br />

beg<strong>in</strong>n<strong>in</strong>g, then this leads to compatibility problems of the beg<strong>in</strong>n<strong>in</strong>g and end<br />

of the signal. We can easily resolve this misfit by replac<strong>in</strong>g the signal with a<br />

version where we add the signal <strong>in</strong> reverse order, hence avoid<strong>in</strong>g any sudden<br />

change <strong>in</strong> signal amplitude which would be smoothed out by the algorithm.<br />

The problem can now be localized by select<strong>in</strong>g K clusters <strong>in</strong> the feature space<br />

of delayed coord<strong>in</strong>ates of the signal {xi[l] | l = 0, . . . , L − 1}. Cluster<strong>in</strong>g<br />

can be achieved by a k-means cluster algorithm as expla<strong>in</strong>ed <strong>in</strong> section 2.2.<br />

But k-means cluster<strong>in</strong>g is only appropriate if the variance or the kurtosis<br />

of a signal do not depend on the <strong>in</strong>herent signal structure. For other noise<br />

7


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 205<br />

selection schemes like choos<strong>in</strong>g the noise components based on the variance<br />

of the autocorrelation, it is usually better to f<strong>in</strong>d an appropriate partition<strong>in</strong>g<br />

of the set of time steps {0, . . . , L − 1} <strong>in</strong>to K successive segments, s<strong>in</strong>ce this<br />

preserves the <strong>in</strong>herent time structure of the signals.<br />

For other noise selection methods like choos<strong>in</strong>g the noise components based on<br />

the variance of the autocorrelation it is usually better to f<strong>in</strong>d an appropriate<br />

partition of the set of time steps {0, . . . , L − 1} <strong>in</strong>to K successive segments,<br />

s<strong>in</strong>ce this preserves the <strong>in</strong>herent time structure of the signal.<br />

Note that the cluster<strong>in</strong>g does not change the data but only changes its time<br />

sequence, i.e. permutes and regroups the columns of the trajectory matrix and<br />

separates it <strong>in</strong>to K sub-matrices.<br />

3.1.2 Decomposition and denois<strong>in</strong>g<br />

After center<strong>in</strong>g, i.e. remov<strong>in</strong>g the mean <strong>in</strong> each cluster, we can analyze the<br />

M-dimensional signals <strong>in</strong> these K clusters us<strong>in</strong>g PCA or ICA. The PCA case<br />

(Local PCA (LPCA)) is studied <strong>in</strong> [39] so <strong>in</strong> the follow<strong>in</strong>g we will propose an<br />

ICA based denois<strong>in</strong>g.<br />

Us<strong>in</strong>g ICA, we extract M ICs from each delayed signal. Like <strong>in</strong> all projection<br />

based denois<strong>in</strong>g algorithms, noise reduction is achieved by project<strong>in</strong>g the signal<br />

<strong>in</strong>to a lower dimensional subspace. We used two different criteria to estimate<br />

the number p of signal+noise components, i.e. the dimension of the signal<br />

subspace onto which we project after apply<strong>in</strong>g ICA.<br />

• One criterion is a consistent MDL estimator of pMDL for the data model <strong>in</strong><br />

equation 5 ( [39])<br />

pMDL = argm<strong>in</strong> MDL(M, L, p, (λj), γ) (7)<br />

p=0,...,M−1<br />

⎧<br />

⎪⎨<br />

argm<strong>in</strong><br />

p=0,...,M−1 ⎪⎩ −�(M<br />

− p)L �<br />

⎛<br />

⎜<br />

ln ⎝ ΠMj=p+1λ 1 ⎞<br />

M−p<br />

j ⎟<br />

1 �Mj=p+1 ⎠<br />

λj<br />

M−p<br />

�<br />

+ pM − p2<br />

� �1 �<br />

p<br />

+ + 1 + ln γ<br />

2 2 2<br />

⎛<br />

p2 p<br />

pM − + + 1<br />

2 2 − ⎝<br />

p<br />

1 2<br />

ln 2 L +<br />

⎞⎫<br />

p�<br />

M−1 � ⎬<br />

ln λj − ln λj<br />

⎠<br />

⎭<br />

j=1<br />

j=1<br />

where λj denote the variances of the signal components <strong>in</strong> feature space,<br />

i.e. after apply<strong>in</strong>g the de-mix<strong>in</strong>g matrix which we estimate with the ICA<br />

algorithm. To reta<strong>in</strong> the relative strength of the components <strong>in</strong> the mixture,<br />

we normalize the rows of the de-mix<strong>in</strong>g matrix to unit norm. The variances<br />

8


206 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

are ordered such that the smallest eigenvalues λj correspond to directions<br />

<strong>in</strong> feature space most likely to be associated with noise components only.<br />

The first term <strong>in</strong> the MDL estimator represents the likelihood of the<br />

m − p gaussian white noise components. The third term stems from the<br />

estimation of the description length of the signal part (first p components)<br />

of the mixture based on their variances. The second term acts as a penalty<br />

term to favor parsimonious representations of the data for short time series,<br />

and becomes <strong>in</strong>significant <strong>in</strong> the limit L → ∞ s<strong>in</strong>ce it does not depend on L<br />

while the other two terms grow without bounds. The parameter γ controls<br />

this behavior and is a parameter of the MDL estimator, hence of the f<strong>in</strong>al<br />

denois<strong>in</strong>g algorithm. By experience, good values for γ seem to be 32 or 64.<br />

• Based on the observations reported by [17] and our observations that, <strong>in</strong><br />

some situations, the MDL estimator tends to significantly underestimate<br />

the number of noise components, we also used another approach: We clustered<br />

the variances of the signal components <strong>in</strong>to two clusters us<strong>in</strong>g k-means<br />

cluster<strong>in</strong>g and def<strong>in</strong>ed pcl as the number of elements <strong>in</strong> the cluster which<br />

conta<strong>in</strong>s the largest eigenvalue. This yields a good estimate of the number<br />

of signal components, if the noise variances are not clustered well enough<br />

together but, nevertheless, are separated from the signal by a large gap.<br />

More details and simulations corroborat<strong>in</strong>g our observations can be found<br />

<strong>in</strong> section 4.1.1.<br />

Depend<strong>in</strong>g on the statistical nature of the data, the order<strong>in</strong>g of the components<br />

<strong>in</strong> the MDL estimator can be achieved us<strong>in</strong>g different methods. For data with<br />

a non-Gaussian distribution, we select the noise component as the component<br />

with the smallest value of the kurtosis as Gaussian noise corresponds to a<br />

vanish<strong>in</strong>g kurtosis. For non-stationary data with stationary noise, we identify<br />

the noise by the smallest variance of its autocorrelation.<br />

3.1.3 Reconstruction<br />

In each cluster the center<strong>in</strong>g is reversed by add<strong>in</strong>g back the cluster mean. To<br />

reconstruct the noise reduced signal, we first have to reverse the cluster<strong>in</strong>g of<br />

the data to yield the signal x e i [l] ∈ R M by concatenat<strong>in</strong>g the trajectory submatrices<br />

and revers<strong>in</strong>g the permutation done dur<strong>in</strong>g cluster<strong>in</strong>g. The result<strong>in</strong>g<br />

trajectory matrix does not possess identical entries <strong>in</strong> each diagonal. Hence<br />

we average over the candidates <strong>in</strong> the delayed data, i.e.over all entries <strong>in</strong> each<br />

diagonal:<br />

x e i [l] := 1<br />

M−1 �<br />

x<br />

M j=0<br />

e i [l + j mod L]j (8)<br />

where x e i [l]j stands for the j-th component of the enhanced vector x e i at time<br />

step l. Note that the summation is done over the diagonals of the trajectory<br />

matrix so it would yields xi if performed on the orig<strong>in</strong>al delayed signal xi.<br />

9


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 207<br />

3.1.4 Parameter estimation<br />

We still have to f<strong>in</strong>d optimal values for the global parameters M and K.<br />

Their selection aga<strong>in</strong> can be based on a MDL criterion for the detected noise<br />

e := x − xe. Accord<strong>in</strong>gly we apply the LICA algorithm for different M and<br />

K and embed each of these error signals e(M, K) <strong>in</strong> delayed coord<strong>in</strong>ates of a<br />

fixed large enough dimension ˆ M and choose the parameters M0 and K0 such<br />

that the MDL criterion estimat<strong>in</strong>g the nois<strong>in</strong>ess of the error signal is m<strong>in</strong>imal.<br />

The MDL criterion is evaluated with respect to the eigenvalues λj(M, K) of<br />

the correlation matrix of e(M, K) such that<br />

(M0, K0) = argm<strong>in</strong> MDL<br />

M,K<br />

� M, ˆ L, 0, (λj(M, K)), γ �<br />

3.2 Denois<strong>in</strong>g us<strong>in</strong>g Delayed AMUSE<br />

Signals with an <strong>in</strong>herent correlation structure like time series data can as well<br />

be analyzed us<strong>in</strong>g second-order bl<strong>in</strong>d source separation techniques only [22,34].<br />

GEVD of a matrix pencil [36,37] or a jo<strong>in</strong>t approximative diagonalization of a<br />

set of correlation matrices [1] is then usually considered. Recently we proposed<br />

an algorithm based on a generalized eigenvalue decomposition <strong>in</strong> a feature<br />

space of delayed coord<strong>in</strong>ates [34]. It provides means for BSS and denois<strong>in</strong>g<br />

simultaneously.<br />

3.2.1 Embedd<strong>in</strong>g<br />

Assum<strong>in</strong>g that each sensor signal is a l<strong>in</strong>ear comb<strong>in</strong>ation X = AS of N<br />

underly<strong>in</strong>g but unknown source signals si, a source signal trajectory matrix<br />

S can be written <strong>in</strong> analogy to equation 1 and equation 2. Then the mix<strong>in</strong>g<br />

matrix A is a block matrix with a diagonal matrix <strong>in</strong> each block:<br />

⎡<br />

a11IM×M a12IM×M ⎢<br />

· · · a1NIM×M ⎥<br />

⎢<br />

⎥<br />

⎢<br />

⎥<br />

⎢ a21IM×M a22IM×M · · · · · · ⎥<br />

A = ⎢<br />

⎥<br />

⎢<br />

. ⎥<br />

⎢ . . .. . ⎥<br />

⎣<br />

⎦<br />

aN1IM×M aN2IM×M · · · aNNIM×M<br />

⎤<br />

(9)<br />

(10)<br />

The matrix IM×M represents the identity matrix, and <strong>in</strong> accord with an <strong>in</strong>stantaneous<br />

mix<strong>in</strong>g model the mix<strong>in</strong>g coefficient aij relates the sensor signal<br />

xi with the source signal sj.<br />

10


208 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3.2.2 Generalized Eigenvector Decomposition<br />

The delayed correlation matrices of the matrix pencil are computed with one<br />

matrix Xr obta<strong>in</strong>ed by elim<strong>in</strong>at<strong>in</strong>g the first ki columns of X and another<br />

matrix, Xl, obta<strong>in</strong>ed by elim<strong>in</strong>at<strong>in</strong>g the last ki columns. Then, the delayed<br />

correlation matrix Rx(ki) = XrXl T will be an NM × NM matrix. Each of<br />

these two matrices can be related with a correspond<strong>in</strong>g matrix <strong>in</strong> the source<br />

signal doma<strong>in</strong>:<br />

Rx(ki) = ARs(ki)A T = ASrSl T A T<br />

(11)<br />

Then the two pairs of matrices (Rx(k1), Rx(k2)) and (Rs(k1), Rs(k2)) represent<br />

congruent pencils [32] with the follow<strong>in</strong>g properties:<br />

• Their eigenvalues are the same, i.e., the eigenvalue matrices of both pencils<br />

are identical: Dx = Ds.<br />

• If the eigenvalues are non-degenerate (dist<strong>in</strong>ct values <strong>in</strong> the diagonal of<br />

the matrix Dx = Ds), the correspond<strong>in</strong>g eigenvectors are related by the<br />

transformation Es = A T Ex.<br />

Assum<strong>in</strong>g that all sources are uncorrelated, the matrices Rs(ki) are block<br />

diagonal, hav<strong>in</strong>g block matrices Rmm(ki) = SriSli T along the diagonal. The<br />

eigenvector matrix of the GEVD of the pencil (Rs(k1), Rs(k2)) aga<strong>in</strong> forms a<br />

block diagonal matrix with block matrices Emm form<strong>in</strong>g M × M eigenvector<br />

matrices of the GEVD of the pencils (Rmm(k1), Rmm(k2)). The uncorrelated<br />

components can then be estimated from l<strong>in</strong>early transformed sensor signals<br />

via<br />

Y = Ex T X = Ex T AS = Es T S (12)<br />

hence turn out to be filtered versions of the underly<strong>in</strong>g source signals. As the<br />

eigenvector matrix Es is a block diagonal matrix, there are M signals <strong>in</strong> each<br />

column of Y which are a l<strong>in</strong>ear comb<strong>in</strong>ation of one of the source signals and<br />

its delayed versions. Then the columns of the matrix Emm represent impulse<br />

responses of F<strong>in</strong>ite Impulse Response (FIR) filters. Consider<strong>in</strong>g that all the<br />

columns of Emm are different, their frequency response might provide different<br />

spectral densities of the source signal spectra. Then NM output signals y<br />

encompass M filtered versions of each of the N estimated source signals.<br />

3.2.3 Implementation of the GEVD<br />

There are several ways to compute the generalized eigenvalue decomposition.<br />

We resume a procedure valid if one of the matrices of the pencil is symmetric<br />

positive def<strong>in</strong>ite. Thus, we consider the pencil (Rx(0), Rx(k2)) and perform<br />

the follow<strong>in</strong>g steps:<br />

Step 1: Compute a standard eigenvalue decomposition of Rx(0) = V ΛV T , i.e,<br />

11


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 209<br />

compute the eigenvectors vi and eigenvalues λi. As the matrix is symmetric<br />

positive def<strong>in</strong>ite, the eigenvalues can be arranged <strong>in</strong> descend<strong>in</strong>g order (λ1 ><br />

λ2 > · · · > λNM). This procedure corresponds to the usual whiten<strong>in</strong>g step<br />

<strong>in</strong> many ICA algorithms. It can be used to estimate the number of sources,<br />

but it can also be considered a strategy to reduce noise much like with PCA<br />

denois<strong>in</strong>g. Dropp<strong>in</strong>g small eigenvalues amounts to a projection from a highdimensional<br />

feature space onto a lower dimensional manifold represent<strong>in</strong>g the<br />

signal+noise subspace. Thereby it is tacitly assumed that small eigenvalues<br />

are related with noise components only. Here we consider a variance criterion<br />

to choose the most significant eigenvalues, those related with the embedded<br />

determ<strong>in</strong>istic signal, accord<strong>in</strong>g to<br />

λ1 + λ2 + ... + λl<br />

λ1 + λ2 + ...λNM<br />

≥ T H (13)<br />

If we are <strong>in</strong>terested <strong>in</strong> the eigenvectors correspond<strong>in</strong>g to directions of high<br />

variance of the signals, the threshold T H should be chosen such that their<br />

maximum energy is preserved. Similar to the whiten<strong>in</strong>g phase <strong>in</strong> many BSS<br />

algorithms, the data matrix X can be transformed us<strong>in</strong>g<br />

1<br />

−<br />

Q = Λ 2 V T<br />

(14)<br />

to calculate a transformed matrix of delayed correlations C(k2) to be used<br />

<strong>in</strong> the next step. The transformation matrix can be computed us<strong>in</strong>g either<br />

the l most significant eigenvalues, <strong>in</strong> which case denois<strong>in</strong>g is achieved, or all<br />

eigenvalues and respective eigenvectors. Also note, that Q represents a l×NM<br />

matrix if denois<strong>in</strong>g is considered.<br />

Step 2: Use the transformed delayed correlation matrix C(k2) = QRx(k2)Q T<br />

and its standard eigenvalue decomposition: the eigenvector matrix U and<br />

eigenvalue matrix Dx.<br />

The eigenvectors of the pencil (Rx(0), Rx(k2)), which are not normalized, form<br />

the columns of the eigenvector matrix Ex = QT 1<br />

−<br />

U = V Λ 2 U. The ICs of<br />

the delayed sensor signals can then be estimated via the transformation given<br />

below, yield<strong>in</strong>g l (or NM) signals, one signal per row of Y :<br />

Y = Ex T X = U T QX = U T 1<br />

−<br />

Λ 2 V T X (15)<br />

The first step of this algorithm is therefore equivalent to a PCA <strong>in</strong> a highdimensional<br />

feature space [9,38], where a matrix similar to Q is used to project<br />

the data onto the signal manifold.<br />

12


210 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

3.3 Kernel PCA based denois<strong>in</strong>g<br />

Kernel PCA has been developed by [19], hence we give here only a short summary<br />

for convenience. PCA only extracts l<strong>in</strong>ear features though with suitable<br />

nonl<strong>in</strong>ear features more <strong>in</strong>formation could be extracted. It has been shown [19]<br />

that KPCA is well suited to extract <strong>in</strong>terest<strong>in</strong>g nonl<strong>in</strong>ear features <strong>in</strong> the data.<br />

KPCA first maps the data xi <strong>in</strong>to some high-dimensional feature space Ω<br />

through a nonl<strong>in</strong>ear mapp<strong>in</strong>g Φ : R n → R m , m > n and then performs l<strong>in</strong>ear<br />

PCA on the mapped data <strong>in</strong> the feature space Ω. Assum<strong>in</strong>g centered data <strong>in</strong><br />

feature space, i.e. � l k=1 Φ(xk) = 0, to perform PCA <strong>in</strong> space Ω amounts to<br />

f<strong>in</strong>d<strong>in</strong>g the eigenvalues λ > 0 and eigenvectors ω ∈ Ω of the correlation matrix<br />

¯R = 1<br />

l<br />

� lj=1 Φ(xj)Φ(xj) T .<br />

Note that all ω with λ �= 0 lie <strong>in</strong> the subspace spanned by the vectors<br />

Φ(x1), . . . , Φ(xl). Hence the eigenvectors can be represented via<br />

l�<br />

ω = αiΦ(xi) (16)<br />

i=1<br />

Multiply<strong>in</strong>g the eigenequation with Φ(xk) from the left the follow<strong>in</strong>g modified<br />

eigenequation is obta<strong>in</strong>ed<br />

Kα = lλα (17)<br />

with λ > 0. The eigenequation now is cast <strong>in</strong> the form of dot products occurr<strong>in</strong>g<br />

<strong>in</strong> feature space through the l × l matrix K with elements Kij =<br />

(Φ(xi)·Φ(xj)) = k(xi, xj) which are represented by kernel functions k(xi, xj)<br />

to be evaluated <strong>in</strong> the <strong>in</strong>put space. For feature extraction any suitable kernel<br />

can be used and knowledge of the nonl<strong>in</strong>ear function Φ(x) is not needed. Note<br />

that the latter can always be reconstructed from the pr<strong>in</strong>cipal components<br />

obta<strong>in</strong>ed. The image of a data vector under the map Φ can be reconstructed<br />

from its projections βk via<br />

n�<br />

n�<br />

ˆPnΦ(x) = βkωk = (ωk · Φ(x))ωk<br />

k=1<br />

k=1<br />

(18)<br />

which def<strong>in</strong>es the projection operator ˆ Pn. In denois<strong>in</strong>g applications, n is deliberately<br />

chosen such that the squared reconstruction error<br />

e 2 l�<br />

rec =<br />

i=1<br />

� ˆ PnΦ(xi) − Φ(xi) � 2<br />

(19)<br />

is m<strong>in</strong>imized. To f<strong>in</strong>d a correspond<strong>in</strong>g approximate representation of the data<br />

<strong>in</strong> <strong>in</strong>put space, the so called pre-image, it is necessary to estimate a vector<br />

13


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 211<br />

z ∈ R N <strong>in</strong> <strong>in</strong>put space such that<br />

ρ(z) =� ˆ PnΦ(x) − Φ(z) � 2 n� l�<br />

= k(z, z) − 2 βk α<br />

k=1 i=1<br />

k i k(xi, z) (20)<br />

is m<strong>in</strong>imized. Note that an analytic solution to the pre-image problem has been<br />

given recently <strong>in</strong> case of <strong>in</strong>vertible kernels [16]. In denois<strong>in</strong>g applications it is<br />

hoped that the deliberately neglected dimensions of m<strong>in</strong>or variance conta<strong>in</strong><br />

noise mostly and z represents a denoised version of x. Equation (20) can be<br />

m<strong>in</strong>imized via gradient descent techniques.<br />

4 Applications and simulations<br />

In this section we will first present results and concomitant <strong>in</strong>terpretation<br />

of some experiments with toy data us<strong>in</strong>g different variations of the LICA<br />

denois<strong>in</strong>g algorithm. Next we also present some test simulations us<strong>in</strong>g toy data<br />

of the algorithm dAMUSE. F<strong>in</strong>ally we will discuss the results of apply<strong>in</strong>g the<br />

three different denois<strong>in</strong>g algorithms presented above to a real world problem,<br />

i.e. to enhance prote<strong>in</strong> NMR spectra contam<strong>in</strong>ated with a huge water artifact.<br />

4.1 Denois<strong>in</strong>g with Local ICA applied to toy examples<br />

We will present some sample experimental results us<strong>in</strong>g artificially generated<br />

signals and random noise. As the latter is characterized by a vanish<strong>in</strong>g kurtosis,<br />

the LICA based denois<strong>in</strong>g algorithm uses the component kurtosis for noise<br />

selection.<br />

4.1.1 Discussion of an MDL based subspace selection<br />

In the LICA denois<strong>in</strong>g algorithm the MDL criterion is also used to select the<br />

number of noise components <strong>in</strong> each cluster. This works without prior knowledge<br />

of the noise strength. S<strong>in</strong>ce the estimation is based solely on statistical<br />

properties, it produces suboptimal results <strong>in</strong> some cases, however. In figure 1<br />

we compare, for an artificial signal with a known additive white gaussian noise,<br />

the denois<strong>in</strong>g achieved with the MDL based estimation of the subspace dimension<br />

versus the estimation based on the noise level. The latter is done us<strong>in</strong>g<br />

a threshold on the variances of the components <strong>in</strong> feature space such that<br />

only the signal part is conserved. Fig. 1 shows that the threshold criterion<br />

works slightly better <strong>in</strong> this case, though the MDL based selection can obta<strong>in</strong><br />

a comparable level of denois<strong>in</strong>g. However, the smaller SNR <strong>in</strong>dicates that the<br />

14


212 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

10<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

Orig<strong>in</strong>al Signal Noisy Signal MDL based local ICA Threshold based local ICA<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

×10 3<br />

5<br />

0<br />

-5<br />

-10<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

-10<br />

-15<br />

-15<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

×10 3<br />

10<br />

5<br />

0<br />

-5<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

10<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

5<br />

0<br />

-5<br />

-10<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

8<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

-4<br />

-6<br />

-8<br />

-8<br />

0.5 1 1.5 2 2.5 3 3.5 4<br />

Fig. 1. Comparison between MDL and threshold denois<strong>in</strong>g of an artificial signal<br />

with known SNR = 0. The feature space dimension was M = 40 and the number<br />

of clusters was K = 35. The (MDL achieved an SNR = 8.9 dB and the Threshold<br />

criterion an SNR = 10.5 dB)<br />

MDL criterion favors some over-modell<strong>in</strong>g of the signal subspace, i.e. it tends<br />

to underestimate the number of noise components <strong>in</strong> the registered signals.<br />

In [17] the conditions, such as the noise not be<strong>in</strong>g completely white, which<br />

lead to a strong over-modell<strong>in</strong>g are identified. Over-modell<strong>in</strong>g also happens<br />

frequently, if the eigenvalues of the covariance matrix related with noise components,<br />

are not sufficiently close together and are not separated from the<br />

signal components by a gap. In those cases a cluster<strong>in</strong>g criterion for the eigenvalues<br />

seems to yield better results, but it is not as generic as the MDL<br />

criterion.<br />

4.1.2 Comparisons between LICA and LPCA<br />

Consider the artificial signal shown <strong>in</strong> figure 1 with vary<strong>in</strong>g additive Gaussian<br />

white noise. We apply the LICA denois<strong>in</strong>g algorithm us<strong>in</strong>g either an MDL criterion<br />

or a threshold criterion for parameter selection. The results are depicted<br />

<strong>in</strong> figure 2.<br />

The first and second diagram of figure 2 compare the performance, here the<br />

enhancement of the SNR and the mean square error, of LPCA and LICA<br />

depend<strong>in</strong>g on the <strong>in</strong>put SNR. Note that a source SNR of 0 describes a case<br />

where signal and noise have the same strength, while negative values <strong>in</strong>dicate<br />

situations where the signal is buried <strong>in</strong>t the noise. The third graph shows the<br />

difference <strong>in</strong> kurtosis of the orig<strong>in</strong>al signal and the source signal <strong>in</strong> dependence<br />

on the <strong>in</strong>put SNR. All three diagrams were generated with the same data set,<br />

i.e. the same signal and, for a given <strong>in</strong>put SNR, the same additive noise.<br />

These results suggest that a LICA approach is more effective when the signal<br />

is <strong>in</strong>fested with a large amount of noise, whereas a LPCA seems better suited<br />

for signals with high SNRs. This might be due to the nature of our selection of<br />

subspaces based on kurtosis or variance of the autocorrelation as the comparison<br />

of higher statistical moments of the restored data, like kurtosis, <strong>in</strong>dicate<br />

that noise reduction can be enhanced if we are us<strong>in</strong>g a LICA approach.<br />

15<br />

×10 3<br />

×10 3<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

-4<br />

-6


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 213<br />

SNR Enhancement [dB]<br />

-5 0 5 10<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

SNR enhancement Mean square error Kurtosis<br />

ica<br />

pca<br />

-5 0 5 10<br />

Source SNR [dB]<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

Error orig<strong>in</strong>al/recovered [×10 3 ]<br />

0 1 2 3 4 5 6 7 8<br />

3<br />

2<br />

1<br />

ica<br />

pca<br />

3<br />

2<br />

1<br />

0<br />

0<br />

0 1 2 3 4 5 6 7 8<br />

Error orig<strong>in</strong>al/noisy [×10 3 ]<br />

Kurtosis error |kurt(se) − kurt(s)|<br />

-5 0 5 10<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-5 0 5 10<br />

Source SNR [dB]<br />

Fig. 2. Comparison between LPCA and LICA based denois<strong>in</strong>g. Here the mean square<br />

error of two signals x, y with L samples is 1 �<br />

L ||xi − yi|| 2 . For all noise levels a<br />

complete parameter estimation was done <strong>in</strong> the sets {10, 15, . . . , 60} for M and<br />

{20, 30, . . . , 80} for K.<br />

4.1.3 LICA denois<strong>in</strong>g with multi-dimensional data sets<br />

A generalization of the LICA algorithm to multidimensional data sets like<br />

images where pixel <strong>in</strong>tensities depend on two coord<strong>in</strong>ates is desirable. A simple<br />

generalization would be to look at delayed coord<strong>in</strong>ates of vectors <strong>in</strong>stead<br />

of scalars. However, this appears impractical due to the prohibitive computational<br />

effort. More importantly, this direct approach reduces the number<br />

of available samples significantly. This leads to far less accurate estimators<br />

of important aspects like the MDL estimation of the dimension of the signal<br />

subspace or the estimation of the kurtosis criterion <strong>in</strong> the LICA case.<br />

Another approach could be to convert the data to a 1D str<strong>in</strong>g by choos<strong>in</strong>g some<br />

path through the data and concatenat<strong>in</strong>g the pixel <strong>in</strong>tensities accord<strong>in</strong>gly. But<br />

this can easily create unwanted artifacts along the chosen path. Further, local<br />

correlations are broken up, hence not all the available <strong>in</strong>formation is used.<br />

But a more sophisticated and, depend<strong>in</strong>g on the nature of the signal, very effective<br />

alternative approach can be envisaged. Instead of convert<strong>in</strong>g the multidimensional<br />

data <strong>in</strong>to 1D data str<strong>in</strong>gs prior to apply<strong>in</strong>g the LICA algorithm,<br />

we can use a modified delay transformation us<strong>in</strong>g shifts along all available<br />

dimensions. This concept is similar to the multidimensional auto-covariances<br />

used <strong>in</strong> the Multi Dimensional SOBI (mdSOBI) algorithm <strong>in</strong>troduced <strong>in</strong> [31].<br />

In the 2D case, for example, consider an n × n image represented by a matrix<br />

P = (aij)i,j=1...n. Then the transformed data set consists of copies of P which<br />

are shifted either along columns or rows or both. For <strong>in</strong>stance, a translation<br />

16<br />

i<br />

ica<br />

pca<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0


214 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Noisy Image Local PCA<br />

Local ICA Local ICA + PCA<br />

Fig. 3. Comparison of LPCA and LICA based denois<strong>in</strong>g upon an image <strong>in</strong>fested with<br />

Gaussian noise. Also note an improvement <strong>in</strong> denois<strong>in</strong>g power if both are applied<br />

consecutively (Local PCA SNR = 8.8 dB, LICA SNR = 10.6 dB, LPCA and LICA<br />

consecutively SNR = 12.6 dB). All images where denoised us<strong>in</strong>g a fixed number of<br />

clusters K = 20 and a delay radius of M = 4, which results <strong>in</strong> a 49-dimensional<br />

feature space.<br />

aij → ai−1,j+1, (i, j = 1, ..., n) yields the follow<strong>in</strong>g transformed image:<br />

⎡<br />

⎢<br />

P −1,1 = ⎢<br />

⎣<br />

an,2 . . . an,n an,1<br />

a1,2 . . . a1,n a1,1<br />

.<br />

an−1,2 . . . an−1,n an−1,1<br />

.<br />

.<br />

⎤<br />

⎥<br />

⎦<br />

(21)<br />

Then <strong>in</strong>stead of choos<strong>in</strong>g a s<strong>in</strong>gle delay dimension, we choose a delay radius<br />

M and use all P ν with �ν� < M as delayed versions of the orig<strong>in</strong>al signal.<br />

The rema<strong>in</strong>der of the LICA based denois<strong>in</strong>g algorithm works exactly as <strong>in</strong><br />

the case of a 1D time series.<br />

In figure 3 we show that this approach us<strong>in</strong>g the the MDL criterion to select<br />

the number of components compared between LPCA and LICA. In addition<br />

we see that the algorithm also works favorable if applied multiple times.<br />

17


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 215<br />

4.2 Denois<strong>in</strong>g with dAMUSE applied to toy examples<br />

A group of three artificial source signals with different frequency contents was<br />

chosen: one member of the group represents a narrow-band signal, a s<strong>in</strong>usoid;<br />

the second signal encompasses a wide frequency range; and the last one represents<br />

a sawtooth wave whose spectral density is concentrated <strong>in</strong> the low<br />

frequency band (see figure 4).<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

0 20 40 60 80 100<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40 60 80 100<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40 60 80 100<br />

(n)<br />

400<br />

300<br />

200<br />

100<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

400<br />

300<br />

200<br />

100<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

600<br />

400<br />

200<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

(π)<br />

Fig. 4. Artificial signals (left column) and their frequency contents (right column)<br />

The simulations were designed to illustrate the method and to study the <strong>in</strong>fluence<br />

of the threshold parameter T H on the performance when noise is added<br />

at different levels. In what concerns noise we also try to f<strong>in</strong>d out if there is<br />

any advantage of us<strong>in</strong>g a GEVD <strong>in</strong>stead of a PCA analysis. Hence the signals<br />

at the output of the first step of the algorithm (us<strong>in</strong>g the matrix Q to project<br />

the data) are also compared with the output signals. Results are collected <strong>in</strong><br />

table 1.<br />

Random noise was added to the sensor signals yield<strong>in</strong>g a SNR <strong>in</strong> the range<br />

of [0, 20] dB. The parameters M = 4 and T H = 0.95 were kept fixed. As<br />

the noise level <strong>in</strong>creases, the number of significant eigenvalues also <strong>in</strong>creases.<br />

Hence at the output of the first step more signals need to be considered. Thus<br />

as the noise energy <strong>in</strong>creases, the number (l) of signals or the dimension of<br />

18


216 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Table 1<br />

Number of output signals correlated with noise or source signals after step 1 and<br />

step 2 of the algorithm dAMUSE.<br />

1st step 2nd step<br />

SNR NM Sources Noise Sources Noise Total<br />

20 dB 12 6 0 6 0 6<br />

15 dB 12 5 2 6 1 7<br />

10 dB 12 6 2 7 1 8<br />

5 dB 12 6 3 7 2 9<br />

0 dB 12 7 4 8 3 11<br />

matrix C also <strong>in</strong>creases after the application of the first step (last column of<br />

table 1). As the noise <strong>in</strong>creases, an <strong>in</strong>creas<strong>in</strong>g number of ICs will be available<br />

at the output of the two steps. Comput<strong>in</strong>g, <strong>in</strong> the frequency doma<strong>in</strong>, the correlation<br />

coefficients between the output signals of each step of the algorithm<br />

and noise or source signals we confirm that some are related with the sources<br />

and others with noise. Table 1 (columns 3-6) shows that the maximal correlation<br />

coefficients are distributed between noise and source signals to a vary<strong>in</strong>g<br />

degree. We can see that the number of signals correlated with noise is always<br />

higher <strong>in</strong> the first level. Results show that for low noise levels the first step<br />

(which is ma<strong>in</strong>ly a pr<strong>in</strong>cipal component analysis <strong>in</strong> a space of dimension NM)<br />

achieves good solutions already. However, we can also see (for narrow-band<br />

signals and/or M low) that the time doma<strong>in</strong> characteristics of the signals resemble<br />

the orig<strong>in</strong>al source signals only after a GEVD, i.e. at the output of the<br />

second step rather than with a PCA, i.e. at the output of first step. Figure 5<br />

shows examples of signals that have been obta<strong>in</strong>ed <strong>in</strong> the two steps of the<br />

algorithm for SNR = 10 dB. At the output of the first level the 3 signals with<br />

highest frequency correlation were chosen among the 8 output signals. Us<strong>in</strong>g a<br />

similar criterion to choose 3 signals at the output of the 2nd step (last column<br />

of figure 5), we can see that their time course is more similar to the source<br />

signals than after the first step (middle column of figure 5)<br />

4.3 Denois<strong>in</strong>g of prote<strong>in</strong> NMR spectra<br />

In biophysics the determ<strong>in</strong>ation of the 3D structure of biomolecules like prote<strong>in</strong>s<br />

is of utmost importance. Nuclear magnetic resonance techniques provide<br />

<strong>in</strong>dispensable tools to reach this goal. As hydrogen nuclei are the most abundant<br />

and most sensitive nuclei <strong>in</strong> prote<strong>in</strong>s, proton NMR spectra of prote<strong>in</strong>s<br />

dissolved <strong>in</strong> water are recorded mostly. S<strong>in</strong>ce the concentration of the solvent<br />

is by magnitudes larger than the prote<strong>in</strong> concentration, there is always a large<br />

19


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 217<br />

1<br />

0.5<br />

0<br />

−0.5<br />

Sources<br />

−1<br />

0 20 40<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40<br />

(n)<br />

2<br />

1<br />

0<br />

−1<br />

1st step<br />

−2<br />

0 20 40<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

(n)<br />

2<br />

1<br />

0<br />

−1<br />

2st step<br />

−2<br />

0 20 40<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

0 20 40<br />

2<br />

1<br />

0<br />

−1<br />

−2<br />

0 20 40<br />

Fig. 5. Comparison of output signals result<strong>in</strong>g after the first step (second column)<br />

and the second step (last column) of dAMUSE.<br />

proton signal of the water solvent contam<strong>in</strong>at<strong>in</strong>g the prote<strong>in</strong> spectrum. This<br />

water artifact cannot be suppressed completely with technical means, hence<br />

it would be <strong>in</strong>terest<strong>in</strong>g to remove it dur<strong>in</strong>g the analysis of the spectra.<br />

BSS techniques have been shown to solve this separation problem [27,28]. BSS<br />

algorithms are based on an ICA [2] which extracts a set of underly<strong>in</strong>g <strong>in</strong>dependent<br />

source signals out of a set of measured signals without know<strong>in</strong>g how<br />

the mix<strong>in</strong>g process is carried out. We have used an algebraic algorithm [35,36]<br />

based on second order statistics us<strong>in</strong>g the time structure of the signals to<br />

separate this and related artifacts from the rema<strong>in</strong><strong>in</strong>g prote<strong>in</strong> spectrum. Unfortunately<br />

due to the statistical nature of the algorithm unwanted noise is<br />

<strong>in</strong>troduced <strong>in</strong>to the reconstructed spectrum as can be seen <strong>in</strong> figure 6. The water<br />

artifact removal is effected by a decomposition of a series of NMR spectra<br />

<strong>in</strong>to their uncorrelated spectral components apply<strong>in</strong>g a generalized eigendecomposition<br />

of a congruent matrix pencil [37]. The latter is formed with a<br />

correlation matrix of the signals and a correlation matrix with delayed or<br />

filtered signals [32]. Then we can detect and remove the components which<br />

conta<strong>in</strong> only a signal generated by the water and reconstruct the rema<strong>in</strong><strong>in</strong>g<br />

prote<strong>in</strong> spectrum from its ICs. But the latter now conta<strong>in</strong>s additional noise<br />

20<br />

(n)


218 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

<strong>in</strong>troduced by the statistical analysis procedure, hence denois<strong>in</strong>g deemed necessary.<br />

The algorithms discussed above have been applied to an experimental 2D Nuclear<br />

Overhauser Effect Spectroscopy (NOESY) proton NMR spectrum of the<br />

polypeptide P11 dissolved <strong>in</strong> water. The synthetic peptide P11 consists of 24<br />

am<strong>in</strong>o acids only and represents the helix H11 of the human Glutathion reductase<br />

[21]. A simple pre-saturation of the water resonance was applied to prevent<br />

saturation of the dynamic range of the Analog Digital Converter (ADC).<br />

Every data set comprises 512 Free Induction Decays (FIDs) S(t1, t2) ≡ xn[l]<br />

or their correspond<strong>in</strong>g spectra ˆ S(11, ω2) ≡ ˆxn[l], with L = 2048 samples each,<br />

which correspond to N = 128 evolution periods t1 ≡ [n]. To each evolution<br />

period belong four FIDs with different phase modulations, hence only FIDs<br />

with equal phase modulations have been considered for analysis. A BSS analysis,<br />

us<strong>in</strong>g both the algorithm GEVD us<strong>in</strong>g Matrix Pencil (GEVD-MP) [28]<br />

and the algorithm dAMUSE [33], was applied to all data sets. Note that the<br />

matrix pencil with<strong>in</strong> GEVD-MP was conveniently computed <strong>in</strong> the frequency<br />

doma<strong>in</strong>, while <strong>in</strong> the algorithm dAMUSE <strong>in</strong> spite of the filter<strong>in</strong>g operation<br />

be<strong>in</strong>g performed <strong>in</strong> the frequency doma<strong>in</strong>, the matrix pencil was computed <strong>in</strong><br />

the time doma<strong>in</strong>. The GEVD is performed <strong>in</strong> dAMUSE as described above to<br />

achieve a dimension reduction and concomitant denois<strong>in</strong>g.<br />

4.3.1 Local ICA denois<strong>in</strong>g<br />

For denois<strong>in</strong>g we first used the LICA denois<strong>in</strong>g algorithm proposed above<br />

to enhance the reconstructed prote<strong>in</strong> signal without the water artifact. We<br />

applied the denois<strong>in</strong>g only to those components which were identified as water<br />

components. Then we removed the denoised versions of these water artifact<br />

components from the total spectrum. As a result, the additional noise is at<br />

least halved as can also be seen from figure 7. On the part of the spectrum<br />

away from the center, i.e. not conta<strong>in</strong><strong>in</strong>g any water artifacts, we could estimate<br />

the <strong>in</strong>crease of the SNR with the orig<strong>in</strong>al spectrum as reference. We calculated<br />

a SNR of 17.3 dB of the noisy spectrum and a SNR of 21.6 dB with apply<strong>in</strong>g<br />

the denois<strong>in</strong>g algorithm.<br />

We compare the result, i.e. the reconstructed artifact-free prote<strong>in</strong> spectrum of<br />

our denois<strong>in</strong>g algorithm to the result of a KPCA based denois<strong>in</strong>g algorithm<br />

us<strong>in</strong>g a gaussian kernel <strong>in</strong> figure 8. The figure depicts the differences between<br />

the denoised spectra and the orig<strong>in</strong>al spectrum <strong>in</strong> the regions where the water<br />

signal is not very dom<strong>in</strong>at<strong>in</strong>g. As can be seen, the LICA denois<strong>in</strong>g algorithm<br />

reduces the noise but does not change the content of the signal, whereas the<br />

KPCA algorithm seems to <strong>in</strong>fluence the peak amplitudes of the prote<strong>in</strong> resonances<br />

as well. Further experiments are under way <strong>in</strong> our laboratory to <strong>in</strong>vestigate<br />

these differences <strong>in</strong> more detail and to establish an automatic artifact<br />

21


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 219<br />

Signal [a.u.]<br />

Signal [a.u.]<br />

Orig<strong>in</strong>al NMR spectrum of the P11 prote<strong>in</strong><br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

Spectrum after the water removal algorithm<br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

Fig. 6. The graph shows a 1D slice of a proton 2D NOESY NMR spectrum of the<br />

polypeptide P11 before and after remov<strong>in</strong>g the water artifact with the GEVD-MP<br />

algorithm. The 1D spectrum corresponds to the shortest evolution period t1.<br />

removal algorithm for multidimensional NMR spectra.<br />

4.3.2 Kernel PCA denois<strong>in</strong>g<br />

As the removal of the water artifact lead to additional noise <strong>in</strong> the spectra<br />

(compare figure 9(a) and figure 9(b)) KPCA based denois<strong>in</strong>g was applied.<br />

First (almost) noise free samples had to be created <strong>in</strong> order to determ<strong>in</strong>e the<br />

pr<strong>in</strong>ciple axes <strong>in</strong> feature space. For that purpose, the first 400 data po<strong>in</strong>ts<br />

of the real and the imag<strong>in</strong>ary part of each of the 512 orig<strong>in</strong>al spectra were<br />

used to form a 400 × 1024 sample matrix X (1) . Likewise five further sample<br />

matrices X (m) , m = 2, . . . , 6, were created, which now consisted of the data<br />

po<strong>in</strong>ts 401 to 800, 601 to 1000, 1101 to 1500, 1249 to 1648 and 1649 to 2048<br />

respectively. Note that the region (1000 - 1101) of data po<strong>in</strong>ts compris<strong>in</strong>g the<br />

ma<strong>in</strong> part of the water resonance was nulled deliberately as it is of no use for<br />

22<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2


220 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

Signal [a.u.]<br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

(a) LICA denoised spectrum of P11 after the water artifact has been removed<br />

with the algorithm GEVD-MP<br />

Signal [a.u.]<br />

10 8 6 4<br />

δ [ppm]<br />

2 0 -2<br />

(b) KPCA denoised spectrum of P11 after the water artifact has been removed<br />

with the algorithm GEVD-MP<br />

Fig. 7. The figure shows the correspond<strong>in</strong>g artifact free P11 spectra after the denois<strong>in</strong>g<br />

algorithms have been applied. The LICA algorithm was applied to all water<br />

components with M, K chosen with the MDL estimator (γ = 32) between 20 and<br />

60 and 20 and 80 respectively. The second graph shows the denoised spectrum with<br />

a KPCA based algorithm us<strong>in</strong>g a gaussian kernel.<br />

the KPCA. For each of the sample matrices X (m) the correspond<strong>in</strong>g kernel<br />

matrix K was determ<strong>in</strong>ed by<br />

Ki,j = k(xi, xj), i, j = 1, . . . , 400 (22)<br />

where xi denotes the i-th column of X (m) . For the kernel function a Gaussian<br />

kernel<br />

k(xi, xj) = exp<br />

�<br />

− � xi − xj � 2<br />

2σ 2<br />

23<br />

�<br />

(23)


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 221<br />

Signal [a.u.] Signal [a.u.]<br />

10 8 6 4 2 0 -2 -4<br />

10 8 6 4 2 0<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

-2 -4<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

-2<br />

δ [ppm]<br />

10 8 6 4 2 0 -2 0<br />

10 8 6 4 2 0 -2 0<br />

δ [ppm]<br />

Fig. 8. The graph uncovers the differences of the LICA and KPCA denois<strong>in</strong>g algorithms.<br />

As a reference the correspond<strong>in</strong>g 1D slice of the orig<strong>in</strong>al P11 spectrum is<br />

displayed on top. From top to bottom the three curves show: The difference of the<br />

orig<strong>in</strong>al and the spectrum with the GEVD-MP algorithm applied, the difference between<br />

the orig<strong>in</strong>al and the LICA denoised spectrum and the difference between the<br />

orig<strong>in</strong>al and the KPCA denoised spectrum. To compare the graphs <strong>in</strong> one diagram<br />

the three graphs are translated vertically by 2, 4 and 6 respectively.<br />

where<br />

2σ 2 =<br />

1<br />

400 ∗ 399<br />

is the width parameter σ, was chosen.<br />

400<br />

�<br />

i,j=1<br />

� xi − xj � 2<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

(24)<br />

F<strong>in</strong>ally the kernel matrix K was expressed <strong>in</strong> terms of its EVD (equation 17)<br />

which lead to the expansion parameters α necessary to determ<strong>in</strong>e the pr<strong>in</strong>cipal<br />

axes of the correspond<strong>in</strong>g feature space Ω (m) :<br />

400<br />

�<br />

ω = αiΦ(xi). (25)<br />

i=1<br />

Similar to the orig<strong>in</strong>al data, the noisy data of the reconstructed spectra were<br />

used to form six 400 × 1024 dimensional pattern matrices P (m) , m = 1, . . . , 6.<br />

24


222 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

x 106<br />

12<br />

x 107<br />

10<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

5<br />

0<br />

−5<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

−4<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

δ [ppm]<br />

(a) Orig<strong>in</strong>al (noisy) spectrum<br />

x 106<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

x 105<br />

−2<br />

−4<br />

−6<br />

−8<br />

−10<br />

10 9<br />

x 106<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

δ [ppm]<br />

(b) Reconstructed spectrum with the<br />

water artifact removed with the matrix<br />

pencil algorithm<br />

10 9 8 7 6 5 4 3 2 1 0 −1<br />

δ [ppm]<br />

(c) Result of the KPCA denois<strong>in</strong>g of the reconstructed spectrum<br />

Fig. 9. 1D slice of a 2D NOESY spectrum of the polypeptide P11 <strong>in</strong> aqueous solution<br />

correspond<strong>in</strong>g to the shortest evolution period t1. The chemical shift ranges from<br />

−1ppm to 10ppm roughly, The <strong>in</strong>sert shows the region of the spectrum between 10<br />

and 9ppm roughly. The upper trace corresponds to the denoised basel<strong>in</strong>e and the<br />

lower trace shows the basel<strong>in</strong>e of the orig<strong>in</strong>al spectrum.<br />

25


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 223<br />

Then the pr<strong>in</strong>cipal components βk of each column of P m were calculated <strong>in</strong><br />

the correspond<strong>in</strong>g feature space Ω (m) . In order to denoise the patterns only<br />

projections onto the first n = 112 pr<strong>in</strong>cipal axes were considered. This lead to<br />

�<br />

βk =<br />

where x is a column of P m .<br />

400<br />

α<br />

i=1<br />

k i k(xi, x), k = 1, . . . , 112 (26)<br />

After reconstruct<strong>in</strong>g the image ˆ PnΦ(x) of the sample vector under the map Φ<br />

(equation 18), its approximate pre-image was determ<strong>in</strong>ed by m<strong>in</strong>imiz<strong>in</strong>g the<br />

cost function<br />

�112<br />

�400<br />

ρ(z) = −2<br />

βk α<br />

k=1 i=1<br />

k i k(xi, z) (27)<br />

Note that the method described above fails to denoise the region where the<br />

water resonance appears (data po<strong>in</strong>ts 1001 to 1101) because then the samples<br />

formed from the orig<strong>in</strong>al data differ too much from the noisy data. This is not<br />

a major drawback as prote<strong>in</strong> peaks totally hidden under the water artifact<br />

cannot be uncovered by the presented bl<strong>in</strong>d source separation method. Figure<br />

9(c) shows the result<strong>in</strong>g denoised prote<strong>in</strong> spectrum on an identical vertical<br />

scale as figure 9(a) and figure 9(b). The <strong>in</strong>sert compares the noise <strong>in</strong> a region<br />

of the spectrum between 10 and 9ppm roughly where no prote<strong>in</strong> peaks are<br />

found. The upper trace shows the basel<strong>in</strong>e of the denoised reconstructed prote<strong>in</strong><br />

spectrum and the lower trace the correspond<strong>in</strong>g basel<strong>in</strong>e of the orig<strong>in</strong>al<br />

experimental spectrum before the water artifact has been separated out.<br />

4.3.3 Denois<strong>in</strong>g us<strong>in</strong>g Delayed AMUSE<br />

LICA denois<strong>in</strong>g of reconstructed prote<strong>in</strong> spectra necessitate the solution of the<br />

BSS problem beforehand us<strong>in</strong>g any ICA algorithm. A much more elegant solution<br />

is provided by the recently proposed algorithm dAMUSE, which achieves<br />

BSS and denois<strong>in</strong>g simultaneously. To test the performance of the algorithm,<br />

it was also applied to the 2D NOESY NMR spectra of the polypeptide P11.<br />

A 1D slice of the 2D NOESY spectrum of P11 correspond<strong>in</strong>g to the shortest<br />

evolution period t1 is presented <strong>in</strong> figure 9(a) which shows a huge water artifact<br />

despite some pre-saturation on the water resonance. Figure 10 shows the reconstructed<br />

spectra obta<strong>in</strong>ed with the algorithms GEVD-MP and dAMUSE,<br />

respectively. The algorithm GEVD-MP yielded almost artifact-free spectra<br />

but with clear changes <strong>in</strong> the peak <strong>in</strong>tensities <strong>in</strong> some areas of the spectra. On<br />

the contrary, the reconstructed spectra obta<strong>in</strong>ed with the algorithm dAMUSE<br />

still conta<strong>in</strong> some remnants of the water artifact but the prote<strong>in</strong> peak <strong>in</strong>tensities<br />

rema<strong>in</strong>ed unchanged and all basel<strong>in</strong>e distortions have been cured. All<br />

26


224 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

(a) 1D slice of the NOESY spectrum of the prote<strong>in</strong> P11 spectrum reconstructed<br />

with the algorithm GEVD-MP<br />

(b) Correspond<strong>in</strong>g prote<strong>in</strong> spectrum reconstructed with the algorithm<br />

dAMUSE<br />

Fig. 10. Comparison of denois<strong>in</strong>g of the P11 prote<strong>in</strong> spectrum<br />

parameters of the algorithms are collected <strong>in</strong> table 2<br />

5 Conclusions<br />

We proposed two new denois<strong>in</strong>g techniques and also considered KPCA denois<strong>in</strong>g<br />

which are all based on the concept of embedd<strong>in</strong>g signals <strong>in</strong> delayed<br />

coord<strong>in</strong>ates. We presented a detailed discussion of their properties and also<br />

discussed results obta<strong>in</strong>ed apply<strong>in</strong>g them to illustrative toy examples. Furthermore<br />

we compared all three algorithms by apply<strong>in</strong>g them to the real world<br />

27


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 225<br />

Table 2<br />

Parameter values for the embedd<strong>in</strong>g dimension of the feature space of dAMUSE<br />

(MdAMUSE), the number (K) of sampl<strong>in</strong>g <strong>in</strong>tervals used per delay <strong>in</strong> the trajectory<br />

matrix, the number Npc of pr<strong>in</strong>cipal components reta<strong>in</strong>ed after the first step of<br />

the GEVD and the half-width (σ) of the Gaussian filter used <strong>in</strong> the algorithms<br />

GEVD-MP and dAMUSE<br />

Parameter NIC MdAMUSE Npc Nw(GEV D)<br />

P11 256 3 148 49<br />

Parameter Nw(dAMUSE) σ SNRGEV D−MP SNRdAMUSE<br />

P11 46 0.3 18, 6 dB 22, 9 dB<br />

problem of remov<strong>in</strong>g the water artifact from NMR spectra and denois<strong>in</strong>g the<br />

result<strong>in</strong>g reconstructed spectra. Although all three algorithms achieved good<br />

results concern<strong>in</strong>g the f<strong>in</strong>al SNR, <strong>in</strong> case of the NMR spectra it turned out<br />

that KPCA seems to alter the spectral shapes while LICA and dAMUSE do<br />

not. At least with prote<strong>in</strong> NMR spectra it is crucial that denois<strong>in</strong>g algorithms<br />

do not alter <strong>in</strong>tegrated peak <strong>in</strong>tensities <strong>in</strong> the spectra as the latter form the<br />

bases for the structure elucidation process.<br />

In future we have to further <strong>in</strong>vestigate the dependence of the proposed algorithms<br />

on the situation at hand. Thereby it will be crucial to identify data<br />

models for which the each one of the proposed denois<strong>in</strong>g techniques works<br />

best and to f<strong>in</strong>d good measures of how such models suit the given data.<br />

6 Acknowledgements<br />

This research has been supported by the BMBF (project ModKog) and the<br />

DFG (GRK 638: Nonl<strong>in</strong>earity and Nonequilibrium <strong>in</strong> Condensed Matter). We<br />

are grateful to W. Gronwald and H. R. Kalbitzer for provid<strong>in</strong>g the NMR<br />

spectrum of P11 and helpful discussions.<br />

References<br />

[1] Adel Belouchrani, Karim Abed-Meraim, Jean-François Cardoso, and Eric<br />

Moul<strong>in</strong>es. A bl<strong>in</strong>d source separation technique us<strong>in</strong>g secon-order statistics.<br />

IEEE Transactions on Signal Process<strong>in</strong>g, 45(2):434–444, 1997.<br />

[2] Andrzej Cichocki and Shun-Ichi Amari. Adaptive Bl<strong>in</strong>d Signal and Image<br />

Process<strong>in</strong>g. Wiley, 2002.<br />

28


226 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

[3] P. Comon. Idependent component analysis - a new concept ? Signal Process<strong>in</strong>g,<br />

36:287–314, 1994.<br />

[4] K. I. Diamantaras and S. Y. Kung. Pr<strong>in</strong>cipal <strong>Component</strong> Neural Networks,<br />

Theory and Applications. Wiley, 1996.<br />

[5] A. Effern, K. Lehnertz, T. Schreiber, T. Grunwald, P. David, and C. E.<br />

Elger. Nonl<strong>in</strong>ear denois<strong>in</strong>g of transient signals with application to event-related<br />

potentials. Physica D, 140:257–266, 2000.<br />

[6] E. Fishler and H. Messer. On the use of order statistics for improved detection<br />

of signals by the MDL criterion. IEEE Transactions on Signal Process<strong>in</strong>g,<br />

48:2242–2247, 2000.<br />

[7] R. Freeman. Sp<strong>in</strong> Choreography. Spektrum Academic Publishers, Oxford, 1997.<br />

[8] R. R. Gharieb and A. Cichocki. Second-order statistics based bl<strong>in</strong>d source<br />

separation us<strong>in</strong>g a bank of subband filters. Digital Signal Process<strong>in</strong>g, 13:252–<br />

274, 2003.<br />

[9] M. Ghil, M. R. Allen, M. D. Dett<strong>in</strong>ger, and K. Ide. Advanced spectral methods<br />

for climatic time series. Reviews of Geophysics, 40(1):1–41, 2002.<br />

[10] K. H. Hausser and H.-R. Kalbitzer. NMR <strong>in</strong> Medic<strong>in</strong>e and Biology. Berl<strong>in</strong>,<br />

1991.<br />

[11] J. Hérault and C. Jutten. Space or time adaptive signal process<strong>in</strong>g by neural<br />

network models. In J. S. Denker, editor, Neural Networks for Comput<strong>in</strong>g.<br />

Proceed<strong>in</strong>gs of the AIP Conference, pages 206–211, New York, 1986. American<br />

Institute of Physics.<br />

[12] Aapo Hyvär<strong>in</strong>en, Patrik Hoyer, and Erkki Oja. Intelligent Signal Process<strong>in</strong>g.<br />

IEEE Press, 2001.<br />

[13] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja. <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>.<br />

2001.<br />

[14] A. Hyvär<strong>in</strong>en and E. Oja. A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent<br />

component analysis. Neural Computation, 9:1483–1492, 1997.<br />

[15] A. K. Ja<strong>in</strong> and R. C. Dubes. Algorithms for Cluster<strong>in</strong>g Data. Prentice Hall:<br />

New Jersey, 1988.<br />

[16] J. T. Kwok and I. W. Tsang. The pre-image problem <strong>in</strong> kernel methods. In<br />

Proceed. Int. Conf. Mach<strong>in</strong>e Learn<strong>in</strong>g (ICML03), 2003.<br />

[17] A. P. Liavas and P. A. Regalia. On the behavior of <strong>in</strong>formation theoretic criteria<br />

for model order selection. IEEE Transactions on Signal Process<strong>in</strong>g, 49:1689–<br />

1695, 2001.<br />

[18] Chor T<strong>in</strong> Ma, Zhi D<strong>in</strong>g, and Sze Fong Yau. A two-stage algorithm for MIMO<br />

bl<strong>in</strong>d deconvolution of nonstationary colored noise. IEEE Transactions on<br />

Signal Process<strong>in</strong>g, 48:1187–1192, 2000.<br />

29


Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006 227<br />

[19] S. Mika, B. Schölkopf, A. Smola, K. Müller, M. Scholz, and G. Rätsch. Kernel<br />

PCA and denois<strong>in</strong>g <strong>in</strong> feature spaces. Adv. Neural Information Process<strong>in</strong>g<br />

Systems, NIPS11, 11, 1999.<br />

[20] V. Moskv<strong>in</strong>a and K. M. Schmidt. Approximate projectors <strong>in</strong> s<strong>in</strong>gular spectrum<br />

analysis. SIAM Journal Mat. Anal. Appl., 24(4):932–942, 2003.<br />

[21] A. Nordhoff, Ch. Tziatzios, J. A. V. Broek, M. Schott, H.-R. Kalbitzer,<br />

K. Becker, D. Schubert, and R. H. Schirme. Denaturation and reactivation of<br />

dimeric human glutathione reductase. Eur. J. Biochem, pages 273–282, 1997.<br />

[22] L. Parra and P. Sajda. Bl<strong>in</strong>d source separation vis generalized eigenvalue<br />

decomposition. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 4:1261–1269, 2003.<br />

[23] K. Pearson. On l<strong>in</strong>es and planes of closest fit to systems of po<strong>in</strong>ts <strong>in</strong> space.<br />

Philosophical Magaz<strong>in</strong>e, 2:559–572, 1901.<br />

[24] I. W. Sandberg and L. Xu. Uniform approximation of multidimensional myoptic<br />

maps. Transactions on Circuits and Systems, 44:477–485, 1997.<br />

[25] B. Schoelkopf, A. Smola, and K.-R. Mueller. Nonl<strong>in</strong>ear component analysis as<br />

a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998.<br />

[26] K. Stadlthanner, E. W. Lang, A. M. Tomé, A. R. Teixeira, and C. G. Puntonet.<br />

Kernel-PCA denois<strong>in</strong>g of artifact-free prote<strong>in</strong> NMR spectra. Proc. IJCNN’2004,<br />

Budapest, Hungaria, 2004.<br />

[27] K. Stadlthanner, F. J. Theis, E. W. Lang, A. M. Tomé, W. Gronwald, and H.-R.<br />

Kalbitzer. A matrix pencil approach to the bl<strong>in</strong>d source separation of artifacts<br />

<strong>in</strong> 2D NMR spectra. Neural Information Process<strong>in</strong>g - Letters and Reviews,<br />

1:103–110, 2003.<br />

[28] K. Stadlthanner, F. Theis, E. W. Lang, A. M. Tomé, A. R. Teixeira,<br />

W. Gronwald, and H.-R. Kalbitzer. GEVD-MP. Neurocomput<strong>in</strong>g accepted,<br />

2005.<br />

[29] K. Stadlthanner, A. M. Tomé, F. J. Theis, W. Gronwald, H.-R. Kalbitzer, and<br />

E. W. Lang. Bl<strong>in</strong>d source separation of water artifacts <strong>in</strong> NMR spectra us<strong>in</strong>g a<br />

matrix pencil. In Fourth International Symposium On <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong> and Bl<strong>in</strong>d Source Separation, ICA’2003, pages 167–172, Nara, Japan,<br />

2003.<br />

[30] F. Takens. On the numerical determ<strong>in</strong>ation of the dimension of an attractor.<br />

Dynamical Systems and Turbulence, Annual Notes <strong>in</strong> <strong>Mathematics</strong>, 898:366–<br />

381, 1981.<br />

[31] F. J. Theis, A. Meyer-Bäse, and E. W. Lang. Second-order bl<strong>in</strong>d source<br />

separation based on multi-dimensional autocovariances. In Proc. ICA 2004,<br />

volume 3195 of Lecture Notes <strong>in</strong> Computer Science, pages 726–733, Granada,<br />

Spa<strong>in</strong>, 2004.<br />

[32] Ana Maria Tomé and Nuno Ferreira. On-l<strong>in</strong>e source separation of temporally<br />

correlated signals. In European Signal Process<strong>in</strong>g Conference, EUSIPCO2002,<br />

Toulouse, France, 2002.<br />

30


228 Chapter 15. Neurocomput<strong>in</strong>g, 69:1485-1501, 2006<br />

[33] Ana Maria Tomé, Ana Rita Teixeira, Elmar Wolfgang Lang, Kurt Stadlthanner,<br />

A. P. Rocha, and R. ALmeida. dAMUSE - A new tool for denois<strong>in</strong>g and BSS.<br />

Digital Signal Process<strong>in</strong>g, 2005.<br />

[34] Ana Maria Tomé, Ana Rita Teixeira, Elmar Wolfgang Lang, Kurt Stadlthanner,<br />

and A. P. Rocha. Bl<strong>in</strong>d source separation us<strong>in</strong>g time-delayed signals. In<br />

International Jo<strong>in</strong>t Conference on Neural Networks, IJCNN’2004, volume CD,<br />

Budapest, Hungary, 2004.<br />

[35] Ana Maria Tomé. Bl<strong>in</strong>d source separation us<strong>in</strong>g a matrix pencil. In Int. Jo<strong>in</strong>t<br />

Conf. on Neural Networks, IJCNN’2000, Como, Italy, 2000.<br />

[36] Ana Maria Tomé. An iterative eigendecomposition approach to bl<strong>in</strong>d source<br />

separation. In 3rd Intern. Conf. on <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> and Signal<br />

Separation, ICA’2003, pages 424–428, San Diego, USA, 2001.<br />

[37] Lang Tong, Ruey wen Liu, Victor C. Soon, and Yih-Fang Huang. Indeterm<strong>in</strong>acy<br />

and identifiability of bl<strong>in</strong>d identification. IEEE Transactions on Circuits and<br />

Systems, 38(5):499–509, 1991.<br />

[38] Rolf Vetter, J. M. Ves<strong>in</strong>, Patrick Celka, Philippe Renevey, and Jens Krauss.<br />

Automatic nonl<strong>in</strong>ear noise reduction us<strong>in</strong>g local pr<strong>in</strong>cipal component analysis<br />

and MDL parameter selection. Proceed<strong>in</strong>gs of the IASTED International<br />

Conference on Signal Process<strong>in</strong>g Pattern Recognition and Applications (SPPRA<br />

02) Crete, pages 290–294, 2002.<br />

[39] Rolf Vetter. Extraction of efficient and characteristic features of<br />

multidimensional time series. PhD thesis, EPFL, Lausanne, 1999.<br />

[40] P. Vitányi and M. Li. Mim<strong>in</strong>um description length <strong>in</strong>duction, bayesianism, and<br />

kolmogorov complexity. IEEE Transactions on Information Theory, 46:446–<br />

464, 2000.<br />

31


Chapter 16<br />

Proc. ICA 2006, pages 917-925<br />

Paper F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis.<br />

In Proc. ICA 2006, pages 917-925, Charleston, USA, 2006<br />

Reference (Theis and Kawanabe, 2006)<br />

Summary <strong>in</strong> section 1.5.2<br />

229


230 Chapter 16. Proc. ICA 2006, pages 917-925<br />

Uniqueness of Non-Gaussian Subspace <strong>Analysis</strong><br />

Fabian J. Theis 1 and Motoaki Kawanabe 2<br />

1 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

2 Fraunhofer FIRST.IDA, Kekuléstraße 7, 12439 Berl<strong>in</strong>, Germany<br />

fabian@theis.name andnabe@first.fhg.de<br />

Abstract. Dimension reduction provides an important tool for preprocess<strong>in</strong>g<br />

large scale data sets. A possible model for dimension reduction is realized by<br />

project<strong>in</strong>g onto the non-Gaussian part of a given multivariate record<strong>in</strong>g. We prove<br />

that the subspaces of such a projection are unique given that the Gaussian subspace<br />

is of maximal dimension. This result therefore guarantees that projection<br />

algorithms uniquely recover the underly<strong>in</strong>g lower dimensional data signals.<br />

An important open problem <strong>in</strong> signal process<strong>in</strong>g is the task of efficient dimension<br />

reduction, i.e. the search for mean<strong>in</strong>gful signals with<strong>in</strong> a higher dimensional data set.<br />

Classical techniques such as pr<strong>in</strong>cipal component analysis hereby def<strong>in</strong>e ‘mean<strong>in</strong>gful’<br />

us<strong>in</strong>g second-order statistics (maximal variance), which may often be <strong>in</strong>adequate for<br />

signal detection, i.e. <strong>in</strong> the presence of strong noise. This contrasts to higher order models<br />

<strong>in</strong>clud<strong>in</strong>g projection pursuit [1,2] or non-Gaussian subspace analysis (NGSA) [3,4].<br />

While the former extracts a s<strong>in</strong>gle non-Gaussian <strong>in</strong>dependent component from the data<br />

set, the latter tries to detect a whole non-Gaussian subspace with<strong>in</strong> the data, and no<br />

assumption of <strong>in</strong>dependence with<strong>in</strong> the subspace is made.<br />

The goal of l<strong>in</strong>ear dimension reduction can be def<strong>in</strong>ed as the search of a projection<br />

W∈Mat(n×d) of a d-dimensional random vector X with n


Chapter 16. Proc. ICA 2006, pages 917-925 231<br />

An <strong>in</strong>tuitive notion of how to choose the reduced dimension n is to require that WGX is<br />

maximally Gaussian, and hence WNX non-Gaussian.<br />

The dimension reduction problem itself can of course also be formulated with<strong>in</strong> a<br />

generative model, which leads to the follow<strong>in</strong>g l<strong>in</strong>ear mix<strong>in</strong>g model<br />

X=ANSN+ AGSG<br />

such that SN and SG are <strong>in</strong>dependent, and SG Gaussian. Then (AN, AG) −1 = (W ⊤ N , W⊤ G )⊤ .<br />

This model <strong>in</strong>cludes the general noisy ICA model X=ANSN+ G, where G is Gaussian<br />

and SN is also assumed to be mutually <strong>in</strong>dependent; the dimension reduction then means<br />

projection onto the signal subspace, which might be deteriorated by the noise G along<br />

the subspace — the components of G orthogonal to the subspace will be removed.<br />

However (1) is more general <strong>in</strong> the sense that it does not assume mutual <strong>in</strong>dependence<br />

of SN, only <strong>in</strong>dependence of SN and SG.<br />

The paper is organized as follows: In the next section, we first discuss obvious<br />

<strong>in</strong>determ<strong>in</strong>acies of NGSA and possible regularizations. We then present our ma<strong>in</strong> result,<br />

theorem 1, and give an explicit proof <strong>in</strong> a special case. The general proof is divided<br />

up <strong>in</strong>to a series of lemmas, the proofs of which are omitted due to lack of space. In<br />

section 2, some simulations are performed to validate the uniqueness result. A practical<br />

algorithm for perform<strong>in</strong>g NGSA essentially us<strong>in</strong>g the idea of separated characteristic<br />

functions from the proof is presented <strong>in</strong> the co-paper [6].<br />

1 Uniqueness of NGSA-based dimension reduction<br />

This contribution aims at provid<strong>in</strong>g conditions such that the decomposition (1) is unique.<br />

More precisely, we will show under which conditions the non-Gaussian as well as the<br />

Gaussian subspace is unique.<br />

1.1 Indeterm<strong>in</strong>acies<br />

Clearly, the matrices AN and AG <strong>in</strong> the decomposition (1) cannot be unique — multiplication<br />

from the right us<strong>in</strong>g any <strong>in</strong>vertible matrix leaves the model <strong>in</strong>variant: X=<br />

ANSN+ AGSG= (ANBN)(B−1 −1<br />

N SN)+(AGBG)(B G SG) with BN∈ Gl(n), BG∈ Gl(d− n),<br />

because B−1 N SN and B−1 G SG are aga<strong>in</strong> <strong>in</strong>dependent, and B−1 G SG Gaussian.<br />

An additional <strong>in</strong>determ<strong>in</strong>acy comes <strong>in</strong>to play due to the fact that we do not want to<br />

fix the reduced dimension <strong>in</strong> advance. Given a realization of the model (1) with d


232 Chapter 16. Proc. ICA 2006, pages 917-925<br />

1.2 Uniqueness theorem<br />

Def<strong>in</strong>ition 1. X=AS with A∈Gl(d), S=(SN, SG) and SN∈ L2(Ω,R n ) is called an<br />

n-decomposition of X if SN and SG are stochastically <strong>in</strong>dependent and SG is Gaussian.<br />

In this case, X is said to be n-decomposable.<br />

Hence an n-decomposition of X corresponds to the NGSA problem. If as before<br />

A=(AN, AG), then the n-dimensional subvectorspace im(AN)⊂R d is called the non-<br />

Gaussian subspace, and im(AG) the Gaussian subspace of the decomposition; here<br />

im(A) denotes the image of the l<strong>in</strong>ear map A.<br />

Def<strong>in</strong>ition 2. X is denoted to be m<strong>in</strong>imally n-decomposable if X is not (n−1)-decomposable.<br />

Then dime(X) := n is called the essential dimension of X.<br />

For example, the essential dimension dime(X) is zero if and only if X is Gaussian,<br />

whereas the essential dimension of a d-dimensional mutually <strong>in</strong>dependent Laplacian is<br />

d. The follow<strong>in</strong>g theorem is the ma<strong>in</strong> theoretical contribution of this work. It essentially<br />

connects uniqueness of the dimension reduction model with m<strong>in</strong>imality, and gives a<br />

simple characterization for it.<br />

Theorem 1 (Uniqueness of NGSA). Let n


Chapter 16. Proc. ICA 2006, pages 917-925 233<br />

1.3 Proof of theorem 1<br />

First note that the theorem holds trivially for n=0, because <strong>in</strong> this case X is Gaussian.<br />

So <strong>in</strong> the follow<strong>in</strong>g let 0


234 Chapter 16. Proc. ICA 2006, pages 917-925<br />

for all xN∈R, because A is <strong>in</strong>vertible.<br />

Now aNN� 0, otherwise XN= aNGS G which contradicts (ii) for X. If also aNG� 0,<br />

then by equation (3), h ′′ N is constant and therefore S N Gaussian, which aga<strong>in</strong> contradicts<br />

(ii), now for S. Hence aNG= 0. By (3), aGNaGG= 0, and aga<strong>in</strong> aGG�0, otherwise<br />

XG= aGNS N contradict<strong>in</strong>g (ii) for S. Hence also aGN= 0 as was to show.<br />

General proof. In order to give an idea of the ma<strong>in</strong> proof without gett<strong>in</strong>g lost <strong>in</strong> details,<br />

we have divided it up <strong>in</strong>to a sequence of lemmas; these will not be proven due<br />

to lack of space. The characteristic function of the random vector X is def<strong>in</strong>ed by<br />

�X(x) := E(exp ix⊤X), and s<strong>in</strong>ce X is assumed to have exist<strong>in</strong>g covariance,�X is twice<br />

cont<strong>in</strong>uously differentiable. Moreover by def<strong>in</strong>ition�AS(x)=�S(A ⊤x), and the characteristic<br />

function of an <strong>in</strong>dependent random vector factorizes <strong>in</strong>to the component characteristic<br />

functions. So <strong>in</strong>stead of us<strong>in</strong>g pX as <strong>in</strong> the 2-dimensional example, we use�X,<br />

hav<strong>in</strong>g similar properties except for the fact that the range is now complex and that the<br />

differentiability condition can be considerably relaxed.<br />

We will need the follow<strong>in</strong>g lemma, which has essentially been shown <strong>in</strong> [9]; here<br />

∇ f denotes the gradient of f and H f its Hessian.<br />

Lemma 1. Let X∈ L2(Ω,R m ) be a random vector. Then X is Gaussian with covariance<br />

2C if and only if it satisfies�XH�X −∇�X(∇�X) ⊤ + C�X 2≡ 0.<br />

Note that we may assume that the covariance of S (and hence also of X) is positive<br />

def<strong>in</strong>ite — otherwise, while still keep<strong>in</strong>g the model, we can simply remove the subspace<br />

of determ<strong>in</strong>istic components (i.e. components of variance 0), which have to be mapped<br />

onto each other by A. Hence we may even assume Cov(SG)=I, after whiten<strong>in</strong>g as<br />

described <strong>in</strong> section 1.1. This uses the fact that the basis with<strong>in</strong> the Gaussian subspace<br />

is not unique. The same holds also for the non-Gaussian subspace, so we may choose<br />

any BN∈ Gl(n) and BG∈ O(d−n) to get<br />

� �<br />

ANNBN<br />

X= (B<br />

AGNBN<br />

−1<br />

N SN)+<br />

� �<br />

ANGBG<br />

(B<br />

AGGBG<br />

⊤ GSG). (4)<br />

Here only orthogonal matrices BG are allowed <strong>in</strong> order for B⊤ GSG to stay decorrelated,<br />

with SG be<strong>in</strong>g decorrelated.<br />

The next lemma uses the dimension reduction model for X and S to derive an explicit<br />

differential equation for ˆSN. The Gaussian part ˆSG <strong>in</strong> the follow<strong>in</strong>g lemma vanishes<br />

after application of lemma 1.<br />

Lemma 2. For any basis BN∈ Gl(n), the non-Gaussian source characteristic function<br />

ˆSN∈C 2 (Rn ,C) fulfills<br />

�<br />

ANNBN<br />

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� B ⊤ NA⊤ GN + 2ANGA ⊤ GG ˆS 2 N≡ 0. (5)<br />

Lemma 3. Let (ANN, ANG)∈Mat(n×(n+(d−n))) be an arbitrary full rank matrix.<br />

If rank ANN < n, then we may choose coord<strong>in</strong>ates BN ∈ Gl(n), BG ∈ O(d−n) and<br />

M∈Gl(n) such that for arbitrary matrices∗∈Mat((n−1)×(n−1)),∗ ′ ∈ Mat((n−<br />

1)×(d− n−1)):<br />

MANNBN=<br />

0 0<br />

0∗<br />

and MANGBG= 1 0<br />

0∗ ′


Chapter 16. Proc. ICA 2006, pages 917-925 235<br />

The basis choice from lemma 3 together with assumption (ii) can be used to prove<br />

the follow<strong>in</strong>g fact:<br />

Lemma 4. The non-Gaussian transformation is <strong>in</strong>vertible i.e. ANN∈ Gl(n).<br />

The next lemma can be seen as modification of lemma 1, and <strong>in</strong>deed it can be shown<br />

similarly.<br />

Lemma 5. If ˆSN fulfills � ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� e1+ ˆS 2 Nc≡0 for some constant vector<br />

c∈R n , then the source component (SN)1 is Gaussian and <strong>in</strong>dependent of (SN)(2 : n).<br />

Here more generally ei ∈ Rn denotes the i-th unit vector. Putt<strong>in</strong>g these lemmas<br />

together, we can f<strong>in</strong>ally prove theorem 1: Accord<strong>in</strong>g to lemma 4, ANN is <strong>in</strong>vertible, so<br />

from the left yields<br />

multiply<strong>in</strong>g equation (5) from lemma 2 by B −1<br />

N A−1<br />

NN<br />

�<br />

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� B ⊤ NA⊤ GN + CˆS 2 N≡ 0 (6)<br />

for any BN∈ Gl(n) and some fixed, real matrix C∈Mat(n×(d− n)).<br />

We claim that AGN= 0. If not, then there exists v∈R d−n with�A⊤ GNv�=1. Choose<br />

BN from (4) such that B−1 N SN is decorrelated. This is <strong>in</strong>variant under left-multiplication<br />

by an orthogonal matrix, so we may moreover assume that B⊤ NA⊤ GNv=e1. Multiply<strong>in</strong>g<br />

equation (6) <strong>in</strong> turn by v from the right therefore shows that the vector function<br />

�<br />

ˆSNHˆSN −∇ˆSN(∇ˆSN) ⊤� e1+ cˆS 2 N≡ 0 (7)<br />

is zero; here c := Cv∈R. This means that ˆSN fulfills the condition of lemma 5, which<br />

implies that (SN)1 is Gaussian and <strong>in</strong>dependent of the rest. But this contradicts (ii) for<br />

S, hence AGN= 0. Plugg<strong>in</strong>g this result <strong>in</strong>to equation (5), evaluation at sN= 0 shows<br />

that ANGA ⊤ GG = 0. S<strong>in</strong>ce AGN = 0 and A∈Gl(d), necessarily AGG ∈ Gl(d−n), so<br />

ANG= 0 as was to prove.<br />

2 Simulations<br />

In this section, we will provide experimental validation of the uniqueness result of<br />

corollary 1. In order to stay unbiased and not test a s<strong>in</strong>gle algorithm, we have to uniformly<br />

search the parameter space for possibly equivalent model representations. The<br />

model assumptions (1) will not be perfectly fulfilled, so we <strong>in</strong>troduce a measure of<br />

model deviation based on 4-th order cumulants <strong>in</strong> the follow<strong>in</strong>g.<br />

Let the non-Gaussian dimension n and the total dimension d be fixed. Given a random<br />

vector X=(XN, XG), we can without loss of generality assume that Cov(X)=I.<br />

Any possible model deviation consists of (i) a deviation from the <strong>in</strong>dependence of XN<br />

and XG and (ii) a deviation from the Gaussianity of XG. In the case of non-vanish<strong>in</strong>g<br />

kurtoses, the former can be approximated for example by<br />

δI(X) :=<br />

1<br />

n(n−d)d 2<br />

n�<br />

d�<br />

d�<br />

i=1 j=n+1 k=1 l=1<br />

d�<br />

cum 2 (Xi, X j, Xk, Xl),


236 Chapter 16. Proc. ICA 2006, pages 917-925<br />

where the fourth-order cumulant tensor is def<strong>in</strong>ed as cum(Xi, X j, Xk, Xl) := E(XiX jXkXl)−<br />

E(XiX j)E(XkXl)−E(XiXk)E(X jXl)−E(XiXl)E(X jXK). The deviation (ii) from Gaussianity<br />

of XG can simply be measured by kurtosis, which <strong>in</strong> the case of white X means<br />

δG(X) :=<br />

1<br />

(n−d)<br />

d�<br />

j=n+1<br />

�<br />

�<br />

�E(X 4 i )−3� � �.<br />

Altogether, we can therefore def<strong>in</strong>e a total model deviation as the weighted sum of the<br />

above <strong>in</strong>dices; the weight <strong>in</strong> the follow<strong>in</strong>g was chosen experimentally to approximately<br />

yield even contributions of the two measures:<br />

δ(X)=10n(d−n)δI(X)+δG(X)<br />

For numerical tests, we generate two different non-Gaussian source data sets, see<br />

figure 1(d) and also [4], figure 1. The first source set (I) is an n-dimensional dependent<br />

sub-Gaussian random vector given by an isotropic uniform density with<strong>in</strong> the unit<br />

disc, and source set (II) a 2-dimensional dependent super- and sub-Gaussian, given by<br />

p(s1, s2)∝exp(−|s1|)1[c(s1),c(s1)+1], where c(s1)=0 if|s1|≤ln 2 and c(s1)=−1 otherwise.<br />

Normalization was chosen to guarantee Cov(S N)=I<strong>in</strong> advance.<br />

In order to test for model violations, we have to f<strong>in</strong>d two representations X=AS<br />

and X=A ′ S ′ of the same mixtures. After multiplication by A −1 we may as before<br />

assume that a s<strong>in</strong>gle representation X=AS is given with X and S both fulfill<strong>in</strong>g the<br />

dimension reduction model (1), and we have to show that ANG = AGN = 0 if the<br />

decomposition is m<strong>in</strong>imal (corollary 1). The latter can be tested numerically by us<strong>in</strong>g<br />

the so-called normalized crosserror<br />

E(A) :=<br />

1<br />

2n(d−n)<br />

� �ANG� 2 F +�AGN� 2 F<br />

where�.�F is some matrix norm, <strong>in</strong> our case the Frobenius norm.<br />

In order to reduce the d 2 -dimensional search space, after whiten<strong>in</strong>g we may assume<br />

that A∈O(d), so only d(d− 1)/2 dimensions have to be searched. O(d) can be uniformly<br />

sampled for example by choos<strong>in</strong>g B with Gaussian i.i.d. coefficients and orthogonaliz<strong>in</strong>g<br />

A := (BB ⊤ ) −1/2 B. We perform 10 4 Monte-Carlo runs with random A∈O(d).<br />

Sources have been generated with T= 10 4 samples, n-dimensional non-Gaussian part<br />

(a) and (b) from above, and (d−n)-dimensional i.i.d. Gaussians. We measure model<br />

deviationδ(AS) and compare it with the deviation E(A) from block-diagonality.<br />

The results for vary<strong>in</strong>g parameters are given <strong>in</strong> figure 1(a-c). In all three cases we<br />

observe that the smaller the model deviation, the smaller also the crosserror. This gives<br />

an asymptotic confirmation of corollary 1, <strong>in</strong>dicat<strong>in</strong>g that by random sampl<strong>in</strong>g no nonuniqueness<br />

realizations have been found.<br />

3 Conclusion<br />

By m<strong>in</strong>imality of the decomposition (1), we gave a necessary condition for the uniqueness<br />

of non-Gaussian subspace analysis. Together with the assumption of exist<strong>in</strong>g covariance,<br />

this was already sufficient to guarantee model uniqueness. Our result allows<br />

NGSA algorithms to f<strong>in</strong>d the unknown, unique signal space with<strong>in</strong> a noisy highdimensional<br />

data set [6].<br />

� ,


Chapter 16. Proc. ICA 2006, pages 917-925 237<br />

model deviationδ(AS)<br />

model deviationδ(AS)<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

crosserror E(A)<br />

(a) n=2, d= 5, source (I)<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

model deviationδ(AS)<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35<br />

crosserror E(A)<br />

(b) n=3, d=5, source (I)<br />

crosserror E(A)<br />

(c) n=2, d=4, source (II) (d) Laplacian & uniform source<br />

Fig. 1. (a-c): total model deviationδ(AS) of the transformed sources versus crosserror E(A) of the<br />

mix<strong>in</strong>g matrix for 10 4 Monte-Carlo runs. The circle◦<strong>in</strong>dicates the actual source model deviation<br />

(non-zero due to f<strong>in</strong>ite sample sizes). (d): 2-dimensional dependent sub-Gaussian source (II).<br />

References<br />

1. Friedman, J., Tukey, J.: A projection pursuit algorithm for exploratory data analysis. IEEE<br />

Trans. on Computers 23 (1975) 881–890<br />

2. Hyvär<strong>in</strong>en, A., Karhunen, J., Oja, E.: <strong>Independent</strong> component analysis. John Wiley & Sons<br />

(2001)<br />

3. Blanchard, G., Kawanabe, M., Sugiyama, M., Spoko<strong>in</strong>y, V., Müller, K.R.: In search of nongaussian<br />

components of a high-dimensional distribution. JMLR (2005) In revision. The<br />

prepr<strong>in</strong>t is available at http://www.cs.titech.ac.jp/ tr/reports/2005/TR05-0003.pdf.<br />

4. Kawanabe, M.: L<strong>in</strong>ear dimension reduction based on the fourth-order cumulant tensor. In:<br />

Proc. ICANN 2005. Volume 3697 of LNCS., Warsaw, Poland, Spr<strong>in</strong>ger (2005) 151–156<br />

5. Theis, F.: Uniqueness of complex and multidimensional <strong>in</strong>dependent component analysis.<br />

Signal Process<strong>in</strong>g 84 (2004) 951–956<br />

6. Kawanabe, M., Theis, F.: Extract<strong>in</strong>g non-gaussian subspaces by characteristic functions. In:<br />

submitted to ICA 2006. (2006)<br />

7. Comon, P.: <strong>Independent</strong> component analysis - a new concept? Signal Process<strong>in</strong>g 36 (1994)<br />

287–314<br />

8. Theis, F.: A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation. Neural Computation<br />

16 (2004) 1827–1850<br />

9. Theis, F.: Multidimensional <strong>in</strong>dependent component analysis us<strong>in</strong>g characteristic functions.<br />

In: Proc. EUSIPCO 2005, Antalya, Turkey (2005)


238 Chapter 16. Proc. ICA 2006, pages 917-925


Chapter 17<br />

IEEE SPL 13(2):96-99, 2006<br />

Paper F.J. Theis, C.G. Puntonet, and E.W. Lang. Median-based cluster<strong>in</strong>g for<br />

underdeterm<strong>in</strong>ed bl<strong>in</strong>d signal process<strong>in</strong>g. IEEE Signal Process<strong>in</strong>g Letters,<br />

13(2):96-99, 2006<br />

Reference (Theis et al., 2006)<br />

Summary <strong>in</strong> section 1.5.3<br />

239


240 Chapter 17. IEEE SPL 13(2):96-99, 2006<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 1<br />

Median-based cluster<strong>in</strong>g for underdeterm<strong>in</strong>ed bl<strong>in</strong>d<br />

signal process<strong>in</strong>g<br />

Fabian J. Theis, Member, IEEE, Carlos G. Puntonet, Member, IEEE, Elmar W. Lang<br />

Abstract— In underdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation, more<br />

sources are to be extracted from less observed mixtures without<br />

know<strong>in</strong>g both sources and mix<strong>in</strong>g matrix. k-means-style cluster<strong>in</strong>g<br />

algorithms are commonly used to do this algorithmically<br />

given sufficiently sparse sources, but <strong>in</strong> any case other than<br />

determ<strong>in</strong>istic sources, this lacks theoretical justification. After<br />

establish<strong>in</strong>g that mean-based algorithms converge to wrong solutions<br />

<strong>in</strong> practice, we propose a median-based cluster<strong>in</strong>g scheme.<br />

Theoretical justification as well as algorithmic realizations (both<br />

onl<strong>in</strong>e and batch) are given and illustrated by some examples.<br />

x2<br />

α<br />

x1<br />

α<br />

F(α)<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

∆(α) mean<br />

∆(α) median<br />

0<br />

0 0.2 0.4 0.6 0.8<br />

EDICS Category: SAS-ICAB<br />

BLIND source separation (BSS), ma<strong>in</strong>ly based on the assumption<br />

of <strong>in</strong>dependent sources, is currently the topic of<br />

many researchers [1], [2]. Given an observed m-dimensional<br />

mixture random vector x, which allows an unknown decomposition<br />

x = As, the goal is to identify the mix<strong>in</strong>g matrix<br />

A and the unknown n-dimensional source random vector s.<br />

Commonly, first A is identified, and only then are the sources<br />

recovered. We will therefore denote the former task by bl<strong>in</strong>d<br />

mix<strong>in</strong>g model recovery (BMMR), and the latter (with known<br />

A) by bl<strong>in</strong>d source recovery (BSR).<br />

In the difficult case of underdeterm<strong>in</strong>ed or overcomplete<br />

BSS, where less mixtures than sources are observed (m < n),<br />

BSR is non-trivial, see section II. However, our ma<strong>in</strong> focus<br />

lies on the usually more elaborate matrix recovery. Assum<strong>in</strong>g<br />

statistically <strong>in</strong>dependent sources with exist<strong>in</strong>g variance and<br />

at most one Gaussian component, it is well-known that A<br />

is determ<strong>in</strong>ed uniquely by x [3]. However, how to do this<br />

algorithmically is far from obvious, and although quite a few<br />

algorithms have been proposed recently [4]–[6], performance<br />

is yet limited. The most commonly used overcomplete algorithms<br />

rely on sparse sources (after possible sparsification by<br />

preprocess<strong>in</strong>g), which can be identified by cluster<strong>in</strong>g, usually<br />

by k-means or some extension [5], [6]. But apart from the fact<br />

that theoretical justifications have not been found, mean-based<br />

cluster<strong>in</strong>g only identifies the correct A if the data density<br />

approaches a delta distribution. In figure 1, we illustrate the<br />

deficiency of mean-based cluster<strong>in</strong>g; we get an error of up to<br />

5◦ (a) circle histogram for α = 0.4 (b) comparison of mean and median<br />

Fig. 1. Mean- versus median-based cluster<strong>in</strong>g. We consider the mixture x of<br />

two <strong>in</strong>dependent gamma-distributed sources (γ = 0.5, 10<br />

per mix<strong>in</strong>g angle, which is rather substantial consider<strong>in</strong>g the<br />

sparse density and the simple, complete case of m = n = 2.<br />

Moreover the figure <strong>in</strong>dicates that median-based cluster<strong>in</strong>g<br />

performs much better. Indeed, mean-based cluster<strong>in</strong>g does not<br />

Manuscript received xxx; revised xxx.<br />

Some prelim<strong>in</strong>ary results were reported at the conferences ESANN 2002,<br />

SIP 2002 and ICA 2003.<br />

FT and EL are with the Institute of Biophysics, University of Regensburg,<br />

93040 Regensburg, Germany (phone: +49 941 943 2924, fax: +49 941 943<br />

2479, e-mail: fabian@theis.name), and CP is with the Dep. Arqitectura y<br />

Tecnología de Computadores, Universidad de Granada, 18071 Granada, Spa<strong>in</strong>.<br />

5 samples) us<strong>in</strong>g a<br />

mix<strong>in</strong>g matrix A with columns <strong>in</strong>cl<strong>in</strong>ed by α and (π/2−α) respectively. (a)<br />

shows the mixture density for α = 0.4 after projection onto the circle. For<br />

α ∈ [0, π/4), (b) compares the error when estimat<strong>in</strong>g A by the mean and the<br />

median of the projected density <strong>in</strong> the receptive field F(α) = (−π/4, π/4) of<br />

the known column a1 of A. The former is the k-means convergence criterion.<br />

possess any equivariance property (performance <strong>in</strong>dependent<br />

of A). In the follow<strong>in</strong>g we propose a novel median-based<br />

cluster<strong>in</strong>g method and prove its equivariance (lemma 1.2) and<br />

convergence. For brevity, the proofs are given for the case of<br />

arbitrary n, but m = 2, although they can be readily extended<br />

to higher sensor signal dimensions. Correspond<strong>in</strong>g algorithms<br />

are proposed and experimentally validated.<br />

I. GEOMETRIC MATRIX RECOVERY<br />

Without loss of generality we assume that A has pairwise<br />

l<strong>in</strong>early <strong>in</strong>dependent columns, and m ≤ n. BMMR tries to<br />

identify A <strong>in</strong> x = As given x, where s is assumed to be<br />

statistically <strong>in</strong>dependent. Obviously, this can only be done up<br />

to equivalence [3], where B is said to be equivalent to A,<br />

B ∼ A, if B can be written as B = APL with an <strong>in</strong>vertible<br />

diagonal matrix L (scal<strong>in</strong>g matrix) and an <strong>in</strong>vertible matrix P<br />

with unit vectors <strong>in</strong> each row (permutation matrix). Hence we<br />

may assume the columns ai of A to have unit norm.<br />

For geometric matrix-recovery, we use a generalization [7]<br />

of the geometric ICA algorithm [8]. Let s be an <strong>in</strong>dependent<br />

n-dimensional, Lebesgue-cont<strong>in</strong>uous, random vector with<br />

density ps describ<strong>in</strong>g the sources. As s is <strong>in</strong>dependent, ps<br />

factorizes <strong>in</strong>to ps(s1, . . . , sn) = ps1(s1) · · · psn(sn) with the<br />

one-dimensional marg<strong>in</strong>al source density functions psi. We<br />

assume symmetric sources, i.e. psi(s) = psi(−s) for s ∈ R<br />

and i ∈ [1 : n] := {1, . . .,n}, <strong>in</strong> particular E(s) = 0.<br />

The geometric BMMR algorithm for symmetric distributions<br />

goes as follows [7]: Pick 2n start<strong>in</strong>g vectors<br />

w1,w ′ 1, . . . ,wn,w ′ n on the unit sphere Sm−1 ⊂ Rm such<br />

that the wi are pairwise l<strong>in</strong>early <strong>in</strong>dependent and wi = −w ′ i .<br />

Often, these wi are called neurons because they resemble the<br />

neurons used <strong>in</strong> cluster<strong>in</strong>g algorithms and <strong>in</strong> Kohonen’s selforganiz<strong>in</strong>g<br />

maps. Furthermore fix a learn<strong>in</strong>g rate η : N → R.<br />

α


Chapter 17. IEEE SPL 13(2):96-99, 2006 241<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 2<br />

The<br />

�<br />

usual hypothesis <strong>in</strong><br />

�<br />

competitive learn<strong>in</strong>g is η(t) > 0,<br />

t∈N η(t) = ∞ and t∈N η(t)2 < ∞. Then iterate the<br />

follow<strong>in</strong>g steps until an appropriate abort condition has been<br />

met: Choose a sample x(t) ∈ Rm accord<strong>in</strong>g to the distribution<br />

of x. If x(t) = 0 pick a new one — note that this case happens<br />

with probability zero. Project x(t) onto the unit sphere and get<br />

y(t) := π(x(t)), where π(x) := x/|x| ∈ Sm−1 . Let i(t) ∈<br />

[1 : n] such that wi(t)(t) or w ′ i(t) (t) is the neuron closest to<br />

y(t). Then set wi(t)(t + 1) := π(wi(t)(t) + η(t)π(σy(t) −<br />

wi(t)(t))) and w ′ i(t) (t + 1) := −wi(t)(t + 1), where σ := 1<br />

if wi(t)(t) is closest to y(t), σ := −1 otherwise. All other<br />

neurons are not moved <strong>in</strong> this iteration. This update rule equals<br />

onl<strong>in</strong>e k-means on Sn−1 except for the fact that the w<strong>in</strong>ner<br />

neuron is not moved proportional to the sample but only <strong>in</strong><br />

its direction due to the normalization. We will see that <strong>in</strong>stead<br />

of f<strong>in</strong>d<strong>in</strong>g the mean <strong>in</strong> the receptive field (as <strong>in</strong> k-means), the<br />

algorithm searches for the correspond<strong>in</strong>g median.<br />

A. Model verification<br />

In this section we first calculate the densities of the random<br />

variables of our cluster<strong>in</strong>g problem. Then we will prove an<br />

asymptotic convergence result. For the theoretical analysis, we<br />

will restrict ourselves to m = 2 mixtures for simplicity. As<br />

above, let x denote the sensor signal vector and A the mix<strong>in</strong>g<br />

matrix such that x = As. We may assume that A has columns<br />

ai = (cosαi, s<strong>in</strong>αi) ⊤ with 0 ≤ α1 < . . . < αn < π.<br />

1) Neural update rule on the sphere: Due to the symmetry<br />

of s we can identify the two neurons wi and w ′ i<br />

. For this<br />

let ι(ϕ) := (ϕ mod π) ∈ [0, π) identify all angles modulo<br />

π, and set θ(v) := ι(arctan(v2/v1)); then θ(wi) = θ(w ′ i )<br />

and θ(ai) = αi. We are <strong>in</strong>terested <strong>in</strong> the essentially onedimensional<br />

projected sensor signal random vector π(x), so<br />

us<strong>in</strong>g θ we may approximate y := θ(π(x)) ∈ [0, π) measur<strong>in</strong>g<br />

the argument of x. Note that the density py of y can be<br />

calculated from px by py(ϕ) = � ∞<br />

−∞ px(r cosϕ, r s<strong>in</strong>ϕ)rdr.<br />

Now let the (n × n)-matrix B be def<strong>in</strong>ed by<br />

B :=<br />

A<br />

0 In−2<br />

where In−2 is the (n − 2)-dimensional identity matrix; so B<br />

is <strong>in</strong>vertible. The random vector Bs has the density pBs =<br />

| detB| −1ps ◦ B−1 . A equals B followed by the projection<br />

from Rn onto the first two coord<strong>in</strong>ates, hence<br />

py(ϕ)= 2<br />

| detB|<br />

� ∞<br />

�<br />

rdr<br />

0<br />

,<br />

Rn−2 dxps(B −1 (r cosϕ, r s<strong>in</strong> ϕ,x) ⊤ ) (1)<br />

for any ϕ ∈ [0, π), where we have used the symmetry of ps.<br />

The geometric learn<strong>in</strong>g algorithm <strong>in</strong>duces the follow<strong>in</strong>g<br />

n-dimensional Markov cha<strong>in</strong> ω(t) def<strong>in</strong>ed recursively by a<br />

start<strong>in</strong>g po<strong>in</strong>t ω(0) ∈ R n and the iteration rule ω(t + 1) =<br />

ι n (ω(t)+η(t)ζ(y(t)e−ω(t))), where e = (1, . . .,1) ⊤ ∈ R n ,<br />

ζ(x1, . . . , xn) ∈ R n such that<br />

ζi(x1, . . . , xn) =<br />

� sgn(xi) |xi| ≤ |xj| for all j<br />

0 otherwise<br />

and y(0), y(1), . . . is a sequence of i.i.d. random variables with<br />

density py represent<strong>in</strong>g the samples <strong>in</strong> each onl<strong>in</strong>e iteration.<br />

Note that the ‘modulo π’ map ι is only needed to guarantee<br />

that each component of ω(t + 1) lies <strong>in</strong> [0, π).<br />

Furthermore, we can assume that after enough iterations<br />

there is one po<strong>in</strong>t v ∈ S 1 that will not be traversed any<br />

more, and without loss of generality we assume θ(v) = 0<br />

so that the above algorithm simplifies to the planar case with<br />

the recursion rule ω(t + 1) = ω(t) + η(t)ζ(y(t)e − ω(t)).<br />

This is k-means-type learn<strong>in</strong>g with an additional sign function.<br />

Without the sign function and the additional condition that<br />

py is log-concave, it has been shown [9] that the process<br />

ω(t) converges to a unique constant process ω(∞) ∈ R n<br />

such that ωi(∞) = E(py|[β(i), β ′ (i)]), where F(ωi(∞)) :=<br />

{ϕ ∈ [0, π) | ι(|ϕ − ωi(∞)|) ≤ ι(|ϕ − ωj(∞)|) for all j �= i}<br />

denotes the receptive field of the neuron ωi(∞) and β(i), β ′ (i)<br />

designate the receptive field borders. This is precisely the kmeans<br />

convergence condition illustrated <strong>in</strong> figure 1.<br />

2) Limit po<strong>in</strong>ts analysis: We now want to study the limit<br />

po<strong>in</strong>ts of geometric matrix-recovery, so we assume that the<br />

algorithm has already converged. The idea, generaliz<strong>in</strong>g our<br />

analysis <strong>in</strong> the complete case [7] then is to formulate a<br />

condition which the limit po<strong>in</strong>ts will have to satisfy and to<br />

show that the BMMR solutions are among them.<br />

The angles γ1, . . .,γn ∈ [0, π) are said to satisfy the geometric<br />

convergence condition (GCC) if they are the medians<br />

of y restricted to their receptive fields i.e. if γi is the median<br />

of py|F(γi). Moreover, a constant random vector ˆω ∈ R n is<br />

called fixed po<strong>in</strong>t if E(ζ(ye− ˆω)) = 0. Hence, the expectation<br />

of a Markov process ω(t) start<strong>in</strong>g at a fixed po<strong>in</strong>t will<br />

<strong>in</strong>deed not be changed by the geometric update rule because<br />

E(ω(t+1)) = E(ω(t))+η(t)E(ζ(y(t)e−ω(t))) = E(ω(t)).<br />

Lemma 1.1: Assume that the geometric algorithm converges<br />

to a constant random vector ω(∞). Then ω(∞) is a<br />

fixed po<strong>in</strong>t if and only if the ωi(∞) satisfy the GCC.<br />

Proof: Assume ω(∞) is a fixed po<strong>in</strong>t of geometric ICA<br />

<strong>in</strong> the expectation. Without loss of generality, let [β, β ′ ] be<br />

the receptive field of ωi(∞) such that β, β ′ ∈ [0, π). S<strong>in</strong>ce<br />

ω(∞) is a fixed po<strong>in</strong>t of geometric ICA <strong>in</strong> the expectation,<br />

E � χ [β,β ′ ](y(t))sgn(y(t) − ωi(∞)) � = 0, where χ [β,β ′ ]<br />

denotes the characteristic function of that <strong>in</strong>terval. But this<br />

means � ωi(∞)<br />

β (−1)py(ϕ)dϕ+ � β ′<br />

ωi(∞) 1py(ϕ)dϕ = 0 so ωi(∞)<br />

satisfies GCC. The other direction follows similarly.<br />

Lemma 1.2: The angles αi = θ(ai) satisfy the GCC.<br />

Proof: It is enough to show that α1 satisfies GCC. Let<br />

β := (α1 + αn − π)/2 and β ′ := (α1 + α2)/2. Then the<br />

receptive field of α1 can be written (modulo π) as F(α1) =<br />

[β, β ′ ]. Therefore, we have to show that α1 is the median<br />

of py|F(α1), which means � α1<br />

β py(ϕ)dϕ = � β ′<br />

α1 py(ϕ)dϕ.<br />

Us<strong>in</strong>g � equation (1), the left hand side can be expanded as<br />

α1<br />

β py(ϕ)dϕ<br />

�<br />

−1 = 2| detB| K ′ dxps(B−1x), where K :=<br />

θ−1 [β, α1] denotes the cone of open<strong>in</strong>g angle α1 − β start<strong>in</strong>g<br />

from angle β, and K ′ = K × Rn−2 � . This implies<br />

α1<br />

β py(ϕ)dϕ = 2 �<br />

B−1 (K ′ ) dsps(s). Now note that the trans-<br />

formed extended cone B −1 (K ′ ) is a cone end<strong>in</strong>g at the s1-axis<br />

of open<strong>in</strong>g angle π/4 times Rn−2 � , because A is l<strong>in</strong>ear. Hence<br />

α1<br />

β py(ϕ)dϕ = 2 � ∞<br />

0 ds1<br />

� s1<br />

0 ds2<br />

�<br />

Rn−2 ds3 . . .dsnps(s) =<br />

� β ′<br />

α1 py(ϕ)dϕ, where we have used the same calculation for<br />

[α1, β ′ ] as for [β, α1] at the last step.


242 Chapter 17. IEEE SPL 13(2):96-99, 2006<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 3<br />

4000<br />

3000<br />

2000<br />

1000<br />

2° 74° 104° 135°<br />

0<br />

0 20 40 60 80 100 120 140 160 180<br />

Fig. 2. Approximated probability density function of a mixture of four<br />

speech signals us<strong>in</strong>g the mix<strong>in</strong>g angles αi = 2 ◦ ,74 ◦ ,104 ◦ ,135 ◦ . Plotted<br />

is the approximated density us<strong>in</strong>g a histogram with 180 b<strong>in</strong>s; the thick l<strong>in</strong>e<br />

<strong>in</strong>dicates the density after smooth<strong>in</strong>g with a 5 degree radius polynomial kernel.<br />

In the proof we show that the median-condition is equivariant<br />

mean<strong>in</strong>g that it does no depend on A. Hence any algorithm<br />

based on such a condition is equivariant, as confirmed by<br />

figure 1. Set ξ(ω) := (cosω, s<strong>in</strong> ω) ⊤ ; then θ(ξ(ω)) = ω.<br />

Comb<strong>in</strong><strong>in</strong>g both lemmata, we have therefore shown:<br />

Theorem 1.3: The set Φ of fixed po<strong>in</strong>ts of geometric<br />

matrix-recovery conta<strong>in</strong>s an element (ˆω1, . . . , ˆωn) such that<br />

the matrix (ξ(ˆω1)...ξ(ˆωn)) solves the overcomplete BMMR.<br />

The stable fixed po<strong>in</strong>ts <strong>in</strong> the above set Φ can be found by<br />

the geometric matrix-recovery algorithm. Furthermore, experiments<br />

confirm that <strong>in</strong> the special case of unimodal, symmetric<br />

and non-Gaussian signals, the set Φ consists of only two<br />

elements: a stable and an unstable fixed po<strong>in</strong>t, where the stable<br />

fixed po<strong>in</strong>t will be found by the algorithm. Then, depend<strong>in</strong>g<br />

on the kurtosis of the sources, either the stable (supergaussian<br />

case) or the <strong>in</strong>stable (subgaussian case) fixed po<strong>in</strong>t represents<br />

the image of the unit vectors. We have partially shown this <strong>in</strong><br />

the complete case, see [7], theorem 5:<br />

Theorem 1.4: If n = 2 and ps1 = ps2, then Φ conta<strong>in</strong>s only<br />

two elements given that py|[0, π) has exactly 4 local extrema.<br />

More elaborate studies of Φ are necessary to show full<br />

convergence, however the mathematics can be expected to be<br />

difficult. This can already be seen from the complicated and<br />

<strong>in</strong> higher dimensions yet unknown proofs of convergence of<br />

the related self-organiz<strong>in</strong>g-map algorithm [9].<br />

B. Turn<strong>in</strong>g the onl<strong>in</strong>e algorithm <strong>in</strong>to a batch algorithm<br />

The above theory can be used to derive a batch-type<br />

learn<strong>in</strong>g algorithm, by test<strong>in</strong>g all different receptive<br />

fields for the overcomplete GCC after histogrambased<br />

estimation of y. For simplicity let us assume<br />

that the cumulative distribution Py of y is <strong>in</strong>vertible.<br />

For ϕ = (ϕ1, . . . , ϕn−1), def<strong>in</strong>e a function µ(ϕ) :=<br />

((γ1(ϕ) + γ2(ϕ))/2 − ϕ1, . . . , (γn(ϕ) + γ1(ϕ))/2 − ϕn),<br />

where γi(ϕ) := P −1<br />

y ((Py(ϕi) + Py(ϕi+1)/2) is the<br />

median of y|[ϕi, ϕi+1] <strong>in</strong> [ϕi, ϕi+1] for i ∈ [1 : n] and<br />

ϕn := (ϕn−1 + ϕ1 + π)/2, ϕn+1 := ϕ1.<br />

Lemma 1.5: If µ(ϕ) = 0 then the γi(ϕ) satisfy the GCC.<br />

Proof: By def<strong>in</strong>ition, the receptive field of γi(ϕ) is<br />

given by [(γi−1(ϕ) + γi(ϕ))/2, (γi(ϕ) + γi+1(ϕ))/2]. S<strong>in</strong>ce<br />

µ(ϕ) = 0 implies γi(ϕ) + γi+1(ϕ))/2 = ϕi, the receptive<br />

field of γi(ϕ) is precisely [ϕi, ϕi+1], and by construction<br />

γi(ϕ) is the median of y restricted to the above <strong>in</strong>terval.<br />

Algorithm (overcomplete FastGeo): F<strong>in</strong>d the zeros of µ(ϕ).<br />

Algorithmically, we may simply estimate y us<strong>in</strong>g a histogram<br />

and search for the zeros exhaustively or by discrete<br />

gradient descent. Note that for m = 2 this is precisely the<br />

complete FastGeo algorithm [7]. Aga<strong>in</strong> µ always has at least<br />

two zeros represent<strong>in</strong>g the stable and the unstable fixed po<strong>in</strong>t<br />

of the neural algorithm, so for supergaussian sources we<br />

extract the stable fixed po<strong>in</strong>t by maximiz<strong>in</strong>g �n i=1 py(γi(ϕi)).<br />

Similar to the complete case, histogram-based density approximation<br />

results <strong>in</strong> a quite ‘ragged’ distribution. Hence,<br />

zeros of µ are split up <strong>in</strong>to multiple close-by zeros. This<br />

can be improved by smooth<strong>in</strong>g the distribution us<strong>in</strong>g a kernel<br />

with sufficiently small radius. Too large kernel radii result <strong>in</strong><br />

lower accuracy because the calculation of the median is only<br />

roughly <strong>in</strong>dependent of the kernel radius. In figure 2 we use a<br />

polynomial kernel of radius 5 degree (zero everywhere else);<br />

one can see that <strong>in</strong>deed this smoothes the distribution nicely.<br />

C. Cluster<strong>in</strong>g <strong>in</strong> higher mixture dimensions<br />

We now extend any BSS algorithm work<strong>in</strong>g <strong>in</strong> lower mix<strong>in</strong>g<br />

dimension m ′ < m to dimension m us<strong>in</strong>g the simple idea of<br />

project<strong>in</strong>g the mixtures x onto different subspaces and then<br />

estimat<strong>in</strong>g A from the recovered projected matrices. We elim<strong>in</strong>ate<br />

scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acies by normalization and permutation<br />

<strong>in</strong>determ<strong>in</strong>acies by compar<strong>in</strong>g the source correlation matrices.<br />

1) Equivalence after projections: Let m ′ ∈ N with 1 <<br />

m ′ < m, and let M denote the set of all subsets of [1 : m]<br />

of size m ′ . For an element τ ∈ M let τ = {τ1, . . . , τm ′}<br />

such that 1 ≤ τ1 < . . . < τm ′ ≤ m and let πτ denote the<br />

ordered projection from R m onto those coord<strong>in</strong>ates. Consider<br />

the projected mix<strong>in</strong>g matrix Aτ := πτA. We will study<br />

how scal<strong>in</strong>g-equivalence behaves under projection, where two<br />

matrices A,B are said to be scal<strong>in</strong>g equivalent, A ∼s B if<br />

there exists an <strong>in</strong>vertible diagonal matrix L with A = BL.<br />

Lemma 1.6: Let τ 1 , . . . , τ k ∈ M such that �<br />

i τi = [1 : m]<br />

and j ∈ �<br />

i τi . If A τ i ∼s B τ i for all i and ajl �= 0 for all l,<br />

then A ∼s B.<br />

Proof: Fix a column l ∈ [1 : n], and let a := al, b := bl.<br />

By assumption, for each i ∈ [1 : k] there exist λi �= 0 such that<br />

b τ i = λia τ i. Index j occurs <strong>in</strong> all projections, so bj = λiaj<br />

for all i. Hence all λi =: λ co<strong>in</strong>cide and b = λa.<br />

This lemma gives the general idea how to comb<strong>in</strong>e matrices;<br />

now we will construct specific projections. Assume that the<br />

first row of A does not conta<strong>in</strong> any zeros. This is a very mild<br />

assumption because A was assumed to be full-rank, and the<br />

set of A’s with first row without zeros is dense <strong>in</strong> the set of<br />

full-rank matrices. As usual let ⌈λ⌉ denote the smallest <strong>in</strong>teger<br />

larger or equal to λ ∈ R. Then let k := ⌈(m−1)/(m ′ −1)⌉, and<br />

def<strong>in</strong>e τ i := {1, 2+(m ′ −1)(i−1), . . .,2+(m ′ −1)i−1} for<br />

i < k and τ k := {1, m−m ′ +2, . . ., m}. Then �<br />

i τi = [1 : m]<br />

and 1 ∈ �<br />

i τi . Given (m ′ ×n)-matrices B 1 , . . .,B k := (b k i,j ),<br />

def<strong>in</strong>e A B 1 ,...,B k to be composed of the columns (j ∈ [1 : n])<br />

aj := (1, b 1 2,j /b1 1,j , . . . , b1 m ′ ,j /b1 1,j ,<br />

b 2 2,j /b21,j , . . . , bk−1<br />

m ′ ,j /bk−1 1,j ,<br />

b k 3+k(m ′ −1)−m,j /bk1,j , . . . , bkm ′ ,j /bk1,j )⊤ .


Chapter 17. IEEE SPL 13(2):96-99, 2006 243<br />

IEEE SIGNAL PROCESSING LETTERS, VOL. X, NO. XX, XXXX 4<br />

Lemma 1.7: Let B 1 , . . .,B k be (m ′ ×n)-matrices such that<br />

Aτ i ∼s Bi for i ∈ [1 : k]. Then AB1 ,...,Bk ∼s A.<br />

Proof: By assumption there exist λi l<br />

∈ R \ {0} such that<br />

bi j,l = λil (Aτ i)j,l for all i ∈ [1 : k], j ∈ [1 : m ′ ] and l ∈ [1 : n],<br />

hence bi j,l /bi 1,l = (Aτ i)j,l/(Aτ i)1,l. One can check that due to<br />

the choice of the τi ’s we then have (AB1 ,...,Bk)j,l = aj,l/a1,l<br />

for all j ∈ [1 : m] and therefore AB1 ,...,Bk ∼ A.<br />

2) Reduction algorithm: The dimension reduction algorithm<br />

now is very simple. Pick k and τ1 , . . . , τk as <strong>in</strong> the<br />

previous section. Perform overcomplete BMMR with the projected<br />

mixtures πτ i(x) for i ∈ [1 : k] and get estimated mix<strong>in</strong>g<br />

matrices Bi . If this recovery has been carried out without any<br />

error, then every Bi is equivalent to Aτ i. Due to permutations,<br />

they might however not be scal<strong>in</strong>g-equivalent. Therefore do<br />

the follow<strong>in</strong>g iteratively for each i ∈ [1 : k]: Apply the overcomplete<br />

source-recovery, see next section, to πτ i(x) us<strong>in</strong>g<br />

Bi and get recovered sources si . For all j < i, consider the<br />

absolute crosscorrelation matrices (| Cor(si r, sj s)|)r,s. The row<br />

positions of the maxima of this matrix are pairwise different<br />

because the orig<strong>in</strong>al sources were chosen to be <strong>in</strong>dependent.<br />

Thereby we get a permutation matrix Pi <strong>in</strong>dicat<strong>in</strong>g how to<br />

permute Bi , Ci := BiPi , so that the new source correlation<br />

matrices are diagonal. F<strong>in</strong>ally, we have constructed matrices<br />

Ci such that there exists a permutation P <strong>in</strong>dependent of i with<br />

Ci ∼s Aτ iP for all i ∈ [1 : k]. Now we can apply lemma<br />

1.7 and get a matrix AC1 ,...,Ck with AC1 ,...,Ck ∼s AP and<br />

therefore AC1 ,...,Ck ∼ A as desired.<br />

II. BLIND SOURCE RECOVERY<br />

Us<strong>in</strong>g the results from the BMMR step, we can assume<br />

that an estimate of A has been found. In order to solve the<br />

overcomplete BSS problem, we are therefore left with the task<br />

of reconstruct<strong>in</strong>g the sources us<strong>in</strong>g the mixtures x and the<br />

estimated matrix (BSR). S<strong>in</strong>ce A has full rank, the equation<br />

x(t) = As(t) yields the (n − m)-dimensional aff<strong>in</strong>e vector<br />

space A−1 {x(t)} as solution space for s(t). Hence, if n ><br />

m the source-recovery problem is ill-posed without further<br />

assumptions. Us<strong>in</strong>g a maximum likelihood approach [4], [5]<br />

an appropriate assumption can be derived:<br />

Given a prior probability p0 s on the sources, it can be<br />

seen quickly [4], [10] that the most likely source sample<br />

is recovered by s = argmaxx=Asp0 s. Depend<strong>in</strong>g on the<br />

assumptions on the prior p 0 s<br />

of s, we get different optimization<br />

criteria. In the experiments we will assume a simple prior p 0 s ∝<br />

exp(−|s|p) with any p-norm |.|p. Then s = argm<strong>in</strong> x=As |s|p,<br />

which can be solved l<strong>in</strong>early <strong>in</strong> the Gaussian case p = 2 and<br />

by l<strong>in</strong>ear programm<strong>in</strong>g or a shortest-path decomposition <strong>in</strong> the<br />

sparse, Laplacian case p = 1, see [5], [10].<br />

III. EXPERIMENTAL RESULTS<br />

In order to compare the mixture matrix A with the recovered<br />

matrix B from the BMMR step, we calculate the<br />

generalized crosstalk<strong>in</strong>g error E(A,B) of A and B def<strong>in</strong>ed<br />

by E(A,B) := m<strong>in</strong>M∈Π �A −BM�, where the m<strong>in</strong>imum is<br />

taken over the group Π of all <strong>in</strong>vertible matrices hav<strong>in</strong>g only<br />

one non-zero entry per column and �.� denotes some matrix<br />

norm. It vanishes if and only if A and B are equivalent [10].<br />

TABLE I<br />

PERFORMANCE OF BMMR-ALGORITHMS (n = 3, m = 2, 100 RUNS)<br />

algorithm mean E(A, Â) deviation σ<br />

FastGeo (kernel r = 5, approx. 0.1) 0.60 0.60<br />

FastGeo (kernel r = 0, approx. 0.5) 0.40 0.46<br />

FastGeo (kernel r = 5, approx. 0.5) 0.29 0.42<br />

Soft-LOST (p = 0.01) 0.68 0.57<br />

The overcomplete FastGeo algorithm is applied to 4 speech<br />

signals s, mixed by a (2 × 4)-mix<strong>in</strong>g matrix A with coefficients<br />

uniformly drawn from [−1, 1], see figure 2 for their<br />

mixture density. The algorithm estimates the matrix well with<br />

E(A, Â) = 0.68, and BSR by 1-norm m<strong>in</strong>imization yields<br />

recovered sources with a mean SNR of only 2.6dB when<br />

compared with the orig<strong>in</strong>al sources; as noted before [5], [10],<br />

without sparsification for <strong>in</strong>stance by FFT, source-recovery<br />

is difficult. To analyze the overcomplete FastGeo algorithm<br />

more generally, we perform 100 Monte-Carlo runs us<strong>in</strong>g highkurtotic<br />

gamma-distributed three-dimensional sources with<br />

104 samples, mixed by a (2 × 3)-mix<strong>in</strong>g matrix with weights<br />

uniformly chosen from [−1, 1]. In table I, the mean of the<br />

performance <strong>in</strong>dex depend<strong>in</strong>g on various parameters is presented.<br />

Not<strong>in</strong>g that the mean error when us<strong>in</strong>g random (2×3)matrices<br />

with coefficients uniformly taken from [−1, 1] is<br />

E = 1.9 ± 0.73, we observe good performance, especially<br />

for a larger kernel radius and higher approximation parameter<br />

(E = 0.29), also compared with Soft-LOST’s E = 0.68 [6].<br />

As an example <strong>in</strong> higher mixture dimension three speech<br />

signals are mixed by a column-normalized (3 × 3) mix<strong>in</strong>g<br />

matrix A. For n = m = 3, m ′ = 2, the projection<br />

framework simplifies to k = 2 with projections π {1,2} and<br />

π {1,3}. Overcomplete geometric ICA is performed with 5·104 sweeps. The recoveries of the projected matrices π {1,2}A<br />

and π {1,3}A are quite good with E(π {1,2}A,B1 ) = 0.084<br />

and E(π {1,3}A,B2 ) = 0.10. Tak<strong>in</strong>g out the permutations as<br />

described before, we get a recovered mix<strong>in</strong>g matrix AB1 ,B2 with low generalized crosstalk<strong>in</strong>g error of E(A,AB1 ,B2) =<br />

0.15 (compared with a mean random error of E = 3.2 ± 0.7).<br />

REFERENCES<br />

[1] A. Hyvär<strong>in</strong>en, J. Karhunen, and E. Oja, “<strong>Independent</strong> component<br />

analysis,” John Wiley & Sons, 2001.<br />

[2] A. Cichocki and S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g.<br />

John Wiley & Sons, 2002.<br />

[3] J. Eriksson and V. Koivunen, “Identifiability, separability and uniqueness<br />

of l<strong>in</strong>ear ICA models,” IEEE Signal Process<strong>in</strong>g Letters, vol. 11, no. 7,<br />

pp. 601–604, 2004.<br />

[4] T. Lee, M. Lewicki, M. Girolami, and T. Sejnowski, “Bl<strong>in</strong>d source<br />

separation of more sources than mixtures us<strong>in</strong>g overcomplete representations,”<br />

IEEE Signal Process<strong>in</strong>g Letters, vol. 6, no. 4, pp. 87–90, 1999.<br />

[5] P. Bofill and M. Zibulevsky, “Underdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation<br />

us<strong>in</strong>g sparse representations,” Signal Process<strong>in</strong>g, vol. 81, pp. 2353–2362,<br />

2001.<br />

[6] P. O’Grady and B. Pearlmutter, “Soft-LOST: EM on a mixture of<br />

oriented l<strong>in</strong>es,” <strong>in</strong> Proc. ICA 2004, ser. Lecture Notes <strong>in</strong> Computer<br />

Science, vol. 3195, Granada, Spa<strong>in</strong>, 2004, pp. 430–436.<br />

[7] F. Theis, A. Jung, C. Puntonet, and E. Lang, “L<strong>in</strong>ear geometric ICA:<br />

Fundamentals and algorithms,” Neural Computation, vol. 15, pp. 419–<br />

439, 2003.<br />

[8] C. Puntonet and A. Prieto, “An adaptive geometrical procedure for bl<strong>in</strong>d<br />

separation of sources,” Neural Process<strong>in</strong>g Letters, vol. 2, 1995.<br />

[9] M. Benaim, J.-C. Fort, and G. Pagés, “Convergence of the onedimensional<br />

Kohonen algorithm,” Adv. Appl. Prob., vol. 30, pp. 850–<br />

869, 1998.<br />

[10] F. Theis, E. Lang, and C. Puntonet, “A geometric algorithm for overcomplete<br />

l<strong>in</strong>ear ICA,” Neurocomput<strong>in</strong>g, vol. 56, pp. 381–398, 2004.


244 Chapter 17. IEEE SPL 13(2):96-99, 2006


Chapter 18<br />

Proc. EUSIPCO 2006<br />

Paper P. Gruber and F.J. Theis. Grassmann cluster<strong>in</strong>g. In Proc. EUSIPCO 2006,<br />

Florence, Italy, 2006<br />

Reference (Gruber and Theis, 2006)<br />

Summary <strong>in</strong> section 1.5.3<br />

245


246 Chapter 18. Proc. EUSIPCO 2006<br />

ABSTRACT<br />

An important tool <strong>in</strong> high-dimensional, explorative data m<strong>in</strong><strong>in</strong>g<br />

is given by cluster<strong>in</strong>g methods. They aim at identify<strong>in</strong>g<br />

samples or regions of similar characteristics, and often code<br />

them by a s<strong>in</strong>gle codebook vector or centroid. One of the<br />

most commonly used partitional cluster<strong>in</strong>g techniques is the<br />

k-means algorithm, which <strong>in</strong> its batch form partitions the data<br />

set <strong>in</strong>to k disjo<strong>in</strong>t clusters by simply iterat<strong>in</strong>g between cluster<br />

assignments and cluster updates. The latter step implies<br />

calculat<strong>in</strong>g a new centroid with<strong>in</strong> each cluster. We generalize<br />

the concept of k-means by apply<strong>in</strong>g it not to the standard<br />

Euclidean space but to the manifold of subvectorspaces<br />

of a fixed dimension, also known as the Grassmann manifold.<br />

Important examples <strong>in</strong>clude projective space i.e. the<br />

manifold of l<strong>in</strong>es and the space of all hyperplanes. Detect<strong>in</strong>g<br />

clusters <strong>in</strong> multiple samples drawn from a Grassmannian<br />

is a problem aris<strong>in</strong>g <strong>in</strong> various applications. In this manuscript,<br />

we provide correspond<strong>in</strong>g metrics for a Grassmann<br />

k-means algorithm, and solve the centroid calculation problem<br />

explicitly <strong>in</strong> closed form. An application to nonnegative<br />

matrix factorization illustrates the feasibility of the proposed<br />

algorithm.<br />

1. PARTITIONAL CLUSTERING<br />

Many algorithms for cluster<strong>in</strong>g i.e. the detection of common<br />

features with<strong>in</strong> a data set are discussed <strong>in</strong> the literature. In<br />

the follow<strong>in</strong>g, we will study cluster<strong>in</strong>g with<strong>in</strong> the framework<br />

of k-means [2].<br />

In general, its goal can be described as follows: Given a<br />

set A of po<strong>in</strong>ts <strong>in</strong> some metric space (M,d), f<strong>in</strong>d a partition of<br />

A <strong>in</strong>to disjo<strong>in</strong>t non-empty subsets Bi, �<br />

i Bi = A, together with<br />

centroids ci ∈ M so as to m<strong>in</strong>imize the sum of the squares<br />

of the distances of each po<strong>in</strong>t of A to the centroid ci of the<br />

cluster Bi conta<strong>in</strong><strong>in</strong>g it. In other words, m<strong>in</strong>imize<br />

E(B1,c1,...,Bk,ck) :=<br />

GRASSMANN CLUSTERING<br />

Peter Gruber and Fabian J. Theis<br />

Institute of Biophysics, University of Regensburg<br />

93040 Regensburg, Germany<br />

phone: +49 941 943 2924, fax: +49 941 943 2479<br />

email: fabian@theis.name, web: http://fabian.theis.name<br />

k<br />

∑ ∑<br />

i=1 a∈Bi<br />

d(a,ci) 2 . (1)<br />

If the set A conta<strong>in</strong>s only f<strong>in</strong>itely many elements<br />

a1,...,aN, then this can be easily re-formulated as constra<strong>in</strong>ed<br />

non-l<strong>in</strong>ear optimization problem: m<strong>in</strong>imize<br />

subject to<br />

wit ∈ {0,1},<br />

E(W,C) :=<br />

k<br />

∑<br />

i=1<br />

k<br />

T<br />

∑ ∑<br />

i=1 t=1<br />

witd(ai,ci) 2 . (2)<br />

wit = 1 for 1 ≤ i ≤ k,1 ≤ t ≤ T. (3)<br />

Here C := {c1,...,ck} are the centroid locations, and W :=<br />

(wit) is the partition matrix correspond<strong>in</strong>g to the partition Bi<br />

of A.<br />

A common approach to m<strong>in</strong>imiz<strong>in</strong>g (2) subject to (3) is<br />

partial optimization for W and C, i.e. alternat<strong>in</strong>g m<strong>in</strong>imization<br />

of either W and C while keep<strong>in</strong>g the other one fixed.<br />

The batch k-means algorithm employs precisely this strategy:<br />

After an <strong>in</strong>itial, random choice of centroids c1,...,ck,<br />

it iterates between the follow<strong>in</strong>g two steps until convergence<br />

measured by a suitable stopp<strong>in</strong>g criterion:<br />

• cluster assignment: at determ<strong>in</strong>e an <strong>in</strong>dex i(t) such that<br />

i(t) = argm<strong>in</strong> i d(at,ci) (4)<br />

• cluster update: with<strong>in</strong> each cluster Bi := {at|i(t) = i}<br />

determ<strong>in</strong>e the centroid ci by m<strong>in</strong>imiz<strong>in</strong>g<br />

ci := argm<strong>in</strong>c ∑ d(a,c)<br />

a∈Bi<br />

2<br />

The cluster assignment step corresponds to m<strong>in</strong>imiz<strong>in</strong>g<br />

(2) for fixed C, which means choos<strong>in</strong>g the partition W such<br />

that each element of A is assigned to the i-th cluster if ci is<br />

the closest centroid. In the cluster update step, (2) is m<strong>in</strong>imized<br />

for fixed partition W, imply<strong>in</strong>g that ci is constructed<br />

as centroid with<strong>in</strong> the i-th cluster; this <strong>in</strong>deed corresponds to<br />

m<strong>in</strong>imiz<strong>in</strong>g E(W,C) for fixed W because <strong>in</strong> this case the<br />

cost function is a sum of functions depend<strong>in</strong>g different parameters,<br />

so we can m<strong>in</strong>imize them separately lead<strong>in</strong>g to the<br />

centroid equation (5). This general update rule converges to<br />

a local m<strong>in</strong>imum under rather weak conditions [3, 7].<br />

An important special case is given by M := Rn and<br />

the Euclidean distance d(x,y) := �x − y�. The centroids<br />

from equation (5) can then be calculated <strong>in</strong> closed form,<br />

and each centroid is simply given by the cluster mean ci :=<br />

(1/|Bi|)∑a∈Bi a; this follows directly from<br />

∑ �a − ci�<br />

a∈Bi<br />

2 = ∑<br />

a∈Bi<br />

n<br />

∑<br />

j=1<br />

(a j −ci j) 2 =<br />

n<br />

∑ ∑<br />

j=1 a∈Bi<br />

(5)<br />

(a 2 j −2a jci j +c 2 i j),<br />

which can be m<strong>in</strong>imized separately for each coord<strong>in</strong>ate j and<br />

is m<strong>in</strong>imal with respect to ci j if the derivative of the quadratic<br />

function is zero, so if |Bi|ci j = ∑a∈Bi a j.<br />

In the follow<strong>in</strong>g, we are <strong>in</strong>terested <strong>in</strong> more complex metric<br />

spaces. Typically, k-means can be implemented efficiently,<br />

if the cluster centroids can be calculated quickly. In<br />

the example of R n , we saw that it was crucial to use m<strong>in</strong>imize<br />

the square distances and to use the Euclidean distance.<br />

Hence we will study metrics which also allow a closed-form<br />

centroid solution.


Chapter 18. Proc. EUSIPCO 2006 247<br />

The data space of <strong>in</strong>terest will consist of subspaces of<br />

R n , and the goal is to f<strong>in</strong>d subspace clusters. We will only be<br />

deal<strong>in</strong>g with sub-vector-spaces; extensions to the aff<strong>in</strong>e case<br />

are discussed <strong>in</strong> section 3.3.<br />

A somewhat related method is the so-called k-plane cluster<strong>in</strong>g<br />

algorithm [4], which does not cluster subspaces but<br />

solves the problem of fitt<strong>in</strong>g hyperplanes <strong>in</strong> R n to a given<br />

po<strong>in</strong>t set A ⊂ R n . A hyperplane H ⊂ R n can be described by<br />

H = {x|c ⊤ x = 0} = c ⊥ for some normal vector c, typically<br />

chosen such that �c� = 1. Bradley and Mangasarian [4] essentially<br />

choose the pseudo-metric d(a,b) := |a ⊤ b| on the<br />

sphere S n−1 := {x ∈ R|�x� = 1} — the data can be assumed<br />

to lie on the sphere after normalization, which does<br />

not change cluster conta<strong>in</strong>ment. They show that the centroid<br />

equation (5) is solved by any eigenvector of the cluster correlation<br />

BiBi ⊤ correspond<strong>in</strong>g to the m<strong>in</strong>imal eigenvalue, if by<br />

abuse of notation Bi is to <strong>in</strong>dicate the (n × |Bi|)-matrix conta<strong>in</strong><strong>in</strong>g<br />

the elements of the set Bi <strong>in</strong> its columns. Alternative<br />

approaches to this subspace cluster<strong>in</strong>g problem are reviewed<br />

<strong>in</strong> [6].<br />

2. PROJECTIVE CLUSTERING<br />

A first step towards general subspace cluster<strong>in</strong>g is to consider<br />

one-dimensional subspace i.e. l<strong>in</strong>es. Let RP n denote<br />

the space of one-dimensional real vector subspaces of Rn+1 .<br />

It is equivalent to Sn after identify<strong>in</strong>g antipodal po<strong>in</strong>ts, so it<br />

has the quotient representation RP n = Sn /{−1,1}. We will<br />

represent l<strong>in</strong>es by their equivalence class [x] := {λx|λ ∈ Rn }<br />

for x �= 0. A metric can be def<strong>in</strong>ed by<br />

d0([x],[y]) :=<br />

�<br />

1 −<br />

�<br />

x⊤y �x��y�<br />

�2 Clearly d is symmetric, and positive def<strong>in</strong>ite accord<strong>in</strong>g to the<br />

Cauchy-Schwartz’s <strong>in</strong>equality.<br />

Conveniently, the cluster centroid of cluster Bi is given by<br />

any eigenvector of the cluster correlation BiBi ⊤ correspond<strong>in</strong>g<br />

to the largest eigenvalue. In section 3.1, we will show that<br />

projective cluster<strong>in</strong>g is a special case of a more general cluster<strong>in</strong>g<br />

and hence the derivation of the correspond<strong>in</strong>g centroid<br />

cluster<strong>in</strong>g algorithm will be postponed until later.<br />

Figure 1 shows an example application of the projective<br />

k-means algorithm. Note that the projective k-means can be<br />

directly applied to the dual problem of cluster<strong>in</strong>g hyperplanes<br />

by us<strong>in</strong>g the description via their normal ‘l<strong>in</strong>es’.<br />

3. GRASSMANN CLUSTERING<br />

More <strong>in</strong>terest<strong>in</strong>gly, we would like to perform cluster<strong>in</strong>g <strong>in</strong><br />

the Grassmann manifold Gn,p of p-dimensional vector subspaces<br />

of R n for 0 ≤ p ≤ n. If Vn,p denotes the Stiefel<br />

manifold consist<strong>in</strong>g of orthonormal matrices for n ≥ p, then<br />

Gn,p has the natural quotient representation Gn,p = Vn,p/Op,<br />

where Op := Vp,p denotes the orthogonal group. This representation<br />

simply means that any p-dimensional subspace<br />

of R n is given by p orthonormal vectors, i.e. by a basis<br />

V ∈ Vn,p, which is unique except for right multiplication by<br />

an orthogonal matrix. We will also write [V] for the subspace.<br />

The geometric properties of optimization algorithms on<br />

Gn,p are nicely discussed by Edelman et al. [5]. They<br />

also summarize various metrics on the Grassmann manifold,<br />

(6)<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1<br />

Figure 1: Illustration of projective k-means cluster<strong>in</strong>g <strong>in</strong><br />

three dimensions. 10 5 samples from a 4-dimensional<br />

strongly supergaussian distribution are projected onto three<br />

dimensions and serve as the generators of the l<strong>in</strong>es. These<br />

were nicely clustered <strong>in</strong>to k = 4 centroids, located at the density<br />

axes.<br />

which can all be naturally derived from the geodesic metric<br />

(arc length) <strong>in</strong>duced by the natural Riemannian structure of<br />

Gn,p. Some equivalence relations between the metrics are<br />

known, but for computational purposes, we choose the very<br />

easy to calculate so-called projection F-norm given by<br />

−0.5<br />

d([V],[W]) := 2 −1/2 �VV ⊤ − WW ⊤ �F<br />

where �V�F := � tr(VV ⊤ ) denotes the Frobenius-norm of<br />

a matrix. Note that the projection F-norm is <strong>in</strong>deed welldef<strong>in</strong>ed,<br />

as (7) does not depend on the choice of class representatives.<br />

In order to perform k-means cluster<strong>in</strong>g on (Gn,p,d), we<br />

have to solve the centroid problem (5). One of our ma<strong>in</strong><br />

results is that the centroid [Ci] of subspaces of some cluster<br />

Bi is spanned by p eigenvectors correspond<strong>in</strong>g to the<br />

smallest eigenvalues of the generalized cluster covariance<br />

(1/|Bi|)∑[V]∈Bi VV⊤ . This generalizes the projective and<br />

the hyperplane k-means algorithm from above.<br />

3.1 Calculat<strong>in</strong>g the optimal centroids<br />

For the cluster update step of the batch k-means algorithm<br />

we need to f<strong>in</strong>d [C] such that<br />

f (C) :=<br />

l<br />

∑<br />

i=1<br />

d([Vi],[C]) 2<br />

for l subspaces [Vi] represented by Vi ∈ V(n, p) is m<strong>in</strong>imal,<br />

subject to g(C) := C ⊤ C = Ip (pseudo orthogonality).<br />

We may also assume that the Vi are pseudo-orthonormal<br />

Vi ⊤ Vi = Ip.<br />

It is easy to see that:<br />

f (C) = 2 −1/2 tr(∑ViVi i<br />

⊤ ) + tr(lCC ⊤ − 2CC ⊤ ∑ViVi i<br />

⊤ )<br />

where<br />

= 2 −1/2 trD + tr((lIn − 2V)CC ⊤ )<br />

V := ∑ViVi i<br />

⊤<br />

and EDE ⊤ = V<br />

0<br />

0.5<br />

1<br />

(7)


248 Chapter 18. Proc. EUSIPCO 2006<br />

(0, .5, .5, 0)<br />

(0, 1, 0, 0)<br />

(0, 0, 1, 0) (.5, 0, .5, 0)<br />

(.3, .3, 0, .3)<br />

(1, 0, 0, 0)<br />

Figure 2: Illustration of the convex subsets on which the<br />

equation ∑ n i=1 diixi for given D is optimized. Here n = 4 and<br />

the surfaces for p = 1,...,3 are depicted (normalized onto<br />

the standard simplex).<br />

denote the eigenvalue decomposition of V with E orthonormal<br />

and D diagonal. This means that<br />

n<br />

−1/2<br />

f (C) = 2 ∑ dii + l p − 2tr(DE<br />

i=1<br />

⊤ CC ⊤ E)<br />

n<br />

n<br />

−1/2<br />

= 2 ∑ dii + l p − 2∑<br />

diixii<br />

i=1<br />

i=1<br />

where di j are the matrix elements of D, and xi j of X =<br />

E ⊤ CC ⊤ E.<br />

Here trX = tr(CC ⊤ ) = p for pseudo orthogonal C (p<br />

eigenvectors C with eigenvalue 1) and all 0 ≤ xii ≤ 1 (aga<strong>in</strong><br />

pseudo orthogonality). Hence this is a l<strong>in</strong>ear optimization<br />

problem on a convex set (see also figure 2) and therefore any<br />

optimum is located at the corners of the convex set, which <strong>in</strong><br />

our case are {x ∈ {0,1} n | ∑ n i=1 xi = p}. If we assume that<br />

the dii are ordered <strong>in</strong> descend<strong>in</strong>g order, then a m<strong>in</strong>imum of f<br />

is given by<br />

which corresponds to<br />

CC ⊤ = EXE ⊤ = E<br />

C =<br />

� �<br />

Ip<br />

.<br />

0<br />

� �<br />

Ip 0<br />

E<br />

0 0<br />

⊤ ,<br />

In this calculation we can also see the <strong>in</strong>determ<strong>in</strong>acies of<br />

the optimization:<br />

1. If two or more eigenvalues of V are equal, any po<strong>in</strong>t on<br />

the correspond<strong>in</strong>g edge of the convex set is optimal and<br />

hence the centroid can vary along the subspace generated<br />

by the correspond<strong>in</strong>g eigenvectors E<br />

2. If some eigenvalues of V are zero, a similar <strong>in</strong>determ<strong>in</strong>acy<br />

occurs.<br />

An example <strong>in</strong> RP 2 is demonstrated <strong>in</strong> figure 3.<br />

e1<br />

}<br />

d0(e1, x) = cos(x1) 2<br />

}<br />

x = (x1, x2)<br />

d0(e2, x) = cos(x2) 2<br />

Figure 3: Let Vi be two samples which are orthogonal<br />

(w.l.o.g. we can assume Vi = ei represented by the unit vectors).<br />

Hence V = ∑ViVi ⊤ has degenerate eigenstructure.<br />

Then the quantisation error is given by d(e1,x) 2 + d(e2,x) 2<br />

which is here 1 2 (2 + 2 − 2trIXX⊤ ) = 2 − x2 1 − x2 2 = 1 for X<br />

represented by x = (x1,x2). Hence any X is a centroid <strong>in</strong> the<br />

sense of the batch k-means algorithm.<br />

3.2 Relationship to projective cluster<strong>in</strong>g<br />

The distance d0 on RP n from above (equation (6)) was def<strong>in</strong>ed<br />

as<br />

�<br />

� �<br />

V ⊤ 2<br />

W<br />

d0(V,W) = 1 −<br />

,<br />

�V ��W�<br />

if accord<strong>in</strong>g to our previous notation [V],[W] ∈ Gn,1 = RP n .<br />

Note that if the two vectors represent time series, then this is<br />

the same as the correlation between the two.<br />

It is now easy to see that this distance co<strong>in</strong>cides with the<br />

def<strong>in</strong>ition of d on the general Grassmanian from above. Let<br />

V,W ∈ V(n,1) be two vectors. We may assume that V ⊤ V =<br />

W ⊤ W = 1. Then<br />

2d(V,W) 2 = tr(VV ⊤ + WW ⊤ − VW ⊤ − WV ⊤ )<br />

= tr(VV ⊤ ) + tr(WW ⊤ ) − 2tr(V(V ⊤ W)W ⊤ )<br />

All matrices have rank 1 and hence the trace is the sole<br />

nonzero eigenvalue. S<strong>in</strong>ce VV ⊤ V = V it is 1 for the first<br />

matrix, similar for the second and W ⊤ V for the third, because<br />

VW ⊤ V = (W ⊤ V)V. Hence<br />

2d(V,W) 2 = 2 − 2(W ⊤ V) 2<br />

= 2d0(V,W) 2 .<br />

3.3 Deal<strong>in</strong>g with aff<strong>in</strong>e spaces<br />

So far we only have dealt with the special case of cluster<strong>in</strong>g<br />

subspaces, i.e. l<strong>in</strong>ear subsets which conta<strong>in</strong> the orig<strong>in</strong>. But<br />

<strong>in</strong> practice the problem of cluster<strong>in</strong>g aff<strong>in</strong>e subspaces arises,<br />

see for example 5. This can be dealt with quite easily.<br />

Let F be a p dimensional aff<strong>in</strong>e l<strong>in</strong>ear subset of R n . Then<br />

F can be characterized by p + 1 po<strong>in</strong>ts v0,...,vp such that<br />

e2


Chapter 18. Proc. EUSIPCO 2006 249<br />

v1 − v0,...,vn − v0 are l<strong>in</strong>early <strong>in</strong>dependent. Consider the<br />

follow<strong>in</strong>g embedd<strong>in</strong>g<br />

R n → R n+1 : (x1,...,xn) ↦→ (x1,...,xn,1).<br />

We may therefore identify the p dimensional aff<strong>in</strong>e subspaces<br />

with the p + 1 l<strong>in</strong>ear subspaces <strong>in</strong> R n+1 by embedd<strong>in</strong>g<br />

the generators and tak<strong>in</strong>g the l<strong>in</strong>ear closure. In fact it<br />

is easy to see that we obta<strong>in</strong> a 1-to-1 mapp<strong>in</strong>g between the<br />

p dimensional aff<strong>in</strong>e subspaces of R n and the p + 1 dimensional<br />

l<strong>in</strong>ear subspaces <strong>in</strong> R n−1 , which <strong>in</strong>tersect the orthogonal<br />

complement of (0,...,0,1) only at the orig<strong>in</strong>.<br />

Hence we can reduce the aff<strong>in</strong>e case to calculations for<br />

l<strong>in</strong>ear subsets only. Note that s<strong>in</strong>ce only eigenvectors of sums<br />

of projections onto the subsets Vi can become centroids <strong>in</strong><br />

the batch version of the k-means algorithm, any centroid is<br />

also <strong>in</strong> the image of the above embedd<strong>in</strong>g and can be identified<br />

uniquely with a aff<strong>in</strong>e subspace of the orig<strong>in</strong>al problem.<br />

4. EXPERIMENTAL RESULTS<br />

We f<strong>in</strong>ish by illustrat<strong>in</strong>g the algorithm <strong>in</strong> a few examples.<br />

4.1 Toy example<br />

As a toy example, let us first consider 10 4 samples of<br />

G4,2, namely uniformly randomly chosen from the 6 possible<br />

2-dimensional coord<strong>in</strong>ate planes. In order to avoid<br />

any bias with<strong>in</strong> the algorithm, the non-zero coefficients from<br />

the plane-represent<strong>in</strong>g matrices have been chosen uniformly<br />

from O2. The samples have been deteriorated by Gaussian<br />

noise with a signal-to-noise ratio of 10dB. Application of<br />

the Grassmann k-means algorithm with k = 6 yields convergence<br />

after only 6 epochs with the result<strong>in</strong>g 6 clusters<br />

with centroids [V i ]. The distance measure µ(V) := (|vi1 +<br />

vi2| + |vi1 − vi2|)i should be large only <strong>in</strong> two coord<strong>in</strong>ates if<br />

[V] is close to the correspond<strong>in</strong>g 2-dimensional coord<strong>in</strong>ate<br />

plane. And <strong>in</strong>deed, the found centroids have distance measures<br />

µ(V i ) =<br />

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞<br />

0.02 1.7 1.7 0.01 2.0 0.01<br />

⎜ 0 ⎟ ⎜0.01⎟<br />

⎜0.01⎟<br />

⎜ 1.5 ⎟ ⎜ 2.0 ⎟ ⎜ 2.0 ⎟<br />

⎝<br />

1.9<br />

⎠, ⎝<br />

0.01<br />

⎠, ⎝<br />

1.7<br />

⎠, ⎝<br />

1.5<br />

⎠, ⎝<br />

0<br />

⎠, ⎝<br />

0.01<br />

⎠.<br />

1.9 1.7 0.02 0 0.01 2.0<br />

Hence, the algorithm correctly chose all 6 coord<strong>in</strong>ate planes<br />

as cluster centroids.<br />

4.2 Polytope identification<br />

As an example application of the Grassmann cluster<strong>in</strong>g algorithm,<br />

we want to solve the follow<strong>in</strong>g approximation problem<br />

from computational geometry: given a set of po<strong>in</strong>ts, identify<br />

the smallest convex polytope with a fixed number of faces<br />

k, conta<strong>in</strong><strong>in</strong>g the po<strong>in</strong>ts. In two dimensions, this implies the<br />

task of f<strong>in</strong>d<strong>in</strong>g the k edges of a polytope where only samples<br />

<strong>in</strong> the <strong>in</strong>side are known. We use QHull algorithm [1] to<br />

construct the convex hull thus identify<strong>in</strong>g the possible edges<br />

of the polytope. Then, we apply aff<strong>in</strong>e Grassmann k-means<br />

cluster<strong>in</strong>g to these edges <strong>in</strong> order to identify the k bound<strong>in</strong>g<br />

edges. Figure 4 shows an example. Generalization to arbitrary<br />

dimensions are straight-forward.<br />

(a) Samples (b) QHull<br />

(c) Grassmann cluster<strong>in</strong>g<br />

(d) Result<br />

Samples<br />

QHull contour<br />

Grasmann cluster<strong>in</strong>g<br />

Mix<strong>in</strong>g matrix<br />

Figure 4: An example of us<strong>in</strong>g hyperplane cluster<strong>in</strong>g (p =<br />

n − 1) to identify the contour of a samples figure. QHull was<br />

used to f<strong>in</strong>d the outer edges then those are clustered <strong>in</strong>to 4<br />

clusters. The broken l<strong>in</strong>es show the boundaries use to generate<br />

the 300 samples.


250 Chapter 18. Proc. EUSIPCO 2006<br />

4.3 Nonnegative Matrix Factorization<br />

(Overcomplete) Nonnegative Matrix Factorization (NMF)<br />

deals with the problem of f<strong>in</strong>d<strong>in</strong>g a nonnegative decomposition<br />

X = AS + N of a nonnegative matrix X, where N<br />

denotes unknown Gaussian noise. S is often pictured as a<br />

source data set conta<strong>in</strong><strong>in</strong>g samples along its columns. If we<br />

assume that S spans the whole first quadrant, then X is a<br />

conic hull with cone l<strong>in</strong>es given by the columns of A. After<br />

projection to the standard simplex, the conic hull reduces<br />

to the convex hull, and the projected, known mixture data<br />

set X lies with<strong>in</strong> a convex polytope of the order given by the<br />

number of rows of S. Hence we face the problem of identify<strong>in</strong>g<br />

edges of a sampled polytope, and, even <strong>in</strong> the overcomplete<br />

case, we may tackle this problem by the Grassmann<br />

cluster<strong>in</strong>g-based identification algorithm from the previous<br />

section.<br />

As an example, see figure 5, we choose a random mix<strong>in</strong>g<br />

matrix<br />

A =<br />

�<br />

0.76 0.39<br />

�<br />

0.14<br />

0.033 0.06 0.43<br />

0.20 0.56 0.43<br />

and sources S given by i.i.d. samples from a squared<br />

gaussian. 10 5 samples were drawn, and 10 to 10 5 sample<br />

subsets where used for the comparison. We refer to the figure<br />

caption for further details.<br />

5. CONCLUSION<br />

We have studied k-means-style cluster<strong>in</strong>g problems on the<br />

non-Euclidean Grassmann manifold. In an adequate metric,<br />

we were able to reduce the aris<strong>in</strong>g centroid calculation<br />

problem to the calculation of eigenvectors of the cluster covariance,<br />

for which we gave a proof based on convex optimization.<br />

The algorithm was illustrated by applications to<br />

polytope fitt<strong>in</strong>g and to perform<strong>in</strong>g overcomplete nonnegative<br />

factorizations similar to NMF. In future work, besides<br />

extend<strong>in</strong>g the framework to other cluster<strong>in</strong>g algorithms and<br />

matrix manifolds together with prov<strong>in</strong>g convergence of the<br />

result<strong>in</strong>g algorithms, we plan on apply<strong>in</strong>g the algorithm for<br />

the stability analysis of multidimensional <strong>in</strong>dependent component<br />

analysis.<br />

Acknowledgements<br />

Partial f<strong>in</strong>ancial support by the DFG (GRK 638) and the<br />

BMBF (project ‘ModKog’) is gratefully acknowledged.<br />

REFERENCES<br />

[1] C.B. Barber, D.P. Dobk<strong>in</strong>, and H. Huhdanpaa. The<br />

quickhull algorithm for convex hull. Technical Report<br />

GCG53, The Geometry Center, University of M<strong>in</strong>nesota,<br />

M<strong>in</strong>neapolis, 1993.<br />

[2] C.M. Bishop. Neural Networks for Pattern Recognition.<br />

Oxford University Press, 1995.<br />

[3] L. Bottou and Y. Bengio. Convergence properties of the<br />

k-means algorithms. In Proc. NIPS 1994, pages 585–<br />

592. MIT Press, 1995.<br />

[4] P.S. Bradley and O.L. Mangasarian. k-plane cluster<strong>in</strong>g.<br />

Journal of Global Optimization, 16(1):23–32, 2000.<br />

Samples<br />

QHull contour<br />

Grasmann cluster<strong>in</strong>g (100 Samples)<br />

NMF Mix<strong>in</strong>g Matrix (100000 Samples)<br />

Mix<strong>in</strong>g matrix<br />

(a) Comparison of NMF (Mean square error) and Grassmann cluster<strong>in</strong>g<br />

for NMF (averaged over 4 tries)<br />

Crosserror<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

10 1<br />

10 1<br />

10 2<br />

10 2<br />

10 3<br />

NMF (Mean Square Error)<br />

Grassmann cluster<strong>in</strong>g<br />

10 3<br />

Number of samples<br />

10 4<br />

10 4<br />

10 5<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

10 5<br />

0<br />

(b) Illustration of the NMF algorithm: projection onto<br />

the standard simplex<br />

Figure 5: Grassmann cluster<strong>in</strong>g can be used to solve the<br />

NMF problem. The mixed signal to be analyzed is a 3dimensional<br />

toy signal with a positive 3 × 3 matrix. The<br />

result<strong>in</strong>g mixture was analyzed with a mean square error implementation<br />

of NMF and compared to Grassmann cluster<strong>in</strong>g.<br />

In the cluster<strong>in</strong>g algorithm the data is first projected onto<br />

the standard simplex. This translates the task to the polytope<br />

identification discussed <strong>in</strong> section 4.2.<br />

[5] A. Edelman, T.A. Arias, and S.T. Smith. The geometry<br />

of algorithms with orthogonality constra<strong>in</strong>ts. SIAM Journal<br />

on Matrix <strong>Analysis</strong> and Applications, 20(2):303–<br />

353, 1999.<br />

[6] L. Parsons, E. Haque, and H. Liu. Subspace cluster<strong>in</strong>g<br />

for high dimensional data: a review. SIGKDD Explor.<br />

Newsl., 6(1):90–105, 2004.<br />

[7] S.Z. Selim and M.A. Ismail. K-means-type algorithms: a<br />

generalized convergence theorem and characterization of<br />

local optimality. IEEE Transactions on Pattern <strong>Analysis</strong><br />

and Mach<strong>in</strong>e Intelligence, 6:81–87, 1984.


Chapter 19<br />

LNCS 3195:977-984, 2004<br />

Paper I.R. Keck, F.J. Theis, P. Gruber, E.W. Lang, K. Specht, and C.G. Puntonet.<br />

3D spatial analysis of fMRI data on a word perception task. In Proc. ICA<br />

2004, volume 3195 of LNCS, pages 977-984, Granada, Spa<strong>in</strong>, 2004. Spr<strong>in</strong>ger<br />

Reference (Keck et al., 2004)<br />

Summary <strong>in</strong> section 1.6.1<br />

251


252 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data<br />

on a Word Perception Task<br />

Ingo R. Keck 1 , Fabian J. Theis 1 , Peter Gruber 1 ,ElmarW.Lang 1 ,<br />

Karsten Specht 2 , and Carlos G. Puntonet 3<br />

1 Institute of Biophysics, Neuro- and Bio<strong>in</strong>formatics Group<br />

University of Regensburg, D-93040 Regensburg, Germany<br />

{Ingo.Keck,elmar.lang}@biologie.uni-regensburg.de<br />

2 Institute of Medic<strong>in</strong>e, Research Center Jülich, D-52425 Jülich, Germany<br />

k.specht@fz-juelich.de<br />

3 Departamento de Arquitectura y Tecnologia de Computadores<br />

Universidad de Granada/ESII, E-1807 Granada, Spa<strong>in</strong><br />

carlos@atc.ugr.es<br />

Abstract. We discuss a 3D spatial analysis of fMRI data taken dur<strong>in</strong>g<br />

a comb<strong>in</strong>ed word perception and motor task. The event - based experiment<br />

was part of a study to <strong>in</strong>vestigate the network of neurons <strong>in</strong>volved<br />

<strong>in</strong> the perception of speech and the decod<strong>in</strong>g of auditory speech stimuli.<br />

We show that a classical general l<strong>in</strong>ear model analysis us<strong>in</strong>g SPM does<br />

not yield reasonable results. With bl<strong>in</strong>d source separation (BSS) techniques<br />

us<strong>in</strong>g the FastICA algorithm it is possible to identify different<br />

<strong>in</strong>dependent components (IC) <strong>in</strong> the auditory cortex correspond<strong>in</strong>g to<br />

four different stimuli. Most <strong>in</strong>terest<strong>in</strong>g, we could detect an IC represent<strong>in</strong>g<br />

a network of simultaneously active areas <strong>in</strong> the <strong>in</strong>ferior frontal gyrus<br />

responsible for word perception.<br />

1 Introduction<br />

S<strong>in</strong>ce the early 90s [1, 2], functional magnetic resonance imag<strong>in</strong>g (fMRI) based<br />

on the blood oxygen level dependent contrast (BOLD) developed <strong>in</strong>to one of<br />

the ma<strong>in</strong> technologies <strong>in</strong> human bra<strong>in</strong> research. Its high spatial and temporal<br />

resolution comb<strong>in</strong>ed with its non-<strong>in</strong>vasive nature makes it to an important tool<br />

to discover functional areas <strong>in</strong> the human bra<strong>in</strong> work and their <strong>in</strong>teractions.<br />

However, its low signal to noise ratio (SNR) and the high number of activities<br />

<strong>in</strong> the passive bra<strong>in</strong> require sophisticated analysis methods which can be divided<br />

<strong>in</strong>to two classes:<br />

– model based approaches like the general l<strong>in</strong>ear model which require prior<br />

knowledge of the time course of the activations,<br />

– model free approaches like bl<strong>in</strong>d source separation (BSS) which try to separate<br />

the recorded activation <strong>in</strong>to different classes accord<strong>in</strong>g to statistical<br />

specifications without prior knowledge of the activation.<br />

C.G. Puntonet and A. Prieto (Eds.): ICA 2004, LNCS 3195, pp. 977–984, 2004.<br />

c○ Spr<strong>in</strong>ger-Verlag Berl<strong>in</strong> Heidelberg 2004


Chapter 19. LNCS 3195:977-984, 2004 253<br />

978 Ingo R. Keck et al.<br />

In this text we compare these analysis techniques <strong>in</strong> a study of an auditory<br />

task. We show an example where traditional model based methods do not yield<br />

reasonable results. Rather bl<strong>in</strong>d source separation techniques have to be used<br />

to get mean<strong>in</strong>gful and <strong>in</strong>terest<strong>in</strong>g results concern<strong>in</strong>g the networks of activations<br />

related to a comb<strong>in</strong>ed word recognition and motor task.<br />

1.1 Model Based Approach: General L<strong>in</strong>ear Model<br />

The general l<strong>in</strong>ear model as a k<strong>in</strong>d of regression analysis has been the classic<br />

way to analyze fMRI data <strong>in</strong> the past [3]. Basically it uses second order statistics<br />

to f<strong>in</strong>d the voxels whose activations correlate best to given time courses. The<br />

measured signal for each voxel <strong>in</strong> time y =(y(t1), ..., y(tn)) T is written as a<br />

l<strong>in</strong>ear comb<strong>in</strong>ation of <strong>in</strong>dependent variables y = Xb + e, with the vector b of<br />

regression coefficients and the matrix X of the <strong>in</strong>dependent variables which <strong>in</strong><br />

case of an fMRI-analysis consist of the assumed time courses <strong>in</strong> the data and<br />

additional filters to account for the serial correlation of fMRI data. The residual<br />

error e ought to be m<strong>in</strong>imized. The normal equation X T Xb = X T y of the<br />

problem is solved by b =(X T X) −1 X T y and has a unique solution if XX T has<br />

full rank. F<strong>in</strong>ally a significance test us<strong>in</strong>g e is applied to estimate the statistical<br />

significance of the found correlation.<br />

As the model X must be known <strong>in</strong> advance to calculate b, this method is<br />

called “model-based”. It can be used to test the accuracy of a given model, but<br />

cannot by itself f<strong>in</strong>d a better suited model even if one exists.<br />

1.2 Model Free Approach:<br />

BSS Us<strong>in</strong>g <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong><br />

In case of fMRI data bl<strong>in</strong>d source separation refers to the problem of separat<strong>in</strong>g<br />

a given sensor signal, i.e. the fMRI data at the time t<br />

x(t) =A [s(t)+snoise(t)] =<br />

n�<br />

aisi(t)+<br />

i=1<br />

n�<br />

i=1<br />

aisnoise,i(t)<br />

<strong>in</strong>to its underly<strong>in</strong>g n source signals s with ai(t) be<strong>in</strong>g its contribution to the<br />

sensor signal, hence its mix<strong>in</strong>g coefficient. A and s are unique except for permutation<br />

and scal<strong>in</strong>g. The functional segregation of the bra<strong>in</strong> [3] closely matches<br />

the requirement of spatially <strong>in</strong>dependent sources as assumed <strong>in</strong> spatial ICA.<br />

The term snoise(t) is the time dependent noise. Unfortunately, <strong>in</strong> fMRI the noise<br />

level is of the same order of magnitude as the signal, so it has to be taken <strong>in</strong>to<br />

account. As the noise term will depend on time, it can be <strong>in</strong>cluded as additional<br />

components <strong>in</strong>to the problem. This problem is called “under-determ<strong>in</strong>ed”<br />

or “over-complete” as the number of <strong>in</strong>dependent sources will always exceed the<br />

number of measured sensor signals x(t).<br />

Various algorithms utiliz<strong>in</strong>g higher order statistics have been proposed to solve<br />

the BSS problem. In fMRI analysis, mostly the extended Infomax (based on<br />

entropy maximisation [4, 5]) and FastICA (based on negentropy us<strong>in</strong>g fix-po<strong>in</strong>t


254 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data on a Word Perception Task 979<br />

iteration [6]) algorithm have been used so far. While the extended Infomax algorithm<br />

is expected to perform slightly better on real data due to its adaptive<br />

nature, FastICA does not depend on educated guesses about the probability<br />

density distribution of the unknown source signals. In this paper we choose to<br />

utilize FastICA because of its low demands on computational power.<br />

2 Results<br />

First, we will present the implementation of the algorithm we used. Then we will<br />

discuss an example of an event-designed experiment and its BSS based analysis<br />

where we were able to identify a network of bra<strong>in</strong> areas which could not be<br />

detected us<strong>in</strong>g classic regression methods.<br />

2.1 Method<br />

To implement spatial ICA for fMRI data, every three-dimensional fMRI image<br />

is considered as a s<strong>in</strong>gle mixture of underly<strong>in</strong>g <strong>in</strong>dependent components. The<br />

rows of every image matrix have to be concatenated to a s<strong>in</strong>gle row-vector and<br />

with these image-vectors the mixture matrix X is constructed.<br />

For FastICA the second order correlation <strong>in</strong> the data has to be elim<strong>in</strong>ated by<br />

a “whiten<strong>in</strong>g” preprocess<strong>in</strong>g. This is done us<strong>in</strong>g a pr<strong>in</strong>cipal component analysis<br />

(PCA) step prior to the FastICA algorithm. In this step a data reduction can be<br />

applied by omitt<strong>in</strong>g pr<strong>in</strong>cipal components (PC) with a low variance <strong>in</strong> the signal<br />

reconstruction process. However, this should be handled with care as valuable<br />

high order statistical <strong>in</strong>formation can be conta<strong>in</strong>ed <strong>in</strong> these low variance PCs.<br />

The maximal variations <strong>in</strong> the timetrends of the supposed word-detection ICs<br />

<strong>in</strong> our example account only for 0.7 % of the measured fMRI Signal.<br />

The FastICA algorithm calculates the de-mix<strong>in</strong>g matrix W = A −1 . Then the<br />

underly<strong>in</strong>g sources S can be reconstructed as well as the orig<strong>in</strong>al mix<strong>in</strong>g matrix<br />

A. ThecolumnsofA represent the time-courses of the underly<strong>in</strong>g sources which<br />

are conta<strong>in</strong>ed <strong>in</strong> the rows of S. To display the ICs the rows of S have to be<br />

converted back to three-dimensional image matrixes.<br />

As noted before because of the high noise present <strong>in</strong> fMRI data the ICA<br />

problem will always be under-determ<strong>in</strong>ed or over-complete. As FastICA cannot<br />

separate more components than the number of mixtures available, the result<strong>in</strong>g<br />

IC will always be composed of a noise part and the “real” IC superimposed on<br />

that noise. This can be compensated by <strong>in</strong>dividually de-nois<strong>in</strong>g the IC. As a rule<br />

of thumb we decided that to be considered a noise signal the value has to be<br />

below 10 times the mean variance <strong>in</strong> the IC which corresponds to a standard<br />

deviation of about 3.<br />

2.2 Example: <strong>Analysis</strong> of an Event-Based Experiment<br />

This experiment was part of a study to <strong>in</strong>vestigate the network <strong>in</strong>volved <strong>in</strong><br />

the perception of speech and the decod<strong>in</strong>g of auditory speech stimuli. Therefor


Chapter 19. LNCS 3195:977-984, 2004 255<br />

980 Ingo R. Keck et al.<br />

one- and two-syllable words were divided <strong>in</strong>to several frequency-bands and then<br />

rearranged randomly to obta<strong>in</strong> a set of auditory stimuli. The set consisted of four<br />

different types of stimuli, conta<strong>in</strong><strong>in</strong>g 1, 2, 3 or 4 frequency bands (FB1–FB4)<br />

respectively. Only FB4 was perceivable as words.<br />

Dur<strong>in</strong>g the functional imag<strong>in</strong>g session these stimuli were presented pseudorandomized<br />

to 5 subjects, accord<strong>in</strong>g to the rules of a stochastic event-related<br />

paradigm. The task of the subjects was to press a button as soon as they were<br />

sure that they had just recognized a word <strong>in</strong> the sound presented. It was expected<br />

that <strong>in</strong> case of FB4 these four types of stimuli activate different areas of the<br />

auditory system as well as the superior temporal sulcus <strong>in</strong> the left hemisphere [8].<br />

Prior to the statistical analysis the fMRI data were pre-processed with the<br />

SPM2 toolbox [9]. A slice-tim<strong>in</strong>g procedure was performed, movements corrected,<br />

the result<strong>in</strong>g images were normalized <strong>in</strong>to a stereotactical standard space<br />

(def<strong>in</strong>ed by a template from the Montreal Neurological Institute) and smoothed<br />

with a gaussian kernel to <strong>in</strong>crease the signal-to-noise ratio.<br />

Classical Fixed-Effect <strong>Analysis</strong>. First, a classic regression analysis with<br />

SPM2 was applied. No substantial differences <strong>in</strong> the activation of the auditory<br />

cortex apart from an overall <strong>in</strong>crease of activity with ascend<strong>in</strong>g number of frequency<br />

bands was found <strong>in</strong> three subjects. One subject showed no correlated<br />

activity at all, two only had marg<strong>in</strong>al activity located <strong>in</strong> the auditory cortex<br />

(figure 1 (c)). Only one subject showed obvious differences between FB1 and<br />

FB4: an activation of the left supplementary motor area, the c<strong>in</strong>gulate gyrus<br />

and an <strong>in</strong>creased size of active area <strong>in</strong> the left auditory cortex for FB4 (figure 1<br />

(a),(b)).<br />

Spatial ICA with FastICA. For the sICA with FastICA [6] up to 351 threedimensional<br />

images of the fMRI sessions were <strong>in</strong>terpreted as separate mixtures<br />

of the unknown spatial <strong>in</strong>dependent activity signals. Because of the high computational<br />

demand each subject was analyzed <strong>in</strong>dividually <strong>in</strong>stead of a whole group<br />

ICA as proposed <strong>in</strong> [10]. A pr<strong>in</strong>cipal component analysis (PCA) was applied to<br />

whiten the data. 340 components of this PCA were reta<strong>in</strong>ed that correspond to<br />

more than 99.999% of the orig<strong>in</strong>al signals. This is still 100 times greater than<br />

the share of ICs like that shown <strong>in</strong> figure 3 on the fMRI signal. In one case only<br />

317 fMRI images were measured and all result<strong>in</strong>g 317 PCA components were<br />

reta<strong>in</strong>ed.<br />

Then the stabilized version of the FastICA algorithm was applied us<strong>in</strong>g tanh<br />

as non-l<strong>in</strong>earity. The result<strong>in</strong>g 340 (resp. 317) spatially <strong>in</strong>dependent components<br />

(IC) were sorted <strong>in</strong>to different classes depend<strong>in</strong>g on their structural localization<br />

with<strong>in</strong> the bra<strong>in</strong>. Various ICs <strong>in</strong> the region of the auditory cortex could be<br />

identified <strong>in</strong> all subjects, figure 2 show<strong>in</strong>g one example. Note that all bra<strong>in</strong> images<br />

<strong>in</strong> this article are flipped, i.e. the left hemisphere appears on the right side of<br />

the picture. To calculate the contribution of the displayed ICs to the observed<br />

fMRI data the value of its voxels has to be multiplied with the time course of<br />

its activation for each scan (lower subplot to the right of each IC plot). Also


256 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data on a Word Perception Task 981<br />

a)<br />

b)<br />

c)<br />

Fig. 1. Fixed-effect analysis of the experimental data. No substantial differences between<br />

the activation <strong>in</strong> the auditory cortex correlated to (a) FB1 and (b) FB4 can be<br />

seen. (c) shows the analysis for FB4 of a different subject.<br />

Fig. 2. <strong>Independent</strong> component located <strong>in</strong> the auditory cortex and its time course.<br />

a component located at the position of the supplementary motor area (SMA)<br />

could be found <strong>in</strong> all subjects.


Chapter 19. LNCS 3195:977-984, 2004 257<br />

982 Ingo R. Keck et al.<br />

Fig. 3. <strong>Independent</strong> component which correspond to a proposed subsystem for word<br />

detection.<br />

Fig. 4. <strong>Independent</strong> component with activation <strong>in</strong> Broca’s area (speech motor area).<br />

The most <strong>in</strong>terest<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>g was an IC which represents a network of three<br />

simultaneously active areas <strong>in</strong> the <strong>in</strong>ferior frontal gyrus (figure 3) <strong>in</strong> one subject.<br />

This network was suggested to be a center for the perception of speech <strong>in</strong> [8].<br />

Figure 4 shows an IC (of the same subject) that we assume to be a network for<br />

the decision to press the button. All other subjects except one had ICs that correspond<br />

to these networks, although often separated <strong>in</strong>to different components.<br />

The time course of both components matches visually very well (figure 5) while<br />

their correlation coefficient rema<strong>in</strong>s rather low (kcorr =0.36), apparently due to<br />

temporary time- and basel<strong>in</strong>e-shifts.


258 Chapter 19. LNCS 3195:977-984, 2004<br />

3D Spatial <strong>Analysis</strong> of fMRI Data on a Word Perception Task 983<br />

Comparison of the Regression <strong>Analysis</strong> Versus ICA. To compare the<br />

results of the fixed-effect analysis with the results of the ICA the correlation<br />

coefficients between the expected time-trends of the fixed-effect analysis and the<br />

time-trends of the ICs were calculated. No substantial correlation was found:<br />

87 % of all these coefficients were <strong>in</strong> the range of −0.1 to0.1, the highest coefficient<br />

found be<strong>in</strong>g 0.36 for an IC with<strong>in</strong> the auditory cortex (figure 2). The<br />

correlation coefficients for the proposed word detection network (figure 3) were<br />

0.14, 0.08, 0.19 and 0.18 for FB1–FB4. Therefor it is quite obvious that this<br />

network of areas <strong>in</strong> the <strong>in</strong>ferior frontal gyrus cannot be detected with a classic<br />

fixed-effect regression analysis.<br />

While the reasons for the differences between the activation-trends of the<br />

ICs and the assumed time-trends are still subject to on-go<strong>in</strong>g research, it can be<br />

expected that the results of this ICA will help to ga<strong>in</strong> further <strong>in</strong>formation about<br />

the work flow of the bra<strong>in</strong> concern<strong>in</strong>g the task of word detection.<br />

Fig. 5. The activation of the ICs shown <strong>in</strong> figure 3 (dotted) and 4 (solid), plotted<br />

for scan no. 25–75. While these time-trends obviously appear to be correlated, their<br />

correlation coefficient rema<strong>in</strong>s very low due to temporary basel<strong>in</strong>e- and time-shifts <strong>in</strong><br />

the trends.<br />

3 Conclusions<br />

We have shown that ICA can be a valuable tool to detect hidden or suspected<br />

l<strong>in</strong>ks and activity <strong>in</strong> the bra<strong>in</strong> that cannot be found us<strong>in</strong>g the classical approach<br />

of a model-based analysis like the general l<strong>in</strong>ear model. While clearly ICA cannot<br />

be used to validate a model (be<strong>in</strong>g <strong>in</strong> itself model-free), it can give useful h<strong>in</strong>ts<br />

to understand the <strong>in</strong>ternal organization of the bra<strong>in</strong> and help to develop new<br />

models and study designs which then can be validated us<strong>in</strong>g a classic regression<br />

analysis.


Chapter 19. LNCS 3195:977-984, 2004 259<br />

984 Ingo R. Keck et al.<br />

Acknowledgment<br />

This work was supported by the BMBF (project ModKog).<br />

References<br />

1. K. K. Kwong, J. W. Belliveau, D. A. Chester, I. E. Goldberg, R. M. Weisskoff, B.<br />

P. Poncelet, D. N. Kennedy, B. E. Hoppel, M. S. Cohen, R. Turner, H-M. Cheng,<br />

T. J. Brady, B. R. Rosen, “Dynamic magnetic resonance imag<strong>in</strong>g of human bra<strong>in</strong><br />

activity dur<strong>in</strong>g primary sensory stimulation”, Proc. Natl. Acad. USA 89, 5675–<br />

5679 (1992).<br />

2. S. Ogawa, T. M. Lee, A. R. Kay, D. W. Tank, “Bra<strong>in</strong> magnetic-resonance-imag<strong>in</strong>g<br />

with contrast dependent on blood oxygenation”, Proc. Natl Acad. Sci. USA 87,<br />

9868–9872 (1990).<br />

3. R. S. J. Frackowiak, K. J. Friston, Ch. D. Frith, R. J. Dolan, J. C. Mazziotta,<br />

“Human Bra<strong>in</strong> Function”, Academic Press, San Diego, USA, 1997.<br />

4. A.J. Bell, T.J. Sejnowski, “An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d separation<br />

and bl<strong>in</strong>d deconvolution”, Neural Computation, 7(6), 1129–1159 (1995).<br />

5. M.J. McKeown, T.J. Sejnowski, “<strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> of FMRI Data:<br />

Exam<strong>in</strong><strong>in</strong>g the Assumptions”, Human Bra<strong>in</strong> Mapp<strong>in</strong>g 6, 368–372 (1998).<br />

6. A. Hyvär<strong>in</strong>nen, “Fast and Robust Fixed-Po<strong>in</strong>t Algorithms for <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>”, IEEE Transactions on Neural Networks 10(3), 626–634 (1999).<br />

7. F. Esposito, E. Formisano, E. Seifritz, R. Goebel, R. Morrone, G. Tedeschi, F. Di<br />

Salle, “Spatial <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong> of Functional MRI Time-Series:<br />

To What Extent Do Results Depend on the Algorithm Used?”, Human Bra<strong>in</strong><br />

Mapp<strong>in</strong>g 16, 146–157 (2002).<br />

8. K. Specht, J. Reul, “Function segregation of the temporal lobes <strong>in</strong>to highly differentiated<br />

subsystems for auditory perception: an auditory rapid event-related<br />

fMRI-task”, NeuroImage 20, 1944–1954 (2003).<br />

9. SPM2: http://www.fil.ion.ulc.ac.uk/spm/spm2.html, July 2003.<br />

10. V.D. Calhoun, T. Adali, G.D. Pearlson, J.J. Pekar, “A Method for Mak<strong>in</strong>g Group<br />

Inferences from Functional MRI Data Us<strong>in</strong>g <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>”,<br />

Human Bra<strong>in</strong> Mapp<strong>in</strong>g 14, 140–151 (2001).


260 Chapter 19. LNCS 3195:977-984, 2004


Chapter 20<br />

Signal Process<strong>in</strong>g 86(3):603-623,<br />

2006<br />

Paper F.J. Theis and G.A. García. On the use of sparse signal decomposition <strong>in</strong><br />

the analysis of multi-channel surface electromyograms. Signal Process<strong>in</strong>g,<br />

86(3):603-623, 2006<br />

Reference (Theis and García, 2006)<br />

Summary <strong>in</strong> section 1.6.3<br />

261


262 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

On the use of sparse signal decomposition <strong>in</strong><br />

the analysis of multi-channel surface<br />

electromyograms<br />

Fabian J. Theis a,∗ , Gonzalo A. García b<br />

a Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

b Department of Bio<strong>in</strong>formatic Eng<strong>in</strong>eer<strong>in</strong>g, Osaka University, Osaka, Japan<br />

Abstract<br />

The decomposition of surface electromyogram data sets (s-EMG) is studied us<strong>in</strong>g<br />

bl<strong>in</strong>d source separation techniques based on sparseness; namely <strong>in</strong>dependent<br />

component analysis, sparse nonnegative matrix factorization, and sparse component<br />

analysis. When applied to artificial signals we f<strong>in</strong>d noticeable differences of<br />

algorithm performance depend<strong>in</strong>g on the source assumptions. In particular, sparse<br />

nonnegative matrix factorization outperforms the other methods with regards to<br />

<strong>in</strong>creas<strong>in</strong>g additive noise. However, <strong>in</strong> the case of real s-EMG signals we show that<br />

despite the fundamental differences <strong>in</strong> the various models, the methods yield rather<br />

similar results and can successfully separate the source signal. This can be expla<strong>in</strong>ed<br />

by the fact that the different sparseness assumptions (super-Gaussianity, positivity<br />

together with m<strong>in</strong>imal 1-norm and fixed number of zeros respectively) are all<br />

only approximately fulfilled thus apparently forc<strong>in</strong>g the algorithms to reach similar<br />

results, but from different <strong>in</strong>itial conditions.<br />

Key words: surface EMG, bl<strong>in</strong>d source separation, sparse component analysis,<br />

<strong>in</strong>dependent component analysis, sparse nonnegative matrix factorization<br />

1 Introduction<br />

A basic question <strong>in</strong> data analysis, signal process<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g as well as<br />

neuroscience is how to represent a large data set X (observed as an (m ×<br />

∗ Correspond<strong>in</strong>g author.<br />

Email addresses: fabian@theis.name (Fabian J. Theis), garciaga@ieee.org<br />

(Gonzalo A. García).<br />

Prepr<strong>in</strong>t submitted to Elsevier Science 8 April 2005


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 263<br />

N)−matrix) <strong>in</strong> different ways. One of the simplest approaches lies <strong>in</strong> a l<strong>in</strong>ear<br />

decomposition, X = AS, with A an (m × N)−matrix (mix<strong>in</strong>g matrix) and S<br />

an (n × N)−matrix stor<strong>in</strong>g the sources. Both A and S are unknown, hence<br />

this problem is often described as bl<strong>in</strong>d source separation (BSS). To get a<br />

well-def<strong>in</strong>ed problem, A and S have to satisfy additional properties such as:<br />

• the source components Si (rows of S) are assumed to be realizations of<br />

stochastically <strong>in</strong>dependent random variables — this method is called <strong>in</strong>dependent<br />

component analysis (ICA) [1,2],<br />

• the sources S are required to conta<strong>in</strong> as many zeros as possible — we then<br />

speak of sparse component analysis (SCA) [3,4],<br />

• A and S are assumed to be nonnegative, which is denoted by nonnegative<br />

matrix factorization (NMF) [5].<br />

The above-mentioned models as well as their <strong>in</strong>terplay have recently been <strong>in</strong><br />

the focus of many researchers, for <strong>in</strong>stance concern<strong>in</strong>g the question of how<br />

ICA and sparseness are related to each other and how they can be <strong>in</strong>tegrated<br />

<strong>in</strong>to BSS algorithms [6–10], how to deal with nonnegativity <strong>in</strong> the ICA case<br />

[11,12] or how to extend NMF <strong>in</strong> order to <strong>in</strong>clude sparseness [13,14]. Much<br />

work has already been devoted to these subjects, and their applications to<br />

various fields are currently emerg<strong>in</strong>g. Indeed, l<strong>in</strong>ear representations such as the<br />

above have several potential applications <strong>in</strong>clud<strong>in</strong>g decomposition of objects<br />

<strong>in</strong>to ‘natural’ components [5], redundancy and dimensionality reduction [2],<br />

biomedical data analysis, micro-array data m<strong>in</strong><strong>in</strong>g or enhancement, feature<br />

extraction of images <strong>in</strong> nuclear medic<strong>in</strong>e, etc. [1,2].<br />

In this study, we will analyze and compare the above models, not from a theoretical<br />

po<strong>in</strong>t of view but rather from a concrete, real-world example, namely<br />

the analysis of surface electromyogram (s-EMG) data sets. An electromyogram<br />

(EMG) denotes the electric signal generated by a contract<strong>in</strong>g muscle<br />

[15]; its study is relevant to the diagnosis of motoneuron diseases [16] as well<br />

as neurophysiological research [17]. In general, EMG measurements make use<br />

of <strong>in</strong>vasive, pa<strong>in</strong>ful needle electrodes. An alternative is to use s-EMG, which<br />

is measured us<strong>in</strong>g non-<strong>in</strong>vasive, pa<strong>in</strong>less surface electrodes. However, <strong>in</strong> this<br />

case the signals are rather more difficult to <strong>in</strong>terpret due to noise and overlap<br />

of several source signals as shown <strong>in</strong> Fig 1(a). We have already applied ICA <strong>in</strong><br />

order to solve the s-EMG decomposition problem [18], however performance<br />

<strong>in</strong> real-world noisy s-EMG is still problematic, and it is yet unknown if the<br />

assumption of <strong>in</strong>dependent sources holds very well <strong>in</strong> the sett<strong>in</strong>g of s-EMG.<br />

In the present work, we apply sparse BSS methods based on various model<br />

assumptions to s-EMG signals. We first outl<strong>in</strong>e each of those methods and<br />

the correspond<strong>in</strong>g performance <strong>in</strong>dices used for their comparison. We then<br />

present the decompositions obta<strong>in</strong>ed with each method, and f<strong>in</strong>ally discuss<br />

these results <strong>in</strong> section 4.<br />

2


264 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

(a)<br />

s-EMG MU#2 MU#1<br />

50 ms 0.5 mV<br />

(b)<br />

CNS<br />

sp<strong>in</strong>al cord<br />

motor<br />

unit<br />

α-motoneurons<br />

MU#2<br />

( )<br />

MU#1 ( )<br />

Fig. 1. Electromyogram; (a) Example of a signal obta<strong>in</strong>ed from a surface electrode<br />

show<strong>in</strong>g the superposition of the activity of several motor units. (b) α-motoneurons<br />

convey the commands from the central nervous system to the muscles. A motor unit<br />

consists of an α-motoneuron and the muscle fibres it <strong>in</strong>nervates.<br />

2 Method<br />

2.1 Signal orig<strong>in</strong> and acquisition<br />

The central nervous system conveys commands to the muscles by tra<strong>in</strong>s of<br />

electric impulses (fir<strong>in</strong>gs) via α-motoneurons, whose bodies are located <strong>in</strong> the<br />

sp<strong>in</strong>al cord. The term<strong>in</strong>al axons of an α-motoneuron <strong>in</strong>nervate a group of<br />

muscle fibres. A motor unit (MU) consists of an α-motoneuron and the muscle<br />

fibres it <strong>in</strong>nervates, Fig. 1(b). When a MU is active, it produces a tra<strong>in</strong> of<br />

electric impulses called motor unit action potential tra<strong>in</strong> (MUAPT). The s-<br />

EMG signal is composed of the weighted summation of several MUAPTs; an<br />

example is given <strong>in</strong> Fig. 1(a).<br />

2.1.1 Artificial signal generator<br />

We developed a multi-channel s-EMG generator based on Disselhorst-Klug et<br />

al.’s model [19], employ<strong>in</strong>g Andreassen and Rosenfalck’s conductivity parameters<br />

[20], and the fir<strong>in</strong>g rate statistical characteristics described by Clamann<br />

[21].<br />

The volume conductor between a muscle fibre and the sk<strong>in</strong> surface where the<br />

electrodes are placed acts as a spatial low-pass filter; hence, a source signal<br />

is attenuated <strong>in</strong> direct proportion to its distance from the electrode. Us<strong>in</strong>g<br />

Griep’s tripole model [22], we can calculate the spatial potential distribution<br />

3


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 265<br />

Synthetic s-EMG Source Signals [arbitrary units]<br />

MU#1<br />

MU#2<br />

MU#3<br />

MU#4<br />

MU#5<br />

Channel 1<br />

Synthetic s-EMG signal [arbitrary units]<br />

Channel 7 Channel 6 Channel 5 Channel 4 Channel 3 Channel 2<br />

Channel 8<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

10<br />

0<br />

-10<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

20<br />

0<br />

-20<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

(a) synthetic source signals<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

(c) generated s-EMG signals<br />

Depth <strong>in</strong>side the arm [mm]<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

MU#1<br />

MU#3<br />

MU#4<br />

MU#2<br />

MU#5<br />

Ch#1 Ch#2 Ch#3 Ch#4 Ch#5 Ch#6 Ch#7 Ch#8<br />

0<br />

-15 -10 -5 0 5 10<br />

Electrode-Array surface [mm]<br />

(b) source locations<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1<br />

2<br />

3<br />

Channel<br />

4<br />

5<br />

6<br />

7<br />

(d) mix<strong>in</strong>g matrix<br />

8<br />

1<br />

2<br />

3<br />

4<br />

5<br />

Source Signal<br />

Fig. 2. Artificially created signals. (a) Five synthetic motor units action potential<br />

tra<strong>in</strong>s (source signals, MUAPTs) generated by the model. (b) Artificial surface<br />

electromyogram (s-EMG); location of the sources (motor units, encircled asterisks)<br />

respect the electrodes location marked by an x. The y-axis represents the depth<br />

<strong>in</strong>side the upper limb. (c) Artificially created mixture signal — only first 5000<br />

samples (out of 15000) are shown. The eight-channel signal is generated from the<br />

five source signals (a), us<strong>in</strong>g the mix<strong>in</strong>g matrix illustrated <strong>in</strong> (d).<br />

generated on the sk<strong>in</strong> surface by the excitation of a s<strong>in</strong>gle muscle fibre. The<br />

potential distribution generated by the fir<strong>in</strong>g of a s<strong>in</strong>gle motor unit can be<br />

calculated as the l<strong>in</strong>ear superposition of the contributions of all its constitutive<br />

muscle fibres, which are not located <strong>in</strong> one plane. The aforementioned<br />

model can be considered as a l<strong>in</strong>ear <strong>in</strong>stantaneous mix<strong>in</strong>g process. In reality<br />

however it is well-known that the model could exhibit slightly convolutive behavior,<br />

which is due to the fact that the media crossed by the s-EMG sources<br />

is anisotropic, provok<strong>in</strong>g delayed mixtures of the same signals (convolutions)<br />

rather than an <strong>in</strong>stantaneous mixture of them. However, <strong>in</strong> the muscle model<br />

used, the distances between the different muscle fibres have been <strong>in</strong>creased<br />

4


266 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

name mean<strong>in</strong>g value<br />

num number of MUs 5<br />

siz size of each MU, <strong>in</strong> number of fibres 20-30 (uniformly distr.)<br />

cfr central fir<strong>in</strong>g rate 20 Hz<br />

ED electrode <strong>in</strong>ter-pole distance 2.54 mm<br />

CH number of generated channels 8<br />

meanIPI mean <strong>in</strong>ter-pulse <strong>in</strong>terval 1/cfr · 1000 ms<br />

IPIstd Clamann formula for IPI deviation 9.1 · 10 −4 · meanIPI 2 + 4<br />

SY radial conductivity 0.063 S/m<br />

SZ axial conductivity 0.328 S/m<br />

A anisotropy ratio SZ/SY = 5.2<br />

SI <strong>in</strong>tracellular conductivity 1.010 S/m<br />

D fibre diameter 50 · 10 −6 m<br />

CV conduction velocity 4.0 m/s<br />

FatHeight fat-sk<strong>in</strong> layer thickness 3 mm<br />

MUterritory XY-plane radius of each MU 1.0 mm<br />

DistEndplate endplate-to-electrode distance 20 mm<br />

FibZsegment length of the endplate <strong>in</strong> the Z axis 0.4 mm<br />

CondVolCond conductivity of the volume conductor 0.08 S/m<br />

Table 1<br />

Parameters used for artificial s-EMG generation.<br />

by the anisotropy factor A, calculated as the ratio between the radial conductivity<br />

(0.063 S/m) and the axial conductivity (0.328 S/m), see Tab. 1 for<br />

parameter details. Afterwards, the potential distribution can be calculated as<br />

if the volume conductor were isotropic. The channels compos<strong>in</strong>g the synthetic<br />

s-EMG conta<strong>in</strong> the same MUAP without any time delay among themselves,<br />

hence produc<strong>in</strong>g a <strong>in</strong>stantaneous mixture with the other source signals.<br />

We have generated 100 synthetic, eight-channel s-EMG signals of 1.5-second<br />

duration. Each s-EMG was composed of five MUs randomly located under<br />

a 3 mm fat and sk<strong>in</strong> layer <strong>in</strong> the detection area of the electrode-array (up<br />

to ±2 mm of the sides of the electrode array and up to a depth of 6 mm).<br />

Each MU <strong>in</strong> turn consisted of 20 to 30 fibres (uniform distribution) of <strong>in</strong>f<strong>in</strong>ite<br />

length located randomly <strong>in</strong> a 1 mm circle <strong>in</strong> the XY plane. The conduction<br />

velocity of the fibres was 4.0 m/s. Note that an average biceps brachii MU<br />

is generally composed of at least 200-300 muscle fibres [23]; however, for the<br />

sake of computational speed, that number was reduced to one tenth, which<br />

5


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 267<br />

only resulted <strong>in</strong> the reduction of the amplitude of the f<strong>in</strong>al MUAPs. This does<br />

not have any <strong>in</strong>fluence <strong>in</strong> our case, as the algorithms are not affected by the<br />

general amplitude of the signals, only by their respective proportions, if at all.<br />

The simulated electrode-array used to generate the mixed record<strong>in</strong>gs had the<br />

same dimensions and characteristics of the real one and was virtually located<br />

20 mm apart from the endplate region, which had a thickness of 0.4 mm. X axis<br />

corresponded to the length of the electrode, Y axis to the depth, and Z axis<br />

to the fibres direction. The mean <strong>in</strong>stantaneous fir<strong>in</strong>g rate of each MU was 20<br />

fir<strong>in</strong>gs/second (mean <strong>in</strong>ter-pulse <strong>in</strong>terval meanIPI of 50 ms), with a standard<br />

deviation calculated follow<strong>in</strong>g Clamman’s model as 9.1 · 10 −4 · meanIPI 2 + 4<br />

[21].<br />

The s-EMG signals were generated as followed; each muscle fibre compos<strong>in</strong>g<br />

a fir<strong>in</strong>g MU generates an action potential follow<strong>in</strong>g the mov<strong>in</strong>g tripole model<br />

([24] and references there<strong>in</strong>); the f<strong>in</strong>al MUAP appear<strong>in</strong>g <strong>in</strong> each channel is the<br />

summation of those action potentials after apply<strong>in</strong>g the anisotropy coefficient<br />

A and the fad<strong>in</strong>g produced by the media crossed by the signal ([19] and references<br />

there<strong>in</strong>). Note that the model generally holds for any skeletal muscle.<br />

However, some of the parameters (such as the fat layer and fir<strong>in</strong>g rate) were<br />

chosen follow<strong>in</strong>g the average, normal values of the biceps brachii [25].<br />

A s<strong>in</strong>gle such data set is visualized <strong>in</strong> Fig. 2. The eight-dimensional s-EMG<br />

observations that have been generated from the five underly<strong>in</strong>g MUAPTs are<br />

shown <strong>in</strong> Fig. 2(a). Their locations are depicted <strong>in</strong> Fig. 2(b) and the mix<strong>in</strong>g<br />

matrix used to generate the synthetic data set (see Fig. 2(c)) is shown <strong>in</strong><br />

Fig. 2(d). We will use several (100) of these signals with randomly chosen<br />

source locations <strong>in</strong> batch runs for statistical comparisons.<br />

2.1.2 Real s-EMG signal acquisition<br />

Fig. 3 shows the experimental sett<strong>in</strong>g for the acquisition of real s-EMG record<strong>in</strong>gs.<br />

The subject (after giv<strong>in</strong>g <strong>in</strong>formed consent) sat on a chair and was<br />

strapped to it with nylon belts as shown <strong>in</strong> Fig. 3(a). The subject’s forearm<br />

was put <strong>in</strong> a cast fixed to the table, the shoulder angle be<strong>in</strong>g approximately<br />

30 ◦ . To measure the torque exerted by the subject, a stra<strong>in</strong> gauge was placed<br />

between the forearm cast and the table fixation. An electrode array was placed<br />

on the biceps short head muscle belly, out of the <strong>in</strong>nervation zone, transverse<br />

to the direction of muscle fibers, after clean<strong>in</strong>g the sk<strong>in</strong> with alcohol and Sk<strong>in</strong>-<br />

Pure (Nihon Kohden Corp, Tokyo, Japan). The electrode array consists of<br />

16 sta<strong>in</strong>less steel poles of 1 mm <strong>in</strong> diameter, 3 mm <strong>in</strong> height, and 2.54 mm<br />

apart <strong>in</strong> both plane directions that are attached to a rigid acrylic plate, see<br />

Fig. 3(b). The poles of each bipolar electrode were placed at the shortest<br />

possible distance to obta<strong>in</strong> sharper MUAPs.<br />

6


268 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

(a)<br />

(b)<br />

mass-EMG<br />

electrode<br />

electrode-array<br />

8 chan.s<br />

direction of muscle fiber<br />

3mm<br />

CH1<br />

1mm<br />

2.54mm<br />

belt<br />

stra<strong>in</strong> gauge<br />

CH8<br />

41mm<br />

2.54mm<br />

41mm<br />

chair<br />

styrofoam<br />

cast<br />

oscilloscope<br />

measured<br />

torque target<br />

torque<br />

Fig. 3. Experimental sett<strong>in</strong>g; (a) Eight-channels s-EMG was recorded dur<strong>in</strong>g an<br />

isometric, constant force 30% of subject’s MVC. (b) Detail of the s-EMG bipolar<br />

electrode array used for the record<strong>in</strong>gs.<br />

Eight-channel EMG bipolar signals were measured (<strong>in</strong>put impedance above<br />

10 12 Ω, CMRR of 110 dB), filtered by a first-order, band-pass Butterworth filter<br />

(cut-off frequencies of 70 Hz and 1 kHz), and then amplified (ga<strong>in</strong> of 80<br />

dB). The target torque and the measured torque were displayed on an oscilloscope<br />

as a thick and as a th<strong>in</strong> l<strong>in</strong>e respectively, which the subject was asked<br />

to match. The eight-channel s-EMG was sampled at a frequency of 10 kHz, 12<br />

bits per sample, by a PCI-MIO-16E-1 A/D converter card (National Instruments,<br />

Aust<strong>in</strong>, Texas, USA) <strong>in</strong>stalled <strong>in</strong> a PC, which was also equipped with<br />

a LabView (National Instruments, Aust<strong>in</strong>, Texas, USA) program <strong>in</strong> charge of<br />

controll<strong>in</strong>g the experiments. The signals were recorded dur<strong>in</strong>g a 5s period of<br />

constant-force isometric contractions at 30% maximum voluntary contraction<br />

(MVC).<br />

Before apply<strong>in</strong>g any BSS method, the signal was preprocessed us<strong>in</strong>g a modified<br />

dead-zone filter developed to reta<strong>in</strong> the MUAPs belong<strong>in</strong>g to the target<br />

MUAPT and to remove both noise and low-amplitude MUAPs that did not<br />

reach the given thresholds. The nonl<strong>in</strong>ear filter is realized by an algorithm that<br />

was developed <strong>in</strong> order to reta<strong>in</strong> the MUAPs belong<strong>in</strong>g to the target MUAPT<br />

and to remove both the noise and the low-amplitude MUAPs that did not<br />

reach the given thresholds. In former works, these thresholds were <strong>in</strong>itially set<br />

at a level three times above the noise level (record<strong>in</strong>g at 0% MVC) and then<br />

7


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 269<br />

adjusted with the aid of an <strong>in</strong>corporated graphical user <strong>in</strong>terface [18]. In the<br />

present work, the thresholds were set to 10% for all the signals so that we may<br />

compare all obta<strong>in</strong>ed results easily. The algorithm looks for zero-cross<strong>in</strong>gs on<br />

the signal and def<strong>in</strong>es a waveform as the samples of the considered signal comprised<br />

between two consecutive zero-cross<strong>in</strong>gs. If the maximum (respectively,<br />

m<strong>in</strong>imum <strong>in</strong> the case of a negative waveform) is above (respectively, below)<br />

the threshold, that waveform is kept <strong>in</strong> the signal given as output. Otherwise,<br />

the waveform is substituted by a zero-voltage signal on that period.<br />

It is quite important to use such a filter, because although BSS algorithms<br />

are generally able to separate the noise as a component, they are usually<br />

unable to separate more source signals than available channels. Generally <strong>in</strong><br />

our experiments there were more MUs than available channels even at low<br />

levels of contraction, but only few of them had amplitude above the noise<br />

level. In order to elim<strong>in</strong>ate the activity of some MUs we delete the MUAPTs<br />

belong<strong>in</strong>g to distant MUs, whose MUAPs are less powerful. This is enhanced<br />

by PCA preprocess<strong>in</strong>g and dimension reduction as discussed later <strong>in</strong> section<br />

3.2.<br />

2.2 Bl<strong>in</strong>d source separation<br />

L<strong>in</strong>ear bl<strong>in</strong>d source separation (BSS) describes the task of bl<strong>in</strong>dly recover<strong>in</strong>g<br />

A and S <strong>in</strong> the equation<br />

X = AS + N (1)<br />

where X consists of m signals with N observations each, put together <strong>in</strong>to an<br />

(m × N)−matrix. A is a full-rank (m × n)−matrix, and we typically assume<br />

that m ≥ n (complete and under-complete case). Moreover N models additive<br />

noise, which is commonly assumed to be white Gaussian. Depend<strong>in</strong>g on the<br />

assumptions on A and S we get different models. Our goal is to estimate<br />

the mix<strong>in</strong>g matrix A. Then the sources can be recovered by apply<strong>in</strong>g the<br />

pseudo<strong>in</strong>verse A + of A to the mixtures X (which is optimal <strong>in</strong> the maximumlikelihood<br />

sense when Gaussian noise is assumed).<br />

In the case of s-EMG data, the orig<strong>in</strong>al signals are the MUAPTs generated by<br />

the motor units active dur<strong>in</strong>g a susta<strong>in</strong>ed contraction. In this sett<strong>in</strong>g, A quantifies<br />

how each source contributes to each observation. Of course additional<br />

requirements will have to be applied to the model to guarantee a satisfactorily<br />

small space of solutions, and depend<strong>in</strong>g on the assumptions on A and S we<br />

will obta<strong>in</strong> different models.<br />

8


270 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

2.2.1 <strong>Independent</strong> component analysis<br />

<strong>Independent</strong> component analysis (ICA) describes the task of (here: l<strong>in</strong>early)<br />

transform<strong>in</strong>g a given multivariate random vector such that its transform is<br />

stochastically <strong>in</strong>dependent. In our sett<strong>in</strong>g the random vector is given by N<br />

realizations, and ICA is applied to solve the BSS problem (1), where S is<br />

assumed to be <strong>in</strong>dependent. As <strong>in</strong> all BSS problems, a key issue lies <strong>in</strong> the<br />

question of identifiability of the model, and it can be shown that A (and<br />

hence S) is already uniquely — except for column permutation and scal<strong>in</strong>g —<br />

determ<strong>in</strong>ed by X if S conta<strong>in</strong>s at most one Gaussian and is square <strong>in</strong>tegrable<br />

[26,27]. This enables us to apply ICA to the BSS problem and to recover the<br />

orig<strong>in</strong>al sources.<br />

The idea of ICA was first expressed by Jutten and Hérault [28], while the<br />

term ICA was later co<strong>in</strong>ed by Comon [26]. In contrast to pr<strong>in</strong>cipal component<br />

analysis (PCA), ICA uses higher-order statistics to fully separate the data.<br />

Typical algorithms are based on contrasts such as m<strong>in</strong>imum mutual <strong>in</strong>formation,<br />

maximum entropy, diagonal cumulants or non-Gaussianity. For more<br />

details on ICA we refer to the two available excellent text-books [1,2].<br />

In the follow<strong>in</strong>g we will use the so-called JADE (jo<strong>in</strong>t approximate diagonalization<br />

of eigenmatrices) algorithm, which identifies the sources us<strong>in</strong>g the fact<br />

that due to <strong>in</strong>dependence, the diagonal cumulants of the sources vanish. Furthermore,<br />

fix<strong>in</strong>g two <strong>in</strong>dices of the 4th order cumulants, it is easy to see that<br />

such cumulant matrices of the mixtures are diagonalized by A.<br />

After pre-process<strong>in</strong>g by PCA, we can assume that A is orthogonal. Then<br />

diagonalization of one mixture cumulant matrix already yields A, given that<br />

its eigenvalues are pairwise different. However, <strong>in</strong> practice this is not always<br />

the case; furthermore, estimation errors could result <strong>in</strong> a bad estimate of the<br />

cumulant matrix and hence of A. Therefore, jo<strong>in</strong>t diagonalization of a whole<br />

set of cumulant matrices yields an improved estimate of A. Algorithms for<br />

actually perform<strong>in</strong>g jo<strong>in</strong>t diagonalization <strong>in</strong>clude gradient descent on the sum<br />

of off-diagonal terms, iterative construction of A by Givens rotation <strong>in</strong> two<br />

coord<strong>in</strong>ates [29], an iterative two-step recovery of A [30] or — more recently<br />

— a l<strong>in</strong>ear least-squares algorithm for diagonalization [31], where the latter<br />

two algorithms can also search for non-orthogonal matrices A.<br />

One method we use for analyz<strong>in</strong>g the s-EMG signals is JADE-based ICA. Confirm<strong>in</strong>g<br />

results from [32] we show that ICA can <strong>in</strong>deed extract the underly<strong>in</strong>g<br />

sources. In the case of s-EMG signals, all sources are strongly super-Gaussian<br />

and can therefore safely be assumed to be non-Gaussian, so identifiability<br />

holds. However, due to the nonnegativity of A, the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy is<br />

reduced to multiplication with a positive scalar <strong>in</strong> each column. If we additionally<br />

use the common assumption of unit variance of the sources, this<br />

9


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 271<br />

already elim<strong>in</strong>ates the scal<strong>in</strong>g <strong>in</strong>determ<strong>in</strong>acy. In order to use an ord<strong>in</strong>ary ICA<br />

algorithm, we simply have to add a ’postprocess<strong>in</strong>g’ stage: to guarantee a<br />

nonnegative matrix, column signs are flipped to have only (or as many as<br />

possible) nonnegative column entries. Also note that statistical <strong>in</strong>dependence<br />

mean<strong>in</strong>g that the multivariate probability densities factorize is not related to<br />

the synchrony <strong>in</strong> the fir<strong>in</strong>g of MUs [33] — otherwise overlapp<strong>in</strong>g MUs could<br />

not be separated.<br />

We f<strong>in</strong>ally want to remark that ICA can also be <strong>in</strong>terpreted as sparse signal<br />

decomposition method <strong>in</strong> the case of super-Gaussian sources. This follows from<br />

the fact that a good and often-used contrast for ICA is given by maximization<br />

of non-Gaussianity [34] — this can be approximately derived from the fact<br />

that due to the central limit theorem, a mixture of <strong>in</strong>dependent sources tends<br />

to be more Gaussian than the sources, so the process can be <strong>in</strong>verted by<br />

maximiz<strong>in</strong>g non-Gaussianity. In our sett<strong>in</strong>g the sources are very sparse, hence<br />

strongly non-Gaussian. An ICA decomposition is therefore closely related to<br />

a decomposition <strong>in</strong>to parts of maximal sparseness — at least if sparseness is<br />

measured us<strong>in</strong>g kurtosis.<br />

2.2.2 (Sparse) nonnegative matrix factorization<br />

In contrast to other matrix factorization models such as PCA, ICA or SCA,<br />

nonnegative matrix factorization (NMF) strictly requires both matrices A and<br />

S to have nonnegative entries, which means that the data can be described<br />

us<strong>in</strong>g only additive components. Such a constra<strong>in</strong>t has many physical realizations<br />

and applications, for <strong>in</strong>stance <strong>in</strong> object decomposition [5]. If additionally<br />

some sparseness constra<strong>in</strong>ts are put on A and S, we speak of sparse NMF, see<br />

[14] for more details.<br />

Typically, NMF is performed us<strong>in</strong>g a least-squares (Euclidean) contrast<br />

E(A,S) = �X − AS� 2 , (2)<br />

which is to be m<strong>in</strong>imized. This optimization problem, albeit convex <strong>in</strong> each<br />

variable separately, is not convex <strong>in</strong> both at the same time and hence direct<br />

estimation is not possible. Paatero [35] m<strong>in</strong>imizes (2) us<strong>in</strong>g a gradient<br />

algorithm, whereas Lee and Seung [36] develop a multiplicative update rule<br />

<strong>in</strong>creas<strong>in</strong>g algorithm performance considerably.<br />

Although NMF has recently ga<strong>in</strong>ed popularity due to its simplicity and power<br />

<strong>in</strong> various applications, its solutions frequently fail to exhibit the desired sparse<br />

object decomposition. Therefore, Hoyer [14] proposes a modification of the<br />

NMF model to <strong>in</strong>clude sparseness. However, a simple modification of the cost<br />

function (2) could yield undesirable local m<strong>in</strong>ima, so <strong>in</strong>stead he chooses to<br />

m<strong>in</strong>imize (2) under the constra<strong>in</strong>t of fixed sparseness of both A and S. Here,<br />

10


272 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

sparseness is measured by comb<strong>in</strong><strong>in</strong>g the Euclidean norm �.�2 and the 1-norm<br />

�x�1 := �<br />

i |xi| as follows:<br />

√<br />

n − �x�1/�x�2<br />

sparseness(x) := √ (3)<br />

n − 1<br />

if x ∈ R n \ {0}. So sparseness(x) = 1 (maximal) if x conta<strong>in</strong>s n −1 zeros, and<br />

it reaches zero if the absolute value of all coefficients of x co<strong>in</strong>cide.<br />

The devised algorithm is based on the iterative application of a gradient decent<br />

step and a projection step, thus restrict<strong>in</strong>g the search to the subspace of sparse<br />

solutions. We perform the factorization us<strong>in</strong>g the publicly available Matlab<br />

library nmfpack 1 , which is used <strong>in</strong> [14].<br />

So NMF decomposes X <strong>in</strong>to nonnegative A and nonnegative S. The assumption<br />

that A has nonnegative coefficients is very well fulfilled by s-EMG record<strong>in</strong>gs;<br />

however as seen before the sources also have negative entries. In order<br />

to be able to apply the algorithms, we therefore preprocess the data us<strong>in</strong>g the<br />

function<br />

⎧<br />

⎪⎨ 0, x < 0<br />

κ(x) =<br />

(4)<br />

⎪⎩ x, x ≥ 0<br />

to cut off non-zero values; this yields the new random vector (sample matrix)<br />

X+ := (κ(X1), . . .,κ(Xn)) ⊤ . For comparison, we also construct a new sample<br />

set by simply leav<strong>in</strong>g out samples that have at least one negative value. Here<br />

we model this by the random vector X∗.<br />

2.2.3 Sparse component analysis<br />

Sparse component analysis (SCA) [3,4] requires strong sparseness <strong>in</strong> the sources<br />

only — this is then sufficient to decompose the observations. In order to def<strong>in</strong>e<br />

the SCA model, a vector x ∈ R n is said to be k-sparse if x has at most<br />

k non-zero entries. This k-sparseness implies k0-sparseness for k0 ≥ k. If an<br />

n-dimensional vector is k-sparse for k = n − 1, it is simply said to be sparse.<br />

The goal of sparse component analysis of level k (k-SCA) is to decompose X<br />

<strong>in</strong>to X = AS as above such that each sample (i.e. column) of S is k-sparse.<br />

In the follow<strong>in</strong>g we will assume k = n − 1.<br />

Note that, <strong>in</strong> contrast to the ICA model, the above model is not translation<br />

<strong>in</strong>variant. However, it is easy to see that if <strong>in</strong>stead of A we allow an aff<strong>in</strong>e<br />

l<strong>in</strong>ear transformation, the translation constant can be determ<strong>in</strong>ed from X<br />

only as long as the sources are non-determ<strong>in</strong>istic. In other words, <strong>in</strong>stead of<br />

assum<strong>in</strong>g k-sparseness of the sources we could also assume that at any time<br />

1 http://www.cs.hels<strong>in</strong>ki.fi/u/phoyer/<br />

11


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 273<br />

<strong>in</strong>stant only k source components are allowed to vary from a previously fixed<br />

constant (which can be different for each source).<br />

In [3] we showed that under slight conditions k-sparseness already guarantees<br />

identifiability of the model, even <strong>in</strong> the case of less observations than sources.<br />

In the sett<strong>in</strong>g of s-EMG however we are <strong>in</strong> the comfortable situation of hav<strong>in</strong>g<br />

more observations than sources, so as <strong>in</strong> the ICA case we preprocess our data<br />

us<strong>in</strong>g PCA projection — this dimension reduction algorithm can be applied<br />

even to our case of non-decorrelated sources as (given low or no noise) the first<br />

three pr<strong>in</strong>cipal components will span the source signal subspace, see comment<br />

<strong>in</strong> section 3.2. The above uniqueness result is based on the fact that due to the<br />

assumed sparseness the data clusters <strong>in</strong>to a fixed number of hyperplanes. This<br />

fact can also be used <strong>in</strong> an algorithm to actually reconstruct A by identify<strong>in</strong>g<br />

the set of hyperplanes. From the hyperplanes, A can be recovered by simply<br />

tak<strong>in</strong>g <strong>in</strong>tersections.<br />

However, multiple hyperplane identification is non trivial, and the <strong>in</strong>volved<br />

cost function<br />

σ(A) = 1<br />

N<br />

N�<br />

t=1<br />

n<br />

m<strong>in</strong><br />

i=1<br />

|a⊤ i X(t)|<br />

, (5)<br />

�X(t)�<br />

where ai denote the columns of A, is highly non-convex. In order to improve<br />

the robustness of the proposed, stochastic identifier, we developed an identification<br />

algorithm us<strong>in</strong>g a generalized Hough transform [37]. Alternatively a<br />

generalization of k-means cluster<strong>in</strong>g can be used, which iteratively clusters the<br />

data <strong>in</strong>to groups belong<strong>in</strong>g to the different hyperplanes, and then identifies a<br />

hyperplane with<strong>in</strong> the cluster by regression [38].<br />

In this paper, we assume sparseness of the sources S <strong>in</strong> the sense that at<br />

least one coefficient of S at a certa<strong>in</strong> time <strong>in</strong>stant has to be zero. In the<br />

case of s-EMG, the maximum natural fir<strong>in</strong>g rate of a motor unit is about 30<br />

pulses/second, last<strong>in</strong>g each pulse less than 15 ms [39]. Therefore, a motor unit<br />

is active less than 450 ms per second; that is, at least 55% of the time each<br />

source signal is zero. In addition, the fir<strong>in</strong>gs of different motor units are not<br />

synchronized (only their respective fir<strong>in</strong>g rate shows the tendency to change<br />

together, the so-called common drive [40]). For these reasons, the probability<br />

of all n sources fir<strong>in</strong>g at a given time <strong>in</strong>stant is bounded by 0.45 n , which quickly<br />

approaches zero for <strong>in</strong>creas<strong>in</strong>g n. Even <strong>in</strong> the case of only n = 3 sources, at<br />

most 9% of the samples are fully active. Hence the source conditions for SCA<br />

should be rather well fulfilled, an we can f<strong>in</strong>d isolated MUAPs <strong>in</strong>side an s-EMG<br />

signal us<strong>in</strong>g SCA with high probability.<br />

12


274 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

2.3 Measures used for comparison<br />

In order to compare the recovered signals with the artificial sources, we simply<br />

compare the mix<strong>in</strong>g matrices. For this we employ Amari’s separation performance<br />

<strong>in</strong>dex [41], which is given by the equation<br />

⎛<br />

⎞<br />

n� n� |pij|<br />

n�<br />

�<br />

n�<br />

�<br />

|pij|<br />

E1(P) = ⎝<br />

− 1⎠<br />

+<br />

− 1<br />

i=1 j=1 maxk |pik|<br />

j=1 i=1 maxk |pkj|<br />

where P = (pij) = Â+ A, A be<strong>in</strong>g the real mix<strong>in</strong>g matrix and Â+ the<br />

pseudo<strong>in</strong>verse of its estimation Â. Note that E1(P) ≤ 2n(n − 1). For both<br />

the artificial and the real signals, we calculate E1( Â+ 1 Â2), where Âi are the<br />

two recovered mix<strong>in</strong>g matrices.<br />

Furthermore, we also want to study the recovered signals. So <strong>in</strong> order to be able<br />

to compare between different methods, to each pair of components obta<strong>in</strong>ed<br />

with each method, as well as to the source signals and the synthetic s-EMG<br />

channels, we apply the follow<strong>in</strong>g equivalence measures: Pr<strong>in</strong>cipe’s quadratic<br />

mutual <strong>in</strong>formation (QMI) [42], Kullback-Leibler <strong>in</strong>formation distance (K-LD)<br />

[43], Renyi’s entropy measure [43], mutual <strong>in</strong>formation measure (MuIn) [42],<br />

Rosenblatt’s squared distance functional (RSD) [43], Skaug and Tjøstheim’s<br />

weighted difference (STW) [43] and cross-correlation (Xcor). All the measures<br />

are normalized with respect to the maximum value obta<strong>in</strong>ed when applied<br />

to each component with itself; that is, we divide by the maximum of each<br />

comparison matrix diagonal (maximum mutual <strong>in</strong>formation).<br />

For the calculation of the above-mentioned <strong>in</strong>dices it is necessary to estimate<br />

both jo<strong>in</strong>t and marg<strong>in</strong>al probability density function of the signals. We have<br />

decide to use the data-driven method Kn-nearest-neighbour (KnNN) [44] (for<br />

details, refer to [32]). Furthermore, <strong>in</strong> order to compare separation performance<br />

<strong>in</strong> the presence of noise, we measure the strength of a one-dimensional<br />

signal S versus additive noise ˜ S us<strong>in</strong>g the signal-to-noise ratio (SNR) def<strong>in</strong>ed<br />

by<br />

SNR(S, ˜ �S�<br />

S) := 20 log10 �S − ˜ S� .<br />

3 Results<br />

In this section we compare the various models for source separation when<br />

applied to both toy and real s-EMG record<strong>in</strong>gs.<br />

13


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 275<br />

(a) A JADE<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.6<br />

0.4<br />

0.2<br />

1 2 3 4 5 6 7 8<br />

(d) A sNMF<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

1<br />

2<br />

3<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

channel<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.6<br />

0.4<br />

0.2<br />

(b) A NMF<br />

signal<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

(e) A sNMF*<br />

1<br />

2<br />

3<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.6<br />

0.4<br />

0.2<br />

(c) A NMF*<br />

(f) A SCA<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

1<br />

2<br />

3<br />

Fig. 4. Mix<strong>in</strong>g matrices recovered for the synthetic s-EMG us<strong>in</strong>g different methods;<br />

(a) ICA us<strong>in</strong>g jo<strong>in</strong>t approximate diagonalization of eigenmatrices (JADE), (b, c)<br />

nonnegative matrix factorization with different preprocess<strong>in</strong>g (NMF, NMF∗), (d, e)<br />

Sparse NMF (sNMF, sNMF∗) with the same two preprocess<strong>in</strong>g methods as NMF,<br />

and (f) sparse component analysis (SCA).<br />

3.1 Artificial signals<br />

In the first example, we compare performance <strong>in</strong> the well-known sett<strong>in</strong>g of<br />

artificially created toy-signals.<br />

3.1.1 S<strong>in</strong>gle s-EMG<br />

For visualization, we will first analyze a s<strong>in</strong>gle artificial s-EMG record<strong>in</strong>g, and<br />

only later present batch-runs over multiple separate realizations to test for<br />

statistical robustness. As data set, we use toy signals as <strong>in</strong> section 3.1 but<br />

with only three source components for easier visualization. The ICA-result<br />

is produced us<strong>in</strong>g JADE after PCA to 3 components. Please note that here<br />

and <strong>in</strong> the follow<strong>in</strong>g we perform dimension reduction because <strong>in</strong> the small<br />

sensor volumes <strong>in</strong> question not many MUs are present. This is confirmed by<br />

consider<strong>in</strong>g the eigenvalue structure of the covariance matrix: tak<strong>in</strong>g the mean<br />

over 10 real s-EMG data sets — further discussed <strong>in</strong> section 3.2 — the ratio<br />

of third to first largest eigenvalue is only 0.11, and the ratio of the fourth to<br />

the first only 0.04. Tak<strong>in</strong>g sums, <strong>in</strong> the mean we loose only 5.7% of raw data<br />

14


276 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

<strong>Component</strong> given by each method [arbitrary units]<br />

(a) JADE<br />

(b) NMF<br />

(c) NMF *<br />

(d) sNMF<br />

(g) Source (f) SCA (e) sNMF *<br />

-2<br />

-4<br />

024<br />

0.05<br />

0<br />

-0.05<br />

0.05<br />

0<br />

-0.05<br />

0.05<br />

0<br />

-0.05<br />

0.05<br />

0<br />

-0.05<br />

5<br />

0<br />

-5<br />

10<br />

0<br />

-10<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

Fig. 5. One of the three recovered sources after apply<strong>in</strong>g the pseudo<strong>in</strong>verse A +<br />

of the estimated mix<strong>in</strong>g matrices to the synthetic s-EMG mixtures X; (a-f) results<br />

obta<strong>in</strong>ed us<strong>in</strong>g the different methods, see Fig. 4; (g) for comparison, also the orig<strong>in</strong>al<br />

source signal is shown.<br />

by dimension reduction to the first three eigenvalues, which lies easily <strong>in</strong> the<br />

range of the noise level of typical data sets.<br />

In order to reduce the ever-present permutation and scal<strong>in</strong>g ambiguity, the<br />

columns of the recovered mix<strong>in</strong>g matrix AJADE are normalized to unit length;<br />

furthermore, column sign is chosen to give a positive sum of coefficients (because<br />

A is assumed to have only positive coefficients), and the columns are<br />

permuted such that the maxima <strong>in</strong>dex was <strong>in</strong>creas<strong>in</strong>g, Fig. 4(a). The three<br />

components are mostly active <strong>in</strong> channels 3, 4 or 5 respectively, which co<strong>in</strong>cides<br />

with our construction (the real sources lie closest to sensor 4, 3 and 5<br />

respectively). Fig. 5(a) shows one of the recovered sources, and for comparison,<br />

Fig. 5(g) shows the orig<strong>in</strong>al source signal.<br />

We subsequently apply NMF and sparse NMF to both X+ and X∗ (denoted<br />

by NMF, NMF∗, sNMF and sNMF∗ respectively); <strong>in</strong> all cases we achieve fast<br />

convergence. The mix<strong>in</strong>g matrices obta<strong>in</strong>ed are shown <strong>in</strong> Fig. 4(b-e). In all four<br />

cases we observe a good recovery of the source locations. Cross-multiplication<br />

of these matrices with their pseudo<strong>in</strong>verses shows a high similarity, and the<br />

recovered source signals are similar and match well the orig<strong>in</strong>al source, see<br />

Fig. 5(b-e).<br />

15


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 277<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

SCA<br />

sNMF *<br />

sNMF<br />

NMF *<br />

NMF<br />

sNMF *<br />

sNMF<br />

(a) matrix comparison by Amari <strong>in</strong>dex<br />

NMF *<br />

NMF<br />

JADE<br />

Measure mean [a.u.]<br />

1.5<br />

1<br />

0.5<br />

0<br />

Renyi<br />

QMI<br />

KLD<br />

MuIn<br />

RSDSTW<br />

Measure<br />

Xcor<br />

JADE<br />

Source<br />

SCA NMFNMF Channels<br />

sNMF *<br />

sNMF<br />

*<br />

Signal<br />

(b) comparison of the recovered sources<br />

Fig. 6. Inter-component performance <strong>in</strong>dex comparisons. (a) Comparison of the<br />

Amari <strong>in</strong>dex of matrices acquired from two different methods each <strong>in</strong> the case of<br />

synthetic s-EMG decomposition. As the comparison matrix is symmetric, half of<br />

its values have been omitted for clarity. (b) Mean of the <strong>in</strong>ter-components mutual<br />

<strong>in</strong>formation for each method estimated by different measures. For comparison, the<br />

measures correspond<strong>in</strong>g to the source signals (m<strong>in</strong>imal <strong>in</strong>dices) and to the channels<br />

of the mixed s-EMG (maximal <strong>in</strong>dices) are also shown.<br />

F<strong>in</strong>ally, we perform SCA us<strong>in</strong>g a generalized Hough transform [37]. Note that<br />

there are also other algorithms possible for such extraction, and model generalizations<br />

are possible, see for example [45]. After whiten<strong>in</strong>g and dimension<br />

reduction, we perform Hough hyperplane identification with b<strong>in</strong>-size 180 and<br />

manually identified the maxima <strong>in</strong> the obta<strong>in</strong>ed the Hough accumulator. The<br />

recovered mix<strong>in</strong>g matrix is aga<strong>in</strong> visualized <strong>in</strong> Fig. 4(f). Similar to the previous<br />

results, the three components are roughly most active at locations 2 to 3,<br />

4 and 5 respectively. Multiplication with the pseudo<strong>in</strong>verse of the recovered<br />

matrices from the above experiments shows that the result co<strong>in</strong>cides quite well<br />

with the matrices from NMF, but differs slightly more from the ICA-recovery.<br />

We calculate and compare the Amari <strong>in</strong>dex for each method and as shown <strong>in</strong><br />

Fig. 6(a), no major differences can be detected. A comparison of the recovered<br />

sources us<strong>in</strong>g the <strong>in</strong>dices from section 2.3 is given <strong>in</strong> Fig. 6(b), where also<br />

the <strong>in</strong>dices correspond<strong>in</strong>g to the orig<strong>in</strong>al source signals (m<strong>in</strong>imum mutual<br />

<strong>in</strong>formation values) and to the channels of the s-EMG (maximum values) have<br />

been added. All the methods separate the mixtures (improvement <strong>in</strong> terms of<br />

<strong>in</strong>dices) but the methods yield somewhat different result, with JADE giv<strong>in</strong>g<br />

quite different sources than the rest of the methods, and NMF and NMF∗<br />

perform<strong>in</strong>g rather similar. In terms of source <strong>in</strong>dependence, of course JADE<br />

scores best rat<strong>in</strong>gs <strong>in</strong> the <strong>in</strong>dices, as can be confirmed by calculat<strong>in</strong>g the Amari<br />

<strong>in</strong>dex of the recovered matrix with the orig<strong>in</strong>al source matrix:<br />

16


278 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA sources<br />

mean kurtosis 10.80 8.48 8.80 8.85 9.64 7.14 11.4<br />

sparseness 0.286 0.268 0.295 0.270 0.302 0.201 0.252<br />

σ(A✷) 2.91 2.02 2.11 1.99 2.36 0.96 3.24<br />

Table 2<br />

Sparseness measures of the recovered sources us<strong>in</strong>g the various methods. In the first<br />

row, the mean kurtosis is calculated (the higher, the more ‘spiky’ the signal). The<br />

second row gives the sparseness <strong>in</strong>dex from equation (3) (the higher, the sparser<br />

the signal) and the third row the cost function (5) employed <strong>in</strong> SCA (the lower, the<br />

better the SCA criterion is fulfilled). Optimal values are pr<strong>in</strong>ted <strong>in</strong> bold face.<br />

AJADE ANMF ANMF∗ AsNMF AsNMF∗ ASCA<br />

E1(A + ✷A) 1.39 3.27 2.96 3.41 2.56 3.86<br />

JADE clearly outperforms the other methods — this will be expla<strong>in</strong>ed <strong>in</strong> the<br />

next paragraph. However we have to add that the signal generation additionally<br />

<strong>in</strong>volves a slightly nonl<strong>in</strong>ear filter, so we can only estimate the real mix<strong>in</strong>g<br />

matrix A from the sources and the mixtures, which yielded a non-negligible<br />

error. Hence this result <strong>in</strong>dicates that JADE could best approximate the l<strong>in</strong>earized<br />

system.<br />

One key question <strong>in</strong> this work is how the different models <strong>in</strong>duce sparseness.<br />

Clearly, sparseness is a rather ambiguous term, so we calculate three<br />

<strong>in</strong>dices (kurtosis, ‘sparseness’ from equation (3) and σ from (5)) for the real<br />

as well as the recovered sources us<strong>in</strong>g the proposed methods, see Tab. 2. As<br />

expected, ICA gives the highest kurtosis among all methods, whereas sparse<br />

NMF yields highest values of the sparseness criterion. And SCA has the lowest<br />

i.e. best value regard<strong>in</strong>g k-sparseness measured by σ(A). We further see that<br />

both the kurtosis and the sparseness criterion seem to be somewhat related<br />

with regards to the data set as high values <strong>in</strong> both <strong>in</strong>dices are achieved by<br />

both JADE and sNMF. The k-sparseness criterion, which fixes only the zero-<br />

(semi)norm i.e. requires a fixed amount of zeros <strong>in</strong> the data without additional<br />

requirements about the other values does not <strong>in</strong>duce as high sparseness when<br />

measured us<strong>in</strong>g kurtosis/sparseness and vice versa. F<strong>in</strong>ally by look<strong>in</strong>g at the<br />

sparseness <strong>in</strong>dices of the real sources, we can now understand why JADE outperformed<br />

the other methods <strong>in</strong> this toy data sett<strong>in</strong>g — <strong>in</strong>deed kurtosis of the<br />

sources is high and the mutual <strong>in</strong>formation low, see Fig. 6(b). But <strong>in</strong> terms of<br />

sparseness and especially σ, the sources are not as sparse as expected. Hence<br />

the (sparse) NMF and ma<strong>in</strong>ly the SCA algorithm could not perform as well<br />

as JADE when compared to the orig<strong>in</strong>al sources as noted above. However, we<br />

will see that <strong>in</strong> the case of real s-EMG signals, this dist<strong>in</strong>ction will break; furthermore,<br />

sparse NMF turns out to be more robust aga<strong>in</strong>st noise than JADE,<br />

as is shown <strong>in</strong> the follow<strong>in</strong>g.<br />

17


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 279<br />

Amari <strong>in</strong>dex<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

(a) Amari <strong>in</strong>dex<br />

SNR<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

(b) SNR<br />

Fig. 7. Boxplot of the separation performance when identify<strong>in</strong>g 5 sources <strong>in</strong> 8-dimensional<br />

observed artificial s-EMG record<strong>in</strong>gs. (a) shows the mean Amari <strong>in</strong>dex of<br />

the product of the identified separation and the real mix<strong>in</strong>g matrix, and (b) depicts<br />

the SNR of the recovered sources versus the real sources. Mean and variance were<br />

taken over 100 runs.<br />

3.1.2 Multiple s-EMG experiments<br />

We now show performance of the above algorithms when applied to 100 different<br />

realizations of artificial s-EMG data sets. The data consists of 8 channels<br />

with 5 underly<strong>in</strong>g source activities; more details about data generation are<br />

given <strong>in</strong> section 2.1.1. The algorithm parameters are the same as above with<br />

the exception that automated SCA is performed us<strong>in</strong>g Mangasarian-style cluster<strong>in</strong>g<br />

[38] after PCA dimension reduction to 5.<br />

The ICA and sparse BSS algorithms from above are applied to these data sets,<br />

and Amari <strong>in</strong>dex as well as SNR of the recovered sources versus the orig<strong>in</strong>al<br />

sources is stored. In Fig. 7, the means of these two <strong>in</strong>dices as well as their<br />

deviations are shown <strong>in</strong> a box plot, separately for each algorithm. As <strong>in</strong> the<br />

s<strong>in</strong>gle s-EMG experiment, these statistics confirm that JADE performs best,<br />

both <strong>in</strong> terms of matrix and <strong>in</strong> terms of source recovery (which is more or less<br />

the same due to the fact that we are deal<strong>in</strong>g with the noiseless case so far).<br />

The NMF algorithms identify the mix<strong>in</strong>g matrix with acceptable performance,<br />

however (sparse) NMF tak<strong>in</strong>g only positive samples (sNMF∗) tends to separate<br />

the data slightly better than by sample preprocess<strong>in</strong>g us<strong>in</strong>g κ from equation<br />

(4). SCA cannot detect the source matrix as well as the other BSS algorithms<br />

— aga<strong>in</strong> the SCA conditions seem to be somewhat violated — however it<br />

performs adequately well recover<strong>in</strong>g the sources, which is due to the fact that<br />

some sources are recovered very well result<strong>in</strong>g <strong>in</strong> a higher SNR than the NMF<br />

algorithms. For practical reasons, it is important to check to what extend the<br />

signal-to-<strong>in</strong>terference ratio between the channels is improved after apply<strong>in</strong>g<br />

BSS algorithms. For each run, we monitor the SIR of the orig<strong>in</strong>al sources and<br />

the recoveries by tak<strong>in</strong>g the mean over all channels. The two SIR means are<br />

18


280 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

divided to give a mean improvement ratio. Tak<strong>in</strong>g aga<strong>in</strong> mean over the 100<br />

runs yields the follow<strong>in</strong>g improvements:<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

mean SIR improvement 4.15 2.54 2.87 0.701 1.90 3.08<br />

This confirms our results; JADE works very well for preprocess<strong>in</strong>g, but also<br />

NMF∗ and, <strong>in</strong>terest<strong>in</strong>gly, SCA.<br />

In order to test the robustness of the methods aga<strong>in</strong>st noise, we recorded s-<br />

EMG at 0%MVC, that is, when the muscle was totally relaxed; <strong>in</strong> this way<br />

we obta<strong>in</strong>ed a record<strong>in</strong>g of noise only. As expected the result<strong>in</strong>g record<strong>in</strong>gs<br />

have nearly the same means and variances, and are close to Gaussian. This is<br />

confirmed by a Jarque-Bera test, which asymptotically tests for goodness-of-fit<br />

of an observed signal to a normal distribution. We recorded two different noise<br />

signals, and the test was positive <strong>in</strong> 11 out of 16 cases (at significance level<br />

α = 5%); the 5 exceptions had p-values not lower than 0.001. Furthermore, the<br />

noise is <strong>in</strong>dependent as expected because it has a close to diagonal covariance<br />

matrix, Amari <strong>in</strong>dex of 2.1, which is quite low for 8 dimensional signals. The<br />

noise is not fully i.i.d. but exhibits slight non-stationarity. Nonetheless, we<br />

take this f<strong>in</strong>d<strong>in</strong>gs as confirmation to assume additive Gaussian noise <strong>in</strong> the<br />

follow<strong>in</strong>g. We will show mean algorithm performance over 50 runs for vary<strong>in</strong>g<br />

noise levels.<br />

Note that due to the Gaussian noise, the models (especially the NMF model<br />

which already uses such a Gaussian error term) hold well given the additive<br />

noise. We multiplied this noise signal progressively by 0, 0.01, 0.05, 0.1, 0.5,<br />

1 and 5 (which corresponds to mean source SNRs of ∞, 36, 22, 16, 2.1, -3.9<br />

and -18 dB) and then added each of the obta<strong>in</strong>ed signals to a randomly generated<br />

synthetic s-EMG conta<strong>in</strong><strong>in</strong>g 5 sources as above. The Amari <strong>in</strong>dex was<br />

calculated for each method and for each noise level. We thus obta<strong>in</strong>ed the<br />

comparative graph shown <strong>in</strong> Fig. 8. Interest<strong>in</strong>gly, sparse NMF∗ outperforms<br />

JADE <strong>in</strong> all cases, which <strong>in</strong>dicates that the sNMF model (which already <strong>in</strong>cludes<br />

noise), works best <strong>in</strong> cases of slight to stronger additive noise — which<br />

makes it very well adapted to real applications. Aga<strong>in</strong> SCA performs somewhat<br />

problematically, however separates the data well a the noise level of -3.9<br />

dB. We believe that this is due to the <strong>in</strong>volved threshold<strong>in</strong>g parameter <strong>in</strong><br />

SCA hyperplane detection; apparently it is necessary to implement an adaptive<br />

choice of this parameter <strong>in</strong> order to improve separation as <strong>in</strong> the case of<br />

SNR of -3.9 dB.<br />

19


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 281<br />

median Amari−<strong>in</strong>dex<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

JADE<br />

NMF<br />

NMF∗<br />

sNMF<br />

sNMF∗<br />

SCA<br />

0<br />

−20 −10 0 10<br />

mean SNR (dB)<br />

20 30 40<br />

Fig. 8. Result of the experiment test<strong>in</strong>g the robustness of the different methods<br />

aga<strong>in</strong>st noise. Plotted is the mean over 50 runs.<br />

Real s-EMG signal [V]<br />

Channel 1<br />

Channel 2<br />

Channel 3<br />

Channel 4<br />

Channel 5<br />

Channel 6<br />

Channel 7<br />

Channel 8<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

5<br />

0<br />

-5<br />

-10<br />

100 200 300 400 500 600 700 800 900 1000<br />

Time [ms]<br />

Fig. 9. Real s-EMG obta<strong>in</strong>ed from a healthy subject perform<strong>in</strong>g a susta<strong>in</strong>ed contraction<br />

at 30% MVC.<br />

20


282 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

(a) A JADE<br />

1 2 3 4 5 6 7 8<br />

(d) A sNMF<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

1<br />

2<br />

3<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

channel<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

(b) A NMF<br />

signal<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

(e) A sNMF*<br />

1<br />

2<br />

3<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

(c) A NMF*<br />

(f) A SCA<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

1<br />

2<br />

3<br />

Fig. 10. Mix<strong>in</strong>g matrices recovered for the real s-EMG us<strong>in</strong>g different methods; (a)<br />

JADE, (b, c) NMF with different preprocess<strong>in</strong>g, (d, e) sparse NMF with the same<br />

two preprocess<strong>in</strong>g methods as NMF, and (f) SCA.<br />

3.2 Real s-EMG signals<br />

In this section we analyze real s-EMG record<strong>in</strong>gs obta<strong>in</strong>ed from ten healthy<br />

subjects. At first we will aga<strong>in</strong> study a s<strong>in</strong>gle s-EMG and plot it for visual<br />

<strong>in</strong>spection, and then show statistics over multiple subjects. The first data set<br />

has been obta<strong>in</strong>ed from a s<strong>in</strong>gle subject perform<strong>in</strong>g a susta<strong>in</strong>ed contraction at<br />

30% MVC, see Fig. 9. The signal acquisition and preprocess<strong>in</strong>g is described<br />

<strong>in</strong> section 2.1.2.<br />

We <strong>in</strong>itially use JADE as ICA algorithm of choice. The estimated mix<strong>in</strong>g<br />

matrix is visualized <strong>in</strong> Fig. 10(a). As <strong>in</strong> the case of synthetic signals, we then<br />

apply NMF and sparse NMF to both X+ and X∗ and obta<strong>in</strong> the recovered<br />

mix<strong>in</strong>g matrices visualized <strong>in</strong> Fig. 10(b-e).<br />

We face the follow<strong>in</strong>g problem when recover<strong>in</strong>g the source signals by SCA.<br />

If we use PCA to n = 3 dimensions, we cannot achieve convergence; also a<br />

generalized Hough plot [37] does not reveal such structure. Hence we choose<br />

dimension reduction to n = 2. In the two-dimensional projected mixtures, the<br />

data clearly clusters along two l<strong>in</strong>es, so the assumption of sparseness holds<br />

<strong>in</strong> 2 dimensions. We use Mangasarian-style cluster<strong>in</strong>g / SCA (similar to kmeans)<br />

[38] to recover these directions. The thus recovered mix<strong>in</strong>g matrix <strong>in</strong><br />

21<br />

1 2


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 283<br />

1.5<br />

1<br />

0.5<br />

0<br />

SCA<br />

sNMF *<br />

sNMF<br />

NMF *<br />

NMF<br />

sNMF *<br />

sNMF<br />

NMF *<br />

NMF<br />

JADE<br />

(a) matrix comparison by Amari <strong>in</strong>dex<br />

Measure value [a.u.]<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

KLD<br />

MuIn<br />

QMI<br />

RSD<br />

Renyi<br />

Measure STW<br />

Xcor<br />

JADE<br />

SCA NMFNMF sNMF<br />

*<br />

sNMF Channels<br />

*<br />

Signal<br />

(b) comparison of the recovered sources<br />

Fig. 11. Inter-component performance <strong>in</strong>dex comparisons; (a) comparison of the<br />

Amari <strong>in</strong>dex; (b) mean of various source dependence measures.<br />

JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

mean kurtosis 4.97 4.74 4.80 4.81 4.80 4.82<br />

sparseness 0.387 0.424 0.413 0.408 0.407 0.405<br />

σ(A✷) 0.76 0.50 0.55 0.60 0.62 0.70<br />

Table 3<br />

Comparison of sparseness of the recovered sources (real s-EMG data).<br />

two dimensions is plotted <strong>in</strong> Fig. 10(f). Note that the SCA matrix columns<br />

match two columns of the mix<strong>in</strong>g matrices found by the other methods.<br />

Similar to the toy data set, we compare the recoveries based on the different<br />

methods us<strong>in</strong>g various <strong>in</strong>dices from section 2.3 for mix<strong>in</strong>g matrices and recovered<br />

sources , Fig. 11. In contrast to the artificial signals, here all methods<br />

yield rather similar performance, the mean Amari <strong>in</strong>dices are roughly half the<br />

value of the <strong>in</strong>dices <strong>in</strong> the toy data sett<strong>in</strong>g. This confirms that the methods<br />

recover rather similar sources.<br />

As <strong>in</strong> the case of artificial signals, we aga<strong>in</strong> compare the sparseness of the<br />

recovered sources, see Tab. 3. Due to the noise present <strong>in</strong> the real data, the<br />

signals are clearly less sparse than the artificial data. Furthermore a comparison<br />

between the various methods yields noticeably less differences than <strong>in</strong><br />

the case of artificial signals. At first glance it seems unclear why SCA performs<br />

worse <strong>in</strong> terms of k-sparseness (us<strong>in</strong>g σ(A)) than the NMF methods,<br />

but better than JADE. This however can be expla<strong>in</strong>ed by the fact that the<br />

PCA dimension reduction reduces the number of parameters hence SCA cannot<br />

search the whole space, so it performs worse than NMF <strong>in</strong> that respect.<br />

However it outperforms JADE, which also uses PCA preprocess<strong>in</strong>g.<br />

22


284 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

<strong>Component</strong> given by each method [a.u.]<br />

(a) JADE<br />

(b) NMF<br />

(c) NMF*<br />

(d) sNMF<br />

(e) sNMF*<br />

(f) SCA<br />

(g) s-EMG<br />

-2<br />

-4<br />

02468<br />

-2<br />

-4<br />

02468<br />

-2 02468<br />

-2 0246<br />

-2 02468<br />

-2<br />

-4<br />

02468<br />

0.5 1<br />

1.5<br />

-0.5 0<br />

50 100 150 200 250 300 350 400 450 500<br />

Time [ms]<br />

Fig. 12. One of the three recovered sources after apply<strong>in</strong>g A + to an s-EMG data<br />

recorded at 30% MVC; (a-f) results obta<strong>in</strong>ed us<strong>in</strong>g the different methods; (g) orig<strong>in</strong>al<br />

source signal.<br />

In order to be able to draw statistically more relevant conclusions, we compare<br />

the various BSS algorithms for s-EMGs of n<strong>in</strong>e subject, recorded at 30% MVC.<br />

Fig. 12 plots a s<strong>in</strong>gle extracted source for each BSS algorithms; Fig. 12(g)<br />

shows the s-EMG channel where the dom<strong>in</strong>ant source signal is chosen for<br />

comparison. One problem of perform<strong>in</strong>g batch comparisons however lies <strong>in</strong> the<br />

fact that separation performance is commonly evaluated by visual <strong>in</strong>spection<br />

or at most by comparison with other separation methods — because of course<br />

the orig<strong>in</strong>al sources are unknown. We cannot do plots similar to Fig. 12 for<br />

each subject, so <strong>in</strong> order to provide a more objective measure, we consider the<br />

ma<strong>in</strong> application of BSS for real s-EMG analysis — preprocess<strong>in</strong>g <strong>in</strong> order to<br />

achieve ‘cleaner’ data for template match<strong>in</strong>g.<br />

A common measure for this is to count the number of zero-cross<strong>in</strong>gs of the<br />

sources (after comb<strong>in</strong><strong>in</strong>g them to a full eight dimensional observation vector by<br />

tak<strong>in</strong>g only the maximally active source <strong>in</strong> each channel). Note that this zerocross<strong>in</strong>g<br />

count is already outputted by the MDZ filter. It is directly related<br />

to the amount of MUAPs present <strong>in</strong> a s-EMG signal [46], and by compar<strong>in</strong>g<br />

this <strong>in</strong>dex before and after the BSS algorithms, we can analyze whether and<br />

to what extent they actually enhance the signal. With the aid of the MDZ<br />

filter, we count the number of waves (the excursion of the signal between two<br />

23


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 285<br />

subject ID JADE NMF NMF∗ sNMF sNMF∗ SCA<br />

b 9% 7% 9% 10% 11% 10%<br />

f 5% 3% 5% 4% 5% 4%<br />

g 37% 35% 35% 35% 36% 30%<br />

k 35% 35% 35% 35% 34% 38%<br />

m 16% 16% 17% 16% 16% 13%<br />

og 9% 8% 8% 8% 8% 10%<br />

ok 3% 7% 9% 9% 10% 7%<br />

s 41% 42% 42% 41% 41% 37%<br />

y 71% 71% 73% 71% 73% 74%<br />

means 25.0 % 25.1 % 25.7% 25.4% 25.8 % 24.6 %<br />

std. deviations 21.4 % 21.4 % 21.3% 20.9% 21.0% 21.4 %<br />

Table 4<br />

Negative zero-cross<strong>in</strong>g mean ratios i.e. relative enhancements for each subject, together<br />

with mean performance of each algorithm.<br />

consecutive zero cross<strong>in</strong>gs) and take the mean over all channels before and<br />

after apply<strong>in</strong>g each of the BSS algorithms. We then subtract the latter from<br />

the previous value and divide by the <strong>in</strong>itial number of waves <strong>in</strong> order to have<br />

an <strong>in</strong>dex than can be compared between different signals. Tab. 4 shows the<br />

result<strong>in</strong>g ratios for the n<strong>in</strong>e subjects. All BSS algorithms result <strong>in</strong> a reduction<br />

of zero-cross<strong>in</strong>gs, and best results per run are achieved by NMF∗, sNMF∗ and<br />

SCA. We see that <strong>in</strong> the mean, all algorithm perform somewhat similarly,<br />

with sNMF∗ be<strong>in</strong>g best and SCA be<strong>in</strong>g worst. Hence the best algorithm for<br />

this data set, sNMF∗, achieved a mean reduction of the number of waves<br />

was 25.8%, which means that after apply<strong>in</strong>g the algorithms each channel is<br />

composed, <strong>in</strong> average, of one fourth of the orig<strong>in</strong>al number of waves, mak<strong>in</strong>g<br />

the template-match<strong>in</strong>g technique easily applicable.<br />

4 Discussion<br />

The ma<strong>in</strong> focus of this work lies <strong>in</strong> the application of three different sparse<br />

BSS models — source <strong>in</strong>dependence, (sparse) nonnegativity and k-sparseness<br />

— to the analysis of s-EMG signals. This application is motivated by the fact<br />

that the underly<strong>in</strong>g MUAPTs exhibit properties (ma<strong>in</strong>ly sparseness) that fit<br />

quite well to the three <strong>in</strong> pr<strong>in</strong>ciple different models. Furthermore, we take<br />

<strong>in</strong>terest <strong>in</strong> how well these models behave <strong>in</strong> the case of slightly perturbed<br />

<strong>in</strong>itial conditions — ICA for <strong>in</strong>stance is known to be quite robust aga<strong>in</strong>st<br />

24


286 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

small errors <strong>in</strong> the <strong>in</strong>dependence assumption — and how they can deal with<br />

additional noise.<br />

In the first example of artificially created mixtures, we are able to demonstrate<br />

that a decomposition analysis us<strong>in</strong>g the three models was possible, and give<br />

comparisons over larger data sets. Although the recovered sources are all rather<br />

alike, Fig. 5 and Fig. 7, we found that ICA outperformed the other methods<br />

concern<strong>in</strong>g distance to the real solution, which was ma<strong>in</strong>ly due to the fact that<br />

the artificial sources — but not <strong>in</strong> the case of real signals, see below — best<br />

fitted to the ICA model, Tab. 2. However, when consider<strong>in</strong>g additional noise<br />

with <strong>in</strong>creas<strong>in</strong>g power, sparse NMF turned out to be more robust than the<br />

ICA model, Fig. 8.<br />

We then applied the BSS methods to real s-EMG data sets, and the three<br />

different models yielded surpris<strong>in</strong>gly similar results, although <strong>in</strong> theory these<br />

models do not fully overlap. We speculate that this similarity is due to the<br />

fact that — as most probably <strong>in</strong> all applications — the models do not fully<br />

hold. This allows the various algorithms to only approximate the model solutions,<br />

and hence to arrive at similar solutions, but from different directions.<br />

Furthermore, this <strong>in</strong>dicates that the three models look for different properties<br />

of the sources, and that these properties are fulfilled to vary<strong>in</strong>g extent, see<br />

Tab. 3 for numerical details. Comparisons over s-EMG data sets from multiple<br />

subjects aga<strong>in</strong> confirmed similar equality of performance, where aga<strong>in</strong><br />

sparse NMF slightly outperformed the other algorithms <strong>in</strong> the mean (<strong>in</strong> nice<br />

correspondence with the noise result from above).<br />

Note that the aim of the present work is not the full recovery of a target<br />

MUAPT (source signal) <strong>in</strong> its orig<strong>in</strong>al form. In fact, as shown previously [18],<br />

it would be sufficient to <strong>in</strong>crease the amplitude a target MUAPT so that <strong>in</strong><br />

average it is above the noise and above the other MUAPTs level. Then, we<br />

are able to cut the <strong>in</strong>terfer<strong>in</strong>g signals with a modified dead zone filter and<br />

thus isolat<strong>in</strong>g the target MUAPT. And <strong>in</strong>deed all of the employed sparse BSS<br />

methods fulfill this requirement, which is confirmed by the decrease <strong>in</strong> the<br />

number of zero-cross<strong>in</strong>gs of the separated signals, see Tab. 4.<br />

5 Conclusion<br />

We have compared the effectiveness of various sparse BSS methods for signal<br />

decomposition; namely, ICA, NMF, sparse NMF and SCA, when applied to<br />

s-EMG signals. Surface EMG signals represent an attractive test<strong>in</strong>g signal as<br />

they approximately fulfill all the requirements of these methods and are of<br />

major importance <strong>in</strong> medical diagnosis and basic neurophysiological research.<br />

25


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 287<br />

All methods, <strong>in</strong> spite of be<strong>in</strong>g based on very different approaches, gave similar<br />

results <strong>in</strong> the case of real data and decomposed them sufficiently well. We<br />

therefore suggest us<strong>in</strong>g sparse BSS as an important preprocess<strong>in</strong>g tool before<br />

apply<strong>in</strong>g the common template match<strong>in</strong>g technique. In terms of algorithm<br />

comparisons, ICA performed better than the other algorithms <strong>in</strong> the noiseless<br />

case, however sparse NMF∗ outperformed the other methods when noise was<br />

added and slightly so <strong>in</strong> the case of multiple real s-EMG record<strong>in</strong>gs. In later<br />

work concern<strong>in</strong>g methods of s-EMG decomposition, we therefore want to focus<br />

on properties and possible improvements of sparse NMF regard<strong>in</strong>g parameter<br />

choice (which level of sparseness to choose) and signal preprocess<strong>in</strong>g (<strong>in</strong> order<br />

to deal with positive signals). Preprocess<strong>in</strong>g to improve sparseness is currently<br />

studied. In order to be better able to compare methods, we also plan to apply<br />

the methods to artificial sources generated us<strong>in</strong>g other available s-EMG generators<br />

[47]. F<strong>in</strong>ally extensions to convolutive mix<strong>in</strong>g situations will have to<br />

be analyzed.<br />

Acknowledgements<br />

We would like to thank Dr. S. Ra<strong>in</strong>ieri for helpful discussion and K. Maekawa<br />

for the first version of the s-EMG generator. This work was partly supported<br />

by the M<strong>in</strong>istry of Education, Culture, Sports, Science, and Technology of<br />

Japan (Grant-<strong>in</strong>-Aid for Scientific Research). G.A.G. is supported by a grant<br />

from the same M<strong>in</strong>istry. F.T. gratefully acknowledges partial f<strong>in</strong>ancial support<br />

by the DFG (GRK 638) and the BMBF (project ‘ModKog’).<br />

References<br />

[1] A. Hyvär<strong>in</strong>en, J. Karhunen, E. Oja, <strong>Independent</strong> <strong>Component</strong> <strong>Analysis</strong>, John<br />

Wiley & Sons, 2001.<br />

[2] A. Cichocki, S. Amari, Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g, John Wiley<br />

& Sons, 2002.<br />

[3] P. Georgiev, F. Theis, A. Cichocki, Bl<strong>in</strong>d source separation and sparse<br />

component analysis of overcomplete mixtures, <strong>in</strong>: Proc. ICASSP 2004, Vol. 5,<br />

Montreal, Canada, 2004, pp. 493–496.<br />

[4] P. Georgiev, F. Theis, A. Cichocki, Sparse component analysis and bl<strong>in</strong>d source<br />

separation of underdeterm<strong>in</strong>ed mixtures, IEEE Trans. Neural Networks <strong>in</strong> press.<br />

[5] D. Lee, H. Seung, Learn<strong>in</strong>g the parts of objects by non-negative matrix<br />

factorization, Nature 40 (1999) 788–791.<br />

26


288 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

[6] S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pursuit,<br />

SIAM J. Sci. Comput. 20 (1) (1998) 33–61.<br />

[7] F. Theis, E. Lang, C. Puntonet, A geometric algorithm for overcomplete l<strong>in</strong>ear<br />

ICA, Neurocomput<strong>in</strong>g 56 (2004) 381–398.<br />

[8] M. Zibulevsky, B. Pearlmutter, Bl<strong>in</strong>d source separation by sparse decomposition<br />

<strong>in</strong> a signal dictionary, Neural Computation 13 (4) (2001) 863–882.<br />

[9] M. Lewicki, T. Sejnowski, Learn<strong>in</strong>g overcomplete representations, Neural<br />

Computation 12 (2) (2000) 337–365.<br />

[10] B. Olshausen, D. Field, Emergence of simple-cell receptive field properties by<br />

learn<strong>in</strong>g a sparse code for natural images, Nature 381 (6583) (1996) 607–609.<br />

[11] M. D. Plumbley, E. Oja, A ’non-negative pca’ algorithm for <strong>in</strong>dependent<br />

component analysis, IEEE Transactions on Neural Networks 15 (1) (2004) 66–<br />

76.<br />

[12] E. Oja, M. D. Plumbley, Bl<strong>in</strong>d separation of positive sources us<strong>in</strong>g non-negative<br />

pca, <strong>in</strong>: Proc. ICA 2003, Nara, Japan, 2003, pp. 11–16.<br />

[13] W. Liu, N. Zheng, X. Lu, Non-negative matrix factorization for visual cod<strong>in</strong>g,<br />

<strong>in</strong>: Proc. ICASSP 2003, Vol. III, 2003, pp. 293–296.<br />

[14] P. Hoyer, Non-negative matrix factorization with sparseness constra<strong>in</strong>ts,<br />

Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research 5 (2004) 1457–1469.<br />

[15] J. Basmajian, C. D. Luca, Muscle Alive. Their Functions Revealed by<br />

Electromyography, 5th Edition, Williams & Wilk<strong>in</strong>s, Baltimore, 1985.<br />

[16] A. Halliday, S. Butler, R. Paul, A Textbook of Cl<strong>in</strong>ical Neurophysiology, John<br />

Wiley & Sons, New York, 1987.<br />

[17] D. Far<strong>in</strong>a, R. Merletti, R. Enoka, The extraction of neural strategies from the<br />

surface EMG, Journal of Applied Physiology 96 (2004) 1486–1495.<br />

[18] G. García, R. Okuno, K. Akazawa, Decomposition algorithm for surface<br />

electrode-array electromyogram <strong>in</strong> voluntary isometric contraction, IEEE<br />

Eng<strong>in</strong>eer<strong>in</strong>g <strong>in</strong> Medic<strong>in</strong>e and Biology Magaz<strong>in</strong>e <strong>in</strong> press.<br />

[19] C. Disselhorst-Klug, J. Silny, G. Rau, Estimation of the relationship<br />

between the non<strong>in</strong>vasively detected activity of s<strong>in</strong>gle motor units and their<br />

characteristic pathological changes by modell<strong>in</strong>g, Journal of Electromyography<br />

and K<strong>in</strong>esiology 8 (1998) 323–335.<br />

[20] S. Andreassen, A. Rosenfalck, Relationship of <strong>in</strong>tracellular and extracellular<br />

action potentials of skeletal muscle fibers, CRC critical reviews <strong>in</strong> bioeng<strong>in</strong>eer<strong>in</strong>g<br />

6 (4) (1981) 267–306.<br />

[21] H. Clamann, Statistical analysis of motor unit fir<strong>in</strong>g patterns <strong>in</strong> a human<br />

skeletal muscle, Biophysical Journal 9 (1969) 1233–1251.<br />

27


Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006 289<br />

[22] P. Griep, F. Gielen, H. Boom, K. Boon, L. Hoogstraten, C. Pool, W. Wall<strong>in</strong>ga-<br />

De-Jonge, Calculation and registration of the same motor unit action potential,<br />

Electroenceph Cl<strong>in</strong> Neurophysiol 53 (1973) 388–404.<br />

[23] E. St˚alberg, J. Trontelj, S<strong>in</strong>gle Fiber Electromyography, Old Work<strong>in</strong>g (UK),<br />

The Mirvalle Press, 1979.<br />

[24] S. Maekawa, T. Arimoto, M. Kotani, Y. Fujiwara, Motor unit decomposition<br />

of surface emg us<strong>in</strong>g multichannel bl<strong>in</strong>d deconvolution, <strong>in</strong>: Proc. ISEK 2002,<br />

Vienna, Austria, 2002, pp. 38–39.<br />

[25] K. McGill, K. Cumm<strong>in</strong>s, L. Dorfman, Automatic decomposition of the cl<strong>in</strong>ical<br />

electromyogram, IEEE Transactions on Biomedical Eng<strong>in</strong>eer<strong>in</strong>g 32 (7) (1985)<br />

470–477.<br />

[26] P. Comon, <strong>Independent</strong> component analysis - a new concept?, Signal Process<strong>in</strong>g<br />

36 (1994) 287–314.<br />

[27] F. Theis, A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation,<br />

Neural Computation 16 (2004) 1827–1850.<br />

[28] J. Hérault, C. Jutten, Space or time adaptive signal process<strong>in</strong>g by neural<br />

network models, <strong>in</strong>: J. Denker (Ed.), Neural Networks for Comput<strong>in</strong>g.<br />

Proceed<strong>in</strong>gs of the AIP Conference, American Institute of Physics, New York,<br />

1986, pp. 206–211.<br />

[29] J.-F. Cardoso, A. Souloumiac, Jacobi angles for simultaneous diagonalization,<br />

SIAM J. Mat. Anal. Appl. 17 (1) (1995) 161–164.<br />

[30] A. Yeredor, Non-orthogonal jo<strong>in</strong>t diagonalization <strong>in</strong> the leastsquares sense with<br />

application <strong>in</strong> bl<strong>in</strong>d source separation, IEEE Trans. Signal Process<strong>in</strong>g 50 (7)<br />

(2002) 1545–1553.<br />

[31] A. Ziehe, P. Laskov, K.-R. Mueller, G. Nolte, A l<strong>in</strong>ear least-squares algorithm<br />

for jo<strong>in</strong>t diagonalization, <strong>in</strong>: Proc. of ICA 2003, Nara, Japan, 2003, pp. 469–474.<br />

[32] G. García, K. Maekawa, K. Akazawa, Decomposition of synthetic multi-channel<br />

surface-electromyogram us<strong>in</strong>g <strong>in</strong>dependent component analysis, <strong>in</strong>: Proc. ICA<br />

2004, Vol. 3195 of Lecture Notes <strong>in</strong> Computer Science, Granada, Spa<strong>in</strong>, 2004,<br />

pp. 985–991.<br />

[33] S. Takahashi, Y. Sakurai, M. Tsukada, Y. Anzai, Classification of neural<br />

activities from tetrode record<strong>in</strong>gs us<strong>in</strong>g <strong>in</strong>dependent component analysis,<br />

Neurocomput<strong>in</strong>g 49 (2002) 289–298.<br />

[34] A. Hyvär<strong>in</strong>en, E. Oja, A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis, Neural Computation 9 (1997) 1483–1492.<br />

[35] P. Paatero, U. Tapper, Positive matrix factorization: A non-negative factor<br />

model with optimal utilization of error estimates of data values, Environmetrics<br />

5 (1994) 111–126.<br />

28


290 Chapter 20. Signal Process<strong>in</strong>g 86(3):603-623, 2006<br />

[36] D. Lee, H. Seung, Algorithms for non-negative matrix factorization, <strong>in</strong>:<br />

Advances <strong>in</strong> Neural Information Process<strong>in</strong>g (Proc. NIPS 2000), Vol. 13, MIT<br />

Press, 2000, pp. 556–562.<br />

[37] F. Theis, P. Georgiev, A. Cichocki, Robust overcomplete matrix recovery for<br />

sparse sources us<strong>in</strong>g a generalized hough transform, <strong>in</strong>: Proc. ESANN 2004,<br />

d-side, Evere, Belgium, Bruges, Belgium, 2004, pp. 343–348.<br />

[38] P. Bradley, O. Mangasarian, k-plane cluster<strong>in</strong>g, Journal of Global Optimization<br />

16 (1) (2000) 23–32.<br />

[39] C. D. Luca, Physiology and mathematics of myoelectric signals, IEEE<br />

Transactions on Biomedical Eng<strong>in</strong>eer<strong>in</strong>g 26 (6) (1979) 313–325.<br />

[40] C. D. Luca, R. LeFever, M. McCue, A. Xenakis, Control scheme govern<strong>in</strong>g<br />

concurrently active human motor units dur<strong>in</strong>g voluntary contractions, Journal<br />

of Physiology 329 (1982) 129–142.<br />

[41] S. Amari, A. Cichocki, H. Yang, A new learn<strong>in</strong>g algorithm for bl<strong>in</strong>d signal<br />

separation, Advances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems 8 (1996) 757–<br />

763.<br />

[42] D. Xu, J. Pr<strong>in</strong>cipe, J. F. III, H.-C. Wu, A novel measure for <strong>in</strong>dependent<br />

component analysis (ICA), <strong>in</strong>: Proc. ICASSP 1998, Vol. 2, Seattle, 1998, pp.<br />

1161–1164.<br />

[43] D. Tjøstheim, Measures of dependence and tests of <strong>in</strong>dependence, Statistics 28<br />

(1996) 249–282.<br />

[44] R. Duda, P. Hart, D. Stork, Pattern classification, 2nd Edition, Wiley, New<br />

York, 2001.<br />

[45] F. Theis, S. Amari, Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g<br />

sparse sources, <strong>in</strong>: Proc. ICA 2004, Vol. 3195 of Lecture Notes <strong>in</strong> Computer<br />

Science, Granada, Spa<strong>in</strong>, 2004, pp. 718–725.<br />

[46] P. Zhou, Z. Erim, W. Rymer, Motor unit action potential counts <strong>in</strong> surface<br />

electrode array EMG, <strong>in</strong>: Proc. IEEE EMBS 2003, Cancun, Mexico, 2003, pp.<br />

2067–2070.<br />

[47] B. Freriks, H. Hermens, European Recommendations for<br />

surface electromyography, results of the SENIAM project, Roess<strong>in</strong>gh Research<br />

and Development b.v. (CD-ROM), 1999.<br />

29


Chapter 21<br />

Proc. BIOMED 2005, pages 209-212<br />

Paper F.J. Theis, Z. Kohl, H.G. Kuhn, H.G. Stockmeier, and E.W. Lang. Automated<br />

count<strong>in</strong>g of labelled cells <strong>in</strong> rodent bra<strong>in</strong> section images. In Proc.<br />

BioMED 2004, pages 209-212, Innsbruck, Austria, 2004. ACTA Press,<br />

Canada<br />

Reference (Theis et al., 2004c)<br />

Summary <strong>in</strong> section 1.6.2<br />

291


292 Chapter 21. Proc. BIOMED 2005, pages 209-212<br />

Automated count<strong>in</strong>g of labelled cells <strong>in</strong> rodent bra<strong>in</strong> section images<br />

F.J. Theis 1 , Z. Kohl 2 , H.G. Kuhn 2 , H.G. Stockmeier 1 and E.W. Lang 1<br />

1 Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany<br />

2 Department of Neurology, University of Regensburg, 93053 Regensburg, Germany<br />

email: fabian@theis.name<br />

ABSTRACT<br />

The genesis of new cells, especially of neurons, <strong>in</strong> the adult<br />

human bra<strong>in</strong> is currently of great scientific <strong>in</strong>terest. In order<br />

to measure neurogenesis <strong>in</strong> animals new born cells are<br />

labelled with specific markers such as BrdU; <strong>in</strong> bra<strong>in</strong> sections<br />

these can later be analyzed and counted through the<br />

microscope. So far, the image analysis has been performed<br />

by hand. In this work, we present an algorithm to automatically<br />

segment the digital bra<strong>in</strong> section picture <strong>in</strong>to cell<br />

and noncell components, giv<strong>in</strong>g a count of the number of<br />

cells <strong>in</strong> the section. This is done by first tra<strong>in</strong><strong>in</strong>g a so-called<br />

cell classifier with cell and non-cell patches <strong>in</strong> a supervised<br />

manner. This cell classifier can later be used <strong>in</strong> an arbitrary<br />

number of sections by scann<strong>in</strong>g the section and choos<strong>in</strong>g<br />

maxima of this classifier as cell center locations. For tra<strong>in</strong><strong>in</strong>g,<br />

s<strong>in</strong>gle- and multi-layer perceptrons were used. In prelim<strong>in</strong>ary<br />

experiments, we get good performance of the classifier.<br />

KEY WORDS<br />

Cell count<strong>in</strong>g, image segmentation, cell classification, neurogenesis,<br />

BrdU<br />

1 Biological background<br />

1.1 New neurons <strong>in</strong> the adult bra<strong>in</strong><br />

Dur<strong>in</strong>g the last decades the fact that new neurons are cont<strong>in</strong>uously<br />

generated <strong>in</strong> the adult mammalian bra<strong>in</strong> - a phenomenon<br />

termed adult neurogenesis - came more and more<br />

<strong>in</strong>to focus of neuroscience research [1][2][7]. Under physiological<br />

conditions, neuroscientists found, that adult neurogenesis<br />

seems to be restricted to two bra<strong>in</strong> regions: The<br />

wall of the lateral ventricle and the granular cell layer of<br />

the hippocampus.<br />

A large variety of factors <strong>in</strong>clud<strong>in</strong>g environmental<br />

signals, trophic factors, hormone and neurotransmitters<br />

have recently been identified to regulate the generation of<br />

new neurons <strong>in</strong> the adult bra<strong>in</strong>. These studies were typically<br />

performed by us<strong>in</strong>g a comb<strong>in</strong>ation of different histological<br />

techniques, such as non-radioactive label<strong>in</strong>g of<br />

newly generated cells, stereological count<strong>in</strong>g and confocal<br />

microscope analysis, <strong>in</strong> order to quantitatively analyze<br />

adult neurogenesis (review <strong>in</strong> [8]). However, this procedure<br />

is time consum<strong>in</strong>g, s<strong>in</strong>ce histological analysis currently depends<br />

on assessment of positive signals <strong>in</strong> histological sec-<br />

tions by <strong>in</strong>dividual <strong>in</strong>vestigators through manual or semiautomatic<br />

count<strong>in</strong>g.<br />

1.2 Method used<br />

Bromodeoxyurid<strong>in</strong>e (BrdU), a thymid<strong>in</strong>e analog is given<br />

systemically and is <strong>in</strong>tegrated <strong>in</strong>to the replicat<strong>in</strong>g DNA dur<strong>in</strong>g<br />

cell division [3]. Us<strong>in</strong>g a specific antibody aga<strong>in</strong>st<br />

BrdU, labelled cells can be detected by an immunohistochemical<br />

sta<strong>in</strong><strong>in</strong>g procedure. The nuclei of labelled cells<br />

on 40µm thick bra<strong>in</strong> sections appear <strong>in</strong> dark brown or<br />

black dense color. To determ<strong>in</strong>e the amount of BrdUpositive<br />

cells <strong>in</strong> the granular cell layer of the hippocampus<br />

they were counted on a light microscope (Olympus IX<br />

70; Hamburg, Germany) with a 20× objective. Digital images<br />

with a resolution of 1600 × 1200 pixels were taken by<br />

a color video camera us<strong>in</strong>g the analySIS-software system<br />

(Soft Imag<strong>in</strong>g System, Münster, Germany).<br />

2 Automated count<strong>in</strong>g<br />

Figure 1 shows a section image, <strong>in</strong> which the cells are to be<br />

counted.<br />

Classical approaches such as threshold<strong>in</strong>g and erosion<br />

after image normalization were not successful, ma<strong>in</strong>ly because<br />

cell clusters <strong>in</strong> the image cannot be detected properly<br />

and counted us<strong>in</strong>g this method.<br />

We decided to adapt a method proposed by Nattkemper<br />

et al. [9]to evaluate fluorescence micrographs of lymphocytes<br />

<strong>in</strong>vad<strong>in</strong>g human tissue. The ma<strong>in</strong> idea is to build<br />

<strong>in</strong> a first step a function mapp<strong>in</strong>g an image patch to a confidence<br />

value <strong>in</strong> [0, 1], <strong>in</strong>dicat<strong>in</strong>g how probable a cell lies<br />

<strong>in</strong> this patch or not — we call this function cell classifier.<br />

In the second step this function is used as a local filter on<br />

the whole image; its application gives a probability distribution<br />

over the whole image with local maxima at cell positions.<br />

Nattkemper et al. call this distribution confidence<br />

map. Maxima analysis of the confidence map reveals the<br />

number and the position of the cells (image segmentation).<br />

3 Cell classifier<br />

In this section, we will expla<strong>in</strong> how to generate a cell classifier<br />

that is a function mapp<strong>in</strong>g image patches to cell confidence<br />

values. For this we will generate a sample set of cells


Chapter 21. Proc. BIOMED 2005, pages 209-212 293<br />

Figure 1. Typical section image and below the manually<br />

bounded but automatically labelled image. In order to depict<br />

the image size a black scale bar of length 50µm is<br />

given <strong>in</strong> the top image at the bottom. Here, the number<br />

of counted cell with<strong>in</strong> the boundary (region of <strong>in</strong>terest) is<br />

84, the number of cells <strong>in</strong> the whole image is 116.<br />

and non-cells, and then tra<strong>in</strong> an artificial neural network to<br />

this sample set.<br />

3.1 Sample set<br />

After fix<strong>in</strong>g the patch size — <strong>in</strong> the follow<strong>in</strong>g we will use<br />

20 by 20 pixel grey-level image patches — a tra<strong>in</strong><strong>in</strong>g set of<br />

cell and non-cell patches has to be generated manually by<br />

the expert. These image patches are then to be classified by<br />

a neural network. Figure 2 shows some cell and non-cell<br />

patches.<br />

Interpret<strong>in</strong>g each 20 by 20 image patch as a 400dimensional<br />

vector, we thus get a set of L tra<strong>in</strong><strong>in</strong>g vectors<br />

T := {(x1, t1), . . . , (xL, tL)}<br />

with xi ∈ R n — here n = 20 2 — represent<strong>in</strong>g the image<br />

patch and ti ∈ {0, 1} either 0 or 1 depend<strong>in</strong>g on<br />

Figure 2. Part of the tra<strong>in</strong><strong>in</strong>g set: The first row consists of<br />

20x20-pixel image patches that conta<strong>in</strong> cells, the lower row<br />

consists of non-cell image patches.<br />

whether xi is a non-cell or a cell. The goal is to f<strong>in</strong>d a<br />

mapp<strong>in</strong>g correctly classify<strong>in</strong>g this data set that is a mapp<strong>in</strong>g<br />

ζ : R n → [0, 1] with ζ(xi) = oi for i = 1, . . . , L.<br />

We call such a mapp<strong>in</strong>g cell classifier. Of course ζ is not<br />

uniquely def<strong>in</strong>ed by the above property, so some regularization<br />

has to be <strong>in</strong>troduced. Any <strong>in</strong>terpolation technique such<br />

as Fourier or Tayler approximation can be used to f<strong>in</strong>d ζ;<br />

we will use s<strong>in</strong>gle and multilayer perceptrons as expla<strong>in</strong>ed<br />

<strong>in</strong> the previous section.<br />

3.2 Preprocess<strong>in</strong>g<br />

Before we apply neural network learn<strong>in</strong>g, we preprocess<br />

the data as follows: After mean removal, we apply pr<strong>in</strong>cipal<br />

component analysis (PCA) <strong>in</strong> order to reduce the data<br />

set dimension as well as to decorrelate the data <strong>in</strong> a first<br />

separation step. This is achieved by diagonaliz<strong>in</strong>g the data<br />

set covariance and project<strong>in</strong>g along the eigenvectors with<br />

largest eigenvalues.<br />

By only tak<strong>in</strong>g the first 5 eigenvalues of the tra<strong>in</strong><strong>in</strong>g<br />

set, projection along those first 5 pr<strong>in</strong>cipal axes still reta<strong>in</strong>s<br />

95% of the data. Thus, the 400 dimensional data space was<br />

reduced to a whitened 5-dimensional data set.<br />

A visualization of the 120-samples data set is given<br />

<strong>in</strong> figure 3, here after projection to 3 dimensions. One can<br />

easily see that the cell and non-cell components can be l<strong>in</strong>early<br />

separated — thus us<strong>in</strong>g a perceptron, see next section,<br />

can <strong>in</strong>deed already learn the cell classifier. Furthermore, a<br />

k-means cluster<strong>in</strong>g algorithm has been applied with k = 2<br />

<strong>in</strong> order to f<strong>in</strong>d the two data clusters. Those directly correspond<br />

to the cell/non-cell components, see figure.<br />

3.3 Neural network learn<strong>in</strong>g<br />

Supervised learn<strong>in</strong>g algorithms try to approximate a given<br />

function f : R n → A ⊂ R m by us<strong>in</strong>g a number of given<br />

sample-observation pairs (xλ, f(xλ)) ∈ R n × A. We will<br />

restrict ourselves to feed-forward layered neural networks<br />

[4]; furthermore, we found that simple s<strong>in</strong>gle-layered neural<br />

networks (perceptrons) <strong>in</strong> comparison to multi-layered<br />

networks (MLP) already sufficed to learn the data set well<br />

— furthermore they have the advantage of easier rule extraction<br />

and <strong>in</strong>terpretation.


294 Chapter 21. Proc. BIOMED 2005, pages 209-212<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

−2<br />

4<br />

2<br />

0<br />

−2<br />

−4<br />

−3<br />

Figure 3. Data set with 120 samples after 3-dimensional<br />

PCA projection (91% of the data was reta<strong>in</strong>ed). The dots<br />

mark the 60 samples represent<strong>in</strong>g cells, the x’s mark the 60<br />

non-cell data po<strong>in</strong>ts. The two circles <strong>in</strong>dicate clusters of<br />

a k-means application with a search for two clusters. Obviously,<br />

k-means nicely differentiates between the cell and<br />

the non-cell components.<br />

A perceptron with output dimension 1 consists of a<br />

s<strong>in</strong>gle neuron only, so the output function y can be written<br />

as<br />

y(x) = θ(w ⊤ x + w0)<br />

with weight w ∈ R n , n <strong>in</strong>put dimension, w0 ∈ R the<br />

bias and as activation function θ, the Heaviside function<br />

(θ(x) = 0 for x < 0 and θ(x) = 1 for x ≥ 0). Often, the<br />

bias w0 is added as additional weight to w with fixed <strong>in</strong>put<br />

1.<br />

Learn<strong>in</strong>g <strong>in</strong> a perceptron means m<strong>in</strong>imiz<strong>in</strong>g the error<br />

energy function shown above. This can be performed for<br />

example by gradient descent with respect to w and w0. This<br />

<strong>in</strong>duces the well known delta-rule for the weight update<br />

−2<br />

−1<br />

∆w = η(y(x) − t)x,<br />

where η denotes a chosen learn<strong>in</strong>g rate parameter, y(x) the<br />

output of the neural network at sample x and t the observation<br />

of <strong>in</strong>put x. It is easy to see that a perceptron separates<br />

the data l<strong>in</strong>early with the boundary hyperplane given by<br />

{x ∈ R n |w ⊤ x + w0 = 0}.<br />

For the cell classifier, we use a s<strong>in</strong>gle-unit perceptron<br />

with l<strong>in</strong>ear activation function <strong>in</strong> order to get a measure<br />

for the certa<strong>in</strong>ty of cell/non-cell classification. Application<br />

of delta-learn<strong>in</strong>g to the 5-dimensional data set from<br />

above gives excellent performance after already 4 epochs<br />

of batch learn<strong>in</strong>g. The f<strong>in</strong>al performance error (variance<br />

of perceptron estimation error of the tra<strong>in</strong><strong>in</strong>g set) after 55<br />

epochs was 0.0038 which confirms the good performance<br />

as well as the l<strong>in</strong>earity of the classification problem. This<br />

was further shown, when we used a two-layered network<br />

0<br />

1<br />

2<br />

3<br />

4<br />

with 5 hidden neurons <strong>in</strong> order to test for nonl<strong>in</strong>earities <strong>in</strong><br />

the data set. After only 10 epochs, the error was already<br />

very small, and it could f<strong>in</strong>ally be dim<strong>in</strong>ished to 3 · 10 −19 .<br />

Still the performance of the perceptron is more than enough<br />

for classification.<br />

4 Confidence map<br />

4.1 Generation<br />

The cell classifier from above only has to be tra<strong>in</strong>ed once.<br />

Given such a cell classifier, section pictures can now<br />

be analyzed as follows:<br />

A pixelwise scan of the image gives an image patch<br />

with center location at the scan po<strong>in</strong>t; to this image patch<br />

the cell classifier is then applied to give a probability of<br />

whether a cell is at the given location or not. This altogether<br />

(after image extension at the boundaries) yields a probability<br />

distribution over the whole image which is called confidence<br />

map. Each po<strong>in</strong>t of the confidence map is a value <strong>in</strong><br />

[0, 1] stat<strong>in</strong>g how probable it is that a cell is depicted at the<br />

specified location.<br />

In practice a pixelwise scan is too expensive <strong>in</strong> terms<br />

of calculation time, so <strong>in</strong>stead a grid value say 5 for 20 ×<br />

20-patches is <strong>in</strong>troduced, and the picture is scanned only<br />

every 5-th pixel. This yields a rasterization of the orig<strong>in</strong>al<br />

confidence map, which is still good enough to detect cells.<br />

Figure 4 shows the rasterized confidence map of a section<br />

part. The maxima of the confidence map correspond to the<br />

cell locations; small but non-zero values <strong>in</strong> the confidence<br />

map typically depict misclassifications that can be avoided<br />

by threshold<strong>in</strong>g.<br />

4.2 Evaluation<br />

After the confidence map has been generated, it can be<br />

evaluated by simple maxima analysis. However as seen <strong>in</strong><br />

figure 4, maxima not always correspond to cell positions,<br />

so threshold<strong>in</strong>g <strong>in</strong> the confidence map is first applied. Values<br />

of 0.5 to 0.8 yield good results <strong>in</strong> experiments. Furthermore,<br />

the cell classifier could have responded to one cell<br />

when applied to image patches with large overlap. Therefore<br />

after a maximum has been detected, adjacent po<strong>in</strong>ts <strong>in</strong><br />

the confidence map are also set to zero with<strong>in</strong> a given radius<br />

(15 to 18 were good values for 20 × 20 image patches). Iterative<br />

application of this algorithm then gave the f<strong>in</strong>al cell<br />

positions and hence the image segmentation.<br />

5 Results<br />

In practice we used perceptron learn<strong>in</strong>g after preprocess<strong>in</strong>g<br />

with PCA and also ICA [5] [6] <strong>in</strong> order to help the learn<strong>in</strong>g<br />

algorithm with l<strong>in</strong>early separated data.<br />

The patch size was chosen to be 20 × 20, a threshold<strong>in</strong>g<br />

of 0.8 was applied <strong>in</strong> the confidence map and the


Chapter 21. Proc. BIOMED 2005, pages 209-212 295<br />

1<br />

0.5<br />

0<br />

0<br />

5<br />

10<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

45<br />

50<br />

0<br />

5<br />

10<br />

Figure 4. The plot shows the confidence map with grid<br />

value 5 of the image part shown above.<br />

cut-out radius for cell detection <strong>in</strong> the confidence map was<br />

18 pixels.<br />

In figure 1 an automatically segmented picture is<br />

shown. We see good performances of the count<strong>in</strong>g algorithm.<br />

So far we only compared cell-numbers of sections,<br />

counted by the algorithm and by an expert. We get calculation<br />

errors of about 5%. In further experiments, we also<br />

want to compare cell positions, detected by the algorithm<br />

and by an expert.<br />

6 Conclusion<br />

We have presented a framework for bra<strong>in</strong> section image<br />

segmentation and analysis. The feature detector, here cell<br />

classifier, was first tra<strong>in</strong>ed on a given sample set us<strong>in</strong>g a<br />

neural network. This detector was then applied by scann<strong>in</strong>g<br />

over the image to get a confidence map. Maxima analysis<br />

yields the cell locations. Experiments showed good performance<br />

of the classifier, however larger tests will have to be<br />

performed.<br />

In future work, various problems will have to be dealt<br />

with. First of all the scann<strong>in</strong>g performance should be <strong>in</strong>creased<br />

<strong>in</strong> order to be able to use smaller grid values, which<br />

could significantly <strong>in</strong>crease the classification rate. This<br />

could be done for example by us<strong>in</strong>g some k<strong>in</strong>d of hierarchical<br />

neural network like a cellular neural network, see<br />

[9]. In typical bra<strong>in</strong> section images, some cells not directly<br />

15<br />

20<br />

25<br />

30<br />

35<br />

40<br />

45<br />

50<br />

ly<strong>in</strong>g <strong>in</strong> the focus plane occur <strong>in</strong> a blurred fashion. In order<br />

to count those without count<strong>in</strong>g them twice <strong>in</strong> two section<br />

images with different focus planes, a three-dimensional cell<br />

classifier could be tra<strong>in</strong>ed for fixed focus plane distances.<br />

A different approach for account<strong>in</strong>g for non-focused cells<br />

would be to simply allow ’overcount<strong>in</strong>g’, and then to reduce<br />

doubles <strong>in</strong> the segmented images accord<strong>in</strong>g to location.<br />

This seems suitable given the fact that cells do not<br />

vary greatly <strong>in</strong> size. F<strong>in</strong>ally, sections typically span more<br />

than one microscope image. In order to count cells of the<br />

whole section some way of not count<strong>in</strong>g cells twice <strong>in</strong> both<br />

image parts has to be devised. This could be done for example<br />

by us<strong>in</strong>g techniques from image fusion. Furthermore,<br />

often not the whole image but only parts of the section are<br />

to be counted; so far this choice of the ’region of <strong>in</strong>terest’<br />

is done manually. We hope to automate this <strong>in</strong> the future<br />

by f<strong>in</strong>d<strong>in</strong>g separat<strong>in</strong>g features of these regions.<br />

7 Acknowledgements<br />

F.J.T. and E.W.L. gratefully acknowledges f<strong>in</strong>ancial support<br />

by the DFG 1 and the BMBF 2 .<br />

References<br />

[1] J. Altman and G. Das. Autoradiographic and histological<br />

evidence of postnatal hippocampal neurogenesis <strong>in</strong> rats. J.<br />

Comp. Neurol., 124(3):319–335, 1965.<br />

[2] H. Cameron, C. Woolley, B. McEwen, and E. Gould. Differentiation<br />

of newly born neurons and glia <strong>in</strong> the dentate gyrus<br />

of the adult rat. Neuroscience, 56(2):336–344, 1993.<br />

[3] F. Dolbeare. Bromodeoxyurid<strong>in</strong>e: a diagnostic tool <strong>in</strong> biology<br />

and medic<strong>in</strong>e, part i: Historical perspectives, histochemical<br />

methods and cell k<strong>in</strong>etics. Histochem. J., 27(5):339–369,<br />

1995.<br />

[4] S. Hayk<strong>in</strong>. Neural networks. Macmillan College Publish<strong>in</strong>g<br />

Company, 1994.<br />

[5] J. Hérault and C. Jutten. Space or time adaptive signal<br />

process<strong>in</strong>g by neural network models. In J.S. Denker, editor,<br />

Neural Networks for Comput<strong>in</strong>g. Proceed<strong>in</strong>gs of the AIP<br />

Conference, pages 206–211, New York, 1986. American Institute<br />

of Physics.<br />

[6] A. Hyvär<strong>in</strong>en, J.Karhunen, and E.Oja. <strong>Independent</strong> <strong>Component</strong><br />

<strong>Analysis</strong>. John Wiley & Sons, 2001.<br />

[7] H.G. Kuhn, H. Dick<strong>in</strong>son-Anson, and F. Gage. Neurogenesis<br />

<strong>in</strong> the dentate gyrus of the adult rat: age-related decrease of<br />

neuronal progenitor proliferation. J. Neurosci., 16(6):2027–<br />

2033, 1996.<br />

[8] H.G. Kuhn, T. Palmer, and E. Fuchs. Adult neurogenesis: a<br />

compensatory mechanism for neuronal damage. Eur. Arch.<br />

Psychiatry Cl<strong>in</strong>. Neurosci., 251(4):152–158, 2001.<br />

[9] T.W. Nattkemper, H. Ritter, and W. Schubert. A neural classifier<br />

enabl<strong>in</strong>g high-throughput topological analysis of lymphocytes<br />

<strong>in</strong> tissue sections. IEEE Trans. ITB, 5:138–149, 2001.<br />

1 graduate college ‘nonl<strong>in</strong>ear dynamics’<br />

2 project ‘ModKog’


296 Chapter 21. Proc. BIOMED 2005, pages 209-212


Bibliography<br />

Abed-Meraim, K. and Belouchrani, A. (2004). Algorithms for jo<strong>in</strong>t block diagonalization. In<br />

Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria.<br />

Abed-Meraim, K., Belouchrani, A., and Hua, Y. (1996). Bl<strong>in</strong>d identification of a l<strong>in</strong>ear-quadratic<br />

mixture of <strong>in</strong>dependent components based on jo<strong>in</strong>t diagonalization procedure. In Proc. of<br />

ICASSP 1996, volume 5, pages 2718–2721, Atlanta, USA.<br />

Achard, S. and Jutten, C. (2005). Identifiability of post-nonl<strong>in</strong>ear mixtures. IEEE Signal<br />

Process<strong>in</strong>g Letters, 12(5):423–426.<br />

Achard, S., Pham, D.-T., and Jutten, C. (2003). Quadratic dependence measure for nonl<strong>in</strong>ear<br />

bl<strong>in</strong>d sources separation. In Proc. of ICA 2003, pages 263–268, Nara, Japan.<br />

Akaho, S., Kiuchi, Y., and Umeyama, S. (1999). MICA: Multimodal <strong>in</strong>dependent component<br />

analysis. In Proc. IJCNN 1999, pages 927–932.<br />

Alpayd<strong>in</strong>, E. (2004). Introduction to Mach<strong>in</strong>e Learn<strong>in</strong>g. MIT Press.<br />

Altman, J. and Das, G. (1965). Autoradiographic and histological evidence of postnatal hippocampal<br />

neurogenesis <strong>in</strong> rats. J. Comp. Neurol., 124(3):319–335.<br />

Amari, S. (1998). Natural gradient works efficiently <strong>in</strong> learn<strong>in</strong>g. Neural Computation, 10:251–<br />

276.<br />

Araki, S., Sawada, H., Mukai, R., and Mak<strong>in</strong>o, S. (2007). Underdeterm<strong>in</strong>ed bl<strong>in</strong>d sparse source<br />

separation for arbitrarily arranged multiple sensors. Signal Process<strong>in</strong>g, 87(8):1833–1847.<br />

Babaie-Zadeh, M. (2002). On Bl<strong>in</strong>d Source Separation <strong>in</strong> Convolutive and Nonl<strong>in</strong>ear Mixtures.<br />

PhD thesis, Institut National Polytechnique de Grenoble.<br />

Babaie-Zadeh, M., Jutten, C., and Nayebi, K. (2002). A geometric approach for separat<strong>in</strong>g post<br />

non-l<strong>in</strong>ear mixtures. In Proc. of EUSIPCO ’02, volume II, pages 11–14, Toulouse, France.<br />

Bach, F. and Jordan, M. (2003a). Beyond <strong>in</strong>dependent components: trees and clusters. Journal<br />

of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 4:1205–1233.<br />

Bach, F. and Jordan, M. (2003b). F<strong>in</strong>d<strong>in</strong>g clusters <strong>in</strong> <strong>in</strong>dependent component analysis. In Proc.<br />

ICA 2003, pages 891–896.<br />

297


298 BIBLIOGRAPHY<br />

Barber, C., Dobk<strong>in</strong>, D., and Huhdanpaa, H. (1993). The quickhull algorithm for convex hull.<br />

Technical Report GCG53, The Geometry Center, University of M<strong>in</strong>nesota, M<strong>in</strong>neapolis.<br />

Bartsch, H. and Obermayer, K. (2003). Second order statistics of natural images. Neurocomput<strong>in</strong>g,<br />

52-54:467–472.<br />

Basmajian, J. and Luca, C. D. (1985). Muscle Alive. Their Functions Revealed by Electromyography.<br />

Williams & Wilk<strong>in</strong>s, Baltimore, 5th edition.<br />

Bell, A. and Sejnowski, T. (1995). An <strong>in</strong>formation-maximisation approach to bl<strong>in</strong>d separation<br />

and bl<strong>in</strong>d deconvolution. Neural Computation, 7:1129–1159.<br />

Belouchrani, A. and Am<strong>in</strong>, M. (1998). Bl<strong>in</strong>d source separation based on time-frequency signal<br />

representations. IEEE Trans. Signal Process<strong>in</strong>g, 46(11):2888–2897.<br />

Belouchrani, A., Meraim, K. A., Cardoso, J.-F., and Moul<strong>in</strong>es, E. (1997). A bl<strong>in</strong>d source separation<br />

technique based on second order statistics. IEEE Transactions on Signal Process<strong>in</strong>g,<br />

45(2):434–444.<br />

Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press.<br />

Bishop, C. (2007). Pattern Recognition and Mach<strong>in</strong>e Learn<strong>in</strong>g. Spr<strong>in</strong>ger.<br />

Blanchard, G., Kawanabe, M., Sugiyama, M., Spoko<strong>in</strong>y, V., and Müller, K. (2006). In search<br />

of non-gaussian components of a high-dimensional distribution. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g<br />

Research, 7:247–282.<br />

Bofill, P. and Zibulevsky, M. (2001). Underdeterm<strong>in</strong>ed bl<strong>in</strong>d source separation us<strong>in</strong>g sparse<br />

representations. Signal Process<strong>in</strong>g, 81:2353–2362.<br />

Böhm, M., Stadlthanner, K., Gruber, P., Theis, F., Lang, E., Tomé, A., Teixeira, A., Gronwald,<br />

W., and Kalbitzer, H. (2006). On the use of simulated anneal<strong>in</strong>g to automatically assign<br />

<strong>in</strong>dependent components. IEEE Transactions on Biomedical Eng<strong>in</strong>eer<strong>in</strong>g, 53(5):810–820.<br />

Boser, B., Guyon, I., and Vapnik, V. (1992). A tra<strong>in</strong><strong>in</strong>g algorithm for optimal marg<strong>in</strong> classifiers.<br />

In Proc. COLT 1992, pages 144–152.<br />

Burges, C. (1998). A tutorial on support vector mach<strong>in</strong>es for pattern recognition. Knowledge<br />

Discovery and Data M<strong>in</strong><strong>in</strong>g, 2(2):121–167.<br />

Cardoso, J. (1998). Multidimensional <strong>in</strong>dependent component analysis. In Proc. of ICASSP<br />

’98, Seattle.<br />

Cardoso, J. and Souloumiac, A. (1990). Localization and identification with the quadricovariance.<br />

Traitement du Signal, 7(5):397–406.<br />

Cardoso, J. and Souloumiac, A. (1993). Bl<strong>in</strong>d beamform<strong>in</strong>g for non gaussian signals. IEE<br />

Proceed<strong>in</strong>gs - F, 140(6):362–370.


BIBLIOGRAPHY 299<br />

Cardoso, J. and Souloumiac, A. (1995). Jacobi angles for simultaneous diagonalization. SIAM<br />

J. Mat. Anal. Appl., 17(1):161–164.<br />

Choi, S. and Cichocki, A. (2000). Bl<strong>in</strong>d separation of nonstationary sources <strong>in</strong> noisy mixtures.<br />

Electronics Letters, 36(848-849).<br />

Cichocki, A. and Amari, S. (2002). Adaptive bl<strong>in</strong>d signal and image process<strong>in</strong>g. John Wiley &<br />

Sons.<br />

Cohen, L. (1995). Time-Frequency <strong>Analysis</strong>. Prentice-Hall, Englewood Cliffs, NJ.<br />

Combettes, P. (1993). The foundations of set theoretic estimation. Proc. IEEE, 81:182–208.<br />

Comon, P. (1994). <strong>Independent</strong> component analysis - a new concept? Signal Process<strong>in</strong>g, 36:287–<br />

314.<br />

Darmois, G. (1953). Analyse générale des liaisons stochastiques. Rev. Inst. Internationale<br />

Statist., 21:2–8.<br />

Dobra, A., Hans, C., Jones, B., Nev<strong>in</strong>s, J., Yao, G., and West, M. (2004). Sparse graphical<br />

models for explor<strong>in</strong>g gene expression data. Journal of Multivariate <strong>Analysis</strong>, 90:196–212.<br />

Edelman, A., Arias, T., and Smith, S. (1999). The geometry of algorithms with orthogonality<br />

constra<strong>in</strong>ts. SIAM Journal on Matrix <strong>Analysis</strong> and Applications, 20(2):303–353.<br />

Effern, A., Lehnertz, K., Schreiber, T., Grunwald, T., David, P., and Elger, C. E. (2000).<br />

Nonl<strong>in</strong>ear denois<strong>in</strong>g of transient signals with application to event-related potentials. Physica<br />

D, 140:257–266.<br />

Eriksson, J. and Koivunen, V. (2003). Identifiability and separability of l<strong>in</strong>ear ica models revisited.<br />

In Proc. of ICA 2003, pages 23–27.<br />

Eriksson, J. and Koivunen, V. (2006). Complex random vectors and ica models: identifiability,<br />

uniqueness, and separability. IEEE Transactions on Information Theory, 52(3):1017–1029.<br />

Févotte, C. and Theis, F. (2007a). Orthonormal approximate jo<strong>in</strong>t block-diagonalization. Technical<br />

report, GET/Télécom Paris.<br />

Févotte, C. and Theis, F. (2007b). Pivot selection strategies <strong>in</strong> jacobi jo<strong>in</strong>t block-diagonalization.<br />

In Proc. ICA 2007, London, U.K.<br />

Friedman, J. (1987). Exploratory projection pursuit. Journal of the American Statistical Association,<br />

82:249–266.<br />

Friedman, J. and Tukey, J. (1975). A projection pursuit algorithm for exploratory data analysis.<br />

IEEE Trans. on Computers, 23(9):881–890.<br />

García, G., Okuno, R., and Akazawa, K. (2004). Decomposition algorithm for surface electrodearray<br />

electromyogram <strong>in</strong> voluntary isometric contraction. IEEE Eng<strong>in</strong>eer<strong>in</strong>g <strong>in</strong> Medic<strong>in</strong>e and<br />

Biology Magaz<strong>in</strong>e, <strong>in</strong> press.


300 BIBLIOGRAPHY<br />

Georgiev, P. (2001). Bl<strong>in</strong>d source separation of bil<strong>in</strong>early mixed signals. In Proc. of ICA 2001,<br />

pages 328–330, San Diego, USA.<br />

Georgiev, P., Pardalos, P., and Theis, F. (2006). A bil<strong>in</strong>ear algorithm for sparse representations.<br />

Computational Optimization and Applications.<br />

Georgiev, P., Pardalos, P., Theis, F., Cichocki, A., and Bakardjian, H. (2005a). Data M<strong>in</strong><strong>in</strong>g<br />

<strong>in</strong> Biomedic<strong>in</strong>e, chapter Sparse component analysis: a new tool for data m<strong>in</strong><strong>in</strong>g. Spr<strong>in</strong>ger, <strong>in</strong><br />

pr<strong>in</strong>t.<br />

Georgiev, P. and Theis, F. (2004). Bl<strong>in</strong>d source separation of l<strong>in</strong>ear mixtures with s<strong>in</strong>gular<br />

matrices. In Proc. ICA 2004, volume 3195 of LNCS, pages 121–128, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Georgiev, P., Theis, F., and Cichocki, A. (2004). Bl<strong>in</strong>d source separation and sparse component<br />

analysis of overcomplete mixtures. In Proc. ICASSP 2004, volume 5, pages 493–496, Montreal,<br />

Canada.<br />

Georgiev, P., Theis, F., and Cichocki, A. (2005b). Multiscale Optimization Methods, chapter<br />

Optimization algorithms for sparse representations and applications. Ed. P. Pardalos.<br />

Georgiev, P., Theis, F., and Cichocki, A. (2005c). Sparse component analysis and bl<strong>in</strong>d source<br />

separation of underdeterm<strong>in</strong>ed mixtures. IEEE Transactions on Neural Networks, 16(4):992–<br />

996.<br />

Ghurye, S. and Olk<strong>in</strong>, I. (1962). A characterization of the multivariate normal distribution.<br />

Ann. Math. Statist., 33:533–541.<br />

Gruber, P., Kohler, C., and Theis, F. (2007). A toolbox for model-free analysis of fmri data. In<br />

Proc. ICA 2007, London, U.K.<br />

Gruber, P., Stadlthanner, K., Böhm, M., Theis, F., Lang, E., Tomé, A., Teixeira, A., Puntonet,<br />

C., and Saéz, J. G. (2006). Denois<strong>in</strong>g us<strong>in</strong>g local projective subspace methods. Neurocomput<strong>in</strong>g,<br />

69:1485–1501.<br />

Gruber, P. and Theis, F. (2006). Grassmann cluster<strong>in</strong>g. In Proc. EUSIPCO 2006, Florence,<br />

Italy.<br />

Gutch, H. and Theis, F. (2007). <strong>Independent</strong> subspace analysis is unique, given irreducibility.<br />

In Proc. ICA 2007, London, U.K.<br />

Hashimoto, W. (2003). Quadratic forms <strong>in</strong> natural images. Network: Computation <strong>in</strong> Neural<br />

Systems, 14:765–788.<br />

Hayk<strong>in</strong>, S. (1994). Neural networks. Macmillan College Publish<strong>in</strong>g Company.<br />

Hérault, J. and Jutten, C. (1986). Space or time adaptive signal process<strong>in</strong>g by neural network<br />

models. In Denker, J., editor, Neural Networks for Comput<strong>in</strong>g. Proceed<strong>in</strong>gs of the AIP<br />

Conference, pages 206–211, New York. American Institute of Physics.


BIBLIOGRAPHY 301<br />

Hosse<strong>in</strong>i, S. and Deville, Y. (2003). Bl<strong>in</strong>d separation of l<strong>in</strong>ear-quadratic mixtures of real sources<br />

us<strong>in</strong>g a recurrent structure. Lecture Notes <strong>in</strong> Computer Science, 2687:241–248.<br />

Hoyer, P. (2004). Non-negative matrix factorization with sparseness constra<strong>in</strong>ts. Journal of<br />

Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 5:1457–1469.<br />

Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2):435–475.<br />

Hyvär<strong>in</strong>en, A. (1999). Fast and robust fixed-po<strong>in</strong>t algorithms for <strong>in</strong>dependent component analysis.<br />

IEEE Transactions on Neural Networks, 10(3):626–634.<br />

Hyvär<strong>in</strong>en, A. and Hoyer, P. (2000). Emergence of phase and shift <strong>in</strong>variant features by decomposition<br />

of natural images <strong>in</strong>to <strong>in</strong>dependent feature subspaces. Neural Computation,<br />

12(7):1705–1720.<br />

Hyvär<strong>in</strong>en, A., Hoyer, P., and Inki, M. (2001a). Topographic <strong>in</strong>dependent component analysis.<br />

Neural Computation, 13(7):1525–1558.<br />

Hyvär<strong>in</strong>en, A., Hoyer, P., and Oja, E. (2001b). Intelligent Signal Process<strong>in</strong>g. IEEE Press.<br />

Hyvär<strong>in</strong>en, A., Karhunen, J., and Oja, E. (2001c). <strong>Independent</strong> component analysis. John Wiley<br />

& Sons.<br />

Hyvär<strong>in</strong>en, A. and Oja, E. (1997). A fast fixed-po<strong>in</strong>t algorithm for <strong>in</strong>dependent component<br />

analysis. Neural Computation, 9:1483–1492.<br />

Hyvär<strong>in</strong>en, A. and Pajunen, P. (1998). On existence and uniqueness of solutions <strong>in</strong> nonl<strong>in</strong>ear <strong>in</strong>dependent<br />

component analysis. Proceed<strong>in</strong>gs of the 1998 IEEE International Jo<strong>in</strong>t Conference<br />

on Neural Networks (IJCNN’98), 2:1350–1355.<br />

Il<strong>in</strong>, A. (2006). <strong>Independent</strong> dynamics subspace analysis. In Proc. ESANN 2006, pages 345–350.<br />

Karako-Eilon, S., Yeredor, A., and Mendlovic, D. (2003). Bl<strong>in</strong>d source separation based on the<br />

fractional fourier transform. In Proc. ICA 2003, pages 615–620, Nara, Japan.<br />

Karvanen, J. and Theis, F. (2004). Spatial ICA of fMRI data <strong>in</strong> time w<strong>in</strong>dows. In Proc. MaxEnt<br />

2004, volume 735 of AIP conference proceed<strong>in</strong>gs, pages 312–319, Garch<strong>in</strong>g, Germany.<br />

Kawanabe, M. (2005). L<strong>in</strong>ear dimension reduction based on the fourth-order cumulant tensor.<br />

In Proc. ICANN 2005, volume 3697 of LNCS, pages 151–156, Warsaw, Poland. Spr<strong>in</strong>ger.<br />

Kawanabe, M. and Theis, F. (2007). Jo<strong>in</strong>t low-rank approximation for extract<strong>in</strong>g non-gaussian<br />

subspaces. Signal Process<strong>in</strong>g.<br />

Keck, I., Theis, F., Gruber, P., Lang, E., Churan, J., and Puntonet, C. (2006). Model-free region<br />

of <strong>in</strong>terest based analysis of fMRI data. In Proc. MELECON 2006, pages 478–481, Malaga,<br />

Spa<strong>in</strong>.


302 BIBLIOGRAPHY<br />

Keck, I., Theis, F., Gruber, P., Lang, E., Specht, K., F<strong>in</strong>k, G., Tomé, A., and Puntonet, C.<br />

(2005). Automated cluster<strong>in</strong>g of ICA results for fMRI data analysis. In Proc. CIMED 2005,<br />

pages 211–216, Lisbon, Portugal.<br />

Keck, I., Theis, F., Gruber, P., Lang, E., Specht, K., and Puntonet, C. (2004). 3D spatial<br />

analysis of fMRI data on a word perception task. In Proc. ICA 2004, volume 3195 of LNCS,<br />

pages 977–984, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Kruskal, J. (1969). Statistical Computation, chapter Toward a practical method which helps<br />

uncover the structure of a set of observations by f<strong>in</strong>d<strong>in</strong>g the l<strong>in</strong>e tranformation which optimizes<br />

a new <strong>in</strong>dex of condensation, pages 427–440. Academic Press, New York.<br />

Lee, D. and Seung, H. (1999). Learn<strong>in</strong>g the parts of objects by non-negative matrix factorization.<br />

Nature, 40:788–791.<br />

Lee, T., Lewicki, M., Girolami, M., and Sejnowski, T. (1999). Bl<strong>in</strong>d source separation of more<br />

sources than mixtures us<strong>in</strong>g overcomplete representations. IEEE Signal Process<strong>in</strong>g Letters,<br />

6(4):87–90.<br />

Leshem, A. (1999). Source separation us<strong>in</strong>g bil<strong>in</strong>ear forms. In Proc. of the 8th Int. Conference<br />

on Higher-Order Statistical Signal Process<strong>in</strong>g.<br />

Liavas, A. P. and Regalia, P. A. (2001). On the behavior of <strong>in</strong>formation theoretic criteria for<br />

model order selection. IEEE Transactions on Signal Process<strong>in</strong>g, 49:1689–1695.<br />

L<strong>in</strong>, J. (1998). Factoriz<strong>in</strong>g multivariate function classes. In Advances <strong>in</strong> Neural Information<br />

Process<strong>in</strong>g Systems, volume 10, pages 563–569.<br />

Lutter, D., Stadlthanner, K., Theis, F., Lang, E. W., Tomé, A., Becker, B., and Vogt, T. (2006).<br />

Analyz<strong>in</strong>g gene expression profiles with ICA. In Proc. BioMED 2006, Innsbruck, Austria.<br />

Ma, C. T., D<strong>in</strong>g, Z., and Yau, S. F. (2000). A two-stage algorithm for MIMO bl<strong>in</strong>d deconvolution<br />

of nonstationary colored noise. IEEE Transactions on Signal Process<strong>in</strong>g, 48:1187–1192.<br />

MacKay, D. (2003). Information Theory, Inference, and Learn<strong>in</strong>g Algorithms. Cambridge University<br />

Press, 6.9 edition.<br />

McKeown, M., Jung, T., Makeig, S., Brown, G., K<strong>in</strong>dermann, S., Bell, A., and Sejnowski,<br />

T. (1998). <strong>Analysis</strong> of fMRI data by bl<strong>in</strong>d separation <strong>in</strong>to <strong>in</strong>dependent spatial components.<br />

Human Bra<strong>in</strong> Mapp<strong>in</strong>g, 6:160–188.<br />

Meyer-Baese, A., Gruber, P., Theis, F., and Foo, S. (2006). Bl<strong>in</strong>d source separation based on<br />

self-organiz<strong>in</strong>g neural network. Eng<strong>in</strong>eer<strong>in</strong>g Applications of Artificial Intelligence, 19:305–311.<br />

Meyer-Bäse, A., Gruber, P., Theis, F., Wismüller, A., and Ritter, H. (2005). Application of<br />

unsupervised cluster<strong>in</strong>g methods to biomedical image analysis. In Proc. WSOM 2005, pages<br />

621–628, Paris, France.


BIBLIOGRAPHY 303<br />

Meyer-Bäse, A., Theis, F., Lange, O., and Puntonet, C. (2004a). Tree-dependent and topographic<br />

<strong>in</strong>dependent component analysis for fMRI analysis. In Proc. ICA 2004, volume 3195<br />

of LNCS, pages 782–789, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Meyer-Bäse, A., Theis, F., Lange, O., and Wismüller, A. (2004b). Cluster<strong>in</strong>g of dependent<br />

components: A new paradigm for fMRI signal detection. In Proc. IJCNN 2004, pages 1947–<br />

1952, Budapest, Hungary.<br />

Meyer-Bäse, A., Thümmler, V., and Theis, F. (2006). Stability analysis of an unsupervised<br />

competitive neural network. In Proc. IJCNN 2006, Vancouver, Canada.<br />

Mika, S., Schölkopf, B., Smola, A., Müller, K., Scholz, M., and Rätsch, G. (1999). Kernel pca<br />

and de-nois<strong>in</strong>g <strong>in</strong> feature spaces. In Proc. NIPS, volume 11.<br />

Mitchell, T. (1997). Mach<strong>in</strong>e Learn<strong>in</strong>g. McGraw Hill.<br />

Molgedey, L. and Schuster, H. (1994). Separation of a mixture of <strong>in</strong>dependent signals us<strong>in</strong>g<br />

time-delayed correlations. Physical Review Letters, 72(23):3634–3637.<br />

Moreau, E. (2001). A generalization of jo<strong>in</strong>t-diagonalization criteria for source separation. IEEE<br />

Transactions on Signal Process<strong>in</strong>g, 49(3):530–541.<br />

Müller, K.-R., Philips, P., and Ziehe, A. (1999). Jadetd: Comb<strong>in</strong><strong>in</strong>g higher-order statistics and<br />

temporal <strong>in</strong>formation for bl<strong>in</strong>d source separation (with noise). In Proc. of ICA 1999, pages<br />

87–92, Aussois, France.<br />

Ogawa, S., Lee, T., Kay, A., and Tank, D. (1990). Bra<strong>in</strong> magnetic-resonance-imag<strong>in</strong>g with<br />

contrast dependent on blood oxygenation. Proc. Nat. Acad. Sci. USA, 87:9868–9872.<br />

O’Grady, P. and Pearlmutter, B. (2004). Soft-LOST: EM on a mixture of oriented l<strong>in</strong>es. In<br />

Proc. ICA 2004, volume 3195 of Lecture Notes <strong>in</strong> Computer Science, pages 430–436, Granada,<br />

Spa<strong>in</strong>.<br />

Pham, D. (2001). Jo<strong>in</strong>t approximate diagonalization of positive def<strong>in</strong>ite matrices. SIAM Journal<br />

on Matrix Anal. and Appl., 22(4):1136–1152.<br />

Pham, D.-T. and Cardoso, J. (2001). Bl<strong>in</strong>d separation of <strong>in</strong>stantaneous mixtures of nonstationary<br />

sources. IEEE Transactions on Signal Process<strong>in</strong>g, 49(9):1837–1848.<br />

Poczos, B. and Lör<strong>in</strong>cz, A. (2005). <strong>Independent</strong> subspace analysis us<strong>in</strong>g k-nearest neighborhood<br />

distances. In Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland.<br />

Spr<strong>in</strong>ger.<br />

Schachtner, R., Lutter, D., Theis, F., Lang, E., Tomé, A., Saéz, J. G., and Puntonet, C. (2007).<br />

Bl<strong>in</strong>d matrix decomposition techniques to identify marker genes from microarrays. In Proc.<br />

ICA 2007, London, U.K.<br />

Schäfer, J. and Strimmer, K. (2005). Learn<strong>in</strong>g large-scale graphical gaussian models from genomic<br />

data. In Proc. CNET 2004, AIP Conference Proceed<strong>in</strong>gs 776.


304 BIBLIOGRAPHY<br />

Schießl, I., Schöner, H., Stetter, M., Dima, A., and Obermayer, K. (2000). Regularized second<br />

order source separation. In Proc. ICA 2000, volume 2, pages 111–116, Hels<strong>in</strong>ki, F<strong>in</strong>land.<br />

Schölkopf, B. and Smola, A. (2002). Learn<strong>in</strong>g with Kernels. MIT Press, Cambridge, MA.<br />

Schölkopf, B., Smola, A., and Müller, K. (1998). Nonl<strong>in</strong>ear component analysis as a kernel<br />

eigenvalue nonl<strong>in</strong>ear component analysis as a kernel eigenvalue problem. Neural Computation,<br />

10:1299–1319.<br />

Schöner, H., Stetter, M., Schießl, I., Mayhew, J., Lund, J., McLoughl<strong>in</strong>, N., and Obermayer, K.<br />

(2000). Application of bl<strong>in</strong>d separation of sources to optical record<strong>in</strong>g of bra<strong>in</strong> activity. In<br />

Proc. NIPS 1999, volume 12, pages 949–955. MIT Press.<br />

Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89:217–219.<br />

Skitovitch, V. (1954). L<strong>in</strong>ear forms <strong>in</strong> <strong>in</strong>dependent random variables and the normal distribution<br />

law. Izvestiia AN SSSR, Ser. Matem., 18:185–200.<br />

Specht, K. and Reul, J. (2003). Function segregation of the temporal lobes <strong>in</strong>to highly differentiated<br />

subsystems for auditory perception: an auditory rapid event-related fmri-task.<br />

NeuroImage, 20:1944–1954.<br />

Squart<strong>in</strong>i, S. and Theis, F. (2006). New riemannian metrics for speed<strong>in</strong>g-up the convergence of<br />

over- and underdeterm<strong>in</strong>ed ICA. In Proc. ISCAS 2006, Kos, Greece.<br />

Stadlthanner, K., Theis, F., Lang, E., and Puntonet, C. (2005a). Hybridiz<strong>in</strong>g sparse component<br />

analysis with genetic algorithms for bl<strong>in</strong>d source separation. In Proc. ISBMDA 2005, pages<br />

137–148, Aveiro, Portugal.<br />

Stadlthanner, K., Theis, F., Lang, E., Puntonet, C., Gomez-Vilda, P., Tomé, A., Langmann,<br />

T., and Schmitz, G. (2006a). Sparse nonnegative matrix factorization applied to microarray<br />

data sets. In Proc. ICA 2006, pages 254–261, Charleston, USA.<br />

Stadlthanner, K., Theis, F., Lang, E., Tomé, A., Gronwald, W., and Kalbitzer, H.-R. (2003). A<br />

matrix pencil approach to the bl<strong>in</strong>d source separation of artifacts <strong>in</strong> 2D NMR spectra. Neural<br />

Information Process<strong>in</strong>g LR, 1(3):103–110.<br />

Stadlthanner, K., Theis, F., Puntonet, C., and Lang, E. (2005b). Extended sparse nonnegative<br />

matrix factorization. In Proc. IWANN 2005, volume 3512 of LNCS, pages 249–256, Barcelona,<br />

Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Stadlthanner, K., Tomé, A., Theis, F., Lang, E., Gronwald, W., and Kalbitzer, H. (2006b).<br />

Separation of water artifacts <strong>in</strong> 2D NOESY prote<strong>in</strong> spectra us<strong>in</strong>g congruent matrix pencils.<br />

Neurocomput<strong>in</strong>g, 69:497–522.<br />

Stone, J., Porrill, J., Porter, N., and Wilk<strong>in</strong>son, I. (2002). Spatiotemporal <strong>in</strong>dependent component<br />

analysis of event-related fmri data us<strong>in</strong>g skewed probability density functions. NeuroImage,<br />

15(2):407–421.


BIBLIOGRAPHY 305<br />

Taleb, A. and Jutten, C. (1999). Indeterm<strong>in</strong>acy and identifiability of bl<strong>in</strong>d identification. IEEE<br />

Transactions on Signal Process<strong>in</strong>g, 47(10):2807–2820.<br />

Theis, F. (2004a). A new concept for separability problems <strong>in</strong> bl<strong>in</strong>d source separation. Neural<br />

Computation, 16:1827–1850.<br />

Theis, F. (2004b). Uniqueness of complex and multidimensional <strong>in</strong>dependent component analysis.<br />

Signal Process<strong>in</strong>g, 84(5):951–956.<br />

Theis, F. (2004c). Uniqueness of real and complex l<strong>in</strong>ear <strong>in</strong>dependent component analysis<br />

revisited. In Proc. EUSIPCO 2004, pages 1705–1708, Vienna, Austria.<br />

Theis, F. (2005a). Bl<strong>in</strong>d signal separation <strong>in</strong>to groups of dependent signals us<strong>in</strong>g jo<strong>in</strong>t block<br />

diagonalization. In Proc. ISCAS 2005, pages 5878–5881, Kobe, Japan.<br />

Theis, F. (2005b). Gradients on matrix manifolds and their cha<strong>in</strong> rule. Neural Information<br />

Process<strong>in</strong>g LR, 9(1):1–13.<br />

Theis, F. (2005c). Multidimensional <strong>in</strong>dependent component analysis us<strong>in</strong>g characteristic functions.<br />

In Proc. EUSIPCO 2005, Antalya, Turkey.<br />

Theis, F. (2007). Towards a general <strong>in</strong>dependent subspace analysis. In Proc. NIPS 2006.<br />

Theis, F. and Amari, S. (2004). Postnonl<strong>in</strong>ear overcomplete bl<strong>in</strong>d source separation us<strong>in</strong>g sparse<br />

sources. In Proc. ICA 2004, volume 3195 of LNCS, pages 718–725, Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Theis, F. and García, G. (2006). On the use of sparse signal decomposition <strong>in</strong> the analysis of<br />

multi-channel surface electromyograms. Signal Process<strong>in</strong>g, 86(3):603–623.<br />

Theis, F., Georgiev, P., and Cichocki, A. (2004a). Bl<strong>in</strong>d source recovery: algorithm comparison<br />

and fusion. In Proc. MaxEnt 2004, volume 735 of AIP conference proceed<strong>in</strong>gs, pages 320–327,<br />

Garch<strong>in</strong>g, Germany.<br />

Theis, F., Georgiev, P., and Cichocki, A. (2007a). Robust sparse component analysis based on<br />

a generalized hough transform. EURASIP Journal on Applied Signal Process<strong>in</strong>g.<br />

Theis, F. and Gruber, P. (2005). On model identifiability <strong>in</strong> analytic postnonl<strong>in</strong>ear ICA. Neurocomput<strong>in</strong>g,<br />

64:223–234.<br />

Theis, F., Gruber, P., Keck, I., and Lang, E. (2007b). A robust model for spatiotemporal<br />

dependencies. Neurocomput<strong>in</strong>g (<strong>in</strong> press).<br />

Theis, F., Gruber, P., Keck, I., Meyer-Bäse, A., and Lang, E. (2005a). Spatiotemporal bl<strong>in</strong>d<br />

source separation us<strong>in</strong>g double-sided approximate jo<strong>in</strong>t diagonalization. In Proc. EUSIPCO<br />

2005, Antalya, Turkey.<br />

Theis, F., Gruber, P., Keck, I., Tomé, A., and Lang, E. (2005b). A spatiotemporal second-order<br />

algorithm for fMRI data analysis. In Proc. CIMED 2005, pages 194–201, Lisbon, Portugal.


306 BIBLIOGRAPHY<br />

Theis, F. and Inouye, Y. (2006). On the use of jo<strong>in</strong>t diagonalization <strong>in</strong> bl<strong>in</strong>d signal process<strong>in</strong>g.<br />

In Proc. ISCAS 2006, Kos, Greece.<br />

Theis, F., Jung, A., Puntonet, C., and Lang, E. (2003). L<strong>in</strong>ear geometric ICA: Fundamentals<br />

and algorithms. Neural Computation, 15:419–439.<br />

Theis, F. and Kawanabe, M. (2006). Uniqueness of non-gaussian subspace analysis. In Proc.<br />

ICA 2006, pages 917–925, Charleston, USA.<br />

Theis, F. and Kawanabe, M. (2007). Colored subspace analysis. In Proc. ICA 2007, London,<br />

U.K.<br />

Theis, F., Kohl, Z., Guggenberger, C., Kuhn, H., Puntonet, C., and Lang, E. (2004b). ZANE -<br />

an algorithm for count<strong>in</strong>g labelled cells <strong>in</strong> section images. In Proc. MEDSIP 2004, Malta.<br />

Theis, F., Kohl, Z., Kuhn, H., Stockmeier, H., and Lang, E. (2004c). Automated count<strong>in</strong>g<br />

of labelled cells <strong>in</strong> rodent bra<strong>in</strong> section images. In Proc. BioMED 2004, pages 209–212,<br />

Innsbruck, Austria. ACTA Press, Canada.<br />

Theis, F. and Lang, E. (2004a). L<strong>in</strong>earization identification and an application to BSS us<strong>in</strong>g a<br />

SOM. In Proc. ESANN 2004, pages 205–210, Bruges, Belgium. d-side, Evere, Belgium.<br />

Theis, F. and Lang, E. (2004b). Postnonl<strong>in</strong>ear bl<strong>in</strong>d source separation via l<strong>in</strong>earization identification.<br />

In Proc. IJCNN 2004, pages 2199–2204, Budapest, Hungary.<br />

Theis, F., Lang, E., and Puntonet, C. (2004d). A geometric algorithm for overcomplete l<strong>in</strong>ear<br />

ICA. Neurocomput<strong>in</strong>g, 56:381–398.<br />

Theis, F., Meyer-Bäse, A., and Lang, E. (2004e). Second-order bl<strong>in</strong>d source separation based on<br />

multi-dimensional autocovariances. In Proc. ICA 2004, volume 3195 of LNCS, pages 726–733,<br />

Granada, Spa<strong>in</strong>. Spr<strong>in</strong>ger.<br />

Theis, F. and Nakamura, W. (2004). Quadratic <strong>in</strong>dependent component analysis. IEICE Trans.<br />

Fundamentals, E87-A(9):2355–2363.<br />

Theis, F., Puntonet, C., and Lang, E. (2006). Median-based cluster<strong>in</strong>g for underdeterm<strong>in</strong>ed<br />

bl<strong>in</strong>d signal process<strong>in</strong>g. IEEE Signal Process<strong>in</strong>g Letters, 13(2):96–99.<br />

Theis, F., Stadlthanner, K., and Tanaka, T. (2005c). First results on uniqueness of sparse<br />

non-negative matrix factorization. In Proc. EUSIPCO 2005, Antalya, Turkey.<br />

Theis, F. and Tanaka, T. (2005). A fast and efficient method for compress<strong>in</strong>g fMRI data sets.<br />

In Proc. ICANN 2005, volume 3697 of LNCS, pages 769–777, Warsaw, Poland. Spr<strong>in</strong>ger.<br />

Theis, F. and Tanaka, T. (2006). Sparseness by iterative projections onto spheres. In Proc.<br />

ICASSP 2006, Toulouse, France.<br />

Tomé, A. M., Teixeira, A. R., Lang, E. W., Stadlthanner, K., Rocha, A. P., and ALmeida, R.<br />

(2005). dAMUSE - A new tool for denois<strong>in</strong>g and BSS. Digital Signal Process<strong>in</strong>g.


BIBLIOGRAPHY 307<br />

Tong, L., Liu, R.-W., Soon, V., and Huang, Y.-F. (1991). Indeterm<strong>in</strong>acy and identifiability of<br />

bl<strong>in</strong>d identification. IEEE Transactions on Circuits and Systems, 38:499–509.<br />

van der Veen, A. and Paulraj, A. (1996). An analytical constant modulus algorithm. IEEE<br />

Trans. Signal Process<strong>in</strong>g, 44(5):1–19.<br />

Vetter, R., Ves<strong>in</strong>, J. M., Celka, P., Renevey, P., and Krauss, J. (2002). Automatic nonl<strong>in</strong>ear<br />

noise reduction us<strong>in</strong>g local pr<strong>in</strong>cipal component analysis and MDL parameter selection. Proc.<br />

SPPRA 2002, pages 290–294.<br />

Vollgraf, R. and Obermayer, K. (2001). Multi-dimensional ICA to separate correlated sources.<br />

In Proc. NIPS 2001, pages 993–1000.<br />

Yeredor, A. (2000). Bl<strong>in</strong>d source separation via the second characteristic function. Signal<br />

Process<strong>in</strong>g, 80(5):897–902.<br />

Yeredor, A. (2002). Non-orthogonal jo<strong>in</strong>t diagonalization <strong>in</strong> the leastsquares sense with application<br />

<strong>in</strong> bl<strong>in</strong>d source separation. IEEE Trans. Signal Process<strong>in</strong>g, 50(7):1545–1553.<br />

Youla, D. and Webb, H. (1982). Image restoration by the methods of convex projections. part<br />

I — theory. IEEE Trans. Med. Imag<strong>in</strong>g, MI(I):81–94.<br />

Zibulevsky, M. and Pearlmutter, B. (2001). Bl<strong>in</strong>d source separation by sparse decomposition <strong>in</strong><br />

a signal dictionary. Neural Computation, 13(4):863–882.<br />

Ziehe, A., Kawanabe, M., Harmel<strong>in</strong>g, S., and Müller, K.-R. (2003a). Bl<strong>in</strong>d separation of postnonl<strong>in</strong>ear<br />

mixtures us<strong>in</strong>g l<strong>in</strong>eariz<strong>in</strong>g transformations and temporal decorrelation. Journal of<br />

Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 4:1319–1338.<br />

Ziehe, A., Laskov, P., Mueller, K.-R., and Nolte, G. (2003b). A l<strong>in</strong>ear least-squares algorithm<br />

for jo<strong>in</strong>t diagonalization. In Proc. of ICA 2003, pages 469–474, Nara, Japan.<br />

Ziehe, A. and Mueller, K.-R. (1998). TDSEP – an efficient algorithm for bl<strong>in</strong>d separation us<strong>in</strong>g<br />

time structure. In Niklasson, L., Bodén, M., and Ziemke, T., editors, Proc. of ICANN’98,<br />

pages 675–680, Skövde, Sweden. Spr<strong>in</strong>ger Verlag, Berl<strong>in</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!