Whitening as a Tool for Estimating Mutual Information in ...

Journal of Statistical Physics, Vol. 124, No. 5, September 2006 ( C○ 2006 )

DOI: 10.1007/s10955-006-9131-x

**Whiten ing**

In**for**mation **in** Spatiotemporal Data Sets

Andre**as** Galka, 1,2 Tohru Ozaki, 1 Jorge Bosch Bayard 3 and Okito Yam**as**hita 4

Received September 1, 2004; accepted May 3, 2005

Published Onl**in**e: August 1, 2006

We address the issue of **in**ferr**in**g the connectivity structure of spatially extended dynamical

systems by estimation of mutual **in****for**mation between pairs of sites. The well-known

problems result**in**g from correlations with**in** and between the time series are addressed

by explicit temporal and spatial modell**in**g steps which aim at approximately remov**in**g

all spatial and temporal correlations, i.e. at whiten**in**g the data, such that it is replaced by

spatiotemporal **in**novations; this approach provides a l**in**k to the maximum-likelihood

method and, **for** appropriately chosen models, removes the problem of estimat**in**g probability

distributions of unknown, possibly complicated shape. A parsimonious multivariate

autoregressive model b**as**ed on nearest-neighbour **in**teractions is employed. **Mutual**

**in****for**mation can be re**in**terpreted **in** the framework of dynamical model comparison

(i.e. likelihood ratio test**in**g), s**in**ce it is shown to be equivalent to the difference of the

log-likelihoods of coupled and uncoupled models **for** a pair of sites, and a parametric

estimator of mutual **in****for**mation can be derived. We also discuss, with**in** the framework

of model comparison, the relationship between the coefficient of l**in**ear correlation and

mutual **in****for**mation. The practical application of this methodology is demonstrated **for**

simulated multivariate time series generated by a stoch**as**tic coupled-map lattice. The

parsimonious modell**in**g approach is compared to general multivariate autoregressive

modell**in**g and to Independent Component Analysis (ICA).

KEY WORDS: time series analysis, autoregressive modell**in**g, **in**novations, mutual

**in****for**mation, maximum likelihood, whiten**in**g, spatiotemporal modell**in**g, **in**dependent

component analysis

1 Institute of Statistical Mathematics (ISM), M**in**ami-Azabu 4-6-7, Tokyo 106-8569, Japan; e-mail:

galka@physik.uni-kiel.de

2 Institute of Experimental and Applied Physics, University of Kiel, 24098 Kiel, Germany

3 Cuban Neuroscience Center, Ave 25 No. 5202 esqu**in**a 158 Cubanacán, POB 6880, 6990, Ciudad

Habana, Cuba

4 ATR Computational Neuroscience Laboratories, Hikaridai 2-2-2, Kyoto 619-0288, Japan

1275

0022-4715/06/0900-1275/0 C○ 2006 Spr**in**ger Science+Bus**in**ess Media, Inc.

1276 Galka, Ozaki, Bayard and Yam**as**hita

1. INTRODUCTION

Recently there is grow**in**g consensus that **for** the **in**vestigation of complex extended

systems new approaches to the analysis of dynamical multivariate data sets

are required. (1,2) Such data sets will typically arise **in** the guise of multivariate

time series, such that the temporal dimension of the data reflects the dynamical

nature of the underly**in**g processes, while other **as**pects pert**in**ent to these processes

are accommodated **in** further dimensions. If the data, **as** a typical and very important

c**as**e, **in** addition to the dimension of time also depends on physical space, it

is commonly called spatiotemporal data.

The temporal dimension is characterised by the non-equivalence between

the two possible directions of the time arrow, which gives rise to the pr**in**ciple

of causality, accord**in**g to which events **in** the p**as**t may **in**fluence events **in** the

future, where**as** the opposite situation never occurs. The spatial dimension is (**in**

most c**as**es) characterised by the absence of such **as**ymmetries, and by the concept

of locality, i.e. the existence of a neighbourhood **for** each po**in**t **in** space, such

that, **for** sufficiently short time **in**tervals, direct **in**teractions **in**volv**in**g this po**in**t

will typically be conf**in**ed to po**in**ts with**in** this neighbourhood, but not extend to

spatially remote po**in**ts.

In contemporary scientific research a v**as**t number of systems are **in**vestigated

which fall **in**to the cl**as**s of spatially extended dynamical systems. They

cover a wide range of discipl**in**es, **in**clud**in**g hydrodynamics, (3) optics, (4) meteorology,

geophysics, (5) ecology, (6) biology, medic**in**e (7,8) and eng**in**eer**in**g. (9) Presently,

it h**as** become standard **in** these discipl**in**es to rout**in**ely record spatiotemporal

data sets **in** large quantities, either through active experimentation or through

field observations. Despite the diversity of these discipl**in**es and of the correspond**in**g

data sets, it seems likely that **for** many c**as**es a unified approach to

analys**in**g and modell**in**g the b**as**ic properties of the underly**in**g dynamics can be

**for**mulated.

As a central notion relevant **for** this field we mention connectivity, i.e. the

presence of direct (and possibly directed) dynamical **in**teractions between spatially

dist**in**ct locations with**in** the system. In**for**mation about connectivity **in** a system

may reach far beyond the level of pure data description s**in**ce it addresses an

important **as**pect of the functional composition of the system. Correlations with**in**

multivariate time series can be described by me**as**ures such **as** l**in**ear and nonl**in**ear

correlation functions or mutual **in****for**mation; (10) especially mutual **in****for**mation

h**as** attracted considerable attention recently s**in**ce it promises a very general

quantification of statistical dependence. Its practical estimation from real data,

however, is known to be a difficult t**as**k because of the need to estimate probability

densities. (11) This is true **in** particular **for** the c**as**e of short time series; **in** many

situations it may be difficult or impossible to record long time series from a system,

e.g. due to technical limitations or limited stationarity of the system.

**Whiten ing**

In this paper we discuss the estimation of mutual **in****for**mation from temporally

and spatially correlated time series, and aim at clarify**in**g the relations between

predictive modell**in**g, maximum-likelihood estimation, model comparison, l**in**ear

correlation and mutual **in****for**mation. The close relationship between mutual **in****for**mation

and likelihood ratio test**in**g w**as** recently also observed by Brill**in**ger. (12,13)

As our start**in**g po**in**t, we **in**terpret modell**in**g **as** a process of spatial and temporal

whiten**in**g of the raw data, i.e. trans**for**m**in**g it **in**to spatially and temporally uncorrelated

**in**novations. We will argue that this step removes the need **for** estimat**in**g

unknown and possibly complicated probability distributions, s**in**ce they are replaced

by Gaussian distributions. While it is well known that fitt**in**g a time series

by a predictive model corresponds to temporal whiten**in**g, i.e. the identification of

the noise process driv**in**g the (multivariate) dynamics, we will show that spatial

whiten**in**g can be **in**terpreted **as** fitt**in**g the covariance matrix of this noise process.

The ma**in** focus of this paper lies on the presentation of the methodology

and its theoretical background, and the discussion is kept at a very general level,

**as**sum**in**g just the presence of a spatially extended dynamical system, from which

spatiotemporal data h**as** been recorded. Although our **in**terest **in** this subject w**as**

**in**itially triggered by data sets recorded **in** bra**in** research, a full discussion of this

particular field of application would be beyond the scope of this paper. Instead

we will illustrate the practical application of the methodology by show**in**g results

of analys**in**g simulated data sets generated by a system of coupled stoch**as**tic

oscillators; this system can be regarded **as** an example of a coupled map lattice. (14)

The structure of this paper is **as** follows. In Sec. 2 we will briefly review the

def**in**itions of the coefficient of l**in**ear correlation and of mutual **in****for**mation. Both

quantify dependencies between pairs of data sets, i.e. they may be estimated from

bivariate time series; while it is also possible to estimate the mutual **in****for**mation

of a set of more than two data sets, we will not use this c**as**e **in** this paper.

In Sec. 3 we will **in**troduce a different viewpo**in**t on the def**in**ition and estimation

of mutual **in****for**mation from bivariate time series; **for** this purpose we

will discuss the concept of whiten**in**g by predictive autoregressive (AR) modell**in**g

and refer to an important theorem from the theory of Markov processes which

provides the justification of this approach. The derivation of a central result on the

relation between the coefficient of l**in**ear correlation and mutual **in****for**mation will

be given **in** Appendix A. We will also briefly review l**in**ear modell**in**g of bivariate

time series by likelihood maximisation.

In Sec. 4 we will generalise the discussion by consider**in**g modell**in**g of multivariate

time series represent**in**g spatially extended systems; **in** such situation the

issue of modell**in**g the **in**stantaneous correlations between the various spatial locations

h**as** to be addressed, which will lead us to the concept of spatial whiten**in**g.

Also **in** this c**as**e l**in**ear correlation and mutual **in****for**mation will be regarded **as**

properties of bivariate time series with**in** the multivariate data set, i.e. of pairs of

spatial locations. In this section we will also briefly discuss the wider field of AR

1278 Galka, Ozaki, Bayard and Yam**as**hita

modell**in**g of multivariate time series and one of its well-established manifestations,

the Pr**in**cipal Oscillation Pattern (POP) method; (41) **in** contr**as**t to this method,

**in** this paper we prefer to employ parsimonious multivariate autoregressive (MAR)

models, i.e. models with sparse transition matrix.

In Sec. 5 we will discuss how from the **in**novations result**in**g from multivariate

time series modell**in**g, deeper layers of correlation can be extracted; this will lead

us to a fully parametric estimator of mutual **in****for**mation, provid**in**g an alternative to

the common nonparametric estimators. The possibility to def**in**e this estimator is a

direct consequence of the fact that, **for** sufficiently good models, the distribution of

the **in**novations is known to be Gaussian. The detailed derivation of the estimator

will be deferred to Appendix B. The parametric estimator itself is valid **for** general

situations, not only **for** the c**as**e of spatiotemporal data.

In Sec. 6 the modell**in**g approach, **as** proposed **in** this paper, will be briefly

compared with a widely applied cl**as**s of algorithms **for** the analysis of multivariate

data, known **as** Independent Component Analysis (ICA).

In Sec. 7 a simulation study will be presented, and a conclud**in**g discussion

will be given **in** Sec. 8.

2. LINEAR CORRELATION AND MUTUAL INFORMATION

In this section we will review the def**in**ition of the two commonly used statistics

**for** me**as**ur**in**g mutual dependencies between two time series, l**in**ear correlation

and mutual **in****for**mation, and we will briefly discuss **in** which way the latter differs

from the **for**mer.

2.1. L**in**ear Correlation

For a given pair of time series x t and y t , t = 1,...,N t , the l**in**ear correlation

structure can be quantified by the symmetric 2 × 2 sample covariance matrix

S xy = 1 N t

∑N t

t=1

((x t −〈x t 〉) , (y t −〈y t 〉)) † ((x t −〈x t 〉) , (y t −〈y t 〉)) , (1)

where 〈x t 〉= 1 N t

∑t x t. We choose to def**in**e the elements of this matrix **as**

(

)

σx 2 σ x σ y r(x, y)

S xy =

, (2)

σ x σ y r(x, y)

where σ 2 x and σ 2 y denote the (sample) variances of x t and y t , respectively, while

the off-diagonal element quantifies the l**in**ear cross-covariance between the two

σ 2 y

**Whiten ing**

variables; here the coefficient of l**in**ear correlation r(x, y) h**as** been def**in**ed **as** (13)

∑

t

r(x, y) =

(x ∑

t −〈x t 〉)(y t −〈y t 〉)

t

=

(x t −〈x t 〉)(y t −〈y t 〉)

√

N t σ x σ y ∑

t (x t −〈x t 〉) 2 ∑ . (3)

t (y t −〈y t 〉) 2

Note that by this def**in**ition r(x, y) is normalised to −1 ≤ r(x, y) ≤ 1.

2.2. **Mutual** In**for**mation

In contr**as**t to the coefficient of l**in**ear correlation, mutual **in****for**mation, **as**

**in**troduced **in** 1948 by Shannon, (15) does not directly refer to the c**as**e of a pair

of time series. Rather it is b**as**ed on the probability distributions of two random

variables x and y which can **as**sume values out of a set of states; here we limit our

attention to the c**as**e of the number of possible states be**in**g f**in**ite, say S. Letthe

**in**dex i, i = 1,...S, label these states, denote the correspond**in**g values by x i and

y i and **as**sume that jo**in**t and marg**in**al probability distributions p(x i , y j ), p(x i )

and p(y i ) **for** the occurrence of these states exist. Then the mutual **in****for**mation

between x and y is def**in**ed by

I (x, y) =

S∑

i=1

S∑

j=1

p(x i , y j )log p(x i, y j )

p(x i )p(y j ) . (4)

Note that i and j do not label time, but states. Here we mention that it is

possible also to def**in**e l**in**ear correlation by a summation over states **in**stead over

time, (10) but **in** this paper we will not need this variant. We rewrite Eq. (4) **as**

I (x, y) =〈log p(x i , y j ) − log(p(x i )p(y j ))〉 p(xi ,y j ), (5)

where 〈.〉 p(xi ,y j ) denotes the average over (i, j) with respect to p(x i , y j ).

We remark that both mutual **in****for**mation and Shannon entropy are special

c**as**es of a general me**as**ure of discrepancy between two probability distributions

p i and q i , **in**troduced by Boltzmann already **in** 1877 (16) and sometimes known **as**

Boltzmann entropy, (17) which is def**in**ed by

B(p; q) =− ∑ i

p i log p i

q i

, (6)

where i may be an **in**dex vector. The negative of B(p; q) is known **as** Kullback–

Leibler **in****for**mation or relative entropy. (18)

When estimat**in**g I (x, y) from paired data (x (t) , y (t) ), t = 1,...N t , where

t denotes simply an **in**dex **for** labell**in**g the data, the probability distributions

p(x i , y j ), p(x i ) and p(y j ) have to be estimated numerically; while the most common

method **for** this purpose is b**as**ed on histograms, (13) recently various alternative

approaches have been explored, **in** particular methods b**as**ed on kernels, (19)

1280 Galka, Ozaki, Bayard and Yam**as**hita

correlation **in**tegrals (20) and k-nearest neighbour distances. (21) Obviously, the same

situation is given **for** the c**as**e of the estimation of Shannon entropy. (11)

Note that the concepts of “time” and “dynamics” are not **in**volved **in** the

def**in**ition of I (x, y). The **in**dex t just serves the purpose of determ**in****in**g which

values **in** the data sets **for** x and y **for**m pairs; this **in****for**mation is needed **in** the

estimation of p(x i , y j ). Apart from this constra**in**t the sampled data (x (t) , y (t) )

are **as**sumed to be **in**dependently drawn from the correspond**in**g true probability

distributions, and I (x, y) will be **in**variant with respect to any shuffl**in**g of the

order of the data, **as** long **as** the pairs are preserved.

If the data comes **in** the guise of time series (such that t **in** fact denotes time),

the **as**sumption of **in**dependence will typically be **in**valid, s**in**ce most time series

show temporal correlations. It h**as** been observed that such correlations pose a

problem **for** the estimation of mutual **in****for**mation; while recently a few authors

have begun to address this problem and to develop remedies, (22,23) still **in** most

applications the presence of temporal correlations is ignored. But by ignor**in**g

these correlations the dynamics of the underly**in**g process is ignored. In this paper

we suggest to regard these correlations **as** a source of valuable **in****for**mation, rather

than a nuisance.

3. INNOVATION APPROACH TO MUTUAL INFORMATION

In this section we will review the ma**in** theoretical foundations of this paper,

namely the concept of whiten**in**g and its implications **for** the likelihood of the data;

furthermore we will briefly discuss univariate and bivariate l**in**ear autoregressive

models.

3.1. The Likelihood of Innovation Time Series

If we decide to take the temporal correlations **in** time series serious, we have

to regard x t and y t **as** different random variables **for** each value of t; then Eq. (5)

should be replaced by

I (x, y) = log p ( (x 1 , y 1 ),...,(x Nt , y Nt ) ) − log ( p(x 1 ,...,x Nt )p(y 1 ,...,y Nt ) ) ,

(7)

where **in** the jo**in**t distribution of all elements of both time series we have ordered

the elements **as** pairs. If several time series aligned by an external trigger are

available, also the average with respect to the jo**in**t distribution (see Eq. (5)) could

be ma**in**ta**in**ed, but this step is not essential and can be omitted. Re**in**terpret**in**g

Eq. (7) from the viewpo**in**t of time series analysis, it can be seen that mutual

**in****for**mation can be regarded **as** a difference between two terms represent**in**g loglikelihoods,

the first referr**in**g to the bivariate time series (x t , y t ), the second be**in**g

the sum of the log-likelihoods of the two univariate time series x t and y t .Let

**Whiten ing**

log-likelihood be denoted by L, then Eq. (7) corresponds to

I (x, y) = L(x, y) − (L(x) + L(y)) . (8)

For later use we note that Eq. (7) can alternatively be written by us**in**g

conditional probabilities **as**

I (x, y) = log(p(x 1 ,...,x Nt |y 1 ,...,y Nt ) p(y 1 ,...,y Nt ))

− log(p(x 1 ,...,x Nt ) p(y 1 ,...,y Nt )), (9)

correspond**in**g to

I (x, y) = L(x|y) − L(x). (10)

In order to estimate mutual **in****for**mation from Eq. (7) high-dimensional jo**in**t

distributions need to be evaluated which, due to correlations between the elements

of the two time series, generally will have very complicated structure; these correlations

will occur both between x and y and with**in** the values of each of these

two variables at different po**in**ts of time t. In order to simplify the structure of

these distributions we propose to describe these correlations by the correspond**in**g

optimal predictors of x t and y t , b**as**ed on the set of previous values of the two

time series, {(x tx , y ty )| t x , t y < t}, i.e. we per**for**m three different modell**in**g steps,

thereby estimat**in**g the conditional means **for** x t , y t and (x t , y t ); conditional mean

will be denoted by E(.). By limit**in**g the **in****for**mation available **for** prediction to the

p**as**t we ensure that we stay **in** the doma**in** of causal modell**in**g. The residuals of

these predictions are given by

ɛ t (x|x) = x t − E(x t |x t−1 , x t−2 ,...), (11)

ɛ t (y|y) = y t − E(y t |y t−1 , y t−2 ,...), (12)

(ɛ t (x|x, y),ɛ t (y|x, y)) † = (x t , y t ) †

−E((x t , y t ) † |(x t−1 , y t−1 ) † , (x t−2 , y t−2 ) † ,...). (13)

Note that **in** general the residuals **for** x and y **in** Eq. (13) will be different from

those **in** Eqs. (11) and (12), s**in**ce **in** Eq. (13) additional **in****for**mation is employed **for**

the predictions; this fact is expressed by the notation ɛ t (x|x),ɛ t (x|x, y), explicitly

stat**in**g the condition**in**g on one or both of the variables x and y.

Follow**in**g a suggestion of Wiener, residuals are also called **in**novations. The

theoretical framework **for** the trans**for**mation of time series data **in**to **in**dependent

**in**novations is provided by the theory of stoch**as**tic differential equations (24,25) and

by filter**in**g theory, (26,27) alternatively also known **as** Markov process theory. The

most general results were given by Lévy (28) (see Theorem 41 **in** Ref. 29). He

h**as** shown that, under mild conditions, any cont**in**uous-time Markov process can

be modelled such that the correspond**in**g **in**novations can be represented **as** the

sum of two white noise processes, which have Gaussian and Poisson distributions,

1282 Galka, Ozaki, Bayard and Yam**as**hita

respectively; **in** the c**as**e of cont**in**uous dynamics only the Gaussian noise process

will be present. The c**as**e of additional observation noise h**as** been treated by Frost

and Kailath. (30) Consequently, we expect that, under the **as**sumption of cont**in**uous

dynamics, **for** optimal predictors the time series of result**in**g **in**novations will be

uncorrelated (**in** fact, **in**dependent) and Gaussian, even if, due to nonl**in**earities **in**

the dynamics, the x t and y t are non-Gaussian.

Eqs. (11), (12) and (13) represent mapp**in**gs from the orig**in**al data to the

correspond**in**g **in**novations; let the Jacobians of these mapp**in**gs be denoted by

∂ε(x)

∂ x

and , respectively, then the likelihoods of the orig**in**al data and

∂ y

of the **in**novations are related accord**in**g to (31)

p(x 1 ,...,x Nt ) =

∂ε(x)

∣ ∂ x ∣ p(ɛ 1 (x|x),...,ɛ Nt (x|x)), (14)

p(y 1 ,...,y Nt ) =

∂ε( y)

∣ ∂ y ∣ p(ɛ 1 (y|y),...,ɛ Nt (y|y)), (15)

p((x 1 , y 1 ),...,(x Nt , y Nt )) =

∂ε(x, y)

∣ ∂(x, y) ∣ p((ɛ 1(x|x, y),ɛ 1 (y|x, y)),...,

(ɛ Nt (x|x, y),ɛ Nt (y|x, y))), (16)

,

∂ε( y)

∂ε(x, y)

∂(x, y)

where the notation |.| denotes the absolute value of the determ**in**ant of a matrix.

Now it can be seen from Eq. (11) that ∂ɛ t

∂x t

= 1 and ∂ɛ t

= 0**for**t ′ > t (accord**in**g

to the pr**in**ciple of causality), there**for**e the determ**in**ant **in** Eq. (14) is unity,

∂x t ′

and **in** fact p(x 1 ,...,x Nt ) is equal to p(ɛ 1 (x|x),...,ɛ Nt (x|x)) **for** the specific

series of **in**novations ɛ 1 (x|x),...,ɛ Nt (x|x) correspond**in**g to the data x 1 ,...,x Nt ,

although the shapes of these two distributions itself may differ very much. The

same argument applies to Eqs. (15) and (16).

Fortunately, this result removes the need to estimate the unknown and possibly

complicated probability distributions **in** Eqs. (4) and (7), s**in**ce, if sufficiently good

models can be found, it follows from Markov process theory, **as** discussed above,

that the distributions of the **in**novations are products of Gaussians. Then the

correspond**in**g log-likelihoods are given **for** x by

L(x) = log p(x 1 ,...,x Nt ) = log p(ɛ 1 (x|x),...,ɛ Nt (x|x))

(

)

=− 1 N t

N t log σɛ(x|x) 2 2

+ ∑ ɛt 2(x|x)

+ N t log(2π) , (17)

t=1

σ 2 ɛ(x|x)

**for** y by a correspond**in**g expression and **for** (x, y) by

L(x, y) = log p((x 1 , y 1 ),...,(x Nt , y Nt ))

= log p((ɛ 1 (x|x, y),ɛ 1 (y|x, y)),...,(ɛ Nt (x|x, y),ɛ Nt (y|x, y)))

**Whiten ing**

=− 1 (

∑N t

N t log |S ɛ(x,y|x,y) |+ (ɛ t (x|x, y),ɛ t (y|x, y))

2

t=1

)

× S −1

ɛ(x,y|x,y) (ɛ t(x|x, y),ɛ t (y|x, y)) † + 2N t log(2π) . (18)

Here σɛ(x|x) 2 denotes the variance of the **in**novation process **for** the c**as**e of a

predictive model **for** x only; and S ɛ(x,y|x,y) denotes the covariance matrix of the

bivariate **in**novation process **for** the c**as**e of a predictive model **for** (x, y). Note that,

if x and y are replaced by the correspond**in**g **in**novations, the sample covariance

matrix def**in**ed **in** Eqs. (1) and (2) corresponds to S ɛ(x,y|x,y) , there**for**e the same

structure is chosen:

(

)

σɛ(x|x,y)

2 σ ɛ(x|x,y) σ ɛ(y|x,y) r(ɛ(x),ɛ(y))

S ɛ(x,y|x,y) =

σ ɛ(x|x,y) σ ɛ(y|x,y) r(ɛ(x),ɛ(y)) σɛ(y|x,y)

2 .

(19)

Here aga**in** we have def**in**ed a normalised l**in**ear correlation coefficient

r ( ɛ(x),ɛ(y) ) , analogous to Eq. (3), **in** order to describe **in**stantaneous correlations

between the **in**novations ɛ(x|x, y) and ɛ(y|x, y).

If we replace **in** Eqs. (17), (18) and (19) the parameters σ ɛ(x|x) , σ ɛ(y|y) , σ ɛ(x|x,y) ,

σ ɛ(y|x,y) and r ( ɛ(x),ɛ(y) ) by their appropriate maximum-likelihood estimators and

**in**sert the results **in**to Eq. (8), we obta**in**

I (x, y) =− 1 2 N t(

log(1 − r 2 (ɛ(x),ɛ(y)))

+ ( log σ 2 ɛ(x|x,y) − log σ 2 ɛ(x|x))

+

(

log σ

2

ɛ(y|x,y) − log σ 2 ɛ(y|y)))

; (20)

here **for** notational convenience we have absta**in**ed from denot**in**g the estimators

of the parameters **in** a different way than the parameters themselves. A detailed

derivation of Eq. (20) can be found **in** Appendix A.

Strictly speak**in**g, we have been ignor**in**g so far that **for** the first data po**in**ts

x 1 , x 2 ,...,x p (where p is the model order) no optimal predictions can be per**for**med

due to lack of a sufficient number of previous po**in**ts, so the correspond**in**g

contribution to the likelihood h**as** to be calculated by different methods; (32) but

**for** small p and sufficiently large N t this miss**in**g contribution can usually be

neglected.

Note that Eq. (7) can be regarded **as** the likelihood ratio test (LRT) statistic

of the null hypothesis of **in**dependence of the time series x t and y t . (13) By Eq. (20)

possible deviations from **in**dependence are decomposed **in**to three components,

the first describ**in**g **in**stantaneous correlations between the **in**novations of x and y

(quantified by r(ɛ(x),ɛ(y))), while the second and the third describe dependence

of x on the p**as**t of y and vice versa. If know**in**g the p**as**t of y does not improve

predictions of x, and vice versa, the mutual **in****for**mation can still be non-zero,

1284 Galka, Ozaki, Bayard and Yam**as**hita

**as** given by the first term on the rhs of Eq. (20); however, s**in**ce any estimates

of the mutual **in****for**mation obta**in**ed from actual f**in**ite samples will follow a χ 2 -

distribution (**as** it is usually the c**as**e **for** any LRT statistics **in** the null c**as**e), it is

to be expected that even **in** this c**as**e there will be a small positive bi**as** result**in**g

from the second and third terms on the rhs of Eq. (20). In contr**as**t to this, **in**

the non-null c**as**e the estimate of mutual **in****for**mation can be expected to follow a

Gaussian distribution). (33)

3.2. Calculation of Innovations

The time series of **in**novations ɛ t (x|x), ɛ t (y|y), ɛ t (x|x, y) and ɛ t (y|x, y)

are obta**in**ed by fitt**in**g causal dynamical models to the data (x t , y t ); l**in**ear or

nonl**in**ear model cl**as**ses may be employed, depend**in**g on the properties of the

data. Nonl**in**ear modell**in**g will be necessary, if the distribution of the data displays

deviations from Gaussianity; this nonl**in**earity can be **in**corporated either directly

**in** the model **for** the dynamics, or **in** a static observation function through which

the underly**in**g dynamics w**as** observed, or **in** a comb**in**ation of these two c**as**es. The

ma**in** cl**as**ses of deviation from Gaussianity are given by heavy-tailed distributions,

th**in**-tailed distributions and **as**ymmetric distributions. In its full generality the t**as**k

of identify**in**g an optimal model may be very demand**in**g, if not **in**fe**as**ible, but it

can be expected that **in** many c**as**es sub-optimal approximations will be sufficient.

The cl**as**s of l**in**ear models represents a convenient first-order approximation;

experience h**as** shown that **in** a substantial number of c**as**es even this model cl**as**s

provides sufficient whiten**in**g. (34) In this paper we will focus exclusively on the

c**as**e of l**in**ear modell**in**g.

A specific model is characterised by its structure (or, equivalently, by belong**in**g

to a specific model cl**as**s) and by a set of data-dependent parameters, collected

**in** a parameter vector ϑ. The univariate l**in**ear AR model **for** the dynamics of a

state variable ξ is given by

ξ t = µ +

p∑

a t ′ξ t−t ′ + e t , (21)

t ′ =1

where p denotes the positive **in**teger model order, µ denotes a constant **in**tercept

term (allow**in**g **for** non-zero mean of ξ t ), and e t represents the dynamical noise term

which drives the dynamics; the variance of e t is denoted by σe 2 . The parameter

vector **for** this model is given by ϑ = (µ, a 1 ,...,a p ,σe 2 ). On the b**as**is of this

model the **in**novations can be estimated from given data x t by

(

)

p∑

ɛ t (x|x) = x t − µ + a t ′ x t−t ′ ; (22)

t ′ =1

**Whiten ing**

the variance of the **in**novations ɛ t (x|x) provides a sample estimate **for** σ 2 e .Forthe

c**as**e of a l**in**ear bivariate AR model **for** the state variables (ξ,η) † the correspond**in**g

model is given by

( ) ( )

ξt µξ

= +

η t µ η

p∑

A t ′

t ′ =1

(

ξt−t ′

η t−t ′

)

+

( )

et (ξ)

, (23)

e t (η)

where the dynamical noise term is represented by (e t (ξ), e t (η)) † ; its (2 × 2) covariance

matrix is denoted by S e(ξ,η) . The parameter vector ϑ **for** this model

consists of the **in**tercept terms (µ ξ ,µ η ), all elements of the transition matrices A t ′

and the three **in**dependent elements from S e(ξ,η) . The estimator of the **in**novations

(ɛ t (x|x, y),ɛ t (y|x, y)) † **for** data x t and y t (represent**in**g ξ and η) follows **in** analogy

to Eq. (22). From now on we will express models always **in** terms of observed data

and estimated **in**novations (**in**stead of the not directly accessible “true” driv**in**g

noises e t ), **as** **in** the example of Eq. (22).

The preferred method **for** choos**in**g an appropriate model cl**as**s and **for** fitt**in**g

the parameters is maximisation of the likelihoods, **as** given by Eqs. (17) and

(18). However, it should be mentioned that when compar**in**g the per**for**mance of

different model cl**as**ses, hav**in**g different numbers of data-dependent parameters

(i.e. different dimension of ϑ), the model cl**as**s with the larger number of datadependent

parameters will typically achieve the better likelihood, and there**for**e

overfitted models will result. This phenomenon can be **in**terpreted by stat**in**g that

the likelihood represents a bi**as**ed estimator of Boltzmann entropy, Eq. (6) (where

p i represents the true distribution of the quantity to be predicted, and q i represents

the predictive distribution). (35) B**as**ed on estimat**in**g this bi**as**, corrections to the

likelihood have been proposed, such **as** the Akaike In**for**mation Criterion (AIC)

or the Bayesian In**for**mation Criterion (BIC). (36) Cross-validation represents an

alternative approach to avoid**in**g overfitt**in**g; it h**as** been proved that m**in**imisation

of AIC and cross-validation are **as**ymptotically equivalent. (37) In this paper we will

employ m**in**imisation of AIC **for** the purpose of model comparison.

Remember**in**g that **in** general there is the possibility of choos**in**g between

various different model cl**as**ses **for** whiten**in**g, such **as** cl**as**ses b**as**ed on different

nonl**in**ear functions or different state space representations, we note that consequently

the **in**novations are not uniquely def**in**ed, a phenomenon which at first sight

may appear problematic. We would like to remark that the situation is not different

**for** the c**as**e of the nonparametric estimators of probability distributions mentioned

earlier, such **as** histogram or kernel estimators: Also **for** these methods the results

will depend on various design choices, such **as** kernel functions, bandwidths, b**in**

widths, etc.

1286 Galka, Ozaki, Bayard and Yam**as**hita

4. MODELLING SPATIOTEMPORAL DATA

So far we have discussed the c**as**e of only two random variables x and y;

now we shall turn to the general c**as**e of spatiotemporal dynamics. Data which

depend both on time and on space can be modelled by MAR models; we will now

describe how such models can be **for**mulated and simplified **for** high-dimensional

situations (i.e. data cover**in**g a large number of grid po**in**ts), and we will discuss an

approximative approach to modell**in**g the covariance matrix of the driv**in**g noise

of MAR models.

4.1. Multivariate AR Modell**in**g: General C**as**e

We **as**sume that a discretisation of physical space **in**to a rectangular grid is

used, and that the grid po**in**ts are labelled by an **in**dex v, such that the me**as**urement

consists of scalar time series x (v)

t **for** each grid po**in**t. Let x t and ɛ t (x) denote the

column vectors **for**med of the data x (v)

t and the **in**novations ɛ t (x (v) ) **for** all grid

po**in**ts, respectively, and let N v denote the total number of grid po**in**ts.

In general **for**m the dynamics of such systems may be described by nonl**in**ear

MAR models; the l**in**ear c**as**e is given by the generalisation of Eq. (23), such that

the correspond**in**g **in**novations are estimated by (see Eq. (22))

(

)

p∑

ɛ t (x) = x t − µ + A t ′x t−t ′ , (24)

where the N v -dimensional **in**tercept vector µ, thesetoftheN v × N v transition

matrices A t ′, t ′ = 1,...,p, and the N v (N v − 1)/2 **in**dependent elements of the

covariance matrix S ɛ(x) = E ( ɛ t (x)ɛ t (x) †) **for**m the parameter vector ϑ. For each

pair of grid po**in**ts (u,v) the correspond**in**g elements of the transition matrices

(A t ′) uv and (A t ′) vu describe the dynamical (i.e., delayed) **in**teractions between these

two grid po**in**ts, while the **in**stantaneous correlations are described by (S ɛ(x) ) uv ≡

(S ɛ(x) ) vu . The total number of parameters **in** this model (i.e. the dimension of ϑ),

to be estimated from the data, is given by

t ′ =1

N par = N v + pN 2 v + 1 2 N v(N v − 1). (25)

Efficient methods **for** estimat**in**g the parameters of Eq. (24) have been proposed

by Lev**in**son, (38) Whittle (39) and Neumaier and Schneider. (40) The special

c**as**e of a model of first order, p = 1, **for**ms the b**as**is **for** the method known **as**

Pr**in**cipal Oscillation Pattern (POP) analysis (41) which h**as** found widespread application

**in** geophysics and related fields. This method is b**as**ed on the analysis of

the eigenvalues of the first-order transition matrix A 1 ; depend**in**g on whether

these eigenvalues are real numbers or complex conjugated pairs, they correspond

to (stoch**as**tically driven) relaxator or oscillator modes of the underly**in**g

**Whiten ing**

spatiotemporal dynamics. This method h**as** been generalised to MAR models of

higher order, p > 1, by Neumaier and Schneider. (40)

However, it is a well-known disadvantage of MAR models that the number

of parameters N par from Eq. (25) may e**as**ily become very large, thereby lead**in**g

to overparametrised, non-parsimonious models. Depend**in**g on the resolution of

the spatial discretisation (e.g., the number of spatially dist**in**ct me**as**urement sites),

N v may be a large number; if furthermore the length of the available time series

N t is only short (a common situation **in** many fields of application), it may e**as**ily

occur that N par **as**sumes a value comparable to (or even excess**in**g) the total number

of data values, thereby render**in**g reliable estimation of parameters **in**fe**as**ible.

Accord**in**g to a generally accepted guidel**in**e obta**in**ed from practical experience,

the number of parameters to be estimated should not exceed a threshold of approximately

10% of the number of available data values. For this re**as**on we will

now discuss a parsimonious variant of MAR modell**in**g.

4.2. Multivariate AR Modell**in**g: Parsimonious Approach

For most spatially extended physical systems the **as**sumption is justified that at

sufficiently short time scales **in**teractions will take place only over short distances,

there**for**e we propose to restrict the dynamics to local neighbourhoods: As a

first-order approximation we **as**sume that each grid po**in**t will **in**teract directly

only with its immediate spatial neighbours on the grid. Most elements of A 1

become zero by this **as**sumption, i.e. A 1 **as**sumes a sparse structure. As a further

simplification, motivated by regard**in**g the spatially and temporally discrete model

**as** an approximation of a (stoch**as**tic) partial differential equation, it is re**as**onable

to limit the **in**teraction between different grid po**in**ts to the first lag, i.e. to set all

off-diagonal elements of A t ′, t ′ > 1, to zero. The diagonal elements, describ**in**g

the dependence of each grid po**in**t on its own p**as**t, may **in** general be non-zero up to

a lag (a model order) t ′ = p > 1. This situation corresponds to the approximation

of a pth-order time derivative **in** a partial differential equation.

For a s**in**gle grid po**in**t v the temporally whitened **in**novations result **as**

ɛ t

(

x

(v) ) = x (v)

t

⎛

− ⎝µ (v) +

p∑

t ′ =1

a (v)

t x (v)

′ t−t + ′

∑

u ∈N (v)

⎞

b (v,u)

1

x (u) ⎠

t−1

, (26)

where a (v)

t , t ′ = 1,...,p, and b (v,u)

′

1

denote the parameters **for** self-**in**teraction and

neighbour **in**teraction, respectively, N (v) denotes the set of labels of the neighbours

of grid po**in**t v, and µ (v) is the constant **in**tercept **for** time series x (v)

t . In analogy

with Eq. (22), we should have denoted the **in**novations **in** Eq. (26) by ɛ t (x (v) |x (v) )

or possibly ɛ t (x (v) |x (v) , x N (v) ) **in**stead of ɛ t (x (v) ), but **in** order to avoid excessively

1288 Galka, Ozaki, Bayard and Yam**as**hita

complicated notation we will from now on refra**in** from explicitly stat**in**g the

condition**in**g on the p**as**t of the same grid po**in**t, and possibly its neighbours.

As a generalisation we mention the possibility that **in** Eq. (26) also the model

order p could be chosen differently **for** each grid po**in**t, p = p(v), but **in** this paper

we will not explore this option further. Rather will we employ a common value **for**

p **for** all grid po**in**ts, which could be obta**in**ed by m**in**imisation of an **in****for**mation

criterion such **as** AIC or BIC; alternatively it may be argued that p should be set

at most to 2, s**in**ce higher orders of a time derivative would be unusual **in** partial

differential equations describ**in**g excitable media.

This model can be **in**terpreted **as** a decomposition of the high-dimensional

dynamical system described by Eq. (24) **in**to a set of coupled low-dimensional

dynamical systems, each of which is focussed on one grid po**in**t; the **in**fluence

of the neighbour**in**g grid po**in**ts is treated **as** an additional external disturbance,

represented by the neighbourhood term. It should be stressed that when analys**in**g

spatiotemporal data with a view at the connectivity structure of the underly**in**g

system, we are not **in**terested **in** the correlations between pairs of neighbour**in**g grid

po**in**ts, s**in**ce we regard their presence **as** natural; but rather will we be **in**terested

**in** correlations between pairs of grid po**in**ts separated by larger distances, s**in**ce

they may represent some f**as**ter mode of **in**teraction which is not described by the

b**as**ic MAR model.

Given the data x (v)

t , the parameters a (v)

1 ,...,a(v) p , b (v,u)

1

and µ (v) can be estimated

separately **for** each grid po**in**t v by the l**in**ear le**as**t-squares method (which

under the **as**sumptions of Gaussian **in**novations and l**in**ear dynamics is equivalent

to full likelihood maximisation), and a series of **in**novations ɛ t (x (v) ) will result **as**

an estimate of the noise process driv**in**g the local dynamics of grid po**in**t v. This

simple and efficient po**in**twise model fitt**in**g approach replaces the usual methods

**for** fitt**in**g of full MAR models, **as** mentioned above, but it does so at the cost of

neglect**in**g the need to estimate also the off-diagonal elements of the covariance

matrix of the driv**in**g noise, S ɛ(x) . We will deal with this po**in**t **in** the next section.

4.3. Spatial **Whiten ing**

In the previous section simplifications have been **in**troduced by which most

of the elements of the transition matrices A t ′ could be set to zero. Now a similar

step h**as** to be accomplished **for** the covariance matrix S ɛ(x) . In general, it h**as** to be

expected that the off-diagonal elements of S ɛ(x) are non-zero, i.e. that **in**stantaneous

mutual correlations between the **in**novations ɛ t (x (v) ) at different grid po**in**ts are

present; this will be true especially **for** neighbour**in**g grid po**in**ts. But if this is

the c**as**e, the decomposition approach described **in** the previous section is **in**valid,

s**in**ce it **as**sumes uncorrelated noises and provides estimates only **for** the diagonal

elements σ 2 ɛ(x (v) ) .

**Whiten ing**

Note that here we are deal**in**g with **in**stantaneous (**as** compared to the sampl**in**g

time of the data), purely spatial correlations, which cannot be described by

dynamical models like Eq. (26); there**for**e a separate step of spatial whiten**in**g is

needed. We describe this step by an **in**stantaneous l**in**ear trans**for**m applied to the

orig**in**al data

˜x t = Lx t . (27)

Various approaches **for** the choice of the matrix L may be explored, and aga**in**

it will depend on the particular data which choice provides best spatial whiten**in**g;

**as** a first approximation we propose to employ a Laplacian matrix

L =

(I Nv − 1 )

k N , (28)

where I Nv denotes the N v × N v unity matrix, and N denotes a N v × N v matrix

hav**in**g N vu = 1ifu belongs to N (v), and 0 otherwise. This trans**for**m corresponds

to a discrete second-order spatial derivative; derivatives are natural tools

**for** whiten**in**g. The parameter k **in** Eq. (28) should be chosen **as** k = 6, if **for** each

grid po**in**t six spatial neighbours are considered; but the parameter may also be

chosen **in** a data-adaptive way, e.g. by maximum-likelihood, if a correspond**in**g

correction term is added to the likelihood (see Eq. (31)). A practical advantage of

choos**in**g a value **for** k different from the number of neighbours is that otherwise

L may be very close to s**in**gular.

We remark that the cl**as**s of autoregressive **in**tegrated mov**in**g-average

(ARIMA) models **for** time series modell**in**g (32) results from a similar approach

of tak**in**g a discrete derivative prior to any further analysis, but **in** the c**as**e of

ARIMA this derivative is applied **in** the time doma**in**, while here we are apply**in**g

it to the spatial doma**in**.

If consequently we replace x by ˜x, Eq. (24) becomes

(

)

p∑

ɛ t (x) = L −1 ɛ t (˜x) = x t − L −1 µ + L −1 A t ′Lx t−t ′ , (29)

where S ɛ(˜x) = E ( ɛ t (˜x)ɛ t (˜x) †) is expected to be diagonal or at le**as**t closer to diagonal

than S ɛ(x) . There**for**e this approach to spatial whiten**in**g is equivalent to

modell**in**g the non-diagonal covariance matrix of the **in**novations correspond**in**g

to the orig**in**al data by

t ′ =1

S ɛ(x) = L −1 S ɛ(˜x) (L −1 ) † ; (30)

recently a similar model h**as** also been applied successfully **for** estimat**in**g unobserved

bra**in** states through spatiotemporal Kalman filter**in**g. (42,43) Future research

may succeed **in** identify**in**g superior approaches to model **in**stantaneous spatial correlations

**in** spatiotemporal data. Clearly, depend**in**g on the data, this trans**for**mation

1290 Galka, Ozaki, Bayard and Yam**as**hita

will not remove all correlations from the ɛ t (x (v) ), but it removes those correlations

which are merely an artifact of spatial neighbourhood and there**for**e may

be regarded **as** “trivial.” Rema**in****in**g correlations can be expected to conta**in** more

relevant **in****for**mation about the underly**in**g system, i.e. its connectivity structure,

**as** will be demonstrated **in** the next section.

Note that after the application of the Laplacian trans**for**m to the data, a

correction needs to be applied to the log-likelihood, which is given by

L(x 1 ,...,x Nt ) = L(˜x 1 ,...,˜x Nt ) + (N t − p)log|L|. (31)

The number of parameters of the model given by Eq. (26) follows **as**

N par = N v

(〈

card (N (v)) 〉 + p + 1 ) + 1, (32)

where 〈 card (N (v)) 〉 denotes the average number of neighbours of a grid po**in**t, and

the f**in**al +1 counts the parameter k of the Laplacian.

In practice, once the data h**as** been spatially whitened by multiplication with

the Laplacian L, the model fitt**in**g can be per**for**med **as** already described above,

without pay**in**g any further attention to the spatial whiten**in**g step. This rema**in**s

true **for** the c**as**e that with**in** the multivariate data set the time series of a given pair

of (typically non-neighbour**in**g) grid po**in**ts v and w are modelled, **in** addition to

the model terms already present **in** Eq. (24) **for** each of the two grid po**in**ts, by

some k**in**d of direct **in**teraction terms, e.g. **as** decribed by Eq. (23). In this c**as**e

the result**in**g series of conditional **in**novations can be denoted by ɛ t ( ˜x (v) | ˜x (w) ) (and

vice versa). From now on we will omit the tilde and use the follow**in**g shorthand

notations: ɛ t (v) := ɛ t (x (v) ) and ɛ t (v|w) := ɛ t (x (v) |x (w) ); **in** the same way, the correspond**in**g

variances will be abbreviated **as** σ 2 ɛ(v) := σ 2 ɛ(x (v) ) and σ 2 ɛ(v|w) := σ 2 ɛ(x (v) |x (w) ) .

5. ANALYSING CONNECTIVITY IN INNOVATION TIME SERIES

With this section we will complete the theoretical part of this paper by

discuss**in**g how further **in****for**mation can be extracted from the **in**novations result**in**g

from spatiotemporal modell**in**g. For pairs of grid po**in**ts with**in** spatiotemporal data

sets a modification of the modell**in**g presented so far will be **in**troduced; b**as**ed on

this modified model we will derive a parametric estimator of mutual **in****for**mation.

5.1. Spatial Correlations **in** the Innovations

If a spatiotemporal time series would be trans**for**med to completely **in**dependent

**in**novations (with respect to time and space), this would mean that

all available dynamical **in****for**mation had successfully been extracted from this

time series and condensed **in**to the model, **as** represented by both the dynamical

model and the spatial whiten**in**g trans**for**mation (i.e. the covariance matrix

**Whiten ing**

of the **in**novations). Clearly **for** real-world data, this can never be achieved completely.

As an example, there may exist very f**as**t connections between grid po**in**ts

which are not neighbours, but separated by some larger distance; these would

give rise to additional **in**stantaneous spatial correlations that are not removed by

multiplication with the Laplacian matrix L. We propose to regard the possibility

of such **in**cidents not **as** a weakness of the parsimonious modell**in**g approach

discussed so far, but rather **as** a strength, s**in**ce it enables us to explore deeper

layers of the spatiotemporal dynamics. By remov**in**g a first layer of spatial and

temporal correlations we obta**in** a trans**for**med representation of the orig**in**al data

set, with**in** which we may be able to detect much more subtle correlations. This

l**in**e of argumentation can be regarded **as** an illustration of the mathematical argument

of Sec. 3.1 which established the equality of p ( ɛ 1 (x),...,ɛ Nt (x) ) and

p(x 1 ,...,x Nt ).

Given the **in**novation time series ɛ t (x), we may look **for** rema**in****in**g **in**stantaneous

spatial correlations between pairs of grid po**in**ts (v, w) by comput**in**g the

coefficient of l**in**ear correlation r (ɛ(v),ɛ(w)) accord**in**g to Eq. (3), or by estimat**in**g

the mutual **in****for**mation I (ɛ(v),ɛ(w)) accord**in**g to Eq. (4), thereby mak**in**g use of

the fact that temporal correlations have been removed by the temporal whiten**in**g

step.

Alternatively, **as** mentioned already at the end of Sec. 4.3, the modell**in**g of

the data (i.e. the generation of the **in**novations) may be per**for**med by **in**clud**in**g

additional direct **in**teraction terms, either by per**for**m**in**g a full bivariate model

fitt**in**g step **for** a given pair of grid po**in**ts, follow**in**g Eq. (23), or at le**as**t by

allow**in**g **for** a non-zero value of the covariance term E (ɛ t (v)ɛ t (w)) with**in** S ɛ(x) .

If our aim is to capture **in**stantaneous spatial correlations, the latter approach is

sufficient; **in** this c**as**e those elements with**in** S ɛ(x) which correspond to grid po**in**ts

v and w can be def**in**ed **as** **in** Eq. (19).

Aga**in** the model fitt**in**g should be done by maximisation of likelihood,

but **in** this c**as**e the l**in**ear le**as**t-squares method cannot be applied, and estimat**in**g

the additional covariance term **in** S ɛ(x) requires numerical optimisation.

S**in**ce per**for**m**in**g numerical optimisation steps **for** all pairs of grid po**in**ts would

consume considerable time, we will **in** the next section present an approximative

method which renders this problem accessible to the l**in**ear le**as**t-squares

method.

5.2. Approximative Model **for** Spatially Correlated Innovations

Consider **for** a moment aga**in** the c**as**e of modell**in**g a pair of (typically

non-neighbour**in**g) grid po**in**ts v and w with**in** a full spatiotemporal model,

without **as**sum**in**g a non-zero covariance term of the correspond**in**g **in**novations

E (ɛ t (v)ɛ t (w)); then each grid po**in**t is modelled accord**in**g to Eq. (26), and we have

1292 Galka, Ozaki, Bayard and Yam**as**hita

a bivariate sub-model with **in**novations

⎛

p∑

ɛ t (v) = x (v)

t − ⎝µ (v) + a (v)

t x (v)

′ t−t + ′

ɛ t (w) = x (w)

t

⎛

− ⎝µ (w) +

t ′ =1

t ′ =1

p∑

t ′ =1

a (w)

t x (w)

′ t−t + ′

∑

u ∈N (v)

∑

u ∈N (w)

⎞

b (v,u)

1

x (u) ⎠

t−1

⎞

b (w,u)

1

x (u) ⎠

t−1

. (33)

Motivated by the work of Geweke, (44) we suggest to represent **in**stantaneous

correlations between x (v)

t and x (w)

t , i.e. non-zero E (ɛ t (v)ɛ t (w)), by **in**troduc**in**g an

additional **in**stantaneous coupl**in**g term **in**to the bivariate model, which yields a

new bivariate sub-model with **in**novations

⎛

⎞

p∑

ɛ t (v) = x (v)

t − ⎝µ (v) + a (v)

t x (v)

′ t−t +

∑

b (v,u)

′ 1

x (u) ⎠

t−1

ɛ t (w|v) = x (w)

t

⎛

− ⎝µ (w) +

p∑

t ′ =1

a (w)

t x (w)

′ t−t + ′

u ∈N (v)

∑

u ∈N (w)

⎞

b (w,u)

1

x (u)

t−1 + c vw x (v) ⎠

t . (34)

Note that, unlike with the standard def**in**ition of autoregressive models, here

we use the state value of one grid po**in**t at time t **in** order to model the value of

another grid po**in**t at the same time t.

The additional coupl**in**g parameter c vw can be estimated conveniently by

the same l**in**ear le**as**t-squares method which is also employed **for** estimat**in**g the

other parameters **in** Eq. (34), thus avoid**in**g computationally more demand**in**g

numerical optimisation. In this model the covariance matrix of the **in**novations

is guaranteed to be diagonal; its diagonal elements shall be denoted by σɛ(v) 2 and

σɛ(w|v) 2 . However, due to the **in**clusion of the **in**stantaneous **in**teraction term this

and x (w)

t ,

covariance matrix does not directly refer to the orig**in**al state variables x (v)

t

but to trans**for**med variables; see Appendix B **for** a discussion of the correspond**in**g

non-diagonal covariance matrix of the orig**in**al x (v)

t

and x (w)

t .

It h**as** to be emph**as**ised that the model underly**in**g Eq. (34) is not equivalent to

the orig**in**al bivariate modell**in**g, correspond**in**g to Eq. (23), but rather it represents a

useful approximation; **in** Appendix B we will show that the coupl**in**g parameter c vw

approximately corresponds to the coefficient of l**in**ear correlation r (ɛ(v),ɛ(w)).

Also the result of the le**as**t-squares model fit does not represent a maximumlikelihood

fit, although we expect that **in** most c**as**es it will have very similar

properties.

In Eq. (34) the two grid po**in**ts v and w are no longer treated **in** a symmetrical

way, s**in**ce x (v)

t

is modelled only by its own p**as**t (and the p**as**t of its neighbours),

**Whiten ing**

where**as** x (w)

t is modelled by the p**as**t of both x (v)

t and x (w)

t . Experience h**as** shown

that if the roles of v and w are **in**terchanged, **in** most c**as**es the results rema**in**

essentially unchanged; however, **in** certa**in** situations this remark does no longer

apply, e.g. when the variances of the orig**in**al data x (v)

t and x (w)

t differ very much. In

such c**as**es we have adopted the approach to choose out of the two possibilities the

model which achieves the better likelihood. In Appendix B an explicit expression

**for** the likelihood of model Eq. (34) is derived.

As already mentioned, it would also be possible to augment the model Eq. (34)

by time-lagged **in**teraction terms, correspond**in**g to the full bivariate modell**in**g of

Eq. (23); by such terms further deviations of the **in**novations ɛ t (x) from the ideal

condition of be**in**g completely white and mutually **in**dependent could be modelled.

In particular, this generalisation would offer the possibility of quantify**in**g directed

correlations between grid po**in**ts, which can be **in**terpreted **as** me**as**ur**in**g causal

connectivity (while by def**in**ition **in**stantaneous correlations are non-directed).

Through direct comparison of the likelihoods obta**in**ed by models (33) and

(34) it can be decided whether **for** given data the **in**clusion of the **in**teraction term

improves the model; this step can be regarded **as** a Likelihood Ratio Test (LRT). (45)

B**as**ed on Eqs. (9) and (10), the LRT test statistic can also be used **for** deriv**in**g an

estimator of mutual **in****for**mation; follow**in**g this path we obta**in**

I ( x (v) , x (w)) = (N t − p) 2c vw σɛ(v),ɛ(w|v) 2 − c2 vw σ ɛ(v)

2

2σɛ(w|v)

2 , (35)

where we have def**in**ed σɛ(v),ɛ(w|v) 2 = E (ɛ t(v)ɛ t (w|v)); see Appendix B **for** details

on the derivation of this result. This estimator differs from most other currently

known estimators by the complete absence of histograms or similar non-parametric

elements.

However, it h**as** to be mentioned that this estimator h**as** the disadvantage of

sometimes produc**in**g negative estimates of mutual **in****for**mation; some discussion

on this po**in**t can be found **in** Appendix B. For small values of mutual **in****for**mation

also nonparametric estimators are known to occ**as**ionally produce negative results

which are due to statistical fluctuations. (21)

6. COMPARISON WITH INDEPENDENT COMPONENT ANALYSIS

The concept of whiten**in**g is also play**in**g an important role **in** other approaches

**for** the analysis of multivariate time series, such **as** Independent Component Analysis

(ICA): (46) As a part of this method commonly a l**in**ear preprocess**in**g step is

applied to the multivariate data, also called “whiten**in**g,” such that the channels

become mutually uncorrelated; they are also rescaled **in** order to have unit variance,

such that the sample covariance matrix of the trans**for**med data becomes a

unity matrix. While the rescal**in**g step h**as** to be regarded **as** problematic, s**in**ce **in**

1294 Galka, Ozaki, Bayard and Yam**as**hita

most c**as**es it will deteriorate the signal-noise ratio of the data, it can be said that

otherwise this step roughly corresponds to the multiplication with the Laplacian

and the **in**clusion of **in**stantaneous terms **in** Eq. (34), i.e. to the identification of

the **in**stantaneous correlation structure of the driv**in**g noise.

However, the concept of temporal whiten**in**g does not have a counterpart

**in** the standard ICA methodology, s**in**ce the description of data, **as** provided by

ICA, ignores the constra**in**t of causal modell**in**g, i.e. the direction of time. The

numerous ICA algorithms which have been proposed, differ **in** whether they take

the temporal order**in**g of the data **in**to account; while the majority of algorithms

ignores the temporal order**in**g completely, some approaches aim at simultaneously

diagonalis**in**g **in**stantaneous and lagged covariance matrices, (47) but also these

matrices do not depend on the direction of time.

Most ICA algorithms are b**as**ed on the **as**sumption of the existence of a

set of source components s t = (s (1)

t ,...,s (N s)

t ), which are **as**sumed to be mutually

**in**dependent and to have non-Gaussian distributions, such that the data is modelled

**as**

x t = Cs t , (36)

where C denotes the N v × N s mix**in**g matrix; frequently N v = N s is chosen, but

the c**as**e N v > N s is also possible, **as** **in** Factor Analysis. (51) Both C and s t are

unknown and need to be estimated; **in** many ICA algorithms the **as**sumption of

non-Gaussianity of all components s (v)

t (except **for** at most one) is essential.

Obviously, the approach proposed **in** this paper differs dist**in**ctively from ICA,

s**in**ce it rests on the theorem from Markov process theory mentioned above, accord**in**g

to which a wide cl**as**s of dynamical systems can be (temporally) whitened

**in**to Gaussian **in**novations, while deviations from Gaussianity **in** the data are **as**sumed

to be a result of nonl**in**ear elements **in** the dynamics or **in** the observation

process (see Sec. 3.1). Innovations at different grid po**in**ts are not **as**sumed to be

**in**dependent or uncorrelated, but rather an explicit model **for** their non-diagonal covariance

matrix is **for**mulated. In the light of the theorem on Gaussian **in**novations,

we would regard the whiten**in**g **as** unsuccessful if the result**in**g **in**novations were

found to have a clearly non-Gaussian distribution (but, **as** mentioned **in** Sec. 3.1,

a Poisson noise component may be accepted).

7. APPLICATION TO SIMULATED SPATIOTEMPORAL DYNAMICS

We will now illustrate the ide**as** presented **in** this paper by application to data

generated by a simulated dynamical system of moderate complexity. The design of

our simulation is motivated by earlier work of Schreiber, (48) but while he employed

a strongly nonl**in**ear determ**in**istic system, we choose a stoch**as**tic system which

is b**as**ed on apply**in**g a sigmoid nonl**in**earity to a l**in**ear autoregressive structure;

**Whiten ing**

the c**as**e of employ**in**g a stoch**as**tic version of Schreiber’s system will be briefly

discussed **in** Sec. 8.

This section discusses the def**in**ition of the system, the method to produce

simulated data from the system and the results of analys**in**g this data by the

whiten**in**g approach, us**in**g the parsimonious MAR model, **as** presented above;

**for** comparison we will also discuss the c**as**es of analys**in**g the data with a full

MAR model (hav**in**g a non-sparse transition matrix) and with standard ICA

algorithms.

7.1. Cha**in** of Coupled Stoch**as**tical Oscillators

The system consists of a one-dimensional cha**in** of N v nodes (represent**in**g

“grid po**in**ts”) with periodic boundary conditions. At each node v, v = 1,...,N v ,

a local dynamical process evolves, def**in**ed by

(

)

y (v)

t = tanh a (v)

1

y (v)

t−1 + a(v) 2

y (v)

t−2 + b(v,v−1) 1

y (v−1)

t−1

+ b (v,v+1)

1

y (v+1)

t−1

+ η (v)

t .

(37)

By H t = (η (1)

t ,...,η (N v)

t ) the vector of driv**in**g noise terms shall be denoted;

its covariance is chosen **as** nondiagonal, follow**in**g the **for**m given by Eq. (30); the

diagonal matrix correspond**in**g to S ɛ(˜x) is created from a set of node-dependent

standard deviations σ (v)

dyn which are generated by draw**in**g N v numbers from a

Gaussian distribution N (χ η ,σ η ) with χ η ≫ σ η and additionally smooth**in**g this

set along the cha**in** of nodes (observ**in**g periodic boundary conditions).

The sets of left and right neighbour **in**teraction parameters b (v,v−1) ,

b (v,v+1) ,v = 1 ...,N v (aga**in** observ**in**g periodic boundary conditions), are created

**in** the same way **as** the set of standard deviations σ (v)

dyn

. By choos**in**g slightly

different means χ b− ,χ b+ **for** the Gaussian distributions of left and right neighbour

**in**teraction parameters, **as**ymmetries can be **in**troduced **in**to the system, and

more “**in**terest**in**g” dynamics results. The local dynamical parameters a (v)

1 , a(v) 2

are

chosen accord**in**g to the rules **for** design**in**g stable second-order autoregressive

processes, i.e. such that the roots of the correspond**in**g characteristic equation

which shall be written **as**

(

λ

(v) ) 2

− a

(v)

1 λ(v) − a (v)

2

= 0, (38)

λ (v)

± = r (v) e ±iϕ(v) , (39)

lie with**in** the unit circle **in** the complex plane (but the result**in**g dynamics will

become more “**in**terest**in**g,” if a few moduli are allowed to be slightly larger than

one). Moduli r (v) and ph**as**es ϕ (v) can be chosen **in**dependently from suitably

chosen distributions; we have decided to choose r (v) from a Gaussian distribution

1296 Galka, Ozaki, Bayard and Yam**as**hita

N (χ r ,σ r ) and ϕ (v) from the **as**ymmetric distribution

p(ϕ) =−log ( p U (ϕ)(1 − e −π ) + e −π) , (40)

where p U (ϕ) is the uni**for**m distribution on the **in**terval [0, 1]. Parameters then

follow by

a (v)

1

= 2 r (v) cos ϕ (v) , a (v)

2

=− ( r (v)) 2

. (41)

Aga**in** smooth**in**g along the cha**in** is applied, **in** order to en**for**ce smoothly

vary**in**g properties of local dynamics.

The hyperbolic tangent **in** Eq. (37) h**as** been **in**troduced follow**in**g the example

of the sigmoid nonl**in**earity **in** Neural Network models; its ma**in** purpose is to

prevent divergence of the dynamics which may arise due to some roots λ (v)

± ly**in**g

outside the unit circle or due to feedback between neighbours. Furthermore this

pronounced nonl**in**earity contributes to creat**in**g “**in**terest**in**g” dynamics.

Instantaneous long-distance correlations can be **in**troduced **in**to this system

by choos**in**g two (non-neighbour**in**g) nodes v and w and driv**in**g them by a common

noise process (but preserv**in**g the local variances of the two nodes)

˜η (m)

t = σ (m) (

)

dyn η (v)

t

2 σ (v) + η(w) t

dyn

σ (w) , m = v, w; (42)

dyn

the result**in**g **in**stantaneous long-distance correlations may be termed “**in**tr**in**sic,” **in**

contr**as**t to those which occur either due to the dynamical coupl**in**g of nodes with**in**

local neighbourhoods, or **as** a result of pure co**in**cidence, **as** it is not uncommon **in**

short time series.

This system implements a coupled-map lattice consist**in**g of stoch**as**tically

driven oscillators; (14) after allow**in**g **for** a transient to die out, data is sampled from

the system by record**in**g noisy observations of the states of each node:

where the observational noise n (v)

t

x (v)

t

= y (v)

t

+ n (v)

t , (43)

is sampled from N (0,σ (v)

obs

); the node-dependent

standard deviations σ (v)

obs are drawn from N (χ n,σ n ) (where χ n ≫ σ n ) and afterwards

smoothed along the cha**in**. Observation noise is uncorrelated between nodes.

7.2. Results of Simulation Study

We implement the dynamical system described **in** the previous section

with N v = 64 nodes. Parameters of the Gaussian distributions are chosen

**as** (χ η ,σ η ) = (0.25, 0.05), (χ b− ,σ b− ) = (0.55, 0.25), (χ b+ ,σ b+ ) = (0.45, 0.2),

(χ r ,σ r ) = (0.5, 0.2) and (χ n ,σ n ) = (0.1, 0.01). Grid po**in**ts 16 and 41 are correlated

accord**in**g to Eq. (42). The system is simulated, start**in**g from random

**Whiten ing**

Fig. 1. Simulated dynamics **for** a stoch**as**tic coupled-map cha**in** consist**in**g of 64 nodes; upper panel:

amplitude (colour-coded) vs. node number (vertical axis) and time (horizontal axis); periodic boundary

conditions apply to the vertical axis. Lower panel: time series of amplitudes at nodes 16 (red) and 41

(green) vs. time.

**in**itial conditions, and after wait**in**g **for** 1000 time po**in**ts **in** order to allow **for**

transient behaviour, a multivariate data set of N t = 512 po**in**ts length is recorded

from all 64 nodes, **in**clud**in**g observational noise. The data is displayed **in** Fig. 1;

the upper panel shows the time series of all nodes along the cha**in**, while the time

series of the **in**tr**in**sically correlated nodes 16 and 41 are shown explicitly **in** the

lower panel. In the figure it can be seen that **for** the selected values of parameters

groups of nodes switch randomly between predom**in**antly positive or negative values,

thereby **for**m**in**g coherent spatiotemporal patterns. Furthermore it can be seen

that, at le**as**t with respect to visual **in**spection, the correlation between nodes 16

and 41 is not very pronounced.

The same po**in**t is demonstrated **in** Fig. 2 where **for** all pairs of nodes (v, w)

the l**in**ear correlation me**as**ure r(v, w) := r(x (v) , x (w) ) accord**in**g to Eq. (3) (upper

panel) and the mutual **in****for**mation I (v, w) := I (x (v) , x (w) ) accord**in**g to Eq. (4)

(lower panel) are shown. The estimate of I (v, w) is obta**in**ed by us**in**g a MATLAB

implementation of a histogram-b**as**ed estimator, provided by Moddemeijer. (22)

Note that **in** this figure both quantities are calculated directly **for** the data x (v)

t .The

1298 Galka, Ozaki, Bayard and Yam**as**hita

Fig. 2. L**in**ear correlation matrix (upper panel) and mutual **in****for**mation matrix (lower panel) **for** all

pairs of nodes, estimated from the multivariate time series shown **in** Fig. 1. **Mutual** **in****for**mation w**as**

estimated by a histogram estimator. In both panels values on the diagonal have been omitted.

**Whiten ing**

matrix of l**in**ear correlations shows broad are**as** of correlated or anti-correlated

nodes, and the pair (16, 41), produc**in**g a value of r(16, 41) = 0.5006 (while r is

bounded by def**in**ition to the **in**terval [−1,1]), cannot be e**as**ily dist**in**guished from

these are**as**, although its correlation is of completely different orig**in** than all other

correlations which are visible **in** the figure. The matrix of mutual **in****for**mation

reproduces some of the features of the matrix of l**in**ear correlations, but without

dist**in**guish**in**g between correlation and anti-correlation; **in** this c**as**e the **in**tr**in**sically

correlated pair produces a rather low value of I (16, 41) = 0.1286 (here it

should be mentioned that the typical scal**in**g of this estimate depends on certa**in**

design choices of the mutual **in****for**mation estimator which we shall not discuss

here **in** detail).

Clearly the particular estimates obta**in**ed **for** l**in**ear correlation and mutual **in****for**mation

are to be regarded **as** not very reliable, s**in**ce the underly**in**g time series is

rather short, N t = 512. Repeat**in**g the analyses 1000 times **for** different realisations

generated by the same system (each of 512 po**in**ts length) yields, **in** terms of means

m(.) and standard deviations s(.), **for** l**in**ear correlation m (r(16, 41)) = 0.2536

and s (r(16, 41)) = 0.1741, and **for** mutual **in****for**mation m (I (16, 41)) = 0.0841

and s (I (16, 41)) = 0.0344, i.e. fairly broad distributions. Note that **in** real life frequently

the option of repeat**in**g analyses **for** many equivalent time series recorded

under exactly preserved conditions may not be given.

We would like to briefly discuss the distribution of the simulated data. In

Fig. 3 skewness and kurtosis are shown **for** the time series recorded **for** each

node (open circles). It can be seen that some nodes display pronounced skewness,

i.e. **as**ymmetric distributions; and while many nodes display negative values of

kurtosis, some reach sizable positive values. These results show a substantial

deviation from Gaussianity, which is a result of the sigmoid nonl**in**earity **in** the

Fig. 3. Skewness (left panel) and kurtosis (right panel) vs. node number (horizontal axis) calculated

from the multivariate time series shown **in** Fig. 1 (open circles) and from the correspond**in**g **in**novations

(black dots).

1300 Galka, Ozaki, Bayard and Yam**as**hita

dynamics. Negative values of kurtosis correspond to th**in**-tail behaviour, **as** it

should be expected **for** a sigmoid function, but it is obvious that not all nodes follow

this pattern, s**in**ce the distribution of some nodes is dom**in**ated by **as**ymmetries.

Next we per**for**m spatial whiten**in**g by apply**in**g the trans**for**mation accord**in**g

to Eqs. (27) and (28), and temporal whiten**in**g by fitt**in**g to each node the

model Eq. (26); note that this is a completely l**in**ear model, **in** contradiction to the

true model used **for** generat**in**g the data, Eq. (37). We are employ**in**g an **in**correct

model cl**as**s, nevertheless very good whiten**in**g can be achieved. The parameter k **in**

Eq. (28) is chosen **as** 9, although one would rather expect 2 **in** a one-dimensional

system; but we found that too strong spatial whiten**in**g produces spurious temporal

correlations (or, more precisely, anticorrelations), especially **for** pairs of

neighbour**in**g nodes, there**for**e we recommend to avoid too high values **for** 1/k.

The success of the whiten**in**g trans**for**mation can be see **in** Fig. 4, where

aga**in** l**in**ear correlation (upper panel) and mutual **in****for**mation (lower panel) are

shown, but now **for** the **in**novations ɛ t (v) of the spatial and temporal whiten**in**g

steps **in**stead **for** the orig**in**al data x (v)

t . In both panels the pair (16, 41) stands out

clearly aga**in**st a background of very small values, **as**sum**in**g a correlation value

of r(16, 41) = 0.6959, while the second largest value **in** the matrix is 0.1560, and

a mutual **in****for**mation value of I (16, 41) = 0.2485, while the second largest value

is 0.0310. The values **for** repeat**in**g the analysis **for** 1000 different realisations

are **for** l**in**ear correlation m (r(16, 41)) = 0.6735 and s (r(16, 41)) = 0.0262, and

**for** mutual **in****for**mation m (I (16, 41)) = 0.2404 and s (I (16, 41)) = 0.0290; these

distributions are much narrower than obta**in**ed above **in** the c**as**e without whiten**in**g,

thereby demonstrat**in**g the degree of precision that can be achieved on the b**as**is of

time series of only 512 po**in**ts length. Note that if the **in**novations could be estimated

without any error, we would expect r(16, 41) = 1.0, but due to observational noise,

model imperfection and f**in**ite-sample effects this value cannot be atta**in**ed. The

theoretical optimal value **for** I (16, 41) is given by the entropy of the common

**in**novations.

If we **in**vestigate the distribution of the **in**novations, we obta**in** the results

shown by dots **in** Fig. 3. It can be seen that **for** most nodes both skewness and

kurtosis are much closer to zero now; especially the pronounced **as**ymmetries

have been removed. The values of these two quantities scatter around zero **for** the

set of all nodes, with mean values very close to zero, 0.0105 **for** skewness and

0.0175 **for** kurtosis. These results confirm that the whiten**in**g step h**as** produced

approximately Gaussian **in**novations, despite fitt**in**g a l**in**ear model to nonl**in**ear

data.

Furthermore we apply to each pair of nodes the model comparison given by

Eqs. (33) and (34) and compute the correspond**in**g parametric estimate of mutual

**in****for**mation, given by Eq. (35); the result is shown **in** Fig. 5. From the figure it can be

seen that also by this method the **in**tr**in**sic correlation of the pair (16, 41) is clearly

**Whiten ing**

Fig. 4. L**in**ear correlation matrix (upper panel) and mutual **in****for**mation matrix (lower panel) **for** all

pairs of nodes, estimated from the **in**novations of the multivariate time series shown **in** Fig. 1. **Mutual**

**in****for**mation w**as** estimated by a histogram estimator. In both panels values on the diagonal have been

omitted.

1302 Galka, Ozaki, Bayard and Yam**as**hita

Fig. 5. **Mutual** **in****for**mation matrix **for** all pairs of nodes, estimated by the parametric estimator b**as**ed

on the difference of log-likelihood of pairwise model comparisons from the multivariate time series

shown **in** Fig. 1. Values on the diagonal have been omitted.

detected, **as**sum**in**g a value of I (16, 41) = 98.1332, while the second largest value

is 5.7633. Obviously the scal**in**g of this estimate of mutual **in****for**mation is different

from that b**as**ed on histograms, but the relative behaviour is essentially the same.

The values **for** a distribution of 1000 realisations are m (I (16, 41)) = 88.7080 and

s (I (16, 41)) = 21.3805.

F**in**ally we **in**vestigate how the per**for**mance of the whiten**in**g approach **in**

correctly identify**in**g **in**tr**in**sically correlated voxel pairs depends on time series

length N t and standard deviation of observational noise χ n (the latter averaged

over nodes). For this purpose we **for**mulate the null hypothesis “there is no **in**tr**in**sic

correlation **for** the pair (16, 41)” and try to reject it b**as**ed on simple rank**in**g of

the values of r and I **for** all pairs of voxels: If r(16, 41) > r(v, w) **for** all pairs

(v, w) ≠ (16, 41) the null hypothesis is rejected. Instead of the coefficient of

l**in**ear correlation r also the mutual **in****for**mation I , estimated by either a histogram

estimator or by Eq. (35), may be used. While repeat**in**g this procedure 500 times

(each time choos**in**g new parameters **for** the simulated system accord**in**g to the

distributions given above, keep**in**g parameters of the distributions fixed at the

values given above, and choos**in**g a new set of observation noise covariances σ (v)

obs

from N (χ n ,σ n ), where σ n = χ n /10) **for** each pair of values of N t and χ n ,we

**Whiten ing**

Table I. Estimate of time series length N t required

**for** achiev**in**g probability of type-II error p II = 0.0,...,0.75

**for** given (average) observation noise standard deviation

χ n = 0.02, 0.1, 0.2, 0.4, us**in**g correlation coefficient, histogram

estimate of mutual **in****for**mation or parametric estimate of

mutual **in****for**mation (by Eq. (35))

p II = 0.0 0.25 0.50 0.75

correlation coefficient r

χ n = 0.02 50 30 25 20

0.1 120 45 40 30

0.2 300 135 100 70

0.4 >700 >700 650 375

mut. **in**f. I (histogram estimate)

χ n = 0.02 105 60 50 40

0.1 240 130 100 80

0.2 700 430 330 240

0.4 >700 >700 >700 >700

mut. **in**f. I (parametric estimate)

χ n = 0.02 80 30 25 20

0.1 150 55 45 30

0.2 480 180 120 80

0.4 >700 >700 >700 690

Note. Values have been estimated by simulation (see text **for** details)

and rounded to multiples of 5.

record the number of failures to reject the null hypothesis, i.e. we me**as**ure the rate

of type-II errors p II .

When apply**in**g this analysis to estimates of l**in**ear correlation and of mutual

**in****for**mation obta**in**ed from orig**in**al (unwhitened) data, p II is always larger than 0.9

and **in** most c**as**es 1.0, **in** agreement with the results presented so far; **for** estimates

obta**in**ed from the **in**novations we obta**in** the results shown **in** Table 1. The table

gives the approximate length of time series N t which is, at a given value of χ n ,

required to reduce the probability of type-II error p II to a given value; values

**for** N t have been rounded to multiples of 5. Note that accord**in**g to Eq. (37) the

distribution of the simulated dynamics is essentially bounded by −1 ≤ y (v)

t ≤ 1;

**in** fact we f**in**d that the sample standard deviations of the simulated data sets

be**for**e add**in**g observational noise scatter around a value of 0.6. This value is to be

compared with the values **for** χ n given **in** the table.

From the table it can be seen that the parametric estimate **for** mutual **in****for**mation

provides better detection of **in**tr**in**sic correlation than the histogram estimate,

while the l**in**ear correlation coefficient displays the best per**for**mance. In summary

it can be stated that after whiten**in**g **in**tr**in**sic correlation can be detected reliably

1304 Galka, Ozaki, Bayard and Yam**as**hita

**in** time series of a few hundred po**in**ts length even **in** the presence of strong

observational noise.

7.3. Application of Full MAR Analysis

For the purpose of comparison, we shall consider the c**as**e of fitt**in**g a full

MAR model, accord**in**g to Eq. (24), to the same data **as** studied so far, **in**stead

of a parsimonious model. In c**as**e of the smallest model order p = 1, **as** **in** POP

analysis, we f**in**d from Eq. (25) that the number of model parameters N par is

18.85% of the total number of available data values N v N t , while p = 2(whichis

the “true” model order of this simulation) yields 31.35%, so both values exceed

the 10%-rule mentioned **in** Sec. 4.1; the parsimonious model Eq. (32) gives a

much more favourable ratio of 0.98%.

Ignor**in**g this problem **for** a moment and fitt**in**g a MAR(2) model (i.e. a MAR

model with model order p = 2) by Whittle’s algorithm, (39) we obta**in** transition

matrices A 1 and A 2 ; **as** should be expected we f**in**d that these matrices appear

“blurred” **as** compared to the correct matrices employed **in** the simulation, i.e.

the non-vanish**in**g values are systematically underestimated (**in** terms of their

absolute values; diagonal elements of A 2 which should always be negative, are also

sometimes estimated **as** positive), while the vanish**in**g values (**in** A 1 all elements

referr**in**g to non-neighbour**in**g pairs of nodes, and **in** A 2 all off-diagonal elements)

**as**sume non-zero values, scatter**in**g around zero. A similar remark applies to the

covariance matrix of the driv**in**g noise, S ɛ(x) . The **in**tr**in**sic correlation between

nodes 16 and 41 can be detected, also by this analysis: it appears now partly

**in** the correspond**in**g off-diagonal elements of the estimate of S ɛ(x) (which **in**

the parsimonious model, due to the Laplacian trans**for**mation, corresponds to a

diagonal matrix), and partly **in** the rema**in****in**g correlations of the **in**novations of

the full MAR model fit. Naturally, due to the overparametrisation, this **in**tr**in**sic

correlation will e**as**ily be lost **in** noise, result**in**g either from estimation errors or

be**in**g present **in** the data itself.

In this simulation the data w**as** created by a parsimonious MAR model, so it

is not surpris**in**g that a similar model shows better per**for**mance than the full MAR

model; **in** the analysis of real-world data it will depend on the properties of the data

and the underly**in**g system which model is more appropriate. In systems **for** which

the **as**sumption of f**as**t local dynamics is **in**valid, the full MAR model may **in**deed

be superior. Whether the much larger amount of model parameters is justified, can

be determ**in**ed by **in****for**mation criteria such **as** AIC and BIC; (36) e.g. **for** the example

of the simulation presented above, we f**in**d that the AIC of the MAR models of first

and second order **as**sume values of 15158.61 and 17255.94, respectively, while

the parsimonious model **as**sumes a much lower value of 10044.47. We remark that

the pure likelihood of the MAR models is larger than that of the parsimonious

model; however, due to their large numbers of parameters, these models are heavily

**Whiten ing**

punished by AIC **for** overfitt**in**g. But s**in**ce **in** this c**as**e the number of parameters is

too large **for** reliable estimation of AR models anyway, compared to the number of

available data values, the parsimonious model provides a convenient alternative.

7.4. Application of ICA

It is possible to apply ICA algorithms to the simulated data shown **in** Fig. 1;

however, **in** do**in**g so we impose the constra**in**t of the existence of a set of non-

Gaussian **in**dependent signals, which are related to the data by a mix**in**g matrix, **as**

shown **in** Eq. (36). The only **in**dependent signals which were **in**volved **in** generat**in**g

the data, are the driv**in**g noises H t = (η (1)

t ,...,η (N v)

t ), but their relationship to the

data is of more complicated nature than described by Eq. (36), there**for**e these

noises cannot be retrieved by ICA; the estimation of the driv**in**g noises is the goal

of the whiten**in**g approach (**as**sum**in**g that observation noise can be neglected) and

requires a dynamical model of the process generat**in**g the data. Furthermore, the

available knowledge about neighbourhood relationships between nodes is ignored

by standard ICA algorithms.

If we nevertheless per**for**m an ICA decomposition of the simulated data,

employ**in**g the “F**as**tICA” algorithm **in**troduced by Hyvär**in**en (46) (us**in**g “symmetric”

estimation, i.e. estimat**in**g all **in**dependent components **in** parallel) we obta**in**

the results shown **in** Fig. 6. The algorithm produces a set of 64 components, but

there is no preferred order**in**g of these components, s**in**ce the spatial neighbourhood

relationships of the data have been lost; **for** this re**as**on the plot shows no

equivalent to spatial structures. In the lower panel of Fig. 6 two arbitrarily chosen

components are shown explicitly; it can be seen that they display little temporal

structure, except **for** one sharp spike-like maximum of each component. Upon

closer **in**spection it is found that such sharp extrema are present **for** most of the

**in**dependent components; s**in**ce they never occur at the same time po**in**t **for** two

components, such spikes contribute to reduc**in**g the residual dependencies between

components.

Indeed, the residual correlations between the components provided by

F**as**tICA are virtually zero: the maximum value of l**in**ear correlation among all

pairs of components is r = 5.79 × 10 −16 ; **for** the residual mutual **in****for**mation,

us**in**g the same histogram-b**as**ed estimator **as** be**for**e, a value of I = 0.0115 is

found. These values show that this decomposition w**as** successful, with respect to

the underly**in**g **as**sumption of the ICA approach; nevertheless, the decomposition

provides no useful **in****for**mation about the actual dynamics of this particular system

and about its spatial correlation structures, such **as** the existence of one pair of

**in**tr**in**sically correlated non-neighbour**in**g nodes.

F**in**ally, we remark that some ICA algorithms have been proposed which

are b**as**ed on maximis**in**g a (log-)likelihood, **as** it is also the c**as**e **in** time series

1306 Galka, Ozaki, Bayard and Yam**as**hita

Fig. 6. F**as**tICA decomposition of the data shown **in** Fig. (1); upper panel: amplitude (colour-coded)

vs. component number (vertical axis) and time (horizontal axis); order**in**g of components is arbitrary.

Lower panel: time series of amplitudes of components 1 (red) and 2 (green) vs. time.

modell**in**g by dynamical models; s**in**ce likelihood (after the bi**as** correction aga**in**st

overfitt**in**g, i.e. employ**in**g AIC or BIC, see above) represents a global me**as**ure of

the quality of models **for** given data, the per**for**mance of these methods can be

compared directly to time series modell**in**g. As examples we mention the work

of Wu and Pr**in**cipe (49) and of Choi et al., (50) both of which are b**as**ed on the

“generalised Gaussian distribution,” a cl**as**s of distributions which conta**in**s the

Gaussian distribution **as** a particular c**as**e. Fitt**in**g ICA models b**as**ed on this cl**as**s

of distributions to data bears close similarity to the likelihood-b**as**ed version of

cl**as**sical Factor Analysis, (51) but without the constra**in**t of Gaussianity. By apply**in**g

an ICA algorithm b**as**ed on these ide**as** to the same simulated data **as** be**for**e, we

obta**in** a value of AIC=24986.08. The fact that this value is much larger than the

AIC of parsimonious MAR modell**in**g, AIC=10044.47 (thereby demonstrat**in**g

a much **in**ferior description of the data, **as** compared to parsimonious MAR),

reflects partly the benefit of **in**clud**in**g dynamics **in**to the model, i.e. of temporal

whiten**in**g, and partly the penalty **for** the much larger number of parameters of the

ICA model (which **in** this c**as**e is dom**in**ated by the Nv 2 parameters **in** the mix**in**g

**Whiten ing**

matrix C); but **in** this c**as**e even the pure likelihood of the parsimonious MAR

model is considerably larger than the likelihood of the ICA model.

8. DISCUSSION AND CONCLUSION

In this paper we have explored a new viewpo**in**t concern**in**g the issue of def**in**ition

and estimation of mutual **in****for**mation between time series with**in** multivariate

data sets sampled from spatially extended dynamical systems; the l**in**k between

mutual **in****for**mation and log-likelihood which we have demonstrated, applies to

the estimation of mutual **in****for**mation between any pair of temporally correlated

time series. We have presented a conceptually simple approach **for** parsimonious

spatiotemporal modell**in**g of data, which is b**as**ed on the notion of whiten**in**g with

respect to both space and time. The whiten**in**g approach proposed **in** this paper

can be summarised by the follow**in**g steps:

1. For each time po**in**t the data vector is multiplied by the appropriate Laplacian

matrix (see Eqs. (27) and (28)); thereby the data is spatially whitened.

2. The result**in**g trans**for**med data is modelled by a suitable MAR model (see

Eq. (26)); thereby temporal whiten**in**g is provided. For high-dimensional

time series (represent**in**g a large number of grid po**in**ts **in** space) parsimonious

modell**in**g is achieved by limit**in**g direct **in**teractions (i.e. non-zero

model parameters **in** the transition matrices) to neighbours.

3. The result**in**g **in**novations are analysed **for** rema**in****in**g correlations; such

correlations will usually be **in**stantaneous spatial correlations, **as** demonstrated

**in** the simulation example, but they could also **in**volve time lags

between different spatial locations.

4. F**in**ally there is the option of refitt**in**g the autoregressive model (i.e. repeat**in**g

step 2), but augmented by the additional correlations identified **in**

step 3; an example, limited to one pair of grid po**in**ts, is given **in** Eq. (34).

Provided the model is sufficiently successful **in** whiten**in**g the data, the **in**novations

can be **as**sumed to have a multivariate Gaussian distribution without

temporal or spatial correlations, there**for**e they represent a convenient b**as**is **for**

the estimation of mutual **in****for**mation, and hence connectivity, **as** h**as** been shown

**in** this paper. By exploit**in**g this property of **in**novations we have derived a parametric

estimator of mutual **in****for**mation, which may replace the commonly used

non-parametric estimators. But it should be stressed that the crucial po**in**t is the removal

of spatiotemporal correlations, i.e. the twofold whiten**in**g step; once this h**as**

been done, also non-parametric estimators provide considerably improved results.

It represents a typical situation **in** present-day scientific research that spatially

extended dynamical systems are **in**vestigated by sampl**in**g at discrete spatial

positions and time po**in**ts. It may be argued that bivariate mutual **in****for**mation w**as**

1308 Galka, Ozaki, Bayard and Yam**as**hita

**in**sufficient **for** such situations s**in**ce it does not properly address the various potential

sources **for** spatial and temporal correlations. One possible extension of the

standard def**in**ition is given by conditional mutual **in****for**mation; (12) whiten**in**g by

explicit modell**in**g of the dynamics with respect to both space and time provides an

alternative approach. As we have demonstrated **in** the simulation example, it is not

required that the correct dynamical model be employed; we expect that **for** a large

fraction of c**as**es a l**in**ear low-order autoregressive model with nearest-neighbour

**in**teractions may prove to be a useful approximation.

As a natural consequence of the methodology discussed **in** this paper, the

estimate of mutual **in****for**mation becomes dependent on the dynamical model used

**for** whiten**in**g; **as** an example, compare the lower panels of Figs. 2 and 4: Where**as**

**in** Fig. 2 all correlations contribute to the mutual **in****for**mation, **in** Fig. 4 most correlations

have been removed by whiten**in**g, and only those rema**in** which are not

captured by the model. By this method it becomes possible to decompose correlations

**in**to different layers, such that each ref**in**ement of the model renders it possible

to describe more correlations through the model, and delegate less correlations to

the cl**as**s of “unexpla**in**ed” correlations rema**in****in**g **in** the **in**novations.

Whether explicitly nonl**in**ear model cl**as**ses need to be employed, depends

on the data; generally, strong deviations from Gaussianity will **in**dicate the necessity

of employ**in**g nonl**in**ear elements **in** the modell**in**g of the dynamics or the

observation process. The c**as**e of data generated by a sigmoid nonl**in**earity is, to

some degree, a well-behaved c**as**e, s**in**ce it corresponds to th**in**-tail deviations from

Gaussianity. On the other hand, it h**as** been demonstrated **in** this paper that also

pronounced **as**ymmetries can efficiently be mapped back to approximately Gaussian

distributions by employ**in**g a simplified l**in**ear model. We do not yet know

how this model cl**as**s would per**for**m **in** the presence of heavy-tail distributions;

further work is required **in** order to obta**in** more experience with respect to these

issues.

While **in** this paper we have relied upon us**in**g a fairly simple model cl**as**s

**for** the dynamics, it h**as** to be admitted that the true dynamics of many spatially

extended systems occurr**in**g **in** Nature may be considerably more complex, and

there**for**e it may **in** some c**as**es be very difficult to f**in**d efficient models **for** whiten**in**g

the data. In such situations it will be necessary to explore more sophisticated

model cl**as**ses and to spend considerable ef**for**t on choice and adaptation of the

models, possibly employ**in**g prior external knowledge about the dynamical properties

of the system **in** question.

Also **in** simulations it is possible to f**in**d systems **for** which the approach **as**

described **in** this paper fails. As an example we mention the logistic map lattice

used **in** Ref. 48; a stoch**as**tic version of this lattice could be def**in**ed by (aga**in** us**in**g

a one-dimensional cha**in** of po**in**ts with periodic boundary conditions)

y (v)

t

= a (v)

1

f

(

y (v)

t−1

)

+ b (v,v−1)

1

f

(

y (v−1)

t−1

)

+ b (v,v+1)

1

f

(

y (v+1)

t−1

)

+ η (v)

t , (44)

**Whiten ing**

where f (.) denotes the logistic map

f (y) = c − y 2 ; (45)

here c denotes a constant parameter. Obviously Eq. (44) represents a strongly

nonl**in**ear system that cannot be e**as**ily approximated by l**in**ear models. When

repeat**in**g all numerical experiments of this paper with data generated by Eq. (44)

(us**in**g c = 1.99) **in**stead of Eq. (37), aga**in** driv**in**g two nodes by a common noise

**in**put, none of the matrices of correlation and mutual **in****for**mation, whether **for** raw

data or **in**novations, detects this **in**tr**in**sically correlated pair. This would probably

require modell**in**g the data by employ**in**g the correct model cl**as**s, **as** given by

Eqs. (44) and (45). For real data the correct model cl**as**s rema**in**s unknown; **in** this

c**as**e the per**for**mance of l**in**ear and nonl**in**ear models needs to be compared by

their correspond**in**g likelihoods, or preferably by their values of AIC or BIC.

As h**as** already been mentioned, it is possible to generalise the analysis discussed

**in** this paper to the c**as**e of lagged correlations between different po**in**ts

**in** space. While here we have conf**in**ed our attention to **in**stantaneous correlations

**in** the **in**novations, it would be e**as**ily possible to analyse more general crosscorrelations

with**in** the **in**novations, and thereby **in**vestigate directed connectivity

and causal relationships between pairs of po**in**ts **in** space. Aga**in** we presume that

it may be useful and even necessary to first remove the layers of e**as**ily expla**in**able

correlations, be**for**e the deeper causality relationships can be uncovered.

F**in**ally we mention that, not surpris**in**gly, the results of modell**in**g and of estimation

of mutual **in****for**mation will be the more reliable, the longer the available

time series are. By us**in**g a time series of N t = 512 po**in**ts length **in** the simulation

we have chosen an **in**termediate c**as**e; when us**in**g longer time series, **in** the raw

data the **in**tr**in**sic correlation between pairs of nodes will gradually become completely

hidden, where**as** **in** Fig. 2 it is still possible to discern positive values of

r(16, 41) and I (16, 41) which weakly stand out aga**in**st the background. In such

c**as**es whiten**in**g approaches to uncover**in**g **in**tr**in**sic correlations become even more

useful. On the other hand, **in** various fields, such **as** biomedical data analysis or

palaeoclimatic research, it is not uncommon that the length of the available time

series is of the order of only 100 time po**in**ts or even shorter. We have found **in**

additional simulations that also **in** such unfavourable c**as**es the whiten**in**g approach

succeeds **in** provid**in**g results similar to those shown **in** Figs. 4 and 5; however, **in**

such c**as**es the actual appearance of the matrices r(v, w) and I (v, w) will strongly

depend on the particular realisation sampled from the system.

We are currently apply**in**g the methodology developed **in** this paper to spatiotemporal

data sets obta**in**ed by functional magnetic resonance imag**in**g (fMRI)

from the bra**in**s of volunteers participat**in**g **in** stimulation experiments; the detailed

results of these **in**vestigations will be presented and discussed elsewhere. However,

we expect that the theory and methodology of whiten**in**g through parsimonious

1310 Galka, Ozaki, Bayard and Yam**as**hita

MAR modell**in**g will bear relevance **for** the analysis of spatiotemporal data sets

aris**in**g **in** numerous fields of science.

APPENDIX A: MUTUAL INFORMATION OF BIVARIATE

INNOVATION TIME SERIES

In order to evaluate Eq. (8), the two ma**in** contributions from the loglikelihoods,

Eqs. (17) and (18), need to be evaluated, while the constant terms

(those **in**clud**in**g log(2π)) cancel out. The first contribution is given by (omitt**in**g

the factor (-1/2))

N t log |Ŝɛ(x,y|x,y)|−N t log ˆσ ɛ(x|x) 2 − N t log ˆσ ɛ(y|y) 2 ; (46)

by **in**sert**in**g the determ**in**ant of Eq. (19) and rearrang**in**g, this contribution yields

readily

)

N t log(1 − ˆr 2 (ɛ(x),ɛ(y))) + N t

(log ˆσ 2 ɛ(x|x,y) − log ˆσ 2 ɛ(x|x)

+N t

(log ˆσ 2 ɛ(y|x,y) − log ˆσ 2 ɛ(y|y)

)

; (47)

here the “hat” notation denotes the maximum-likelihood (ML) estimates of the

correspond**in**g parameters.

The second contribution is given by (aga**in** omitt**in**g (−1/2))

∑

(ɛ t (x|x, y),ɛ t (y|x, y)) Ŝ −1

ɛ(x,y|x,y) (ɛ t(x|x, y),ɛ t (y|x, y)) †

t

− ∑ ɛt 2(x|x)

ˆσ 2 − ∑ ɛt 2(y|y)

t ɛ(x|x)

ˆσ 2 . (48)

t ɛ(y|y)

In order to evaluate this contribution the appropriate ML estimates need to

be **in**serted explicitly. From a standard theorem of multivariate statistics (see e.g.

Theorem 4.3.1 **in** Ref. 52) it follows that the ML estimate of S ɛ(x,y|x,y) is given by

⎛

N −1 ∑

t ɛ 2 (x|x, y) N −1 ∑

⎞

t ɛ(x|x, y)ɛ(y|x, y)

Ŝ ɛ(x,y|x,y) = ⎝ ∑ t

t ∑

⎠

ɛ(x|x, y)ɛ(y|x, y) ɛ 2 . (49)

(y|x, y)

N −1

t

t

N −1

t

By compar**in**g Eqs. (19) and (49) the ML estimators of the variances and the

l**in**ear correlation coefficient result **as**

∑

ˆσ ɛ(.|.) 2 = 1 ∑

ɛ(x|x, y)ɛ(y|x, y)

ɛ 2 t

(.|.) and ˆr (ɛ(x),ɛ(y)) =

N t N t ˆσ (x|x, y)ˆσ (y|x, y) , (50)

t

respectively. Here ɛ(.|.) stands **for** any of the comb**in**ations ɛ(x|x), ɛ(y|y),

ɛ(x|x, y) and ɛ(y|x, y). By **in**sert**in**g these estimates, the second and third sum **in**

t

**Whiten ing**

expression (48) each reduce to N t , while the first sum, after **in**sert**in**g the **in**verse

of Eq. (19) and some further trans**for**mations, is found to reduce to 2N t , such that

expression (48), i.e. the second contribution, altogether vanishes.

There**for**e the mutual **in****for**mation is given by tak**in**g expression (47) times

(−1/2), **as** claimed **in** Eq. (20).

APPENDIX B: LOG-LIKELIHOOD FOR BIVARIATE TIME SERIES

WITH INSTANTANEOUS COUPLING

For simplicity, **in** this appendix we shall not use different notation **for** a

statistical quantity and its estimate. We start by rewrit**in**g Eq. (34) **as**

⎛

( )

(v)

µ (v) + ∑ a (v)

x t

x (w) = C −1 t

⎜

x (v)

′ t−t + ∑ ⎞

b (v,u)

′ 1

x (u) ( )

t−1

t

⎝

′ u

µ (w) + ∑ a (w)

t

t x (w)

′ t−t + ∑ ⎟

b (w,u)

′ 1

x (u) ⎠ + C −1 ɛt (v)

,

t−1

ɛ t (w|v)

t ′ u

( )

(51)

1 0

where we have def**in**ed C = . The **in**novation term **in** Eq. (51) is given

−c vw 1

by C −1 (ɛ t (v),ɛ t (w|v)) † , and the correspond**in**g covariance matrix follows **as**

( )

σ

2

S ɛ(v,w|v) = C −1 ɛ(v)

0 (C

−1 ) ( )

† σ

2

ɛ(v)

c vw σɛ(v)

2 =

0 σɛ(w|v)

2 c vw σɛ(v) 2 c2 vw σ ɛ(v) 2 + σ ɛ(w|v)

2 . (52)

Note that here we are us**in**g the fact that the covariance matrix of the **in**novation

term **in** Eq. (34), (ɛ t (v),ɛ t (w|v)) † , is known to be diagonal, s**in**ce all **in**stantaneous

correlations are captured by the coupl**in**g term c vw x (v)

t . The log-likelihood of

model (34) is given by

L(v, w|v) =− 1 2

∑N t

t=p+1

(

log ∣ ∣ Sɛ(v,w|v)

∣ ∣ + (ɛt (v),ɛ t (w|v)) S −1

ɛ(v,w|v)

)

× (ɛ t (v),ɛ t (w|v)) † + 2log(2π) , (53)

where |.| denotes matrix determ**in**ant. After **in**sert**in**g Eq. (52) and some further

trans**for**mations this expression becomes

L(v, w|v) =− 1 2 (N t − p) ( log σ 2 ɛ(v) + log σ 2 ɛ(w|v))

− (Nt − p) (1 + log(2π))

+ (N t − p) 2c vw σɛ(v),ɛ(w|v) 2 − c2 vw σ ɛ(v)

2

2σɛ(w|v)

2 , (54)

where we have def**in**ed σ 2 ɛ(v),ɛ(w|v) = E (ɛ t(v)ɛ t (w|v)). S**in**ce **for** the first p data

po**in**ts no predictions can be per**for**med, (N t − p) arises **in** this expression **in**stead

1312 Galka, Ozaki, Bayard and Yam**as**hita

of N t . Typically p will be small, there**for**e the miss**in**g likelihood contribution of

the first p po**in**ts can be neglected.

Note that the log-likelihood of the uncoupled model Eq. (33) is given by

L(v, w) =− 1 2 (N t − p) ( log σ 2 ɛ(v) + log σ 2 ɛ(w))

− (Nt − p) (1 + log(2π)) . (55)

From Eqs. (9) and (14)–(16) it follows that the mutual **in****for**mation of the time

series at grid po**in**ts v and w is given by (identify**in**g x (v)

t with y t and x (w)

t with x t )

I ( x (v) , x (w)) = L(v, w|v) − L(v, w). (56)

Thereby we obta**in** the estimator **for** mutual **in****for**mation

I ( x (v) , x (w)) = (N t − p) 2c vw σɛ(v),ɛ(w|v) 2 − c2 vw σ ɛ(v)

2

2σɛ(w|v)

2 , (57)

**as** claimed **in** Eq. (35).

Although the majority of values provided by this estimator are positive, it h**as**

to be mentioned that frequently also negative estimates occur, although mutual

**in****for**mation cannot be negative. From the viewpo**in**t of Likelihood Ratio Test**in**g,

this behaviour corresponds to the c**as**e of the more general model per**for**m**in**g

worse than the special model. While closer exam**in**ation may be required **in** order

to fully understand this surpris**in**g behaviour, we **in**terprete this effect **as** a ma**in**ly

numerical problem result**in**g from replac**in**g the correct symmetric bivariate model,

Eq. (23), by the approximative **as**ymmetric model, Eq. (34). For the numerical

results presented **in** this paper we have taken the follow**in**g approach: S**in**ce **for**

each pair of grid po**in**ts (v, w) two **as**ymmetrical models can be **for**mulated, only

the model correspond**in**g to the higher likelihood is accepted; and **in** the c**as**e that

both models yield negative estimates of mutual **in****for**mation, a value of zero is

**as**sumed.

F**in**ally we briefly consider once aga**in** S ɛ(v,w|v) , **as** given by Eq. (52), and

compare with S ɛ(v,w|v,w) , **as** given by Eq. (19). Each of these two expressions conta**in**s

three **in**dependent parameters, such that they can be **in**terpreted **as** provid**in**g

two different parametrisations **for** the covariance matrix of bivariate time series;

but it is obvious that Eq. (19) represents the general c**as**e, while Eq. (52) represents

an approximation. By compar**in**g both expressions it can be seen that the coupl**in**g

parameter c vw corresponds roughly (but not precisely) to the coefficient of l**in**ear

correlation r (ɛ(v),ɛ(w)). Further study may be required **for** understand**in**g the

differences and similarities between these two parametrisations **in** full detail.

ACKNOWLEDGMENTS

A. Galka gratefully acknowledges support by the Deutsche Forschungsgeme**in**schaft

(DFG) through project GA 673/1-1 and by the Japanese Society **for**

**Whiten ing**

the Promotion of Science (JSPS) through fellowship ID No. P 03059. T. Ozaki

gratefully acknowledges support by (JSPS) through grants KIBAN 13654075 and

KIBAN 15500193. The comments of the anonymous referees have contributed to

improv**in**g this paper considerably.

REFERENCES

1. P. E. Cladis and P. Palffy-Muhoray (eds.), Spatio-Temporal Patterns **in** Nonequilibrium Complex

Systems. SFI Studies **in** the Sciences of Complexity (Addison Wesley Longman, Read**in**g, MA,

1995).

2. L. B. Almeida, F. Lopes da Silva and J. C. Pr**in**cipe (eds.), Spatiotemporal Models **in** Biological

and Artificial Systems, volume 37 of Frontiers **in** Artificial Intelligence and Applications (IOS

Press, Amsterdam, 1997).

3. D. Walgraef, Spatio-temporal Pattern Formation (Spr**in**ger, Berl**in**, Heidelberg, New York, 1997).

4. E. Gehrig and O. Hess, Spatio-Temporal Dynamics and Quantum Fluctuations **in** Semiconductor

L**as**ers. Spr**in**ger Tracts **in** Modern Physics, Vol. 189 (Spr**in**ger, Berl**in**, Heidelberg, New York,

2003).

5. G. Christakos, Modern Spatiotemporal Geostatistics. Studies **in** Mathematical Geology, Vol. 6

(Ox**for**d University Press, Ox**for**d, 2000).

6. J. B**as**compte and R. V. Solé (eds.),Modell**in**g Spatiotemporal Dynamics **in** Ecology (Spr**in**ger,

Berl**in**, Heidelberg, New York, 1998).

7. E. Yago, C. Escera, K. Alho, M. H. Giard and J. M. Serra-Grabulosa, Spatiotemporal dynamics of

the auditory novelty-P3 event-related bra**in** potential. Cogn. Bra**in** Res. 16:383–390, 2003.

8. C. Luk**as**, J. Falck, J. Bartkova, J. Bartek and J. Luk**as**, Dist**in**ct spatiotemporal dynamics of

mammalian checkpo**in**t regulators **in**duced by DNA damage. Nature Cell Biol. 5:255–260, 2003.

9. M. Mori, M. Ka**in**o, S. Kanemoto, M. Enomoto, S. Ebata and S. Tsunoyama, Development of

advanced core noise monitor**in**g system **for** BWRS. Prog. Nucl. Energy 43:201–207, 2003.

10. W. Li, **Mutual** **in****for**mation functions versus correlation functions. J. Stat. Phys. 60:823–837, 1990.

11. P. Hall and S. C. Morton, On the estimation of entropy. Ann. Inst. Statist. Math. 45:69–88, 1993.

12. D. R. Brill**in**ger, Second-order moments and mutual **in****for**mation **in** the analysis of time series. In

Y. P. Chaubey (ed.), Recent Advances **in** Statistical Methods, pp. 64–76 (Imperial College Press,

London, 2002).

13. D. R. Brill**in**ger, Some data analyses us**in**g mutual **in****for**mation. Brazilian J. Prob. Statist. 18:163–

183, 2004.

14. K. Kaneko, Spatiotemporal chaos **in** one- and two-dimensional coupled map lattices. Physica D

37:60–82, 1989.

15. C. E. Shannon, A mathematical theory of communication. Bell Syst. Tech. J. 27:379–423, 623–656,

1948.

16. L. Boltzmann, Über die Beziehung zwischen dem zweiten Hauptsatze der mechanischen

Wärmetheorie und der Wahrsche**in**lichkeitsrechnung respektive den Sätzen über d**as**

Wärmegleichgewicht. Wiener Berichte 76:373–435, 1877.

17. H. Akaike, On the likelihood of a time series model. The Statistician 27:217–235, 1978.

18. T. M. Cover and J. A. Thom**as**, Elements of In**for**mation Theory (Wiley, New York, 1991).

19. Y. Moon, B. Rajagopalan and U. Lall, Estimation of mutual **in****for**mation us**in**g kernel density

estimators. Phys. Rev. E 52:2318–2321, 1995.

20. D. Prichard and J. Theiler, Generalized redundancies **for** time series analysis. Physica D 84:476–

493, 1995.

1314 Galka, Ozaki, Bayard and Yam**as**hita

21. A. Kr**as**kov, Harald Stögbauer and P. Gr**as**sberger, **Estimat ing** mutual

69:066138, 2004.

22. R. Moddemeijer, A statistic to estimate the variance of the histogram b**as**ed mutual **in****for**mation

estimator b**as**ed on dependent pairs of observations. Signal Proc. 75:51–63, 1999.

23. H. Stögbauer, A. Kr**as**kov, S. A. Astakhov and P. Gr**as**sberger, Le**as**t-dependent-component analysis

b**as**ed on mutual **in****for**mation. Phys. Rev. E 70:066123, 2004.

24. K. Ito, On stoch**as**tic differential equations. Mem. Am. Math. Soc. 4:1–51, 1951.

25. J. L. Doob, Stoch**as**tic Processes (Wiley, New York, 1953).

26. A. H. Jazw**in**ski, Stoch**as**tic Processes and Filter**in**g Theory (Academic Press, San Diego, 1970).

27. T. Kailath, A view of three decades of l**in**ear filter**in**g theory. IEEE Trans. Inf. Theory 20:146–181,

1974.

28. P. Lévy, Sur une cl**as**se de courbes de l’espace de Hilbert et sur une équation **in**tégrale non l**in**éaire.

Ann. Sci. École Norm. Sup. 73:121–156, 1956.

29. P. Protter, Stoch**as**tic Integration and Differential Equations (Spr**in**ger-Verlag, Berl**in**, 1990).

30. P. A. Frost and T. Kailath, An **in**novation approach to le**as**t squares estimation—part III: Nonl**in**ear

estimation **in** white gaussian noise. IEEE Trans. Autom. Contr. 16:217–226, 1971.

31. B. V. Gnedenko, The Theory of Probability (Mir Publishers, Moscow, 1969).

32. G. E. P. Box and G. M. Jenk**in**s, Time Series Analysis, Forec**as**t**in**g and Control 2nd edition

(Holden-Day, San Francisco, 1976).

33. R. Moddemeijer, On estimation of entropy and mutual **in****for**mation of cont**in**uous distributions.

Signal Proc. 16:233–248, 1989.

34. A. Galka and T. Ozaki, Test**in**g **for** nonl**in**earity **in** high-dimensional time series from cont**in**uous

dynamics. Physica D 158:32–44, 2001.

35. H. Akaike, Prediction and entropy. In A. C. Atk**in**son and S. E. Fienberg (eds.), A Celebration of

Statistics, pp. 1–24 (Spr**in**ger, Berl**in**, Heidelberg, New York, 1985).

36. J. Kuha, AIC and BIC: Comparisons of **as**sumptions and per**for**mance. Sociol. Methods Res.

33:188–229, 2004.

37. M. Stone, An **as**ymptotic equivalence of choice of model by cross-validation and Akaike’s criterion.

J. Roy. Statist. Soc. B 39:44–47, 1977.

38. N. Lev**in**son, The Wiener rms error criterion **in** filter design and prediction. J. Math. Phys. 25:261–

278, 1947.

39. P. Whittle, On the fitt**in**g of multivariate autoregressions, and the approximate canonical factorization

of a spectral density matrix. Biometrika 50:129–134, 1963.

40. A. Neumaier and T. Schneider, Estimation of parameters and eigenmodes of multivariate autoregressive

models. ACM Trans. Math. Softw. 27:27–57, 2001.

41. C. Penland, Random **for**c**in**g and **for**ec**as**t**in**g us**in**g pr**in**cipal oscillation pattern analysis. Monthly

Weather Rev. 117:2165–2185, 1989.

42. A. Galka, O. Yam**as**hita, T. Ozaki, R. Biscay and P. A. Valdés-Sosa, A solution to the dynamical

**in**verse problem of EEG generation us**in**g spatiotemporal Kalman filter**in**g. NeuroImage 23:435–

453, 2004.

43. A. Galka, O. Yam**as**hita and T. Ozaki, GARCH modell**in**g of covariance **in** dynamical estimation

of **in**verse solutions. Phys. Lett. A 333:261–268, 2004.

44. J. Geweke, Inference and causality **in** economic time series models. In Z. Grilliches and M.

Intriligator (eds.), Handbook of Econometrics, pp. 1101–1144 (North-Holland, Amsterdam, 1984).

45. A. Buse, The likelihood ratio, Wald, and Lagrange multiplier test: An expository note. Am. Stat.

36:153–157, 1982.

46. A. Hyvär**in**en, J. Karhunen and E. Oja, Independent Component Analysis (Wiley, New York, 2001).

47. L. Molgedey and H. G. Schuster, Separation of a mixture of **in**dependent signals us**in**g time delayed

correlations. Phys. Rev. Lett. 72:3634–3637, 1994.

**Whiten ing**

48. T. Schreiber, Spatio-temporal structure **in** coupled map lattices: two-po**in**t correlations versus

mutual **in****for**mation. J. Phys. A: Math. Gen. 23:L393–L398, 1990.

49. H. Wu and J. Pr**in**cipe, Generalized anti-Hebbian learn**in**g **for** source separation. In Proceed**in**gs

of ICASSP’99, pp. 1073–1076 (IEEE Press, Piscataway, NJ, 1999).

50. S. Choi, A. Cichocki and S. Amari, Flexible **in**dependent component analysis. J. VLSI Signal

Process 26:25–38, 2000.

51. K. G. Jöreskog, Some contributions to maximum likelihood Factor Analysis. Psychometrika

32:443–482, 1967.

52. B. Flury, A First Course **in** Multivariate Statistics (Spr**in**ger, New York, Berl**in**, Heidelberg, 1997).