18.01.2015 Views

Karjalainen, Pasi A. Regularization and Bayesian methods for ...

Karjalainen, Pasi A. Regularization and Bayesian methods for ...

Karjalainen, Pasi A. Regularization and Bayesian methods for ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Karjalainen</strong>, <strong>Pasi</strong> A. <strong>Regularization</strong> <strong>and</strong> <strong>Bayesian</strong> <strong>methods</strong> <strong>for</strong> evoked potential estimation.<br />

Kuopio University Publications C. Natural <strong>and</strong> Environmental Sciences<br />

61. 1997. 139 p.<br />

ISBN 951-781-559-X<br />

ISSN 1235-0486<br />

ABSTRACT<br />

Evoked potentials (EP) are often defined to be potentials that are caused by the<br />

electrical activity in central nervous system after a stimulation of the sensoral<br />

system. In the analysis of evoked potentials the fundamental problem is to extract<br />

in<strong>for</strong>mation about the potential from measurements that contain also on-going<br />

background electroencephalogram (EEG).<br />

The most widely used tool <strong>for</strong> the analysis of evoked potentials has been the<br />

averaging of the measurements over an ensemble of trials. This is the optimal way<br />

to improve the signal-to-noise ratio when the underlying model <strong>for</strong> the observations<br />

is that the evoked potential is a deterministic signal in independent additive<br />

background noise of zero mean. However, <strong>for</strong> over three decades it has been evident<br />

that the nature of evoked potentials is more or less stochastic. In particular,<br />

the latencies <strong>and</strong> the amplitudes of the peaks in the potentials can have stochastic<br />

variation between repetitions of the stimuli.<br />

Currently the goal in the analysis of the evoked potentials is to obtain best<br />

possible estimates <strong>for</strong> single potentials. The most common approach to this estimation<br />

is to <strong>for</strong>m an estimator (filter) with which the unwanted contribution of<br />

the EEG can be filtered out from the evoked potential. A major difficulty in this<br />

task is often the very low signal-to-noise ratio.<br />

There are two major aims in this thesis. The first is to review the existing<br />

<strong>methods</strong> <strong>for</strong> evoked potential estimation in view of estimation theory in a unified<br />

<strong>and</strong> consistent <strong>for</strong>malism. Special attention is paid to the implicit assumptions<br />

about the evoked potentials that are made in the existing <strong>methods</strong>. The second<br />

aim is to <strong>for</strong>m a unified estimation scheme <strong>for</strong> single evoked potentials. The<br />

proposed estimation scheme is based on <strong>Bayesian</strong> estimation <strong>and</strong> regularization<br />

theory.<br />

Two practical estimation procedures are proposed. The first is applicable when<br />

the evoked potentials are assumed to be r<strong>and</strong>om samples from a time-independent<br />

distribution. The other is applicable if the parameters of the evoked potentials<br />

have trend-like variations between the repetitions of the stimuli. The proposed<br />

<strong>methods</strong> are also evaluated using both simulated data <strong>and</strong> real measurements.<br />

Two systematic <strong>methods</strong> <strong>for</strong> the simulation of evoked potentials are also proposed<br />

<strong>for</strong> this purpose.<br />

AMS Subject Classification: 93E10, 62H25, 62L12<br />

National Library of Medicine Classification: WL 102, WL 150, QT 36<br />

INSPEC Thesaurus: bioelectric potentials; estimation theory; Bayes <strong>methods</strong>;<br />

inverse problems; Kalman filters


To Tuula <strong>and</strong> Väinö


Acknowledgements<br />

This work was carried out in the Department of Applied Physics at the University<br />

of Kuopio during 1995-1996.<br />

I warmly thank my supervisor Associate Professor Jari Kaipio, Ph.D., <strong>for</strong> his<br />

guidance <strong>and</strong> advices during all phases of this work. Most of the results in this<br />

thesis have got their final <strong>for</strong>m during long discussions with him.<br />

I am grateful to my second supervisor Professor Lauri Patomäki, Ph.D., <strong>for</strong><br />

his support to this work <strong>and</strong> <strong>for</strong> the opportunity to work in the Department of the<br />

Applied Physics in various positions during the last ten years.<br />

I thank the official reviewers Professor Jouko Lampinen, Ph.D., <strong>and</strong> Associate<br />

Professor Erkki Somersalo, Ph.D., <strong>for</strong> their many suggestions <strong>and</strong> advices on how<br />

to improve this thesis.<br />

I thank Marko Vauhkonen, M.Sc., whose work in the field of impedance tomography<br />

has been a link between me <strong>and</strong> the regularization theory.<br />

I also thank Anu Koistinen, M.Sc., who has made a great work in keeping<br />

the bibliographic databases up-to-date. She has also provided the measurements<br />

<strong>for</strong> which I thank also the Department of Clinical Neurophysiology, University of<br />

Kuopio.<br />

I thank all my co-workers <strong>and</strong> students of the biomedical inverse problems<br />

group <strong>and</strong> the staff of the Department of the Applied Physics.<br />

Finally I thank my loving Tuula <strong>for</strong> her underst<strong>and</strong>ing when I prepared this<br />

thesis.<br />

Kuopio, May 1997<br />

<strong>Pasi</strong> <strong>Karjalainen</strong>


Abbreviations<br />

EP Evoked potential<br />

ERP Event related potential<br />

EEG Electroencephalogram<br />

SNR Signal-to-noise ratio<br />

BAEP Brainstem auditory evoked potential<br />

ML Maximum likelihood<br />

MS Mean square<br />

LMMS Linear minimum mean square<br />

GMS Generalized mean square<br />

UC Uni<strong>for</strong>m cost<br />

MAP Maximum a posteriori<br />

GM Gauss–Markov<br />

LS Least squares<br />

GLS Generalized least squares<br />

LMV Linear minimum variance<br />

GCV Generalized cross-validation<br />

RLS Recursive least squares<br />

LMS Least mean square<br />

NLMS Normalized least mean square<br />

PC Principal component<br />

ARMA Autoregressive moving average<br />

AR Autoregressive<br />

MA Moving average<br />

FIR Finite impulse response<br />

ARX Autoregressive with exogenous input<br />

LCA Latency corrected averaging<br />

CCA Cross-correlation averaging<br />

APWF a posteriori “Wiener” filtering<br />

PCA Principal components analysis<br />

MWA Moving window averaging<br />

EWA Exponential window averaging<br />

Notations<br />

x ∼ N(η x ,C x ) Jointly Gaussian r<strong>and</strong>om vector x = (x 1 ,...,x n ) T<br />

(·) T Transpose<br />

p(x) Joint density function of r<strong>and</strong>om vector x<br />

p(x|y) Conditional density function of x given y<br />

η x ,E {x} Expected value of x<br />

E x {f(x,y)} Expected value of f(x,y) with respect to x<br />

C x Covariance of x<br />

R x Correlation of x<br />

η x|y ,E {x|y} Expected value of x given y, conditional mean<br />

Conditional covariance of x given y<br />

C x|y


z<br />

h<br />

H<br />

θ<br />

ˆθ(z)<br />

ˆθ<br />

˜θ<br />

C(θ, ˆθ)<br />

B(ˆθ)<br />

B(ˆθ|z)<br />

ˆθ B<br />

trace (A)<br />

W<br />

l<br />

J<br />

R (H)<br />

N (H)<br />

ψ<br />

diag (λ 1 ,...,λ p )<br />

L<br />

α<br />

S<br />

S ⊥<br />

dim(S)<br />

K S<br />

rank(A)<br />

q −1<br />

A(q)<br />

S(ω)<br />

Γ<br />

Λ<br />

U,V<br />

S<br />

K t<br />

E {x}<br />

Vector of observations z = (z 1 ,...,z n ) T<br />

Model <strong>for</strong> observation<br />

Linear model <strong>for</strong> observation<br />

Parameter vector<br />

Estimator of parameter vector θ based on observations z<br />

Estimate of parameter vector θ<br />

Estimation error<br />

Cost function<br />

Bayes cost<br />

Conditional Bayes cost<br />

<strong>Bayesian</strong> estimate<br />

Trace of matrix A<br />

Weighting matrix<br />

Least squares index<br />

Jacobian<br />

Range of H<br />

Nullspace of H<br />

Basis vector<br />

Diagonal matrix<br />

<strong>Regularization</strong> matrix<br />

<strong>Regularization</strong> parameter<br />

Subspace<br />

Orthogonal complement of S<br />

Dimension of S<br />

Matrix of basis vectors of subspace S<br />

Rank of matrix A<br />

Time delay operator<br />

Polynomial of operator q −1<br />

Spectrum<br />

Matrix of eigenvectors<br />

Diagonal matrix of eigenvalues<br />

Matrices of left <strong>and</strong> right singular vectors, respectively<br />

Diagonal matrix of singular values<br />

Kalman gain<br />

The empirical expectation, sample mean


CONTENTS<br />

1 Introduction 15<br />

2 Estimation theory 19<br />

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.2 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.3 <strong>Bayesian</strong> estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

2.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . 23<br />

2.5 Bayes cost method . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.6 Mean square estimation . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.7 Maximum a posteriori estimation . . . . . . . . . . . . . . . . . . . 27<br />

2.8 Linear minimum mean square estimator . . . . . . . . . . . . . . . 28<br />

2.9 Minimum mean square estimator <strong>for</strong> Gaussian variables . . . . . . 29<br />

2.10 Mean square estimation with observation model . . . . . . . . . . . 31<br />

2.11 Gauss–Markov estimate . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

2.12 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

2.13 Comparison of ML, MAP <strong>and</strong> MS estimates . . . . . . . . . . . . . 36<br />

2.14 Selection of the basis vectors . . . . . . . . . . . . . . . . . . . . . 39<br />

2.15 Modeling of prior in<strong>for</strong>mation in <strong>Bayesian</strong> estimation . . . . . . . . 40<br />

2.16 Recursive mean square estimation . . . . . . . . . . . . . . . . . . 41<br />

2.17 Time-varying linear regression . . . . . . . . . . . . . . . . . . . . . 45<br />

2.18 Properties of the MS estimator . . . . . . . . . . . . . . . . . . . . 47<br />

3 <strong>Regularization</strong> theory 49<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

3.2 Tikhonov regularization . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

3.3 Principal component based regularization . . . . . . . . . . . . . . 52<br />

3.4 Subspace regularization . . . . . . . . . . . . . . . . . . . . . . . . 54


3.5 Selection of the regularization parameters . . . . . . . . . . . . . . 55<br />

4 Time series models 57<br />

4.1 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

4.2 Stationary time series models . . . . . . . . . . . . . . . . . . . . . 59<br />

4.3 Time-dependent time series models . . . . . . . . . . . . . . . . . . 60<br />

4.4 Adaptive filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

4.5 Wiener filtering as estimation problem . . . . . . . . . . . . . . . . 63<br />

5 Simulation of evoked potentials 66<br />

5.1 Simulation of the background EEG . . . . . . . . . . . . . . . . . . 66<br />

5.2 Component based simulation of evoked potentials . . . . . . . . . . 67<br />

5.3 Principal component based simulation of evoked potentials . . . . 72<br />

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

6 Estimation of evoked potentials 76<br />

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

6.2 Ensemble analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

6.2.1 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

6.2.2 Weighted <strong>and</strong> selective averaging . . . . . . . . . . . . . . . 78<br />

6.2.3 Latency-dependent averaging . . . . . . . . . . . . . . . . . 81<br />

6.2.4 Filtering of the average response . . . . . . . . . . . . . . . 82<br />

6.2.5 Deconvolution <strong>methods</strong> . . . . . . . . . . . . . . . . . . . . 83<br />

6.2.6 Principal components analysis . . . . . . . . . . . . . . . . 83<br />

6.2.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

6.2.8 Other ensemble analysis <strong>methods</strong> . . . . . . . . . . . . . . . 84<br />

6.3 Single trial estimation . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />

6.3.1 Filtering of single responses . . . . . . . . . . . . . . . . . . 85<br />

6.3.2 Time-varying Wiener filtering . . . . . . . . . . . . . . . . . 85<br />

6.3.3 Linear least squares estimation . . . . . . . . . . . . . . . . 86<br />

6.3.4 Adaptive filtering . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

6.3.5 Principal component regression approach . . . . . . . . . . 88<br />

6.3.6 Subspace regularization of evoked potentials . . . . . . . . . 89<br />

6.3.7 Smoothness priors estimation of evoked potentials . . . . . 89<br />

6.3.8 Other single trial estimation <strong>methods</strong> . . . . . . . . . . . . 90<br />

6.4 Miscellaneous topics . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

6.4.1 Frequency domain <strong>methods</strong> . . . . . . . . . . . . . . . . . . 91


6.4.2 Parametric modeling . . . . . . . . . . . . . . . . . . . . . . 91<br />

6.4.3 Estimation of the peak latencies . . . . . . . . . . . . . . . 93<br />

6.5 Dynamical estimation of evoked potentials . . . . . . . . . . . . . . 93<br />

6.5.1 Windowed averaging . . . . . . . . . . . . . . . . . . . . . . 94<br />

6.5.2 Frequency domain filtering . . . . . . . . . . . . . . . . . . 94<br />

6.5.3 Adaptive algorithms . . . . . . . . . . . . . . . . . . . . . . 94<br />

6.5.4 Recursive mean square estimation of evoked potentials . . . 95<br />

6.5.5 Other dynamical estimation <strong>methods</strong> . . . . . . . . . . . . . 96<br />

6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

7 Two <strong>methods</strong> <strong>for</strong> evoked potential estimation 98<br />

7.1 A systematic method <strong>for</strong> single trial estimation . . . . . . . . . . . 98<br />

7.2 A systematic method <strong>for</strong> dynamical estimation . . . . . . . . . . . 100<br />

7.3 The data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

7.4 Case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />

7.5 Case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

7.6 Case study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

7.7 Case study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

7.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

8 Discussion <strong>and</strong> conclusions 124<br />

References 126


CHAPTER<br />

I<br />

Introduction<br />

Evoked potentials (EP) were originally defined to be potentials that are caused by<br />

the electrical activity in central nervous system after a stimulation of the sensoral<br />

system. Currently all electrical potentials caused by physical stimulation are<br />

more often called evoked potentials. The evoked potentials can be classified into<br />

exogenous <strong>and</strong> endogenous evoked potentials [11]. Exogenous evoked potentials<br />

are determined by the physical characteristics of the stimulus only. Endogenous<br />

evoked potentials are determined by the psychological significance of the stimulus.<br />

An example of endogenous evoked potentials are the cognitive evoked potentials.<br />

Potentials are sometimes emitted also without any clear physical stimulus. There<strong>for</strong>e<br />

it is currently common to use the notion of event related potentials (ERP).<br />

The evoked potentials are then event related potentials that are immediate or<br />

intermediate responses to physical stimuli.<br />

In this thesis we are concerned on the analysis <strong>methods</strong> that necessitate a<br />

physical trigger signal, that is physical stimulus. There<strong>for</strong>e we use only the notion<br />

of evoked potential here. In most cases we are interested in evoked potentials that<br />

are originated in the human brain. The potentials are usually measured from outer<br />

layer, scalp, of the human head. The measured potential is then superposition of<br />

all electrical activity that is originated in head. This activity include the electrical<br />

activity from muscles <strong>and</strong> eye movements <strong>and</strong> also the spontaneous electrical<br />

activity of the brain. We call this activity the background electroencephalogram<br />

(EEG). The background EEG is usually thought to be not correlated with the<br />

stimulus. We do not review the physiological origin of the evoked potentials in<br />

this thesis. For that see e.g. [162, 170, 150, 92]. Also the measurement techniques<br />

are not reviewed here. For these see e.g. [179].<br />

In the analysis of evoked potentials the fundamental problem is to extract<br />

in<strong>for</strong>mation about the potential from measurements that contain also on-going<br />

background EEG. The most widely used tool <strong>for</strong> the analysis of evoked potentials<br />

has been the averaging of the measurements over an ensemble of trials. In the mean<br />

square sense this is the optimal way to improve the signal-to-noise ratio (SNR)<br />

when the underlying model <strong>for</strong> the observations is that the evoked potential is a<br />

15


16 1. Introduction<br />

deterministic signal in independent additive background noise of zero mean.<br />

However, <strong>for</strong> over three decades it has been evident that the nature of evoked<br />

potentials is more or less stochastic. In particular, the latencies <strong>and</strong> the amplitudes<br />

of the peaks in the potentials can have stochastic variation between the<br />

repetitions of the stimuli [22]. In addition the variations can be trend-like <strong>and</strong><br />

the means of the latencies or the amplitudes can change during the test. The<br />

in<strong>for</strong>mation about these kinds of variations in evoked potentials vanishes when<br />

the signal is averaged. The resulting estimates <strong>for</strong> the potentials possibly do not<br />

correspond to any physical or neuroanatomical situation <strong>and</strong> thus inference about<br />

the neurophysical structure is difficult. A concrete example is the use of average<br />

evoked potentials in source estimation, the estimation of the location <strong>and</strong> other<br />

parameters of the electrical sources in brain.<br />

Statistically the average evoked potential, the sample mean, is an example of<br />

the use of the first order statistics, in which only the first moment of the population<br />

parameters is estimated. The next evident improvement is to use the second<br />

order statistics, that is, covariance analysis. Typical examples of the second order<br />

<strong>methods</strong> are e.g. principal component analysis [90] <strong>and</strong> factor analysis [173] of the<br />

potentials. With these <strong>methods</strong> also a measure of the variation of the potentials<br />

can be obtained. However, also with these <strong>methods</strong> the in<strong>for</strong>mation carried by<br />

a single measurement is lost. Also all the in<strong>for</strong>mation about the time variation<br />

between the repetitions of the test is lost.<br />

Currently the goal in the analysis of the evoked potentials is to get in<strong>for</strong>mation<br />

about single potentials e.g. obtain the best possible estimate <strong>for</strong> a single potential.<br />

The notion of single trial analysis can be used in this context. The most common<br />

way to do single trial analysis is to <strong>for</strong>m an estimator (filter) with which the<br />

unwanted contribution of the on-going background activity can be filtered out<br />

from the evoked potential. A major difficulty in this task is often the very low<br />

signal-to-noise ratio, typically SNR < −10 dB.<br />

In any such filtering or estimation method the per<strong>for</strong>mance of the estimator<br />

depends on the properties of the underlying signals. In the most realistic <strong>methods</strong><br />

some models <strong>for</strong> the measurements <strong>and</strong> the signals are assumed. The estimator<br />

that yields the minimum mean square criterion is then derived. The per<strong>for</strong>mance<br />

of the estimators then depends strongly on how realistic the assumptions are.<br />

The most common assumptions concern the second order statistics of the evoked<br />

potentials <strong>and</strong> the on-going background EEG. In addition, we sometimes have<br />

prior in<strong>for</strong>mation about the evoked potentials. The in<strong>for</strong>mation can relate e.g. to<br />

the shape or the location of the peaks in potential. Generally, we should take this<br />

prior in<strong>for</strong>mation into account to obtain best possible estimates.<br />

In literature we can find two approaches to take the prior in<strong>for</strong>mation or<br />

beliefs about the parameters (signal) to be estimated into account. The first<br />

is the <strong>Bayesian</strong> approach with which the prior in<strong>for</strong>mation can be taken into<br />

account in the statistical sense. This procedure usually necessitates some prior<br />

statistical in<strong>for</strong>mation about the parameters in the <strong>for</strong>m of prior distributions.<br />

The other approach, the use of the regularization <strong>methods</strong>, arises from the field of<br />

ill-posed problems. In this class of the <strong>methods</strong> the prior in<strong>for</strong>mation about the


17<br />

underlying parameters can usually be implemented to the estimation procedure<br />

in a straight<strong>for</strong>ward way. Typical interpretations <strong>for</strong> such kind of knowledge is<br />

that the potentials are small, almost equal or slowly varying. The regularization<br />

approach has a direct connection to the <strong>Bayesian</strong> approach. The regularization<br />

solutions can usually be seen as <strong>Bayesian</strong> point estimates with some prior densities.<br />

There are two key problems in the estimation of the evoked potentials. The<br />

first is the implementation of the prior in<strong>for</strong>mation to the estimation procedure.<br />

Due the low signal-to-noise ratio, all the prior in<strong>for</strong>mation that is available should<br />

be used in estimation. A major problem in this is the <strong>for</strong>mulation of the in<strong>for</strong>mation<br />

in mathematical <strong>for</strong>m. The other problem is the evaluation of the proposed<br />

estimation <strong>methods</strong>. The most common way to <strong>for</strong>m a “new” estimation method<br />

<strong>for</strong> evoked potentials seems to have been blind use of just some estimation method<br />

without proper investigation of the implicit assumptions. Comparison of different<br />

<strong>methods</strong> necessitates the use of unified <strong>for</strong>malism <strong>for</strong> all the <strong>methods</strong>.<br />

The aims <strong>and</strong> contents of the thesis<br />

In this thesis “we” is used to denote the author of the thesis. There are two<br />

major aims in this thesis. The first one is to review the existing <strong>methods</strong> <strong>for</strong><br />

evoked potential estimation in view of estimation theory in unified <strong>and</strong> consistent<br />

<strong>for</strong>malism. Special attention is paid to find the implicit assumptions about the<br />

evoked potentials that are made in proposed <strong>methods</strong>. The second aim is to <strong>for</strong>m<br />

a unified estimation scheme <strong>for</strong> single evoked potentials. The proposed estimation<br />

scheme is based on <strong>Bayesian</strong> estimation <strong>and</strong> regularization theory.<br />

Some of the <strong>methods</strong> proposed here are discussed in [100, 101, 211, 212, 213,<br />

214, 96]. The aim of this thesis is not only to summarize the results but also give<br />

a systematic <strong>and</strong> detailed presentation of the <strong>methods</strong> <strong>and</strong> their theoretical background.<br />

We also present practical procedures <strong>for</strong> the application of the <strong>methods</strong><br />

<strong>for</strong> real measurements. Although the estimation schemes are proposed in algorithmic<br />

<strong>for</strong>m, the most important thing is their general structure. They can easily be<br />

modified also <strong>for</strong> different models <strong>for</strong> evoked potentials.<br />

In Chapter 2 the estimation theoretical background that is relevant <strong>for</strong> rest<br />

of the thesis is presented. The main purpose of the chapter is to introduce a<br />

unified <strong>for</strong>malism with which the rest of the thesis – regularization theory, time<br />

series analysis, <strong>Bayesian</strong> statistics <strong>and</strong> the existing analysis <strong>methods</strong> <strong>for</strong> evoked<br />

potential studies – can be combined. Most of the topics are discussed in detail to<br />

make it possible to compare the different <strong>for</strong>mulations that are discussed in the<br />

thesis, in a consistent way.<br />

In Chapter 3 regularization theory is discussed. The main purpose is to review<br />

the regularization <strong>methods</strong> in view of <strong>Bayesian</strong> estimation <strong>and</strong> to introduce the<br />

<strong>for</strong>ms of subspace regularization <strong>and</strong> principal component regression. These are<br />

used in Chapters 4-7.<br />

In Chapter 4 some topics of time series analysis are discussed. The topics of<br />

time series modeling <strong>and</strong> Wiener filtering are studied. It is also shown that the<br />

adaptive algorithms have a special connection to the regularization theory.


18 1. Introduction<br />

Chapters 5, 6 <strong>and</strong> 7 contain the main novel parts of this thesis. In Chapter<br />

5 two simulation <strong>methods</strong> <strong>for</strong> evoked potentials are introduced. Both <strong>methods</strong><br />

are presented as systematic <strong>methods</strong> <strong>for</strong> different needs of simulations. The applicability<br />

of the <strong>methods</strong> is then discussed. In Chapter 6 the existing <strong>methods</strong><br />

<strong>for</strong> evoked potential analysis are reviewed. The <strong>methods</strong> are analyzed in detail<br />

in light of estimation <strong>and</strong> regularization theory. In Chapter 6 we also introduce<br />

some novel estimation <strong>methods</strong> that <strong>for</strong>m the basis <strong>for</strong> the two practical estimation<br />

procedures that are proposed in Chapter 7. The estimation <strong>methods</strong> are proposed<br />

in a general <strong>for</strong>m that can easily be modified to correspond to more complicated<br />

assumptions about the physical reality. In Chapter 7 the proposed <strong>methods</strong> are<br />

also evaluated using both simulated data <strong>and</strong> real measurements. The applicability<br />

of the so-called generalized cross-validation criterion <strong>for</strong> determination of the<br />

regularization parameters is also evaluated <strong>and</strong> discussed.<br />

Chapter 8 contains overall discussion <strong>and</strong> conclusions of this thesis.


CHAPTER<br />

II<br />

Estimation theory<br />

In this chapter we discuss estimation theory. The main goal is to present the estimation<br />

theoretical background that is necessary <strong>for</strong> underst<strong>and</strong>ing the properties<br />

of the regularization <strong>methods</strong> that are discussed in Chapter 3 as well as time series<br />

modeling <strong>methods</strong> that are discussed in Chapter 4. A unified <strong>for</strong>malism that is<br />

introduced in this chapter is used through the thesis. With this <strong>for</strong>malism the<br />

results in estimation theory can be combined with the regularization theory, time<br />

series analysis <strong>methods</strong> <strong>and</strong> <strong>Bayesian</strong> statistics. All these approaches are presented<br />

in multivariate vector <strong>for</strong>malism. In particular the review of the existing <strong>methods</strong><br />

<strong>for</strong> evoked potential analysis in Chapter 6 is based strictly on this presentation.<br />

This makes it possible to analyze the implicit assumptions made in the <strong>methods</strong>.<br />

Some of the derivations are given in more detail than the others. Many of the<br />

intermediate equations are needed in further analysis. A detailed presentation is<br />

also reasonable since this estimation theoretical background cannot be seen to be<br />

commonly known in the evoked potential community. The main sources <strong>for</strong> the<br />

theory discussed in this chapter are the references [192, 193, 152, 133, 115, 142, 66].<br />

2.1 Introduction<br />

In estimation theory the problem is to calculate (estimate) the parameters describing<br />

the underlying reality from the measurements that may contain some errors.<br />

We denote the measurements with a vector z. With vector θ we denote the parameters<br />

that are to be estimated. We use ˆθ <strong>for</strong> an estimate of θ <strong>and</strong> ˆθ(z) <strong>for</strong><br />

estimator, the function<br />

ˆθ = ˆθ(z) (2.1)<br />

that connects the observations to the estimate. As estimation error, the difference<br />

between actual <strong>and</strong> estimated value of the parameter we use ˜θ } = θ − ˆθ. Estimator<br />

<strong>for</strong> which the expectation of the estimation error is zero E<br />

{˜θ = 0 is said to be<br />

unbiased.<br />

When θ is r<strong>and</strong>om we speak about <strong>Bayesian</strong> estimation <strong>and</strong> the goal is to find<br />

characteristics of the posterior density p(θ|z) of θ. Such estimates are e.g. mean<br />

19


20 2. Estimation theory<br />

square (MS) <strong>and</strong> maximum a posteriori (MAP) estimates. These can be derived<br />

as minimizers of the expectation of some specific functions of the estimation error.<br />

When θ is treated as unknown but non-r<strong>and</strong>om, we do not use any in<strong>for</strong>mation<br />

about prior density of θ. Usually the estimation rule is then to minimize some<br />

estimation criterion. This criterion can still be probabilistic, as in Gauss–Markov<br />

estimate, since the estimator ˆθ(z) can still be r<strong>and</strong>om through the observation<br />

model. In some cases as in least squares estimation the estimation criterion is fully<br />

deterministic. We refer to non-r<strong>and</strong>om parameters with the notion of unknown<br />

parameter [192].<br />

In some cases the observations z are related to the parameters θ by an observation<br />

model. For example<br />

z = h(θ,v) (2.2)<br />

where v are r<strong>and</strong>om measurement errors, is a typical observation model. The<br />

notion of nuisance parameters is sometimes used <strong>for</strong> v [66]. In most common cases<br />

v is additive <strong>and</strong> the most general observation model used in this thesis is of the<br />

<strong>for</strong>m<br />

z = h(θ) + v (2.3)<br />

This is called the additive noise model. In many cases we restrict our attention to<br />

linear observation models that can be written in the <strong>for</strong>m<br />

z = Hθ + v (2.4)<br />

where H is a matrix that does not contain parameters to be estimated. In all<br />

observation models θ can be r<strong>and</strong>om or fixed but unknown. In this thesis we set<br />

θ ∈ R p <strong>and</strong> z,v ∈ R M .<br />

The observations z are sometimes called the dependent variables, especially<br />

in the statistical literature. If the model H contains observations x, these are<br />

called the independent variables, regressor variables, predictor variables [91] or<br />

the explanatory variables [23]. For matrix H we use the notion of observation<br />

matrix [133]. The notions of model matrix <strong>and</strong> carrier matrix [23] <strong>and</strong> the design<br />

matrix [66] are also used.<br />

When an observation model is used, the estimation can also be based on the<br />

minimization of some norm of the residual r = z − h(θ) with respect to θ. With a<br />

special selection <strong>for</strong> this norm <strong>and</strong> special assumptions about the joint density of<br />

θ <strong>and</strong> z this approach coincides with the <strong>Bayesian</strong> approach.<br />

2.2 Probability theory<br />

The selection of the details in this section is based on the needs of following sections<br />

<strong>and</strong> chapters. We shall not review fundamental definitions of probability<br />

theory, such as probability space, elementary events, r<strong>and</strong>om variables <strong>and</strong> probability<br />

measure here. These are presented e.g. in [152]. For the definitions of<br />

the distribution <strong>and</strong> density functions of multidimensional r<strong>and</strong>om variables we<br />

refer to [171]. In this thesis r<strong>and</strong>om variables are not denoted systematically differently<br />

from deterministic variables. Usually all the r<strong>and</strong>om variables are real


2.2. Probability theory 21<br />

<strong>and</strong> vector valued. The joint density of the components of the r<strong>and</strong>om vector<br />

x = (x 1 ,...,x n ) T is denoted with p(x). The superscript notation (·) T denotes the<br />

transpose. The joint density of the components of two r<strong>and</strong>om vectors x <strong>and</strong> y is<br />

denoted with p(x,y). We consider only continuous distribution functions.<br />

The probability densities of single r<strong>and</strong>om variable x i , the marginal densities<br />

p i (x i ), are obtained by integration<br />

p i (x i ) =<br />

∫ ∞<br />

−∞<br />

· · ·<br />

∫ ∞<br />

−∞<br />

The mean or expected value η x ∈ R n of x is<br />

η x = E {x} =<br />

=<br />

∫ ∞<br />

−∞<br />

· · ·<br />

p(x 1 ,...,x n )dx 1 · · · dx i−1 dx i+1 · · · dx n (2.5)<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

xp(x)dx (2.6)<br />

(x 1 ,...,x n ) T p(x 1 ,...,x n )dx 1 · · · dx n (2.7)<br />

= (η x1 ,...,η xn ) T (2.8)<br />

With notation E {x} we mean that the integral is over all r<strong>and</strong>om variables inside<br />

the braces. If x <strong>and</strong> y are r<strong>and</strong>om vectors,<br />

With the subscript notation<br />

E {f(x,y)} =<br />

∫ ∞<br />

E x {f(x,y)} =<br />

−∞<br />

f(x,y)p(x,y)dxdy (2.9)<br />

∫ ∞<br />

−∞<br />

f(x,y)p(x)dx (2.10)<br />

we mean that the expectation is taken only over the r<strong>and</strong>om variables x. The<br />

non-normalized correlation matrix of a r<strong>and</strong>om vector x is defined as<br />

⎛<br />

⎞<br />

R x = E { xx T } E {x 1 x 1 } · · · E {x 1 x n }<br />

= ⎜<br />

.<br />

⎝ . ..<br />

. ⎟<br />

⎠ (2.11)<br />

E {x n x 1 } · · · E {x n x n }<br />

that is, the componentwise expectation of the outer product of x with itself. The<br />

cross-correlation of r<strong>and</strong>om vectors x <strong>and</strong> y is defined as<br />

R xy = E { xy T } (2.12)<br />

Covariance is the correlation of the r<strong>and</strong>om vector (x − η x )<br />

C x = E { (x − η x )(x − η x ) T } = E { xx T } − η x η T x (2.13)<br />

<strong>and</strong> the cross-covariance of x <strong>and</strong> y is defined as<br />

C xy = E { (x − η x )(y − η y ) T } = E { xy T } − η x η T y (2.14)


22 2. Estimation theory<br />

Clearly, we have C xy = C T yx. The conditional density of x given y is defined to be<br />

p(x|y) = p(x,y)<br />

p(y)<br />

(2.15)<br />

whenever p(y) > 0 <strong>and</strong> 0 otherwise. Clearly, we can also write<br />

p(y|x) = p(x,y)<br />

p(x)<br />

(2.16)<br />

<strong>and</strong> we obtain<br />

p(x|y)p(y) = p(y|x)p(x) (2.17)<br />

This is called the Bayes’ theorem. The conditional mean of x given y is<br />

η x|y = E {x|y} =<br />

∫ ∞<br />

−∞<br />

xp(x|y)dx (2.18)<br />

that is a function of the r<strong>and</strong>om variable y. If x <strong>and</strong> y are r<strong>and</strong>om vectors,<br />

E {f(x,y)|y} =<br />

∫ ∞<br />

−∞<br />

Using (2.10), (2.19) <strong>and</strong> (2.15) we can derive a useful result<br />

E {f(x,y)} =<br />

=<br />

∫ ∞<br />

−∞<br />

∫ ∞<br />

−∞<br />

f(x,y)p(x|y)dx (2.19)<br />

f(x,y)p(x,y)dxdy (2.20)<br />

f(x,y)p(x|y)dxp(y)dy (2.21)<br />

= E y {E {f(x,y)|y}} (2.22)<br />

The conditional covariance of x given y is [152]<br />

∣ }<br />

C x|y = E<br />

{(x − η x|y )(x − η x|y ) T ∣∣y<br />

= E { xx ∣ T } y − ηx|y ηx|y T (2.23)<br />

The r<strong>and</strong>om variables x i are said to be statistically independent if<br />

p(x) = ∏ i<br />

p i (x i ) (2.24)<br />

<strong>and</strong> jointly Gaussian if their joint density can be written in the <strong>for</strong>m<br />

p(x) = (det C x) −1/2 (<br />

exp − 1 )<br />

(2π) n/2 2 (x − η x) T Cx −1 (x − η x )<br />

(2.25)<br />

where det C x denotes the determinant of C x . When x is jointly Gaussian with<br />

mean η x <strong>and</strong> covariance C x , this is denoted by<br />

x ∼ N(η x ,C x ) (2.26)


2.3. <strong>Bayesian</strong> estimation 23<br />

2.3 <strong>Bayesian</strong> estimation<br />

In <strong>Bayesian</strong> estimation we assume that the parameters θ are r<strong>and</strong>om having some<br />

joint density p(z,θ) with the measurements z. In <strong>Bayesian</strong> estimation the goal is<br />

to solve the posterior density p(θ|z) of the parameters given the observations. In<br />

<strong>Bayesian</strong> point estimation some specific statistic of the posterior density is solved.<br />

Typical point estimates include e.g. the mean <strong>and</strong> the mode of the density p(θ|z).<br />

These are derived in following sections with some assumptions about θ.<br />

From (2.17) we obtain <strong>for</strong> posterior density<br />

p(θ|z) = p(z|θ)p(θ)<br />

(2.27)<br />

p(z)<br />

∝ p(z|θ)p(θ) (2.28)<br />

where p(z) = ∫ p(θ,z)dθ = ∫ p(θ)p(z|θ)dθ. The posterior density contains thus<br />

two parts. The first part is p(z|θ) that is called the likelihood of the data. This part<br />

of the posterior density depends on the data z. It thus contains the dependence of<br />

the parameters to be estimated on the data. The function p(θ) does not depend on<br />

observed data <strong>and</strong> we call it the prior density of θ. It contains the prior assumptions<br />

about the parameters. Since θ is assumed to be r<strong>and</strong>om, it has some joint density<br />

that can further depend on some parameters φ having some prior density p(φ).<br />

We call such a model hierarchical <strong>and</strong> the parameters φ hyperparameters.<br />

Sometimes it is possible to use prior densities that contain no in<strong>for</strong>mation<br />

about the parameters. We call this kind of densities nonin<strong>for</strong>mative [66]. A<br />

typical example is the function p(θ) = c where c is constant. This is an example<br />

of an improper density [26] since it does not integrate to unity. An improper prior<br />

density can be used in <strong>Bayesian</strong> analysis if the posterior density is a proper density.<br />

2.4 Maximum likelihood estimation<br />

In maximum likelihood estimation the probability density function of the observation<br />

z given the unknown parameter θ is assumed to be known. No probability<br />

density of θ is required. For given data z, ˆθ ML = ˆθ ML (z) is the maximum likelihood<br />

estimate, if<br />

p(z|ˆθ ML ) ≥ p(z|ˆθ) (2.29)<br />

<strong>for</strong> all ˆθ. Thus, ˆθ ML maximizes the likelihood function p(z|θ) <strong>for</strong> given data z. In<br />

other words, the selection of ˆθ ML makes the measurement z most probable in class<br />

of all probability densities that are of the <strong>for</strong>m p(z|ˆθ) [133, 102]. The maximum<br />

likelihood estimate is often thought to be the “true” estimate, when θ is treated<br />

as non-r<strong>and</strong>om. Other estimates are usually compared to the maximum likelihood<br />

estimate. The reason is that the maximum likelihood estimate has many desirable<br />

properties of a “good” estimate. See e.g. [39] <strong>and</strong> [102].


24 2. Estimation theory<br />

2.5 Bayes cost method<br />

If we assume that θ is a r<strong>and</strong>om vector with a known joint density p(θ,z) with<br />

the observation z, we have made the so-called <strong>Bayesian</strong> assumption [142]. This<br />

assumption leads to the so-called Bayes cost method <strong>for</strong> solving the estimator<br />

ˆθ(z). For the cost method we define the function C(θ, ˆθ) that assigns to each<br />

combination of actual parameter value <strong>and</strong> estimate the unique real valued cost.<br />

We call C(θ, ˆθ) the cost function. The expected value of the cost is given by<br />

{<br />

B(ˆθ) = E C(θ, ˆθ(z))<br />

}<br />

=<br />

∫ ∞ ∫ ∞<br />

−∞<br />

−∞<br />

C(θ, ˆθ(z))p(θ,z)dθ dz (2.30)<br />

that is called the Bayes cost. From (2.15) we obtain p(θ,z) = p(z|θ)p(θ) <strong>and</strong> the<br />

expectation can be written in the <strong>for</strong>m<br />

∫ ∞<br />

(∫ ∞<br />

)<br />

B(ˆθ) = C(θ, ˆθ(z))p(z|θ)dz p(θ)dθ (2.31)<br />

−∞ −∞<br />

The inner integral is clearly the conditional expectation of the cost given θ <strong>and</strong> we<br />

write<br />

∫ ∞<br />

B(ˆθ|θ) = C(θ, ˆθ(z))p(z|θ)dz (2.32)<br />

−∞<br />

{<br />

= E C(θ, ˆθ)<br />

}<br />

∣<br />

∣θ<br />

(2.33)<br />

This can be called the conditional Bayes cost, given θ, <strong>and</strong> in terms of the conditional<br />

cost the Bayes cost of the estimator can be written as<br />

∫ ∞<br />

B(ˆθ) = B(ˆθ|θ)p(θ)dθ (2.34)<br />

−∞<br />

{<br />

= E θ<br />

{E C(θ, ˆθ)<br />

}}<br />

∣<br />

∣θ<br />

(2.35)<br />

}<br />

= E θ<br />

{B(ˆθ|θ)<br />

(2.36)<br />

Similarily using p(θ,z) = p(θ|z)p(z) we can write<br />

{<br />

B(ˆθ) = E C(θ, ˆθ(z))<br />

}<br />

{<br />

= E z<br />

{E C(θ, ˆθ)<br />

}}<br />

∣<br />

∣z<br />

}<br />

= E z<br />

{B(ˆθ|z)<br />

(2.37)<br />

(2.38)<br />

(2.39)<br />

where<br />

∫ ∞<br />

B(ˆθ|z) = C(θ, ˆθ(z))p(θ|z)dθ (2.40)<br />

−∞<br />

{<br />

= E C(θ, ˆθ)<br />

}<br />

∣<br />

∣z<br />

(2.41)


2.6. Mean square estimation 25<br />

is the conditional Bayes cost given z.<br />

Now we can state the Bayes estimation criterion: For a given cost function<br />

C(θ, ˆθ), the <strong>Bayesian</strong> estimator ˆθ B is selected so that<br />

B(ˆθ B ) ≤ B(ˆθ) (2.42)<br />

<strong>for</strong> all ˆθ. So the <strong>Bayesian</strong> estimator is the one that minimizes the Bayes cost. [133]<br />

Different choices of the cost function lead to different estimators, <strong>and</strong> most<br />

common estimators can often be seen as minimizers of some specific cost function.<br />

2.6 Mean square estimation<br />

Consider the situation in which the cost function is of a specific <strong>for</strong>m, namely the<br />

squared norm of the estimation error ˜θ = θ − ˆθ<br />

C MS (θ, ˆθ) = ‖θ − ˆθ‖ 2 = ˜θ T ˜θ (2.43)<br />

From (2.39) <strong>and</strong> (2.41) we can write the Bayes cost function in <strong>for</strong>m<br />

{<br />

B(ˆθ) = E C(θ, ˆθ)<br />

}<br />

}<br />

= E z<br />

{B(ˆθ|z)<br />

{<br />

= E z<br />

{E C(θ, ˆθ)<br />

}}<br />

∣<br />

∣z<br />

(2.44)<br />

(2.45)<br />

(2.46)<br />

The outer expectation does not depend on θ <strong>and</strong> we can thus minimize the conditional<br />

Bayes cost<br />

{<br />

B(ˆθ|z) = E C(θ, ˆθ)<br />

}<br />

∣<br />

∣z<br />

(2.47)<br />

Inserting the mean square cost function (2.43) to this we obtain<br />

{<br />

B(ˆθ|z) = E (θ − ˆθ) T (θ − ˆθ)<br />

}<br />

∣<br />

∣z<br />

{<br />

}<br />

= E θ T θ − 2θ T ∣ ˆθ + ˆθT ˆθ ∣z<br />

(2.48)<br />

(2.49)<br />

Next, by definition, the conditional mean of θ given z is<br />

<strong>and</strong> the conditional covariance is<br />

∣ }<br />

C θ|z = E<br />

{(θ − η θ|z )(θ − η θ|z ) T ∣∣z<br />

∣ }<br />

= E<br />

{θθ T ∣∣z<br />

= E<br />

η θ|z = E {θ|z} (2.50)<br />

(2.51)<br />

∣ }<br />

− η θ|z E<br />

{θ T ∣∣z<br />

− E {θ|z} ηθ|z T + η θ|zηθ|z T (2.52)<br />

{θθ T ∣ ∣∣z<br />

}<br />

− η θ|z η T θ|z (2.53)


26 2. Estimation theory<br />

Now, since B(ˆθ|z) is a scalar,<br />

B(ˆθ|z) = trace B(ˆθ|z) (2.54)<br />

( {<br />

}<br />

= trace E θ T ∣<br />

θ∣z<br />

− 2ηθ|zˆθ T + ˆθ<br />

)<br />

T ˆθ (2.55)<br />

where trace (A) is defined to be the sum of the diagonals of square matrix A. We<br />

can use the identities [69]<br />

trace (A + B) = trace (A) + trace (B) (2.56)<br />

<strong>and</strong> then<br />

trace (ABC) = trace (CAB) = trace (BCA) (2.57)<br />

( ∣ }<br />

B(ˆθ|z) = trace E<br />

{θθ T ∣∣z<br />

− 2ˆθη θ|z T + ˆθˆθ ) T (2.58)<br />

(<br />

= trace C θ|z + η θ|z ηθ|z T − 2ˆθη θ|z T + ˆθˆθ ) T (2.59)<br />

(<br />

)<br />

= trace C θ|z + trace (ˆθ − η θ|z )(ˆθ − η θ|z ) T (2.60)<br />

(<br />

)<br />

= trace C θ|z + trace (ˆθ − η θ|z ) T (ˆθ − η θ|z ) (2.61)<br />

∥<br />

∥ ∥∥<br />

2<br />

= trace C θ|z + ∥ˆθ − η θ|z (2.62)<br />

The first term in right h<strong>and</strong> side of the equation does not depend on ˆθ(z) <strong>and</strong> is<br />

clearly positive <strong>and</strong> the second can be made to zero by choosing ˆθ = η θ|z . There<strong>for</strong>e<br />

we conclude that the optimal <strong>Bayesian</strong> minimum mean square estimator is the<br />

function η θ|z , that is, the conditional mean<br />

ˆθ MS =<br />

∫ ∞<br />

−∞<br />

θp(θ|z)dθ = E {θ|z} = η θ|z (2.63)<br />

This result holds <strong>for</strong> all densities p(θ|z) [193]. The estimator ˆθ MS is sometimes<br />

also called the conditional mean estimator. The expected value of the estimation<br />

error ˜θ can be written as<br />

}<br />

}}<br />

∣<br />

E<br />

{˜θ = E z<br />

{E<br />

{˜θ ∣z<br />

(2.64)<br />

= E z<br />

{E<br />

{θ − ˆθ<br />

∣ }} ∣∣z<br />

MS (2.65)<br />

Now, since E{E{θ|z}|z} = E{θ|z}<br />

{∫ ∞<br />

∣ } ∣∣z<br />

= E z θp(θ|z)dz − E<br />

{ˆθMS } (2.66)<br />

−∞<br />

∣ }} ∣∣z<br />

= E z<br />

{ˆθMS − E<br />

{ˆθMS (2.67)<br />

}<br />

E<br />

{˜θ = 0 (2.68)


2.7. Maximum a posteriori estimation 27<br />

This means that the mean square estimator is unbiased.<br />

The preceding results are easily modified to include a symmetric positive<br />

semidefinite (weighting) matrix W. Let the generalized mean square cost function<br />

be<br />

C GMS (θ, ˆθ) = ˜θ T W ˜θ (2.69)<br />

The conditional Bayes cost is then<br />

{<br />

B(ˆθ|z) = E (θ − ˆθ) T W(θ − ˆθ)<br />

}<br />

∣<br />

∣z<br />

= ˆθ T W ˆθ − 2ˆθ T Wη θ|z + E { θ T Wθ ∣ }<br />

z<br />

(2.70)<br />

(2.71)<br />

= (ˆθ − η θ|z ) T W(ˆθ − η θ|z ) + E { θ T Wθ ∣ ∣ z<br />

}<br />

− η<br />

T<br />

θ|z Wη θ|z (2.72)<br />

The positive semidefiniteness means that <strong>for</strong> any a, a T Wa ≥ 0. Thus the first<br />

term in right h<strong>and</strong> side of the equation is positive <strong>and</strong> can be made equal to zero<br />

by choosing<br />

ˆθ GMS = E {θ|z} (2.73)<br />

This also minimizes the conditional Bayes cost, since the other terms does not<br />

depend on ˆθ. The result is identical to the result <strong>for</strong> ˆθ MS .<br />

The fact that the conditional mean minimizes the generalized index C GMS (θ, ˆθ)<br />

has an important implication. It means that the conditional mean minimizes the<br />

error of each component of ˜θ individually [192]. This can be seen by choosing the<br />

weighting matrix W so that only one, say i’th, diagonal element is nonzero <strong>and</strong><br />

equals to one <strong>and</strong> all the off- diagonal elements are zero<br />

⎛<br />

⎞<br />

0 · · · · · · 0<br />

. . ..<br />

. ..<br />

0<br />

W i =<br />

1<br />

(2.74)<br />

0<br />

⎜<br />

.<br />

⎝ .<br />

..<br />

. ⎟<br />

⎠<br />

0 · · · · · · 0<br />

The minimization of the mean square cost function (2.43) can clearly be seen as<br />

minimization of the sum<br />

C MS (θ, ˆθ) = ˜θ<br />

∑<br />

T ˜θ = ˜θ T W i˜θ (2.75)<br />

<strong>and</strong> thus the conditional mean minimizes each squared error term ˜θ i individually.<br />

i<br />

2.7 Maximum a posteriori estimation<br />

Let us now define another cost function, the uni<strong>for</strong>m cost function [133]<br />

{<br />

C UC (θ, ˆθ) 0 ,<br />

=<br />

˜θ ∈ I<br />

1 ,otherwise<br />

(2.76)


28 2. Estimation theory<br />

where I =] − ɛ,ɛ[× · · · ×] − ɛ,ɛ[⊂ R p <strong>and</strong> ɛ is small. So this cost gives zero penalty<br />

if all components of the estimation error are small <strong>and</strong> a unit penalty if any of the<br />

components is larger than ɛ. When C UC (θ, ˆθ) is substituted into the equation of<br />

conditional Bayes cost (2.41) we obtain<br />

∫<br />

B UC (ˆθ|z) = p(θ|z)dθ (2.77)<br />

˜θ∈I<br />

∫<br />

= 1 − p(θ|z)dθ (2.78)<br />

where I states <strong>for</strong> the complement of I, <strong>and</strong> using the mean value theorem <strong>for</strong><br />

integrals [5] there is a value, say ˆθ, in I <strong>for</strong> which<br />

˜θ∈I<br />

B UC (ˆθ|z) = 1 − (2ɛ) p p(ˆθ|z) (2.79)<br />

To minimize B UC (ˆθ|z) we must maximize p(ˆθ|z) so ˆθ UC can be defined by<br />

p(ˆθ UC |z) ≥ p(ˆθ|z) (2.80)<br />

<strong>for</strong> all ˆθ. Since ˆθ UC maximizes the posterior density of θ given the observations z,<br />

ˆθ UC is also called the maximum a posteriori estimate ˆθ MAP .<br />

ˆθ UC is the mode of density p(θ|z) <strong>and</strong> yet another name <strong>for</strong> the estimator<br />

is the conditional mode estimator. It can be shown, that if we assume that the<br />

prior distribution of θ is uni<strong>for</strong>m in a region containing the maximum likelihood<br />

estimate, then the maximum likelihood estimate is identical to maximum a posteriori<br />

estimate, that is ˆθ ML = ˆθ MAP [133]. Clearly ˆθ MAP = ˆθ MS if the mode of the<br />

density p(θ|z) equals to the mean η θ|z . This is the case when p(θ|z) is symmetric<br />

<strong>and</strong> unimodal.<br />

2.8 Linear minimum mean square estimator<br />

In this section we restrict the <strong>for</strong>m of the estimator to be a linear function of<br />

data <strong>and</strong> derive the optimum estimator with this structural constraint. If certain<br />

conditions <strong>for</strong> densities p(θ) <strong>and</strong> p(z) are fulfilled, this optimal linear estimator<br />

turns out to be generally optimal.<br />

Let the estimator be constrained to be a linear function of the data<br />

ˆθ = Kz (2.81)<br />

Let θ <strong>and</strong> z be r<strong>and</strong>om vectors with zero means <strong>and</strong> known covariances. No other<br />

assumptions are made about the joint distribution of the parameters <strong>and</strong> data.<br />

We derive the estimator that is of the <strong>for</strong>m (2.81) <strong>and</strong> minimizes the mean square<br />

Bayes cost B MS (ˆθ). We first note that<br />

}<br />

B MS (ˆθ) = E<br />

{˜θT ˜θ<br />

= E<br />

{<br />

(θ − ˆθ) T (θ − ˆθ)<br />

}<br />

(2.82)<br />

(2.83)


2.9. Minimum mean square estimator <strong>for</strong> Gaussian variables 29<br />

<strong>and</strong> that<br />

{<br />

= trace E (θ − ˆθ)(θ − ˆθ) } T (2.84)<br />

= trace C˜θ<br />

(2.85)<br />

{<br />

C˜θ<br />

= E (θ − ˆθ)(θ − ˆθ) } T (2.86)<br />

= E { (θ − Kz)(θ − Kz) T } (2.87)<br />

= C θ − KC zθ − C θz K T + KC z K T (2.88)<br />

= C θ + (K − C θz C −1<br />

z<br />

)C z (K − C θz Cz<br />

−1 ) T − C θz Cz −1 C zθ (2.89)<br />

where it is assumed that C z is invertible. Only the second term on the right h<strong>and</strong><br />

side of the equation depends on the matrix K. The trace, <strong>and</strong> each term of the<br />

diagonal (note that the diagonal is quadratic <strong>and</strong> thus positive), of the matrix<br />

E<br />

{(θ − ˆθ)(θ − ˆθ)<br />

}<br />

T can be minimized by choosing<br />

K = C θz C −1<br />

z (2.90)<br />

so that we can write <strong>for</strong> the linear minimum mean square estimate<br />

<strong>and</strong> the estimation error covariance is from (2.89)<br />

Using the trans<strong>for</strong>mations<br />

C˜θLMMS<br />

ˆθ LMMS = C θz C −1<br />

z z (2.91)<br />

= C θ − C θz C −1<br />

z C zθ (2.92)<br />

θ ′ = θ − E {θ} (2.93)<br />

z ′ = z − E {z} (2.94)<br />

result extends to variables with nonzero means [192]. The estimator (2.91) is<br />

optimum when its structure is restricted to be linear. No assumptions have been<br />

made about the densities of the measurements <strong>and</strong> the parameters.<br />

2.9 Minimum mean square estimator <strong>for</strong> Gaussian variables<br />

If θ <strong>and</strong> z are jointly Gaussian, then the linear minimum mean square estimator<br />

is not only the optimal linear estimator but the overall optimal estimator <strong>for</strong> θ.<br />

Let the joint density function <strong>for</strong> θ <strong>and</strong> z is (without scaling terms)<br />

⎧<br />

⎨ (<br />

p(θ,z) ∝ exp<br />

⎩ −1 2<br />

) ( ) −1 (<br />

θ T z T C θ C θz<br />

C zθ C z<br />

θ<br />

z<br />

) ⎫ ⎬<br />

⎭<br />

(2.95)<br />

where θ <strong>and</strong> z have been assumed to have zero means. The nonzero mean situation<br />

can be treated with the trans<strong>for</strong>mations (2.93) <strong>and</strong> (2.94). It was shown that the<br />

mean square estimate equals to the conditional mean<br />

ˆθ MS = E {θ|z} (2.96)


30 2. Estimation theory<br />

For the calculation of the mean we first have to <strong>for</strong>m the equation <strong>for</strong> the posterior<br />

density p(θ|z). First we note that the matrix inversion lemma [69] gives <strong>for</strong> the<br />

inverse of the joint covariance of θ <strong>and</strong> z<br />

(<br />

) −1 (<br />

)<br />

C θ C θz C 11 C 12<br />

=<br />

(2.97)<br />

C zθ C z C 21 C 22<br />

where<br />

C 11 = (C θ − C θz Cz<br />

−1 C zθ ) −1 = C −1<br />

θ<br />

C 22 = (C z − C zθ C −1<br />

θ<br />

C 12 = C T 21 = −C 11 C θz C −1<br />

z<br />

C θz ) −1 = C −1<br />

z<br />

+ C −1<br />

θ<br />

C θz C 22 C zθ C −1<br />

+ C −1<br />

z<br />

The density of z can be written in the <strong>for</strong>m<br />

{<br />

p(z) ∝ exp − 1 ( ) ( )(<br />

0 z T 0 0<br />

2 0 Cz<br />

−1<br />

so that the posterior density p(θ|z) is obtained by <strong>for</strong>ming<br />

p(θ|z) = p(θ,z)<br />

p(z)<br />

{<br />

∝<br />

exp<br />

{<br />

= exp<br />

{<br />

= exp<br />

− 1 2<br />

− 1 2<br />

(<br />

θ<br />

(2.98)<br />

C zθ C 11 C θz Cz −1 (2.99)<br />

= −C −1<br />

θ<br />

C θz C 22 (2.100)<br />

0<br />

z<br />

) (<br />

θ T z T C 11 C 12<br />

C 21 C 22 − Cz<br />

−1<br />

)}<br />

)(<br />

θ<br />

z<br />

)}<br />

(2.101)<br />

(2.102)<br />

(2.103)<br />

(<br />

θ T C 11 θ + 2θ T C 12 z + z T (C 22 − C −1<br />

z )z )} (2.104)<br />

− 1 2 (θT C 11 θ + 2θ T C 11 C θz C −1<br />

z z (2.105)<br />

+z T Cz<br />

−1 C zθ C 11 C θz Cz −1 z)<br />

{<br />

= exp − 1 }<br />

2 (θ − C θzCz<br />

−1 z) T C 11 (θ − C θz Cz −1 z)<br />

}<br />

(2.106)<br />

(2.107)<br />

This is clearly a Gaussian density. The Gaussian conditional density is of the <strong>for</strong>m<br />

{<br />

p(θ|z) ∝ exp − 1 }<br />

2 (θ − E {θ|z})T C −1<br />

θ|z (θ − E {θ|z}) (2.108)<br />

Since ˆθ MS = E {θ|z}, C θ|z = C˜θMS<br />

<strong>and</strong> comparing with (2.107) we can conclude<br />

ˆθ MS = C θz C −1<br />

z z (2.109)<br />

C˜θMS<br />

= C θ − C θz C −1<br />

z C zθ (2.110)<br />

that is exactly the linear minimum mean square estimator.


2.10. Mean square estimation with observation model 31<br />

2.10 Mean square estimation with observation model<br />

Next we consider a model <strong>for</strong> the dependence between observations <strong>and</strong> parameters.<br />

The most common observation model is the so-called additive noise model<br />

z = h(θ) + v (2.111)<br />

Now we have to assume some joint density p(θ,v) <strong>for</strong> r<strong>and</strong>om parameters θ <strong>and</strong><br />

observation error v. Then, at least theoretically, the joint density of z <strong>and</strong> θ is<br />

known <strong>and</strong> we can <strong>for</strong>m either p(z|θ) <strong>for</strong> maximum likelihood estimation or the<br />

posterior density p(θ|z) of θ <strong>for</strong> <strong>Bayesian</strong> estimation. In general the density p(θ|z)<br />

is needed e.g. <strong>for</strong> the mean square estimator, but in certain situations, such as<br />

in the case of linear estimates the knowledge about the second order statistics is<br />

enough.<br />

Let us now constrain the observations to be of specific linear <strong>for</strong>m of parameters<br />

z = Hθ + v (2.112)<br />

where v <strong>and</strong> θ are r<strong>and</strong>om. Let θ <strong>and</strong> v have zero means <strong>and</strong> known covariances.<br />

The measurements z have then zero mean <strong>and</strong> has covariance<br />

<strong>and</strong> the cross covariance C θz is<br />

C z = E { (Hθ + v)(Hθ + v) T } (2.113)<br />

= HC θ H T + HC θv + C vθ H T + C v (2.114)<br />

C θz = E { θ(Hθ + v) T } = C θ H T + C θv (2.115)<br />

Using these <strong>for</strong> linear mean square estimate we obtain<br />

ˆθ LMMS = (C θ H T + C θv )(HC θ H T + HC θv + C vθ H T + C v ) −1 z (2.116)<br />

with error covariance matrix<br />

= C C˜θLMMS θ − (C θ H T + C θv )(HC θ H T + HC θv + C vθ H T + C v ) −1 (HC θ + C vθ )<br />

(2.117)<br />

A special case of this is when θ <strong>and</strong> v are uncorrelated, C θv = 0. Then the<br />

equations <strong>for</strong> the estimate <strong>and</strong> error reduce to<br />

ˆθ LMMS = C θ H T (HC θ H T + C v ) −1 z (2.118)<br />

C˜θLMMS<br />

= C θ − C θ H T (HC θ H T + C v ) −1 HC θ (2.119)<br />

Applying the matrix inversion lemma (2.97–2.100) we obtain<br />

ˆθ LMMS = (C −1<br />

θ<br />

= (C −1<br />

C˜θLMMS θ<br />

+ H T C −1<br />

v<br />

H) −1 H T Cv<br />

−1 z = H T C −1 C˜θLMMS v z (2.120)<br />

+ H T C −1<br />

v H) −1 (2.121)


32 2. Estimation theory<br />

2.11 Gauss–Markov estimate<br />

We consider the situation, where the parameters are unknown but non-r<strong>and</strong>om<br />

<strong>and</strong> the observation model is linear, that is<br />

z = Hθ + v (2.122)<br />

where v is r<strong>and</strong>om. Let v have zero mean <strong>and</strong> covariance C v . We derive the linear<br />

unbiased estimator ˆθ that minimizes the criterion<br />

{<br />

E ‖θ − ˆθ‖ 2} {<br />

= E (θ − ˆθ) T (θ − ˆθ)<br />

}<br />

(2.123)<br />

{<br />

= trace E (θ − ˆθ)(θ − ˆθ) } T (2.124)<br />

= trace C˜θ<br />

(2.125)<br />

Let<br />

with the requirement<br />

ˆθ = Kz + k (2.126)<br />

}<br />

E<br />

{ˆθ = θ (2.127)<br />

Then<br />

}<br />

E<br />

{ˆθ<br />

= E {Kz + k} = KE {z} + k (2.128)<br />

= KE {Hθ + v} + k = KHθ + k (2.129)<br />

For ˆθ to be unbiased K <strong>and</strong> k must satisfy<br />

KH = I, k = 0 (2.130)<br />

For the error covariance we can write<br />

{<br />

C˜θ<br />

= E (θ − ˆθ)(θ − ˆθ) } T (2.131)<br />

Next we consider the matrix<br />

= E { (θ − Kz)(θ − Kz) T } (2.132)<br />

= E { (θ − KHθ − Kv)(θ − KHθ − Kv) T } (2.133)<br />

= E { Kvv T K T } = KC v K T (2.134)<br />

K ′ = (H T Cv<br />

−1 H) −1 H T Cv −1<br />

(2.135)<br />

<strong>for</strong> which K ′ H = I. With this selection the estimator ˆθ = K ′ z is unbiased. We<br />

will next show that K ′ minimizes the mean square error. First we note that<br />

K ′ C v K T = (H T Cv<br />

−1 H) −1 H T Cv −1 C v K T (2.136)<br />

= (H T Cv −1 H) −1 (KH) T (2.137)<br />

= (H T Cv −1 H) −1 (2.138)


2.12. Least squares estimation 33<br />

<strong>and</strong><br />

KC v K ′T = KC v Cv<br />

−1 H(H T Cv −1 H) −1 (2.139)<br />

= (KH)(H T Cv −1 H) −1 (2.140)<br />

= (H T Cv −1 H) −1 (2.141)<br />

H) −1 H T Cv<br />

−1 C v Cv<br />

−1 H(H T Cv −1 H T ) −1 (2.142)<br />

H) −1 H T Cv<br />

−1 H(H T Cv −1 H T ) −1 (2.143)<br />

= (H T Cv −1 H) −1 (2.144)<br />

K ′ C v K ′T = (H T C −1<br />

v<br />

Next we <strong>for</strong>m the matrix<br />

= (H T C −1<br />

v<br />

(K − K ′ )C v (K − K ′ ) T = KC v K T − K ′ C v K T (2.145)<br />

− KC v K ′T + K ′ C v K ′T (2.146)<br />

= KC v K T − K ′ C v K ′T (2.147)<br />

Note that the matrix (K − K ′ )C v (K − K ′ ) T is positive semidefinite. Now we<br />

obtain <strong>for</strong> the trace of the error covariance matrix<br />

trace C˜θ<br />

= trace KC v K T (2.148)<br />

= trace (K − K ′ )C v (K − K ′ ) T + trace K ′ C v K ′T (2.149)<br />

This can be minimized by choosing K = K ′ <strong>and</strong> the Gauss–Markov estimate can<br />

be written in the <strong>for</strong>m<br />

ˆθ GM = (H T C −1<br />

v<br />

H) −1 H T C −1<br />

v z (2.150)<br />

C˜θGM<br />

= (H T C −1<br />

v H) −1 (2.151)<br />

Comparing ˆθ GM with ˆθ LMMS we note that ˆθ GM is obtained by letting C −1<br />

θ<br />

= 0.<br />

This means that the prior density of θ is flat or no a priori in<strong>for</strong>mation is available<br />

about θ [66, 26].<br />

Note that the criterion to be minimized in Gauss–Markov estimation is identical<br />

to that of mean square estimator. The difference is in treatment of θ as<br />

non-r<strong>and</strong>om but unknown parameter vector.<br />

Since this estimator also minimizes the variance of the estimator it is also<br />

called the linear minimum variance estimator ˆθ LMV .<br />

2.12 Least squares estimation<br />

Next we consider the situation, where neither the parameters θ or the error v<br />

in observations is interpreted as r<strong>and</strong>om. The solution of this problem leads to<br />

generalized least squares solution.<br />

Let the observation model be<br />

z = h(θ) + v (2.152)


34 2. Estimation theory<br />

where θ <strong>and</strong> v are unknown but non-r<strong>and</strong>om. Then the generalized least squares<br />

estimator ˆθ GLS is defined to be the minimizer of the generalized least squares index<br />

l GLS = (z − h(θ)) T W(z − h(θ)) (2.153)<br />

= ‖Lz − Lh(θ)‖ 2 (2.154)<br />

where L T L = W is a symmetric positive definite matrix. If W is diagonal the<br />

index is also called the weighted least squares index.<br />

With a nonlinear h(θ), the minimization of l GLS has to be done iteratively.<br />

First we <strong>for</strong>m the Taylor expansion of the generalized least squares index in the<br />

neighborhood of some θ ∗<br />

( ) ∂l<br />

l(θ) = l(θ ∗ ) +<br />

∂θ (θ∗ ) (θ − θ ∗ ) (2.155)<br />

+ 1 2 (θ − θ∗ ) T ( ∂ 2 l<br />

∂θ 2 (θ∗ )<br />

)<br />

(θ − θ ∗ ) (2.156)<br />

+ O(‖θ − θ ∗ ‖ 3 ) (2.157)<br />

Note that ∂l/∂θ is a row vector <strong>and</strong> <strong>and</strong> ∂ 2 l/∂θ 2 is a symmetric matrix. Next we<br />

approximate l(θ) with the second order approximation<br />

[( ) ∂l<br />

l(θ) ≈ l(θ ∗ ) +<br />

∂θ (θ∗ ) + 1 ( ∂<br />

2 (θ − θ∗ ) T 2 )]<br />

l<br />

∂θ 2 (θ∗ ) (θ − θ ∗ ) = f(θ) (2.158)<br />

This approximation is at minimum, when θ = ˆθ so, that<br />

( )<br />

(<br />

∂f ∂l<br />

∂<br />

∂θ (ˆθ) =<br />

∂θ (θ∗ ) + (ˆθ − θ ∗ ) T 2 )<br />

l<br />

∂θ 2 (θ∗ )<br />

so that we can solve ˆθ<br />

The gradient of l is of the <strong>for</strong>m<br />

= 0 (2.159)<br />

( ∂<br />

ˆθ = θ ∗ 2 ) −1 ( ) T<br />

l ∂l<br />

−<br />

∂θ 2 (θ∗ )<br />

∂θ (θ∗ )<br />

(2.160)<br />

∂l<br />

∂θ (θ∗ ) = −2(z − h(θ ∗ )) T W ∂h<br />

∂θ (θ∗ ) (2.161)<br />

<strong>and</strong> differentiating twice, we obtain<br />

∂ 2 l<br />

∂θ 2 (θ∗ ) = −2<br />

( M<br />

∑<br />

i=1<br />

(z i − h i (θ ∗ ))W ∂2 h i<br />

∂θ 2 (θ∗ )<br />

)<br />

( ) T ( )<br />

∂h ∂h<br />

+ 2<br />

∂θ (θ∗ ) W<br />

∂θ (θ∗ )<br />

(2.162)<br />

Finally, we obtain a recursion<br />

⎛<br />

⎞<br />

ˆθ i+1 = ˆθ<br />

M∑<br />

i + k i<br />

⎝Ji T WJ i − (z j − h j (ˆθ i ))W ∂2 h ( )<br />

j<br />

∂θ 2 (ˆθ i ) ⎠−1<br />

Ji T W(z − h(ˆθ i ))<br />

j=1<br />

(2.163)


2.12. Least squares estimation 35<br />

where J i = ∂h<br />

∂θ (ˆθ i ). k i is the step size parameter, that controls the convergence<br />

of the iteration [18]. This method is called the Newton–Raphson method. If the<br />

norm ‖z − h(θ)‖ is small we can make the approximation<br />

∂ 2 l<br />

∂θ 2 (ˆθ i ) ≈ 2J T i WJ i (2.164)<br />

<strong>and</strong> the iteration can be written in <strong>for</strong>m<br />

ˆθ i+1 = ˆθ<br />

( )<br />

i + k i J<br />

T −1<br />

)<br />

i WJ i<br />

(Ji T W(z − h(ˆθ i ))<br />

(2.165)<br />

This is called the Gauss–Newton method. Yet another well known search procedure<br />

results, when (∂ 2 l/∂θ 2 )(ˆθ i ) is replaced with the identity matrix. With W = I<br />

we can write<br />

ˆθ i+1 = ˆθ<br />

)<br />

i + k i<br />

(Ji T (z − h(ˆθ i ))<br />

(2.166)<br />

This is called the steepest descent method.<br />

If h(θ) is linear the observation model is of the <strong>for</strong>m<br />

z = Hθ + v (2.167)<br />

Now the minimization of the generalized linear least squares index<br />

l GLS = (z − Hθ)W(z − Hθ) T (2.168)<br />

= ‖Lz − LHθ‖ 2 (2.169)<br />

takes a simpler <strong>for</strong>m. Let first W = I <strong>and</strong> denote the corresponding index with<br />

l LS . Then <strong>for</strong>m the gradient<br />

Then, the least squares estimator satisfies<br />

or<br />

This can be rewritten in the <strong>for</strong>m<br />

∂l LS<br />

∂θ = −2(z − Hθ)T H (2.170)<br />

− (z − Hˆθ LS ) T H = 0 (2.171)<br />

H T (z − Hˆθ LS ) = 0 (2.172)<br />

H T Hˆθ LS = H T z (2.173)<br />

This system of equations is called the normal equations. The <strong>for</strong>mal solution is<br />

ˆθ LS = (H T H) −1 H T z (2.174)<br />

Note that the matrix H T H equals to ∂ 2 l LS /∂θ 2 so that if it is positive definite,<br />

the solution is guaranteed to be the stationary point of l LS .


36 2. Estimation theory<br />

Note, that the criterion (2.172) implies that the residual r = z −ẑ = z −H ˆθ LS<br />

is orthogonal to all columns of the matrix H. In other words, r belongs to the null<br />

space of H T that equals to the orthogonal complement of the range of H, that<br />

is r ∈ N ( H ) T = R (H) ⊥ . By definition ẑ ∈ R (H) <strong>and</strong> dim (R (H)) = p. The<br />

measurement z ∈ R M is now of the <strong>for</strong>m<br />

z = ẑ + r (2.175)<br />

<strong>and</strong> we can see that ẑ is the orthogonal projection of z onto R (H), the linear<br />

subspace spanned by the columns of the matrix H. We call these vectors the basis<br />

vectors. This interpretation of the least squares problem is central in Chapter 6.<br />

The generalized solution is obtained by multiplying the equation (2.167) with<br />

a matrix L so that L T L = W<br />

<strong>and</strong> with notations z ′ = Lz <strong>and</strong> H ′ = LH<br />

Now the minimization of the generalized index<br />

is obtained by using (2.174)<br />

Lz = LHθ + Lv (2.176)<br />

z ′ = H ′ θ + Lv (2.177)<br />

l GLS = (z − Hθ)W(z − Hθ) T (2.178)<br />

= (z ′ − H ′ θ)(z ′ − H ′ θ) T (2.179)<br />

ˆθ GLS = (H ′T H ′ ) −1 H ′T z ′ (2.180)<br />

= (H T L T LH) −1 H T L T Lz (2.181)<br />

= (H T WH) −1 H T Wz (2.182)<br />

This is seen to be equivalent to the Gauss–Markov estimate ˆθ GM if we choose<br />

W = Cv −1 .<br />

A classical reference <strong>for</strong> linear <strong>and</strong> nonlinear least squares problems is [110]<br />

<strong>and</strong> <strong>for</strong> nonlinear optimization in general [93].<br />

2.13 Comparison of ML, MAP <strong>and</strong> MS estimates<br />

In this section we compare the maximum likelihood estimation with maximum a<br />

posteriori estimation in case of Gaussian densities. Let the observation model be<br />

z = h(θ) + v (2.183)<br />

where θ <strong>and</strong> v are r<strong>and</strong>om parameters. Only the parameters θ are to be estimated.<br />

Given θ, the observations z <strong>and</strong> the error v have the same density, except that the<br />

mean E {z} = E {v} + h(θ). The density of z given θ is thus<br />

p(z|θ) = p v (z − h(θ)|θ) (2.184)


2.13. Comparison of ML, MAP <strong>and</strong> MS estimates 37<br />

Assuming v is Gaussian with zero mean <strong>and</strong> θ <strong>and</strong> v are independent we obtain<br />

<strong>and</strong> taking logarithms<br />

{<br />

p(z|θ) ∝ exp − 1 }<br />

2 (z − h(θ))T Cv<br />

−1 (z − h(θ))<br />

(2.185)<br />

log p(z|θ) = const − 1 2 (z − h(θ))T C −1<br />

v (z − h(θ)) (2.186)<br />

Maximization of this so-called log-likelihood function gives the maximum likelihood<br />

estimate <strong>and</strong> is identical to to minimization of<br />

l ML = 1 2 (z − h(θ))T C −1<br />

v (z − h(θ)) (2.187)<br />

This is identical to generalized least squares index. In the general case the minimization<br />

is nonlinear <strong>and</strong> can be done e.g. with Gauss-Newton algorithm. In<br />

linear case the minimizing estimate reduces to Gauss–Markov estimate<br />

Next we can write <strong>for</strong> the posterior density<br />

ˆθ = (H T Cv<br />

−1 H) −1 H T Cv −1 z (2.188)<br />

p(θ|z) ∝ p(z|θ)p(θ) (2.189)<br />

= p v (z − h(θ)|θ)p(θ) (2.190)<br />

<strong>and</strong> if we assume that v ∼ N(0,C v ) <strong>and</strong> θ ∼ N(η θ ,C θ ), this can be written in<br />

<strong>for</strong>m<br />

{<br />

p(θ|z) ∝ exp − 1 }<br />

2 (z − h(θ))T Cv<br />

−1 (z − h(θ)) p(θ) (2.191)<br />

{<br />

∝ exp − 1 }<br />

2 (z − h(θ))T Cv<br />

−1 (z − h(θ)) (2.192)<br />

{<br />

exp − 1 }<br />

2 (θ − η θ) T C −1<br />

θ<br />

(θ − η θ )<br />

(2.193)<br />

The maximum a posteriori estimate is obtained now by maximizing the logarithm<br />

of the posterior density<br />

log p(θ|z) = const − 1 2 (z −h(θ))T Cv<br />

−1 (z −h(θ)) − 1 2 (θ −η θ) T C −1<br />

θ<br />

(θ −η θ ) (2.194)<br />

This is seen to be identical to minimization of<br />

l MAP = 1 2 (z − h(θ))T Cv<br />

−1 (z − h(θ)) + 1 2 (θ − η θ) T C −1<br />

θ<br />

(θ − η θ ) (2.195)


38 2. Estimation theory<br />

The quadratic index (2.195) can be written in <strong>for</strong>m<br />

l MAP = 1 2<br />

= (<br />

1 2 ∥ L<br />

(<br />

(z − h(θ)) T , (θ − η θ ) T ) ( C −1<br />

v 0<br />

) (<br />

z<br />

− L<br />

η θ<br />

h(θ)<br />

θ<br />

)∥ ∥∥∥∥<br />

2<br />

0 C −1<br />

θ<br />

) (<br />

)<br />

z − h(θ)<br />

(2.196)<br />

θ − η θ<br />

(2.197)<br />

where<br />

L T L = W, W =<br />

(<br />

C −1<br />

v 0<br />

0 C −1<br />

θ<br />

If we assume the linear observation model h(θ) = Hθ we obtain<br />

(<br />

l MAP = 1 2 ∥ L<br />

) (<br />

z<br />

− L<br />

η θ<br />

Hθ<br />

θ<br />

)<br />

)∥ ∥∥∥∥<br />

2<br />

(2.198)<br />

(2.199)<br />

= 1 2 ‖Lz′ − LH ′ θ‖ 2 (2.200)<br />

where<br />

H ′ =<br />

(<br />

H<br />

I<br />

)<br />

, z ′ =<br />

(<br />

)<br />

z<br />

η θ<br />

(2.201)<br />

This can be solved as generalized linear LS problem <strong>and</strong> the <strong>for</strong>mal solution is<br />

ˆθ GLS = (H ′T L T LH ′ ) −1 H ′T L T Lz ′ (2.202)<br />

= (H ′T WH ′ ) −1 H ′T Wz ′ (2.203)<br />

= (H T C −1<br />

v<br />

H + C −1<br />

θ<br />

) −1 (H T Cv<br />

−1 z + C −1<br />

θ<br />

η θ ) (2.204)<br />

This is seen to be the mean square estimate with nonzero mean η θ <strong>for</strong> θ.<br />

If we further assume that C −1<br />

θ<br />

= 0 corresponding to infinite prior variance of<br />

θ, we obtain exactly the Gauss–Markov estimate. As seen this is equivalent to the<br />

maximum likelihood estimate. If we assume that the errors are independent with<br />

equal variance, Cv<br />

−1 = σv<br />

−2 I <strong>and</strong> the <strong>for</strong>m of the estimator is<br />

ˆθ = (H T H) −1 H T z (2.205)<br />

This is identical to the linear least squares solution. The least squares <strong>and</strong> the<br />

Gauss–Markov estimates can thus be seen as <strong>Bayesian</strong> estimates with no prior<br />

in<strong>for</strong>mation about the parameters θ.<br />

We conclude that with Gaussian densities MAP <strong>and</strong> ML estimation reduces<br />

to linear or nonlinear generalized least squares estimation. If some of the densities<br />

is not Gaussian, the minimization is not any more minimization of a quadratic<br />

function.


2.14. Selection of the basis vectors 39<br />

2.14 Selection of the basis vectors<br />

As emphasized in Section 2.12, the linear least squares problem can be seen as<br />

interpretation of the measurements z as linear combination of some basis vectors.<br />

The basis vectors are contained in the columns of the observation matrix H =<br />

(ψ 1 ,...,ψ p ) in the linear observation model<br />

z = Hθ + v (2.206)<br />

In some applications it is common that the observation matrix H contains also<br />

measurements. In statistical applications these variables, other than z, are called<br />

the regressor variables [91]. Basically they can be also r<strong>and</strong>om in nature. In<br />

time series applications the observation matrix can even contain the values of the<br />

measurements z. This is the case e.g. in modeling of the time series with recursive<br />

models as discussed in Chapter 4. For example, in autoregressive modeling of<br />

time series the columns of the observation matrix are delayed versions of the<br />

measurement z. If the observation matrix contains r<strong>and</strong>om variables the optimality<br />

of the mean square estimator is not necessarily true any more. This is discussed<br />

briefly in Section 2.18.<br />

If we have some in<strong>for</strong>mation about the “physical” nature of the measurements,<br />

we can use this in<strong>for</strong>mation in basis vector selection. If the underlying model is<br />

originally nonlinear, the model can be linearized in the neighborhood of some<br />

point near to solution <strong>and</strong> this linearized model can be used as basis of the model.<br />

Sometimes we do not have or do not want to use such in<strong>for</strong>mation. In such cases<br />

we can select as basis a set of “generic” vectors that we believe to be able to<br />

model the data z. Such generic choices are <strong>for</strong> example sampled polynomials <strong>and</strong><br />

trigonometric functions. Also sets of Gaussian or sigmoidally shaped vectors can<br />

be used. Any mixture of these is also possible. For example the basis vectors<br />

ψ 0 = 1 <strong>and</strong> ψ 1 = t that can model the first order trend in measurements can be<br />

used with all generic bases.<br />

A special case of H is when the basis vectors ψ i are orthonormal to each other.<br />

This means that<br />

H T H = I (2.207)<br />

The linear least squares solution is then<br />

<strong>and</strong> the estimate <strong>for</strong> the observations z is in this basis<br />

ˆθ = (H T H) −1 H T z = H T z (2.208)<br />

ẑ = Hˆθ = HH T z (2.209)<br />

p∑<br />

= ψ i c (2.210)<br />

i=1<br />

where c i = ψi T z, that is, the inner product of the data with the basis vector. If the<br />

basis is the trigonometric basis <strong>and</strong> if p = M the sum (2.210) is called the discrete<br />

time Fourier series [125] or the discrete Fourier series [152] <strong>for</strong> z.


40 2. Estimation theory<br />

If the measurement z is r<strong>and</strong>om, the coefficients c i are also r<strong>and</strong>om parameters.<br />

If they are required to be uncorrelated, we can write<br />

E { cc T } = E { H T zz T H } (2.211)<br />

= H T R z H = diag (σ 2 1,...,σ 2 p) (2.212)<br />

This is an eigenproblem. The basis vectors are then obtained as eigenvectors of<br />

the data correlation matrix R z . The sum (2.210) can then be called e.g. the<br />

discrete Karhunen-Loeve trans<strong>for</strong>m or principal component trans<strong>for</strong>m [157]. It<br />

can be shown that this selection of the basis gives the minimum mean square error<br />

in ẑ compared to any other set of same number of basis vectors.<br />

In wavelet trans<strong>for</strong>m the basis vectors are a set of orthogonal sampled functions<br />

with local support <strong>and</strong> (almost) non-overlapping spectra. The wavelet trans<strong>for</strong>m<br />

of the measurement can then be seen as filtering of the measurements with bank<br />

of noncausal b<strong>and</strong> pass filters [40, 134].<br />

2.15 Modeling of prior in<strong>for</strong>mation in <strong>Bayesian</strong> estimation<br />

Modeling of prior in<strong>for</strong>mation in <strong>Bayesian</strong> estimation is a fundamental problem.<br />

In some cases the parameters θ have some physical meaning <strong>and</strong> it is possible that<br />

we have some knowledge of their possible values. Typical interpretations <strong>for</strong> such<br />

kind of knowledge (or assumptions) is that the parameters are positive, small,<br />

almost equal or smoothly varying.<br />

A common way to implement the prior in<strong>for</strong>mation to <strong>Bayesian</strong> estimation is<br />

to use any prior density with realistic shape. For example if we know that the<br />

parameters should be in some specific p dimensional interval we select the density<br />

to be the multidimensional Gaussian density with appropriate mean <strong>and</strong> covariance<br />

to shrink the estimates towards the center of the desired area. The weaker our<br />

beliefs about the area are, the less in<strong>for</strong>mation the prior density should contain.<br />

Then it becomes more nonin<strong>for</strong>mative, more flat. In the limiting case we can use<br />

some improper prior density that still can be useful <strong>for</strong> the estimation, e.g. as the<br />

positivity constraint.<br />

Since the parameters θ are r<strong>and</strong>om we can assume that the <strong>for</strong>m of the density<br />

of θ depends on the hyperparameters φ. Inference concerning θ is then based on<br />

the posterior density<br />

p(θ|z,φ) = p(z,θ|φ)<br />

p(z|φ)<br />

=<br />

= ∫ p(z,θ|φ)<br />

(2.213)<br />

p(z,u|φ)du<br />

p(z|θ)p(θ|φ)<br />

∫<br />

p(z|u)p(u|φ)du<br />

(2.214)<br />

If φ is not known, the proper full <strong>Bayesian</strong> approach would be to interpret it as<br />

r<strong>and</strong>om with hyperprior p(φ). The desired posterior <strong>for</strong> θ is now obtained by<br />

marginalization<br />

p(θ|z) = p(z,θ)<br />

p(z)<br />

=<br />

∫<br />

p(z,θ,φ)dφ<br />

∫ ∫<br />

p(z,u,φ)dφdu<br />

(2.215)


2.16. Recursive mean square estimation 41<br />

=<br />

∫<br />

p(z|θ)p(θ|φ)p(φ)dφ<br />

∫ ∫<br />

p(z|u)p(u|φ)p(φ)dφdu<br />

(2.216)<br />

The density p(φ) could also be parametrized. We would then obtain a hierarchical<br />

multistage model. However, in the last stage an assumption about the last prior<br />

density has to be made.<br />

In the empirical <strong>Bayesian</strong> approach we reject the necessity of the hyperprior<br />

p(φ) in (2.216) <strong>and</strong> use instead an estimated value ˆφ <strong>for</strong> φ in (2.214). The estimate<br />

is obtained as a maximizer of the marginal density p(z|φ) [26].<br />

2.16 Recursive mean square estimation<br />

In the preceding sections we have dealt with systems, in which some fixed amount<br />

of data has been available <strong>for</strong> estimation of the unknown or r<strong>and</strong>om parameters.<br />

The estimators have been functions of the whole data set. In recursive estimation<br />

the question arises how to update the estimate, when some amount of new data<br />

is received. The answer <strong>for</strong> this question with certain observation models is the<br />

so-called Kalman filtering approach. Kalman filter is a tool with which one can<br />

estimate the sequence of the states of the dynamical system that cannot be observed<br />

directly. The available in<strong>for</strong>mation is the data sequence that carries some<br />

in<strong>for</strong>mation about the states.<br />

Among many other <strong>for</strong>ms the state-space model <strong>for</strong> linear dynamical systems<br />

can be written as follows. Let z t ∈ R M <strong>and</strong> θ t ∈ R p be vector-valued processes.<br />

The state θ t evolves according to linear difference equation<br />

θ t+1 = F t θ t + G t w t (2.217)<br />

with some initial distribution <strong>for</strong> θ 0 . The state is not observed directly. Instead,<br />

the measurements z t are available at discrete sampling times <strong>and</strong> are described as<br />

z t = H t θ t + v t (2.218)<br />

This is clearly a linear observation model. The other assumptions <strong>for</strong> the model<br />

are as follows<br />

• F t , G t <strong>and</strong> H t are known sequences of matrices.<br />

• (θ 0 ,w t ,v t ) is a sequence of mutually uncorrelated r<strong>and</strong>om vectors with finite<br />

variance.<br />

• E {w t } = 0, E {v t } = 0, ∀t<br />

• covariances C wt , C vt <strong>and</strong> C vt,w t<br />

are known sequences of matrices.<br />

The Kalman filtering problem is now to find the linear minimum mean square<br />

estimator ˆθ t <strong>for</strong> state θ t given the observations z 1 ,...,z t . This has been shown to<br />

be equal to conditional mean<br />

ˆθ t = E {θ t |z 1 ,...z t } (2.219)


42 2. Estimation theory<br />

In the derivation of Kalman filter we also require that the estimate is recursive (sequential).<br />

The notation introduced here is a compromise between some historical<br />

conventions [97], diverse text book notations [133, 142] <strong>and</strong> our <strong>for</strong>mer notation<br />

<strong>for</strong> estimation problems.<br />

As we have seen we can use two approaches to obtain the linear mean square<br />

estimate. The first is to specify a linear conditional mean <strong>and</strong> find the best linear<br />

<strong>for</strong>m. The second approach is to assume, that θ t , v t <strong>and</strong> w t are Gaussian. In this<br />

case the conditional mean is again linear. The results of these two approaches<br />

are identical. In other words Kalman filter is the best sequential estimator if<br />

the Gaussian assumption is valid <strong>and</strong> it is the best linear estimator whatever the<br />

distributions are. For more general cases of recursive <strong>Bayesian</strong> estimation see [194].<br />

We employ the second approach. We use the fact that the maximum a posteriori<br />

<strong>and</strong> mean square estimates are identical as seen in section 2.13. First we<br />

have to look at the expression <strong>for</strong> the density of θ t given z t ,...,z 1 . We make the<br />

notation Z t = (z t ,...,z 1 ). Then<br />

For the numerator we can write<br />

p(θ t |z t ,...,z 1 ) = p(θ t, Z t )<br />

p(Z t )<br />

= p(θ t,z t , Z t−1 )<br />

p(z t , Z t−1 )<br />

(2.220)<br />

(2.221)<br />

p(θ t ,z t , Z t−1 ) = p(z t |θ t , Z t−1 )p(θ t , Z t−1 ) (2.222)<br />

= p(z t |θ t , Z t−1 )p(θ t |Z t−1 )p(Z t−1 ) (2.223)<br />

If θ t is given, the only r<strong>and</strong>om term in (2.218) is v t that does not depend on θ t or<br />

Z t−1 . Thus<br />

p(z t |θ t , Z t−1 ) = p(z t |θ t ) (2.224)<br />

<strong>and</strong> we can write<br />

p(θ t |Z t ) = p(z t|θ t )p(θ t |Z t−1 )p(Z t−1 )<br />

p(z t , Z t−1 )<br />

= p(z t|θ t )p(θ t |Z t−1 )p(Z t−1 )<br />

p(z t |Z t−1 )p(Z t−1 )<br />

= p(z t|θ t )p(θ t |Z t−1 )<br />

p(z t |Z t−1 )<br />

(2.225)<br />

(2.226)<br />

(2.227)<br />

Now since each of the densities in right h<strong>and</strong> side are Gaussian, we need only to<br />

<strong>for</strong>m the means <strong>and</strong> the covariances of the densities to find the density of θ t given<br />

Z t .<br />

For p(z t |θ t ) we obtain<br />

E {z t |θ t } = E {H t θ t + v t |θ t } = H t θ t (2.228)<br />

C zt|θ t<br />

= E { (H t θ t + v t − E {z t |θ t })(H t θ t + v t − E {z t |θ t }) T |θ t<br />

}<br />

(2.229)<br />

= C vt (2.230)


2.16. Recursive mean square estimation 43<br />

For p(θ t |Z t−1 ) we obtain<br />

E {θ t |Z t−1 } = E {F t−1 θ t−1 + G t−1 w t−1 |Z t−1 } (2.231)<br />

= F t−1ˆθt−1<br />

. = ˆθt|t−1 (2.232)<br />

ˆθ t|t−1 is thus the prediction of θ t based on ˆθ t−1 , ˆθt−1 = E {θ t−1 |Z t−1 } is the<br />

optimal MS estimate at time t − 1 <strong>and</strong><br />

C θt|Z t−1<br />

= E<br />

{(θ t − ˆθ t|t−1 )(θ t − ˆθ<br />

}<br />

t|t−1 ) T |Z t−1 (2.233)<br />

= E<br />

{˜θt|t−1˜θT t|t−1 |Z t−1<br />

} .=<br />

C˜θt|t−1<br />

(2.234)<br />

Next we could <strong>for</strong>m the mean <strong>and</strong> covariance of p(z t |Z t−1 ), <strong>for</strong>m the density<br />

p(θ t |Z t ) <strong>and</strong> calculate the conditional mean E {θ t |Z t } that is the estimator we<br />

are looking <strong>for</strong>. However to obtain the MAP estimator we only need to maximize<br />

the numerator of (2.227) or the logarithm of that. Now clearly<br />

log p(θ t |Z t ) ∝ const−(z t −H t θ t ) T Cv −1<br />

t<br />

(z t −H t θ t )−(θ t − ˆθ t|t−1 ) T C −1 (θ<br />

˜θ t − ˆθ t|t−1 )<br />

t|t−1<br />

(2.235)<br />

The derivative of this with respect to θ t equals to zero when θ t = ˆθ t<br />

Solving ˆθ t gives the MAP estimate<br />

Ht T Cv −1<br />

t<br />

(z t − H tˆθt ) − C −1 (ˆθ<br />

˜θ t − ˆθ t|t−1 ) = 0 (2.236)<br />

t|t−1<br />

ˆθ t = (Ht T Cv −1<br />

t<br />

H t + C −1 ) −1 (C −1 ˆθt|t−1 + H<br />

˜θ t|t−1<br />

˜θ t T Cv −1<br />

t<br />

z t ) (2.237)<br />

t|t−1<br />

This is clearly of the <strong>for</strong>m (2.204), the <strong>Bayesian</strong> MAP estimate using the last<br />

available estimate as mean of θ t . The first term on the right h<strong>and</strong> side of the<br />

equation can be exp<strong>and</strong>ed using the matrix inversion lemma<br />

where the notation<br />

ˆθ t = (C˜θt|t−1<br />

− C˜θt|t−1<br />

H T t (H t C˜θt|t−1<br />

H T t + C vt ) −1 H t C˜θt|t−1<br />

) (2.238)<br />

(C −1<br />

˜θ t|t−1<br />

ˆθt|t−1 + H T t C −1<br />

v t<br />

z t ) (2.239)<br />

= (C˜θt|t−1<br />

− K t H t C˜θt|t−1<br />

)(C −1<br />

˜θ t|t−1<br />

ˆθt|t−1 + H T t C −1<br />

v t<br />

z t ) (2.240)<br />

K t = C˜θt|t−1<br />

H T t (H t C˜θt|t−1<br />

H T t + C vt ) −1 (2.241)<br />

has been made. After some lengthy calculations this can be written in <strong>for</strong>m [133]<br />

ˆθ t = ˆθ t|t−1 + K t (z t − H tˆθt|t−1 ) (2.242)<br />

This is the desired recursive <strong>for</strong>m; the new estimate can be updated from the<br />

prediction based on the previous one using the correction term K t (z t − H tˆθt|t−1 ).


44 2. Estimation theory<br />

Matrix K t is the so-called gain matrix <strong>and</strong> the term in parenthesis is the error<br />

between the actual data value z t <strong>and</strong> the prediction H tˆθt|t−1 .<br />

Next we <strong>for</strong>m the equation <strong>for</strong> C˜θt|t−1<br />

. First we note that<br />

˜θ t|t−1 = θ t − ˆθ t|t−1 (2.243)<br />

= F t−1 θ t−1 + G t−1 w t−1 − F t−1ˆθt−1 (2.244)<br />

= F t−1˜θt−1 + G t−1 w t−1 (2.245)<br />

<strong>and</strong> <strong>for</strong> the covariance we can write<br />

{<br />

}<br />

= E (F C˜θt|t−1 t−1˜θt−1 + G t−1 w t−1 )(F t−1˜θt−1 + G t−1 w t−1 ) T (2.246)<br />

= F t−1 C˜θt−1<br />

F T t−1 + G t−1 C wt−1 G T t−1 (2.247)<br />

since ˜θ t−1 <strong>and</strong> w t−1 are uncorrelated. For the error ˜θ t we have<br />

Inserting z t = H t θ t + v t we obtain<br />

<strong>and</strong> <strong>for</strong> error variance<br />

˜θ t = θ t − ˆθ t = θ t − (ˆθ t|t−1 + K t (z t − H tˆθt|t−1 )) (2.248)<br />

˜θ t = θ t − (ˆθ t|t−1 + K t (H t θ t + v t − H tˆθt|t−1 )) (2.249)<br />

= ˜θ t|t−1 − K t (H t˜θt|t−1 + v t ) (2.250)<br />

= (I − K t H t )˜θ t|t−1 − K t v t (2.251)<br />

C˜θt<br />

= (I − K t H t )C˜θt|t−1<br />

(I − K t H t ) T + K t C vt K T t (2.252)<br />

= (I − K t H t )C˜θt|t−1<br />

(2.253)<br />

since ˜θ t|t−1 = θ t − F t−1ˆθt−1 <strong>and</strong> v t are uncorrelated. Now equations (2.232),<br />

(2.247), (2.241), (2.253) <strong>and</strong> (2.242) <strong>for</strong>m the Kalman filter. After adding the<br />

initializations we can summarize the Kalman filter algorithm<br />

C˜θ0<br />

= C θ0 (2.254)<br />

ˆθ 0 = E {θ 0 } (2.255)<br />

ˆθ t|t−1 = F t−1ˆθt−1 (2.256)<br />

C˜θt|t−1<br />

= F t−1 C˜θt−1<br />

F T t−1 + G t−1 C wt−1 G T t−1 (2.257)<br />

K t = C˜θt|t−1<br />

H T t (H t C˜θt|t−1<br />

H T t + C vt ) −1 (2.258)<br />

C˜θt<br />

= (I − K t H t )C˜θt|t−1<br />

(2.259)<br />

ˆθ t = ˆθ t|t−1 + K t (z t − H tˆθt|t−1 ) (2.260)


2.17. Time-varying linear regression 45<br />

2.17 Time-varying linear regression<br />

In this section we discuss a special <strong>for</strong>m of recursive estimation problem, the timevarying<br />

linear regression with scalar measurements z t ∈ R. The main point in<br />

this section is to emphasize that the most common recursive algorithms <strong>for</strong> time<br />

varying linear regression can be obtained from Kalman filter with some specific<br />

assumptions <strong>for</strong> the state space equations. Let<br />

θ t = θ t−1 + w t (2.261)<br />

z t = ϕ T t θ t + v t (2.262)<br />

so that the parameters obey the r<strong>and</strong>om walk model [81]. In scalar case ϕ t is a<br />

regression vector that can contain any stochastic or deterministic variables. The<br />

solution ˆθ t that minimizes the conditional expectation given past observations is<br />

given by Kalman filter discussed in Section 2.16.<br />

First we make notations<br />

F t = I (2.263)<br />

G t = I (2.264)<br />

H t = ϕ T t (2.265)<br />

(2.266)<br />

in (2.217) <strong>and</strong> (2.218) <strong>and</strong> we can write the Kalman filter, equations (2.254–2.260),<br />

in the <strong>for</strong>m<br />

ˆθ t|t−1 = ˆθ t−1 (2.267)<br />

C˜θt|t−1<br />

= C˜θt−1<br />

+ C wt−1 (2.268)<br />

K t = C˜θt|t−1<br />

ϕ t (ϕ T t C˜θt|t−1<br />

ϕ t + C vt ) −1 (2.269)<br />

C˜θt<br />

= (I − K t ϕ T t )C˜θt|t−1<br />

(2.270)<br />

ˆθ t = ˆθ t|t−1 + K t (z t − ϕ T t ˆθ t|t−1 ) (2.271)<br />

We insert (2.267) <strong>and</strong> (2.268) into (2.269–2.271) <strong>and</strong> obtain<br />

ˆθ t = ˆθ t−1 + K t (z t − ϕ T t ˆθ t−1 ) (2.272)<br />

K t =<br />

C˜θt<br />

=<br />

+ C (C˜θt−1 wt−1 )ϕ t<br />

ϕ T (2.273)<br />

t + C (C˜θt−1 wt−1 )ϕ t + C vt<br />

(<br />

)<br />

+ C (C˜θt−1 wt−1 )ϕ t ϕ T t<br />

I −<br />

ϕ T + C (C˜θt−1 wt−1 ) (2.274)<br />

t + C (C˜θt−1 wt−1 )ϕ t + C vt<br />

If we now denote P t = + C C˜θt wt<br />

<strong>for</strong>m<br />

the recursive mean square estimate takes the<br />

ˆθ t = ˆθ t−1 + K t (z t − ϕ T t ˆθ t−1 ) (2.275)


46 2. Estimation theory<br />

= ˆθ t−1 + K t ε t (2.276)<br />

K t =<br />

P t =<br />

P t−1 ϕ t<br />

ϕ T (2.277)<br />

t P t−1 ϕ t + C vt<br />

(<br />

P t−1 ϕ t ϕ T )<br />

t<br />

I −<br />

ϕ T P t−1 + C wt (2.278)<br />

t P t−1 ϕ t + C vt<br />

P t is then a recursive estimate of the covariance + C C˜θt wt<br />

Kalman gain. If we now choose (make assumption)<br />

<strong>and</strong> K t is called the<br />

C vt = λ t (2.279)<br />

C wt = ( λ −1<br />

t − 1 ) ( P t−1 ϕ t ϕ T )<br />

t<br />

I −<br />

ϕ T P t−1 (2.280)<br />

t P t−1 ϕ t + C vt<br />

then the Kalman filter reduces to the so-called recursive least squares (RLS) algorithm.<br />

This leads to the conclusion, that the RLS is optimal in mean square<br />

sense if the assumptions (2.279) <strong>and</strong> (2.280) are valid. Otherwise the RLS only<br />

approximates the optimal recursive estimate. If λ t ≡ λ the method is called the<br />

<strong>for</strong>getting factor RLS. The <strong>for</strong>gertting factor RLS is popular e.g. is in time series<br />

modeling, since its per<strong>for</strong>mance can be tuned with one parameter λ. Since λ is<br />

usually tuned in RLS near to unity, the implicit assumption is, that C wt in Kalman<br />

filter is “small” corresponding to slow variation of the model parameters.<br />

With different choices of C wt , C vt <strong>and</strong> P 0 several algorithms of the <strong>for</strong>m<br />

ˆθ t = ˆθ t−1 + K t (z t − ϕ T t ˆθ t−1 ) (2.281)<br />

ε t = z t − ϕ T t ˆθ t−1 (2.282)<br />

can be obtained. The most popular <strong>for</strong>ms are the normalized least mean square<br />

(NLMS) algorithm<br />

ˆθ t = ˆθ µϕ t<br />

t−1 +<br />

µϕ T t ϕ t + 1 ε t (2.283)<br />

<strong>and</strong> the least mean square (LMS) algorithm<br />

ˆθ t = ˆθ t−1 + µϕ t ε t (2.284)<br />

where µ is the step size parameter that controls the convergence of the algorithm.<br />

The connection of the LMS with the steepest descent method is clearly visible. The<br />

connections of RLS, NLMS <strong>and</strong> LMS algorithms to Kalman filtering are discussed<br />

in detail in [99]. The corresponding Kalman gain K t <strong>and</strong> momentary covariance<br />

estimate P t are summarized in Table 2.1. These connections are fundamental,<br />

when the implicit assumptions of the recursive algorithms are compared.<br />

The time-varying algorithms presented here are all in their generic <strong>for</strong>ms. In<br />

many cases the effectiveness of the algorithm can be tuned with different matrix<br />

decompositions <strong>and</strong> scalings during the iteration. A review of adaptive algorithms<br />

can be found e.q. in [81]. Another common reference to recursive systems is [116].<br />

The connection of the different computational <strong>for</strong>ms of the adaptive algorithms


2.18. Properties of the MS estimator 47<br />

Table 2.1<br />

Kalman gain K t <strong>and</strong> covariance estimate P t <strong>for</strong> different adaptive algorithms<br />

K t<br />

P t<br />

Kalman P t−1 ϕ t (ϕ T t P t−1 ϕ t + C vt ) −1<br />

RLS P t−1 ϕ t (ϕ T t P t−1 ϕ t + λ) −1 ( )<br />

I − Kt ϕ T t Pt−1 + C wt<br />

λ ( ) −1 I − K t ϕ T t Pt−1<br />

NLMS µϕ t (µϕ T t ϕ t + 1) −1 µI<br />

LMS µϕ t µ(I − µϕ t+1 ϕ T t+1) −1<br />

with the Kalman filter is summarized in [183]. For historical reasons we refer also<br />

to the references [229] <strong>and</strong> [85]. The preprint books [189] <strong>and</strong> [193] contain many<br />

of the papers published on the subject of adaptive <strong>and</strong> Kalman filtering.<br />

2.18 Properties of the MS estimator<br />

The mean square estimate has many important properties. Perhaps the most<br />

important property is its optimality among a wide class of estimators. Let the<br />

cost function be a function of the estimation error ˜θ only, that is, C(θ − ˆθ) = C(˜θ).<br />

Let C be symmetric about ˜θ = 0<br />

<strong>and</strong> convex<br />

C(˜θ) = C(−˜θ) (2.285)<br />

C(λ˜θ 1 + (1 − λ)˜θ 2 ) ≤ λC(˜θ 1 ) + (1 − λ)C(˜θ 2 ), 0 ≤ λ ≤ 1 (2.286)<br />

These properties cover a wide range of cost functions, <strong>for</strong> example the quadratic<br />

cost function C MS (˜θ) = ˜θ T ˜θ <strong>and</strong> the absolute error cost function C ABS (˜θ) = ∑ |˜θ i |.<br />

It can be shown that the conditional Bayes cost that is associated with any cost<br />

function with properties (2.285) <strong>and</strong> (2.286) can never be less than that associated<br />

with the mean square cost function C MS (˜θ) [192, 133].<br />

Another important property is the so-called orthogonality property. Let ξ =<br />

ξ(z) be any function of data z. Then using (2.22) we can write<br />

E<br />

{(θ − ˆθ } {<br />

MS )ξ T = E z E<br />

{(θ − ˆθ<br />

∣ }}<br />

MS )ξ T ∣∣z<br />

(2.287)<br />

But ξ is a function of z only <strong>and</strong> thus<br />

E z<br />

{E<br />

{(θ − ˆθ<br />

∣ }}<br />

{<br />

MS )ξ T ∣∣z<br />

= E z<br />

{E (θ − ˆθ }<br />

∣<br />

}<br />

MS ) ∣z ξ T (2.288)<br />

∣ } ∣∣z<br />

= E z<br />

{E {θ|z} − E<br />

{ˆθMS } ξ T (2.289)<br />

= E z<br />

{(ˆθ MS − ˆθ MS )ξ T } = 0 (2.290)


48 2. Estimation theory<br />

since ˆθ MS is unbiased. The result is the so-called orthogonality principle [192].<br />

It is important to note that in all derivations with observation model, we<br />

have assumed that the observation matrix H is deterministic. Thus all optimality<br />

results hold only if H does not depend on r<strong>and</strong>om variables, e.g. measurements.<br />

However, in many important estimation problems H is an explicit function of<br />

measurements z. Such a case is e.g. parametric modeling of time series which<br />

is discussed in Chapter 4. In such cases general optimality criterions usually can<br />

be derived only asymptotically. That is, when the number of measurements M<br />

increases without bound. This kind of analysis with stochastic H leads to the<br />

so-called stochastic approximation <strong>methods</strong> [115]. These are not discussed in this<br />

thesis.


CHAPTER<br />

III<br />

<strong>Regularization</strong> theory<br />

In this chapter we discuss some regularization <strong>methods</strong> that are applicable to<br />

the regularization of linear problems. The main goal is to motivate the so-called<br />

principal component regression <strong>and</strong> subspace regularization approaches. Both are<br />

presented in the <strong>for</strong>malism that is introduced in Chapter 2. The <strong>methods</strong> are<br />

directly applicable to several problems that arise in evoked potential analysis as<br />

discussed in Chapter 6.<br />

Another purpose of this chapter is to emphasize the interpretation of the most<br />

common regularization <strong>methods</strong> as <strong>Bayesian</strong> estimators. This makes it possible to<br />

see the implicit assumptions about the parameters when some specific regularization<br />

method is used.<br />

3.1 Introduction<br />

The notion of regularization arises from area of ill-posed problems. The problem<br />

is to find a solution to originally infinite dimensional problem<br />

z = Hθ (3.1)<br />

that is ill-posed in Hadamard’s sense. Originally Hadamard posed that the problem<br />

is well posed if there exists an unique solution that depends continuously on<br />

the data [73]. The latter condition means that arbitrarily small perturbations in<br />

the data cannot cause arbitrarily large perturbations in the solution.<br />

Hansen [77] has used the notion of discrete ill-posed problem <strong>for</strong> problems<br />

<strong>and</strong><br />

z = Hθ ; H ∈ R n×n (3.2)<br />

min<br />

θ<br />

‖Hθ − z‖ ; H ∈ R m×n ; m > n (3.3)<br />

if the singular values of H decay gradually to zero <strong>and</strong> the ratio between the<br />

largest <strong>and</strong> the smallest nonzero singular value is large. In this case the solution<br />

49


50 3. <strong>Regularization</strong> theory<br />

can be very sensitive to errors in the data z even though the solutions were strictly<br />

speaking continuous functions of the data.<br />

The need to obtain stable solutions to ill-posed problems has led to several<br />

<strong>methods</strong> in which the original problem is modified so that the solution of the<br />

modified problem is near to the solution of the original problem but is less sensitive<br />

to errors in the data. These <strong>methods</strong> are often called the regularization <strong>methods</strong>.<br />

In regularization <strong>methods</strong> in<strong>for</strong>mation about the desirable solution is used <strong>for</strong> the<br />

modification of the original problem. In this sense the regularization approach is<br />

related to the <strong>Bayesian</strong> approach to estimation.<br />

Several regularization <strong>methods</strong> can be interpreted as stabilization of the inversion<br />

of the matrix H T H in the least squares solution. Such <strong>methods</strong> are e.g.<br />

Tikhonov regularization <strong>and</strong> principal component regression. Other approaches<br />

to regularization are e.g. the use of truncated iterative <strong>methods</strong>, such as conjugate<br />

gradient method, <strong>for</strong> solving linear systems of equations [78]. The truncated<br />

singular value docomposition [75, 33] is also commonly used. All regularization<br />

<strong>methods</strong> can be interpreted in the same <strong>for</strong>malism with the so-called filter factors,<br />

see e.g. [182]. General references on regularization <strong>methods</strong> are e.g. [209, 72] or<br />

[56] <strong>and</strong> especially [201] in connection to Bayes estimation.<br />

3.2 Tikhonov regularization<br />

We discuss here the most popular regularization method, Tikhonov regularization<br />

method [208, 207] <strong>for</strong> solution of the least squares problem. We define here the<br />

generalized Tikhonov regularized solution ˆθ α with equation<br />

ˆθ α = arg min<br />

θ<br />

{‖L 1 (Hθ − z)‖ 2 + α 2 ‖L 2 (θ − θ ∗ )‖ 2 } (3.4)<br />

where L T 1 L 1 = W is a positive definite matrix. The regularization parameter α<br />

controls the weight given to minimization of the side constraint<br />

Ω(θ) = ‖L 2 (θ − θ ∗ )‖ 2 (3.5)<br />

relative to minimization of the weighted least squares index<br />

l LS = ‖L 1 (Hθ − z)‖ 2 (3.6)<br />

The term θ ∗ is the initial (prior) guess <strong>for</strong> the solution. In more common definitions<br />

W = I, see e.g. [77].<br />

The matrix L 2 is typically either the identity matrix I or a discrete approximation<br />

D d of the d’th derivative operator. In this case L 2 is a b<strong>and</strong>ed matrix with<br />

full row rank. For example the 1’st <strong>and</strong> 2’nd difference matrices are<br />

⎛<br />

⎞<br />

1 −1 0 · · · 0<br />

. 0 .. . .. . ..<br />

. ..<br />

D 1 =<br />

⎜<br />

(3.7)<br />

⎝<br />

.<br />

. .. . .. . .. ⎟ 0 ⎠<br />

0 · · · 0 1 −1


3.2. Tikhonov regularization 51<br />

<strong>and</strong><br />

⎛<br />

D 2 =<br />

⎜<br />

⎝<br />

1 −2 1 0 · · · 0<br />

0<br />

. .. . .. . .. . .. . .<br />

. .. . .. . .. . .. 0<br />

0 · · · 0 1 −2 1<br />

⎞<br />

⎟<br />

⎠<br />

(3.8)<br />

The solution (3.4) can also be written in <strong>for</strong>m<br />

⎧ ( ) (<br />

⎨ L 1 H L 1 z<br />

ˆθ α = arg min<br />

θ −<br />

θ ⎩∥<br />

αL 2 αL 2 θ ∗<br />

)∥ ∥∥∥∥<br />

2 ⎫ ⎬<br />

⎭<br />

(3.9)<br />

<strong>and</strong> making notations<br />

H ′ =<br />

z ′ =<br />

L ′ =<br />

(<br />

(<br />

(<br />

H<br />

I<br />

)<br />

)<br />

z<br />

θ ∗<br />

)<br />

L 1 0<br />

(3.10)<br />

(3.11)<br />

(3.12)<br />

the solution is of the <strong>for</strong>m<br />

ˆθ α = arg min<br />

θ<br />

{‖L ′ H ′ θ − L ′ z ′ ‖ 2 } (3.13)<br />

from which it is easy to see that the <strong>for</strong>mal solution is<br />

ˆθ α = (H ′T L ′T L ′ H ′ ) −1 H ′T L ′T L ′ z ′ (3.14)<br />

= (H T WH + α 2 L T 2 L 2 ) −1 (H T Wz + α 2 L T 2 L 2 θ ∗ ) (3.15)<br />

Sometimes the notion of st<strong>and</strong>ard <strong>for</strong>m regularization is used when L 2 = I, otherwise<br />

the notion of regularization in general <strong>for</strong>m is used [74]. Tikhonov regularization<br />

is also called ridge regression [69] in statistical literature or rubric graduation<br />

[218] in approximation theory. All the <strong>methods</strong> using difference approximations<br />

as regularizing side constraint can be called smoothness priors <strong>methods</strong> [67]. This<br />

is also the notion used in this thesis.<br />

It is easy to see that the Tikhonov regularization has a direct connection to<br />

<strong>Bayesian</strong> linear mean square estimation. We can compare equation (3.9) to the<br />

mean square estimate<br />

ˆθ MS = arg min<br />

θ<br />

∥ ∥∥∥∥<br />

L<br />

(<br />

H<br />

I<br />

)<br />

θ − L<br />

(<br />

)∥<br />

z ∥∥∥∥<br />

2<br />

η θ<br />

(3.16)


52 3. <strong>Regularization</strong> theory<br />

where<br />

L T L =<br />

(<br />

C −1<br />

v 0<br />

0 C −1<br />

θ<br />

)<br />

(3.17)<br />

Thus generalized Tikhonov regularization is the optimal mean square estimator if<br />

v ∼ N(0,W −1 ) <strong>and</strong> θ ∼ N(θ ∗ ,α −2 (L T 2 L 2 ) −1 ).<br />

Sometimes we do not have any particular assumptions about the parameters θ<br />

themselves. They may not even have any physical meaning as in the case of generic<br />

basis functions in linear least squares. More often we have beliefs about what the<br />

measurements should be. In this case we have to regularize the prediction of the<br />

model, that is, we want that Hθ has some desirable properties. The regularized<br />

solution <strong>for</strong> W = I is then<br />

ˆθ α = arg min<br />

θ<br />

{‖Hθ − z‖ 2 + α 2 ‖L 2 (Hθ − Hθ ∗ )‖ 2 } (3.18)<br />

<strong>and</strong> the solution <strong>for</strong> this is clearly<br />

ˆθ α = (H T H + α 2 H T L T 2 L 2 H) −1 (H T z + α 2 H T L T 2 L 2 Hθ ∗ ) (3.19)<br />

This is of course of the same <strong>for</strong>m than (3.15) with regularization matrix L ′ = L 2 H.<br />

However in <strong>for</strong>m (3.19) st<strong>and</strong>ard regularization matrices are directly applicable.<br />

3.3 Principal component based regularization<br />

In [23] it is emphasized that the notion of regularization has its origin in approximation<br />

theory <strong>and</strong> many of the regularization <strong>methods</strong> there can be put in the<br />

<strong>for</strong>m<br />

ˆθ = GH T z (3.20)<br />

This estimator clearly covers the ordinary least squares solution with the choice<br />

G = (H T H) −1 . The selection of G can thus be seen as approximation of the<br />

inverse of H T H. The <strong>for</strong>m (3.20) includes also the Tikhonov regularization. The<br />

notion of ridge regression has been used <strong>for</strong> Tikhonov regularization with L = I<br />

<strong>and</strong> the parameter α is called then the ridge parameter. In general ridge regression<br />

problem we have some criterion, e.g. |θ| = c <strong>for</strong> some given c, <strong>for</strong> selecting the<br />

parameter α [69].<br />

It is possible to select<br />

G = ∑ 1<br />

v j vj T (3.21)<br />

α j<br />

j∈S<br />

where v i <strong>and</strong> α i are the eigenvectors <strong>and</strong> eigenvalues of matrix H T H, respectively.<br />

S is the index set S ⊂ I, where I = (1,..., rank (H T H)). The selection of the<br />

index set S can be based on the eigenvalues α i or the correlation of the eigenvectors<br />

with the data z. In the first approach the goal is to eliminate the large variances<br />

in regression by deleting the components <strong>for</strong> which α i is small. This approach is<br />

sometimes called the traditional principal component regression [23]. In the second<br />

approach it is undesirable to delete components that have large correlations with


3.3. Principal component based regularization 53<br />

the dependent variable z, the data. Several criterions <strong>for</strong> selection of the principal<br />

components in (3.21) has been summarized in [91].<br />

Two other <strong>methods</strong> based on the use of principal components are the latent<br />

root regression <strong>and</strong> the partial least squares [91]. In latent root regression the principal<br />

components are calculated <strong>for</strong> the p independent variables <strong>and</strong> the dependent<br />

variable. The principal components corresponding to the smallest eigenvalues are<br />

then examined <strong>and</strong> the eigenvector is deleted if the PC is small also <strong>for</strong> the dependent<br />

variable z. It has then only little predictive value <strong>for</strong> the data. In partial least<br />

squares the basic idea is to start with the principal components of the explanatory<br />

variables <strong>and</strong> then rotate them so that they become more highly correlated with<br />

the dependent variable z. These rotated principal components are then used to<br />

predict z.<br />

When used in <strong>for</strong>m (3.21) the principal component regression is quite ad hoc.<br />

The eigendecomposition is used only to regularize the inversion of the matrix<br />

H T H. As denoted in [23], the principal component regression has as approach<br />

only little to do with the prior assumptions about the problem. In the following<br />

we <strong>for</strong>m an alternative <strong>for</strong>m of the regression, where the connection to the problem<br />

is more cleary visible.<br />

First we assume the ordinary linear observation model<br />

<strong>for</strong> which the ordinary least squares solution is<br />

<strong>and</strong> <strong>for</strong> the prediction we have<br />

Next we recall that the singular value decomposition<br />

z = Hθ + v (3.22)<br />

ˆθ = (H T H) −1 H T z (3.23)<br />

ẑ = Hˆθ (3.24)<br />

H = USV T (3.25)<br />

gives the eigenvectors of the matrix H T H in columns of the matrix V . Thus we<br />

can make decomposition<br />

(H T H) −1 = ∑ j∈S<br />

1<br />

λ j<br />

v j v T j<br />

+ ∑ j∈S<br />

1<br />

λ j<br />

v j v T j (3.26)<br />

= V 1 Λ −1<br />

1 V T<br />

1 + V 2 Λ −1<br />

2 V T<br />

2 (3.27)<br />

where S = I\S. In principal component regression we use only the first term, <strong>and</strong><br />

thus the prediction can be written in <strong>for</strong>m<br />

ẑ = H(V 1 Λ −1<br />

1 V 1 T )Hz (3.28)<br />

= USV T (V 1 Λ −1<br />

1 V 1 T )V SU T z (3.29)<br />

= (U 1 S 1 V1 T + U 2 S 2 V2 T )(V 1 Λ1 −1 1 T )(V 1 S 1 U1 T + V 2 S 2 U2 T )z (3.30)<br />

= U 1 S 1 V1 T V 1 Λ −1<br />

1 V 1 T V 1 S 1 U1 T z + U 1 S 1 V1 T V 1 Λ −1<br />

1 V 1 T V 2 S 2 U2 T z (3.31)<br />

+ U 2 S 2 V2 T V 1 Λ −1<br />

1 V 1 T V 1 S 1 U1 T z + U 2 S 2 V2 T V 1 Λ −1<br />

1 V 1 T V 2 S 2 U2 T z (3.32)


54 3. <strong>Regularization</strong> theory<br />

Now we recall that by the definition of SVD the matrices U <strong>and</strong> V are unitary.<br />

This means that V1 T V 1 = I <strong>and</strong> V1 T V 2 = 0. We also note that S 1 Λ −1<br />

1 S 1 = I. The<br />

equation of the prediction then reduces to<br />

ẑ = U 1 U T 1 z (3.33)<br />

= U 1 (U T 1 U 1 ) −1 U T 1 z (3.34)<br />

= U 1ˆθ (3.35)<br />

Where ˆθ is the least squares solution of the modified problem<br />

z = U 1 θ + v (3.36)<br />

Thus the principal component regression can be seen as modifying the observation<br />

model so that the observations are presented in linear subspace S spanned by the<br />

columns of U 1 that are eigenvectors of the matrix HH T , the covariance matrix of<br />

the basis vectors of the original problem.<br />

A modification of the principal component regression is proposed in [211, 212,<br />

96]. First a set of expectable noiseless measurements Hθ is generated. These<br />

simulations are interpreted as samples from the prior density. The principal component<br />

regression is then used to reduce the dimensionality of the problem. The<br />

method is called basis constraint method.<br />

3.4 Subspace regularization<br />

In some cases we can make the assumption that the parameter vector is close to<br />

some linear subspace S = R (K S ) where K S is a matrix containing an orthonormal<br />

basis of the subspace S in its columns. The projection of θ onto S is then K S K T S θ,<br />

<strong>and</strong> the distance of θ from S is<br />

Ω(θ) = ∥ ∥ KS K T S θ − θ ∥ ∥ =<br />

∥ ∥(KS K T S − I)θ ∥ ∥ (3.37)<br />

We can now use this quantity as a side constraint in the regularized least squares<br />

solution <strong>and</strong> we obtain<br />

{<br />

ˆθ = arg min ‖L 1 (z − Hθ)‖ 2 + α ∥ 2 (I − KS KS T )θ ∥ 2} (3.38)<br />

θ<br />

We call this solution the subspace regularized solution. Since the matrix L 2 =<br />

(I −K S K T S ) is idempotent <strong>and</strong> symmetric, we can write L 2 = L 2 L T 2 = (I −K S K T S )<br />

<strong>and</strong> the solution is<br />

ˆθ = (H T WH + α 2 (I − K S K T S )) −1 H T Wz (3.39)<br />

The matrix (I − K S K T S ) is the orthogonal projector onto the null space of the<br />

operator K S <strong>and</strong> the subspace regularization can be seen as modifying the original<br />

least squares problem so that the solution is closer to the null space of K S .<br />

The subspace regularized solution reduces to the principal component regression


3.5. Selection of the regularization parameters 55<br />

when α goes to infinity. This has been shown in [213]. The method is originally<br />

introduced in [214].<br />

It is readily seen that the subspace regularization equals <strong>for</strong>mally to mean<br />

square estimation with the prior θ ∼ N(0,α −2 (I − K S KS T)−1 ) if v ∼ N(0,W −1 ).<br />

But the matrix (I − K S KS T ) is generally not of full rank <strong>and</strong> the inverse does not<br />

exist. This means that the prior is improper <strong>and</strong> the parameters can have infinite<br />

variance in some directions. In these directions the prior assumptions does not<br />

contain any in<strong>for</strong>mation about the distribution of the parameters. Thus we can<br />

call this kind of prior partially nonin<strong>for</strong>mative.<br />

Also in this case the prior in<strong>for</strong>mation can be stated in terms of the noiseless<br />

measurements Hθ. If the requirement is that Hθ is close to the subspace S ′ , the<br />

solution is<br />

ˆθ = (H T WH + α 2 H T (I − K S ′KS T ′)H)−1 H T Wz (3.40)<br />

3.5 Selection of the regularization parameters<br />

Selection of the regularization parameters is a fundamental problem in regularization.<br />

In the inverse problem literature the main attention has been paid to the<br />

optimal selection of the parameter α when the <strong>for</strong>m of the observation model is<br />

known <strong>and</strong> the regularization matrix is fixed. However, in real estimation problems<br />

the selection of the observation model can also be interpreted as an aspect in<br />

regularization. For example in [218] the number p of the basis functions in observation<br />

model is treated as regularization parameter. In [23] it is argued that from<br />

the <strong>Bayesian</strong> point of view the selection of the structure of the regularizing prior<br />

can be a more fundamental problem than the selection of the weight α. However,<br />

some practical tools have been developed <strong>for</strong> the selection of the regularization<br />

parameters. For a more detailed discussion on the selection of the regularization<br />

parameters see e.g. [77] <strong>and</strong> the references therein.<br />

The <strong>methods</strong> <strong>for</strong> the selection of the regularization parameters are classically<br />

divided into two classes depending on whether or not the norm ‖v‖ of the noise<br />

term in the observation model is known. The discrepancy principle of Morozov<br />

is an example of a method that relies on the knowledge of ‖v‖. The idea in<br />

discrepancy principle is to use the regularization parameters so that the index<br />

f MDP (α) =<br />

∥<br />

∥z − Hˆθ α<br />

∥ ∥∥<br />

2<br />

− ‖v‖<br />

2<br />

(3.41)<br />

is zero. This is a nonlinear equation <strong>for</strong> α <strong>and</strong> it can be solved iteratively. An<br />

improved modification of this is Greferer–Raus method [72].<br />

In the so-called posterior <strong>methods</strong> all the necessary in<strong>for</strong>mation <strong>for</strong> selection of<br />

the regularization parameters are extracted from the data. The quasi-optimality<br />

method is an example of a method in which no in<strong>for</strong>mation about ‖v‖ is needed<br />

In quasi-optimality method the parameter α is chosen so that the norm<br />

f QO (α) =<br />

∥ α d ∥ ∥∥∥<br />

dα ˆθ<br />

2<br />

α<br />

(3.42)


56 3. <strong>Regularization</strong> theory<br />

has a local minimum with α ≠ 0 [74]. Another method of this class is the socalled<br />

generalized cross-validation (GCV) that is based on the “leave-one-out”<br />

type procedure [69, 218]. The solution can be written as minimizer of the index<br />

∥<br />

∥Hˆθ α − z∥ 2<br />

f GCV (α) =<br />

(trace (I − HK)) 2 (3.43)<br />

where ˆθ α = Kz [77].<br />

In the L-curve method the choice of the regularization parameter is based<br />

on the graph of the penalty term ‖L 2ˆθα ‖ versus the residual norm ‖z − H ˆθ α ‖.<br />

When the graph is plotted in log-log scale, the L-curve will sometimes have a<br />

characteristic L-shaped apperance with a “corner” separating a steep part from<br />

the flat part of the graph. The value of α corresponding to this corner is selected<br />

<strong>for</strong> the regularizing parameter. The L-curve criterion is a fully heuristic approach<br />

although it has some asymptotic optimality properties [76]. The use of L-curve<br />

criterion has been criticized e.g. in [57].<br />

The <strong>methods</strong> <strong>for</strong> the selection of the regularization parameters are often proposed<br />

<strong>for</strong> the ridge class of <strong>methods</strong>. That is, <strong>for</strong> the Tikhonov regularization in<br />

st<strong>and</strong>ard <strong>for</strong>m. Also the analysis of the properties are usually carried <strong>for</strong> ridge estimators.<br />

Even in this simple case the theory justifying the use of e.g. generalized<br />

cross-validation is an asymptotic one as stressed in [218].<br />

Several different criteria <strong>for</strong> the selection of the regularization parameters are<br />

discussed in [23] where it is also emphasized that the selection can also be based on<br />

some approximative <strong>Bayesian</strong> <strong>for</strong>mulation. It is also stressed in [23] that nowadays<br />

with good numerical tools the fully <strong>Bayesian</strong> implementation of the problem would<br />

be straight<strong>for</strong>ward.


CHAPTER<br />

IV<br />

Time series models<br />

In this chapter some topics in time series analysis are discussed. The problem<br />

of time-series modeling is <strong>for</strong>mulated <strong>and</strong> the connection between time-dependent<br />

modeling <strong>and</strong> linear regression is emphasized. After that the special properties<br />

of the LMS algorithm are discussed <strong>and</strong> the <strong>for</strong>m of the LMS is compared with<br />

<strong>Bayesian</strong> MAP estimation <strong>and</strong> subspace regularization. These results are then<br />

used in Chapter 6 where the use of adaptive algorithms <strong>for</strong> the estimation of the<br />

evoked potentials is analyzed.<br />

Another topic of this chapter is optimal filtering. The so-called Wiener filtering<br />

scheme is discussed <strong>and</strong> the connection between Wiener filtering <strong>and</strong> mean<br />

square estimation is especially emphasized. The time-dependent Wiener filter is<br />

defined <strong>and</strong> connected to more traditional <strong>for</strong>mulations of Wiener filtering. These<br />

<strong>for</strong>mulations are fundamental in Chapter 6 where the corresponding <strong>methods</strong> <strong>for</strong><br />

evoked potential analysis are reviewed.<br />

4.1 Stochastic processes<br />

A scalar valued r<strong>and</strong>om (or stochastic) process x(t) is a set {x(t) ∈ R : t ∈ T } of<br />

r<strong>and</strong>om variables defined in the same probability space [142]. T is said to be the<br />

parameter set of process. When T is a finite or countably infinite set, we say that<br />

the process is a discrete parameter set process. When the parameter t denotes<br />

the time, the process is said to be a discrete-time process, a time series. Only<br />

discrete-time processes are discussed in this thesis.<br />

The mean of the r<strong>and</strong>om processes x(t) is a function of time<br />

η x (t) = E {x(t)} =<br />

∫ ∞<br />

−∞<br />

<strong>and</strong> covariance is a function of two time variables t <strong>and</strong> s<br />

x(t)p(x(t))dx(t) (4.1)<br />

C x (t,s) = E {(x(t) − η x (t))(x(s) − η x (s))} (4.2)<br />

= E {x(t)x(s)} − η x (t)η x (s) (4.3)<br />

57


58 4. Time series models<br />

If the mean is time-invariant<br />

= R x (t,s) − η x (t)η x (s) (4.4)<br />

E {x(t)} = η (4.5)<br />

we say that the process is first-order stationary. If in addition the autocorrelation<br />

R x (t,s) depends only on the time difference τ = s − t<br />

R x (t,s) = R x (τ) (4.6)<br />

we say, that x(t) is second-order stationary. The notion of wide sense stationarity<br />

is also used <strong>for</strong> second order stationarity. Discrete white noise process e(t) is<br />

defined to be a sequence of mutually independent zero mean r<strong>and</strong>om variables<br />

[191].<br />

A widely used notion in signal processing literature is the signal-to-noise ratio.<br />

For a stationary stochastic signal s(t) with additive noise<br />

the signal-to-noise ratio can be defined as [187]<br />

SNR = E { s 2 (t) }<br />

x(t) = s(t) + v(t) (4.7)<br />

E {v 2 (t)}<br />

For deterministic s(t) the ratio is sometimes defined as time-dependent function<br />

SNR(t) =<br />

|s(t)|<br />

√<br />

E {v2 (t)}<br />

(4.8)<br />

(4.9)<br />

as in [153].<br />

Let x(t) be a wide sense stationary process with autocorrelation R x (τ). Then<br />

its power spectrum is defined as<br />

S x (ω) =<br />

∞∑<br />

τ=−∞<br />

R x (τ)e −jτω (4.10)<br />

The spectrum is thus the discrete Fourier trans<strong>for</strong>m of the autocorrelation of the<br />

x(t).<br />

Some classes of discrete stochastic processes can be expressed as outputs of<br />

linear time-invariant systems with white noise input. We call the process z(t)<br />

autoregressive moving average process or ARMA(p,q) process if it can be expressed<br />

as<br />

p∑<br />

q∑<br />

z(t) = a j z(t − j) + b k e(t − k) (4.11)<br />

j=1<br />

where e(t) is a white noise process. Using operator notation, we can write this in<br />

the <strong>for</strong>m<br />

A(q)z(t) = (1 + B(q)) e(t) (4.12)<br />

k=0


4.2. Stationary time series models 59<br />

where<br />

A(q) = 1 −<br />

B(q) =<br />

p∑<br />

a j q −j (4.13)<br />

j=1<br />

q∑<br />

b k q −k (4.14)<br />

k=1<br />

<strong>and</strong> q −1 is the time delay operator q −1 z(t) = z(t − 1). A(q) <strong>and</strong> B(q) are thus<br />

operator polynomials. If we have B(q) ≡ 0 the process (4.11) is called autoregressive,<br />

AR process. The estimation of the coefficients of the polynomials A(q) <strong>and</strong><br />

B(q) based on measurements z(t), is called rational modeling.<br />

Using the results of the linear filtering theory it can be shown that the power<br />

spectrum of ARMA(p,q) process is [167]<br />

S(ω) = ∣<br />

∑ q 1 +<br />

σe<br />

2 k=1 b ke −iωk∣ ∣ 2<br />

∣<br />

∣1 − ∑ ∣<br />

p ∣∣<br />

j=1 a je −iωj 2<br />

(4.15)<br />

= ∣ 1 + B(e<br />

σe<br />

2 ) ∣ 2<br />

|A(e −iω )| 2 (4.16)<br />

4.2 Stationary time series models<br />

An approach to time series modeling is to model the measured signal as output<br />

of linear system. If the signal is stationary, this system can be time-invariant.<br />

For example if we want to model the signal z(t) with a p’th order AR model, the<br />

observation model is of the <strong>for</strong>m<br />

z(t) =<br />

p∑<br />

a k z(t − k) + e(t) t = p + 1,...,N (4.17)<br />

k=1<br />

We have emphasized the scalar nature of the observation z(t) ∈ R here using the<br />

time variable in parenthesis rather than in subscript. The measurements can be<br />

written in matrix <strong>for</strong>m<br />

z = Hθ + e (4.18)<br />

where θ = (a 1 ,...,a p ) T <strong>and</strong><br />

H =<br />

⎛<br />

⎜<br />

⎝<br />

z(p) · · · z(1)<br />

.<br />

.<br />

z(N − 1) · · · z(N − p)<br />

⎞<br />

⎟<br />

⎠ (4.19)<br />

z = (z(p + 1),...,z(N)) T (4.20)<br />

e = (e(p + 1),...,e(N)) T (4.21)


60 4. Time series models<br />

The least squares solution <strong>for</strong> the AR parameters is then<br />

ˆθ LS = (H T H) −1 H T z (4.22)<br />

ˆθ LS is the solution that minimizes the output least squares error norm<br />

ˆθ LS = arg min<br />

θ<br />

‖e‖ 2 (4.23)<br />

= arg min<br />

θ<br />

e T e (4.24)<br />

= arg min<br />

θ<br />

‖z − Hθ‖ 2 (4.25)<br />

The least squares estimate of the AR model parameters is thus a linear function<br />

of data. This is not the case with ARMA modeling. In ARMA modeling the<br />

observation model can be written in <strong>for</strong>m<br />

z(t) =<br />

p∑<br />

a j z(t − j) +<br />

j=1<br />

q∑<br />

b k e(t − k) + e(t) (4.26)<br />

The LS solution <strong>for</strong> parameters a <strong>and</strong> b is then the one that minimizes the norm<br />

of the error e. This minimization cannot be written in <strong>for</strong>m<br />

k=1<br />

ˆθ LS = arg min<br />

θ<br />

‖z − Hθ‖ 2 (4.27)<br />

where H does not depend on the parameters to be estimated. There<strong>for</strong>e the<br />

ARMA modeling of time series is a nonlinear optimization problem. For different<br />

<strong>methods</strong> <strong>for</strong> solving ARMA(p,q) model in case of stationary signals see e.g. [37].<br />

General references <strong>for</strong> estimation of stationary models are e.g. [125, 34, 103].<br />

4.3 Time-dependent time series models<br />

If the signal to be modeled is nonstationary it cannot be modeled as output of a<br />

time-invariant system. It is natural to assume in this case that the system has<br />

time-varying parameters. For example the time-varying ARMA model can be<br />

written in <strong>for</strong>m<br />

If we denote<br />

z(t) =<br />

p∑<br />

a j (t)z(t − j) +<br />

j=1<br />

q∑<br />

b k (t)e(t − k) + e(t) (4.28)<br />

k=1<br />

θ t = (a 1 (t),...,a p (t),b 1 (t),...,b q (t)) T (4.29)<br />

ϕ t = (z(t − 1),...,z(t − p),e(t − 1),...,e(t − q)) T (4.30)<br />

we can write the equation in the <strong>for</strong>m<br />

z(t) = ϕ T t θ t + e(t) (4.31)


4.3. Time-dependent time series models 61<br />

This is exactly of the <strong>for</strong>m of linear regression (2.262). The parameters θ t can now<br />

be estimated iteratively <strong>for</strong> example with the RLS algorithm.<br />

ˆϕ t = (z(t − 1),...,z(t − p),ε(t − 1),...,ε(t − q)) T (4.32)<br />

ε(t) = z(t) − ˆϕ T t ˆθ t−1 (4.33)<br />

K t = P t−1<br />

ˆϕ t<br />

λ + ˆϕ T t P t−1 ˆϕ t<br />

(4.34)<br />

ˆθ t = ˆθ t−1 + ε(t)K t (4.35)<br />

P t = λ −1 (I − K t ˆϕ T t )P t−1 (4.36)<br />

where<br />

ˆθ t = (â 1 (t),...,â p (t),ˆb 1 (t),...,ˆb q (t)) T (4.37)<br />

Note that the unknown process e(t) is here estimated with the prediction error<br />

process ε(t) in every step of the iteration. We use there<strong>for</strong>e the notation ˆϕ t instead<br />

of ϕ t .<br />

The application to AR <strong>and</strong> MA models is straight<strong>for</strong>ward. Selection of regressor<br />

<strong>and</strong> parameter vector structure <strong>for</strong> different model structures is summarized<br />

in Table 4.1.<br />

Other recursive estimators discussed in Section 2.17 are applied to timevarying<br />

time-series modeling analogously. The recursion<br />

ˆθ t = ˆθ t−1 + K t (z(t) − ˆϕ T t ˆθ t−1 ) (4.38)<br />

in combination of tables 4.1 <strong>and</strong> 2.1 summarizes the different models <strong>and</strong> algorithms.<br />

Table 4.1<br />

Regressor vector ˆϕ t <strong>and</strong> parameter vector ˆθ <strong>for</strong> different time series model structures<br />

AR<br />

ARMA<br />

MA<br />

AR<br />

ARMA<br />

MA<br />

ˆϕ T t<br />

(z(t − 1),...,z(t − p))<br />

(z(t − 1),...,z(t − p),ε(t − 1),...,ε(t − q))<br />

(ε(t − 1),...,ε(t − q))<br />

ˆθ T<br />

(â 1 (t),...,â p (t))<br />

(â 1 (t),...,â p (t),ˆb 1 (t),...,ˆb q (t))<br />

(ˆb 1 (t),...,ˆb q (t))<br />

The <strong>for</strong>m of the linear system can be different from the linear regression discussed<br />

here. AR, ARMA <strong>and</strong> MA models are all so-called transversal models <strong>for</strong>


62 4. Time series models<br />

time series. Another possibility is to use so called lattice structures [124]. Also <strong>for</strong><br />

these it is possible to <strong>for</strong>m recursive <strong>for</strong>ms [61]. Since the operator polynomials<br />

in transversal models can be parametrized also with the roots of the polynomials,<br />

yet another possibility is to use so-called root tracking <strong>methods</strong> <strong>for</strong> time varying<br />

modeling of the time series [99].<br />

4.4 Adaptive filtering<br />

In Section 4.3 the regressor vector contained previous terms of the measurements<br />

or the prediction error values. It is also possible that the regressor vector contains<br />

samples from another process, the reference signal. Methods that use this kind of<br />

model are referred in [81] to interference cancelling <strong>methods</strong>. The most famous<br />

applications in this class of <strong>methods</strong> are perhaps the adaptive noise cancelling<br />

applications [228]. Originally the idea in adaptive noise cancelling is to use measurements<br />

that contain signal <strong>and</strong> noise as the primary input z(t) = s(t) + n 0 (t).<br />

The noise is assumed to be uncorrelated with the signal. The regressor vector<br />

is called the reference input <strong>and</strong> it contains a signal n 1 (t) that is correlated with<br />

the noise n 0 (t) in the primary input. With this observation model some adaptive<br />

linear regression algorithm, typically LMS, is then used. Since the regressor<br />

now contains terms that are correlated with the noise term in the primary input,<br />

the term ˆϕ T t ˆθ t−1 models this noise n 0 (t). In this case the prediction error term<br />

ε(t) = z(t) − ˆϕ T t ˆθ t−1 thus approximates the underlying signal s(t).<br />

Adaptive noise cancellation can thus be seen as time varying linear regression.<br />

The primary input is modeled as linear combination of delayed reference inputs<br />

with time varying parameters. The error norm is minimized iteratively with an<br />

adaptive algorithm. For example, if the LMS algorithm is used, the corresponding<br />

iterative algorithm is the steepest descent algorithm. The measurement z(t) <strong>and</strong><br />

the regression model ϕ t are also time-dependent. In every step of the minimization<br />

new data <strong>and</strong> regressor values are received.<br />

Another important property of adaptive algorithms is their connection with<br />

regularization <strong>methods</strong>. We recall first, that the recursive mean square estimate<br />

is in fact a <strong>Bayesian</strong> MAP estimate using the previous error covariance as the<br />

estimate of the prior covariance of the parameters. This can be seen from equation<br />

(2.237). Also the implicit assumption in the LMS algorithm is that the covariance<br />

is of the <strong>for</strong>m<br />

P t = µ(I − µϕ t+1 ϕ T t+1) −1 (4.39)<br />

This can be compared to the assumptions on the prior in<strong>for</strong>mation in subspace<br />

regularization which is discussed in Section 3.4. It was emphasized that the prior<br />

covariance of the parameters is (<strong>for</strong>mally)<br />

C θ = α −2 (I − K S K T S ) −1 (4.40)<br />

The LMS algorithm thus drives the solution to be close to the rank-one subspace<br />

that is spanned by the vector ϕ t+1 . Thus ϕ t+1 serves as prior in<strong>for</strong>mation <strong>for</strong><br />

recursive estimation when LMS is used. The regularizing subspace is clearly timevarying.<br />

The <strong>for</strong>m (4.39) is fundamental when analyzing the prior assumptions


4.5. Wiener filtering as estimation problem 63<br />

made in the LMS algorithm. Note that in the NLMS algorithm the implicit assumption<br />

<strong>for</strong> the error covariance is<br />

P t = µI (4.41)<br />

<strong>and</strong> the method has thus close relationship with Tikhonov regularization that was<br />

discussed in Section 3.2. In Tikhonov regularization we have<br />

C θ = α −2 I (4.42)<br />

4.5 Wiener filtering as estimation problem<br />

Let s be a vector containing values s(t) sampled from a time-varying quantity. If<br />

this quantity is r<strong>and</strong>om, s is a sample from a stochastic process. In the signal<br />

processing literature a central problem is the estimation of s in presence of noise<br />

v from observations z. When the solution minimizes the mean square error, the<br />

estimation procedure is often called Wiener filtering [152]. The <strong>for</strong>m of the optimal<br />

estimator depends on the observation model <strong>and</strong> on the assumptions about the<br />

r<strong>and</strong>om signals s <strong>and</strong> v, that is, the joint density p(s,v).<br />

In many cases we require that the filter is a linear operator. In the Wiener<br />

filtering problem we thus seek the estimator of the <strong>for</strong>m<br />

ŝ = Kz (4.43)<br />

that minimizes the mean square error. By orthogonality principle we know that<br />

<strong>for</strong> mean square estimator<br />

From this we obtain <strong>for</strong> K<br />

E { (s − ŝ)z T } = E { (s − Kz)z T } (4.44)<br />

= E { sz T } − KE { zz T } (4.45)<br />

= R sz − KR z = 0 (4.46)<br />

K = R sz R −1<br />

z (4.47)<br />

We call this the time-varying Wiener filter. We use the notion of time-varying<br />

since we have not made any assumptions about the distribution of s. The solution<br />

(4.47) is thus the optimal linear estimator in mean square sense whatever the<br />

density of s is. The optimality thus holds also <strong>for</strong> nonstationary signals. The only<br />

assumptions made are that the filter is linear <strong>and</strong> yields the mean square criterion.<br />

If s <strong>and</strong> z would assumed to be zero mean, the <strong>for</strong>m of the estimator would yield<br />

the <strong>for</strong>m of (2.90).<br />

In the original work of Wiener [230] s <strong>and</strong> v were assumed to be stationary.<br />

Comparing the definitions in Sections 2.2 <strong>and</strong> 4.1 it can be seen that the correlation<br />

of the r<strong>and</strong>om vector <strong>and</strong> the autocorrelation of the stationary stochastic process<br />

have a special connection. If we interpret a stationary r<strong>and</strong>om process as a doubly<br />

infinite r<strong>and</strong>om vector [142], the autocorrelation matrix of such a process is an<br />

infinite-dimensional Toeplitz matrix. Thus K is also infinite-dimensional <strong>and</strong> it


64 4. Time series models<br />

is easy to show that in the case of stationary processes, the optimal Wiener filter<br />

can be expressed as the convolution<br />

s(t) =<br />

∞∑<br />

i=−∞<br />

k(t − i)z(i) (4.48)<br />

where k(t) is the impulse response of a linear time-invariant filter. The spectrum<br />

of the filter is<br />

K(ω) = S sz(ω)<br />

(4.49)<br />

S z (ω)<br />

where S sz (ω) <strong>and</strong> S z (ω) are the discrete Fourier trans<strong>for</strong>ms of any row of R sz <strong>and</strong><br />

R z respectively. This is often called the noncausal Wiener filter [187] since the<br />

summation in (4.48) is from −∞ to ∞.<br />

If we assume the additive noise model<br />

<strong>and</strong> that s <strong>and</strong> v are independent, we have<br />

<strong>and</strong><br />

z = s + v (4.50)<br />

R z = E { zz T } = E { (s + v)(s + v) T } (4.51)<br />

= E { ss T } + E { sv T } + E { vs T } + E { vv T } (4.52)<br />

= R s + R v (4.53)<br />

R sz = E { sz T } = E { s(s + v) T } (4.54)<br />

= E { ss T } + E { sv T } (4.55)<br />

= R s (4.56)<br />

Using again the interpretation of stochastic processes as doubly infinite r<strong>and</strong>om<br />

vectors we conclude that with the additive noise model the spectrum of the Wiener<br />

filter is<br />

S s (ω)<br />

K(ω) =<br />

(4.57)<br />

S s (ω) + S v (ω)<br />

This is also the original <strong>for</strong>m presented by Wiener [230].<br />

If we further assume, that s can be presented as linear combination of some<br />

basis vectors, we can write<br />

s = Hθ (4.58)<br />

If we also assume that v <strong>and</strong> θ are zero mean, we can write<br />

R z = E { Hθθ T H T } + C v (4.59)<br />

= HC θ H T + C v (4.60)<br />

<strong>and</strong><br />

R sz = HC θ H T (4.61)


4.5. Wiener filtering as estimation problem 65<br />

the Wiener filter can be written in <strong>for</strong>m<br />

Using the matrix inversion lemma we can write<br />

ŝ = HC θ H T (HC θ H T + C v ) −1 (4.62)<br />

ŝ = H(H T C −1<br />

v<br />

H + C −1<br />

θ<br />

) −1 H T Cv −1 z (4.63)<br />

= Hˆθ MS (4.64)<br />

The use of the notion Wiener filtering is diverse. In some cases it is used <strong>for</strong> the<br />

best (in mean square sense) causal estimator <strong>for</strong> the underlying signal with some<br />

specific structure of the filter, in some cases the notion simply means the mean<br />

square estimator. For the theory of Wiener filtering see e.g. [152, 167, 153, 187].


CHAPTER<br />

Simulation of evoked potentials<br />

V<br />

Simulation is an important part of the evaluation of any estimation method. When<br />

done properly, all prior in<strong>for</strong>mation about the problem should be implemented into<br />

the simulation procedure. With simulations one can compare the estimates to the<br />

true parameters. This can never be done with real measurements. This holds<br />

also <strong>for</strong> the evaluation of estimation <strong>methods</strong> in evoked potential analysis. Most<br />

commonly deterministic peaks with additive noise are used in simulations [2, 105].<br />

Other models <strong>for</strong> evoked potentials that are used are e.g. sinusoidal [29] or damped<br />

sinusoidal models [210, 63]. Polynomials [132] <strong>and</strong> splines [83] have also been used.<br />

In this chapter we introduce two <strong>methods</strong> <strong>for</strong> constructing realistic simulations<br />

of evoked potentials. The first is based on the r<strong>and</strong>om occurrence of the peaks<br />

with preselected shapes. The shape parameters of the peaks are extracted using<br />

non-linear fitting of Gaussian basis functions to the average wave<strong>for</strong>m. The peak<br />

locations <strong>and</strong> amplitudes are then r<strong>and</strong>omly selected using some joint density. We<br />

call this method component based simulation of evoked potentials. The other simulation<br />

method proposed here is based on interpretation of the evoked potentials<br />

as r<strong>and</strong>om vectors. The second order statistics of a set of realistic measurements is<br />

first extracted. The covariance is then approximated using the truncated eigendecomposition<br />

of the covariance matrix. Using this in<strong>for</strong>mation the simulations are<br />

then drawn as samples of this approximating jointly Gaussian distribution. We<br />

call this method principal component based simulation of evoked potentials. The<br />

simulation of the background EEG is also briefly discussed.<br />

5.1 Simulation of the background EEG<br />

In the simulation of the evoked potentials one important step is a simulation of the<br />

background EEG. If the properties of the background simulation does not fulfill the<br />

assumptions of the real EEG background, the evaluation of the evoked potential<br />

estimation <strong>methods</strong> can not give realistic results. The most common assumption<br />

about the background is that it is a r<strong>and</strong>om process additive to the evoked potential.<br />

If the background is assumed to be stationary, some time-independent<br />

66


5.2. Component based simulation of evoked potentials 67<br />

parametric model can be used in simulation. The model can be based on real<br />

measurements e.g. on pred-stimulus data The AR-model approach is used e.g. in<br />

[240].<br />

Nonstationary background can be generated e.g. with time-varying AR models.<br />

The stationary models can be <strong>for</strong>med <strong>for</strong> measurements be<strong>for</strong>e the stimulus<br />

<strong>and</strong> after the late potentials in the evoked potential measurement. Smoothly timevarying<br />

functions can then be selected to model the structure of the time evolution<br />

of the AR parameters. Nonstationary background simulations are then obtained<br />

as output of this time-varying system with white noise input. The simulation of<br />

nonstationary EEG is discussed in [95].<br />

5.2 Component based simulation of evoked potentials<br />

In some cases the goal in the estimation is to get in<strong>for</strong>mation about the location<br />

of the peaks in the evoked potential. In this case we need a simulation method in<br />

which the peak locations can be used as parameters. With this kind of a method<br />

the peak locations can be based e.g. on some pre-selected r<strong>and</strong>om distribution,<br />

such as uni<strong>for</strong>m or Gaussian distribution. With this kind of a method it is also<br />

easy to generate simulations in which the parameters of the peaks are correlated.<br />

This is done by using multidimensional joint densities in simulation. The shape of<br />

the peaks that are used in simulation can be based on real data.<br />

We propose here a method in which the peak distributions are based on real<br />

measurements. The peak amplitudes <strong>and</strong> locations can be generated r<strong>and</strong>omly<br />

with interactively selected limits. The proposed method is as follows:<br />

1. Measure a set of evoked potentials. The delay between the repetitions has to<br />

be long enough so that the background EEG samples can also be measured.<br />

A typical set of measurements is shown in Fig. 5.1.<br />

2. Measure a set of background EEG samples. Pre-stimulus data can be used<br />

<strong>for</strong> this purpose. If the background EEG is assumed to be stationary it is<br />

possible to use an AR model in background simulation. Several samples of<br />

the EEG can be modeled <strong>and</strong> then the AR parameters can be averaged.<br />

Some conventional method such as the modified covariance method or the<br />

Yule-Walker method can be used <strong>for</strong> AR modeling [125]. The estimation of<br />

the residual power can be done with the same algorithms. Trend removal<br />

can be done be<strong>for</strong>e modeling if necessary. A typical set of background EEG<br />

samples is shown in Fig. 5.1. Ten typical spectra of AR(8) models <strong>for</strong><br />

background EEG are shown in Fig. 5.2.<br />

3. Calculate the mean of the AR-parameters <strong>and</strong> the estimated residual powers.<br />

Create two sets of realizations using these parameters. One set is used<br />

<strong>for</strong> the simulation of the pre-stimulus EEG <strong>and</strong> the other set is used as<br />

additive background EEG in simulated measurements of evoked potentials.<br />

4. Remove the linear trend from the measured evoked potentials. Calculate<br />

the average of the measurements <strong>and</strong> determine the number <strong>and</strong> locations


68 5. Simulation of evoked potentials<br />

of the peaks in the average. Fit a function consisting of e.g. Gaussian<br />

shaped functions to the average. Fitting can be based on the nonlinear<br />

least squares scheme discussed in Section 2.12. The average, the location<br />

of the peaks <strong>and</strong> the fitted sum of Gaussian functions are shown in Fig. 5.3<br />

<strong>for</strong> measurements shown in Fig. 5.1.<br />

5. Select the limits of the desired location <strong>and</strong> amplitude variation of the peaks.<br />

The locations <strong>and</strong> the amplitudes can then be selected using some joint<br />

density. The use of uni<strong>for</strong>m <strong>and</strong> Gaussian densities are straight<strong>for</strong>ward. In<br />

the Gaussian case the limits can correspond e.g. to some multiple of the<br />

deviation of the density. Typical limits are shown in Fig. 5.4 <strong>and</strong> a set of<br />

simulations using uni<strong>for</strong>m densities <strong>for</strong> locations in Fig. 5.5.<br />

6. Add the simulated background EEG to simulated evoked potentials to obtain<br />

simulated measurements. A set of final simulations is shown in Fig.<br />

5.6.<br />

−50<br />

−50<br />

0<br />

0<br />

50<br />

20 40 60 80 100 120<br />

50<br />

20 40 60 80 100 120<br />

−50<br />

−50<br />

0<br />

0<br />

50<br />

20 40 60 80 100 120<br />

50<br />

20 40 60 80 100 120<br />

Figure 5.1: Twenty measured evoked potentials (top left) <strong>and</strong> the background<br />

EEG (bottom left) with the corresponding means (top right <strong>and</strong><br />

bottom right). The vertical axis is in microvolts <strong>and</strong> the horizontal axis in<br />

points.<br />

In the example that is carried out here, <strong>and</strong> shown in Figs. 5.1–5.6, the<br />

uncorrelated uni<strong>for</strong>m joint density is used <strong>for</strong> parameters of the peaks. This is<br />

suitable e.g. <strong>for</strong> the situations in which the per<strong>for</strong>mance of the latency estimation<br />

method is evaluated. In such cases we only need realistic single simulations. The<br />

statistics of the whole set of simulations need not to be realistic.


5.2. Component based simulation of evoked potentials 69<br />

50<br />

45<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

π/2 π<br />

Figure 5.2: AR(8) amplitude spectra of 10 samples of measured background<br />

EEG. The vertical axis is arbitrary <strong>and</strong> the horizontal axis is the<br />

normalized frequency.<br />

-15<br />

-10<br />

-5<br />

0<br />

5<br />

10<br />

15<br />

0 20 40 60 80 100 120<br />

Figure 5.3: The mean of the measured evoked potentials (solid rough),<br />

the fitted sum of Gaussians (solid smooth) <strong>and</strong> the marked peaks of the<br />

mean (dotted vertical).


70 5. Simulation of evoked potentials<br />

-15<br />

-10<br />

-5<br />

0<br />

5<br />

10<br />

15<br />

0 20 40 60 80 100 120<br />

Figure 5.4: The fitted sum of the Gaussians (solid) <strong>and</strong> given limits of<br />

the variation (dotted rectangles)<br />

-15<br />

-10<br />

-5<br />

0<br />

5<br />

10<br />

15<br />

0 20 40 60 80 100 120<br />

Figure 5.5: Component based simulations of evoked potentials.


5.2. Component based simulation of evoked potentials 71<br />

-40<br />

-30<br />

-20<br />

-10<br />

0<br />

10<br />

20<br />

30<br />

40<br />

0 20 40 60 80 100 120<br />

Figure 5.6: Simulated measurements of evoked potentials with component<br />

based simulation method.


72 5. Simulation of evoked potentials<br />

5.3 Principal component based simulation of evoked potentials<br />

In some cases also the statistical properties of the whole data set should be realistic.<br />

This is the situation in cases where the end to the estimation is e.g. classification<br />

of the measurements. In such a case one possibility is to interpret the evoked<br />

potential as a r<strong>and</strong>om vector <strong>and</strong> use the realizations of an appropriate joint density<br />

as simulations of evoked potentials.<br />

We propose here a method <strong>for</strong> evoked potential simulation. The method is<br />

based on use the of 2’nd order joint density approximation that is extracted from<br />

a set of real measurements. The proposed method is as follows:<br />

1. Measure a set z = (z 1 ,...,z N ) of evoked potentials. A typical set of measurements<br />

is shown in Fig. 5.1. The length of the measurement is T = 128<br />

points.<br />

2. Form the average ˆη <strong>and</strong> the centered covariance estimate Ĉz <strong>for</strong> the measurements.<br />

3. Form the eigendecomposition<br />

where ˆΓ = (γ 1 ,...,γ N ) <strong>and</strong> Λ = diag (λ 1 ,...,λ N ).<br />

Ĉ zˆΓ = ˆΓΛ (5.1)<br />

4. Use some small number p of eigenvectors to <strong>for</strong>m the matrix ˆΓ p . Usually<br />

p eigenvectors corresponding to p largest eigenvalues are selected, so that<br />

ˆΓ p = (γ 1 ,...,γ p ). The idea is that the eigenvectors should correspond the<br />

evoked potential part of the measurements, not the background EEG.<br />

5. If N is small the mean <strong>and</strong> the eigenvectors can be noisy. In such a case<br />

they can be smoothed using the smoothness priors approach<br />

ˆΓ α = (I + αD T 2 D 2 ) −1ˆΓp (5.2)<br />

ˆη α = (I + αD T 2 D 2 ) −1ˆη (5.3)<br />

using some suitable value <strong>for</strong> α, as explained in Section 3.2. The columns<br />

of ˆΓ α using p = 4 <strong>and</strong> α = 10 are shown in Fig. 5.7 <strong>for</strong> the data shown in<br />

Fig. 5.1.<br />

6. Generate the simulations <strong>for</strong> the evoked potentials with equation [191]<br />

s = ˆΓ α Λ 1/2<br />

p x + ˆη α (5.4)<br />

where x ∼ N(0,I) <strong>and</strong> Λ p = diag (λ 1 ,...,λ p ). The density of the simulations<br />

s is then jointly Gaussian with mean ˆη α <strong>and</strong> covariance C s = ˆΓ α Λ pˆΓT α .<br />

A set of typical simulations <strong>for</strong> evoked potentials is shown in Fig. 5.8.<br />

7. Add the simulated background EEG to simulated evoked potentials to obtain<br />

simulated measurements. The background EEG can be simulated as<br />

in Section 5.2. In Fig. 5.9 the original set of measurements is shown with<br />

the simulated ones with the corresponding means.


5.3. Principal component based simulation of evoked potentials 73<br />

If all the eigenvectors are used when the matrix ˆΓ p is <strong>for</strong>med, the decomposition<br />

contains also the second order statistical in<strong>for</strong>mation of the background EEG.<br />

This kind of scheme can be used e.g. <strong>for</strong> generation of the larger data sets than<br />

the original measured data set. The simulated measurements are still ensured to<br />

have almost the same statistics than the original measurements. However, in most<br />

cases p has to be selected with care to avoid the eigenvectors corresponding the<br />

background EEG to be taken into matrix ˆΓ p .<br />

If the simulated evoked potentials vary too little around the mean ˆη α <strong>for</strong><br />

practical simulations, the matrix Λ 1/2<br />

p can be replaced with the identity matrix.<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

-0.05<br />

-0.1<br />

-0.15<br />

-0.2<br />

0 20 40 60 80 100 120<br />

Figure 5.7: Four largest eigenvectors <strong>for</strong>ming the columns of the matrix ˆΓ α.


74 5. Simulation of evoked potentials<br />

-60<br />

-40<br />

-20<br />

0<br />

20<br />

40<br />

60<br />

0 20 40 60 80 100 120<br />

Figure 5.8: A set of evoked potentials simulated with principal component<br />

based simulation method.<br />

-50<br />

-50<br />

0<br />

0<br />

50<br />

20 40 60 80 100 120<br />

50<br />

20 40 60 80 100 120<br />

-50<br />

-50<br />

0<br />

0<br />

50<br />

20 40 60 80 100 120<br />

50<br />

20 40 60 80 100 120<br />

Figure 5.9: Measured evoked potentials (top left) <strong>and</strong> the measurements<br />

simulated with the principal component based method (bottom left) with<br />

the corresponding means (top right <strong>and</strong> bottom right). The vertical axes<br />

are in microvolts <strong>and</strong> the horizontal axes in points.


5.4. Discussion 75<br />

5.4 Discussion<br />

The two simulation <strong>methods</strong> introduced here are different in nature. The simulation<br />

by the component potentials is clearly not consistent with the data in strict<br />

sense. The parameters of the components are adopted from the mean of the measurements<br />

that may not correspond to any single potential. This is also the reason<br />

why the means of the measurements <strong>and</strong> the simulations are generally different.<br />

However, with this method many parameters of the final simulations e.g. the peak<br />

locations can be easily controlled. The joint density of the peak parameters can<br />

be selected with any kind of shape <strong>and</strong> correlation. To obtain more realistic simulations<br />

with this method the Gaussian functions can be replaced with any other<br />

functions.<br />

The principal component based simulation gives more realistic simulations<br />

when compared to the data. This is not surprising since the first <strong>and</strong> second<br />

moments of the measurements <strong>and</strong> the simulations are almost the same by construction.<br />

However, with this method some difficulties can arise. If the end to<br />

the simulation is e.g. to <strong>for</strong>m a data set <strong>for</strong> evaluation of the peak detector, the<br />

peaks have to be detected from the noiseless data set first. This is due to the fact<br />

that this simulation method is <strong>for</strong>mulated so that in evoked potentials there is no<br />

peaks by construction.<br />

Component based simulation method is more suitable <strong>for</strong> generation of data<br />

sets with trends. These are needed <strong>for</strong> evaluation of dynamical estimation <strong>methods</strong>.


CHAPTER<br />

VI<br />

Estimation of evoked potentials<br />

In this chapter we review the existing <strong>methods</strong> <strong>for</strong> the estimation of the evoked<br />

potentials. The <strong>methods</strong> are presented in the <strong>for</strong>malism introduced in Chapter<br />

2 <strong>and</strong> special attention is paid to the analysis of the implicit assumptions of the<br />

<strong>methods</strong>. The <strong>methods</strong> are categorized into ensemble analysis <strong>methods</strong>, single<br />

trial <strong>methods</strong> <strong>and</strong> dynamical <strong>methods</strong>. The <strong>methods</strong> that are discussed in Sections<br />

6.3.5, 6.3.6, 6.3.7 <strong>and</strong> 6.5.4 are partially novel. These <strong>methods</strong> are then used in<br />

Chapter 7 in which two systematic <strong>methods</strong> <strong>for</strong> evoked potential estimation are<br />

proposed.<br />

6.1 Introduction<br />

The notion of evoked potential is used here <strong>for</strong> the time-varying potential s in some<br />

location of the scalp. The potential is caused by stimulation of the somatosensory<br />

system. We assume that the measurements z of these potentials contain also noise<br />

v. The source of v is the spontaneous brain activity which is called the background<br />

EEG. Background EEG is thought to be independent of the stimulation<br />

<strong>and</strong> additive to the evoked potential. This is also called the additive noise model<br />

<strong>for</strong> observations<br />

z = s + v (6.1)<br />

The approaches that are used in analysis of evoked potentials are sometimes<br />

categorized also into deterministic <strong>and</strong> stochastic approaches. In the deterministic<br />

approach s is thought to be fixed between the repetitions of the test. Although<br />

it has been evident <strong>for</strong> over three decades [22] that this is not always a proper<br />

assumption, it is commonly used in evoked potential analysis. In the <strong>Bayesian</strong><br />

<strong>for</strong>malism the deterministic <strong>and</strong> stochastic approaches are h<strong>and</strong>led equivalently.<br />

The deterministic approach can be interpreted as the stochastic approach with no<br />

prior in<strong>for</strong>mation. In the stochastic approach the evoked potential s is assumed<br />

to be a r<strong>and</strong>om vector with some probability density p(s).<br />

We use the following categorization <strong>for</strong> the <strong>methods</strong> that are reviewed in this<br />

chapter:<br />

76


6.2. Ensemble analysis 77<br />

• ensemble analysis <strong>methods</strong> that give in<strong>for</strong>mation about first or second order<br />

statistics of set of evoked potentials<br />

• single trial <strong>methods</strong> that give in<strong>for</strong>mation about single responses.<br />

• dynamical <strong>methods</strong> that take account the dependence between the single<br />

trials or even try to model it<br />

Especially we call single trial estimation <strong>methods</strong> the <strong>methods</strong> that give estimate<br />

<strong>for</strong> a single evoked potential. Although the ensemble characteristics can<br />

be efficiently used e.g. in classification, there are many situations in which the<br />

single trial estimate is seen more appropriate. For example, if we are interested in<br />

the true mean latency of some peak of the evoked potential, the “Latency of the<br />

average is not the average of the latencies”, as argued in [25] <strong>and</strong> [190]. It has also<br />

been argued, that “...determination of the average is not enough...” [22] <strong>and</strong> even<br />

that “...the average may be one of the least significant variables of the response...”<br />

[6].<br />

It has been also suggested that the use of average wave<strong>for</strong>m is not a proper<br />

approach in source localization [185]. The use of single trials in topographic estimates<br />

also gives temporal in<strong>for</strong>mation about e.g. adaptation <strong>and</strong> habituation as<br />

emphasized in [114].<br />

Most of the <strong>methods</strong> that are reviewed here are applicable <strong>for</strong> analysis of<br />

several kind of evoked potentials. The potentials that are used in the evaluation<br />

of the <strong>methods</strong> include e.g. the somatosensory evoked potentials (SEP), visual<br />

evoked potentials (VEP) <strong>and</strong> auditive evoked potentials (AEP). A special example<br />

of auditive evoked potentials is the P300 potential, that is named according to its<br />

large positive peak that occurs approximately 300 ms after the stimulus.<br />

In this review the measurements <strong>and</strong> potentials are finite length vectors whose<br />

elements are samples from the originally continuous-time wave<strong>for</strong>m. The t’th<br />

coordinate of the vector corresponds to the t’th time lag after stimulation. These<br />

scalars are denoted by z(t). The other time course is between the repetitions of<br />

the test. In this case the evoked potentials can be seen as vector valued stochastic<br />

processes. The i’th measurement vector is then z i = (z(T i + ∆T),...,z(T i +<br />

M∆T)) T where T i is the absolute time of the i’th stimulus <strong>and</strong> ∆T is the sampling<br />

interval. All <strong>methods</strong> are presented in the estimation theoretical <strong>for</strong>malism that<br />

is introduced in Chapter 2. N is the number of the measurements in set of all<br />

measurements.<br />

Other reviews on the analysis <strong>methods</strong> of evoked potentials are e.g. [179, 130,<br />

7, 117].<br />

6.2 Ensemble analysis<br />

In this section we review in detail some of the ensemble analysis <strong>methods</strong> used in<br />

evoked potential analysis. In particular, we are interested in the conditions under<br />

which the reviewed <strong>methods</strong> are optimal.


78 6. Estimation of evoked potentials<br />

6.2.1 Averaging<br />

The most conventional method of analysis of the evoked potentials is ensemble<br />

averaging. The measurement vectors z i are averaged to reduce the signal-to-noise<br />

ratio. In the estimation theoretical <strong>for</strong>malism, if s does not vary between the<br />

repetitions, the measurements can be written in <strong>for</strong>m<br />

z = s + v = Hθ + v (6.2)<br />

where H = (I| · · · |I) T , θ = s, z = (z1 T , · · · ,zN T )T <strong>and</strong> v = (v1 T , · · · ,vN T )T . The<br />

least squares solution <strong>for</strong> s can then be written in <strong>for</strong>m<br />

ŝ = (H T H) −1 H T z (6.3)<br />

∑ N<br />

= (NI) −1 z i (6.4)<br />

= 1 N<br />

i=1<br />

N∑<br />

z i = z (6.5)<br />

i=1<br />

that is, the conventional average of the measurements. Thus we conclude that<br />

the average of the measurements is the best estimator in least squares sense <strong>for</strong><br />

deterministic s <strong>and</strong> additive noise model. The minimization of the least squares<br />

criterion is equivalent to the assumption v ∼ N(0,σ 2 I).<br />

If s is not deterministic, we obtain <strong>for</strong> expectation of ŝ<br />

E {ŝ} = 1 N<br />

= 1 N<br />

N∑<br />

E {z i } (6.6)<br />

i=1<br />

N∑<br />

E {s i } + 1 N<br />

i=1<br />

N∑<br />

E {v i } (6.7)<br />

i=1<br />

If s i are drawn from same joint density <strong>and</strong> v i are zero mean, we obtain<br />

E {ŝ} = 1 N<br />

N∑<br />

η s = η s (6.8)<br />

i=1<br />

so that with the stochastic assumption <strong>for</strong> the evoked potentials the average of<br />

the observations is an unbiased estimator of the expected value of the evoked<br />

potentials.<br />

6.2.2 Weighted <strong>and</strong> selective averaging<br />

In conventional averaging all measurements are treated with same weight. This<br />

is natural since we have assumed nothing about theOBB noise variance between<br />

the repetitions of the test. An approach to reduce the effect of the most noisy<br />

wave<strong>for</strong>ms is to take into account the covariance of the background EEG. With


6.2. Ensemble analysis 79<br />

H = (I| · · · |I) T , θ = s, z = (z T 1 , · · · ,z T N )T <strong>and</strong> v = (v T 1 , · · · ,v T N )T the Gauss–<br />

Markov estimator can be written in <strong>for</strong>m<br />

where<br />

ŝ = (H T Cv<br />

−1 H) −1 H T Cv −1 z (6.9)<br />

C v = E { (v T 1 ,...,v T N) T (v T 1 ,...,v T N) } (6.10)<br />

Usually we can assume that the noise vectors v i are mutually independent. This is<br />

especially true if the background EEG can be modeled as a process having rather<br />

short correlation time. This means that it is a wide b<strong>and</strong> process. In this case the<br />

covariance of the EEG is of the block diagonal <strong>for</strong>m<br />

C v = diag (C v1 ,...,C vN ) (6.11)<br />

When this is applied to equation (6.9), the resulting estimate is of the <strong>for</strong>m<br />

ŝ =<br />

( N<br />

∑<br />

i=1<br />

C −1<br />

v i<br />

) −1 N<br />

∑<br />

i=1<br />

C −1<br />

v i<br />

z i (6.12)<br />

This is the minimum variance estimate of s when the background covariances in<br />

the independent measurements are known or can be estimated.<br />

Equation (6.12) reduces to the conventional average if the background covariances<br />

are identical between the repetitions, that is, if C vi = C, i = 1,...,N. This<br />

result can be found from any st<strong>and</strong>ard text book of data analysis [119]. It is also<br />

obtained (totally unnecessarily) with Monte Carlo simulations <strong>for</strong> a homogenous<br />

set of artificial brainstem potentials in [118]. Note that the result holds even <strong>for</strong><br />

nonstationary background, that is, in cases when C is not a Toeplitz matrix.<br />

Background EEG is sometimes modeled as white noise. If we assume that the<br />

variance of the background can differ between the repetitions, we can write<br />

In this case, the estimated evoked potential is of the <strong>for</strong>m<br />

C vi = σ 2 i I (6.13)<br />

ŝ =<br />

w i = 1 σ 2 i<br />

( N<br />

∑<br />

i=1<br />

σ −2<br />

i I<br />

) −1 N<br />

∑<br />

i=1<br />

∑ N<br />

σ −2 i=1<br />

i z i =<br />

w iz i<br />

∑ N<br />

i=1 w i<br />

(6.14)<br />

(6.15)<br />

This is clearly a weighted average of the measurements. The estimate is the<br />

minimum variance estimate <strong>for</strong> the evoked potential. Although also this result can<br />

be found found from any st<strong>and</strong>ard text book of data analysis [119] it is derived<br />

(again unnecessarily) in [84] <strong>and</strong> [118].<br />

The requirement (6.13) is too restrictive. The result (6.14–6.15) holds also <strong>for</strong><br />

C vi = σ 2 i C (6.16)


80 6. Estimation of evoked potentials<br />

where C is any positive definite matrix. The weighted average (6.14–6.15) is thus<br />

optimal even <strong>for</strong> nonstationary background processes.<br />

The difficulty in this estimation scheme is the estimation of the noise variance<br />

in the measurement. An approach to the estimation of the covariance is to compare<br />

the individual measurements to the conventional average of the measurements. We<br />

know that if s is deterministic, the average is an unbiased estimator <strong>for</strong> s. We can<br />

then approximate the noise covariance by differentiating the measurements from<br />

the average <strong>and</strong> using the sample covariance of the residual as an estimate <strong>for</strong><br />

the error covariance. If the changes in the background statistics are slow, the<br />

background can be modeled e.g. as AR model based on the measurements be<strong>for</strong>e<br />

the stimulus. The covariance of the AR model can then be used as estimate <strong>for</strong><br />

the background covariance.<br />

Slightly different approach is adopted in [41]. There it is assumed that the<br />

noise variance is constant C vi = σvI 2 <strong>and</strong> the “amplitude” of the evoked potential<br />

varies so that<br />

s i = w i s (6.17)<br />

It is found that in this case the equation (6.14) maximizes the signal-to-noise ratio<br />

in average potential. The method <strong>for</strong> estimation of the weights w i proposed in [41]<br />

is quite lengthy <strong>and</strong> is based on the generalized eigenvalue analysis of covariances.<br />

However, the result can be obtained with a much more simple analysis: Let w =<br />

(w 1 ,...,w N ) T . Then the model <strong>for</strong> the observations is<br />

z = sw T + v = Hθ + v (6.18)<br />

where z = (z 1 ,...,z N ), H = s, θ = w T<br />

solution <strong>for</strong> the weights is then<br />

<strong>and</strong> v = (v 1 ,...,v N ). The least squares<br />

ŵ T = (s T s) −1 s T z (6.19)<br />

Since s is not known, we approximate it with the best estimate available, with the<br />

average (sample mean) z<br />

The result proposed in [41] is<br />

ŵ T = (z T z) −1 z T z (6.20)<br />

= zT z<br />

‖z‖ 2 (6.21)<br />

ŵ T = zT z<br />

∥ z T z ∥ (6.22)<br />

where only the scaling factor in denominator differs from (6.21). In weighted<br />

averaging the scaling of the weights is arbitrary.<br />

It is easy to modify the equations (6.14–6.15) with different selection of the<br />

weights w i . We can, <strong>for</strong> example, choose<br />

{<br />

1 ,σi 2 ≤ σ s<br />

0 ,σi 2 > σ (6.23)<br />

s


6.2. Ensemble analysis 81<br />

<strong>and</strong> thus reject those measurements from the average, that have the variance<br />

greater than σ 2 s. This approach is sometimes called the selective averaging. The<br />

method is proposed e.g. in [14] <strong>and</strong> [210] where the selection is based on the<br />

manual inspection of the measurements. This method is clearly an example of a<br />

robust estimation method [86].<br />

In [64] the method called selective averaging by cross covarying is proposed.<br />

Roughly speaking in this method the measurements are weighted with the covariance<br />

of the measurements <strong>and</strong> the average. The fully deterministic model is used.<br />

The method is systematic but quite ad hoc.<br />

In [161] selective averaging is based on cross correlation of the measurement<br />

with a “template” wave<strong>for</strong>m. The templates are derived from the average wave<strong>for</strong>m<br />

by windowing. Only the wave<strong>for</strong>ms with high correlation with the template<br />

were selected <strong>for</strong> averaging.<br />

In [65] the evoked potentials are first clustered using the spectral content of<br />

the background EEG by fuzzy clustering <strong>methods</strong>. The selective averaging is then<br />

per<strong>for</strong>med group by group.<br />

6.2.3 Latency-dependent averaging<br />

The fact that the evoked potentials have variations in latencies during the test<br />

has led to design of <strong>methods</strong>, in which these variations are tried to be taken into<br />

account.<br />

In Woody’s “adaptive” method [235], the measured evoked potential is crosscorrelated<br />

with a “template” wave<strong>for</strong>m. The measurement vector is then shifted<br />

be<strong>for</strong>e the averaging according to the maximum of the cross-covariance. The<br />

procedure can then be repeated with calculated average as template. The method<br />

is usually called the Woody averaging or the cross correlation averaging (CCA).<br />

Woodys method is evaluated e.g. in [221]. The model <strong>for</strong> evoked potential in<br />

Woody’s method is thus a deterministic wave<strong>for</strong>m with r<strong>and</strong>om latency shift. If<br />

the latency shift of single measurement can be estimated, the method gives the<br />

optimal estimate <strong>for</strong> the underlying wave<strong>for</strong>m.<br />

Another common approach to the problem is latency corrected averaging<br />

(LCA) [129]. In LCA every peak is aligned separately be<strong>for</strong>e the averaging. In<br />

the method, the wave<strong>for</strong>ms are first filtered with a Wiener-type filter [6] <strong>and</strong> the<br />

pre-selected number of peaks are detected by cross-correlating the responses with<br />

a template wave<strong>for</strong>m. A triangle shaped template is suggested in [129]. By using<br />

the histogram of the peak locations, the peaks that correspond to each other are<br />

detected <strong>and</strong> aligned. The averaging is then carried out peak-by-peak. The resulting<br />

average produced by LCA is usually a discontinuous wave<strong>for</strong>m. LCA <strong>and</strong><br />

CCA are compared with the conventional averaging in [8]. It is found that CCA<br />

can lock to background if narrow b<strong>and</strong> processes, such as α activity are present.<br />

The assumptions of LCA are more realistic than the assumptions of CCA, but the<br />

result of latency corrected averaging is not any more a continuous wave<strong>for</strong>m. In<br />

LCA the peaks of the resulting average are estimates <strong>for</strong> the means of the peaks<br />

of the original wave<strong>for</strong>ms.


82 6. Estimation of evoked potentials<br />

In [131] a method <strong>for</strong> converting the disjoint segments of LCA into continuous<br />

wave<strong>for</strong>m is presented. The method is called the continuous LCA. The method is<br />

based on least squares fitting of the LCA using sinusoidal basis functions (Fourierseries).<br />

In [239] the original LCA method is modified <strong>and</strong> the modification is called peak<br />

component latency corrected averaging (PC-LCA). The main differences between<br />

the original <strong>and</strong> modified <strong>methods</strong> are that in [239] adaptive filtering is used as<br />

pre-processor, the template is a cosine shaped wave<strong>for</strong>m <strong>and</strong> also the amplitude<br />

in<strong>for</strong>mation are taken into account in the calculation of the latency distribution<br />

of the peaks.<br />

6.2.4 Filtering of the average response<br />

The fact that the average evoked potential can still be noisy has motivated many<br />

authors to propose the <strong>methods</strong> with which the signal-to-noise ratio can further<br />

be improved after averaging. The filtering method that has got most attention<br />

is the Wiener filtering approach. This is due to its optimality as minimum mean<br />

square estimator as explained in Section 4.5.<br />

Most of the early works dealt with the original <strong>for</strong>m of the Wiener<br />

K(ω) =<br />

S s (ω)<br />

S s (ω) + S v (ω)<br />

(6.24)<br />

The difficulty in this is clearly that the <strong>for</strong>m of the filter necessitates prior knowledge<br />

about the spectra S v (ω) <strong>and</strong> S s (ω) of the background EEG <strong>and</strong> the evoked<br />

potential, respectively. This has led to method that is called the a posteriori<br />

“Wiener” filtering (APWF). The method is originally proposed in [219]. The idea<br />

in the method is that the spectra can be estimated a posteriori comparing the<br />

spectra of the averaged measurements S z (ω) to the averaged spectrum of the individual<br />

measurements S z (ω). The method has been used <strong>for</strong> the smoothing of<br />

the average visual evoked potentials, <strong>and</strong> it was found that the method gives unrealistic<br />

results [149]. In [54] it was found that the <strong>for</strong>mulation of the APWF is<br />

biased <strong>and</strong> a correction was derived. After that in [4] it was argued that both<br />

<strong>for</strong>ms are erroneous because rejection of the noise in averaging of sampled signal<br />

is not the same <strong>for</strong> different frequencies. A method <strong>for</strong> comparison of APWF with<br />

other <strong>methods</strong> is presented in [55].<br />

Papers where APWF is found to be either ineffective or even giving erroneous<br />

results include e.g. [13, 210, 3, 144, 27] <strong>and</strong> [225]. There is two main reasons <strong>for</strong><br />

difficulties with the Wiener filtering. The first is that the <strong>methods</strong> <strong>for</strong> estimating<br />

the spectra necessitate that the signal is deterministic. The other one is that the<br />

evoked potentials are transient-like wave<strong>for</strong>ms that seldom can be thought to be<br />

stationary. In derivation of the Wiener filter it was assumed that the signal is<br />

r<strong>and</strong>om <strong>and</strong> stationary.<br />

The criticism is summarized in [47, 44]. It is also proposed that the so-called<br />

a posteriori time-varying filtering would solve the problem. The method is a subb<strong>and</strong><br />

weighting method <strong>for</strong> Wiener-type filtering of the average evoked potential.


6.2. Ensemble analysis 83<br />

The properties of the method are discussed in [43, 45, 46]. Finally, in [138] it was<br />

summarized that the spectra of the averaged background <strong>and</strong> the average potential<br />

are never so disjoint that any kind of digital filtering could be applied to averaged<br />

signal in posterior way. In cases where the method is applicable, no smoothing is<br />

usually needed. This is also reported in [105]. Recently the properties of APWF<br />

is studied e.g. in [62].<br />

Other filtering approaches <strong>for</strong> average potential are used e.g. in [1] where<br />

digital filtering is suggested with a wide-b<strong>and</strong> amplifier <strong>and</strong> in [24] where ARMA<br />

filtering is compared with APWF. Recent studies on the filtering of an average<br />

wave<strong>for</strong>m include e.g. [60] where a hybrid FIR/median filter is suggested <strong>for</strong><br />

enhancement of the average.<br />

6.2.5 Deconvolution <strong>methods</strong><br />

If the latencies of the component potentials vary between the repetitions of the<br />

test it is a well known fact that the average of the potentials is the component<br />

wave<strong>for</strong>m convolved with the probability density function of the latency variation.<br />

This has led to <strong>methods</strong> where the average is tried to be sharpened by filtering with<br />

an inverse convolution filter. Since the probability density can often be assumed<br />

to be a smooth, e.g. Gaussian shaped function, the inverse filter is a high pass<br />

filter. This is emphasized e.g. in [131] where the deconvolution method is called<br />

enhanced averaging. The same histogram in<strong>for</strong>mation can be used <strong>for</strong> design of<br />

the filter as in latency corrected averaging. Since the deconvolution of this kind of<br />

smoothing kernels is a highly ill-posed problem, the method is improved in [165]<br />

by using so-called iterative restoration algorithms. Both <strong>methods</strong> necessitate quite<br />

smooth average wave<strong>for</strong>ms to be succesful.<br />

In [147] it is emphasized that the latency variation can be caused by asynchronous<br />

sampling. In that case the smoothing kernel is a uni<strong>for</strong>m moving averaging<br />

window of the length of the sampling period. In [147] the deconvolution is<br />

per<strong>for</strong>med in the frequency domain.<br />

It is worth stressing that the deconvolution with this kind of kernels is a<br />

classical example of inverse problem. If the probability density of the latency is<br />

known or can be estimated e.g. with the <strong>methods</strong> explained in Section 6.4.3, the<br />

discrete <strong>for</strong>m of the convolution kernel is easy to <strong>for</strong>m. The deconvolution is then<br />

simple to per<strong>for</strong>m with any of the regularization <strong>methods</strong> that are reviewed in<br />

Chapter 3.<br />

6.2.6 Principal components analysis<br />

St<strong>and</strong>ard multivariate statistical <strong>methods</strong> are sometimes used <strong>for</strong> analysis of the<br />

evoked potentials [50, 52]. Principal Components Analysis (PCA) was perhaps<br />

first used <strong>for</strong> EP’s in [90] <strong>and</strong> in [181]. In PCA the stochastic sample is presented<br />

as a weighted sum of orthogonal basis vectors. The weights <strong>and</strong> the basis vectors<br />

are obtained from the eigendecomposition of the covariance or correlation matrix<br />

of the ensemble of the measurement vectors [91]. The fact that the basis functions<br />

in PCA are orthonormal has lead to misinterpretations that the basis functions


84 6. Estimation of evoked potentials<br />

could represent some independent physiological generators. In fact already in [90],<br />

p.441 it was warned that “The principal factors, in themselves, do not have any<br />

physiological meaning <strong>and</strong> should not be constructed to represent a physiological<br />

system.” After that the use of PCA <strong>for</strong> discrimination of the individual physiological<br />

component potentials was widely discussed in [53, 222] <strong>and</strong> [51]. The<br />

variations in the peak locations have been found to be the main reason <strong>for</strong> the existence<br />

of the “extra” components in principal components analysis [136]. These<br />

extra components are generated when the basis functions are rotated to be strictly<br />

positive or negative.<br />

The applicability of the PCA in the analysis of evoked potentials is also ciriticized<br />

at least in [176, 174, 223, 224, 233] <strong>and</strong> [139].<br />

In itself, the principal components analysis gives in<strong>for</strong>mation on the second<br />

order statistics of the ensemble of the measurements. This in<strong>for</strong>mation can further<br />

be used e.g. <strong>for</strong> classification of the measurements.<br />

6.2.7 Classification<br />

As emphasized in [133], the estimation is the continuous extension of the classification<br />

<strong>and</strong> detection problems. In classification <strong>and</strong> detection problems the task is to<br />

decide which group the evoked potential belongs based on some features extracted<br />

from the data. Detection is classification into two classes. A common detection<br />

problem is the decision whether or not the evoked potential is present in signal<br />

at all. However, we do not review the classification <strong>and</strong> detection <strong>methods</strong> here<br />

in detail. Classification <strong>and</strong> detection of the evoked potentials are studied e.g. in<br />

[196, 216, 137, 227, 35, 10, 172, 36, 168, 98, 68, 169, 199] <strong>and</strong> [204].<br />

Matched filtering is a method <strong>for</strong> finding a known deterministic wave<strong>for</strong>m from<br />

a noisy signal. In this sense matched filtering is a detection method <strong>and</strong> necessitates<br />

that the shape of the evoked potential is known a priori. The matched filter<br />

approach is used <strong>for</strong> the detection of the peaks <strong>for</strong> improvement of the Woody’s<br />

method in [197]. The effect of the treshold setting is studied in [236]. In [234] it is<br />

found that matched filtering is not practical below SNR 25 dB. This means that<br />

detection of the single trial is hardly ever possible.<br />

6.2.8 Other ensemble analysis <strong>methods</strong><br />

An evident modification of averaging is the use of some other point estimate e.g.<br />

median [20]. It has been reported that the median is less sensitive to large disturbances<br />

in data than the average [220, 177, 20]. However, with small sample<br />

sizes the median can create additional disturbances, such as peak splitting [178].<br />

Also the use of mode is studied. In [109] the mode is calculated with a genetic<br />

algorithm.<br />

In [31] it is assumed that the brainstem auditory evoked potential (BAEP)<br />

is an output of linear system with impulse input. In such cases the response of<br />

the system to several impulses is the convolution of the single impulse response<br />

with the input impulse sequence. In [31] maximum length sequences is used <strong>for</strong><br />

stimulation of BAEP. In [188] it is suggested that the auditory system is nonlinear


6.3. Single trial estimation 85<br />

<strong>and</strong> a method <strong>for</strong> nonlinear system identification using the same sequences is<br />

proposed.<br />

An approach to the smoothing of the average is to use least squares fitting with<br />

a specific basis functions <strong>for</strong> the average. In [131] the trigonometric basis functions<br />

are used <strong>for</strong> the modeling of discontinuous components of the LCA method. In<br />

[127] the use of Fourier, Haar <strong>and</strong> Walsh–Hadamard basis functions as well as use<br />

of principal components is discussed. In [126] polynomial basis is used <strong>for</strong> this<br />

purpose.<br />

6.3 Single trial estimation<br />

In this section we review the most common <strong>methods</strong> proposed <strong>for</strong> single trial<br />

estimation. The <strong>methods</strong> in Sections 6.3.5, 6.3.6, 6.3.7 are reviewed here are then<br />

used in Chapter 7 where two systematic <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

are proposed.<br />

6.3.1 Filtering of single responses<br />

Digital filtering is sometimes used also <strong>for</strong> single responses [180, 6]. The simplest<br />

filters are moving average FIR filters [180] <strong>and</strong> more complicated ones are designed<br />

to detect some specific peak such as P300 [59]. Wiener filtering is also possible<br />

<strong>for</strong> single measurements [6, 29]. In all cases the filter has to have symmetric<br />

impulse response to avoid phase distortion. The main problem in linear timeinvariant<br />

filtering is the fact that usually the spectra of the evoked potential <strong>and</strong><br />

the background noise overlap heavily. This is reported at least in [197, 105, 195].<br />

This is natural since the evoked potential is a transient-like smooth wave<strong>for</strong>m with<br />

no periodicity. The spectrum of this kind of wave<strong>for</strong>m is not defined properly. The<br />

effect of digital filtering <strong>for</strong> single wave<strong>for</strong>ms is studied e.g. in [180, 175, 120, 21]<br />

<strong>and</strong> [148].<br />

6.3.2 Time-varying Wiener filtering<br />

Although it is emphasized in Section 4.5 that the use of the notion Wiener filtering<br />

is diverse, we call here the time-varying Wiener filtering <strong>methods</strong> the <strong>methods</strong> that<br />

employ the time-varying Wiener filter equation<br />

K = R sz R −1<br />

z (6.25)<br />

The crucial problem in these <strong>methods</strong> is to obtain a good model <strong>for</strong> the cross<br />

covariance R sz . In general this task necessitates a model <strong>for</strong> the observations<br />

<strong>and</strong> some analytical model <strong>for</strong> the evoked potential s. This model can then be<br />

estimated using the observations. In [237] the filter is called the time-varying<br />

minimum mean square error filter. First it is noted that when signal <strong>and</strong> noise<br />

are uncorrelated, R sz = R ss . The signal is then modeled to be a superposition of<br />

the components with r<strong>and</strong>om location <strong>and</strong> amplitude. The parametric <strong>for</strong>m of the<br />

covariance of the signal is then calculated. The presented <strong>for</strong>m necessitates the<br />

probability densities of the peak locations as well as the means <strong>and</strong> the variances


86 6. Estimation of evoked potentials<br />

of the peak amplitudes. This in<strong>for</strong>mation can be obtained using the LCA method.<br />

This approach is extended in [226] <strong>for</strong> multielectrode measurements.<br />

6.3.3 Linear least squares estimation<br />

One possibility <strong>for</strong> the single trial estimation is to start with an observation model.<br />

The additive noise model <strong>for</strong> measurements can be written in <strong>for</strong>m<br />

z = s + v = Hθ + v (6.26)<br />

where z = (z 1 , · · · ,z N ), v = (v 1 , · · · ,v N ) <strong>and</strong> s = (s 1 , · · · ,s N ) = (Hθ 1 , · · · ,Hθ N ).<br />

z i , s i <strong>and</strong> v i are column vectors. The evoked potentials are thus modeled as a linear<br />

combination of some basis vectors, namely the columns of the matrix H. With<br />

different choices of the basis several different least squares estimation schemes can<br />

be obtained.<br />

The most obvious scheme is to use some generic basis selection. For example,<br />

if we assume that the measured evoked potential is composed of Gaussian shaped<br />

component potentials with preselected shape, we can use these as columns of the<br />

matrix H. In this scheme it is also easy to add the basis functions ψ 0 = 1 <strong>and</strong><br />

ψ 1 = t that can model the first order trend in measurements. The most trivial<br />

selection is obviously H = I.<br />

If the covariance of the background EEG does not change between the repetitions<br />

of the test, the Gauss–Markov estimate <strong>for</strong> evoked potentials can be calculated<br />

using equation<br />

ˆθ = (H T Cv<br />

−1 H) −1 H T Cv −1 z (6.27)<br />

where C v = E { vv } T . In the simplest case we assume Cv<br />

−1<br />

the single evoked potential is then of the <strong>for</strong>m<br />

= I. The estimate <strong>for</strong><br />

ŝ = Hˆθ (6.28)<br />

However, the method is not practical <strong>for</strong> single trial estimation without regularization.<br />

6.3.4 Adaptive filtering<br />

The use of adaptive filtering in the analysis of evoked potentials has been studied<br />

intensively during the last ten years. Most attention has been paid to the use of<br />

the LMS algorithm <strong>for</strong> filtering of the single evoked responses. Another application<br />

has been to use adaptive algorithms to model the trends in measurements between<br />

the repetitions of the test. These <strong>methods</strong> are discussed in Section 6.5.3.<br />

In most applications the procedure is as explained in Section 4.4. In the LMS<br />

algorithm<br />

ˆθ t = ˆθ t−1 + µϕ t ε(t) (6.29)<br />

ε(t) = z(t) − ϕ T t ˆθ t−1 (6.30)<br />

the measurement z(t) is selected as the primary input. Several choices are proposed<br />

<strong>for</strong> the reference input, that is, the regressor vector ϕ t . With parenthesis


6.3. Single trial estimation 87<br />

notation it is again emphasized, that the algorithm is used in scalar <strong>for</strong>m, that<br />

is, <strong>for</strong> the scalar measurement. The time refers then to the index in a single<br />

measurement vector. As is discussed in Section 4.4 this approach corresponds to<br />

the time-varying linear regression <strong>and</strong> the measurements z(t) are modeled as a<br />

linear combination of the components of the reference input with time-varying coefficients.<br />

The recursion corresponds to stochastic steepest descent minimization<br />

(stochastic gradient method [85]). As explained in Section 2.17, in the LMS algorithm<br />

the parameters are updated to the direction of the regressor vector. If the<br />

reference input is fixed, the iterative method can serve at most as a regularization<br />

method if the mean squared error diminishes in the direction that is determined by<br />

the reference input vector. In some cases, e.g. when the reference input changes<br />

during the iteration true adaptation can also be achieved. Steepest descent <strong>methods</strong><br />

could also be useful <strong>for</strong> nonlinear problems, but no references exist <strong>for</strong> that<br />

kind of the applications.<br />

In many studies it is emphasized that the adaptive filtering approach does not<br />

necessitate prior in<strong>for</strong>mation about the evoked potentials, see e.g. [156]. In fact the<br />

reference signal <strong>and</strong> the model structure are the prior in<strong>for</strong>mation used in adaptive<br />

filtering. Also the use of the LMS algorithm corresponds to the assumption that<br />

the covariance of the parameters are of the specific <strong>for</strong>m that is discussed in Section<br />

2.17.<br />

One of the most cited works in this area is [203], in which it is proposed that<br />

two measurements z i <strong>and</strong> z j can be fed into the LMS algorithm, z j as the reference<br />

<strong>and</strong> z i as primary input. The regressor vector is thus of the <strong>for</strong>m<br />

ϕ t = (z j (t),z j (t − 1),...,z j (t − p)) T (6.31)<br />

It is assumed that z i <strong>and</strong> z j differ only with respect to the additive noise process,<br />

which means that the evoked potential is deterministic. The approach is thus<br />

to estimate (locally) one measurement with a set of delayed versions of another<br />

measurement. In [122] it is emphasized that if the signal is deterministic <strong>and</strong> no<br />

assumptions are made about the noise, this approach can not give better results<br />

than averaging of the two measurements. In the least squares sense this is true,<br />

but in some situations it is possible that the variance of the estimate reduces by<br />

the cost of increased bias.<br />

Earlier in [121] the per<strong>for</strong>mance of the adaptive noise cancellation was tested<br />

using almost the same noise sequence as reference as used in primary input. The<br />

primary signal was thus modeled as a linear combination of delayed noise sequences.<br />

The difference ε t is then argued to correspond to the evoked potential.<br />

In [217] the method that is presented in [203] is exp<strong>and</strong>ed <strong>for</strong> several reference<br />

signals. The regressor vector is thus of the <strong>for</strong>m<br />

ϕ t = (z I(1) (t),...,z I(p) (t)) T (6.32)<br />

where I is some permutation of the index set of the measurement set. In other<br />

words, any p measurements can be selected to <strong>for</strong>m the regressor vector. The<br />

model <strong>for</strong> the measurement is then that it is a linear combination of several other<br />

measurements.


88 6. Estimation of evoked potentials<br />

In [48] a method is proposed, in which the signal is first filtered with two<br />

different low pass filters. Based on the variances of the outputs of these, the<br />

signals are adaptively smoothed <strong>for</strong>wards <strong>and</strong> backwards.<br />

In [112] background EEG is first modeled as stationary AR process using the<br />

pre-stimulus data. The average wave<strong>for</strong>m is then modeled as a time-varying ARprocess<br />

with the RLS algorithm. The parameters are then used as a time-varying<br />

model <strong>for</strong> the evoked potentials. The observation is then modeled with state space<br />

equations using the evoked potential as the state vector. The evoked potentials<br />

are then estimated using the Kalman filtering approach.<br />

In [123] a modified adaptive line enhancement method is proposed. In this<br />

method the pre-stimulus data is adaptively modeled with an AR model. This<br />

model is then used to filter the post-stimulus data. The notion of modified means<br />

that a non-adaptive filter is used <strong>for</strong> post-stimulus data. The filtered wave<strong>for</strong>ms<br />

can further be averaged.<br />

In [32] adaptive filtering is used in the analysis of brainstem auditory evoked<br />

potentials (BAEP) in the same manner as in [203]. The average potential is used<br />

as the reference input. Thus the underlying model is that single BAEPs are linear<br />

combinations of shifted averages of all the evoked potentials.<br />

In [156] several physical channels are used <strong>for</strong> measuring the somatosensory<br />

potential of a peripheral nerve. The primary input is selected to be the electrode<br />

nearest to the nerve. Other channels are used as references.<br />

6.3.5 Principal component regression approach<br />

In Section 3.3 it was emphasized that the selection of the basis vectors in least<br />

squares estimation can be based on the eigendecomposition of the data matrix. If<br />

the covariance R s of the evoked potentials is known we can write the model <strong>for</strong><br />

the evoked potentials in the <strong>for</strong>m<br />

s = K S θ (6.33)<br />

where the columns of K S are some subset of the eigenvectors of R s . If we use the<br />

additive noise model <strong>for</strong> the observations we can write<br />

The Gauss–Markov estimate is then<br />

z = K S θ + v (6.34)<br />

ˆθ = (K S Cv<br />

−1 K S ) −1 KS T Cv −1 z (6.35)<br />

ŝ = K S ˆθ (6.36)<br />

The correlation matrix of the evoked potentials is usually not known. In some<br />

cases the first eigenvectors of the data correlation matrix<br />

R z = E { zz T } (6.37)<br />

span nearly the same subspace than those of the evoked potentials. Then we can<br />

approximate the signal subspace by the first eigenvectors of R z . This case is also


6.3. Single trial estimation 89<br />

discussed in Section 2.14. This approach is clearly equivalent to first taking all<br />

measurements as regressors <strong>and</strong> then using the principal component regression to<br />

reduce the dimensionality of the problem. In principal component regression the<br />

evoked potential is <strong>for</strong>ced to lie in the principal subspace spanned by the columns<br />

of K S .<br />

This method is proposed in [100] <strong>for</strong> the estimation of visual evoked potentials.<br />

An almost identical approach is used in [107] with the white noise assumptions<br />

C v = σ 2 vI. In this case the estimator reduces to<br />

ˆθ = (K S K S ) −1 K T S z = K T S z (6.38)<br />

ŝ = K S ˆθ = KS K T S z (6.39)<br />

6.3.6 Subspace regularization of evoked potentials<br />

As discussed in Section 3.4 the principal component regression approach can be<br />

modified so that it is less restrictive. With evoked potentials we want that the<br />

noiseless signals s = Hθ are close to some pre-selected subspace S. It was shown<br />

in Section 3.4 that the solution is then of the <strong>for</strong>m<br />

{<br />

ˆθ = arg min ‖L 1 (z − Hθ)‖ 2 + α ∥ 2 (I − KS KS T )Hθ ∥ 2} (6.40)<br />

θ<br />

Since the matrix L 2 = (I − K S KS T ) is symmetric <strong>and</strong> idempotent, we can write<br />

L 2 = L 2 L T 2 = (I − K S KS T ). By comparing this <strong>for</strong>m with the results in Section<br />

3.4 we note that L T 1 L 1 = Cv<br />

−1 <strong>and</strong> the solution is then<br />

ˆθ = (H T Cv<br />

−1 H + α 2 H T (I − K S KS T )H) −1 H T Cv −1 z (6.41)<br />

ŝ = Hˆθ (6.42)<br />

In Section 3.4 this approach is also shown to have a connection to the <strong>Bayesian</strong><br />

mean square estimation. The matrix α 2 H T (I − K S KS T )H corresponds <strong>for</strong>mally<br />

to the inverse of the covariance of the estimated parameters.<br />

The basis selection in this approach can be based on the <strong>methods</strong> discussed<br />

in Section 6.3.3. The basis K S of the regularizing subspace S can be selected<br />

using some approach discussed in Section 3.3. A systematic method <strong>for</strong> single<br />

trial estimation of evoked potentials using the subspace regularization method is<br />

proposed in Section 7.1.<br />

A slightly different <strong>for</strong>m of the method is proposed in [100], where subspace<br />

regularization is compared with the principal component regression approach.<br />

6.3.7 Smoothness priors estimation of evoked potentials<br />

The smoothness priors method is directly applicable to the smoothing of evoked<br />

potentials. It is clearly a noncausal filtering method. The smoothness priors<br />

approach can be combined with any choice of the basis vectors in the observation<br />

model. When the smoothing is per<strong>for</strong>med <strong>for</strong> the noiseless potentials s = Hθ the


90 6. Estimation of evoked potentials<br />

solution can be written in <strong>for</strong>m<br />

ˆθ = (H T H + λ 2 H T D T d D d H) −1 H T z (6.43)<br />

ŝ = Hˆθ (6.44)<br />

The smoothness priors approach is used with a recursive estimator <strong>for</strong> brainstem<br />

auditory evoked potentials in [145]. The observation matrix is there based on the<br />

knowledge of the impulse response of the earphone. Second order difference is used<br />

<strong>for</strong> smoothing <strong>and</strong> also the <strong>Bayesian</strong> connection is emphasized.<br />

The potentials that are estimated with any estimation method may still be<br />

noisy <strong>and</strong> the smoothness priors method can also be applied to estimated evoked<br />

potentials. For example, in subspace regularization the estimates can be noisy<br />

since the eigenvectors of the estimated covariance matrix are noisy when the sample<br />

size N is small. The model <strong>for</strong> the estimated evoked potentials can then be written<br />

in <strong>for</strong>m<br />

ŝ = Is 2 + v 2 (6.45)<br />

with C v2 = I. That is, we assume that the estimated evoked potentials still contain<br />

some additive noise. The smoothness priors solution <strong>for</strong> ŝ 2 is then<br />

ŝ 2 = (I + α 2 2D T d D d ) −1 ŝ (6.46)<br />

This is the cascade <strong>for</strong>m <strong>for</strong> the smoothness priors estimation. Another possibility<br />

is to use the smoothing in parallel <strong>for</strong>m. For example, with subspace regularization<br />

the solution would then be<br />

{<br />

ˆθ = arg min ‖L v (z − Hθ)‖ 2 + α ∥ 2 (I − KS KS T )Hθ ∥ 2 + α2 2 ‖D d Hθ‖ 2} (6.47)<br />

θ<br />

However, the tuning of the hyperparameters α <strong>and</strong> α 2 is easier in the cascade<br />

<strong>for</strong>m. The cascade <strong>for</strong>m implementation of the smoothness priors method is used<br />

with recursive estimation of evoked potentials in [101] <strong>and</strong> in the <strong>methods</strong> that<br />

are proposed in Sections 7.1 <strong>and</strong> 7.2.<br />

In some cases the basis vectors can be smoothed with smoothness priors<br />

method. This is the approach that is used in the simulation method that is proposed<br />

in Section 5.3.<br />

6.3.8 Other single trial estimation <strong>methods</strong><br />

An interesting approach is adopted in [17] in which the estimation of the single<br />

evoked potentials is based on outlier in<strong>for</strong>mation. A set of measurements containing<br />

both EEG with <strong>and</strong> without evoked potentials is first modeled with a robust<br />

AR model estimator. If the evoked potentials are thought to be outliers in the<br />

data, the robust estimator models only the background part of the samples. In<br />

ideal case, the residual of the model would then correspond to the evoked potential<br />

in sample. The per<strong>for</strong>mance of the method is further improved in [128].<br />

In [31] it is assumed that the brainstem auditory evoked response is an output<br />

of a linear system with impulse input. In such a case the response of the system to


6.4. Miscellaneous topics 91<br />

several impulses is the convolution of the single impulse response with the input<br />

impulse sequence. In [31] the method of maximum length sequences have been<br />

used <strong>for</strong> the stimulation of brainstem auditory evoked potentials. However, in<br />

[188] it is suggested that the auditory system is nonlinear <strong>and</strong> a nonlinear system<br />

identification method using the same sequences is proposed.<br />

Neural networks have also been used in analysis of evoked potentials. In [71]<br />

a method is proposed <strong>for</strong> artifact reduction in somatosensory evoked potential<br />

analysis.<br />

6.4 Miscellaneous topics<br />

The <strong>methods</strong> that are reviewed in this section are either applicable to both average<br />

<strong>and</strong> single measurements of evoked potential, or they do not fit under other topics<br />

in a natural way.<br />

6.4.1 Frequency domain <strong>methods</strong><br />

In [130] it is emphasized that filtering can be per<strong>for</strong>med also in frequency domain.<br />

This approach has the same limitations as the time domain filtering approach<br />

that is discussed in Sections 6.2.4 <strong>and</strong> 6.3.1. Different frequency domain filtering<br />

schemes can be found e.g. in [231, 146, 148]. Frequency domain filtering is applied<br />

to the single trial analysis of P300 measurements in [199].<br />

The same argumentation holds also <strong>for</strong> spectral analysis or the classification<br />

using the spectral content of evoked potentials. If the components do not differ<br />

spectrally, they can not be distinguished by linear spectral <strong>methods</strong>. Spectral<br />

<strong>methods</strong> have been used e.g. in [42, 143, 140, 141, 184].<br />

The use of time-frequency distribution <strong>methods</strong>, such as Wigner distribution<br />

<strong>and</strong> wavelets are even more problematic, since they deal with the temporal spectrum<br />

of the signal. The definition of temporal spectrum is always based on a<br />

time window of some length <strong>and</strong> shape [167]. The use of temporal spectrum as a<br />

measure of the instantaneus frequency content of a signal is reasonable only if the<br />

time variation of the second order statistical properties of the signal is slow. This<br />

means that the signal is nearly stationary within the time window. The Wigner<br />

distribution is used in [87] <strong>and</strong> wavelet based <strong>methods</strong> e.g. in [12, 111, 16]. Also<br />

the method that is presented in [108] is basically a frequency domain filter bank<br />

method, although the notion of matched filtering is used.<br />

As discussed in [117], frequency domain <strong>methods</strong> are mainly applicable to<br />

analysis of two special classes of event related potentials. The first is the event<br />

related synchronization/desynchronization, where the phase of the potential is<br />

not locked with the stimuli. The other is the analysis of the so-called steady state<br />

potentials that are evoked using continuous stimuli, e.g. temporarily modulated<br />

light. Analysis of this kind of potentials is out of the scope of this thesis.<br />

6.4.2 Parametric modeling<br />

Yet another class of <strong>methods</strong> that arise from the theory of the time series analysis is<br />

the parametric modeling of the time series with rational models. The problem with


92 6. Estimation of evoked potentials<br />

these <strong>methods</strong> in context of evoked potentials is that the most common models are<br />

time-invariant. Thus the use of these <strong>methods</strong> relies on tight assumptions about<br />

the stationarity of the signal.<br />

In [2] it is assumed that evoked potentials can be modeled as deterministic<br />

wave<strong>for</strong>ms added with r<strong>and</strong>om fluctuations. These r<strong>and</strong>om fluctuations are assumed<br />

to be such that they can be modeled as ARMA process since “...they <strong>and</strong><br />

spontaneus EEG are basically similar in nature [sic]” <strong>and</strong> ARMA models are sometimes<br />

used <strong>for</strong> modeling of the background EEG. Another ARMA model is used<br />

<strong>for</strong> the background. The scalar Kalman filter is then used to predict the evoked<br />

potential. The basic assumption in [2] is that the difference between the potential<br />

<strong>and</strong> the average is a stationary process. This is not in coherence with [38]<br />

where it is found that the variations are larger in late components than in early<br />

components.<br />

The scalar Kalman filtering scheme is used also in [195] in which an ARMA<br />

model is first fitted to the average response. Using Kalman filtering the evoked<br />

potential is then predicted with noisy measurement as input. The model is capable<br />

of taking into account r<strong>and</strong>om amplification <strong>and</strong> shifting of the whole potential,<br />

but not variations in the individual peaks.<br />

Another study in which ARMA modeling is used <strong>for</strong> the average potential<br />

is [88]. It is proposed that ARMA modeling can be used <strong>for</strong> the modeling of the<br />

average, since “The frequency domain characteristics of the simulated sequence <strong>and</strong><br />

the corresponding power spectral density of the ARMA filter were quite close to the<br />

periodogram of the original data sequence [sic]”. The simulations were created by<br />

feeding white noise sequences to the ARMA model. The simulations or the average<br />

of the simulations are not inspected in time domain.<br />

In [82] <strong>and</strong> [83] an ARMA model is used <strong>for</strong> the spectrum analysis of the evoked<br />

potential. Based on this model the digital filter is designed to extract the evoked<br />

potential from additive noise. As emphasized in [82] “...the proposed method has<br />

the advantage that no assumptions about the EP to be found are necessary except<br />

the generally accepted fact that the EP is a deterministic, the EEG a stochastic<br />

signal [sic]”. In [83] it is argued that “The single responses show a great variability<br />

of latency <strong>and</strong> amplitude...” <strong>and</strong> it is concluded that this is “...demonstrating that<br />

the single evoked response is not a stationary signal. [sic]”<br />

Parametric models with an exogenous input are also used. In [30] <strong>and</strong> [28]<br />

ARX-model is used. The underlying model is then that if the background EEG<br />

can be modeled as an AR process, then <strong>for</strong> evoked potential “...the same process<br />

is driven instead by a deterministic signal... [sic]”. The use of the average of the<br />

responses is suggested in [30]. In [113] the average is replaced with the running<br />

average of the latest 20 responses.<br />

A damped sinusoidal model is sometimes used <strong>for</strong> the modeling the evoked<br />

potentials. With additive noise we can write <strong>for</strong> a single measurement<br />

z(t) =<br />

p∑<br />

A i ρ t i sin(ω i t) + v(t) (6.48)<br />

i=1


6.5. Dynamical estimation of evoked potentials 93<br />

Estimation of the parameters A i , ρ i <strong>and</strong> ω i is a nonlinear problem. A well-known<br />

approximation method <strong>for</strong> solving the parameters is Prony’s method [94]. It is<br />

proposed <strong>for</strong> use with average evoked potentials in [9]. In [80] it is found that<br />

<strong>for</strong> single trial estimates the signal-to-noise ratio should be about 10 dB. The use<br />

of generalized singular value decomposition with Prony’s method is proposed in<br />

[63, 79].<br />

6.4.3 Estimation of the peak latencies<br />

The estimation of peak latencies can sometimes be in<strong>for</strong>mative in itself, as emphasized<br />

in [25]. The main goal in estimation can e.g. be to find the histogram<br />

of the latencies <strong>for</strong> some specific peaks such as P300. The estimates <strong>for</strong> the peak<br />

latencies is also needed <strong>for</strong> some other estimation <strong>methods</strong> such as LCA <strong>and</strong> CCA.<br />

The simplest method <strong>for</strong> peak latency estimation is called peak picking [190].<br />

It simply means that the maximum or minimum of the signal in some desired<br />

time interval is selected to represent the peak location. The signal is sometimes<br />

prefiltered [180] with some time domain filter.<br />

Another method is to cross-correlate the signal with a template wave<strong>for</strong>m<br />

[58, 70] <strong>and</strong> to find the maximum of the correlation. The cross-covariance is also<br />

used [158]. Other variations are to use normalised cross-correlation functions [15]<br />

or generalized cross-correlation method [166].<br />

In [132] a low-order polynomial is used to approximate the signal in the vicinity<br />

of the peak. The fitting of the polynomial is a simple linear least squares problem<br />

<strong>and</strong> if the second order polynomial is used the minimum or maximum is unique<br />

<strong>and</strong> simple to calculate.<br />

Different <strong>methods</strong> <strong>for</strong> the estimation of the single trial P300 latencies are<br />

compared in [190, 70].<br />

6.5 Dynamical estimation of evoked potentials<br />

As discussed in Section 6.1, we refer to the <strong>methods</strong> that take into account the<br />

time variation between the single evoked potentials as dynamical <strong>methods</strong>. In<br />

the dynamical estimation <strong>methods</strong> preceding or following estimates are used as<br />

in<strong>for</strong>mation in the estimation of single evoked potentials.<br />

If the set of measurements is assumed to be homogenous, all measurements<br />

can be taken into account with equal weights in estimation. With a homogenous<br />

set we mean that the measurements can be thought to be sampled from the same<br />

joint probability denstity. In this case the estimation reduces to the <strong>methods</strong> we<br />

have reviewed as single trial estimation <strong>methods</strong>. The dynamical method can give<br />

additional in<strong>for</strong>mation about the evoked potentials if our assumption is that the<br />

parameters of the joint density of the evoked potentials or the background noise<br />

have trend-like variations during the test. Typical situations are e.g. the trend-like<br />

changing of the latency or amplitude of some specific peak. The changing of the<br />

latency is reported in somatosensory recording e.g. in [22].


94 6. Estimation of evoked potentials<br />

6.5.1 Windowed averaging<br />

The most obvious way to h<strong>and</strong>le the time-variations between the single measurements<br />

is sub-averaging the measurements in groups. Sub-averaging is used e.g. in<br />

[22] to demonstrate the decrease of the amplitude in visual evoked potentials. The<br />

same optimality criterions hold <strong>for</strong> sub-averaging as <strong>for</strong> averaging. Sub-averaging<br />

gives optimal estimators if the evoked potentials are assumed to be deterministic<br />

within the sub-averaged groups.<br />

Another obvious method <strong>for</strong> the dynamical estimation of the evoked potentials<br />

is the windowed averaging of the measurements. This can also be called sliding<br />

window averaging. The estimator then takes the <strong>for</strong>m of a moving average filter<br />

of the measurement values in every time lag. In vector <strong>for</strong>m we have<br />

n−1<br />

∑<br />

ŝ MWA<br />

t = w i z t−i (6.49)<br />

i=0<br />

In [205] the term moving window averager (MWA) is used <strong>for</strong> the sliding window<br />

averager with equal weights w i = 1/n. Another method that is proposed in [205]<br />

is exponentially weighted averager (EWA) in which the weights are of the <strong>for</strong>m<br />

w i =<br />

γ i<br />

∑ n−1<br />

j=0 γj (6.50)<br />

<strong>for</strong> some 0 < γ < 1. It can be shown that the equivavelent <strong>for</strong>m of EWA is then<br />

ŝ EWA<br />

t<br />

= γz t + (1 − γ)ŝ EWA<br />

t−1 (6.51)<br />

Some <strong>methods</strong> have been proposed in which the single estimates are processed<br />

with single trial estimation <strong>methods</strong> <strong>and</strong> then the sliding window averaging has<br />

been per<strong>for</strong>med. In [49] the adaptive Kalman smoother, that is proposed in [48]<br />

<strong>for</strong> single trial estimation, is combined with the exponentially weighted averager.<br />

6.5.2 Frequency domain filtering<br />

The frequency domain filtering is implemented in [186] in which a low-pass filtering<br />

is applied also <strong>for</strong> each measurement separately. The estimator can then be represented<br />

as a two-dimensional separable low-pass filter. The use of a non-separable<br />

filter is also straight<strong>for</strong>ward. The implementation of the filtering is easiest in the<br />

frequency domain which is emphasized in [151].<br />

In [232] polynomial fitting is proposed to be used in the tracking of the variations<br />

in spectrum the of the evoked potentials in which fifth order orthogonal<br />

polynomials are fitted <strong>for</strong> every frequency lag separately.<br />

6.5.3 Adaptive algorithms<br />

Adaptive algorithms can take the time variations into account in a natural way<br />

as discussed in Section 2.16. Although it would be simple to use the adaptive


6.5. Dynamical estimation of evoked potentials 95<br />

algorithms in vector <strong>for</strong>m, the most popular way has been to use the algorithms<br />

in scalar <strong>for</strong>m as discussed in Section 2.17.<br />

One of the first approaches that is based on adaptive algorithms is presented<br />

in [215]. The evoked potential is assumed to be a deterministic almost periodic<br />

signal<br />

s t+T ≈ s t (6.52)<br />

This kind of a signal can be approximated using the Fourier series expansion. In<br />

[215] the harmonic basis functions of the Fourier series are adaptively fitted to the<br />

evoked potential data using the LMS algorithm. The regressor vector is then of<br />

the <strong>for</strong>m<br />

ϕ t = (sin(ω 0 t),sin(2ω 0 t),...,cos(ω 0 t),cos(2ω 0 t),...) (6.53)<br />

The method can track slow changes in evoked potentials <strong>and</strong> it is used <strong>for</strong> the<br />

monitoring the response of medication during neurosurgical procedures in [205].<br />

The effect of slight nonperiodicity is studied in [89].<br />

In [106] the periodicity assumption is used again. The model <strong>for</strong> signal is<br />

a deterministic evoked potential in noise <strong>and</strong> the reference input is an impulse<br />

function synchronized with the stimulus. The signal is thus modeled as linear<br />

combination of delayed impulses, that is, with the natural basis as the observation<br />

model. As is discussed in Section 6.2.1 the least squares solution with this kind of<br />

observation model H = I reduces to ensemble averaging. Since the algorithm that<br />

is used in [106] is the LMS algorithm <strong>and</strong> the weight update is per<strong>for</strong>med only<br />

once during the period of M scalar samples of one trial, the estimate corresponds<br />

to a sliding window averaging with adaptive weight adjustment.<br />

Two adaptive <strong>methods</strong> are presented in [200]. One is to use the measurement<br />

as the<br />

primary input. The other method is to use ŝ EWA<br />

t in place of reference. In the first<br />

approach the model is then to present the sliding average as a linear combination<br />

of the delayed versions of one measurement. The other model is to use delayed<br />

sliding averages as a basis <strong>for</strong> another sliding average.<br />

In [238] the steepest descent method is used directly <strong>for</strong> the estimation of the<br />

time-varying Wiener filter matrix.<br />

In [104] the LMS is used <strong>for</strong> the estimation of the delay between two identical<br />

potentials embedded with noise. The method is proposed <strong>for</strong> the tracking of<br />

abnormal delays in somatosensory potentials if a normal SEP sample is available.<br />

In [202] the use of LMS is extended <strong>for</strong> filter banks. The method is proposed<br />

<strong>for</strong> the tracking of time-varying evoked potentials.<br />

z t as the reference signal <strong>and</strong> the exponentially weighted average ŝ EWA<br />

t−1<br />

6.5.4 Recursive mean square estimation of evoked potentials<br />

A possible approach is to assume that the evoked potential is a vector valued<br />

r<strong>and</strong>om process with slow variations between the repetitions of the test. It can<br />

then be modeled with a state-space model. The recursive mean square estimate<br />

<strong>for</strong> the state is then given by the Kalman filter.<br />

Let the observation model be linear <strong>and</strong> time-invariant. Assume that time<br />

variations of the state vector can be modeled with the so-called r<strong>and</strong>om-walk


96 6. Estimation of evoked potentials<br />

model. Then the state space equations are of the <strong>for</strong>m<br />

θ t+1 = θ t + w t (6.54)<br />

z t = Hθ t + v t (6.55)<br />

where v t <strong>and</strong> w t are r<strong>and</strong>om vectors. As discussed in Section 2.16, the recursive<br />

mean square estimate <strong>for</strong> θ t is given by the Kalman filter. The Kalman filter is the<br />

recursive linear mean square estimator <strong>for</strong> θ t using the previous estimate as the<br />

prior in<strong>for</strong>mation. With the model (6.54–6.55) the Kalman filter takes the <strong>for</strong>m<br />

K t = P t H T (HP t H T + C v ) −1 (6.56)<br />

P t+1 = (I − K t H)P t + C w (6.57)<br />

ˆθ t+1 = ˆθ t + K t (z t − Hˆθ t ) (6.58)<br />

where K t is the Kalman gain <strong>and</strong> ˆθ t is a time-varying estimate <strong>for</strong> the parameters<br />

θ t .<br />

With this approach several possibilities exist <strong>for</strong> the selection of the observation<br />

model. With the trivial choice H = I the parameter vector θ = s equals<br />

the evoked potential. Some generic basis selection is also possible, such as the<br />

Gaussian basis. Another possibility is to use the principal component approach,<br />

so that H = K S . The recursive estimate <strong>for</strong> s takes then the <strong>for</strong>m<br />

K t = P t K T S (K S P t K T S + C v ) −1 (6.59)<br />

P t+1 = (I − K t K S )P t + C w (6.60)<br />

ˆθ t+1 = ˆθ t + K t (z t − K S ˆθt ) (6.61)<br />

ŝ t+1 = K S ˆθt+1 (6.62)<br />

The smoothness priors method can be combined with this estimator. This approach<br />

is used in [101] <strong>and</strong> a systematic method is proposed in Section 7.2.<br />

6.5.5 Other dynamical estimation <strong>methods</strong><br />

In [206] the running average of the wavelet trans<strong>for</strong>m of the running average of<br />

the measurements is used to track trends in evoked potentials.<br />

In [105] it is assumed that the deviations of the evoked potentials from the<br />

mean are a continuous function of some external measurable condition. Based<br />

on this it is proposed that a nonparametric regression method iterative kernel<br />

estimation could be used <strong>for</strong> estimation of single trials of evoked potentials. The<br />

exact mathematical <strong>for</strong>mulation of the method is unpublished.<br />

6.6 Discussion<br />

A wide range of <strong>methods</strong> has been proposed <strong>for</strong> the evoked potential analysis<br />

during the last decades. Some of the <strong>methods</strong> are proposed <strong>for</strong> the estimation<br />

of the properties of the whole data set <strong>and</strong> some <strong>for</strong> estimation of single evoked


6.6. Discussion 97<br />

potentials. Some of the <strong>methods</strong> try to model the time variations between the<br />

single trials.<br />

Several <strong>methods</strong> that are proposed <strong>for</strong> the estimation of the ensemble properties<br />

are optimal <strong>for</strong> some specific conditions. These conditions are often not<br />

very realistic. For example, Woody’s method is optimal <strong>for</strong> a deteministic wave<strong>for</strong>m<br />

with r<strong>and</strong>om latency. If more realistic assumptions are used the in<strong>for</strong>mation<br />

may not be representable with ensemble parameters any more. For example, the<br />

assumptions of the LCA are realistic, but the result of LCA is not a good estimator<br />

<strong>for</strong> any physical wave<strong>for</strong>m. The conventional average is often still required<br />

<strong>for</strong> historical reasons to make clinical studies comparable. It seems that different<br />

<strong>methods</strong> <strong>for</strong> “enhancing” the conventional average often yield such estimators that<br />

still are not good estimators <strong>for</strong> anything. In this sense these <strong>methods</strong> are not very<br />

often applicable.<br />

There is not many <strong>methods</strong> <strong>for</strong> single trial estimation of evoked potentials.<br />

It seems that the most reliable results can be obtained with the minimum mean<br />

square estimation <strong>methods</strong>. We have reviewed her two types of mean square<br />

estimation <strong>methods</strong>. The time-varying Wiener filtering based approach <strong>and</strong> the<br />

observation model based approach. The origin of both approaches is the same.<br />

However, in the model based approach it is easier to implement prior in<strong>for</strong>mation to<br />

the estimation procedure. In particular, we have reviewed different <strong>methods</strong> that<br />

take into account in<strong>for</strong>mation about the ensemble statistics <strong>and</strong> the smoothness<br />

assumptions about the underlying evoked potentials. We believe that the model<br />

based approach will be more popular in the future. The extension of these models<br />

<strong>for</strong> several channels <strong>and</strong> realistic observation models is straight<strong>for</strong>ward.<br />

The model based approach to mean square estimation is easy to extend <strong>for</strong><br />

time-varying situations. The optimal recursive estimate is then the Kalman filter.<br />

In Section 6.5.4 we proposed a method in which the principal component regression<br />

is combined with the Kalman filter. The proposed method is the only published<br />

method in which a general purpose adaptive algorithm is used <strong>for</strong> the estimation<br />

of evoked potentials in vector <strong>for</strong>m. Kalman filtering also allows the use of more<br />

realistic models <strong>for</strong> evoked potential fluctuations than the r<strong>and</strong>om walk model.<br />

We believe that these approaches are worth further investigation.


CHAPTER<br />

VII<br />

Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

In this chapter we propose two systematic <strong>methods</strong> <strong>for</strong> the estimation of evoked<br />

potentials. Basis of these <strong>methods</strong> have been discussed in the Sections 6.3.5,<br />

6.3.6, 6.3.7 <strong>and</strong> 6.5.4. The first method is applicable to single trial analysis <strong>and</strong><br />

the other <strong>for</strong> dynamical estimation of evoked potentials. The proposed <strong>methods</strong><br />

are evaluated using both simulated data <strong>and</strong> real measurements. Although the<br />

proposed <strong>methods</strong> are widely applicable to different types of the evoked potentials,<br />

the simulations <strong>and</strong> the real measurements that are done here are based on the<br />

P300 potentials. The P300 is one of the most studied responses of the human brain.<br />

The P300 test is per<strong>for</strong>med using the so-called oddball-paradigm. Two kinds of<br />

stimuli, the st<strong>and</strong>ard <strong>and</strong> the target, are used in stimulation. The test person is<br />

asked to per<strong>for</strong>m some task when the target stimuli occurs. For definitions <strong>and</strong><br />

significance of the P300 potential see e.g. [198, 159, 160, 164, 135, 58, 163].<br />

7.1 A systematic method <strong>for</strong> single trial estimation<br />

In this section we propose a systematic method <strong>for</strong> single trial estimation of evoked<br />

potentials. The method is based on the subspace regularization with a generic<br />

choice <strong>for</strong> the observation model. The final smoothing of the estimates is based<br />

on the smoothness priors approach. The proposed method is as follows:<br />

1. Measure a set of noisy evoked potentials z = (z 1 ,...,z Nz ) <strong>and</strong> background<br />

EEG v = (v 1 ,...,v Nv ), where z i <strong>and</strong> v i are column vectors. The background<br />

measurements v i are typically measured be<strong>for</strong>e the stimulus. The<br />

time between the repetitions should be r<strong>and</strong>om <strong>and</strong> long enough to avoid<br />

the late potentials from corrupting the background estimate <strong>and</strong> the locking<br />

of the background to the stimulus.<br />

∑<br />

vi v T i<br />

2. Calculate C v ≈ Nv<br />

−1 . This is an estimate <strong>for</strong> the background covariance.<br />

Other <strong>methods</strong> are also applicable <strong>for</strong> estimation of the covariance.<br />

The background can be modeled e.g. as an AR model <strong>and</strong> the covariance<br />

can then be calculated using the spectrum of the model.<br />

98


7.1. A systematic method <strong>for</strong> single trial estimation 99<br />

3. Form the matrix H with some generic way. A set of Gaussian shaped<br />

vectors with different delays <strong>and</strong> pre-selected shape is a suitable choice. If<br />

the measurements have trend, it is possible to include the constant <strong>and</strong> the<br />

first order polynomial basis vectors as columns of H.<br />

∑<br />

zi z T i<br />

4. Calculate R z ≈ Nz<br />

−1 . This is an estimate <strong>for</strong> the correlation matrix<br />

of the measurements. Note that when the correlation matrix is used here,<br />

the mean of the evoked potentials is modeled automatically as first eigenvector<br />

of the correlation matrix. Also the use of the covariance matrix is<br />

possible, but the mean of the measurements has then to be included in the<br />

equations of the estimates explicitly.<br />

5. Solve the ordinary eigendecomposition R z U = UΛ of the correlation matrix.<br />

The solution of only the principal eigenspace is also possible.<br />

6. Form K S = (u 1 ,...,u p ) where u i are eigenvectors of R z . The common<br />

choice is to use p eigenvectors associated with the p largest eigenvalues.<br />

7. Form the matrix D d . Usually the second or third order difference matrices<br />

are used.<br />

8. Calculate the estimate ŝ <strong>for</strong> the evoked potentials with<br />

ŝ = (I+α2D 2 d T D d ) −1 H(H T Cv<br />

−1 H+α 2 H T (I −K S KS T )H) −1 H T Cv −1 z (7.1)<br />

The selection of α <strong>and</strong> α 2 can be based on some posterior rule as discussed<br />

in Section 3.5. Another possibility is to select the parameters experimentally.<br />

For example, the parameter α 2 can be based on visual inspection of<br />

smoothing the eigenvectors. One possibility to select the parameter α is to<br />

make realistic simulations <strong>and</strong> inspect the estimation error as a function of<br />

the regularization parameters.<br />

This systematic method is based on the subspace regularization approach. The<br />

observation model is based on generic basis selection. The selection of the basis<br />

vectors can be different, but the idea is that the basis is quite general, that is,<br />

several different types of measurements can be modeled with this basis.<br />

In subspace regularization the solution is regularized towards the null space of<br />

the regularization matrix K S . Our approach is to include all the prior in<strong>for</strong>mation<br />

in this part of the estimation. There are several other possibilities to select the matrix<br />

K S . The selection of the subspace S is here based purely on the observations,<br />

that is, the dependent variable. It could be based also on independent variables.<br />

This kind of approach is used in [211, 96], in which possible measurements are first<br />

simulated <strong>and</strong> K S is <strong>for</strong>med as the principal eigenspace of this set of vectors. The<br />

selection could also be based on both independent <strong>and</strong> dependent variables as in<br />

the partial least squares method.<br />

In the <strong>for</strong>m (7.1) the background EEG is assumed to be Gaussian, but no<br />

stationarity is assumed. In the stationary case the matrix C v would be a Toeplitz<br />

matrix. One possibility to take account the nonstationarity of the background<br />

EEG is to model the background be<strong>for</strong>e <strong>and</strong> after the stimulation as AR model. If


100 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

the model parameters differ it is possible to model the background as time-varying<br />

AR process with smooth transition of the parameters from a state to another. A<br />

method <strong>for</strong> this is proposed in [96].<br />

The selection of the subspace dimension p <strong>and</strong> number of the basis vectors<br />

in H are also regularization parameter selection problems as emphasized in [218].<br />

The selection can be based on some posterior rule. It is possible to <strong>for</strong>mulate the<br />

whole problem as a fully empirical <strong>Bayesian</strong> problem, in which all regularization<br />

parameters are treated as hyperparameters with a suitable prior distribution. The<br />

selection of the hyperparameters can then be based on the likelihood of the data<br />

given the hyperparameters.<br />

7.2 A systematic method <strong>for</strong> dynamical estimation<br />

In this section we propose a systematic method <strong>for</strong> the recursive mean square<br />

estimation of the single evoked potentials. The method is based on the state<br />

space representation of the evolution of evoked potentials <strong>and</strong> the use of Kalman<br />

filtering. Principal component regression is used <strong>for</strong> <strong>for</strong>mulation of the observation<br />

model. The final smoothing of the estimates is based on the smoothness priors<br />

assumption. The proposed method is as follows:<br />

1. Measure a set of noisy evoked potentials z = (z 1 ,...,z Nz ) <strong>and</strong> background<br />

EEG v = (v 1 ,...,v Nv ), where z i <strong>and</strong> v i are column vectors. The background<br />

measurements v i are typically measured be<strong>for</strong>e the stimulus. The<br />

time between the repetitions should be r<strong>and</strong>om <strong>and</strong> long enough to avoid<br />

the late potentials from corrupting the background estimate <strong>and</strong> the locking<br />

of the background to the stimulus.<br />

∑<br />

vi v T i<br />

2. Calculate C v ≈ Nv<br />

−1 . This is an estimate <strong>for</strong> the background covariance.<br />

The background can also be modeled e.g. as an AR model <strong>and</strong><br />

the covariance can then be calculated using the spectrum of the model.<br />

∑<br />

zi z T i<br />

3. Calculate R z ≈ Nz<br />

−1 . This is an estimate <strong>for</strong> the correlation matrix<br />

of the measurements. Another possibility is to use the covariance matrix.<br />

The mean of the measurements can then be included in the estimation by<br />

e.g. adding it to the observation matrix as a column.<br />

4. Solve the ordinary eigendecomposition R z U = UΛ of the correlation matrix.<br />

The solution of only the principal eigenspace is also possible with some<br />

suitable method.<br />

5. Form K S = (u 1 ,...,u p ) where u i are eigenvectors of R z . Commonly p<br />

eigenvectors assosciated with the p largest eigenvalues are used.<br />

6. Form the matrix D d <strong>and</strong> fix α 2 . Usually the second or third order difference<br />

matrices are used. The selection of α can be based on some posterior rule.<br />

7. Select ˆθ 0 , the initial estimate <strong>for</strong> the parameters. It is possible to use the<br />

average of the least squares estimates as the initial parameter vector.


7.3. The data sets 101<br />

8. Select P 0 , the initial estimate <strong>for</strong> the estimation error covariance. The<br />

selection of this can be based on experience.<br />

9. Select C w , the covariance of the error in state update. This is the variable<br />

that controls the variance <strong>and</strong> the bias (tracking speed) in estimation. It<br />

is possible to use experience <strong>and</strong> visual inspection to select e.g. diagonal<br />

covariance.<br />

10. Calculate the estimate ŝ 2,t <strong>for</strong> evoked potentials using the Kalman filter<br />

<strong>and</strong> smoothness priors approach<br />

K t = P t K T S (K S P t K T S + C v ) −1 (7.2)<br />

P t+1 = (I − K t K S )P t + C w (7.3)<br />

ˆθ t+1 = ˆθ t + K t (z t − K S ˆθt ) (7.4)<br />

ŝ t+1 = K S ˆθt+1 (7.5)<br />

ŝ 2,(t+1) = (I + α 2 2D T d D d ) −1 ŝ t+1 (7.6)<br />

This systematic method is based on the use of the Kalman filter <strong>and</strong> principal<br />

component regression. The observation model is then based on the statistics of<br />

the set of measurements. The selection of the basis vectors can also be based on<br />

some other strategy. They can be selected e.g. to be of some generic <strong>for</strong>m or on<br />

any strategy discussed in Section 3.3.<br />

Also in this method the background EEG is assumed to be Gaussian, but no<br />

stationarity is assumed. If the covariance of the background is not estimated, the<br />

covariance can be selected to be the identity matrix C v = σ 2 vI. The selection of<br />

the covariance can also be based on experience <strong>and</strong> the visual inspection. The<br />

selection of the covariance C w is more important in this method. If the covariance<br />

is tuned to be too small, the estimates have bias towards the previous estimates.<br />

If the covariance is selected to be too large, the estimates have too much variance.<br />

Usually C w is selected to be of the <strong>for</strong>m C w = σ 2 wI <strong>and</strong> the selection of the<br />

variance term σ 2 w is based on experience. It is possible to use the so-called statespace<br />

identification <strong>methods</strong>. One of the first is proposed in [19]. In these the<br />

covariance terms are treated as hyperparameters, <strong>and</strong> the selection is based on<br />

e.g. output-error least squares estimation of the parameters.<br />

The effect of the initial values ˆθ 0 <strong>and</strong> P 0 can be removed by using the algorithm<br />

first backwards, with reversed order of data. The last estimates of the backward<br />

run can then be used as the initial estimates in the <strong>for</strong>ward run.<br />

If the measurements do not depend on the previous measurements, the problem<br />

reduces to single trial estimation <strong>and</strong> the method proposed in Section 7.1 should<br />

be used instead of dynamical method.<br />

7.3 The data sets<br />

For the evaluation of the proposed <strong>methods</strong> we <strong>for</strong>m three sets of data. Two<br />

data sets are <strong>for</strong>med using the simulation <strong>methods</strong> introduced in Sections 5.2 <strong>and</strong>


102 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

5.3. The data sets are referred to as data set #1 <strong>and</strong> data set #2, respectively. In<br />

addition to these the evoked potentials measured from three test persons were used<br />

as data. The measurements are named 878p, 1173p <strong>and</strong> 958p. The simulations are<br />

based on the measurement 878p. This is shown together with the simulations in<br />

Fig. 7.1. The number of the measurements is between 76 <strong>and</strong> 80 <strong>and</strong> the number<br />

of simulations is 80 in both sets.<br />

-50<br />

-50<br />

0<br />

0<br />

50<br />

20 40 60 80 100 120<br />

50<br />

20 40 60 80 100 120<br />

-30<br />

-20<br />

-10<br />

0<br />

10<br />

20<br />

30<br />

-30<br />

-20<br />

-10<br />

0<br />

10<br />

20<br />

30<br />

20 40 60 80 100 120<br />

20 40 60 80 100 120<br />

-50<br />

-50<br />

0<br />

0<br />

50<br />

20 40 60 80 100 120<br />

50<br />

20 40 60 80 100 120<br />

Figure 7.1: The measurement 878p (top left) <strong>and</strong> the mean of the measurements<br />

(top right), data set #1 (middle left) <strong>and</strong> data set #2 (bottom<br />

left) with corresponding noiseless simulations (middle right <strong>and</strong> bottom<br />

right, respectively). The vertical axis is in microvolts <strong>and</strong> the horizontal<br />

axis in points. Note the different scale in data set #1. The positive axis is<br />

downwards.<br />

The different nature of the simulation <strong>methods</strong> is again clearly visible in Fig.<br />

7.1. The background activity is measured be<strong>for</strong>e the stimulus in the case of the<br />

measurement 878p. In the case of the data sets #1 <strong>and</strong> #2, realizations of an AR


7.4. Case study 1 103<br />

process is used as background as explained in Section 5.1. In case of the data set<br />

#1 the latency of the positive peak is uni<strong>for</strong>mly distributed in the interval 66 –<br />

110. In the case of the data set #2, p = 4 eigenvectors are used in simulation.<br />

7.4 Case study 1<br />

In the first study the data set #1 is used. The observation model is selected to be<br />

a generic set of Gaussian shaped functions. They are not intended to be optimal<br />

in any sense.<br />

H = (ψ 1 ,...,ψ k ) (7.7)<br />

where<br />

(<br />

ψ i = exp − 1 )<br />

2d 2 (t − τ i) 2<br />

(7.8)<br />

The selected basis is not intended to be optimal in any sense. The width parameter<br />

is d = 10 <strong>and</strong> k = 20 basis functions are used. The basis functions, that is, the<br />

columns of the matrix H are plotted in Fig. 7.2.<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 20 40 60 80 100 120<br />

Figure 7.2: The basis of the observation model in the case study 1.<br />

The evoked potentials are then estimated using the systematic estimation<br />

method that is proposed in Section 7.1. Different values are used <strong>for</strong> the regularization<br />

parameter α <strong>and</strong> <strong>for</strong> the dimension p of the regularizing subspace. The<br />

norm of the estimation error between the true (simulated) <strong>and</strong> estimated evoked<br />

potential is then calculated. The estimation error norm is shown in Fig. 7.3.<br />

The error norm is seen to have a minimum with respect to both the number of<br />

basis functions <strong>and</strong> degree of regularization. This means that the regularization<br />

diminishes the error compared to the ordinary Gauss–Markov estimation that corresponds<br />

to the value α = 0. The proper selection of both of the parameters is thus


104 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

important. The over-estimation of p is called over-fitting. If the number of the<br />

basis vectors is too high the model tends to fit also the noise in the observations.<br />

If α goes to infinity, the method equals to the principal component regression discussed<br />

in Section 3.3. In this case subspace regression yields a smaller error than<br />

the principal component regression.<br />

0.06<br />

0.06<br />

0.05<br />

0.04<br />

0.04<br />

0.03<br />

0.02<br />

10 -3 10 -2 10 -1 10 0 10 1<br />

1<br />

0.02<br />

1 2 3 4 5<br />

1<br />

0<br />

0<br />

-1<br />

-1<br />

-2<br />

-2<br />

1 2 3 4 5<br />

1 2 3 4 5<br />

Figure 7.3: The estimation error norm in the case study 1 as a function of<br />

the regularization parameter α with different values of p (top left) <strong>and</strong> as<br />

a function of p with different values of α (top right). The same error norm<br />

as gray scale image <strong>and</strong> contour plot (bottom left <strong>and</strong> bottom right respectively)<br />

as function of α (vertical, 10-base logarithmic) <strong>and</strong> p (horizontal).<br />

The mean squared error is seen to be at minimum with p = 2 or p = 3 <strong>and</strong><br />

the optimal value <strong>for</strong> α is between 10 −2 <strong>and</strong> 10 −1 . The value p = 3 is selected<br />

<strong>for</strong> the dimension of the regularizing subspace. The columns of the matrix K S<br />

are plotted in this case in Fig. 7.4. The structure of the basis vectors enables<br />

the approximation of the mean of the potentials <strong>and</strong> variations in the positive<br />

peak. This can be easily seen by considering different linear combinations of these<br />

vectors.<br />

The evoked potentials are then estimated from the simulated noisy measurements<br />

of data set #1 using the method proposed in Section 7.1. The regularization<br />

parameter values α = 0.01 <strong>and</strong> α 2 = 10 are used. In Fig. 7.5 eight single trial<br />

estimates are shown together with the noiseless simulations <strong>and</strong> the simulated<br />

measurements. The effect of the regularization is clearly visible in estimates 2 <strong>and</strong><br />

3, (numbering from left to right, top to bottom). The method tends to minimize


7.4. Case study 1 105<br />

−0.2<br />

−0.15<br />

−0.1<br />

−0.05<br />

0<br />

0.05<br />

0.1<br />

0.15<br />

0.2<br />

20 40 60 80 100 120<br />

Figure 7.4: The columns of the matrix K S with p = 3. The first, second<br />

<strong>and</strong> third eigenvectors (bold, medium <strong>and</strong> thin lines, respectively).<br />

the error between the true potential <strong>and</strong> the estimate in the vicinity of the second<br />

negative peak although there are great variations in the data in this interval. The<br />

opposite effect is seen in the estimate 7, in which the estimate tend to minimize the<br />

residual. These single trial estimates can further be used e.g. in source estimation.<br />

As an example we also localize the maximum of the positive peak (latency)<br />

in single estimated potentials. For that we use the method introduced in [132]<br />

<strong>and</strong> discussed in Section 6.4.3. The latency estimation is based on the fitting of a<br />

second order polynomial to the data in the vicinity of the absolute maximum of the<br />

estimate. The width of the fitting window is 40 points. The latency estimation is<br />

carried out <strong>for</strong> the estimated evoked potentials as well as <strong>for</strong> the noiseless simulated<br />

potentials. The latency of the estimated potentials is shown as a function of the<br />

latency of the noiseless simulations in Fig. 7.6. The errors in latency estimation<br />

are seen to be homogenous through the simulations.<br />

In Fig. 7.7 the mean <strong>and</strong> the st<strong>and</strong>ard deviations of the latencies are shown<br />

together with the histograms of the latencies <strong>for</strong> the estimated <strong>and</strong> simulated<br />

potentials. Both histograms show that the estimated latencies are distributed<br />

uni<strong>for</strong>mly over the interval in which the peak maximas are by construction.


106 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

−20<br />

−20<br />

−10<br />

−10<br />

0<br />

0<br />

10<br />

10<br />

20<br />

20 40 60 80 100 120<br />

20<br />

20 40 60 80 100 120<br />

−20<br />

−20<br />

−10<br />

−10<br />

0<br />

0<br />

10<br />

10<br />

20<br />

20 40 60 80 100 120<br />

20<br />

20 40 60 80 100 120<br />

−20<br />

−20<br />

−10<br />

−10<br />

0<br />

0<br />

10<br />

10<br />

20<br />

20 40 60 80 100 120<br />

20<br />

20 40 60 80 100 120<br />

−20<br />

−20<br />

−10<br />

−10<br />

0<br />

0<br />

10<br />

10<br />

20<br />

20 40 60 80 100 120<br />

20<br />

20 40 60 80 100 120<br />

Figure 7.5: Eight r<strong>and</strong>omly selected single estimates (dotted) with simulations<br />

with (solid rough) <strong>and</strong> without (solid smooth) noise in the case<br />

study 1.


7.4. Case study 1 107<br />

110<br />

105<br />

100<br />

95<br />

90<br />

85<br />

80<br />

75<br />

70<br />

65<br />

60<br />

60 65 70 75 80 85 90 95 100 105 110<br />

Figure 7.6: The latency of the estimated evoked potentials as function of<br />

the latency of the noiseless simulations (+) in case study 1. The straight<br />

line is the line with slope 1 that goes through the origin (the ideal case).


108 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

-20<br />

15<br />

-10<br />

10<br />

0<br />

10<br />

5<br />

20<br />

0 20 40 60 80 100 120<br />

-20<br />

0<br />

60 70 80 90 100 110 120<br />

15<br />

-10<br />

10<br />

0<br />

10<br />

5<br />

20<br />

0 20 40 60 80 100 120<br />

0<br />

60 70 80 90 100 110 120<br />

Figure 7.7: The mean <strong>and</strong> the st<strong>and</strong>ard deviations (vertical lines) of<br />

the latencies of the positive peak <strong>for</strong> the noiseless simulations with the<br />

average of simulated measurements (top left) <strong>and</strong> the histogram of the<br />

latencies (top right) in case study 1. The mean <strong>and</strong> the st<strong>and</strong>ard deviations<br />

(vertical lines) of the latencies of the positive peak <strong>for</strong> the estimated evoked<br />

potentials with the average of simulated measurements (bottom left) <strong>and</strong><br />

the histogram of the latencies (bottom right).


7.5. Case study 2 109<br />

7.5 Case study 2<br />

The procedure in Section 7.4 is repeated <strong>for</strong> the data set #2. The mean square<br />

errors are shown in Fig. 7.8. The minimum of the error norm is not any more<br />

as clear as in case study 1. This is due to the structure of the joint statistics<br />

of the simulations. The simulated evoked potentials are generated so that their<br />

joint statistics is strictly Gaussian. For this type of data the optimal α would<br />

be infinite <strong>and</strong> the method would reduce to the principal component regression if<br />

the eigenspace of the evoked potentials would be known. However, the noise term<br />

is the reason why the principal eigenspace of the simulated measurements, that<br />

is used in regularization, does not yield the principal eigenspace of the simulated<br />

evoked potentials.<br />

0.15<br />

0.1<br />

0.1<br />

0.05<br />

0<br />

10 -3 10 -2 10 -1 10 0 10 1<br />

1<br />

0<br />

1<br />

2 4 6 8 10<br />

0<br />

-1<br />

-2<br />

2 4 6 8 10<br />

0<br />

-1<br />

-2<br />

2 4 6 8 10<br />

Figure 7.8: As in Fig. 7.3 but <strong>for</strong> case study 2.<br />

The data set #2 is also used <strong>for</strong> evaluation of the generalized cross-validation<br />

criterion (3.43) <strong>for</strong> the selection of the parameters p <strong>and</strong> α. The generalized crossvalidation<br />

criterion is calculated <strong>for</strong> the whole data set. Thus the residual norm<br />

in numerator of (3.43) is then replaced with the stacked residual of the whole<br />

data set. The criterion is shown in Fig. 7.9. It is seen that the criterion does<br />

not have a clear minimum as a function of either of the parameters p or α. This<br />

is not surprising in view of the analysis that is carried out in [218], in which it<br />

is emphasized that the theory of GCV is asymptotic <strong>and</strong> the criterion may not<br />

have minimum <strong>for</strong> small data sets. However, when compared to Fig. 7.8 it is seen<br />

that the criterion has a corner in vicinity of the optimal parameter values that are<br />

obtained using the mean square error. Thus we use this corner in selection of the<br />

parameters in this thesis.


110 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

150<br />

150<br />

100<br />

100<br />

50<br />

50<br />

0<br />

−3 −2 −1 0 1<br />

1<br />

0<br />

0 5 10 15<br />

1<br />

0<br />

0<br />

−1<br />

−1<br />

−2<br />

−2<br />

2 4 6 8 10 12 14<br />

2 4 6 8 10 12 14<br />

Figure 7.9: The generalized cross-validation criterion in case study 2 as a<br />

function of the logarithm of the regularization parameter α (top left) <strong>and</strong> p<br />

(top right). The same error as gray scale image <strong>and</strong> contour plot (bottom<br />

right <strong>and</strong> bottom left respectively) as function of α (vertical, logarithmic)<br />

<strong>and</strong> p (horizontal).<br />

The optimal regularization parameters in this case are seen to be approximately<br />

p = 5 <strong>and</strong> α = 0.05. The evoked potentials are estimated with the estimation<br />

method proposed in Section 7.1 using these values as parameters. The<br />

smoothing parameter was α 2 = 10. In Fig. 7.11 eight r<strong>and</strong>omly selected estimates<br />

are shown together with the simulated noiseless evoked potentials <strong>and</strong> simulated<br />

measurements. It is again seen that the estimate is a compromise between the<br />

data <strong>and</strong> the prior assumptions. The estimate rejects clear disturbances as seen<br />

in the estimate 3. However there is now no peaks in simulations by construction.<br />

That means that the question about e.g. the latency or amplitude of some specific<br />

peak may not be unique at all.<br />

The latencies of the positive peak are again extracted using the same method<br />

as in Section 7.4. Latency estimation is carried out <strong>for</strong> the estimated evoked potentials<br />

as well as <strong>for</strong> the noiseless simulations since in this case we do not have any<br />

in<strong>for</strong>mation about the latencies that correspond to the simulations by construction.<br />

The latency of the estimate is shown as function of the latency of the noiseless<br />

simulation in Fig. 7.10. The error in latency estimates is homogenous except <strong>for</strong><br />

that some outliers exist. The latency distribution is clearly not homogenous in<br />

this case.<br />

In Fig. 7.12 the mean <strong>and</strong> the st<strong>and</strong>ard deviations of the latencies are shown


7.5. Case study 2 111<br />

110<br />

105<br />

100<br />

95<br />

90<br />

85<br />

80<br />

75<br />

70<br />

65<br />

60<br />

60 65 70 75 80 85 90 95 100 105 110<br />

Figure 7.10: As in Fig. 7.6 but <strong>for</strong> case study 2.<br />

together with the histograms of the latencies <strong>for</strong> the estimated <strong>and</strong> simulated potentials.<br />

It is seen that the histograms are skew but similar. It is evident that these<br />

latency distributions are consistent with the real evoked potential measurement,<br />

since the simulations were generated using realistic second order approximation of<br />

the joint distribution of the measured evoked potentials. In this case it is clear<br />

that the mean of the latencies does not necessarily correspond to the latency of<br />

the mean of the measurements. In this case e.g. the median of the latencies is<br />

closer to the latency of the mean as seen in Fig. 7.13.


112 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

20 40 60 80 100 120<br />

20 40 60 80 100 120<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

20 40 60 80 100 120<br />

20 40 60 80 100 120<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

20 40 60 80 100 120<br />

20 40 60 80 100 120<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

−30<br />

−20<br />

−10<br />

0<br />

10<br />

20 40 60 80 100 120<br />

20 40 60 80 100 120<br />

Figure 7.11: Eight r<strong>and</strong>omly selected single estimates (dotted) <strong>and</strong> simulations<br />

with (solid rough) <strong>and</strong> without (solid smooth) noise in the case<br />

study 2.


7.5. Case study 2 113<br />

-50<br />

25<br />

20<br />

0<br />

15<br />

10<br />

5<br />

50<br />

0 20 40 60 80 100 120<br />

-50<br />

0<br />

60 70 80 90 100 110 120<br />

25<br />

20<br />

0<br />

15<br />

10<br />

5<br />

50<br />

0 20 40 60 80 100 120<br />

0<br />

60 70 80 90 100 110 120<br />

Figure 7.12: As in Fig. 7.7 but <strong>for</strong> the case study 2. A<br />

−30<br />

−25<br />

−20<br />

−15<br />

−10<br />

−5<br />

0<br />

5<br />

10<br />

15<br />

20<br />

0 20 40 60 80 100 120<br />

Figure 7.13: The median (dotted vertical) <strong>and</strong> the mean (solid vertical)<br />

of the latencies of the positive peak of the estimated evoked potentials with<br />

the mean of simulated measurements in case study 2.


114 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

7.6 Case study 3<br />

The estimation procedure is now applied to the real evoked potential measurements<br />

878p, 1173p <strong>and</strong> 958p. In this case the selection of the regularization parameters<br />

cannot be based on the true mean square error, since we can not compare the<br />

estimated evoked potentials with the true ones. The parameter selection can again<br />

be based on the GCV criterion. The criterion is calculated <strong>for</strong> measurements 878p<br />

<strong>and</strong> the result is shown in Fig. 7.15. The criterion is seen to have a corner<br />

approximately at about p = 5 <strong>and</strong> α = 0.05. These are also the selected values in<br />

the case study 2 in which the simulated data was <strong>for</strong>med using real measurements.<br />

It is then evident that these values are suitable also <strong>for</strong> these measurements. The<br />

values p = 5 <strong>and</strong> α = 0.05 are thus selected as regularization parameters. The<br />

same parameters are used <strong>for</strong> all measurement sets. The single trial estimates<br />

are then calculated. In Fig. 7.14 eight r<strong>and</strong>omly selected estimates are shown<br />

together with the measurements 878p. It is again seen that the estimate rejects<br />

clear disturbances.<br />

The detection of the positive peak is then carried out as be<strong>for</strong>e. The results<br />

are shown in Figs. 7.16-7.18.<br />

With all measurement sets the estimated latency of the third peak (P300) has<br />

more or less nonsymmetric histogram. This is natural since it is expectable that<br />

the latency of this component has close relationship (correlation) to the latencies<br />

of the other components. It is even probable that the components have a causal<br />

relationship. A more interesting thing is that the distribution of the latency is<br />

not always time-independent. In the case of measurements 1173p <strong>and</strong> 878p the<br />

variance of the latency of P300 peak grows during the test. This means e.g. that<br />

the average is not a good estimator <strong>for</strong> any true potential of this kind. Thus we<br />

stress that in this kind of a test, single trial estimation should always be done prior<br />

to any other estimation method. The trend in the variance of the latencies is in<br />

this case more interesting than the absolute error. In the case of the measurement<br />

958p there seems to be variation in latency although no clear trend is visible. In the<br />

averaging of such data the shape of the average potential will be the convolution<br />

of the shape of the underlying peak <strong>and</strong> the probability density of the latency of<br />

the peak if the shape of the peak is independent of the latency. This means that<br />

the amplitude in<strong>for</strong>mation in the peak depends on the variance of the latency. In<br />

such a case the method that is evaluated here can be used <strong>for</strong> the design of the<br />

deconvolution filter as discussed in Section 6.2.5.<br />

The single trials are estimated also with the regularization parameter α = 0<br />

<strong>for</strong> measurements 878p. The estimate corresponds then to the Gauss–Markov<br />

estimate. Note that the covariance of the background EEG is still taken account<br />

in this case. The modeling of the measurements as linear combinations of Gaussian<br />

basis functions also smooths the estimates. In Fig. 7.19 eight estimates are shown<br />

together with the measurement 878p. It is seen that the estimates minimize the<br />

residual homogenously. If the degree of smoothing is increased, the estimate tend<br />

to smooth also the true peaks. This property is common to all time-invariant<br />

filtering <strong>methods</strong>. The latency estimation is carried out also <strong>for</strong> these smoothed


7.6. Case study 3 115<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

Figure 7.14: Eight r<strong>and</strong>omly selected single estimates (dotted) with real<br />

measurements 878p (solid).<br />

estimates <strong>and</strong> the results are shown in Fig. 7.20. It is seen that the trend in the<br />

variance of the latencies is not visible any more.


116 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

200<br />

200<br />

150<br />

150<br />

100<br />

100<br />

50<br />

50<br />

0<br />

−3 −2 −1 0 1<br />

1<br />

0<br />

0 5 10 15<br />

1<br />

0<br />

0<br />

−1<br />

−1<br />

−2<br />

−2<br />

2 4 6 8 10 12 14<br />

2 4 6 8 10 12 14<br />

Figure 7.15: As in Fig. 7.9 but <strong>for</strong> measurement 878p.<br />

-50<br />

30<br />

20<br />

0<br />

10<br />

50<br />

0 20 40 60 80 100 120<br />

0<br />

60 70 80 90 100 110 120<br />

-50<br />

110<br />

0<br />

100<br />

90<br />

80<br />

50<br />

0 20 40 60 80 100 120<br />

70<br />

0 20 40 60 80<br />

Figure 7.16: The mean <strong>and</strong> the st<strong>and</strong>ard deviations (vertical lines) of the<br />

latencies of the positive peak of the estimated potentials with the mean of<br />

simulated measurements (top left) <strong>and</strong> the histogram of the latencies (top<br />

right) in case 3 <strong>for</strong> the measurement 878p. The median of the latencies<br />

(vertical lines) with the mean of simulated measurements (bottom left) <strong>and</strong><br />

the estimated latencies as a function of the number of the simulation.


7.6. Case study 3 117<br />

-20<br />

15<br />

-10<br />

10<br />

0<br />

10<br />

5<br />

20<br />

0 20 40 60 80 100 120<br />

-20<br />

-10<br />

0<br />

10<br />

20<br />

0 20 40 60 80 100 120<br />

0<br />

60 70 80 90 100 110 120<br />

120<br />

110<br />

100<br />

90<br />

80<br />

70<br />

0 10 20 30 40 50 60<br />

Figure 7.17: As in Fig. 7.16 but <strong>for</strong> the measurement 1173p.<br />

-20<br />

20<br />

-10<br />

15<br />

0<br />

10<br />

10<br />

5<br />

20<br />

0 20 40 60 80 100 120<br />

-20<br />

-10<br />

0<br />

10<br />

20<br />

0 20 40 60 80 100 120<br />

0<br />

60 70 80 90 100 110 120<br />

120<br />

110<br />

100<br />

90<br />

80<br />

70<br />

0 20 40 60<br />

Figure 7.18: As in Fig. 7.16 but <strong>for</strong> the measurement 958p.


118 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

−20<br />

−20<br />

0<br />

0<br />

20<br />

20<br />

0 50 100<br />

0 50 100<br />

Figure 7.19: Eight r<strong>and</strong>omly selected single estimates using α = 0 (dotted)<br />

with real measurement 878p (solid).


7.6. Case study 3 119<br />

−50<br />

25<br />

20<br />

0<br />

15<br />

10<br />

5<br />

50<br />

0 20 40 60 80 100 120<br />

−50<br />

0<br />

60 70 80 90 100 110 120<br />

120<br />

100<br />

0<br />

80<br />

50<br />

0 20 40 60 80 100 120<br />

60<br />

0 20 40 60 80<br />

Figure 7.20: The mean <strong>and</strong> the st<strong>and</strong>ard deviations (vertical lines) of the<br />

latencies of the positive peak of the Gauss–Markov estimated potentials<br />

together with the mean of simulated measurements (top left) <strong>and</strong> the histogram<br />

of the latencies (top right) <strong>for</strong> the measurement 878p. The median<br />

of the latencies (vertical line) with the mean of simulated measurements<br />

(bottom left) <strong>and</strong> the estimated latencies as a function of the number of<br />

the measurement.


120 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

7.7 Case study 4<br />

The dynamical estimation method that is introduced in Section 7.2 is evaluated<br />

in this section. The simulations of the data set #1 are sorted according to the<br />

latency of the third peak. By this construction we obtain a data set in which<br />

the trend-like variation of evoked potentials is guaranteed. The sorted data set is<br />

shown in Fig. 7.21.<br />

20<br />

40<br />

60<br />

80<br />

20 40 60 80 100 120<br />

20<br />

40<br />

60<br />

80<br />

20 40 60 80 100 120<br />

Figure 7.21: The sorted noiseless simulations as a waterfall plot (top left)<br />

<strong>and</strong> as a gray scale image (top right). The sorted simulations with noise<br />

as a waterfall plot (bottom left) <strong>and</strong> as a gray scale image (bottom right).<br />

The systematic estimation method that is introduced in Section 7.2 is then<br />

applied to the estimation of the single evoked potentials. In the first case the<br />

observation matrix H is selected as in Section 7.4 The columns of H are plotted<br />

in Fig. 7.2. The initial conditions <strong>and</strong> other parameters are selected <strong>for</strong> simplicity<br />

so that<br />

ˆθ 0 = E { (H T H) −1 H T z } (7.9)<br />

C w = 0.01I (7.10)<br />

C v = 0.1I (7.11)<br />

P 0 = 0.1I (7.12)<br />

where the notation E {·} means the average over vectors. The dimension of the<br />

problem is thus 20, the number of basis vectors in H. The mean of the least<br />

squares solutions has been used as initial parameter estimate. The other values are


7.7. Case study 4 121<br />

selected so that the tracking per<strong>for</strong>mance of the algorithm is visually acceptable.<br />

More systematic ways are discussed in Section 7.2. The value α 2 = 10 is used as<br />

the smoothing parameter. The results of the estimation are shown in Fig. 7.22.<br />

It is seen that the tracking per<strong>for</strong>mance of the algorithm is comparable to the<br />

per<strong>for</strong>mance of the non-dynamical method in the case study 1. However, in this<br />

method the observation model is generic <strong>and</strong> no prior in<strong>for</strong>mation is implemented<br />

in the observation model. The assumption of slow variation, that is, the r<strong>and</strong>om<br />

walk model serves as the prior in<strong>for</strong>mation in this case.<br />

110<br />

100<br />

90<br />

80<br />

70<br />

0 20 40 60 80<br />

110<br />

100<br />

90<br />

80<br />

70<br />

70 80 90 100 110<br />

20<br />

40<br />

60<br />

80<br />

20 40 60 80 100 120<br />

Figure 7.22: The results of the dynamical estimation with generic Gaussian<br />

observation model H. The true latencies (solid) <strong>and</strong> the estimated<br />

latencies (+) as functions of the observation (top left). The estimated latencies<br />

as a function of the true latencies (top right). The estimates as a<br />

waterfall plot (bottom left) <strong>and</strong> as a gray scale image (bottom right).<br />

The estimation procedure is repeated using the principal component observation<br />

model as proposed in Section 7.2. The subspace dimension is selected to be<br />

p = 3. The other initial values <strong>and</strong> parameters are the same as in (7.9–7.12). The<br />

results are shown in Fig. 7.23.<br />

The dimension of the problem is thus 3, the number of basis vectors in K S .<br />

The results in Fig. 7.22 can be compared to the results in 7.23. It is seen that<br />

the implementation of prior in<strong>for</strong>mation in <strong>for</strong>m of subspace constraint reduces<br />

the errors in latency estimates. In both cases the first estimate can be seen to be<br />

inconsistent with rest of the estimates. The more accurate selection rule than the<br />

mean is proposed in 7.2. However with this selection it can be seen that the effect<br />

of the initial estimates is negligible after few estimates.


122 7. Two <strong>methods</strong> <strong>for</strong> evoked potential estimation<br />

110<br />

100<br />

90<br />

80<br />

70<br />

0 20 40 60 80<br />

110<br />

100<br />

90<br />

80<br />

70<br />

70 80 90 100 110<br />

20<br />

40<br />

60<br />

80<br />

20 40 60 80 100 120<br />

Figure 7.23: As in Fig. 7.22 but with principal component observation model K S.<br />

When the variance of the estimates is tuned to be small, the bias (tracking<br />

error) increases in estimates. This is clearly seen in Fig. 7.23. The bias can be<br />

reduced by using the estimation method in <strong>for</strong>ward <strong>and</strong> backward directions <strong>and</strong><br />

then taking the averages of the estimates. A more accurate approach is to use the<br />

so-called Kalman smoothing <strong>methods</strong> instead of Kalman filtering in estimation<br />

[154, 155].<br />

7.8 Discussion<br />

The proposed systematic method <strong>for</strong> single trial estimation is seen to be robust <strong>and</strong><br />

reliable. The estimates are realistic by visual inspection. As seen from the simulations,<br />

the mean square estimation error has a minimum which means that the<br />

use of the subspace regularization improves the estimate, as compared to Gauss–<br />

Markov (minimum variance) estimation with the proposed basis, <strong>and</strong> the principal<br />

component regression. The latency histograms of the estimates are comparable<br />

with the histograms of the simulations. The method can be modified further to<br />

use more realistic basis functions.<br />

It is evident that a reliable single trial estimation method can give much more<br />

in<strong>for</strong>mation than simple averaging. From these tests it is clear that, <strong>for</strong> example,<br />

the latency in P300 test should be examined on the single trial basis prior to any<br />

other estimation method.<br />

In the estimation of real evoked potentials there is a need to develop a reliable<br />

method <strong>for</strong> the selection of regularization parameters. The implementation of<br />

some posterior method would be most desirable but it might be not possible to


7.8. Discussion 123<br />

<strong>for</strong>m such a method <strong>for</strong> general use. Future topics that are to be investigated are<br />

the fully <strong>Bayesian</strong> implementations of the method.<br />

The extension of the method proposed in Section 7.1 <strong>for</strong> multichannel measurements<br />

is straight<strong>for</strong>ward. However, in the multichannel estimation <strong>methods</strong><br />

the model <strong>for</strong> observations has to contain the dependence of the measurements in<br />

different electrode locations from each other. The proper modeling of this dependence<br />

necessitates the model <strong>for</strong> the head as a volume conductor. The complete<br />

estimation problem can then be <strong>for</strong>mulated as a source estimation problem.<br />

Single trial estimation <strong>methods</strong> are also needed as the first step in dynamical<br />

estimation. This is since the selection of the evolution model has to be based on<br />

some assumptions about the evoked potentials. For example, with the results of<br />

the single trial estimation in the case study 3, one can conclude that in some cases<br />

there is trend in variance of the latencies. This implies, that the covariance C w<br />

should not be time-independent in the state evolution model.<br />

In case study 4 a realistic state evolution model can be used. The assumptions<br />

of the trend-like changes are valid <strong>for</strong> the third peak in simulated potentials. The<br />

estimated latencies are comparable to the true latencies. The covariances C v <strong>and</strong><br />

C w were selected here by visual inspection. With these covariances the momentary<br />

variance <strong>and</strong> bias of the Kalman filter can be tuned so that the bias <strong>and</strong> the<br />

variance of the estimated latencies are acceptable. The per<strong>for</strong>mance of the method<br />

is not very critical with respect to the selection of the covariances. However, they<br />

can also be selected with a posterior approach that is based on data using the socalled<br />

state space identification <strong>methods</strong>. In these <strong>methods</strong> the system matrices<br />

in state space equations are parametrized <strong>and</strong> the parameters are solved e.g. as<br />

output-error least squares estimates. Implementation of these <strong>methods</strong> are further<br />

topics that are to be investigated. Another future improvement is the use of the<br />

so-called hyperparameter models <strong>for</strong> the state equations.<br />

The transient effect of the initial velues of ˆθ 0 <strong>and</strong> P 0 can be reduced by using<br />

the algorithm <strong>for</strong>ward <strong>and</strong> backward <strong>and</strong> taking the final backward estimates as<br />

initial values in the <strong>for</strong>ward estimation.


CHAPTER<br />

VIII<br />

Discussion <strong>and</strong> conclusions<br />

The presentation of all parts in this thesis is based on unified <strong>for</strong>mulation that is<br />

introduced in Chapter 2. With this <strong>for</strong>malism the results in the estimation theory<br />

can be combined with the regularization theory, time series analysis <strong>methods</strong> <strong>and</strong><br />

<strong>Bayesian</strong> statistics. The <strong>for</strong>malism is a multivariate vector <strong>for</strong>malism. In particular,<br />

the review of the existing <strong>methods</strong> <strong>for</strong> evoked potential analysis in Chapter 6 is<br />

based strictly on this presentation. This makes it possible to analyze the implicit<br />

assumptions that are made in various <strong>methods</strong>. This kind of unified presentation<br />

of <strong>methods</strong> makes the analysis of the properties of the <strong>methods</strong> more comparable.<br />

The review that is carried out in Chapter 6 should be helpful <strong>for</strong> anyone who is<br />

interested in the estimation of evoked potentials.<br />

One problem in the evaluation of estimation <strong>methods</strong> has been the lack of<br />

realistic simulation <strong>methods</strong>. The simulation method is usually tailored to suite<br />

well <strong>for</strong> the estimation method that is evaluated. In this sense the results of the<br />

evaluations may have been in some cases too optimistic. For this reason we have<br />

proposed two <strong>methods</strong> <strong>for</strong> the simulation of evoked potentials. Both <strong>methods</strong> are<br />

useful <strong>for</strong> different needs of simulations. The component based simulations were<br />

found to be somewhat inconsistent with the data. In this method the parameters<br />

of the components were adopted from the mean of the measurements that may not<br />

correspond to any single potential. However, with this method many parameters<br />

of the final simulations e.g. the peak locations can be easily controlled. The joint<br />

density of the component potential parameters can be selected with any kind of<br />

shape <strong>and</strong> correlation. In order to obtain more realistic simulations with this<br />

method, the selected probability density <strong>for</strong> the latency of the component should<br />

be deconvolved out from the shape of the component. However, this can be a<br />

very unstable problem. Principal component based simulation gives more realistic<br />

simulations when compared to the data. This is not surprising since the first <strong>and</strong><br />

second moments of the measurements <strong>and</strong> the simulations are by construction<br />

almost equal. We conclude that the proposed <strong>methods</strong> can serve as a systematic<br />

approach to the simulation of evoked potentials. The simulation of nonstationary<br />

background is also easy to connect to the <strong>methods</strong>.<br />

124


125<br />

We also proposed a systematic method <strong>for</strong> the single trial estimation of evoked<br />

potentials. The proposed method was applied <strong>for</strong> the estimation of the evoked<br />

potentials in P300 test. Only the target responses were estimated. The estimates<br />

were seen to be able to reject clear disturbances in data. This can be meaningful<br />

e.g. if the amplitude ratios of the peaks are of the interest.<br />

As an example the single trial estimates were used <strong>for</strong> latency estimation. In all<br />

three cases the estimated latency of the P300 peak had more or less nonsymmetric<br />

histogram. This is natural since it is expectable that the latency of this peak has<br />

close relationship to the latencies of other peaks. It is even probable that the<br />

peaks have a causal relationship. The probability density <strong>for</strong> the latency is then<br />

convolution of the densities of all causal components. Another interesting thing<br />

to note is that the distribution of the latency is not always time independent.<br />

In some cases the variance of the latency of P300 peak grows with time. The<br />

proposed method is also easily modified to include even nonstationary background<br />

processes. Also different kinds of observation model selection schemes <strong>and</strong> prior<br />

assumptions are directly applicable to this observation model based approach.<br />

The other systematic estimation method that is presented is a method <strong>for</strong> the<br />

dynamical estimation of the evoked potentials when the parameters of the potentials<br />

have trends. In the proposed algorithm the Kalman filtering <strong>and</strong> the principal<br />

component approaches are combined with the smoothness priors approach. The<br />

simulations show that the algorithm is capable of tracking slow trends in the evoked<br />

potential parameters. In these simulations the negative peaks varied r<strong>and</strong>omly <strong>and</strong><br />

did not fulfill the assumption of trend-like variation. The proposed method is the<br />

only method in which evoked potentials are treated as vector-velued stochastic<br />

processes <strong>and</strong> a recursive estimator has been used. Also the use of the principal<br />

component regression in connection of the Kalman filter is of great importance.<br />

The proposed method allows the use of great variability of more specific models<br />

both <strong>for</strong> observations <strong>and</strong> the time evolution of the evoked potentials.<br />

Overall conclusions <strong>and</strong> future directions<br />

In this thesis it has been shown that the regularization <strong>and</strong> <strong>Bayesian</strong> approaches<br />

can be useful in the analysis of single evoked potentials. It has also been shown<br />

that the proposed <strong>methods</strong> are related to several optimality criterions. We stress<br />

once again that although the simulations show that the proposed <strong>methods</strong> can be<br />

useful as themselves, the most important property of the estimators is their general<br />

structure that allows one to develop estimators with more realistic observation<br />

models <strong>and</strong> more realistic choises <strong>for</strong> the prior. The following lines <strong>for</strong> further<br />

research are suggested:<br />

• Formulation of fully <strong>Bayesian</strong> single trial estimator <strong>for</strong> evoked potentials.<br />

This should contain as few user tunable parameters as possible. For this<br />

we need a good method <strong>for</strong> the estimation of the regularization parameters<br />

in a posterior way. It is possible to <strong>for</strong>mulate the estimation method as a<br />

hierarchical <strong>Bayesian</strong> model <strong>and</strong> then use an empirical <strong>Bayesian</strong> approach<br />

<strong>for</strong> the estimation of the regularization parameters.


126 References<br />

• It has to be kept in mind that a single trial estimate is only a time-varying<br />

estimate <strong>for</strong> a potential in some location of scalp. When the proposed<br />

<strong>methods</strong> are extended to multichannel measurements <strong>and</strong> realistic observation<br />

models, the whole estimation problem can be interpreted as a source<br />

localization or potential mapping problem.<br />

• In recursive estimation the r<strong>and</strong>om walk model is the simplest choice. The<br />

true model <strong>for</strong> evoked potential evolution has to be more realistic <strong>and</strong> could<br />

be based on the data. This leads to the so-called hypermodel <strong>and</strong> state space<br />

identification problems.<br />

Overall, we argue that the goal in the future development of the <strong>methods</strong> is<br />

a full empirical <strong>Bayesian</strong> recursive multichannel method <strong>for</strong> estimation of brains<br />

activity with realistic observation <strong>and</strong> evolution models <strong>for</strong> evoked potentials.


REFERENCES<br />

[1] J.J. Ackmann, P.P. Elko, <strong>and</strong> S.J. Wu. Digital filtering of on-line evoked potentials.<br />

International Journal on Bio-Medical Computing, 10:291–303, 1979.<br />

[2] H. Al-Nashi. A maximum likelihood method <strong>for</strong> estimating EEG evoked potentials.<br />

IEEE Transactions on Biomedical Engineering, 33:1087–1095, 1986.<br />

[3] V. Albrecht, P. Lansky, M. Indra, <strong>and</strong> T. Radil-Weiss. Wiener filtration versus<br />

averaging of evoked potentials. Biological Cybernetics, 27:147–154, 1977.<br />

[4] V. Albrecht <strong>and</strong> T. Radil-Weiss. Some comments on the derivation of the Wiener<br />

filter <strong>for</strong> average evoked potentials. Biological Cybernetics, 24:43–46, 1976.<br />

[5] T.M. Apostol. Mathematical Analysis. Addison-Wesley, 1974.<br />

[6] J.I. Aunon <strong>and</strong> C.D. McGillem. Techniques <strong>for</strong> processing single evoked potentials.<br />

In Proc San Diego Biomed. Symp., pages 211–218, 1975.<br />

[7] J.I. Aunon, C.D. McGillem, <strong>and</strong> D.G. Childers. Signal processing in evoked potential<br />

research: averaging <strong>and</strong> modeling. CRC Critical Rewievs in Biomedical<br />

Engineering, 4:323–367, 1981.<br />

[8] J.I. Aunon <strong>and</strong> R.W. Sencaj. Comparison of different techniques <strong>for</strong> processing<br />

evoked potentials. Medical <strong>and</strong> Biological Engineering <strong>and</strong> Computing, 16:642–650,<br />

1979.<br />

[9] J.I. Aunon, J.J. Westerkamp, C.D. McGillem, <strong>and</strong> K. Fortier. Prony decomposition<br />

of evoked potentials. IEEE frontiers of Engineering <strong>and</strong> Computing in Health Care,<br />

pages 696–698, 1984.<br />

[10] N.I. Bachen. Detection of stimulus-related (evoked response) activity in the electroencephalogram<br />

(EEG). IEEE Transactions on Biomedical Engineering, 33:566–<br />

571, 1986.<br />

[11] G. Barret. Analytic techniques in the estimation of evoked potentials. In A.S.<br />

Gevins <strong>and</strong> A. Remond, editors, EEG h<strong>and</strong>book. Methods of analysis of brain<br />

electrical <strong>and</strong> magnetic signals, Vol 2. Elsevier science publisher B.V (Biomedical<br />

division), 1987.<br />

[12] E.A. Bartnik <strong>and</strong> K.J. Blinowska. Wavelets - new method of evoked potential<br />

analysis. Medical <strong>and</strong> Biological Engineering <strong>and</strong> Computing, 30:125–126, 1992.<br />

[13] E. Basar, A. Gönder, C.A. Özesmi, <strong>and</strong> P. Ungan. Dynamics of brain rhythmic<br />

<strong>and</strong> evoked potentials. I. Some computational <strong>methods</strong> <strong>for</strong> the analysis of electrical<br />

127


128 References<br />

signals from the brain. Biological Cybernetics, 20:137–144, 1975.<br />

[14] E. Basar, P. Ungan, A. Gönder, <strong>and</strong> C.A. Özesmi. A posteriori selective averaging<br />

of evoked potentials: comparison with Wiener filtering. In XVII Alpine EEG<br />

meeting, Sestriere (Italy), January 27–31, 1975.<br />

[15] M.D. Berger. Analysis of sensory evoked potentials using normalized crosscorrelation<br />

functions. Medical <strong>and</strong> Biological Engineering <strong>and</strong> Computing, 21:149–<br />

157, 1983.<br />

[16] O. Bertr<strong>and</strong>, J. Bohorquez, <strong>and</strong> J. Pernier. Time-frequency digital filtering based<br />

on an invertible wavelet trans<strong>for</strong>m: an application to evoked potentials. IEEE<br />

Transactions on Biomedical Engineering, 41:77–88, 1994.<br />

[17] G.E. Birch, P.D. Lawrence, <strong>and</strong> R.D. Hare. Single-trial processing of event-related<br />

potentials using outlier in<strong>for</strong>mation. IEEE Transactions on Biomedical Engineering,<br />

40:59–73, 1993.<br />

[18] Å. Björck. Numerical <strong>methods</strong> <strong>for</strong> least squares problems. SIAM, 1996.<br />

[19] T. Bohlin. Analysis of EEG signals with changing spectra using a short-word<br />

Kalman estimator. Mathematical Biosciences, 35:221–259, 1977.<br />

[20] R.P. Borda <strong>and</strong> J.D. Frost. Error reduction in small sample averaging through<br />

use of the median rather than the mean. Electroencephalography <strong>and</strong> Clinical<br />

Neurophysiology, 25:391–392, 1968.<br />

[21] J.R. Boston. Noise cancellation <strong>for</strong> brainstem auditory evoked potentials. IEEE<br />

Transactions on Biomedical Engineering, 32:1066–1070, 1985.<br />

[22] M.A.B. Brazier. Evoked responses recorded from the depths of the human brain.<br />

Annals of New York Academy of Sciences, 112:33–59, 1964.<br />

[23] P.J. Brown. Measurement, Regression, <strong>and</strong> Calibration. Ox<strong>for</strong>d Science Publications,<br />

1993.<br />

[24] R.C. Burgess, V. Le Hong, <strong>and</strong> E.C. Jacobs. A comparative study of a posteriori<br />

Wiener filtering, time varying filtering <strong>and</strong> ARMA filtering <strong>methods</strong> using<br />

simulated evoked potentials. In Proc. 5th Conf. on Medical In<strong>for</strong>matics, pages<br />

1119–1123, Washington DC, 1986.<br />

[25] E. Callaway, R. Halliday, R. Naylor, <strong>and</strong> D. Thouvenin. The latency of the average<br />

is not the average of the latencies. Psychophysiology, 21:517, 1984.<br />

[26] B.P. Carlin <strong>and</strong> T.A. Louis. Bayes <strong>and</strong> empirical Bayes <strong>methods</strong> <strong>for</strong> data analysis.<br />

Chapman & Hall, 1996.<br />

[27] E.H. Carlton <strong>and</strong> S. Katz. Is Wiener filtering an effective method of improving<br />

evoked potential estimation IEEE Transactions on Biomedical Engineering,<br />

27(4):187–192, 1980.<br />

[28] S. Cerutti, G. Baselli, D. Liberati, <strong>and</strong> G. Pavesi. Single sweep analysis of visual<br />

evoked potentials through a model of parametric identification. Biological<br />

Cybernetics, 56:111–120, 1987.<br />

[29] S. Cerutti, V. Bersani, A. Carrara, <strong>and</strong> D. Liberati. Analysis of visual evoked<br />

potentials through Wiener filtering applied to a small number of sweeps. Journal<br />

on Biomedical Engineering, 9:3–12, 1987.<br />

[30] S. Cerutti, G. Chiarenza, D. Liberati, P. Mascellani, <strong>and</strong> G. Pavesi. A parametric<br />

method of identification of single-trial event-related potentials in brain. IEEE<br />

Transactions on Biomedical Engineering, 35:701–711, 1988.<br />

[31] F.H.Y. Chan, F.K. Lam, P.W.F. Poon, <strong>and</strong> M.H. Du. Measurement of human<br />

BAERs by the maximum length sequense technique. Medical <strong>and</strong> Biological Engineering<br />

<strong>and</strong> Computing, 30(1):32–40, 1992.<br />

[32] F.H.Y. Chan, F.K. Lam, P.W.F. Poon, <strong>and</strong> W. Qiu. Detection of brainstem


References 129<br />

auditory evoked potential by adaptive filtering. Medical <strong>and</strong> Biological Engineering<br />

<strong>and</strong> Computing, 33:69–75, 1995.<br />

[33] Tony F. Chan <strong>and</strong> Per Christian Hansen. Computing truncated singular value<br />

decomposition least squares solutions by rank revealing QR-factorizations. SIAM<br />

Journal of Scientific Statistical Computing, 11:519–530, 1990.<br />

[34] D.G. Childers, editor. Modern Spectrum Analysis. IEEE Press, 1978.<br />

[35] D.G. Childers, I.S. Fischler, T.L. Boaz, N.W. Perry Jr, <strong>and</strong> A.A. Arroyo. Multichannel,<br />

single trial event related potential classification. IEEE Transactions on<br />

Biomedical Engineering, 33:1069–1075, 1986.<br />

[36] D.G. Childers, N.W. Perry, I.A. Fischler, T. Boaz, <strong>and</strong> A.A. Arroyo. Event-related<br />

potentials: A critical review of <strong>methods</strong> <strong>for</strong> single-trial detection. CRC Critical<br />

Rewievs in Biomedical Engineering, 14:185–200, 1987.<br />

[37] B. Choi. ARMA Model Identification. Springer-Verlag, 1992.<br />

[38] Z. Ciganek. Variability of the human visual evoked potential: normative data.<br />

Electroencephalography <strong>and</strong> Clinical Neurophysiology, 27:35–42, 1969.<br />

[39] H. Cramer. Mathematical Methods in Statistics. Princeton University Press, 1946.<br />

[40] I. Daubechies. Ten Lectures on Wavelets. SIAM, 1992.<br />

[41] C.E. Davila <strong>and</strong> M.S. Mobin. Weighted averaging of evoked potentials. IEEE<br />

Transactions on Biomedical Engineering, 39:338–345, 1992.<br />

[42] A.E. Davis. Power spectral analysis of flash <strong>and</strong> click evoked responses. Electroencephalography<br />

<strong>and</strong> Clinical Neurophysiology, 35:287–291, 1973.<br />

[43] J. de Weerd <strong>and</strong> J. Kap. Spectro-temporal representations <strong>and</strong> time-varying spectra<br />

of evoked potentials. Biological Cybernetics, 41:101–117, 1981.<br />

[44] J.P.C. de Weerd. Facts <strong>and</strong> fancies about a posteriori Wiener filtering. Biomed.<br />

Engng, 28:252–257, 1981.<br />

[45] J.P.C. de Weerd. A posteriori Time-varying filtering of averaged evoked potentials.<br />

I. Introduction <strong>and</strong> conseptual basis. Biological Cybernetics, 41:211–222, 1981.<br />

[46] J.P.C. de Weerd <strong>and</strong> J.I. Kap. A posteriori Time-varying filtering of averaged<br />

evoked potentials.II. Mathematical <strong>and</strong> computational aspects. Biological Cybernetics,<br />

41:223–234, 1981.<br />

[47] J.P.C. de Weerd <strong>and</strong> W.J.L. Martens. Theory <strong>and</strong> practise of a posteriori Wiener<br />

filtering of averaged evoked potentials. Biological Cybernetics, 30:81, 1978.<br />

[48] C. Doncarli <strong>and</strong> L. Goerig. Adaptive smoothing of evoked potentials: a new approach.<br />

In Proc. 10th Annual Conf. IEEE-EMBS, pages 1152–1153, New Orleans,<br />

Lousiana, November 1988.<br />

[49] C. Doncarli <strong>and</strong> L. Goering. Adaptive smoothing of evoked potentials. Signal<br />

Processing, 28:63–76, 1992.<br />

[50] E. Donchin. A multivariate approach to the analysis of average evoked potentials.<br />

IEEE Transactions on Biomedical Engineering, 13:131–139, 1966.<br />

[51] E. Donchin <strong>and</strong> E. Heffley. The independence of P300 <strong>and</strong> the CNV reviewed: a<br />

reply to Wastell. Biological Psychology, 9:177–188, 1979.<br />

[52] E. Donchin <strong>and</strong> E.F. Heffley. Multivariate analysis of event-related potential data:<br />

A tutorial review. In D. Otto, editor, Multidisciplinary perspectives in eventrelated<br />

brain potential research, pages 555–572. U.S. Governement Printing Office,<br />

Washington DC, 1978.<br />

[53] E. Donchin, P. Tueting, W. Ritter, M. Kutas, <strong>and</strong> E. Heffley. On the independence<br />

of the CNV <strong>and</strong> the P300 components of the human average evoked potential.<br />

Electroencephalography <strong>and</strong> Clinical Neurophysiology, 38:449–461, 1975.<br />

[54] D.J. Doyle. Some comments on the use of Wiener filtering in the estimation of


130 References<br />

evoked potentials. Electroencephalography <strong>and</strong> Clinical Neurophysiology, 28:533–<br />

534, 1975.<br />

[55] D.J. Doyle. A proposed methodology <strong>for</strong> evaluation of the Wiener filtering method<br />

of evoked potential estimation. Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

43:749–751, 1977.<br />

[56] H. Engl. Inverse Problems. Johannes–Kepler–Universität, Linz, Austria, 1992.<br />

Lecture notes.<br />

[57] H.W. Engl <strong>and</strong> W. Grever. Using the L-curve <strong>for</strong> determining optimal regularization<br />

parameters. Numerische Mathematik, 69:25–31, 1994.<br />

[58] M. Fabiani, G. Gratton, D. Karis, <strong>and</strong> E. Donchin. Definition, identification<br />

<strong>and</strong> reliability of measurement of the P300 component of the event-related brain<br />

potential. In P. Ackles, J.R. Jennings, <strong>and</strong> M.G.H. Coles, editors, Advances in<br />

psychophysiology, Vol. 2, pages 1–78. JAI Press, Greenwich, 1987.<br />

[59] L.A. Farwell, J.M. Martinerie, T.R. Bashore, P.E. Rapp, <strong>and</strong> P.H. Goddard. Optimal<br />

digital filters <strong>for</strong> long-latency components of the event-related brain potential.<br />

Psychophysiology, 30:306–315, 1993.<br />

[60] S. Fotopoulos, G. Economou, A. Bezerianos, <strong>and</strong> N. Laskaris. Latency measurement<br />

improvement of P100 complex in visual evoked potentials by FMH filters.<br />

IEEE Transactions on Biomedical Engineering, 42(4):424–427, 1995.<br />

[61] B. Friedl<strong>and</strong>er. Lattice filters <strong>for</strong> adaptive processing. Proceedings og the IEEE,<br />

70:829–867, 1982.<br />

[62] M. Furst <strong>and</strong> A. Blau. Optimal a posteriori time domain filter <strong>for</strong> average evoked<br />

potentials. IEEE Transactions on Biomedical Engineering, 38:827–833, 1991.<br />

[63] T. Gänsler <strong>and</strong> M. Hansson. Estimation of event-related potentials with GSVD.<br />

In Proc. 13th Annual Conf. IEEE-EMBS, pages 423–424, 1991.<br />

[64] T. Gasser, J. Möcks, <strong>and</strong> R. Verleger. SELAVCO: a method to deal with trial-totrial<br />

variability of evoked potentials. Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

55:717–723, 1983.<br />

[65] I. Gath, E. Bar-On, <strong>and</strong> D. Lehmann. Automatic classification of visual evoked<br />

responses. Computer Methods <strong>and</strong> Programs in Biomedicine, 20:17–22, 1985.<br />

[66] A. Gelman, J.B. Carlin, H.S. Stern, <strong>and</strong> D.B. Rubin. <strong>Bayesian</strong> data analysis.<br />

Chapman & Hall, 1995.<br />

[67] W. Gersch. Smoothness priors. In New Directions in Time Series Analysis, Part<br />

II, pages 113–146. Springer-Verlag, 1991.<br />

[68] A.S. Gevins <strong>and</strong> N.H. Morgan. Applications of neural-network (NN) signal processing<br />

in brain research. IEEE Transactions on Acoustics, Speech <strong>and</strong> Signal<br />

Processing, 36:1152–1161, 1988.<br />

[69] G.H. Golub <strong>and</strong> C.F. van Loan. Matrix Computations. The Johns Hopkins University<br />

Press, 1989.<br />

[70] G. Gratton, A.F. Kramer, M.G.H. Coles, <strong>and</strong> E. Donchin. Simulation studies of<br />

latency measures of components of the event-related brain potential. Psychophysiology,<br />

26:233–248, 1989.<br />

[71] R.C.W. Grieve, P.A. Parker, <strong>and</strong> B. Hudgins. Training neural networks <strong>for</strong> stimulus<br />

artifact reduction in somatosensory evoked potential measurements. In Proc.<br />

18th Annual Conf. IEEE-EMBS, Amsterdam, 1996.<br />

[72] C. W. Groetsch. Inverse Problems in the Mathematical Sciences. Vieweg, 1993.<br />

[73] J. Hadamard. Lecture’s on Cauchy’s problem in linear partial differential equations.<br />

Yale University Press, 1923.<br />

[74] M. Hanke <strong>and</strong> P.C. Hansen. <strong>Regularization</strong> <strong>methods</strong> <strong>for</strong> large-scale problems.


References 131<br />

Surveys on Mathematics <strong>for</strong> Industry, 3:253–315, 1993.<br />

[75] P.C. Hansen. The truncated SVD as a method of regularization. BIT, 27:534–553,<br />

1987.<br />

[76] P.C. Hansen. Analysis of discrete ill-posed problems by means of the L-curve.<br />

SIAM Rev., 34:561–580, 1992.<br />

[77] P.C. Hansen. <strong>Regularization</strong> Tools: A Matlab package <strong>for</strong> analysis <strong>and</strong> solution of<br />

discrete ill-posed problems. UNI-C, Technical university of Denmark, 1992.<br />

[78] P.C. Hansen. Experience with regularizing CG iterations. Technical Report UNIC-<br />

94-02, Technical University of Danmark, 1994.<br />

[79] M. Hansson <strong>and</strong> T. Cedholt. Estimation of event related potentials. In Proc. 12th<br />

Annual Conf. IEEE-EMBS, pages 901–902, 1990.<br />

[80] M. Hansson, T. Gänsler, <strong>and</strong> G. Salomonsson. Estimation of single event-related<br />

potentials utilizing the prony method. IEEE Transactions on Biomedical Engineering,<br />

43(10):973–981, 1996.<br />

[81] S. Haykin. Adaptive filter theory. Prentice Hall, 2 edition, 1991.<br />

[82] H.J. Heinze <strong>and</strong> H. Kunkel. ARMA - filtering of evoked potentials. Methods of<br />

In<strong>for</strong>mation in Medicine, 23:29–36, 1984.<br />

[83] H.J. Heinze <strong>and</strong> H. Künkel. Evaluation of single epoch evoked potentials. Neuropsychobiology,<br />

12:68–72, 1984.<br />

[84] J. Hoke, B. Ross, R.E. Wickesberg, <strong>and</strong> B Lütkenhöner. Weigheted averaging -<br />

theory <strong>and</strong> application to electric response audiometry. Electroencephalography<br />

<strong>and</strong> Clinical Neurophysiology, 57:484–489, 1984.<br />

[85] M.L. Honig <strong>and</strong> D.G. Messerschmidt. Adaptive Filters: Structures, Algorithms<br />

<strong>and</strong> Applications. Kluwer Academic Publishers, 1984.<br />

[86] P.J. Huber. Robust Statistical Procedures. SIAM, 1977.<br />

[87] P. Husar, K. Schellhorn, O. Hoenecke, <strong>and</strong> G. Henning. Time-varying filtering<br />

of transient visual evoked responses. In Proc. 18th Annual Conf. IEEE-EMBS,<br />

Amsterdam, 1996.<br />

[88] M.H. Jacobs, S.S. Rao, <strong>and</strong> G.V. Jose. Parametric modeling of somatosensory<br />

evoked-potentials. IEEE Transactions on Biomedical Engineering, 36(3):392–403,<br />

1989.<br />

[89] R. Jane, P. Laguna, <strong>and</strong> P. Caminal. Adaptive estimation of event-related bioelectric<br />

signals. In Proc. 13th Annual Conf. IEEE-EMBS, pages 3656–3660, Orl<strong>and</strong>o,<br />

1991.<br />

[90] E.R. John, D.S. Ruchkin, <strong>and</strong> J. Villegas. Experimental background: signal analysis<br />

<strong>and</strong> behavioral correlates of evoked potential configurations in cats. Annals<br />

of New York Academy of Sciences, 112:362–420, 1964.<br />

[91] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.<br />

[92] H.G. Vaughan Jr <strong>and</strong> J.C. Arezzo. The neural basis of event-related potentials. In<br />

Human Event-Related Potentials, volume 3 of H<strong>and</strong>book of Electroencephalography<br />

<strong>and</strong> Clinical Neurophysiology, pages 45–96. Elsevier, 1987.<br />

[93] J.E. Dennis Jr. <strong>and</strong> R.B. Schnabel. Numerical <strong>methods</strong> <strong>for</strong> unconstrained optimization<br />

<strong>and</strong> nonlinear equations. Prentice-Hall, 1983.<br />

[94] Marple S. Lawrence Jr. Digital Spectral Analysis with Applications. Prentice<br />

Hall,Englewood Cliffs, 1987.<br />

[95] J.P. Kaipio. Simulation <strong>and</strong> Estimation of Nonstationary EEG. PhD thesis, University<br />

of Kuopio, 1996.<br />

[96] J.P. Kaipio <strong>and</strong> P.A. <strong>Karjalainen</strong>. Estimation of event related synchronization<br />

changes by a new TVAR method. IEEE Transactions on Biomedical Engineering,


132 References<br />

44(8), 1997.<br />

[97] R.E. Kalman. A new approach to linear filtering <strong>and</strong> prediction problems. Trans.<br />

ASME, J. Basic Eng., 82D:35–45, 1960.<br />

[98] M.V. Kamath, S.N. Reddy, A.R.M. Upton D.N. Ghista, <strong>and</strong> M.E. Jernigan. Statistical<br />

pattern classification of clinical brainstem auditory evoked potentials. International<br />

Journal on Bio-Medical Computing, 21:9–28, 1988.<br />

[99] P.A. <strong>Karjalainen</strong>. Estimation theoretical background of root tracking algorithms<br />

with application to EEG. Phil. Lic. thesis, Univ. of Kuopio, Dept. of Applied<br />

Physics, 1996.<br />

[100] P.A. <strong>Karjalainen</strong>, J.P. Kaipio, <strong>and</strong> A.S. Koistinen. PCA based <strong>Bayesian</strong> estimation<br />

of single trial evoked potentials. In Proc 1st Int Conf Biolectromagn., pages 195–<br />

196, Tampere, Finl<strong>and</strong>, 1996.<br />

[101] P.A. <strong>Karjalainen</strong>, J.P. Kaipio, A.S. Koistinen, <strong>and</strong> T. Kärki. Recursive <strong>Bayesian</strong><br />

estimation of single trial evoked potentials. In Proc. 18th Annual Conf. IEEE-<br />

EMBS, Amsterdam, 1996.<br />

[102] J.P. Keating, R.L. Mason, <strong>and</strong> P.K. Sen. Pitman’s measure of closeness: a comparison<br />

of statistical estimators. SIAM, 1993.<br />

[103] S.B. Kesler, editor. Modern Spectrum Analysis, II. IEEE Press, 1986.<br />

[104] X. Kong <strong>and</strong> N.V. Thakor. Adaptive estimation of latency changes in evoked<br />

potentials. IEEE Transactions on Biomedical Engineering, 43(2):189–197, 1996.<br />

[105] S. Krieger, J. Timmer, S. Lis, <strong>and</strong> H.M. Olbrich. Some considerations on estimating<br />

event-related brain signals. Journal of Neural Transm Gen Sect, 99(1-3):103–129,<br />

1995.<br />

[106] P. Laguna, R. Jane, O. Meste, P.W. Poon, P. Caminal, H. Rix, <strong>and</strong> N.V. Thakor.<br />

Adaptive filter <strong>for</strong> event-related bioelectric signals using an impulse correlated<br />

reference input: Comparison with signal averaging tecniques. IEEE Transactions<br />

on Biomedical Engineering, 39:1032–1043, 1992.<br />

[107] D.H. Lange. Variable single-trial evoked potential estimation via principal component<br />

identification. In Proc. 18th Annual Conf. IEEE-EMBS, amsterdam, 1996.<br />

[108] D.H. Lange, H. Pratt, <strong>and</strong> G.F. Inbar. Segmented matched filtering of single<br />

event related evoked potentials. IEEE Transactions on Biomedical Engineering,<br />

42(3):317–320, 1995.<br />

[109] N. Laskaris, A. Bezerianos, S. Fotopoulos, <strong>and</strong> P. Papathanasopoulos. A genetic<br />

algorithm <strong>for</strong> the multidimensional filtering process of visual evoked potential signals.<br />

In Proc. 18th Annual Conf. IEEE-EMBS, Amsterdam, 1996.<br />

[110] C.L. Lawson <strong>and</strong> R.J. Hanson. Solving Least Squares Problems. Prentice-Hall,<br />

1974.<br />

[111] Y.H. Lee, J.S. Kim, H.S. Park, S.I. Kim, <strong>and</strong> D.S. Lee. Analysis of visual evoked<br />

potentials using a wavelet decomposition algorithm. In Proc. 18th Annual Conf.<br />

IEEE-EMBS, Amsterdam, 1996.<br />

[112] D. Liberati, L. Bertolini, <strong>and</strong> D. C. Colombo. Parametric method <strong>for</strong> the detection<br />

of inter- <strong>and</strong> intrasweep variability in VEP processing. Medical <strong>and</strong> Biological<br />

Engineering <strong>and</strong> Computing, 29:159–166, 1991.<br />

[113] D. Liberati, S. Cerutti, E. Di Ponzio, V. Ventimiglia, <strong>and</strong> L. Zaninelli. The implementation<br />

of an autoregressive model with exogenous input in a single sweep<br />

visual evoked potential analysis. Journal on Biomedical Engineering, 11:285–292,<br />

1989.<br />

[114] D. Liberati, L. Narici, A. Santonini, <strong>and</strong> S. Cerutti. The dynamic behavior of the<br />

evoked magnetoencephalogram detected through parametric identification. Jour-


References 133<br />

nal on Biomedical Engineering, 14:57–63, 1992.<br />

[115] L. Ljung, G. Pflug, <strong>and</strong> H. Walk. Stochastic Approximation <strong>and</strong> Optimization of<br />

R<strong>and</strong>om Systems. Birkhäuser, 1992.<br />

[116] L. Ljung <strong>and</strong> T. Söderström. Theory <strong>and</strong> Practice of Recursive Identification. MIT<br />

Press, 1983.<br />

[117] F.H. Lopes da Silva. Event-related potentials: Methodology <strong>and</strong> quantification.<br />

In Electroencephalography:Basic principles, clinical applications <strong>and</strong> related fields,<br />

pages 877–886. Williams & Wilkins, 1993.<br />

[118] B. Lutkenhoner, M. Hoke, <strong>and</strong> C. Pantev. Possibilities <strong>and</strong> limitations of weighted<br />

averaging. Biological Cybernetics, 52(6):409–416, 1985.<br />

[119] L. Lyons. A practical guide to data analysis <strong>for</strong> physical science students. Cambridge<br />

University Press, Cambridge, 1991.<br />

[120] P.J. Maccabee, E. I. Pinkhasov, <strong>and</strong> R. Q. Cracco. Short latency evoked potentials<br />

to median nerve stimulation: Effect of low-frequency filter. Electroencephalography<br />

<strong>and</strong> Clinical Neurophysiology, 55:34–44, 1983.<br />

[121] G. Madhavan, H. deBruin, <strong>and</strong> A.R.M. Upton. Evoked potential processing <strong>and</strong><br />

pattern recognition. In Proc. 6th Annual Conf. IEEE-EMBS, pages 699–702, 1984.<br />

[122] G.P. Madhavan. Comments on adaptive filtering of evoked potentials. IEEE<br />

Transactions on Biomedical Engineering, 35:273–275, 1988.<br />

[123] P.G. Madhavan. Minimal repetition evoked potentials by modified adaptive line enhancement.<br />

IEEE Transactions on Biomedical Engineering, 39(7):760–764, 1992.<br />

[124] J. Makhoul. Stable <strong>and</strong> efficient lattice <strong>methods</strong> <strong>for</strong> linear prediction. IEEE<br />

Transactions on Acoustics, Speech <strong>and</strong> Signal Processing, 25:423–428, 1997.<br />

[125] S.L. Marple. Digital Spectral Analysis. Prentice-Hall International, 1987.<br />

[126] D.C. Martin, J. Becker, <strong>and</strong> V. Buffington. An evoked potential study of endogenous<br />

affective disorders in alcoholics. In Proc. Conf. Evoked Brain Potentials <strong>and</strong><br />

Behavior, pages 401–417, 1997.<br />

[127] D.C. Martin, D. Borg-Breen, <strong>and</strong> V. Buffington. Basis functions in the analysis<br />

of evoked potentials. In Proc. Conf. Evoked Brain Potentials <strong>and</strong> Behavior, pages<br />

545–561, 1997.<br />

[128] S.G. Mason, G.E. Birch, <strong>and</strong> M.R. Ito. Improved single-trial signal extraction of<br />

low SNR events. IEEE Transactions on Signal Processing, 42(2):423–426, 1994.<br />

[129] C.D. McGillem <strong>and</strong> J.I. Aunon. Measurements of signal components in single<br />

visually evoked brain potentials. IEEE Transactions on Biomedical Engineering,<br />

24:232–241, 1977.<br />

[130] C.D. McGillem <strong>and</strong> J.I. Aunon. Analysis of event-related potentials, chapter 5,<br />

pages 131–169. Elsevier Science Publisher B. V, 1987.<br />

[131] C.D. McGillem, J.I. Aunon, <strong>and</strong> C.A. Pomalaza. Improved wave<strong>for</strong>m estimation<br />

procedures <strong>for</strong> event-related potentials. IEEE Transactions on Biomedical Engineering,<br />

BME-32, 1985.<br />

[132] C.D. McGillem, K. Yu, <strong>and</strong> D.C. Childers. Effects of ongoing EEG on latency<br />

measurements of evoked potentals. In Proc. 4th Annual Conf. IEEE-EMBS, pages<br />

60–63, Philadelphia, 1982.<br />

[133] J.L. Melsa <strong>and</strong> D.L. Cohn. Decision <strong>and</strong> Estimation Theory. McGraw-Hill, 1978.<br />

[134] Y. Meyer. Wavelets: algorithms <strong>and</strong> applications. SIAM, 1993.<br />

[135] H.J. Michalewski, D.K. Prasher, <strong>and</strong> A. Starr. Latency variability <strong>and</strong> temporal<br />

interrelationships of the auditory event-related potentials (N1, P2, N2, <strong>and</strong> P3) in<br />

normal subjects. Electroencephalography <strong>and</strong> Clinical Neurophysiology, 65(1):59–<br />

71, 1986.


134 References<br />

[136] J. Möcks. The influence of latency jitter in principal component analysis of eventrelated<br />

potentials. Psychophysiology, 23:480–484, 1986.<br />

[137] J. Möcks, T. Gasser, <strong>and</strong> P.D. Tuan. Variability of single visual evoked potentials<br />

evaluated by two new statistical tests. Electroencephalography <strong>and</strong> Clinical<br />

Neurophysiology, 57:571–580, 1984.<br />

[138] J. Möcks, T.G. Gasser, W. Kohler, <strong>and</strong> J.P.C. De Weerd. Does filtering <strong>and</strong><br />

smoothing of average evoked potentials really pay a statistical comparison.<br />

Electroencephalography <strong>and</strong> Clinical Neurophysiology, 64:469–480, 1986.<br />

[139] J. Möcks <strong>and</strong> R. Verleger. Principal component analysis of event-related potentials:<br />

a note on misallocation of variance. Electroencephalography <strong>and</strong> Clinical<br />

Neurophysiology, 65:393–398, 1986.<br />

[140] E.B. Moody, E. Micheli-Tzanakou, <strong>and</strong> S. Chokroverty. Analysis of visual evoked<br />

responses in the frequency domain by the method of movements. In Proc. 13th<br />

Annual Northeast Bioeng. Conf., pages 549–552, 1987.<br />

[141] E.B. Moody, E. Micheli-Tzanakou, <strong>and</strong> S. Chokroverty. An adaptive approach to<br />

spectral analysis of pattern-reversal visual evoked potentials. IEEE Transactions<br />

on Biomedical Engineering, 36:439–447, 1989.<br />

[142] R.E. Mortensen. R<strong>and</strong>om Signals <strong>and</strong> Systems. Wiley, 1987.<br />

[143] J.M. Moser <strong>and</strong> J.I. Aunon. Classification <strong>and</strong> detection of single evoked brain potentials<br />

using time-frequency amplitude features. IEEE Transactions on Biomedical<br />

Engineering, 33:1096–1106, 1986.<br />

[144] N.J.D. Nagelkerke <strong>and</strong> J. Strackee. Some notes on the statistical properties of a<br />

posteriori Wiener filtering. Biological Cybernetics, 33:121–123, 1979.<br />

[145] A.M. Nait-Ali, O. Adam, <strong>and</strong> J.-F. Motsch. The brainstem auditory evoked potentials<br />

estimation using a <strong>Bayesian</strong> deconvolution method. In Proc. 18th Annual<br />

Conf. IEEE-EMBS, Amsterdam, 1996.<br />

[146] M. Nakamura, S. Nishida, <strong>and</strong> H. Shibasaki. Spectral properties of signal averaging<br />

<strong>and</strong> a novel technique <strong>for</strong> improving the signal-to-noise ratio. Journal on<br />

Biomedical Engineering, 11:72–78, 1989.<br />

[147] M. Nakamura, S. Nishida, <strong>and</strong> H. Shibasaki. Deterion of average evoked potential<br />

wave<strong>for</strong>m due to asynchronous averaging <strong>and</strong> its compensation. IEEE Transactions<br />

on Biomedical Engineering, 38:309–312, 1991.<br />

[148] S. Nishida, M. Nakamura, <strong>and</strong> H. Shibasaki. Method <strong>for</strong> single-trial recording of<br />

somatosensory evoked potentials. Journal on Biomedical Engineering, 150:257–<br />

262, 1993.<br />

[149] T. Nogawa, K. Katayama, Y. Tabata, T. Kawahara, <strong>and</strong> T. Ohshio. Visual evoked<br />

potentials estimated by Wiener filtering. Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

35:375–378, 1973.<br />

[150] T. Nyrke. Herätepotentiaalien fysiologiset ja metodologiset perusteet. In H. Lang,<br />

V. Häkkinen, T. Larsen, J. Partanen, <strong>and</strong> U. Tolonen, editors, Sähköiset aivomme,<br />

pages 359–378. Suomen KNF-yhdistys, Painokiila Ky, 1994. In Finnish.<br />

[151] A. LeBron Paige, Ö. Özdamar, <strong>and</strong> R.E. Delgado. Two-dimesional spectral processing<br />

of sequential evoked potentials. Medical <strong>and</strong> Biological Engineering <strong>and</strong><br />

Computing, 34:239–243, 1996.<br />

[152] A. Papoulis. Probability, R<strong>and</strong>om Variables <strong>and</strong> Stochastic Processes. McGraw-<br />

Hill, 1984.<br />

[153] A. Papoulis. Signal Analysis. McGraw-Hill, 1984.<br />

[154] P.G. Park <strong>and</strong> T. Kailath. Square-root Bryson-Frazier smoothing algorithms.<br />

IEEE Transactions on Automatic Control, 40:761–766, 1995.


References 135<br />

[155] P.G. Park <strong>and</strong> T. Kailath. New square-root smoothing algorithms. IEEE Transactions<br />

on Automatic Control, 41:727–733, 1996.<br />

[156] V. Parsa <strong>and</strong> P.A. Parker. Multireference adaptive noise cancellation applied to<br />

somatosensory-evoked potentials. IEEE Transactions on Biomedical Engineering,<br />

pages 792–800, 1994.<br />

[157] F. Pedersen. Interactive explorative analysis of multivariate images using principal<br />

components. PhD thesis, Uppsala University, 1994.<br />

[158] A. Pfefferbaum <strong>and</strong> J.M. Ford. ERPs to stimuli requiring response production <strong>and</strong><br />

inhibition: effects of age, probability <strong>and</strong> visual noise. Electroencephalography <strong>and</strong><br />

Clinical Neurophysiology, 71:55–63, 1988.<br />

[159] A. Pfefferbaum, J.M. Ford, B.G. Wenegrat, W.T. Roth, <strong>and</strong> B.S. Kopell. Clinical<br />

application of the P3 component of event-related potentials. i. normal aging.<br />

Electroencephalography <strong>and</strong> Clinical Neurophysiology, 59(2):85–103, 1984.<br />

[160] A. Pfefferbaum, J.M. Ford, B.G. Wenegrat, W.T. Roth, <strong>and</strong> B.S. Kopell. Clinical<br />

application of the P3 component of event-related potentials. ii. dementia, depression<br />

<strong>and</strong> schizophrenia. Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

59(2):104–124, 1984.<br />

[161] G. Pfurtscheller <strong>and</strong> R. Cooper. Selective averaging of the intracerebral click<br />

evoked responses in man: An improved method of measuring latencies <strong>and</strong> amplitudes.<br />

Electroencephalography <strong>and</strong> Clinical Neurophysiology, 38:187–190, 1975.<br />

[162] T.W. Picton. Introduction. In Human Event-Related Potentials, volume 3 of H<strong>and</strong>book<br />

of Electroencephalography <strong>and</strong> Clinical Neurophysiology, pages 1–5. Elsevier,<br />

1987.<br />

[163] J. Polich. P300 in clinical applications: meaning, method, <strong>and</strong> measurement. In<br />

E. Niedermeyer ja F.L. da Silva, editor, Electroencephalography: basic principles,<br />

clinical applications, <strong>and</strong> related fields, pages 1005–1018, 1993.<br />

[164] J. Polich, L. Howard, <strong>and</strong> L. Starr. Effects of age on the P300 component of<br />

the event-related potentials from auditory stimuli: peak definition, variation <strong>and</strong><br />

measurement. Journal of Gerontology, 40:721–726, 1985.<br />

[165] C.A. Pomalaza-Raez <strong>and</strong> C.D. McGillem. Enhancement of event-related potentials<br />

by iterative restoration algorithms. IEEE Transactions on Biomedical Engineering,<br />

33:1107–1113, 1986.<br />

[166] M. Prahlad, X. Kong, <strong>and</strong> M. Tahernezhadi. Latency change estimation <strong>for</strong> evoked<br />

potentials via a new generalized cross correlation method. In Proc. 18th Annual<br />

Conf. IEEE-EMBS, Amsterdam, 1996.<br />

[167] M.B. Priestley. Spectral Analysis <strong>and</strong> Time Series. Academic Press, 1981.<br />

[168] R.R. Rawlings, M.J. Eckardt, <strong>and</strong> H. Begleiter. Multivariate spectral <strong>methods</strong> <strong>for</strong><br />

the analysis of event-related brain potentials. Computers <strong>and</strong> Biomedical Research,<br />

21:117–128, 1988.<br />

[169] J. Raz <strong>and</strong> G. Fein. Testing <strong>for</strong> heterogeneity of evoked potential signals using an<br />

approximation to an exact permutation test. Biometrics, 48:1069–1080, 1992.<br />

[170] D. Regan. Human visual evoked potentials. In Human Event-Related Potentials,<br />

volume 3 of H<strong>and</strong>book of Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

pages 159–243. Elsevier, 1987.<br />

[171] J.A. Rice. Mathematical statistics <strong>and</strong> data analysis. Duxbury Press, 2 edition,<br />

1995.<br />

[172] K. Roberts, P. Lawrence, A. Eisen, <strong>and</strong> M. Hoirch. Echancement <strong>and</strong> dynamic<br />

time warping of somatosensory evoked potential components applied to patients<br />

with multiple sclerosis. IEEE Transactions on Biomedical Engineering, 34:397–


136 References<br />

405, 1987.<br />

[173] L.J. Rogers. Determination of the number <strong>and</strong> waveshapes of event related potential<br />

components using comparative factor analysis. International Journal of<br />

Neuroscience, 56:219–245, 1991.<br />

[174] F. Rösler <strong>and</strong> D. Manzey. Principal components <strong>and</strong> varimax-rotated components<br />

in event-related potential research: some remarks on their interaction. Biological<br />

Psychology, 13:3–26, 1981. also in subspace.<br />

[175] P.M. Rossini, R.Q. Cracco, J.B: Cracco, <strong>and</strong> W.J. Hanse. Short latency somatosensory<br />

evoked potentials to peroneal nerve stimulation: Scalp topography <strong>and</strong> the<br />

effect of different frequency filters. Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

52:540–542, 1981.<br />

[176] A. Van Rotterdam. Limitation <strong>and</strong> difficulties in signal processing by means of<br />

the principal-components analysis. IEEE Transactions on Biomedical Engineering,<br />

17:268–269, 1970.<br />

[177] D.S. Ruchkin. Comparison of statistical errors of the median <strong>and</strong> average evoked<br />

responses. IEEE Transactions on Biomedical Engineering, pages 54–56, January<br />

1974.<br />

[178] D.S. Ruchkin. A shortcoming of the median evoked response. IEEE Transactions<br />

on Biomedical Engineering, page 245, May 1975.<br />

[179] D.S. Ruchkin. Measurement of event-related potentials. In Human Event-Related<br />

Potentials, volume 3 of H<strong>and</strong>book of Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

pages 7–44. Elsevier, 1987.<br />

[180] D.S. Ruchkin <strong>and</strong> E.M. Glaser. Simple digital filters <strong>for</strong> examining CNV <strong>and</strong> P300<br />

on a single trial basis. In D.A. Otto, editor, Multidisciplinary perpectives on eventrelated<br />

brain potential research, pages 579–581. U.S. Governement Printing Office,<br />

Washington DC, 1978.<br />

[181] D.S. Ruchkin, J. Villegas, <strong>and</strong> E.R. John. An analysis of average evoked potentials<br />

making use of least mean square techiques. Annals of New York Academy of<br />

Sciences, 115:799–821, 1964.<br />

[182] R.J. Santos. Equivalence of regularization <strong>and</strong> truncated iteration <strong>for</strong> general<br />

ill-posed problems. Linear Algebra <strong>and</strong> its Applications, 236:25–33, 1996.<br />

[183] A.H. Sayed <strong>and</strong> T. Kailath. A state-space approach to adaptive RLS filtering.<br />

IEEE Signal Processing Magazine, 11:18–60, 1994.<br />

[184] B. Sayers, H.A. Beagley, <strong>and</strong> W.R. Hanshall. The mechanisms of auditory evoked<br />

EEG responses. Nature, 247:481–483, 1974.<br />

[185] R.W. Sencaj <strong>and</strong> J.I. Aunon. Dipole localization of average <strong>and</strong> single visual<br />

evoked potentials. IEEE Transactions on Biomedical Engineering, 29(1):26–33,<br />

1982.<br />

[186] J.A. Sgro, R.G. Emerson, <strong>and</strong> T.A. Pedley. Real-time reconstruction of evoked<br />

potentials using a new two-dimensional filter method. Electroencephalography <strong>and</strong><br />

Clinical Neurophysiology, 62:372–380, 1985.<br />

[187] K.S. Shanmugan <strong>and</strong> A.M. Breipohl. R<strong>and</strong>om signals: detection, estimation <strong>and</strong><br />

data analysis. Wiley, 1988.<br />

[188] Y. Shi <strong>and</strong> K.E. Hecox. Nonlinear system identification by m-pulse sequences: application<br />

to brainstem auditory evoked responses. IEEE Transactions on Biomedical<br />

Engineering, 38(9):834–845, 1991.<br />

[189] L.H. Sibul, editor. Adaptive signal processing. IEE Proceedings, Pt. E, 1987.<br />

[190] F.T.Y. Smulders, J.L. Kenemans, <strong>and</strong> A. Kok. A comparison of different <strong>methods</strong><br />

<strong>for</strong> estimating single-trial P300 latencies. Electroencephalography <strong>and</strong> Clinical


References 137<br />

Neurophysiology, 92:107–114, 1994.<br />

[191] V. Solo <strong>and</strong> X. Kong. Adaptive Signal Processing Algorithms. Stability <strong>and</strong> Per<strong>for</strong>mance.<br />

Wiley, 1995.<br />

[192] H.W. Sorenson. Parameter Estimation: Principles <strong>and</strong> Problems. Marcel Dekker,<br />

1980.<br />

[193] H.W. Sorenson, editor. Kalman Filtering: Theory <strong>and</strong> Applications. IEEE Press,<br />

1985.<br />

[194] J.C. Spall, editor. <strong>Bayesian</strong> Analysis of Time Series <strong>and</strong> Dynamic Models. Marcel<br />

Dekker, 1988.<br />

[195] M. Spreckelsen <strong>and</strong> B. Bromm. Estimation of single-evoked cerebral potentials<br />

by means of parametric modeling <strong>and</strong> filtering. IEEE Transactions on Biomedical<br />

Engineering, 35:691–700, 1988.<br />

[196] K.C. Squires <strong>and</strong> E. Donchin. Beyond averaging: The use if discriminant functions<br />

to reconize event related potentials elicited by single auditory stimuli. Electroencephalography<br />

<strong>and</strong> Clinical Neurophysiology, 41:449–459, 1976.<br />

[197] G.H. Steeger, O. Herrmann, <strong>and</strong> M. Spreng. Some improvements in the measurements<br />

of variable latency acoustically evoked potentials in human EEG. IEEE<br />

Transactions on Biomedical Engineering, 30:295–303, 1983.<br />

[198] S. Sutton. P300 – thirteen years later. In Proc. Conf. Evoked Brain Potentials<br />

<strong>and</strong> Behavior, pages 107–126, 1997.<br />

[199] S. Suwazono, H. Shibasaki, S. Nishida, M. Nakamura, M. Honda, T. Nagamine,<br />

A. Ikeda, J. Ito, <strong>and</strong> J. Kimura. Automatic detection of p300 in single sweep<br />

records of auditory event-related potential. Journal of Clinical Neurophysiology,<br />

11(4):448–460, 1994.<br />

[200] O. Svensson. Tracking of changes in latency <strong>and</strong> amplitude of the evoked potential<br />

by using adaptive LMS filters <strong>and</strong> exponential averagers. IEEE Transactions on<br />

Biomedical Engineering, 40:1074–1079, 1993.<br />

[201] A. Tarantola. Inverse Problem Theory. Elsevier, 1987.<br />

[202] N.V. Thakor. Adaptive filtering of transient evoked potentials. In Proc. Annual<br />

Conf. Eng. Med. Biol. (ACEMB), 1986.<br />

[203] N.V Thakor. Adaptive filtering of evoked potentials. IEEE Transactions on<br />

Biomedical Engineering, 34(1):6–12, 1987.<br />

[204] N.V. Thakor, X. Kong, <strong>and</strong> D.F. Hanley. Nonlinear changes in brain’s response<br />

in the event of injury as detected by adaptive coherence estimation of evoked<br />

potentials. IEEE Transactions on Biomedical Engineering, 42(1):42–51, 1995.<br />

[205] N.V. Thakor, C.A. Vaz, R.W. McPherson, <strong>and</strong> D. F. Hanley. Adaptive Fourier<br />

series modeling of time-varying evoked potentials: Study of human somatosensory<br />

evoked response to etomidate anesthetic. Electroencephalography <strong>and</strong> Clinical<br />

Neurophysiology, 80:108–118, 1991.<br />

[206] N.V. Thakor, G. Xin-rong, S. Yi-Chun, <strong>and</strong> D. F. Hanley. Multiresolution wavelet<br />

analysis of evoked potentials. IEEE Transactions on Biomedical Engineering,<br />

40(11):1085–1094, 1993.<br />

[207] A.N. Tikhonov. <strong>Regularization</strong> of incorrectly posed problems. Soviet. Math. Dokl.,<br />

4:1624–1627, 1993.<br />

[208] A.N. Tikhonov. Solution of incorrectly <strong>for</strong>mulated problems <strong>and</strong> the regularization<br />

method. Soviet. Math. Dokl., 4:1035–1038, 1993.<br />

[209] A.N. Tikhonov <strong>and</strong> V.Y. Arsenin. Solutions of Ill-Posed Problems. Winston <strong>and</strong><br />

Sons, 1977.<br />

[210] P. Ungan <strong>and</strong> E. Basar. Comparison of Wiener filtering <strong>and</strong> selective averaging of


138 References<br />

evoked potentials. Electroencephalography <strong>and</strong> Clinical Neurophysiology, 40:516–<br />

520, 1976.<br />

[211] M. Vauhkonen, J.P. Kaipio, E. Somersalo, <strong>and</strong> P.A. <strong>Karjalainen</strong>. Basis constraint<br />

method <strong>for</strong> estimating conductivity distribution of the human thorax. In Proceedings<br />

of the IX International Conference on Electrical Bio-Impedance, pages<br />

528–531, 1995.<br />

[212] M. Vauhkonen, J.P. Kaipio, E. Somersalo, <strong>and</strong> P.A. <strong>Karjalainen</strong>. Electrical<br />

impedance tomography with basis constraints. Inverse Problems, 13:523–530, 1997.<br />

[213] M. Vauhkonen, D. Vadász, J.P. Kaipio, <strong>and</strong> P.A. <strong>Karjalainen</strong>. Tikhonov regularization<br />

<strong>and</strong> prior in<strong>for</strong>mation in electrical impedance tomography. Technical<br />

Report 3/96, University of Kuopio, Department of Applied Physics, 1996.<br />

[214] M. Vauhkonen, D. Vadász, P.A. <strong>Karjalainen</strong>, <strong>and</strong> J.P. Kaipio. Subspace regularization<br />

method <strong>for</strong> electrical impedance tomography. In Proc 1st Int Conf<br />

Bioelectromagn, pages 165–166, Tampere, Finl<strong>and</strong>, 1996.<br />

[215] C.A. Vaz <strong>and</strong> N.I. Thakor. Adaptive Fourier estimation of time-varying evoked<br />

potentials. IEEE Transactions on Biomedical Engineering, 36:448–455, 1989.<br />

[216] J.J. Vidal. Real-time detection of brain events in EEG. Proceedings og the IEEE,<br />

65:633–641, 1977.<br />

[217] C.E. Da Vila, A.J. Welch, <strong>and</strong> H.G. Ryl<strong>and</strong>er III. Adaptive estimation of single<br />

evoked potentials. In Proc. 8th Annual Conf. IEEE-EMBS, pages 406–409, Dallas,<br />

1986.<br />

[218] G. Wahba. Spline Models <strong>for</strong> Observational Data. SIAM, 1990.<br />

[219] D.O. Walter. A posteriori Wiener filtering of average evoked responses. Electroencephalography<br />

<strong>and</strong> Clinical Neurophysiology, 27:61–70, 1969.<br />

[220] D.O. Walter. Two approximations to the median evoked response. Electroencephalography<br />

<strong>and</strong> Clinical Neurophysiology, 30:246–247, 1971.<br />

[221] D.G. Wastell. Statistical detection of individual evoked responses: An evaluation<br />

of Woody’s adaptive filter. Electroencephalography <strong>and</strong> Clinical Neurophysiology,<br />

42:835–839, 1977.<br />

[222] D.G. Wastell. On the independence of P300 <strong>and</strong> the CNV: a short critique of<br />

the principal component analysis of Donchin et. al (1975). Biological Psychology,<br />

9:171–176, 1979.<br />

[223] D.G. Wastell. On the correlated nature of evoked brain activity: biophysical <strong>and</strong><br />

statistical considerations. Biological Psychology, 13:51–69, 1981.<br />

[224] D.G. Wastell. PCA <strong>and</strong> Varimax rotation: some comments on Rosler <strong>and</strong> Manzey.<br />

Biological Psychology, 13:27–29, 1981.<br />

[225] D.G. Wastell. When Wiener filtering is less than optimal: an illustrative application<br />

to the brain stem evoked potential. Electroencephalography <strong>and</strong> Clinical<br />

Neurophysiology, 51:678–682, 1981.<br />

[226] J.J. Westerkamp <strong>and</strong> J.I. Aunon. Optimum multielectrode a posteriori estimates of<br />

single-response evoked potentials. IEEE Transactions on Biomedical Engineering,<br />

34:13–22, 1987.<br />

[227] J.J. Westerkamp, J.I. Aunon, C.D. McGillem, <strong>and</strong> J.M. Moser. Sequential detection<br />

of changes in evoked potentials. IEEE frontiers of Engineering <strong>and</strong> Computing<br />

in Health Care, pages 691–695, 1984.<br />

[228] B. Widrow, J.R. Glower JR, J.M. McCool, J. Kauniz, C.S. Williams, R.H. Hearn,<br />

J.R. Zeidler, E. Dong JR, <strong>and</strong> R.C. Goodlin. Adaptive noise cancelling: Principles<br />

<strong>and</strong> applications. Proceedings og the IEEE, 63:1692–1716, 1975.<br />

[229] B. Widrow <strong>and</strong> S.D. Stearns. Adaptive Signal Processing. Prentice-Hall, 1985.


References 139<br />

[230] N. Wiener. The extrapolation, interpolation <strong>and</strong> smoothing of stationary time<br />

series. Wiley, New York, 1949.<br />

[231] J. Woestenburg, M. Verbaten, <strong>and</strong> J. Slangen. The removal of the eye-movement<br />

artifact from EEG by regression analysis in the frequency domain. Biological<br />

Psychology, 16:127–147, 1983.<br />

[232] J.C. Woestenburg, M.N. Verbaten, H.H. Van-Hees, <strong>and</strong> J.L. Slangen. Single-trial<br />

ERP estimation in the frequency domain using orthogonal polynomial trend analysis<br />

(OPTA): estimation of individual habituation. Biological Psychology, 17(2-<br />

3):173–191, 1983.<br />

[233] C.C. Wood <strong>and</strong> G. McCarthy. Principal component analysis of event-related potentials:<br />

Simulation studies demostrate mis-allocation of variance components.<br />

Electroencephalography <strong>and</strong> Clinical Neurophysiology, 59:249–260, 1984.<br />

[234] W. Woodworth, S. Reisman, <strong>and</strong> A.B. Fontaine. The detection of auditory evoked<br />

responses using a matched filter. IEEE Transactions on Biomedical Engineering,<br />

30:369–376, 1983.<br />

[235] C.D. Woody. Characterization of an adaptive filter <strong>for</strong> the analysis of variable<br />

latency neuroelectric signals. Medical <strong>and</strong> Biological Engineering <strong>and</strong> Computing,<br />

5:539–553, 1967.<br />

[236] C.D. Woody <strong>and</strong> M.J. Nahvi. Application of optimum linear filter theory to the<br />

detection of cortical signals preceding facial movements in cats. Exp. Brain Res.,<br />

16:455–465, 1973.<br />

[237] K. Yu <strong>and</strong> C.D. McGillem. Optimum filters <strong>for</strong> estimating evoked potential wave<strong>for</strong>ms.<br />

IEEE Transactions on Biomedical Engineering, 30:730–737, 1983.<br />

[238] X.H. Yu, Z.Y. He, <strong>and</strong> Y.S. Zhang. Time-varying adaptive filters <strong>for</strong><br />

evoked-potentials estimation. IEEE Transactions on Biomedical Engineering,<br />

41(11):1062–1071, 1994.<br />

[239] X.H. Yu, Y.S. Zhang, <strong>and</strong> Z.Y. He. Peak component latency-corrected average<br />

method <strong>for</strong> evoked potential wave-<strong>for</strong>m estimation. IEEE Transactions on Biomedical<br />

Engineering, 41(11):1072–1082, 1994.<br />

[240] L. Zetterberg. Experience with analysis <strong>and</strong> simulation of EEG signals with parametric<br />

description of spectra. In P. Kellaway <strong>and</strong> I. Petersen, editors, Automation<br />

of Clinical Electroencephalography, pages 161–201. Raven Press, 1973.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!