Karjalainen, Pasi A. Regularization and Bayesian methods for ...

Karjalainen, Pasi A. Regularization and Bayesian methods for evoked potential estimation. 

Kuopio University Publications C. Natural and Environmental Sciences 

61. 1997. 139 p. 

ISBN 951-781-559-X 

ISSN 1235-0486 

ABSTRACT 

Evoked potentials (EP) are often defined to be potentials that are caused by the 

electrical activity in central nervous system after a stimulation of the sensoral 

system. In the analysis of evoked potentials the fundamental problem is to extract 

information about the potential from measurements that contain also on-going 

background electroencephalogram (EEG). 

The most widely used tool for the analysis of evoked potentials has been the 

averaging of the measurements over an ensemble of trials. This is the optimal way 

to improve the signal-to-noise ratio when the underlying model for the observations 

is that the evoked potential is a deterministic signal in independent additive 

background noise of zero mean. However, for over three decades it has been evident 

that the nature of evoked potentials is more or less stochastic. In particular, 

the latencies and the amplitudes of the peaks in the potentials can have stochastic 

variation between repetitions of the stimuli. 

Currently the goal in the analysis of the evoked potentials is to obtain best 

possible estimates for single potentials. The most common approach to this estimation 

is to form an estimator (filter) with which the unwanted contribution of 

the EEG can be filtered out from the evoked potential. A major difficulty in this 

task is often the very low signal-to-noise ratio. 

There are two major aims in this thesis. The first is to review the existing 

methods for evoked potential estimation in view of estimation theory in a unified 

and consistent formalism. Special attention is paid to the implicit assumptions 

about the evoked potentials that are made in the existing methods. The second 

aim is to form a unified estimation scheme for single evoked potentials. The 

proposed estimation scheme is based on Bayesian estimation and regularization 

theory. 

Two practical estimation procedures are proposed. The first is applicable when 

the evoked potentials are assumed to be random samples from a time-independent 

distribution. The other is applicable if the parameters of the evoked potentials 

have trend-like variations between the repetitions of the stimuli. The proposed 

methods are also evaluated using both simulated data and real measurements. 

Two systematic methods for the simulation of evoked potentials are also proposed 

for this purpose. 

AMS Subject Classification: 93E10, 62H25, 62L12 

National Library of Medicine Classification: WL 102, WL 150, QT 36 

INSPEC Thesaurus: bioelectric potentials; estimation theory; Bayes methods; 

inverse problems; Kalman filters

To Tuula and Väinö

Acknowledgements 

This work was carried out in the Department of Applied Physics at the University 

of Kuopio during 1995-1996. 

I warmly thank my supervisor Associate Professor Jari Kaipio, Ph.D., for his 

guidance and advices during all phases of this work. Most of the results in this 

thesis have got their final form during long discussions with him. 

I am grateful to my second supervisor Professor Lauri Patomäki, Ph.D., for 

his support to this work and for the opportunity to work in the Department of the 

Applied Physics in various positions during the last ten years. 

I thank the official reviewers Professor Jouko Lampinen, Ph.D., and Associate 

Professor Erkki Somersalo, Ph.D., for their many suggestions and advices on how 

to improve this thesis. 

I thank Marko Vauhkonen, M.Sc., whose work in the field of impedance tomography 

has been a link between me and the regularization theory. 

I also thank Anu Koistinen, M.Sc., who has made a great work in keeping 

the bibliographic databases up-to-date. She has also provided the measurements 

for which I thank also the Department of Clinical Neurophysiology, University of 

Kuopio. 

I thank all my co-workers and students of the biomedical inverse problems 

group and the staff of the Department of the Applied Physics. 

Finally I thank my loving Tuula for her understanding when I prepared this 

thesis. 

Kuopio, May 1997 

Pasi Karjalainen

Abbreviations 

EP Evoked potential 

ERP Event related potential 

EEG Electroencephalogram 

SNR Signal-to-noise ratio 

BAEP Brainstem auditory evoked potential 

ML Maximum likelihood 

MS Mean square 

LMMS Linear minimum mean square 

GMS Generalized mean square 

UC Uniform cost 

MAP Maximum a posteriori 

GM Gauss–Markov 

LS Least squares 

GLS Generalized least squares 

LMV Linear minimum variance 

GCV Generalized cross-validation 

RLS Recursive least squares 

LMS Least mean square 

NLMS Normalized least mean square 

PC Principal component 

ARMA Autoregressive moving average 

AR Autoregressive 

MA Moving average 

FIR Finite impulse response 

ARX Autoregressive with exogenous input 

LCA Latency corrected averaging 

CCA Cross-correlation averaging 

APWF a posteriori “Wiener” filtering 

PCA Principal components analysis 

MWA Moving window averaging 

EWA Exponential window averaging 

Notations 

x ∼ N(η x ,C x ) Jointly Gaussian random vector x = (x 1 ,...,x n ) T 

(·) T Transpose 

p(x) Joint density function of random vector x 

p(x|y) Conditional density function of x given y 

η x ,E {x} Expected value of x 

E x {f(x,y)} Expected value of f(x,y) with respect to x 

C x Covariance of x 

R x Correlation of x 

η x|y ,E {x|y} Expected value of x given y, conditional mean 

Conditional covariance of x given y 

C x|y

z 

h 

H 

θ 

ˆθ(z) 

ˆθ 

˜θ 

C(θ, ˆθ) 

B(ˆθ) 

B(ˆθ|z) 

ˆθ B 

trace (A) 

W 

l 

J 

R (H) 

N (H) 

ψ 

diag (λ 1 ,...,λ p ) 

L 

α 

S 

S ⊥ 

dim(S) 

K S 

rank(A) 

q −1 

A(q) 

S(ω) 

Γ 

Λ 

U,V 

S 

K t 

E {x} 

Vector of observations z = (z 1 ,...,z n ) T 

Model for observation 

Linear model for observation 

Parameter vector 

Estimator of parameter vector θ based on observations z 

Estimate of parameter vector θ 

Estimation error 

Cost function 

Bayes cost 

Conditional Bayes cost 

Bayesian estimate 

Trace of matrix A 

Weighting matrix 

Least squares index 

Jacobian 

Range of H 

Nullspace of H 

Basis vector 

Diagonal matrix 

Regularization matrix 

Regularization parameter 

Subspace 

Orthogonal complement of S 

Dimension of S 

Matrix of basis vectors of subspace S 

Rank of matrix A 

Time delay operator 

Polynomial of operator q −1 

Spectrum 

Matrix of eigenvectors 

Diagonal matrix of eigenvalues 

Matrices of left and right singular vectors, respectively 

Diagonal matrix of singular values 

Kalman gain 

The empirical expectation, sample mean

CONTENTS 

1 Introduction 15 

2 Estimation theory 19 

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.2 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.3 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

2.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . 23 

2.5 Bayes cost method . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

2.6 Mean square estimation . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.7 Maximum a posteriori estimation . . . . . . . . . . . . . . . . . . . 27 

2.8 Linear minimum mean square estimator . . . . . . . . . . . . . . . 28 

2.9 Minimum mean square estimator for Gaussian variables . . . . . . 29 

2.10 Mean square estimation with observation model . . . . . . . . . . . 31 

2.11 Gauss–Markov estimate . . . . . . . . . . . . . . . . . . . . . . . . 32 

2.12 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . 33 

2.13 Comparison of ML, MAP and MS estimates . . . . . . . . . . . . . 36 

2.14 Selection of the basis vectors . . . . . . . . . . . . . . . . . . . . . 39 

2.15 Modeling of prior information in Bayesian estimation . . . . . . . . 40 

2.16 Recursive mean square estimation . . . . . . . . . . . . . . . . . . 41 

2.17 Time-varying linear regression . . . . . . . . . . . . . . . . . . . . . 45 

2.18 Properties of the MS estimator . . . . . . . . . . . . . . . . . . . . 47 

3 Regularization theory 49 

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

3.2 Tikhonov regularization . . . . . . . . . . . . . . . . . . . . . . . . 50 

3.3 Principal component based regularization . . . . . . . . . . . . . . 52 

3.4 Subspace regularization . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5 Selection of the regularization parameters . . . . . . . . . . . . . . 55 

4 Time series models 57 

4.1 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

4.2 Stationary time series models . . . . . . . . . . . . . . . . . . . . . 59 

4.3 Time-dependent time series models . . . . . . . . . . . . . . . . . . 60 

4.4 Adaptive filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

4.5 Wiener filtering as estimation problem . . . . . . . . . . . . . . . . 63 

5 Simulation of evoked potentials 66 

5.1 Simulation of the background EEG . . . . . . . . . . . . . . . . . . 66 

5.2 Component based simulation of evoked potentials . . . . . . . . . . 67 

5.3 Principal component based simulation of evoked potentials . . . . 72 

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 

6 Estimation of evoked potentials 76 

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

6.2 Ensemble analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

6.2.1 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

6.2.2 Weighted and selective averaging . . . . . . . . . . . . . . . 78 

6.2.3 Latency-dependent averaging . . . . . . . . . . . . . . . . . 81 

6.2.4 Filtering of the average response . . . . . . . . . . . . . . . 82 

6.2.5 Deconvolution methods . . . . . . . . . . . . . . . . . . . . 83 

6.2.6 Principal components analysis . . . . . . . . . . . . . . . . 83 

6.2.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

6.2.8 Other ensemble analysis methods . . . . . . . . . . . . . . . 84 

6.3 Single trial estimation . . . . . . . . . . . . . . . . . . . . . . . . . 85 

6.3.1 Filtering of single responses . . . . . . . . . . . . . . . . . . 85 

6.3.2 Time-varying Wiener filtering . . . . . . . . . . . . . . . . . 85 

6.3.3 Linear least squares estimation . . . . . . . . . . . . . . . . 86 

6.3.4 Adaptive filtering . . . . . . . . . . . . . . . . . . . . . . . . 86 

6.3.5 Principal component regression approach . . . . . . . . . . 88 

6.3.6 Subspace regularization of evoked potentials . . . . . . . . . 89 

6.3.7 Smoothness priors estimation of evoked potentials . . . . . 89 

6.3.8 Other single trial estimation methods . . . . . . . . . . . . 90 

6.4 Miscellaneous topics . . . . . . . . . . . . . . . . . . . . . . . . . . 91 

6.4.1 Frequency domain methods . . . . . . . . . . . . . . . . . . 91

6.4.2 Parametric modeling . . . . . . . . . . . . . . . . . . . . . . 91 

6.4.3 Estimation of the peak latencies . . . . . . . . . . . . . . . 93 

6.5 Dynamical estimation of evoked potentials . . . . . . . . . . . . . . 93 

6.5.1 Windowed averaging . . . . . . . . . . . . . . . . . . . . . . 94 

6.5.2 Frequency domain filtering . . . . . . . . . . . . . . . . . . 94 

6.5.3 Adaptive algorithms . . . . . . . . . . . . . . . . . . . . . . 94 

6.5.4 Recursive mean square estimation of evoked potentials . . . 95 

6.5.5 Other dynamical estimation methods . . . . . . . . . . . . . 96 

6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

7 Two methods for evoked potential estimation 98 

7.1 A systematic method for single trial estimation . . . . . . . . . . . 98 

7.2 A systematic method for dynamical estimation . . . . . . . . . . . 100 

7.3 The data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

7.4 Case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 

7.5 Case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

7.6 Case study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

7.7 Case study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

7.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

8 Discussion and conclusions 124 

References 126

CHAPTER 

I 

Introduction 

Evoked potentials (EP) were originally defined to be potentials that are caused by 

the electrical activity in central nervous system after a stimulation of the sensoral 

system. Currently all electrical potentials caused by physical stimulation are 

more often called evoked potentials. The evoked potentials can be classified into 

exogenous and endogenous evoked potentials [11]. Exogenous evoked potentials 

are determined by the physical characteristics of the stimulus only. Endogenous 

evoked potentials are determined by the psychological significance of the stimulus. 

An example of endogenous evoked potentials are the cognitive evoked potentials. 

Potentials are sometimes emitted also without any clear physical stimulus. Therefore 

it is currently common to use the notion of event related potentials (ERP). 

The evoked potentials are then event related potentials that are immediate or 

intermediate responses to physical stimuli. 

In this thesis we are concerned on the analysis methods that necessitate a 

physical trigger signal, that is physical stimulus. Therefore we use only the notion 

of evoked potential here. In most cases we are interested in evoked potentials that 

are originated in the human brain. The potentials are usually measured from outer 

layer, scalp, of the human head. The measured potential is then superposition of 

all electrical activity that is originated in head. This activity include the electrical 

activity from muscles and eye movements and also the spontaneous electrical 

activity of the brain. We call this activity the background electroencephalogram 

(EEG). The background EEG is usually thought to be not correlated with the 

stimulus. We do not review the physiological origin of the evoked potentials in 

this thesis. For that see e.g. [162, 170, 150, 92]. Also the measurement techniques 

are not reviewed here. For these see e.g. [179]. 

In the analysis of evoked potentials the fundamental problem is to extract 

information about the potential from measurements that contain also on-going 

background EEG. The most widely used tool for the analysis of evoked potentials 

has been the averaging of the measurements over an ensemble of trials. In the mean 

square sense this is the optimal way to improve the signal-to-noise ratio (SNR) 

when the underlying model for the observations is that the evoked potential is a 

15

16 1. Introduction 

deterministic signal in independent additive background noise of zero mean. 

However, for over three decades it has been evident that the nature of evoked 

potentials is more or less stochastic. In particular, the latencies and the amplitudes 

of the peaks in the potentials can have stochastic variation between the 

repetitions of the stimuli [22]. In addition the variations can be trend-like and 

the means of the latencies or the amplitudes can change during the test. The 

information about these kinds of variations in evoked potentials vanishes when 

the signal is averaged. The resulting estimates for the potentials possibly do not 

correspond to any physical or neuroanatomical situation and thus inference about 

the neurophysical structure is difficult. A concrete example is the use of average 

evoked potentials in source estimation, the estimation of the location and other 

parameters of the electrical sources in brain. 

Statistically the average evoked potential, the sample mean, is an example of 

the use of the first order statistics, in which only the first moment of the population 

parameters is estimated. The next evident improvement is to use the second 

order statistics, that is, covariance analysis. Typical examples of the second order 

methods are e.g. principal component analysis [90] and factor analysis [173] of the 

potentials. With these methods also a measure of the variation of the potentials 

can be obtained. However, also with these methods the information carried by 

a single measurement is lost. Also all the information about the time variation 

between the repetitions of the test is lost. 

Currently the goal in the analysis of the evoked potentials is to get information 

about single potentials e.g. obtain the best possible estimate for a single potential. 

The notion of single trial analysis can be used in this context. The most common 

way to do single trial analysis is to form an estimator (filter) with which the 

unwanted contribution of the on-going background activity can be filtered out 

from the evoked potential. A major difficulty in this task is often the very low 

signal-to-noise ratio, typically SNR < −10 dB. 

In any such filtering or estimation method the performance of the estimator 

depends on the properties of the underlying signals. In the most realistic methods 

some models for the measurements and the signals are assumed. The estimator 

that yields the minimum mean square criterion is then derived. The performance 

of the estimators then depends strongly on how realistic the assumptions are. 

The most common assumptions concern the second order statistics of the evoked 

potentials and the on-going background EEG. In addition, we sometimes have 

prior information about the evoked potentials. The information can relate e.g. to 

the shape or the location of the peaks in potential. Generally, we should take this 

prior information into account to obtain best possible estimates. 

In literature we can find two approaches to take the prior information or 

beliefs about the parameters (signal) to be estimated into account. The first 

is the Bayesian approach with which the prior information can be taken into 

account in the statistical sense. This procedure usually necessitates some prior 

statistical information about the parameters in the form of prior distributions. 

The other approach, the use of the regularization methods, arises from the field of 

ill-posed problems. In this class of the methods the prior information about the

17 

underlying parameters can usually be implemented to the estimation procedure 

in a straightforward way. Typical interpretations for such kind of knowledge is 

that the potentials are small, almost equal or slowly varying. The regularization 

approach has a direct connection to the Bayesian approach. The regularization 

solutions can usually be seen as Bayesian point estimates with some prior densities. 

There are two key problems in the estimation of the evoked potentials. The 

first is the implementation of the prior information to the estimation procedure. 

Due the low signal-to-noise ratio, all the prior information that is available should 

be used in estimation. A major problem in this is the formulation of the information 

in mathematical form. The other problem is the evaluation of the proposed 

estimation methods. The most common way to form a “new” estimation method 

for evoked potentials seems to have been blind use of just some estimation method 

without proper investigation of the implicit assumptions. Comparison of different 

methods necessitates the use of unified formalism for all the methods. 

The aims and contents of the thesis 

In this thesis “we” is used to denote the author of the thesis. There are two 

major aims in this thesis. The first one is to review the existing methods for 

evoked potential estimation in view of estimation theory in unified and consistent 

formalism. Special attention is paid to find the implicit assumptions about the 

evoked potentials that are made in proposed methods. The second aim is to form 

a unified estimation scheme for single evoked potentials. The proposed estimation 

scheme is based on Bayesian estimation and regularization theory. 

Some of the methods proposed here are discussed in [100, 101, 211, 212, 213, 

214, 96]. The aim of this thesis is not only to summarize the results but also give 

a systematic and detailed presentation of the methods and their theoretical background. 

We also present practical procedures for the application of the methods 

for real measurements. Although the estimation schemes are proposed in algorithmic 

form, the most important thing is their general structure. They can easily be 

modified also for different models for evoked potentials. 

In Chapter 2 the estimation theoretical background that is relevant for rest 

of the thesis is presented. The main purpose of the chapter is to introduce a 

unified formalism with which the rest of the thesis – regularization theory, time 

series analysis, Bayesian statistics and the existing analysis methods for evoked 

potential studies – can be combined. Most of the topics are discussed in detail to 

make it possible to compare the different formulations that are discussed in the 

thesis, in a consistent way. 

In Chapter 3 regularization theory is discussed. The main purpose is to review 

the regularization methods in view of Bayesian estimation and to introduce the 

forms of subspace regularization and principal component regression. These are 

used in Chapters 4-7. 

In Chapter 4 some topics of time series analysis are discussed. The topics of 

time series modeling and Wiener filtering are studied. It is also shown that the 

adaptive algorithms have a special connection to the regularization theory.

18 1. Introduction 

Chapters 5, 6 and 7 contain the main novel parts of this thesis. In Chapter 

5 two simulation methods for evoked potentials are introduced. Both methods 

are presented as systematic methods for different needs of simulations. The applicability 

of the methods is then discussed. In Chapter 6 the existing methods 

for evoked potential analysis are reviewed. The methods are analyzed in detail 

in light of estimation and regularization theory. In Chapter 6 we also introduce 

some novel estimation methods that form the basis for the two practical estimation 

procedures that are proposed in Chapter 7. The estimation methods are proposed 

in a general form that can easily be modified to correspond to more complicated 

assumptions about the physical reality. In Chapter 7 the proposed methods are 

also evaluated using both simulated data and real measurements. The applicability 

of the so-called generalized cross-validation criterion for determination of the 

regularization parameters is also evaluated and discussed. 

Chapter 8 contains overall discussion and conclusions of this thesis.

CHAPTER 

II 

Estimation theory 

In this chapter we discuss estimation theory. The main goal is to present the estimation 

theoretical background that is necessary for understanding the properties 

of the regularization methods that are discussed in Chapter 3 as well as time series 

modeling methods that are discussed in Chapter 4. A unified formalism that is 

introduced in this chapter is used through the thesis. With this formalism the 

results in estimation theory can be combined with the regularization theory, time 

series analysis methods and Bayesian statistics. All these approaches are presented 

in multivariate vector formalism. In particular the review of the existing methods 

for evoked potential analysis in Chapter 6 is based strictly on this presentation. 

This makes it possible to analyze the implicit assumptions made in the methods. 

Some of the derivations are given in more detail than the others. Many of the 

intermediate equations are needed in further analysis. A detailed presentation is 

also reasonable since this estimation theoretical background cannot be seen to be 

commonly known in the evoked potential community. The main sources for the 

theory discussed in this chapter are the references [192, 193, 152, 133, 115, 142, 66]. 

2.1 Introduction 

In estimation theory the problem is to calculate (estimate) the parameters describing 

the underlying reality from the measurements that may contain some errors. 

We denote the measurements with a vector z. With vector θ we denote the parameters 

that are to be estimated. We use ˆθ for an estimate of θ and ˆθ(z) for 

estimator, the function 

ˆθ = ˆθ(z) (2.1) 

that connects the observations to the estimate. As estimation error, the difference 

between actual and estimated value of the parameter we use ˜θ } = θ − ˆθ. Estimator 

for which the expectation of the estimation error is zero E 

{˜θ = 0 is said to be 

unbiased. 

When θ is random we speak about Bayesian estimation and the goal is to find 

characteristics of the posterior density p(θ|z) of θ. Such estimates are e.g. mean 

19

20 2. Estimation theory 

square (MS) and maximum a posteriori (MAP) estimates. These can be derived 

as minimizers of the expectation of some specific functions of the estimation error. 

When θ is treated as unknown but non-random, we do not use any information 

about prior density of θ. Usually the estimation rule is then to minimize some 

estimation criterion. This criterion can still be probabilistic, as in Gauss–Markov 

estimate, since the estimator ˆθ(z) can still be random through the observation 

model. In some cases as in least squares estimation the estimation criterion is fully 

deterministic. We refer to non-random parameters with the notion of unknown 

parameter [192]. 

In some cases the observations z are related to the parameters θ by an observation 

model. For example 

z = h(θ,v) (2.2) 

where v are random measurement errors, is a typical observation model. The 

notion of nuisance parameters is sometimes used for v [66]. In most common cases 

v is additive and the most general observation model used in this thesis is of the 

form 

z = h(θ) + v (2.3) 

This is called the additive noise model. In many cases we restrict our attention to 

linear observation models that can be written in the form 

z = Hθ + v (2.4) 

where H is a matrix that does not contain parameters to be estimated. In all 

observation models θ can be random or fixed but unknown. In this thesis we set 

θ ∈ R p and z,v ∈ R M . 

The observations z are sometimes called the dependent variables, especially 

in the statistical literature. If the model H contains observations x, these are 

called the independent variables, regressor variables, predictor variables [91] or 

the explanatory variables [23]. For matrix H we use the notion of observation 

matrix [133]. The notions of model matrix and carrier matrix [23] and the design 

matrix [66] are also used. 

When an observation model is used, the estimation can also be based on the 

minimization of some norm of the residual r = z − h(θ) with respect to θ. With a 

special selection for this norm and special assumptions about the joint density of 

θ and z this approach coincides with the Bayesian approach. 

2.2 Probability theory 

The selection of the details in this section is based on the needs of following sections 

and chapters. We shall not review fundamental definitions of probability 

theory, such as probability space, elementary events, random variables and probability 

measure here. These are presented e.g. in [152]. For the definitions of 

the distribution and density functions of multidimensional random variables we 

refer to [171]. In this thesis random variables are not denoted systematically differently 

from deterministic variables. Usually all the random variables are real

2.2. Probability theory 21 

and vector valued. The joint density of the components of the random vector 

x = (x 1 ,...,x n ) T is denoted with p(x). The superscript notation (·) T denotes the 

transpose. The joint density of the components of two random vectors x and y is 

denoted with p(x,y). We consider only continuous distribution functions. 

The probability densities of single random variable x i , the marginal densities 

p i (x i ), are obtained by integration 

p i (x i ) = 

∫ ∞ 

−∞ 

· · · 

∫ ∞ 

−∞ 

The mean or expected value η x ∈ R n of x is 

η x = E {x} = 

= 

∫ ∞ 

−∞ 

· · · 

p(x 1 ,...,x n )dx 1 · · · dx i−1 dx i+1 · · · dx n (2.5) 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

xp(x)dx (2.6) 

(x 1 ,...,x n ) T p(x 1 ,...,x n )dx 1 · · · dx n (2.7) 

= (η x1 ,...,η xn ) T (2.8) 

With notation E {x} we mean that the integral is over all random variables inside 

the braces. If x and y are random vectors, 

With the subscript notation 

E {f(x,y)} = 

∫ ∞ 

E x {f(x,y)} = 

−∞ 

f(x,y)p(x,y)dxdy (2.9) 

∫ ∞ 

−∞ 

f(x,y)p(x)dx (2.10) 

we mean that the expectation is taken only over the random variables x. The 

non-normalized correlation matrix of a random vector x is defined as 

⎛ 

⎞ 

R x = E { xx T } E {x 1 x 1 } · · · E {x 1 x n } 

= ⎜ 

. 

⎝ . .. 

. ⎟ 

⎠ (2.11) 

E {x n x 1 } · · · E {x n x n } 

that is, the componentwise expectation of the outer product of x with itself. The 

cross-correlation of random vectors x and y is defined as 

R xy = E { xy T } (2.12) 

Covariance is the correlation of the random vector (x − η x ) 

C x = E { (x − η x )(x − η x ) T } = E { xx T } − η x η T x (2.13) 

and the cross-covariance of x and y is defined as 

C xy = E { (x − η x )(y − η y ) T } = E { xy T } − η x η T y (2.14)


Clearly, we have C xy = C T yx. The conditional density of x given y is defined to be 

p(x|y) = p(x,y) 

p(y) 

(2.15) 

whenever p(y) > 0 and 0 otherwise. Clearly, we can also write 

p(y|x) = p(x,y) 

p(x) 

(2.16) 

and we obtain 

p(x|y)p(y) = p(y|x)p(x) (2.17) 

This is called the Bayes’ theorem. The conditional mean of x given y is 

η x|y = E {x|y} = 

∫ ∞ 

−∞ 

xp(x|y)dx (2.18) 

that is a function of the random variable y. If x and y are random vectors, 

E {f(x,y)|y} = 

∫ ∞ 

−∞ 

Using (2.10), (2.19) and (2.15) we can derive a useful result 

E {f(x,y)} = 

= 

∫ ∞ 

−∞ 

∫ ∞ 

−∞ 

f(x,y)p(x|y)dx (2.19) 

f(x,y)p(x,y)dxdy (2.20) 

f(x,y)p(x|y)dxp(y)dy (2.21) 

= E y {E {f(x,y)|y}} (2.22) 

The conditional covariance of x given y is [152] 

∣ } 

C x|y = E 

{(x − η x|y )(x − η x|y ) T ∣∣y 

= E { xx ∣ T } y − ηx|y ηx|y T (2.23) 

The random variables x i are said to be statistically independent if 

p(x) = ∏ i 

p i (x i ) (2.24) 

and jointly Gaussian if their joint density can be written in the form 

p(x) = (det C x) −1/2 ( 

exp − 1 ) 

(2π) n/2 2 (x − η x) T Cx −1 (x − η x ) 

(2.25) 

where det C x denotes the determinant of C x . When x is jointly Gaussian with 

mean η x and covariance C x , this is denoted by 

x ∼ N(η x ,C x ) (2.26)

2.3. Bayesian estimation 23 

2.3 Bayesian estimation 

In Bayesian estimation we assume that the parameters θ are random having some 

joint density p(z,θ) with the measurements z. In Bayesian estimation the goal is 

to solve the posterior density p(θ|z) of the parameters given the observations. In 

Bayesian point estimation some specific statistic of the posterior density is solved. 

Typical point estimates include e.g. the mean and the mode of the density p(θ|z). 

These are derived in following sections with some assumptions about θ. 

From (2.17) we obtain for posterior density 

p(θ|z) = p(z|θ)p(θ) 

(2.27) 

p(z) 

∝ p(z|θ)p(θ) (2.28) 

where p(z) = ∫ p(θ,z)dθ = ∫ p(θ)p(z|θ)dθ. The posterior density contains thus 

two parts. The first part is p(z|θ) that is called the likelihood of the data. This part 

of the posterior density depends on the data z. It thus contains the dependence of 

the parameters to be estimated on the data. The function p(θ) does not depend on 

observed data and we call it the prior density of θ. It contains the prior assumptions 

about the parameters. Since θ is assumed to be random, it has some joint density 

that can further depend on some parameters φ having some prior density p(φ). 

We call such a model hierarchical and the parameters φ hyperparameters. 

Sometimes it is possible to use prior densities that contain no information 

about the parameters. We call this kind of densities noninformative [66]. A 

typical example is the function p(θ) = c where c is constant. This is an example 

of an improper density [26] since it does not integrate to unity. An improper prior 

density can be used in Bayesian analysis if the posterior density is a proper density. 

2.4 Maximum likelihood estimation 

In maximum likelihood estimation the probability density function of the observation 

z given the unknown parameter θ is assumed to be known. No probability 

density of θ is required. For given data z, ˆθ ML = ˆθ ML (z) is the maximum likelihood 

estimate, if 

p(z|ˆθ ML ) ≥ p(z|ˆθ) (2.29) 

for all ˆθ. Thus, ˆθ ML maximizes the likelihood function p(z|θ) for given data z. In 

other words, the selection of ˆθ ML makes the measurement z most probable in class 

of all probability densities that are of the form p(z|ˆθ) [133, 102]. The maximum 

likelihood estimate is often thought to be the “true” estimate, when θ is treated 

as non-random. Other estimates are usually compared to the maximum likelihood 

estimate. The reason is that the maximum likelihood estimate has many desirable 

properties of a “good” estimate. See e.g. [39] and [102].


2.5 Bayes cost method 

If we assume that θ is a random vector with a known joint density p(θ,z) with 

the observation z, we have made the so-called Bayesian assumption [142]. This 

assumption leads to the so-called Bayes cost method for solving the estimator 

ˆθ(z). For the cost method we define the function C(θ, ˆθ) that assigns to each 

combination of actual parameter value and estimate the unique real valued cost. 

We call C(θ, ˆθ) the cost function. The expected value of the cost is given by 

{ 

B(ˆθ) = E C(θ, ˆθ(z)) 

} 

= 

∫ ∞ ∫ ∞ 

−∞ 

−∞ 

C(θ, ˆθ(z))p(θ,z)dθ dz (2.30) 

that is called the Bayes cost. From (2.15) we obtain p(θ,z) = p(z|θ)p(θ) and the 

expectation can be written in the form 

∫ ∞ 

(∫ ∞ 

) 

B(ˆθ) = C(θ, ˆθ(z))p(z|θ)dz p(θ)dθ (2.31) 

−∞ −∞ 

The inner integral is clearly the conditional expectation of the cost given θ and we 

write 

∫ ∞ 

B(ˆθ|θ) = C(θ, ˆθ(z))p(z|θ)dz (2.32) 

−∞ 

{ 

= E C(θ, ˆθ) 

} 

∣ 

∣θ 

(2.33) 

This can be called the conditional Bayes cost, given θ, and in terms of the conditional 

cost the Bayes cost of the estimator can be written as 

∫ ∞ 

B(ˆθ) = B(ˆθ|θ)p(θ)dθ (2.34) 

−∞ 

{ 

= E θ 

{E C(θ, ˆθ) 

}} 

∣ 

∣θ 

(2.35) 

} 

= E θ 

{B(ˆθ|θ) 

(2.36) 

Similarily using p(θ,z) = p(θ|z)p(z) we can write 

{ 

B(ˆθ) = E C(θ, ˆθ(z)) 

} 

{ 

= E z 


}} 

∣ 

∣z 

} 

= E z 

{B(ˆθ|z) 

(2.37) 

(2.38) 

(2.39) 

where 

∫ ∞ 

B(ˆθ|z) = C(θ, ˆθ(z))p(θ|z)dθ (2.40) 

−∞ 

{ 

= E C(θ, ˆθ) 

} 

∣ 

∣z 

(2.41)

2.6. Mean square estimation 25 

is the conditional Bayes cost given z. 

Now we can state the Bayes estimation criterion: For a given cost function 

C(θ, ˆθ), the Bayesian estimator ˆθ B is selected so that 

B(ˆθ B ) ≤ B(ˆθ) (2.42) 

for all ˆθ. So the Bayesian estimator is the one that minimizes the Bayes cost. [133] 

Different choices of the cost function lead to different estimators, and most 

common estimators can often be seen as minimizers of some specific cost function. 

2.6 Mean square estimation 

Consider the situation in which the cost function is of a specific form, namely the 

squared norm of the estimation error ˜θ = θ − ˆθ 

C MS (θ, ˆθ) = ‖θ − ˆθ‖ 2 = ˜θ T ˜θ (2.43) 

From (2.39) and (2.41) we can write the Bayes cost function in form 

{ 

B(ˆθ) = E C(θ, ˆθ) 

} 

} 

= E z 

{B(ˆθ|z) 

{ 

= E z 


}} 

∣ 

∣z 

(2.44) 

(2.45) 

(2.46) 

The outer expectation does not depend on θ and we can thus minimize the conditional 

Bayes cost 

{ 

B(ˆθ|z) = E C(θ, ˆθ) 

} 

∣ 

∣z 

(2.47) 

Inserting the mean square cost function (2.43) to this we obtain 

{ 

B(ˆθ|z) = E (θ − ˆθ) T (θ − ˆθ) 

} 

∣ 

∣z 

{ 

} 

= E θ T θ − 2θ T ∣ ˆθ + ˆθT ˆθ ∣z 

(2.48) 

(2.49) 

Next, by definition, the conditional mean of θ given z is 

and the conditional covariance is 

∣ } 

C θ|z = E 

{(θ − η θ|z )(θ − η θ|z ) T ∣∣z 

∣ } 

= E 

{θθ T ∣∣z 

= E 

η θ|z = E {θ|z} (2.50) 

(2.51) 

∣ } 

− η θ|z E 

{θ T ∣∣z 

− E {θ|z} ηθ|z T + η θ|zηθ|z T (2.52) 

{θθ T ∣ ∣∣z 

} 

− η θ|z η T θ|z (2.53)


Now, since B(ˆθ|z) is a scalar, 

B(ˆθ|z) = trace B(ˆθ|z) (2.54) 

( { 

} 

= trace E θ T ∣ 

θ∣z 

− 2ηθ|zˆθ T + ˆθ 

) 

T ˆθ (2.55) 

where trace (A) is defined to be the sum of the diagonals of square matrix A. We 

can use the identities [69] 

trace (A + B) = trace (A) + trace (B) (2.56) 

and then 

trace (ABC) = trace (CAB) = trace (BCA) (2.57) 

( ∣ } 

B(ˆθ|z) = trace E 

{θθ T ∣∣z 

− 2ˆθη θ|z T + ˆθˆθ ) T (2.58) 

( 

= trace C θ|z + η θ|z ηθ|z T − 2ˆθη θ|z T + ˆθˆθ ) T (2.59) 

( 

) 

= trace C θ|z + trace (ˆθ − η θ|z )(ˆθ − η θ|z ) T (2.60) 

( 

) 

= trace C θ|z + trace (ˆθ − η θ|z ) T (ˆθ − η θ|z ) (2.61) 

∥ 

∥ ∥∥ 

2 

= trace C θ|z + ∥ˆθ − η θ|z (2.62) 

The first term in right hand side of the equation does not depend on ˆθ(z) and is 

clearly positive and the second can be made to zero by choosing ˆθ = η θ|z . Therefore 

we conclude that the optimal Bayesian minimum mean square estimator is the 

function η θ|z , that is, the conditional mean 

ˆθ MS = 

∫ ∞ 

−∞ 

θp(θ|z)dθ = E {θ|z} = η θ|z (2.63) 

This result holds for all densities p(θ|z) [193]. The estimator ˆθ MS is sometimes 

also called the conditional mean estimator. The expected value of the estimation 

error ˜θ can be written as 

} 

}} 

∣ 

E 

{˜θ = E z 

{E 

{˜θ ∣z 

(2.64) 

= E z 

{E 

{θ − ˆθ 

∣ }} ∣∣z 

MS (2.65) 

Now, since E{E{θ|z}|z} = E{θ|z} 

{∫ ∞ 

∣ } ∣∣z 

= E z θp(θ|z)dz − E 

{ˆθMS } (2.66) 

−∞ 

∣ }} ∣∣z 

= E z 

{ˆθMS − E 

{ˆθMS (2.67) 

} 

E 

{˜θ = 0 (2.68)

2.7. Maximum a posteriori estimation 27 

This means that the mean square estimator is unbiased. 

The preceding results are easily modified to include a symmetric positive 

semidefinite (weighting) matrix W. Let the generalized mean square cost function 

be 

C GMS (θ, ˆθ) = ˜θ T W ˜θ (2.69) 

The conditional Bayes cost is then 

{ 

B(ˆθ|z) = E (θ − ˆθ) T W(θ − ˆθ) 

} 

∣ 

∣z 

= ˆθ T W ˆθ − 2ˆθ T Wη θ|z + E { θ T Wθ ∣ } 

z 

(2.70) 

(2.71) 

= (ˆθ − η θ|z ) T W(ˆθ − η θ|z ) + E { θ T Wθ ∣ ∣ z 

} 

− η 

T 

θ|z Wη θ|z (2.72) 

The positive semidefiniteness means that for any a, a T Wa ≥ 0. Thus the first 

term in right hand side of the equation is positive and can be made equal to zero 

by choosing 

ˆθ GMS = E {θ|z} (2.73) 

This also minimizes the conditional Bayes cost, since the other terms does not 

depend on ˆθ. The result is identical to the result for ˆθ MS . 

The fact that the conditional mean minimizes the generalized index C GMS (θ, ˆθ) 

has an important implication. It means that the conditional mean minimizes the 

error of each component of ˜θ individually [192]. This can be seen by choosing the 

weighting matrix W so that only one, say i’th, diagonal element is nonzero and 

equals to one and all the off- diagonal elements are zero 

⎛ 

⎞ 

0 · · · · · · 0 

. . .. 

. .. 

0 

W i = 

1 

(2.74) 

0 

⎜ 

. 

⎝ . 

.. 

. ⎟ 

⎠ 

0 · · · · · · 0 

The minimization of the mean square cost function (2.43) can clearly be seen as 

minimization of the sum 

C MS (θ, ˆθ) = ˜θ 

∑ 

T ˜θ = ˜θ T W i˜θ (2.75) 

and thus the conditional mean minimizes each squared error term ˜θ i individually. 

i 

2.7 Maximum a posteriori estimation 

Let us now define another cost function, the uniform cost function [133] 

{ 

C UC (θ, ˆθ) 0 , 

= 

˜θ ∈ I 

1 ,otherwise 

(2.76)


where I =] − ɛ,ɛ[× · · · ×] − ɛ,ɛ[⊂ R p and ɛ is small. So this cost gives zero penalty 

if all components of the estimation error are small and a unit penalty if any of the 

components is larger than ɛ. When C UC (θ, ˆθ) is substituted into the equation of 

conditional Bayes cost (2.41) we obtain 

∫ 

B UC (ˆθ|z) = p(θ|z)dθ (2.77) 

˜θ∈I 

∫ 

= 1 − p(θ|z)dθ (2.78) 

where I states for the complement of I, and using the mean value theorem for 

integrals [5] there is a value, say ˆθ, in I for which 

˜θ∈I 

B UC (ˆθ|z) = 1 − (2ɛ) p p(ˆθ|z) (2.79) 

To minimize B UC (ˆθ|z) we must maximize p(ˆθ|z) so ˆθ UC can be defined by 

p(ˆθ UC |z) ≥ p(ˆθ|z) (2.80) 

for all ˆθ. Since ˆθ UC maximizes the posterior density of θ given the observations z, 

ˆθ UC is also called the maximum a posteriori estimate ˆθ MAP . 

ˆθ UC is the mode of density p(θ|z) and yet another name for the estimator 

is the conditional mode estimator. It can be shown, that if we assume that the 

prior distribution of θ is uniform in a region containing the maximum likelihood 

estimate, then the maximum likelihood estimate is identical to maximum a posteriori 

estimate, that is ˆθ ML = ˆθ MAP [133]. Clearly ˆθ MAP = ˆθ MS if the mode of the 

density p(θ|z) equals to the mean η θ|z . This is the case when p(θ|z) is symmetric 

and unimodal. 

2.8 Linear minimum mean square estimator 

In this section we restrict the form of the estimator to be a linear function of 

data and derive the optimum estimator with this structural constraint. If certain 

conditions for densities p(θ) and p(z) are fulfilled, this optimal linear estimator 

turns out to be generally optimal. 

Let the estimator be constrained to be a linear function of the data 

ˆθ = Kz (2.81) 

Let θ and z be random vectors with zero means and known covariances. No other 

assumptions are made about the joint distribution of the parameters and data. 

We derive the estimator that is of the form (2.81) and minimizes the mean square 

Bayes cost B MS (ˆθ). We first note that 

} 

B MS (ˆθ) = E 

{˜θT ˜θ 

= E 

{ 

(θ − ˆθ) T (θ − ˆθ) 

} 

(2.82) 

(2.83)

2.9. Minimum mean square estimator for Gaussian variables 29 

and that 

{ 

= trace E (θ − ˆθ)(θ − ˆθ) } T (2.84) 

= trace C˜θ 

(2.85) 

{ 

C˜θ 

= E (θ − ˆθ)(θ − ˆθ) } T (2.86) 

= E { (θ − Kz)(θ − Kz) T } (2.87) 

= C θ − KC zθ − C θz K T + KC z K T (2.88) 

= C θ + (K − C θz C −1 

z 

)C z (K − C θz Cz 

−1 ) T − C θz Cz −1 C zθ (2.89) 

where it is assumed that C z is invertible. Only the second term on the right hand 

side of the equation depends on the matrix K. The trace, and each term of the 

diagonal (note that the diagonal is quadratic and thus positive), of the matrix 

E 

{(θ − ˆθ)(θ − ˆθ) 

} 

T can be minimized by choosing 

K = C θz C −1 

z (2.90) 

so that we can write for the linear minimum mean square estimate 

and the estimation error covariance is from (2.89) 

Using the transformations 

C˜θLMMS 

ˆθ LMMS = C θz C −1 

z z (2.91) 

= C θ − C θz C −1 

z C zθ (2.92) 

θ ′ = θ − E {θ} (2.93) 

z ′ = z − E {z} (2.94) 

result extends to variables with nonzero means [192]. The estimator (2.91) is 

optimum when its structure is restricted to be linear. No assumptions have been 

made about the densities of the measurements and the parameters. 

2.9 Minimum mean square estimator for Gaussian variables 

If θ and z are jointly Gaussian, then the linear minimum mean square estimator 

is not only the optimal linear estimator but the overall optimal estimator for θ. 

Let the joint density function for θ and z is (without scaling terms) 

⎧ 

⎨ ( 

p(θ,z) ∝ exp 

⎩ −1 2 

) ( ) −1 ( 

θ T z T C θ C θz 

C zθ C z 

θ 

z 

) ⎫ ⎬ 

⎭ 

(2.95) 

where θ and z have been assumed to have zero means. The nonzero mean situation 

can be treated with the transformations (2.93) and (2.94). It was shown that the 

mean square estimate equals to the conditional mean 

ˆθ MS = E {θ|z} (2.96)


For the calculation of the mean we first have to form the equation for the posterior 

density p(θ|z). First we note that the matrix inversion lemma [69] gives for the 

inverse of the joint covariance of θ and z 

( 

) −1 ( 

) 

C θ C θz C 11 C 12 

= 

(2.97) 

C zθ C z C 21 C 22 

where 

C 11 = (C θ − C θz Cz 

−1 C zθ ) −1 = C −1 

θ 

C 22 = (C z − C zθ C −1 

θ 

C 12 = C T 21 = −C 11 C θz C −1 

z 

C θz ) −1 = C −1 

z 

+ C −1 

θ 

C θz C 22 C zθ C −1 

+ C −1 

z 

The density of z can be written in the form 

{ 

p(z) ∝ exp − 1 ( ) ( )( 

0 z T 0 0 

2 0 Cz 

−1 

so that the posterior density p(θ|z) is obtained by forming 

p(θ|z) = p(θ,z) 

p(z) 

{ 

∝ 

exp 

{ 

= exp 

{ 

= exp 

− 1 2 

− 1 2 

( 

θ 

(2.98) 

C zθ C 11 C θz Cz −1 (2.99) 

= −C −1 

θ 

C θz C 22 (2.100) 

0 

z 

) ( 

θ T z T C 11 C 12 

C 21 C 22 − Cz 

−1 

)} 

)( 

θ 

z 

)} 

(2.101) 

(2.102) 

(2.103) 

( 

θ T C 11 θ + 2θ T C 12 z + z T (C 22 − C −1 

z )z )} (2.104) 

− 1 2 (θT C 11 θ + 2θ T C 11 C θz C −1 

z z (2.105) 

+z T Cz 

−1 C zθ C 11 C θz Cz −1 z) 

{ 

= exp − 1 } 

2 (θ − C θzCz 

−1 z) T C 11 (θ − C θz Cz −1 z) 

} 

(2.106) 

(2.107) 

This is clearly a Gaussian density. The Gaussian conditional density is of the form 

{ 

p(θ|z) ∝ exp − 1 } 

2 (θ − E {θ|z})T C −1 

θ|z (θ − E {θ|z}) (2.108) 

Since ˆθ MS = E {θ|z}, C θ|z = C˜θMS 

and comparing with (2.107) we can conclude 

ˆθ MS = C θz C −1 

z z (2.109) 

C˜θMS 

= C θ − C θz C −1 

z C zθ (2.110) 

that is exactly the linear minimum mean square estimator.

2.10. Mean square estimation with observation model 31 

2.10 Mean square estimation with observation model 

Next we consider a model for the dependence between observations and parameters. 

The most common observation model is the so-called additive noise model 

z = h(θ) + v (2.111) 

Now we have to assume some joint density p(θ,v) for random parameters θ and 

observation error v. Then, at least theoretically, the joint density of z and θ is 

known and we can form either p(z|θ) for maximum likelihood estimation or the 

posterior density p(θ|z) of θ for Bayesian estimation. In general the density p(θ|z) 

is needed e.g. for the mean square estimator, but in certain situations, such as 

in the case of linear estimates the knowledge about the second order statistics is 

enough. 

Let us now constrain the observations to be of specific linear form of parameters 

z = Hθ + v (2.112) 

where v and θ are random. Let θ and v have zero means and known covariances. 

The measurements z have then zero mean and has covariance 

and the cross covariance C θz is 

C z = E { (Hθ + v)(Hθ + v) T } (2.113) 

= HC θ H T + HC θv + C vθ H T + C v (2.114) 

C θz = E { θ(Hθ + v) T } = C θ H T + C θv (2.115) 

Using these for linear mean square estimate we obtain 

ˆθ LMMS = (C θ H T + C θv )(HC θ H T + HC θv + C vθ H T + C v ) −1 z (2.116) 

with error covariance matrix 

= C C˜θLMMS θ − (C θ H T + C θv )(HC θ H T + HC θv + C vθ H T + C v ) −1 (HC θ + C vθ ) 

(2.117) 

A special case of this is when θ and v are uncorrelated, C θv = 0. Then the 

equations for the estimate and error reduce to 

ˆθ LMMS = C θ H T (HC θ H T + C v ) −1 z (2.118) 

C˜θLMMS 

= C θ − C θ H T (HC θ H T + C v ) −1 HC θ (2.119) 

Applying the matrix inversion lemma (2.97–2.100) we obtain 

ˆθ LMMS = (C −1 

θ 

= (C −1 

C˜θLMMS θ 

+ H T C −1 

v 

H) −1 H T Cv 

−1 z = H T C −1 C˜θLMMS v z (2.120) 

+ H T C −1 

v H) −1 (2.121)


2.11 Gauss–Markov estimate 

We consider the situation, where the parameters are unknown but non-random 

and the observation model is linear, that is 

z = Hθ + v (2.122) 

where v is random. Let v have zero mean and covariance C v . We derive the linear 

unbiased estimator ˆθ that minimizes the criterion 

{ 

E ‖θ − ˆθ‖ 2} { 

= E (θ − ˆθ) T (θ − ˆθ) 

} 

(2.123) 

{ 

= trace E (θ − ˆθ)(θ − ˆθ) } T (2.124) 

= trace C˜θ 

(2.125) 

Let 

with the requirement 

ˆθ = Kz + k (2.126) 

} 

E 

{ˆθ = θ (2.127) 

Then 

} 

E 

{ˆθ 

= E {Kz + k} = KE {z} + k (2.128) 

= KE {Hθ + v} + k = KHθ + k (2.129) 

For ˆθ to be unbiased K and k must satisfy 

KH = I, k = 0 (2.130) 

For the error covariance we can write 

{ 

C˜θ 

= E (θ − ˆθ)(θ − ˆθ) } T (2.131) 

Next we consider the matrix 

= E { (θ − Kz)(θ − Kz) T } (2.132) 

= E { (θ − KHθ − Kv)(θ − KHθ − Kv) T } (2.133) 

= E { Kvv T K T } = KC v K T (2.134) 

K ′ = (H T Cv 

−1 H) −1 H T Cv −1 

(2.135) 

for which K ′ H = I. With this selection the estimator ˆθ = K ′ z is unbiased. We 

will next show that K ′ minimizes the mean square error. First we note that 

K ′ C v K T = (H T Cv 

−1 H) −1 H T Cv −1 C v K T (2.136) 

= (H T Cv −1 H) −1 (KH) T (2.137) 

= (H T Cv −1 H) −1 (2.138)

2.12. Least squares estimation 33 

and 

KC v K ′T = KC v Cv 

−1 H(H T Cv −1 H) −1 (2.139) 

= (KH)(H T Cv −1 H) −1 (2.140) 

= (H T Cv −1 H) −1 (2.141) 

H) −1 H T Cv 

−1 C v Cv 

−1 H(H T Cv −1 H T ) −1 (2.142) 

H) −1 H T Cv 

−1 H(H T Cv −1 H T ) −1 (2.143) 

= (H T Cv −1 H) −1 (2.144) 

K ′ C v K ′T = (H T C −1 

v 

Next we form the matrix 

= (H T C −1 

v 

(K − K ′ )C v (K − K ′ ) T = KC v K T − K ′ C v K T (2.145) 

− KC v K ′T + K ′ C v K ′T (2.146) 

= KC v K T − K ′ C v K ′T (2.147) 

Note that the matrix (K − K ′ )C v (K − K ′ ) T is positive semidefinite. Now we 

obtain for the trace of the error covariance matrix 

trace C˜θ 

= trace KC v K T (2.148) 

= trace (K − K ′ )C v (K − K ′ ) T + trace K ′ C v K ′T (2.149) 

This can be minimized by choosing K = K ′ and the Gauss–Markov estimate can 

be written in the form 

ˆθ GM = (H T C −1 

v 

H) −1 H T C −1 

v z (2.150) 

C˜θGM 

= (H T C −1 

v H) −1 (2.151) 

Comparing ˆθ GM with ˆθ LMMS we note that ˆθ GM is obtained by letting C −1 

θ 

= 0. 

This means that the prior density of θ is flat or no a priori information is available 

about θ [66, 26]. 

Note that the criterion to be minimized in Gauss–Markov estimation is identical 

to that of mean square estimator. The difference is in treatment of θ as 

non-random but unknown parameter vector. 

Since this estimator also minimizes the variance of the estimator it is also 

called the linear minimum variance estimator ˆθ LMV . 

2.12 Least squares estimation 

Next we consider the situation, where neither the parameters θ or the error v 

in observations is interpreted as random. The solution of this problem leads to 

generalized least squares solution. 

Let the observation model be 

z = h(θ) + v (2.152)


where θ and v are unknown but non-random. Then the generalized least squares 

estimator ˆθ GLS is defined to be the minimizer of the generalized least squares index 

l GLS = (z − h(θ)) T W(z − h(θ)) (2.153) 

= ‖Lz − Lh(θ)‖ 2 (2.154) 

where L T L = W is a symmetric positive definite matrix. If W is diagonal the 

index is also called the weighted least squares index. 

With a nonlinear h(θ), the minimization of l GLS has to be done iteratively. 

First we form the Taylor expansion of the generalized least squares index in the 

neighborhood of some θ ∗ 

( ) ∂l 

l(θ) = l(θ ∗ ) + 

∂θ (θ∗ ) (θ − θ ∗ ) (2.155) 

+ 1 2 (θ − θ∗ ) T ( ∂ 2 l 

∂θ 2 (θ∗ ) 

) 

(θ − θ ∗ ) (2.156) 

+ O(‖θ − θ ∗ ‖ 3 ) (2.157) 

Note that ∂l/∂θ is a row vector and and ∂ 2 l/∂θ 2 is a symmetric matrix. Next we 

approximate l(θ) with the second order approximation 

[( ) ∂l 

l(θ) ≈ l(θ ∗ ) + 

∂θ (θ∗ ) + 1 ( ∂ 

2 (θ − θ∗ ) T 2 )] 

l 

∂θ 2 (θ∗ ) (θ − θ ∗ ) = f(θ) (2.158) 

This approximation is at minimum, when θ = ˆθ so, that 

( ) 

( 

∂f ∂l 

∂ 

∂θ (ˆθ) = 

∂θ (θ∗ ) + (ˆθ − θ ∗ ) T 2 ) 

l 

∂θ 2 (θ∗ ) 

so that we can solve ˆθ 

The gradient of l is of the form 

= 0 (2.159) 

( ∂ 

ˆθ = θ ∗ 2 ) −1 ( ) T 

l ∂l 

− 

∂θ 2 (θ∗ ) 

∂θ (θ∗ ) 

(2.160) 

∂l 

∂θ (θ∗ ) = −2(z − h(θ ∗ )) T W ∂h 

∂θ (θ∗ ) (2.161) 

and differentiating twice, we obtain 

∂ 2 l 

∂θ 2 (θ∗ ) = −2 

( M 

∑ 

i=1 

(z i − h i (θ ∗ ))W ∂2 h i 

∂θ 2 (θ∗ ) 

) 

( ) T ( ) 

∂h ∂h 

+ 2 

∂θ (θ∗ ) W 

∂θ (θ∗ ) 

(2.162) 

Finally, we obtain a recursion 

⎛ 

⎞ 

ˆθ i+1 = ˆθ 

M∑ 

i + k i 

⎝Ji T WJ i − (z j − h j (ˆθ i ))W ∂2 h ( ) 

j 

∂θ 2 (ˆθ i ) ⎠−1 

Ji T W(z − h(ˆθ i )) 

j=1 

(2.163)

2.12. Least squares estimation 35 

where J i = ∂h 

∂θ (ˆθ i ). k i is the step size parameter, that controls the convergence 

of the iteration [18]. This method is called the Newton–Raphson method. If the 

norm ‖z − h(θ)‖ is small we can make the approximation 

∂ 2 l 

∂θ 2 (ˆθ i ) ≈ 2J T i WJ i (2.164) 

and the iteration can be written in form 


( ) 

i + k i J 

T −1 

) 

i WJ i 

(Ji T W(z − h(ˆθ i )) 

(2.165) 

This is called the Gauss–Newton method. Yet another well known search procedure 

results, when (∂ 2 l/∂θ 2 )(ˆθ i ) is replaced with the identity matrix. With W = I 

we can write 


) 

i + k i 

(Ji T (z − h(ˆθ i )) 

(2.166) 

This is called the steepest descent method. 

If h(θ) is linear the observation model is of the form 

z = Hθ + v (2.167) 

Now the minimization of the generalized linear least squares index 

l GLS = (z − Hθ)W(z − Hθ) T (2.168) 

= ‖Lz − LHθ‖ 2 (2.169) 

takes a simpler form. Let first W = I and denote the corresponding index with 

l LS . Then form the gradient 

Then, the least squares estimator satisfies 

or 

This can be rewritten in the form 

∂l LS 

∂θ = −2(z − Hθ)T H (2.170) 

− (z − Hˆθ LS ) T H = 0 (2.171) 

H T (z − Hˆθ LS ) = 0 (2.172) 

H T Hˆθ LS = H T z (2.173) 

This system of equations is called the normal equations. The formal solution is 

ˆθ LS = (H T H) −1 H T z (2.174) 

Note that the matrix H T H equals to ∂ 2 l LS /∂θ 2 so that if it is positive definite, 

the solution is guaranteed to be the stationary point of l LS .


Note, that the criterion (2.172) implies that the residual r = z −ẑ = z −H ˆθ LS 

is orthogonal to all columns of the matrix H. In other words, r belongs to the null 

space of H T that equals to the orthogonal complement of the range of H, that 

is r ∈ N ( H ) T = R (H) ⊥ . By definition ẑ ∈ R (H) and dim (R (H)) = p. The 

measurement z ∈ R M is now of the form 

z = ẑ + r (2.175) 

and we can see that ẑ is the orthogonal projection of z onto R (H), the linear 

subspace spanned by the columns of the matrix H. We call these vectors the basis 

vectors. This interpretation of the least squares problem is central in Chapter 6. 

The generalized solution is obtained by multiplying the equation (2.167) with 

a matrix L so that L T L = W 

and with notations z ′ = Lz and H ′ = LH 

Now the minimization of the generalized index 

is obtained by using (2.174) 

Lz = LHθ + Lv (2.176) 

z ′ = H ′ θ + Lv (2.177) 

l GLS = (z − Hθ)W(z − Hθ) T (2.178) 

= (z ′ − H ′ θ)(z ′ − H ′ θ) T (2.179) 

ˆθ GLS = (H ′T H ′ ) −1 H ′T z ′ (2.180) 

= (H T L T LH) −1 H T L T Lz (2.181) 

= (H T WH) −1 H T Wz (2.182) 

This is seen to be equivalent to the Gauss–Markov estimate ˆθ GM if we choose 

W = Cv −1 . 

A classical reference for linear and nonlinear least squares problems is [110] 

and for nonlinear optimization in general [93]. 

2.13 Comparison of ML, MAP and MS estimates 

In this section we compare the maximum likelihood estimation with maximum a 

posteriori estimation in case of Gaussian densities. Let the observation model be 

z = h(θ) + v (2.183) 

where θ and v are random parameters. Only the parameters θ are to be estimated. 

Given θ, the observations z and the error v have the same density, except that the 

mean E {z} = E {v} + h(θ). The density of z given θ is thus 

p(z|θ) = p v (z − h(θ)|θ) (2.184)

2.13. Comparison of ML, MAP and MS estimates 37 

Assuming v is Gaussian with zero mean and θ and v are independent we obtain 

and taking logarithms 

{ 

p(z|θ) ∝ exp − 1 } 

2 (z − h(θ))T Cv 

−1 (z − h(θ)) 

(2.185) 

log p(z|θ) = const − 1 2 (z − h(θ))T C −1 

v (z − h(θ)) (2.186) 

Maximization of this so-called log-likelihood function gives the maximum likelihood 

estimate and is identical to to minimization of 

l ML = 1 2 (z − h(θ))T C −1 

v (z − h(θ)) (2.187) 

This is identical to generalized least squares index. In the general case the minimization 

is nonlinear and can be done e.g. with Gauss-Newton algorithm. In 

linear case the minimizing estimate reduces to Gauss–Markov estimate 

Next we can write for the posterior density 

ˆθ = (H T Cv 

−1 H) −1 H T Cv −1 z (2.188) 

p(θ|z) ∝ p(z|θ)p(θ) (2.189) 

= p v (z − h(θ)|θ)p(θ) (2.190) 

and if we assume that v ∼ N(0,C v ) and θ ∼ N(η θ ,C θ ), this can be written in 


{ 

p(θ|z) ∝ exp − 1 } 

2 (z − h(θ))T Cv 

−1 (z − h(θ)) p(θ) (2.191) 

{ 

∝ exp − 1 } 

2 (z − h(θ))T Cv 

−1 (z − h(θ)) (2.192) 

{ 

exp − 1 } 

2 (θ − η θ) T C −1 

θ 

(θ − η θ ) 

(2.193) 

The maximum a posteriori estimate is obtained now by maximizing the logarithm 

of the posterior density 

log p(θ|z) = const − 1 2 (z −h(θ))T Cv 

−1 (z −h(θ)) − 1 2 (θ −η θ) T C −1 

θ 

(θ −η θ ) (2.194) 

This is seen to be identical to minimization of 

l MAP = 1 2 (z − h(θ))T Cv 

−1 (z − h(θ)) + 1 2 (θ − η θ) T C −1 

θ 

(θ − η θ ) (2.195)


The quadratic index (2.195) can be written in form 

l MAP = 1 2 

= ( 

1 2 ∥ L 

( 

(z − h(θ)) T , (θ − η θ ) T ) ( C −1 

v 0 

) ( 

z 

− L 

η θ 

h(θ) 

θ 

)∥ ∥∥∥∥ 

2 

0 C −1 

θ 

) ( 

) 

z − h(θ) 

(2.196) 

θ − η θ 

(2.197) 

where 

L T L = W, W = 

( 

C −1 

v 0 

0 C −1 

θ 

If we assume the linear observation model h(θ) = Hθ we obtain 

( 

l MAP = 1 2 ∥ L 

) ( 

z 

− L 

η θ 

Hθ 

θ 

) 

)∥ ∥∥∥∥ 

2 

(2.198) 

(2.199) 

= 1 2 ‖Lz′ − LH ′ θ‖ 2 (2.200) 

where 

H ′ = 

( 

H 

I 

) 

, z ′ = 

( 

) 

z 

η θ 

(2.201) 

This can be solved as generalized linear LS problem and the formal solution is 

ˆθ GLS = (H ′T L T LH ′ ) −1 H ′T L T Lz ′ (2.202) 

= (H ′T WH ′ ) −1 H ′T Wz ′ (2.203) 

= (H T C −1 

v 

H + C −1 

θ 

) −1 (H T Cv 

−1 z + C −1 

θ 

η θ ) (2.204) 

This is seen to be the mean square estimate with nonzero mean η θ for θ. 

If we further assume that C −1 

θ 

= 0 corresponding to infinite prior variance of 

θ, we obtain exactly the Gauss–Markov estimate. As seen this is equivalent to the 

maximum likelihood estimate. If we assume that the errors are independent with 

equal variance, Cv 

−1 = σv 

−2 I and the form of the estimator is 

ˆθ = (H T H) −1 H T z (2.205) 

This is identical to the linear least squares solution. The least squares and the 

Gauss–Markov estimates can thus be seen as Bayesian estimates with no prior 

information about the parameters θ. 

We conclude that with Gaussian densities MAP and ML estimation reduces 

to linear or nonlinear generalized least squares estimation. If some of the densities 

is not Gaussian, the minimization is not any more minimization of a quadratic 

function.

2.14. Selection of the basis vectors 39 

2.14 Selection of the basis vectors 

As emphasized in Section 2.12, the linear least squares problem can be seen as 

interpretation of the measurements z as linear combination of some basis vectors. 

The basis vectors are contained in the columns of the observation matrix H = 

(ψ 1 ,...,ψ p ) in the linear observation model 

z = Hθ + v (2.206) 

In some applications it is common that the observation matrix H contains also 

measurements. In statistical applications these variables, other than z, are called 

the regressor variables [91]. Basically they can be also random in nature. In 

time series applications the observation matrix can even contain the values of the 

measurements z. This is the case e.g. in modeling of the time series with recursive 

models as discussed in Chapter 4. For example, in autoregressive modeling of 

time series the columns of the observation matrix are delayed versions of the 

measurement z. If the observation matrix contains random variables the optimality 

of the mean square estimator is not necessarily true any more. This is discussed 

briefly in Section 2.18. 

If we have some information about the “physical” nature of the measurements, 

we can use this information in basis vector selection. If the underlying model is 

originally nonlinear, the model can be linearized in the neighborhood of some 

point near to solution and this linearized model can be used as basis of the model. 

Sometimes we do not have or do not want to use such information. In such cases 

we can select as basis a set of “generic” vectors that we believe to be able to 

model the data z. Such generic choices are for example sampled polynomials and 

trigonometric functions. Also sets of Gaussian or sigmoidally shaped vectors can 

be used. Any mixture of these is also possible. For example the basis vectors 

ψ 0 = 1 and ψ 1 = t that can model the first order trend in measurements can be 

used with all generic bases. 

A special case of H is when the basis vectors ψ i are orthonormal to each other. 

This means that 

H T H = I (2.207) 

The linear least squares solution is then 

and the estimate for the observations z is in this basis 

ˆθ = (H T H) −1 H T z = H T z (2.208) 

ẑ = Hˆθ = HH T z (2.209) 

p∑ 

= ψ i c (2.210) 

i=1 

where c i = ψi T z, that is, the inner product of the data with the basis vector. If the 

basis is the trigonometric basis and if p = M the sum (2.210) is called the discrete 

time Fourier series [125] or the discrete Fourier series [152] for z.


If the measurement z is random, the coefficients c i are also random parameters. 

If they are required to be uncorrelated, we can write 

E { cc T } = E { H T zz T H } (2.211) 

= H T R z H = diag (σ 2 1,...,σ 2 p) (2.212) 

This is an eigenproblem. The basis vectors are then obtained as eigenvectors of 

the data correlation matrix R z . The sum (2.210) can then be called e.g. the 

discrete Karhunen-Loeve transform or principal component transform [157]. It 

can be shown that this selection of the basis gives the minimum mean square error 

in ẑ compared to any other set of same number of basis vectors. 

In wavelet transform the basis vectors are a set of orthogonal sampled functions 

with local support and (almost) non-overlapping spectra. The wavelet transform 

of the measurement can then be seen as filtering of the measurements with bank 

of noncausal band pass filters [40, 134]. 

2.15 Modeling of prior information in Bayesian estimation 

Modeling of prior information in Bayesian estimation is a fundamental problem. 

In some cases the parameters θ have some physical meaning and it is possible that 

we have some knowledge of their possible values. Typical interpretations for such 

kind of knowledge (or assumptions) is that the parameters are positive, small, 

almost equal or smoothly varying. 

A common way to implement the prior information to Bayesian estimation is 

to use any prior density with realistic shape. For example if we know that the 

parameters should be in some specific p dimensional interval we select the density 

to be the multidimensional Gaussian density with appropriate mean and covariance 

to shrink the estimates towards the center of the desired area. The weaker our 

beliefs about the area are, the less information the prior density should contain. 

Then it becomes more noninformative, more flat. In the limiting case we can use 

some improper prior density that still can be useful for the estimation, e.g. as the 

positivity constraint. 

Since the parameters θ are random we can assume that the form of the density 

of θ depends on the hyperparameters φ. Inference concerning θ is then based on 

the posterior density 

p(θ|z,φ) = p(z,θ|φ) 

p(z|φ) 

= 

= ∫ p(z,θ|φ) 

(2.213) 

p(z,u|φ)du 

p(z|θ)p(θ|φ) 

∫ 

p(z|u)p(u|φ)du 

(2.214) 

If φ is not known, the proper full Bayesian approach would be to interpret it as 

random with hyperprior p(φ). The desired posterior for θ is now obtained by 

marginalization 

p(θ|z) = p(z,θ) 

p(z) 

= 

∫ 

p(z,θ,φ)dφ 

∫ ∫ 

p(z,u,φ)dφdu 

(2.215)

2.16. Recursive mean square estimation 41 

= 

∫ 

p(z|θ)p(θ|φ)p(φ)dφ 

∫ ∫ 

p(z|u)p(u|φ)p(φ)dφdu 

(2.216) 

The density p(φ) could also be parametrized. We would then obtain a hierarchical 

multistage model. However, in the last stage an assumption about the last prior 

density has to be made. 

In the empirical Bayesian approach we reject the necessity of the hyperprior 

p(φ) in (2.216) and use instead an estimated value ˆφ for φ in (2.214). The estimate 

is obtained as a maximizer of the marginal density p(z|φ) [26]. 

2.16 Recursive mean square estimation 

In the preceding sections we have dealt with systems, in which some fixed amount 

of data has been available for estimation of the unknown or random parameters. 

The estimators have been functions of the whole data set. In recursive estimation 

the question arises how to update the estimate, when some amount of new data 

is received. The answer for this question with certain observation models is the 

so-called Kalman filtering approach. Kalman filter is a tool with which one can 

estimate the sequence of the states of the dynamical system that cannot be observed 

directly. The available information is the data sequence that carries some 

information about the states. 

Among many other forms the state-space model for linear dynamical systems 

can be written as follows. Let z t ∈ R M and θ t ∈ R p be vector-valued processes. 

The state θ t evolves according to linear difference equation 

θ t+1 = F t θ t + G t w t (2.217) 

with some initial distribution for θ 0 . The state is not observed directly. Instead, 

the measurements z t are available at discrete sampling times and are described as 

z t = H t θ t + v t (2.218) 

This is clearly a linear observation model. The other assumptions for the model 

are as follows 

• F t , G t and H t are known sequences of matrices. 

• (θ 0 ,w t ,v t ) is a sequence of mutually uncorrelated random vectors with finite 

variance. 

• E {w t } = 0, E {v t } = 0, ∀t 

• covariances C wt , C vt and C vt,w t 

are known sequences of matrices. 

The Kalman filtering problem is now to find the linear minimum mean square 

estimator ˆθ t for state θ t given the observations z 1 ,...,z t . This has been shown to 

be equal to conditional mean 

ˆθ t = E {θ t |z 1 ,...z t } (2.219)


In the derivation of Kalman filter we also require that the estimate is recursive (sequential). 

The notation introduced here is a compromise between some historical 

conventions [97], diverse text book notations [133, 142] and our former notation 

for estimation problems. 

As we have seen we can use two approaches to obtain the linear mean square 

estimate. The first is to specify a linear conditional mean and find the best linear 

form. The second approach is to assume, that θ t , v t and w t are Gaussian. In this 

case the conditional mean is again linear. The results of these two approaches 

are identical. In other words Kalman filter is the best sequential estimator if 

the Gaussian assumption is valid and it is the best linear estimator whatever the 

distributions are. For more general cases of recursive Bayesian estimation see [194]. 

We employ the second approach. We use the fact that the maximum a posteriori 

and mean square estimates are identical as seen in section 2.13. First we 

have to look at the expression for the density of θ t given z t ,...,z 1 . We make the 

notation Z t = (z t ,...,z 1 ). Then 

For the numerator we can write 

p(θ t |z t ,...,z 1 ) = p(θ t, Z t ) 

p(Z t ) 

= p(θ t,z t , Z t−1 ) 

p(z t , Z t−1 ) 

(2.220) 

(2.221) 

p(θ t ,z t , Z t−1 ) = p(z t |θ t , Z t−1 )p(θ t , Z t−1 ) (2.222) 

= p(z t |θ t , Z t−1 )p(θ t |Z t−1 )p(Z t−1 ) (2.223) 

If θ t is given, the only random term in (2.218) is v t that does not depend on θ t or 

Z t−1 . Thus 

p(z t |θ t , Z t−1 ) = p(z t |θ t ) (2.224) 

and we can write 

p(θ t |Z t ) = p(z t|θ t )p(θ t |Z t−1 )p(Z t−1 ) 

p(z t , Z t−1 ) 

= p(z t|θ t )p(θ t |Z t−1 )p(Z t−1 ) 

p(z t |Z t−1 )p(Z t−1 ) 

= p(z t|θ t )p(θ t |Z t−1 ) 

p(z t |Z t−1 ) 

(2.225) 

(2.226) 

(2.227) 

Now since each of the densities in right hand side are Gaussian, we need only to 

form the means and the covariances of the densities to find the density of θ t given 

Z t . 

For p(z t |θ t ) we obtain 

E {z t |θ t } = E {H t θ t + v t |θ t } = H t θ t (2.228) 

C zt|θ t 

= E { (H t θ t + v t − E {z t |θ t })(H t θ t + v t − E {z t |θ t }) T |θ t 

} 

(2.229) 

= C vt (2.230)

2.16. Recursive mean square estimation 43 

For p(θ t |Z t−1 ) we obtain 

E {θ t |Z t−1 } = E {F t−1 θ t−1 + G t−1 w t−1 |Z t−1 } (2.231) 

= F t−1ˆθt−1 

. = ˆθt|t−1 (2.232) 

ˆθ t|t−1 is thus the prediction of θ t based on ˆθ t−1 , ˆθt−1 = E {θ t−1 |Z t−1 } is the 

optimal MS estimate at time t − 1 and 

C θt|Z t−1 

= E 

{(θ t − ˆθ t|t−1 )(θ t − ˆθ 

} 

t|t−1 ) T |Z t−1 (2.233) 

= E 

{˜θt|t−1˜θT t|t−1 |Z t−1 

} .= 

C˜θt|t−1 

(2.234) 

Next we could form the mean and covariance of p(z t |Z t−1 ), form the density 

p(θ t |Z t ) and calculate the conditional mean E {θ t |Z t } that is the estimator we 

are looking for. However to obtain the MAP estimator we only need to maximize 

the numerator of (2.227) or the logarithm of that. Now clearly 

log p(θ t |Z t ) ∝ const−(z t −H t θ t ) T Cv −1 

t 

(z t −H t θ t )−(θ t − ˆθ t|t−1 ) T C −1 (θ 

˜θ t − ˆθ t|t−1 ) 

t|t−1 

(2.235) 

The derivative of this with respect to θ t equals to zero when θ t = ˆθ t 

Solving ˆθ t gives the MAP estimate 

Ht T Cv −1 

t 

(z t − H tˆθt ) − C −1 (ˆθ 

˜θ t − ˆθ t|t−1 ) = 0 (2.236) 

t|t−1 

ˆθ t = (Ht T Cv −1 

t 

H t + C −1 ) −1 (C −1 ˆθt|t−1 + H 

˜θ t|t−1 

˜θ t T Cv −1 

t 

z t ) (2.237) 

t|t−1 

This is clearly of the form (2.204), the Bayesian MAP estimate using the last 

available estimate as mean of θ t . The first term on the right hand side of the 

equation can be expanded using the matrix inversion lemma 

where the notation 

ˆθ t = (C˜θt|t−1 

− C˜θt|t−1 

H T t (H t C˜θt|t−1 

H T t + C vt ) −1 H t C˜θt|t−1 

) (2.238) 

(C −1 

˜θ t|t−1 

ˆθt|t−1 + H T t C −1 

v t 

z t ) (2.239) 

= (C˜θt|t−1 

− K t H t C˜θt|t−1 

)(C −1 

˜θ t|t−1 

ˆθt|t−1 + H T t C −1 

v t 

z t ) (2.240) 

K t = C˜θt|t−1 


H T t + C vt ) −1 (2.241) 

has been made. After some lengthy calculations this can be written in form [133] 

ˆθ t = ˆθ t|t−1 + K t (z t − H tˆθt|t−1 ) (2.242) 

This is the desired recursive form; the new estimate can be updated from the 

prediction based on the previous one using the correction term K t (z t − H tˆθt|t−1 ).


Matrix K t is the so-called gain matrix and the term in parenthesis is the error 

between the actual data value z t and the prediction H tˆθt|t−1 . 

Next we form the equation for C˜θt|t−1 

. First we note that 

˜θ t|t−1 = θ t − ˆθ t|t−1 (2.243) 

= F t−1 θ t−1 + G t−1 w t−1 − F t−1ˆθt−1 (2.244) 

= F t−1˜θt−1 + G t−1 w t−1 (2.245) 

and for the covariance we can write 

{ 

} 

= E (F C˜θt|t−1 t−1˜θt−1 + G t−1 w t−1 )(F t−1˜θt−1 + G t−1 w t−1 ) T (2.246) 

= F t−1 C˜θt−1 

F T t−1 + G t−1 C wt−1 G T t−1 (2.247) 

since ˜θ t−1 and w t−1 are uncorrelated. For the error ˜θ t we have 

Inserting z t = H t θ t + v t we obtain 

and for error variance 

˜θ t = θ t − ˆθ t = θ t − (ˆθ t|t−1 + K t (z t − H tˆθt|t−1 )) (2.248) 

˜θ t = θ t − (ˆθ t|t−1 + K t (H t θ t + v t − H tˆθt|t−1 )) (2.249) 

= ˜θ t|t−1 − K t (H t˜θt|t−1 + v t ) (2.250) 

= (I − K t H t )˜θ t|t−1 − K t v t (2.251) 

C˜θt 

= (I − K t H t )C˜θt|t−1 

(I − K t H t ) T + K t C vt K T t (2.252) 

= (I − K t H t )C˜θt|t−1 

(2.253) 

since ˜θ t|t−1 = θ t − F t−1ˆθt−1 and v t are uncorrelated. Now equations (2.232), 

(2.247), (2.241), (2.253) and (2.242) form the Kalman filter. After adding the 

initializations we can summarize the Kalman filter algorithm 

C˜θ0 

= C θ0 (2.254) 

ˆθ 0 = E {θ 0 } (2.255) 

ˆθ t|t−1 = F t−1ˆθt−1 (2.256) 

C˜θt|t−1 

= F t−1 C˜θt−1 

F T t−1 + G t−1 C wt−1 G T t−1 (2.257) 



H T t + C vt ) −1 (2.258) 

C˜θt 

= (I − K t H t )C˜θt|t−1 

(2.259) 

ˆθ t = ˆθ t|t−1 + K t (z t − H tˆθt|t−1 ) (2.260)

2.17. Time-varying linear regression 45 

2.17 Time-varying linear regression 

In this section we discuss a special form of recursive estimation problem, the timevarying 

linear regression with scalar measurements z t ∈ R. The main point in 

this section is to emphasize that the most common recursive algorithms for time 

varying linear regression can be obtained from Kalman filter with some specific 

assumptions for the state space equations. Let 

θ t = θ t−1 + w t (2.261) 

z t = ϕ T t θ t + v t (2.262) 

so that the parameters obey the random walk model [81]. In scalar case ϕ t is a 

regression vector that can contain any stochastic or deterministic variables. The 

solution ˆθ t that minimizes the conditional expectation given past observations is 

given by Kalman filter discussed in Section 2.16. 

First we make notations 

F t = I (2.263) 

G t = I (2.264) 

H t = ϕ T t (2.265) 

(2.266) 

in (2.217) and (2.218) and we can write the Kalman filter, equations (2.254–2.260), 

in the form 

ˆθ t|t−1 = ˆθ t−1 (2.267) 

C˜θt|t−1 

= C˜θt−1 

+ C wt−1 (2.268) 


ϕ t (ϕ T t C˜θt|t−1 

ϕ t + C vt ) −1 (2.269) 

C˜θt 

= (I − K t ϕ T t )C˜θt|t−1 

(2.270) 

ˆθ t = ˆθ t|t−1 + K t (z t − ϕ T t ˆθ t|t−1 ) (2.271) 

We insert (2.267) and (2.268) into (2.269–2.271) and obtain 

ˆθ t = ˆθ t−1 + K t (z t − ϕ T t ˆθ t−1 ) (2.272) 

K t = 

C˜θt 

= 

+ C (C˜θt−1 wt−1 )ϕ t 

ϕ T (2.273) 

t + C (C˜θt−1 wt−1 )ϕ t + C vt 

( 

) 

+ C (C˜θt−1 wt−1 )ϕ t ϕ T t 

I − 

ϕ T + C (C˜θt−1 wt−1 ) (2.274) 

t + C (C˜θt−1 wt−1 )ϕ t + C vt 

If we now denote P t = + C C˜θt wt 


the recursive mean square estimate takes the 

ˆθ t = ˆθ t−1 + K t (z t − ϕ T t ˆθ t−1 ) (2.275)


= ˆθ t−1 + K t ε t (2.276) 

K t = 

P t = 

P t−1 ϕ t 

ϕ T (2.277) 

t P t−1 ϕ t + C vt 

( 

P t−1 ϕ t ϕ T ) 

t 

I − 

ϕ T P t−1 + C wt (2.278) 


P t is then a recursive estimate of the covariance + C C˜θt wt 

Kalman gain. If we now choose (make assumption) 

and K t is called the 

C vt = λ t (2.279) 

C wt = ( λ −1 

t − 1 ) ( P t−1 ϕ t ϕ T ) 

t 

I − 

ϕ T P t−1 (2.280) 


then the Kalman filter reduces to the so-called recursive least squares (RLS) algorithm. 

This leads to the conclusion, that the RLS is optimal in mean square 

sense if the assumptions (2.279) and (2.280) are valid. Otherwise the RLS only 

approximates the optimal recursive estimate. If λ t ≡ λ the method is called the 

forgetting factor RLS. The forgertting factor RLS is popular e.g. is in time series 

modeling, since its performance can be tuned with one parameter λ. Since λ is 

usually tuned in RLS near to unity, the implicit assumption is, that C wt in Kalman 

filter is “small” corresponding to slow variation of the model parameters. 

With different choices of C wt , C vt and P 0 several algorithms of the form 

ˆθ t = ˆθ t−1 + K t (z t − ϕ T t ˆθ t−1 ) (2.281) 

ε t = z t − ϕ T t ˆθ t−1 (2.282) 

can be obtained. The most popular forms are the normalized least mean square 

(NLMS) algorithm 

ˆθ t = ˆθ µϕ t 

t−1 + 

µϕ T t ϕ t + 1 ε t (2.283) 

and the least mean square (LMS) algorithm 

ˆθ t = ˆθ t−1 + µϕ t ε t (2.284) 

where µ is the step size parameter that controls the convergence of the algorithm. 

The connection of the LMS with the steepest descent method is clearly visible. The 

connections of RLS, NLMS and LMS algorithms to Kalman filtering are discussed 

in detail in [99]. The corresponding Kalman gain K t and momentary covariance 

estimate P t are summarized in Table 2.1. These connections are fundamental, 

when the implicit assumptions of the recursive algorithms are compared. 

The time-varying algorithms presented here are all in their generic forms. In 

many cases the effectiveness of the algorithm can be tuned with different matrix 

decompositions and scalings during the iteration. A review of adaptive algorithms 

can be found e.q. in [81]. Another common reference to recursive systems is [116]. 

The connection of the different computational forms of the adaptive algorithms

2.18. Properties of the MS estimator 47 

Table 2.1 

Kalman gain K t and covariance estimate P t for different adaptive algorithms 

K t 

P t 

Kalman P t−1 ϕ t (ϕ T t P t−1 ϕ t + C vt ) −1 

RLS P t−1 ϕ t (ϕ T t P t−1 ϕ t + λ) −1 ( ) 

I − Kt ϕ T t Pt−1 + C wt 

λ ( ) −1 I − K t ϕ T t Pt−1 

NLMS µϕ t (µϕ T t ϕ t + 1) −1 µI 

LMS µϕ t µ(I − µϕ t+1 ϕ T t+1) −1 

with the Kalman filter is summarized in [183]. For historical reasons we refer also 

to the references [229] and [85]. The preprint books [189] and [193] contain many 

of the papers published on the subject of adaptive and Kalman filtering. 

2.18 Properties of the MS estimator 

The mean square estimate has many important properties. Perhaps the most 

important property is its optimality among a wide class of estimators. Let the 

cost function be a function of the estimation error ˜θ only, that is, C(θ − ˆθ) = C(˜θ). 

Let C be symmetric about ˜θ = 0 

and convex 

C(˜θ) = C(−˜θ) (2.285) 

C(λ˜θ 1 + (1 − λ)˜θ 2 ) ≤ λC(˜θ 1 ) + (1 − λ)C(˜θ 2 ), 0 ≤ λ ≤ 1 (2.286) 

These properties cover a wide range of cost functions, for example the quadratic 

cost function C MS (˜θ) = ˜θ T ˜θ and the absolute error cost function C ABS (˜θ) = ∑ |˜θ i |. 

It can be shown that the conditional Bayes cost that is associated with any cost 

function with properties (2.285) and (2.286) can never be less than that associated 

with the mean square cost function C MS (˜θ) [192, 133]. 

Another important property is the so-called orthogonality property. Let ξ = 

ξ(z) be any function of data z. Then using (2.22) we can write 

E 

{(θ − ˆθ } { 

MS )ξ T = E z E 

{(θ − ˆθ 

∣ }} 

MS )ξ T ∣∣z 

(2.287) 

But ξ is a function of z only and thus 

E z 

{E 

{(θ − ˆθ 

∣ }} 

{ 

MS )ξ T ∣∣z 

= E z 

{E (θ − ˆθ } 

∣ 

} 

MS ) ∣z ξ T (2.288) 

∣ } ∣∣z 

= E z 

{E {θ|z} − E 

{ˆθMS } ξ T (2.289) 

= E z 

{(ˆθ MS − ˆθ MS )ξ T } = 0 (2.290)


since ˆθ MS is unbiased. The result is the so-called orthogonality principle [192]. 

It is important to note that in all derivations with observation model, we 

have assumed that the observation matrix H is deterministic. Thus all optimality 

results hold only if H does not depend on random variables, e.g. measurements. 

However, in many important estimation problems H is an explicit function of 

measurements z. Such a case is e.g. parametric modeling of time series which 

is discussed in Chapter 4. In such cases general optimality criterions usually can 

be derived only asymptotically. That is, when the number of measurements M 

increases without bound. This kind of analysis with stochastic H leads to the 

so-called stochastic approximation methods [115]. These are not discussed in this 

thesis.

CHAPTER 

III 

Regularization theory 

In this chapter we discuss some regularization methods that are applicable to 

the regularization of linear problems. The main goal is to motivate the so-called 

principal component regression and subspace regularization approaches. Both are 

presented in the formalism that is introduced in Chapter 2. The methods are 

directly applicable to several problems that arise in evoked potential analysis as 

discussed in Chapter 6. 

Another purpose of this chapter is to emphasize the interpretation of the most 

common regularization methods as Bayesian estimators. This makes it possible to 

see the implicit assumptions about the parameters when some specific regularization 

method is used. 


The notion of regularization arises from area of ill-posed problems. The problem 

is to find a solution to originally infinite dimensional problem 

z = Hθ (3.1) 

that is ill-posed in Hadamard’s sense. Originally Hadamard posed that the problem 

is well posed if there exists an unique solution that depends continuously on 

the data [73]. The latter condition means that arbitrarily small perturbations in 

the data cannot cause arbitrarily large perturbations in the solution. 

Hansen [77] has used the notion of discrete ill-posed problem for problems 


z = Hθ ; H ∈ R n×n (3.2) 

min 

θ 

‖Hθ − z‖ ; H ∈ R m×n ; m > n (3.3) 

if the singular values of H decay gradually to zero and the ratio between the 

largest and the smallest nonzero singular value is large. In this case the solution 

49

50 3. Regularization theory 

can be very sensitive to errors in the data z even though the solutions were strictly 

speaking continuous functions of the data. 

The need to obtain stable solutions to ill-posed problems has led to several 

methods in which the original problem is modified so that the solution of the 

modified problem is near to the solution of the original problem but is less sensitive 

to errors in the data. These methods are often called the regularization methods. 

In regularization methods information about the desirable solution is used for the 

modification of the original problem. In this sense the regularization approach is 

related to the Bayesian approach to estimation. 

Several regularization methods can be interpreted as stabilization of the inversion 

of the matrix H T H in the least squares solution. Such methods are e.g. 

Tikhonov regularization and principal component regression. Other approaches 

to regularization are e.g. the use of truncated iterative methods, such as conjugate 

gradient method, for solving linear systems of equations [78]. The truncated 

singular value docomposition [75, 33] is also commonly used. All regularization 

methods can be interpreted in the same formalism with the so-called filter factors, 

see e.g. [182]. General references on regularization methods are e.g. [209, 72] or 

[56] and especially [201] in connection to Bayes estimation. 

3.2 Tikhonov regularization 

We discuss here the most popular regularization method, Tikhonov regularization 

method [208, 207] for solution of the least squares problem. We define here the 

generalized Tikhonov regularized solution ˆθ α with equation 

ˆθ α = arg min 

θ 

{‖L 1 (Hθ − z)‖ 2 + α 2 ‖L 2 (θ − θ ∗ )‖ 2 } (3.4) 

where L T 1 L 1 = W is a positive definite matrix. The regularization parameter α 

controls the weight given to minimization of the side constraint 

Ω(θ) = ‖L 2 (θ − θ ∗ )‖ 2 (3.5) 

relative to minimization of the weighted least squares index 

l LS = ‖L 1 (Hθ − z)‖ 2 (3.6) 

The term θ ∗ is the initial (prior) guess for the solution. In more common definitions 

W = I, see e.g. [77]. 

The matrix L 2 is typically either the identity matrix I or a discrete approximation 

D d of the d’th derivative operator. In this case L 2 is a banded matrix with 

full row rank. For example the 1’st and 2’nd difference matrices are 

⎛ 

⎞ 

1 −1 0 · · · 0 

. 0 .. . .. . .. 

. .. 

D 1 = 

⎜ 

(3.7) 

⎝ 

. 

. .. . .. . .. ⎟ 0 ⎠ 

0 · · · 0 1 −1

3.2. Tikhonov regularization 51 


⎛ 

D 2 = 

⎜ 

⎝ 

1 −2 1 0 · · · 0 

0 

. .. . .. . .. . .. . . 

. .. . .. . .. . .. 0 

0 · · · 0 1 −2 1 

⎞ 

⎟ 

⎠ 

(3.8) 

The solution (3.4) can also be written in form 

⎧ ( ) ( 

⎨ L 1 H L 1 z 


θ − 

θ ⎩∥ 

αL 2 αL 2 θ ∗ 

)∥ ∥∥∥∥ 

2 ⎫ ⎬ 

⎭ 

(3.9) 

and making notations 

H ′ = 

z ′ = 

L ′ = 

( 

( 

( 

H 

I 

) 

) 

z 

θ ∗ 

) 

L 1 0 

(3.10) 

(3.11) 

(3.12) 

the solution is of the form 


θ 

{‖L ′ H ′ θ − L ′ z ′ ‖ 2 } (3.13) 

from which it is easy to see that the formal solution is 

ˆθ α = (H ′T L ′T L ′ H ′ ) −1 H ′T L ′T L ′ z ′ (3.14) 

= (H T WH + α 2 L T 2 L 2 ) −1 (H T Wz + α 2 L T 2 L 2 θ ∗ ) (3.15) 

Sometimes the notion of standard form regularization is used when L 2 = I, otherwise 

the notion of regularization in general form is used [74]. Tikhonov regularization 

is also called ridge regression [69] in statistical literature or rubric graduation 

[218] in approximation theory. All the methods using difference approximations 

as regularizing side constraint can be called smoothness priors methods [67]. This 

is also the notion used in this thesis. 

It is easy to see that the Tikhonov regularization has a direct connection to 

Bayesian linear mean square estimation. We can compare equation (3.9) to the 

mean square estimate 

ˆθ MS = arg min 

θ 

∥ ∥∥∥∥ 

L 

( 

H 

I 

) 

θ − L 

( 

)∥ 

z ∥∥∥∥ 

2 

η θ 

(3.16)


where 

L T L = 

( 

C −1 

v 0 

0 C −1 

θ 

) 

(3.17) 

Thus generalized Tikhonov regularization is the optimal mean square estimator if 

v ∼ N(0,W −1 ) and θ ∼ N(θ ∗ ,α −2 (L T 2 L 2 ) −1 ). 

Sometimes we do not have any particular assumptions about the parameters θ 

themselves. They may not even have any physical meaning as in the case of generic 

basis functions in linear least squares. More often we have beliefs about what the 

measurements should be. In this case we have to regularize the prediction of the 

model, that is, we want that Hθ has some desirable properties. The regularized 

solution for W = I is then 


θ 

{‖Hθ − z‖ 2 + α 2 ‖L 2 (Hθ − Hθ ∗ )‖ 2 } (3.18) 

and the solution for this is clearly 

ˆθ α = (H T H + α 2 H T L T 2 L 2 H) −1 (H T z + α 2 H T L T 2 L 2 Hθ ∗ ) (3.19) 

This is of course of the same form than (3.15) with regularization matrix L ′ = L 2 H. 

However in form (3.19) standard regularization matrices are directly applicable. 

3.3 Principal component based regularization 

In [23] it is emphasized that the notion of regularization has its origin in approximation 

theory and many of the regularization methods there can be put in the 


ˆθ = GH T z (3.20) 

This estimator clearly covers the ordinary least squares solution with the choice 

G = (H T H) −1 . The selection of G can thus be seen as approximation of the 

inverse of H T H. The form (3.20) includes also the Tikhonov regularization. The 

notion of ridge regression has been used for Tikhonov regularization with L = I 

and the parameter α is called then the ridge parameter. In general ridge regression 

problem we have some criterion, e.g. |θ| = c for some given c, for selecting the 

parameter α [69]. 

It is possible to select 

G = ∑ 1 

v j vj T (3.21) 

α j 

j∈S 

where v i and α i are the eigenvectors and eigenvalues of matrix H T H, respectively. 

S is the index set S ⊂ I, where I = (1,..., rank (H T H)). The selection of the 

index set S can be based on the eigenvalues α i or the correlation of the eigenvectors 

with the data z. In the first approach the goal is to eliminate the large variances 

in regression by deleting the components for which α i is small. This approach is 

sometimes called the traditional principal component regression [23]. In the second 

approach it is undesirable to delete components that have large correlations with

3.3. Principal component based regularization 53 

the dependent variable z, the data. Several criterions for selection of the principal 

components in (3.21) has been summarized in [91]. 

Two other methods based on the use of principal components are the latent 

root regression and the partial least squares [91]. In latent root regression the principal 

components are calculated for the p independent variables and the dependent 

variable. The principal components corresponding to the smallest eigenvalues are 

then examined and the eigenvector is deleted if the PC is small also for the dependent 

variable z. It has then only little predictive value for the data. In partial least 

squares the basic idea is to start with the principal components of the explanatory 

variables and then rotate them so that they become more highly correlated with 

the dependent variable z. These rotated principal components are then used to 

predict z. 

When used in form (3.21) the principal component regression is quite ad hoc. 

The eigendecomposition is used only to regularize the inversion of the matrix 

H T H. As denoted in [23], the principal component regression has as approach 

only little to do with the prior assumptions about the problem. In the following 

we form an alternative form of the regression, where the connection to the problem 

is more cleary visible. 

First we assume the ordinary linear observation model 

for which the ordinary least squares solution is 

and for the prediction we have 

Next we recall that the singular value decomposition 

z = Hθ + v (3.22) 

ˆθ = (H T H) −1 H T z (3.23) 

ẑ = Hˆθ (3.24) 

H = USV T (3.25) 

gives the eigenvectors of the matrix H T H in columns of the matrix V . Thus we 

can make decomposition 

(H T H) −1 = ∑ j∈S 

1 

λ j 

v j v T j 

+ ∑ j∈S 

1 

λ j 

v j v T j (3.26) 

= V 1 Λ −1 

1 V T 

1 + V 2 Λ −1 

2 V T 

2 (3.27) 

where S = I\S. In principal component regression we use only the first term, and 

thus the prediction can be written in form 

ẑ = H(V 1 Λ −1 

1 V 1 T )Hz (3.28) 

= USV T (V 1 Λ −1 

1 V 1 T )V SU T z (3.29) 

= (U 1 S 1 V1 T + U 2 S 2 V2 T )(V 1 Λ1 −1 1 T )(V 1 S 1 U1 T + V 2 S 2 U2 T )z (3.30) 

= U 1 S 1 V1 T V 1 Λ −1 

1 V 1 T V 1 S 1 U1 T z + U 1 S 1 V1 T V 1 Λ −1 

1 V 1 T V 2 S 2 U2 T z (3.31) 

+ U 2 S 2 V2 T V 1 Λ −1 

1 V 1 T V 1 S 1 U1 T z + U 2 S 2 V2 T V 1 Λ −1 

1 V 1 T V 2 S 2 U2 T z (3.32)


Now we recall that by the definition of SVD the matrices U and V are unitary. 

This means that V1 T V 1 = I and V1 T V 2 = 0. We also note that S 1 Λ −1 

1 S 1 = I. The 

equation of the prediction then reduces to 

ẑ = U 1 U T 1 z (3.33) 

= U 1 (U T 1 U 1 ) −1 U T 1 z (3.34) 

= U 1ˆθ (3.35) 

Where ˆθ is the least squares solution of the modified problem 

z = U 1 θ + v (3.36) 

Thus the principal component regression can be seen as modifying the observation 

model so that the observations are presented in linear subspace S spanned by the 

columns of U 1 that are eigenvectors of the matrix HH T , the covariance matrix of 

the basis vectors of the original problem. 

A modification of the principal component regression is proposed in [211, 212, 

96]. First a set of expectable noiseless measurements Hθ is generated. These 

simulations are interpreted as samples from the prior density. The principal component 

regression is then used to reduce the dimensionality of the problem. The 

method is called basis constraint method. 

3.4 Subspace regularization 

In some cases we can make the assumption that the parameter vector is close to 

some linear subspace S = R (K S ) where K S is a matrix containing an orthonormal 

basis of the subspace S in its columns. The projection of θ onto S is then K S K T S θ, 

and the distance of θ from S is 

Ω(θ) = ∥ ∥ KS K T S θ − θ ∥ ∥ = 

∥ ∥(KS K T S − I)θ ∥ ∥ (3.37) 

We can now use this quantity as a side constraint in the regularized least squares 

solution and we obtain 

{ 

ˆθ = arg min ‖L 1 (z − Hθ)‖ 2 + α ∥ 2 (I − KS KS T )θ ∥ 2} (3.38) 

θ 

We call this solution the subspace regularized solution. Since the matrix L 2 = 

(I −K S K T S ) is idempotent and symmetric, we can write L 2 = L 2 L T 2 = (I −K S K T S ) 

and the solution is 

ˆθ = (H T WH + α 2 (I − K S K T S )) −1 H T Wz (3.39) 

The matrix (I − K S K T S ) is the orthogonal projector onto the null space of the 

operator K S and the subspace regularization can be seen as modifying the original 

least squares problem so that the solution is closer to the null space of K S . 

The subspace regularized solution reduces to the principal component regression

3.5. Selection of the regularization parameters 55 

when α goes to infinity. This has been shown in [213]. The method is originally 

introduced in [214]. 

It is readily seen that the subspace regularization equals formally to mean 

square estimation with the prior θ ∼ N(0,α −2 (I − K S KS T)−1 ) if v ∼ N(0,W −1 ). 

But the matrix (I − K S KS T ) is generally not of full rank and the inverse does not 

exist. This means that the prior is improper and the parameters can have infinite 

variance in some directions. In these directions the prior assumptions does not 

contain any information about the distribution of the parameters. Thus we can 

call this kind of prior partially noninformative. 

Also in this case the prior information can be stated in terms of the noiseless 

measurements Hθ. If the requirement is that Hθ is close to the subspace S ′ , the 

solution is 

ˆθ = (H T WH + α 2 H T (I − K S ′KS T ′)H)−1 H T Wz (3.40) 

3.5 Selection of the regularization parameters 

Selection of the regularization parameters is a fundamental problem in regularization. 

In the inverse problem literature the main attention has been paid to the 

optimal selection of the parameter α when the form of the observation model is 

known and the regularization matrix is fixed. However, in real estimation problems 

the selection of the observation model can also be interpreted as an aspect in 

regularization. For example in [218] the number p of the basis functions in observation 

model is treated as regularization parameter. In [23] it is argued that from 

the Bayesian point of view the selection of the structure of the regularizing prior 

can be a more fundamental problem than the selection of the weight α. However, 

some practical tools have been developed for the selection of the regularization 

parameters. For a more detailed discussion on the selection of the regularization 

parameters see e.g. [77] and the references therein. 

The methods for the selection of the regularization parameters are classically 

divided into two classes depending on whether or not the norm ‖v‖ of the noise 

term in the observation model is known. The discrepancy principle of Morozov 

is an example of a method that relies on the knowledge of ‖v‖. The idea in 

discrepancy principle is to use the regularization parameters so that the index 

f MDP (α) = 

∥ 

∥z − Hˆθ α 

∥ ∥∥ 

2 

− ‖v‖ 

2 

(3.41) 

is zero. This is a nonlinear equation for α and it can be solved iteratively. An 

improved modification of this is Greferer–Raus method [72]. 

In the so-called posterior methods all the necessary information for selection of 

the regularization parameters are extracted from the data. The quasi-optimality 

method is an example of a method in which no information about ‖v‖ is needed 

In quasi-optimality method the parameter α is chosen so that the norm 

f QO (α) = 

∥ α d ∥ ∥∥∥ 

dα ˆθ 

2 

α 

(3.42)


has a local minimum with α ≠ 0 [74]. Another method of this class is the socalled 

generalized cross-validation (GCV) that is based on the “leave-one-out” 

type procedure [69, 218]. The solution can be written as minimizer of the index 

∥ 

∥Hˆθ α − z∥ 2 

f GCV (α) = 

(trace (I − HK)) 2 (3.43) 

where ˆθ α = Kz [77]. 

In the L-curve method the choice of the regularization parameter is based 

on the graph of the penalty term ‖L 2ˆθα ‖ versus the residual norm ‖z − H ˆθ α ‖. 

When the graph is plotted in log-log scale, the L-curve will sometimes have a 

characteristic L-shaped apperance with a “corner” separating a steep part from 

the flat part of the graph. The value of α corresponding to this corner is selected 

for the regularizing parameter. The L-curve criterion is a fully heuristic approach 

although it has some asymptotic optimality properties [76]. The use of L-curve 

criterion has been criticized e.g. in [57]. 

The methods for the selection of the regularization parameters are often proposed 

for the ridge class of methods. That is, for the Tikhonov regularization in 

standard form. Also the analysis of the properties are usually carried for ridge estimators. 

Even in this simple case the theory justifying the use of e.g. generalized 

cross-validation is an asymptotic one as stressed in [218]. 

Several different criteria for the selection of the regularization parameters are 

discussed in [23] where it is also emphasized that the selection can also be based on 

some approximative Bayesian formulation. It is also stressed in [23] that nowadays 

with good numerical tools the fully Bayesian implementation of the problem would 

be straightforward.

CHAPTER 

IV 

Time series models 

In this chapter some topics in time series analysis are discussed. The problem 

of time-series modeling is formulated and the connection between time-dependent 

modeling and linear regression is emphasized. After that the special properties 

of the LMS algorithm are discussed and the form of the LMS is compared with 

Bayesian MAP estimation and subspace regularization. These results are then 

used in Chapter 6 where the use of adaptive algorithms for the estimation of the 

evoked potentials is analyzed. 

Another topic of this chapter is optimal filtering. The so-called Wiener filtering 

scheme is discussed and the connection between Wiener filtering and mean 

square estimation is especially emphasized. The time-dependent Wiener filter is 

defined and connected to more traditional formulations of Wiener filtering. These 

formulations are fundamental in Chapter 6 where the corresponding methods for 

evoked potential analysis are reviewed. 

4.1 Stochastic processes 

A scalar valued random (or stochastic) process x(t) is a set {x(t) ∈ R : t ∈ T } of 

random variables defined in the same probability space [142]. T is said to be the 

parameter set of process. When T is a finite or countably infinite set, we say that 

the process is a discrete parameter set process. When the parameter t denotes 

the time, the process is said to be a discrete-time process, a time series. Only 

discrete-time processes are discussed in this thesis. 

The mean of the random processes x(t) is a function of time 

η x (t) = E {x(t)} = 

∫ ∞ 

−∞ 

and covariance is a function of two time variables t and s 

x(t)p(x(t))dx(t) (4.1) 

C x (t,s) = E {(x(t) − η x (t))(x(s) − η x (s))} (4.2) 

= E {x(t)x(s)} − η x (t)η x (s) (4.3) 

57

58 4. Time series models 

If the mean is time-invariant 

= R x (t,s) − η x (t)η x (s) (4.4) 

E {x(t)} = η (4.5) 

we say that the process is first-order stationary. If in addition the autocorrelation 

R x (t,s) depends only on the time difference τ = s − t 

R x (t,s) = R x (τ) (4.6) 

we say, that x(t) is second-order stationary. The notion of wide sense stationarity 

is also used for second order stationarity. Discrete white noise process e(t) is 

defined to be a sequence of mutually independent zero mean random variables 

[191]. 

A widely used notion in signal processing literature is the signal-to-noise ratio. 

For a stationary stochastic signal s(t) with additive noise 

the signal-to-noise ratio can be defined as [187] 

SNR = E { s 2 (t) } 

x(t) = s(t) + v(t) (4.7) 

E {v 2 (t)} 

For deterministic s(t) the ratio is sometimes defined as time-dependent function 

SNR(t) = 

|s(t)| 

√ 

E {v2 (t)} 

(4.8) 

(4.9) 

as in [153]. 

Let x(t) be a wide sense stationary process with autocorrelation R x (τ). Then 

its power spectrum is defined as 

S x (ω) = 

∞∑ 

τ=−∞ 

R x (τ)e −jτω (4.10) 

The spectrum is thus the discrete Fourier transform of the autocorrelation of the 

x(t). 

Some classes of discrete stochastic processes can be expressed as outputs of 

linear time-invariant systems with white noise input. We call the process z(t) 

autoregressive moving average process or ARMA(p,q) process if it can be expressed 

as 

p∑ 

q∑ 

z(t) = a j z(t − j) + b k e(t − k) (4.11) 

j=1 

where e(t) is a white noise process. Using operator notation, we can write this in 

the form 

A(q)z(t) = (1 + B(q)) e(t) (4.12) 

k=0

4.2. Stationary time series models 59 

where 

A(q) = 1 − 

B(q) = 

p∑ 

a j q −j (4.13) 

j=1 

q∑ 

b k q −k (4.14) 

k=1 

and q −1 is the time delay operator q −1 z(t) = z(t − 1). A(q) and B(q) are thus 

operator polynomials. If we have B(q) ≡ 0 the process (4.11) is called autoregressive, 

AR process. The estimation of the coefficients of the polynomials A(q) and 

B(q) based on measurements z(t), is called rational modeling. 

Using the results of the linear filtering theory it can be shown that the power 

spectrum of ARMA(p,q) process is [167] 

S(ω) = ∣ 

∑ q 1 + 

σe 

2 k=1 b ke −iωk∣ ∣ 2 

∣ 

∣1 − ∑ ∣ 

p ∣∣ 

j=1 a je −iωj 2 

(4.15) 

= ∣ 1 + B(e 

σe 

2 ) ∣ 2 

|A(e −iω )| 2 (4.16) 

4.2 Stationary time series models 

An approach to time series modeling is to model the measured signal as output 

of linear system. If the signal is stationary, this system can be time-invariant. 

For example if we want to model the signal z(t) with a p’th order AR model, the 

observation model is of the form 

z(t) = 

p∑ 

a k z(t − k) + e(t) t = p + 1,...,N (4.17) 

k=1 

We have emphasized the scalar nature of the observation z(t) ∈ R here using the 

time variable in parenthesis rather than in subscript. The measurements can be 

written in matrix form 

z = Hθ + e (4.18) 

where θ = (a 1 ,...,a p ) T and 

H = 

⎛ 

⎜ 

⎝ 

z(p) · · · z(1) 

. 

. 

z(N − 1) · · · z(N − p) 

⎞ 

⎟ 

⎠ (4.19) 

z = (z(p + 1),...,z(N)) T (4.20) 

e = (e(p + 1),...,e(N)) T (4.21)


The least squares solution for the AR parameters is then 

ˆθ LS = (H T H) −1 H T z (4.22) 

ˆθ LS is the solution that minimizes the output least squares error norm 

ˆθ LS = arg min 

θ 

‖e‖ 2 (4.23) 

= arg min 

θ 

e T e (4.24) 

= arg min 

θ 

‖z − Hθ‖ 2 (4.25) 

The least squares estimate of the AR model parameters is thus a linear function 

of data. This is not the case with ARMA modeling. In ARMA modeling the 

observation model can be written in form 

z(t) = 

p∑ 

a j z(t − j) + 

j=1 

q∑ 

b k e(t − k) + e(t) (4.26) 

The LS solution for parameters a and b is then the one that minimizes the norm 

of the error e. This minimization cannot be written in form 

k=1 

ˆθ LS = arg min 

θ 

‖z − Hθ‖ 2 (4.27) 

where H does not depend on the parameters to be estimated. Therefore the 

ARMA modeling of time series is a nonlinear optimization problem. For different 

methods for solving ARMA(p,q) model in case of stationary signals see e.g. [37]. 

General references for estimation of stationary models are e.g. [125, 34, 103]. 

4.3 Time-dependent time series models 

If the signal to be modeled is nonstationary it cannot be modeled as output of a 

time-invariant system. It is natural to assume in this case that the system has 

time-varying parameters. For example the time-varying ARMA model can be 

written in form 

If we denote 

z(t) = 

p∑ 

a j (t)z(t − j) + 

j=1 

q∑ 

b k (t)e(t − k) + e(t) (4.28) 

k=1 

θ t = (a 1 (t),...,a p (t),b 1 (t),...,b q (t)) T (4.29) 

ϕ t = (z(t − 1),...,z(t − p),e(t − 1),...,e(t − q)) T (4.30) 

we can write the equation in the form 

z(t) = ϕ T t θ t + e(t) (4.31)

4.3. Time-dependent time series models 61 

This is exactly of the form of linear regression (2.262). The parameters θ t can now 

be estimated iteratively for example with the RLS algorithm. 

ˆϕ t = (z(t − 1),...,z(t − p),ε(t − 1),...,ε(t − q)) T (4.32) 

ε(t) = z(t) − ˆϕ T t ˆθ t−1 (4.33) 

K t = P t−1 

ˆϕ t 

λ + ˆϕ T t P t−1 ˆϕ t 

(4.34) 

ˆθ t = ˆθ t−1 + ε(t)K t (4.35) 

P t = λ −1 (I − K t ˆϕ T t )P t−1 (4.36) 

where 

ˆθ t = (â 1 (t),...,â p (t),ˆb 1 (t),...,ˆb q (t)) T (4.37) 

Note that the unknown process e(t) is here estimated with the prediction error 

process ε(t) in every step of the iteration. We use therefore the notation ˆϕ t instead 

of ϕ t . 

The application to AR and MA models is straightforward. Selection of regressor 

and parameter vector structure for different model structures is summarized 

in Table 4.1. 

Other recursive estimators discussed in Section 2.17 are applied to timevarying 

time-series modeling analogously. The recursion 

ˆθ t = ˆθ t−1 + K t (z(t) − ˆϕ T t ˆθ t−1 ) (4.38) 

in combination of tables 4.1 and 2.1 summarizes the different models and algorithms. 

Table 4.1 

Regressor vector ˆϕ t and parameter vector ˆθ for different time series model structures 

AR 

ARMA 

MA 

AR 

ARMA 

MA 

ˆϕ T t 

(z(t − 1),...,z(t − p)) 

(z(t − 1),...,z(t − p),ε(t − 1),...,ε(t − q)) 

(ε(t − 1),...,ε(t − q)) 

ˆθ T 

(â 1 (t),...,â p (t)) 

(â 1 (t),...,â p (t),ˆb 1 (t),...,ˆb q (t)) 

(ˆb 1 (t),...,ˆb q (t)) 

The form of the linear system can be different from the linear regression discussed 

here. AR, ARMA and MA models are all so-called transversal models for


time series. Another possibility is to use so called lattice structures [124]. Also for 

these it is possible to form recursive forms [61]. Since the operator polynomials 

in transversal models can be parametrized also with the roots of the polynomials, 

yet another possibility is to use so-called root tracking methods for time varying 

modeling of the time series [99]. 

4.4 Adaptive filtering 

In Section 4.3 the regressor vector contained previous terms of the measurements 

or the prediction error values. It is also possible that the regressor vector contains 

samples from another process, the reference signal. Methods that use this kind of 

model are referred in [81] to interference cancelling methods. The most famous 

applications in this class of methods are perhaps the adaptive noise cancelling 

applications [228]. Originally the idea in adaptive noise cancelling is to use measurements 

that contain signal and noise as the primary input z(t) = s(t) + n 0 (t). 

The noise is assumed to be uncorrelated with the signal. The regressor vector 

is called the reference input and it contains a signal n 1 (t) that is correlated with 

the noise n 0 (t) in the primary input. With this observation model some adaptive 

linear regression algorithm, typically LMS, is then used. Since the regressor 

now contains terms that are correlated with the noise term in the primary input, 

the term ˆϕ T t ˆθ t−1 models this noise n 0 (t). In this case the prediction error term 

ε(t) = z(t) − ˆϕ T t ˆθ t−1 thus approximates the underlying signal s(t). 

Adaptive noise cancellation can thus be seen as time varying linear regression. 

The primary input is modeled as linear combination of delayed reference inputs 

with time varying parameters. The error norm is minimized iteratively with an 

adaptive algorithm. For example, if the LMS algorithm is used, the corresponding 

iterative algorithm is the steepest descent algorithm. The measurement z(t) and 

the regression model ϕ t are also time-dependent. In every step of the minimization 

new data and regressor values are received. 

Another important property of adaptive algorithms is their connection with 

regularization methods. We recall first, that the recursive mean square estimate 

is in fact a Bayesian MAP estimate using the previous error covariance as the 

estimate of the prior covariance of the parameters. This can be seen from equation 

(2.237). Also the implicit assumption in the LMS algorithm is that the covariance 

is of the form 

P t = µ(I − µϕ t+1 ϕ T t+1) −1 (4.39) 

This can be compared to the assumptions on the prior information in subspace 

regularization which is discussed in Section 3.4. It was emphasized that the prior 

covariance of the parameters is (formally) 

C θ = α −2 (I − K S K T S ) −1 (4.40) 

The LMS algorithm thus drives the solution to be close to the rank-one subspace 

that is spanned by the vector ϕ t+1 . Thus ϕ t+1 serves as prior information for 

recursive estimation when LMS is used. The regularizing subspace is clearly timevarying. 

The form (4.39) is fundamental when analyzing the prior assumptions

4.5. Wiener filtering as estimation problem 63 

made in the LMS algorithm. Note that in the NLMS algorithm the implicit assumption 

for the error covariance is 

P t = µI (4.41) 

and the method has thus close relationship with Tikhonov regularization that was 

discussed in Section 3.2. In Tikhonov regularization we have 

C θ = α −2 I (4.42) 

4.5 Wiener filtering as estimation problem 

Let s be a vector containing values s(t) sampled from a time-varying quantity. If 

this quantity is random, s is a sample from a stochastic process. In the signal 

processing literature a central problem is the estimation of s in presence of noise 

v from observations z. When the solution minimizes the mean square error, the 

estimation procedure is often called Wiener filtering [152]. The form of the optimal 

estimator depends on the observation model and on the assumptions about the 

random signals s and v, that is, the joint density p(s,v). 

In many cases we require that the filter is a linear operator. In the Wiener 

filtering problem we thus seek the estimator of the form 

ŝ = Kz (4.43) 

that minimizes the mean square error. By orthogonality principle we know that 

for mean square estimator 

From this we obtain for K 

E { (s − ŝ)z T } = E { (s − Kz)z T } (4.44) 

= E { sz T } − KE { zz T } (4.45) 

= R sz − KR z = 0 (4.46) 

K = R sz R −1 

z (4.47) 

We call this the time-varying Wiener filter. We use the notion of time-varying 

since we have not made any assumptions about the distribution of s. The solution 

(4.47) is thus the optimal linear estimator in mean square sense whatever the 

density of s is. The optimality thus holds also for nonstationary signals. The only 

assumptions made are that the filter is linear and yields the mean square criterion. 

If s and z would assumed to be zero mean, the form of the estimator would yield 

the form of (2.90). 

In the original work of Wiener [230] s and v were assumed to be stationary. 

Comparing the definitions in Sections 2.2 and 4.1 it can be seen that the correlation 

of the random vector and the autocorrelation of the stationary stochastic process 

have a special connection. If we interpret a stationary random process as a doubly 

infinite random vector [142], the autocorrelation matrix of such a process is an 

infinite-dimensional Toeplitz matrix. Thus K is also infinite-dimensional and it


is easy to show that in the case of stationary processes, the optimal Wiener filter 

can be expressed as the convolution 

s(t) = 

∞∑ 

i=−∞ 

k(t − i)z(i) (4.48) 

where k(t) is the impulse response of a linear time-invariant filter. The spectrum 

of the filter is 

K(ω) = S sz(ω) 

(4.49) 

S z (ω) 

where S sz (ω) and S z (ω) are the discrete Fourier transforms of any row of R sz and 

R z respectively. This is often called the noncausal Wiener filter [187] since the 

summation in (4.48) is from −∞ to ∞. 

If we assume the additive noise model 

and that s and v are independent, we have 


z = s + v (4.50) 

R z = E { zz T } = E { (s + v)(s + v) T } (4.51) 

= E { ss T } + E { sv T } + E { vs T } + E { vv T } (4.52) 

= R s + R v (4.53) 

R sz = E { sz T } = E { s(s + v) T } (4.54) 

= E { ss T } + E { sv T } (4.55) 

= R s (4.56) 

Using again the interpretation of stochastic processes as doubly infinite random 

vectors we conclude that with the additive noise model the spectrum of the Wiener 

filter is 

S s (ω) 

K(ω) = 

(4.57) 

S s (ω) + S v (ω) 

This is also the original form presented by Wiener [230]. 

If we further assume, that s can be presented as linear combination of some 

basis vectors, we can write 

s = Hθ (4.58) 

If we also assume that v and θ are zero mean, we can write 

R z = E { Hθθ T H T } + C v (4.59) 

= HC θ H T + C v (4.60) 


R sz = HC θ H T (4.61)

4.5. Wiener filtering as estimation problem 65 

the Wiener filter can be written in form 

Using the matrix inversion lemma we can write 

ŝ = HC θ H T (HC θ H T + C v ) −1 (4.62) 

ŝ = H(H T C −1 

v 

H + C −1 

θ 

) −1 H T Cv −1 z (4.63) 

= Hˆθ MS (4.64) 

The use of the notion Wiener filtering is diverse. In some cases it is used for the 

best (in mean square sense) causal estimator for the underlying signal with some 

specific structure of the filter, in some cases the notion simply means the mean 

square estimator. For the theory of Wiener filtering see e.g. [152, 167, 153, 187].

CHAPTER 

Simulation of evoked potentials 

V 

Simulation is an important part of the evaluation of any estimation method. When 

done properly, all prior information about the problem should be implemented into 

the simulation procedure. With simulations one can compare the estimates to the 

true parameters. This can never be done with real measurements. This holds 

also for the evaluation of estimation methods in evoked potential analysis. Most 

commonly deterministic peaks with additive noise are used in simulations [2, 105]. 

Other models for evoked potentials that are used are e.g. sinusoidal [29] or damped 

sinusoidal models [210, 63]. Polynomials [132] and splines [83] have also been used. 

In this chapter we introduce two methods for constructing realistic simulations 

of evoked potentials. The first is based on the random occurrence of the peaks 

with preselected shapes. The shape parameters of the peaks are extracted using 

non-linear fitting of Gaussian basis functions to the average waveform. The peak 

locations and amplitudes are then randomly selected using some joint density. We 

call this method component based simulation of evoked potentials. The other simulation 

method proposed here is based on interpretation of the evoked potentials 

as random vectors. The second order statistics of a set of realistic measurements is 

first extracted. The covariance is then approximated using the truncated eigendecomposition 

of the covariance matrix. Using this information the simulations are 

then drawn as samples of this approximating jointly Gaussian distribution. We 

call this method principal component based simulation of evoked potentials. The 

simulation of the background EEG is also briefly discussed. 

5.1 Simulation of the background EEG 

In the simulation of the evoked potentials one important step is a simulation of the 

background EEG. If the properties of the background simulation does not fulfill the 

assumptions of the real EEG background, the evaluation of the evoked potential 

estimation methods can not give realistic results. The most common assumption 

about the background is that it is a random process additive to the evoked potential. 

If the background is assumed to be stationary, some time-independent 

66

5.2. Component based simulation of evoked potentials 67 

parametric model can be used in simulation. The model can be based on real 

measurements e.g. on pred-stimulus data The AR-model approach is used e.g. in 

[240]. 

Nonstationary background can be generated e.g. with time-varying AR models. 

The stationary models can be formed for measurements before the stimulus 

and after the late potentials in the evoked potential measurement. Smoothly timevarying 

functions can then be selected to model the structure of the time evolution 

of the AR parameters. Nonstationary background simulations are then obtained 

as output of this time-varying system with white noise input. The simulation of 

nonstationary EEG is discussed in [95]. 

5.2 Component based simulation of evoked potentials 

In some cases the goal in the estimation is to get information about the location 

of the peaks in the evoked potential. In this case we need a simulation method in 

which the peak locations can be used as parameters. With this kind of a method 

the peak locations can be based e.g. on some pre-selected random distribution, 

such as uniform or Gaussian distribution. With this kind of a method it is also 

easy to generate simulations in which the parameters of the peaks are correlated. 

This is done by using multidimensional joint densities in simulation. The shape of 

the peaks that are used in simulation can be based on real data. 

We propose here a method in which the peak distributions are based on real 

measurements. The peak amplitudes and locations can be generated randomly 

with interactively selected limits. The proposed method is as follows: 

1. Measure a set of evoked potentials. The delay between the repetitions has to 

be long enough so that the background EEG samples can also be measured. 

A typical set of measurements is shown in Fig. 5.1. 

2. Measure a set of background EEG samples. Pre-stimulus data can be used 

for this purpose. If the background EEG is assumed to be stationary it is 

possible to use an AR model in background simulation. Several samples of 

the EEG can be modeled and then the AR parameters can be averaged. 

Some conventional method such as the modified covariance method or the 

Yule-Walker method can be used for AR modeling [125]. The estimation of 

the residual power can be done with the same algorithms. Trend removal 

can be done before modeling if necessary. A typical set of background EEG 

samples is shown in Fig. 5.1. Ten typical spectra of AR(8) models for 

background EEG are shown in Fig. 5.2. 

3. Calculate the mean of the AR-parameters and the estimated residual powers. 

Create two sets of realizations using these parameters. One set is used 

for the simulation of the pre-stimulus EEG and the other set is used as 

additive background EEG in simulated measurements of evoked potentials. 

4. Remove the linear trend from the measured evoked potentials. Calculate 

the average of the measurements and determine the number and locations

68 5. Simulation of evoked potentials 

of the peaks in the average. Fit a function consisting of e.g. Gaussian 

shaped functions to the average. Fitting can be based on the nonlinear 

least squares scheme discussed in Section 2.12. The average, the location 

of the peaks and the fitted sum of Gaussian functions are shown in Fig. 5.3 

for measurements shown in Fig. 5.1. 

5. Select the limits of the desired location and amplitude variation of the peaks. 

The locations and the amplitudes can then be selected using some joint 

density. The use of uniform and Gaussian densities are straightforward. In 

the Gaussian case the limits can correspond e.g. to some multiple of the 

deviation of the density. Typical limits are shown in Fig. 5.4 and a set of 

simulations using uniform densities for locations in Fig. 5.5. 

6. Add the simulated background EEG to simulated evoked potentials to obtain 

simulated measurements. A set of final simulations is shown in Fig. 

5.6. 

−50 

−50 

0 

0 

50 

20 40 60 80 100 120 

50 

20 40 60 80 100 120 

−50 

−50 

0 

0 

50 

20 40 60 80 100 120 

50 

20 40 60 80 100 120 

Figure 5.1: Twenty measured evoked potentials (top left) and the background 

EEG (bottom left) with the corresponding means (top right and 

bottom right). The vertical axis is in microvolts and the horizontal axis in 

points. 

In the example that is carried out here, and shown in Figs. 5.1–5.6, the 

uncorrelated uniform joint density is used for parameters of the peaks. This is 

suitable e.g. for the situations in which the performance of the latency estimation 

method is evaluated. In such cases we only need realistic single simulations. The 

statistics of the whole set of simulations need not to be realistic.


50 

45 

40 

35 

30 

25 

20 

15 

10 

5 

0 

π/2 π 

Figure 5.2: AR(8) amplitude spectra of 10 samples of measured background 

EEG. The vertical axis is arbitrary and the horizontal axis is the 

normalized frequency. 

-15 

-10 

-5 

0 

5 

10 

15 

0 20 40 60 80 100 120 

Figure 5.3: The mean of the measured evoked potentials (solid rough), 

the fitted sum of Gaussians (solid smooth) and the marked peaks of the 

mean (dotted vertical).


-15 

-10 

-5 

0 

5 

10 

15 

0 20 40 60 80 100 120 

Figure 5.4: The fitted sum of the Gaussians (solid) and given limits of 

the variation (dotted rectangles) 

-15 

-10 

-5 

0 

5 

10 

15 

0 20 40 60 80 100 120 

Figure 5.5: Component based simulations of evoked potentials.


-40 

-30 

-20 

-10 

0 

10 

20 

30 

40 

0 20 40 60 80 100 120 

Figure 5.6: Simulated measurements of evoked potentials with component 

based simulation method.


5.3 Principal component based simulation of evoked potentials 

In some cases also the statistical properties of the whole data set should be realistic. 

This is the situation in cases where the end to the estimation is e.g. classification 

of the measurements. In such a case one possibility is to interpret the evoked 

potential as a random vector and use the realizations of an appropriate joint density 

as simulations of evoked potentials. 

We propose here a method for evoked potential simulation. The method is 

based on use the of 2’nd order joint density approximation that is extracted from 

a set of real measurements. The proposed method is as follows: 

1. Measure a set z = (z 1 ,...,z N ) of evoked potentials. A typical set of measurements 

is shown in Fig. 5.1. The length of the measurement is T = 128 

points. 

2. Form the average ˆη and the centered covariance estimate Ĉz for the measurements. 

3. Form the eigendecomposition 

where ˆΓ = (γ 1 ,...,γ N ) and Λ = diag (λ 1 ,...,λ N ). 

Ĉ zˆΓ = ˆΓΛ (5.1) 

4. Use some small number p of eigenvectors to form the matrix ˆΓ p . Usually 

p eigenvectors corresponding to p largest eigenvalues are selected, so that 

ˆΓ p = (γ 1 ,...,γ p ). The idea is that the eigenvectors should correspond the 

evoked potential part of the measurements, not the background EEG. 

5. If N is small the mean and the eigenvectors can be noisy. In such a case 

they can be smoothed using the smoothness priors approach 

ˆΓ α = (I + αD T 2 D 2 ) −1ˆΓp (5.2) 

ˆη α = (I + αD T 2 D 2 ) −1ˆη (5.3) 

using some suitable value for α, as explained in Section 3.2. The columns 

of ˆΓ α using p = 4 and α = 10 are shown in Fig. 5.7 for the data shown in 

Fig. 5.1. 

6. Generate the simulations for the evoked potentials with equation [191] 

s = ˆΓ α Λ 1/2 

p x + ˆη α (5.4) 

where x ∼ N(0,I) and Λ p = diag (λ 1 ,...,λ p ). The density of the simulations 

s is then jointly Gaussian with mean ˆη α and covariance C s = ˆΓ α Λ pˆΓT α . 

A set of typical simulations for evoked potentials is shown in Fig. 5.8. 

7. Add the simulated background EEG to simulated evoked potentials to obtain 

simulated measurements. The background EEG can be simulated as 

in Section 5.2. In Fig. 5.9 the original set of measurements is shown with 

the simulated ones with the corresponding means.

5.3. Principal component based simulation of evoked potentials 73 

If all the eigenvectors are used when the matrix ˆΓ p is formed, the decomposition 

contains also the second order statistical information of the background EEG. 

This kind of scheme can be used e.g. for generation of the larger data sets than 

the original measured data set. The simulated measurements are still ensured to 

have almost the same statistics than the original measurements. However, in most 

cases p has to be selected with care to avoid the eigenvectors corresponding the 

background EEG to be taken into matrix ˆΓ p . 

If the simulated evoked potentials vary too little around the mean ˆη α for 

practical simulations, the matrix Λ 1/2 

p can be replaced with the identity matrix. 

0.2 

0.15 

0.1 

0.05 

0 

-0.05 

-0.1 

-0.15 

-0.2 

0 20 40 60 80 100 120 

Figure 5.7: Four largest eigenvectors forming the columns of the matrix ˆΓ α.


-60 

-40 

-20 

0 

20 

40 

60 

0 20 40 60 80 100 120 

Figure 5.8: A set of evoked potentials simulated with principal component 

based simulation method. 

-50 

-50 

0 

0 

50 

20 40 60 80 100 120 

50 

20 40 60 80 100 120 

-50 

-50 

0 

0 

50 

20 40 60 80 100 120 

50 

20 40 60 80 100 120 

Figure 5.9: Measured evoked potentials (top left) and the measurements 

simulated with the principal component based method (bottom left) with 

the corresponding means (top right and bottom right). The vertical axes 

are in microvolts and the horizontal axes in points.

5.4. Discussion 75 

5.4 Discussion 

The two simulation methods introduced here are different in nature. The simulation 

by the component potentials is clearly not consistent with the data in strict 

sense. The parameters of the components are adopted from the mean of the measurements 

that may not correspond to any single potential. This is also the reason 

why the means of the measurements and the simulations are generally different. 

However, with this method many parameters of the final simulations e.g. the peak 

locations can be easily controlled. The joint density of the peak parameters can 

be selected with any kind of shape and correlation. To obtain more realistic simulations 

with this method the Gaussian functions can be replaced with any other 

functions. 

The principal component based simulation gives more realistic simulations 

when compared to the data. This is not surprising since the first and second 

moments of the measurements and the simulations are almost the same by construction. 

However, with this method some difficulties can arise. If the end to 

the simulation is e.g. to form a data set for evaluation of the peak detector, the 

peaks have to be detected from the noiseless data set first. This is due to the fact 

that this simulation method is formulated so that in evoked potentials there is no 

peaks by construction. 

Component based simulation method is more suitable for generation of data 

sets with trends. These are needed for evaluation of dynamical estimation methods.

CHAPTER 

VI 

Estimation of evoked potentials 

In this chapter we review the existing methods for the estimation of the evoked 

potentials. The methods are presented in the formalism introduced in Chapter 

2 and special attention is paid to the analysis of the implicit assumptions of the 

methods. The methods are categorized into ensemble analysis methods, single 

trial methods and dynamical methods. The methods that are discussed in Sections 

6.3.5, 6.3.6, 6.3.7 and 6.5.4 are partially novel. These methods are then used in 

Chapter 7 in which two systematic methods for evoked potential estimation are 

proposed. 


The notion of evoked potential is used here for the time-varying potential s in some 

location of the scalp. The potential is caused by stimulation of the somatosensory 

system. We assume that the measurements z of these potentials contain also noise 

v. The source of v is the spontaneous brain activity which is called the background 

EEG. Background EEG is thought to be independent of the stimulation 

and additive to the evoked potential. This is also called the additive noise model 

for observations 

z = s + v (6.1) 

The approaches that are used in analysis of evoked potentials are sometimes 

categorized also into deterministic and stochastic approaches. In the deterministic 

approach s is thought to be fixed between the repetitions of the test. Although 

it has been evident for over three decades [22] that this is not always a proper 

assumption, it is commonly used in evoked potential analysis. In the Bayesian 

formalism the deterministic and stochastic approaches are handled equivalently. 

The deterministic approach can be interpreted as the stochastic approach with no 

prior information. In the stochastic approach the evoked potential s is assumed 

to be a random vector with some probability density p(s). 

We use the following categorization for the methods that are reviewed in this 

chapter: 

76

6.2. Ensemble analysis 77 

• ensemble analysis methods that give information about first or second order 

statistics of set of evoked potentials 

• single trial methods that give information about single responses. 

• dynamical methods that take account the dependence between the single 

trials or even try to model it 

Especially we call single trial estimation methods the methods that give estimate 

for a single evoked potential. Although the ensemble characteristics can 

be efficiently used e.g. in classification, there are many situations in which the 

single trial estimate is seen more appropriate. For example, if we are interested in 

the true mean latency of some peak of the evoked potential, the “Latency of the 

average is not the average of the latencies”, as argued in [25] and [190]. It has also 

been argued, that “...determination of the average is not enough...” [22] and even 

that “...the average may be one of the least significant variables of the response...” 

[6]. 

It has been also suggested that the use of average waveform is not a proper 

approach in source localization [185]. The use of single trials in topographic estimates 

also gives temporal information about e.g. adaptation and habituation as 

emphasized in [114]. 

Most of the methods that are reviewed here are applicable for analysis of 

several kind of evoked potentials. The potentials that are used in the evaluation 

of the methods include e.g. the somatosensory evoked potentials (SEP), visual 

evoked potentials (VEP) and auditive evoked potentials (AEP). A special example 

of auditive evoked potentials is the P300 potential, that is named according to its 

large positive peak that occurs approximately 300 ms after the stimulus. 

In this review the measurements and potentials are finite length vectors whose 

elements are samples from the originally continuous-time waveform. The t’th 

coordinate of the vector corresponds to the t’th time lag after stimulation. These 

scalars are denoted by z(t). The other time course is between the repetitions of 

the test. In this case the evoked potentials can be seen as vector valued stochastic 

processes. The i’th measurement vector is then z i = (z(T i + ∆T),...,z(T i + 

M∆T)) T where T i is the absolute time of the i’th stimulus and ∆T is the sampling 

interval. All methods are presented in the estimation theoretical formalism that 

is introduced in Chapter 2. N is the number of the measurements in set of all 

measurements. 

Other reviews on the analysis methods of evoked potentials are e.g. [179, 130, 

7, 117]. 

6.2 Ensemble analysis 

In this section we review in detail some of the ensemble analysis methods used in 

evoked potential analysis. In particular, we are interested in the conditions under 

which the reviewed methods are optimal.

78 6. Estimation of evoked potentials 

6.2.1 Averaging 

The most conventional method of analysis of the evoked potentials is ensemble 

averaging. The measurement vectors z i are averaged to reduce the signal-to-noise 

ratio. In the estimation theoretical formalism, if s does not vary between the 

repetitions, the measurements can be written in form 

z = s + v = Hθ + v (6.2) 

where H = (I| · · · |I) T , θ = s, z = (z1 T , · · · ,zN T )T and v = (v1 T , · · · ,vN T )T . The 

least squares solution for s can then be written in form 

ŝ = (H T H) −1 H T z (6.3) 

∑ N 

= (NI) −1 z i (6.4) 

= 1 N 

i=1 

N∑ 

z i = z (6.5) 

i=1 

that is, the conventional average of the measurements. Thus we conclude that 

the average of the measurements is the best estimator in least squares sense for 

deterministic s and additive noise model. The minimization of the least squares 

criterion is equivalent to the assumption v ∼ N(0,σ 2 I). 

If s is not deterministic, we obtain for expectation of ŝ 

E {ŝ} = 1 N 

= 1 N 

N∑ 

E {z i } (6.6) 

i=1 

N∑ 

E {s i } + 1 N 

i=1 

N∑ 

E {v i } (6.7) 

i=1 

If s i are drawn from same joint density and v i are zero mean, we obtain 

E {ŝ} = 1 N 

N∑ 

η s = η s (6.8) 

i=1 

so that with the stochastic assumption for the evoked potentials the average of 

the observations is an unbiased estimator of the expected value of the evoked 

potentials. 

6.2.2 Weighted and selective averaging 

In conventional averaging all measurements are treated with same weight. This 

is natural since we have assumed nothing about theOBB noise variance between 

the repetitions of the test. An approach to reduce the effect of the most noisy 

waveforms is to take into account the covariance of the background EEG. With


H = (I| · · · |I) T , θ = s, z = (z T 1 , · · · ,z T N )T and v = (v T 1 , · · · ,v T N )T the Gauss– 

Markov estimator can be written in form 

where 

ŝ = (H T Cv 

−1 H) −1 H T Cv −1 z (6.9) 

C v = E { (v T 1 ,...,v T N) T (v T 1 ,...,v T N) } (6.10) 

Usually we can assume that the noise vectors v i are mutually independent. This is 

especially true if the background EEG can be modeled as a process having rather 

short correlation time. This means that it is a wide band process. In this case the 

covariance of the EEG is of the block diagonal form 

C v = diag (C v1 ,...,C vN ) (6.11) 

When this is applied to equation (6.9), the resulting estimate is of the form 

ŝ = 

( N 

∑ 

i=1 

C −1 

v i 

) −1 N 

∑ 

i=1 

C −1 

v i 

z i (6.12) 

This is the minimum variance estimate of s when the background covariances in 

the independent measurements are known or can be estimated. 

Equation (6.12) reduces to the conventional average if the background covariances 

are identical between the repetitions, that is, if C vi = C, i = 1,...,N. This 

result can be found from any standard text book of data analysis [119]. It is also 

obtained (totally unnecessarily) with Monte Carlo simulations for a homogenous 

set of artificial brainstem potentials in [118]. Note that the result holds even for 

nonstationary background, that is, in cases when C is not a Toeplitz matrix. 

Background EEG is sometimes modeled as white noise. If we assume that the 

variance of the background can differ between the repetitions, we can write 

In this case, the estimated evoked potential is of the form 

C vi = σ 2 i I (6.13) 

ŝ = 

w i = 1 σ 2 i 

( N 

∑ 

i=1 

σ −2 

i I 

) −1 N 

∑ 

i=1 

∑ N 

σ −2 i=1 

i z i = 

w iz i 

∑ N 

i=1 w i 

(6.14) 

(6.15) 

This is clearly a weighted average of the measurements. The estimate is the 

minimum variance estimate for the evoked potential. Although also this result can 

be found found from any standard text book of data analysis [119] it is derived 

(again unnecessarily) in [84] and [118]. 

The requirement (6.13) is too restrictive. The result (6.14–6.15) holds also for 

C vi = σ 2 i C (6.16)


where C is any positive definite matrix. The weighted average (6.14–6.15) is thus 

optimal even for nonstationary background processes. 

The difficulty in this estimation scheme is the estimation of the noise variance 

in the measurement. An approach to the estimation of the covariance is to compare 

the individual measurements to the conventional average of the measurements. We 

know that if s is deterministic, the average is an unbiased estimator for s. We can 

then approximate the noise covariance by differentiating the measurements from 

the average and using the sample covariance of the residual as an estimate for 

the error covariance. If the changes in the background statistics are slow, the 

background can be modeled e.g. as AR model based on the measurements before 

the stimulus. The covariance of the AR model can then be used as estimate for 

the background covariance. 

Slightly different approach is adopted in [41]. There it is assumed that the 

noise variance is constant C vi = σvI 2 and the “amplitude” of the evoked potential 

varies so that 

s i = w i s (6.17) 

It is found that in this case the equation (6.14) maximizes the signal-to-noise ratio 

in average potential. The method for estimation of the weights w i proposed in [41] 

is quite lengthy and is based on the generalized eigenvalue analysis of covariances. 

However, the result can be obtained with a much more simple analysis: Let w = 

(w 1 ,...,w N ) T . Then the model for the observations is 

z = sw T + v = Hθ + v (6.18) 

where z = (z 1 ,...,z N ), H = s, θ = w T 

solution for the weights is then 

and v = (v 1 ,...,v N ). The least squares 

ŵ T = (s T s) −1 s T z (6.19) 

Since s is not known, we approximate it with the best estimate available, with the 

average (sample mean) z 

The result proposed in [41] is 

ŵ T = (z T z) −1 z T z (6.20) 

= zT z 

‖z‖ 2 (6.21) 

ŵ T = zT z 

∥ z T z ∥ (6.22) 

where only the scaling factor in denominator differs from (6.21). In weighted 

averaging the scaling of the weights is arbitrary. 

It is easy to modify the equations (6.14–6.15) with different selection of the 

weights w i . We can, for example, choose 

{ 

1 ,σi 2 ≤ σ s 

0 ,σi 2 > σ (6.23) 

s


and thus reject those measurements from the average, that have the variance 

greater than σ 2 s. This approach is sometimes called the selective averaging. The 

method is proposed e.g. in [14] and [210] where the selection is based on the 

manual inspection of the measurements. This method is clearly an example of a 

robust estimation method [86]. 

In [64] the method called selective averaging by cross covarying is proposed. 

Roughly speaking in this method the measurements are weighted with the covariance 

of the measurements and the average. The fully deterministic model is used. 

The method is systematic but quite ad hoc. 

In [161] selective averaging is based on cross correlation of the measurement 

with a “template” waveform. The templates are derived from the average waveform 

by windowing. Only the waveforms with high correlation with the template 

were selected for averaging. 

In [65] the evoked potentials are first clustered using the spectral content of 

the background EEG by fuzzy clustering methods. The selective averaging is then 

performed group by group. 

6.2.3 Latency-dependent averaging 

The fact that the evoked potentials have variations in latencies during the test 

has led to design of methods, in which these variations are tried to be taken into 

account. 

In Woody’s “adaptive” method [235], the measured evoked potential is crosscorrelated 

with a “template” waveform. The measurement vector is then shifted 

before the averaging according to the maximum of the cross-covariance. The 

procedure can then be repeated with calculated average as template. The method 

is usually called the Woody averaging or the cross correlation averaging (CCA). 

Woodys method is evaluated e.g. in [221]. The model for evoked potential in 

Woody’s method is thus a deterministic waveform with random latency shift. If 

the latency shift of single measurement can be estimated, the method gives the 

optimal estimate for the underlying waveform. 

Another common approach to the problem is latency corrected averaging 

(LCA) [129]. In LCA every peak is aligned separately before the averaging. In 

the method, the waveforms are first filtered with a Wiener-type filter [6] and the 

pre-selected number of peaks are detected by cross-correlating the responses with 

a template waveform. A triangle shaped template is suggested in [129]. By using 

the histogram of the peak locations, the peaks that correspond to each other are 

detected and aligned. The averaging is then carried out peak-by-peak. The resulting 

average produced by LCA is usually a discontinuous waveform. LCA and 

CCA are compared with the conventional averaging in [8]. It is found that CCA 

can lock to background if narrow band processes, such as α activity are present. 

The assumptions of LCA are more realistic than the assumptions of CCA, but the 

result of latency corrected averaging is not any more a continuous waveform. In 

LCA the peaks of the resulting average are estimates for the means of the peaks 

of the original waveforms.


In [131] a method for converting the disjoint segments of LCA into continuous 

waveform is presented. The method is called the continuous LCA. The method is 

based on least squares fitting of the LCA using sinusoidal basis functions (Fourierseries). 

In [239] the original LCA method is modified and the modification is called peak 

component latency corrected averaging (PC-LCA). The main differences between 

the original and modified methods are that in [239] adaptive filtering is used as 

pre-processor, the template is a cosine shaped waveform and also the amplitude 

information are taken into account in the calculation of the latency distribution 

of the peaks. 

6.2.4 Filtering of the average response 

The fact that the average evoked potential can still be noisy has motivated many 

authors to propose the methods with which the signal-to-noise ratio can further 

be improved after averaging. The filtering method that has got most attention 

is the Wiener filtering approach. This is due to its optimality as minimum mean 

square estimator as explained in Section 4.5. 

Most of the early works dealt with the original form of the Wiener 

K(ω) = 

S s (ω) 

S s (ω) + S v (ω) 

(6.24) 

The difficulty in this is clearly that the form of the filter necessitates prior knowledge 

about the spectra S v (ω) and S s (ω) of the background EEG and the evoked 

potential, respectively. This has led to method that is called the a posteriori 

“Wiener” filtering (APWF). The method is originally proposed in [219]. The idea 

in the method is that the spectra can be estimated a posteriori comparing the 

spectra of the averaged measurements S z (ω) to the averaged spectrum of the individual 

measurements S z (ω). The method has been used for the smoothing of 

the average visual evoked potentials, and it was found that the method gives unrealistic 

results [149]. In [54] it was found that the formulation of the APWF is 

biased and a correction was derived. After that in [4] it was argued that both 

forms are erroneous because rejection of the noise in averaging of sampled signal 

is not the same for different frequencies. A method for comparison of APWF with 

other methods is presented in [55]. 

Papers where APWF is found to be either ineffective or even giving erroneous 

results include e.g. [13, 210, 3, 144, 27] and [225]. There is two main reasons for 

difficulties with the Wiener filtering. The first is that the methods for estimating 

the spectra necessitate that the signal is deterministic. The other one is that the 

evoked potentials are transient-like waveforms that seldom can be thought to be 

stationary. In derivation of the Wiener filter it was assumed that the signal is 

random and stationary. 

The criticism is summarized in [47, 44]. It is also proposed that the so-called 

a posteriori time-varying filtering would solve the problem. The method is a subband 

weighting method for Wiener-type filtering of the average evoked potential.


The properties of the method are discussed in [43, 45, 46]. Finally, in [138] it was 

summarized that the spectra of the averaged background and the average potential 

are never so disjoint that any kind of digital filtering could be applied to averaged 

signal in posterior way. In cases where the method is applicable, no smoothing is 

usually needed. This is also reported in [105]. Recently the properties of APWF 

is studied e.g. in [62]. 

Other filtering approaches for average potential are used e.g. in [1] where 

digital filtering is suggested with a wide-band amplifier and in [24] where ARMA 

filtering is compared with APWF. Recent studies on the filtering of an average 

waveform include e.g. [60] where a hybrid FIR/median filter is suggested for 

enhancement of the average. 

6.2.5 Deconvolution methods 

If the latencies of the component potentials vary between the repetitions of the 

test it is a well known fact that the average of the potentials is the component 

waveform convolved with the probability density function of the latency variation. 

This has led to methods where the average is tried to be sharpened by filtering with 

an inverse convolution filter. Since the probability density can often be assumed 

to be a smooth, e.g. Gaussian shaped function, the inverse filter is a high pass 

filter. This is emphasized e.g. in [131] where the deconvolution method is called 

enhanced averaging. The same histogram information can be used for design of 

the filter as in latency corrected averaging. Since the deconvolution of this kind of 

smoothing kernels is a highly ill-posed problem, the method is improved in [165] 

by using so-called iterative restoration algorithms. Both methods necessitate quite 

smooth average waveforms to be succesful. 

In [147] it is emphasized that the latency variation can be caused by asynchronous 

sampling. In that case the smoothing kernel is a uniform moving averaging 

window of the length of the sampling period. In [147] the deconvolution is 

performed in the frequency domain. 

It is worth stressing that the deconvolution with this kind of kernels is a 

classical example of inverse problem. If the probability density of the latency is 

known or can be estimated e.g. with the methods explained in Section 6.4.3, the 

discrete form of the convolution kernel is easy to form. The deconvolution is then 

simple to perform with any of the regularization methods that are reviewed in 

Chapter 3. 

6.2.6 Principal components analysis 

Standard multivariate statistical methods are sometimes used for analysis of the 

evoked potentials [50, 52]. Principal Components Analysis (PCA) was perhaps 

first used for EP’s in [90] and in [181]. In PCA the stochastic sample is presented 

as a weighted sum of orthogonal basis vectors. The weights and the basis vectors 

are obtained from the eigendecomposition of the covariance or correlation matrix 

of the ensemble of the measurement vectors [91]. The fact that the basis functions 

in PCA are orthonormal has lead to misinterpretations that the basis functions


could represent some independent physiological generators. In fact already in [90], 

p.441 it was warned that “The principal factors, in themselves, do not have any 

physiological meaning and should not be constructed to represent a physiological 

system.” After that the use of PCA for discrimination of the individual physiological 

component potentials was widely discussed in [53, 222] and [51]. The 

variations in the peak locations have been found to be the main reason for the existence 

of the “extra” components in principal components analysis [136]. These 

extra components are generated when the basis functions are rotated to be strictly 

positive or negative. 

The applicability of the PCA in the analysis of evoked potentials is also ciriticized 

at least in [176, 174, 223, 224, 233] and [139]. 

In itself, the principal components analysis gives information on the second 

order statistics of the ensemble of the measurements. This information can further 

be used e.g. for classification of the measurements. 

6.2.7 Classification 

As emphasized in [133], the estimation is the continuous extension of the classification 

and detection problems. In classification and detection problems the task is to 

decide which group the evoked potential belongs based on some features extracted 

from the data. Detection is classification into two classes. A common detection 

problem is the decision whether or not the evoked potential is present in signal 

at all. However, we do not review the classification and detection methods here 

in detail. Classification and detection of the evoked potentials are studied e.g. in 

[196, 216, 137, 227, 35, 10, 172, 36, 168, 98, 68, 169, 199] and [204]. 

Matched filtering is a method for finding a known deterministic waveform from 

a noisy signal. In this sense matched filtering is a detection method and necessitates 

that the shape of the evoked potential is known a priori. The matched filter 

approach is used for the detection of the peaks for improvement of the Woody’s 

method in [197]. The effect of the treshold setting is studied in [236]. In [234] it is 

found that matched filtering is not practical below SNR 25 dB. This means that 

detection of the single trial is hardly ever possible. 

6.2.8 Other ensemble analysis methods 

An evident modification of averaging is the use of some other point estimate e.g. 

median [20]. It has been reported that the median is less sensitive to large disturbances 

in data than the average [220, 177, 20]. However, with small sample 

sizes the median can create additional disturbances, such as peak splitting [178]. 

Also the use of mode is studied. In [109] the mode is calculated with a genetic 

algorithm. 

In [31] it is assumed that the brainstem auditory evoked potential (BAEP) 

is an output of linear system with impulse input. In such cases the response of 

the system to several impulses is the convolution of the single impulse response 

with the input impulse sequence. In [31] maximum length sequences is used for 

stimulation of BAEP. In [188] it is suggested that the auditory system is nonlinear

6.3. Single trial estimation 85 

and a method for nonlinear system identification using the same sequences is 

proposed. 

An approach to the smoothing of the average is to use least squares fitting with 

a specific basis functions for the average. In [131] the trigonometric basis functions 

are used for the modeling of discontinuous components of the LCA method. In 

[127] the use of Fourier, Haar and Walsh–Hadamard basis functions as well as use 

of principal components is discussed. In [126] polynomial basis is used for this 

purpose. 

6.3 Single trial estimation 

In this section we review the most common methods proposed for single trial 

estimation. The methods in Sections 6.3.5, 6.3.6, 6.3.7 are reviewed here are then 

used in Chapter 7 where two systematic methods for evoked potential estimation 

are proposed. 

6.3.1 Filtering of single responses 

Digital filtering is sometimes used also for single responses [180, 6]. The simplest 

filters are moving average FIR filters [180] and more complicated ones are designed 

to detect some specific peak such as P300 [59]. Wiener filtering is also possible 

for single measurements [6, 29]. In all cases the filter has to have symmetric 

impulse response to avoid phase distortion. The main problem in linear timeinvariant 

filtering is the fact that usually the spectra of the evoked potential and 

the background noise overlap heavily. This is reported at least in [197, 105, 195]. 

This is natural since the evoked potential is a transient-like smooth waveform with 

no periodicity. The spectrum of this kind of waveform is not defined properly. The 

effect of digital filtering for single waveforms is studied e.g. in [180, 175, 120, 21] 

and [148]. 

6.3.2 Time-varying Wiener filtering 

Although it is emphasized in Section 4.5 that the use of the notion Wiener filtering 

is diverse, we call here the time-varying Wiener filtering methods the methods that 

employ the time-varying Wiener filter equation 

K = R sz R −1 

z (6.25) 

The crucial problem in these methods is to obtain a good model for the cross 

covariance R sz . In general this task necessitates a model for the observations 

and some analytical model for the evoked potential s. This model can then be 

estimated using the observations. In [237] the filter is called the time-varying 

minimum mean square error filter. First it is noted that when signal and noise 

are uncorrelated, R sz = R ss . The signal is then modeled to be a superposition of 

the components with random location and amplitude. The parametric form of the 

covariance of the signal is then calculated. The presented form necessitates the 

probability densities of the peak locations as well as the means and the variances


of the peak amplitudes. This information can be obtained using the LCA method. 

This approach is extended in [226] for multielectrode measurements. 

6.3.3 Linear least squares estimation 

One possibility for the single trial estimation is to start with an observation model. 

The additive noise model for measurements can be written in form 

z = s + v = Hθ + v (6.26) 

where z = (z 1 , · · · ,z N ), v = (v 1 , · · · ,v N ) and s = (s 1 , · · · ,s N ) = (Hθ 1 , · · · ,Hθ N ). 

z i , s i and v i are column vectors. The evoked potentials are thus modeled as a linear 

combination of some basis vectors, namely the columns of the matrix H. With 

different choices of the basis several different least squares estimation schemes can 

be obtained. 

The most obvious scheme is to use some generic basis selection. For example, 

if we assume that the measured evoked potential is composed of Gaussian shaped 

component potentials with preselected shape, we can use these as columns of the 

matrix H. In this scheme it is also easy to add the basis functions ψ 0 = 1 and 

ψ 1 = t that can model the first order trend in measurements. The most trivial 

selection is obviously H = I. 

If the covariance of the background EEG does not change between the repetitions 

of the test, the Gauss–Markov estimate for evoked potentials can be calculated 

using equation 


−1 H) −1 H T Cv −1 z (6.27) 

where C v = E { vv } T . In the simplest case we assume Cv 

−1 

the single evoked potential is then of the form 

= I. The estimate for 

ŝ = Hˆθ (6.28) 

However, the method is not practical for single trial estimation without regularization. 

6.3.4 Adaptive filtering 

The use of adaptive filtering in the analysis of evoked potentials has been studied 

intensively during the last ten years. Most attention has been paid to the use of 

the LMS algorithm for filtering of the single evoked responses. Another application 

has been to use adaptive algorithms to model the trends in measurements between 

the repetitions of the test. These methods are discussed in Section 6.5.3. 

In most applications the procedure is as explained in Section 4.4. In the LMS 

algorithm 

ˆθ t = ˆθ t−1 + µϕ t ε(t) (6.29) 

ε(t) = z(t) − ϕ T t ˆθ t−1 (6.30) 

the measurement z(t) is selected as the primary input. Several choices are proposed 

for the reference input, that is, the regressor vector ϕ t . With parenthesis


notation it is again emphasized, that the algorithm is used in scalar form, that 

is, for the scalar measurement. The time refers then to the index in a single 

measurement vector. As is discussed in Section 4.4 this approach corresponds to 

the time-varying linear regression and the measurements z(t) are modeled as a 

linear combination of the components of the reference input with time-varying coefficients. 

The recursion corresponds to stochastic steepest descent minimization 

(stochastic gradient method [85]). As explained in Section 2.17, in the LMS algorithm 

the parameters are updated to the direction of the regressor vector. If the 

reference input is fixed, the iterative method can serve at most as a regularization 

method if the mean squared error diminishes in the direction that is determined by 

the reference input vector. In some cases, e.g. when the reference input changes 

during the iteration true adaptation can also be achieved. Steepest descent methods 

could also be useful for nonlinear problems, but no references exist for that 

kind of the applications. 

In many studies it is emphasized that the adaptive filtering approach does not 

necessitate prior information about the evoked potentials, see e.g. [156]. In fact the 

reference signal and the model structure are the prior information used in adaptive 

filtering. Also the use of the LMS algorithm corresponds to the assumption that 

the covariance of the parameters are of the specific form that is discussed in Section 

2.17. 

One of the most cited works in this area is [203], in which it is proposed that 

two measurements z i and z j can be fed into the LMS algorithm, z j as the reference 

and z i as primary input. The regressor vector is thus of the form 

ϕ t = (z j (t),z j (t − 1),...,z j (t − p)) T (6.31) 

It is assumed that z i and z j differ only with respect to the additive noise process, 

which means that the evoked potential is deterministic. The approach is thus 

to estimate (locally) one measurement with a set of delayed versions of another 

measurement. In [122] it is emphasized that if the signal is deterministic and no 

assumptions are made about the noise, this approach can not give better results 

than averaging of the two measurements. In the least squares sense this is true, 

but in some situations it is possible that the variance of the estimate reduces by 

the cost of increased bias. 

Earlier in [121] the performance of the adaptive noise cancellation was tested 

using almost the same noise sequence as reference as used in primary input. The 

primary signal was thus modeled as a linear combination of delayed noise sequences. 

The difference ε t is then argued to correspond to the evoked potential. 

In [217] the method that is presented in [203] is expanded for several reference 

signals. The regressor vector is thus of the form 

ϕ t = (z I(1) (t),...,z I(p) (t)) T (6.32) 

where I is some permutation of the index set of the measurement set. In other 

words, any p measurements can be selected to form the regressor vector. The 

model for the measurement is then that it is a linear combination of several other 

measurements.


In [48] a method is proposed, in which the signal is first filtered with two 

different low pass filters. Based on the variances of the outputs of these, the 

signals are adaptively smoothed forwards and backwards. 

In [112] background EEG is first modeled as stationary AR process using the 

pre-stimulus data. The average waveform is then modeled as a time-varying ARprocess 

with the RLS algorithm. The parameters are then used as a time-varying 

model for the evoked potentials. The observation is then modeled with state space 

equations using the evoked potential as the state vector. The evoked potentials 

are then estimated using the Kalman filtering approach. 

In [123] a modified adaptive line enhancement method is proposed. In this 

method the pre-stimulus data is adaptively modeled with an AR model. This 

model is then used to filter the post-stimulus data. The notion of modified means 

that a non-adaptive filter is used for post-stimulus data. The filtered waveforms 

can further be averaged. 

In [32] adaptive filtering is used in the analysis of brainstem auditory evoked 

potentials (BAEP) in the same manner as in [203]. The average potential is used 

as the reference input. Thus the underlying model is that single BAEPs are linear 

combinations of shifted averages of all the evoked potentials. 

In [156] several physical channels are used for measuring the somatosensory 

potential of a peripheral nerve. The primary input is selected to be the electrode 

nearest to the nerve. Other channels are used as references. 

6.3.5 Principal component regression approach 

In Section 3.3 it was emphasized that the selection of the basis vectors in least 

squares estimation can be based on the eigendecomposition of the data matrix. If 

the covariance R s of the evoked potentials is known we can write the model for 

the evoked potentials in the form 

s = K S θ (6.33) 

where the columns of K S are some subset of the eigenvectors of R s . If we use the 

additive noise model for the observations we can write 

The Gauss–Markov estimate is then 

z = K S θ + v (6.34) 

ˆθ = (K S Cv 

−1 K S ) −1 KS T Cv −1 z (6.35) 

ŝ = K S ˆθ (6.36) 

The correlation matrix of the evoked potentials is usually not known. In some 

cases the first eigenvectors of the data correlation matrix 

R z = E { zz T } (6.37) 

span nearly the same subspace than those of the evoked potentials. Then we can 

approximate the signal subspace by the first eigenvectors of R z . This case is also


discussed in Section 2.14. This approach is clearly equivalent to first taking all 

measurements as regressors and then using the principal component regression to 

reduce the dimensionality of the problem. In principal component regression the 

evoked potential is forced to lie in the principal subspace spanned by the columns 

of K S . 

This method is proposed in [100] for the estimation of visual evoked potentials. 

An almost identical approach is used in [107] with the white noise assumptions 

C v = σ 2 vI. In this case the estimator reduces to 

ˆθ = (K S K S ) −1 K T S z = K T S z (6.38) 

ŝ = K S ˆθ = KS K T S z (6.39) 

6.3.6 Subspace regularization of evoked potentials 

As discussed in Section 3.4 the principal component regression approach can be 

modified so that it is less restrictive. With evoked potentials we want that the 

noiseless signals s = Hθ are close to some pre-selected subspace S. It was shown 

in Section 3.4 that the solution is then of the form 

{ 

ˆθ = arg min ‖L 1 (z − Hθ)‖ 2 + α ∥ 2 (I − KS KS T )Hθ ∥ 2} (6.40) 

θ 

Since the matrix L 2 = (I − K S KS T ) is symmetric and idempotent, we can write 

L 2 = L 2 L T 2 = (I − K S KS T ). By comparing this form with the results in Section 

3.4 we note that L T 1 L 1 = Cv 

−1 and the solution is then 


−1 H + α 2 H T (I − K S KS T )H) −1 H T Cv −1 z (6.41) 

ŝ = Hˆθ (6.42) 

In Section 3.4 this approach is also shown to have a connection to the Bayesian 

mean square estimation. The matrix α 2 H T (I − K S KS T )H corresponds formally 

to the inverse of the covariance of the estimated parameters. 

The basis selection in this approach can be based on the methods discussed 

in Section 6.3.3. The basis K S of the regularizing subspace S can be selected 

using some approach discussed in Section 3.3. A systematic method for single 

trial estimation of evoked potentials using the subspace regularization method is 

proposed in Section 7.1. 

A slightly different form of the method is proposed in [100], where subspace 

regularization is compared with the principal component regression approach. 

6.3.7 Smoothness priors estimation of evoked potentials 

The smoothness priors method is directly applicable to the smoothing of evoked 

potentials. It is clearly a noncausal filtering method. The smoothness priors 

approach can be combined with any choice of the basis vectors in the observation 

model. When the smoothing is performed for the noiseless potentials s = Hθ the


solution can be written in form 

ˆθ = (H T H + λ 2 H T D T d D d H) −1 H T z (6.43) 

ŝ = Hˆθ (6.44) 

The smoothness priors approach is used with a recursive estimator for brainstem 

auditory evoked potentials in [145]. The observation matrix is there based on the 

knowledge of the impulse response of the earphone. Second order difference is used 

for smoothing and also the Bayesian connection is emphasized. 

The potentials that are estimated with any estimation method may still be 

noisy and the smoothness priors method can also be applied to estimated evoked 

potentials. For example, in subspace regularization the estimates can be noisy 

since the eigenvectors of the estimated covariance matrix are noisy when the sample 

size N is small. The model for the estimated evoked potentials can then be written 

in form 

ŝ = Is 2 + v 2 (6.45) 

with C v2 = I. That is, we assume that the estimated evoked potentials still contain 

some additive noise. The smoothness priors solution for ŝ 2 is then 

ŝ 2 = (I + α 2 2D T d D d ) −1 ŝ (6.46) 

This is the cascade form for the smoothness priors estimation. Another possibility 

is to use the smoothing in parallel form. For example, with subspace regularization 

the solution would then be 

{ 

ˆθ = arg min ‖L v (z − Hθ)‖ 2 + α ∥ 2 (I − KS KS T )Hθ ∥ 2 + α2 2 ‖D d Hθ‖ 2} (6.47) 

θ 

However, the tuning of the hyperparameters α and α 2 is easier in the cascade 

form. The cascade form implementation of the smoothness priors method is used 

with recursive estimation of evoked potentials in [101] and in the methods that 

are proposed in Sections 7.1 and 7.2. 

In some cases the basis vectors can be smoothed with smoothness priors 

method. This is the approach that is used in the simulation method that is proposed 

in Section 5.3. 

6.3.8 Other single trial estimation methods 

An interesting approach is adopted in [17] in which the estimation of the single 

evoked potentials is based on outlier information. A set of measurements containing 

both EEG with and without evoked potentials is first modeled with a robust 

AR model estimator. If the evoked potentials are thought to be outliers in the 

data, the robust estimator models only the background part of the samples. In 

ideal case, the residual of the model would then correspond to the evoked potential 

in sample. The performance of the method is further improved in [128]. 

In [31] it is assumed that the brainstem auditory evoked response is an output 

of a linear system with impulse input. In such a case the response of the system to

6.4. Miscellaneous topics 91 

several impulses is the convolution of the single impulse response with the input 

impulse sequence. In [31] the method of maximum length sequences have been 

used for the stimulation of brainstem auditory evoked potentials. However, in 

[188] it is suggested that the auditory system is nonlinear and a nonlinear system 

identification method using the same sequences is proposed. 

Neural networks have also been used in analysis of evoked potentials. In [71] 

a method is proposed for artifact reduction in somatosensory evoked potential 

analysis. 

6.4 Miscellaneous topics 

The methods that are reviewed in this section are either applicable to both average 

and single measurements of evoked potential, or they do not fit under other topics 

in a natural way. 

6.4.1 Frequency domain methods 

In [130] it is emphasized that filtering can be performed also in frequency domain. 

This approach has the same limitations as the time domain filtering approach 

that is discussed in Sections 6.2.4 and 6.3.1. Different frequency domain filtering 

schemes can be found e.g. in [231, 146, 148]. Frequency domain filtering is applied 

to the single trial analysis of P300 measurements in [199]. 

The same argumentation holds also for spectral analysis or the classification 

using the spectral content of evoked potentials. If the components do not differ 

spectrally, they can not be distinguished by linear spectral methods. Spectral 

methods have been used e.g. in [42, 143, 140, 141, 184]. 

The use of time-frequency distribution methods, such as Wigner distribution 

and wavelets are even more problematic, since they deal with the temporal spectrum 

of the signal. The definition of temporal spectrum is always based on a 

time window of some length and shape [167]. The use of temporal spectrum as a 

measure of the instantaneus frequency content of a signal is reasonable only if the 

time variation of the second order statistical properties of the signal is slow. This 

means that the signal is nearly stationary within the time window. The Wigner 

distribution is used in [87] and wavelet based methods e.g. in [12, 111, 16]. Also 

the method that is presented in [108] is basically a frequency domain filter bank 

method, although the notion of matched filtering is used. 

As discussed in [117], frequency domain methods are mainly applicable to 

analysis of two special classes of event related potentials. The first is the event 

related synchronization/desynchronization, where the phase of the potential is 

not locked with the stimuli. The other is the analysis of the so-called steady state 

potentials that are evoked using continuous stimuli, e.g. temporarily modulated 

light. Analysis of this kind of potentials is out of the scope of this thesis. 

6.4.2 Parametric modeling 

Yet another class of methods that arise from the theory of the time series analysis is 

the parametric modeling of the time series with rational models. The problem with


these methods in context of evoked potentials is that the most common models are 

time-invariant. Thus the use of these methods relies on tight assumptions about 

the stationarity of the signal. 

In [2] it is assumed that evoked potentials can be modeled as deterministic 

waveforms added with random fluctuations. These random fluctuations are assumed 

to be such that they can be modeled as ARMA process since “...they and 

spontaneus EEG are basically similar in nature [sic]” and ARMA models are sometimes 

used for modeling of the background EEG. Another ARMA model is used 

for the background. The scalar Kalman filter is then used to predict the evoked 

potential. The basic assumption in [2] is that the difference between the potential 

and the average is a stationary process. This is not in coherence with [38] 

where it is found that the variations are larger in late components than in early 

components. 

The scalar Kalman filtering scheme is used also in [195] in which an ARMA 

model is first fitted to the average response. Using Kalman filtering the evoked 

potential is then predicted with noisy measurement as input. The model is capable 

of taking into account random amplification and shifting of the whole potential, 

but not variations in the individual peaks. 

Another study in which ARMA modeling is used for the average potential 

is [88]. It is proposed that ARMA modeling can be used for the modeling of the 

average, since “The frequency domain characteristics of the simulated sequence and 

the corresponding power spectral density of the ARMA filter were quite close to the 

periodogram of the original data sequence [sic]”. The simulations were created by 

feeding white noise sequences to the ARMA model. The simulations or the average 

of the simulations are not inspected in time domain. 

In [82] and [83] an ARMA model is used for the spectrum analysis of the evoked 

potential. Based on this model the digital filter is designed to extract the evoked 

potential from additive noise. As emphasized in [82] “...the proposed method has 

the advantage that no assumptions about the EP to be found are necessary except 

the generally accepted fact that the EP is a deterministic, the EEG a stochastic 

signal [sic]”. In [83] it is argued that “The single responses show a great variability 

of latency and amplitude...” and it is concluded that this is “...demonstrating that 

the single evoked response is not a stationary signal. [sic]” 

Parametric models with an exogenous input are also used. In [30] and [28] 

ARX-model is used. The underlying model is then that if the background EEG 

can be modeled as an AR process, then for evoked potential “...the same process 

is driven instead by a deterministic signal... [sic]”. The use of the average of the 

responses is suggested in [30]. In [113] the average is replaced with the running 

average of the latest 20 responses. 

A damped sinusoidal model is sometimes used for the modeling the evoked 

potentials. With additive noise we can write for a single measurement 

z(t) = 

p∑ 

A i ρ t i sin(ω i t) + v(t) (6.48) 

i=1

6.5. Dynamical estimation of evoked potentials 93 

Estimation of the parameters A i , ρ i and ω i is a nonlinear problem. A well-known 

approximation method for solving the parameters is Prony’s method [94]. It is 

proposed for use with average evoked potentials in [9]. In [80] it is found that 

for single trial estimates the signal-to-noise ratio should be about 10 dB. The use 

of generalized singular value decomposition with Prony’s method is proposed in 

[63, 79]. 

6.4.3 Estimation of the peak latencies 

The estimation of peak latencies can sometimes be informative in itself, as emphasized 

in [25]. The main goal in estimation can e.g. be to find the histogram 

of the latencies for some specific peaks such as P300. The estimates for the peak 

latencies is also needed for some other estimation methods such as LCA and CCA. 

The simplest method for peak latency estimation is called peak picking [190]. 

It simply means that the maximum or minimum of the signal in some desired 

time interval is selected to represent the peak location. The signal is sometimes 

prefiltered [180] with some time domain filter. 

Another method is to cross-correlate the signal with a template waveform 

[58, 70] and to find the maximum of the correlation. The cross-covariance is also 

used [158]. Other variations are to use normalised cross-correlation functions [15] 

or generalized cross-correlation method [166]. 

In [132] a low-order polynomial is used to approximate the signal in the vicinity 

of the peak. The fitting of the polynomial is a simple linear least squares problem 

and if the second order polynomial is used the minimum or maximum is unique 

and simple to calculate. 

Different methods for the estimation of the single trial P300 latencies are 

compared in [190, 70]. 

6.5 Dynamical estimation of evoked potentials 

As discussed in Section 6.1, we refer to the methods that take into account the 

time variation between the single evoked potentials as dynamical methods. In 

the dynamical estimation methods preceding or following estimates are used as 

information in the estimation of single evoked potentials. 

If the set of measurements is assumed to be homogenous, all measurements 

can be taken into account with equal weights in estimation. With a homogenous 

set we mean that the measurements can be thought to be sampled from the same 

joint probability denstity. In this case the estimation reduces to the methods we 

have reviewed as single trial estimation methods. The dynamical method can give 

additional information about the evoked potentials if our assumption is that the 

parameters of the joint density of the evoked potentials or the background noise 

have trend-like variations during the test. Typical situations are e.g. the trend-like 

changing of the latency or amplitude of some specific peak. The changing of the 

latency is reported in somatosensory recording e.g. in [22].


6.5.1 Windowed averaging 

The most obvious way to handle the time-variations between the single measurements 

is sub-averaging the measurements in groups. Sub-averaging is used e.g. in 

[22] to demonstrate the decrease of the amplitude in visual evoked potentials. The 

same optimality criterions hold for sub-averaging as for averaging. Sub-averaging 

gives optimal estimators if the evoked potentials are assumed to be deterministic 

within the sub-averaged groups. 

Another obvious method for the dynamical estimation of the evoked potentials 

is the windowed averaging of the measurements. This can also be called sliding 

window averaging. The estimator then takes the form of a moving average filter 

of the measurement values in every time lag. In vector form we have 

n−1 

∑ 

ŝ MWA 

t = w i z t−i (6.49) 

i=0 

In [205] the term moving window averager (MWA) is used for the sliding window 

averager with equal weights w i = 1/n. Another method that is proposed in [205] 

is exponentially weighted averager (EWA) in which the weights are of the form 

w i = 

γ i 

∑ n−1 

j=0 γj (6.50) 

for some 0 < γ < 1. It can be shown that the equivavelent form of EWA is then 

ŝ EWA 

t 

= γz t + (1 − γ)ŝ EWA 

t−1 (6.51) 

Some methods have been proposed in which the single estimates are processed 

with single trial estimation methods and then the sliding window averaging has 

been performed. In [49] the adaptive Kalman smoother, that is proposed in [48] 

for single trial estimation, is combined with the exponentially weighted averager. 

6.5.2 Frequency domain filtering 

The frequency domain filtering is implemented in [186] in which a low-pass filtering 

is applied also for each measurement separately. The estimator can then be represented 

as a two-dimensional separable low-pass filter. The use of a non-separable 

filter is also straightforward. The implementation of the filtering is easiest in the 

frequency domain which is emphasized in [151]. 

In [232] polynomial fitting is proposed to be used in the tracking of the variations 

in spectrum the of the evoked potentials in which fifth order orthogonal 

polynomials are fitted for every frequency lag separately. 

6.5.3 Adaptive algorithms 

Adaptive algorithms can take the time variations into account in a natural way 

as discussed in Section 2.16. Although it would be simple to use the adaptive

6.5. Dynamical estimation of evoked potentials 95 

algorithms in vector form, the most popular way has been to use the algorithms 

in scalar form as discussed in Section 2.17. 

One of the first approaches that is based on adaptive algorithms is presented 

in [215]. The evoked potential is assumed to be a deterministic almost periodic 

signal 

s t+T ≈ s t (6.52) 

This kind of a signal can be approximated using the Fourier series expansion. In 

[215] the harmonic basis functions of the Fourier series are adaptively fitted to the 

evoked potential data using the LMS algorithm. The regressor vector is then of 

the form 

ϕ t = (sin(ω 0 t),sin(2ω 0 t),...,cos(ω 0 t),cos(2ω 0 t),...) (6.53) 

The method can track slow changes in evoked potentials and it is used for the 

monitoring the response of medication during neurosurgical procedures in [205]. 

The effect of slight nonperiodicity is studied in [89]. 

In [106] the periodicity assumption is used again. The model for signal is 

a deterministic evoked potential in noise and the reference input is an impulse 

function synchronized with the stimulus. The signal is thus modeled as linear 

combination of delayed impulses, that is, with the natural basis as the observation 

model. As is discussed in Section 6.2.1 the least squares solution with this kind of 

observation model H = I reduces to ensemble averaging. Since the algorithm that 

is used in [106] is the LMS algorithm and the weight update is performed only 

once during the period of M scalar samples of one trial, the estimate corresponds 

to a sliding window averaging with adaptive weight adjustment. 

Two adaptive methods are presented in [200]. One is to use the measurement 

as the 

primary input. The other method is to use ŝ EWA 

t in place of reference. In the first 

approach the model is then to present the sliding average as a linear combination 

of the delayed versions of one measurement. The other model is to use delayed 

sliding averages as a basis for another sliding average. 

In [238] the steepest descent method is used directly for the estimation of the 

time-varying Wiener filter matrix. 

In [104] the LMS is used for the estimation of the delay between two identical 

potentials embedded with noise. The method is proposed for the tracking of 

abnormal delays in somatosensory potentials if a normal SEP sample is available. 

In [202] the use of LMS is extended for filter banks. The method is proposed 

for the tracking of time-varying evoked potentials. 

z t as the reference signal and the exponentially weighted average ŝ EWA 

t−1 

6.5.4 Recursive mean square estimation of evoked potentials 

A possible approach is to assume that the evoked potential is a vector valued 

random process with slow variations between the repetitions of the test. It can 

then be modeled with a state-space model. The recursive mean square estimate 

for the state is then given by the Kalman filter. 

Let the observation model be linear and time-invariant. Assume that time 

variations of the state vector can be modeled with the so-called random-walk


model. Then the state space equations are of the form 

θ t+1 = θ t + w t (6.54) 

z t = Hθ t + v t (6.55) 

where v t and w t are random vectors. As discussed in Section 2.16, the recursive 

mean square estimate for θ t is given by the Kalman filter. The Kalman filter is the 

recursive linear mean square estimator for θ t using the previous estimate as the 

prior information. With the model (6.54–6.55) the Kalman filter takes the form 

K t = P t H T (HP t H T + C v ) −1 (6.56) 

P t+1 = (I − K t H)P t + C w (6.57) 

ˆθ t+1 = ˆθ t + K t (z t − Hˆθ t ) (6.58) 

where K t is the Kalman gain and ˆθ t is a time-varying estimate for the parameters 

θ t . 

With this approach several possibilities exist for the selection of the observation 

model. With the trivial choice H = I the parameter vector θ = s equals 

the evoked potential. Some generic basis selection is also possible, such as the 

Gaussian basis. Another possibility is to use the principal component approach, 

so that H = K S . The recursive estimate for s takes then the form 

K t = P t K T S (K S P t K T S + C v ) −1 (6.59) 

P t+1 = (I − K t K S )P t + C w (6.60) 

ˆθ t+1 = ˆθ t + K t (z t − K S ˆθt ) (6.61) 

ŝ t+1 = K S ˆθt+1 (6.62) 

The smoothness priors method can be combined with this estimator. This approach 

is used in [101] and a systematic method is proposed in Section 7.2. 

6.5.5 Other dynamical estimation methods 

In [206] the running average of the wavelet transform of the running average of 

the measurements is used to track trends in evoked potentials. 

In [105] it is assumed that the deviations of the evoked potentials from the 

mean are a continuous function of some external measurable condition. Based 

on this it is proposed that a nonparametric regression method iterative kernel 

estimation could be used for estimation of single trials of evoked potentials. The 

exact mathematical formulation of the method is unpublished. 


A wide range of methods has been proposed for the evoked potential analysis 

during the last decades. Some of the methods are proposed for the estimation 

of the properties of the whole data set and some for estimation of single evoked


potentials. Some of the methods try to model the time variations between the 

single trials. 

Several methods that are proposed for the estimation of the ensemble properties 

are optimal for some specific conditions. These conditions are often not 

very realistic. For example, Woody’s method is optimal for a deteministic waveform 

with random latency. If more realistic assumptions are used the information 

may not be representable with ensemble parameters any more. For example, the 

assumptions of the LCA are realistic, but the result of LCA is not a good estimator 

for any physical waveform. The conventional average is often still required 

for historical reasons to make clinical studies comparable. It seems that different 

methods for “enhancing” the conventional average often yield such estimators that 

still are not good estimators for anything. In this sense these methods are not very 

often applicable. 

There is not many methods for single trial estimation of evoked potentials. 

It seems that the most reliable results can be obtained with the minimum mean 

square estimation methods. We have reviewed her two types of mean square 

estimation methods. The time-varying Wiener filtering based approach and the 

observation model based approach. The origin of both approaches is the same. 

However, in the model based approach it is easier to implement prior information to 

the estimation procedure. In particular, we have reviewed different methods that 

take into account information about the ensemble statistics and the smoothness 

assumptions about the underlying evoked potentials. We believe that the model 

based approach will be more popular in the future. The extension of these models 

for several channels and realistic observation models is straightforward. 

The model based approach to mean square estimation is easy to extend for 

time-varying situations. The optimal recursive estimate is then the Kalman filter. 

In Section 6.5.4 we proposed a method in which the principal component regression 

is combined with the Kalman filter. The proposed method is the only published 

method in which a general purpose adaptive algorithm is used for the estimation 

of evoked potentials in vector form. Kalman filtering also allows the use of more 

realistic models for evoked potential fluctuations than the random walk model. 

We believe that these approaches are worth further investigation.

CHAPTER 

VII 

Two methods for evoked potential estimation 

In this chapter we propose two systematic methods for the estimation of evoked 

potentials. Basis of these methods have been discussed in the Sections 6.3.5, 

6.3.6, 6.3.7 and 6.5.4. The first method is applicable to single trial analysis and 

the other for dynamical estimation of evoked potentials. The proposed methods 

are evaluated using both simulated data and real measurements. Although the 

proposed methods are widely applicable to different types of the evoked potentials, 

the simulations and the real measurements that are done here are based on the 

P300 potentials. The P300 is one of the most studied responses of the human brain. 

The P300 test is performed using the so-called oddball-paradigm. Two kinds of 

stimuli, the standard and the target, are used in stimulation. The test person is 

asked to perform some task when the target stimuli occurs. For definitions and 

significance of the P300 potential see e.g. [198, 159, 160, 164, 135, 58, 163]. 

7.1 A systematic method for single trial estimation 

In this section we propose a systematic method for single trial estimation of evoked 

potentials. The method is based on the subspace regularization with a generic 

choice for the observation model. The final smoothing of the estimates is based 

on the smoothness priors approach. The proposed method is as follows: 

1. Measure a set of noisy evoked potentials z = (z 1 ,...,z Nz ) and background 

EEG v = (v 1 ,...,v Nv ), where z i and v i are column vectors. The background 

measurements v i are typically measured before the stimulus. The 

time between the repetitions should be random and long enough to avoid 

the late potentials from corrupting the background estimate and the locking 

of the background to the stimulus. 

∑ 

vi v T i 

2. Calculate C v ≈ Nv 

−1 . This is an estimate for the background covariance. 

Other methods are also applicable for estimation of the covariance. 

The background can be modeled e.g. as an AR model and the covariance 

can then be calculated using the spectrum of the model. 

98

7.1. A systematic method for single trial estimation 99 

3. Form the matrix H with some generic way. A set of Gaussian shaped 

vectors with different delays and pre-selected shape is a suitable choice. If 

the measurements have trend, it is possible to include the constant and the 

first order polynomial basis vectors as columns of H. 

∑ 

zi z T i 

4. Calculate R z ≈ Nz 

−1 . This is an estimate for the correlation matrix 

of the measurements. Note that when the correlation matrix is used here, 

the mean of the evoked potentials is modeled automatically as first eigenvector 

of the correlation matrix. Also the use of the covariance matrix is 

possible, but the mean of the measurements has then to be included in the 

equations of the estimates explicitly. 

5. Solve the ordinary eigendecomposition R z U = UΛ of the correlation matrix. 

The solution of only the principal eigenspace is also possible. 

6. Form K S = (u 1 ,...,u p ) where u i are eigenvectors of R z . The common 

choice is to use p eigenvectors associated with the p largest eigenvalues. 

7. Form the matrix D d . Usually the second or third order difference matrices 

are used. 

8. Calculate the estimate ŝ for the evoked potentials with 

ŝ = (I+α2D 2 d T D d ) −1 H(H T Cv 

−1 H+α 2 H T (I −K S KS T )H) −1 H T Cv −1 z (7.1) 

The selection of α and α 2 can be based on some posterior rule as discussed 

in Section 3.5. Another possibility is to select the parameters experimentally. 

For example, the parameter α 2 can be based on visual inspection of 

smoothing the eigenvectors. One possibility to select the parameter α is to 

make realistic simulations and inspect the estimation error as a function of 

the regularization parameters. 

This systematic method is based on the subspace regularization approach. The 

observation model is based on generic basis selection. The selection of the basis 

vectors can be different, but the idea is that the basis is quite general, that is, 

several different types of measurements can be modeled with this basis. 

In subspace regularization the solution is regularized towards the null space of 

the regularization matrix K S . Our approach is to include all the prior information 

in this part of the estimation. There are several other possibilities to select the matrix 

K S . The selection of the subspace S is here based purely on the observations, 

that is, the dependent variable. It could be based also on independent variables. 

This kind of approach is used in [211, 96], in which possible measurements are first 

simulated and K S is formed as the principal eigenspace of this set of vectors. The 

selection could also be based on both independent and dependent variables as in 

the partial least squares method. 

In the form (7.1) the background EEG is assumed to be Gaussian, but no 

stationarity is assumed. In the stationary case the matrix C v would be a Toeplitz 

matrix. One possibility to take account the nonstationarity of the background 

EEG is to model the background before and after the stimulation as AR model. If

100 7. Two methods for evoked potential estimation 

the model parameters differ it is possible to model the background as time-varying 

AR process with smooth transition of the parameters from a state to another. A 

method for this is proposed in [96]. 

The selection of the subspace dimension p and number of the basis vectors 

in H are also regularization parameter selection problems as emphasized in [218]. 

The selection can be based on some posterior rule. It is possible to formulate the 

whole problem as a fully empirical Bayesian problem, in which all regularization 

parameters are treated as hyperparameters with a suitable prior distribution. The 

selection of the hyperparameters can then be based on the likelihood of the data 

given the hyperparameters. 

7.2 A systematic method for dynamical estimation 

In this section we propose a systematic method for the recursive mean square 

estimation of the single evoked potentials. The method is based on the state 

space representation of the evolution of evoked potentials and the use of Kalman 

filtering. Principal component regression is used for formulation of the observation 

model. The final smoothing of the estimates is based on the smoothness priors 

assumption. The proposed method is as follows: 

1. Measure a set of noisy evoked potentials z = (z 1 ,...,z Nz ) and background 

EEG v = (v 1 ,...,v Nv ), where z i and v i are column vectors. The background 

measurements v i are typically measured before the stimulus. The 

time between the repetitions should be random and long enough to avoid 

the late potentials from corrupting the background estimate and the locking 

of the background to the stimulus. 

∑ 

vi v T i 

2. Calculate C v ≈ Nv 

−1 . This is an estimate for the background covariance. 

The background can also be modeled e.g. as an AR model and 

the covariance can then be calculated using the spectrum of the model. 

∑ 

zi z T i 

3. Calculate R z ≈ Nz 

−1 . This is an estimate for the correlation matrix 

of the measurements. Another possibility is to use the covariance matrix. 

The mean of the measurements can then be included in the estimation by 

e.g. adding it to the observation matrix as a column. 

4. Solve the ordinary eigendecomposition R z U = UΛ of the correlation matrix. 

The solution of only the principal eigenspace is also possible with some 

suitable method. 

5. Form K S = (u 1 ,...,u p ) where u i are eigenvectors of R z . Commonly p 

eigenvectors assosciated with the p largest eigenvalues are used. 

6. Form the matrix D d and fix α 2 . Usually the second or third order difference 

matrices are used. The selection of α can be based on some posterior rule. 

7. Select ˆθ 0 , the initial estimate for the parameters. It is possible to use the 

average of the least squares estimates as the initial parameter vector.

7.3. The data sets 101 

8. Select P 0 , the initial estimate for the estimation error covariance. The 

selection of this can be based on experience. 

9. Select C w , the covariance of the error in state update. This is the variable 

that controls the variance and the bias (tracking speed) in estimation. It 

is possible to use experience and visual inspection to select e.g. diagonal 

covariance. 

10. Calculate the estimate ŝ 2,t for evoked potentials using the Kalman filter 

and smoothness priors approach 

K t = P t K T S (K S P t K T S + C v ) −1 (7.2) 

P t+1 = (I − K t K S )P t + C w (7.3) 

ˆθ t+1 = ˆθ t + K t (z t − K S ˆθt ) (7.4) 

ŝ t+1 = K S ˆθt+1 (7.5) 

ŝ 2,(t+1) = (I + α 2 2D T d D d ) −1 ŝ t+1 (7.6) 

This systematic method is based on the use of the Kalman filter and principal 

component regression. The observation model is then based on the statistics of 

the set of measurements. The selection of the basis vectors can also be based on 

some other strategy. They can be selected e.g. to be of some generic form or on 

any strategy discussed in Section 3.3. 

Also in this method the background EEG is assumed to be Gaussian, but no 

stationarity is assumed. If the covariance of the background is not estimated, the 

covariance can be selected to be the identity matrix C v = σ 2 vI. The selection of 

the covariance can also be based on experience and the visual inspection. The 

selection of the covariance C w is more important in this method. If the covariance 

is tuned to be too small, the estimates have bias towards the previous estimates. 

If the covariance is selected to be too large, the estimates have too much variance. 

Usually C w is selected to be of the form C w = σ 2 wI and the selection of the 

variance term σ 2 w is based on experience. It is possible to use the so-called statespace 

identification methods. One of the first is proposed in [19]. In these the 

covariance terms are treated as hyperparameters, and the selection is based on 

e.g. output-error least squares estimation of the parameters. 

The effect of the initial values ˆθ 0 and P 0 can be removed by using the algorithm 

first backwards, with reversed order of data. The last estimates of the backward 

run can then be used as the initial estimates in the forward run. 

If the measurements do not depend on the previous measurements, the problem 

reduces to single trial estimation and the method proposed in Section 7.1 should 

be used instead of dynamical method. 

7.3 The data sets 

For the evaluation of the proposed methods we form three sets of data. Two 

data sets are formed using the simulation methods introduced in Sections 5.2 and


5.3. The data sets are referred to as data set #1 and data set #2, respectively. In 

addition to these the evoked potentials measured from three test persons were used 

as data. The measurements are named 878p, 1173p and 958p. The simulations are 

based on the measurement 878p. This is shown together with the simulations in 

Fig. 7.1. The number of the measurements is between 76 and 80 and the number 

of simulations is 80 in both sets. 

-50 

-50 

0 

0 

50 

20 40 60 80 100 120 

50 

20 40 60 80 100 120 

-30 

-20 

-10 

0 

10 

20 

30 

-30 

-20 

-10 

0 

10 

20 

30 

20 40 60 80 100 120 

20 40 60 80 100 120 

-50 

-50 

0 

0 

50 

20 40 60 80 100 120 

50 

20 40 60 80 100 120 

Figure 7.1: The measurement 878p (top left) and the mean of the measurements 

(top right), data set #1 (middle left) and data set #2 (bottom 

left) with corresponding noiseless simulations (middle right and bottom 

right, respectively). The vertical axis is in microvolts and the horizontal 

axis in points. Note the different scale in data set #1. The positive axis is 

downwards. 

The different nature of the simulation methods is again clearly visible in Fig. 

7.1. The background activity is measured before the stimulus in the case of the 

measurement 878p. In the case of the data sets #1 and #2, realizations of an AR

7.4. Case study 1 103 

process is used as background as explained in Section 5.1. In case of the data set 

#1 the latency of the positive peak is uniformly distributed in the interval 66 – 

110. In the case of the data set #2, p = 4 eigenvectors are used in simulation. 

7.4 Case study 1 

In the first study the data set #1 is used. The observation model is selected to be 

a generic set of Gaussian shaped functions. They are not intended to be optimal 

in any sense. 

H = (ψ 1 ,...,ψ k ) (7.7) 

where 

( 

ψ i = exp − 1 ) 

2d 2 (t − τ i) 2 

(7.8) 

The selected basis is not intended to be optimal in any sense. The width parameter 

is d = 10 and k = 20 basis functions are used. The basis functions, that is, the 

columns of the matrix H are plotted in Fig. 7.2. 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 20 40 60 80 100 120 

Figure 7.2: The basis of the observation model in the case study 1. 

The evoked potentials are then estimated using the systematic estimation 

method that is proposed in Section 7.1. Different values are used for the regularization 

parameter α and for the dimension p of the regularizing subspace. The 

norm of the estimation error between the true (simulated) and estimated evoked 

potential is then calculated. The estimation error norm is shown in Fig. 7.3. 

The error norm is seen to have a minimum with respect to both the number of 

basis functions and degree of regularization. This means that the regularization 

diminishes the error compared to the ordinary Gauss–Markov estimation that corresponds 

to the value α = 0. The proper selection of both of the parameters is thus


important. The over-estimation of p is called over-fitting. If the number of the 

basis vectors is too high the model tends to fit also the noise in the observations. 

If α goes to infinity, the method equals to the principal component regression discussed 

in Section 3.3. In this case subspace regression yields a smaller error than 

the principal component regression. 

0.06 

0.06 

0.05 

0.04 

0.04 

0.03 

0.02 

10 -3 10 -2 10 -1 10 0 10 1 

1 

0.02 

1 2 3 4 5 

1 

0 

0 

-1 

-1 

-2 

-2 

1 2 3 4 5 

1 2 3 4 5 

Figure 7.3: The estimation error norm in the case study 1 as a function of 

the regularization parameter α with different values of p (top left) and as 

a function of p with different values of α (top right). The same error norm 

as gray scale image and contour plot (bottom left and bottom right respectively) 

as function of α (vertical, 10-base logarithmic) and p (horizontal). 

The mean squared error is seen to be at minimum with p = 2 or p = 3 and 

the optimal value for α is between 10 −2 and 10 −1 . The value p = 3 is selected 

for the dimension of the regularizing subspace. The columns of the matrix K S 

are plotted in this case in Fig. 7.4. The structure of the basis vectors enables 

the approximation of the mean of the potentials and variations in the positive 

peak. This can be easily seen by considering different linear combinations of these 

vectors. 

The evoked potentials are then estimated from the simulated noisy measurements 

of data set #1 using the method proposed in Section 7.1. The regularization 

parameter values α = 0.01 and α 2 = 10 are used. In Fig. 7.5 eight single trial 

estimates are shown together with the noiseless simulations and the simulated 

measurements. The effect of the regularization is clearly visible in estimates 2 and 

3, (numbering from left to right, top to bottom). The method tends to minimize


−0.2 

−0.15 

−0.1 

−0.05 

0 

0.05 

0.1 

0.15 

0.2 

20 40 60 80 100 120 

Figure 7.4: The columns of the matrix K S with p = 3. The first, second 

and third eigenvectors (bold, medium and thin lines, respectively). 

the error between the true potential and the estimate in the vicinity of the second 

negative peak although there are great variations in the data in this interval. The 

opposite effect is seen in the estimate 7, in which the estimate tend to minimize the 

residual. These single trial estimates can further be used e.g. in source estimation. 

As an example we also localize the maximum of the positive peak (latency) 

in single estimated potentials. For that we use the method introduced in [132] 

and discussed in Section 6.4.3. The latency estimation is based on the fitting of a 

second order polynomial to the data in the vicinity of the absolute maximum of the 

estimate. The width of the fitting window is 40 points. The latency estimation is 

carried out for the estimated evoked potentials as well as for the noiseless simulated 

potentials. The latency of the estimated potentials is shown as a function of the 

latency of the noiseless simulations in Fig. 7.6. The errors in latency estimation 

are seen to be homogenous through the simulations. 

In Fig. 7.7 the mean and the standard deviations of the latencies are shown 

together with the histograms of the latencies for the estimated and simulated 

potentials. Both histograms show that the estimated latencies are distributed 

uniformly over the interval in which the peak maximas are by construction.


−20 

−20 

−10 

−10 

0 

0 

10 

10 

20 

20 40 60 80 100 120 

20 

20 40 60 80 100 120 

−20 

−20 

−10 

−10 

0 

0 

10 

10 

20 

20 40 60 80 100 120 

20 

20 40 60 80 100 120 

−20 

−20 

−10 

−10 

0 

0 

10 

10 

20 

20 40 60 80 100 120 

20 

20 40 60 80 100 120 

−20 

−20 

−10 

−10 

0 

0 

10 

10 

20 

20 40 60 80 100 120 

20 

20 40 60 80 100 120 

Figure 7.5: Eight randomly selected single estimates (dotted) with simulations 

with (solid rough) and without (solid smooth) noise in the case 

study 1.


110 

105 

100 

95 

90 

85 

80 

75 

70 

65 

60 

60 65 70 75 80 85 90 95 100 105 110 

Figure 7.6: The latency of the estimated evoked potentials as function of 

the latency of the noiseless simulations (+) in case study 1. The straight 

line is the line with slope 1 that goes through the origin (the ideal case).


-20 

15 

-10 

10 

0 

10 

5 

20 

0 20 40 60 80 100 120 

-20 

0 

60 70 80 90 100 110 120 

15 

-10 

10 

0 

10 

5 

20 

0 20 40 60 80 100 120 

0 

60 70 80 90 100 110 120 

Figure 7.7: The mean and the standard deviations (vertical lines) of 

the latencies of the positive peak for the noiseless simulations with the 

average of simulated measurements (top left) and the histogram of the 

latencies (top right) in case study 1. The mean and the standard deviations 

(vertical lines) of the latencies of the positive peak for the estimated evoked 

potentials with the average of simulated measurements (bottom left) and 

the histogram of the latencies (bottom right).



The procedure in Section 7.4 is repeated for the data set #2. The mean square 

errors are shown in Fig. 7.8. The minimum of the error norm is not any more 

as clear as in case study 1. This is due to the structure of the joint statistics 

of the simulations. The simulated evoked potentials are generated so that their 

joint statistics is strictly Gaussian. For this type of data the optimal α would 

be infinite and the method would reduce to the principal component regression if 

the eigenspace of the evoked potentials would be known. However, the noise term 

is the reason why the principal eigenspace of the simulated measurements, that 

is used in regularization, does not yield the principal eigenspace of the simulated 

evoked potentials. 

0.15 

0.1 

0.1 

0.05 

0 

10 -3 10 -2 10 -1 10 0 10 1 

1 

0 

1 

2 4 6 8 10 

0 

-1 

-2 

2 4 6 8 10 

0 

-1 

-2 

2 4 6 8 10 

Figure 7.8: As in Fig. 7.3 but for case study 2. 

The data set #2 is also used for evaluation of the generalized cross-validation 

criterion (3.43) for the selection of the parameters p and α. The generalized crossvalidation 

criterion is calculated for the whole data set. Thus the residual norm 

in numerator of (3.43) is then replaced with the stacked residual of the whole 

data set. The criterion is shown in Fig. 7.9. It is seen that the criterion does 

not have a clear minimum as a function of either of the parameters p or α. This 

is not surprising in view of the analysis that is carried out in [218], in which it 

is emphasized that the theory of GCV is asymptotic and the criterion may not 

have minimum for small data sets. However, when compared to Fig. 7.8 it is seen 

that the criterion has a corner in vicinity of the optimal parameter values that are 

obtained using the mean square error. Thus we use this corner in selection of the 

parameters in this thesis.


150 

150 

100 

100 

50 

50 

0 

−3 −2 −1 0 1 

1 

0 

0 5 10 15 

1 

0 

0 

−1 

−1 

−2 

−2 

2 4 6 8 10 12 14 

2 4 6 8 10 12 14 

Figure 7.9: The generalized cross-validation criterion in case study 2 as a 

function of the logarithm of the regularization parameter α (top left) and p 

(top right). The same error as gray scale image and contour plot (bottom 

right and bottom left respectively) as function of α (vertical, logarithmic) 

and p (horizontal). 

The optimal regularization parameters in this case are seen to be approximately 

p = 5 and α = 0.05. The evoked potentials are estimated with the estimation 

method proposed in Section 7.1 using these values as parameters. The 

smoothing parameter was α 2 = 10. In Fig. 7.11 eight randomly selected estimates 

are shown together with the simulated noiseless evoked potentials and simulated 

measurements. It is again seen that the estimate is a compromise between the 

data and the prior assumptions. The estimate rejects clear disturbances as seen 

in the estimate 3. However there is now no peaks in simulations by construction. 

That means that the question about e.g. the latency or amplitude of some specific 

peak may not be unique at all. 

The latencies of the positive peak are again extracted using the same method 

as in Section 7.4. Latency estimation is carried out for the estimated evoked potentials 

as well as for the noiseless simulations since in this case we do not have any 

information about the latencies that correspond to the simulations by construction. 

The latency of the estimate is shown as function of the latency of the noiseless 

simulation in Fig. 7.10. The error in latency estimates is homogenous except for 

that some outliers exist. The latency distribution is clearly not homogenous in 

this case. 

In Fig. 7.12 the mean and the standard deviations of the latencies are shown


110 

105 

100 

95 

90 

85 

80 

75 

70 

65 

60 

60 65 70 75 80 85 90 95 100 105 110 

Figure 7.10: As in Fig. 7.6 but for case study 2. 

together with the histograms of the latencies for the estimated and simulated potentials. 

It is seen that the histograms are skew but similar. It is evident that these 

latency distributions are consistent with the real evoked potential measurement, 

since the simulations were generated using realistic second order approximation of 

the joint distribution of the measured evoked potentials. In this case it is clear 

that the mean of the latencies does not necessarily correspond to the latency of 

the mean of the measurements. In this case e.g. the median of the latencies is 

closer to the latency of the mean as seen in Fig. 7.13.


−30 

−20 

−10 

0 

10 

−30 

−20 

−10 

0 

10 

20 40 60 80 100 120 

20 40 60 80 100 120 

−30 

−20 

−10 

0 

10 

−30 

−20 

−10 

0 

10 

20 40 60 80 100 120 

20 40 60 80 100 120 

−30 

−20 

−10 

0 

10 

−30 

−20 

−10 

0 

10 

20 40 60 80 100 120 

20 40 60 80 100 120 

−30 

−20 

−10 

0 

10 

−30 

−20 

−10 

0 

10 

20 40 60 80 100 120 

20 40 60 80 100 120 

Figure 7.11: Eight randomly selected single estimates (dotted) and simulations 

with (solid rough) and without (solid smooth) noise in the case 

study 2.


-50 

25 

20 

0 

15 

10 

5 

50 

0 20 40 60 80 100 120 

-50 

0 

60 70 80 90 100 110 120 

25 

20 

0 

15 

10 

5 

50 

0 20 40 60 80 100 120 

0 

60 70 80 90 100 110 120 

Figure 7.12: As in Fig. 7.7 but for the case study 2. A 

−30 

−25 

−20 

−15 

−10 

−5 

0 

5 

10 

15 

20 

0 20 40 60 80 100 120 

Figure 7.13: The median (dotted vertical) and the mean (solid vertical) 

of the latencies of the positive peak of the estimated evoked potentials with 

the mean of simulated measurements in case study 2.



The estimation procedure is now applied to the real evoked potential measurements 

878p, 1173p and 958p. In this case the selection of the regularization parameters 

cannot be based on the true mean square error, since we can not compare the 

estimated evoked potentials with the true ones. The parameter selection can again 

be based on the GCV criterion. The criterion is calculated for measurements 878p 

and the result is shown in Fig. 7.15. The criterion is seen to have a corner 

approximately at about p = 5 and α = 0.05. These are also the selected values in 

the case study 2 in which the simulated data was formed using real measurements. 

It is then evident that these values are suitable also for these measurements. The 

values p = 5 and α = 0.05 are thus selected as regularization parameters. The 

same parameters are used for all measurement sets. The single trial estimates 

are then calculated. In Fig. 7.14 eight randomly selected estimates are shown 

together with the measurements 878p. It is again seen that the estimate rejects 

clear disturbances. 

The detection of the positive peak is then carried out as before. The results 

are shown in Figs. 7.16-7.18. 

With all measurement sets the estimated latency of the third peak (P300) has 

more or less nonsymmetric histogram. This is natural since it is expectable that 

the latency of this component has close relationship (correlation) to the latencies 

of the other components. It is even probable that the components have a causal 

relationship. A more interesting thing is that the distribution of the latency is 

not always time-independent. In the case of measurements 1173p and 878p the 

variance of the latency of P300 peak grows during the test. This means e.g. that 

the average is not a good estimator for any true potential of this kind. Thus we 

stress that in this kind of a test, single trial estimation should always be done prior 

to any other estimation method. The trend in the variance of the latencies is in 

this case more interesting than the absolute error. In the case of the measurement 

958p there seems to be variation in latency although no clear trend is visible. In the 

averaging of such data the shape of the average potential will be the convolution 

of the shape of the underlying peak and the probability density of the latency of 

the peak if the shape of the peak is independent of the latency. This means that 

the amplitude information in the peak depends on the variance of the latency. In 

such a case the method that is evaluated here can be used for the design of the 

deconvolution filter as discussed in Section 6.2.5. 

The single trials are estimated also with the regularization parameter α = 0 

for measurements 878p. The estimate corresponds then to the Gauss–Markov 

estimate. Note that the covariance of the background EEG is still taken account 

in this case. The modeling of the measurements as linear combinations of Gaussian 

basis functions also smooths the estimates. In Fig. 7.19 eight estimates are shown 

together with the measurement 878p. It is seen that the estimates minimize the 

residual homogenously. If the degree of smoothing is increased, the estimate tend 

to smooth also the true peaks. This property is common to all time-invariant 

filtering methods. The latency estimation is carried out also for these smoothed


−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

Figure 7.14: Eight randomly selected single estimates (dotted) with real 

measurements 878p (solid). 

estimates and the results are shown in Fig. 7.20. It is seen that the trend in the 

variance of the latencies is not visible any more.


200 

200 

150 

150 

100 

100 

50 

50 

0 

−3 −2 −1 0 1 

1 

0 

0 5 10 15 

1 

0 

0 

−1 

−1 

−2 

−2 

2 4 6 8 10 12 14 

2 4 6 8 10 12 14 

Figure 7.15: As in Fig. 7.9 but for measurement 878p. 

-50 

30 

20 

0 

10 

50 

0 20 40 60 80 100 120 

0 

60 70 80 90 100 110 120 

-50 

110 

0 

100 

90 

80 

50 

0 20 40 60 80 100 120 

70 

0 20 40 60 80 

Figure 7.16: The mean and the standard deviations (vertical lines) of the 

latencies of the positive peak of the estimated potentials with the mean of 

simulated measurements (top left) and the histogram of the latencies (top 

right) in case 3 for the measurement 878p. The median of the latencies 

(vertical lines) with the mean of simulated measurements (bottom left) and 

the estimated latencies as a function of the number of the simulation.


-20 

15 

-10 

10 

0 

10 

5 

20 

0 20 40 60 80 100 120 

-20 

-10 

0 

10 

20 

0 20 40 60 80 100 120 

0 

60 70 80 90 100 110 120 

120 

110 

100 

90 

80 

70 

0 10 20 30 40 50 60 

Figure 7.17: As in Fig. 7.16 but for the measurement 1173p. 

-20 

20 

-10 

15 

0 

10 

10 

5 

20 

0 20 40 60 80 100 120 

-20 

-10 

0 

10 

20 

0 20 40 60 80 100 120 

0 

60 70 80 90 100 110 120 

120 

110 

100 

90 

80 

70 

0 20 40 60 

Figure 7.18: As in Fig. 7.16 but for the measurement 958p.


−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

−20 

−20 

0 

0 

20 

20 

0 50 100 

0 50 100 

Figure 7.19: Eight randomly selected single estimates using α = 0 (dotted) 

with real measurement 878p (solid).


−50 

25 

20 

0 

15 

10 

5 

50 

0 20 40 60 80 100 120 

−50 

0 

60 70 80 90 100 110 120 

120 

100 

0 

80 

50 

0 20 40 60 80 100 120 

60 

0 20 40 60 80 

Figure 7.20: The mean and the standard deviations (vertical lines) of the 

latencies of the positive peak of the Gauss–Markov estimated potentials 

together with the mean of simulated measurements (top left) and the histogram 

of the latencies (top right) for the measurement 878p. The median 

of the latencies (vertical line) with the mean of simulated measurements 

(bottom left) and the estimated latencies as a function of the number of 

the measurement.



The dynamical estimation method that is introduced in Section 7.2 is evaluated 

in this section. The simulations of the data set #1 are sorted according to the 

latency of the third peak. By this construction we obtain a data set in which 

the trend-like variation of evoked potentials is guaranteed. The sorted data set is 

shown in Fig. 7.21. 

20 

40 

60 

80 

20 40 60 80 100 120 

20 

40 

60 

80 

20 40 60 80 100 120 

Figure 7.21: The sorted noiseless simulations as a waterfall plot (top left) 

and as a gray scale image (top right). The sorted simulations with noise 

as a waterfall plot (bottom left) and as a gray scale image (bottom right). 

The systematic estimation method that is introduced in Section 7.2 is then 

applied to the estimation of the single evoked potentials. In the first case the 

observation matrix H is selected as in Section 7.4 The columns of H are plotted 

in Fig. 7.2. The initial conditions and other parameters are selected for simplicity 

so that 

ˆθ 0 = E { (H T H) −1 H T z } (7.9) 

C w = 0.01I (7.10) 

C v = 0.1I (7.11) 

P 0 = 0.1I (7.12) 

where the notation E {·} means the average over vectors. The dimension of the 

problem is thus 20, the number of basis vectors in H. The mean of the least 

squares solutions has been used as initial parameter estimate. The other values are


selected so that the tracking performance of the algorithm is visually acceptable. 

More systematic ways are discussed in Section 7.2. The value α 2 = 10 is used as 

the smoothing parameter. The results of the estimation are shown in Fig. 7.22. 

It is seen that the tracking performance of the algorithm is comparable to the 

performance of the non-dynamical method in the case study 1. However, in this 

method the observation model is generic and no prior information is implemented 

in the observation model. The assumption of slow variation, that is, the random 

walk model serves as the prior information in this case. 

110 

100 

90 

80 

70 

0 20 40 60 80 

110 

100 

90 

80 

70 

70 80 90 100 110 

20 

40 

60 

80 

20 40 60 80 100 120 

Figure 7.22: The results of the dynamical estimation with generic Gaussian 

observation model H. The true latencies (solid) and the estimated 

latencies (+) as functions of the observation (top left). The estimated latencies 

as a function of the true latencies (top right). The estimates as a 

waterfall plot (bottom left) and as a gray scale image (bottom right). 

The estimation procedure is repeated using the principal component observation 

model as proposed in Section 7.2. The subspace dimension is selected to be 

p = 3. The other initial values and parameters are the same as in (7.9–7.12). The 

results are shown in Fig. 7.23. 

The dimension of the problem is thus 3, the number of basis vectors in K S . 

The results in Fig. 7.22 can be compared to the results in 7.23. It is seen that 

the implementation of prior information in form of subspace constraint reduces 

the errors in latency estimates. In both cases the first estimate can be seen to be 

inconsistent with rest of the estimates. The more accurate selection rule than the 

mean is proposed in 7.2. However with this selection it can be seen that the effect 

of the initial estimates is negligible after few estimates.


110 

100 

90 

80 

70 

0 20 40 60 80 

110 

100 

90 

80 

70 

70 80 90 100 110 

20 

40 

60 

80 

20 40 60 80 100 120 

Figure 7.23: As in Fig. 7.22 but with principal component observation model K S. 

When the variance of the estimates is tuned to be small, the bias (tracking 

error) increases in estimates. This is clearly seen in Fig. 7.23. The bias can be 

reduced by using the estimation method in forward and backward directions and 

then taking the averages of the estimates. A more accurate approach is to use the 

so-called Kalman smoothing methods instead of Kalman filtering in estimation 

[154, 155]. 


The proposed systematic method for single trial estimation is seen to be robust and 

reliable. The estimates are realistic by visual inspection. As seen from the simulations, 

the mean square estimation error has a minimum which means that the 

use of the subspace regularization improves the estimate, as compared to Gauss– 

Markov (minimum variance) estimation with the proposed basis, and the principal 

component regression. The latency histograms of the estimates are comparable 

with the histograms of the simulations. The method can be modified further to 

use more realistic basis functions. 

It is evident that a reliable single trial estimation method can give much more 

information than simple averaging. From these tests it is clear that, for example, 

the latency in P300 test should be examined on the single trial basis prior to any 

other estimation method. 

In the estimation of real evoked potentials there is a need to develop a reliable 

method for the selection of regularization parameters. The implementation of 

some posterior method would be most desirable but it might be not possible to


form such a method for general use. Future topics that are to be investigated are 

the fully Bayesian implementations of the method. 

The extension of the method proposed in Section 7.1 for multichannel measurements 

is straightforward. However, in the multichannel estimation methods 

the model for observations has to contain the dependence of the measurements in 

different electrode locations from each other. The proper modeling of this dependence 

necessitates the model for the head as a volume conductor. The complete 

estimation problem can then be formulated as a source estimation problem. 

Single trial estimation methods are also needed as the first step in dynamical 

estimation. This is since the selection of the evolution model has to be based on 

some assumptions about the evoked potentials. For example, with the results of 

the single trial estimation in the case study 3, one can conclude that in some cases 

there is trend in variance of the latencies. This implies, that the covariance C w 

should not be time-independent in the state evolution model. 

In case study 4 a realistic state evolution model can be used. The assumptions 

of the trend-like changes are valid for the third peak in simulated potentials. The 

estimated latencies are comparable to the true latencies. The covariances C v and 

C w were selected here by visual inspection. With these covariances the momentary 

variance and bias of the Kalman filter can be tuned so that the bias and the 

variance of the estimated latencies are acceptable. The performance of the method 

is not very critical with respect to the selection of the covariances. However, they 

can also be selected with a posterior approach that is based on data using the socalled 

state space identification methods. In these methods the system matrices 

in state space equations are parametrized and the parameters are solved e.g. as 

output-error least squares estimates. Implementation of these methods are further 

topics that are to be investigated. Another future improvement is the use of the 

so-called hyperparameter models for the state equations. 

The transient effect of the initial velues of ˆθ 0 and P 0 can be reduced by using 

the algorithm forward and backward and taking the final backward estimates as 

initial values in the forward estimation.

CHAPTER 

VIII 

Discussion and conclusions 

The presentation of all parts in this thesis is based on unified formulation that is 

introduced in Chapter 2. With this formalism the results in the estimation theory 

can be combined with the regularization theory, time series analysis methods and 

Bayesian statistics. The formalism is a multivariate vector formalism. In particular, 

the review of the existing methods for evoked potential analysis in Chapter 6 is 

based strictly on this presentation. This makes it possible to analyze the implicit 

assumptions that are made in various methods. This kind of unified presentation 

of methods makes the analysis of the properties of the methods more comparable. 

The review that is carried out in Chapter 6 should be helpful for anyone who is 

interested in the estimation of evoked potentials. 

One problem in the evaluation of estimation methods has been the lack of 

realistic simulation methods. The simulation method is usually tailored to suite 

well for the estimation method that is evaluated. In this sense the results of the 

evaluations may have been in some cases too optimistic. For this reason we have 

proposed two methods for the simulation of evoked potentials. Both methods are 

useful for different needs of simulations. The component based simulations were 

found to be somewhat inconsistent with the data. In this method the parameters 

of the components were adopted from the mean of the measurements that may not 

correspond to any single potential. However, with this method many parameters 

of the final simulations e.g. the peak locations can be easily controlled. The joint 

density of the component potential parameters can be selected with any kind of 

shape and correlation. In order to obtain more realistic simulations with this 

method, the selected probability density for the latency of the component should 

be deconvolved out from the shape of the component. However, this can be a 

very unstable problem. Principal component based simulation gives more realistic 

simulations when compared to the data. This is not surprising since the first and 

second moments of the measurements and the simulations are by construction 

almost equal. We conclude that the proposed methods can serve as a systematic 

approach to the simulation of evoked potentials. The simulation of nonstationary 

background is also easy to connect to the methods. 

124

125 

We also proposed a systematic method for the single trial estimation of evoked 

potentials. The proposed method was applied for the estimation of the evoked 

potentials in P300 test. Only the target responses were estimated. The estimates 

were seen to be able to reject clear disturbances in data. This can be meaningful 

e.g. if the amplitude ratios of the peaks are of the interest. 

As an example the single trial estimates were used for latency estimation. In all 

three cases the estimated latency of the P300 peak had more or less nonsymmetric 

histogram. This is natural since it is expectable that the latency of this peak has 

close relationship to the latencies of other peaks. It is even probable that the 

peaks have a causal relationship. The probability density for the latency is then 

convolution of the densities of all causal components. Another interesting thing 

to note is that the distribution of the latency is not always time independent. 

In some cases the variance of the latency of P300 peak grows with time. The 

proposed method is also easily modified to include even nonstationary background 

processes. Also different kinds of observation model selection schemes and prior 

assumptions are directly applicable to this observation model based approach. 

The other systematic estimation method that is presented is a method for the 

dynamical estimation of the evoked potentials when the parameters of the potentials 

have trends. In the proposed algorithm the Kalman filtering and the principal 

component approaches are combined with the smoothness priors approach. The 

simulations show that the algorithm is capable of tracking slow trends in the evoked 

potential parameters. In these simulations the negative peaks varied randomly and 

did not fulfill the assumption of trend-like variation. The proposed method is the 

only method in which evoked potentials are treated as vector-velued stochastic 

processes and a recursive estimator has been used. Also the use of the principal 

component regression in connection of the Kalman filter is of great importance. 

The proposed method allows the use of great variability of more specific models 

both for observations and the time evolution of the evoked potentials. 

Overall conclusions and future directions 

In this thesis it has been shown that the regularization and Bayesian approaches 

can be useful in the analysis of single evoked potentials. It has also been shown 

that the proposed methods are related to several optimality criterions. We stress 

once again that although the simulations show that the proposed methods can be 

useful as themselves, the most important property of the estimators is their general 

structure that allows one to develop estimators with more realistic observation 

models and more realistic choises for the prior. The following lines for further 

research are suggested: 

• Formulation of fully Bayesian single trial estimator for evoked potentials. 

This should contain as few user tunable parameters as possible. For this 

we need a good method for the estimation of the regularization parameters 

in a posterior way. It is possible to formulate the estimation method as a 

hierarchical Bayesian model and then use an empirical Bayesian approach 

for the estimation of the regularization parameters.

126 References 

• It has to be kept in mind that a single trial estimate is only a time-varying 

estimate for a potential in some location of scalp. When the proposed 

methods are extended to multichannel measurements and realistic observation 

models, the whole estimation problem can be interpreted as a source 

localization or potential mapping problem. 

• In recursive estimation the random walk model is the simplest choice. The 

true model for evoked potential evolution has to be more realistic and could 

be based on the data. This leads to the so-called hypermodel and state space 

identification problems. 

Overall, we argue that the goal in the future development of the methods is 

a full empirical Bayesian recursive multichannel method for estimation of brains 

activity with realistic observation and evolution models for evoked potentials.

REFERENCES 

[1] J.J. Ackmann, P.P. Elko, and S.J. Wu. Digital filtering of on-line evoked potentials. 

International Journal on Bio-Medical Computing, 10:291–303, 1979. 

[2] H. Al-Nashi. A maximum likelihood method for estimating EEG evoked potentials. 

IEEE Transactions on Biomedical Engineering, 33:1087–1095, 1986. 

[3] V. Albrecht, P. Lansky, M. Indra, and T. Radil-Weiss. Wiener filtration versus 

averaging of evoked potentials. Biological Cybernetics, 27:147–154, 1977. 

[4] V. Albrecht and T. Radil-Weiss. Some comments on the derivation of the Wiener 

filter for average evoked potentials. Biological Cybernetics, 24:43–46, 1976. 

[5] T.M. Apostol. Mathematical Analysis. Addison-Wesley, 1974. 

[6] J.I. Aunon and C.D. McGillem. Techniques for processing single evoked potentials. 

In Proc San Diego Biomed. Symp., pages 211–218, 1975. 

[7] J.I. Aunon, C.D. McGillem, and D.G. Childers. Signal processing in evoked potential 

research: averaging and modeling. CRC Critical Rewievs in Biomedical 

Engineering, 4:323–367, 1981. 

[8] J.I. Aunon and R.W. Sencaj. Comparison of different techniques for processing 

evoked potentials. Medical and Biological Engineering and Computing, 16:642–650, 

1979. 

[9] J.I. Aunon, J.J. Westerkamp, C.D. McGillem, and K. Fortier. Prony decomposition 

of evoked potentials. IEEE frontiers of Engineering and Computing in Health Care, 

pages 696–698, 1984. 

[10] N.I. Bachen. Detection of stimulus-related (evoked response) activity in the electroencephalogram 

(EEG). IEEE Transactions on Biomedical Engineering, 33:566– 

571, 1986. 

[11] G. Barret. Analytic techniques in the estimation of evoked potentials. In A.S. 

Gevins and A. Remond, editors, EEG handbook. Methods of analysis of brain 

electrical and magnetic signals, Vol 2. Elsevier science publisher B.V (Biomedical 

division), 1987. 

[12] E.A. Bartnik and K.J. Blinowska. Wavelets - new method of evoked potential 

analysis. Medical and Biological Engineering and Computing, 30:125–126, 1992. 

[13] E. Basar, A. Gönder, C.A. Özesmi, and P. Ungan. Dynamics of brain rhythmic 

and evoked potentials. I. Some computational methods for the analysis of electrical 

127


signals from the brain. Biological Cybernetics, 20:137–144, 1975. 

[14] E. Basar, P. Ungan, A. Gönder, and C.A. Özesmi. A posteriori selective averaging 

of evoked potentials: comparison with Wiener filtering. In XVII Alpine EEG 

meeting, Sestriere (Italy), January 27–31, 1975. 

[15] M.D. Berger. Analysis of sensory evoked potentials using normalized crosscorrelation 

functions. Medical and Biological Engineering and Computing, 21:149– 

157, 1983. 

[16] O. Bertrand, J. Bohorquez, and J. Pernier. Time-frequency digital filtering based 

on an invertible wavelet transform: an application to evoked potentials. IEEE 

Transactions on Biomedical Engineering, 41:77–88, 1994. 

[17] G.E. Birch, P.D. Lawrence, and R.D. Hare. Single-trial processing of event-related 

potentials using outlier information. IEEE Transactions on Biomedical Engineering, 

40:59–73, 1993. 

[18] Å. Björck. Numerical methods for least squares problems. SIAM, 1996. 

[19] T. Bohlin. Analysis of EEG signals with changing spectra using a short-word 

Kalman estimator. Mathematical Biosciences, 35:221–259, 1977. 

[20] R.P. Borda and J.D. Frost. Error reduction in small sample averaging through 

use of the median rather than the mean. Electroencephalography and Clinical 

Neurophysiology, 25:391–392, 1968. 

[21] J.R. Boston. Noise cancellation for brainstem auditory evoked potentials. IEEE 


[22] M.A.B. Brazier. Evoked responses recorded from the depths of the human brain. 

Annals of New York Academy of Sciences, 112:33–59, 1964. 

[23] P.J. Brown. Measurement, Regression, and Calibration. Oxford Science Publications, 

1993. 

[24] R.C. Burgess, V. Le Hong, and E.C. Jacobs. A comparative study of a posteriori 

Wiener filtering, time varying filtering and ARMA filtering methods using 

simulated evoked potentials. In Proc. 5th Conf. on Medical Informatics, pages 

1119–1123, Washington DC, 1986. 

[25] E. Callaway, R. Halliday, R. Naylor, and D. Thouvenin. The latency of the average 

is not the average of the latencies. Psychophysiology, 21:517, 1984. 

[26] B.P. Carlin and T.A. Louis. Bayes and empirical Bayes methods for data analysis. 

Chapman & Hall, 1996. 

[27] E.H. Carlton and S. Katz. Is Wiener filtering an effective method of improving 

evoked potential estimation IEEE Transactions on Biomedical Engineering, 

27(4):187–192, 1980. 

[28] S. Cerutti, G. Baselli, D. Liberati, and G. Pavesi. Single sweep analysis of visual 

evoked potentials through a model of parametric identification. Biological 

Cybernetics, 56:111–120, 1987. 

[29] S. Cerutti, V. Bersani, A. Carrara, and D. Liberati. Analysis of visual evoked 

potentials through Wiener filtering applied to a small number of sweeps. Journal 

on Biomedical Engineering, 9:3–12, 1987. 

[30] S. Cerutti, G. Chiarenza, D. Liberati, P. Mascellani, and G. Pavesi. A parametric 

method of identification of single-trial event-related potentials in brain. IEEE 


[31] F.H.Y. Chan, F.K. Lam, P.W.F. Poon, and M.H. Du. Measurement of human 

BAERs by the maximum length sequense technique. Medical and Biological Engineering 

and Computing, 30(1):32–40, 1992. 

[32] F.H.Y. Chan, F.K. Lam, P.W.F. Poon, and W. Qiu. Detection of brainstem

References 129 

auditory evoked potential by adaptive filtering. Medical and Biological Engineering 

and Computing, 33:69–75, 1995. 

[33] Tony F. Chan and Per Christian Hansen. Computing truncated singular value 

decomposition least squares solutions by rank revealing QR-factorizations. SIAM 

Journal of Scientific Statistical Computing, 11:519–530, 1990. 

[34] D.G. Childers, editor. Modern Spectrum Analysis. IEEE Press, 1978. 

[35] D.G. Childers, I.S. Fischler, T.L. Boaz, N.W. Perry Jr, and A.A. Arroyo. Multichannel, 

single trial event related potential classification. IEEE Transactions on 

Biomedical Engineering, 33:1069–1075, 1986. 

[36] D.G. Childers, N.W. Perry, I.A. Fischler, T. Boaz, and A.A. Arroyo. Event-related 

potentials: A critical review of methods for single-trial detection. CRC Critical 

Rewievs in Biomedical Engineering, 14:185–200, 1987. 

[37] B. Choi. ARMA Model Identification. Springer-Verlag, 1992. 

[38] Z. Ciganek. Variability of the human visual evoked potential: normative data. 

Electroencephalography and Clinical Neurophysiology, 27:35–42, 1969. 

[39] H. Cramer. Mathematical Methods in Statistics. Princeton University Press, 1946. 

[40] I. Daubechies. Ten Lectures on Wavelets. SIAM, 1992. 

[41] C.E. Davila and M.S. Mobin. Weighted averaging of evoked potentials. IEEE 


[42] A.E. Davis. Power spectral analysis of flash and click evoked responses. Electroencephalography 

and Clinical Neurophysiology, 35:287–291, 1973. 

[43] J. de Weerd and J. Kap. Spectro-temporal representations and time-varying spectra 

of evoked potentials. Biological Cybernetics, 41:101–117, 1981. 

[44] J.P.C. de Weerd. Facts and fancies about a posteriori Wiener filtering. Biomed. 

Engng, 28:252–257, 1981. 

[45] J.P.C. de Weerd. A posteriori Time-varying filtering of averaged evoked potentials. 

I. Introduction and conseptual basis. Biological Cybernetics, 41:211–222, 1981. 

[46] J.P.C. de Weerd and J.I. Kap. A posteriori Time-varying filtering of averaged 

evoked potentials.II. Mathematical and computational aspects. Biological Cybernetics, 

41:223–234, 1981. 

[47] J.P.C. de Weerd and W.J.L. Martens. Theory and practise of a posteriori Wiener 

filtering of averaged evoked potentials. Biological Cybernetics, 30:81, 1978. 

[48] C. Doncarli and L. Goerig. Adaptive smoothing of evoked potentials: a new approach. 

In Proc. 10th Annual Conf. IEEE-EMBS, pages 1152–1153, New Orleans, 

Lousiana, November 1988. 

[49] C. Doncarli and L. Goering. Adaptive smoothing of evoked potentials. Signal 

Processing, 28:63–76, 1992. 

[50] E. Donchin. A multivariate approach to the analysis of average evoked potentials. 


[51] E. Donchin and E. Heffley. The independence of P300 and the CNV reviewed: a 

reply to Wastell. Biological Psychology, 9:177–188, 1979. 

[52] E. Donchin and E.F. Heffley. Multivariate analysis of event-related potential data: 

A tutorial review. In D. Otto, editor, Multidisciplinary perspectives in eventrelated 

brain potential research, pages 555–572. U.S. Governement Printing Office, 

Washington DC, 1978. 

[53] E. Donchin, P. Tueting, W. Ritter, M. Kutas, and E. Heffley. On the independence 

of the CNV and the P300 components of the human average evoked potential. 


[54] D.J. Doyle. Some comments on the use of Wiener filtering in the estimation of


evoked potentials. Electroencephalography and Clinical Neurophysiology, 28:533– 

534, 1975. 

[55] D.J. Doyle. A proposed methodology for evaluation of the Wiener filtering method 

of evoked potential estimation. Electroencephalography and Clinical Neurophysiology, 

43:749–751, 1977. 

[56] H. Engl. Inverse Problems. Johannes–Kepler–Universität, Linz, Austria, 1992. 

Lecture notes. 

[57] H.W. Engl and W. Grever. Using the L-curve for determining optimal regularization 

parameters. Numerische Mathematik, 69:25–31, 1994. 

[58] M. Fabiani, G. Gratton, D. Karis, and E. Donchin. Definition, identification 

and reliability of measurement of the P300 component of the event-related brain 

potential. In P. Ackles, J.R. Jennings, and M.G.H. Coles, editors, Advances in 

psychophysiology, Vol. 2, pages 1–78. JAI Press, Greenwich, 1987. 

[59] L.A. Farwell, J.M. Martinerie, T.R. Bashore, P.E. Rapp, and P.H. Goddard. Optimal 

digital filters for long-latency components of the event-related brain potential. 

Psychophysiology, 30:306–315, 1993. 

[60] S. Fotopoulos, G. Economou, A. Bezerianos, and N. Laskaris. Latency measurement 

improvement of P100 complex in visual evoked potentials by FMH filters. 

IEEE Transactions on Biomedical Engineering, 42(4):424–427, 1995. 

[61] B. Friedlander. Lattice filters for adaptive processing. Proceedings og the IEEE, 

70:829–867, 1982. 

[62] M. Furst and A. Blau. Optimal a posteriori time domain filter for average evoked 

potentials. IEEE Transactions on Biomedical Engineering, 38:827–833, 1991. 

[63] T. Gänsler and M. Hansson. Estimation of event-related potentials with GSVD. 

In Proc. 13th Annual Conf. IEEE-EMBS, pages 423–424, 1991. 

[64] T. Gasser, J. Möcks, and R. Verleger. SELAVCO: a method to deal with trial-totrial 

variability of evoked potentials. Electroencephalography and Clinical Neurophysiology, 

55:717–723, 1983. 

[65] I. Gath, E. Bar-On, and D. Lehmann. Automatic classification of visual evoked 

responses. Computer Methods and Programs in Biomedicine, 20:17–22, 1985. 

[66] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian data analysis. 

Chapman & Hall, 1995. 

[67] W. Gersch. Smoothness priors. In New Directions in Time Series Analysis, Part 

II, pages 113–146. Springer-Verlag, 1991. 

[68] A.S. Gevins and N.H. Morgan. Applications of neural-network (NN) signal processing 

in brain research. IEEE Transactions on Acoustics, Speech and Signal 

Processing, 36:1152–1161, 1988. 

[69] G.H. Golub and C.F. van Loan. Matrix Computations. The Johns Hopkins University 

Press, 1989. 

[70] G. Gratton, A.F. Kramer, M.G.H. Coles, and E. Donchin. Simulation studies of 

latency measures of components of the event-related brain potential. Psychophysiology, 

26:233–248, 1989. 

[71] R.C.W. Grieve, P.A. Parker, and B. Hudgins. Training neural networks for stimulus 

artifact reduction in somatosensory evoked potential measurements. In Proc. 

18th Annual Conf. IEEE-EMBS, Amsterdam, 1996. 

[72] C. W. Groetsch. Inverse Problems in the Mathematical Sciences. Vieweg, 1993. 

[73] J. Hadamard. Lecture’s on Cauchy’s problem in linear partial differential equations. 

Yale University Press, 1923. 

[74] M. Hanke and P.C. Hansen. Regularization methods for large-scale problems.


Surveys on Mathematics for Industry, 3:253–315, 1993. 

[75] P.C. Hansen. The truncated SVD as a method of regularization. BIT, 27:534–553, 

1987. 

[76] P.C. Hansen. Analysis of discrete ill-posed problems by means of the L-curve. 

SIAM Rev., 34:561–580, 1992. 

[77] P.C. Hansen. Regularization Tools: A Matlab package for analysis and solution of 

discrete ill-posed problems. UNI-C, Technical university of Denmark, 1992. 

[78] P.C. Hansen. Experience with regularizing CG iterations. Technical Report UNIC- 

94-02, Technical University of Danmark, 1994. 

[79] M. Hansson and T. Cedholt. Estimation of event related potentials. In Proc. 12th 

Annual Conf. IEEE-EMBS, pages 901–902, 1990. 

[80] M. Hansson, T. Gänsler, and G. Salomonsson. Estimation of single event-related 

potentials utilizing the prony method. IEEE Transactions on Biomedical Engineering, 

43(10):973–981, 1996. 

[81] S. Haykin. Adaptive filter theory. Prentice Hall, 2 edition, 1991. 

[82] H.J. Heinze and H. Kunkel. ARMA - filtering of evoked potentials. Methods of 

Information in Medicine, 23:29–36, 1984. 

[83] H.J. Heinze and H. Künkel. Evaluation of single epoch evoked potentials. Neuropsychobiology, 

12:68–72, 1984. 

[84] J. Hoke, B. Ross, R.E. Wickesberg, and B Lütkenhöner. Weigheted averaging - 

theory and application to electric response audiometry. Electroencephalography 


[85] M.L. Honig and D.G. Messerschmidt. Adaptive Filters: Structures, Algorithms 

and Applications. Kluwer Academic Publishers, 1984. 

[86] P.J. Huber. Robust Statistical Procedures. SIAM, 1977. 

[87] P. Husar, K. Schellhorn, O. Hoenecke, and G. Henning. Time-varying filtering 

of transient visual evoked responses. In Proc. 18th Annual Conf. IEEE-EMBS, 

Amsterdam, 1996. 

[88] M.H. Jacobs, S.S. Rao, and G.V. Jose. Parametric modeling of somatosensory 

evoked-potentials. IEEE Transactions on Biomedical Engineering, 36(3):392–403, 

1989. 

[89] R. Jane, P. Laguna, and P. Caminal. Adaptive estimation of event-related bioelectric 

signals. In Proc. 13th Annual Conf. IEEE-EMBS, pages 3656–3660, Orlando, 

1991. 

[90] E.R. John, D.S. Ruchkin, and J. Villegas. Experimental background: signal analysis 

and behavioral correlates of evoked potential configurations in cats. Annals 

of New York Academy of Sciences, 112:362–420, 1964. 

[91] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986. 

[92] H.G. Vaughan Jr and J.C. Arezzo. The neural basis of event-related potentials. In 

Human Event-Related Potentials, volume 3 of Handbook of Electroencephalography 

and Clinical Neurophysiology, pages 45–96. Elsevier, 1987. 

[93] J.E. Dennis Jr. and R.B. Schnabel. Numerical methods for unconstrained optimization 

and nonlinear equations. Prentice-Hall, 1983. 

[94] Marple S. Lawrence Jr. Digital Spectral Analysis with Applications. Prentice 

Hall,Englewood Cliffs, 1987. 

[95] J.P. Kaipio. Simulation and Estimation of Nonstationary EEG. PhD thesis, University 

of Kuopio, 1996. 

[96] J.P. Kaipio and P.A. Karjalainen. Estimation of event related synchronization 

changes by a new TVAR method. IEEE Transactions on Biomedical Engineering,


44(8), 1997. 

[97] R.E. Kalman. A new approach to linear filtering and prediction problems. Trans. 

ASME, J. Basic Eng., 82D:35–45, 1960. 

[98] M.V. Kamath, S.N. Reddy, A.R.M. Upton D.N. Ghista, and M.E. Jernigan. Statistical 

pattern classification of clinical brainstem auditory evoked potentials. International 

Journal on Bio-Medical Computing, 21:9–28, 1988. 

[99] P.A. Karjalainen. Estimation theoretical background of root tracking algorithms 

with application to EEG. Phil. Lic. thesis, Univ. of Kuopio, Dept. of Applied 

Physics, 1996. 

[100] P.A. Karjalainen, J.P. Kaipio, and A.S. Koistinen. PCA based Bayesian estimation 

of single trial evoked potentials. In Proc 1st Int Conf Biolectromagn., pages 195– 

196, Tampere, Finland, 1996. 

[101] P.A. Karjalainen, J.P. Kaipio, A.S. Koistinen, and T. Kärki. Recursive Bayesian 

estimation of single trial evoked potentials. In Proc. 18th Annual Conf. IEEE- 

EMBS, Amsterdam, 1996. 

[102] J.P. Keating, R.L. Mason, and P.K. Sen. Pitman’s measure of closeness: a comparison 

of statistical estimators. SIAM, 1993. 

[103] S.B. Kesler, editor. Modern Spectrum Analysis, II. IEEE Press, 1986. 

[104] X. Kong and N.V. Thakor. Adaptive estimation of latency changes in evoked 

potentials. IEEE Transactions on Biomedical Engineering, 43(2):189–197, 1996. 

[105] S. Krieger, J. Timmer, S. Lis, and H.M. Olbrich. Some considerations on estimating 

event-related brain signals. Journal of Neural Transm Gen Sect, 99(1-3):103–129, 

1995. 

[106] P. Laguna, R. Jane, O. Meste, P.W. Poon, P. Caminal, H. Rix, and N.V. Thakor. 

Adaptive filter for event-related bioelectric signals using an impulse correlated 

reference input: Comparison with signal averaging tecniques. IEEE Transactions 


[107] D.H. Lange. Variable single-trial evoked potential estimation via principal component 

identification. In Proc. 18th Annual Conf. IEEE-EMBS, amsterdam, 1996. 

[108] D.H. Lange, H. Pratt, and G.F. Inbar. Segmented matched filtering of single 

event related evoked potentials. IEEE Transactions on Biomedical Engineering, 

42(3):317–320, 1995. 

[109] N. Laskaris, A. Bezerianos, S. Fotopoulos, and P. Papathanasopoulos. A genetic 

algorithm for the multidimensional filtering process of visual evoked potential signals. 

In Proc. 18th Annual Conf. IEEE-EMBS, Amsterdam, 1996. 

[110] C.L. Lawson and R.J. Hanson. Solving Least Squares Problems. Prentice-Hall, 

1974. 

[111] Y.H. Lee, J.S. Kim, H.S. Park, S.I. Kim, and D.S. Lee. Analysis of visual evoked 

potentials using a wavelet decomposition algorithm. In Proc. 18th Annual Conf. 

IEEE-EMBS, Amsterdam, 1996. 

[112] D. Liberati, L. Bertolini, and D. C. Colombo. Parametric method for the detection 

of inter- and intrasweep variability in VEP processing. Medical and Biological 

Engineering and Computing, 29:159–166, 1991. 

[113] D. Liberati, S. Cerutti, E. Di Ponzio, V. Ventimiglia, and L. Zaninelli. The implementation 

of an autoregressive model with exogenous input in a single sweep 

visual evoked potential analysis. Journal on Biomedical Engineering, 11:285–292, 

1989. 

[114] D. Liberati, L. Narici, A. Santonini, and S. Cerutti. The dynamic behavior of the 

evoked magnetoencephalogram detected through parametric identification. Jour-


nal on Biomedical Engineering, 14:57–63, 1992. 

[115] L. Ljung, G. Pflug, and H. Walk. Stochastic Approximation and Optimization of 

Random Systems. Birkhäuser, 1992. 

[116] L. Ljung and T. Söderström. Theory and Practice of Recursive Identification. MIT 

Press, 1983. 

[117] F.H. Lopes da Silva. Event-related potentials: Methodology and quantification. 

In Electroencephalography:Basic principles, clinical applications and related fields, 

pages 877–886. Williams & Wilkins, 1993. 

[118] B. Lutkenhoner, M. Hoke, and C. Pantev. Possibilities and limitations of weighted 

averaging. Biological Cybernetics, 52(6):409–416, 1985. 

[119] L. Lyons. A practical guide to data analysis for physical science students. Cambridge 

University Press, Cambridge, 1991. 

[120] P.J. Maccabee, E. I. Pinkhasov, and R. Q. Cracco. Short latency evoked potentials 

to median nerve stimulation: Effect of low-frequency filter. Electroencephalography 


[121] G. Madhavan, H. deBruin, and A.R.M. Upton. Evoked potential processing and 

pattern recognition. In Proc. 6th Annual Conf. IEEE-EMBS, pages 699–702, 1984. 

[122] G.P. Madhavan. Comments on adaptive filtering of evoked potentials. IEEE 


[123] P.G. Madhavan. Minimal repetition evoked potentials by modified adaptive line enhancement. 

IEEE Transactions on Biomedical Engineering, 39(7):760–764, 1992. 

[124] J. Makhoul. Stable and efficient lattice methods for linear prediction. IEEE 

Transactions on Acoustics, Speech and Signal Processing, 25:423–428, 1997. 

[125] S.L. Marple. Digital Spectral Analysis. Prentice-Hall International, 1987. 

[126] D.C. Martin, J. Becker, and V. Buffington. An evoked potential study of endogenous 

affective disorders in alcoholics. In Proc. Conf. Evoked Brain Potentials and 

Behavior, pages 401–417, 1997. 

[127] D.C. Martin, D. Borg-Breen, and V. Buffington. Basis functions in the analysis 

of evoked potentials. In Proc. Conf. Evoked Brain Potentials and Behavior, pages 

545–561, 1997. 

[128] S.G. Mason, G.E. Birch, and M.R. Ito. Improved single-trial signal extraction of 

low SNR events. IEEE Transactions on Signal Processing, 42(2):423–426, 1994. 

[129] C.D. McGillem and J.I. Aunon. Measurements of signal components in single 

visually evoked brain potentials. IEEE Transactions on Biomedical Engineering, 

24:232–241, 1977. 

[130] C.D. McGillem and J.I. Aunon. Analysis of event-related potentials, chapter 5, 

pages 131–169. Elsevier Science Publisher B. V, 1987. 

[131] C.D. McGillem, J.I. Aunon, and C.A. Pomalaza. Improved waveform estimation 

procedures for event-related potentials. IEEE Transactions on Biomedical Engineering, 

BME-32, 1985. 

[132] C.D. McGillem, K. Yu, and D.C. Childers. Effects of ongoing EEG on latency 

measurements of evoked potentals. In Proc. 4th Annual Conf. IEEE-EMBS, pages 

60–63, Philadelphia, 1982. 

[133] J.L. Melsa and D.L. Cohn. Decision and Estimation Theory. McGraw-Hill, 1978. 

[134] Y. Meyer. Wavelets: algorithms and applications. SIAM, 1993. 

[135] H.J. Michalewski, D.K. Prasher, and A. Starr. Latency variability and temporal 

interrelationships of the auditory event-related potentials (N1, P2, N2, and P3) in 

normal subjects. Electroencephalography and Clinical Neurophysiology, 65(1):59– 

71, 1986.


[136] J. Möcks. The influence of latency jitter in principal component analysis of eventrelated 

potentials. Psychophysiology, 23:480–484, 1986. 

[137] J. Möcks, T. Gasser, and P.D. Tuan. Variability of single visual evoked potentials 

evaluated by two new statistical tests. Electroencephalography and Clinical 


[138] J. Möcks, T.G. Gasser, W. Kohler, and J.P.C. De Weerd. Does filtering and 

smoothing of average evoked potentials really pay a statistical comparison. 


[139] J. Möcks and R. Verleger. Principal component analysis of event-related potentials: 

a note on misallocation of variance. Electroencephalography and Clinical 


[140] E.B. Moody, E. Micheli-Tzanakou, and S. Chokroverty. Analysis of visual evoked 

responses in the frequency domain by the method of movements. In Proc. 13th 

Annual Northeast Bioeng. Conf., pages 549–552, 1987. 

[141] E.B. Moody, E. Micheli-Tzanakou, and S. Chokroverty. An adaptive approach to 

spectral analysis of pattern-reversal visual evoked potentials. IEEE Transactions 


[142] R.E. Mortensen. Random Signals and Systems. Wiley, 1987. 

[143] J.M. Moser and J.I. Aunon. Classification and detection of single evoked brain potentials 

using time-frequency amplitude features. IEEE Transactions on Biomedical 

Engineering, 33:1096–1106, 1986. 

[144] N.J.D. Nagelkerke and J. Strackee. Some notes on the statistical properties of a 

posteriori Wiener filtering. Biological Cybernetics, 33:121–123, 1979. 

[145] A.M. Nait-Ali, O. Adam, and J.-F. Motsch. The brainstem auditory evoked potentials 

estimation using a Bayesian deconvolution method. In Proc. 18th Annual 

Conf. IEEE-EMBS, Amsterdam, 1996. 

[146] M. Nakamura, S. Nishida, and H. Shibasaki. Spectral properties of signal averaging 

and a novel technique for improving the signal-to-noise ratio. Journal on 


[147] M. Nakamura, S. Nishida, and H. Shibasaki. Deterion of average evoked potential 

waveform due to asynchronous averaging and its compensation. IEEE Transactions 


[148] S. Nishida, M. Nakamura, and H. Shibasaki. Method for single-trial recording of 

somatosensory evoked potentials. Journal on Biomedical Engineering, 150:257– 

262, 1993. 

[149] T. Nogawa, K. Katayama, Y. Tabata, T. Kawahara, and T. Ohshio. Visual evoked 

potentials estimated by Wiener filtering. Electroencephalography and Clinical Neurophysiology, 

35:375–378, 1973. 

[150] T. Nyrke. Herätepotentiaalien fysiologiset ja metodologiset perusteet. In H. Lang, 

V. Häkkinen, T. Larsen, J. Partanen, and U. Tolonen, editors, Sähköiset aivomme, 

pages 359–378. Suomen KNF-yhdistys, Painokiila Ky, 1994. In Finnish. 

[151] A. LeBron Paige, Ö. Özdamar, and R.E. Delgado. Two-dimesional spectral processing 

of sequential evoked potentials. Medical and Biological Engineering and 

Computing, 34:239–243, 1996. 

[152] A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw- 

Hill, 1984. 

[153] A. Papoulis. Signal Analysis. McGraw-Hill, 1984. 

[154] P.G. Park and T. Kailath. Square-root Bryson-Frazier smoothing algorithms. 

IEEE Transactions on Automatic Control, 40:761–766, 1995.


[155] P.G. Park and T. Kailath. New square-root smoothing algorithms. IEEE Transactions 

on Automatic Control, 41:727–733, 1996. 

[156] V. Parsa and P.A. Parker. Multireference adaptive noise cancellation applied to 

somatosensory-evoked potentials. IEEE Transactions on Biomedical Engineering, 

pages 792–800, 1994. 

[157] F. Pedersen. Interactive explorative analysis of multivariate images using principal 

components. PhD thesis, Uppsala University, 1994. 

[158] A. Pfefferbaum and J.M. Ford. ERPs to stimuli requiring response production and 

inhibition: effects of age, probability and visual noise. Electroencephalography and 

Clinical Neurophysiology, 71:55–63, 1988. 

[159] A. Pfefferbaum, J.M. Ford, B.G. Wenegrat, W.T. Roth, and B.S. Kopell. Clinical 

application of the P3 component of event-related potentials. i. normal aging. 

Electroencephalography and Clinical Neurophysiology, 59(2):85–103, 1984. 

[160] A. Pfefferbaum, J.M. Ford, B.G. Wenegrat, W.T. Roth, and B.S. Kopell. Clinical 

application of the P3 component of event-related potentials. ii. dementia, depression 

and schizophrenia. Electroencephalography and Clinical Neurophysiology, 

59(2):104–124, 1984. 

[161] G. Pfurtscheller and R. Cooper. Selective averaging of the intracerebral click 

evoked responses in man: An improved method of measuring latencies and amplitudes. 


[162] T.W. Picton. Introduction. In Human Event-Related Potentials, volume 3 of Handbook 

of Electroencephalography and Clinical Neurophysiology, pages 1–5. Elsevier, 

1987. 

[163] J. Polich. P300 in clinical applications: meaning, method, and measurement. In 

E. Niedermeyer ja F.L. da Silva, editor, Electroencephalography: basic principles, 

clinical applications, and related fields, pages 1005–1018, 1993. 

[164] J. Polich, L. Howard, and L. Starr. Effects of age on the P300 component of 

the event-related potentials from auditory stimuli: peak definition, variation and 

measurement. Journal of Gerontology, 40:721–726, 1985. 

[165] C.A. Pomalaza-Raez and C.D. McGillem. Enhancement of event-related potentials 

by iterative restoration algorithms. IEEE Transactions on Biomedical Engineering, 

33:1107–1113, 1986. 

[166] M. Prahlad, X. Kong, and M. Tahernezhadi. Latency change estimation for evoked 

potentials via a new generalized cross correlation method. In Proc. 18th Annual 

Conf. IEEE-EMBS, Amsterdam, 1996. 

[167] M.B. Priestley. Spectral Analysis and Time Series. Academic Press, 1981. 

[168] R.R. Rawlings, M.J. Eckardt, and H. Begleiter. Multivariate spectral methods for 

the analysis of event-related brain potentials. Computers and Biomedical Research, 

21:117–128, 1988. 

[169] J. Raz and G. Fein. Testing for heterogeneity of evoked potential signals using an 

approximation to an exact permutation test. Biometrics, 48:1069–1080, 1992. 

[170] D. Regan. Human visual evoked potentials. In Human Event-Related Potentials, 

volume 3 of Handbook of Electroencephalography and Clinical Neurophysiology, 

pages 159–243. Elsevier, 1987. 

[171] J.A. Rice. Mathematical statistics and data analysis. Duxbury Press, 2 edition, 

1995. 

[172] K. Roberts, P. Lawrence, A. Eisen, and M. Hoirch. Echancement and dynamic 

time warping of somatosensory evoked potential components applied to patients 

with multiple sclerosis. IEEE Transactions on Biomedical Engineering, 34:397–


405, 1987. 

[173] L.J. Rogers. Determination of the number and waveshapes of event related potential 

components using comparative factor analysis. International Journal of 

Neuroscience, 56:219–245, 1991. 

[174] F. Rösler and D. Manzey. Principal components and varimax-rotated components 

in event-related potential research: some remarks on their interaction. Biological 

Psychology, 13:3–26, 1981. also in subspace. 

[175] P.M. Rossini, R.Q. Cracco, J.B: Cracco, and W.J. Hanse. Short latency somatosensory 

evoked potentials to peroneal nerve stimulation: Scalp topography and the 

effect of different frequency filters. Electroencephalography and Clinical Neurophysiology, 

52:540–542, 1981. 

[176] A. Van Rotterdam. Limitation and difficulties in signal processing by means of 

the principal-components analysis. IEEE Transactions on Biomedical Engineering, 

17:268–269, 1970. 

[177] D.S. Ruchkin. Comparison of statistical errors of the median and average evoked 

responses. IEEE Transactions on Biomedical Engineering, pages 54–56, January 

1974. 

[178] D.S. Ruchkin. A shortcoming of the median evoked response. IEEE Transactions 

on Biomedical Engineering, page 245, May 1975. 

[179] D.S. Ruchkin. Measurement of event-related potentials. In Human Event-Related 

Potentials, volume 3 of Handbook of Electroencephalography and Clinical Neurophysiology, 

pages 7–44. Elsevier, 1987. 

[180] D.S. Ruchkin and E.M. Glaser. Simple digital filters for examining CNV and P300 

on a single trial basis. In D.A. Otto, editor, Multidisciplinary perpectives on eventrelated 

brain potential research, pages 579–581. U.S. Governement Printing Office, 

Washington DC, 1978. 

[181] D.S. Ruchkin, J. Villegas, and E.R. John. An analysis of average evoked potentials 

making use of least mean square techiques. Annals of New York Academy of 

Sciences, 115:799–821, 1964. 

[182] R.J. Santos. Equivalence of regularization and truncated iteration for general 

ill-posed problems. Linear Algebra and its Applications, 236:25–33, 1996. 

[183] A.H. Sayed and T. Kailath. A state-space approach to adaptive RLS filtering. 

IEEE Signal Processing Magazine, 11:18–60, 1994. 

[184] B. Sayers, H.A. Beagley, and W.R. Hanshall. The mechanisms of auditory evoked 

EEG responses. Nature, 247:481–483, 1974. 

[185] R.W. Sencaj and J.I. Aunon. Dipole localization of average and single visual 

evoked potentials. IEEE Transactions on Biomedical Engineering, 29(1):26–33, 

1982. 

[186] J.A. Sgro, R.G. Emerson, and T.A. Pedley. Real-time reconstruction of evoked 

potentials using a new two-dimensional filter method. Electroencephalography and 

Clinical Neurophysiology, 62:372–380, 1985. 

[187] K.S. Shanmugan and A.M. Breipohl. Random signals: detection, estimation and 

data analysis. Wiley, 1988. 

[188] Y. Shi and K.E. Hecox. Nonlinear system identification by m-pulse sequences: application 

to brainstem auditory evoked responses. IEEE Transactions on Biomedical 

Engineering, 38(9):834–845, 1991. 

[189] L.H. Sibul, editor. Adaptive signal processing. IEE Proceedings, Pt. E, 1987. 

[190] F.T.Y. Smulders, J.L. Kenemans, and A. Kok. A comparison of different methods 

for estimating single-trial P300 latencies. Electroencephalography and Clinical



[191] V. Solo and X. Kong. Adaptive Signal Processing Algorithms. Stability and Performance. 

Wiley, 1995. 

[192] H.W. Sorenson. Parameter Estimation: Principles and Problems. Marcel Dekker, 

1980. 

[193] H.W. Sorenson, editor. Kalman Filtering: Theory and Applications. IEEE Press, 

1985. 

[194] J.C. Spall, editor. Bayesian Analysis of Time Series and Dynamic Models. Marcel 

Dekker, 1988. 

[195] M. Spreckelsen and B. Bromm. Estimation of single-evoked cerebral potentials 

by means of parametric modeling and filtering. IEEE Transactions on Biomedical 

Engineering, 35:691–700, 1988. 

[196] K.C. Squires and E. Donchin. Beyond averaging: The use if discriminant functions 

to reconize event related potentials elicited by single auditory stimuli. Electroencephalography 


[197] G.H. Steeger, O. Herrmann, and M. Spreng. Some improvements in the measurements 

of variable latency acoustically evoked potentials in human EEG. IEEE 


[198] S. Sutton. P300 – thirteen years later. In Proc. Conf. Evoked Brain Potentials 

and Behavior, pages 107–126, 1997. 

[199] S. Suwazono, H. Shibasaki, S. Nishida, M. Nakamura, M. Honda, T. Nagamine, 

A. Ikeda, J. Ito, and J. Kimura. Automatic detection of p300 in single sweep 

records of auditory event-related potential. Journal of Clinical Neurophysiology, 

11(4):448–460, 1994. 

[200] O. Svensson. Tracking of changes in latency and amplitude of the evoked potential 

by using adaptive LMS filters and exponential averagers. IEEE Transactions on 


[201] A. Tarantola. Inverse Problem Theory. Elsevier, 1987. 

[202] N.V. Thakor. Adaptive filtering of transient evoked potentials. In Proc. Annual 

Conf. Eng. Med. Biol. (ACEMB), 1986. 

[203] N.V Thakor. Adaptive filtering of evoked potentials. IEEE Transactions on 

Biomedical Engineering, 34(1):6–12, 1987. 

[204] N.V. Thakor, X. Kong, and D.F. Hanley. Nonlinear changes in brain’s response 

in the event of injury as detected by adaptive coherence estimation of evoked 

potentials. IEEE Transactions on Biomedical Engineering, 42(1):42–51, 1995. 

[205] N.V. Thakor, C.A. Vaz, R.W. McPherson, and D. F. Hanley. Adaptive Fourier 

series modeling of time-varying evoked potentials: Study of human somatosensory 

evoked response to etomidate anesthetic. Electroencephalography and Clinical 


[206] N.V. Thakor, G. Xin-rong, S. Yi-Chun, and D. F. Hanley. Multiresolution wavelet 

analysis of evoked potentials. IEEE Transactions on Biomedical Engineering, 

40(11):1085–1094, 1993. 

[207] A.N. Tikhonov. Regularization of incorrectly posed problems. Soviet. Math. Dokl., 

4:1624–1627, 1993. 

[208] A.N. Tikhonov. Solution of incorrectly formulated problems and the regularization 

method. Soviet. Math. Dokl., 4:1035–1038, 1993. 

[209] A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-Posed Problems. Winston and 

Sons, 1977. 

[210] P. Ungan and E. Basar. Comparison of Wiener filtering and selective averaging of


evoked potentials. Electroencephalography and Clinical Neurophysiology, 40:516– 

520, 1976. 

[211] M. Vauhkonen, J.P. Kaipio, E. Somersalo, and P.A. Karjalainen. Basis constraint 

method for estimating conductivity distribution of the human thorax. In Proceedings 

of the IX International Conference on Electrical Bio-Impedance, pages 

528–531, 1995. 

[212] M. Vauhkonen, J.P. Kaipio, E. Somersalo, and P.A. Karjalainen. Electrical 

impedance tomography with basis constraints. Inverse Problems, 13:523–530, 1997. 

[213] M. Vauhkonen, D. Vadász, J.P. Kaipio, and P.A. Karjalainen. Tikhonov regularization 

and prior information in electrical impedance tomography. Technical 

Report 3/96, University of Kuopio, Department of Applied Physics, 1996. 

[214] M. Vauhkonen, D. Vadász, P.A. Karjalainen, and J.P. Kaipio. Subspace regularization 

method for electrical impedance tomography. In Proc 1st Int Conf 

Bioelectromagn, pages 165–166, Tampere, Finland, 1996. 

[215] C.A. Vaz and N.I. Thakor. Adaptive Fourier estimation of time-varying evoked 

potentials. IEEE Transactions on Biomedical Engineering, 36:448–455, 1989. 

[216] J.J. Vidal. Real-time detection of brain events in EEG. Proceedings og the IEEE, 

65:633–641, 1977. 

[217] C.E. Da Vila, A.J. Welch, and H.G. Rylander III. Adaptive estimation of single 

evoked potentials. In Proc. 8th Annual Conf. IEEE-EMBS, pages 406–409, Dallas, 

1986. 

[218] G. Wahba. Spline Models for Observational Data. SIAM, 1990. 

[219] D.O. Walter. A posteriori Wiener filtering of average evoked responses. Electroencephalography 


[220] D.O. Walter. Two approximations to the median evoked response. Electroencephalography 


[221] D.G. Wastell. Statistical detection of individual evoked responses: An evaluation 

of Woody’s adaptive filter. Electroencephalography and Clinical Neurophysiology, 

42:835–839, 1977. 

[222] D.G. Wastell. On the independence of P300 and the CNV: a short critique of 

the principal component analysis of Donchin et. al (1975). Biological Psychology, 

9:171–176, 1979. 

[223] D.G. Wastell. On the correlated nature of evoked brain activity: biophysical and 

statistical considerations. Biological Psychology, 13:51–69, 1981. 

[224] D.G. Wastell. PCA and Varimax rotation: some comments on Rosler and Manzey. 

Biological Psychology, 13:27–29, 1981. 

[225] D.G. Wastell. When Wiener filtering is less than optimal: an illustrative application 

to the brain stem evoked potential. Electroencephalography and Clinical 


[226] J.J. Westerkamp and J.I. Aunon. Optimum multielectrode a posteriori estimates of 

single-response evoked potentials. IEEE Transactions on Biomedical Engineering, 

34:13–22, 1987. 

[227] J.J. Westerkamp, J.I. Aunon, C.D. McGillem, and J.M. Moser. Sequential detection 

of changes in evoked potentials. IEEE frontiers of Engineering and Computing 

in Health Care, pages 691–695, 1984. 

[228] B. Widrow, J.R. Glower JR, J.M. McCool, J. Kauniz, C.S. Williams, R.H. Hearn, 

J.R. Zeidler, E. Dong JR, and R.C. Goodlin. Adaptive noise cancelling: Principles 

and applications. Proceedings og the IEEE, 63:1692–1716, 1975. 

[229] B. Widrow and S.D. Stearns. Adaptive Signal Processing. Prentice-Hall, 1985.


[230] N. Wiener. The extrapolation, interpolation and smoothing of stationary time 

series. Wiley, New York, 1949. 

[231] J. Woestenburg, M. Verbaten, and J. Slangen. The removal of the eye-movement 

artifact from EEG by regression analysis in the frequency domain. Biological 

Psychology, 16:127–147, 1983. 

[232] J.C. Woestenburg, M.N. Verbaten, H.H. Van-Hees, and J.L. Slangen. Single-trial 

ERP estimation in the frequency domain using orthogonal polynomial trend analysis 

(OPTA): estimation of individual habituation. Biological Psychology, 17(2- 

3):173–191, 1983. 

[233] C.C. Wood and G. McCarthy. Principal component analysis of event-related potentials: 

Simulation studies demostrate mis-allocation of variance components. 


[234] W. Woodworth, S. Reisman, and A.B. Fontaine. The detection of auditory evoked 

responses using a matched filter. IEEE Transactions on Biomedical Engineering, 

30:369–376, 1983. 

[235] C.D. Woody. Characterization of an adaptive filter for the analysis of variable 

latency neuroelectric signals. Medical and Biological Engineering and Computing, 

5:539–553, 1967. 

[236] C.D. Woody and M.J. Nahvi. Application of optimum linear filter theory to the 

detection of cortical signals preceding facial movements in cats. Exp. Brain Res., 

16:455–465, 1973. 

[237] K. Yu and C.D. McGillem. Optimum filters for estimating evoked potential waveforms. 


[238] X.H. Yu, Z.Y. He, and Y.S. Zhang. Time-varying adaptive filters for 

evoked-potentials estimation. IEEE Transactions on Biomedical Engineering, 

41(11):1062–1071, 1994. 

[239] X.H. Yu, Y.S. Zhang, and Z.Y. He. Peak component latency-corrected average 

method for evoked potential wave-form estimation. IEEE Transactions on Biomedical 

Engineering, 41(11):1072–1082, 1994. 

[240] L. Zetterberg. Experience with analysis and simulation of EEG signals with parametric 

description of spectra. In P. Kellaway and I. Petersen, editors, Automation 

of Clinical Electroencephalography, pages 161–201. Raven Press, 1973.

Karjalainen, Pasi A. Regularization and Bayesian methods for ...

Create successful ePaper yourself

Delete template?

Save as template?