12.07.2015 Views

Kalman Filtering and Neural Networks - Grupo de Mecatrônica ...

Kalman Filtering and Neural Networks - Grupo de Mecatrônica ...

Kalman Filtering and Neural Networks - Grupo de Mecatrônica ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

KALMAN FILTERING ANDNEURAL NETWORKS


KALMAN FILTERING ANDNEURAL NETWORKSEdited bySimon HaykinCommunications Research Laboratory,McMaster University, Hamilton, Ontario, CanadaA WILEY-INTERSCIENCE PUBLICATIONJOHN WILEY & SONS, INC.New York = Chichester = Weinheim = Brisbane = Singapore = Toronto


Designations used by companies to distinguish their products are often claimed astra<strong>de</strong>marks. In all instances where John Wiley & Sons, Inc., is aware of a claim, theproduct names appear in initial capital or ALL CAPITAL LETTERS. Rea<strong>de</strong>rs, however, shouldcontact the appropriate companies for more complete information regarding tra<strong>de</strong>marks<strong>and</strong> registration.Copyright 2001 by John Wiley & Sons, Inc.. All rights reserved.No part of this publication may be reproduced, stored in a retrieval system or transmittedin any form or by any means, electronic or mechanical, including uploading,downloading, printing, <strong>de</strong>compiling, recording or otherwise, except as permitted un<strong>de</strong>rSections 107 or 108 of the 1976 United States Copyright Act, without the prior writtenpermission of the Publisher. Requests to the Publisher for permission should beaddressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue,New York, NY 10158-0012, (212) 850-6011, fax (212) 850-6008,E-Mail: PERMREQ@WILEY.COM.This publication is <strong>de</strong>signed to provi<strong>de</strong> accurate <strong>and</strong> authoritative information in regard tothe subject matter covered. It is sold with the un<strong>de</strong>rst<strong>and</strong>ing that the publisher is notengaged in ren<strong>de</strong>ring professional services. If professional advice or other expertassistance is required, the services of a competent professional person should be sought.ISBN 0-471-22154-6This title is also available in print as ISBN 0-471-36998-5.For more information about Wiley products, visit our web site at www.Wiley.com.


CONTENTSPrefaceContributorsxixiii1 <strong>Kalman</strong> Filters 1Simon Haykin1.1 Introduction = 11.2 Optimum Estimates = 31.3 <strong>Kalman</strong> Filter = 51.4 Divergence Phenomenon: Square-Root <strong>Filtering</strong> = 101.5 Rauch–Tung–Striebel Smoother = 111.6 Exten<strong>de</strong>d <strong>Kalman</strong> Filter = 161.7 Summary = 20References = 202 Parameter-Based <strong>Kalman</strong> Filter Training:Theory <strong>and</strong> Implementation 23Gintaras V. Puskorius <strong>and</strong> Lee A. Feldkamp2.1 Introduction = 232.2 Network Architectures = 262.3 The EKF Procedure = 282.3.1 Global EKF Training = 292.3.2 Learning Rate <strong>and</strong> Scaled Cost Function = 312.3.3 Parameter Settings = 322.4 Decoupled EKF (DEKF) = 332.5 Multistream Training = 35v


viCONTENTS2.5.1 Some Insight into the Multistream Technique = 402.5.2 Advantages <strong>and</strong> Extensions of MultistreamTraining = 422.6 Computational Consi<strong>de</strong>rations = 432.6.1 Derivative Calculations = 432.6.2 Computationally Efficient Formulations forMultiple-Output Problems = 452.6.3 Avoiding Matrix Inversions = 462.6.4 Square-Root <strong>Filtering</strong> = 482.7 Other Extensions <strong>and</strong> Enhancements = 512.7.1 EKF Training with Constrained Weights = 512.7.2 EKF Training with an Entropic Cost Function = 542.7.3 EKF Training with Scalar Errors = 552.8 Automotive Applications of EKF Training = 572.8.1 Air=Fuel Ratio Control = 582.8.2 Idle Speed Control = 592.8.3 Sensor-Catalyst Mo<strong>de</strong>ling = 602.8.4 Engine Misfire Detection = 612.8.5 Vehicle Emissions Estimation = 622.9 Discussion = 632.9.1 Virtues of EKF Training = 632.9.2 Limitations of EKF Training = 642.9.3 Gui<strong>de</strong>lines for Implementation <strong>and</strong> Use = 64References = 653 Learning Shape <strong>and</strong> Motion from Image Sequences 69Gaurav S. Patel, Sue Becker, <strong>and</strong> Ron Racine3.1 Introduction = 693.2 Neurobiological <strong>and</strong> Perceptual Foundations of our Mo<strong>de</strong>l = 703.3 Network Description = 713.4 Experiment 1 = 733.5 Experiment 2 = 743.6 Experiment 3 = 763.7 Discussion = 77References = 81


CONTENTSvii4 Chaotic Dynamics 83Gaurav S. Patel <strong>and</strong> Simon Haykin4.1 Introduction = 834.2 Chaotic (Dynamic) Invariants = 844.3 Dynamic Reconstruction = 854.4 Mo<strong>de</strong>ling Numerically Generated Chaotic Time Series = 874.4.1 Logistic Map = 874.4.2 Ikeda Map = 914.4.3 Lorenz Attractor = 994.5 Nonlinear Dynamic Mo<strong>de</strong>ling of Real-WorldTime Series = 1064.5.1 Laser Intensity Pulsations = 1064.5.2 Sea Clutter Data = 1134.6 Discussion = 119References = 1215 Dual Exten<strong>de</strong>d <strong>Kalman</strong> Filter Methods 123Eric A. Wan <strong>and</strong> Alex T. Nelson5.1 Introduction = 1235.2 Dual EKF – Prediction Error = 1265.2.1 EKF – State Estimation = 1275.2.2 EKF – Weight Estimation = 1285.2.3 Dual Estimation = 1305.3 A Probabilistic Perspective = 1355.3.1 Joint Estimation Methods = 1375.3.2 Marginal Estimation Methods = 1405.3.3 Dual EKF Algorithms = 1445.3.4 Joint EKF = 1495.4 Dual EKF Variance Estimation = 1495.5 Applications = 1535.5.1 Noisy Time-Series Estimation <strong>and</strong> Prediction = 1535.5.2 Economic Forecasting – In<strong>de</strong>x of IndustrialProduction = 1555.5.3 Speech Enhancement = 1575.6 Conclusions = 163Acknowledgments = 164


viiiCONTENTSAppendix A: Recurrent Derivative of the <strong>Kalman</strong> Gain = 164Appendix B: Dual EKF with Colored Measurement Noise = 166References = 1706 Learning Nonlinear Dynamical System Using theExpectation-Maximization Algorithm 175Sam T. Roweis <strong>and</strong> Zoubin Ghahramani6.1 Learning Stochastic Nonlinear Dynamics = 1756.1.1 State Inference <strong>and</strong> Mo<strong>de</strong>l Learning = 1776.1.2 The <strong>Kalman</strong> Filter = 1806.1.3 The EM Algorithm = 1826.2 Combining EKS <strong>and</strong> EM = 1866.2.1 Exten<strong>de</strong>d <strong>Kalman</strong> Smoothing (E-step) = 1866.2.2 Learning Mo<strong>de</strong>l Parameters (M-step) = 1886.2.3 Fitting Radial Basis Functions to GaussianClouds = 1896.2.4 Initialization of Mo<strong>de</strong>ls <strong>and</strong> Choosing Locationsfor RBF Kernels = 1926.3 Results = 1946.3.1 One- <strong>and</strong> Two-Dimensional Nonlinear State-SpaceMo<strong>de</strong>ls = 1946.3.2 Weather Data = 1976.4 Extensions = 2006.4.1 Learning the Means <strong>and</strong> Widths of the RBFs = 2006.4.2 On-Line Learning = 2016.4.3 Nonstationarity = 2026.4.4 Using Bayesian Methods for Mo<strong>de</strong>l Selection <strong>and</strong>Complexity Control = 2036.5 Discussion = 2066.5.1 I<strong>de</strong>ntifiability <strong>and</strong> Expressive Power = 2066.5.2 Embed<strong>de</strong>d Flows = 2076.5.3 Stability = 2106.5.4 Takens’ Theorem <strong>and</strong> Hid<strong>de</strong>n States = 2116.5.5 Should Parameters <strong>and</strong> Hid<strong>de</strong>n States be TreatedDifferently? = 2136.6 Conclusions = 214Acknowledgments = 215


CONTENTSixAppendix: Expectations Required to Fit the RBFs = 215References = 2167 The Unscented <strong>Kalman</strong> Filter 221Eric A. Wan <strong>and</strong> Rudolph van <strong>de</strong>r Merwe7.1 Introduction = 2217.2 Optimal Recursive Estimation <strong>and</strong> the EKF = 2247.3 The Unscented <strong>Kalman</strong> Filter = 2347.3.1 State-Estimation Examples = 2377.3.2 The Unscented <strong>Kalman</strong> Smoother = 2407.4 UKF Parameter Estimation = 2437.4.1 Parameter-Estimation Examples = 27.5 UKF Dual Estimation = 2497.5.1 Dual Estimation Experiments = 2497.6 The Unscented Particle Filter = 2547.6.1 The Particle Filter Algorithm = 2597.6.2 UPF Experiments = 2637.7 Conclusions = 269Appendix A: Accuracy of the Unscented Transformation = 269Appendix B: Efficient Square-Root UKF Implementations = 273References = 277In<strong>de</strong>x 283


PREFACEThis self-contained book, consisting of seven chapters, is <strong>de</strong>voted to<strong>Kalman</strong> filter theory applied to the training <strong>and</strong> use of neural networks,<strong>and</strong> some applications of learning algorithms <strong>de</strong>rived in this way.It is organized as follows: Chapter 1 presents an introductory treatment of <strong>Kalman</strong> filters, withemphasis on basic <strong>Kalman</strong> filter theory, the Rauch–Tung–Striebelsmoother, <strong>and</strong> the exten<strong>de</strong>d <strong>Kalman</strong> filter. Chapter 2 presents the theoretical basis of a powerful learningalgorithm for the training of feedforward <strong>and</strong> recurrent multilayeredperceptrons, based on the <strong>de</strong>coupled exten<strong>de</strong>d <strong>Kalman</strong> filter (DEKF);the theory presented here also inclu<strong>de</strong>s a novel technique calledmultistreaming. Chapters 3 <strong>and</strong> 4 present applications of the DEKF learning algorithmto the study of image sequences <strong>and</strong> the dynamic reconstructionof chaotic processes, respectively. Chapter 5 studies the dual estimation problem, which refers to theproblem of simultaneously estimating the state of a nonlineardynamical system <strong>and</strong> the mo<strong>de</strong>l that gives rise to the un<strong>de</strong>rlyingdynamics of the system. Chapter 6 studies how to learn stochastic nonlinear dynamics. Thisdifficult learning task is solved in an elegant manner by combiningtwo algorithms:1. The expectation-maximization (EM) algorithm, which provi<strong>de</strong>san iterative procedure for maximum-likelihood estimation withmissing hid<strong>de</strong>n variables.2. The exten<strong>de</strong>d <strong>Kalman</strong> smoothing (EKS) algorithm for a refine<strong>de</strong>stimation of the state.xi


xiiPREFACE Chapter 7 studies yet another novel i<strong>de</strong>a – the unscented <strong>Kalman</strong>filter – the performance of which is superior to that of the exten<strong>de</strong>d<strong>Kalman</strong> filter.Except for Chapter 1, all the other chapters present illustrative applicationsof the learning algorithms <strong>de</strong>scribed here, some of which involve theuse of simulated as well as real-life data.Much of the material presented here has not appeared in book formbefore. This volume should be of serious interest to researchers in neuralnetworks <strong>and</strong> nonlinear dynamical systems.SIMON HAYKINCommunications Research Laboratory,McMaster University, Hamilton, Ontario, Canada


ContributorsSue Becker, Department of Psychology, McMaster University, 1280 MainStreet West, Hamilton, ON, Canada L8S 4K1Lee A. Feldkamp, Ford Research Laboratory, Ford Motor Company, 2101Village Road, Dearborn, MI 48121-2053, U.S.A.Simon Haykin, Communications Research Laboratory, McMasterUniversity, 1280 Main Street West, Hamilton, ON, Canada L8S 4K1Zoubin Ghahramani, Gatsby Computational Neuroscience Unit, UniversityCollege London, Alex<strong>and</strong>ra House, 17 Queen Square, LondonWC1N 3AR, U.K.Alex T. Nelson, Department of Electrical <strong>and</strong> Computer Engineering,Oregon Graduate Institute of Science <strong>and</strong> Technology, 19600 N.W. vonNeumann Drive, Beaverton, OR 97006-1999, U.S.A.Gaurav S. Patel, 1553 Manton Blvd., Canton, MI 48187, U.S.A.Gintaras V. Puskorius, Ford Research Laboratory, Ford Motor Company,2101 Village Road, Dearborn, MI 48121-2053, U.S.A.Ron Racine, Department of Psychology, McMaster University, 1280Main Street West, Hamilton, ON, Canada L8S 4K1Sam T. Roweis, Gatsby Computational Neuroscience Unit, UniversityCollege London, Alex<strong>and</strong>ra House, 17 Queen Square, London WC1N3AR, U.K.Rudolph van <strong>de</strong>r Merwe, Department of Electrical <strong>and</strong> ComputerEngineering, Oregon Graduate Institute of Science <strong>and</strong> Technology,19600 N.W. von Neumann Drive, Beaverton, OR 97006-1999, U.S.A.Eric A. Wan, Department of Electrical <strong>and</strong> Computer Engineering,Oregon Graduate Institute of Science <strong>and</strong> Technology, 19600 N.W.von Neumann Drive, Beaverton, OR 97006-1999, U.S.A.xiii


KALMAN FILTERING ANDNEURAL NETWORKS


Adaptive <strong>and</strong> Learning Systems for Signal Processing,Communications, <strong>and</strong> ControlEditor: Simon HaykinBeckerman = ADAPTIVE COOPERATIVE SYSTEMSChen <strong>and</strong> Gu = CONTROL-ORIENTED SYSTEM IDENTIFICATION: An H 1ApproachCherkassky <strong>and</strong> Mulier = LEARNING FROM DATA: Concepts, Theory,<strong>and</strong> MethodsDiamantaras <strong>and</strong> Kung = PRINCIPAL COMPONENT NEURAL NETWORKS:Theory <strong>and</strong> ApplicationsHaykin = KALMAN FILTERING AND NEURAL NETWORKSHaykin = UNSUPERVISED ADAPTIVE FILTERING: Blind Source SeparationHaykin = UNSUPERVISED ADAPTIVE FILTERING: Blind DeconvolutionHaykin <strong>and</strong> Puthussarypady = CHAOTIC DYNAMICS OF SEA CLUTTERHrycej = NEUROCONTROL: Towards an Industrial Control MethodologyHyvärinen, Karhunen, <strong>and</strong> Oja = INDEPENDENT COMPONENT ANALYSISKristić, Kanellakopoulos, <strong>and</strong> Kokotović = NONLINEAR AND ADAPTIVECONTROL DESIGNNikias <strong>and</strong> Shao = SIGNAL PROCESSING WITH ALPHA-STABLEDISTRIBUTIONS AND APPLICATIONSPassino <strong>and</strong> Burgess = STABILITY ANALYSIS OF DISCRETE EVENT SYSTEMSSánchez-Peña <strong>and</strong> Sznaler = ROBUST SYSTEMS THEORY ANDAPPLICATIONSS<strong>and</strong>berg, Lo, Fancourt, Principe, Katagiri, <strong>and</strong> Haykin = NONLINEARDYNAMICAL SYSTEMS: Feedforward <strong>Neural</strong> Network PerspectivesTao <strong>and</strong> Kokotović = ADAPTIVE CONTROL OF SYSTEMS WITH ACTUATORAND SENSOR NONLINEARITIESTsoukalas <strong>and</strong> Uhrig = FUZZY AND NEURAL APPROACHES INENGINEERINGVan Hulle = FAITHFUL REPRESENTATIONS AND TOPOGRAPHIC MAPS:From Distortion- to Information-Based Self-OrganizationVapnik = STATISTICAL LEARNING THEORYWerbos = THE ROOTS OF BACKPROPAGATION: From Or<strong>de</strong>redDerivatives to <strong>Neural</strong> <strong>Networks</strong> <strong>and</strong> Political Forecasting


1KALMAN FILTERSSimon HaykinCommunications Research Laboratory, McMaster University,Hamilton, Ontario, Canada(haykin@mcmaster.ca)1.1 INTRODUCTIONThe celebrated <strong>Kalman</strong> filter, rooted in the state-space formulation oflinear dynamical systems, provi<strong>de</strong>s a recursive solution to the linearoptimal filtering problem. It applies to stationary as well as nonstationaryenvironments. The solution is recursive in that each updated estimate ofthe state is computed from the previous estimate <strong>and</strong> the new input data,so only the previous estimate requires storage. In addition to eliminatingthe need for storing the entire past observed data, the <strong>Kalman</strong> filter iscomputationally more efficient than computing the estimate directly fromthe entire past observed data at each step of the filtering process.In this chapter, we present an introductory treatment of <strong>Kalman</strong> filtersto pave the way for their application in subsequent chapters of the book.We have chosen to follow the original paper by <strong>Kalman</strong> [1] for the<strong>Kalman</strong> <strong>Filtering</strong> <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>, Edited by Simon HaykinISBN 0-471-36998-5 # 2001 John Wiley & Sons, Inc.1


2 1 KALMAN FILTERS<strong>de</strong>rivation; see also the books by Lewis [2] <strong>and</strong> Grewal <strong>and</strong> Andrews [3].The <strong>de</strong>rivation is not only elegant but also highly insightful.Consi<strong>de</strong>r a linear, discrete-time dynamical system <strong>de</strong>scribed by theblock diagram shown in Figure 1.1. The concept of state is fundamental tothis <strong>de</strong>scription. The state vector or simply state, <strong>de</strong>noted by x k ,is<strong>de</strong>finedas the minimal set of data that is sufficient to uniquely <strong>de</strong>scribe theunforced dynamical behavior of the system; the subscript k <strong>de</strong>notesdiscrete time. In other words, the state is the least amount of data onthe past behavior of the system that is nee<strong>de</strong>d to predict its future behavior.Typically, the state x k is unknown. To estimate it, we use a set of observeddata, <strong>de</strong>noted by the vector y k .In mathematical terms, the block diagram of Figure 1.1 embodies thefollowing pair of equations:1. Process equationx kþ1 ¼ F kþ1;k x k þ w k ;ð1:1Þwhere F kþ1;k is the transition matrix taking the state x k from time kto time k þ 1. The process noise w k is assumed to be additive, white,<strong>and</strong> Gaussian, with zero mean <strong>and</strong> with covariance matrix <strong>de</strong>finedbyE½w n w T k Š¼ Q k for n ¼ k;0 for n 6¼ k;ð1:2Þwhere the superscript T <strong>de</strong>notes matrix transposition. The dimensionof the state space is <strong>de</strong>noted by M.Figure 1.1 Signal-flow graph representation of a linear, discrete-timedynamical system.


1.2 OPTIMUM ESTIMATES 32. Measurement equationy k ¼ H k x k þ v k ;ð1:3Þwhere y k is the observable at time k <strong>and</strong> H k is the measurementmatrix. The measurement noise v k is assumed to be additive, white,<strong>and</strong> Gaussian, with zero mean <strong>and</strong> with covariance matrix <strong>de</strong>finedbyE½v n v T k Š¼ R k for n ¼ k;ð1:4Þ0 for n 6¼ k:Moreover, the measurement noise v k is uncorrelated with theprocess noise w k . The dimension of the measurement space is<strong>de</strong>noted by N.The <strong>Kalman</strong> filtering problem, namely, the problem of jointly solvingthe process <strong>and</strong> measurement equations for the unknown state in anoptimum manner may now be formally stated as follows: Use the entire observed data, consisting of the vectors y 1 ; y 2 ; ...; y k ,to find for each k 1 the minimum mean-square error estimate ofthe state x i .The problem is called filtering if i ¼ k, prediction if i > k, <strong>and</strong> smoothingif 1 i < k.1.2 OPTIMUM ESTIMATESBefore proceeding to <strong>de</strong>rive the <strong>Kalman</strong> filter, we find it useful to reviewsome concepts basic to optimum estimation. To simplify matters, thisreview is presented in the context of scalar r<strong>and</strong>om variables; generalizationof the theory to vector r<strong>and</strong>om variables is a straightforward matter.Suppose we are given the observabley k ¼ x k þ v k ;where x k is an unknown signal <strong>and</strong> v k is an additive noise component. Let^x k <strong>de</strong>note the a posteriori estimate of the signal x k , given the observationsy 1 ; y 2 ; ...; y k . In general, the estimate ^x k is different from the unknown


4 1 KALMAN FILTERSsignal x k . To <strong>de</strong>rive this estimate in an optimum manner, we need a cost(loss) function for incorrect estimates. The cost function should satisfy tworequirements: The cost function is nonnegative. The cost function is a non<strong>de</strong>creasing function of the estimation error~x k <strong>de</strong>fined by~x k ¼ x k ^x k :These two requirements are satisfied by the mean-square error<strong>de</strong>fined byJ k ¼ E½ðx k¼ E½~x 2 kŠ;^x k Þ 2 Šwhere E is the expectation operator. The <strong>de</strong>pen<strong>de</strong>nce of the costfunction J k on time k emphasizes the nonstationary nature of therecursive estimation process.To <strong>de</strong>rive an optimal value for the estimate ^x k , we may invoke twotheorems taken from stochastic process theory [1, 4]:Theorem 1.1 Conditional mean estimator If the stochastic processesfx k g <strong>and</strong> fy k g are jointly Gaussian, then the optimum estimate ^x k thatminimizes the mean-square error J k is the conditional mean estimator:^x k ¼ E½x k jy 1 ; y 2 ; ...; y k Š:Theorem 1.2 Principle of orthogonality Let the stochastic processesfx k g <strong>and</strong> fy k g be of zero means; that is,E½x k Š¼E½y k Š¼0 for all k:Then:(i) the stochastic process fx k g <strong>and</strong> fy k g are jointly Gaussian; or(ii) if the optimal estimate ^x k is restricted to be a linear function ofthe observables <strong>and</strong> the cost function is the mean-square error,(iii) then the optimum estimate ^x k , given the observables y 1 ,y 2 ; ...; y k , is the orthogonal projection of x k on the spacespanned by these observables.


1.3 KALMAN FILTER 5With these two theorems at h<strong>and</strong>, the <strong>de</strong>rivation of the <strong>Kalman</strong> filterfollows.1.3 KALMAN FILTERSuppose that a measurement on a linear dynamical system, <strong>de</strong>scribed byEqs. (1.1) <strong>and</strong> (1.3), has been ma<strong>de</strong> at time k. The requirement is to usethe information contained in the new measurement y k to update theestimate of the unknown state x k . Let ^x k <strong>de</strong>note a priori estimate of thestate, which is already available at time k. With a linear estimator as theobjective, we may express the a posteriori estimate ^x k as a linearcombination of the a priori estimate <strong>and</strong> the new measurement, asshown by^x k ¼ G ð1Þk^x k þ G k y k ; ð1:5Þwhere the multiplying matrix factors G ð1Þk<strong>and</strong> G k are to be <strong>de</strong>termined. Tofind these two matrices, we invoke the principle of orthogonality statedun<strong>de</strong>r Theorem 1.2. The state-error vector is <strong>de</strong>fined by~x k ¼ x k ^x k : ð1:6ÞApplying the principle of orthogonality to the situation at h<strong>and</strong>, we maythus writeE½~x k y T i Š¼0 for i ¼ 1; 2; ...; k 1: ð1:7ÞUsing Eqs. (1.3), (1.5), <strong>and</strong> (1.6) in (1.7), we getE½ðx kG ð1Þk ^x k G k H k x k G k w k Þy T i Š¼0 for i ¼ 1; 2; ...; k 1:ð1:8ÞSince the process noise w k <strong>and</strong> measurement noise v k are uncorrelated, itfollows thatE½w k y T i Š¼0:


6 1 KALMAN FILTERSUsing this relation <strong>and</strong> rearranging terms, we may rewrite Eq. (8) asE½ðI G k H k G ð1Þk Þx ky T i þ G ð1Þk ðx k ^x k Þy T i Š¼0; ð1:9Þwhere I is the i<strong>de</strong>ntity matrix. From the principle of orthogonality, we nownote thatE½ðx k^x k Þy T i Š¼0:Accordingly, Eq. (1.9) simplifies toðI G k H k G ð1Þk ÞE½x ky T i Š¼0 for i ¼ 1; 2; ...; k 1: ð1:10ÞFor arbitrary values of the state x k <strong>and</strong> observable y i , Eq. (1.10) can onlybe satisfied if the scaling factors G ð1Þk<strong>and</strong> G k are related as follows:I G k H k G ð1Þk¼ 0;or, equivalently, G ð1Þkis <strong>de</strong>fined in terms of G k asG ð1Þk¼ I G k H k : ð1:11ÞSubstituting Eq. (1.11) into (1.5), we may express the a posteriori estimateof the state at time k as^x k ¼ ^x k þ G k ðy k H k ^x k Þ; ð1:12Þin light of which, the matrix G k is called the <strong>Kalman</strong> gain.There now remains the problem of <strong>de</strong>riving an explicit formula for G k .Since, from the principle of orthogonality, we haveit follows thatE½ðx k ^x k Þy T k Š¼0; ð1:13ÞE½ðx k ^x k Þ^y T k Š¼0; ð1:14Þ


1.3 KALMAN FILTER 7where ^y T k is an estimate of y k given the previous measurementy 1 ; y 2 ; ...; y k 1 .Define the innovations process~y k ¼ y k ^y k : ð1:15ÞThe innovation process represents a measure of the ‘‘new’’ informationcontained in y k ; it may also be expressed as~y k ¼ y k H k ^x k¼ H k x k þ v k¼ H k ~x k þ v k :H k ^x kð1:16ÞHence, subtracting Eq. (1.14) from (1.13) <strong>and</strong> then using the <strong>de</strong>finition ofEq. (1.15), we may writeE½ðx k ^x k Þ~y T k Š¼0: ð1:17ÞUsing Eqs. (1.3) <strong>and</strong> (1.12), we may express the state-error vector x kas^x kx k ^x k ¼ ~x k G k ðH k ~x k þ v k Þ¼ðI G k H k Þ~x k G k v k : ð1:18ÞHence, substituting Eqs. (1.16) <strong>and</strong> (1.18) into (1.17), we getE½fðI G k H k Þ~x k G k v k gðH k ~x k þ v k ÞŠ ¼ 0: ð1:19ÞSince the measurement noise v k is in<strong>de</strong>pen<strong>de</strong>nt of the state x k <strong>and</strong>therefore the error ~x k , the expectation of Eq. (1.19) reduces toðI G k H k ÞE½~x k ~x T k ŠH T k G k E½v k v T k Š¼0: ð1:20ÞDefine the a priori covariance matrixP k ¼ E½ðx k ^x k Þðx k ^x k Þ T Š¼ E½~x k ~x T k Š: ð1:21Þ


8 1 KALMAN FILTERSThen, invoking the covariance <strong>de</strong>finitions of Eqs. (1.4) <strong>and</strong> (1.21), we mayrewrite Eq. (1.20) asðI G k H k ÞP k H T k G k R k ¼ 0:Solving this equation for G k , we get the <strong>de</strong>sired formulaG k ¼ P k H T k ½H k P k H T k þ R k Š1 ; ð1:22Þwhere the symbol ½Š1 <strong>de</strong>notes the inverse of the matrix insi<strong>de</strong> the squarebrackets. Equation (1.22) is the <strong>de</strong>sired formula for computing the <strong>Kalman</strong>gain G k , which is <strong>de</strong>fined in terms of the a priori covariance matrix P k .To complete the recursive estimation procedure, we consi<strong>de</strong>r the errorcovariance propagation, which <strong>de</strong>scribes the effects of time on thecovariance matrices of estimation errors. This propagation involves twostages of computation:1. The a priori covariance matrix P k at time k is <strong>de</strong>fined by Eq. (1.21).Given P k , compute the a posteriori covariance matrix P k , which, attime k, is<strong>de</strong>fined byP k ¼ E½~x k ~x T k Š¼ E½ðx k ^x k Þðx k ^x k Þ T Š: ð1:23Þ2. Given the ‘‘old’’ a posteriori covariance matrix, P k 1 , compute the‘‘updated’’ a priori covariance matrix P k .To proceed with stage 1, we substitute Eq. (1.18) into (1.23) <strong>and</strong> notethat the noise process v k is in<strong>de</strong>pen<strong>de</strong>nt of the a priori estimation error ~x k .We thus obtain 1P k ¼ðI G k H k ÞE½~x k ~x T k ŠðI G k H k Þ T þ G k E½v k v T k ŠG T k¼ðI G k H k ÞP k ðI G k H k Þ T þ G k R k G T k : ð1:24Þ1 Equation (1.24) is referred to as the ‘‘Joseph’’ version of the covariance update equation[5].


1.3 KALMAN FILTER 9Exp<strong>and</strong>ing terms in Eq. (1.24) <strong>and</strong> then using Eq. (1.22), we mayreformulate the <strong>de</strong>pen<strong>de</strong>nce of the a posteriori covariance matrix P k onthe a priori covariance matrix P k in the simplified formP k ¼ðI G k H k ÞP k ðI G k H k ÞP k H T k G T k þ G k R k G T k¼ðI G k H k ÞP k G k R k G T k þ G k R k G T k¼ðI G k H k ÞP k : ð1:25ÞFor the second stage of error covariance propagation, we first recognizethat the a priori estimate of the state is <strong>de</strong>fined in terms of the ‘‘old’’ aposteriori estimate as follows:^x k ¼ F k;k 1 ^x k 1 : ð1:26ÞWe may therefore use Eqs. (1.1) <strong>and</strong> (1.26) to express the a prioriestimation error in yet another form:~x k ¼ x k ^x - k¼ðF k;k 1 x k 1 þ w k 1 Þ ðF k;k 1 ^x k 1 Þ¼ F k;k 1 ðx k 1 ^x k 1 Þþw k 1¼ F k;k 1 ~x k 1 þ w k 1 : ð1:27ÞAccordingly, using Eq. (1.27) in (1.21) <strong>and</strong> noting that the process noisew k is in<strong>de</strong>pen<strong>de</strong>nt of ~x k 1 , we getP k ¼ F k;k 1 E½~x k 1 ~x T k 1ŠF T k;k 1 þ E½w k 1 w T k 1Š¼ F k;k 1 P k 1 F T k;k 1 þ Q k 1 ; ð1:28Þwhich <strong>de</strong>fines the <strong>de</strong>pen<strong>de</strong>nce of the a priori covariance matrix P k on the‘‘old’’ a posteriori covariance matrix P k 1 .With Eqs. (1.26), (1.28), (1.22), (1.12), <strong>and</strong> (1.25) at h<strong>and</strong>, we may nowsummarize the recursive estimation of state as shown in Table 1.1. Thistable also inclu<strong>de</strong>s the initialization. In the absence of any observed data attime k ¼ 0, we may choose the initial estimate of the state as^x 0 ¼ E½x 0 Š;ð1:29Þ


10 1 KALMAN FILTERSTable 1.1Summary of the <strong>Kalman</strong> filterState-space mo<strong>de</strong>lx kþ1 ¼ F kþ1;k x k þ w k ;y k ¼ H k x k þ v k ;where w k <strong>and</strong> v k are in<strong>de</strong>pen<strong>de</strong>nt, zero-mean, Gaussian noise processes ofcovariance matrices Q k <strong>and</strong> R k , respectively.Initialization: For k ¼ 0, set^x 0 ¼ E½x 0 Š;P 0 ¼ E ½ðx 0 E½x 0 ŠÞðx 0 E½x 0 ŠÞ T Š:Computation: For k ¼ 1; 2; ..., compute:State estimate propagation^x k ¼ F k;k 1 ^x k 1 ;Error covariance propagationP k ¼ F k;k 1 P k 1 F T k;k 1 þ Q k 1 ;<strong>Kalman</strong> gain matrixG k ¼ P k H T k H k P k H T 1;k þ R kState estimate update^x k ¼ ^x k þ G k y kH k ^x k ;Error covariance updateP k ¼ðI G k H k ÞP k :<strong>and</strong> the initial value of the a posteriori covariance matrix asP 0 ¼ E½ðx 0 E½x 0 ŠÞðx 0 E½x 0 ŠÞ T Š: ð1:30ÞThis choice for the initial conditions not only is intuitively satisfying butalso has the advantage of yielding an unbiased estimate of the state x k .1.4 DIVERGENCE PHENOMENON: SQUARE-ROOT FILTERINGThe <strong>Kalman</strong> filter is prone to serious numerical difficulties that are welldocumented in the literature [6]. For example, the a posteriori covariancematrix P k is <strong>de</strong>fined as the difference between two matrices P k <strong>and</strong>


1.5 RAUCH–TUNG–STRIEBEL SMOOTHER 11G k H k P k ; see Eq. (1.25). Hence, unless the numerical accuracy of thealgorithm is high enough, the matrix P k resulting from this computationmay not be nonnegative-<strong>de</strong>finite. Such a situation is clearly unacceptable,because P k represents a covariance matrix. The unstable behavior of the<strong>Kalman</strong> filter, which results from numerical inaccuracies due to the use offinite-wordlength arithmetic, is called the divergence phenomenon.Arefined method of overcoming the divergence phenomenon is to usenumerically stable unitary transformations at every iteration of the <strong>Kalman</strong>filtering algorithm [6]. In particular, the matrix P k is propagated in asquare-root form by using the Cholesky factorization:P k ¼ P 1=2kP T=2k; ð1:31Þwhere P 1=2kis reserved for a lower-triangular matrix, <strong>and</strong> P T=2kis itstranspose. In linear algebra, the Cholesky factor P 1=2kis commonly referredto as the square root of the matrix P k . Accordingly, any variant of the<strong>Kalman</strong> filtering algorithm based on the Cholesky factorization is referredto as square-root filtering. The important point to note here is that thematrix product P 1=2kP T=2kis much less likely to become in<strong>de</strong>finite, becausethe product of any square matrix <strong>and</strong> its transpose is always positive<strong>de</strong>finite.In<strong>de</strong>ed, even in the presence of roundoff errors, the numericalconditioning of the Cholesky factor P 1=2kis generally much better than thatof P k itself.1.5 RAUCH–TUNG–STRIEBEL SMOOTHERIn Section 1.3, we addressed the optimum linear filtering problem. Thesolution to the linear prediction problem follows in a straightforwardmanner from the basic theory of Section 1.3. In this section, we consi<strong>de</strong>rthe optimum smoothing problem.To proceed, suppose that we are given a set of data over the timeinterval 0 < k N. Smoothing is a non-real-time operation in that itinvolves estimation of the state x k for 0 < k N, using all the availabledata, past as well as future. In what follows, we assume that the final timeN is fixed.To <strong>de</strong>termine the optimum state estimates ^x k for 0 < k N, we need toaccount for past data y j <strong>de</strong>fined by 0 < j k, <strong>and</strong> future data y j <strong>de</strong>fined byk < j N. The estimation pertaining to the past data, which we refer to asforward filtering theory, was presented in Section 1.3. To <strong>de</strong>al with the


12 1 KALMAN FILTERSissue of state estimation pertaining to the future data, we use backwardfiltering, which starts at the final time N <strong>and</strong> runs backwards. Let ^x f k <strong>and</strong>^x b k<strong>de</strong>note the state estimates obtained from the forward <strong>and</strong> backwardrecursions, respectively. Given these two estimates, the next issue to beconsi<strong>de</strong>red is how to combine them into an overall smoothed estimate ^x k ,which accounts for data over the entire time interval. Note that the symbol^x k used for the smoothed estimate in this section is not to be confused withthe filtered (i.e., a posteriori) estimate used in Section 1.3.We begin by rewriting the process equation (1.1) as a recursion for<strong>de</strong>creasing k, as shown byx k ¼ F 1kþ1;kx kþ1 F 1kþ1;kw k ;ð1:32Þwhere Fkþ1;k 1 is the inverse of the transition matrix F kþ1;k . The rationalefor backward filtering is <strong>de</strong>picted in Figure 1.2a, where the recursionbegins at the final time N. This rationale is to be contrasted with that offorward filtering <strong>de</strong>picted in Figure 1.2b. Note that the a priori estimate^x b k<strong>and</strong> the a posteriori estimate ^x b kfor backward filtering occur to theright <strong>and</strong> left of time k in Figure 1.2a, respectively. This situation is theexact opposite to that occurring in the case of forward filtering <strong>de</strong>picted inFigure 1.2b.To simplify the presentation, we introduce the two <strong>de</strong>finitions:S k ¼½P b kŠ 1 ;ð1:33ÞS k ¼½P b k Š 1 ; ð1:34ÞFigure 1.2 Illustrating the smoother time-updates for (a ) backward filtering<strong>and</strong> (b) forward filtering.


<strong>and</strong> the two intermediate variables^z k ¼½P b kŠ 1 ^x b k ¼ S k ^x b k;ð1:35Þ^z k ¼½P b k Š 1 ^x b k ¼ S k ^x b k : ð1:36ÞThen, building on the rationale of Figure 1.2a, we may <strong>de</strong>rive thefollowing updates for the backward filter [2]:1. Measurement updates1.5 RAUCH–TUNG–STRIEBEL SMOOTHER 13S k ¼ S k þ H k Rk 1 H k ; ð1:37Þ^z k ¼ ^z k þ H T k Rk 1 y k ; ð1:38Þwhere y k is the observable <strong>de</strong>fined by the measurement equation(1.3), H k is the measurement matrix, <strong>and</strong> Rk1 is the inverse of thecovariance matrix of the measurement noise v k .2. Time updatesG b k ¼ S kþ1 ½S kþ1 þ Qk1 Š1 ; ð1:39ÞS k ¼ F T kþ1;kðI G b kÞS kþ1 F kþ1;k ; ð1:40Þ^z k ¼ F T kþ1;kðI G b kÞ^z kþ1 ; ð1:41Þwhere G b k is the <strong>Kalman</strong> gain for backward filtering <strong>and</strong> Qk1 is theinverse of the covariance matrix of the process noise w k . Thebackward filter <strong>de</strong>fined by the measurement <strong>and</strong> time updates ofEqs. (1.37)–(1.41) is the information formulation of the <strong>Kalman</strong>filter. The information filter is distinguished from the basic <strong>Kalman</strong>filter in that it propagates the inverse of the error covariance matrixrather than the error covariance matrix itself.Given observable data over the interval 0 < k N for fixed N, supposewe have obtained the following two estimates: The forward a posteriori estimate ^x f kon data y j for 0 < j k. The backward a priori estimate ^x b kfilter on data y j for k < j N.by operating the <strong>Kalman</strong> filterby operating the information


14 1 KALMAN FILTERSWith these two estimates <strong>and</strong> their respective error covariance matrices ath<strong>and</strong>, the next issue of interest is how to <strong>de</strong>termine the smoothed estimate^x k <strong>and</strong> its error covariance matrix, which incorporate the overall data overthe entire time interval 0 < k N.Recognizing that the process noise w k <strong>and</strong> measurement noise v k arein<strong>de</strong>pen<strong>de</strong>nt, we may formulate the error covariance matrix of the aposteriori smoothed estimate ^x k as follows:P k ¼½½P f k Š 1 þ½P b k Š 1 Š 1¼½½P f k Š 1 þ S k Š 1 :ð1:42ÞTo proceed further, we invoke the matrix inversion lemma, which may bestated as follows [7]. Let A <strong>and</strong> B be two positive-<strong>de</strong>finite matrices relatedbyA ¼ B 1 þ CD 1 C T ;where D is another positive-<strong>de</strong>finite matrix <strong>and</strong> C is a matrix withcompatible dimensions. The matrix inversion lemma states that we mayexpress the inverse of the matrix A as follows:A 1 ¼ B BC½D þ C T BCŠ 1 C T B:For the problem at h<strong>and</strong>, we setA ¼ Pk 1 ;B ¼ P f k ;C ¼ I;D ¼½S k Š 1 ;where I is the i<strong>de</strong>ntity matrix. Then, applying the matrix inversion lemmato Eq. (1.42), we obtainP k ¼ P f kP f k ½Pb k þ P f k Š 1 P f k¼ P f kP f k S k ½I þ P f k S k Š1 P f k : ð1:43ÞFrom Eq. (1.43), we find that the a posteriori smoothed error covariancematrix P k is smaller than or equal to the a posteriori error covariance


1.5 RAUCH–TUNG–STRIEBEL SMOOTHER 15Figure 1.3 Illustrating the error covariance for forward filtering, backwardfiltering, <strong>and</strong> smoothing.matrix P f kproduced by the <strong>Kalman</strong> filter, which is naturally due to the factthat smoothing uses additional information contained in the future data.This point is borne out by Figure 1.3, which <strong>de</strong>picts the variations of P k ,P f k , <strong>and</strong> Pb k with k for a one-dimensional situation.The a posteriori smoothed estimate of the state is <strong>de</strong>fined by^x k ¼ P k ð½P f k Š 1 ^x f k þ½Pb kŠ1 ^x b k Þ: ð1:44ÞUsing Eqs. (1.36) <strong>and</strong> (1.43) in (1.44) yields, after simplification,where the smoother gain is <strong>de</strong>fined by^x k ¼ ^x f k þðP kz k G k ^x f kÞ; ð1:45ÞG k ¼ P f k S k ½I þ P f k S k Š 1 ;ð1:46Þwhich is not to be confused with the <strong>Kalman</strong> gain of Eq. (1.22).The optimum smoother just <strong>de</strong>rived consists of three components: A forward filter in the form of a <strong>Kalman</strong> filter. A backward filter in the form of an information filter. A separate smoother, which combines results embodied in theforward <strong>and</strong> backward filters.The Rauch–Tung–Striebel smoother, however, is more efficient than thethree-part smoother in that it incorporates the backward filter <strong>and</strong> separate


16 1 KALMAN FILTERSsmoother into a single entity [8, 9]. Specifically, the measurement updateof the Rauch–Tung–Striebel smoother is <strong>de</strong>fined bywhere A k is the new gain matrix:P k ¼ P f kA k ðP f kþ1P kþ1 ÞA T k ; ð1:47ÞA k ¼ P f k FT kþ1;k½P f kþ1 Š 1 :ð1:48ÞThe corresponding time update is <strong>de</strong>fined by^x k ¼ ^x f k þ A kð^x kþ1 ^x f kþ1 Þ ð1:49ÞThe Rauch–Tung–Striebel smoother thus proceeds as follows:1. The <strong>Kalman</strong> filter is applied to the observable data in a forwardmanner, that is, k ¼ 0; 1; 2; ..., in accordance with the basic theorysummarized in Table 1.1.2. The recursive smoother is applied to the observable data in abackward manner, that is, k ¼ N 1; N 2; ..., in accordancewith Eqs. (1.47)–(1.49).3. The initial conditions are <strong>de</strong>fined byP N ¼ P f N ;^x k ¼ ^x f k :ð1:50Þð1:51ÞTable 1.2 summarizes the computations involved in the Rauch–Tung–Striebel smoother.1.6 EXTENDED KALMAN FILTERThe <strong>Kalman</strong> filtering problem consi<strong>de</strong>red up to this point in the discussionhas addressed the estimation of a state vector in a linear mo<strong>de</strong>l of adynamical system. If, however, the mo<strong>de</strong>l is nonlinear, we may extend theuse of <strong>Kalman</strong> filtering through a linearization procedure. The resultingfilter is referred to as the exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) [10–12]. Such an


1.6 EXTENDED KALMAN FILTER 17Table 1.2Summary of the Rauch–Tung–Striebel smootherState-space mo<strong>de</strong>lx kþ1 ¼ F kþ1;k x k þ w ky k ¼ H k x k þ v kwhere w k <strong>and</strong> v k are in<strong>de</strong>pen<strong>de</strong>nt, zero-mean, Gaussian noise processes ofcovariance matrices Q k <strong>and</strong> R k , respectively.Forward filterInitialization: For k ¼ 0, set^x 0 ¼ E½x 0 Š;P 0 ¼ E½ðx 0 E½x 0 Šðx 0 E½x 0 ŠÞ T Š:Computation: For k ¼ 1; 2; ..., computeRecursive smootherInitialization: For k ¼ N, set^x f k¼ F k;k 1 ^x f k 1 ;P f k¼ F k;k 1 P f k 1 FT k;k 1 þ Q k 1 ;G f k ¼ P f kH T k ½H k P f kH T k þ R k Š^x f k ¼ ^x f kþ G f k ðy k H k ^x f kÞ:P N ¼ P f N ;^x k ¼ ^x f k :Computation: For k ¼ N 1; N 2, computeA k ¼ P f k FT kþ1;k½P f kþ1 Š 1 ;P k ¼ P f kA k ðP f kþ1P kþ1 ÞA T k ; ^x k ¼ ^x f k þ A k ^x kþ1 ^x f kþ1:1 ;extension is feasible by virtue of the fact that the <strong>Kalman</strong> filter is <strong>de</strong>scribedin terms of difference equations in the case of discrete-time systems.To set the stage for a <strong>de</strong>velopment of the exten<strong>de</strong>d <strong>Kalman</strong> filter,consi<strong>de</strong>r a nonlinear dynamical system <strong>de</strong>scribed by the state-space mo<strong>de</strong>lx kþ1 ¼ fðk; x k Þþw k ;y k ¼ hðk; x k Þþv k ;ð1:52Þð1:53Þ


18 1 KALMAN FILTERSwhere, as before, w k <strong>and</strong> v k are in<strong>de</strong>pen<strong>de</strong>nt zero-mean white Gaussiannoise processes with covariance matrices R k <strong>and</strong> Q k , respectively. Here,however, the functional fðk; x k Þ <strong>de</strong>notes a nonlinear transition matrixfunction that is possibly time-variant. Likewise, the functional hðk; x k Þ<strong>de</strong>notes a nonlinear measurement matrix that may be time-variant, too.The basic i<strong>de</strong>a of the exten<strong>de</strong>d <strong>Kalman</strong> filter is to linearize the statespacemo<strong>de</strong>l of Eqs. (1.52) <strong>and</strong> (1.53) at each time instant around the mostrecent state estimate, which is taken to be either ^x k or ^x k , <strong>de</strong>pending onwhich particular functional is being consi<strong>de</strong>red. Once a linear mo<strong>de</strong>l isobtained, the st<strong>and</strong>ard <strong>Kalman</strong> filter equations are applied.More explicitly, the approximation proceeds in two stages.Stage 1 The following two matrices are constructed:F kþ1;k ¼@fðk; xÞ@xH k ¼ @hðk; x kÞ@x ;x¼^x k :x¼^x kð1:54Þð1:55ÞThat is, the ijth entry of F kþ1;k is equal to the partial <strong>de</strong>rivative of the ithcomponent of Fðk; xÞ with respect to the jth component of x. Likewise, the ijthentry of H k is equal to the partial <strong>de</strong>rivative of the ith component of Hðk; xÞ withrespect to the jth component of x. In the former case, the <strong>de</strong>rivatives are evaluatedat ^x k , while in the latter case, the <strong>de</strong>rivatives are evaluated at ^x k. The entries ofthe matrices F kþ1;k <strong>and</strong> H k are all known (i.e., computable), by having ^x k <strong>and</strong> ^x kavailable at time k.Stage 2 Once the matrices F kþ1;k <strong>and</strong> H k are evaluated, they are then employedin a first-or<strong>de</strong>r Taylor approximation of the nonlinear functions Fðk; x k Þ <strong>and</strong>Hðk; x k Þ around ^x k <strong>and</strong> ^x k, respectively. Specifically, Fðk; x k Þ <strong>and</strong> Hðk; x k Þ areapproximated as followsFðk; x k ÞFðx; ^x k ÞþF kþ1;k ðx; ^x k Þ;Hðk; x k ÞHðx; ^x k ÞþH kþ1;k ðx; ^x k Þ:ð1:56Þð1:57ÞWith the above approximate expressions at h<strong>and</strong>, we may now proceed toapproximate the nonlinear state equations (1.52) <strong>and</strong> (1.53) as shown by,respectively,x kþ1 F kþ1;k x k þ w k þ d k ;y k H k x k þ v k ;


where we have introduced two new quantities:1.6 EXTENDED KALMAN FILTER 19y k ¼ y k fhðx; ^x k Þ H k ^x k g; ð1:58Þd k ¼ fðx; ^x k Þ F kþ1;k ^x k : ð1:59ÞThe entries in the term y k are all known at time k, <strong>and</strong>, therefore, y k can beregar<strong>de</strong>d as an observation vector at time n. Likewise, the entries in the term d kare all known at time k.Table 1.3Exten<strong>de</strong>d <strong>Kalman</strong> filterState-space mo<strong>de</strong>lx kþ1 ¼ fðk; x k Þþw k ;y k ¼ hðk; x k Þþv k ;where w k <strong>and</strong> v k are in<strong>de</strong>pen<strong>de</strong>nt, zero mean, Gaussian noise processes ofcovariance matrices Q k <strong>and</strong> R k , respectively.Definitions@fðk; xÞF kþ1;k ¼ j@x x¼xk ;@hðk; xÞH k ¼ j@x x¼xk :Initialization: For k ¼ 0, set^x 0 ¼ E½x 0 Š;P 0 ¼ E½ðx 0 E½x 0 ŠÞðx 0 E½x 0 ŠÞ T Š:Computation: For k ¼ 1; 2; ..., compute:State estimate propagation^x k ¼ fðk; ^x k 1 Þ;Error covariance propagationP k ¼ F k;k 1 P k 1 F T k;k 1 þ Q k 1 ;<strong>Kalman</strong> gain matrixG k ¼ P k H T k H k P k H T 1;k þ R kState estimate update^x k ¼ ^x k þ G k y k hðk; ^x k Þ;Error covariance updateP k ¼ðI G k H k ÞP k :


20 1 KALMAN FILTERSGiven the linearized state-space mo<strong>de</strong>l of Eqs. (1.58) <strong>and</strong> (1.59), wemay then proceed <strong>and</strong> apply the <strong>Kalman</strong> filter theory of Section 1.3 to<strong>de</strong>rive the exten<strong>de</strong>d <strong>Kalman</strong> filter. Table 1.2 summarizes the recursionsinvolved in computing the exten<strong>de</strong>d <strong>Kalman</strong> filter.1.7 SUMMARYThe basic <strong>Kalman</strong> filter is a linear, discrete-time, finite-dimensionalsystem, which is endowed with a recursive structure that makes a digitalcomputer well suited for its implementation. A key property of the<strong>Kalman</strong> filter is that it is the minimum mean-square (variance) estimatorof the state of a linear dynamical system.The <strong>Kalman</strong> filter, summarized in Table 1.1, applies to a lineardynamical system, the state space mo<strong>de</strong>l of which consists of twoequations: The process equation that <strong>de</strong>fines the evolution of the state with time. The measurement equation that <strong>de</strong>fines the observable in terms of thestate.The mo<strong>de</strong>l is stochastic owing to the additive presence of process noise<strong>and</strong> measurement noise, which are assumed to be Gaussian with zeromean <strong>and</strong> known covariance matrices.The Rauch–Tung–Striebel smoother, summarized in Table 1.2, buildson the <strong>Kalman</strong> filter to solve the optimum smoothing problem in anefficient manner. This smoother consists of two components: a forwardfilter based on the basic <strong>Kalman</strong> filter, <strong>and</strong> a combined backward filter <strong>and</strong>smoother.Applications of <strong>Kalman</strong> filter theory may be exten<strong>de</strong>d to nonlineardynamical systems, as summarized in Table 1.3. The <strong>de</strong>rivation of theexten<strong>de</strong>d <strong>Kalman</strong> filter hinges on linearization of the nonlinear state-spacemo<strong>de</strong>l on the assumption that <strong>de</strong>viation from linearity is of first or<strong>de</strong>r.REFERENCES[1] R.E. <strong>Kalman</strong>, ‘‘A new approach to linear filtering <strong>and</strong> prediction problems,’’Transactions of the ASME, Ser. D, Journal of Basic Engineering, 82, 34–45(1960).


REFERENCES 21[2] F.H. Lewis, Optical Estimation with an Introduction to Stochastic ControlTheory. New York: Wiley, 1986.[3] M.S. Grewal <strong>and</strong> A.P. Andrews, <strong>Kalman</strong> <strong>Filtering</strong>: Theory <strong>and</strong> Practice.Englewood Cliffs, NJ: Prentice-Hall, 1993.[4] H.L. Van Trees, Detection, Estimation, <strong>and</strong> Modulation Theory, Part I. NewYork: Wiley, 1968.[5] R.S. Bucy <strong>and</strong> P.D. Joseph, <strong>Filtering</strong> for Stochastic Processes, with Applicationsto Guidance. New York: Wiley, 1968.[6] P.G. Kaminski, A.E. Bryson, Jr., <strong>and</strong> S.F. Schmidt, ‘‘Discrete square rootfiltering: a survey of current techniques,’’ IEEE Transactions on AutomaticControl, 16, 727–736 (1971).[7] S. Haykin, Adaptive Filter Theory, 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 1996.[8] H.E. Rauch, ‘‘Solutions to the linear smoothing problem,’’ IEEE Transactionson Automatic Control, 11, 371–372 (1963).[9] H.E. Rauch, F. Tung, <strong>and</strong> C.T. Striebel, ‘‘Maximum likelihood estimates oflinear dynamic systems,’’ AIAA Journal, 3, 1445–1450 (1965).[10] A.H. Jazwinski, Stochastic Processes <strong>and</strong> <strong>Filtering</strong> Theory. New York:Aca<strong>de</strong>mic Press, 1970.[11] P.S. Maybeck, Stochastic Mo<strong>de</strong>ls, Estimation <strong>and</strong> Control, Vol. 1. New York:Aca<strong>de</strong>mic Press, 1979.[12] P.S. Maybeck, Stochastic Mo<strong>de</strong>ls, Estimation, <strong>and</strong> Control, Vol. 2. NewYork: Aca<strong>de</strong>mic Press, 1982.


2PARAMETER-BASEDKALMAN FILTER TRAINING:THEORY ANDIMPLEMENTATIONGintaras V. Puskorius <strong>and</strong> Lee A. FeldkampFord Research Laboratory, Ford Motor Company, Dearborn, Michigan, U.S.A.(gpuskori@ford.com, lfeldkam@ford.com)2.1 INTRODUCTIONAlthough the rediscovery in the mid 1980s of the backpropagationalgorithm by Rumelhart, Hinton, <strong>and</strong> Williams [1] has long beenviewed as a l<strong>and</strong>mark event in the history of neural network computing<strong>and</strong> has led to a sustained resurgence of activity, the relative ineffectivenessof this simple gradient method has motivated many researchers to<strong>de</strong>velop enhanced training procedures. In fact, the neural network literaturehas been inundated with papers proposing alternative training<strong>Kalman</strong> <strong>Filtering</strong> <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>, Edited by Simon HaykinISBN 0-471-36998-5 # 2001 John Wiley & Sons, Inc.23


24 2 PARAMETER-BASED KALMAN FILTER TRAININGmethods that are claimed to exhibit superior capabilities in terms oftraining speed, mapping accuracy, generalization, <strong>and</strong> overall performancerelative to st<strong>and</strong>ard backpropagation <strong>and</strong> related methods.Amongst the most promising <strong>and</strong> enduring of enhanced trainingmethods are those whose weight update procedures are based uponsecond-or<strong>de</strong>r <strong>de</strong>rivative information (whereas st<strong>and</strong>ard backpropagationexclusively utilizes first-<strong>de</strong>rivative information). A variety of second-or<strong>de</strong>rmethods began to be <strong>de</strong>veloped <strong>and</strong> appeared in the published neuralnetwork literature shortly after the seminal article on backpropagation waspublished. The vast majority of these methods can be characterized asbatch update methods, where a single weight update is based on a matrixof second <strong>de</strong>rivatives that is approximated on the basis of many trainingpatterns. Popular second-or<strong>de</strong>r methods have inclu<strong>de</strong>d weight updatesbased on quasi-Newton, Levenburg–Marquardt, <strong>and</strong> conjugate gradienttechniques. Although these methods have shown promise, they are oftenplagued by convergence to poor local optima, which can be partiallyattributed to the lack of a stochastic component in the weight updateprocedures. Note that, unlike these second-or<strong>de</strong>r methods, weight updatesusing st<strong>and</strong>ard backpropagation can either be performed in batch orinstance-by-instance mo<strong>de</strong>.The exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) forms the basis of a second-or<strong>de</strong>rneural network training method that is a practical <strong>and</strong> effective alternativeto the batch-oriented, second-or<strong>de</strong>r methods mentioned above. Theessence of the recursive EKF procedure is that, during training, in additionto evolving the weights of a network architecture in a sequential (asopposed to batch) fashion, an approximate error covariance matrix thatenco<strong>de</strong>s second-or<strong>de</strong>r information about the training problem is alsomaintained <strong>and</strong> evolved. The global EKF (GEKF) training algorithmwas introduced by Singhal <strong>and</strong> Wu [2] in the late 1980s, <strong>and</strong> has served asthe basis for the <strong>de</strong>velopment <strong>and</strong> enhancement of a family of computationallyeffective neural network training methods that has enabled theapplication of feedforward <strong>and</strong> recurrent neural networks to problems incontrol, signal processing, <strong>and</strong> pattern recognition.In their work, Singhal <strong>and</strong> Wu <strong>de</strong>veloped a second-or<strong>de</strong>r, sequentialtraining algorithm for static multilayered perceptron networks that wasshown to be substantially more effective (or<strong>de</strong>rs of magnitu<strong>de</strong>) in terms ofnumber of training epochs than st<strong>and</strong>ard backpropagation for a series ofpattern classification problems. However, the computational complexityof GEKF scales as the square of the number of weights, due to the<strong>de</strong>velopment <strong>and</strong> use of second-or<strong>de</strong>r information that correlates everypair of network weights, <strong>and</strong> was thus found to be impractical for all but


2.1 INTRODUCTION 25the simplest network architectures, given the state of st<strong>and</strong>ard computinghardware in the early 1990s.In response to the then-intractable computational complexity of GEKF,we <strong>de</strong>veloped a family of training procedures, which we named the<strong>de</strong>coupled EKF algorithm [3]. Whereas the GEKF procedure <strong>de</strong>velops<strong>and</strong> maintains correlations between each pair of network weights, theDEKF family provi<strong>de</strong>s an approximation to GEKF by <strong>de</strong>veloping <strong>and</strong>maintaining second-or<strong>de</strong>r information only between weights that belong tomutually exclusive groups. We have concentrated on what appear to besome relatively natural groupings; for example, the no<strong>de</strong>-<strong>de</strong>coupled(NDEKF) procedure mo<strong>de</strong>ls only the interactions between weights thatprovi<strong>de</strong> inputs to the same no<strong>de</strong>. In one limit of a separate group for eachnetwork weight, we obtain the fully <strong>de</strong>coupled EKF procedure, whichtends to be only slightly more effective than st<strong>and</strong>ard backpropagation. Inthe other extreme of a single group for all weights, DEKF reduces exactlyto the GEKF procedure of Singhal <strong>and</strong> Wu.In our work, we have successfully applied NDEKF to a wi<strong>de</strong> range ofnetwork architectures <strong>and</strong> classes of training problems. We have <strong>de</strong>monstratedthat NDEKF is extremely effective at training feedforward as wellas recurrent network architectures, for problems ranging from patternclassification to the on-line training of neural network controllers forengine idle speed control [4, 5]. We have <strong>de</strong>monstrated the effective use ofdynamic <strong>de</strong>rivatives computed by both forward methods, for examplethose based on real-time-recurrent learning (RTRL) [6, 7], as well as bytruncated backpropagation through time (BPTT(h)) [8] with the parameter-basedDEKF methods, <strong>and</strong> have exten<strong>de</strong>d this family of methods tooptimize cost functions other than sum of squared errors [9], which we<strong>de</strong>scribe below in Sections 2.7.2 <strong>and</strong> 2.7.3.Of the various extensions <strong>and</strong> enhancements of EKF training that wehave <strong>de</strong>veloped, perhaps the most enabling is one that allows for EKFprocedures to perform a single update of a network’s weights on the basisof more than a single training instance [10–12]. As mentioned above, EKFalgorithms are intrinsically sequential procedures, where, at any giventime during training, a network’s weight values are updated on the basis ofone <strong>and</strong> only one training instance. When EKF methods or any othersequential procedures are used to train networks with distributed representations,as in the case of multilayered perceptrons <strong>and</strong> time-laggedrecurrent neural networks, there is a ten<strong>de</strong>ncy for the training procedure toconcentrate on the most recently observed training patterns, to the<strong>de</strong>triment of training patterns that had been observed <strong>and</strong> processed along time in the past. This situation, which has been called the recency


26 2 PARAMETER-BASED KALMAN FILTER TRAININGphenomenon, is particularly troublesome for training of recurrent neuralnetworks <strong>and</strong>=or neural network controllers, where the temporal or<strong>de</strong>r ofpresentation of data during training must be respected. It is likely thatsequential training procedures will perform greedily for these systems, forexample by merely changing a network’s output bias during training toaccommodate a new region of operation. On the other h<strong>and</strong>, the off-linetraining of static networks can circumvent difficulties associated with therecency effect by employing a scrambling of the sequence of datapresentation during training.The recency phenomenon can be at least partially mitigated in thesecircumstances by providing a mechanism that allows for multiple traininginstances, preferably from different operating regions, to be simultaneouslyconsi<strong>de</strong>red for each weight vector update. Multistream EKFtraining is an extension of EKF training methods that allows for multipletraining instances to be batched, while remaining consistent with the<strong>Kalman</strong> methods.We begin with a brief discussion of the types of feedforward <strong>and</strong>recurrent network architectures that we are going to consi<strong>de</strong>r for trainingby EKF methods. We then discuss the global EKF training method,followed by recommendations for setting of parameters for EKF methods,including the relationship of the choice of learning rate to the initializationof the error covariance matrix. We then provi<strong>de</strong> treatments of the<strong>de</strong>coupled exten<strong>de</strong>d <strong>Kalman</strong> filter (DEKF) method as well as the multistreamprocedure that can be applied with any level of <strong>de</strong>coupling. Wediscuss at length a variety of issues related to computer implementation,including <strong>de</strong>rivative calculations, computationally efficient formulations,methods for avoiding matrix inversions, <strong>and</strong> square-root filtering forcomputational stability. This is followed by a number of special topics,including training with constrained weights <strong>and</strong> alternative cost functions.We then provi<strong>de</strong> an overview of applications of EKF methods to a series ofproblems in control, diagnosis, <strong>and</strong> mo<strong>de</strong>ling of automotive powertrainsystems. We conclu<strong>de</strong> the chapter with a discussion of the virtues <strong>and</strong>limitations of EKF training methods, <strong>and</strong> provi<strong>de</strong> a series of gui<strong>de</strong>lines forimplementation <strong>and</strong> use.2.2 NETWORK ARCHITECTURESWe consi<strong>de</strong>r in this chapter two types of network architecture: the wellknownfeedforward layered network <strong>and</strong> its dynamic extension, therecurrent multilayered perceptron (RMLP). A block-diagram representa-


2.2 NETWORK ARCHITECTURES 27Figure 2.1 Block-diagram representation of two hid<strong>de</strong>n layer networks. (a )<strong>de</strong>picts a feedforward layered neural network that provi<strong>de</strong>s a staticmapping between the input vector u k <strong>and</strong> the output vector y k .(b) <strong>de</strong>pictsa recurrent multilayered perceptron (RMLP) with two hid<strong>de</strong>n layers. In thiscase, we assume that there are time-<strong>de</strong>layed recurrent connectionsbetween the outputs <strong>and</strong> inputs of all no<strong>de</strong>s within a layer. The signals v i k<strong>de</strong>note the no<strong>de</strong> activations for the ith layer. Both of these block representationsassume that bias connections are inclu<strong>de</strong>d in the feedforwardconnections.tion of these types of networks is given in Figure 2.1. Figure 2.2 shows anexample network, <strong>de</strong>noted as a 3-3-3-2 network, with three inputs, twohid<strong>de</strong>n layers of three no<strong>de</strong>s each, <strong>and</strong> an output layer of two no<strong>de</strong>s.Figure 2.3 shows a similar network, but modified to inclu<strong>de</strong> interlayer,time-<strong>de</strong>layed recurrent connections. We <strong>de</strong>note this as a 3-3R-3R-2RRMLP, where the letter ‘‘R’’ <strong>de</strong>notes a recurrent layer. In this case, bothhid<strong>de</strong>n layers as well as the output layer are recurrent. The essentialdifference between the two types of networks is the recurrent network’sability to enco<strong>de</strong> temporal information. Once trained, the feedforwardFigure 2.2 A schematic diagram of a 3-3-3-2 feedforward network architecturecorresponding to the block diagram of Figure 2.1a.


28 2 PARAMETER-BASED KALMAN FILTER TRAININGFigure 2.3. A schematic diagram of a 3-3R-3R-2R recurrent network architecturecorresponding to the block diagram of Figure 2.1b. Note thepresence of time <strong>de</strong>lay operators <strong>and</strong> recurrent connections betweenthe no<strong>de</strong>s of a layer.network merely carries out a static mapping from input signals u k tooutputs y k , such that the output is in<strong>de</strong>pen<strong>de</strong>nt of the history in whichinput signals are presented. On the other h<strong>and</strong>, a trained RMLP provi<strong>de</strong>s adynamic mapping, such that the output y k is not only a function of thecurrent input pattern u k , but also implicitly a function of the entire historyof inputs through the time-<strong>de</strong>layed recurrent no<strong>de</strong> activations, given by thevectors v i k 1, where i in<strong>de</strong>xes layer number.2.3 THE EKF PROCEDUREWe begin with the equations that serve as the basis for the <strong>de</strong>rivation of theEKF family of neural network training algorithms. A neural network’sbehavior can be <strong>de</strong>scribed by the following nonlinear discrete-timesystem:w kþ1 ¼ w k þ v kð2:1Þy k ¼ h k ðw k ; u k ; v k 1 Þþn k : ð2:2ÞThe first of these, known as the process equation, merely specifies that thestate of the i<strong>de</strong>al neural network is characterized as a stationary processcorrupted by process noise v k , where the state of the system is given bythe network’s weight parameter values w k . The second equation, known asthe observation or measurement equation, represents the network’s <strong>de</strong>sired


2.3 THE EKF PROCEDURE 29response vector y k as a nonlinear function of the input vector u k , theweight parameter vector w k , <strong>and</strong>, for recurrent networks, the recurrentno<strong>de</strong> activations v k ; this equation is augmented by r<strong>and</strong>om measurementnoise n k . The measurement noise n k is typically characterized as zeromean,white noise with covariance given by E½n k n T l Š¼d k;lR k . Similarly,the process noise v k is also characterized as zero-mean, white noise withcovariance given by E½v k v T l Š¼d k;lQ k .2.3.1 Global EKF TrainingThe training problem using <strong>Kalman</strong> filter theory can now be <strong>de</strong>scribed asfinding the minimum mean-squared error estimate of the state w using allobserved data so far. We assume a network architecture with M weights<strong>and</strong> N o output no<strong>de</strong>s <strong>and</strong> cost function components. The EKF solution tothe training problem is given by the following recursion (see Chapter 1):A k ¼½R k þ H T k P k H k Š 1 ;K k ¼ P k H k A k ;^w kþ1 ¼ ^w k þ K k j k ;ð2:3Þð2:4Þð2:5ÞP kþ1 ¼ P k K k H T k P k þ Q k : ð2:6ÞThe vector ^w k represents the estimate of the state (i.e., weights) of thesystem at update step k. This estimate is a function of the <strong>Kalman</strong> gainmatrix K k <strong>and</strong> the error vector j k ¼ y k ^y k , where y k is the target vector<strong>and</strong> ^y k is the network’s output vector for the kth presentation of a trainingpattern. The <strong>Kalman</strong> gain matrix is a function of the approximate errorcovariance matrix P k , a matrix of <strong>de</strong>rivatives of the network’s outputs withrespect to all trainable weight parameters H k , <strong>and</strong> a global scaling matrixA k . The matrix H k may be computed via static backpropagation orbackpropagation through time for feedforward <strong>and</strong> recurrent networks,respectively (<strong>de</strong>scribed below in Section 2.6.1). The scaling matrix A k is afunction of the measurement noise covariance matrix R k , as well as of thematrices H k <strong>and</strong> P k . Finally, the approximate error covariance matrix P kevolves recursively with the weight vector estimate; this matrix enco<strong>de</strong>ssecond <strong>de</strong>rivative information about the training problem, <strong>and</strong> is augmentedby the covariance matrix of the process noise Q k . This algorithmattemptsPto find weight values that minimize the sum of squared errork jT k j k . Note that the algorithm requires that the measurement <strong>and</strong>


30 2 PARAMETER-BASED KALMAN FILTER TRAININGprocess noise covariance matrices, R k <strong>and</strong> Q k , be specified for all traininginstances. Similarly, the approximate error covariance matrix P k must beinitialized at the beginning of training. We consi<strong>de</strong>r these issues below inSection 2.3.3.GEKF training is carried out in a sequential fashion as shown in thesignal flow diagram of Figure 2.4. One step of training involves thefollowing steps:1. An input training pattern u k is propagated through the network toproduce an output vector ^y k . Note that the forward propagation is afunction of the recurrent no<strong>de</strong> activations v k 1 from the previous timestep for RMLPs. The error vector j k is computed in this step as well.2. The <strong>de</strong>rivative matrix H k is obtained by backpropagation. In thiscase, there is a separate backpropagation for each component of theoutput vector ^y k , <strong>and</strong> the backpropagation phase will involve a timehistory of recurrent no<strong>de</strong> activations for RMLPs.3. The <strong>Kalman</strong> gain matrix is computed as a function of the <strong>de</strong>rivativematrix H k , the approximate error covariance matrix P k , <strong>and</strong> themeasurement covariance noise matrix R k . Note that this stepinclu<strong>de</strong>s the computation of the global scaling matrix A k .4. The network weight vector is updated using the <strong>Kalman</strong> gain matrixK k , the error vector j k , <strong>and</strong> the current values of the weight vector ^w k .Figure 2.4 Signal flow diagram for EKF neural network training. The first twosteps, comprising the forward- <strong>and</strong> backpropagation operations, will<strong>de</strong>pend on whether or not the network being trained has recurrentconnections. On the other h<strong>and</strong>, the EKF calculations enco<strong>de</strong>d by steps(3)–(5) are in<strong>de</strong>pen<strong>de</strong>nt of network type.


5. The approximate error covariance matrix is updated using the<strong>Kalman</strong> gain matrix K k , the <strong>de</strong>rivative matrix H k , <strong>and</strong> the currentvalues of the approximate error covariance matrix P k . Although notshown, this step also inclu<strong>de</strong>s augmentation of the error covariancematrix by the covariance matrix of the process noise Q k .2.3.2 Learning Rate <strong>and</strong> Scaled Cost Function2.3 THE EKF PROCEDURE 31We noted above that R k is the covariance matrix of the measurement noise<strong>and</strong> that this matrix must be specified for each training pattern. Generallyspeaking, training problems that are characterized by noisy measurementdata usually require that the elements of R k be scaled larger than for thoseproblems with relatively noise-free training data. In [5, 7, 12], we interpretthis measurement error covariance matrix to represent an inverse learningrate: R k ¼ Zk 1 Sk1 , where the training cost function at time step k is nowgiven by e k ¼ 1 2 jT k S kj k , <strong>and</strong> S k allows the various network outputcomponents to be scaled nonuniformly. Thus, the global scaling matrixA k of equation (2.3) can be written asA k ¼ 1 1Sk1 þ H T k PZ k H k : ð2:7ÞkThe use of the weighting matrix S k in Eq. (2.7) poses numericaldifficulties when the matrix is singular. 1 We reformulate the GEKFalgorithm to eliminate this difficulty by distributing the square root ofthe weighting matrix into both the <strong>de</strong>rivative matrices as H k * ¼ H k S 1=2k<strong>and</strong>the error vector as j k * ¼ S 1=2kj k . The matrices H k * thus contain the scaled<strong>de</strong>rivatives of network outputs with respect to the weights of the network.The rescaled exten<strong>de</strong>d <strong>Kalman</strong> recursion is then given byA k * ¼ 1 1I þðHZ k *Þ T P k H k * ; ð2:8ÞkK k * ¼ P k H k *A k *;^w kþ1 ¼ ^w k þ K k *j k *;ð2:9Þð2:10ÞP kþ1 ¼ P k K k *ðH k *Þ T P k þ Q k : ð2:11ÞNote that this rescaling does not change the evolution of either the weightvector or the approximate error covariance matrix, <strong>and</strong> eliminates the need1 This may occur when we utilize penalty functions to impose explicit constraints onnetwork outputs. For example, when a constraint is not violated, we set the correspondingdiagonal element of S k to zero, thereby ren<strong>de</strong>ring the matrix singular.


32 2 PARAMETER-BASED KALMAN FILTER TRAININGto compute the inverse of the weighting matrix S k for each trainingpattern. For the sake of clarity in the remain<strong>de</strong>r of this chapter, we shallassume a uniform scaling of output signals, S k ¼ I, which impliesR k ¼ Zk1 I, <strong>and</strong> drop the asterisk notation.2.3.3 Parameter SettingsEKF training algorithms require the setting of a number of parameters. Inpractice, we have employed the following rough gui<strong>de</strong>lines. First, wetypically assume that the input–output data have been scaled <strong>and</strong> transformedto reasonable ranges (e.g., zero mean, unit variance for allcontinuous input <strong>and</strong> output variables). We also assume that weightvalues are initialized to small r<strong>and</strong>om values drawn from a zero-meanuniform or normal distribution. The approximate error covariance matrixis initialized to reflect the fact that no a priori knowledge was used toinitialize the weights; this is accomplished by setting P 0 ¼ E 1 I, where E isa small number (of the or<strong>de</strong>r of 0.001–0.01). As noted above, we assumeuniform scaling of outputs: S k ¼ I. Then, training data that are characterizedby noisy measurements usually require small values for the learningrate Z k to achieve good training performance; we typically bound thelearning rate to values between 0.001 <strong>and</strong> 1. Finally, the covariance matrixQ k of the process noise is represented by a scaled i<strong>de</strong>ntity matrix q k I, withthe scale factor q k ranging from as small as zero (to represent no processnoise) to values of the or<strong>de</strong>r of 0.1. This factor is generally annealed froma large value to a limiting value of the or<strong>de</strong>r of 10 6 . This annealingprocess helps to accelerate convergence <strong>and</strong>, by keeping a nonzero valuefor the process noise term, helps to avoid divergence of the errorcovariance update in Eqs. (2.6) <strong>and</strong> (2.11).We show here that the setting of the learning rate, the process noisecovariance matrix, <strong>and</strong> the initialization of the approximate error covariancematrix are inter<strong>de</strong>pen<strong>de</strong>nt, <strong>and</strong> that an arbitrary scaling can beapplied to R k , P k , <strong>and</strong> Q k without altering the evolution of the weightvector ^w in Eqs. (2.5) <strong>and</strong> (2.10). First consi<strong>de</strong>r the <strong>Kalman</strong> gain of Eqs.(2.4) <strong>and</strong> (2.9). An arbitrary positive scaling factor m can be applied to R k<strong>and</strong> P k without altering the contents of K k :K k ¼ P k H k ½R k þ H T k P k H k Š¼ mP k H k ½mR k þ H T k mP k H k Š¼ P y k H k½R y k þ HT k P y k H kŠ¼ P y k H kA y k ;111


2.4 DECOUPLED EKF (DEKF) 33where we have <strong>de</strong>fined R y k ¼ mR k, P y k ¼ mP k, <strong>and</strong> A y k ¼ m 1 A k . Similarly,the approximate error covariance update becomesP y kþ1 ¼ mP kþ1¼ mP k K k H T k mP k þ mQ k¼ P y kK k H T k P y k þ Qy k :This implies that a training trial characterized by the parameter settingsR k ¼ Z 1 I, P 0 ¼ E 1 I, <strong>and</strong> Q k ¼ qI, would behave i<strong>de</strong>ntically to atraining trial with scaled versions of these parameter settings: R k ¼mZ 1 I, P 0 ¼ mE 1 I, <strong>and</strong> Q k ¼ mqI. Thus, for any given EKF trainingproblem, there is no one best set of parameter settings, but a continuum ofrelated settings that must take into account the properties of the trainingdata for good performance. This also implies that only two effectiveparameters need to be set. Regardless of the training problem consi<strong>de</strong>red,we have typically chosen the initial error covariance matrix to beP 0 ¼ E 1 I, with E ¼ 0:01 <strong>and</strong> 0.001 for sigmoidal <strong>and</strong> linear activationfunctions, respectively. This leaves us to specify values for Z k <strong>and</strong> Q k ,which are likely to be problem-<strong>de</strong>pen<strong>de</strong>nt.2.4 DECOUPLED EKF (DEKF)The computational requirements of GEKF are dominated by the need tostore <strong>and</strong> update the approximate error covariance matrix P k at each timestep. For a network architecture with N o outputs <strong>and</strong> M weights, GEKF’scomputational complexity is OðN o M 2 Þ <strong>and</strong> its storage requirements areOðM 2 Þ. The parameter-based DEKF algorithm is <strong>de</strong>rived from GEKF byassuming that the interactions between certain weight estimates can beignored. This simplification introduces many zeroes into the matrix P k .Ifthe weights are <strong>de</strong>coupled so that the weight groups are mutually exclusiveof one another, then P k can be arranged into block-diagonal form. Let grefer to the number of such weight groups. Then, for group i, the vector ^w i krefers to the estimated weight parameters, H i k is the submatrix of<strong>de</strong>rivatives of network outputs with respect to the ith group’s weights,P i k is the weight group’s approximate error covariance matrix, <strong>and</strong> K i k is its<strong>Kalman</strong> gain matrix. The concatenation of the vectors ^w i kforms the vector^w k . Similarly, the global <strong>de</strong>rivative matrix H k is composed via concatena-


34 2 PARAMETER-BASED KALMAN FILTER TRAININGtion of the individual submatrices H i k. The DEKF algorithm for the ithweight group is given by" # 1A k ¼ R k þ PgðH j k ÞT P j k H j k; ð2:12Þj¼1K i k ¼ P i kH i kA k ;^w i kþ1 ¼ ^w i k þ K i kj k ;ð2:13Þð2:14ÞP i kþ1 ¼ P i k K i kðH i kÞ T P i k þ Q i k: ð2:15ÞA single global sealing matrix A k , computed with contributions from all ofthe approximate error covariance matrices <strong>and</strong> <strong>de</strong>rivative matrices, is usedto compute the <strong>Kalman</strong> gain matrices, K i k. These gain matrices are used toupdate the error covariance matrices for all weight groups, <strong>and</strong> arecombined with the global error vector j k for updating the weight vectors.In the limit of a single weight group (g ¼ 1), the DEKF algorithm reducesexactly to the GEKF algorithm.The computational complexity <strong>and</strong> storage requirements for DEKF canbe significantly less than those of GEKF. For g disjoint weight groups, thecomputational complexity of DEKF becomes OðNo 2 PM þ N go i¼1 M i 2 Þ,where M i is the number of weights in group i, while the storagerequirements become Oð P gi¼1 M i 2 Þ. Note that this complexity analysisdoes not inclu<strong>de</strong> the computational requirements for the matrix of<strong>de</strong>rivatives, which is in<strong>de</strong>pen<strong>de</strong>nt of the level of <strong>de</strong>coupling. It shouldbe noted that in the case of training recurrent networks or networks asfeedback controllers, the computational complexity of the <strong>de</strong>rivativecalculations can be significant.We have found that <strong>de</strong>coupling of the weights of the network by no<strong>de</strong>(i.e., each weight group is composed of a single no<strong>de</strong>’s weight) is rathernatural <strong>and</strong> leads to compact <strong>and</strong> efficient computer implementations.Furthermore, this level of <strong>de</strong>coupling typically exhibits substantial computationalsavings relative to GEKF, often with little sacrifice in networkperformance after completion of training. We refer to this level of<strong>de</strong>coupling as no<strong>de</strong>-<strong>de</strong>coupled EKF or NDEKF. Other forms of <strong>de</strong>couplingconsi<strong>de</strong>red have been fully <strong>de</strong>coupled EKF, in which each individualweight constitutes a unique group (thereby resulting in an error covariancematrix that has diagonal structure), <strong>and</strong> layer-<strong>de</strong>coupled EKF, in whichweights are grouped by the layer to which they belong [13]. We show anexample of the effect of all four levels of <strong>de</strong>coupling on the structure of


2.5 MULTISTREAM TRAINING 35Figure 2.5 Block-diagonal representation of the approximate error covariancematrix P k for the RMLP network shown in Figure 2.3 for four differentlevels of <strong>de</strong>coupling. This network has two recurrent layers with three no<strong>de</strong>seach <strong>and</strong> each no<strong>de</strong> with seven incoming connections. The output layer isalso recurrent, but its two no<strong>de</strong>s only have six connections each. Only thesha<strong>de</strong>d portions of these matrices are updated <strong>and</strong> maintained for thevarious forms of <strong>de</strong>coupling shown. Note that we achieve a reduction bynearly a factor of 8 in computational complexity for the case of no<strong>de</strong><strong>de</strong>coupling relative to GEKF in this example.the approximate error covariance matrix in Figure 2.5. For the remain<strong>de</strong>rof this chapter, we explicitly consi<strong>de</strong>r only two different levels of<strong>de</strong>coupling for EKF training: global <strong>and</strong> no<strong>de</strong>-<strong>de</strong>coupled EKF.2.5 MULTISTREAM TRAININGUp to this point, we have consi<strong>de</strong>red forms of EKF training in which asingle weight-vector update is performed on the basis of the presentationof a single input–output training pattern. However, there may be situationsfor which a coordinated weight update, on the basis of multiple training


36 2 PARAMETER-BASED KALMAN FILTER TRAININGpatterns, would be advantageous. We consi<strong>de</strong>r in this section an abstractexample of such a situation, <strong>and</strong> <strong>de</strong>scribe the means by which the EKFmethod can be naturally exten<strong>de</strong>d to simultaneously h<strong>and</strong>le multipletraining instances for a single weight update. 2Consi<strong>de</strong>r the st<strong>and</strong>ard recurrent network training problem: training on asequence of input–output pairs. If the sequence is in some sense homogeneous,then one or more linear passes through the data may wellproduce good results. However, in many training problems, especiallythose in which external inputs are present, the data sequence is heterogeneous.For example, regions of rapid variation of inputs <strong>and</strong> outputsmay be followed by regions of slow change. Alternatively, a sequence ofoutputs that centers about one level may be followed by one that centersabout a different level. In any case, the ten<strong>de</strong>ncy always exists in astraightforward training process for the network weights to be adaptedunduly in favor of the currently presented training data. This recency effectis analogous to the difficulty that may arise in training feedforwardnetworks if the data are repeatedly presented in the same or<strong>de</strong>r.In this latter case, an effective solution is to scramble the or<strong>de</strong>r ofpresentation; another is to use a batch update algorithm. For recurrentnetworks, the direct analog of scrambling the presentation or<strong>de</strong>r is topresent r<strong>and</strong>omly selected subsequences, making an update only for thelast input–output pair of the subsequence (when the network would beexpected to be in<strong>de</strong>pen<strong>de</strong>nt of its initialization at the beginning of thesequence). A full batch update would involve running the network throughthe entire data set, computing the required <strong>de</strong>rivatives that correspond toeach input–output pair, <strong>and</strong> making an update based on the entire set oferrors.The multistream procedure largely circumvents the recency effect bycombining features of both scrambling <strong>and</strong> batch updates. Like full batchmethods, multistream training [10–12] is based on the principle that eachweight update should attempt to satisfy simultaneously the <strong>de</strong>m<strong>and</strong>s frommultiple input–output pairs. However, it retains the useful stochasticaspects of sequential updating, <strong>and</strong> requires much less computation timebetween updates. We now <strong>de</strong>scribe the mechanics of multistream training.2 In the case of purely linear systems, there is no advantage in batching up a collection oftraining instances for a single weight update via <strong>Kalman</strong> filter methods, since all weightupdates are completely consistent with previously observed data. On the other h<strong>and</strong>,<strong>de</strong>rivative calculations <strong>and</strong> the exten<strong>de</strong>d <strong>Kalman</strong> recursion for nonlinear networks utilizefirst-or<strong>de</strong>r approximations, so that weight updates are no longer guaranteed to be consistentwith all previously processed data.


2.5 MULTISTREAM TRAINING 37In a typical training problem, we <strong>de</strong>al with one or more files, each ofwhich contains a sequence of data. Breaking the overall data into multiplefiles is typical in practical problems, where the data may be acquired indifferent sessions, for distinct mo<strong>de</strong>s of system operation, or un<strong>de</strong>rdifferent operating conditions.In each cycle of training, we choose a specified number N s of r<strong>and</strong>omlyselected starting points in a chosen set of files. Each such starting point isthe beginning of a stream. In the multistream procedure we progresssequentially through each stream, carrying out weight updates accordingto the set of current points. Copies of recurrent no<strong>de</strong> outputs must bemaintained separately for each stream. Derivatives are also computedseparately for each stream, generally by truncated backpropagationthrough time (BPTT(h)) as discussed in Section 2.6.1 below. Becausewe generally have no prior information with which to initialize therecurrent network, we typically set all state no<strong>de</strong>s to values of zero atthe start of each stream. Accordingly, the network is executed but updatesare suspen<strong>de</strong>d for a specified number N p of time steps, called the priminglength, at the beginning of each stream. Updates are performed until aspecified number N t of time steps, called the trajectory length, have beenprocessed. Hence, N t N p updates are performed in each training cycle.If we take N s ¼ 1 <strong>and</strong> N t N p ¼ 1, we recover the or<strong>de</strong>r-scramblingprocedure <strong>de</strong>scribed above; N t may be i<strong>de</strong>ntified with the subsequencelength. On the other h<strong>and</strong>, we recover the batch procedure if we take N sequal to the number of time steps for which updates are to be performed,assemble streams systematically to end at the chosen N s steps, <strong>and</strong> againtake N t N p ¼ 1.Generally speaking, apart from the computational overhead involved,we find that performance tends to improve as the number of streams isincreased. Various strategies are possible for file selection. If the numberof files is small, it is convenient to choose N s equal to a multiple of thenumber of files <strong>and</strong> to select each file the same number of times. If thenumber of files is too large to make this practical, then we tend to selectfiles r<strong>and</strong>omly. In this case, each set of N t N p updates is based on only asubset of the files, so it seems reasonable not to make the trajectory lengthN t too large.An important consi<strong>de</strong>ration is how to carry out the EKF updateprocedure. If gradient updates were being used, we would simply averagethe updates that would have been performed had the streams been treatedseparately. In the case of EKF training, however, averaging separateupdates is incorrect. Instead, we treat this problem as that of training asingle, shared-weight network with N o N s outputs. From the st<strong>and</strong>point of


38 2 PARAMETER-BASED KALMAN FILTER TRAININGthe EKF method, we are simply training a multiple-output network inwhich the number of original outputs is multiplied by the number ofstreams. The nature of the <strong>Kalman</strong> recursion, because of the global scalingmatrix A k , is then to produce weight updates that are not a simple averageof the weight updates that would be computed separately for each output,as is the case for a simple gradient <strong>de</strong>scent weight update. Note that we arestill minimizing the same sum of squared error cost function.In single-stream EKF training, we place <strong>de</strong>rivatives of network outputswith respect to network weights in the matrix H k constructed from N ocolumn vectors, each of dimension equal to the number of trainableweights, N w . In multistream training, the number of columns is correspondinglyincreased to N o N s . Similarly, the vector of errors j k has N o N selements. Apart from these augmentations of H k <strong>and</strong> j k , the form of the<strong>Kalman</strong> recursion is unchanged.Given these consi<strong>de</strong>rations, we <strong>de</strong>fine the <strong>de</strong>coupled multistream EKFrecursion as follows. We shall alter the temporal in<strong>de</strong>xing by specifying arange of training patterns that indicate how the multi-stream recursionshould be interpreted. We <strong>de</strong>fine l ¼ k þ N s 1 <strong>and</strong> allow the range k : lto specify the batch of training patterns for which a single weight vectorupdate will be performed. Then, the matrix H i k: l is the concatenation ofthe <strong>de</strong>rivative matrices for the ith group of weights <strong>and</strong> for trainingpatterns that have been assigned to the range k : l. Similarly, the augmente<strong>de</strong>rror vector is <strong>de</strong>noted by j k: l . We construct the <strong>de</strong>rivative matrices<strong>and</strong> error vector, respectively, byH k: l ¼ðH k H kþ1 H kþ2 H l 1 H l Þ;j k: l ¼ðj T k j T kþ1j T kþ2 j T l 1j T l Þ T :We use a similar notation for the measurement error covariance matrixR k: l <strong>and</strong> the global scaling matrix A k: l , both square matrices of dimensionN o N s , <strong>and</strong> for the <strong>Kalman</strong> gain matrices K i k: l, with size M i N o N s . Themultistream DEKF recursion is then given by" # 1A k: l ¼ R k: l þ PgðH j k: l ÞT P j k H j k: l; ð2:16Þj¼1K i k: l ¼ P i kH i k: lA k: l ;^w i kþN s¼ ^w i k þ K i k: lj k: l ;ð2:17Þð2:18ÞP i kþN s¼ P i k K i k: lðH i k: lÞ T P i k þ Q i k: ð2:19Þ


2.5 MULTISTREAM TRAINING 39Note that this formulation reduces correctly to the original DEKFrecursion in the limit of a single stream, <strong>and</strong> that multistream GEKF isgiven in the case of a single weight group. We provi<strong>de</strong> a block diagramrepresentation of the multistream GEKF procedure in Figure 2.6. Note thatthe steps of training are very similar to the single-stream case, with theexception of multiple forward-propagation <strong>and</strong> backpropagation steps, <strong>and</strong>the concatenation operations for the <strong>de</strong>rivative matrices <strong>and</strong> error vectors.Let us consi<strong>de</strong>r the computational implications of the multistreammethod. The sizes of the approximate error covariance matrices P i k <strong>and</strong>the weight vectors w i kare in<strong>de</strong>pen<strong>de</strong>nt of the chosen number of streams.On the other h<strong>and</strong>, we noted above the increase in size for the <strong>de</strong>rivativematrices H i k: l, as well as of the <strong>Kalman</strong> gain matrices K i k: l. However, thecomputation required to obtain H i k: l <strong>and</strong> to compute updates to P i k is thesame as for N s separate updates. The major additional computationalbur<strong>de</strong>n is the inversion required to obtain the matrix A k: l whose dimensionis N s times larger than in the single-stream case. Even this cost tendsto be small compared with that associated with the P i k matrices, as long asFigure 2.6 Signal flow diagram for multistream EKF neural network training.The first two steps are comprised of multiple forward- <strong>and</strong> backpropagationoperations, <strong>de</strong>termined by the number of streams N s selected; these stepsalso <strong>de</strong>pend on whether or not the network being trained has recurrentconnections. On the other h<strong>and</strong>, once the <strong>de</strong>rivative matrix H k: l <strong>and</strong> errorvector j k: l are formed, the EKF steps enco<strong>de</strong>d by steps (3)–(5) are in<strong>de</strong>pen<strong>de</strong>ntof number of streams <strong>and</strong> network type.


40 2 PARAMETER-BASED KALMAN FILTER TRAININGN o N s is smaller than the number of network weights (GEKF) or themaximum number of weights in a group (DEKF).If the number of streams chosen is so large as to make the inversion ofA k: l impractical, the inversion may be avoi<strong>de</strong>d by using one of thealternative EKF formulations <strong>de</strong>scribed below in Section 2.6.3.2.5.1 Some Insight into the Multistream TechniqueA simple means of motivating how multiple training instances can be usedsimultaneously for a single weight update via the EKF procedure is toconsi<strong>de</strong>r the training of a single linear no<strong>de</strong>. In this case, the application ofEKF training is equivalent to that of the recursive least-squares (RLS)algorithm. Assume that a training data set is represented by m uniquetraining patterns. The kth training pattern is represented by a d-dimensionalinput vector u k , where we assume that all input vectors inclu<strong>de</strong> aconstant bias component of value equal to 1, <strong>and</strong> a 1-dimensional outputtarget y k . The simple linear mo<strong>de</strong>l for this system is given by^y k ¼ u T k w f ;ð2:20Þwhere w f is the single no<strong>de</strong>’s d-dimensional weight vector. The weightvector w f can be found by applying m iterations of the RLS procedure asfollows:a k ¼½1 þ u T k P k u k Š 1 ;k k ¼ P k u k a k ;ð2:21Þð2:22Þw kþ1 ¼ w k þ k k ðy k ^y k Þ; ð2:23ÞP kþ1 ¼ P k k k u T k P k ; ð2:24Þwhere the diagonal elements of P 0 are initialized to large positive values,<strong>and</strong> w 0 to a vector of small r<strong>and</strong>om values. Also, w f ¼ w m after a singlepresentation of all training data (i.e., after a single epoch).We recover a batch, least-squares solution to this single-no<strong>de</strong> trainingproblem via an extreme application of the multistream concept, where weassociate m unique streams with each of the m training instances. In thiscase, we arrange the input vectors into a matrix U of size d m, whereeach column corresponds to a unique training pattern. Similarly, wearrange the target values into a single m-dimensional column vector y,


2.5 MULTISTREAM TRAINING 41where elements of y are or<strong>de</strong>red i<strong>de</strong>ntically with the matrix U. As before,we select the initial weight vector w 0 to consist of r<strong>and</strong>omly chosenvalues, <strong>and</strong> we select P 0 ¼ E 1 I, with E small. Given the choice of initialweight vector, we can compute the network output for each trainingpattern, <strong>and</strong> arrange all the results using the matrix notation^y 0 ¼ U T w 0 :ð2:25ÞA single weight update step of the <strong>Kalman</strong> filter recursion applied to thism-dimensional output problem at the beginning of training can be writtenasA 0 ¼½I þ U T P 0 UŠ 1 ;K 0 ¼ P 0 UA 0 ;ð2:26Þð2:27Þw 1 ¼ w 0 þ K 0 ðy ^y 0 Þ; ð2:28Þwhere we have chosen not to inclu<strong>de</strong> the error covariance update here forreasons that will soon become clear. At the beginning of training, werecognize that P 0 is large, <strong>and</strong> we assume that the training data set isscaled so that U T P 0 U I. This allows A 0 to be approximated byA 0 E½EI þ U T UŠ1 ; ð2:29Þsince P 0 is diagonal. Given this approximation, we can write the <strong>Kalman</strong>gain matrix asK 0 ¼ U½EI þ U T UŠ 1 :ð2:30ÞWe now substitute Eqs. (2.25) <strong>and</strong> (2.30) into Eq. (2.28) to <strong>de</strong>rive theweight vector after one time step of this m-stream <strong>Kalman</strong> filter procedure:w 1 ¼ w 0 þ U½EI þ U T UŠ¼ w 0 U½EI þ U T UŠ1 ½y U T w 0 Š1 U T w 0 þ U½EI þ U T UŠ1 y: ð2:31ÞIf we apply the matrix equality lim E!0 U½EI þ U T UŠ 1 U T ¼ I, we obtainthe pseudoinverse solution:w f ¼ w 1 ¼½UU T Š 1 Uy;ð2:32Þ


42 2 PARAMETER-BASED KALMAN FILTER TRAININGwhere we have ma<strong>de</strong> use oflim E!0 UT UŠ1 U T ¼ I; ð2:33Þlim E!0 UT UŠ1 U T ¼½UU T Š1 UU T ; ð2:34Þlim E!0 UT UŠ1 ¼½UU T Š1 U: ð2:35ÞThus, one step of the multistream <strong>Kalman</strong> recursion recovers veryclosely the least-squares solution. If m is too large to make the inversionoperation practical, we could instead divi<strong>de</strong> the problem into subsets <strong>and</strong>perform the procedure sequentially for each subset, arriving eventually atnearly the same result (in this case, however, the covariance update needsto be performed).As illustrated in this one-no<strong>de</strong> example, the multistream EKF update isnot an average of the individual updates, but rather is coordinated throughthe global scaling matrix A. It is intuitively clear that this coordination ismost valuable when the various streams place contrasting <strong>de</strong>m<strong>and</strong>s on thenetwork.2.5.2 Advantages <strong>and</strong> Extensions of Multistream TrainingDiscussions of the training of networks with external recurrence oftendistinguish between series–parallel <strong>and</strong> parallel configurations. In theformer, target values are substituted for the corresponding network outputsduring the training process. This scheme, which is also known as teacherforcing, helps the network to get ‘‘on track’’ <strong>and</strong> stay there during training.Unfortunately, it may also compromise the performance of the networkwhen, in use, it must <strong>de</strong>pend on its own output. Hence, it is not uncommonto begin with the series–parallel configuration, then switch to the parallelconfiguration as the network learns the task. Multistream training seems tolessen the need for the series–parallel scheme; the response of the trainingprocess to the <strong>de</strong>m<strong>and</strong>s of multiple streams tends to keep the network fromgetting too far off-track. In this respect, multistream training seemsparticularly well suited for training networks with internal recurrence(e.g., recurrent multilayered perceptrons), where the opportunity to useteacher forcing is limited, because correct values for most if not all outputsof recurrent no<strong>de</strong>s are unknown.Though our presentation has concentrated on multistreaming simply asan enhanced training technique, one can also exploit the fact that the


2.6 COMPUTATIONAL CONSIDERATIONS 43streams used to provi<strong>de</strong> input–output data need not arise homogeneously,that is, from the same training task. In<strong>de</strong>ed, we have <strong>de</strong>monstrated that asingle fixed-weight, recurrent neural network, trained by multistream EKF,can carry out multiple tasks in a control context, namely, to act as astabilizing controller for multiple distinct <strong>and</strong> unrelated systems, withoutexplicit knowledge of system i<strong>de</strong>ntity [14]. This work <strong>de</strong>monstrated thatthe trained network was capable of exhibiting what could be consi<strong>de</strong>red tobe adaptive behavior: the network, acting as a controller, observed thebehavior of the system (through the system’s output), implicitly i<strong>de</strong>ntifiedwhich system the network was being subjected to, <strong>and</strong> then took actions tostabilize the system. We view this somewhat unexpected behavior as beingthe direct result of combining an effective training procedure withenabling representational capabilities that recurrent networks provi<strong>de</strong>.2.6 COMPUTATIONAL CONSIDERATIONSWe discuss here a number of topics related to implementation of thevarious EKF training procedures from a computational perspective. Inparticular, we consi<strong>de</strong>r issues related to computation of <strong>de</strong>rivatives that arecritical to the EKF methods, followed by discussions of computationallyefficient formulations, methods for avoiding matrix inversions, <strong>and</strong> theuse of square-root filtering as an alternative means of insuring stableperformance.2.6.1 Derivative CalculationsWe discussed above both the global <strong>and</strong> <strong>de</strong>coupled versions of the EKFalgorithm, where we consi<strong>de</strong>r the global EKF to be a limiting form of<strong>de</strong>coupled EKF (i.e., DEKF with a single weight group). In addition, wehave <strong>de</strong>scribed the multistream EKF procedure as a means of batchingtraining instances, <strong>and</strong> have noted that multistreaming can be used withany form of <strong>de</strong>coupled EKF training, for both feedforward <strong>and</strong> recurrentnetworks. The various EKF procedures can all be compactly <strong>de</strong>scribed bythe DEKF recursion of Eqs. (2.12)–(2.15), where we have assumed thatthe <strong>de</strong>rivative matrices H i k are given. However, the implications forcomputationally efficient <strong>and</strong> clear implementations of the various formsof EKF training <strong>de</strong>pend upon the <strong>de</strong>rivative calculations, which aredictated by whether a network architecture is static or dynamic (i.e.,feedforward or recurrent), <strong>and</strong> whether or not multistreaming is used. Here


44 2 PARAMETER-BASED KALMAN FILTER TRAININGwe provi<strong>de</strong> insight into the nature of <strong>de</strong>rivative calculations for training ofboth static <strong>and</strong> dynamic networks with EKF methods (see [12] forimplementation <strong>de</strong>tails).We assume the convention that a network’s weights are organized byno<strong>de</strong>, regardless of the <strong>de</strong>gree of <strong>de</strong>coupling, which allows us to naturallypartition the matrix of <strong>de</strong>rivatives of network outputs with respect toweight parameters, H k , into a set of G submatrices H i k, where G is thenumber of no<strong>de</strong>s of the network. Then, each matrix H i k <strong>de</strong>notes the matrixof <strong>de</strong>rivatives of network outputs with respect to the weights associatedwith the ith no<strong>de</strong> of the network. For feedforward networks, thesesubmatrices can be written as the outer product of two vectors [3],H i k ¼ u i kðc i kÞ T ;where u i k is the ith no<strong>de</strong>’s input vector <strong>and</strong> ci k is a vector of partial<strong>de</strong>rivatives of the network’s outputs with respect to the ith no<strong>de</strong>’s net input,<strong>de</strong>fined as the dot product of the weight vector w i kwith the correspondinginput vector u i k . Note that the vectors ci k are computed via the backpropagationprocess, where the dimension of each of these vectors is<strong>de</strong>termined by the number of network outputs. In contrast to the st<strong>and</strong>ardbackpropagation algorithm, which begins the <strong>de</strong>rivative calculationprocess (i.e., backpropagation) with error signals for each of the network’soutputs, <strong>and</strong> effectively combines these error signals (for multiple-outputproblems) during the backpropagation process, the EKF methods beginthe process with signals of unity for each network output <strong>and</strong> backpropagatea separate signal for each unique network output.In the case of recurrent networks, we assume the use of truncatedbackpropagation through time for calculation of <strong>de</strong>rivatives, with atruncation <strong>de</strong>pth of h steps; this process is <strong>de</strong>noted by BPTT(h). Now,each submatrix H i k can no longer be expressed as a simple outer productof two vectors; rather, each of these submatrices is expressed as the sum ofa series of outer products:H i k ¼ Phj¼1H i;jk¼ Phj¼1u i;jk ðci;j k ÞT ;where the matrix H i;jkis the contribution from the jth step of backpropagationto the computation of the total <strong>de</strong>rivative matrix for the ithno<strong>de</strong>; the vector u i;jkis the vector of inputs to the ith no<strong>de</strong> at the jth step ofbackpropagation; <strong>and</strong> c i;jkis the vector of backpropagated <strong>de</strong>rivatives of


network outputs with respect to the ith no<strong>de</strong>’s net input at the jth step ofbackpropagation. Here, we have chosen arbitrarily to have j increase as westep back in time.Finally, consi<strong>de</strong>r multi-stream training, where we assume that thetraining problem involves a recurrent network architecture, with <strong>de</strong>rivativescomputed by BPTT(h) (feedforward networks are subsumed by thecase of h ¼ 1). We again assume an N o -component cost function with N sstreams, <strong>and</strong> <strong>de</strong>fine l ¼ k þ N s 1. Then, each submatrix H i k: l becomes aconcatenation of a series of submatrices, each of which is expressed as thesum of a series of outer products:H i k: l ¼¼" #PhH i;j;1 P hkH i;j;2k PhH i;j;N skj¼1 j¼1j¼1" #Phj¼1u i;j;1kðc i;j;1kÞ T Phj¼12.6 COMPUTATIONAL CONSIDERATIONS 45u i;j;2kðc i;j;2kÞ T Phj¼1u i;j;N skðc i;j;N skÞ Tð2:36Þ: ð2:37ÞHere we have expressed each submatrix H i k: l as an M i ðN o N s Þ matrix,where M i is the number of weights corresponding to the networks ithno<strong>de</strong>. The submatrices H i;j;mkare of size M i N o , corresponding to asingle training stream. For purposes of a compact representation, weexpress each matrix H i k: l as a sum of matrices (as opposed to aconcatenation) by forming vectors C i;j;mkfrom the vectors c i;j;mkin thefollowing fashion. The vector C i;j;mkis of length N o N s with componentsset to zero everywhere except for in the mth (out of N s ) block of length N o ,where this subvector is set equal to the vector c i;j;mk. Then,H i k: l ¼ PhP N sj¼1 m¼1u i;j;mkðC i;j;mkÞ T :Note that the matrix is expressed in this fashion for notational convenience<strong>and</strong> consistency, <strong>and</strong> that we would make use of the sparse nature of thevector C i;j;mkin implementation.2.6.2 Computationally Efficient Formulations forMultiple-Output ProblemsWe now consi<strong>de</strong>r implications for the computational complexity of EKFtraining due to expressing the <strong>de</strong>rivative calculations as a series of vector


46 2 PARAMETER-BASED KALMAN FILTER TRAININGouter products as shown above. We consi<strong>de</strong>r the simple case of feedforwardnetworks trained by no<strong>de</strong>-<strong>de</strong>coupled EKF (NDEKF) in whicheach no<strong>de</strong>’s weights comprise a unique group for purposes of the errorcovariance update. The NDEKF recursion can then be written as" # 1A k ¼ R k þ PGa j k c j k ðc j k ÞT ; ð2:38Þj¼1K i k ¼ v i kðg i kÞ T ;^w i kþ1 ¼ ^w i k þ½ðc i kÞ T ðA k j k ÞŠv i k;ð2:39Þð2:40ÞP i kþ1 ¼ P i k b i kv i kðv i kÞ T þ Q i k; ð2:41Þwhere we have used the following equations in intermediate steps:v i k ¼ P i ku i k;g i k ¼ A k c i k;a i k ¼ðu i kÞ T v i k;b i k ¼ðg i kÞ T c i k:ð2:42Þð2:43Þð2:44Þð2:45ÞBased upon this partitioning of the <strong>de</strong>rivative matrix H k ,wefind that thecomputationalPcomplexity of NDEKF is reduced from OðNo 2 MþN Go i¼1 M i 2 Þ to OðNo 2 G þ P Gi¼1 M i 2 Þ, indicating a distinct advantage forfeedforward networks with multiple output no<strong>de</strong>s. On the other h<strong>and</strong>, thepartitioning of the <strong>de</strong>rivative matrix does not provi<strong>de</strong> any computationaladvantage for GEKF training of feedforward networks.2.6.3 Avoiding Matrix InversionsA complicating factor for effective implementation of EKF trainingschemes is the need to perform matrix inversions for those problemswith multiple cost function components. We typically perform these typesof calculations with matrix inversion routines based on singular-value<strong>de</strong>composition [15]. Although these techniques have served us well overthe years, we recognize that this often discourages ‘‘quick-<strong>and</strong>-dirty’’implementations <strong>and</strong> may pose a large obstacle to hardware implementation.Two classes of methods have been <strong>de</strong>veloped that allow EKF training tobe performed for multiple-output problems without explicitly resorting tomatrix inversion routines. The first class [16] <strong>de</strong>pends on the partitioning


2.6 COMPUTATIONAL CONSIDERATIONS 49Now, the i<strong>de</strong>a is to find a unitary transformation Y such that" #A 1=2k0P k H k A 1=2kP 1=2kþ1" #¼ R1=2 kH T k P 1=2k0 P 1=2 Y: ð2:53ÞkThis is easily accomplished by applying a series of 2 2 Givens rotationsto annihilate the elements of the submatrix H T k P 1=2k, thereby yieldingthe left-h<strong>and</strong>-si<strong>de</strong> matrix. Given this result of the square-root filteringprocedure, we can perform the network weight update via the followingadditional steps: (1) compute A 1=2kby inverting A 1=2k; (2) compute the<strong>Kalman</strong> gain matrix by K k ¼ðP k H k A 1=2kÞA 1=2k; (3) perform the weightupdate via Eq. (2.5).2.6.4.2 With Artificial Noise In our original work [3] on EKF-basedtraining, we introduced the use of artificial process noise as a simple <strong>and</strong>easily controlled mechanism to help assure that the approximate errorcovariance matrix P k would retain the necessary property of nonnegative<strong>de</strong>finiteness,thereby allowing us to avoid the more computationallycomplicated square-root formulations. In addition to controlling theproper evolution of P k , we have also found that artificial process noise,when carefully applied, helps to accelerate the training process <strong>and</strong>, moreimportantly, leads to solutions superior to those found without artificialprocess noise. We emphasize that the use of artificial process noise is notad hoc, but appears due to the process noise term in Eq. (2.1) (i.e., thecovariance matrix Q k in Eq. (2.6) disappears only when v k ¼ 0 for all k).We have continued to use this feature in our implementations as aneffective means for escaping poor local minima, <strong>and</strong> have not experiencedproblems with divergence. Other researchers with in<strong>de</strong>pen<strong>de</strong>nt implementationsof various forms of EKF [13, 17, 20] for neural network traininghave also found the use of artificial process noise to be beneficial.Furthermore, other gradient-based training algorithms have effectivelyexploited weight noise (e.g., see [21]).We now <strong>de</strong>monstrate that the square-root filtering formulation of Eq.(2.53) is easily exten<strong>de</strong>d to inclu<strong>de</strong> artificial process noise, <strong>and</strong> that the useof artificial process noise <strong>and</strong> square-root filtering are not mutuallyexclusive. Again, we wish to express the error covariance update in a


50 2 PARAMETER-BASED KALMAN FILTER TRAININGfactorized form, but we now augment the left-h<strong>and</strong>-si<strong>de</strong> of Eq. (2.52) toinclu<strong>de</strong> the square root of the process-noise covariance matrix:" #R k þ H T k P k H k H T k P kP k H k P k þ Q k2" #¼ R1=2 kH T k P 1=2R 1=2k060 P 1=2kQ 1=2 4kk0P 1=2kH k P 1=2k0 Q 1=2k375 : ð2:54ÞSimilarly, the right-h<strong>and</strong> si<strong>de</strong> of Eq. (2.52) is augmented by blocks ofzeroes, so that the matrices are of the same size as those of the right-h<strong>and</strong>si<strong>de</strong> of Eq. (2.54):" #Ak1 H T k P kP k H k P kþ1 þ P k H k A k H T k P k2" #A 1=2A 1=2kA 1=2kk0 0 6¼P k H k A 1=2kP 1=2 4 0 P 1=2kþ1kþ100 03H T k P k75: ð2:55ÞHere, Eqs. (2.54) <strong>and</strong> (2.55) are equivalent to one another, <strong>and</strong> Eq. (2.53)is appropriately modified to" #A 1=2k0 0P k H k A 1=2kP 1=2kþ10" #¼ R1=2 kH T k P 1=2k00 P 1=2kQ 1=2kY: ð2:56ÞThus, square-root filtering can easily accommodate the artificial processnoiseextension, where the matrices H T k P 1=2k<strong>and</strong> Q 1=2kare both annihilatedvia a sequence of Givens rotations. Note that this extension involvessubstantial additional computational costs beyond those incurred whenQ k ¼ 0. For a network with M weights <strong>and</strong> N o outputs, the use of artificialprocess noise introduces OðM 3 Þ additional computations for the annihilationof Q 1=2k, whereas the annihilation of the matrix H T k P 1=2konly involvesOðM 2 N o Þ computations (here we assume M M o Þ.


2.7 OTHER EXTENSIONS AND ENHANCEMENTS 512.7 OTHER EXTENSIONS AND ENHANCEMENTS2.7.1 EKF Training with Constrained WeightsDue to the second-or<strong>de</strong>r properties of the EKF training procedure, wehave observed that, for certain problems, networks trained by EKF tend to<strong>de</strong>velop large weight values (e.g., between 10 <strong>and</strong> 100 in magnitu<strong>de</strong>). Weview this capability as a double-edged sword: on the one h<strong>and</strong>, someproblems may require that large weight values be <strong>de</strong>veloped, <strong>and</strong> the EKFprocedures are effective at finding solutions for these problems. On theother h<strong>and</strong>, trained networks may need to be <strong>de</strong>ployed with executionperformed in fixed-point arithmetic, which requires that limits be imposedon the range of values of network inputs, outputs <strong>and</strong> weights. Fornonlinear sigmoidal no<strong>de</strong>s, the no<strong>de</strong> outputs are usually limited tovalues between 1 <strong>and</strong> þ1, <strong>and</strong> input signals can usually be linearlytransformed so that they fall within this range. On the other h<strong>and</strong>, the EKFprocedures as <strong>de</strong>scribed above place no limit on the weight values. We<strong>de</strong>scribe here a natural mechanism, imposed during training, that limits therange of weight values. In addition to allowing for fixed-point <strong>de</strong>ploymentof trained networks, this weight-limiting mechanism may also promotebetter network generalization.We wish to set constraints on weight values during the training processwhile maintaining rigorous consistency with the EKF recursion. We canaccomplish this by converting the unconstrained nonlinear optimizationproblem into one of optimization with constraints. The general i<strong>de</strong>a is totreat each of the network’s weight values as the output of a monotonicallyincreasing function fðÞ with saturating limits at the function’s extremes(e.g., a sigmoid function). Thus, the EKF recursion is performed in anunconstrained space, while the network’s weight values are nonlineartransformations of the corresponding unconstrained values that evolveduring the training process. This transformation requires that theEKF recursion be modified to take into account the function fðÞ asapplied to the parameters (i.e., unconstrained weight values) that evolveduring training.Assume that the vector of network’s weights w k is constrained to takeon values in the range a to þa <strong>and</strong> that each component w i;jk(the jthweight of the ith no<strong>de</strong>) of the constrained weight vector is related to anunconstrained value ~w i;jkvia a function w i;jk¼ fð ~w i;jk; aÞ. We formulate theEKF recursion so that weight updates are performed in the unconstrainedweight space, while the steps of forward propagation <strong>and</strong> backpropagationof <strong>de</strong>rivatives are performed in the constrained weight space.


52 2 PARAMETER-BASED KALMAN FILTER TRAININGThe training steps are carried out as follows. At time step k, an inputvector is propagated through the network, <strong>and</strong> the network outputs arecomputed <strong>and</strong> stored in the vector ^y k . The error vector j k is also formed as<strong>de</strong>fined above. Subsequently, the <strong>de</strong>rivatives of each component of ^y k withrespect to each no<strong>de</strong>’s weight vector w i kare computed <strong>and</strong> stored into thematrices H i k, where the component H i;j;lkcontains the <strong>de</strong>rivative of the lthcomponent of ^y k with respect to the jth weight of the ith no<strong>de</strong>. In or<strong>de</strong>r toperform the EKF recursion in the unconstrained space, we must performthree steps in addition to those that are normally carried out. First, wetransform weight values from the constrained space via ~w i;jk¼ f 1 ðw i;jk ; aÞfor all trainable weights of all no<strong>de</strong>s of the network, which yields thevectors ~w i k. Second, the <strong>de</strong>rivatives that have been previously computedwith respect to weights in the constrained space must be transformed to<strong>de</strong>rivatives with respect to weights in the unconstrained space. This iseasily performed by the following transformation for each <strong>de</strong>rivativecomponent:~H i;j;lk¼ H i;j;lk@o i;jk@ ~w i;j ¼ H i;j;lkk@fð ~w i;jk ; aÞ@ ~w i;jk: ð2:57ÞThe EKF weight update procedure of Eqs. (2.8)–(2.11) is then appliedusing the unconstrained weights <strong>and</strong> <strong>de</strong>rivatives with respect to unconstrainedweights. Note that no transformation is applied to either thescaling matrix S k or the error vector j k before they are used in the update.Finally, after the weight updates are performed, the unconstrained weightsare transformed back to the constrained space by w i;jk¼ fð ~w i;jk; aÞ for allweights of all no<strong>de</strong>s in the network.We now consi<strong>de</strong>r specific forms for the function fð ~w i;jk; aÞ that transformsweight values from an unconstrained space to a constrained space.We require that the function obey the following properties:1. fð ~w i;jk; aÞ is monotonically increasing.2. fð0; aÞ ¼0.3. fð 1; aÞ ¼ a.4. fðþ1; aÞ ¼þa.5. @wi;j k@ ~w i;jkj ~w i;j ¼0¼ 1:k6. lim a!1 fð ~w i;jk; aÞ ¼ ~wi;jk¼ w i;jk .


2.7 OTHER EXTENSIONS AND ENHANCEMENTS 53Property 5 imposes a constraint on the gain of the transformation, whileproperty 6 imposes the constraint that in the limit of large a, theconstrained optimization problem operates i<strong>de</strong>ntically to the unconstrainedproblem. This last constraint also implies that lim a!1 ð@w i;jk=@ ~wi;jk Þ¼1.One c<strong>and</strong>idate function is a symmetric saturating linear transformationwhere w i;jk¼ ~w i;jkwhen a ~w i;jk a, <strong>and</strong> is otherwise equal to eithersaturating value. The major disadvantage with this constraint function isthat its inverse is multivalued outsi<strong>de</strong> the linear range. Thus, once thetraining process pushes the constrained weight value into the saturatedregion, the <strong>de</strong>rivative of constrained weight with respect to unconstrainedweight becomes zero, <strong>and</strong> no further training of that particular weightvalue will occur due to the zero-valued <strong>de</strong>rivative.Alternatively, we may consi<strong>de</strong>r various forms of symmetric sigmoidfunctions that are everywhere differentiable <strong>and</strong> have well-<strong>de</strong>finedinverses. The Elliott sigmoid [22] conveniently does not involve transcen<strong>de</strong>ntalfunctions. We choose to consi<strong>de</strong>r a generalization of thismonotonic <strong>and</strong> symmetric saturating function given byw i;jk¼ fð ~w i;jk ; aÞ ¼a ~wi;j kb þj~w i;jk j ¼~w i;jkb=a þj~w i;jk j=a ;ð2:58Þwhere b is a positive quantity that <strong>de</strong>termines the function’s gain. We mustchoose the value of b so that the <strong>de</strong>rivative of fð ~w i;jk; aÞ with respect to~w i;jk, evaluated at ~wi;jk¼ 0, is equal to 1. This condition can be shown to besatisfied by the choice b ¼ a. Thus, the constraint function we choose isgiven byw i;jk¼ fð ~w i;jk ; aÞ ¼a ~wi;j ka þj~w i;jk j ¼~w i;jk1 þj~w i;jk j=a :ð2:59ÞBy inspection, we see that this function satisfies the various requirements.For example, for large a (i.e., as a !1Þ, w i;jk! ~w i;jk; for smaller valuesof a, when j ~w i;jkja; wi;jk! a sgnð ~w i;jkÞ. The inverse of this function iseasily found to be given by~w i;jk¼ f 1 ðw i;jk ; aÞ ¼aawi;j kjw i;jk j ¼w i;jk1 jw i;jk j=a :ð2:60ÞIn this case, for a w i;jk , ~wi;j k! w i;jk; similarly, as wi;jk! a; ~w i;jk!1.Asa final note, the <strong>de</strong>rivative of constrained weight with respect to uncon-


54 2 PARAMETER-BASED KALMAN FILTER TRAININGstrained weight, which is nee<strong>de</strong>d for computing the proper <strong>de</strong>rivatives inthe EKF recursion, can be expressed in many different ways, some ofwhich are given by@w i;jk@ ~w i;j ¼k! 2aa þj~w i;jk j ¼ a jwi;jak j! 2¼ 1jw i;jk ja! 2 ! 2¼ wi;j: ð2:61Þk~w i;jk2.7.2 EKF Training with an Entropic Cost FunctionAs <strong>de</strong>fined above, the EKF training algorithm assumes that a quadraticfunction of some error signal is being minimized over all network outputs<strong>and</strong> all training patterns. However, other cost functions are often useful ornecessary. One such function that has been found to be particularlyappropriate for pattern classification problems, <strong>and</strong> for which a soundstatistical basis exists, is a cost function based on minimizing crossentropy[23]. We consi<strong>de</strong>r a prototypical problem in which a network istrained to act as a pattern classifier; here network outputs enco<strong>de</strong> binarypattern classifications. We assume that target values of 1 are provi<strong>de</strong>d foreach training pattern. Then the contribution to the total entropic costfunction at time step n is given bye k ¼ PN oe l k ¼ PN ol¼1l¼1ð1 þ y l kÞ log 1 þ yl k1 þ ^y l kþð1 y l kÞ log 1 yl k1 ^y l k: ð2:62ÞSince the components of the vector y k are constrained to be either þ1 or1, we note that only one of the two components for each output l will benonzero. This allows the cost function to be expressed ase k j ylk ¼1 ¼ PN oe l k ¼ PN o 22 logl¼1 l¼1 1 þ y l k ^y l : ð2:63ÞkThe EKF training procedure assumes that at each time step k aquadratic cost function is being minimized, which we write asC k ¼ P N ol¼1 ðzl k^z l k Þ2 ¼ P N ol¼1 ðxl kÞ 2 , where z l k<strong>and</strong> ^z l kare target <strong>and</strong>output values, respectively. (We assume here the case of S k ¼ I; thisprocedure is easily exten<strong>de</strong>d to nonuniform weighting matrices.) At thispoint, we would like to find appropriate transformations between thefz l k ; ^z l k g <strong>and</strong> fyl k ; ^y l k g so that C k <strong>and</strong> e k are equivalent, thereby allowing us


to use the EKF procedure to minimize the entropic cost function. We firstnote that both the quadratic <strong>and</strong> entropic cost functions are calculated bysumming individual cost function components from all targets <strong>and</strong> outputno<strong>de</strong>s. We immediately see that this leads to the equality ðx l kÞ 2 ¼ e l kfor allN o outputs, which implies z l k^z l k ¼ðel k Þ1=2 . At this point, we assume thatwe can assign all target values z l k ¼ 0;3 so that x l k ¼ ^z l k ¼ðel k Þ1=2 .Nowthe EKF recursion can be applied to minimize the entropic cost function,since C k ¼ S N ol¼1 ðxl kÞ 2 ¼ S N ol¼1 ½ðel k Þ1=2 Š 2 ¼ S N ol¼1 el k ¼ e k.The remain<strong>de</strong>r of the <strong>de</strong>rivation is straightforward. The EKF recursionin the case of the entropic cost function requires that <strong>de</strong>rivatives of^z l k ¼ð el k Þ1=2 be computed for all N o outputs <strong>and</strong> all weight parameters,which are subsequently stored in the matrices H i k. Applying the chain rule,these <strong>de</strong>rivatives are expressed as a function of the <strong>de</strong>rivatives of networkoutputs with respect to weight parameters:H i;j;lk¼ @^z l k@w i;jkyl2.7 OTHER EXTENSIONS AND ENHANCEMENTS 55¼ @ðel k Þ1=2@w i;j ¼k ¼1 k1 @^y l kðe l k Þ1=2 ðy l k þ ^y l k Þ @w i;j : ð2:64ÞkNote that the effect of the relative entropy cost function on the calculationof <strong>de</strong>rivatives is h<strong>and</strong>led entirely in the initialization of the backpropagationprocess, where the term 1=½ðe l k Þ1=2 ðy l k þ ^y l k ÞŠ is used for each of the N ooutput no<strong>de</strong>s to start the backpropagation process, rather than starting witha value of unity for each output no<strong>de</strong> as in the nominal formulation.In general, the EKF procedure can be modified in the manner just<strong>de</strong>scribed for a wi<strong>de</strong> range of cost functions, provi<strong>de</strong>d that they meet atleast three simple requirements. First, the cost function must be adifferentiable function of network outputs. Second, the cost functionshould be expressed as a sum of contributions, where there is a separatetarget value for each individual component. Third, each component of thecost function must be non-negative.2.7.3 EKF Training with Scalar ErrorsWhen applied to a multiple-output training problem, the EKF formulationin Eqs. (2.3)–(2.6) requires a separate backpropagation for each output<strong>and</strong> a matrix inversion. In this section, we <strong>de</strong>scribe an approximation to3 The i<strong>de</strong>a of using a modified target value of zero with the actual targets appearing inexpressions for system outputs can be applied to the EKF formulation of Eqs. (2.3)–(2.6)without any change in its un<strong>de</strong>rlying behavior.


56 2 PARAMETER-BASED KALMAN FILTER TRAININGthe EKF neural network training procedure that allows us to treat suchproblems with single-output training complexity. In this approximation,we require only the computation of <strong>de</strong>rivatives of a scalar quantity withrespect to trainable weights, thereby reducing the backpropagation computation<strong>and</strong> eliminating the need for a matrix inversion in the multipleoutputEKF recursion.For the sake of simplicity, we consi<strong>de</strong>r here the prototypical networktraining problem for which network outputs directly enco<strong>de</strong> signals forwhich targets are <strong>de</strong>fined. The square root of the contribution to the totalcost function at time step k is given by~y k ¼ C 1=2k¼ PN 1=2ojy l k ^y l kj 2 ; ð2:65Þl¼1where we are again treating the simple case of uniform scaling of networkerrors (i.e., S k ¼ I). The goal here is to train a network so that the sum ofsquares of this scalar error measure is minimized over time. As in the caseof the entropic cost function, we consi<strong>de</strong>r the target for training to be zerofor all training instances, <strong>and</strong> the scalar error signal used in the <strong>Kalman</strong>recursion to be given by x k ¼ 0 ~y k . The EKF recursion requires that the<strong>de</strong>rivatives of the scalar observation ~y k be computed with respect to allweight parameters. The <strong>de</strong>rivative of the scalar error with respect to the jthweight of the ith no<strong>de</strong> is given byH i;j;1k¼ @~y k@w i;j ¼ PN ok l¼1@~y k@^y l k@^y l k@w i;j ¼ PN o y l kk l¼1x k^y l k@^y l k@w i;jk: ð2:66ÞIn this scalar formulation, the <strong>de</strong>rivative calculations via backpropagationare initialized with the terms ðy l k^y l k Þ=x k for all N o network output no<strong>de</strong>s(as opposed to initializing the backpropagation calculations with values ofunity for the nominal EKF recursion of Eqs. (2.3)–(2.6)). Furthermore,only one quantity is backpropagated, rather than N o quantities for thenominal formulation. Note that this scalar approximation reduces exactlyto the nominal EKF algorithm in the limit of a single-output problem:H i;j;1k~y k ¼ðjy 1 k ^y 1 kj 2 Þ 1=2 ¼jy 1 k ^y 1 kj; ð2:67Þx k ¼ 0 jy 1 k ^y 1 kj; ð2:68Þ¼ @~y k@w i;jk¼ sgnðy 1 k ^y 1 kÞ @^y 1 k@w i;jk: ð2:69Þ


2.8 AUTOMOTIVE APPLICATIONS OF EKF TRAINING 57Consi<strong>de</strong>r the case y 1 k ^y 1 k . Then, the error signal is given by x k ¼ y 1 k^y 1 k .Similarly, @~y k =@w ¼ @^y 1 k =@w, since @~y k =@y 1 k ¼ 1. Otherwise, when y1 k > ^y 1 k ,the error signal is given by x k ¼ ðy 1 k^y 1 k Þ, <strong>and</strong> @~y k =@w ¼ @^y 1 k =@w,since @~y k =@^y 1 k ¼ 1. Since both the error <strong>and</strong> the <strong>de</strong>rivatives are thenegatives of what the nominal EKF recursion provi<strong>de</strong>s, the effects ofnegation cancel one another. Thus, in either case, the scalar formulationfor a single-output problem is exactly equivalent to that of the EKFprocedure of Eqs. (2.3)–(2.6).Because the procedure <strong>de</strong>scribed here is an approximation to the baseprocedure, we suspect that classes of problems exist for which it is not aseffective; further work will be required to clarify this question. In thisregard, we note that once criteria are available to gui<strong>de</strong> the <strong>de</strong>cision ofwhether to scalarize or not, one may also consi<strong>de</strong>r a hybrid approach toproblems with many outputs. In this approach, selected outputs would becombined as <strong>de</strong>scribed above to produce scalar error variables; the latterwould then be treated with the original procedure.2.8 AUTOMOTIVE APPLICATIONS OF EKF TRAININGThe general area of automotive powertrain control, diagnosis, <strong>and</strong> mo<strong>de</strong>linghas offered substantial opportunity for the application of neuralnetwork methods. These opportunities are driven by the steadily increasing<strong>de</strong>m<strong>and</strong>s that are placed on the performance of vehicle control <strong>and</strong>diagnostic systems as a consequence of global competition <strong>and</strong> governmentm<strong>and</strong>ates. Mo<strong>de</strong>rn automotive powertrain control systems involveseveral interacting subsystems, any one of which can involve significantengineering challenges. We summarize the application of EKF training tothree signal processing problems related to automotive diagnostics <strong>and</strong>emissions mo<strong>de</strong>ling, as well as its application to two automotive controlproblems. In all five cases, we have found EKF training of recurrent neuralnetworks to be an enabler for <strong>de</strong>veloping effective solutions to theseproblems.Figure 2.7 provi<strong>de</strong>s a diagrammatic representation of these five neuralnetwork applications <strong>and</strong> how they potentially interact with one another.We observe that the neural network controllers for engine idle speed <strong>and</strong>air=fuel (A=F) ratio control produce signals that affect the operation of theengine, while the remaining neural network mo<strong>de</strong>ls are used to <strong>de</strong>scribevarious aspects of engine operation as a function of measurable engineoutputs.


58 2 PARAMETER-BASED KALMAN FILTER TRAININGFigure 2.7 Block-diagram representation of neural network applications forautomotive engine control <strong>and</strong> diagnosis. Solid boxes represent physicalcomponents of the engine system, double-lined solid boxes representneural network mo<strong>de</strong>ls or diagnostic processes, <strong>and</strong> double-lined dashedboxes represent neural network controllers. For the sake of simplicity, wehave not shown all relevant sensors <strong>and</strong> their corresponding signals (e.g.,engine coolant temperature).2.8.1 Air=Fuel Ratio ControlAt a very basic level, the role of the A=F controller is to supply fuel to theengine such that it matches the amount of air pumped into the engine viathe throttle <strong>and</strong> idle speed bypass valve. This is accomplished with anelectronic feedback control system that utilizes a heated exhaust gasoxygen (HEGO) sensor whose role is to indicate whether the engine-outexhaust is rich (i.e., too much fuel) or lean (too much air). Depending onthe measured state of the exhaust gases, as well as engine operatingconditions such as engine speed <strong>and</strong> load, the A=F control is changed soas to drive the system toward stoichiometry. Since the HEGO sensor islargely consi<strong>de</strong>red to be a binary sensor (i.e., it produces high=low voltage


2.8 AUTOMOTIVE APPLICATIONS OF EKF TRAINING 59levels for rich=lean operations, respectively), <strong>and</strong> since there are timevaryingtransport <strong>de</strong>lays, the closed-loop A=F control strategy often takesthe form of a jump=ramp strategy, which effectively causes the HEGOoutput to oscillate between the two voltage levels. We have <strong>de</strong>monstratedthat an open-loop recurrent neural network controller can be trained toprovi<strong>de</strong> a correction signal to the closed-loop A=F control in the face oftransient conditions (i.e., dynamic changes in engine speed <strong>and</strong> load),thereby eliminating large <strong>de</strong>viations from stoichiometry. This is accomplishedby using an auxiliary universal EGO (UEGO) sensor, whichprovi<strong>de</strong>s a continuous measure of A=F ratio (as opposed to the rich=leanindication provi<strong>de</strong>d by the HEGO), during the in-vehicle training process.Deviations of measured A=F ratio from stoichiometric A=F ratio provi<strong>de</strong>the error signal for the EKF training process; however, the measured A=Fratio is not used as an input, <strong>and</strong> since the A=F control does not have amajor effect on engine operating conditions when operated near stoichiometry,then this can be viewed as a problem of training an open-loopcontroller. Nevertheless, we use recurrent network controllers to provi<strong>de</strong>the capability of representing the condition-<strong>de</strong>pen<strong>de</strong>nt dynamics associatedwith the operation of the engine system un<strong>de</strong>r A=F control, <strong>and</strong>must take care to properly compute <strong>de</strong>rivatives with BPTT(h).2.8.2 Idle Speed ControlA second engine control task is that of maintaining smooth engineoperation at idle conditions. In this case, no air is provi<strong>de</strong>d to the intakemanifold of the engine via the throttle; in or<strong>de</strong>r to keep the enginerunning, a bypass air valve is used to regulate the flow of air into theengine. The role of the idle speed control system is to maintain a relativelylow (for purposes of fuel economy) <strong>and</strong> constant engine speed, in the faceof disturbances that place <strong>and</strong> remove additional loads on the engine (e.g.,shifting from neutral to drive, activating the air conditioning system, <strong>and</strong>locking up the power steering); feedforward signals encoding these eventsare provi<strong>de</strong>d as input to the idle speed controller. The control range of thebypass air signal is large (more than 1000 rpm un<strong>de</strong>r idle conditions), butits effect is <strong>de</strong>layed by a time inversely proportional to engine speed. Thespark advance comm<strong>and</strong>, which regulates the timing of ignition, has animmediate effect on engine speed, but over a small range (on or<strong>de</strong>r of100 rpm). Thus, an effective engine idle speed controller coordinates thetwo controls to maintain a constant engine speed. The error signals for theEKF training process are a weighted sum of squared <strong>de</strong>viations of engine


60 2 PARAMETER-BASED KALMAN FILTER TRAININGspeed from a <strong>de</strong>sired speed, combined with constraints on the controlsexpressed as squared error signals. We have used recurrent neuralnetworks, trained by on-line EKF methods, to <strong>de</strong>velop effective idlespeed control strategies, <strong>and</strong> have documented this work in [5]. Notethat unlike the case of the A=F controller, this is an example of a closedloopcontroller, since the bypass air <strong>and</strong> spark advance controls affectengine speed, which is used as a controller input.2.8.3 Sensor-Catalyst Mo<strong>de</strong>lingA particularly critical component of a vehicle’s emissions control system isthe catalytic converter. The role of the catalytic converter is to chemicallytransform noxious <strong>and</strong> environmentally damaging engine-out emissions,which are the byproduct of the engine’s combustion process, to environmentallybenign chemical compounds. An i<strong>de</strong>al three-way catalyticconverter should completely perform the following three tasks duringcontinuous vehicle operation: (1) oxidation of hydrocarbon (HC) exhaustgases to carbon dioxi<strong>de</strong> (CO 2 ) <strong>and</strong> water (H 2 O); (2) oxidation of carbonmonoxi<strong>de</strong> (CO) to CO 2 ; <strong>and</strong> (3) reduction of nitrogen oxi<strong>de</strong>s (NO x )tonitrogen (N 2 ) <strong>and</strong> oxygen (O 2 ). In practice, it is possible to achieve highconversion efficiencies for all three types of exhaust gases only when theengine is operating near stoichiometry. An effective A=F control strategyenables such conversion.However, even in the presence of effective A=F control, vehicle-out(i.e., tailpipe) emissions may be unreasonably high if the catalyticconverter has been damaged. Government regulations require that theperformance of a vehicle’s catalytic converter be continuously monitoredto <strong>de</strong>tect when conversion efficiencies have dropped below some threshold.Unfortunately, it is currently infeasible to equip vehicles with sensorsthat can measure the various exhaust gas species directly. Instead, catalyticconverter monitors are based on comparing the output of a HEGO sensorthat is exposed to engine-out emissions with the output of a second sensorthat is mounted downstream of the catalytic converter <strong>and</strong> is exposed tothe tailpipe emissions. This approach is based on the observation that thepostcatalyst HEGO sensor switches infrequently, relative to the precatalystHEGO sensor, when the catalyst is operating efficiently. Similarly, theaverage rate of switching of the postcatalyst sensor increases as catalystefficiency <strong>de</strong>creases (due to <strong>de</strong>creasing oxygen storage capability).A catalyst monitor can be <strong>de</strong>veloped based on a neural network mo<strong>de</strong>lof the dynamic operation of the postcatalyst HEGO sensor as a function of


2.8 AUTOMOTIVE APPLICATIONS OF EKF TRAINING 61the precatalyst HEGO sensor <strong>and</strong> engine operating conditions [12] for acatalyst of nominal conversion efficiency. This is a difficult task, especiallygiven the nonlinear responses of the various components <strong>and</strong> thecondition-<strong>de</strong>pen<strong>de</strong>nt time <strong>de</strong>lays, which can range from less than 0.1 s athigh engine speeds to more than 1 s at low speeds. We employed a RMLPnetwork with structure 15-20R-15R-10R-1 <strong>and</strong> a sparse tapped <strong>de</strong>lay linerepresentation to directly capture the long-term temporal characteristics ofthe precatalyst HEGO sensor. Because of the size of the network (over1,500 weights) <strong>and</strong> the number of training samples (63,000), we chose toemploy <strong>de</strong>coupled EKF training. The trained network effectively representedthe condition-<strong>de</strong>pen<strong>de</strong>nt time <strong>de</strong>lays <strong>and</strong> nonlinearities of thesystem, as shown in [12].2.8.4 Engine Misfire DetectionEngine misfire is broadly <strong>de</strong>fined as the condition in which a substantialfraction of a cylin<strong>de</strong>r’s air–fuel mixture fails to ignite. Frequent misfirewill lead to a <strong>de</strong>terioration of the catalytic converter, ultimately resulting inunacceptable levels of emitted pollutants. Consequently, governmentm<strong>and</strong>ates require that onboard misfire <strong>de</strong>tection capability be provi<strong>de</strong>dfor nearly all engine operating conditions.While there are many ways of <strong>de</strong>tecting engine misfire, all currentlypractical methods rely on observing engine crankshaft dynamics with aposition sensor located at one end of the shaft. Briefly stated, one looks fora crankshaft acceleration <strong>de</strong>ficit following a cylin<strong>de</strong>r firing <strong>and</strong> attempts to<strong>de</strong>termine whether such a <strong>de</strong>ficit is attributable to a lack of power provi<strong>de</strong>don the most recent firing stroke.Since every engine firing must be evaluated, the natural ‘‘clock’’ formisfire <strong>de</strong>tection is based on crankshaft rotation, rather than on time. Foran n-cylin<strong>de</strong>r engine, there are n engine firings, or events, per enginecycle, which requires two engine revolutions. The actual time intervalbetween events varies consi<strong>de</strong>rably, from 20 ms at 750 rpm to 2.5 ms at6000 rpm for an eight-cylin<strong>de</strong>r engine. Engine speed, as required forcontrol, is typically <strong>de</strong>rived from measured intervals between marks on atiming wheel. As used in misfire <strong>de</strong>tection, an acceleration value iscalculated from the difference between successive intervals.A serious problem associated with measuring crankshaft acceleration isthe presence of complex torsional dynamics of the crankshaft, even in theabsence of misfire. This is due to the finite stiffness of the crankshaft. Themagnitu<strong>de</strong> of acceleration induced by such torsional vibrations may be


62 2 PARAMETER-BASED KALMAN FILTER TRAININGlarge enough to dwarf acceleration <strong>de</strong>ficits from misfire. Further, thetorsional vibrations are themselves altered by misfire, so that normalengine firings followed by misfire may be misinterpreted.We have approached the misfire <strong>de</strong>tection problem with recurrentneural networks trained by GEKF [12] to act as dynamic patternclassifiers. We use as inputs engine speed, engine load, crankshaftacceleration, <strong>and</strong> a binary flag to i<strong>de</strong>ntify the beginning of the cylin<strong>de</strong>rfiring sequence. The training target is a binary signal, according towhether a misfire had been artificially induced for the current cylin<strong>de</strong>rduring the previous engine cycle. This phasing enables the network tomake use of information contained in measured accelerations that followthe engine event being classified. We find that trained networks makeremarkably few classification errors, most of which occur during momentsof rapid acceleration or <strong>de</strong>celeration.2.8.5 Vehicle Emissions EstimationIncreasing levels of pollutants in the atmosphere – observed <strong>de</strong>spite theimposition of stricter emission st<strong>and</strong>ards <strong>and</strong> technological improvementsin emissions control systems – have led to mo<strong>de</strong>ls being <strong>de</strong>veloped topredict emissions inventories. These are typically based on the emissionslevels that are m<strong>and</strong>ated by the government for a particular drivingschedule <strong>and</strong> a given mo<strong>de</strong>l year. It has been found that the emissionsinventories based on these m<strong>and</strong>ated levels do not accurately reflect thosethat are actually found to exist. That is, actual emission rates <strong>de</strong>pendheavily upon driving patterns, <strong>and</strong> real-world driving patterns are notcomprehensively represented by the m<strong>and</strong>ated driving schedules. To betterassess the emissions that occur in practice <strong>and</strong> to predict emissionsinventories, experiments have been conducted using instrumented vehiclesthat are driven in actual traffic. Unfortunately, such vehicles are costly <strong>and</strong>are difficult to operate <strong>and</strong> maintain.We have found that recurrent neural networks can be trained to estimateinstantaneous engine-out emissions from a small number of easilymeasured engine variables. Un<strong>de</strong>r the assumption of a properly operatingfuel control system <strong>and</strong> catalytic converter, this leads to estimates oftailpipe emissions as well. This capability then allows one to estimate thesensitivity of emissions to driving style (e.g., aggressive versus conservative).Once trained, the network requires only information already availableto the powertrain processor. Because of engine dynamics, we havefound the use of recurrent networks trained by EKF methods to enable


2.9 DISCUSSION 63accurate estimation of instantaneous emissions levels. We provi<strong>de</strong> a<strong>de</strong>tailed <strong>de</strong>scription of this application in [24].2.9 DISCUSSIONWe have presented in this chapter an overview of neural network trainingmethods based on the principles of exten<strong>de</strong>d <strong>Kalman</strong> filtering. Wesummarize our findings by consi<strong>de</strong>ring the virtues <strong>and</strong> limitations ofthese methods, <strong>and</strong> provi<strong>de</strong> gui<strong>de</strong>lines for implementation.2.9.1 Virtues of EKF TrainingThe EKF family of training algorithms <strong>de</strong>velops <strong>and</strong> employs secondor<strong>de</strong>rinformation during the training process using only first-or<strong>de</strong>rapproximations. The use of second-or<strong>de</strong>r information, as embed<strong>de</strong>d inthe approximate error covariance matrix, which co-evolves with the weightvector during training, provi<strong>de</strong>s enhanced capabilities relative to first-or<strong>de</strong>rmethods, both in terms of training speed <strong>and</strong> quality of solution. Theamount of second-or<strong>de</strong>r information utilized is controlled by the level of<strong>de</strong>coupling, which is chosen on the basis of computational consi<strong>de</strong>rations.Thus, the computational complexity of the EKF methods can be scaled tomeet the needs of specific applications.We have found that EKF methods have enabled the training of recurrentneural networks, for both mo<strong>de</strong>ling <strong>and</strong> control of nonlinear dynamicalsystems. The sequential nature of the EKF provi<strong>de</strong>s advantages relative tobatch second-or<strong>de</strong>r methods, since weight updates can be performed on aninstance-by-instance basis with EKF training. On the other h<strong>and</strong>, theability to batch multiple training instances with multistream EKF trainingprovi<strong>de</strong>s a level of scalability in addition to that provi<strong>de</strong>d by <strong>de</strong>coupling.The sequential nature of the EKF, in both single- <strong>and</strong> multistreamoperation, provi<strong>de</strong>s a stochastic component that allows for more effectivesearch of the weight space, especially when used in combination withartificial process noise.The EKF methods are easily implemented in software, <strong>and</strong> there issubstantial promise for hardware implementation as well. Methods foravoiding matrix inversions in the EKF have been <strong>de</strong>veloped, therebyenabling easy implementations. Finally, we believe that the greatest virtueof EKF training of neural networks is its established <strong>and</strong> proven applicabilityto a wi<strong>de</strong> range of difficult mo<strong>de</strong>ling <strong>and</strong> control problems.


64 2 PARAMETER-BASED KALMAN FILTER TRAINING2.9.2 Limitations of EKF TrainingPerhaps the most significant limitation of EKF training is its limitedapplicability to cost functions other then minimizing sum of squared error.Although we have shown that other cost functions can be used (e.g.entropic measures), we are nevertheless restricted to those optimizationproblems that can be converted to minimizing a sum of squared errorcriterion. On the other h<strong>and</strong>, many problems, particularly in control,require other optimization criteria. For example, in a portfolio optimizationproblem, we should like to maximize the total return over time.Converting such an optimization criterion to a sum of squared errorscriterion is usually not straightforward. However, we do not view the sumof squared-error optimization criterion as a limitation for most problemsthat can be viewed as belonging to the class of traditional supervisedtraining problems.The EKF procedures <strong>de</strong>scribed in this chapter are <strong>de</strong>rived on the basisof a first-or<strong>de</strong>r linearization of the nonlinear system; this may provi<strong>de</strong> alimitation in the form of large errors in the weight estimates <strong>and</strong>covariance matrix, since the second-or<strong>de</strong>r information is effectively<strong>de</strong>veloped by taking outer products of the gradients. Chapter 7 introducesthe unscented <strong>Kalman</strong> filter (UKF) as an alternative to the EKF. The UKFis expected to provi<strong>de</strong> a more accurate means of <strong>de</strong>veloping the requiredsecond-or<strong>de</strong>r information than the EKF, without increasing the computationalcomplexity.2.9.3 Gui<strong>de</strong>lines for Implementation <strong>and</strong> Use1. Decoupling should be used when computation is a concern (e.g., foron-line applications). No<strong>de</strong> <strong>and</strong> layer <strong>de</strong>coupling are the two mostappropriate choices. Otherwise, we recommend the use of globalEKF, regardless of network architecture, as it should be expected tofind better solutions than any of the <strong>de</strong>coupled versions because ofthe use of full second-or<strong>de</strong>r information.2. Effectively, two parameter values need to be chosen for training ofnetworks with EKF methods. We assume that the approximate errorcovariance matrices are always initialized with diagonal value of 100<strong>and</strong> 1,000 for weights corresponding to nonlinear <strong>and</strong> linear no<strong>de</strong>s,respectively. Then, the user of these methods must set values for thelearning rate <strong>and</strong> process-noise term according to characteristics ofthe training problem.


REFERENCES 653. Training of recurrent networks, either as supervised training tasks orfor controller training, can often be improved by multistreaming.The choice of the number of streams is dictated by problemcharacteristics.4. Matrix inversions can be avoi<strong>de</strong>d by use of sequential EKF updateprocedures. In the case of <strong>de</strong>coupling, the or<strong>de</strong>r in which outputs areprocessed can affect training performance in <strong>de</strong>tail. We recommendthat outputs be processed in r<strong>and</strong>om or<strong>de</strong>r when these methods areused.5. Square-root filtering can be employed to insure computationalstability for the error covariance update equation. However, theuse of square-root filtering with artificial process noise for covarianceupdates results in a substantial increase in computationalcomplexity. We have noted that nonzero artificial process noisebenefits training, by providing a mechanism to escape poor localminima <strong>and</strong> a mechanism that maintains stable covariance updateswhen using the Riccati update equation. We recommend that squarerootfiltering only be employed when no artificial process noise isused (<strong>and</strong> only for GEKF).6. The EKF procedures can be modified to allow for alternative costfunctions (e.g., entropic cost functions) <strong>and</strong> for weight constraints tobe imposed during training, which thereby allow networks to be<strong>de</strong>ployed in fixed-point arithmetic.REFERENCES[1] D.E. Rumelhart, G.E. Hinton, <strong>and</strong> R.J. Williams, ‘‘Learning representationsof back-propagation errors,’’ Nature 323, 533–536 (1986).[2] S. Singhal <strong>and</strong> L. Wu, ‘‘Training multilayer perceptrons with the exten<strong>de</strong>d<strong>Kalman</strong> algorithm,’’ in D.S. Touretzky, Eds., Advances in <strong>Neural</strong> InformationProcessing Systems 1, San Mateo, CA: Morgan Kaufmann, 1989, pp. 133–140.[3] G.V. Puskorius <strong>and</strong> L. A. Feldkamp, ‘‘Decoupled exten<strong>de</strong>d <strong>Kalman</strong> filtertraining of feedforward layered networks,’’ in Proceedings of InternationalJoint Conference of <strong>Neural</strong> <strong>Networks</strong>, Seattle, WA, 1991, Vol. 1, pp. 771–777.[4] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘Automotive engine idle speed controlwith recurrent neural networks,’’ in Proceedings of the 1993 AmericanControl Conference, San Francisco, CA, pp. 311–316.


66 2 PARAMETER-BASED KALMAN FILTER TRAINING[5] G.V. Puskorius, L.A. Feldkamp, <strong>and</strong> L.I. Davis, Jr., ‘‘Dynamic neuralnetwork methods applied to on-vehicle idle speed control,’’ Proceedings ofthe IEEE, 84, 1407–1420 (1996).[6] R.J. Williams <strong>and</strong> D. Zipser, ‘‘A learning algorithm for continually runningfully recurrent neural networks,’’ <strong>Neural</strong> Computation, 1, 270–280 (1989).[7] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘Neurocontrol of nonlinear dynamicalsystems with <strong>Kalman</strong> filter-trained recurrent networks,’’ IEEE Transactionson <strong>Neural</strong> <strong>Networks</strong>, 5, 279–297 (1994).[8] P.J. Werbos, ‘‘Backpropagation through time: What it does <strong>and</strong> how to do it,’’Proceedings of the IEEE, 78, 1550–1560 (1990).[9] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘Extensions <strong>and</strong> enhancements of<strong>de</strong>coupled exten<strong>de</strong>d <strong>Kalman</strong> filter training,’’ in Proceedings of the 1997International Conference on <strong>Neural</strong> <strong>Networks</strong>, Houston, TX, Vol. 3, pp.1879–1883.[10] L.A. Feldkamp <strong>and</strong> G.V. Puskorius, ‘‘Training controllers for robustness:multi-stream DEKF,’’ in Proceedings of the IEEE International Conferenceon <strong>Neural</strong> <strong>Networks</strong>, Orl<strong>and</strong>o, FL, 1994, Vol. IV, pp. 2377–2382.[11] L.A. Feldkamp <strong>and</strong> G.V. Puskorius, ‘‘Training of robust neurocontrollers,’’ inProceedings of the 33rd IEEE InternationaI Conference on Decision <strong>and</strong>Control, Orl<strong>and</strong>o, FL, 1994, Vol. III, pp. 2754–2760.[12] L.A. Feldkamp <strong>and</strong> G.V. Puskorius, ‘‘A signal processing framework basedon dynamic neural networks with application to problems in adaptation,filtering <strong>and</strong> classification,’’ Proceedings of the IEEE, 86, 2259–2277 (1998).[13] F. Heimes, ‘‘Exten<strong>de</strong>d <strong>Kalman</strong> filter neural network training: experimentalresults <strong>and</strong> algorithm improvements,’’ in Proceedings of the 1998 IEEEConference on Systems, Man <strong>and</strong> Cybernetics, Orl<strong>and</strong>o, FL., pp. 1639–1644.[14] L.A. Feldkamp <strong>and</strong> G.V. Puskorius, ‘‘Fixed weight controller for multiplesystems,’’ in Proceedings of the 1997 IEEE International Conference on<strong>Neural</strong> <strong>Networks</strong>, Houston, TX, Vol. 2, pp 773–778.[15] G.H. Golub <strong>and</strong> C.F. Van Loan, Matrix Computations, 2nd ed. Baltimore,MD: The John Hopkins University Press, 1989.[16] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘Avoiding matrix inversions for the<strong>de</strong>coupled exten<strong>de</strong>d <strong>Kalman</strong> filter training algorithm,’’ in Proceedings of theWorld Congress on Neutral <strong>Networks</strong>, Washington, DC, 1995, pp. I-704–I-709.[17] E.S. Plumer, ‘‘Training neural networks using sequential exten<strong>de</strong>d <strong>Kalman</strong>filtering,’’ in Proceedings of the World Congress on <strong>Neural</strong> <strong>Networks</strong>,Washington DC, 1995 pp. I-764–I-769.[18] P. Sun <strong>and</strong> K. Marko, ‘‘The square root <strong>Kalman</strong> filter training of recurrentneural networks,’’ in Proceedings of the 1998 IEEE Conference on Systems,Man <strong>and</strong> Cybernetics, Orl<strong>and</strong>o, FL, pp. 1645–1651.


REFERENCES 67[19] S. Haykin, Adaptive Filter Theory, 3rd ed. Englewood Cliffs, NJ: Prentice-Hall.[20] E.W. Saad, D.V. Prokhorov, <strong>and</strong> D.C. Wunsch III, ‘‘Comparative study ofstock trend prediction using time <strong>de</strong>lay, recurrent <strong>and</strong> probabilistic neuralnetworks,’’ IEEE Transactions on <strong>Neural</strong> <strong>Networks</strong>, 9, 1456–1470 (1998).[21] K.-C. Jim, C.L. Giles, <strong>and</strong> B.G. Horne, ‘‘An analysis of noise in recurrentneural networks: convergence <strong>and</strong> generalization,’’ IEEE Transactions on<strong>Neural</strong> <strong>Networks</strong>, 7, 1424–1438 (1996).[22] D.L. Elliot, ‘‘A better activation function for artificial neural networks,’’Institute for Systems Research, University of Maryl<strong>and</strong>, Technical ReportTR93-8, 1993.[23] J. Hertz, A. Krogh, <strong>and</strong> R.G. Palmer, Introduction to the Theory of <strong>Neural</strong>Computation, Redwood City, CA: Addison-Wesley, 1991.[24] G. Jesion, C.A. Gierczak, G.V. Puskorius, L.A. Feldkamp, <strong>and</strong> J.W. Butler,‘‘The application of dynamic neural networks to the estimation of feedgasvehicle emissions,’’ in Proceedings of the 1998 International Joint Conferenceon <strong>Neural</strong> <strong>Networks</strong>, Anchorage, AK, pp. 69–73.


3LEARNING SHAPE ANDMOTION FROM IMAGESEQUENCESGaurav S. PatelDepartment of Electrical <strong>and</strong> Computer Engineering, McMaster University,Hamilton, Ontario, CanadaSue Becker <strong>and</strong> Ron RacineDepartment of Psychology, McMaster University, Hamilton, Ontario, Canada(beckers@mcmaster.ca)3.1 INTRODUCTIONIn Chapter 2, Puskorius <strong>and</strong> Feldkamp <strong>de</strong>scribed a procedure for thesupervised training of a recurrent multilayer perceptron – the no<strong>de</strong><strong>de</strong>couple<strong>de</strong>xten<strong>de</strong>d <strong>Kalman</strong> filter (NDEKF) algorithm. We now use thismo<strong>de</strong>l to <strong>de</strong>al with high-dimensional signals: moving visual images. Manycomplexities arise in visual processing that are not present in onedimensionalprediction problems: the scene may be cluttered with back-<strong>Kalman</strong> <strong>Filtering</strong> <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>, Edited by Simon HaykinISBN 0-471-36998-5 # 2001 John Wiley & Sons, Inc.69


70 3 LEARNING SHAPE AND MOTION FROM IMAGE SEQUENCESground objects, the object of interest may be occlu<strong>de</strong>d, <strong>and</strong> the systemmay have to <strong>de</strong>al with tracking differently shaped objects at differenttimes. The problem we have <strong>de</strong>alt with initially is tracking objects thatvary in both shape <strong>and</strong> location. Tracking differently shaped objects ischallenging for a system that begins by performing local feature extraction,because the features of two different objects may appear i<strong>de</strong>nticallocally even though the objects differ in global shape (e.g., squares versusrectangles). However, a<strong>de</strong>quate tracking may still be achievable without aperfect three-dimensional mo<strong>de</strong>l of the object, using locally extractedfeatures as a starting point, provi<strong>de</strong>d there is continuity between imageframes.Our neural network mo<strong>de</strong>l is able to make use of short-term continuityto track a range of different geometric shapes (circles, squares, <strong>and</strong>triangles). We evaluate the mo<strong>de</strong>l’s abilities in three experiments. In thefirst experiment, the mo<strong>de</strong>l was trained on images of two different movingshapes, where each shape had its own characteristic movement trajectory.In the second experiment, the training set was ma<strong>de</strong> more difficult byadding a third object, which also had a unique motion trajectory. In thethird <strong>and</strong> final experiment, the restriction of one direction of motion pershape was lifted. Thus, the mo<strong>de</strong>l experienced the same shape traveling indifferent trajectories, as well as different shapes traveling in the sametrajectory. Even un<strong>de</strong>r these conditions, the mo<strong>de</strong>l was able to learn totrack a given shape for many time steps <strong>and</strong> anticipate both its shape <strong>and</strong>location many time steps into the future.3.2 NEUROBIOLOGICAL AND PERCEPTUAL FOUNDATIONSThe architecture of our mo<strong>de</strong>l is motivated by two key anatomical featuresof the mammalian neocortex, the extensive use of feedback connections,<strong>and</strong> the hierarchical multiscale structure. We discuss briefly the evi<strong>de</strong>ncefor, <strong>and</strong> benefits of, each of these in turn.Feedback is a ubiquitous feature of the brain, both between <strong>and</strong> withincortical areas. Whenever two cortical areas are interconnected, theconnections tend to be bidirectional [1]. Additionally, within everyneocortical area, neurons within the superficial layers are richly interconnectedlaterally via a network of horizontal connections [2]. The <strong>de</strong>nseweb of feedback connections within the visual system has been shown tobe important in suppressing background stimuli <strong>and</strong> amplifying salient orforeground stimuli [3]. Feedback is also likely to play an important role inprocessing sequences. Clearly, we view the world as a continuously


3.3 NETWORK DESCRIPTION 71varying sequence rather than as a disconnected collection of snapshots.Seeing the world in this way allows recent experience to play a role in theanticipation or prediction of what will come next. The generation ofpredictions in a perceptual system may serve at least two importantfunctions: (1) To the extent that an incoming sensory signal is consistentwith expectations, intelligent filtering may be done to increase the signalto-noiseratio <strong>and</strong> resolve ambiguities using context. (2) When the signalviolates expectations, an organism can react quickly to such changing orsalient conditions by <strong>de</strong>-emphasizing the expected part of the signal <strong>and</strong><strong>de</strong>voting more processing capacity to the unexpected information. Topdownconnections between processing layers, or lateral connections withinlayers, or both, might be used to accomplish this. Lateral connectionsallow for local constraints about moving contours to gui<strong>de</strong> one’s expectations,<strong>and</strong> this is the basis for our mo<strong>de</strong>l.Prediction in a high-dimensional space is computationally complex in afully connected network architecture. The problem requires a moreconstrained network architecture that will reduce the number of freeparameters. The visual system has done just that. In the earliest stagesof processing, cells’ receptive fields span only a few <strong>de</strong>grees of visualangle, while in higher visual areas, cells’ receptive fields span almost theentire visual field (for a review, see [4]). Therefore, we <strong>de</strong>signed our mo<strong>de</strong>lnetwork with a similar hierarchical architecture, in which the first layer ofunits were connected to relatively small, local regions of the image <strong>and</strong> asubsequent layer spanned the entire visual field (see Figure 3.1).3.3 NETWORK DESCRIPTIONPrediction in a high-dimensional space such as a 50 50 pixel image,using a fully connected recurrent network is not feasible, because thenumber of connections is typically one or more or<strong>de</strong>rs of magnitu<strong>de</strong> largerthan the dimensionality of the input, <strong>and</strong> the NDEKF training procedurerequires adapting these parameters for typically hundreds to thous<strong>and</strong>s ofiterations. The problem requires a more constrained network architecturethat will reduce the number of free parameters. Motivated by thehierarchical architecture of real visual systems, we <strong>de</strong>signed our mo<strong>de</strong>lnetwork with a similar hierarchical architecture in which the first layer ofunits were connected to relatively small, local 5 5 pixel regions of theimage <strong>and</strong> a subsequent layer spanned the entire visual field (see Figure3.1).


72 3 LEARNING SHAPE AND MOTION FROM IMAGE SEQUENCESFigure 3.1 A diagram of the network used. The numbers in the boxesindicate the number of units in each layer or module, except in the inputlayer, where the receptive fields are numbered 1; ...; 4. Local receptivefields of size 5 5 at the input are fed to the four banks of four units in the firsthid<strong>de</strong>n layer. The second layer of eight units then combines these localfeatures learned by the first hid<strong>de</strong>n layer. Note the recurrence in the secondhid<strong>de</strong>n layer.A four-layer network of size 100-16-8R-100, as <strong>de</strong>picted in Figures3.1a <strong>and</strong> 3.1b, was used in the following experiments. Training images ofsize 10 10, which are arranged in a vector format of size 100 1, wereused to form the input to the networks. As <strong>de</strong>picted in Figure 3.1a, theinput image is divi<strong>de</strong>d into four non-overlapping receptive fields of size5 5. Further, the 16 units in the first hid<strong>de</strong>n layer are divi<strong>de</strong>d into fourbanks of four units each. Each of the four units within a bank receiveinputs from one of the four receptive fields. This <strong>de</strong>scribes how the10 10 image is connected to the 16 units in the first hid<strong>de</strong>n layer. Eachof these 16 units feed into a second hid<strong>de</strong>n layer of 8 units. The secondhid<strong>de</strong>n layer has recurrent connections (note that recurrence is only withinthe layer <strong>and</strong> not between layers).


3.4 EXPERIMENT 1 73Thus, the input layer of the network is connected to small <strong>and</strong> localregions of the image. The first layer processes these local receptive fieldsseparately, in an effort to extract relevant local features. These features arethen combined by the second hid<strong>de</strong>n layer to predict the next image in thesequence. The predicted image is represented at the output layer. Theprediction error is then used in the EKF equations to update the weights.This process is repeated over several epochs through the training imagesequences until a sufficiently small incremental mean-squared error isobtained.3.4 EXPERIMENT 1In the first experiment, the mo<strong>de</strong>l is trained on images of two differentmoving shapes, where each shape has its own characteristic movement,that is, shape <strong>and</strong> direction of movement are perfectly correlated. Thesequence of eight 10 10 pixel images in Figure 3.2a is used to train afour-layered (100-16-8R-100) network to make one-step predictions of theimage sequence. In the first four time steps, a circle moves upward withinthe image; <strong>and</strong> in the last four time steps, a triangle moves downwardFigure 3.2 Experiment 1: one-step <strong>and</strong> iterated prediction of imagesequence. (a) Training sequence used. (b) One-step prediction. (c) multistepprediction. In (b) <strong>and</strong> (c), the three rows correspond to input, prediction,<strong>and</strong> error, respectively.


74 3 LEARNING SHAPE AND MOTION FROM IMAGE SEQUENCESwithin the image. At each time step, the network is presented with one ofthe eight 10 10 images as input (divi<strong>de</strong>d into four 5 5 receptive fieldsas <strong>de</strong>scribed above), <strong>and</strong> generates in its output layer a prediction of theinput at the next time step, but it is always given the correct input at thenext time step. Training was stopped after 20 epochs through the trainingsequence. Figure 3.2b shows the network operating in one-step predictionmo<strong>de</strong> on the training sequence after training. It makes excellent predictionsof the object shape <strong>and</strong> also its motion. Figure 3.2c shows thenetwork operating in an autonomous mo<strong>de</strong> after being shown only the firstimage of the sequence. In this multistep prediction case, the network isonly given external input at the first time step in the sequence. Beyond thefirst time step, the network is given its prediction from time t 1 as itsinput at time t, which could potentially lead to a buildup of predictionerrors over many time steps. This shows that the network has reconstructedthe entire dynamics, to which it was exposed during training,when provi<strong>de</strong>d with only the first image. This is in<strong>de</strong>ed a difficult task. Itis seen that as the iterative prediction proceeds, the residual errors (thethird row in Figure 3.2c) are amplified at each step.3.5 EXPERIMENT 2Next, an ND-EKF network with the same 100-16-8R-100 architectureused in Experiment 1 was trained with three sequences, each consisting offour images, in the following or<strong>de</strong>r: circle moving right <strong>and</strong> up; triangle moving right <strong>and</strong> down; square moving right <strong>and</strong> up.During training, at the beginning of each sequence, the network stateswere initialized to zero, so that the network would not learn the or<strong>de</strong>r ofpresentation of the sequences. The network was therefore expected tolearn the motions associated with each of the three shapes, <strong>and</strong> not theor<strong>de</strong>r of presentation of the shapes.During testing, the or<strong>de</strong>r of presentation of the three sequences varied,as shown in Figure 3.3a. The trained network does well at the task of onestepprediction, only failing momentarily at transition points where weswitch between sequences. It is important to note that one-step prediction,in this case, is a difficult <strong>and</strong> challenging task because the network has to<strong>de</strong>termine (1) what shape is present <strong>and</strong> (2) which direction it is movingin, without direct knowledge of inputs some time in the past. In or<strong>de</strong>r to


3.5 EXPERIMENT 2 75Figure 3.3 Experiment 2: one-step prediction of image sequences usingthe trained network. (a) Various combinations of sequences used in training.(b) Same sequences as in (a), but with occlusions. (c) Prediction on somesequences not seen during training. The three rows in each image correspondto input, prediction, <strong>and</strong> error, respectively.make good predictions, it must rely on its recurrent or feedback connections,which play a crucial role in the present mo<strong>de</strong>l.We also tested the mo<strong>de</strong>l on a set of occlu<strong>de</strong>d images – images withregions that are intentionally zeroed. Remarkably, the network makescorrect one-step predictions, even in the presence of occlusions as shownin Figure 3.3b. In addition, the predictions do not contain occlusions; thatis, they are correctly filled in, <strong>de</strong>monstrating the robustness of the mo<strong>de</strong>lto occlusions. In Figure 3.3c, when the network is presented with


76 3 LEARNING SHAPE AND MOTION FROM IMAGE SEQUENCESsequences that it had not been exposed to during training, a larger residualerror is obtained, as expected. However, the network is still capable ofi<strong>de</strong>ntifying the shape <strong>and</strong> motion, although not as accurately as before.3.6 EXPERIMENT 3In Experiment 1, the network was presented with short sequences (fourimages) of only two shapes (circle <strong>and</strong> triangle), <strong>and</strong> in experiment 2 anextra shape (square) was ad<strong>de</strong>d. In Experiment 3, to make the learningtask even more challenging, the length of the sequences was increased to10 <strong>and</strong> the restriction of one direction of motion per shape was lifted.Specifically, each shape was permitted to move right <strong>and</strong> either up ordown. Thus, the network was exposed to different shapes traveling insimilar directions <strong>and</strong> also the same shape traveling in different directions,increasing the total number of images presented to the network from 8images in Experiment 1 <strong>and</strong> 12 images in Experiment 2 to 100 images inthis experiment. In effect, there is a substantial increase in the number oflearning patterns, <strong>and</strong> thus a substantial increase in the complexity of thelearning task. However, since the number of weights in the network islimited <strong>and</strong> remains the same as in the other experiments, the networkcannot simply memorize the sequences.We trained a network of the same 100-16-8R-100 architecture on sixsequences, each consisting of 10 images (see Fig. 3.4) in the following or<strong>de</strong>r: circle moving right <strong>and</strong> up; square moving right <strong>and</strong> down; triangle moving right <strong>and</strong> up; circle moving right <strong>and</strong> down; square moving right <strong>and</strong> up; triangle moving right <strong>and</strong> down.Training was performed in a similar manner as Experiment 2. Duringtesting, the or<strong>de</strong>r of presentation of the six sequences was varied; severalexamples are shown in Figure 3.5. As in the previous experiments, evenwith the larger number of training patterns, the network is able to predictthe correct motion of the shapes, only failing during transitions betweenshapes. It is able to distinguish between the same shapes moving indifferent directions as well as different shapes moving in the samedirection, using context available via the recurrent connections.


3.7 DISCUSSION 77Figure 3.4Experiment 3: six image sequences used for training.The failure of the mo<strong>de</strong>l to make accurate predictions at transitionsbetween shapes can also be seen in the residual error that is obtainedduring prediction. The residual error in the predicted image is quantifiedby calculating the mean-squared prediction error, as shown in Figure 3.6.The figure shows how the mean-squared prediction error varies as theprediction continues. Note the transient increase in error at transitionsbetween shapes.3.7 DISCUSSIONIn this chapter, we have <strong>de</strong>alt with time-series prediction of high-dimensionalsignals: moving visual images. This situation is much more


78 3 LEARNING SHAPE AND MOTION FROM IMAGE SEQUENCESFigure 3.5 Experiment 3: one-step prediction of image sequences usingthe trained network. The three rows in each image correspond to input,prediction, <strong>and</strong> error, respectively.complicated than a one-dimensional case, in that the system has to <strong>de</strong>alwith simultaneous shape <strong>and</strong> motion prediction. The network was trainedby the EKF method to perform one-step prediction of image sequences ina specific or<strong>de</strong>r. Then, during testing, the or<strong>de</strong>r of the sequences wasvaried <strong>and</strong> the network was asked to predict the correct shape <strong>and</strong> locationof the next image in the sequence. The complexity of the problem wasincreased from Experiment 1 to 3 as we introduced occlusions, increasedboth the length of the training sequences <strong>and</strong> the number of shapespresented, <strong>and</strong> allowed shape <strong>and</strong> motion to vary in<strong>de</strong>pen<strong>de</strong>ntly. In all


3.7 DISCUSSION 79Figure 3.6 Mean-squared prediction error in one-step prediction of imagesequences using the trained network. The three rows in each imagecorrespond to input, prediction, <strong>and</strong> error, respectively. The graphs showhow the mean-squared prediction error varies as the prediction progresses.Notice the increase in error at transitions between shapes.cases, the network was able to predict the correct motion of the shapes,failing only momentarily at transitions between shapes.The network <strong>de</strong>scribed here is a first step toward mo<strong>de</strong>ling themechanisms by which the human brain might simultaneously recognize<strong>and</strong> track moving stimuli. Any attempt to mo<strong>de</strong>l both shape <strong>and</strong> motionprocessing simultaneously within a single network may seem to be at oddswith the well-established finding that shape <strong>and</strong> spatial information areprocessed in separate pathways of the visual system [5]. An extremeversion of this view posits that form-related features are processed strictlyby the ventral ‘‘what’’ pathway <strong>and</strong> motion features are processed strictly


80 3 LEARNING SHAPE AND MOTION FROM IMAGE SEQUENCESby the dorsal ‘‘where’’ pathway. Anatomically, however, there are crossconnectionsbetween the two pathways at several points [6]. Furthermore,there is ample behavioral evi<strong>de</strong>nce that the processes of shape <strong>and</strong> motionperception are not completely separate. For example, it has long beenestablished that we are able to infer shape from motion (see e.g., [7]).Conversely, un<strong>de</strong>r certain conditions, object recognition can be shown todrive motion perception [8]. In addition, Stone [9] has shown that viewersare much better at recognizing objects when they are moving in characteristic,familiar trajectories as compared with unfamiliar trajectories.These data suggest that when shape <strong>and</strong> motion are tightly correlated,viewers will learn to use them together to recognize objects. This isexactly what happens in our mo<strong>de</strong>l.To accomplish temporal processing in our mo<strong>de</strong>l, we have incorporatedwithin-layer recurrent connections in the architecture used here. Anotherpossibility would be to incorporate top-down recurrent connections. A keyanatomical feature of the visual system is top-down feedback betweenvisual areas [3]. Top-down connections could allow global expectationsabout the three-dimensional shape of a moving object to gui<strong>de</strong> predictions.Thus, an important direction for future work is to extend the mo<strong>de</strong>l toallow top-down feedback. Rao <strong>and</strong> Ballard [10] have proposed analternative neural network implementation of the EKF that employs topdownfeedback between layers, <strong>and</strong> have applied their mo<strong>de</strong>l to both staticimages <strong>and</strong> time-varying image sequences [10, 11]. Other mo<strong>de</strong>ls ofcortical feedback for mo<strong>de</strong>ling the generation of expectations have alsobeen proposed (see, e.g., [12, 13]).Natural visual systems can <strong>de</strong>al with an enormous space of possibleimages, un<strong>de</strong>r wi<strong>de</strong>ly varying viewing conditions. Another importantdirection for future work is to extend our mo<strong>de</strong>l to <strong>de</strong>al with more realisticimages. Many additional complexities arise in natural images that were notpresent in the artificial image sequences used here. For example, thesimultaneous presence of both foreground <strong>and</strong> background objects mayhin<strong>de</strong>r the prediction accuracy. Natural visual systems likely use attentionalfiltering <strong>and</strong> binding strategies to alleviate this problem; forexample, Moran <strong>and</strong> Desimone [14] have observed cells that show asuppressed neural response to a preferred stimulus if unatten<strong>de</strong>d <strong>and</strong> in thepresence of an atten<strong>de</strong>d stimulus. Another simplification of our images isthat shape remained constant for many time frames, whereas for real threedimensionalobjects, the shape projected onto a two-dimensional imagemay change dramatically over time, because of rotations as well as nonrigidmotions (e.g. bending). Humans are able to infer three-dimensionalshape from non-rigid motion, even from highly impoverished stimuli such


REFERENCES 81as moving light displays [7]. It is likely that the architecture <strong>de</strong>scribed herecould h<strong>and</strong>le changes in shape, provi<strong>de</strong>d shape changes predictably <strong>and</strong>gradually over time.REFERENCES[1] D.J. Felleman <strong>and</strong> D.C. Van Essen, ‘‘Distributed hierarchical processing inthe primate cerebral cortex’’, Cerebral Cortex, 1, 1–47 (1991).[2] J.S. Lund, Q. Wu <strong>and</strong> J.B. Levitt, ‘‘Visual cortex cell types <strong>and</strong> connections’’,in M.A. Arbib, Ed., H<strong>and</strong>book of Brain Theory <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>,Cambridge, MA: MIT Press, 1995.[3] J.M. Hupé, A.C. James, B.R. Payne, S.G. Lomber, P. Girard <strong>and</strong> J. Bullier,‘‘Cortical feedback improves discrimination between figure <strong>and</strong> backgroundby V1, V2 <strong>and</strong> V3 neurons’’, Nature, 394, 784–787 (1998).[4] M.W. Oram <strong>and</strong> D.I. Perrett, ‘‘Mo<strong>de</strong>ling visual recognition from neurobiologicalconstraints’’, <strong>Neural</strong> <strong>Networks</strong>, 7, 945–972 (1994).[5] M. Mishkin, L.G. Ungerlei<strong>de</strong>r <strong>and</strong> K.A. Macko, ‘‘Object vision <strong>and</strong> spatialvision: Two cortical pathways’’, Trends in Neurosciences, 6, 414–417(1983).[6] E.A. De Yoe <strong>and</strong> D.C. Van Essen, ‘‘Concurrent processing streams inmonkey visual cortex’’, Trends in Neurosciences, 11, 219–226, (1988).[7] G. Johanssen, ‘‘Visual perception of biological motion <strong>and</strong> a mo<strong>de</strong>l for itsanalysis’’, Perception <strong>and</strong> Psychophysics, 14, 201–211 (1973).[8] V.S. Ramach<strong>and</strong>ran, C. Armel, C. Foster <strong>and</strong> R. Stoddard, ‘‘Object recognitioncan drive motion perception’’, Nature, 395, 852–853 (1998).[9] J.V. Stone, ‘‘Object recognition: View-specificity <strong>and</strong> motion-specificity’’,Vision Research, 39, 4032–4044, (1999).[10] R.P.N. Rao <strong>and</strong> D.H. Ballard, ‘‘Dynamic mo<strong>de</strong>l of visual recognition predictsneural response properties in the visual cortex’’, <strong>Neural</strong> Computation, 9(4),721–763 (1997)[11] R.P.N. Rao, ‘‘Correlates of attention in a mo<strong>de</strong>l of dynamic visual recognition’’,in M.I. Jordan, M.J. Kearns <strong>and</strong> S.A. Solla, Eds., Advances in <strong>Neural</strong>Information Processing Systems, Vol. 10. Cambridge, MA: MIT Press, 1998.[12] E. Harth, K.P. Unnikrishnan <strong>and</strong> A.S. P<strong>and</strong>ay, ‘‘The inversion of sensoryprocessing by feedback pathways: A mo<strong>de</strong>l of visual cognitive functions’’,Science, 237, 184–187 (1987).[13] D. Mumford, ‘‘On the computational architecture of the neocortex’’, BiologicalCybernetics, 65, 135–145 (1991).[14] J. Moran <strong>and</strong> R. Desimone, ‘‘Selective attention gates visual processing inthe extrastriate cortex’’, Science, 229, 782–784, (1985).


4CHAOTIC DYNAMICSGaurav S. PatelDepartment of Electrical <strong>and</strong> Computer Engineering, McMaster University,Hamilton, Ontario, CanadaSimon HaykinCommunications Research Laboratory, McMaster University, Hamilton,Ontario, Canada(haykin@mcmaster.ca)4.1 INTRODUCTIONIn this chapter, we consi<strong>de</strong>r another application of the exten<strong>de</strong>d <strong>Kalman</strong>filter recurrent multilayer perceptron (EKF-RMLP) scheme: the mo<strong>de</strong>lingof a chaotic time series or one that could be potentially chaotic.The generation of a chaotic process is governed by a coupled set ofnonlinear differential or difference equations. The hallmark of a chaoticprocess is sensitivity to initial conditions, which means that if the startingpoint of motion is perturbed by a very small increment, the <strong>de</strong>viation in<strong>Kalman</strong> <strong>Filtering</strong> <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>, Edited by Simon HaykinISBN 0-471-36998-5 # 2001 John Wiley & Sons, Inc.83


84 4 CHAOTIC DYNAMICSTable 4.1Summary of data sets used in the study<strong>Networks</strong>izeTraininglengthTestinglengthSamplingfrequencyf s ðHzÞLargestLyapunovexponentl max(nats=sample)CorrelationdimensionD MLLogistic 6-4R-2R-1 5,000 25,000 1 0.69 1.04Ikeda 6-6R-5R-1 5,000 25,000 1 0.354 1.51Lorenz 3-8R-7R-1 5,000 25,000 40 0.040 2.09NH 3 laser 9-10R-8R-1 1,000 9,000 1 a 0.147 2.01Sea clutter 6-8R-7R-1 40,000 10,000 1000 0.228 4.69a The sampling frequency for the laser data was not known. It was assumed to be 1 Hz for theLyapunov exponent calculations.the resulting waveform, compared to the original waveform, increasesexponentially with time. Consequently, unlike an ordinary <strong>de</strong>terministicprocess, a chaotic process is predictable only in the short term.Specifically, we consi<strong>de</strong>r five data sets categorized as follows: The logistic map, Ikeda map, <strong>and</strong> Lorenz attractor, whose dynamicsare governed by known equations; the corresponding time series cantherefore be numerically generated by using the known equations ofmotion. Laser intensity pulsations <strong>and</strong> sea clutter (i.e., radar backscatter froman ocean surface) whose un<strong>de</strong>rlying equations of motion areunknown; in this second case, the data are obtained from real-lifeexperiments.Table 4.1 shows a summary of the data sets used for mo<strong>de</strong>l validation.The table also shows the lengths of the data sets used, <strong>and</strong> their divisioninto the training <strong>and</strong> test sets, respectively. Also shown is a partialsummary of the dynamic invariants for each of the data sets used <strong>and</strong>the size of network used for mo<strong>de</strong>ling the dynamics for each set.4.2 CHAOTIC (DYNAMIC) INVARIANTSThe correlation dimension is a measure of the complexity of a chaoticprocess [1]. This chaotic invariant is always a fractal number, which is onereason for referring to a chaotic process as a ‘‘strange’’ attractor. The other


4.2 CHAOTIC (DYNAMIC) INVARIANTS 85chaotic invariants, the Lyapunov exponents, are, in part, responsible forsensitivity of the process to initial conditions, the occurrence of whichrequires having at least one positive Lyapunov exponent. The horizon ofpredictability (HOP) of the process is <strong>de</strong>termined essentially by the largestpositive Lyapunov exponent [1]. Another useful parameter of a chaoticprocess is the Kaplan–York dimension or Lyapunov dimension, which is<strong>de</strong>fined in terms of a Lyapunov spectrum byP Kl ii¼1D KY ¼ K þjl Kþ1 j ;ð4:1Þwhere the l i are the Lyapunov exponents arranged in <strong>de</strong>creasing or<strong>de</strong>r <strong>and</strong>K is the largest integer for which the following inequalities holdP Ki¼1l i 0<strong>and</strong>PKþ1i¼1l i < 0:Typically, the Kaplan–Yorke dimension is close in numerical value to thecorrelation dimension. Yet another byproduct of the Lyapunov spectrum isthe Kolmogorov entropy, which provi<strong>de</strong>s a measure of informationgenerated due to sensitivity to initial conditions. It can be calculated asthe sum of all the positive Lyapunov exponents of the process. The chaoticinvariants were estimated as follows:1. The correlation dimension was estimated using an algorithm basedon the method of maximum likelihood [2] – hence the notation D MLfor the correlation dimension.2. The Lyapunov exponents were estimated using an algorithm, involvingthe QR - <strong>de</strong>composition applied to a Jacobian that pertains tothe un<strong>de</strong>rlying dynamics of the time series.3. The Kolmogorov entropy was estimated directly from the time seriesusing an algorithm based on the method of maximum likelihood[2] – hence the notation KE ML for the Kolmogorov entropy soestimated. The indirect estimate of the Kolmogorov entropy fromthe Lyapunov spectrum is <strong>de</strong>noted by KE LE .


86 4 CHAOTIC DYNAMICS4.3 DYNAMIC RECONSTRUCTIONThe attractor of a dynamical system is constructed by plotting theevolution of the state vector in state space. This construction is possiblewhen we have access to every state variable of the system. In practicalsituations <strong>de</strong>aling with dynamical systems of unknown state-space equations,however, all that we have available is a set of measurements takenfrom the system. Given such a situation, we may raise the followingquestion: Is it possible to reconstruct the attractor of a system (with manystate variables) using a single time series of measurements? The answer tothis question is an emphatic yes; it was first illustrated by Packard et al.[3], <strong>and</strong> then given a firm mathematical foundation by Takens [4] <strong>and</strong>Mañé [5]. In essence, the celebrated Takens embedding theorem guaranteesthat by applying the <strong>de</strong>lay coordinate method to the measurementtime series, the original dynamics could be reconstructed, un<strong>de</strong>r certainassumptions. In the <strong>de</strong>lay coordinate method (sometimes referred to as themethod of <strong>de</strong>lays), <strong>de</strong>lay coordinate vectors are formed using time-<strong>de</strong>layedvalues of the measurements, as shown here:sðnÞ ¼½sðnÞ; sðn tÞ; ...; sðn ðd E 2ÞtÞ; sðn ðd E 1ÞtÞŠ T ;where d E is called the embedding dimension <strong>and</strong> t is known as theembedding <strong>de</strong>lay, taken to be some suitable multiple of the sampling timet s . By means of such an embedding, it is possible to reconstruct the truedynamics using only one measurement. Takens’ theorem assumes theexistence of d E <strong>and</strong> t such that mapping from sðnÞ to sðn þ tÞ is possible.The concept of dynamic reconstruction using <strong>de</strong>lay coordinate embeddingis very elegant, because we can use it to build a mo<strong>de</strong>l of a nonlineardynamical system, given a set of measured data on the system. We can useit to ‘‘reverse-engineer’’ the dynamics, i.e., use the time series to <strong>de</strong>ducecharacteristics of the physical system that was responsible for its generation.Put it another way, the reconstruction of the dynamics from a timeseries is in reality an ill-posed inverse problem. The direct problem is:given the dynamics, <strong>de</strong>scribe the iterates; <strong>and</strong> the inverse problem is: giventhe iterates, <strong>de</strong>scribe the dynamics. The inverse problem is ill-posedbecause, <strong>de</strong>pending on the quality of the data, a solution may not bestable, may not be unique, or may not even exist. One way to make theproblem well-posed is to inclu<strong>de</strong> prior knowledge about the input–outputmapping. In effect, the use of <strong>de</strong>lay coordinate embedding inserts someprior knowledge into the mo<strong>de</strong>l, since the embedding parameters are<strong>de</strong>termined from the data.


4.4 MODELING NUMERICALLY GENERATED CHAOTIC TIME SERIES 87To estimate the embedding <strong>de</strong>lay t, we used the method of mutualinformation proposed by Fraser [6]. According to this method, theembedding <strong>de</strong>lay is <strong>de</strong>termined by finding the particular <strong>de</strong>lay for whichthe mutual information between the observable time series <strong>and</strong> its <strong>de</strong>layedversion is minimized for the first time. Given such an embedding <strong>de</strong>lay, wecan construct a <strong>de</strong>lay coordinate vector whose adjacent samples are asstatistically in<strong>de</strong>pen<strong>de</strong>nt as possible.To estimate the embedding dimension d E , we use the method of falsenearest neighbors [1]; the embedding dimension is the smallest integerdimension that unfolds the attractor.4.4 MODELING NUMERICALLY GENERATED CHAOTICTIME SERIES4.4.1 Logistic MapIn this experiment, the EKF-RMLP-based mo<strong>de</strong>ling scheme is applied tothe logistic map (also known as the quadratic map), which was first usedas a mo<strong>de</strong>l of population growth. The logistic map is <strong>de</strong>scribed by thedifference equation:xðk þ 1Þ ¼axðkÞ½1 xðkÞŠ; ð4:2Þwhere the nonlinearity parameter a is chosen to be 4.0 so that theproduced behavior is chaotic. The logistic map exhibits <strong>de</strong>terministicchaos in the interval a 2 (3.5699, 4]. An initial value of xð0Þ ¼0:5 wasused, <strong>and</strong> 35,000 points were generated, of which the first 5000 pointswere discar<strong>de</strong>d, leaving a data set of 30,000 samples. A training set,consisting of the first 5000 samples, was used to train an RMLP on a onestepprediction task by means of the EKF method. An RMLP configurationof 6-4R-2R-1, which has a total of 61 weights including the biasterms, was selected for this mo<strong>de</strong>ling problem. The training convergedafter only 5 epochs <strong>and</strong> a sufficiently low MSE was achieved as shown inFigure 4.1.Open-Loop Evaluation A test set, consisting of the unexposed25,000 samples, was used to evaluate the performance of the network atthe task of one-step prediction as well as recursive prediction. Figure 4.2ashows the one-step prediction performance of the network on a shortportion of the test data. It is visually observed that the two curves are


88 4 CHAOTIC DYNAMICSFigure 4.1Training MSE versus epochs for the logistic map.almost i<strong>de</strong>ntical. Also, for numerical one-step performance evaluation,signal-to-error ratio (SER) is used. This measure, expressed in <strong>de</strong>cibels, is<strong>de</strong>fined bySER ¼ 10 log 10MSSMSE ;ð4:3Þwhere MSS is the mean-squared value of the actual test data <strong>and</strong> MSE isthe mean-squared value of the prediction error at the output. MSS is foundto be 0.374 for the 25,000-testing sequence <strong>and</strong> MSE is found to be1:09 10 5 for the trained RMLP network prediction error. This gives anSER of 45.36 dB, which is certainly impressive because it means that thepower of the one-step prediction error over 25,000 test samples is manytimes smaller than the power of the signal.Closed-Loop Evaluation To evaluate the autonomous behavior ofthe network, its no<strong>de</strong> outputs are first initialized to zero, it is then see<strong>de</strong>dwith points selected from the test data, <strong>and</strong> then passed through a primingphase where it operates in the one-step mo<strong>de</strong> for p l ¼ 30 steps. At the endof priming, the network’s output is fed back to its input, <strong>and</strong> autonomous


Figure 4.2 Results for the dynamic reconstruction of the logistic map. (a) One-stepprediction. (b) Iterated prediction. (c) Attractor of original signal. (d) Attractor ofiteratively reconstructed signal.89


90 4 CHAOTIC DYNAMICSoperation begins. At this point, the network is operating on its ownwithout further inputs, <strong>and</strong> the task that is asked of the network is in<strong>de</strong>edchallenging. The autonomous behavior of the network, which begins afterpriming, is shown in Figure 4.2b, <strong>and</strong> it is observed that the predictionsclosely follow the actual data for about 5 steps on average [which is closeto the theoretical horizon of predictability (HOP) of 5 calculated from theLyapunov spectrum], after which they start to <strong>de</strong>viate significantly. Figure4.3 plots the one-step prediction of the logistic map for three differentstarting pointsThe overall trajectory of the predicted signal, in the long term, has astructure that is very similar to the actual logistic map. The similarity isclearly seen by observing their attractors, which are shown in Figures 4.2c<strong>and</strong> 4.2d. For numerical autonomous performance evaluation, the dynamicalinvariants of both the actual data <strong>and</strong> the mo<strong>de</strong>l-generated data arecompared in Table 4.2. For the logistic map, d L ¼ 1; it therefore has onlyone Lyapunov exponent, which happens to be 0.69 nats=sample. Thismeans that the sum of Lyapunov exponents is not negative, thus violatingone of the conditions in the Kaplan–Yorke method, <strong>and</strong> it is for this reasonthat the Kaplan–Yorke dimension D KY could not be calculated. However,by comparing the other calculated invariants, it is seen that the Lyapunovexponent <strong>and</strong> the correlation dimension of the two signals are in closeagreement with each other. In addition, the Kolmogorov entropy values forthe two signals also match very closely. The theoretical horizons ofpredictability of the two signals are also in agreement with each other.These results <strong>de</strong>monstrate very convincingly that the original dynamicshave been accurately mo<strong>de</strong>led by the trained RMLP. Furthermore, therobustness of the mo<strong>de</strong>l is tested by starting the predictions from variouslocations on the test data, corresponding to indices of N 0 ¼ 3060, 5060,<strong>and</strong> 10,060. The results, shown in Figure 4.4, clearly indicate that theRMLP network is able to reconstruct the logistic series beginning fromany location, chosen at r<strong>and</strong>om.Table 4.2Comparison of chaotic invariants of logistic mapKE LEKE MLl 1Time series d E t d L D ML D KY (nats=sample) (nats=sample) HOP(samples)Actual logistic 6 5 1 1.04 — a 0.69 0.69 0.64 5Reconstructed 6 12 1 1.00 — a 0.61 0.61 0.65 6a Since the sum of Lyapunov exponents is not negative, D KY could not be calculated.


4.4 MODELING NUMERICALLY GENERATED CHAOTIC TIME SERIES 91Figure 4.3 One-step prediction of logistic map from different startingpoints. Note that A ¼ initialization <strong>and</strong> B ¼ one-step phase.4.4.2 Ikeda MapThis second experiment uses the Ikeda map (which is substantially morecomplicated than the logistic map) to test the performance of the EKF-RMLP mo<strong>de</strong>ling scheme. The Ikeda map is a complex-valued map <strong>and</strong> isgenerated using the following difference equations:mðkÞ ¼0:46:01 þ x 2 ; ð4:4Þ1 ðkÞþx2 2ðkÞ x 1 ðk þ 1Þ ¼1:0 þ mfx 1 ðkÞ cos½mðkÞŠ x 2 ðkÞ sin½mðkÞŠg; ð4:5Þx 2 ðk þ 1Þ ¼1:0 þ mfx 1 ðkÞþx 2 ðkÞ cos½mðkÞŠg;ð4:6Þwhere x 1 <strong>and</strong> x 2 are the real <strong>and</strong> imaginary components, respectively, of x<strong>and</strong> the parameter m is carefully chosen to be 0.7 so that the producedbehavior is chaotic. The initial values of x 1 ð0Þ ¼0:5 <strong>and</strong> x 2 ð0Þ ¼0:5 wereselected <strong>and</strong>, as pointed out earlier, a data set of 30,000 samples was


92 4 CHAOTIC DYNAMICSFigure 4.4 Iterative prediction of logistic map from different starting points,corresponding to N 0 ¼ 3060, 5060, <strong>and</strong> 10,060, respectively, with p l ¼ 30.Note that A ¼ initialization, B ¼ priming phase, <strong>and</strong> C ¼ autonomous phase.generated. In this experiment, only the x 1 component of the Ikeda map isused, for which the embedding parameters of d E ¼ 6 <strong>and</strong> t ¼ 10 were<strong>de</strong>termined. The first 5000 samples of this data set were used to train anRMLP with the EKF algorithm at one-step prediction. During training, atruncation <strong>de</strong>pth t d ¼ 10 was used for the backpropagation through-time(BPTT) <strong>de</strong>rivative calculations. The RMLP configuration of 6-6R-5R-1,which has a total of 144 weights including the bias terms, was chosen tomo<strong>de</strong>l the Ikeda series. The training converged after only 15 epochs, <strong>and</strong> asufficiently low incremental training mean-squared error was achieved, asshown in Figure 4.5.Open-Loop Evaluation The test set, consisting of the unexposed25,000 samples of data, is used for performance evaluation, <strong>and</strong> Figure4.6a shows one-step performance of the network on a short portion of thetest data. It is in<strong>de</strong>ed difficult to distinguish between the actual <strong>and</strong>predicted signals, thus visually verifying the goodness of the predictions.


4.4 MODELING NUMERICALLY GENERATED CHAOTIC TIME SERIES 93Figure 4.5Training MSE versus epochs for the Ikeda map.For a numerical measure, the mean-squared value of the 25,000 sampleIkeda test series was calculated to be MSS ¼ 0:564 <strong>and</strong> the mean-squaredprediction error of MSE ¼ 1:4 10 4 was produced by the trainedRMLP network, thus giving an SER of 36.02 dB.Closed-Loop Evaluation To begin autonomous prediction, a <strong>de</strong>layvector consisting of 6 taps spaced by 10 samples apart is constructed asdictated by the embedding parameters d E <strong>and</strong> t. The RMLP is initializedwith a <strong>de</strong>lay vector, constructed from the test samples, <strong>and</strong> passed througha priming phase with p l ¼ 60, after which the network operates in closedloopmo<strong>de</strong>. The autonomous continuation from where the training dataend is shown in Figure 4.6b. Note that the predictions follow closely forabout 10 steps on average, which is in agreement with the theoreticalhorizon of predictability of 11 calculated by from the Lyapunov spectrum.A length of 25,000 autonomous samples were generated using the trainedEKF-RMLP mo<strong>de</strong>l, <strong>and</strong> the reconstructed attractor is plotted in Figure4.6d. The reconstructed attractor has exactly the same form as the originalattractor, which is plotted in Figure 4.6c using the actual Ikeda samples.These figures clearly <strong>de</strong>monstrate that the RMLP network has captured theun<strong>de</strong>rlying dynamics of the Ikeda map series. For numerical performance


94Figure 4.6 Results for the dynamic reconstruction of the Ikeda map. (a) One-stepprediction. (b) Iterated prediction. (c) Attractor of original signal. (d) Attractor ofiteratively reconstructed signal.


4.4 MODELING NUMERICALLY GENERATED CHAOTIC TIME SERIES 95Figure 4.7 One-step prediction of Ikeda series from different starting points.Note that A ¼ initialization <strong>and</strong> B ¼ one-step phase.evaluation, the correlation dimension, Lyapunov exponents <strong>and</strong> Kolmogoroventropy of both the actual Ikeda series <strong>and</strong> the autonomouslygenerated samples are calculated. Table 4.3, which summarizes the results,shows that the dynamic invariants of both the actual <strong>and</strong> reconstructedsignals are in very close agreement with each other. This illustrates that thetrue dynamics of the data were captured by the trained network. Figure 4.7plots the one-step prediction of the Ikeda map for three different startingpoints. The reconstruction produced here is robust <strong>and</strong> stable, regardlessof the position of the initializing <strong>de</strong>lay vector on the test data, as<strong>de</strong>monstrated in Figure 4.8, which shows autonomous operation startingat indices of N 0 ¼ 3120, 10,120, <strong>and</strong> 17,120, respectively.Noisy Ikeda Series It was shown above that the noise-free Ikedaseries can be mo<strong>de</strong>led by the RMLP scheme. In a real environment,observables signals are usually corrupted by additive noise, which makesthe problem more difficult. Thus, to make the mo<strong>de</strong>ling task morechallenging than it already is, computer-generated noise is ad<strong>de</strong>d to theIkeda series such that the resulting signal-to-noise ratios (SNRs) of twosets of the noisy observables signals are 25 dB <strong>and</strong> 10 dB, respectively.


96 4 CHAOTIC DYNAMICSFigure 4.8 Iterative prediction of Ikeda series from different starting points,corresponding to indices of N 0 ¼ 3120, 10,120 <strong>and</strong> 17,120, respectively, withp l ¼ 60. Note that A ¼ initialization, B ¼ priming phase <strong>and</strong> C ¼ autonomousphase.The attractors of the noisy signals are shown in the left-h<strong>and</strong> parts ofFigures 4.9a <strong>and</strong> 4.9b, respectively. The increase in noise level was moresubstantial for the 10 dB case, <strong>and</strong> this corrupted the signal verysignificantly. It is apparent from Figure 4.9b that the intricate <strong>de</strong>tails ofthe attractor trajectories are lost owing to the high level of noise.The noisy signals were used to train two distinct 6-6R-5R-1 networksusing the first 5000 samples in the same fashion as in the noise-free case.The right-h<strong>and</strong> plots of Figures 4.9a <strong>and</strong> 4.9b show the attractors of theautonomously generated Ikeda series produced by the two trained RMLPnetworks. Whereas the network trained with a 25 dB SNR was able tocapture the Ikeda dynamics, the network trained with a 10 dB SNR wasunable to do so. This shows that because of the substantial amount ofnoise in the 10 dB case, the network was unable to capture dynamics of theIkeda series. However, for the 25 dB case, the network was not only ableto capture the predictable part of the Ikeda series but also filtered thenoise. Table 4.3 displays the chaotic invariants of the original <strong>and</strong>


Table 4.3 Comparison of chaotic invariants of Ikeda map un<strong>de</strong>r noiseless <strong>and</strong> noisy conditionsl 1 l 2 l 3 l 4 l 5 l 6 KE LE KE MLHOPTime series d E t d L D ML D KY (nats=sample) (nats=sample) (samples)Actual Ikeda 6 10 2 1.51 1.44 0.354 0.802 — — — — 0.354 0.316 11Reconstructed 6 19 2 1.68 1.33 0.248 0.736 — — — — 0.248 0.347 1625 dB SNR Ikeda 6 1 5 1.92 3.9 0.354 0.076 0.081 0.296 0.765 — 0.3548 0.314 14Reconstructed 6 15 3 1.54 2.15 0.248 0.14 0.792 — — — 0.262 0.304 1510 dB SNR Ikeda 6 6 6 3.46 — a 1.145 0.948 0.778 0.543 0.215 0.495 0.363 0.662 3Reconstructed 6 2 3 0.26 — a 3.47 6.01 3.47 — — — — a — a — aa Quantities that could not be calculated.97


98 4 CHAOTIC DYNAMICSFigure 4.9 Dynamic reconstruction of the noisy Ikeda map. (a) Autonomousperformance for Ikeda map with 25 dB SNR. (b) Autonomous performancefor Ikeda map with 10 dB SNR. Plots on the left to noisy originalsignals, <strong>and</strong> those on the right to reconstructed signals.reconstructed signals for both levels of noise. The addition of noise has theeffect of increasing the number of active <strong>de</strong>grees of freedom, <strong>and</strong> thus thenumber of Lyapunov exponents increases in a corresponding way. Theinvariants of the reconstructed signal corresponding to the 25 dB SNRcase match more closely with the original noise-free Ikeda invariants ascompared with the noisy invariants. However, for the failed reconstructionin the 10 dB case, there is a large disagreement between the reconstructedinvariants <strong>and</strong> the actual invariants, which is to be expected. In fact, someof the invariants could not even be calculated.


4.4 MODELING NUMERICALLY GENERATED CHAOTIC TIME SERIES 99Table 4.4SER results for several test cases for the Ikeda seriesTraining noiseTesting noise Clean 25 dB SNR 10 dB SNRClean 36.0 27.1 16.325 dB SNR 14.1 18.1 14.910 dB SNR 1.7 5.8 6.6Evaluation of one-step prediction for the more challenging noisy caseswas also done. Table 4.4 summarizes the one-step SER results collectedover a number of distinct test cases. For example, the first column of thetable shows how a network trained with clean training data performs ontest data with various levels of noise. It is important to note that it isdifficult to achieve an SER larger than the SNR of the test data. The bestoverall generalization results are obtained when the network is trainedwith 25 dB SNR.4.4.3 Lorenz AttractorThe Lorenz attractor is more challenging than the Ikeda or logistic map; itis <strong>de</strong>scribed by a coupled set of three nonlinear differential equations:_x ¼ s½ yðnÞ xðnÞŠ; ð4:7aÞ_y ¼ xðnÞ½r zðnÞŠ yðnÞ; ð4:7bÞ_z ¼ xðnÞyðnÞ bzðnÞ; ð4:7cÞwhere the fixed parameters r ¼ 45:92, s ¼ 16, <strong>and</strong> b ¼ 4 are used, <strong>and</strong> _xmeans the <strong>de</strong>rivative of x with respect to time t. As before, a data set of30,000 samples was generated at 40 Hz, of which the first 5000 sampleswere used to train the RMLP mo<strong>de</strong>l <strong>and</strong> the remaining 25,000 were usedfor testing. For the experimental Lorenz series, an embedding dimensionof d E ¼ 3 <strong>and</strong> a <strong>de</strong>lay of t ¼ 4 were calculated. An RMLP networkconfiguration of 3-8R-7R-1, consisting of 216 weights including thebiases, was trained with the EKF algorithm, <strong>and</strong> the convergence of thetraining MSE is shown in Figure 4.10.Open-Loop Evaluation The results shown in Figure 4.11 werearrived at in only 10 epochs of working through the training set. The


100 4 CHAOTIC DYNAMICSFigure 4.10Training MSE versus epochs for the Lorenz series.plot of one-step prediction over a portion of the test data is <strong>de</strong>picted inFigure 4.11a. The one-step MSE over 25,000 samples of test data wascalculated to be 1:57 10 5 , which corresponds to an SER value of40.2 dB.Closed-Loop Evaluation The autonomous continuation, from wherethe test data end, is shown in Figure 4.11b, which <strong>de</strong>monstrates that thenetwork has learned the dynamics of the Lorenz attractor very well. Theiterated predictions follow the trajectory very closely for about 80 timesteps on average, <strong>and</strong> then <strong>de</strong>monstrate chaotic divergence, as expected.This is in close agreement with the theoretical horizon of predictability of97 calculated from the Lyapunov spectrum. A further testament to thesuccess of the EKF-RMLP mo<strong>de</strong>l is that the reconstructed attractor, shownin Figure 4.11d, is similar in shape to the attractor of the original series,which is shown in Figure 4.11c, <strong>de</strong>monstrating that the network hasin<strong>de</strong>ed captured the dynamics well. In addition, the dynamic invariants ofthe original <strong>and</strong> reconstructed series are compared in Table 4.5, whichshows close agreement between their respective correlation dimension,Lyapunov spectrum, <strong>and</strong> Kolmogorov entropy, thus indicating the strongpresence of the original dynamics in the reconstructed signal. Figure 4.12


Figure 4.11 Results for the dynamic reconstruction of the Lorenz series. (a) One-stepprediction. (b) Iterated prediction. (c) Attractor of original signal. (d) Attractor ofiteratively reconstructed signal.101


Table 4.5 Comparison of chaotic invariants of Lorenz seriesl 1 l 2 l 3 l 4 l 5 l 6 l 7 KE LE KE MLHOPTime series d E t d L D ML D KY (nats=sample) (nats=sample) (samples)Actual Lorenz 3 4 3 2.09 2.03 0.040 0.0005 1.286 — — — — 0.040 0.042 97Reconstructed 3 4 3 1.99 2.26 0.050 0.0009 0.1828 — — — — 0.050 0.037 7825 dB SNR Lorenz 5 4 4 2.84 3.75 0.489 0.249 0.092 0.850 — — — 0.738 0.151 8Reconstructed 3 4 3 2.00 2.11 0.039 0.006 0.303 — — — — 0.039 0.022 9910 dB SNR Lorenz 7 14 7 7.18 6.71 1.554 0.419 0.269 0.109 0.10 0.442 1.128 1.35 1.06 7Reconstructed 3 5 3 1.84 2.18 0.062 0.003 0.360 — — — — 0.065 0.024 63102


4.4 MODELING NUMERICALLY GENERATED CHAOTIC TIME SERIES 103Figure 4.12 One-step prediction of Lorenz series from different startingpoints. Note that A ¼ initialization <strong>and</strong> B ¼ one-step phase.plots one-step predictions of the Lorenz time series for three differentstarting points. The reconstruction of the Lorenz series was in<strong>de</strong>ed veryrobust <strong>and</strong> stable, regardless of the starting position on the test series, as<strong>de</strong>monstrated in Figure 4.13, which shows autonomous operation startingat indices of N 0 ¼ 3042, 10,042, <strong>and</strong> 17,042, respectively.Noisy Lorenz Series To make the problem even more challenging thanit already is, a significant level of noise was ad<strong>de</strong>d to the clean Lorenzseries just as was done for the Ikeda map. An RMLP network architecturewas selected similar to the noise-free case, <strong>and</strong> two distinct networks weretrained using the noisy Lorenz signals with 25 dB SNR <strong>and</strong> 10 dB SNR,respectively. The networks were trained with a learning rate of p r ¼ 0:001for 15 epochs through the first 5000 samples, as was done for the noisefreecase. Then, both one-step prediction <strong>and</strong> autonomous predictionresults were obtained with the remaining unexposed 25,000 samples.Figures 4.14a <strong>and</strong> 4.15b show that excellent dynamic reconstruction of theLorenz series was possible even in the presence of 25 dB <strong>and</strong> 10 dB SNR,


104 4 CHAOTIC DYNAMICSFigure 4.13 Iterative prediction of Lorenz series from different startingpoints, corresponding to indices of N 0 ¼ 3042, 10,042, <strong>and</strong> 17,042, respectivelywith p l ¼ 30. Note that A ¼ initialization, B ¼ priming phase, <strong>and</strong>C ¼ autonomous phase.respectively, <strong>de</strong>spite the fact that the additive noise corrupts the attractorsconsi<strong>de</strong>rably. Moreover, comparison of the dynamic invariants of the noisy<strong>and</strong> reconstructed signals in Table 4.5 reveals that the reconstructedsignals <strong>and</strong> their invariants are reasonably close to the noise-free signal<strong>and</strong> the iterated predictions are smoother in comparison to the noisysignals, as shown in Figs. 4.15a <strong>and</strong> 4.15b.It was shown earlier that dynamic reconstruction of the Ikeda map wasnot possible for the case of 10 dB SNR. In or<strong>de</strong>r to explain this, thenormalized frequency spectra of both the Ikeda map <strong>and</strong> Lorenz series areplotted in Figure 4.16. This figure illustrates how much more significantlythe Ikeda spectrum is corrupted by 10 dB noise as compared with theLorenz spectrum. Whereas the Ikeda spectrum is completely drowned at10 dB SNR, the Lorenz spectrum is still discernible, <strong>and</strong> as such itsdynamic reconstruction was possible. Another way of explaining the


4.4 MODELING NUMERICALLY GENERATED CHAOTIC TIME SERIES 105Figure 4.14 Dynamic reconstruction of the noisy Lorenz series. (a) Autonomousperformance for Lorenz series with 25 dB SNR. (b) Autonomousperformance for Lorenz series with 10 dB SNR. Plots on the left pertain tothe original data, <strong>and</strong> those on the right to reconstructed data.reason for the robust behavior of the Lorenz attractor to the presence ofadditive noise is to recall that the correlation dimension of the Lorenzattractor is greater than that of the Ikeda map; see Table 4.1. Accordingly,the Lorenz attractor is more complex than the Ikeda map – hence thegreater robustness of the Lorenz attractor to noise.Evaluation of one-step performance for the more challenging noisycases was also done. Table 4.6 summarizes the one-step SER resultscollected over a number of distinct test cases. For example, the firstcolumn of the table shows how a network trained with clean training dataperforms on test data with various levels of noise. The best overallgeneralization results are obtained when the network is trained with25 dB SNR.


106 4 CHAOTIC DYNAMICSFigure 4.15 Iterative prediction of noisy Lorenz series with (a) 25 dB <strong>and</strong> (b)10 dB SNR. Note that A ¼ initialization, B ¼ priming phase <strong>and</strong> C ¼ autonomousphase.4.5 NONLINEAR DYNAMIC MODELING OF REAL-WORLDTIME SERIES4.5.1 Laser Intensity PulsationsIn the first experiment with real-world data, a series recor<strong>de</strong>d from a farinfraredlaser in a chaotic state is used. This data set was provi<strong>de</strong>d as partof a data package used in the Santa Fe Institute Time Series Prediction <strong>and</strong>Analysis Competition [7]. During the time of the competition, only thefirst 1000 samples were provi<strong>de</strong>d, <strong>and</strong> the goal was to predict the following


4.5 NONLINEAR DYNAMIC MODELING 107Figure 4.16Lorenz <strong>and</strong> Ikeda series: normalized frequency spectra.Table 4.6SER results for several test cases for the Lorenz seriesTraining noiseTesting noise Clean 25 dB SNR 10 dB SNRClean 40.2 30.5 36.825 dB SNR 17.8 23.1 17.210 dB SNR 3.5 9.3 3.8100 samples as accurately as possible. Now, however, the full data setconsisting of 10,000 samples is available. The first 1100 samples areplotted in Figure 4.17. The data consist of high-frequency pulsations,which gradually rise in amplitu<strong>de</strong> <strong>and</strong> sud<strong>de</strong>nly collapse from a high tolow amplitu<strong>de</strong>. The rapid <strong>de</strong>cays of oscillations in this data set occur withno periodicity, <strong>and</strong> are a challenge to mo<strong>de</strong>l. During these collapses, the


108 4 CHAOTIC DYNAMICSFigure 4.17First 1100 samples of original laser series.signal abruptly jumps from one region of the attractor to another distinctregion, <strong>and</strong> these transitions are the most difficult parts of this series tomo<strong>de</strong>l.To remain consistent with what was done in the SFI competition, onlythe first 1000 samples were presented to the network, while the remaining9000 samples of data were reserved for testing purposes <strong>and</strong> not shown tothe network. The optimal embedding <strong>de</strong>lay was found to be t ¼ 2 <strong>and</strong> theembedding dimension to be d E ¼ 9. A network architecture of 9-10R-8R-1, consisting of 361 weights including the biases, was used to mo<strong>de</strong>l thisdata. The network was trained through the first 1000 points for 50 epochs.Open-Loop Evaluation The network makes excellent one-steppredictions as shown in Figure 4.18a, with very low prediction errors inmost regions. The accuracy of prediction drops slightly during the sud<strong>de</strong>nsignal collapses, but then increases quickly as the network adapts. TheSER value for the one-step prediction performance on the unexposed 9000samples of the test set was calculated to be 26.05 dB.Closed-Loop Evaluation The iterated predictions of the network arealso very remarkable. The autonomous continuation from sample 1000,where the training data end, is shown in Figure 4.18b. This was achievedby first initializing the network with a <strong>de</strong>lay vector, consisting of 9 taps,each spaced 2 samples apart, from the beginning of the training set <strong>and</strong>then priming for p l ¼ 982 steps, after which the network operates in anautonomous mo<strong>de</strong>. Since the network was initialized with the first samplesof the actual data, autonomous operation does not begin until the pointð9 2Þþ982 ¼ 1000, as shown in Figure 4.19. The autonomous outputof the network closely follows the original series for about 60 steps (whichis higher than the average horizon of predictability of 26 calculated fromthe Lyapunov spectrum) after which it begins to diverge. Figure 4.20shows the one-step prediction of the laser time series with three differentstarting points.


Figure 4.18 Results of the dynamic reconstruction of the laser series. (a) One-stepprediction. (b) Iterated prediction. (c) Attractor of original signal. (d) Attractor ofiteratively reconstructed signal.109


110 4 CHAOTIC DYNAMICSFigure 4.19 Iterated prediction of the laser data beginning at sample 1000,using p l ¼ 982. Note that A ¼ initialization, B ¼ priming phase, <strong>and</strong> C ¼autonomous phase. Phase C is enlarged in Figure 4.18b.Figure 4.20 One-step prediction of laser series with different starting points.Note that A ¼ initialization <strong>and</strong> B ¼ one-step phase.Figure 4.21 shows that the network is capable of stable <strong>and</strong> robustreconstruction even when it is initialized from different starting points onthe test set, corresponding to indices of N 0 ¼ 48, 1048, <strong>and</strong> 4048,respectively. Note that the network has not seen this data set before, <strong>and</strong>is still able to predict the signal collapses up to an impressive <strong>de</strong>gree ofaccuracy. Figure 4.22, which shows the 9000-sample iterated output of thenetwork, <strong>de</strong>monstrates that the chaotic pulsations have been learned by the


4.5 NONLINEAR DYNAMIC MODELING 111Figure 4.21 Iterative prediction of laser series from different starting points,corresponding to indices of N 0 ¼ 48, 1048, <strong>and</strong> 4048, respectively, withp l ¼ 30. Note that A ¼ initialization, B ¼ printing phase, <strong>and</strong> C ¼ autonomousphase.network from just the first 1000 samples. The attractor of the originalseries is plotted in Figure 4.18c using the actual laser data. The attractor inFigure 4.18d is plotted using the reconstructed points as produced by thetrained RMLP network. It is clear from the attractor of the reconstructedsignal that the dynamics of the laser data has been faithfully captured bythe EKF trained RMLP network. Table 4.7 shows that the correlationdimension, Lyapunov exponents, <strong>and</strong> Kolmogorov entropy of the reconstructedsignal are in close agreement with those obtained from theoriginal data, thus indicating a faithful reconstruction of the originaldynamics. The correlation dimension of NH 3 laser pulsations calculatedhere is consistent with what is reported by Hubner et al. [8].Longer Training Length In another experiment, the same-sizednetwork was trained on the first 5000 points, as compared with the first1000 used above. The iterated output produced by the network is shownin Figure 4.23. Since the network was exposed to more of the signal


112 4 CHAOTIC DYNAMICSFigure 4.22 9000 samples of iterated output of the RMLP trained on the first1000 samples.Table 4.7Comparison of chaotic invariants of laser seriesl 1 l 2 l 3 KE LE KE MLHOPTime series d E t d L D ML D KY (nats=sample) (nats=sample) (samples)Actual laser 9 2 3 2.01 2.19 0.147 0.039 0.547 0.147 0.195 26Reconstructed 9 2 3 1.90 2.13 0.115 0.037 0.602 0.115 0.133 34Figure 4.23 9000 samples of iterated output of the RMLP trained on the first5000 samples.Figure 4.24Last 9000 samples of the actual laser series.


4.5 NONLINEAR DYNAMIC MODELING 113dynamics, in this case, the iterated output of the network <strong>de</strong>monstratesmore frequent chaotic collapses <strong>and</strong> is closer in appearance to the actualsignal shown in Figure 4.24.4.5.2 Sea Clutter DataSea clutter, orsea echo, refers to the radar backscatter from an oceansurface. Radars operating in a maritime environment have a seriouslimitation imposed on their performance by unwanted sea clutter. Therefore,a problem of fundamental interest in the radar literature is themo<strong>de</strong>ling of sea clutter to <strong>de</strong>sign optimum target <strong>de</strong>tectors.For many years, clutter echoes in radar systems were mo<strong>de</strong>led as astochastic process. However, in recent years, some researchers haveconclu<strong>de</strong>d that the generation of sea clutter is governed by <strong>de</strong>terministicchaos [9 – 13]. These claims, in varying <strong>de</strong>grees, have been largely basedon the estimation of chaotic invariants of sea clutter, namely, the correlationdimension, Lyapunov exponents, Kaplan-Yorke dimension, <strong>and</strong> theKolmogorov entropy, the latter two are <strong>de</strong>rived from the Lyapunovexponents. But, recently, serious concerns have been raised on thediscriminative power of the state-of-the-art algorithms currently availablefor distinguishing between sea clutter <strong>and</strong> different forms of surrogate data<strong>de</strong>rived from the original data; the surrogate data are known to bestochastic by purposely introducing r<strong>and</strong>omization into their generation.By estimating the correlation dimension <strong>and</strong> Lyapunov exponents, it isfound that there is little difference between the chaotic invariants of seaclutter <strong>and</strong> those of their surrogate counterparts [14, 15], which casts doubton <strong>de</strong>terministic chaos being the nonlinear dynamical mo<strong>de</strong>l foe sea clutter.In [14], it is pointed out physical realities dictate the presence of noisenot just in the measurement equation (owing to instrument errors) but alsoin the process equation (owing to the rate of variability of the forcesaffecting the ocean dynamics). According to Sugihara [16], there isunavoidable practical difficulty in disentangling the process noise fromthe measurement noise when we try to reconstruct an invariant measuresuch as the Lyapunov exponents. This may explain the reason for theinability of chaotic-invariant estimation algorithms to distinguish betweensea clutter <strong>and</strong> its surrogates.Our primary interest in this chapter is in dynamic reconstruction. In thiscontext, we may raise the following question: Can the EKF-RMLP beconfigured to perform dynamic reconstruction on sea clutter in a robustmanner? In what follows, an attempt to answer this question is presented;


Table 4.8 Comparison of chaotic invariants of sea clutterl 1 l 2 l 3 l 4 l 5 KE LE KE MLHOPTime series d E t d L D ML D KY (nats=sample) (nats=sample) (samples)Actual sea clutter 6 10 5 4.69 4.42 0.228 0.102 0.005 0.118 0.448 0.335 0.369 17Reconstructed from 6 9 5 3.12 4.51 0.214 0.078 0.010 0.083 0.427 0.302 0.292 18N 2 ¼ 0Reconstructed from 7 6 5 3.58 4.75 0.216 0.129 0.049 0.112 0.373 0.394 0.303 18N 0 ¼ 5160Reconstructed from 6 8 5 4.43 4.62 0.232 0.136 0.023 0.113 0.434 0.391 0.317 17N 0 ¼ 6160Reconstructed from 6 6 5 3.67 4.23 0.191 0.088 0.029 0.142 0.713 0.308 0.220 20N 0 ¼ 7160114


4.5 NONLINEAR DYNAMIC MODELING 115the answer has bearing on whether sea clutter is chaotic or not. Figure 4.25shows the in-phase component of typical sea clutter, recor<strong>de</strong>d by aninstrument-quality radar on the East Coast of Canada, with the radaroperating in a dwelling mo<strong>de</strong> at a low grazing angle. The data consist of50,000 samples, of which the first 40,000 are used for network training,while the remaining 10,000 samples are reserved for testing. An embeddingdimension of d E ¼ 6 <strong>and</strong> embedding <strong>de</strong>lay of t ¼ 10 were calculatedfor sea clutter.Open-Loop Evaluation The results for the mo<strong>de</strong>ling of sea clutter bythe EKF-RMLP method are shown in Figure 4.26. A 6-8R-7R-1 networktrained with p r ¼ 0:5 was used. The SER over the unexposed 10,000 testsamples was found to be 36.65 dB, which is certainly impressive, since inthe EKF-RMLP based method presented here, the synaptic weightsremain fixed after the training was completed.Closed-Loop Evaluation The iterated continuation from where thetraining data end, corresponding to an in<strong>de</strong>x of N 0 ¼ 0, is shown in FigureFigure 4.25In-phase component of typical sea-clutter data.


116Figure 4.26 Results for the dynamic reconstruction of sea clutter. (a) One-step prediction.(b) Iterated prediction. (c) Attractor of original signal. (d) Attractor of iterativelyreconstructed signal.


4.5 NONLINEAR DYNAMIC MODELING 117Figure 4.27 One-step prediction of sea clutter from different starting points.Note that A ¼ initialization <strong>and</strong> B ¼ one-step mo<strong>de</strong> of operation.4.26b. Observe that the prediction is close to the actual data for roughlyless than 10 steps (which is significantly less than the average theoreticalhorizon of predictability of 17 calculated from the Lyapunov spectrum),after which it diverges into a distinct trajectory of its own. The network,when iterated in this fashion for several thous<strong>and</strong> steps, produces anoutput whose attractor is plotted in Figure 4.26d. It is difficult to comparethe reconstructed attractor with the original attractor in Figure 4.26cbecause we are looking at a five-dimensional attractor in three dimensions.Since it is impossible to plot attractors in five dimensions, we may<strong>de</strong>termine the success of dynamic reconstruction by a numerical comparisonof the dynamic invariants of the reconstructed signal with those of theactual signal. As shown in Table 4.8, the Lyapunov exponents of thereconstructed signal are in close agreement with those obtained from theactual data. In addition, the correlation dimension <strong>and</strong> Kolmogoroventropy of both signals are also close, except for the slightly lower thanexpected value of D ML for the reconstructed signal. Furthermore, D KY <strong>and</strong>


118 4 CHAOTIC DYNAMICSKE LE are close to D ML <strong>and</strong> KE ML , respectively. Figure 4.27 plots one-steppredictions of sea clutter for three different starting points.Stability <strong>and</strong> Robustness Furthermore, the more difficult case ofautonomous prediction from different starting points chosen from the testdata is shown in Figure 4.28. This shows that stable reconstruction is alsopossible from indices of N 0 ¼ 5160, 6160, <strong>and</strong> 7160 on the test data. Thedynamic invariants calculated from signals reconstructed at these indicesare in close agreement with the actual Table 4.8. Again, there seems to bea slight disagreement in the values of D ML .This shows that the EKF-RMLP scheme was able to reconstruct seaclutterdynamics on several occasions in a reasonable manner, <strong>de</strong>pendingon the data used. However, it is important to point out that robustreconstruction was not possible from all initialization points – for this<strong>and</strong> other types of networks used. The different types of reconstructionFigure 4.28 Iterative prediction of sea clutter from different starting points,corresponding to indices of N 0 ¼ 5160, 6160, <strong>and</strong> 7160, respectively, withp l ¼ 100. Note that A ¼ initialization, B ¼ priming phase, <strong>and</strong> C ¼ autonomousphase.


4.6 DISCUSSION 119failures encountered are illustrated in Figure 4.29. In the first type ofreconstruction failure, the output becomes constant after a few steps. Inother words, the output behaves as a fixed-point attractor. In the secondtype of failure, the output becomes periodic after a few time steps. In thethird type of failure, the output breaks down after a few steps of goodprediction. How do we explain the reconstruction failures of Figure 4.29?Dynamic reconstruction, in accordance with Taken’s embedding theoremtolerates a mo<strong>de</strong>rate level of measurement noise, but the process equationhas to be noise-free. The presence of dynamical noise in the processequation is therefore in violation of Taken’s theorem — hence thedifficulty in <strong>de</strong>vising a robust dynamic reconstruction system for seaclutter.4.6 DISCUSSIONTakens’ theorem provi<strong>de</strong>s the theoretical basis for an experimentalapproach for the autoregressive mo<strong>de</strong>ling of a nonlinear dynamicalsystem. Specifically, the state evolution of the system with a d-dimensionalstate space can be reconstructed by observing 2d þ 1 time lags of asingle output of the system for an arbitrary <strong>de</strong>lay. The theorem requiresthat the observable time series be noiseless <strong>and</strong> of infinite length.Unfortunately, neither of these two conditions can be satisfied in practice.To mitigate these difficulties, the recommen<strong>de</strong>d procedure is to do thefollowing: Use an ‘‘optimum’’ time <strong>de</strong>lay for the <strong>de</strong>lay coordinate method, forwhich the adjacent samples are as statistically in<strong>de</strong>pen<strong>de</strong>nt aspractically possible.For the experimental study presented here, we minimized the mutualinformation between the observable time series <strong>and</strong> its <strong>de</strong>layed version asthe basis for estimating the optimum time <strong>de</strong>lay.In this chapter, we have presented experimental results <strong>de</strong>monstratingthe ability of the EKF-RMLP scheme to reconstruct the dynamics ofchaotic processes in accordance with Takens’ embedding theorem. Theresults presented for (1) the numerically generated logistic map, Ikedamap, <strong>and</strong> Lorenz attractor, <strong>and</strong> (2) the experimentally generated laserpulsations are in<strong>de</strong>ed remarkable in that the reconstructed data closelymatch all the important chaotic characteristics of the original time series.Moreover, the dynamic reconstruction performed here is not only stable


120 4 CHAOTIC DYNAMICS


REFERENCES 121Figure 4.29 Illustrative failure mo<strong>de</strong>s in iterative prediction of sea clutter byvarious networks. (a) Output converges to a constant (fixed point attractor);N 0 ¼ 4160 <strong>and</strong> 160, respectively. (b) Output becomes periodic; N 0 ¼ 7160.(c) Output breaks down after a few time steps; N 0 ¼ 160 <strong>and</strong> 7160,respectively. Note that A ¼ initialization, B ¼ priming phase, <strong>and</strong> C ¼ autonomousphase.


122 4 CHAOTIC DYNAMICS[9] S. Haykin <strong>and</strong> S. Puthusserypady, ‘‘Chaotic dynamics of sea clutter’’ Chaos,7, 777 – 802 (1997).[10] S. Haykin <strong>and</strong> Xiao bo Li, ‘‘Detection of signals in chaos,’’ Proceedings ofthe IEEE, 83 (1995).[11] H. Leung <strong>and</strong> S. Haykin, ‘‘Is there a Radar Clutter Attractor?’’ Appl. Phys.Letters, 56, 393 – 395 (1990).[12] H. Leung <strong>and</strong> T. Lo, ‘‘Chaotic Radar Signal-Processing Over the Sea,’’ IEEEJ. Oceanic Engineering, 18, 287 – 295 (1993).[13] A.J. Palmer, R.A. Kropfli, <strong>and</strong> C.W. Fairall, ‘‘Signature of DeterministicChaos in Radar Sea Clutter <strong>and</strong> Ocean Surface Waves,’’ Chaos, 6, 613 – 616(1995).[14] S. Haykin, R. Bakker, <strong>and</strong> B. Currie, ‘‘Uncovering Nonlinear Dynamics: TheCase Study of Sea Clutter,’’ Proc. IEEE, un<strong>de</strong>r review.[15] C.P. Unsworth, M.R. Cowper, B. Mulgrew, <strong>and</strong> S. McLaughlin, ‘‘FalseDetection of Chaotic Behaviour in Stochastic Computed K-distributionMo<strong>de</strong>l of Radar Sea Clutter,’’ IEEE Workshop on Statistical Signal <strong>and</strong>Array Processing, pp. 296 – 300 (2000).[16] G. Sugihara, ‘‘Nonlinear Forecasting for the Classification of Natural TimeSeries,’’ Phil. Trans. R. Soc. London, A348, pp. 477 – 495 (1994).


5DUAL EXTENDED KALMANFILTER METHODSEric A. Wan <strong>and</strong> Alex T. NelsonDepartment of Electrical <strong>and</strong> Computer Engineering, Oregon Graduate Institute ofScience <strong>and</strong> Technology, Beaverton, Oregon, U.S.A.5.1 INTRODUCTIONThe Exten<strong>de</strong>d <strong>Kalman</strong> Filter (EKF) provi<strong>de</strong>s an efficient method forgenerating approximate maximum-likelihood estimates of the state of adiscrete-time nonlinear dynamical system (see Chapter 1). The filterinvolves a recursive procedure to optimally combine noisy observationswith predictions from the known dynamic mo<strong>de</strong>l. A second use of theEKF involves estimating the parameters of a mo<strong>de</strong>l (e.g., neural network)given clean training data of input <strong>and</strong> output data (see Chapter 2). In thiscase, the EKF represents a modified-Newton type of algorithm for on-linesystem i<strong>de</strong>ntification. In this chapter, we consi<strong>de</strong>r the dual estimationproblem, in which both the states of the dynamical system <strong>and</strong> itsparameters are estimated simultaneously, given only noisy observations.<strong>Kalman</strong> <strong>Filtering</strong> <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>, Edited by Simon HaykinISBN 0-471-36998-5 # 2001 John Wiley & Sons, Inc.123


124 5 DUAL EXTENDED KALMAN FILTER METHODSTo be more specific, we consi<strong>de</strong>r the problem of learning both thehid<strong>de</strong>n states x k <strong>and</strong> parameters w of a discrete-time nonlinear dynamicalsystem,x kþ1 ¼ Fðx k ; u k ; wÞþv k ;y k ¼ Hðx k ; wÞþn k ;ð5:1Þwhere both the system states x k <strong>and</strong> the set of mo<strong>de</strong>l parameters w for thedynamical system must be simultaneously estimated from only theobserved noisy signal y k . The process noise v k drives the dynamicalsystem, observation noise is given by n k , <strong>and</strong> u k corresponds to observe<strong>de</strong>xogenous inputs. The mo<strong>de</strong>l structure, FðÞ <strong>and</strong> HðÞ, may representmultilayer neural networks, in which case w are the weights.The problem of dual estimation can be motivated either from the needfor a mo<strong>de</strong>l to estimate the signal or (in other applications) from the needfor good signal estimates to estimate the mo<strong>de</strong>l. In general, applicationscan be divi<strong>de</strong>d into the tasks of mo<strong>de</strong>ling, estimation, <strong>and</strong> prediction. Inestimation, all noisy data up to the current time is used to approximate thecurrent value of the clean state. Prediction is concerned with using allavailable data to approximate a future value of the clean state. Mo<strong>de</strong>ling(sometimes referred to as i<strong>de</strong>ntification) is the process of approximatingthe un<strong>de</strong>rlying dynamics that generated the states, again given only thenoisy observations. Specific applications may inclu<strong>de</strong> noise reduction(e.g., speech or image enhancement), or prediction of financial <strong>and</strong>economic time series. Alternatively, the mo<strong>de</strong>l may correspond to theexplicit equations <strong>de</strong>rived from first principles of a robotic or vehiclesystem. In this case, w corresponds to a set of unknown parameters.Applications inclu<strong>de</strong> adaptive control, where parameters are used in the<strong>de</strong>sign process <strong>and</strong> the estimated states are used for feedback.Heuristically, dual estimation methods work by alternating betweenusing the mo<strong>de</strong>l to estimate the signal, <strong>and</strong> using the signal to estimate themo<strong>de</strong>l. This process may be either iterative or sequential. Iterativeschemes work by repeatedly estimating the signal using the currentmo<strong>de</strong>l <strong>and</strong> all available data, <strong>and</strong> then estimating the mo<strong>de</strong>l using theestimates <strong>and</strong> all the data (see Fig. 5.1a). Iterative schemes are necessarilyrestricted to off-line applications, where a batch of data has beenpreviously collected for processing. In contrast, sequential approachesuse each individual measurement as soon as it becomes available to updateboth the signal <strong>and</strong> mo<strong>de</strong>l estimates. This characteristic makes thesealgorithms useful in either on-line or off-line applications (see Fig. 5.1b).


5.1 INTRODUCTION 125Figure 5.1 Two approaches to the dual estimation problem. (a ) Iterativeapproaches use large blocks of data repeatedly. (b) Sequential approachesare <strong>de</strong>signed to pass over the data one point at a time.The vast majority of work on dual estimation has been for linearmo<strong>de</strong>ls. In fact, one of the first applications of the EKF combines both thestate vector x k <strong>and</strong> unknown parameters w in a joint bilinear state-spacerepresentation. An EKF is then applied to the resulting nonlinear estimationproblem [1, 2]; we refer to this approach as the joint exten<strong>de</strong>d <strong>Kalman</strong>filter. Additional improvements <strong>and</strong> analysis of this approach are provi<strong>de</strong>din [3, 4]. An alternative approach, proposed in [5], uses two separate<strong>Kalman</strong> filters: one for signal estimation, <strong>and</strong> another for mo<strong>de</strong>l estimation.The signal filter uses the current estimate of w, <strong>and</strong> the weight filteruses the signal estimates ^x k to minimize a prediction error cost. In [6], thisdual <strong>Kalman</strong> approach is placed in a general family of recursive predictionerror algorithms. Apart from these sequential approaches, some iterativemethods <strong>de</strong>veloped for linear mo<strong>de</strong>ls inclu<strong>de</strong> maximum-likelihoodapproaches [7–9] <strong>and</strong> expectation-maximization (EM) algorithms [10–13]. These algorithms are suitable only for off-line applications, althoughsequential EM methods have been suggested.Fewer papers have appeared in the literature that are explicitlyconcerned with dual estimation for nonlinear mo<strong>de</strong>ls. One algorithm(proposed in [14]) alternates between applying a robust form of the


126 5 DUAL EXTENDED KALMAN FILTER METHODSEKF to estimate the time-series <strong>and</strong> using these estimates to train a neuralnetwork via gradient <strong>de</strong>scent. A joint EKF is used in [15] to mo<strong>de</strong>lpartially unknown dynamics in a mo<strong>de</strong>l reference adaptive control framework.Furthermore, iterative EM approaches to the dual estimationproblem have been investigated for radial basis function networks [16]<strong>and</strong> other nonlinear mo<strong>de</strong>ls [17]; see also Chapter 6. Errors-in-variables(EIV) mo<strong>de</strong>ls appear in the nonlinear statistical regression literature [18],<strong>and</strong> are used for regressing on variables related by a nonlinear function,but measured with some error. However, errors-in-variables is an iterativeapproach involving batch computation; it tends not to be practical fordynamical systems because the computational requirements increase inproportion to N 2 , where N is the length of the data. A heuristic methodknown as Clearning minimizes a simplified approximation to the EIV costfunction. While it allows for sequential estimation, the simplification canlead to severely biased results [19]. The dual EKF [19] is a nonlinearextension of the linear dual <strong>Kalman</strong> approach of [5], <strong>and</strong> recursiveprediction error algorithm of [6]. Application of the algorithm to speechenhancement appears in [20], while extensions to other cost functionshave been <strong>de</strong>veloped in [21] <strong>and</strong> [22]. The crucial, but often overlookedissue of sequential variance estimation is also addressed in [22].Overview The goal of this chapter is to present a unified probabilistic<strong>and</strong> algorithmic framework for nonlinear dual estimation methods. In thenext section, we start with the basic dual EKF prediction error method.This approach is the most intuitive, <strong>and</strong> involves simply running two EKFfilters in parallel. The section also provi<strong>de</strong>s a quick review of the EKF forboth state <strong>and</strong> weight estimation, <strong>and</strong> introduces some of the complicationsin coupling the two. An example in noisy time-series prediction isalso given. In Section 5.3, we <strong>de</strong>velop a general probabilistic frameworkfor dual estimation. This allows us to relate the various methods that havebeen presented in the literature, <strong>and</strong> also provi<strong>de</strong>s a general algorithmicapproach leading to a number of different dual EKF algorithms. Results onadditional example data sets are presented in Section 5.5.5.2 DUAL EKF–PREDICTION ERRORIn this section, we present the basic dual EKF prediction error algorithm.For completeness, we start with a quick review of the EKF for stateestimation, followed by a review of EKF weight estimation (see Chapters


5.2 DUAL EKF–PREDICTION ERROR 1271 <strong>and</strong> 2 for more <strong>de</strong>tails). We then discuss coupling the state <strong>and</strong> weightfilters to form the dual EKF algorithm.5.2.1 EKF–State EstimationFor a linear state-space system with known mo<strong>de</strong>l <strong>and</strong> Gaussian noise, the<strong>Kalman</strong> filter [23] generates optimal estimates <strong>and</strong> predictions of the statex k . Essentially, the filter recursively updates the (posterior) mean ^x k <strong>and</strong>covariance P xkof the state by combining the predicted mean ^x k <strong>and</strong>covariance P xkwith the current noisy measurement y k . These estimates areoptimal in both the MMSE <strong>and</strong> MAP senses. Maximum-likelihood signalestimates are obtained by letting the initial covariance P x0approachinfinity, thus causing the filter to ignore the value of the initial state ^x 0 .For nonlinear systems, the exten<strong>de</strong>d <strong>Kalman</strong> filter provi<strong>de</strong>s approximatemaximum-likelihood estimates. The mean <strong>and</strong> covariance of the stateare again recursively updated; however, a first-or<strong>de</strong>r linearization of thedynamics is necessary in or<strong>de</strong>r to analytically propagate the Gaussianr<strong>and</strong>om-variable representation. Effectively, the nonlinear dynamics areapproximated by a time-varying linear system, <strong>and</strong> the linear <strong>Kalman</strong>filters equations are applied. The full set of equations are given in Table5.1. While there are more accurate methods for <strong>de</strong>aling with the nonlineardynamics (e.g., particle filters [24, 25], second-or<strong>de</strong>r EKF, etc.), thest<strong>and</strong>ard EKF remains the most popular approach owing to its simplicity.Chapter 7 investigates the use of the unscented <strong>Kalman</strong> filter as apotentially superior alternative to the EKF [26–29].Another interpretation of <strong>Kalman</strong> filtering is that of an optimizationalgorithm that recursively <strong>de</strong>termines the state x k in or<strong>de</strong>r to minimize acost function. It can be shown that the cost function consists of a weightedprediction error <strong>and</strong> estimation error components given byJðx k 1Þ¼ Pkt¼1½y t Hðx t ; wÞŠ T ðR n Þ 1 ½y t Hðx t ; wÞŠþðx t x t Þ T ðR v Þ 1 ðx t x t Þg ð5:10Þwhere x t ¼ Fðx t 1 ; wÞ is the predicted state, <strong>and</strong> R n <strong>and</strong> R v are theadditive noise <strong>and</strong> innovations noise covariances, respectively. This interpretationwill be useful when <strong>de</strong>aling with alternate forms of the dual EKFin Section 5.3.3.


128 5 DUAL EXTENDED KALMAN FILTER METHODSTable 5.1Exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) equationsInitialize with:^x 0 ¼ E½x 0 Š;ð5:2ÞP x0¼ E½ðx 0 ^x 0 Þðx 0 ^x 0 Þ T Š: ð5:3ÞFor k 2 f 1; ...; 1g, the time-update equations of the exten<strong>de</strong>d <strong>Kalman</strong> filter are^x k ¼ Fð^x k 1 ; u k ; wÞ; ð5:4ÞP xk¼ A k 1 P xk 1A T k 1 þ R v ; ð5:5Þ<strong>and</strong> the measurement-update equations arewhereK x k ¼ P xkC T k ðC k P xkC T k þ R n Þ 1 ;ð5:6Þ^x k ¼ ^x k þ K x k½y k Hð^x k ; wÞŠ; ð5:7ÞP xk¼ðI K x kC k ÞP xk; ð5:8ÞA k ¼ D @Fðx; u k ; wÞ@x^x k;C k ¼ D @Hðx; wÞ@x<strong>and</strong> where R v <strong>and</strong> R n are the covariances of v k <strong>and</strong> n k , respectively.^x k;ð5:9Þ5.2.2 EKF–Weight EstimationAs proposed initially in [30], <strong>and</strong> further <strong>de</strong>veloped in [31] <strong>and</strong> [32], theEKF can also be used for estimating the parameters of nonlinear mo<strong>de</strong>ls(i.e., training neural networks) from clean data. Consi<strong>de</strong>r the generalproblem of learning a mapping using a parameterized nonlinear functionGðx k ; wÞ. Typically, a training set is provi<strong>de</strong>d with sample pairs consistingof known input <strong>and</strong> <strong>de</strong>sired output, fx k ; d k g. The error in the mo<strong>de</strong>l is<strong>de</strong>fined as e k ¼ d k Gðx k ; wÞ, <strong>and</strong> the goal of learning involves solvingfor the parameters w in or<strong>de</strong>r to minimize the expected squared error. TheEKF may be used to estimate the parameters by writing a new state-spacerepresentationw kþ1 ¼ w k þ r k ;d k ¼ Gðx k ; w k Þþe k ;ð5:11Þð5:12Þwhere the parameters w k correspond to a stationary process with i<strong>de</strong>ntitystate transition matrix, driven by process noise r k . The output d k


5.2 DUAL EKF–PREDICTION ERROR 129Table 5.2The exten<strong>de</strong>d <strong>Kalman</strong> weight filter equationsInitialize with:^w 0 ¼ E½wŠð5:13ÞP w0¼ E½ðw ^w 0 Þðw ^w 0 Þ T Š ð5:14ÞFor k 2 f1; ...; 1g, the time update equations of the <strong>Kalman</strong> filter are:^w k ¼ ^w k 1 ð5:15ÞP wk¼ P wk 1þ R r k 1 ð5:16Þ<strong>and</strong> the measurement update equations:K w k ¼ P wkðC w k Þ T ðC w k P wkðC w k Þ T þ R e Þ 1 ð5:17Þ^w k ¼ ^w k þ K w k ðd k Gð ^w k ; x k 1 ÞÞ ð5:18ÞP wk¼ðI K w k C w k ÞP wk: ð5:19ÞwhereC w k ¼ D @Gðx k 1 ; wÞ T@ww¼ ^w kð5:20Þcorresponds to a nonlinear observation on w k . The EKF can then beapplied directly, with the equations given in Table 5.2. In the linear case,the relationship between the <strong>Kalman</strong> filter (KF) <strong>and</strong> the popular recursiveleast-squares (RLS) is given [33] <strong>and</strong> [34]. In the nonlinear case, the EKFtraining corresponds to a modified-Newton optimization method [22].As an optimization approach, the EKF minimizes the prediction errorcost:JðwÞ ¼ Pkt¼1½d t Gðx t ; wÞŠ T ðR e Þ 1 ½d t Gðx t ; wÞŠ: ð5:21ÞIf the ‘‘noise’’ covariance R e is a constant diagonal matrix, then, in fact, itcancels out of the algorithm (this can be shown explicitly), <strong>and</strong> hence canbe set arbitrarily (e.g., R e ¼ 0:5I). Alternatively, R e can be set to specify aweighted MSE cost. The innovations covariance E½r k r T k Š¼Rr k, on theother h<strong>and</strong>, affects the convergence rate <strong>and</strong> tracking performance.Roughly speaking, the larger the covariance, the more quickly ol<strong>de</strong>rdata are discar<strong>de</strong>d. There are several options on how to choose R r k: Set R r k to an arbitrary diagonal value, <strong>and</strong> anneal this towards zeroesas training continues.


130 5 DUAL EXTENDED KALMAN FILTER METHODS Set R r k ¼ðl 1 1ÞP wk, where l 2ð0; 1Š is often referred to as the‘‘forgetting factor.’’ This provi<strong>de</strong>s for an approximate exponentially<strong>de</strong>caying weighting on past data <strong>and</strong> is <strong>de</strong>scribed more fully in [22]. Set R r k ¼ð1 aÞR r k 1 þ aK w k ½d k Gðx k ; ^wÞŠ½d k Gðx k ; ^wÞŠ T ðK w k Þ T ,which is a Robbins–Monro stochastic approximation scheme forestimating the innovations [6]. The method assumes that the covarianceof the <strong>Kalman</strong> update mo<strong>de</strong>l is consistent with the actual updatemo<strong>de</strong>l.Typically, R r k is also constrained to be a diagonal matrix, which implies anin<strong>de</strong>pen<strong>de</strong>nce assumption on the parameters.Study of the various tra<strong>de</strong>-offs between these different approaches isstill an area of open research. For the experiments performed in thischapter, the forgetting factor approach is used.Returning to the dynamic system of Eq. (5.1), the EKF weight filter canbe used to estimate the mo<strong>de</strong>l parameters for either F or H. To learn thestate dynamics, we simply make the substitutions G ! F <strong>and</strong> d k ! x kþ1 .To learn the measurement function, we make the substitutions G ! H<strong>and</strong> d k ! y k . Note that for both cases, it is assumed that the noise-freestate x k is available for training.5.2.3 Dual EstimationWhen the clean state is not available, a dual estimation approach isrequired. In this section, we introduce the basic dual EKF algorithm,which combines the <strong>Kalman</strong> state <strong>and</strong> weight filters. Recall that the task isto estimate both the state <strong>and</strong> mo<strong>de</strong>l from only noisy observations.Essentially, two EKFs are run concurrently. At every time step, an EKFstate filter estimates the state using the current mo<strong>de</strong>l estimate ^w k , whilethe EKF weight filter estimates the weights using the current state estimate^x k . The system is shown schematically in Figure 5.2. In or<strong>de</strong>r to simplifythe presentation of the equations, we consi<strong>de</strong>r the slightly less generalstate-space mo<strong>de</strong>l:x kþ1 ¼ Fðx k ; u k ; wÞþv k ;ð5:22Þy k ¼ Cx k þ n k ; C ¼½1 0 ... 0Š; ð5:23Þin which we take the scalar observation y k to be one of the states. Thus, weonly need to consi<strong>de</strong>r estimating the parameters associated with a single


5.2 DUAL EKF–PREDICTION ERROR 131Time Update EKFx∧x k-1∧x−kMeasurementUpdate EKFx∧xk(measurement)y k∧wk-1Time Update EKFw−k∧wMeasurementUpdate EKFw∧wkFigure 5.2 The dual exten<strong>de</strong>d <strong>Kalman</strong> filter. The algorithm consists of twoEKFs that run concurrently. The top EKF generates state estimates, <strong>and</strong>requires ^w k 1 for the time update. The bottom EKF generates weightestimates, <strong>and</strong> requires ^x k 1 for the measurement update.nonlinear function F. The dual EKF equations for this system arepresented in Table 5.3. Note that for clarity, we have specified theequations for the additive white-noise case. The case of colored measurementnoise n k is treated in Appendix B.Recurrent Derivative Computation While the dual EKF equationsappear to be a simple concatenation of the previous state <strong>and</strong> weight EKFequations, there is actually a necessary modification of the linearizationC w k ¼ C@^x k =@ ^w k associated with the weight filter. This is due to the factthat the signal filter, whose parameters are being estimated by the weightfilter, has a recurrent architecture, i.e., ^x k is a function of ^x k 1 , <strong>and</strong> bothare functions of w. 1 Thus, the linearization must be computed usingrecurrent <strong>de</strong>rivatives with a routine similar to real-time recurrent learning1 Note that a linearization is also required for the state EKF, but this <strong>de</strong>rivative,@Fð^x k 1 ; ^w k Þ=@^x k 1 , can be computed with a simple technique (such as backpropagation)because ^w kis not itself a function of ^x k 1 .


132 5 DUAL EXTENDED KALMAN FILTER METHODSTable 5.3 The dual exten<strong>de</strong>d <strong>Kalman</strong> filter equations. The <strong>de</strong>finitions of k<strong>and</strong> C w k <strong>de</strong>pend on the particular form of the weight filter being used. Seethe text for <strong>de</strong>tailsInitialize with:^w 0 ¼ E½wŠ; P w0¼ E½ðw ^w 0 Þðw ^w 0 Þ T Š;^x 0 ¼ E½x 0 Š; P x0¼ E½ðx 0 ^x 0 Þðx 0 ^x 0 Þ T Š:For k 2 f 1; ...; 1g, the time-update equations for the weight filter are^w k ¼ ^w k 1 ; ð5:24ÞP wk¼ P wk 1þ R r k 1 ¼ l 1 P wk 1; ð5:25Þ<strong>and</strong> those for the state filter are^x k ¼ Fð^x k 1 u k ; ^w k Þ; ð5:26ÞP xk¼ A k 1 P xk 1A T k 1 þ R v : ð5:27ÞThe measurement-update equations for the state filter are<strong>and</strong> those for the weight filter arewhereA k1 ¼ D @Fðx; ^w k Þ@xK x k ¼ P xkC T ðCP xkC T þ R n Þ 1 ;ð5:28Þ^x k ¼ ^x k þ K x kðy k C^x k Þ; ð5:29ÞP xk¼ðI K x kCÞP xk; ð5:30ÞK w k ¼ P wkðC w k Þ T ½C w k P wkðC w k Þ T þ R e Š1 ; ð5:31Þ^w k ¼ ^w k þ K w k ; k ¼ðy k C^x k Þ; C w k ¼ D @ k^x k 1@w ¼ C @^x k@w :w¼ ^w kð5:34Þ(RTRL) [35]. Taking the <strong>de</strong>rivative of the signal filter equations results inthe following system of recursive equations:@^x kþ1@ ^w¼ @Fð^x; ^wÞ @^x k@^x k @ ^w þ @Fð^x; ^wÞ; ð5:35Þ@ ^w k@^x k@ ^w ¼ðIKx kCÞ @^x k@ ^w þ @Kx k@ ^w ðy k C^x k Þ; ð5:36Þ


5.2 DUAL EKF–PREDICTION ERROR 133where @Fð^x; ^wÞ=@^x k <strong>and</strong> @Fð^x; ^wÞ=@ ^w k are evaluated at ^w k <strong>and</strong> containstatic linearizations of the nonlinear function.The last term in Eq. (5.36) may be dropped if we assume that the<strong>Kalman</strong> gain K x k is in<strong>de</strong>pen<strong>de</strong>nt of w. Although this greatly simplifies thealgorithm, the exact computation of @K x k=@ ^w may be computed, as shownin Appendix A. Whether the computational expense in calculating therecursive <strong>de</strong>rivatives (especially that of calculating @K x k=@ ^w) is worth theimprovement in performance is clearly a <strong>de</strong>sign issue. Experimentally,the recursive <strong>de</strong>rivatives appear to be more critical when the signal ishighly nonlinear, or is corrupted by a high level of noise.Example As an example application, consi<strong>de</strong>r the noisy time-seriesfx k g N 1 generated by a nonlinear autoregression:x k ¼ f ðx k 1 ; ...x k M ; wÞþv k ;y k ¼ x k þ n k 8k 2f1; ...; Ng:ð5:37ÞThe observations of the series y k contain measurement noise n k in additionto the signal. The dual EKF requires reformulating this mo<strong>de</strong>l into a statespacerepresentation. One such representation is given by264x kx kx k 1..Mþ1x k ¼ Fðx k 1 ; wÞþBv k ; ð5:38Þ3 23 2 3f ðx k 1 ; ...; x k M ; wÞ123 2 31 0 0 0 x 75 ¼ k 16 .0 . . 0. 74. 5 6.647 74 . 5 5 þ 0.6 74 . 5 v k;0 0 1 00x ky k ¼ Cx k þ n k ;¼½1 0 ... 0Šx k þ n k ; ð5:39Þwhere the state x k is chosen to be lagged values of the time series, <strong>and</strong> thestate transition function FðÞ has its first element given by f ðÞ, with theremaining elements corresponding to shifted values of the previous state.The results of a controlled time-series experiment are shown in Figure5.3. The clean signal, shown by the thin curve in Figure 5.3a, is generatedby a neural network (10-5-1) with chaotic dynamics, driven by whiteGaussian-process noise (s 2 v ¼ 0:36). Colored noise generated by a linearautoregressive mo<strong>de</strong>l is ad<strong>de</strong>d at 3 dB signal-to-noise ratio (SNR) toproduce the noisy data indicated by þ symbols. Figure 5.3b shows theM


134 5 DUAL EXTENDED KALMAN FILTER METHODSFigure 5.3 The dual EKF estimate (heavy curve) of a signal generated by aneural network (thin curve) <strong>and</strong> corrupted by adding colored noise at 3 dB(þ). For clarity, the last 150 points of a 20,000-point series are shown. Only thenoisy data are available: both the signal <strong>and</strong> weights are estimated by thedual EKF. (a ) Clean neural network signal <strong>and</strong> noisy measurements. (b) DualEKF estimates versus EKF estimates. (c ) Estimates with full <strong>and</strong> static <strong>de</strong>rivatives.(d ) MSE profiles of EKF versus dual EKF.


5.3 A PROBABILISTIC PERSPECTIVE 135time series estimated by the dual EKF. The algorithm estimates both theclean time series <strong>and</strong> the neural network weights. The algorithm is runsequentially over 20,000 points of data; for clarity, only the last 150 pointsare shown. For comparison, the estimates using an EKF with the knownneural network mo<strong>de</strong>l are also shown. The MSE for the dual EKF,computed over the final 1000 points of the series, is 0.2171, whereasthe EKF produces an MSE of 0.2153, indicating that the dual algorithmhas successfully learned both the mo<strong>de</strong>l <strong>and</strong> the states estimates. 2Figure 5.3c shows the estimate when the static approximation torecursive <strong>de</strong>rivatives is used. In this example, this static <strong>de</strong>rivative actuallyprovi<strong>de</strong>s a slight advantage, with an MSE of 0.2122. The difference,however, is not statistically significant. Finally, Figure 5.3d assesses theconvergence behavior of the algorithm. The mean-squared error (MSE) iscomputed over 500 point segments of the time series at 50 point intervalsto produce the MSE profile (dashed line). For comparison, the solid line isthe MSE profile of the EKF signal estimation algorithm, which uses thetrue neural network mo<strong>de</strong>l. The dual EKF appears to converge to theoptimal solution after only about 2000 points.5.3 A PROBABILISTIC PERSPECTIVEIn this section, we present a unified framework for dual estimation. Westart by <strong>de</strong>veloping a probabilistic perspective, which leads to a number ofpossible cost functions that can be used in the estimation process. Variousapproaches in the literature, which may differ in their actual optimizationprocedure, can then be related based on the un<strong>de</strong>rlying cost function. Wethen show how a <strong>Kalman</strong>-based optimization procedure can be used toprovi<strong>de</strong> a common algorithmic framework for minimizing each of the costfunctions.MAP Estimation Dual estimation can be cast as a maximum a posteriori(MAP) solution. The statistical information contained in the sequenceof data fy k g N 1 about the signal <strong>and</strong> parameters is embodied by the jointconditional probability <strong>de</strong>nsity of the sequence of states fx k g N 1 <strong>and</strong> weights2 A surprising result is that the dual EKF sometimes actually outperforms the EKF, eventhough the EKF appears to have an unfair advantage of knowing the true mo<strong>de</strong>l. Ourexplanation is that the EKF, even with the known mo<strong>de</strong>l, is still an approximate estimationalgorithm. While the dual EKF also learns an approximate mo<strong>de</strong>l, this mo<strong>de</strong>l can actuallybe better matched to the state estimation approximation.


136 5 DUAL EXTENDED KALMAN FILTER METHODSw, given the noisy data fy k g N 1 . For notational convenience, <strong>de</strong>fine thecolumn vectors x N 1 <strong>and</strong> y N 1 , with elements from fx kg N 1 <strong>and</strong> fy k g N 1 ,respectively. The joint conditional <strong>de</strong>nsity function is written asr xN1wjy ðX ¼ x N N 1 ; W ¼ wjY ¼ y N 1 Þ;1ð5:40Þwhere X, Y, <strong>and</strong> W are the vectors of r<strong>and</strong>om variables associated withx N 1 , yN 1 , <strong>and</strong> w, respectively. This joint <strong>de</strong>nsity is abbreviated as r x . N1 wjyN 1The MAP estimation approach consists of <strong>de</strong>termining instances of thestates <strong>and</strong> weights that maximize this conditional <strong>de</strong>nsity. For Gaussi<strong>and</strong>istributions, the MAP estimate also corresponds to the minimum meansquare<strong>de</strong>rror (MMSE) estimator. More generally, as long as the <strong>de</strong>nsity isunimodal <strong>and</strong> symmetric around the mean, the MAP estimate provi<strong>de</strong>s theBayes estimate for a broad class of loss functions [36].Taking MAP as the starting point allows dual estimation approaches tobe divi<strong>de</strong>d into two basic classes. The first, referred to here as jointestimation methods, attempt to maximize r xN directly. We can write1 wjyN 1this optimization problem explicitly asð^x N 1 ; ^wÞ ¼arg max r xN : ð5:41Þx N 1 ;w 1 wjyN 1The second class of methods, which will be referred to as marginalestimation methods, operate by exp<strong>and</strong>ing the joint <strong>de</strong>nsity asr xN1 wjyN 1¼ r xN r1 jwyN 1 wjyN1ð5:42Þ<strong>and</strong> maximizing the two terms separately, that is,^x N 1 ¼ arg max r xNx N 1jwy ; ^w ¼ arg max r N1wjy : ð5:43Þw N 11The cost functions associated with joint <strong>and</strong> marginal approaches will bediscussed in the following sections.


5.3.1 Joint Estimation Methods5.3 A PROBABILISTIC PERSPECTIVE 137Using Bayes’ rule, the joint conditional <strong>de</strong>nsity can be expressed asr xN1 wjyN 1¼ r y N 1 jxN 1 wr x N 1 wr yN1¼ r y N 1 jxN 1 wr x N 1 jwr wr yN1: ð5:44ÞAlthough fy k g N 1 is statistically <strong>de</strong>pen<strong>de</strong>nt on fx k g N 1 <strong>and</strong> w, the prior r yN is1nonetheless functionally in<strong>de</strong>pen<strong>de</strong>nt of fx k g N 1 <strong>and</strong> w. Therefore, r xN1wjy N 1can be maximized by maximizing the terms in the numerator alone.Furthermore, if no prior information is available on the weights, r w can bedropped, leaving the maximization ofr yN1 xN 1 jw ¼ r yN1 jxN 1 w r xN1 jw ð5:45Þwith respect to fx k g N 1 <strong>and</strong> w.To <strong>de</strong>rive the corresponding cost function, we assume v k <strong>and</strong> n k areboth zero-mean white Gaussian noise processes. It can then be shown (see[22]), that" #1Pr yN1x N 1 jw ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiexpN ðy k Cx k Þ 2ð2pÞ N ðs 2 nÞ N k¼1 2s 2 n1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiexpð2pÞ N jR v j NP Nk¼112 ðx k x k Þ T ðR v Þ 1 ðx k x k Þ ;ð5:46Þwherex k ¼ D E½x k jfx t g k 11 ; wŠ ¼Fðx k 1 ; wÞ: ð5:47Þ


138 5 DUAL EXTENDED KALMAN FILTER METHODSHere we have used the structure given in Eq. (5.37) to compute theprediction x k using the mo<strong>de</strong>l Fð; wÞ. Taking the logarithm, the correspondingcost function is given by"J ¼ PNlogð2ps 2 nÞþ ðy k Cx k Þ 2k¼1s 2 ð5:48Þnþ logð2pjR v jÞ þ ðx k x k Þ T ðR v Þ 1 ðx k x k Þ : ð5:49ÞThis cost function can be minimized with respect to any of the unknownquantities (including the variances, which we will consi<strong>de</strong>r in Section 5.4).For the time being, consi<strong>de</strong>r only the optimization of fx k g N 1 <strong>and</strong> w.Because the log terms in the above cost are in<strong>de</strong>pen<strong>de</strong>nt of the signal<strong>and</strong> weights, they can be dropped, providing a more specialized costfunction:J j ðx N 1 ; wÞ ¼ PNk¼1" #ðy k Cx k Þ 2þðx k x k Þ T ðR v Þ 1 ðx k x k Þ : ð5:50Þs 2 nThe first term is a soft constraint keeping fx k g N 1 close to the observationsfy k g N 1 . The smaller the measurement noise variance s 2 n, the stronger thisconstraint will be. The second term keeps the state estimates <strong>and</strong> mo<strong>de</strong>lestimates mutually consistent with the mo<strong>de</strong>l structure. This constraintwill be strong when the state is highly <strong>de</strong>terministic (i.e., R v is small).J j ðx N 1 ; wÞ should be minimized with respect to both fx kg N 1 <strong>and</strong> w to findthe estimates that maximize the joint <strong>de</strong>nsity r yN1x N 1 jw . This is a difficultoptimization problem because of the high <strong>de</strong>gree of coupling between theunknown quantities fx k g N 1 <strong>and</strong> w. In general, we can classify approaches asbeing either direct or <strong>de</strong>coupled. In direct approaches, both the signal <strong>and</strong>the state are <strong>de</strong>termined jointly as a multivariate optimization problem.Decoupled approaches optimize one variable at a time while the othervariable is fixed, <strong>and</strong> then alternating. Direct algorithms inclu<strong>de</strong> the jointEKF algorithm (see Section 5.1), which attempts to minimize the costsequentially by combining the signal <strong>and</strong> weights into a single (joint) statevector. The <strong>de</strong>coupled approaches are elaborated below.Decoupled Estimation To minimize J j ðx N 1 ; wÞ with respect to thesignal, the cost function is evaluated using the current estimate ^w of the


weights to generate the predictions. The simplest approach is to substitutethe predictions ^x k ¼ D Fðx k 1 ; ^wÞ directly into Eq. (5.50), obtainingJ j ðx N 1 ; ^wÞ ¼ PNk¼1" #ðy k Cx k Þ 2þðx k ^x k Þ T ðR v Þ 1 ðx k ^x k Þ : ð5:51Þs 2 n5.3 A PROBABILISTIC PERSPECTIVE 139This cost function is then minimized with respect to fx k g N 1 . To minimizethe joint cost function with respect to the weights, J j ðx N 1 ; wÞ is evaluatedusing the current signal estimate f^x k g N 1 <strong>and</strong> the associated (re<strong>de</strong>fined)predictions ^x k¼ D Fð^x k 1 ; wÞ. Again, this results in a straightforwardsubstitution in Eq. (5.50):J j ð^x N 1 ; wÞ ¼ PNk¼1" #ðy k C^x k Þ 2þð^x k ^x k Þ T ðR v Þ 1 ð^x k ^x k Þ : ð5:52Þs 2 nAn alternative simplified cost function can be used if it is assumed thatonly ^x k is a function of the weights:J ji ð^x N 1 ; wÞ ¼ PNk¼1ð^x k ^x k Þ T ðR v Þ 1 ð^x k ^x k Þ: ð5:53ÞThis is essentially a type of prediction error cost, where the mo<strong>de</strong>l istrained to predict the estimated state. Effectively, the method maximizesr xN1jw, evaluated at x N 1 ¼ ^x N 1 . A potential problem with this approach isthat it is not directly constrained by the actual data fy k g N 1 . An inaccurate(yet self-consistent) pair of estimates ð^x N 1 ; ^wÞ could conceivably beobtained as a solution. Nonetheless, this is essentially the approach usedin [14] for robust prediction of time series containing outliers.In the <strong>de</strong>coupled approach to joint estimation, by separately minimizingeach cost with respect to its argument, the values are found that maximize(at least locally) the joint conditional <strong>de</strong>nsity function. Algorithms that fallinto this class inclu<strong>de</strong> a sequential two-observation form of the dual EKFalgorithm [21], <strong>and</strong> the errors-in-variables (EIV) method applied to batchstyleminimization [18, 19]. An alternative approach, referred to as errorcoupling, makes the extra step of taking the errors in the estimates intoaccount. However, this error-coupled approach (investigated in [22]) doesnot appear to perform reliably, <strong>and</strong> is not <strong>de</strong>scribed further in this chapter.


140 5 DUAL EXTENDED KALMAN FILTER METHODS5.3.2 Marginal Estimation MethodsRecall that in marginal estimation, the joint <strong>de</strong>nsity function is exp<strong>and</strong>edasr xN1wjy N 1¼ r xN1jwy N 1 r wjy N 1 ;ð5:54Þ<strong>and</strong> ^x N 1 is found by maximizing the first factor on the right-h<strong>and</strong> si<strong>de</strong>,while ^w is found by maximizing the second factor. Note that only the firstfactor (r xN1jwy ) is <strong>de</strong>pen<strong>de</strong>nt on the state. Hence, maximizing this factorN1for the state will yield the same solution as when maximizing the joint<strong>de</strong>nsity (assuming the optimal weights have been found). However,because both factors also <strong>de</strong>pend on w, maximizing the second (r wjyN )1alone with respect to w is not the same as maximizing the joint <strong>de</strong>nsityr xN1wjy with respect to w. Nonetheless, the resulting estimates ^w areN1consistent <strong>and</strong> unbiased, if conditions of sufficient excitation are met [37].The marginal estimation approach is exemplified by the maximumlikelihoodapproaches [8, 9] <strong>and</strong> EM approaches [11, 12]. Motivation forthese methods usually comes from consi<strong>de</strong>ring only the marginal <strong>de</strong>nsityr wjyN1to be the relevant quantity to maximize, rather than the joint <strong>de</strong>nsityr xN1wjy . However, in or<strong>de</strong>r to maximize the marginal <strong>de</strong>nsity, it isN1necessary to generate signal estimates that are invariably produced bymaximizing the first term r xN1jwy . N1Maximum-Likelihood Cost To <strong>de</strong>rive a cost function for weightestimation, we further exp<strong>and</strong> the marginal <strong>de</strong>nsity asr wjyN1¼ r y N 1 jw r wr yN1: ð5:55ÞIf there is no prior information on w, maximizing this posterior <strong>de</strong>nsity isequivalent to maximizing the likelihood function r yN1jw. Assuming Gaussianstatistics, the chain rule for conditional probabilities can be used toexpress this likelihood function as:r yN1jw ¼ QNk¼1" #1 ðy k y kjk 1 Þ 2qffiffiffiffiffiffiffiffiffiffiffiexp2ps 2 2s 2 ; ð5:56Þee kk


5.3 A PROBABILISTIC PERSPECTIVE 141wherey kjk1 ¼ D k 1E½y k j y t ; wŠ ð5:57Þ1is the conditional mean (<strong>and</strong> optimal prediction), <strong>and</strong> s 2 e kis the predictionerror variance. Taking the logarithm yields the following maximumlikelihoodcost function:J ml ðwÞ ¼ PNk¼1"logð2ps 2 e kÞþ ðy #k y kjk 1 Þ 2: ð5:58Þs 2 e kNote that the log-likelihood function takes the same form whether themeasurement noise is colored or white. In evaluating this cost function,the term y kjk 1 ¼ C^x k must be computed. Thus, the signal estimate mustbe <strong>de</strong>termined as a step to weight estimation. For linear mo<strong>de</strong>ls, this canbe done exactly using an ordinary <strong>Kalman</strong> filter. For nonlinear mo<strong>de</strong>ls,however, the expectation is approximated by an exten<strong>de</strong>d <strong>Kalman</strong> filter,which equivalently attempts to minimize the joint cost J j ðx k 1 ; ^wÞ <strong>de</strong>fined inSection 5.3.1 by Eq. (5.51).An iterative maximum-likelihood approach for linear mo<strong>de</strong>ls is<strong>de</strong>scribed in [7] <strong>and</strong> [8]; this chapter presents a sequential maximumlikelihoodapproach for nonlinear mo<strong>de</strong>ls, <strong>de</strong>veloped in [21].Prediction Error Cost Often the variance s 2 e kin the maximum-likelihoodcost is assumed (incorrectly) to be in<strong>de</strong>pen<strong>de</strong>nt of the weights w<strong>and</strong> the time in<strong>de</strong>x k. Un<strong>de</strong>r this assumption, the log likelihood can bemaximized by minimizing the squared prediction error cost function:J pe ðwÞ ¼ PNk¼1ðy k y kjk 1 Þ 2 : ð5:59ÞThe basic dual EKF algorithm <strong>de</strong>scribed in the previous section minimizesthis simplified cost function with respect to the weights w, <strong>and</strong> is anexample of a recursive prediction error algorithm [6, 19]. While questionablefrom a theoretical perspective, these algorithms have been shownin the literature to be quite useful. In addition, they benefit from reducedcomputational cost, because the <strong>de</strong>rivative of the variance s 2 e kwith respectto w is not computed.


142 5 DUAL EXTENDED KALMAN FILTER METHODSEM Algorithm Another approach to maximizing r wjyN is offered by the1expectation-maximization (EM) algorithm [10, 12, 38]. The EM algorithmcan be <strong>de</strong>rived by first exp<strong>and</strong>ing the log-likelihood aslog r yN1 jw ¼ log r yN1 xN 1 jw log r xN : ð5:60Þ1 jwyN 1Taking the conditional expectation of both si<strong>de</strong>s using the conditional<strong>de</strong>nsity r xN gives1 jwyN 1log r yN1jw ¼ E XjYW½log r yN1x N 1 jwjyN 1 ; ^wŠ E XjYW ½log r xN1jwy jy N N11 ; ^wŠ;ð5:61Þwhere the expectation over X of the left-h<strong>and</strong> si<strong>de</strong> has no effect, because Xdoes not appear in log r yN1jw. Note that the expectation is conditional on aprevious estimate of the weights, ^w. The second term on the right isconcave by Jensen’s inequality [39], 3 so choosing w to maximize the firstterm on the right-h<strong>and</strong> si<strong>de</strong> alone will always increase the log-likelihoodon the left-h<strong>and</strong> si<strong>de</strong>. Thus, the EM algorithm repeatedly maximizesE XjYW ½log r yN1x N 1 jw jy N 1 ; ^wŠ with respect to w, each time setting ^w to the newmaximizing value. The procedure results in maximizing the originalmarginal <strong>de</strong>nsity r yN1jw .For the white-noise case, it can be shown (see [12, 22]) that the EM costfunction isJ em ¼ E XjYW"P Nk¼1(logð2ps 2 nÞþ ðy k Cx k Þ 2s 2 n)#þ logð2pjR v jÞ þ ðx k x k Þ T ðR v Þ 1 ðx k x k Þ z yN 1 ; ^w ; ð5:62Þwhere x k ¼ D Fðx k 1 ; wÞ, as before. The evaluation of this expectation iscomputable on a term-by-term basis (see [12] for the linear case).However, for the sake of simplicity, we present here the resulting3 Jensen’s inequality states that E½gðxÞŠ gðE½xŠÞ for a concave function gðÞ.


expression for the special case of time-series estimation, represented inEq. (5.37). As shown in [22], the expectation evaluates toJ em ¼ N logð4p 2 s 2 vs 2 nÞþ PN5.3 A PROBABILISTIC PERSPECTIVE 143k¼1"ðy k^x kjN Þ 2 þ p kjNs 2 nþ ð^x kjN ^x kjN Þ 2 þ p kjN 2p y kjN þ p kjNs 2 v#; ð5:63Þwhere ^x kjN <strong>and</strong> p kjN are <strong>de</strong>fined as the conditional mean <strong>and</strong> variance of x kgiven ^w <strong>and</strong> all the data, fy k g N 1 . The terms ^x kjN <strong>and</strong> p kjN are the conditionalmean <strong>and</strong> variance of x k ¼ f ðx k 1 ; wÞ, given all the data. The additionalterm p y kjN represents the cross-variance of x k <strong>and</strong> x k , conditioned on all thedata. Again we see that <strong>de</strong>termining state estimates is a necessary step to<strong>de</strong>termining the weights. In this case, the estimates ^x kjN are found byminimizing the joint cost J j ðx N 1 ; ^wÞ, which can be approximated using anexten<strong>de</strong>d <strong>Kalman</strong> smoother. A sequential version of EM can be implementedby replacing ^x kjN with the usual causal estimates ^x k , found usingthe EKF.Summary of Cost Functions The various cost functions given in thissection are summarized in Table 5.4. No explicit signal estimation cost isgiven for the marginal estimation methods, because signal estimation isonly an implicit step of the marginal approach, <strong>and</strong> uses the joint costJ j ðx N 1 ; ^wÞ. These cost functions, combined with specific optimizationmethods, lead to the variety of algorithms that appear in the literature.Table 5.4Summary of dual estimation cost functionsSymbol Name of cost Density Eq.Joint J j ðx N 1 ; wÞ Joint r x N 1 wjyN 1(5.50)J j ðx N 1 ; ^wÞ Joint signal r x N (5.51)1 wjyN 1J j ð^x N 1 ; wÞ Joint weight r x (5.52)N1 wjyN 1J ji ð^x N 1 ; wÞ Joint weight (in<strong>de</strong>pen<strong>de</strong>nt) r x N 1 jw (5.53)Marginal J pe ðwÞ Prediction error r wjy N (5.59)1J ml ðwÞ Maximum likelihood r wjy N (5.58)1J em ðwÞ EM n.a. (5.62)


144 5 DUAL EXTENDED KALMAN FILTER METHODSIn the next section, we shall show how each of these cost functions can beminimized using a general dual EKF-based approach.5.3.3 Dual EKF AlgorithmsIn this section, we show how the dual EKF algorithm can be modified tominimize any of the cost functions discussed earlier. Recall that the basicdual EKF as presented in Section 5.2.3 minimized the prediction error costof Eq. (5.59). As was shown in the last section, all approaches use thesame joint cost function for the state-estimation component. Thus, thestate EKF remains unchanged. Only the weight EKF must be modified.We shall show that this involves simply re<strong>de</strong>fining the error term k .To <strong>de</strong>velop the method, consi<strong>de</strong>r again the general state-space formulationfor weight estimation (Eq. (5.11)):w kþ1 ¼ w k þ r k ;d k ¼ Gðx k ; w k Þþe k :ð5:64Þð5:65ÞWe may reformulate this state-space representation asw k ¼ w k 1 þ r k ; ð5:66Þ0 ¼ k ; þe k ; ð5:67Þwhere k ¼ d k Gðx k ; w k Þ <strong>and</strong> the target ‘‘observation’’ is fixed at zero.This observed error formulation yields the exact same set of <strong>Kalman</strong>equations as before, <strong>and</strong> hence minimizes the same prediction error cost,JðwÞ ¼ P kt¼1 ½d t Gðx t ; wÞŠ T ðR e Þ 1 ½d t Gðx t ; wÞŠ ¼ P kt¼1 J t. However,if we consi<strong>de</strong>r the modified-Newton algorithm interpretation, it can beshown [22] that the EKF weight filter is also equivalent to the recursion^w k ¼ ^w k þ P wkðC w k Þ T ðR e Þ 1 ð0 þ k Þ;ð5:68ÞwhereC w k ¼ 4 @ð kÞ@wTw¼w kð5:69Þ


5.3 A PROBABILISTIC PERSPECTIVE 145<strong>and</strong>P 1w k¼ðl 1 P wk 1Þ 1 þðC w k Þ T ðR e Þ 1 C w k : ð5:72ÞThe weight update in Eq. (5.68) is of the form^w k ¼ ^w k S k H w Jð ^w k Þ T ; ð5:73Þwhere H w J is the gradient of the cost J with respect to w, <strong>and</strong> S k is asymmetric matrix that approximates the inverse Hessian of the cost. Boththe gradient <strong>and</strong> Hessian are evaluated at the previous value of the weightestimate. Thus, we see that by using the observed error formulation, it ispossible to re<strong>de</strong>fine the error term k , which in turn allows us to minimizean arbitrary cost function that can be expressed as a sum of instantaneousterms J k ¼ T k k. This basic i<strong>de</strong>a was presented by Puskorius <strong>and</strong> Feldkamp[40] for minimizing an entropic cost function; see also Chapter 2.Note that J k ¼ T k k does not uniquely specify k , which can be vectorvalued.The error must be chosen such that the gradient <strong>and</strong> inverseHessian approximations (Eqs. (5.70) <strong>and</strong> (5.72)) are consistent with the<strong>de</strong>sired batch cost.In the following sections, we give the exact specification of the errorterm (<strong>and</strong> corresponding gradient C w k ) necessary to modify the dual EKFalgorithm to minimize the different cost functions. The original set of dualEKF equations given in Table 5.3 remains the same, with only k beingre<strong>de</strong>fined. Note that for each case, the full evaluation of C w k requires takingrecursive gradients. The procedure for this is analogous to that taken inSection 5.2.3. Furthermore, we restrict ourselves to the autoregressivetime-series mo<strong>de</strong>l with state-space representation given in Eqs. (5.38) <strong>and</strong>(5.39).Joint Estimation Forms The corresponding weight cost function (seealso Eq. (5.52)) <strong>and</strong> error terms are given in Table 5.5. Note that thisrepresents a special two-observation form of the weight filter, where^x t ¼ f ð^x t 1 ; wÞ; e k ¼ 4 ðy k ^x k Þ; ~^x^x k ¼ 4 ð^x k ^xÞ k ;Note that this dual EKF algorithm represents a sequential form of the<strong>de</strong>coupled approach to joint optimization; that is, the two EKFs minimizethe overall joint cost function by alternately optimizing one argument at a


146 5 DUAL EXTENDED KALMAN FILTER METHODSTable 5.5 Joint cost function observed error terms for the dual EKFweight filter" #J j ð^x k 1; wÞ ¼ Pk ðy t ^x t Þ 2t¼1 s 2 þ ð^x t ^x t Þ 2n s 2 ; ð5:74Þv" #" #k ¼ 4 sn1 e k; with Csv1 ~^x^x w sn1 H T we kk ¼ksv1 H T ~^x^x :w k Štime while the other argument is fixed. A direct approach found using thejoint EKF is <strong>de</strong>scribed later in Section 5.3.4.Marginal Estimation Forms–Maximum-Likelihood Cost Thecorresponding weight cost function (see Eq. (5.58)) <strong>and</strong> error terms aregiven in Table 5.6, wheree k ¼ y k ^x k ; l e;k ¼s 2 e;k3e 2 k2s 2 e;kNote that the prediction error variance is given bys 2 e k¼ E½ðy k y kjk 1 Þ 2 jfy t g1 k 1 ; wŠ ð5:75aÞ¼ E½ðn k þ x k ^x k Þ 2 jfy t g k 1 1 ; wŠ ð5:75bÞ¼ s 2 n þ CP k C T ;ð5:75cÞwhere P k is computed by the <strong>Kalman</strong> signal filter (see [22] for adiscussion of the selection <strong>and</strong> interpretation of l e;k ).Table 5.6 Maximum-likelihood cost function observe<strong>de</strong>rror terms for dual EKF weight filter" #J ml ðwÞ ¼ PNlogð2ps 2 e kÞþ ðy k ^x kÞ 2;k¼12" #e k ¼ 4 ðl e;k Þ 1=2; C w 6k ¼ 4se 1ke ks 2 e k12:ðl e;k Þ 1=2H T s 2 e kwðs 2 e kÞ375:1sH T ekwe k þe kH T 2ðs 2 e kÞ wðs 2 3=2 e kÞ


5.3 A PROBABILISTIC PERSPECTIVE 147Table 5.7 Prediction error cost function observed error terms for the dualEKF weight filterJ pe ðwÞ ¼ PNe 2 k ¼ðy k ^x k Þ 2 ; ð5:76Þk¼1k ¼ 4 e k ¼ðy k ^x k Þ; C w k ¼ H w e k ¼ C @^x k@w :w¼ ^w kMarginal Estimation Forms–Prediction Error Cost If s 2 e kisassumed to be in<strong>de</strong>pen<strong>de</strong>nt of w, then we are left with the formulascorresponding to the original basic dual EKF algorithm (for the time-seriescase); see Table 5.7.Marginal Estimation Forms–EM Cost The dual EKF can be modifiedto implement a sequential EM algorithm. Note that the M-step, whichrelates to the weight filter, corresponds to a generalized M-step, in whichthe cost function is <strong>de</strong>creased (but not necessarily minimized) at eachiteration. The formulation is given in Table 5.8, where ~^x^x kjk ¼ ^x k ^x kjk .Note that Jkem ðwÞ was specified by dropping terms in Eq. (5.63) that arein<strong>de</strong>pen<strong>de</strong>nt of the weights (see [22]). While ^x k are found by the usualstate EKF, the variance terms p y kjk , <strong>and</strong> p kjk , as well as ^x kjk(a noncausalprediction), are not typically computed in the normal implementation ofthe state EKF. To compute these, the state vector is augmented by oneadditional lagged value of the signal: x þ k ¼ x k¼x k; ð5:78Þx k M x k 1Table 5.8filterEM cost function observed error terms for the dual EKF weightJk em ðwÞ ¼ ð^x k ^x kjk Þ 2 2p y kjk þ p kjks 2 ; ð5:77Þv2123sv1sH T ~^x^x 3v w kjk~^x^x pffiffiffiffiffiffikjkpffiffiffiffiffiffik ¼ 6 2 s14 v ð p y 2 ð pykjk Þ1=2 75 ; kjk Þ 1=2Cw k ¼H T2swp y kjkv:sv 1 ð p kjkÞ 1=264 ð p kjkÞ 1=2 7H T 52swp kjkv


148 5 DUAL EXTENDED KALMAN FILTER METHODSThe state <strong>Kalman</strong> filter is then modified by adding a final zero element tothe vectors B <strong>and</strong> C (see Eqs. (5.38) <strong>and</strong> (5.39)), <strong>and</strong> the linearized statetransition matrix is re<strong>de</strong>fined as" #A k ¼ 4 H T x fI 0:Now the estimate ^x þ k will contain ^x k 1jk in its last M elements. As shownin [22], the variance terms are then approximated byp kjk ¼ CA k 1jk ðP k 1jk ÞA T k 1jkC T ; p y kjk ¼ CðP] k ÞAT k 1jkC T ; ð5:79Þwhere the covariance P k 1jk is provi<strong>de</strong>d as the lower right block of theaugmented covariance P þ k , <strong>and</strong> P] k is the upper right block of Pþ k . Theusual error covariance P k is provi<strong>de</strong>d in the upper left block of P þ k .Furthermore, A k 1jk is found by linearizing f ðÞ at ^x k 1jk . The noncausalprediction ^x kjk ¼ E½ f ðx k 1 ; wÞjfy t g k 1; ^wŠ f ð^x k 1jk ; wÞ.Finally, the necessary gradient terms in the dual EKF algorithm can beevaluated as follows:H T w ~^x^x kjk ¼ H w^x kjk ¼ H w f ð^x k 1jk ; wÞ; ð5:80Þwhich is evaluated at ^w k . The ith element of the gradient vector H w p kjk isconstructed from the expression"@p kjk@w ðiÞ ¼ C @A #k 1jk@w ðiÞ ðP k 1jk ÞA T k 1jk þ A k 1jk ðP k 1jk Þ @AT k 1jk@w ðiÞ C T ; ð5:81Þ<strong>and</strong> the elements of the gradient H w p y kjk@p y kjk@w ðiÞ ¼ C @A k1jk@w ðiÞ ðP ] k ÞTare given byC T :ð5:82ÞNote that all components in the cost function are recursive functions of ^w,but not of w. Hence, no recurrent <strong>de</strong>rivative computations are required forthe EM algorithm. Furthermore, it can be shown that the actual errorvariances p kjk <strong>and</strong> p y kjk(not their gradients) cancel out in the formula forC w k , <strong>and</strong> thus should be replaced with large constant values to obtain agood Hessian approximation.


5.4 DUAL EKF VARIANCE ESTIMATION 1495.3.4 Joint EKFIn the previous section, the dual EKF was modified to minimize the jointcost function. This implementation represented a <strong>de</strong>coupled type ofapproach, in which separate state-space representations were used toestimate x k <strong>and</strong> w. An alternative direct approach is given by the jointEKF, which generates simultaneous MAP estimates of x k <strong>and</strong> w. This isaccomplished by <strong>de</strong>fining a new joint state-space representation withconcatenated state:z k ¼x k: ð5:83Þw kIt is clear that maximizing the <strong>de</strong>nsity r zk jy is equivalent to maximizingk1r xk wjy Hence, the MAP-optimal estimate of z1. k k will contain the values ofx k <strong>and</strong> w k that minimize the batch cost Jðx k 1 ; wÞ. Running the single EKFwith this state vector provi<strong>de</strong>s a sequential estimation algorithm. The jointEKF first appeared in the literature for the estimation of linear systems (inwhich there is a bilinear relation between the states <strong>and</strong> weights) [1, 2].The general equations for nonlinear systems are given in Table 5.9.Note that because the gradient of f ðzÞ with respect to w is taken with theother elements (namely, ^x k ) fixed, it will not involve recursive <strong>de</strong>rivatives of^x k with respect to w. This fact is cited in [3] <strong>and</strong> [6] as a potential source ofconvergence problems for the joint EKF. Additional results <strong>and</strong> citations in[5] corroborate the difficulties of the approach, although the cause ofdivergence is linked there to the linearization of the coupled system, ratherthan the lack of recurrent <strong>de</strong>rivatives (note that no divergence problemswere encountered in preparing the experimental results in this chapter).Although the use of recurrent <strong>de</strong>rivatives is suggested in [3] <strong>and</strong> [6], thereis no theoretical justification for this. In summary, the joint EKF provi<strong>de</strong>san alternative to the dual EKF for sequential minimization of the joint costfunction. Note that the joint EKF cannot be readily adapted to minimizeother cost functions discussed in this chapter.5.4 DUAL EKF VARIANCE ESTIMATIONThe implementation of the EKF requires the noise variance, s 2 v <strong>and</strong> s 2 n asparameters in the algorithm. Often these can be <strong>de</strong>termined from physicalknowledge of the problem (e.g., sensor accuracy or ambient noise


150 5 DUAL EXTENDED KALMAN FILTER METHODSTable 5.9The joint exten<strong>de</strong>d <strong>Kalman</strong> filter equations (time-series case)Initialize with^z 0 ¼ E½z 0 Š;ð5:84ÞP 0 ¼ E½ðz 0 ^z 0 Þðz 0 ^z 0 Þ T Š: ð5:85ÞFor k 2 f 1; ...; 1g, the time-update equations of the <strong>Kalman</strong> filter are^z k ¼ Fð^z k 1 Þ; ð5:86ÞP k ¼ A k 1 P k 1A T k 1 þ V k ; ð5:87Þ<strong>and</strong> the measurement update equations areK k ¼ P kC T ð CP kC T þ s 2 nÞ 1 ;ð5:88Þ^z k ¼ ^z k þ K k ðy k C^z k Þ; ð5:89ÞP k ¼ðI K kCÞP k ; ð5:90ÞwhereV k ¼ 4 Cov Bv " #k¼ Bs2 vB T 0; ð5:91Þu k 0 R rFðz k 1 Þ¼ 4 Fðx k 1 ; w k 1 Þ; C ¼ D ½C 0 ... 0Š; ð5:92ÞI w k 12 2A k ¼ 4 @ @f ð^x k ; wÞ T 32FðzÞ4@z ¼@x5 @f ð^x k ; wÞ T 3 34@w5674z¼^z kI 0 0 0 5 : ð5:93Þ0 Imeasurements). However, if the variances are unknown, their values canbe estimated within the dual EKF framework using cost functions similarto those <strong>de</strong>rived in Section 5.3. A full treatment of the variance estimationfilter is presented in [22]; here we focus on the maximum-likelihood costfunctionJ ml ðs 2 Þ¼ PNk¼1"logð2ps 2 e kÞþ ðy #k y kjk 1 Þ 2: ð5:94Þs 2 e kNote that this cost function is i<strong>de</strong>ntical to the weight-estimation costfunction, except that the argument is changed to emphasize the estimationof the unknown variances.


5.4 DUAL EKF VARIANCE ESTIMATION 151As with the weight filter, a modified-Newton algorithm can be found foreach variance by using an observed error form of the <strong>Kalman</strong> filter <strong>and</strong>mo<strong>de</strong>ling the variances ass 2 kþ1 ¼ s 2 k þ r k ;ð5:95Þ0 ¼ k þ e k ; ð5:96Þwhich gives a one-dimensional state-space representation. However, apeculiar difficulty in the estimation of variances is that these quantitiesmust be positive-valued. Because this constraint is not built explicitly intothe cost functions, it is conceivable that negative values can be obtained.One solution to this problem (inspired by [41]) is to estimate l ¼ 4 logðs 2 Þinstead. Negative values of l map to small positive values of s 2 , <strong>and</strong>l ¼ 1maps to s 2 ¼ 0. The logarithm is a monotonic function, so a oneto-onemapping exists between the optimal value of l <strong>and</strong> the optimalvalue of s 2 . An additional benefit of the logarithmic function is that itexp<strong>and</strong>s the dynamic range near s 2 ¼ 0, where the solution is more likelyto resi<strong>de</strong>; this can improve the numerical properties of the optimization.Of course, this new formulation requires computing the gradients <strong>and</strong>Hessians of the cost J with respect to l. Fortunately, the change is fairlystraightforward; these expressions are simple functions of the <strong>de</strong>rivativeswith respect to s 2 . The variance estimation filter is given in Table 5.10.Note that the dimension of the state space is 1 in the case of varianceestimation, while the observation k is generally multidimensional. Forthis reason, the covariance form of the KF is more efficient than the formsshown earlier for signal or weight estimation, which employ the matrixinversion lemma <strong>and</strong> use a <strong>Kalman</strong> gain term. This form of the variancefilter is used in the experiments in the next section.In Table 5.10, the mean y kjk 1 k <strong>and</strong> variance s2 e kare computed in thesame way as before (see Eq. (5.75)), except that the unknown variance s 2is now an additional conditioning argument in the expectations. Hence, the<strong>de</strong>rivatives are@y kjk1 k@s 2 ¼ @^x k@s 2 ;ð5:103Þ@s 2 e k@s 2 ¼ @s2 n@s 2 þ @P k ð1; 1Þ@s 2 ; ð5:104Þwhere P k ð1; 1Þ is the upper left element of the matrix, <strong>and</strong> @s 2 n=@s 2 iseither 1 or 0, <strong>de</strong>pending on whether s 2 is s 2 n or s 2 v. The other <strong>de</strong>rivatives


152 5 DUAL EXTENDED KALMAN FILTER METHODSTable 5.10Variance estimation filter of the dual EKFInitialize with^s 2 0 ¼ E½s 2 Š; p s0¼ E½ðs 2 ^s 2 0Þðs 2 ^s 2 0Þ T Š:For k 2 f1; ...; 1g, the time-update equations for the variance filter are^s 2 k ¼ ^s 2 k 1; ð5:97Þp sk¼ p sk 1þ s 2 u; s 2 u ¼ 1 1 plsk 1; ð5:98Þ<strong>and</strong> the measurement equations areh i 1;p sk¼ ðp skÞ 1 þðC s k Þ T sr 2 C s k ð^s 2 k Þ 2 þðC s k Þ T sr2k ^s 2 kð5:99Þ^l k ¼ logð^s 2 k Þ; ð5:100Þ^l k ¼ ^l k þ p skðC s k Þ T sr 2k ^s 2 k ; ð5:101Þ^s 2 k ¼ e^l k:ð5:102ÞThe observed error term <strong>and</strong> gradient are2k ¼ 4 ðl e;kÞ 1=2 s 1 ; C s k ¼ 64e ke k1 ðl e;k Þ 1=2 @s 2 3e k2 s 2 e k@s 21 @e k@s 2 þ e k@s 2 7e k5 ;2ðs 2 e kÞ ð3=2Þ @s 2where s 2 represents either s 2 n or s 2 v, <strong>and</strong> where @e k@s 2 ¼ @y kjk 1k@s 2 :s ekare computed by taking the <strong>de</strong>rivative of the <strong>Kalman</strong> filter equations withrespect to either variance term (represented by s 2 ). This results in thefollowing system of recursive equations:@^x kþ1@s 2¼ @Fð^x; ^wÞ @^x k@^x k @s 2 ;ð5:105Þ@^x k@s 2 ¼ðIKx kCÞ @^x k@s 2 þ @Kx k@s 2 ð y k C^x k Þ; ð5:106Þwhere @Fð^x; ^wÞ=@^x k is evaluated at ^w k , <strong>and</strong> represents a static linearizationof the neural network. Note that ½@Fð^x; ^wÞ=@ ^w k Š½@ ^w=@s 2 Š does not appearin Eq. (5.105), un<strong>de</strong>r the assumption that @ ^w=@s 2 ¼ 0. The last term in


Eq. (5.106) may be dropped if we assume that the <strong>Kalman</strong> gain K x isin<strong>de</strong>pen<strong>de</strong>nt of s 2 . However, for accurate computation of the recursive<strong>de</strong>rivatives, @K x k=@s 2 must be calculated; this is shown along with the<strong>de</strong>rivative @P k ð1; 1Þ=@s 2 in Appendix A.5.5 APPLICATIONSIn this section, we present the results of using the dual EKF methods on anumber of different applications.5.5.1 Noisy Time-Series Estimation <strong>and</strong> PredictionFor the first example, we return to the noisy time-series example ofSection 5.2. Figure 5.4 compares the performance of the various dual EKF<strong>and</strong> joint EKF methods presented in this chapter. Box plots are used toshow the mean, median, <strong>and</strong> quartile values based on 10 different runs(note that the higher mean for the maximum-likelihood cost is aconsequence of a single outlier). The figure also compares performancesfor both known variances <strong>and</strong> estimated variances. The results in theexample are fairly consistent with our findings on a number of controlled


154 5 DUAL EXTENDED KALMAN FILTER METHODSexperiments using different time series <strong>and</strong> parameter settings [22]. Forboth white <strong>and</strong> colored noises, the maximum-likelihood weight-estimationcost J ml ðwÞ generally produces the best results, but often exhibitsinstability. This fact is analyzed in [22], <strong>and</strong> reduces the <strong>de</strong>sirability ofusing this cost function. The joint cost function J j ðwÞ has better stabilityproperties <strong>and</strong> produces excellent results for colored measurement noise.For white noise, the prediction error cost performs nearly as well asJ ml ðwÞ, but without stability problems. Hence, the dual EKF cost functionsJ pe ðwÞ <strong>and</strong> J j ðwÞ are generally the best choices for white <strong>and</strong> coloredmeasurement noise, respectively. The joint EKF <strong>and</strong> dual EKF performsimilarly in many cases, although the joint EKF is consi<strong>de</strong>rably less robustto inaccuracies in the assumed mo<strong>de</strong>l structure <strong>and</strong> noise variances.Chaotic Hénon Map As a second time-series example, we consi<strong>de</strong>rmo<strong>de</strong>ling the long-term behavior of the chaotic Hénon map:a kþ1 ¼ 1 1:4a 2 k þ b k ; ð5:107Þb kþ1 ¼ 0:3b k :ð5:108ÞTo obtain a one-dimensional time series for the following experiment, thesignal is <strong>de</strong>fined as x k ¼ a k . The phase plot of x kþ1 versus x k (Fig. 5.5a)shows the chaotic attractor. A neural network (5-7-1) can easily be trainedas a single-step predictor on this clean signal. The network is theniterated–feeding back the predictions of the network as future inputs–toproduce the attractor shown in Figure 5.5b. However, if the signal iscorrupted by white noise at 10 dB SNR (Fig. 5.5c), <strong>and</strong> a neural networkwith the same architecture is trained on these noisy data, the dynamics arenot a<strong>de</strong>quately captured. The iterated predictions exhibit limit-cyclebehavior with far less complexity.In contrast, using the dual EKF to train the neural network on the noisydata captures significantly more of the chaotic dynamics, as shown inFigure 5.5d. Here, J pe ðwÞ is used for weight estimation, <strong>and</strong> the maximum-likelihoodcost is used for estimating s 2 v. The measurement-noisevariance is assumed to be known. Parameter covariances are initialized at0.1, <strong>and</strong> the initial signal covariance is P x0¼ I. Forgetting factors arel w ¼ 0:9999 <strong>and</strong> l s2 v¼ 0:9993. Although the attractor is not reproducedwith total fi<strong>de</strong>lity, its general structure has been extracted from the noisydata.This example also illustrates an interesting interpretation of dual EKFprediction training. During the training process, estimations from theoutput of the predictor are fed back as inputs, which are optimally


5.5 APPLICATIONS 1551.51.5110.50.500-0.5-0.5-1-1-1.5-1 0 1-1.5-1 0 1(a)(b)1.51.5110.50.500-0.5-0.5-1-1-1.5-1 0 1-1.5-1 0 1(c)(d)Figure 5.5 Phase plots of x kþ1 versus x k for the original Hénon series (a ), theseries generated by a neural network trained on x k (b), the series generatedby a neural network trained on y k (c ), <strong>and</strong> the series generated by a neuralnetwork trained on y k , using the dual EKF (d ).weighted by the <strong>Kalman</strong> gain with the next noisy observation. Theeffective recurrent learning that takes place is analogous to compromisemethods [42], which use the same principle of combining new observationswith output predictions during training, in or<strong>de</strong>r to improve robustnessof iterated performance.5.5.2 Economic Forecasting–In<strong>de</strong>x of Industrial ProductionEconomic <strong>and</strong> financial data are inherently noisy. As an example, weconsi<strong>de</strong>r the in<strong>de</strong>x of industrial production (IP). As with most macro-


156 5 DUAL EXTENDED KALMAN FILTER METHODSeconomic series, the IP is a composite in<strong>de</strong>x of many different economicindicators, each of which is generally measured by a survey of some kind.The monthly IP data are shown in Figure 5.6a for January 1950 to January1990. To remove the trend, the differences between the log 2 values forFigure 5.6 (a ) In<strong>de</strong>x of industrial production (IP) in the United States fromJanuary 1940 through March 2000. Data available from Fe<strong>de</strong>ral Reserve[43]. (b) Monthly rate of return of the IP in the United States, 1950–1990. (c )The dual KF prediction for a typical run (middle), along with the signalestimates (dotted line). (d ) The prediction residual.


5.5 APPLICATIONS 157adjacent months are computed; this is called the IP monthly rate of return,<strong>and</strong> is shown in Figure 5.6b.An important baseline approach is to predict the IP from its past values,using a st<strong>and</strong>ard linear autoregressive mo<strong>de</strong>l. Results with an AR-14mo<strong>de</strong>l are reported by Moody et al. [44]. For comparison, both a linearAR-14 mo<strong>de</strong>l <strong>and</strong> neural network (14 input, 4 hid<strong>de</strong>n unit) mo<strong>de</strong>l aretested using the dual EKF methods. Consistent with experiments reportedin [44], data from January 1950 to December 1979 are used for a trainingset, <strong>and</strong> the remain<strong>de</strong>r of the data is reserved for testing. The dual KF (ordual EKF) is iterated over the training set for several epochs, <strong>and</strong> theresultant mo<strong>de</strong>l–consisting of ^w; ^s 2 v, <strong>and</strong> ^s 2 n–is used with a st<strong>and</strong>ard KF(or EKF) to produce causal predictions on the test set.All experiments are repeated 10 times with different initial weights ^w 0to produce the boxplots in Figure 5.7a. The advantage of dual estimationon linear mo<strong>de</strong>ls in comparison to the benchmark AR-14 mo<strong>de</strong>l trainedwith least-squares (LS) is clear. For the nonlinear mo<strong>de</strong>l, overtraining is aserious concern, because the algorithm is being run repeatedly over a veryshort training set (only 360 points). This effect is shown in Figure 5.7.Based on experience with other types of data (which may not have beenoptimal in this case) all runs were halted after only 5 training epochs.Nevertheless, the dual EKF with J j ðwÞ cost produces significantly betterresults.Although better results are reported on this problem in [44] usingmo<strong>de</strong>ls with external inputs from other series, the dual EKF single-timeseriesresults are quite competitive. Future work to incorporate additionalinputs would, of course, be a straightforward extension within the dualEKF framework.5.5.3 Speech EnhancementSpeech enhancement is concerned with the processing of noisy speech inor<strong>de</strong>r to improve the quality or intelligibility of the signal. Applicationsrange from front-ends for automatic speech recognition systems, totelecommunications in aviation, military, teleconferencing, <strong>and</strong> cellularenvironments. While there exist a broad array of traditional enhancementtechniques (e.g., spectral subtraction, signal-subspace embedding, timedomainiterative approaches, etc. [45]), such methods frequently result inaudible distortion of the signal, <strong>and</strong> are somewhat unsatisfactory in realworldnoisy environments. Different neural-network-based approaches arereviewed in [45].


158 5 DUAL EXTENDED KALMAN FILTER METHODSFigure 5.7 (a ) Boxplots of the prediction NMSE on the test set (1980-1990).(b) <strong>and</strong> (c ) show the average convergence behavior of linear <strong>and</strong> neuralnetwork mo<strong>de</strong>l structures, respectively.The dual estimation algorithms presented in this chapter have theadvantage of generating estimates using only the noisy signal itself. Toaddress its nonstationarity, the noisy speech is segmented into shortoverlapping windows. The dual estimation algorithms are then iteratedover each window to generate the signal estimate. Parameters learned fromone window are used to initialize the algorithm for the next window.Effectively, a sequence of time-series mo<strong>de</strong>ls is trained on the specific


5.5 APPLICATIONS 159noisy speech signal of interest, resulting in a nonstationary mo<strong>de</strong>l that canbe used to remove noise from the given signal.A number of controlled experiments using 8 kHz sampled speech havebeen performed in or<strong>de</strong>r to compare the different algorithms (joint EKFversus dual EKF with various costs). It was generally conclu<strong>de</strong>d that thebest results were obtained with the dual EKF with J ml ðwÞ cost, using a 10-4-1 neural network (versus a linear mo<strong>de</strong>l), <strong>and</strong> window length set at 512samples (overlap of 64 points). Preferred nominal settings were found tobe: P x0¼ I; P w0¼ 0:01I; p s0¼ 0:1; l w ¼ 0:9997, <strong>and</strong> l s2 v¼ 0:9993.The process-noise variance s 2 v is estimated with the dual EKF usingJ ml ðs 2 vÞ, <strong>and</strong> is given a lower limit (e.g., 10 8 ) to avoid potentialdivergence of the filters during silence periods. While, in practice, theadditive-noise variance s 2 n could be estimated as well, we used thecommon procedure of estimating this from the start of the recording(512 points) where no speech is assumed present. In addition, linear AR-12 (or 10) filters were used to mo<strong>de</strong>l colored additive noise. Using thesettings found from these controlled experiments, several enhancementapplications are reviewed below.SpEAR Database The dual-EKF algorithm was applied to a portion ofCSLU’s Speech Enhancement Assessment Resource (SpEAR [47]). Asopposed to artificially adding noise, the database is constructed byacoustically combining prerecor<strong>de</strong>d speech (e.g., TIMIT) <strong>and</strong> noise(e.g., SPIB database [48]). Synchronous playback <strong>and</strong> recording in aroom is used to provi<strong>de</strong> exact time-aligned references to the clean speechsuch that objective measures can still be calculated. Table 5.11 presentssample results in terms of average segmental SNR. 4Car Phone Speech In this example, the dual EKF is used to processan actual recording of a woman talking on her cellular telephone whiledriving on the highway. The signal contains a significant level of road <strong>and</strong>engine noise, in addition to the distortion introduced by the telephonechannel. The results appear in Figure 5.8, along with the noisy signal.4 Segmental SNR is consi<strong>de</strong>red to be a more perceptually relevant measure than st<strong>and</strong>ardSNR, <strong>and</strong> is computed as the average of the SNRs computed within 240-point windows, orframes of speech: SSNR ¼ (# frames) 1 P i max (SNR i, 10 dB). Here, SNR i is the SNRof the ith frame (weighted by a Hanning window), which is threshol<strong>de</strong>d from below at10 dB. The thresholding reduces the contribution of portions of the series where nospeech is present (i.e., where the SNR is strongly negative) [49], <strong>and</strong> is expected toimprove the measure’s perceptual relevance.


160 5 DUAL EXTENDED KALMAN FILTER METHODSTable 5.11database aDual EKF enhancement results using a portion of the SpEARMale voice(segmental SNR)Female voice(segmental SNR)Noise Before After Static Before After StaticF-16 2.27 2.65 1.69 0.16 4.51 3.46Factory 1:63 2.58 2.48 1.07 4.19 4.24Volvo 1.60 5.60 6.42 4.10 6.78 8.10Pink 2.59 1.44 1.06 0.23 4.39 3.54White 1:35 2.87 2.68 1.05 4.96 5.05Bursting 1.60 5.05 4.24 7.82 9.36 9.61a Different noise sources are used for the same male <strong>and</strong> female speaker. All results are indB, <strong>and</strong> represent the segmental SNR averaged over the length of the waveform. Resultslabeled ‘‘static’’ were obtained using the static approximation to the <strong>de</strong>rivatives. Forreference, in this range of values, an improvement of 3 dB in segmental SNR relates toapproximately an improvement of 5 dB in normal SNR.Spectrograms of both the noisy speech <strong>and</strong> estimated speech are inclu<strong>de</strong>dto aid in the comparison. The noise reduction is most successful innonspeech portions of the signal, but is also apparent in the visibility offormants of the estimated signal, which are obscured in the noisy signal.The perceptual quality of the result is quite good, with an absence of the‘‘musical noise’’ artifacts often present in spectral subtraction results.Seminar Recording The next example comes from an actual recordingma<strong>de</strong> of a lecture at the Oregon Graduate Institute. In this instance, theaudio recording equipment was configured improperly, resulting in a veryloud buzzing noise throughout the entire recording. The noise has afundamental frequency of 60 Hz (indicating that improper grounding wasthe likely culprit), but many other harmonics <strong>and</strong> frequencies are presentas well owing to some additional nonlinear clipping. As suggested byFigure 5.9, the SNR is extremely low, making for an unusually difficultaudio enhancement problem.Digit Recognition As the final example, we consi<strong>de</strong>r the applicationof speech enhancement for use as a front-end to automatic speechrecognition (ASR) systems. The effectiveness of the dual EKF in this5 The authors wish to thank Edward Kaiser for his invaluable assistance in this experiment.


5.5 APPLICATIONS 161Figure 5.8 Enhancement of car phone speech. The noisy waveformappears in (a ), with its spectrogram in (b). The spectrogram <strong>and</strong> waveformof the dual EKF result are shown in (c ) <strong>and</strong> (d ), respectively. To make thespectrograms easier to view, the spectral tilt is removed, <strong>and</strong> their histogramsare equalized according to the range of intensities of the enhancedspeech spectrogram.application is <strong>de</strong>monstrated using a speech corpus <strong>and</strong> ASR system 5<strong>de</strong>veloped at the Oregon Graduate Institute’s Center for Spoken LanguageUn<strong>de</strong>rst<strong>and</strong>ing (CSLU). The speech corpus consists of zip-co<strong>de</strong>s,addresses, <strong>and</strong> other digits read over the telephone by various people; the


162 5 DUAL EXTENDED KALMAN FILTER METHODSFigure 5.9 Enhancement of high-noise seminar recording. The noisy waveformappears in (a ), with its spectrogram in (b). The spectrogram <strong>and</strong>waveform of the dual EKF result are shown in (c ) <strong>and</strong> (d ), respectively.ASR system is a speaker-in<strong>de</strong>pen<strong>de</strong>nt digit recognizer, trained exclusivelyto recognize numbers from zero to nine when read over the phone.A subset of 599 sentences was used in this experiment. As seen in Table5.12, the recognition rates on the clean telephone speech are quite good.However, adding white Gaussian noise to the speech at 6 dB significantlyreduces the performance. In addition to the dual EKF, a st<strong>and</strong>ard spectralsubtraction routine <strong>and</strong> an enhancement algorithm built into the speechco<strong>de</strong>c TIA=EIA=IS-718 for digital cellular phones (published by the


5.6 CONCLUSIONS 163Table 5.12 Automatic speech recognition rates for cleanrecordings of telephone speech (spoken digits), as comparedwith the same speech corrupted by white noise, <strong>and</strong> subsequentlyprocessed by spectral subtraction (SSUB), a cellularphone enhancement st<strong>and</strong>ard (IS-718), <strong>and</strong> the dual EKFCorrect wordsCorrect sentencesClean 96.37% 85.81% (514=599)Noisy 59.21% 21.37% (128=599)SSUB 77.45% 38.06% (228=599)IS-718 67.32% 29.22% (175=599)Dual EKF 82.19% 52.92% (317/599)Telecommunications Industry Association) was used for comparison. Asshown by Table 5.12, the dual EKF outperforms both the IS-718 <strong>and</strong>spectral subtraction recognition rates by a significant amount.5.6 CONCLUSIONSThis chapter has <strong>de</strong>tailed a unified approach to dual estimation based on amaximum a posteriori perspective. By maximizing the joint conditional<strong>de</strong>nsity r xN1;wjy , the most probable values of the signal <strong>and</strong> parameters areN1sought, given the noisy observations. This probabilistic perspectiveelucidates the relationships between various dual estimation methodsproposed in the literature, <strong>and</strong> allows their categorization in terms ofmethods that maximize the joint conditional <strong>de</strong>nsity function directly, <strong>and</strong>those that maximize a related marginal conditional <strong>de</strong>nsity function.Cost functions associated with the joint <strong>and</strong> marginal <strong>de</strong>nsities havebeen <strong>de</strong>rived un<strong>de</strong>r a Gaussian assumption. This approach offers a numberof insights about previously <strong>de</strong>veloped methods. For example, the predictionerror cost is viewed as an approximation to the maximum-likelihoodcost; moreover, both are classified as marginal estimation cost functions.Thus, the recursive prediction error method of [5] <strong>and</strong> [6] is quite differentfrom the joint EKF approach of [1] <strong>and</strong> [2], which minimizes a jointestimation cost. 6 Furthermore, the joint EKF <strong>and</strong> errors-in-variables algorithmsare shown to offer two different ways of minimizing the same jointcost function; one is a sequential method <strong>and</strong> the other is iterative.The dual EKF algorithm has been presented, which uses two exten<strong>de</strong>d<strong>Kalman</strong> filters run concurrently–one for state estimation <strong>and</strong> one for6 This fact is overlooked in [6], which emphasizes the similarity of these two algorithms.


164 5 DUAL EXTENDED KALMAN FILTER METHODSweight estimation. By modification of the weight filter into an observe<strong>de</strong>rror form, it is possible to minimize each of the cost functions that are<strong>de</strong>veloped. 7 This provi<strong>de</strong>d a common algorithmic platform for theimplementation of a broad variety of methods. In general, the dual EKFalgorithm represents a sequential approach, which is applicable to bothlinear <strong>and</strong> nonlinear mo<strong>de</strong>ls, <strong>and</strong> which can be used in the presence ofwhite or colored measurement noise. In addition, the algorithm has beenexten<strong>de</strong>d to provi<strong>de</strong> estimation of noise variance parameters within thesame theoretical framework; this contribution is crucial in applications forwhich this information is not otherwise available.Finally, a number of examples have been presented to illustrate theperformance of the dual EKF methods. The ability of the dual EKF tocapture the un<strong>de</strong>rlying dynamics of a noisy time series has been illustratedusing the chaotic Hénon map. The application of the algorithm to the IPseries <strong>de</strong>monstrates its potential in a real-world prediction context. Onspeech enhancement problems, the lack of musical noise in the enhancedspeech un<strong>de</strong>rscores the advantages of a time-domain approach; theusefulness of the dual EKF as a front-end to a speech recognizer hasalso been <strong>de</strong>monstrated. In general, the state-space formulation of thealgorithm makes it applicable to a much wi<strong>de</strong>r variety of contexts than hasbeen explored here. The intent of this chapter was to show the utility of thedual EKF as a fundamental method for solving a range of problems insignal processing <strong>and</strong> mo<strong>de</strong>ling.ACKNOWLEDGMENTSThis work was sponsored in part by the NSF un<strong>de</strong>r Grants ECS-0083106<strong>and</strong> IRI-9712346.APPENDIX A: RECURRENT DERIVATIVE OF THE KALMAN GAIN(1) With Respect to the WeightsIn the state-estimation filter, the <strong>de</strong>rivative of the <strong>Kalman</strong> gain with respectto the weights w is computed as follows. Denoting the <strong>de</strong>rivative of K x k7 Note that <strong>Kalman</strong> algorithms are approximate MAP optimization procedures fornonlinear systems. Hence, future work consi<strong>de</strong>rs alternative optimization procedures(e.g., unscented <strong>Kalman</strong> filters [29]), which can still be cast within the same theoreticallymotivated dual estimation framework.


APPENDIX A 165with respect to the ith element of ^w by @K x k=@ ^wðiÞ (the ith column of@K x k=@ ^w) gives@K x k@ ^wðiÞ ¼ I Kx kCCP xkC T þ s 2 n@P xk@ ^wðiÞ CT ;ð5:109Þwhere the <strong>de</strong>rivatives of the error covariances are@P xk@ ^wðiÞ ¼ @A k 1@ ^wðiÞ P x k 1A T k 1 þ A k 1@P xk 1@ ^wðiÞ AT k 1 þ A k 1 P xk 1@A T k 1@ ^wðiÞ ; ð5:110Þ@P xk 1@ ^wðiÞ ¼ @Kx k 1@ ^wðiÞ CP k 1 þðI K x k 1CÞ @P x k 1@ ^wðiÞ :ð5:111ÞNote that A k 1 <strong>de</strong>pends not only on the weights ^w, but also on the point oflinearization, ^x k 1 . Therefore,@A k 1@ ^wðiÞ ¼ @ 2 F@^x k 1 @ ^wðiÞ þ @2 F @^x k 1ð@^x k 1 Þ 2 @ ^wðiÞ ;ð5:112Þwhere the first term is the static <strong>de</strong>rivative of A k 1 ¼ @F=@x k 1 with ^x k 1fixed, <strong>and</strong> the second term inclu<strong>de</strong>s the recurrent <strong>de</strong>rivative of ^x k 1 . Theterm @ 2 F=ð@^x k 1 Þ 2 actually represents a three-dimensional tensor (ratherthan a matrix), <strong>and</strong> care must be taken with this calculation. However,when A k 1 takes on a sparse structure, as with time-series applications, its<strong>de</strong>rivative with respect to x contains mostly zeroes, <strong>and</strong> is in fact entirelyzero for linear mo<strong>de</strong>ls.(2) With Respect to the VariancesIn the variance-estimation filter, the <strong>de</strong>rivatives @K x k=@s 2 may be calculatedas follows:@K x k@s 2 ¼ I Kx kC @P xkCP k C T þ s 2 n@s 2 CT ;ð5:113Þ


166 5 DUAL EXTENDED KALMAN FILTER METHODSwhere@P xk@s 2 ¼ @A k 1@s 2 P k 1 A T @P xkk 1 þ A 1k 1@s 2 A T @Ak 1 þ A k 1 P k 1k 1@s 2 ; ð5:114Þ@P xk 1@s 2 ¼ @Kx k 1@s 2 CP k 1 þðI K x k 1CÞ @P x k 1@s 2 : ð5:115ÞBecause A k 1 <strong>de</strong>pends on the linearization point, ^x k 1 , its <strong>de</strong>rivative is@A k 1@s 2 ¼ @A k 1@^x k 1@^x k 1@s 2 ; ð5:116Þwhere again the <strong>de</strong>rivative @ ^w=@s 2 is assumed to be zero.APPENDIX B: DUAL EKF WITH COLORED MEASUREMENT NOISEIn this appendix, we give dual EKF equations for additive colored noise.Colored noise is mo<strong>de</strong>led as a linear AR process:n k ¼ PM ni¼1a ðiÞn n k i þ v n;k ; ð5:128Þwhere the parameters a ðiÞn are assumed to be known, <strong>and</strong> v nk is a whiteGaussian process with (possibly unknown) variance s 2 v n. The noise n k cannow be thought of as a second signal ad<strong>de</strong>d to the first, but with thedistinction that it has been generated by a known system. Note that theconstraint y k ¼ x k þ n k , requires that the estimates ^x k <strong>and</strong> ^n k must alsosum to y k . To enforce this constraint, both the signal <strong>and</strong> noise areincorporated into a combined state-space representation:x kn kk ¼ F c ð k 1 ; w; a n ÞþB c v c;k ; ð5:129Þ ¼ Fðx k 1; wÞþ B 0 vk; ð5:130ÞA n n k 1 0 B n v n;ky k ¼ C c j k ;y k ¼½C C n Šx kn k;


APPENDIX B 167whereA n ¼ 42643a ð2Þn ... a ðM nÞn1 0 0 0.0 . . 0 . .75 ; C n ¼ B T n ¼½1 0 ... 0Š:0 0 1 0a ð1ÞnThe effective measurement noise is zero, <strong>and</strong> the process noise v c;kwhite, as required, with covarianceis V c ¼ s2 v 00 s 2 :v nBecause n k can be viewed as a second signal, it should be estimated onan equal footing with x k . Consi<strong>de</strong>r, therefore, maximizing r xN1n N 1 wjyN 1(where n N 1 is a vector comprising elements in fn kg N 1 ) instead of r xN .1 wjyN 1We can write this term asr xN1 nN 1 wjyN 1¼ r y N 1 xN 1 nN 1 jwr wr yN1; ð5:131Þ<strong>and</strong> (in the absence of prior information about w) maximize r yN1x N 1 nN 1 jwalone. As before, the cost functions that result from this approach can becategorized as joint or marginal costs, <strong>and</strong> their <strong>de</strong>rivations are similar tothose for the white noise case. The associated dual EKF algorithm forcolored noise is given in Table 5.13. Minimization of different costfunction is again achieved by simply re<strong>de</strong>fining the error term. Thesemodifications are presented without <strong>de</strong>rivation below.Joint Estimation FormsThe corresponding weight cost function <strong>and</strong> error terms for a <strong>de</strong>coupledapproach is given in Table 5.14.


168 5 DUAL EXTENDED KALMAN FILTER METHODSTable 5.13 The dual exten<strong>de</strong>d <strong>Kalman</strong> filter equations for coloredmeasurement noise. The <strong>de</strong>finitions of k <strong>and</strong> C w k will <strong>de</strong>pend on the costfunction used for weight estimationInitialize with^w 0 ¼ E½wŠ; P w0¼ E½ðw ^w 0 Þðw ^w 0 Þ T Š;^j 0 ¼ E½j 0 Š; P j0¼ E½ðj 0^j 0 Þðj 0^j 0 Þ T Š:For k 2 f1; ...; 1g, the time-update equations for the weight filter are^w k ¼ ^w k 1 ; ð5:117ÞP wk¼ P wk 1þ R r k 1; ð5:118Þ<strong>and</strong> those for the signal filter are^j k ¼ Fð^j k 1 ; ^w k Þ; ð5:119ÞP jk¼ A k 1 P jk 1A T k 1 þ B c R v cB T c : ð5:120ÞThe measurement-update equations for the signal filter areK j k ¼ P j kC T c ðC c P jkC T c Þ 1 ;ð5:121Þ^j k ¼ ^j k þ K j k ðy k C c ^j k Þ; ð5:122ÞP jk¼ðI K j k CÞP j k; ð5:123Þ<strong>and</strong> those for the weight filter areK w k ¼ P wkðC w k Þ T ½C w k P wkðC w k Þ T þ R e Š1 ; ð5:124Þ^w k ¼ ^w k þ K w k k;ð5:125ÞP wk¼ðI K w k C w k ÞP wk; ð5:126ÞwhereA k1 ¼ 4 @Fðj; ^w k; a n Þ@j ; k ¼ y k C c ^j k ; C w k ¼ 4 @ k^j k 1@w ¼ C c@^j k@ww¼ ^w k :ð5:127ÞMarginal Estimation–Maximum-Likelihood CostThe corresponding weight cost function, <strong>and</strong> error terms are given inTable 5.15, wheree k ¼ 4 y k ð^x k þ ^n k Þ; l e;k ¼s 2 e;k3e 2 k2s 2 e;k


Table 5.14 Colored-noise joint cost function: observed error termsfor the dual EKF weight filter" #Jð^x k 1; ^n k 1; wÞþ Pk ð^x t ^x t Þ 2þ ð^n t ^n t Þ 2; or J k ¼ Pkt¼1s 2 vs 2 v n2 3k ¼ 4 sv1 ~^x^x k4 5; C wsv 1k ¼n~^n^n k24sv1 H T ~^x^x w ksv 1nH T ~^n^n w kAPPENDIX B 16935:t¼1~^x^x 2 ks 2 vþ ~^n^n !2ks 2 ;v nð5:132ÞTable 5.15 Maximum-likelihood cost function: observed errorterms for the dual EKF weight filter" #Jcml ðwÞ ¼ PNlogð2ps 2 e kÞþ ðy k ^x k^n k Þ 2k¼1s 2 ;e k2" #1 ðl e;k Þ 1=231=2k ¼ 4 ðl e;k Þse 1 ; C w 2 s 2 H T wðs 2 e kÞek ¼k67ke k4 1H T eswe k þ kek 2ðs 2 e kÞ 3=2 HT wðs 2 5 :e kÞMarginal Estimation Forms–Prediction Error CostIf s 2 e kis assumed to be in<strong>de</strong>pen<strong>de</strong>nt of w, then we have the prediction errorcost shown in Table 5.16. Note that an alternative prediction error formmay be <strong>de</strong>rived by including H w ^n k in the calculation of C w k . However, theperformance appears superior if this term is neglected.Table 5.16 Colored-noise prediction-error cost function:observed error terms for the dual EKF weight filterJ pec ðwÞ ¼ PNk¼1e 2 k ¼ PNðy k ^x k ^n k Þ 2 ; ð5:133Þk¼1k ¼ 4 y k ^n k ^x k ; C w k ¼ H w^x k :


170 5 DUAL EXTENDED KALMAN FILTER METHODSMarginal Estimation–EM CostThe cost <strong>and</strong> observed error terms for weight estimation with colorednoise are i<strong>de</strong>ntical to those for the white-noise case, shown in Table 5.8. Inthis case, the on-line statistics are found by augmenting the combined statevector with one additional lagged value for both the signal <strong>and</strong> noise.Specifically,j þ k2¼ 64x kn kx kMnk3 2 3x k75 ¼ 6 x k 14M nn kn k 175 ; ð5:134Þso that the estimate ^j þ k produced by a <strong>Kalman</strong> filter will contain ^x k 1jk inelements 2; ...; 1 þ M, <strong>and</strong> ^n k 1jk in its last M n elements. Furthermore,the error variances p kjk<strong>and</strong> p y kjk can be obtained from the covariance Pþ c;kof j þ k produced by the KF.REFERENCES[1] R.E. Kopp <strong>and</strong> R.J. Orford, ‘‘Linear regression applied to system i<strong>de</strong>ntificationfor adaptive control systems,’’ AIAA Journal, 1, 2300–2006 (1963).[2] H. Cox, ‘‘On the estimation of state variables <strong>and</strong> parameters for noisydynamic systems,’’ IEEE Transactions on Automatic Control, 9, 5–12(1964).[3] L. Ljung, ‘‘Asymptotic behavior of the exten<strong>de</strong>d <strong>Kalman</strong> filter as a parameterestimator for linear systems,’’ IEEE Transactions on Automatic Control, 24,36–50 (1979).[4] M. Niedźwiecki <strong>and</strong> K. Cisowski, ‘‘Adaptive scheme of elimination ofbroadb<strong>and</strong> noise <strong>and</strong> impulsive disturbances from AR <strong>and</strong> ARMA signals,’’IEEE Transactions on Signal Processing, 44, 528–537 (1996).[5] L.W. Nelson <strong>and</strong> E. Stear, ‘‘The simultaneous on-line estimation of parameters<strong>and</strong> states in linear systems,’’ IEEE Transactions on AutomaticControl, Vol. AC-12, 438–442 (1967).[6] L. Ljung <strong>and</strong> T. Sö<strong>de</strong>rström, Theory <strong>and</strong> Practice of Recursive I<strong>de</strong>ntification.Cambridge, MA: MIT Press, 1983.[7] H. Akaike, ‘‘Maximum likelihood i<strong>de</strong>ntification of Gaussian autoregressivemoving average mo<strong>de</strong>ls,’’ Biometrika, 60, 255–265 (1973).


REFERENCES 171[8] N.K. Gupta <strong>and</strong> R.K. Mehra, ‘‘Computational aspects of maximum likelihoo<strong>de</strong>stimation <strong>and</strong> reduction in sensitivity function calculations,’’ IEEETransaction on Automatic Control, 19, 774–783 (1974).[9] J.S. Lim <strong>and</strong> A.V. Oppenheim, ‘‘All-pole mo<strong>de</strong>ling of <strong>de</strong>gra<strong>de</strong>d speech,’’IEEE Transactions on Acoustics, Speech <strong>and</strong> Signal Processing, 26, 197–210 (1978).[10] A. Dempster, N.M. Laird, <strong>and</strong> D.B. Rubin, ‘‘Maximum-likelihood fromincomplete data via the EM algorithm,’’ Journal of the Royal StatisticalSociety, Ser. B, 39, 1–38 (1977).[11] B.R. Musicus <strong>and</strong> J.S. Lim, ‘‘Maximum likelihood parameter estimationof noisy data,’’ in Proceedings of the International Conference onAcoustics, Speech, <strong>and</strong> Signal Processing, IEEE, April 1979, pp.224–227.[12] R.H. Shumway <strong>and</strong> D.S. Stoffer, ‘‘An approach to time series smoothing <strong>and</strong>forecasting using the EM algorithm,’’ Journal of Time Series Analysis, 3,253–264 (1982).[13] E. Weinstein, A.V. Oppenheim, M. Fe<strong>de</strong>r, <strong>and</strong> J.R. Buck, ‘‘Iterative <strong>and</strong>sequential algorithms for multisensor signal enhancement,’’ IEEE Transactionson Signal Processing, 42, 846–859 (1994).[14] J.T. Connor, R.D. Martin, <strong>and</strong> L.E. Atlas, ‘‘Recurrent neural networks <strong>and</strong>robust time series prediction,’’ IEEE Transactions on <strong>Neural</strong> <strong>Networks</strong>, 5(2),240–254 (1994).[15] S.C. Stubberud <strong>and</strong> M. Owen, ‘‘Artificial neural network feedback loop withon-line training, in Proceedings of International Symposium on IntelligentControl, IEEE, September 1996, pp. 514–519.[16] Z. Ghahramani <strong>and</strong> S.T. Roweis, ‘‘Learning nonlinear dynamical systemsusing an EM algorithm,’’ in Advances in <strong>Neural</strong> Information ProcessingSystems 11: Proceedings of the 1998 Conference. Cambridge, MA: MITPress, 1999.[17] T. Briegel <strong>and</strong> V. Tresp, ‘‘Fisher scoring <strong>and</strong> a mixture of mo<strong>de</strong>s approach forapproximate inference <strong>and</strong> learning in nonlinear state space mo<strong>de</strong>ls,’’ inAdvances in <strong>Neural</strong> Information Processing Systems 11: Proceedings of the1998 Conference. Cambridge, MA: MIT Press, 1999.[18] G. Seber <strong>and</strong> C. Wild, Nonlinear Regression. New York: Wiley, 1989, pp.491–527.[19] E.A. Wan <strong>and</strong> A.T. Nelson, ‘‘Dual <strong>Kalman</strong> filtering methods for nonlinearprediction, estimation, <strong>and</strong> smoothing,’’ in Advances in <strong>Neural</strong> InformationProcessing Systems 9. Cambridge, MA: MIT Press, 1997.[20] E.A. Wan <strong>and</strong> A.T. Nelson, ‘‘<strong>Neural</strong> dual exten<strong>de</strong>d <strong>Kalman</strong> filtering:Applications in speech enhancement <strong>and</strong> monaural blind signal separation,’’in Proceedings of IEEE Workshop on <strong>Neural</strong> <strong>Networks</strong> for Signal Processing,1997.


172 5 DUAL EXTENDED KALMAN FILTER METHODS[21] A.T. Nelson <strong>and</strong> E.A. Wan, ‘‘A two-observation <strong>Kalman</strong> framework formaximum-likelihood mo<strong>de</strong>ling of noisy time series,’’ in Proceedings ofInternational Joint Conference on <strong>Neural</strong> <strong>Networks</strong>, IEEE/INNS, May 1998.[22] A.T. Nelson, ‘‘Nonlinear Estimation <strong>and</strong> Mo<strong>de</strong>ling of Noisy Time-Series byDual <strong>Kalman</strong> Filter Methods,’’ PhD thesis, Oregon Graduate Institute ofScience <strong>and</strong> Technology, 2000.[23] R.E. <strong>Kalman</strong>, ‘‘A new approach to linear filtering <strong>and</strong> prediction problems,’’Transactions of the ASME, Ser. D, Journal of Basic Engineering, 82, 35–45(1960).[24] J.F.G. <strong>de</strong> Freitas, M. Niranjan, A.H. Gee, <strong>and</strong> A. Doucet, ‘‘Sequential MonteCarlo methods for optimisation of neural network mo<strong>de</strong>ls,’’ Technical ReportTR-328, Cambridge University Engineering Department, November 1998.[25] R. van <strong>de</strong>r Merwe, N. <strong>de</strong> Freitas, A. Doucet, <strong>and</strong> E. Wan, ‘‘The unscentedparticle filter,’’ Technical Report CUED=F-INFENG=TR 380, CambridgeUniversity Engineering Department, August 2000.[26] S.J. Julier, J.K. Uhlmann, <strong>and</strong> H. Durrant-Whyte, ‘‘A new approach forfiltering nonlinear systems,’’ in Proceedings of the American ControlConference, 1995, pp. 1628–1632.[27] S.J. Julier <strong>and</strong> J.K. Uhlmann, ‘‘A general method for approximating nonlineartransformations of probability distributions,’’ Technical Report, RRG, Departmentof Engineering Science, University of Oxford, November 1996. http://www.robots.ox.ac.uk/~siju/work/publications/letter_size/Unscented.zip.[28] E.A. Wan <strong>and</strong> R. van <strong>de</strong>r Merwe, ‘‘The unscented <strong>Kalman</strong> filter for nonlinearestimation,’’ in Proceedings of Symposium 2000 on Adaptive Systems forSignal Processing, Communication <strong>and</strong> Control (AS-SPCC), Lake Louise,Alberta, Canada, IEEE, October 2000.[29] E.A. Wan, R. van <strong>de</strong>r Merwe, <strong>and</strong> A.T. Nelson, ‘‘Dual estimation <strong>and</strong> theunscented transformation,’’ in Advances in <strong>Neural</strong> Information ProcessingSystems 12: Proceedings of the 1999 Conference. Cambridge, MA: MITPress, 2000.[30] S. Singhal <strong>and</strong> L. Wu, ‘‘Training multilayer perceptrons with the exten<strong>de</strong>d<strong>Kalman</strong> filter,’’ in Advances in <strong>Neural</strong> Information Processing Systems 1.San Mateo, CA: Morgan Kauffman, 1989, pp. 133–140.[31] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘<strong>Neural</strong> control of nonlinear dynamicsystems with <strong>Kalman</strong> filter trained recurrent networks,’’ IEEE Transactionson <strong>Neural</strong> <strong>Networks</strong>, 5 (1994).[32] E.S. Plumer, ‘‘Training neural networks using sequential-update forms of theexten<strong>de</strong>d <strong>Kalman</strong> filter,’’ Informal Report LA-UR-95-422, Los AlamosNational Laboratory, January 1995.[33] A.H. Sayed <strong>and</strong> T. Kailath, ‘‘A state-space approach to adaptive RLSfiltering,’’ IEEE Signal Processing Magazine, 11(3), 18–60 (1994).


REFERENCES 173[34] S. Haykin, Adaptive Filter Theory, 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 1996.[35] R. Williams <strong>and</strong> D. Zipser, ‘‘A learning algorithm for continually runningfully recurrent neural networks,’’ <strong>Neural</strong> Computation, 1, 270–280 (1989).[36] F.L. Lewis, Optimal Estimation. New York: Wiley, 1986.[37] R.K. Mehra, I<strong>de</strong>ntification of stochastic linear dynamic systems using<strong>Kalman</strong> filter representation. AIAA Journal, 9, 28–31 (1971).[38] B.D. Ripley, Pattern Recognition <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>. Cambridge UniversityPress, 1996.[39] T.M. Cover <strong>and</strong> J.A. Thomas, Elements of Information Theory. New York:Wiley, 1991.[40] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘Extensions <strong>and</strong> enhancements of<strong>de</strong>coupled exten<strong>de</strong>d <strong>Kalman</strong> filter training, in Proceedings of InternationalConference on <strong>Neural</strong> <strong>Networks</strong>, ICNN’97, IEEE, June 1997, Vol. 3.[41] N.N. Schraudolph, ‘‘Online local gain adaptation for multi-layer perceptrons,’’Technical report, IDSIA, Lugano, Switzerl<strong>and</strong>, March 1998.[42] P.J. Werbos, H<strong>and</strong>book of Intelligent Control. New York: Van Nostr<strong>and</strong>Reinhold, 1992, pp. 283–356.[43] FRED: Fe<strong>de</strong>ral Reserve Economic Data. Available on the Internet at http://www.stls.frb.org/fred/. Accessed May 9, 2000.[44] J. Moody, U. Levin, <strong>and</strong> S. Rehfuss, ‘‘Predicting the U.S. in<strong>de</strong>x of industrialproduction,’’ <strong>Neural</strong> Network World, 3, 791–794 (1993).[45] J.H.L. Hansen, J.R. Deller, <strong>and</strong> J.G. Praokis, Discrete-Time Processing ofSpeech Signals. New York: Macmillan, 1993.[46] E.A. Wan <strong>and</strong> A.T. Nelson, ‘‘Removal of noise from speech using the dualEKF algorithm,’’ in Proceedings of International Conference on Acoustics,Speech, <strong>and</strong> Signal Processing, IEEE, May 1998.[47] Center for Spoken Language Un<strong>de</strong>rst<strong>and</strong>ing, Speech Enhancement AssessmentResource (SpEAR). Available on the Internet at http://cslu.ece.ogi.edu/nsel/data/in<strong>de</strong>x.html. Accessed May 2000.[48] Rice University, Signal Processing Information Base (SPIB). Available onthe Internet at http://spib.ece.rice.edu/signal.html. Accessed September 15,1999.[49] J.H.L. Hansen <strong>and</strong> B.L. Pellom, ‘‘An effective quality evaluation protocol forspeech enhancement algorithms,’’ in Proceedings of International Conferenceon Spoken Language Processing, ICSLP-98, Sidney, 1998.


6LEARNING NONLINEARDYNAMICAL SYSTEMSUSING THE EXPECTATION–MAXIMIZATION ALGORITHMSam Roweis <strong>and</strong> Zoubin GhahramaniGatsby Computational Neuroscience Unit, University College London, London U.K.(zoubin@gatsby.ucl.ac.uk)6.1 LEARNING STOCHASTIC NONLINEAR DYNAMICSSince the advent of cybernetics, dynamical systems have been animportant mo<strong>de</strong>ling tool in fields ranging from engineering to the physical<strong>and</strong> social sciences. Most realistic dynamical systems mo<strong>de</strong>ls have twoessential features. First, they are stochastic – the observed outputs are anoisy function of the inputs, <strong>and</strong> the dynamics itself may be driven bysome unobserved noise process. Second, they can be characterized by<strong>Kalman</strong> <strong>Filtering</strong> <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>, Edited by Simon HaykinISBN 0-471-36998-5 # 2001 John Wiley & Sons, Inc.175


176 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMsome finite-dimensional internal state that, while not directly observable,summarizes at any time all information about the past behavior of theprocess relevant to predicting its future evolution.From a mo<strong>de</strong>ling st<strong>and</strong>point, stochasticity is essential to allow a mo<strong>de</strong>lwith a few fixed parameters to generate a rich variety of time-seriesoutputs. 1 Explicitly mo<strong>de</strong>ling the internal state makes it possible to<strong>de</strong>couple the internal dynamics from the observation process. For example,to mo<strong>de</strong>l a sequence of vi<strong>de</strong>o images of a balloon floating in the wind,it would be computationally very costly to directly predict the array ofcamera pixel intensities from a sequence of arrays of previous pixelintensities. It seems much more sensible to attempt to infer the true state ofthe balloon (its position, velocity, <strong>and</strong> orientation) <strong>and</strong> <strong>de</strong>couple theprocess that governs the balloon dynamics from the observation processthat maps the actual balloon state to an array of measured pixel intensities.Often we are able to write down equations governing these dynamicalsystems directly, based on prior knowledge of the problem structure <strong>and</strong>the sources of noise – for example, from the physics of the situation. Insuch cases, we may want to infer the hid<strong>de</strong>n state of the system from asequence of observations of the system’s inputs <strong>and</strong> outputs. Solving thisinference or state-estimation problem is essential for tasks such as trackingor the <strong>de</strong>sign of state-feedback controllers, <strong>and</strong> there exist well-knownalgorithms for this.However, in many cases, the exact parameter values, or even the grossstructure of the dynamical system itself, may be unknown. In such cases,the dynamics of the system have to be learned or i<strong>de</strong>ntified fromsequences of observations only. Learning may be a necessary precursorif the ultimate goal is effective state inference. But learning nonlinearstate-based mo<strong>de</strong>ls is also useful in its own right, even when we are notexplicitly interested in the internal states of the mo<strong>de</strong>l, for tasks such asprediction (extrapolation), time-series classification, outlier <strong>de</strong>tection, <strong>and</strong>filling-in of missing observations (imputation). This chapter addresses theproblem of learning time-series mo<strong>de</strong>ls when the internal state is hid<strong>de</strong>n.Below, we briefly review the two fundamental algorithms that form thebasis of our learning procedure. In section 6.2, we introduce our algorithm1 There are, of course, completely <strong>de</strong>terministic but chaotic systems with this property. Ifwe separate the noise processes in our mo<strong>de</strong>ls from the <strong>de</strong>terministic portions of thedynamics <strong>and</strong> observations, we can think of the noises as another <strong>de</strong>terministic (but highlychaotic) system that <strong>de</strong>pends on initial conditions <strong>and</strong> exogenous inputs that we do notknow. In<strong>de</strong>ed, when we run simulations using a psuedo-r<strong>and</strong>om-number generator startedwith a particular seed, this is precisely what we are doing.


6.1 LEARNING STOCHASTIC NONLINEAR DYNAMICS 177<strong>and</strong> <strong>de</strong>rive its learning rules. Section 6.3 presents results of using thealgorithm to i<strong>de</strong>ntify nonlinear dynamical systems. Finally, we presentsome conclusions <strong>and</strong> potential extensions to the algorithm in Sections 6.4<strong>and</strong> 6.5.6.1.1 State Inference <strong>and</strong> Mo<strong>de</strong>l LearningTwo remarkable algorithms from the 1960s – one <strong>de</strong>veloped in engineering<strong>and</strong> the other in statistics – form the basis of mo<strong>de</strong>rn techniques instate estimation <strong>and</strong> mo<strong>de</strong>l learning. The <strong>Kalman</strong> filter, introduced by<strong>Kalman</strong> <strong>and</strong> Bucy in 1961 [1], was <strong>de</strong>veloped in a setting where thephysical mo<strong>de</strong>l of the dynamical system of interest was readily available;its goal is optimal state estimation in systems with known parameters. Theexpectation–maximization (EM) algorithm, pioneered by Baum <strong>and</strong>colleagues [2] <strong>and</strong> later generalized <strong>and</strong> named by Dempster et al. [3],was <strong>de</strong>veloped to learn parameters of statistical mo<strong>de</strong>ls in the presence ofincomplete data or hid<strong>de</strong>n variables.In this chapter, we bring together these two algorithms in or<strong>de</strong>r to learnthe dynamics of stochastic nonlinear systems with hid<strong>de</strong>n states. Our goalis twofold: both to <strong>de</strong>velop a method for i<strong>de</strong>ntifying the dynamics ofnonlinear systems whose hid<strong>de</strong>n states we wish to infer, <strong>and</strong> to <strong>de</strong>velop ageneral nonlinear time-series mo<strong>de</strong>ling tool. We examine inference <strong>and</strong>learning in discrete-time 2 stochastic nonlinear dynamical systems withhid<strong>de</strong>n states x k , external inputs u k , <strong>and</strong> noisy outputs y k . (All lower-casecharacters (except indices) <strong>de</strong>note vectors. Matrices are represented byupper-case characters.) The systems are parametrized by a set of tunablematrices, vectors, <strong>and</strong> scalars, which we shall collectively <strong>de</strong>note as y. Theinputs, outputs, <strong>and</strong> states are related to each other byx kþ1 ¼ f ðx k ; u k Þþw k ;y k ¼ gðx k ; u k Þþv k ;ð6:1aÞð6:1bÞ2 Continuous-time dynamical systems (in which <strong>de</strong>rivatives are specified as functions of thecurrent state <strong>and</strong> inputs) can be converted into discrete-time systems by sampling theiroutputs <strong>and</strong> using ‘‘zero-or<strong>de</strong>r holds’’ on their inputs. In particular, for a continuous-timelinear system _xðtÞ ¼A c xðtÞþB c uðtÞ sampled at interval t, the corresponding dynamics <strong>and</strong>input driving matrices so that x kþ1 ¼ Ax k þ Bu k are A ¼ P 1Ak k¼0ct k =k! ¼ expðA c tÞ <strong>and</strong>B ¼ Ac 1 ðA IÞB c .


178 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMFigure 6.1 A probabilistic graphical mo<strong>de</strong>l for stochastic dynamicalsystems with hid<strong>de</strong>n states x k , inputs u k , <strong>and</strong> observables y k .where w k <strong>and</strong> v k are zero-mean Gaussian noise processes. The state vectorx evolves according to a nonlinear but stationary Markov dynamics 3driven by the inputs u <strong>and</strong> by the noise source w. The outputs y arenonlinear, noisy but stationary <strong>and</strong> instantaneous functions of the currentstate <strong>and</strong> current input. The vector-valued nonlinearities f <strong>and</strong> g areassumed to be differentiable, but otherwise arbitrary. The goal is to<strong>de</strong>velop an algorithm that can be used to mo<strong>de</strong>l the probability <strong>de</strong>nsityof output sequences (or the conditional <strong>de</strong>nsity of outputs given inputs)using only a finite number of example time series. The crux of the problemis that both the hid<strong>de</strong>n state trajectory <strong>and</strong> the parameters are unknown.Mo<strong>de</strong>ls of this kind have been examined for <strong>de</strong>ca<strong>de</strong>s in systems <strong>and</strong>control engineering. They can also be viewed within the framework ofprobabilistic graphical mo<strong>de</strong>ls, which use graph theory to represent theconditional <strong>de</strong>pen<strong>de</strong>ncies between a set of variables [4, 5]. A probabilisticgraphical mo<strong>de</strong>l has a no<strong>de</strong> for each (possibly vector-valued) r<strong>and</strong>omvariable, with directed arcs representing stochastic <strong>de</strong>pen<strong>de</strong>nces. Absentconnections indicate conditional in<strong>de</strong>pen<strong>de</strong>nce. In particular, no<strong>de</strong>s areconditionally in<strong>de</strong>pen<strong>de</strong>nt from their non-<strong>de</strong>scen<strong>de</strong>nts, given their parents– where parents, children, <strong>de</strong>scen<strong>de</strong>nts, etc, are <strong>de</strong>fined with respect to thedirectionality of the arcs (i.e., arcs go from parent to child). We cancapture the <strong>de</strong>pen<strong>de</strong>nces in Eqs. (6.1a,b) compactly by drawing thegraphical mo<strong>de</strong>l shown in Figure 6.1.One of the appealing features of probabilistic graphical mo<strong>de</strong>ls is thatthey explicitly diagram the mechanism that we assume generated the data.This generative mo<strong>de</strong>l starts by picking r<strong>and</strong>omly the values of the no<strong>de</strong>sthat have no parents. It then picks r<strong>and</strong>omly the values of their children3 Stationarity means here that neither f nor the covariance of the noise process w k , <strong>de</strong>pendon time; that is, the dynamics are time-invariant. Markov refers to the fact that given thecurrent state, the next state does not <strong>de</strong>pend on the past history of the states.


6.1 LEARNING STOCHASTIC NONLINEAR DYNAMICS 179given the parents’ values, <strong>and</strong> so on. The r<strong>and</strong>om choices for each childgiven its parents are ma<strong>de</strong> according to some assumed noise mo<strong>de</strong>l. Thecombination of the graphical mo<strong>de</strong>l <strong>and</strong> the assumed noise mo<strong>de</strong>l at eachno<strong>de</strong> fully specify a probability distribution over all variables in the mo<strong>de</strong>l.Graphical mo<strong>de</strong>ls have helped clarify the relationship between dynamicalsystems <strong>and</strong> other probabilistic mo<strong>de</strong>ls such as hid<strong>de</strong>n Markovmo<strong>de</strong>ls <strong>and</strong> factor analysis [6]. Graphical mo<strong>de</strong>ls have also ma<strong>de</strong> itpossible to <strong>de</strong>velop probabilistic inference algorithms that are vastlymore general than the <strong>Kalman</strong> filter.If we knew the parameters, the operation of interest would be to inferthe hid<strong>de</strong>n state sequence. The uncertainty in this sequence would beenco<strong>de</strong>d by computing the posterior distributions of the hid<strong>de</strong>n statevariables given the sequence of observations. The <strong>Kalman</strong> filter (reviewedin Chapter 1) provi<strong>de</strong>s a solution to this problem in the case where f <strong>and</strong> gare linear. If, on the other h<strong>and</strong>, we had access to the hid<strong>de</strong>n statetrajectories as well as to the observables, then the problem would beone of mo<strong>de</strong>l-fitting, i.e. estimating the parameters of f <strong>and</strong> g <strong>and</strong> thenoise covariances. Given observations of the (no longer hid<strong>de</strong>n) states <strong>and</strong>outputs, f <strong>and</strong> g can be obtained as the solution to a possibly nonlinearregression problem, <strong>and</strong> the noise covariances can be obtained from theresiduals of the regression. How should we proceed when both the systemmo<strong>de</strong>l <strong>and</strong> the hid<strong>de</strong>n states are unknown?The classical approach to solving this problem is to treat the parametersy as ‘‘extra’’ hid<strong>de</strong>n variables, <strong>and</strong> to apply an exten<strong>de</strong>d <strong>Kalman</strong> filtering(EKF) algorithm (see Chapter 1) to the nonlinear system with the statevector augmented by the parameters [7, 8]. For stationary mo<strong>de</strong>ls, thedynamics of the parameter portion of this exten<strong>de</strong>d state vector are set tothe i<strong>de</strong>ntity function. The approach can be ma<strong>de</strong> inherently on-line, whichmay be important in certain applications. Furthermore, it provi<strong>de</strong>s anestimate of the covariance of the parameters at each time step. Finally, itsobjective, probabilistically speaking, is to find an optimum in the jointspace of parameters <strong>and</strong> hid<strong>de</strong>n state sequences.In contrast, the algorithm we present is a batch algorithm (although, aswe discuss in Section 6.4.2, online extensions are possible), <strong>and</strong> does notattempt to estimate the covariance of the parameters. Like other instancesof the EM algorithm, which we <strong>de</strong>scribe below, its goal is to integrate overthe uncertain estimates of the unknown hid<strong>de</strong>n states <strong>and</strong> optimize theresulting marginal likelihood of the parameters given the observed data.An exten<strong>de</strong>d <strong>Kalman</strong> smoother (EKS) is used to estimate the approximatestate distribution in the E-step, <strong>and</strong> a radial basis function (RBF) network[9, 10] is used for nonlinear regression in the M-step. It is important not to


180 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMconfuse this use of the exten<strong>de</strong>d <strong>Kalman</strong> algorithm, namely, to estimatejust the hid<strong>de</strong>n state as part of the E-step of EM, with the use that we<strong>de</strong>scribed in the previous paragraph, namely to simultaneously estimateparameters <strong>and</strong> hid<strong>de</strong>n states.6.1.2 The <strong>Kalman</strong> FilterLinear dynamical systems with additive white Gaussian noises are themost basic mo<strong>de</strong>ls to examine when consi<strong>de</strong>ring the state-estimationproblem, because they admit exact <strong>and</strong> efficient inference. (Here, <strong>and</strong> inwhat follows, we call a system linear if both the state evolution function<strong>and</strong> the state-to-output observation function are linear, <strong>and</strong> nonlinearotherwise.) The linear dynamics <strong>and</strong> observation processes correspondto matrix operations, which we <strong>de</strong>note by A; B <strong>and</strong> C; D, respectively,giving the classic state-space formulation of input-driven linear dynamicalsystems:x kþ1 ¼ Ax k þ Bu k þ w k ;y k ¼ Cx k þ Du k þ v k :ð6:2aÞð6:2bÞThe Gaussian noise vectors w <strong>and</strong> v have zero mean <strong>and</strong> covariances Q<strong>and</strong> R respectively. If the prior probability distribution pðx 1 Þ over initialstates is taken to be Gaussian, then the joint probabilities of all states <strong>and</strong>outputs at future times are also Gaussian, since the Gaussian distribution isclosed un<strong>de</strong>r the linear operations applied by state evolution <strong>and</strong> outputmapping <strong>and</strong> un<strong>de</strong>r the convolution applied by additive Gaussian noise.Thus, all distributions over hid<strong>de</strong>n state variables are fully <strong>de</strong>scribed bytheir means <strong>and</strong> covariance matrices. The algorithm for exactly computingthe posterior mean <strong>and</strong> covariance for x k given some sequence ofobservations consists of two parts: a forward recursion, which uses theobservations from y 1 to y k , known as the <strong>Kalman</strong> filter [11], <strong>and</strong> abackward recursion, which uses the observations from y T to y kþ1 . Thecombined forward <strong>and</strong> backward recursions are known as the <strong>Kalman</strong> orRauch–Tung–Streibel (RTS) smoother [12]. These algorithms arereviewed in <strong>de</strong>tail in Chapter 1.There are three key insights to un<strong>de</strong>rst<strong>and</strong>ing the <strong>Kalman</strong> filter. Thefirst is that the <strong>Kalman</strong> filter is simply a method for implementing Bayes’rule. Consi<strong>de</strong>r the very general setting where we have a prior pðxÞ on some


6.1 LEARNING STOCHASTIC NONLINEAR DYNAMICS 181state variable <strong>and</strong> an observation mo<strong>de</strong>l pðyjxÞ for the noisy outputs giventhe state. Bayes’ rule gives us the state-inference procedure:pðxjyÞ ¼ pðyjxÞpðxÞ ¼ pðyjxÞpðxÞ ; ð6:3aÞpðyÞ ZðZ ¼ pðyÞ ¼ pðyjxÞpðxÞ dx;ð6:3bÞwhere the normalizer Z is the unconditional <strong>de</strong>nsity of the observation. Allwe need to do in or<strong>de</strong>r to convert our prior on the state into a posterior isto multiply by the likelihood from the observation equation, <strong>and</strong> thenrenormalize.The second insight is that there is no need to invert the output ordynamics functions, as long as we work with easily normalizabledistributions over hid<strong>de</strong>n states. We see this by applying Bayes’ rule tothe linear Gaussian case for a single time step. 4 We start with a Gaussianbelief nðx k 1 , V k 1 Þ on the current hid<strong>de</strong>n state, use the dynamics toconvert this to a prior nðx þ , V þ Þ on the next state, <strong>and</strong> then condition onthe observation to convert this prior into a posterior nðx k , V k Þ. This givesthe classic <strong>Kalman</strong> filtering equations:xpðx k 1 Þ¼nðx þ ; V þ Þ; ð6:4aÞx þ ¼ Ax k 1 ; V þ ¼ AV k 1 A > þ Q; ð6:4bÞpðy k jx k Þ¼nðCx k ; RÞ;ð6:4cÞpðx k jy k Þ¼nðx k ; V k Þ;ð6:4dÞx k ¼ x þ þ Kðy k Cx þ Þ; V k ¼ðI KCÞV þ ; ð6:4eÞK ¼ V þ C > ðCV þ C > þ RÞ 1 :ð6:4f ÞThe posterior is again Gaussian <strong>and</strong> analytically tractable. Notice thatneither the dynamics matrix A nor the observation matrix C nee<strong>de</strong>d to beinverted.The third insight is that the state-estimation procedures can be implementedrecursively. The posterior from the previous time step is runthrough the dynamics mo<strong>de</strong>l <strong>and</strong> becomes our prior for the current timestep. We then convert this prior into a new posterior by using the currentobservation.4 Some notation: A multivariate normal (Gaussian) distribution with mean m <strong>and</strong> covariancematrix S is written as nðm; SÞ. The same Gaussian evaluated at the point z is <strong>de</strong>noted bynðm; SÞj z . The <strong>de</strong>terminant of a matrix is <strong>de</strong>noted by jAj <strong>and</strong> matrix inversion by A 1 . Thesymbol means ‘‘distributed according to.’’


182 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMFor the general case of a nonlinear system with non-Gaussian noise,state estimation is much more complex. In particular, mapping througharbitrary nonlinearities f <strong>and</strong> g can result in arbitrary state distributions,<strong>and</strong> the integrals required for Bayes’ rule can become intractable. Severalmethods have been proposed to overcome this intractability, each providinga distinct approximate solution to the inference problem. Assuming f<strong>and</strong> g are differentiable <strong>and</strong> the noise is Gaussian, one approach is tolocally linearize the nonlinear system about the current state estimate sothat applying the <strong>Kalman</strong> filter to the linearized system the approximatestate distribution remains Gaussian. Such algorithms are known asexten<strong>de</strong>d <strong>Kalman</strong> filters (EKF) [13, 14]. The EKF has been used bothin the classical setting of state estimation for nonlinear dynamical systems<strong>and</strong> also as a basis for on-line learning algorithms for feedforward neuralnetworks [15] <strong>and</strong> radial basis function networks [16, 17]. For more<strong>de</strong>tails, see Chapter 2.State inference in nonlinear systems can also be achieved by propagatinga set of r<strong>and</strong>om samples in state space through f <strong>and</strong> g, while at eachtime step re-weighting them using the likelihood pðyjxÞ. We shall refer toalgorithms that use this general strategy as particle filters [18], althoughvariants of this sampling approach are known as sequential importancesampling, bootstrap filters [19], Monte Carlo filters [20], con<strong>de</strong>nsation[21], <strong>and</strong> dynamic mixture mo<strong>de</strong>ls [22, 23]. A recent survey of thesemethods is provi<strong>de</strong>d in [24]. A third approximate state-inference method,known as the unscented filter [25–27], <strong>de</strong>terministically chooses a set ofbalanced points <strong>and</strong> propagates them through the nonlinearities in or<strong>de</strong>r torecursively approximate a Gaussian state distribution; for more <strong>de</strong>tails, seeChapter 7. Finally, there are algorithms for approximate inference <strong>and</strong>learning based on mean field theory <strong>and</strong> variational methods [28, 29].Although we have chosen to make local linearization (EKS) the basis ofour algorithms below, it is possible to formulate the same learning algorithmsusing any approximate inference method (e.g., the unscented filter).6.1.3 The EM AlgorithmThe EM or expectation–maximization algorithm [3, 30] is a wi<strong>de</strong>lyapplicable iterative parameter re-estimation procedure. The objective ofthe EM algorithm is to maximize the likelihood of the observed dataPðY jyÞ in the presence of hid<strong>de</strong>n 5 variables X . (We shall <strong>de</strong>note the entire5 Hid<strong>de</strong>n variables are often also called latent variables; we shall use both terms. They canalso be thought of as missing data for the problem or as auxiliary parameters of the mo<strong>de</strong>l.


6.1 LEARNING STOCHASTIC NONLINEAR DYNAMICS 183sequence of observed data by Y ¼fy 1 ; ...; y t g, observed inputs byU ¼fu 1 ; ...; u T g, the sequence of hid<strong>de</strong>n variables by X ¼fx 1 ; ...; x t g,<strong>and</strong> the parameters of the mo<strong>de</strong>l by y.) Maximizing the likelihood as afunction of y is equivalent to maximizing the log-likelihood:ðLðyÞ ¼log PðYjU; yÞ ¼log PðX ; Y jU; yÞ dX: ð6:5ÞXUsing any distribution QðX Þ over the hid<strong>de</strong>n variables, we can obtain alower bound on L:ððlog PðY ; X jU; yÞ dX ¼ logXðð¼XXðXPðX ; Y jU; yÞQðX ÞQðX ÞPðX ; Y jU; yÞQðX Þ logQðX ÞdXdXQðX Þ log PðX ; Y jU; yÞ dXX¼FðQ; yÞ;QðX Þ log QðX Þ dXð6:6aÞð6:6bÞð6:6cÞð6:6d Þwhere the middle inequality (6.6b) is known as Jensen’s inequality <strong>and</strong> canbe proved using the concavity of the log function. If we <strong>de</strong>fine the energyof a global configuration ðX ; Y Þ to be log PðX ; Y jU; yÞ, then the lowerbound FðQ; yÞ LðyÞ is the negative of a quantity known in statisticalphysics as the free energy: the expected energy un<strong>de</strong>r Q minus the entropyof Q [31]. The EM algorithm alternates between maximizing F withrespect to the distribution Q <strong>and</strong> the parameters y, respectively, holdingthe other fixed. Starting from some initial parameters y 0 we alternatelyapply:E-step: Q kþ1 arg max FðQ; y k Þ;QM-step: y kþ1 arg max FðQ kþ1 ; yÞ:yð6:7aÞð6:7bÞIt is easy to show that the maximum in the E-step results when Q is exactlythe conditional distribution of X , Q* kþ1 ðX Þ¼PðX jY ; U; y k Þ, at whichpoint the bound becomes an equality: FðQ* kþ1 ; y k Þ¼Lðy k Þ. The maxi-


184 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMmum in the M-step is obtained by maximizing the first term in (6.6c),since the entropy of Q does not <strong>de</strong>pend on y:M-step: y* kþ1 arg maxyðXPðX jY; U; y k Þ log PðX ; Y jU; yÞ dX :ð6:8ÞThis is the expression most often associated with the EM algorithm, but itobscures the elegant interpretation [31] of EM as coordinate ascent in F(see Fig. 6.2). Since F¼Lat the beginning of each M-step, <strong>and</strong> since theE-step does not change y, we are guaranteed not to <strong>de</strong>crease the likelihoodafter each combined EM step. (While this is obviously true of ‘‘complete’’EM algorithms as <strong>de</strong>scribed above, it may also be true for ‘‘incomplete’’ or‘‘sparse’’ variants in which approximations are used during the E- <strong>and</strong>=orM-steps so long as F always goes up; see also the earlier work in [32].)For example, this can take the form of a gradient M- step algorithm (wherewe increase PðY jyÞ with respect to y but do not strictly maximize it), orany E-step which improves the bound F without saturating it [31].)In dynamical systems with hid<strong>de</strong>n states, the E-step correspondsexactly to solving the smoothing problem: estimating the hid<strong>de</strong>n statetrajectory given both the observations=inputs <strong>and</strong> the parameter values.The M-step involves system i<strong>de</strong>ntification using the state estimates fromthe smoother. Therefore, at the heart of the EM learning procedure is thefollowing i<strong>de</strong>a: use the solutions to the filtering=smoothing problem toestimate the unknown hid<strong>de</strong>n states given the observations <strong>and</strong> the currentFigure 6.2 The EM algorithm can be thought of as coordinate ascent in thefunctional FðQðXÞ, yÞ (see text). The E-step maximizes F with respect to QðXÞgiven fixed y (horizontal moves), while the M-step maximizes F with respectto y given fixed QðXÞ (vertical moves).


6.1 LEARNING STOCHASTIC NONLINEAR DYNAMICS 185mo<strong>de</strong>l parameters. Then use this fictitious complete data to solve for newmo<strong>de</strong>l parameters. Given the estimated states obtained from the inferencealgorithm, it is usually easy to solve for new parameters. For example,when working with linear Gaussian mo<strong>de</strong>ls, this typically involvesminimizing quadratic forms, which can be done with linear regression.This process is repeated, using these new mo<strong>de</strong>l parameters to infer thehid<strong>de</strong>n states again, <strong>and</strong> so on. Keep in mind that our goal is to maximizethe log-likelihood (6.5) (or equivalently maximize the total likelihood) ofthe observed data with respect to the mo<strong>de</strong>l parameters. This meansintegrating (or summing) over all the ways in which the mo<strong>de</strong>l could haveproduced the data (i.e., hid<strong>de</strong>n state sequences). As a consequence ofusing the EM algorithm to do this maximization, we find ourselvesneeding to compute (<strong>and</strong> maximize) the expected log-likelihood of thejoint data (6.8), where the expectation is taken over the distribution ofhid<strong>de</strong>n values predicted by the current mo<strong>de</strong>l parameters <strong>and</strong> the observations.In the past, the EM algorithm has been applied to learning lineardynamical systems in specific cases, such as ‘‘multiple-indicator multiplecause’’(MIMC) mo<strong>de</strong>ls with a single latent variable [33] or state-spacemo<strong>de</strong>ls with the observation matrix known [34]), as well as more generally[35]. This chapter applies the EM algorithm to learning nonlineardynamical systems, <strong>and</strong> is an extension of our earlier work [36]. Sincethen, there has been similar work applying EM to nonlinear dynamicalsystems [37, 38]. Whereas other work uses sampling for the E-step <strong>and</strong>gradient M-steps, our algorithm uses the RBF networks to obtain acomputationally efficient <strong>and</strong> exact M-step.The EM algorithm has four important advantages over classicalapproaches. First, it provi<strong>de</strong>s a straightforward <strong>and</strong> principled methodfor h<strong>and</strong>ing missing inputs or outputs. (In<strong>de</strong>ed this was the originalmotivation for Shumway <strong>and</strong> Stoffer’s application of the EM algorithmto learning partially unknown linear dynamical systems [34].) Second, EMgeneralizes readily to more complex mo<strong>de</strong>ls with combinations of discrete<strong>and</strong> real-valued hid<strong>de</strong>n variables. For example, one can formulate EM fora mixture of nonlinear dynamical systems [39, 40]. Third, whereas it isoften very difficult to prove or analyze stability within the classical on-lineapproach, the EM algorithm is always attempting to maximize the likelihood,which acts as a Lyapunov function for stable learning. Fourth, theEM framework facilitates Bayesian extensions to learning – for example,through the use of variational approximations [29].


186 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EM6.2 COMBINING EKS AND EMIn the next sections, we shall <strong>de</strong>scribe the basic components of our EMlearning algorithm. For the expectation step of the algorithm, we infer anapproximate conditional distribution of the hid<strong>de</strong>n states using Exten<strong>de</strong>d<strong>Kalman</strong> Smoothing (Section 6.2.1). For the maximization step, we firstdiscuss the general case (Section 6.2.2), <strong>and</strong> then <strong>de</strong>scribe the particularcase where the nonlinearities are represented using Gaussian radial basisfunction (RBF) networks (Section 6.2.3). Since, as with all EM orlikelihood ascent algorithms, our algorithm is not guaranteed to find theglobally optimum solutions, good initialization is a key factor in practicalsuccess. We typically use a variant of factor analysis followed byestimation of a purely linear dynamical system as the starting point fortraining our nonlinear mo<strong>de</strong>ls (Section 6.2.4).6.2.1 Exten<strong>de</strong>d <strong>Kalman</strong> smoothing (E-step)Given a system <strong>de</strong>scribed by Eqs. (6.1a,b), the E-step of an EM learningalgorithm needs to infer the hid<strong>de</strong>n states from a history of observedinputs <strong>and</strong> outputs. The quantities at the heart of this inference problemare two conditional <strong>de</strong>nsitiesPðx k ju 1 ; ...; u T ; y 1 ; ...; y T Þ; 1 k T; ð6:9ÞPðx k ; x kþ1 ju 1 ; ...; u T ; y 1 ; ...; y T Þ; 1 k T 1: ð6:10ÞFor nonlinear systems, these conditional <strong>de</strong>nsities are in general non-Gaussian, <strong>and</strong> can in fact be quite complex. For all but a very fewnonlinear systems, exact inference equations cannot be written down inclosed form. Furthermore, for many nonlinear systems of interest, exactinference is intractable (even numerically), meaning that, in principle, theamount of computation required grows exponentially in the length of thetime series observed. The intuition behind all exten<strong>de</strong>d <strong>Kalman</strong> algorithmsis that they approximate a stationary nonlinear dynamical system with anon-stationary (time-varying) but linear system. In particular, exten<strong>de</strong>d<strong>Kalman</strong> smoothing (EKS) simply applies regular <strong>Kalman</strong> smoothing to alocal linearization of the nonlinear system. At every point ~x in x space, the<strong>de</strong>rivatives of the vector-valued functions f <strong>and</strong> g <strong>de</strong>fine the matrices,A ~x @f@xx¼~x<strong>and</strong>C ~x @g@x ;x¼~x


6.2 COMBINING EKS AND EM 187respectively. The dynamics are linearized about ^x k , the mean of the currentfiltered (not smoothed) state estimate at time t. The output equation can besimilarly linearized. These linearizations yieldx kþ1 f ð^x k ; u k ÞþA^x kðx k ^x k Þþw; ð6:11Þy k gð^x k ; u k ÞþC^x kðx k ^x k Þþv: ð6:12ÞIf the noise distributions <strong>and</strong> the prior distribution of the hid<strong>de</strong>n state atk ¼ 1 are Gaussian, then, in this progressively linearized system, theconditional distribution of the hid<strong>de</strong>n state at any time k given the historyof inputs <strong>and</strong> outputs will also be Gaussian. Thus, <strong>Kalman</strong> smoothing canbe used on the linearized system to infer this conditional distribution; thisis illustrated in Figure 6.3.Notice that although the algorithm performs smoothing (in other words,it takes into account all observations, including future ones, wheninferring the state at any time), the linearization is only done in theforward direction. Why not re-linearize about the backwards estimatesduring the RTS recursions? While, in principle, this approach might givebetter results, it is difficult to implement in practice because it requires thedynamics functions to be uniquely invertible, which it often is not true.Unlike the normal (linear) <strong>Kalman</strong> smoother, in the EKS, the errorcovariances for the state estimates <strong>and</strong> the <strong>Kalman</strong> gain matrices doFigure 6.3 Illustration of the information used in exten<strong>de</strong>d <strong>Kalman</strong> smoothing(EKS), which infers the hid<strong>de</strong>n state distribution during the E-step of ouralgorithm. The nonlinear mo<strong>de</strong>l is linearized about the current state estimateat each time, <strong>and</strong> then <strong>Kalman</strong> smoothing is used on the linearizedsystem to infer Gaussian state estimates.


188 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EM<strong>de</strong>pend on the observed data, not just on the time in<strong>de</strong>x t. Furthermore, itis no longer necessarily true that if the system is stationary, the <strong>Kalman</strong>gain will converge to a value that makes the smoother act as the optimalWiener filter in the steady state.6.2.2 Learning Mo<strong>de</strong>l Parameters (M-step)The M-step of our EM algorithm re-estimates the parameters of the mo<strong>de</strong>lgiven the observed inputs, outputs, <strong>and</strong> the conditional distributions overthe hid<strong>de</strong>n states. For the mo<strong>de</strong>l we have <strong>de</strong>scribed, the parameters <strong>de</strong>finethe nonlinearities f <strong>and</strong> g, <strong>and</strong> the noise covariances Q <strong>and</strong> R (as well asthe mean <strong>and</strong> covariance of the initial state, x 1 ).Two complications can arise in the M-step. First, fully re-estimating f<strong>and</strong> g in each M-step may be computationally expensive. For example, ifthey are represented by neural network regressors, a single full M-stepwould be a lengthy training procedure using backpropagation, conjugategradients, or some other optimization method. To avoid this, one could usepartial M-steps that increase but do not maximize the expected loglikelihood(6.8) – for example, each consisting of one or a few gradientsteps. However, this will in general make the fitting procedure muchslower.The second complication is that f <strong>and</strong> g have to be trained using theuncertain state-estimates output by the EKS algorithm. This makes itdifficult to apply st<strong>and</strong>ard curve-fitting or regression techniques. Consi<strong>de</strong>rfitting f , which takes as inputs x k <strong>and</strong> u k <strong>and</strong> outputs x kþ1 . For each t, theconditional <strong>de</strong>nsity estimated by EKS is a full-covariance Gaussian in ðx k ,x kþ1 Þ space. So f has to be fit not to a set of data points but instead to amixture of full-covariance Gaussians in input–output space (Gaussian‘‘clouds’’ of data). I<strong>de</strong>ally, to follow the EM framework, this conditional<strong>de</strong>nsity should be integrated over during the fitting process. Integratingover this type of data is nontrivial for almost any form of f . One simple butinefficient approach to bypass this problem is to draw a large sample fromthese Gaussian clouds of data <strong>and</strong> then fit f to these samples in the usualway. A similar situation occurs with the fitting of the output function g.We present an alternative approach, which is to choose the form of thefunction approximator to make the integration easier. As we shall show,using Gaussian radial basis function (RBF) networks [9, 10] to mo<strong>de</strong>l f<strong>and</strong> g allows us to do the integrals exactly <strong>and</strong> efficiently. With this choiceof representation, both of the above complications vanish.


6.2 COMBINING EKS AND EM 1896.2.3 Fitting Radial Basis Functions to Gaussian CloudsWe shall present a general formulation of an RBF network from which itshould be clear how to fit special forms for f <strong>and</strong> g. Consi<strong>de</strong>r thefollowing nonlinear mapping from input vectors x <strong>and</strong> u to an outputvector z:z ¼ PIi¼1h i r i ðxÞþAx þ Bu þ b þ w;ð6:13Þwhere w is a zero-mean Gaussian noise variable with covariance Q, <strong>and</strong> r iare scalar valved RBFs <strong>de</strong>fined below. This general mapping can be usedin several ways to represent dynamical systems, <strong>de</strong>pending on which ofthe input to hid<strong>de</strong>n to output mappings are assumed to be nonlinear. Threeexamples are: (1) representing f using (6.13) with the substitutionsx x k , u u k , <strong>and</strong> z x kþ1 ; (2) representing f using x ðx k ; u k Þ,u ;, <strong>and</strong> z x kþ1 ; <strong>and</strong> (3) representing g using the substitutionsx x k , u u k , <strong>and</strong> z y k . (In<strong>de</strong>ed, for different simulations, we shalluse different forms.) The parameters are the I coefficients h i of the RBFs;the matrices A <strong>and</strong> B multiplying inputs x <strong>and</strong> u, respectively; <strong>and</strong> anoutput bias vector b, <strong>and</strong> the noise covariance Q. Each RBF is assumed tobe a Gaussian in x space, with center c i <strong>and</strong> width given by the covariancematrix S i :r i ðxÞ ¼j2pS i j 1=2 exp½12 ðx c iÞ > S 1i ðx c i ÞŠ; ð6:14Þwhere jS i j is the <strong>de</strong>terminant of the matrix S i . For now, we assume that thecenters <strong>and</strong> widths of the RBFs are fixed, although we discuss learningtheir locations in Section 6.4.The goal is to fit this RBF mo<strong>de</strong>l to data (u; x; z). The complication isthat the data set comes in the form of a mixture of Gaussian distributions.Here we show how to analytically integrate over this mixture distributionto fit the RBF mo<strong>de</strong>l.Assume the data set isPðx; z; uÞ ¼ 1 PnJ j ðx; zÞdðu u j Þ: ð6:15ÞjThat is, we observe samples from the u variables, each paired with aGaussian ‘‘cloud’’ of data, n j ,overðx; zÞ. The Gaussian n j has mean m j<strong>and</strong> covariance matrix C j .


190 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMLet ^z y ðx; uÞ ¼ P Ii¼1 h ir i ðxÞþAx þ Bu þ b, where y is the set of parameters.The log-likelihood of a single fully observed data point un<strong>de</strong>r themo<strong>de</strong>l would be12 ½z ^z y ðx; uÞŠ > Q 1 1½z ^z y ðx; uÞŠ2ln jQjþconst:Since the ðx; zÞ values in the data set are uncertain, the maximum expectedlog-likelihood RBF fit to the mixture of Gaussian data is obtained byminimizing the following integrated quadratic form:miny;Q( ð ð)Pn j ðx; zÞ½z ^z y ðx; u j ÞŠ > Q 1 ½z ^z y ðx; u j ÞŠ dx dz þ J ln jQj :jxzð6:16ÞWe rewrite this in a slightly different notation, using angular brackets hi jto <strong>de</strong>note expectation over n j , <strong>and</strong> <strong>de</strong>finingy ½h 1 ; h 2 ; ...; h I ; A; B; bŠ;F ½r 1 ðxÞ; r 2 ðxÞ; ...; r I ðxÞ; x > ; u > ; 1Š > :Then, the objective is written asminy;Q( )Phðz yFÞ > Q 1 ðz yFÞi j þ J ln jQj : ð6:17ÞjTaking <strong>de</strong>rivatives with respect to y, premultiplying by Q 1 , <strong>and</strong> settingthe result to zero gives the linear equations P j hðz yFÞFT i j ¼ 0, whichwe can solve for y <strong>and</strong> Q:^y ¼P jhzF > i j!PjhFF > i j! 1;^Q ¼ 1 JPhzz > i jj^y P jhFz > i j!:ð6:18ÞIn other words, given the expectations in the angular brackets, the optimalparameters can be solved for via a set of linear equations. In the Appendix,we show that these expectations can be computed analytically <strong>and</strong>efficiently, which means that we can take full <strong>and</strong> exact M-steps. The<strong>de</strong>rivation is somewhat laborious, but the intuition is very simple: the


6.2 COMBINING EKS AND EM 191Gaussian RBFs multiply the Gaussian <strong>de</strong>nsities n j to form new unnormalizedGaussians in (x; y) space. Expectations un<strong>de</strong>r these newGaussians are easy to compute. This fitting algorithm is illustrated inFigure 6.4.Note that among the four advantages we mentioned previously for theEM algorithm – ability to h<strong>and</strong>le missing observations, generalizability toextensions of the basic mo<strong>de</strong>l, Bayesian approximations, <strong>and</strong> guaranteedstability through a Lyapunov function – we have had to forgo one. There isno guarantee that exten<strong>de</strong>d <strong>Kalman</strong> smoothing increases the lower boundon the true likelihood, <strong>and</strong> therefore stability cannot be assured. Inpractice, the algorithm is rarely found to become unstable, <strong>and</strong> theapproximation works well: in our experiments, the likelihoods increasedmonotonically <strong>and</strong> good <strong>de</strong>nsity mo<strong>de</strong>ls were learned. Nonetheless, it maybe <strong>de</strong>sirable to <strong>de</strong>rive guaranteed-stable algorithms for certain specialcases using lower-bound preserving variational approximations [29] orother approaches that can provi<strong>de</strong> such proofs.The ability to fully integrate over uncertain state estimates provi<strong>de</strong>spractical benefits as well as being theoretically pleasing. We havecompared fitting our RBF networks using only the means of the stateestimates with performing the full integration as <strong>de</strong>rived above. Whenusing only the means, we found it necessary to introduce a ridgeFigure 6.4 Illustration of the regression technique employed during the M-step. A fit to a mixture of Gaussian <strong>de</strong>nsities is required; if Gaussian RBFnetworks are used, then this fit can be solved analytically. The dashed lineshows a regular RBF fit to the centers of the four Gaussian <strong>de</strong>nsities, while thesolid line shows the analytical RBF fit using the covariance information. Thedotted lines below show the support of the RBF kernels.


192 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMregression (weight <strong>de</strong>cay) parameter in the M-step to penalize the verylarge coefficients that would otherwise occur based on precise cancellationsbetween inputs. Since the mo<strong>de</strong>l is linear in the parameters, this ridgeregression regularizer is like adding white noise to the radial basis outputsr i ðxÞ (i.e., after the RBF kernels have been applied). 6 By linearization, thisis approximately equivalent to Gaussian noise at the inputs x with acovariance <strong>de</strong>termined by the <strong>de</strong>rivatives of the RBFs at the inputlocations. The uncertain state estimates provi<strong>de</strong> exactly this sort ofnoise, <strong>and</strong> thus automatically regularize the RBF fit in the M-step. Thisnaturally avoids the need to introduce a penalty on large coefficients, <strong>and</strong>improves generalization.6.2.4 Initialization of Mo<strong>de</strong>ls <strong>and</strong> Choosing Locationsfor RBF KernelsThe practical success of our algorithm <strong>de</strong>pends on two <strong>de</strong>sign choices thatneed to be ma<strong>de</strong> at the beginning of the training procedure. The first is tojudiciously select the placement of the RBF kernels in the representationof the state dynamics <strong>and</strong>=or output function. The second is to sensiblyinitialize the parameters of the mo<strong>de</strong>l so that iterative improvement withthe EM algorithm (which finds only local maxima of the likelihoodfunction) finds a good solution.In mo<strong>de</strong>ls with low-dimensional hid<strong>de</strong>n states, placement of RBFkernel centers can be done by gridding the state space <strong>and</strong> placing onekernel on each grid point. Since the scaling of the state variables is givenby the covariance matrix of the state dynamics noise w k in Eq. (6.1a)which, without loss of generality, we have set to I, it is possible to<strong>de</strong>termine both a suitable size for the gridding region over the state space,<strong>and</strong> a suitable scaling of the RBF kernels themselves. However, thenumber of kernels in such a grid increases exponentially with the griddimension, so, for more than three or four state variables, gridding thestate space is impractical. In these cases, we first use a simple initialization,such as a linear dynamical system, to infer the hid<strong>de</strong>n states, <strong>and</strong> thenplace RBF kernels on a r<strong>and</strong>omly chosen subset of the inferred statemeans. 7 We set the widths (variances) of the RBF kernels once we have6 Consi<strong>de</strong>r a simple scalar linear regression example y j ¼ yz j , which can be solved byminimizing P j ðy j yz j Þ 2 . If each z j has mean z j <strong>and</strong> variance l, the expected value of thiscost function is P j ðy j yz j Þ 2 þ Jly 2 , which is exactly ridge regression with l controllingthe amount of regularization.7 In or<strong>de</strong>r to properly cover the portions of the state space that are most frequently used, werequire a minimum distance between RBF kernel centers. Thus, in practice, we rejectcenters that fall too close together.


6.2 COMBINING EKS AND EM 193the spacing of their centers by attempting to make neighboring kernelscross when their outputs are half of their peak value. This ensures that,with all the coefficients set approximately equal, the RBF network willhave an almost ‘‘flat’’ output across the space. 8These heuristics can be used both for fixed assignments of centers <strong>and</strong>widths, <strong>and</strong> as initialization to an adaptive RBF placement procedure. InSection 6.4.1, we discuss techniques for adapting both the positions of theRBF centers <strong>and</strong> their widths during training of the mo<strong>de</strong>l.For systems with nonlinear dynamics but approximately linear outputfunctions, we initialize using maximum-likelihood factor analysis (FA)trained on the collection of output observations (or conditional factoranalysis for mo<strong>de</strong>ls with inputs). Factor analysis is a very simple mo<strong>de</strong>l,which assumes that the output variables are generated by linearlycombining a small number of in<strong>de</strong>pen<strong>de</strong>nt Gaussian hid<strong>de</strong>n state variables<strong>and</strong> then adding in<strong>de</strong>pen<strong>de</strong>nt Gaussian noise to each output variable [6].One can think of factor analysis as a special case of linear dynamicalsystems with Gaussian noise where the states are not related in time (i.e.,A ¼ 0). We used the weight matrix (called the loading matrix) learned byfactor analysis to initialize the observation matrix C in the dynamicalsystem. By doing time-in<strong>de</strong>pen<strong>de</strong>nt inference through the factor analysismo<strong>de</strong>l, we can also obtain approximate estimates for the state at each time.These estimates can be used to initialize the nonlinear RBF regressor byfitting the estimates at one time step as a function of those at the previoustime step. (We also sometimes do a few iterations of training using apurely linear dynamical system before initializing the nonlinear RBFnetwork.) Since such systems are nonlinear flows embed<strong>de</strong>d in linearmanifolds, this initialization estimates the embedding manifold using alinear statistical technique (FA) <strong>and</strong> the flow using a nonlinear regressionbased on projections into the estimated manifold.If the output function is nonlinear but the dynamics are approximatelylinear, then a mixture of factor analyzers (MFA) can be trained on theoutput observations [41, 42]. A mixture of factor analyzers is a mo<strong>de</strong>l thatassumes that the data were generated from several Gaussian clusters withdiffering means, with the covariance within each cluster being mo<strong>de</strong>led bya factor analyzer. Systems with nonlinear output function but lineardynamics capture linear flows in a nonlinear embedding manifold, <strong>and</strong>8 One way to see this is to consi<strong>de</strong>r Gaussian RBFs in an n-dimensional grid (i.e., a squarelattice), all with heights 1. The RBF centers <strong>de</strong>fine a hypercube, the distance betweenneighboring RBFs being 2d, where d is chosen such that e d2 =ð2s 2Þ ¼ 1 2. At the centers ofthe hypercubes, there are 2 n contributions from neighboring Gaussians, each of which is apffiffidistance n d, <strong>and</strong> so contributes ð12 Þn to the height. Therefore, the height at the interiors isapproximately equal to the height at the corners.


194 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMFigure 6.5Summary of the main steps of the NLDS-EM algorithm.the goal of the MFA initialization is to capture the nonlinear shape of theoutput manifold. Estimating the dynamics is difficult (since the hid<strong>de</strong>nstates of the individual analyzers in the mixture cannot be combined easilyinto a single internal state representation), but is still possible. 9 Asummary of the algorithm including these initialization techniques isshown in Figure 6.5.I<strong>de</strong>ally, Bayesian methods would be used to control the complexity ofthe mo<strong>de</strong>l by estimating the internal state dimension <strong>and</strong> optimal numberof RBF centers. However, in general, only approximate techniques such ascross-validation or variational approximations can be implemented inpractice (see Section 6.4.4). Currently, we have set these complexityparameters either by h<strong>and</strong> or with cross-validation.6.3 RESULTSWe tested how well our algorithm could learn the dynamics of a nonlinearsystem by observing only the system inputs <strong>and</strong> outputs. We investigatedthe behavior on simple one- <strong>and</strong> two-dimensional state-space problemswhose nonlinear dynamics were known, as well as on a weather timeseriesproblem involving real temperature data.6.3.1 One- <strong>and</strong> Two-Dimensional Nonlinear State-SpaceMo<strong>de</strong>lsIn or<strong>de</strong>r to be able to compare our algorithm’s learned internal staterepresentation with a ground truth state representation, we first tested it on9 As an approximate solution to the problem of getting a single hid<strong>de</strong>n state from a MFA,we can use the following procedure: (1) Estimate the ‘‘similarity’’ between analyzer centersusing average separation in time between data points for which they are active. (2) Usest<strong>and</strong>ard embedding techniques such as multidimensional scaling (MDS) [43] to place theMFA centers in a Eucli<strong>de</strong>an space of dimension k. (3) Time-in<strong>de</strong>pen<strong>de</strong>nt state inference foreach observation now consists of the responsibility-weighted low-dimensional MFAcenters, where the responsibilities are the posterior probabilities of each analyzer giventhe observation un<strong>de</strong>r the MFA.


6.3 RESULTS 195synthetic data generated by nonlinear dynamics whose form was known.The systems we consi<strong>de</strong>red consisted of three inputs <strong>and</strong> four observablesat each time, with either one or two hid<strong>de</strong>n state variables. The relation ofthe state from one time step to the next was given by a variety of nonlinearfunctions followed by Gaussian noise. The outputs were a linear functionof the state <strong>and</strong> inputs plus Gaussian noise. The inputs affected the stateonly through a linear driving function. The true <strong>and</strong> learned state transitionfunctions for these systems, as well as sample outputs in response toGaussian noise inputs <strong>and</strong> internal driving noise, are shown in Figures6.6c,d, 6.7c, <strong>and</strong> 6.8c.We initialized each nonlinear mo<strong>de</strong>l with a linear dynamical mo<strong>de</strong>ltrained with EM, which, in turn, we initialized with a variant of factoranalysis (see Section 6.2.4). The one-dimensional state-space mo<strong>de</strong>ls weregiven 11 RBFs in x space, which were uniformly spaced. (The range ofmaximum <strong>and</strong> minimum x values was automatically <strong>de</strong>termined from the<strong>de</strong>nsity of inferred points.) Two-dimensional state-space mo<strong>de</strong>ls weregiven 25 RBFs spaced in a 5 5 grid uniformly over the range of inferredFigure 6.6 Example of fitting a system with nonlinear dynamics <strong>and</strong> linearobservation function. The panels show the fitting of a nonlinear system witha one-dimensional hid<strong>de</strong>n state <strong>and</strong> 4 noisy outputs driven by Gaussiannoise inputs <strong>and</strong> internal state noise. (a) The true dynamics function (line)<strong>and</strong> states (dots) used to generate the training data (the inset is thehistogram of internal states). (b) The learned dynamics function <strong>and</strong>states inferred on the training data (the inset is the histogram of inferredinternal states). (c) The first component of the observable time series fromthe training data. (d) The first component of fantasy data generated fromthe learned mo<strong>de</strong>l (on the same scale as c).


196 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMFigure 6.7 More examples of fitting systems with nonlinear dynamics <strong>and</strong>linear observation functions. Each of the five rows shows the fitting of anonlinear system with a one-dimensional hid<strong>de</strong>n state <strong>and</strong> four noisyoutputs driven by Gaussian noise inputs <strong>and</strong> internal-state noise. (a) Thetrue dynamics function (line) <strong>and</strong> states (dots) used to generate the trainingdata. (b) The learned dynamics function <strong>and</strong> states inferred on the trainingdata. (c) The first component of the observable time series: training data onthe top <strong>and</strong> fantasy data generated from the learned mo<strong>de</strong>l on thebottom. The nonlinear dynamics can produce quasi-periodic outputs inresponse to white driving noise.states. After the initialization was over, the algorithm discovered thenonlinearities in the dynamics within less than 5 iterations of EM (seeFigs. 6.6a,b, 6.7a,b, <strong>and</strong> 6.8a,b.After training the mo<strong>de</strong>ls on input–output observations from thedynamics, we examined the learned internal state representation <strong>and</strong>


6.3 RESULTS 197Figure 6.8 Multidimensional example of fitting a system with nonlineardynamics <strong>and</strong> linear observation functions. The true system is piecewiselinearacross the state space. The plots show the fitting of a nonlinear systemwith a two-dimensional hid<strong>de</strong>n state <strong>and</strong> 4 noisy outputs driven by Gaussiannoise inputs <strong>and</strong> internal state noise. (a) The true dynamics vector field(arrows) <strong>and</strong> states (dots) used to generate the training data. (b) Thelearned dynamics vector field <strong>and</strong> states inferred on the training data. (c)The first component of the observable time series: training data on the top<strong>and</strong> fantasy data generated from the learned mo<strong>de</strong>l on the bottom.compared it with the known structure of the generating system. As thefigures show, the algorithm recovers the form of the nonlinear dynamicsquite well. We are also able to generate ‘‘fantasy’’ data from the mo<strong>de</strong>lsonce they have been learned by exciting them with Gaussian noise ofsimilar variance to that applied during training. The resulting observationstreams look qualitatively very similar to the time series from the truesystems.We can quantify this quality of fit by comparing the log-likelihood ofthe training sequences <strong>and</strong> novel test sequences un<strong>de</strong>r our nonlinear mo<strong>de</strong>lwith the likelihood un<strong>de</strong>r a basic linear dynamical system mo<strong>de</strong>l or a staticmo<strong>de</strong>l such as factor analysis. Figure 6.9 presents this comparison. Thenonlinear dynamical system had significantly superior likelihood on bothtraining <strong>and</strong> test data for all the example systems. (Notice that for systemE, the linear dynamical system is much better than factor analysis becauseof the strong hysteresis (mo<strong>de</strong>-locking) in the system. Thus, the output atthe previous time step is an excellent predictor of the current output.)6.3.2 Weather DataAs an example of a real system with a nonlinear output function as well asimportant dynamics, we trained our mo<strong>de</strong>l on records of the dailymaximum <strong>and</strong> minimum temperatures in Melbourne, Australia, over theperiod 1981–1990. 10 We used a mo<strong>de</strong>l with two internal state variables,10 This data is available on the world wi<strong>de</strong> web from the Australian Bureau of Meteorologyat http:==www.bom.gov.au=climate.


198 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMFigure 6.9 Differences in log-likelihood assigned by various mo<strong>de</strong>ls totraining <strong>and</strong> test data from the systems in Figures 6.6 <strong>and</strong> 6.7. Each adjacentgroup of five bars shows the log-likelihood of the five examples (A–E) un<strong>de</strong>rfactor analysis (FA), linear dynamical systems (LDS), <strong>and</strong> our nonlineardynamical system mo<strong>de</strong>l (NLDS). Results on training data appear on theleft <strong>and</strong> results on test data on the right; taller bars represent better mo<strong>de</strong>ls.Log-likelihoods are offset so that FA on the training data is zero. Error barsrepresent the 68% quantile about the median across 100 repetitions oftraining or testing. For NLDS, the exact likelihood cannot be computed;what is shown is the pseudo-likelihood computed by EKS.three outputs, <strong>and</strong> no inputs. During the training phase, the three outputswere the minimum <strong>and</strong> maximum daily temperature as well as a realvalued output indicating the time of the year (month) in the range [0, 12].The mo<strong>de</strong>l was trained on 1500 days of temperature records, or just overfour seasons. We tested on the remaining 2150 days by showing the mo<strong>de</strong>lonly the minimum <strong>and</strong> maximum daily temperatures <strong>and</strong> attempting topredict the time of year (month). The prediction was performed by usingthe EKS algorithm to do state inference given only the two availableobservation streams. Once state inference was performed, the learnedoutput function of the mo<strong>de</strong>l could be used to predict the time of year.This prediction problem inherently requires the use of information fromprevious <strong>and</strong>=or future times, since the static relationship betweentemperature <strong>and</strong> season is ambiguous during spring=fall. Figure 6.10shows the results of this prediction after training; the algorithm hasdiscovered a relationship between the hid<strong>de</strong>n state <strong>and</strong> the observationsthat allows it to perform reasonable prediction for this task. Also shown


6.3 RESULTS 199Figure 6.10 Mo<strong>de</strong>l of maximum <strong>and</strong> minimum daily temperatures inMelbourne, Australia from 1981 to 1990. Left of vertical line: A system withtwo hid<strong>de</strong>n states governed by linear dynamics <strong>and</strong> a non-linear outputfunction was trained on observation vectors of a three-dimensional timeseries consisting of maximum <strong>and</strong> minimum temperatures for each day aswell as the (real-valued) month of the year. Training points are shown astriangles (maximum temperature), squares (minimum temperature) <strong>and</strong> asolid line (sawtooth wave below). Right of vertical line: After training, thesystem can infer its internal state from only the temperature observations.Having inferred its internal state it can predict the month of the year as amissing output (line below). The solid lines in the upper plots show themo<strong>de</strong>l’s prediction of minimum <strong>and</strong> maximum temperature given theinferred state at the time.are the mo<strong>de</strong>l predictions of minimum <strong>and</strong> maximum temperatures giventhe inferred state.Although not explicitly part of the generative mo<strong>de</strong>l, the learned systemimplicitly parameterizes a relationship between time of year <strong>and</strong> temperature.We can discover this relationship by evaluating the nonlinear outputfunction at many points in the state space. Each evaluation yields a tripleof month, minimum temperature <strong>and</strong> maximum temperature. These triplescan then be plotted against each other as in Figure 6.11 to show that themo<strong>de</strong>l has discovered Melbourne’s seasonal temperature variations.


200 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMFigure 6.11 Prediction of maximum <strong>and</strong> minimum daily temperaturesbased on time of year. The mo<strong>de</strong>l from Figure 6.10 implicitly learns arelationship between time of year <strong>and</strong> minimum=maximum temperature.This relationship is not directly invertible, but the temporal information usedby exten<strong>de</strong>d <strong>Kalman</strong> smoothing correctly infers month given temperatureas shown in Figure 6.10.6.4 EXTENSIONS6.4.1 Learning the Means <strong>and</strong> Widths of the RBFsIt is possible to relax the assumption that the Gaussian radial basisfunctions have fixed centers <strong>and</strong> widths, although this results in a somewhatmore complicated <strong>and</strong> slower fitting algorithm. To <strong>de</strong>rive learningrules for the RBF centers c i <strong>and</strong> width matrices S i , we need to consi<strong>de</strong>rhow they play into the cost function (6.17) through the RBF kernel (6.14).We take <strong>de</strong>rivatives with respect to the expectation of the cost function c,<strong>and</strong> exchange the or<strong>de</strong>r of the expectation <strong>and</strong> the <strong>de</strong>rivative: @c¼@c i@c@r i@r i@c i¼ 2hðyF zÞ > Q 1 h i r i ðxÞS 1i ðx c i Þi: ð6:19ÞRecalling that F ¼½r 1 ðxÞ r 2 ðxÞ ... r I ðxÞ x > u > 1Š > , it is clearthat c i figures nonlinearly in several places in this equation, <strong>and</strong> thereforeit is not possible to solve for c i in closed form. We can, however, use theabove gradient to move the center c i to <strong>de</strong>crease the cost, whichcorresponds to taking a partial M-step with respect to c i . Equation(6.19) requires the computation of three third-or<strong>de</strong>r expectations in


6.4 EXTENSIONS 201addition to the first- <strong>and</strong> second-or<strong>de</strong>r expectations nee<strong>de</strong>d to optimize y<strong>and</strong> Q: hr i ðxÞr k ðxÞx l i j , hr i ðxÞx k x l i j ; <strong>and</strong> hr i ðxÞz k x l i j . Similarly, differentiatingthe cost with respect to Si1 gives @c@Si1 ¼@c @r i@r i @Si1 ¼ h½ðyF zÞ > Q 1 h i Šr i ðxÞ½S i ðx c i Þðx c i Þ > Ši:ð6:20ÞWe now need three fourth-or<strong>de</strong>r expectations as well: hr i ðxÞr k ðxÞx l x m i j ,hr i ðxÞ k x l x m i j , <strong>and</strong> hr i ðxÞz k x l x m i j .These additional expectations increase both the storage <strong>and</strong> computationtime of the algorithm – a cost that may not be compensated by thead<strong>de</strong>d advantage of moving of centers <strong>and</strong> widths by small gradient steps.One heuristic is to place centers <strong>and</strong> widths using unsupervised techniquessuch as the EM algorithm for Gaussian mixtures, which consi<strong>de</strong>rs solelythe input <strong>de</strong>nsity <strong>and</strong> not the output nonlinearity. Alternatively, some ofthese higher-or<strong>de</strong>r expectations can be approximated using, for example,hr i ðxÞi r i ðhxiÞ.6.4.2 On-line LearningOne of the major limitations of the algorithm that we have presented inthis chapter is that it is a batch algorithm; that is, it assumes that we usethe entire sequence of observations to estimate the mo<strong>de</strong>l parameters.Fortunately, it is relatively straightforward to <strong>de</strong>rive an on-line version ofthe algorithm, which updates parameters as it receives observations. Thisis achieved using the recursive least-squares (RLS) algorithm, which is infact just a special case of the discrete <strong>Kalman</strong> filter (see, e.g., [8, 44]).The key observation is that the cost minimized in the M-step of thealgorithm (6.17) is a quadratic function of the parameters y. RLS is simplya way of solving quadratic problems on-line. Using k to in<strong>de</strong>x time step,the resulting algorithm for scalar z is as follows:y k ¼ y k 1 þðhzFi k y k 1 hFF > i k ÞP k ; ð6:21ÞP k ¼ P k 1P k 1 hFF > i k P k 11 þhF > P k 1 Fi k; ð6:22ÞQ k ¼ Q k 1 þ 1 k ½hz2 i k y k hFzi k Q k 1 Š: ð6:23Þ


202 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMLet us ignore the expectations for now. Initializing y 0 ¼ 0, Q 0 ¼ I, <strong>and</strong> P 0very large, it is easy to show that, after a few iterations, the estimates of y kwill rapidly converge to the exact values obtained by the least-squaressolution. The estimate of Q will converge to the correct values plus a biasincurred by the fact that the early estimates of Q were based on residualsfrom y k rather than lim k!1 y k . P k is a recursive estimate ofð P kj¼1 hFFi jÞ 1 , obtained by using the matrix inversion lemma.There is an important way in which this on-line algorithm is anapproximation to the batch EM algorithm we have <strong>de</strong>scribed for nonlinearstate-space mo<strong>de</strong>ls. The expectations hi k in the online algorithm arecomputed by running a single step of the exten<strong>de</strong>d <strong>Kalman</strong> filter using theprevious parameters y k 1 . In the batch EM algorithm, the expectations arecomputed by running an exten<strong>de</strong>d <strong>Kalman</strong> smoother over the entiresequence using the current parameter estimate. Moreover, these expectationsare used to re-estimate the parameters, the smoother is then re-run,the parameters are re-re-estimated, <strong>and</strong> so on, to perform the usualiterations of EM. In general, we can expect that, unless the time seriesis nonstationary, the parameter estimates obtained by the batch algorithmafter convergence will mo<strong>de</strong>l the data better than those obtained by the onlinealgorithm.Interestingly, the updates for the RLS on-line algorithm <strong>de</strong>scribed hereare very similar to the parameter updates used a dual exten<strong>de</strong>d <strong>Kalman</strong>filter approach to system i<strong>de</strong>ntification [45] (see Chapter 5 <strong>and</strong> Section6.5.5). This similarity is not coinci<strong>de</strong>ntal, since, as mentioned, the <strong>Kalman</strong>filter can be <strong>de</strong>rived as a generalization of the RLS algorithm. In fact, thissimilarity can be exploited in an elegant manner to <strong>de</strong>rive an on-linealgorithm for parameter estimation for nonstationary nonlinear dynamicalsystems.6.4.3 NonstationarityTo h<strong>and</strong>le nonstationary time series, we assume that the parameters c<strong>and</strong>rift according to a Gaussian r<strong>and</strong>om walk with covariance S y :y k ¼ y k 1 þ E k ; where E k nð0; S y Þ:As before, we have the following function relating the z variables to theparameters y <strong>and</strong> nonlinear kernels F:z k ¼ y k F k þ w k ; where w k nð0; QÞ;


6.4 EXTENSIONS 203which we can view as the observation mo<strong>de</strong>l for a ‘‘state variable’’ y k withtime-varying ‘‘output matrix’’ F k . Since both the dynamics <strong>and</strong> observationmo<strong>de</strong>ls are linear in y <strong>and</strong> the noise is Gaussian, we can apply thefollowing <strong>Kalman</strong> filter to recursively compute the distribution of driftingparameters y:^y k ¼ ^y k 1 þ ðhzFi k^y k 1 hFF > i k ÞP kjk 1Q k 1 þhF > ; ð6:24ÞP kjk 1 Fi kP kjk 1 ¼ P k 1 þ S y ; ð6:25ÞP kjk 1 hFF > i k P kjk 1P k ¼ P kjk 1Q k 1 þhF > ; ð6:26ÞP kjk 1 Fi kQ k ¼ Q k 1 þ lðhz 2 i k^y k hFzi k Q k 1 Þ: ð6:27ÞThere are two important things to note. First, these equations <strong>de</strong>scribe anordinary <strong>Kalman</strong> filter, except that both the ‘‘output’’ z <strong>and</strong> ‘‘outputmatrix’’ F k are jointly uncertain with a Gaussian distribution. Second,we have also assumed that the output noise covariance can drift byintroducing a forgetting factor l in its re-estimation equation. As before,the expectations are computed by running one step of the EKF over thehid<strong>de</strong>n variables using y k 1 .While we <strong>de</strong>rived this on-line algorithm starting from the batch EMalgorithm, what we have en<strong>de</strong>d up with appears almost i<strong>de</strong>ntical to thedual exten<strong>de</strong>d <strong>Kalman</strong> filter (discussed in Chapter 5). In<strong>de</strong>ed, we have two<strong>Kalman</strong> filters – one exten<strong>de</strong>d <strong>and</strong> one ordinary – running in parallel,estimating the hid<strong>de</strong>n states <strong>and</strong> parameters, respectively.We can also view this on-line algorithm as an approximation to theBayesian posterior over parameters <strong>and</strong> hid<strong>de</strong>n variables. The true posteriorwould be some complicated distribution over the x; z, <strong>and</strong> y parameters.Here, we have recursively approximated it with two in<strong>de</strong>pen<strong>de</strong>ntGaussians – one over (x; z) <strong>and</strong> one over y. The approximated posterior fory k has mean ^y k <strong>and</strong> covariance P k .6.4.4 Using Bayesian Methods for Mo<strong>de</strong>l Selection <strong>and</strong>Complexity ControlLike any other maximum-likelihood procedure, the EM algorithm<strong>de</strong>scribed in this chapter has the potential to overfit the data set – thatis, to find spurious patterns in noise in the data, thereby generalizing


204 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMpoorly. In our implementation, we used some ridge regression, that is, aweight <strong>de</strong>cay regularizer on the h i parameters, which seemed to work wellin practice but required some heuristics for setting regularization parameters.(Although, as mentioned previously, integrating over the hid<strong>de</strong>nvariables acts as a sort of modulated input noise, <strong>and</strong> so, in effect,performs ridge regression, which can eliminate the need for explicitregularization.)A second closely related problem faced by maximum-likelihoodmethods is that there is no built-in procedure for doing mo<strong>de</strong>l selection.That is, the value of the maximum of the likelihood is not a suitable way tochoose between different mo<strong>de</strong>l structures. For example, consi<strong>de</strong>r theproblems of choosing the dimensionality of the state space x <strong>and</strong> choosingthe number of basis functions I. Higher dimensions of x <strong>and</strong> more basisfunctions should always, in principle, result in higher maxima of thelikelihood, which means that more complex mo<strong>de</strong>ls will always bepreferred to simpler ones. But this, of course, leads to overfitting.Bayesian methods provi<strong>de</strong> a very general framework for simultaneouslyh<strong>and</strong>ling the overfitting <strong>and</strong> mo<strong>de</strong>l selection problems in a consistentmanner. The key i<strong>de</strong>a of the Bayesian approach is to avoid maximizationwherever possible. Instead, possible mo<strong>de</strong>ls, structures, parameters – inshort, all settings of unknown quantities – should be weighted by theirposterior probabilities, <strong>and</strong> predictions should be ma<strong>de</strong> according to thisweighted posterior.For our nonlinear dynamical system, we can, for example, treat theparameters y as an unknown. Then the mo<strong>de</strong>l’s prediction of the output attime k þ 1isðpð y kþ1 ju 1:kþ1 ; y 1:k Þ¼ dy pð y kþ1 ju kþ1 ; y 1:k ; u 1:k ; yÞpðyj y 1:k ; u 1:k Þðð¼ dy pðyj y 1:k ; u 1:k Þ dx kþ1 pð y kþ1 ju kþ1 ; x kþ1 ; yÞ pðx kþ1 ju 1:kþ1 ; y 1:k ; yÞ;where the first integral on the last line is over the posterior distribution ofthe parameters <strong>and</strong> the second integral is over the posterior distribution ofthe hid<strong>de</strong>n variables.The posterior distribution over parameters can be obtained recursivelyfrom Bayes’ rule:pðyj y 1:k ; u 1:kÞ ¼ pð y kju 1:k ; y 1:k 1 ; yÞpðyj U1:k 1 ; y 1:k 1 Þ:pð y k ju 1:k ; y 1:k 1 Þ


The dual exten<strong>de</strong>d <strong>Kalman</strong> filter, the joint exten<strong>de</strong>d <strong>Kalman</strong> filter, <strong>and</strong> thenonstationary on-line algorithm from Section 6.4.3 are all coarse approximationsof these Bayesian recursions.The above equations are all implicitly conditioned on some choice ofmo<strong>de</strong>l structure s m , that is, the dimension of x <strong>and</strong> the number of basisfunctions. Although the Bayesian mo<strong>de</strong>ling philosophy advocates averagingpredictions of different mo<strong>de</strong>l structures, if necessary it is alsopossible to use Bayes’ rule to choose between mo<strong>de</strong>l structures accordingto their probabilities:Pðs m j y 1:k ; u 1:k Þ¼pð y 1:kju 1:k ; s m ÞPðs m ÞPn pð y 1:kju 1:k ; s n ÞPðs n Þ :6.4 EXTENSIONS 205Tractable approximations to the required integrals can be obtained inseveral ways. We highlight three i<strong>de</strong>as, without going into much <strong>de</strong>tail; ana<strong>de</strong>quate solution to this problem for nonlinear dynamical systemsrequires further research. The first i<strong>de</strong>a is the use of Markov-chainMonte Carlo (MCMC) techniques to sample over both parameters <strong>and</strong>hid<strong>de</strong>n variables. Sampling can be an efficient way of computing highdimensionalintegrals if the samples are concentrated in regions whereparameters <strong>and</strong> states have high probability. MCMC methods such asGibbs sampling have been used for linear dynamical systems [46, 47],while a promising method for nonlinear systems is particle filtering [18,24], in which samples (‘‘particles’’) can be used to represent the jointdistribution over parameters <strong>and</strong> hid<strong>de</strong>n states at each time step. Thesecond i<strong>de</strong>a is the use of so-called ‘‘automatic relevance <strong>de</strong>termination’’(ARD [48, 49]). This consists of using a zero-mean Gaussian prior oneach parameter with tunable variances. Since these variances are parametersthat control the prior distribution of the mo<strong>de</strong>l parameters, they arereferred to as hyperparameters. Optimizing these variance hyperparameterswhile integrating over the parameters results in ‘‘irrelevant’’parameters being eliminated from the mo<strong>de</strong>l. This occurs when thevariance controlling a particular parameter goes to zero. ARD for RBFnetworks with a center on each data point has been used by Tipping [50]successfully for nonlinear regression, <strong>and</strong> given the name ‘‘relevancevector machine’’ in analogy to support vector machines. The third i<strong>de</strong>a isthe use of variational methods to lower-bound the mo<strong>de</strong>l structureposterior probabilities. In exactly the same way that EM can be thoughtof as forming a lower bound on the likelihood using a distribution over thehid<strong>de</strong>n variables, variational Bayesian methods lower-bound the evi<strong>de</strong>nceusing a distribution over both hid<strong>de</strong>n variables <strong>and</strong> parameters. VariationalBayesian methods have been used in [51] to infer the structure of linear


206 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMdynamical systems, although the generalization to nonlinear systems ofthe kind <strong>de</strong>scribed in this chapter is not straightforward.Of course, in principle, the Bayesian approach would advocate averagingover all possible choices of c i , S i , I; Q, etc. It is easy to see how thiscan rapidly get very unwieldy.6.5 DISCUSSION6.5.1 I<strong>de</strong>ntifiability <strong>and</strong> Expressive PowerAs we saw from the experiments <strong>de</strong>scribed above, the algorithm that wehave presented is capable of learning good <strong>de</strong>nsity mo<strong>de</strong>ls for a variety ofnonlinear time series. Specifying the class of nonlinear systems that ouralgorithm can mo<strong>de</strong>l well <strong>de</strong>fines its expressive power. A related questionis: What is the ability of this mo<strong>de</strong>l, in principle, to recover the actualparameters of specific nonlinear systems? This is the question of mo<strong>de</strong>li<strong>de</strong>ntifiability. These two questions are intimately tied, since they both<strong>de</strong>scribe the mapping between actual nonlinear systems <strong>and</strong> mo<strong>de</strong>lparameter settings.There are three trivial <strong>de</strong>generacies that make our mo<strong>de</strong>l technicallyuni<strong>de</strong>ntifiable, but should not concern us. First, it is always possible topermute the dimensions in the state space <strong>and</strong>, by permuting the domain ofthe output mapping <strong>and</strong> dynamics in the corresponding fashion, obtain anexactly equivalent mo<strong>de</strong>l. Second, the state variables can be rescaled or, infact, transformed by any invertible linear mapping. This transformationcan be absorbed by the output <strong>and</strong> dynamics functions, yielding a mo<strong>de</strong>lwith i<strong>de</strong>ntical input–output behavior. Without loss of generality, we alwaysset the covariance of the state evolution noise to be the i<strong>de</strong>ntity matrix,which both sets the scale of the state space <strong>and</strong> disallows certain statetransformations without reducing the expressive power of the mo<strong>de</strong>l.Third, we take the observation noise to be uncorrelated with the statenoise <strong>and</strong> both noises to be zero-mean, since, again without loss ofgenerality, these can be absorbed into the f <strong>and</strong> g functions. 11There exist other forms of uni<strong>de</strong>ntifiability that are more difficult toovercome. For example, if both f <strong>and</strong> g are nonlinear, then (at least in thenoise-free case), for any arbitrary invertible transformation of the state,11 Imagine that the joint noise covariance was nonzero: hw k v > k i¼S. Replacing A withA 0 ¼ A SR 1 C gives a new noise process w 0 with covariance Q 0 ¼ Q SR 1 S > that isuncorrelated with v, leaving the input–output behavior invariant. Similarly, any nonzeronoise means can be absorbed into the b terms in the functions f <strong>and</strong> g.


6.5 DISCUSSION 207there exist transformations of f <strong>and</strong> g that result in i<strong>de</strong>ntical input–outputbehavior. In this case, it would he very hard to <strong>de</strong>tect that the recoveredmo<strong>de</strong>l is in<strong>de</strong>ed a faithful mo<strong>de</strong>l of the actual system, since the estimated<strong>and</strong> actual states would appear to be unrelated.Clearly, not all systems can be mo<strong>de</strong>led by assuming that f is linear <strong>and</strong>g is nonlinear. Similarly, not all systems can be mo<strong>de</strong>led by assuming thatf is nonlinear <strong>and</strong> g is linear. For example, consi<strong>de</strong>r the case where theobservations y k <strong>and</strong> y kþn are statistically in<strong>de</strong>pen<strong>de</strong>nt, but each observationlies on a curved low-dimensional manifold in a high-dimensionalspace. Mo<strong>de</strong>ling this would require a nonlinear g as in nonlinear factoranalysis, but an f ¼ 0. Therefore, choosing either f or g to be linearrestricts the expressive power of the mo<strong>de</strong>l.Unlike the state noise covariance Q, assuming that the observationnoise covariance R is diagonal does restrict the expressive power of themo<strong>de</strong>l. This is easy to see for the case where the dimension of the statespace is small <strong>and</strong> the dimension of the observation vector is large. A fullcovariance R can capture all correlations between observations at a singletime step, while a diagonal R mo<strong>de</strong>l cannot.For nonlinear dynamical systems, the Gaussian-noise assumption is notas restrictive as it may initially appear. This is because the nonlinearity canbe used to turn Gaussian noise into non-Gaussian noise [6].Of course, we have restricted our expressive power by using an RBFnetwork, especially one in which the means <strong>and</strong> centers of the RBFs arefixed. One could try to appeal to universal approximation theorems tomake the claim that one could, in principle, mo<strong>de</strong>l any nonlineardynamical system. But this would be misleading in the light of thenoise assumptions <strong>and</strong> the fact that only a finite <strong>and</strong> usually smallnumber of RBFs are going to be used in practice.6.5.2 Embed<strong>de</strong>d FlowsThere are two ways to think about the dynamical mo<strong>de</strong>ls we haveinvestigated, shown in Figure 6.12. One is as a nonlinear Markov process(flow) x k that has been embed<strong>de</strong>d (or potentially projected) into a manifoldy k . From this perspective, the function f controls the evolution of thestochastic process, <strong>and</strong> the function g specifies the nonlinear embedding(or projection) operation. 1212 To simplify presentation, we shall neglect driving inputs u k in this section, although thearguments extend as well to systems with inputs.


208 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMFigure 6.12 Two interpretations of the graphical mo<strong>de</strong>l for stochastic(non)linear dynamical systems (see text). (a) A Markov process embed<strong>de</strong>din a manifold. (b) Nonlinear factor analysis through time.Another way to think of the same mo<strong>de</strong>l is as a nonlinear version of alatent-variable mo<strong>de</strong>l such as factor analysis (but possibly with externalinputs as well) in which the latent variables or factors evolve through timerather than being drawn in<strong>de</strong>pen<strong>de</strong>ntly for each observation. The nonlinearfactor analysis mo<strong>de</strong>l is represented by g <strong>and</strong> the time evolution of thelatent variables by f .If the state space is of lower dimension than the observation space <strong>and</strong>the observation noise is additive, then a useful geometrical intuitionapplies. In such cases, we have observed a flow insi<strong>de</strong> an embed<strong>de</strong>dmanifold. The observation function g specifies the structure (shape) of themanifold, while the dynamics f specifies the flow within the manifold.Armed with this intuition, the learning problem looks as if it might be<strong>de</strong>coupled into two separate stages: first find the manifold by doing somesort of <strong>de</strong>nsity mo<strong>de</strong>ling on the collection of observed outputs (ignoringtheir time or<strong>de</strong>r); second, find the flow (dynamics) by projecting theobservations into the manifold <strong>and</strong> doing nonlinear regression from onetime step to the next. This intuition is partly true, <strong>and</strong> in<strong>de</strong>ed provi<strong>de</strong>s thebasis for many of the practical <strong>and</strong> effective initialization schemes that wehave tried. However, the crucial point as far as the <strong>de</strong>sign of learningalgorithms is concerned is that the two learning problems interact in a waythat makes the problem easier. Once we know something about thedynamics, this information gives some prior knowledge when trying tolearn the manifold shape. For example, if the dynamics suggest that thenext state will be near a certain point, we can use this information to dobetter than naive projection when we locate a noisy observation on themanifold. Conversely, knowing something about the manifold allows us toestimate the dynamics more effectively.


6.5 DISCUSSION 209We discuss separately two special cases of flows in manifolds: systemswith linear output functions but nonlinear dynamics, <strong>and</strong> systems withlinear dynamics but nonlinear output function.When the output function g is linear <strong>and</strong> the dynamics f is nonlinear(Fig. 6.13), the observed sequence forms a nonlinear flow in a linearsubspace of the observation space. The manifold estimation is ma<strong>de</strong>easier, even with high levels of observation noise, by the fact that its shapeis known to be a hyperplane. All that is required is to find its orientation<strong>and</strong> the character of the output noise. Time-invariant analysis of theobservations by algorithms such as factor analysis is an excellent way toinitialize estimates of the hyperplane <strong>and</strong> noises. However, during learning,we may have cause to tilt the hyperplane to make the dynamics fitbetter, or conversely cause to modify the dynamics to make the hyperplanemo<strong>de</strong>l better.This setting is actually more expressive than it might seem initially.Consi<strong>de</strong>r a nonlinear output function gðxÞ that is ‘‘invertible’’ in the sensethat it be written in the form gðxÞ ¼C ~gðxÞ for invertible ~g <strong>and</strong> non-squarematrix C. Any such nonlinear output function can be ma<strong>de</strong> strictly linear ifwe transform to a new state variable ~x:~x ¼ ~gðxÞ )~x kþ1 ¼ ~ f ð~x k ; w k Þ¼~gð f ð~g 1 ð~xÞÞ þ w k Þ; ð6:28aÞy k ¼ C~x k þ v k ¼ gðx k Þþv k ;ð6:28bÞFigure 6.13 Linear <strong>and</strong> nonlinear dynamical systems represent flow fieldsembed<strong>de</strong>d in manifolds. For systems with linear output functions, such asthe one illustrated, the manifold is a hyperplane while the dynamics may becomplex. For systems with nonlinear output functions the shape of theembedding manifold is also curved.


210 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMwhich gives an equivalent mo<strong>de</strong>l but with a purely linear output process,<strong>and</strong> potentially nonadditive dynamics noise.For nonlinear output functions g paired with linear dynamics f , theobservation sequence forms a matrix (linear) flow in a nonlinear manifold:x kþ1 ¼ Ax k þ w k ;y k ¼ gðx k Þþv k :ð6:29aÞð6:29bÞThe manifold learning is har<strong>de</strong>r now, because we must estimate a thin,curved subspace of the observation space in the presence of noise.However, once we have learned this manifold approximately, we projectthe observations into it <strong>and</strong> learn only linear dynamics. The win comesfrom the following fact: in the locations where the projected dynamics donot look linear, we know that we should bend the manifold to make thedynamics more linear. Thus, not only the shape of the outputs (ignoringtime) but also the linearity of the dynamics give us clues to learning themanifold.6.5.3 StabilityStability is a key issue in the study of any dynamical system. Here we haveto consi<strong>de</strong>r stability at two levels: the stability of the learning procedure,<strong>and</strong> the stability of the learned nonlinear dynamical system.Since every step of the EM algorithm is guaranteed to increase the loglikelihooduntil convergence, it has a built-in Lyapunov function for stablelearning. However, as we have pointed out, our use of exten<strong>de</strong>d <strong>Kalman</strong>smoothing in the E-step of the algorithm represents an approximation tothe exact E-step, <strong>and</strong> therefore we have to forego any guarantees ofstability of learning. While we rarely had problems with stability oflearning, this is sure to be problem-specific, <strong>de</strong>pending both on the qualityof the EKS approximation <strong>and</strong> on how close the true system dynamics isto the boundary of stability. In contrast to the EKS approximations, certainvariational approximations [29] transform the intractable Lyapunov functioninto a tractable one, <strong>and</strong> therefore preserve stability of learning. It isnot clear how to apply these variational approximations to nonlineardynamics, although this would clearly be an interesting area of research.Stability of the learned nonlinear dynamical system can be analyzed bymaking use of some linear systems theory. We know that, for discrete-timelinear dynamical systems, if all eigenvalues of the A matrix lie insi<strong>de</strong> theunit circle, then the system is globally stable. The nonlinear dynamics of


6.5 DISCUSSION 211our RBF network f can be <strong>de</strong>composed into two parts (cf. Eq. (6.13)): alinearPcomponent given by A, <strong>and</strong> a nonlinear component given byi h ir i ðxÞ. Clearly, for the system to be globally stable, A has to satisfythe eigenvalue criterion for linear systems. Moreover, if the RBF coefficientsfor both f <strong>and</strong> g have boun<strong>de</strong>d norm (i.e., max i jh i j < hÞ <strong>and</strong> theRBF is boun<strong>de</strong>d, with min i <strong>de</strong>tðS i Þ > s min > 0 <strong>and</strong> max ij jc i c j j < c,then the nonlinear system is stable in the following sense. The conditionson S i <strong>and</strong> h mean thatP h i r i ðxÞ < Ihi ð2pÞ d=2 pffiffiffiffiffiffiffi k:s minTherefore, the noise-free nonlinear component of the dynamics alone willalways maintain the state within a sphere of radius k around c. So, if thelinear component is stable, then for any sequence of boun<strong>de</strong>d inputs, theoutput sequence of the noise-free system will be boun<strong>de</strong>d. Intuitively,although unstable behavior might occur in the region of RBF support,once x leaves this region it is drawn back in by A.For the on-line EM learning algorithm, the hid<strong>de</strong>n state dynamics <strong>and</strong>the parameter re-estimation dynamics will interact, <strong>and</strong> therefore astability analysis would be quite challenging. However, since there is nostability guarantee for the batch EKS-EM algorithm, it seems veryunlikely that a simple form of the on-line algorithm could be provablystable.6.5.4 Takens’ Theorem <strong>and</strong> Hid<strong>de</strong>n StatesIt has long been known that for linear systems, there is an equivalencebetween so called state-space formulations involving hid<strong>de</strong>n variables <strong>and</strong>direct vector autoregressive mo<strong>de</strong>ls of the time series. In 1980, Takensproved a remarkable theorem [52] that tells us that, for almost any<strong>de</strong>terministic nonlinear dynamical system with a d-dimensional statespace, the state can be effectively reconstructed by observing 2d þ 1time lags of any one of its outputs. In particular, Takens showed that sucha lag vector will be a smooth embedding (diffeomorphism) of the truestate, if one exists. This notion of finding an ‘‘embedding’’ for the state hasbeen used to justify a nonlinear regression approach to learning nonlineardynamical systems. That is, if you suspect that the system is nonlinear <strong>and</strong>that it has d state dimensions, then instead of building a state-space mo<strong>de</strong>l,you can do away with representing states <strong>and</strong> just build an autoregressive(AR) mo<strong>de</strong>l directly on the observations that nonlinearly relates previous


212 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMoutputs <strong>and</strong> the current output. (Chapter 4 discusses the case of chaoticdynamics.) This view begs the question: Do we need our mo<strong>de</strong>ls to havehid<strong>de</strong>n states at all?While no constructive realization for Takens’ theorem exists in general,there are very strong results for linear systems. For purely linear systems,we can appeal to the Cayley–Hamilton theorem 13 to show that the hid<strong>de</strong>nstate can always be eliminated to obtain an equivalent vector autoregressivemo<strong>de</strong>l by taking only d time lags of the output. Furthermore, there isa construction that allows this conversion to be performed explicitly. 14Takens’ theorem offers us a similar guarantee for elimination of hid<strong>de</strong>nstates in nonlinear dynamical systems, as long as we take 2d þ 1 outputlags. (However, no similar recipe exists for explicitly converting to anautoregressive form). These results appear to make hid<strong>de</strong>n states unnecessary.The problem with this view is that it does not generalize well to manyrealistic high-dimensional <strong>and</strong> noisy scenarios. Consi<strong>de</strong>r the examplementioned in the introduction. While it is mathematically true that thepixels in the vi<strong>de</strong>o frame of a balloon floating in the wind are a (highlynonlinear) function of the pixels in the previous vi<strong>de</strong>o frames, it would beludicrous from the mo<strong>de</strong>ling perspective to build an AR mo<strong>de</strong>l of thevi<strong>de</strong>o images. This would require a number of parameters of the or<strong>de</strong>r ofthe number of pixels squared. Furthermore, unlike the noise-free case ofTakens’ theorem, when the dynamics are noisy, the optimal prediction ofthe observation would have to <strong>de</strong>pend on the entire history of pastobservations. Any truncation of this history throws away potentiallyvaluable information about the unobserved state. The state-space formulationof nonlinear dynamical systems allows us to overcome both of theselimitations of nonlinear autoregressive mo<strong>de</strong>ls. That is, it allows us to havecompact representations of dynamics, <strong>and</strong> to integrate uncertain informationover time. The price paid for this is that it requires inference over thehid<strong>de</strong>n state.13 Any square matrix A of size n satisfies its own characteristic equation. Equivalently, anymatrix power A m for m n can be written as a linear combination of lower matrix powersI; A; A 2 ; ...; A n 1 .14 Start with the system x kþ1 ¼ Ax k þ w k , y k ¼ Cx k þ v k . Create a d-dimensionl lag vectorz k ¼½y k ; y kþ1 ;...;y kþd 1 Š that holds the current <strong>and</strong> d 1 future outputs. Writez k ¼ Gx k þ n k for G ¼½CI; CA; CA 2 ;...;CA d 1 Š <strong>and</strong> Gaussian noise n (although withnondiagonal covariance). The Cayley–Hamilton theorem assures us that G is full rank, <strong>and</strong>thus we need not take any more lags. Given the lag vector z k , we can solve the systemz k ¼ Gx k for x k ; write this solution as G þ z k . Using the original observation equation dtimes, to solve for y k ; ...; y kþd 1 in terms of z k , we can write an autoregression for z k asz kþ1 ¼ G þ AGz k þ m k for Gaussian noise m.


6.5 DISCUSSION 2136.5.5 Should Parameters <strong>and</strong> Hid<strong>de</strong>n States be TreatedDifferently?The maximum-likelihood framework on which the EM algorithm is basedmakes a distinction between parameters <strong>and</strong> hid<strong>de</strong>n variables: it attemptsto integrate over hid<strong>de</strong>n variables to maximize the likelihood as a functionof parameters. This leads to the two-step approach, which computessufficient statistics over the hid<strong>de</strong>n variables in the E-step <strong>and</strong> optimizesparameters in the M-step. In contrast, a fully Bayesian approach tolearning nonlinear dynamical state-space mo<strong>de</strong>ls would treat bothhid<strong>de</strong>n variables <strong>and</strong> parameters as unknown <strong>and</strong> attempt to compute orapproximate the joint posterior distribution over them – in effect integratingover both.It is important to compare these approaches to system i<strong>de</strong>ntificationwith more traditional ones. We highlight two such approaches: joint EKFapproaches <strong>and</strong> dual EKF approaches.In joint EKF approaches [7, 8], an augmented hid<strong>de</strong>n state space isconstructed that comprises the original hid<strong>de</strong>n state space <strong>and</strong> theparameters. Since parameters <strong>and</strong> hid<strong>de</strong>n states interact, even for lineardynamical systems this approach results in nonlinear dynamics over theaugmented hid<strong>de</strong>n states. Initializing a Gaussian prior distribution bothover parameters <strong>and</strong> over states, an exten<strong>de</strong>d <strong>Kalman</strong> filter is then used torecursively update the joint distribution over states <strong>and</strong> parameters basedon the observations, pðX ; yjY Þ. This approach has the advantage that it canmo<strong>de</strong>l uncertainties in the parameters <strong>and</strong> correlations between parameters<strong>and</strong> hid<strong>de</strong>n variables. In fact, this approach treats parameters <strong>and</strong> statevariables completely symmetrically, <strong>and</strong> can be thought of as iterativelyimplementing a Gaussian approximation to the recursive Bayes’ rulecomputations. Nonstationarity can be easily built in by giving theparameters (e.g., r<strong>and</strong>om-walk) dynamics. Although it has some veryappealing properties, this approach is known to suffer from instabilityproblems, which is the reason why dual EKF approaches have beenproposed.In dual EKF approaches (see Chapter 5), two interacting but distinctexten<strong>de</strong>d <strong>Kalman</strong> filters run simultaneously. One computes a Gaussianapproximation of the state posterior given a parameter estimate <strong>and</strong> theobservations: pðX j^y old ; YÞ, while the other computes a Gaussian approximationof the parameter posterior given the estimated states pðyj ^X old ; Y Þ.The two EKFs interact by each feeding its estimate (i.e., the posteriormeans ^X <strong>and</strong> ^y) into the other. One can think of the dual EKF asperforming approximate coordinate ascent in pðX ; yjYÞ by iteratively


214 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMmaximizing pðX j^y old ; Y Þ <strong>and</strong> pðyj ^X old ; Y) un<strong>de</strong>r the assumption that eachconditional is Gaussian. Since the only interaction between parameters<strong>and</strong> hid<strong>de</strong>n variables occurs through their respective means, the procedurehas the flavor of mean-field methods in physics <strong>and</strong> neural networks [53].Like these methods, it is also likely to suffer from the overconfi<strong>de</strong>nceproblem – namely, since the parameter estimate does not take into accountthe uncertainty in the states, the parameter covariance will be overlynarrow, <strong>and</strong> likewise for the states.For large systems, both joint <strong>and</strong> dual EKF methods suffer from the factthat the parameter covariance matrix is quadratic in the number ofparameters. This problem is more pronounced for the joint EKF, since itconsi<strong>de</strong>rs the concatenated state space. Furthermore, both joint <strong>and</strong> dualEKF methods rely on Gaussian approximations to parameter distributions.This can sometimes be problematic – for example, consi<strong>de</strong>r retainingpositive-<strong>de</strong>finiteness of a noise covariance matrix un<strong>de</strong>r the assumptionthat its parameters are Gaussian-distributed.6.6 CONCLUSIONSThis chapter has brought together two classic algorithms – one fromstatistics <strong>and</strong> another from systems engineering – to address the learningof stochastic nonlinear dynamical systems. We have shown that by pairingthe exten<strong>de</strong>d <strong>Kalman</strong> smoothing algorithm for approximate state estimationin the E-step with a radial basis function learning mo<strong>de</strong>l that permitsexact analytic solution of the M-step, the EM algorithm is capable oflearning a nonlinear dynamical mo<strong>de</strong>l from data. As a si<strong>de</strong>-effect we have<strong>de</strong>rived an algorithm for training a radial basis function network to fit datain the form of a mixture of Gaussians. We have also <strong>de</strong>rived an on-lineversion of the algorithm <strong>and</strong> a version for <strong>de</strong>aling with nonstationary timeseries.We have <strong>de</strong>monstrated the algorithm on a series of synthetic <strong>and</strong>realistic nonlinear dynamical systems, <strong>and</strong> have shown that it is able tolearn accurate mo<strong>de</strong>ls from only observations of inputs <strong>and</strong> outputs.Initialization of mo<strong>de</strong>l parameters <strong>and</strong> placement of the radial basiskernels are important to the practical success of the algorithm. We havediscussed techniques for making these choices, <strong>and</strong> have provi<strong>de</strong>d gradientrules for adapting the centers <strong>and</strong> widths of the basis functions.The main strength of our algorithm is that by making a specific choiceof nonlinear estimator (Gaussian radial basis networks), we are able toexactly account for the uncertain state estimates generated during inference.Furthermore, the parameter-update procedures still only require the


APPENDIX: EXPECTATIONS REQUIRED TO FIT THE RBFS 215solution of systems of linear equations. However, we rely on the st<strong>and</strong>ard,but potentially inaccurate, exten<strong>de</strong>d <strong>Kalman</strong> smoother for approximateinference. For certain problems where local linearization is an extremelypoor approximation, greater accuracy may be achieved using otherapproximate inference techniques such as the unscented filter (see Chapter7). Another area worthy of further investigation is how to initialize theparameters more effectively when the data lie on a nonlinear manifold; inthese cases, factor analysis provi<strong>de</strong>s an ina<strong>de</strong>quate static mo<strong>de</strong>l.The belief network literature has recently been dominated by twomethods for approximate inference: Markov-chain Monte Carlo [54]<strong>and</strong> variational approximations [29]. To the best of our knowledge, [36]<strong>and</strong> [45] were the first instances where exten<strong>de</strong>d <strong>Kalman</strong> smoothing wasused to perform approximate inference in the E-step of EM. While EKSdoes not have the theoretical guarantees of variational methods (which arealso approximate, but monotonically optimize a computable objectivefunction during learning), its simplicity has gained it wi<strong>de</strong> acceptance inthe estimation <strong>and</strong> control literatures as a method for doing inference innonlinear dynamical systems. Our practical success in mo<strong>de</strong>ling a varietyof nonlinear time series suggests that the combination of exten<strong>de</strong>d <strong>Kalman</strong>algorithms <strong>and</strong> the EM algorithm can provi<strong>de</strong> powerful tools for learningnonlinear dynamical systems.ACKNOWLEDGMENTSThe authors acknowledge support from the Gatsby Charitable Fund, <strong>and</strong>thank Simon Haykin for helpful comments that helped to clarify thepresentation. S.R. is supported in part by a National Science FoundationPDF Fellowship <strong>and</strong> a grant from the Natural Sciences <strong>and</strong> EngineeringResearch Council of Canada.APPENDIX: EXPECTATIONS REQUIRED TO FIT THE RBFsThe expectations that we need to compute for Eq. (6.78) are hxi j , hzi j ,hxx > i j , hzz > i j , hxz > i j , hr i ðxÞi j , hxr i ðxÞi j , hzp i ðxÞi j ; <strong>and</strong> hr i ðxÞr l ðxÞi j . Startingwith some of the easier ones that do not <strong>de</strong>pend on the RBF kernel r,we havehxi j ¼ m x j ; hzi j ¼ m z j ;hxx > i j ¼ m x j m x;>jhxz > i j ¼ m x j mz;> jþ C xxj ;þ C xzj :hzz > i j ¼ m z j m z;>nþ C zzj ;


216 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EMObserve that when we multiply the Gaussian RBF kernel r i ðxÞ (Eq. (6.14))<strong>and</strong> n j , we get a Gaussian <strong>de</strong>nsity over ðx; zÞ with mean <strong>and</strong> covariance m ij ¼ C ij Cj 1 m j þ S i10c i ; C ij ¼ Cj 1 þ S i 1 10;0 0<strong>and</strong> an extra constant (due to lack of normalization),whereb ij ¼ð2pÞ d x=2 jS i j 1=2 jC j j 1=2 jC ij j 1=2 expð12 d ijÞ;d ij ¼ c > i S 1i c i þ m > j C 1j m j m > ij C 1ij m ij :Using b ij <strong>and</strong> m ij , we can evaluate the other expectations:Finally,wherehr i ðxÞi j ¼ b ij ; hxr i ðxÞi j ¼ b ij m x ij; hzr i ðxÞi j ¼ b ij m z ij:hr i ðxÞr l ðxÞi j ¼ð2pÞ d xjC j j 1=2 jS i j 1=2 jS l j 1=2 jC ilj j 1=2 expð 1 2 g iljÞ;"C ilj ¼ Cj 1 þ S #!i 1 þ Sl 110;0 0" #!m ilj ¼ C ilj Cj 1 m j þ S i 1 c i þ Sl1;0g ilj ¼ c > i Si 1 c i þ c > l Sl 1 c l þ m > j Cj 1 m j m > iljCilj 1 m ilj :c lREFERENCES[1] R.E. <strong>Kalman</strong> <strong>and</strong> R.S. Bucy, ‘‘New results in linear filtering <strong>and</strong> prediction,’’Transactions of the ASME Ser. D, Journal of Basic Engineering, 83,95–108(1961).


REFERENCES 217[2] L.E. Baum <strong>and</strong> J.A. Eagon, ‘‘An equality with applications to statisticalestimation for probabilistic functions of markov processes <strong>and</strong> to a mo<strong>de</strong>l forecology,’’ Bulletin of the American Mathematical Society, 73, 360–363(1967).[3] A.P. Dempster, N.M. Laird, <strong>and</strong> D.B. Rubin, ‘‘Maximum likelihood fromincomplete data via the EM algorithm,’’ Journal of the Royal StatisticalSociety, Ser. B, 39, 1–38 (1977).[4] J. Pearl, Probabilistic Reasoning in Intelligent Systems: <strong>Networks</strong> of PlausibleInference. San Mateo, CA: Morgan Kaufmann, 1988.[5] S.L. Lauritzen <strong>and</strong> D.J. Spiegelhalter, ‘‘Local computations with probabilitieson graphical structures <strong>and</strong> their application to expert systems,’’ Journalof the Royal Statistical Society, Ser. B, 50, 157–224 (1988).[6] S. Roweis <strong>and</strong> Z. Ghahramani, ‘‘A unifying review of linear Gaussianmo<strong>de</strong>ls,’’ <strong>Neural</strong> Computation, 11, 305–345 (1999).[7] L. Ljung <strong>and</strong> T. Sö<strong>de</strong>rström, Theory <strong>and</strong> Practice of Recursive I<strong>de</strong>ntification.Cambridge, MA: MIT Press, 1983.[8] G.C. Goodwin <strong>and</strong> K.S. Sin, Adaptive <strong>Filtering</strong>, Prediction <strong>and</strong> Control.Englewood Cliffs, NJ: Prentice-Hall, 1984.[9] J. Moody <strong>and</strong> C. Darken, ‘‘Fast learning in networks of locally-tunedprocessing units,’’ <strong>Neural</strong> Computation, 1, 281–294 (1989).[10] D.S. Broomhead <strong>and</strong> D. Lowe, ‘‘Multivariable functional interpolation <strong>and</strong>adaptive networks,’’ Complex Systems, 2, 321–355 (1988).[11] R.E. <strong>Kalman</strong>, ‘‘A new approach to linear filtering <strong>and</strong> prediction problems,’’Journal of Basic Engineering, 82, 35–45 (1960).[12] H.E. Rauch, ‘‘Solutions to the linear smoothing problem,’’ IEEE Transactionson Automatic Control, 8, 371–372 (1963).[13] R.E. Kopp <strong>and</strong> R.J. Orford, ‘‘Linear regression applied to system i<strong>de</strong>ntification<strong>and</strong> adaptive control systems,’’ AIEE Journal, 1, 2300–2306 (1963).[14] H. Cox, ‘‘On the estimation of state variables <strong>and</strong> parameters for noisydynamic systems.’’ IEEE Transactions on Automatic Control, 9, 5–12(1964).[15] S. Singhal <strong>and</strong> L. Wu, ‘‘Training multiplayer perceptrons with the exten<strong>de</strong>d<strong>Kalman</strong> algorithm,’’ in Advances in <strong>Neural</strong> Information Processing Systems,Vol. 1, San Mateo, CA: Morgan Kaufmann, 1989, 133–140.[16] V. Kadirkamanathan <strong>and</strong> M. Niranjan, ‘‘A functional estimation approach tosequential learning with neural networks,’’ <strong>Neural</strong> Computation, 5, 954–975(1993).[17] I.T. Nabney, A. McLachlan, <strong>and</strong> D. Lowe, ‘‘Practical methods of tracking ofnonstationary time series applied to real-world data,’’ in S. K. Rogers <strong>and</strong>D. W. Ruck, Eds. AeroSense ’96: Applications <strong>and</strong> Science of Artificial<strong>Neural</strong> <strong>Networks</strong> II. SPIE Proceedings, 1996, pp. 152–163.


218 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EM[18] J.E. H<strong>and</strong>schin <strong>and</strong> D.Q. Mayne, ‘‘Monte Carlo techniques to estimate theconditional expectation in multi-stage non-linear filtering,’’ InternationalJournal of Control, 9, 547–559 (1969).[19] N.J. Gordon, D.J. Salmond, <strong>and</strong> A.F.M. Smith, ‘‘A novel approach tononlinear=non-Gaussian Bayesian state space estimation,’’ IEE ProceedingsF: Radar <strong>and</strong> Signal Processing, 140, 107–113 (1993).[20] G. Kitagawa, ‘‘Monte Carlo filter <strong>and</strong> smoother for non-Gaussian nonlinearstate space mo<strong>de</strong>ls,’’ Journal of Computer Graphics <strong>and</strong> Graphical Statistics,5, 1–25 (1996).[21] M. Israd <strong>and</strong> A. Blake, ‘‘CONDENSATION – conditional <strong>de</strong>nsity propagationfor visual tracking,’’ International Journal of Computer Vision, 29,5–28(1998).[22] M. West, ‘‘Approximating posterior distributions by mixtures,’’ Journal ofthe Royal Statistical Society, Ser, B, 54, 553–568 (1993).[23] M. West, ‘‘Mixture mo<strong>de</strong>ls, monte carlo, Bayesian updating <strong>and</strong> dynamicmo<strong>de</strong>ls,’’ Computing Science <strong>and</strong> Statistics, 24, 325–333 (1993).[24] A. Doucet, J.F.G. <strong>de</strong> Freitas, <strong>and</strong> N.J. Gordon, Sequential Monte CarloMethods in Practice. New York: Springer-Verlag, 2000.[25] S.J. Julier, J.K. Uhlmann, <strong>and</strong> H.F. Durrant-Whyte, ‘‘A new approach forfiltering nonlinear systems,’’ in Proceedings of the 1995 American ControlConference, Seattle, 1995, pp. 1628–1632.[26] S.J. Julier <strong>and</strong> J.K. Uhlmann, ‘‘A new extension of the <strong>Kalman</strong> filter tononlinear systems,’’ in Proceedings of AeroSense: The 11th InternationalSymposium on Aerospace=Defense Sensing, Simulation <strong>and</strong> Controls,Orl<strong>and</strong>o, FL, 1997.[27] E.A. Wan, R. van <strong>de</strong>r Merwe, <strong>and</strong> A.T. Nelson, ‘‘Dual estimation <strong>and</strong> theunscented transformation,’’ in Advances in <strong>Neural</strong> Information ProcessingSystems, Vol. 12, Cambridge, MA: MIT Press, 1999.[28] Z. Ghahramani <strong>and</strong> G.E. Hinton, ‘‘Variational learning for switching statespacemo<strong>de</strong>ls,’’ <strong>Neural</strong> Computation, 12, 831–864 (2000).[29] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, <strong>and</strong> L.K. Saul, ‘‘An introductionto variational methods in graphical mo<strong>de</strong>ls,’’ Machine Learning, 37, 183–233 (1999).[30] L.E. Baum <strong>and</strong> T. Petrie, ‘‘Statistical inference for probabilistic functions offinite state Markov chains,’’ Annals of Mathematical Statistics, 37, 1554–1563 (1966).[31] R.M. Neal <strong>and</strong> G.E. Hinton, ‘‘A view of the EM algorithm that justifiesincremental sparse <strong>and</strong> other variants,’’ in M.I. Jordan, ed., Learning inGraphical Mo<strong>de</strong>ls. Dordrecht: Kluwer Aca<strong>de</strong>mic, 1998, pp. 355–368.[32] I. Csiszár <strong>and</strong> G. Tusnády, ‘‘Information geometry <strong>and</strong> alternating minimizationprocedures,’’ Statistics & Decisions, Supplement Issue 1, pp. 205–237, 1984.


REFERENCES 219[33] C.F. Chen, ‘‘The EM approach to the multiple indicators <strong>and</strong> multiple causesmo<strong>de</strong>l via the estimation of the latent variables.’’ Journal of the AmericanStatistical Association, 76, 704–708 (1981).[34] R.H. Shumway <strong>and</strong> D.S. Stoffer, ‘‘An approach to time series smoothing <strong>and</strong>forecasting using the EM algorithm,’’ Journal of Time Series Analysis, 3,253–264 (1982).[35] Z. Ghahramani <strong>and</strong> G. Hinton, ‘‘Parameter estimation for linear dynamicalsystems,’’ Technical Report CRG-TR-96-2, Department of ComputerScience, University of Toronto, February 1996.[36] Z. Ghahramani <strong>and</strong> S. Roweis, ‘‘Learning nonlinear dynamical systemsusing an EM algorithm,’’ in Advances in <strong>Neural</strong> Information ProcessingSystems, Vol. 11. Cambridge, MA: MIT Press, 1999, pp. 431–437.[37] J.F.G. <strong>de</strong> Freitas, M. Niranjan, <strong>and</strong> A.H. Gee, ‘‘Nonlinear state spaceestimation with neural networks <strong>and</strong> the EM algorithm,’’ Technical Report,Cambridge University Engineering Department, 1999.[38] T. Briegel <strong>and</strong> V. Tresp, ‘‘Fisher scoring <strong>and</strong> a mixture of mo<strong>de</strong>s approach forapproximate inference <strong>and</strong> learning in nonlinear state space mo<strong>de</strong>ls,’’ inAdvances in <strong>Neural</strong> Information Processing Systems, Vol. 11. Cambridge,MA, MIT Press, 1999.[39] Z. Ghahramani <strong>and</strong> G. Hinton ‘‘Switching state-space mo<strong>de</strong>ls,’’ TechnicalReport CRG-TR-96-3, Department of Computer Science, University ofToronto, July 1996.[40] K. Murphy, ‘‘Switching <strong>Kalman</strong> filters,’’ Technical Report, Department ofComputer Science, University of California, Berkeley, August, 1998.[41] G.E. Hinton, P. Dayan, <strong>and</strong> M. Revow, ‘‘Mo<strong>de</strong>ling the manifolds of Imagesof h<strong>and</strong>written digits,’’ IEEE Transactions on <strong>Neural</strong> <strong>Networks</strong>, 8, 65–74(1997).[42] Z. Ghahramani <strong>and</strong> G. Hinton, ‘‘The EM algorithm for mixtures of factoranalyzers,’’ Technical Report CRG-TR-96-1, Department of ComputerScience, University of Toronto, May 1996 (revised February 1997).[43] W.S. Torgerson, ‘‘Multidimensional scaling I. Theory <strong>and</strong> method,’’ Psychometrika,17, 401–419 (1952).[44] S. Haykin, Adaptive Filter Theory, 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 1996.[45] E.A. Wan <strong>and</strong> A.T. Nelson, ‘‘Dual <strong>Kalman</strong> filtering methods for nonlinearprediction,’’ in Advances in <strong>Neural</strong> Information Processing Systems, Vol. 9.Cambridge, MA: MIT Press, 1997.[46] C.K. Carter <strong>and</strong> R. Kohn, ‘‘On Gibbs sampling for state space mo<strong>de</strong>ls,’’Biometrika, 81, 541–553 (1994).[47] S. Früwirth-Schnatter, ‘‘Bayesian mo<strong>de</strong>l discrimination <strong>and</strong> Bayes factors forlinear Gaussian state space mo<strong>de</strong>ls.’’ Journal of the Royal Statistical Society,Ser. B, 57, 237–246 (1995).


220 6 LEARNING NONLINEAR DYNAMICAL SYSTEMS USING EM[48] D.J.C. MacKay, ‘‘Bayesian non-linear mo<strong>de</strong>lling for the prediction competition,’’ASHRAE Transcations, 100, 1053–1062 (1994).[49] R.M. Neal, ‘‘Assessing relevance <strong>de</strong>termination methods using DELVE,’’ inC.M. Bishop, Ed. <strong>Neural</strong> <strong>Networks</strong> <strong>and</strong> Machine Learning. New York:Springer-Verlag, 1998, pp. 97–129.[50] M.E. Tipping, ‘‘The relevance vector machine,’’ in Advances in <strong>Neural</strong>Information Processiing Systems, Vol. 12. Cambridge, MA: MIT Press,2000, pp. 652–658.[51] Z. Ghahramani <strong>and</strong> M.J. Beal, ‘‘Propagation algorithms for variationalBayesian learning.’’ in Advances in <strong>Neural</strong> Information Processing Systems,Vol. 13. Cambridge, MA: MIT Press, 2001.[52] F. Takens, ‘‘Detecting strange attractors in turbulence,’’ in D.A. R<strong>and</strong> <strong>and</strong>L.-S. Young, Eds., Dynamical Systems <strong>and</strong> Turbulence, Warwick 1980,Lecture Notes in Mathematics, Vol. 898. Berlin: Springer-Verlag, 1981,pp. 365–381.[53] J. Hertz, A. Krogh, <strong>and</strong> R.G. Palmer, Introduction to the Theory of <strong>Neural</strong>Computation. Redwood City, CA: Addison-Wesley, 1991.[54] R.M. Neal, ‘‘Probablistic inference using Markov chain Monte Carlomethods,’’ Technical Report CRG-TR-93-1, Department of ComputerScience, University of Toronto, 1993.


7THE UNSCENTEDKALMAN FILTEREric A. Wan <strong>and</strong> Rudolph van <strong>de</strong>r MerweDepartment of Electrical <strong>and</strong> Computer Engineering, Oregon Graduate Institute ofScience <strong>and</strong> Technology, Beaverton, Oregon, U.S.A.7.1 INTRODUCTIONIn this book, the exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) has been used as thest<strong>and</strong>ard technique for performing recursive nonlinear estimation. TheEKF algorithm, however, provi<strong>de</strong>s only an approximation to optimalnonlinear estimation. In this chapter, we point out the un<strong>de</strong>rlying assumptions<strong>and</strong> flaws in the EKF, <strong>and</strong> present an alternative filter withperformance superior to that of the EKF. This algorithm, referred to asthe unscented <strong>Kalman</strong> filter (UKF), was first proposed by Julier et al.[1–3], <strong>and</strong> further <strong>de</strong>veloped by Wan <strong>and</strong> van <strong>de</strong>r Merwe [4–7].The basic difference between the EKF <strong>and</strong> UKF stems from the mannerin which Gaussian r<strong>and</strong>om variables (GRV) are represented for propagatingthrough system dynamics. In the EKF, the state distribution is<strong>Kalman</strong> <strong>Filtering</strong> <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>, Edited by Simon HaykinISBN 0-471-36998-5 # 2001 John Wiley & Sons, Inc.221


222 7 THE UNSCENTED KALMAN FILTERapproximated by a GRV, which is then propagated analytically through thefirst-or<strong>de</strong>r linearization of the nonlinear system. This can introduce largeerrors in the true posterior mean <strong>and</strong> covariance of the transformed GRV,which may lead to suboptimal performance <strong>and</strong> sometimes divergence ofthe filter. The UKF address this problem by using a <strong>de</strong>terministic samplingapproach. The state distribution is again approximated by a GRV, but isnow represented using a minimal set of carefully chosen sample points.These sample points completely capture the true mean <strong>and</strong> covariance ofthe GRV, <strong>and</strong>, when propagated through the true nonlinear system,captures the posterior mean <strong>and</strong> covariance accurately to second or<strong>de</strong>r(Taylor series expansion) for any nonlinearity. The EKF, in contrast, onlyachieves first-or<strong>de</strong>r accuracy. No explicit Jacobian or Hessian calculationsare necessary for the UKF. Remarkably, the computational complexity ofthe UKF is the same or<strong>de</strong>r as that of the EKF.Julier <strong>and</strong> Uhlman <strong>de</strong>monstrated the substantial performance gains ofthe UKF in the context of state estimation for nonlinear control. A numberof theoretical results were also <strong>de</strong>rived. This chapter reviews this work,<strong>and</strong> presents extensions to a broa<strong>de</strong>r class of nonlinear estimationproblems, including nonlinear system i<strong>de</strong>ntification, training of neuralnetworks, <strong>and</strong> dual estimation problems. Additional material inclu<strong>de</strong>s the<strong>de</strong>velopment of an unscented <strong>Kalman</strong> smoother (UKS), specification ofefficient recursive square-root implementations, <strong>and</strong> a novel use of theUKF to improve particle filters [6].In presenting the UKF, we shall cover a number of application areas ofnonlinear estimation in which the EKF has been applied. Generalapplication areas may be divi<strong>de</strong>d into state estimation, parameter estimation(e.g., learning the weights of a neural network), <strong>and</strong> dual estimation(e.g., the expectation–maximization (EM) algorithm). Each of these areasplace specific requirements on the UKF or EKF, <strong>and</strong> will be <strong>de</strong>veloped inturn. An overview of the framework for these areas is briefly reviewednext.State Estimation The basic framework for the EKF involves estimationof the state of a discrete-time nonlinear dynamical system,x kþ1 ¼ Fðx k ; u k ; v k Þ;y k ¼ Hðx k ; n k Þ;ð7:1Þð7:2Þwhere x k represents the unobserved state of the system, u k is a knownexogeneous input, <strong>and</strong> y k is the observed measurement signal. The process


noise v k drives the dynamic system, <strong>and</strong> the observation noise is given byn k . Note that we are not assuming additivity of the noise sources. Thesystem dynamical mo<strong>de</strong>l F <strong>and</strong> H are assumed known. A simple blockdiagram of this system is shown in Figure 7.1. In state estimation, the EKFis the st<strong>and</strong>ard method of choice to achieve a recursive (approximate)maximum-likelihood estimate of the state x k . For completeness, we shallreview the EKF <strong>and</strong> its un<strong>de</strong>rlying assumptions in Section 7.2 to helpmotivate the presentation of the UKF for state estimation in Section 7.3.Parameter Estimation Parameter estimation, sometimes referred toas system i<strong>de</strong>ntification or machine learning, involves <strong>de</strong>termining anonlinear mappingy k ¼ Gðx k ; wÞ;7.1 INTRODUCTION 223ð7:3Þwhere x k is the input, y k is the output, <strong>and</strong> the nonlinear map GðÞ isparameterized by the vector w. The nonlinear map, for example, may be afeedforward or recurrent neural network (w are the weights), withnumerous applications in regression, classification, <strong>and</strong> dynamic mo<strong>de</strong>ling.Learning corresponds to estimating the parameters w. Typically, atraining set is provi<strong>de</strong>d with sample pairs consisting of known input <strong>and</strong><strong>de</strong>sired outputs, fx k , d k g. The error of the machine is <strong>de</strong>fined as e k ¼d k Gðx k ; wÞ, <strong>and</strong> the goal of learning involves solving for the parametersw in or<strong>de</strong>r to minimize the expectation of some given function ofthe error.While a number of optimization approaches exist (e.g., gradient <strong>de</strong>scentusing backpropagation), the EKF may be used to estimate the parametersby writing a new state-space representation,w kþ1 ¼ w k þ r k ;d k ¼ Gðx k ; w k Þþe k ;ð7:4Þð7:5ÞProcess noiseMeasurement noiseInputOutputStateFigure 7.1Discrete-time nonlinear dynamical system.


224 7 THE UNSCENTED KALMAN FILTERwhere the parameters w k correspond to a stationary process with i<strong>de</strong>ntitystate transition matrix, driven by process noise r k (the choice of variance<strong>de</strong>termines convergence <strong>and</strong> tracking performance <strong>and</strong> will be discussedin further <strong>de</strong>tail in Section 7.4). The output d k corresponds to a nonlinearobservation on w k . The EKF can then be applied directly as an efficient‘‘second-or<strong>de</strong>r’’ technique for learning the parameters. The use of the EKFfor training neural networks has been <strong>de</strong>veloped by Singhal <strong>and</strong> Wu [8]<strong>and</strong> Puskorious <strong>and</strong> Feldkamp [9], <strong>and</strong> is covered in Chapter 2 of thisbook. The use of the UKF in this role is <strong>de</strong>veloped in Section 7.4.Dual Estimation A special case of machine learning arises when theinput x k is unobserved, <strong>and</strong> requires coupling both state estimation <strong>and</strong>parameter estimation. For these dual estimation problems, we againconsi<strong>de</strong>r a discrete-time nonlinear dynamical system,x kþ1 ¼ Fðx k ; u k ; v k ; wÞ;y k ¼ Hðx k ; n k ; wÞ;ð7:6Þð7:7Þwhere both the system states x k <strong>and</strong> the set of mo<strong>de</strong>l parameters w for thedynamical system must be simultaneously estimated from only theobserved noisy signal y k . Example applications inclu<strong>de</strong> adaptive nonlinearcontrol, noise reduction (e.g., speech or image enhancement), <strong>de</strong>terminingthe un<strong>de</strong>rlying price of financial time series, etc. A general theoretical <strong>and</strong>algorithmic framework for dual <strong>Kalman</strong>-based estimation has beenpresented in Chapter 5. An expectation–maximization approach has alsobeen covered in Chapter 6. Approaches to dual estimation utilizing theUKF are <strong>de</strong>veloped in Section 7.5.In the next section, we review optimal estimation to explain the basicassumptions <strong>and</strong> flaws with the EKF. This will motivate the use of theUKF as a method to amend these flaws. A <strong>de</strong>tailed <strong>de</strong>velopment of theUKF is given in Section 7.3. The remain<strong>de</strong>r of the chapter will then bedivi<strong>de</strong>d based on the application areas reviewed above. We conclu<strong>de</strong> thechapter in Section 7.6 with the unscented particle filter, in which the UKFis used to improve sequential Monte-Carlo-based filtering methods.Appendix A provi<strong>de</strong>s a <strong>de</strong>rivation of the accuracy of the UKF. AppendixB <strong>de</strong>tails an efficient square-root implementation of the UKF.7.2 OPTIMAL RECURSIVE ESTIMATION AND THE EKFGiven observations y k , the goal is to estimate the state x k . We make noassumptions about the nature of the system dynamics at this point. The


7.2 OPTIMAL RECURSIVE ESTIMATION AND THE EKF 225optimal estimate in the minimum mean-squared error (MMSE) sense isgiven by the conditional mean:^x k ¼ E½x k jY k 0Š;ð7:8Þwhere Y k 0 is the sequence of observations up to time k. Evaluation of thisexpectation requires knowledge of the a posteriori <strong>de</strong>nsity pðx k jY k 0Þ. 1Given this <strong>de</strong>nsity, we can <strong>de</strong>termine not only the MMSE estimator, butany ‘‘best’’ estimator un<strong>de</strong>r a specified performance criterion. Theproblem of <strong>de</strong>termining the a posteriori <strong>de</strong>nsity is in general referred toas the Bayesian approach, <strong>and</strong> can be evaluated recursively according tothe following relations:pðx k jY k 0Þ¼ pðx kjY k 0 1 Þpðy k jx k Þpðy k jY0 k 1 ; ð7:9ÞÞwhereðpðx k jY k 0 1 Þ¼ pðx k jx k 1 Þpðx k 1 jY k 0 1 Þ dx k 1 ; ð7:10Þ<strong>and</strong> the normalizing constant pðy k jY k 0Þ is given byðpðy k jY0 k 1 Þ¼ pðx k jY0 k 1 Þpðy k jx k Þ dx k : ð7:11ÞThis recursion specifies the current state <strong>de</strong>nsity as a function of theprevious <strong>de</strong>nsity <strong>and</strong> the most recent measurement data. The state-spacemo<strong>de</strong>l comes into play by specifying the state transition probabilitypðx k jx k 1 Þ <strong>and</strong> measurement probability or likelihood, pðy k jx x Þ. Specifically,pðx k jx k 1 Þ is <strong>de</strong>termined by the process noise <strong>de</strong>nsity pðv k Þ with thestate-update equationx kþ1 ¼ Fðx k ; u k ; v k Þ:ð7:12ÞFor example, given an additive noise mo<strong>de</strong>l with Gaussian <strong>de</strong>nsity,pðv k Þ¼nð0; R v Þ, then pðx k jx k 1 Þ¼nðFðx k 1 , u k 1 Þ, R v ). Similarly,1 Note that we do not write the implicit <strong>de</strong>pen<strong>de</strong>nce on the observed input u k , since it is nota r<strong>and</strong>om variable.


226 7 THE UNSCENTED KALMAN FILTERpðy k jx x Þ is <strong>de</strong>termined by the observation noise <strong>de</strong>nsity pðn k Þ <strong>and</strong> themeasurement equationy k ¼ Hðx k ; n k Þ:ð7:13ÞIn principle, knowledge of these <strong>de</strong>nsities <strong>and</strong> the initial conditionpðx 0 jy 0 Þ¼pðy 0 jx 0 Þpðx 0 Þ=pðy 0 Þ <strong>de</strong>termines pðx k jY k 0Þ for all k. Unfortunately,the multidimensional integration indicated by Eqs. (7.9)–(7.11)makes a closed-form solution intractable for most systems. The onlygeneral approach is to apply Monte Carlo sampling techniques thatessentially convert integrals to finite sums, which converge to the truesolution in the limit. The particle filter discussed in the last section of thischapter is an example of such an approach.If we make the basic assumption that all <strong>de</strong>nsities remain Gaussian,then the Bayesian recursion can be greatly simplified. In this case, only theconditional mean ^x k ¼ E½x k jY k 0Š <strong>and</strong> covariance P xkneed to be evaluated.It is straightforward to show that this leads to the recursive estimation^x k ¼ðprediction of x k ÞþK k ½y k ðprediction of y k ÞŠ; ð7:14ÞP xk¼ P xkK k P ~yk K T k : ð7:15ÞWhile this is a linear recursion, we have not assumed linearity of themo<strong>de</strong>l. The optimal terms in this recursion are given by^x k ¼ E½Fðx k 1 ; u k 1 ; v k 1 ÞŠ; ð7:16ÞK k ¼ P xk y kP~y 1k ~y k; ð7:17Þ^y k ¼ E½Hðx k ; n k ÞŠ; ð7:18Þwhere the optimal prediction (i.e., prior mean) of x k is written as ^x k , <strong>and</strong>corresponds to the expectation of a nonlinear function of the r<strong>and</strong>omvariables x k 1 <strong>and</strong> v k 1 (with a similar interpretation for the optimalprediction ^y k ). The optimal gain term K k is expressed as a function ofposterior covariance matrices (with ~y k ¼ y k ^y k Þ. Note that evaluation ofthe covariance terms also require taking expectations of a nonlinearfunction of the prior state variable. P xkis the prediction of the covarianceof x k , <strong>and</strong> P ~yk is the covariance of ~y k .The celebrated <strong>Kalman</strong> filter [10] calculates all terms in these equationsexactly in the linear case, <strong>and</strong> can be viewed as an efficient method for


7.2 OPTIMAL RECURSIVE ESTIMATION AND THE EKF 227analytically propagating a GRV through linear system dynamics. Fornonlinear mo<strong>de</strong>ls, however, the EKF approximates the optimal terms as^x k Fð^x k 1 ; u k 1 ; vÞ; ð7:19ÞK k ^P xk y k^P ~y 1k ~y k; ð7:20Þ^y k Hð^x k ; nÞ; ð7:21Þwhere predictions are approximated simply as functions of the priormean value (no expectation taken). 2 The covariances are <strong>de</strong>termined bylinearizing the dynamical equations ðx kþ1 Ax k þ B u u k þ Bv k , y k Cx k þ Dn k ), <strong>and</strong> then <strong>de</strong>termining the posterior covariance matricesanalytically for the linear system. In other words, in the EKF, the statedistribution is approximated by a GRV, which is then propagated analyticallythrough the ‘‘first-or<strong>de</strong>r’’ linearization of the nonlinear system. Theexplicit equations for the EKF are given in Table 7.1. As such, the EKFTable 7.1Exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) equationsInitialize with^x 0 ¼ E½x 0 Š;ð7:22ÞP x0¼ E½ðx 0 ^x 0 Þðx 0 ^x 0 Þ T Š: ð7:23ÞFor k 2f1; ...; 1g, the time-update equations of the exten<strong>de</strong>d <strong>Kalman</strong> filter are^x k ¼ Fð^x k 1 ; u k 1 ; vÞ; ð7:24ÞP xk¼ A k 1 P xk 1A T k 1 þ B k R v B T k ; ð7:25Þ<strong>and</strong> the measurement-update equations arewhereK k ¼ P xkC T k ðC k P xkC T k þ D k R n D T k Þ 1 ;ð7:26Þ^x k ¼ ^x k þK k ½y k Hð^x k ; nÞŠ; ð7:27ÞP xk¼ðI K k C k ÞP xk; ð7:28ÞA k ¼ D @Fðx; u k ; vÞ@xC k ¼ D @Hðx; nÞ ;@x^x k^x k;B k ¼ D @Fð^x k; u k ; vÞ@v ;v D k ¼ D @Hð^x k ; nÞ@n ;n<strong>and</strong> where R v <strong>and</strong> R n are the covariances of v k <strong>and</strong> n k , respectively.ð7:29Þ2 The noise means are <strong>de</strong>noted by n ¼ E½nŠ <strong>and</strong> v ¼ E½vŠ, <strong>and</strong> are usually assumed to equalzero.


228 7 THE UNSCENTED KALMAN FILTERcan be viewed as providing ‘‘first-or<strong>de</strong>r’’ approximations to the optimalterms. 3 These approximations, however, can introduce large errors in thetrue posterior mean <strong>and</strong> covariance of the transformed (Gaussian) r<strong>and</strong>omvariable, which may lead to suboptimal performance <strong>and</strong> sometimesdivergence of the filter. 4 It is these ‘‘flaws’’ that will be addressed in thenext section using the UKF.7.3 THE UNSCENTED KALMAN FILTERThe UKF addresses the approximation issues of the EKF. The statedistribution is again represented by a GRV, but is now specified using aminimal set of carefully chosen sample points. These sample pointscompletely capture the true mean <strong>and</strong> covariance of the GRV, <strong>and</strong> whenpropagated through the true nonlinear system, capture the posterior mean<strong>and</strong> covariance accurately to the second or<strong>de</strong>r (Taylor series expansion)for any nonlinearity. To elaborate on this, we begin by explaining theunscented transformation.Unscented Transformation The unscented transformation (UT) is amethod for calculating the statistics of a r<strong>and</strong>om variable which un<strong>de</strong>rgoesa nonlinear transformation [3]. Consi<strong>de</strong>r propagating a r<strong>and</strong>om variable x(dimension L) through a nonlinear function, y ¼ f ðxÞ. Assume x has meanx <strong>and</strong> covariance P x . To calculate the statistics of y, we form a matrix X of2L þ 1 sigma vectors X i according to the following:X 0 ¼ x;pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX i ¼ x þð ðL þ lÞP x Þ i ; i ¼ 1; ...; L;pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX i ¼ x ð ðL þ lÞP x Þ i L ; i ¼ L þ 1; ...; 2L;ð7:30Þ3 While ‘‘second-or<strong>de</strong>r’’ versions of the EKF exist, their increased implementation <strong>and</strong>computational complexity tend to prohibit their use.4 A popular technique to improve the ‘‘first-or<strong>de</strong>r’’ approach is the iterated EKF, whicheffectively iterates the EKF equations at the current time step by re<strong>de</strong>fining the nominalstate estimate <strong>and</strong> re-linearizing the measurement equations. It is capable of providingbetter performance than the basic EKF, especially in the case of significant nonlinearity inthe measurement function [11]. We have not performed a comparison to the UKF at thistime, though a similar procedure may also be adapted to iterate the UKF.


where l ¼ a 2 ðL þ kÞ L is a scaling parameter. The constant a <strong>de</strong>terminesthe spread of the sigma points around x, <strong>and</strong> is usually set to a smallpositive value (e.g., 1 a 10 4 Þ. The constant k is a secondary scalingparameter, which is usually set to 3 L (see [1] for <strong>de</strong>tails), <strong>and</strong> b is usedto incorporate prior knowledge of pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffithe distribution of x (for Gaussi<strong>and</strong>istributions, b ¼ 2 is optimal). ð ðL þ lÞP x Þ is the ith column of theimatrix square root (e.g., lower-triangular Cholesky factorization). Thesesigma vectors are propagated through the nonlinear functionY i ¼ f ðX i Þ; i ¼ 0; ...; 2L; ð7:31Þ<strong>and</strong> the mean <strong>and</strong> covariance for y are approximated using a weightedsample mean <strong>and</strong> covariance of the posterior sigma points,y P2Li¼0P y P2Lwith weights W i given byi¼0W ðmÞ0¼ lL þ l ;W ðcÞ0 ¼ lL þ l þ 1W ðmÞi¼ W ðcÞi7.3 THE UNSCENTED KALMAN FILTER 229W ðmÞi Y i ; ð7:32ÞW ðcÞi ðY i yÞðY i yÞ T ; ð7:33Þa2 þ bð7:34Þ1¼2ðL þ lÞ ; i ¼ 1; ...; 2L:A block diagram illustrating the steps in performing the UT is shown inFigure 7.2. Note that this method differs substantially from general MonteCarlo sampling methods, which require or<strong>de</strong>rs of magnitu<strong>de</strong> more samplepoints in an attempt to propagate an accurate (possibly non-Gaussian)distribution of the state. The <strong>de</strong>ceptively simple approach taken with theUT results in approximations that are accurate to the third or<strong>de</strong>r forGaussian inputs for all nonlinearities. For non-Gaussian inputs, approximationsare accurate to at least the second or<strong>de</strong>r, with the accuracy ofthird- <strong>and</strong> higher-or<strong>de</strong>r moments being <strong>de</strong>termined by the choice of a <strong>and</strong>b. The proof of this is provi<strong>de</strong>d in Appendix A. Valuable insight into theUT can also be gained by relating it to a numerical technique called


230 7 THE UNSCENTED KALMAN FILTERFigure 7.2Block diagram of the UT.Gaussian quadrature numerical evaluation of integrals. Ito <strong>and</strong> Xiong [12]recently showed the relation between the UT <strong>and</strong> the Gauss–Hermitequadrature rule 5 in the context of state estimation. A close similarity alsoexists between the UT <strong>and</strong> the central difference interpolation filtering(CDF) techniques <strong>de</strong>veloped separately by Ito <strong>and</strong> Xiong [12] <strong>and</strong>Nørgaard, Poulsen, <strong>and</strong> Ravn [13]. In [7] van <strong>de</strong>r Merwe <strong>and</strong> Wan showhow the UKF <strong>and</strong> CDF can be unified in a general family of <strong>de</strong>rivativefree<strong>Kalman</strong> filters for nonlinear estimation.A simple example is shown in Figure 7.3 for a two-dimensional system:Figure 7.3a shows the true mean <strong>and</strong> covariance propagation using MonteCarlo sampling; Figure 7.3b shows the results using a linearizationapproach as would be done in the EKF; Figure 7.3c shows the performanceof the UT (note that only five sigma points are required). Thesuperior performance of the UT is clear.Unscented <strong>Kalman</strong> Filter The unscented <strong>Kalman</strong> filter (UKF) is astraightforward extension of the UT to the recursive estimation in Eq.(7.14), where the state RV is re<strong>de</strong>fined as the concatenation of the originalstate <strong>and</strong> noise variables: x a k ¼½xT k v T k n T k ŠT . The UT sigma pointselection scheme, Eq. (7.30), is applied to this new augmented state RVto calculate the corresponding sigma matrix, X a k. The UKF equations are5 In the scalar case, the Gauss–Hermite rule is given by Ð 11 f ðxÞð2pÞ 1=2 e x2 dx ¼P mi¼1 w i f ðx i Þ, where the equality holds for all polynomials, f ðÞ, of <strong>de</strong>gree up to 2m 1<strong>and</strong> the quadrature points x i <strong>and</strong> weights w i are <strong>de</strong>termined according to the rule type (see[12] for <strong>de</strong>tails). For higher dimensions, the Gauss–Hermite rule requires on the or<strong>de</strong>r ofm L functional evaluations, where L is the dimension of the state. For the scalar case, theUTwith a ¼ 1, b ¼ 0, <strong>and</strong> k ¼ 2 coinci<strong>de</strong>s with the three-point Gauss–Hermite quadraturerule.


7.3 THE UNSCENTED KALMAN FILTER 231Figure 7.3 Example of the UT for mean <strong>and</strong> covariance propagation:(a) actual; (b) first-or<strong>de</strong>r linearization (EFK); (c) UT.given in Table 7.2. Note that no explicit calculations of Jacobians orHessians are necessary to implement this algorithm. Furthermore, theoverall number of computations is of the same or<strong>de</strong>r as the EKF.Implementation Variations For the special (but often encountered)case where the process <strong>and</strong> measurement noise are purely additive, thecomputational complexity of the UKF can be reduced. In such a case, thesystem state need not be augmented with the noise RVs. This reduces thedimension of the sigma points as well as the total number of sigma pointsused. The covariances of the noise source are then incorporated into thestate covariance using a simple additive procedure. This implementation isgiven in Table 7.3. The complexity of the algorithm is of or<strong>de</strong>r L 3 , where Lis the dimension of the state. This is the same complexity as the EKF. Themost costly operation is in forming the sample prior covariance matrix P k .Depending on the form of F, this may be simplified; for example, forunivariate time series or with parameter estimation (see Section 7.4), thecomplexity reduces to or<strong>de</strong>r L 2 .


232 7 THE UNSCENTED KALMAN FILTERTable 7.2Unscented <strong>Kalman</strong> filter (UKF) equationsInitialize with^x 0 ¼ E½x 0 Š;ð7:35ÞP 0 ¼ E½ðx 0 ^x 0 Þðx 0 ^x 0 Þ T Š; ð7:36Þ^x a 0 ¼ E½x a Š¼½^x T 0 0 0Š T ; ð7:37ÞP a 0 ¼ E½ðx a 0 ^x a 0Þðx a 023P 0 0 0^x a 0Þ T Š¼40 R v 0 5:0 0 R nð7:38ÞFor k 2f1; ...; 1g,calculate the sigma points:pX a k 1 ¼½^x a k 1 ^x a k 1 þ g ffiffiffiffiffiffiffiffiffiP a k 1The time-update equations arep^x a k 1 g ffiffiffiffiffiffiffiffiffiŠ: ð7:39ÞP a k 1X x kjk 1 ¼ FðX x k 1; u k 1 ; ðX v k 1Þ; ð7:40Þ^x kP k¼ P2Li¼0¼ P2Li¼0W ðmÞi X x i;kjk 1; ð7:41ÞW ðcÞi ðX x i;kjk 1 ^x k ÞðX x i;kjk 1 ^x k Þ T ; ð7:42ÞY kjk 1 ¼ HðX x kjk 1; X n k 1Þ; ð7:43Þ^y k¼ P2Li¼0<strong>and</strong> the measurement-update equations areP ~yk ~y k¼ P2Li¼0P xk y k¼ P2Li¼0W ðmÞi Y i;kjk 1 ; ð7:44ÞW ðcÞi ðY i;kjk 1 ^y k ÞðY i;kjk 1 ^y k Þ T ; ð7:45ÞW ðcÞi ðX i;kjk 1 ^x k ÞðY i;kjk 1 ^y k Þ T ; ð7:46ÞK k ¼ P xk y kP~y 1k ~y k; ð7:47Þ^x k ¼ ^x k þK k ðy k ^y k Þ; ð7:48ÞP k ¼ P k K k P ~yk ~y kK T k ; ð7:49Þwherex a ¼½x T v T n T Š T ; X a ¼ ½ðX x Þ T ðX v Þ T ðX n Þ T Š T ;pg ¼ ffiffiffiffiffiffiffiffiffiffiffiL þ l;l is the composite scaling parameter, L is the dimension of the augmented state,R v is the process-noise covariance, R n is the measurement-noise covariance, <strong>and</strong>W i are the weights as calculated in Eq. (7.34).


7.3 THE UNSCENTED KALMAN FILTER 233Table 7.3UKF – additive (zero mean) noise caseInitialize withFor k 2f1; ...; 1g,calculate the sigma points:pX k 1 ¼½^x k 1 ^x k 1 þ g ffiffiffiffiffiffiffiffiffiP k 1The time-update equations are^x 0 ¼ E½x 0 Š;ð7:50ÞP 0 ¼ E½ðx 0 ^x 0 Þðx 0 ^x 0 Þ T Š: ð7:51Þp^x k 1 g ffiffiffiffiffiffiffiffiffiŠ: ð7:52ÞP k 1X * kjk 1 ¼ FðX k 1 ; u k 1 Þ; ð7:53Þ^x kP k¼ P2Li¼0¼ P2Li¼0ðaugment sigma pointsÞ 6pX kjk 1 ¼½X * kjk 1 X * 0;kjk 1 þ g ffiffiffiffiffiffiR vW ðmÞi X* i;kjk 1 ð7:54ÞW ðcÞi ðX* i;kjk 1 ^x k ÞðX* i;kjk 1 ^x k Þ T þ R v ; ð7:55ÞpX* 0;kjk 1 g ffiffiffiffiffiffiR v Š ð7:56ÞY kjk 1 ¼ HðX kjk 1 Þ; ð7:57Þ^y k¼ P2Li¼0<strong>and</strong> the measurement-update equations areP ~yk ~y k¼ P2Li¼0P xk y k¼ P2Li¼0W ðmÞi Y i;kjk 1 ; ð7:58ÞW ðcÞi ðY i;kjk 1 ^y k ÞðY i;kjk 1 ^y k Þ T þ R n ; ð7:59ÞW ðcÞi ðX i;kjk 1 ^x k ÞðY i;kjk 1 ^y k Þ T ð7:60ÞK k ¼ P xk y kP~y 1k ~y kð7:61Þ^x k ¼ ^x k þK k ðy k ^y k Þ ð7:62ÞP k ¼ P k K k P ~yk ~y kK T k ; ð7:63Þpwhere g ¼ffiffiffiffiffiffiffiffiffiffiffiL þ l; l is the composite scaling parameter, L is the dimension ofthe state, R v is the process-noise covariance, R n is the measurement-noisecovariance <strong>and</strong> W i are the weights as calculated in Eq. (7.34).6 Here we augment the sigma points with additional points <strong>de</strong>rived from the matrix squareroot of the process noise covariance. This requires setting L ! 2L <strong>and</strong> recalculating thevarious weights W i accordingly. Alternatively,pwe may redraw a complete new set of sigmapoints, i.e., X kjk 1 ¼½^x k^x k þ g ffiffiffiffiffiffi pP k ^x k g ffiffiffiffiffiffiP k Š. This alternative approach resultsin fewer sigma points being used, but also discards any odd-moments information capturedby the original propagated sigma points.


234 7 THE UNSCENTED KALMAN FILTERA number of variations for numerical purposes are also possible. Forexample, the matrix square root, which can be implemented directly usinga Cholesky factorization, is in general of or<strong>de</strong>r 1 6 L3 . However, thecovariance matrices are expressed recursively, <strong>and</strong> thus the square rootcan be computed in only or<strong>de</strong>r M L 2 (where M is the dimension of theoutput y k ) by performing a recursive update to the Cholesky factorization.Details of an efficient recursive square-root UKF implementation aregiven in Appendix B.7.3.1 State-Estimation ExamplesThe UKF was originally <strong>de</strong>signed for state estimation applied to nonlinearcontrol applications requiring full-state feedback [1–3]. We provi<strong>de</strong> anexample for a double inverted pendulum control system. In addition, weprovi<strong>de</strong> a new application example corresponding to noisy time-seriesestimation with neural networks.Double Inverted Pendulum A double inverted pendulum (see Fig.7.4) has states corresponding to cart position <strong>and</strong> velocity, <strong>and</strong> top <strong>and</strong>bottom pendulum angle <strong>and</strong> angular velocity, x ¼½x; _x; y 1 ; y _ 1 ; y 2 ; y _ 2 Š. Thesystem parameters correspond to the length <strong>and</strong> mass of each pendulum,<strong>and</strong> the cart mass, w ¼½l 1 ; l 2 ; m 1 ; m 2 ; MŠ. The dynamical equations areðM þ m 1 þ m 2 Þ€x ðm 1 þ 2m 2 Þl 1€ y1 cos y 1 m 2 l 2€ y2 cos y 2¼ u þðm 1 þ 2m 2 Þl 1 ð _ y 1 Þ 2 sin y 1 þ m 2 l 2 ð _ y 2 Þ 2 sin y 2 ;ðm 1 þ 2m 2 Þl 1 €x cos y 1 þ 4ð 1 3 m 1 þ m 2 Þðl 1 Þ 2 € y1 þ 2m 2 l 1 l 2€ y2 cosðy 2ð7:64Þy 1 Þ¼ðm 1 þ 2m 2 Þgl 1 sin y 1 þ 2m 2 l 1 l 2 ð _ y 2 Þ 2 sinðy 2 y 1 Þ; ð7:65Þm 2 €xl 2 cos y 2 þ 2m 2 l 1 l 2€ y1 cosðy 2y 1 Þþ 4 3 m 2ðl 2 Þ 2 € y2¼ m 2 gl 2 sin y 2 2m 2 l 1 l 2 ð _ y 1 Þ 2 sinðy 2 y 1 Þ: ð7:66ÞFigure 7.4Double inverted pendulum.


7.3 THE UNSCENTED KALMAN FILTER 235These continuous-time dynamics are discretized with a sampling periodof 0.02 seconds. The pendulum is stabilized by applying a control force uto the cart. In this case, we use a state-<strong>de</strong>pen<strong>de</strong>nt Ricatti equation (SDRE)controller to stabilize the system. 7 A state estimator is run outsi<strong>de</strong> thecontrol loop in or<strong>de</strong>r to compare the EKF with the UKF (i.e., the estimatedstates are not used in the feedback control for evaluation purposes). Theobservation corresponds to noisy measurements of the cart position, cartvelocity, <strong>and</strong> angle of the top pendulum. This is a challenging problem,since no measurements are ma<strong>de</strong> for the bottom pendulum, nor for theangular velocity of the top pendulum. For this experiment, the pendulumis initialized in a jack-knife position (þ25 = 25 ), with a cart offset of0.5 meters. The resulting state estimates are shown in Figure 7.5. Clearly,the UKF is better able to track the unobserved states. 8 If the estimatedstates are used for feedback in the control loop, the UKF system is stillable to stabilize the pendulum, while the EKF system crashes. We shallreturn to the double inverted pendulum problem later in this chapter forboth mo<strong>de</strong>l estimation <strong>and</strong> dual estimation.Noisy Time-Series Estimation In this example, the UKF is used toestimate an un<strong>de</strong>rlying clean time series corrupted by additive Gaussianwhite noise. The time-series used is the Mackey–Glass-30 chaotic series[15, 16]. The clean time-series is first mo<strong>de</strong>led as a nonlinear autoregressionx k ¼ f ðx k 1 ; ...x k M ; wÞþv k ; ð7:67Þwhere the mo<strong>de</strong>l f (parameterised by w) was approximated by training afeedforward neural network on the clean sequence. The residual error afterconvergence was taken to be the process-noise variance.Next, white Gaussian noise was ad<strong>de</strong>d to the clean Mackey–Glassseries to generate a noisy time series y k ¼ x k þ n k . The corresponding7 An SDRE controller [11] is <strong>de</strong>signed by formulating the dynamical equations asx kþ1 ¼ Aðx k Þx k þ Bðx k Þu k : Note, this representation is not a linearization, but rather areformulation of the nonlinear dynamics into a pseudo-linear form. Based on thisstate-space representation, we <strong>de</strong>sign an optimal LQR controller, u k ¼R 1 B T ðx k ÞPðx k Þx k Kðx k Þx k , where Pðx k Þ is a solution of the st<strong>and</strong>ard Ricattiequations using state-<strong>de</strong>pen<strong>de</strong>nt matrices Aðx k Þ <strong>and</strong> Bðx k Þ. The procedure is repeated atevery time step at the current state x k , <strong>and</strong> provi<strong>de</strong>s local asymptotic stability of the plant[14]. The approach has been found to be far more robust than LQR controllers based onst<strong>and</strong>ard linearization techniques.8 Note that if all six states are observed with noise, then the performances of the EKF <strong>and</strong>UKF are comparable.


236 7 THE UNSCENTED KALMAN FILTERFigure 7.5 State estimation for the double inverted pendulum problem.Only three noisy states are observed: cart position, cart velocity, <strong>and</strong> theangle of the top pendulum. (10 dB SNR; a ¼ 1, b ¼ 0, k ¼ 0.)state-space representation is given by264x kþ1 ¼ Fðx k ; wÞ þBv k ;3 23 2 3f ðx k ; ...; x k Mþ1 ; wÞ 1x 2323k.7. 5 ¼ 1 0 0 0 x k6 .4 0 . . 0 . .7.64567 74 . 5 5 þ 0.6 74 . 5 v k;0 0 1 00x kþ1x kMx kMþ1y k ¼½1 0 ... 0Šx k þ n k : ð7:68ÞIn the estimation problem, the noisy time-series y k is the only observedinput to either the EKF or UKF algorithms (both utilize the known neuralnetwork mo<strong>de</strong>l). Figure 7.6 shows a subsegment of the estimates generatedby both the EKF <strong>and</strong> the UKF (the original noisy time series has a3 dB SNR). The superior performance of the UKF is clearly visible.


7.3 THE UNSCENTED KALMAN FILTER 2375cleannoisyEKFxk ()0-5200 210 220 230 240 250 260 270 280 290 300( a)k5cleannoisyUKFxk ()0Normalized MSE-5200 210 220 230 240 250 260 270 280 290 300( b)k10.80.60.40.2EKFUKF00 100 200 300 400 500 600 700 800 900 1000() ckFigure 7.6 Estimation of Mackey–Glass time series using a known mo<strong>de</strong>l: (a)with the EKF; (b) with the UKF. (c) shows a comparison of estimation errors forthe complete sequence.7.3.2 The Unscented <strong>Kalman</strong> SmootherAs has been discussed, the <strong>Kalman</strong> filter is a recursive algorithm providingthe conditional expectation of the state x k given all observations Y k 0 up tothe current time k. In contrast, the <strong>Kalman</strong> smoother estimates the stategiven all observations past <strong>and</strong> future, Y N 0 , where N is the final time.<strong>Kalman</strong> smoothers are commonly used for applications such as trajectoryplanning, noncausal noise reduction, <strong>and</strong> the E-step in the EM algorithm


238 7 THE UNSCENTED KALMAN FILTER[17, 18]. A thorough treatment of the <strong>Kalman</strong> smoother in the linear caseis given in [19]. The basic i<strong>de</strong>a is to run a <strong>Kalman</strong> filter forward in time toestimate the mean <strong>and</strong> covariance ð^x f k , P f kÞ of the state, given past data. Asecond <strong>Kalman</strong> filter is then run backward in time to produce a backwardtimepredicted mean <strong>and</strong> covariance ð^xkb , Pkb ), given the future data.These two estimates are then combined, producing the followingsmoothed statistics, given all the data:ðP s kÞ 1 ¼ðP f k Þ 1 þðPk b Þ 1 ; ð7:69Þ^x s k ¼ P s k½ðPkb Þ 1 ^x k b þðP f k Þ 1 ^x f kŠ: ð7:70ÞFor the nonlinear case, the EKF replaces the <strong>Kalman</strong> filter. The use ofthe EKF for the forward filter is straightforward. However, implementationof the backward filter is achieved by using the following linearizedbackward-time system:x k 1 ¼ A 1 x k þ A 1 Bv k ð7:71Þthat is, the forward nonlinear dynamics are linearized, <strong>and</strong> then invertedfor the backward mo<strong>de</strong>l. A linear <strong>Kalman</strong> filter is then applied.Our proposed unscented <strong>Kalman</strong> smoother (UKS) replaces the EKFwith the UKF. In addition, we consi<strong>de</strong>r using a nonlinear backward mo<strong>de</strong>las well, either <strong>de</strong>rived from first principles or by training a backwardpredictor using a neural network mo<strong>de</strong>l, as illustrated for the time-seriescase in Figure 7.7. The nonlinear backward mo<strong>de</strong>l allows us to take fulladvantage of the UKF, which requires no linearization step.To illustrate performance, we reconsi<strong>de</strong>r the noisy Mackey–Glass timeseriesproblem of the previous section, as well as a second time seriesgenerated using a chaotic autoregressive neural network. Table 7.4compares smoother performance. In this case, the network mo<strong>de</strong>ls aretrained on the clean time series, <strong>and</strong> then tested on the noisy data using thest<strong>and</strong>ard exten<strong>de</strong>d <strong>Kalman</strong> smoother with linearized backward mo<strong>de</strong>lTime seriesFigure 7.7ˆxForward=backward neural network prediction training.


7.4 UKF PARAMETER ESTIMATION 239Table 7.4Comparison of smoother performanceMackey–GlassNormalized MSEAlgorithm F B SEKS1 0.20 0.70 0.27EKS2 0.20 0.31 0.19UKS 0.10 0.24 0.08Chaotic AR–NNNormalized MSEAlgorithm F B SEKS1 0.35 0.32 0.28EKS2 0.35 0.22 0.23UKS 0.23 0.21 0.16(EKS1), an exten<strong>de</strong>d <strong>Kalman</strong> smoother with a second nonlinear backwardmo<strong>de</strong>l (EKS2), <strong>and</strong> the unscented <strong>Kalman</strong> smoother (UKS). The forward(F), backward (B), <strong>and</strong> smoothed (S) estimation errors are reported.Again, the performance benefits of the unscented approach are clear.7.4 UKF PARAMETER ESTIMATIONRecall that parameter estimation involves learning a nonlinear mappingy k ¼ Gðx k ; wÞ, where w corresponds to the set of unknown parameters.GðÞ may be a neural network or another parameterized function. The EKFmay be used to estimate the parameters by writing a new state-spacerepresentationw kþ1 ¼ w k þ r k ;d k ¼ Gðx k ; w k Þþe k ;ð7:73Þð7:74Þwhere w k corresponds to a stationary process with i<strong>de</strong>ntity state transitionmatrix, driven by process noise r k . The <strong>de</strong>sired output d k corresponds to anonlinear observation on w k . In the linear case, the relationship betweenthe <strong>Kalman</strong> Filter (KF) <strong>and</strong> the popular recursive least-squares (RLS) isgiven in [20] <strong>and</strong> [25]. In the nonlinear case, the EKF training correspondsto a modified-Newton method [22] (see also Chapter 2).


240 7 THE UNSCENTED KALMAN FILTERFrom an optimization perspective, the following prediction error cost isminimized:JðwÞ ¼ Pkt¼1½d t Gðx t ; wÞŠ T ðR e Þ 1 ½d t Gðx t ; wÞŠ: ð7:75ÞThus, if the ‘‘noise’’ covariance R e is a constant diagonal matrix, then, infact, it cancels out of the algorithm (this can be shown explicitly), <strong>and</strong>hence can be set arbitrarily (e.g., R e ¼ 0:5I). Alternatively, R e can be setto specify a weighted MSE cost. The innovations covarianceE½r k r T k Š¼Rr k, on the other h<strong>and</strong>, affects the convergence rate <strong>and</strong> trackingperformance. Roughly speaking, the larger the covariance, the morequickly ol<strong>de</strong>r data is discar<strong>de</strong>d. There are several options on how tochoose R r k. Set R r k to an arbitrary ‘‘fixed’’ diagonal value, which may then be‘‘annealed’’ towards zero as training continues. Set R r k ¼ðlRLS 1 1ÞP wk, where l RLS 2ð0; 1Š is often referred to asthe ‘‘forgetting factor,’’ as <strong>de</strong>fined in the recursive least-squares(RLS) algorithm [21]. This provi<strong>de</strong>s for an approximate exponentially<strong>de</strong>caying weighting on past data, <strong>and</strong> is <strong>de</strong>scribed more fully in[22]. Note that l RLS should not be confused with l used for sigmapointcalculation. SetR r k ¼ð1 a RM ÞR r k 1 þ a RM K w k ½d k Gðx k ; ^wÞŠ½d k Gðx k ; ^wÞŠ T ðK w k Þ T ;which is a Robbins–Monro stochastic approximation scheme forestimating the innovations [23]. The method assumes that thecovariance of the <strong>Kalman</strong> update mo<strong>de</strong>l is consistent with theactual update mo<strong>de</strong>l. Typically, R r k is also constrained to be adiagonal matrix, which implies an in<strong>de</strong>pen<strong>de</strong>nce assumption onthe parameters. Note that a similar update may also be used for R e k.Our experience indicates that the ‘‘Robbins–Monro’’ method provi<strong>de</strong>s thefastest rate of absolute convergence <strong>and</strong> lowest final MMSE values (seethe experiments in the next section). The ‘‘fixed’’ R r k in combination withannealing can also achieve good final MMSE performance, but requiresmore monitoring <strong>and</strong> a greater prior knowledge of the noise levels. Forproblems where the MMSE is zero, the covariance should be lowerboun<strong>de</strong>dto prevent the algorithm from stalling <strong>and</strong> potential numerical


7.4 UKF PARAMETER ESTIMATION 241problems. The ‘‘forgetting-factor’’ <strong>and</strong> ‘‘fixed’’ R r k methods are mostappropriate for on-line learning problems in which tracking of timevaryingparameters is necessary. In this case, the parameter covariancestays lower-boun<strong>de</strong>d, allowing the most recent data to be emphasized. Thisleads to some misadjustment, but also keeps the <strong>Kalman</strong> gain sufficientlylarge to maintain good tracking. In general, study of the various tra<strong>de</strong>-offsbetween these different approaches is still an area of open research.The UKF represents an alternative to the EKF for parameter estimation.However, as the state transition function is linear, the advantage of theUKF may not be as obvious. Note that the observation function is stillnonlinear. Furthermore, the EKF essentially builds up an approximation tothe expected Hessian by taking outer products of the gradient. The UKF,however, may provi<strong>de</strong> a more accurate estimate through direct approximationof the expectation of the Hessian. While both the EKF <strong>and</strong> UKFcan be expected to achieve similar final MMSE performance, theircovergence properties may differ. In addition, a distinct advantage of theUKF occurs when either the architecture or error metric is such thatdifferentiation with respect to the parameters is not easily <strong>de</strong>rived, as isnecessary in the EKF. The UKF effectively evaluates both the Jacobian<strong>and</strong> Hessian precisely through its sigma-point propagation, without theneed to perform any analytical differentiation.Specific equations for UKF parameter estimation are given in Table 7.5.Simplifications have been ma<strong>de</strong> relative to the state UKF, accounting forthe specific form of the state transition function. In Table 7.5, we haveprovi<strong>de</strong>d two options on how the function output ^d k is achieved. In thefirst option, the output is given as^d k ¼ P2Li¼0W ðmÞi D i;kjk 1 E½Gðx k ; w k ÞŠ; ð7:89Þcorresponding to the direct interpretation of the UKF equations. Theoutput is the expected value (mean) of a function of the r<strong>and</strong>om variablew k . In the second option, we have^d k ¼ Gðx k ; ^w k Þ;ð7:90Þcorresponding to the typical interpretation, in which the output is thefunction with the current ‘‘best’’ set of parameters. This option yieldsconvergence performance that is indistinguishable from the EKF. The firstoption, however, has different convergence characteristics, <strong>and</strong> requires


242 7 THE UNSCENTED KALMAN FILTERTable 7.5UKF parameter estimationInitialize with^w 0 ¼ E½wŠ;ð7:76ÞP w0¼ E½ðw ^w 0 Þðw ^w 0 Þ T Š: ð7:77ÞFor k 2f1; ...; 1g,The time update <strong>and</strong> sigma-point calculation are given by^w k ¼ ^w k 1 ; ð7:78ÞP wk¼ P wk 1þ R r k 1; ð7:79ÞqffiffiffiffiffiffiffiqffiffiffiffiffiffiffiW kjk 1 ¼½^w k ^w k þ g P wk^w k g P wkŠ; ð7:80Þoption 1:D kjk 1 ¼ Gðx k ; W kjk 1 Þ; ð7:81Þ^d k ¼ P2Li¼0W ðmÞi D i;kjk 1 ; ð7:82Þoption 2: ^d k ¼ Gðx k ; ^w k Þ: ð7:83Þ<strong>and</strong> the measurement-update equations arePd ~ k~d k¼ P2Li¼0P wk d k¼ P2Li¼0W ðcÞi ðD i;kjk 1^d k ÞðD i;kjk 1^d k Þ T þ R e k; ð7:84ÞW ðcÞi ðW i;kjk 1 ^w k ÞðD i;kjk 1^d k Þ T ; ð7:85ÞK k ¼ P wk d kP~d 1 ;k~d kð7:86Þ^w k ¼ ^w k þK k ðd k^d k Þ; ð7:87ÞP wk¼ P wkK k Pd ~ k~d kK T k ; ð7:88Þpwhere g ¼ffiffiffiffiffiffiffiffiffiffiffiL þ l; l is the composite scaling parameter, L is the dimension ofthe state, R r is the process-noise covariance, R e is the measurement-noisecovariance, <strong>and</strong> W i are the weights as calculated in Eq. (7.34).further explanation. In the state-space approach to parameter estimation,absolute convergence is achieved when the parameter covariance P wkgoesto zero (this also forces the <strong>Kalman</strong> gain to zero). At this point, the outputfor either option is i<strong>de</strong>ntical. However, prior to this, the finite covarianceprovi<strong>de</strong>s a form of averaging on the output of the function, which in turnprevents the parameters from going to the minimum of the error surface.Thus, the method may help avoid falling into a local minimum. Furthermore,it provi<strong>de</strong>s a form of built-in regularization for short or noisy data


7.4 UKF PARAMETER ESTIMATION 243sets that are prone to overfitting (exact specification of the level ofregularization requires further study).Note that the complexity of the UKF algorithm is still of or<strong>de</strong>r L 3 (L isthe number of parameters), owing to the need to compute a matrix squareroot at each time step. An or<strong>de</strong>r L 2 complexity (same as the EKF) can beachieved by using a recursive square-root formulation as given inAppendix B.7.4.1 Parameter Estimation ExamplesWe have performed a number of experiments to illustrate the performanceof the UKF parameter-estimation approach. The first set of experimentscorresponds to benchmark problems for neural network training, <strong>and</strong> serveto illustrate some of the differences between the EKF <strong>and</strong> UKF, as well asthe different options discussed above. Two parametric optimizationproblems are also inclu<strong>de</strong>d, corresponding to mo<strong>de</strong>l estimation of thedouble pendulum, <strong>and</strong> the benchmark ‘‘Rosenbrock’s Banana’’ optimizationproblem.Benchmark NN Regression <strong>and</strong> Time-Series Problems TheMackay robot-arm dataset [24, 25] <strong>and</strong> the Ikeda chaotic time series[26] are used as benchmark problems to compare neural network training.Figure 7.8 illustrates the differences in learning curves for the EKF versusUKF (option 1). Note the slightly lower final MSE performance of theUKF weight training. If option 2 for the UKF output is used (see Eq.(7.82), then the learning curves for the EKF <strong>and</strong> UKF are indistinguishable;this has been found to be consistent with all experiments; therefore,we shall not show explicit learning curves for the UKF with option 2.Figure 7.9 illustrates performance differences based on the choice ofprocessing noise covariance R r k. The Mackey–Glass <strong>and</strong> Ikeda time seriesare used. The plots show only comparisons for the UKF (differences aresimilar for the EKF). In general, the Robbins–Monro method is the mostrobust approach, with the fastest rate of convergence. In some examples,we have seen faster convergence with the ‘‘annealed’’ approach; however,this also requires additional insight <strong>and</strong> heuristic methods to monitor thelearning. We should reiterate that the ‘‘fixed’’ <strong>and</strong> ‘‘lambda’’ approachesare more appropriate for on-line tracking problems.Four-Regions Classification In the next example, we consi<strong>de</strong>r abenchmark pattern classification problem having four interlocking regions


244 7 THE UNSCENTED KALMAN FILTERFigure 7.8 (a) MacKay robot-arm problem: comparison of learning curvesfor the EKF <strong>and</strong> UKF training, 2-12-2 MLP, annealing noise estimation. (b)Ikeda chaotic time series: comparison of learning curves for the EKF <strong>and</strong> UKFtraining, 10-7-1 MLP, Robbins–Monro noise estimation.[8]. A three-layer feedforward network (MLP) with 2-10-10-4 no<strong>de</strong>s istrained using inputs r<strong>and</strong>omly drawn within the pattern space, S ¼½ 1; 1Š½1; 1Š, with the <strong>de</strong>sired output value of þ0:8 if the patternfell within the assigned region <strong>and</strong> 0:8 otherwise. Figure 7.10 illustratesthe classification task, learning curves for the UKF <strong>and</strong> EKF, <strong>and</strong> the finalclassification regions. For the learning curve, each epoch represents 100r<strong>and</strong>omly drawn input samples. The test set evaluated on each epochcorresponds to a uniform grid of 10,000 points. Again, we see the superiorperformance of the UKF.Double Inverted Pendulum Returning to the double inverted pendulum(Section 7.3.1), we consi<strong>de</strong>r learning the system parameters,w ¼½l 1 ; l 2 ; m 1 ; m 2 ; MŠ. These parameter values are treated as unknown(all initialized to 1.0). The full state, x ¼½x; _x; y 1 ; y _ 1 ; y 2 ; y _ 2 Š, is observed.


7.4 UKF PARAMETER ESTIMATION 24510 0 EpochsTraining set error : MSE10 -1fixedlambdaannealRobbins-Monro2 4 6 8 10 12 14 16 18 20( a)Training set error : MSEfixedlambdaannealRobbins-Monro10 010 -210 -42 4 6 8 10 12 14 16 18 20( b)EpochsFigure 7.9 <strong>Neural</strong> network parameter estimation using different methodsfor noise estimation. (a) Ikeda chaotic time series. (b) Mackey–Glass chaotictime series. (UKF settings: a ¼ 10 4 , b ¼ 2, k ¼ 3 L, where L is the statedimension.)Figure 7.11 shows the total mo<strong>de</strong>l MSE versus iteration comparing EKFwith UKF. Each iteration represents a pendulum crash with different initialconditions for the state (no control is applied). The final convergedparameter estimates are as follows:l 1 l 2 m 1 m 2 MTrue mo<strong>de</strong>l 0.50 0.75 0.75 0.50 1.50UKF estimate 0.50 0.75 0.75 0.50 1.49EKF estimate 0.50 0.75 0.68 0.45 1.35In this case, the EKF has converged to a biased solution, possiblycorresponding to a local minimum in the error surface.


246 7 THE UNSCENTED KALMAN FILTERFigure 7.10 Singhal <strong>and</strong> Wu’s four-region classification problem. (a) Truemapping. (b) Learning curves on the test set. (c) NN classification: EKFtrained.(d) NN classification: UKF-trained. (UKF settings: a ¼ 10 4 , b ¼ 2,k ¼ 3 L, where L is the state dimension; 2-10-10-4 MLP; Robbins–Monro;1 epoch ¼ 100 r<strong>and</strong>om examples.)Mo<strong>de</strong>l MSE10 010 -5IterationEKFUKF5 10 15 20 25 30 35 40Figure 7.11 Inverted double pendulum parameter estimation. (UKFsettings: a ¼ 10 4 , b ¼ 2, k ¼ 3 L, where L is the state dimension; Robbins–Monro.)


7.4 UKF PARAMETER ESTIMATION 247Rosenbrock’s Banana Function For the last parameter estimationexample, we turn to a pure optimization problem. The Banana function[27] can be thought of as a two-dimensional surface with a saddle-likecurvature that bends around the origin. Specifically, we wish to find thevalues of x 1 <strong>and</strong> x 2 that minimize the functionf ðx 1 ; x 2 Þ¼100ðx 2 x 2 1Þ 2 þð1 x 1 Þ 2 : ð7:91ÞThe true minimum is at x 1 ¼ 1 <strong>and</strong> x 2 ¼ 1. The Banana function is a wellknowntest problem used to compare the convergence rates of competingminimization techniques.In or<strong>de</strong>r to use the UKF or EKF, the basic parameter estimationequations need to be reformulated to minimize a non-MSE cost function.To do this we write the state-space equations in observed error form [28]:w k ¼ w k 1 þ r k ; ð7:92Þ0 ¼ k þ e k ; ð7:93Þwhere the target ‘‘observation’’ is fixed at zero, <strong>and</strong> k is an error termresulting in the optimization of the sum of instantaneous costs J k ¼ T k k.The MSE cost is optimized by setting k ¼ d k Gðx k ; w k Þ.However,arbitrary costs (e.g., cross-entropy) can also be minimized simply byspecifying k appropriately. Further discussion of this approach has beengiven in Chapter 5. Reformulation of the UKF equations requires changingonly the effective output to k , <strong>and</strong> setting the <strong>de</strong>sired response tozero.For the example at h<strong>and</strong>, we set k ¼½10ðx 2 x 1 Þ 1 x 1 Š T . Furthermore,since this optimization problem is a special case of ‘‘noiseless’’parameter estimation where the actual error can be minimized to zero, wemake use of Eq. (7.89) (option 2) to calculate the output of the UKFalgorithm. This will allow the UKF to reach the true minimum of the errorsurface more rapidly. 9 We also set the scaling parameter a to a small value,which we have found to be appropriate again for zero MSE problems.Un<strong>de</strong>r these circumstances, the performances of the UKF <strong>and</strong> EKF areindistinguishable, as illustrated in Figure 7.12. Overall, the performances9 Note that the use of option 1, where the expected value of the function is used as theoutput, essentially involves averaging of the output based on the current parametercovariance. This shows convergence in the case where zero MSE is possible, sinceconvergence of the state covariance to zero would also be necessary through properannealing of the state noise innovations R r .


248 7 THE UNSCENTED KALMAN FILTERFunction Value10 20 kEKFUKF10 0f( X)10 -2010 -401 2 3 4 5 6 7( a)10 2010 0kEKFUKFMSE10 -2010 -401 2 3 4 5 6 7( b)Figure 7.12 Rosenbrock’s ‘‘Banana’’ optimization problem. (a) Functionvalue. (b) Mo<strong>de</strong>l error. (UKF settings: a ¼ 10 4 , b ¼ 2, k ¼ 3 L, where L isthe state dimension; Fixed.)of the two filters are comparable or superior to those of a number ofalternative optimization approaches (e.g., Davidson–Fletcher–Powell,Levenburg–Marquardt, etc. See ‘‘opt<strong>de</strong>mo’’ in Matlab). The main purposeof this example was to illustrate the versatility of the UKF to generaloptimization problems.7.5 UKF DUAL ESTIMATIONRecall that the dual estimation problem consists of simultaneouslyestimating the clean state x k <strong>and</strong> the mo<strong>de</strong>l parameters w from thenoisy data y k (see Eq. (7.7)). A number of algorithmic approaches existfor this problem, including joint <strong>and</strong> dual EKF methods (recursiveprediction error <strong>and</strong> maximum-likelihood versions), <strong>and</strong> expectation–maximization (EM) approaches. A thorough coverage of these algorithms


is given in Chapters 5 <strong>and</strong> 6. In this section, we present results for the dualUKF (prediction error) <strong>and</strong> joint UKF methods.In the dual exten<strong>de</strong>d <strong>Kalman</strong> filter [29], a separate state-space representationis used for the signal <strong>and</strong> the weights. Two EKFs are runsimultaneously for signal <strong>and</strong> weight estimation. At every time step, thecurrent estimate of the weights is used in the signal filter, <strong>and</strong> the currentestimate of the signal state is used in the weight filter. In the dual UKFalgorithm, both state <strong>and</strong> weight estimation are done with the UKF.In the joint exten<strong>de</strong>d <strong>Kalman</strong> filter [30], the signal-state <strong>and</strong> weightvectors are concatenated into a single, joint state vector: ½x T k wT k ŠT .Estimation is done recursively by writing the state-space equations forthe joint state asx kþ1w kþ1 ¼Fðx k; u k ; w k ÞIw ky k ¼½1 0 ... 0Š7.5 UKF DUAL ESTIMATION 249þBv kx kw kr kþ n k ;: ð7:94Þð7:95Þ<strong>and</strong> running an EKF on the joint state space to produce simultaneousestimates of the states x k <strong>and</strong> w. Again, our approach is to use the UKFinstead of the EKF.7.5.1 Dual Estimation ExperimentsNoisy Time-Series We present results on two time-series to provi<strong>de</strong> aclear illustration of the use of the UKF over the EKF. The first series isagain the Mackey–Glass-30 chaotic series with additive noise(SNR 3 dB). The second time series (also chaotic) comes from anautoregressive neural network with r<strong>and</strong>om weights driven by Gaussianprocess noise <strong>and</strong> also corrupted by additive white Gaussian noise(SNR 3 dB). A st<strong>and</strong>ard 6-10-1 MLP with tanh hid<strong>de</strong>n activationfunctions <strong>and</strong> a linear output layer was used for all the filters in theMackey–Glass problem. A 5-3-1 MLP was used for the second problem.The process- <strong>and</strong> measurement-noise variances associated with the statewere assumed to be known. Note that, in contrast to the state estimationexample in the previous section, only the noisy time series is observed. Aclean reference is never provi<strong>de</strong>d for training.Example training curves for the different dual <strong>and</strong> joint <strong>Kalman</strong>-base<strong>de</strong>stimation methods are shown in Figure 7.13. A final estimate for theMackey–Glass series is also shown for the dual UKF. The superiorperformance of the UKF-based algorithms is clear.


250 7 THE UNSCENTED KALMAN FILTER0.550.50.45Dual UKFDual EKFJoint UKFJoint EKFNormalized MSE0.40.350.30.250.20 5 10 15 20 25 30( a)Epoch0.70.6Dual EKFDual UKFJoint EKFJoint UKF0.5Normalized MSE0.40.30.20.10( b)5 10 15 20 25 30 35 40 45 50 55 60Epoch5cleannoisyDual UKFxk ( )0-5200 210 220 230 240 250 260 270 280 290 300( c)kFigure 7.13 Comparative learning curves <strong>and</strong> results for the dual estimationexperiments. Curves are averaged over 10 <strong>and</strong> 3 runs, respectively,using different initial weights. ‘‘Fixed’’ innovation covariances are used inthe joint algorithms. ‘‘Annealed’’ covariances are used for the weight filterin the dual algorithms. (a) Chaotic AR neural network. (b) Mackey–Glasschaotic time series. (c) Estimation of Mackey–Glass time series: dual UKF.


7.5 UKF DUAL ESTIMATION 251Figure 7.14Mass-<strong>and</strong>-spring system.Mo<strong>de</strong> Estimation This example illustrates the use of the joint UKF forestimating the mo<strong>de</strong>s of a mass-<strong>and</strong>-spring system (see Fig. 7.14). Thiswork was performed at the University of Washington by Mark Campbell<strong>and</strong> Shelby Brunke. While the system is linear, direct estimation of thenatural frequencies o 1 <strong>and</strong> o 2 jointly with the states is a nonlinearestimation problem. Figure 7.15 compares the performance of the EKF<strong>and</strong> UKF. Note that the EKF does not converge to the true value for o 2 .For this experiment, the input process noise SNR is approximately 100 dB,<strong>and</strong> the measured positions y 1 <strong>and</strong> y 2 have additive noise at a 60 dB SNR(these settings effectively turn the task into a pure parameter-estimationFigure 7.15Linear mo<strong>de</strong> prediction.


252 7 THE UNSCENTED KALMAN FILTERproblem). A fixed innovations R r was used for the parameter estimation inthe joint algorithms. Sampling was done at the Nyquist rate (based on o 2 ),which emphasizes the effect of linearization in the EKF. For fastersampling rates, the performance of the EKF <strong>and</strong> UKF become moresimilar.F15 Flight Simulation In this example (also performed at the Universityof Washington), joint estimation is done on an F15 aircraft mo<strong>de</strong>l[31]. The simulation inclu<strong>de</strong>s vehicle nonlinear dynamics, <strong>and</strong> engine <strong>and</strong>sensor noise mo<strong>de</strong>ling, as well as atmospheric mo<strong>de</strong>ling (<strong>de</strong>nsities,pressure, etc.) based on look-up tables. Also incorporated are aerodynamicforces based on data from Wright Patterson AFB. A closed-loop systemusing a gain-scheduled TECS controller is used to control the mo<strong>de</strong>l [32].A simulated mission was used to test the UKF estimator, <strong>and</strong> involved aquick <strong>de</strong>scent, short tactical run, 180 turn, <strong>and</strong> ascent, with a possiblefailure in the stabilitator (horizontal control surface on the tail of theaircraft). Measurements consisted of the states with additive noise (20 dBSNR). Turbulence was approximately 1 m=s RMS. During the mission, thejoint UKF estimated the 12 states (positions, orientations, <strong>and</strong> their<strong>de</strong>rivatives) as well as parameters corresponding to aerodynamic forces<strong>and</strong> moments. This was done ‘‘off-line’’; that is, the estimated states werenot used within the control loop. Illustrative results are shown in Figure7.16 for estimation of the altitu<strong>de</strong>, velocity, <strong>and</strong> lift parameter (overall liftforce on the aircraft). The left column shows the mission without a failure.The right column inclu<strong>de</strong>s a 50% stabilitator failure at 65 seconds. Notethat even with this failure, the UKF is still capable of tracking the state <strong>and</strong>parameters. It should be pointed out that the ‘‘black-box’’ nature of thesimulator was not conducive to taking Jacobians necessary for running theEKF. Hence, implementation of the EKF for comparison was notperformed.Double Inverted Pendulum For the final dual estimation example,we again consi<strong>de</strong>r the double inverted pendulum, but this time we estimateboth the states <strong>and</strong> system parameters using the joint UKF. Observationscorrespond to noisy measurements of the six states. Estimated states arethen fed back for closed-loop control. In addition, parameter estimates areused at every time step to <strong>de</strong>sign the controller using the SDRE approach.Figure 7.17 illustrates performance of this adaptive control system byshowing the evolution of the estimated <strong>and</strong> actual states. At the start of thesimulation, both the states <strong>and</strong> parameters are unknown (the controlsystem is unstable at this point). However, within one trial, the UKF


7.6 THE UNSCENTED PARTICLE FILTER 253Figure 7.16 F15 mo<strong>de</strong>l joint estimation (note that the estimated <strong>and</strong> truevalues of the state are indistinguishable at this resolution).enables convergence <strong>and</strong> stabilization of the pendulum without a singlecrash.7.6 THE UNSCENTED PARTICLE FILTERThe particle filter is a sequential Monte Carlo method that allows for acomplete representation of the state distribution using sequential importancesampling <strong>and</strong> resampling [33–35]. Whereas the st<strong>and</strong>ard EKF <strong>and</strong>UKF make a Gaussian assumption to simplify the optimal recursiveBayesian estimation (see Section 7.2), particle filters make no assumptionson the form of the probability <strong>de</strong>nsities in question; that is, they employfull nonlinear, non-Gaussian estimation. In this section, we present amethod that utilizes the UKF to augment <strong>and</strong> improve the st<strong>and</strong>ard particlefilter, specifically through generation of the importance proposal distribution.This chapter will review the background fundamentals necessary tointroduce particle filtering, <strong>and</strong> the extension based on the UKF. The


254 7 THE UNSCENTED KALMAN FILTERFigure 7.17 Double Inverted Pendulum joint estimation. Estimated states(a) <strong>and</strong> parameters (b). Only y 1 <strong>and</strong> y 2 are plotted (in radians).material is based on work done by van <strong>de</strong>r Merwe, <strong>de</strong> Freitas, Doucet, <strong>and</strong>Wan in [6], which also provi<strong>de</strong>s a more thorough review <strong>and</strong> treatment ofparticle filters in general.Monte Carlo Simulation <strong>and</strong> Sequential Importance SamplingParticle filtering is based on Monte Carlo simulation with sequentialimportance sampling (SIS). The overall goal is to directly implementoptimal Bayesian estimation (see Eqs. (7.9)–(7.11)) by recursively approximatingthe complete posterior state <strong>de</strong>nsity. In Monte Carlo simulation, aset of weighted particles (samples), drawn from the posterior distribution,is used to map integrals to discrete sums. More precisely, the posteriorfiltering <strong>de</strong>nsity can be approximated by the following empirical estimate:^pðx k jY k 0Þ¼ 1 NP Ni¼1dðx kX ðiÞk Þ;


where the r<strong>and</strong>om samples fX ðiÞk ; i ¼ 1; ...; Ng, are drawn from pðx kjY k 0Þ<strong>and</strong> dðÞ <strong>de</strong>notes the Dirac <strong>de</strong>lta function. The posterior filtering <strong>de</strong>nsitypðx k jY k 0Þ is a marginal of the full posterior <strong>de</strong>nsity given by pðX k 0jY k 0Þ.Consequently, any expectations of the formðEðgðx k ÞÞ ¼ gðx k Þpðx k jY k 0Þ dx kð7:96Þmay be approximated by the following estimate:7.6 THE UNSCENTED PARTICLE FILTER 255Eðgðx k ÞÞ 1 NP Ni¼1gðX ðiÞk Þ:ð7:97ÞFor example, letting gðxÞ ¼x yields the optimal MMSE estimate^x k ¼ E½x k jY k 0Š. The particles X ðiÞkare assumed to be in<strong>de</strong>pen<strong>de</strong>nt <strong>and</strong>i<strong>de</strong>ntically distributed (i.i.d) for the approximation to hold. As N goesto infinity, the estimate converges to the true expectation almost surely.Sampling from the filtering posterior is only a special case of Monte Carlosimulation, which in general <strong>de</strong>als with the complete posterior <strong>de</strong>nsitypðX k 0jY k 0Þ. We shall use this more general form to <strong>de</strong>rive the particle filteralgorithm.It is often impossible to sample directly from the posterior <strong>de</strong>nsityfunction. However, we can circumvent this difficulty by making use ofimportance sampling <strong>and</strong> alternatively sampling from a known proposaldistribution qðX k 0jY k 0Þ. The exact form of this distribution is a critical<strong>de</strong>sign issue, <strong>and</strong> is usually chosen in or<strong>de</strong>r to facilitate easy sampling.The <strong>de</strong>tails of this are discussed later. Given this proposal distribution, wecan make use of the following substitution:ðEðg k ðX k 0ÞÞ ¼ g k ðX k 0Þ pðXk 0jY k 0ÞqðX k 0jY k 0Þ qðXk 0jY k 0Þ dX k 0ð¼ g k ðX k 0Þ pðYk 0jX k 0ÞpðX k 0ÞpðY k 0ÞqðX k 0jY k 0Þ qðXk 0jY k 0Þ dX k 0ð¼ g k ðX k 0Þ w kðX k 0ÞpðY k 0Þ qðXk 0jY k 0Þ dX k 0;where the variables w k ðX k 0Þ are known as the unnormalized importanceweights,w k ¼ pðYk 0jX k 0ÞpðX k 0ÞqðX k 0jY k : ð7:98Þ0Þ


256 7 THE UNSCENTED KALMAN FILTERWe can get rid of the unknown normalizing <strong>de</strong>nsity pðY k 0Þ as follows:Eðg k ðX k 0ÞÞ ¼ 1pðY k 0Þ¼¼ðg k ðX k 0Þw k ðX k 0ÞqðX k 0jY k 0Þ dX k 0Ðgk ðX k 0Þw k ðX k 0ÞqðX k 0jY k 0Þ dX k 0ÐpðYk0 jX k 0ÞpðX k 0Þ qðXk 0jY k 0ÞqðX k 0jY k 0Þ dXk 0Ðgk ðX k 0Þw k ðX k 0ÞqðX k 0jY k 0Þ dX k 0Ðwk ðX k 0ÞqðX k 0jY k 0Þ dX k 0¼ E qðjY kÞðw kðX k 0Þg0 k ðX k 0ÞÞE qðjYk0Þ ðw kðX k ;0ÞÞwhere the notation E qðjYk0Þ has been used to emphasize that the expectationsare taken over the proposal distribution qðjY k 0Þ.A sequential update to the importance weights is achieved by exp<strong>and</strong>ingthe proposal distribution as qðX k 0jY k 0Þ¼qðX k 0 1 jY k 0 1 Þqðx k jX k 0 1 , Y k 0Þ,where we are making the assumption that the current state is not<strong>de</strong>pen<strong>de</strong>nt on future observations. Furthermore, un<strong>de</strong>r the assumptionthat the states correspond to a Markov process <strong>and</strong> that the observationsare conditionally in<strong>de</strong>pen<strong>de</strong>nt given the states, we can arrive at therecursive update:pðyw k ¼ w k jx k Þpðx k jx k 1 Þk 1qðx k jX k 0 1 ; Y k : ð7:99Þ0ÞEquation (7.99) provi<strong>de</strong>s a mechanism to sequentially update the importanceweights given an appropriate choice of proposal distribution,qðx k jX k 0 1 , Y k 0Þ. Since we can sample from the proposal distribution <strong>and</strong>evalute the likelihood pðy k jx k Þ <strong>and</strong> transition probabilities pðx k jx k 1 Þ; allwe need to do is generate a prior set of samples <strong>and</strong> iteratively computethe importance weights. This procedure then allows us to evaluate theexpectations of interest by the following estimate:EðgðX k 0ÞÞ N 1 PNi¼1N 1 PNi¼1gðX ðiÞ0:k Þw kðX ðiÞ0:k Þ¼ PNgðX ðiÞw k ðX ðiÞ0:k Þ ~w k ðX ðiÞ0:k0:k Þ i¼1Þ; ð7:100Þ


7.6 THE UNSCENTED PARTICLE FILTER 257where the normalized importance weights ~w ðiÞk¼ w ðiÞk = P Nj¼1 wðjÞ k<strong>and</strong> X ðiÞ0:k<strong>de</strong>notes the ith sample trajectory drawn from the proposal distributionqðx k jX k 0 1 , Y k 0Þ. This estimate asymptotically converges if the expectation<strong>and</strong> variance of gðX k 0Þ <strong>and</strong> w k exist <strong>and</strong> are boun<strong>de</strong>d, <strong>and</strong> if the support ofthe proposal distribution inclu<strong>de</strong>s the support of the posterior distribution.Thus, as N tends to infinity, the posterior <strong>de</strong>nsity function can beapproximated arbitrarily well by the point-mass estimate^pðX k 0jY k 0Þ¼ PNi¼1~w ðiÞk dðXk 0X ðiÞ0:k Þ ð7:101Þ<strong>and</strong> the posterior filtering <strong>de</strong>nsity by^pðx k jY k 0Þ¼ PNi¼1~w ðiÞk dðx kX ðiÞk Þ:ð7:102ÞIn the case of filtering, we do not need to keep the whole history of thesample trajectories, in that only the current set of samples at time k isnee<strong>de</strong>d to calculate expectations of the form given in Eq. (7.96) <strong>and</strong>(7.97). To do this, we simply set, gðX k 0Þ¼gðx k Þ. These point-massestimates can approximate any general distribution arbitrarily well, limitedonly by the number of particles used <strong>and</strong> how well the above-mentionedimportance sampling conditions are met. In contrast, the posterior distributioncalculated by the EKF is a minimum-variance Gaussian approximationto the true distribution, which inherently cannot capture complexstructure such as multimodalities, skewness, or other higher-or<strong>de</strong>rmoments.Resampling <strong>and</strong> MCMC Step The sequential importance sampling(SIS) algorithm discussed so far has a serious limitation: the variance ofthe importance weights increases stochastically over time. Typically, aftera few iterations, one of the normalized importance weights tends to unity,while the remaining weights tend to zero. A large number of samples arethus effectively removed from the sample set because their importanceweights become numerically insignificant. To avoid this <strong>de</strong>generacy, aresampling or selection stage may be used to eliminate samples with lowimportance weights <strong>and</strong> multiply samples with high importance weights.This is often followed by a Markov-chain Monte Carlo (MCMC) movestep, which introduces sample variety without affecting the posteriordistribution they represent.


258 7 THE UNSCENTED KALMAN FILTERA selection scheme associates to each particle X ðiÞka number of‘‘children,’’ N i , such that P Ni¼1 N i ¼ N. Several selection schemes havebeen proposed in the literature, including sampling-importance resamplingðSIRÞ [36–38], residual resampling [25, 39], <strong>and</strong> minimum-variancesampling [34].Sampling-importance resampling (SIR) involves mapping the Diracr<strong>and</strong>om measure fX ðiÞk , ~wðiÞ kg into an equally weighted r<strong>and</strong>om measurefX ðjÞk , N 1 g. In other words, we produce N new samples all with weighting1=N. This can be accomplished by sampling uniformly from the discreteset fX ðiÞk; i ¼ 1; ...; Ng with probabilities f ~wðiÞk; i ¼ 1; ...; Ng. Figure 7.18gives a graphical representation of this process. This procedure effectivelyreplicates the original X ðiÞkparticle N i times ðN i may be zero).In residual resampling [25, 39] a two-step process is used, which makesuse of SIR. In the first step, the number of children are <strong>de</strong>terministicly setusing the floor function, NiA ¼bN ~w ðiÞt c. Each X ðiÞkparticle is replicated NiAtimes. In the second step, SIR is used to select the remainingP Ni¼1 N iAN t ¼ Nsamples, with new weights w 0 ðiÞt ¼ N t1 ð ~w t ðiÞ N Ni A Þ.These samples form a second set Ni B , such that N t ¼ P Ni¼1 N i B , <strong>and</strong> aredrawn as <strong>de</strong>scribed previously. The total number of children of eachparticle is then set to N i ¼ NiA þ Ni B . This procedure is computationallycheaper than pure SIR, <strong>and</strong> also has lower sample variance. Thus, residualresampling is used for all experiments in Section 7.6.2 (in general, wehave found that the specific choice of resampling scheme does notsignificantly affect the performance of the particle filter).After the selection=resampling step at time k, we obtain N particlesdistributed approximately according to the posterior distribution. Since theselection step favors the creation of multiple copies of the ‘‘fittest’’Figure 7.18 Resampling process, whereby a r<strong>and</strong>om measure fx ðiÞk , ~w ðiÞk g ismapped into an equally weighted r<strong>and</strong>om measure fx ðjÞk , N 1 g. The in<strong>de</strong>x i isdrawn from a uniform distribution.


7.6 THE UNSCENTED PARTICLE FILTER 259particle, many particles may end up having no children ðN i ¼ 0Þ, whereasothers might end up having a large number of children, the extreme casebeing N i ¼ N for a particular value i. In this case, there is a severe<strong>de</strong>pletion of samples. Therefore, an additional procedure is often requiredto introduce sample variety after the selection step without affecting thevalidity of the approximation inferred. This is achieved by performing asingle MCMC step on each particle. The basic i<strong>de</strong>a is that if the particlesare already distributed according to the posterior pðx k jY k 0Þ (which is thecase), then applying a Markov-chain transition kernel with the sameinvariant distribution to each particle results in a set of new particlesdistributed according to the posterior of interest. However, the newparticles may move to more interesting areas of the state space. Detailsof the MCMC step are given in [6]. For our experiments in Section 7.6.2,we found an MCMC step to be unnecessary. However, this cannot beassumed in general.7.6.1 The Particle Filter AlgorithmThe pseudo-co<strong>de</strong> of a generic particle filter is presented in Table 7.6. Inimplementing this algorithm, the choice of the proposal distributionqðx k jX k 0 1 , Y k 0Þ is the most critical <strong>de</strong>sign issue. The optimal proposaldistribution (which minimizes the variance on the importance weights) isgiven by [40–43]qðx k jX k 10 ; Y k 0Þ¼pðx k jX k 10 ; Y k 0Þ; ð7:103Þthat is, the true conditional state <strong>de</strong>nsity given the previous state history<strong>and</strong> all observations. Sampling from this is, of course, impractical forarbitrary <strong>de</strong>nsities (recall the motivation for using importance sampling inthe first place). Consequently, the transition prior is the most popularchoice of proposal distribution [35, 44–47]: 10qðx k jX k 10 ; Y k 0Þ¼ pðx k jx k 1 Þ: ð7:104ÞFor example, if an additive Gaussian process noise mo<strong>de</strong>l is used, thetransition prior is simplypðx k jx k 1 Þ¼nðFðx k 1 ; 0Þ; R v k 1Þ: ð7:105Þ10 The notation ¼ <strong>de</strong>notes ‘‘chosen as,’’ to indicate a subtle difference versus ‘‘approximation’’.


260 7 THE UNSCENTED KALMAN FILTERTable 7.6Algorithm for the generic particle filter1. Initialization: k ¼ 0 For i ¼ 1; ...; N, draw the states X ðiÞ0 from the prior pðx 0Þ.2. For k ¼ 1; 2; ...(a) Importance sampling step For i ¼ 1; ...; N, sample X ðiÞk qðx k jx ðiÞ0:k 1 , Yk 0). For i ¼ 1; ...; N, evaluate the importance weights up to anormalizing constant:w ðiÞk¼ w ðiÞ pðy k jX ðiÞk ÞpðXðiÞ k jXðiÞ kk 1qðX ðiÞk jXðiÞ 0:k 1 ; Yk 0Þ For i ¼ 1; ...; N, normalize the importance weights:1 Þ: ð7:106Þ~w ðiÞk¼ w ðiÞkP Nj¼1w ðjÞk! 1:(b) Selection step ðresamplingÞ Multiply=suppress samples X ðiÞkwith high=low importance weights~w ðiÞk, respectively, to obtain N r<strong>and</strong>om samples XðiÞkapproximatelydistributed according to pðx ðiÞk jYk 0Þ. For i ¼ 1; ...; N, set w ðiÞk¼ ~w ðiÞk¼ N 1 .(c) MCMC move step ðoptionalÞ(d) Output: The output of the algorithm is a set of samples that can be used toapproximate the posterior distribution as follows:^pðx k jY k 0Þ¼ 1 dðxN ki¼1The optimal MMSE estimator is given asP N^x k ¼ Eðx k jY k 0Þ 1 NP Ni¼1X ðiÞk Þ:X ðiÞk :Similar expectations of the function gðx k Þ can also be calculated as asample average.The effectiveness of this approximation <strong>de</strong>pends on how close theproposal distribution is to the true posterior distribution. If there is notsufficient overlap, only a few particles will have significant importanceweights when their likelihood are evaluated.The EKF <strong>and</strong> UKF Particle Filter An improvement in the choice ofproposal distribution over the simple transition prior, which also addressthe problem of sample <strong>de</strong>pletion, can be accomplished by moving the


7.6 THE UNSCENTED PARTICLE FILTER 261Figure 7.19 Including the most current observation into the proposaldistribution, allows us to move the samples in the prior to regions of highlikelihood. This is of paramount importance if the likelihood happens to lie inone of the tails of the prior distribution, or if it is too narrow (low measurementerror).particles towards the regions of high likelihood, based on the most recentobservations y k (see Fig. 7.19). An effective approach to accomplish this,is to use an EKF generated Gaussian approximation to the optimalproposal, that is,qðx k jX k 0 1 ; Y k 0Þ¼ q n ðx k jY k 0Þ; ð7:107Þwhich is accomplished by using a separate EKF to generate <strong>and</strong> propagatea Gaussian proposal distribution for each particle,q n ðx ðiÞk jYk 0Þ¼nðx ðiÞk ; PðiÞ kÞ; i ¼ 1; ...; N: ð7:108ÞThat is, at time k one uses the EKF equations, with the new data, tocompute the mean <strong>and</strong> covariance of the importance distribution for eachparticle from the previous time step k 1. Next, we redraw the ith particle(at time k) from this new updated distribution. While still making aGaussian assumption, the approach provi<strong>de</strong>s a better approximation to theoptimal conditional proposal distribution <strong>and</strong> has been shown to improveperformance on a number of applications [33, 48].By replacing the EKF with the UKF, we can more accurately propagatethe mean <strong>and</strong> covariance of the Gaussian approximation to the statedistribution. Distributions generated by the UKF will have a greatersupport overlap with the true posterior distribution than the overlapachieved by the EKF estimates. In addition, scaling parameters used forsigma-point selection can be optimised to capture certain characteristic ofthe prior distribution if known; e.g. the algorithm can be modified to workwith distributions that have heavier tails than Gaussian distributions suchas Cauchy or Stu<strong>de</strong>nt-t distributions. The new filter that results from usinga UKF for proposal distribution generation within a particle filter frameworkis called the unscented particle filter (UPF). Referring to the


262 7 THE UNSCENTED KALMAN FILTERalgorithm in Table 7.6 for the generic particle filter, the first item in theimportance sampling step, For i ¼ 1; ...; N, sample X ðiÞkis replaced with the following UKF update: qðx k jx ðiÞ0:k 1 , Yk 0), For i ¼ 1; ...; N:– Update the prior ðk 1Þ distribution for each particle with theUKF: Calculate sigma points:qffiffiffiffiffiffiffiffiffiX ðiÞak 1 ¼½XðiÞa k 1X ðiÞak 1 þ g P ðiÞak 1X ðiÞak 1g Propagate particle into future (time update):qffiffiffiffiffiffiffiffiffiŠ: ð7:109ÞP ðiÞak 1X ðiÞxkjkP ðiÞkjkY ðiÞkjky ðiÞkjk1ðiÞx¼ FðXk 1 ; u k; X ðiÞvk 1Þ; XðiÞkjk1 ¼ P2Lj¼01¼ HðXðiÞxkjk1 ¼ P2Lj¼0W ðcÞj ðX ðiÞxj;kjk 1X ðiÞkjk1 ; X ðiÞnk 1 Þ;1 ¼ P2Lj¼0ðiÞx1ÞðX j;kjk 1X ðiÞkjkW ðmÞj X ðiÞxj;kjk 1 ;ð7:110Þ1 ÞT ð7:111ÞW ðmÞj Y ðiÞj;kjk 1 : ð7:112Þ Incorporate new observation (measurement update):P ~yk ~y k¼ P2Lj¼0P xk y k¼ P2LJ¼0K k ¼ P xk y kP 1~y k ~y k;X ðiÞkP ðiÞk– Sample X ðiÞkW ðcÞj ðY ðiÞj;kjk 1y ðiÞkjkW ðcÞj ðX ðiÞj;kjk 1X ðiÞkjk1 ÞðYðiÞ j;kjk 1y ðiÞkjk 1 ÞT ;ð7:113Þ1 ÞðYðiÞ j;kjk 1y ðiÞkjk 1 ÞT ;ð7:114Þ¼ X ðiÞkjk 1 þ K kðy k y ðiÞkjk 1Þ; ð7:115Þ¼ P ðiÞkjk 1K k P ~yk ~y kK T k : ð7:116Þ qðx ðiÞk jxðiÞ 0:k 1 , Yk 0ÞnðX ðiÞk , PðiÞ k Þ.All other steps in the particle filter formulation remain unchanged.


7.6 THE UNSCENTED PARTICLE FILTER 2637.6.2 UPF ExperimentsThe performance of the UPF is evaluated on two estimation problems. Thefirst problem is a synthetic scalar estimation problem <strong>and</strong> the second is areal-world problem concerning the pricing of financial instruments.Synthetic Experiment For this experiment, a time series was generatedby the following process mo<strong>de</strong>l:x kþ1 ¼ 1 þ sinðoptÞþf 1 x k þ v k ; ð7:117Þwhere v k is a Gamma Gað3; 2Þ r<strong>and</strong>om variable mo<strong>de</strong>ling the processnoise, <strong>and</strong> o ¼ 0:04 <strong>and</strong> f 1 ¼ 0:5 are scalar parameters. A nonstationaryobservation mo<strong>de</strong>l,y k ¼ f 2x 2 k þ n k; t 30;f 3 x k 2 þ n k t > 30;ð7:118Þis used, with f 2 ¼ 0:2 <strong>and</strong> f 3 ¼ 0:5. The observation noise, n k , is drawnfrom a Gaussian distribution nð0; 0:00001Þ. Given only the noisyobservations y k , the different filters were used to estimate the un<strong>de</strong>rlyingclean state sequence x k for k ¼ 1 ...60. The experiment was repeated 100times with r<strong>and</strong>om re-initialization for each run. All of the particle filtersused 200 particles <strong>and</strong> residual resampling. The UKF parameters were setto a ¼ 1, b ¼ 0 <strong>and</strong> k ¼ 2. These parameters are optimal for the scalarcase. Table 7.7 summarizes the performance of the different filters. Thetable shows the means <strong>and</strong> variances of the mean-square error (MSE) ofthe state estimates. Figure 7.20 compares the estimates generated from aTable 7.7 State-estimation experiment results: the mean <strong>and</strong> variance ofthe MSE were calculated over 100 in<strong>de</strong>pen<strong>de</strong>nt runsMSEAlgorithm Mean VarianceExten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) 0.374 0.015Unscented <strong>Kalman</strong> filter (UKF) 0.280 0.012Particle filter: generic 0.424 0.053Particle filter: MCMC move step 0.417 0.055Particle filter: EKF proposal 0.310 0.016Particle filter: EKF proposal <strong>and</strong> MCMC move step 0.307 0.015Particle filter: UKF proposal (‘‘unscented particle filter’’) 0.070 0.006Particle filter: UKF proposal <strong>and</strong> MCMC move step 0.074 0.008


264 7 THE UNSCENTED KALMAN FILTER9876Ext [ ( )]5432True xPF estimatePF-EKF estimatePF-UKF estimate10 10 20 30 40 50 60TimeFigure 7.20 Plot of estimates generated by the different filters on thesynthetic state-estimation experiment.single run of the different particle filters. The superior performance of theunscented particle filter (UPF) is clear.Pricing Financial Options Derivatives are financial instrumentswhose value <strong>de</strong>pends on some basic un<strong>de</strong>rlying cash product, such asinterest rates, equity indices, commodities, foreign exchange, or bonds[49]. A call option allows the hol<strong>de</strong>r to buy a cash product, at a specifieddate in the future, for a price <strong>de</strong>termined in advance. The price at whichthe option is exercised is known as the strike price, while the date in whichthe option lapses is often referred to as the maturity time. Put options, onthe other h<strong>and</strong>, allow the hol<strong>de</strong>r to sell the un<strong>de</strong>rlying cash product. Intheir seminal work [50], Black <strong>and</strong> Scholes <strong>de</strong>rived the following industryst<strong>and</strong>ard equations for pricing European call <strong>and</strong> put options:C ¼ Sn c ðd 1 Þ Xe rt mn c ðd 2 Þ; ð7:119ÞP ¼ Sn c ð d 1 ÞþXe rt mn c ð d 2 Þ; ð7:120Þ


7.6 THE UNSCENTED PARTICLE FILTER 265where C <strong>de</strong>notes the price of a call option, P the price of a put option, Sthe current value of the un<strong>de</strong>rlying cash product, X the <strong>de</strong>sired strikeprice, t m the time to maturity, <strong>and</strong> n c ð:Þ the cumulative normal distribution,<strong>and</strong> d 1 <strong>and</strong> d 2 are given byd 1 ¼ lnðS=X Þþðr þ s2 =2Þtp msffiffiffiffi ;t mpd 2 ¼ d 1 s ffiffiffiffi t m;where s is the (unknown) volatility of the cash product <strong>and</strong> r is the riskfreeinterest rate.The volatility, s; is usually estimated from a small moving window ofdata over the most recent 50–180 days [49]. The risk-free interest rate r isoften estimated by monitoring interest rates in the bond markets. Ourapproach is to treat r <strong>and</strong> s as the hid<strong>de</strong>n states, <strong>and</strong> C <strong>and</strong> P as the outputtFigure 7.21 Probability smile for options on the FTSE-100 in<strong>de</strong>x (1994).Although the volatility smile indicates that the option with strike priceequal to 3225 is un<strong>de</strong>rpriced, the shape of the probability gives us awarning against the hypothesis that the option is un<strong>de</strong>r-priced. Posteriormean estimates were obtained with the Black–Scholes mo<strong>de</strong>l <strong>and</strong> particlefilter (), a fourth-or<strong>de</strong>r polynomial fit ( ), <strong>and</strong> hypothesized volatility ().


266 7 THE UNSCENTED KALMAN FILTERobservations. S <strong>and</strong> t m are treated as known control signals (inputobservations). This represents a parameter estimation problem, with thenonlinear observation given by Eqs. (7.119) or (7.120). This allows us tocompute daily complete probability distributions for r <strong>and</strong> s <strong>and</strong> to <strong>de</strong>ci<strong>de</strong>whether the current value of an option in the market is being either overpricedor un<strong>de</strong>r-priced. See [51] <strong>and</strong> [52] for <strong>de</strong>tails.As an example, Figure 7.21 shows the implied probability <strong>de</strong>nsityfunction of each volatility against several strike prices using five pairs ofcall <strong>and</strong> put option contracts on the British FTSE-100 in<strong>de</strong>x (fromFebruary 1994 to December 1994). Figure 7.22 shows the estimatedvolatility <strong>and</strong> interest rate for a contract with a strike price of 3225. InTable 7.8, we compare the one-step-ahead normalized square errors on apair of options with strike price 2925. The square errors were onlymeasured over the last 100 days of trading, so as to allow the algorithmsto converge. The experiment was repeated 100 times with 100 particles ineach particle filter (the mean value is reported; all variance were essentiallyzero). In this example, both the EKF <strong>and</strong> UKF approaches toimproving the proposal distribution lead to a significant improvementover the st<strong>and</strong>ard particle filters. The main advantage of the UKF over theFigure 7.22Estimated interest rate <strong>and</strong> volatility.


7.7 CONCLUSIONS 267Table 7.8 One-step-ahead normalized square errors over100 runs. The trivial prediction is obtained by assuming thatthe price on the following day corresponds to the currentpriceOption type Algorithm Mean NSECall Trivial 0.078Exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) 0.037Unscented <strong>Kalman</strong> filter (UKF) 0.037Particle filter: generic 0.037Particle filter: EKF proposal 0.009Unscented particle filter 0.009Put Trivial 0.035Exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF) 0.023Unscented <strong>Kalman</strong> filter (UKF) 0.023Particle filter: generic 0.023Particle filter: EKF proposal 0.007Unscented particle filter 0.008EKF is the ease of implementation, which avoids the need to analyticallydifferentiate the Black–Scholes equations.7.7 CONCLUSIONSThe EKF has been wi<strong>de</strong>ly accepted as a st<strong>and</strong>ard tool in the control <strong>and</strong>machine-learning communities. In this chapter, we have presented analternative to the EKF using the unscented <strong>Kalman</strong> filter. The UKFaddresses many of the approximation issues of the EKF, <strong>and</strong> consistentlyachieves an equal or better level of performance at a comparable level ofcomplexity. The performance benefits of the UKF-based algorithms havebeen <strong>de</strong>monstrated in a number of application domains, including stateestimation, dual estimation, <strong>and</strong> parameter estimation.There are a number of clear advantages to the UKF. First, the mean <strong>and</strong>covariance of the state estimate is calculated to second or<strong>de</strong>r or better, asopposed to first or<strong>de</strong>r in the EKF. This provi<strong>de</strong>s for a more accurateimplementation of the optimal recursive estimation equations, which is thebasis for both the EKF <strong>and</strong> UKF. While equations specifying the UKFmay appear more complicated than the EKF, the actual computationalcomplexity is equivalent. For state estimation, both algorithms are in


268 7 THE UNSCENTED KALMAN FILTERgeneral of or<strong>de</strong>r L 3 (where L is the dimension of the state). For parameterestimation, both algorithms are of or<strong>de</strong>r L 2 (where L is the number ofparameters). An efficient recursive square-root implementation (seeAppendix B) was necessary to achieve the level of complexity in theparameter-estimation case. Furthermore, a distinct advantage of the UKFis its ease of implementation. In contrast to the EKF, no analytical<strong>de</strong>rivatives (Jacobians or Hessians) need to be calculated. The utility ofthis is especially valuable in situations where the system is a ‘‘black box’’mo<strong>de</strong>l in which the internal dynamic equations are unavailable. In or<strong>de</strong>r toapply an EKF to such systems, <strong>de</strong>rivatives must be found either from aprincipled analytical re-<strong>de</strong>rivation of the system, or through costly <strong>and</strong>often inaccurate numerical methods (e.g., by perturbation). In contrast, theUKF relies on only functional evaluations (inputs <strong>and</strong> outputs) through theuse of <strong>de</strong>terministically drawn samples from the prior distribution of thestate r<strong>and</strong>om variable. From a coding perspective, this also allows for amuch more general <strong>and</strong> modular implementation.Even though the UKF has clear advantages over the EKF, there are stilla number of limitations. As in the EKF, it makes a Gaussian assumptionon the probability <strong>de</strong>nsity of the state r<strong>and</strong>om variable. Often thisassumption is valid, <strong>and</strong> numerous real-world applications have beensuccessfully implemented based on this assumption. However, for certainproblems (e.g., multimodal object tracking), a Gaussian assumption willnot suffice, <strong>and</strong> the UKF (or EKF) cannot be applied with confi<strong>de</strong>nce. Insuch examples, one has to resort to more powerful, but also morecomputationally expensive, filtering paradigms such as particle filters(see Section 7.6). Finally, another implementation limitation leading tosome uncertainty, is the necessity to choose the three unscented transformationparameters (i.e., a; b, <strong>and</strong> k). While we have attempted to provi<strong>de</strong>some gui<strong>de</strong>lines on how to choose these parameters, the optimal selectionclearly <strong>de</strong>pends on the specifics of the problem at h<strong>and</strong>, <strong>and</strong> is not fullyun<strong>de</strong>rstood. In general, the choice of settings does not appear critical forstate estimation, but has a greater affect on performance <strong>and</strong> convergenceproperties for parameter estimation. Our current work focuses on addressingthis issue through <strong>de</strong>veloping a unified <strong>and</strong> adaptive way ofcalculating the optimal value of these parameters. Other areas of openresearch inclu<strong>de</strong> utilizing the UKF for estimation of noise covariances,extension of the UKF to recurrent architectures that may require dynamic<strong>de</strong>rivatives (see Chapter 2 <strong>and</strong> 5), <strong>and</strong> the use of the UKF <strong>and</strong> smoother inthe expectation–maximization algorithm (see Chapter 6). Clearly, we haveonly begun to scratch the surface of the numerous applications that canbenefit with use of the UKF.


APPENDIX A: ACCURACY OF THE UNSCENTEDTRANSFORMATIONIn this appendix, we show how the unscented transformation achievessecond-or<strong>de</strong>r accuracy in the prediction of the posterior mean <strong>and</strong>covariance of a r<strong>and</strong>om variable that un<strong>de</strong>rgoes a nonlinear transformation.For the purpose of this analysis, we assume that all nonlineartransformations are analytic across the domain of all possible values ofx. This condition implies that the nonlinear function can be expressed as amultidimensional Taylor series consisting of an arbitrary number of terms.As the number of terms in the sum tend to infinity, the residual of theseries tends to zero. This implies that the series always converges to thetrue value of the function.If we consi<strong>de</strong>r the prior variable x as being perturbed about a mean x bya zero-mean disturbance dx with covariance P x , then the Taylor seriesexpansion of the nonlinear transformation f ðxÞ about x isf ðxÞ ¼f ðx þ dxÞ ¼ P1If we <strong>de</strong>fine the operator D n dx f asn¼0ðdx H x Þ n f ðxÞn!APPENDIX A 269x¼x: ð7:121ÞD n dx f ¼ D ½ðdx H x Þ n f ðxÞŠ x¼x ;ð7:122Þthen the Taylor series expansion of the nonlinear transformation y ¼ f ðxÞcan be written asy ¼ f ðxÞ ¼f ðxÞþD dx f þ 1 2 D2 dx f þ 1 3! D3 dx f þ 1 4! D4 dx f þ: ð7:123ÞAccuracy of the MeanThe true mean of y is given byy ¼ E½yŠ ¼E½ f ðxÞŠð7:124Þ¼ E f ðxÞþD dx f þ 1 2 D2 dx f þ 1 3! D3 dx f þ 1 4! D3 dx f þ : ð7:125Þ


270 7 THE UNSCENTED KALMAN FILTERIf we assume that x is a symmetrically distributed 11 r<strong>and</strong>om variable, thenall odd moments will be zero. Also note that E½dx dx T Š¼P x . Given this,the mean can be reduced further toy ¼ f ðxÞþ 1 2 ½ðHT P x HÞf ðxÞŠ x¼x þ E1 4! D4 dx f þ 1 6! D6 dx f þ : ð7:126ÞThe UT calculates the posterior mean from the propagated sigma pointsusing Eq. (7.32). The sigma points are given bypffiffiffiffiffiffiffiffiffiffiffiX i ¼ x ð L þ lÞs i ;¼ x ~s iwhere s i <strong>de</strong>notes the ith column 12 of the matrix square root of P x . Thisimplies that P Li¼1 ðs is T i Þ¼P x . Given this formulation of the sigma points,we can again write the propagation of each point through the nonlinearfunction as a Taylor series expansion about x:Y i ¼ f ðX i Þ¼f ðxÞþD ~si f þ 1 2 D2 ~s if þ 1 3! D3 ~s if þ 1 4! D4 ~s if þ:Using Eq. (7.32), the UT predicted mean isP 2Ly UT ¼lL þ l f ðxÞþ 12ðL þ lÞ i¼1 f ðxÞþD ~si f þ 1 2 D2 ~s if þ 1 3! D3 ~s if þ 1 4! D4 ~s if þ1¼ f ðxÞþ2ðL þ lÞP 2Li¼1D ~si f þ 1 2 D2 ~s if þ 1 3! D3 ~s if þ 1 4! D4 ~s if þ :Since the sigma points are symmetrically distributed around x, all the oddmoments are zero. This results in the simplification1y UT ¼ f ðxÞþ2ðL þ lÞP 2Li¼112 D2 ~s if þ 1 4! D4 ~s if þ 1 6! D6 ~s if þ ;11 This inclu<strong>de</strong>s probability distributions such as Gaussian, Stu<strong>de</strong>nt-t, etc.12 See Section 7.3 for <strong>de</strong>tails of exactly how the sigma points are calculated.


APPENDIX A 271<strong>and</strong> since12ðL þ lÞP 2Li¼1112 D2 ~s if ¼ ðHf P2LÞT2ðL þ lÞ¼L þ l 2ðL þ lÞ ðHf 1ÞT 2pffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffið L þ ls i s T i L þ lÞðHf ÞðHf Þi¼1P 2Li¼1¼ 1 2 ½ðHT P x HÞ f ðxÞŠ x¼x ;s i s T ithe UT predicted mean can be further simplified toy UT ¼ f ðxÞþ 1 2 ½ðHT P x HÞ f ðxÞŠ x¼x1 Pþ2L 12ðL þ lÞ 4! D4 ~s if þ 1 6! D6 ~s if þ : ð7:127Þi¼1When we compare Eqs. (7.127) <strong>and</strong> (7.126), we can clearly see that thetrue posterior mean <strong>and</strong> the mean calculated by the UT agrees exactly tothe third or<strong>de</strong>r <strong>and</strong> that errors are only introduced in the first <strong>and</strong> higheror<strong>de</strong>rterms. The magnitu<strong>de</strong>s of these errors <strong>de</strong>pends on the choice of thecomposite scaling parameter l as well as the higher-or<strong>de</strong>r <strong>de</strong>rivatives of f .In contrast, a linearization approach calculates the posterior mean asy LIN ¼ f ðxÞ;ð7:128Þwhich only agrees with the true posterior mean up to the first or<strong>de</strong>r. Julier<strong>and</strong> Uhlman [2] show that, on a term-by-term basis, the errors in thehigher-or<strong>de</strong>r terms of the UT are consistently smaller than those forlinearization.Accuracy of the CovarianceThe true posterior covariance is given byP y ¼ E½ðy y T Þðy y T Þ T Š¼E½yy T Š yy T ð7:129Þ


272 7 THE UNSCENTED KALMAN FILTERwhere the expectation is taken over the distribution of y. Substituting Eqs.(7.123) <strong>and</strong> (7.125) into (7.129), <strong>and</strong> recalling that all odd moments of dxare zero owing to symmetry, we can write the true posterior covariance asP y ¼A x P x A T x ¼ 1 4 f½ðHT P x HÞf ðxÞŠ½ðH T P x HÞfðxÞŠ T g x¼x" #þ EP1 P 1 1i¼1 j¼1 i!j! Di dx f ðD jdx f ÞT|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}ij>1 P 1 P 1 1i¼1 j¼1 ð2iÞ!ð2jÞ! E½D2i x f ŠE½D 2jsx f Š T Š;|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}ij>1ð7:130Þwhere A x is the Jacobian matrix of f ðxÞ evaluated at x. It can be shown(using a similar approach as for the posterior mean) that the posteriorcovariance calculated by the UT is given byðP y Þ UT ¼A x P x A T 1 x4 ½ðHT P x HÞ f ðxÞŠ½ðH T P x HÞ f ðxÞŠ T" #1 Pþ2L P 1 P 1 12ðL þ lÞ k¼1 i¼1 j¼1 i!j! Di ~s kf ðD j ~s kf Þ T|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}ij>1" #P 1 P 1 1 P 2L P 2Li¼1 j¼1 ð2iÞ!ð2jÞ!4ðL þ lÞ 2 D 2i~s kf ðD 2j~s mf Þ T : ð7:131Þk¼1 m¼1|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}ij>1Comparing Eqs. (7.130) <strong>and</strong> (7.131), it is clear that the UT againcalculates the posterior covariance accurately to the first two terms, witherrors only introduced in the fourth- <strong>and</strong> higher-or<strong>de</strong>r moments. Julier <strong>and</strong>Uhlmann [2] show how the absolute term-by-term errors of these higheror<strong>de</strong>rmoments are again consistently smaller for the UT than for thelinearized case that truncates the Taylor series after the first term, that is,x¼xðP y Þ LIN ¼A x P x A T x :ð7:132ÞFor this <strong>de</strong>rivation, we have assumed the value of the b parameter in theUT to be zero. If prior knowledge about the shape of the prior distributionof x is known, b can be set to a non-zero value that minimizes the error in


APPENDIX B 273some of the higher ð 4Þ or<strong>de</strong>r moments. Julier [53] shows how the errorin the kurtosis of the posterior distribution is minimized for a Gaussian xwhen b ¼ 2.APPENDIX B: EFFICIENT SQUARE-ROOT UKF IMPLEMENTATIONSIn the st<strong>and</strong>ard <strong>Kalman</strong> implementation, the state (or parameter) covarianceP k is recursively calculated. The UKF requires taking the matrixsquare-root S k S T k ¼ P k , at each time step, which is Oð 1 6 L3 Þ using aCholesky factorization. In the square-root UKF (SR-UKF), S k will bepropagated directly, avoiding the need to refactorize at each time step. Thealgorithm will in general still be OðL 3 Þ for state estimation, but withimproved numerical properties (e.g., guaranteed positive-semi<strong>de</strong>finitenessof the state covariances), similar to those of st<strong>and</strong>ard square-root <strong>Kalman</strong>filters [20]. However, for the special state-space formulation of parameterestimation, an OðL 2 Þ implementation becomes possible (equivalentcomplexity to EKF parameter estimation).The square-root form of the UKF makes use of three powerful linearalgebratechniques, 13 QR <strong>de</strong>composition, Cholesky factor updating, <strong>and</strong>efficient least squares, which we briefly review below: QR <strong>de</strong>composition The QR <strong>de</strong>composition or factorization of amatrix A 2 R LN is given by, A T ¼ QR, where Q 2 R NN isorthogonal, R 2 R NL is upper-triangular, <strong>and</strong> N L. The uppertriangularpart of R, ~R, is the transpose of the Cholesky factor ofP ¼ AA T , that is, ~R ¼ S T , such that ~R T ~R ¼ AA T . We use theshorth<strong>and</strong> notation qrfg to donate a QR <strong>de</strong>composition of a matrixwhere only ~R is returned. The computational complexity of a QR<strong>de</strong>composition is oðNL 2 Þ. Note that performing a Cholesky factorizationdirectly on P ¼ AA T is Oð 1 6 L3 Þ plus OðNL 2 Þ to form AA T . Cholesky factor updating If S is the original lower-triangularCholesky factor of P ¼ AA T , thenpthe Cholesky factor of the rank-1 update (or downdate) P ffiffi n uuTis <strong>de</strong>noted by S ¼cholupdatefS; u; ng. Ifu is a matrix <strong>and</strong> not a vector, then theresult is M consecutive updates of the Cholesky factor using the Mcolumns of u. This algorithm (available in Matlab as cholupdate)is only OðL 2 Þ per update.13 See [54] for theoretical <strong>and</strong> implementation <strong>de</strong>tails.


274 7 THE UNSCENTED KALMAN FILTER Efficient least squares The solution to the equation ðAA T Þx ¼ A T balso corresponds to the solution of the over<strong>de</strong>termined least-squaresproblem Ax ¼ b. This can be solved efficiently using a QR <strong>de</strong>compositionwith pivoting (implemented in Matlab’s ‘‘=’’ operator).The complete specifications for the new square-root filters are given inTable 7.9 for state estimation <strong>and</strong> Table 7.10 for parameter estimation.Below we <strong>de</strong>scribe the key parts of the square-root algorithms, <strong>and</strong> howthey contrast with the st<strong>and</strong>ard implementations. Experimental results <strong>and</strong>further discussion are presented in [7] <strong>and</strong> [55].Square-Root State EstimationAs in the original UKF, the filter is initialized by calculating the matrixsquare root of the state covariance once via a Cholesky factorization, Eq.(7.133). However, the propagated <strong>and</strong> updated Cholesky factor is thenused in subsequent iterations to directly form the sigma points. In Eq.(7.138) the time update of the Cholesky factor, S , is calculated using aQR <strong>de</strong>composition of the compound matrix containing the weightedpropagated sigma points <strong>and</strong> the matrix square root of the additive processnoise covariance. The subsequent Cholesky update (or downdate) in Eq.(7.137) is necessary since the zeroth weight, W ðcÞ0 , may be negative. Thesetwo steps replace the time-update of P in Eq. (7.55), <strong>and</strong> is also OðL 3 Þ.The same two-step approach is applied to the calculation of theCholesky factor, S y , of the observation error covariance in Eqs. (7.142)<strong>and</strong> (7.143). This step is OðLM 2 Þ, where M is the observation dimension.In contrast to the way that <strong>Kalman</strong> gain is calculated in the st<strong>and</strong>ard UKF(see Eq. (7.61)), we now use two nested inverse (or least-squares)solutions to the following expansion of Eq. (7.60): K k ðS ~yk S T ~y kÞ¼P xk y k.Since S y is square <strong>and</strong> triangular, efficient ‘‘back-substitutions’’can be used to solve for K k directly without the need for a matrixinversion.Finally, the posterior measurement update of the Cholesky factor of thestate covariance is calculated in Eq. (7.147) by applying M sequentialCholesky downdates to S k . The downdate vectors are the columns ofU ¼K k S yk . This replaces the posterior update of P k in Eq. (7.63), <strong>and</strong> isalso OðLM 2 Þ.Square-Root Parameter EstimationThe parameter-estimation algorithm follows a similar framework to that ofthe state-estimation square-root UKF. However, an OðML 2 Þ algorithm, as


APPENDIX B 275Table 7.9Square-Root UKF for state estimationInitialize with^x 0 ¼ E½x 0 Š;S 0 ¼ chol E½ðx 0 ^x 0 Þðx 0 ^x 0 Þ T Š : ð7:133ÞFor k 2f1; ...; 1g,The sigma-point calculation <strong>and</strong> time update are given byX k 1 ¼½^x k 1 ^x k 1 þ gS k ^x k 1 gS k Š; ð7:134ÞX* kjk 1 ¼ FðX k 1 ; u k 1 Þ; ð7:135Þ^x kS k¼ P2LW ðmÞi¼0i X* i;kjk 1 ; ð7:136Þqffiffiffiffiffiffiffiffiffipffiffiffiffiffiffi¼ qr ðX* 1:2L;kjk 1 ^x k Þ R v ð7:137ÞW ðcÞ1S k ¼ cholupdatefS k ; X* 0;k ^x k ; W ðcÞ0 g; ð7:138Þðaugment sigma pointsÞ 14pX kjk 1 ¼½X* kjk 1 X* 0;kjk 1 þ g ffiffiffiffiffiffiR vpX* 0;kjk 1 g ffiffiffiffiffiffiR v Š ð7:139ÞY kjk 1 ¼ HðX kjk 1 Þ ð7:140Þ^y k¼ P2Li¼0<strong>and</strong> the measurement update equations areqffiffiffiffiffiffiffiffiffi¼ qr W ðcÞS ~ykW ðmÞi Y i;kjk 1 ; ð7:141Þ1ðY 1:2L;k^y k Þpffiffiffiffiffiffi; ð7:142ÞS ~yk ¼ cholupdatefS ~yk ; Y 0;k ^y k ; W ðcÞ0 g ð7:143ÞP xk y k¼ P2Li¼0R n kW ðcÞi ðX i;kjk 1 ^x k ÞðY i;kjk 1 ^y k Þ T ; ð7:144ÞK k ¼ðP xk y k=S T ~y kÞ=S ~yk ;ð7:145Þ^x k ¼ ^x k þK k ðy k ^y k Þ;U ¼K k S ~yk ;ð7:146ÞS k ¼ cholupdatefS k ; U; 1g; ð7:147Þopposed to OðL 3 Þ, is possible by taking advantage of the linear statetransition function. Specifically, the time update of the state covariance isgiven simply by P wk¼ P wk 1þ R r k 1 (see Section 7.4 for a discussion onselecting R r k 1). In the square-root filters S wkmay thus be updated directlyin Eq. (7.150) using one of two options: (1) S wk¼ l 1=2RLS S w k 1, correspond-14 Alternatively, redraw a new set of sigma points that incorporate the additive processnoise, i.e., X kjk 1 ¼½^x k^x k þ gS k ^x k gS k Š.


276 7 THE UNSCENTED KALMAN FILTERTable 7.10Square-root UKF for parameter estimationInitialize with^w 0 ¼ E½wŠ; S w0¼ cholfE½ðw ^w 0 Þðw ^w 0 Þ T Šg: ð7:148ÞFor k 2f1; ...; 1g,The time update <strong>and</strong> sigma point calculation are given by^w k ¼ ^w k 1 ; ð7:149ÞS wk¼ l 1=2RLS S w k 1or S wk¼ S wk 1þ D rk 1; ð7:150ÞW kjk 1 ¼½^w k ^w k þ gS wk^w k gS wkŠ; ð7:151ÞD kjk 1 ¼ Gðx k ; W kjk 1 Þ; ð7:152Þ^d k ¼ P2Li¼0<strong>and</strong> the measurement-update equations areqffiffiffiffiffiffiffiffiffi¼ qr W ðcÞ ^d k ÞS dkW ðmÞi D i;kjk 1 ; ð7:153Þ1ðD 1:2L;kpffiffiffiffiffiR e ; ð7:154ÞwhereS dk¼ cholupdatefS dk; D 0;k^d k ; W ðcÞ0 g; ð7:155ÞP wk d k¼ P2Li¼0W ðcÞi ðW i;kjk 1 ^w k ÞðD i;kjk 1^d k Þ T ; ð7:156ÞK k ¼ðP wk d k=S T d kÞ=S dk;ð7:157Þ^w k ¼ ^w k þK k ðd k^d k Þ; ð7:156ÞU ¼K k S dk;ð7:158ÞS wk¼ cholupdatefS wk; U; 1g; ð7:159ÞqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiD rk 1¼ DiagfS wk 1gþ DiagfS wk 1g 2 þ DiagfR r k 1g:ing to an exponential weighting on past data; (2) S wk¼ S wk 1þ D rk 1,where the diagonal matrix D rk 1; is chosen to approximate the effects ofannealing a diagonal process noise covariance R r k. 15 Both options avoidthe costly OðL 3 Þ QR <strong>and</strong> Cholesky-based updates necessary in the stateestimationfilter.15 This update ensures that the main diagonal of P wkis exact. However, additional offdiagonalcross-terms S wk 1D T r k 1þ D rk 1S T w k 1are also introduced (though the effect appearsnegligible).


REFERENCES 277REFERENCES[1] S.J. Julier, J.K. Uhlmann, <strong>and</strong> H. Durrant-Whyte, ‘‘A new approach forfiltering nonlinear systems,’’ in Proceedings of the American ControlConference, 1995, pp. 1628–1632.[2] S.J. Julier <strong>and</strong> J.K. Uhlmann, ‘‘A general method for approximatingnonlinear transformations of probability distributions,’’ Technical Report,RRG, Department of Engineering Science, University of Oxford, November1996. http:==www.robots.ox.ac.uk=siju=work=publications=letter_size=Unscented.zip.[3] S.J. Julier <strong>and</strong> J.K. Uhlmann, ‘‘A new extension of the <strong>Kalman</strong> filter tononlinear systems,’’ in Proceedings of AeroSense: The 11th InternationalSymposium on Aerospace=Defence Sensing, Simulation <strong>and</strong> Controls, 1997.[4] E.A. Wan, R. van <strong>de</strong>r Merwe, <strong>and</strong> A.T. Nelson, ‘‘Dual estimation <strong>and</strong> theunscented transformation,’’ in S.A. Solla, T.K. Leen, <strong>and</strong> K.-R. Müller, Eds.Advances in <strong>Neural</strong> Information Processing Systems 12, Cambridge, MA:MIT Press, 2000, pp. 666–672.[5] E.A. Wan <strong>and</strong> R. van <strong>de</strong>r Merwe, ‘‘The unscented <strong>Kalman</strong> filter for nonlinearestimation, in Proceedings of Symposium 2000 on Adaptive Systems forSignal Processing, Communication <strong>and</strong> Control (AS-SPCC), IEEE, LakeLouise, Alberta, Canada, October 2000.[6] R. van <strong>de</strong>r Merwe, J.F.G. <strong>de</strong> Freitas, D. Doucet, <strong>and</strong> E.A. Wan, ‘‘Theunscented particle filter,’’ Technical Report CUED=F-INFENG=TR 380,Cambridge University Engineering Department, August 2000.[7] R. van <strong>de</strong>r Merwe <strong>and</strong> E.A. Wan, ‘‘Efficient <strong>de</strong>rivative-free <strong>Kalman</strong> filters foronline learning,’’ in Proceedings of European Symposium on Artificial<strong>Neural</strong> <strong>Networks</strong> (ESANN), Bruges, Belgium, April 2001.[8] S. Singhal <strong>and</strong> L. Wu, ‘‘Training multilayer perceptrons with the exten<strong>de</strong>d<strong>Kalman</strong> filter,’’ in Advances in <strong>Neural</strong> Information Processing Systems 1.San Mateo, CA: Morgan Kauffman, 1989, pp. 133–140.[9] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘Decoupled exten<strong>de</strong>d <strong>Kalman</strong> filtertraining of feedforward layered networks,’’ in Proceedings of IJCNN, Vol. 1,International Joint Conference on <strong>Neural</strong> <strong>Networks</strong>, 1991, pp. 771–777.[10] R.E. <strong>Kalman</strong>, ‘‘A new approach to linear filtering <strong>and</strong> prediction problems,’’Transactions of the ASME, Ser. D, Journal of Basic Engineering, 82, 35–45(1960).[11] A. Jazwinsky, Stochastic Processes <strong>and</strong> <strong>Filtering</strong> Theory. New York:Aca<strong>de</strong>mic Press, 1970.[12] K. Ito <strong>and</strong> K. Xiong, ‘‘Gaussian filters for nonlinear filtering problems,’’IEEE Transactions on Automatic Control, 45, 910–927 (2000).[13] M. Nørgaard, N.K. Poulsen, <strong>and</strong> O. Ravn, ‘‘Advances in <strong>de</strong>rivative-free stateestimation for nonlinear systems,’’ Technical Report IMM-REP-1998-15,


278 7 THE UNSCENTED KALMAN FILTERDepartment of Mathematical Mo<strong>de</strong>lling=Department of Automation, TechnicalUniversity of Denmark, Lyngby, April 2000.[14] J.R. Cloutier, C.N. D’Souza, <strong>and</strong> C.P. Mracek, ‘‘Nonlinear regulation <strong>and</strong>nonlinear H-infinity controls via the state-<strong>de</strong>pen<strong>de</strong>nt Riccati equationtechnique: Part 1, Theory,’’ in Proceedings of the International Conferenceon Nonlinear Problems in Aviation <strong>and</strong> Aerospace, Daytona Beach, FL, May1996.[15] M. Mackey <strong>and</strong> L. Glass, ‘‘Oscillation <strong>and</strong> chaos in a physiological controlsystem,’’ Science, 197, 287–289 1977.[16] A. Lape<strong>de</strong>s <strong>and</strong> R. Farber, ‘‘Nonlinear signal processing using neuralnetworks: Prediction <strong>and</strong> system mo<strong>de</strong>lling,’’ Technical Report LAUR872662, Los Alamos National Laboratory, 1987.[17] R.H. Shumway <strong>and</strong> D.S. Stoffer, ‘‘An approach to time series smoothing <strong>and</strong>forecasting using the EM algorithm,’’ Time Series Analysis, 3, 253–264(1982).[18] Z. Ghahramani <strong>and</strong> S.T. Roweis, ‘‘Learning nonlinear dynamical systemsusing an EM algorithm,’’ in M.J. Kearns, S.A. Solla, <strong>and</strong> D.A. Cohn, Eds.,Advances in <strong>Neural</strong> Information Processing Systems 11: Proceedings of the1998 Conference. Cambridge, MA: MIT Press, 1999.[19] F.L. Lewis, Optimal Estimation. New York: Wiley, 1986.[20] A.H. Sayed <strong>and</strong> T. Kailath, ‘‘A state-space approach to adaptive RLSfiltering,’’ IEEE Signal Processing Magazine, pp. 18–60 (July 1994).[21] S. Haykin. Adaptive Filter Theory, 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 1996.[22] A.T. Nelson, ‘‘Nonlinear estimation <strong>and</strong> mo<strong>de</strong>ling of noisy time-series bydual <strong>Kalman</strong> filtering methods,’’ PhD Thesis, Oregon Graduate Institute,2000.[23] L. Ljung <strong>and</strong> T. Sö<strong>de</strong>rström, Theory <strong>and</strong> Practice of Recursive I<strong>de</strong>ntification.Cambridge, MA: MIT Press, 1983.[24] D.J.C. MacKay,http:==wol.ra.phy.cam.ac.uk=mackay=sourcedata. html.www.[25] D.J.C. MacKay, ‘‘A practical Bayesian framework for backpropagationnetworks,’’ <strong>Neural</strong> Computation, 4, 448–472 (1992).[26] K. Ikeda, ‘‘Multiple-valued stationary state <strong>and</strong> its instability of light by aring cavity system,’’ Optics Communications 30, 257–261 (1979).[27] H.H. Rosenbrock, ‘‘An automatic method for finding the greatest or leastvalue of a function,’’ Computer Journal, 3, 175–184 (1960).[28] G.V. Puskorius <strong>and</strong> L.A. Feldkamp, ‘‘Extensions <strong>and</strong> enhancementsof <strong>de</strong>coupled exten<strong>de</strong>d <strong>Kalman</strong> filter training,’’ in Proceedings of ICNN,Vol. 3, International Conference on <strong>Neural</strong> <strong>Networks</strong> 1997, pp. 1879–1883.


REFERENCES 279[29] E.A. Wan <strong>and</strong> A.T. Nelson, ‘‘<strong>Neural</strong> dual exten<strong>de</strong>d <strong>Kalman</strong> filtering:Applications in speech enhancement <strong>and</strong> monaural blind signal separation,’’in Proceedings of <strong>Neural</strong> <strong>Networks</strong> for Signal Processing Workshop, IEEE,1997.[30] M.B. Matthews, ‘‘A state-space approach to adaptive nonlinear filteringusing recurrent neural networks,’’ in Proceedings of IASTED InternationalSymposium on Artificial Intelligence Application <strong>and</strong> <strong>Neural</strong> <strong>Networks</strong>,1990, pp. 197–200.[31] R.W. Brumbaugh, ‘‘An Aircraft Mo<strong>de</strong>l for the AIAA Controls DesignChallenge,’’ PRC Inc., Edwards, CA.[32] J.P. Dutton, ‘‘Development of a Nonlinear Simulation for the McDonnellDouglas F-15 Eagle with a Longitudinal TECS Control Law,’’ Master Thesis,Department of Aeronautics <strong>and</strong> Astronautics, University of Washington,1994.[33] A Doucet, ‘‘On sequential simulation-based methods for Bayesian filtering,’’Technical Report CUED=F-INFENG=TR 310, Cambridge University EngineeringDepartment, 1998.[34] A. Doucet, J.F.G. <strong>de</strong> Freitas, <strong>and</strong> N.J. Gordon, ‘‘Introduction to sequentialMonte Carlo methods,’’ in A. Doucet, J.F.G. <strong>de</strong> Freitas, <strong>and</strong> N.J. Gordon,Eds. Sequential Monte Carlo Methods in Practice. Berlin: Springer-Verlag,2000.[35] N.J. Gordon, D.J. Salmond, <strong>and</strong> A.F.M. Smith, ‘‘Novel approach to nonlinear=non-GaussianBayesian state estimation,’’ IEE Proceedings, Part F,140, 107–113 (1993).[36] B. Efron, The Bootstrap Jacknife <strong>and</strong> other Resampling Plans. Phila<strong>de</strong>lphia:SIAM, 1982.[37] D.B. Rubin, ‘‘Using the SIR algorithm to simulate posterior distributions,’’ inJ.M. Bernardo, M.H. DeGroot, D.V. Lindley, <strong>and</strong> A.F.M. Smith, Eds.,Bayesian Statistics 3. Oxford University Press, 1988, pp. 395–402.[38] A.F.M. Smith <strong>and</strong> A.E. Gelf<strong>and</strong>, ‘‘Bayesian statistics without tears: asampling–resampling perspective,’’ American Statistician, 46, 84–88, 1992.[39] T. Higuchi, ‘‘Monte Carlo filter using the genetic algorithm operators,’’Journal of Statistical Computation <strong>and</strong> Simulation, 59, 1–23 (1997).[40] A. Kong, J.S. Liu, <strong>and</strong> W.H. Wong, ‘‘Sequential imputations <strong>and</strong> Bayesianmissing data problems,’’ Journal of the American Statistical Association, 89,278–288 (1994).[41] J.S. Liu <strong>and</strong> R. Chen, ‘‘Blind <strong>de</strong>convolution via sequential imputations,’’Journal of the American Statistical Association, 90, 567–576 (1995).[42] J.S. Liu <strong>and</strong> R. Chen, ‘‘Sequential Monte Carlo methods for dynamicsystems,’’ Journal of the American Statistical Association, 93. 1032–1044(1998).


280 7 THE UNSCENTED KALMAN FILTER[43] V.S. Zaritskii, V.V. Svetnik, <strong>and</strong> L.I. Shimelevich, ‘‘Monte-Carlo techniquesin problems of optimal information processing,’’ Automation <strong>and</strong> RemoteControl, 36, 2015–2022 (1975).[44] D. Avitzour, ‘‘A stochastic simulation Bayesian approach to multitargettracking,’’ IEE Proceedings on Radar, Sonar <strong>and</strong> Navigation, 142, 41–44(1995).[45] E.R. Beadle <strong>and</strong> P.M. Djurić, ‘‘A fast weighted Bayesian bootstrap filter fornonlinear mo<strong>de</strong>l state estimation,’’ IEEE Transactions on Aerospace <strong>and</strong>Elecytronic Systems, 33, 338–343 (1997).[46] M. Isard <strong>and</strong> A. Blake, ‘‘Contour tracking by stochastic propagation ofconditional <strong>de</strong>nsity,’’ in Proceedings of European Conference on ComputerVision, Cambridge, UK, 1996, pp. 343–356.[47] G. Kitagawa, ‘‘Monte Carlo filter <strong>and</strong> smoother for non-Gaussian nonlinearstate space mo<strong>de</strong>l,’’ Journal of Computational <strong>and</strong> Graphical Statistics, 5,1–25 (1996).[48] J.F.G. <strong>de</strong> Freitas, ‘‘Bayesian Methods for <strong>Neural</strong> <strong>Networks</strong>,’’ PhD Thesis,Cambridge University Engineering Department, 1999.[49] J.C. Hull, Options, Futures, <strong>and</strong> Other Derivatives, 3rd ed. Upper SaddleRiver, NJ: Prentice-Hall, 1997.[50] F. Black <strong>and</strong> M. Scholes, ‘‘The pricing of options <strong>and</strong> corporate liabilities,’’Journal of Political Economy, 81, 637–659 (1973).[51] M. Niranjan, ‘‘Sequential tracking in pricing financial options using mo<strong>de</strong>lbased <strong>and</strong> neural network approaches,’’ in M.C. Mozer, M.I. Jordan, <strong>and</strong> T.Petsche, Eds. Advances in <strong>Neural</strong> Information Processing Systems 8, 1996,960–966.[52] J.F.G. <strong>de</strong> Freitas, M. Niranjan, A.H. Gee, <strong>and</strong> A. Doucet, ‘‘Sequential MonteCarlo methods to train neural network mo<strong>de</strong>ls,’’ <strong>Neural</strong> Computation, 12,955–993 (2000).[53] S.J. Julier, ‘‘The scaled unscented transformation.’’ In preparation.[54] W.H. Press, S.A. Teukolsky, W.T. Vetterling, <strong>and</strong> B.P. Flannery, NumericalRecipes in C: The Art of Scientific Computing, 2nd ed. Cambridge UniversityPress, 1992.[55] R. van <strong>de</strong>r Merwe <strong>and</strong> E.A. Wan, ‘‘The square-root unscented <strong>Kalman</strong> filterfor state <strong>and</strong> parameter-estimation,’’ in Proceedings of International Conferenceon Acoustics, Speech, <strong>and</strong> Signal Processing, Salt Lake City, UT, May2001.


INDEXA priori covariance matrix, 7Air=fuel ratio control, 58Approximate error covariance matrix,24, 29–34, 49, 63Artificial process-noise, 48–50Attentional filtering, 80Automatic relevance <strong>de</strong>termination(ARD), 205Automotive applications, 57Automotive powertrain controlsystems, 57Avoiding matrix inversions, 46Backpropagation, 30, 39, 44, 51, 56Backpropagation process, 55Backward filtering, 12Bayesian methods, 203Bayes’ rule, 181BPTT(h), 45Cayley–Hamilton theorem, 212Central difference interpolation, 230Chaotic (dynamic) invariants, 84Chaotic dynamics, 83Cholesky factorization, 11Closed-loop controller, 60Closed-loop evaluation, 88, 93, 100,108, 115Colored, 166Comparison of chaotic invariances ofIkeda map, 97Comparison of chaotic invariants oflogistic map, 90Comparison of chaotic invariances ofLorenz series, 102Comparison of chaotic invariants ofsea clutter, 114Computational complexity, 24, 33, 34,39, 46, 63Conditional mean estimator, 4Constrained weights, 51Correlation dimension, 84Cortical feedback, 80Cost functions, 64Covariance matix of the process noise,31, 32Cross-entropy, 54Decoupled exten<strong>de</strong>d <strong>Kalman</strong> filter(DEKF), 26, 33, 39, 47Decoupled exten<strong>de</strong>d <strong>Kalman</strong> filter(NDEKF) algorithm, 69DEKF algorithm, 34Delay coordinate method, 86Derivative calculations, 43, 56Derivative matrices, 31, 34, 38Derivative matrix, 30, 31, 33Derivatives of network outputs, 44Divergence phenomenon, 10281


282 INDEXDouble inverted pendulum, 234Dual EKF, 213Dual estimation, 123, 130, 224, 249Dual <strong>Kalman</strong>, 125Dynamic pattern classifiers, 62Dynamic reconstruction, 85Dynamic reconstruction of the laserseries, 109Dynamic reconstruction of the Lorenzseries, 101Dynamic reconstruction of the noisyLorenz series, 105Dynamic reconstruction of the noisyIkeda map, 98EKF, 37, 43, 52, 54, 56, 62see Exten<strong>de</strong>d <strong>Kalman</strong> filterEKF procedure, 28Elliott sigmoid, 53EM algorithm, 142Embedding <strong>de</strong>lay, 86Embedding dimension, 86Embedding, 211Engine misfire <strong>de</strong>tection, 61Entropic cost function, 54, 55Error covariance propagation, 8Error covariance matrices, 48Error covariance matrix, 26Error covariance update, 49Error vector, 29–31, 34, 38, 52Estimation, 124Expectation–maximization (EM)algorithm, 177, 182Exten<strong>de</strong>d <strong>Kalman</strong> filter (EKF), 16, 24,123, 182, 221, 227Exten<strong>de</strong>d <strong>Kalman</strong> filtering (EKF)algorithm, 179Exten<strong>de</strong>d <strong>Kalman</strong> filter-recurrentmultilayered perceptron, 83Exten<strong>de</strong>d <strong>Kalman</strong> filter, summary of,19Factor analysis (FA), 193<strong>Filtering</strong>, 3Forward filtering, 12Fully <strong>de</strong>coupled EKE, 25, 34Gauss–Hermite quadrature rule, 230GEKF, 30, 33, 34, 39, 62see Global EKEGEKF, <strong>de</strong>coupled EKE algorithm, 25Generative mo<strong>de</strong>l, 178Givens rotations, 49, 50Global EKE (GEKF), 24, 26Global scaling matrix, 29, 31, 38Global sealing matrix A k ,34Global EKF training, 29Graphical mo<strong>de</strong>ls, 178, 179Hid<strong>de</strong>n variables, 177Hierarchical architecture, 71I<strong>de</strong>ntifiability, 206Idle speed control, 59Ikeda map, 91Inference, 176Innovations, 7Jensen’s inequality, 183Joint EKF, 213Joint estimation, 137Joint exten<strong>de</strong>d <strong>Kalman</strong> filter, 125‘‘Joseph’’ version of the covarianceupdate equation, 8<strong>Kalman</strong> filter, 1, 5, 177<strong>Kalman</strong> filter, information formulationof, 13<strong>Kalman</strong> gain, 6<strong>Kalman</strong> gain matrix, 29, 30, 31, 33, 49<strong>Kalman</strong> gain matrices, 34, 38Kaplan–York dimension, 85Kernel, 192Kolmogorov entropy, 85Laser intensity pulsations, 106Layer-<strong>de</strong>coupled EKF, 34Learning rate, 31, 32, 48


INDEX 283Least-squares solution, 42Logistic map, 87Lorenz attractor, 99Lyapunov exponents, 84Lyapunov dimension, 85MAP estimation, 135Mammalian neocortex, 70Marginal estimation, 140Markov-chain Monte Carlo, 258Matrices H i k,52Matrix of <strong>de</strong>rivatives, 29Matrix factorization lemma, 48Matrix inversion lemma, 14Matrix inversions, 55, 63Maximum a posteriori (MAP), 135Maximum-likelihood cost, 140Measurement covariance, 30Measurement equation, 3Measurement error covariance matrix,38Mixture of factor analyzers (MFA),193MMSE estimator, 225Mo<strong>de</strong>ling, 124Mo<strong>de</strong>l selection, 203Monte Carlo simulation, 255Moving object, 80Multistream, 39, 40Multistream EKF, 26, 38Multistream EKF training, 63Multistream <strong>Kalman</strong> recursion, 42Multistream training, 34, 36, 45Neurobiological foundations, 70No<strong>de</strong>-<strong>de</strong>coupled exten<strong>de</strong>d <strong>Kalman</strong>filter (NDEKF) algorithm, 69No<strong>de</strong>-<strong>de</strong>coupled EKF, 25, 34, 46Noise, 166Noisy time-series estimation, 153, 235Noisy Ikeda series, 95Noisy Lorenz series, 103Nonlinear dynamics, 175Nonlinear dynamic mo<strong>de</strong>lling of realworldtime series, 106Non-rigid motion, 80Nonstationarity, 202Occlusions, 75, 78On-line learning, 201,Open-loop evaluation, 87, 92, 99, 108,115Optimal recursive estiation, 224Optimization with constraints, 51Optimum smoothing problem, 11Overfitting, 204Parameter estimation, 223, 240Partial M-step, 200Particle filters, 182, 259Perceptual foundations, 70Prediction error, 126Prediction, 3, 124Priming length, 37Principles of orthogonality, 4Process equation, 2Process noise, 29Proposal distribution, 254Radial basis function (RBF) networks,188Rauch–Tung–Streibel (RTS) smoother,11, 17, 180Rauch–Tung–Striebel, 15, 25Real-time-recurrent learning (RTRL),25Recency effect, 36Recency phenomenon, 25, 26Reconstruction failures, 119Recurrent <strong>de</strong>rivative, 131, 164Recurrent multilayered perceptron(RMLP), 26, 28, 30, 61, 69Recurrent network, 28, 44, 59Recurrent multiplayer perceptron, 69Recurrent neural networks, 60, 62, 63Recursive least-squares (RLS)algorithm, 201


284 INDEXRescaled exten<strong>de</strong>d <strong>Kalman</strong> recursion,31Sea clutter data, 113Sensor-catalyst mo<strong>de</strong>ling, 60Sequential DEKF, 47Sequential importance sampling, 255Sequential update, 47Shape <strong>and</strong> motion perception, 80Signal-to-ratio (SER), 87Simultaneous DEKF, 47Singular-value <strong>de</strong>composition, 46Smoothing, 3Speech enhancement, 157Square-root filtering, 10, 48–50Square-root UKF, 273Stability, 210Stability <strong>and</strong> robustness, 118State-error vector, 5State estimation, 127, 176, 222Sum of squared error, 29, 64Summary of the Rauch–Tung–Striebelsmoother, 17Summary of the <strong>Kalman</strong> filter, 10Takens embedding, 86Taken’s theorem, 211Teacher forcing, 42Tracking, 70Training cost function, 31Trajectory length, 37Truncated backpropagation throughtime (BPTT(h)), 25, 37, 44Unscented transformation, 228Unscented filter, 182Unscented particle filter, 254Unscented <strong>Kalman</strong> filter (UKF), 221,228, 230Unscented <strong>Kalman</strong> smoother, 237Variance estimation, 149Variational approximations, 210Vehicle emissions estimation, 62Weather Data, 197Weight estimation, 128Weighting matrix, 31, 32‘‘What’’ pathway, 79‘‘Where’’ pathway, 80

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!